+ All Categories
Home > Documents > Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and...

Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and...

Date post: 30-Sep-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
44
Lecture 5A: Regression and classication Lecture 5B: Kernel regression and classication, and stability analysis Wrap-up Lecture 5 Computational Pattern Analysis and Statistical Learning Lecture 5: Supervised learning Tijl De Bie, Konstantin Tretyakov (Largely based on joint work with Nello Cristianini and John Shawe-Taylor) Tartu, Estonia November 2006 T. De Bie, K. Tretyakov Pattern Analysis
Transcript
Page 1: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Computational Pattern Analysis and StatisticalLearning

Lecture 5: Supervised learning

Tijl De Bie, Konstantin Tretyakov(Largely based on joint work with Nello Cristianini and John

Shawe-Taylor)

Tartu, Estonia

November 2006

T. De Bie, K. Tretyakov Pattern Analysis

Page 2: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

1 Lecture 5A: Regression and classi�cationLinear regressionFisher�s discriminant analysisSupport Vector Machines

2 Lecture 5B: Kernel regression and classi�cation, and stabilityanalysisKernel ridge regressionHow to �kernelise�an algorithm? �you should know nowKernel support vector machinesStatistical analysis of ridge regression

3 Wrap-up Lecture 5T. De Bie, K. Tretyakov Pattern Analysis

Page 3: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Overview

Recapitulation of ridge regression �now with o¤set

Fisher�s discriminant analysis

Support Vector Machines

T. De Bie, K. Tretyakov Pattern Analysis

Page 4: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Least squares regression

We want to approximate yi as a linear function of xiIn terms of a weight vector w, this means: yi � x0iw, or,kyi � x0iwk � 0Pattern function is parameterised by w (note the � sign):

πw (Z ) = �1n

n

∑i=1

�yi � x0iw

�2= �1

nky�Xwk2

Formal pattern recognition problem:

maxw

πw (Z ), maxw�1nky�Xwk2 , min

wky�Xwk2

T. De Bie, K. Tretyakov Pattern Analysis

Page 5: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Least squares regression with o¤set

T. De Bie, K. Tretyakov Pattern Analysis

Page 6: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Least squares regression with o¤set

We want to approximate yi as an a¢ ne function of xiIn terms of a weight vector w and o¤set b, this means:yi � x0iw+ b, or, kyi � (x0iw+ b)k � 0Pattern function is parameterised by w (note the � sign):

πw,b (Z ) = �1n

n

∑i=1

�yi �

�x0iw+ b

��2= �1

nky�Xw� 1bk2

Formal pattern recognition problem:

maxw,b

πw,b (Z ), maxw,b�1nky�Xw� 1bk2 , min

wky�Xw� 1bk2

T. De Bie, K. Tretyakov Pattern Analysis

Page 7: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Ridge regression with o¤set

Danger for over�tting (usually not in 1/low-dimensionalregression, but in high-dimensional spaces such as when usingkernel trick to do nonlinear regression)Capacity control: regularise by additionally controllingC (πw,b) = kwk2

T. De Bie, K. Tretyakov Pattern Analysis

Page 8: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Ridge regression with o¤set

minwky�Xw� 1bk2 + γ kwk2

Solve by taking gradient w.r.t. w, and derivative w.r.t b, andequating to 0:� �

γI+X0X�w+X01b�X0y = 0

10Xw+ 101b� 10y = 0

Solved by a linear system of equations:� �γI+X0X

�X01

10X 101

��wb

�=

�X0y10y

T. De Bie, K. Tretyakov Pattern Analysis

Page 9: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Fisher�s discriminant analysis

Let�s assume binary classi�cation: yi 2 f�1, 1gPattern function: learn classi�er as thresholded linearfunction? y sign(x0w+ b)Then:

�g �πw,b (x, y) =�1� sign (y (x0w+ b))

2

�2However, this is hard to optimise... non-convex!

Hence, use a convex upper bound:

�gπw,b (x, y) =�1� y

�x0w+ b

��2T. De Bie, K. Tretyakov Pattern Analysis

Page 10: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Fisher�s discriminant analysis

Ideal:

�g �πw,b (x, y) =�1� sign (y (x0w+ b))

2

�2Convex upper bound:

�gπw,b (x, y) =�1� y

�x0w+ b

��2

T. De Bie, K. Tretyakov Pattern Analysis

Page 11: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Note, for y binary,�gπw,b (x, y) = (1� y (x0w+ b))

2 = (y � (x0w+ b))2

Same as for ridge regression!

Hence, exact same methodology as for (ridge) regression canbe used

T. De Bie, K. Tretyakov Pattern Analysis

Page 12: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Fisher�s discriminant analysis

πw,b (X) = 1n ∑n

i=1 gw,b (xi ) withgw,b (xi ) = � (yi � (x0iw+ b))

2 = � (1� yi (x0iw+ b))2

�gw,b (xi ) is the cost associated to each (xi , yi )Quite sensitive to outliers (quadratic!)

T. De Bie, K. Tretyakov Pattern Analysis

Page 13: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Fisher�s discriminant analysis

πw,b (X) = 1n ∑n

i=1 gw,b (xi ) withgw,b (xi ) = � (yi � (x0iw+ b))

2 = � (1� yi (x0iw+ b))2

This is the cost associated to each (xi , yi )Quite sensitive to outliers (quadratic!)

T. De Bie, K. Tretyakov Pattern Analysis

Page 14: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Support Vector Machines for robust regression

Solution: use another cost (not quadratic), also an upper

bound on �g �πw,b (x, y) =�1�sign(y (x0w+b))

2

�2But keep it convex...

T. De Bie, K. Tretyakov Pattern Analysis

Page 15: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Support vector machines

Averaging pattern function with:

gw,b (xi ) = �max�0, 1� yi

�x0iw+ b

��Pattern function itself:

πw,b (X) = �1n

n

∑i=1max

�0, 1� yi

�x0iw+ b

��Capacity functional:

C (πw,b (X)) = kwk2

Pattern recognition problem:

minw,b

1n

n

∑i=1max

�0, 1� yi

�x0iw+ b

��+ γ kwk2

T. De Bie, K. Tretyakov Pattern Analysis

Page 16: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Support vector machines

Introduce new variables: ξ i � 0 and ξ i � 1� yi (x0iw+ b)Then, ∑n

i=1max (0, 1� yi (x0iw+ b)) = minξ ∑ ξ iHence:

minw,b,ξ

1n

n

∑i=1

ξ i + γ kwk2

s.t. ξ i � 0ξ i � 1� yi

�x0iw+ b

�This is easy to solve using any quadratic programmingtoolbox...

T. De Bie, K. Tretyakov Pattern Analysis

Page 17: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Support vector machines

Property: many ξ i = 0, corresponding to yi (x0iw+ b) � 1,

�x0iw+ b

�� 1 if yi = 1�

x0iw+ b�� �1 if yi = �1

Hence: many (xi , yi ) can be separated by a certain marginThe for which yi (x0iw+ b) � 1 are known as the supportvectorsFor some, (x0iw+ b) = yi holds

T. De Bie, K. Tretyakov Pattern Analysis

Page 18: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Support vector machines

Size of the margin: take a point on the margin, i.e. for which(x0iw+ b) = yi , and another point for which�x0jw+ b

�= �1

Margin is length of projections of xi and xj on w:(xi � xj )0w/ kwk = 2/ kwk

T. De Bie, K. Tretyakov Pattern Analysis

Page 19: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Linear regressionFisher�s discriminant analysisSupport Vector Machines

Support vector machines

Capacity functional kwk2 make sure the margin is large...At the same time, the pattern function makes sure theclassi�cation error on the training set is small...

The combination of these two features makes sure that theerror on another set of data points, a test set, can beexpected to be small

T. De Bie, K. Tretyakov Pattern Analysis

Page 20: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Ridge regression: recapitulation

Optimal w and b found as:� �γI+X0X

�X01

10X 101

��wb

�=

�X0y10y

�Estimate label for data point x as y = x0w+ b

T. De Bie, K. Tretyakov Pattern Analysis

Page 21: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel ridge regression

Note:�X0X+ γI

�w+X01b�X0y = 0, w = X0 �

�1γ(y�Xw� 1b)

�Let�s denote α =

�2γ (y�Xw� 1b)

�, then

w = X0α =n

∑i=1

αixi

The weight vector is a linear combination of the data points(representer theorem)Projection of a data point on the weight vector is a weightedsum of kernels (inner products):

x0w+ b = x0X0α+ b =n

∑i=1

αik (x , xi ) + b

T. De Bie, K. Tretyakov Pattern Analysis

Page 22: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel ridge regression

Let�s plug this in the equations (assuming that K = XX0 is fullrank): � �

γI+X0X�X01

10X 101

��wb

�=

�X0y10y

��X 00 1

�� �γI+X0X

�X01

10X 101

��X0αb

�=

�X 00 1

��X0y10y

��

γK+K2 K110K 101

��αb

�=

�K 00 1

��y10y

T. De Bie, K. Tretyakov Pattern Analysis

Page 23: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel ridge regression

�γK+K2 K110K 101

��αb

�=

�K 00 1

��y10y

��

γI+K 110K 101

��αb

�=

�y10y

Again: a set of linear equations...

T. De Bie, K. Tretyakov Pattern Analysis

Page 24: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel ridge regression

In summary, the dual vector α and the o¤set b can be founde¢ ciently by solving�

γI+K 110K 101

��αb

�=

�y10y

�Then, for a test object x the label y can be predicted as

y =n

∑i=1

αik (x , xi ) + b

T. De Bie, K. Tretyakov Pattern Analysis

Page 25: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel Fisher discriminant analysis

Just a di¤erent use from Kernel ridge regression

With binary labels y

! We will not discuss this in greater detail here

T. De Bie, K. Tretyakov Pattern Analysis

Page 26: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Recurring themes and tricks

You should have noticed that all methods relying on inner products,distances,... can be expressed in terms of kernel functions:

1 The 1st step in kernelising invokes an instance of therepresenter theorem: the parameters (weight vector, clustercentre) can be represented as a linear combination of the data:

w = Xα

2 The 2nd step plugs in this equation, and left-multiplies theequations to obtain inner products XX�where possible...

3 Kernel trick: substitute the inner products with kernels

T. De Bie, K. Tretyakov Pattern Analysis

Page 27: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel support vector machines

Same trick works for support vector machines

But a di¤erent approach is more common here: relying onoptimisation theory

Can be used for ridge regression, PCA, etc as well!

T. De Bie, K. Tretyakov Pattern Analysis

Page 28: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel support vector machines

Support vector machine:

minw,b,ξ

1n

n

∑i=1

ξ i + γ kwk2

s.t. ξ i � 0ξ i � 1� yi

�x0iw+ b

�Use Lagrange multipliers α � 0 and β � 0 for bothinequalities

T. De Bie, K. Tretyakov Pattern Analysis

Page 29: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel support vector machines

minw,b,ξ

maxβ1,β2

1n

n

∑i=1

ξ i + γ kwk2 � β0ξ � α0 (ξ � 1+ y� (Xw+ 1b))

maxα,β

minw,b,ξ

1n10ξ + γ kwk2 �

�β0 + α0

�ξ + α01� α0yb� α0 (y�Xw)

Take gradient w.r.t w and equate to 0:

2γw = X0 diag (y) α

Same for ξ:1n1 = (β+ α)

Same for b:α0y = 0

T. De Bie, K. Tretyakov Pattern Analysis

Page 30: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Kernel support vector machines

Plugging all this in the objective, gives:

maxα,β

� 14γ

α0�diag (y)XX0 diag (y)

�α+ α01

maxα,β

� 14γ

α0�diag (y)XX0 diag (y)

�α+ α01

Hence, using kernels and with constraints:

maxα,β

� 14γ

α0�K� yy0

�α+ α01

s.t.1n� α � 0, α0y = 0

This is the Lagrange dual formulation �Lagrange duals areoften directly in kernel form...

T. De Bie, K. Tretyakov Pattern Analysis

Page 31: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Averaging pattern functions

Will follow same pattern as the bound for PCA

This is due to the fact that both are based on an averagingpattern function

Let us �rst do the study in full generality, for averagingpattern functions

π (X ) =1n

n

∑i=1gπ (xi )

T. De Bie, K. Tretyakov Pattern Analysis

Page 32: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Averaging pattern functions

In general:

π (X)� EX fπ (X)g � maxπ2Π

(π (X)� EX fπ (X)g)

� EZ

�maxπ2Π

(π (Z)� EX fπ (X)g)�

� EXZ

�maxπ2Π

(π (Z)� π (X))�

We should make the approximate inequality into a rigorousinequality...

Then devise an upper bound for the last quantity

T. De Bie, K. Tretyakov Pattern Analysis

Page 33: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Averaging pattern functions

The approximate equality for averaging pattern functions:

maxπ2Π

(π (X)� EX fπ (X)g) � EZ�maxπ2Π

(π (Z)� EX fπ (X)g)�

Let us assume that jgπ (x)� gπ (x�)j � M (true e.g. if0 � gπ (x) � M)Then, replacing one data point xi by a di¤erent value x�i canchange the value of this function of X by at most Mn (thisrequires some thought... check it!)

! McDiarmid...

T. De Bie, K. Tretyakov Pattern Analysis

Page 34: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

McDiarmid�s inequality (again):

Theorem (McDiarmid�s inequality)

For f a function of X = fx1, x2, ..., xi , ..., xng 2 X and xi iid, iff (X ) has bounded di¤erences ci , meaning thatjf (X )� f (X i )j � ci , we have that

P(f (X )� Eff (X )g < ε) � 1� exp� �2ε2

∑ni=1 c

2i

�.

T. De Bie, K. Tretyakov Pattern Analysis

Page 35: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Averaging pattern functions

McDiarmid�s inequality with ci = Mn : with probability at least

1� exp�� 2nε2

M 2

maxπ2Π

(π (X)� EX fπ (X)g)�EZ�maxπ2Π

(π (Z)� EX fπ (X)g)�< ε

In other words, with a probability of at least δ/2, we havethat:

maxπ2Π

(π (X)� EX fπ (X)g)

� EZ

�maxπ2Π

(π (Z)� EX fπ (X)g)�+M

rln (2/δ)

2n.

T. De Bie, K. Tretyakov Pattern Analysis

Page 36: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Averaging pattern functions

) quantity to be bounded: EXZ fmaxπ2Π (π (Z)� π (X))gBounded by the Rademacher complexity (σ i i.i.d., 1 or �1both with probability 1

2 )

EXZnmax

π(π (Z)� π (X))

o= EXZ

(maxπ2Π

1n

n

∑i=1

�gπ

�zi�� gπ

�xi��!)

= EXZσ

(maxπ2Π

�����1n n

∑i=1

σi�gπ

�zi�� gπ

�xi�������)

� EXσ

(maxπ2Π

�����2n n

∑i=1

σigπ

�xi������), R (Π) ,

T. De Bie, K. Tretyakov Pattern Analysis

Page 37: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Rademacher complexity

De�nition (Rademacher and empirical Rademacher complexity)

The Rademacher complexity R (Π) of a space Π of additivepattern functions π with π (X ) = 1

n ∑ni=1 g (xi ) is given by

R (Π) = EXσ

(maxπ2Π

�����2n n

∑i=1

σigπ

�xi������).

The empirical Rademacher complexity bRX (Π) of the same patternspace and for given data X = fx1, x2, . . . , xng is given by

bRX (Π) = Eσ

(maxπ2Π

�����2n n

∑i=1

σigπ (xi )

�����).

T. De Bie, K. Tretyakov Pattern Analysis

Page 38: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Rademacher complexity

McDiarmid�s inequality ) bRX (Π) � R (Π) with highprobability

Indeed, it is easy to very that McDiarmid�s theorem applieswith ci = 2M

n , showing that with a probability of at least δ/2

R (Π) � bRX (Π) + 2Mr ln (2/δ)

2n

T. De Bie, K. Tretyakov Pattern Analysis

Page 39: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Rademacher bounds

The resulting empirical Rademacher type bound is given by

π (X)� EX fπ (X)g = π (X)� Ex fgπ (x)g

� bRX (Π) + 3Mr ln (2/δ)

2n

which holds with probability 1� 2 � δ/2 = 1� δ over randomdraws of XHereby, M is an upper bound on jgπ (x)� gπ (x�)j (8x, x�)Power of this type of bounds:

Quite tight, data-dependentbRX (Π) is usually easy to boundT. De Bie, K. Tretyakov Pattern Analysis

Page 40: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Ridge regression stability bound (without o¤set)

We prove stability for the strati�cation formulation:

minw

1njjXw� yjj2 s.t. jjwjj2 � c

Assume: kxk2 � R2x and y � Ry0 � gπw (x, y) = (x

0w� y)2 � cR2x + 2pcRxRy + R2y , so:

M = cR2x + 2pcRxRy + R2y

Empirical Rademacher complexity:

bRX (Π)� 2

nc

sn

∑i=1(x0ixi )

2 +2n

sn

∑i=1y4i +

4n

pc

sn

∑i=1y2i (x

0ixi ).

T. De Bie, K. Tretyakov Pattern Analysis

Page 41: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Ridge regression stability bound (without o¤set)

bRX (Π) = Eσ

(maxw

�����2n n

∑i=1

σi�x0iw� yi

�2�����)

= Eσ

(maxw

�����2n n

∑i=1

σi

��x0iw�2+ y2i � 2yix0iw

������)

� 2nEσ

(maxw

����� n∑i=1 σi�x0iw�2�����+

����� n∑i=1 σiy2i

�����+ 2����� n∑i=1 σiyix0iw

�����!)

� 2nEσ

�maxw

��∑ni=1

σixix0i ,ww

0���+maxw ��∑ni=1 σiy2i

��+2maxw

��∑ni=1

σiyixi ,w

��� �

T. De Bie, K. Tretyakov Pattern Analysis

Page 42: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Ridge regression stability bound (without o¤set)

bRX (Π) � 2nEσ

8>>>>><>>>>>:

pc2r

∑ni ,j=1

Dσixix0i , σjxjx

0j

E+q

∑ni ,j=1 σiσjy2i y

2j

+2pc

r∑ni ,j=1

Dσiyixi , σjyjxj

E9>>>>>=>>>>>;

� 2nc

vuutEσ

(n

∑i ,j=1

σiσj

Dxix0i , xjx

0j

E)+2n

vuutEσ

(n

∑i ,j=1

σiσjy2i y2j

)

+4n

pc

vuutEσ

(n

∑i ,j=1

σiσj hyixi , yjxj i)

T. De Bie, K. Tretyakov Pattern Analysis

Page 43: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Kernel ridge regressionHow to �kernelise�an algorithm? � you should know nowKernel support vector machinesStatistical analysis of ridge regression

Ridge regression stability bound (without o¤set)

bRX (Π) � 2nc

vuutEσ

(n

∑i ,j=1

σiσj

Dxix0i , xjx

0j

E)

+2n

vuutEσ

(n

∑i ,j=1

σiσjy2i y2j

)

+4n

pc

vuutEσ

(n

∑i ,j=1

σiσj hyixi , yjxj i)

� 2nc

sn

∑i=1(x0ixi )

2 +2n

sn

∑i=1y4i +

4n

pc

sn

∑i=1y2i (x

0ixi )

T. De Bie, K. Tretyakov Pattern Analysis

Page 44: Computational Pattern Analysis and Statistical Learning · Lecture 5A: Regression and classi–cation Lecture 5B: Kernel regression and classi–cation, and stability analysis Wrap-up

Lecture 5A: Regression and classi�cationLecture 5B: Kernel regression and classi�cation, and stability analysis

Wrap-up Lecture 5

Wrap-up

Supervised learning methods:

Ridge regression revisited (now with o¤set)Fisher�s discriminant analysisSupport vector machine

Kernel versions

Statistical study

In general for averaging pattern functions using RademachercomplexitiesIn particular, applied to ridge regression

T. De Bie, K. Tretyakov Pattern Analysis


Recommended