+ All Categories
Home > Documents > The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning...

The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning...

Date post: 25-Apr-2018
Category:
Upload: phamquynh
View: 218 times
Download: 2 times
Share this document with a friend
51
The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio The Learning Problem and Regularization
Transcript
Page 1: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

The Learning Problem and Regularization

Tomaso Poggio

9.520 Class 02

February 2011

Tomaso Poggio The Learning Problem and Regularization

Page 2: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

About this class

Theme We introduce the learning problem as the problemof function approximation from sparse data. Wedefine the key ideas of loss functions, empiricalerror and generalization error. We then introducethe Empirical Risk Minimization approach and thetwo key requirements on algorithms using it:generalization and stability. We then describe akey algorithm – Tikhonov regularization – thatsatisfies these requirements.

Math Required Familiarity with basic ideas in probability theory.

Tomaso Poggio The Learning Problem and Regularization

Page 3: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

About this class

Theme We introduce the learning problem as the problemof function approximation from sparse data. Wedefine the key ideas of loss functions, empiricalerror and generalization error. We then introducethe Empirical Risk Minimization approach and thetwo key requirements on algorithms using it:generalization and stability. We then describe akey algorithm – Tikhonov regularization – thatsatisfies these requirements.

Math Required Familiarity with basic ideas in probability theory.

Tomaso Poggio The Learning Problem and Regularization

Page 4: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Plan

Setting up the learning problem: definitionsGeneralization and StabilityEmpirical Risk MinimizationRegularizationAppendix: Sample and Approximation Error

Tomaso Poggio The Learning Problem and Regularization

Page 5: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Data Generated By A Probability Distribution

We assume that there are an “input” space X and an “output”space Y . We are given a training set S consisting n samplesdrawn i.i.d. from the probability distribution µ(z) on Z = X × Y :

(x1, y1), . . . , (xn, yn)

that is z1, . . . , znWe will use the conditional probability of y given x, writtenp(y |x):

µ(z) = p(x , y) = p(y |x) · p(x)

It is crucial to note that we view p(x , y) as fixed but unknown.

Tomaso Poggio The Learning Problem and Regularization

Page 6: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Probabilistic setting

X Y

P(x)

P(y|x)

Tomaso Poggio The Learning Problem and Regularization

Page 7: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Hypothesis Space

The hypothesis space H is the space of functions that weallow our algorithm to provide. For many algorithms (such asoptimization algorithms) it is the space the algorithm is allowedto search. As we will see in future classes, it is often importantto choose the hypothesis space as a function of the amount ofdata n available.

Tomaso Poggio The Learning Problem and Regularization

Page 8: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Learning As Function Approximation From Samples:Regression and Classification

The basic goal of supervised learning is to use the training setS to “learn” a function fS that looks at a new x value xnew andpredicts the associated value of y :

ypred = fS(xnew )

If y is a real-valued random variable, we have regression.If y takes values from an unordered finite set, we have patternclassification. In two-class pattern classification problems, weassign one class a y value of 1, and the other class a y value of−1.

Tomaso Poggio The Learning Problem and Regularization

Page 9: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Loss Functions

In order to measure goodness of our function, we need a lossfunction V . In general, we let V (f , z) = V (f (x), y) denote theprice we pay when we see x and guess that the associated yvalue is f (x) when it is actually y .

Tomaso Poggio The Learning Problem and Regularization

Page 10: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Common Loss Functions For Regression

For regression, the most common loss function is square lossor L2 loss:

V (f (x), y) = (f (x)− y)2

We could also use the absolute value, or L1 loss:

V (f (x), y) = |f (x)− y |

Vapnik’s more general ε-insensitive loss function is:

V (f (x), y) = (|f (x)− y | − ε)+

Tomaso Poggio The Learning Problem and Regularization

Page 11: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Common Loss Functions For Classification

For binary classification, the most intuitive loss is the 0-1 loss:

V (f (x), y) = Θ(−yf (x))

where Θ(−yf (x)) is the step function and y is binary, eg y = +1 ory = −1. For tractability and other reasons, we often use the hingeloss (implicitely introduced by Vapnik) in binary classification:

V (f (x), y) = (1− y · f (x))+

Tomaso Poggio The Learning Problem and Regularization

Page 12: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

The learning problem: summary so far

There is an unknown probability distribution on the productspace Z = X × Y , written µ(z) = µ(x , y). We assume that X isa compact domain in Euclidean space and Y a bounded subsetof R. The training set S = {(x1, y1), ..., (xn, yn)} = {z1, ...zn}

consists of n samples drawn i.i.d. from µ.

H is the hypothesis space, a space of functions f : X → Y .

A learning algorithm is a map L : Z n → H that looks at S andselects from H a function fS : x→ y such that fS(x) ≈ y in apredictive way.

Tomaso Poggio The Learning Problem and Regularization

Page 13: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Expected error, empirical error

Given a function f , a loss function V , and a probabilitydistribution µ over Z , the expected or true error of f is:

I[f ] = EzV [f , z] =

∫Z

V (f , z)dµ(z)

which is the expected loss on a new example drawn at randomfrom µ.We would like to make I[f ] small, but in general we do not knowµ.Given a function f , a loss function V , and a training set Sconsisting of n data points, the empirical error of f is:

IS[f ] =1n

∑V (f , zi)

Tomaso Poggio The Learning Problem and Regularization

Page 14: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Plan

Setting up the learning problem: definitionsGeneralization and StabilityEmpirical Risk MinimizationRegularizationAppendix: Sample and Approximation Error

Tomaso Poggio The Learning Problem and Regularization

Page 15: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

A reminder: convergence in probability

Let {Xn} be a sequence of bounded random variables. We saythat

limn→∞

Xn = X in probability

if∀ε > 0 lim

n→∞P{|Xn − X | ≥ ε} = 0.

Tomaso Poggio The Learning Problem and Regularization

Page 16: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Generalization

A natural requirement for fS is distribution independentgeneralization

limn→∞

|IS[fS]− I[fS]| = 0 in probability

This is equivalent to saying that for each n there exists a εn anda δ(ε) such that

P {|ISn [fSn ]− I[fSn ]| ≥ εn} ≤ δ(εn),

with εn and δ going to zero for n→∞.In other words, the training error for the solution must convergeto the expected error and thus be a “proxy” for it. Otherwise thesolution would not be “predictive”.A desirable additional requirement is consistency

ε > 0 limn→∞

P{

I[fS]− inff∈H

I[f ] ≥ ε}

= 0.

Tomaso Poggio The Learning Problem and Regularization

Page 17: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Finite Samples and Convergence Rates

More satisfactory results give guarantees for finite number ofpoints: this is related to convergence rates.

Suppose we can prove that with probability at least 1− e−τ2 wehave

|IS[fS]− I[fS]| ≤ C√nτ

for some (problem dependent) constant C.

The above result gives a convergence rate.If we fix ε, τ and solve for n the eq. ε = C√

nτ we obtain thesample complexity:

n(ε, τ) =C2τ2

ε2

the number of samples to obtain an error ε, withconfidence 1− e−τ2.

Tomaso Poggio The Learning Problem and Regularization

Page 18: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Remark: Finite Samples and Convergence Rates

Asymptotic results for generalization and consistency are validfor any distribution µ. It is impossible however to guarantee agiven convergence rate independently of µ. This is Devroye’sNo free lunch theorem, see Devroye, Gyorfi, Lugosi, 1997,p112-113, Theorem 7.1). So there are rules that asymptoticallyprovide optimal performance for any distribution. However, theirfinite sample performance is always extremely bad for somedistributions.So...how do we find good learning algorithms?

Tomaso Poggio The Learning Problem and Regularization

Page 19: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

A learning algorithm should be well-posed, eg stable

In addition to the key property of generalization, a “good”learning algorithm should also be stable: fS should dependcontinuously on the training set S. In particular, changing oneof the training points should affect less and less the solution asn goes to infinity. Stability is a good requirement for the learningproblem and, in fact, for any mathematical problem. We openhere a small parenthesis on stability and well-posedness.

Tomaso Poggio The Learning Problem and Regularization

Page 20: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

General definition of Well-Posed and Ill-Posedproblems

A problem is well-posed if its solution:

existsis uniquedepends continuously on the data (e.g. it is stable)

A problem is ill-posed if it is not well-posed. In the context ofthis class, well-posedness is mainly used to mean stability ofthe solution.

Tomaso Poggio The Learning Problem and Regularization

Page 21: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

More on well-posed and ill-posed problems

Hadamard introduced the definition of ill-posedness. Ill-posedproblems are typically inverse problems.As an example, assume g is a function in Y and u is a functionin X , with Y and X Hilbert spaces. Then given the linear,continuous operator L, consider the equation

g = Lu.

The direct problem is is to compute g given u; the inverseproblem is to compute u given the data g. In the learning caseL is somewhat similar to a “sampling” operation and the inverseproblem becomes the problem of finding a function that takesthe values

f (xi) = yi , i = 1, ...n

The inverse problem of finding u is well-posed whenthe solution exists,is unique andis stable, that is depends continuously on the initial data g.

Ill-posed problems fail to satisfy one or more of these criteria.Often the term ill-posed applies to problems that are notstable, which in a sense is the key condition.

Tomaso Poggio The Learning Problem and Regularization

Page 22: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Plan

Setting up the learning problem: definitionsGeneralization and StabilityEmpirical Risk MinimizationRegularizationAppendix: Sample and Approximation Error

Tomaso Poggio The Learning Problem and Regularization

Page 23: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

ERM

Given a training set S and a function space H, empirical riskminimization (Vapnik introduced the term) is the class ofalgorithms that look at S and select fS as

fS = arg minf∈H

IS[f ]

.For example linear regression is ERM when V (z) = (f (x)− y)2

and H is space of linear functions f = ax .

Tomaso Poggio The Learning Problem and Regularization

Page 24: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Generalization and Well-posedness of Empirical RiskMinimization

For ERM to represent a “good” class of learning algorithms, thesolution should

generalizeexist, be unique and – especially – be stable(well-posedness).

Tomaso Poggio The Learning Problem and Regularization

Page 25: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

ERM and generalization: given a certain number ofsamples...

Tomaso Poggio The Learning Problem and Regularization

Page 26: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

...suppose this is the “true” solution...

Tomaso Poggio The Learning Problem and Regularization

Page 27: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

... but suppose ERM gives this solution.

Tomaso Poggio The Learning Problem and Regularization

Page 28: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Under which conditions the ERM solution convergeswith increasing number of examples to the truesolution? In other words...what are the conditions forgeneralization of ERM?

Tomaso Poggio The Learning Problem and Regularization

Page 29: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

ERM and stability: given 10 samples...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 30: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

...we can find the smoothest interpolating polynomial(which degree?).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 31: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

But if we perturb the points slightly...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 32: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

...the solution changes a lot!

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 33: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

If we restrict ourselves to degree two polynomials...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 34: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

...the solution varies only a small amount under asmall perturbation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 35: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

Since Tikhonov, it is well-known that a generally ill-posedproblem such as ERM, can be guaranteed to be well-posedand therefore stable by an appropriate choice of H. Forexample, compactness of H guarantees stability.It seems intriguing that the classical conditions for consistencyof ERM – thus quite a different property – consist ofappropriately restricting H. It seems that the same restrictionsthat make the approximation of the data stable, may providesolutions that generalize...

Tomaso Poggio The Learning Problem and Regularization

Page 36: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

We would like to have a hypothesis space that yieldsgeneralization. Loosely speaking this would be a H for whichthe solution of ERM, say fS is such that |IS[fS]− I[fS]| convergesto zero in probability for n increasing.Note that the above requirement is NOT the law of largenumbers; the requirement for a fixed f that |IS[f ]− I[f ]|converges to zero in probability for n increasing IS the law oflarge numbers.

Tomaso Poggio The Learning Problem and Regularization

Page 37: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

Theorem [Vapnik and Cervonenkis (71), Alon et al (97),Dudley, Giné, and Zinn (91)]

A (necessary) and sufficient condition for generalization (andconsistency) of ERM is that H is uGC.DefinitionH is a (weak) uniform Glivenko-Cantelli (uGC) classif

∀ε > 0 limn→∞

supµ

PS

{supf∈H|I[f ]− IS[f ]| > ε

}= 0.

Tomaso Poggio The Learning Problem and Regularization

Page 38: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

The theorem (Vapnik et al.) says that a proper choice ofthe hypothesis space H ensures generalization of ERM(and consistency since for ERM generalization isnecessary and sufficient for consistency and viceversa).Other results characterize uGC classes in terms of measures ofcomplexity or capacity of H (such as VC dimension).

A separate theorem (Niyogi, Poggio et al., mentioned in thelast class) guarantees also stability (defined in a specificway) of ERM. Thus with the appropriate definition ofstability, stability and generalization are equivalent forERM.

Thus the two desirable conditions for a learning algorithm –generalization and stability – are equivalent (and theycorrespond to the same constraints on H).

Tomaso Poggio The Learning Problem and Regularization

Page 39: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Plan

Setting up the learning problem: definitionsGeneralization and StabilityEmpirical Risk MinimizationRegularizationAppendix: Sample and Approximation Error

Tomaso Poggio The Learning Problem and Regularization

Page 40: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Regularization

Regularization (originally introduced by Tikhonov independentlyof the learning problem) ensures well-posedness and (becauseof the above argument) generalization of ERM by constrainingthe hypothesis space H. The direct way – minimize theempirical error subject to f in a ball in an appropriate H – iscalled Ivanov regularization. The indirect way is Tikhonovregularization (which is not strictly ERM).

Tomaso Poggio The Learning Problem and Regularization

Page 41: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Ivanov and Tikhonov RegularizationERM finds the function in (H) which minimizes

1n

n∑i=1

V (f (xi ), yi )

which in general – for arbitrary hypothesis space H – is ill-posed.

Ivanov regularizes by finding the function that minimizes

1n

n∑i=1

V (f (xi ), yi )

while satisfying R(f ) ≤ A.

Tikhonov regularization minimizes over the hypothesis space H,for a fixed positive parameter γ, the regularized functional

1n

n∑i=1

V (f (xi ), yi ) + γR(f ). (1)

R(f ) is the regulirizer, a penalization on f . In this course we willmainly discuss the case R(f ) = ‖f‖2

K where ‖f‖2K is the norm in the

Reproducing Kernel Hilbert Space (RKHS) H, defined by the kernelK .

Tomaso Poggio The Learning Problem and Regularization

Page 42: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Tikhonov Regularization

As we will see in future classes

Tikhonov regularization ensures well-posedness egexistence, uniqueness and especially stability (in a verystrong form) of the solutionTikhonov regularization ensures generalizationTikhonov regularization is closely related to – but differentfrom – Ivanov regularization, eg ERM on a hypothesisspace H which is a ball in a RKHS.

Tomaso Poggio The Learning Problem and Regularization

Page 43: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Next Class

In the next class we will introduce RKHS: they will be thehypothesis spaces we will work with.We will also derive the solution of Tikhonov regularization.

Tomaso Poggio The Learning Problem and Regularization

Page 44: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Plan

Setting up the learning problem: definitionsGeneralization and StabilityEmpirical Risk MinimizationRegularizationAppendix: Sample and Approximation Error

Tomaso Poggio The Learning Problem and Regularization

Page 45: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Generalization, Sample Error and Approximation Error

Generalization error is IS[fS]− I[fS].Sample error is I[fS]− I[fH]Approximation error is I[fH]− I[f0]Error is I[fS]− I[f0] = (I[fS]− I[fH]) + (I[fH]− I[f0])

Tomaso Poggio The Learning Problem and Regularization

Page 46: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Appendix: Target Space, Sample and ApproximationError

In addition to the hypothesis space H, the space we allow ouralgorithms to search, we define...The target space T is a space of functions, chosen a priori inany given problem, that is assumed to contain the “true”function f0 that minimizes the risk. Often, T is chosen to be allfunctions in L2, or all differentiable functions. Notice that the“true” function if it exists is defined by µ(z), which contains allthe relevant information.

Tomaso Poggio The Learning Problem and Regularization

Page 47: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Sample Error (also called Estimation Error)

Let fH be the function in H with the smallest true risk.We have defined the generalization error to be IS[fS]− I[fS].We define the sample error to be I[fS]− I[fH], the difference in truerisk between the best function in H and the function in H we actuallyfind. This is what we pay because our finite sample does not give usenough information to choose to the “best” function in H. We’d likethis to be small. Consistency – defined earlier – is equivalent to thesample error going to zero for n→∞.A main goal in classical learning theory (Vapnik, Smale, ...) is“bounding” the generalization error. Another goal – for learning theoryand statistics – is bounding the sample error, that is determiningconditions under which we can state that I[fS]− I[fH] will be small(with high probability).As a simple rule, we expect that if H is “well-behaved”, then, as ngets large the sample error will become small.

Tomaso Poggio The Learning Problem and Regularization

Page 48: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Approximation Error

Let f0 be the function in T with the smallest true risk.We define the approximation error to be I[fH]− I[f0], thedifference in true risk between the best function in H and thebest function in T . This is what we pay when H is smaller thanT . We’d like this error to be small too. In much of the followingwe can assume that I[f0] = 0.We will focus less on the approximation error in 9.520, but wewill explore it.As a simple rule, we expect that as H grows bigger, theapproximation error gets smaller. If T ⊆ H – which is a situationcalled the realizable setting –the approximation error is zero.

Tomaso Poggio The Learning Problem and Regularization

Page 49: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Error

We define the error to be I[fS]− I[f0], the difference in true riskbetween the function we actually find and the best function inT . We’d really like this to be small. As we mentioned, often wecan assume that the error is simply I[fS].The error is the sum of the sample error and the approximationerror:

I[fS]− I[f0] = (I[fS]− I[fH]) + (I[fH]− I[f0])

If we can make both the approximation and the sample errorsmall, the error will be small. There is a tradeoff between theapproximation error and the sample error...

Tomaso Poggio The Learning Problem and Regularization

Page 50: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

The Approximation/Sample Tradeoff

It should already be intuitively clear that making H big makesthe approximation error small. This implies that we can (help)make the error small by making H big.On the other hand, we will show that making H small will makethe sample error small. In particular for ERM, if H is a uGCclass, the generalization error and the sample error will go tozero as n→∞, but how quickly depends directly on the “size”of H. This implies that we want to keep H as small as possible.(Furthermore, T itself may or may not be a uGC class.)Ideally, we would like to find the optimal tradeoff between theseconflicting requirements.

Tomaso Poggio The Learning Problem and Regularization

Page 51: The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio

Generalization, Sample Error and Approximation Error

Generalization error is IS[fS]− I[fS].Sample error is I[fS]− I[fH]Approximation error is I[fH]− I[f0]Error is I[fS]− I[f0] = (I[fS]− I[fH]) + (I[fH]− I[f0])

Tomaso Poggio The Learning Problem and Regularization


Recommended