The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning...

The Learning Problem and Regularization

Tomaso Poggio

9.520 Class 02

February 2011

Tomaso Poggio The Learning Problem and Regularization

About this class

Theme We introduce the learning problem as the problemof function approximation from sparse data. Wedefine the key ideas of loss functions, empiricalerror and generalization error. We then introducethe Empirical Risk Minimization approach and thetwo key requirements on algorithms using it:generalization and stability. We then describe akey algorithm – Tikhonov regularization – thatsatisfies these requirements.

Math Required Familiarity with basic ideas in probability theory.


About this class

Theme We introduce the learning problem as the problemof function approximation from sparse data. Wedefine the key ideas of loss functions, empiricalerror and generalization error. We then introducethe Empirical Risk Minimization approach and thetwo key requirements on algorithms using it:generalization and stability. We then describe akey algorithm – Tikhonov regularization – thatsatisfies these requirements.

Math Required Familiarity with basic ideas in probability theory.


Plan

Setting up the learning problem: definitionsGeneralization and StabilityEmpirical Risk MinimizationRegularizationAppendix: Sample and Approximation Error


Data Generated By A Probability Distribution

We assume that there are an “input” space X and an “output”space Y . We are given a training set S consisting n samplesdrawn i.i.d. from the probability distribution µ(z) on Z = X × Y :

(x1, y1), . . . , (xn, yn)

that is z1, . . . , znWe will use the conditional probability of y given x, writtenp(y |x):

µ(z) = p(x , y) = p(y |x) · p(x)

It is crucial to note that we view p(x , y) as fixed but unknown.


Probabilistic setting

X Y

P(x)

P(y|x)


Hypothesis Space

The hypothesis space H is the space of functions that weallow our algorithm to provide. For many algorithms (such asoptimization algorithms) it is the space the algorithm is allowedto search. As we will see in future classes, it is often importantto choose the hypothesis space as a function of the amount ofdata n available.


Learning As Function Approximation From Samples:Regression and Classification

The basic goal of supervised learning is to use the training setS to “learn” a function fS that looks at a new x value xnew andpredicts the associated value of y :

ypred = fS(xnew )

If y is a real-valued random variable, we have regression.If y takes values from an unordered finite set, we have patternclassification. In two-class pattern classification problems, weassign one class a y value of 1, and the other class a y value of−1.


Loss Functions

In order to measure goodness of our function, we need a lossfunction V . In general, we let V (f , z) = V (f (x), y) denote theprice we pay when we see x and guess that the associated yvalue is f (x) when it is actually y .


Common Loss Functions For Regression

For regression, the most common loss function is square lossor L2 loss:

V (f (x), y) = (f (x)− y)2

We could also use the absolute value, or L1 loss:

V (f (x), y) = |f (x)− y |

Vapnik’s more general ε-insensitive loss function is:

V (f (x), y) = (|f (x)− y | − ε)+


Common Loss Functions For Classification

For binary classification, the most intuitive loss is the 0-1 loss:

V (f (x), y) = Θ(−yf (x))

where Θ(−yf (x)) is the step function and y is binary, eg y = +1 ory = −1. For tractability and other reasons, we often use the hingeloss (implicitely introduced by Vapnik) in binary classification:

V (f (x), y) = (1− y · f (x))+


The learning problem: summary so far

There is an unknown probability distribution on the productspace Z = X × Y , written µ(z) = µ(x , y). We assume that X isa compact domain in Euclidean space and Y a bounded subsetof R. The training set S = {(x1, y1), ..., (xn, yn)} = {z1, ...zn}

consists of n samples drawn i.i.d. from µ.

H is the hypothesis space, a space of functions f : X → Y .

A learning algorithm is a map L : Z n → H that looks at S andselects from H a function fS : x→ y such that fS(x) ≈ y in apredictive way.


Expected error, empirical error

Given a function f , a loss function V , and a probabilitydistribution µ over Z , the expected or true error of f is:

I[f ] = EzV [f , z] =

∫Z

V (f , z)dµ(z)

which is the expected loss on a new example drawn at randomfrom µ.We would like to make I[f ] small, but in general we do not knowµ.Given a function f , a loss function V , and a training set Sconsisting of n data points, the empirical error of f is:

IS[f ] =1n

∑V (f , zi)


Plan



A reminder: convergence in probability

Let {Xn} be a sequence of bounded random variables. We saythat

limn→∞

Xn = X in probability

if∀ε > 0 lim

n→∞P{|Xn − X | ≥ ε} = 0.


Generalization

A natural requirement for fS is distribution independentgeneralization

limn→∞

|IS[fS]− I[fS]| = 0 in probability

This is equivalent to saying that for each n there exists a εn anda δ(ε) such that

P {|ISn [fSn ]− I[fSn ]| ≥ εn} ≤ δ(εn),

with εn and δ going to zero for n→∞.In other words, the training error for the solution must convergeto the expected error and thus be a “proxy” for it. Otherwise thesolution would not be “predictive”.A desirable additional requirement is consistency

ε > 0 limn→∞

P{

I[fS]− inff∈H

I[f ] ≥ ε}

= 0.


Finite Samples and Convergence Rates

More satisfactory results give guarantees for finite number ofpoints: this is related to convergence rates.

Suppose we can prove that with probability at least 1− e−τ2 wehave

|IS[fS]− I[fS]| ≤ C√nτ

for some (problem dependent) constant C.

The above result gives a convergence rate.If we fix ε, τ and solve for n the eq. ε = C√

nτ we obtain thesample complexity:

n(ε, τ) =C2τ2

ε2

the number of samples to obtain an error ε, withconfidence 1− e−τ2.


Remark: Finite Samples and Convergence Rates

Asymptotic results for generalization and consistency are validfor any distribution µ. It is impossible however to guarantee agiven convergence rate independently of µ. This is Devroye’sNo free lunch theorem, see Devroye, Gyorfi, Lugosi, 1997,p112-113, Theorem 7.1). So there are rules that asymptoticallyprovide optimal performance for any distribution. However, theirfinite sample performance is always extremely bad for somedistributions.So...how do we find good learning algorithms?


A learning algorithm should be well-posed, eg stable

In addition to the key property of generalization, a “good”learning algorithm should also be stable: fS should dependcontinuously on the training set S. In particular, changing oneof the training points should affect less and less the solution asn goes to infinity. Stability is a good requirement for the learningproblem and, in fact, for any mathematical problem. We openhere a small parenthesis on stability and well-posedness.


General definition of Well-Posed and Ill-Posedproblems

A problem is well-posed if its solution:

existsis uniquedepends continuously on the data (e.g. it is stable)

A problem is ill-posed if it is not well-posed. In the context ofthis class, well-posedness is mainly used to mean stability ofthe solution.


More on well-posed and ill-posed problems

Hadamard introduced the definition of ill-posedness. Ill-posedproblems are typically inverse problems.As an example, assume g is a function in Y and u is a functionin X , with Y and X Hilbert spaces. Then given the linear,continuous operator L, consider the equation

g = Lu.

The direct problem is is to compute g given u; the inverseproblem is to compute u given the data g. In the learning caseL is somewhat similar to a “sampling” operation and the inverseproblem becomes the problem of finding a function that takesthe values

f (xi) = yi , i = 1, ...n

The inverse problem of finding u is well-posed whenthe solution exists,is unique andis stable, that is depends continuously on the initial data g.

Ill-posed problems fail to satisfy one or more of these criteria.Often the term ill-posed applies to problems that are notstable, which in a sense is the key condition.


Plan



ERM

Given a training set S and a function space H, empirical riskminimization (Vapnik introduced the term) is the class ofalgorithms that look at S and select fS as

fS = arg minf∈H

IS[f ]

.For example linear regression is ERM when V (z) = (f (x)− y)2

and H is space of linear functions f = ax .


Generalization and Well-posedness of Empirical RiskMinimization

For ERM to represent a “good” class of learning algorithms, thesolution should

generalizeexist, be unique and – especially – be stable(well-posedness).


ERM and generalization: given a certain number ofsamples...


...suppose this is the “true” solution...


... but suppose ERM gives this solution.


Under which conditions the ERM solution convergeswith increasing number of examples to the truesolution? In other words...what are the conditions forgeneralization of ERM?


ERM and stability: given 10 samples...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


...we can find the smoothest interpolating polynomial(which degree?).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


But if we perturb the points slightly...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


...the solution changes a lot!

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


If we restrict ourselves to degree two polynomials...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


...the solution varies only a small amount under asmall perturbation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


ERM: conditions for well-posedness (stability) andpredictivity (generalization)

Since Tikhonov, it is well-known that a generally ill-posedproblem such as ERM, can be guaranteed to be well-posedand therefore stable by an appropriate choice of H. Forexample, compactness of H guarantees stability.It seems intriguing that the classical conditions for consistencyof ERM – thus quite a different property – consist ofappropriately restricting H. It seems that the same restrictionsthat make the approximation of the data stable, may providesolutions that generalize...



We would like to have a hypothesis space that yieldsgeneralization. Loosely speaking this would be a H for whichthe solution of ERM, say fS is such that |IS[fS]− I[fS]| convergesto zero in probability for n increasing.Note that the above requirement is NOT the law of largenumbers; the requirement for a fixed f that |IS[f ]− I[f ]|converges to zero in probability for n increasing IS the law oflarge numbers.



Theorem [Vapnik and Cervonenkis (71), Alon et al (97),Dudley, Giné, and Zinn (91)]

A (necessary) and sufficient condition for generalization (andconsistency) of ERM is that H is uGC.DefinitionH is a (weak) uniform Glivenko-Cantelli (uGC) classif

∀ε > 0 limn→∞

supµ

PS

{supf∈H|I[f ]− IS[f ]| > ε

}= 0.



The theorem (Vapnik et al.) says that a proper choice ofthe hypothesis space H ensures generalization of ERM(and consistency since for ERM generalization isnecessary and sufficient for consistency and viceversa).Other results characterize uGC classes in terms of measures ofcomplexity or capacity of H (such as VC dimension).

A separate theorem (Niyogi, Poggio et al., mentioned in thelast class) guarantees also stability (defined in a specificway) of ERM. Thus with the appropriate definition ofstability, stability and generalization are equivalent forERM.

Thus the two desirable conditions for a learning algorithm –generalization and stability – are equivalent (and theycorrespond to the same constraints on H).


Plan



Regularization

Regularization (originally introduced by Tikhonov independentlyof the learning problem) ensures well-posedness and (becauseof the above argument) generalization of ERM by constrainingthe hypothesis space H. The direct way – minimize theempirical error subject to f in a ball in an appropriate H – iscalled Ivanov regularization. The indirect way is Tikhonovregularization (which is not strictly ERM).


Ivanov and Tikhonov RegularizationERM finds the function in (H) which minimizes

1n

n∑i=1

V (f (xi ), yi )

which in general – for arbitrary hypothesis space H – is ill-posed.

Ivanov regularizes by finding the function that minimizes

1n

n∑i=1

V (f (xi ), yi )

while satisfying R(f ) ≤ A.

Tikhonov regularization minimizes over the hypothesis space H,for a fixed positive parameter γ, the regularized functional

1n

n∑i=1

V (f (xi ), yi ) + γR(f ). (1)

R(f ) is the regulirizer, a penalization on f . In this course we willmainly discuss the case R(f ) = ‖f‖2

K where ‖f‖2K is the norm in the

Reproducing Kernel Hilbert Space (RKHS) H, defined by the kernelK .


Tikhonov Regularization

As we will see in future classes

Tikhonov regularization ensures well-posedness egexistence, uniqueness and especially stability (in a verystrong form) of the solutionTikhonov regularization ensures generalizationTikhonov regularization is closely related to – but differentfrom – Ivanov regularization, eg ERM on a hypothesisspace H which is a ball in a RKHS.


Next Class

In the next class we will introduce RKHS: they will be thehypothesis spaces we will work with.We will also derive the solution of Tikhonov regularization.


Plan



Generalization, Sample Error and Approximation Error

Generalization error is IS[fS]− I[fS].Sample error is I[fS]− I[fH]Approximation error is I[fH]− I[f0]Error is I[fS]− I[f0] = (I[fS]− I[fH]) + (I[fH]− I[f0])


Appendix: Target Space, Sample and ApproximationError

In addition to the hypothesis space H, the space we allow ouralgorithms to search, we define...The target space T is a space of functions, chosen a priori inany given problem, that is assumed to contain the “true”function f0 that minimizes the risk. Often, T is chosen to be allfunctions in L2, or all differentiable functions. Notice that the“true” function if it exists is defined by µ(z), which contains allthe relevant information.


Sample Error (also called Estimation Error)

Let fH be the function in H with the smallest true risk.We have defined the generalization error to be IS[fS]− I[fS].We define the sample error to be I[fS]− I[fH], the difference in truerisk between the best function in H and the function in H we actuallyfind. This is what we pay because our finite sample does not give usenough information to choose to the “best” function in H. We’d likethis to be small. Consistency – defined earlier – is equivalent to thesample error going to zero for n→∞.A main goal in classical learning theory (Vapnik, Smale, ...) is“bounding” the generalization error. Another goal – for learning theoryand statistics – is bounding the sample error, that is determiningconditions under which we can state that I[fS]− I[fH] will be small(with high probability).As a simple rule, we expect that if H is “well-behaved”, then, as ngets large the sample error will become small.


Approximation Error

Let f0 be the function in T with the smallest true risk.We define the approximation error to be I[fH]− I[f0], thedifference in true risk between the best function in H and thebest function in T . This is what we pay when H is smaller thanT . We’d like this error to be small too. In much of the followingwe can assume that I[f0] = 0.We will focus less on the approximation error in 9.520, but wewill explore it.As a simple rule, we expect that as H grows bigger, theapproximation error gets smaller. If T ⊆ H – which is a situationcalled the realizable setting –the approximation error is zero.


Error

We define the error to be I[fS]− I[f0], the difference in true riskbetween the function we actually find and the best function inT . We’d really like this to be small. As we mentioned, often wecan assume that the error is simply I[fS].The error is the sum of the sample error and the approximationerror:

I[fS]− I[f0] = (I[fS]− I[fH]) + (I[fH]− I[f0])

If we can make both the approximation and the sample errorsmall, the error will be small. There is a tradeoff between theapproximation error and the sample error...


The Approximation/Sample Tradeoff

It should already be intuitively clear that making H big makesthe approximation error small. This implies that we can (help)make the error small by making H big.On the other hand, we will show that making H small will makethe sample error small. In particular for ERM, if H is a uGCclass, the generalization error and the sample error will go tozero as n→∞, but how quickly depends directly on the “size”of H. This implies that we want to keep H as small as possible.(Furthermore, T itself may or may not be a uGC class.)Ideally, we would like to find the optimal tradeoff between theseconflicting requirements.


Generalization, Sample Error and Approximation Error

Generalization error is IS[fS]− I[fS].Sample error is I[fS]− I[fH]Approximation error is I[fH]− I[f0]Error is I[fS]− I[f0] = (I[fS]− I[fH]) + (I[fH]− I[f0])


Date post:	25-Apr-2018
Category:	Documents
Upload:	phamquynh
View:	218 times
Download:	2 times

The Learning Problem and Regularization - mit.edu9.520/spring11/slides/class02.pdf · The Learning...

Documents