Regularization Instructor : Dr. Saeed Shiry. Hypothesis Space The hypothesis space H is the space of...

Regularization

Instructor : Dr. Saeed Shiry

Hypothesis Space

The hypothesis space H is the space of functions allow our algorithm to provide. in the space the algorithm is allowed to search. it is often important to choose the hypothesis

space as a function of the amount of data available.

Learning As Function Approximation From Samples: Regression and Classification

The basic goal of supervised learning: to use the training set S to “learn” a function For a new x value predict the associated value of y:

Regression : If y is a real-valued random variable

Pattern classification : If y takes values from an unordered finite set, In two-class pattern classification problems, we

assign one class a y value of 1, and the other class a y value of −1.

Loss Functions

In order to measure goodness of our function, we need a loss function V.

In general, we let V(f , z) = V(f (x), y) price we pay when we see x and guess that the

associated y value is f (x) when it is actually y.

Common Loss Functions For Regression The most common loss function is square

loss or L2 loss: V(f (x), y) = (f (x) − y)^2

L1 loss: V(f (x), y) = |f (x) − y|

Vapnik’s more general -insensitive loss:

Problem of risk minimization

In order to choose the best available approximation to the supervisor's response, one measures the loss or discrepancy L(y, f(x, a)) between the response y of the supervisor to a given input x and the response f(x, a) provided by the learning machine. Consider the expected value of the loss, given by the risk functional

The goal is to find the function f(x, , a) which minimizes the risk functional R(a) over the class of functions f(x,), A in the situation where the joint probability distribution P(x,y) is unknown and the only available information is contained in the training set.

Three Main Learning Problems

1. Pattern Recognition Let the supervisor's output y take only two values y = {0,1} and

let f(x,), A, be a set of indicator functions (functions which take only two values: zero and one).

Consider the following loss function:

For this loss function, the functional (1.2) determines the probability of different answers given by the supervisor and by the indicator function f(x, ). We call the case of different answers a classification error.

The problem, therefore, is to find a function that minimizes the probability of classification error when the probability measure F(x, y) is unknown, but the data are given.


2. Regression Estimation Let the supervisor's answer y be a real value, and let f(x, ),

A, be a set of real functions that contains the regression function

It is known that the regression function is the one that minimizes the functional (1.2) with the following loss function:

Thus the problem of regression estimation is the problem of minimizing the risk functional (1.2) with the above loss function in the situation where the probability measure P(x,y) is unknown but the data are given.


3. Density Estimation (Fisher-Wald Setting) Finally, consider the problem of density estimation from the set of

densities p(x, ) A. For this problem we consider the following loss function:

It is known that the desired density minimizes the risk functional (1.2) with the above loss function .

Thus, again, to estimate the density from the data one has to minimize the risk functional under the condition that the corresponding probability measure P(x) is unknown, but i.i.d. data

are given.

Expected error, empirical error The expected or true error of f is:

Given a function f , a loss function V, and a probability distribution μ over Z,

the expected loss on a new example drawn at random from μ. We would like to make I[f ] small, but in general we do not know

μ. The empirical error of f is:

Given a function f , a loss function V, and a training set S consisting of n data points

A reminder: convergence in probability

Let {Xn} be a sequence of bounded random variables. We say that

Generalization

A learning algorithm should be well-posed, eg stable

In addition to the key property of generalization, a “good” learning algorithm should also be stable: fs should depend continuously on the training

set S. In particular, changing one of the training points

should affect less and less the solution as n goes to infinity.

General definition of Well-Posed and Ill-Posed problems

A problem is well-posed if its solution: exists is unique depends continuously on the data (e.g. it is

stable)

A problem is ill-posed if it is not well-posed. well-posedness is mainly used to mean stability of

the solution.

Theory of Solving Ill-Posed Problems In the early 1900s Hadamard observed that under some (very general)

circumstances the problem of solving (linear) operator equations

(finding f F that satisfies the equality), is ill-posed; even if there exists a unique solution to this equation,

a small deviation on the right-hand side of this equation (Fδ instead of F, where ||F- Fδ ||< δ is arbitrarily small) can cause large deviations in the solutions (it can happen that ||fδ -f||< is large).

In this case if the right-hand side F of the equation is not exact (e.g., it equals Fδ , where Fδ differs from F by some level δ of noise), the functions fδ that minimize the function

do not guarantee a good approximation to the desired solution even if δ tends to zero.

Real-life problems were found to be ill-posed

Hadamard thought that ill-posed problems are a pure mathematical phenomenon and that all real-life problems are "well-posed.“

However, in the second half of the century a number of very important real-life problems were found to be ill-posed. it is important that one of main problems of

statistics, estimating the density function from the data, is ill-posed.

Regularization theory Regularization theory was one of the first signs of the existence

of intelligent inference: In the middle of the 1960s it was discovered that if instead of the

functional R(f) one minimizes another so-called regularized functional

where Ω(f) is some function (that belongs to a special type of functions) and (δ) is an appropriately chosen constant (depending on the level of noise), then one obtains a sequence of solutions that converges to the desired one as δ tends to zero

ERM

Given a training set S and a function space H, empirical risk minimization (Vapnik introduced the term) is the class of algorithms that look at S and select fs as

For example linear regression is ERM when V(z) = (f (x) − y)^2 and H is space of linear functions f = ax.

THE EMPIRICAL RISK MINIMIZATION (ERM) INDUCTIVE PRINCIPLE

In order to minimize the risk functional for an unknown probability measure P(z) the following induction principle is usually employed.

The expected risk functional R() is replaced by the empirical risk functional

Constructed on the basis of the training set.

The principle is to approximate the function Q(z, ) which minimizes the risk by the function Q(z, l) which miniminimizes the empirical risk (1.8).

This principle is called the Empirical Risk Minimization induction principle (ERM principle).

Generalization and Well-posedness of Empirical Risk Minimization

For ERM to represent a “good” class of learning algorithms, the solution should generalize exist, be unique and – especially – be stable

(well-posedness).

ERM and generalization: given a certain number of samples...

...suppose this is the “true” solution...

... but suppose ERM gives this solution.

Under which conditions the ERM solution converges with increasing number of examples to the true solution? In other words...what are the conditions for generalization of ERM?

ERM and stability: given 10 samples...

...we can find the smoothest interpolating polynomial (which degree?).

But if we perturb the points slightly...

...the solution changes a lot!

If we restrict ourselves to degree two polynomials...

...the solution varies only a small amount under a small perturbation.

ERM: conditions for well-posedness (stability) and predictivity (generalization)

Since Tikhonov, it is well-known that a generally ill-posed problem such as ERM, can be guaranteed to be well-posed and therefore stable by an appropriate choice of H. For example, compactness of H guarantees stability.

It seems intriguing that the classical conditions for consistency of ERM – thus quite a different property – consist of appropriately restricting H.


We would like to have a hypothesis space that yields generalization. Loosely speaking this would be a H for which the solution of ERM, say fs is such that |Is[fs] −I[fs]| converges to zero in probability for n increasing.

Note that the above requirement is NOT the law of large numbers; the requirement for a fixed f that |Is[f ] − I[f ]| converges to zero in probability for n increasing Is the law of large numbers.


The theorem says that a proper choice of the hypothesis space H ensures generalization of ERM (and consistency since for ERM generalization is necessary and sufficient for consistency and viceversa).

A separate theorem guarantees also stability (defined in a specific way) of ERM.

Thus with the appropriate definition of stability, stability and generalization are equivalent for ERM.

Other results characterize uGC classes in terms of measures of complexity or capacity of H (such as VC dimension).

Thus the two desirable conditions for a learning algorithm –generalization and stability – are equivalent (and they correspond to the same constraints on H).


Regularization

A method of improving stability of solutions of ill-conditioned inverse problems, called regularization.

The basic idea in the treatment of ill-conditioned problems use some a priori knowledge about solutions to disqualify

meaningless ones. such knowledge can be:

some regularity condition on the solution expressed existence of derivatives up to a certain order with bounds on the magnitudes of these derivatives

some localization condition such as a bound on the support of the solution or its behavior at infinity.

Tikhonov’s regularization: penalizes undesired solutions by adding a term called a stabilizer.

Regularization

Generally speaking, any regularization method tries to analyze a related well-posed problem whose solution approximates the original ill-posed problem.

The well-posedness is achieved by implementing one or more of the following basic ideas restriction of the data; change of the space and/or topologies; modification of the operator itself; the concept of regularization operators; and well-posed stochastic extensions of ill-posed problems.

Regularization

Regularized cost function = empirical cost function +regularization parameter *regularizer function

Image restoration – An ill-posed problem

Degradation model

H is ill-conditioned which makes image restoration problem an ill-posed problem Solution is not stable

),(),(),(),( vuNvuFvuHvuG

),(

),(),(

),(

),(),(ˆ

vuH

vuNvuF

vuH

vuGvuF

Tikhonov’s Regularization

Theory Proposed by Tikhonov in 1963 Proposes the use of prior knowledge to regularize

mappings

Most common application: utilize the smoothness property: “Similar inputs produce similar outputs for an

input-output mapping to be smooth”

Ivanov and Tikhonov Regularization

Tikhonov Regularization

As we will see in future classes Tikhonov regularization ensures well-posedness

eg existence, uniqueness and especially stability (in a very strong form) of the solution

Tikhonov regularization ensures generalization Tikhonov regularization is closely related to – but different from – Ivanov regularization, eg ERM on a hypothesis space H which is a ball in a RKHS.

Date post:	19-Dec-2015
Category:	Documents
View:	214 times
Download:	0 times

Regularization Instructor : Dr. Saeed Shiry. Hypothesis Space The hypothesis space H is the space of...

Documents