Optimization Methods for Machine Learning Radial Basis...

Interpolation RBF Regularized RBF Generalized RBF XOR problem References

Optimization Methods for Machine LearningRadial Basis function

Laura Palagihttp://www.dis.uniroma1.it/∼palagi

Dipartimento di Ingegneria informatica automatica e gestionale A. RubertiSapienza Universita di Roma

Via Ariosto 25

RBF Networks L. Palagi 1 / 29


Interpolation problem

Given p distinct points in Rn:

X = {x i ∈ Rn, i = 1, . . . ,p},

and a corresponding set of real numbers

Y = {y i ∈ R, i = 1, . . . ,p}.

The interpolation problem consists in finding a function f : Rn→ R, in agiven class of real functions F , which satisfies:

f (x i ) = y i i = 1, . . . ,P. (1)



Interpolation properties

For n = 1 the Interpolation pb. can be solved explicitly using polynomials

f (x) =P−1

∑i=0

ci ti

For n > 1, the 2-layer MLP with g not polynomial satisfies

P

∑j=1

v jg(w j T x i −bj) = y i , i = 1, . . . ,P

for some w j ∈ Rn, and v j ,bj ∈ R.

MLP possesses the universal approximation property, i.e. can approximatearbitrarily well a continuous function (provided that an arbitrarily largenumber of units is available).



Interpolation properties

Being an universal approximator may be not enough from theoretical pointof view. An important property is the

existence of a best approximation

Informally: given a function f belonging to some set of functions F andgiven a subset A of F find an element of A which is closest to f . Ifd(f ,g) is the distance between two elements f ,g in F , we consider theproblem

d∗A = infa∈A

d(f ,a)

If there exists a∗ ∈A that attains the infimum, namely d∗A = d(f ,a∗) thena∗ is the best approximation to f from A .



Best approximation properties

MLP does not have the best approximation property [1].

Consider another approximation scheme

f (x) =P

∑j=1

vjφ(‖x−x j‖), (2)

where‖ · ‖ is the euclidean norm and φ(‖x−x j‖) (φ : R+→ R) is a suitablecontinuous function, called radial basis function (RBF) since it is assumedthat the argument of φ is the radius r = ‖x−x j‖.The data points x j ∈ X are the referred as the centers.



Gaussian RBF

φ(r) = e−(r/σ)2

with radius r > 0 and spread σ > 0 (very sensitive parameter)

It is a localized function in the sense it goes to zero with increasing radius(far from the centers)



Multiquadric

φ(r) = (r2 + σ2)1/2


It grows with the distance from the centers



Inverse Multiquadric

φ(r) = (r2 + σ2)−1/2


It goes to zero with increasing radius (as the Gaussian)



Other RBF

φ(r) = r linear splineφ(r) = r3 cubic splineφ(r) = r2 log r , thin plate spline.



Interpolation by RBF

By imposing the interpolation conditions we get:

P

∑j=1

vjφ(‖x i −x j‖) = y i , i = 1, . . . ,P. (3)

Let define the vectors v =(v1 · · · vP

)T, and y =

(y1 · · · yP

)T, and

the symmetric P×P matrix Φ with elements

Φi ,j = φ(‖x i −x j‖), 1≤ i , j ≤ P,

system (3) can be written as:

Φv = y .

It is a linear system of P equations in P unknowns.



Matrix Φ is non singular, provided that P ≥ 2, that the interpolationpoints x j , j = 1, . . . ,P are distinct and using

Gaussian (Φ positive definite)

the multiquadric

the inverse multiquadric (Φ positive definite)

linear spline

Thus, the interpolation problem Φv = y admits a unique solution.

When φ

pos. def. it can be computed by minimizing the (strictly) convex quadraticfunction in RP

F (v) =1

2vTΦv −yT v ,

whose gradient is given by ∇F (v) = Φv −y .



Matrix Φ is non singular, provided that P ≥ 2, that the interpolationpoints x j , j = 1, . . . ,P are distinct and using

Gaussian (Φ positive definite)

the multiquadric

the inverse multiquadric (Φ positive definite)

linear spline

Thus, the interpolation problem Φv = y admits a unique solution. When φ

pos. def. it can be computed by minimizing the (strictly) convex quadraticfunction in RP

F (v) =1

2vTΦv −yT v ,

whose gradient is given by ∇F (v) = Φv −y .



From Interpolation to approximation properties

Because of the remarkable properties of the RBFs, the RBF method is oneof the most often applied approaches in multivariable interpolation.

This has motivated the attempt of employing RBFs also withinapproximation algorithms for the solution of classification and regressionproblems in data mining.�� What does it change ?



Regularized RBF neural networksConsider the data set {(xp,yp), p = 1, . . . ,P} obtained by randomsampling (in the presence of noise) of a function belonging to some spaceof functions F .

We already discussed that the problem of recovering the function or anestimate of it from a finite set of data is ill posed in the sense that it hasan infinite number of solutions.

In order to choose one particular and stable solution we need to imposesome regularization property on the function, that is the problem is

Minimization of the Regularized Empirical Risk

minf

P

∑p=1

`(f (xp;ω)−yp)︸︷︷︸empirical error

+ R(λ , f )︸︷︷︸regularization term



Regularized RBF neural networks

The most common form of a priori knowledge consists in assuming thatthe function is smooth enough in the sense that two similar inputscorrespond to two similar outputs.

Smoothness is a measure of the ”oscillatory” behavior of f . Within a classof differentiable functions, one function is said to be smoother thananother one if it oscillates less. A smoothness functional R(f ,λ ) is definedand we consider

minf

1

2

P

∑p=1

[yp− f (xp)]2 + λE (f ),

where the first term is enforcing closeness to the data and the secondsmoothness while the regularization parameter λ > 0 controls the tradeoffbetween these two terms.



Regularized RBF neural networks

It can be shown that for a wide class of smoothness functionals E2(f ), thesolutions of the minimization all have the same form

f (x) =P

∑i=1

viφ(‖x− c i‖)

Centers coincides with inputs

c i = x i , i = 1, . . . ,P

and weights solve the regularized system

(Φ + λ I )v = y

whereΦ = {Φij}i ,j=1,...,P = {φ(‖x i −x j‖)}i ,j=1,...,P



Equivalent convex quadratic optimization

Assume that the matrix Φ is positive semidefinite, then the unique solutionof the system

(Φ + λ I )v = y

can be computed by minimizing the (strictly) convex quadratic function inRP

FR(v) =1

2vT (Φ + λ I )v −yT v =

1

2vTΦv −yT v︸︷︷︸

F (v)

+1

2λ‖v‖2︸︷︷︸

regularization

whose gradient is given by ∇FR(v) = (Φ + λ I )v −y .



2-layer Regularized RBF network

f (x) =P

∑i=1

viφ(‖x− c i‖)

y(x)

-x t��

��3

��

��:

QQQQQQQQQQQs

φ(‖x−x1‖)

φ(‖x−x2‖)

φ(‖x−xP‖)

v1

v2

vP

XXXXXXXXXXz

��3

QQQQQQQQQQs n+ -uu



2-layer Regularized RBF network

RBF are universal approximator: any continuous function can beapproximated arbitrarily well on a compact set, provided a sufficientlylarge number of units, and for an appropriate choice of the parameters

RBF possess the best approximation property, namely there exists thebest approximation and in most cases (under assumptions oftensatisfied) is unique (RBF is linear in parameters v) [1]

The value of λ can be selected by employing cross validationtechniques and this may require that system (Φ + λ I )v = y is solvedseveral times.

The spread σ is a hyper-parameter too



2-layer Generalized RBF network

When P is very large, the cost of constructing a regularized RBF networkcan be prohibitive. Indeed, the computation of the weights v ∈ RP requiresthe solution of a possible ill conditioned linear system, which costs O(P3).

Generalized RBF neural network are used where the number N of neuralunits is much less than P.

The output of the network can be defined by

y(x) =N

∑j=1

vjφj(‖x− cj‖), (4)

where both the centers cj ∈ Rn and the weights vj j = 1, . . . ,N must beselected appropriately.




y(x)

-x t��

��3

��

��:

QQQQQQQQQQQs

φ(‖x− c1‖)

φ(‖x− c2‖)

φ(‖x− cN‖)

v1

v2

vN

XXXXXXXXXXz

��3

QQQQQQQQQQs n+ -uu




GRBF are universal approximator: any continuous function can beapproximated arbitrarily well on a compact set, provided a sufficientlylarge number of units, and for an appropriate choice of the parameters

GRBF may NOT possess the best approximation property. However ifthe centers are fixed, the approximation problem becomes linear withrespect to w and the existence of a best approximation is guaranteed

In the general case, both the centers and the weights are treated asvariable parameters and the approximation is nonlinear

As N << P, GRBF performs inherently a structural stabilizationwhich may prevent the occurrence of overtraining.




GRBF are universal approximator: any continuous function can beapproximated arbitrarily well on a compact set, provided a sufficientlylarge number of units, and for an appropriate choice of the parameters

GRBF may NOT possess the best approximation property. However ifthe centers are fixed, the approximation problem becomes linear withrespect to w and the existence of a best approximation is guaranteed

In the general case, both the centers and the weights are treated asvariable parameters and the approximation is nonlinear

As N << P, GRBF performs inherently a structural stabilizationwhich may prevent the occurrence of overtraining.



An example: Exclusive OR

The logical function XOR

XORp x1 x2 yp

1 -1 -1 -12 -1 1 13 1 -1 14 1 1 -1

ccs

s1

2

3

4

-

6x2

x1

Perceptron (linear separator) doesn’t work



Two layer MLP

w

w

-

-

x2

-��7x1 S

SSSSSSSw

-

sign(·)

sign(·)

w22

w12

w21

w11

b1

b2

v1

v2i+a2-

6

i+a1-

6QQQQQQs

��3

i+6

-sign(·)

-

wb3

y(x)

w

w



Two layer MLP

Choose w11 = w22 = 1 and w12 = w21 =−1, b1 = b2 =−1, v1 = v2 = 1b3 = 0.1 (output bias). We get

a1 = x1−x2−1 z1 = sign(a1)a2 =−x1 + x2−1 z2 = sign(a2)y = sign(z1 + z2 + 0.1)

input p a1 a2 z1 z2 z1 + z2 + 0.1 y

1 -1 -1 -1 -1 -1.9 -12 -3 1 -1 1 0.1 13 1 -3 1 -1 0.1 14 -1 -1 -1 -1 -1.9 -1



Two layer MLP

This MLP network with two hidden nodes realizes a nonlinear separation(each hidden node describes one of the two lines). The output nodecombines the outputs of the two hidden layer.

eeu

u1

2

3

4

-

@@@@@@@

@@@@@@@

6x2

x1



RBF network

Consider a RBF network with two units (N = 2) with centers c1,c2 andassume the activation function is a gaussian gj = e−(‖x−cj‖/σ)2

w

w

-

-

x2

-��7x1S

SSSSSSSw

-

v1

v2

z2 = e−‖x−c2‖2

σ2��3

z1 = e−‖x−c1‖2

σ2 QQQQQQs i+6

-sign(·)

-

wb y(x)



RBF network

Choose σ =√

2 and c1 =

(11

)c2 =

(−1−1

)We transform the problem

into a linearly separable form.

XOR

p e−‖x−c1‖2

σ2 e−‖x−c1‖2

σ2 yp

1 e−4 1 -12 e−2 e−2 13 e−2 e−2 14 1 e−4 -1

fv

v1

423

@@@@@@@@@

-

6z2

z1



The output takes the form

y(x) = v1e− ‖x−c1‖2

σ2 + v2e− ‖x−c2‖2

σ2 +b

Minimizing the training error

minv ,b

4

∑p=1

(y(xp)−yp)2

we get the optimal solution (v∗,b∗) that gives E = 0 v1

v2

b

=

−2,675065656−2,675065656

1,72406123

and the RBF network has been trained.



Girosi, F., Poggio, T. (1990). Networks and the best approximation property. Biological cybernetics, 63(3), 169-176.


Date post:	10-Nov-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Optimization Methods for Machine Learning Radial Basis...

Documents