Study of Sparse Online Gaussian Process for Regression EE645 Final Project May 2005 Eric Saint...

transcript

Study ofSparse Online Gaussian Process

for Regression

EE645 Final ProjectMay 2005

Eric Saint Georges

A. IntroductionB. OGP

1. Definition of Gaussian Process2. Sparse Online GP algorithm (OGP)

C. Simulation Results1. Comparison with LS SVM on Boston Housing data

set (Batch)2. Time Series Prediction using OGP3. Optical Beam Position Optimization

D. Conclusion

Contents

Introduction

Possible Application of OGP to Optical Free Space Communication

Monitoring and Optimization in a noisy environment

Sparse OGP Algorithm developed by Lehel Csato and al.

D. Conclusion

Contents

Gaussian Process Definition

Collection of indexed random variables– Mean– Covariance defined by a Kernel function

• function: can be any Positive Semi Definite function• Defines the assumptions on the prior distribution• Wide scope of choices• Popular Kernels are stationary functions: f (x-x’)

– Index can be time or space or anything else

On line GP Process

• Bayesian Process:Prior distribution (GP Process)

Likelihood Function

Posterior distribution

(Using Bayes rule)

Solving a Gaussian Process: Given measures n inputs and n measures t i

(with ti = yi + ei) being zero mean and e variance

Prior distribution over y i is given by the covariance matrix K ij = C(xi,xj).

Prior distribution over the measures t i is given by K+ e In

Prediction on function y* for an input x* consists in calculating the mean and variances:

y*(x* )= i C(xi,xj)

(x* )=C(x*,x*) –kT(x*)(K + e In )-1 k (x*)

With i = (K + e In )-1 t

Solving the Gaussian Process:

Solving requires inversion of (K + e In ) which is a n x n matrix, n being the number of training inputs.

Memory is in n2 and cpu time in n3.

Sampling from a Gaussian Process:

• Example of Kernel:

a = amplitude

s = scale (smoothness)

Sampling from a GP: before Training

-20 -15 -10 -5 0 5 10 15 20-10

10Drawing Samples with Scale = 10

SampleMean+Standard Deviation-Standard Deviation

0 20 40 60 80 100-10

Effect of Scale

Small Scale=1 Large Scale =100

0 20 40 60 80 100-10

Effect of Scale

Small Scale=1 Large Scale =100

Sampling from a GP: After Training

-20 -15 -10 -5 0 5 10 15 20-3

After 3 Training samples

Sampling from a GP: After Training

-20 -15 -10 -5 0 5 10 15 20-3

After 10 Training samples

Online Gaussian Process: Issues

Two Major Issues with the GP process;

1. Data Set size is limited by Memory and CPU

2. Posterior distribution is usually not Gaussian

Algorithm developed by Csato and al.

Sparse Online Gaussian Algorithm

Sparsity createdusing limited number

of SVs

GaussianApproximation

Posterior DistributionNot usually Gaussian

Data Set size limitedby Memory and CPU

Matlab Software available on the Web

SOGP Process defined by:– Kernel Parameters m + 2 Vector for RBF Kernel– Support Vectors: d x 1 Vector (indexes)– GP Parameters:

: d x 1 Vector• K: d x n Matrix

• m dimension of input space• d number of support vectors

Sparse Online Gaussian Algorithm

1. Definition of Gaussian Process2. Sparse Online GP algorithm (SOGP)

C. Simulation Results1. Comparison with LS SVM on Boston Housing

data set (Batch)2. Time Series Prediction using OGP3. Optical Beam Position Optimization

D. Conclusion

LS SVM on Boston Housing Data Set

RBF KernelC=10, =4304 training samples averaged on 10 Random draws

MeanAverage MSE on Training: 3 3 3 3.0Average MSE on Test: 7 7 6 6.5Standard Deviation on Training 0 0 1 0.4Standard Deviation on Test 2 2 2 1.6

Average cpu = 3 sec / run

•Kernel :

•Initial Hyper-parameters: and (i=1 to 13 for BH)

•Number of Hyper-parameter optimization iterations: tried

between 3 and 6

•Max number of Support Vectors: Variable

OGP on Boston Housing Data Set

6 Iterations, MaxBV between 10 and 250

0 50 100 1500

15GP Regression on Boston Housing Data Set

Max number of Support Vectors

maxHyp = 3

MSE av eraged ov er 5 draws. Hy per Parameters updated 3 times. Max numbers of Support Vectors = 10, 20, 50, 100, 150, Number of Training samples = 304 Total Elapsed Time = 2733 sec. Results sav ed in Result_5120_4.mat net Structure sav ed in net_5120_4.mat Figure sav ed in Fig_5120_4.jpg

5120_4 30-Apr-2005

0 50 100 1500

maxHyp = 4

5120_5 30-Apr-2005

0 50 100 1500

maxHyp = 6

5120_6 30-Apr-2005

Cpu Time

0 5 10 15 20 250

600Processing Time

Maximum Number of SVs

MaxHyp=3

MaxHyp=4MaxHyp=6

0 50 100 1500.4

1.1Processing Time /maxBV/maxHyp

Maximum Number of SVs

MaxHyp=3

MaxHyp=4MaxHyp=6

0 50 100 150

Number of Support Vectors

Alpha*(Beta+SVs2)./SVs

a*(b+SVs2)/SVsas a function of SVs

10 15 20 25 30 35 40 45 50 55 600

MSE av eraged ov er 10 draws. Hy per Parameters updated 4 times. Max numbers of Support Vectors = 10, 20, 30, 40, 50, 60, Number of Training samples = 304 Total Elapsed Time = 5749 sec. Results sav ed in Result_5121_1.mat net Structure sav ed in net_5121_1.mat Figure sav ed in Fig_5121_1.jpg

5121_1 01-May-2005

Run with 4 Iterations, MaxBV between 10 and 60

25 30 35 40 452

MSE av eraged ov er 50 draws. Hy per Parameters updated 4 times. Max numbers of Support Vectors = 30, 40, Number of Training samples = 304 Total Elapsed Time = 8335 sec. Results sav ed in Result_5122_1.mat net Structure sav ed in net_5122_1.mat Figure sav ed in Fig_5122_1.jpg

5122_1 02-May-2005

Final Run with 4 Iterations, MaxBV 30 and 40Average over 50 random draws

Max Number of SVs 30 40Average MSE on Training: 3.80 3.20Average MSE on Test: 7.10 6.90Standard Deviation on Training 0.23 0.24Standard Deviation on Test 1.20 1.10

MSE not as good as LS SVM (6.9 versus to 6.5)But Standard deviation better than LS SVM (1.1 versus 1.6).

Cpu Time much longer (90sec versus 3 sec per run)But increases slower with number of samples than LS SVM.Might do better with large data sets.

Conclusion

TSP (Time Series Prediction)

0 1000 2000 3000 4000 5000-600

TSP Data

Training DataPrediction Data

OGP on TSP

RUN = 10 Kpar(1) = 0.0100, kpar(2) = 2000 Overlap between section = 0 Training samples. Max numbers of Support Vectors = 50,

5128_18 08-May-2005

Training DataTest DataGP EstimationOPG Prediction

0 200 400 600 800 1000-600

Initial kpar(1) = 1.00e-002Final kpar(1) = 1.30e-003MSE on Prediction = 2489.7.

980 Training Samples.

OGP on TSP: Initial Runs

5128_13 08-May-2005

800 850 900 950 1000-400

Initial kpar(1) = 1.00e-002

Final kpar(1) = 1.61e-002

MSE on Prediction = 1400.2.

5128_15 08-May-2005

700 750 800 850 900 950 1000-400

Initial kpar(1) = 1.00e-002

RUN = 10 281 Training Samples. Kpar(1) = 0.0100, kpar(2) = 2000 Overlap between section = 0 Training samples. Max numbers of Support Vectors = 50,

5128_33 08-May-2005

700 750 800 850 900 950 1000-400

200Initial kpar(1) = 1.00e-002

Training DataTest DataGP EstimationOPG PredictionSupport Vectors

5128_24 08-May-2005

700 750 800 850 900 950 1000-400

OGP on TSP: Local Minimum

For both runs: Initial kpar = 1e-2

Final kpar = 1.42e-2MSE = 1132

Final kpar = 2.45e-3MSE = 95

OGP on TSP: Impact of Over-fitting

940 960 980 1000 1020 1040

RUN = 7

Kpar(1) = 0.3000, kpar(2) = 2000

Overlap between section = 30 Training samples.

Max numbers of Support Vectors = 50,

OGP on TSP: Impact of Number of Samples on Prediction

cpu=6sec

5128_46 08-May-2005

900 920 940 960 980 1000-300

cpu=16 sec

5128_52 08-May-2005

800 850 900 950 1000-400

5128_42 08-May-2005

700 750 800 850 900 950 1000-400

cpu=27sec

5128_44 08-May-2005

600 650 700 750 800 850 900 950 1000-400

cpu=45sec

cpu=109sec

5128_45 08-May-2005

500 600 700 800 900 1000-400

cpu=233sec

5128_51 08-May-2005

400 500 600 700 800 900 1000-600

200Initial kpar(1) = 1.00e-003Final kpar(1) = 3.58e-003MSE on Prediction = 632.6.

OGP on TSP: Impact of Number SVs on Prediction

cpu=19sec

5128_54 08-May-2005

800 850 900 950 1000-400

cpu=16sec

5128_52 08-May-2005

800 850 900 950 1000-400

cpu=16sec

5128_53 08-May-2005

800 850 900 950 1000-400

OGP on TSP: Final Runs

5128_69 08-May-2005

5128_70 08-May-2005

5128_71 08-May-2005

5128_72 08-May-2005

5128_73 08-May-2005

5128_74 08-May-2005

5128_75 08-May-2005

5128_76 08-May-2005

5128_77 08-May-2005

5128_78 08-May-2005

5128_79 08-May-2005

5128_80 08-May-2005

0 500 1000 1500 2000 2500-600

5128_81 08-May-2005

Running 200 Sample at a time, with 30 sample overlap

1880 1900 1920 1940 1960 1980 2000 2020 2040 2060 2080

RUN = 6

Kpar(1) = 1.0000, kpar(2) = 2200

Overlap between section = 0 Training samples.

OGP on TSP: Why an overlap?

5128_100 08-May-2005

2000 2500 3000 3500 4000 4500 5000-300

Does not always behaves!...

Difficult to find the right set of parameters

•Initial Kernel Parameter

•Number of Support Vectors

•Number of Training Samples per run

OGP on TSP: Conclusion

Beam Position Optimization

Gaussian Beam

-200-100

200-200

With Small Noise

Gaussian Beam

-200-100

200-200

With Noise

Gaussian Beam

-200-100

200-200

With Noise

Gaussian Beam

-200-100

200-200

With Noise

Gaussian Beam

-200-100

200-200

With Noise

-200-100

200-30

Gaussian Beam Position Optimization

Sampling the beam at a given position, and measuring Power

Objective: Find the top of the beam

• Idea:

– With a few initial samples, use the OGP to get an estimate of the beam profile and position

– Move toward the max of the estimate– Add this new sample to the training set– Iterate

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

-200-100

200-30

Works faster than current algorithm (Finds the top with less

steps)

Does not work well if no noise.

Can be improved

OGP for Beam Optimization: Conclusion

• Specific Kernel: s1 = s2 (Beam is symmetric in x, y)

• Use the known Beam divergence to set the initial Kernel

Parameters

• Optimize choice of sample

– Going directly to the estimated top might not be the best

(Because it does not help to improve the estimate)

– Improve robustness by minimizing probability to sample at lower power

OGP for Beam Optimization: Possible Improvements

D. Conclusion

• OGP is an interesting tool

• Complex software

• Many tunings to insure stability and convergence

• Not easy to use

• Next Steps:

More comparison with online LS SVM

– Performance

– Cpu Time

Conclusion

References

• [1] Gaussian Processes, C.K. Williams, March1, 2002

• [2] Efficient Implementation of Gaussian Processes, Mark Gibbs, David J.C. MacKay, May 28, 1997

• [3] Sparse Online Gaussian Processes, Lehel Csató, Manfred Opper, October 9, 2002

• [4] Neural Networks for pattern Recognition, Christopher M. Bishop

• [5] Time Series Competition Data downloaded from http://www.esat.kuleuven.ac.be/sista/workshop/competition.html

• [6] Castó OGP toolbox for Matlab and Demo Program Tutorial from http://www.kyb.tuebingen.mpg.de/bs/people/csatol/ogp/

Study of Sparse Online Gaussian Process for Regression EE645 Final Project May 2005 Eric Saint...

Documents