Modern Machine Learning Regression Methods

transcript

Modern Machine LearningRegression Methods*

Igor & Igor**

*based on our lecture at Strasbourg Chemoinformatics School**Igor Baskin, MSU

Click h

ere to

YY PDF Transformer 2.0

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

The Goal of LearningThe goal of statistical learning (in chemoinformatics) consists in finding functionf relating the value of some property y (which can be physicochemical property,biological activity, etc) to the values of descriptors x1,…,xM (which can describechemical compounds, reactions, etc)

),...,( 1 Mxxfy

Continuous y are handled by regression analysis, function approximation, etc

Discrete y are handed by discriminant analysis, classification, patternrecognition, etc

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

The Goal of LearningSo, the goal of statistical learning is to find such function class F and suchparameters c1,…,cP that would minimize the risk function and therefore provide amodel with the highest predictive ability:

min),...( 1 PccR

In classical statistics it is assumed that:

min),...(min),...( 1exp1 PP ccRccR

Is this correct? Almost correct for big data sets and may not be correct forsmall data sets

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Incorrectness of Empirical Risk MinimizationData approximation with polynoms of different order

Minimization of the empirical risk function does not guaranties the bestpredictive performance of the model

ii xccpxPol

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Occam’s Razor

Entia non sunt multiplicanda praeter necessitatem

Entities should not be multiplied beyondnecessity

All things being equal, the simplest solutiontends to be the best one

(c. 1285–1349)

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Machine Learning RegressionMethods

• Multiple Linear Regression (MLR)• Kernel version of this method• K Nearest Neighbours (kNN)• Back-Propagation Neural Network (BPNN)• Associative Neural Network (ASNN

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Multiple Linear Regression

YXXXC TT 1)(

Regressioncoefficients

Experimental propertyvalues Descriptor values

M < N !!!

Topless: M < N/5 for good models

Topleiss: M < N/5 for good models/

y c0 c1x1 ... cM xM

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Mystery of the “Rule of 5”

John G. Topliss

J.H.Topliss & R.L.Costello, JMC, 1972, Vol. 15, No. 10, P. 1066-1068.

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Mystery of the “Rule of 5”

John G. Topliss

J.H.Topliss & R.L.Costello, JMC, 1972, Vol. 15, No. 10, P. 1066-1068.

For example, if R2 = 0.40 is regarded as the maximumacceptable level of chance correlation then the minimumnumber of observations required to test five variables isabout 30, for 10 variables 50 observations, for 20 variables65 observations, and for 30 variables 85 observations.

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Mystery of the “Rule of 5”C.Hansch, K.W.Kim, R.H.Sarma, JACS,1973, Vol. 95, No.19, 6447-6449

Topliss and Costello2 have pointed out the danger of finding meaninglesschance correlations with three or four data points per variable.

The correlation coefficient is good and there are almost five data pointsper variable.

Topliss Hansch: M < N/5 for good models/

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Kernel ridge regression

• Regression + regularization, in feature space

• Avoids large, canceling coefficients• Bias unnecessary if samples and labels are centered (in

feature space)• Solution in convex hull of samples, ,

where K is the kernel and I the identity matrix• Predict new samples x as

wmin yi w,x i

K In n1

i x i, xi

y i w , (x i) w 2

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

K Nearest Neighbors

Euclidij xxD

Manhatij xxD

neighbourskjj

predi yy

neighbourskjj

neighbourskj ij

predi y

1Non-weighted Weighted

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Overfitting by Variable Selection inkNN

Golbraikh A., Tropsha A. Beware of q2! JMGM, 2002, 20, 269-276

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Artificial Neuron

wi5wi4

wi3wi2wi1

O2O2o2

oi = f(neti)

neti = wjioj-ti

)1(1)( xexf

xf,0,1

Transfer function

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Multilayer Neural NetworkInput Layer

Hidden Layer

Output Layer

Neurons in the input layer correspond to descriptors, neurons in the output layer– to properties being predicted, neurons in the hidden layer – to nonlinear latentvariables

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Generalized Delta-RuleThis is application of the steepest descent method to training backpropagationneural networks

empij y

– learning rate constant

Paul WerbosDavid

RummelhardJames

McClelland

Geoffrey Hinton

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Multilayer Neural NetworkInput Layer

Hidden Layer

Output Layer

The number of weights corresponds to the number ofadjustable parameters of the method

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Origin of “Rule of 2”The number of weights (adjustable parameters) for the case of one hidden layer

W = (I+1)H + (H+1)O

2.28.1T.A. Andrea and H. Kalayeh, J. Med. Chem., 1991, 34, 2824-2836.

Parameter :WN

End of “Rule of 2”I.V. Tetko, D.J. Livingstone, Luik, A.I. J. Chem. Inf. Comput. Sci., 1995, 35, 826-833.

Baskin, I.I. et al. Foundations Comput. Decision. Sci. 1997, v.22, No.2, p.107-116.

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Overtraining and Early Stopping

training set

test set 1

test set 2

point for early stopping

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Associative Neural Network (ASNN)Associative Neural Network (ASNN)

A prediction of casei:

x i ANNE M z i

Pearson’s (Spearman) correlation coefficient rij=R(zi,zj)>0 in space ofresiduals

ikNjjjkii zyzz

Ensemble approach:Ensemble approach:

<<= ASNN bias correction<<= ASNN bias correction

The correction of neural network ensemble value is performed using errors (biases)calculated for the neighbor cases of analyzed case xxii detected in space of neuralnetwork models

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Correction of a model bythe nearest neighbors

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

-3 -2 -1 0 1 2 30

-3 -2 -1 0 1 2 3

Detection of nearestneighbors in space ofmodels uses invariants in“structure-property” space.

Nearest neighbors for Gauss functionClic

k here

to buy

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

ASNN local correction stronglyimproves prediction of extreme data

Tetko, Methods Mol Biol, 2008, 458, 826-833.

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Prediction of similar properties

0 1 2 3

t raining "LogP"

"LogD"

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

10 new points are measured

0 1 2 3

t raining "LogP"

"LogD"

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Library mode (no retraining)

10 1 2 3

t raining "LogP"

"LogD"

Library "LogD"

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Library mode vs development ofa new model using 10 cases

10 1 2 3

t arget funct ion"LogD"

Library "LogD"

Ret rainingfrom "scrat h"

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

How to validate a model?

• Training/Validation set partition– How to select validation set?– How many molecules should be in it?

• N-fold cross-validation

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

What is n-cross-validation?

White & blue rectangles are training & validation sets, respectively.

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Conclusions

• There are many machine learning methods• Different problems may require different methods• All methods could be prone of overfitting• But all of them have facilities to tackle this

problem• To correctly evaluate performance of your model

use validation set or better n-fold cross-validation(built-in option)

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Exam. Question 1

What is it?

1. Support VectorRegression

2. BackpropagationNeural Network

3. Partial Least SquaresRegression

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Exam. Question 2Which method is not prone to overfitting?

1. Multiple Linear Regression2. Partial Least Squares3. Support Vector Regression4. Backpropagation Neural Networks5. K Nearest Neighbours

Neither!

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Thank you for your attention!

Click h

ere to

www.ABBYY.comClic

k here

to buy

www.ABBYY.com

Modern Machine Learning Regression Methods

Education