Post on 24-Jan-2015
description
transcript
Modern Machine LearningRegression Methods*
Igor & Igor**
*based on our lecture at Strasbourg Chemoinformatics School**Igor Baskin, MSU
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
The Goal of LearningThe goal of statistical learning (in chemoinformatics) consists in finding functionf relating the value of some property y (which can be physicochemical property,biological activity, etc) to the values of descriptors x1,…,xM (which can describechemical compounds, reactions, etc)
),...,( 1 Mxxfy
Continuous y are handled by regression analysis, function approximation, etc
Discrete y are handed by discriminant analysis, classification, patternrecognition, etc
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
The Goal of LearningSo, the goal of statistical learning is to find such function class F and suchparameters c1,…,cP that would minimize the risk function and therefore provide amodel with the highest predictive ability:
min),...( 1 PccR
In classical statistics it is assumed that:
min),...(min),...( 1exp1 PP ccRccR
Is this correct? Almost correct for big data sets and may not be correct forsmall data sets
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Incorrectness of Empirical Risk MinimizationData approximation with polynoms of different order
Minimization of the empirical risk function does not guaranties the bestpredictive performance of the model
p
i
ii xccpxPol
10),(
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Occam’s Razor
Entia non sunt multiplicanda praeter necessitatem
Entities should not be multiplied beyondnecessity
All things being equal, the simplest solutiontends to be the best one
(c. 1285–1349)
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Machine Learning RegressionMethods
• Multiple Linear Regression (MLR)• Kernel version of this method• K Nearest Neighbours (kNN)• Back-Propagation Neural Network (BPNN)• Associative Neural Network (ASNN
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Multiple Linear Regression
YXXXC TT 1)(
Mc
cc
C 1
0
Ny
yy
Y2
1
NM
N
M
M
xx
xxxx
X
1
221
111
1
11
Regressioncoefficients
Experimental propertyvalues Descriptor values
M < N !!!
Topless: M < N/5 for good models
Y=CX
Topleiss: M < N/5 for good models/
y c0 c1x1 ... cM xM
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Mystery of the “Rule of 5”
John G. Topliss
J.H.Topliss & R.L.Costello, JMC, 1972, Vol. 15, No. 10, P. 1066-1068.
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Mystery of the “Rule of 5”
John G. Topliss
J.H.Topliss & R.L.Costello, JMC, 1972, Vol. 15, No. 10, P. 1066-1068.
For example, if R2 = 0.40 is regarded as the maximumacceptable level of chance correlation then the minimumnumber of observations required to test five variables isabout 30, for 10 variables 50 observations, for 20 variables65 observations, and for 30 variables 85 observations.
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Mystery of the “Rule of 5”C.Hansch, K.W.Kim, R.H.Sarma, JACS,1973, Vol. 95, No.19, 6447-6449
Topliss and Costello2 have pointed out the danger of finding meaninglesschance correlations with three or four data points per variable.
The correlation coefficient is good and there are almost five data pointsper variable.
Topliss Hansch: M < N/5 for good models/
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Kernel ridge regression
• Regression + regularization, in feature space
• Avoids large, canceling coefficients• Bias unnecessary if samples and labels are centered (in
feature space)• Solution in convex hull of samples, ,
where K is the kernel and I the identity matrix• Predict new samples x as
wmin yi w,x i
2w 2
i
K In n1
i x i, xi
minw
y i w , (x i) w 2
i
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
K Nearest Neighbors
M
k
jk
ik
Euclidij xxD
1
2)(
M
k
jk
ik
Manhatij xxD
1
tan
neighbourskjj
predi yy
k1
neighbourskjj
ij
neighbourskj ij
predi y
DD
y 11
1Non-weighted Weighted
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Overfitting by Variable Selection inkNN
Golbraikh A., Tropsha A. Beware of q2! JMGM, 2002, 20, 269-276
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Artificial Neuron
wi5wi4
wi3wi2wi1
o5
o3 o4
o1
O2O2o2
oi = f(neti)
oi
neti = wjioj-ti
)1(1)( xexf
xx
xf,0,1
)(
Transfer function
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Multilayer Neural NetworkInput Layer
Hidden Layer
Output Layer
Neurons in the input layer correspond to descriptors, neurons in the output layer– to properties being predicted, neurons in the hidden layer – to nonlinear latentvariables
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Generalized Delta-RuleThis is application of the steepest descent method to training backpropagationneural networks
')(ji
jji
ij
empij y
deedf
yw
Rw
– learning rate constant
1974
Paul WerbosDavid
RummelhardJames
McClelland
1986
Geoffrey Hinton
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Multilayer Neural NetworkInput Layer
Hidden Layer
Output Layer
The number of weights corresponds to the number ofadjustable parameters of the method
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Origin of “Rule of 2”The number of weights (adjustable parameters) for the case of one hidden layer
W = (I+1)H + (H+1)O
2.28.1T.A. Andrea and H. Kalayeh, J. Med. Chem., 1991, 34, 2824-2836.
Parameter :WN
End of “Rule of 2”I.V. Tetko, D.J. Livingstone, Luik, A.I. J. Chem. Inf. Comput. Sci., 1995, 35, 826-833.
Baskin, I.I. et al. Foundations Comput. Decision. Sci. 1997, v.22, No.2, p.107-116.
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Overtraining and Early Stopping
training set
test set 1
test set 2
point for early stopping
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Associative Neural Network (ASNN)Associative Neural Network (ASNN)
A prediction of casei:
x i ANNE M z i
z1i
zki
zMi
Pearson’s (Spearman) correlation coefficient rij=R(zi,zj)>0 in space ofresiduals
Mk
iki z
Mz
,1
1
)(
1
ikNjjjkii zyzz
x
Ensemble approach:Ensemble approach:
<<= ASNN bias correction<<= ASNN bias correction
The correction of neural network ensemble value is performed using errors (biases)calculated for the neighbor cases of analyzed case xxii detected in space of neuralnetwork models
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Correction of a model bythe nearest neighbors
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 30
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
A B
-4
-2
0
2
4
-3 -2 -1 0 1 2 3
C
x
x x
x2
x1
y y
Detection of nearestneighbors in space ofmodels uses invariants in“structure-property” space.
Nearest neighbors for Gauss functionClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
ASNN local correction stronglyimproves prediction of extreme data
Tetko, Methods Mol Biol, 2008, 458, 826-833.
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Prediction of similar properties
.
0
0.25
0.5
0.75
1
0 1 2 3
t raining "LogP"
"LogD"
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
10 new points are measured
0
0.25
0.5
0.75
1
0 1 2 3
t raining "LogP"
"LogD"
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Library mode (no retraining)
0
0.25
0.5
0.75
10 1 2 3
t raining "LogP"
"LogD"
Library "LogD"
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Library mode vs development ofa new model using 10 cases
0
0.25
0.5
0.75
10 1 2 3
t arget funct ion"LogD"
Library "LogD"
Ret rainingfrom "scrat h"
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
How to validate a model?
• Training/Validation set partition– How to select validation set?– How many molecules should be in it?
• N-fold cross-validation
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
What is n-cross-validation?
White & blue rectangles are training & validation sets, respectively.
n=5
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Conclusions
• There are many machine learning methods• Different problems may require different methods• All methods could be prone of overfitting• But all of them have facilities to tackle this
problem• To correctly evaluate performance of your model
use validation set or better n-fold cross-validation(built-in option)
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Exam. Question 1
What is it?
1. Support VectorRegression
2. BackpropagationNeural Network
3. Partial Least SquaresRegression
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Exam. Question 2Which method is not prone to overfitting?
1. Multiple Linear Regression2. Partial Least Squares3. Support Vector Regression4. Backpropagation Neural Networks5. K Nearest Neighbours
Neither!
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com
Thank you for your attention!
Click h
ere to
buy
ABB
YY PDF Transformer 2.0
www.ABBYY.comClic
k here
to buy
ABB
YY PDF Transformer 2.0
www.ABBYY.com