Date post: | 14-May-2018 |
Category: |
Documents |
Upload: | duongduong |
View: | 219 times |
Download: | 0 times |
1
CS 2710 Foundations of Machine Learning
Lecture 24
Milos Hauskrecht
5329 Sennott Square
Linear regression (cont)
Logistic regression
Linear regression
• Vector definition of the model
– Include bias constant in the input vector
xwxT
dd xwxwxwxwf 221100)(
kwww ,, 10 - parameters (weights)
1
1x),( wxf
0w
1w
2w
dw
dx
2x
x
Input vector
),,,1( 21 dxxx x
2
Linear regression. Error.
• Data:
• Function:
• We would like to have
• Error function
– measures how much our predictions deviate from the desired answers
• Learning:
We want to find the weights minimizing the error !
2
,..1
))((1
ii
ni
n fyn
J x
iii yD ,x
)( ii f xx
nify ii ,..,1allfor)( x
Mean-squared error
Linear regression. Example
• 1 dimensional input
-1.5 -1 -0.5 0 0.5 1 1.5 2-15
-10
-5
0
5
10
15
20
25
30
)( 1xx
3
Linear regression. Example.
• 2 dimensional input ),( 21 xxx
-3-2
-10
12
3 -4
-2
0
2
4-20
-15
-10
-5
0
5
10
15
20
Linear regression. Optimization.
• We want the weights minimizing the error
• For the optimal set of parameters, derivatives of the error with
respect to each parameter must be 0
• Vector of derivatives:
2
,..1
2
,..1
)(1
))((1
i
T
i
ni
ii
ni
n yn
fyn
J xwx
0xxwww ww
ii
T
i
n
i
nn yn
JJ )(2
))(())((grad1
0)(2
)( ,,1,10,0
1
jididiii
n
i
n
j
xxwxwxwyn
Jw
w
4
Solving linear regression
By rearranging the terms we get a system of linear equations
with d+1 unknowns
0)(2
)( ,,1,10,0
1
jididiii
n
i
n
j
xxwxwxwyn
Jw
w
ji
n
i
iji
n
i
didji
n
i
jijji
n
i
iji
n
i
i xyxxwxxwxxwxxw ,
1
,
1
,,
1
,,
1
1,1,
1
0,0
bAw
1111111
,
1
,
1
1,1
1
0,0
n
i
i
n
i
did
n
i
jij
n
i
i
n
i
i yxwxwxwxw
1,
1
1,
1
,1,
1
,1,
1
1,11,
1
0,0 i
n
i
ii
n
i
didi
n
i
jiji
n
i
ii
n
i
i xyxxwxxwxxwxxw
Solving linear regression
• The optimal set of weights satisfies:
Leads to a system of linear equations (SLE) with d+1
unknowns of the form
Solution to SLE: ?
0xxwww
ii
T
i
n
i
n yn
J )(2
))((1
ji
n
i
iji
n
i
didji
n
i
jijji
n
i
iji
n
i
i xyxxwxxwxxwxxw ,
1
,
1
,,
1
,,
1
1,1,
1
0,0
bAw
5
Solving linear regression
• The optimal set of weights satisfies:
Leads to a system of linear equations (SLE) with d+1
unknowns of the form
Solution to SLE:
• matrix inversion
0xxwww
ii
T
i
n
i
n yn
J )(2
))((1
ji
n
i
iji
n
i
didji
n
i
jijji
n
i
iji
n
i
i xyxxwxxwxxwxxw ,
1
,
1
,,
1
,,
1
1,1,
1
0,0
bAw
bAw1
Gradient descent solution
Goal: the weight optimization in the linear regression model
An alternative to SLE solution:
• Gradient descent
Idea:
– Adjust weights in the direction that improves the Error
– The gradient tells us what is the right direction
- a learning rate (scales the gradient changes)
)(www w iError
2
,..1
)),((1
)( wxw ii
ni
n fyn
ErrorJ
0
6
Gradient descent method
• Descend using the gradient information
• Change the value of w according to the gradient
w
*|)( ww wError
*w
)(wError
)(www w iError
Direction of the descent
Gradient descent method
• New value of the parameter
- a learning rate (scales the gradient changes)
w
*|)( wwErrorw
*w
)(wError
*|)(* w
j
jj wErrorw
ww
0
For all j
7
Gradient descent method
• Iteratively approaches the optimum of the Error function
w)0(w
)(wError
)2(w)1(w )3(w
Batch vs Online regression algorithm
• The error function defined on the complete dataset D
• We say we are learning the model in the batch mode:
– All examples are available at the time of learning
– Weights are optimizes with respect to all training examples
• An alternative is to learn the model in the online mode
– Examples are arriving sequentially
– Model weights are updated after every example
– If needed examples seen can be forgotten
-
2
,..1
)),((1
)( wxw ii
ni
n fyn
ErrorJ
8
Online gradient algorithm
• The error function is defined for the complete dataset D
• Error for one example
• Online gradient method: changes weights after every example
• vector form:
2
,..1
)),((1
)( wxw ii
ni
n fyn
ErrorJ
2
online )),((2
1)( wxw iii fyErrorJ
)(wi
j
jj Errorw
ww
0 - Learning rate that depends on the number of updates
)(www w iError
iii yD ,x
Online gradient method
2)),((2
1)( wxw iiionline fyErrorJ
(i)-th update step with :
xwxTf )(Linear model
On-line error
ii
1)( Annealed learning rate:
- Gradually rescales changes
On-line algorithm: generates a sequence of online updates
)1(|)(
)()1()(
i
j
ii
j
i
jw
Erroriww
w
w
j-th weight:
iii yD ,x
ji
i
ii
i
j
i
j xfyiww ,
)1()1()()),()((
wx
Fixed learning rate: Ci )(
- Use a small constant
9
Online regression algorithm
Online-linear-regression (stopping_criterion)
Initialize weights
initialize i=1;
while stopping_criterion = FALSE
select the next data point
set learning rate
update weight vector
end
return weights
),,( 210 dwwww w
)(i
iii fyi xwxww )),()((
),( iii yD x
Advantages: very easy to implement, continuous data streams
On-line learning. Example
-3 -2 -1 0 1 2 31
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 31
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 30.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
-3 -2 -1 0 1 2 30.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
1 2
3 4
10
Adaptive models
2)),((2
1)( wxw iiionline fyErrorJ
xwxTf )(Linear model
On-line error
ci )(
Adaptive models:
• the underlying model is not stationary and can change over time
• Example: seasonal changes
• On-line algorithm can be made adaptive by keeping
the learning at some constant value
On-line algorithm:
• Sequence of online updates (one example at the time)
• Useful for continuous data streams
Extensions of simple linear model
)()(1
0 xx j
m
j
jwwf
)(1 x
)(2 x
)(xm
1
1x )(xf
0w
1w
2w
mwdx
The same techniques as before to learn the weights !!!!
)(xj - an arbitrary function of x
Replace inputs to linear units with feature (basis) functions
to model nonlinearities
11
Extensions of the linear model
• Models linear in the parameters we want to fit
• Basis functions examples:
– a higher order polynomial, one-dimensional input
– Multidimensional quadratic
– Other types of basis functions
)()(1
0 xx k
m
k
kwwf
)()...(),( 21 xxx m - feature or basis functions
mwww ..., 10 - parameters
xx )(12
2 )( xx 3
3 )( xx ),( 21 xxx
11 )( xx2
12 )( xx23 )( xx 2
24 )( xx 215 )( xxx
xx sin)(1 xx cos)(2
)( 1xx
Example. Regression with polynomials.
Regression with polynomials of degree m
• Data points: pairs of
• Feature functions: m feature functions
• Function to learn:i
m
i
ii
m
i
i xwwxwwxf
1
0
1
0 )(),( w
i
i xx )(
yx,
mi ,,2,1
x)(1 x
2
2 )( xx
m
m x)(x
1
x
0w
1w
2w
mw
12
Multidimensional model example
-3-2
-10
12
3 -4
-2
0
2
4-20
-15
-10
-5
0
5
10
15
20
Multidimensional model example
13
Regularized linear regression
• If the number of parameters is large relative to the number of data points used to train the model, we face the threat of overfit (generalization error of the model goes up)
• The prediction accuracy can be often improved by setting some coefficients to zero
– Increases the bias, reduces the variance of estimates
• Solutions:
– Subset selection
– Ridge regression
– Lasso regression
– Principal component regression
• Next: ridge regression
Ridge regression
• Error function for the standard least squares estimates:
• We seek:
• Ridge regression:
• Where
• What does the new error function do?
2
,..1
)(1
)( i
T
i
ni
n yn
J xww
2
,..1
* )(1
minarg i
T
i
ni
yn
xwww
22
,..1
)(1
)( wxww
i
T
i
ni
n yn
J
d
i
iw0
22w 0and
14
Ridge regression
• Standard regression:
• Ridge regression:
• penalizes non-zero weights with the cost
proportional to (a shrinkage coefficient)
• If an input attribute has a small effect on improving the error
function it is “shut down” by the penalty term
• Inclusion of a shrinkage penalty is often referred to as
regularization.
(ridge regression is related to Tikhonov regularization)
2
,..1
)(1
)( i
T
i
ni
n yn
J xww
2
2
2
,..1
)(1
)(Li
T
i
ni
n yn
J wxww
d
i
iLw
0
22
2w
jx
Regularized linear regression
How to solve the least squares problem if the error function is
enriched by the regularization term ?
Answer: The solution to the optimal set of weights w is obtained
again by solving a set of linear equation.
Standard linear regression:
Solution:
Regularized linear regression:
2w
0xxwww
ii
T
i
n
i
n yn
J )(2
))((1
yXXXwTT 1)(*
where X is an nxd matrix with rows corresponding to
examples and columns to inputs
yXXXIwTT 1)(*
15
Lasso regression
• Standard regression:
• Lasso regression/regularization:
• penalizes non-zero weights with the cost
proportional to .
• L1 is more aggressive pushing the weights to 0 compared to L2.
2
,..1
)(1
)( i
T
i
ni
n yn
J xww
1
2
,..1
)(1
)(Li
T
i
ni
n yn
J wxww
d
i
iLw
01
||w
Classification
• Data:
– represents a discrete class value
• Goal: learn
• Binary classification
– A special case when
• First step:
– we need to devise a model of the function f
}1,0{Y
YXf :
},..,,{ 21 ndddD
iii yd ,x
iy
16
Discriminant functions
• A common way to represent a classifier is by using
– Discriminant functions
• Works for both the binary and multi-way classification
• Idea:
– For every class i define a function mapping
– When the decision on input x should be made choose the
class with the highest value of
)(xig X
)(xig
)(maxarg* xii gy
Discriminant functions
• A common way to represent a classifier is by using
– Discriminant functions
• Works for both the binary and multi-way classification
• Idea:
– For every class i define a function mapping
– When the decision on input x should be made choose the
class with the highest value of
• So what happens with the input space? Assume a binary case.
)(xig X
)(xig
)(maxarg* xii gy
17
Discriminant functions
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Discriminant functions
)()( 01 xx gg
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg
18
Discriminant functions
)()( 01 xx gg
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg
)()( 01 xx gg
Discriminant functions
• Decision boundary: discriminant functions are equal
)()( 01 xx gg
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg
)()( 01 xx gg
)()( 01 xx gg
19
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Decision boundary
Quadratic decision boundary
)()( 01 xx gg
)()( 01 xx gg )()( 01 xx gg
Logistic regression model
• Defines a linear decision boundary
• Discriminant functions:
• where
)()()( 1 xwxwwx,TT ggf
)1/(1)( zezg
x
Input vector
1
1x )( wx,f
0w
1w
2w
dw2x
z
dx
Logistic function
)()(1 xwxTgg )(1)(0 xwx
Tgg
- is a logistic function
20
Logistic function
Function:
• Is also referred to as a sigmoid function
• takes a real number and outputs the number in the interval [0,1]
• Models a smooth switching function; replaces hard threshold
function
)1(
1)(
zezg
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic (smooth) switching Threshold (hard) switching
Logistic regression model
• Discriminant functions:
• Values of discriminant functions vary in interval [0,1]
– Probabilistic interpretation
),|1( wxyp
x
Input vector
1
1x
0w
1w
2w
dw2x
z
dx
)()(1 xwxTgg )(1)(0 xwx
Tgg
)()(),|1()( 1 xwxxwwx,Tggypf
21
Logistic regression
• We learn a probabilistic function
– where f describes the probability of class 1 given x
Note that:
• Making decisions with the logistic regression model:
?
),|1()()( 1 wxxwwx, ypgf T
]1,0[: Xf
)|1(1),|0( wx,wx ypyp
Logistic regression
• We learn a probabilistic function
– where f describes the probability of class 1 given x
Note that:
• Making decisions with the logistic regression model:
),|1()()( 1 wxxwwx, ypgf T
]1,0[: Xf
2/1)|1( xypIf then choose 1
Else choose 0
)|1(1),|0( wx,wx ypyp
22
Linear decision boundary
• Logistic regression model defines a linear decision boundary
• Why?
• Answer: Compare two discriminant functions.
• Decision boundary:
• For the boundary it must hold:
0)(
)(1log
)(
)(log
1
xw
xw
x
xT
T
g
g
g
go
)()( 01 xx gg
0)(explog
)(exp1
1
)(exp1
)(exp
log)(
)(log
1
xwxw
xw
xw
xw
x
x TT
T
T
T
g
go
Logistic regression model. Decision boundary
• LR defines a linear decision boundary
Example: 2 classes (blue and red points)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Decision boundary
wTx=0
23
Likelihood of outputs
• Let
• Then
• Find weights w that maximize the likelihood of outputs
– Apply the log-likelihood trick. The optimal weights are
the same for both the likelihood and the log-likelihood
Logistic regression: parameter learning
n
i
y
i
y
i
n
i
y
i
y
iiiiiDl
1
1
1
1)1(log)1(log),( w
n
i
y
i
y
iii
n
i
iiyyPDL1
1
1
)1(),|(),( wxw
)()(),|1( i
T
iiii gzgyp xwwx
)1log()1(log1
iii
n
i
i yy
iii yD ,x
Logistic regression: parameter learning
• Notation:
• Log likelihood
• Derivatives of the loglikelihood
• Gradient descent:
)1log()1(log),(1
iii
n
i
i yyDl
w
)),(())((),(11
iii
n
i
i
T
ii
n
i
fygyDl xwxxwxww
)1(|)],([)()1()(
kDlkkk
ww www
Nonlinear in weights !!
n
i
ii
k
i
kk fyk1
)1()1()( )],([)( xxwww
))((),(1
, ii
n
i
ji
j
zgyxDlw
w
)()(),|1( i
T
iiii gzgyp xwwx
24
Derivation of the gradient
• Log likelihood
• Derivatives of the loglikelihood
)1log()1(log),(1
iii
n
i
i yyDl
w
)),(())((),(11
iii
n
i
i
T
ii
n
i
fygyDl xwxxwxww
j
in
i
iiii
ij w
zzgyzgy
zDl
w
1
))(1log()1()(log),( w
i
i
i
i
i
i
i
iiiii
i z
zg
zgy
z
zg
zgyzgyzgy
z
)(
)(1
1)1(
)(
)(
1))(1log()1()(log
))(1)(()(
ii
i
i zgzgz
zg
Derivative of a logistic function
)())()(1())(1( iiiiii zgyzgyzgy
ji
j
i xw
z,
Logistic regression. Online gradient descent
• On-line component of the loglikelihood
• On-line learning update for weight w
• ith update for the logistic regression and
)1(|)],([)()1()(
kkonline
kk DJkww www
),( wkonline DJ
kkk yD ,x
kk
k
i
kk fyk xxwww )],()[( )1()1()(
)1log()1(log),(online iiiii yyDJ w
25
Online logistic regression algorithm
Online-logistic-regression (stopping_criterion)
initialize weights
while stopping_criterion = FALSE
do select next data point
set
update weights (in parallel)
end
return weights
),,( 210 dwwww w
)(i
iii fyi xxwww )],()[(
iii yD ,x
w
Online algorithm. Example.
26
Online algorithm. Example.
Online algorithm. Example.