Logistic Regression
Rong Jin
Logistic Regression Model In Gaussian generative model:
Generalize the ratio to a linear model
Parameters: w and c
( 1 | )log
( 1| )
p y xx w c
p y x
, ,
21
( 1) ( | 1)log ~ 2
( 1) ( | 1)
i imii
i
p y p x yx c
p y p x y
Logistic Regression Model In Gaussian generative model:
Generalize the ratio to a linear model
Parameters: w and c
( 1 | )log
( 1| )
p y xx w c
p y x
, ,
21
( 1) ( | 1)log ~ 2
( 1) ( | 1)
i imii
i
p y p x yx c
p y p x y
Logistic Regression Model The log-ratio of positive class to negative class
Results
( 1| )log
( 1| )
p y xx w c
p y x
( 1 | )
exp( )( 1| )
( 1 | ) ( 1 | ) 1
p y xx w c
p y x
p y x p y x
1( 1| )
1 exp( ) 1 ( | )
1 1 exp ( )( 1 | )
1 exp( )
p y xx w c
p y xy x w c
p y xx w c
Logistic Regression Model The log-ratio of positive class to negative class
Results
( 1| )log
( 1| )
p y xx w c
p y x
( 1 | )
exp( )( 1| )
( 1 | ) ( 1 | ) 1
p y xx w c
p y x
p y x p y x
1( 1| )
1 exp( ) 1 ( | )
1 1 exp ( )( 1 | )
1 exp( )
p y xx w c
p y xy x w c
p y xx w c
Logistic Regression Model Assume the inputs and outputs are related in the log
linear function
Estimate weights: MLE approach
1 2
1( | ; )
1 exp ( )
{ , ,..., , }d
p y xy x w c
w w w c
*
1, ,
1,
, max ( ) max log ( | ; )
1max log
1 exp( )
ntrain i iiw c w c
n
iw c
w c l D p y x
y x w c
1 2{ , ,..., , }dw w w c
Example 1: Heart Disease
• Input feature x: age group id
• output y: having heart disease or not
• +1: having heart disease
• -1: no heart disease
1: 25-29
2: 30-34
3: 35-39
4: 40-44
5: 45-49
6: 50-54
7: 55-59
8: 60-64
0
2
4
6
8
10
1 2 3 4 5 6 7 8
Age group
Nu
mb
er o
f P
eop
le
No heart Disease
Heart disease
Example 1: Heart Disease
1( | )
1 exp ( )
{ , }
p y xy xw c
w c
• Logistic regression model
• Learning w and c: MLE approach
• Numerical optimization: w = 0.58, c = -3.34
8
1
8
1
( ) ( ) log ( | ) ( ) log ( | )
1 1( ) log ( ) log
1 exp 1 exp
train i ii
i ii
l D n p i n p i
n niw c iw c
0
2
4
6
8
10
1 2 3 4 5 6 7 8
Age group
Nu
mb
er o
f P
eop
le
No heart Disease
Heart disease
Example 1: Heart Disease
W = 0.58 An old person is more likely to
have heart disease
C = -3.34 xw+c < 0 p(+|x) < p(-|x) xw+c > 0 p(+|x) > p(-|x) xw+c = 0 decision boundary
x* = 5.78 53 year old
0
2
4
6
8
10
1 2 3 4 5 6 7 8
Age groupN
um
ber
of
Peo
ple
No heart Disease
Heart disease
1 1
( | ; ) ; ( | ; )1 exp 1 exp
p x p xxw c xw c
Naïve Bayes Solution
• Inaccurate fitting• Non Gaussian distribution
• i* = 5.59• Close to the estimation by logistic
regression
• Even though naïve Bayes does not fit input patterns well, it still works fine for the decision boundary
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
9
Age group
Problems with Using Histogram Data?
0
2
4
6
8
10
1 2 3 4 5 6 7 8
Age group
Nu
mb
er o
f P
eop
le
Heart Disease
No heart disease
Uneven Sampling for Different Ages
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8
Age group
Solution #
1
#
1
( ) ( ) log ( | ) ( ) log ( | )
1 1( ) log ( ) log
1 exp 1 exp
grouptrain i ii
groupi ii
l D n p i n p i
n niw c iw c
#
1
#
1
( ) ( ) log ( | ) ( ) log ( | )
1 1( ) log ( ) log
1 exp 1 exp
grouptrain i ii
groupi ii
l D p p i p p i
p piw c iw c
( ) ( ) : percentage of people at the ith age group that have heart disease
( ) ( ) : percentage of people at the ith age group that have heart diseasei i
i i
n p
n p
w = 0.63, c = -3.56 i* = 5.65
Solution #
1
#
1
( ) ( ) log ( | ) ( ) log ( | )
1 1( ) log ( ) log
1 exp 1 exp
grouptrain i ii
groupi ii
l D n p i n p i
n niw c iw c
#
1
#
1
( ) ( ) log ( | ) ( ) log ( | )
1 1( ) log ( ) log
1 exp 1 exp
grouptrain i ii
groupi ii
l D p p i p p i
p piw c iw c
( ) ( ) : percentage of people at the ith age group that have heart disease
( ) ( ) : percentage of people at the ith age group that have heart diseasei i
i i
n p
n p
w = 0.63, c = -3.56 i* = 5.65 < 5.78
Solution #
1
#
1
( ) ( ) log ( | ) ( ) log ( | )
1 1( ) log ( ) log
1 exp 1 exp
grouptrain i ii
groupi ii
l D n p i n p i
n niw c iw c
#
1
#
1
( ) ( ) log ( | ) ( ) log ( | )
1 1( ) log ( ) log
1 exp 1 exp
grouptrain i ii
groupi ii
l D p p i p p i
p piw c iw c
( ) ( ) : percentage of people at the ith age group that have heart disease
( ) ( ) : percentage of people at the ith age group that have heart diseasei i
i i
n p
n p
w = 0.63, c = -3.56 i* = 5.65 < 5.78
Example: Text Classification Learn to classify text into predefined categories Input x: a document
Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …}
Output y: if the document is politics or not +1 for political document, -1 for not political document
Training data: 1 2 1 2 , ,..., ; , ,...,n n
N n n
d d d d d d
1 ,1 2 ,2 , = , , , ,..., ,i i i n i nd word t word t word t
Example 2: Text Classification Logistic regression model
Every term ti is assigned with a weight wi
Learning parameters: MLE approach
Need numerical solutions
1 2
1( | ; )
1 exp ( )
{ , ,..., , }
i ii
n
p y dy w t c
w w w c
( ) ( )
1 1
( ) ( )
1 1
( ) log ( | ) log ( | )
1 1log log
1 exp 1 expi i i i
N Ntrain i ii i
N N
i ii i i it d t d
l D p d p d
w t c w t c
1 1 2 2, , , ,..., ,n nd word t word t word t
Example 2: Text Classification Logistic regression model
Every term ti is assigned with a weight wi
Learning parameters: MLE approach
Need numerical solutions
1 2
1( | ; )
1 exp ( )
{ , ,..., , }
i ii
n
p y dy w t c
w w w c
1 1
1 1, ,
( ) log ( | ) log ( | )
1 1log log
1 exp 1 exp
n ntrain i ii i
n n
i ij i j j i jj j
l D p d p d
w t c w t c
1 1 2 2, , , ,..., ,n nd word t word t word t
Example 2: Text Classification Weight wi
wi > 0: term ti is a positive evidence
wi < 0: term ti is a negative evidence
wi = 0: term ti is irrelevant to the category of documents
The larger the | wi |, the more important ti term is determining whether the document is interesting.
Threshold c0 : more likely to be a political document
0 : more likely to be a non-political document
0 : decision boundary
i ii
i ii
i ii
w t c
w t c
w t c
Example 2: Text Classification Weight wi
wi > 0: term ti is a positive evidence
wi < 0: term ti is a negative evidence
wi = 0: term ti is irrelevant to the category of documents
The larger the | wi |, the more important ti term is determining whether the document is interesting.
Threshold c0 : more likely to be a political document
0 : more likely to be a non-political document
0 : decision boundary
i ii
i ii
i ii
w t c
w t c
w t c
Example 2: Text Classification
• Dataset: Reuter-21578
• Classification accuracy
• Naïve Bayes: 77%
• Logistic regression: 88%
Why Logistic Regression Works better for Text Classification? Optimal linear decision boundary
Generative model Weight ~ logp(w|+) - logp(w|-) Sub-optimal weights
Independence assumption Naive Bayes assumes that each word is generated
independently Logistic regression is able to take into account of
the correlation of words
Discriminative Model Logistic regression model is a discriminative model
Models the conditional probability p(y|x), i.e., the decision boundary
Gaussian generative model Models p(x|y), i.e., input patterns of different classes
Comparison
Generative Model
• Model P(x|y)• Model the input patterns
• Usually fast converge• Cheap computation• Robust to noise data
But• Usually performs worse
Discriminative Model
• Model P(y|x) directly• Model the decision boundary
• Usually good performance
But• Slow convergence• Expensive computation• Sensitive to noise data
Comparison
Generative Model
• Model P(x|y)• Model the input patterns
• Usually fast converge• Cheap computation• Robust to noise data
But• Usually performs worse
Discriminative Model
• Model P(y|x) directly• Model the decision boundary
• Usually good performance
But• Slow convergence• Expensive computation• Sensitive to noise data
A Few Words about Optimization
Convex objective function Solution could be non-unique
*
1,
1, max log
1 exp( )n
iw cw c
y x w c
Problems with Logistic Regression?
1 1 2 2
1 2
1( | ; )
1 exp ( ... )
{ , ,..., , }m m
m
p xc x w x w x w
w w w c
How about words that only appears in one class?
Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is
a positive document. Let w be its associated weight
Consider the derivative of l(Dtrain) with respect to w
w will be infinite !
( ) ( )
1 1
( )
1
( ) log ( | ) log ( | )
log ( | ) log ( | ) log ( | )
log ( | )i
N Ntrain i ii i
Ni id d i
l D p d p d
p d p d p d
p d l l
( ) log ( | ) 10 0 0
1 exptrainl D l lp d
w w w w c x w
Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is
a positive document. Let w be its associated weight
Consider the derivative of l(Dtrain) with respect to w
w will be infinite !
( ) ( )
1 1
( )
1
( ) log ( | ) log ( | )
log ( | ) log ( | ) log ( | )
log ( | )i
N Ntrain i ii i
Ni id d i
l D p d p d
p d p d p d
p d l l
( ) log ( | ) 10 0 0
1 exptrainl D l lp d
w w w w c x w
Example of Overfitting for LogRes
Iteration
Decrease in the classification
accuracy of test data
Solution: Regularization Regularized log-likelihood
s||w||2 is called the regularizer Favors small weights Prevents weights from becoming too large
2
2
( ) ( ) 21 1 1
( ) ( )
log ( | ) log ( | )
reg train train
N N mi i ii i i
l D l D s w
p d p d s w
The Rare Word Problem Consider word t that only appears in one document d, and d is
a positive document. Let w be its associated weight( ) ( )
1 1
( )
1
( ) log ( | ) log ( | )
log ( | ) log ( | ) log ( | )
log ( | )i
N Ntrain i ii i
Ni id d i
l D p d p d
p d p d p d
p d l l
( ) ( ) 21 1 1
( ) 21 1
21
( ) log ( | ) log ( | )
log ( | ) log ( | ) log ( | )
log ( | )
i
N N mreg train i i ii i i
N mi i id d i i
mii
l D p d p d s w
p d p d p d s w
p d l l s w
The Rare Word Problem Consider the derivative of l(Dtrain) with respect to w
When s is small, the derivative is still positive But, it becomes negative when w is large
( ) log ( | ) 10 0 0
1 exptrainl D l lp d
w w w w c x w
( ) log ( | )2
10 0 2
1 exp
reg trainl D l lp dsw
w w w w
swc x w
The Rare Word Problem Consider the derivative of l(Dtrain) with respect to w
When w is small, the derivative is still positive But, it becomes negative when w is large
( ) log ( | ) 10 0 0
1 exptrainl D l lp d
w w w w c x w
( ) log ( | )2
10 0 2
1 exp
reg trainl D l lp dsw
w w w w
swc x w
Regularized Logistic Regression
Using regularization Without regularization
Iteration
Interpretation of Regularizer
Many interpretation of regularizer Bayesian stat.: model prior Statistical learning: minimize the generalized
error Robust optimization: min-max solution
2
2( ) ( )reg train trainl D l D s w
Regularizer: Robust Optimization
assume each data point is unknown-but-bounded in a sphere of radius s and center xi
find the classifier w that is able to classify the unknown-but-bounded data point with high classification confidence
Sparse Solution What does the solution of regularized logistic
regression look like ? A sparse solution
Most weights are small and close to zero
Sparse Solution What does the solution of regularized logistic
regression look like ? A sparse solution
Most weights are small and close to zero
Why do We Need Sparse Solution? Two types of solutions
1. Many non-zero weights but many of them are small
2. Only a small number of non-zero weights, and many of them are large
Occam’s Razor: the simpler the better A simpler model that fits data unlikely to be coincidence A complicated model that fit data might be coincidence Smaller number of non-zero weights
less amount of evidence to consider
simpler model
case 2 is preferred
Occam’s Razer
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
Occam’s Razer: Power = 1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
1y a x
Occam’s Razer: Power = 3
2 31 2 3y a x a x a x
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
Occam’s Razor: Power = 10
2 101 2 10...y a x a x a x
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
Finding Optimal Solutions
Concave objective function No local maximum
Many standard optimization algorithms work
22 1 1
21 1
( ) ( ) log ( | )
1log
1 exp ( )
N mreg train train ii i
N mii i
l D l D w p y x s w
s wy c x w
Gradient Ascent Maximize the log-likelihood by iteratively adjusting the
parameters in small increments In each iteration, we adjust w in the direction that increases the
log-likelihood (toward the gradient)
21 1
1
21 1
1
log ( | )
(1 ( | ))
log ( | )
(1 ( | ))
where is learning rate.
N mi i ii i
Ni i i ii
N mi i ii i
Ni i ii
w w p y x s ww
w sw x y p y x
c c p y x s wc
c y p y x
Predication ErrorsPreventing weights from being too large
Graphical Illustration
No regularization case
Gradient Ascent Maximize the log-likelihood by iteratively adjusting the
parameters in small increments In each iteration, we adjust w in the direction that increases the
log-likelihood (toward the gradient)
21 1
1
21 1
1
log ( | )
(1 ( | ))
log ( | )
(1 ( | ))
where is learning rate.
N mi i ii i
Ni i i ii
N mi i ii i
Ni i ii
w w p y x s ww
w sw x y p y x
c c p y x s wc
c y p y x
Using regularization Without regularization
Iteration
When should Stop? The gradient ascent learning method
converges when there is no incentive to move the parameters in any particular direction:
21 1 1
21 1 1
log ( | ) (1 ( | )) 0
log ( | ) (1 ( | )) 0
N m Ni i i i i i ii i i
N m Ni i i i i ii i i
p y x w sw x y p y xw
p y x w y p y xc