+ All Categories
Home > Documents > Logistic Regression

Logistic Regression

Date post: 06-Feb-2016
Category:
Upload: enid
View: 15 times
Download: 0 times
Share this document with a friend
Description:
Logistic Regression. Rong Jin. Logistic Regression Model. In Gaussian generative model: Generalize the ratio to a linear model Parameters: w and c. Logistic Regression Model. In Gaussian generative model: Generalize the ratio to a linear model Parameters: w and c. - PowerPoint PPT Presentation
Popular Tags:
50
Logistic Regression Rong Jin
Transcript
Page 1: Logistic Regression

Logistic Regression

Rong Jin

Page 2: Logistic Regression

Logistic Regression Model In Gaussian generative model:

Generalize the ratio to a linear model

Parameters: w and c

( 1 | )log

( 1| )

p y xx w c

p y x

, ,

21

( 1) ( | 1)log ~ 2

( 1) ( | 1)

i imii

i

p y p x yx c

p y p x y

Page 3: Logistic Regression

Logistic Regression Model In Gaussian generative model:

Generalize the ratio to a linear model

Parameters: w and c

( 1 | )log

( 1| )

p y xx w c

p y x

, ,

21

( 1) ( | 1)log ~ 2

( 1) ( | 1)

i imii

i

p y p x yx c

p y p x y

Page 4: Logistic Regression

Logistic Regression Model The log-ratio of positive class to negative class

Results

( 1| )log

( 1| )

p y xx w c

p y x

( 1 | )

exp( )( 1| )

( 1 | ) ( 1 | ) 1

p y xx w c

p y x

p y x p y x

1( 1| )

1 exp( ) 1 ( | )

1 1 exp ( )( 1 | )

1 exp( )

p y xx w c

p y xy x w c

p y xx w c

Page 5: Logistic Regression

Logistic Regression Model The log-ratio of positive class to negative class

Results

( 1| )log

( 1| )

p y xx w c

p y x

( 1 | )

exp( )( 1| )

( 1 | ) ( 1 | ) 1

p y xx w c

p y x

p y x p y x

1( 1| )

1 exp( ) 1 ( | )

1 1 exp ( )( 1 | )

1 exp( )

p y xx w c

p y xy x w c

p y xx w c

Page 6: Logistic Regression

Logistic Regression Model Assume the inputs and outputs are related in the log

linear function

Estimate weights: MLE approach

1 2

1( | ; )

1 exp ( )

{ , ,..., , }d

p y xy x w c

w w w c

*

1, ,

1,

, max ( ) max log ( | ; )

1max log

1 exp( )

ntrain i iiw c w c

n

iw c

w c l D p y x

y x w c

1 2{ , ,..., , }dw w w c

Page 7: Logistic Regression

Example 1: Heart Disease

• Input feature x: age group id

• output y: having heart disease or not

• +1: having heart disease

• -1: no heart disease

1: 25-29

2: 30-34

3: 35-39

4: 40-44

5: 45-49

6: 50-54

7: 55-59

8: 60-64

0

2

4

6

8

10

1 2 3 4 5 6 7 8

Age group

Nu

mb

er o

f P

eop

le

No heart Disease

Heart disease

Page 8: Logistic Regression

Example 1: Heart Disease

1( | )

1 exp ( )

{ , }

p y xy xw c

w c

• Logistic regression model

• Learning w and c: MLE approach

• Numerical optimization: w = 0.58, c = -3.34

8

1

8

1

( ) ( ) log ( | ) ( ) log ( | )

1 1( ) log ( ) log

1 exp 1 exp

train i ii

i ii

l D n p i n p i

n niw c iw c

0

2

4

6

8

10

1 2 3 4 5 6 7 8

Age group

Nu

mb

er o

f P

eop

le

No heart Disease

Heart disease

Page 9: Logistic Regression

Example 1: Heart Disease

W = 0.58 An old person is more likely to

have heart disease

C = -3.34 xw+c < 0 p(+|x) < p(-|x) xw+c > 0 p(+|x) > p(-|x) xw+c = 0 decision boundary

x* = 5.78 53 year old

0

2

4

6

8

10

1 2 3 4 5 6 7 8

Age groupN

um

ber

of

Peo

ple

No heart Disease

Heart disease

1 1

( | ; ) ; ( | ; )1 exp 1 exp

p x p xxw c xw c

Page 10: Logistic Regression

Naïve Bayes Solution

• Inaccurate fitting• Non Gaussian distribution

• i* = 5.59• Close to the estimation by logistic

regression

• Even though naïve Bayes does not fit input patterns well, it still works fine for the decision boundary

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

9

Age group

Page 11: Logistic Regression

Problems with Using Histogram Data?

0

2

4

6

8

10

1 2 3 4 5 6 7 8

Age group

Nu

mb

er o

f P

eop

le

Heart Disease

No heart disease

Page 12: Logistic Regression

Uneven Sampling for Different Ages

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8

Age group

Page 13: Logistic Regression

Solution #

1

#

1

( ) ( ) log ( | ) ( ) log ( | )

1 1( ) log ( ) log

1 exp 1 exp

grouptrain i ii

groupi ii

l D n p i n p i

n niw c iw c

#

1

#

1

( ) ( ) log ( | ) ( ) log ( | )

1 1( ) log ( ) log

1 exp 1 exp

grouptrain i ii

groupi ii

l D p p i p p i

p piw c iw c

( ) ( ) : percentage of people at the ith age group that have heart disease

( ) ( ) : percentage of people at the ith age group that have heart diseasei i

i i

n p

n p

w = 0.63, c = -3.56 i* = 5.65

Page 14: Logistic Regression

Solution #

1

#

1

( ) ( ) log ( | ) ( ) log ( | )

1 1( ) log ( ) log

1 exp 1 exp

grouptrain i ii

groupi ii

l D n p i n p i

n niw c iw c

#

1

#

1

( ) ( ) log ( | ) ( ) log ( | )

1 1( ) log ( ) log

1 exp 1 exp

grouptrain i ii

groupi ii

l D p p i p p i

p piw c iw c

( ) ( ) : percentage of people at the ith age group that have heart disease

( ) ( ) : percentage of people at the ith age group that have heart diseasei i

i i

n p

n p

w = 0.63, c = -3.56 i* = 5.65 < 5.78

Page 15: Logistic Regression

Solution #

1

#

1

( ) ( ) log ( | ) ( ) log ( | )

1 1( ) log ( ) log

1 exp 1 exp

grouptrain i ii

groupi ii

l D n p i n p i

n niw c iw c

#

1

#

1

( ) ( ) log ( | ) ( ) log ( | )

1 1( ) log ( ) log

1 exp 1 exp

grouptrain i ii

groupi ii

l D p p i p p i

p piw c iw c

( ) ( ) : percentage of people at the ith age group that have heart disease

( ) ( ) : percentage of people at the ith age group that have heart diseasei i

i i

n p

n p

w = 0.63, c = -3.56 i* = 5.65 < 5.78

Page 16: Logistic Regression

Example: Text Classification Learn to classify text into predefined categories Input x: a document

Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …}

Output y: if the document is politics or not +1 for political document, -1 for not political document

Training data: 1 2 1 2 , ,..., ; , ,...,n n

N n n

d d d d d d

1 ,1 2 ,2 , = , , , ,..., ,i i i n i nd word t word t word t

Page 17: Logistic Regression

Example 2: Text Classification Logistic regression model

Every term ti is assigned with a weight wi

Learning parameters: MLE approach

Need numerical solutions

1 2

1( | ; )

1 exp ( )

{ , ,..., , }

i ii

n

p y dy w t c

w w w c

( ) ( )

1 1

( ) ( )

1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 expi i i i

N Ntrain i ii i

N N

i ii i i it d t d

l D p d p d

w t c w t c

1 1 2 2, , , ,..., ,n nd word t word t word t

Page 18: Logistic Regression

Example 2: Text Classification Logistic regression model

Every term ti is assigned with a weight wi

Learning parameters: MLE approach

Need numerical solutions

1 2

1( | ; )

1 exp ( )

{ , ,..., , }

i ii

n

p y dy w t c

w w w c

1 1

1 1, ,

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

n ntrain i ii i

n n

i ij i j j i jj j

l D p d p d

w t c w t c

1 1 2 2, , , ,..., ,n nd word t word t word t

Page 19: Logistic Regression

Example 2: Text Classification Weight wi

wi > 0: term ti is a positive evidence

wi < 0: term ti is a negative evidence

wi = 0: term ti is irrelevant to the category of documents

The larger the | wi |, the more important ti term is determining whether the document is interesting.

Threshold c0 : more likely to be a political document

0 : more likely to be a non-political document

0 : decision boundary

i ii

i ii

i ii

w t c

w t c

w t c

Page 20: Logistic Regression

Example 2: Text Classification Weight wi

wi > 0: term ti is a positive evidence

wi < 0: term ti is a negative evidence

wi = 0: term ti is irrelevant to the category of documents

The larger the | wi |, the more important ti term is determining whether the document is interesting.

Threshold c0 : more likely to be a political document

0 : more likely to be a non-political document

0 : decision boundary

i ii

i ii

i ii

w t c

w t c

w t c

Page 21: Logistic Regression

Example 2: Text Classification

• Dataset: Reuter-21578

• Classification accuracy

• Naïve Bayes: 77%

• Logistic regression: 88%

Page 22: Logistic Regression

Why Logistic Regression Works better for Text Classification? Optimal linear decision boundary

Generative model Weight ~ logp(w|+) - logp(w|-) Sub-optimal weights

Independence assumption Naive Bayes assumes that each word is generated

independently Logistic regression is able to take into account of

the correlation of words

Page 23: Logistic Regression

Discriminative Model Logistic regression model is a discriminative model

Models the conditional probability p(y|x), i.e., the decision boundary

Gaussian generative model Models p(x|y), i.e., input patterns of different classes

Page 24: Logistic Regression

Comparison

Generative Model

• Model P(x|y)• Model the input patterns

• Usually fast converge• Cheap computation• Robust to noise data

But• Usually performs worse

Discriminative Model

• Model P(y|x) directly• Model the decision boundary

• Usually good performance

But• Slow convergence• Expensive computation• Sensitive to noise data

Page 25: Logistic Regression

Comparison

Generative Model

• Model P(x|y)• Model the input patterns

• Usually fast converge• Cheap computation• Robust to noise data

But• Usually performs worse

Discriminative Model

• Model P(y|x) directly• Model the decision boundary

• Usually good performance

But• Slow convergence• Expensive computation• Sensitive to noise data

Page 26: Logistic Regression

A Few Words about Optimization

Convex objective function Solution could be non-unique

*

1,

1, max log

1 exp( )n

iw cw c

y x w c

Page 27: Logistic Regression

Problems with Logistic Regression?

1 1 2 2

1 2

1( | ; )

1 exp ( ... )

{ , ,..., , }m m

m

p xc x w x w x w

w w w c

How about words that only appears in one class?

Page 28: Logistic Regression

Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is

a positive document. Let w be its associated weight

Consider the derivative of l(Dtrain) with respect to w

w will be infinite !

( ) ( )

1 1

( )

1

( ) log ( | ) log ( | )

log ( | ) log ( | ) log ( | )

log ( | )i

N Ntrain i ii i

Ni id d i

l D p d p d

p d p d p d

p d l l

( ) log ( | ) 10 0 0

1 exptrainl D l lp d

w w w w c x w

Page 29: Logistic Regression

Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is

a positive document. Let w be its associated weight

Consider the derivative of l(Dtrain) with respect to w

w will be infinite !

( ) ( )

1 1

( )

1

( ) log ( | ) log ( | )

log ( | ) log ( | ) log ( | )

log ( | )i

N Ntrain i ii i

Ni id d i

l D p d p d

p d p d p d

p d l l

( ) log ( | ) 10 0 0

1 exptrainl D l lp d

w w w w c x w

Page 30: Logistic Regression

Example of Overfitting for LogRes

Iteration

Decrease in the classification

accuracy of test data

Page 31: Logistic Regression

Solution: Regularization Regularized log-likelihood

s||w||2 is called the regularizer Favors small weights Prevents weights from becoming too large

2

2

( ) ( ) 21 1 1

( ) ( )

log ( | ) log ( | )

reg train train

N N mi i ii i i

l D l D s w

p d p d s w

Page 32: Logistic Regression

The Rare Word Problem Consider word t that only appears in one document d, and d is

a positive document. Let w be its associated weight( ) ( )

1 1

( )

1

( ) log ( | ) log ( | )

log ( | ) log ( | ) log ( | )

log ( | )i

N Ntrain i ii i

Ni id d i

l D p d p d

p d p d p d

p d l l

( ) ( ) 21 1 1

( ) 21 1

21

( ) log ( | ) log ( | )

log ( | ) log ( | ) log ( | )

log ( | )

i

N N mreg train i i ii i i

N mi i id d i i

mii

l D p d p d s w

p d p d p d s w

p d l l s w

Page 33: Logistic Regression

The Rare Word Problem Consider the derivative of l(Dtrain) with respect to w

When s is small, the derivative is still positive But, it becomes negative when w is large

( ) log ( | ) 10 0 0

1 exptrainl D l lp d

w w w w c x w

( ) log ( | )2

10 0 2

1 exp

reg trainl D l lp dsw

w w w w

swc x w

Page 34: Logistic Regression

The Rare Word Problem Consider the derivative of l(Dtrain) with respect to w

When w is small, the derivative is still positive But, it becomes negative when w is large

( ) log ( | ) 10 0 0

1 exptrainl D l lp d

w w w w c x w

( ) log ( | )2

10 0 2

1 exp

reg trainl D l lp dsw

w w w w

swc x w

Page 35: Logistic Regression

Regularized Logistic Regression

Using regularization Without regularization

Iteration

Page 36: Logistic Regression

Interpretation of Regularizer

Many interpretation of regularizer Bayesian stat.: model prior Statistical learning: minimize the generalized

error Robust optimization: min-max solution

2

2( ) ( )reg train trainl D l D s w

Page 37: Logistic Regression

Regularizer: Robust Optimization

assume each data point is unknown-but-bounded in a sphere of radius s and center xi

find the classifier w that is able to classify the unknown-but-bounded data point with high classification confidence

Page 38: Logistic Regression

Sparse Solution What does the solution of regularized logistic

regression look like ? A sparse solution

Most weights are small and close to zero

Page 39: Logistic Regression

Sparse Solution What does the solution of regularized logistic

regression look like ? A sparse solution

Most weights are small and close to zero

Page 40: Logistic Regression

Why do We Need Sparse Solution? Two types of solutions

1. Many non-zero weights but many of them are small

2. Only a small number of non-zero weights, and many of them are large

Occam’s Razor: the simpler the better A simpler model that fits data unlikely to be coincidence A complicated model that fit data might be coincidence Smaller number of non-zero weights

less amount of evidence to consider

simpler model

case 2 is preferred

Page 41: Logistic Regression

Occam’s Razer

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Page 42: Logistic Regression

Occam’s Razer: Power = 1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

1y a x

Page 43: Logistic Regression

Occam’s Razer: Power = 3

2 31 2 3y a x a x a x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Page 44: Logistic Regression

Occam’s Razor: Power = 10

2 101 2 10...y a x a x a x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Page 45: Logistic Regression

Finding Optimal Solutions

Concave objective function No local maximum

Many standard optimization algorithms work

22 1 1

21 1

( ) ( ) log ( | )

1log

1 exp ( )

N mreg train train ii i

N mii i

l D l D w p y x s w

s wy c x w

Page 46: Logistic Regression

Gradient Ascent Maximize the log-likelihood by iteratively adjusting the

parameters in small increments In each iteration, we adjust w in the direction that increases the

log-likelihood (toward the gradient)

21 1

1

21 1

1

log ( | )

(1 ( | ))

log ( | )

(1 ( | ))

where is learning rate.

N mi i ii i

Ni i i ii

N mi i ii i

Ni i ii

w w p y x s ww

w sw x y p y x

c c p y x s wc

c y p y x

Predication ErrorsPreventing weights from being too large

Page 47: Logistic Regression

Graphical Illustration

No regularization case

Page 48: Logistic Regression

Gradient Ascent Maximize the log-likelihood by iteratively adjusting the

parameters in small increments In each iteration, we adjust w in the direction that increases the

log-likelihood (toward the gradient)

21 1

1

21 1

1

log ( | )

(1 ( | ))

log ( | )

(1 ( | ))

where is learning rate.

N mi i ii i

Ni i i ii

N mi i ii i

Ni i ii

w w p y x s ww

w sw x y p y x

c c p y x s wc

c y p y x

Page 49: Logistic Regression

Using regularization Without regularization

Iteration

Page 50: Logistic Regression

When should Stop? The gradient ascent learning method

converges when there is no incentive to move the parameters in any particular direction:

21 1 1

21 1 1

log ( | ) (1 ( | )) 0

log ( | ) (1 ( | )) 0

N m Ni i i i i i ii i i

N m Ni i i i i ii i i

p y x w sw x y p y xw

p y x w y p y xc


Recommended