+ All Categories
Home > Documents > Linear classifiers Lecture 3 - People | MIT...

Linear classifiers Lecture 3 - People | MIT...

Date post: 25-Nov-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
16
Linear classifiers Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin
Transcript
Page 1: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Linear classifiers Lecture 3

David Sontag

New York University

Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

Page 2: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Example: Spam •  Imagine 3 features (spam is “positive” class):

1.  free (number of occurrences of “free”)

2.  money (occurrences of “money”)

3.  BIAS (intercept, always has value 1)

BIAS : -3 free : 4 money : 2 ...

BIAS : 1 free : 1 money : 1 ...

“free money”

w.f(x)  >  0    SPAM!!!  

Page 3: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Binary Decision Rule

•  In the space of feature vectors –  Examples are points

–  Any weight vector is a hyperplane

–  One side corresponds to Y=+1

–  Other corresponds to Y=-1

BIAS : -3 free : 4 money : 2 ... 0 1

0

1

2

free m

oney

+1 = SPAM

-1 = HAM

Page 4: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

The perceptron algorithm •  Start with weight vector = •  For each training instance (xi,yi*):

–  Classify with current weights

–  If correct (i.e., y=yi*), no change! –  If wrong: update

= ln

1σi√2π

e− (xi−µi0)

2

2σ2i

1σi√2π

e− (xi−µi1)

2

2σ2i

= −(xi − µi0)2

2σ2i+(xi − µi1)2

2σ2i

=µi0 + µi1

σ2ixi +

µ2i0 + µ2i12σ2i

w0 = ln1− θ

θ+

µ2i0 + µ2i12σ2i

wi =µi0 + µi1

σ2i

w = w + η�

j

[y∗j − p(y∗j |xj , w)]f(xj)

w = w + [y∗ − y(x;w)]f(x)

w = w + y∗f(x)

4

i i

i

i i

(xi)

(xi) i

t

Page 5: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Def: Linearly separable data

Called the margin

w

Equivalently, for yt = +1,

and for yt = -1,

Page 6: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Mistake Bound: Separable Case

•  Assume the data set D is linearly separable with margin γ, i.e.,

•  Assume

•  Theorem: The maximum number of mistakes made by the perceptron algorithm is bounded by

[Rong Jin]

Page 7: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Proof by induction

[Rong Jin]

. .

(full proof given on board)

Page 8: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Properties of the perceptron algortihm

•  Separability: some parameters get the training set perfectly correct

•  Convergence: if the training is linearly separable, perceptron will eventually converge

Separable

Non-Separable

Page 9: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Problems with the perceptron algorithm

•  Noise: if the data isn’t linearly separable, no guarantees of convergence or accuracy

•  Frequently the training data is linearly separable!

–  When the number of features is much larger than the number of data points, there is lots of flexibility

–  As a result, Perceptron can significantly overfit the data

•  Averaged perceptron is an algorithmic modification that helps with both issues

–  Averages the weight vectors across all iterations

Why?

Page 10: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Linear Separators

  Which of these linear separators is optimal?

Page 11: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

  SVMs (Vapnik, 1990’s) choose the linear separator with the largest margin

•  Good according to intuition, theory, practice

•  SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task

Support Vector Machine (SVM)

V. Vapnik

Robust to outliers!

Page 12: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Review: Normal to a plane

w.x

+ b

= 0

-- projection of xj onto the plane

-- unit vector parallel to w

is the length of the vector, i.e.

Page 13: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Scale invariance

w.x

+ b

= 0

Any other ways of writing the same dividing line? •  w.x + b = 0 •  2w.x + 2b = 0 •  1000w.x + 1000b = 0 •  ….

Page 14: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

w.x

+ b

= +

1

w.x

+ b

= -

1

w.x

+ b

= 0

for yt = +1,

and for yt = -1,

During learning, we set the scale by asking that, for all t,

Scale invariance

That is, we want to satisfy all of the linear constraints

Page 15: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

w.x

+ b

= +

1

w.x

+ b

= -

1

w.x

+ b

= 0

x1 x2

Final result: can maximize margin by minimizing ||w||2!!!

γ

What is as a function of w?

-

We also know that:

So,

Page 16: Linear classifiers Lecture 3 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 9. 11. · linear constraints . w. x w. x w. x x 1 x 2 ... •

Support vector machines (SVMs)

•  Example of a convex optimization problem

–  A quadratic program

–  Polynomial-time algorithms to solve!

•  Hyperplane defined by support vectors

–  Could use them as a lower-dimension basis to write down line, although we haven’t seen how yet

•  More on these later

w.x

+ b

= +

1

w.x

+ b

= -

1

w.x

+ b

= 0

margin 2γ

Support Vectors: •  data points on the

canonical lines

Non-support Vectors: •  everything else •  moving them will

not change w


Recommended