SanjayLallandStephenBoyd EE104 StanfordUniversityee104.stanford.edu/lectures/multiclass.pdf ·...

EE104 S. Lall and S. Boyd

Multi-Class Classification

Sanjay Lall and Stephen Boyd

EE104Stanford University

1

Multi-class classification

I multi-class classification with V = f1; : : : ;Kg

I embed the K classes as 1; : : : ; K 2 Rm

I use nearest-neighbor un-embedding, v̂ = argmini kŷ � ik2I use RERM to fit predictor

I validate using Neyman-Pearson metric on test data

I Neyman-Pearson metric isP

j�jEj

I Ej is rate of mistaking v = j

I �j is our relative distaste for mistaking v = j

I with �1 = � � � = �K = 1, reduces to error rate

2

Signed distances

3

2 1 0 1 2 3 41

0

1

2

3

4

5

a

bH

When is a vector closer to one given vector than another?

I when is ŷ 2 Rm closer to a than b, where a 6= b?

I square both sides of kŷ � ak2 < kŷ � bk2 to get

2(b� a)Tŷ � kbk22 + kak22 < 0

I the decision boundary is given by kŷ � ak2 = kŷ � bk2, i.e.,

2(b� a)Tŷ � kbk22 + kak22 = 0

I this defines a hyperplane H in Rm, with normal vector b� a, passing through the midpoint (a+ b)=2

4

2 1 0 1 2 3 41

0

1

2

3

4

5

a

bH

ŷ

Signed distance to the decision boundary

I the signed distance of ŷ to H is

D(ŷ; a; b) =2(b� a)Tŷ � kak22 + kbk22

2kb� ak2

I D(ŷ; a; b) < 0 when ŷ is closer to a than b

I jD(ŷ; a; b)j is the distance of ŷ to H

I D(ŷ; a; b) = 0 gives the decision boundary

I D(ŷ; a; b) is an affine function of ŷ

5

Signed distances

I now consider un-embedding the prediction ŷ 2 Rm, i.e., finding which i is closest

I define signed distance functions Dij , for i 6= j, as

Dij(ŷ) = D(ŷ; i; j) =2( j � i)Tŷ � k ik22 + k jk22

2k j � ik2

I Dij(ŷ) < 0 means ŷ is closer to i than j

I ŷ is closest to i when Dij < 0 for j 6= i, or

maxj 6=i

Dij(ŷ) < 0

I loss function should encourage this, when y = i

6

Examples

I Boolean, with 1 = �1 and 2 = 1

D12(ŷ) = ŷ; D21(ŷ) = �ŷ

so, when y = �1, we’d like ŷ < 0; when y = +1, we’d like ŷ > 0

I one-hot, with j = ei, j = 1; : : : ;K

Dij =yj � yip

2; i 6= j

so, when y = ei, we want maxj 6=iDij(ŷ) < 0, i.e., argmaxj ŷj = i

7

Multi-class loss functions

8

Loss function for multi-class classification

I we need to give the K functions of ŷ

`(ŷ; i); i = 1; : : : ;K

I `(ŷ; i) is how much we dislike predicting ŷ when y = i

I loss function `(ŷ; i) should be

I small when maxj 6=iDij(ŷ) < 0

I larger when maxj 6=iDij(ŷ) 6< 0

9

Neyman-Pearson loss

I Neyman-Pearson loss is

`(ŷ; i) =

(0 maxj 6=iDij < 0�i otherwise

i.e., zero when i is decoded correctly, �i otherwise

I it’s hard to minimize L(�)I we do better with a proxy loss that

I approximates, or at least captures the flavor of, the Neyman-Pearson loss

I is more easily optimized (e.g., is convex, differentiable)

10

Multi-class hinge loss

I hinge loss is`(ŷ; i) = �imax

j 6=i(1 +Dij(ŷ))+

I `(ŷ; i) is zero when ŷ is correctly un-embedded, with a margin at least one

I convex but not differentiable

I with quadratic regularization, called multi-class SVM

I for Boolean embedding with 1 = �1, 2 = 1, reduces to

`(ŷ;�1) = �1(1 + ŷ)+; `(ŷ; 1) = �2(1� ŷ)+usual hinge loss when �1 = 1

11

Multi-class hinge loss

3 2 1 0 1 2 33

2

1

0

1

2

3

31

2

y1

y2

-3-2

-10

12

3

-3

-2

-1

0

1

2

30

2

4

6

8

10loss `(ŷ; 1)

ŷ1

ŷ2

-3

-2

-1

0

1

2

3

-3-2-10

123

0

2

4

6

8

10

loss `(ŷ; 2)

ŷ1

ŷ2 -3-2

-10

12

3

-3

-2

-1

0

1

2

30

2

4

6

8

10loss `(ŷ; 3)

ŷ1

ŷ2

12

Multi-class logistic loss

I logistic loss is

`(ŷ; i) = �i log

KXj=1

exp(Dij(ŷ))

!

(where we take Djj = 0)

I convex and differentiable

I called multi-class logistic regression

I for Boolean embedding with 1 = �1, 2 = 1, reduces to

`(ŷ;�1) = �1 log(1 + eŷ); `(ŷ; 1) = �2 log(1 + e�ŷ)

usual logistic loss when �1 = 1

13

Multi-class logistic loss

3 2 1 0 1 2 33

2

1

0

1

2

3

31

2

y1

y2

-6-4

-20

24

6

-6

-4

-2

0

2

4

60

2

4

6

8

10loss `(ŷ; 1)

ŷ1

ŷ2

-6

-4

-2

0

2

4

6

-6-4-20

246

0

2

4

6

8

10

loss `(ŷ; 2)

ŷ1

ŷ2 -6-4

-20

24

6

-6

-4

-2

0

2

4

60

2

4

6

8

10loss `(ŷ; 3)

ŷ1

ŷ2

14

Log-sum-exp function

I the function f : Rn ! Rf(x) = log

nXi=1

exp(xi)

is called the log-sum-exp function

I it is a convex differentiable approximation to the max function

I sometimes called the softmax function; but that term is also used for other functions

I we havemaxfx1; : : : ; xng � f(x) � maxfx1; : : : ; xng+ log(n)

15

Example: Iris

16

Example: Iris

I famous example dataset by Fisher, 1936

I measurements of 150 plants, 50 from each of 3 species

I iris setosa, iris versicolor, iris virginica

I four measurements: sepal length, sepal width, petal length, petal width

17

Example: Iris

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

sepal length

2.0

2.5

3.0

3.5

4.0

4.5

sepal width

1

2

3

4

5

6

7

petal length

5 6 7 80.0

0.5

1.0

1.5

2.0

2.5

2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5

petal width

18

Classification with two features

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

2.0

2.5

3.0

3.5

4.0

4.5

I using only sepal_length and sepal_width

I one-hot embedding, multi-class logistic loss with �i = 1 for all i, trained on all data

I confusion matrix C =

24 50 0 00 38 13

0 12 37

35

19

Classification with all four features

I use all four features

I one-hot embedding, multi-class logistic loss with �i = for all i, trained on all data

I confusion matrix C =

24 50 0 00 49 1

0 1 49

35

20

Summary

21

Summary

I loss functions for multi-class classification should encourage correct un-embedding, i.e.,

I `(ŷ; i) is small when ŷ is closest to i

I `(ŷ; i) is not small when ŷ is not closest to i

I most common losses are multi-class hinge loss and multi-class logistic

I associated classifiers are called multi-class SVM and multi-class logistic

I both losses are convex, so easy to solve ERM or RERM problems

22

Date post:	29-Jan-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

SanjayLallandStephenBoyd EE104 StanfordUniversityee104.stanford.edu/lectures/multiclass.pdf ·...

Documents