+ All Categories
Home > Documents > SanjayLallandStephenBoyd EE104 StanfordUniversityee104.stanford.edu/lectures/multiclass.pdf ·...

SanjayLallandStephenBoyd EE104 StanfordUniversityee104.stanford.edu/lectures/multiclass.pdf ·...

Date post: 29-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
22
EE104 S. Lall and S. Boyd Multi-Class Classification Sanjay Lall and Stephen Boyd EE104 Stanford University 1
Transcript
  • EE104 S. Lall and S. Boyd

    Multi-Class Classification

    Sanjay Lall and Stephen Boyd

    EE104Stanford University

    1

  • Multi-class classification

    I multi-class classification with V = f1; : : : ;Kg

    I embed the K classes as 1; : : : ; K 2 Rm

    I use nearest-neighbor un-embedding, v̂ = argmini kŷ � ik2I use RERM to fit predictor

    I validate using Neyman-Pearson metric on test data

    I Neyman-Pearson metric isP

    j�jEj

    I Ej is rate of mistaking v = j

    I �j is our relative distaste for mistaking v = j

    I with �1 = � � � = �K = 1, reduces to error rate

    2

  • Signed distances

    3

  • 2 1 0 1 2 3 41

    0

    1

    2

    3

    4

    5

    a

    bH

    When is a vector closer to one given vector than another?

    I when is ŷ 2 Rm closer to a than b, where a 6= b?

    I square both sides of kŷ � ak2 < kŷ � bk2 to get

    2(b� a)Tŷ � kbk22 + kak22 < 0

    I the decision boundary is given by kŷ � ak2 = kŷ � bk2, i.e.,

    2(b� a)Tŷ � kbk22 + kak22 = 0

    I this defines a hyperplane H in Rm, with normal vector b� a, passing through the midpoint (a+ b)=2

    4

  • 2 1 0 1 2 3 41

    0

    1

    2

    3

    4

    5

    a

    bH

    Signed distance to the decision boundary

    I the signed distance of ŷ to H is

    D(ŷ; a; b) =2(b� a)Tŷ � kak22 + kbk22

    2kb� ak2

    I D(ŷ; a; b) < 0 when ŷ is closer to a than b

    I jD(ŷ; a; b)j is the distance of ŷ to H

    I D(ŷ; a; b) = 0 gives the decision boundary

    I D(ŷ; a; b) is an affine function of ŷ

    5

  • Signed distances

    I now consider un-embedding the prediction ŷ 2 Rm, i.e., finding which i is closest

    I define signed distance functions Dij , for i 6= j, as

    Dij(ŷ) = D(ŷ; i; j) =2( j � i)Tŷ � k ik22 + k jk22

    2k j � ik2

    I Dij(ŷ) < 0 means ŷ is closer to i than j

    I ŷ is closest to i when Dij < 0 for j 6= i, or

    maxj 6=i

    Dij(ŷ) < 0

    I loss function should encourage this, when y = i

    6

  • Examples

    I Boolean, with 1 = �1 and 2 = 1

    D12(ŷ) = ŷ; D21(ŷ) = �ŷ

    so, when y = �1, we’d like ŷ < 0; when y = +1, we’d like ŷ > 0

    I one-hot, with j = ei, j = 1; : : : ;K

    Dij =yj � yip

    2; i 6= j

    so, when y = ei, we want maxj 6=iDij(ŷ) < 0, i.e., argmaxj ŷj = i

    7

  • Multi-class loss functions

    8

  • Loss function for multi-class classification

    I we need to give the K functions of ŷ

    `(ŷ; i); i = 1; : : : ;K

    I `(ŷ; i) is how much we dislike predicting ŷ when y = i

    I loss function `(ŷ; i) should be

    I small when maxj 6=iDij(ŷ) < 0

    I larger when maxj 6=iDij(ŷ) 6< 0

    9

  • Neyman-Pearson loss

    I Neyman-Pearson loss is

    `(ŷ; i) =

    (0 maxj 6=iDij < 0�i otherwise

    i.e., zero when i is decoded correctly, �i otherwise

    I it’s hard to minimize L(�)I we do better with a proxy loss that

    I approximates, or at least captures the flavor of, the Neyman-Pearson loss

    I is more easily optimized (e.g., is convex, differentiable)

    10

  • Multi-class hinge loss

    I hinge loss is`(ŷ; i) = �imax

    j 6=i(1 +Dij(ŷ))+

    I `(ŷ; i) is zero when ŷ is correctly un-embedded, with a margin at least one

    I convex but not differentiable

    I with quadratic regularization, called multi-class SVM

    I for Boolean embedding with 1 = �1, 2 = 1, reduces to

    `(ŷ;�1) = �1(1 + ŷ)+; `(ŷ; 1) = �2(1� ŷ)+usual hinge loss when �1 = 1

    11

  • Multi-class hinge loss

    3 2 1 0 1 2 33

    2

    1

    0

    1

    2

    3

    31

    2

    y1

    y2

    -3-2

    -10

    12

    3

    -3

    -2

    -1

    0

    1

    2

    30

    2

    4

    6

    8

    10loss `(ŷ; 1)

    ŷ1

    ŷ2

    -3

    -2

    -1

    0

    1

    2

    3

    -3-2-10

    123

    0

    2

    4

    6

    8

    10

    loss `(ŷ; 2)

    ŷ1

    ŷ2 -3-2

    -10

    12

    3

    -3

    -2

    -1

    0

    1

    2

    30

    2

    4

    6

    8

    10loss `(ŷ; 3)

    ŷ1

    ŷ2

    12

  • Multi-class logistic loss

    I logistic loss is

    `(ŷ; i) = �i log

    KXj=1

    exp(Dij(ŷ))

    !

    (where we take Djj = 0)

    I convex and differentiable

    I called multi-class logistic regression

    I for Boolean embedding with 1 = �1, 2 = 1, reduces to

    `(ŷ;�1) = �1 log(1 + eŷ); `(ŷ; 1) = �2 log(1 + e�ŷ)

    usual logistic loss when �1 = 1

    13

  • Multi-class logistic loss

    3 2 1 0 1 2 33

    2

    1

    0

    1

    2

    3

    31

    2

    y1

    y2

    -6-4

    -20

    24

    6

    -6

    -4

    -2

    0

    2

    4

    60

    2

    4

    6

    8

    10loss `(ŷ; 1)

    ŷ1

    ŷ2

    -6

    -4

    -2

    0

    2

    4

    6

    -6-4-20

    246

    0

    2

    4

    6

    8

    10

    loss `(ŷ; 2)

    ŷ1

    ŷ2 -6-4

    -20

    24

    6

    -6

    -4

    -2

    0

    2

    4

    60

    2

    4

    6

    8

    10loss `(ŷ; 3)

    ŷ1

    ŷ2

    14

  • Log-sum-exp function

    I the function f : Rn ! Rf(x) = log

    nXi=1

    exp(xi)

    is called the log-sum-exp function

    I it is a convex differentiable approximation to the max function

    I sometimes called the softmax function; but that term is also used for other functions

    I we havemaxfx1; : : : ; xng � f(x) � maxfx1; : : : ; xng+ log(n)

    15

  • Example: Iris

    16

  • Example: Iris

    I famous example dataset by Fisher, 1936

    I measurements of 150 plants, 50 from each of 3 species

    I iris setosa, iris versicolor, iris virginica

    I four measurements: sepal length, sepal width, petal length, petal width

    17

  • Example: Iris

    4.5

    5.0

    5.5

    6.0

    6.5

    7.0

    7.5

    8.0

    sepal length

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    sepal width

    1

    2

    3

    4

    5

    6

    7

    petal length

    5 6 7 80.0

    0.5

    1.0

    1.5

    2.0

    2.5

    2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5

    petal width

    18

  • Classification with two features

    4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    I using only sepal_length and sepal_width

    I one-hot embedding, multi-class logistic loss with �i = 1 for all i, trained on all data

    I confusion matrix C =

    24 50 0 00 38 13

    0 12 37

    35

    19

  • Classification with all four features

    I use all four features

    I one-hot embedding, multi-class logistic loss with �i = for all i, trained on all data

    I confusion matrix C =

    24 50 0 00 49 1

    0 1 49

    35

    20

  • Summary

    21

  • Summary

    I loss functions for multi-class classification should encourage correct un-embedding, i.e.,

    I `(ŷ; i) is small when ŷ is closest to i

    I `(ŷ; i) is not small when ŷ is not closest to i

    I most common losses are multi-class hinge loss and multi-class logistic

    I associated classifiers are called multi-class SVM and multi-class logistic

    I both losses are convex, so easy to solve ERM or RERM problems

    22


Recommended