+ All Categories
Home > Documents > Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods...

Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods...

Date post: 06-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
17
Probabilistic Outputs for SVMs and Comparisons to Regularized Likelihood Methods John Platt 1 January 31st 2007 1 Presented by Nikos Karampatziakis John Platt Probabilistic Outputs for SVMs and Comparisons to Regu
Transcript
Page 1: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Probabilistic Outputs for SVMs and

Comparisons to Regularized Likelihood Methods

John Platt1

January 31st 2007

1Presented by Nikos Karampatziakis

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 2: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Outline

Background.

Related Work.

Platt’s Method.

Results and Conclusions.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 3: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

SVM and Probabilities

In many settings we want to give an input to a classifier and weare more interested in the degree of its belief that the outputshould be +1.

Typical examples include combining individual predictions andthe “reject” option.

In such cases it is useful to produce a probability P (y = 1|x).

However SVMs don’t do that.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 4: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

SVM and Probabilities (2)

Tradiotional SVM training:

min1

2||w||2 + C

i

ξi

s.t. yi(w · xi) ≥ 1 − ξi, ξi ≥ 0

Classification of a point x: f(x) = w · x.

Focus on accuracy. Zero/one loss.

When the loss function is not symmetric probabilities can help.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 5: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

(Not so) Recent Work (1)

Wahba’s Approach: Train an SVM to minimize the negative loglikelihood

min∑

i

−yif(xi) + log(1 + ef(xi))

Can add a regularization term to control the complexity of f

versus fit to the data.

In this case P (y = 1|x) = 11+e−f(x) .

This fomulation gives solutions with many support vectors.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 6: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

(Not so) Recent Work (2)

Vapnik’s Approach: Fit a function P (y = 1|t, u) on the data.

w

t

u

Vapnik uses a linear combination of basis functions (cosines) tofit P (y = 1|t, u).

Need to solve a linear system for every new input x.

Issues with monotonicity and outputs outside [0, 1]?

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 7: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

(Not so) Recent Work (3)

Hastie & Tibshirani: Fit gaussians(?) to p(f |y = ±1).

P (y = 1|f) =p(f |y = 1)P (y = 1)

i=±1 P (y = i)p(f |y = i)

=q exp(−(f−µ1)2

σ2 )

q exp(−(f−µ1)2

σ2 ) + (1 − q) exp(−(f−µ2)2

σ2 )

=1

1 + 1−qq

exp(−(f−µ2)2

σ2 + (f−µ1)2

σ2 )

=1

1 + exp(2µ2−µ1

σ2 f +µ2

1−µ22

σ2 + ln 1−qq

)

With unequal variances we get P (y = 1|f) = 11+eaf2+bf+c

where

a =σ22−σ2

1

σ22σ2

1

Non monotonic.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 8: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Platt’s Method (1)

Idea: Look at the data! (but not Vapnik’s u)

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 9: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Platt’s Method (1)

Idea: Look at the data! (but not Vapnik’s u)

p(f |y = i) seems to be exponentially ditributed when f is in thewrong side of the margin. E.g. p(f |y = 1) = r1e

−r1(1−f), f ≤ 1If we use Bayes’ rule we get

p(y = 1|f) =1

1 + exp(Af + B)

where A = −(r1 + r2) and B = r1 − r2 + ln P (y=−1)P (y=1)

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 10: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Platt’s Method (2)

Platt’s sigmoid vs. Hastie & Tibshirani’s sigmoid. (number ofparameters, training procedure)

Using the previous histogram we can compute estimates ofp(f ∈ bini) and p(f ∈ bini|y = 1). Then by Bayes’ rule:

P (y = 1|f ∈ bini) =p(f ∈ bini|y = 1)P (y = 1)

p(f ∈ bini)

Plotting these probabilities and the fitted sigmoid we see thatPlatt’s sigmoid is doing well in practice.

The reliability diagrams we saw are also sigmoid shaped.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 11: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Platt’s Method (3)

Training data: (pi, ti). pi sigmoid’s response to fi, ti = yi+12 .

Parameters are fit by minimizing: − log∏

pti

i (1 − pi)1−ti

min−∑

i

ti log(pi) + (1 − ti) log(1 − pi)

Issue: How to choose the training set?

Using the output of the SVM for the training set.

Biased estimate both for linear and non linear SVMs.

(Re)using a hold out set. Using cross-validation.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 12: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Platt’s Method (4)

Another issue: How to avoid overfitting?

Overfitting occurs when there are very few examples from oneclass which are seperable from the other class.

Then the learned sigmoid is essentially a step function.

We are back to bad probabilities.

Add some regularization by changing ti (0 → ǫ−, 1 → 1 − ǫ+).

Minimizing the same function is still valid.

Similar trick is used in neural net training when there is no errorpropagated back if the difference between the target and theoutput is small.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 13: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Platt’s Method (5)

MAP estimate

ti =

N++1N++2 if yi = 1

1N

−+2 if yi = −1

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 14: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Platt’s Method (5)

MAP estimate

ti =

N++1N++2 if yi = 1

1N

−+2 if yi = −1

These are derived in the same way as class probabilities in theleaves of a decision tree when we use Laplacian smoothing.

We start with two examples one positive and one negative. Thenwe get N+ positives (N− negatives).

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 15: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Experiments

Three models: raw SVM, SVM+sigmoid and SVM trained formaximizing log likelihood.

Reuters, Adult and Web data sets.

Linear and quadratic kernels.

Accuracy of Raw SVM (f(x) = 0) vs. SVM+sigmoid(P (y = 1) = 0.5)

Quality of probabilities of log likelihood SVM vs. SVM+sigmoid

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 16: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

Results

Zero threshold is not always optimal (we knew that from 578).Sigmoid threshold is significantly better for unbalanced problems.

Produced probabilities are not worse than those of regularizedlikelihood SVM. Solution is sparser and fitting the sigmoid ismuch simpler than implementing a kernel machine.

SVM with sigmoid and regularized likelihood SVM are trained tooptimize one measure but they preform similarly for bothaccuracy and log likelihood. For a particular set of hypotheses(e.g. SVMs with quadratic kernels) it is hard to know in advancewhich training method will perform better.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized

Page 17: Probabilistic Outputs for SVMs and Comparisons to ...Comparisons to Regularized Likelihood Methods John Platt1 January 31st 2007 1Presented by Nikos Karampatziakis John Platt Probabilistic

More Recent Results

Beware of the pseudocode! A recent paper by Chih-Jen Linpoints out bugs and numerical difficulties.

Platt’s method is not specific to SVMs.

Any model that predicts poor probabilities should be calibratedbut already well calibrated models such as neural nets cannotbenefit from any type of calibration.

Reliability diagrams in the latter case are very close to straightlines and a sigmoid is not a good model for fitting straight lines.

John Platt Probabilistic Outputs for SVMs and Comparisons to Regularized


Recommended