Topics• Regularization
– prior, penalties, MAP estimation
– the effect of regularization, generalization
– regularization and discrimination
• Discriminative classification
– criterion, margin
– support vector machine
Tommi Jaakkola, MIT CSAIL 2
MAP estimation, regularization• Consider again a simple 2-d logistic regression model
P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )
• Before seeing any data we may prefer some values of the
parameters over others (e.g., small over large values).
Tommi Jaakkola, MIT CSAIL 3
MAP estimation, regularization• Consider again a simple 2-d logistic regression model
P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )
• Before seeing any data we may prefer some values of the
parameters over others (e.g., small over large values).
• We can express this preference through a prior distribution
over the parameters (here omitting w0)
p(w1, w2;σ2) =1
2πσ2exp
{− 1
2σ2(w2
1 + w22)
}where σ2 determines how tightly around zero we want to
constrain the values of w1 and w2.
Tommi Jaakkola, MIT CSAIL 4
MAP estimation, regularization• Consider again a simple 2-d logistic regression model
P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )
• Before seeing any data we may prefer some values of the
parameters over others (e.g., small over large values).
• We can express this preference through a prior distribution
over the parameters (here omitting w0)
p(w1, w2;σ2) =1
2πσ2exp
{− 1
2σ2(w2
1 + w22)
}• To combine the prior with the availabale data we find the
MAP (maximum a posteriori) parameter estimates:
wMAP = argmaxw
[n∏
i=1
P (yi|xi,w)
]p(w1, w2;σ2)
Tommi Jaakkola, MIT CSAIL 5
MAP estimation, regularization• The estimation criterion is now given by a penalized log-
likelihood (cf. log-posterior):
l(D;w) =n∑
i=1
log P (yi|xi,w) + log p(w1, w2;σ2)
=n∑
i=1
log P (yi|xi,w)− 12σ2
(w21 + w2
2) + const.
• We’d like to understand how the solution changes as a
function of the prior variance σ2 (or more generally with
different priors)
Tommi Jaakkola, MIT CSAIL 6
The effect of regularization• Let’s first understand graphically how the addition of the
prior changes the solution
l(D;w) =
log-likelihood︷ ︸︸ ︷n∑
i=1
log P (yi|xi,w)
log-prior︷ ︸︸ ︷− 1
2σ2(w2
1 + w22) +const.
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
log-prior log-likelihood
Tommi Jaakkola, MIT CSAIL 7
The effect of regularization• Let’s first understand graphically how the addition of the
prior changes the solution
l(D;w) =
log-likelihood︷ ︸︸ ︷n∑
i=1
log P (yi|xi,w)
log-prior︷ ︸︸ ︷− 1
2σ2(w2
1 + w22) +const.
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
log-prior log-likelihood log-posterior
Tommi Jaakkola, MIT CSAIL 8
The effect of regularization cont’d
l(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2
(w21 + w2
2) + const.
σ2 = ∞
Tommi Jaakkola, MIT CSAIL 9
The effect of regularization cont’d
l(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2
(w21 + w2
2) + const.
σ2 = ∞ σ2 = 10/n
Tommi Jaakkola, MIT CSAIL 10
The effect of regularization cont’d
l(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2
(w21 + w2
2) + const.
σ2 = ∞ σ2 = 10/n σ2 = 1/n
Tommi Jaakkola, MIT CSAIL 11
The effect of regularization cont’d
l(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2
w21 + const.
σ2 = ∞
Tommi Jaakkola, MIT CSAIL 12
The effect of regularization cont’d
l(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2
w21 + const.
σ2 = ∞ σ2 = 10/n
Tommi Jaakkola, MIT CSAIL 13
The effect of regularization cont’d
l(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2
w21 + const.
σ2 = ∞ σ2 = 10/n σ2 = 1/n
Tommi Jaakkola, MIT CSAIL 14
The effect of regularization cont’d
l(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2
w21 + const.
σ2 = 10/n σ2 = 1/n σ2 = 0.1/n
Tommi Jaakkola, MIT CSAIL 15
The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion
l(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− 1n2σ2
(w21 + w2
2) + const.
Tommi Jaakkola, MIT CSAIL 16
The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion
l(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c
2(w2
1 + w22) + const.
where c = 1/nσ2; increasing c results in stronger
regularization.
Tommi Jaakkola, MIT CSAIL 17
The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion
l(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c
2(w2
1 + w22) + const.
where c = 1/nσ2; increasing c results in stronger
regularization.
• Resulting average log-likelihoods
training log-lik. =1n
n∑i=1
log P (yi|xi, wMAP )
test log-lik. = E(x,y)∼P { log P (y|x, wMAP ) }
Tommi Jaakkola, MIT CSAIL 18
The effect of regularization: train/test
l(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c
2(w2
1 + w22) + const.
training log-lik. =1n
n∑i=1
log P (yi|xi, wMAP )
test log-lik. = E(x,y)∼P { log P (y|x, wMAP ) }
0 0.2 0.4 0.6 0.8 1−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
c=1/(nσ2)
mea
n lo
g−lik
elih
ood
traintest
Tommi Jaakkola, MIT CSAIL 19
Likelihood, regularization, and discrimination• Regularization by penalizing ‖w1‖2 = w2
1 + w22 in
l(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c
2(w2
1 + w22) + const.
does not directly limit the logistic regression model as a
classifier. For example:
0 0.2 0.4 0.6 0.8 1−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
c=1/(nσ2)
mea
n lo
g−lik
elih
ood
traintest
0 0.2 0.4 0.6 0.8 10.85
0.9
0.95
1
c=1/(nσ2)
clas
sific
atio
n ac
cura
cy
traintest
Tommi Jaakkola, MIT CSAIL 20
Likelihood, regularization, and discrimination• Regularization by penalizing ‖w1‖2 = w2
1 + w22 in
l(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c
2(w2
1 + w22) + const.
does not directly limit the logistic regression model as a
classifier.
• Classification decisions only depend on the sign of the
discriminant function
f(x;w) = w0 + xTw1 = (x− x0)Tw1
where w1 = [w1, w2]T and x0 is chosen such that w0 =xT
0 w1. Limiting ‖w1‖2 = w21 + w2
2 does not reduce the
possible signs.
Tommi Jaakkola, MIT CSAIL 21
Topics• Regularization
– prior, penalties, MAP estimation
– the effect of regularization, generalization
– regularization and discrimination
• Discriminative classification
– criterion, margin
– support vector machine
Tommi Jaakkola, MIT CSAIL 22
Discriminative classification• Consider again a binary classification task with y = ±1 labels
(not 0/1 as before) and linear discriminant functions
f(x;w) = w0 + xTw1
parameterized by w0 and w1 = [w1, . . . , wd]T .
• The predicted label is simply given by the sign of the
discriminant function y = sign(f(x;w))
• We are only interested in getting the labels correct; no
probabilities are associated with the predictions
Tommi Jaakkola, MIT CSAIL 23
Discriminative classification• When the training set {(x1, y1), . . . , (xn, yn)} is linearly
separable we can find parameters w such that
yi[w0 + xTi w1] > 0, i = 1, . . . , n
i.e., the sign of the discriminant function agrees with the
label
−4 −2 0 2 4 6−2
−1
0
1
2
3
4
5
6
(there are many possible solutions)
Tommi Jaakkola, MIT CSAIL 24
Discriminative classification• Perhaps we can find a better discriminant boundary by
requiring that the training examples are separated with a
fixed “margin”:
yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n
−4 −2 0 2 4 6−2
−1
0
1
2
3
4
5
6
Tommi Jaakkola, MIT CSAIL 25
Discriminative classification• Perhaps we can find a better discriminant boundary by
requiring that the training examples are separated with a
fixed “margin”:
yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n
−4 −2 0 2 4 6−2
−1
0
1
2
3
4
5
6
−1 0 1 2 3 40
0.5
1
1.5
2
2.5
3
3.5
4
The problem is the same as before. The notion of “margin”
used here depends on the scale of ‖w1‖
Tommi Jaakkola, MIT CSAIL 26
Margin and regularization• We get a more meaningful (geometric) notion of margin by
regularizing the problem:
minimize12‖w1‖2 =
12
d∑i=1
w2i
subject to
yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n
• What can we say about the solution?
Tommi Jaakkola, MIT CSAIL 27
Margin and regularization• One dimensional example: f(x;w) = w0 + w1x
Relevant constraints:
1[w0 + w1x+]− 1 ≥ 0
−1[w0 + w1x−]− 1 ≥ 0
Maximum separation would be
at the mid point with a margin
|x+ − x−|/2.
f(x; w) = w0 + w1x
o ooo ox oxxxxx
|x+ − x−|/2
x+ x−
Tommi Jaakkola, MIT CSAIL 28
Margin and regularization• One dimensional example: f(x;w) = w0 + w1x
Relevant constraints:
1[w0 + w1x+]− 1 ≥ 0
−1[w0 + w1x−]− 1 ≥ 0
At the mid point the value of
the margin is |x+ − x−|/2.
f(x; w∗) = w∗0 + w∗
1x
o ooo ox oxxxxx
|x+ − x−|/2
x+ x−
• We can find the maximum margin solution by minimizing
the slope |w1| while satisfying the classification constraints
• The resulting margin is directly tied to the minimizing slope
(slope = 1/margin): |w∗1| = 2/|x+ − x−|
Tommi Jaakkola, MIT CSAIL 29
Support vector machine• We minimize the regularization penalty
12‖w1‖2 =
12
d∑i=1
w2i
subject to the classification
constraints
yi[w0 + xTi w1]− 1 ≥ 0
for i = 1, . . . , n.
x
x
xx
x
x
x
x
xx
xx
xx
x x
x
• Analogously to the one dimensional case, the “slope” is
related to the geometric margin: ‖w∗1‖ = 1/margin.
• The solution is again defined only on the basis of a subset of
examples or “support vectors”
Tommi Jaakkola, MIT CSAIL 30