Logistic Regression
Announcement
1 HW 2 due today
2 HW 2 release today due 515
Process
Decide on a model
Find the function which fits the data best Choose a loss function Pick the function which minimizes loss on data
Use function to make prediction on new examples
2
Data 1Him llit Yi E Ion xE2d
f Rd soil t EFf Heavymyat p y x
l twit
f arsfc.intlHxihYijXneuEDd
I knew
Logistic Regression
Actually classification, not regression :)
Logistic function(or Sigmoid):
Learn P(Y = 1|X = x) using �(wTx), for link function � =<latexit sha1_base64="6YR2mQoAdwRZWy+rKGhBlBEbQ6k=">AAACSnicdVBNTxsxEPWmtIX0K22PXEYklVKpinYDCumhElIvHHoIEoFUSRrNOt5gxWuvbG8h2ub3ceHUGz+CC4cixAU7BAmqdi5+evNm/ObFmeDGhuF5UHqy8vTZ89W18ouXr16/qbx9d2BUrinrUiWU7sVomOCSdS23gvUyzTCNBTuMp199//An04YruW9nGRumOJE84RSto0YVHEjF5ZhJC/CNoZZQG6Roj+K46Mzr3+ELRPALeu49+ViD3HA5cQrDJynWj3/se/YTJEqDMzCFJJfU772XuLHaqFING2G43Wy1wIP21mbkwVYYfm5B5BhfVbKszqjyezBWNE+dJyrQmH4UZnZYoLacCjYvD3LDMqRTnLC+gxJTZobFIoo5fHDMeGEoUe6mBftwosDUmFkaO6U/0/zd8+S/ev3cJu1hwWWWWybp3UdJLsAq8LnCmGtGrZg5gFRz5xXoEWqk1qVfdiHcXwr/BwfNRrTZaO41qzvtZRyrZJ1skDqJyDbZIbukQ7qEklNyQf6Qq+AsuAyug5s7aSlYzrwnj6q0cgstCa7U</latexit>
Features can be discrete or continuous!
P[Y = 1|X = x,w] = �(wTx) =1
1 + exp(�wTx)<latexit sha1_base64="lK3XZHT7juGteOWzhLgwqXNsfog=">AAACMXicdVDLSgMxFM34tr6qLt1cLEJFLTO11LoQCm5cVrBa6Ywlk2ZqMPMgyWjLOL/kxj8RNy4UcetPmLEVVPRAuCfn3EtyjxtxJpVpPhlj4xOTU9Mzs7m5+YXFpfzyyqkMY0Fok4Q8FC0XS8pZQJuKKU5bkaDYdzk9c68OM//smgrJwuBEDSLq+LgXMI8RrLTUyR/ZPlaXrps00vY5HIAFt9DStb8NN46utmQ9HxdvLk6gv6mvnsAksdLEgi2waT8q7gyttJMvmCXT3CtXq5CRWmXXykjFNPerYGklQwGN0OjkH+xuSGKfBopwLGXbMiPlJFgoRjhNc3YsaYTJFe7RtqYB9ql0ks+NU9jQShe8UOgTKPhUv08k2Jdy4Lu6M9tP/vYy8S+vHSuv5iQsiGJFAzJ8yIs5qBCy+KDLBCWKDzTBRDD9VyCXWIeidMg5HcLXpvA/OS2XrN1S+bhSqJdHccygNbSOishCe6iOjlADNRFBd+gRPaMX4954Ml6Nt2HrmDGaWUU/YLx/AKjUpjo=</latexit>
P[Y = 0|X = x,w] = 1� �(wTx) =exp(�wTx)
1 + exp(�wTx)
=1
1 + exp(wTx)<latexit sha1_base64="xnYqZTBi0eGmbMByz6xx1hQ4vUA=">AAACYHicdZFPTyIxGMY7o7si4op608ubJW7Y7Eo6SFAPJiRePGIiyoaZJZ3SwcbOn7QdgYx8SW8evPhJ7AAa2axv0vTJ73nftH3qJ4IrjfGTZa+sfvm6VlgvbpQ2v22Vt3euVZxKyjo0FrHs+kQxwSPW0VwL1k0kI6Ev2I1/d577N/dMKh5HV3qSMC8kw4gHnBJtUL88ckOib30/a097f+AMMDxA1+zj3zDyzO4cgqv4MCTV0d8rGP+EH2fgBpLQzGXjpHo4p9PMgV+wRFy3+N7qfPAXdr9cwTWMj+vNJuTipHHk5KKB8WkTHEPyqqBFtfvlR3cQ0zRkkaaCKNVzcKK9jEjNqWDTopsqlhB6R4asZ2REQqa8bBbQFA4MGUAQS7MiDTP6cSIjoVKT0DedeRzqXy+H//N6qQ5OvIxHSapZROcHBakAHUOeNgy4ZFSLiRGESm7uCvSWmEy0+ZOiCeHtpfC5uK7XnKNa/bJRadUXcRTQPvqOqshBx6iFLlAbdRBFz9aKVbI2rRe7YG/Z2/NW21rM7KKlsvdeAW75r0Y=</latexit>
XE LdWER'tY CSO I
Text
Sigmoid for binary classes
P(Y = 0|w,X) =1
1 + exp(w0 +P
k wkXk)
P(Y = 1|w,X) = 1� P(Y = 0|w,X) =exp(w0 +
Pk wkXk)
1 + exp(w0 +P
k wkXk)
P(Y = 1|w,X)
P(Y = 0|w,X)= exp(w0 +
X
k
wkXk)
WO W i n Wd CRWo offsetX K E R XEdd
exp Wo t IzwnXkexp in X
ifmagnitude is large forsmall
ratio is extremelylarge
C 70
Sigmoid for binary classes
P(Y = 0|w,X) =1
1 + exp(w0 +P
k wkXk)
P(Y = 1|w,X) = 1� P(Y = 0|w,X) =exp(w0 +
Pk wkXk)
1 + exp(w0 +P
k wkXk)
P(Y = 1|w,X)
P(Y = 0|w,X)= exp(w0 +
X
k
wkXk)
logP(Y = 1|w,X)
P(Y = 0|w,X)= w0 +
X
k
wkXk
Linear Decision Rule!
fits argymax ply x t I predicts
I predict Obothamok
so predict Iso predictO
bothareOK
Logistic Regression – a Linear classifier
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
logP(Y = 1|w,X)
P(Y = 0|w,X)= w0 +
X
k
wkXk
w
t I wTxlargeO
Wow Wix small 4 x C 22
i
0need 0nonlinearclassifier
Process
Decide on a model
Find the function which fits the data best Choose a loss function Pick the function which minimizes loss on data
Use function to make prediction on new examples
7
fled flwTx o
O d W
E
P (Y = 1|x,w) = exp(wTx)
1 + exp(wTx)
P (Y = �1|x,w) = 1
1 + exp(wTx)
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)
■ This is equivalent to:
■ So we can compute the maximum likelihood estimator:
bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
■ Have a bunch of iid data:
Loss function: Conditional Likelihoodencoding it
onlysimplicity
it Yi l 0111Mi l CStill
w parameterwant to learn
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)
bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
= argminw
nX
i=1
log(1 + exp(�yi xTi w))
Logistic Loss: `i(w) = log(1 + exp(�yi xTi w))
Squared error Loss: `i(w) = (yi � xTi w)
2
(MLE for Gaussian noise)
■ Have a bunch of iid data:
Loss function: Conditional Likelihood
ytog il monotonic
elf I Yilyou
classification
2
for regression
Process
Decide on a model
Find the function which fits the data best Choose a loss function Pick the function which minimizes loss on data
Use function to make prediction on new examples
10©2018 Kevin Jamieson
what we really care
011 I 9 txt Fylog Hexity whyfor training
µ LE principle
0 1 is hard tooptimize
avgminw I En let Yi
Loss function: Conditional Likelihood
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)
bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
= argminw
nX
i=1
log(1 + exp(�yi xTi w))= J(w)
What does J(w) look like? Is it convex?
■ Have a bunch of iid data:
jus log It expfyixiiu.ly
g w 7 Out E
Loss function: Conditional Likelihoodtoy it exp f 2 l
f
2 y X
Conveyit 2 0 loss 70
2 w loss is
intuition if y wTx have samesignHoss M
Tf y Kwtx havedifferentsisters big
Loss function: Conditional Likelihood
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)
bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
= argminw
nX
i=1
log(1 + exp(�yi xTi w))= J(w)
■ Have a bunch of iid data:
Good news: J(w) is convex function of w, no local optima problems
Bad news: no closed-form solution to maximize J(w)
Good news: convex functions easy to optimize
I guadiene descent
= argminw
nX
i=1
log(1 + exp(�yi xTi w)) When is this loss small?
©Kevin Jamieson 2018
Overfitting and Linear Separability
e direction offset
sign xiu sign Iw
f id w 30
Large parameters → Overfitting
When data is linearly separable, weights ⇒ ∞
Overfitting
Penalize high weights to prevent overfitting?
r r r
w10
w4 w 2
large weight p yl x 0 or 1
often not accurate Iwant ply1 1f fourth
it constant
Regularized Conditional Log Likelihood
argminw,b
nX
i=1
log�1 + exp(�yi (x
Ti w + b))
�+ �||w||22
Be sure to not regularize the o↵set b!
Add a penalty to avoid high weights/overfitting?:
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
470 boo x'web y11 19 Rw 0 7 if null smut Huey c O HAWK latte Hb
TidatT regularization
can also use
11y 6 wotuix
1 u
witur 2 µ
no O