1
1
Intro Logistic Regression Gradient Descent + SGD AdaGrad
Machine Learning for Big Data CSE547/STAT548, University of Washington
Emily Fox January 7th, 2014
©Emily Fox 2014
Case Study 1: Estimating Click Probabilities
Ad Placement Strategies
n Companies bid on ad prices
n Which ad wins? (many simplifications here) ¨ Naively:
¨ But:
¨ Instead:
©Emily Fox 2014 2
2
Key Task: Estimating Click Probabilities
n What is the probability that user i will click on ad j
n Not important just for ads: ¨ Optimize search results ¨ Suggest news articles ¨ Recommend products
n Methods much more general, useful for: ¨ Classification ¨ Regression ¨ Density estimation
©Emily Fox 2014 3
Learning Problem for Click Prediction
n Prediction task: n Features:
n Data:
¨ Batch: ¨ Online:
n Many approaches (e.g., logistic regression, SVMs, naïve Bayes, decision
trees, boosting,…) ¨ Focus on logistic regression; captures main concepts, ideas generalize to other approaches
©Emily Fox 2014 4
3
Logistic Regression Logistic function (or Sigmoid):
n Learn P(Y|X) directly ¨ Assume a particular functional form ¨ Sigmoid applied to a linear function
of the data:
Z
Features can be discrete or continuous! 5 ©Emily Fox 2014
Very convenient!
6 ©Emily Fox 2014
0
implies
linear classification
rule!
0
1
4
Digression: Logistic regression more generally
n Logistic regression in more general case, where Y in {y1,…,yR}
for k<R
for k=R (normalization, so no weights for this class)
Features can be discrete or continuous! 7 ©Emily Fox 2014
Loss function: Conditional Likelihood
n Have a bunch of iid data of the form:
n Discriminative (logistic regression) loss function:
Conditional Data Likelihood
8 ©Emily Fox 2014
5
Expressing Conditional Log Likelihood
9 ©Emily Fox 2014
`(w) =X
j
yj lnP (Y = 1|xj ,w) + (1� yj) lnP (Y = 0|xj ,w)
=
X
j
y
j(w0 +
dX
i=1
wixji )� ln
1 + exp(w0 +
dX
i=1
wixji )
!
Maximizing Conditional Log Likelihood
Good news: l(w) is concave function of w, no local optima problems
Bad news: no closed-form solution to maximize l(w)
Good news: concave functions easy to optimize
10 ©Emily Fox 2014
=
X
j
y
j(w0 +
dX
i=1
wixji )� ln
1 + exp(w0 +
dX
i=1
wixji )
!
6
Optimizing concave function – Gradient ascent
n Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent
n Gradient ascent is simplest of optimization approaches ¨ e.g., Conjugate gradient ascent much better (see reading)
Gradient:
Step size, η>0
Update rule:
11 ©Emily Fox 2014
Gradient Ascent for LR
Gradient ascent algorithm: iterate until change < ε
For i = 1,…,d,
repeat
12 ©Emily Fox 2014
(t)
(t)
7
Regularized Conditional Log Likelihood
n If data is linearly separable, weights go to infinity n Leads to overfitting à Penalize large weights
n Add regularization penalty, e.g., L2:
n Practical note about w0:
©Emily Fox 2014 13
`(w) = lnY
j
P (yj |xj ,w))� �||w||222
14
Standard v. Regularized Updates
n Maximum conditional likelihood estimate
n Regularized maximum conditional likelihood estimate
©Emily Fox 2014
(t)
(t)
w
⇤= argmax
wln
2
4Y
j
P (yj |xj ,w))
3
5� �X
i>0
w2i
2
8
Stopping criterion
n Regularized logistic regression is strongly concave ¨ Negative second derivative bounded away from zero:
n Strong concavity (convexity) is super helpful!!
n For example, for strongly concave l(w):
©Emily Fox 2014 15
`(w) = lnY
j
P (yj |xj ,w))� �||w||22
`(w⇤)� `(w) 1
2�||r`(w)||22
2
Convergence rates for gradient descent/ascent
n Number of Iterations to get to accuracy
n If func Lipschitz: O(1/ϵ2)
n If gradient of func Lipschitz: O(1/ϵ)
n If func is strongly convex: O(ln(1/ϵ))
©Emily Fox 2014 16
`(w⇤)� `(w) ✏
9
Challenge 1: Complexity of computing gradients
n What’s the cost of a gradient update step for LR???
©Emily Fox 2014 17
(t)
Challenge 2: Data is streaming
n Assumption thus far: Batch data
n But, click prediction is a streaming data task: ¨ User enters query, and ad must be selected:
n Observe xj, and must predict yj
¨ User either clicks or doesn’t click on ad: n Label yj is revealed afterwards
¨ Google gets a reward if user clicks on ad
¨ Weights must be updated for next time:
©Emily Fox 2014 18
10
Learning Problems as Expectations
n Minimizing loss in training data: ¨ Given dataset:
n Sampled iid from some distribution p(x) on features:
¨ Loss function, e.g., hinge loss, logistic loss,… ¨ We often minimize loss in training data:
n However, we should really minimize expected loss on all data:
n So, we are approximating the integral by the average on the training data ©Emily Fox 2014 19
`(w) = Ex
[`(w,x)] =
Zp(x)`(w,x)dx
`D(w) =1
N
NX
j=1
`(w,xj)
Gradient ascent in Terms of Expectations
n “True” objective function:
n Taking the gradient:
n “True” gradient ascent rule:
n How do we estimate expected gradient?
©Emily Fox 2014 20
`(w) = Ex
[`(w,x)] =
Zp(x)`(w,x)dx
11
SGD: Stochastic Gradient Ascent (or Descent)
n “True” gradient: n Sample based approximation:
n What if we estimate gradient with just one sample??? ¨ Unbiased estimate of gradient ¨ Very noisy! ¨ Called stochastic gradient ascent (or descent)
n Among many other names ¨ VERY useful in practice!!!
©Emily Fox 2014 21
r`(w) = Ex
[r`(w,x)]
Stochastic Gradient Ascent: general case
n Given a stochastic function of parameters: ¨ Want to find maximum
n Start from w(0) n Repeat until convergence:
¨ Get a sample data point xt ¨ Update parameters:
n Works on the online learning setting! n Complexity of each gradient step is constant in number of examples! n In general, step size changes with iterations
©Emily Fox 2014 22
12
Stochastic Gradient Ascent for Logistic Regression
n Logistic loss as a stochastic function:
n Batch gradient ascent updates:
n Stochastic gradient ascent updates: ¨ Online setting:
©Emily Fox 2014 23
Ex
[`(w,x)] = Ex
⇥lnP (y|x,w)� �||w||22
⇤
w
(t+1)i w
(t)i + ⌘
8<
:��w(t)i +
1
N
NX
j=1
x
(j)i [y(j) � P (Y = 1|x(j)
,w
(t))]
9=
;
w
(t+1)i w
(t)i + ⌘t
n
��w(t)i + x
(t)i [y(t) � P (Y = 1|x(t)
,w
(t))]o
2
Convergence rate of SGD
n Theorem: ¨ (see Nemirovski et al ‘09 from readings) ¨ Let f be a strongly convex stochastic function ¨ Assume gradient of f is Lipschitz continuous and bounded
¨ Then, for step sizes:
¨ The expected loss decreases as O(1/t):
©Emily Fox 2014 24
13
Convergence rates for gradient descent/ascent versus SGD
n Number of Iterations to get to accuracy
n Gradient descent: ¨ If func is strongly convex: O(ln(1/ϵ)) iterations
n Stochastic gradient descent: ¨ If func is strongly convex: O(1/ϵ) iterations
n Seems exponentially worse, but much more subtle: ¨ Total running time, e.g., for logistic regression:
n Gradient descent: n SGD: n SGD can win when we have a lot of data
¨ And, when analyzing true error, situation even more subtle… expected running time about the same, see readings
©Emily Fox 2014 25
`(w⇤)� `(w) ✏
Motivating AdaGrad (Duchi, Hazan, Singer 2011)
n Assuming , standard stochastic (sub)gradient descent updates are of the form:
n Should all features share the same learning rate?
n Often have high-dimensional feature spaces ¨ Many features are irrelevant ¨ Rare features are often very informative
n Adagrad provides a feature-specific adaptive learning rate by incorporating knowledge of the geometry of past observations
©Emily Fox 2014 26
w 2 Rd
w(t+1)i w(t)
i � ⌘gt,i
14
Why adapt to geometry?
Hard
Nice
y
t
�
t,1 �
t,2 �
t,3
1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1
1 Frequent, irrelevant
2 Infrequent, predictive
3 Infrequent, predictive
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32
x
x
x
©Emily Fox 2014 27
Examples from Duchi et al. ISMP 2012
slides
Why adapt to geometry?
Hard
Nice
y
t
�
t,1 �
t,2 �
t,3
1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1
1 Frequent, irrelevant
2 Infrequent, predictive
3 Infrequent, predictive
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32
Why Adapt to Geometry?
Not All Features are Created Equal
n Examples:
©Emily Fox 2014 28
Motivation
Text data:
The most unsung birthday
in American business and
technological history
this year may be the 50th
anniversary of the Xerox
914 photocopier.
a
aThe Atlantic, July/August 2010.
High-dimensional image features
Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32
Images from Duchi et al. ISMP 2012 slides
15
Projected Gradient
n Brief aside…
n Consider an arbitrary feature space
n If , can use projected gradient for (sub)gradient descent
©Emily Fox 2014 29
w(t+1) =
w 2 W
w 2 W
w(t+1)i w(t)
i � ⌘gt,i
R(T ) =TX
t=1
ft(w(t))� inf
w2W
TX
t=1
ft(w)
Regret Minimization
n How do we assess the performance of an online algorithm?
n Algorithm iteratively predicts n Incur loss n Regret:
What is the total incurred loss of algorithm relative to the best choice of that could have been made retrospectively
©Emily Fox 2014 30
w(t)
ft(w(t))
w
16
Regret Bounds for Standard SGD
n Standard projected gradient stochastic updates:
n Standard regret bound:
©Emily Fox 2014 31
TX
t=1
ft(w(t))� ft(w
⇤) 1
2⌘||w(1) �w⇤||22 +
⌘
2
TX
t=1
||gt||22
w(t+1) = arg minw2W
||w � (w(t) � ⌘gt)||22
Projected Gradient using Mahalanobis
n Standard projected gradient stochastic updates:
n What if instead of an L2 metric for projection, we considered the Mahalanobis norm
w(t+1) = arg minw2W
||w � (w(t) � ⌘gt)||22
w(t+1) = arg minw2W
||w � (w(t) � ⌘A�1gt)||2A
©Emily Fox 2014 32
17
Mahalanobis Regret Bounds
n What A to choose?
n Regret bound now:
n What if we minimize upper bound on regret w.r.t. A in hindsight?
©Emily Fox 2014 33
w(t+1) = arg minw2W
||w � (w(t) � ⌘A�1gt)||2A
minA
TX
t=1
⌦gt, A
�1gt↵
TX
t=1
ft(w(t))� ft(w
⇤) 1
2⌘||w(1) �w⇤||2A +
⌘
2
TX
t=1
||gt||2A�1
Mahalanobis Regret Minimization
n Objective:
n Solution:
For proof, see Appendix E, Lemma 15 of Duchi et al. 2011. Uses “trace trick” and Lagrangian. n A defines the norm of the metric space we should be operating in
©Emily Fox 2014 34
minA
TX
t=1
⌦gt, A
�1gt↵
A = c
TX
t=1
gtgTt
! 12
subject to A ⌫ 0, tr(A) C
18
AdaGrad Algorithm
n At time t, estimate optimal (sub)gradient modification A by
n For d large, At is computationally intensive to compute. Instead,
n Then, algorithm is a simple modification of normal updates:
©Emily Fox 2014 35
At =
tX
⌧=1
g⌧gT⌧
! 12
w(t+1) = arg minw2W
||w � (w(t) � ⌘diag(At)�1gt)||2diag(At)
w(t+1) = arg minw2W
||w � (w(t) � ⌘A�1gt)||2A
AdaGrad in Euclidean Space
n For , n For each feature dimension,
where
n That is,
n Each feature dimension has it’s own learning rate! ¨ Adapts with t ¨ Takes geometry of the past observations into account ¨ Primary role of η is determining rate the first time a feature is encountered
©Emily Fox 2014 36
W = Rd
w(t+1)i w(t)
i � ⌘t,igt,i
⌘t,i =
w(t+1)i w(t)
i �⌘qPt⌧=1 g
2⌧,i
gt,i
19
AdaGrad Theoretical Guarantees
n AdaGrad regret bound:
n So, what does this mean in practice?
n Many cool examples. This really is used in practice! n Let’s just examine one…
©Emily Fox 2014 37
TX
t=1
ft(w(t))� ft(w
⇤) 2R1
dX
i=1
||g1:T,j ||2
R1 := max
t||w(t) �w⇤||1
AdaGrad Theoretical Example
n Expect to out-perform when gradient vectors are sparse
n SVM hinge loss example: where
n If xjt ≠ 0 with probability
n Previously best known method:
©Emily Fox 2014 38
ft(w) = [1� yt⌦x
t,w↵]+ x
t 2 {�1, 0, 1}d
E"f
1
T
TX
t=1
w(t)
!#� f(w⇤
) = O✓ ||w⇤||1p
T·max{log d, d1�↵/2}
◆
E"f
1
T
TX
t=1
w(t)
!#� f(w⇤) = O
✓ ||w⇤||1pT
·pd
◆
/ j�↵, ↵ > 1
20
Neural Network Learning
0 20 40 60 80 100 1200
5
10
15
20
25
Time (hours)
Ave
rage F
ram
e A
ccura
cy (
%)
Accuracy on Test Set
SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L−BFGS
(Dean et al. 2012)
Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 26 / 32
Neural Network Learning
Wildly non-convex problem:
f(x; ⇠) = log (1 + exp (h[p(hx1, ⇠1i) · · · p(hxk
, ⇠
k
i)], ⇠0i))
where
p(↵) =
1
1 + exp(↵)
�1 �2 �3 �4�5
x1 x2 x3 x4 x5
p(hx1, �1i)
Idea: Use stochastic gradient methods to solve it anyway
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32
Neural Network Learning
n Very non-convex problem, but use SGD methods anyway
©Emily Fox 2014 39
Neural Network Learning
Wildly non-convex problem:
f(x; ⇠) = log (1 + exp (h[p(hx1, ⇠1i) · · · p(hxk
, ⇠
k
i)], ⇠0i))
where
p(↵) =
1
1 + exp(↵)
�1 �2 �3 �4�5
x1 x2 x3 x4 x5
p(hx1, �1i)
Idea: Use stochastic gradient methods to solve it anyway
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32
Neural Network Learning
Wildly non-convex problem:
f(x; ⇠) = log (1 + exp (h[p(hx1, ⇠1i) · · · p(hxk
, ⇠
k
i)], ⇠0i))
where
p(↵) =
1
1 + exp(↵)
�1 �2 �3 �4�5
x1 x2 x3 x4 x5
p(hx1, �1i)
Idea: Use stochastic gradient methods to solve it anyway
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32
Images from Duchi et al. ISMP 2012 slides
What you should know about Logistic Regression (LR) and Click Prediction
n Click prediction problem: ¨ Estimate probability of clicking ¨ Can be modeled as logistic regression
n Logistic regression model: Linear model n Gradient ascent to optimize conditional likelihood n Overfitting + regularization n Regularized optimization
¨ Convergence rates and stopping criterion n Stochastic gradient ascent for large/streaming data
¨ Convergence rates of SGD n AdaGrad motivation, derivation, and algorithm
40 ©Emily Fox 2014