Date post: | 16-May-2018 |
Category: |
Documents |
Upload: | trinhxuyen |
View: | 222 times |
Download: | 1 times |
Lecture 1: Supervised Learning
Tuo Zhao
Schools of ISYE and CSE, Georgia Tech
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
(Supervised) Regression Analysis
Example: living areas and prices of 47 houses:
CS229 Lecture notes
Andrew Ng
Supervised learning
Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:
Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540
......
We can plot this data:
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
100
200
300
400
500
600
700
800
900
1000
housing prices
square feet
pric
e (in
$10
00)
Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?
1
CS229 Lecture notes
Andrew Ng
Supervised learning
Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:
Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540
......
We can plot this data:
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
100
200
300
400
500
600
700
800
900
1000
housing prices
square feet
pric
e (in
$10
00)
Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?
1
Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ R, and f∗ : Rd → R,
yi = f∗(xi) + εi for i = 1, ..., n,
where εi’s are i.i.d. with Eεi = 0, Eε2i = σ2 <∞.
Simple Linear Function: f∗(xi) = x>i θ∗.
Why is it called supervised learning?
Tuo Zhao — Lecture 1: Supervised Learning 2/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Why Supervised?
Tuo Zhao — Lecture 1: Supervised Learning 3/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Play on Words?
Two unknown functions f∗0 , f∗1 : Rd → R?
yi = 1(zi = 1) · f∗1 (xi) + 1(zi = 0) · f∗0 (xi) + εi,
where i = 1, ..., n, and zi’s are i.i.d. with
P(zi = 1) = δ and P(zi = 0) = 1− δ for δ ∈ (0, 1).
zi’: Latent Variables. Supervised? Unsupervised?
Tuo Zhao — Lecture 1: Supervised Learning 4/62
Linear Regression
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Linear Regression
Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ R, and θ∗ : Rd,
yi = x>i θ∗ + εi for i = 1, ..., n,
where εi’s are i.i.d. with Eεi = 0, Eε2i = σ2 <∞.
Ordinary Least Square Regression
θOLS
= arg minθ
1
2n
n∑
i=1
(yi − x>i θ)2.
Least Absolute Deviation Regression:
θLAD
= arg minθ
1
n
n∑
i=1
|yi − x>i θ|.
Tuo Zhao — Lecture 1: Supervised Learning 6/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Robust Regression
Tuo Zhao — Lecture 1: Supervised Learning 7/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Linear Regression — Matrix Notation
X = [x>1 , ...,x>n ]> ∈ Rn×d, y = [y1, ..., yn]> ∈ Rn,
y = Xθ∗ + ε,
where Eε = 0, Eεε> = σ2In.
Ordinary Least Square Regression
θOLS
= arg minθ
1
2n‖y −Xθ‖22 .
Least Absolute Deviation Regression:
θLAD
= arg minθ
1
n‖y −Xθ‖1 .
Tuo Zhao — Lecture 1: Supervised Learning 8/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Least Square Regression — Analytical Solution
Ordinary Least Square Regression
θOLS
= arg minθ
1
2n‖y −Xθ‖22
︸ ︷︷ ︸L(θ)
.
First order optimality condition:
∇L(θ) =1
nX>(Xθ − y) = 0⇒ X>Xθ = X>y.
Analytical solution and unbiasedness:
θ = (X>X)−1X>y
= (X>X)−1X>(Xθ∗ + ε)
= θ∗ + (X>X)−1X>ε⇒ Eεθ = θ∗
Tuo Zhao — Lecture 1: Supervised Learning 9/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Least Square Regression — Convexity
Ordinary Least Square Regression
θOLS
= arg minθ
1
2n‖y −Xθ‖22
︸ ︷︷ ︸L(θ)
.
Second order optimality condition:
∇2L(θ) =1
nX>X � 0.
Convexity Function:
L(θ′) ≥ L(θ) +∇L(θ)>(θ′ − θ).
Tuo Zhao — Lecture 1: Supervised Learning 10/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Convex v.s. Nonconvex Optimization
Stationary Solutions: ∇L(θ) = 0
Global Optimum Global Optimum
Local Optimum
Local Maximum
Saddle Points
Easy but restrictive Difficult but flexible
We may get stuck at a local optimum or saddle point fornonconvex optimization.
Tuo Zhao — Lecture 1: Supervised Learning 11/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Maximum Likelihood Estimation
X = [x>1 , ...,x>n ]> ∈ Rn×d, y = [y1, ..., yn]> ∈ Rn,
y = Xθ∗ + ε,
where ε ∼ N(0, σ2In).
Likelihood Function
L(θ) = (2πσ2)−n2 exp
(− 1
2σ2(y −Xθ)>(y −Xθ)
)
Maximum Log Likelihood Estimation:
θMLE
= arg maxθ
logL(θ)
= arg maxθ
−n2
log(2πσ2)− 1
2σ2‖y −Xθ‖22
Tuo Zhao — Lecture 1: Supervised Learning 12/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Maximum Likelihood Estimation
Maximum Log Likelihood Estimation:
θMLE
= arg maxθ
−n2
log(2πσ2)− 1
2σ2‖y −Xθ‖22
Given σ2 as some unknown constant,
θMLE
= arg maxθ
− 1
2n‖y −Xθ‖22 = arg min
θ
1
2n‖y −Xθ‖22 .
Probabilistic Interpretation:
Simple and illustrative.
Restrictive and potentially misleading.
Remember t-test? What if the model is wrong?
Tuo Zhao — Lecture 1: Supervised Learning 13/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Computational Cost of OLS
The number of basic operations, e.g., addition, subtraction,multiplication, division.
Matrix Multiplication: X>X: O(nd2)
Matrix Inverse: (X>X)−1: O(d3)
Matrix Vector Multiplication: X>y: O(nd)
Matrix Vector Multiplication: [(X>X)−1][X>y]: O(d2)
Overall computational cost: O(nd2), given n� d
Tuo Zhao — Lecture 1: Supervised Learning 14/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Scalability and Efficiency of OLS
Simple Closed-form
Overall computational cost: O(nd2).
Massive data: Both n and d are large.
Not very efficient and scalable.
Better ways to improve the computation?
Tuo Zhao — Lecture 1: Supervised Learning 15/62
Optimization for Linear Regression
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Vanilla Gradient Descent
θ(k+1) = θ(k) − ηk∇L(θ(k)).
ηk > 0 — the step size parameter (fixed or line search)
Stop when the gradient is small:∥∥∥∇L(θ(K))
∥∥∥2≤ δ
✓(k)
f(✓)
✓
rf(b✓) = 0
�rf(✓(k))
Tuo Zhao — Lecture 1: Supervised Learning 17/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Computational Cost of VGD
Gradient: ∇L(θ(k)) = 1nX>(Xθ(k) − y)
Matrix Vector Multiplication: Xθ(k): O(nd)
Vector Subtraction: Xθ(k) − y: O(n)
Matrix Vector Multiplication: X>(Xθ(k) − y): O(nd)
Overall computational cost per iteration: O(nd)
Better than O(nd2), but how many iterations?
Tuo Zhao — Lecture 1: Supervised Learning 18/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Rate of Convergence
What are Good Algorithms?
Asymptotic Convergence: θ(k) → θ as k →∞?
Nonasymptotic Rate of Convergence: The optimization errorafter k iterations
Example: Gap in Objective Value – Sublinear Convergence
f(θ(k))− f(θ) = O(L/k2
)v.s. O (L/k) ,
where L is some constant depending on the problem.
Example: Gap in Parameter – Linear Convergence∥∥∥θ(k) − θ∥∥∥2
2= O
((1− 1/κ)k
)v.s. O
((1− 1/
√κ)k)
where κ is some constant depending on the problem.
Tuo Zhao — Lecture 1: Supervised Learning 19/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Iteration Complexity of Gradient Descent
Iteration Complexity
We need at most
K = O
(κ log
(1
ε
))
iterations such that ∥∥∥θ(K) − θ∥∥∥2
2≤ ε,
where κ is some constant depending on the problem.
What is κ? It is related to smoothness and convexity.
Tuo Zhao — Lecture 1: Supervised Learning 20/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Strong Convexity
There exists a constant µ such that for any θ and θ′, we have
L(θ′) ≥ L(θ) +∇L(θ)>(θ′ − θ) +µ
2
∥∥θ′ − θ∥∥22.
L(✓) + rL(✓)>(✓0 � ✓) +µ
2k✓0 � ✓k2
2
L(✓0)
L(✓) + rL(✓)>(✓0 � ✓)
Tuo Zhao — Lecture 1: Supervised Learning 21/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Strong Smoothness
There exists a constant L such that for any θ and θ′, we have
L(θ′) ≤ L(θ) +∇L(θ)>(θ′ − θ) +L
2
∥∥θ′ − θ∥∥22.
L(✓0)
L(✓) + rL(✓)>(✓0 � ✓)
L(✓) + rL(✓)>(✓0 � ✓) +L
2k✓0 � ✓k2
2
Tuo Zhao — Lecture 1: Supervised Learning 22/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Condition Number κ = L/µ
f(θ) = 0.9× θ21 + 0.1× θ22 f(θ) = 0.5× θ21 + 0.5× θ22
Tuo Zhao — Lecture 1: Supervised Learning 23/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Vector Field Representation
f(θ) = 0.9× θ21 + 0.1× θ22 f(θ) = 0.5× θ21 + 0.5× θ22
Tuo Zhao — Lecture 1: Supervised Learning 24/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Understanding Regularity Conditions
Mean Value Theorem: There exists a constant z ∈ [0, 1] suchthat for any θ and θ′, we have
L(θ′)− L(θ)−∇L(θ)>(θ′ − θ) =1
2(θ′ − θ)>∇2L(θ)(θ′ − θ),
where θ is a convex combination: θ = zθ + (1− z)θ′.Hessian Matrix for OLS:
∇2L(θ) =1
nX>X.
Control the reminder:
Λmin(1
nX>X)
︸ ︷︷ ︸µ
≤ (θ′ − θ)>∇2L(θ)(θ′ − θ)∥∥θ′ − θ∥∥22
≤ Λmax(1
nX>X)
︸ ︷︷ ︸L
.
Tuo Zhao — Lecture 1: Supervised Learning 25/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Understanding Gradient Descent Algorithms
Iteratively Minimize Quadratic Approximation
At the (t+ 1)-th iteration, we consider
Q(θ;θ(k)) = L(θ(k)) +∇>L(θ(k))(θ − θ(k)) +L
2
∥∥∥θ − θ(k)∥∥∥2
2.
We have
Q(θ;θ(k)) ≥ L(θ) and Q(θ(k);θ(k)) = L(θ(k)).
We take
θ(k+1) = arg minθ
Q(θ;θ(k)) = θ(k) − 1
L∇L(θ(k)).
Tuo Zhao — Lecture 1: Supervised Learning 26/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Backtracking Line Search
The worst case – fixed step size: ηk = 1/L.
At the (k + 1)-th iteration, we take ηk = ηk−1, i.e.,
θ(k+1) = θ(k) − ηk∇L(θ(k)) if Q(θ(k+1);θ(k)) ≥ L(θ).
Otherwise, we take
ηk+1 = (1− δ)mηk,where δ > 0 and m is the smallest positive integer such that
Q(θ(k+1);θ(k)) ≥ L(θ).
Tuo Zhao — Lecture 1: Supervised Learning 27/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Backtracking Line Search
✓(k)
⌘k = 0.752 ⇥ ⌘k�1 ⌘k = ⌘k�1
⌘k = 0.75 ⇥ ⌘k�1
Tuo Zhao — Lecture 1: Supervised Learning 28/62
Tradeoff Statistics and Computation
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
High Precision or Low Precision
Can we tolerate a large ε?
From a learning perspective, our interest is θ∗, not θ.
Error decomposition:∥∥∥θ(K) − θ∗∥∥∥2≤∥∥∥θ(K) − θ
∥∥∥2︸ ︷︷ ︸
Opt. Error
+∥∥∥θ − θ∗
∥∥∥2︸ ︷︷ ︸
Stat. Error
.
High precision expects something like∥∥∥θ(K) − θ∥∥∥2≈ 10−10
Does it make any difference?
Tuo Zhao — Lecture 1: Supervised Learning 30/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
High Precision or Low Precision
The statistical error is undefeatable!
HIGH-DIMENSIONAL NOISY LASSO 1649
lower- and upper-RE conditions with high probability. The proof of Theorem 2 it-self is based on an extension of a result due to Agarwal et al. [1] on the convergenceof projected gradient descent and composite gradient descent in high dimensions.Their result, as originally stated, imposed convexity of the loss function, but theproof can be modified so as to apply to the nonconvex loss functions of interesthere. As noted following Theorem 1, all global minimizers of the nonconvex pro-gram (2.4) lie within a small ball. In addition, Theorem 2 guarantees that the localminimizers also lie within a ball of the same magnitude. Note that in order to showthat Theorem 2 can be applied to the specific statistical models of interest in thispaper, a considerable amount of technical analysis remains in order to establishthat its conditions hold with high probability.
In order to understand the significance of the bounds (3.5) and (3.7), note thatthey provide upper bounds for the ℓ2-distance between the iterate β t at time t ,which is easily computed in polynomial-time, and any global optimum !β of theprogram (2.4) or (2.7), which may be difficult to compute. Focusing on bound(3.5), since γ ∈ (0,1), the first term in the bound vanishes as t increases. Theremaining terms involve the statistical errors ∥!β − β∗∥q , for q = 1,2, which arecontrolled in Theorem 1. It can be verified that the two terms involving the statisti-cal error on the right-hand side are bounded as O(k logp
n ), so Theorem 2 guaranteesthat projected gradient descent produce an output that is essentially as good—interms of statistical error—as any global optimum of the program (2.4). Bound(3.7) provides a similar guarantee for composite gradient descent applied to theLagrangian version.
Experimentally, we have found that the predictions of Theorem 2 are borne outin simulations. Figure 2 shows the results of applying the projected gradient de-scent method to solve the optimization problem (2.4) in the case of additive noise
(a) (b)
FIG. 2. Plots of the optimization error log(∥βt − !β∥2) and statistical error log(∥β t −β∗∥2) versusiteration number t , generated by running projected gradient descent on the nonconvex objective.Each plot shows the solution path for the same problem instance, using 10 different starting points.As predicted by Theorem 2, the optimization error decreases geometrically.
Statistical ErrorOptimization Error
Number of Iterations
Erro
r in
Loga
rithm
ic Sc
ale
Tuo Zhao — Lecture 1: Supervised Learning 31/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Tradeoff Statistical and Optimization Errors
The statistical error is undefeatable!
The statistical error of the optimal solution:
E∥∥∥θ − θ∗
∥∥∥2
2= E
∥∥∥(X>X)−1X>ε∥∥∥2
2
= σ2tr[(X>X)−1] = O
(σ2d
n
)
We only need∥∥∥θ(K) − θ
∥∥∥2�∥∥∥θ − θ∗
∥∥∥2.
Given K = O(κ · log
(nσ2d
)), we have
E∥∥∥θ(K) − θ∗
∥∥∥2
2= O
(σ2d
n
).
Tuo Zhao — Lecture 1: Supervised Learning 32/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Agnostic Learning
All models are wrong, but some are useful!
Data generating process: (X, Y ) ∼ D.
The oracle model: foracle(X) = X>θoracle, where
θoracle = arg minθ
ED(Y −X>θ)2.
The estimated model: f(X) = X>θ
θ = arg minθ
1
2n‖y −Xθ‖22 ,
and (x1, y1), ..., (xn, yn) ∼ DAt the K-th iteration: f (K)(X) = X>θ(K)
Tuo Zhao — Lecture 1: Supervised Learning 33/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Agnostic Learning (See more details in CS-7545)
All models are wrong, but some are useful!
All linear models
“True” model
“Oracle” model
“Estimated” model
Tuo Zhao — Lecture 1: Supervised Learning 34/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Agnostic Learning
“Approximation” error: ED(Y − foracle(X))2
“Estimation” error : ED(f(X)− foracle(X))2
Optimization error : ED(f (K)(X)− f(X))2
Decomposition of the statistical error:
ED(Y − f (K)(X))2 ≤ED(Y − foracle(X))2
+ ED(f(X)− foracle(X))2
+ ED(f (K)(X)− f(X))2
How should we choose ε?
Tuo Zhao — Lecture 1: Supervised Learning 35/62
Scalable Computation of Linear Regression
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Stochastic Approximation
What if n is too large?
Empirical Risk Minimization: L(θ) = 1n
∑ni=1 `i(θ).
For Least Square Regression:
`i(θ) =1
2(yi − x>i θ)2 or `i(θ) =
1
2|Mi|∑
j∈Mi
(yj − x>j θ)2.
Randomly sample i from 1, ..., n with equal probability,
Ei∇`i(θ) = ∇L(θ) and E ‖∇`i(θ)−∇L(θ)‖22 ≤M2.
Stochastic Gradient (SG): replace ∇L(θ) with ∇`i(θ)
θ(k+1) = θ(k) − ηk∇`it(θ(k)).
Tuo Zhao — Lecture 1: Supervised Learning 37/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Why Stochastic Gradient?
Perturbed Descent Directions
Tuo Zhao — Lecture 1: Supervised Learning 38/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Convergence of Stochastic Gradient Algorithms
How many iterations do we need?
A sequence of decreasing step size parameters: ηk � 1kµ
Given a pre-specified error ε, we need
K = O
(M2 + L2
µ2ε
)
iterations such that
E∥∥∥θ(K) − θ
∥∥∥2
2≤ ε, where θ
(K)=
1
T
T∑
t=1
θ(k).
When µ2εn� (M2 + L2), i.e., n is super large,
O
(d(M2 + L2)
µ2ε
)v.s. O (κnd) v.s. O(nd2).
Tuo Zhao — Lecture 1: Supervised Learning 39/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Why decreasing step size?
Control Variance + Sufficient Descent ⇒ Convergence
Intuition:
θ(k+1) = θ(k) − η∇L(θ(k))︸ ︷︷ ︸Descent
+ η(∇L(θ(k))−∇`i(θ(k)))︸ ︷︷ ︸Error
Not summable: Sufficient exploration,∞∑
t=1
ηk =∞.
Square summable: Diminishing Variance,∞∑
t=1
η2k <∞.
Tuo Zhao — Lecture 1: Supervised Learning 40/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Minibatch Variance Reduction
Mini-batch SGD: If |Mi| ↑, then M2 ↓
|Mi| ↑ means more computational cost per iteration.
M2 ↓ means fewer iterations.
Tuo Zhao — Lecture 1: Supervised Learning 41/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Variance Reduction by Control Variates
Stochastic Variance Reduced Gradient Algorithm (SVRG):
At the k-th epoch,
θ = θ[k], θ(0) = θ[k].
At the k-th iteration of the k-th epoch,
θ(k+1) = θ(k) − ηk(∇`i(θ(k))−∇`i(θ) +∇L(θ)).
After m iterations of the k-th epoch,
θ[k+1] = θ(m).
Tuo Zhao — Lecture 1: Supervised Learning 42/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Strong Smoothness and Convexity
Regularity Conditions
(Strong Smoothness) There exist constants Li’s such that forany θ and θ′, we have
`i(θ′)− `i(θ)−∇`i(θ)>(θ′ − θ) ≤ Li
2
∥∥θ′ − θ∥∥22.
(Strong Convexity) There exists a constant µ such that forany θ and θ′, we have
L(θ′)− L(θ)−∇L(θ)>(θ′ − θ) ≥ µ
2
∥∥θ′ − θ∥∥22.
Condition Number
κmax =maxi Li
µ≥ κ =
L
µ.
Tuo Zhao — Lecture 1: Supervised Learning 43/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Why does SVRG work?
The strong smoothness implies:∥∥∥∇`i(θ(k))−∇`i(θ)
∥∥∥2≤ Lmax
∥∥∥θ(k) − θ∥∥∥2
Bias correction:
E[∇`i(θ(k))−∇`i(θ)] = ∇L(θ(k))−∇L(θ).
Variance Reduction: As θ(k) → θ and θ → θ
E∥∥∥∇`i(θ(k))−∇`i(θ) +∇L(θ)−∇L(θ(k))
∥∥∥2
2
≤ E∥∥∥∇`i(θ(k))−∇`i(θ)
∥∥∥2
2→ 0.
Tuo Zhao — Lecture 1: Supervised Learning 44/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Convergence of SVRG
How many iterations do we need?
Fixed step size parameter: ηk � 1Lmax
Given a pre-specified error ε and m � κmax, we need
K = O
(log
(1
ε
))
epochs such that
E∥∥∥θ[K] − θ
∥∥∥2
2≤ ε.
Total number of operations:
O (nd+ dκmax) v.s. O
(dM2
µ2ε+dκ2
ε
)v.s. O (ndκ) .
Tuo Zhao — Lecture 1: Supervised Learning 45/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Comparison of GD, SGD and SVRG
Tuo Zhao — Lecture 1: Supervised Learning 46/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Summary
The empirical performance highly depends on theimplementation.
Cyclic or shuffled order (Not stochastic) in practice.
“Too many tuning parameters” means “my algorithm mightonly work in theory”.
Theoretical bounds can be very loose. The constant maymatter a lot in practice.
Good software engineers with B.S./M.S. degrees can earnmuch more than Ph.D.’s, if they know how to code efficientalgorithms.
Tuo Zhao — Lecture 1: Supervised Learning 47/62
Classification Analysis
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Classification v.s. Regression
Tuo Zhao — Lecture 1: Supervised Learning 49/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Logistic Regression
Given x1, ...,xn ∈ Rd and θ∗ ∈ Rd,
yi ∼ Bernoulli(h(x>i θ∗)) for i = 1, ..., n,
h: (−∞,∞)→ [0, 1]
Logistic/Sigmoid function
h(z) =1
1 + exp(−z) .
Remark: h(0) = 0.5, h(−∞) = 0 and h(∞) = 1.
Tuo Zhao — Lecture 1: Supervised Learning 50/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Logistic/Sigmoid Function
Tuo Zhao — Lecture 1: Supervised Learning 51/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Logistic Regression
Maximum Likelihood Estimation
θ = arg maxθ
L(θ)
= arg maxθ
log
n∏
i=1
(h(x>i θ)
)yi (1− h(x>i θ)
)1−yi
= arg maxθ
n∑
i=1
yi log h(x>i θ) + (1− yi) log(1− h(x>i θ))
= arg maxθ
n∑
i=1
[yi · x>i θ − log(1 + exp(x>i θ))
]
= arg minθ
1
n
n∑
i=1
[log(1 + exp(x>i θ))− yi · x>i θ
].
Tuo Zhao — Lecture 1: Supervised Learning 52/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Optimization for Logistic Regression
Convex Problem? Let F(θ) = −L(θ).
∇2F(θ) =1
n
n∑
i=1
h(x>i θ)(1− h(x>i θ))xix>i � 0.
No closed form solution.
Gradient descent and stochastic gradient algorithms areapplicable.
Tuo Zhao — Lecture 1: Supervised Learning 53/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Prediction for Logistic Regression
Prediction: Given x∗, we have
P(y∗ = 1) =1
1 + exp(−θ>x∗)≥ 0.5.
Why linear classification?
P(y∗ = 1) ≥ 0.5⇔ θ>x∗ ≥ 0⇔ y∗ = sign(θ
>x∗).
Tuo Zhao — Lecture 1: Supervised Learning 54/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Logistic Loss
Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ {−1, 1}, and θ∗ : Rd,
P(yi = 1) =1
1 + exp(−x>i θ∗)for i = 1, ..., n,
An alternative formulation:
θ = arg minθ
1
n
n∑
i=1
log(1 + exp(−yix>i θ)).
We can also use 0-1 loss:
θ = arg minθ
1
n
n∑
i=1
1(sign(x>i θ) 6= yi).
Tuo Zhao — Lecture 1: Supervised Learning 55/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Loss Functions for Classification
Tuo Zhao — Lecture 1: Supervised Learning 56/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Newton’s Method
At the k-th iteration, we take
θ(k+1) = θ(k) − ηk[∇2F(θ)
]−1∇F(θ),
where ηk > 0 is a step size parameter.
The Second Order Taylor’s Approximation:
θ(k+0.5) = arg minθ
F(θ(k)) +∇F(θ(k))(θ − θ(k))+
+1
2(θ − θ(k))>∇2F(θ(k))(θ − θ(k)).
Backtracking Line Search:
θ(k+1) = θ(k) + ηk(θ(k+0.5) − θ(k)).
Tuo Zhao — Lecture 1: Supervised Learning 57/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Newton’s Method
Tuo Zhao — Lecture 1: Supervised Learning 58/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Newton’s Method
Sublinear + Quadratic Convergence:
Given∥∥∥θ(k) − θ
∥∥∥2
2≤ R� 1, we have
∥∥∥θ(k+1) − θ∥∥∥2
2≤ (1− δ)
∥∥∥θ(k) − θ∥∥∥4
2.
Given∥∥∥θ(k) − θ
∥∥∥2
2≥ R, we have
∥∥∥θ(k+1) − θ∥∥∥2
2= O(1/k).
Iteration complexity: (Some parameters hidden for simplicity)
O
(log
(1
R
)+ log log
(1
ε
)).
Tuo Zhao — Lecture 1: Supervised Learning 59/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Newton’s Method
Advantage:
More efficient for highly accurate solutions.
Avoid extensively calculating log or exp functions.
— Taylor expansions combined with a table.
Fewer line search steps.
— Due to quadratic convergence.
Often more efficient than gradient descent.
Tuo Zhao — Lecture 1: Supervised Learning 60/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Newton’s Method
Tuo Zhao — Lecture 1: Supervised Learning 61/62
ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning
Newton’s Method
Disadvantage:
Computing inverse Hessian matrices is expensive!
Storing inverse Hessian matrices is expensive!
Subsampled Newton:
θ(k+1) = θ(k) − ηk[H(θ(k))
]−1∇F(θ(k)).
Quasi-Newton Method: DFP, BFGS, Broyden, SR1
— Use differences of gradient vectors to approximate Hessianmatrices.
Tuo Zhao — Lecture 1: Supervised Learning 62/62