Recitation4 for BigData
Jay GuFeb 7 2013
LASSO and Coordinate Descent
A numerical example
N = 50P = 200# Non zero coefficients = 5
X ~ normal (0, I)beta_1, beta_2, beta_3 ~ normal (1, 2)sigma ~ normal(0, 0.1*I)
Y = Xbeta + sigma
Split training vs testing: 80/20
Generate some synthetic data:
Practicalities• Standardize your data:• Center X, Y
remove the intercept• Scale X to have unit norm at each column
fair regularization for all covariates
• Warm start. Run Lambdas from large to small,
• Starting from the largest lambda to be max(X’y)• Guarantees to have zero support size.
Algorithm
Ridge Regression: Closed form solution.
LASSO: Iterative algorithms:Subgradient DescentGeneralized Gradient Methods (ISTA)Accelerated Generalized Gradient Methods
(FSTA)Coordinate Descent
SubdifferentialsCoordinate Descent
• Slides from Ryan Tibshirani
http://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdfhttp://www.cs.cmu.edu/~ggordon/10725-F12/slides/25-coord-desc.pdf
Coordintate Descent: always find global optimum?
• Convex and differentiable? Yes
• Convex and non-differentiable? No
• Convex but separable non-differentiable parts?
• Yes. Proof:
CD for Linear Regression
Rate of Convergence?
• Assuming gradient is Lipchitz continuous.
• Subgradient Descent: 1/sqrt(k)• Gradient Descent: 1/k• Optimal rate for first order methods: 1/(k^2)• Coordinate Descent:– Only know for some special cases
Summary: Coordinate Descent
• Good for large P• No tuning parameter• In practice converge much faster than the
optimal first order methods
• Only applies to certain cases• Unknown convergence rate for general
function classes