Recitation4 for BigData

Recitation4 for BigData

Jay GuFeb 7 2013

LASSO and Coordinate Descent

A numerical example

N = 50P = 200# Non zero coefficients = 5

X ~ normal (0, I)beta_1, beta_2, beta_3 ~ normal (1, 2)sigma ~ normal(0, 0.1*I)

Y = Xbeta + sigma

Split training vs testing: 80/20

Generate some synthetic data:

Practicalities• Standardize your data:• Center X, Y

remove the intercept• Scale X to have unit norm at each column

fair regularization for all covariates

• Warm start. Run Lambdas from large to small,

• Starting from the largest lambda to be max(X’y)• Guarantees to have zero support size.

Algorithm

Ridge Regression: Closed form solution.

LASSO: Iterative algorithms:Subgradient DescentGeneralized Gradient Methods (ISTA)Accelerated Generalized Gradient Methods

(FSTA)Coordinate Descent

SubdifferentialsCoordinate Descent

• Slides from Ryan Tibshirani

http://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdfhttp://www.cs.cmu.edu/~ggordon/10725-F12/slides/25-coord-desc.pdf

Coordintate Descent: always find global optimum?

• Convex and differentiable? Yes

• Convex and non-differentiable? No

• Convex but separable non-differentiable parts?

• Yes. Proof:

CD for Linear Regression

Rate of Convergence?

• Assuming gradient is Lipchitz continuous.

• Subgradient Descent: 1/sqrt(k)• Gradient Descent: 1/k• Optimal rate for first order methods: 1/(k^2)• Coordinate Descent:– Only know for some special cases

Summary: Coordinate Descent

• Good for large P• No tuning parameter• In practice converge much faster than the

optimal first order methods

• Only applies to certain cases• Unknown convergence rate for general

function classes

Date post:	22-Feb-2016
Category:	Documents
Upload:	dionne
View:	27 times
Download:	0 times

Recitation4 for BigData

Documents