+ All Categories
Home > Documents > Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf ·...

Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf ·...

Date post: 25-Jun-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
47
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28
Transcript
Page 1: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Sparsity Models

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Sparsity Models 1 / 28

Page 2: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Topics

Standard sparse regression modelalgorithms: convex relaxation and greedy algorithmsparse recovery analysis: high level view

Some extensions (complex regularization)structured sparsitygraphical modelmatrix regularization

T. Zhang (Rutgers) Sparsity Models 2 / 28

Page 3: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Modern Sparsity Analysis: Motivation

Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality

Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications

Challenges:formulation, focusing on efficient computationmathematical analysis

T. Zhang (Rutgers) Sparsity Models 3 / 28

Page 4: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Modern Sparsity Analysis: Motivation

Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality

Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications

Challenges:formulation, focusing on efficient computationmathematical analysis

T. Zhang (Rutgers) Sparsity Models 3 / 28

Page 5: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Modern Sparsity Analysis: Motivation

Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality

Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications

Challenges:formulation, focusing on efficient computationmathematical analysis

T. Zhang (Rutgers) Sparsity Models 3 / 28

Page 6: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Standard Sparse Regression

Model: Y = X β̄ + ε

Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2

High dimensional setting: n� pSparsity: β̄ has few nonzero components

supp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n

T. Zhang (Rutgers) Sparsity Models 4 / 28

Page 7: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Standard Sparse Regression

Model: Y = X β̄ + ε

Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2

High dimensional setting: n� p

Sparsity: β̄ has few nonzero componentssupp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n

T. Zhang (Rutgers) Sparsity Models 4 / 28

Page 8: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Standard Sparse Regression

Model: Y = X β̄ + ε

Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2

High dimensional setting: n� pSparsity: β̄ has few nonzero components

supp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n

T. Zhang (Rutgers) Sparsity Models 4 / 28

Page 9: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Algorithms for Standard Sparsity

L0 regularization: natural method (computationally inefficient)

β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k

L1 regularization (Lasso): convex relaxation (computationally efficient)

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]

Theoretical questions:how well can we estimate parameter β̄ (recovery performance)

T. Zhang (Rutgers) Sparsity Models 5 / 28

Page 10: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Algorithms for Standard Sparsity

L0 regularization: natural method (computationally inefficient)

β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k

L1 regularization (Lasso): convex relaxation (computationally efficient)

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]

Theoretical questions:how well can we estimate parameter β̄ (recovery performance)

T. Zhang (Rutgers) Sparsity Models 5 / 28

Page 11: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Algorithms for Standard Sparsity

L0 regularization: natural method (computationally inefficient)

β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k

L1 regularization (Lasso): convex relaxation (computationally efficient)

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]

Theoretical questions:how well can we estimate parameter β̄ (recovery performance)

T. Zhang (Rutgers) Sparsity Models 5 / 28

Page 12: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Greedy Algorithms for standard sparse regularization

Reformulation: find variable set F ⊂ {1, . . . ,p} to minimize

minβ‖Xβ − Y‖22 supp(β) ⊂ F s.t. |F | ≤ k

Forward Greedy Algorithm (OMP): select variables one by one

Initialize variable set F k = ∅ at k = 0Iterate k = 1, . . . ,p

find best variable j to add to F k−1 (maximum reduction of squared error)F k = F k−1 ∪ {j}

terminate with some criterion;output β̂ using regression with selected variables F k

Theoretical question: recovery performance?

T. Zhang (Rutgers) Sparsity Models 6 / 28

Page 13: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Greedy Algorithms for standard sparse regularization

Reformulation: find variable set F ⊂ {1, . . . ,p} to minimize

minβ‖Xβ − Y‖22 supp(β) ⊂ F s.t. |F | ≤ k

Forward Greedy Algorithm (OMP): select variables one by one

Initialize variable set F k = ∅ at k = 0Iterate k = 1, . . . ,p

find best variable j to add to F k−1 (maximum reduction of squared error)F k = F k−1 ∪ {j}

terminate with some criterion;output β̂ using regression with selected variables F k

Theoretical question: recovery performance?

T. Zhang (Rutgers) Sparsity Models 6 / 28

Page 14: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Conditions and Results

Type of results (sparse recovery):Variable selection (can we find nonzero variables): can we recover thetrue support F̄?

supp(β̂) ≈ F̄?

Parameter estimation (how well we can estimate β̄): can we recoverthe parameters?

‖β̂ − β̄‖22 ≤?

Are efficient algorithms (such as L1 or OMP) good enough?

Yes but require conditions:Irrepresentable: for support recoveryRIP – Restricted Isometry Property: for parameter recovery

T. Zhang (Rutgers) Sparsity Models 7 / 28

Page 15: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Conditions and Results

Type of results (sparse recovery):Variable selection (can we find nonzero variables): can we recover thetrue support F̄?

supp(β̂) ≈ F̄?

Parameter estimation (how well we can estimate β̄): can we recoverthe parameters?

‖β̂ − β̄‖22 ≤?

Are efficient algorithms (such as L1 or OMP) good enough?Yes but require conditions:

Irrepresentable: for support recoveryRIP – Restricted Isometry Property: for parameter recovery

T. Zhang (Rutgers) Sparsity Models 7 / 28

Page 16: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

KKT Condition for Lasso Solution

Lasso solution:

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]KKT condition: at β̂ = β̂L1:

Exists a sub-gradient being zero:for all j = 1, . . . ,p (Xj is the j-th column of X ):

2X>j (X β̂ − y) + λ∇|β̂j | = 0.

Subgradient of L1 norm: ∇|u| = sign(u) =

1 u > 0−1 u < 0∈ [−1,1] u = 0.

If we can find a β̂ that satisfies KKT condition, then it is Lasso solution.A slightly stronger condition implies uniqueness.

T. Zhang (Rutgers) Sparsity Models 8 / 28

Page 17: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Feature Selection Consistency of Lasso

Idea: construct a solution and check KKT condition.

Define β̂ such that β̂F̄ satisfies:

2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,

and set β̂F̄ c = 0.

Condition A:X>F̄ XF̄ is full rank

sign(β̂F̄ ) = sign(β̂F̄ )

2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .

Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)

T. Zhang (Rutgers) Sparsity Models 9 / 28

Page 18: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Feature Selection Consistency of Lasso

Idea: construct a solution and check KKT condition.Define β̂ such that β̂F̄ satisfies:

2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,

and set β̂F̄ c = 0.

Condition A:X>F̄ XF̄ is full rank

sign(β̂F̄ ) = sign(β̂F̄ )

2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .

Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)

T. Zhang (Rutgers) Sparsity Models 9 / 28

Page 19: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Feature Selection Consistency of Lasso

Idea: construct a solution and check KKT condition.Define β̂ such that β̂F̄ satisfies:

2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,

and set β̂F̄ c = 0.

Condition A:X>F̄ XF̄ is full rank

sign(β̂F̄ ) = sign(β̂F̄ )

2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .

Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)

T. Zhang (Rutgers) Sparsity Models 9 / 28

Page 20: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Irrepresentable Condition

The condition

µ = supj 6∈F̄|X>j XF̄ (X>F̄ XF̄ )−1sign(β̄)F̄ | < 1,

is called irrepresentable condition.It implies condition A when y = X β̄ and λ is sufficiently small.

Under irrepresentable condition, ifnoise is sufficiently smallminj∈F̄ |β̄j | is larger than noise level

then there exists appropriate λ such that condition A holds. ThusLasso solution is unique and feature selection consistent.

Condition similar to irrepresentable condition can be derived for OMP.

T. Zhang (Rutgers) Sparsity Models 10 / 28

Page 21: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Irrepresentable Condition

The condition

µ = supj 6∈F̄|X>j XF̄ (X>F̄ XF̄ )−1sign(β̄)F̄ | < 1,

is called irrepresentable condition.It implies condition A when y = X β̄ and λ is sufficiently small.

Under irrepresentable condition, ifnoise is sufficiently smallminj∈F̄ |β̄j | is larger than noise level

then there exists appropriate λ such that condition A holds. ThusLasso solution is unique and feature selection consistent.

Condition similar to irrepresentable condition can be derived for OMP.

T. Zhang (Rutgers) Sparsity Models 10 / 28

Page 22: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

RIP Conditions

Feature selection consistency implies good parameter estimation.However irrepresentable condition is too strong.

RIP (restricted isometry property): weaker condition which can beused to obtain parameter estimation result.Definition of RIP: for some c > 1, the following condition holds

ρ+(ck̄)/ρ−(ck̄) <∞

ρ+(s) = sup{β>X>Xββ>β

: ‖β‖0 ≤ s}

ρ−(s) = inf{β>X>Xββ>β

: ‖β‖0 ≤ s}

with k̄ = |F̄ | = ‖β̄‖0.

T. Zhang (Rutgers) Sparsity Models 11 / 28

Page 23: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

RIP Conditions

Feature selection consistency implies good parameter estimation.However irrepresentable condition is too strong.RIP (restricted isometry property): weaker condition which can beused to obtain parameter estimation result.Definition of RIP: for some c > 1, the following condition holds

ρ+(ck̄)/ρ−(ck̄) <∞

ρ+(s) = sup{β>X>Xββ>β

: ‖β‖0 ≤ s}

ρ−(s) = inf{β>X>Xββ>β

: ‖β‖0 ≤ s}

with k̄ = |F̄ | = ‖β̄‖0.

T. Zhang (Rutgers) Sparsity Models 11 / 28

Page 24: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Results under Restricted Isometry Property

Parameter estimation under RIP:

‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),

where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible

Feature selection under RIP:neither procedure achieves feature selection consistency.

Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently

T. Zhang (Rutgers) Sparsity Models 12 / 28

Page 25: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Results under Restricted Isometry Property

Parameter estimation under RIP:

‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),

where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible

Feature selection under RIP:neither procedure achieves feature selection consistency.

Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently

T. Zhang (Rutgers) Sparsity Models 12 / 28

Page 26: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Results under Restricted Isometry Property

Parameter estimation under RIP:

‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),

where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible

Feature selection under RIP:neither procedure achieves feature selection consistency.

Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently

T. Zhang (Rutgers) Sparsity Models 12 / 28

Page 27: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Complex Regularization: structured sparsity

Wavelet domain: sparsity pattern not random (structured)

Image domain Wavelet domain

can we take advantage of structure?

T. Zhang (Rutgers) Sparsity Models 13 / 28

Page 28: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Complex Regularization: structured sparsity

Wavelet domain: sparsity pattern not random (structured)

Image domain Wavelet domain

can we take advantage of structure?

T. Zhang (Rutgers) Sparsity Models 13 / 28

Page 29: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Structured Sparsity Characterization

Observation:sparsity pattern is the set of nonzero coefficientsnot all sparse patterns are equally likely

Our proposal: information theoretical characterization of “structure”:

a sparsity pattern F is associated with cost c(F )c(F ) is negative log-likelihood of F (or its multiple).

Optimization problem:

minβ‖Xβ − Y‖22 subject to ‖β‖0 + c(supp(β)) ≤ s.

c(supp(β)): cost for selecting support supp(β)‖β‖0: cost for estimation after feature selection

T. Zhang (Rutgers) Sparsity Models 14 / 28

Page 30: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Example: Group Structure

Variables are divided into pre-defined groups G1, . . . ,Gp/mm variables per groupExample (m = 4)

G1 G2 · · · G4 · · · Gp/m

nodes: variablesgray nodes: selected variables (groups 1,2,4)

Assumption:coefficients are not completely randomcoefficients in each group are simultaneously (or nearlysimultaneously) zeros or nonzeros

How to take advantage of group structure?

T. Zhang (Rutgers) Sparsity Models 15 / 28

Page 31: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Example: Group Structure

Variables are divided into pre-defined groups G1, . . . ,Gp/mm variables per group

Assumption:coefficients in each group are simultaneously zeros or nonzeros

Group sparsity pattern cost: ‖β‖0 + m−1‖β‖0 ln p.Standard sparsity pattern cost (for Lasso): ‖β‖0 ln pTheoretical question:

can we take advantage of group sparsity structure to improve Lasso?

T. Zhang (Rutgers) Sparsity Models 16 / 28

Page 32: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Convex Relaxation for group sparsity

L1 − L2 convex relaxation (group Lasso)

β̂ = arg minβ‖Xβ − Y‖22 + λ

∑j

‖βGj‖2.

This is supposed to take advantage of group sparsity structurewithin group: uses L2 regularization (doesn’t encourage sparsity)across group: uses L1 regularization (encourage sparsity)

Question: what is the benefit of group Lasso formulation?

T. Zhang (Rutgers) Sparsity Models 17 / 28

Page 33: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Convex Relaxation for group sparsity

L1 − L2 convex relaxation (group Lasso)

β̂ = arg minβ‖Xβ − Y‖22 + λ

∑j

‖βGj‖2.

This is supposed to take advantage of group sparsity structurewithin group: uses L2 regularization (doesn’t encourage sparsity)across group: uses L1 regularization (encourage sparsity)

Question: what is the benefit of group Lasso formulation?

T. Zhang (Rutgers) Sparsity Models 17 / 28

Page 34: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Recovery Analysis for Lasso and Group Lasso

Simple sparsity: s = ‖β̄‖0 variables out of p variablesinformation theoretical complexity (log of mumbler of choices): O(s ln p)statistical recovery performance:

‖β̂ − β̄‖22 = O(σ2‖β̄‖0 ln p/n)

Group sparsity: g groups out of p/m groups (ideally g = ‖β̄‖0/m)information theoretical complexity (log of number of choices):O(g ln(p/m))

Statistical recovery performance for group Lasso: if supp(β̄) iscovered in g groups, under group RIP (weaker than RIP)

‖β̂ − β̄‖22 = O

σ2

n( g ln(p/m)︸ ︷︷ ︸group selection

+ mg︸︷︷︸estimation after group selection

)

T. Zhang (Rutgers) Sparsity Models 18 / 28

Page 35: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Recovery Analysis for Lasso and Group Lasso

Simple sparsity: s = ‖β̄‖0 variables out of p variablesinformation theoretical complexity (log of mumbler of choices): O(s ln p)statistical recovery performance:

‖β̂ − β̄‖22 = O(σ2‖β̄‖0 ln p/n)

Group sparsity: g groups out of p/m groups (ideally g = ‖β̄‖0/m)information theoretical complexity (log of number of choices):O(g ln(p/m))

Statistical recovery performance for group Lasso: if supp(β̄) iscovered in g groups, under group RIP (weaker than RIP)

‖β̂ − β̄‖22 = O

σ2

n( g ln(p/m)︸ ︷︷ ︸group selection

+ mg︸︷︷︸estimation after group selection

)

T. Zhang (Rutgers) Sparsity Models 18 / 28

Page 36: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Group sparsity: correct group structure

50 100 150 200 250 300 350 400 450 500−2

0

2(a) Original

50 100 150 200 250 300 350 400 450 500−2

0

2(b) Lasso

50 100 150 200 250 300 350 400 450 500−2

0

2(b) Group Lasso

T. Zhang (Rutgers) Sparsity Models 19 / 28

Page 37: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Group sparsity: incorrect group structure

0 100 200 300 400 500−2

0

2(a) Original

0 100 200 300 400 500−2

0

2(b) Lasso

0 100 200 300 400 500−2

0

2(c) Group Lasso

T. Zhang (Rutgers) Sparsity Models 20 / 28

Page 38: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Matrix Formulation: Graphical Model Example

Learning gene interaction network structure

T. Zhang (Rutgers) Sparsity Models 21 / 28

Page 39: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Formulation: Gaussian Graphical Model

Multi-dimensional Gaussian vectors: X1, . . .Xn ∼ N(µ,Σ).Precision matrix Θ = Σ−1

Non-zeros of precision matrix gives graphical model structure:

P(Xi) ∝ |Θ|exp[−1

2(Xi − µ)T Θ(Xi − µ)

].

where | · | is determinant.Estimation: L1 regularized maximum likelihood estimator

Θ̂ = arg minΘ

[− ln |Θ|+ tr(Σ̂Θ) + λ‖Θ‖1

],

‖ · ‖1: element L1 regularization to encourage sparsityΣ̂: empirical covariance matrix.

Analysis exists (feature selection and parameter estimation):techniques similar to L1 analysis but not satisfactory

T. Zhang (Rutgers) Sparsity Models 22 / 28

Page 40: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Matrix Completion

user

movieM1 M2 M3 M4 M5 M6 M7 M8

U1 1 ? 0 4 2 1 ? ?U2 2 4 ? ? 4 ? 1 2U3 2 3 2 ? 4 ? ? 2U4 ? 1 0 3 2 ? 1 1U5 1 ? ? 1 ? ? 3 2

m × n matrix: m users and n movies with incomplete ratingscan we fill-in the missing values?

require assumptions:intuition: U2 and U3 has similar ratings on observed ratings — assumethey have similar preferencelow-rank (rank-r ) structure: user i to ui ∈ Rr and movie j to vj ∈ Rr ,with rating ≈ uT

i vj . Let X the true rating matrix

X ≈ UV T (U : m × r V : n × r)

T. Zhang (Rutgers) Sparsity Models 23 / 28

Page 41: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Matrix Completion

user

movieM1 M2 M3 M4 M5 M6 M7 M8

U1 1 ? 0 4 2 1 ? ?U2 2 4 ? ? 4 ? 1 2U3 2 3 2 ? 4 ? ? 2U4 ? 1 0 3 2 ? 1 1U5 1 ? ? 1 ? ? 3 2

m × n matrix: m users and n movies with incomplete ratingscan we fill-in the missing values?require assumptions:

intuition: U2 and U3 has similar ratings on observed ratings — assumethey have similar preferencelow-rank (rank-r ) structure: user i to ui ∈ Rr and movie j to vj ∈ Rr ,with rating ≈ uT

i vj . Let X the true rating matrix

X ≈ UV T (U : m × r V : n × r)

T. Zhang (Rutgers) Sparsity Models 23 / 28

Page 42: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Formulation

Let S = {observed (i , j) entries}Let yij be the observed values for (i , j) ∈ SLet X be the true underlying rating matrix

We want to find X to fit observed yij , assuming X is low-rank:

minX∈Rm×n

∑(i,j)∈S

(Xij − yij)2 + λ · rank(X )

.rank(X ): nonconvex function of Xconvex relaxation: trace-norm ‖X‖∗, defined as the sum of singularvalues of X .

The convex reformulation is

minX∈Rm×n

∑(i,j)∈S

(Xij − yij)2 + λ · ‖X‖∗

.Solution of trace norm regularization is low-rank.

T. Zhang (Rutgers) Sparsity Models 24 / 28

Page 43: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Sparsity versus Low-rank

A vector β ∈ Rp: p parametersreduce dimension — sparsity ‖β‖0 is smallconstraint ‖β‖0 ≤ s is nonconvexconvex relaxation: convex hull of unit 1-sparse vectors, which gives L1regularization ‖β‖1 ≤ 1vector solution with L1 regularization is sparse

A matrix X ∈ Rm×n: m × n parametersreduce dimension — lowrank X =

∑rj=1 ujvT

j where uj ∈ Rm andvj ∈ Rn are vectorsnumber of parameters — no more than rm + rn.rank constraint is nonconvexconvex relaxation: convex hull of unit rank-one matrices, which givestrace-norm regularization ‖X‖∗ ≤ 1matrix solution with trace-norm regularization is low-rank

T. Zhang (Rutgers) Sparsity Models 25 / 28

Page 44: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Matrix Regularization Example: mixed sparsity and low rank

Y (observed) = XL(low-rank) + XS(sparse)

[X̂S, X̂L] = arg min

12µ‖(XS + XL)− Y‖22 + λ‖XS‖1 + ‖XL‖∗︸ ︷︷ ︸

trace norm

.trace norm: sum of singular values of a matrix – encourage low-rank matrix

T. Zhang (Rutgers) Sparsity Models 26 / 28

Page 45: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Theoretical Analysis

Want to know under what conditions, we can recovery XS and XLMatrix is m × n.XS: sparse (spike like outliers)

no more than n0 outliers per rowno more than m0 outliers per column

XL: rank is rincoherence: XL is “flat” – no component is large

Question: how many outliers per row n0 and per column m0 areallowed to recover XS and XL?Partial answer (not completely satisfactory):

Sparsity pattern supp(XS) is random: exact recovery under the followingconditions

m0 = O(m) and n0 = O(n).Sparsity pattern supp(XS) does not have to be random:

m0 ≤ c(m/r) and n0 ≤ c(n/r) for some constant c (r is the rank of XL)

T. Zhang (Rutgers) Sparsity Models 27 / 28

Page 46: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Theoretical Analysis

Want to know under what conditions, we can recovery XS and XLMatrix is m × n.XS: sparse (spike like outliers)

no more than n0 outliers per rowno more than m0 outliers per column

XL: rank is rincoherence: XL is “flat” – no component is large

Question: how many outliers per row n0 and per column m0 areallowed to recover XS and XL?Partial answer (not completely satisfactory):

Sparsity pattern supp(XS) is random: exact recovery under the followingconditions

m0 = O(m) and n0 = O(n).Sparsity pattern supp(XS) does not have to be random:

m0 ≤ c(m/r) and n0 ≤ c(n/r) for some constant c (r is the rank of XL)

T. Zhang (Rutgers) Sparsity Models 27 / 28

Page 47: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

References

Statistical Science Special Issue on Sparsity and Regularization, 2012http://www.imstat.org/sts/future_papers.html

Structured Sparsity: F. Bach, et alGeneral Theoretical Analysis: S. Negahban et al.Graphical Models: J. Lafferty et al.Nonconvex Methods: CH Zhang and T Zhang...

T. Zhang (Rutgers) Sparsity Models 28 / 28


Recommended