Low-rank matrix estimation A bootstrap framework Results Other noise
A bootstrap framework for low-rankmatrix estimation
Julie Josse – Stefan Wager
Laboratoire de statistique, Agrocampus Ouest, Rennes, FranceStatistic department, Stanford University
Malgorzata Bogdan group meeting, Wroclaw University ofTechnology, 31 october 2014
1 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Low-rank matrix estimation
A bootstrap framework
Results
Other noise
2 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Low-rank matrix estimation
⇒ Model: X ∈ Rn×p ∼ L(µ) with E [X ] = µ of low-rank k
⇒ Gaussian noise model:
X = µ+ ε, with εijiid∼N
(0, σ2)
Image, collaborative filtering...
⇒ Classical solution: truncated SVD
µ̂k =k∑
l=1
ul dl v>l
3 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Shrinking - thresholding SVD
⇒ Hard thresholding: dl · 1(l ≤ k) = dl · 1(dl ≥ τ)
argminµ{‖X − µ‖22 : rank (µ) ≤ k
}Selecting k? Cross-validation (Josse & Husson, 2012)
Chatterjee, (2013), Donoho & Gavish, (2014): an optimal hardthreshold with better asymptotic MSE than k
⇒ Soft threshoding: dl max(1− τ
dl, 0)
argminµ{‖X − µ‖22 + λ‖µ‖∗
}Selecting λ? Stein’s Unbiased Risk Estimate (Candès et al., 2012)
4 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Shrinking - thresholding SVD
⇒ Hard thresholding: dl · 1(l ≤ k) = dl · 1(dl ≥ τ)
argminµ{‖X − µ‖22 : rank (µ) ≤ k
}Selecting k? Cross-validation (Josse & Husson, 2012)
Chatterjee, (2013), Donoho & Gavish, (2014): an optimal hardthreshold with better asymptotic MSE than k
⇒ Soft threshoding: dl max(1− τ
dl, 0)
argminµ{‖X − µ‖22 + λ‖µ‖∗
}Selecting λ? Stein’s Unbiased Risk Estimate (Candès et al., 2012)
4 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Shrinking - thresholding SVD
⇒ Non linear shrinkage: µ̂shrink =∑min{n, p}
l=1 ul ψ (dl ) v>l
• Shabalin & Nobel (2013); Gavish & Donoho (2014). Asymptoticn = np and p → ∞, np/p → β, 0 < β ≤ 1
ψ(dl ) =1dl
√(d2l − β − 1
)2 − 4β · 1(l ≥ (1+
√β))
• Verbank, Josse & Husson (2013). Asymptotic n, p fixed, σ → 0
ψ(dl ) = dl
(d2l − σ2
d2l
)· 1(l ≤ k)
signal variancetotal variance
5 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Shrinking - thresholding SVD
⇒ Non linear shrinkage: µ̂shrink =∑min{n, p}
l=1 ul ψ (dl ) v>l
• Shabalin & Nobel (2013); Gavish & Donoho (2014). Asymptoticn = np and p → ∞, np/p → β, 0 < β ≤ 1
ψ(dl ) =1dl
√(d2l − β − 1
)2 − 4β · 1(l ≥ (1+
√β))
• Verbank, Josse & Husson (2013). Asymptotic n, p fixed, σ → 0
ψ(dl ) = dl
(d2l − σ2
d2l
)· 1(l ≤ k)
signal variancetotal variance
5 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Shrinking - thresholding SVD
⇒ Non linear shrinkage: µ̂shrink =∑min{n, p}
l=1 ul ψ (dl ) v>l
• Shabalin & Nobel (2013); Gavish & Donoho (2014). Asymptoticn = np and p → ∞, np/p → β, 0 < β ≤ 1
ψ(dl ) =1dl
√(d2l − β − 1
)2 − 4β · 1(l ≥ (1+
√β))
• Verbank, Josse & Husson (2013). Asymptotic n, p fixed, σ → 0
ψ(dl ) = dl
(d2l − σ2
d2l
)· 1(l ≤ k)
signal variancetotal variance
ψ(dl ) =
(dl −
σ2
dl
)· 1(l ≤ k)
5 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Shrinking - thresholding SVD
• Josse & Sardy (2014). Adaptive estimator. Finite sample.
ψ(dl ) = dl max(1− τγ
dγl, 0)
argminµ{‖X − µ‖22 + λ‖µ‖∗,w
}Selecting (τ, γ)? Data dependent: SURE - GSURE (σ unknown)
• Bayesian SVD models:
• Hoff (2009): uniform priors on U and V , dl ∼ N(0, s2
γ
)• Todeschini, et al. (2013): each dl has its s2
γl, hierarchical
6 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Low-rank matrix estimation
A bootstrap framework
Results
Other noise
7 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
SVD via the autoencoder
Model: X = µ+ ε, with εijiid∼N
(0, σ2)
⇒ Classical formulation of the least squares:
• LS: µ̂k = argminµ{‖X − µ‖22 : rank (µ) ≤ k
}• Truncated SVD: µ̂k =
∑kl=1 ul dl v>l
• Shrinkage to better recover µ (Rq: LS = Maximum likelihood- not the best in MSE)
⇒ Another formulation: autoencoder (Boulard & Kamp, 1988)
µ̂k = XBk , where Bk = argminB
{‖X − XB‖22 : rank (B) ≤ k
}
8 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
SVD via the autoencoder
Model: X = µ+ ε, with εijiid∼N
(0, σ2)
⇒ Classical formulation of the least squares:
• LS: µ̂k = argminµ{‖X − µ‖22 : rank (µ) ≤ k
}• Truncated SVD: µ̂k =
∑kl=1 ul dl v>l
• Shrinkage to better recover µ (Rq: LS = Maximum likelihood- not the best in MSE)
⇒ Another formulation: autoencoder (Boulard & Kamp, 1988)
µ̂k = XBk , where Bk = argminB
{‖X − XB‖22 : rank (B) ≤ k
}
8 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Parametric Bootstrap
⇒ Autoencoder: compress X
µ̂k = XBk , where Bk = argminB
{‖X − XB‖22 : rank (B) ≤ k
}⇒ Aim: recover µ
µ̂∗k = XB∗k ,B∗k = argminB
{EX∼L(µ)
[‖µ− XB‖22
]: rank (B) ≤ k
}
⇒ Parametric Bootstrap
µ̂bootk = XB̂k , B̂k = argminB
{EX̃∼L(X )
[∥∥∥X − X̃B∥∥∥2
2
]: rank (B) ≤ k
}
9 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Parametric Bootstrap
⇒ Autoencoder: compress X
µ̂k = XBk , where Bk = argminB
{‖X − XB‖22 : rank (B) ≤ k
}⇒ Aim: recover µ
µ̂∗k = XB∗k ,B∗k = argminB
{EX∼L(µ)
[‖µ− XB‖22
]: rank (B) ≤ k
}⇒ Parametric Bootstrap
µ̂bootk = XB̂k , B̂k = argminB
{EX̃∼L(X )
[∥∥∥X − X̃B∥∥∥2
2
]: rank (B) ≤ k
}
9 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Stable Autoencoder
Model: X = µ+ ε, with εijiid∼N
(0, σ2)
µ̂bootk = XB̂k , B̂k = argminB
{EX̃∼L(X )
[∥∥∥X − X̃B∥∥∥2
2
]: rkB ≤ k
}B̂k = argminB
{Eε[‖X − (X + ε)B‖22
]: rkB ≤ k
}
⇒ Solution: singular-value shrinkage estimator
µ̂bootk =k∑
l=1
uldl
1+ λ/d2lv>l with λ = nσ2
10 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Feature Noising = Ridge = Shrinkage (Proof)
Eε[‖X − (X + ε)B‖22
]= ‖X − XB‖22 + Eε
[‖εB‖22
]= ‖X − XB‖22 +
∑i ,j ,k
B2ij Var
[εjk]
= ‖X − XB‖22 + nσ2 ‖B‖22
Let X = UDV> be the SVD of X , λ = nσ2
B̂λ = V B̃λV T , where B̃λ = argminB
{‖D − DB‖22 + λ ‖B‖22
}B̃ii = argminBii
{(1− Bii )
2 D2ii + λB2
ii
}=
D2ii
λ+ D2ii
µ̂(k)λ =
n∑i=1
Ui .Dii
1+ λ/D2iiV>i .
11 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Feature Noising = Ridge = Shrinkage⇒ Bootstrap = Feature noising
µ̂bootk = XB̂k , argminB
{Eε[‖X − (X + ε)B‖22
]: rank (B) ≤ k
}⇒ Equivalent to a ridge autoencoder problem
µ̂(k)λ = XB̂λ, B̂λ = argminB
{‖X − XB‖22 + λ ‖B‖22
}: rank (B) ≤ k
Ridge estimator robust around round perturbation of the data
µ̂bootk =k∑
l=1
uldl
1+ λ/d2lv>l with λ = nσ2
µ̂(λ)k = XB̂λ, B̂λ =
(X>X + S
)−1X>X , S = diag(λ)
12 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Feature Noising = Ridge = Shrinkage⇒ Bootstrap = Feature noising
µ̂bootk = XB̂k , argminB
{Eε[‖X − (X + ε)B‖22
]: rank (B) ≤ k
}⇒ Equivalent to a ridge autoencoder problem
µ̂(k)λ = XB̂λ, B̂λ = argminB
{‖X − XB‖22 + λ ‖B‖22
}: rank (B) ≤ k
Ridge estimator robust around round perturbation of the data
µ̂bootk =k∑
l=1
uldl
1+ λ/d2lv>l with λ = nσ2
µ̂(λ)k = XB̂λ, B̂λ =
(X>X + S
)−1X>X , S = diag(λ)
12 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Feature noising and regularization in regression
Bishop (1995). Training with noise is equivalent to tikhonovregularization. Neural computation.
β̂ = argminβ
{Eεij
iid∼N (0, σ2)
[‖Y − (X + ε)β‖22
]}
Many noisy copies of the data and average out the auxiliary noise isequivalent to ridge regularization with λ = nσ2:
β̂(R)λ = argminβ
{‖Y − Xβ‖+ λ ‖β‖22
}
⇒ Control overfitting by artificially corrupting the training data
13 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Drop-out training
Drop-out (Hinton et al. 2012) randomly omits subsets of featuresat each iteration of a training algorithm (improves neural networks)
Wager, et al. (2013). Dropout training as adaptive regularization.
• GLM: equivalence between noising schemes and regularization• Full potential with drop-out noise: Xij = 0 with prob δ andXij/(1− δ) → nice penalty (ex. logistic reg - rare features)
Wager, et al. (2014). Altitude training: bounds for single layer.
• "marathon runner who practices on altitude: once a classifierlearns to perform well on training examples corrupted bydropout, it will do very well on the uncorrupted test"
14 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
A centered parametric bootstrap
⇒ Aim: recover µ
µ̂∗k = XB∗k ,B∗k = argminB
{EX∼L(µ)
[‖µ− XB‖22
]: rank (B) ≤ k
}⇒ X as a proxy for µ
µ̂bootk = XB̂k , B̂k = argminB
{EX̃∼L(X )
[∥∥∥X − X̃B∥∥∥2
2
]: rank (B) ≤ k
}
⇒ X is bigger than µ! µ̂ = XB̂ as a proxy for µ:
B̂ = argminB
{E ˜̂µ∼L(µ̂)
[∥∥∥µ̂− ˜̂µB∥∥∥2
2
]}
15 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
A centered parametric bootstrap
µ̂free = XB̂ B̂ = argminB
{E ˜̂µ∼L(µ̂)
[∥∥∥µ̂− ˜̂µB∥∥∥2
2
]}
Iterative algorithm. Initialization: µ̂ = X
1. µ̂ = XB̂
2. B̂ = argminB
{E ˜̂µ∼L(µ̂)
[∥∥∥µ̂− ˜̂µB∥∥∥2
2
]}1. µ̂ = XB̂2. B̂ = (µ̂>µ̂+ S)−1µ̂>µ̂
⇒ µ̂free will be automatically of low rank!!
16 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Low-rank matrix estimation
A bootstrap framework
Results
Other noise
17 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Simulation design
The simulated data:X = µ+ ε
Vary several parameters:• n/p: n=200 p=500• rank k : 10, 100• the SNR ratio: 4, 2, 1, 0.5
Estimators:• TSVD: truncated SVD at k or τ∗ (Donoho & Gavish, 2013)• OS: optimal shrinkage (Shabalin & Nobel, 2013)• SVST: soft thresholding
• RIDGE: µ̂ = X (X ′X + diag(nσ2))−1X ′X , bootstrap with X• FREE: the iterative estimator, bootstrap with µ̂
18 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Simulation results
k SNR NEW TSVD OS SVSTRIDGE FREE k τ
MSE10 4 0.004 0.004 0.004 0.004 0.004 0.008
100 4 0.037 0.036 0.038 0.038 0.037 0.04510 2 0.017 0.017 0.017 0.016 0.017 0.033
100 2 0.142 0.143 0.152 0.158 0.146 0.15610 1 0.067 0.067 0.072 0.072 0.067 0.116
100 1 0.511 0.775 0.733 0.856 0.600 0.44810 0.5 0.277 0.251 0.321 0.321 0.250 0.353
100 0.5 1.600 1.000 3.164 1.000 0.961 0.852Rank
10 4 10 10 10 65100 4 100 100 100 19310 2 10 10 10 63
100 2 100 100 100 18110 1 10 10 10 59
100 1 29.6 38 64 15410 0.5 10 10 10 51
100 0.5 0 0 15 86
19 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Simulation results
⇒ Different noise regime
• low noise: TSVD (k is too big with soft)• moderate noise: OS, RIDGE, FREE• high noise (SNR low, k large): SVST
⇒ Adaptive estimator
⇒ FREE performs well in MSE and as a by-product estimatesaccurately the rank!
Rq: not the usual behavior n > p: rows more perturbed than thecolumns; n < p: columns more perturbed; n = p
20 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Partial conclusion
⇒ Parametric Bootstrap: flexible framework for transforming noisemodel into regularized matrix estimator
⇒ Gaussian noise: singular value shrinkage
⇒ Gaussian noise is not always appropriate. Procedure is mostuseful outside the Gaussian framework
21 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Low-rank matrix estimation
A bootstrap framework
Results
Other noise
22 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Count data
⇒ Model: X ∈ Rn×p ∼ L(µ) with E [X ] = µ of low-rank k
⇒ Poisson noise model: Xij ∼ Poisson (µij)⇒ Binomial noising: Xij ∼ 1
1−δ Binomial (µij , 1− δ)
⇒ µ̂k =∑k
l=1 ul dl v>l
⇒ Bootstap estimator = noising: X̃ij ∼ 11−δ Binomial (Xij , 1− δ)
µ̂boot = XB̂ B̂ = argminB
{EX̃∼L(X )
[∥∥∥X − X̃B∥∥∥2
2
]}⇒ Estimator robust to subsampling the obs used to build X
23 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Bootstrap estimators
⇒ Feature noising = regularization
B̂ = argminB
{‖X − XB‖22 +
∥∥∥S 12B∥∥∥2
2
}Sjj =
n∑i=1
VarX̃∼L(X )
[X̃ij
]µ̂ = X (X>X + δ
1−δSR)−1X>X , SR diagonal with row-sums of X
⇒ New estimator µ̂ that does not reduce to singular value shrinkage
⇒ FREE estimator - iterative algorithm1. µ̂ = XB̂
2. B̂ = argminB
{E ˜̂µ∼L(µ̂)
[∥∥∥µ̂− ˜̂µB∥∥∥2
2
]}
24 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Bootstrap estimators
⇒ Feature noising = regularization
B̂ = argminB
{‖X − XB‖22 +
∥∥∥S 12B∥∥∥2
2
}Sjj =
n∑i=1
VarX̃∼L(X )
[X̃ij
]µ̂ = X (X>X + δ
1−δSR)−1X>X , SR diagonal with row-sums of X
⇒ New estimator µ̂ that does not reduce to singular value shrinkage
⇒ FREE estimator - iterative algorithm1. µ̂ = XB̂2. B̂ = (µ̂>µ̂+ S)−1µ̂>µ̂
24 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
First results
Simulated data: Xij = Poisson (µij)
Design:• n = 20; p = 50• d1 = 489.49, d2 = 72.21, d3 = 6.75
Comparison:• RV coefficient (Escoufier, 1976) - correlation between matrices
MSE d1 d2 d3 RV U RV V kTSVD 1.57 489.28 73.47 20.46
SHRINK 1.44 489.14 72.51 16.83FREE 1.12 488.43 71.84 5.85 0.80 0.77 2.71
⇒ Singular vectors are not the same
25 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Regularized Correspondence Analysis
⇒ Correspondence Analysis
M = R−12
(X − 1
Nrc>)C−
12 , where R = diag (r) , C = diag (c)
µ̂CAk = R
12 M̂k C
12 +
1Nrc>
⇒ Regularized CA:
X̃ij ∼1
1− δBinomial (Xij , 1− δ)
B̂λ = argminB
{‖M −MB‖22 +
δ
1− δ
∥∥∥∥S 12MB∥∥∥∥}
where SM is a diagonal with (SM)jj = c−1j∑n
i=1 Var[X̃ij
]/ri
26 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Regularized Correspondence Analysis
⇒ Population data: perfume data set• 12 luxury perfumes described by 39 words - N=1075• d1 = 0.44, d2 = 0.15
⇒ Sample data: N = 400 - same proportion
d1 d2 RV U RV V k RV row RV colCA 0.59 0.31 0.83 0.75 0.91 0.71
SHRINK 0.35 0.10 0.83 0.75 0.93 0.74FREE 0.42 0.12 0.86 0.77 2 0.94 0.75
27 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Regularized Correspondence Analysis
−2 −1 0 1 2 3
−1.
5−
0.5
0.5
1.5
CA
Dim 1 (60.18%)
Dim
2 (
39.8
2%)
floral
fruity
strong
light
sugary
freshsoap
vanilla
oldagressive
toilets alcohol
drugs
hot
peppery
rose
candy
vegetableeau.de.cologne
amber
Angel
Aromatics Elixir
Chanel 5
Cin.ma
Coco Mademoiselle
J_adore
J_adore_et
L_instant
Lolita Lempika
PleasuresPure Poison
Shalimar
−1 0 1 2
−1.
0−
0.5
0.0
0.5
1.0
Regularized CA
Dim 1 (62.43%)D
im 2
(37
.57%
)
floral
fruity
strong
cinema
light
sugary
fresh
soap
vanilla
old
toilets
alcohol
heavy
drugspeppery
rose
candy
musky
vegetable
amber
Angel
Aromatics Elixir
Chanel 5
Cin.ma
Coco Mademoiselle
J_adore
J_adore_et
L_instant
Lolita Lempika
Pleasures
Pure Poison
Shalimar
28 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Regularized Correspondence Analysis
●
−1.0 −0.5 0.0 0.5 1.0 1.5
−1.
0−
0.5
0.0
0.5
1.0
Factor map
Dim 1 (65.41%)
Dim
2 (
34.5
9%)
J_adore
J_adore_et
Pleasures
cinema
Pure Poison
Coco Mademoiselle
Lolita Lempika
L_instant
Angel
Shalimar
Chanel 5
Aromatics Elixir
cluster 1
cluster 2
cluster 3
cluster 4
●
●
●
●
●
●
●
●
●
●
●
●
cluster 1 cluster 2 cluster 3 cluster 4
●
−0.5 0.0 0.5 1.0−
1.0
−0.
50.
00.
5
Factor map
Dim 1 (77.13%)
Dim
2 (
22.8
7%)
J_adore
J_adore_et
Pleasures
cinema
Pure Poison
Coco Mademoiselle
Lolita Lempika
L_instant
Angel
Shalimar
Chanel 5
Aromatics Elixir
cluster 1
cluster 2
cluster 3
●
●
●
●
●
●
●
●
●
●
●
●
cluster 1 cluster 2 cluster 3
29 / 30
Low-rank matrix estimation A bootstrap framework Results Other noise
Discussion
• FREE - good denoiser - automatic rank⇒ No free lunch: choosing σ - δ
• Convergence of the iterative algorithm?
• Regularized CA
• Extension tensor - Hoff (2013) Bayesian treatment of Tuckerdecomposition methods, hierarchical priors: an empiricalBayesian approach
⇒ SVD so many points of view !!
30 / 30