Subspace estimation in linear dimension reduction...Subspace estimation in linear dimension...

Post on 10-Jul-2020

3 views 0 download

transcript

Subspace estimation in linear dimension reduction

Hannu Oja (with Klaus Nordhausen and David E. Tyler)

BIRS workshop, Banff, November 2015

1

The plan

• Linear and nonlinear dimension reduction

• Supervised and unsupervised dimension reduction

• Similarities between PCA, FOBI and SIR

• Signal and noise subspaces

• Bootstrap tests for the dimension of signal subspace

• Estimation of the dimension of signal subspace

2

Introduction

• Let x be a p-variate random vector with cumulative distribution Fx.

• Linear dimension reduction.

Find a projection matrix P such that you do not loose information

if you transform x → z = Px:

(i) x|Px is not “interesting” (unsupervised)

(ii) y ⊥⊥ x |Px for some “interesting” y (supervised)

• Nonlinear dimension reduction - not discussed here.

Find a (nonlinear) function H : Rp → Rk such that you do not loose information

if you transform x → z = H(x):

(i) x|H(x) is not “interesting” (unsupervised)

(ii) y ⊥⊥ x |H(x) for some “interesting” y (supervised)

3

Linear dimension reduction

• The dimension of x is reduced using a k × p matrix B.

Then

x → z = Bx

or

x → z = PBx where PB = B′(BB′)−1B.

• The idea is that k << p and that “no information is lost” in the transformation.

• Dimension reduction methods (unsupervised and supervised):

PCA, ICA, ICS, SIR, SAVE, etc.

4

Looking for similarities: PCA, FOBI, SIR

• Assume that E(x) = 0. In PCA, one then finds the p× p transformation matrix W

such that

WW′ = Ip and WE(xx′)W′ = D

where D is a diagonal matrix with diagonal elements d1 ≥ ... ≥ dp ≥ 0.

• In the independent component analysis (ICA), FOBI finds transformation matrix W such

that

WE(xx′)W′ = Ip and WE(xx′E(xx′)−1xx′)W′ = D

where the diagonal elements D are ordered so that

|d1 − (p+ 2)| ≥ ... ≥ |dp − (p+ 2)|.

• The sliced inverse regression (SIR) uses a dependent variable y, and finds a

transformation matrix W which satisfies

WE(xx′)W′ = Ip and WE(E(x|y)E(x|y)′)W′ = D

where the diagonal elements D are d1 ≥ ... ≥ dp ≥ 0.

5

• The idea in dimension reduction is then that W = (W′

1,W′

2)′ where

– k-dimensional W1x presents information (signal), and

– (p− k)-dimensional W2x presents noise.

6

Figure 1: Data set 1, Fisher’s Iris Data: Original variables.

Sepal.Length

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

2.0

2.5

3.0

3.5

4.0

Sepal.Width

Petal.Length

12

34

56

7

4.5 5.5 6.5 7.5

0.5

1.0

1.5

2.0

2.5

1 2 3 4 5 6 7

Petal.Width

7

Figure 2: Data set 1, Fisher’s Iris Data: Principal components.

Comp.1

−1.0 0.0 0.5 1.0 −0.4 0.0 0.2 0.4

−3

−1

12

34

−1.

00.

01.

0

Comp.2

Comp.3

−0.

50.

00.

5

−3 −1 1 2 3 4

−0.

40.

00.

20.

4

−0.5 0.0 0.5

Comp.4

8

Figure 3: Data set 1, Fisher’s Iris Data : FOBI coordinates.

IC.1

5 6 7 8 9 10 −2.0 −1.0 0.0 1.0

45

67

89

56

78

910

IC.2

IC.3

−7

−6

−5

−4

−3

4 5 6 7 8 9

−2.

0−

1.0

0.0

1.0

−7 −6 −5 −4 −3

IC.4

9

Figure 4: Data set 2: Original variables.

V1

−10 0 5 −5 0 5 10 −5 0 5

−3

−1

13

−10

05

V2

V3

−10

05

−5

05

10

V4

V5

−5

05

10

−3 −1 1 3

−5

05

−10 0 5 −5 0 5 10

V6

10

Figure 5: Data set 2: Principal components.

V1

−5 0 5 10 −6 −2 2 −1.5 0.0 1.0

−15

−5

515

−5

05

10

V2

V3

−10

05

10

−6

−2

2

V4

V5

−1.

00.

51.

5

−15 −5 5 15

−1.

50.

01.

0

−10 0 5 10 −1.0 0.5 1.5

V6

11

Figure 6: Data set 2: FOBI coordinates.

V1

−3 −1 1 3 −2 0 2 −3 −1 0 1

−4

−2

02

−3

−1

13

V2

V3

−3

−1

13

−2

02

V4

V5

−3

−1

01

−4 −2 0 2

−3

−1

01

−3 −1 1 3 −3 −1 0 1

V6

12

Figure 7: Data set 3: Original variables.

V1

−5 0 5 −10 0 5 −4 0 2 4

−5

05

−5

05

V2

V3

−4

04

−10

05

V4

V5

−4

02

4

−5 0 5

−4

02

4

−4 0 4 −4 0 2 4

V6

13

Figure 8: Data set 3: Principal components.

V1

−4 0 4 −4 0 2 −1.0 0.0 1.0

−15

−5

5

−4

04

V2

V3

−2

24

6

−4

02

V4

V5

−2

01

−15 −5 5

−1.

00.

01.

0

−2 2 4 6 −2 0 1

V6

14

Figure 9: Data set 3: FOBI coordinates.

V1

−3 −1 1 −2 0 2 −0.5 1.0 2.5

−3

−1

1

−3

−1

1

V2

V3

−3

−1

13

−2

02

V4

V5

−2

02

−3 −1 1

−0.

51.

02.

5

−3 −1 1 3 −2 0 2

V6

15

Figure 10: Data set 4: Original variables.

X.1

−6 −2 2 6 −6 0 4 −6 −2 2

−6

04

−6

−2

26

X.2

X.3

−10

010

−6

04

X.4

X.5

−4

04

−6 0 4

−6

−2

2

−10 0 10 −4 0 4

y

16

Figure 11: Data set 4: SIR coordinates.

Z.1

−3 0 2 −3 0 2 −6 −2 2

−3

02

4

−3

02

Z.2

Z.3

−3

02

−3

02

Z.4

Z.5

−3

02

−3 0 2 4

−6

−2

2

−3 0 2 −3 0 2

y

17

Testing whether W2x is noise

• In dimension reduction W = (W′

1,W′

2)′ and k-variate W1x is assumed to carry the

relevant information. We then wish to test the following null hypotheses saying that W2x

presents noise:

– PCA:

(i) H0 :W2x ∼ Np−k(0, σ2Ip−k),

(ii) H0 :W2x is spherically symmetric, or

(iii) H0 :W2x has exchangeable components.

– FOBI:

H0 :W2x ∼ Np−k(0, Ip−k).

– SIR:

H0 : (y,W1x) ⊥⊥ W2x (implies that y ⊥⊥ W2x|W1x and linearity condition).

• Unconventional semiparametric bootstrapping is used in the following to test for these

hypotheses.

18

Test statistics for the dimension of W1x

• Let X = (x1, ...,xn)′ (or (y,X)) be a random sample for the distribution of x (or of

(y,x) and W and D natural estimates of W and D, respectively. We then have the

following.

• PCA: H0 implies that d1 ≥ ... ≥ dk > dk+1 = ... = dp. We choose

T (X) = − log

( ∏pi=k+1 di

1/(p−k)

∑pi=k+1 di/(p− k)

).

• FOBI: H0 implies that d1 ≥ ... ≥ dk > dk+1 = ... = dp = p+ 2. We choose

T (X) =

p∑

i=k+1

(di − p− 2)2.

• SIR: H0 implies that d1 ≥ ... ≥ dk > dk+1 = ... = dp = 0. we choose

T (y,X) = log

(p∏

i=k+1

di1/(p−k)

)(or∑p

i=k+1 d2i ) .

19

Tests based on limiting distributions

• Let X = (x1, ...,xn)′ (or (y,X)) be a random sample for the distribution of x (or of

(y,x) and W and D natural estimates of W and D, respectively. We then have the

following.

• PCA: Tyler (1981), Schott (2006), etc.

• FOBI: ?

• SIR: LI (1991), Bura and Cook (2001)

20

PCA: Strategies for bootstrapping

• Write Z = (X− 1nµ′)W′ and Z = (Z1, Z2) = (X− 1nµ

′)(W1

,W2

).

• Our bootstrap samples X∗ under the null model is then obtained as follows.

1. Write Z = (Z1, Z2) for a bootstrap sample of size n from {z1, ..., zn}.

2. Z∗

1 = Z1 and

2.1 Z∗

2 = (O1z21, ...,Onz2n)′ for n independent random orthogonal

(p− k)× (p− k) matrices O1, ...,On (subsphericity of W2x), or

2.2 Z∗

2 = (P1z21, ...,Pnz2n)′ for n independent random (p− k)× (p− k)

permutation matrices P1, ...,Pn (exchangeability of W2x).

3. Write Z∗ = (Z∗

1,Z∗

2).

4. Write X∗ = Z∗(W′)−1 + 1nµ′.

• An estimated p-value for a bootstrap test with the test statistic T (X) is then obtained as

M−1#{T (X∗

j ) ≥ T (X)} where X∗

1, ...,X∗

M are M independent bootstrap

samples.

21

PCA: Simulation results

• 500 repetitions (random samples) for sample sizes n = 50, 100, 150, 200 were

generated from N5(0, diag(3, 2, 1, 1, 1)).

• For each random sample, M = 200 bootstrap samples were generated for null

hypotheses k = 3, 2, 1 under the assumptions of subsphericity (O) and

subexchangeability (P).

• The proportion of bootstrap p-values below 0.05 is reported in the following. The true

value is k = 2.

n k = 3 k = 2 k = 1

O P O P O P

50 0.026 0.020 0.032 0.028 0.416 0.392

100 0.016 0.016 0.044 0.050 0.828 0.814

150 0.016 0.016 0.054 0.060 0.970 0.972

200 0.022 0.018 0.036 0.034 0.998 0.998

22

FOBI: Strategies for bootstrapping

• Write Z = (X− 1nµ′)W′ and Z = (Z1, Z2) = (X− 1nµ

′)(W1

,W2

).

• Our bootstrap samples X∗ under the null model is then obtained as follows.

1. Write Z∗

1 for a matrix of componentwise bootstrap samples of size n from Z1.

2. Let Z∗

2 be a random sample of size n from Np−k(0, Ik).

3. Write Z∗ = (Z∗

1,Z∗

2).

4. Write X∗ = Z∗(W′)−1 + 1nµ′.

• An estimated p-value for a bootstrap test with the test statistic T (X) is then obtained as

M−1#{T (X∗

j ) ≥ T (X)} where X∗

1, ...,X∗

M are M independent bootstrap

samples.

23

FOBI: Simulation results

• 500 repetitions (random samples) for sample sizes n = 50, 100, 200, ..., 1000 were

generated from a 5-variate independent component model (Setting 1 and 2 below).

• For each random sample, M = 200 bootstrap samples were generated for null

hypotheses k = 2, 3, 4 under the assumptions of subsgaussianity.

• The proportion of bootstrap p-values below 0.05 is reported in the following.

24

• FOBI, Setting 1The distribution of the independent components are χ2

3, N(0, 1), N(0, 1), N(0, 1) and

U(0, 1), and the mixing matrix is I5. The true value is k = 2.

n k = 3 k = 2 k = 1

50 0.026 0.030 0.044

100 0.028 0.050 0.104

200 0.030 0.062 0.236

500 0.018 0.062 0.890

1000 0.028 0.044 1.000

25

• FOBI, Setting 2The distributions of the independent components are exp(1), t6, N(0, 1), N(0, 1),

N(0, 1), and the mixing matrix is I5. The true value is k = 2.

n k = 3 k = 2 k = 1

50 0.038 0.034 0.102

100 0.040 0.048 0.206

200 0.058 0.094 0.468

500 0.024 0.058 0.798

1000 0.028 0.070 0.962

26

SIR: Strategies for bootstrapping

• Write Z = (X− 1nµ′)W′ and Z = (Z1, Z2) = (X− 1nµ

′)(W1

,W2

).

• Our bootstrap samples (y∗,X∗) under the null model is then obtained as follows.

1. Let(y∗,Z∗

1) be a bootstrap sample of size n from (y, Z1).

2. Let Z∗

2 be a bootstrap sample form random sample of size n from Z2.

(Bootstrap samples are independent)

3. Write Z∗ = (Z∗

1,Z∗

2).

4. Write X∗ = Z∗(W′)−1 + 1nµ′.

• An estimated p-value for a bootstrap test with the test statistic T (y,X) is then obtained

as M−1#{T ((y∗,X∗)j) ≥ T (y,X)} where (y∗,X∗)1, ..., (y∗,X∗)M are M

independent bootstrap samples.

27

SIR: Simulation results

• 500 repetitions (random samples) for sample sizes n = 50, 100, 200, ..., 1000 were

generated from a nonlinear model for response y and 5-variate x (Setting 1 and 2 below).

• For each random sample, M = 200 bootstrap samples were generated for null

hypotheses k = 2, 3, 4 under the assumptions that (y,W1x) and W2x are

independent.

• The proportion of bootstrap p-values below 0.05 is reported in the following.

28

• SIR, Setting 1Now x ∼ N5(0, I5) and y = x1(x1 + x2 + 1) + ǫ, where ǫ ∼ N(0, 0.25) and

ǫ ⊥⊥ x. Again, k = 2.

n k = 3 k = 2 k = 1

100 0.010 0.024 0.162

200 0.004 0.034 0.298

500 0.004 0.042 0.552

1000 0.012 0.038 0.740

2000 0.010 0.040 0.908

5000 0.006 0.046 0.982

10000 0.010 0.052 0.996

29

• SIR, Setting 2Now x ∼ N5(0, I5) and y = x1

0.5+(x2+1.5)2 + ǫ, where ǫ ∼ N(0, 0.25) and ǫ ⊥⊥ x.

The true value is k = 2.

n k = 3 k = 2 k = 1

100 0.006 0.030 0.242

200 0.010 0.036 0.398

500 0.006 0.060 0.710

1000 0.004 0.032 0.856

2000 0.006 0.030 0.950

5000 0.002 0.028 0.986

10000 0.008 0.060 0.996

30

Final remarks

• FOBI and SIR just serve here as first examples on ICA (ICS) methods and supervised dimension

reduction methods. Our approach works for other methods as well.

• Comparison: Asymptotic tests vs. bootstrap tests

• How to robustify?

– PCA: Replace the covariance matrix by a robust scatter matrix (elliptic case)

– FOBI: Use two robust scatter matrices with independent property

– SIR: Robustify both Cov(x) and Cov(x|y). Gather et al. (2001, 2002), Yohai and Noste

(2005)

31

• Estimation of k:

Test H0,0

Test H0,1

Test H0,2

...

reject

k = 1

accept

reject

k = 0

accept

Figure 12: Estimation through stepwise testing procedure.

32

Some referencesBura, E. and Cook, R.D. (2001). Extending Sliced Inverse Regression: the Weighted Chi-Squared Test.

Journal of the American Statistical Association, 96, 996-1003.

Dray, S. (2008). On the number of principal components: A test of dimensionality based on

measurements of similarity between matrices. Computational Statistics & Data Analysis, 52, 2228 2237

Ilmonen, P., Serfling, R., and Oja, H. (2012). Invariant coordinate selection (ICS) functionals.

International Statistical Review, 80, 93110.

Li, K.C. (1991). Sliced Inverse Regression for Dimension Reduction. Journal of the American Statistical

Association, 86, 316-342.

Liski, E., Nordhausen, K., and Oja, H. (2013). Supervised invariant coordinate selection. Statistics: A

Journal of Theoretical and Applied Statistics, 48, 711-731.

Miettinen, J., Nordhausen, K., Oja, H. and Taskinen, S. (2015). Fourth moments and independent

component analysis. Statistical Science,30, 372-390.

Tyler, D.E. (1981). Asymptotic Inference for Eigenvectors. The Annals of Statistics, 9, 725-736.

Tyler, D.E., Critchley, F., Dumbgen, L. and Oja, H. (2009). Invariant coordinate selection. Journal of

Royal Statistical Society B, 71, 549-592.

33

THANK YOU FOR YOUR INTEREST !

34