Post on 17-Aug-2020
transcript
1
Dependence Analysis with Reproducing Kernel Hilbert Spaces
Kenji FukumizuInstitute of Statistical Mathematics
Graduate University for Advanced Studies
Based on collaborations with M. Jordan (UC Berkeley), F. Bach (Ecole Normale Supérieure), A. Gretton, X. Sun, and
B. Schölkopf (Max-Planck Institute)
7th World Congress on Statistics and Probability July 14-19, 2008. Singapore
2
Outline
Introduction
Independence and conditional independence with RKHS
Kernel dimension reduction for regression
Summary
“RKHS methods” for statistical inference– Reproducing kernel Hilbert space (RKHS) / positive definite kernel:
capture “nonlinearity” or “higher-order moments” of data. e.g. Support vector machine.
– Recent studies: RKHS applied to independence and conditional independence.
RKHS for statistical inference
ΦHX
X
Φ (X)
RKHS
Apply linear methods onthe transformed data
4
Positive definite kernel and RKHSPositive definite kernelΩ: set.k is positive definite if k(x,y) = k(y,x) and for anythe matrix (Gram matrix) is positive semidefinite.
– Example: Gaussian RBF kernel
Reproducing kernel Hilbert space (RKHS)k: positive definite kernel on Ω.
H : Hilbert space consisting of functions on Ω such that1)2) is dense in H. 3)
R→Ω×Ω:kΩ∈∈ nxx,n K,1N
( )jiji xxk
,),(
H∈⋅ ),( xk for all .Ω∈x
)(),,( xffxk =⋅ H(reproducing property)
{ }Ω∈⋅ xxk |),(Span
1∃
( )22exp),( σyxyxk −−=
., Ω∈∈∀ xf H
5
How to use RKHS for data analysis?Transform data into RKHS.
Data: X1, …, XN Φ(X1),…, Φ(XN) : functional data
),(,: xkx ⋅→ΩΦ aH
),()(.. xkxei ⋅=Φ
ΩX ΩY
ΦX ΦY
HX HY
X Y
ΦX(X) ΦY(Y) feature mapfeature map
RKHS RKHS
Illustration of dependence analysis with RKHS
Why RKHS? Easy empirical computationThe inner product of H is efficiently computable, while the
dimensionality may be infinite.
– The computational cost essentially depends on the sample size N.c.f. L2 inner product / power expansion
– Advantageous for high-dimensional data of moderate sample size.
– Can be applied for non-Euclidean data (strings, graphs, etc.).
6
,)(1∑ = Φ= Ni ii xaf ∑ == N
ji jiji xxkbagf 1, ),(,∑ = Φ= Nj jj xbg 1 )(
),,,,,,,,,,,,(),,,( 2222 Ka YZXWXZXYWZYXWZYXWZYX
),()(),( yxkyx =ΦΦ
7
Outline
Introduction
Independence and conditional independence with RKHS
Kernel dimension reduction for regression
Summary
8
Covariance on RKHS(X , Y) : random vector taking values on ΩX x ΩY. (HX, kX), (HY , kY): RKHS on ΩX and ΩY, resp.Define random variables on the RKHS HX and HY by
Def. Cross-covariance operator ΣYX : HX HY
c.f. ordinary covariance matrix: VXY = Cov[X, Y] = E[YXT] – E[Y]E[X]T
),,()( XkX XX ⋅=Φ ).,()( YkY YY ⋅=Φ
)])(),([Cov()]([)]([)]()([, YgXfXfEYgEXfYgEfg YX =−=Σ
for all YX gf HH ∈∈ ,
)]([)]([)]()([ XEYEXYE XYXYYX Φ⊗Φ−Φ⊗Φ=Σ
9
Characterization of independenceIndependence and cross-covariance operatorIf the RKHS’s are “rich enough” to express all the moments,
– Analog to Gaussian random vectors:
– c.f. characteristic function
– Applied to independence test (Gretton et al. 2008).
OXY =Σ⇔ )]([)]([)]()([ XfEYgEXfYgE =for all
[ ] [ ] [ ]YY
XX
YXXY
TTTT
eEeEeeE ηωηω 1111 −−−− =⇔for all ω and η.
X Y
YX gf HH ∈∈ ,X Y
f and g are test functions to compare the moments with respect to PXY and PXPY.
X Y .OVYX =⇔
10
Characteristic kernelsA class for determining a probabilityX: random variable taking values on Ω.(H, k): RKHS on Ω with a bounded measurable kernel k.
H (or k) is called characteristic if, for probabilities P and Q on Ω,
(H works as a class of test functions to determine a probability.)
– If given by the product kernel kXkY is characteristic,
– An example on Rm: Gaussian RBF kernel
.)()]([)]([ ~~ QPfXfEXfE QXPX =∈∀= meansH
YX HH ⊗
.OXY =Σ⇔X Y
ΣXY = O )]()([)]()([ YgXfEYgXfEYXXY PPP = PXY = PXPY.( )
( )22exp σyx −−
11
Estimation of cross-cov. operator: i.i.d. sample on
is represented by the Gram matrices.
– A uniform law of large numbers follows:
– Weak convergence of to a Gaussian process on is also known.
)(,),( ,1,1 NN YXYX K
.),(1),(1),(),(1ˆ111
)( ⎟⎠⎞
⎜⎝⎛ ⋅⊗⎟
⎠⎞
⎜⎝⎛ ⋅−⋅⊗⋅=Σ ∑∑∑
===
N
iiX
N
iiY
N
iiXiY
NYX Xk
NYk
NXkYk
N
( ) )(1ˆ )( ∞→=Σ−Σ NNOpHSYXN
YX
ΩX x ΩY.
).( pr.in 0)](),([Cov)](),([Covsup1||||,1||||
∞→→−≤≤
NYgXfYgXfempgf YHXH
YX HH ⊗
{ }{ }.)()()()(ˆ, 11
11
11)( ∑∑∑ === −=Σ N
i iNNi iN
Ni iiN
NYX XfYgXfYgfg
Theorem
(rank )N≤
)(ˆ NYXΣ
( )YXN
YXN Σ−Σ )(ˆ
12
RKHS and conditional independenceConditional covariance operatorX and Y: random variables. HX, HY : RKHS with kernel kX, kY, resp.
– Relation to conditional variance:If kX is characteristic (e.g Gaussian RBF kernel),
– Empirical estimator
XYXXYXYYXYY ΣΣΣ−Σ≡Σ −1| : conditional covariance operator Def.
[ ] ( ) ( ) 2| )]([)()]([)(inf]|)([, XfEXfYgEYgEXYgVarEgg
XfXYY −−−==Σ∈H
( ) )(1)()()()(|
ˆˆˆˆˆ NXYN
NXX
NYX
NYY
NXYY I Σ+ΣΣ−Σ=Σ
−ε
εN: regularization coefficient
XYXXYXYY VVVV 1−−(Analogous to conditional covariance matrix )
)( Yg H∈∀
Can be represented by Gram matrices.
on HY
13
Conditional independence
Theorem (FBJ 2004, 2006)U, V, and Y are random variables on ΩU, ΩV, and ΩY, resp. HU, HV, HY : RKHS on ΩU, ΩV, ΩY with kernel kU, kV, kY, resp. X = (U,V). RKHS on ΩX = ΩU x ΩV is defined by kX = kUkV.Assume HX , HU : characteristic. Then,
If further HY is characteristic, then
XYYUYYUXY ||| Σ=Σ⇔
XYYUYY || Σ≥Σ ≥ : the partial order of self-adjoint operators
[ ]XYYUYY ||Tr Σ−Σ works as a measure of conditional independence.
means that B – A is positive semidefinite. AB ≥
14
Outline
Introduction
Independence and conditional independence with RKHS
Kernel dimension reduction for regression
Summary
15
– Regression: Y : response variable, X=(X1,...,Xm): m-dim. explanatory variable
– Goal of dimension reduction for regression= Find an effective direction for regression (EDR space)
– Existing methods: Sliced Inverse Regression (SIR, Li 1991), principal Hessian direction (pHd, Li 1992), SAVE (Cook&Weisberg 1991), MAVE (Xia et al 2002), contour regression (Li et al 2005), among others.
( ))|(~),...,|(~)|( 1 XBYpXbXbYpXYp TTd
T ==
B=(b1,..,bd): matrixdm× d is fixed.
Dimension reduction for regression
X Y | BTX
16
Kernel Dimension Reduction(Fukumizu, Bach, Jordan 2004, 2006)
Use characteristic kernels for BTX and Y.
– KDR objective function
– KDR contrast function with finite sample
|| T YY XYY B XΣ = Σ ⇔|| T YY XYY B XΣ ≥ Σ
X Y | BTX
|:min Tr TT
dYY B XB B B I=
⎡ ⎤Σ⎣ ⎦
EDR space
( )[ ]1
:Trmin −
=+ NNXBYIBBB
INGG T
dT
ε
),(, jT
iT
dijXB XBXBkK T =
( ) ( )TNNNNXB
TNNNNXB IKIG TT 11 11 11 −−= : centered Gram matrix
where
17
Wide applicability of KDR– The most general approach to dimension reduction:
• no model is used for p(Y|X) or p(X) . • no strong assumptions on the distribution of X, Y and
dimensionality/type of Y. – Most conventional methods have some restrictions.
Computational issues– Computational cost with matrices of sample size.
Low-rank approximation, e.g. incomplete Cholesky decomposition.
– Non-convex contrast function, possibly local minima.Gradient method with an annealing technique
starting from a large σ in Gaussian RBF kernel.
KDR method
18
Consistency of KDR
Suppose kd is bounded and continuous, and
Let S0 be the set of the optimal parameters;
Estimator:
Then, under some conditions, for any open set
1/ 20, ( ).N NN Nε ε→ → ∞ → ∞
Theorem (FBJ2006)
0U S⊃
( )( )ˆPr 1 ( ).NB U N∈ → → ∞
[ ] [ ]{ }XBYYBXBYYdT
TTIBBBS '|'|0 TrminTr,| Σ=Σ==
( )[ ]1
:
)( Trminˆ −
=+= NNXBYIBBB
N INGGB T
dT
ε
19
Numerical results with KDRSynthetic data (A)
.)1()5.1(5.0
222
2
1 WXXXY +++
++=
),0(~ : 4. INX dim 4
).,0(~ 2τNW τ = 0.1, 0.4, 0.8.
KDR SIR SAVE pHdτ Mean SD Mean SD Mean SD Mean SD0.1 0.11 0.07 0.55 0.28 0.77 0.35 1.04 0.340.4 0.17 0.09 0.60 0.27 0.82 0.34 1.03 0.330.8 0.34 0.22 0.69 0.25 0.94 0.35 1.06 0.33
Frobenius norms of the projection matrices over 100 samples.(Means and standard deviations)
Sample size N = 100
±±±
±±±
±±±
±±±
20
Synthetic data (B)
.)(21 2
1 WaXY −= ).1,0(~ NW a = 0, 0.5, 1.
),0(~ : 4INX dim. 10
Sample size N = 500
KDR SIR SAVE pHda Mean SD Mean SD Mean SD Mean SD0.0 0.17 0.05 1.83 0.22 0.30 0.07 1.48 0.270.5 0.17 0.04 0.58 0.19 0.35 0.08 1.52 0.281.0 0.18 0.05 0.30 0.08 0.57 0.20 1.58 0.28
±±±
±±±
±±±
±±±
21
KDR on Real dataWine dataData
13 dim. 178 data.3 classes2 dim. projection
-20 -10 0 10 20
-20
-15
-10
-5
0
5
10
15
20
-20 -10 0 10 20
-20
-15
-10
-5
0
5
10
15
20
-20 -10 0 10 20
-20
-15
-10
-5
0
5
10
15
20σ = 30 CCA
Partial Least Square
Sliced Inverse Regression
-20 -10 0 10 20-20
-15
-10
-5
0
5
10
15
20 KDR
( )1 2
2 21 2
( , )
exp
k z z
z z σ= − −
22
Swiss bank notes dataX: 6 dim. (measurements of each bank note)Y: binary (genuine/counterfeit) 100 counterfeits and 100 genuine notes
-10 0 10
-10
0
10
Gaussian (a = 10)
-10 0 10
-10
0
10
Gaussian (a = 10000)
-10 0 10
-10
0
10
Gaussian (a = 100)
-10 0 10
-10
0
10
Linear
-10 0 10
-10
0
10
SAVE
)/||||exp(
),(2
21
21
azz
zzk
−−=
SAVE
KDR
KDR
KDR
23
SummaryPositive definite kernels give a nice tool for dependence analysis– Covariance and conditional covariance operators on RKHS
characterize independence and conditional independence.
Kernel dimension reduction for regression (KDR)– The most general approach to dimension reduction.
Future/ongoing studies– Choice of kernel. Better than heuristics.– Choice of dimensionality for KDR. – Further asymptotic properties of the KDR estimator.
ReferencesFukumizu, K., F.R. Bach, and M.I. Jordan. Dimensionality reduction for
supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 5(Jan):73-99, 2004.
Fukumizu, K., F. Bach and M. Jordan. Kernel dimension reduction in regression. Tech. Report 715, Dept. Statistics, University of California, Berkeley, 2006.
Gretton, A. K. Fukumizu, C.H. Teo, L. Song, B.Schölkopf, A. Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20:585-592. 2008.
Fukumizu, K., A. Gretton, X. Sun., and B. Schölkopf. Kernel Measures of Conditional Dependence. Advances in Neural Information Processing Systems 20:489-496. 2008.
24