Multiple Kernel Learning
Hossein HajimirsadeghiSchool of Computing Science
Simon Fraser University
November 5, 2013
2
Introduction - SVM
0)(. bxw
1)(. bxw
w
1Max . Margin
1))(.( bxwy ii
2
, 2
1min w
bw
is.t.
1)(. bxwbxwxf )(.)(
3
iii bxwy 1))(.(
i
ibw
Cw 2
, 2
1min
is.t. 0i
4
i
iibw
bxwyCw ))(.(1,0max2
1min
2
,
Regularizer )),(( ii yxfl
Loss Function
5
SVM: Optimization Problem
iii bxwy 1))(.(
i
ibw
Cw 2
, 2
1min
is.t. 0i
iii
iiiii
ii
bxwy
CwbwL
1))((
2
1),,,,(
2
6
SVM: Dual
iii bxwy 1))(.(
i
ibw
Cw 2
, 2
1min
is.t. 0i
ji
jijijii
i xxyy,
)().(2
1max
Ci 0
0i
ii y
s.t. i
Primal
Dual
SVM-Dual
7
ji
jijijii
i xxyy,
)().(2
1max
Ci 00
iii y
s.t. i
Resulting Classifier:
,)().()(.)( bxxybxwxfi
iii
j
ijji xxyyb )().(
ji xxK ,
8
Kernel Methods
:such that kernel, called ,: Define XXK
)().(),( yxyxK
Ideas:
K often interpreted as a similarity measure
Benefits: Efficiency Flexibility
22211 )(),( cyxyxyxK
c
xc
xc
xx
x
x
c
xc
xc
xx
x
x
2
1
21
21
21
2
1
21
21
21
2
2
2.
2
2
2
Kernelized SVM
9
ji
jijijii
i xxyy,
)().(2
1max
Ci 00
iii y
s.t. i
Classifier:
,)()()(.)( bxxybxwxfi
iii
j
ijji xxyyb )()(
Kernelized
10
ji
jijijii
i xxKyy,
),(2
1max
Ci 00
iii y
s.t. i
Classifier:
,),()(.)( bxxKybxwxfi
iii
j
ijji xxKyyb ),(
11
Kernelized SVM
),(...),(),(
...
...
),(...),(),(
),(...),(),(
21
22212
12111
NNNN
N
N
xxKxxKxxK
xxKxxKxxK
xxKxxKxxK
K
YKYααα1α
TT
2
1max
Cα00Yα1TSubject to
12
Ideal Kernel MatrixTyyK
bxxKyxfi
iii ),()(
ji
jijiji yy
yyyyxxK
1
1),(
byyyxfi
iii )(
byyxfi
ii 2)(
13
Motivation for MKL
• Success of SVM is dependent on choice of good kernel:– How to choose kernels• Kernel function• Parameters
• Practical problems involve multiple heterogeneous data sources– How can kernels help to fuse features• Esp. features from different modalities
14
Multiple Kernel Learning
P
m
mj
mimji xxKfxxK
1),(,
P
m
mj
mimmji xxKxxK
1
),(,
General MKL:
Linear MKL:
15
MKL Algorithms
• Fixed Rules• Heuristic Approaches• Similarity Optimization– Maximizing the similarity to ideal kernel matrix
• Structural Risk Optimization– Minimizing “regularization term” + “error term”
16
Similarity Optimization
• Similarity:– kernel alignment– Euclidean distance– Kullback-Leibler (KL) divergence
2211
2121
,,
,),(
KKKK
KKKK A
i j
jiji xxKxxK ),(),(, 222
11121 KK
),( TA yyK
17
Similarity Optimization
• Lanckriet et al. (2004)
0 ,1 s.t.
),(max
KK
yyK
tr
A T
P
mmm
1
KK
Can be converted to a Semi-definite programming problem
Better Results: Centered Kernel AlignmentCortes et al (2010)
18
Structural Risk Optimization
YαYKαα1 ηα
TT
2
1max
Cα00Yα1T
Subject to
)( ηK
)()(min ηK ηη
r
0ηK
Subject to
Structural Risk Optimization
19
Subject to
)()(min ηK ηη
r
0ηK
General MKL (Varma et al. 2009)
η
K η )(**
2
1Yα
η
KYα η
T
Coordinate descent algorithm:1-Fix kernel parameters and find 2-Fix and update by gradientα η
η α
YαYKαα1 ηα
TT
2
1max )( ηK
20
Structural Risk: Another View
P
m
mj
mimmji xxKxxK
1
),(, η
)(),(, jiji xxxxK ηηη
η0 if)(),( m
jmmim xx
)(
...
)(
)(
)(
...
)(
)(
,2
22
111
222
111
PiPP
i
i
T
PiPP
i
i
ji
x
x
x
x
x
x
xxK
η
)( ixη
21
Structural Risk: Another Viewbxxf )(.)(,, ηηbw w
b
x
x
x
xf
PiPP
i
i
P
)(
...
)(
)(
].,...,,[)(2
22
111
21,,
wwwηbw
b
x
x
x
xf
PiP
i
i
PP
)(
...
)(
)(
].,...,,[)(2
2
11
2211,,
wwwηbw
bxdxfP
mmmm
1,, )(.)( wdbw
1d 2d Pd
22
Structural Risk: Another Viewbxdxf
P
mmmm
1,, )(.)( wdbw
i
P
mmmmi bxdy
1))(.(1
w
i
i
P
mmm
bwCd
1
2
,, 2
1min w
is.t. 0i
mmm d wv :
i
P
mmmi bxy
1))(.(1
v
i
i
P
mmm
bvdCd
1
2
,,, 2
1min v
is.t. 0i
23
Structural Risk Optimization
i
P
mmmi bxy
1))(.(1
v is.t. 0i
Simple MKL
i
i
P
mmm
bCdJ
1
2
,, 2
1min)( vdv
)(min dd
J
11
P
mmd 0mdSuch that
Rakotomamonjy et al. 2008
24
Multi-Class SVM
yyy bxyxf )(.),(, wbw
iiiii yyyxfyxf ),(),(),( ww
i
iC
2
, 2
1min ww
yi,
s.t. 0i
),(.),( yxyxf ww
i
i
yy
yy
0
1
)),(),,((max iiiyy
yxfyxfli
ww
25
Latent SVM
iii xfy 1)(w
i
iC
2
, 2
1min ww
is.t. 0i
),(.),( hxhxF ww
),(.max)( hxxfh
ww
1x 2x mx
1h 2h mh…
…
… ),( hxm),( 2 hx),( 1 hx
)(h
),(.max hxih
w
26
),,(.max yhxih
w
Multi-Class Latent SVM
iiiii yyyxfyxf ),(),(),( ww
i
iC
2
, 2
1min ww
yi,
s.t. 0i
),,(.),,( yhxyhxF ww
),,(.max),( yhxyxfh
ww
1x 2x mx
1h 2h mh…
…
…
y
),( hxm),( 2 hx),( 1 hx
),( hy
),,(.max iih
yhxw
27
Latent Kernelized Structural SVM
i
iC
2
, 2
1min ww
Wu and Jia 2012
),,(.),,( yhxyhxF ww
),,(.max),( yhxyxfh
ww
)),(),,((max iiiyy
i yxfyxfli
ww
)),,(max),,(max1,0max(,
iih
iyh
i yhxFyhxF ww
28
Latent Kernelized Structural SVM
iiC
2
, 2
1min ww
)),,(max),,(max1,0max(,
iih
iyh
i yhxFyhxF ww
Find the dual
The dual Variables: ui , Su S
i Su
iiuw vxuxKvxFyhxF ),,,(),(),,( w
29
Latent Kernelized Structural SVM
i Su
iiuSv
wSv
vxuxKvxFxf ),,,(max),(max)( w
Inference),,(.),,( yhxyhxF ww
),,(.max),( yhxyxfh
ww
),(.max)( vxxfSv
ww
NO EFFICIENT EXACT
SOLUTION
?),,,(max,
jjiihh
hxhxKji
30
Latent MKL
i
P
mimmi bhxy
1)),(.(1
v is.t. 0i
m
mi
i
P
mmm
dbdCd 2
1
2
,,, 22
1min
vv
Vahdat et al. 2013Latent Version of SimpleMKL
P
mii
hhxdxf
1
),(.max)( ww
0md1y i
* ihh
1y i h
Coordinate descent Learning Algorithm:
1-Perform inference for positive samples2-Solve the dual optimization problem like SimpleMKL
Find the dual
31
Some other works
• Hierarchical MKL (Bach 2008)• Latent Kernel SVM (Yang et al. 2012)• Deep MKL (Strobl and Visweswaran 2013)
32
References• Gönen, M., & Alpaydın, E. (2011). Multiple kernel learning algorithms. The Journal of Machine Learning
Research, 2211-2268.
• Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet, Y. (2008). SimpleMKL.Journal of Machine Learning Research, 9, 2491-2521.
• Varma, M., & Babu, B. R. (2009, June). More generality in efficient multiple kernel learning. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 1065-1072).
• Cortes, C., Mohri, M., & Rostamizadeh, A. (2010). Two-stage learning kernel algorithms. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 239-246).
• Lanckriet, G. R., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5, 27-72.
• Wu, X., & Jia, Y. (2012). View-invariant action recognition using latent kernelized structural SVM. In Computer Vision–ECCV 2012 (pp. 411-424). Springer Berlin Heidelberg.
• Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008, June). A discriminatively trained, multiscale, deformable part model. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-8). IEEE.
• Yang, W., Wang, Y., Vahdat, A., & Mori, G. (2012). Kernel Latent SVM for Visual Recognition. In Advances in Neural Information Processing Systems(pp. 818-826).
• Vahdat, A., Cannons, K., Mori, G., Oh, S., & Kim, I. (2013). Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach. IEEE International Conference on Computer Vision (ICCV).
• Cortes, C., Mohri, M., Rostamizadeh, A., ICML 2011 Tutorial: Learning Kernels.