MachineLearning
KernelsandtheKernelTrick
1
Supportvectormachines
• Trainingbymaximizingmargin
• TheSVMobjective
• SolvingtheSVMoptimizationproblem
• Supportvectors,dualsandkernels
2
Supportvectormachines
• Trainingbymaximizingmargin
• TheSVMobjective
• SolvingtheSVMoptimizationproblem
• Supportvectors,dualsandkernels
3
Thislecture
1. Supportvectors
2. Kernels
3. Thekerneltrick
4. Propertiesofkernels
5. Anotherexampleofthekerneltrick
4
Thislecture
1. Supportvectors
2. Kernels
3. Thekerneltrick
4. Propertiesofkernels
5. Anotherexampleofthekerneltrick
5
Sofarwehaveseen
• Supportvectormachines
• Hingelossandoptimizingtheregularizedloss
Morebroadly,differentalgorithmsforlearninglinearclassifiers
6
Sofarwehaveseen
• Supportvectormachines
• Hingelossandoptimizingtheregularizedloss
Morebroadly,differentalgorithmsforlearninglinearclassifiers
Whataboutnon-linearmodels?
7
Onewaytolearnnon-linearmodels
Explicitlyintroducenon-linearityintothefeaturespace
8
Ifthetrueseparatorisquadratic
Onewaytolearnnon-linearmodels
Explicitlyintroducenon-linearityintothefeaturespace
9
Ifthetrueseparatorisquadratic Transformallinputpointsas
Onewaytolearnnon-linearmodels
Explicitlyintroducenon-linearityintothefeaturespace
10
Ifthetrueseparatorisquadratic Transformallinputpointsas
Now,wecantrytofindaweightvectorinthishigherdimensionalspace
Thatis,predictusingwTÁ(x1,x2)¸ b
SVM:Primals andduals
TheSVMobjective
11
Thisiscalledtheprimalformoftheobjective
Thiscanbeconvertedtoitsdualform,whichwillletusproveaveryusefulproperty
SVM:Primals andduals
TheSVMobjective
12
Thisiscalledtheprimalformoftheobjective
Thiscanbeconvertedtoitsdualform,whichwillletusproveaveryusefulproperty
Anotheroptimizationproblem
HasthepropertythatmaxDual=minPrimal
Supportvectormachines
Letw betheminimizeroftheSVMproblemforsomedatasetwithmexamples:{(xi,yi)}Then,fori =1…m,thereexist®i¸ 0suchthattheoptimumwcanbewrittenas
13
Supportvectormachines
Letw betheminimizeroftheSVMproblemforsomedatasetwithmexamples:{(xi,yi)}Then,fori =1…m,thereexist®i¸ 0suchthattheoptimumwcanbewrittenas
Furthermore,
14
++
++++++
-- --
-- -- --
---- --
--
+ -
Allpointsoutsidethemargin
Supportvectormachines
Letw betheminimizeroftheSVMproblemforsomedatasetwithmexamples:{(xi,yi)}Then,fori =1…m,thereexist®i¸ 0suchthattheoptimumwcanbewrittenas
Furthermore,
15
++
++++++
-- --
-- -- --
---- --
--
+ -
Allpointsonthewrongsideofthemargin
Supportvectormachines
Letw betheminimizeroftheSVMproblemforsomedatasetwithmexamples:{(xi,yi)}Then,fori =1…m,thereexist®i¸ 0suchthattheoptimumwcanbewrittenas
Furthermore,
16
++
++++++
-- --
-- -- --
---- --
--
+ -
Allpointsonthemargin
Supportvectors
Theweightvectoriscompletelydefinedbytrainingexampleswhose®isarenotzero
Theseexamplesarecalledthesupportvectors
17
Thislecture
ü Supportvectors
2. Kernels
3. Thekerneltrick
4. Propertiesofkernels
5. Anotherexampleofthekerneltrick
18
Predictingwithlinearclassifiers
• Prediction=and
• Thatis,wejustshowedthat
– Weonlyneedtocomputedotproductsbetweentrainingexamplesandthenewexamplex
19
Predictingwithlinearclassifiers
• Prediction=and
• Thatis,wejustshowedthat
– Weonlyneedtocomputedotproductsbetweentrainingexamplesandthenewexamplex
20
Predictingwithlinearclassifiers
• Prediction=and
• Thatis,wejustshowedthat
– Weonlyneedtocomputedotproductsbetweentrainingexamplesandthenewexamplex
21
Predictingwithlinearclassifiers
• Prediction=and
• Thatis,wejustshowedthat
– Weonlyneedtocomputedotproductsbetweentrainingexamplesandthenewexamplex
• Thisistrueevenifwemapexamplestoahighdimensionalspace
22
Predictingwithlinearclassifiers
• Prediction=and
• Thatis,wejustshowedthat
– Thatisweonlyneedtocomputedotproductsbetweentrainingexamplesandthenewexamplex
• Thisistrueevenifwemapexamplestoahighdimensionalspace
23
Dotproductsinhighdimensionalspaces
Letusdefineadotproductinthehighdimensionalspace
Sopredictionwiththishighdimensionalliftingmapis
24
because
Dotproductsinhighdimensionalspaces
Letusdefineadotproductinthehighdimensionalspace
Sopredictionwiththishighdimensionalliftingmapis
25
because
Dotproductsinhighdimensionalspaces
Letusdefineadotproductinthehighdimensionalspace
Sopredictionwiththishighdimensionalliftingmapis
26
because
Kernelbasedmethods
Whatdoesthisnewformulationgiveus?IfwehavetocomputeÁ everytimeanyway,wegainnothing
IfwecancomputethevalueofKwithoutexplicitlywritingtheblownuprepresentation,thenwewillhaveacomputationaladvantage
27
Predictusing
Kernelbasedmethods
Whatdoesthisnewformulationgiveus?IfwehavetocomputeÁ everytimeanyway,wegainnothing
IfwecancomputethevalueofKwithoutexplicitlywritingtheblownuprepresentation,thenwewillhaveacomputationaladvantage.
28
Predictusing
Thislecture
ü Supportvectors
ü Kernels
3. Thekerneltrick
4. Propertiesofkernels
5. Anotherexampleofthekerneltrick
29
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
30
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
31
Alldegreezeroterms
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
32
Alldegreezeroterms Alldegreeoneterms
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
33
Alldegreezeroterms Alldegreeoneterms Alldegreetwoterms
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
andcomputethedotproductA=Á(x)TÁ (z)[takestime]
34
Alldegreezeroterms Alldegreeoneterms Alldegreetwoterms
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
andcomputethedotproductA=Á(x)TÁ (z)[takestime]
• Instead,intheoriginalspace,compute
Theorem:A=B(Coefficientsdonotreallymatter)
35
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
andcomputethedotproductA=Á(x)TÁ (z)[takestime]
• Instead,intheoriginalspace,compute
Theorem:A=B(Coefficientsdonotreallymatter)
36
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
andcomputethedotproductA=Á(x)TÁ (z)[takestime]
• Instead,intheoriginalspace,compute
Theorem:A=B(Coefficientsdonotreallymatter)
37
Example:PolynomialKernel
• Giventwoexamplesx andz wewanttomapthemtoahighdimensionalspace[forexample, quadratic]
andcomputethedotproductA=Á(x)TÁ (z)[takestime]
• Instead,intheoriginalspace,compute
Claim:A=B(Coefficientsdonotreallymatter)
38
Example:Twodimensions,quadratickernel
39
A=Á(x)TÁ (z)
TheKernelTrick
SupposewewishtocomputeK(x,z)= Á(x)TÁ (z)
HereÁ mapsx andztoahighdimensionalspace
TheKernelTrick:Savetime/spacebycomputingthevalueofK(x,z)byperformingoperationsintheoriginalspace(withoutafeaturetransformation!)
40
Computingdotproductsefficiently
KernelTrick: Youwanttoworkwithdegree2polynomialfeatures,Á(x).Then,yourdotproductwillbeoperateusingvectorsinaspaceofdimensionalityn(n+1)/2.
Thekerneltrickallowsyoutosavetime/spaceandcomputedotproductsinanndimensionalspace.
• CanweuseanyfunctionK(.,.)?– No!AfunctionK(x,z)isavalidkernelifitcorrespondstoaninnerproductin
some(perhapsinfinitedimensional)featurespace.
• Generalcondition: constructtheGrammatrix{K(xi ,zj)};checkthatit’spositivesemidefinite
41
(Notjustfordegree2polynomials)
Thislecture
ü Supportvectors
ü Kernels
ü Thekerneltrick
4. Propertiesofkernels
5. Anotherexampleofthekerneltrick
42
Whichfunctionsarekernels?
KernelTrick: Youwanttoworkwithdegree2polynomialfeatures,Á(x).Then,yourdotproductwillbeoperateusingvectorsinaspaceofdimensionalityn(n+1)/2.
Thekerneltrickallowsyoutosavetime/spaceandcomputedotproductsinanndimensionalspace.
• CanweuseanyfunctionK(.,.)?– No!AfunctionK(x,z)isavalidkernelifitcorrespondstoaninnerproductin
some(perhapsinfinitedimensional)featurespace.
• Generalcondition: constructtheGrammatrix{K(xi ,zj)};checkthatit’spositivesemidefinite
43
(Notjustfordegree2polynomials)
Whichfunctionsarekernels?
KernelTrick: Youwanttoworkwithdegree2polynomialfeatures,Á(x).Then,yourdotproductwillbeoperateusingvectorsinaspaceofdimensionalityn(n+1)/2.
Thekerneltrickallowsyoutosavetime/spaceandcomputedotproductsinanndimensionalspace.
• CanweuseanyfunctionK(.,.)?– No!AfunctionK(x,z)isavalidkernelifitcorrespondstoaninnerproductin
some(perhapsinfinitedimensional)featurespace.
• Generalcondition: constructtheGrammatrix{K(xi ,zj)};checkthatit’spositivesemidefinite
44
(Notjustfordegree2polynomials)
Whichfunctionsarekernels?
KernelTrick: Youwanttoworkwithdegree2polynomialfeatures,Á(x).Then,yourdotproductwillbeoperateusingvectorsinaspaceofdimensionalityn(n+1)/2.
Thekerneltrickallowsyoutosavetime/spaceandcomputedotproductsinanndimensionalspace.
• CanweuseanyfunctionK(.,.)?– No!AfunctionK(x,z)isavalidkernelifitcorrespondstoaninnerproductin
some(perhapsinfinitedimensional)featurespace.
• Generalcondition: constructtheGrammatrix{K(xi ,zj)};checkthatit’spositivesemidefinite
45
(Notjustfordegree2polynomials)
Reminder:Positivesemi-definitematrices
AsymmetricmatrixMispositivesemi-definiteifitis– Foranyvectornon-zeroz,wehavezTMz¸ 0
(Ausefulpropertycharacterizingmanyinterestingmathematicalobjects)
46
TheKernelMatrix
• TheGrammatrixofasetofnvectorsS={x1…xn}isthen×nmatrixG withGij =xiTxj– ThekernelmatrixistheGrammatrixof{φ(x1),…,φ(xn)}– (sizedependsonthe#ofexamples,notdimensionality)
• ShowingthatafunctionKisavalidkernel– Directapproach:Ifyouhavetheφ(xi),youhavetheGrammatrix(andit’seasyto
seethatitwillbepositivesemi-definite).Why?
– Indirect:IfyouhavetheKernel,writedowntheKernelmatrixKij,andshowthatitisalegitimatekernel,withoutanexplicitconstructionofφ(xi)
47
TheKernelMatrix
• TheGrammatrixofasetofnvectorsS={x1…xn}isthen×nmatrixG withGij =xiTxj– ThekernelmatrixistheGrammatrixof{φ(x1),…,φ(xn)}– (sizedependsonthe#ofexamples,notdimensionality)
• ShowingthatafunctionKisavalidkernel– Directapproach:Ifyouhavetheφ(xi),youhavetheGrammatrix(andit’seasyto
seethatitwillbepositivesemi-definite).Why?
– Indirect:IfyouhavetheKernel,writedowntheKernelmatrixKij,andshowthatitisalegitimatekernel,withoutanexplicitconstructionofφ(xi)
48
Mercer’scondition
LetK(x,z)beafunctionthatmapstwondimensionalvectorstoarealnumber
Kisavalidkernelifforeveryfiniteset{x1,x2,! },foranychoiceofrealvaluedc1,c2,!,wehave
49
Polynomialkernels
• Linearkernel:k(x,z)=xTz
• Polynomialkernelofdegreed:k(x,z)=(xTz)d– onlydth-orderinteractions
• Polynomialkerneluptodegreed:k(x,z)=(xTz +c)d(c>0)– allinteractionsoforderdorlower
50
GaussianKernel(ortheradialbasisfunctionkernel)
– (x−z)2:squaredEuclideandistancebetweenx andz– c=σ2:afreeparameter– verysmallc:K≈identitymatrix(everyitemisdifferent)– verylargec: K≈unitmatrix(allitemsarethesame)
– k(x,z)≈1whenx,zclose– k(x,z)≈0whenx,zdissimilar
51
GaussianKernel(ortheradialbasisfunctionkernel)
– (x−z)2:squaredEuclideandistancebetweenx andz– c=σ2:afreeparameter– verysmallc:K≈identitymatrix(everyitemisdifferent)– verylargec: K≈unitmatrix(allitemsarethesame)
– k(x,z)≈1whenx,zclose– k(x,z)≈0whenx,zdissimilar
52
Exercises:1. Provethatthisisakernel.2. Whatisthe“blownup”featurespaceforthiskernel?
ConstructingNewKernels
Youcanconstructnewkernelsk’(x,x’)fromexistingones:
– Multiplyingk(x,x’)byaconstantc
ck(x,x’)
– Multiplyingk(x,x’)byafunctionfappliedtox and x’
f(x)k(x,x’)f(x’)
– Applyingapolynomial(withnon-negativecoefficients)tok(x,x’)
P(k(x,x’))withP(z)=∑iaizi and ai≥0
– Exponentiatingk(x,x’)
exp(k(x,x’))
53
ConstructingNewKernels(2)
• Youcanconstructk’(x,x’)fromk1(x,x’),k2(x,x’) by:– Addingk1(x,x’) andk2(x,x’):
k1(x,x’)+k2(x,x’)
– Multiplyingk1(x,x’)andk2(x,x’):k1(x,x’)k2(x,x’)
• Also:– Ifφ(x)2 Rm and km(z,z’)avalidkernelinRm,
k(x,x’) =km(φ(x),φ(x’))isalsoavalidkernel
– IfA isasymmetricpositivesemi-definitematrix,k(x,x’) =xAx’isalsoavalidkernel
54
ConstructingNewKernels(2)
• Youcanconstructk’(x,x’)fromk1(x,x’),k2(x,x’) by:– Addingk1(x,x’) andk2(x,x’):
k1(x,x’)+k2(x,x’)
– Multiplyingk1(x,x’)andk2(x,x’):k1(x,x’)k2(x,x’)
• Also:– Ifφ(x)2 Rm and km(z,z’)avalidkernelinRm,
k(x,x’) =km(φ(x),φ(x’))isalsoavalidkernel
– IfA isasymmetricpositivesemi-definitematrix,k(x,x’) =xAx’isalsoavalidkernel
55
KernelTrick:Anexample
Lettheblownupfeaturespacerepresentthespaceofall3nconjunctions.Then,
wheresame(x,z) isthenumberoffeaturesthathavethesamevalueforbothxandz
Example:Taken=3;x=(001),z=(011),wehaveconjunctionsofsize0,1,2,3Proof: letm=same(x,z);construct“surviving”conjunctionsby1. choosingtoincludeoneofthesekliteralswiththerightpolarityintheconjunctions,or2. choosingtonotincludeitatall.Conjunctionswithliteralsoutsidethissetdisappear.
56
Thislecture
ü Supportvectors
ü Kernels
ü Thekerneltrick
ü Propertiesofkernels
5. Anotherexampleofthekerneltrick
57
KernelTrick:Anexample
Lettheblownupfeaturespacerepresentthespaceofall3nconjunctions.Then,
wheresame(x,z) isthenumberoffeaturesthathavethesamevalueforbothxandz
Example:Taken=3;x=(001),z=(011),wehaveconjunctionsofsize0,1,2,3Proof: letm=same(x,z);construct“surviving”conjunctionsby1. choosingtoincludeoneofthesekliteralswiththerightpolarityintheconjunctions,or2. choosingtonotincludeitatall.Conjunctionswithliteralsoutsidethissetdisappear.
58
KernelTrick:Anexample
Lettheblownupfeaturespacerepresentthespaceofall3nconjunctions.Then,
wheresame(x,z) isthenumberoffeaturesthathavethesamevalueforbothxandz
Example:Taken=3;x=(001),z=(011),wehaveconjunctionsofsize0,1,2,3Proof: letm=same(x,z);construct“surviving”conjunctionsby1. choosingtoincludeoneofthesekliteralswiththerightpolarityintheconjunctions,or2. choosingtonotincludeitatall.Conjunctionswithliteralsoutsidethissetdisappear.
59
KernelTrick:Anexample
Lettheblownupfeaturespacerepresentthespaceofall3nconjunctions.Then,
wheresame(x,z) isthenumberoffeaturesthathavethesamevalueforbothxandz
Example:Taken=3;x=(001),z=(011),wehaveconjunctionsofsize0,1,2,3Proof: letm=same(x,z);construct“surviving”conjunctionsby1. choosingtoincludeoneofthesekliteralswiththerightpolarityintheconjunctions,or2. choosingtonotincludeitatall.Conjunctionswithliteralsoutsidethissetdisappear.
60
KernelTrick:Anexample
Lettheblownupfeaturespacerepresentthespaceofall3nconjunctions.Then,
wheresame(x,z) isthenumberoffeaturesthathavethesamevalueforbothxandz
Example:Taken=3;x=(001),z=(011),wehaveconjunctionsofsize0,1,2,3Proof: letm=same(x,z);construct“surviving”conjunctionsby1. choosingtoincludeoneofthesekliteralswiththerightpolarityintheconjunctions,or2. choosingtonotincludeitatall.Conjunctionswithliteralsoutsidethissetdisappear.
61
KernelTrick:Anexample
Lettheblownupfeaturespacerepresentthespaceofall3nconjunctions.Then,
wheresame(x,z) isthenumberoffeaturesthathavethesamevalueforbothxandz
Example:Taken=3;x=(001),z=(011),wehaveconjunctionsofsize0,1,2,3Proof: letm=same(x,z);construct“surviving”conjunctionsby1. choosingtoincludeoneofthesekliteralswiththerightpolarityintheconjunctions,or2. choosingtonotincludeitatall.Conjunctionswithliteralsoutsidethissetdisappear.
62
Exercises
1. Showthatthisargumentworksforaspecificexample– TakeX={x1,x2,x3,x4}– Á(x) =Thespaceofall3n conjunctions;|Á(x)|=81– Considerx=(1100),z=(1101)– WriteÁ(x),Á(z),therepresentationofx,z intheÁ space– ComputeÁ(x)TÁ(z)– Showthat
K(x,z)=Á(x)TÁ(z)=åi Ái(z)Ái(x)=2same(x,z) =8
2. Trytodevelopanotherkernel,e.g.,wherethespaceofallconjunctionsofsize3(exactly)
63
Summary:Kerneltrick
• Tomakethefinalprediction,wearecomputingdotproducts
• Thekerneltrickisacomputationaltricktocomputedotproductsinhigherdimensionalspaces
• ThisisapplicablenotjusttoSVMs.ThesameideacanbeextendedtoPerceptrontoo:theKernelPerceptron
• Important:Alltheboundswehaveseen(eg:Perceptronbound,etc)dependontheunderlyingdimensionality– Bymovingtoahigherdimensionalspace,weareincurringapenalty
onsamplecomplexity
64