LinearizedKernelDictionaryLearning
AlonaGolts,Prof.MikiElad
236862– IntroductiontoSparseandRedundantRepresentations
4.1.18
WhatWeShallSeeToday
Alona Golts2
SparseRepresentationsasamodelforsignalprocessing
Thismodelissuccessfulinmachinelearningtasksaswell Kernelsarealsoextremely
popularinmachinelearning
Sparserepresentationsandkernels“givebirth”to
aninterestingcombination
Thisnewmodelhasitsshareofgrowingpainsinboth:
spaceruntime
Ourpre-processingcalledLKDL,preservesthe“good”,whiledealingwiththe“bad”
Wrightetal.(‘09) http://cs.stanford.edu/people/karpathyhttp://alex.smola.org/books.html
1) Introtosparserepresentations2) Introtokernels3) Kerneldictionarylearning4) Linearizedkerneldictionarylearning(LKDL)5) Resultsandsummary
Outline
3
Background
Ourwork
Alona Golts
SparseRepresentations
Introto
4 Alona Golts
WhyUseSparseRepresentations?Inpainting[2]
Compression[4]
[1]Dabov,Foi,Katkovnik andEgiazarian (‘07)[2]Mairal,Elad.andSapiro(‘08)[3]Yang,Wright,HuangandMa(‘10)[4]BrytandElad(‘08)
5
Super-Resolution[3]Denoising[1]
Alona Golts
SparseCoding§ “Sparsecoding”– representingasignalwithasparse
combinationof“dictionaryatoms”
dictionarysignal sparsevector
Naïvesolutionofsolving(∗):
§ scanningmq optionsofsupports,𝛄
§ solvingleastsquares§ choosingbestreconstruction…
NOTagoodidea!𝐱 ∈ 𝐑, 𝐃 ∈ 𝐑,×/
𝛄 ∈ 𝐑/
» •
“cardinality”(∗)argmin
𝛄𝐱 − 𝐃𝛄 6
6 s. t. 𝛄 ; ≤ q
6 Alona Golts
GreedyApproach- OMP
§ Step1:chooseatomthatbestmatches 𝐱
§ Nextsteps:giventhepreviouslyfoundatoms,choosenextonethatbestfits residual.j; =argmax 𝐫ABC, 𝐝F
Repeatq timesoruntiltargetthresholdisreached.
§ updatecoefficientsofsparsevectorandresidual:𝛄𝐭 = argmin
𝛄 𝐱 − 𝐃A𝛄A 6
6,𝐫A = 𝐱 − 𝐃A𝛄A
»•
7 Alona Golts
§ “Dictionarylearning”– findingasetofatomsandrepresentationsthat“bestsparsify”acollectionofinputs𝐗
DictionaryLearning
»
inputsignalmatrix
𝐗 ∈ 𝐑,×I
•
dictionary
𝐃 ∈ 𝐑,×/
sparserepresentationmatrix
𝚪 ∈ 𝐑/×I
argmin𝐃,𝚪
𝐗 − 𝐃𝚪 K6s. t. 𝛄L ; ≤ q, ∀i = 1…N
8 Alona Golts
argmin𝐃,𝚪
𝐗 − 𝐃𝚪 K6s. t. 𝛄L ; ≤ q, ∀i = 1…N
§ Basicstrategy:blockcoordinatedescent
§ IterateoverthefollowingforT iterations:
ØGiven𝐃,findsparserepresentations,𝚪ØGiven𝚪,updatedictionary,𝐃
o MOD[1]– updateentiredictionaryatonce.o KSVD[2]– updateoneatomatatime,alongwiththecoefficients,solvingarank-1SVDproblem.
DictionaryLearning
[1]Engan,Aake,Hakon andHusoy,(’99)[2]EladandAharon.(‘06)
9 Alona Golts
IntrotoKernels
10 Alona Golts
Φ“inputspace”𝒳(xC, x6)
“featurespace”ℱ
zC, z6, zV = (xC6, 2� xCx6, x66)
xC
x6
ClassificationProblem
11
zC
zV
z6
Alona Golts
§ Forthepreviousmapping,letuscalculatetheinnerproductbetweentwosignalsinthefeaturespace:
“kernel”
Φ x ,Φ(y) = xC6, 2� xCx6, x66yC6
2� yCy6y66
=
KernelTrick
= xC6yC6 + 2xCx6yCy6 + x66y66 = xCyC + x6y6 6 = 𝐱, 𝐲 6
= κ x, y12 Alona Golts
Thefollowingtwoareequivalent:
§ κ ispositivedefinite(PD),i.e.,foranytrainingpoints 𝐱C, … , 𝐱I ∈𝒳 andforarbitraryscalars aC, … , aI ∈ 𝐑,thefollowingholds:
§ ThereexitsamapΦ intoadot-productspaceℋ s.t.:
PositiveDefiniteKernels
^aLaF𝐊L,F ≥ 0, 𝐊L,F = κ(𝐱L, 𝐱F)�
L,F
κ 𝐱, 𝐱b = Φ 𝐱 ,Φ(𝐱b)
13 Alona Golts
Commonlyusedkernels:
Linear: κ 𝐱, 𝐱b = 𝐱, 𝐱b + cPolynomial: κ 𝐱, 𝐱b = 𝐱, 𝐱b + c d
Gaussian/RBF: κ 𝐱, 𝐱b = exp − 𝐱 − 𝐱b 6/2σ6
Thekernelmatrixconsistsofinnerproductsofthefeaturevectorsinthehighdimensionalspace.
Typesofkernels
𝐗 = 𝐱C, 𝐱𝟐, … , 𝐱𝐍 , 𝐊 = Φ 𝐗 kΦ 𝐗 ,
𝐊L,F = κ 𝐱L, 𝐱F = Φ 𝐱L , Φ 𝐱F
14 Alona Golts
KernelMatrix
15
𝐗∈ 𝐑,×I 𝐊
∈ 𝐑I×I
Alona Golts
§ Kernelsprovidepowerfulrepresentationalpowertolinearmachinelearningalgorithms,thushavebeenusedextensivelyoverthepast20years:§ SVM§ KernelPCA§ KernelRegression§ KernelK-means§ KernelNN§ ...
KernelsinMachineLearning
16 Alona Golts
ClassificationusingSparsity
§ Thesparsitymodelisalsoeffectiveindiscriminativetasks,aswellasgenerativeones:§ “SparseRepresentationforSignalClassification”,Huangetal.,(‘06)§ “RobustFaceRecognitionusingSparseRepresentations”,Wrightetal.,(’09)§ “LinearSpatialPyramidMatchingUsingSparseCodingforImageClassification”,Yangetal.,(‘09)§ “Sparserepresentationforcomputervisionandpatternrecognition”,Wrightetal.,(’10)§ “RobustVisualTrackingandVehicleClassificationviaSparseRepresentation”,Meietal.,(’11)§ “LearningSparseRepresentationsforHumanActionRecognition”,Guhaetal.,(’12)§ “LearningStructuredLow-rankRepresentationsforImageClassification”,Zhangetal.,(’13)§ “MultiviewHessianDiscriminativeSparseCodingforImageAnnotation”,Liuetal.,(’14)§ “LearningDiscriminativeSparseRepresentationsforHyperspectralImageClassification”,Duetal.,(’15)
§ Whythennot“kernelize”classicsparserepresentationalgorithms?
17 Alona Golts
KernelSparseRepresentations
18
§ Inthepast5yearstherehasbeenamultitudeofworkconcentratedonkernelsparserepresentations.
§ Someexamples:§ Vincent&Bengio,(’02)§ Gao,Tsang&Chia,(’10)§ Zhang,Zho,Chang,Liu,Wang&Li,(’12)§ Nguyen,Patel,Nasarabadi&Chellappa,(’12)
§ Wechoosetoconcentrateonkerneldictionarylearningtohighlightthebenefitofourapproach.
Alona Golts
KernelDictionaryLearning
19
Nguyen,Patel,NasrabadiandChellappa(‘12)
Alona Golts
§ Performlineardictionarylearninginfeaturespace:
(*)“Representertheorem”- KimeldorfandWahba(‘71)(*)“DoubleSparsity”- Rubinstein,ZibulevskyandElad(‘10)
KernelDictionaryLearning
argminl 𝐃 ,𝚪
𝚽(𝐗) −𝚽 𝐃 𝚪 K6s. t. 𝛄L ; ≤ q, ∀i = 1…N
argmin𝐀,𝚪
𝚽(𝐗) − 𝚽 𝐗 𝐀𝚪 K6s. t. 𝛄L ; ≤ q, ∀i = 1…N
∗ Φ 𝐃 = Φ 𝐗 𝐀, 𝐀 ∈ 𝐑I×/
𝐗 → Φ 𝐗 ,𝐃 → Φ 𝐃
20 Alona Golts
𝚪∈ 𝐑/×I
KernelDictionaryLearning
Φ(𝐗)
∈ 𝐑?×I
𝐀
∈ 𝐑I×/
21
Φ(𝐗) ~
argminr,𝚪
𝚽(𝐗) − 𝚽 𝐗 𝐀𝚪 K6s. t. 𝛄L ; ≤ q, ∀i = 1…N
Alona Golts
j; = argmax Φ 𝐱 − Φ 𝐗 𝐀ABC𝛄ABC,Φ 𝐗 𝐚F
= 𝐊 𝐱, 𝐗 𝐚F − 𝛄ABCk 𝐀ABCk 𝐊 𝐗, 𝐗 𝐚F
AS:Chooseatomthatbestmatchesresidual:
“Kernelization”ofOMP
j; = argmax (𝐱 − 𝐃ABC𝛄ABC, 𝐝F
𝐫ABCClassic:
Kernel:
Inputsignal Trainset ∈ 𝐑C×I ∈ 𝐑I×I
22 Alona Golts
Kernel:
𝛄A = argmin𝛄
𝚽 𝐱 − Φ 𝐗 𝐀A𝛄 6 = Φ 𝐗 𝐀A t𝚽 𝐱
= 𝐀Ak𝐊 𝐗, 𝐗 𝐀ABC𝐀𝐭𝐓𝐊 𝐗, 𝐱
LS:Updatesparsevectorusingleastsquares:
“Kernelization”ofOMP
𝛄A = argmin𝛄
𝐱 − 𝐃A𝛄 6 = 𝐃Ak𝐃ABC𝐃Ak𝐱
Classic:
23 Alona Golts
§ Oncethesparserepresentation𝚪 isknown,update𝐀:
§ UpdateforKernelMOD:
§ KSVDcanbeupdatedtoousingkernelsonly
“Kernelization”ofMOD
𝐀 = 𝚪t = 𝚪k 𝚪𝚪𝐓 BC
argmin𝐀
Φ 𝐗 −Φ 𝐗 𝐀𝚪 K6
24
More
Alona Golts
Complexity:Step:
Runtime:
N − numberofsignalsq − targetcardinalityd − signaldimension
ProblemswithKDL
𝐊 ∈ 𝐑I×I
Memory:𝐗 ∈ 𝐑,×I
𝑂 dq + d
𝑂 N6 + Nq + N
𝑂 dq6 + dq + qV
𝑂 N6q + Nq + qV
OMP- AtomSelection
KOMP- AtomSelection
OMP- LeastSquares
KOMP– LeastSquares
N ≫ d ≫ q
25 Alona Golts
§ Introducesnonlinearitytosparserepresentationalgorithms.
§ Fairlyeasytosubstitutedotproductswithkernels.
§ Flexibilitywithchoiceofkernel.
TheBadTheGood§ Highdependenceonapossiblyhuge
kernelmatrix.
§ Complexityofalgorithmsdependsonnumberofsignalsinsteadoftheirdimension.
§ Aspecific“tailoring”ofthekernelisneededineachindividualalgorithm.
§ Algorithmcannotalwaysbewrittenusingdotproducts.
KDL:ProsandCons
26 Alona Golts
LinearizedKernelDictionary
Learning27
OurWork:
Alona Golts
OurObjective
Incorporatenonlinearityintodictionarylearningbykernelizing
Fasterruntime,lessmemory
TurninganyDLintokernelDLinaneasyway
28 Alona Golts
AnyPDkernelmatrixcanbedecomposedinto:
Zhang,Lan,WangandMoerchen,(‘12)
“VirtualSamples”
𝐅 ∈ 𝐑I×I
KernelMatrixDecomposition
𝐊 = Φ 𝐗 kΦ 𝐗 = 𝐅k𝐅
OriginalSamples
𝐗 ∈ 𝐑,×I
29 Alona Golts
𝐊 = 𝐅k𝐅
argmin𝐃,𝚪
𝐅 − 𝐃𝚪 K6
LinearizedKernelDL(LKDL)
Decomposekernelmatrixtoinnerproductof“virtualsamples”
Performclassical(linear)DLonvirtualsamples
Produceclassificationresult
Alona Golts30
computationalcost:𝑂 NV /𝑂(N6k)
§ Eigendecomposition:
§ Notpracticalforlargekernelmatrices:
HowtoDecomposeK?
𝐔𝚺𝐔k = 𝐊 = 𝐅k𝐅→ 𝐅 = 𝚺C/6𝐔k
𝐊 ∈ 𝐑I×I
31 Alona Golts
𝐊 = 𝐖 𝐒𝐓𝐒 𝐁
c columnsfrom𝐊 → 𝐂
32
Sampling
𝐊 =∈ 𝐑I×I
= 𝐂 𝐜 ≪ 𝐍
𝐂 =
NyströmMethod
∈ 𝐑I×�
§ FindanapproximationofthePDmatrix: 𝐊� ≈ 𝐊
Alona Golts
33
Approximationfork ≤ c
𝐊� = 𝐂𝐖t𝐂k
NyströmMethod
𝐂 =∈ 𝐑I×�
𝐖 =∈ 𝐑𝐜×𝐜
𝐂𝐓
=
𝐂 𝐖t𝐊�⋅ ⋅
Alona Golts
34
VirtualSampleComputation
Nyström:
𝐊� = 𝐂𝐖t𝐂k
eigen-decomposition:𝐖 = 𝐕𝚲𝐕k
𝐖t = 𝐕𝚲𝐕k
= ⋅ ⋅𝐖t
∈ 𝐑𝐜×𝐜
𝐕
∈ 𝐑𝐜×�
𝚲t
∈ 𝐑�×�
𝐕k
∈ 𝐑�×�
c − numberofsampledcolumnsinNyströmk − degreeofeigen-decomposition
Alona Golts
35
“virtualsample”computation:𝐊� = 𝐅k𝐅 → 𝐅 = 𝚲t C/6𝐕k𝐂k
VirtualSampleComputation
𝚲t C/6
∈ 𝐑�×�
𝐕k
∈ 𝐑�×�
𝐂𝐓
∈ 𝐑�×I
𝐅
∈ 𝐑�×I
= ⋅ ⋅
𝐊
∈ 𝐑I×I
Alona Golts
36
TrainL dictionaries,oneforeachclass:argmin𝐃�,𝚪�
𝐗L − 𝐃L𝚪L K6
ClassificationusingDL
𝐗 =
𝐗C 𝐗6 𝐗�
…
𝐃C 𝐃6 𝐃�…
KSVD
Alona Golts
37
ClassificationusingDL
SparsecodeeachtestsampleoverL dictionaries:argmin
𝛄�𝐱A��A − 𝐃L𝛄L 6
6
s. t. 𝛄L ; ≤ q,∀i = 1…L
𝐃C 𝐃6 𝐃�
…
𝐱A��A
𝛄C 𝛄6 𝛄�
…OMP
Alona Golts
38
ClassificationusingDL
Chosenclassistheonewithminimal
representationerror:
class = argminL
rL 6
𝐱A��A − 𝐃L𝛄L 66,
∀i = 1…L
𝐃C 𝛄C𝐱A��ArC = − ⋅
𝐃6 𝛄6𝐱A��A
r6 = − ⋅
𝐱A��A 𝐃� 𝛄�
r� = − ⋅
…
𝑐𝑙𝑎𝑠𝑠 = argminL
rL 6
∀i = 1…L
Alona Golts
TrainL dictionaries:argmin𝐀�,𝚪�
Φ 𝐗L − Φ 𝐗L 𝐀L𝚪L K6
39
ClassificationusingKDL
𝐗C𝐊C
𝐗6𝐊6
𝐗�𝐊�
KKSVD
𝐀C 𝐀6 𝐀�
…
… ∈ 𝐑��×/
Alona Golts
40
ClassificationusingKDL
𝐊C 𝐀C 𝐊6 𝐀6 𝐊� 𝐀�
…
SparsecodeeachtestsampleoverL dictionaries:
solve:argmin𝛄�
Φ(𝐱A��A) − Φ 𝐗L 𝐀L𝛄L 66s. t. 𝛄L ; ≤ q, ∀i = 1…L
𝐤A��AC 𝐤A��A6 𝐤A��A�
κ(𝐱A��A, 𝐗C) κ(𝐱A��A, 𝐗6) κ(𝐱A��A, 𝐗�)
KOMP𝛄C 𝛄6 𝛄�
…
Alona Golts
41
ClassificationusingKDL
𝐊C 𝐀C 𝐤A��AC𝛄C
Chosenclassistheonewithminimalrepresentationerror:
𝑐𝑙𝑎𝑠𝑠 = argminL
rL = argmin Φ 𝐱A��A − Φ(𝐗L)𝐀L𝛄L 66, ∀i = 1…L
𝐊6 𝐀6 𝐤A��A6𝛄6
𝐊� 𝐀� 𝐤A��A�𝛄�
…
𝑐𝑙𝑎𝑠𝑠 = argminL
rL∀i = 1…L
Alona Golts
ClassificationusingLKDL
Samplesignalsfrom
trainingset:𝐗 → 𝐗�
𝟏.
Computevirtualtestsample
𝐟A��A = 𝚲t C/6𝐕k𝐜A��Ak
7.
Compute
𝐂 = 𝐊 𝐗, 𝐗�
𝟐.Compute
𝐖 = 𝐊 𝐗�, 𝐗�
3.
Approximate𝐖
𝐖 = 𝐕𝚲𝐕k
4.
Computevirtualtrainset
𝐅 = 𝚲t C/6𝐕k𝐂k
5.Compute
𝐜A��A = 𝐊(𝐱A��A, 𝐗�)
𝟔.
ClassificationusingDL
42 Alona Golts
ResultsLKDL
43 Alona Golts
1. LKDLimprovesdiscriminabilityoverlinearDL.
2. LKDLworksasgoodorbetterthanKDL.
3. LKDLismoreefficientwithrespecttoKDL.
4. LKDLcanbeincorporatedseamlesslyinvirtuallyanyDLalgorithm.
Results- Objective
44 Alona Golts
256signal dim.
7291size oftrainset
2007sizeoftestset
10(digits)#classes
300# atomsperclass
5cardinality
5#iterations
Polynomialkernel
2kernelparameter
20%oftrainsamples
c – numberofsamplesinNyström
256k – approx.dim.
USPSDataset
45 Alona Golts
err =𝐊 − 𝐊� K𝐊 K
ApproximationQuality
46 Alona Golts
Dependenceonc/N
47 Alona Golts
EffectofNoise: EffectofMissingPixels:
RobustnesstoCorruptions
48 Alona Golts
784signal dim.
60,000size oftrainset
10,000sizeoftestset
10(digits)#classes
700# atomsperclass
11cardinality
2#iterations
Polynomialkernel
2kernelparameter
15%oftrainsamples
c – numberofsamplesinNyström
784k – approx.dim.
MNISTDataset
49
LeCunetal.(‘98)
Alona Golts
Accuracy
TestTime
TrainTime
RuntimeImprovement
50
§ Introducesnonlinearitytosparserepresentationalgorithms.
§ Canscale-upanddealwithrelativelyhighnumberofinputsamples
§ Canbeeasilyaddedtoanydictionarylearningalgorithm.
§ Flexibilitywithchoiceofkernel.
TheBadTheGood§ Nyströmmethodrequirescalculating
andstoringthematrix𝐂,whichislargeinitself
§ Eigen-decompositionof𝐖 iscomputationallydemandingforverylargedatasets
§ Virtualsamplesdon’tusuallyrelatetotheoriginaldata,thusimageprocessingtasksofflimits
LKDL– ProsandCons
01/03/16MScSeminar,AlonaGolts51
§ TherearebenefitsinusingkernelsinDL-basedclassificationtasks.
§ KernelDLimprovesaccuracyoverDLbutsuffersfromdimensionalityproblems.
§ LKDL– amethodofcombiningkernelsasfeaturesandusinglinearDLontopofthem,waspresented.
§ LKDLprovidescomparableaccuracytoKDL,withfastertrainingandtesting.
§ LKDLcancombinedontopofanyDLalgorithm.
Summary
01/03/16MScSeminar,AlonaGolts52
ThankYou!
01/03/16MScSeminar,AlonaGolts53
Updatestage:
Φ 𝐗 −Φ 𝐗 𝐀𝚪 K6 = Φ 𝐗 − Φ 𝐗 ∑ 𝐚𝐣𝛄F/
F C K6 =
= Φ 𝐗 𝐈 −^ 𝐚F𝛄F/
F¢�− Φ 𝐗 𝐚�𝛄F
K
6
= Φ 𝐗 𝐄� − Φ 𝐗 𝐌� K6
𝐄�� = 𝐄�𝛀� → Φ 𝐗 𝐄�� − Φ 𝐗 𝐚�𝛄�� K6
Φ 𝐗 𝐄�� = 𝐔𝚺𝐕k → Φ 𝐗 𝐚�𝛄�� = σC𝐮C𝐯Ck,
𝛄�� = σC𝐯Ck, Φ 𝐗 𝐚� = 𝐮C, 𝐚𝐤 = σCBC𝐄��𝐯C
54
Rank-1
KernelKSVD
Return
Alona Golts