DeepLearningIIIUnsupervisedLearning
RussSalakhutdinov
Machine Learning Department Carnegie Mellon University
Canadian Institute of Advanced Research
UnsupervisedLearning
Non-probabilis;cModelsØ SparseCodingØ AutoencodersØ Others(e.g.k-means)
ExplicitDensityp(x)
Probabilis;c(Genera;ve)Models
TractableModelsØ Fullyobserved
BeliefNetsØ NADEØ PixelRNN
Non-TractableModelsØ BoltzmannMachinesØ Varia;onal
AutoencodersØ HelmholtzMachinesØ Manyothers…
Ø Genera;veAdversarialNetworks
Ø MomentMatchingNetworks
ImplicitDensity
TalkRoadmap• BasicBuildingBlocks:
Ø SparseCodingØ Autoencoders
• DeepGenera;veModelsØ RestrictedBoltzmannMachinesØ DeepBeliefNetworksandDeepBoltzmannMachinesØ HelmholtzMachines/Varia;onalAutoencoders
• Genera;veAdversarialNetworks
• ModelEvalua;on
h3
h2
h1
v
W3
W2
W1
h3
h2
h1
v
W3
W2
W1
Deep Belief Network Deep Boltzmann Machine
DBNsvs.DBMs
DBNsarehybridmodels:• InferenceinDBNsisproblema;cduetoexplainingaway.• Onlygreedypretrainig,nojointop/miza/onoveralllayers.• Approximateinferenceisfeed-forward:nobo6om-upandtop-down.
Mathema;calFormula;on
h3
h2
h1
v
W3
W2
W1
modelparameters
• BoVom-upandTop-down:
DeepBoltzmannMachine
BoVom-up Top-Down
Unlikemanyexis;ngfeed-forwardmodels:ConvNet(LeCun),HMAX(Poggioet.al.),DeepBeliefNets(Hintonet.al.)
• Dependenciesbetweenhiddenvariables.• Allconnec;onsareundirected.
Input
h3
h2
h1
v
W3
W2
W1
NeuralNetworkOutput
h3
h2
h1
v
W3
W2
W1
Mathema;calFormula;on
DeepBoltzmannMachine
h3
h2
h1
v
W3
W2
W1
DeepBeliefNetwork
Unlikemanyexis;ngfeed-forwardmodels:ConvNet(LeCun),HMAX(Poggio),DeepBeliefNets(Hinton)
Input
h3
h2
h1
v
W3
W2
W1
h3
h2
h1
v
W3
W2
W1
Mathema;calFormula;on
DeepBoltzmannMachine DeepBeliefNetwork
h3
h2
h1
v
W3
W2
W1
Unlikemanyexis;ngfeed-forwardmodels:ConvNet(LeCun),HMAX(Poggio),DeepBeliefNets(Hinton)
inference
NeuralNetworkOutput
Input
Mathema;calFormula;on
modelparameters
Maximumlikelihoodlearning:
Problem:Bothexpecta;onsareintractable!
Learningruleforundirectedgraphicalmodels:MRFs,CRFs,Factorgraphs.
• Dependenciesbetweenhiddenvariables.
DeepBoltzmannMachine
h3
h2
h1
v
W3
W2
W1
ApproximateLearning
(Approximate)MaximumLikelihood:
Notfactorialanymore!
h3
h2
h1
v
W3
W2
W1
• Bothexpecta;onsareintractable!
Data
ApproximateLearning
(Approximate)MaximumLikelihood:h3
h2
h1
v
W3
W2
W1
Notfactorialanymore!
ApproximateLearning
(Approximate)MaximumLikelihood:
Notfactorialanymore!
h3
h2
h1
v
W3
W2
W1 Varia;onalInference
Stochas;cApproxima;on(MCMC-based)
PreviousWorkManyapproachesforlearningBoltzmannmachineshavebeenproposedoverthelast20years:• HintonandSejnowski(1983),• PetersonandAnderson(1987)• Galland(1991)• KappenandRodriguez(1998)• Lawrence,Bishop,andJordan(1998)• Tanaka(1998)• WellingandHinton(2002)• ZhuandLiu(2002)• WellingandTeh(2003)• YasudaandTanaka(2009)
ManyofthepreviousapproacheswerenotsuccessfulforlearninggeneralBoltzmannmachineswithhiddenvariables.
Real-worldapplica;ons–thousandsofhiddenandobservedvariableswithmillionsofparameters.
AlgorithmsbasedonContras;veDivergence,ScoreMatching,Pseudo-Likelihood,CompositeLikelihood,MCMC-MLE,PiecewiseLearning,cannothandlemul;plelayersofhiddenvariables.
NewLearningAlgorithm
Condi;onal Uncondi;onal
PosteriorInference SimulatefromtheModel
Approximatecondi;onal
Approximatethejointdistribu;on
(Salakhutdinov, 2008; NIPS 2009)
Condi;onal Uncondi;onal
PosteriorInference SimulatefromtheModel
Approximatethejointdistribu;on
Data-dependent
Approximatecondi;onal
NewLearningAlgorithm
Data-independent
density Match
(Salakhutdinov, 2008; NIPS 2009)
Condi;onal Uncondi;onal
PosteriorInference
Approximatethejointdistribu;on
Data-dependent
Approximatecondi;onal
NewLearningAlgorithm
Data-independent
Match
KeyIdea:
MarkovChainMonteCarlo
Data-dependent:Varia/onalInference,mean-fieldtheoryData-independent:Stochas/cApproxima/on,MCMCbased
Mean-Field
SimulatefromtheModel
h2
h1
v
Timet=1
Stochas;cApproxima;on
Update Updateh2
h1
v
t=2h2
h1
v
t=3
• Generate bysimula;ngfromaMarkovchainthatleavesinvariant(e.g.GibbsorM-Hsampler)
• Update byreplacingintractable withapointes;mate
Inprac;cewesimulateseveralMarkovchainsinparallel.(Robbins and Monro, Ann. Math. Stats, 1957; L. Younes, Probability Theory 1989)
Updateandsequen;ally,where
LearningAlgorithmUpdateruledecomposes:
Truegradient Perturba;ontermAlmostsureconvergenceguaranteesaslearningrate
Problem:High-dimensionaldata:theprobabilitylandscapeishighlymul;modal.
Connec;onstothetheoryofstochas;capproxima;onandadap;veMCMC.
Keyinsight:Thetransi;onoperatorcanbeanyvalidtransi;onoperator–TemperedTransi;ons,Parallel/SimulatedTempering.
MarkovChainMonteCarlo
PosteriorInference
Mean-Field
Varia;onalInferenceApproximateintractabledistribu;onwithsimpler,tractabledistribu;on :
Varia;onalLowerBound
MinimizeKLbetweenapproxima;ngandtruedistribu;onswithrespecttovaria;onalparameters.
PosteriorInference
Mean-Field
Varia;onalInferenceApproximateintractabledistribu;onwithsimpler,tractabledistribu;on :
Mean-Field:Chooseafullyfactorizeddistribu;on:
with
Nonlinearfixed-pointequa;ons:
Varia/onalInference:Maximizethelowerboundw.r.t.Varia;onalparameters.
Varia;onalLowerBound
PosteriorInference
Mean-Field
Varia;onalInferenceApproximateintractabledistribu;onwithsimpler,tractabledistribu;on :
1.Varia/onalInference:Maximizethelowerboundw.r.t.varia;onalparameters
MarkovChainMonteCarlo
2.MCMC:Applystochas;capproxima;ontoupdatemodelparameters
Almostsureconvergenceguaranteestoanasympto;callystablepoint.
Uncondi;onalSimula;onVaria;onalLowerBound
PosteriorInference
Mean-Field
Varia;onalInferenceApproximateintractabledistribu;onwithsimpler,tractabledistribu;on :
1.Varia/onalInference:Maximizethelowerboundw.r.t.varia;onalparameters
MarkovChainMonteCarlo
2.MCMC:Applystochas;capproxima;ontoupdatemodelparameters
Almostsureconvergenceguaranteestoanasympto;callystablepoint.
Uncondi;onalSimula;on
FastInference
Learningcanscaletomillionsofexamples
Varia;onalLowerBound
GoodGenera;veModel?HandwriVenCharacters
GoodGenera;veModel?HandwriVenCharacters
GoodGenera;veModel?HandwriVenCharacters
RealDataSimulated
GoodGenera;veModel?HandwriVenCharacters
RealData Simulated
GoodGenera;veModel?HandwriVenCharacters
GoodGenera;veModel?MNISTHandwriVenDigitDataset
Handwri;ngRecogni;on
LearningAlgorithm Error
Logis;cregression 12.0%K-NN 3.09%NeuralNet(PlaV2005) 1.53%SVM(Decosteet.al.2002) 1.40%DeepAutoencoder(Bengioet.al.2007)
1.40%
DeepBeliefNet(Hintonet.al.2006)
1.20%
DBM 0.95%
LearningAlgorithm Error
Logis;cregression 22.14%K-NN 18.92%NeuralNet 14.62%SVM(Larochelleet.al.2009) 9.70%DeepAutoencoder(Bengioet.al.2007)
10.05%
DeepBeliefNet(Larochelleet.al.2009)
9.68%
DBM 8.40%
MNISTDataset Op;calCharacterRecogni;on60,000examplesof10digits 42,152examplesof26EnglishleVers
Permuta;on-invariantversion.
Genera;veModelof3-DObjects
24,000examples,5objectcategories,5differentobjectswithineachcategory,6lightningcondi;ons,9eleva;ons,18azimuths.
3-DObjectRecogni;on
LearningAlgorithm ErrorLogis;cregression 22.5%K-NN(LeCun2004) 18.92%SVM(Bengio&LeCun2007) 11.6%DeepBeliefNet(Nair&Hinton2009)
9.0%
DBM 7.2%
PaVernComple;on
Permuta;on-invariantversion.
Whereelsecanweusegenera;vemodels?
Data–Collec;onofModali;es• Mul;mediacontentontheweb-image+text+audio.
• Productrecommenda;onsystems.
• Robo;csapplica;ons.
AudioVision
TouchsensorsMotorcontrol
sunset,pacificocean,bakerbeach,seashore,ocean
car,automobile
Challenges-I
Verydifferentinputrepresenta;ons
Image Text
sunset,pacificocean,bakerbeach,seashore,
ocean • Images–real-valued,dense
Difficulttolearncross-modalfeaturesfromlow-levelrepresenta;ons.
Dense
• Text–discrete,sparse
Sparse
Challenges-II
Noisyandmissingdata
Image Textpentax,k10d,pentaxda50200,kangarooisland,sa,australiansealion
mickikrimmel,mickipedia,headshot
unseulpixel,naturey
<notext>
Challenges-IIImage Text Textgeneratedbythemodel
beach,sea,surf,strand,shore,wave,seascape,sand,ocean,waves
portrait,girl,woman,lady,blonde,preVy,gorgeous,expression,model
night,noVe,traffic,light,lights,parking,darkness,lowlight,nacht,glow
fall,autumn,trees,leaves,foliage,forest,woods,branches,path
pentax,k10d,pentaxda50200,kangarooisland,sa,australiansealion
mickikrimmel,mickipedia,headshot
unseulpixel,naturey
<notext>
ASimpleMul;modalModel• Useajointbinaryhiddenlayer.• Problem:Inputshaveverydifferentsta;s;calproper;es.
• Difficulttolearncross-modalfeatures.
0010
0Real-valued
1-of-K
0010
0Dense,real-valuedimagefeatures
GaussianmodelReplicatedSoumax
Mul;modalDBM
Wordcounts
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
Mul;modalDBM
0010
0Dense,real-valuedimagefeatures
GaussianmodelReplicatedSoumax
Wordcounts
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
GaussianmodelReplicatedSoumax
0010
0
Mul;modalDBM
Wordcounts
Dense,real-valuedimagefeatures
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
0010
0Dense,real-valuedimagefeatures
Wordcounts
GaussianmodelReplicatedSoumax
Mul;modalDBM
BoVom-up+
Top-down
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
TextGeneratedfromImages
canada,nature,sunrise,ontario,fog,mist,bc,morning
insect,buVerfly,insects,bug,buVerflies,lepidoptera
graffi;,streetart,stencil,s;cker,urbanart,graff,sanfrancisco
portrait,child,kid,ritraVo,kids,children,boy,cute,boys,italy
dog,cat,pet,kiVen,puppy,ginger,tongue,kiVy,dogs,furry
sea,france,boat,mer,beach,river,bretagne,plage,briVany
Given Generated Given Generated
TextGeneratedfromImagesGiven Generated
water,glass,beer,boVle,drink,wine,bubbles,splash,drops,drop
portrait,women,army,soldier,mother,postcard,soldiers
obama,barackobama,elec;on,poli;cs,president,hope,change,sanfrancisco,conven;on,rally
Genera;ngTextfromImages
Samplesdrawnauerevery50stepsofGibbsupdates
MIR-FlickrDataset
(Huiskes et al., 2010)
• 1millionimagesalongwithuser-assignedtags.
sculpture,beauty,stone
nikon,green,light,photoshop,apple,d70
white,yellow,abstract,lines,bus,graphic
sky,geotagged,reflec;on,cielo,bilbao,reflejo
food,cupcake,vegan
d80
anawesomeshot,theperfectphotographer,flash,damniwishidtakenthat,spiritofphotography
nikon,abigfave,goldstaraward,d80,nikond80
Results• Logis;cregressionontop-levelrepresenta;on.• Mul;modalInputs
LearningAlgorithm MAP Precision@50
Random 0.124 0.124LDA[Huiskeset.al.] 0.492 0.754SVM[Huiskeset.al.] 0.475 0.758DBM-Labelled 0.526 0.791DeepBeliefNet 0.638 0.867Autoencoder 0.638 0.875DBM 0.641 0.873
MeanAveragePrecision
Labeled25Kexamples
+1Millionunlabelled
TalkRoadmap• BasicBuildingBlocks:
Ø SparseCodingØ Autoencoders
• DeepGenera;veModelsØ RestrictedBoltzmannMachinesØ DeepBeliefNetwork,DeepBoltzmannMachinesØ HelmholtzMachines/Varia;onalAutoencoders
• Genera;veAdversarialNetworks
• ModelEvalua;on
HelmholtzMachines• Hinton,G.E.,Dayan,P.,Frey,B.J.andNeal,R.,Science1995
Inputdata
h3
h2
h1
v
W3
W2
W1
Genera;veProcessApproximate
Inference
• Kingma&Welling,2014
• Rezende,Mohamed,Daan,2014
• Mnih&Gregor,2014
• Bornschein&Bengio,2015
• Tang&Salakhutdinov,2013
HelmholtzMachines,DBNs,DBMs
h3
h2
h1
v
W3
W2
W1
h3
h2
h1
v
W3
W2
W1
Deep Boltzmann Machine
Helmholtz Machine
h3
h2
h1
v
W3
W2
W1
Deep Belief Network
Varia;onalAutoencoders(VAEs)• TheVAEdefinesagenera;veprocessintermsofancestralsamplingthroughacascadeofhiddenstochas;clayers:
h3
h2
h1
v
W3
W2
W1
Eachtermmaydenoteacomplicatednonlinearrela;onship
• Samplingandprobabilityevalua;onistractableforeach.
Genera;veProcess
• denotesparametersofVAE.
• isthenumberofstochas/clayers.
Inputdata
VAE:Example• TheVAEdefinesagenera;veprocessintermsofancestralsamplingthroughacascadeofhiddenstochas;clayers:
Thistermdenotesaone-layerneuralnet.
Determinis;cLayer
Stochas;cLayer
Stochas;cLayer
• denotesparametersofVAE.
• Samplingandprobabilityevalua;onistractableforeach.
• isthenumberofstochas/clayers.
Varia;onalBound• TheVAEistrainedtomaximizethevaria;onallowerbound:
Inputdata
h3
h2
h1
v
W3
W2
W1
• Hardtoop;mizethevaria;onalboundwithrespecttotherecogni;onnetwork(high-variance).
• KeyideaofKingmaandWellingistousereparameteriza;ontrick.
• Tradingoffthedatalog-likelihoodandtheKLdivergencefromthetrueposterior.
Reparameteriza;onTrick• Assumethattherecogni;ondistribu;onisGaussian:
withmeanandcovariancecomputedfromthestateofthehiddenunitsatthepreviouslayer.
• Alterna;vely,wecanexpressthisintermofauxiliaryvariable:
• Assumethattherecogni;ondistribu;onisGaussian:
• Or
Determinis;cEncoder
• Therecogni;ondistribu;oncanbeexpressedintermsofadeterminis;cmapping:
Distribu;onofdoesnotdependon
Reparameteriza;onTrick
Compu;ngtheGradients• Thegradientw.r.ttheparameters:bothrecogni;onandgenera;ve:
Gradientscanbecomputedbybackprop
Themappinghisadeterminis;cneuralnetforfixed.
Autoencoder
ImportanceWeightedAutoencoders• CanimproveVAEbyusingfollowingk-sampleimportanceweigh;ngofthelog-likelihood:
wherearesampledfromtherecogni;onnetwork.
Inputdata
h3
h2
h1
v
W3
W2
W1
unnormalizedimportanceweights
(Burda, Grosse, Salakhutdinov, ICLR 2016)
ImportanceWeightedAutoencoders• CanimproveVAEbyusingfollowingk-sampleimportanceweigh;ngofthelog-likelihood:
• Thisisalowerboundonthemarginallog-likelihood:
• SpecialCaseofk=1:SameasstandardVAEobjec;ve.
• UsingmoresamplesàImprovesthe;ghtnessofthebound.
TighterLowerBound
• Forallk,thelowerboundssa;sfy:
• Usingmoresamplescanonlyimprovethe;ghtnessofthebound.
• Moreoverifisbounded,then:
Compu;ngtheGradients• Wecanusetheunbiasedes;mateofthegradientusingreparameteriza;ontrick:
wherewedefinenormalizedimportanceweights:
IWAEsvs.VAEs• Drawk-samplesformtherecogni;onnetwork
- ork-setsofauxiliaryvariables.• ObtainthefollowingMonteCarloes;mateofthegradient:
• ComparethistotheVAE’ses;mateofthegradient:
Mo;va;ngExample• Canwegenerateimagesfromnaturallanguagedescrip;ons?
Astopsignisflyinginblueskies
Apaleyellowschoolbusisflyinginblueskies
Aherdofelephantsisflyinginblueskies
Alargecommercialairplaneisflyinginblueskies
(Mansimov, Parisotto, Ba, Salakhutdinov, 2015)
Genera;ngImagesfromCap;ons
• Genera;veModel:Stochas;cRecurrentNetwork,chainedsequenceofVaria;onalAutoencoders,withasinglestochas;clayer.
• Recogni;onModel:Determinis;cRecurrentNetwork.
Stochas;cLayer
(Gregor et al., 2015)
FlippingColorsAyellowschoolbusparkedintheparkinglot
Aredschoolbusparkedintheparkinglot
Agreenschoolbusparkedintheparkinglot
Ablueschoolbusparkedintheparkinglot
NovelSceneComposi;onsAtoiletseatsitsopeninthebathroom
AskGoogle?
Atoiletseatsitsopeninthegrassfield
BloombergNews