Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago
Instructor:SuriyaGunasekar,TTIChicago
28June2018
Day9:Unsupervisedlearning,dimensionality
reduction
Topicssofar
• Linearregression• Classification
o Logisticregressiono Maximummarginclassifiers,kerneltricko Generativemodelso Neuralnetworkso Ensemblemethods
• TodayandTomorrowo Unsupervisedlearning– dimensionalityreduction,clusteringo Review
1
Unsupervisedlearning• Unsupervisedlearning:Requiresdata𝑥 ∈ 𝒳,butnolabels• Goal?:Compactrepresentationofthedatabydetectingpatternso e.g.Groupemailsbytopic
• Usefulwhenwedon’tknowwhatwearelookingforo makesevaluationtricky
• Applicationsinvisualization,exploratorydataanalysis,semi-supervisedlearning
2
Clustering
3
Clusteringlanguages
4
Clusteringspecies(phylogeny)
5
Imageclustering/segmentation
6
Currenttrendistousedatasetswithlabelsforsuchtaske.g.,MSCOCO
Dimensionalityreduction
• Inputdata𝑥 ∈ 𝒳 mayhavethousandsormillionsofdimensions!o e.g.,textdatarepresentedasbagorwordso e.g.,videostreamofimageso e.g.,fMRIdata#voxelsx#timesteps
• Dimensionalityreduction:representdatawithfewerdimensionso easierlearninginsubsequenttasks(preprocessing)o visualizationo discoverintrinsicpatternsinthedata
7
Manifolds
8
Embeddings
9
Lowdimensionalembedding
• Givenhighdimensionalfeature𝒙 = 𝑥&, 𝑥(, … , 𝑥*
findtransformations𝑧 𝒙 = 𝑧& 𝒙 , 𝑧( 𝒙 , … , 𝑧, 𝒙
sothat“almostallusefulinformation”about𝒙 isretainedin𝑧(𝒙)• Ingeneral𝑘 ≪ 𝑑, and𝑧(𝒙) isnotinvertible• Transformationlearnedfromadatasetofexamplesof𝑥
𝑆 = 𝒙 𝒊 ∈ ℝ*: 𝑖 = 1,2, … , 𝑁o Note:typicallynolabels𝑦
10
Lineardimensionalityreduction• Givenhighdimensionalfeature
𝒙 = 𝑥&, 𝑥(, … , 𝑥*findtransformations
𝒛 = 𝑧 𝒙 = 𝑧& 𝒙 , 𝑧( 𝒙 ,… , 𝑧, 𝒙• Restrictz 𝒙 tobealinearfunctionof𝒙
𝑧& = 𝒘𝟏. 𝒙𝑧( = 𝒘𝟐. 𝒙
⋮𝑧, = 𝒘𝒌. 𝒙
11
𝑧&𝑧(⋮𝑧,
← 𝒘 𝟏 →← 𝒘 𝟐 →
⋮← 𝒘 𝒌 →
𝑥&𝑥(𝑥E
⋮
𝑥*
=𝒛 = 𝑾𝒙
where𝒛 ∈ ℝ,,𝑾 ∈ ℝ,×*,𝒙 ∈ ℝ*
onlyquestioniswhich𝑾?
Lineardimensionality2Dexample
• Givenpoints𝑆 = {𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁} in2D,wewanta1Drepresentationo project 𝒙 𝒊 ontoaline𝒘. 𝒙 = 0
o Find𝒘 tominimizesthesumofsquareddistancestotheline
12
Vectorprojections
• 𝒙. 𝒖 = 𝒙 𝒖 cos𝜃• Assuming 𝒖 = 𝟏,• 𝒙. 𝒖 = 𝒙 cos𝜃 = 𝑧R à valueof𝑥 alongu
• distanceof𝒙 toprojectionis𝑧R𝒖 − 𝒙 = ‖ 𝒙. 𝒖 𝒖 − 𝒙‖
13
𝒙 cos𝜃= 𝑧R
𝜃𝑢
Principalcomponentanalysis
• Fora1Dembeddingalongdirection𝒖,distanceof 𝒙 totheprojectionalong𝒖 isgivenby
𝑧R𝒖 − 𝒙 = ‖ 𝒙. 𝒖 𝒖 − 𝒙‖• Moregenerallyfor𝑘 dimensionalembedding:
o findorthonormalbasisofthe𝑘 dimensionalsubspace𝒖𝟏, 𝒖𝟐, … , 𝒖𝒌 ∈ ℝ*,i.e., 𝒖𝒊. 𝒖𝒋 = 1 if𝑖 = 𝑗,and0otherwise
o let𝑼 ∈ ℝ,×* bethematrixwith𝒖𝟏, 𝒖𝟐, … , 𝒖𝒌 alongrowso distanceofprojectionof𝒙 tospan{𝒖𝟏, 𝒖𝟐, … , 𝒖𝒌}
𝑼Y𝑼𝒙 − 𝒙o alsofromorthonormality of𝒖𝟏, 𝒖𝟐, … , 𝒖𝒌,check𝑼𝑼Y = 𝑰
• PCAobjective
min𝑼∈ℝ^×_
` 𝑼Y𝑼𝒙 𝒊 − 𝒙 𝒊(
a
bc&
𝑠. 𝑡. 𝑼𝑼Y = 𝐼
14
PCA• PCAobjective
min𝑼∈ℝ^×_
1𝑁` 𝑼Y𝑼𝒙 𝒊 − 𝒙 𝒊
(a
bc&
𝑠. 𝑡. 𝑼𝑼Y = 𝐼
• Also,forall𝑼𝑼Y = 𝐼𝑼Y𝑼𝒙 − 𝒙 ( = 𝒙 ( + 𝑥Y𝑼Y𝑼𝑼Y𝑼𝒙 − 2𝒙Y𝑼Y𝑼𝒙
= 𝒙 𝟐 − 𝒙Y𝑼Y𝑼𝒙 = 𝒙 𝟐 − 𝑼𝒙 𝟐
• EquivalentPCAobjective
max𝑼
1𝑁` 𝑼𝒙(𝒊) (a
bc&
= ` 𝑢jYΣlmm𝑢j
�
j∈ ,
𝑠. 𝑡. 𝑼𝑼Y = 𝐼
whereΣlmm =&a∑ 𝑥 b 𝑥 b
Yabc& (derivationinboard)
• ThisisthesameasfindingtopkeigenvectorsofΣlmm
15
PCAalgorithm
• Given𝑆 = 𝒙 𝒊 ∈ ℝ*: 𝑖 = 1,2, … , 𝑁
• Let𝑿 ∈ ℝa×* bedatamatrix
o makesure𝑋 isre-centeredsothatcolumnmeanis0
• Σlmm =&a∑ 𝒙 𝒊 𝒙 𝒊
Yabc& =
&a𝑿Y𝑿 ∈ ℝ*×*
• 𝒖𝟏, 𝒖𝟐, … , 𝒖𝒌 ∈ ℝ* aretopkeigenvectorsofΣlmm
16
Howtopick𝑘?• Dataassumedtobelowdimensionalprojection+noise• Onlykeepprojectionsontocomponentswithlargeeigenvaluesandignoretherest
17Slidecredit:Arti Singh
Eigenfaces
• TurkandPentland ’91
18
SVDversion• Given𝑆 = 𝒙 𝒊 ∈ ℝ*: 𝑖 = 1,2, … , 𝑁• Let𝑿 ∈ ℝa×* bedatamatrix
o makesure𝑿 isre-centeredsothatcolumnmeanis0• 𝑿 = 𝑽s𝑺s𝑼sY betheSingularValueDecomposition(SVD)of𝑿,whereo 𝑽s ∈ ℝa×* haveorthonormalcolumns, i.e.,𝑽sY𝑽s = 𝑰
§ columnsof𝑽sarecalledleftsingularvectorso 𝑼s ∈ ℝ*×* alsohasorthonormalcolumns, i.e.,𝑼sY𝑼s = 𝑰
§ columnsof𝑼sarecalledrightsingularvectorso 𝑺s = 𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙 𝜎&, 𝜎(, … , 𝜎* ∈ ℝ*×*
§ 𝜎&, 𝜎(, … , 𝜎* arecalledthesingularvalues• Firstkcolumnsof𝑼sarethe𝒖𝟏, 𝒖𝟐, … , 𝒖𝒌 wewant.
• Representationof𝒙 ∈ ℝ* as𝑧 𝒙 ∈ ℝ, isgivenby𝑧(𝒙)j = 𝜎j𝒖𝒋. 𝒙for𝑗 = 1,2, … , 𝑘
19
Otherlineardimensionalityreduction• PCA: givendata𝑥 ∈ ℝ*,findU ∈ ℝ,×* tominimize
min 𝑈Y𝑈𝑥 − 𝑥 ((𝑠. 𝑡. 𝑈𝑈Y = 𝐼
• Canonicalcorrelationanalysis:giventwo“views”ofdata𝑥 ∈ ℝ*and𝑥 ∈ ℝ*,findU ∈ ℝ,×*, 𝑈 ∈ ℝ,×* tominimize
𝑈𝑥 − 𝑈𝑥 ((𝑠. 𝑡. 𝑈𝑈Y = 𝑈𝑈Y = 𝐼
• Sparsedictionarylearning:learnasparserepresentationof𝑥 asalinearcombinationofover-completedictionary
𝑥 → 𝐷𝑧whereD ∈ ℝ*×, 𝑧 ∈ ℝo unlikePCA,here𝑚 ≫ 𝑑 so𝑧 ishigherdimensional,butlearnedtobesparse!
• Independentcomponentanalysis• Factoranalysis• Lineardiscriminantanalysis
20
Nonlineardimensionalityreduction
• Isomap• Autoencoders• KernelPCA• Locallinearembedding• Checkoutt-SNEfor2Dvisualization• …
21
Isomap
22
Isomap – algorithm
• Datasetof𝑁 points𝑆 = 𝒙 𝒊 ∈ ℝ*: 𝑖 = 1,2, … ,𝑁• RepresentthepointsasakNN-graphwithweightsproportionaltodistancebetweenthepoints• Thegeodesicdistance𝑑 𝑥, 𝑥 betweenpointsinthemanifoldisthelengthofshortestpathinthegraph• Useanyshortestpathalgorithmcanbeusedtoconstructamatrix𝑀 ∈ ℝa×a of𝑑 𝑥 b , 𝑥(j) forall𝑥 b , 𝑥(j) ∈ 𝑆• MDS: Finda(lowdimensional)embedding𝑧(𝑥) of𝑥 sothatdistancesarepreserved
min
` 𝑧 𝑥 b − 𝑧 𝑥 j − 𝑀bj(
�
b,j∈ a
o sometimesmin∑ m
m
�b,j∈ a
23
Autoencoders
• Recallneuralnetworksasfeaturelearning
o waslearnedforsomesupervisedlearningtasko weightslearnedbyminimizingℓ(𝑣R, 𝑦)o butwedon’thave𝑦 anymore!
24
𝑣&
𝑣(
𝑣E
𝜙 𝑥 &
𝜙 𝑥 ,
𝑥&
𝑥(
𝑥*
⋯𝑣R⋯⋯
Autoencoders
• Recallneuralnetworksasfeaturelearning
o waslearnedforsomesupervisedlearningtasko weightslearnedbyminimizingℓ(𝑣R, 𝑦)o butwedon’thave𝑦 anymore!o insteaduseanother“decoder”networktoreconstruct𝑥
25
𝑣&
𝑣(
𝑣E
𝜙 𝑥 &
𝜙 𝑥 ,
𝑥&
𝑥(
𝑥*
⋯𝑣R⋯⋯
Autoencoders
• 𝜙 𝑥 = 𝑓𝑾𝟏 𝒙• 𝒙£ = 𝑓𝑾𝟐 𝜙(𝒙)• somelossℓ 𝑥¤, 𝑥
𝑊¦&,𝑊¦( = min§̈ ,§`ℓ 𝑓𝑾𝟐 𝑓𝑾𝟏 𝒙
b , 𝒙 ba
bc&
• learnusingSGDwithbackpropagation
26
𝑣&
𝑣(
𝑣E
𝜙 𝑥 &
𝜙 𝑥 ,
𝑥&
𝑥(
𝑥*⋯ ⋯
⋯𝑥¤&
𝑥¤(
𝑥¤*
⋯
⋯