08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A =...

transcript

5/2/17

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

¡ Assumption: Dataliesonornearalowd-dimensionalsubspace

¡ Axesofthissubspaceareeffectiverepresentationofthedata

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

5/2/17

¡ Compress/reducedimensionality:§ 106 rows;103 columns;noupdates§ Randomaccesstoanycell(s);smallerror:OK

The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]

¡ Q:Whatisrank ofamatrixA?¡ A: Numberoflinearlyindependent columnsofA¡ Forexample:§ MatrixA= hasrankr=2

§ Why?Thefirsttworowsarelinearlyindependent,sotherankisatleast2,butallthreerowsarelinearlydependent(thefirstisequaltothesumofthesecondandthird)sotherankmustbelessthan3.

¡ Whydowecareaboutlowrank?§ WecanwriteA astwo“basis”vectors:[121][-2-31]§ Andnewcoordinatesof:[10][01][1-1]

5/2/17

¡ Cloudofpoints3Dspace:§ Thinkofpointpositionsasamatrix:

¡ Wecanrewritecoordinatesmoreefficiently!§ Oldbasisvectors: [100][010][001]§ Newbasisvectors:[121][-2-31]§ ThenA hasnewcoordinates:[10].B:[01],C:[1-1]

§ Notice:Wereducedthenumberofcoordinates!

1 row per point:

¡ Goalofdimensionalityreductionistodiscovertheaxisofdata!

Rather than representingevery point with 2 coordinateswe represent each point with1 coordinate (corresponding tothe position of the point on the red line).

By doing this we incur a bit oferror as the points do not exactly lie on the line

5/2/17

Whyreducedimensions?¡ Discoverhiddencorrelations/topics§ Wordsthatoccurcommonlytogether

¡ Removeredundantandnoisyfeatures§ Notallwordsareuseful

¡ Interpretationandvisualization¡ Easierstorageandprocessingofthedata

A[mxn] =U[mxr] S [ rxr] (V[nxr])T

¡ A:Inputdatamatrix§ m xnmatrix(e.g.,m documents,n terms)

¡ U:Leftsingularvectors§ m xrmatrix (m documents,r concepts)

¡ S:Singularvalues§ r xr diagonalmatrix(strengthofeach‘concept’)(r :rankofthematrixA)

¡ V:Rightsingularvectors§ n xrmatrix(n terms,r concepts)

5/2/17

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

s1u1v1 s2u2v2

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

σi … scalarui … vectorvi … vector

5/2/17

Itisalways possibletodecomposearealmatrixA intoA=US VT ,where

¡ U,S,V:unique¡ U,V:columnorthonormal§ UT U=I;VT V=I (I:identitymatrix)§ (Columnsareorthogonalunitvectors)

¡ S:diagonal§ Entries(singularvalues)arepositive,andsortedindecreasingorder(σ1 ³ σ2 ³ ...³ 0)

Nice proof of uniqueness: http://www.mpi-inf.mpg.de/~bast/ir-seminar-ws04/lecture2.pdf

¡ A=US VT - example:UserstoMovies

=SciFi

Romnce

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

“Concepts” AKA Latent dimensionsAKA Latent factors

5/2/17

=SciFi

Romnce

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

SciFi-conceptRomance-concept

=SciFi

Romnce

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

5/2/17

¡ A=US VT - example:

Romance-concept

U is “user-to-concept” similarity matrix

SciFi-concept

=SciFi

Romnce

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

Romnce

SciFi-concept

“strength” of the SciFi-concept

=SciFi

Romnce

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

5/2/17

SciFi-concept

V is “movie-to-concept”similarity matrix

SciFi-concept

=SciFi

Romnce

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

‘movies’,‘users’and‘concepts’:¡ U:user-to-conceptsimilaritymatrix

¡ V:movie-to-conceptsimilaritymatrix

¡ S:itsdiagonalelements:‘strength’ofeachconcept

5/2/17

first right singular vector

Movie 1 rating

¡ Insteadofusingtwocoordinates(𝒙, 𝒚) todescribepointlocations,let’suseonlyonecoordinate 𝒛

¡ Point’spositionisitslocationalongvector𝒗𝟏¡ Howtochoose𝒗𝟏?Minimizereconstructionerror

5/2/17

¡ Goal:Minimizethesumofreconstructionerrors:

)) 𝑥+, − 𝑧+,/

+12§ where𝒙𝒊𝒋 arethe“old”and𝒛𝒊𝒋 arethe“new”coordinates

¡ SVDgives‘best’axistoprojecton:§ ‘best’=minimizingthereconstructionerrors

¡ Inotherwords,minimumreconstructionerror

first right singularvector

Movie 1 rating

¡ A=US VT- example:§ V: “movie-to-concept”matrix§ U:“user-to-concept”matrix

Movie 1 rating

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

5/2/17

¡ A=US VT- example:

Movie 1 rating

variance (‘spread’) on the v1 axis

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

A=US VT- example:¡ US: Givesthecoordinatesofthepointsintheprojectionaxis

Movie 1 rating

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

1.61 0.19 -0.015.08 0.66 -0.036.82 0.85 -0.058.43 1.04 -0.061.86 -5.60 0.840.86 -6.93 -0.870.86 -2.75 0.41

Projection of users on the “Sci-Fi” axis (U S) T:

5/2/17

Moredetails¡ Q: Howexactlyisdim.reductiondone?

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

Moredetails¡ Q: Howexactlyisdim.reductiondone?¡ A:Setsmallestsingularvaluestozero

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

5/2/17

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

5/2/17

» x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.020.41 0.070.55 0.090.68 0.110.15 -0.590.07 -0.730.07 -0.29

12.4 0 0 9.5

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.69

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.92 0.95 0.92 0.01 0.012.91 3.01 2.91 -0.01 -0.013.90 4.04 3.90 0.01 0.014.82 5.00 4.82 0.03 0.030.70 0.53 0.70 4.11 4.11

-0.69 1.34 -0.69 4.78 4.780.32 0.23 0.32 2.01 2.01

Frobenius norm:

ǁMǁF = ÖΣij Mij2 ǁA-BǁF = Ö Σij (Aij-Bij)2

is“small”

5/2/17

A USigma

B USigma

B is best approximation of A

¡ Theorem:Let A =US VT and B =US VT whereS = diagonalrxr matrix withsi=σi (i=1…k)elsesi=0thenB isa best rank(B)=k approx.toA

Whatdowemeanby“best”:§ B isasolutiontominB ǁA-BǁF whererank(B)=k

Σ𝜎22

𝜎88

𝐴 − 𝐵 ; = ) 𝐴+, − 𝐵+,/

5/2/17

¡ Theorem: Let A =US VT (σ1³σ2³…,rank(A)=r)then B =US VT

§ S = diagonalrxr matrix wheresi=σi (i=1…k)elsesi=0isabestrank-k approximationtoA:§ B isasolutiontominB ǁA-BǁF whererank(B)=k

¡ Wewillneed2facts:§ 𝑀 ; = ∑ 𝑞++ /�

+ whereM =P Q R isSVDofM§ US VT- US VT =U (S - S)VT

Σ𝜎22

𝜎88

Details!

¡ Wewillneed2facts:§ 𝑀 ; = ∑ 𝑞AA /�

A whereM =P Q R isSVDofM

§ US VT- US VT =U (S - S)VT

We apply:-- P column orthonormal-- R row orthonormal-- Q is diagonal

Details!

5/2/17

¡ A =US VT ,B =US VT (σ1³σ2³…³ 0, rank(A)=r)§ S = diagonalnxn matrixwheresi=σi (i=1…k)elsesi=0then B issolutiontominB ǁA-BǁF ,rank(B)=k

¡ Why?

¡ Wewanttochoosesi tominimize∑ 𝜎+ − 𝑠+ /�+

¡ Solutionistosetsi=σi (i=1…k)andothersi=0

ååå+=+==

iiis s

2)(min sss

=-=-S=-

iiisFFkBrankBsSBA

)(,)(minminmin s

We used: U S VT - U S VT = U (S - S) VT

Details!

Equivalent:‘spectraldecomposition’ofthematrix:

= x xu1 u2

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

5/2/17

Equivalent:‘spectraldecomposition’ofthematrix

= u1σ1 vT1 u2σ2 vT

2+ +...

n x 1 1 x m

k terms

Assume: σ1 ³ σ2 ³ σ3 ³ ... ³ 0

Why is setting small σi to 0 the right thing to do?Vectors ui and vi are unit length, so σiscales them.So, zeroing small σi introduces less error.

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

Q:Howmanyσs tokeep?A: Rule-of-athumb:keep80-90%of‘energy’ = ∑ 𝝈𝒊𝟐�

= u1σ1 vT1 u2σ2 vT

2+ +...n

Assume: σ1 ³ σ2 ³ σ3 ³ ...

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

5/2/17

¡ TocomputeSVD:§ O(nm2) orO(n2m) (whicheverisless)

¡ But:§ Lesswork,ifwejustwantsingularvalues§ orifwewantfirstk singularvectors§ orifthematrixissparse

¡ Implementedinlinearalgebrapackageslike§ LINPACK,Matlab,SPlus,Mathematica ...

¡ SVD: A=US VT:unique§ U:user-to-conceptsimilarities§ V:movie-to-conceptsimilarities§ S :strengthofeachconcept

¡ Dimensionalityreduction:§ keepthefewlargestsingularvalues(80-90%of‘energy’)

§ SVD:picksuplinearcorrelations

5/2/17

¡ SVDgivesus:§ A = U S VT

¡ Eigen-decomposition:§ A = X L XT

§ Aissymmetric§ U,V,Xareorthonormal (UTU=I),§ L, S arediagonal

¡ Nowlet’scalculate:§ AAT= US VT(US VT)T = US VT(VSTUT) = USST UT

§ ATA = V ST UT (US VT) = V SST VT

¡ SVDgivesus:§ A = U S VT

¡ Eigen-decomposition:§ A = X L XT

§ Aissymmetric§ U,V,Xareorthonormal (UTU=I),§ L, S arediagonal

¡ Nowlet’scalculate:§ AAT= US VT(US VT)T = US VT(VSTUT) = USST UT

§ ATA = V ST UT (US VT) = V SST VT

X L2 XT

Shows how to computeSVD using eigenvalue

decomposition!

5/2/17

¡ A AT =U S2 UT

¡ ATA =V S2 VT

¡ (ATA) k=V S2k VT

§ E.g.:(ATA)2=V S2 VTV S2 VT=V S4 VT

¡ (ATA) k~v1 σ12k v1T fork>>1

5/2/17

¡ Q:Findusersthatlike‘Matrix’¡ A:Mapqueryintoa‘conceptspace’– how?

=SciFi

Romnce

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

5 0 0 0 0

Matrix

Project into concept space:Inner product with each ‘concept’ vector vi

5/2/17

5 0 0 0 0

MatrixA

Project into concept space:Inner product with each ‘concept’ vector vi

Compactly,wehave:qconcept =qV

movie-to-conceptsimilarities (V)

SciFi-concept

5 0 0 0 0

0.56 0.120.59 -0.020.56 0.120.09 -0.690.09 -0.69

x 2.8 0.6

5/2/17

¡ Howwouldtheuserd thatrated(‘Alien’,‘Serenity’)behandled?dconcept =dV

movie-to-conceptsimilarities (V)

SciFi-concept

0 4 5 0 0

0.56 0.120.59 -0.020.56 0.120.09 -0.690.09 -0.69

x 5.2 0.4

¡ Observation: Userd thatrated(‘Alien’,‘Serenity’)willbesimilar touserq thatrated(‘Matrix’),althoughd andq havezeroratingsincommon!

0 4 5 0 0

SciFi-concept

5 0 0 0 0

Zero ratings in common Similarity ≠ 0

2.8 0.6

5.2 0.4

5/2/17

+ Optimallow-rankapproximationintermsofFrobenius norm

- Interpretabilityproblem:§ Asingularvectorspecifiesalinearcombinationofallinputcolumnsorrows

- Lackofsparsity:§ Singularvectorsaredense!

5/2/17

¡ Goal:ExpressAasaproductofmatricesC,U,RMakeǁA-C·U·RǁF small

¡ “Constraints”onCandR:

A C U R

Frobenius norm:

ǁXǁF = Ö Σij Xij2

¡ Goal:ExpressAasaproductofmatricesC,U,RMakeǁA-C·U·RǁF small

¡ “Constraints”onCandR:

Pseudo-inverse of the intersection of C and R

A C U R

Frobenius norm:

ǁXǁF = Ö Σij Xij2

5/2/17

¡ Let:Ak bethe“best”rankk approximationtoA (thatis,Ak isSVDofA)

Theorem [Drineas etal.]CUR inO(m·n)timeachieves§ ǁA-CURǁF £ ǁA-AkǁF +eǁAǁFwithprobabilityatleast1-d,bypicking§ O(klog(1/d)/e2) columns,and§ O(k2log3(1/d)/e6) rows

In practice:Pick 4k cols/rows

¡ Samplingcolumns(similarlyforrows):

Note this is a randomized algorithm, same column can be sampled more than once

5/2/17

¡ LetW bethe“intersection”ofsampledcolumnsC androwsR§ LetSVDofW= XZ YT

¡ Then: U =Y (Z+)2XT

§ Z+:reciprocalsofnon-zerosingularvalues: Z+

ii =1/ Zii

¡ Forexample:

§ Select𝒄 = 𝑶 𝒌 𝒍𝒐𝒈 𝒌𝜺𝟐

columnsofAusingColumnSelect algorithm

§ Select𝒓 = 𝑶 𝒌 𝒍𝒐𝒈 𝒌𝜺𝟐

rowsofAusingColumnSelect algorithm

§ Set𝑼 = 𝑾P

¡ Then:withprobability98%

In practice:Pick 4k cols/rowsfor a “rank-k” approximation

SVD errorCUR error

5/2/17

+ Easyinterpretation• Sincethebasisvectorsareactualcolumnsandrows

+ Sparsebasis• Sincethebasisvectorsareactualcolumnsandrows

- Duplicatecolumnsandrows• Columnsoflargenormswillbesampledmanytimes

Singular vectorActual column

¡ Ifwewanttogetridoftheduplicates:§ Throwthemaway§ Scale(multiply)thecolumns/rowsbythesquarerootofthenumberofduplicates

Construct a small U

5/2/17

SVD: A = U S VT

Huge but sparse Big and dense

CUR: A = C U RHuge but sparse Big but sparse

dense but small

sparse and small

¡ DBLPbibliographicdata§ Author-to-conferencebigsparsematrix§ Aij:Numberofpaperspublishedbyauthori atconferencej

§ 428Kauthors(rows),3659conferences(columns)§ Verysparse

¡ Wanttoreducedimensionality§ Howmuchtimedoesittake?§ Whatisthereconstructionerror?§ Howmuchspacedoweneed?

5/2/17

¡ Accuracy:§ 1– relativesumsquarederrors

¡ Spaceratio:§ #outputmatrixentries/#inputmatrixentries

¡ CPUtime

SVDCURCUR no duplicates

SVDCURCUR no dup

Sun, Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM ’07.

¡ SVDislimitedtolinearprojections:§ Lower-dimensionallinearprojectionthatpreservesEuclideandistances

¡ Non-linearmethods:Isomap§ Dataliesonanonlinearlow-dimcurveakamanifold

§ Usethedistanceasmeasuredalongthemanifold

§ How?§ Buildadjacencygraph§ Geodesicdistanceisgraphdistance

§ SVD/PCAthegraphpairwise distancematrix

5/2/17

¡ Drineas etal.,FastMonteCarloAlgorithmsforMatricesIII:ComputingaCompressedApproximateMatrixDecomposition,SIAMJournalonComputing,2006.

¡ J.Sun,Y.Xie,H.Zhang,C.Faloutsos:LessisMore:CompactMatrixDecompositionforLargeSparseGraphs,SDM2007

¡ Intra- andinterpopulation genotypereconstructionfromtaggingSNPs,P.Paschou,M.W.Mahoney,A.Javed,J.R.Kidd,A.J.Pakstis,S.Gu,K.K.Kidd,andP.Drineas,GenomeResearch,17(1),96-107(2007)

¡ Tensor-CURDecompositionsForTensor-BasedData,M.W.Mahoney,M.Maggioni,andP.Drineas,Proc.12-thAnnualSIGKDD,327-336(2006)

08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A =...

Documents