+ All Categories
Home > Documents > & lda2Vec Topic Mo def65b0b05ling with LSA ,...

& lda2Vec Topic Mo def65b0b05ling with LSA ,...

Date post: 01-Sep-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
12
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 1/12 Topic Modeling with LSA, PLSA, LDA & lda2Vec Joyce Xu in NanoNets Follow May 25, 2018 · 12 min read This article is a comprehensive overview of Topic Modeling and its associated techniques. In natural language understanding (NLU) tasks, there is a hierarchy of lenses through which we can extract meaning — from words to sentences to paragraphs to documents. At the document level, Sign in Get started
Transcript
Page 1: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 1/12

TopicModelingwithLSA,PLSA,LDA&lda2Vec

JoyceXuinNanoNets Follow

May25,2018·12minread

ThisarticleisacomprehensiveoverviewofTopicModelinganditsassociatedtechniques.

Innaturallanguageunderstanding(NLU)tasks,thereisahierarchyoflensesthroughwhichwecan

extractmeaning—fromwordstosentencestoparagraphstodocuments.Atthedocumentlevel,

Signin Getstarted

Page 2: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 2/12

oneofthemostusefulwaystounderstandtextisbyanalyzingitstopics.Theprocessoflearning,

recognizing,andextractingthesetopicsacrossacollectionofdocumentsiscalledtopicmodeling.Inthispost,wewillexploretopicmodelingthrough4ofthemostpopulartechniquestoday:LSA,

pLSA,LDA,andthenewer,deeplearning-basedlda2vec.OverviewAlltopicmodelsarebasedonthesamebasicassumption:

eachdocumentconsistsofamixtureoftopics,andeachtopicconsistsofacollectionofwords.

Inotherwords,topicmodelsarebuiltaroundtheideathatthesemanticsofourdocumentare

actuallybeinggovernedbysomehidden,or“latent,”variablesthatwearenotobserving.Asa

result,thegoaloftopicmodelingistouncovertheselatentvariables—topics—thatshapethe

meaningofourdocumentandcorpus.Therestofthisblogpostwillbuildupanunderstandingof

howdifferenttopicmodelsuncovertheselatenttopics.LSALatentSemanticAnalysis,orLSA,isoneofthefoundationaltechniquesintopicmodeling.Thecore

ideaistotakeamatrixofwhatwehave—documentsandterms—anddecomposeitintoa

separatedocument-topicmatrixandatopic-termmatrix.Thefirststepisgeneratingourdocument-termmatrix.Givenmdocumentsandnwordsinour

vocabulary,wecanconstructanm×nmatrixAinwhicheachrowrepresentsadocumentandeach

columnrepresentsaword.InthesimplestversionofLSA,eachentrycansimplybearawcountof

thenumberoftimesthej-thwordappearedinthei-thdocument.Inpractice,however,rawcounts

donotworkparticularlywellbecausetheydonotaccountforthesignificanceofeachwordinthe

document.Forexample,theword“nuclear”probablyinformsusmoreaboutthetopic(s)ofagiven

documentthantheword“test.”Consequently,LSAmodelstypicallyreplacerawcountsinthedocument-termmatrixwithatf-idf

score.Tf-idf,ortermfrequency-inversedocumentfrequency,assignsaweightfortermjin

documentiasfollows:

Signin Getstarted

Page 3: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 3/12

Intuitively,atermhasalargeweightwhenitoccursfrequentlyacrossthedocumentbutinfrequently

acrossthecorpus.Theword“build”mightappearofteninadocument,butbecauseit’slikelyfairly

commonintherestofthecorpus,itwillnothaveahightf-idfscore.However,iftheword

“gentrification”appearsofteninadocument,becauseitisrarerintherestofthecorpus,itwillhave

ahighertf-idfscore.Oncewehaveourdocument-termmatrixA,wecanstartthinkingaboutourlatenttopics.Here’sthe

thing:inalllikelihood,Aisverysparse,verynoisy,andveryredundantacrossitsmanydimensions.

Asaresult,tofindthefewlatenttopicsthatcapturetherelationshipsamongthewordsand

documents,wewanttoperformdimensionalityreductiononA.ThisdimensionalityreductioncanbeperformedusingtruncatedSVD.SVD,orsingularvalue

decomposition,isatechniqueinlinearalgebrathatfactorizesanymatrixMintotheproductof3

separatematrices:M=U*S*V,whereSisadiagonalmatrixofthesingularvaluesofM.Critically,

truncatedSVDreducesdimensionalitybyselectingonlythetlargestsingularvalues,andonly

keepingthefirsttcolumnsofUandV.Inthiscase,tisahyperparameterwecanselectandadjustto

reflectthenumberoftopicswewanttofind.

Intuitively,thinkofthisasonlykeepingthetmostsignificantdimensionsinourtransformedspace.

Inthiscase,U∈ℝ^(m⨉t)emergesasourdocument-topicmatrix,andV∈ℝ^(n⨉t)becomesour

term-topicmatrix.InbothUandV,thecolumnscorrespondtooneofourttopics.InU,rows

representdocumentvectorsexpressedintermsoftopics;inV,rowsrepresenttermvectors

expressedintermsoftopics.Withthesedocumentvectorsandtermvectors,wecannoweasilyapplymeasuressuchascosine

similaritytoevaluate:thesimilarityofdifferentdocumentsthesimilarityofdifferentwordsthesimilarityofterms(or“queries”)anddocuments(whichbecomesusefulininformation

retrieval,whenwewanttoretrievepassagesmostrelevanttooursearchquery).CodeInsklearn,asimpleimplementationofLSAmightlooksomethinglikethis:

fromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.decompositionimportTruncatedSVDfromsklearn.pipelineimportPipelinedocuments=["doc1.txt","doc2.txt","doc3.txt"]#rawdocumentstotf-idfmatrix:

Signin Getstarted

Page 4: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 4/12

vectorizer=TfidfVectorizer(stop_words='english',use_idf=True,smooth_idf=True)#SVDtoreducedimensionality:svd_model=TruncatedSVD(n_components=100,//numdimensionsalgorithm='randomized',n_iter=10)#pipelineoftf-idf+SVD,fittoandappliedtodocuments:svd_transformer=Pipeline([('tfidf',vectorizer),('svd',svd_model)])svd_matrix=svd_transformer.fit_transform(documents)

#svd_matrixcanlaterbeusedtocomparedocuments,comparewords,orcomparequerieswithdocuments

LSAisquickandefficienttouse,butitdoeshaveafewprimarydrawbacks:lackofinterpretableembeddings(wedon’tknowwhatthetopicsare,andthecomponentsmay

bearbitrarilypositive/negative)needforreallylargesetofdocumentsandvocabularytogetaccurateresultslessefficientrepresentation

PLSApLSA,orProbabilisticLatentSemanticAnalysis,usesaprobabilisticmethodinsteadofSVDto

tackletheproblem.Thecoreideaistofindaprobabilisticmodelwithlatenttopicsthatcangenerate

thedataweobserveinourdocument-termmatrix.Inparticular,wewantamodelP(D,W)suchthat

foranydocumentdandwordw,P(d,w)correspondstothatentryinthedocument-termmatrix.Recallthebasicassumptionoftopicmodels:eachdocumentconsistsofamixtureoftopics,and

eachtopicconsistsofacollectionofwords.pLSAaddsaprobabilisticspintotheseassumptions:givenadocumentd,topiczispresentinthatdocumentwithprobabilityP(z|d)givenatopicz,wordwisdrawnfromzwithprobabilityP(w|z)

Formally,thejointprobabilityofseeingagivendocumentandwordtogetheris:

Intuitively,theright-handsideofthisequationistellingushowlikelyitisseesomedocument,and

thenbaseduponthedistributionoftopicsofthatdocument,howlikelyitistofindacertainword

withinthatdocument.Inthiscase,P(D),P(Z|D),andP(W|Z)aretheparametersofourmodel.P(D)canbedetermined

directlyfromourcorpus.P(Z|D)andP(W|Z)aremodeledasmultinomialdistributions,andcanbe

trainedusingtheexpectation-maximizationalgorithm(EM).Withoutgoingintoafull

Signin Getstarted

Page 5: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 5/12

mathematicaltreatmentofthealgorithm,EMisamethodoffindingthelikeliestparameter

estimatesforamodelwhichdependsonunobserved,latentvariables(inourcase,thetopics).Interestingly,P(D,W)canbeequivalentlyparameterizedusingadifferentsetof3parameters:

Wecanunderstandthisequivalencybylookingatthemodelasagenerativeprocess.Inourfirst

parameterization,wewerestartingwiththedocumentwithP(d),andthengeneratingthetopic

withP(z|d),andthengeneratingthewordwithP(w|z).Inthisparameterization,wearestarting

withthetopicwithP(z),andthenindependentlygeneratingthedocumentwithP(d|z)andthe

wordwithP(w|z).

https://www.slideshare.net/NYCPredictiveAnalytics/introduction-to-probabilistic-latent-

semantic-analysis

Thereasonthisnewparameterizationissointerestingisbecausewecanseeadirectparallel

betweenourpLSAmodelourLSAmodel:

wheretheprobabilityofourtopicP(Z)correspondstothediagonalmatrixofoursingulartopic

probabilities,theprobabilityofourdocumentgiventhetopicP(D|Z)correspondstoourdocument-

topicmatrixU,andtheprobabilityofourwordgiventhetopicP(W|Z)correspondstoourterm-

topicmatrixV.Sowhatdoesthattellus?Althoughitlooksquitedifferentandapproachestheprobleminavery

differentway,pLSAreallyjustaddsaprobabilistictreatmentoftopicsandwordsontopofLSA.Itis

Signin Getstarted

Page 6: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 6/12

afarmoreflexiblemodel,butstillhasafewproblems.Inparticular:BecausewehavenoparameterstomodelP(D),wedon’tknowhowtoassignprobabilitiesto

newdocumentsThenumberofparametersforpLSAgrowslinearlywiththenumberofdocumentswehave,so

itispronetooverfittingWewillnotlookatanycodeforpLSAbecauseitisrarelyusedonitsown.Ingeneral,whenpeople

arelookingforatopicmodelbeyondthebaselineperformanceLSAgives,theyturntoLDA.LDA,

themostcommontypeoftopicmodel,extendsPLSAtoaddresstheseissues.LDALDAstandsforLatentDirichletAllocation.LDAisaBayesianversionofpLSA.Inparticular,ituses

dirichletpriorsforthedocument-topicandword-topicdistributions,lendingitselftobetter

generalization.Iamnotgoingtointoanin-depthtreatmentofdirichletdistributions,sincethereareverygood

intuitiveexplanationshereandhere.Asabriefoverview,however,wecanthinkofdirichletasa

“distributionoverdistributions.”Inessence,itanswersthequestion:“giventhistypeofdistribution,

whataresomeactualprobabilitydistributionsIamlikelytosee?”Considertheveryrelevantexampleofcomparingprobabilitydistributionsoftopicmixtures.Let’s

saythecorpuswearelookingathasdocumentsfrom3verydifferentsubjectareas.Ifwewantto

modelthis,thetypeofdistributionwewantwillbeonethatveryheavilyweightsonespecifictopic,

anddoesn’tgivemuchweighttotherestatall.Ifwehave3topics,thensomespecificprobability

distributionswe’dlikelyseeare:MixtureX:90%topicA,5%topicB,5%topicCMixtureY:5%topicA,90%topicB,5%topicCMixtureZ:5%topicA,5%topicB,90%topicC

Ifwedrawarandomprobabilitydistributionfromthisdirichletdistribution,parameterizedbylarge

weightsonasingletopic,wewouldlikelygetadistributionthatstronglyresembleseithermixture

X,mixtureY,ormixtureZ.Itwouldbeveryunlikelyforustosampleadistributionthatis33%topic

A,33%topicB,and33%topicC.That’sessentiallywhatadirichletdistributionprovides:awayofsamplingprobabilitydistributions

ofaspecifictype.RecallthemodelforpLSA:

InpLSA,wesampleadocument,thenatopicbasedonthatdocument,thenawordbasedonthat

topic.HereisthemodelforLDA:

Signin Getstarted

Page 7: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 7/12

FromadirichletdistributionDir(α),wedrawarandomsamplerepresentingthetopicdistribution,

ortopicmixture,ofaparticulardocument.Thistopicdistributionisθ.Fromθ,weselectaparticular

topicZbasedonthedistribution.Next,fromanotherdirichletdistributionDir(�),weselectarandomsamplerepresentingtheword

distributionofthetopicZ.Thisworddistributionisφ.Fromφ,wechoosethewordw.Formally,theprocessforgeneratingeachwordfromadocumentisasfollows(bewarethis

algorithmusescinsteadofztorepresentthetopic):

https://cs.stanford.edu/~ppasupat/a9online/1140.html

LDAtypicallyworksbetterthanpLSAbecauseitcangeneralizetonewdocumentseasily.InpLSA,

thedocumentprobabilityisafixedpointinthedataset.Ifwehaven’tseenadocument,wedon’t

havethatdatapoint.InLDA,thedatasetservesastrainingdataforthedirichletdistributionof

document-topicdistributions.Ifwehaven’tseenadocument,wecaneasilysamplefromthe

dirichletdistributionandmoveforwardfromthere.CodeLDAiseasilythemostpopular(andtypicallymosteffective)topicmodelingtechniqueoutthere.It’s

availableingensimforeasyuse:

fromgensim.corpora.Dictionaryimportload_from_text,doc2bowfromgensim.corporaimportMmCorpusfromgensim.models.ldamodelimportLdaModeldocument="Thisissomedocument..."#loadid->wordmapping(thedictionary)id2word=load_from_text('wiki_en_wordids.txt')#loadcorpusiteratormm=MmCorpus('wiki_en_tfidf.mm')#extract100LDAtopics,updatingonceevery10,000lda=LdaModel(corpus=mm,id2word=id2word,num_topics=100,update_every=1,chunksize=10000,passes=1)#useLDAmodel:transformnewdoctobag-of-words,thenapplyldadoc_bow=doc2bow(document.split())doc_lda=lda[doc_bow]

Signin Getstarted

Page 8: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 8/12

#doc_ldaisvectoroflengthnum_topicsrepresentingweightedpresenceofeachtopicinthedoc

WithLDA,wecanextracthuman-interpretabletopicsfromadocumentcorpus,whereeachtopicis

characterizedbythewordstheyaremoststronglyassociatedwith.Forexample,topic2couldbe

characterizedbytermssuchas“oil,gas,drilling,pipes,Keystone,energy,”etc.Furthermore,givena

newdocument,wecanobtainavectorrepresentingitstopicmixture,e.g.5%topic1,70%topic2,

10%topic3,etc.Thesevectorsareoftenveryusefulfordownstreamapplications.LDAinDeepLearning:lda2vecSowheredothesetopicmodelsfactorintomorecomplexnaturallanguageprocessingproblems?Atthebeginningofthispost,wetalkedabouthowimportantitistobeabletoextractmeaningfrom

textateverylevel—word,paragraph,document.Atthedocumentlevel,wenowknowhowto

representthetextasmixturesoftopics.Atthewordlevel,wetypicallyusesomethinglikeword2vec

toobtainvectorrepresentations.lda2vecisanextensionofword2vecandLDAthatjointly

learnsword,document,andtopicvectors.Here’showitworks.lda2vecspecificallybuildsontopoftheskip-grammodelofword2vectogeneratewordvectors.If

you’renotfamiliarwithskip-gramandword2vec,youcanreaduponithere,butessentiallyit’sa

neuralnetthatlearnsawordembeddingbytryingtousetheinputwordtopredictsurrounding

contextwords.

Signin Getstarted

Page 9: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 9/12

Withlda2vec,insteadofusingthewordvectordirectlytopredictcontextwords,weleveragea

contextvectortomakethepredictions.Thiscontextvectoriscreatedasthesumoftwoother

vectors:thewordvectorandthedocumentvector.Thewordvectorisgeneratedbythesameskip-gramword2vecmodeldiscussedearlier.The

documentvectorismoreinteresting.Itisreallyaweightedcombinationoftwoothercomponents:thedocumentweightvector,representingthe“weights”(latertobetransformedinto

percentages)ofeachtopicinthedocumentthetopicmatrix,representingeachtopicanditscorrespondingvectorembedding

Together,thedocumentvectorandthewordvectorgenerate“context”vectorsforeachwordinthe

document.Thepoweroflda2vecliesinthefactthatitnotonlylearnswordembeddings(and

contextvectorembeddings)forwords,itsimultaneouslylearnstopicrepresentationsanddocument

representationsaswell.

Signin Getstarted

Page 10: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 10/12

https://multithreaded.stitch�x.com/blog/2016/05/27/lda2vec

Foramoredetailedoverviewofthemodel,checkoutChrisMoody’soriginalblogpost(Moody

createdlda2vecin2016).CodecanbefoundatMoody’sgithubrepositoryandthisJupyter

Notebookexample.ConclusionAlltoooften,wetreattopicmodelsasblack-boxalgorithmsthat“justwork.”Fortunately,unlike

manyneuralnets,topicmodelsareactuallyquiteinterpretableandmuchmorestraightforwardto

diagnose,tune,andevaluate.Hopefullythisblogposthasbeenabletoexplaintheunderlying

math,motivations,andintuitionyouneed,andleaveyouenoughhigh-levelcodetogetstarted.

Pleaseleaveyourthoughtsinthecomments,andhappyhacking!AboutNanonets

NanonetsmakesitsupereasytouseDeepLearning.Youcanbuildamodelwithyourowndatatoachievehighaccuracy&useourAPIstointegrate

thesameinyourapplication.

Signin Getstarted

Page 11: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 11/12

[email protected]

DataScience NLP Arti�cialIntelligence DeepLearning MachineLearning

4.4Kclaps

Seeresponses(29)

MoreFromMedium

MorefromNanoNets

HowtoAutomateSurveillance Easily

MorefromNanoNets

HowtoeasilydoObject Detection on

MorefromNanoNets

HowToEasilyClassify Food Using

WRITTENBY

JoyceXu Follow

Deeplearning,RL,NLP,CV,andallthatjazz.@DeepMindAI,@sidewalklabs,@Stanford

NanoNets Follow

NanoNets:MachineLearningAPI

Signin Getstarted

Page 12: & lda2Vec Topic Mo def65b0b05ling with LSA , PLSAqning2.web.engr.illinois.edu/misc/topic_modeling...2019/06/30  · 6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 12/12

SurveillanceEasilywithDeepLearning

Bharat…Aug3,… 4.5K

ObjectDetectiononDroneImageryusingDeeplearning

Gaurav…Jun6,… 3.6K

ClassifyFoodUsingDeepLearningAndTensorFlow

Bharat…Mar18… 457

Signin Getstarted


Recommended