RaghavaMukkamala andRaviVatrapu
CentreforBusinessDataAnalytics(bda.cbs.dk)DepartmentofITManagementCopenhagenBusinessSchool
Phone:+45-4185-2299Email:[email protected]
Web:http://www.cbs.dk/en/staff/rrmitm
Pre-ICISWorkshoponTextMiningasaStrategyofInquiryinInformationSystemsResearchSunday,11-December-2016,Dublin,Ireland
1
Motivation
• AutomatedTextAnalysishasgainedprominenceasitcansubstantiallyreducesthecostsofanalyzinglargevolumesoftext
• NewBigSocialDataAnalyticstechniquesincreasingusingtextanalysisaspartoftheirmainstreamanalysis• duetoubiquitoususeofsocialmediaplatformsbymillionsofusers• hugeamountsofcontent(includingtext)accumulatedcontinuously
• Tofindouttheusers’opinions,emotionsetc.fromtextdocuments
2
BigSocialDataAnalytics– OverallMethodology
[IEEEBigdataCongress2014],[IEEEAccess2016]
TextClassification• Classification:Assigningtextdocuments*topredefinedcategories• Category:Asetoflabelsmodelingadomainspecificconcept• E.g.SentimentorEmotion1 Analysis• SocialInfluence2:Reciprocity,CommitmentandConsistency,SocialProof,Authority,Liking,Scarcity
• ClassifyingdocumentsintoKnownCategories:• DictionaryMethods(orLexicon-basedmodels)• SupervisedLearningMethods
• Discovering(Unknown)CategoriesandTopics• TopicModeling
41) Ekman,Paul."Anargumentforbasicemotions."Cognition&emotion 6.3-4(1992):169-200.2) Cialdini,RobertB.Influence.Vol.3.A.Michel,1987.
*Textdocument:unitoftext,whichcouldbeoneormorewords/sentences/paragraphs
DictionaryMethods• DictionaryMethods:• Userate/proportionatwhichkeywordsappearindocuments• Usesalistofwordswithscorestofindoutthedocumentcategorylabel
• Boring:-1,Disgust:-3,inspire:+2,masterpiece:+5• Limitedtocategoriesforwhichdictionariesareavailable(Sentiment,Emotionetc.)• domainspecifici.e.accuracydependsondomainfromwhichwordsaretaken
• Usageofword{crude}in“crudeoil”vs“acrudejoke”• Validationofdictionariesisbithard(whenweusethem,wedon’tknowhowmuchaccuracywewillget)
5
SupervisedLearningMethods
• SupervisedTextclassification• Usemanuallyencodedtrainingsets(documentswithlabels)bydomainexperts• Canbeusedwithanydomainspecificmodelorcategory• Moreaccurateresultsthandictionarybasedapproaches• Validationiseasyastheyprovideperformancemeasures• Drawback:Preparingtrainingsetsmightbeanexpensivetask
• Applications:Spamdetection,Age/genderidentification,Languageidentification,Sentimentanalysisandsoon
6
AmbiguitymakesNLPhard
RealNewspaperheadlines• TeacherStrikesIdleKids• #1Theteacherisonstrike,whichidlesthekids.• #2Ateacherstrikeskidswhoareidle
• BanonNudeDancingonGovernor'sDesk• #1Banon[NudeDancingonGovernor’sDesk]• #2[BanonNudeDancing]onGovernor’sDesk
• Iftextcontainsambiguity,theclassificationsaccuraciesmayvary
7DanJurafsky andChristopherManning.NaturalLanguageProcessing(Coursera- StanfordUniversity)https://www.coursera.org/course/nlp
MUTATO:Frontend
8
MUTATO:Architecture
9
Domain Expert
Global Perspective documents: 21
Text Extraction
Text Preprocessing
Word Frequency Analysis
Collocation Analysis
Factors with search words
Keyword Analysis
Text corpus for Training set
Training Set with Labels
Keyword counts
Classifier Training
Classified Texts with Labels
TextAnalyst
Multi-dimensional Text Classification Tool
Text Mining/Topic Modeling,
Text Classification
Text Corpus (social data,
documents and etc)
Search words
Text
Domain Experts coding
training set
Training Data Set
with Models
Natural Language Toolkit (NLTK),
Gensim,Python, ASP.Net
Classification Performance Measures
Accuracy, Precision, Recall, F-Meaasure
Inter Coder Agreement
Inter-rater Agreement,
Cohen's Kappa
Performance Measures
Results
Keyword Analysis
Keyword Counts, Most prominent
Categories
Word Frequency Analysis
Most Frequent Words, Frequency
Distributions
Text Classification
Multi-label Domain Specific classified Texts
Collocation Analysis
Bigrams, Trigrams and N-grams
Models Topic Modeling
Discovering Topics and Categories
LDIC2016(BestPaperNomination)
TextMining(Unsupervised)
• KeywordAnalysis:KeywordcountsusingNaturalLanguageToolkit(NLTK)• WordFrequencyAnalysis:Frequentoccurringwordsfromagiventextcorpus,byusingthetermdocumentmatrix.(e.g.Top100mostfrequentwords)• CollocationAnalysis:Collocationsareexpressionsofmultiplewords,whichcommonlyco-occurinthedocuments• providesinsightsaboutdocumentsbyprovidingbigrams,trigramandn-gramsthatcontainwords,whichco-occurinthedocuments.
10
TopicModeling(Unsupervised)
• Toidentify/discovertopicsandinformationpatternsintext• Clusteringtechniquestogroupthewordsbasedonsimilaritydistances• ToolbasedonGensim1 library+Python
111)Gensim,topicmodelingforhumanshttps://radimrehurek.com/gensim/
TextClassification• Input
• adocumentd• afixedsetofclassesC={c1,c2,…,cJ}• Atrainingset(ofsizem)hand-labeleddocuments{(d1,c1),....,(dm,cm)}
• Output:• alearnedortrainedclassifierγ:dàc|c∊C
• ClassifiersusingNaiv̈eBayesAlgorithm• Alternatives:Logisticregression,Support-vectormachines,NeuralNetworks
• Naiv̈eBayesClassifier• BasedonBayesruleofconditionalprobabilities• Bagofwordsapproach• Requiresmanuallycodedtrainingsetsbydomainexperts
12
TrainingSets– ManualCoding
• SystematicapproachformanualcontentanalysissuggestedbyRebeccaMorris[1]• ReliabilityCohen’sKappavalue:
• po=0.16+0.31+0.41=0.88
• pc=(0.20×0.21)+(0.37×0.34)+(0.43×0.45)=0.362.
13
1.R.Morris,“Computerizedcontentanalysisinmanagementresearch:Ademonstrationofadvantages&limitations,”JournalofManagement,vol.20,no.4,pp.903–931,1994.2,3
ModelTrainingTool
• https://textmining.cbs.dk/TextClassification/ClsssifyTextModels.aspx
14
“Heres anidea.Ifyouliketheirfoodeatthere.Ifyoudont liketheirfoodeatsomewhereelseormakeyourownmeal.Ireallydont understandwhatthebigdealis.”
User Consumer
Organisation
SocialInfluence
Domain-SpecificClassifier#01:Marketing
“Brazilianhighwaytransportshowcasesaseriesofpositivefeaturessuchasflexibility,availability,andspeed.However,whencomparedtoothermodes,itbearslimitationssuchaslowproductivity,lowenergyefficiency,andlowsafetyindices.
Domain-SpecificClassifier#02:OperationsResearch
“Whatthispostissaying:Someobesepeopledon'tsufferfromType2Diabetes..Whatthispostisn'tsaying:Obesitydoesn'tcauseType2Diabetes..Youcanbehealthyandobese.”
Domain-SpecificClassifier#03:PublicHealth
Classifier• UsingNaturalLanguageToolkitandPython
• CustomPythonscript(~1000lines)usedfortraining&classificationofthetexts
• MUTATO1.0automatesthewholeprocessasatool
18
PerformanceMeasures
19
ToolStatistics
20
• Languagessupported:English,Danish,Norwegian,Swedish,[Finnish]
• ClassificationDonefor• 20BDA/BSDAstudentprojectswithvarietyofdatasets:H&M,DanskeBank,Volkswagencrisis,Skavlan Talkshow,TV2Norwayetc.
• ~10Mastersthesisprojects:Patient-journey,Jabra-Classification,Skat data,SASvsNorwegianAirlines,TransportationLogistics,CouchsurfingFBdata
• ResearchArticles:12
21
ResearchPublications:TextAnalytics
IEEEEDOC2014 IEEEBigData2016 IEEEBigData2016
ResearchPublications:SocialMediaCrisis
22
IEEEBigDataCongress2015 IEEEEDOC2015 IEEEAccessJournal
23ACMMindtrek 2016HICSS2016
ResearchPublications:Crowdfunding&Crowdsourcing
24ICTH2016 IEEEHealthCom 2016
ResearchPublications:PublicHealth
25LDIC2016(BestPaperNomination) WCTR2016
ResearchPublications:OperationsResearch
FutureResearch
26
• TextSummarizationTechniquesforAsynchronousCommunication• Danish,NorwegianandFinnishLanguages• Discourseanalysisforasynchronouscommunication(suchasblogs,socialmedia)• BasedonHiddenMarkovmodels andgraphoptimizationtechniques• usingIntra-sententialRhetoricalParseTreeandaspect-baseddiscoursetrees
ThankYou [email protected],[email protected]