1
WebsiteandSocialMediaAnalytics- ATextMiningApproach
YiluZhou,PhDAssociateProfessor
Gabelli SchoolofBusinessFordhamUniversity
The Text Mining Task Pyramid
2
Case 1: Social Network and Content Analysis
3
Who links to whom and who influences whom?How are the sites used?Which sites are more sophisticated?
YiluZhou,Jialun Qin,Hsinchun Chen,EdnaReid
MDS Visualization of Arab Group Web Sites
Hizb-Ut-Tahrir
Jihad Supporters
Palestinian supporters
Hizballah Cluster
Palestinian terrorists
Comparison - Content Analysis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Black Separatists Christian Identity Militia Neo-confederates
Neo-Nazis/WhiteSupremacists
Eco-Terrorism
Norm
aliz
ed C
onte
nt L
evel
s Communications
Fundraising
Ideology
Propaganda (insiders)
Propaganda (outsiders)
Virtual Community
Command and Control
Recruitment and Training
U.S. Domestic Extremist Web sites
Middle Eastern Extremist Web sites
00.10.20.30.40.50.60.70.80.91
Hizb-ut-Tahrir Hizbollah Al-Qaeda LinkedWebsites
Jihad Sympathizers Palestinian terroristgroups
Norm
alize
d Co
nten
t Lev
els Communications
Fundraising
Sharing Ideology
Propaganda (Insiders)
Propaganda(outsider)
Virtual Community
Command and Control
Recruitment and Training
Forum Content Analysis
Contents:
1- LinkstoterroristWebsites2- Messagesfromterroristgroups
3- Ideology4- Computerviruses
Yahoo! Group Name Description/Organization Content Language Postings after June 2004
Members Total postings
black_heart2010 · ( سرایاالموت للجھاد االلكتروني (Death Brigades of Electronic Jihad
Salafi Groups/ Al-QaedaRegistration required
1-2-3-4 Arabic 80 183 331
Jehaaadlast حي علىالجھاد في سبیل هللا
Salafi Groups/ Al-QaedaRegistration Not required
1-2-3 Arabic 235 436 494
mojahidon · المجاھدون Salafi Groups/ Al-QaedaRegistration required
1-2-3-4 Arabic 23 110 41
shahed4pal Salafi Groups/ Al-QaedaRegistration required
1-2-3-4 Arabic 25 132 85
islamic-union · اإلتحاداإلسالمي
Salafi Groups/ Al-QaedaRegistration required
1-2-3-4 Arabic 62 108 594
Some opposed users intentionally post attachments containing viruses to disrupt the operation of these groups.The number of postings increased in the month of July 2004.
Topic Map: Sub-topic identification
U.S. and Middle Eastern Intensity Scores
U.S. Middle Eastern
Forum Racism Violence Forum Racism Violence
Angelic Adolf 5.513 0.962 Azzamy 30.182 19.833Aryan Nation 9.921 5.683 Friends 2.076 6.238
CCNU 3.712 14.546 Islamic Union 2.657 9.198
Neo-Nazi 5.458 5.614 Kataeb 2.610 6.605
NSM 10.740 10.740 Kataeb Qassam 25.203 18.670
Smash Nazi 12.424 10.591 Taybah 14.989 15.348
White Knights 19.313 6.353 Osama Lover 14.369 14.584
World Knights 2.468 2.234 Wa Islamah 4.075 9.193
All Forums 10.988 6.902 All Forums 11.892 12.644
Case 2: TelCorp ExampleAhmedAbbasi,YiluZhou,Shasha Deng,Pengzhu Zhang
9
•OneofthetenlargesttelecommunicationsanddataserviceprovidersintheUS
•Havea20+memberdedicatedsocialmediamonitoringteam
•Usesocialmediaanalysissystemthatprocesses43,000+messagesperday(1,800/hour)• Duringpeaktimes:5,000/hour;83/minute• About4,000discussionthreadupdatesperday
10
• Fall2012– increasedmaxspeedsforpremiumInternetplancustomers• Announcedthroughvarioussocialmediachannels
•Monitoringteam24-hourassessmentof2,000newdiscussionthreads:overallsentimentpositive• However,callcenternoticeduptickincomplaints
• Next24-hours,monitoringteamcombeddata• Speedincreaseonlyappliedto20%ofcustomers• Poorlybroadcastedtoallcustomers• Createdinitialeuphoriafollowedbyconfusionandanger
11
•Exactly54hoursaftertheinitialannouncement,thecompanymadeamendsby:• Introducingsimilarmaxspeedincreasesforcustomersonnon-
premiumplans• Providingpromotionaloffersonadditionalservicesandupgrades• Apologizingfortheconfusion
•Nevertheless,customerchurnratewas5x• Estimated$110millioninlostrevenueovernext12-months• Nottomentionlong-termlossesbasedoncustomerlifetimevalue
12
LAP Goal
14
15
ConversationDisentanglement
16
• CoherenceAnalysis• Nearly50%ofmessagesindiscussionforumthreadsdon’trespondtopriororfirstposting• FacebookandTwitter:20%to30%• Systemlimitationsand/orlackofproperusageofsystemfeatures
17
• SpeechActs
• TelCorpConversations:• Positiveexpressives early• Followedbyconversationsencompassingquestions,suggestions,assertionsof
indifference/negligence,negativeexpressives,anddeclarationsofhavingswitchedtootherproviders
• IncorporatesmanyLAPprinciples• Interplaybetweenconversations,interactions,andmessageacts• Importanceofconversationbeginnings• Contextualization,lexicalchaining,andthematization• Speechactinter-dependencies
18
19
• PrimitiveMessageDetection• LeveragesLAPprincipleoflexical
chains(topic“breadcrumbs”)
§ Conversation Affiliation Classification• LeveragesLAPprincipleof
thematization• Examinesprimitive/secondary
messagesimilaritiesacrossthreadregions
20
• PrimitiveMessageDetection
Wherewxt =tfxtidft ,tisoneofthek uniquetermsinX,
r isoneofthej uniquetermsinY,t andr arenouns,verbs,noun/verbphrases,ornamedentities
str isthesimilaritybetweentandr basedontheshortestpaththatconnectsthemintheis-a(hypernym/hypnoym)taxonomyinWordNet(Miller1995).
21
• CoherenceAnalysisMethod
.
Letsi,li,andci representthesystem,linguistic,andconversationstructurefeaturevectorsforagivenmessagepairX andY.
WedefineacombinatorialensembleofkernelsK= {K1…KQ}encompassingallcombinationsoflinearcompositekernelsinvolvings,l,andc (hereQ=7dueto23– 1totalcombinations).
Giventwoinstancerowsinthetrainingdatamatrix,theirsimilarityisdefinedbasedontheinnerproductbetweenallcombinationsoftheirthreevectorss1,l1,c1,ands2,l2,andc2.
22
• SpeechActTwo-StageClassifier• LeveragesLAPprincipleaboutrelationbetweenspeechactsandmessageinteractions
23
• SpeechActTreeKernelClassifierSimilarityFunction
KAC (xi,xj) isasimilaritymeasurebetweenSxi andSxj computedbycomparingalltreefragmentsinSxiandSxj,whereafragmentisdefinedasanysub-graphcontainingmorethanonenode(CollinsandDuffy2002).
KAC (xi,xj) issimplyequaltotwotimesthenumberofcommonfragmentsinSxi andSxj,dividedbythetotalnumberoffragmentsinSxi andSxj.
Formally,lethk(xi)denotethepresenceofthekthtreefragmentinSxi (wherehk(xi)=1ifthekthtreefragmentexistsinxi)suchthatSxi isnowrepresentedasabinaryvectorh(xi)=(h1(xi),…,hn(xi)):
• Four-monthA/Btest
• WorkedwithTelCorpITtoincorporateLTASwithexistingtopic-sentiment-basedsystem
• Split23-membermonitoringteamintotwogroups• A-team:12membersassignedtoexistingsystem• B-team:11membersassignedtonewsystemthatalsohadLTAS• Teamswerenotco-located,andonlyhadaccesstotheirrespectivesystem
• TelCorpreceived5.2millionsocialmediamessagesduringthe4-monthperiod
24
OverviewofTelCorp’sSocialMediaMonitoringWorkflow
25
26
27
Questions?