ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=onviaSocialMediaMining
SearchingforCredibleInforma3onviaSocialMediaMining
HuanLiu
DataMiningandMachineLearningLabArizonaStateUniversity
ThankstoFormerandCurrentPhDStudentsofDMML
• RezaZafarni,AsstProf,SyracuseU• XiaHu,AsstProf,TexasA&MU• MagdielGalan,Intel• ShamanthKumar,CastlightHealth• PritamGundecha,IBMResAlmaden• JiliangTang,AsstProf,MSU• HuijiGao,LinkedIn• AliAbbasi,MachineZone• SalemAlelyani,AsstProf,KingKhalidU• XufeiWang,LinkedIn• GeoffreyBarbier,AFRL• LeiTang,Clari• ZhengZhao,Google• Ni3nAgarwal,ChairProf,UALR• SaiMoturu,PostDoc,MITMediaLab• LeiYu,AsscProf,BinghamtonU,NY
• RobertTrevino,AFRL• YunzhongLiu,LeEco,US• SomnathShahapurkar,FICO• FredMorstaXer• IsaacJones• SuhasRanganath• SuhangWang• TahoraNazer• JundongLi• LiangWu• GhazalehBeigi• KaiShu• Jus3nSampson
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
False,Misleading,andInaccurateInforma3on
• Spam• Fraud• FakeNews• Rumor• UrbanLegend• Gossip• Informa3oncanbe:true,false,oruncertain• BigData:6th`V’EveryoneShouldKnowAbout
– Vulnerability– Socialmediahasall6V’s
3
Disinforma*on(purposeful)
Misinforma*on(uninten*onal)&Disinforma*on
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
SpaminSocialMedia
• Unwantedcontentinforma3ongeneratedbyspammingusersascomments,chat,fakerequeststhatareusedtopromoteproductsorspreadmaliciousinforma3on.
4
– Fakereviews – Maliciouslinks – Fakerequests
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Fraud(Scam)inSocialMedia
• Asocialmediafraudisdefraudingand/ortakingadvantageofsocialmediauserswiththeuseofsocialmediaservices.
5
– Swindlemoney – Stealpersonalinforma3on
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
FakeNewsWebsitesandSocialMedia
• Fakenewswebsitesdeliberatelypublishhoaxes,propaganda,anddisinforma3ontodrivetrafficexacerbatedbysocialmedia
• Fakenewscanaffectdomes3cpoli3cs,inflamedbysocialmedia,duetolimitedresourcestochecktheveracityofclaims– Easyto“like”and“share”,buttakingefforttocheck,albeitjustafewclicksaway(effortasymmetry)
• Fakenews+SocialmediaCyberwarfare
6
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
FakeNewsIsRampantinSocialMedia
• Fakenewsspreadsonsocialmedia– Spreadsrapidly
– Evolvesfast
7
• Crossovertoothernetworks • Modifiedcontent
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
FakeNewsCanCauseRealHarm
• Pizzagate:storiesoffakenewsfromRedditleadtorealshoo3ng
• Afalserumorerased$136billionin10minutes
8
Fake News Onslaught Targets Pizzeria as Nest of Child-Trafficking, New York Times, 2016
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Rumors
• Wikipedia:“Atalltaleofexplana3onscircula3ngfrompersontopersonandpertainingtoanobject,event,orissueinpublicconcern”.
• Rumorscanbetrueorfalse.
9
– Falserumor
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
GossipinSocialMedia
• Gossipisidlechatandrumoraboutpersonaland/orprivateaffairsofothers.
• Socialmediaallowsforfaster,alargerscaleof,andmoreconvenientidlechat.
10
– Celebrity:“ObamasmovingtoAsheville”
– Friends:People“aremuchmorelikelytogossipwhenastoryunitesafamiliarpersonwithaninteres3ngscenario.“
FamiliaritywithInterestBreedsGossip:Contribu3onsofEmo3on,Expecta3on,andReputa3on, PLoS ONE, 2014
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
UrbanLegendinSocialMedia
• Fic3onalstorieswithmacabreelementsrootedinlocalpopularculture.– Onsocialmedia,itdevelopsfasterandspreadswider
• Insummary,itisimpera3vetostudycredibilitychecking
11
• UrbanlegendofFengshui
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
OnCredibilityChecking
• Studyingdifferenttypesofcredibilityandtheneedfordifferentdataandinforma3onsourcesincredibilitychecking– Wedon’thavetoreinventwheelsinsocialmediaminingandcan“standontheshoulderofgiants”
– Machinesdifferfromhumansincredibilitychecking
• AboutCredibilityChecking– TypesofCredibility(socialsciences,psychology,CS)– AspectsofCredibilityChecking– ComponentsofCredibilityCheckinginSocialMedia
12
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
FourTypesofCredibility
• Presumedcredibility(generalassump3ons)– “Ourfriendsusuallytelltruth”
• Reputedcredibility(basedonthirdpar3es’reports)– Forinstance,pres3giousawardsorofficial3tles
• Surfacecredibility(simpleinspec3on)– “Peoplejudgeabookbyitscover”
• Experiencedcredibility(first-handexperience)– “Timecantell”(路遥知马力,日久见人心)
• Anynewtypetoexploreinsocialmedia? 13
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
AspectsofCredibilityChecking(CC)
• CanweturnCCintoaproblemeasierforusersorAMTurks(withoutmuchexper3se)tocheck?
• IssuesaboutCredibilityCheckingMeasures– Reputa3onandHistory(3me)– AccuracyandRelevance– TransparencyandIntegrity(consistency)– Responsefromindependentsources(consistency)
• Implica3onorimpactassessment– Noteverypieceoffakenewsisdisastrous– “Warnornottowarn”:howtobalance?
14
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
News/Post
Fake
Yes NoUncertain
• Recipients
• Senders
• Sourceofinforma3on
• Content
• Networkcontext• Crowdsourcing(fact-checkingsites,e.g.,Snopes)• Groundtruth(mul3faceted,goldstandard)
Exper=se,experienceBackground,occupa=on
Reputa=onLengthofonlinepresenceSocialnetworks
ProvenanceReputa=on,Cura=on/Edi=ngLength
Wri=ngstyleTopicsURLsMul=media
Topicthread(Outlierdetec=on)RetweetsRepliesComments
ComponentsinCredibilityCheckinginSocialMedia
15
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
SearchingforCredibleInforma=on
16
CredibleData
Spam
Bots(automa=callygeneratedcontent)
FakeNews
Rumor
• AUniqueChallenge– Groundtruth
• Addi3onalChallenges– Credibilityverifica3on– Dynamicchange– Timeliness
• Alterna3veApproaches– RumorDetec3on– SpamDetec3on– BotDetec3on– InferringDistrust
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
UsingSocialMediaforCredibilityChecking
• VelocityandVolume– 6,000tweetspersecond,5millionperdayonTwiXer– 55millionstatusand300millionphotosperdayonFB
• Variety– Geo-spa3al,textual,pictorial,temporal,socialdimensions– Crossmodality(e.g.,geotaggedpictures)
• Veracity– Truthfulnessandaccuracyofinforma3on
• Usebigdata,mul3-sourceinfo,andsocialnetworkstocompensateforlackofexper3se(以其之矛还其之盾)
17
18
Adecentbreakdownofallthingsrealandfakenew
s.hX
p://imgur.com
/7xHaUXf
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
RumorDetec=on
• Rumor:unverifiedandrelevantinforma3onthatcirculatesinthecontextofambiguity.
• Goal:detec3ngemergingrumorswithminimuminforma3onasearlyaspossible– Ifinterven3onisnotfeasible,getearlywarningorprepared
• Challenges:– Howtoovercomethelackofinforma3oninasingletweet?– Howtodetectrumorsintheirforma3vestage?
19
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
InsufficientInforma=oninaSingleTweet
• Asingletweetcouldbedamaging,butcontainsliXleinforma3onw/ocontextfordetec3on
• Treatbatchesoftweetsas“conversa3ons”• Basedonkeywordsimilari3es• Basedonreplychains
20
...
1to9tweets 10+tweets
PointofAcceptableAccuracy
• Aggregateconversa3ons• Sharedhashtags• Commonlinks• Cosinesimilarity
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Detec=onofEmergingRumors
• Emergentdetec3on-linkthefirsttweetinarumorwiththosealreadyposted
• Standardrumorclassifica3onsarenoteffec3veforsmallconversa3ons– Lackofnetworkandsta3s3caldata– Datasparsityissues
• Implicitlinkingworkseffec3velyfordetec3ngsmallrumorcascades
21
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
BotDetec=on
• Bots– Innocuous:relayinforma3onfromofficialsources– Malicious:spreadrumorsandfalseinforma3on
• Goal:RemovebotsfromsocialmediadatawithhighRecall– WHY?
• Challenges– Acquiringgroundtruth– IncreasingRecallwithoutsignificantlyreducingPrecision
22
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
BotsinSocialMedia
• BotsonTwiXer:– TwiXerclaims5%of230Musersarebots.– Onestudyfound20Mbotaccounts=9%**.– 24%ofalltweetsaregeneratedbybots***.
• 5-11%ofFacebookaccountsarefake****.
*hXp://blogs.wsj.com/digits/2014/03/21/new-report-spotlights-twiXers-reten3on-problem/**hXp://www.nbcnews.com/technology/1-10-twiXer-accounts-fake-say-researchers-2D11655362***hXps://sysomos.com/inside-twiXer/most-ac3ve-twiXer-user-data****hXp://thenextweb.com/facebook/2014/02/03/facebook-es3mates-5-5-11-2-accounts-fake/ 23
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
FindingGroundTruth
• ThreestatesofaTwiXeruser:– Ac3ve– Suspended– Deleted
• Idea:– Usethesestatesas
labels– Twosnapshotsof
eachuseristaken
24
Suspended
Deleted
Ac3ve
Ini=alCrawl• Findsseedsetofusers.• CrawlsProfile,Network,...
StatusonTwiXerasalabelingmechanism
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
GroundTruth-Honeypots
• Actasobviousbotaccounts• AXractotherbotaccounts• Botsareiden3fiedwhentheyfollowouraccount• Assump=on:Realusersdonotfollowbots
25
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Honeypots-Logic
• Post“Luring”Content– Postcontentthatwillbeseen– trendingtopics,hashtags,
“famous”tweets...• MaintainNetwork
Connec=ons– “Followback”,Retweets– Famebegetsfame
• PromoteOtherHoneypots– Retweeteachother’stweets– Men3oneachother
HoneypotAccounts
ChooseHoneypot,
h
RetweetRandomHoneypot
10%
SampleRandomTweet,t
90% hretweets
t
30%
hcopiest70%
Recordh’snewfriends
Wait10s
Follownew
friends
26
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
BoostOR
• BasedonAdaBoost• TrytoincreaseRecallwithoutdras3cdecreaseinPrecision
• Itera3velyupdatetheweightofinstances:– Unchangedifcorrectlyclassified– Decreasediffalsenega3ve– Increasediffalseposi3ve
27
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Trust-DistrustPredic=on
• Goal– Trustanddistrustrela3onscanplayanimportantroleinhelpingonlineuserscollectreliableinforma3on
– Findingtrustworthyusersandreliableinforma3onisofsignificantimportance
– Howtopredicttrustrela3onsbetweenusers?
• Challenges– Trustrela3onsareextremelysparse– Distrustrela3onsareevensparserthantrustones– Findingsubs*tutefeaturesindica3veoftrustanddistrust
28
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
TrustandEmo=ons
• Accordingtopsychology,user’semo3onscanbestrongindicatorsoftrustanddistrustrela3ons
• Emo3onalinforma3onismoreavailablethanthatoftrust/distrust
• Thereexistsacorrela3onbetweenemo3onsandtrust/distrustrela3ons
29
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
ModelingEmo=onalInforma=on
• Userswithposi3ve(nega3ve)emo3onsaremorelikelytoestablishtrust(distrust)rela3ons
• Userswithhighposi3ve(nega3ve)emo3onstrengthsaremorelikelytoestablishtrust(distrust)
• TheEmo3onalTrustDistrustframeworkETD– Low-rankmatrixfactoriza3on
– Emo3onalinforma3onregulariza3on
30
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
StudyingBiasinSocialMediaData
• TwiXersharesitsdata– “Firehose”feed-100%-costly– “StreamingAPI”feed-1%-free
• Weusuallyobtaindataviasampling– IsthesampleddatafromtheStreamingAPIrepresenta3veofthetrueac3vityonTwiXer’sFirehose?
• Challenges– Howtodetermineifthesampleisbiasedwhenwedonothaveaccesstothewholedata?
– Howtoobtainanunbiasedsample?
31
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Twicer’sStreamingAPIvs.Firehose
• DatafromFirehoseandStreamingAPIhasbeencollectedforspecificperiodof3metoperformanalysis
• Morethan90%ofallgeotaggedtweetsareavailableviaStreamingAPIandthereisnotsignificantdifferenceinloca3ondistribu3on
• Basedonin-degreecentralityandbetweennesscentralityinuser-userretweetnetworks,theStreamingAPIfinds~50%ofthekeyusers
32
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Mi=ga=ngBiasinTwicer’sStreamingAPI
CanwefindbiaswithouttheFirehose?
Es3ma3ngBiasfromStreamingAPI:– ObtaintrendofhashtagfromSampleAPIandStreamingAPI
– BootstrapSampleAPItoobtainconfidenceintervals
– MarkregionswhereStreamingAPIisoutsideofconfidenceintervals
Mi3ga3ngBias:– Leveragemul3plecrawlerstomaximizedataforeachquery
– RoundRobinSpliyng
33
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Time-Cri=calInforma=oninCrisisResponse
• Socialmediaisusedtorequestforimmediateassistanceduringcrisis
• Time-cri3calpostsdemandimmediateaXen3on• Addressingthesequeriespromptlycanhelpinemergencyresponse
• Howcanthesepostsbedis3nguishedfromothers?
• WhatIsRequiredinFindingTime-Cri*calResponses?– Userswithexper3seorknowledge– Fastresponse– Relevantanswers
34
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
FindingTime-Cri=calResponses
• Manyques3onsaskedduringcrisisshouldbeimmediatelyaXended
• Manyrespondersarebusy• Howcanwefindapromptresponderwhocanprovidearelevantanswer?
• ChallengesofIden3fyingPromptResponders– Howdowees3matethereply*meofuserstoiden3fypromptresponders?
– Timelinessandrelevance:howdoweintegrate3melinesswithrelevancetorankcandidateresponders?
35
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Informa=onSeekinginSocialMedia
• Socialmediaisusedtorequestforhelpduringcrisis
• Addressingthesequeriespromptlycanhelpinemergencyresponse
36
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
Iden=fyingCandidateResponders
• Timeliness– Theusercanrespondmorequicklyifsheisavailablesoonazertheques3onisposted.Itcanbees3matedusingthepreviouspos3ng3mes
– Auserrespondstoques3onsfasterifshehasrepliedpromptlytosimilarques3onsinthepast
• Relevance– Userswhosepreviouscontentissimilartotheques3onhavehigherrelevanceandtheirresponseismorelikelytobearelevantanswer
• Timelinessandrelevanceareintegratedbycombiningtherankingscores
37
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
SearchingforCredibleInforma=on
38
CredibleData
Spam
Bots(automa=callygeneratedcontent)
FakeNews
Rumor
• AUniqueChallenge– Groundtruth
• Addi3onalChallenges– Credibilityverifica3on– Dynamicchange– Timeliness
• Alterna3veApproaches– RumorDetec3on– SpamDetec3on– BotDetec3on– InferringDistrust
以其之矛还其之盾
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
ThankYouAll
• ProfessorYang’skindinvita3onandwarmhospitality• FundingsupportfromONR,NSF,ARO,amongothers• DMMLLabformerandcurrentmembers,andLiangWuforhelpingwiththeprepara3onofthispresenta3on
Searchfor“HuanLiu”formoreinforma3onaboutDMML
HLiu,FMorstaXer,JTang,andRZafarani.``Thegood,thebad,andtheugly:uncoveringnovelresearchopportuni=esinsocialmediamining",inTrendsofDataScience,Interna3onalJournalonDataScienceandAnaly3cs,SpringerInterna3onalPublishingSwitzerland.September,2016.DOI10.1007/s41060-016-0023-0
39
40
• scikit-feature–anopensourcefeatureselec3onrepositoryinPython
• SocialCompu3ngRepository
RepositoriesandRecentBooks
41hcp://dmml.asu.edu/smm/
ArizonaStateUniversityDataMiningandMachineLearningLab SearchingforCredibleInforma=on BJUT2016
References
1. [BeigiSDM’16]GhazalehBeigi,JiliangTang,SuhangWang,andHuanLiu.“Exploi3ngEmo3onalInforma3onforTrust/DistrustPredic3on”.SIAMInterna3onalConferenceonDataMining(SDM16),May5-7,2016.Miami,Florida.
2. [MorstaXerASONAM’16]FredMorstaXer,LiangWu,TahoraH.Nazer,KathleenM.Carley,andHuanLiu.“ANewApproachtoBotDetec3on:StrikingtheBalanceBetweenPrecisionandRecall”,IEEE/ACMInterna3onalConferenceonAdvancesinSocialNetworkAnalysisandMining(ASONAM2016),August18-21,SanFrancisco,CA.
3. [MorstaXerWWW’14]FredMorstaXer,JürgenPfeffer,HuanLiu.WhenisitBiased?AssessingtheRepresenta3venessofTwiXer'sStreamingAPI”,WWWWebScience2014.
4. [MorstaXerICWSM’13]FredMorstaXer,JürgenPfeffer,HuanLiu,KathleenMCarley.IstheSampleGoodEnough?ComparingDatafromTwiXer'sStreamingAPIwithTwiXer'sFirehose”,ICWSM2013.
5. [SampsonCIKM’16]Jus3nSampson,FredMorstaXer,LiangWuandHuanLiu.“LeveragingtheImplicitStructurewithinSocialMediaforEmergentRumorDetec3on",shortpaper,ACMInterna3onalConferenceofInforma3onandKnowledgeManagement(CIKM2016),October24-28,2016.Indianapolis,Indiana.
6. [SampsonICDM’15]Jus3nSampson,FredMorstaXer,RezaZafarani,andHuanLiu.“Real-TimeCrisisMappingUsingLanguageDistribu3on”.Demo.InProceedingsofIEEEInterna3onalConferenceonDataMining(ICDM2015),November14-17,2015.Atlan3cCity,NJ.
42