Date post: | 27-Jul-2015 |
Category: |
Education |
Upload: | denis-parra-santander |
View: | 52 times |
Download: | 0 times |
Identifying Relevant Messages in a Twitter-based Citizen Channel for Natural Disaster Situations
Alfredo Cobo [email protected]
Denis Parra [email protected]
Jaime Navón [email protected]
Pon=ficia Universidad Católica de Chile Departamento de Ciencia de la Computación
Av. Vicuña Mackenna 4860, Macul San=ago, Chile
I (… and some other people in this room)
… come from Chile
Picture from hMp://www.quadrodemedalhas.com/images/mapas/mapa-‐chile.jpg
hMp://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Chile_in_South_America_(-‐mini_map_-‐rivers).svg/409px-‐Chile_in_South_America_(-‐mini_map_-‐rivers).svg.png
Chile, well-known for its..
• Copper (Top Producer)
"Top 5 Copper Producers" by Plazak -‐ Own work. Licensed under CC BY-‐SA 3.0 via Wikimedia Commons -‐ hMp://commons.wikimedia.org/wiki/File:Top_5_Copper_Producers.png#/media/File:Top_5_Copper_Producers.png hMps://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAYQjB0&url=hMp%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3ANa=ve_Copper_(mineral).jpg&ei=L31ZVbOsL4r1UrbRgKAB&bvm=bv.93564037,d.d24&psig=AFQjCNHr2zm5m4Jmim7AgkCwwSb0b5mGUA&ust=1432014509629311
Chile, well-known for its..
• Wine (Price + quality)
"Fiesta de Vendimia" by LuxoDresden -‐ Own work. Licensed under CC BY-‐SA 3.0 via Wikimedia Commons -‐ hMp://commons.wikimedia.org/wiki/File:Fiesta_de_Vendimia.JPG#/media/File:Fiesta_de_Vendimia.JPG
If you start typing in Google…
9 out of 10 disasters …
If you start typing in Google…
9 out of 10 disasters … prefer Chile
… and for Natural Disasters L
• Largest ever registered earthquake in History: Valdivia, Chile, 22nd of May of 1960 (9.5 in Richter Scale)
• We usually have 1 large earthquake every 30 years (~ 8 degrees in Richter Scale)
• Last one in 2010 close to Concepción, but it also affected San=ago (the capital)
… so, at PUC Chile
• We created CIGIDEN “Na=onal Research Center for the Integrated Administra=on of Natural Disasters”
CIGIDEN’s Goal in this project
• Help ci=zens staying informed during situa=ons of natural disasters by using Social Media. • Build Mobile Applica=on (Carlos Molina) • Filter automa=cally relevant messages from those not related to earthquakes (Alfredo Cobo) to feed the applica=on
Our Task: Building a Twitter classifier -‐ Filter tweets related to natural disasters from those who did not.
Related Work Manual Classifica8on Data Post-‐processing Feature Genera8on Tools for Disaster Management
Vieweg et al. (2010) Imran et al. (2013) Mendoza et al. (2010)
Mendoza et al. (2010) Cas=llo et al. (2011) (Informa=on Credibility on TwiMer)
Gimpel et al. (2011) Koloumpis et al. (2011) Liu et al. (2012) Wu et al. (2011) Lee et al. (2014) (Not necessarily for natural disasters)
Hiltz et al. (2013) Power et al. (2013) Caragea et al. (2011) Abel et al. (2012) Middleton et al. (2014) MorstaMer et al. (2013) Imran et al. (2014)
Why building this classifier would be a contribution? • Building and valida=ng a ground truth for classifying tweets in Spanish.
• Building the classifier and dealing with • Class Imbalance • Number of latent dimensions (Feature Genera=on using LDA)
Workflow of Activities
Chile’s Earthquake 2010
Cas=llo et al. (2010)
Our groundtruth
Non-‐relevant messages
Realis=c dataset
Sampling, Cleaning & filtering
Classifiers
-‐ Feature selec=on (LDA)
-‐ Class Imbalance
10% -‐ 80%
Building the ground truth
• Random sampling of 5,000 tweets from Cas=llo et al. (2010) dataset, used to study credibility ~ Chile’s 2010 earthquake.
• Dates: From February 27th un=l March 2nd (Spanning 4 days in 2010)
• We kept only Spanish messages, removed messages too similar (Lavenshtein distance): 2,187 messages leE
Validating of the ground truth
• Fleiss Kappa: • κ = 0.645, p < .001
• Intraclass correla=on • ICC(2,1): IIC = 0.646, p < .001
• Landis and Koch et al. (1977)
• Relevant messages were labeled based on Imran et al. (2013) classifica=on: • Cau=on/Warning • Casual=es and Damage • People (missing, found, etc.) • Informa=on source
Workflow of Activities
Chile’s Earthquake 2010
Cas=llo et al. (2010)
Our groundtruth
Non-‐relevant messages
Realis=c dataset
Sampling, Cleaning & filtering
Classifiers
-‐ Feature selec=on (LDA)
-‐ Class Imbalance
Classification Problem Features Class Imbalance
User Network
Content (4,766 unique words)
Followers Hashtags Followees Words
User men=ons
• Ground Truth is a not realis=c representa=on of TwiMer
• We added “Noise”: Introduced Tweets non-‐relevant to the event (20% -‐ 80%)
• Sampled non-‐relevant tweets from 5 months.
• Removed all tweets posted during days of seismic ac=vi=es
Model Precision Recall F1 score Accuracy AUC Dimensions Noise Propor8on
Baseline 0.625 0.545 0.53 0.5 0.568 -‐ 0
Bernoulli NB
0.831 0.226 0.355 0.594 0.605 2000 0
Logis=c Regression
0.827 0.641 0.722 0.756 0.834 2000 0.6
Linear SVM 0.687 0.677 0.682 0.687 0.719 1000 0.6
Random Forest
0.807 0.673 0.734 0.758 0.844 1000 0.8
Classification Results
Analysis ~ LDA Dimensions and Noise
Analysis ~ LDA Dimensions and Noise
Conclusions & Future Work
• We built and validated a ground truth of tweets in Spanish relevant to disasters
• We implemented a classifier and analyzed its performance based on several algorithms and dealing with class imbalance problem
• Future Work: Move the applica=on from prototype to produc=on, test online scalability
That’s all folks!
• Thanks and ques=ons to corresponding author Alfredo Cobo: [email protected] or Denis Parra: [email protected]
Chile, small country, but well-known for its..
• Length (4,300 Km)
~ 4,300 Km ~8,000 Km
Model Features
• Newman et al. (2007) • Biro et al. (2008) • Wei et al. (2006) • Wang et al. (2012) • Han (2005)
Features Corpora Features Followers Hashtags Friends Words
User men=ons
Results
• Amatriain et al. (2013)
Architecture
Plots of bootstrap Agreement Day 1 Agreement Day 2
Agreement Day 4 Agreement Day 3
Word Frequencies
Just “Terremoto”: AUC
Related Work
Manual classification
• Vieweg et al. (2010) • Imran et al. (2013)
Post Processing
• Cas=llo et al. (2011) • Mendoza et al. (2010)
Feature Generation Approaches
• Gimpel et al. (2011) • Koloumpis et al. (2011) • Liu et al. (2012) • Wu et al. (2011) • Lee et al. (2014)
Tools For Disaster Management
• Hiltz et al. (2013) • Power et al. (2013) • Caragea et al. (2011) • Abel et al. (2012) • Middleton et al. (2014) • MorstaMer et al. (2013) • Imran et al. (2014)
Building the ground truth
• Mendoza et al. (2010)
• Imran et al. (2013)
Algorithms and evaluation procedure
• Cas=llo et al. (2011) • FawceM et al. (2004) • Manning et al. (2008) • Wen et al. (2014)