+ All Categories
Home > Technology > Flux of MEME - final report

Flux of MEME - final report

Date post: 12-May-2015
Category:
Upload: thomas-alisi
View: 746 times
Download: 5 times
Share this document with a friend
Description:
final presentation and results of topic extraction and analysis tool, developed for telecom italia working capital research grant
Popular Tags:
28
flux of meme - final report telecom italia, milan 30.9.11 thomas alisi @grudelsud Friday, September 30, 11
Transcript
Page 1: Flux of MEME - final report

flux of meme - final reporttelecom italia, milan 30.9.11thomas alisi@grudelsud

Friday, September 30, 11

Page 2: Flux of MEME - final report

the basics

Friday, September 30, 11

Page 3: Flux of MEME - final report

the idea

Meme: a postulated unit or element of cultural ideas transmitted from one mind to another through speech or similar phenomena.

Zeitgeist: German language expression referring to "the spirit of the times"

Semantic Web: an evolving development of the World Wide Web in which the meaning (semantics) of information on the web is defined, making it possible for machines to process it

Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated and shared on social media mainly via mobile networks

Friday, September 30, 11

Page 4: Flux of MEME - final report

background

yahoo researchWWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, WattsWSDM2011 - Who Uses Web Search for What? And How? - Weber, JaimesCSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog Conversations - Shamma, Kennedy, Churchill

othersWWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee, Park, MoonTech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, LaffertyTech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei

Friday, September 30, 11

Page 5: Flux of MEME - final report

algorithm steps

1. fetch data 2. create clusters 3. extract topics 4. analyze stats

Friday, September 30, 11

Page 6: Flux of MEME - final report

implementation

Friday, September 30, 11

Page 7: Flux of MEME - final report

step 1. fetch data!

using the free Spritzer access to Twitter streaming API (~1% of total tweets)defined set of location boxes (Italy, UK, France, Spain)reinforcing locations with geonames didn’t prove to be efficient (origin: from a galaxy far far away)enrich content through web scraping, also carrying meta & opengraph keywordsblacklist of noisy sources

Friday, September 30, 11

Page 8: Flux of MEME - final report

step 2. create geo-clusters

create time slicesselect all the posts within a time slicechoose geo-granularity (radius of clusters)agglomerate posts with Hierarchical Agglomerative Clustering (HAC)

Friday, September 30, 11

Page 9: Flux of MEME - final report

step 3. extract topics

a geo-cluster represents the whole bag of word used to define a document topic extraction is implemented with LDAα Dirichlet prior param. on the per-document topic distributions (frontend output: weight)β Dirichlet prior param on the per-topic word distributionθi is the topic distribution for document i,zij is the topic for the jth word in document i, andwij is the specific word.

user defined params: number of topics, number of words per topic, min followers

Friday, September 30, 11

Page 10: Flux of MEME - final report

step 4. analyze data

define search context: topics or keywordsperform live search with TF-IDF indicatorsdisplay time-lapse of clusters’ analytics evolution (log-scale count and average size)quick and easy interface: toggle visibility of clusters

Friday, September 30, 11

Page 11: Flux of MEME - final report

step 4. analyze data

drag and zoom on specific location boxesselect time intervaldisplay aggregated stats of clusters (count and size) within location boxshow and export breakdown of posts’ languages

Friday, September 30, 11

Page 12: Flux of MEME - final report

step 4. analyze data

show stats and content of specific clusters

lat-lon of centroids, std. deviation, surface and radius

display weighted topics, TF-IDF of terms within topics, TF-IDF of meta keywordsshow / export list of postsshow related links

Friday, September 30, 11

Page 13: Flux of MEME - final report

step 4. analyze data

show query metrics and parametersdisplay overall TF-IDF for the selected query

Friday, September 30, 11

Page 14: Flux of MEME - final report

demohttp://fom.londondroids.com/fom/

Friday, September 30, 11

Page 15: Flux of MEME - final report

sorry guys, now the boring stuff...backend, front-end API, cron jobs

Friday, September 30, 11

Page 16: Flux of MEME - final report

Backend

Streaming APIa batch process is constantly running and saving data on the dboptions: fetch by search query, expand terms with wikiminer, access all the stream, filter geotagged, filter location box, fetch related content

Clustering and Topic extractiondefine geo granularity time/size of geo clustersfollowers and retweetsnumber of topics / keywordslanguage mapping

Friday, September 30, 11

Page 17: Flux of MEME - final report

API

search clusters containing specific topics / keywordsreturns lists of clusters ordered by topic weightall the data extraction API conforms to a RESTful model and returns JSON structured data

Friday, September 30, 11

Page 18: Flux of MEME - final report

API

read list of geographic clustersusually called after a search topic has been raised

Friday, September 30, 11

Page 19: Flux of MEME - final report

API

read semantic content of a geographic clustertopics group by score (alpha parameter in LDA) and word weighted with TF-IDF with respect to the whole cluster content

Friday, September 30, 11

Page 20: Flux of MEME - final report

API

read meta / opengraph content of a geographic cluster

Friday, September 30, 11

Page 21: Flux of MEME - final report

APIexport list of posts

exports all the posts contained in a cluster

example request: /cluster/export_posts/1026/csv

read post content

reads the content of a post

example request: /cluster/read_post/560951

read related link

read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above)

example request: /cluster/read_link/16268

execute cluster stats within a location box

read list of clusters contained within a location box and creates stat charts (in form of google chart images)

example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33

execute post stats within a location box

read list of posts contained within a location box and perform stats on languages

example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33

read query content

reads the list of geo-clusters associated to a specific query id (usually fetched by the function above)

example request: /cluster/read/2

Friday, September 30, 11

Page 22: Flux of MEME - final report

Cron

keep everything runningrestart the streaming API now and then, so as to keep twitter happycreate the clusters at the end of the day

Friday, September 30, 11

Page 23: Flux of MEME - final report

Friday, September 30, 11

Page 24: Flux of MEME - final report

servers

Friday, September 30, 11

Page 25: Flux of MEME - final report

final thoughts

Friday, September 30, 11

Page 26: Flux of MEME - final report

improvements

optimize time slicing!emerging topics should be checked on hourly basis among the complete dataset

train models!a training set would be ideal to create models and optimize performances of the topic extraction algorithmmodels could relate to specific context in order to improve results (e.g. all the tweets from newspapers)

create language classifiersincrease the precision of language detection with naive bayes classifiers

think of scalabilityincreasing the amount of data makes it necessary to scale up to Map/Reduce architectures

increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...)enhance analysis and visualization (e.g. reinforce topic correlation / n-grams)

Friday, September 30, 11

Page 27: Flux of MEME - final report

other refs

algorithmsLDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocationHAC - http://en.wikipedia.org/wiki/Cluster_analysis

librariestwitter 4 java - http://twitter4j.orgmachine learning - http://mallet.cs.umass.edu/jquery (core + ui) - http://jquery.org/data tables - http://datatables.net/chart api - http://code.google.com/apis/chart/

image courtesyhttp://yesyesno.com/nike-city-runs

Friday, September 30, 11

Page 28: Flux of MEME - final report

thanks!codebase source + wiki https://github.com/grudelsud/fomthomas alisi @grudelsudgiuseppe serra @giuseppeserramarco bertini @bertinimarco ?

Friday, September 30, 11


Recommended