Date post: | 08-Jul-2015 |
Category: |
Technology |
Upload: | michael-vogiatzis |
View: | 623 times |
Download: | 4 times |
How to spot first stories on Twitter using Storm
Michael Vogiatzis - @mvogiatzis
Software Engineer
The Task
Find the first document in a stream of documents, which discusses about a
specific event.
@mvogiatzis
Spam
◦ It’s Cooooooooooooolddd !! Brrrrrr…
Neutral
◦ #nowplaying ♫ Live At The BBC – Dire Straits
Events
◦ The 6.4-magnitude quake struck just after 9.20pm (CST) on Sunday in the Banda Sea northeast of East Timor.
@mvogiatzis
Algorithm
TF-IDF on input Tweet
Convert it to Vector
@mvogiatzis
TF - IDF
Split text into words
Term Frequency * Inverted Document Frequency
More frequent words – less weight
Remove out-of-vocabulary words e.g. “lol”, “the”
Remove URLs and mentions (@)
@mvogiatzis
Algorithm
TF-IDF on input Tweet
Convert it to Vector
Find N nearest neighbours
◦ Locality Sensitive Hashing
@mvogiatzis
Locality Sensitive Hashing
Data Clustering – Near neighbour search
Buckets – Hash Tables for similar documents
Random projection creates a hash
Identical hash -> nearest neighbour candidate
@mvogiatzis
Locality Sensitive Hashing cont’d
@mvogiatzis
Algorithm
TF-IDF on input Tweet
Convert it to Vector
Find N nearest neighbours
◦ Locality Sensitive Hashing
Compare distances and find the closest
If distance < threshold not a first story
@mvogiatzis
Extra Step
If Buckets distance is not short enough
Compare with a fixed number of recent tweets
Check again
@mvogiatzis
Algorithm
TF-IDF on input Tweet
Convert it to Vector
Find N nearest neighbours ◦ Locality Sensitive Hashing
Compare distances and find the closest
If distance < threshold not a first story
Else compare with X most recent tweets (optimization)
If new_distance > threshold -> first story!
@mvogiatzis
Storm
Real-time computation made easy
Storm
Distributed real-time computation system
Fault tolerant
Fast
Scalable
Guaranteed message processing
Open source
Multilang capabilities
@mvogiatzis
Elements
Streams
◦ Set of tuples
◦ Unbounded sequence of data
Spout
◦ Source of streams
Bolts
◦ Application logic
◦ Functions
◦ Streaming aggregations, joins, DB ops
@mvogiatzis
Topology
@mvogiatzis
Part I
@mvogiatzis
Part II
@mvogiatzis
Results
Input Tweet Stored Tweet Similarity score
@Real_Liam_Payne i wanna be your female
pal
i. wanna be your best
friend so follow me
0.385
RT @damnitstrue: Life
is for living, not for
stressing.
RT Life is for living, not
for stressing.
0.99
The 6.4-magnitude quake struck just after 9.20pm (CST) on Sunday in the Banda Sea northeast of
East Timor. http://t.co/UhfwC
S2xPp
Yay Sunday!
0.129
@mvogiatzis
Evaluation
Evaluation on speed-up metric
◦ 1381 % vs single threaded
◦ 372 % vs multi threaded (4 threads)
Having humans labeling tweets is hard!
Implementation tested on newswire and broadcast news
False alarms
@mvogiatzis
Future work
Reduce false alarms by using threads for topics
Image similarity detection
Audio similarity ?
◦ Hello Shazam!
@mvogiatzis
Michael Vogiatzis
Twitter: @mvogiatzis
Code on Github
http://micvog.com
◦ Next post: “7 Lessons Learned at a London startup”
@mvogiatzis