Series-O-RamaSeries-O-Rama
Search & Recommend TV series with SQLSearch & Recommend TV series with SQL
http://bit.ly/series-o-rama2012http://bit.ly/series-o-rama2012
Guillaume [email protected]
March 27th, 2012
Toulouse: A Picture is Worth a Thousand WordsSeries-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
2
1
2
3
4Capbreton
3h ride Toulousepopulation: 437 000students: 97 000
Aberdeenpopulation: 210 400students: ?? ???
Collioure2h30 ride
Ax-les-Thermes1h40 ride
en.wikipedia.org
Telly Addicts Need Help to Find TV Series
Main Topics of Grey’s AnatomyGrey’s Anatomy? Text mining, Visualization
Series about ‘plane crash islandplane crash island’ Search engine
What should I watch next? Recommender system
amazon.com
3
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
Text Mining: Let’s Crunch Subtitles
4
Main Topics of Grey’s AnatomyGrey’s Anatomy? Text mining, Visualization
Series about ‘plane crash islandplane crash island’ Search engine
What should I watch next? Recommender system
Cold CaseCold Case
GreyGrey’’s Anatomys Anatomy
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
What’s in a Subtitle File?
5
Title – Season – Episode – Language.srt 1 episode = 1 plain text file
Synchronization start --> stop
Dialogue
We can easily extract words[ a, again*2, and, but, com, cuban,
different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ]
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
6
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
DB technology at Work!
[Home]
7 527 files = 337 MB
100% Java and Oracle
DB technology at Work!
[Search engine]
7
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
Ranked listof results
DB technology at Work!
[Infos]
8
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
Most popular
terms
Mostrelatedseries
DB technology at Work!
[Recommendations]
9
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
DB technology at Work!
[Recommendations]
10
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
I liked I disliked
What shouldI watch next?
DB technology at Work!
[Recommendations]
11
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
Ranked list ofrecommendations
How Does this Work?
12
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
Architecture and Data Model
13
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
DB
subtitles
indexing
searching
browsing
recommending
GUI
offline
online
Dict = { idT,term}8 plane27 killer29 crash
Posting = { idT*,idS*, nb}
27 45 898 45 38 12 90
Theory Text Indexing Pipeline
14
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
[the, plane, crashed, ..., planes, ..., is]
[plane, crashed, ..., planes, ...]
[plane, crash, ..., plane, ...]
{(plane, 48), (crash, 15) ...}
Tokenization + lowercase
Stopwords removal
Stemming
PorterPorter’’s Stemmer (1980)s Stemmer (1980)http://qaa.ath.cx/porter_js_demo.html
In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects …
Counting
Theory Similarity of Paired Series
15
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
A Big Limitation The distribution of terms among series is ignored
It makes no difference that a term occurs 1 time or 1,000,000 times
Dice’s Coefficient (1945) Based on the Set Theory
Example: Let us Model a Series as a Set of Terms
House = {hospital, doctor, crazy, psycho}Grey’s = {doctor, care, hospital}
Vocabulary
Theory Vector Space Model, Term Weighting
16
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
Raw TF
dexter > lost
max
max
Normalization TF / max(TF)
survive ?
max
max
dexter < lost
Theory Best Match Retrieval
17
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
1 TV series = 1 vector
1 45 1467 6790 n
Now, we know how to:
Find most popular terms popular terms for a TV series
Compute similaritysimilarity between TV series
Find TV series matching a querymatching a query
Theory More on Term Weighting
18
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
1 45 1467 6790 n
1 TV series = 1 vector
All terms are supposed to be equally representative… but ‘survive’ is way more unusual than ‘people’
‘survive’ better represents Lost than ‘people’ does
IDF: Inverse Document FrequencyIDF: Inverse Document Frequency
Theory The Big Picture: TF*IDF
19
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.
Theory … and Practice
20
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
Series = { idS, name,maxNb}12 Lost 54045 Dexter 125
Dict = { idT, termidf }8 plane 1.2527 killer 2.8729 crash 3.07
Posting = { idT*, idS*, nb, tf }
27 45 89 0.718 45 3 0.028 12 90 0.16
Description of a TV Series
21
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
Lost
⋈
Many surnames need to be filtered out
Retrieval of TV Series queries with 1 term
22
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
survive ⋈
Importance of normalization
• Stargate Atlantisnb/maxNb = 63/1116 = 0.05645
• Bladenb/maxNb = 9/163 = 0.05521
Retrieval of TV Series queries with n terms
23
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
survive mulder ⋈
67|The Vampire Diaries
survive|0.028|0.107 = 0.028 * 0.107 = 0.003
mulder|0.007|3.977 = 0.007 * 3.977 = 0.028
+ 0.031
18| X-Files
survive|0.014|0.107 = 0.014 * 0.107 = 0.001
mulder|1.000|3.977 = 1.000 * 3.977 = 3.977
+ 3.978⁞
Similar to House?
Computing Similarities Among TV Series
1/2
24
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
⋈
First, let’s compute the numerator where: Ai = Terms from House Bi = Terms from Another TV series Ai Bi
Similar to House?
Computing Similarities Among TV Series
2/2
Series-O-Rama: Search & Recommend TV series with SQL
Guillaume Cabanac
⋈
⋈
⋈
25
Thank you
http://www.irit.fr/~Guillaume.Cabanachttp://www.irit.fr/~Guillaume.Cabanac