+ All Categories
Home > Documents > Series-O-Rama Search & Recommend TV series with SQL bit.ly /series-o-rama2012

Series-O-Rama Search & Recommend TV series with SQL bit.ly /series-o-rama2012

Date post: 11-Jan-2016
Category:
Upload: bikita
View: 31 times
Download: 0 times
Share this document with a friend
Description:
Guillaume Cabanac [email protected]. Series-O-Rama Search & Recommend TV series with SQL http:// bit.ly /series-o-rama2012. March 27th, 2012. Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac. Toulouse: A Picture is Worth a Thousand Words. 1. 3. - PowerPoint PPT Presentation
Popular Tags:
26
Series-O-Rama Series-O-Rama Search & Recommend TV series Search & Recommend TV series with SQL with SQL http://bit.ly/series-o-rama2012 http://bit.ly/series-o-rama2012 Guillaume Cabanac [email protected] March 27th, 2012
Transcript
Page 1: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Series-O-RamaSeries-O-Rama

Search & Recommend TV series with SQLSearch & Recommend TV series with SQL

http://bit.ly/series-o-rama2012http://bit.ly/series-o-rama2012

Guillaume [email protected]

March 27th, 2012

Page 2: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Toulouse: A Picture is Worth a Thousand WordsSeries-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

2

1

2

3

4Capbreton

3h ride Toulousepopulation: 437 000students: 97 000

Aberdeenpopulation: 210 400students: ?? ???

Collioure2h30 ride

Ax-les-Thermes1h40 ride

Page 3: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

en.wikipedia.org

Telly Addicts Need Help to Find TV Series

Main Topics of Grey’s AnatomyGrey’s Anatomy? Text mining, Visualization

Series about ‘plane crash islandplane crash island’ Search engine

What should I watch next? Recommender system

amazon.com

3

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 4: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Text Mining: Let’s Crunch Subtitles

4

Main Topics of Grey’s AnatomyGrey’s Anatomy? Text mining, Visualization

Series about ‘plane crash islandplane crash island’ Search engine

What should I watch next? Recommender system

Cold CaseCold Case

GreyGrey’’s Anatomys Anatomy

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 5: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

What’s in a Subtitle File?

5

Title – Season – Episode – Language.srt 1 episode = 1 plain text file

Synchronization start --> stop

Dialogue

We can easily extract words[ a, again*2, and, but, com, cuban,

different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ]

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 6: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

6

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

DB technology at Work!

[Home]

7 527 files = 337 MB

100% Java and Oracle

Page 7: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

DB technology at Work!

[Search engine]

7

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Ranked listof results

Page 8: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

DB technology at Work!

[Infos]

8

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Most popular

terms

Mostrelatedseries

Page 9: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

DB technology at Work!

[Recommendations]

9

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 10: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

DB technology at Work!

[Recommendations]

10

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

I liked I disliked

What shouldI watch next?

Page 11: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

DB technology at Work!

[Recommendations]

11

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Ranked list ofrecommendations

Page 12: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

How Does this Work?

12

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 13: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Architecture and Data Model

13

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

DB

subtitles

indexing

searching

browsing

recommending

GUI

offline

online

Dict = { idT,term}8 plane27 killer29 crash

Posting = { idT*,idS*, nb}

27 45 898 45 38 12 90

Page 14: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Theory Text Indexing Pipeline

14

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

[the, plane, crashed, ..., planes, ..., is]

[plane, crashed, ..., planes, ...]

[plane, crash, ..., plane, ...]

{(plane, 48), (crash, 15) ...}

Tokenization + lowercase

Stopwords removal

Stemming

PorterPorter’’s Stemmer (1980)s Stemmer (1980)http://qaa.ath.cx/porter_js_demo.html

In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects …

Counting

Page 15: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Theory Similarity of Paired Series

15

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

A Big Limitation The distribution of terms among series is ignored

It makes no difference that a term occurs 1 time or 1,000,000 times

Dice’s Coefficient (1945) Based on the Set Theory

Example: Let us Model a Series as a Set of Terms

House = {hospital, doctor, crazy, psycho}Grey’s = {doctor, care, hospital}

Page 16: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Vocabulary

Theory Vector Space Model, Term Weighting

16

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Raw TF

dexter > lost

max

max

Normalization TF / max(TF)

survive ?

max

max

dexter < lost

Page 17: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Theory Best Match Retrieval

17

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

1 TV series = 1 vector

1 45 1467 6790 n

Now, we know how to:

Find most popular terms popular terms for a TV series

Compute similaritysimilarity between TV series

Find TV series matching a querymatching a query

Page 18: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Theory More on Term Weighting

18

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

1 45 1467 6790 n

1 TV series = 1 vector

All terms are supposed to be equally representative… but ‘survive’ is way more unusual than ‘people’

‘survive’ better represents Lost than ‘people’ does

IDF: Inverse Document FrequencyIDF: Inverse Document Frequency

Page 19: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Theory The Big Picture: TF*IDF

19

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.

Page 20: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Theory … and Practice

20

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Series = { idS, name,maxNb}12 Lost 54045 Dexter 125

Dict = { idT, termidf }8 plane 1.2527 killer 2.8729 crash 3.07

Posting = { idT*, idS*, nb, tf }

27 45 89 0.718 45 3 0.028 12 90 0.16

Page 21: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Description of a TV Series

21

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Lost

Many surnames need to be filtered out

Page 22: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Retrieval of TV Series queries with 1 term

22

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

survive ⋈

Importance of normalization

• Stargate Atlantisnb/maxNb = 63/1116 = 0.05645

• Bladenb/maxNb = 9/163 = 0.05521

Page 23: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Retrieval of TV Series queries with n terms

23

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

survive mulder ⋈

67|The Vampire Diaries

survive|0.028|0.107 = 0.028 * 0.107 = 0.003

mulder|0.007|3.977 = 0.007 * 3.977 = 0.028

+ 0.031

18| X-Files

survive|0.014|0.107 = 0.014 * 0.107 = 0.001

mulder|1.000|3.977 = 1.000 * 3.977 = 3.977

+ 3.978⁞

Page 24: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Similar to House?

Computing Similarities Among TV Series

1/2

24

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

First, let’s compute the numerator where: Ai = Terms from House Bi = Terms from Another TV series Ai Bi

Page 25: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Similar to House?

Computing Similarities Among TV Series

2/2

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

25

Page 26: Series-O-Rama Search & Recommend TV series with SQL  bit.ly /series-o-rama2012

Thank you

http://www.irit.fr/~Guillaume.Cabanachttp://www.irit.fr/~Guillaume.Cabanac


Recommended