Issuu Talk on Topic Models and Recommendation Systems

Post on 31-Mar-2016

219 views 2 download

Tags:

description

Issuu gave a talk on the Data Science and Machine Learning Meetup in Copenhagen, Nov. 2013.

transcript

Topic Models Recommendations

Morten Arngren Senior Data Scientist[ ]

About Topic Recommendations

πŸ’‘ !

Recommendations

Modelling

β€œβ€¦YouTube for Publications…

IStarted in 2006 by 5 dudes.

15M. publications (free)πŸ“–

πŸ‘€ 7.5B. page views / month

340M. pages - (25 km2)

2013

πŸ‘₯ 83M. unique visitors / month

""

Data Science Team (Copenhagen)

12x 2.6GHz

96GB Ram

2TB SSD

2TB HardDrive

Morten Arngren Ph.D. in Machine Learning and AI (2011) M.Sc.A.M. (2007) B.Sc.E.E. (1997) !ISSUU, Data Scientist (2011 - present) DTU & FOSS Analytical, Machine Learning in Food Quality (2008-2011) Nokia Mobile Phones, Digital Signal Processing (2000-2007) Alcatel Space Denmark, Building Rockets (1997-2000)

Andrius Butkus Ph.D. in Digital Media Personalisation (2009) M.Sc.E.E. (2004) B.Sc.E.E. (2002) !ISSUU, Data Scientist (2011 - present) DTU External Lecturer, Human Computer Interaction (2010 - present) DTU Assistant Professor, Digital Media Engineering (2008-2010) ☁ Amazon Web

Services

ML Gadgets

πŸ“ˆDataπŸ“ˆData

πŸ“ˆData

πŸ“–Layout

(Quantify text and image boxes)

πŸš€

πŸš€

Article Extraction

)OCR

πŸš€

Image

Cover Analysis

#

Explicit Detection

Doc. Type Classification

$

Text

Detect Language (56)

Translate to English (from 24 languages) LDA Topics

(βš›

πŸš€

πŸ”Ž

Page

Content

*DB

&40k

Pubs / Day

time

Reader Activity

+!

,

πŸ‘

- -

πŸ‘

,

,,

-

N NSession

""

"" "

"

"

*DB

πŸ” πŸ”πŸŽ¬

🎧1

2πŸ“Ή

β€œBirdie Nam Nam”

200GB / Day

Topic Modelling

LATENT DIRICHLET ALLOCATION

150 topics (preset parameter)

Topic model based on Bag-of-Words Data

http://radimrehurek.com/gensim/

Wikipedia Training Data ~4.5M Single Articles

(Pure Topics)

arabicAustralia history business

islands environment

hotels

poetic

food design arts

plants animals

Topic Distribution

1501

LDA 🌴

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.[ ]

πŸš€

✈

(

πŸ“Ή

5

🌴

LATENT DIRICHLET ALLOCATION

Properties Σ[0:1] ∧ = 1

LDA SpacePC 4

the real

5+

Issuu Publications

TOPIC CATEGORIES

(

🍸

✈ ✈

(

πŸ“Ή

~4.5 Mio.

Density distr ibution not the same

I🌴

8🍸

~9 Mio.

Empty locations in LDA space.

Travel

Cocktails

Chemistry

0.5 Travel 0.4 Spor ts 0.1

Botanics

Drinks

(Learning from Wikipedia Dataset)

Dancing

Recommendation System!

🎬

READER ACTIVITY

πŸ” πŸ”πŸŽ§1

2πŸ“Ή

Extract Implic it Rating….?

No Explic it Rating….

Timeβ€œBirdie Nam Nam”

Session { UserName: β€˜Birdie-Nam-Nam’ DocID: xxx-xxxxx Pages: 1: [250, 725, 569, 134, ...] 2: [1056, 1259, ...] 3: [1056, 1259, ...] 4: [102, 356, 208, 438] 5: [102, 356, 208, 438] 6: [5250, 3567, 809] 7: [5250, 3567, 809] ... TimeStamp: 1378935850 DocID: yyy-yyyyy }

Pages: [1,2,3,6,7] ReadTime: 25789 ms. TimeStamp: 1378935850

Browsing or Reading?Time

Readers

Publ

icat

ions

πŸ”

🎬

2

🎧

🍸

Item2Item Matrix

πŸ”

🎬

2

🎧

🍸

πŸ” 🎬 2 🎧 🍸

12πŸ“ΉπŸŽ¬πŸŽ§ πŸ”πŸ”

Reader indexed learning

To

Pages: [1,6,7,10,11] ReadTime: 11250 ms. TimeStamp: 1385437850

Time

568525081065

850 11509860

3690

in weeks

decay per week= 850

Decay function

RECOMMENDING

Item2Item Matrix

8

πŸ”

🎬

πŸ€

🍸

1 🍟 5 🎧 🎱

1 🍟 5 🎧

Item Matrix Weight Mapping Function

πŸŽ§πŸŽ¬πŸ“Ή πŸ”

Time

25081065850 1150

N

πŸ‘πŸŒ΄< πŸš€

11 1

Read History

πŸ“–

Likes

Stacks

RECOMMENDING

+5

πŸ” I

1 πŸ•

πŸ“Ή

β™«8

🎬

🎧

πŸ€

🍏🍟

E

πŸΈπŸ”ˆ

🎀

🎱

πŸ“·C

🍷

🍺🎾

F

πŸ‘½

🎱

Item Matrix Weight Mapping Function

1

Item Weights

1 🍟 5 🎧 🎱 1🍟5 🎧 🎱

πŸ”€Weighted Sampling

1🍟5 🎧 🎱

Max. Rank

Tuned Parameters

Deep Belief Network Model

Bag-of-Words modelTraining Data

I

Lars Maal

2000

500

20

2

Kasper Johansen

! "

Collaborate Fi lter ing Using Social Media Knowledge

Master Student Project

LLΓΈe

Master Student Project

LLMorten Arngren

Senior Data Scientist[ ]