Topic Models Recommendations
Morten Arngren Senior Data Scientist[ ]
About Topic Recommendations
π‘ !
Recommendations
Modelling
ββ¦YouTube for Publicationsβ¦
IStarted in 2006 by 5 dudes.
15M. publications (free)π
π 7.5B. page views / month
340M. pages - (25 km2)
2013
π₯ 83M. unique visitors / month
""
Data Science Team (Copenhagen)
12x 2.6GHz
96GB Ram
2TB SSD
2TB HardDrive
Morten Arngren Ph.D. in Machine Learning and AI (2011) M.Sc.A.M. (2007) B.Sc.E.E. (1997) !ISSUU, Data Scientist (2011 - present) DTU & FOSS Analytical, Machine Learning in Food Quality (2008-2011) Nokia Mobile Phones, Digital Signal Processing (2000-2007) Alcatel Space Denmark, Building Rockets (1997-2000)
Andrius Butkus Ph.D. in Digital Media Personalisation (2009) M.Sc.E.E. (2004) B.Sc.E.E. (2002) !ISSUU, Data Scientist (2011 - present) DTU External Lecturer, Human Computer Interaction (2010 - present) DTU Assistant Professor, Digital Media Engineering (2008-2010) β Amazon Web
Services
ML Gadgets
πDataπData
πData
πLayout
(Quantify text and image boxes)
π
π
Article Extraction
)OCR
π
Image
Cover Analysis
#
Explicit Detection
Doc. Type Classification
$
Text
Detect Language (56)
Translate to English (from 24 languages) LDA Topics
(β
π
π
Page
Content
*DB
&40k
Pubs / Day
time
Reader Activity
+!
,
π
- -
π
,
,,
-
N NSession
""
"" "
"
"
*DB
π ππ¬
π§1
2πΉ
βBirdie Nam Namβ
200GB / Day
Topic Modelling
LATENT DIRICHLET ALLOCATION
150 topics (preset parameter)
Topic model based on Bag-of-Words Data
http://radimrehurek.com/gensim/
Wikipedia Training Data ~4.5M Single Articles
(Pure Topics)
arabicAustralia history business
islands environment
hotels
poetic
food design arts
plants animals
Topic Distribution
1501
LDA π΄
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993β1022, January 2003.[ ]
π
β
(
πΉ
5
π΄
LATENT DIRICHLET ALLOCATION
Properties Ξ£[0:1] β§ = 1
LDA SpacePC 4
the real
5+
Issuu Publications
TOPIC CATEGORIES
(
πΈ
β β
(
πΉ
~4.5 Mio.
Density distr ibution not the same
Iπ΄
8πΈ
~9 Mio.
Empty locations in LDA space.
Travel
Cocktails
Chemistry
0.5 Travel 0.4 Spor ts 0.1
Botanics
Drinks
(Learning from Wikipedia Dataset)
Dancing
Recommendation System!
π¬
READER ACTIVITY
π ππ§1
2πΉ
Extract Implic it Ratingβ¦.?
No Explic it Ratingβ¦.
TimeβBirdie Nam Namβ
Session { UserName: βBirdie-Nam-Namβ DocID: xxx-xxxxx Pages: 1: [250, 725, 569, 134, ...] 2: [1056, 1259, ...] 3: [1056, 1259, ...] 4: [102, 356, 208, 438] 5: [102, 356, 208, 438] 6: [5250, 3567, 809] 7: [5250, 3567, 809] ... TimeStamp: 1378935850 DocID: yyy-yyyyy }
Pages: [1,2,3,6,7] ReadTime: 25789 ms. TimeStamp: 1378935850
Browsing or Reading?Time
Readers
Publ
icat
ions
π
π¬
2
π§
πΈ
Item2Item Matrix
π
π¬
2
π§
πΈ
π π¬ 2 π§ πΈ
12πΉπ¬π§ ππ
Reader indexed learning
To
Pages: [1,6,7,10,11] ReadTime: 11250 ms. TimeStamp: 1385437850
Time
568525081065
850 11509860
3690
in weeks
decay per week= 850
Decay function
RECOMMENDING
Item2Item Matrix
8
π
π¬
π
πΈ
1 π 5 π§ π±
1 π 5 π§
Item Matrix Weight Mapping Function
π§π¬πΉ π
Time
25081065850 1150
N
ππ΄< π
11 1
Read History
π
Likes
Stacks
RECOMMENDING
+5
π I
1 π
πΉ
β«8
π¬
π§
π
ππ
E
πΈπ
π€
π±
π·C
π·
πΊπΎ
F
π½
π±
Item Matrix Weight Mapping Function
1
Item Weights
1 π 5 π§ π± 1π5 π§ π±
πWeighted Sampling
1π5 π§ π±
Max. Rank
Tuned Parameters
Deep Belief Network Model
Bag-of-Words modelTraining Data
I
Lars Maal
2000
500
20
2
Kasper Johansen
! "
Collaborate Fi lter ing Using Social Media Knowledge
Master Student Project
LLΓΈe
Master Student Project
LLMorten Arngren
Senior Data Scientist[ ]