+ All Categories
Home > Technology > Kddcup2011

Kddcup2011

Date post: 15-Jan-2015
Category:
Upload: liang-xiang
View: 9,424 times
Download: 0 times
Share this document with a friend
Description:
 
Popular Tags:
26
The Art of Lemon’s solution KDD Cup 2011 Track 2 Siwei Lai/ Rui Diao Liang Xiang
Transcript
Page 1: Kddcup2011

The Art of Lemon’ssolution KDD Cup 2011 Track 2

Siwei Lai/ Rui DiaoLiang Xiang

Page 2: Kddcup2011

Outline Problem Introduction

Data Analytics

Algorithms

Main Models Model Ensemble Post Process

Conclusion

Future Work

Content11.2175%

Item CF3.8222%

BSVD+3.5362%

NBSVD+3.8146%

Model Ensemble2.5033%

Post Process2.4808%

Page 3: Kddcup2011

Problem Introduction Two Tracks

Track 2

Classification Problem Positive Samples : tracks users vote higher than 80 Negative Samples : popular tracks users have not voted

Data Set User voting data Taxonomy data

Comments Similar to Top-N recommendation problem Using negative samples to prevent Harry Potter problem

Page 4: Kddcup2011

Data Analytics User vote data may be ordered by time.

Anchoring effect Vote on artists and then vote on their tracks

This is main reason why we got 2nd position

http://justaguyinagarage.blogspot.com/2011/06/recommendation-system-competitions.html

Page 5: Kddcup2011

Data Analytics If a user have voted on artist/album, she will have large

probability to vote the tracks of the artist/album.

45% 58% Artist ⇒ Artist’s tracks75% 75% Album ⇒ Album’s tracks45% 56% Item Items with the same ⇒ Artist51% 52% Item Items with the same ⇒ Album

Page 6: Kddcup2011

Data Analytics User vote data may be ordered by time.

Anchoring effect Vote on artists and then vote on their tracks

If a user have voted on artist, she will have large probability to vote the tracks of the artist.

Page 7: Kddcup2011

Algorithm: Main Models Content-based Model

Item-based Collaborative Filtering Model

Binary Latent Factor Model

Neighborhood-based Binary SVD Model

Page 8: Kddcup2011

Content-based Model If a user have voted on artist/album, she will have large

probability to vote the tracks of the artist/album.

Version 1. User will vote on a track if she have voted the same artist’s item before. (Error rate

≈ 17%)

Version 2. Use the average score of some artist/album. (Error rate ≈ 11%)

P(u, i) = 1 if user u have voted tracks with same artist/album of track i

P(u, i) = average score user u assigned on artist/album of track i or tracks with same artist/ablum

Page 9: Kddcup2011

Item-based Collaborative Filtering Jaccard Index

𝑤𝑖𝑗=∑

𝑢∈𝑁 (𝑖)∩𝑁 ( 𝑗 )

1

|𝑁 (𝑖)∪𝑁 ( 𝑗)|Error rate ≈ 9%

Page 10: Kddcup2011

Item-based Collaborative Filtering Our Similarity

𝑤𝑖𝑗=∑

𝑢∈𝑁 (𝑖)∩𝑁( 𝑗 )

𝑓 (𝑢 ,𝑖 , 𝑗)

|𝑁 (𝑖)∪𝑁 ( 𝑗)|

Page 11: Kddcup2011

Item-based Collaborative Filtering Model + Temporal information

𝑓 (𝑢 ,𝑖 , 𝑗 )= 1

|𝑑𝑢𝑖−𝑑𝑢𝑗|𝛾3

141|8573...251480 0232699 50132238 50...67376 503109 096153 30...

1405 items

862|1455...232699 90238869 90271685 90...252580 903109 9049451 90...

9items

2033|5396...81180 643109 5426594 52...8830 26232699 5953396 57...

20items

...

𝑤𝑖𝑗=∑

𝑢∈𝑁 (𝑖)∩𝑁( 𝑗 )

𝑓 (𝑢 ,𝑖 , 𝑗)

|𝑁 (𝑖)∪𝑁 ( 𝑗)|

1/14061.1 1/91.1 1/201.1

Page 12: Kddcup2011

Item-based Collaborative Filtering + Vote information

𝑓 (𝑢 ,𝑖 , 𝑗 )= 1

|𝑑𝑢𝑖−𝑑𝑢𝑗|𝛾3 ¿¿

141|8573...251480 0232699 50132238 50...67376 503109 096153 30...

862|1455...232699 90238869 90271685 90...252580 903109 9049451 90...

2033|5396...81180 643109 5426594 52...8830 26232699 5953396 57...

...

1/14061.1 /510.2 1/91.1 /10.2 1/201.1 /60.2

Page 13: Kddcup2011

Item-based Collaborative Filtering Prediction

�̂�𝑢𝑖= ∑𝑗∈𝑆(𝑁 (𝑢 ) ,𝑘)

𝑤 𝑖𝑗 (1+𝑟 𝑢𝑗)𝛾 1

Page 14: Kddcup2011

�̂�𝑢𝑖= ∑𝑗∈𝑆(𝑁 (𝑢 ) ,𝑘)

𝑤 𝑖𝑗 (1+𝑟 𝑢𝑗)𝛾 1 1

√𝑛𝑖

Item-based Collaborative Filtering + Removing popular bias

Page 15: Kddcup2011

Item-based Collaborative Filtering

Factors Error Rate (%)

initial model (Jaccard Index + KNN) 8.9992

+ removing popular bias 5.2953

+ using temporal information 3.9283

+ using vote information 3.8222

+ using taxonomy information 3.6578

Page 16: Kddcup2011

Binary Latent Factor Model

Sampling

Positive samples: items in train data. Negative samples: nearly the same as sampling test

data. Positive samples and Negative samples have the same

number for each user

�̂�𝑢𝑖=𝑝𝑢𝑇𝑞𝑖

minimiz e ∑(𝑢 ,𝑖)∈𝒦+¿ ∪𝒦−

(𝑟 𝑢𝑖−𝑝𝑢𝑇 𝑞𝑖 )+𝜆(‖𝑝𝑢‖

2+‖𝑞𝑖‖

2)

¿¿prediction

𝑟𝑢𝑖=1 (𝑢 , 𝑖)∈𝒦+¿ ¿𝑟𝑢𝑖=0 (𝑢 ,𝑖)∈𝒦−

Error rate ≈ 6%

Page 17: Kddcup2011

Binary Latent Factor Model+

�̂�𝑢𝑖=𝑏𝑢+𝑏𝑖+𝑏𝑎 ( 𝑖 )+𝑏𝑏 ( 𝑖)+𝑏𝑢 , 𝐼 (𝑎(𝑖)∈𝑁 (𝑢))+𝑏𝑢 , 𝐼 (𝑏(𝑖)∈𝑁(𝑢))+𝑝𝑢𝑇 (𝑞𝑖+𝑥𝑎( 𝑖 )+𝑦𝑏 (𝑖 ))

prediction

minimiz e ∑(𝑢 ,𝑖)∈𝒦+¿ ∪𝒦−

(𝑟 𝑢𝑖−𝑝𝑢𝑇 𝑞𝑖 )+𝜆(‖𝑝𝑢‖

2+‖𝑞𝑖‖

2)

¿¿

Error rate ≈ 3.5%

Page 18: Kddcup2011

Neighborhood-based Binary SVD Model

�̂�𝑢𝑖=𝑏𝑢+𝑏𝑖+𝑏𝑎 ( 𝑖 )+𝑏𝑏 ( 𝑖)+𝑏𝑢 , 𝐼 (𝑎(𝑖)∈𝑁 (𝑢))+𝑏𝑢 , 𝐼 (𝑏(𝑖)∈𝑁(𝑢))+1

√|𝑁 (𝑢 )|𝑞𝑖𝑇 ( ∑

𝑗∈𝑁 (𝑢)𝑦 𝑗)

prediction

Page 19: Kddcup2011

Features used

ModelsFeatures

Content Item CF BSVD+ NBSVD+

Collaborative filtering × √ √ √

Neighborhood info × √ × √

Ratings √ √ ○ ○

Time ordering × √ × ×

Artist/album √ ○ √ √

Genre structure × × × ×

Page 20: Kddcup2011

Model Ensemble Local test set

Linear combination

Simulated Annealing

8-fold cross validation

Model Error Rate (%) weight

Content 11.2175 0.002

Item CF 3.8222 0.438

BSVD+ 3.5362 0.006

NBSVD+ 3.8146 0.025

Local Train Set

Local Test Set

Test Set

Train

Set

Page 21: Kddcup2011

Post Process Some special features can not be modeled well

Find special user-item pairs.

The most popular items. Vote high on track’s album but vote low on it’s artist. …

Multiply a factor

Page 22: Kddcup2011

Algorithms

Content11.2175%

Item CF3.8222%

BSVD+3.5362%

NBSVD+3.8146%

Model Ensemble2.5033%

0.002

0.483

0.006

0.025

Post Process2.4808%

Page 23: Kddcup2011

Model Similarities

Page 24: Kddcup2011

Conclusion Data Analysis is very important

User behavior data is ordered by time Artist/Album data can improve accuracy a lot

Team members number and model numbers is very important

Useful algorithms:

Content-based Neighborhood-based Matrix Factorization

Page 25: Kddcup2011

Future Work How to add temporal information into Binary SVD Model?

Apply Binary SVD into real production

How to make explanation How to make real-time on-line recommendation

Page 26: Kddcup2011

Q&A

[email protected]