+ All Categories
Home > Technology > Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Date post: 04-Jul-2015
Category:
Upload: error007
View: 479 times
Download: 10 times
Share this document with a friend
Description:
Data Mining Concepts and Techniques 2nd Ed slides
50
04/18/13 Data Mining: Principles and Algorithms 1 Data Mining: Concepts and Techniques — Chapter 11 — Additional Theme: Collaborative Filtering & Data Mining Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved
Transcript
Page 1: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms1

Data Mining: Concepts and Techniques

— Chapter 11 —

Addit ional Theme: Collaborative Filtering & Data Mining

Jiawei Han and Micheline KamberDepartment of Computer Science

University of Illinois at Urbana-Champaignwww.cs.uiuc.edu/~hanj

©2006 Jiawei Han and Micheline Kamber. All rights reserved

Page 2: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms2

Page 3: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms3

Outline

Motivation Systems in Action A Conceptual Framework User-User Methods Item-Item Methods Recent Advances and Open Problems

Page 4: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms4

Motivation

User Perspective Lots of online products, books, movies, etc. Reduce my choices…please…

Manager Perspective

“ if I have 3 million customers on the web, I should have 3 million stores on the web.”

CEO of Amazon.com [SCH01]

Page 5: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms5

Example: Recommendation

Customers who bought this book also bought:

•Data Preparation for Data Mining: by Dorian Pyle (Author) •The Elements of Statistical Learning: by T. Hastie, et al •Data Mining: Introductory and Advanced Topics: by Margaret H. Dunham•Mining the Web: Analysis of Hypertext and Semi Structured Data

Page 6: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms6

Example: Personalization

Page 7: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms7

Other Examples

Movielens: movies Moviecritic: movies again My launch: music Gustos starrater: web pages Jester: Jokes TV Recommender: TV shows Suggest 1.0 : different products And much more…

Page 8: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms8

How it Works?

Each user has a profile Users rate items

Explicitly: score from 1..5 Implicitly: web usage mining

Time spent in viewing the item Navigation path Etc…

System does the rest, How? This is what we will show today

Page 9: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms9

Basic Approaches

Collaborative Filtering (CF) Look at users collective behavior Look at the active user history Combine!

Content-based Filtering Recommend items based on key-words More appropriate for information retrieval

Page 10: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms10

Collaborative Filtering: A Framework

u1

u2

ui

...

um

Items: I

i1 i2 … ij … in

3 1.5 …. … 2

2

1

3

rij=?

The task:Q1: Find Unknown ratings?Q2: Which items should we recommend to this user?...

Unknown function f: U x I→ R

Users: U

Page 11: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms11

Collaborative Filtering Road Map

User-User Methods Identify like-minded users Memory-based: K-NN Model-based: Clustering

Item-Item Method Identify buying patterns Correlation Analysis Linear Regression Belief Network Association Rule Mining

Page 12: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms12

User-User Similarity: Intuit ion

TargetTargetCustomerCustomer

Q1: How to measure similarity?

Q2: How to select neighbors?

Q3: How to combine?

Page 13: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms13

How to Measure Similarity?

Pearson correlation coefficient

Cosine measure Users are vectors in product-dimension space

∑∑

∈∈

−−

−−=

Items RatedCommonly j

2

Items RatedCommonly j

2

Items RatedCommonly j

)()(

))((

),(iijaaj

iijaaj

prrrr

rrrr

iaw

ui

ua

i1 in

22*

.),(

ia

iac rr

rriaw =

Page 14: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms14

Nearest Neighbor Approaches [SAR00a]

Offline phase: Do nothing…just store transactions

Online phase: Identify highly similar users to the active one

Best K ones All with a measure greater than a threshold

Prediction

∑∑ −

+=

i

iiji

aaj iaw

rriawrr

),(

)(),(

User a’s neutralUser i’s deviation

User a’s estimated deviation

Page 15: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms15

Horting Method [AGG99]

K-NN is not transitive Horting takes advantage of transitivity Uses new similarity measure: Predictability User i predicts user a if

They have rated sufficiently common items There is an error-bounded linear

transformation from user i’s ratings to a’s ones

Page 16: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms16

How Horting Works?

Offline phase: build neighborhood graph Online phase: Compute raj

Ua

1- Identify users who predict ua

2- Identify users who rated j

3- Find shortest paths from group1 to 2

4- Backward propagation and averaging

- Better for sparse environments- Not well evaluated

Page 17: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms17

Clustering [BRE98]

Offline phase: Build clusters: k-mean, k-medoid, etc.

Online phase: Identify the nearest cluster to the active user Prediction:

Use the center of the cluster Weighted average between cluster members

Weights depend on the active user

Faster Slower but a little more accurate

Page 18: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms18

Clustering vs. k-NN Approaches

K-NN using Pearson measure is slower but more accurate

Clustering is more scalableActive user

Bad recommendations

We can use soft clustering but will lose computational edge

Page 19: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms19

Did We Answer the Questions?

TargetTargetCustomerCustomer

Q1: How to measure similarity?

Q2: How to select neighbors?

Q3: How to combine?

Page 20: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms20

Are We Done? Q1:How to measure similarity?

.....

......

),( Items RatedCommonly j∑

∈=iawp

What about Sparsity?Not enough common Itemsimplies spurious neighbors and hence bad recommendations

Sparsity results from the poor representation!

U1 rates recycled letter pads HighU2 rates recycled memo pads High

Both of them like Recycled office products

They are similar but the math won’t work for that

Example from [SAR00P]

By working at the right level of abstraction we can eliminate sparsity

Done... Really??

Page 21: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms21

The Power of Representation [UNG98]

Action Foreign Classic

Q1-B: How can we formalize this intuition?

Page 22: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms22

How to Abstract?

Semi-manual Methods Use product features Cluster products first, then cluster users Works only if we have descriptive features

Automatic Methods Adjusted Product Taxonomy Latent Semantic Indexing

Page 23: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms23

Adjusted Product Taxonomy [CHO04]

• Input : product taxonomy•Output: modified taxonomy with even distribution

Page 24: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms24

Adjusted Product Taxonomy (2)

Using original taxonomy

Using adjusted taxonomy

Number of transactions having this category

Page 25: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms25

Latent Semantic Indexing [SAR00b]

=R

m X n

U

m X r

S

r X r

I’

r X n

Sk

k X k

Uk

m X k

Ik’

k X n

The reconstructed matrix Rk = Uk.Sk.Ik’ is the closest rank-k matrix to the original matrix R.

Rk

• Captures latent associations• Reduced space is less-noisy

Page 26: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms26

Are We Done? (2)

Q2:How to Select Neighbors? We don’t expect to use the same neighbors

for all products Neighbors should be product-category

specific

Not adequately answered

Q2-B. How can we determine whether or not a user is relevant to a given product?

Page 27: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms27

Selecting Relevant Instances [YU01]

Superman and Batman and correlated Titanic and Batman are negatively correlated “Dances with Wolves” has nothing to do with Batman’s rating Karen is not a good instance to consider

MI(X;Y) = H(X) – H(X|Y)

How can we formalize this? Mutual Information

Predict this

Page 28: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms28

Selecting Relevant Instances (2)

Offline phase: Estimate mutual information between items For each item:

Find users who rated it Compute their strength (how many relevant items

they also rated) Retain subset of them (10% works fine)

Online phase: To predict the target item’s rating, run k-NN on

its reduced instance space

Better results with less data… quality not quantity is what matter

Page 29: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms29

Are We Done? (3)

Q3:How to combine? Weighted average Discover association rules in neighbors’ transactions

[LEE01, WAN04] For every x in this group: like(x, Item1) ^ like(x, Item2) like(x, Item3) Use confidence and support to judge the quality of the

prediction Prediction is done on the binary level (like, dislike) Costly to run online

Page 30: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms30

User-User Methods Evaluation

Achieve good quality in practice The more processing we push offline, the better

the method scale However:

User preference is dynamic High update frequency of offline-calculated

information No recommendation for new users

We don’t know much about them yet

Page 31: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms31

Collaborative Filtering Road Map

User-User Methods Identify like-minded users Memory-based: K-NN Model-based: Clustering

Item-Item Method Identify buying patterns Correlation Analysis Linear Regression Belief Network Association Rule Mining

Page 32: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms32

Item-Item Similarity: The Intuit ion

Search for similarities among items All computations can be done offline Item-Item similarity is more stable that user-user

similarity No need for frequent updates

First Order Models Correlation Analysis Linear Regression

Higher Order Models Belief Network Association Rule Mining

Page 33: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms33

Correlation-based Methods [SAR01]

Same as in user-user similarity but on item vectors Pearson correlation coefficient

Look for users who rated both items

u1

um

i1 ii ij in

∑∑∑

∈∈

−−

−−=

ItemsBoth Rated Usersu

2

ItemsBoth Rated Usersu

2

ItemsBoth Rated Usersu

)()(

))((

iuijuj

iuijuj

ijrrrr

rrrrs

Page 34: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms34

Correlation-based Methods (2)

Offline phase: Calculate n(n-1) similarity measures For each item

Determine its k-most similar items Online phase:

Predict rating for a given user-item pair as a weighted sum over similar items that he rated

Ua ?2 3 4

∑∑

∈=

itemssimilariij

aiitemssimilariij

aj s

rsr

j

Page 35: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms35

Regression Based Methods [VUC00]

Offline phase: Fit n(n-1) linear regressions Fij(x) is a linear transformation of a user rating on

item i to his rating on item j Online phase

Same as previous method The weights are inversely proportional to the

regression error rates

∑∑

∈=

aby items

aby items

)(

ratediij

airatedi

ijij

aj w

rfw

r

Page 36: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms36

Higher Order Models

Previous approaches used the Naïve Bayes assumption Item effects on a given one are independent

Not always true Higher order models can do better

Belief Network Association Rule Mining

Page 37: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms37

Bayesian Belief Network: introduction

Bayesian belief network allows a subset of the variables to be conditionally independent

A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution

X Y

ZP

Nodes: random variablesLinks: dependencyX,Y are the parents of Z, and Y is the parent of PNo dependency between Z and PHas no loops or cycles

Page 38: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms38

Bayesian Belief Network: An Example

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table for the variable LungCancer:Shows the conditional probability for each possible combination of its parents

∏=

=n

iZParents iziPznzP

1))(|(),...,1(

Page 39: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms39

Belief Network for CF [BRE98]

Every item is a node Binary rating (like, dislike) Learn offline a belief network over the training date CPT table at each node is represented as a decision tree Use greedy algorithms to determine the best network

structure Use probabilistic inference for online prediction

Page 40: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms40

Belief Network for CF: An Example

decision tree for the random variable “Melrose Palace” in the movie domain

Probability

Friends

M.P

B.H CPT

Page 41: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms41

Association Rule Mining

Offline processing Work on the binary level (like, dislike) View user as market basket containing items

liked by user Discover association rules between items

Online processing: Match items that the active user like with rules

left hand side Recommend rules’ consequent based on

support and confidence

Page 42: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms42

Association Rule Mining : Problems

High support threshold leads to low coverage and may eliminate important, but infrequent items from consideration

Low support thresholds result in very large model sizes, computationally expensive offline pattern discovery phase and slower online matching phase

Solution: Adaptive Association Rule Mining

Page 43: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms43

Adaptive Association Rule Mining [LIN01]

minSupport

minConfidenceDesired number

of rules

Given: transaction dataset target item desired range for number of

rules specified minimum confidence

Find: set S of association rules for target item such that number of rules in S is in given range rules in S satisfy minimum confidence constraint rules in S have higher support than rules not in S that satisfy above

constraints

Page 44: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms44

Adaptive Association Rule Mining (2)

Discover rules with one item on the head Like (x, item1) ^ Like (x, item2) Like(x,

target)

The miner discovers association rules iteratively (for each target item) until the desired number of rules are extracted

Support is adjusted per-item

Page 45: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms45

Item-Item Methods: Why It Works?

Like(x,Book1)^like(x,book2) like(x,book3)

Like(x,Movie1) like(x,Movie2)

Support Support

We use the right neighbors for each item

Without discovering the groups themselves thus eliminating costly online matching

In general better quality than user-user methods and better response time [LIN03]

Book1, Book2Movie1

Bookgang

Moviegang

Page 46: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms46

Recent Work and Open Problems

Order-based methods Ordering items is more informative than rating them [KAM03] developed k-o’mean to work on orders

Preference-based methods Total ordering of items is not feasible Work on partial orders (preferences) [COH99]

Integrating background knowledge User demographic information, item-features, etc..

Modeling time Sequential patterns

Page 47: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms47

References (1) Charu C. Aggarwal, Joel L. Wolf, Kun-Lung Wu, Philip S. Yu: Horting Hatches

an Egg: A New Graph-Theoretic Approach to Collaborative Filtering. KDD 1999: 201-212

J. Breese, D. Heckerman, C. Kadie Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proc. 14th Conf. Uncertainty in Artificial Intelligence, Madison, July 1998.

Yoon Ho Cho and Jae Kyeong Kim: Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26(2), 2003

William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. In Advances in Neural Processing Systems 10, Denver, CO, 1997

Jiawe Han, Fall 2003 online course notes available at: http://www-courses.cs.uiuc.edu/~cs397han/slides/05.ppt Toshihiro Kamishima: Nantonac collaborative filtering: recommendation

based on order responses. KDD 2003: 583-588 Lee, C.-H, Kim, Y.-H., Rhee, P.-K. Web personalization expert with combining

collaborative filtering and association rule mining technique. Expert Systems with Applications, v 21, n 3, October, 2001, p 131-137

Page 48: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms48

References (2) W. Lin, 2001P, online presentation available at: http://www.wiwi.hu-

berlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/LinAlvarezRuiz_WebKDD2000.ppt

Weiyang Lin, Sergio A. Alvarez, and Carolina Ruiz. Efficient adaptive-support association rule mining for recommender systems. Data Mining and Knowledge Discovery, 6:83--105, 2002

G. Linden, B. Smith, and J. York, "Amazon.com Recommendations Iemto -item collaborative filtering", IEEE Internet Computing, Vo. 7, No. 1, pp. 7680, Jan. 2003. Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Analysis of recommendation algorithms for e-commerce. ACM Conf. Electronic Commerce 2000: 158-167

B. Sarwar, G. Karypis, J. Konstan, and J. Riedl: Application of dimensionality reduction in recommender systems--a case study. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.

B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. WWW’01

Page 49: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms49

References (3) B. Sarwar, 2000P, online presentation available at: http://www.wiwi.hu-

berlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/badrul.ppt J. Ben Schafer, Joseph A. Konstan, John Riedl: E-Commerce

Recommendation Applications. Data Mining and Knowledge Discovery 5(1/2): 115-153, 2001

L.H. Ungar and D.P. Foster: Clustering Methods for Collaborative Filtering, AAAI Workshop on Recommendation Systems, 1998.

Yi-Fan Wang, Yu-Liang Chuang, Mei-Hua Hsu and Huan-Chao Keh: A personalized recommender system for the cosmetic business. Expert Systems with Applications, v 26, n 3, April, 2004 Pages 427-434

S. Vucetic and Z. Obradovic. A regression-based approach for scaling-up personalized recommender systems in e-commerce. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.

Kai Yu, Xiaowei Xu, Martin Ester, and Hans-Peter Kriegel: Selecting relevant instances for efficient accurate collaborative filtering. In Proceedings of the 10th CIKM, pages 239--246. ACM Press, 2001.

Cheng Zhai, Spring 2003 online course notes available at: http://sifaka.cs.uiuc.edu/course/2003-497CXZ/loc/cf.ppt

Page 50: Chapter -11 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

04/18/13 Data Mining: Principles and Algorithms50


Recommended