+ All Categories
Home > Documents > Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Date post: 31-Mar-2015
Category:
Upload: bria-boram
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
22
Web Usage Mining Classification Fang Yao • MEMS 2002 • 185029 Humboldt Uni zu Berlin
Transcript
Page 1: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Web Usage Mining

Classification

• Fang Yao

• MEMS 2002

• 185029

Humboldt Uni zu Berlin

Page 2: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Contents:• Defination and the Usages

• Outputs of Classification

• Methods of Classification

• Application to EDOC

• Discussion on Imcomplete

Data

• Discussion questions &

Outlook

Humboldt Uni zu Berlin

Page 3: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Classification

• A Major Data Mining Operation• Give one attribute (e.g play), try to predict the value of new people’s behavior by means of some other available attributes.

Humboldt Uni zu Berlin

• Usages: • Behavior predictions• improve Web design• personal marketing……

“People with age less than 40 and salary > 40k trade on-line”

Page 4: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

A Small Example

Humboldt Uni zu Berlin

Weather Data

Source: Witten & Frank, table 1.2

outlook temperature humidity windy playsunny hot high false nosunny hot high ture noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true

…. …. …. …. ….

no

Decision Tree

outlook

yeswindy

overcastrainy sunny

false true

….

humidity ….

no

high

….

Page 5: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Outputs of Classification

Humboldt Uni zu Berlin

Decision Tree Classification Rules

If outlook = sunnyand humidity = highthen play = no

If outlook = rainyand windy = truethen play = no

If outlook = overcastthen play = yes

If humidity = normalthen play = yes.......

outlook

yeswindy

overcastrainy sunny

false true

….

humidity ….

no

high

….

Page 6: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ divide-and-conquer

constructing decision trees

Humboldt Uni zu Berlin

Step 1: select a splitting attribute

humidity

YesYesYesNoNoNoNo

high normal

YesYesYesYesYesYesNo

Gain(humidity)0,152 bits

windy

YesYesYesNoNoNo

truefalse

YesYesYesYesYesYesNoNo

Gain(windy)0,048 bits

temp.

YesYesNoNo

hot mild cool

YesYesYesyesNoNo

YesYesYesNo

Gain(temperature):0,029 bits

> >

outlook

YesYesYesYes

overcastrainy sunny

YesYesYesNoNo

YesYesNoNoNo

Gain(outlook):0,247 bits

>

Page 7: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ divide-and-conquer

constructing decision trees

Humboldt Uni zu Berlin

Calculation information gain:

Gain(outlook) = info([9,5]) – info([4,0],[3,2] ,[2,3])=0,247 bits

info([4,0],[3,2] ,[2,3]) = (4/14)info([4,0]) + (5/14)info([3,2]) + (5/14)info([2,3])

Where:

Informational value of creating a branch on the „outlook“

outlook

YesYesYesYes

overcastrainy

sunny

YesYesYesNoNo

YesYesNoNoNo

Page 8: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ divide-and-conquer

calculating information

Humboldt Uni zu Berlin

Formula for information value:

nnn pppppppppentropy log...loglog),...,,( 221121

• Logarithms are expressed in base 2.• unit is ‚bits‘• argument p is expressed as fraction that add up to 1.

Info([2,3])=entropy (2/5,3/5) = -2/5*log2/5 - 3/5*log3/5 =0,97

Example:

Page 9: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ divide-and-conquer

calculating information

Humboldt Uni zu Berlin

0. 0000

0. 1000

0. 2000

0. 3000

0. 4000

0. 5000

0. 6000

0. 01 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 0. 98P

Entropy

Page 10: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ divide-and-conquer

constructing decision trees

Humboldt Uni zu Berlin

outlook

yes

overcastrainy

sunny

? ?

Page 11: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ divide-and-conquer

Humboldt Uni zu Berlin

Step 2: select a daughter attribute___outlook = sunny

humidity

No

No

No

high normal

Yes

Yes

Gain(humidity)0,971 bits

temp.

No

No

hot mild cool

Yes

NoYes

Gain(temperature):0,571 bits

>

windy

Yes

No

true false

Yes

Yes

No

No

Gain(windy)0,020 bits

>

Do this recursively !!!

Page 12: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ divide-and-conquer

constructing decision trees

Humboldt Uni zu Berlin

outlook

yeswindy

overcastrainy

sunny

humidity

no

high

no

true

yes

false

yes

normal

Stop rules: • stop when all leaf nodes are pure• stop when no more attribute can be splited

Page 13: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ C 4.5

Humboldt Uni zu Berlin

WHY C 4.5?

• The real -world data is more Complicated

• Numeric attributes

• Missing values

• Final solution need more Operations

• Pruning

• From trees to rules

Page 14: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Methods _ C 4.5

Humboldt Uni zu Berlin

• Numeric attributes: binary split with numeric thresholds halfway between the values

• Missing values: -- Ignoring leads to losing information -- Partial instances

• Pruning decision tree: -- subtree replacement -- subtree raising

A

C

A

B

C

A

B

Page 15: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Application in WEKA

Humboldt Uni zu Berlin

Page 16: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Application in WEKA

Humboldt Uni zu Berlin

Data: Clickstream from log of EDOC on 30th March

Method: J4.8 Algorithm

Objective: Prediction of dissertation reading

Attributes:HOME {1,0}AU-START {1,0}DSS-LOOKUP {1,0}SH-OTHER {1,0}OTHER {1,0}AUHINWEISE {1,0}DSS-RVK {1,0}AUTBERATUNG {1,0}DSS-ABSTR {1,0}

HIST-DISS {1,0}OT-PUB-READ {1,0}OT-CONF {1,0}SH-START {1,0}SH-DOCSERV {1,0}SH-DISS {1,0}OT-BOOKS {1,0}SH-START-E {1,0}

Page 17: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Application in WEKA

Humboldt Uni zu Berlin

Result:

DSS-ABSTR

Page 18: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Application in WEKA

Humboldt Uni zu Berlin

DSS-Lookup

Page 19: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Discussion on Incomplete data

Humboldt Uni zu Berlin

Idea: Site-centric data v.s. User-centric data

Incomplete data are inferior to the one from Complete data.

User-centric data:

Site-centric data:

Example:

User1: Expedia1, Expedia2, Expedia3User2: Expedia1, Expedia2, Expedia3, Expedia4

User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2,Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3,Cheaptickets3User2: Expedia1, Expedia2, Expedia3, Expedia4

Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001)

Page 20: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Humboldt Uni zu Berlin

Results: Lift curve

source: Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) figrue6.6-6.9

Discussion on Incomplete data

Page 21: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Humboldt Uni zu Berlin

Discussion Questions & Outlook

• What is the proper target attribute for an analysis of non-profit site?

• What data do we prefer to have?

• Which improvement could be made to the data?

Page 22: Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Humboldt Uni zu Berlin

References:

• Witten, I.H., & Frank, E.(2000). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. San Diego, CA: Academic Press. Sections 3.1-3.3; 4.3; 6.1

• Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001). „Personalization from Incomplete Data: What you don’t know can hurt.“

• http://www.cs.cmu.edu/~awm/tutorials


Recommended