Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

transcript

Web Usage Mining

Classification

• Fang Yao

• MEMS 2002

• 185029

Humboldt Uni zu Berlin

Contents:• Defination and the Usages

• Outputs of Classification

• Methods of Classification

• Application to EDOC

• Discussion on Imcomplete

• Discussion questions &

Outlook

Classification

• A Major Data Mining Operation• Give one attribute (e.g play), try to predict the value of new people’s behavior by means of some other available attributes.

• Usages: • Behavior predictions• improve Web design• personal marketing……

“People with age less than 40 and salary > 40k trade on-line”

A Small Example

Weather Data

Source: Witten & Frank, table 1.2

outlook temperature humidity windy playsunny hot high false nosunny hot high ture noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true

…. …. …. …. ….

Decision Tree

outlook

yeswindy

overcastrainy sunny

false true

humidity ….

Outputs of Classification

Decision Tree Classification Rules

If outlook = sunnyand humidity = highthen play = no

If outlook = rainyand windy = truethen play = no

If outlook = overcastthen play = yes

If humidity = normalthen play = yes.......

outlook

yeswindy

overcastrainy sunny

false true

humidity ….

Methods _ divide-and-conquer

constructing decision trees

Step 1: select a splitting attribute

humidity

YesYesYesNoNoNoNo

high normal

YesYesYesYesYesYesNo

Gain(humidity)0,152 bits

YesYesYesNoNoNo

truefalse

YesYesYesYesYesYesNoNo

Gain(windy)0,048 bits

YesYesNoNo

hot mild cool

YesYesYesyesNoNo

YesYesYesNo

Gain(temperature):0,029 bits

outlook

YesYesYesYes

overcastrainy sunny

YesYesYesNoNo

YesYesNoNoNo

Gain(outlook):0,247 bits

Calculation information gain:

Gain(outlook) = info([9,5]) – info([4,0],[3,2] ,[2,3])=0,247 bits

info([4,0],[3,2] ,[2,3]) = (4/14)info([4,0]) + (5/14)info([3,2]) + (5/14)info([2,3])

Where:

Informational value of creating a branch on the „outlook“

outlook

YesYesYesYes

overcastrainy

YesYesYesNoNo

YesYesNoNoNo

calculating information

Formula for information value:

nnn pppppppppentropy log...loglog),...,,( 221121

• Logarithms are expressed in base 2.• unit is ‚bits‘• argument p is expressed as fraction that add up to 1.

Info([2,3])=entropy (2/5,3/5) = -2/5*log2/5 - 3/5*log3/5 =0,97

Example:

calculating information

0. 0000

0. 1000

0. 2000

0. 3000

0. 4000

0. 5000

0. 6000

0. 01 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 0. 98P

Entropy

outlook

overcastrainy

Step 2: select a daughter attribute___outlook = sunny

humidity

high normal

Gain(humidity)0,971 bits

hot mild cool

Gain(temperature):0,571 bits

true false

Gain(windy)0,020 bits

Do this recursively !!!

outlook

yeswindy

overcastrainy

humidity

normal

Stop rules: • stop when all leaf nodes are pure• stop when no more attribute can be splited

Methods _ C 4.5

WHY C 4.5?

• The real -world data is more Complicated

• Numeric attributes

• Missing values

• Final solution need more Operations

• Pruning

• From trees to rules

Methods _ C 4.5

• Numeric attributes: binary split with numeric thresholds halfway between the values

• Missing values: -- Ignoring leads to losing information -- Partial instances

• Pruning decision tree: -- subtree replacement -- subtree raising

Application in WEKA

Data: Clickstream from log of EDOC on 30th March

Method: J4.8 Algorithm

Objective: Prediction of dissertation reading

Attributes:HOME {1,0}AU-START {1,0}DSS-LOOKUP {1,0}SH-OTHER {1,0}OTHER {1,0}AUHINWEISE {1,0}DSS-RVK {1,0}AUTBERATUNG {1,0}DSS-ABSTR {1,0}

HIST-DISS {1,0}OT-PUB-READ {1,0}OT-CONF {1,0}SH-START {1,0}SH-DOCSERV {1,0}SH-DISS {1,0}OT-BOOKS {1,0}SH-START-E {1,0}

Application in WEKA

Result:

DSS-ABSTR

Application in WEKA

DSS-Lookup

Discussion on Incomplete data

Idea: Site-centric data v.s. User-centric data

Incomplete data are inferior to the one from Complete data.

User-centric data:

Site-centric data:

Example:

User1: Expedia1, Expedia2, Expedia3User2: Expedia1, Expedia2, Expedia3, Expedia4

User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2,Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3,Cheaptickets3User2: Expedia1, Expedia2, Expedia3, Expedia4

Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001)

Results: Lift curve

source: Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) figrue6.6-6.9

Discussion on Incomplete data

Discussion Questions & Outlook

• What is the proper target attribute for an analysis of non-profit site?

• What data do we prefer to have?

• Which improvement could be made to the data?

References:

• Witten, I.H., & Frank, E.(2000). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. San Diego, CA: Academic Press. Sections 3.1-3.3; 4.3; 6.1

• Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001). „Personalization from Incomplete Data: What you don’t know can hurt.“

• http://www.cs.cmu.edu/~awm/tutorials

Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Documents