Post on 31-Mar-2015
transcript
Web Usage Mining
Classification
• Fang Yao
• MEMS 2002
• 185029
Humboldt Uni zu Berlin
Contents:• Defination and the Usages
• Outputs of Classification
• Methods of Classification
• Application to EDOC
• Discussion on Imcomplete
Data
• Discussion questions &
Outlook
Humboldt Uni zu Berlin
Classification
• A Major Data Mining Operation• Give one attribute (e.g play), try to predict the value of new people’s behavior by means of some other available attributes.
Humboldt Uni zu Berlin
• Usages: • Behavior predictions• improve Web design• personal marketing……
“People with age less than 40 and salary > 40k trade on-line”
A Small Example
Humboldt Uni zu Berlin
Weather Data
Source: Witten & Frank, table 1.2
outlook temperature humidity windy playsunny hot high false nosunny hot high ture noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true
…. …. …. …. ….
no
Decision Tree
outlook
yeswindy
overcastrainy sunny
false true
….
humidity ….
no
high
….
Outputs of Classification
Humboldt Uni zu Berlin
Decision Tree Classification Rules
If outlook = sunnyand humidity = highthen play = no
If outlook = rainyand windy = truethen play = no
If outlook = overcastthen play = yes
If humidity = normalthen play = yes.......
outlook
yeswindy
overcastrainy sunny
false true
….
humidity ….
no
high
….
Methods _ divide-and-conquer
constructing decision trees
Humboldt Uni zu Berlin
Step 1: select a splitting attribute
humidity
YesYesYesNoNoNoNo
high normal
YesYesYesYesYesYesNo
Gain(humidity)0,152 bits
windy
YesYesYesNoNoNo
truefalse
YesYesYesYesYesYesNoNo
Gain(windy)0,048 bits
temp.
YesYesNoNo
hot mild cool
YesYesYesyesNoNo
YesYesYesNo
Gain(temperature):0,029 bits
> >
outlook
YesYesYesYes
overcastrainy sunny
YesYesYesNoNo
YesYesNoNoNo
Gain(outlook):0,247 bits
>
Methods _ divide-and-conquer
constructing decision trees
Humboldt Uni zu Berlin
Calculation information gain:
Gain(outlook) = info([9,5]) – info([4,0],[3,2] ,[2,3])=0,247 bits
info([4,0],[3,2] ,[2,3]) = (4/14)info([4,0]) + (5/14)info([3,2]) + (5/14)info([2,3])
Where:
Informational value of creating a branch on the „outlook“
outlook
YesYesYesYes
overcastrainy
sunny
YesYesYesNoNo
YesYesNoNoNo
Methods _ divide-and-conquer
calculating information
Humboldt Uni zu Berlin
Formula for information value:
nnn pppppppppentropy log...loglog),...,,( 221121
• Logarithms are expressed in base 2.• unit is ‚bits‘• argument p is expressed as fraction that add up to 1.
Info([2,3])=entropy (2/5,3/5) = -2/5*log2/5 - 3/5*log3/5 =0,97
Example:
Methods _ divide-and-conquer
calculating information
Humboldt Uni zu Berlin
0. 0000
0. 1000
0. 2000
0. 3000
0. 4000
0. 5000
0. 6000
0. 01 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 0. 98P
Entropy
Methods _ divide-and-conquer
constructing decision trees
Humboldt Uni zu Berlin
outlook
yes
overcastrainy
sunny
? ?
Methods _ divide-and-conquer
Humboldt Uni zu Berlin
Step 2: select a daughter attribute___outlook = sunny
humidity
No
No
No
high normal
Yes
Yes
Gain(humidity)0,971 bits
temp.
No
No
hot mild cool
Yes
NoYes
Gain(temperature):0,571 bits
>
windy
Yes
No
true false
Yes
Yes
No
No
Gain(windy)0,020 bits
>
Do this recursively !!!
Methods _ divide-and-conquer
constructing decision trees
Humboldt Uni zu Berlin
outlook
yeswindy
overcastrainy
sunny
humidity
no
high
no
true
yes
false
yes
normal
Stop rules: • stop when all leaf nodes are pure• stop when no more attribute can be splited
Methods _ C 4.5
Humboldt Uni zu Berlin
WHY C 4.5?
• The real -world data is more Complicated
• Numeric attributes
• Missing values
• Final solution need more Operations
• Pruning
• From trees to rules
Methods _ C 4.5
Humboldt Uni zu Berlin
• Numeric attributes: binary split with numeric thresholds halfway between the values
• Missing values: -- Ignoring leads to losing information -- Partial instances
• Pruning decision tree: -- subtree replacement -- subtree raising
A
C
A
B
C
A
B
Application in WEKA
Humboldt Uni zu Berlin
Application in WEKA
Humboldt Uni zu Berlin
Data: Clickstream from log of EDOC on 30th March
Method: J4.8 Algorithm
Objective: Prediction of dissertation reading
Attributes:HOME {1,0}AU-START {1,0}DSS-LOOKUP {1,0}SH-OTHER {1,0}OTHER {1,0}AUHINWEISE {1,0}DSS-RVK {1,0}AUTBERATUNG {1,0}DSS-ABSTR {1,0}
HIST-DISS {1,0}OT-PUB-READ {1,0}OT-CONF {1,0}SH-START {1,0}SH-DOCSERV {1,0}SH-DISS {1,0}OT-BOOKS {1,0}SH-START-E {1,0}
Application in WEKA
Humboldt Uni zu Berlin
Result:
DSS-ABSTR
Application in WEKA
Humboldt Uni zu Berlin
DSS-Lookup
Discussion on Incomplete data
Humboldt Uni zu Berlin
Idea: Site-centric data v.s. User-centric data
Incomplete data are inferior to the one from Complete data.
User-centric data:
Site-centric data:
Example:
User1: Expedia1, Expedia2, Expedia3User2: Expedia1, Expedia2, Expedia3, Expedia4
User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2,Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3,Cheaptickets3User2: Expedia1, Expedia2, Expedia3, Expedia4
Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001)
Humboldt Uni zu Berlin
Results: Lift curve
source: Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) figrue6.6-6.9
Discussion on Incomplete data
Humboldt Uni zu Berlin
Discussion Questions & Outlook
• What is the proper target attribute for an analysis of non-profit site?
• What data do we prefer to have?
• Which improvement could be made to the data?
Humboldt Uni zu Berlin
References:
• Witten, I.H., & Frank, E.(2000). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. San Diego, CA: Academic Press. Sections 3.1-3.3; 4.3; 6.1
• Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001). „Personalization from Incomplete Data: What you don’t know can hurt.“
• http://www.cs.cmu.edu/~awm/tutorials