+ All Categories
Home > Documents > Data Mining: Metode dan Algoritma Romi Satria Wahono [email protected] +6281586220090.

Data Mining: Metode dan Algoritma Romi Satria Wahono [email protected] +6281586220090.

Date post: 31-Mar-2015
Category:
Upload: jude-cantley
View: 228 times
Download: 3 times
Share this document with a friend
Popular Tags:
23
Data Mining: Metode dan Algoritma Romi Satria Wahono [email protected] http://romisatriawahono.net +6281586220090
Transcript
Page 1: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Data Mining:Metode dan Algoritma

Romi Satria [email protected]://romisatriawahono.net

+6281586220090

Page 2: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

SD Sompok Semarang (1987) SMPN 8 Semarang (1990) SMA Taruna Nusantara, Magelang (1993) S1, S2 dan S3 (on-leave)

Department of Computer SciencesSaitama University, Japan (1994-2004)

Research Interests: Software Engineering and Intelligent Systems

Founder IlmuKomputer.Com Peneliti LIPI (2004-2007) Founder dan CEO PT Brainmatics Cipta Informatika

Romi Satria Wahono

Page 3: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Course Outline1. Pengenalan Data Mining2. Proses Data Mining3. Evaluasi dan Validasi pada Data Mining4. Metode dan Algoritma Data Mining5. Penelitian Data Mining

Page 4: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Metode dan Algoritma

Page 5: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Metode dan Algoritma1. Inferring rudimentary rules2. Statistical modeling3. Constructing decision trees4. Constructing rules5. Association rule learning6. Linear models7. Instance-based learning8. Clustering

Page 6: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Simplicity first Simple algorithms often work very well! There are many kinds of simple structure, eg:•One attribute does all the work• All attributes contribute equally & independently• A weighted linear combination might do• Instance-based: use a few prototypes•Use simple logical rules

Success of method depends on the domain

Page 7: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Inferring rudimentary rules 1R: learns a 1-level decision tree• I.e., rules that all test one particular attribute

Basic version•One branch for each value• Each branch assigns most frequent class• Error rate: proportion of instances that don’t belong to

the majority class of their corresponding branch• Choose attribute with lowest error rate(assumes nominal attributes)

Page 8: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Pseudo-code for 1R

Note: “missing” is treated as a separate attribute value

For each attribute,For each value of the attribute, make a rule as

follows:count how often each class appearsfind the most frequent classmake the rule assign that class to this

attribute-valueCalculate the error rate of the rules

Choose the rules with the smallest error rate

Page 9: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Evaluating the weather attributes

3/6True No*

5/142/8False YesWindy

1/7Normal Yes

4/143/7High NoHumidity

5/14

4/14

Total errors

1/4Cool Yes

2/6Mild Yes

2/4Hot No*Temp

2/5Rainy Yes

0/4Overcast Yes

2/5Sunny NoOutlook

Errors

RulesAttribute

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHighHotSunny

NoFalseHighHotSunny

PlayWindy

Humidity

TempOutlook

Page 10: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Dealing with numeric attributes Discretize numeric attributes Divide each attribute’s range into intervals

• Sort instances according to attribute’s values• Place breakpoints where class changes (majority class)• This minimizes the total error

Example: temperature from weather data 64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperature

Outlook

Page 11: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

The problem of overfitting This procedure is very sensitive to noise•One instance with an incorrect class label will probably

produce a separate interval Also: time stamp attribute will have zero errors Simple solution:

enforce minimum number of instances in majority class per interval

Example (with min = 3):64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

Page 12: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

With overfitting avoidance Resulting rule set:

0/1> 95.5 Yes

3/6True No*

5/142/8False YesWindy

2/6> 82.5 and 95.5 No

3/141/7 82.5 YesHumidity

5/14

4/14

Total errors

2/4> 77.5 No*

3/10 77.5 YesTemperature

2/5Rainy Yes

0/4Overcast Yes

2/5Sunny NoOutlook

ErrorsRulesAttribute

Page 13: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Discussion of 1R 1R was described in a paper by Holte (1993)

• Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data)•Minimum number of instances was set to 6 after some

experimentation• 1R’s simple rules performed not much worse than much more

complex decision trees Simplicity first pays off!

Very Simple Classification Rules Perform Well on Most Commonly Used DatasetsRobert C. Holte, Computer Science Department, University of Ottawa

Page 14: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Discussion of 1R: Hyperpipes Another simple technique: build one rule for each

class• Each rule is a conjunction of tests, one for each attribute• For numeric attributes: test checks whether instance's

value is inside an intervalInterval given by minimum and maximum observed in training

data

• For nominal attributes: test checks whether value is one of a subset of attribute valuesSubset given by all possible values observed in training data

• Class with most matching tests is predicted

Page 15: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Statistical modeling “Opposite” of 1R: use all the attributes Two assumptions: Attributes are• equally important• statistically independent (given the class value)

I.e., knowing the value of one attribute says nothing about the value of another (if the class is known)

Independence assumption is never correct! But … this scheme works well in practice

Page 16: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Probabilities for weather data

5/14

5

No

9/14

9

Yes

Play

3/5

2/5

3

2

No

3/9

6/9

3

6

Yes

True

False

True

False

Windy

1/5

4/5

1

4

NoYesNoYesNoYes

6/9

3/9

6

3

Normal

High

Normal

High

Humidity

1/5

2/5

2/5

1

2

2

3/9

4/9

2/9

3

4

2

Cool2/53/9Rainy

Mild

Hot

Cool

Mild

Hot

Temperature

0/54/9Overcast

3/52/9Sunny

23Rainy

04Overcast

32Sunny

Outlook

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHighHotSunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Page 17: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Probabilities for weather data

5/14

5

No

9/14

9

Yes

Play

3/5

2/5

3

2

No

3/9

6/9

3

6

Yes

True

False

True

False

Windy

1/5

4/5

1

4

NoYesNoYesNoYes

6/9

3/9

6

3

Normal

High

Normal

High

Humidity

1/5

2/5

2/5

1

2

2

3/9

4/9

2/9

3

4

2

Cool2/53/9Rainy

Mild

Hot

Cool

Mild

Hot

Temperature

0/54/9Overcast

3/52/9Sunny

23Rainy

04Overcast

32Sunny

Outlook

?TrueHighCoolSunny

PlayWindyHumidity

Temp.OutlookA new day:

Likelihood of the two classes

For “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053

For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206

Conversion into a probability by normalization:

P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205

P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

Page 18: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Bayes’s rule Probability of event H given evidence E:

A priori probability of H : Probability of event before evidence is seen A posteriori probability of H : Probability of event after evidence is seen

Thomas BayesBorn: 1702 in London, EnglandDied: 1761 in Tunbridge Wells, Kent, England

𝑃𝑟 [𝐻∣ 𝐸 ]=𝑃𝑟 [𝐸∣𝐻 ]𝑃𝑟 [𝐻 ]𝑃𝑟 [𝐸 ]

𝑃𝑟 [𝐻 ]

𝑃𝑟 [𝐻∣ 𝐸 ]

Page 19: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Naïve Bayes for classification Classification learning: what’s the probability of the

class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e.

attributes) that are independent

𝑃𝑟 [𝐻∣ 𝐸 ]=𝑃𝑟 [𝐸1∣𝐻 ]𝑃𝑟 [𝐸2 ∣𝐻 ] 𝑃𝑟 [𝐸𝑛∣ 𝐻 ] 𝑃𝑟 [𝐻 ]

𝑃𝑟 [𝐸 ]

Page 20: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Weather data example

?TrueHighCoolSunny

PlayWindyHumidity

Temp.OutlookEvidence E

Probability ofclass “yes”

𝑃𝑟 [ 𝑦𝑒𝑠 ∣𝐸 ]=𝑃𝑟 [𝑂𝑢𝑡𝑙𝑜𝑜𝑘=𝑆𝑢𝑛𝑛𝑦 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒=𝐶𝑜𝑜𝑙 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦= h𝐻𝑖𝑔 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝑊𝑖𝑛𝑑𝑦=𝑇𝑟𝑢𝑒∣ 𝑦𝑒𝑠 ]

×𝑃𝑟 [ 𝑦𝑒𝑠 ]𝑃𝑟 [𝐸 ]

¿

29×39×39×39×914

𝑃𝑟 [𝐸 ]

Page 21: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Cognitive Assignment II Pahami dan kuasai satu metode data mining dari

berbagai literatur Rangkumkan dengan detail dalam bentuk slide,

dari mulai definisi, perkembangan metode, tahapan algoritma, penerapannya untuk suatu studi kasus, dan buat/temukan code Java (eksekusi contoh kasus)

Presentasikan di depan kelas pada mata kuliah berikutnya dengan bahasa manusia

Page 22: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Pilihan Algoritma atau Metode1. Neural Network2. Support Vector Machine3. Naive Bayes4. K-Nearest Neighbor5. CART6. Linear Discriminant Analysis7. Agglomerative Clustering8. Support Vector Regression9. Expectation Maximization10. C4.511. K-Means

12. Self-Organizing Map13. FP-Growth14. A Priori15. Logistic Regression16. Random Forest17. K-Medoids18. Radial Basis Function19. Fuzzy C-Means20. K*21. Support Vector Clustering22. OneR

Page 23: Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net  +6281586220090.

Referensi1. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical

Machine Learning Tools and Techniques 3rd Edition, Elsevier, 2011

2. Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, 2005

3. Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011

4. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Second Edition, Elsevier, 2006

5. Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, 2010

6. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, World Scientific, 2007


Recommended