Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net +6281586220090.

transcript

Data Mining:Metode dan Algoritma

Romi Satria Wahonoromi@romisatriawahono.nethttp://romisatriawahono.net

+6281586220090

SD Sompok Semarang (1987) SMPN 8 Semarang (1990) SMA Taruna Nusantara, Magelang (1993) S1, S2 dan S3 (on-leave)

Department of Computer SciencesSaitama University, Japan (1994-2004)

Research Interests: Software Engineering and Intelligent Systems

Founder IlmuKomputer.Com Peneliti LIPI (2004-2007) Founder dan CEO PT Brainmatics Cipta Informatika

Romi Satria Wahono

Course Outline1. Pengenalan Data Mining2. Proses Data Mining3. Evaluasi dan Validasi pada Data Mining4. Metode dan Algoritma Data Mining5. Penelitian Data Mining

Metode dan Algoritma

Metode dan Algoritma1. Inferring rudimentary rules2. Statistical modeling3. Constructing decision trees4. Constructing rules5. Association rule learning6. Linear models7. Instance-based learning8. Clustering

Simplicity first Simple algorithms often work very well! There are many kinds of simple structure, eg:•One attribute does all the work• All attributes contribute equally & independently• A weighted linear combination might do• Instance-based: use a few prototypes•Use simple logical rules

Success of method depends on the domain

Inferring rudimentary rules 1R: learns a 1-level decision tree• I.e., rules that all test one particular attribute

Basic version•One branch for each value• Each branch assigns most frequent class• Error rate: proportion of instances that don’t belong to

the majority class of their corresponding branch• Choose attribute with lowest error rate(assumes nominal attributes)

Pseudo-code for 1R

Note: “missing” is treated as a separate attribute value

For each attribute,For each value of the attribute, make a rule as

follows:count how often each class appearsfind the most frequent classmake the rule assign that class to this

attribute-valueCalculate the error rate of the rules

Choose the rules with the smallest error rate

Evaluating the weather attributes

3/6True No*

5/142/8False YesWindy

1/7Normal Yes

4/143/7High NoHumidity

Total errors

1/4Cool Yes

2/6Mild Yes

2/4Hot No*Temp

2/5Rainy Yes

0/4Overcast Yes

2/5Sunny NoOutlook

Errors

RulesAttribute

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHighHotSunny

NoFalseHighHotSunny

PlayWindy

Humidity

TempOutlook

Dealing with numeric attributes Discretize numeric attributes Divide each attribute’s range into intervals

• Sort instances according to attribute’s values• Place breakpoints where class changes (majority class)• This minimizes the total error

Example: temperature from weather data 64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperature

Outlook

The problem of overfitting This procedure is very sensitive to noise•One instance with an incorrect class label will probably

produce a separate interval Also: time stamp attribute will have zero errors Simple solution:

enforce minimum number of instances in majority class per interval

Example (with min = 3):64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

With overfitting avoidance Resulting rule set:

0/1> 95.5 Yes

3/6True No*

5/142/8False YesWindy

2/6> 82.5 and 95.5 No

3/141/7 82.5 YesHumidity

Total errors

2/4> 77.5 No*

3/10 77.5 YesTemperature

2/5Rainy Yes

0/4Overcast Yes

2/5Sunny NoOutlook

ErrorsRulesAttribute

Discussion of 1R 1R was described in a paper by Holte (1993)

• Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data)•Minimum number of instances was set to 6 after some

experimentation• 1R’s simple rules performed not much worse than much more

complex decision trees Simplicity first pays off!

Very Simple Classification Rules Perform Well on Most Commonly Used DatasetsRobert C. Holte, Computer Science Department, University of Ottawa

Discussion of 1R: Hyperpipes Another simple technique: build one rule for each

class• Each rule is a conjunction of tests, one for each attribute• For numeric attributes: test checks whether instance's

value is inside an intervalInterval given by minimum and maximum observed in training

• For nominal attributes: test checks whether value is one of a subset of attribute valuesSubset given by all possible values observed in training data

• Class with most matching tests is predicted

Statistical modeling “Opposite” of 1R: use all the attributes Two assumptions: Attributes are• equally important• statistically independent (given the class value)

I.e., knowing the value of one attribute says nothing about the value of another (if the class is known)

Independence assumption is never correct! But … this scheme works well in practice

Probabilities for weather data

NoYesNoYesNoYes

Normal

Humidity

Cool2/53/9Rainy

Temperature

0/54/9Overcast

3/52/9Sunny

23Rainy

04Overcast

32Sunny

Outlook

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHighHotSunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Probabilities for weather data

NoYesNoYesNoYes

Normal

Humidity

Cool2/53/9Rainy

Temperature

0/54/9Overcast

3/52/9Sunny

23Rainy

04Overcast

32Sunny

Outlook

?TrueHighCoolSunny

PlayWindyHumidity

Temp.OutlookA new day:

Likelihood of the two classes

For “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053

For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206

Conversion into a probability by normalization:

P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205

P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

Bayes’s rule Probability of event H given evidence E:

A priori probability of H : Probability of event before evidence is seen A posteriori probability of H : Probability of event after evidence is seen

Thomas BayesBorn: 1702 in London, EnglandDied: 1761 in Tunbridge Wells, Kent, England

𝑃𝑟 [𝐻∣ 𝐸 ]=𝑃𝑟 [𝐸∣𝐻 ]𝑃𝑟 [𝐻 ]𝑃𝑟 [𝐸 ]

𝑃𝑟 [𝐻 ]

𝑃𝑟 [𝐻∣ 𝐸 ]

Naïve Bayes for classification Classification learning: what’s the probability of the

class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e.

attributes) that are independent

𝑃𝑟 [𝐻∣ 𝐸 ]=𝑃𝑟 [𝐸1∣𝐻 ]𝑃𝑟 [𝐸2 ∣𝐻 ] 𝑃𝑟 [𝐸𝑛∣ 𝐻 ] 𝑃𝑟 [𝐻 ]

𝑃𝑟 [𝐸 ]

Weather data example

?TrueHighCoolSunny

PlayWindyHumidity

Temp.OutlookEvidence E

Probability ofclass “yes”

𝑃𝑟 [ 𝑦𝑒𝑠 ∣𝐸 ]=𝑃𝑟 [𝑂𝑢𝑡𝑙𝑜𝑜𝑘=𝑆𝑢𝑛𝑛𝑦 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒=𝐶𝑜𝑜𝑙 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦= h𝐻𝑖𝑔 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝑊𝑖𝑛𝑑𝑦=𝑇𝑟𝑢𝑒∣ 𝑦𝑒𝑠 ]

×𝑃𝑟 [ 𝑦𝑒𝑠 ]𝑃𝑟 [𝐸 ]

29×39×39×39×914

𝑃𝑟 [𝐸 ]

Cognitive Assignment II Pahami dan kuasai satu metode data mining dari

berbagai literatur Rangkumkan dengan detail dalam bentuk slide,

dari mulai definisi, perkembangan metode, tahapan algoritma, penerapannya untuk suatu studi kasus, dan buat/temukan code Java (eksekusi contoh kasus)

Presentasikan di depan kelas pada mata kuliah berikutnya dengan bahasa manusia

Pilihan Algoritma atau Metode1. Neural Network2. Support Vector Machine3. Naive Bayes4. K-Nearest Neighbor5. CART6. Linear Discriminant Analysis7. Agglomerative Clustering8. Support Vector Regression9. Expectation Maximization10. C4.511. K-Means

12. Self-Organizing Map13. FP-Growth14. A Priori15. Logistic Regression16. Random Forest17. K-Medoids18. Radial Basis Function19. Fuzzy C-Means20. K*21. Support Vector Clustering22. OneR

Referensi1. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical

Machine Learning Tools and Techniques 3rd Edition, Elsevier, 2011

2. Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, 2005

3. Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011

4. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Second Edition, Elsevier, 2006

5. Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, 2010

6. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, World Scientific, 2007

Data Mining: Metode dan Algoritma Romi Satria Wahono romi@romisatriawahono.net +6281586220090.

Documents