Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | jude-cantley |
View: | 228 times |
Download: | 3 times |
Data Mining:Metode dan Algoritma
Romi Satria [email protected]://romisatriawahono.net
+6281586220090
SD Sompok Semarang (1987) SMPN 8 Semarang (1990) SMA Taruna Nusantara, Magelang (1993) S1, S2 dan S3 (on-leave)
Department of Computer SciencesSaitama University, Japan (1994-2004)
Research Interests: Software Engineering and Intelligent Systems
Founder IlmuKomputer.Com Peneliti LIPI (2004-2007) Founder dan CEO PT Brainmatics Cipta Informatika
Romi Satria Wahono
Course Outline1. Pengenalan Data Mining2. Proses Data Mining3. Evaluasi dan Validasi pada Data Mining4. Metode dan Algoritma Data Mining5. Penelitian Data Mining
Metode dan Algoritma
Metode dan Algoritma1. Inferring rudimentary rules2. Statistical modeling3. Constructing decision trees4. Constructing rules5. Association rule learning6. Linear models7. Instance-based learning8. Clustering
Simplicity first Simple algorithms often work very well! There are many kinds of simple structure, eg:•One attribute does all the work• All attributes contribute equally & independently• A weighted linear combination might do• Instance-based: use a few prototypes•Use simple logical rules
Success of method depends on the domain
Inferring rudimentary rules 1R: learns a 1-level decision tree• I.e., rules that all test one particular attribute
Basic version•One branch for each value• Each branch assigns most frequent class• Error rate: proportion of instances that don’t belong to
the majority class of their corresponding branch• Choose attribute with lowest error rate(assumes nominal attributes)
Pseudo-code for 1R
Note: “missing” is treated as a separate attribute value
For each attribute,For each value of the attribute, make a rule as
follows:count how often each class appearsfind the most frequent classmake the rule assign that class to this
attribute-valueCalculate the error rate of the rules
Choose the rules with the smallest error rate
Evaluating the weather attributes
3/6True No*
5/142/8False YesWindy
1/7Normal Yes
4/143/7High NoHumidity
5/14
4/14
Total errors
1/4Cool Yes
2/6Mild Yes
2/4Hot No*Temp
2/5Rainy Yes
0/4Overcast Yes
2/5Sunny NoOutlook
Errors
RulesAttribute
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindy
Humidity
TempOutlook
Dealing with numeric attributes Discretize numeric attributes Divide each attribute’s range into intervals
• Sort instances according to attribute’s values• Place breakpoints where class changes (majority class)• This minimizes the total error
Example: temperature from weather data 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
……………
YesFalse8075Rainy
YesFalse8683Overcast
NoTrue9080Sunny
NoFalse8585Sunny
PlayWindyHumidityTemperature
Outlook
The problem of overfitting This procedure is very sensitive to noise•One instance with an incorrect class label will probably
produce a separate interval Also: time stamp attribute will have zero errors Simple solution:
enforce minimum number of instances in majority class per interval
Example (with min = 3):64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
With overfitting avoidance Resulting rule set:
0/1> 95.5 Yes
3/6True No*
5/142/8False YesWindy
2/6> 82.5 and 95.5 No
3/141/7 82.5 YesHumidity
5/14
4/14
Total errors
2/4> 77.5 No*
3/10 77.5 YesTemperature
2/5Rainy Yes
0/4Overcast Yes
2/5Sunny NoOutlook
ErrorsRulesAttribute
Discussion of 1R 1R was described in a paper by Holte (1993)
• Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data)•Minimum number of instances was set to 6 after some
experimentation• 1R’s simple rules performed not much worse than much more
complex decision trees Simplicity first pays off!
Very Simple Classification Rules Perform Well on Most Commonly Used DatasetsRobert C. Holte, Computer Science Department, University of Ottawa
Discussion of 1R: Hyperpipes Another simple technique: build one rule for each
class• Each rule is a conjunction of tests, one for each attribute• For numeric attributes: test checks whether instance's
value is inside an intervalInterval given by minimum and maximum observed in training
data
• For nominal attributes: test checks whether value is one of a subset of attribute valuesSubset given by all possible values observed in training data
• Class with most matching tests is predicted
Statistical modeling “Opposite” of 1R: use all the attributes Two assumptions: Attributes are• equally important• statistically independent (given the class value)
I.e., knowing the value of one attribute says nothing about the value of another (if the class is known)
Independence assumption is never correct! But … this scheme works well in practice
Probabilities for weather data
5/14
5
No
9/14
9
Yes
Play
3/5
2/5
3
2
No
3/9
6/9
3
6
Yes
True
False
True
False
Windy
1/5
4/5
1
4
NoYesNoYesNoYes
6/9
3/9
6
3
Normal
High
Normal
High
Humidity
1/5
2/5
2/5
1
2
2
3/9
4/9
2/9
3
4
2
Cool2/53/9Rainy
Mild
Hot
Cool
Mild
Hot
Temperature
0/54/9Overcast
3/52/9Sunny
23Rainy
04Overcast
32Sunny
Outlook
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindyHumidityTempOutlook
Probabilities for weather data
5/14
5
No
9/14
9
Yes
Play
3/5
2/5
3
2
No
3/9
6/9
3
6
Yes
True
False
True
False
Windy
1/5
4/5
1
4
NoYesNoYesNoYes
6/9
3/9
6
3
Normal
High
Normal
High
Humidity
1/5
2/5
2/5
1
2
2
3/9
4/9
2/9
3
4
2
Cool2/53/9Rainy
Mild
Hot
Cool
Mild
Hot
Temperature
0/54/9Overcast
3/52/9Sunny
23Rainy
04Overcast
32Sunny
Outlook
?TrueHighCoolSunny
PlayWindyHumidity
Temp.OutlookA new day:
Likelihood of the two classes
For “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053
For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
Bayes’s rule Probability of event H given evidence E:
A priori probability of H : Probability of event before evidence is seen A posteriori probability of H : Probability of event after evidence is seen
Thomas BayesBorn: 1702 in London, EnglandDied: 1761 in Tunbridge Wells, Kent, England
𝑃𝑟 [𝐻∣ 𝐸 ]=𝑃𝑟 [𝐸∣𝐻 ]𝑃𝑟 [𝐻 ]𝑃𝑟 [𝐸 ]
𝑃𝑟 [𝐻 ]
𝑃𝑟 [𝐻∣ 𝐸 ]
Naïve Bayes for classification Classification learning: what’s the probability of the
class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e.
attributes) that are independent
𝑃𝑟 [𝐻∣ 𝐸 ]=𝑃𝑟 [𝐸1∣𝐻 ]𝑃𝑟 [𝐸2 ∣𝐻 ] 𝑃𝑟 [𝐸𝑛∣ 𝐻 ] 𝑃𝑟 [𝐻 ]
𝑃𝑟 [𝐸 ]
Weather data example
?TrueHighCoolSunny
PlayWindyHumidity
Temp.OutlookEvidence E
Probability ofclass “yes”
𝑃𝑟 [ 𝑦𝑒𝑠 ∣𝐸 ]=𝑃𝑟 [𝑂𝑢𝑡𝑙𝑜𝑜𝑘=𝑆𝑢𝑛𝑛𝑦 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒=𝐶𝑜𝑜𝑙 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦= h𝐻𝑖𝑔 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝑊𝑖𝑛𝑑𝑦=𝑇𝑟𝑢𝑒∣ 𝑦𝑒𝑠 ]
×𝑃𝑟 [ 𝑦𝑒𝑠 ]𝑃𝑟 [𝐸 ]
¿
29×39×39×39×914
𝑃𝑟 [𝐸 ]
Cognitive Assignment II Pahami dan kuasai satu metode data mining dari
berbagai literatur Rangkumkan dengan detail dalam bentuk slide,
dari mulai definisi, perkembangan metode, tahapan algoritma, penerapannya untuk suatu studi kasus, dan buat/temukan code Java (eksekusi contoh kasus)
Presentasikan di depan kelas pada mata kuliah berikutnya dengan bahasa manusia
Pilihan Algoritma atau Metode1. Neural Network2. Support Vector Machine3. Naive Bayes4. K-Nearest Neighbor5. CART6. Linear Discriminant Analysis7. Agglomerative Clustering8. Support Vector Regression9. Expectation Maximization10. C4.511. K-Means
12. Self-Organizing Map13. FP-Growth14. A Priori15. Logistic Regression16. Random Forest17. K-Medoids18. Radial Basis Function19. Fuzzy C-Means20. K*21. Support Vector Clustering22. OneR
Referensi1. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier, 2011
2. Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, 2005
3. Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011
4. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Second Edition, Elsevier, 2006
5. Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, 2010
6. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, World Scientific, 2007