Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | rizqy-fahmi |
View: | 8 times |
Download: | 0 times |
Data Mining:4. Metode dan Algoritma
Romi Satria [email protected]
http://romisatriawahono.net/dmWA/SMS: +6281586220090
1
Romi Satria Wahono
• SD Sompok Semarang (1987)• SMPN 8 Semarang (1990)• SMA Taruna Nusantara Magelang (1993)• B.Eng, M.Eng and Ph.D in Software Engineering from
Saitama University Japan (1994-2004)Universiti Teknikal Malaysia Melaka (2014)• Research Interests: Software Engineering,
Machine Learning• Founder dan Koordinator IlmuKomputer.Com• Peneliti LIPI (2004-2007)• Founder dan CEO PT Brainmatics Cipta Informatika
2
Course Outline
1. Pengenalan Data Mining2. Proses Data Mining3. Evaluasi dan Validasi pada Data Mining4. Metode dan Algoritma Data Mining5. Penelitian Data Mining
4. Metode dan Algoritma
1. Inferring rudimentary rules2. Statistical modeling3. Constructing decision trees4. Constructing rules5. Association rule learning6. Linear models7. Instance-based learning8. Clustering
Simplicity first
• Simple algorithms often work very well!• There are many kinds of simple structure, eg:• One attribute does all the work• All attributes contribute equally & independently• A weighted linear combination might do• Instance-based: use a few prototypes• Use simple logical rules
• Success of method depends on the domain
Inferring rudimentary rules
• 1R: learns a 1-level decision tree• I.e., rules that all test one particular attribute
• Basic version• One branch for each value• Each branch assigns most frequent class• Error rate: proportion of instances that don’t belong to
the majority class of their corresponding branch• Choose attribute with lowest error rate(assumes nominal attributes)
Pseudo-code for 1R
• Note: “missing” is treated as a separate attribute value
For each attribute,For each value of the attribute, make a rule as
follows:count how often each class appearsfind the most frequent classmake the rule assign that class to this
attribute-valueCalculate the error rate of the rules
Choose the rules with the smallest error rate
Evaluating the weather attributes
3/6True No*
5/142/8False YesWindy
1/7Normal Yes
4/143/7High NoHumidity
5/14
4/14
Total errors
1/4Cool Yes
2/6Mild Yes
2/4Hot No*Temp
2/5Rainy Yes
0/4Overcast Yes
2/5Sunny NoOutlook
Errors
RulesAttribute
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindy
Humidity
TempOutlook
Dealing with numeric attributes• Discretize numeric attributes• Divide each attribute’s range into intervals
• Sort instances according to attribute’s values• Place breakpoints where class changes (majority class)• This minimizes the total error
• Example: temperature from weather data 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
……………
YesFalse8075Rainy
YesFalse8683Overcast
NoTrue9080Sunny
NoFalse8585Sunny
PlayWindyHumidityTemperature
Outlook
The problem of overfitting
• This procedure is very sensitive to noise• One instance with an incorrect class label will probably
produce a separate interval
• Also: time stamp attribute will have zero errors• Simple solution:
enforce minimum number of instances in majority class per interval• Example (with min = 3):
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
With overfitting avoidance
• Resulting rule set:
0/1> 95.5 Yes
3/6True No*
5/142/8False YesWindy
2/6> 82.5 and 95.5 No
3/141/7 82.5 YesHumidity
5/14
4/14
Total errors
2/4> 77.5 No*
3/10 77.5 YesTemperature
2/5Rainy Yes
0/4Overcast Yes
2/5Sunny NoOutlook
ErrorsRulesAttribute
Discussion of 1R
• 1R was described in a paper by Holte (1993)• Contains an experimental evaluation on 16 datasets
(using cross-validation so that results were representative of performance on future data)• Minimum number of instances was set to 6 after some
experimentation• 1R’s simple rules performed not much worse than much
more complex decision trees
• Simplicity first pays off!Very Simple Classification Rules Perform Well on Most Commonly Used DatasetsRobert C. Holte, Computer Science Department, University of Ottawa
Discussion of 1R: Hyperpipes• Another simple technique: build one rule for each
class• Each rule is a conjunction of tests, one for each attribute• For numeric attributes: test checks whether instance's
value is inside an interval• Interval given by minimum and maximum observed in training
data• For nominal attributes: test checks whether value is one
of a subset of attribute values• Subset given by all possible values observed in training data
• Class with most matching tests is predicted
Statistical modeling
• “Opposite” of 1R: use all the attributes• Two assumptions: Attributes are• equally important• statistically independent (given the class value)
• I.e., knowing the value of one attribute says nothing about the value of another (if the class is known)
• Independence assumption is never correct!• But … this scheme works well in practice
Probabilities for weather data
5/14
5
No
9/14
9
Yes
Play
3/5
2/5
3
2
No
3/9
6/9
3
6
Yes
True
False
True
False
Windy
1/5
4/5
1
4
NoYesNoYesNoYes
6/9
3/9
6
3
Normal
High
Normal
High
Humidity
1/5
2/5
2/5
1
2
2
3/9
4/9
2/9
3
4
2
Cool2/53/9Rainy
Mild
Hot
Cool
Mild
Hot
Temperature
0/54/9Overcast
3/52/9Sunny
23Rainy
04Overcast
32Sunny
Outlook
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindyHumidityTempOutlook
Probabilities for weather data
5/14
5
No
9/14
9
Yes
Play
3/5
2/5
3
2
No
3/9
6/9
3
6
Yes
True
False
True
False
Windy
1/5
4/5
1
4
NoYesNoYesNoYes
6/9
3/9
6
3
Normal
High
Normal
High
Humidity
1/5
2/5
2/5
1
2
2
3/9
4/9
2/9
3
4
2
Cool2/53/9Rainy
Mild
Hot
Cool
Mild
Hot
Temperature
0/54/9Overcast
3/52/9Sunny
23Rainy
04Overcast
32Sunny
Outlook
?TrueHighCoolSunny
PlayWindyHumidity
Temp.OutlookA new day:
Likelihood of the two classes
For “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053
For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
Bayes’s rule
• Probability of event H given evidence E:
• A priori probability of H :• Probability of event before evidence is seen• A posteriori probability of H :• Probability of event after evidence is seen
Thomas BayesBorn: 1702 in London, EnglandDied: 1761 in Tunbridge Wells, Kent, England
𝑃𝑟 [𝐻∣ 𝐸 ]=𝑃𝑟 [𝐸∣𝐻 ]𝑃𝑟 [𝐻 ]𝑃𝑟 [𝐸 ]
𝑃𝑟 [𝐻 ]
𝑃𝑟 [𝐻∣ 𝐸 ]
Naïve Bayes for classification• Classification learning: what’s the probability of the
class given an instance?• Evidence E = instance• Event H = class value for instance• Naïve assumption: evidence splits into parts (i.e.
attributes) that are independent
𝑃𝑟 [𝐻∣ 𝐸 ]=𝑃𝑟 [𝐸1∣𝐻 ]𝑃𝑟 [𝐸2 ∣𝐻 ] 𝑃𝑟 [𝐸𝑛∣ 𝐻 ] 𝑃𝑟 [𝐻 ]
𝑃𝑟 [𝐸 ]
Weather data example?TrueHighCoolSunny
PlayWindyHumidity
Temp.OutlookEvidence E
Probability ofclass “yes”
𝑃𝑟 [ 𝑦𝑒𝑠 ∣𝐸 ]=𝑃𝑟 [𝑂𝑢𝑡𝑙𝑜𝑜𝑘=𝑆𝑢𝑛𝑛𝑦 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒=𝐶𝑜𝑜𝑙 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦= h𝐻𝑖𝑔 ∣ 𝑦𝑒𝑠 ]×𝑃𝑟 [𝑊𝑖𝑛𝑑𝑦=𝑇𝑟𝑢𝑒∣ 𝑦𝑒𝑠 ]
×𝑃𝑟 [ 𝑦𝑒𝑠 ]𝑃𝑟 [𝐸 ]
¿
29×39×39×39×914
𝑃𝑟 [𝐸 ]
Cognitive Assignment II
1. Pahami dan kuasai satu algoritma data mining dari berbagai literatur
2. Rangkumkan dengan detail dalam bentuk slide, dengan format:
1. Definisi2. Tahapan Algoritma3. Penerapan Tahapan Algoritma untuk Dataset Main Golf
atau Iris4. Code Java dari Algoritma
3. Presentasikan di Pertemuan Berikutnya
Pilihan Algoritma1. Neural Network2. Logistic Regression3. Support Vector Machine4. K-Means5. K-Nearest Neighbor6. Self-Organizing Map7. Linear Regression8. Naïve Bayes9. FP-Growth10. C4.5
Referensi1. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier, 2011
2. Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, 2005
3. Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011
4. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Third Edition, Elsevier, 2012
5. Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, 2010
6. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, World Scientific, 2007