Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | anjan-krishnamurthy |
View: | 110 times |
Download: | 2 times |
MaxQDPro TeamAnjan.K Harish.R
II Sem M.Tech CSE
04/10/23 Machine learning with WEKA 1
Machine Learning with Machine Learning with WEKAWEKA
AgendaAgenda
04/10/23 2Machine learning with WEKA
Introduction to WEKAIntroduction to WEKAWaikato Environment for Knowledge Analysis
Weka is a collection of machine learning algorithms for data mining tasks.
Weka contains tools for data pre-processing,
classification, regression, clustering, association rules, and visualization.
Official Web Site: http://www.cs.waikato.ac.nz/ml/weka/
04/10/23 3Machine learning with WEKA
April 10, 2006 4
WEKA System HierarchyWEKA System Hierarchy
User Application
Model:SerializedObjects
Weka system
Basic support
weka.core
User Interface - weka.gui
DataBase/
Datawarehouse
Arff, Csv,C45
documents
Simple CLI Explorer ExperimenterKnowledge
FlowArffViewer
JDBC
Middle layer
Algorithms Evaluation supportsand UI supports
weka.classifiers
weka.estimators
weka.filtersweka.associations
weka.clusterers
weka.attributeSelectionweka.experiment
weka.datagenerator
Weka’s Weka’s RRole in the ole in the BBig ig PPictureicture
Input•Raw data
Input•Raw data
Data Mingby Weka
•Pre-processing •Classification•Regression •Clustering •Association Rules •Visualization
Data Mingby Weka
•Pre-processing •Classification•Regression •Clustering •Association Rules •Visualization
Output•Result
Output•Result
04/10/23 5Machine learning with WEKA
Machine learning with WEKA
KDD ProcessKDD Process
Data
Knowledge
Se lec tion
Preprocess ing
Trans fo rmation
Data Mining
Inte rpre ta tionEva lua tion
04/10/23 6
04/10/23 Machine learning with WEKA 7
WEKA: the softwareWEKA: the softwareMachine learning/data mining software
written in Java (distributed under the GNU Public License)
Used for research, education, and applications
Complements “Data Mining” by Witten & Frank
Main features:◦Comprehensive set of data pre-processing
tools, learning algorithms and evaluation methods
◦Graphical user interfaces (incl. data visualization)
◦Environment for comparing learning algorithms
04/10/23 Machine learning with WEKA 8
HistoryHistory Project funded by the NZ government since 1993
◦ Develop state-of-the art workbench of data mining tools
◦ Explore fielded applications◦ Develop new fundamental methods
04/10/23 Machine learning with WEKA 9
HistoryHistory July 1997 - WEKA 2.2
◦ Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support for C5
◦ Included a facility (based on Unix makefiles) for configuring and running large scale experiments
Early 1997 - decision was made to rewrite WEKA in Java◦ Originated from code written by Eibe Frank for his
PhD◦ Originally codenamed JAWS (JAJAva WWeka SSystem)
May 1998 - WEKA 2.3◦ Last release of the TCL/TK-based system
Mid 1999 - WEKA 3 (100% Java) released◦ Version to complement the Data Mining book◦ Development version (including GUI)
04/10/23 Machine learning with WEKA 10
WEKA: versionsWEKA: versionsThere are several versions of
WEKA:◦WEKA 3.4: “book version” compatible
with description in data mining book◦WEKA 3.5.5: “development version”
with lots of improvementsThis talk is based on a nightly
snapshot of WEKA 3.5.5 (12-Feb-2007)
With latest being WEKA 3.6 series
04/10/23 Machine learning with WEKA 11
java weka.gui.GUIChooser
Machine learning with WEKA
Explorer - Explorer - PreprocessingPreprocessingImport from files: ARFF, CSV, C4.5,
binaryImport from URL or an SQL database
(using JDBC)Preprocessing filters
◦Adding/removing attributes◦Attribute value substitution ◦Discretization (MDL, Kononenko, etc.)◦Time series filters (delta, shift)◦Sampling, randomization◦Missing value management◦Normalization and other numeric
transformations04/10/23 12
ARFF File FormatARFF File Format Require declarations of @RELATION, @ATTRIBUTE and @DATA @RELATION declaration associates a name with the dataset
◦ @RELATION <relation-name>@RELATION iris
@ATTRIBUTE declaration specifies the name and type of an attribute
◦ @attribute <attribute-name> <datatype>
◦ Datatype can be numeric, nominal, string or date@ATTRIBUTE sepallength NUMERIC @ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA declaration is a single line denoting the start of the data segment
◦ Missing values are represented by ?@DATA 5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, ?, 1.4, ?, Iris-versicolor
04/10/23 13Machine learning with WEKA
Machine learning with WEKA
Explorer - ClassificationExplorer - ClassificationPredicted attribute is categoricalImplemented methods
◦Naïve Bayes◦decision trees and rules◦neural networks◦support vector machines◦ instance-based classifiers …
Evaluation◦test set◦crossvalidation ...
04/10/23 14
J48 = Decision TreeJ48 = Decision Tree
petalwidth <= 0.6: Iris-setosa (50.0) : # under node
petalwidth > 0.6 # ..number wrong
| petalwidth <= 1.7| | petallength <= 4.9: Iris-versicolor
(48.0/1.0)| | petallength > 4.9| | | petalwidth <= 1.5: Iris-virginica (3.0)| | | petalwidth > 1.5: Iris-versicolor
(3.0/1.0)| petalwidth > 1.7: Iris-virginica (46.0/1.0)
04/10/23 15Machine learning with WEKA
Cross-validationCross-validationCorrectly Classified Instances 143
95.3%Incorrectly Classified Instances 7
4.67 %Default 10-fold cross validation i.e.
◦Split data into 10 equal sized pieces◦Train on 9 pieces and test on
remainder◦Do for all possibilities and average
04/10/23 16Machine learning with WEKA
J48 Confusion MatrixJ48 Confusion Matrix
Old data set from statistics: 50 of each class
a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica
04/10/23 17Machine learning with WEKA
Precision, Recall, and Precision, Recall, and AccuracyAccuracyPrecision: probability of being correct
given that your decision.◦Precision of iris-setosa is 49/49 = 100%◦Specificity in medical literature
Recall: probability of correctly identifying class.◦Recall accuracy for iris-setosa is 49/50 =
98%◦Sensitity in medical literature
Accuracy: # right/total = 143/150 =~95%
04/10/23 18Machine learning with WEKA
Machine learning with WEKA
Explorer - Explorer - ClusteringClusteringImplemented methods
◦k-Means◦EM◦Cobweb◦X-means◦FarthestFirst…
Clusters can be visualized and compared to “true” clusters (if given)
Evaluation based on loglikelihood if clustering scheme produces a probability distribution
04/10/23 19
04/10/23 Machine learning with WEKA 20
Explorer - AssociationsExplorer - AssociationsWEKA contains the Apriori algorithm
(among others) for learning association rules◦Works only with discrete data
Can identify statistical dependencies between groups of attributes:◦milk, butter bread, eggs (with confidence
0.9 and support 2000)Apriori can compute all rules that have
a given minimum support and exceed a given confidence
CONCEPT HIERARCY
Food
Milk Bread Fruit
2% Skimmed Fat Free Wheat White Apple Banana Orange
Inorganic Organic
Level 1
Multiple-Level Association Rule Mining in Weka
CONCEPT HIERARCY
Food
Milk Bread Fruit
2% Skimmed Fat Free Wheat White Apple Banana Orange
Inorganic Organic
Level 2
Multiple-Level Association Rule Mining in Weka
CONCEPT HIERARCY
Food
Milk Bread Fruit
2% Skimmed Fat Free Wheat White Apple Banana Orange
Inorganic Organic
Level 3
Multiple-Level Association Rule Mining in Weka
04/10/23
Sample Execution (1)Sample Execution (1)java weka.associations.Apriori -t
data/weather.nominal.arff -I yes
Apriori
=======
Minimum support: 0.2
Minimum confidence: 0.9
Number of cycles performed: 17
Generated sets of large itemsets:
Size of set of large itemsets L(1): 12
24Machine learning with WEKA
04/10/23
Sample Execution (2)Sample Execution (2)
Best rules found:
1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1)
2. temperature=cool 4 ==> humidity=normal 4 (1)
3. outlook=overcast 4 ==> play=yes 4 (1)
4. temperature=cool play=yes 3 ==> humidity=normal 3 (1)
5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1)
7. outlook=sunny humidity=high 3 ==> play=no 3 (1)
8. outlook=sunny play=no 3 ==> humidity=high 3 (1)
25Machine learning with WEKA
Machine learning with WEKA
RegressionRegressionPredicted attribute is continuousImplemented methods
◦(linear regression)◦neural networks◦regression trees …
04/10/23 26
Machine learning with WEKA
Explorer - Explorer - Attribute Attribute SelectionSelectionVery flexible: arbitrary combination
of search and evaluation methodsBoth filtering and wrapping
methodsSearch methods
◦best-first◦genetic◦ranking ...
Evaluation measures◦ReliefF◦ information gain◦gain ratio …
04/10/23 27
04/10/23 Machine learning with WEKA 28
Explorer - Data Explorer - Data VisualizationVisualizationVisualization very useful in practice:
e.g. helps to determine difficulty of the learning problem
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)◦To do: rotating 3-d visualizations
(Xgobi-style)Color-coded class values“Jitter” option to deal with nominal
attributes (and to detect “hidden” data points)
“Zoom-in” function
04/10/23 Machine learning with WEKA 29
Performing experimentsPerforming experimentsExperimenter makes it easy to compare
the performance of different learning schemes
For classification and regression problemsResults can be written into file or databaseEvaluation options: cross-validation,
learning curve, hold-outCan also iterate over different parameter
settingsSignificance-testing built in!
04/10/23 Machine learning with WEKA 30
The Knowledge Flow GUIThe Knowledge Flow GUI
Java-Beans-based interface for setting up and running machine learning experiments
Data sources, classifiers, etc. are beans and can be connected graphically
Data “flows” through components: e.g.,“data source” -> “filter” -> “classifier” -> “evaluator”
Layouts can be saved and loaded again later
cf. Clementine ™
04/10/23 Machine learning with WEKA 31
Projects based on WProjects based on WEKAEKA 45 projects currently (30/01/07) listed on the
WekaWiki Incorporate/wrap WEKA
◦ GRB Tool Shed - a tool to aid gamma ray burst research
◦ YALE - facility for large scale ML experiments◦ GATE - NLP workbench with a WEKA interface◦ Judge - document clustering and classification◦ RWeka - an R interface to Weka
Extend/modify WEKA◦ BioWeka - extension library for knowledge
discovery in biology◦ WekaMetal - meta learning extension to WEKA◦ Weka-Parallel - parallel processing for WEKA◦ Grid Weka - grid computing using WEKA◦ Weka-CG - computational genetics tool library
04/10/23 Machine learning with WEKA 32
WWEKAEKA and P and PENTAHOENTAHOPentaho – The leader in Open Source
Business Intelligence (BI)September 2006 – Pentaho acquires the
Weka project (exclusive license and SF.net page)
Weka will be used/integrated as data mining component in their BI suite
Weka will be still available as GPL open source software
Most likely to evolve 2 editions:◦ Community edition◦ BI oriented edition
04/10/23 Machine learning with WEKA 33
Limitations of WLimitations of WEKAEKA
Traditional algorithms need to have all data in main memory
==> big datasets are an issueSolution:
◦Incremental schemes◦Stream algorithms
MOA “MMassive OOnline AAnalysis”(not only a flightless bird, but also extinct!)
SummarySummaryIntroduction to WEKAWEKA System HierarchyWEKA featuresBrief HistoryExplorerExperimenterCLIKnowledge FlowProject Based on WEKALimitations of WEKA
04/10/23 34Machine learning with WEKA
26/Sep/2006 S.P.Vimal, CS IS Group, BITS-Pilani 35
ReferencesReferences
1. Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
2. http://www.itl.nist.gov/div898/handbook/index.htm