Post on 12-Jan-2016
transcript
Department of Computer Science, University of Waikato, New Zealand
Bernhard Pfahringer(based on material by Eibe Frank, Mark
Hall, and Peter Reutemann)
WEKA: A Machine Learning Toolkit
The Explorer• Classification and
Regression• Clustering• Association Rules• Attribute Selection• Data Visualization
The Experimenter The Knowledge
Flow GUI Other Utilities Conclusions
Machine Learning with WEKA
04/21/23 University of Waikato 2
WEKA: the bird
Copyright: Martin Kramer (mkramer@wxs.nl)
The Weka or woodhen (Gallirallus australis) is an endemic bird of New Zealand. (Source: WikiPedia)
04/21/23 University of Waikato 3
WEKA: the software Machine learning/data mining software written in
Java (distributed under the GNU Public License) Used for research, education, and applications Complements “Data Mining” by Witten & Frank Main features:
Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods
Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms
04/21/23 University of Waikato 4
History Project funded by the NZ government since 1993
Develop state-of-the art workbench of data mining tools Explore fielded applications Develop new fundamental methods
04/21/23 University of Waikato 5
History (2) Late 1992 - funding was applied for by Ian Witten 1993 - development of the interface and infrastructure
WEKA acronym coined by Geoff Holmes WEKA’s file format “ARFF” was created by Andrew Donkin
ARFF was rumored to stand for AAndrew’s RRidiculous FFile FFormat Sometime in 1994 - first internal release of WEKA
TCL/TK user interface + learning algorithms written mostly in C Very much beta software Changes for the b1 release included (among others):
“Ambiguous and Unsupported menu commands removed.”“Crashing processes handled (in most cases :-)”
October 1996 - first public release: WEKA 2.1
04/21/23 University of Waikato 6
History (3) July 1997 - WEKA 2.2
Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support for C5
Included a facility (based on Unix makefiles) for configuring and running large scale experiments
Early 1997 - decision was made to rewrite WEKA in Java Originated from code written by Eibe Frank for his PhD Originally codenamed JAWS (JAJAva WWeka SSystem)
May 1998 - WEKA 2.3 Last release of the TCL/TK-based system
Mid 1999 - WEKA 3 (100% Java) released Version to complement the Data Mining book Development version (including GUI)
04/21/23 University of Waikato 7
WEKA: versions There are several versions of WEKA:
WEKA 3.4: “book version” compatible with description in data mining book
WEKA 3.5.5: “development version” with lots of improvements
This talk is based on a nightly snapshot of WEKA 3.5.5 (12-Feb-2007)
04/21/23 University of Waikato 8
@relation heart-disease-simplified
@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}
@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...
WEKA only deals with “flat” files
04/21/23 University of Waikato 9
@relation heart-disease-simplified
@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}
@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...
WEKA only deals with “flat” files
04/21/23 University of Waikato 10
04/21/23 University of Waikato 11
java -jar weka.jar
04/21/23 University of Waikato 12
Explorer(pre-processing the data) Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary Data can also be read from a URL or from an SQL
database (using JDBC) Pre-processing tools in WEKA are called “filters” WEKA contains filters for:
Discretization, normalization, resampling, attribute selection, transforming and combining attributes, …
04/21/23 University of Waikato 13
04/21/23 University of Waikato 14
04/21/23 University of Waikato 15
04/21/23 University of Waikato 16
04/21/23 University of Waikato 17
04/21/23 University of Waikato 18
04/21/23 University of Waikato 19
04/21/23 University of Waikato 20
04/21/23 University of Waikato 21
04/21/23 University of Waikato 22
04/21/23 University of Waikato 23
04/21/23 University of Waikato 24
04/21/23 University of Waikato 25
04/21/23 University of Waikato 26
04/21/23 University of Waikato 27
04/21/23 University of Waikato 28
04/21/23 University of Waikato 29
04/21/23 University of Waikato 30
04/21/23 University of Waikato 31
04/21/23 University of Waikato 32
04/21/23 University of Waikato 33
04/21/23 University of Waikato 34
Explorer: building “classifiers” Classifiers in WEKA are models for predicting
nominal or numeric quantities Implemented learning schemes include:
Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …
“Meta”-classifiers include: Bagging, boosting, stacking, error-correcting output
codes, locally weighted learning, …
04/21/23 University of Waikato 35
04/21/23 University of Waikato 36
04/21/23 University of Waikato 37
04/21/23 University of Waikato 38
04/21/23 University of Waikato 39
04/21/23 University of Waikato 40
04/21/23 University of Waikato 41
04/21/23 University of Waikato 42
04/21/23 University of Waikato 43
04/21/23 University of Waikato 44
04/21/23 University of Waikato 45
04/21/23 University of Waikato 46
04/21/23 University of Waikato 47
04/21/23 University of Waikato 48
04/21/23 University of Waikato 49
04/21/23 University of Waikato 50
04/21/23 University of Waikato 51
04/21/23 University of Waikato 52
04/21/23 University of Waikato 53
04/21/23 University of Waikato 54
04/21/23 University of Waikato 55
04/21/23 University of Waikato 56
04/21/23 University of Waikato 57
04/21/23 University of Waikato 58
04/21/23 University of Waikato 59
04/21/23 University of Waikato 60
04/21/23 University of Waikato 61
04/21/23 University of Waikato 62
04/21/23 University of Waikato 63
04/21/23 University of Waikato 64
04/21/23 University of Waikato 65
04/21/23 University of Waikato 66
04/21/23 University of Waikato 67
04/21/23 University of Waikato 68
04/21/23 University of Waikato 69
04/21/23 University of Waikato 70
04/21/23 University of Waikato 71
04/21/23 University of Waikato 72
04/21/23 University of Waikato 73
04/21/23 University of Waikato 74
04/21/23 University of Waikato 75
04/21/23 University of Waikato 76
04/21/23 University of Waikato 77
04/21/23 University of Waikato 78
04/21/23 University of Waikato 79
04/21/23 University of Waikato 80
04/21/23 University of Waikato 81
04/21/23 University of Waikato 82
04/21/23 University of Waikato 83
04/21/23 University of Waikato 84
04/21/23 University of Waikato 85
04/21/23 University of Waikato 86
04/21/23 University of Waikato 87
04/21/23 University of Waikato 88
04/21/23 University of Waikato 89
04/21/23 University of Waikato 90
Explorer: clustering data WEKA contains “clusterers” for finding groups of
similar instances in a dataset Some implemented schemes are:
k-Means, EM, Cobweb, X-means, FarthestFirst Clusters can be visualized and compared to “true”
clusters (if given) Evaluation based on loglikelihood if clustering
scheme produces a probability distribution
04/21/23 University of Waikato 91
04/21/23 University of Waikato 92
04/21/23 University of Waikato 93
04/21/23 University of Waikato 94
04/21/23 University of Waikato 95
04/21/23 University of Waikato 96
04/21/23 University of Waikato 97
04/21/23 University of Waikato 98
04/21/23 University of Waikato 99
04/21/23 University of Waikato 100
Explorer: finding associations WEKA contains the Apriori algorithm (among
others) for learning association rules Works only with discrete data
Can identify statistical dependencies between groups of attributes: milk, butter bread, eggs (with confidence 0.9 and
support 2000) Apriori can compute all rules that have a given
minimum support and exceed a given confidence
04/21/23 University of Waikato 101
04/21/23 University of Waikato 102
04/21/23 University of Waikato 103
04/21/23 University of Waikato 104
Explorer: attribute selection Panel that can be used to investigate which
(subsets of) attributes are the most predictive ones Attribute selection methods contain two parts:
A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking
An evaluation method: correlation-based, wrapper, information gain, chi-squared, …
Very flexible: WEKA allows (almost) arbitrary combinations of these two
04/21/23 University of Waikato 105
04/21/23 University of Waikato 106
04/21/23 University of Waikato 107
04/21/23 University of Waikato 108
04/21/23 University of Waikato 109
04/21/23 University of Waikato 110
04/21/23 University of Waikato 111
Explorer: data visualization Visualization very useful in practice: e.g. helps to
determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style)
Color-coded class values “Jitter” option to deal with nominal attributes (and
to detect “hidden” data points) “Zoom-in” function
04/21/23 University of Waikato 112
04/21/23 University of Waikato 113
04/21/23 University of Waikato 114
04/21/23 University of Waikato 115
04/21/23 University of Waikato 116
04/21/23 University of Waikato 117
04/21/23 University of Waikato 118
04/21/23 University of Waikato 119
04/21/23 University of Waikato 120
04/21/23 University of Waikato 121
04/21/23 University of Waikato 122
Performing experiments Experimenter makes it easy to compare the
performance of different learning schemes For classification and regression problems Results can be written into file or database Evaluation options: cross-validation, learning
curve, hold-out Can also iterate over different parameter settings Significance-testing built in!
04/21/23 University of Waikato 123
04/21/23 University of Waikato 124
04/21/23 University of Waikato 125
04/21/23 University of Waikato 126
04/21/23 University of Waikato 127
04/21/23 University of Waikato 128
04/21/23 University of Waikato 129
04/21/23 University of Waikato 130
04/21/23 University of Waikato 131
04/21/23 University of Waikato 132
04/21/23 University of Waikato 133
The Knowledge Flow GUI
Java-Beans-based interface for setting up and running machine learning experiments
Data sources, classifiers, etc. are beans and can be connected graphically
Data “flows” through components: e.g.,“data source” -> “filter” -> “classifier” -> “evaluator”
Layouts can be saved and loaded again later cf. Clementine ™
04/21/23 University of Waikato 134
04/21/23 University of Waikato 135
04/21/23 University of Waikato 136
04/21/23 University of Waikato 137
04/21/23 University of Waikato 138
04/21/23 University of Waikato 139
04/21/23 University of Waikato 140
04/21/23 University of Waikato 141
04/21/23 University of Waikato 142
04/21/23 University of Waikato 143
04/21/23 University of Waikato 144
04/21/23 University of Waikato 145
04/21/23 University of Waikato 146
04/21/23 University of Waikato 147
04/21/23 University of Waikato 148
04/21/23 University of Waikato 149
Sourceforge.net – Downloads
04/21/23 University of Waikato 150
Sourceforge.net – Web Traffic
WekaWiki launched – 05/2005
WekaDoc Wiki introduced – 12/2005
04/21/23 University of Waikato 151
Projects based on WEKA 45 projects currently (30/01/07) listed on the WekaWiki Incorporate/wrap WEKA
GRB Tool Shed - a tool to aid gamma ray burst research YALE - facility for large scale ML experiments GATE - NLP workbench with a WEKA interface Judge - document clustering and classification RWeka - an R interface to Weka
Extend/modify WEKA BioWeka - extension library for knowledge discovery in biology WekaMetal - meta learning extension to WEKA Weka-Parallel - parallel processing for WEKA Grid Weka - grid computing using WEKA Weka-CG - computational genetics tool library
04/21/23 University of Waikato 152
WEKA and PENTAHO Pentaho – The leader in Open Source Business
Intelligence (BI) September 2006 – Pentaho acquires the Weka project
(exclusive license and SF.net page) Weka will be used/integrated as data mining component in
their BI suite Weka will be still available as GPL open source software Most likely to evolve 2 editions:
Community edition BI oriented edition
04/21/23 University of Waikato 153
Limitations of WEKA
Traditional algorithms need to have all data in main memory
==> big datasets are an issue Solution:
Incremental schemes Stream algorithms
MOA “MMassive OOnline AAnalysis”
(not only a flightless bird, but also extinct!)
04/21/23 University of Waikato 154
Conclusion: try it yourself! WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka Also has a list of projects based on WEKA (probably incomplete list of) WEKA contributors:
Abdelaziz Mahoui, Alexander K. Seewald, Ashraf M. Kibriya, Bernhard Pfahringer, Brent Martin, Peter Flach, Eibe Frank, Gabi Schmidberger, Ian H. Witten, J. Lindgren, Janice Boughton, Jason Wells, Len Trigg, Lucio de Souza Coelho, Malcolm Ware, Mark Hall, Remco Bouckaert, Richard Kirkby, Shane Butler, Shane Legg, Stuart Inglis, Sylvain Roy, Tony Voyle, Xin Xu, Yong
Wang, Zhihai Wang