+ All Categories
Home > Documents > Weka ML Tutorial 5 6

Weka ML Tutorial 5 6

Date post: 24-Oct-2014
Category:
Upload: lennart-liberg
View: 64 times
Download: 1 times
Share this document with a friend
Popular Tags:
15
Machine Learning Tutorial Machine Learning Tutorial CB, GS, REC Section 5 Section 5 API for Weka Machine Learning Tutorial for the UKP lab June 10 2011 June 10, 2011
Transcript
Page 1: Weka ML Tutorial 5 6

Machine Learning TutorialMachine Learning Tutorial

CB, GS, REC, ,

Section 5Section 5API for Weka

Machine Learning Tutorial for the UKP labJune 10 2011June 10, 2011

Page 2: Weka ML Tutorial 5 6

W k APIWeka API

S i f i t l b i i WEKA GUISeries of experiments are laborious in WEKA GUIThe API is simple and easy to use to design complex workflows

e.g. grid search / simulated annealing over the classifier hyperparameter spaceg g g yp p p

Major conceptsarff filearff fileInstances / InstanceClassifier

2SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 3: Weka ML Tutorial 5 6

A ff fil f tArff file format

H d tl if i th t f th d t tHeader exactly specifying the parameters of the datasetCSV representation of instancessparse representation is also handledsparse representation is also handled

Example@relation MYDATASET

@attribute att1 real@attribute att2 {value1,value2}@attribute classlabel {positive,negative}

@data0.1,value1,positive0.0,value2,negative

3

, , g…

Page 4: Weka ML Tutorial 5 6

R di ff filReading an arff file

i t I t bj tinto an Instances objectimport weka.core.Instances;

Instances trainData = null;Reader reader = new BufferedReader(new FileReader(new File(file)));

try {trainData = new Instances(reader);trainData.setClassIndex(trainData.numAttributes() - 1);

}finallyfinally{reader.close();

}

4

}

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 5: Weka ML Tutorial 5 6

T i i l ifiTraining a classifier

import weka.classifiers.Classifier; import weka.core.Instances; import weka.core.Instance;import weka.classifiers.Classifier; import weka.core.Instances; import weka.core.Instance;

Instances trainData (from prev slide); Instances testData;

Classifier cl = createClassifier(); //initialize a classifier, see latercl.buildClassifier(trainData); //perform the training process

for (int i = 0; i < testData numInstances(); i++) {for (int i = 0; i < testData.numInstances(); i++) {Instance inst = testData.instance(i); //you grab a single test instance

classificationtry {

int value = (int) cl.classifyInstance(inst); //you get the offset of the nominal valueString label = data.classAttribute().value(value); //you query the corresponding labelint realvalue = (int) inst.classValue(); //in case the gold labels were in the arff… //compare…

distributiontry {

5

try {double[] distr = cl. distributionForInstance(inst); //you get the class posteriors in an array

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 6: Weka ML Tutorial 5 6

C ti / I iti li i l ifiCreating / Initializing a classifier

bli Ab t tCl ifi t Cl ifi () th E tipublic AbstractClassifier createClassifier() throws Exception{

J48 j48 = new J48();Configuration config = Configuration.getInstance();Configuration config Configuration.getInstance();

j48.setUnpruned(config.getBooleanProperty(Configuration.USE_UNPRUNED_TREE));if (!j48.getUnpruned()) {

j48.setReducedErrorPruning(config.getBooleanProperty(Configuration.REDUCED_ERROR_PRUNING));

}j48 setConfidenceFactor(j48.setConfidenceFactor(

config.getFloatProperty(Configuration.CONFIDENCE_FACTOR, j48.getConfidenceFactor()));j48.setMinNumObj(config.getIntProperty(Configuration.MIN_NUMOBJ, j48.getMinNumObj()));j48.setNumFolds(config.getIntProperty(Configuration.NUM FOLDS, j48.getNumFolds()));j ( g g p y( g _ , j g ()));

return j48;}

6

Also: setOptions(java.lang.String[] options) ; forName(java.lang.String classifierName, java.lang.String[] options) ;

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 7: Weka ML Tutorial 5 6

W k APIWeka API

V i l lf l i i dVery simple, self-explaining codeClear architecture / structure

weka.classifiers;weka.classifiers.bayes (BayesNet, Naive Bayes)weka.classifiers.functions (Neural Net, Linear Regression, Logistic regression/Maxent, SVM, …)weka.classifiers.lazy (Nearest Neighbor, …)weka.classifiers.rules (JRip, …)weka.classifiers.trees (C4.5, Random Forest, …)weka.classifiers.meta (Boosting, Bagging, Attribute Selection, Voting, …)

weka.clusterers;

Quick prototyping, testing of many algorithmsNot all state of the artNot the most efficient (e g logreg is slow)

7

Not the most efficient (e.g. logreg is slow)Ideal for learning / teaching / starting up

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 8: Weka ML Tutorial 5 6

Machine Learning TutorialMachine Learning Tutorial

CB, GS, REC, ,

Section 6Section 6Machine Learning – further topics

Machine Learning Tutorial for the UKP labJune 10 2011June 10, 2011

Page 9: Weka ML Tutorial 5 6

Cl ifi tiClassification

U til l ifi tiUntil now: classificationfinite set of (nominal) class labels

classification units / instances weretokenstokenstoken sequencessentencesdocuments…

Error is measured via the percentage of correct predictionscf. accuracy, error rate, etc.

Methods worth to consider / trymaximum entropy models (in weka: logistic regression)decision trees (for easier tasks)boosted decision treesconditional random fieldsupport vector machines

9SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 10: Weka ML Tutorial 5 6

R iRegression

R iRegression:approximate a real valued target variablealso called function learninggerror is measured as the difference between the predicted and the observedvaluesusually based on real valued featuresusually based on real valued features

Less typical problem setting for NLP

Methods worth to consider / trylinear regressionlinear regressionsupport vector machine

10SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 11: Weka ML Tutorial 5 6

R kiRanking

Preference learningPreference learninginstead of classification, try to predict a total order over a set of possible labels(e.g. all possible actions at a time)

h f h KE hresearch area of the KE group here

Subset ranking / Learning to rankinstead of classification, try to predict a (partial) order of a set of instances (e.g. query-document pairs)more relevant in NLP and especially IR

Error is calculated according to some ranking measurec.f. P@k, MAP, NDCG@ , ,

Methods worth to consider / try(rank)SVM

11

(rank)SVMboosted decision/regression trees

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 12: Weka ML Tutorial 5 6

S i i d l iSemi-supervised learning

Exploit labeled + unlabeled data to improve models (or likewise to get aExploit labeled + unlabeled data to improve models (or likewise to get a similar model with less labeled data)

ExamplesExamplesin SVM, maximize the margin (distance from decision boundary) taking into account unlabeled pointsuse unlabeled data to calculate feature statisticsuse automatically labeled data to append training setuse automatically labeled data to append training set

Two different paradigmsInductive setting: learn a model that applies to new examples – use some labeled+unlabeled data andg pp pevaluate on unseen data

more generalless powerful

T d ti tti l d l th t di t d fi d t t t t l l b l dTransductive setting: learn a model that predicts a predefined test set accurately – use some labeleddata + unlabeled test data and evaluate on the unlabeled test

more powerfulentails the need to retrain before predicting further new data

12

p gc.f. Niklas‘s thesis

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 13: Weka ML Tutorial 5 6

S i i d l iSemi-supervised learning

Self trainingSelf-trainingtrain a modelpredicted instances that meet a predefined selection criterion (e.g. p(+) > 0.95)

dd d h i i l d h iare added to the training pool, and then retrain

Co-trainingCo trainingtrain two different models / the same model on 2 independent representations(e.g. spam filtering based on text and on links)predicted instances that meet a predefined selection criterion are added to thepredicted instances that meet a predefined selection criterion are added to thetraining pool of the other model, and then retrain both

Active learningtrain a model on a small initial setinstances that meet a predefined selection criterion (e.g. model shows high

13

instances that meet a predefined selection criterion (e.g. model shows highuncertainty, p(+) ~ P(-)) are asked for human labeling, and then retrain

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 14: Weka ML Tutorial 5 6

S i l i ( t t i i d t )Semi-sup. learning (generate training data)

B t t i ( t t i i d t )Bootstrapping (generate training data)start with an initial small seed setinstances that meet a predefined selection criterion (e.g. contextual similarity to( g ythe seed) are added to the training pool, and then retry

Distant supervisionstart with an assumption of positive / negative membership (e.g. for pairs in a knowledge base, you know the label, look for texts having that pair)generate potential positive/negative instances based on the assumption, andgenerate potential positive/negative instances based on the assumption, andthen train a model

Train on errorshaving labeled data for an associated task, train on its errors (which partly aredue to the lack of knowledge about your current probleme g disease and associated symptom codes are never added to the same

14

e.g. disease and associated symptom codes are never added to the same document – learn D/S relationships from D and S labels/classifiers

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Page 15: Weka ML Tutorial 5 6

D i d t tiDomain adaptation

Wh i d i th t t (f t d/ l b l di t ib ti )When crossing domains, the texts (feature and/or label distributions) canchange

this degrades ML performance (on target D with small train, compared to source D with large i )train)

try to tackle this domain impact to have OK performance in (almost) unseen domains

Pivot features that are frequent and robust accross domainsPivot features that are frequent and robust accross domainse.g. „good“ is a positive sentiment word in all domains

Structural correspondence learningStructural correspondence learningalign source/target specific features through their similarities to pivot featurescan exploit target specific knowledge through correspondences to source specific features

Easy domain adaptationuse triple, source-only, target-only and source+target versions of all featurescan learn general (pivot patterns) and also target- (source-)specific knowledge

15

can learn general (pivot patterns) and also target (source )specific knowledge

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |


Recommended