Weka ML Tutorial 5 6

Machine Learning TutorialMachine Learning Tutorial

CB, GS, REC, ,

Section 5Section 5API for Weka

Machine Learning Tutorial for the UKP labJune 10 2011June 10, 2011

W k APIWeka API

S i f i t l b i i WEKA GUISeries of experiments are laborious in WEKA GUIThe API is simple and easy to use to design complex workflows

e.g. grid search / simulated annealing over the classifier hyperparameter spaceg g g yp p p

Major conceptsarff filearff fileInstances / InstanceClassifier

2SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

A ff fil f tArff file format

H d tl if i th t f th d t tHeader exactly specifying the parameters of the datasetCSV representation of instancessparse representation is also handledsparse representation is also handled

Example@relation MYDATASET

@attribute att1 real@attribute att2 {value1,value2}@attribute classlabel {positive,negative}

@data0.1,value1,positive0.0,value2,negative

3

, , g…

R di ff filReading an arff file

i t I t bj tinto an Instances objectimport weka.core.Instances;

Instances trainData = null;Reader reader = new BufferedReader(new FileReader(new File(file)));

try {trainData = new Instances(reader);trainData.setClassIndex(trainData.numAttributes() - 1);

}finallyfinally{reader.close();

}

4

}

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

T i i l ifiTraining a classifier

import weka.classifiers.Classifier; import weka.core.Instances; import weka.core.Instance;import weka.classifiers.Classifier; import weka.core.Instances; import weka.core.Instance;

Instances trainData (from prev slide); Instances testData;

Classifier cl = createClassifier(); //initialize a classifier, see latercl.buildClassifier(trainData); //perform the training process

for (int i = 0; i < testData numInstances(); i++) {for (int i = 0; i < testData.numInstances(); i++) {Instance inst = testData.instance(i); //you grab a single test instance

classificationtry {

int value = (int) cl.classifyInstance(inst); //you get the offset of the nominal valueString label = data.classAttribute().value(value); //you query the corresponding labelint realvalue = (int) inst.classValue(); //in case the gold labels were in the arff… //compare…

distributiontry {

5

try {double[] distr = cl. distributionForInstance(inst); //you get the class posteriors in an array


C ti / I iti li i l ifiCreating / Initializing a classifier

bli Ab t tCl ifi t Cl ifi () th E tipublic AbstractClassifier createClassifier() throws Exception{

J48 j48 = new J48();Configuration config = Configuration.getInstance();Configuration config Configuration.getInstance();

j48.setUnpruned(config.getBooleanProperty(Configuration.USE_UNPRUNED_TREE));if (!j48.getUnpruned()) {

j48.setReducedErrorPruning(config.getBooleanProperty(Configuration.REDUCED_ERROR_PRUNING));

}j48 setConfidenceFactor(j48.setConfidenceFactor(

config.getFloatProperty(Configuration.CONFIDENCE_FACTOR, j48.getConfidenceFactor()));j48.setMinNumObj(config.getIntProperty(Configuration.MIN_NUMOBJ, j48.getMinNumObj()));j48.setNumFolds(config.getIntProperty(Configuration.NUM FOLDS, j48.getNumFolds()));j ( g g p y( g _ , j g ()));

return j48;}

6

Also: setOptions(java.lang.String[] options) ; forName(java.lang.String classifierName, java.lang.String[] options) ;


W k APIWeka API

V i l lf l i i dVery simple, self-explaining codeClear architecture / structure

weka.classifiers;weka.classifiers.bayes (BayesNet, Naive Bayes)weka.classifiers.functions (Neural Net, Linear Regression, Logistic regression/Maxent, SVM, …)weka.classifiers.lazy (Nearest Neighbor, …)weka.classifiers.rules (JRip, …)weka.classifiers.trees (C4.5, Random Forest, …)weka.classifiers.meta (Boosting, Bagging, Attribute Selection, Voting, …)

weka.clusterers;

Quick prototyping, testing of many algorithmsNot all state of the artNot the most efficient (e g logreg is slow)

7

Not the most efficient (e.g. logreg is slow)Ideal for learning / teaching / starting up


Machine Learning TutorialMachine Learning Tutorial

CB, GS, REC, ,

Section 6Section 6Machine Learning – further topics

Machine Learning Tutorial for the UKP labJune 10 2011June 10, 2011

Cl ifi tiClassification

U til l ifi tiUntil now: classificationfinite set of (nominal) class labels

classification units / instances weretokenstokenstoken sequencessentencesdocuments…

Error is measured via the percentage of correct predictionscf. accuracy, error rate, etc.

Methods worth to consider / trymaximum entropy models (in weka: logistic regression)decision trees (for easier tasks)boosted decision treesconditional random fieldsupport vector machines


R iRegression

R iRegression:approximate a real valued target variablealso called function learninggerror is measured as the difference between the predicted and the observedvaluesusually based on real valued featuresusually based on real valued features

Less typical problem setting for NLP

Methods worth to consider / trylinear regressionlinear regressionsupport vector machine


R kiRanking

Preference learningPreference learninginstead of classification, try to predict a total order over a set of possible labels(e.g. all possible actions at a time)

h f h KE hresearch area of the KE group here

Subset ranking / Learning to rankinstead of classification, try to predict a (partial) order of a set of instances (e.g. query-document pairs)more relevant in NLP and especially IR

Error is calculated according to some ranking measurec.f. P@k, MAP, NDCG@ , ,

Methods worth to consider / try(rank)SVM

11

(rank)SVMboosted decision/regression trees


S i i d l iSemi-supervised learning

Exploit labeled + unlabeled data to improve models (or likewise to get aExploit labeled + unlabeled data to improve models (or likewise to get a similar model with less labeled data)

ExamplesExamplesin SVM, maximize the margin (distance from decision boundary) taking into account unlabeled pointsuse unlabeled data to calculate feature statisticsuse automatically labeled data to append training setuse automatically labeled data to append training set

Two different paradigmsInductive setting: learn a model that applies to new examples – use some labeled+unlabeled data andg pp pevaluate on unseen data

more generalless powerful

T d ti tti l d l th t di t d fi d t t t t l l b l dTransductive setting: learn a model that predicts a predefined test set accurately – use some labeleddata + unlabeled test data and evaluate on the unlabeled test

more powerfulentails the need to retrain before predicting further new data

12

p gc.f. Niklas‘s thesis


S i i d l iSemi-supervised learning

Self trainingSelf-trainingtrain a modelpredicted instances that meet a predefined selection criterion (e.g. p(+) > 0.95)

dd d h i i l d h iare added to the training pool, and then retrain

Co-trainingCo trainingtrain two different models / the same model on 2 independent representations(e.g. spam filtering based on text and on links)predicted instances that meet a predefined selection criterion are added to thepredicted instances that meet a predefined selection criterion are added to thetraining pool of the other model, and then retrain both

Active learningtrain a model on a small initial setinstances that meet a predefined selection criterion (e.g. model shows high

13

instances that meet a predefined selection criterion (e.g. model shows highuncertainty, p(+) ~ P(-)) are asked for human labeling, and then retrain


S i l i ( t t i i d t )Semi-sup. learning (generate training data)

B t t i ( t t i i d t )Bootstrapping (generate training data)start with an initial small seed setinstances that meet a predefined selection criterion (e.g. contextual similarity to( g ythe seed) are added to the training pool, and then retry

Distant supervisionstart with an assumption of positive / negative membership (e.g. for pairs in a knowledge base, you know the label, look for texts having that pair)generate potential positive/negative instances based on the assumption, andgenerate potential positive/negative instances based on the assumption, andthen train a model

Train on errorshaving labeled data for an associated task, train on its errors (which partly aredue to the lack of knowledge about your current probleme g disease and associated symptom codes are never added to the same

14

e.g. disease and associated symptom codes are never added to the same document – learn D/S relationships from D and S labels/classifiers


D i d t tiDomain adaptation

Wh i d i th t t (f t d/ l b l di t ib ti )When crossing domains, the texts (feature and/or label distributions) canchange

this degrades ML performance (on target D with small train, compared to source D with large i )train)

try to tackle this domain impact to have OK performance in (almost) unseen domains

Pivot features that are frequent and robust accross domainsPivot features that are frequent and robust accross domainse.g. „good“ is a positive sentiment word in all domains

Structural correspondence learningStructural correspondence learningalign source/target specific features through their similarities to pivot featurescan exploit target specific knowledge through correspondences to source specific features

Easy domain adaptationuse triple, source-only, target-only and source+target versions of all featurescan learn general (pivot patterns) and also target- (source-)specific knowledge

15

can learn general (pivot patterns) and also target (source )specific knowledge


Date post:	24-Oct-2014
Category:	Documents
Upload:	lennart-liberg
View:	64 times
Download:	1 times

Weka ML Tutorial 5 6

Documents