Date post: | 24-Oct-2014 |
Category: |
Documents |
Upload: | lennart-liberg |
View: | 64 times |
Download: | 1 times |
Machine Learning TutorialMachine Learning Tutorial
CB, GS, REC, ,
Section 5Section 5API for Weka
Machine Learning Tutorial for the UKP labJune 10 2011June 10, 2011
W k APIWeka API
S i f i t l b i i WEKA GUISeries of experiments are laborious in WEKA GUIThe API is simple and easy to use to design complex workflows
e.g. grid search / simulated annealing over the classifier hyperparameter spaceg g g yp p p
Major conceptsarff filearff fileInstances / InstanceClassifier
2SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
A ff fil f tArff file format
H d tl if i th t f th d t tHeader exactly specifying the parameters of the datasetCSV representation of instancessparse representation is also handledsparse representation is also handled
Example@relation MYDATASET
@attribute att1 real@attribute att2 {value1,value2}@attribute classlabel {positive,negative}
@data0.1,value1,positive0.0,value2,negative
3
, , g…
R di ff filReading an arff file
i t I t bj tinto an Instances objectimport weka.core.Instances;
Instances trainData = null;Reader reader = new BufferedReader(new FileReader(new File(file)));
try {trainData = new Instances(reader);trainData.setClassIndex(trainData.numAttributes() - 1);
}finallyfinally{reader.close();
}
4
}
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
T i i l ifiTraining a classifier
import weka.classifiers.Classifier; import weka.core.Instances; import weka.core.Instance;import weka.classifiers.Classifier; import weka.core.Instances; import weka.core.Instance;
Instances trainData (from prev slide); Instances testData;
Classifier cl = createClassifier(); //initialize a classifier, see latercl.buildClassifier(trainData); //perform the training process
for (int i = 0; i < testData numInstances(); i++) {for (int i = 0; i < testData.numInstances(); i++) {Instance inst = testData.instance(i); //you grab a single test instance
classificationtry {
int value = (int) cl.classifyInstance(inst); //you get the offset of the nominal valueString label = data.classAttribute().value(value); //you query the corresponding labelint realvalue = (int) inst.classValue(); //in case the gold labels were in the arff… //compare…
distributiontry {
5
try {double[] distr = cl. distributionForInstance(inst); //you get the class posteriors in an array
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
C ti / I iti li i l ifiCreating / Initializing a classifier
bli Ab t tCl ifi t Cl ifi () th E tipublic AbstractClassifier createClassifier() throws Exception{
J48 j48 = new J48();Configuration config = Configuration.getInstance();Configuration config Configuration.getInstance();
j48.setUnpruned(config.getBooleanProperty(Configuration.USE_UNPRUNED_TREE));if (!j48.getUnpruned()) {
j48.setReducedErrorPruning(config.getBooleanProperty(Configuration.REDUCED_ERROR_PRUNING));
}j48 setConfidenceFactor(j48.setConfidenceFactor(
config.getFloatProperty(Configuration.CONFIDENCE_FACTOR, j48.getConfidenceFactor()));j48.setMinNumObj(config.getIntProperty(Configuration.MIN_NUMOBJ, j48.getMinNumObj()));j48.setNumFolds(config.getIntProperty(Configuration.NUM FOLDS, j48.getNumFolds()));j ( g g p y( g _ , j g ()));
return j48;}
6
Also: setOptions(java.lang.String[] options) ; forName(java.lang.String classifierName, java.lang.String[] options) ;
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
W k APIWeka API
V i l lf l i i dVery simple, self-explaining codeClear architecture / structure
weka.classifiers;weka.classifiers.bayes (BayesNet, Naive Bayes)weka.classifiers.functions (Neural Net, Linear Regression, Logistic regression/Maxent, SVM, …)weka.classifiers.lazy (Nearest Neighbor, …)weka.classifiers.rules (JRip, …)weka.classifiers.trees (C4.5, Random Forest, …)weka.classifiers.meta (Boosting, Bagging, Attribute Selection, Voting, …)
weka.clusterers;
Quick prototyping, testing of many algorithmsNot all state of the artNot the most efficient (e g logreg is slow)
7
Not the most efficient (e.g. logreg is slow)Ideal for learning / teaching / starting up
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
Machine Learning TutorialMachine Learning Tutorial
CB, GS, REC, ,
Section 6Section 6Machine Learning – further topics
Machine Learning Tutorial for the UKP labJune 10 2011June 10, 2011
Cl ifi tiClassification
U til l ifi tiUntil now: classificationfinite set of (nominal) class labels
classification units / instances weretokenstokenstoken sequencessentencesdocuments…
Error is measured via the percentage of correct predictionscf. accuracy, error rate, etc.
Methods worth to consider / trymaximum entropy models (in weka: logistic regression)decision trees (for easier tasks)boosted decision treesconditional random fieldsupport vector machines
9SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
R iRegression
R iRegression:approximate a real valued target variablealso called function learninggerror is measured as the difference between the predicted and the observedvaluesusually based on real valued featuresusually based on real valued features
Less typical problem setting for NLP
Methods worth to consider / trylinear regressionlinear regressionsupport vector machine
10SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
R kiRanking
Preference learningPreference learninginstead of classification, try to predict a total order over a set of possible labels(e.g. all possible actions at a time)
h f h KE hresearch area of the KE group here
Subset ranking / Learning to rankinstead of classification, try to predict a (partial) order of a set of instances (e.g. query-document pairs)more relevant in NLP and especially IR
Error is calculated according to some ranking measurec.f. P@k, MAP, NDCG@ , ,
Methods worth to consider / try(rank)SVM
11
(rank)SVMboosted decision/regression trees
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
S i i d l iSemi-supervised learning
Exploit labeled + unlabeled data to improve models (or likewise to get aExploit labeled + unlabeled data to improve models (or likewise to get a similar model with less labeled data)
ExamplesExamplesin SVM, maximize the margin (distance from decision boundary) taking into account unlabeled pointsuse unlabeled data to calculate feature statisticsuse automatically labeled data to append training setuse automatically labeled data to append training set
Two different paradigmsInductive setting: learn a model that applies to new examples – use some labeled+unlabeled data andg pp pevaluate on unseen data
more generalless powerful
T d ti tti l d l th t di t d fi d t t t t l l b l dTransductive setting: learn a model that predicts a predefined test set accurately – use some labeleddata + unlabeled test data and evaluate on the unlabeled test
more powerfulentails the need to retrain before predicting further new data
12
p gc.f. Niklas‘s thesis
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
S i i d l iSemi-supervised learning
Self trainingSelf-trainingtrain a modelpredicted instances that meet a predefined selection criterion (e.g. p(+) > 0.95)
dd d h i i l d h iare added to the training pool, and then retrain
Co-trainingCo trainingtrain two different models / the same model on 2 independent representations(e.g. spam filtering based on text and on links)predicted instances that meet a predefined selection criterion are added to thepredicted instances that meet a predefined selection criterion are added to thetraining pool of the other model, and then retrain both
Active learningtrain a model on a small initial setinstances that meet a predefined selection criterion (e.g. model shows high
13
instances that meet a predefined selection criterion (e.g. model shows highuncertainty, p(+) ~ P(-)) are asked for human labeling, and then retrain
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
S i l i ( t t i i d t )Semi-sup. learning (generate training data)
B t t i ( t t i i d t )Bootstrapping (generate training data)start with an initial small seed setinstances that meet a predefined selection criterion (e.g. contextual similarity to( g ythe seed) are added to the training pool, and then retry
Distant supervisionstart with an assumption of positive / negative membership (e.g. for pairs in a knowledge base, you know the label, look for texts having that pair)generate potential positive/negative instances based on the assumption, andgenerate potential positive/negative instances based on the assumption, andthen train a model
Train on errorshaving labeled data for an associated task, train on its errors (which partly aredue to the lack of knowledge about your current probleme g disease and associated symptom codes are never added to the same
14
e.g. disease and associated symptom codes are never added to the same document – learn D/S relationships from D and S labels/classifiers
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |
D i d t tiDomain adaptation
Wh i d i th t t (f t d/ l b l di t ib ti )When crossing domains, the texts (feature and/or label distributions) canchange
this degrades ML performance (on target D with small train, compared to source D with large i )train)
try to tackle this domain impact to have OK performance in (almost) unseen domains
Pivot features that are frequent and robust accross domainsPivot features that are frequent and robust accross domainse.g. „good“ is a positive sentiment word in all domains
Structural correspondence learningStructural correspondence learningalign source/target specific features through their similarities to pivot featurescan exploit target specific knowledge through correspondences to source specific features
Easy domain adaptationuse triple, source-only, target-only and source+target versions of all featurescan learn general (pivot patterns) and also target- (source-)specific knowledge
15
can learn general (pivot patterns) and also target (source )specific knowledge
SS 2011 | Computer Science Department | UKP Lab - György Szarvas |