1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004.

transcript

SIMS 290-2: Applied Natural Language Processing

Preslav NakovOctober 6, 2004

The 20 Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

WEKA: Exporer

WEKA: Experimenter

Source: originally collected by Ken LangContent and structure:

approximately 20,000 newsgroup documents– 19,997 originally– 18,828 without duplicates

partitioned evenly across 20 different newsgroups

Some categories are strongly related (and thus hard to discriminate):

20 Newsgroups Data Sethttp://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/

comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x

rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey

sci.cryptsci.electronicssci.medsci.space

misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast

talk.religion.miscalt.atheismsoc.religion.christian

computers

Sample Posting: “talk.politics.guns”From: cdt@sw.stratus.com (C. D. Tavares)Subject: Re: Congress to review ATF's status

In article <C5vzHF.D5K@cbnews.cb.att.com>, lvc@cbnews.cb.att.com (Larry Cipriani) writes:

> WASHINGTON (UPI) -- As part of its investigation of the deadly> confrontation with a Texas cult, Congress will consider whether the> Bureau of Alcohol, Tobacco and Firearms should be moved from the> Treasury Department to the Justice Department, senators said Wednesday.> The idea will be considered because of the violent and fatal events> at the beginning and end of the agency's confrontation with the Branch> Davidian cult.

Of course. When the catbox begines to smell, simply transfer itscontents into the potted plant in the foyer.

"Why Hillary! Your government smells so... FRESH!"--

cdt@rocket.sw.stratus.com --If you believe that I speak for my company,OR cdt@vos.stratus.com write today for my special Investors' Packet...

subject

signature

Need special handling during

feature extraction…

… writes:

WEKA: Exporer

WEKA: Experimenter

7Slide adapted from Eibe Frank's

WEKA: The Bird

Copyright: Martin Kramer (mkramer@wxs.nl), University of Waikato, New Zealand

WEKA: Terminology

Some synonyms/explanations for the terms used by WEKA, which may differ from what we adopted:

Attribute: feature Relation: collection of examples Instance: collection in use Class: category

WEKA: The Software Toolkit

Machine learning/data mining software in JavaGNU LicenseUsed for research, education and applicationsComplements “Data Mining” by Witten & FrankMain features:

data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms

http://www.cs.waikato.ac.nz/ml/weka

WEKA GUI Chooser java -Xmx1000M -jar weka.jar

Our Toy Example

We demonstrate WEKA on a toy example:

3 categories from “20 Newsgroups”:– misc.forsale, – rec.sport.hockey, – comp.graphics

20 documents per category features:– words converted to lowercase– frequency 2 or more required– stopwords removed

Explorer: Pre-Processing The Data

WEKA can import data is from:files: ARFF, CSV, C4.5, binaryURL SQL database (using JDBC)

Pre-processing tools (filters) are used for:Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.

List of attributes (last: class variable)

Frequency and categories for the selected

attribute

Statistics about the values of the selected attribute

Classification

Filter selection

Manual attribute selection

Statistical attribute selection

Preprocessing

The Preprocessing Tab

Explorer: Building “Classifiers”

Classifiers in WEKA are models for:classification (predict a nominal class)regression (predict a numerical quantity)

Learning algorithms:Naïve Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.

Meta-classifiers:cannot be used alonealways combined with a learning algorithmexamples: boosting, bagging etc.

Choice of classifier

The attribute whose value is to be predicted from the values of the remaining ones.

Default is the last attribute.

Here (in our toy example) it is

named “class”.

Cross-validation: split the data into e.g. 10 folds and

10 times train on 9 folds and test on the remaining one

The Classification Tab

Choosing a classifier

False: Gaussian

True: kernels (better)

displays synopsis and options

numerical to nominal

conversion by discretization

outputs additional information

all other numbers can be obtained from it

different/easy class

accuracy

Contains information about the actual and the predicted classification

All measures can be derived from it: accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)

These extend for more than 2 classes: see previous lecture slides for details

Confusion matrix

predicted

– a b

Outputs the probability

distribution for each example

Predictions Output

Probability distribution for

a wrong example:

predicted 1 instead of 3

Naïve Bayes makes incorrect

conditional independence assumptions

and typically is over-confident in its prediction regardless of whether it is

correct or not.

Predictions Output

Error Visualization

Little squares designate errors

Axes show example number

Find which attributes are the most predictive ones

Two parts: search method: – best-first, forward selection, random, exhaustive, genetic

algorithm, ranking

evaluation method: – information gain, chi-squared, etc.

Very flexible: WEKA allows (almost) arbitrary combinations of these two

Explorer: Attribute Selection

Individual Features Ranking

misc.forsale

comp.graphics

rec.sport.hockey

misc.forsale

comp.graphics

rec.sport.hockey

random number

31Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's

feature correlation

2-Way Interactions

Feature Interactions

category

feature feature

importance of feature B

importance of feature A

33Slide adapted from Guozhu Dong's

Feature Subsets Selection

Problem illustration

Full setEmpty setEnumeration

SearchExhaustive/Complete (enumeration/branch&bounding)Heuristic (sequential forward/backward)Stochastic (generate/evaluate)Individual features or subsets generation/evaluation

Features Subsets Selection

misc.forsale

comp.graphics

rec.sport.hockey

17,309 subsets considered21 attributes selected

Features Subsets Selection

Saving the Selected Features

All we can do from this tab is to save the buffer in a text file. Not very useful...

But we can also perform feature selection during the pre-processing step...(the following slides)

Features Selection on Preprocessing

679 attributes: 678 + 1 (for the class)

Features Selection on Preprocessing

Just 22 attributes remain:

21 + 1 (for the class)

Run Naïve Bayes With the 21 Features

higher accuracy

21 Attributes

different/easy class

accuracy

(AGAIN) Naïve Bayes With All Features

ALL 679 Attributes(repeated slide)

Sometimes WEKA has a weird naming for some algorithms

Here is how to find the algorithms Barbara introduced: Naïve Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Winnow: weka.classifiers.functions.winnow Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk

Some of these are more sophisticated versions of the classic algorithms

e.g. I cannot find the classic Naïve Bayes in WEKA (although there are 5 available implementations).

Some Important Algorithms

WEKA: Explorer

WEKA: Experimenter

Experimenter makes it easy to compare the performance of different learning schemes

Problems: classification regression

Results: written into file or databaseEvaluation options:

cross-validation learning curve hold-out

Can also iterate over different parameter settingsSignificance-testing built in!

Performing Experiments

Experiments Setup

CSV file: can be open in Exceldatasets

algorithms

Experiments Setup

accuracy

SVM is the best

Decision tree is the

SVM is statistically better than Naïve Bayes

Decision tree is statistically worse than Naïve Bayes

Experiments: Excel

Results are output into an CSV file, which can

be read in Excel!

WEKA: Explorer

WEKA: Experimenter

@relation heart-disease-simplified

@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}

@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...

WEKA File Format: ARFF

Other attribute types:

• String

• Date

Numerical attribute

Nominal attribute

Missing value

Value 0 is not represented explicitlySame header (i.e @relation and @attribute tags)the @data section is different

Instead of @data

0, X, 0, Y, "class A"0, 0, W, 0, "class B"

We have

{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

This is especially useful for textual data (why?)But! Problems with feature selection: cannot save results

WEKA File Format: Sparse ARFF

Works on the 20 newsgroups collectionExtracts the features

currently words easy to modify, just change one or more of:– extract_features_and_freqs()– is_feature_good() – build_stoplist()

Allows to filter out: the stopwords the infrequent features

Features are weighted by document frequencyProduces an ARFF file to be used by WEKA

Allows to specify: which subset of classes to consider the number of documents for each class the minimum feature frequency regular expression pattern a feature should match whether to remove the stopwords whether to convert words to lowercase kind of output to produce:

sparse (i.e., feature = value) full vector (list of values)

Python Interface to WEKA: How To

Needs installed "20_newsgroups“ and "stopwords“To get the things working under Windows:

open “__init__.py”in the code below, substitute “/” with “\\”

##################################################### 20 Newsgroupsgroups = [(ng, ng+'/.*') for ng in ''' alt.atheism rec.autos sci.space comp.graphics rec.motorcycles soc.religion.christian comp.os.ms-windows.misc rec.sport.baseball talk.politics.guns comp.sys.ibm.pc.hardware rec.sport.hockey talk.politics.mideast comp.sys.mac.hardware sci.crypt talk.politics.misc comp.windows.x sci.electronics talk.religion.misc misc.forsale sci.med'''.split()] twenty_newsgroups = SimpleCorpusReader( '20_newsgroups', '20_newsgroups/', '.*/.*', groups, description_file='../20_newsgroups.readme')del groups # delete temporary variable

The Main Function

Example Usage

Python dictionary

Estimated over the whole set! cross-validation: OK; test/train: not OK

Python Interface to WEKAFunctions You Will Probably Want To Modify

convert to lowercase

Also: stemming!Also: word+POS!

Also: compounds!

Python Interface to WEKAYou might want to add… Stemming

Porter stemmer>>> cats = Token(TEXT='cats', POS='NN')

>>> from nltk.stemmer.porter import *

>>> porter = PorterStemmer()

>>> porter.stem(cats)

>>> print cats

<POS='NN', STEM='cat', TEXT='cats'>

WordNet stemmer morphy – morphological analyzer you need the following packages installed:– nltk.wordnet– nltk-contrib.pywordnet

>>> from nltk_contrib.pywordnet.stemmer import *

>>> morphy('dogs')

Python Interface to WEKAYou might want to add… TF.IDF

TF.IDF: tij log(N/ni) TF– tij: frequency of term i in document j

– this is how features are currently weighted

IDF: log(N/ni)

– ni: number of documents containing term i

– N: total number of documents

Modify the function extract_features_and_freqs_forall()

WEKA: Explorer

WEKA: Experimenter

Summary

WEKA: The ToolkitExplorer

– Classification– Feature selection

ExperimenterARFF file format

Python Interface to WEKAfeature extraction

stemmingWeighting: TF.IDF

1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004.

Documents