Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | clement-green |
View: | 218 times |
Download: | 0 times |
Matwin 19991
Text classification: In Search of a Representation
Stan Matwin
School of Information Technology and Engineering
University of [email protected]
Matwin 19992
Outline
Supervised learning=classificationML/DM at U of OClassical approachAttempt at a linguistic representationN-grams – how to get them?Labelling and co-learningNext steps?…
Matwin 19993
Supervised learning (classification)
Given:a set of training instances T={et}, where each
t is a class label : one of the classes C1,…Ck
a concept with k classes C1,…Ck (but the definition of the concept is NOT known)
Find: a description for each class which will perform
well in determining (predicting) class membership for unseen instances
Matwin 19994
Classification
Prevalent practice:
examples are represented as vectors of values of attributes
Theoretical wisdom,
confirmed empirically: the more examples, the better predictive accuracy
Matwin 19995
ML/DM at U of O
Learning from imbalanced classes: applications in remote sensing
a relational, rather than propositional representation: learning the maintainability concept
Learning in the presence of background knowledge. Bayesian belief networks and how to get them. Appl to distributed DB
Matwin 19996
Why text classification?
Automatic file savingInternet filtersRecommendersInformation extraction…
Matwin 19997
Bag of words
Text classification: standard approach
1. Remove stop words and markings2. remaining words are all attributes3. A document becomes a vector
<word, frequency>
4. Train a boolean classifier for each class
5. Evaluate the results on an unseen sample
Matwin 19998
Text classification: tools
RIPPERA “covering”learnerWorks well with large sets of binary
featuresNaïve Bayes
Efficient (no search)Simple to programGives “degree of belief”
Matwin 19999
“Prior art”
Yang: best results using k-NN: 82.3% microaveraged accuracy
Joachim’s results using Support Vector Machine + unlabelled data
SVM insensitive to high dimensionality, sparseness of examples
Matwin 199910
SVM in Text classification
SVM
Transductive SVMMaximum separationMargin for test set
Training with 17 examples in 10 most frequent categories gives test performance of 60% on 3000+ test cases available during training
Matwin 199911
Problem 1: aggressive feature selection
“Machine”: 50%“Learning”: 75%“Machine Learning”: 50%
AI
“Machine”: 4%“Learning”: 75%“Machine Learning”: 0%
EP
“Machine”: 80%“Learning”: 5%“Machine Learning”: 0%
MT
RIPPER (B.O.W.): machine & learning = AI �
FLIPPER (Cohen): machine & learning & near & after = AI �
RIPPER (Phrases): “machine learning” = AI �
Matwin 199912
Problem 2: semantic relationships are missed
knife gundagger sword
rifle slingshot
weapon� Semantically related words may
be sparsely distributed throughmany documents
� Statistical learner may be able topick up these correlations
� Rule-based learner isdisadvantaged
Matwin 199913
Proposed solution (Sam Scott)
Get noun phrases and/or key phrases (Extractor) and add to the feature list
Add hypernyms
Matwin 199914
Hypernyms - WordNet
“synset” => SYNONYM“is a” => HYPERNYM“instance of” => HYPONYM
“is a”
“instance of”
“Synset”
weapon
gun
pistol,revolver
knife
Matwin 199915
Evaluation (Lewis)
•Vary the “loss ratio” parameter
• For each parameter value
• Learn a hypothesis for each class (binary classification)
• Micro-average the confusion matrices (add component-wise)
• Compute precision and recall
• Interpolate (or extrapolate) to find the point where micro- averaged precision and recall are equal
Matwin 199916
Results
No gain over BW in alternative representations
But…
Comprehensibility…
Micro-averaged b.e.Reuters DigiTrad
BW .821 .359BWS .810 .360NP .827 .357NPS .819 .356KP .817 .288e
KPS .816 .297e
H0 .741e .283H1 .734e .281NPW .823 N/A
Matwin 199917
Combining classifiers
Comparable to best known results (Yang)
Reuters DigiTrad# representations b.e. representations b.e.1 NP .827 BWS .3603 BW, NP, NPS .845 BW, BWS, NP .404e
5 BW, NP, NPS, KP, KPS .849 BW, BWS, NP, KPS, KP .422e
Matwin 199918
Other possibilities
Using hypernyms with a small training set (avoids ambiguous words)
Use Bayes+Ripper in a cascade scheme (Gama)
Other representations:
Matwin 199919
Collocations
Do not need to be noun phrases, just pairs of words possibly separated by stop words
Only the well discriminating ones are chosen
These are added to the bag of words, and…
Ripper
Matwin 199920
N-grams
N-grams are substrings of a given lengthGood results in Reuters [Mladenic, Grobelnik]
with Bayes; we try RIPPER
A different task: classifying text filesAttachments
Audio/video
Coded
From n-grams to relational features
Matwin 199921
How to get good n-grams?
We use Ziv-Lempel for frequent substring detection (.gz!)
abababaa ba a
b
b
a
Matwin 199922
N-grams
Counting Pruning:
substring occurrence ratio < acceptance threshold
Building relations: string A almost always precedes string B
Feeding into relational learner (FOIL)
Matwin 199923
Using grammar induction (text files)
Idea: detect patterns of substringsPatterns are regular languagesMethods of automata induction: a
recognizer for each class of filesWe use a modified version of RPNI2
[Dupont, Miclet]
Matwin 199924
What’s new…
Work with marked up text (Word, Web)
XML with semantic tags: mixed blessing for DM/TM
Co-learningText mining
Matwin 199925
Co-learning
How to use unlabelled data? Or How to limit the number of examples that need be labelled?
Two classifiers and two redundantly sufficient representations
Train both, run both on test set, add best predictions to training set
Matwin 199926
Co-learning
Training set grows as……each learner predicts independently
due to redundant sufficiency (different representations)
would also work with our learners if we used Bayes?
Would work with classifying emails
Matwin 199927
Co-learning
Mitchell experimented with the task of classifying web pages (profs, students, courses, projects) – a supervised learning task
Used Anchor textPage contents
Error rate halved (from 11% to 5%)
Matwin 199928
Cog-sci?
Co- learning seems to be cognitively justified
Model: students learning in groups (pairs)
What other social learning mechanisms could provide models for supervised learning?
Matwin 199929
Conclusion
A practical task, needs a solutionNo satisfactory solution so farFruitful ground for research