Date post: | 28-Mar-2015 |
Category: |
Documents |
Upload: | michael-knight |
View: | 218 times |
Download: | 0 times |
Extracting Key-Substring-Group Features for Text Classification
Dell Zhang and Wee Sun Lee
KDD2006
The Context
Text Classification via Machine Learning (ML)
L Classifier U
Learning Predicting
TrainingDocuments
TestDocuments
Text Data
to_be_or_not_to_be…
To be, or not to be
…
tobeor
betonot
…
Some Applications
Non-Topical Text Classification Text Genre Classification
Paper? Poem? Prose? Text Authorship Classification
Washington? Adams? Jefferson?
How to exploit sub-word/super-word information?
Some Applications
Asian-Language Text Classification
How to avoid the problem of word-segmentation?
Some Applications
Spam Filtering
How to handle non-alphabetical characters etc.?
(Pampapathi et al., 2006)
Some Applications
Desktop Text Classification
How to deal with different types of files?
Learning Algorithms
Generative Naïve Bayes, Rocchio, …
Discriminative Support Vector Machine (SVM) , AdaBoost, …
For word-based text classification, discriminative methods are often superior to generative methods.
How about string-based text classification?
String-Based Text Classification Generative
Markov Chain Models (char-level) fixed order: n-gram, … variable order: PST, PPM, …
Discriminative SVM with string kernel (= taking all substrings as
features implicitly through the “kernel trick”) limitations: (1) ridge problem; (2) feature redundancy; (3)
feature selection/weighting and advanced kernels.
generative
discriminative
word-based string-based
?
The Problem
The Difficulty
The number of substrings: O(n2)
5 + 9 = 14 characters
15 + 45 = 60 substrings
d1: to_be
d2: not_to_be
Our Idea
The substrings could be partitioned into statistical equivalence groups
toto_to_bto_be
d1: to_be
d2: not_to_be
otot_ot_tot_toot_to_ot_to_bot_to_be
……
n o t _ t o _ b e d2
to _ b e
_ t o _ b e d2
o_ b e
t _ t o _ b e d2
_b e
t o _ b e d2
b e
e
a suffix tree node=
a substring group
Suffix Tree d1
d2
d1
d2
d1
d2
d1
d2
d1
d2
Substring-Groups
The substrings in an equivalence group have exactly identical distribution over the corpus, therefore such a substring-group could be taken in whole as a single feature to be used by a statistical machine learning algorithm for text classification.
Substring-Groups
The number of substring-groups: O(n) n trivial substring-groups
leaf nodes frequency = 1 not so useful to learning
at most n-1 non-trivial substring-groups internal (non-root) nodes frequency > 1 to be selected as features
Key-Substring-Groups
Select the key (salient) substring-groups by -l the minimum frequency
freq(SGv) -h the maximum frequency
freq(SGv) -b the minimum number of branches
children_num(v) -p the maximum parent-child conditional probability
freq(SGv) / freq(SGp(v)) -q the maximum suffix-link conditional probability
freq(SGv) / freq(SGs(v))
Suffix Link
“c1 c2 …ck ” “c2 …ck ” v s(v) s(v) root
Feature Extraction Algorithm
Input a set of documents the parameters
Output the key-substring-groups for each document
Time Complexity: O(n) Trick
make use of suffix links to traverse the tree
Feature Extraction Algorithm
construct the (generalized) suffix tree T
using Ukkonen’s algorithm;
count frequencies recursively;
select features recursively;
accumulate features recursively;
for each document d {
match d to T and get to the node v;
while v is not the root {
output the features associated with v;
move v to the next node via the suffix link of v;
}
}
Experiments
Parameter Tuning the number of features the cross-validation performance
Feature Weighting TFxIDF (with l2 normalization)
Learning Algorithm LibSVM linear kernel
English Text Topic Classification Dataset
Reuters-21578 Top10 (ApteMod) The home-ground of word-based text classification
Classes (1) earn; (2) acq; (3) money-fx; (4) grain; (5) crude; (6) trade;
(7) interest; (8) ship; (9) wheat; (10) corn.
Parameters -l 80 -h 8000 -b 8 -p 0.8 -q 0.8
Features 9*1013 6,055 (extracted in < 30 seconds)
English Text Topic Classification
The distribution of substring-groups ~ Zip’s law (power law)
English Text Topic Classification
The performance of linear kernel SVM with key-substring-group features on the Reuters-21578 top10 dataset.
English Text Topic Classification
Comparing the experimental results of our proposed approach and some representative existing approaches.
English Text Topic Classification
The influence of feature extraction parameters to the number of features and the text classification performance.
Chinese Text Topic Classification Dataset
TREC-5 People’s Daily News Classes
(1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics.
Parameters -l 20 -h 8000 -b 8 -p 0.8 -q 0.8
Chinese Text Topic Classification Performance (miF)
SVM + word segmentation: 82.0% (He et al., 2000; He et al., 2003)
char-level n-gram language model: 86.7% (Peng et al. 2004)
SVM with key-substring-group features: 87.3%
Greek Text Authorship Classification Dataset
(Stamatatos et al., 2000) Classes
(1) S. Alaxiotis; (2) G. Babiniotis; (3) G. Dertilis; (4) C. Kiosse; (5) A. Liakos; (6) D. Maronitis; (7) M. Ploritis; (8) T. Tasios; (9) K. Tsoukalas; (10) G. Vokos.
Greek Text Authorship Classification Performance (accuracy)
deep natural language processing: 72% (Stamatatos et al., 2000)
char-level n-gram language model: 90% (Peng et al. 2004)
SVM with key-substring-group features: 92%
Greek Text Genre Classification Dataset
(Stamatatos et al., 2000) Classes
(1) press editorial; (2) press reportage; (3) academic prose; (4) official documents; (5) literature; (6) recipes; (7) curriculum vitae; (8) interviews; (9) planned speeches; (10) broadcast news.
Greek Text Genre Classification Performance (accuracy)
deep natural language processing: 82% (Stamatatos et al., 2000)
char-level n-gram language model: 86% (Peng et al. 2004)
SVM with key-substring-group features: 94%
Conclusion
We propose the concept of key-substring-group features and a linear-time (suffix tree based) algorithm to
extract them We show that
our method works well for some text classification tasks
clustering etc.?gene/protein sequence data?
?