+ All Categories
Home > Documents > Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Date post: 28-Mar-2015
Category:
Upload: michael-knight
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
34
Extracting Key- Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006
Transcript
Page 1: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Extracting Key-Substring-Group Features for Text Classification

Dell Zhang and Wee Sun Lee

KDD2006

Page 2: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

The Context

Text Classification via Machine Learning (ML)

L Classifier U

Learning Predicting

TrainingDocuments

TestDocuments

Page 3: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Text Data

to_be_or_not_to_be…

To be, or not to be

tobeor

betonot

Page 4: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Some Applications

Non-Topical Text Classification Text Genre Classification

Paper? Poem? Prose? Text Authorship Classification

Washington? Adams? Jefferson?

How to exploit sub-word/super-word information?

Page 5: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Some Applications

Asian-Language Text Classification

How to avoid the problem of word-segmentation?

Page 6: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Some Applications

Spam Filtering

How to handle non-alphabetical characters etc.?

(Pampapathi et al., 2006)

Page 7: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Some Applications

Desktop Text Classification

How to deal with different types of files?

Page 8: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Learning Algorithms

Generative Naïve Bayes, Rocchio, …

Discriminative Support Vector Machine (SVM) , AdaBoost, …

For word-based text classification, discriminative methods are often superior to generative methods.

How about string-based text classification?

Page 9: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

String-Based Text Classification Generative

Markov Chain Models (char-level) fixed order: n-gram, … variable order: PST, PPM, …

Discriminative SVM with string kernel (= taking all substrings as

features implicitly through the “kernel trick”) limitations: (1) ridge problem; (2) feature redundancy; (3)

feature selection/weighting and advanced kernels.

Page 10: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

generative

discriminative

word-based string-based

?

The Problem

Page 11: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

The Difficulty

The number of substrings: O(n2)

5 + 9 = 14 characters

15 + 45 = 60 substrings

d1: to_be

d2: not_to_be

Page 12: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Our Idea

The substrings could be partitioned into statistical equivalence groups

toto_to_bto_be

d1: to_be

d2: not_to_be

otot_ot_tot_toot_to_ot_to_bot_to_be

……

Page 13: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

n o t _ t o _ b e d2

to _ b e

_ t o _ b e d2

o_ b e

t _ t o _ b e d2

_b e

t o _ b e d2

b e

e

a suffix tree node=

a substring group

Suffix Tree d1

d2

d1

d2

d1

d2

d1

d2

d1

d2

Page 14: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Substring-Groups

The substrings in an equivalence group have exactly identical distribution over the corpus, therefore such a substring-group could be taken in whole as a single feature to be used by a statistical machine learning algorithm for text classification.

Page 15: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Substring-Groups

The number of substring-groups: O(n) n trivial substring-groups

leaf nodes frequency = 1 not so useful to learning

at most n-1 non-trivial substring-groups internal (non-root) nodes frequency > 1 to be selected as features

Page 16: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Key-Substring-Groups

Select the key (salient) substring-groups by -l the minimum frequency

freq(SGv) -h the maximum frequency

freq(SGv) -b the minimum number of branches

children_num(v) -p the maximum parent-child conditional probability

freq(SGv) / freq(SGp(v)) -q the maximum suffix-link conditional probability

freq(SGv) / freq(SGs(v))

Page 17: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Suffix Link

“c1 c2 …ck ” “c2 …ck ” v s(v) s(v) root

Page 18: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Feature Extraction Algorithm

Input a set of documents the parameters

Output the key-substring-groups for each document

Time Complexity: O(n) Trick

make use of suffix links to traverse the tree

Page 19: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Feature Extraction Algorithm

construct the (generalized) suffix tree T

using Ukkonen’s algorithm;

count frequencies recursively;

select features recursively;

accumulate features recursively;

for each document d {

match d to T and get to the node v;

while v is not the root {

output the features associated with v;

move v to the next node via the suffix link of v;

}

}

Page 20: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Experiments

Parameter Tuning the number of features the cross-validation performance

Feature Weighting TFxIDF (with l2 normalization)

Learning Algorithm LibSVM linear kernel

Page 21: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

English Text Topic Classification Dataset

Reuters-21578 Top10 (ApteMod) The home-ground of word-based text classification

Classes (1) earn; (2) acq; (3) money-fx; (4) grain; (5) crude; (6) trade;

(7) interest; (8) ship; (9) wheat; (10) corn.

Parameters -l 80 -h 8000 -b 8 -p 0.8 -q 0.8

Features 9*1013 6,055 (extracted in < 30 seconds)

Page 22: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

English Text Topic Classification

The distribution of substring-groups ~ Zip’s law (power law)

Page 23: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

English Text Topic Classification

The performance of linear kernel SVM with key-substring-group features on the Reuters-21578 top10 dataset.

Page 24: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

English Text Topic Classification

Comparing the experimental results of our proposed approach and some representative existing approaches.

Page 25: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

English Text Topic Classification

The influence of feature extraction parameters to the number of features and the text classification performance.

Page 26: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Chinese Text Topic Classification Dataset

TREC-5 People’s Daily News Classes

(1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics.

Parameters -l 20 -h 8000 -b 8 -p 0.8 -q 0.8

Page 27: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Chinese Text Topic Classification Performance (miF)

SVM + word segmentation: 82.0% (He et al., 2000; He et al., 2003)

char-level n-gram language model: 86.7% (Peng et al. 2004)

SVM with key-substring-group features: 87.3%

Page 28: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Greek Text Authorship Classification Dataset

(Stamatatos et al., 2000) Classes

(1) S. Alaxiotis; (2) G. Babiniotis; (3) G. Dertilis; (4) C. Kiosse; (5) A. Liakos; (6) D. Maronitis; (7) M. Ploritis; (8) T. Tasios; (9) K. Tsoukalas; (10) G. Vokos.

Page 29: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Greek Text Authorship Classification Performance (accuracy)

deep natural language processing: 72% (Stamatatos et al., 2000)

char-level n-gram language model: 90% (Peng et al. 2004)

SVM with key-substring-group features: 92%

Page 30: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Greek Text Genre Classification Dataset

(Stamatatos et al., 2000) Classes

(1) press editorial; (2) press reportage; (3) academic prose; (4) official documents; (5) literature; (6) recipes; (7) curriculum vitae; (8) interviews; (9) planned speeches; (10) broadcast news.

Page 31: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Greek Text Genre Classification Performance (accuracy)

deep natural language processing: 82% (Stamatatos et al., 2000)

char-level n-gram language model: 86% (Peng et al. 2004)

SVM with key-substring-group features: 94%

Page 32: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Conclusion

We propose the concept of key-substring-group features and a linear-time (suffix tree based) algorithm to

extract them We show that

our method works well for some text classification tasks

clustering etc.?gene/protein sequence data?

Page 33: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

?

Page 34: Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Recommended