1
Text Bundling: Statistics-Based Data Reduction
by L. Shih, J.D.M. Rennie, Y. Chang and D.R. Karger
Presented bySteve VincentMarch 4, 2004
2
Text Classification Domingos discussed the tradeoff of
speed and accuracy in the context of very large databases
Best test classification algorithms are “super-linear” – each additional training point takes more time to train than the previous point.
3
Text Classification Most highly accurate text classifiers take a
disproportionately large time to handle a large number of training examples
Classifiers become impractical when faced with large data sets such as the OHSUMED data set The OHSUMED test collection is a set of 348,566
references from MEDLINE, the on-line medical information database
Consists of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991)
4
Data Reduction Subsampling Bagging Feature selection
5
Subsampling Retains a random subset of the original
training data Subsampling does not preserve the entire
set of data, rather it preserves all statistics on a random set of the data
Subsampling is fast and easy to implement Reducing to a single point per class via
subsampling yields a single sample document This gives almost no information about the
nature of the class
6
Bagging Partition the original data set and learns
a classifier on each partition then a test document is labeled by majority vote of the classifiers
This provides fast training due to training on a subset of the original data
Testing is slow since it evaluates multiple classifiers for each test example
7
Feature Selection Retains only the best features of a data set All classifiers use some type of feature
selection If a classifier sees a feature as irrelevant, it simply
ignores that feature One type of feature selection ranks feature
according to |p(fi|+) – p(fi|-)|, where p(fi|c) is the empirical frequency of fi in class c of training documents
Little empirical evidence comparing the time reduction of feature selection with the resulting loss in accuracy
8
Text Bundling Text bundling generates a new
smaller training set by averaging together small groups of points This preserves certain statistics on all
the data instead of just a subset of the data
This application used on one statistic (mean), but it is possible to use multiple statistics
9
Bundling Algorithm Tradeoff between speed and accuracy:
Less raw information retained, the faster the classifier will run and the less accurate the results
Each data reduction technique operates by retaining some information and removing other information
By carefully selecting our statistics for a domain, we can optimize the information we retain
10
Bundling Algorithm Bundling preserves a set of k user-
chosen statistics, s = (s1,…,sk), where si is a function that maps a set of data to a single value.
11
Global Constraint There are many possible reduced data
sets that can satisfy this constraint But we don’t only want to preserve the global
statistics, we also want to preserve additional information about the distribution
To get a reduced data set that satisfied the global constraint, we could generate several random points and then choose the remaining points to preserve statistics This does not retain any information about
our data except for the chosen statistics
12
Local Constraint We can retain some of the
information besides the statistics by grouping together set of points and preserving the statistics locally
13
Local Constraint The bundling algorithm’s local constraint
is to maintain the same statistics between subsets of the training data
Focus on statistics means: Bundled data will not have any examples in
common with the original data Ensures that certain global statistics are
maintained, while also maintaining a relationship between certain partitions of the data in the original and bundled training sets
14
Text Bundling First step in bundling is to select a
statistic or statistic to preserve: For text, the mean statistic of each
feature is chosen Rocchio and Multinomial Naïve Bayes
perform classification using only the mean statistics of the data
15
Rocchio Algorithm Rocchio classification algorithm selects a
decision boundary (plane) that is perpendicular to a vector connecting two class centroids Let {x11,…,x1l1} and {x21,…,x2l2} be set of
training data for the positive and negative classes
Let c1= (1/l1)i x1i and c2= (1/l2)i x2i be the centroids for the classes
RocchioScore(x)=x (c1 – c2) With a threshold boundary, b, an example is
labeled according to the sign of the score minus the threshold value: lRocchio(x) = sign (RocchioScore(x) – b)
16
Naïve Bayes
A basic naïve Bayes classifier is as follows:
wk are the features (words) used, document , and class cj
)(
)|()(maxarg),(
xp
cwpcpcxy k
jkj
cj
j
x
17
Multinomial Naïve Bayes
Multinomial Naïve Bayes has shown improvements over other Naïve Bayes types. The formula is:
n(wk, ) is the count of word wk in document
)(
)|()(maxarg),(
),(
xp
cwpcpcxy k
xwnjkj
cj
k
j
xx
18
Text Bundling Assume that there are a set of training
documents for each class Apply bundling separately to each class Let D=(d1,…dn) be a set of documents. Using the “bag of words”
representation, where each word is a feature and is document is represented as a vector of word counts
19
Text Bundling di = {di1,…, diV} where the second
subscript indexes the words and V is the size of the vocabulary.
Use the mean statistics for each feature as our text statistics.
Define the jth statistic as
20
Maximal Bundling Reduce to a single point with the
mean statistics The jth component of the single
point is sj(D) Using a linear classifier on this
“maximal bundling” will result in a decision boundary equivalent to Rocchio’s decision boundary
21
Bundling Algorithms Randomized bundling
Partition points randomly Only need one pass over the training points Poorly preserves data point locations in feature space
Rocchio bundling Projects points onto a vector Partitions points that are near one another in the
projected space Use RocchioScore to sort documents by their score,
then bundling consecutive sorted documents Pre-processing time for Rocchio bundling is O(n
log(m))
22
Procedure Randomized Bundling
23
Procedure Rocchio Bundling
24
Data Sets 20 Newsgroups (20 News)
Collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
Industry Sector (IS) Collection of corporate web pages organized into
categories based on what a company produces or does
There are 9619 non empty documents and 105 categories.
Reuters 21578 (Reuters) Consists of a set of 10,000 news stories
25
Data Sets
20 News IS Reuters
Train Size 12,000 4,797 7,700
Test Size 7,982 4,822 3,019
Features 62,060 55,194 18,621
SVM time 6,464 2,268 408
Accuracy 86.5% 92.8% 88.7%
26
Experiment Used Support Vector Machine for classification
Used SvmFu implementation with the penalty for misclassification of training points set at 10
Coded pre-processing in C++ Use Rainbow to pre-process the raw documents
into feature vectors Limited runs to 8 hours per run Compared Bagging, Feature Selection, Subsample,
Random Bundling and Rocchio Bundling Also ran experiment on the OHSUMED, but did
not get results for all tests
27
Slowest Results (time, accuracy)
20 News IS ReutersBagging 4051 .843 1849 .863 346 .886Feature Selection
5870 .853 2186 .896 507 .884
Subsample
2025 .842 926 .858 173 .859
Random Bundling
2613 .862
1205 .909 390 .863
Rocchio Bundling
2657 .864
1244 .914 404 .882
28
Quickest Results (time, accuracy)
20 News IS Reuters
Bagging 2795 .812 1590 .173 295 .878
Feature Selection
4601 .649 1738 .407 167 .423
Subsample 22 .261 59 .170 173 .213
Random Bundling
117 .730 177 .9 105 .603
Rocchio Bundling
173 .730 248 .9 129 .603
29
20 News Results
30
IS Results
31
Reuters Results
32
Future Work Extend bundling in both a theoretical
and empirical sense. May be possible to analyze or provide
bounds on the loss in accuracy due to bundling
Would like to construct general methods for bundling sets of statistics
Would also like to extend bundling to other machine learning domains
33
ReferencesP. Domingos, “When and how to subsample:
Report on the KDD-2001 panel”, SIGKDD Explorations,
NIST Test Collections (http://trec.nist.gov/data.html)D. Mladenic, “Feature subset selection in text-
learning”, Proceedings of the Tenth European Conference on Machine Learning
Xiaodan Zhu, “Junk Email Filtering with Large Margin Perceptrons”, University of Toronto, Department of Computer Science (www.cs.toronto.edu/pub/xzhu/reports_nips2.doc )
34
Questions?