Text classification IICE-324: Modern Information Retrieval Sharif University of Technology
M. Soleymani
Fall 2017
Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Outline
Vector space classification
Rocchio
Linear classifiers
kNN
2
Standing queries
The path from IR to text classification:
You have an information need to monitor, say:
Unrest in the Niger delta region
You want to rerun an appropriate query periodically to find new news
items on this topic
You will be sent new documents that are found
I.e., it’s not ranking but classification (relevant vs. not relevant)
Such queries are called standing queries
Long used by “information professionals”
A modern mass instantiation is GoogleAlerts
Standing queries are (hand-written) text classifiers
Ch. 13
4
Recall: vector space representation
Each doc is a vector
One component for each term (= word).
Terms are axes
Usually normalize vectors to unit length.
High-dimensional vector space:
10,000+ dimensions, or even 100,000+
Docs are vectors in this space
How can we do classification in this space?
Sec.14.1
5
Classification using vector spaces
Training set: a set of docs, each labeled with its class (e.g.,
topic)
This set corresponds to a labeled set of points (or, equivalently,
vectors) in the vector space
Premise 1: Docs in the same class form a contiguous
regions of space
Premise 2: Docs from different classes don’t overlap
(much)
We define surfaces to delineate classes in the space
Sec.14.1
6
Documents in a vector space
Government
Science
Arts
Sec.14.1
7
Test document of what class?
Government
Science
Arts
Sec.14.1
8
Test document of what class?
Government
Science
Arts
Is this similarity
hypothesis true in
general?
Our main topic today is how to find good separators
Sec.14.1
Government
Relevance feedback
relation to classification
9
In relevance feedback, the user marks docs as
relevant/non-relevant.
Relevant/non-relevant can be viewed as classes or categories.
For each doc, the user decides which of these two classes
is correct.
Relevance feedback is a form of text classification.
Rocchino for text classification
Relevance feedback methods can be adapted for text
categorization
Relevance feedback can be viewed as 2-class classification
Use standard tf-idf weighted vectors to represent text docs
For training docs in each category, compute a prototype as
centroid of the vectors of the training docs in the category.
Prototype = centroid of members of class
Assign test docs to the category with the closest prototype
vector based on cosine similarity.
10
Sec.14.2
Definition of centroid
𝜇 𝑐 =1
𝐷𝑐
𝑑∈𝐷𝑐
𝑑
𝐷𝑐: docs that belong to class 𝑐
𝑑 : vector space representation of 𝑑.
Centroid will in general not be a unit vector even when
the inputs are unit vectors.
11
Sec.14.2
Rocchino algorithm
12
Rocchio: example
13
We will see that Rocchino finds linear boundaries
between classes
Government
Science
Arts
Illustration of Rocchio: text classification
14
Sec.14.2
15
Rocchio properties
Forms a simple generalization of the examples in eachclass (a prototype).
Prototype vector does not need to be normalized.
Classification is based on similarity to class prototypes.
Does not guarantee classifications are consistent with thegiven training data.
Sec.14.2
16
Rocchio anomaly
Prototype models have problems with polymorphic
(disjunctive) categories.
Sec.14.2
Rocchio classification: summary
Rocchio forms a simple representation for each class:
Centroid/prototype
Classification is based on similarity to the prototype
It does not guarantee that classifications are consistentwith the given training data
It is little used outside text classification
It has been used quite effectively for text classification
But in general worse than many other classifiers
Rocchio does not handle nonconvex, multimodal classescorrectly.
17
Sec.14.2
Linear classifiers
18
Assumption:The classes are linearly separable.
Classification decision: 𝑖=1𝑚 𝑤𝑖𝑥𝑖 +𝑤0 > 0?
First, we only consider binary classifiers.
Geometrically, this corresponds to a line (2D), a plane (3D) or
a hyperplane (higher dimensionalities) decision boundary.
Find the parameters 𝑤0, 𝑤1, … , 𝑤𝑚 based on training set.
Methods for finding these parameters: Perceptron, Rocchio,…
19
Separation by hyperplanes
A simplifying assumption is linear separability:
in 2 dimensions, can separate classes by a line
in higher dimensions, need hyperplanes
Sec.14.4
Two-class Rocchio as a linear classifier
Line or hyperplane defined by:
For Rocchio, set:
𝑤 = 𝜇 𝑐1 − 𝜇 𝑐2
𝑤0 =1
2 𝜇 𝑐1
2 − 𝜇 𝑐22
20
Sec.14.2
𝑤0 +
𝑖=1
𝑀
𝑤𝑖𝑑𝑖 = 𝑤0 +𝑤𝑇 𝑑 ≥ 0
21
Linear classifier: example
Class:“interest” (as in interest rate)
Example features of a linear classifier
wi ti wi ti
To classify, find dot product of feature vector and weights
• 0.70 prime
• 0.67 rate
• 0.63 interest
• 0.60 rates
• 0.46 discount
• 0.43 bundesbank
• −0.71 dlrs
• −0.35 world
• −0.33 sees
• −0.25 year
• −0.24 group
• −0.24 dlr
Sec.14.4
Linear classifier: example
22
Class “interest” in Reuters-21578
𝑑1:“rate discount dlrs world”
𝑑2:“prime dlrs”
𝑤𝑇 𝑑1 = 0.07 ⇒ 𝑑1 is assigned to the “interest” class
𝑤𝑇 𝑑2 = −0.01 ⇒ 𝑑2 is not assigned to this class
𝑤0 = 0
Naïve Bayes as a linear classifier
23
𝑃 𝐶1 𝑖=1
𝑀
𝑃 𝑡𝑖 𝐶1𝑡𝑓𝑖,𝑑 > 𝑃(𝐶2)
𝑖=1
𝑀
𝑃 𝑡𝑖 𝐶2𝑡𝑓𝑖,𝑑
log 𝑃(𝐶1) +
𝑖=1
𝑀
𝑡𝑓𝑖,𝑑 × log𝑃 𝑡𝑖 𝐶1
> log𝑃(𝐶2) +
𝑖=1
𝑀
𝑡𝑓𝑖,𝑑 × log𝑃 𝑡𝑖 𝐶2
𝑤𝑖 = log𝑃 𝑡𝑖 𝐶1𝑃 𝑡𝑖 𝐶2
𝑥𝑖 = 𝑡𝑓𝑖,𝑑 𝑤0 = log𝑃 𝐶1
𝑃 𝐶1
24
Linear programming / Perceptron
Find a,b,c, such that
ax + by > c for red points
ax + by < c for blue points
Sec.14.4
25
Which hyperplane?
In general, lots of possible
solutions for a,b,c.
Sec.14.4
26
Which hyperplane?
Lots of possible solutions for a,b,c.
Some methods find a separating hyperplane, but not theoptimal one [according to some criterion of expected goodness]
Which points should influence optimality?
All points
E.g., Rocchino
Only “difficult points” close to decision boundary
E.g., SupportVector Machine (SVM)
Sec.14.4
27
Support Vector Machine (SVM)
Support vectors
Maximizesmargin
SVMs maximize the margin around
the separating hyperplane.
A.k.a. large margin classifiers
Solving SVMs is a quadratic
programming problem
Seen by many as the most
successful current text classification
method*
*but other discriminative methods
often perform very similarly
Sec. 15.1
Narrowermargin
28
Linear classifiers
Many common text classifiers are linear classifiers
Classifiers more powerful than linear often don’t performbetter on text problems.Why?
Despite the similarity of linear classifiers, noticeableperformance differences between them
For separable problems, there is an infinite number of separatinghyperplanes.
Different training methods pick different hyperplanes.
Also different strategies for non-separable problems
Sec.14.4
29
Linear classifiers:
binary and multiclass classification
Consider 2 class problems
Deciding between two classes, perhaps, government and non-
government
Multi-class
How do we define (and find) the separating surface?
How do we decide which region a test doc is in?
Sec.14.4
30
More than two classes
One-of classification (multi-class classification)
Classes are mutually exclusive.
Each doc belongs to exactly one class
Any-of classification
Classes are not mutually exclusive.
A doc can belong to 0, 1, or >1 classes.
For simplicity, decompose into K binary problems
Quite common for docs
Sec.14.5
31
Set of binary classifiers: any of
Build a separator between each class and its complementary
set (docs from all other classes).
Given test doc, evaluate it for membership in each class.
Apply decision criterion of classifiers independently
It works although considering dependencies between categories may
be more accurate
Sec.14.5
32
Multi-class: set of binary classifiers
Build a separator between each class and its
complementary set (docs from all other classes).
Given test doc, evaluate it for membership in each class.
Assign doc to class with:
maximum score
maximum confidence
maximum probability
?
?
?
Sec.14.5
33
k Nearest Neighbor Classification
kNN = k Nearest Neighbor
To classify a document d:
Define k-neighborhood as the k nearest neighborsof d
Pick the majority class label in the k-neighborhood
Sec.14.3
34
Nearest-Neighbor (1NN) classifier
Learning phase: Just storing the representations of the training examples in D.
Does not explicitly compute category prototypes.
Testing instance 𝑥 (under 1NN): Compute similarity between x and all examples in D.
Assign x the category of the most similar example in D.
Rationale of kNN: contiguity hypothesis
We expect a test doc 𝑑 to have the same label as the training docs
located in the local region surrounding 𝑑.
Sec.14.3
35
Test Document = Science
Government
Science
Arts
Sec.14.1
36
k Nearest Neighbor (kNN) classifier
1NN: subject to errors due to
A single atypical example.
Noise (i.e., an error) in the category label of a single trainingexample.
More robust alternative:
find the k most-similar examples
return the majority category of these k examples.
Sec.14.3
38
kNN example: k=6
Government
Science
Arts
P(science| )?
Sec.14.3
39
kNN decision boundaries
Government
Science
Arts
Boundaries are in
principle arbitrary
surfaces (polyhedral)
kNN gives locally defined decision boundaries between classes – far away points do not influence each classification decision (unlike Rocchio, etc.)
Sec.14.3
1NN: Voronoi tessellation
40
The decision boundaries between classes are piecewise linear.
kNN algorithm
41
Time complexity of kNN
42
kNN test time proportional to the size of the training
set!
kNN is inefficient for very large training sets.
43
Similarity metrics
Nearest neighbor method depends on a similarity (ordistance) metric.
Euclidean distance: Simplest for continuous vector space.
Hamming distance: Simplest for binary instance space.
number of feature values that differ
For text, cosine similarity of tf.idf weighted vectors istypically most effective.
Sec.14.3
44
Illustration of kNN (k=3) for text vector
space
Sec.14.3
45
3-NN vs. Rocchio
Nearest Neighbor tends to handle polymorphic
categories better than Rocchio/NB.
46
Nearest neighbor with inverted index
Naively, finding nearest neighbors requires a linear search
through |𝐷| docs in collection
Similar to determining the 𝑘 best retrievals using the test doc
as a query to a database of training docs.
Use standard vector space inverted index methods to
find the k nearest neighbors.
Testing Time: O(B|Vt|)
Typically B << |D| if a large list of stopwords is used.
Sec.14.3
B is the average number of training docs in which
at least one word of test-document appears
A nonlinear problem
Linear classifiers do
badly on this task
kNN will do very well
(assuming enough
training data)
47
Sec.14.4
Overfitting example
48
49
kNN: summary No training phase necessary
Actually: We always preprocess the training set, so in reality training time ofkNN is linear.
May be expensive at test time
kNN is very accurate if training set is large. In most cases it’s more accurate than linear classifiers
Optimality result: asymptotically zero error if Bayes rate is zero.
But kNN can be very inaccurate if training set is small.
Scales well with large number of classes Don’t need to train C classifiers for C classes
Classes can influence each other Small changes to one class can have ripple effect
Sec.14.3
50
Choosing the correct model capacity
Sec.14.6
Linear classifiers for doc classification
51
We typically encounter high-dimensional spaces in textapplications.
With increased dimensionality, the likelihood of linearseparability increases rapidly
Many of the best-known text classification algorithms arelinear.
More powerful nonlinear learning methods are more sensitiveto noise in the training data.
Nonlinear learning methods sometimes perform better ifthe training set is large, but by no means in all cases.
Which classifier do I use for a given text
classification problem?
Is there a learning method that is optimal for all text
classification problems?
No, because there is a tradeoff between complexity of the
classifier and its performance on new data points.
Factors to take into account:
How much training data is available?
How simple/complex is the problem?
How noisy is the data?
How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust
classifier.
52
Reuters collection
53
Only about 10 out of 118 categories are large
Common categories
(#train, #test)
• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)
• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)
Evaluating classification
54
Evaluation must be done on test data that are
independent of the training data
training and test sets are disjoint.
Measures: Precision, recall, F1, accuracy
F1 allows us to trade off precision against recall (harmonic
mean of P and R).
Precision P and recall R
55
Precision P = tp/(tp + fp)
Recall R = tp/(tp + fn)
actually in the
class
actually in the
class
predicted to be in
the classtp fp
Predicted not to
be in the classfn tn
56
Good practice department:
Make a confusion matrix
This (i, j) entry means 53 of the docs actually in class i were
put in class j by the classifier.
In a perfect classification, only the diagonal has non-zero entries
Look at common confusions and how they might be addressed
53
Class assigned by classifier
Ac
tua
l C
lass
Sec. 15.2.4
𝑐𝑖𝑗
57
Per class evaluation measures
Recall: Fraction of docs in class i classified correctly:
Precision: Fraction of docs assigned class i that are
actually about class i:
Accuracy: (1 - error rate) Fraction of docs classified
correctly:
j i
ij
i
ii
c
c
j
ji
ii
c
c
j
ij
ii
c
c
Sec. 15.2.4
Averaging: macro vs. micro
58
We now have an evaluation measure (F1) for one class.
But we also want a single number that shows aggregate
performance over all classes
59
Micro- vs. Macro-Averaging
If we have more than one class, how do we combinemultiple performance measures into one quantity?
Macroaveraging: Compute performance for each class,then average.
Compute F1 for each of the C classes
Average these C numbers
Microaveraging: Collect decisions for all classes, aggregatethem and then compute measure.
Compute TP, FP, FN for each of the C classes
Sum these C numbers (e.g., all TP to get aggregate TP)
Compute F1 for aggregate TP, FP, FN
Sec. 15.2.4
60
Micro- vs. Macro-Averaging: Example
Truth:
yes
Truth:
no
Classifier:
yes
10 10
Classifier:
no
10 970
Truth:
yes
Truth:
no
Classifier:
yes
90 10
Classifier:
no
10 890
Truth:
yes
Truth:
no
Classifier:
yes
100 20
Classifier:
no
20 1860
Class 1 Class 2 Micro Ave. Table
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7
Microaveraged precision: 100/120 = .83
Microaveraged score is dominated by score
on common classes
Sec. 15.2.4
61
Evaluation measure: F1
62
Amount of data?
Little amount of data stick to less powerful classifiers (i.e. linear ones)
Naïve Bayes should do well in such circumstances (Ng and Jordan 2002NIPS)
The practical answer is to get more labeled data as soon as you can
Reasonable amount of data We can use all our clever classifiers
Huge amount of data Expensive methods like SVMs (train time) or kNN (test time) are
quite impractical
Naïve Bayes can come back into its own again! Or other advanced methods with linear training/test complexity
With enough data the choice of classifier may not matter much, andthe best choice may be unclear
Sec. 15.3.1
63
How many categories?
A few (well separated ones)?
Easy!
A zillion closely related ones?
Think:Yahoo! Directory
Quickly gets difficult!
May need a hybrid automatic/manual solution
Sec. 15.3.2
dairycrops
agronomyforestry
AI
HCI
craft
missions
botany
evolution
cellmagnetism
relativity
courses
agriculture biology physics CS space
... ... ...
… (30)
www.yahoo.com/Science
... ...
Yahoo! Hierarchy
64
65
How can one tweak performance?
Aim to exploit any domain-specific useful features that
give special meanings or that zone the data
Aim to collapse things that would be treated as different
but shouldn’t be.
E.g., ISBNs, part numbers, chemical formulas
Does putting in “hacks” help?
You bet!
Feature design and non-linear weighting is very important in the
performance of real-world systems
Sec. 15.3.2
66
Upweighting
You can get a lot of value by differentially weighting
contributions from different document zones.
That is, you count as two instances of a word when you
see it in, say, the abstract
Upweighting title words helps (Cohen & Singer 1996)
Doubling the weighting on the title words is a good rule of thumb
Upweighting the first sentence of each paragraph helps
(Murata, 1999)
Upweighting sentences that contain title words helps (Ko et al,
2002)
Sec. 15.3.2
67
Two techniques for zones
1. Have a completely separate set of features/parameters
for different zones like the title
2. Use the same features (pooling/tying their parameters)
across zones, but upweight the contribution of different
zones
Commonly the second method is more successful:
it costs you nothing in terms of sparsifying the data, but can
give a very useful performance boost
Which is best is a contingent fact about the data
Sec. 15.3.2
68
Does stemming/lowercasing/… help?
As always, it’s hard to tell, and empirical evaluation isnormally the gold standard.
But note that the role of tools like stemming is ratherdifferent for TextCat vs. IR:
For IR, you want to improve recall
For TextCat, with sufficient training data, stemming does nogood.
It only helps in compensating for data sparseness
Sec. 15.3.2
69
Resources
IIR, Chapter 14
Ch. 14