©2012 Paula Matuszek
CSC 9010: Text Mining Applications:
Document-Based Techniques
Dr. Paula Matuszek
(610) 647-9789
©2012 Paula Matuszek
Document Classification Document classifying
– Assign documents to pre-defined categories Examples
– Process email into work, personal, junk– Process documents from a newsgroup into
“interesting”, “not interesting”, “spam and flames”– Process transcripts of bugged phone calls into
“relevant” and “irrelevant” Issues
– Real-time?– How many categories/document? Flat or hierarchical?– Categories defined automatically or by hand?
©2012 Paula Matuszek
Document Classification Usually
– relatively few categories– well defined; a person could do task easily– Categories don't change quickly
Flat vs Hierarchy– Simple classification is into mutually-exclusive
document collections– Richer classification is into hierarchy with multiple
inheritance– broader and narrower categories
– documents can go more than one place
– merges into search interfaces such as Pubmed
©2012 Paula Matuszek
Classification -- Automatic Statistical approaches Set of “training” documents define categories
– Underlying representation of document derived from text– BOW– features we discussed last time
– Classification model is trained using machine learning– Individual documents classified applying the model
Requires relatively little effort to create categories
Accuracy heavily dependent on training examples
Typically limited to flat, mutually exclusive categories
©2012 Paula Matuszek
Classification: Manual Natural Language/linguistic techniques Categories are defined by people
– underlying representation of document is typically stream of tokens
– category description contains– ontology of terms and relations
– pattern-matching rules
– individual documents classified by pattern-matching Defining categories can be very time-consuming Typically takes some experimentation to "get it
right" Can handle much more complex structures
Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm
Automatic Classification Framework
Documents PreprocessingFeature
Extraction
Feature filtering
Applyingclassificationalgorithms
Performancemeasure
Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm
Preprocessing
• Preprocessing:
transform documents into a suitable representation for classification task– Remove HTML or other tags– Remove stop words– Perform word stemming (Remove suffix)
Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm
Feature ExtractionMost crucial decision you’ll make!1. Topic
• Words, phrases, ?
2. Author• Stylistic features
3. Sentiment• Adjectives, ?
4. Spam• Specialized vocabulary
Features must relate to categories
Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm
Feature Filtering
• Feature selection:
remove non-informative terms from documents
=>improve classification effectiveness
=>reduce computational complexity
©2012 Paula Matuszek
Evaluation We need to know how well our
classification system is performing– Recall: % of documents in a class which
are correctly classified as that class– ri = correctly classified as i / total which are i
– Precision: % of documents classified in a class which are actually in that class
– pi = correctly classified as i / total classified as i
©2012 Paula Matuszek
Corpus
Documents classified into category
Documents actuallyin category
CorrectlyCategorized
©2012 Paula Matuszek
Combined Effectiveness
• Ideally, we want a measure that combines both precision and recall
• F1: 2pr / p+r• If we accept everything, F1 = 0• If we accept nothing, F1 = 0• For perfect precision and recall, F1 = 1• If either precision or recall drops, so does F1
©2012 Paula Matuszek
Measuring Individual Features
If we have a large feature set, we may be interested in which features are actually useful.
informative features: which features gives us the biggest separation between two classes
can probably omit least informative features without impacting performance
caution: correlation, not causation...
©2012 Paula Matuszek
Choice of Evaluation Measure
For many tasks, F1 gives the best overall measure– sorting news stories– deciding genre or author
But it depends on your domain– spam filters– flagging important email
©2012 Paula Matuszek
Evaluation: Overfitting
Training a model = predicting classification for our training set given the data in the set
Degrees of freedom: with 10 cases and 10 features I can always predict perfectly
Model may capture chance variations in set This leads to overfitting -- the model is too
closely matched to the exact data set it’s been given
More likely with– large number of features– small training sets
©2012 Paula Matuszek
Evaluation: Training and Test Sets
To avoid (or at least detect) overfitting we always use separate training and test sets
Model is trained on one set of examples Evaluation measures are calculated on a
different set. sets should be comparable and each
should be representative of overall corpus
©2012 Paula Matuszek
Some classification methods
Common classification algorithms include– nearest neighbor (KNN) methods– decision trees– naive Bayes classifiers– linear classification classifiers (e.g., SVMs)
18Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt
K-Nearest-Neighbor AlgorithmK-Nearest-Neighbor Algorithm• Principle: points (documents) that are close
in the space belong to the same class
19Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt
K-Nearest-Neighbor AlgorithmK-Nearest-Neighbor Algorithm• Measure of similarity between test document and
each neighbor– count of words shared– tf*idf variants
• Select k nearest neighbors of a test document among training examples– more than 1 neighbor to avoid error of a single
atypical training example– K is typically 3 or 5
• Assign test document to the class which contains most of the neighbors
20Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt
Analysis of KNN AlgorithmAnalysis of KNN Algorithm• Advantages:
– Effective – Can handle large, sparse vectors– “Training time” is short– Can be incremental
• Disadvantages:– Classification time is long– Difficult to find optimal value of k
21Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt
Decision Tree AlgorithmDecision Tree Algorithm• Decision tree associated with document:
– Root node contains all documents– Each internal node is subset of documents
separated according to one attribute– Each arc is labeled with predicate which can
be applied to attribute at parent– Each leaf node is labeled with a class
Example Decision Tree
Text
no
Contains “Villanova”
0 1>1
Irrelevant Contains Wildcats
yes
Sports article Academic article
General Article
Decision Trees for Text
Each node is a single variable -- not useful for very large, very sparse vector such as BOW
Features might include– other document characteristics like diversity– counts for small subset of terms
– most frequent– tf*idf– domain-based ontology
Creating a Decision Tree
At each node, choose function which provides maximum separation
If all examples at new node are one class, stop for that node
Recur with each mixed node Stop when no choice improves
separation -- or when you reach predefined level
25Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt
Analysis of Decision Tree AlgorithmAnalysis of Decision Tree Algorithm
• Advantages:– Easy to understand– Easy to train– Classification is fast
• Disadvantages:– Training time is relatively expensive– A document is only connected with one branch– Once a mistake is made at a higher level, any
subtree is wrong – Not suited for very high dimensions
Bayesian Methods
Based on probability Used widely in probabilistic learning and
classification. Uses prior probability of each category
given no information about an item. Categorization produces a posterior
probability distribution over the possible categories given description of item.
Naive Bayes
Bayes Theorem says we can determine probability of an event C given another event x based on– the overall probability of event C– the probability of event x given event C
P(C|x) = P(x|C) * P(C)/p(x)
28Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt
Naïve Bayes AlgorithmNaïve Bayes Algorithm• Estimate the probability of each class for a
document: – Compute the posterior probability (Bayes rule)
– Assumption of word independency (Naive assumption)
29Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt
Analysis of Naïve Bayes Analysis of Naïve Bayes AlgorithmAlgorithm
• Advantages:– Works well on numeric and textual data– Easy implementation and computation– has been effective in practice; typical spam
filter, for instance
• Disadvantages:– Conditional independence assumption is in fact
naive: usually violated by real-world data – performs poorly when features are highly
correlated
Linear Regression
Classic linear regression: predict the value of some variable based on a weighted sum of other variables
Very common statistical technique for prediction
e.g.: predict college GPA with a weighted sum of SAT verbal and quantitative scores, high school GPA, and a “high school quality” measure
Linear Scoring Methods
Generalization of linear regression to much higher dimensionality
Goal is binary separation of instances into 2 classes
Best known is SVM: support vector machine.– classifier is a separating hyperplane– support vectors are those features which
define the plane
33Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt
Support Vector MachinesSupport Vector Machines
• Main idea of SVMsMain idea of SVMsFind out the linear separating hyperplane Find out the linear separating hyperplane which maximize the margin, i.e., the optimal which maximize the margin, i.e., the optimal separating hyperplane (OSH) separating hyperplane (OSH)
SVMS Advantages:
– Handle very large dimensionality– Empirically, have been shown to work well with
text classification Disadvantages
– sensitive to noise, such as mislabeled training examples
– binary only (but can train multiple SVMs)– implementation is complex: variety of
implementation choices (similarity measure, kernel, etc) can require extensive tuning
Summary Document classification is a common task. Manual rules provide outstanding results and allow
complex structures, but very expensive to implement. Automated methods use labeled cases to train a model
– Decision trees and decision rules are easy to understand, but require good feature set tuned to domain
– Nearest neighbor simple to implement and quick to train, but slow to classify. Can handle incremental training cases.
– Bayes is easy to implement and works well in some domains, but can have problems with highly correlated features
– SVMs more complex to implement, but handle very large dimensionality well and have proven to be best choice in many text domains