+ All Categories
Home > Documents > Special topics on text mining [ Part I: text classification ]

Special topics on text mining [ Part I: text classification ]

Date post: 24-Feb-2016
Category:
Upload: dawn
View: 46 times
Download: 0 times
Share this document with a friend
Description:
Special topics on text mining [ Part I: text classification ]. Hugo Jair Escalante , Aurelio Lopez, Manuel Montes and Luis Villaseñor. Representation and preprocessing of documents. Hugo Jair Escalante , Aurelio Lopez, Manuel Montes and Luis Villaseñor. Agenda. - PowerPoint PPT Presentation
Popular Tags:
65
Special topics on text mining [Part I: text classification] Hugo Jair Escalante , Aurelio Lopez, Manuel Montes and Luis Villaseñor
Transcript
Page 1: Special topics on text mining [ Part I: text classification ]

Special topics on text mining[Part I: text classification]

Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor

Page 2: Special topics on text mining [ Part I: text classification ]

Representation and preprocessing of documents

Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor

Page 3: Special topics on text mining [ Part I: text classification ]

Agenda

• Recap: text classification• Representation of documents • Preprocessing• Feature selection• Instance selection• Discussion• Assignments

Page 4: Special topics on text mining [ Part I: text classification ]

Text classification• Text classification is the assignment of free-

text documents to one or more predefined categories based on their content

Documents (e.g., news articles)Categories/classes

(e.g., sports, religion, economy)

Manual approach?

Page 5: Special topics on text mining [ Part I: text classification ]

Manual classification• Very accurate when job is done by experts– Different to classify news in general categories than

biomedical papers into subcategories. • But difficult and expensive to scale– Different to classify thousands than millions

• Used by Yahoo!, Looksmart, about.com, ODP, Medline, etc.

Ideas for building an automatic classification system?How to define a classification function?

Page 6: Special topics on text mining [ Part I: text classification ]

Hand-coded rule based systems

• Main approach in the 80s• Disadvantage knowledge acquisition bottleneck

– too time consuming, too difficult, inconsistency issues

Experts

Labeleddocuments

Knowledgeengineers

Rule 1, if … then … elseRule N, if … then …

Classifier

New document

Document’s category

Page 7: Special topics on text mining [ Part I: text classification ]

Example: filtering spam email

• Rule-based classifier

Classifier 1

Classifier 2

Taken from Hastie et al. The Elements of Statistical Learning, 2007, Springer.

Page 8: Special topics on text mining [ Part I: text classification ]

Machine learning approach (1)• A general inductive process builds a classifier by

learning from a set of preclassified examples.– Determines the characteristics associated with each one of

the topics.

Ronen Feldman and James Sanger, The Text Mining Handbook

Page 9: Special topics on text mining [ Part I: text classification ]

Machine learning approach (2)Have to be Experts?

Labeleddocuments(training set)

Rules, trees, probabilities,

prototypes, etc.

Classifier

New document

Document’s category

Inductive process

Experts

How large has to be?

Which algorithm?

How to represent documents?

Page 10: Special topics on text mining [ Part I: text classification ]

Machine learning approach (3)

• Machine learning approach to TC: to develop automated methods able to classify documents with a certain degree of success

?Training documents

(Labeled)Learning machine

(an algorithm)

Trained machine

Unseen (test, query) document

Labeled document

Page 11: Special topics on text mining [ Part I: text classification ]

Text classification

• Machine learning approach to TC: Recipe

1. Gather labeled documents2. Construction of a classifier

A. Document representationB. Preprocessing C. Dimensionality reductionD. Classification methods

3. Evaluation of a TC method

Assumption: a large enough training set

of labeled documents is

available

Later we will study methods that allow us to relax such

assumption [semi-supervised and unsupervised learning]

Page 12: Special topics on text mining [ Part I: text classification ]

Document representation

• Represent the content of digital documents in a way that it can be processed by a computer

Page 13: Special topics on text mining [ Part I: text classification ]

Before representing documents: Preprocessing

• Eliminate information about style, such as html or xml tags.– For some applications this information may be useful. For instance, only

index some document sections.

• Remove stop words– Functional words such as articles, prepositions, conjunctions are not useful

(do not have an own meaning).

• Perform stemming or lemmatization– The goal is to reduce inflectional forms, and sometimes derivationally

related forms.

am, are, is → be car, cars, car‘s → car

Page 14: Special topics on text mining [ Part I: text classification ]

Document representation• Transform documents, which typically are strings of characters,

into a representation suitable for the learning algorithm:– Codify/represent/transform documents into a vector representation

• The most common used document representation is the bag of words (BOW) approach– Documents are represented by the set of words that they contain – Word order is not captured by this representation– There is no attempt for understanding their content– The vocabulary of all of the different words in all of the documents is

considered as the base for the vector representation

Page 15: Special topics on text mining [ Part I: text classification ]

Document representation

t1 t1 … t|V|

d1

d2

: wi,j

dm

Documents in the corpus(one vector/row per document)

Weight indicating the contributionof word j in document i.

V: Vocabulary from the collection (i.e., et of all different

words that occur in the corpus)

Which words are good features?How to select/extract them?

How to compute their weights?

Terms in the vocabulary(Basic units expressing document’s content)

Page 16: Special topics on text mining [ Part I: text classification ]

(ML Conventions)

X={xij}

n

mxi

y ={yj}

a

wSlide taken from I. Guyon. Feature and Model Selection. Machine Learning Summer School, Ile de Re, France, 2008.

Page 17: Special topics on text mining [ Part I: text classification ]

(A brief note on evaluation in TC)

• The available data is divided into three subsets:– Training (m1)• used for the construction

(learning) the classifier – Validation (m2)• Optimization of parameters

of the TC method– Test (m3)• Used for the evaluation of

the classifier

Terms (N = |V|)

Docu

men

ts

(M)

m1

m2

m3

Page 18: Special topics on text mining [ Part I: text classification ]

Document representation• Simplest BOW-based representation: Each

document is represented by a binary vector whose entries indicate the presence/absence of terms from the vocabulary (Boolean/binary weighting)

Document Content

Syllabus.txt Advanced topics on text mining

Evaluation.txt Homework, reports (text)

Students.txt Graduate (Advanced)

Description.txt Studying topics on text mining

Obtain the BOW representation with Boolean weighting for these documents

Page 19: Special topics on text mining [ Part I: text classification ]

Term weighting [extending the Boolean BOW]

• Two main ideas:– The importance of a term increases proportionally

to the number of times it appears in the document.• It helps to describe document’s content.

– The general importance of a term decreases proportionally to its occurrences in the entire collection.• Common terms are not good to discriminate between

different classes

Does the order of words matters?

Page 20: Special topics on text mining [ Part I: text classification ]

Term weighting – main approaches

• Binary weights: – wi,j = 1 iff document di contains term tj , otherwise 0.

• Term frequency (tf):– wi,j = (no. of occurrences of tj in di)

• tf x idf weighting scheme:– wi,j = tf(tj, di) × idf(tj), where:

• tf(tj, di) indicates the ocurrences of tj in document di

• idf(tj) = log [N/df(tj)], where df(tj) is the number of documets that contain

the term tj.

These methods do not use the information of the classes, why?

Page 21: Special topics on text mining [ Part I: text classification ]

(A brief note on evaluation in TC)

• The available data is divided into three subsets:– Training (m1)• used for the construction

(learning) the classifier – Validation (m2)• Optimization of parameters

of the TC method– Test (m3)• Used for the evaluation of

the classifier

Terms (N = |V|)

Docu

men

ts

(M)

m1

m2

m3

Page 22: Special topics on text mining [ Part I: text classification ]

Extended document representations

• Document representations that capture information not considered by the BOW formulation

• Examples:– Based on distributional term representations – Locally weighted bag of words

Page 23: Special topics on text mining [ Part I: text classification ]

Distributional term representations (DTRs)

• Distributional hypothesis: words that occur in the same contexts tend to have similar meanings– A word is known by the company it keeps!– Occurrence & co-occurrence

• Distributional term representation: Each term is represented by a vector of weights that indicate how often it occur in documents or how often co-occur with other terms

• DTRs for document representation: Combine the vectors that have the DTR for terms that appear in the document

Page 24: Special topics on text mining [ Part I: text classification ]

• Documents are represented by the weighted sum of the DTRs for terms appearing in the document

Accommodation Steinhausen - Exterior View Blumenau, Brazil

October 2004

Document

Accommodation:

Contextual weights

Steinhausen:

. . .

. . .

. . .

. . .

. . .

. . . +

. . .

Distributional term representations (DTRs)

Terms

DTR for terms

DTR for document

Page 25: Special topics on text mining [ Part I: text classification ]

• Document occurrence representation (DOR): The representation for a term is given by the documents it mostly occurs in.

• Term co-occurrence representation (TCOR): The representation of a term is determined by the other terms from the vocabulary that mostly co-occur with it

A. Lavelli, F. Sebastiani, and R. Zanoli. Distributional Term Representations: An Experimental Comparison, Proceedings of the International Conference of Information and Knowledge Management, pp. 615—624, 2005, ACM Press.

| |( , ) ( , ) log( )tcorj k j k

k

Mt t ttf t tN

w

| |( , ) ( , ) log( )dorj k j k

k

Mt df tT

w d d

Distributional term representations (DTRs)

Page 26: Special topics on text mining [ Part I: text classification ]

Document representation (DOR & TCOR)

• DOR-B: – Unweighted sum

• DOR-TF-IDF– Consider term-occurrence

frequency and term-usage across the collection

• DOR-IF-IDF-W– Weight terms from different

modalities

1

i

k

Ndor dori t

k

d w

1

( , ) log( )iN

dor tfidf dori k i

i i

Ntf i tN

d w

, ,dor tfidf w dor tfidf dor tfidfi t i t l i la a d d d

Page 27: Special topics on text mining [ Part I: text classification ]

LOWBOW: The locally weighted bag-of-words framework

• Each document is represented by a set of local histograms computed across the whole document but smoothed by kernels and centered at different document locations

• LOWBOW-based document representations can preserve sequential information in documents

G. Lebanon,Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research. Vol. 8, pp. 2405—2441, 2007.

Page 28: Special topics on text mining [ Part I: text classification ]

BOW approach

• Indicates the (weighted) occurrence of terms in documents

,1 ,| |,...,i i i Vx x d

Page 29: Special topics on text mining [ Part I: text classification ]

LOWBOW framework• A set of histograms, each weighted

according to selected positions in the document

1{ ,..., }ki i id dl dl ,jj si i K dl d

Page 30: Special topics on text mining [ Part I: text classification ]

BOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

BOW representation1 V

Page 31: Special topics on text mining [ Part I: text classification ]

LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Identify locations in documents

Page 32: Special topics on text mining [ Part I: text classification ]

LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Weight the contribution of terms according to

Gaussians at the different locations

1 V

Page 33: Special topics on text mining [ Part I: text classification ]

LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Weight the contribution of terms according to

Gaussians at the different locations

1 V

Page 34: Special topics on text mining [ Part I: text classification ]

LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Weight the contribution of terms according to

Gaussians at the different locations

1 V

Page 35: Special topics on text mining [ Part I: text classification ]

LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Weight the contribution of terms according to

Gaussians at the different locations

1 V

Page 36: Special topics on text mining [ Part I: text classification ]

36

w1, w2, w3, w4, w5, w6, w7, … wN-2, wN-1, wN

μ1

Kernel smoothing

Document

Kernel locations

1, ( )K x 2, ( )K x 3, ( )K x 1, ( )kK x , ( )kK x

μ2 μ3 μk-1 μk. . .

N

Position weighting

1 V

LHs: position + frequency weighting

1 V 1 V 1 V 1 V

1 1 1 1N N N 1 N

Page 37: Special topics on text mining [ Part I: text classification ]

Assignment # 1• Read a paper on term weighting or document occurrence

representations (it can be one from those available in the course page or another chosen by you)

• Prepare a presentation of at most 10 minutes, in which you describe the proposed/adopted approach *different of those seen in class*. The presentation must cover the following aspects:

A. Underlying and intuitive idea of the approachB. Formal descriptionC. Benefits and limitations (compared to the schemes seen in class)D. Your idea(s) to improve the presented approach

Page 38: Special topics on text mining [ Part I: text classification ]

Suggested readings on term weighting and preprocessing

• M. Lan, C. Tan, H. Low, S. Sung. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. Proc. of WWW, pp. 1032—1033, 2005.

Page 39: Special topics on text mining [ Part I: text classification ]

Text classification

• Machine learning approach to TC: Recipe

1. Gather labeled documents2. Construction of a classifier

A. Document representationB. Preprocessing C. Dimensionality reductionD. Classification methods

3. Evaluation of a TC method

Page 40: Special topics on text mining [ Part I: text classification ]

Dimensionality issues

• What do you think is the average size of the vocabulary in a small-scale text categorization problem (~1,000 - 10,000 documents)

• It depends on the domain and type of the corpus, although usual vocabulary sizes in text classification range from a few thousands to millions of terms

Page 41: Special topics on text mining [ Part I: text classification ]

Dimensionality issues• A central problem in text classification is the

high dimensionality of the features space.– There is one dimension for each unique word found

in the collection can reach hundreds of thousands– Processing is extremely costly in computational

terms– Most of the words (features) are irrelevant to the

categorization task

How to select/extract relevant features?How to evaluate the relevancy of the features?

Page 42: Special topics on text mining [ Part I: text classification ]

The curse of dimensionality• Dimensionality is a common

issue in machine learning (in general)

• Number of positions scale exponentially with the dimensionality of the problem

• We need an exponential number of training examples to cover all positions

Image taken from: Samy Bengio and Yoshua Bengio, Taking on the Curse of Dimensionality in Joint Distributions Using Neural Networks, in: IEEE Transaction on Neural Networks, special issue on data mining and knowledge discovery, volume 11, number 3, pages 550-557, 2000.

Page 43: Special topics on text mining [ Part I: text classification ]

Dimensionality reduction: Two main approaches

• Feature selection• Idea: removal of non-informative words according to corpus statistics• Output: subset of original features– Main techniques: document frequency, mutual information and

information gain

• Re-parameterization– Idea: combine lower level features (words) into higher-level

orthogonal dimensions– Output: a new set of features (not words)– Main techniques: word clustering and Latent semantic indexing (LSI)

Page 44: Special topics on text mining [ Part I: text classification ]

FS: Document frequency• The document frequency for a word is the

number of documents in which it occurs.

• This technique consists in the removal of words whose document frequency is less than a specified threshold

• The basic assumption is that rare words are either non-informative for category prediction or not influential in global performance.

Page 45: Special topics on text mining [ Part I: text classification ]

FS: Document frequency

Page 46: Special topics on text mining [ Part I: text classification ]

FS: Document frequency

Page 47: Special topics on text mining [ Part I: text classification ]

Zipf's law• The frequency of any word is inversely proportional to its rank

in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

G. Kirby. Zipf’s law. UK Journal of Naval Science Volume 10, No. 3 pp 180 – 185, 1985.

Page 48: Special topics on text mining [ Part I: text classification ]

FS: Mutual information• Measures the mutual dependence of the two

variables– In TC, it measures the information that a word t and a

class c share: how much knowing word t reduces our uncertainty about class c

The idea is to select words that are very related with one class

Page 49: Special topics on text mining [ Part I: text classification ]

FS: Mutual information

• Let:

• Then:

• To get the global MI for term t:

A: # times t and c co-occurB: # times t occurs without cC: # times c occurs without tN: # documents

Page 50: Special topics on text mining [ Part I: text classification ]

FS: Information gain (1)

• Information gain (IG) measures how well an attribute separates the training examples according to their target classification– Is the attribute a good classifier?

• The idea is to select the set of attributes having the greatest IG values– Commonly, maintain attributes with IG > 0

How to measure the worth (IG) of an attribute?

Page 51: Special topics on text mining [ Part I: text classification ]

FS: Information gain (2)

• Information gain Entropy

• Entropy characterizes the impurity of an arbitrary collection of examples.– It specifies the minimum number of bits of information needed to encode

the classification of an arbitrary member of the dataset (S).

• For a binary problem:

Greatest uncertainty1 bit to encode the class

No uncertaintyalways positive/negativenot need to encode the class

Entropy: Average information from a message that can take m values

Page 52: Special topics on text mining [ Part I: text classification ]

FS: Information gain (3)

• IG of an attribute measures the expected reduction in entropy caused by partitioning the examples according to this attribute.– The greatest the IG, the better the attribute for classification– IG < 0 indicates that we have a problem with greater

uncertainty than the original– The maximum value is log C; C is the number of classes.

Page 53: Special topics on text mining [ Part I: text classification ]

Other FS methods for TC

G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, 3:1289—1305, 2003

Page 54: Special topics on text mining [ Part I: text classification ]

Other FS methods for TC

G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, 3:1289—1305, 2003

Page 55: Special topics on text mining [ Part I: text classification ]

• Problem: find the subset of features that are more helpful for classification– Reduce the dimensionality of the data– Eliminate uninformative features– Find discriminate features

For a problem with n features there are 2n different subsets of features

Feature selection the ML perspective

I. Guyon, et al. Feature Extraction: Foundations and Applications, Springer 2006.

Page 56: Special topics on text mining [ Part I: text classification ]

Feature selection the ML perspective

I. Guyon, et al. Feature Extraction: Foundations and Applications, Springer 2006.

Page 57: Special topics on text mining [ Part I: text classification ]

Feature selection the ML perspective• Filters: Evaluate the importance of features using methods

that are independent of the classification model

• Wrappers: Evaluate the importance of subsets of features using the classification model (a search strategy is adopted)

• Embedded: Take advantage of the nature of the classification model being considered

i i if

i i

w

I. Guyon, et al. Feature Extraction: Foundations and Applications, Springer 2006.

Page 58: Special topics on text mining [ Part I: text classification ]

Filters vs. Wrappers

• Main goal: rank subsets of useful features.

All features FilterFeature subset

Predictor

All features

Wrapper

Multiple Feature subsets

Predictor

• Danger of over-fitting with intensive search!

Page 59: Special topics on text mining [ Part I: text classification ]

Feature selection the ML perspective

• General diagram of a wrapper feature selection method

ValidationOriginal feature set

Generation EvaluationSubset of feature

Stopping criterion

yesnoSelected subset of feature

process

Generation = select feature subset candidate.Evaluation = compute relevancy value of the subset.Stopping criterion = determine whether subset is relevant.Validation = verify subset validity.From M. Dash and H. Liu. http://www.comp.nus.edu.sg/~wongszec/group10.ppt

Page 60: Special topics on text mining [ Part I: text classification ]

Trends on feature selection for text classification

• How to combine the output of different feature selection methods?

• Hybrid feature selection method (filter(s) + wrapper) for feature subset selection

Page 61: Special topics on text mining [ Part I: text classification ]

Fusion of feature selection methods

• Different feature selection methods choose different features

• The combination of the outputs of different FS methods can result in a better feature selection process

Y. Saeys et al. Robust feature selection using ensemble feature selection techniques. Proc. of ECML/PKDD, pp. 313—325, LNAI 5112, Springer, 2008.

R. Neumayer et al. Combination of feature selection methods for text categorization. Proc. of European Conference on Information Retrieval, pp. 763—766, 2011.

Page 62: Special topics on text mining [ Part I: text classification ]

Hybrids for feature selection• Wrapper-based feature selection often outperforms filter-

based methods. However, wrappers are too computationally demanding. (Most) Filters are extremely efficient, although most of them are univariate

• Idea: hybrid approach:– Use filters to generate a ranked list of features (possibly using several

FS methods)– Use a wrapper to select the best subset of features using the

information returned by the filter

M. A. Esseghir. Effective Wrapper-Filter hybridization through GRASP Schemata. JMLR Workshop and Conference Proceedings Vol. 10: Feature Selection in Data Mining, 10:45-54, 2010.

P. Bermejo, J. A. Gamez, J. M. Puerta. A GRASP Algorithm for Fast Hybrid Feature Subset Selection in High-dimensional Datasets. Pattern Recognition Letters, 2011.

J. Pacheco, S. Casado, L. Nunez. A Variable Selection Method based on Tabu Search for Logistic Regression. European Journal of Operations Research, Vol. 199:506—511, 2009.

Page 63: Special topics on text mining [ Part I: text classification ]

Feature Extraction, Foundations and ApplicationsI. Guyon et al, Eds.Springer, 2006.http://clopinet.com/fextract-book

More on feature selection and extraction …

Page 64: Special topics on text mining [ Part I: text classification ]

Assignment # 2• Read a paper about feature selection/extraction for text

classification (it can be one from those available in the course page or another chosen by you)

• Prepare a presentation of at most 10 minutes, in which you describe a feature selection approach *different of those seen in class*. The presentation must cover the following aspects:

A. Underlying and intuitive idea of the proposed approachB. Formal descriptionC. Benefits and limitations (compared to the approaches seen in

class)D. Your idea(s) to improve the presented approach

Page 65: Special topics on text mining [ Part I: text classification ]

Suggested readings on feature selection for text classification

• G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, 3:1289—1305, 2003

• H. Liu, H. Motoda. Computational Methods of Feature Selection. Chapman & Hall, CRC, 2008.

• Y. Yang, J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. Proc. of the 14th International Conference on Machine Learning, pp. 412—420, 1997.

• D. Mladenic, M. Grobelnik. Feature Selection for Unbalanced Class Distribution and Naïve Bayes. Proc. of the 16th Conference on Machine Learning, pp. 258—267, 1999.

• I. Guyon, et al. Feature Extraction Foundations and Applications, Springer, 2006.

• Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, Vol. 3:1157—1182, 2003.


Recommended