Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | victor-conley |
View: | 223 times |
Download: | 2 times |
Data Warehousing & Mining Data Warehousing & Mining with Business Intelligence: with Business Intelligence:
Principles and AlgorithmsPrinciples and Algorithms
Overview Of Text MiningOverview Of Text Mining
11
MotivationMotivation Text mining is well motivated, due to the fact that much of Text mining is well motivated, due to the fact that much of
the world’s data can be found in free text form (newspaper the world’s data can be found in free text form (newspaper articles, emails, literature, etc.). articles, emails, literature, etc.).
While mining free text has the same goals as data mining While mining free text has the same goals as data mining in general (extracting useful knowledge/stats/trends), text in general (extracting useful knowledge/stats/trends), text mining must overcome a major difficulty – there is no mining must overcome a major difficulty – there is no explicit structure.explicit structure.
Machines can reason will relational data well since Machines can reason will relational data well since schemas are explicitly available. Free text, however, schemas are explicitly available. Free text, however, encodes all semantic information within natural language. encodes all semantic information within natural language.
Text mining algorithms, then, must make some sense out Text mining algorithms, then, must make some sense out of this natural language representation. of this natural language representation.
Humans are great at doing this, but this has proved to be a Humans are great at doing this, but this has proved to be a problem for machines.problem for machines. 22
Sources Of dataSources Of data
LettersLetters EmailsEmails Phone Phone
recordingsrecordings ContractsContracts
Technical Technical documentsdocuments
PatentsPatents Web pagesWeb pages ArticlesArticles
33
Text MiningText Mining How does it relate to data mining in general?How does it relate to data mining in general? How does it relate to computational How does it relate to computational
linguistics?linguistics? How does it relate to information retrieval?How does it relate to information retrieval?
Finding Patterns Finding “Nuggets”
Novel Non-Novel
Non-textual data General data-mining
Exploratory Data
Analysis
Database queries
Textual data Computational Linguistics
Information Retrieval
Typical ApplicationsTypical Applications
Summarizing documentsSummarizing documents Discovering/monitoring relations among people, Discovering/monitoring relations among people,
places, organizations, etcplaces, organizations, etc Customer profile analysisCustomer profile analysis Trend analysisTrend analysis Documents summarizationDocuments summarization Spam IdentificationSpam Identification Public health early warningPublic health early warning Event tracksEvent tracks
66
Data Mining / Knowledge Discovery
Structured Data Multimedia Free Text Hypertext
HomeLoan ( Loanee: Frank Ri Lender: MWF Agency:Lake View Amount: $200,000 Term: 15 years )
Frank Rizzo boughthis home from LakeView Real Estate in1992. He paid $200,000under a15-year loanfrom MW Financial.
<a href>Frank Rizzo</a> Bought<a hef>this home</a>from <a href>LakeView Real Estate</a>In <b>1992</b>.<p>...Loans($200K,
[map],...)
Mining Text Data: An Introduction
77
General NLPGeneral NLP——Too Difficult!Too Difficult!
Word-level ambiguity Word-level ambiguity ““design” can be a noun or a verbdesign” can be a noun or a verb (Ambiguous POS) (Ambiguous POS) ““root” has multiple meaningsroot” has multiple meanings (Ambiguous sense) (Ambiguous sense)
Syntactic ambiguitySyntactic ambiguity ““natural language processing” natural language processing” (Modification)(Modification) ““A man saw a boy A man saw a boy with a telescopewith a telescope.”.” (PP Attachment) (PP Attachment)
Anaphora resolutionAnaphora resolution ““John persuaded Bill to buy a TV for John persuaded Bill to buy a TV for himselfhimself.”.”
((himselfhimself = John or Bill?) = John or Bill?) PresuppositionPresupposition
““He has quit smoking.” implies that he smoked before.He has quit smoking.” implies that he smoked before.
Humans rely on context to interpret (when possible).This context may extend beyond a given document!
88
Text Databases and IRText Databases and IR Text databases (document databases) Text databases (document databases)
Large collections of documents from various sources: Large collections of documents from various sources: news articles, research papers, books, digital news articles, research papers, books, digital libraries, e-mail messages, and Web pageslibraries, e-mail messages, and Web pages
Data stored is usually Data stored is usually semi-structuredsemi-structured Traditional IR techniques become inadequate for the Traditional IR techniques become inadequate for the
increasingly vast amounts of text dataincreasingly vast amounts of text data Information retrievalInformation retrieval
A field developed in parallel with database systemsA field developed in parallel with database systems Information is organized into (a large number of) Information is organized into (a large number of)
documentsdocuments Information retrieval problem: locating relevant Information retrieval problem: locating relevant
documents based on user input, such as keywords or documents based on user input, such as keywords or example documentsexample documents
Information RetrievalInformation Retrieval TypicalTypical IR systems IR systems
Online library catalogsOnline library catalogs Online document management systemsOnline document management systems
Information retrieval vs. database systemsInformation retrieval vs. database systems Some DB problems are not present in IR, e.g., Some DB problems are not present in IR, e.g.,
update, transaction management, complex update, transaction management, complex objectsobjects
Some IR problems are not addressed well in Some IR problems are not addressed well in DBMS, e.g., unstructured documents, DBMS, e.g., unstructured documents, approximate search using keywords and approximate search using keywords and relevancerelevance
99
10
Some “Basic” IR Techniques
StemmingStemming
Stop wordsStop words
Weighting of terms (e.g., TF-IDF)Weighting of terms (e.g., TF-IDF)
Vector/Unigram representation of textVector/Unigram representation of text
Text similarity (e.g., cosine, KL-div)Text similarity (e.g., cosine, KL-div)
Relevance/pseudo feedback Relevance/pseudo feedback
1111
Information Retrieval TechniquesInformation Retrieval Techniques Basic ConceptsBasic Concepts
A document can be described by a set of A document can be described by a set of representative keywords called representative keywords called index termsindex terms..
Different index terms Different index terms have varying relevance when varying relevance when used to describe document contents.used to describe document contents.
This effect is captured through the This effect is captured through the assignment of assignment of numerical weights to each index termnumerical weights to each index term of a document. of a document. (e.g.: frequency, tf-idf)(e.g.: frequency, tf-idf)
DBMS AnalogyDBMS Analogy Index Terms Index Terms AttributesAttributes Weights Weights Attribute ValuesAttribute Values
12
Generality of Basic Techniques
Raw text
Term similarity
Doc similarity
Vector centroid
CLUSTERING
d
CATEGORIZATION
META-DATA/ANNOTATION
d d d
d
d d
d
d d d
d d
d d
t t
t t
t t t
t t
t
t t
Stemming & Stop words
Tokenized text
Term Weighting
w11 w12… w1n
w21 w22… w2n
… …wm1 wm2… wmn
t1 t2 … tn
d1
d2 … dm
Sentenceselection
SUMMARIZATION
1313
Basic Measures for Text RetrievalBasic Measures for Text Retrieval
Precision:Precision: the percentage of retrieved documents that are in the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)fact relevant to the query (i.e., “correct” responses)
Recall:Recall: the percentage of documents that are relevant to the percentage of documents that are relevant to the query and were, in fact, retrievedthe query and were, in fact, retrieved
recall=∣{Relevant }∩{Retrieved }∣
∣{Relevant }∣
Relevant Relevant & Retrieved Retrieved
All Documents
precision=∣{Relevant }∩{Retrieved }∣
∣{Retrieved }∣
Information Retrieval TechniquesInformation Retrieval Techniques
Index Terms (Attribute) Selection:Index Terms (Attribute) Selection: Stop listStop list Word stemWord stem Index terms weighting methodsIndex terms weighting methods
Terms Terms Documents Frequency Matrices Documents Frequency Matrices Information Retrieval Models:Information Retrieval Models:
Boolean ModelBoolean Model Vector ModelVector Model Probabilistic ModelProbabilistic Model
1414
Boolean ModelBoolean Model Consider that index terms are either present or Consider that index terms are either present or
absent in a documentabsent in a document As a result, the index term weights are assumed As a result, the index term weights are assumed
to be all binariesto be all binaries A query is composed of index terms linked by A query is composed of index terms linked by
three connectives: three connectives: notnot, , andand, and , and oror e.g.: car e.g.: car andand repair, plane repair, plane oror airplaneairplane
The Boolean model predicts that each document The Boolean model predicts that each document is either relevant or non-relevant based on the is either relevant or non-relevant based on the match of a document to the querymatch of a document to the query
1515
Keyword-Based RetrievalKeyword-Based Retrieval A document is represented by a string, which can be A document is represented by a string, which can be
identified by a set of keywordsidentified by a set of keywords Queries may use Queries may use expressionsexpressions of keywords of keywords
E.g., car E.g., car andand repair shop, tea repair shop, tea oror coffee, DBMS coffee, DBMS but but notnot Oracle Oracle
Queries and retrieval should consider Queries and retrieval should consider synonymssynonyms,, e.g., repair and maintenancee.g., repair and maintenance
Major difficulties of the modelMajor difficulties of the model SynonymySynonymy: A keyword : A keyword TT does not appear does not appear
anywhere in the document, even though the anywhere in the document, even though the document is closely related to document is closely related to TT, e.g., data mining, e.g., data mining
PolysemyPolysemy: The same keyword may mean different : The same keyword may mean different things in different contexts, e.g., mining\things in different contexts, e.g., mining\
1616
Similarity-Based Retrieval in Text DataSimilarity-Based Retrieval in Text Data Finds similar documents based on a set of Finds similar documents based on a set of
common keywordscommon keywords Answer should be based on the degree of Answer should be based on the degree of
relevance based on the nearness of the relevance based on the nearness of the keywords, relative frequency of the keywords, keywords, relative frequency of the keywords, etc.etc.
Basic techniquesBasic techniques Stop listStop list
Set of words that are deemed “irrelevant”, Set of words that are deemed “irrelevant”, even though they may appear frequentlyeven though they may appear frequently
E.g., E.g., a, the, of, for, to, witha, the, of, for, to, with, etc., etc.Stop lists may vary when document set Stop lists may vary when document set
variesvaries1717
Similarity-Based Retrieval in Text DataSimilarity-Based Retrieval in Text Data
Word stemWord stemSeveral words are small syntactic variants of each Several words are small syntactic variants of each
other since they share a common word stemother since they share a common word stemE.g., E.g., drugdrug, , drugs, druggeddrugs, drugged
A term frequency tableA term frequency tableEach entryEach entry frequent_table(i, j) frequent_table(i, j) = # of occurrences of = # of occurrences of
the wordthe word t tii in document in document ddii
Usually, the Usually, the ratioratio instead of the absolute number of instead of the absolute number of occurrences is usedoccurrences is used
Similarity metrics: measure the closeness of a Similarity metrics: measure the closeness of a document to a query (a set of keywords)document to a query (a set of keywords)Relative term occurrencesRelative term occurrencesCosine distance:Cosine distance:
1818
sim(v1 ,v2 )=v1⋅v2
∣v1∣∣v2∣
Feature Extraction: Task(1)Feature Extraction: Task(1)
Task: Extract a good subset of words to represent documents
Document collection
All unique words/phrases
Feature Extraction
All good words/phrases
Feature Extraction:Task Feature Extraction:Task
While more and more textual information is available online, effective retrieval is difficult without good indexing of text content.
TEXT INDEXING TOOLS
Text-information-online-retrieval-index
Feature Extraction
Feature Extraction:IndexingFeature Extraction:Indexing
Identification all unique words
Removal stop wordsRemoval
stop words
Word Stemming
Training documents
Term Weighting•Naive terms•Importance of term in Doc
Removal of suffix to generate word stem grouping words increasing the relevance ex.{walker,walking}walk
non-informative word ex.{the,and,when,more}
Feature Extraction:Weighting Model Feature Extraction:Weighting Model
•tf - Term Frequency weightingwij = Freqij
Freqij : := the number of times jth term occurs in document Di.
Drawback: without reflection of importance factor for document discrimination. •Ex.
ABRTSAQWAXAO
RTABBAXAQSAK
D1
D2
A B K O Q R S T W X
D1 3 1 0 1 1 1 1 1 1 1
D2 3 2 1 0 1 1 1 1 0 1
Feature Extraction:Weighting ModelFeature Extraction:Weighting Model
•tfidf - Inverse Document Frequency weightingwij = Freqij * log(N/ DocFreqj) .N : := the number of documents in the training document collection.DocFreqj ::= the number of documents in which the jth term occurs.
Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection
A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0 0.3 0
D2 0 0 0.3 0 0 0 0 0 0 0
•Ex.
Indexing TechniquesIndexing Techniques Inverted indexInverted index
Maintains two hash- or B+-tree indexed tables: Maintains two hash- or B+-tree indexed tables: document_tabledocument_table: a set of document records <doc_id, : a set of document records <doc_id,
postings_list> postings_list> term_tableterm_table: a set of term records, <term, postings_list>: a set of term records, <term, postings_list>
Answer query: Find all docs associated with one or a set of termsAnswer query: Find all docs associated with one or a set of terms + easy to implement+ easy to implement – – do not handle well synonymy and polysemy, and posting lists do not handle well synonymy and polysemy, and posting lists
could be too long (storage could be very large)could be too long (storage could be very large) Signature fileSignature file
Associate a signature with each documentAssociate a signature with each document A signature is a representation of an ordered list of terms that A signature is a representation of an ordered list of terms that
describe the documentdescribe the document Order is obtained by frequency analysis, stemming and stop listsOrder is obtained by frequency analysis, stemming and stop lists
2424
Latent Semantic IndexingLatent Semantic Indexing
Similar documents have similar word Similar documents have similar word frequenciesfrequencies
Difficulty: the size of the term frequency matrix Difficulty: the size of the term frequency matrix is very largeis very large
Use a singular value decomposition (SVD) Use a singular value decomposition (SVD) techniques to reduce the size of frequency tabletechniques to reduce the size of frequency table
Retain the Retain the KK most significant rows of the most significant rows of the frequency tablefrequency table
2525
Probabilistic ModelProbabilistic Model Basic assumption: Given a user query, there is a set of Basic assumption: Given a user query, there is a set of
documents which contains exactly the relevant documents which contains exactly the relevant documents and no other (ideal answer set)documents and no other (ideal answer set)
Querying process as a process of specifying the Querying process as a process of specifying the properties of an ideal answer set. Since these properties properties of an ideal answer set. Since these properties are not known at query time, an initial guess is madeare not known at query time, an initial guess is made
This initial guess allows the generation of a preliminary This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is probabilistic description of the ideal answer set which is used to retrieve the first set of documentsused to retrieve the first set of documents
An interaction with the user is then initiated with the An interaction with the user is then initiated with the purpose of improving the probabilistic description of the purpose of improving the probabilistic description of the answer setanswer set
2626
Dimension Reduction:DocFreq ThresholdingDimension Reduction:DocFreq Thresholding
Calculates DocFreq(w)
Sets threshold
Removes all words:DocFreq <
Naive TermsTraining documents D
Feature Terms
Types of Text Data MiningTypes of Text Data Mining Keyword-based association analysisKeyword-based association analysis Automatic document classificationAutomatic document classification Similarity detectionSimilarity detection
Cluster documents by a common authorCluster documents by a common author Cluster documents containing information from a Cluster documents containing information from a
common source common source Link analysis: unusual correlation between entitiesLink analysis: unusual correlation between entities Sequence analysis: predicting a recurring eventSequence analysis: predicting a recurring event Anomaly detection: find information that violates usual Anomaly detection: find information that violates usual
patterns patterns Hypertext analysisHypertext analysis
Patterns in anchors/linksPatterns in anchors/linksAnchor text correlations with linked objectsAnchor text correlations with linked objects
2828
Keyword-Based Association AnalysisKeyword-Based Association Analysis Motivation: Collect sets of keywords or terms that occur Motivation: Collect sets of keywords or terms that occur
frequently together and then find the frequently together and then find the associationassociation or or correlation correlation relationships among themrelationships among them
Association Analysis Process: Preprocess the text data by Association Analysis Process: Preprocess the text data by parsing, stemming, removing stop words, etc.parsing, stemming, removing stop words, etc.
Evoke association mining algorithms: Consider each Evoke association mining algorithms: Consider each document as a transaction & View a set of keywords document as a transaction & View a set of keywords in the document as a set of items in the transactionin the document as a set of items in the transaction
Term level association miningTerm level association miningNo need for human effort in tagging documentsNo need for human effort in tagging documentsThe number of meaningless results and the The number of meaningless results and the
execution time is greatly reducedexecution time is greatly reduced2929
Text ClassificationText Classification Automatic classification for the large number of on-line text Automatic classification for the large number of on-line text
documents (Web pages, e-mails, intranets, etc.) documents (Web pages, e-mails, intranets, etc.) Classification ProcessClassification Process
Data preprocessingData preprocessing Definition of training set and test setsDefinition of training set and test sets Creation of the classification model using the selected Creation of the classification model using the selected
classification algorithmclassification algorithm Classification model validationClassification model validation Classification of new/unknown text documentsClassification of new/unknown text documents
Text document classification differs from the classification of Text document classification differs from the classification of relational datarelational data Document databases are not structured according to Document databases are not structured according to
attribute-value pairsattribute-value pairs3030
Text Classification(2)Text Classification(2)
Classification Classification Algorithms:Algorithms: Support Vector Support Vector
MachinesMachines K-NNK-NN Naïve BayesNaïve Bayes Neural NetworksNeural Networks Decision TreesDecision Trees Association rule-Association rule-
based Boostingbased Boosting
3131
Text Classification: An ExampleText Classification: An Example
Ex# Hooligan
1 An English football fan …
Yes
2 During a game in Italy …
Yes
3 England has been beating France …
Yes
4 Italian football fans were cheering …
No
5 An average USA salesman earns 75K
No
6 The game in London was horrific
Yes
7 Manchester city is likely to win the championship
Yes
8 Rome is taking the lead in the football league
Yes 10
clas
s
Training Set
ModelLearn
Classifier
text
TestSet
Hooligan
A Danish football fan ?
Turkey is playing vs. France. The Turkish fans …
? 10
Document ClusteringDocument Clustering MotivationMotivation
Automatically group related documents based on their Automatically group related documents based on their contentscontents
No predetermined training sets or taxonomiesNo predetermined training sets or taxonomies Generate a taxonomy at runtimeGenerate a taxonomy at runtime
Clustering ProcessClustering Process Data preprocessing: remove stop words, stem, Data preprocessing: remove stop words, stem,
feature extraction, lexical analysis, etc.feature extraction, lexical analysis, etc. Hierarchical clustering: compute similarities applying Hierarchical clustering: compute similarities applying
clustering algorithms.clustering algorithms. Model-Based clustering (Neural Network Approach): Model-Based clustering (Neural Network Approach):
clusters are represented by “exemplars”. (e.g.: SOM)clusters are represented by “exemplars”. (e.g.: SOM)3333
Document Clustering :k-meansDocument Clustering :k-means
3434
0. Input: 0. Input: DD::={d::={d11,d,d22,…d,…dn n }; }; kk::=the cluster number;::=the cluster number;
1. Select k document vectors as the initial centriods of 1. Select k document vectors as the initial centriods of k clusters k clusters 2. Repeat2. Repeat3. Select one vector 3. Select one vector dd in remaining documents in remaining documents4. Compute similarities between d and 4. Compute similarities between d and kk centroids centroids5. Put 5. Put dd in the closest cluster and recompute the in the closest cluster and recompute the centroid centroid 6. Until the centroids don’t change6. Until the centroids don’t change7. Output:7. Output:kk clusters of documents clusters of documentsCan similarly extened Hierarchical clustering Can similarly extened Hierarchical clustering algorithms to Text case too.algorithms to Text case too.
3535
Text CategorizationText Categorization Pre-given categories and labeled document Pre-given categories and labeled document
examples (Categories may form hierarchy)examples (Categories may form hierarchy) Classify new documents Classify new documents A standard classification (supervised A standard classification (supervised
learning ) problemlearning ) problem
CategorizationSystem
…
Sports
Business
Education
Science…
SportsBusiness
Education
ApplicationsApplications News article classificationNews article classification Automatic email filteringAutomatic email filtering Webpage classificationWebpage classification Word sense disambiguationWord sense disambiguation … …… …
3636
Categorization: ArchitectureCategorization: Architecture
Training documents
preprocessingWeighting
Selecting feature
Predefinedcategories
New document
d
Classifier
Category(ies) to d
Categorization ClassifiersCategorization Classifiers Centroid-based ClassifierCentroid-based Classifier
k-Nearest Neighbor Classifierk-Nearest Neighbor Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Model:Centroid-Based ClassifierModel:Centroid-Based Classifier 1.Input:new document 1.Input:new document d d =(=(ww11, , ww22,…,,…,wwnn););
2.Predefined categories:C={c2.Predefined categories:C={c11,c,c22,….,c,….,cll};}; 3.//Compute centroid vector3.//Compute centroid vector
c⃗ i=∑d'∈c i
d'
∣ci∣
, ciC4.Similarity model - cosine function
Simil (d i ,d j )=cos(d i ,d j )=d i⋅d j
∥d i∥2×∥d j∥2
=∑ w il×w jl
√∑ w2 il×√∑ w
2 jl
5.Compute similarity Sim il ( c⃗ i ,d )= cos ( c⃗ i ,d )
6.Output:Assign to document d the category cmax
Sim il ( c⃗ i ,d )≤ Sim il ( cmax ,d )
Model: K-Nearest Neighbor ClassifierModel: K-Nearest Neighbor Classifier
11.Input.Input:new document :new document dd;;
2.training collection:D={d2.training collection:D={d11,d,d22,…d,…dn n };};
3.predefined categories:C={c3.predefined categories:C={c11,c,c22,….,c,….,cll};};
4.//Compute similarities for(d4.//Compute similarities for(diiD){ Simil(D){ Simil(dd,d,dii) =cos() =cos(dd,,ddii); }); }
5.//Select k-nearest neighbor5.//Select k-nearest neighbor
Construct k-document subset DConstruct k-document subset Dkk so that so that
Simil(Simil(dd,d,dii) < min(Simil() < min(Simil(dd,doc) | doc ,doc) | doc DDkk) ) ddi i D- DD- Dk.k.
6.//Compute score for each category6.//Compute score for each category
for(cfor(ciiC){ score(cC){ score(cii)=0;)=0;
for(docfor(docDDkk){ score(c){ score(cii)+=((doc)+=((docccii)=true?1:0)} })=true?1:0)} }
7.7.OutputOutput:Assign to :Assign to dd the category the category cc with with highest scorehighest score
Categorization MethodsCategorization Methods Manual: Typically rule-based Manual: Typically rule-based
Does not scale up (labor-intensive, rule inconsistency)Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular May be appropriate for special data on a particular
domaindomain Automatic: Typically exploiting machine learning Automatic: Typically exploiting machine learning
techniquestechniques Vector space model basedVector space model based
Prototype-based (Rocchio)Prototype-based (Rocchio) K-nearest neighbor (KNN)K-nearest neighbor (KNN) Decision-tree (learn rules)Decision-tree (learn rules) Neural Networks (learn non-linear classifier)Neural Networks (learn non-linear classifier) Support Vector Machines (SVM)Support Vector Machines (SVM)
Probabilistic or generative model basedProbabilistic or generative model based Naïve Bayes classifier Naïve Bayes classifier
4141
4242
Vector Space ModelVector Space Model
Represent a doc by a term vectorRepresent a doc by a term vector Term: basic concept, e.g., word or phraseTerm: basic concept, e.g., word or phrase
Each term defines one dimensionEach term defines one dimension
N terms define a N-dimensional spaceN terms define a N-dimensional space
Element of vector corresponds to term weightElement of vector corresponds to term weight
E.g., d = (xE.g., d = (x11,…,x,…,xNN), x), xii is “importance” of term i is “importance” of term i
New document is assigned to the most likely New document is assigned to the most likely
category based on vector similarity. category based on vector similarity.
4343
VS Model: IllustrationVS Model: Illustration
Java
Microsoft
Starbucks
C1 Category 1
C3
Category 3
new doc
4444
How to Assign WeightsHow to Assign Weights Two-fold heuristics based on frequencyTwo-fold heuristics based on frequency
TF (Term frequency)TF (Term frequency)More frequent More frequent withinwithin a document a document more relevant more relevant
to semanticsto semanticse.g., “query” vs. “commercial”e.g., “query” vs. “commercial”
IDF (Inverse document frequency)IDF (Inverse document frequency)Less frequentLess frequent among among documents documents more more
discriminativediscriminativee.g. “algebra” vs. “science”e.g. “algebra” vs. “science”
4545
TF WeightingTF Weighting Weighting:Weighting:
More frequent => more relevant to topicMore frequent => more relevant to topice.g. “query” vs. “commercial”e.g. “query” vs. “commercial”Raw TF= f(Raw TF= f(t,dt,d): how many times term): how many times term t t appears in appears in
doc doc dd Normalization:Normalization:
Document length varies => relative frequency preferredDocument length varies => relative frequency preferrede.g., Maximum frequency normalizatione.g., Maximum frequency normalization
4646
IDF WeightingIDF Weighting Ideas:Ideas:
Less frequentLess frequent among among documents documents more more discriminativediscriminative
Formula:Formula:
n — total number of docs n — total number of docs k — # docs with term t k — # docs with term t appearing appearing
(the DF document frequency)(the DF document frequency)
4747
TF-IDF WeightingTF-IDF Weighting TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)
Freqent within doc Freqent within doc high tf high tf high weight high weight Selective among docs Selective among docs high idf high idf high weight high weight
Recall VS modelRecall VS model Each selected term represents one dimensionEach selected term represents one dimension Each doc is represented by a feature vectorEach doc is represented by a feature vector Its Its tt-term coordinate of document -term coordinate of document dd is the TF-IDF is the TF-IDF
weightweight This is more reasonableThis is more reasonable
Just for illustration Just for illustration …… Many complex and more effective weighting variants Many complex and more effective weighting variants
exist in practiceexist in practice
4848
How to Measure Similarity?How to Measure Similarity? Given two documentGiven two document
Similarity definitionSimilarity definition
dot productdot product
normalized dot product (or cosine)normalized dot product (or cosine)
4949
Illustrative ExampleIllustrative Example
text mining travel map search engine govern president congressIDF(faked) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3
doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4)doc2 1(2.4 ) 2 (5.6) 1(3.3) doc3 1 (2.2) 1(3.2) 1(4.3)
newdoc 1(2.4) 1(4.5)
doc3
text miningsearchengine
text
traveltext
maptravel
government presidentcongress
doc1
doc2
……
To whom is newdoc more similar?
Sim(newdoc,doc1)=4.8*2.4+4.5*4.5
Sim(newdoc,doc2)=2.4*2.4
Sim(newdoc,doc3)=0
5050
Probabilistic ModelProbabilistic Model
Category Category CC is modeled as a probability is modeled as a probability
distribution of pre-defined random eventsdistribution of pre-defined random events
Random events model the process of generating Random events model the process of generating
documentsdocuments
Therefore, how likely a document Therefore, how likely a document dd belongs to belongs to
category category C C is measured through the probability is measured through the probability
for category for category CC to generate to generate dd..
5151
EvaluationsEvaluations Effectiveness measureEffectiveness measure
PrecisionPrecision
RecallRecall
5252
Evaluation (con’t)Evaluation (con’t) BenchmarksBenchmarks
Classic: Reuters collectionClassic: Reuters collection A set of newswire stories classified under categories related to A set of newswire stories classified under categories related to
economics.economics. EffectivenessEffectiveness
Difficulties of strict comparisonDifficulties of strict comparison different parameter settingdifferent parameter setting different “split” (or selection) between training and testingdifferent “split” (or selection) between training and testing various optimizations … …various optimizations … …
However widely recognizableHowever widely recognizable Best: Boosting-based committee classifier & SVMBest: Boosting-based committee classifier & SVM Worst: Naïve Bayes classifier Worst: Naïve Bayes classifier
Need to consider other factors, especially efficiencyNeed to consider other factors, especially efficiency
5353
Summary: Text CategorizationSummary: Text Categorization
Wide application domainWide application domain
Comparable effectiveness to professionalsComparable effectiveness to professionals
Manual TC is not 100% and unlikely to improve Manual TC is not 100% and unlikely to improve
substantially. substantially.
A.T.C. is growing at a steady paceA.T.C. is growing at a steady pace
Prospects and extensionsProspects and extensions
Very noisy text, such as text from O.C.R.Very noisy text, such as text from O.C.R.
Speech transcriptsSpeech transcripts