Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 1/20
Text mining
Text mining - data mining on unstructured
textual documents
2 possible approaches:
Data preprocessing + „standard“ data mining
algorithms
Special algorithms for text mining
2 types of tasks:
information retrieval – the documents considered
as a whole (documents correspond to instances)
information extraction – analyze the content of
documents
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 2/20
Document representation (preprocessing)
Free text transformed into one row in data table:
lexical analysis (identify words)
lemmatization (transforming of inflected words to
their base form)
ignore stop-words (words, that are not related to
the content of the document – typically
connectives)
row in data table - vector with as many components as
are the terms (words) of a language (bag-of-words).
Terms encoded:
binary values – yes/no occurrence in the document,
number of occurrences in the document,
@relation analcatdata-authorship
@attribute a INTEGER
@attribute all INTEGER
@attribute also INTEGER
@attribute an INTEGER
@attribute and INTEGER
@attribute any INTEGER
@attribute are INTEGER
@attribute as INTEGER
@attribute at INTEGER
@attribute be INTEGER
. . . . .
@attribute Author {Austen,London,Milton,Shakespeare}
@data
46,12,0,3,66,9,4,16,13,13,4,8,8,1,0,1,5,0,21,12,16,3,6,62,3,3,30,3,9,14,1,2,6,5,0,1
0,16,2,54,7,8,1,7,0,4,7,1,3,3,17,67,6,2,5,1,4,47,2,3,40,11,7,5,6,8,4,9,1,0,1,Austen
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 3/20
TFIDF value (term frequency inverse document
frequency) – requires the work with the whole collection of
documents (corpus)
TFIDF = n * log M
m
n occurrence of term in the document
m occurrence of term in the whole collection
M documents in the collection
Used to evaluate the similarity between documents based
on occurrence of same terms
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 4/20
Advantages:
Invariant w.r.t order of terms in the document
Does not require further preprocessing
Disadvantages:
Cannot express multiword phrases Can be solved using n-grams instead of single terms
e.g. Mistr Jan Hus:
bigrams Mistr Jan, Jan Hus
trigrams Mistr Jan Hus
Does not use the structure of documents Can be solved by weighting the terms
Very large dimensionality of vectors (~ 10 000) –
must be solved during preprocessing attribute selection
wrapper approach = use „brute force“
filter approach = evaluate relevance of terms
attribute transformation
e.g. latent semantic indexing: representation of documents using small number of concepts
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 5/20
word2vec – new approach to word representation used to
evaluate the similarity between words based on their
appearance in documents (again requires the work with the
corpus – each word represented by a vector)
consists of 2 parts:
continuous bag-of-words (CBOW)
skip-gram
neural networks based approach, where inputs (and outputs)
encode the co-occurrence of words and their contexts
(neighboring words) in the corpus
CBOW used to predict the probability of a word given a context
(left), skip-gram used to predict the probability of a context
given a word (right)
“pre-trained” by Google and freely available (300-dimensional
vectors for 3 million words and phrases)
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 6/20
Example: text preprocessing in SAS Text Miner:
text parsing
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 7/20
text filtering
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 8/20
text topic
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 9/20
Similarity of documents
For two documents x1 ={x11,x12, …, x1m} x2 ={x21,x22, …, x2m}
Cosine measure
simC(x1, x2) = cos (x1, x2) = x1 · x2
||x1|| ||x2||
Symmetric overlap measure
simS(x1, x2) = j min(x1j ,x2j)
min(j x1j , j x2j)
Dice measure
simD(x1, x2) = 2 ||x1 x2||
||x1|| + ||x2|| =
2 x1 · x2
j x1j + j x2j
Jaccard measure
simJ(x1, x2) = ||x1 x2||
||x1 x2|| =
x1 · x2
j x1j + j x2j - x·z
where
x1 x2= j=1
m x1j x2j
||x|| = x·x = j=1
m xj
2
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 10/20
A) Information retrieval tasks
document understood as a whole
“Classic” information retrieval: find documents that
best fit to given query
1. boolean model = query condition composed from
terms using logical connectives AND, OR a NOT
doesn’t allow to consider importance of terms in
document
doesn’t allow to consider importance of terms in
query
offers only rough scale (document fits/doesn’t fit)
2. fuzzy extension = offers more values than TRUE,
FALSE
e.g. for query Q given using weighted terms tj:vj a tk:vk and
a document D containing the same terms (with weights w)
tj:wj and tk:wk , the relevance R(D,Q) of document D w.r.t.
query Q , where query is a conjunction tj:vj AND tk:vk
R(D,Q) = min (vj wj,vk wk )
And where query is a conjunction tj:vj OR tk:vk
R(D,Q) = max (vj wj,vk wk ).
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 11/20
3. vector model = use similarity measures mentioned
above (both query and documents are vectors)
Evaluate results of retrieval precision and recall
Precision = TP
TP + FP Recall =
TP
TP + FN
Relation between precison and recall
Narrow queries (AND) will result in relatively small number
of retrieved documents, most of them being relevant, broad
queries (OR) will result in relatively large number of
retrieved documents mostly not being relevant
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 12/20
Text mining on the level of documents:
text categorization – classification of documents
into more classes
document clustering – similarity based grouping of
documents
document filtering – classification of documents
into 2 classes (interesting vs. uninteresting, spam
vs. ham)
duplication detection – looking for similar
documents (detecting plagiarism)
SAS Document duplication detection
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 13/20
sentiment analysis – classification documents
according to emotions expressed by the author
(usually 3 classes: positive, negative and neutral)
SAS sentiment analysis
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 14/20
Systems and algorithms for information retrieval
algorithm SMART (System for Manipulating And
Retrieving Text) – vector representation, TFIDF,
cosine measure and symmetric overlap (Salton, 1971)
naive bayes classifier for document classification probabilistic Pic(term_i_ in_document|document_from_class_c)
(Lewis, 1991), (Mitchell, 1997), (Grobelnik, Mladenic,
1998)
Kohonen neural network SOM - geometric
interpretation of SOM is transformed to conceptual
interpretation; the closer are two cluster within SOM,
the closer is the meaning of the corresponding
documents
WebSOM (Honkela, 1996), (Kohonen, 1998) –
categorization of documents in Internet
genetic algorithms - documents represented using bit
strings (chromosome) encoding the occurrence (1) or no-
occurrence (0) of a term, fitness functions corresponds
to a similarity measure (e.g. Jaccard) between document
and query, also represented using bit string (Gordon,
1988)
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 15/20
SAS Text Miner
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 16/20
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 17/20
B) Information extraction tasks
analysis of unstructured text with the aim to find
specific type of information
1. text summarization:
e.g. SAS Text Summarization
Selects important sentences from the text – importance
given by user defined concepts, the more concepts are
identified in the sentence, the more important this sentence
is. Concepts are defined using regular expressions and
grammar rules
Summarization options: whole document, paragraphs, sections
2. named entity recognition: – identifying atomic
elements like names of persons or organizations,
local names, time information
e.g. (Labský, Svátek, 2007) within project MedIEQ
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 18/20
3. template mining: identifying sequence of words
(usually defined using regular expression)
e.g. SAS Content Categorization:
classification concept defined using list of words or
„regular expressions“
grammar concept defined using linguistic rules
Definition of grammar concept
Identification of grammar concept
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 19/20
Identification of adjectives: precision and recall je 13/17=0.75
4. finding associations: between occurrence of
different phrases in a collection of documents A S, ….. „if writing about A, then also writing about B“
System FACT (Finding Associations in Collections of Text)
- news about political events (Feldman, Hirsh, 1997)
{Iran,USA} Reagan
System Document Explorer - analysis of business texts
(Feldman a kol, 1998)
america online inc, bertelsmann ag joint venture (13, 0.72)
Crucial for automated information extraction is large body
of domain knowledge. In case of system FACT geopolitical
knowledge and linguistic knowledge (synonyms to selected
terms) in case of system Document Explorer knowledge about
companies and firms.
Knowledge Discovery in Databases T14: text mining
P. Berka, 2018 20/20
Systems for text mining
Intelligent Miner for Text (IBM)
http://www.software.ibm.com/
Text Analyst (Megaputer Intelligence )
http://www.megaputer.com
Text Miner (SAS Institute Inc.)
http://www.sas.com/technologies/analytics/datamining/
textminer
After suitable text preprocessing (transforming documents
into rows in a relational data table) we can use also
„standard“ KDD systems.
weka