Date post: | 18-Feb-2017 |
Category: |
Documents |
Upload: | nathan-miller |
View: | 137 times |
Download: | 3 times |
Topic ModelingNATHAN MILLER
Uses Document Summarization
Machine Translation
Named Entity Recognition
Natural Language Understanding/ Generation
Optical Character Recognition
Part-of-speech Tagging
Sentiment Analysis
Topic Segmentation
Corpus – a defined grouping of similar documents◦ All Fairy Tales
Document – user-defined body of text◦ Cinderella
Term – a word in a document◦ Slipper
Text Mining Terms
Slipper
Term: Slipper Document: Cinderella Corpus: All Fairy Tales
Pre-ProcessingCLEAN INGTOKEN IZ INGSTEM MINGTDM/DTM
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
“Well,” said the wolf, “then I'll huff and I'll puff and I'll blow your straw house in.”
Hansel left a trail of crumbs behind him to mark the way.
Corp
usDocum
ent
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
Tokenization
• The• quick• brown• fox• jumps
• over• the• lazy• dog;• then,
• Foxy• Cow• jumped• over• the
• moon
bag-o-words
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
Tokenization
• The quick• quick brown• brown fox• fox jumps• jumps over
• over the• the lazy• lazy dog• dog; then,• then, Foxy
• Foxy Cow• Cow jumped• jumped over• over the• the moon.
• moon.
n-grams
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
Stop Word Removal
• quick• brown• fox• jumps• lazy
• dog• Foxy• Cow• jumped• moon
feature selection
TF-IDF: Term Frequency-Inverse Document Frequency
tf(t,d)
Stop Word Removal
Inverse Document Frequency
Number of docs d (within the corpus D) in which a term t appears.
raw frequency: Frequency of a term t in a document d.
Document Frequency
Number of docs d within a corpus D; N = |D|
One method of determining stop words in a corpus
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
Stemming
• quick• brown• fox• jump• lazy
• dog• Fox• Cow• jump• moon
The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.
DTM/TDM
Term Doc 1 Doc 2 … Doc n
quick 1 0 … 0
brown 1 0 … 0
fox 2 0 … 1
jump 2 1 … 0
lazy 1 0 … 0
dog 1 0 … 0
cow 1 1 … 1
… … … … …
moon 1 0 … 0
Document-Term Matrix/ Term-Document Matrix
Topic ModelingK-MEANSLATENT DIRICHLET ALLOCATION (LDA)
see also https://en.wikipedia.org/wiki/K-means_clustering
K-Means
number of clusters
an observation; a term
mean (centroid) of a cluster
every word in a cluster
iterates through
k clusters
=Within-Cluster Sum of Squares: SST within a particular cluster
Minimize this function:
Within-Cluster Sum of Squares
K-means Learning1. 2. 3. 4.
Randomly pick k points (“means”)
Assign each observation to nearest “mean”
Calculate mean (centroid) of each cluster
Repeat steps 2 and 3 until convergence
K-means Interactive Example http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF
LDA
“Hidden” Topical Structure <= (“hidden” variables|observed words)
1. Uncover the hidden topical patterns (topics)
2. Annotate the documents according to those topics
3. Use the annotations to organize, summarize and/or search the texts
Posterior Distribution
Latent Dirichlet Allocation
http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
LDA Learning1. Iterate through all words in a document and randomly assign each word to a topic
2. Calculate A. P(topic|document) and B. P(word|topic)
3. Reassign each word to a topic using P(topic|document)*P(word|topic) = Probability that a topic “generated” a word
SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF
Posterior Distribution
SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF
Text Mining Document Summarization
Machine Translation
Named Entity Recognition
Natural Language Understanding/ Generation
Optical Character Recognition
Part-of-speech Tagging
Sentiment Analysis
Topic Segmentation◦ K-means◦ LDA
Further Study David Blei’s Introduction of LDA: https://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf
RStudio’s Tutorial on Text Mining:
https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html