Topic Modeling

Topic ModelingNATHAN MILLER

Uses Document Summarization

Machine Translation

Named Entity Recognition

Natural Language Understanding/ Generation

Optical Character Recognition

Part-of-speech Tagging

Sentiment Analysis

Topic Segmentation

Corpus – a defined grouping of similar documents◦ All Fairy Tales

Document – user-defined body of text◦ Cinderella

Term – a word in a document◦ Slipper

Text Mining Terms

Slipper

Term: Slipper Document: Cinderella Corpus: All Fairy Tales

Pre-ProcessingCLEAN INGTOKEN IZ INGSTEM MINGTDM/DTM

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

“Well,” said the wolf, “then I'll huff and I'll puff and I'll blow your straw house in.”

Hansel left a trail of crumbs behind him to mark the way.

Corp

usDocum

ent



Tokenization

• The• quick• brown• fox• jumps

• over• the• lazy• dog;• then,

• Foxy• Cow• jumped• over• the

• moon

bag-o-words



Tokenization

• The quick• quick brown• brown fox• fox jumps• jumps over

• over the• the lazy• lazy dog• dog; then,• then, Foxy

• Foxy Cow• Cow jumped• jumped over• over the• the moon.

• moon.

n-grams



Stop Word Removal

• quick• brown• fox• jumps• lazy

• dog• Foxy• Cow• jumped• moon

feature selection

TF-IDF: Term Frequency-Inverse Document Frequency

tf(t,d)

Stop Word Removal

Inverse Document Frequency

Number of docs d (within the corpus D) in which a term t appears.

raw frequency: Frequency of a term t in a document d.

Document Frequency

Number of docs d within a corpus D; N = |D|

One method of determining stop words in a corpus



Stemming

• quick• brown• fox• jump• lazy

• dog• Fox• Cow• jump• moon


DTM/TDM

Term Doc 1 Doc 2 … Doc n

quick 1 0 … 0

brown 1 0 … 0

fox 2 0 … 1

jump 2 1 … 0

lazy 1 0 … 0

dog 1 0 … 0

cow 1 1 … 1

… … … … …

moon 1 0 … 0

Document-Term Matrix/ Term-Document Matrix

Topic ModelingK-MEANSLATENT DIRICHLET ALLOCATION (LDA)

see also https://en.wikipedia.org/wiki/K-means_clustering

K-Means

number of clusters

an observation; a term

mean (centroid) of a cluster

every word in a cluster

iterates through

k clusters

=Within-Cluster Sum of Squares: SST within a particular cluster

Minimize this function:

Within-Cluster Sum of Squares

K-means Learning1. 2. 3. 4.

Randomly pick k points (“means”)

Assign each observation to nearest “mean”

Calculate mean (centroid) of each cluster

Repeat steps 2 and 3 until convergence

K-means Interactive Example http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

http://www.naftaliharris.com/blog/visualizing-k-means-clustering/



SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF

LDA

“Hidden” Topical Structure <= (“hidden” variables|observed words)

1. Uncover the hidden topical patterns (topics)

2. Annotate the documents according to those topics

3. Use the annotations to organize, summarize and/or search the texts

Posterior Distribution

Latent Dirichlet Allocation

http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

LDA Learning1. Iterate through all words in a document and randomly assign each word to a topic

2. Calculate A. P(topic|document) and B. P(word|topic)

3. Reassign each word to a topic using P(topic|document)*P(word|topic) = Probability that a topic “generated” a word


Posterior Distribution


Text Mining Document Summarization

Machine Translation

Named Entity Recognition

Natural Language Understanding/ Generation

Optical Character Recognition

Part-of-speech Tagging

Sentiment Analysis

Topic Segmentation◦ K-means◦ LDA

Further Study David Blei’s Introduction of LDA: https://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf

RStudio’s Tutorial on Text Mining:

https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html

https://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf






Date post:	18-Feb-2017
Category:	Documents
Upload:	nathan-miller
View:	137 times
Download:	3 times

Topic Modeling

Documents