+ All Categories
Home > Documents > Topic Modeling

Topic Modeling

Date post: 18-Feb-2017
Category:
Upload: nathan-miller
View: 137 times
Download: 3 times
Share this document with a friend
24
Topic Modeling NATHAN MILLER
Transcript
Page 1: Topic Modeling

Topic ModelingNATHAN MILLER

Page 2: Topic Modeling

Uses Document Summarization

Machine Translation

Named Entity Recognition

Natural Language Understanding/ Generation

Optical Character Recognition

Part-of-speech Tagging

Sentiment Analysis

Topic Segmentation

Page 3: Topic Modeling

Corpus – a defined grouping of similar documents◦ All Fairy Tales

Document – user-defined body of text◦ Cinderella

Term – a word in a document◦ Slipper

Text Mining Terms

Slipper

Term: Slipper Document: Cinderella Corpus: All Fairy Tales

Page 4: Topic Modeling

Pre-ProcessingCLEAN INGTOKEN IZ INGSTEM MINGTDM/DTM

Page 5: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

“Well,” said the wolf, “then I'll huff and I'll puff and I'll blow your straw house in.”

Hansel left a trail of crumbs behind him to mark the way.

Corp

usDocum

ent

Page 6: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

Tokenization

• The• quick• brown• fox• jumps

• over• the• lazy• dog;• then,

• Foxy• Cow• jumped• over• the

• moon

bag-o-words

Page 7: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

Tokenization

• The quick• quick brown• brown fox• fox jumps• jumps over

• over the• the lazy• lazy dog• dog; then,• then, Foxy

• Foxy Cow• Cow jumped• jumped over• over the• the moon.

• moon.

n-grams

Page 8: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

Stop Word Removal

• quick• brown• fox• jumps• lazy

• dog• Foxy• Cow• jumped• moon

feature selection

Page 9: Topic Modeling

TF-IDF: Term Frequency-Inverse Document Frequency

tf(t,d)

Stop Word Removal

Inverse Document Frequency

Number of docs d (within the corpus D) in which a term t appears.

raw frequency: Frequency of a term t in a document d.

Document Frequency

Number of docs d within a corpus D; N = |D|

One method of determining stop words in a corpus

Page 10: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

Stemming

• quick• brown• fox• jump• lazy

• dog• Fox• Cow• jump• moon

Page 11: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

DTM/TDM

Term Doc 1 Doc 2 … Doc n

quick 1 0 … 0

brown 1 0 … 0

fox 2 0 … 1

jump 2 1 … 0

lazy 1 0 … 0

dog 1 0 … 0

cow 1 1 … 1

… … … … …

moon 1 0 … 0

Document-Term Matrix/ Term-Document Matrix

Page 12: Topic Modeling

Topic ModelingK-MEANSLATENT DIRICHLET ALLOCATION (LDA)

Page 13: Topic Modeling

see also https://en.wikipedia.org/wiki/K-means_clustering

K-Means

number of clusters

an observation; a term

mean (centroid) of a cluster

every word in a cluster

iterates through

k clusters

=Within-Cluster Sum of Squares: SST within a particular cluster

Minimize this function:

Page 14: Topic Modeling

Within-Cluster Sum of Squares

Page 15: Topic Modeling

K-means Learning1. 2. 3. 4.

Randomly pick k points (“means”)

Assign each observation to nearest “mean”

Calculate mean (centroid) of each cluster

Repeat steps 2 and 3 until convergence

Page 17: Topic Modeling

SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF

LDA

“Hidden” Topical Structure <= (“hidden” variables|observed words)

1. Uncover the hidden topical patterns (topics)

2. Annotate the documents according to those topics

3. Use the annotations to organize, summarize and/or search the texts

Posterior Distribution

Latent Dirichlet Allocation

Page 18: Topic Modeling

http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

Page 19: Topic Modeling

LDA Learning1. Iterate through all words in a document and randomly assign each word to a topic

2. Calculate A. P(topic|document) and B. P(word|topic)

3. Reassign each word to a topic using P(topic|document)*P(word|topic) = Probability that a topic “generated” a word

Page 20: Topic Modeling

SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF

Page 21: Topic Modeling

Posterior Distribution

SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF

Page 22: Topic Modeling
Page 23: Topic Modeling

Text Mining Document Summarization

Machine Translation

Named Entity Recognition

Natural Language Understanding/ Generation

Optical Character Recognition

Part-of-speech Tagging

Sentiment Analysis

Topic Segmentation◦ K-means◦ LDA


Recommended