Modeling Documents

6th June 2005 Research in Algorithms for the InterNet 1

Modeling Documents

Amruta Joshi

Department of Computer Science

Stanford University

Research in Algorithms for the InterNet 2Amruta Joshi, Stanford Univ.

Outline Topic Models

Topic Extraction2 Author Information Modeling Topics Modeling Authors Author Topic Model Inference

Integrating topics and syntax Probabilistic Models Composite Model Inference


Motivation

Identifying content of a document Identifying its latent structure

More specificallyGiven a collection of documents we want to

create a model to collect information about Authors Topics Syntactic constructs


Topics & Authors

Why model topics? Observe topic trends How documents relate to one-another Tagging abstracts

Why model authors’ interests? Identifying what author writes about Identifying authors with similar interests Authorship attribution Creating reviewer lists Finding unusual work by an author


Topic Extraction: Overview

Supervised Learning Techniques Learn from labeled document

collection But Unlabeled documents,

Rapidly changing fields (Yang 1998)

In floods, the banks of a river overflow

rivers



Dimensionality Reduction Represent documents in

Vector Space of terms Map to low-dimensionality

Non-linear dim. reduction WEBSOM (Lagus et. al. 1999)

Linear Projection LSI (Berry, Dumais, O’Brien

1995)

Regions represent topics



Cluster documents on semantic contentTypically, each cluster has just 1 topic

Aspect ModelTopic modeled as distribution over wordsDocuments generated from multiple topics


Author Information: Overview

As doth the lion in the Capitol, A man no mightier than thyself or me …

Analyzing text using Stylometry

statistical analysis using literary style, frequency of word usage, etc

Semantics Content of document


Author Information: Overview

Graph-based modelsBuild Interactive

ReferralWeb using citations Kautz, Selman, Shah 1997

Build Co-Author Graphs White & Smith Page-Rank for analysis

D1

D3 D4

D2


The Big Idea Topic Model

Model topics as distribution over words

Author Model Model author as distribution over words

Author-Topic Model Probabilistic Model for both Model topics as distribution over words Model authors as distribution over topics


Bayesian Networks

nodes = random variablesedges = direct probabilistic influence

Topology captures independence: XRay conditionally independent of Pneumonia given Infiltrates

XRay

Lung Infiltrates

Sputum Smear

TuberculosisPneumonia

Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park


Bayesian Networks

Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) — distribution over Xi for each assignment to parents

If variables are discrete, P is usually multinomial P can be linear Gaussian, mixture of Gaussians, …

0.7 0.3

p

t

p

0.6 0.4

0.01 0.99

0.2 0.8

tp

t

t

p

TP P(I |P, T )

XRay

Lung Infiltrates

Sputum Smear

TuberculosisPneumonia



BN Learning

BN models can be learned from empirical data parameter estimation via numerical optimization structure learning via combinatorial search.

InducerInducerInducerInducerData

X

I

S

T P



Generative ModelProbabilistic Generative Process Statistical Inference

Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b )

Mixture components

Mixture weights


Bayesian Network for modeling document generation

Doc 1

T1 T2 TT

Z

…

w1

W

w2 wv…

Z

W


Topic Model: Plate Notation

D

Document

Document specific

distribution over topics

T

Topic distribution over

words

w

Wordz

Nd

Topic


Topic Model: Geometric Representation


Modeling Authors with words

D

Document

w

Word

Uniform distribution over authors of doc

ad

x

Nd

Author

A

Distribution of authors over

words


D

Document

Author-Topic Model

T

Topic distribution over words

w

Word

z

Topic

A

Distribution of authors over

topicsx

Nd

Author

ad

Uniform distribution of

documents over authors


Inference

Expectation Maximization But poor results (local Maxima)

Gibbs Sampling Parameters: , Start with initial random assignment Update parameter using other parameters Converges after ‘n’ iterations Burn-in time


Inference and Learning for Documents

# of times word m is

assigned to topic j

# of times topic j has occurred in document d

mj dj

Prob. that ith topic is assigned to topic j keeping other topic

assn unchanged


Matrix Factorization


Topic Model: Inference

River Stream Bank Money Loan123456789

10111213141516

Can we recover the original topics and topic mixtures from this data?

docu

men

ts

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine


Example of Gibbs Sampling


10111213141516

Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )

River

LoanMoneyBankStream




10111213141516

After 1 iteration

Apply sampling equation to each word token

River

LoanMoneyBankStream




10111213141516

After 4 iterations

River

LoanMoneyBankStream




10111213141516

After 32 iterations

stream .40 bank .39bank .35 money .32river .25 loan .29

topic 1 topic 2●● ●●

River

LoanMoneyBankStream



Results

Tested on Scientific Papers NIPS Dataset

V=13,649 D=1,740 K=2,037 #Topics = 100 #tokens = 2,301,375

CiteSeer Dataset V=30,799 D=162,489 K=85,465 #Topics = 300 #tokens = 11,685,514


Evaluating Predictive Power

Perplexity Indicates ability to predict words on new

unseen documents

Lower the better


Results: Perplexity


Recap

First Author Model Topic Model

Then Author-Topic Model

Next… Integrating Topics & Syntax


Integrating topics & syntax

Probabilistic Models Short-range dependencies

Syntactic Constraints Represented as distinct syntactic classes HMM, Probabilistic CFGs

Long-range dependencies Semantic Constraints Represented as probabilistic distribution Bayes Model, Topic Model

New Idea! Use both


How to integrate these? Mixture of Models

Each word exhibits either short or long range dependencies

Product of Models Each word exhibits both short or long range

dependencies

Composite Model Asymmetric All words exhibit short-range dependencies Subset of words exhibit long-range

dependencies


The Composite Model 1

Capturing asymmetry Replace probability distribution over words with

semantic model Syntactic model chooses when to emit content

word Semantic model chooses which word to emit

Methods Syntactic component is HMM Semantic component is Topic model


Generating phrases

network neural output

networks ...

image images object

objects ...

kernel support

svm vector ..

.

in with for

on ...

used trained

obtained described

...

0.5 0.4 0.1

0.9

0.2

0.7

0.9

network used for images image obtained with kernel output described with objects neural network trained with svm

images


The Composite Model 2 (Graphical)

w1 w2 w3 w4

c1 c2 c3 c4

z1 z2 z3 z4

Topics

Words

Classes

Doc’s distribution over topics


The Composite Model 3

(d) : document’s distribution over topics Transitions between classes ci-1 and ci follow

distribution (Ci-1)

A document is generated as: For each word wi in document d

Draw zi from (d)

Draw ci from (Ci-1)

If ci=1, then draw wi from (zi), else draw wi from (ci)


Results

Tested onBrown corpus (tagged with word types)Concatenated Brown & TASA corpus

HMM & Topic Model20 Classes

start/end Markers Class + 19 classes

T = 200


Results Identifying Syntactic classes & semantic topics

Clean separation observed

Identifying function words & content words “control” : plain verb (syntax) or semantic word

Part-of-Speech Tagging Identifying syntactic class

Document Classification Brown corpus: 500 docs => 15 groups Results similar to plain Topic Model


Extensions to Topic Model

Integrating link information (Cohn, Hofmann 2001)

Learning Topic Hierarchies Integrating Syntax & Topics Integrate authorship info with content

(author-topic model) Grade-of-membership Models Random sentence generation


Conclusion

Identifying its latent structure

Document Content is modeled forSemantic Associations – topic model Authorship - author topic modelSyntactic Constructs – HMM


Acknowledgements

Prof. Rajeev Motwani Advice and guidance regarding topic

selection

T. K. Satish Kumar Help on Probabilistic Models


Thank you!


References Primary

Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington.

Steyvers, M. & Griffiths, T. Probabilistic topic models. (http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf)

Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada

Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (in press). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17.

Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235.

Date post:	07-Jan-2016
Category:	Documents
Upload:	alder
View:	41 times
Download:	0 times