+ All Categories
Home > Documents > Modeling Documents

Modeling Documents

Date post: 07-Jan-2016
Category:
Upload: alder
View: 41 times
Download: 0 times
Share this document with a friend
Description:
Modeling Documents. Amruta Joshi Department of Computer Science Stanford University. Outline. Topic Models Topic Extraction 2 Author Information Modeling Topics Modeling Authors Author Topic Model Inference Integrating topics and syntax Probabilistic Models Composite Model Inference. - PowerPoint PPT Presentation
44
6th June 2005 Research in Algorithms for th e InterNet 1 Modeling Documents Amruta Joshi Department of Computer Science Stanford University
Transcript
Page 1: Modeling Documents

6th June 2005 Research in Algorithms for the InterNet 1

Modeling Documents

Amruta Joshi

Department of Computer Science

Stanford University

Page 2: Modeling Documents

Research in Algorithms for the InterNet 2Amruta Joshi, Stanford Univ.

Outline Topic Models

Topic Extraction2 Author Information Modeling Topics Modeling Authors Author Topic Model Inference

Integrating topics and syntax Probabilistic Models Composite Model Inference

Page 3: Modeling Documents

Research in Algorithms for the InterNet 3Amruta Joshi, Stanford Univ.

Motivation

Identifying content of a document Identifying its latent structure

More specificallyGiven a collection of documents we want to

create a model to collect information about Authors Topics Syntactic constructs

Page 4: Modeling Documents

Research in Algorithms for the InterNet 4Amruta Joshi, Stanford Univ.

Topics & Authors

Why model topics? Observe topic trends How documents relate to one-another Tagging abstracts

Why model authors’ interests? Identifying what author writes about Identifying authors with similar interests Authorship attribution Creating reviewer lists Finding unusual work by an author

Page 5: Modeling Documents

Research in Algorithms for the InterNet 5Amruta Joshi, Stanford Univ.

Topic Extraction: Overview

Supervised Learning Techniques Learn from labeled document

collection But Unlabeled documents,

Rapidly changing fields (Yang 1998)

In floods, the banks of a river overflow

rivers

Page 6: Modeling Documents

Research in Algorithms for the InterNet 6Amruta Joshi, Stanford Univ.

Topic Extraction: Overview

Dimensionality Reduction Represent documents in

Vector Space of terms Map to low-dimensionality

Non-linear dim. reduction WEBSOM (Lagus et. al. 1999)

Linear Projection LSI (Berry, Dumais, O’Brien

1995)

Regions represent topics

Page 7: Modeling Documents

Research in Algorithms for the InterNet 7Amruta Joshi, Stanford Univ.

Topic Extraction: Overview

Cluster documents on semantic contentTypically, each cluster has just 1 topic

Aspect ModelTopic modeled as distribution over wordsDocuments generated from multiple topics

Page 8: Modeling Documents

Research in Algorithms for the InterNet 8Amruta Joshi, Stanford Univ.

Author Information: Overview

As doth the lion in the Capitol, A man no mightier than thyself or me …

Analyzing text using Stylometry

statistical analysis using literary style, frequency of word usage, etc

Semantics Content of document

Page 9: Modeling Documents

Research in Algorithms for the InterNet 9Amruta Joshi, Stanford Univ.

Author Information: Overview

Graph-based modelsBuild Interactive

ReferralWeb using citations Kautz, Selman, Shah 1997

Build Co-Author Graphs White & Smith Page-Rank for analysis

D1

D3 D4

D2

Page 10: Modeling Documents

Research in Algorithms for the InterNet 10Amruta Joshi, Stanford Univ.

The Big Idea Topic Model

Model topics as distribution over words

Author Model Model author as distribution over words

Author-Topic Model Probabilistic Model for both Model topics as distribution over words Model authors as distribution over topics

Page 11: Modeling Documents

Research in Algorithms for the InterNet 11Amruta Joshi, Stanford Univ.

Bayesian Networks

nodes = random variablesedges = direct probabilistic influence

Topology captures independence: XRay conditionally independent of Pneumonia given Infiltrates

XRay

Lung Infiltrates

Sputum Smear

TuberculosisPneumonia

Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park

Page 12: Modeling Documents

Research in Algorithms for the InterNet 12Amruta Joshi, Stanford Univ.

Bayesian Networks

Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) — distribution over Xi for each assignment to parents

If variables are discrete, P is usually multinomial P can be linear Gaussian, mixture of Gaussians, …

0.7 0.3

p

t

p

0.6 0.4

0.01 0.99

0.2 0.8

tp

t

t

p

TP P(I |P, T )

XRay

Lung Infiltrates

Sputum Smear

TuberculosisPneumonia

Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park

Page 13: Modeling Documents

Research in Algorithms for the InterNet 13Amruta Joshi, Stanford Univ.

BN Learning

BN models can be learned from empirical data parameter estimation via numerical optimization structure learning via combinatorial search.

InducerInducerInducerInducerData

X

I

S

T P

Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park

Page 14: Modeling Documents

Research in Algorithms for the InterNet 14Amruta Joshi, Stanford Univ.

Generative ModelProbabilistic Generative Process Statistical Inference

Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b )

Mixture components

Mixture weights

Page 15: Modeling Documents

Research in Algorithms for the InterNet 15Amruta Joshi, Stanford Univ.

Bayesian Network for modeling document generation

Doc 1

T1 T2 TT

Z

w1

W

w2 wv…

Z

W

Page 16: Modeling Documents

Research in Algorithms for the InterNet 16Amruta Joshi, Stanford Univ.

Topic Model: Plate Notation

D

Document

Document specific

distribution over topics

T

Topic distribution over

words

w

Wordz

Nd

Topic

Page 17: Modeling Documents

Research in Algorithms for the InterNet 17Amruta Joshi, Stanford Univ.

Topic Model: Geometric Representation

Page 18: Modeling Documents

Research in Algorithms for the InterNet 18Amruta Joshi, Stanford Univ.

Modeling Authors with words

D

Document

w

Word

Uniform distribution over authors of doc

ad

x

Nd

Author

A

Distribution of authors over

words

Page 19: Modeling Documents

Research in Algorithms for the InterNet 19Amruta Joshi, Stanford Univ.

D

Document

Author-Topic Model

T

Topic distribution over words

w

Word

z

Topic

A

Distribution of authors over

topicsx

Nd

Author

ad

Uniform distribution of

documents over authors

Page 20: Modeling Documents

Research in Algorithms for the InterNet 20Amruta Joshi, Stanford Univ.

Inference

Expectation Maximization But poor results (local Maxima)

Gibbs Sampling Parameters: , Start with initial random assignment Update parameter using other parameters Converges after ‘n’ iterations Burn-in time

Page 21: Modeling Documents

Research in Algorithms for the InterNet 21Amruta Joshi, Stanford Univ.

Inference and Learning for Documents

# of times word m is

assigned to topic j

# of times topic j has occurred in document d

mj dj

Prob. that ith topic is assigned to topic j keeping other topic

assn unchanged

Page 22: Modeling Documents

Research in Algorithms for the InterNet 22Amruta Joshi, Stanford Univ.

Matrix Factorization

Page 23: Modeling Documents

Research in Algorithms for the InterNet 23Amruta Joshi, Stanford Univ.

Topic Model: Inference

River Stream Bank Money Loan123456789

10111213141516

Can we recover the original topics and topic mixtures from this data?

docu

men

ts

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 24: Modeling Documents

Research in Algorithms for the InterNet 24Amruta Joshi, Stanford Univ.

Example of Gibbs Sampling

River Stream Bank Money Loan123456789

10111213141516

Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 25: Modeling Documents

Research in Algorithms for the InterNet 25Amruta Joshi, Stanford Univ.

River Stream Bank Money Loan123456789

10111213141516

After 1 iteration

Apply sampling equation to each word token

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 26: Modeling Documents

Research in Algorithms for the InterNet 26Amruta Joshi, Stanford Univ.

River Stream Bank Money Loan123456789

10111213141516

After 4 iterations

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 27: Modeling Documents

Research in Algorithms for the InterNet 27Amruta Joshi, Stanford Univ.

River Stream Bank Money Loan123456789

10111213141516

After 32 iterations

stream .40 bank .39bank .35 money .32river .25 loan .29

topic 1 topic 2●● ●●

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 28: Modeling Documents

Research in Algorithms for the InterNet 28Amruta Joshi, Stanford Univ.

Results

Tested on Scientific Papers NIPS Dataset

V=13,649 D=1,740 K=2,037 #Topics = 100 #tokens = 2,301,375

CiteSeer Dataset V=30,799 D=162,489 K=85,465 #Topics = 300 #tokens = 11,685,514

Page 29: Modeling Documents

Research in Algorithms for the InterNet 29Amruta Joshi, Stanford Univ.

Evaluating Predictive Power

Perplexity Indicates ability to predict words on new

unseen documents

Lower the better

Page 30: Modeling Documents

Research in Algorithms for the InterNet 30Amruta Joshi, Stanford Univ.

Results: Perplexity

Page 31: Modeling Documents

Research in Algorithms for the InterNet 31Amruta Joshi, Stanford Univ.

Recap

First Author Model Topic Model

Then Author-Topic Model

Next… Integrating Topics & Syntax

Page 32: Modeling Documents

Research in Algorithms for the InterNet 32Amruta Joshi, Stanford Univ.

Integrating topics & syntax

Probabilistic Models Short-range dependencies

Syntactic Constraints Represented as distinct syntactic classes HMM, Probabilistic CFGs

Long-range dependencies Semantic Constraints Represented as probabilistic distribution Bayes Model, Topic Model

New Idea! Use both

Page 33: Modeling Documents

Research in Algorithms for the InterNet 33Amruta Joshi, Stanford Univ.

How to integrate these? Mixture of Models

Each word exhibits either short or long range dependencies

Product of Models Each word exhibits both short or long range

dependencies

Composite Model Asymmetric All words exhibit short-range dependencies Subset of words exhibit long-range

dependencies

Page 34: Modeling Documents

Research in Algorithms for the InterNet 34Amruta Joshi, Stanford Univ.

The Composite Model 1

Capturing asymmetry Replace probability distribution over words with

semantic model Syntactic model chooses when to emit content

word Semantic model chooses which word to emit

Methods Syntactic component is HMM Semantic component is Topic model

Page 35: Modeling Documents

Research in Algorithms for the InterNet 35Amruta Joshi, Stanford Univ.

Generating phrases

network neural output

networks ...

image images object

objects ...

kernel support

svm vector ..

.

in with for

on ...

used trained

obtained described

...

0.5 0.4 0.1

0.9

0.2

0.7

0.9

network used for images image obtained with kernel output described with objects neural network trained with svm

images

Page 36: Modeling Documents

Research in Algorithms for the InterNet 36Amruta Joshi, Stanford Univ.

The Composite Model 2 (Graphical)

w1 w2 w3 w4

c1 c2 c3 c4

z1 z2 z3 z4

Topics

Words

Classes

Doc’s distribution over topics

Page 37: Modeling Documents

Research in Algorithms for the InterNet 37Amruta Joshi, Stanford Univ.

The Composite Model 3

(d) : document’s distribution over topics Transitions between classes ci-1 and ci follow

distribution (Ci-1)

A document is generated as: For each word wi in document d

Draw zi from (d)

Draw ci from (Ci-1)

If ci=1, then draw wi from (zi), else draw wi from (ci)

Page 38: Modeling Documents

Research in Algorithms for the InterNet 38Amruta Joshi, Stanford Univ.

Results

Tested onBrown corpus (tagged with word types)Concatenated Brown & TASA corpus

HMM & Topic Model20 Classes

start/end Markers Class + 19 classes

T = 200

Page 39: Modeling Documents

Research in Algorithms for the InterNet 39Amruta Joshi, Stanford Univ.

Results Identifying Syntactic classes & semantic topics

Clean separation observed

Identifying function words & content words “control” : plain verb (syntax) or semantic word

Part-of-Speech Tagging Identifying syntactic class

Document Classification Brown corpus: 500 docs => 15 groups Results similar to plain Topic Model

Page 40: Modeling Documents

Research in Algorithms for the InterNet 40Amruta Joshi, Stanford Univ.

Extensions to Topic Model

Integrating link information (Cohn, Hofmann 2001)

Learning Topic Hierarchies Integrating Syntax & Topics Integrate authorship info with content

(author-topic model) Grade-of-membership Models Random sentence generation

Page 41: Modeling Documents

Research in Algorithms for the InterNet 41Amruta Joshi, Stanford Univ.

Conclusion

Identifying its latent structure

Document Content is modeled forSemantic Associations – topic model Authorship - author topic modelSyntactic Constructs – HMM

Page 42: Modeling Documents

Research in Algorithms for the InterNet 42Amruta Joshi, Stanford Univ.

Acknowledgements

Prof. Rajeev Motwani Advice and guidance regarding topic

selection

T. K. Satish Kumar Help on Probabilistic Models

Page 43: Modeling Documents

Research in Algorithms for the InterNet 43Amruta Joshi, Stanford Univ.

Thank you!

Page 44: Modeling Documents

Research in Algorithms for the InterNet 44Amruta Joshi, Stanford Univ.

References Primary

Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington.

Steyvers, M. & Griffiths, T. Probabilistic topic models. (http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf)

Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada

Griffiths, T.L., & Steyvers, M.,  Blei, D.M., & Tenenbaum, J.B. (in press). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17.

Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 


Recommended