Date post: | 03-Dec-2014 |
Category: |
Documents |
Upload: | pradipto-das |
View: | 602 times |
Download: | 0 times |
Simultaneous Joint and Conditional Modeling of
Documents Tagged from Two Perspectives
Pradipto Das, Rohini Srihari and Yun FuSUNY Buffalo
CIKM 2011, Glasgow, Scotland
Ubiquitous Bi-Perspective Document Structure
Words indicative of
important Wiki concepts
Actual human generated
Wiki category tags – words
that summarize/
categorize the document
Wikipedia
Words indicative
of questions
Actual tags for the
forum post – even
frequencies are given!
Words indicative
of answers
StackOverflow
Ubiquitous Bi-Perspective Document Structure
Ubiquitous Bi-Perspective Document Structure
Words indicative
of document
title
Actual tags given by users
Words indicative of image
description
Yahoo! Flickr
Understanding the Two Perspectives
News Article
What if the documents are plain text files?
It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.
Understanding the Two Perspectives
News Article
Imagine browsing over reports in a topic cluster
It is believed US investigators have asked for, but have been so far refused access to, evidence
accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year.
This investigation was launched by US President Bill Clinton and is in principle a far more simple
or at least more single-minded pursuit than that of Ms. Holland.
Dorothea Holland, until four months ago
was the only prosecuting lawyer on the
German case.
Understanding the Two Perspectives
News Article
The “document level” perspective
What words can we remember after a first browse?
German, US, investigations, GM, Dorothea Holland, Lopez,
prosecute
Important Verbs and Dependents
Named Entities
Understanding the Two Perspectives What helped us generate the Document Level perspective?
ORGANIZATION
It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.
News Article
LOCATIONMISC
PERSON
WHAT HAPPENED?
The “word level” perspective
The “document level” perspective
German, US, investigations, GM, Dorothea Holland, Lopez,
prosecute
What if we turn the document off? Summarization power of the perspectives
It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.
German, US, investigations, GM, Dorothea Holland, Lopez,
prosecute
Sentence Boundaries
End
(2)
Hypothesis• Documents are at least tagged from two
different perspectives – either implicit or explicit and one perspective affects the other– Simplest example of implicit WL tagging – binned
positions indicating sections– Simplest example of implicit DL tagging – tag cloud
It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. tagcrowd.com
Begi
n (0
)M
idd
le (1
)
The “word level” (WL) tags are usually some category descriptions
How can bi-level perspective be exploited?
Can we generate category labels for Wikipedia documents by looking at image captions? Can we use images to label latent topics?
Can we build a topic model that incorporates both perspectives simultaneously? choice of document level tags, impact on
performance Can supervised and unsupervised generative
models work together?
Example – A Wikipedia Article on “fog”
Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors
0
1
2
The Wikipedia Article on “fog”
Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors
Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California, bridge, air Labels by model from title and image captions
Take the first category label – “weather hazards to aircraft” “aircraft” doesn’t occur in the document body! “hazard” only appears in a section label read as “Visibility
hazards” “Weather” appears only 6 out of 15 times in the main body
However, if we look at the images, it seems that the concept of fog is related to concepts like fog over the Golden Gate bridge, fog in streets, poor visibility and quality of air
The Family of Tag-Topic Models
• TagLDA: An occurrence of a word depends on how much of it is explained by a topic K and a WL tag t Intuitively
LDA TagLDA
LDA’s learnt “purple” topic can generate all 4 large balls with high probability
TagLDA learns the “purple” topic better based on a constraint - it will generate a mix of large and small balls with high probability
L LSLL L
Trai
nSa
mpl
e
SL
Faceted Bi-Perspective Document Organization
Topics conditioned on different section identifiers (WL tag categories)
Topic Marginals
Topics over
image captions
Correspondence of DL tag words
with content words
Topic Labeling
The Family of Tag-Topic ModelsMETag2LDA CorrMETag2LDAMMLDA CorrMMLDATagLDA
Combines TagLDA and
MMLDA
Combines TagLDA and CorrMMLDA
MM = Multinomial + Multinomial; ME = Multinomial + Exponential
The Family of Tag-Topic Models• METag2LDA: A topic generating all DL tags in a document doesn’t
necessarily mean that the same topic generates all words in the document
• CorrMETag2LDA: A topic generating *all* DL tags in a document does mean that the same topic generates all words in the document - a considerable strongpoint
Topic concentration parameter
Document specific topic proportions
Document content words
Document Level (DL) tags
Word Level (WL) tags
Indicator variables
Topic Parameters
Tag Parameters
CorrME-Tag2LDA
METag2LDA
Experiments Wikipedia articles with images and captions manually
collected along {food, animal, countries, sport, war, transportation, nature, weapon, universe and ethnic groups} concepts
Tags used: DL Tags – image caption words and the article titles WL Tags – Positions of sections binned into 5 bins
Objective: to generate category labels for test documents Evaluation
– Perplexity: to see performance among various TagLDA models– WordNet based similarity evaluation between actual category labels
and model output
Evaluations – Held-out Perplexity
Selected Wikipedia Articles
WL tag categories – Section positions in the document DL tags – image caption words and article titles TagLDA perplexity is comparable to MM(METag2)LDA
The (image caption words + article titles) and the content words are independently discriminative enough
CorrMM(METag2)LDA performs best since almost all image caption words and the article title for a Wikipedia document are about a specific topic and the correspondence assumption is accepted by the model with much higher confidence
K=20 K=50 K=100 K=2000
100000200000300000400000500000600000700000800000
MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA
Evaluations – Application End-Goals
Inverse Hop distance in WordNet ontology
Top 5 words from the caption vocabulary are chosen Max Weighted Average = 5, Max Best = 1 METag2LDA almost always wins by narrow margins METag2LDA reweights the vocabulary of caption words and article titles that are about a
topic and hence may miss specializations relevant to document within the top (5) ones In WordNet ontology, specializations lead to more hop distance
Ontology based scoring helps explain connections to caption words to ground truths e.g. Skateboard skate glide snowboard
K=20 K=50 K=100 K=2000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
METag2LDA-Av-erageDistance
corrMETag2LDA-AverageDistance
METag2LDA-BestDistance
corrMETag2LDA-BestDistance
Evaluations – Held-out Perplexity
DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)
WL tag categories – Named Entities DL tags – abstract coherence markers like (“subj” “obj”) e.g. “Mary/Subj taught the class.
Everybody liked Mary/Obj.” [Ignored coref resolution] Abstract markers like (“subj” “obj”) acting as DL perspective are not document
discriminative markers Rather they indicate a semantic perspective of coherence which is intricately linked to words Topics are influenced both by non-sparse document level coherence indicators like (“subj”
“obj”, “subj” “--”, etc.) AND also by document level co-occurrence By ignoring the DL perspective completely leads to better fit by TagLDA due to variations
in word distributions only
40 60 80 1001350000
1400000
1450000
1500000
1550000
1600000
1650000
MMLDA METag2LDA corrLDA corrMETag2LDA
40 60 80 1000
200000400000600000800000
10000001200000140000016000001800000
MMLDA METag2LDA corrLDA corrMETag2LDATagLDA
Evaluations – Application End-Goals
Person Named Entity coverage (DUC05 data)
Two PERSON NEs in the same docset i.e., manual topic set are related (G in total) A_B, A, B are treated as separate PERSON NEs For each docset in DUC05 data
Create a set of best topics for a docset and pull out top PER NE pairs from the PER NE facets
Find how many matched over all documents in a docset (M in total) Win over baseline = M/G (averaged over all docsets) CorrMETag2LDA wins here because of the nature of DL perspective (Role transitions like
“SubjObj” coherence markers) More topics are pulled out that group more PER NEs across documents (Recall )
40 60 80 1000
0.5
1
1.5
2
2.5
3
3.5
4
0.350.63
0.98 0.910.96
1.88
3.08
3.66
METag2LDA
CorrMETag2LDA
Model Usefulness and Applications
• Applications– Document classification using reduced dimensions– Find faceted topics automatically through word level tags– Learn correspondences between perspectives– Label topics through document level multimedia– Create recommendations based on perspectives– Video analysis: word prediction given video features– Tying “multilingual comparable corpora” through topics– Multi-document summarization using coherence– E-Textbook aided discussion forum mining:
• Explore topics through the lens of students and teachers• Label topics from posts through concepts in the e-textbook
Summary
• Flexible family of topic models that integrate a partitioned space of DL tags and words with WL tag categories– Supervised models can collaborate with unsupervised
generative models i.e. supervised models can be bettered independently
• Captioned multimedia objects like images, video, audio can provide intuitive latent space labeling – a picture is worth a 1000 words
• Obtain “facets” in topics• As always held-out perplexity should not always be the
sole judge of end-task performance
Thanks!
Special thanks to Jordan Boyd-Graber for useful discussions on TagLDA parameter regularizations