Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | stella-booth |
View: | 214 times |
Download: | 0 times |
HawkesTopic: A Joint Model for Network Inference and Topic Modeling from Text-Based
Cascades
Xinran He1, Theodoros Rekatsinas2, James Foulds3, Lise Getoor3 and Yan Liu1
07/08/20151University of Southern California2University of Maryland, College Park3University of California, Santa Cruz
He et al. HawkesTopic ICML 2015
Introduction• Diffusion is an important and fundamental phenomenon:
• Abundant text-based cascades in a variety of social platforms
A
B C
D
E
F
G
01/17
Viral marketing, detection of rumors, modeling news dynamics …
t=0
t=1 t=1.5
t=2
t=3.5
He et al. HawkesTopic ICML 2015
Traditional vs Text-based Cascades
02/17
t=0t=3.5
t=1
t=2
t=1.5
B
A
C
D
E
F
G
t=0t=3.5
t=1
t=2
t=1.5
Traditional cascades Text-based cascades
- Temporal information - Temporal information- Content information
Incorporate content information => better model of diffusion Incorporate temporal information => better model of documents
He et al. HawkesTopic ICML 2015
Network Inference
aaaaaabbb
cccbbbccc
aaabbbbba aaa
aabccc
cccbbcaaa
Network Inference focuses on inferring a hidden diffusion network
Related work: - NetInf, NetRate [Gomez et al. 11,12], MMHP [Yang and Zha 13], KernelCascades [Du el al. 12]
- TopicCascades [Du el al. 13]
t=0t=3.5
t=1
t=2
t=1.5
A
C
D
E
F
G
B B
A
C
D
E
F
G
aaaaab
bbb bbabbc
ccc
Topic 1 Topic 2 Topic 3
aaaaaabbb
cccbbbccc
aaabbbbba aaa
aabccc
cccbbcaaa
0.60.5
0.3 0.2
0.2
0.1
0.1
03/17
He et al. HawkesTopic ICML 2015
Topic Modeling
aaaaaabbb
cccbbbccc
aaabbbbba aaa
aabccc
cccbbcaaa
Topic modeling aims to discover the latent thematic topics
Related work: - LDA [Blei et al. 03], CTM [Blei and Lafferty 06]
- Citation Influence model [Dietz el al. 07], TIR model [Foulds et al. 13]
t=0t=3.5
t=1
t=2
t=1.5
A
C
D
E
F
G
B B
A
C
D
E
F
G
aaaaab
bbb bbabbc
ccc
Topic 1 Topic 2 Topic 3
aaaaaabbb
cccbbbccc
aaabbbbba aaa
aabccc
cccbbcaaa
aaaaaabbb
cccbbbccc
aaabbbbba aaa
aabccc
cccbbcaaa
Corpus
04/17
Our Contribution
HawkesTopic: joint model for simultaneous Network Inference and Topic Modeling from text-based cascades
aaaaaabbb
cccbbbccc
aaabbbbba
aaaaabccc
cccbbcaaa
aaaaab
Topic 1
bbb bbabbc
Topic 2
cccTopic 3
Topic Modeling
He et al. HawkesTopic ICML 2015
B
A
C
D
E
F
Gaaaaaabbb
aaaaabccc
cccbbcaaa
aaabbbbba
cccbbbccc t=0t=3.5
t=1
t=2
t=1.5
Network Inference
A
B C
D
E
F
G
0.6 0.4
0.10.2
0.3
0.3
05/17
HawkesTopic: Intuition
𝑣1
𝑣2
aaaaaabbb
ccccccbbb
aaaababbb
cccccabbb
bbbbbacca
Mutual exciting nature: A posting event can trigger future events
Content cascades: The content of a document should be similar to the document that triggers its publication
𝒕
𝒕
He et al. HawkesTopic ICML 2015 06/17
Modeling Posting Times
Mutually exciting nature captured via Multivariate Hawkes Process (MHP) [Liniger 09].
For MHP, intensity process takes the form:
: influence strength from to : probability density function of the delay distribution
Base intensity Influence from previous events
He et al. HawkesTopic ICML 2015 07/17
+Rate =
Generating Posting Times
𝑣1
𝑣2
𝒕
𝒕
Generate events and their posting times in a breadth first order by interpreting the MHP as clustered Poisson process [Simma 10]
Provide explicit parent relationship for evolution of the content information
Level 0
Level 1
Level 2
He et al. HawkesTopic ICML 2015 08/17
Modeling Documents
He et al. HawkesTopic ICML 2015
𝑣1
𝑣2
𝒕
𝒕
𝛼1
𝛼2
ccbcacccc
aabaaaccc
ccbcacaaa
ccbcccaab
aacaabccc
ccbaabccc
aaaTopic 1
aabaac
cccTopic 2
ccbcac
𝛽1 :𝐾
…
Step 1: Generate the topics
Step 2: For spontaneous events (level=0): 𝜂𝑒∼𝑁 (𝛼𝑣 ,𝜎2 𝐼 )
Step 3: For triggered events (level>0): 𝜂𝑒∼𝑁 (𝜂parent [𝑒] ,𝜎2 𝐼 )
Step 4: For each word in each document: 𝑧𝑒,𝑛∼Discrete (𝜋 (𝜂𝑒 )) ,𝑥𝑒 ,𝑛∼Discrete(𝛽𝑧𝑒 ,𝑛)
09/17
He et al. HawkesTopic ICML 2015
Inference
Joint variational inference based on full mean-field approximation
𝑄 (𝜼 ,𝒛 ,𝑷 )=∏𝑒∈𝐸
[𝑞 (𝜂𝑒|�̂�𝑒 )𝑞 (𝑃𝑒|𝑟 𝑒 )∏𝑛=1
𝑁𝑒
𝑞 (𝑧𝑒 ,𝑛∨𝜙𝑒 ,𝑛)]-- Laplace approximation for non-conjugate variable:
-- Other variables:
Update for the :
𝑟𝑒 ,𝑒′∝𝑁 (�̂�𝑒|�̂�𝑒′ , �̂�2 𝐼 )×𝐴𝑣
𝑒 ′,𝑣 𝑒× 𝑓 Δ(𝑡𝑒−𝑡𝑒′ )
Similarity between document topics
Influence between users
Proximity of events in time
Hawkes Process
10/17
He et al. HawkesTopic ICML 2015
Experiments: setting
11/17
Evaluation metrics:-- Topic modeling: document competition likelihood [Wallach et al. 09]-- Network Inference: AUC against the ground truth network
“Ebola” news articles ~4 months~9k articles, 330 news media sitesCopying information as ground truth
High-energy physics theory papers ~12 yearsTop 50/100/200 researchersCitation network as ground truth
He et al. HawkesTopic ICML 2015
Experiments: algorithmsAlgorithm Description Topic Modeling Network Inference
HTM Our method with topic number K=50 and K=100 for ArXiv with 200 authors
LDA Latent Dirichlet Allocation with collapsed Gibbs sampling
CTM Correlated topic modeling with variational inference
Hawkes Hawkes process considering only event posting time
Hawkes-LDA Two steps approach that first infers topics with LDA
Hawkes-CTM Two steps approach that first infers topics with CTM
12/17
He et al. HawkesTopic ICML 2015
Result: EventRegistry
Hawkes Hawkes-LDA Hawkes-CTM HTMComponent 1 0.622 0.669 0.673 0.697Component 2 0.670 0.704 0.716 0.730Component 3 0.666 0.665 0.669 0.700
LDA CTM HTMComponent 1 -42945 -42458 -42325Component 2 -22558 -22181 -22164Component 3 -17574 -17574 -17571
Network Inference accuracy: 10% improvement
Topic modeling accuracy:
13/17
He et al. HawkesTopic ICML 2015
Result: EventRegistry
14/17
He et al. HawkesTopic ICML 2015
Result: ArXiv
Hawkes Hawkes-LDA Hawkes-CTM HTM
Top50 0.594 0.656 0.645 0.807Top100 0.588 0.589 0.614 0.687Top200 0.618 0.630 0.629 0.659
LDA CTM HTMTop50 -11074 -10769 -10708Top100 -15711 -15477 -15252Top200 -27758 -27630 -27443
15/17
Network Inference accuracy: 40% improvement
Topic modeling accuracy:
He et al. HawkesTopic ICML 2015
Result: ArXiv
16/17
Conclusion
HawkesTopic model unifies Correlated Topic Model and Hawkes process:Þ infers hidden diffusion networkÞ discovers thematic topics of documents
Joint model of temporal information and content information in text-based cascades gets the best result
Experiments on ArXiv and EventRegistry datasetsÞ EventRegistry: 10% improvement in AUCÞ ArXiv: 40% improvement in AUC
He et al. HawkesTopic ICML 2015 17/17
Questions?Thank You