Post on 29-Dec-2015
transcript
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Towards Contextual Text Mining
Qiaozhu Meiqmei2@uiuc.edu
University of Illinois at Urbana-Champaign
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Knowledge Discovery from Text
2
Text Mining System
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3
Overload of Text Content
Content Type
Published Content
Professional web content
User generated content
Private text content
Amount / day 3-4G ~ 2G 8-10G ~ 3T
- Ramakrishnan and Tomkins 2007
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Challenge of Mining Text
4
~750k /day
~3M day
~150k /day
1M
10B
6M
~100B
Where to Start? Where to Go?
Gold?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Context - “Situation of Text”
5
Author
Time
Source
Author’s occupati
on
Language Social
Network
Check Lap Kok, HK
self designer, publisher, editor …
3:53 AM Jan 28th
From Ping.fm
Location
Sentiment
Sentiment
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Rich Context Information
6
102M blogs
100M users > 1M groups
8M contributors 100+ languages
73 years~400k authors ~4k sources
~1B queriesPer hour?~1B Users
~3M msgs /day~5M users
5M users 500M URLs
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Text + Context = ?
7
+
Context = GuidanceI Have A Guide!
=
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Query Log + User = Personalized Search
8
MSR
Modern System Research
Medical simulation
Montessori School of Raleigh
Mountain Safety Research
MSR Racing
Wikipedia definitions
Metropolis Street Racer
Molten salt reactor
Mars sample return
Magnetic Stripe Reader
How much can personalized help?
If you know me, you should give me Microsoft Research…
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9
Common Themes IBM APPLE DELL
Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs
Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB
Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz
IBM LaptopReviews
APPLE LaptopReviews
DELL LaptopReviews
Customer Reviews + Brand = Comparative Product Summary
Can we compare Products?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10
Hot Topics in SIGMOD
Scientific Literature + Time = Topic Trends
What’s hot in literature?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11
One Week Later
Blogs + Time & Location = Spatiotemporal Topic Diffusion
How does discussion spread?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12
Tom Hanks, who is my favorite movie star act the leading role.
protesting... will lose your faith by watching the movie.
a good book to past time.
... so sick of people making such a big deal about a fiction book
The Da Vinci Code
Blogs + Sentiment = Faceted Opinion Summary
What is good and what is bad?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13
Information retrieval
Machine learning Data mining
Coauthor Network
Publications + Social Network =Topical Community
Who works together on what?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Query log + User = Personalized SearchScientific Literature + Time = Topic TrendsReview + Brand = Comparative OpinionBlog + Time & Location = Spatiotemporal Topic
DiffusionBlog + Sentiment = Faceted Opinion SummaryPublications + Social Network = Topical Community
Text + Context = Contextual Text Mining
14
…..
A General Solution for All ?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Roadmap
• Generative Model of Text• Integrating Contexts in Text Models
– Modeling Simple Context– Modeling Implicit Context– Modeling Complex Context
• Applications of Contextual Text Mining
15
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Generative Model of Text
16
)|( ModelwordP
the.. movie.. harry ..
potter is .. based.. on.. j..k..rowling
the
Generation
Inference, Estimation
harry
pottermovie
harry
is
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Text as a Mixture of Topics
17
WebSearch
search 0.2engine 0.15query 0.08user 0.07ranking 0.06……
learning 0.18model 0.14training 0.10kernel 0.09inference 0.07……
mining 0.21data 0.13pattern 0.10clustering 0.05network 0.04……
Topic (Theme) = the subject of a discourse
…
Using machine learning for web search
K topics
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Probabilistic Topic Models(Hofmann ’99, Blei et al. ’03, …)
18
ipodnano
musicdownload
apple
0.150.080.050.020.01
movieharrypotter
actressmusic
0.100.090.050.040.02
Topic 1
Topic 2
Apple iPod
Harry Potter
Ki
iTopicwPizPwP..1
)|()()(
I downloaded
the music of
the movie
harry potter to
my ipod nano
ipod 0.15
harry 0.09
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Parameter Estimation
• Maximum Likelihood Estimation (MLE):
• Parameter Estimation using EM algorithm– Gibbs sampling, Variational inference, Expectation propagation
19
)|(maxarg* DP
ipodnano
musicdownload
apple
0.150.080.050.020.01
movieharrypotter
actressmusic
0.100.090.050.040.02
I downloaded
the music of
the movie
harry potter to
my ipod nano
?????
?????
Guess the affiliation
Estimate the params
I downloaded
the music of
the movie
harry potter to
my ipod nano
I downloaded
the music of
the movie
harry potter to
my ipod nano
I downloaded
the music of
the movie
harry potter to
my ipod nano
Pseudo-Counts
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
How Context Affects Topics
20
• Topics in science literature:16th Century v.s. 21st Century
• When do a computer scientist and a gardener write about “tree, root, prune? ”
• In Europe, “football” appears a lot in a soccer report. What about in the US?
Text are generated according to the Context!!
“Context of Situation” - B. Malinowski 1923
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Existing Work
• PLSA (Hofmann ‘99), LDA (Blei et al ‘03), CTM (Blei et al.
‘06), PAM (Li and McCallum ‘06)
– Don’t incorporate contexts
• Author: Author-topic model (Steyvers et al. 04)
• Time: Topic-over-time (Wang et al. 06), Dynamic Topic model (Blei et al ‘06)
21
Can we capture the context in a general way?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Contextualized Models
22
book
Generation: • How to select contexts?• How to model context structure?
Inference:• How to reveal contextual patterns?
),|( ContextModelwordP
Location = USLocation = China
Source = official
Sentiment = +
harry
potter
is
bookharry
potterrowling
0.150.100.080.05
movieharry
potterdirector
0.180.090.080.04
Year = 1998
Year = 2008P(w|M, Year = 2008)
P(w|M, Year = 1998)
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Roadmap: Modeling Simple Context
23
Author
Time
Source
Author’s occupati
on
Language
Location
Simple Contexts
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Simple Contextual Topic Model(Mei and Zhai KDD’06)
24
Topic 1
Topic 2
Context 1: 2004 Context 2: 2007
Cj Ki
jij cTopicwPcizPjcPwP..1 ..1
),|()|()()(
Apple iPod
Harry Potter
I downloaded
the music of
the movie
harry potter to
my iphone
Contextual Topic
Patterns
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 25
Hot Topics in SIGMOD
Example: Topic Life Cycles(Mei and Zhai KDD’05)
Context = TimeContextual Topic Pattern P(z|time)
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06)
26
Topic: Government Response in
Hurricane Katrina
Hurricane
Katrina
Hurricane Rita
Context = Time & LocationContextual Topic Pattern P(z|time, location)
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 27
Example: Event Impact Analysis(Mei and Zhai KDD’06)
vector 0.05concept 0.03model 0.03space 0.02boolean 0.02function 0.01…
xml 0.07email 0.02 model 0.02collect 0.02judgment 0.01rank 0.01…
probabilist 0.08model 0.04logic 0.04 boolean 0.03algebra 0.02weight 0.01…
model 0.17language 0.08estimate 0.05 parameter 0.03distribution 0.03smooth 0.02likelihood 0.01…
1998
[Ponte and Croft 98]
Starting of TREC
1992
term 0.16relevance 0.08weight 0.07 feedback 0.04model 0.03probabilistic 0.02document 0.02…
Topic: retrieval models
Context = EventContextual Pattern P(w|z, event)
SIGIR
Traditional Models
Evaluation &
Applications
Probabilistic Models
Language Models
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Instantiation: Personalized Search (Mei and Church WSDM’08)
28
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 29
Personalization with Backoff
• Ambiguous query: MSR– Microsoft Research– Mountain Safety Research
• Disambiguate based on user’s prior clicks• We don’t have enough data for everyone!
– Backoff to classes of users• Proof of Concept:
– Context = Classes of Users defined by IP address
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Context Users (IP), groups of users
Personalized Search as Contextual Text Mining
30
Text: query(click) logs
(IP, Query, URL)
P(URL | Query)Text Model:
Contextual Model: P(URL | Query, User)
Goal: Estimate BetterP(URL | Query, User)
156.111.188.243156.111.188.*
156.111.*.*
156.*.*.*
*.*.*.*
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 31
Evaluation Metric: Entropy (H)
• Difficulty of encoding information (a distribution)– Size of search space; difficulty of a task
• Powerful tool for sizing challenges and opportunities – How hard is web search? – How much does personalization help?
• Predict future Cross Entropy H(Future|History)
URL
URLpURLpURLH )(log)()(
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Difficulty of Queries
• Easy queries (low H(URL|Q)):– google, yahoo, myspace, ebay, …
• Hard queries (high H(URL|Q)):– dictionary, yellow pages, movies, “what is may day?”
32
msrgear.commsracing.com
research....commsrwheels.com
msr.commsr.org
msrdev.com…
0.120.100.090.080.070.070.060.05
Hard Query: “MSR” – High Entropy Easy Query: “Google” – Low Entropy
google.comgoogle.cn
maps.google ……
0.800.100.08~ 0~ 0~ 0~ 0~ 0
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33
How Hard Is Search?
• Traditional Search– H(URL | Query)– 2.8 (= 23.9 – 21.1)
• Personalized Search– H(URL | Query, IPIP)– 1.21.2 (= 27.2 – 26.0)
Entropy (H)
Query 21.1
URL 22.1
IP 22.1
Query, URL 23.9
Query, IP 26.0
IP, URL 27.1
All Three 27.2Personalization cuts H in Half!
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Context = First k bytes of IP
34
),|(
),|(
),|(
),|(
),|(
00
11
22
33
44
QIPURLP
QIPURLP
QIPURLP
QIPURLP
QIPURLP
156.111.188.*
156.111.*.*
156.*.*.*
*.*.*.*
Full personalization: every user has a different model: sparse data!
No personalization: all users share the same model: Missed Opportunity
Personalization with backoff: smooth by
similar users
156.111.188.243
),|( QUserURLP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 35
Context Market Segmentation
• Can we do better than IP address? • Potential Context Variables
– ID, QueryType, Click, Intent, …– Demographics (Age, Gender, Income, …)– Time of day & Day of Week
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Roadmap: Modeling Implicit Context
36
Sentiment
Sentiment
Implicit Contexts
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Implicit Context of Text
37
???
Need to infer these situations/conditionsfrom the data (with prior knowledge)
Sentiments
Intents Impact
Trust
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Modeling Implicit Context
38
Topic 1
Topic 2
Positive
Negative
???hate
awfuldisgust
0.210.030.01
goodlike
perfect
0.100.050.02
Apple iPod
Harry Potter
I like the
song of
movie on
perfect but
hate the accent
my
ipod
the
)()|(maxarg* PDP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Example: Faceted Opinion Summarization (Mei et al. WWW’07)
39
Tom Hanks, who is my favorite movie star act the leading role.
Protesting.. you will lose your faith by watching the movie.
a good book to past time.
... so sick of people making such a big deal about a fiction book
Context = Sentiment
Topic 1:Movie
Topic 2:Book
The Da Vinci Code
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Roadmap: Modeling Complex Context
40
Social Network
Complex Contexts
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Complex Context of Text
41
• Find novel contextual patterns;• Regularize contextual models;• Alleviate data sparseness;
Structures of contexts
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Modeling Complex Context
42
Topic 1
Topic 2
A B
Context StructureIntuitions :
Model(A) and Model(B) should be similar
Context A and B are closely related
tionRegularizaLikelihood)( DO
• users in the same building issue similar queries• collaborating researchers work on similar things• topics in SIGMOD are like topics in VLDB
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Graph-based Regularization
43
v u
projection on a plane
Intuition = Regularized model = Smoothed Surfaces!
Model(u)Model(v)
uv
Structure of contexts a graph
Intuition: Model(u) and Model(v) should be similar
Smoothed
surface(s) on top of the Graph
: MLEvu ,
uv
tionRegularizaLikelihood),( GDO
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Instantiation: Topical Community Extraction (Mei et al. WWW’08)
44
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Social Network Analysis
45
Generation, evolution e.g., [Leskovec 05]
Community extractione.g., [Kleinberg 00];
Diffusion [Gruhl 04]; [Backstrom 06]
Search e.g., [Adamic 05]
Ranking e.g., [Brin and Page 98]; [Kleinberg 98]
- Kleinberg and Backstrom 2006, New York Times
Usually don’t model topics in text- Jeong et al. 2001 Nature 411
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topical Community Analysis
46
physicist, physics, scientist, theory, gravitation …
writer, novel, best-sell, book, language, film…
Topics in text help community extraction
Information Retrieval +Data Mining +Machine Learning, …
=Computer Science Literature
Text + Network topical communities
+
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topical Community Extraction as Contextual Text Mining
47
Topic Model
Text: Scientific publications
Text Model:
Contextual Model: Topic Model + Author
Context Structure:Social Network (coauthorship)
Goal: Assign authors into topical communities using P(z|author)- Regularize using social network
Context Authors
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Evu
k
jjj
jc w
k
jj
vpupvuw
wpcpcwcGDO
, 1
2
1
)))|()|((),(2
1(
))|()|(log),(()1(),(
Topic Modeling with Network Regularization
48
Data Likelihood
Graph Harmonic Regularizer,
(a generalization of [Zhu ’03])
Evu
k
jjj
jc w
k
jj
vpupvuw
wpcpcwcGDO
, 1
2
1
)))|()|((),(2
1(
))|()|(log),(()1(),(
tradeoff betweenMLE and smoothness
Smoothness of between neighbors
Model parameters:
Text Model
Graph Regularization
Intuition 2: I work on similar topicswith my coauthors
Intuition 1: Know my research topics frommy publications
tionRegularizaLikelihood),( GDO
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topics & Communities without Network Regularization
Topic 1 Topic 2 Topic 3 Topic 4
term 0.02 peer 0.02 visual 0.02 interface 0.02
question 0.02 patterns 0.01 analog 0.02 towards 0.02
protein 0.01 mining 0.01 neurons 0.02 browsing 0.02
training 0.01 clusters 0.01 vlsi 0.01 xml 0.01
weighting 0.01
stream 0.01 motion 0.01 generation 0.01
multiple 0.01 frequent 0.01 chip 0.01 design 0.01
recognition 0.01 e 0.01 natural 0.01 engine 0.01
relations 0.01 page 0.01 cortex 0.01 service 0.01
library 0.01 gene 0.01 spike 0.01 social 0.01
49
?? ? ?
Noisy community assignment
Fuzzy Topics
Four Conferences: SIGIR, KDD, NIPS, WWW
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topics & Communities with Network Regularization
50
Topic 1 Topic 2 Topic 3 Topic 4
retrieval 0.13 mining 0.11 neural 0.06 web 0.05
information 0.05 data 0.06 learning 0.02 services 0.03
document 0.03 discovery 0.03 networks 0.02 semantic 0.03
query 0.03 databases 0.02 recognition 0.02 services 0.03
text 0.03 rules 0.02 analog 0.01 peer 0.02
search 0.03 association 0.02 vlsi 0.01 ontologies 0.02
evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02
user 0.02 frequent 0.01 gaussian 0.01 management 0.01
relevance 0.02 streams 0.01 network 0.01 ontology 0.01
Information Retrieval
Data mining Machine learning
Web
Coherent community assignment
Clear Topics
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topic Modeling and SNA Improve Each Other
Methods Cut Edge Weights
Ratio Cut/ Norm. Cut
Community Size
Community 1
Community 2
Community 3
Community 4
PLSA 4831 2.14/1.25 2280 2178 2326 2257
NetPLSA 662 0.29/0.13 2636 1989 3069 1347
NCut 855 0.23/0.12 2699 6323 8 11
51
-Ncut: spectral clustering with normalized cut. (Shi et al. ’00)
Network Regularization helps extract coherent communities(ensure tight connection of authors)
Topic Modeling helps balancing communities(text implicitly bridges authors)
The smaller the betterThe smaller the better
Text
Only
NetworkOnly
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Summary of My Talk
52
• Text + Context = Contextual Text Mining– A new paradigm of text mining
• General methodology for contextual text mining– Generative models of text (e.g., Topic Models)– Contextualized models with simple context, implicit
context, complex context;
• Applications of contextual text mining
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Take Away Message
53
+ =Text
Context
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
A Roadmap of My Work
54
Information Retrieval& Web Search
Text Mining
KDD 06a Annotating frequent patterns
KDD 05
KDD 06b
WWW 06
WWW 07
WWW 08
Contextual TopicModels
KDD 07 Labeling topic models
SIGIR 07
CIKM 08
ACL 08 Impact-based summarization
Query suggestionusing hitting time
Poisson languagemodels
PSB 06
IP&M 07
KDD 08
Applicationto Bioinfo.
Bio. literaturemining
SIGIR 08WSDM 08
Graph-based smoothing
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Text InformationManagement
A Roadmap to the Future
55
Information Retrieval& Web Search
Text Mining
Theoretical Framework• Computational challenge;• Structure of contexts
Task SupportSystems
• Web users• Scientists• Business users
Applications
Integrative analysis of heterogeneous data• web 2.0 data• Science data• Information networks
Interdisciplinary• Bioinformatics• Health informatics• Business informatics
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Predict the Future
57
• IP in the future might not be seen in the history
Personalization with backoff
No personalization
Complete personalization
Cro
ss E
ntro
py:
H(f
utur
e | h
isto
ry)
At least first k bytes of IP are seen in History
4 3 2 1 0
Knows at least two bytes
Knows every byte –
enough data