A probabilistic model for retrospective news event detection
Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective news event detection. In the 28th Annual International ACM SIGIR Co
nference (SIGIR'2005), 2005.Presenter: Suhan Yu
Introduction
• RED– Retrospective news event detection (RED) is defined as the
discovery of previously unidentified event in historical news corpus.
• News event definition– a specific thing happens at a specific place and time.– Consecutively reported by many news articles in a period.
Introduction
• Observation:– A news article contains two kinds of information:
• Contents (most previous research work focus)• Timestamps (often ignored)
• This paper contribution include:– Proposing a multi-modal RED algorithm (use content and time
info)– Proposing an approach to determine the approximate number of
events from the articles count-time distribution.
Characteristics of news articles and events
• Halloween topics contains many events– Each year’s Halloween is an event.
• The figure indicates the
two most important
characteristics– Events are peaks, but in
some situations, several
events could be overlapped
on time.– The start and end time of
reports to events on
different website are very
similar.
event
Multi-modal retrospective news event detection method
• Representation of news articles and news events– News articles represented by four kinds of information:
• Who (person)• Where (location)• What (keywords)• When (time) --define as the period between the first article
and the last article. (Time consists two values)– Define news article and event as:
• The four kinds of information of a news article are independent:
timekeywordslocationspersonsevent
timekeywordslocationspersonsarticle
,,,
,,,
)()()()()( timepkeywordsplocationsppesonsparticlep
The generative model of news articles
• Contents– Unigram models to model contents– Model persons, locations and keywords by three models.
• Timestamps– Gaussian Mixture Model (GMM) is chosen to model timestamps.
• A peak is usually modeled by a Gaussian function, where the mean is the position of the peak and the variance is the duration of event.
The generative model of news articles
N=term space size
Learning model parameters
• The model parameter can be estimated by Maximum Likelihood method.
– X represents the corpus of news articles.– M and k are number of news articles and number of events.
• Given an event j, the four kinds of information of the i-th article are conditional independent:
• EM algorithm is generally applied to maximize log-likelihood.
M
i
k
jjij
M
ii expepxpXpXl
1 11
)),()(log())(log())(log();(
)()()()()( jijijijiji ekeywordspelocationspepersonspetimepexp
Maximize log-likelihood
• E-step
• M-step (update parameters)
)()(
)()(
)(
)()()(
)()(
)(
)()(
)1(
rir
r
tji
tj
ti
tji
tjt
ij expep
expep
xp
expepxep
N
s
M
i
tij
M
i
tij
tjn
sitfxepN
nitfxepewp
11
)1(
1
)1(
)1(
)),()((
),()(1)(
iarticlenewsxi
Word n. Like person=Mary
Vocabulary size articlesallM
in xw
ofcount
in entity
Maximize log-likelihood
• M-step– Parameters of the GMM
M
i
tij
M
i it
ijtj
xep
timexep
1
)1(
1
)1()1(
)(
)(
M
i
tij
M
i
tji
tijt
jxep
timexep
1
)1(
2
1
)1()1()1(
)(
)()(
M
xepep
M
i
tijt
j
1
)1()1(
)()(
articlesallM
mean
variances
How many events?
• We assume only the salient peaks are corresponding to events.– Initial estimate of events number can be set as the number of
peaks• Use hill-climbing approach to detect all peaks• Compute salient score for each of them• The top 20% peaks are defined as salient peaks.• Spitting/merging initial
peaks
to detect salient peaks,
we define salient scores
for peaks as:)()()( peakrightpeakleftpeakscore
Splitting/merging initial salient peaks
• MDL (Minimum Description Length)
))log(2
));((log(maxarg Mm
Xpk k
k
)1()1()1(13 nlpk NkNkNkkm
penalty
articlesallM
Np=person vocabulary size
Event summarization
• Maximum a Posterior (MAP)
– is the label of news article
))((maxarg ijj
i xepy
iy ix
Algorithm summary
Multi-modal RED algorithm application
• HISCOVERY system– HISCOVERY (HIStory disCOVERY)– Two useful function
• Photo Story• Chronicle
– News article come from 12 news sites (such as CNN, MSNBC, BBC…)
HISCOVERY system
Experimental methods
• Data– TDT
• Benchmarks for event detection. – TDT4
• Run experiments• Contain 80 events annotated from 28500 news articles.• These articles collected from the period of 2000/10~2001/1
• Each year’s reports can be
regarded as an events.• Extracting named entities.
– Extracted by BBN NLP tool,
which can extract seven
types of named entities.
Experimental design
• To compare the approach with other algorithm:– Group Average Clustering (GAC)
• It is the best algorithm in TDT evaluations.• A hierarchical clustering method
• Baseline– kNN algorithm
Results
• Probabilistic model
gains the best
results, but the
improvement are
not significant.
Results
• Named entities
result
result
result
39 events
result
46 events
Conclusion
• Study 2 characteristics of news articles and events.• Proposed a multi-modal RED algorithm
• Future work:– Use fitful dynamic models to model news events.
• HMM• ICA (Independent components analysis)