Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | lillian-rudge |
View: | 216 times |
Download: | 2 times |
New Event Detection & Tracking
ÖZGÜR BAĞLIOĞLUSÜLEYMAN KARDAŞH. ÇAĞDAŞ ÖCALAN
ERKAN UYAR
Bilkent Information Retrieval GroupComputer Engineering Department
Bilkent University
22/03/07 First Event Detection & Event Tracking 2
Outline
Introduction– What is New event detection, tracking system– Motivation
Related Work– TDT– Google News– NewsInEssence
Proposed System– Test Collection Preparation(TTracker),– Novelty Detection & Event Tracking– C3M concept– Design Details
Future Work– Named Entities with NED
• Conclusion
22/03/07 First Event Detection & Event Tracking 3
Introduction
Event– Time, space
Topic– Seminal event or activity
The differences“Computer virus detected at Biritish Telecom, March 3, 1993 is an Event”
“Computer virus outbreaks” is a topic
22/03/07 First Event Detection & Event Tracking 4
Introduction
New event detection: is the task of detecting stories about previously unseen events in a stream of news stories.
– Airplane crash, earthquake, governmental elections, and etc.
Properties of New EventWhen the event occurred
Who was involved
Where it took place
How it happened
Impact, significance or consequence of the event
22/03/07 First Event Detection & Event Tracking 5
Introduction
• Information filtering system – uses a long-lived profile of a user’s request to identify relevant
material in a stream of arriving documents. – In contrast, new event detection has no knowledge of what
events will happen in the news, so must operate without a pre-specified query.
NEDT usage areasIn categorization system
For people who need to know latest news, • govermental analyst, financial analyst, stock market traders
– Identifying new mails from previous ones
22/03/07 First Event Detection & Event Tracking 6
Related Work
Topic Detection and Tracking (TDT)Researching since 1997Broadcast news, written and spoken news stories in multiple languagesResearch Area
• Story Segmentation - Detect changes between topically cohesive sections
• Topic Tracking - Keep track of stories similar to a set of example stories
• Topic Detection - Build clusters of stories that discuss the same topic• First Story Detection - Detect if a story is the first story of a new,
unknown topic• Link Detection - Detect whether or not two stories are topically linked
22/03/07 First Event Detection & Event Tracking 7
Related Work
Google NewsA novel approach to News
Uses 4,500 English news sources worldwide
Groups similar stories together
Displays them according to each reader's personalized interests.
22/03/07 First Event Detection & Event Tracking 8
Related Work
NewsInEssenceSince 2001
Summarizing clusters of related news articles from multiple sources on the Web.
Developed by the CLAIR group at the University of Michigan.
Being partially funded by the NSF under the ITR program, grant number ITR-0082884.
22/03/07 First Event Detection & Event Tracking 9
Proposed System
Handling of Test data (Milliyet, TRT, Zaman, Haber7, Cnnturk)– Distribution of the data among collections– Processing the raw data
Test Collection Preparation (TTracker)– Profiles and its properties– Sample profiles from collection
Novelty Detection & Event Tracking– C3M Concept – Algorithm details
Future Work– Named entities– System evaluation
• Conclusion
22/03/07 First Event Detection & Event Tracking 10
Handling of Test Data
Data is collected from 5 different sources;– CNN Türk (http://www.cnnturk.com),– Haber 7 (http://www.haber7.com),– Milliyet Gazetesi (http://www.milliyet.com.tr)– TRT (http://www.trt.net.tr),– Zaman Gazetesi (http://www.zaman.com.tr).
• From these sources news of 2005 are crawled which has time stamps (date and time).
22/03/07 First Event Detection & Event Tracking 11
Handling of Test Data
Each source is the representative of different angle of view;
– CNN Türk – It is international, American style – TRT – It is governmental, more restrictive– Milliyet Gazetesi – It has modern perspective– Zaman Gazetesi – It is conservative– Haber 7 – It provides variety
• Hence, different perspectives provides nice challenge while tracking the news.
22/03/07 First Event Detection & Event Tracking 12
Handling of Test Data
Statistics about sources;
After crawling the data, the text is cleaned from html tags by using HTMLParser library.
199.56100.0225,580All
96.7619.042,749Zaman Gazetesi
120.758.519,102TRT
218.3432.172,506Milliyet Gazetesi
237.8526.359,304Haber 7
270.5714.231,919CNN Türk
Avarage News Length (no. of words)
% Addition to Total News
No. of NewsNewsSource CNN Türk
Haber 7
Milliyet
TRT
Zaman
22/03/07 First Event Detection & Event Tracking 13
Test Collection Preparation TTracker
TTracker is a sub-component to collect the test and training data semi-automatically.It is based on an information retrieval system.This system is allowed define the profiles and its tracking news.The system is also provides some statistical information about the profiles.Success of the system will also be compared with manual tracking.
22/03/07 First Event Detection & Event Tracking 14
Test Collection Preparation TTracker
Profile contents as follows;– Topic Title: One or two word definition.– Seminal Event: Definition with at most two or three sentences.– What: What happened during the event.– Who: Who involved the event.– When: When the event occurs.– Where: Where the event occurs.– Topic Size: Estimated number of tracking news.– Seed: Seed document of the event.– Event Type: Category of the event.
22/03/07 First Event Detection & Event Tracking 15
Test Collection Preparation TTracker
Defining the tracking news in five stages;– Stage 1: Using seed document as a query.– Stage 2: Using event profile as a query.– Stage 3: Using tracking news as query.– Stage 4: Creative query searching.– Stage 5: Quality control of the profile.
• After these stages are completed the quality of the profiles are also controlled by administrators.
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Create
Start
Finish
22/03/07 First Event Detection & Event Tracking 16
Test Collection Preparation TTracker
In the stages annotators has right to define the news as “tracking”, “non-tracking”, “not-sure”, “not-evaluated”.
Annotators are evaluating;200 documents for the 1st stage,
300 documents for the 2nd stage,
400 documents for the 3rd stage,
200 documents each for the queries of 4th stage.
22/03/07 First Event Detection & Event Tracking 17
Test Collection Preparation TTracker
Until now, we collect nearly 60 completed profile with valuable contrubiton of our friends.
We give extra importance not to occur bias in the collection. Number of profiles of a person, event types, profile lengths are all kept in balance.
Time-SpendNot-EvaluatedNot-SureNon-TrackingTrackingRetireved
825614377614541129Max.
2000142221Min.
13077137889546Avg.
22/03/07 First Event Detection & Event Tracking 18
Test Collection Preparation TTracker
Example profiles and their life-time statistics;
481535308141Formula 1 Türkiye Grand Prix9
1163253279942005 Eurovision Şarkı Yarışması8
33172188206241231Özbekistan’da kanla bastırılan isyan7
14265299353329Fransa’nın AB anayasasını referanduma götürmesi6
013694241110Live 8 konserlerinin G-8 zirvesine etkisi5
110138147166270179Kırgızistan’da kadifemsi “devrim”4
658166221330318Suriye’yi Lübnan’dan çıkaran suikast3
345276287288291Papa 2. Jean Paul, hastalığı ve ölümü2
185244273298304329Sahte Rakı1
n=10n=25n=50n=100
No. of Tracking News in n DaysLife-Tine (day)
No. of Tracking News
News TitlePro. No
22/03/07 First Event Detection & Event Tracking 19
Test Collection Preparation TTracker
Distribution of news in the year for two sample profiles which are generated by using TTracker;
Sahte Rakı
0
20
40
60
80
2005 Eurovision Şarkı Yarışması
0
2
4
6
8
Days of 2005Days of 2005
Ne
ws
am
ou
nt
Ne
ws
am
ou
nt
22/03/07 First Event Detection & Event Tracking 20
Test Collection Preparation TTracker
To prepare this system, we used information retrieval system – semi automatic;
TTracker’s recall value will be compared with the manual system recall value (=1).
By using T-test, correctness of the system would be measured.
22/03/07 First Event Detection & Event Tracking 21
Proposed System
Novelty Detection & Event Tracking
Novelty detection – the identification of new data that a machine learning
system is not aware of during training. – one of the fundamental requirements of a good
classification or identification system.
22/03/07 First Event Detection & Event Tracking 22
Proposed System
A special case of novelty detection...
0
time
First Event
Tracking Events
Old News
Now
Window
22/03/07 First Event Detection & Event Tracking 23
Proposed System
Cover Coefficient Based Clustering Methodology(C3M) [Can F., Ozkarahan E.1990]
Single pass seed algorithm
Working principles are:• Determining number of clusters• Determining cluster seeds• Assigning other documents to clusters initiated by seeds
– Two stage probability experiment is performed
22/03/07 First Event Detection & Event Tracking 24
Proposed System
• C3M CONCEPT – Example D(Document Term) and C(cover coefficient) matrixes
– Cij=αi* ∑dIK*βK*dJK for k=1 to m
22/03/07 First Event Detection & Event Tracking 25
Proposed System
NEDT using C3M Concept:Threshold value δW (for new event detection) depends:
Window size
Cii of incoming event
Cij of incoming event
to other events in window
• δG depends:– Cluster centroid similarity(CIJ)– Cii of incoming event
22/03/07 First Event Detection & Event Tracking 26
Proposed System
Two thresholds should be found:– In window – In collection
• A possible selection for high in window but complicated and found by some experimental trials intuitionally...
• Results are as follows:
ln max ( )j W ij iik W C C
22/03/07 First Event Detection & Event Tracking 27
Proposed System
Some experiments will be conducted to improve threshold according to:-Some pattern recognition techniques such as
• Mixture of Gaussian• SVM• Decision Trees
Another problem about threshold finding:
– dataset is not large enough– only 2 feature available
Note:Blue dots: New EventGreen dots: Tracking event
X axis: Cii Y axis:Cij
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.05
0.1
0.15
0.2
0.25
0.3
0.35
22/03/07 First Event Detection & Event Tracking 28
Future Work
Improving NED => Using Named EntitiesTopic-conditioned novelty detection (Yang, ..., 2002)
A new similarity measure with semantic classes (Makkonen, ..., 2002)
Modified similarity metrics (Kumaran and Allan, 2004)
Using names and topics (Kumaran and Allan, 2005)
22/03/07 First Event Detection & Event Tracking 29
Future Work
Intuition behind named entities:– Who, Where, When– People, organization, places, date and time
How to embed named entities into NEDA new similarity matrix
Additional similarity comparison with extracted named entities
22/03/07 First Event Detection & Event Tracking 30
Future Work
Evaluation of the NEDJudge documents
Select random documents from different categories
Annotators judge documents
Same documents are used by our system
Finally, evaluation is done according to precision and recall considering annotators’ judgements
22/03/07 First Event Detection & Event Tracking 31
Future Work
Developing an– effective– real-time
Web application capable of detecting new events
tracking old ones
22/03/07 First Event Detection & Event Tracking 32
Conclusion
Mention about– New Event Detection and Tracking Concepts– Test collection preparation– Details of designed system
Goal:– Perform a leading research in Turkish– Make real of dreams in Information Retrival– “Rising like a sun in the science world” Fazli Can
22/03/07 First Event Detection & Event Tracking 33
References
Can F. and Ozkarahan, E. A. “Concepts and effectiveness of the cover coefficient based clustering methodology for text databases”. 1990.
Kumaran G. and Allan J. “Text classification and named entities for new event detection”. 2004.
Makkonen J., Ahonen-Myka H., and Salmenkivi M. “Appliying semantic classes in event detection and tracking”. 2002.
Yang Y., Zhang J., Carbonell J., and Jin C. “Topic-conditioned novelty detection”. 2002.
22/03/07 First Event Detection & Event Tracking 34
Questions?
Thanks for your patience...
Any questions?