Date post: | 29-Mar-2015 |
Category: |
Documents |
Upload: | dalia-wadford |
View: | 213 times |
Download: | 0 times |
Entities, Topics and Events in Community Memories
Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos
Papailiou, Vassilis Plachouras, Wim Peters, Thomas Risse, Yannis Stavrakas, and Nina Tahmasebi
1st International Workshop on Archiving Community Memories6 September 2013, Lisbon, Portugal
Architecture Overview Offline processing
ETOEs extraction Semantic enrichment &consolidation
Cross-crawl analysis Dynamics detection
TEXT ANALYSIS & CONSOLIDATION
Entity & Event Extraction from Text
Development of applications that identify document sections by language automatically select appropriate resources to process multilingual
text (within as well as across documents), handle different domains within single pipelines appropriately
GATE applications are wrapped in the off-line module
Entity types: Person, Location, Organisation, …
Cross-document co-reference within GATE
Improved linguistic pre-processing for degraded text in tweets (joint development with TrendMiner project)
Improvements to event recognition, including use of low-scoring terms as event indicators
Adaptation to German
Entity Enrichment and Correlation
Enrichment and correlation using DBpedia & Freebase
<Enrichment>http://dbpedia.org/resource/Jean-Claude_Trichet</Enrichment>
<Enrichment>http://dbpedia.org/resource/ECB</Enrichment>
<Event>Trichet warns of systemic debt crisis</Event>
<Person>Jean Claude Trichet</Person> <Organisation>ECB</Organisation>
DBpedia Spotlight: keyword search using entity labels with conf. 0.6. Freebase: structured queries using ARCOMEM entity types FC data: 5,800 enriched entities (Dbpedia: 492; Freebase: 5,309)
Avg. precision 0.89 ([1- 0.8] dependent on the entity type and source) RAR data: 19,429 enriched entities (Dbpedia: 6,021; Freebase: 13,408)
[SDA 12]
Freebase Dataset
• Data: 22 millions entities, 350 millions facts
• Schema: 7,500 entity types in about 100 domains
• (June 2011)
• Wikipedia, MusicBrainz, …
Nodes: entities/events (blue), enrichments DBpedia (green), Freebase (orange)
1013 clusters of correlated entities/events in FC
ARCOMEM Entities and Enrichments - Graph
=>cluster expansion using related enrichments
Enrichment and Correlation: Clustering
Direct correlations (entities sharing the same enrichments):
E.g. {Mexico, Mexiko, MEXIKO}, {Greece, Griechenland}
#Clusters with at least 2 correlated entities: FC : 1,013 RAR : 1,381
Exploit graph analysis methods to detect closeness of the enrichments
Linking: e.g. related events with organisations and persons
Enrichment&Clustering component has been integrated in the offline processing and released.
SARA integration: Enrichments: direct links to LOD entities; Clusters: finding similar (or related) entities
Outlook: integration of indirect relationships, studying data quality aspects in LOD
[WOLE 12]
TOPIC DETECTION
Topic Modeling on Rock am Ring Probabilistic topic models provide a suite of techniques to uncover the hidden
semantic theme of a large collection of data
Documents may exhibit multiple topics
Each topic is described by a distribution of probability over the dictionary
Associate each topic with a list of representative documents
and write them into the ARCOMEM KBAlbum 0.021Metal 0.015Songs 0.014Band 0.013
Dj 0.007Lyrics 0.004
Rock 0.055Am 0.050Ring 0.042
Festival 0.009Tickets 0.003
Fashion 0.003
Collection 0.003
Food 0.003Style 0.003Color 0.002
Rock Am Ring Data: 32,864 documents Multilingual (English, German, etc.)
Page 0.007Site 0.005Web 0.005Click 0.004Link 0.004
The Topic Detection module is based on the Mahout Collapsed Variational Bayes which
scales on very large dataset
Task 1: Topic Detection
Task 2: Assign Documents to Topics
Temporal Evolution in Topic Modeling
Several Challenges: Tracking the evolution of topics Early detection of emerging topics Prediction of trendy topics
Topics may evolve and emerge over time[Mantrach 13]
Trendy Topic Detection
HBasePOS
Named Entity Rec.
TokensTrendy Tf-Idf
Ranked List
Understanding what was the trend at a specific time in the pastDetect events/entities/words that are popular in a time frame
Compute Trendiness: The term frequency in a period is penalized
with the average term frequency over other time periods
Tokens that are popular in all time periods are down-weighted
DYNAMICS DETECTION
Twitter Dynamics Motivation – being able to pose questions like: “What are the hashtags associated with #obama at time t?” “Find tweets that mention #cnn during the periods that
#obama is associated with #romney” “How the hashtags associated with #obamawins have
evolved over time?” “Find tweets that mention #romney during the peak periods
of #obama”
Designed a model that takes the temporal aspect for associating hashtags in tweets into account (e.g. based on co-occurrence)
Implemented query operators for retrieving the tweets that satisfy complex conditions: filter, fold, jump, merge, join
Implemented a prototype system
Experiments with 25,000 tweets about the US elections
[WOSS 12]
Change Period
Named Entity Evolution
Named Entities (NE): people, places, companies...
Characteristics of Named Entity Evolution (NEE)
Same thing but different terms over time
Change occurs over short periods of time
Small or no concept shift
Announced to the public repeatedly
Goal: Find method for named entity evolution recognition independent from external knowledge sources
Joseph Ratzinger Pope Benedict
Pope Benedict XVIBenedict XVI
Joseph Aloisius RatzingerCardinal RatzingerCardinal Joseph Ratzinger
[TPDL 12]
Named Entity Evolution Recognizer (NEER)
FilteringFinding
Temporal Co-references
Co-References
Benedict XVIà Joseph Ratzingerà Cardinal Ratzinger
1. Pope Benedict XVI2. Pope Benedict3. Benedict XVI4. Cardinal Ratzinger5. Pope6. Benedict
Identifying Change Periods(Burst Detection)
Extract Text NLP Processing Context Creation
In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.
In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.
In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addresses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.
In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr-esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.
In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr-esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.
In his latest address to American bishops visiting Rome, Pope Benedict XVI stressed that Catholic educators should remain true to the faith -- a reminder issued just in time for another tense season of commencement addr-esses. No, the pope did not mention Georgetown University by name when discussing the Catholic campus culture wars.
Evaluation Results Burst detection found total
73% of all change periods High recall for unsupervised
method Machine learning boosts
precision Data set:
http://www.l3s.de/neer-dataset/
Barack ObamaSenatorState Senator Barack ObamaSenator-elect Barack ObamaSenator Barack ObamaIllinois Democrat
Vladimir PutinPresident-elect Vladimir V PutinMinister Vladimir PutinActing President Vladimir V PutinPresident Vladimir V Putin
Processing Chain[NEER Coling 12]
FOKAS – Formerly Known As Search Engine[FOKAS Coling 12]
http://www.l3s.de/fokas/
References[SDA 12] Dietze, S., Maynard, D., Demidova, E., Risse, T., Peters, W., Doka, K., Stavrakas, Y., Entity Extraction and Consolidation for
Social Web Content Preservation, 2nd SDA Workshop, Pafos, 2012.
[WOLE 12] Nunes, B. P., Kawase, R., Dietze, S., Taibi, D., Casanova, M.A., Nejdl, W., Can entities be friends?, Proc. of WOLE2012 Workshop at the ISWC2012, Boston, US (2012).
[KECSM 12] Maynard, D., Dietze, S., Hare, J., Peters, W., (Eds.), Proc. of the 1st KECSM Workshop at the ISWC2012, CEUR Workshop Proceedings Vol. 895, 2012.
[TPDL 12] Risse, T., Dietze, S., Peters, W., Doka, K., Stavrakas, Y., Senellart, P., Exploiting the Social and Semantic Web for guided Web Archiving, TPDL2012, Pafos, Cyprus, September 2012.
[ICDM 12] Nicola Barbieri, Francesco Bonchi and Giuseppe Manco .Topic-aware Social Influence Propagation Models. Proc. of the ICDM 2012, Brussels, Belgium, December 2012
[WSDM 13] Nicola Barbieri, Francesco Bonchi and Giuseppe Manco. Cascade-Based Community Detection. Proc. of the WSDM 2013, Rome, Italy, February 2013
[NEER Coling 12] Nina Tahmasebi , Gerhard Gossen , Nattiya Kanhabua , Helge Holzmann , Thomas Risse, NEER: An Unsupervised Method for Named Entity Evolution Recognition. Coling 2012, Mumbai
[FOKAS Coling 12] Helge Holzmann , Gerhard Gossen , Nina Tahmasebi, fokas: Formerly Known As -- A Search Engine Incorporating Named Entity Evolution, Proc. of the Coling 2012, Mumbai, India
[WOSS 12] Vassilis Plachouras, and Yannis Stavrakas. Querying Term Associations and their Temporal Evolution in Social Data. Int. VLDB Workshop on Online Social Systems (WOSS 2012).
[ICMR 12] Hare, Jonathon, Samangooei, Sina, Dupplaw, David and Lewis, Paul H. ImageTerrier: an extensible platform for scalable high-performance image retrieval. ACM ICMR'12, Hong Kong, HK.
[MTA12] Hare, Jonathon S., Samangooei, Sina and Lewis, Paul H. (2012) Practical scalable image analysis and indexing using Hadoop. Multimedia Tools and Applications, 1-34.
[Mantrach 13] Amin Mantrach. A Joint Past and Present NMF for Topic Detection and Transitions in Social Media; Subm. 13