Temporal Summarization of News Streams - UvA · 2020-03-24 · Temporal Summarization of News...

University of Amsterdam

Master Thesis

Temporal Summarization of NewsStreams

Author: Georgeta - Cristina GârbaceaSupervisor: Dr. Evangelos Kanoulas

A thesis submitted in fulfillment of the requirementsfor the degree of Master’s of Science in Artificial Intelligence

April, 2016

iii

Acknowledgements

I would like to thank my supervisor Dr. Evangelos Kanoulas for his guid-ance and support throughout this thesis. I would not have been able to carryout this work without the constant help, useful advice and uplifting encour-agement I received on his behalf over the past year.

I would also like to thank Prof. Dr. Maarten de Rijke for giving methe chance to join the Information and Language Processing (ILPS) groupwithin the University of Amsterdam, and the Amsterdam Data Science Re-search Center as a Master student. It was a very inspiring environment tobe in, where I had the opportunity to work on many interesting researchproblems that challenged my thinking.

I sincerely thank all present and former ILPS members and co-authors,especially to Manos Tsagkias, Daan Odijk, David Graus, Isaac Sijaranamual,Zhaochun Ren, Ridho Reinada, and Nikos Voskarides. The interesting discus-sions we had have always been very helpful and a true source of inspiration.

Furthermore, I would also like to thank Prof. Dr. Maarten de Rijke,Prof. Dr. Arjen de Vries, and Dr. Piet Rodenburg for agreeing to be part ofmy thesis defence committee.

Finally, I would like to express my gratitude to my parents, brother, andfriends for their endless support.

iv

Abstract

Monitoring and analyzing the rich and continuously updated content inan online environment can yield valuable information that allows users andorganizations gain useful knowledge about ongoing events and consequently,take immediate action. This calls for effective ways to accurately monitor,analyze and summarize the emergent information present in an online en-vironment. News events such as protests, accidents, or natural disastersrepresent a unique source of fast updated information. In such scenariosusers who are directly affected by the event have an urgent need for up-to-date information that enables them to initiate and carry immediate action.Temporal summarization algorithms filter large volumes of streaming doc-uments and emit sentences that constitute salient event updates. Systemsdeveloped typically combine in an ad-hoc fashion traditional retrieval anddocument summarization algorithms to filter sentences inside documents.Retrieval and summarization algorithms however have been developed to op-erate on static document collections. Therefore, a deep understanding of thelimitations of these approaches when applied to a temporal summarizationtask is necessary. In this work we present a systematic analysis of tempo-ral summarization methods, and demonstrate the limitations and potentialsof previous approaches by examining the retrievability and the centrality ofevent updates, as well as the existence of intrinsic inherent characteristics inupdate versus non-update sentences. We also probe the utility of traditionalinformation retrieval methods, event centrality and event modelling tech-niques at identifying salient sentence updates. Last, we employ supervisedmachine learning methods for event summarization. Our results show thatretrieval algorithms have a theoretical upper bound that does not allow forthe identification of all relevant event updates, however in this work we out-line promising directions towards improving the performance of an efficienttemporal summarization system.

v

Contents

1 Introduction 11.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 52.1 TREC Temporal Summarization . . . . . . . . . . . . . . . . 5

2.1.1 TREC Temporal Summarization 2013 . . . . . . . . . 62.1.2 TREC Temporal Summarization 2014 . . . . . . . . . 72.1.3 TREC Temporal Summarization 2015 . . . . . . . . . 9

2.2 News tracking and summarization . . . . . . . . . . . . . . . . 102.2.1 Event Detection . . . . . . . . . . . . . . . . . . . . . 12

Specified vs. unspecified event detection . . . . . . . . 13Retrospective vs. new event detection . . . . . . . . . 14Supervised vs. unsupervised event detection . . . . . . 15

2.2.2 Event Tracking. . . . . . . . . . . . . . . . . . . . . . . 172.2.3 Event Summarization. . . . . . . . . . . . . . . . . . . 17

3 Task, Datasets and Evaluation Metrics 193.1 TREC Temporal Summarization Task . . . . . . . . . . . . . 193.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 Evaluation Metrics Example . . . . . . . . . . . . . . . 28

4 Upper Bound Analysis 334.1 Main Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . 344.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.1 Retrieval Algorithms: Are event updates retrievable? . 374.3.2 Do event updates demonstrate inherent characteristics? 384.3.3 Summarization Algorithms: Do event updates demon-

strate centrality? . . . . . . . . . . . . . . . . . . . . . 444.3.4 Can entities help in multi-document summarization? . 48

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Methods and Techniques 515.1 Data pre-processing and indexing . . . . . . . . . . . . . . . . 525.2 Information retrieval module . . . . . . . . . . . . . . . . . . 525.3 Information processing module . . . . . . . . . . . . . . . . . 53

5.3.1 Methods and Techniques . . . . . . . . . . . . . . . . . 535.4 Novelty Detection . . . . . . . . . . . . . . . . . . . . . . . . . 565.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 565.6 Analysis and Conclusions . . . . . . . . . . . . . . . . . . . . 675.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

vi

6 Machine Learning for Summarization 756.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 766.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 776.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7 Conclusion 83

A Appendix A 87

B Appendix B 107

Bibliography 111

1

Chapter 1

Introduction

With the exponential growth of the World Wide Web, finding relevant in-formation buried in an avalanche of data has become increasingly difficult.Automatic tools that offer people timely access to up-to-date news and areable to digest information from various sources are in high demand nowa-days, since these can alleviate the information overload problem when thevolume and diversity of data is overwhelming. As the amount of informa-tion increases, the interest in systems that can produce a concise and flu-ent overview of the important content grows larger. This calls for effectivesummarization techniques that can help people grasp the essential pieces ofinformation conveyed. Recent years have witnessed the development of ap-plications capable to summarize news streams, scientific articles, voicemail,meeting recordings, broadcast news and videos [75]. Even though these sys-tems are far from perfect, they have successfully shown their utility in helpingusers cope with vast amounts of data in a timely manner.

During unexpected crisis events such as natural disasters, mass protests orhuman catastrophes, people have an urgent need for information, especiallyif they are directly involved in the event. Crisis situations typically involvemany open questions and uncertainty, and it is often necessary to make quickdecisions having only limited knowledge about a developing event. Generally,in the beginning of an event there is little relevant content available and infor-mation is very scarce. As the event develops, on-going information becomesavailable through news agencies, social media, television, newspapers, andradio stations. Social media in particular has become the main channel forpeople to post situation-sensitive information, allowing affected populationsand those outside the impact zone to stay informed on “what is happeningright now”, and learn “first hand” news in almost real-time. This enablesmembers of the public, humanitarian organisations and formal agencies totake immediate action in time-dependent and safety-critical situations. Therole the public plays in disaster response efforts is essential when it comesto gathering and spreading critical information, fast communication, and theorganization of relief efforts [44].

Time critical news events happen unexpectedly and information aboutthe topic, while often voluminous, is evolving rapidly. On-time and up-to-date information about the event is essential throughout the entire disasterlifecycle (preparation, impact, response and recovery) to people directly in-volved in the event, first responders, formal response agencies, and local,national and international crisis management organizations. However, col-lecting authoritative news is especially challenging during major events whichinvolve extensive damage – the diversity of news sources disseminating theevent cause many rumours and controversial information to propagate, and

2 Chapter 1. Introduction

the high volume of data can overwhelm those trying to monitor the situ-ation [38]. In addition, the quality of information is frequently degradedby the inclusion of unimportant, duplicate or inaccurate content, and thismakes finding the right information in timely fashion a challenging process.Emergency situations create a scenario where it is important to be able topresent users with only novel and relevant facts that characterize an eventas it develops. Because information is needed rapidly, people cannot wait forcomprehensive reports to materialize – they have limited time to follow onthe event and therefore they should receive updates that include the mostpertinent information only. This calls for methods that can successfully mon-itor, filter, analyze and organize the dynamic and overwhelming amount ofdata produced during the time duration of a disaster or crisis event.

Automatic summarization techniques have the potential to assist duringcatastrophic events, by delivering relevant and salient information at regulartime intervals and in cases when human volunteers are unable to do this[49]. According to Radev et al [87], a summary is defined as “a text that isproduced from one or more texts, that conveys important information in theoriginal text(s), and that is no longer than half of the original text(s) andusually significantly less than that”; in the given context the word “text” refersto any speech, video, multimedia document, or hypertext content. Therefore,the main goal of a summarizer system is to present the user with a summaryof the main ideas inside the input in a concise and coherent way, by reducingthe size of the input, but with preserving the initial degree of informativeness.Since the information content in a document appears in bursts, the mainchallenge of an efficient summarization system is distinguishing between themore and the less informative segments of text.

Document summarization is aimed at automatically creating a represen-tative summary or synopsis of the entire document, by finding the most in-formative sentences. In general, there are two main approaches to documentsummarization - extraction and abstraction. Extractive summaries (extracts)tend to be more practical and work by selecting a subset of the most sig-nificant concepts inside the original documents (existing words, phrases, orsentences) to form a summary. As opposed to that, abstractive summaries(abstracts) first need to “understand” the input, i.e. build an internal seman-tic representation of the documents, and then paraphrase the salient conceptsby the use of natural language generation techniques. The end result is anoutput summary close to what a human might generate. However, due tocurrent limitations in natural language processing technology and complexityconstraints, research into abstractive methods for summarization is restrictedto specific domains only, such as image collection and video summation. Interms of the input, summarization systems can produce a summary of eitherone single document, in which case they do single document summarization,or of multiple source documents, in which case they do multi-document sum-marization. Motivated by the ever-growing size of the web and an increasedinformation access, multi-document summarization is extremely useful inproviding a brief digest of many documents on the same topic or discussingthe same event. In addition, according to the topic of focus, summarizationcan be further distinguished into generic summarization and query-focusedsummarization. Generic summarization typically addresses a broad audi-ence: anyone may end up reading the summary, thererefore no assumptionsare made about the genre or domain of input documents, nor the goal for

Chapter 1. Introduction 3

generating the summary [75]. Generic summarization primarily answers thequestion “What is this document/collection of documents about?”. In con-trast, query-focused summarization attempts to summarize the informationcontent given a query issued by the user, and a document/set of relevantdocuments returned by a search engine. The summarizer takes the queryinto account, and finds information within the relevant document(s) that re-lates to the user query. Hence, query-focused summarization aims to give ananswer to the question “What does this document(s) say about the query?”.

One fallback of the above presented approaches is that they take a retro-spective perspective when issuing updates, and assume that all the relevantinformation about an event has already been collected. This makes themunsuitable to scenarios of long-running events with a dynamic flow of infor-mation, which typically require concise on-topic updates to be issued in atimely manner. This observation introduces a new dimension to summariza-tion – time, so that the information conveyed is also time sensitive. Updatesummarization takes this time-dimension into account, and produces an in-cremental summary which contains the most salient and evolving informationfrom a collection of input documents, leaving from the assumption that theuser has prior knowledge about the event and has read previous documentson the topic. The summary is expected to convey the most important de-velopments of an event beyond what the user has already seen, i.e. onlynew information not covered inside an initial update summary (which is theoutput of an Initial Summarization step).

In this thesis we focus on efficiently monitoring the information associatedwith an event over time, following the specifications of the TREC Tempo-ral Summarization1 track [13]. The goal of the task is to encourage thedevelopment of systems which can detect useful, novel, and timely sentence-length updates about a developing crisis event when new information rapidlyemerges in the presence of a dynamic corpus. Given as input a named crisisevent with high impact, and a large volume stream of documents concerningthat event, systems are required to emit a series of sentence updates thatdescribe the evolution of the current event over time. An optimal summarycovers all of the essential information about the event with no redundancy,and each new piece of information is added to the summary as soon as itbecomes available. Writing a concise and fluent summary requires the ca-pability to recognize, modify and merge information expressed in differentsentences inside the input. Given this pre-defined task, our goal is to inves-tigate methods that can help in efficiently summarizing long-running eventsby identifying the important content at the sentence level. In more specificterms, we delve into the problem of extractive query focused multi-documentupdate summarization, and analyze the potential, and limitations of existentinformation retrieval and machine learning methods at identifying salient sen-tence updates deemed to be included in temporal summaries of news eventstreams. We also include a study through which we aim to obtain a deeperunderstanding of how and why some of the aforementioned approaches fail,and what is required for a successful temporal summarization system. Webelieve that such an analysis is necessary and can shed light in developingmore effective algorithms in the future.

1http://www.trec-ts.org

http://www.trec-ts.org

4 Chapter 1. Introduction

1.1 Research Questions

The main research question of this thesis is whether we can effectively iden-tify and broadcast short, timely, relevant and novel sentence-length updatesabout an ongoing crisis event in an online, sequential setting. In order to ad-dress this research question, we aim at answering the following sub-questions.First, we examine which are the limitations of retrieval algorithms towardstemporal summarization of news events (RQ1). To this end, we focus on theoverlap between the language of an event query and the language of an eventupdate in terms of the shared vocabulary, and based on that we perform anupper bound analysis study of the potentials and limitations of informationretrieval algorithms. Furthermore, we investigate the existence of intrinsicinherent characteristics in update versus non-update sentences (RQ2). Weare mainly interested in finding which sentences are important to include ina summary, and whether we can identify discriminative terms that charac-terize these sentence updates. We are also interested in assessing whetheran event update is central inside the documents that contain it (RQ3). Inaddition, we examine whether extracting entities and relations can help insummarization (RQ4). Finally, we investigate the performance of retrievaltechniques individually, and combined with other features, when we employsupervised machine learning methods for summarization in a learning-to-rankframework (RQ5).

1.2 Contributions

The main contributions of this work are:

• a systematic analysis of the limitations and potentials of retrieval, sum-marization, and event update modeling algorithms;

• insights into the performance of main methods for event update iden-tification on the TREC Temporal Summarization dataset;

• an analysis of the efficiency of supervised machine learning techniquesfor the task at hand;

• an examination of what makes for a good sentence update, with sug-gestions for future work and further improvements.

The remainder of this thesis is organized as follows. In Chapter 2 wediscuss related work. The TREC Temporal Summarization task and datasetsare presented in Chapter 3. In Chapter 4 we describe the experimental setupand main methods used. In Chapter 5 we include an upper bound analysison the limitations of the employed methods for sentence update extraction,including the utility of entities in text summarization. We employ machinelearning methods for ranking sentence updates in Chapter 6. We analyze ourresults, conclude, and discuss future research directions in Chapter 7.

5

Chapter 2

Related Work

In this chapter we present work in temporal summarization that is directlyrelated to ours. More specifically, we first focus on describing the designof best systems participating in previous editions of the TREC TemporalSummarization track, and afterwards we present related work in the area oftemporal summarization of online news.

2.1 TREC Temporal Summarization

The Temporal Summarization (TS) task is one of the tracks in the Text RE-trieval Conference (TREC), an on-going series of workshops that encourageresearch in information retrieval and related applications1,2,3. As we de-scribe in Chapter 3, TREC TS provides high volume streams of news articlesand blog posts crawled from the Web, for a set of pre-defined crisis events.Each article or blog post is associated with an event, and each event is rep-resented by a topic description (a textual query briefly describing the event),and the start and end time of the period during which the event took place.Participant systems are required to return a set of sentences extracted fromthe corpus of relevant documents for each event; these sentences will formthe summary of the respective event over time. An optimal summary coversall the essential information about the event with no redundancy, and eachnew piece of information is added to the summary as soon as it becomesavailable. In addition, it is often the case that an event can be decomposedinto a variety of fine-grained atomic sub-events, called nuggets. The idealtemporal summarization system would update users as soon as each of thesesub-events occur. Therefore, an optimal summary should cover all the infor-mation nuggets associated with an event in the minimum number of sentencesdescribing the event.

To evaluate the quality of a summary, TREC TS introduced customevaluation metrics developed by the track organizers, that look specificallyat relevance, coverage, novelty and latency of the updates. These metricsinclude time-sensitive versions of precision and recall, ensuring that systemsare penalized for cases when delivering information about an event happenslong after the event occurred. Expected Gain denotes how much informationis provided by the updates, while Comprehensiveness measures how well theupdates cover the event. Latency is used to measure how fast updates areissued by the system. Expected Latency Gain is similar to traditional preci-sion, with the observation that runs are penalized for delaying the emission of

1http://trec.nist.gov/pubs/trec22/trec2013.html2http://trec.nist.gov/pubs/trec23/trec2014.html3http://trec.nist.gov/pubs/trec24/trec2015.html

http://trec.nist.gov/pubs/trec22/trec2013.html



6 Chapter 2. Related Work

relevant updates. Latency Comprehensiveness is analogous to traditional re-call, and measures the coverage of relevant nuggets in a run. Runs are scoredunder a combined measure of the last two metrics, i.e. the Harmonic Meanof the normalized Expected Latency Gain and Latency Comprehensiveness.

Most TREC TS participants addressed the problem in multiple stages,first selecting relevant documents likely to concern the event, and afterwardsfocusing on extracting relevant and novel sentences based on some filters. Allsystems use a data preprocessing step, which consists in decrypting, decom-pressing and deserialization of the data, and indexing it using open sourcetools to facilitate dealing with the large volume of documents and news arti-cles. Most participants use query expansion techniques to improve retrievalperformance, and to overcome the word mismatch problem between the lan-guage of the query and the vocabulary used in relevant updates [112, 18, 57,62, 50, 114]. In what follows we describe systems of particular interest frompast editions of TREC TS 2013, 2014 and 2015.

2.1.1 TREC Temporal Summarization 2013

The best performing system in TREC TS 2013 in terms of the Expected Gainmetric is the information extraction system developed by team PRIS [112].They use hierarchical Latent Dirichlet Allocation on the set of relevant eventdocuments to infer potential event topics. The generated topic descriptionsundergo a manual filter to select keywords describing the topic, and thesekeywords are later used in scoring sentences based on the similarity betweenthe keyword vector of a topic and the vector of a given sentence. Sentencesthat are most similar to the topic are selected as output. Their systemincludes a post-processing step to ensure that the most up to date informationis selected every hour, and that duplicate sentences are removed from theoutput. However, their approach is semi-supervised, and involves humanlabour, which is expensive and time-consuming.

In the same TREC edition, University of Waterloo [18] achieved the high-est score in terms of the Latency Comprehensiveness metric. They first rankdocuments using the query likelihood model in increasing order of their timestamps, and extract relevant sentences from these documents using standardretrieval techniques. They achieve high recall on the task, however theirmethod performs poorly in terms of precision.

The highest harmonic mean in the competition was obtained by ICTNET[57]. They first determine whether a document is relevant to the topic bychecking whether the title of a news article covers all words inside the eventquery. Afterwards they learn a set of crisis-specific trigger words from thetraining data – for example, words such as “kill”, “die” and “injure”. This setof keywords is augmented with WordNet synonyms, and sentences are scoredbased on how many of these keywords are found inside a sentence. They checkfor novelty, and discard the current sentence if its similarity degree with pastemitted sentences exceeds a certain threshold. However, their approach canfail when the type of the event is not known in advance, or when the presentinformation is not specific to the given event type.

Other teams participate in the competition to investigate real-time ex-tractive models for the summarization of crisis events from news streams.University of Glasgow [62] proposed a core summarization framework ableto process hourly document batches at document, sentence, and temporal

2.1. TREC Temporal Summarization 7

level. At the document level, they select the top 10 documents per houras a function of their relatedness to the query; at the sentence level theyidentify sentences from these documents that are most likely to be useful forinclusion in the summary; lastly, at the temporal level they compare eachcandidate sentence with the current representation of the event, and emitonly novel sentences that have not been selected from prior batches. To thisend, they use the Maximal Marginal Relevance (MMR) algorithm designedto balance relevance and novelty [21]. They find that during periods of lowprominence, off-topic and non-relevant content is likely to be included in thesummary, which leads to a topic-drift within the final summary. Therefore,the volume of information to be summarized should be adjusted during timeperiods when no important sub-events occur. They conclude that adaptingthe sentence selection strategy over time is critical for an efficient updatesummarization system. Finally, the HLTCOE team [110] identifies three keychallenges any temporal summarization system must address: i) topicality:the appropriate selection of on-topic sentences, ii) novelty: the inclusion inthe summary of sentences that contain novel content, and iii) importance:the selection of informative sentences that a human being would consideradding to the summary. They create bag-of-words representations for eachtopic, including unigrams, named entities and predicates, leaving from theevent query description and the titles of documents; each time a new sentenceis selected for inclusion into the summary, these representations are updatedaccordingly. They conclude that dynamically updating a topic’s representa-tion as sentences are selected for inclusion in the summary is a challengingprocess, and that more sophisticated models for the extraction of relevantsentences that capture simultaneously a document’s relevance to the topicand a sentence’s relevance inside a document are needed.


TREC TS 2014 released a pre-filtered version of the TREC KBA datasetcontaining documents more likely to include relevant sentences, this time fora new set of crisis events. Participants could choose between using the entirecorpus of documents just like in the previous year, or running experiments onthe pre-filtered subset. The best performing run in the competition in termsof the H-metric belongs to Columbia University [50, 51]. Their system com-bines supervised machine learning approaches with the Affinity Propagationclustering technique for sentence salience prediction. They use a diverse setof features for selecting an exemplar set of sentences from the relevant eventdocuments. These features include surface features, query features, languagemodel scores, geographic and temporal relevance features. They use thesefeatures in training a machine learning classifier for predicting the centralityof a sentence within an event. Sentences which pass this filter constitute theinput to the next clustering step, which ultimately ensures that the selectedsentences are the most central and relevant. Finally, they check for noveltyby imposing a minimum salience threshold on each emitted sentence update.

Team BJUT [114] scored best in terms of Expected Latency Gain results.Their temporal summarization system includes a corpus pre-processing mod-ule, an information retrieval module, and an information processing module.After performing basic pre-processing steps, they index the data and clustersentences using the k-means clustering algorithm. They choose the centers


of the clusters and the top-k sentences from each cluster for inclusion intothe summary, after ranking them by time and cosine similarity to the query.While attaining good precision, their system is doing less well in terms ofrecall.

The team from University of Glasgow [63] achieved the best recall, de-vising a real-time filtering framework for document processing and sentenceselection. To this end, they leverage TREC TS historical data from previousyear (2013) to train a machine learning classifier for predicting whether adocument is relevant to a given event or not. They expand event representa-tions based on Freebase and DBPedia, and process documents in real-time asthey arrive. Each document is assessed for the extraction of salient sentencesfrom it. To pass the selection filter, a sentence should be of medium length,well-written, and contain one or more named entities. Supervised machinelearning methods are employed at finding well-written sentences. Finally,there is a novelty based filtering module that ensures only sentences withlow cosine similarity with sentences already selected are emitted as updates.One of the important observations of their work is that there is a high degreeof vocabulary mismatch between the description of an event (i.e. the eventquery), and the associated information nuggets for that event. This makesthe task of identifying relevant updates particularly challenging. To tacklethis problem, they leverage the positional relationship of sentences inside adocument to emit updates that do not exhibit any semantic overlap with theevent query, but are found in close proximity with sentences estimated asrelevant. They find that using a larger browsing window increases the com-prehensiveness of the summary, but harms the expected latency gain scores.Their main conclusion is that using sentence proximity within documents canaddress the semantic gap between the event query and the relevant updatesthat share no common terms with the query.

Other teams applied various techniques. The BUPT PRIS team [85]achieved the best latency score. Their main focus is on keywords miningthough the use of query expansion techniques. To this end, they expandquery terms with words with similar meaning based on Wordnet, Word2Vec,and a neural network model. Sentences are scored in batches at regulartime intervals by the number of keywords contained inside the sentence.IRIT [1] proposes a generic event model, leaving from the assumption thatevent updates contain specific crisis words independent of the event type(for example, general keywords such as “storm”, “hurricane” and “bombing”).They build a generic event model by estimating term frequencies of wordsinside the gold standard corpus, and use this model in scoring incomingsentences for inclusion into the summary. Because of the topic drift betweenthe event query and the relevant event updates, ICTNET [24] uses the topicof a document to determine whether the respective document is related ornot to the query. They infer latent semantic topics from documents usingLatent Dirichlet Allocation, and use the generated list of keywords and theirweights in sentence scoring. Additionally, they mine a list of discriminativewords with the χ2 method, and use these words as features in training aSupport Vector Machines classifier for the selection of relevant and novelsentence updates.

2.1. TREC Temporal Summarization 9


TREC 2015 builds upon the full/partial-filtering and summarization tasksintroduced in previous years (2014 and 2015). Besides that, this is the firstyear when the summarization-only task is introduced, providing participantswith lower volume streams of on-topic documents for a given set of events.As we are mainly interested in this task, in what follows we describe thearchitecture of participant systems in the summarization-only track.

The best scoring system in the competition was developed by Universityof Waterloo [90]. They observe that news articles frequently have an invertedpyramid structure: the title describes the newest and most relevant informa-tion, the first paragraph explains the new information on the event, whilethe remaining paragraphs only mention supportive information. Moreover,it is often the case that the first relevant documents about the event revealimportant information such as date, time, location or initial estimates ofdamages, while the corpus later expands to provide more precise informationin the form of continuous updates. Based on this line of reasoning, theydevelop a system for processing documents in 5 minute batches by retrievingthe highest scoring sentences for each time interval using BM25. To avoidpushing redundant updates, they compute the cosine similarity between thedocument titles of the previously pushed updates with the document title ofthe proposed update. To ensure that good quality updates are added to thesummary, they develop a custom metric for sentence selection by looking atdocuments for which the number of paragraphs is strictly higher than thenumber of images contained. BJUT [111] use two different clustering algo-rithms for summarization: non-negative matrix factorization with similaritypreserving feature regularization, and affinity propagation. Compared toother clustering algorithms, affinity propagation presents the advantage thatit is a fast and efficient clustering algorithm for large datasets which does notrequire the preliminary specification of the number of clusters. They find outthat these methods perform similarly, however they seem to lack in stability.

Leaving from the assumption that “events are about entities”, Universityof Glasgow [65] aims at creating summaries of events using features derivedfrom the entities involved in the development of an event. In particular, theyinvestigate the role of features such as entity importance and entity-entityinteraction in capturing salient entities, and how these entities connect witheach other. The importance of an entity is estimated as the frequency ofthe respective entity throughout the entire duration of the event, and theentity-entity interaction is estimated via entity co-occurence. They use thesefeatures for scoring sentence updates for inclusion into the event summary,using two distinct corpus processing methods: summarizing the content ofan event document by document (real-time scenario), and in hourly batches(near real-time task). To produce temporal summaries of events, they firstscore sentences by their cosine similarity to the query, after which the set ofcandidate sentences goes as input into a re-ranking step which makes use ofthe entity focused features already mentioned. Top-k sentences which passthis filter are included in the final summary. They observe that processing thecorpus hour-by-hour is more effective under the expected gain metric, how-ever processing the corpus document-by-document performs better in termsof comprehensiveness. In both cases the feature encoding the entity-entityinteraction is more effective than the feature encoding the entity importance.


The UDEL FANG team [59] performs query expansion based on event-typeinformation, and information about event-related entities in the query byincorporating external knowledge from Wikipedia. Sentences are scored us-ing the query likelihood method with Dirichlet smoothing. ISCASIR [108]relies on distributed word representations in computing distances betweenthe query and the relevant sentence updates. ILPS.UvA [34] employs a setof retrieval based and event modeling methods in building the summary ofa crisis event. For more details regarding their methods and results, seeAppendix B.

2.2 News tracking and summarization

Single and multi-document summarization have been long studied by thenatural language processing and information retrieval communities [75, 44,7, 9, 4, 26, 74]. Multi-document summarization is more complex, and issuessuch as compression, speed, redundancy and passage selection are critical inthe formation of useful summaries. Ideally, multi-document summaries con-tain the key relevant pieces of information shared across documents withoutany redundancy, plus some other information unique to individual documentsdirectly relevant to the user’s query. According to Goldstein et al [36], multi-document summarization is particularly useful in cases when there is a largecollection of dissimilar documents available and the user wants to asses theinformation landscape contained in the collection, or when as the result of theuser issuing a query, a collection of topically related documents is returned.

Events play a central role in many online news summarization systems.In general, events are real-world occurrences that unfold over space and time[6]. An event can be defined in multiple ways, either as “the occurrence ofsomething significant associated with a specific time and location” [20], or as“an occurrence causing changes in the volume of text data that discusses theassociated topic at a specific time; this occurrence is characterized by topicand time, and is often associated with entities such as people and location”[30]. Topic Detection and Tracking (TDT) [6] has focused on monitoringbroadcast news stories and issuing alerts about seminal events and relatedsub-events in a stream of broadcast news stories at document level. Toretrieve text at different granularities, passage retrieval methods have beenwidely employed; see TREC HARD track [3] and INEX adhoc [47] initiativesfor an overview. Passages are typically treated as documents, and existinglanguage modeling techniques that take into account contextual information[22, 72, 33], the document structure[12] or the hyperlinks contained inside thedocument [77] are adapted for retrieval [52]. Nevertheless, passage retrievaltechniques assume a static test collection and are not directly applicable toa streaming corpus scenario. Clustering [106, 105], topic modelling [8, 93,92], and graph-based approaches [31, 67] have been proposed to quantify thesalience of a sentence within a document.

McCreadie et al [64] introduce the task of incremental update summa-rization, aimed at producing a series of sentence updates about an event overtime to issue to an end-user tracking the event. Their approach builds upontraditional update summarization systems that produce fixed length updatesummaries at regular time intervals. During times when no new pieces of

2.2. News tracking and summarization 11

information emerge, these update summaries often contain irrelevant or re-dundant information that needs to be filtered. They model the sentenceselection from each update summary as a rank cutoff problem, and predictbased on the current update summary and previous sentences issued as up-dates, how many sentences from the current update summary to select forinclusion in the final summary. To this end, they use supervised machinelearning approaches, and device a set of features that capture the prevalenceof the event, the novelty of the content, and the overall sentence qualityacross the sentences inside the input update summary. They use these fea-tures in predicting the optimal rank-cutoff θ where to stop reading the currentupdate summary, given the previously seen sentence updates. Kedzie et al[49] combine sentence salience prediction with clustering to produce relevantsummaries that track events across time. They process relevant documentsin hourly batches, by first predicting the salience of each sentence in the cur-rent batch, and afterwards selecting the most salient sentences and clusteringthem. For the task of sentence salience prediction they use a diverse set offeatures, including language model scores, geographic relevance and tempo-ral relevance, in combination with a Gaussian process regression model. Inorder to select sentences that are both salient and representative for the cur-rent batch, they combine the output of the salience prediction model withthe affinity propagation clustering algorithm. Lastly, they select the mostsalient and representative sentences for the current hour after performing asequential redundancy check, in decreasing order of sentence salience scores.However, their approach presents a number of limitations: first, becausethey train the regression model offline, it is difficult to include features thatcapture information about the incoming document stream or the currentsummary; second, the clustering algorithm suffers from an inevitable timelag as it needs to collect one hour’s worth of new documents necessary tobe analyzed before issuing updates; and third, clustering severely limits thescalability of their system. To address these shortcomings, the same authorspropose in [48] a locally optimal learning to search algorithm that uses rein-forcement learning and learning-based search to sample training data fromthe vast space of all update summary configurations with all sentence up-dates, and train a binary classifier which can predict whether or not to includea candidate sentence in the summary of an event. Each environmental statein the reinforcement learning problem corresponds to having seen the firstincoming sentences in the stream and a sequence of actions, and is encodedusing both static and dynamic features. They learn a policy that maps statesto actions – either select or skip the current sentence.

Vuurens et al [105] propose a three-step approach for online news trackingand summarization consisting in the following steps: routing, identificationof salient sentences, and summarization. They first build a graph in whicheach streaming news article is represented as a node of the graph, and assigndirected edges to the top three nearest neighbours of the article based onthe similarity of titles and proximity of publication times. To this end, theyintroduce a 3-NN streaming variant of the k-nearest neighbour clusteringalgorithm, with the purpose to detect newly formed clusters of articles. Fromthese clusters they later extract salient sentences, after applying the 3-NNheuristic one more time. Lastly, they generate a concise summary by selectingthe most relevant sentences which include the most recent developments ofthe given topic.


Althoff et al [11] generate a timeline of events and relations for entities ina knowledge-base, accounting for quality criteria such as relevance, temporaldiversity and content diversity. The relevance of an entity is modeled as alinear combination of the relevance of related entities and the importanceof related dates. For a given entity of interest, they first generate a set ofpossible candidate events by searching the knowledge base. Simple eventsare encoded as paths of length one through the knowledge graph, while com-pound events are nodes connected through a path in the graph. To ensure anevent can be understood by an end-user in natural language, they manuallygenerate description templates for the most frequent paths. In order to selectthe most diverse and relevant set of events, they rely on submodular opti-mization and devise a greedy selection algorithm with optimal guaranteesof diversity in the temporal spacing of events. Sipos et al [102] re-interpretsummarization as a coverage problem over words anchored in time. Thecomponents of the summary they generate for a collection of news articlescan be either authors, keywords, or documents extracted from the collection;therefore, they aim to extend document summarization beyond just extrac-tive sentence retrieval. Given a corpus of related articles, they first identifythe most influential documents on the content of the corpus, then at eachpoint in time they identify the most influential documents for that respectivetime period. The next step is identifying the most influential authors for thegiven period, and lastly the most influential key phrases in the respectivetime interval. An optimal summary should not only optimize the coverageof the information content, but should also reflect which documents and au-thors have had the highest influence in the development of the corpus. Thekey assumption of their approach is that coverage of words can be used as aproxy for the coverage of the information content. They model each task as amaximum coverage problem, and perform optimization via a greedy selectionalgorithm.

Ren at al [92] focus on contrastive themes summarization, quantifyingthe opposing viewpoints found in a set of opinionated documents. The mainchallenges of the task are represented by the unknown number of topics, andthe unknown relationships among topics. To address these, they combine anested Chinese restaurant process with a hierarchical non-parametric topicmodel. They extract a set of diverse and salient themes from documents,and based on the probabilistic distributions of the themes, they generatecontrastive summaries incorporating divergence, diversity and relevance.

2.2.1 Event Detection

Many systems for online news and social media processing during periods ofcrisis focus in their initial stages on the detection of events. Event detectionfrom online news streams represents a vibrant research area, incorporatingdiverse techniques from different fields such as machine learning, naturallanguage processing, data mining, information extraction and retrieval, andtext mining. Some disasters can be predicted in advance (or forewarned)up to a certain level of accuracy, based on meteorological, geographic, de-mographic, or other types of data [44]. In such cases alarming signals aredrawn before the actual event takes place, and immediate action is taken tominimize the impact of the event on the directly affected population. Thereare some other cases however, when even though an event cannot be fully


anticipated, it can still be forecasted by social media analysis, like for ex-ample strikes or mass protests [88]. Nevertheless, a large category of events,such as earthquakes or natural disasters, are unexpected events and cannotbe predicted before they happen. Automatic detection methods are useful infinding critical information about disasters as quickly as it becomes available.This information is vital for populations in the affected areas, in addition torescuers and people able to help, who present an urgent need for up-to-dateinformation. Therefore, techniques for the automatic detection of both pre-dicted and unexpected crisis events are in high demand nowadays to assistin making time-critical and potentially life-saving decisions.

Event detection has been long addressed in the Topic Detection andTracking (TDT) program [5], a research initiative aimed to encourage thedevelopment of tools for news monitoring from traditional media sources,and to keep users updated about the latest news and developments of anevent. TDT was made up of three tasks – story segmentation, topic detec-tion and topic tracking – with the purpose of segmenting the news text intocohesive stories, detection of unforeseen events and tracking the developmentof a previously reported event. Three important sub-tasks make up the eventdetection phase: i) data preprocessing, ii) data representation, and iii) dataorganization. The data preprocessing step involves stop word removal, textstemming and tokenization. Data representation for event detection can bedone either though term vectors (bag of words), or named entity vectors.Term vectors contain non-zero entries for the terms which appear in docu-ments, typically weighted using the classical tf.idf approach [99]. However,the term vector model is likely to suffer from the curse of dimensionality whenthe text is long. Moreover, the temporal, syntactic and semantic features ofthe text are lost. The named entity vector is an alternative representation,which tries to answer the 4W questions: who, what, when, and where [70]. Inthe vector space, similarity between events is measured using the Euclideandistance, the Pearson’s correlation coefficient, the cosine similarity, or themore recently introduced Hellinger distance and clustering index.

According to Atefeh et al [16], methods found in literature for eventdetection can be classified according to the event type (into specified orunspecified), the detection task (into retrospective or new event detection),and the detection method (into supervised or unsupervised). We present inturn each category below.

Specified vs. unspecified event detection

Specified Event Detection. Events can be either fully or partially speci-fied along with related content and metadata information such as location,time or people involved in the event. Supervised event detection techniquesusually exploit this metadata information inside a wide range of machinelearning, data mining and text analysis techniques. Sakaki et al [98] detecthazards and crises such as earthquakes, typhoons and large traffic jams usingspatial and temporal information. They formulate the event detection prob-lem as a classification problem, extract a set of statistical (the number ofwords), contextual (words surrounding user queries), and lexical (keywordsin a tweet) features necessary in training a Support Vector Machines classifierfor event detection. They perform this analysis on Twitter data, and man-ually label the training set with instances of events and non-events. Their


experiments show that statistical features carry the most weight, and thatcombining them with the rest of the features can only yield small improve-ments in the classification performance. Controversial events that give rise topublic debate are detected using supervised gradient boosted decision trees[83]. A diverse set of linguistic, structural, bursting, sentiment, and contro-versy features are used in ranking controversial-event snapshots. In addition,the importance and the number of entities are found useful in determiningthe relative importance of a snapshot. Other features used for event de-tection and entity extraction include the relative positional information andtf.idf frequencies of terms, part-of-speech tags and regular expressions. Met-zler et al [66] retrieve a ranked list of historical event summaries in responseto a user query based on term frequencies within the retrieved timespan.They perform query expansion with the top highest weighted terms duringa specific time interval, and rank summaries using a query likelihood scoringfunction with Dirichlet smoothing .

Unspecified Event Detection. Because no information about the eventis known a priori, unknown events are usually detected by exploiting tempo-ral patterns from the incoming document streams. News events of generalinterest exhibit a sudden and sharp increase in the usage of specific keywords.However, techniques for unspecified event detection need to discriminate be-tween trivial incidents and events of major importance using scalable and effi-cient algorithms. In [100], after employing a Naive Bayes classifier to separatenews from irrelevant information, the authors employ an online clustering al-gorithm based on tf.idf term vectors and cosine similarity to form clusters ofnews articles. The authors of [82] highlight the importance of named enti-ties in improving the overall system accuracy. They identify named entitiesusing the Stanford Named Entity Recognizer4 trained on conventional newscorpora. Topical words, defined as words which are more popular than oth-ers with respect to an event are extracted from news articles on the basis oftheir frequency and entropy, and divided into event clusters in a co-occurencegraph; this helps in tracking changes among events at different times [58].Finally, continuous wavelet transformations localized in time and frequencydomain [25], and therefore able to track the development of a bursty event,have been combined with Latent Dirichlet Allocation [19] topic model infer-ence in the task of unspecified event detection.

In the rest of this thesis we are focusing on developing summarizationtechniques for specified events, as defined by TREC Temporal Summarizationcampaign. For more information about the test events, please see Chapter 3and Appendix A.

Retrospective vs. new event detection

Retrospective Event Detection. Retrospective event detection aims toidentify events from accumulated historical records, mentioned inside docu-ments that have arrived in the past. Common methods involve the creationof clusters of documents using similar words, referring to similar groups ofpeople, or occurring close to each other in time or space [44]. Zhao et al [113]combine clustering techniques and graph analysis to detect events. They con-struct graphs where each node represents an actor involved in the event and

4http://nlp.stanford.edu/software/CRF-NER.shtml

http://nlp.stanford.edu/software/CRF-NER.shtml


edges represent the flow of information between actors, by exploiting the so-cial, textual and temporal characteristics of documents. Sayyadi et al [101]assume that there is a topical relationship between keywords co-occuring be-tween documents. To this end, they build a graph over the documents basedon word co-occurence. Temporal and geographical distributions of space andtime tags found in documents have been used as features in the clusteringprocess, and in determining the type of an event. In [94], the authors ex-tract named entities, calendar dates, part-of-speech tags and certain types ofwords and phrases that characterize an event. Afterwards, they proceed toclassifying the event type retrospectively using a latent variable model.

New Event Detection. New event detection techniques continuouslymonitor the incoming data stream for signals that may indicate emergingevents in real time, such as breaking news or trending events. It is mostlyregarded as a query-free retrieval task – since the event information is notknown in advance, it cannot be formulated as an explicit query. Clusteringtechniques, although they present an inherent time-lag, have been widelyemployed for the task [106]. In general, documents are processed sequentially,merging an event with an existent cluster if the similarity measure is belowa specific threshold, or creating a new one if the similarity measure exceedsthat threshold. In order to ensure that a text document contains informationthat has not been reported before, metrics such as the Hellinger similarity, theKullback-Leibler divergence and the cosine similarity are used to compare thecurrent content with previously reported facts. Online new event detectionseeks to achieve low latency, in the sense that a new event has to be reportedas soon as soon as a document corresponding to the new event has beenreceived by the event detection system. Most of the methods for new eventdetection rely on bursts of specific keywords that present a sharp increasein frequency as an event emerges. Busty patterns are captured when theobserved frequencies of these keywords are much higher than they used tobe in the fixed time interval windows from the past [84]. Other methodsemployed for the task rely on Locality Sensitive Hashing [81] for processingdocuments in a bounded time and space frame. Some detection methods usewavelet-based signals to find bursts in individual words, and compute theircross correlation to differentiate trivial events from major ones [109].

In this current work our goal is to process documents in temporal orderas they arrive, and extract novel updates from news streams in a timelymanner. Although we mainly focus on new event detection, in Chapter 4we take a retrospective look at the limitations of event update identificationalgorithms. This is meant to offer us a deeper understanding of what makesa summarization system successful.

Supervised vs. unsupervised event detection

Unsupervised Event Detection. Most techniques for unsupervised eventdetection rely on clustering algorithms. Due to the fact that data is dynam-ically evolving and that new events arise over time, these algorithms shouldnot require any prior knowledge of the number of clusters (techniques suchas k-means, k-median, k-medoid or other approaches based on expectationmaximization [2] are therefore inappropriate). Furthermore, these cluster-ing methods need to be efficient and highly scalable. Incremental clusteringapproaches [23] have been used for grouping continuously generated text.


When the level of similarity between incoming text and existing clusters ex-ceeds a specific threshold, the new text is considered similar and is mergedto the closest cluster; otherwise it is considered a new event for which anew cluster will be created. Features used in unsupervised event detectioninclude tf.idf weighted term vectors computed over some time period, wordfrequency, word entropy and proper names. Cosine similarity is the mostused metric to compute distances between term vectors and the centres ofclusters. Graph-based clustering algorithms, such as hierarchical divisiveclustering approaches [58], have been proposed to divide topical words intoevent clusters. Graph partitioning techniques [109] have been used to formevents by splitting a graph into subgraphs, where each subgraph correspondsto an event. However, hierarchical clustering algorithms are known to notscale well to the large size of the data, as they require a full similarity matrix.

Supervised Event Detection. Manually labelling data for supervisedevent detection is a rather labor-intensive and time-consuming task, morefeasible in the case of specified events than unspecified events. Classificationalgorithms that have been employed for the task are Naive Bayes, SupportVector Machines, gradient boosted decision tress, typically trained on a smallset of human labelled data. Classifiers use a vast set of linguistic, structural,burst, and sentiment features. Often times relative positional information,POS tagging and named entity extraction features are also incorporated.However, the main drawback of supervised event detection approaches isthat they assume a static environment: there is usually one classifier trainedoffline, on a small batch of manually labelled data, and used for detectingevents either directly, or as a pre-processing step before performing clustering.When data is coming in streams in a continuously evolving environment, theclassifier is prone to err as soon as a topic drift occurs. Incremental learning[46] and ensemble methods [53] have been used to account for unseen eventsin the training data and adapt to changes that may occur over time. Inaddition, semi-supervised learning approaches which rely on a small amountof labelled data combined with a large amount of unlabelled data have beenproposed to train the event detection classifiers [116].

Hybrid Event Detection. Often times a combination of unsupervisedand supervised event detection approaches is employed [100]. Supervisedclassification techniques can be used as a first step in identifying relevantand important documents before clustering them. A trained classifier is ex-pected to improve the efficiency of a summarization system by discriminatingbetween events and non-events, and therefore reducing the amount of noisydata which goes as input into the clustering step. This approach is howeversensitive to classification accuracy, and the cascading pipeline can lead tothe drop of many important events before the clustering stage takes place.Conversely, there are other approaches which do the clustering first, and thenclassify whether each cluster contains relevant information about an event.As the clusters evolve over time, the features for the old clusters need to beperiodically updated, and features for the newly formed clusters inferred.

In our approach to the task we employ supervised machine learning meth-ods. In Chapter 7 we present a machine learned approach able to predictwhether a sentence should be included in the event summary or skipped,based on a set of retrieval, centrality and event modelling features.


2.2.2 Event Tracking.

Event tracking studies how events evolve and unfold over space and time [5].An event typically consists of multiple stages: warning, impact, emergencyand recovery. The situation is monitored during the warning phase. The dis-aster is on-going during the impact phase, while the immediate post-impactperiod during which rescue and other emergency activities take place is theemergency phase. Recovery is the returning back to normal period. Iyengaret al [45] extract a set of discriminative words as features and use them intraining a Support Vector Machines classifier, which combined with a Hid-den Markov model, is able to predict the phase of an event for an incomingmessage.

Often times events are made up of small-scale sub-events that take placeas a crisis situation unfolds. Sub-event detection using Conditional RandomFields has been employed in [54]. Hua et al [43] group sets of words relatedto the event into clusters and use supervised learning to classify incomingmessages. In addition, they perform geographical location estimation onthe classified output. However, challenges in sub-event detection arise fromthe inadequate reporting of spatial and temporal information, or in caseswhen mundane events introduce a lot of noise in distinguishing importanthappenings from trivial ones.

2.2.3 Event Summarization.

Event summarization is particularly important in helping users deal withthe information overload problem in times of crisis. A text based represen-tation of an evolving event can serve in providing a brief digest of the coretopics discussed in the set of input documents [75]. Most text summariza-tion systems operate in an incremental and temporal fashion: given a setof documents the user has read, and a new set of documents to be sum-marized, the objective is to present the user with a summary of only thedata the user has not already read [62, 14]. Purely content based methods,such as the incremental or hierarchical clustering of text have been widelyused in addressing the task. Alternatively, regression based combinationsof features have been employed in summarizing events [39]. However, therelative importance of different sets of features inspired primarily from batchsummarization is not yet well understood. In addition, novelty detection isan integral part of many summarization algorithms that require aggressiveinter-sentence similarity computation, a procedure which scales poorly forlarge datasets.

In the rest of this thesis we focus on tracking and summarization tech-niques for news events. We include an analysis of temporal summarizationmethods, and demonstrate the limitations and potentials of previous ap-proaches by examining the retrievability and the centrality of event updates,the existence of intrinsic inherent characteristics in update versus non-updatesentences.

19

Chapter 3

Task, Datasets and EvaluationMetrics

In this chapter we present the temporal summarization task as defined by theTREC Temporal Summarization track [13, 15, 14], which aims to provide anevaluation testbed for developing methods for the temporal summarizationof large volumes of news streams. More precisely, we describe the goal ofthe task with the main challenges involved, the datasets provided, and theevaluation framework. In addition, we describe and motivate our method ofchoice when it comes to evaluating summaries of news articles.

3.1 TREC Temporal Summarization Task

TREC Temporal Summarization (TS) task facilitates research in monitoringand summarization of information associated with an event over time. Itencourages the development of systems able to emit sentence updates overtime, given as input a named crisis event identified through a query, a timeperiod during which the event occurred, and a high volume stream of newsarticles concerning the event.

Figure 3.1: Example of a TREC topic description for thetopic “2012 Buenos Aires Rail Disaster”.

According to Aslam et al [13], an event refers to a temporally acute topicwhich can be represented by the following set of fields: i) title - a shortretrospective description of the event, specified as string, ii) description - aretrospective free text event description, given as a url, iii) start - the timewhen the system should start summarization, specified as UNIX timestamp,iv) end - the time when the system should end summarization, also speci-fied as UNIX timestamp, v) query - a keyword representation of the event,mentioned as string, and finally vi) type - the event type. In Figure 5.1 we

20 Chapter 3. Task, Datasets and Evaluation Metrics

include a typical TREC TS event description; please check Appendix A foran overview of all topics.

For a specified event, systems are required to emit a series of sentence up-dates that cover all the essential information regarding the event throughoutits entire duration. TREC TS focuses on large events with a wide impact,such as natural catastrophes, disasters, protests or accidents. In Table 3.1 wepresent an overview of the distribution of TREC TS event types for the pastthree years of the competition. Participants are provided with very high-volume streams of news articles and blog posts crawled from the Web, butonly a small portion of the stream is relevant to the event. In addition to fil-tering out the irrelevant content, the stream of documents from the trackingperiod for each event must be processed in temporal order. Therefore, the setof sentences emitted should form a chronological summary of the respectiveevent over time.

Table 3.1: Event type statistics for the testing events insidethe TREC TS 2013, 2014 and 2015 datasets.

Event Type TREC TS 2013 TREC TS 2014 TREC TS 2015accident 2 1 7bombing 1 1 8conflict – – 1earthquake 1 – 2hostage – 1 –impact event – 1 –protest – 6 2riot – 1 –storm 4 3 1shooting 2 1 –Total 10 15 21

An optimal summary covers all the essential information about the eventwith no redundancy, and each new piece of information is added to thesummary as soon as it becomes available. The automatic summarizationof long-running events from news streams poses a number of challenges thatemerge from the dynamic nature of the corpus. First, a long-running eventcan contain hundreds or thousands of unique nuggets of information to besummarized, spread out across the entire lifetime of the event. Second, theinformation reported about the event can rapidly become outdated, and itis often highly redundant. Therefore, there is the need for online algorithmsto extract significant and novel event updates in a timely manner. Third, atemporal summarization system should be able to alter the volume of contentissued as updates over time with respect to the prevalence and novelty ofdiscussions about the event. Lastly, in contrast to classic summarization andtimeline generation tasks, this analysis is expected to be carried online, asdocuments are flowing in the system in a streaming scenario.

Participant systems are required to emit a series of sentences extractedfrom the corpus of relevant documents for each event. When a sentenceis emitted, the time of the underlying document must be recorded. If theemit/ignore sentence decisions are made immediately on a per-sentence basis,the timestamp will correspond to the crawl time of the document. Otherwise,if the emission of a sentence is delayed so as to collect more information about

3.2. Datasets 21

whether or not to issue an update, the timestamp recorded will reflect theadditional latency. Incorporating information which is not part of the KBAcorpus is allowed, as long as this external information has been created priorto the event start time, or the source of this information is perfectly time-aligned with the KBA corpus of documents (such that no information fromthe future is considered). Therefore, any external document created duringor after the event end time is considered information from the future and isimplicitly excluded from being used for training the summarization system.

3.2 Datasets

TREC Temporal Summarization track uses documents from the TREC KBA2014 Stream Corpus1. The corpus (4.5 TB) contains timestamped documentsfrom a variety of news and social media sources, and spans the time periodOctober 2011 - April 2013. Each document inside the KBA Streaming corpushas assigned a unique identifier and a timestamp representing the time whenthe respective document was crawled. A document typically includes theoriginal HTML source of the document, and a set of zero or more sentencesextracted by the organizers; each sentence inside the document is identifiedby the positional index of its occurrence inside the document. However, theextraction of sentences from documents was done using rudimentary heuris-tics, occasionally producing sentences of several hundreds of words in caseswhen no punctuation marks were present (for example, such sentences caninclude entire paragraphs, tables or navigational labels). A sentence updateinside a document is specified by the distinct combination made up of thedocument identifier which contains the sentence, and an integer value des-ignating the index of the sentence inside the document (i.e. the sentenceidentifier).

Organizers made available different versions of the TREC KBA StreamCorpus to participants, depending on their interests in the competition. Inthe first edition of the TREC Temporal Summarization task in 2013, theywere provided with the entire version of the corpus containing high-volumestreams of news articles and blog posts collected from various online sources.Since only a very small portion of the stream was relevant to the set of testevents, the corpus had to be aggressively filtered for removing irrelevant con-tent. Sentence extracts would then be selected from the relevant documentsidentified, and returned to the user as updates describing an event over time.One year later in 2014, organizers made available a pre-filtered version of theTREC KBA Stream Corpus corpus (559 GB), containing documents fromthe time range of the event topics which are more likely to focus on doc-uments that include relevant sentences. The irrelevant content would stillhave to be filtered, and the relevant documents processed in temporal orderto select sentence updates that summarize the event. In addition to these twodatasets, in 2015 the organizers released a much lower volume stream of on-topic documents for a given set of events. Accounting for the datasets fromthe past two years, in 2015 participants can run experiments on three distincttest sets. Each of these three datasets corresponds to a specific sub-task: FullFiltering and Summarization (based on the entire TREC KBA Stream Cor-pus), Partial Filtering and Summarization (based on the pre-filtered version

1http://trec-kba.org/kba-stream-corpus-2014.shtml

http://trec-kba.org/kba-stream-corpus-2014.shtml


of the TREC KBA Stream Corpus), and Summarization Only (based on thefiltered version of the TREC KBA Stream Corpus).

Given the corpus of documents, participant systems are required to pro-duce temporal summaries for distinct sets of crisis events every year. In Table3.2 and Table 3.3 we present the test topics for TREC TS 2013, altogetherwith statistics regarding the number of updates that have been annotated byhuman assessors and the number of essential pieces of information describingeach event (also called nuggets). Since not all documents and updates whichare mentioned in the gold standard set for the TREC TS 2013 collection existin the database of indexed documents, we also include statistics about thenumber of indexed documents, nuggets and updates that we can effectivelyretrieve from the database. Following the same pattern, in Table 3.4 and Ta-ble 3.5 we present information regarding the number of pooled, relevant andindexed updates, documents and nuggets for the TREC TS 2014 collection.Moreover, in Table 3.6 we include statistics regarding the number of nuggets,pooled and relevant updates for the events inside the TREC TS 2015 collec-tion. Since the gold standard set for TREC TS 2015 has not been releasedas of the moment of writing of this thesis, in this last table we resume toincluding information as reported by the TREC TS organizers [14]. For thisreason, in the rest of the thesis we also exclude running experiments on theTREC TS 2015 dataset, considering that we would not be able to performany evaluation on this data in the absence of the annotated gold standardsentence updates for inclusion into the summary.

Table 3.2: TREC Temporal Summarization 2013 topicswith gold nuggets, pooled and relevant updates for event -

nugget matching.

Event Event Title Nuggets Pooled RelevantId Updates Updates1 2012 Buenos Aires Rail Disaster 56 N/A 4262 2012 Pakistan garment factory fires 89 N/A 3723 2012 Aurora shooting 139 N/A 2104 Wisconsion Sikh temple shooting 97 N/A 4065 Hurricane Isaac 2012 108 N/A 816 Hurricane Sandy 418 N/A 4857 June 2012 North American derecho 91 N/A 08 Typhoon Bopha 88 N/A 1729 2012 Guatemala earthquake 45 N/A 16610 2012 Tel Aviv bus bombing 37 N/A 281

3.3 Evaluation Metrics

TREC Temporal Summarization runs are evaluated by taking into consid-eration their relevance, coverage, novelty, and latency of the updates. Toaccount for each of these aspects, the organizers of the track have definedcustom evaluation metrics. The evaluation process is centered around theconcept of information nuggets, defined as fine-grained atomic pieces of con-tent that capture all the essential information a good summary should con-tain. As events are composed of a variety of sub-events, a nugget typicallycorresponds to a sub-event that an ideal system would broadcast to the end

3.3. Evaluation Metrics 23

Table 3.3: TREC Temporal Summarization 2013 topicswith the relevant documents, nuggets and updates. The firsttwo columns are based on the annotations from the TRECgold standard set, while the last two columns denote thenumber of documents and updates released by TREC that

we can effectively retrieve from the database.

Event Relevant Relevant Relevant Indexed Relevant Indexed Nuggets w/ ≥1Id Documents Updates Documents Updates indexed update1 248 426 144 233 392 266 372 123 168 313 164 210 35 42 264 313 406 143 180 435 70 81 34 35 206 397 485 158 172 647 – – – – –8 123 172 56 76 469 135 166 62 68 2710 224 281 91 95 19

Table 3.4: TREC Temporal Summarization 2014 topicswith gold nuggets, pooled and relevant updates for event-

nugget matching.

Event Event Title Nuggets Pooled RelevantId Updates Updates11 Costa Concordia disaster and recovery 226 1,008 39212 Early 2012 European cold wave 72 654 18413 2013 Eastern Australia Floods 68 570 31614 Boston Marathon bombings 76 984 40115 Port Said Stadium riot 45 813 31516 2012 Afghanistan Quran burning protests 72 759 55417 In Amenas hostage crisis 48 768 71418 2011-13 Russian protests 89 916 40919 2012 Romanian protests 97 758 34120 2012-13 Egyptian protests 35 612 30721 Chelyabinsk meteor 124 919 79822 2013 Bulgarian protests 116 608 22023 2013 Shahbag protests 138 723 28124 February 2013 nor'easter 100 951 45325 Christopher Dorner shootings and manhunt 88 701 390

user. Each information nugget is associated with a short textual descriptionand a precise timestamp representing the moment when that correspondingpiece of information became public knowledge. More precisely, these nuggetsare very short sentences (including a single sub-event, fact, location, or date)extracted from Wikipedia event pages retrospectively, and timestamped ac-cording to the revision history of the page.

Each nugget has a corresponding set of sentence updates, chosen fromall the segmented sentences inside the corpus. Each sentence update is inturn associated with a specific timestamp, and is referred to by the combi-nation of a document identifier and a sentence identifier. A sentence updateu = (ustring, ut) is a tuple consisting of a short text string, and a unix times-tamp representing the emission time of the sentence. For instance, a validsentence update is represented by (“The hurricane was upgraded to category4”, 1330169580). The set of sentence updates pushed by a summarization


Table 3.5: TREC Temporal Summarization 2014 topicswith the relevant documents, nuggets and updates. The firsttwo columns are based on the annotations from the TRECgold standard set, while the last two columns denote thenumber of documents and updates released by TREC that

we can effectively retrieve from the database.

Event Relevant Relevant Relevant Indexed Relevant Indexed Nuggets w/ ≥1Id Documents Updates Documents Updates indexed update11 281 392 281 392 12712 123 184 123 184 2913 236 316 224 313 4014 316 401 275 401 4715 180 315 180 315 3916 309 554 309 554 4617 396 714 312 648 4718 264 409 264 409 6619 147 341 147 341 5320 277 307 243 289 2321 480 798 449 798 8322 137 220 124 220 5323 186 281 145 274 6724 309 453 269 430 4925 282 390 250 376 62

Table 3.6: TREC Temporal Summarization 2015 topicswith gold nuggets, pooled and relevant updates for event -

nugget matching.

Event Event Title Nuggets Pooled RelevantId Updates Updates26 Vauxhall helicopter crash 22 1,222 N/A27 Cyclone Nilam 24 1,191 N/A28 2013 Savar building collapse 60 1,471 N/A29 2013 Hyderabad blasts 90 1,347 N/A30 Brazzaville arms dump blasts 77 1,275 N/A31 2012 India blackouts 33 991 N/A32 Reactions to Innocence of Muslims 226 1,645 N/A33 Battle of Konna 41 1,233 N/A34 February 2013 Quetta bombing 26 1,1,37 N/A35 15 April 2013 Iraq attacks 20 1,233 N/A36 19 March 2013 Iraq attacks 48 1,373 N/A37 2011-12 Los Angeles arson attacks 62 1,336 N/A38 2013 Thane building collapse 29 1,390 N/A39 2013 US embassy bombing in Ankara 10 755 N/A40 22 December 2011 Baghdad bombings 37 1,053 N/A41 Aleppo University bombings 26 1,136 N/A42 Carnival Triumph 2013 Engine Room Fire 46 1,271 N/A43 USS Guardian January 2013 Grounding 11 769 N/A44 2012 Indian Ocean earthquakes 65 1,223 N/A45 2012 Haida Gwaii earthquake 57 1,129 N/A46 2012 Catalan independence demonstration 60 1,036 N/A

system S before time τ is mathematically represented as:

Sτ = u ∈ S : ut < τ (3.1)

Two sentence updates u = (ustring, utime) and u′ = (u′string, u′time) are

semantically similar (u ≈ u′) if their corresponding update strings refer to


the same information, i.e. ustring = u′string. Since the two updates u andu′ can be emitted as different times, it is not necessary that their corre-sponding emission timestamps utime and u′time are identical. The sentenceupdates from the gold standard set have been sampled for evaluation througha depth-pooling method by human assessors who were explicitly looking atthe exact match between an atomic nugget and an update; it is thus pos-sible for an update to match multiple nuggets. All unpooled updates andall pooled updates that do not match any nugget are considered irrelevantfor the evaluation metrics. Therefore, an effective summary should cover asmany nuggets as possible in the minimum number of sentences, and avoidingredundancy of the content.

The official TREC TS evaluation measures account for the relevance,latency, verbosity, and matching of the updates. Nuggets are ranked bytheir importance to the query on a scale from 0 to 3 (no importance tohigh importance) by human assessors. Graded relevance is normalized onan exponential scale: high importance nuggets are of “key importance to thequery”, low importance nuggets are of “any importance to the query”. Asopposed to that, binary relevance assigns 1 to anything considered relevant,and 0 to anything considered irrelevant. The relevance function R : N →[0, 1] is defined for each nugget n ∈ N , where N denotes the canonical setupdates (nuggets), by taking the nugget importance ni into consideration,as follows:

Rgraded(n) = eni

emax

n′∈Nn′

i, (Graded relevance)

Rbinary(n) =

{1, if ni > 0, (Binary relevance)0, otherwise

(3.2)

and can be discounted in time (latency discount) or size (verbosity normal-ization). The latency penalty function L(ut, nt) is a monotonically decreas-ing function which compares the time of an update ut with the time of thenugget nt containing that update, penalizing systems for returning depre-cated information, long after the time when the information first becamepublic knowledge on Wikipedia:

L(nt, ut) = 1− 2

πarctan

ut − ntα

, (Latency discount)

where α = 3600× 6, (Latency step of 6 hours)(3.3)

Verbosity normalization penalizes systems which output unreasonablylong updates, as these introduce a significantly higher reading effort for theend-user when going over the summary. Verbosity is defined as a stringpenalty function monotonically increasing in the number of words of a sen-tence update. Practically, verbosity is computed as the fraction of the num-ber of words in a sentence update that did not match a nugget over the


average number of words in a nugget for a query:

V (u) = 1 +|all_words_update| − |nugget_matching_words_update|

AV Gnugget|words_nugget|(3.4)

Therefore, if a sentence update u has all of its words in common with anugget, then the verbosity of the respective sentence update is V (u) = 1,otherwise V (u)−1 is an approximation of the number of extra non-matchingwords in the sentence.

The earliest matching function between a nugget n and a set of updatesemitted by system S identifies the earliest matching update u for the givennugget according to the formula below:

M(n, S) = argmin{u∈S:n≈u}ut (3.5)

Conversely, the set of nuggets for which u is the earliest matching updateis given by:

M−1(u, S) = {n ∈ N :M(n, S) = u} (3.6)

Given a nugget and a matching update, the nugget relevance can bemultiplied with a discounting factor to obtain a discounting gain:

g(u, n) = R(n)× discounting factorgF (u, n) = R(n) : (discount-free gain)gL(u, n) = R(n)× L(nt, ut) : (latency-discounted gain)

(3.7)

As an update can be the earliest to match multiple nuggets, the gain ofan update with respect to a system S is defined as the sum of the latency-discounted relevance of the nuggets for which it is the earliest matchingupdate:

G(u, S) =∑

n∈M−1(u,S)

gL(u, n) (3.8)

Given the above definition of the gain, the Expected Gain EG(S) of asystem S measures the precision of the summary with respect to the eventtopic based on the number of matching nuggets. It is used to reflect thedegree to which the updates within the summary are on-topic and novel. Thenormalized Expected Gain nEG(S) of system S is computed as the averageof the gain per update, and normalized with the maximum obtainable gain


per topic Z:

nEG(S) =1

Z|S|∑u∈S

G(u, S)

=1

Z|S|∑u∈S

∑n∈M−1(u,S)

gL(u, n)

=1

Z|S|∑

n∈N :M(n,S)6=∅

g(M(n, S), n)

(3.9)

Systems are penalized by the overall verbosity of the system (instead ofnormalizing by the number of system updates):

nEGV (S) =1∑

u∈S V (u)

1

Z

∑n∈N :M(n,S)6=∅

g(M(n, S), n) (3.10)

In addition to measuring precision, we are also interested in knowinghow many of the available nuggets are covered by the summarization sys-tem. Similarly to recall in traditional information retrieval evaluation, thecomprehensiveness of a system quantifies the degree of completeness of a setof updates. Therefore, using the comprehensiveness metric the degree of cov-erage of the summary is measured by the ratio of event-related informationincluded in the summary:

C(S) =1∑

n∈N R(n)

∑u∈S

G(u, S) (3.11)

The official metric for the TREC 2015 Temporal Summarization track isa combined measure which incorporates both Expected Gain and Compre-hensiveness metrics presented above. H(S) represents the harmonic mean ofthe expected gain and comprehensiveness with latency included, and is usedfor ranking participant systems in the competition:

H(S) =EGV (S)× C(S)EGV (S) + C(S)

(3.12)

So far we have presented the official evaluation metrics introduced bythe TREC TS organizers in assessing the quality of a summary in terms ofthe number of relevant nuggets of information covered, and accounting forthe latency, verbosity, diversity and novelty of the issued updates. Givenall the optimality constraints an efficient temporal summarization systemshould meet, we believe the temporal summarization problem transformsinto a multi-objective optimization one. However, our main interest in thetask lies in identifying what are the inherent characteristics of an update, andwhat makes a sentence update retrievable. Moreover, we are also interestedin methods that can reliably retrieve these updates, more than assessing fac-tors such as the timeliness, diversity or novelty of the retrieved sentences.To this end, in this thesis we decide to evaluate summaries in terms of thetraditional information retrieval metrics of precision and recall. We believe


that using these metrics can offer us more clarity into the performance ofa temporal summarization system, and allows for a straightforward com-parison among different methods and techniques for the selection of salientsentence updates. In addition, to illustrate how the TREC TS metrics are ef-fectively calculated, we present in turn two examples below. In what followswe demonstrate that, following a simplification in the way the normalizedexpected gain nEG(S) and comprehensiveness C(S) are computed, thesemetrics are indeed equivalent to precision and recall.

3.3.1 Evaluation Metrics Example

In what follows we present an example of how the standardized metrics forthe TREC Temporal Summarization task are computed. We believe thishelps in providing a better understanding of their purpose and significance.The examples shown below are purely illustrative, and the number of nuggetsand updates, as well as the mapping between nuggets and their correspondingupdates in the given order is aleatory.

Let’s assume that {n1, n2, n3, n4, n5, n6, n7, n8} ∈ N is the set of relevantinformation nuggets. The relevance function R : N → [0, 1] is defined foreach nugget n ∈ N . We consider the simplest case of binary relevance forthe nuggets:

R(n1) = 1

R(n2) = 1

R(n3) = 1

R(n4) = 1

R(n5) = 1

R(n6) = 1

R(n7) = 1

R(n8) = 1

(3.13)

Each of these nuggets is associated with a corresponding set of sentenceupdates, where a sentence update u is of the form u = (ustring, utime). Weassume that the updates associated with each nugget are ordered in ascendingorder of their timestamps, such that:

n1 → u1

n1 → u3

n1 → u5

n2 → u3

n2 → u4

n2 → u7

n3 → u1

n3 → u5

n3 → u6

n4 → u2

n4 → u1

n4 → u3

(3.14)

n5 → u3

n5 → u7

n5 → u8

n6 → u2

n6 → u5

n6 → u6

n7 → u5

n7 → u2

n7 → u1

n8 → u4

n8 → u6

n8 → u3

(3.15)

The earliest matching update function between an information nugget{n1, n2, n3, n4, n5, n6, n7, n8} ∈ N and the set of updates u ∈ {u1, u2, u3, u4,u5, u6, u7, u8} emitted by the systems S returns the earliest matching update


for each nugget:M(n1, S) = u1

M(n2, S) = u3

M(n3, S) = u1

M(n4, S) = u2

M(n5, S) = u3

M(n6, S) = u2

M(n7, S) = u5

M(n8, S) = u4

(3.16)

Consequently, the set of nuggets M−1(u, S) for which u ∈ {u1, u2, u3, u4,u5, u6, u7, u8} is the earliest matching update is:

M−1(u1, S) = {n1, n3}M−1(u2, S) = {n4, n6}M−1(u3, S) = {n2, n5}M−1(u4, S) = {n8}

M−1(u5, S) = {n7}M−1(u6, S) = {∅}M−1(u7, S) = {∅}M−1(u8, S) = {∅}

(3.17)

Therefore, the gain of each update with respect to the system S is com-puted as follows:

G(u1, S) =∑

n∈M−1(u1,S)

R(n) =∑

{n∈n1,n3}

R(n) = R(n1) +R(n3) = 1 + 1 = 2

G(u2, S) =∑

n∈M−1(u2,S)

R(n) =∑

{n∈n4,n6}

R(n) = R(n4) +R(n6) = 1 + 1 = 2

G(u3, S) =∑

n∈M−1(u3,S)

R(n) =∑

{n∈n2,n5}

R(n) = R(n2) +R(n5) = 1 + 1 = 2

G(u4, S) =∑

n∈M−1(u4,S)

R(n) =∑{n∈n8}

R(n) = R(n8) = 1

G(u5, S) =∑

n∈M−1(u5,S)

R(n) =∑{n∈n7}

R(n) = R(n7) = 1

G(u6, S) =∑

n∈M−1(u6,S)

R(n) = 0

G(u7, S) =∑

n∈M−1(u7,S)

R(n) = 0

G(u8, S) =∑

n∈M−1(u8,S)

R(n) = 0

(3.18)


Given equations (3.9) and (3.18), we can compute the normalized ex-pected gain of system S:

nEG(S) =1

Z|8|∑u∈S

G(u, S) =1

8×(G(u1, S) +G(u2, S) +G(u3, S)+

+G(u4, S) +G(u5, S) +G(u6, S) +G(u7, S) +G(u8, S))

=1

8× (2 + 2 + 2 + 1 + 1 + 0 + 0 + 0)

=1

8× 8 = 1

(3.19)

Given equations (3.11) and (3.18), we can compute the comprehensivenessof the set of updates retrieved by systems S:

C(S) =1∑

n∈N R(n)

∑u∈S

G(u, S) =1

8×(G(u1, S) +G(u2, S) +G(u3, S)+


=1

8× (2 + 2 + 2 + 1 + 1 + 0 + 0 + 0)

=1

8× 8 = 1

(3.20)

The example presented above constitutes the ideal case when a systemwould rank updates for each nugget in the correct given order. However,in practice it is unlikely that we would see this happening unless we dealwith a perfect system. In what follows, we consider the arbitrary output ofa system where for each nugget the updates are ranked differently comparedto the ideal case. We again mention that the example we provide belowis explanatory, and serves the purpose of showing how TREC evaluationmetrics are computed.

We consider {n1, n2, n3, n4} ∈ N the set of relevant information nuggetsretrieved by the system S, and the set of sentence updates u ∈ {u1, u2, u3, u4,u5, u6, u7, u8} associated with these nuggets, based on the following mapping:

n1 → u3

n1 → u7

n1 → u8

n2 → u5

n2 → u6

n2 → u7

n2 → u8

{n3 → ∅

n4 → u2

n4 → u1

n4 → u3

(3.21)

In the given case, the earliest matching update function between an infor-mation nugget {n1, n2, n3, n4} ∈ N and the set of updates u ∈ {u1, u2, u3, u4,u5, u6, u7, u8} emitted by the systems returns the earliest matching update


for each nugget: M(n1, S) = u3

M(n2, S) = u5

M(n3, S) = ∅M(n4, S) = u2

(3.22)

Consequently, the set of nuggets M−1(u, S) for which u ∈ {u1, u2, u3, u4,u5, u6, u7, u8} is the earliest matching update is:

M−1(u1, S) = ∅M−1(u2, S) = n4

M−1(u3, S) = n1

M−1(u4, S) = ∅M−1(u5, S) = n2

M−1(u6, S) = ∅M−1(u7, S) = ∅M−1(u8, S) = ∅

(3.23)

The gain of each update with respect to the system S is therefore givenby:

G(u1, S) =∑

n∈M−1(u1,S)

R(n) =∑∅

R(n) = 0

G(u2, S) =∑

n∈M−1(u2,S)

R(n) =∑{n∈n4}

R(n) = R(n4) = 1

G(u3, S) =∑

n∈M−1(u1,S)

R(n) =∑{n∈n1}

R(n) = R(n1) = 1

G(u4, S) =∑

n∈M−1(u4,S)

R(n) =∑∅

R(n) = 0

G(u5, S) =∑

n∈M−1(u5,S)

R(n) =∑{n∈n2}

R(n) = R(n2) = 1

G(u6, S) =∑

n∈M−1(u6,S)

R(n) =∑∅

R(n) = 0

G(u7, S) =∑

n∈M−1(u7,S)

R(n) =∑∅

R(n) = 0

G(u8, S) =∑

n∈M−1(u8,S)

R(n) =∑∅

R(n) = 0

(3.24)


Given equations (3.9) and (3.24), we compute the normalized expectedgain of system S as:

nEG(S) =1

Z|8|∑u∈S

G(u, S) =1

8×(G(u1, S) +G(u2, S) +G(u3, S)+


=1

8× (0 + 1 + 1 + 0 + 1 + 0 + 0 + 0)

=1

8× 3 =

3

8(3.25)

Given equations (3.11) and (3.24), we compute the comprehensiveness ofthe set of updates retrieved by systems S as:

C(S) =1∑

n∈N R(n)

∑u∈S

G(u, S) =1

4×(G(u1, S) +G(u2, S) +G(u3, S)+


=1

4× (0 + 1 + 1 + 0 + 1 + 0 + 0 + 0)

=1

4× 3 =

3

4(3.26)

In both examples presented above, we observe that the normalized ex-pected gain nEG(S) and comprehensiveness C(S) metrics yield similar re-sults to precision and recall metrics in traditional information retrieval eval-uation. While the normalized expected gain nEG(S) measures precision ac-counting for the number of updates retrieved that are relevant to the query,comprehensiveness C(S) is the fraction of relevant nuggets that are coveredby an earliest matching update retrieved by the system. On the binary rele-vance scale, we consider a nugget relevant if the relevance function associatedwith it is greater than 0. Given these, we decide to evaluate systems in termsof traditional notions of precision and recall in the rest of this thesis, in-stead of using the normalized expected gain nEG(S) and comprehensivenessC(S) metrics for evaluation. This allows us to gain a better understandingof the performance of event summarization methods, and facilitates a clearand transparent understanding of the assets and limitations of each method.However, we are aware that through this simplification of the evaluationprocess we do not tackle other important aspects such as the diversity andnovelty of emitted sentence updates. At this stage though, we believe thatthese are of secondary importance. We are mainly concerned with under-standing the main characteristics of a relevant update, and how it can bedistinguished from other non-update sentences in a large collection of newsarticles.

33

Chapter 4

Upper Bound Analysis

In this chapter we present a systemic analysis of the limitations and poten-tials of the main approaches for event update identification. We first providea general classification of the methods used in the extraction of sentenceupdates for inclusion in a news streams summary; afterwards we continuewith an in-depth study of the suitability of these techniques when it comesto identifying relevant updates. At this stage we do not devise any new algo-rithm towards temporal summarization; our goal is solely to obtain a deeperunderstanding of how and why approaches for sentence update identificationfail, and what is required for a successful temporal summarization system.We believe that such an analysis is necessary and can shed light in developingmore effective algorithms in the future.

4.1 Main Approaches

Temporal summarization systems typically use a pipelined approach. Ingeneral, methods for event update identification process the stream of inputdocuments for producing event summaries in multiple stages, as follows:

(a) filtering documents to discard those documents that are not relevantto the event query;

(b) ranking and filtering sentences to identify those sentences thatcontain significant information about the current event;

(c) deduplicating/removing redundant sentences that contain infor-mation that has already been emitted.

Some examples of the afore-described pipeline constitute systems submit-ted to the TREC Temporal Summarization track in past years [13, 15, 14],already described in Chapter 2. In this chapter we only focus on identifyingpotential update sentences by assuming that all documents received as inputby our system are relevant to the event, and deliberately choosing to ignorethe past history of what has already been emitted. These assumptions, whichare ensured by the construction of our experiments and evaluation, providea decomposition of the temporal summarization problem and allow a focuson fundamental theories behind understanding what constitutes a potentialevent update (from now on simply referred as update) and what not. Weleave the study of the interplay of the three components as future work.

In what follows, we present the main methods for event update iden-tification for the task of extractive multi-document summarization. Thesealgorithms fall under one of the below categories (or a combination of them):

34 Chapter 4. Upper Bound Analysis

1. Retrieval algorithms that consider event updates as passages to beidentified given an event query (e.g. [10, 17, 73, 28]);

2. Summarization algorithms that assume update sentences are centralin the documents that contain them, and hence algorithms that can aggregatesentences should be able to identify them (e.g. [31, 67, 105, 107]);

3. Event update modeling methods that assume events bear inherentcharacteristics that are not encountered in non-update sentences, and hencealgorithms that model event updates should be able to predict whether asentence is an update or a non-update sentence (e.g. [49, 40, 78]).

4.2 Experimental Design

In this section we describe the experimental design that has been used inour analysis. We consider three different approaches that have been adoptedso far towards detecting event updates: retrieval algorithms, event updatemodeling, and summarization algorithms.

Retrieval Algorithms: The primary goal of the experiments is to iden-tify the limitations of retrieval algorithms towards temporal summarizationof events. In the designed experiments we want to be as indifferent as possi-ble to any particular retrieval algorithm; hence we focus on the fundamentalcomponent of any such algorithm which is the overlap between the languageof an event query and the language of an event update in terms of the sharedvocabulary. If an event update does not contain any query term for in-stance, it is impossible to be retrieved by any vanilla relevance model. Thiscan give us a theoretical upper bound on the number of event updates thatare at all retrievable. Clearly, even if an event update contains the queryterms (we call that event update covered) it is still likely that it will not beretrieved, if for instance the query terms are not discriminative enough toseparate sentences into updates and non-updates. Hence, we further focuson terms that are discriminative. To identify such terms we compute wordlikelihood ratios [71, 89, 40]. Lin and Hovy [56] were the first to introduce thelog-likelihood weighting scheme for summarization. The log-likelihood ratio(LLR) is an approach for identifying discriminative terms between corporabased on frequency profiling. To extract the most discriminative keywordsthat characterize events, we construct two corpora as follows. We considerall relevant annotated sentence updates from the gold standard as our fore-ground corpus, and a background corpus is assembled of all the non-updatesentences from the relevant documents. Afterwards, for each term in theforeground corpus we compute its corresponding LLR score. In order toquantify which are the most discriminative terms in our collection, we rankthe terms in descending order of their LLR scores and consider the top-Nmost discriminative in the rest of our experiments.

(Query Expansion with similar terms) We further want to under-stand the fundamendal reason behind any language mismatch between queryand event updates. A first hypothesis is that such a mismatch is due todifferent representation forms for the same semantics. Hence, in a secondexperiment we expand queries in two different ways: (a) we select a numberof synonym terms using WordNet [69], and (b) we use a Word2Vec [68] modeltrained in turn on the set of relevant gold standard updates from TREC TS2013 and TREC TS 2014; then similar to the previous experiment we test

4.2. Experimental Design 35

the limitations of such an approach examining whether the expanded queryterms are also event update discriminative terms.

(Query Expansion with Relevance Feedback) A second hypothesisis that a vocabulary mismatch is due to a topical drift of the event updates.Imagine the case of the “Boston Marathon Bombing”. Early updates maycontain all the query words, however when the topic drifts to the trial ofthe bombers or the treatment of the injured, it is expected that there willbe a low overlap between the event query and the event updates due to thediverging vocabulary used. Such a vocabulary gap would be hard to fill byany synonym or related terms. However, if one were to consider how thevocabulary of the updates changes over time, one might be able to pick upnew terms from past updates that could help in identifying new updates.This is a form of relevance feedback; for more details, see Figure 5.1. Toassess this hypothesis, given an update, we consider all the sentence updatesthat have appeared in documents prior to this update. Then we examine thevocabulary overlap between this current update and discriminative termsfrom past updates. A high overlap would designate that one can actuallygradually track topical drift.

Figure 4.1: Sketch illustrating relevance feedback basedon past updates. The sentence update in red emitted by thesummarization system at time t could not be covered by anyterms from the previous retrieved sentence updates (in black)emitted before time t; however, the blue sentence updateissued at time t+1 shares one or more terms in commonwith the previously returned sentence updates yielded bythe system before time t+1, therefore it can be covered by

the vocabulary of past updates.

Event Update Centrality: Here we device a set of experiments to testwhether an event update is central in the documents that contain it. If this isthe case, algorithms that can aggregate sentences should be able to identifyrelevant and informative updates. Graph-based ranking methods have beenproposed for document summarization and keyword extraction tasks [31, 67].Leaving from PageRank [79], these methods construct either a sentence ora word network, assuming that important sentences or words are linked tomany other important sentences or words. The underlying model on whichthese methods are based is Random Walk [103] on weighted graphs: animaginary walker starts walking from a node chosen arbitrarily, and fromthat node continues moving towards one of its neighbouring nodes with aprobability proportional to the weight of the edge connecting the two nodes.


Eventually, probabilities of arriving at each node on the graph are produced;these denote the popularity, centrality, or importance of each node. Graphbased ranking methods for text data differ based on whether they use wordsor sentences to represent nodes, and in the way the transition probabilityfrom one node to the other is defined. Most often, they rely on various textfeatures such as word similarity and word co-occurence.

(Within Document Centrality) In this first experiment we are in-terested in testing whether an event update is central within the documentthat contains it. This scenario would be the optimal, since if this is thecase, centrality algorithms running on incoming documents could emit eventupdates in a timely manner. To this end, we use LexRank [31], a state-of-the-art graph-based summarization algorithm, and examine the ranking ofevent updates within each document.

(Across Documents Centrality) Here we perform a maximal infor-mation experiment in which we are interested in assessing the ranking ofsentence updates across documents. If this is the case, it signifies that eventhough sentence updates appear not to be central inside single documents,they become central once more information accumulates. We are aware thatdevising such an algorithm would be lacking in providing the user with timelyupdates, however in this experiment we want to identify the upper limit ofcentrality-based algorithms towards event summarization. Therefore we pur-posefully make abstraction of the temporal aspect.

Event Update Modeling: In this experiment we test the hypothesisthat event updates bear inherent characteristics that are not encountered innon-update sentences. If this is indeed the case, then one might be able todevice a method that uses these inherent characteristics to predict whether asentence is an update or a non-update. We model the inherent characteristicsof a general event update as the set of terms with high log-likelihood ratio,that is the set of the most discriminative event terms. Since extractingthe most discriminative terms for an event at hand from the gold standardannotations would result in a form of overfitting as we learn from and predicton the same dataset, we device two experiments.

(General event update modeling) In this first experiment we test thehypothesis that an event update can generally be distiguished from a non-update independent of the event particulars or the event type. We use thelog-likelihood ratio test to identify the most discriminative terms in eventupdates vs. non-updates1. Afterwards we examine the degree of overlapbetween the extracted discriminative terms with the annotated updates foreach test event.

(Event-type update modeling) Given the fact that different eventtypes may be expressed using a different vocabulary, we repeat the experi-ment described above, this time considering only events that have the sameevent type in common. Our goal is to learn discriminative LLR terms thatare specific to a particular type of event. We use the annotated sentencesfrom the gold standard for each event type in building our foreground cor-pus; the background corpus is made up of all non-update sentences from therelevant documents per event type. We discard event types for which thereis not enough annotated data available.

1In total we extract 8,471 unigrams and 1,169,276 bigrams using the log-likelihood ratioweighting scheme.

4.3. Results and Analysis 37

(Entity-based update modeling) In this experiment we test howmany of the sentence updates in our collection contain named entities. Inparticular, we look into the possibility that effective summaries of events areproduced using features derived from the entities directly involved in thedevelopment of the event. If sentence updates do contain entities, then thisimplies that entity-focused modeling of events can be used for identifyingthe relevant sentences, and improving sentence scoring and ordering insidean event summary.

4.3 Results and Analysis

We test our methods on the set of events provided by the TREC TS organiz-ers as part of the 2013 and 2014 collection. As described in Chapter 3 – theEvaluation section, each event update contains one or more critical units ofinformation, called information nuggets. Two event updates may contain thesame critical information. Information nuggets were extracted from updatesentences by human annotators, and were used to further identify updatesentences that were not in the original pool. In our experiments we use thisextended set of updates. In what follows, we present in turn our findings foreach of the research questions we formulated earlier.

4.3.1 Retrieval Algorithms: Are event updates retrievable?

The first question we want to answer is to what extent there is a languageoverlap between the query events (and query expansions) with the eventupdates. To get a theoretical upper bound we first examine how many eventupdates contain: i) at least one query term, ii) at least one query termafter WordNet and Word2Vec query expansion, and iii) at least one queryterm after query expansion with all the terms from event updates found indocuments prior to the current update (relevance feedback). We observe thaton average 24.4% of event updates are guaranteed to be never retrieved by atraditional retrieval algorithm, while 22.7% of updates by a query expandedwith either WordNet or Word2Vec.

We initially used a pre-trained Word2Vec model2 on external data (theGoogle news corpus). Surprisingly, our query expansion experiments withWord2Vec trained on this dataset did not change the coverage of the up-dates. At a careful inspection of the expansion terms, we observe that theycontain different forms of the same word, or related concepts which are notrelevant to the event under consideration. As we expect a Word2Vec repre-sentation biased towards crisis and disaster events to be more efficient, wetrain Word2Vec on the set of gold standard updates from the TREC TS 2013and TREC TS 2014 collections. We include these expansion terms in the lastcolumn of Tables 4.1 - 4.5. We observe that biasing Word2Vec towards disas-ter scenarios results in adding more discriminative terms to the query, usefulin the retrieval of relevant sentence updates.

Relevance feedback when using all query terms in past event updateslowers the amount of uncovered updates to 16% on average across all eventqueries. Therefore, this also signifies that the upper bound performance forretrieval algorithms reaches approximately 84% update coverage on average.

2https://code.google.com/archive/p/word2vec/

https://code.google.com/archive/p/word2vec/


Hence, retrieval algorithms with pseudo-relevance feedback might be able toaccount for any vocabulary gap and topic drift in the description of sub-events. In order to be realistic though, we compute likelihood ratios forwords in our corpus. We first consider annotated updates as our foregroundcorpus, and non-updates as our background corpus. We rank terms on thebasis of their discriminative power. In Table 4.1 and Table 4.2 - Column 1,we report on the original query terms and their rankings among the mostdiscriminative LLR terms extracted from the TREC TS 2013 collection. InTable 4.3, Table 4.4 and Table 4.5 - Column 1 we report on the TREC TS2014 original query terms and their rankings inside the LLR terms extractedfrom the TREC TS 2014 collection. We observe that in general, query termsappear to be ranked high in the list of discriminative terms, however theproblem is when non-update sentences also contain these terms. This wouldmake it very difficult for any information retrieval algorithm to distinguishbetween the sentence updates and non-updates for any given event.

We repeat the same experiment after expanding query terms with Word-Net and Word2Vec synonyms, and report on the ranks of the expanded queryterms inside the list of LLR terms with high discriminative power in Ta-bles 4.1 - 4.5, Columns 2 and 3. We observe that these query expansionterms are not very discriminative in general, however Word2Vec trained onthe relevant updates performs much better than on external data. Terms thatrepresent named entities, e.g. costa concordia for event with id 11 or bostonbombings for event with id 14, are not very discriminative, and therefore itwould be very hard for any retrieval algorithm to identify updates using onlythe query. The WordNet expansion terms do not seem to help, while theWord2Vec expansions can increase the coverage by a relatively small margin.

4.3.2 Do event updates demonstrate inherent characteris-tics?

Given the results of the previous experiment, a hypothesis to test is whetherknowing beforehand event discriminative terms can help in retrieving eventupdates. Clearly, different event types may have different inherent charac-teristics; for instance, it is expected that an event of type accident may notshare the same characteristics as an event of type protest. Hence, we performour analysis on different slices of the data. First we create a general model ofevent updates by considering non-update sentences as a background corpusand update sentences as a foreground corpus. Then we compute the overlapbetween discriminative terms from this general model across all events andtheir types with the update sentences of the event under consideration. Onecan see in Table 4.6 and Table 4.7 (Columns 3 and 4), and from Figure 4.3(Column 1), that discriminative terms belonging to the general model appearon average in 95% of the event updates. Note that this is not a theoreticalupper bound, as in the case of the retrieval algorithms in the previous sec-tion, since these are terms with high discriminative power in general andtherefore able to pick update from non-update sentences. Therefore, this ismore of an average performance expected.

We repeat the same experiment, this time for each event type separately.We compute the overlap between the discriminative terms from the eventtype model and the annotated sentence updates, and present results for theseexperiments in Table 4.6 and Table 4.7 (Columns 5 and 6), and Figure 4.3


Table 4.1: Query terms ranked by their corresponding log-likelihood ratio value. If a term is not present in the set ofextracted LLR terms, then the term is assigned a rank of -1.Word2Vec was trained on the TREC TS 2013 collection of

relevant documents.

Event Query Query QueryId WordNet Word2Vec1 (crash, 44),

(train, 9),(bueno, 4),(air, 21)

(barge in, -1), (vent, -1),(coach, 1442), (bueno, 4),(string, 8302), (air out, -1),(caravan, -1), (prepar, 4941),(disciplin, -1), (clang, -1),(air, 21), (aim, 1825), (trail,1126), (train, 9), (doss, -1),(crash, 44), (educ, 490),(public, 894), (gear, 8045)

(bueno, 4), (trene, 975),(critic, 7237), (slam, 269),(51st, 2602), (morning.mor,-1), (plow, 444), (buffer, 143),(collis, 1898), (termin, 1163),(fatal, 6579), (commut, 38),(feb., 986), (crash, 44),(bestiv, 4635), (barrier, 907),(beat, 2227), (rail, 166),(argentin, 96), (hurt, 1019),(plane, 3099), (rush-hour,816), (retain, 649), (derail,7447), (smash, 707), (air, 21),(27-year-old, 8382), (train,9), (auditor, 7182),(motorcycl, 3451), (found,1570), (railway, 206)

2 (fire, 12),(pakistan,7), (factori,1)

(pakistan, 7), (open fir, -1),(displac, 3329), (fire, 12),(ardor, -1), (burn, 4634),(arous, -1), (fuel, 6888),(factori, 1)

(pakistan, 7), (garment, 33),(walk, 742), (vietnames,5934), (shoe, 252), (lahor,43), (bomber, 1997), (erupt,2794), (doubl, 5477), (lone,4359), (firework, 5904),(blaze, 55), (biggest, 2803),(destroy, 517), (pakistani,101), (factori, 1), (textil,2090), (perish, 1737), (fire,12), (burn, 4634), (suburb,2284), (believ, 4518), (broke,637), (fled, 6184), (karachi,5), (separ, 3211), (gunfir,3745), (act, 4711), (firefight,7495)

3 (shoot, 73),(colorado,722)

(shoot, 73), (photograph,6588), (colorado, 722), (tear,2248), (dart, -1), (inject, -1),(blast, 243), (film, 7050),(fritter, -1)

(shoot, 73), (domest, -1),(identifi, 1046), (colorado,722), (movi, 1395), (aurora,946), (gurudwara, -1),(theater, 3434), (method,3060), (react, 8259), (colo,5918), (mich., 6231), (colo.,2873), (rememb, 8392),(centuri, 6612), (vigil, 7955),(wisconsin, 78), (massacr,7114), (sikh, 39), (theatr,2177), (speak, 8696)

4 (shoot, 73),(sikh, 39),(templ, 31)

(shoot, 73), (photograph,6588), (tear, 2248), (fritter,-1), (synagogu, -1), (dart, -1),(templ, 31), (blast, 243),(sikh, 39), (film, 7050),(inject, -1)

(domest, -1), (identifi, 1046),(slain, 7325), (motiv, 3935),(sikh, 39), (speak, 8696),(gurudwara, -1), (wis., 644),(gurdwara, 7505), (worshipp,475), (brookfield, 7203),(wisconsin, 78), (adelaidenow,3312), (attend, 2730), (templ,31), (react, 8259), (vigil,7955), (nypd, 6590), (shoot,73), (shock, 2592), (mich.,6231)



relevant documents (continuation of Table 4.1).

Event Query Query QueryId WordNet Word2Vec5 (hurrican,

11), (isaac,3)

(isaac, 3), (hurrican, 11) (frankenstorm, 2171), (dump,7789), (predict, 539), (inland,3504), (katrina, 26), (expect,335), (northeast, 1859),(storm, 41), (isaac, 3),(threaten, 2908), (approach,861), (hurrican, 11), (bear,7185), (warn, 135),(southeast, 1907), (batter,1216), (path, 168),(superstorm, 1093), (devast,561), (mid-atlant, 5802),(toward, 305), (southeastern,2074)

6 (hurrican,11), (sandi,82)

(hurrican, 11), (flaxen, -1),(arenac, -1), (sandi, 82)

(churn, 6130), (frankenstorm,2171), (hurrican, 11), (brace,1392), (bear, 7185), (devast,561), (near, 454), (batter,1216), (superstorm, 1093),(storm, 41), (threaten, 2908),(path, 168), (barrel, 1883),(toward, 305), (approach,861), (sandi, 82)

7 (derecho,82),(midwest,82)

(derecho, -1), (midwest, -1) (unexpectedli, 7643),(derecho, -1), (slow, 5972),(develop, 3378), (depress,4618), (deploy, 3344),(portion, 4234), (pressur,4718), (biloxi, 3512),(midwest, -1), (flood-pron,-1), (proof, 7348)

8 (typhoon,25),(bopha, 48)

(typhoon, 25), (bopha, 48) (dec., 4030), (power, 1376),(cuba, 1668), (philippin, 36),(southern, 351), (mindanao,126), (bataan, 877), (near,454), (slam, 269), (typhoon,25), (expect, 335), (lash,457), (strong, 138), (across,4460), (bopha, 48)

9 (guatemala,2),(earthquak,24)

(guatemala, 2), (earthquak,24)

(strong, 138), (hit, 64),(philippin, 36), (strongest,368), (earthquak, 24),(guatemala, 2), (caribbean,586), (quak, 16), (tremor,8303), (temblor, 5451),(struck, 201), (7.4-magnitud,112), (magnitud, 238)

10 (aviv, 28),(bu, 47),(tel, 29),(bomb, 50)

(bombard, 8405), (tel, 29),(busbar, -1), (bus topolog,-1), (aviv, 28), (bu, 47), (fail,7932), (bomb, 50)

(explod, 213), (blast, 243),(tel, 29), (rocket, 3218),(israel, 289), (aviv, 28), (bu,47), (explos, 608), (ceasefir,563), (bomb, 50), (truce,482), (isra, 963), (bomber,1997), (airstrik, 5844)



relevant documents.

Event Query Term Any Wordnet Synonym Any Similar Word2Vec TermId Rank Rank Rank11 (costa,

183),(concordia,157)

(costa, 183), (concordia, 157),(rib, -1)

(costa, 183), (vaus, 7721),(shipwreck, 3636),(genoa-bas, -1), (wreck,8070), (lean, -1), (liner,4793), (concordia, 157),(ill-fat, 6856), (keel, 6252),(7:13, -1), (raze, -1), (rica, -1)

12 (cold, 23),(wave,218),(european,1471)

(curl, -1), (brandish, -1),(beckon, -1), (cold, 23),(wave, 218), (roll, 7044),(european, 1471)

(explod, 70), (feel, 1615),(colder, 7379), (wintri, 6685),(sleet, 7385), (mild, 6329),(cold, 23), (damag, 145),(freez, 487), (dissolut, -1),(poorest, 880), (fahrenheit,1542), (low, 3566),(shockwav, 1032), (boom,7753), (69-year, -1), (relat,828), (wave, 218),(temperatur, 415), (monetari,2669), (chill, 1099), (smash,3887), (alloc, -1), (deadlock,4282), (shock, 6505),(eurasian, -1), (unleash,1102), (european, 1471)

13 (queensland,16), (flood,25)

(queensland, 16), (flood, 25),(delug, 536), (flood tid, -1)

(lismor, 2139), (dunde, 1442),(inund, 544), (monsoon, -1),(grafton, 840), (crocodil,2092), (flood, 25), (tableland,-1), (sandgat, -1),(capricornia, -1), (emerald,-1), (queensland, 16), (batter,112), (tasmania, 6756),(border, 3640), (drench, -1),(cyclon, 41)

14 (boston,459),(bomb,614),(marathon,690)

(bombard, 6097), (boston,459), (fail, 7549), (bomb,614), (marathon, 690)

(marathon.mor, 6684),(schengen, -1), (bomb, 614),(siddiqui, -1), (boston, 459),(explot, -1), (england, 43),(portland, 793), (2013-04-15,-1), (fame, -1), (marathon,690), (hezbollah, 4783),(4/15/2013, -1), (snow, 5),(mujib, -1), (nail, -1), (storm,0), (117th, -1), (main, 8030),(3-feet, -1), (burga, 7376),(inch, 268), (connecticut,1299), (york, 144), (1993, -1),(pakistani, 5646), (noréast,14), (mossad, -1), (assassin,-1), (aafia, -1)

15 (riot, 20),(egyptian,35)

(belly laugh, -1), (orgi, -1),(carous, -1), (riot, 20),(egyptian, 35)

(morsi, 66), (ignit, 5054),(escal, 882), (riot, 20), (rage,327), (23/11/2012, 450),(egyptian, 35), (deadliest,1355), (accus, 1636),(denounc, 1951), (neutral,-1), (28/11/2012, 1253),(shahbagh, 478), (dictatori,3615), (cairo-bas, -1),(27/11/2012, 2056), (moham,125), (pit, 348), (kick, 5272)




Event Query Term Any Wordnet Synonym Any Similar Word2Vec TermId Rank Rank Rank16 (quran, 15),

(protest, 1),(burn, 13)

(cauter, -1), (cut, 1571),(quran, 15), (bite, 5675),(burn, 13), (combust, -1),(electrocut, -1), (koran, 38),(protest, 1), (burn off, -1),(sunburn, -1)

(russia-rrb-, -1), (partial,3954), (incit, 4142), (ralli,186), (protest, 1), (wit, 5711),(shahbag, 235), (sverdlovsk,7761), (-lrb-mountain, -1),(quraan, -1), (inadvert, 1659),(activist, 8055), (shahbagh,478), (object, 3660), (quran,15), (burn, 13), (rain, 345),(qurán, 272), (disintegr,4413), (fall, 3396), (demonstr,132), (dhaka, 167), (desecr,217), (amid, 182), (koran, 38)

17 (hostag, 2),(amena,75), (crisi,11)

(hostag, 2), (amena, 75),(crisi, 11)

(libyan, 7627), (bait, 8323),(hostag, 2), (gradual, 6074),(bungl, -1), (sahara, 135),(tiguentourin, 1767), (crisi,11), (rebrand, -1), (seizur,4409), (-lrb-courtesi, -1),(character, -1), (econom,1109), (sonatrach, 6662),(bloodbath, 1626),(israeli-palestinian, -1),(sever, -1), ( brought, -1),(bp-oper, -1), (turmoil, 6734),(briton, 113), (algeria, 4),(bp.com-rrb-, -1), (deepen,1060), (debt, 1144),(four-day, 289), (jointli,6730), (eurozon, 4833),(tigantourin, 2838), (amena,75), (miss, -1), (export,1956), (gasfield, -1)

18 (russian,62),(protest, 1)

(russian, 62), (protest, 1) (dhaka, 167), (slavic, -1),(amid, 182), (empir, -1),(lluvia, -1), (meteorit, 50),(ralli, 186), (protest, 1),(activist, 8055), (shahbagh,478), (includ, 861), (meteor,7), (explos, 518), (russian,62), (found, 5820), (shahbag,235), (soviet, 387), (russia,10), (demonstr, 132)

19 (romanian,99),(protest, 1)

(romanian, 99), (protest, 1) (appoint, 3556), (dhaka, 167),(finmin, 547), (amid, 182),(romanian, 99), (baconschi,1125), (exil, -1), (worm, -1),(backer, -1), (insult, 1038),(mujib, -1), (teodor, 1486),(protest, 1), (deem, 4506),(shahbagh, 478), (activist,8055), (ralli, 186), (shahbag,235), (demonstr, 132)




Event Query Term Any Wordnet Synonym Any Similar Word2Vec TermId Rank Rank Rank20 (protest, 1),

(egyptian,35)

(protest, 1), (egyptian, 35) (dhaka, 167), (moham, 125),(egyptian, 35), (23/11/2012,450), (ralli, 186), (morsi, 66),(27/11/2012, 2056), (accus,1636), (28/11/2012, 1253),(protest, 1), (activist, 8055),(shahbagh, 478), (denounc,1951), (amid, 182), (dictatori,3615), (shahbag, 235),(demonstr, 132)

21 (russia, 10),(meteor, 7)

(soviet russia, -1),(meteoroid, -1), (russia, 10),(meteor, 7), (soviet union, -1)

(explod, 70), (streak, 280),(phi, -1), (hit, 24), (incred,6066), (fragment, 493),(injur, 21), (15/02/2013,517), (10-ton, 1958), (shower,79), (firebal, 3015), (meteor,7), (meteorit, 50), (russian,62), (ural, 160), (russia, 10)

22 (bulgarian,96),(protest, 1)

(bulgarian, 96), (protest, 1) (gerb, 4517), (dhaka, 167),(bulgarian, 96), (finmin, 547),(revok, 1183), (ralli, 186),(boyko, 3284), (resign, 74),(borissov, 630), (borisov,1197), (protest, 1), (activist,8055), (shahbagh, 478),(amid, 182), (tender, 3469),(shahbag, 235), (demonstr,132), (bulgaria, 133)

23 (protest, 1),(shahbag,235)

(protest, 1), (shahbag, 235) (dhaka, 167), (throng, 6994),(ralli, 186), (movement,3466), (projonmo, 4541),(protest, 1), (28/11/2012,1253), (shahbagh, 478),(activist, 8055), (sit-in, 6680),(amid, 182), (tahrir, -1),(shahbag, 235), (demonstr,132), (anti-govern, 698)

24 (nor'east,14)

(nor'east, 14) (northeast, 32), (england,43), (winter, 31), (blizzard,6), (noréast, 14), (nemo, -1),(wintri, 6685), (snow, 5),(noreast, 65), (storm, 0),(snowstorm, 42)

25 (shoot, 51),(california,54),(southern,46)

(shoot, 51), (photograph,834), (tear, 2133), (southern,46), (dart, -1), (southerli, -1),(california, 54), (inject,8213), (blast, 275), (film,521), (fritter, -1)

(ex-lo, 304), (angel, 243),(kill, 9), (throughout, 1017),(mountain, 382), (nevada,4518), (northern, 1637),(press-enterpris, -1), (rampag,656), (thursday, 255), (offic,286), (torranc, 2317), (lapd,711), (calif., 763), (southern,46), (california, 54), (ex-lapd,1990), (ex-la, 1181), (spree,494), (former, 1492), (shoot,51), (ex-offic, 7472), (across,45), (ambush, 3650),(north-eastern, 7179),(riversid, 755)


Table 4.6: Degree of overlap of discriminative terms withthe TREC TS 2013 event updates. The LLR terms have beencomputed by considering as a foreground corpus all relevantevent updates, and as a background corpus all non updatesfrom the relevant documents. We limit this experiment totop 100 LLR extracted terms, i.e. we check whether anyof these discriminative LLR terms is to be found inside an

annotated sentence update.

Event Id Total Overlap Overlap (%) Overlap Overlap (%)Updates (General Model) (General Model) (Type Model) (Type Model)

1 233 228 97.85 230 98.712 168 163 97.02 164 97.613 42 31 73.80 40 95.234 180 175 96.68 180 100.005 35 35 100.00 35 100.006 172 166 96.51 141 81.977 – – – – –8 76 74 97.36 57 75.009 68 68 100.00 68 100.0010 95 92 95.83 95 100.00

(Column 2). Interestingly, when mining event specific terms the overlapgenerally increases for the TREC TS 2013 collection, but deteriorates for theTREC TS 2014 collection. This is against our hypothesis, as we were expect-ing that event specific discriminative terms will only increase the degree ofoverlap with the relevant sentence updates. Trying to explain possible causesof this phenomenon, we assume this happens due to the smaller size of theevent type dataset used as a foreground corpus in computing the likelihoodratios of the terms, compared to the case when we include all data irrespec-tive of the event type in the foreground corpus. The resulting event specificLLR terms are fewer but with a higher discriminative power, although we donot consider the discriminative power explicitly when computing the overlapbetween the two models. In addition to this, we are using a fixed cut-offthreshold in our experiments for selecting terms from the discriminative listup until a specific rank. It could be that if we chose another threshold resultswould look different, however we leave the exploration of optimal cut-offs asfuture work towards devising effective algorithms.

4.3.3 Summarization Algorithms: Do event updates demon-strate centrality?

Summarization methods applied at document level assume that event up-dates demonstrate centrality inside the documents they appear in. In thenext set of experiments we test whether it is the case that event updatesdemonstrate centrality characteristics. Ideally, update sentences are centraland salient inside the documents they are found in. This would allow a sum-marization algorithm to identify updates as soon as a document is streamingin. We want to assess the centrality of the updates, and for this reason werun LexRank on each incoming document. We process the LexRank out-put to infer rankings inside documents for the set of relevant event updates.Table 4.8 - Within Document column demonstrates how many of the eventupdates appear in the top-1 and top-10 ranked sentences for the TREC TS


Table 4.7: Degree of overlap of discriminative terms withthe TREC TS 2014 event updates. The LLR terms have beencomputed by considering as a foreground corpus all relevantevent updates, and as a background corpus all non updatesfrom the relevant documents. We limit this experiment totop 100 LLR extracted terms, i.e. we check whether anyof these discriminative LLR terms is to be found inside an

annotated sentence update.

Event Id Total Overlap Overlap (%) Overlap Overlap (%)Updates (General Model) (General Model) (Type Model) (Type Model)

11 392 130 33.16 129 32.9012 184 177 96.19 87 47.2813 313 300 95.84 108 34.5014 401 317 79.05 189 47.1315 315 296 93.96 251 79.6816 554 544 98.19 371 66.9617 648 584 90.12 259 39.9618 409 390 95.35 335 81.9019 341 304 89.14 283 82.9920 289 281 97.23 221 76.4721 798 782 97.99 271 33.9522 220 199 90.45 196 89.0923 274 231 84.30 190 69.3424 430 411 95.58 195 45.3425 376 352 93.61 271 72.07

2013 test events; similarly in Table 4.10 we present within document central-ity results for the TREC TS 2014 test events. We also present the resultsof the experiment as a heat map in Figure 4.2. The average precision valuesacross the two collections can be found bellow the heatmap, while Table 4.12shows the average values for each collections separately. For the TREC TS2013 collection, we can see that it is rarely the case that event updates makeit to the top of the ranking inside single documents. However, within theTREC TS 2014 dataset we observe higher precision and recall scores, signi-fying that the number of updates which are central in top ranked positionsis much higher. From this we conclude that it would be much harder foran algorithm which relies on centrality to identify relevant updates from thedocuments inside the TREC TS 2013 collection compared to the TREC TS2014 collection. Now taking a retrospective look at the centrality problemby considering scores across all relevant documents, presented in Table 4.9and Table 4.11 - Across Relevant Documents column for TREC TS 2013 andTREC TS 2014 datasets respectively, we observe that the same pattern pre-serves for these two different collections. While computing centrality acrossdocuments does not change the scores for the TREC TS 2013 dataset at all,for the TREC TS 2014 collection we can see a considerable increase in bothprecision and recall. TREC TS 2014 annotated updates demonstrate central-ity within and across documents, rendering them central in the developmentof the events under consideration. The summarization methods examinedso far also seem to be complementary to each other. For example, for eventwith id 11, while it is hard to identify event updates when relying on retrievalalgorithms or event modeling techniques, we can see that sentence centralitywould make for a successful approach since the updates for this event arecentral across all documents. Such an algorithm however is not particularlyuseful since it has to wait for all documents to stream in before identifying


any update sentences. One could however relax the low latency requirementand examine how many documents does a summarization algorithm need toobserve before salient updates make it to the top of the ranking. We leavethe construction of such an algorithm for future work.

Table 4.8: Within document centrality scores based onLexRank rankings for the TREC TS 2013 collection.

Event Nuggets w/ Total Ranked Within DocumentId ≥1 update Updates Nuggets Precision@1 Precision@10 R-Precision Recall@1 Recall@101 39 233 30 0.0000 0.0524 0.0016 0.0000 0.22722 31 168 16 0.0000 0.0256 0.0000 0.0000 0.13953 26 42 11 0.0000 0.0060 0.0002 0.0000 0.01354 43 180 26 0.0062 0.0248 0.0003 0.0363 0.12725 20 35 8 0.0140 0.0281 0.0000 0.0250 0.05006 64 172 38 0.0050 0.0325 0.0002 0.0190 0.10477 – – – – – – – –8 46 76 22 0.0081 0.0406 0.0007 0.0172 0.08629 27 68 13 0.0073 0.0147 0.0000 0.0344 0.034410 19 95 10 0.0000 0.0267 0.0000 0.0000 0.1851

Table 4.9: Across document centrality scores based onLexRank rankings for the TREC TS 2013 collection.

Event Nuggets w/ Total Ranked Across Relevant DocumentsId ≥1 update Updates Nuggets Precision@1 Precision@10 R-Precision Recall@1 Recall@101 39 233 30 0.0000 0.0524 0.0016 0.0000 0.22722 31 168 16 0.0000 0.0256 0.0000 0.0000 0.13953 26 42 11 0.0000 0.0060 0.0002 0.0000 0.01354 43 180 26 0.0062 0.0248 0.0003 0.0363 0.12725 20 35 8 0.0140 0.0281 0.0000 0.0250 0.05006 64 172 38 0.0050 0.0325 0.0002 0.0190 0.10477 – – – – – – – –8 46 76 22 0.0081 0.0406 0.0007 0.0172 0.08629 27 68 13 0.0073 0.0147 0.0000 0.0344 0.034410 19 95 10 0.0000 0.0267 0.0000 0.0000 0.1851

Table 4.10: Within document centrality scores based onLexRank rankings for the TREC TS 2014 collection.

Event Nuggets w/ Total Ranked Within DocumentId ≥1 update Updates Nuggets Precision@1 Precision@10 R-Precision Recall@1 Recall@1011 127 392 116 0.0782 0.4341 0.0187 0.1574 0.551112 29 184 12 0.0406 0.2845 0.0087 0.1034 0.241313 40 313 18 0.0296 0.3601 0.0202 0.1000 0.200014 47 401 43 0.0569 0.2405 0.0203 0.1914 0.510615 39 315 34 0.0722 0.5222 0.0461 0.2051 0.666616 46 554 42 0.0776 0.5307 0.0395 0.3043 0.760817 47 648 43 0.0530 0.2979 0.0206 0.3617 0.765918 66 409 58 0.0681 0.5265 0.0392 0.1969 0.575719 53 341 48 0.1428 0.8911 0.0859 0.2075 0.584920 23 289 22 0.0180 0.2382 0.0140 0.2173 0.782621 83 798 66 0.0750 0.4166 0.0305 0.2650 0.554222 53 220 44 0.1386 0.6569 0.0596 0.2641 0.471623 67 274 60 0.0268 0.2741 0.0204 0.0746 0.388024 49 430 35 0.0355 0.2912 0.0152 0.1224 0.489725 62 376 58 0.0744 0.4468 0.0382 0.1612 0.6129

In Tables 4.8-4.9, for event with id 7 we cannot report on any resultsas there are no documents released by the TREC TS organizers associatedwith this event (and implicitly no relevant annotated sentence updates). In


Table 4.11: Across document centrality scores based onLexRank rankings for the TREC TS 2014 collection.

Event Nuggets w/ Total Ranked Across Relevant DocumentsId ≥1 update Updates Nuggets Precision@1 Precision@10 R-Precision Recall@1 Recall@1011 127 392 116 1.0000 0.4128 0.2227 0.8188 0.913312 29 184 12 0.9593 0.0975 0.2514 0.3448 0.413713 40 313 18 0.9237 0.0762 0.2767 0.4500 0.450014 47 401 43 – – – – –15 39 315 34 1.0000 0.1888 0.3355 0.7948 0.871716 46 554 42 0.9870 0.1359 0.3229 0.8043 0.913017 47 648 43 0.7146 0.1085 0.2434 0.8510 0.914818 66 409 58 1.0000 0.2196 0.2973 0.7727 0.878719 53 341 48 1.0000 0.3197 0.3197 0.6603 0.886720 23 289 22 0.8158 0.0794 0.1790 0.8695 0.956521 83 798 66 – – – – –22 53 220 44 0.8905 0.3211 0.2202 0.6603 0.830123 67 274 60 0.7204 0.3172 0.2121 0.6268 0.880524 49 430 35 – – – – –25 62 376 58 – – – – –

Table 4.11, for events with ids 14, 21, 24 and 25 we cannot report on anycentrality scores across relevant documents. Due to the large size of the data,running LexRank on the set of relevant documents for events with these idstakes considerable time and causes out of memory errors on a large clustermachine.

Figure 4.2: Within – (A) – and across – (B) – documentcentrality scores based on LexRank rankings.

Table 4.12: Mean precision values for within and acrossdocument centrality for the 2013 and 2014 collections.

Average P@1 (A) P@10 (A) P@R (A) P@1 (B) P@10 (B) P@R (B)2013 0.0045 0.0279 0.0003 0.0045 0.0279 0.00032014 0.0667 0.4366 0.0326 0.7151 0.1667 0.2028


4.3.4 Can entities help in multi-document summarization?

The question we want to answer is whether entities play a role in distinguish-ing a relevant sentence from a non-relevant one. In such a scenario, entitytype models could be integrated into the summarization system and usedto address the two major challenges in extractive multi-document summa-rization: sentence scoring, in charge with ranking models for the sentenceextracts, and summary composition, addressing the novelty and flow aspectsof the summary.

Table 4.13: Named entity statistics for the relevant sen-tence updates inside the TREC TS 2013 dataset.

Event Id Nuggets Nuggets Relevant Indexed Updates w/ Entities in Distinctw/Updates Updates entities (%) Updates Entities

1 56 44 233 199 (85.40) 141 1,6782 89 43 168 150 (89.28) 161 1,4723 139 74 42 32 (76.19) 39 2864 97 55 180 151 (83.88) 230 8825 108 40 35 35 (100.00) 44 8286 418 105 172 152 (88.37) 120 6057 – – – – – –8 88 58 76 72 (94.73) 87 9499 45 29 68 63 (92.64) 69 1,1,6310 37 27 95 87 (91.57) 84 1,465

We begin our analysis with an overview of the number of sentence updatesfor each event that contain entities, allowing us to understand what are thelimitations of an entity type model on the given collection. To this end, wetag the annotated gold standard updates from the TREC TS 2013 and TRECTS 2014 datasets with the Stanford Named Entity Recognizer3 (NER) for theidentification of entities and their types. The type of an entity can be one ofthe following: person, location, organization, or other. In Table 4.13 and inTable 4.14 we present results for the number of sentence updates containingentities (Column 5), and for the number of distinct entities contained insidethese updates (Column 6) for TREC TS 2013 and TREC TS 2014 collections.

From the tables presented above we observe that on average, 89.11% ofthe event updates in the TREC TS 2013 collection contain named entities,while 79.96% of the relevant sentence updates in the TREC TS 2014 col-lection include named entities that could be identified by the Stanford NERtagger; see also Figure 4.3 (Column 3). This is promising evidence thatusing entities to derive event summarization features can lead to effectivesummaries of news events, and that entity type models have the potentialto significantly improve the quality of output. However, in spite of this en-couraging evidence, if we are to analyze closely the last two columns of Table4.13 and Table 4.14 regarding the number of distinct entities inside the an-notated updates and the number of distinct entities inside the collection, wecan easily notice that the number of distinct entities present inside these up-dates is non-trivial. Moreover, these entities only constitute a small subsetdrawn from the set of all entities present inside the entire collection. Thismakes the selection of relevant entities a challenging problem. As entities areequally present in both update and non-update sentences, the challenge is

3http://nlp.stanford.edu/software/CRF-NER.shtml

http://nlp.stanford.edu/software/CRF-NER.shtml


Table 4.14: Named entity statistics for the relevant sen-tence updates inside the TREC TS 2014 dataset.

Event Id Nuggets Nuggets Relevant Indexed Updates w/ Entities in Distinctw/Updates Updates entities (%) Updates Entities

11 226 127 392 284 (72.44) 107 5,05812 72 29 184 164 (89.13) 106 2,68013 68 40 313 298 (95.20) 161 4,28914 76 47 401 293 (73.06) 178 8,42015 45 39 315 241 (76.50) 106 3,88416 72 46 554 457 (82.49) 222 5,41717 48 47 648 477 (73.61) 213 5,69018 89 66 409 341 (83.37) 166 4,68419 97 53 341 216 (63.34) 94 2,38320 35 23 289 260 (89.96) 184 5,24821 124 83 798 718 (89.97) 253 9,28822 116 53 220 163 (74.09) 114 2,67623 138 67 274 221 (80.65) 91 4,12824 100 49 430 276 (64.18) 270 6,61725 88 62 376 344 (91.48) 188 6,006

how to identify salient entities throughout the development of an event thatcan distinguish an update from a non-update. Therefore, robust methodsthat capture the salience of an entity inside a document and throughout theduration of an event are necessary in order to accurately identify relevantand central sentence length updates. In addition to extracting named enti-ties and quantifying their degree of importance, extracting relations betweenthese entities can possibly represent a step forward towards improving theoverall coherence and quality of a summary.

Figure 4.3: Figure illustrating the degree of overlap be-tween the general LLR model and the event updates (Col-umn 1), the degree of overlap between the event type LLRmodel and the event updates (Column 2), and the percentage

of updates containing entities (Column 3).


4.4 Conclusion

In conclusion, in this chapter we have presented a systematic analysis oftemporal summarization methods, and examined the retrievability and cen-trality characteristics of event updates. We were mainly interested in as-sessing whether there are inherent intrinsic characteristics in update versusnon-update sentences.

To this end, we designed and ran a set of experiments on the theoreticalupper bounds where possible, and on more realistic upper bounds with theuse of discriminative terms obtained through likelihood ratio calculations.

Our results suggest that retrieval algorithms with query expansion havea theoretical upper bound that does not allow for the identification of allrelevant event updates. A topical drift can be partially captured by pseudo-relevance feedback, however the performance of this type of feedback is stillbounded below 100% coverage. Modeling event updates through discrimina-tive terms looks like a promising step towards improving the performance of atemporal summarization system, however the problem with this approach ishow to select the appropriate cut-off value. In our experiments this remainsan open question for future investigation. Finally, after assessing sentencecentrality with the use of graph-based methods, we could see that salientsentences across documents tend to be ranked in top positions. However, theissue with these methods is the amount of information that needs to flowinto the system before such decisions can be made.

51

Chapter 5

Methods and Techniques

In this chapter we probe the performance of retrieval, summarization andevent update modeling methods towards the creation of temporal summariesof news events. We begin with presenting the architecture of our system, themethods we employ, and accounting for the upper bound analysis carriedin Chapter 4, we present the results we obtain on the TREC TS 2013 andTREC TS 2014 collections, followed by an analysis of these results and whythe corresponding approaches work or fail.

Figure 5.1: Sketch illustrating the main components of ourtemporal summarization system.

The framework we built includes a distinct set of modules, including cor-pus pre-processing and indexing, information retrieval, information process-ing modules. We explain below the role of each module, and its importancein our experiments.

52 Chapter 5. Methods and Techniques

5.1 Data pre-processing and indexing

Data pre-processing. TREC KBA Stream Corpus1 released by the or-ganizers of the TREC Temporal Summarization track is an encrypted file(*.gpg) which cannot be used directly. Its decryption requires an authorizedkey that has been provided by the organizers of the challenge to each par-ticipant team. The output of the decryption step is a compressed file (*.xz),which needs to be further processed for the extraction of the content. To thisend, we rely on open-source tools that apply natural language processing tolarge streams of text. The StreamCorpus2 toolbox provides a common datainterchange format for document processing pipelines, and can easily serializeor deserialize large batches of documents into flat files called Chunks basedon the Apache Thrift format. Chunks contains multiple StreamItems, whereeach StreamItem (*.sc) corresponds to a document in our the collection ofdocuments. Sentences inside documents have been tokenized into sentencearrays, but at this stage it becomes easy to run downstream analytics thatleverage the attributes on the token stream.

Data indexing. Once we have extracted our documents, we proceedto indexing them inside ElasticSearch. We index in turn the TREC KBA2013, 2014 and 2015 Temporal Summarization data. Given the large size ofthe pre-filtered TREC KBA 2014 corpus, we split the data into batches andcreate separate ElasticSearch indices for each month in the time period ofthe corpus (October 2011 - April 2013). This makes it convenient in termsof scalability, searching for documents in almost real-time, and also enhancesthe repeatability of our experiments. For the TREC KBA 2013 and TRECKBA 2015 we index the already filtered corpora, creating separate indicesfor each event in the test set.

5.2 Information retrieval module

Our system takes as input a short query defining the event to be tracked, anevent category specifying the type of the event, and a stream of timestampeddocuments relevant for the event under consideration. For each of the eventsin the test set, we issue the event query specified in the description of eachevent to retrieve the set of relevant documents matching the query. TheElasticSearch toolkit is designed to facilitate research in language modelingand information retrieval by natively supporting the construction of basictext retrieval methods, such as tf.idf and Okapi BM25 [95]. As the result ofthe issued event query, ElasticSearch returns a set of documents within thestart time and end time of the event. We use these documents in extract-ing relevant and novel sentences for the inclusion in the event summary, bymeans of standard information retrieval methods and techniques we presentin Section 5.3.1.

Query expansion. For the purpose of the experiments in Chapter 4, wedo query expansion. The query describing an event is typically very short (2-3 words in length), and this makes the retrieval of relevant sentences prone toword mismatch problems, in cases when the vocabulary of the query differssignificantly from the vocabulary of an update. To prevent this, we rely on

1http://trec-kba.org/kba-stream-corpus-2014.shtml2https://github.com/trec-kba/streamcorpus

http://trec-kba.org/kba-stream-corpus-2014.shtml

https://github.com/trec-kba/streamcorpus

5.3. Information processing module 53

query expansion techniques to augment a query word with similar terms. Weuse two methods: i) Wordnet - for each query term we retrieve its Wordnetsynonyms [69], and augment the original query with these terms, and ii)Word2Vec [68] - we train our model on the relevant documents from TRECTemporal Summarization 2013 and 2014 collections, retrieve the most similarterms to a query term, and add them to the expanded query.

5.3 Information processing module

In extracting informative sentences from the relevant documents, we employa set of standard information retrieval techniques. At this stage our focus ison identifying effective methods for extractive sentence selection, assumingthat the input stream contains relevant documents only. Therefore, we de-cide to carry our experiments on the set of documents that have already beenannotated as relevant, and contain at least one sentence update in the TRECTS 2013 and 2014 gold standard set of updates. Consequently, we don’t doany document filtering, but rather decompose the temporal summarizationproblem into basic steps by deliberately removing the filtering component.Besides that, we introduce an oracle assumption consisting in knowing be-forehand how many sentence updates we are expected to retrieve from eachrelevant document. This assumption is aimed at providing a realistic baselineof the potentials and limitations of the algorithms we test. Given this, at thispoint in time we decide to ignore the novelty aspect of the emitted sentenceupdates, as we are more interested in finding relevant, and not necessarilynon-redundant updates. In what follows we describe the methods we employfor extractive sentence selection from the relevant documents, given as inputan event query, a set of relevant documents, and an oracle number of updatesto emit from each relevant document.

5.3.1 Methods and Techniques

In retrieving relevant updates for inclusion in the summary, we considerdifferent information retrieval based approaches that have been adopted intext and document summarization. We are mainly interested to what extentevent updates are retrievable by means of the shared vocabulary betweenthe language of an event query and the language of an event update. Tothis end, we probe the utility of the following well-established informationretrieval methods:

Term Frequency (TF). We use raw term frequency of a query term tin a sentence s. For each query term, we count the number of its occurrencesinside sentence s, compute the score for each sentence, and rank the sentenceseither by time, or in decreasing order of their tf scores.

BM25. Okapi BM25 [96] is arguably one of the most important andwidely used information retrieval functions. BM25 scores a document Dwith respect to query Q, containing keywords q1, . . . , qn as follows:

score(D,Q) =

n∑i=1

IDF (qi)f(qi, D)(k1 + 1)

f(qi, D) + k1

(1− b+ b |D|avgdl

) ,where (5.1)


f(qi, D) is the frequency of query term qi inside document D, |D| is thelength of document D in number of words, avgdl represents the averagedocument length in the text collection from which documents are drawn, k1and b are free parameters, which in the absence of advanced optimizationare usually chosen as k1 ∈ [1.2, 2.0] and b = 0.75;, IDF (qi) represents theinverse document frequency of query term qi in the collection, and is usuallycomputed as:

IDF (qi) = logN − n(qi) + 0.5

n(qi) + 0.5,where (5.2)

N is the total number of documents in the collection, and n(qi) is the totalnumber of documents in which query term qi appears.

In our experiments we use the BM25 formula to score sentences, afterchoosing the values of the free parameters k1 = 1.5 and b = 0.75.

TF.ISF. Similar to the traditional term frequency - inverse documentfrequency (tf.idf) [61] method used for document retrieval, the vector spacemodel for sentence retrieval uses the term frequency - inverse sentence fre-quency [29] (tf.isf) method. Using tf.isf, we rank sentences with the followingformula:

R(s|q) =∑t∈q

log(tft,q + 1) log(tft,s + 1) log

(n+ 1

0.5 + sf t

),where (5.3)

- tft,q is the number of occurrences of term t in query q;- tft,s is the number of occurrences of term t in sentence s;- sft is the number of sentences that contain term t;- n is the number of sentences in the collection.

We apply this formula at the document level, again ranking sentences bytime and in decreasing order of their tf.isf scores.

Query Likelihood (QL). According to the query likelihood model forsentence retrieval [73], sentences are ranked by the probability that the querywas generated by the same distribution of terms the sentence is from:

P (S|Q) ∝ P (S)|Q|∏i=1

P (qi|S) (5.4)

where Q is the query, |Q| is the number of terms in the query, qi is the ith

term in the query, and S is a sentence. Previous work on sentence retrievaltechniques shows that simple query likelihood models successfully outperformword overlap and TF-IDF based measures [17].

Cosine Similarity (COS.SIM). We compute the vector representationfor each query and sentence combination using tf.idf term weights. We ranksentences by time, or by the cosine of the angle between the document andthe query vectors:

cos θ =~a×~b

||~a|| × ||~b||(5.5)

In the formula above, θ denotes the angle between the two vector represen-tations ~a and ~b.

5.3. Information processing module 55

Log-Likelihood Ratio (LLR). Using the LLR [89] method, we extractdiscriminative terms that can distinguish an update from a non-update. Wemodel an event as the set of the most discriminative LLR terms we extractfrom the gold standard TREC TS 2013 and 2014 updates. We build twomodels: a general one and an event-type focused one. For each event type,we build a foreground corpus of all the relevant updates, and a backgroundcorpus of all the non-update sentences from the relevant documents. Toassign more weight to query terms and to make the summary more focused,we follow Gupta et al [40] and augment the list of extracted LLR terms withthe content words from the user query.

Language Modeling (LM). We hypothesize that event updates sharea common crisis-related vocabulary that distinguishes them from other non-update sentences. To build a language model from the set of relevant eventupdates, we use SRILM3, an extensible language modeling toolkit whichsupports the creation and evaluation of a variety of language model typesbased on N-gram statistics [104]. We train an unigram language model basedon TREC TS historical data. Additionally, we also train a 5-gram languagemodel with Laplace smoothing, but interestingly, since our results did notchange much compared to the unigram language model, we decide to dropthe 5-gram language model for the rest of our experiments.

Sentence Centrality. Recently, information network analysis tech-niques have been used in multi-document extractive generic text summa-rization. LexRank [31] is one of the best-known graph-based methods formulti-document summarization based on lexical centrality. Words and sen-tences can be modeled as nodes linked by their co-occurrence or contentsimilarity. The complexity of mining the word network only depends on thescale of the vocabulary used inside the documents; it is often significantlyreduced after applying term filtering. LexRank employs the idea of randomwalk within the graph to do prestige ranking as PageRank [79] does. We relyon the MEAD summarizer [86] implementation to extract the most centralsentences in a multi-document cluster.

Entities and Relations. In prior research [6], it was found that themore named entities a sentence covers, the more information it contains andthe more relevant it is to the topic. In addition, if a sentence contains newnamed entities which did not appear in any prior extracted sentences, in-cluding this sentence into the summary would probably add new information[35]. We attempt to capture the salience of entities and how they connectwith each other, and use these features in scoring sentences for the inclusionin the summary of an event. In deriving entity features we follow McCreadieet al [65], and focus on:

• entity importance (E1): captures the salience of an entity inside adocument, and is estimated as the frequency of the respective entityinside that document;

• entity interaction (E2): defines how central an entity is compared toother entities, and is estimated as the number of co-occuring entitiesfor a given entity at the sentence level.

3http://www.speech.sri.com/projects/srilm/

http://www.speech.sri.com/projects/srilm/


We score sentences based on a linear combination of E1 and E2, takinginto the account the entity importance at the document level, and the entityinteraction at the sentence level.

5.4 Novelty Detection

After computing the ranking score, the top ranked sentences can be retrievedas output and included in the final summary. However, due to the fact thatmany highly ranked sentences sentences are similar to each other and presentan overlap in meaning, strategies to eliminate redundant information are fre-quently required. The cosine similarity metric is widely used as the similarityfunction for guiding sentence selection decisions. All sentences above a spe-cific threshold, considered to be too similar to prior emitted sentences, areautomatically discarded from the output. In addition, reranking techniquessuch as Maximal Marginal Relevance (MMR) [21] have been applied to elim-inate the redundancy problem in a summary. A document/sentence hashigh marginal relevance if it is relevant to the query, and presents minimalsimilarity to all previously selected sentences/documents:

MMR = argmaxDi∈R\S

[λSim1(Di,Q)− (1− λ) maxDj∈S

Sim2(Di,Dj)] (5.6)

where Q is the user query, R denotes the ranked list of documents/sentencesretrieved by the system, and S is the subset of documents/sentences thathave already been selected. Sim1 and Sim2 can be the same, or differentmetrics used in document/sentence retrieval and relevance ranking betweendocuments/ sentences and the query. Assuming we retrieve the top rankednodes in a graph one by one, once a node rj is retrieved, the ranking scores ofall other nodes is discounted by r′i = ri − wijrj , with wij denoting the tran-sition probability from node i to node j. The reranking technique is aimedat reducing redundancy in the original result, however it provides no guar-antees for content coverage maximization. The MMR principle is thereforemore of a compromise between random walk ranking and the informationredundancy problem, where the content coverage is not explicitly modelledin the random walk process.

In our submission to TREC TS 2015 [34], we have employed the cosinesimilarity metric for novelty detection. Nevertheless, in our current experi-ments we decide to drop the novelty detection component when we constructthe summary. This allows us to get a better understanding of the perfor-mance of our methods, and focus strictly on the temporal summarizationproblem in terms of which sentence updates can be retrieved by the afore-mentioned methods.

5.5 Experimental Results

In this section we present the results of our experiments, using the meth-ods and techniques described in Section 5.3.1. We test the performance ofthese algorithms in turn on the TREC TS 2013 (Event Ids 1-10) and 2014(Event Ids 11-25) test collections. We run our experiments on the set of

5.5. Experimental Results 57

filtered relevant documents for the collections, making the oracle assump-tion that we know in advance how many sentence updates to retrieve fromeach relevant document. Therefore, once we hit the threshold for how manysentences to output from a relevant document, we discard the rest of the re-trieved sentences and move on to processing the next timestamped documentin chronological order. As mentioned previously, in our experiments we dropthe novelty detection component, as we are mainly interested in assessinghow well updates can be identified by retrieval algorithms and not necessar-ily in how novel these updates are. We apply two ranking strategies beforeissuing the output. We first rank sentences across documents in given orderof their timestamps and consider the top-n highest scored sentences insideeach document as dictated by the oracle, and second, we rank sentences indecreasing order of their assigned retrieval scores across all relevant docu-ments, preserving the oracle assumption. We evaluate the output of thesemethods in terms of precision and recall. We expect that ranking sentencesby their corresponding retrieval scores across documents will perform bet-ter in identifying salient updates compared to the case when we first rankdocuments by time, and issue the top-n updates from each document. In Ap-pendix A we also include scores when there is no oracle constraint for howmany updates to issue from a specific document. We expect these scores tobe higher than the no oracle case. In what follows we report on the resultswe obtained for each category of the methods we employed.

Retrieval Algorithms: We present results for the traditional informa-tion retrieval methods described in Section 5.3.1 in Tables 5.1 - 5.10. Foreach of the probed methods (TF, BM25, TF.ISF, QL, COS.SIM), we firstinclude the oracle results ranked by time across documents and highest scorewithin documents, and then ranked by retrieval scores across documents. InAppendix A, Tables A.11 - A.20, we present results for the same retrievalmethods across all relevant documents, but without the oracle assumption.

Event Update Centrality: We run LexRank on the set of relevantdocuments, and infer sentence rankings based on the LexRank centralityscores. We aggregate this information using the oracle baseline in Table5.14, while in Appendix A in Table A.24 we present results for the case whenthere is no oracle constraint.

Event Update Modeling: We score sentences using the unigram eventlanguage model trained on the relevant sentence updates from the gold stan-dard collection. We present results with the oracle assumption in Table 5.13(and without the oracle assumption in Appendix A in Table A.23). In addi-tion, we also infer discriminative LLR terms that we use in scoring sentenceupdates for inclusion into the summary. In Tables 5.11 - 5.12 (and in Ap-pendix A in Tables A.21 - A.22) we present LLR results with and withoutthe oracle baseline.

Entities and Relations: We first score sentences by the number of co-occuring entities present at the sentence level (feature E2). In Table 5.15we present oracle results after ranking the scored sentences by time, and inTable A.26 from Appendix A, we include results for ranking sentences indescending order of their scores across documents. Then in Table 5.17 andTable 5.18 we score sentences by the combined metric which includes bothE1 and E2 features with the oracle threshold. In Appendix A, Table A.25and Table A.26 we include results when for the entire document collectionwhen dropping the oracle constraint.


Table 5.1: TF Oracle results ranked by time for TREC TS2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collections

(MAP 2013: 0.0062; MAP 2014: 0.0779).

Id Correct Precision RecallP@10 P@20 P@100 P@R AP R@10 R@20 R@100

1 14/233 0.0000 0.0000 0.0700 0.0601@233 0.0041 0.0000 0.0000 0.03002 12/168 0.3000 0.1500 0.1100 0.0714@168 0.0127 0.0179 0.0179 0.06553 0/42 0.0000 0.0000 – 0.0000@42 0.0000 0.0000 0.0000 –4 11/180 0.2000 0.1500 0.1000 0.0608@181 0.0121 0.0110 0.0166 0.05525 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 9/172 0.0000 0.0000 0.0700 0.0523@172 0.0038 0.0000 0.0000 0.04077 – – – – – – – – –8 8/76 0.0000 0.0500 – 0.1053@76 0.0137 0.0000 0.0132 –9 3/68 0.0000 0.0500 – 0.0441@68 0.0026 0.0000 0.0147 –

10 7/95 0.0000 0.1000 – 0.0729@96 0.0067 0.0000 0.0208 –AVG 2013 0.0556 0.0556 0.0875 0.0519 0.0062 0.0032 0.0092 0.0479

11 74/392 0.3000 0.2000 0.1600 0.1888@392 0.0334 0.0077 0.0102 0.040812 74/184 0.4000 0.4500 0.3900 0.4022@184 0.1725 0.0217 0.0489 0.212013 85/313 0.2000 0.3000 0.2500 0.2716@313 0.0750 0.0064 0.0192 0.079914 68/401 0.0000 0.0000 0.1400 0.1696@401 0.0270 0.0000 0.0000 0.034915 81/315 0.4000 0.4000 0.2500 0.2571@315 0.0697 0.0127 0.0254 0.079416 181/554 0.5000 0.4500 0.4500 0.3267@554 0.1349 0.0090 0.0162 0.081217 143/648 0.4000 0.3000 0.2600 0.2207@648 0.0539 0.0062 0.0093 0.040118 92/409 0.1000 0.1000 0.2100 0.2249@409 0.0491 0.0024 0.0049 0.051319 106/341 0.6000 0.5000 0.4200 0.3109@341 0.1280 0.0176 0.0293 0.123220 66/289 0.2000 0.1000 0.1900 0.2284@289 0.0465 0.0069 0.0069 0.065721 226/798 0.2000 0.2000 0.2800 0.2832@798 0.0732 0.0025 0.0050 0.035122 71/220 0.3000 0.3000 0.3900 0.3227@220 0.1193 0.0136 0.0273 0.177323 64/274 0.1000 0.3000 0.2600 0.2336@274 0.0633 0.0036 0.0219 0.094924 126/430 0.5000 0.4000 0.2900 0.2930@430 0.0861 0.0116 0.0186 0.067425 61/376 0.4000 0.3000 0.2200 0.1622@376 0.0362 0.0106 0.0160 0.0585

AVG 2014 0.3067 0.2867 0.2773 0.2597 0.0779 0.0088 0.0173 0.0828

Table 5.2: TF Oracle results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)collections (MAP 2013: 0.0034; MAP 2014: 0.0898).


1 14/233 0.0000 0.0500 0.0200 0.0601@233 0.0032 0.0000 0.0043 0.00862 12/168 0.0000 0.0000 0.0700 0.0714@168 0.0044 0.0000 0.0000 0.04173 0/42 0.0000 0.0000 – 0.0000@42 0.0000 0.0000 0.0000 –4 11/180 0.0000 0.0000 0.0600 0.0608@181 0.0031 0.0000 0.0000 0.03315 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 9/172 0.0000 0.0000 0.0500 0.0523@172 0.0025 0.0000 0.0000 0.02917 – – – – – – – – –8 8/76 0.0000 0.0500 – 0.1053@76 0.0084 0.0000 0.0132 –9 3/68 0.0000 0.0500 – 0.0441@68 0.0021 0.0000 0.0147 –

10 7/95 0.0000 0.0500 – 0.0729@96 0.0072 0.0000 0.0104 –AVG 2013 0.0000 0.0222 0.0500 0.0519 0.0034 0.0000 0.0047 0.0281

11 74/392 0.5000 0.4500 0.2300 0.1888@392 0.0539 0.0128 0.0230 0.058712 74/184 0.6000 0.4500 0.4500 0.4022@184 0.1929 0.0326 0.0489 0.244613 85/313 0.2000 0.4000 0.3600 0.2716@313 0.0908 0.0064 0.0256 0.115014 68/401 0.0000 0.0500 0.3800 0.1696@401 0.0485 0.0000 0.0025 0.094815 81/315 0.3000 0.4000 0.2900 0.2571@315 0.0760 0.0095 0.0254 0.092116 181/554 0.4000 0.4000 0.3400 0.3267@554 0.1097 0.0072 0.0144 0.061417 143/648 0.7000 0.4500 0.4300 0.2207@648 0.0820 0.0108 0.0139 0.066418 92/409 0.4000 0.3500 0.4000 0.2249@409 0.0771 0.0098 0.0171 0.097819 106/341 0.1000 0.2500 0.3300 0.3109@341 0.0899 0.0029 0.0147 0.096820 66/289 0.3000 0.4500 0.3500 0.2284@289 0.0794 0.0104 0.0311 0.121121 226/798 0.8000 0.7500 0.3200 0.2832@798 0.1012 0.0100 0.0188 0.040122 71/220 0.2000 0.5000 0.4000 0.3227@220 0.1290 0.0091 0.0455 0.181823 64/274 0.1000 0.2000 0.2900 0.2336@274 0.0599 0.0036 0.0146 0.105824 126/430 0.4000 0.4000 0.3600 0.2930@430 0.1139 0.0093 0.0186 0.083725 61/376 0.3000 0.2500 0.2900 0.1622@376 0.0427 0.0080 0.0133 0.0771

AVG 2014 0.3533 0.3833 0.3480 0.2597 0.0898 0.0095 0.0218 0.1025


Table 5.3: BM25 Oracle results ranked by time for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collec-

tions (MAP 2013: 0.0024; MAP 2014: 0.0754).


1 8/233 0.0000 0.0000 0.0400 0.0343@233 0.0013 0.0000 0.0000 0.01722 4/168 0.1000 0.0500 0.0300 0.0238@168 0.0036 0.0060 0.0060 0.01793 0/42 0.0000 0.0000 – 0.0000@42 0.0000 0.0000 0.0000 –4 6/180 0.2000 0.1000 0.0500 0.0331@181 0.0066 0.0110 0.0110 0.02765 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 3/172 0.0000 0.0000 0.0200 0.0174@172 0.0005 0.0000 0.0000 0.01167 – – – – – – – – –8 4/76 0.0000 0.0000 – 0.0526@76 0.0038 0.0000 0.0000 –9 1/68 0.0000 0.0000 – 0.0147@68 0.0003 0.0000 0.0000 –10 2/95 0.1000 0.0500 – 0.0208@96 0.0056 0.0104 0.0104 –

AVG 2013 0.0444 0.0222 0.0350 0.0219 0.0024 0.0030 0.0030 0.018611 85/392 0.2000 0.2000 0.2100 0.2168@392 0.0478 0.0051 0.0102 0.053612 68/184 0.4000 0.4500 0.3400 0.3696@184 0.1505 0.0217 0.0489 0.184813 100/313 0.3000 0.3000 0.2900 0.3195@313 0.1010 0.0096 0.0192 0.092714 71/401 0.0000 0.0000 0.1700 0.1771@401 0.0303 0.0000 0.0000 0.042415 87/315 0.3000 0.3000 0.2200 0.2762@315 0.0695 0.0095 0.0190 0.069816 186/554 0.1000 0.1500 0.4100 0.3357@554 0.1299 0.0018 0.0054 0.074017 144/648 0.6000 0.4500 0.2400 0.2222@648 0.0557 0.0093 0.0139 0.037018 93/409 0.2000 0.1500 0.2100 0.2274@409 0.0505 0.0049 0.0073 0.051319 100/341 0.4000 0.3500 0.3200 0.2933@341 0.0998 0.0117 0.0205 0.093820 67/289 0.2000 0.1500 0.2100 0.2318@289 0.0518 0.0069 0.0104 0.072721 219/798 0.3000 0.2500 0.2700 0.2744@798 0.0688 0.0038 0.0063 0.033822 65/220 0.4000 0.3500 0.3600 0.2955@220 0.1050 0.0182 0.0318 0.163623 69/274 0.1000 0.3000 0.2700 0.2518@274 0.0696 0.0036 0.0219 0.098524 109/430 0.4000 0.3000 0.2300 0.2535@430 0.0599 0.0093 0.0140 0.053525 67/376 0.4000 0.3000 0.2100 0.1782@376 0.0409 0.0106 0.0160 0.0559

AVG 2014 0.2867 0.2667 0.2640 0.2615 0.0754 0.0084 0.0163 0.0785

Table 5.4: BM25 Oracle results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)collections (MAP 2013: 0.0015; MAP 2014: 0.0900).


1 8/233 0.2000 0.1000 0.0500 0.0343@233 0.0032 0.0086 0.0086 0.02152 4/168 0.0000 0.0000 0.0200 0.0238@168 0.0006 0.0000 0.0000 0.01193 0/42 0.0000 0.0000 – 0.0000@42 0.0000 0.0000 0.0000 –4 6/180 0.0000 0.0000 0.0400 0.0331@181 0.0013 0.0000 0.0000 0.02215 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 3/172 0.0000 0.0000 0.0100 0.0174@172 0.0004 0.0000 0.0000 0.00587 – – – – – – – – –8 4/76 0.1000 0.1500 – 0.0526@76 0.0070 0.0132 0.0395 –9 1/68 0.0000 0.0000 – 0.0147@68 0.0003 0.0000 0.0000 –10 2/95 0.0000 0.0000 – 0.0208@96 0.0005 0.0000 0.0000 –

AVG 2013 0.0333 0.0278 0.0300 0.0219 0.0015 0.0024 0.0053 0.015311 85/392 0.5000 0.5000 0.2200 0.2168@392 0.0530 0.0128 0.0255 0.056112 68/184 0.5000 0.6000 0.5400 0.3696@184 0.2127 0.0272 0.0652 0.293513 100/313 0.6000 0.6500 0.4500 0.3195@313 0.1407 0.0192 0.0415 0.143814 71/401 0.1000 0.1500 0.4000 0.1771@401 0.0520 0.0025 0.0075 0.099815 87/315 0.3000 0.2000 0.2200 0.2762@315 0.0726 0.0095 0.0127 0.069816 186/554 0.6000 0.3000 0.3700 0.3357@554 0.1138 0.0108 0.0108 0.066817 144/648 0.3000 0.3000 0.3900 0.2222@648 0.0734 0.0046 0.0093 0.060218 93/409 0.4000 0.4500 0.2800 0.2274@409 0.0713 0.0098 0.0220 0.068519 100/341 0.4000 0.3500 0.2300 0.2933@341 0.0852 0.0117 0.0205 0.067420 67/289 0.4000 0.2500 0.2000 0.2318@289 0.0627 0.0138 0.0173 0.069221 219/798 0.7000 0.7500 0.4500 0.2744@798 0.1024 0.0088 0.0188 0.056422 65/220 0.5000 0.5000 0.3000 0.2955@220 0.1111 0.0227 0.0455 0.136423 69/274 0.3000 0.2500 0.3500 0.2518@274 0.0800 0.0109 0.0182 0.127724 109/430 0.2000 0.1500 0.1100 0.2535@430 0.0615 0.0047 0.0070 0.025625 67/376 0.4000 0.3500 0.3600 0.1782@376 0.0577 0.0106 0.0186 0.0957

AVG 2014 0.4133 0.3833 0.3247 0.2615 0.0900 0.0120 0.0227 0.0958


Table 5.5: TF.ISF Oracle results ranked by time for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collec-

tions (MAP 2013: 0.0407; MAP 2014: 0.0756).


1 66/233 0.0000 0.2000 0.2800 0.2833@233 0.0733 0.0000 0.0172 0.12022 25/168 0.2000 0.2000 0.2100 0.1488@168 0.0330 0.0119 0.0238 0.12503 4/42 0.1000 0.1000 – 0.0952@42 0.0145 0.0238 0.0476 –4 31/180 0.3000 0.2000 0.1500 0.1713@181 0.0364 0.0166 0.0221 0.08295 2/35 0.0000 0.0000 – 0.0571@35 0.0026 0.0000 0.0000 –6 9/172 0.0000 0.0000 0.0500 0.0523@172 0.0034 0.0000 0.0000 0.02917 – – – – – – – – –8 15/76 0.4000 0.3000 – 0.1974@76 0.0601 0.0526 0.0789 –9 12/68 0.2000 0.3000 – 0.1765@68 0.0539 0.0294 0.0882 –

10 30/95 0.2000 0.2000 – 0.3125@96 0.0892 0.0208 0.0417 –AVG 2013 0.1556 0.1667 0.1725 0.1660 0.0407 0.0172 0.0355 0.0893

11 74/392 0.3000 0.2000 0.1600 0.1888@392 0.0334 0.0077 0.0102 0.040812 71/184 0.4000 0.4500 0.3600 0.3859@184 0.1605 0.0217 0.0489 0.195713 93/313 0.2000 0.3000 0.2600 0.2971@313 0.0886 0.0064 0.0192 0.083114 68/401 0.0000 0.0000 0.1400 0.1696@401 0.0272 0.0000 0.0000 0.034915 83/315 0.3000 0.2500 0.2200 0.2635@315 0.0651 0.0095 0.0159 0.069816 172/554 0.3000 0.2500 0.3400 0.3105@554 0.1045 0.0054 0.0090 0.061417 142/648 0.6000 0.4500 0.2100 0.2191@648 0.0527 0.0093 0.0139 0.032418 97/409 0.2000 0.1500 0.2400 0.2372@409 0.0574 0.0049 0.0073 0.058719 105/341 0.5000 0.4500 0.3400 0.3079@341 0.1129 0.0147 0.0264 0.099720 67/289 0.2000 0.1000 0.1900 0.2318@289 0.0471 0.0069 0.0069 0.065721 229/798 0.2000 0.2000 0.2800 0.2870@798 0.0743 0.0025 0.0050 0.035122 70/220 0.4000 0.3500 0.3900 0.3182@220 0.1196 0.0182 0.0318 0.177323 67/274 0.1000 0.3000 0.2600 0.2445@274 0.0670 0.0036 0.0219 0.094924 126/430 0.5000 0.4000 0.2900 0.2930@430 0.0861 0.0116 0.0186 0.067425 63/376 0.4000 0.3000 0.2100 0.1676@376 0.0380 0.0106 0.0160 0.0559

AVG 2014 0.3067 0.2767 0.2593 0.2614 0.0756 0.0089 0.0167 0.0782

Table 5.6: TF.ISF Oracle results ranked by highest scorefor TREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids:

11-25) (MAP 2013: 0.0416; MAP 2014: 0.1072).


1 66/233 0.3000 0.4000 0.2700 0.2833@233 0.0843 0.0129 0.0343 0.11592 25/168 0.6000 0.3000 0.0800 0.1488@168 0.0488 0.0357 0.0357 0.04763 4/42 0.2000 0.1000 – 0.0952@42 0.0182 0.0476 0.0476 –4 31/180 0.0000 0.0000 0.0800 0.1713@181 0.0205 0.0000 0.0000 0.04425 2/35 0.0000 0.0500 – 0.0571@35 0.0033 0.0000 0.0286 –6 9/172 0.0000 0.0000 0.0100 0.0523@172 0.0019 0.0000 0.0000 0.00587 – – – – – – – – –8 15/76 0.2000 0.2000 – 0.1974@76 0.0536 0.0263 0.0526 –9 12/68 0.3000 0.2000 – 0.1765@68 0.0615 0.0441 0.0588 –

10 30/95 0.2000 0.1500 – 0.3125@96 0.0826 0.0208 0.0312 –AVG 2013 0.2000 0.1556 0.1100 0.1660 0.0416 0.0208 0.0321 0.0534

11 74/392 0.5000 0.5000 0.2500 0.1888@392 0.0542 0.0128 0.0255 0.063812 71/184 0.7000 0.6000 0.5400 0.3859@184 0.2102 0.0380 0.0652 0.293513 93/313 0.7000 0.6500 0.4200 0.2971@313 0.1350 0.0224 0.0415 0.134214 68/401 0.2000 0.4000 0.4600 0.1696@401 0.0679 0.0050 0.0200 0.114715 83/315 0.3000 0.4000 0.3000 0.2635@315 0.0805 0.0095 0.0254 0.095216 172/554 0.4000 0.5000 0.3600 0.3105@554 0.1085 0.0072 0.0181 0.065017 142/648 1.0000 0.8500 0.5700 0.2191@648 0.1108 0.0154 0.0262 0.088018 97/409 0.3000 0.4500 0.4600 0.2372@409 0.0992 0.0073 0.0220 0.112519 105/341 0.1000 0.3000 0.3800 0.3079@341 0.1001 0.0029 0.0176 0.111420 67/289 0.5000 0.3500 0.3700 0.2318@289 0.0831 0.0173 0.0242 0.128021 229/798 0.6000 0.7000 0.5500 0.2870@798 0.1269 0.0075 0.0175 0.068922 70/220 0.4000 0.5500 0.4500 0.3182@220 0.1510 0.0182 0.0500 0.204523 67/274 0.3000 0.2500 0.3400 0.2445@274 0.0888 0.0109 0.0182 0.124124 126/430 0.4000 0.4500 0.5100 0.2930@430 0.1339 0.0093 0.0209 0.118625 63/376 0.4000 0.4000 0.3900 0.1676@376 0.0579 0.0106 0.0213 0.1037

AVG 2014 0.4533 0.4900 0.4233 0.2614 0.1072 0.0130 0.0276 0.1217


Table 5.7: QL Oracle results ranked by time for TREC TS2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collections

(MAP 2013: 0.1323; MAP 2014: 0.0401).


1 74/233 0.1000 0.1500 0.3600 0.3176@233 0.1019 0.0043 0.0129 0.15452 48/168 0.4000 0.3500 0.3100 0.2857@168 0.0922 0.0238 0.0417 0.18453 4/42 0.1000 0.1000 – 0.0952@42 0.0138 0.0238 0.0476 –4 55/180 0.4000 0.4000 0.2700 0.3039@181 0.1094 0.0221 0.0442 0.14925 10/35 0.1000 0.2000 – 0.2857@35 0.0673 0.0286 0.1143 –6 48/172 0.3000 0.2500 0.3100 0.2791@172 0.0893 0.0174 0.0291 0.18027 – – – – – – – – –8 31/76 0.6000 0.5500 – 0.4079@76 0.2214 0.0789 0.1447 –9 25/68 0.4000 0.4500 – 0.3676@68 0.1634 0.0588 0.1324 –10 54/95 0.6000 0.6500 – 0.5625@96 0.3316 0.0625 0.1354 –

AVG 2013 0.3333 0.3444 0.3125 0.3228 0.1323 0.0356 0.0780 0.167111 78/392 0.1000 0.1500 0.1900 0.1990@392 0.0392 0.0026 0.0077 0.048512 26/184 0.1000 0.2000 0.1800 0.1413@184 0.0287 0.0054 0.0217 0.097813 81/313 0.2000 0.3000 0.2500 0.2588@313 0.0697 0.0064 0.0192 0.079914 40/401 0.0000 0.0000 0.0500 0.0998@401 0.0076 0.0000 0.0000 0.012515 50/315 0.2000 0.2000 0.1500 0.1587@315 0.0252 0.0063 0.0127 0.047616 122/554 0.2000 0.2500 0.2600 0.2202@554 0.0575 0.0036 0.0090 0.046917 47/648 0.0000 0.0000 0.0300 0.0725@648 0.0041 0.0000 0.0000 0.004618 81/409 0.1000 0.1000 0.1800 0.1980@409 0.0400 0.0024 0.0049 0.044019 74/341 0.2000 0.2500 0.1800 0.2170@341 0.0515 0.0059 0.0147 0.052820 55/289 0.2000 0.2000 0.1400 0.1903@289 0.0338 0.0069 0.0138 0.048421 214/798 0.2000 0.2000 0.1900 0.2682@798 0.0615 0.0025 0.0050 0.023822 56/220 0.4000 0.3500 0.3100 0.2545@220 0.0828 0.0182 0.0318 0.140923 40/274 0.1000 0.2000 0.0900 0.1460@274 0.0209 0.0036 0.0146 0.032824 107/430 0.4000 0.2500 0.2100 0.2488@430 0.0559 0.0093 0.0116 0.048825 52/376 0.3000 0.1500 0.1300 0.1383@376 0.0223 0.0080 0.0080 0.0346

AVG 2014 0.1800 0.1867 0.1693 0.1874 0.0401 0.0054 0.0116 0.0509

Table 5.8: QL Oracle results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)

(MAP 2013: 0.1654; MAP 2014: 0.0453).


1 74/233 0.6000 0.6500 0.4000 0.3176@233 0.1500 0.0258 0.0558 0.17172 48/168 0.7000 0.8000 0.4300 0.2857@168 0.1920 0.0417 0.0952 0.25603 4/42 0.1000 0.2000 – 0.0952@42 0.0197 0.0238 0.0952 –4 55/180 0.4000 0.4000 0.4100 0.3039@181 0.1272 0.0221 0.0442 0.22655 10/35 0.1000 0.2500 – 0.2857@35 0.0896 0.0286 0.1429 –6 48/172 0.2000 0.4500 0.3400 0.2791@172 0.1016 0.0116 0.0523 0.19777 – – – – – – – – –8 31/76 0.6000 0.5500 – 0.4079@76 0.2141 0.0789 0.1447 –9 25/68 0.7000 0.4500 – 0.3676@68 0.2068 0.1029 0.1324 –10 54/95 0.6000 0.6000 – 0.5625@96 0.3879 0.0625 0.1250 –

AVG 2013 0.4444 0.4833 0.3950 0.3228 0.1654 0.0442 0.0986 0.213011 78/392 0.0000 0.0500 0.1000 0.1990@392 0.0282 0.0000 0.0026 0.025512 26/184 0.5000 0.3000 0.1700 0.1413@184 0.0398 0.0272 0.0326 0.092413 81/313 0.2000 0.3000 0.2600 0.2588@313 0.0738 0.0064 0.0192 0.083114 40/401 0.0000 0.0000 0.0000 0.0998@401 0.0062 0.0000 0.0000 0.000015 50/315 0.1000 0.1000 0.2400 0.1587@315 0.0347 0.0032 0.0063 0.076216 122/554 0.4000 0.3000 0.3400 0.2202@554 0.0692 0.0072 0.0108 0.061417 47/648 0.2000 0.3500 0.3100 0.0725@648 0.0216 0.0031 0.0108 0.047818 81/409 0.0000 0.0000 0.0600 0.1980@409 0.0326 0.0000 0.0000 0.014719 74/341 0.2000 0.1500 0.2500 0.2170@341 0.0578 0.0059 0.0088 0.073320 55/289 0.0000 0.1000 0.1700 0.1903@289 0.0343 0.0000 0.0069 0.058821 214/798 0.4000 0.5000 0.3000 0.2682@798 0.0675 0.0050 0.0125 0.037622 56/220 0.1000 0.1000 0.1500 0.2545@220 0.0535 0.0045 0.0091 0.068223 40/274 0.4000 0.3500 0.3100 0.1460@274 0.0467 0.0146 0.0255 0.113124 107/430 0.2000 0.1000 0.1400 0.2488@430 0.0587 0.0047 0.0047 0.032625 52/376 0.3000 0.3500 0.4000 0.1383@376 0.0543 0.0080 0.0186 0.1064

AVG 2014 0.2000 0.2033 0.2133 0.1874 0.0453 0.0060 0.0112 0.0594


Table 5.9: Cosine Sim. Oracle results ranked by time forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)collections (MAP 2013: 0.1340; MAP 2014: 0.0569).


1 89/233 0.2000 0.3500 0.4300 0.3820@233 0.1469 0.0086 0.0300 0.18452 60/168 0.4000 0.3500 0.3700 0.3571@168 0.1331 0.0238 0.0417 0.22023 8/42 0.2000 0.1500 – 0.1905@42 0.0478 0.0476 0.0714 –4 61/180 0.5000 0.5000 0.2900 0.3370@181 0.1319 0.0276 0.0552 0.16025 8/35 0.2000 0.2000 – 0.2286@35 0.0579 0.0571 0.1143 –6 38/172 0.3000 0.2500 0.2400 0.2209@172 0.0629 0.0174 0.0291 0.13957 – – – – – – – – –8 32/76 0.6000 0.5500 – 0.4211@76 0.2308 0.0789 0.1447 –9 24/68 0.4000 0.4500 – 0.3529@68 0.1518 0.0588 0.1324 –

10 50/95 0.4000 0.4000 – 0.5208@96 0.2430 0.0417 0.0833 –AVG 2013 0.3556 0.3556 0.3325 0.3345 0.1340 0.0402 0.0780 0.1761

11 87/392 0.2000 0.1500 0.2300 0.2219@392 0.0530 0.0051 0.0077 0.058712 67/184 0.5000 0.4500 0.3700 0.3641@184 0.1546 0.0272 0.0489 0.201113 58/313 0.6000 0.4000 0.1900 0.1853@313 0.0439 0.0192 0.0256 0.060714 33/401 0.0000 0.0000 0.0600 0.0823@401 0.0056 0.0000 0.0000 0.015015 74/315 0.5000 0.3000 0.2300 0.2349@315 0.0627 0.0159 0.0190 0.073016 141/554 0.2000 0.2000 0.2300 0.2545@554 0.0704 0.0036 0.0072 0.041517 120/648 0.6000 0.4000 0.2300 0.1852@648 0.0424 0.0093 0.0123 0.035518 57/409 0.1000 0.1000 0.1000 0.1394@409 0.0185 0.0024 0.0049 0.024419 84/341 0.6000 0.5000 0.2200 0.2463@341 0.0792 0.0176 0.0293 0.064520 59/289 0.3000 0.2500 0.2200 0.2042@289 0.0488 0.0104 0.0173 0.076121 196/798 0.3000 0.2000 0.2200 0.2456@798 0.0552 0.0038 0.0050 0.027622 42/220 0.0000 0.0500 0.1900 0.1909@220 0.0344 0.0000 0.0045 0.086423 79/274 0.1000 0.3000 0.2600 0.2883@274 0.0806 0.0036 0.0219 0.094924 116/430 0.4000 0.3500 0.2700 0.2698@430 0.0703 0.0093 0.0163 0.062825 63/376 0.4000 0.2000 0.1800 0.1676@376 0.0336 0.0106 0.0106 0.0479

AVG 2014 0.3200 0.2567 0.2133 0.2187 0.0569 0.0092 0.0154 0.0647

Table 5.10: Cosine Sim. Oracle results ranked by highestscore for TREC TS 2013 (Ids: 1-10) and TREC TS 2014(Ids: 11-25) (MAP 2013: 0.1495; MAP 2014: 0.0527).


1 89/233 0.8000 0.6500 0.4500 0.3820@233 0.1826 0.0343 0.0558 0.19312 60/168 0.7000 0.3500 0.4200 0.3571@168 0.1649 0.0417 0.0417 0.25003 8/42 0.2000 0.2000 – 0.1905@42 0.0364 0.0476 0.0952 –4 61/180 0.2000 0.3000 0.2800 0.3370@181 0.1055 0.0110 0.0331 0.15475 8/35 0.1000 0.1500 – 0.2286@35 0.0648 0.0286 0.0857 –6 38/172 0.1000 0.1500 0.2100 0.2209@172 0.0469 0.0058 0.0174 0.12217 – – – – – – – – –8 32/76 0.6000 0.5500 – 0.4211@76 0.2300 0.0789 0.1447 –9 24/68 0.8000 0.5000 – 0.3529@68 0.2066 0.1176 0.1471 –

10 50/95 0.8000 0.6000 – 0.5208@96 0.3081 0.0833 0.1250 –AVG 2013 0.4778 0.3833 0.3400 0.3345 0.1495 0.0499 0.0829 0.1800

11 87/392 0.0000 0.0000 0.1400 0.2219@392 0.0373 0.0000 0.0000 0.035712 67/184 0.3000 0.4500 0.4800 0.3641@184 0.1701 0.0163 0.0489 0.260913 58/313 0.2000 0.1000 0.1300 0.1853@313 0.0288 0.0064 0.0064 0.041514 33/401 0.0000 0.0000 0.0000 0.0823@401 0.0044 0.0000 0.0000 0.000015 74/315 0.0000 0.1000 0.1300 0.2349@315 0.0495 0.0000 0.0063 0.041316 141/554 0.3000 0.3000 0.3600 0.2545@554 0.0751 0.0054 0.0108 0.065017 120/648 0.0000 0.1000 0.1800 0.1852@648 0.0321 0.0000 0.0031 0.027818 57/409 0.0000 0.0000 0.0500 0.1394@409 0.0151 0.0000 0.0000 0.012219 84/341 0.1000 0.1000 0.1700 0.2463@341 0.0503 0.0029 0.0059 0.049920 59/289 0.0000 0.1000 0.2700 0.2042@289 0.0444 0.0000 0.0069 0.093421 196/798 0.4000 0.3000 0.0900 0.2456@798 0.0547 0.0050 0.0075 0.011322 42/220 0.0000 0.1000 0.2200 0.1909@220 0.0351 0.0000 0.0091 0.100023 79/274 0.0000 0.2500 0.3200 0.2883@274 0.0894 0.0000 0.0182 0.116824 116/430 0.2000 0.1500 0.2300 0.2698@430 0.0737 0.0047 0.0070 0.053525 63/376 0.0000 0.0000 0.1500 0.1676@376 0.0305 0.0000 0.0000 0.0399

AVG 2014 0.1000 0.1367 0.1947 0.2187 0.0527 0.0027 0.0087 0.0633


Table 5.11: LLR Oracle results ranked by time for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collec-

tions (MAP 2013: 0.0304; MAP 2014: 0.0817).


1 70/233 0.0000 0.2000 0.2900 0.3004@233 0.0846 0.0000 0.0172 0.12452 28/168 0.2000 0.2500 0.2200 0.1667@168 0.0406 0.0119 0.0298 0.13103 4/42 0.0000 0.0500 – 0.0952@42 0.0095 0.0000 0.0238 –4 25/180 0.1000 0.0500 0.1100 0.1381@181 0.0179 0.0055 0.0055 0.06085 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 12/172 0.0000 0.0500 0.0800 0.0698@172 0.0067 0.0000 0.0058 0.04657 – – – – – – – – –8 9/76 0.3000 0.2000 – 0.1184@76 0.0262 0.0395 0.0526 –9 13/68 0.1000 0.2500 – 0.1912@68 0.0526 0.0147 0.0735 –10 17/95 0.1000 0.1000 – 0.1771@96 0.0357 0.0104 0.0208 –

AVG 2013 0.0889 0.1278 0.1750 0.1397 0.0304 0.0091 0.0255 0.090711 101/392 0.4000 0.3000 0.2700 0.2577@392 0.0679 0.0102 0.0153 0.068912 81/184 0.5000 0.3500 0.4100 0.4402@184 0.1818 0.0272 0.0380 0.222813 82/313 0.4000 0.4500 0.3000 0.2620@313 0.0802 0.0128 0.0288 0.095814 68/401 0.0000 0.0000 0.1500 0.1696@401 0.0270 0.0000 0.0000 0.037415 78/315 0.4000 0.3000 0.2100 0.2476@315 0.0599 0.0127 0.0190 0.066716 147/554 0.1000 0.1500 0.2800 0.2653@554 0.0743 0.0018 0.0054 0.050517 133/648 0.2000 0.2500 0.2000 0.2052@648 0.0447 0.0031 0.0077 0.030918 114/409 0.1000 0.2000 0.2100 0.2787@409 0.0696 0.0024 0.0098 0.051319 117/341 0.6000 0.5500 0.4200 0.3431@341 0.1455 0.0176 0.0323 0.123220 84/289 0.5000 0.2500 0.2700 0.2907@289 0.0783 0.0173 0.0173 0.093421 203/798 0.2000 0.2500 0.3300 0.2544@798 0.0663 0.0025 0.0063 0.041422 71/220 0.3000 0.3000 0.3300 0.3227@220 0.1145 0.0136 0.0273 0.150023 68/274 0.1000 0.2500 0.2200 0.2482@274 0.0626 0.0036 0.0182 0.080324 131/430 0.5000 0.5500 0.3600 0.3047@430 0.1053 0.0116 0.0256 0.083725 73/376 0.3000 0.2500 0.2200 0.1941@376 0.0469 0.0080 0.0133 0.0585

AVG 2014 0.3067 0.2933 0.2787 0.2723 0.0817 0.0096 0.0176 0.0837

Table 5.12: LLR Oracle results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)

(MAP 2013: 0.0277; MAP 2014: 0.0810).


1 70/233 0.0000 0.2000 0.2800 0.3004@233 0.0814 0.0000 0.0172 0.12022 28/168 0.5000 0.3500 0.0900 0.1667@168 0.0396 0.0298 0.0417 0.05363 4/42 0.0000 0.0500 – 0.0952@42 0.0082 0.0000 0.0238 –4 25/180 0.3000 0.4000 0.0900 0.1381@181 0.0356 0.0166 0.0442 0.04975 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 12/172 0.1000 0.0500 0.0300 0.0698@172 0.0039 0.0058 0.0058 0.01747 – – – – – – – – –8 9/76 0.1000 0.1000 – 0.1184@76 0.0184 0.0132 0.0263 –9 13/68 0.2000 0.1500 – 0.1912@68 0.0377 0.0294 0.0441 –10 17/95 0.1000 0.0500 – 0.1771@96 0.0243 0.0104 0.0104 –

AVG 2013 0.1444 0.1500 0.1225 0.1397 0.0277 0.0117 0.0237 0.060211 101/392 0.2000 0.2500 0.2100 0.2577@392 0.0643 0.0051 0.0128 0.053612 81/184 0.4000 0.4500 0.4200 0.4402@184 0.1970 0.0217 0.0489 0.228313 82/313 0.6000 0.5500 0.3200 0.2620@313 0.0901 0.0192 0.0351 0.102214 68/401 0.0000 0.0000 0.0400 0.1696@401 0.0220 0.0000 0.0000 0.010015 78/315 0.2000 0.3500 0.2000 0.2476@315 0.0634 0.0063 0.0222 0.063516 147/554 0.2000 0.2500 0.2000 0.2653@554 0.0648 0.0036 0.0090 0.036117 133/648 0.6000 0.6500 0.3200 0.2052@648 0.0680 0.0093 0.0201 0.049418 114/409 0.4000 0.4000 0.2600 0.2787@409 0.0875 0.0098 0.0196 0.063619 117/341 0.3000 0.3500 0.2600 0.3431@341 0.1094 0.0088 0.0205 0.076220 84/289 0.5000 0.4000 0.3300 0.2907@289 0.1068 0.0173 0.0277 0.114221 203/798 0.4000 0.2500 0.1000 0.2544@798 0.0536 0.0050 0.0063 0.012522 71/220 0.1000 0.2000 0.3100 0.3227@220 0.0995 0.0045 0.0182 0.140923 68/274 0.1000 0.2500 0.2700 0.2482@274 0.0642 0.0036 0.0182 0.098524 131/430 0.0000 0.0000 0.2700 0.3047@430 0.0864 0.0000 0.0000 0.062825 73/376 0.0000 0.0000 0.2200 0.1941@376 0.0382 0.0000 0.0000 0.0585

AVG 2014 0.2667 0.2900 0.2487 0.2723 0.0810 0.0076 0.0172 0.0780


Table 5.13: LM Oracle results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)

(MAP 2013: 0.0123; MAP 2014: 0.0154).


1 24/233 0.0000 0.0000 0.0200 0.0429@233 0.0053 0.0000 0.0000 0.00862 9/168 0.5000 0.2500 0.0600 0.0417@168 0.0219 0.0298 0.0298 0.03573 2/42 0.0000 0.0500 – 0.0476@42 0.0028 0.0000 0.0238 –4 16/180 0.3000 0.3500 0.1200 0.0718@181 0.0264 0.0166 0.0387 0.06635 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 2/172 0.0000 0.0000 0.0000 0.0000@172 0.0001 0.0000 0.0000 0.00007 – – – – – – – – –8 14/76 0.2000 0.1000 0.0500 0.0526@76 0.0248 0.0263 0.0263 0.06589 7/68 0.2000 0.1000 0.0700 0.0735@68 0.0219 0.0294 0.0294 0.1029

10 4/95 0.1000 0.0500 0.0300 0.0312@96 0.0073 0.0104 0.0104 0.0312AVG 2013 0.1444 0.1000 0.0500 0.0402 0.0123 0.0125 0.0176 0.0276

11 50/392 0.0000 0.0000 0.0700 0.0485@392 0.0075 0.0000 0.0000 0.017912 14/184 0.1000 0.0500 0.0400 0.0326@184 0.0040 0.0054 0.0054 0.021713 38/313 0.1000 0.1000 0.1100 0.0543@313 0.0088 0.0032 0.0064 0.035114 79/401 0.5000 0.4000 0.1900 0.0748@401 0.0276 0.0125 0.0200 0.047415 44/315 0.2000 0.1000 0.1000 0.0603@315 0.0118 0.0063 0.0063 0.031716 82/554 0.0000 0.0500 0.0900 0.0560@554 0.0098 0.0000 0.0018 0.016217 151/648 0.1000 0.1500 0.0700 0.0556@648 0.0126 0.0015 0.0046 0.010818 67/409 0.2000 0.3000 0.1300 0.0954@409 0.0217 0.0049 0.0147 0.031819 110/341 0.0000 0.0000 0.0200 0.0616@341 0.0231 0.0000 0.0000 0.005920 53/289 0.0000 0.0000 0.0600 0.1142@289 0.0155 0.0000 0.0000 0.020821 108/798 0.0000 0.0000 0.0100 0.0564@798 0.0080 0.0000 0.0000 0.001322 85/220 0.0000 0.0000 0.2400 0.1636@220 0.0527 0.0000 0.0000 0.109123 48/274 0.0000 0.0000 0.0100 0.0255@274 0.0073 0.0000 0.0000 0.003624 61/430 0.1000 0.1000 0.1400 0.0744@430 0.0130 0.0023 0.0047 0.032625 67/376 0.0000 0.0000 0.0200 0.0266@376 0.0074 0.0000 0.0000 0.0053

AVG 2014 0.0867 0.0833 0.0867 0.0666 0.0154 0.0024 0.0043 0.0261

Table 5.14: LexRank Oracle results ranked by highestscore for TREC TS 2013 (Ids: 1-10) and TREC TS 2014(Ids: 11-25) (MAP 2013: 0.0015; MAP 2014: 0.0151).


1 3/233 0.0000 0.0000 0.0000 0.0000@233 0.0001 0.0000 0.0000 0.00002 2/168 0.1000 0.0500 0.0100 0.0060@168 0.0060 0.0060 0.0060 0.00603 0/42 0.0000 0.0000 – 0.0000@42 0.0000 0.0000 0.0000 –4 5/180 0.1000 0.0500 0.0300 0.0221@181 0.0064 0.0055 0.0055 0.01665 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 1/172 0.0000 0.0000 0.0000 0.0000@172 0.0000 0.0000 0.0000 0.00007 – – – – – – – – –8 3/76 0.0000 0.0000 0.0200 0.0000@76 0.0006 0.0000 0.0000 0.02639 0/68 0.0000 0.0000 0.0000 0.0000@68 0.0000 0.0000 0.0000 0.0000

10 0/95 0.0000 0.0000 0.0000 0.0000@96 0.0000 0.0000 0.0000 0.0000AVG 2013 0.0222 0.0111 0.0100 0.0031 0.0015 0.0013 0.0013 0.0056

11 49/392 0.0000 0.0000 0.0600 0.0485@392 0.0077 0.0000 0.0000 0.015312 8/184 0.0000 0.0000 0.0300 0.0217@184 0.0013 0.0000 0.0000 0.016313 32/313 0.0000 0.0000 0.0400 0.0383@313 0.0039 0.0000 0.0000 0.012814 – – – – – – – – –15 50/315 0.1000 0.1000 0.0900 0.0857@315 0.0135 0.0032 0.0063 0.028616 82/554 0.0000 0.0500 0.0600 0.0505@554 0.0085 0.0000 0.0018 0.010817 127/648 0.2000 0.1500 0.0700 0.0386@648 0.0092 0.0031 0.0046 0.010818 78/409 0.0000 0.0500 0.0900 0.0856@409 0.0177 0.0000 0.0024 0.022019 145/341 0.1000 0.1500 0.1500 0.1466@341 0.0603 0.0029 0.0088 0.044020 39/289 0.1000 0.1000 0.0400 0.0242@289 0.0080 0.0035 0.0069 0.013821 – – – – – – – – –22 87/220 0.2000 0.1000 0.1900 0.1545@220 0.0568 0.0091 0.0091 0.086423 58/274 0.0000 0.0000 0.0100 0.0328@274 0.0099 0.0000 0.0000 0.003624 – – – – – – – – –25 – – – – – – – – –

AVG 2014 0.0538 0.0583 0.0755 0.0661 0.0151 0.0017 0.0033 0.0240


Table 5.15: Oracle results ranked by time for entity co-occurrence at the sentence level for TREC TS 2013 and 2014

(MAP 2013: 0.0261; MAP 2014: 0.0089).


1 42/233 0.2000 0.1500 0.1700 0.1803@233 0.0323 0.0086 0.0129 0.07302 32/168 0.4000 0.3000 0.2200 0.1905@168 0.0515 0.0238 0.0357 0.13103 3/42 0.0000 0.0500 – 0.0714@42 0.0055 0.0000 0.0238 –4 22/180 0.3000 0.2500 0.1700 0.1215@181 0.0389 0.0166 0.0276 0.09395 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 17/172 0.0000 0.0500 0.1400 0.0988@172 0.0144 0.0000 0.0058 0.08147 – – – – – – – – –8 9/76 0.0000 0.0000 – 0.1184@76 0.0135 0.0000 0.0000 –9 12/68 0.2000 0.2500 – 0.1765@68 0.0358 0.0294 0.0735 –10 12/95 0.3000 0.2500 – 0.1250@96 0.0428 0.0312 0.0521 –

AVG 2013 0.1556 0.1444 0.1750 0.1203 0.0261 0.0122 0.0257 0.094811 20/392 0.2000 0.1000 0.0600 0.0510@392 0.0039 0.0051 0.0051 0.015312 9/184 0.0000 0.0000 0.0500 0.0489@184 0.0023 0.0000 0.0000 0.027213 32/313 0.0000 0.0000 0.1000 0.1022@313 0.0090 0.0000 0.0000 0.031914 52/401 0.0000 0.0000 0.0900 0.1297@401 0.0164 0.0000 0.0000 0.022415 25/315 0.0000 0.0000 0.0900 0.0794@315 0.0071 0.0000 0.0000 0.028616 64/554 0.0000 0.0500 0.1000 0.1155@554 0.0138 0.0000 0.0018 0.018117 45/648 0.3000 0.3000 0.0900 0.0694@648 0.0087 0.0046 0.0093 0.013918 36/409 0.0000 0.0500 0.0500 0.0880@409 0.0068 0.0000 0.0024 0.012219 22/341 0.2000 0.1500 0.0900 0.0645@341 0.0064 0.0059 0.0088 0.026420 39/289 0.1000 0.0500 0.1600 0.1349@289 0.0184 0.0035 0.0035 0.055421 97/798 0.0000 0.1000 0.1500 0.1216@798 0.0144 0.0000 0.0025 0.018822 14/220 0.0000 0.0000 0.0500 0.0636@220 0.0035 0.0000 0.0000 0.022723 11/274 0.0000 0.0000 0.0600 0.0401@274 0.0022 0.0000 0.0000 0.021924 38/430 0.1000 0.1000 0.0800 0.0884@430 0.0086 0.0023 0.0047 0.018625 34/376 0.0000 0.1500 0.1200 0.0904@376 0.0127 0.0000 0.0080 0.0319

AVG 2014 0.0600 0.0700 0.0893 0.0859 0.0089 0.0014 0.0031 0.0244

Table 5.16: Oracle results ranked by highest score for en-tity co-occurrence at the sentence level for TREC TS 2013and 2014 (MAP 2013: 0.0149; MAP 2014: 0.0103).


1 42/233 0.2000 0.1500 0.1000 0.1803@233 0.0270 0.0086 0.0129 0.04292 32/168 0.0000 0.0000 0.1300 0.1905@168 0.0258 0.0000 0.0000 0.07743 3/42 0.0000 0.0500 – 0.0714@42 0.0049 0.0000 0.0238 –4 22/180 0.0000 0.0500 0.1300 0.1215@181 0.0158 0.0000 0.0055 0.07185 0/35 0.0000 0.0000 – 0.0000@35 0.0000 0.0000 0.0000 –6 17/172 0.0000 0.0500 0.1400 0.0988@172 0.0117 0.0000 0.0058 0.08147 – – – – – – – – –8 9/76 0.1000 0.0500 – 0.1184@76 0.0133 0.0132 0.0132 –9 12/68 0.0000 0.0000 – 0.1765@68 0.0221 0.0000 0.0000 –10 12/95 0.0000 0.0000 – 0.1250@96 0.0138 0.0000 0.0000 –

AVG 2013 0.0333 0.0389 0.1250 0.1203 0.0149 0.0024 0.0068 0.068411 20/392 0.0000 0.0000 0.0200 0.0510@392 0.0021 0.0000 0.0000 0.005112 9/184 0.0000 0.0000 0.0300 0.0489@184 0.0017 0.0000 0.0000 0.016313 32/313 0.0000 0.0000 0.1000 0.1022@313 0.0105 0.0000 0.0000 0.031914 52/401 0.0000 0.3500 0.2700 0.1297@401 0.0316 0.0000 0.0175 0.067315 25/315 0.0000 0.0500 0.0300 0.0794@315 0.0050 0.0000 0.0032 0.009516 64/554 0.2000 0.1500 0.0700 0.1155@554 0.0112 0.0036 0.0054 0.012617 45/648 0.0000 0.0500 0.1000 0.0694@648 0.0049 0.0000 0.0015 0.015418 36/409 0.1000 0.1000 0.1100 0.0880@409 0.0090 0.0024 0.0049 0.026919 22/341 0.0000 0.0000 0.0100 0.0645@341 0.0028 0.0000 0.0000 0.002920 39/289 0.0000 0.1000 0.2400 0.1349@289 0.0301 0.0000 0.0069 0.083021 97/798 0.1000 0.0500 0.0800 0.1216@798 0.0135 0.0013 0.0013 0.010022 14/220 0.0000 0.0000 0.0800 0.0636@220 0.0043 0.0000 0.0000 0.036423 11/274 0.0000 0.0000 0.0200 0.0401@274 0.0012 0.0000 0.0000 0.007324 38/430 0.4000 0.4000 0.1300 0.0884@430 0.0200 0.0093 0.0186 0.030225 34/376 0.0000 0.0000 0.0700 0.0904@376 0.0061 0.0000 0.0000 0.0186

AVG 2014 0.0533 0.0833 0.0907 0.0859 0.0103 0.0011 0.0040 0.0249


Table 5.17: Oracle results ranked by time with sentencesscored by the combined entity metric for TREC TS 2013 and

2014 (MAP 2013: 0.0283; MAP 2014: 0.0147).


1 43/233 0.2000 0.2000 0.2000 0.1845@233 0.0367 0.0086 0.0172 0.08582 31/168 0.4000 0.3000 0.2300 0.1845@168 0.0532 0.0238 0.0357 0.13693 3/42 0.0000 0.0500 – 0.0714@42 0.0055 0.0000 0.0238 –4 21/180 0.3000 0.2500 0.1700 0.1160@181 0.0383 0.0166 0.0276 0.09395 1/35 0.0000 0.0000 – 0.0286@35 0.0008 0.0000 0.0000 –6 17/172 0.0000 0.0500 0.1400 0.0988@172 0.0144 0.0000 0.0058 0.08147 – – – – – – – – –8 11/76 0.1000 0.0500 – 0.1447@76 0.0199 0.0132 0.0132 –9 14/68 0.2000 0.2000 – 0.2059@68 0.0426 0.0294 0.0588 –

10 12/95 0.3000 0.2500 – 0.1250@96 0.0429 0.0312 0.0521 –AVG 2013 0.1667 0.1500 0.1850 0.1288 0.0283 0.0136 0.0260 0.0995

11 30/392 0.0000 0.0500 0.0700 0.0765@392 0.0059 0.0000 0.0026 0.017912 30/184 0.0000 0.1000 0.1900 0.1630@184 0.0317 0.0000 0.0109 0.103313 35/313 0.0000 0.0500 0.0700 0.1118@313 0.0103 0.0000 0.0032 0.022414 7/401 0.0000 0.0000 0.0100 0.0175@401 0.0002 0.0000 0.0000 0.002515 59/315 0.4000 0.4000 0.2500 0.1873@315 0.0510 0.0127 0.0254 0.079416 65/554 0.2000 0.1000 0.1700 0.1173@554 0.0196 0.0036 0.0036 0.030717 54/648 0.0000 0.0000 0.1000 0.0833@648 0.0080 0.0000 0.0000 0.015418 14/409 0.3000 0.2000 0.0500 0.0342@409 0.0061 0.0073 0.0098 0.012219 42/341 0.1000 0.1500 0.1200 0.1232@341 0.0190 0.0029 0.0088 0.035220 3/289 0.0000 0.0500 0.0200 0.0104@289 0.0004 0.0000 0.0035 0.006921 133/798 0.3000 0.2000 0.1800 0.1667@798 0.0283 0.0038 0.0050 0.022622 19/220 0.0000 0.0000 0.0600 0.0864@220 0.0068 0.0000 0.0000 0.027323 25/274 0.4000 0.3500 0.1500 0.0912@274 0.0224 0.0146 0.0255 0.054724 36/430 0.1000 0.0500 0.0800 0.0837@430 0.0071 0.0023 0.0023 0.018625 19/376 0.0000 0.0500 0.0600 0.0505@376 0.0031 0.0000 0.0027 0.0160

AVG 2014 0.1200 0.1167 0.1053 0.0935 0.0147 0.0032 0.0069 0.0310

Table 5.18: Oracle results ranked by highest score withsentences scored by the combined entity metric for TRECTS 2013 and 2014 (MAP 2013: 0.0184; MAP 2014:

0.0111).


1 43/233 0.1000 0.1000 0.1400 0.1845@233 0.0284 0.0043 0.0086 0.06012 31/168 0.1000 0.0500 0.1200 0.1845@168 0.0314 0.0060 0.0060 0.07143 3/42 0.0000 0.0500 – 0.0714@42 0.0049 0.0000 0.0238 –4 21/180 0.1000 0.0500 0.1100 0.1160@181 0.0133 0.0055 0.0055 0.06085 1/35 0.0000 0.0500 – 0.0286@35 0.0016 0.0000 0.0286 –6 17/172 0.0000 0.0500 0.1400 0.0988@172 0.0117 0.0000 0.0058 0.08147 – – – – – – – – –8 11/76 0.0000 0.1000 – 0.1447@76 0.0197 0.0000 0.0263 –9 14/68 0.2000 0.2000 – 0.2059@68 0.0409 0.0294 0.0588 –

10 12/95 0.0000 0.0000 – 0.1250@96 0.0138 0.0000 0.0000 –AVG 2013 0.0556 0.0722 0.1275 0.1288 0.0184 0.0050 0.0182 0.0684

11 30/392 0.1000 0.1000 0.0700 0.0765@392 0.0070 0.0026 0.0051 0.017912 30/184 0.1000 0.2000 0.0800 0.1630@184 0.0239 0.0054 0.0217 0.043513 35/313 0.1000 0.0500 0.1300 0.1118@313 0.0141 0.0032 0.0032 0.041514 7/401 0.0000 0.0000 0.0100 0.0175@401 0.0003 0.0000 0.0000 0.002515 59/315 0.1000 0.0500 0.1400 0.1873@315 0.0300 0.0032 0.0032 0.044416 65/554 0.0000 0.0000 0.0300 0.1173@554 0.0094 0.0000 0.0000 0.005417 54/648 0.0000 0.0000 0.1500 0.0833@648 0.0090 0.0000 0.0000 0.023118 14/409 0.0000 0.0000 0.0500 0.0342@409 0.0012 0.0000 0.0000 0.012219 42/341 0.0000 0.0000 0.1000 0.1232@341 0.0129 0.0000 0.0000 0.029320 3/289 0.0000 0.0000 0.0100 0.0104@289 0.0002 0.0000 0.0000 0.003521 133/798 0.1000 0.1500 0.0900 0.1667@798 0.0276 0.0013 0.0038 0.011322 19/220 0.4000 0.2000 0.0900 0.0864@220 0.0133 0.0182 0.0182 0.040923 25/274 0.0000 0.0000 0.0200 0.0912@274 0.0058 0.0000 0.0000 0.007324 36/430 0.3000 0.1500 0.0500 0.0837@430 0.0091 0.0070 0.0070 0.011625 19/376 0.1000 0.0500 0.0600 0.0505@376 0.0034 0.0027 0.0027 0.0160

AVG 2014 0.0867 0.0633 0.0720 0.0935 0.0111 0.0029 0.0043 0.0207

5.6. Analysis and Conclusions 67

5.6 Analysis and Conclusions

In what follows we compare and contrast the performance of our sentenceupdate identification methods. We first take a look at the oracle baselineconsidering the time ranked results, and then we look at the results obtainedwhen sentences are scored by their corresponding retrieval scores. We alsocompare these with the case when there is no oracle constraint. To facilitatethis comparison, in Table 5.19 we present a summary of the oracle-basedresults when scoring sentences across documents by time and within docu-ments by the highest score, while in Table 5.20 we present the summary ofthe oracle methods when ranking by the highest scores across documents. InAppendix A, Table A.27 we present equivalent results for the case when weimpose no oracle constraint.

We begin with analyzing the performance of our methods when we usethe oracle baseline, and rank sentences across documents by time. From eachdocument we select as many highest ranked sentence updates as indicated bythe external oracle. From Table 5.19, we observe that, interestingly, methodswhich work well on the TREC TS 2013 collection do not perform so well onthe TREC TS 2014 collection in terms of MAP. Methods such as TF, BM25,TF.ISF seem to have similar behaviour, as opposed to query likelihood andcosine similarity methods. We observe that for the TREC TS 2013 collectioncosine similarity presents the best results (MAP 0.1340), followed closely bythe query likelihood with no smoothing method (MAP 0.1323). When wescore sentences by the extracted LLR terms from the general model or bythe number of entities present at the sentence and document level, resultsimprove over TF or BM25, but are still low compared to cosine similarityresults. However, for the TREC TS 2014 collection, we see a change inperformance. QL and COS.SIM are no longer the top performers, while TF,BM25 and TF.ISF yield some of the highest results. The LLR method seemsto do better at identifying relevant sentence updates. This is encouragingevidence that in scoring sentence updates discriminative terms are beneficialin reliably distinguishing a sentence update from a non-update.

After removing the time ranking and scoring sentences by the highestscores across all documents, from Table 5.20 we can see a significant increasein our scores. This comes to illustrate the fact that the time ordering ofdocuments does not necessarily help precision, and that the most centralsentences are not always to be found in documents close to the onset ofan event. Furthermore, in the beginning of an event salient information isharder to identify, as there is less known information about the actors, the on-going sub-events, people, locations affected, or the event impact. Includingin the summary the first breaking news about the event does not providethe reader of the summary with enough relevant information to help himget a comprehensive and precise description of the event. In addition, it isfrequently the case that in the beginning of the event relevant informationis scarce and hard to identify. In terms of results without the time ranking,we can see that the same pattern as in the case of results ranked by timedominates. For TREC TS 2013, query likelihood and cosine similarity arestill the best performers (with query likelihood presenting the highest MAP0.1654). However, by looking at the TREC TS 2014 collection we see thatthe performance of TF, BM25 and TF.ISF is very similar, with TF.ISF theclear winner (with a MAP of 0.1072). Language modeling of event updates,


centrality or sentence scoring based on the number of entities contained insidethe sentence are less efficient approaches for the task. LLR presents goodprecision, confirming once again that we need discriminative terms in oderto pick the right sentence updates for inclusion in the event summary.

We now turn to a comparison of the three categories of methods employedin finding relevant updates. In terms of event update centrality, we observethat running LexRank on the set of all relevant documents for an event is notan efficient strategy, considering the low precision and recall scores obtained.Therefore, LexRank is not able to successfully identify the most relevant up-dates that have been annotated by the TREC assessors in the gold standard.On the one hand, one explanation of this outcome could be the incompleteannotations from the gold standard, such that LexRank might identify andrank highly salient sentences which have not been considered for pooling.On the other hand, the annotations in the gold standard might be biasedtowards a specific method, if multiple participants submitted runs relying onthat method, which in turns determines the gold standard sentence updatesto be biased towards that specific method which may not be necessarily fo-cused on centrality. On top of all these, it is also possible that LexRank’sinefficiency stems from the fact that is it rather doing general summarization,presenting the user with the most relevant updates describing an event overtime, which do not necessarily encompass the query terms and is not queryfocused. A query-biased LexRank version might be more effective, but weleave this experiment for future work.

Language modeling seems to perform better than LexRank on the TRECTS 2013 collection, and very close to LexRank on the TREC TS 2014 col-lection. However, much improvement is still needed, as the performance ofthe unigram language model trained on the relevant updates is lower thanexpected. Given the limited number of relevant updates for each event inthe gold standard, we consider its coverage is limited. One possibility toenhance its performance would be to train the language model on externaldata specific to our event categories, particularly focused on the language ofnews and disasters. In [49], the authors train a 5-gram language model withKneser-Ney smoothing on the Associated Press section of the Gigaword cor-pus [37] and Wikipedia. However, one of the TREC requirements is that thisexternal data needs to be time-aligned with the TREC TS corpus, and noinformation from the future about the event should leak into the processingsystem. Our assumption was that training the language model on the setof relevant updates can lead to the creation of a general event model thatlearns the characteristics of an update and picks sentences for inclusion intothe summary with high accuracy (as we are overfitting). Including externalinformation was not our purpose, as we assumed that the set of relevant goldstandard updates provided already contains all information we need to know.Since this is not the case, we believe it is possible that multiple non-updatesshare common terms with the event updates, which makes identifying theones which are relevant particularly hard to spot by our language model.Scoring sentences by the log probabilities of the discriminative log likelihoodratio terms yields better results than LexRank or event language modelling.These terms have been inferred from the corpus of relevant event updates.Relevant and salient sentence updates get selected more frequently, and theirpresence in the top of the ranking results in an increase in precision and recallcompared to the TREC TS 2013 collection.


We also notice the difference in performance between collections for thelog-likelihood ratio method. If the presence of LLR terms inside sentencesfrom the TREC TS 2013 collection is not a very strong signal of the rele-vance of those sentences, on the TREC TS 2014 collection the presence ofsuch discriminative terms appears to be a more reliable indicator of sentencesalience. As we have inferred discriminative terms from each collection inparticular, we believe that the higher amount of annotated data for TRECTS 2014 results in the presence of more discriminative terms in our LLR listfor this collection, therefore a better performance of this method on TRECTS 2014 compared to TREC TS 2013. When using the oracle and rankingsentences by time, LLR yields the best MAP value (0.0817) for TREC TS2014, and still one of the highest values when ranking sentences by theirretrieval scores (0.0810). On TREC TS 2013, LLR MAP values are muchlower compared to TREC TS 2014.

Entities seem to play a role in summarization, however more robust meth-ods need to be developed in modelling the salience of an entity during thetime duration of an event. Our present methods are rather basic, and we onlyrely on statistical information regarding the entity presence inside a sentenceor at document level. However, to accurately track the entities inside therelevant updates, we need to be able to identify these entities from the largecollection of entities present inside the corpus. We leave the development ofentity salience models for future work.

Looking at our information retrieval methods, we observe that their per-formance varies depending upon the type of method used. In general, TF,Okapi BM25, and TF.ISF output similar scores for the TREC TS 2013 and2014 collections, which are directly complementary to the results yielded byQL and COS.SIM. Interestingly, on the TREC TS 2013 dataset, QL andCOS.SIM seem to show the best performance, while on the TREC TS 2014dataset they present the worst performance out of all the information re-trieval methods we test. The QL method with no smoothing performs beston the TREC TS 2013 collection, while the TF.ISF method gives the bestresults on the TREC TS 2014 collection. Our hypothesis is that this mightbe a side effect of the length of sentences found inside the TREC TS 2013and TREC TS 2014 collection. In order to get a better understanding ofwhether this is indeed the case, we compute histograms for the length of thesentences inside the documents from the TREC TS 2013 collection in Figure5.2, and sentences inside the documents from the TREC TS 2014 collectionin Figure 5.3. We remove stop words before plotting sentence length his-tograms. We observe that sentences in the TREC TS 2013 collection are lessthan 80 tokens in length, with a peak around 25 tokens. If we look at thelength of the sentences from the TREC TS 2014 collection, we observe thatalmost all sentences are less than 20 tokens in length, and that the majorityof sentences have a length below 10 tokens. Therefore, the average lengthof the sentences from the TREC TS 2013 collection is much higher than theaverage sentence length of the sentences from the TREC TS 2014 collection.We observe accordingly that the TREC TS 2013 collection displays a ten-dency for longer sentences, while this trend is reversed in the TREC TS 2014collection which contains much shorter sentences. We believe this to be alikely cause of the differences in the efficiency of retrieval methods on thetwo collections. Information retrieval methods that do length normalization(QL and COS.SIM) perform better on the TREC TS 2013 collection, while


information retrieval methods that do not incorporate length normalizationat the sentence level (TF, BM25, TF.ISF) perform better on the TREC TS2014 collection.

Table 5.19: Summary of the performance of Oracle meth-ods and techniques with output ranked by time for TREC

TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25).

Method Correct Precision RecallP@10 P@20 P@100 P@R MAP R@10 R@20 R@100

TF AVG 2013 0.0556 0.0556 0.0875 0.0519 0.0062 0.0032 0.0092 0.0479BM25 AVG 2013 0.0444 0.0222 0.0350 0.0219 0.0024 0.0030 0.0030 0.0186

TF.ISF AVG 2013 0.1556 0.1667 0.1725 0.1660 0.0407 0.0172 0.0355 0.0893QL AVG 2013 0.3333 0.3444 0.3125 0.3228 0.1323 0.0356 0.0780 0.1671

COS.SIM AVG 2013 0.3556 0.3556 0.3325 0.3345 0.1340 0.0402 0.0780 0.1761LLR AVG 2013 0.0889 0.1278 0.1750 0.1397 0.0304 0.0091 0.0255 0.0907E2 AVG 2013 0.1556 0.1444 0.1750 0.1203 0.0261 0.0122 0.0257 0.0948

E1 + E2 AVG 2013 0.1667 0.1500 0.1850 0.1288 0.0283 0.0136 0.0260 0.0995TF AVG 2014 0.3067 0.2867 0.2773 0.2597 0.0779 0.0088 0.0173 0.0828

BM25 AVG 2014 0.2867 0.2667 0.2640 0.2615 0.0754 0.0084 0.0163 0.0785TF.ISF AVG 2014 0.3067 0.2767 0.2593 0.2614 0.0756 0.0089 0.0167 0.0782

QL AVG 2014 0.1800 0.1867 0.1693 0.1874 0.0401 0.0054 0.0116 0.0509COS.SIM AVG 2014 0.3200 0.2567 0.2133 0.2187 0.0569 0.0092 0.0154 0.0647

LLR AVG 2014 0.3067 0.2933 0.2787 0.2723 0.0817 0.0096 0.0176 0.0837E2 AVG 2014 0.0600 0.0700 0.0893 0.0859 0.0089 0.0014 0.0031 0.0244

E1 + E2 AVG 2014 0.1200 0.1167 0.1053 0.0935 0.0147 0.0032 0.0069 0.0310

Table 5.20: Summary of the performance of Oracle meth-ods and techniques with output ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25).

Method Correct Precision RecallP@10 P@20 P@100 P@R MAP R@10 R@20 R@100

TF AVG 2013 0.0000 0.0222 0.0500 0.0519 0.0034 0.0000 0.0047 0.0281BM25 AVG 2013 0.0333 0.0278 0.0300 0.0219 0.0015 0.0024 0.0053 0.0153

TF.ISF AVG 2013 0.2000 0.1556 0.1100 0.1660 0.0416 0.0208 0.0321 0.0534QL AVG 2013 0.4444 0.4833 0.3950 0.3228 0.1654 0.0442 0.0986 0.2130

COS.SIM AVG 2013 0.4778 0.3833 0.3400 0.3345 0.1495 0.0499 0.0829 0.1800LLR AVG 2013 0.1444 0.1500 0.1225 0.1397 0.0277 0.0117 0.0237 0.0602LM AVG 2013 0.1444 0.1000 0.0500 0.0402 0.0123 0.0125 0.0176 0.0276

LexRank AVG 2013 0.0222 0.0111 0.0100 0.0031 0.0015 0.0013 0.0013 0.0056E2 AVG 2013 0.0333 0.0389 0.1250 0.1203 0.0149 0.0024 0.0068 0.0684

E1 + E2 AVG 2013 0.0556 0.0722 0.1275 0.1288 0.0184 0.0050 0.0182 0.0684TF AVG 2014 0.3533 0.3833 0.3480 0.2597 0.0898 0.0095 0.0218 0.1025

BM25 AVG 2014 0.4133 0.3833 0.3247 0.2615 0.0900 0.0120 0.0227 0.0958TF.ISF AVG 2014 0.4533 0.4900 0.4233 0.2614 0.1072 0.0130 0.0276 0.1217

QL AVG 2014 0.2000 0.2033 0.2133 0.1874 0.0453 0.0060 0.0112 0.0594COS.SIM AVG 2014 0.1000 0.1367 0.1947 0.2187 0.0527 0.0027 0.0087 0.0633

LLR AVG 2014 0.2667 0.2900 0.2487 0.2723 0.0810 0.0076 0.0172 0.0780LM AVG 2014 0.0867 0.0833 0.0867 0.0666 0.0154 0.0024 0.0043 0.0261

LexRank AVG 2014 0.0538 0.0583 0.0755 0.0661 0.0151 0.0017 0.0033 0.0240E2 AVG 2014 0.0533 0.0833 0.0907 0.0859 0.0103 0.0011 0.0040 0.0249

E1 + E2 AVG 2014 0.0867 0.0633 0.0720 0.0935 0.0111 0.0029 0.0043 0.0207

Therefore, given the longer length of TREC TS 2013 sentences, QL andCOS.SIM perform better, while given the shorter TREC TS 2014 sentences,TF, BM25 and TF.ISF yield better results. This pattern preserves whenthere is no oracle baseline, as can be seen from Table A.27 in Appendix A.We see that COS.SIM and QL are still the top performers for TREC TS2013, while TF.ISF yields the best results on the TREC TS 2014 collection.

In general, much better precision is obtained when using the oracle andranking sentences by highest scores across documents, than when ranking


Figure 5.2: Sentence length histogram for the sentences in-side the relevant documents from TREC TS 2013 collection.

Figure 5.3: Sentence length histogram for the sentences in-side the relevant documents from TREC TS 2014 collection.


them by time first across documents and then by highest scores inside doc-uments. The only exceptions to this case are TF, BM25 and entities (E1,E1 + E2) for the TREC TS 2013 collection, and COS.SIM, LLR and thecombined entity metric (E1 + E2) for the TREC TS 2014 collection. Webelieve that these methods perform better when ranking the output first bytime because they successfully manage to identify relevant sentences directlyfrom the onset of an event. This also means that even though the retrievalscores assigned to the relevant sentences from the beginning of an event aremuch lower than the scores for the top ranked sentences identified through-out the duration of an event, these sentences are as important as other moreprominent sentences, and their inclusion in the summary contributes equallyto higher precision scores.

In conclusion, in this chapter we have employed information retrieval,event language modelling and sentence centrality methods in identifying rel-evant updates for an event. We observe that the query focused informationretrieval methods display the best performance, and that there is a correla-tion between the efficiency of a method and the retrieved sentence length.Long sentences might contain the same terms repeatedly. As a result, termfrequency factors may be large for long sentences, making a sentence moresimilar to the query, but not necessarily more relevant than a shorter sentencewith fewer query terms. Sentence length normalization has the potential tofill this gap. Therefore, we could see that methods which include a lengthnormalization component identify sentence updates more reliably and withhigher accuracy.

After assessing the performance of the approaches presented in this chap-ter on an individual basis, it becomes interesting to check whether combiningretrieval, centrality and event modeling methods as features inside a machinelearning algorithm can yield improvements over the results obtained so far.If this is the case, then these approaches are complementary, and their in-teraction can lead to more accurate modeling of the function for sentencesalience prediction. We explore this possibility in Chapter 6.

5.7 Future Work

So far in this chapter we have tested the potential of retrieving relevantsentence updates for an event by the use of information retrieval, event mod-eling and event centrality methods. In what follows we would like to describenovel ideas for future work which we believe are worth exploring, but whichwe could not fully inquire throughout this thesis.

In particular, we would like to describe our idea of selecting sentences ina temporal summary based on entities and relations present inside that sen-tence. We feel that our methods for scoring sentences based on the numberof co-occuring entities or entity frequency at the document level are ratherbasic, and left unfully explored. As we could see in Chapter 4, entities and re-lations seem to have a lot of potential in modeling relevant sentence updates.In addition to entities, we were also aiming at extracting relations betweenthe entities co-occuring in a sentence update for summarization purposes. Tothis end, we tagged each sentence inside our corpus using the OpenIE4 infor-mation extraction system [32]. The Open IE system runs over sentences in

4https://github.com/knowitall/openie

https://github.com/knowitall/openie

5.7. Future Work 73

a document and creates extractions that represent relations between the en-tities identified inside each sentence update. The extractors focus on genericways in which relationships are expressed in the English language, general-izing across domains. For example, given the sentence “The U.S. presidentBarack Obama gave his speech on Tuesday to thousands of people.”, OpenIEwould produce the following n-ary relations: (Barack Obama, is the presidentof, the U.S.), and (Barack Obama, gave, [his speech, on Tuesday, to thou-sands of people]). While in the given example Barack Obama is indeed anentity, not all extracted arguments of the IE relations contain entities. Oneof the well-known problems of OpenIE systems, including the relationshipswe obtained, is represented by incoherent and uninformative extractions. Forour TREC dataset in particular, this amplifies the level of noise and giventhe time constraints, makes the selection of relevant sentences based on therelations between entities a challenging task to achieve and explore in thisthesis.

Our motivation for extracting entities and relations between entities in-side the sentences in the collection was to include these relations in a sentencescoring algorithm, as follows. We would first identify entities and extract re-lations using open source tools, after which we would proceed to computingthe entity topicality, as well as the salience of a specific relation, in a givendocument. As documents are streaming in, we would continuously updatethe entity topicality, as well as the salience of a relation in an event. Then wewould score sentences based on: i) entities only, as a function of entity topi-cality inside the document and entity topicality inside the event, ii) entitiesand relations, outputing sentences in which relationships are identified thatcontain topical entities, iii) an ontology, after mapping the extracted rela-tions with existent relations in an ontology, and iv) frontier, after identifyingentities and relations inside the query (at this stage we would consider theentities inside the query as frontier entities), we would select sentences thatcontain relations which include the frontier entities, and would update thefrontier with new entities and relations found inside the selected sentences.We would repeat this process throughout the entire duration of an event.

Given the time constrains and the noisy extracted relations we are notable to implement this idea at the moment, however we consider it is a novelapproach for the current research problem, worth of more in-depth explo-ration. We would like to focus on this aspect in future work on documentsummarization.

75

Chapter 6

Machine Learning forSummarization

A key problem in text summarization is finding the salience function whichdetermines what information from the source documents should be includedin the output summary. Determining whether a sentence is salient in the in-put text is the result of the interplay of a combinatination of factors. Thesefactors interact with each other and encompass a wide range of aspects, suchas the nature and genre of the source text, the desired compression rate,the user’s information needs, etc. Methods that have been employed so farin scoring sentences for salience rely on statistical measures of term promi-nence, sentence similarity, the presence or absence of specific features suchas proper names, geographical locations, lexical and/or syntactic features.However, the importance of a particular feature varies with the genre of thetext [60]. For example, in newswire articles, the location of a sentence insidea document is a reliable factor in assessing the information gain carried bythat sentence. Due to the journalistic convention of summarizing the articlein the opening paragraphs, extracting leading sentences from each paragraphis often a reliable strategy in assembling a summary of the main points cov-ered in that news article. However, for other genres of text, like scientific ortechnical articles, the introduction and the conclusion sections usually con-tain pre-summarized material. Therefore, if we are to develop an automaticsummarization system that can adapt to different genres of text, we need tohave an automatic way in which we can infer what are most useful featuresfor a genre, instead of selecting and combining features in an ad-hoc manner.In fact, this represents the basis of a trainable machine learning approach forsummarization.

In this chapter we describe a supervised machine learning approach forthe task of extractive summarization of news events. Our focus is on machinelearning aspects, in particular a performance-level comparison between theresults obtained when scoring sentences with the methods presented in Chap-ter 5, and the results we obtain when employing machine learning algorithmsfor sentence salience identification.

6.1 Approach

We approach the task of multi-document text summarization as a supervisedlearning task. We employ a set of retrieval, centrality and event modelingfeatures in training a classifier for predicting whether a sentence is a rele-vant event update and should be included in the summary of the event, orotherwise discarded. It is well-known that the success of machine learning

76 Chapter 6. Machine Learning for Summarization

for summarization depends on the heuristics used for extracting features,however there are few indications on how to choose relevant and informa-tive features. Moreover, the relative importance of different sets of featuresused in training classifiers is currently not very well understood [44, 76], andrepresents an active research area.

At this stage we are not interested in exploring the utility of other summa-rization features from literature, but we are rather interested in determiningthe performance of our information retrieval, centrality and event model-ing based features when we combine them inside a learning algorithm. Weexpect that they through their interaction we can learn a better salienceprediction function than when we use them separately in scoring sentences.Therefore, the features we use in our current machine learning approach tosummarization sum up to the following:

Retrieval Features. We use the scores retrieved for each sentence inturn by term frequency, Okapi BM25, TF.ISF, query likelihood and cosinesimilarity methods.

Event Modeling Features. We use the language event model we esti-mate from the set of relevant sentence updates in measuring the likelihoodthat a particular sentence is produced by this sentence update model. Inaddition to the language modeling feature, we compute the LLR score ofeach sentence by summing up the probabilities of the terms from the generalLLR event model that are found inside the sentence.

Centrality Features. We rely on LexRank to compute centrality scoresas a measure of sentence salience across documents. For each sentence weuse the centrality score returned by this algorithm as an additional feature.

We represent each feature as a numerical value. We use the featuresmentioned above in training a binary classifier that can predict whether asentence should be included in the summary of a crisis event or not. Wechoose to train the classifier independent of the event type, i.e. we want todetect whether a sentence is an event update regardless of its associationwith a specific disaster.

6.2 Experimental Setup

In addressing the sentence salience detection problem, we concentrate on thecombination of features that can be used for extractive sentence selectionfor the purpose of summarization, and not on developing or optimizing ma-chine learning techniques. In particular, we want to know the effectivenessof retrieval features, event modeling and centrality features when combinedaltogether. To answer this research question we conduct classification exper-iments.

We choose to train a Logistic Regression (LR) [41] classifier for sentencesalience prediction. We motivate this choice based on its good performance inprediction and classification tasks, and ease of training [42, 55]. To this end,we assemble our training corpus from the set of TREC TS 2013 and TRECTS 2014 annotated sentence updates per event from the gold standard, andregard these sentence updates as positive training instances. However, beforewe can train the classifier there are several aspects we need to consider. Ourcorpus so far only contains positive examples, and we need to provide theclassifier with negative examples too. For this reason, we build our corpus


of negative training instances by randomly sampling from each relevant doc-ument per event an equal number of sentence non-updates as the number ofsentence updates we have from that document. We end up with a balancedtraining set, from which we extract features for the learning algorithm. Inaddition, because we are learning and predicting on the same dataset, in or-der to avoid overfitting, we are training our classifier using the leave-one-out1

cross-validation method [91]. This is a special case of k-fold cross-validationwhere k equals the number of instances in the data. We split the data intofolds according to the number of events we have in the dataset. Therefore,for the TREC TS 2013 collection we train a Logistic Regression classifier us-ing 9-fold cross validation (as we have 9 test events in the collection), whilefor the TREC TS 2014 collection we train a Logistic Regression classifierusing 15-fold cross validation (as we have 15 test events in the collection).We perform our experiments using the Scikit-learn framework [80]. In ourexperiments we are interested in determining whether combining features forsentence salience prediction can yield improvements over the case when weare using them separately for scoring which sentences to include in a sum-mary.

6.3 Results and Analysis

In this section we present results for our machine learning experiments onthe TREC TS 2013 and TREC TS 2014 collections. In Table 6.1 and Table6.2 we present results including the oracle assumption, ranking the predictedupdates by time across documents, and by the classifier confidence scoreswithin documents respectively. For an in-depth analysis of the performanceof the classifier, in Table 6.3 we also include classification scores for theentire collection without the oracle assumption. We compare these resultswith the results we present in Chapter 5, in Table 5.19 (summary of themethods when using the oracle baseline and ranking by time) and Table 5.20(summary of the methods when using the oracle baseline and ranking byhighest scores). For comprehensiveness, we also compare with Table A.27in Appendix A (summary of methods when ranking by highest scores anddropping the oracle assumption).

We begin our analysis with a comparison of the classifier performance forthe TREC TS 2013 collection. From Table 6.3 and Table A.27 we observethat overall, the average performance of the classifier (MAP: 0.2203) is com-parable to the performance of the best approach for sentence scoring (cosinesimilarity, MAP: 0.2202), and only surpasses it by a very small margin. Eventhough results do not improve considerably, from this outcome we can in-fer that the classifier learns the best ranking function for the given collectionfrom the interaction of the input features. In this case, it is clearly the cosinesimilarity feature which carries the highest weight among all features. Af-terwards, we turn to a comparison of the performance of the classifier whenusing the oracle. From Table 6.2 vs. Table 5.20, we can see that when usingthe oracle threshold and ranking sentences by highest scores across docu-ments a similar pattern occurs again. If in the current case the MAP valuefor our classifier is 0.1655, the corresponding MAP value for the oracle when

1http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LeaveOneOut.html

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LeaveOneOut.html

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LeaveOneOut.html


ranking sentences by the highest query likelihood scores across documents is0.1654. In this case the classifier also selects sentences according to the bestranking strategy. Moreover, the classifier learns that the cosine similaritymetric is no longer the best ranking function, and adapts to use the more ef-ficient query likelihood model. We also assess the classification performancewhen using the oracle and ranking documents by time. From Table 6.1 wenotice the low recall of the classifier, so that very few of the relevant updatesare selected. Our MAP score of 0.0138 is much lower than the best MAPwhen using the oracle and ranking by time (the cosine similarity MAP of0.1340 in Table 5.19). We believe a possible cause of this phenomenon mightbe that relevant updates are not correctly identified in the beginning of anevent, but tend to become more prominent as the event evolves over time.Moreover, when we rank sentences inside documents we rely on the predic-tion probabilities of the classifier, which assigns rather high confidence scoresfor non-relevant updates.

Table 6.1: LR Oracle results ranked by time for TREC TS2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collections

(MAP 2013: 0.0138; MAP 2014: 0.0034).


1 6/233 0.2000 0.1000 0.0600 0.0258@233 0.0080 0.0086 0.0086 0.02582 6/168 0.4000 0.2000 0.0400 0.0357@168 0.0217 0.0238 0.0238 0.02383 1/42 0.0000 0.0000 – 0.0238@42 0.0010 0.0000 0.0000 –4 0/180 0.0000 0.0000 0.0000 0.0000@181 0.0000 0.0000 0.0000 0.00005 1/35 0.1000 0.0500 – 0.0286@35 0.0143 0.0286 0.0286 –6 2/172 0.0000 0.0500 0.0200 0.0116@172 0.0007 0.0000 0.0058 0.01167 – – – – – – – – –8 4/76 0.4000 0.2000 – 0.0526@76 0.0467 0.0526 0.0526 –9 3/68 0.1000 0.1000 – 0.0441@68 0.0184 0.0147 0.0294 –

10 2/95 0.2000 0.1000 – 0.0208@96 0.0130 0.0208 0.0208 –AVG 2013 0.1556 0.0889 0.0300 0.0270 0.0138 0.0166 0.0189 0.0153

11 0/392 0.0000 0.0000 0.0000 0.0000@392 0.0000 0.0000 0.0000 0.000012 4/184 0.4000 0.2000 0.0400 0.0217@184 0.0130 0.0217 0.0217 0.021713 2/313 0.1000 0.0500 0.0100 0.0064@313 0.0011 0.0032 0.0032 0.003214 0/401 0.0000 0.0000 0.0000 0.0000@401 0.0000 0.0000 0.0000 0.000015 5/315 0.2000 0.1500 0.0500 0.0159@315 0.0054 0.0063 0.0095 0.015916 6/554 0.0000 0.0500 0.0600 0.0108@554 0.0008 0.0000 0.0018 0.010817 12/648 0.5000 0.3500 0.1200 0.0185@648 0.0115 0.0077 0.0108 0.018518 6/409 0.4000 0.2500 0.0500 0.0147@409 0.0044 0.0098 0.0122 0.012219 2/341 0.1000 0.1000 0.0200 0.0059@341 0.0010 0.0029 0.0059 0.005920 0/289 0.0000 0.0000 0.0000 0.0000@289 0.0000 0.0000 0.0000 0.000021 – – – – – – – – –22 0/220 0.0000 0.0000 0.0000 0.0000@220 0.0000 0.0000 0.0000 0.000023 7/274 0.1000 0.0500 0.0300 0.0109@274 0.0043 0.0036 0.0036 0.010924 11/430 0.1000 0.0500 0.0200 0.0256@430 0.0029 0.0023 0.0023 0.004725 2/376 0.1000 0.1000 0.0200 0.0053@376 0.0030 0.0027 0.0053 0.0053

AVG 2014 0.1429 0.0964 0.0300 0.0097 0.0034 0.0043 0.0055 0.0078

Next, we look at the performance of the classifier on the TREC TS 2014collection for the case when we don’t use the oracle threshold, and rankresults by the highest predicted scores. From Table 6.3 we can see it yieldsa mean average precision of 0.1460, which compared to the best performingmethod in Table A.27 (TF.ISF MAP: 0.1635) is a lower precision score.However, the classifier is doing much better compared to other top performers(TF MAP: 0.1379, BM25: 0.1339), which means that despite imperfect, itis still learning which features carry the most weight. When we comparethe performance of the classifier on the oracle baseline method with results


Table 6.2: LR Oracle results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)

(MAP 2013: 0.1655; MAP 2014: 0.1016).


1 81/233 0.5000 0.5500 0.3800 0.3492@233 0.1493 0.0215 0.0472 0.16312 51/168 0.8000 0.6500 0.3400 0.3110@168 0.1631 0.0476 0.0774 0.20243 5/42 0.1000 0.1000 – 0.1250@40 0.0177 0.0238 0.0476 –4 46/180 0.0000 0.2500 0.2600 0.2555@180 0.0670 0.0000 0.0276 0.14365 8/35 0.2000 0.1500 – 0.2286@35 0.0612 0.0571 0.0857 –6 42/172 0.5000 0.4500 0.3400 0.2471@172 0.1040 0.0291 0.0523 0.19777 – – – – – – – – –8 30/76 0.7000 0.6500 – 0.4000@76 0.2447 0.0921 0.1711 –9 23/68 0.8000 0.5500 – 0.3382@68 0.2238 0.1176 0.1618 –10 57/95 1.0000 0.8000 – 0.5938@96 0.4586 0.1042 0.1667 –

AVG 2013 0.5111 0.4611 0.3300 0.3869 0.1655 0.0548 0.0930 0.176711 128/392 0.0000 0.0000 0.1100 0.1811@392 0.0541 0.0000 0.0000 0.028112 84/184 0.5000 0.5000 0.4500 0.3478@184 0.2117 0.0272 0.0543 0.244613 129/313 0.4000 0.5000 0.4100 0.2652@313 0.1416 0.0128 0.0319 0.131014 49/401 0.0000 0.0000 0.0700 0.0798@401 0.0092 0.0000 0.0000 0.017515 123/315 0.5000 0.4500 0.2700 0.2286@315 0.1049 0.0159 0.0286 0.085716 224/554 0.5000 0.4500 0.2400 0.2563@554 0.1091 0.0090 0.0162 0.043317 165/648 0.7000 0.5500 0.4000 0.1744@648 0.0773 0.0108 0.0170 0.061718 146/409 0.5000 0.5500 0.3000 0.2469@409 0.1057 0.0122 0.0269 0.073319 135/341 0.2000 0.1500 0.2900 0.2933@341 0.1117 0.0059 0.0088 0.085020 110/289 0.0000 0.2500 0.3000 0.2561@289 0.1025 0.0000 0.0173 0.103821 – – – – – – – – –22 93/220 0.2000 0.2000 0.2700 0.2955@220 0.1175 0.0091 0.0182 0.122723 79/274 0.3000 0.4000 0.3100 0.2044@274 0.0861 0.0109 0.0292 0.113124 174/430 0.3000 0.5500 0.3500 0.3047@430 0.1392 0.0070 0.0256 0.081425 90/376 0.2000 0.1500 0.2200 0.1809@376 0.0518 0.0053 0.0080 0.0585

AVG 2014 0.3071 0.3357 0.2850 0.2368 0.1016 0.0090 0.0201 0.0893

Table 6.3: LR results ranked by highest score for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) (MAP

2013: 0.2203; MAP 2014: 0.1410).


1 183/233 0.5000 0.5000 0.3500 0.2918@233 0.2409 0.0215 0.0429 0.15022 133/168 0.8000 0.7000 0.3400 0.2738@168 0.2489 0.0476 0.0833 0.20243 16/42 0.1000 0.1000 0.0900 0.0952@42 0.0409 0.0238 0.0476 0.21434 130/180 0.0000 0.2500 0.2300 0.2320@181 0.1377 0.0000 0.0276 0.12715 26/35 0.1000 0.2000 0.1400 0.2000@35 0.1258 0.0286 0.1143 0.40006 98/172 0.5000 0.4500 0.3000 0.2442@172 0.1485 0.0291 0.0523 0.17447 – – – – – – – – –8 49/76 0.7000 0.6500 0.2500 0.3026@76 0.2582 0.0921 0.1711 0.32899 52/68 0.8000 0.5500 0.2500 0.3088@68 0.3012 0.1176 0.1618 0.367610 81/95 1.0000 0.7500 0.4800 0.4896@96 0.4805 0.1042 0.1562 0.5000

AVG 2013 0.5000 0.4611 0.3050 0.2709 0.2203 0.0516 0.0952 0.163511 336/392 0.0000 0.0000 0.1000 0.1582@392 0.1082 0.0000 0.0000 0.025512 158/184 0.5000 0.5000 0.4600 0.3587@184 0.2952 0.0272 0.0543 0.250013 285/313 0.3000 0.4500 0.3400 0.2364@313 0.1796 0.0096 0.0288 0.108614 298/401 0.0000 0.0000 0.0700 0.0574@401 0.0270 0.0000 0.0000 0.017515 240/315 0.5000 0.4500 0.2500 0.1841@315 0.1397 0.0159 0.0286 0.079416 469/554 0.5000 0.4500 0.2300 0.2274@554 0.1762 0.0090 0.0162 0.041517 420/648 0.7000 0.5500 0.4000 0.1620@648 0.0987 0.0108 0.0170 0.061718 373/409 0.5000 0.5500 0.3000 0.2421@409 0.1675 0.0122 0.0269 0.073319 277/341 0.2000 0.1500 0.2700 0.2610@341 0.1747 0.0059 0.0088 0.079220 224/289 0.0000 0.1000 0.2700 0.2353@289 0.1232 0.0000 0.0069 0.093421 – – – – – – – – –22 178/220 0.0000 0.1500 0.1800 0.2455@220 0.1611 0.0000 0.0136 0.081823 154/274 0.3000 0.3500 0.3000 0.1423@274 0.0949 0.0109 0.0255 0.109524 286/430 0.2000 0.5500 0.3500 0.2674@430 0.1736 0.0047 0.0256 0.081425 175/376 0.2000 0.1500 0.1900 0.1410@376 0.0549 0.0053 0.0080 0.0505

AVG 2014 0.2786 0.3143 0.2650 0.2085 0.1410 0.0080 0.0186 0.0824


ranked by the classifier scores in Table 6.2 (MAP 0.0938) with the bestperforming method in Table 5.20 (TF.ISF MAP 0.1072) we observe thatprecision scores are very close to each other. Also, the classifier is doingbetter than BM25, the second best performing method (MAP 0.0900) forthis case. As in the case of the TREC TS 2013 collection, when rankingpredictions by time with the oracle threshold the scores are much lower (anMAP of 0.0034 in Table 6.1 vs. an MAP of 0.0817 in Table 5.19). Thisconfirms once again that in the initial documents it is harder to identifymost relevant sentence updates, and that information needs to accumulatebefore sentences become salient within the development of an event.

6.4 Conclusion

In conclusion, based on the results of our classifier on the TREC TS 2013 andTREC TS 2014 collections, we can infer that the classifier is learning how topredict the salience function, but is limited in the predictions it makes by thebest performing feature. The combination of retrieval, event modeling andcentrality features does not seem to help in achieving much better resultsthan the best performing method in each case in particular. In addition, wehave also examined the performance of a Support Vector Machines classifierwith linear and rbf kernel, and of a Random Forest classifier, and couldsee that scores did not improve compared to the ones we have presentedso far. It might happen though, that with another classification algorithmparticularly optimized on the TREC collection and with an extended set offeatures, scores will improve considerably. Due to time constraints, we leavethe exploration of other machine learning techniques for future work.

Out of the three categories of features we use, the information retrievalfeatures seem to carry the most weight in the predictions made by the clas-sifier. Furthermore, we observe that when we use the oracle baseline andrank sentences by time across documents and by highest scores within doc-uments, the classifier appears to make frequent mistakes and presents lowrecall. When we use the oracle baseline and rank by highest scores acrossall documents, results given by the classifier become comparable to the bestperforming information retrieval approaches. When dropping the oracle con-straint, we can still notice the good performance of our classifier. Therefore,we consider that machine learning techniques for summarization present alot of potential, despite the main issues associated with them: the limitedamount of training data available, and the relatively unknown importance ofvarious sets of features.

6.5 Future Work

In future work we plan to examine machine learning methods for summariza-tion based on an extended set of features. Leaving from the work of Kedzie[49] et al, we would like to incorporate in our approach statistical sentencefeatures, query expansion features, geographical, and temporal features. Onthe one hand, geographic relevance is important in the disaster domain, asdisasters are phenomena that affect some specific parts of the world. If asentence references a location that is in close proximity to the area of thedisaster, then it is likely that the sentence contains information relevant to

6.5. Future Work 81

the event. However, one challenge is that in the given corpus not manysentence updates contain references to a location, nor do we know preciselybeforehand where the event is happening. However, one could use clusteringmethods and treat the cluster centers as the event locations for the currenttime, or compute median distances between locations in a document. On theother hand, temporal features can capture bursty terms suddenly peaking inusage to locate novel sentences. Hourly IDF values can be computed at reg-ular time intervals to measure how TF.IDF scores have changed within thepast 24 hours, and to infer whether a given sentence contains any prominentterms.

In addition to that, in future work we also plan to examine deep learningtechniques for query-oriented multi-document summarization. First stepsfor the extraction of concepts and summary generation have been laid in[115], on the DUC 2005, 2006 and 2007 benchmark datasets. In addition,Denil et al [27] extract topic-relevant sentences using hierarchical convolu-tional document models inspired from computer vision literature. Rush etal [97] find extractive summarization inherently limited, and use a neuralattention model for abstractive summarization. We are interested to whatextent deep learning summarization techniques are useful on the TREC TSdataset. Given the promising results presented in the mentioned papers, weconsider deep learning a technique worth further exploration in the future.

83

Chapter 7

Conclusion

In this work we have presented a systematic analysis of temporal summariza-tion methods, and demonstrated the limitations, and potentials of previousapproaches by examining the retrievability, and centrality of event updates.In addition, we have also looked at the existence of intrinsic inherent char-acteristics in update versus non-update sentences, and used specific methodsand techniques to assess the actual performance of identifying relevant up-dates on the collection. We have also employed a supervised machine learningapproach for the task of predicting sentence salience when it comes to theinclusion of a specific sentence in the temporal summary of an event.

After conducting this analysis, we found that retrievability, centralityand event modeling algorithms have a theoretical upper bound that does notallow for the identification of all relevant event updates, and which limitstheir performance to below 100% coverage (RQ1). We observe that thislanguage gap is not due to a lexical mismatch between the query and theupdates, but rather to topic drifting. Therefore, a dynamic algorithm thatcan adapt the lexical representation of a query – possibly by the means ofrelevance feedback – could bridge this gap (RQ2).

Sentence centrality, when computed within a single document, does notappear to be a strong signal that can designate whether a sentence is an up-date or not. However, when computed across all documents, central sentencesappear to be more more prominent (RQ3). Based on this evidence, we be-lieve that a careful inspection of the minimum number of documents it takesfor a summarization algorithm before salient updates make it to the top ofthe ranking can lead to effective summarization scenarios based on centralityrankings. In addition to that, modeling event updates bears great promisestowards devicing temporal summarization algorithms. Events appear to de-velop around entities, and modeling event updates through discriminativeterms and named entities looks like a promising step towards improving theperformance of a temporal summarization system (RQ4). In our machinelearning approach to summarization we have combined retrievability, central-ity and event modeling features inside a supervised classification algorithm.Even though our results did not improve considerably, we could see that thelearning algorithm has still learnt the best ranking function for the given setof features (RQ5).

So far in our analysis we have concentrated on the overlap between thelanguage of an event query and the language of an event update in terms ofshared vocabulary, ignoring the temporal aspect. In future work, we wouldlike to focus on a temporal analysis which determines whether the languagegap between the query and the relevant updates becomes more evident asthe event develops.

84 Chapter 7. Conclusion

Another possible research direction we consider worth exploring is howthe query terms can be expanded through pseudo-relevance feedback to tracktopic drift. Topic drift occurs when the underlying intent of the query hasmoved away from the underlying intent of relevant documents. The contentof documents in the feedback set can heavily influence the intent of thefeedback model, and how closely it matches the intent of the original query.

Discriminative terms can help in the modeling of event updates. In ourexperiments we could see that the extracted LLR terms from the generalmodel appear on average in 95% of the event updates. However, when weuse event specific discriminative terms, the degree of overlap of the event typemodel with the relevant updates drops, on the one hand because of the limitedamount of data available for inferring these terms. Event modeling withentities, and possibly relations between entities, can help in summarization.As we believe that our current entity models are rather simplistic, we wouldlike to focus on more complex models for scoring sentences at an entity level.Moreover, in future work we would also like to explore the possibility ofscoring sentences based on mapping of relations extracted by OpenIE toolsto relations inside ontologies such as DBpedia and Freebase.

We also probe the actual performance of a number of information re-trieval, sentence centrality and event modeling methods and techniques atretrieving sentence updates. We apply these methods at the sentence level,and we see that traditional information retrieval methods seem to performbest. In addition, we notice a strong correlation between the length of sen-tences in a collection and the retrieval efficiency. More precisely, retrievalalgorithms that do length normalization (such as QL and COS.SIM) yieldbetter results on the TREC TS 2013 collection compared to methods whichdo not include length normalization (TF, BM25, TF.ISF). Conversely, onthe TREC TS 2014 collection best performers are represented by the lattercategory of methods. Trying to explain this discrepancy, we could see thatthe length of sentences inside the two collections differs: TREC TS 2013 in-cludes longer sentences compared to TREC TS 2014, hence the improvementintroduced by length normalization on TREC TS 2013.

We have also used the three categories of features in a supervised machinelearning experiment. We have combined information retrieval, sentence cen-trality, and event modelling scores inside a learning algorithm for sentencesalience prediction, performing leave-one-out cross-validation to also test thepredictive power of the three categories of features. The performance ofthis procedure was compared to the performance of non-trainable baselinemethods. We could see that the performance of our model is limited by thebest performing feature, and that the combination of features does not out-perform the best scoring feature. In future work we would like to exploremore complex classification algorithms, as well as introduce new features forsentence salience prediction.

In our present analysis we disregard important aspects of temporal sum-marization such as the time dimension (when ranking sentences by highestscores across documents, or in centrality experiments), and novelty (andimplicitly the non-redundancy and diversity of content). This allows us todecompose the complex problem of temporal summarization into simpler in-dependent steps, and analyze what are the actual limitations of temporalsummarization methods on the TREC TS 2013 and TREC TS 2014 col-lections. In future work we would like to focus on the interplay of these

Chapter 7. Conclusion 85

components, including tracking topics by time, and scoring sentences fornovelty.

To conclude, summary construction is a complex task which ideally in-volves deep natural language processing capacities. In order to simplify theproblem, most of the research carried in the filed of text summarization andin the present thesis focuses on extractive summarization of the original text.While this simplification is not guaranteed to ensure a good narrative coher-ence, it is conveniently used as an approximate representation of the textcontent for relevance judgement. By far the most important advantage ofpresenting the end user a synopsis of the development of most importantevents is the reduced reading time. In addition, a summary can be eitherused to point towards specific parts of the original documents (indicativemanner), or to cover all the relevant nuggets of information from the originaltext (informative manner). Further advantages lie in the fact that the size ofthe summary can be pre-determined, and correlations can be made betweenthe text present in the summary and the position of that text inside theoriginal documents.

Extractive summarization methods are typically faced with two key prob-lems: how to select a subset of the sentences from the input documents, andhow to rank these sentences according to their relevance to the topic orquery. The selection process requires systems to improve diversity and re-move redundancy so that more relevant information can be covered given thesummary’s limited length. Present solutions for sentence ranking are varied –these are are either based on surface features, graphs, or supervised learning.Our present work explores these methods on the TREC TS 2013 and TRECTS 2014 collections. Through our analysis we have tested the potential ofthese methods, and their behaviour in practical settings. Given that tempo-ral summarization of news streams is a very complex task, we believe thatour work has laid the foundations of what makes a temporal summarizationsystem successful, and that in future work, more complex methods need tobe developed in order to gradually track events over time.

87

Appendix A

Appendix A

Table A.1: TREC TS 2013 Training and Test Topics.

EventId

Event title Start time End time

Training1 2012 East Azerbaijan

earthquakes11 Aug 201212:23:17

21 Aug 201212:23:17

Description: http://en.wikipedia.org/wiki/2012_East_Azerbaijan_earthquakesQuery: iran earthquake, Type: earthquake

Testing1 2012 Buenos Aires Rail

Disaster22 Feb 201211:33:00

03 Mar 201211:33:00

Description: http://en.wikipedia.org/wiki/2012_Buenos_Aires_rail_disasterQuery: buenos aires train crash, Type: accident

2 2012 Pakistan garmentfactory fires

11 Sep 201213:00:00

21 Sep 201213:00:00

Description: http://en.wikipedia.org/wiki/2012_Pakistan_garment_factoryDescription: _firesQuery: pakistan factory fire, Type: accident

3 2012 Aurora shooting 20 Jul 201206:38:00

30 Jul 201206:38:00

Description: http://en.wikipedia.org/wiki/2012_Aurora_shootingQuery: colorado shooting, Type: shooting

4 Wisconsin Sikh templeshooting

05 Aug 201215:25:00

15 Aug 201215:25:00

Description: http://en.wikipedia.org/wiki/Wisconsin_Sikh_temple_shootingQuery: sikh temple shooting, Type: shooting

5 Hurricane Isaac (2012) 28 Aug 201216:20:00

07 Sep 201216:20:00

Description: http://en.wikipedia.org/wiki/Hurricane_Isaac_(2012)Query: hurricane isaac, Type: storm

6 Hurricane Sandy 24 Oct 201215:00:00

03 Nov 201215:00:00

Description: http://en.wikipedia.org/wiki/Hurricane_SandyQuery: hurricane sandy, Type: storm

7 June 2012 North Ameri-can derecho

29 Jun 201215:00:00

09 Jul 201215:00:00

Description: http://en.wikipedia.org/wiki/June_2012_North_American_derechoQuery: midwest derecho, Type: storm

8 Typhoon Bopha 30 Nov 201214:45:00

10 Dec 201214:45:00

Description: http://en.wikipedia.org/wiki/Typhoon_BophaQuery: typhoon bopha, Type: storm

9 2012 Guatemala earth-quake

07 Nov 201216:35:47

17 Nov 201216:35:47

Description: http://en.wikipedia.org/wiki/2012_Guatemala_earthquakeQuery: guatemala earthquake, Type: earthquake

10 2012 Tel Aviv bus bomb-ing

21 Nov 201210:00:00

01 Dec 201210:00:00

Description: http://en.wikipedia.org/wiki/2012_Tel_Aviv_bus_bombingQuery: tel aviv bus bombing, Type: bombing

http://en.wikipedia.org/wiki/2012_East_Azerbaijan_earthquakes

http://en.wikipedia.org/wiki/2012_Buenos_Aires_rail_disaster

http://en.wikipedia.org/wiki/2012_Pakistan_garment_factory

_fires

http://en.wikipedia.org/wiki/2012_Aurora_shooting

http://en.wikipedia.org/wiki/Wisconsin_Sikh_temple_shooting

http://en.wikipedia.org/wiki/Hurricane_Isaac_(2012)

http://en.wikipedia.org/wiki/Hurricane_Sandy

http://en.wikipedia.org/wiki/June_2012_North_American_derecho

http://en.wikipedia.org/wiki/Typhoon_Bopha

http://en.wikipedia.org/wiki/2012_Guatemala_earthquake

http://en.wikipedia.org/wiki/2012_Tel_Aviv_bus_bombing

88 Appendix A. Appendix A

Table A.2: TREC TS 2014 Test Topics.

EventId


Testing11 Costa Concordia disaster

and recovery13 Jan 201221:45:00

01 Feb 201200:00:00

Description: http://en.wikipedia.org/wiki/Costa_Concordia_disasterQuery: costa concordia, Type: accident

12 Early 2012 European coldwave

22 Jan 201200:00:00

18 Feb 201200:00:00

Description: http://en.wikipedia.org/wiki/Early_2012_European_cold_waveQuery: european cold wave, Type: storm

13 2013 Eastern Australiafloods

17 Jan 201300:00:00

30 Jan 201300:00:00

Description: http://en.wikipedia.org/wiki/2012_Aurora_shootingQuery: queensland floods, Type: storm

14 Boston Marathon bomb-ings

15 Apr 201318:49:00

20 Apr 201323:59:59

Description: http://en.wikipedia.org/wiki/Boston_Marathon_bombingsQuery: boston marathon bombing, Type: bombing

15 Port Said Stadium riot 01 Feb 201213:30:00

11 Feb 201213:30:00

Description: http://en.wikipedia.org/wiki/Port_Said_Stadium_riotQuery: egyptian riots, Type: riot

16 2012 Afghanistan Quranburning protests

20 Feb 201217:30:00

28 Feb 201200:00:00

Description: http://en.wikipedia.org/wiki/2012_Afghanistan_Quran_burning_protestsQuery: quran burning protests, Type: protest

17 In Amenas hostage crisis 16 Jan 201300:00:00

20 Jan 201300:00:00

Description: http://en.wikipedia.org/wiki/In_Amenas_hostage_crisisQuery: in amenas hostage crisis, Type: hostage

18 2011-13 Russian protests 04 Dec 201100:00:00

25 Dec 201100:00:00

Description: http://en.wikipedia.org/wiki/2011%E2%80%9313_Russian_protestsQuery: russian protests, Type: protest

19 2012 Romanian protests 12 Jan 201200:00:00

26 Jan 201200:00:00

Description: http://en.wikipedia.org/wiki/2012_Romanian_protestsQuery: romanian protests, Type: protest

20 2012-13 Egyptian protests 18 Nov 201200:00:00

01 Dec 201200:00:00

Description: http://en.wikipedia.org/wiki/2012%E2%80%9313_Egyptian_protestsQuery: egyptian protests, Type: protest

21 Chelyabinsk meteor 15 Feb 201303:20:00

25 Feb 201303:20:00

Description: http://en.wikipedia.org/wiki/Chelyabinsk_meteorQuery: russia meteor, Type: impact event

22 2013 Bulgarian protestsagainst the Borisov cabi-net

10 Feb 201300:00:00

20 Feb 201323:59:59

Description: http://en.wikipedia.org/wiki/2013_Bulgarian_protests_against_the_Borisov_cabinetQuery: bulgarian protests, Type: protest

23 2013 Shahbag protests 05 Feb 201300:00:00

22 Feb 201323:59:59

Description: http://en.wikipedia.org/wiki/2013_Shahbag_protestsQuery: shahbag protests, Type: protest

24 February 2013 nor’easter 07 Feb 201300:00:00

18 Feb 201323:59:59

Description: http://en.wikipedia.org/wiki/February_2013_nor%27easterQuery: nor’easter, Type: storm

25 Christopher Dornershootings and manhunt

03 Feb 201300:00:00

13 Feb 201307:59:59

Description: http://en.wikipedia.org/wiki/Christopher_Dorner_shootings_and_manhuntQuery: Southern California shooting, Type: shooting

http://en.wikipedia.org/wiki/Costa_Concordia_disaster

http://en.wikipedia.org/wiki/Early_2012_European_cold_wave

http://en.wikipedia.org/wiki/2012_Aurora_shooting

http://en.wikipedia.org/wiki/Boston_Marathon_bombings

http://en.wikipedia.org/wiki/Port_Said_Stadium_riot

http://en.wikipedia.org/wiki/2012_Afghanistan_Quran_burning_

protests

http://en.wikipedia.org/wiki/In_Amenas_hostage_crisis

http://en.wikipedia.org/wiki/2011%E2%80%9313_Russian_protests

http://en.wikipedia.org/wiki/2012_Romanian_protests

http://en.wikipedia.org/wiki/2012%E2%80%9313_Egyptian_protests

http://en.wikipedia.org/wiki/Chelyabinsk_meteor

http://en.wikipedia.org/wiki/2013_Bulgarian_protests_against

_the_Borisov_cabinet

http://en.wikipedia.org/wiki/2013_Shahbag_protests

http://en.wikipedia.org/wiki/February_2013_nor%27easter

http://en.wikipedia.org/wiki/Christopher_Dorner_shootings_

and_manhunt

Appendix A. Appendix A 89

Table A.3: TREC TS 2015 Test Topics.

EventId


Testing26 vauxhall helicopter crash 16 Jan 2013

07:59:0031 Jan 201307:59:00

Type: accident27 cyclone nilam 27 Oct 2012

00:00:0002 Nov 201200:00:00

Type: storm28 savar building collapse 24 Apr 2013

02:45:0004 May 201302:45:00

Type: accident29 hyderabad explosion 21 Feb 2013

13:58:0003 Mar 201313:58:00

Type: bombing30 brazzaville explosion 04 Mar 2012

07:00:0014 Mar 201207:00:00

Type: accident31 india power blackouts 29 Jul 2012

21:18:0003 Aug 201221:18:00

Type: accident32 innocence of muslims

protests11 Sep 201200:00:00

30 Sep 201200:00:00

Type: protest33 konna battle 10 Jan 2013

00:00:0019 Jan 201300:00:00

Type: conflict34 quetta bombing 16 Feb 2013

00:00:0020 Feb 201300:00:00

Type: bombing35 iraq bombing 15 Apr 2013

00:00:0020 Apr 201300:00:00

Type: bombing36 iraq bombing 19 Mar 2013

00:00:0024 Mar 201300:00:00

Type: bombing37 los angeles arson 29 Dec 2011

09:00:0005 Jan 201209:00:00

Type: bombing38 thane building collapsed 04 Apr 2013

00:00:0013 Apr 201300:00:00

Type: accident39 suicide bomber ankara 01 Feb 2013

00:00:0005 Feb 201300:00:00

Type: bombing40 baghdad bomb 21 Dec 2011

21:00:0026 Dec 201121:00:00

Type: bombing41 aleppo university explo-

sion15 Jan 201300:00:00

25 Jan 201300:00:00

Type: bombing42 carnival triumph fire 10 Feb 2013

00:00:0015 Feb 201300:00:00

Type: accident43 uss guardian grounding 17 Jan 2013

00:00:0022 Jan 201300:00:00

Type: accident44 aceh earthquake 11 Apr 2012

00:00:0016 Apr 201200:00:00

Type: earthquake45 haida gwaii earthquake 28 Oct 2012

03:00:0007 Nov 201203:00:00

Type: earthquake46 catalan protest 11 Sep 2012

00:00:0016 Sep 201200:00:00

Type: protest


Table A.4: Query terms ranked by their corresponding log-likelihood ratio value. If a term is not present in the set ofextracted LLR terms, then the term is assigned a rank of -1.

Word2Vec was trained on external data.

Event Query Query QueryId WordNet Word2Vec1 (crash, 44),

(train, 9),(bueno, 4),(air, 21)

(barge in, -1), (vent, -1),(coach, 1442), (bueno, 4),(string, 8302), (air out, -1),(caravan, -1), (prepar, 4941),(disciplin, -1), (clang, -1),(air, 21), (aim, 1825), (trail,1126), (train, 9), (doss, -1),(crash, 44), (educ, 490),(public, 894), (gear, 8045)

(u’bueno’, 4), (rollover crash,-1), (nuestra, -1), (sopranomarina poplavskaya, -1),(salir, -1), (estaba, -1),(comme un, -1), (fill, 5215),(wreck, 4035), (collis, 1898),(rollover accid, -1), (fatal,6579), (crash, 44), (rail, 166),(collison, -1), (gar, -1), (bu,47), (commuter train, -1),(train, 9), (locomot, 7701),(railway, 206), (accid, 250),(colliss, -1), (estamo, -1), (cesoir, -1), (ahora, -1), (air, 21),(oiseau, -1), (freight train,-1), (toujour, -1)

2 (fire, 12),(pakistan,7), (factori,1)

(pakistan, 7), (open fir, -1),(displac, 3329), (fire, 12),(ardor, -1), (burn, 4634),(arous, -1), (fuel, 6888),(factori, 1)

(pakistan, 7), (manmohan,-1), (india, 5100), (grassfir,-1), (brach candi, -1), (mill,5510), (saudi arabia, -1),(bangladesh, 8119), (srilanka,-1), (manufactur, 7530),(alarm blaz, -1), (blaze, 55),(inferno, 1865), (pakistani,101), (factori, 1),(manufactori, -1), (plant,2822), (iran, 8255), (fire, 12),(textile mil, -1), (carelesslydiscarded cigarett, -1),(taliban, -1), (toyota bodinealuminum, -1), (kashmir,5620), (garment factori, -1),(flame, 4627), (sri lanka, -1),(brush fir, -1), (firefight,7495)

3 (shoot, 73),(colorado,722)

(shoot, 73), (photograph,6588), (colorado, 722), (tear,2248), (dart, -1), (inject, -1),(blast, 243), (film, 7050),(fritter, -1)

(shoot, 73), (arkansa, -1),(shot, 2775), (colorado, 722),(texa, 2921), (oklahoma,4053), (shooter, 878), (utah,3555), (minnesota, 3180),(denver, 6546), (fatal shoot,-1), (aliodor recal, -1), (jihadiattack, -1), (nevada, 2717),(michigan, 865), (tennesse,-1), (delawar, 2744)


Table A.5: Query terms ranked by their corresponding log-likelihood ratio value. If a term is not present in the set ofextracted LLR terms, then the term is assigned a rank of-1. Word2Vec was trained on external data (continuation of

Table A.4).

Event Query Query QueryId WordNet Word2Vec4 (shoot, 73),

(sikh, 39),(templ, 31)

(shoot, 73), (photograph,6588), (tear, 2248), (fritter,-1), (synagogu, -1), (dart, -1),(templ, 31), (blast, 243),(sikh, 39), (film, 7050),(inject, -1)

(shot, 2775), (stupa, -1), (kalitempl, -1), (maryada, -1),(buddhist templ, -1), (fatalshoot, -1), (aliodor recal, -1),(shiva templ, -1), (sikh, 39),(khalsa panth, -1), (hindu,8124), (gurudwara, -1),(nihang, -1), (siva templ, -1),(kashmiri, -1), (jihadi attack,-1), (hindu templ, -1), (templ,31), (dharam, -1), (jagannathtempl, -1), (vanvasi, -1),(gopuram, -1), (shoot, 73),(shooter, 878), (gandhi, -1)

5 (hurrican,11), (isaac,3)

(isaac, 3), (hurrican, 11) (tropical storm, -1),(hurricane charley, -1),(hurricane ik, -1), (nigel, -1),(gareth, -1), (gallagh, -1),(hurrican, 11), (hoffman, -1),(connor, -1), (hurricane ivan,-1), (bernard, -1), (hurricanerita, -1), (hurricane gustav,-1), (walsh, 5854), (storm,41), (adrian, -1), (isaac, 3),(hurricane wilma, -1), (davi,-1), (meyer, -1)

6 (hurrican,11), (sandi,82)

(hurrican, 11), (flaxen, -1),(arenac, -1), (sandi, 82)

(hurricane charley, -1),(scrubby veget, -1), (tropicalstorm, -1), (hurrican, 11),(pebbl, -1), (hurricane ik, -1),(hurricane ivan, -1), (sandi,82), (sand, 5702), (hurricanegustav, -1), (pebble strewn,-1), (grassi, -1), (storm, 41),(hurricane rita, -1), (riverturag, -1), (powdery sand,-1), (muddi, -1), (hurricanewilma, -1), (sandy soil, -1),(sand dun, -1)



Table A.4).

Event Query Query QueryId WordNet Word2Vec7 (derecho,

82),(midwest,82)

(derecho, -1), (midwest, -1) (snowstorm blanket, -1),(derecho, -1), (objeto, -1),(dakota, -1), (movimiento,-1), (cornbelt, -1),(midwestern, -1), (east coast,-1), (supercell thunderstorm,-1), (mid atlant, -1),(supercel, -1), (southeastmissouri, -1), (downburst, -1),(actualment, -1), (gustnado,-1), (upper midwest, -1),(midwest, -1), (cada uno, -1)

8 (typhoon,25),(bopha, 48)

(typhoon, 25), (bopha, 48) (typhoon megi, -1), (typhoonchanchu, -1), (fengshen, -1),(typhoon parma, -1),(typhoon xangsan, -1), (supertyphoon, -1), (typhoon, 25),(hagupit, -1), (typhoondurian, -1), (reme, -1),(bopha, 48)

9 (guatemala,2),(earthquak,24)

(guatemala, 2), (earthquak,24)

(magnitude earthquak, -1),(magnitude earthquak, -1),(temblor, 5451), (aftershock,1606), (earthquak, 24),(magnitude quak, -1),(guatemala, 2), (devastatingearthquak, -1), (quak, 16)

10 (aviv, 28),(bu, 47),(tel, 29),(bomb, 50)

(bombard, 8405), (tel, 29),(busbar, -1), (bus topolog,-1), (aviv, 28), (bu, 47), (fail,7932), (bomb, 50)

(taxi, -1), (tel, 29), (schoolbu,-1), (bomb, 50), (by thapelosakoana, -1), (bomber, 1997),(trolley, -1), (van, -1), (buss,-1), (aviv, 28), (minibu,1811), (greyhound bu, -1),(bu, 47), (suicide bomb, -1),(by lavinia mahlangu, -1),(bomb blast, -1), (buse, -1),(tel, -1)


Table A.7: Query terms ranked by their corresponding log-likelihood ratio value. If a term is not present in the set ofextracted LLR terms, then the term is assigned a rank of -1.

Word2Vec was trained on external data.

Event Query Term Any Wordnet Synonym Any Similar Word2Vec TermId Rank Rank Rank11 (costa,

183),(concordia,157)

(costa, 183), (concordia, 157),(rib, -1)

(duro, -1), (vs bethel univ,-1), (azul, -1), (concordia,157), (mens basketbal, -1),(sports inform, -1), (footballbetting preview, -1), (luz, -1),(diga, -1), (puerto, -1),(florenc, -1), (lutheran, 7016),(asu, -1), (game thread, -1),(minn. crookston, -1), (isla,-1), (kif, -1), (gustavusadolphus v, -1), (costa, 183),(ventura, -1), (diz, -1),(colleg, -1)

12 (cold, 23),(wave,218),(european,1471)

(curl, -1), (brandish, -1),(beckon, -1), (cold, 23),(wave, 218), (roll, 7044),(european, 1471)

(england, 43), (frosti, 2800),(franc, 1830), (german, 2774),(colder, 7379), (tide, 7784),(cold, 23), (portug, 7541),(frigid weath, -1), (europ,129), (chill, 1099), (european,1471), (surg, 6917), (germani,7966), (rising tid, -1), (warm,7803), (frigid, 2145), (crestingwav, -1), (wave, 218), (spain,5017), (tidal wav, -1), (chilli,-1), (upswel, -1), (american,1602), (multi partyism swept,-1), (dwindling ordain, -1),(bitterly cold, -1), (african,2161), (greec, 7584), (tonightclear, -1)

13 (queensland,16), (flood,25)

(queensland, 16), (flood, 25),(delug, 536), (flood tid, -1)

(adelaid, 6378), (monsoonflood, -1), (devastating flood,-1), (mitch, -1), (inund, 544),(flash flood, -1), (brisban,291), (flood, 25), (perth,3177), (torrential rain, -1),(sydney, 6787), (queensland,16), (qld, 3439), (melbourn,1884), (australian, 6614),(kevin, 6619), (nsw, 739)



Table A.7).

Event Query Term Any Wordnet Synonym Any Similar Word2Vec TermId Rank Rank Rank14 (boston,

459),(bomb,614),(marathon,690)

(bombard, 6097), (boston,459), (fail, 7549), (bomb,614), (marathon, 690)

(toronto, 6715), (pikespeak asc, -1),(triathlon, -1),(bomber, 5036), (bomb,614), (gruel, -1),(chicago, 1642),(boston, 459), (suicidebomb, -1),(ultramarathon, -1),(denver, 1863), (bombblast, -1), (baltimor,2831), (cleveland, -1),(minneapoli, 6028),(nyc, 717), (halfmarathon, -1),(oakland, 4936),(detroit, -1),(marathon, 690),(seattl, 971)

15 (riot, 20),(egyptian,35)

(belly laugh, -1), (orgi, -1),(carous, -1), (riot, 20),(egyptian, 35)

(saudi arabia, -1), (mubarak,3765), (riot, 20), (egyptian,35), (egypt, 8), (violent clash,-1), (moroccan, -1), (arab,976), (unrest, 4147), (protest,1), (ali, 5856), (syria, 462),(moham, 125), (riots erupt,-1), (rioter, -1), (ethiopia,5663)

16 (quran, 15),(protest, 1),(burn, 13)

(cauter, -1), (cut, 1571),(quran, 15), (bite, 5675),(burn, 13), (combust, -1),(electrocut, -1), (koran, 38),(protest, 1), (burn off, -1),(sunburn, -1)

(quoran, -1), (torch, 7098),(protest, 1), (holy koran, -1),(noble quran, -1), (riot, 20),(antigovernment demonstr,-1), (fire, 978), (quraan, -1),(bure, -1), (qúran, -1),(quran, 15), (smolder, -1),(burn, 13), (qurán, 272),(burnt, 2547), (demonstr,132), (holy qurán, -1), (surah,-1), (protest march, -1),(fiercest blaz, -1), (koran, 38),(antigovernment protest, -1),(quŕan, -1)

17 (hostag, 2),(amena,75), (crisi,11)

(hostag, 2), (amena, 75),(crisi, 11)

(kidnapp, 426), (turmoil,6734), (abduct, 5900),(crisis.th, -1), (crise, -1),(hostag, 2), (liquidity crunch,-1), (amena, 75), (hostagetak, -1), (hostage standoff,-1), (credit crunch, -1),(economic downturn, -1),(captiv, 1801), (recess, 6359),(downturn, -1), (meltdown,-1), (kidnap, 6427), (crisi,11), (subprime mortgagecrisi, -1)



Table A.7).

Event Query Term Any Wordnet Synonym Any Similar Word2Vec TermId Rank Rank Rank18 (russian,

62),(protest, 1)

(russian, 62), (protest, 1) (swedish, -1),(antigovernment demonstr,-1), (protest march, -1), (riot,20), (canadian, 5499),(antigovernment protest, -1),(poland, 2486), (american,1602), (british, 860), (russia,10), (ukrain, 219), (protest,1), (russian, 62), (georgian,2338), (moscow, 356),(demonstr, 132)

19 (romanian,99),(protest, 1)

(romanian, 99), (protest, 1) (swedish, -1), (uruguay, -1),(protest march, -1), (riot,20), (xslt, -1), (german,2774), (not transl, -1),(purchase autocad, -1),(serbian, 5878), (serbia,6933), (antigovernmentprotest, -1), (protest, 1),(xfffd, -1), (romanian, 99),(denmark, -1),(antigovernment demonstr,-1), (demonstr, 132)

20 (protest, 1),(egyptian,35)

(protest, 1), (egyptian, 35) (saudi arabia, -1), (protestmarch, -1), (riot, 20),(egyptian, 35), (egypt, 8),(antigovernment demonstr,-1), (ali, 5856), (protest, 1),(moroccan, -1), (arab, 976),(ethiopia, 5663),(antigovernment protest, -1),(mubarak, 3765), (syria, 462),(moham, 125), (demonstr,132)

21 (russia, 10),(meteor, 7)

(soviet russia, -1),(meteoroid, -1), (russia, 10),(meteor, 7), (soviet union, -1)

(north korea, -1), (korea,506), (germani, 7966), (bolid,5749), (iranian, 8331),(poland, 2486), (serbia,6933), (meteor show, -1),(halley comet, -1), (ukrain,219), (meteoroid, -1),(asteroid, 109), (meteor, 7),(meteorit, 50), (russian, 62),(iran, 1036), (comet, 4168),(russia, 10)


Table A.10: Query terms ranked by their correspondinglog-likelihood ratio value. If a term is not present in the setof extracted LLR terms, then the term is assigned a rank of-1. Word2Vec was trained on external data (continuation of

Table A.7).

Event Query Term Any Wordnet Synonym Any Similar Word2Vec TermId Rank Rank Rank22 (bulgarian,

96),(protest, 1)

(bulgarian, 96), (protest, 1) (bulgarian, 96), (protestmarch, -1), (riot, 20),(antigovernment demonstr,-1), (protest, 1),(antigovernment protest, -1),(demonstr, 132)

23 (protest, 1),(shahbag,235)

(protest, 1), (shahbag, 235) (protest march, -1), (riot,20), (antigovernmentdemonstr, -1), (protest, 1),(antigovernment protest, -1),(shahbag, 235), (demonstr,132)

24 (nor'east,14)

(nor'east, 14) (northeast, 32), (tropicalstorm isabel, -1), (nor’east,14), (hurricane isabel, -1),(storm, 0), (hurricane earl,-1), (snowstorm, 42)

25 (shoot, 51),(california,54),(southern,46)

(shoot, 51), (photograph,834), (tear, 2133), (southern,46), (dart, -1), (southerli, -1),(california, 54), (inject,8213), (blast, 275), (film,521), (fritter, -1)

(shot, 5554), (northeastern,2160), (fatal shoot, -1),(aliodor recal, -1),(southwestern, 7108),(nevada, 4518), (northeast,32), (northern, 1637),(oklahoma, 5024),(sacramento, 7240),(southwest, 1886), (arizona,2898), (jihadi attack, -1),(eastern, 138), (southern, 46),(utah, 7079), (oregon, 7404),(southeast, 1768), (california,54), (western, 3417), (shoot,51), (northwestern, -1),(shooter, 8167), (florida,1640), (south carolina, -1),(ohio, -1), (alabama, 6440),(southeastern, 1138)


Table A.11: TF results ranked by time for TREC TS2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collections

(MAP 2013: 0.0174; MAP 2014: 0.0109).

Id Correct Precision RecallP@10 P@20 P@100 P@R P@1000 AP R@10 R@20 R@100 R@1000

1 233/233 0.1000 0.0500 0.0100 0.0086@233 0.0090 0.0117 0.0043 0.0043 0.0043 0.03862 168/168 0.3000 0.1500 0.0700 0.0417@168 0.0150 0.0226 0.0179 0.0179 0.0417 0.08933 42/42 0.1000 0.0500 0.0100 0.0238@42 0.0040 0.0111 0.0238 0.0238 0.0238 0.09524 180/180 0.1000 0.0500 0.0100 0.0221@181 0.0130 0.0138 0.0055 0.0055 0.0055 0.07185 35/35 0.1000 0.0500 0.0100 0.0286@35 0.0070 0.0163 0.0286 0.0286 0.0286 0.20006 172/172 0.0000 0.0000 0.0100 0.0058@172 0.0060 0.0061 0.0000 0.0000 0.0058 0.03497 – – – – – – – – – – –8 76/76 0.2000 0.2500 0.0600 0.0789@76 0.0130 0.0359 0.0263 0.0658 0.0789 0.17119 68/68 0.1000 0.0500 0.0200 0.0147@68 0.0100 0.0236 0.0147 0.0147 0.0294 0.147110 95/95 0.1000 0.0500 0.0100 0.0104@96 0.0070 0.0153 0.0104 0.0104 0.0104 0.0729

AVG 2013 0.1222 0.0778 0.0250 0.0261 0.0093 0.0174 0.0146 0.0190 0.0143 0.102311 392/392 0.0000 0.0000 0.0100 0.0230@392 0.0290 0.0114 0.0000 0.0000 0.0026 0.074012 184/184 0.0000 0.0000 0.0100 0.0054@184 0.0160 0.0128 0.0000 0.0000 0.0054 0.087013 313/313 0.0000 0.0000 0.0100 0.0064@313 0.0030 0.0068 0.0000 0.0000 0.0032 0.009614 401/401 0.0000 0.0000 0.0100 0.0050@401 0.0060 0.0044 0.0000 0.0000 0.0025 0.015015 315/315 0.0000 0.0000 0.0100 0.0127@315 0.0050 0.0096 0.0000 0.0000 0.0032 0.015916 554/554 0.0000 0.0000 0.0200 0.0090@554 0.0100 0.0124 0.0000 0.0000 0.0036 0.018117 648/648 0.0000 0.0500 0.0100 0.0525@648 0.0350 0.0129 0.0000 0.0015 0.0015 0.054018 409/409 0.0000 0.0000 0.0200 0.0269@409 0.0200 0.0122 0.0000 0.0000 0.0049 0.048919 341/341 0.0000 0.0000 0.0100 0.0323@341 0.0310 0.0221 0.0000 0.0000 0.0029 0.090920 289/289 0.0000 0.0000 0.0000 0.0069@289 0.0040 0.0058 0.0000 0.0000 0.0000 0.013821 798/798 0.0000 0.0000 0.0100 0.0213@798 0.0190 0.0089 0.0000 0.0000 0.0013 0.023822 220/220 0.0000 0.0000 0.0200 0.0091@220 0.0070 0.0086 0.0000 0.0000 0.0091 0.031823 274/274 0.0000 0.1000 0.0600 0.0328@274 0.0150 0.0146 0.0000 0.0073 0.0219 0.054724 430/430 0.1000 0.1000 0.0400 0.0209@430 0.0110 0.0123 0.0023 0.0047 0.0093 0.025625 376/376 0.0000 0.0000 0.0100 0.0053@376 0.0070 0.0086 0.0000 0.0000 0.0027 0.0186

AVG 2014 0.0067 0.0167 0.0167 0.0180 0.0145 0.0109 0.0002 0.0009 0.0049 0.0388

Table A.12: TF results ranked by highest score for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collec-

tions (MAP 2013: 0.0113; MAP 2014: 0.1339).


1 233/233 0.0000 0.0500 0.0200 0.0215@233 0.0270 0.0163 0.0000 0.0043 0.0086 0.11592 168/168 0.0000 0.0000 0.0300 0.0179@168 0.0230 0.0141 0.0000 0.0000 0.0179 0.13693 42/42 0.0000 0.0000 0.0100 0.0000@42 0.0070 0.0060 0.0000 0.0000 0.0238 0.16674 180/180 0.0000 0.0000 0.0100 0.0221@181 0.0150 0.0088 0.0000 0.0000 0.0055 0.08295 35/35 0.0000 0.0000 0.0000 0.0000@35 0.0090 0.0083 0.0000 0.0000 0.0000 0.25716 172/172 0.0000 0.0000 0.0100 0.0116@172 0.0120 0.0072 0.0000 0.0000 0.0058 0.06987 – – – – – – – – – – –8 76/76 0.0000 0.0000 0.0100 0.0132@76 0.0180 0.0150 0.0000 0.0000 0.0132 0.23689 68/68 0.0000 0.0000 0.0100 0.0147@68 0.0160 0.0114 0.0000 0.0000 0.0147 0.235310 95/95 0.0000 0.0000 0.0500 0.0521@96 0.0170 0.0145 0.0000 0.0000 0.0521 0.1771

AVG 2013 0.0000 0.0056 0.0175 0.0170 0.0160 0.0113 0.0000 0.0005 0.0094 0.164311 392/392 0.5000 0.4500 0.2100 0.1403@392 0.1150 0.1152 0.0128 0.0230 0.0536 0.293412 184/184 0.6000 0.4500 0.3700 0.3696@184 0.1580 0.2686 0.0326 0.0489 0.2011 0.858713 313/313 0.2000 0.3500 0.3100 0.1725@313 0.1370 0.1243 0.0064 0.0224 0.0990 0.437714 401/401 0.0000 0.0000 0.1600 0.1671@401 0.0690 0.0495 0.0000 0.0000 0.0399 0.172115 315/315 0.3000 0.2000 0.1800 0.1873@315 0.1400 0.1216 0.0095 0.0127 0.0571 0.444416 554/554 0.3000 0.3500 0.2500 0.2347@554 0.2150 0.1826 0.0054 0.0126 0.0451 0.388117 648/648 0.7000 0.4500 0.4300 0.1651@648 0.1240 0.1037 0.0108 0.0139 0.0664 0.191418 409/409 0.4000 0.3500 0.3800 0.1589@409 0.1490 0.1493 0.0098 0.0171 0.0929 0.364319 341/341 0.1000 0.1500 0.2600 0.2463@341 0.1660 0.1727 0.0029 0.0088 0.0762 0.486820 289/289 0.3000 0.4500 0.2800 0.1730@289 0.1270 0.1281 0.0104 0.0311 0.0969 0.439421 798/798 0.8000 0.7500 0.2700 0.1328@798 0.1240 0.1098 0.0100 0.0188 0.0338 0.155422 220/220 0.2000 0.5000 0.4000 0.2773@220 0.1320 0.1978 0.0091 0.0455 0.1818 0.600023 274/274 0.1000 0.2000 0.1700 0.1642@274 0.1110 0.0885 0.0036 0.0146 0.0620 0.405124 430/430 0.3000 0.2000 0.2700 0.2721@430 0.1920 0.1252 0.0070 0.0093 0.0628 0.446525 376/376 0.3000 0.3000 0.2300 0.1702@376 0.0960 0.0717 0.0080 0.0160 0.0612 0.2553

AVG 2014 0.3400 0.3433 0.2780 0.2021 0.1370 0.1339 0.0092 0.0196 0.0820 0.3959


Table A.13: BM25 results ranked by time for TREC TS2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collections

(MAP 2013: 0.0174; MAP 2014: 0.0109).


1 233/233 0.1000 0.0500 0.0100 0.0086@233 0.0090 0.0117 0.0043 0.0043 0.0043 0.03862 168/168 0.3000 0.1500 0.0700 0.0417@168 0.0150 0.0226 0.0179 0.0179 0.0417 0.08933 42/42 0.1000 0.0500 0.0100 0.0238@42 0.0040 0.0111 0.0238 0.0238 0.0238 0.09524 180/180 0.1000 0.0500 0.0100 0.0221@181 0.0130 0.0138 0.0055 0.0055 0.0055 0.07185 35/35 0.1000 0.0500 0.0100 0.0286@35 0.0070 0.0163 0.0286 0.0286 0.0286 0.20006 172/172 0.0000 0.0000 0.0100 0.0058@172 0.0060 0.0061 0.0000 0.0000 0.0058 0.03497 – – – – – – – – – – –8 76/76 0.2000 0.2500 0.0600 0.0789@76 0.0130 0.0359 0.0263 0.0658 0.0789 0.17119 68/68 0.1000 0.0500 0.0200 0.0147@68 0.0100 0.0236 0.0147 0.0147 0.0294 0.147110 95/95 0.1000 0.0500 0.0100 0.0104@96 0.0070 0.0153 0.0104 0.0104 0.0104 0.0729

AVG 2013 0.1222 0.0778 0.0250 0.0261 0.0093 0.0174 0.0146 0.0190 0.0143 0.102311 392/392 0.0000 0.0000 0.0100 0.0230@392 0.0290 0.0114 0.0000 0.0000 0.0026 0.074012 184/184 0.0000 0.0000 0.0100 0.0054@184 0.0160 0.0128 0.0000 0.0000 0.0054 0.087013 313/313 0.0000 0.0000 0.0100 0.0064@313 0.0030 0.0068 0.0000 0.0000 0.0032 0.009614 401/401 0.0000 0.0000 0.0100 0.0050@401 0.0060 0.0044 0.0000 0.0000 0.0025 0.015015 315/315 0.0000 0.0000 0.0100 0.0127@315 0.0050 0.0096 0.0000 0.0000 0.0032 0.015916 554/554 0.0000 0.0000 0.0200 0.0090@554 0.0100 0.0124 0.0000 0.0000 0.0036 0.018117 648/648 0.0000 0.0500 0.0100 0.0525@648 0.0350 0.0129 0.0000 0.0015 0.0015 0.054018 409/409 0.0000 0.0000 0.0200 0.0269@409 0.0200 0.0122 0.0000 0.0000 0.0049 0.048919 341/341 0.0000 0.0000 0.0100 0.0323@341 0.0310 0.0221 0.0000 0.0000 0.0029 0.090920 289/289 0.0000 0.0000 0.0000 0.0069@289 0.0040 0.0058 0.0000 0.0000 0.0000 0.013821 798/798 0.0000 0.0000 0.0100 0.0213@798 0.0190 0.0089 0.0000 0.0000 0.0013 0.023822 220/220 0.0000 0.0000 0.0200 0.0091@220 0.0070 0.0086 0.0000 0.0000 0.0091 0.031823 274/274 0.0000 0.1000 0.0600 0.0328@274 0.0150 0.0146 0.0000 0.0073 0.0219 0.054724 430/430 0.1000 0.1000 0.0400 0.0209@430 0.0110 0.0123 0.0023 0.0047 0.0093 0.025625 376/376 0.0000 0.0000 0.0100 0.0053@376 0.0070 0.0086 0.0000 0.0000 0.0027 0.0186

AVG 2014 0.0067 0.0167 0.0167 0.0180 0.0145 0.0109 0.0002 0.0009 0.0049 0.0388

Table A.14: BM25 results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)collections (MAP 2013: 0.0127; MAP 2014: 0.1379).


1 233/233 0.2000 0.1000 0.0400 0.0429@233 0.0250 0.0181 0.0086 0.0086 0.0172 0.10732 168/168 0.0000 0.0000 0.0100 0.0238@168 0.0320 0.0156 0.0000 0.0000 0.0060 0.19053 42/42 0.0000 0.0000 0.0000 0.0000@42 0.0070 0.0061 0.0000 0.0000 0.0000 0.16674 180/180 0.0000 0.0000 0.0200 0.0221@181 0.0140 0.0094 0.0000 0.0000 0.0110 0.07735 35/35 0.0000 0.0000 0.0000 0.0000@35 0.0090 0.0085 0.0000 0.0000 0.0000 0.25716 172/172 0.0000 0.0000 0.0100 0.0116@172 0.0140 0.0074 0.0000 0.0000 0.0058 0.08147 – – – – – – – – – – –8 76/76 0.0000 0.0500 0.0500 0.0658@76 0.0180 0.0216 0.0000 0.0132 0.0658 0.23689 68/68 0.0000 0.0500 0.0200 0.0147@68 0.0160 0.0129 0.0000 0.0147 0.0294 0.235310 95/95 0.0000 0.0000 0.0500 0.0521@96 0.0170 0.0149 0.0000 0.0000 0.0521 0.1771

AVG 2013 0.0222 0.0222 0.0200 0.0259 0.0169 0.0127 0.0010 0.0040 0.0100 0.169911 392/392 0.5000 0.4500 0.2200 0.1046@392 0.1000 0.1103 0.0128 0.0230 0.0561 0.255112 184/184 0.4000 0.5500 0.5300 0.3696@184 0.1580 0.2924 0.0217 0.0598 0.2880 0.858713 313/313 0.6000 0.6500 0.3300 0.2236@313 0.1520 0.1848 0.0192 0.0415 0.1054 0.485614 401/401 0.0000 0.0000 0.3100 0.1571@401 0.0720 0.0547 0.0000 0.0000 0.0773 0.179615 315/315 0.2000 0.2000 0.1700 0.1778@315 0.1230 0.1157 0.0063 0.0127 0.0540 0.390516 554/554 0.6000 0.3000 0.3000 0.2329@554 0.1990 0.1880 0.0108 0.0108 0.0542 0.359217 648/648 0.3000 0.3000 0.3600 0.1713@648 0.1410 0.1063 0.0046 0.0093 0.0556 0.217618 409/409 0.4000 0.4500 0.2700 0.2029@409 0.1470 0.1483 0.0098 0.0220 0.0660 0.359419 341/341 0.4000 0.3500 0.2100 0.2493@341 0.1570 0.1688 0.0117 0.0205 0.0616 0.460420 289/289 0.4000 0.2500 0.2000 0.1384@289 0.1140 0.0973 0.0138 0.0173 0.0692 0.394521 798/798 0.7000 0.7500 0.3300 0.1654@798 0.1420 0.1256 0.0088 0.0188 0.0414 0.177922 220/220 0.5000 0.5000 0.3300 0.2364@220 0.1070 0.1843 0.0227 0.0455 0.1500 0.486423 274/274 0.3000 0.3000 0.2200 0.1715@274 0.1270 0.1021 0.0109 0.0219 0.0803 0.463524 430/430 0.2000 0.1500 0.0900 0.2256@430 0.1920 0.1036 0.0047 0.0070 0.0209 0.446525 376/376 0.5000 0.4000 0.3800 0.1729@376 0.0870 0.0861 0.0133 0.0213 0.1011 0.2314

AVG 2014 0.4000 0.3733 0.2833 0.1999 0.1345 0.1379 0.0114 0.0221 0.0854 0.3844


Table A.15: TF.ISF results ranked by time for TREC TS2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collections

(MAP 2013: 0.0441; MAP 2014: 0.0109).


1 233/233 0.1000 0.0500 0.0200 0.0258@233 0.0400 0.0453 0.0043 0.0043 0.0086 0.17172 168/168 0.3000 0.3500 0.1000 0.0595@168 0.0450 0.0609 0.0179 0.0417 0.0595 0.26793 42/42 0.1000 0.0500 0.0300 0.0714@42 0.0250 0.0297 0.0238 0.0238 0.0714 0.59524 176/180 0.1000 0.1000 0.0600 0.0663@181 0.0350 0.0377 0.0055 0.0110 0.0331 0.19345 35/35 0.2000 0.1000 0.0400 0.0571@35 0.0170 0.0355 0.0571 0.0571 0.1143 0.48576 172/172 0.0000 0.0000 0.0300 0.0291@172 0.0240 0.0219 0.0000 0.0000 0.0174 0.13957 – – – – – – – – – – –8 76/76 0.2000 0.2500 0.1100 0.1184@76 0.0490 0.0816 0.0263 0.0658 0.1447 0.64479 68/68 0.1000 0.0500 0.0600 0.0735@68 0.0330 0.0504 0.0147 0.0147 0.0882 0.485310 94/95 0.2000 0.1000 0.0400 0.0417@96 0.0170 0.0341 0.0208 0.0208 0.0417 0.1771

AVG 2013 0.1444 0.1167 0.0525 0.0603 0.0317 0.0441 0.0189 0.0266 0.0297 0.351211 392/392 0.0000 0.0000 0.0100 0.0230@392 0.0290 0.0114 0.0000 0.0000 0.0026 0.074012 184/184 0.0000 0.0000 0.0100 0.0054@184 0.0160 0.0128 0.0000 0.0000 0.0054 0.087013 313/313 0.0000 0.0000 0.0100 0.0064@313 0.0030 0.0068 0.0000 0.0000 0.0032 0.009614 401/401 0.0000 0.0000 0.0100 0.0050@401 0.0060 0.0044 0.0000 0.0000 0.0025 0.015015 315/315 0.0000 0.0000 0.0100 0.0127@315 0.0050 0.0096 0.0000 0.0000 0.0032 0.015916 554/554 0.0000 0.0000 0.0200 0.0090@554 0.0100 0.0124 0.0000 0.0000 0.0036 0.018117 648/648 0.0000 0.0500 0.0100 0.0525@648 0.0350 0.0129 0.0000 0.0015 0.0015 0.054018 409/409 0.0000 0.0000 0.0200 0.0269@409 0.0200 0.0122 0.0000 0.0000 0.0049 0.048919 341/341 0.0000 0.0000 0.0100 0.0323@341 0.0310 0.0221 0.0000 0.0000 0.0029 0.090920 289/289 0.0000 0.0000 0.0000 0.0069@289 0.0040 0.0058 0.0000 0.0000 0.0000 0.013821 798/798 0.0000 0.0000 0.0100 0.0213@798 0.0190 0.0089 0.0000 0.0000 0.0013 0.023822 220/220 0.0000 0.0000 0.0200 0.0091@220 0.0070 0.0086 0.0000 0.0000 0.0091 0.031823 274/274 0.0000 0.1000 0.0600 0.0328@274 0.0150 0.0146 0.0000 0.0073 0.0219 0.054724 430/430 0.1000 0.1000 0.0400 0.0209@430 0.0110 0.0123 0.0023 0.0047 0.0093 0.025625 376/376 0.0000 0.0000 0.0100 0.0053@376 0.0070 0.0086 0.0000 0.0000 0.0027 0.0186

AVG 2014 0.0067 0.0167 0.0167 0.0180 0.0145 0.0109 0.0002 0.0009 0.0049 0.0388

Table A.16: TF.ISF results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)

(MAP 2013: 0.1330; MAP 2014: 0.1635).


1 233/233 0.3000 0.4000 0.2600 0.2318@233 0.1810 0.2165 0.0129 0.0343 0.1116 0.77682 168/168 0.6000 0.3000 0.1000 0.0714@168 0.1240 0.1520 0.0357 0.0357 0.0595 0.73813 42/42 0.2000 0.1000 0.0600 0.0714@42 0.0290 0.0661 0.0476 0.0476 0.1429 0.69054 176/180 0.0000 0.0000 0.0800 0.0552@181 0.1150 0.0935 0.0000 0.0000 0.0442 0.63545 35/35 0.0000 0.0000 0.0900 0.0286@35 0.0320 0.0818 0.0000 0.0000 0.2571 0.91436 172/172 0.0000 0.0000 0.0100 0.0058@172 0.0780 0.0580 0.0000 0.0000 0.0058 0.45357 – – – – – – – – – – –8 76/76 0.2000 0.1500 0.1500 0.0789@76 0.0680 0.1670 0.0263 0.0395 0.1974 0.89479 68/68 0.3000 0.2000 0.1500 0.1912@68 0.0620 0.1717 0.0441 0.0588 0.2206 0.911810 94/95 0.2000 0.1000 0.0800 0.0833@96 0.0850 0.1907 0.0208 0.0208 0.0833 0.8854

AVG 2013 0.2000 0.1389 0.1125 0.0909 0.0860 0.1330 0.0208 0.0263 0.0553 0.766711 392/392 0.5000 0.5000 0.2200 0.1607@392 0.1090 0.1176 0.0128 0.0255 0.0561 0.278112 184/184 0.6000 0.5500 0.4500 0.3696@184 0.1580 0.2851 0.0326 0.0598 0.2446 0.858713 313/313 0.6000 0.6000 0.3800 0.2684@313 0.1500 0.1980 0.0192 0.0383 0.1214 0.479214 401/401 0.3000 0.4000 0.4200 0.1446@401 0.0660 0.0810 0.0075 0.0200 0.1047 0.164615 315/315 0.3000 0.4000 0.2800 0.1810@315 0.1240 0.1317 0.0095 0.0254 0.0889 0.393716 554/554 0.4000 0.3000 0.3200 0.2383@554 0.2200 0.2036 0.0072 0.0108 0.0578 0.397117 648/648 0.9000 0.8500 0.5400 0.1960@648 0.1480 0.1436 0.0139 0.0262 0.0833 0.228418 409/409 0.3000 0.4500 0.4200 0.2689@409 0.1530 0.1943 0.0073 0.0220 0.1027 0.374119 341/341 0.1000 0.3000 0.3500 0.2551@341 0.1870 0.1912 0.0029 0.0176 0.1026 0.548420 289/289 0.5000 0.3500 0.3100 0.2561@289 0.1290 0.1504 0.0173 0.0242 0.1073 0.446421 798/798 0.6000 0.7000 0.5000 0.1792@798 0.1660 0.1649 0.0075 0.0175 0.0627 0.208022 220/220 0.4000 0.5000 0.4000 0.3455@220 0.1180 0.2255 0.0182 0.0455 0.1818 0.536423 274/274 0.3000 0.2500 0.3500 0.1679@274 0.1190 0.1151 0.0109 0.0182 0.1277 0.434324 430/430 0.2000 0.3500 0.3800 0.3186@430 0.1920 0.1566 0.0047 0.0163 0.0884 0.446525 376/376 0.5000 0.4000 0.4000 0.1729@376 0.0890 0.0943 0.0133 0.0213 0.1064 0.2367

AVG 2014 0.4333 0.4600 0.3813 0.2348 0.1419 0.1635 0.0123 0.0259 0.1091 0.4020


Table A.17: QL results ranked by time for TREC TS2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collections

(MAP 2013: 0.0441; MAP 2014: 0.0111).


1 233/233 0.1000 0.0500 0.0200 0.0258@233 0.0400 0.0453 0.0043 0.0043 0.0086 0.17172 168/168 0.3000 0.3500 0.1000 0.0595@168 0.0450 0.0609 0.0179 0.0417 0.0595 0.26793 42/42 0.1000 0.0500 0.0300 0.0714@42 0.0250 0.0297 0.0238 0.0238 0.0714 0.59524 176/180 0.1000 0.1000 0.0600 0.0663@181 0.0350 0.0377 0.0055 0.0110 0.0331 0.19345 35/35 0.2000 0.1000 0.0400 0.0571@35 0.0170 0.0355 0.0571 0.0571 0.1143 0.48576 172/172 0.0000 0.0000 0.0300 0.0291@172 0.0240 0.0219 0.0000 0.0000 0.0174 0.13957 – – – – – – – – – – –8 76/76 0.2000 0.2500 0.1100 0.1184@76 0.0490 0.0816 0.0263 0.0658 0.1447 0.64479 68/68 0.1000 0.0500 0.0600 0.0735@68 0.0330 0.0504 0.0147 0.0147 0.0882 0.485310 94/95 0.2000 0.1000 0.0400 0.0417@96 0.0170 0.0341 0.0208 0.0208 0.0417 0.1771

AVG 2013 0.1444 0.1167 0.0525 0.0603 0.0317 0.0441 0.0189 0.0266 0.0297 0.351211 392/392 0.0000 0.0000 0.0100 0.0230@392 0.0290 0.0117 0.0000 0.0000 0.0026 0.074012 184/184 0.0000 0.0000 0.0100 0.0054@184 0.0170 0.0131 0.0000 0.0000 0.0054 0.092413 313/313 0.0000 0.0000 0.0100 0.0064@313 0.0030 0.0069 0.0000 0.0000 0.0032 0.009614 401/401 0.0000 0.0000 0.0100 0.0050@401 0.0060 0.0044 0.0000 0.0000 0.0025 0.015015 315/315 0.0000 0.0000 0.0100 0.0127@315 0.0050 0.0099 0.0000 0.0000 0.0032 0.015916 554/554 0.0000 0.0000 0.0200 0.0090@554 0.0120 0.0127 0.0000 0.0000 0.0036 0.021717 647/648 0.1000 0.0500 0.0100 0.0525@648 0.0350 0.0133 0.0015 0.0015 0.0015 0.054018 409/409 0.0000 0.0000 0.0200 0.0269@409 0.0210 0.0126 0.0000 0.0000 0.0049 0.051319 341/341 0.0000 0.0000 0.0100 0.0323@341 0.0330 0.0227 0.0000 0.0000 0.0029 0.096820 289/289 0.0000 0.0000 0.0000 0.0069@289 0.0040 0.0059 0.0000 0.0000 0.0000 0.013821 798/798 0.0000 0.0000 0.0100 0.0213@798 0.0190 0.0090 0.0000 0.0000 0.0013 0.023822 220/220 0.0000 0.0000 0.0200 0.0091@220 0.0070 0.0087 0.0000 0.0000 0.0091 0.031823 272/274 0.2000 0.1500 0.0400 0.0292@274 0.0130 0.0140 0.0073 0.0109 0.0146 0.047424 430/430 0.1000 0.1000 0.0400 0.0233@430 0.0110 0.0124 0.0023 0.0047 0.0093 0.025625 376/376 0.0000 0.0000 0.0100 0.0053@376 0.0070 0.0087 0.0000 0.0000 0.0027 0.0186

AVG 2014 0.0267 0.0200 0.0153 0.0179 0.0148 0.0111 0.0007 0.0011 0.0044 0.0394

Table A.18: QL results ranked by highest score for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) (MAP

2013: 0.2058; MAP 2014: 0.0656).


1 233/233 0.6000 0.6500 0.4000 0.3433@233 0.1120 0.2150 0.0258 0.0558 0.1717 0.48072 168/168 0.7000 0.8000 0.4400 0.2976@168 0.0790 0.2493 0.0417 0.0952 0.2619 0.47023 42/42 0.1000 0.2500 0.0700 0.1429@42 0.0180 0.0528 0.0238 0.1190 0.1667 0.42864 176/180 0.3000 0.3500 0.3600 0.2928@181 0.0990 0.1691 0.0166 0.0387 0.1989 0.54705 35/35 0.1000 0.1500 0.1300 0.1714@35 0.0280 0.1355 0.0286 0.0857 0.3714 0.80006 172/172 0.2000 0.3500 0.2600 0.2209@172 0.0890 0.1201 0.0116 0.0407 0.1512 0.51747 – – – – – – – – – – –8 76/76 0.6000 0.5500 0.3100 0.3947@76 0.0590 0.2753 0.0789 0.1447 0.4079 0.77639 68/68 0.7000 0.4000 0.2300 0.3235@68 0.0450 0.2506 0.1029 0.1176 0.3382 0.661810 94/95 0.6000 0.6000 0.5000 0.5000@96 0.0670 0.3848 0.0625 0.1250 0.5208 0.6979

AVG 2013 0.4333 0.4556 0.3650 0.2986 0.0662 0.2058 0.0436 0.0914 0.1959 0.597811 392/392 0.0000 0.0500 0.0600 0.0561@392 0.0970 0.0798 0.0000 0.0026 0.0153 0.247412 184/184 0.4000 0.2000 0.0600 0.0435@184 0.0140 0.0255 0.0217 0.0217 0.0326 0.076113 313/313 0.1000 0.1500 0.2900 0.1789@313 0.1470 0.1042 0.0032 0.0096 0.0927 0.469614 401/401 0.0000 0.0000 0.0000 0.0050@401 0.0280 0.0203 0.0000 0.0000 0.0000 0.069815 315/315 0.2000 0.1000 0.1600 0.1651@315 0.0600 0.0441 0.0063 0.0063 0.0508 0.190516 554/554 0.4000 0.3000 0.3100 0.2256@554 0.1280 0.0798 0.0072 0.0108 0.0560 0.231017 647/648 0.2000 0.3500 0.3100 0.0525@648 0.0340 0.0278 0.0031 0.0108 0.0478 0.052518 409/409 0.0000 0.0000 0.0600 0.1956@409 0.1420 0.0674 0.0000 0.0000 0.0147 0.347219 341/341 0.2000 0.1500 0.2400 0.2522@341 0.1000 0.0868 0.0059 0.0088 0.0704 0.293320 289/289 0.0000 0.0500 0.0400 0.0934@289 0.1030 0.0509 0.0000 0.0035 0.0138 0.356421 798/798 0.3000 0.5000 0.1900 0.1241@798 0.1230 0.0889 0.0038 0.0125 0.0238 0.154122 220/220 0.0000 0.0500 0.0500 0.2273@220 0.0900 0.0875 0.0000 0.0045 0.0227 0.409123 272/274 0.3000 0.2500 0.2100 0.1569@274 0.0500 0.0519 0.0109 0.0182 0.0766 0.182524 430/430 0.2000 0.1000 0.0900 0.2256@430 0.1920 0.1046 0.0047 0.0047 0.0209 0.446525 376/376 0.3000 0.4000 0.4200 0.1303@376 0.0510 0.0642 0.0080 0.0213 0.1117 0.1356

AVG 2014 0.1733 0.1767 0.1660 0.1421 0.0906 0.0656 0.0050 0.0090 0.0433 0.2441


Table A.19: Cosine Sim. results ranked by time for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collec-

tions (MAP 2013: 0.0441; MAP 2014: 0.0109).


1 233/233 0.1000 0.0500 0.0200 0.0258@233 0.0400 0.0453 0.0043 0.0043 0.0086 0.17172 168/168 0.3000 0.3500 0.1000 0.0595@168 0.0450 0.0609 0.0179 0.0417 0.0595 0.26793 42/42 0.1000 0.0500 0.0300 0.0714@42 0.0250 0.0297 0.0238 0.0238 0.0714 0.59524 176/180 0.1000 0.1000 0.0600 0.0663@181 0.0350 0.0377 0.0055 0.0110 0.0331 0.19345 35/35 0.2000 0.1000 0.0400 0.0571@35 0.0170 0.0355 0.0571 0.0571 0.1143 0.48576 172/172 0.0000 0.0000 0.0300 0.0291@172 0.0240 0.0219 0.0000 0.0000 0.0174 0.13957 – – – – – – – – – – –8 76/76 0.2000 0.2500 0.1100 0.1184@76 0.0490 0.0816 0.0263 0.0658 0.1447 0.64479 68/68 0.1000 0.0500 0.0600 0.0735@68 0.0330 0.0504 0.0147 0.0147 0.0882 0.485310 94/95 0.2000 0.1000 0.0400 0.0417@96 0.0170 0.0341 0.0208 0.0208 0.0417 0.1771

AVG 2013 0.1444 0.1167 0.0525 0.0603 0.0317 0.0441 0.0189 0.0266 0.0297 0.351211 392/392 0.0000 0.0000 0.0100 0.0230@392 0.0290 0.0114 0.0000 0.0000 0.0026 0.074012 184/184 0.0000 0.0000 0.0100 0.0054@184 0.0160 0.0128 0.0000 0.0000 0.0054 0.087013 313/313 0.0000 0.0000 0.0100 0.0064@313 0.0030 0.0068 0.0000 0.0000 0.0032 0.009614 401/401 0.0000 0.0000 0.0100 0.0050@401 0.0060 0.0044 0.0000 0.0000 0.0025 0.015015 315/315 0.0000 0.0000 0.0100 0.0127@315 0.0050 0.0096 0.0000 0.0000 0.0032 0.015916 554/554 0.0000 0.0000 0.0200 0.0090@554 0.0100 0.0124 0.0000 0.0000 0.0036 0.018117 648/648 0.0000 0.0500 0.0100 0.0525@648 0.0350 0.0129 0.0000 0.0015 0.0015 0.054018 409/409 0.0000 0.0000 0.0200 0.0269@409 0.0200 0.0122 0.0000 0.0000 0.0049 0.048919 341/341 0.0000 0.0000 0.0100 0.0323@341 0.0310 0.0221 0.0000 0.0000 0.0029 0.090920 289/289 0.0000 0.0000 0.0000 0.0069@289 0.0040 0.0058 0.0000 0.0000 0.0000 0.013821 798/798 0.0000 0.0000 0.0100 0.0213@798 0.0190 0.0089 0.0000 0.0000 0.0013 0.023822 220/220 0.0000 0.0000 0.0200 0.0091@220 0.0070 0.0086 0.0000 0.0000 0.0091 0.031823 274/274 0.0000 0.1000 0.0600 0.0328@274 0.0150 0.0146 0.0000 0.0073 0.0219 0.054724 430/430 0.1000 0.1000 0.0400 0.0209@430 0.0110 0.0123 0.0023 0.0047 0.0093 0.025625 376/376 0.0000 0.0000 0.0100 0.0053@376 0.0070 0.0086 0.0000 0.0000 0.0027 0.0186

AVG 2014 0.0067 0.0167 0.0167 0.0180 0.0145 0.0109 0.0002 0.0009 0.0049 0.0388

Table A.20: Cosine Sim. results ranked by highest scorefor TREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids:

11-25) (MAP 2013: 0.2202; MAP 2014: 0.1053).


1 233/233 0.8000 0.6500 0.4200 0.2790@233 0.1830 0.2870 0.0343 0.0558 0.1803 0.78542 168/168 0.7000 0.3500 0.3700 0.2857@168 0.1250 0.2351 0.0417 0.0417 0.2202 0.74403 42/42 0.2000 0.1500 0.0900 0.1667@42 0.0280 0.0880 0.0476 0.0714 0.2143 0.66674 176/180 0.2000 0.2500 0.2400 0.2431@181 0.1170 0.1609 0.0110 0.0276 0.1326 0.64645 35/35 0.2000 0.1500 0.1400 0.1143@35 0.0320 0.1412 0.0571 0.0857 0.4000 0.91436 172/172 0.1000 0.1500 0.1500 0.1628@172 0.0850 0.0945 0.0058 0.0174 0.0872 0.49427 – – – – – – – – – – –8 76/76 0.6000 0.5000 0.3000 0.3553@76 0.0680 0.3004 0.0789 0.1316 0.3947 0.89479 68/68 0.8000 0.5500 0.1900 0.2794@68 0.0620 0.2764 0.1176 0.1618 0.2794 0.911810 94/95 0.8000 0.6500 0.4300 0.4271@96 0.0850 0.3979 0.0833 0.1354 0.4479 0.8854

AVG 2013 0.4889 0.3778 0.2950 0.2570 0.0872 0.2202 0.0531 0.0809 0.1551 0.771411 392/392 0.0000 0.0000 0.1400 0.1046@392 0.0980 0.0860 0.0000 0.0000 0.0357 0.250012 184/184 0.2000 0.4000 0.4300 0.3370@184 0.1590 0.2698 0.0109 0.0435 0.2337 0.864113 313/313 0.1000 0.1000 0.1100 0.0799@313 0.0860 0.0857 0.0032 0.0064 0.0351 0.274814 401/401 0.0000 0.0000 0.0100 0.0424@401 0.0170 0.0212 0.0000 0.0000 0.0025 0.042415 315/315 0.0000 0.1000 0.1200 0.1651@315 0.1540 0.0940 0.0000 0.0063 0.0381 0.488916 554/554 0.3000 0.3500 0.3500 0.2022@554 0.1920 0.1686 0.0054 0.0126 0.0632 0.346617 648/648 0.0000 0.0000 0.1600 0.1343@648 0.1230 0.0620 0.0000 0.0000 0.0247 0.189818 409/409 0.0000 0.0000 0.0300 0.1467@409 0.1290 0.1058 0.0000 0.0000 0.0073 0.315419 341/341 0.1000 0.0500 0.1700 0.1554@341 0.1920 0.1212 0.0029 0.0029 0.0499 0.563020 289/289 0.0000 0.0000 0.2200 0.1349@289 0.1070 0.0799 0.0000 0.0000 0.0761 0.370221 798/798 0.4000 0.3000 0.0900 0.1341@798 0.1200 0.0917 0.0050 0.0075 0.0113 0.150422 220/220 0.0000 0.1000 0.1700 0.1909@220 0.1380 0.1155 0.0000 0.0091 0.0773 0.627323 274/274 0.0000 0.1000 0.2900 0.2336@274 0.1240 0.1083 0.0000 0.0073 0.1058 0.452624 430/430 0.2000 0.2000 0.1400 0.2674@430 0.2020 0.1219 0.0047 0.0093 0.0326 0.469825 376/376 0.1000 0.0500 0.0600 0.1330@376 0.0910 0.0476 0.0027 0.0027 0.0160 0.2420

AVG 2014 0.0933 0.1167 0.1660 0.1641 0.1288 0.1053 0.0023 0.0072 0.0539 0.3765


Table A.21: LLR results ranked by time for TREC TS2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) collections

(MAP 2013: 0.0441; MAP 2014: 0.0109).


1 233/233 0.1000 0.0500 0.0200 0.0258@233 0.0400 0.0453 0.0043 0.0043 0.0086 0.17172 168/168 0.3000 0.3500 0.1000 0.0595@168 0.0450 0.0609 0.0179 0.0417 0.0595 0.26793 42/42 0.1000 0.0500 0.0300 0.0714@42 0.0250 0.0297 0.0238 0.0238 0.0714 0.59524 176/180 0.1000 0.1000 0.0600 0.0663@181 0.0350 0.0377 0.0055 0.0110 0.0331 0.19345 35/35 0.2000 0.1000 0.0400 0.0571@35 0.0170 0.0355 0.0571 0.0571 0.1143 0.48576 172/172 0.0000 0.0000 0.0300 0.0291@172 0.0240 0.0219 0.0000 0.0000 0.0174 0.13957 – – – – – – – – – – –8 76/76 0.2000 0.2500 0.1100 0.1184@76 0.0490 0.0816 0.0263 0.0658 0.1447 0.64479 68/68 0.1000 0.0500 0.0600 0.0735@68 0.0330 0.0504 0.0147 0.0147 0.0882 0.485310 94/95 0.2000 0.1000 0.0400 0.0417@96 0.0170 0.0341 0.0208 0.0208 0.0417 0.1771

AVG 2013 0.1444 0.1167 0.0525 0.0603 0.0317 0.0441 0.0189 0.0266 0.0297 0.351211 392/392 0.0000 0.0000 0.0100 0.0230@392 0.0290 0.0114 0.0000 0.0000 0.0026 0.074012 184/184 0.0000 0.0000 0.0100 0.0054@184 0.0160 0.0128 0.0000 0.0000 0.0054 0.087013 313/313 0.0000 0.0000 0.0100 0.0064@313 0.0030 0.0068 0.0000 0.0000 0.0032 0.009614 401/401 0.0000 0.0000 0.0100 0.0050@401 0.0060 0.0044 0.0000 0.0000 0.0025 0.015015 315/315 0.0000 0.0000 0.0100 0.0127@315 0.0050 0.0096 0.0000 0.0000 0.0032 0.015916 554/554 0.0000 0.0000 0.0200 0.0090@554 0.0100 0.0124 0.0000 0.0000 0.0036 0.018117 648/648 0.0000 0.0500 0.0100 0.0525@648 0.0350 0.0129 0.0000 0.0015 0.0015 0.054018 409/409 0.0000 0.0000 0.0200 0.0269@409 0.0200 0.0122 0.0000 0.0000 0.0049 0.048919 341/341 0.0000 0.0000 0.0100 0.0323@341 0.0310 0.0221 0.0000 0.0000 0.0029 0.090920 289/289 0.0000 0.0000 0.0000 0.0069@289 0.0040 0.0058 0.0000 0.0000 0.0000 0.013821 798/798 0.0000 0.0000 0.0100 0.0213@798 0.0190 0.0089 0.0000 0.0000 0.0013 0.023822 220/220 0.0000 0.0000 0.0200 0.0091@220 0.0070 0.0086 0.0000 0.0000 0.0091 0.031823 274/274 0.0000 0.1000 0.0600 0.0328@274 0.0150 0.0146 0.0000 0.0073 0.0219 0.054724 430/430 0.1000 0.1000 0.0400 0.0209@430 0.0110 0.0123 0.0023 0.0047 0.0093 0.025625 376/376 0.0000 0.0000 0.0100 0.0053@376 0.0070 0.0086 0.0000 0.0000 0.0027 0.0186

AVG 2014 0.0067 0.0167 0.0167 0.0180 0.0145 0.0109 0.0002 0.0009 0.0049 0.0388

Table A.22: LLR results ranked by highest score for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) (MAP

2013: 0.1091; MAP 2014: 0.1467).


1 233/233 0.0000 0.2000 0.2600 0.2575@233 0.1780 0.2003 0.0000 0.0172 0.1116 0.76392 168/168 0.5000 0.3500 0.1000 0.0714@168 0.1240 0.1286 0.0298 0.0417 0.0595 0.73813 42/42 0.0000 0.0500 0.0400 0.0476@42 0.0360 0.0500 0.0000 0.0238 0.0952 0.85714 176/180 0.3000 0.4000 0.1000 0.0994@181 0.1150 0.1088 0.0166 0.0442 0.0552 0.63545 35/35 0.0000 0.0000 0.0200 0.0000@35 0.0350 0.0741 0.0000 0.0000 0.0571 1.00006 172/172 0.1000 0.0500 0.0200 0.0291@172 0.0790 0.0636 0.0058 0.0058 0.0116 0.45937 – – – – – – – – – – –8 76/76 0.1000 0.1500 0.1100 0.1053@76 0.0740 0.1188 0.0132 0.0395 0.1447 0.97379 68/68 0.2000 0.1000 0.0900 0.1029@68 0.0650 0.1051 0.0294 0.0294 0.1324 0.955910 94/95 0.1000 0.0500 0.1300 0.1354@96 0.0830 0.1328 0.0104 0.0104 0.1354 0.8646

AVG 2013 0.1444 0.1500 0.1200 0.0943 0.0877 0.1091 0.0117 0.0236 0.0595 0.805311 392/392 0.2000 0.2500 0.1600 0.2066@392 0.1400 0.1302 0.0051 0.0128 0.0408 0.357112 184/184 0.4000 0.4500 0.4000 0.4022@184 0.1580 0.3217 0.0217 0.0489 0.2174 0.858713 313/313 0.5000 0.5000 0.2900 0.2173@313 0.1630 0.1773 0.0160 0.0319 0.0927 0.520814 401/401 0.0000 0.0000 0.0400 0.0324@401 0.0620 0.0368 0.0000 0.0000 0.0100 0.154615 315/315 0.2000 0.1500 0.1700 0.1746@315 0.1590 0.1479 0.0063 0.0095 0.0540 0.504816 554/554 0.2000 0.3000 0.2000 0.2274@554 0.2360 0.1968 0.0036 0.0108 0.0361 0.426017 648/648 0.6000 0.6000 0.3000 0.1620@648 0.1320 0.1027 0.0093 0.0185 0.0463 0.203718 409/409 0.4000 0.4000 0.2100 0.2200@409 0.1490 0.1592 0.0098 0.0196 0.0513 0.364319 341/341 0.3000 0.3500 0.2500 0.2669@341 0.2060 0.2021 0.0088 0.0205 0.0733 0.604120 289/289 0.4000 0.3500 0.2000 0.2215@289 0.1320 0.1376 0.0138 0.0242 0.0692 0.456721 798/798 0.4000 0.2500 0.0800 0.0652@798 0.0640 0.0826 0.0050 0.0063 0.0100 0.080222 220/220 0.1000 0.2000 0.3000 0.3045@220 0.1590 0.1972 0.0045 0.0182 0.1364 0.722723 274/274 0.1000 0.1000 0.2000 0.1569@274 0.1280 0.1054 0.0036 0.0073 0.0730 0.467224 430/430 0.0000 0.0000 0.2600 0.2023@430 0.1650 0.1380 0.0000 0.0000 0.0605 0.383725 376/376 0.0000 0.0000 0.1400 0.1489@376 0.0980 0.0646 0.0000 0.0000 0.0372 0.2606

AVG 2014 0.2533 0.2600 0.2133 0.2006 0.1434 0.1467 0.0072 0.0152 0.0672 0.4244


Table A.23: LM results ranked by highest score for TRECTS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25) (MAP

2013: 0.0428; MAP 2014: 0.0320).


1 233/233 0.0000 0.0000 0.0100 0.0300@233 0.0860 0.0718 0.0000 0.0000 0.0043 0.36912 168/168 0.5000 0.2500 0.0600 0.0417@168 0.0300 0.0543 0.0298 0.0298 0.0357 0.17863 42/42 0.0000 0.0500 0.0400 0.0714@42 0.0230 0.0270 0.0000 0.0238 0.0952 0.54764 176/180 0.3000 0.3500 0.1200 0.0663@181 0.0260 0.0497 0.0166 0.0387 0.0663 0.14365 35/35 0.0000 0.0000 0.0200 0.0000@35 0.0300 0.0287 0.0000 0.0000 0.0571 0.85716 172/172 0.0000 0.0500 0.0100 0.0058@172 0.0040 0.0186 0.0000 0.0058 0.0058 0.02337 – – – – – – – – – – –8 76/76 0.2000 0.1000 0.0600 0.0526@76 0.0480 0.0618 0.0263 0.0263 0.0789 0.63169 68/68 0.2000 0.1000 0.0300 0.0294@68 0.0280 0.0476 0.0294 0.0294 0.0441 0.411810 94/95 0.1000 0.0500 0.0300 0.0312@96 0.0130 0.0261 0.0104 0.0104 0.0312 0.1354

AVG 2013 0.1444 0.1056 0.0500 0.0365 0.0320 0.0428 0.0125 0.0182 0.0280 0.366511 392/392 0.0000 0.0000 0.0600 0.0357@392 0.0260 0.0275 0.0000 0.0000 0.0153 0.066312 184/184 0.1000 0.0500 0.0500 0.0326@184 0.0200 0.0254 0.0054 0.0054 0.0272 0.108713 313/313 0.1000 0.1000 0.1000 0.0511@313 0.0350 0.0298 0.0032 0.0064 0.0319 0.111814 401/401 0.5000 0.4000 0.2500 0.0823@401 0.0510 0.0410 0.0125 0.0200 0.0623 0.127215 315/315 0.1000 0.1000 0.0900 0.0635@315 0.0400 0.0365 0.0032 0.0063 0.0286 0.127016 554/554 0.0000 0.0000 0.1100 0.0542@554 0.0510 0.0378 0.0000 0.0000 0.0199 0.092117 648/648 0.1000 0.1500 0.0700 0.0509@648 0.0390 0.0237 0.0015 0.0046 0.0108 0.060218 409/409 0.2000 0.3000 0.1300 0.0807@409 0.0630 0.0532 0.0049 0.0147 0.0318 0.154019 341/341 0.0000 0.0000 0.0300 0.0381@341 0.0620 0.0560 0.0000 0.0000 0.0088 0.181820 289/289 0.0000 0.0000 0.0200 0.0242@289 0.0370 0.0187 0.0000 0.0000 0.0069 0.128021 798/798 0.0000 0.0000 0.0200 0.0351@798 0.0320 0.0148 0.0000 0.0000 0.0025 0.040122 220/220 0.0000 0.0000 0.1200 0.1500@220 0.0540 0.0543 0.0000 0.0000 0.0545 0.245523 274/274 0.0000 0.0000 0.0100 0.0036@274 0.0110 0.0165 0.0000 0.0000 0.0036 0.040124 430/430 0.0000 0.0500 0.1000 0.0535@430 0.0420 0.0251 0.0000 0.0023 0.0233 0.097725 376/376 0.0000 0.0000 0.0100 0.0186@376 0.0160 0.0190 0.0000 0.0000 0.0027 0.0426

AVG 2014 0.0733 0.0767 0.0780 0.0516 0.0386 0.0320 0.0020 0.0040 0.0220 0.1082

Table A.24: LexRank results ranked by highest score forTREC TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25)

(MAP 2013: 0.0080; MAP 2014: 0.0382).


1 233/233 0.0000 0.0000 0.0000 0.0000@233 0.0090 0.0106 0.0000 0.0000 0.0000 0.03862 168/168 0.1000 0.0500 0.0100 0.0060@168 0.0040 0.0132 0.0060 0.0060 0.0060 0.02383 42/42 0.0000 0.0000 0.0000 0.0000@42 0.0040 0.0046 0.0000 0.0000 0.0000 0.09524 180/180 0.1000 0.0500 0.0300 0.0166@181 0.0070 0.0116 0.0055 0.0055 0.0166 0.03875 35/35 0.0000 0.0000 0.0000 0.0000@35 0.0040 0.0056 0.0000 0.0000 0.0000 0.11436 172/172 0.0000 0.0000 0.0000 0.0058@172 0.0050 0.0049 0.0000 0.0000 0.0000 0.02917 – – – – – – – – – – –8 76/76 0.0000 0.0000 0.0000 0.0000@76 0.0120 0.0106 0.0000 0.0000 0.0000 0.15799 68/68 0.0000 0.0000 0.0000 0.0000@68 0.0010 0.0063 0.0000 0.0000 0.0000 0.014710 95/95 0.0000 0.0000 0.0000 0.0000@96 0.0020 0.0042 0.0000 0.0000 0.0000 0.0208

AVG 2013 0.0222 0.0111 0.0100 0.0031 0.0053 0.0080 0.0013 0.0013 0.0056 0.059211 390/392 0.0000 0.0000 0.0600 0.0357@392 0.0440 0.0303 0.0000 0.0000 0.0153 0.112212 172/184 0.0000 0.0500 0.0400 0.0380@184 0.0180 0.0205 0.0000 0.0054 0.0217 0.097813 308/313 0.0000 0.0500 0.0300 0.0383@313 0.0290 0.0232 0.0000 0.0032 0.0096 0.092715 314/315 0.1000 0.1000 0.0800 0.0667@315 0.0490 0.0426 0.0032 0.0063 0.0254 0.155616 549/554 0.0000 0.0500 0.0500 0.0397@554 0.0440 0.0353 0.0000 0.0018 0.0090 0.079417 566/648 0.2000 0.1500 0.0500 0.0324@648 0.0320 0.0209 0.0031 0.0046 0.0077 0.049418 409/409 0.1000 0.1500 0.0900 0.0782@409 0.0620 0.0511 0.0024 0.0073 0.0220 0.151619 339/341 0.1000 0.1500 0.1500 0.1437@341 0.1000 0.0872 0.0029 0.0088 0.0440 0.293320 277/289 0.1000 0.1000 0.0200 0.0242@289 0.0220 0.0186 0.0035 0.0069 0.0069 0.076122 215/220 0.2000 0.1000 0.1900 0.1227@220 0.0700 0.0716 0.0091 0.0091 0.0864 0.318223 258/274 0.0000 0.0000 0.0000 0.0109@274 0.0140 0.0188 0.0000 0.0000 0.0000 0.0511

AVG 2014 0.0727 0.0818 0.0691 0.0573 0.0440 0.0382 0.0022 0.0049 0.0225 0.1343


Table A.25: Results when using the cosine similarity met-ric and the entity co-occurence feature for TREC TS 2013and 2014 (MAP 2013: 0.1106; MAP 2014: 0.0747).


1 131/233 0.3000 0.3000 0.2900 0.2403@233 0.1518 0.0129 0.0258 0.12452 81/168 0.5000 0.4000 0.2900 0.2679@168 0.1468 0.0298 0.0476 0.17263 9/42 0.0000 0.1000 0.0900 0.0952@42 0.0215 0.0000 0.0476 0.21434 94/181 0.2000 0.2500 0.1100 0.1547@181 0.0783 0.0110 0.0276 0.06085 16/35 0.0000 0.2000 0.1100 0.1143@35 0.0623 0.0000 0.1143 0.31436 67/172 0.1000 0.1000 0.1300 0.1337@172 0.0475 0.0058 0.0116 0.07567 – – – – – – – – –8 37/76 0.2000 0.2500 0.2900 0.3026@76 0.1351 0.0263 0.0658 0.38169 37/68 0.1000 0.1500 0.1200 0.1324@68 0.0792 0.0147 0.0441 0.1765

10 77/96 0.3000 0.4000 0.3500 0.3646@96 0.2725 0.0312 0.0833 0.3646AVG 2013 0.1889 0.2389 0.2050 0.2006 0.1106 0.0146 0.0520 0.1084

11 269/392 0.2000 0.1500 0.1200 0.1250@392 0.0778 0.0051 0.0077 0.030612 64/184 0.5000 0.3500 0.3500 0.3152@184 0.1272 0.0272 0.0380 0.190213 97/313 0.2000 0.1500 0.1800 0.1438@313 0.0474 0.0064 0.0096 0.057514 168/401 0.0000 0.2500 0.3000 0.1397@401 0.0492 0.0000 0.0125 0.074815 89/315 0.4000 0.3500 0.2500 0.2317@315 0.0769 0.0127 0.0222 0.079416 203/554 0.3000 0.3000 0.2200 0.2130@554 0.0840 0.0054 0.0108 0.039717 208/648 0.5000 0.5500 0.2600 0.1034@648 0.0470 0.0077 0.0170 0.040118 122/409 0.3000 0.3500 0.2600 0.1809@409 0.0659 0.0073 0.0171 0.063619 67/341 0.0000 0.0500 0.1400 0.1730@341 0.0313 0.0000 0.0029 0.041120 100/289 0.7000 0.5000 0.2700 0.1488@289 0.0912 0.0242 0.0346 0.093421 505/798 0.5000 0.4000 0.2900 0.1253@798 0.0873 0.0063 0.0100 0.036322 78/220 0.3000 0.2500 0.3300 0.2455@220 0.1023 0.0136 0.0227 0.150023 79/274 0.0000 0.2000 0.2600 0.1934@274 0.0579 0.0000 0.0146 0.094924 197/430 0.4000 0.3500 0.3400 0.3209@430 0.1506 0.0093 0.0163 0.079125 83/376 0.0000 0.0000 0.1000 0.1303@376 0.0254 0.0000 0.0000 0.0266

AVG 2014 0.2867 0.2800 0.2447 0.1860 0.0748 0.0084 0.0157 0.0732


Table A.26: Results when using the cosine similarity met-ric and the entity co-occurence feature for TREC TS 2013and 2014 (MAP 2013: 0.1195; MAP 2014: 0.0591).


1 131/233 0.3000 0.3000 0.2400 0.2189@233 0.1401 0.0129 0.0258 0.10302 81/168 0.5000 0.4000 0.2800 0.2619@168 0.1387 0.0298 0.0476 0.16673 9/42 0.1000 0.1000 0.0900 0.0952@42 0.0228 0.0238 0.0476 0.21434 94/181 0.1000 0.1500 0.1700 0.1657@181 0.0809 0.0055 0.0166 0.09395 16/35 0.2000 0.2000 0.1100 0.1429@35 0.0831 0.0571 0.1143 0.31436 67/172 0.1000 0.1000 0.1300 0.1279@172 0.0473 0.0058 0.0116 0.07567 – – – – – – – – –8 37/76 0.5000 0.4000 0.2800 0.3421@76 0.1831 0.0658 0.1053 0.36849 37/68 0.2000 0.1500 0.1100 0.1618@68 0.0908 0.0294 0.0441 0.161810 77/96 0.6000 0.4000 0.3400 0.3438@96 0.2883 0.0625 0.0833 0.3542

AVG 2013 0.2889 0.2444 0.2050 0.2067 0.1195 0.0325 0.0551 0.109811 269/392 0.1000 0.0500 0.0100 0.0255@392 0.0620 0.0026 0.0026 0.002612 64/184 0.7000 0.6500 0.4100 0.3098@184 0.1722 0.0380 0.0707 0.222813 97/313 0.0000 0.0000 0.0100 0.0415@313 0.0214 0.0000 0.0000 0.003214 168/401 0.0000 0.0000 0.0000 0.0025@401 0.0060 0.0000 0.0000 0.000015 89/315 0.1000 0.2000 0.3100 0.2127@315 0.0673 0.0032 0.0127 0.098416 203/554 0.2000 0.2000 0.2000 0.2310@554 0.0799 0.0036 0.0072 0.036117 208/648 0.0000 0.1000 0.0300 0.0926@648 0.0270 0.0000 0.0031 0.004618 122/409 0.1000 0.0500 0.0700 0.1174@409 0.0334 0.0024 0.0024 0.017119 67/341 0.1000 0.1500 0.1500 0.1760@341 0.0337 0.0029 0.0088 0.044020 100/289 0.0000 0.0000 0.0500 0.0830@289 0.0357 0.0000 0.0000 0.017321 505/798 0.0000 0.0500 0.0700 0.0276@798 0.0446 0.0000 0.0013 0.008822 78/220 0.3000 0.1500 0.1700 0.1591@220 0.0616 0.0136 0.0136 0.077323 79/274 0.6000 0.6500 0.2200 0.1825@274 0.0787 0.0219 0.0474 0.080324 197/430 0.2000 0.3000 0.3700 0.3000@430 0.1437 0.0047 0.0140 0.086025 83/376 0.1000 0.0500 0.0400 0.0931@376 0.0198 0.0027 0.0027 0.0106

AVG 2014 0.1667 0.1733 0.1407 0.1369 0.0591 0.0064 0.0124 0.0473

Table A.27: Summary of the performance of methods andtechniques with output ranked by highest score for TREC

TS 2013 (Ids: 1-10) and TREC TS 2014 (Ids: 11-25).

Method Correct Precision RecallP@10 P@20 P@100 P@R P@1000 MAP R@10 R@20 R@100 R@1000

TF AVG 2013 0.0000 0.0056 0.0175 0.0170 0.0160 0.0113 0.0000 0.0005 0.0094 0.1643BM25 AVG 2013 0.0222 0.0222 0.0200 0.0259 0.0169 0.0127 0.0010 0.0040 0.0100 0.1699

TF.ISF AVG 2013 0.2000 0.1389 0.1125 0.0909 0.0860 0.1330 0.0208 0.0263 0.0553 0.7667QL AVG 2013 0.4333 0.4556 0.3650 0.2986 0.0662 0.2058 0.0436 0.0914 0.1959 0.5978

COS.SIM AVG 2013 0.4889 0.3778 0.2950 0.2570 0.0872 0.2202 0.0531 0.0809 0.1551 0.7714LLR AVG 2013 0.1444 0.1500 0.1200 0.0943 0.0877 0.1091 0.0117 0.0236 0.0595 0.8053LM AVG 2013 0.1444 0.1056 0.0500 0.0365 0.0320 0.0428 0.0125 0.0182 0.0280 0.3665

LexRank AVG 2013 0.0222 0.0111 0.0100 0.0031 0.0053 0.0080 0.0013 0.0013 0.0056 0.0592TF AVG 2014 0.3400 0.3433 0.2780 0.2021 0.1370 0.1339 0.0092 0.0196 0.0820 0.3959

BM25 AVG 2014 0.4000 0.3733 0.2833 0.1999 0.1345 0.1379 0.0114 0.0221 0.0854 0.3844TF.ISF AVG 2014 0.4333 0.4600 0.3813 0.2348 0.1419 0.1635 0.0123 0.0259 0.1091 0.4020

QL AVG 2014 0.1733 0.1767 0.1660 0.1421 0.0906 0.0656 0.0050 0.0090 0.0433 0.2441COS.SIM AVG 2014 0.0933 0.1167 0.1660 0.1641 0.1288 0.1053 0.0023 0.0072 0.0539 0.3765

LLR AVG 2014 0.2533 0.2600 0.2133 0.2006 0.1434 0.1467 0.0072 0.0152 0.0672 0.4244LM AVG 2014 0.0733 0.0767 0.0780 0.0516 0.0386 0.0320 0.0020 0.0040 0.0220 0.1082

LexRank AVG 2014 0.0727 0.0818 0.0691 0.0573 0.0440 0.0382 0.0022 0.0049 0.0225 0.1343


Figure A.1: Document length histogram for the relevantdocuments inside the TREC TS 2013 collection.

Figure A.2: Document length histogram for the relevantdocuments inside the TREC TS 2014 collection.

107

Appendix B

Appendix B

In this Appendix we report on our participation in the TREC TS 2015 track,the Summarization Only task. In Table B.1 we present the results achievedby our submissions to the track using the official TREC TS evaluation met-rics. However, not all of the runs shown have been pooled for the extractionof gold nuggets and updates. In Figure B.6 we present the official results forthe pooled runs. We observe that the performance of our best run, cosinesimilarity, is very close to the mean for all submissions to the task.

Table B.1: TREC Temporal Summarization 2015 test re-sults (average for each submitted run across all test events).

Run nE[Gain] nE[Latency Gain] Comprehen- Latency HMsiveness

Query likelihood 0.0200 0.0145 0.7541 0.5381 0.0277no smoothingQuery likelihood 0.0798 0.0453 0.4222 0.2687 0.0618with smoothingQuery likelihood 0.0359 0.0204 0.6662 0.4664 0.0375with smoothing +higher thresholdCosine similarity 0.0428 0.0260 0.5708 0.3655 0.0471Cosine similarity 0.0281 0.0197 0.7325 0.5118 0.0372expanded query(Word2Vec)Term frequency 0.0223 0.0160 0.8415 0.6289 0.0310Term frequency 0.0200 0.0147 0.8326 0.6209 0.0285expanded query(Wordnet)Term frequency 0.0264 0.0172 0.7992 0.5865 0.0330expanded query(Word2Vec)TF.ISF 0.0234 0.0166 0.8196 0.6080 0.0321TF.ISF 0.0221 0.0158 0.8260 0.6169 0.0306expanded query(Wordnet)TF.ISF 0.0212 0.0153 0.8301 0.6107 0.0297expanded query(Word2Vec)LexRank 0.0224 0.0157 0.7490 0.5111 0.0299Language modeling 0.0195 0.0135 0.6871 0.4737 0.0258LLR 0.0173 0.0130 0.8348 0.6533 0.0248LDA 0.0222 0.0131 0.7036 0.4271 0.0250LDAv2 0.0202 0.0126 0.7423 0.4778 0.0241TREC TS 2015 Average 0.0595 0.0319 0.5627 0.3603 0.0472

108 Appendix B. Appendix B

Figure B.1: Results for the normalized expected gain met-ric, i.e. the degree to which the updates within the summary

are on-topic and novel.

Figure B.2: Results for the normalized expected latencygain metric, i.e. i.e. the degree to which the updates within

the summary are on-topic, novel and timely.

Appendix B. Appendix B 109

Figure B.3: Results for the comprehensiveness metric, i.e.how many nuggets the system covers. Comprehensivenessis similar to the traditional notion of recall in information

retrieval evaluation.

Figure B.4: Results for expected latency metric, i.e. thedegree to which the information contained within the up-dates is outdated (a high value for latency denotes timely

performance).

110 Appendix B. Appendix B

Figure B.5: Results for HM (nE[Latency Gain], LatencyComponent) - the harmonic mean of normalized ExpectedLatency Gain and Latency Comprehensiveness. This is theofficial target metric for the TREC Temporal Summarization

2015 track.

Figure B.6: TREC results 2015.

111

Bibliography

[1] Rafik Abbes et al. “IRIT at TREC Temporal Summarization 2014”. In:Proceedings of The Twenty-Third Text REtrieval Conference, TREC2014, Gaithersburg, Maryland, USA, November 19-21, 2014. 2014.

[2] Charu C Aggarwal and ChengXiang Zhai. “A survey of text clusteringalgorithms”. In: Mining Text Data. Springer, 2012, pp. 77–128.

[3] James Allan. HARD track overview in TREC 2003 high accuracy re-trieval from documents. Tech. rep. DTIC Document, 2005.

[4] James Allan. “Introduction to topic detection and tracking”. In: Topicdetection and tracking. Springer, 2002, pp. 1–16.

[5] James Allan. Topic detection and tracking: event-based informationorganization. Vol. 12. Springer Science & Business Media, 2012.

[6] James Allan et al. “Topic detection and tracking pilot study finalreport”. In: (1998).

[7] James Allan, Rahul Gupta, and Vikas Khandelwal. “Temporal sum-maries of new topics”. In: Proceedings of the 24th annual internationalACM SIGIR conference on Research and development in informationretrieval. ACM. 2001, pp. 10–18.

[8] James Allan, Rahul Gupta, and Vikas Khandelwal. “Topic models forsummarizing novelty”. In: ARDA Workshop on Language Modelingand Information Retrieval. Pittsburgh, Pennsylvania. 2001.

[9] James Allan, Ron Papka, and Victor Lavrenko. “On-line new event de-tection and tracking”. In: Proceedings of the 21st annual internationalACM SIGIR conference on Research and development in informationretrieval. ACM. 1998, pp. 37–45.

[10] James Allan, Courtney Wade, and Alvaro Bolivar. “Retrieval and nov-elty detection at the sentence level”. In: Proceedings of the 26th annualinternational ACM SIGIR conference on Research and development ininformaion retrieval. ACM. 2003, pp. 314–321.

[11] Tim Althoff et al. “TimeMachine: Timeline Generation for Knowledge-Base Entities”. In: Proceedings of the 21th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining. ACM.2015, pp. 19–28.

[12] Paavo Arvola, Marko Junkkari, and Jaana Kekäläinen. “Generalizedcontextualization method for XML information retrieval”. In: Proceed-ings of the 14th ACM international conference on Information andknowledge management. ACM. 2005, pp. 20–27.

[13] Javed Aslam et al. TREC 2014 temporal summarization track overview.Tech. rep. DTIC Document, 2015.

112 BIBLIOGRAPHY

[14] Javed A. Aslam et al. “TREC 2015 Temporal Summarization”. In:Proceedings of The Twenty-Fourth Text REtrieval Conference, TREC2015, Gaithersburg, Maryland, USA, November 19-22, 2015. 2015.

[15] Javed A. Aslam et al. “TREC 2013 Temporal Summarization”. In:Proceedings of The Twenty-Second Text REtrieval Conference, TREC2013, Gaithersburg, Maryland, USA, November 19-22, 2013. 2013.

[16] Farzindar Atefeh and Wael Khreich. “A survey of techniques for eventdetection in twitter”. In: Computational Intelligence 31.1 (2015), pp. 132–164.

[17] Niranjan Balasubramanian, James Allan, andW Bruce Croft. “A com-parison of sentence retrieval techniques”. In: Proceedings of the 30thannual international ACM SIGIR conference on Research and devel-opment in information retrieval. ACM. 2007, pp. 813–814.

[18] Gaurav Baruah et al. “University of Waterloo at the TREC 2013 Tem-poral Summarization Track.” In: Proceedings of The Twenty-SecondText REtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA,November 19-22, 2013. 2013.

[19] David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirichletallocation”. In: the Journal of machine Learning research 3 (2003),pp. 993–1022.

[20] Thorsten Brants, Francine Chen, and Ayman Farahat. “A system fornew event detection”. In: Proceedings of the 26th annual internationalACM SIGIR conference on Research and development in informaionretrieval. ACM. 2003, pp. 330–337.

[21] Jaime Carbonell and Jade Goldstein. “The use of MMR, diversity-based reranking for reordering documents and producing summaries”.In: Proceedings of the 21st annual international ACM SIGIR confer-ence on Research and development in information retrieval. ACM.1998, pp. 335–336.

[22] David Carmel, Anna Shtok, and Oren Kurland. “Position-based con-textualization for passage retrieval”. In: Proceedings of the 22nd ACMinternational conference on Conference on information & knowledgemanagement. ACM. 2013, pp. 1241–1244.

[23] Moses Charikar et al. “Incremental clustering and dynamic informa-tion retrieval”. In: Proceedings of the twenty-ninth annual ACM sym-posium on Theory of computing. ACM. 1997, pp. 626–635.

[24] Lei Chen et al. “ICTNET at Temporal Summarization Track TREC2014”. In: Proceedings of The Twenty-Third Text REtrieval Confer-ence, TREC 2014, Gaithersburg, Maryland, USA, November 19-21,2014. 2014.

[25] Mário Cordeiro. “Twitter event detection: Combining wavelet analy-sis and topic inference summarization”. In: Doctoral Symposium onInformatics Engineering, DSIE. Vol. 56. 2012.

[26] Dipanjan Das and André FT Martins. “A survey on automatic textsummarization”. In: Literature Survey for the Language and StatisticsII course at CMU 4 (2007), pp. 192–195.

BIBLIOGRAPHY 113

[27] Misha Denil, Alban Demiraj, and Nando de Freitas. “Extraction ofsalient sentences from labelled documents”. In: arXiv preprint arXiv:1412.6815(2014).

[28] Gaël Dias, Elsa Alves, and José Gabriel Pereira Lopes. “Topic seg-mentation algorithms for text summarization and passage retrieval:An exhaustive evaluation”. In: AAAI. Vol. 7. 2007, pp. 1334–1339.

[29] Alen Doko, Maja Stula, and Darko Stipanicev. “A recursive TF-ISFBased Sentence Retrieval Method with Local Context”. In: Interna-tional Journal of Machine Learning and Computing 3.2 (2013), p. 195.

[30] Wenwen Dou et al. “Leadline: Interactive visual analysis of text datathrough event identification and exploration”. In: Visual Analytics Sci-ence and Technology (VAST), 2012 IEEE Conference on. IEEE. 2012,pp. 93–102.

[31] Gunes Erkan and Dragomir R Radev. “LexRank: Graph-based lexicalcentrality as salience in text summarization”. In: Journal of ArtificialIntelligence Research (2004), pp. 457–479.

[32] Oren Etzioni et al. “Open Information Extraction: The Second Gen-eration.” In: IJCAI. Vol. 11. 2011, pp. 3–10.

[33] Ronald T Fernández, David E Losada, and Leif A Azzopardi. “Ex-tending the language modeling framework for sentence retrieval toinclude local context”. In: Information Retrieval 14.4 (2011), pp. 355–389.

[34] Cristina Garbacea and Evangelos Kanoulas. “The University of Am-sterdam (ILPS.UvA) at TREC 2015 Temporal Summarisation Track”.In: Proceedings of The Twenty-Fouth Text REtrieval Conference, TREC2015, Gaithersburg, Maryland, USA, November 17-20, 2015. 2015.

[35] Jiayin Ge, Xuanjing Huang, and Lide Wu. “Approaches to event-focused summarization based on named entities and query words”. In:Proceedings of the 2003 Document Understanding Workshop. 2003.

[36] Jade Goldstein et al. “Multi-document summarization by sentenceextraction”. In: Proceedings of the 2000 NAACL-ANLPWorkshop onAutomatic summarization-Volume 4. Association for ComputationalLinguistics. 2000, pp. 40–48.

[37] David Graff and C Cieri. “English gigaword corpus”. In: LinguisticData Consortium (2003).

[38] Qi Guo, Fernando Diaz, and Elad Yom-Tov. “Updating Users aboutTime Critical Events”. In: Advances in Information Retrieval - 35thEuropean Conference on IR Research, ECIR 2013, Moscow, Russia,March 24-27, 2013. Proceedings. 2013, pp. 483–494.

[39] Qi Guo, Fernando Diaz, and Elad Yom-Tov. “Updating Users aboutTime Critical Events”. In: Advances in Information Retrieval - 35thEuropean Conference on IR Research, ECIR 2013, Moscow, Russia,March 24-27, 2013. Proceedings. 2013, pp. 483–494.

114 BIBLIOGRAPHY

[40] Surabhi Gupta, Ani Nenkova, and Dan Jurafsky. “Measuring impor-tance and query relevance in topic-focused multi-document summa-rization”. In: Proceedings of the 45th Annual Meeting of the ACL onInteractive Poster and Demonstration Sessions. Association for Com-putational Linguistics. 2007, pp. 193–196.

[41] David W Hosmer Jr and Stanley Lemeshow. Applied logistic regres-sion. John Wiley & Sons, 2004.

[42] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant.Applied logistic regression. Vol. 398. John Wiley & Sons, 2013.

[43] Ting Hua et al. “STED: semi-supervised targeted-interest event de-tectionin in twitter”. In: Proceedings of the 19th ACM SIGKDD inter-national conference on Knowledge discovery and data mining. ACM.2013, pp. 1466–1469.

[44] Muhammad Imran et al. “Processing social media messages in massemergency: a survey”. In: ACM Computing Surveys (CSUR) 47.4 (2015),p. 67.

[45] Akshaya Iyengar, Tim Finin, and Anupam Joshi. “Content-based pre-diction of temporal boundaries for events in Twitter”. In: Privacy, Se-curity, Risk and Trust (PASSAT) and 2011 IEEE Third InernationalConference on Social Computing (SocialCom), 2011 IEEE Third In-ternational Conference on. IEEE. 2011, pp. 186–191.

[46] Prachi Joshi and Parag Kulkarni. “Incremental learning: areas andmethods-a survey”. In: International Journal of Data Mining andKnowledge Management Process 2.5 (2012), p. 43.

[47] Jaap Kamps et al. “INEX 2007 evaluation measures”. In: Focused ac-cess to XML documents. Springer, 2008, pp. 24–33.

[48] Chris Kedzie and Fernando Diaz. “CUNLP Temporal Summarization@TREC2015”. In: Proceedings of The Twenty-Fouth Text REtrievalConference, TREC 2015, Gaithersburg, Maryland, USA, November17-20, 2015. 2015.

[49] Chris Kedzie, Kathleen McKeown, and Fernando Diaz. “PredictingSalient Updates for Disaster Summarization”. In: Proceedings of the53rd Annual Meeting of the Association for Computational Linguis-tics and the 7th International Joint Conference on Natural LanguageProcessing of the Asian Federation of Natural Language Processing,ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers.2015, pp. 1608–1617.

[50] Chris Kedzie, Kathleen McKeown, and Fernando Diaz. “PredictingSalient Updates for Disaster Summarization”. In: Proceedings of the53rd Annual Meeting of the Association for Computational Linguis-tics and the 7th International Joint Conference on Natural LanguageProcessing of the Asian Federation of Natural Language Processing,ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers.2015, pp. 1608–1617.

[51] Chris Kedzie, Kathleen McKeown, and Fernando Diaz. “SummarizingDisasters Over Time”. In: Proc. Bloomberg Workshop on Social Good(with SIGKDD). 2014.

BIBLIOGRAPHY 115

[52] Mostafa Keikha et al. “Retrieving Passages and Finding Answers”.In: Proceedings of the 2014 Australasian Document Computing Sym-posium. ACM. 2014, p. 81.

[53] Wael Khreich et al. “Adaptive ROC-based ensembles of HMMs appliedto anomaly detection”. In: Pattern Recognition 45.1 (2012), pp. 208–230.

[54] Arpit Khurdiya et al. “Extraction and Compilation of Events and Sub-events from Twitter”. In: Proceedings of the The 2012 IEEE/WIC/ACMInternational Joint Conferences on Web Intelligence and IntelligentAgent Technology-Volume 01. IEEE Computer Society. 2012, pp. 504–508.

[55] David G Kleinbaum and Mitchel Klein. Logistic regression: a self-learning text. Springer Science & Business Media, 2010.

[56] Chin-Yew Lin and Eduard Hovy. “The automated acquisition of topicsignatures for text summarization”. In: Proceedings of the 18th con-ference on Computational linguistics-Volume 1. Association for Com-putational Linguistics. 2000, pp. 495–501.

[57] Qian Liu et al. “ICTNET at Temporal Summarization Track TREC2013.” In: Proceedings of The Twenty-Second Text REtrieval Confer-ence, TREC 2013, Gaithersburg, Maryland, USA, November 19-22,2013. 2013.

[58] Rui Long et al. “Towards effective event detection, tracking and sum-marization on microblog data”. In: Web-Age Information Manage-ment. Springer, 2011, pp. 652–663.

[59] Kuang Lu and Hui Fang. “Event oriented query expansion for newsevent queries”. In: Proceedings of The Twenty-Fouth Text REtrievalConference, TREC 2015, Gaithersburg, Maryland, USA, November17-20, 2015. 2015.

[60] Inderjeet Mani and Eric Bloedorn. “Machine learning of generic anduser-focused summarization”. In: AAAI/IAAI. 1998, pp. 821–826.

[61] Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al.Introduction to information retrieval. Vol. 1. 1. Cambridge universitypress Cambridge, 2008.

[62] Richard McCreadie et al. “University of Glasgow at TREC 2013: Ex-periments with Terrier in Contextual Suggestion, Temporal Summari-sation and Web Tracks”. In: Proceedings of The Twenty-Second TextREtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA,November 19-22, 2013. 2013.

[63] Richard McCreadie et al. “University of Glasgow at TREC 2014: Ex-periments with Terrier in Contextual Suggestion, Temporal Summari-sation and Web Tracks”. In: Proceedings of The Twenty-Third TextREtrieval Conference, TREC 2014, Gaithersburg, Maryland, USA,November 19-21, 2014. 2014.

[64] Richard McCreadie, Craig Macdonald, and Iadh Ounis. “Incrementalupdate summarization: Adaptive sentence selection based on preva-lence and novelty”. In: Proceedings of the 23rd ACM InternationalConference on Conference on Information and Knowledge Manage-ment. ACM. 2014, pp. 301–310.

116 BIBLIOGRAPHY

[65] Richard McCreadie et al. “BJUT at TREC 2015 Temporal Summa-rization Track”. In: Proceedings of The Twenty-Fouth Text REtrievalConference, TREC 2015, Gaithersburg, Maryland, USA, November17-20, 2015. 2015.

[66] Donald Metzler, Congxing Cai, and Eduard Hovy. “Structured eventretrieval over microblog archives”. In: Proceedings of the 2012 Confer-ence of the North American Chapter of the Association for Compu-tational Linguistics: Human Language Technologies. Association forComputational Linguistics. 2012, pp. 646–655.

[67] Rada Mihalcea and Paul Tarau. “TextRank: Bringing order into texts”.In: Association for Computational Linguistics. 2004.

[68] Tomas Mikolov et al. “Distributed representations of words and phrasesand their compositionality”. In: Advances in neural information pro-cessing systems. 2013, pp. 3111–3119.

[69] George A Miller. “WordNet: a lexical database for English”. In: Com-munications of the ACM 38.11 (1995), pp. 39–41.

[70] Masnizah Mohd. “Named entity patterns across news domains”. In:BCS IRSG Symposium: Future Directions in Information Access. 2007,pp. 30–36.

[71] Robert C Moore. “On Log-Likelihood-Ratios and the Significance ofRare Events.” In: EMNLP. 2004, pp. 333–340.

[72] Vanessa Murdock and W Bruce Croft. “A translation model for sen-tence retrieval”. In: Proceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing.Association for Computational Linguistics. 2005, pp. 684–691.

[73] Vanessa Graham Murdock. “Aspects of sentence retrieval”. PhD the-sis. University of Massachusetts Amherst, 2006.

[74] Ani Nenkova. “Automatic text summarization of newswire: Lessonslearned from the document understanding conference”. In: AAAI. Vol. 5.2005, pp. 1436–1441.

[75] Ani Nenkova and Kathleen McKeown. “Automatic Summarization”.In: Information Retrieval 5.2-3 (2011), pp. 103–233.

[76] Joel Larocca Neto, Alex A Freitas, and Celso AA Kaestner. “Auto-matic text summarization using a machine learning approach”. In:Advances in Artificial Intelligence. Springer, 2002, pp. 205–215.

[77] Muhammad Ali Norozi, Paavo Arvola, and Arjen P de Vries. “Con-textualization using hyperlinks and internal hierarchical structure ofwikipedia documents”. In: Proceedings of the 21st ACM internationalconference on Information and knowledge management. ACM. 2012,pp. 734–743.

[78] Alexandra Olteanu et al. “CrisisLex: A Lexicon for Collecting andFiltering Microblogged Communications in Crises.” In: ICWSM. 2014.

[79] Lawrence Page et al. “The PageRank citation ranking: bringing orderto the web.” In: (1999).

[80] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”.In: The Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

BIBLIOGRAPHY 117

[81] Saša Petrović, Miles Osborne, and Victor Lavrenko. “Streaming firststory detection with application to twitter”. In: Human LanguageTechnologies: The 2010 Annual Conference of the North AmericanChapter of the Association for Computational Linguistics. Associa-tion for Computational Linguistics. 2010, pp. 181–189.

[82] Swit Phuvipadawat and Tsuyoshi Murata. “Breaking news detectionand tracking in Twitter”. In: Web Intelligence and Intelligent AgentTechnology (WI-IAT), 2010 IEEE/WIC/ACM International Confer-ence on. Vol. 3. IEEE. 2010, pp. 120–123.

[83] Ana-Maria Popescu and Marco Pennacchiotti. “Detecting controver-sial events from twitter”. In: Proceedings of the 19th ACM interna-tional conference on Information and knowledge management. ACM.2010, pp. 1873–1876.

[84] Robert Power et al. “Emergency situation awareness: Twitter casestudies”. In: Information Systems for Crisis Response and Manage-ment in Mediterranean Countries. Springer, 2014, pp. 218–231.

[85] Yuanyuan Qi et al. “The Information Extraction systems of BUPT_PRISat TREC2014 Temporal Summarization Track”. In: Proceedings of TheTwenty-Third Text REtrieval Conference, TREC 2014, Gaithersburg,Maryland, USA, November 19-21, 2014. 2014.

[86] Dragomir Radev et al. “MEAD-a platform for multidocument multi-lingual text summarization”. In: (2004).

[87] Dragomir R Radev, Eduard Hovy, and Kathleen McKeown. “Intro-duction to the special issue on summarization”. In: Computationallinguistics 28.4 (2002), pp. 399–408.

[88] Naren Ramakrishnan et al. “’Beating the news’ with EMBERS: Fore-casting civil unrest using open source indicators”. In: Proceedings ofthe 20th ACM SIGKDD international conference on Knowledge dis-covery and data mining. ACM. 2014, pp. 1799–1808.

[89] Paul Rayson and Roger Garside. “Comparing corpora using frequencyprofiling”. In: Proceedings of the workshop on Comparing Corpora.Association for Computational Linguistics. 2000, pp. 1–6.

[90] Ahsan Raza, Kevin Rotondo, and Charles Clarke. “WaterlooClarke:TREC 2015 Temporal Summarization Track”. In: Proceedings of TheTwenty-Fouth Text REtrieval Conference, TREC 2015, Gaithersburg,Maryland, USA, November 17-20, 2015. 2015.

[91] Payam Refaeilzadeh, Lei Tang, and Huan Liu. “Cross-validation”. In:Encyclopedia of database systems. Springer, 2009, pp. 532–538.

[92] Zhaochun Ren et al. “Personalized time-aware tweets summarization”.In: Proceedings of the 36th international ACM SIGIR conference onResearch and development in information retrieval. ACM. 2013, pp. 513–522.

[93] Zhaochun Ren et al. “Summarizing web forum threads based on alatent topic propagation process”. In: Proceedings of the 20th ACMinternational conference on Information and knowledge management.ACM. 2011, pp. 879–884.

118 BIBLIOGRAPHY

[94] Alan Ritter, Oren Etzioni, Sam Clark, et al. “Open domain eventextraction from twitter”. In: Proceedings of the 18th ACM SIGKDDinternational conference on Knowledge discovery and data mining.ACM. 2012, pp. 1104–1112.

[95] Stephen Robertson and Hugo Zaragoza. “On rank-based effectivenessmeasures and optimization”. In: Information Retrieval 10.3 (2007),pp. 321–339.

[96] Stephen E Robertson and Steve Walker. “Some simple effective ap-proximations to the 2-poisson model for probabilistic weighted re-trieval”. In: Proceedings of the 17th annual international ACM SI-GIR conference on Research and development in information retrieval.Springer-Verlag New York, Inc. 1994, pp. 232–241.

[97] Alexander M Rush, Sumit Chopra, and Jason Weston. “A neuralattention model for abstractive sentence summarization”. In: arXivpreprint arXiv:1509.00685 (2015).

[98] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. “Earthquakeshakes Twitter users: real-time event detection by social sensors”. In:Proceedings of the 19th international conference on World wide web.ACM. 2010, pp. 851–860.

[99] Gerard Salton. “Automatic text processing: The transformation, anal-ysis, and retrieval of”. In: Reading: Addison-Wesley (1989).

[100] Jagan Sankaranarayanan et al. “Twitterstand: news in tweets”. In:Proceedings of the 17th acm sigspatial international conference on ad-vances in geographic information systems. ACM. 2009, pp. 42–51.

[101] Hassan Sayyadi, Matthew Hurst, and Alexey Maykov. “Event Detec-tion and Tracking in Social Streams.” In: ICWSM. 2009.

[102] Ruben Sipos et al. “Temporal corpus summarization using submod-ular word coverage”. In: 21st ACM International Conference on In-formation and Knowledge Management, CIKM’12, Maui, HI, USA,October 29 - November 02, 2012. 2012, pp. 754–763.

[103] Frank Spitzer. Principles of random walk. Vol. 34. Springer Science &Business Media, 2013.

[104] Andreas Stolcke et al. “SRILM-an extensible language modeling toolkit.”In: INTERSPEECH. 2002.

[105] Jeroen B. P. Vuurens et al. “Online News Tracking for Ad-Hoc Infor-mation Needs”. In: Proceedings of the 2015 International Conferenceon The Theory of Information Retrieval, ICTIR 2015, Northampton,Massachusetts, USA, September 27-30, 2015. 2015, pp. 221–230.

[106] Jeroen B. P. Vuurens et al. “Online News Tracking for Ad-Hoc Queries”.In: Proceedings of the 38th International ACM SIGIR Conference onResearch and Development in Information Retrieval, Santiago, Chile,August 9-13, 2015. 2015, pp. 1047–1048.

[107] Dingding Wang and Tao Li. “Document update summarization usingincremental hierarchical clustering”. In: Proceedings of the 19th ACMinternational conference on Information and knowledge management.ACM. 2010, pp. 279–288.

BIBLIOGRAPHY 119

[108] Peixia Wang and Wenbo Li. “ISCASIR at TREC 2015 Temporal Sum-marisation Track”. In: Proceedings of The Twenty-Fouth Text RE-trieval Conference, TREC 2015, Gaithersburg, Maryland, USA, Novem-ber 17-20, 2015. 2015.

[109] Jianshu Weng and Bu-Sung Lee. “Event Detection in Twitter.” In:ICWSM 11 (2011), pp. 401–408.

[110] Tan Xu, Douglas W Oard, and Paul McNamee. “HLTCOE at TREC2013: Temporal Summarization.” In: Proceedings of The Twenty-SecondText REtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA,November 19-22, 2013. 2013.

[111] Yingzhe Yao, Zhen Yang, and Kefeng Fan. “BJUT at TREC 2015Temporal Summarization Track”. In: Proceedings of The Twenty-FouthText REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA,November 17-20, 2015. 2015.

[112] Chunyun Zhang et al. “The Information Extraction Systems of PRISat Temporal Summarization Track.” In: Proceedings of The Twenty-Second Text REtrieval Conference, TREC 2013, Gaithersburg, Mary-land, USA, November 19-22, 2013. 2013.

[113] Qiankun Zhao, Prasenjit Mitra, and Bi Chen. “Temporal and informa-tion flow based event detection from social text streams”. In: AAAI.Vol. 7. 2007, pp. 1501–1506.

[114] Yun Zhao et al. “BJUT at TREC 2014 Temporal SummarizationTrack”. In: Proceedings of The Twenty-Third Text REtrieval Confer-ence, TREC 2014, Gaithersburg, Maryland, USA, November 19-21,2014. 2014.

[115] Sheng-hua Zhong et al. “Query-oriented unsupervised multi-documentsummarization via deep learning model”. In: Expert Systems with Ap-plications 42.21 (2015), pp. 8146–8155.

[116] Xiaojin Zhu. “Semi-supervised learning literature survey”. In: (2005).

Date post:	03-Apr-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times