Searching the Temporal Web: Challenges and Current Approaches

Post on 17-Jul-2015

81 views 0 download

Tags:

transcript

Searching the Temporal Web

Challenges and Current

Approaches

Nattiya Kanhabua

IR Group, University of Glasgow

4 February 2013

Outline

• Evolution of the Web

• Temporal IR system

• Current Approaches

– Content Analysis

– Query Analysis

– Retrieval and Ranking

• Open Issues

2

Evolution of the Web

• Web is changing over time in many aspects:

– Size: web pages are added/deleted all the time

– Content: web pages are edited/modified

– Query: users’ information needs changes

[Ke et al., CN 2006; Risvik et al., CN 2002] [Dumais, SIAM-SDM 2012; WebDyn 2010] 3

20

00

First billion-URL index

The world’s largest!

≈5000 PCs in clusters! 2

00

4

Index grows to

4.2 billion pages

1995 2012

20

08

Google counts

1 trillion

unique URLs

Web and Index Sizes

20

09

TBs or PBs of data/index

Tens of thousands of PCs

http://www.worldwidewebsize.com/

Impacts: crawling, indexing, and caching 4

Content Dynamics

• WayBack Machine

– Web archive search by the Internet Archive

5

1998

2006

Content Dynamics

2012

Impacts: document representation and retrieval 6

Query Dynamics

• Search queries exhibit temporal patterns

– Spikes or seasonality

Impacts: search intent and query representation

http://www.google.com/insights/search/

7

Temporal IR System

8

Time-sensitive Queries

• Represent temporal information needs – E.g., 2006 FIFA World Cup, Thailand tsunami

• Queries and the relevance depend on time – Query popularities change over time

– Documents are about events at particular time

9

Time Distribution of Qrel Recency query Time-sensitive query

Time-insensitive query

[Li et al., CIKM 2003]

10

Query/Document Matching

query

Temporal

Web

Determining

Search Intent

Term: {Germany, World, Cup}

Time: {06/2006, 07/2006}

D2006

Retrieved results

matching

Time-sensitive

queries

Semantic

Annotation

Annotated

documents Term: {w1, w2, …, wn}

Time: {PubTime(di), ContentTime(di)}

11

Content Analysis

Two Time Aspects

Two time dimensions

1. Publication or modified time

2. Content or event time

content time

publication time

13

Problem Statements • Difficult to find the trustworthy time for web documents

– Time gap between crawling and indexing

– Decentralization and relocation of web documents

– No standard metadata for time/date

Document Dating

Let’s me see…

This document is

probably

written in 850 A.C.

with 95% confidence.

I found a bible-like

document. But I have

no idea when it was

created?

“ For a given document with uncertain

timestamp, can the contents be used to

determine the timestamp with a sufficiently

high confidence? ”

14

Current Approaches

1. Content-based

2. Link-based

3. Hybrid

15

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Freq

1

1

1

1

1

1

tsunami

Thailand

A non-timestamped

document

16

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Freq

1

1

1

1

1

1

17

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Freq

1

1

1

1

1

1

18

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Freq

1

1

1

1

1

1

19

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Similarity Scores

Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Freq

1

1

1

1

1

1

20

Normalized Log-likelihood Ratio

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Similarity Scores

Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Normalized log-likelihood ratio

• Variant of Kullback-Leibler divergence

• Similarity of a document and time partitions

• C is the background model estimated on the corpus

• Linear interpolation smoothing to avoid the zero probability of unseen words

[Kraaij, SIGIR Forum 2005] 21

Link-based Approach

• Dating a document using its neighbors 1. Web pages linking to the document

• Incoming links

2. Web pages pointed by the document

• Outgoing links

3. Media assets associated with the document

• E.g., images

• Averaging the last-modified dates of its

neighbors as timestamps

[Nunes et al., WIDM 2007] 22

Hybrid Approach

• Inferring timestamps using machine

learning – Exploit links, contents of a web pages and its neighbors

– Features: linguistic, position, page formats, and tags

[Chen et al., SIGIR 2010] 23

Content Time Extraction

• Three types of temporal expressions 1. Explicit: time mentions being mapped directly to

a time point or interval, e.g., “July 4, 2012”

2. Implicit: imprecise time point or interval, e.g.,

“Independence Day 2012”

3. Relative: resolved to a time point or interval

using other types or the publication date, e.g.,

“next month”

[Alonso et al., SIGIR Forum 2007] 24

Identifying Relevant Time

• How to determine relevant temporal

expressions tagged in a document? – Not all temporal expressions associated to an event

are equally relevant

Reported by World Health Organization (WHO) on

29 July 2012 about an ongoing Ebola outbreak

in Uganda since the beginning of July 2012

25

Current Approaches

1. Ranking temporal expressions using

different features

2. The task of identifying relevant time is

regarded as a classification problem – Two classes: (1) relevant and (2) irrelevant

– Definition: relevant referring to the starting, ending or

ongoing time of the event

[Strötgen et al., TempWeb 2012; Kanhabua et al., TAIA 2012] 26

Query Analysis

Determining Time of Queries

• Two types of temporal queries: 1. Explicit: time is provided, "Presidential election 2012“

2. Implicit: time is not provided, "Germany World Cup" • Temporal intent can be implicitly inferred

• Previous studies on temporal queries: – 1.5% of web queries are explicit

– ~7% of web queries are implicit

[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009] 28

Current Approaches

1. Query log analysis

2. Analyzing top-k documents

29

Query Log Analysis

• Features from query logs – Analyze query frequencies over time for identifying

the relevant time of queries

• Time-series analysis – Leverage time-series decomposition for detecting

seasonal queries

[Metzler et al., SIGIR 2009; Shokouhi, SIGIR 2011] 30

Time-series Decomposition

Query: Easter Query: World cup

31

Analyzing Top-k Documents

• Using temporal language models – Determine time of queries when no time is given explicitly

• Exploiting time from search snippets – Extract temporal expressions (i.e., years) from the

contents of top-k retrieved web snippets for a given query

[Kanhabua et al., ECDL 2010; Campos et al., CIKM 2012] 32

Matching: Re-visited

D2006

Ranked results

query

Temporal

Web

Determining

Search Intent

Term: {Germany, World, Cup}

Time: {06/2006, 07/2006}

D2006

Retrieved results

matching

Time-sensitive

Queries

Semantic

Annotation

Annotated

documents Term: {w1, w2, …, wn}

Time: {PubTime(di), ContentTime(di)}

33

Retrieval and Ranking

Searching the Past

• Searching documents created/edited over time

– E.g., web archives, news archives, blogs, or emails

– A journalist wants to write a timeline of a news article

– A Wikipedia contributor searches for historical

information about an entity of interests

Web

archives

news

archives

blogs emails

“temporal document

collections”

Retrieve documents

about Pope Benedict

XVI written before 2005

Term-based IR approaches may give unsatisfied results

35

• Time must be explicitly modeled in order to increase the effectiveness of ranking

– To order search results so that the most relevant ones are ranked higher

• Time uncertainty should be taken into account

– Two temporal expressions can refer to the same time period even though they are not equally written

– E.g. the query “Independence Day 2011” • A retrieval model relying on term-matching only will fail to

retrieve documents mentioning “July 4, 2011”

Challenges

36

Query/Document Models

• A temporal query consists of: – Query keywords

– Temporal expressions

• A document consists of: – Terms, i.e., bag-of-words

– Publication time and temporal expressions

37

Temporal Query Examples

• A temporal query consists of: – Query keywords

– Temporal expressions

• A document consists of: – Terms, i.e., bag-of-words

– Publication time and temporal expressions

[Berberich et al., ECIR 2010]

38

Time-aware Ranking

• Two main approaches 1. Mixture model

• Linearly combining textual- and temporal similarity

2. Probabilistic model

• Generating a query from the textual part and temporal part

of a document independently

[Li et al., CIKM 2003; Baeza-Yates, SIGIR Forum 2005]

[Berberich et al., ECIR 2010; Kanhabua et al., ECDL 2010] 39

Mixture Model

• Linearly combine textual- and temporal similarity

– α indicates the importance of similarity scores

• Both scores are normalized before combining

– Textual similarity can be determined using any term-based retrieval model

• E.g., tf.idf or a unigram language model

40 [Li et al., CIKM 2003; Baeza-Yates, SIGIR Forum 2005]

[Kanhabua et al., ECDL 2010]

Mixture Model

• Linearly combine textual- and temporal similarity

– α indicates the importance of similarity scores

• Both scores are normalized before combining

– Textual similarity can be determined using any term-based retrieval model

• E.g., tf.idf or a unigram language model

How to determine temporal similarity? 41

Temporal Similarity

Sim

ilarity

score

Time distance

d1 d2

42

Probabilistic Model

• Assume that temporal expressions in the query are generated independently from a two-step generative model:

– P(tq|td) can be estimated likelihood for each time

interval tq that td can refer to

– Linear interpolation smoothing is applied to eliminates zero probabilities

• I.e., an unseen temporal expression tq in d

43

[Berberich et al., ECIR 2010]

• Five time-aware ranking models

– LMT [Berberich et al., ECIR 2010]

– LMTU [Berberich et al., ECIR 2010]

– TS [Kanhabua et al., ECLD 2010]

– TSU [Kanhabua et al., ECLD 2010]

– FuzzySet [Kalczynski et al., Inf. Process. 2005]

Comparison of Time-aware Ranking

[Kanhabua et al., SIGIR 2011a]

44

Open Issues

Problem Statements • Queries of named entities (people, company, place)

– Highly dynamic in appearance, i.e., relationships between terms changes over time

– E.g. changes of roles, name alterations, or semantic shift

Named Entity Evolution

Scenario 1 Query: “Pope Benedict XVI” and written before 2005

Documents about “Joseph Alois Ratzinger” are relevant

Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002

Documents about “New York Senator” and “First Lady of

the United States” are relevant 46

Examples of Name Changes

47 QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

Current Approaches

• Temporal co-occurrence

• Mining Wikipedia revisions

• Temporal association rule mining

48 [Berberich et al., WebDB 2009; Kanhabua et al., JCDL 2010]

[Kaluarachchi et al., CIKM 2010; Tahmasebi et al., COLING 2012]

1. Performance prediction – Predict the retrieval effectiveness wrt. a ranking model

Query Prediction Problems

query

precision = ?

recall = ?

MAP = ?

predict

[Cronen-Townsend et al., SIGIR 2002; Diaz et al., SIGIR 2004]

[Hauff et al., ECIR 2010; Carmel et al., 2010; Kanhabua et al. SIGIR 2011b] 49

1. Performance prediction – Predict the retrieval effectiveness wrt. a ranking model

2. Retrieval model prediction – Predict the retrieval model that is most suitable

Query Prediction Problems

query ranking = ?

predict max(precision)

max(recall)

max(MAP)

[Peng et al., ECIR 2010; Kanhabua et al., SIGIR 2012] 50

References • [Alonso et al., SIGIR Forum 2007] Omar Alonso, Michael Gertz, Ricardo A. Baeza-Yates: On the

value of temporal information in information retrieval. SIGIR Forum 41(2): 35-41 (2007)

• [Baeza-Yates, SIGIR Forum 2005] Ricardo A. Baeza-Yates: Searching the future. SIGIR workshop

MF/IR 2005

• [Berberich et al., WebDB 2009] Klaus Berberich, Srikanta J. Bedathur, Mauro Sozio, Gerhard

Weikum: Bridging the Terminology Gap in Web Archive Search. WebDB 2009

• [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum:

A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25

• [Campos et al., CIKM 2012] Ricardo Campos, Gaël Dias, Alípio Jorge, Celia Nunes: GTE: A

Distributional Second-Order Co-Occurrence Approach to Improve the Identification of Top Relevant

Dates in Web Snippets. CIKM 2012

• [Carmel et al., 2010] David Carmel, Elad Yom-Tov: Estimating the Query Difficulty for Information

Retrieval. Morgan & Claypool Publishers 2010

• [Chen et al., SIGIR 2010] Zhumin Chen, Jun Ma, Chaoran Cui, Hongxing Rui, Shaomang Huang: Web

page publication time detection and its application for page rank. SIGIR 2010: 859-860

• [Cronen-Townsend et al., SIGIR 2002] Stephen Cronen-Townsend, Yun Zhou, W. Bruce Croft:

Predicting query performance. SIGIR 2002: 299-306

• [Diaz et al., SIGIR 2004] Fernando Diaz, Rosie Jones: Using temporal profiles of queries for precision

prediction. SIGIR 2004: 18-24

• [Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM-

SDM 2012

51

References (cont’) • [Hauff et al., ECIR 2010] Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, Franciska de Jong: Query

Performance Prediction: Evaluation Contrasted with Effectiveness. ECIR 2010: 204-216

• [de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language

models for the disclosure of historical text. AHC 2005: 161-168

• [Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J.

Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for query

translation in text retrieval with association rules. CIKM 2010: 1789-1792

• [Kalczynski et al., Inf. Process. 2005] Pawel Jan Kalczynski, Amy Chou: Temporal Document

Retrieval Model for business news archives. Inf. Process. Manage. 41(3): 635-650 (2005)

• [Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Exploiting time-based synonyms in

searching document archives. JCDL 2010: 79-88

• [Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Determining Time of Queries for Re-

ranking Search Results. ECDL 2010: 261-272

• [Kanhabua et al., SIGIR 2011a] Nattiya Kanhabua, Kjetil Nørvåg: A comparison of time-aware ranking

methods. SIGIR 2011: 1257-1258

• [Kanhabua et al., SIGIR 2011b] Nattiya Kanhabua, Kjetil Nørvåg: Time-based query performance

predictors. SIGIR 2011: 1181-1182

• [Kanhabua et al., SIGIR 2012] Nattiya Kanhabua, Klaus Berberich, Kjetil Nørvåg: Learning to select a

time-aware retrieval model. SIGIR 2012

• [Kanhabua et al., TAIA 2012] Nattiya Kanhabua, Sara Romano, Avaré Stewart: Identifying Relevant

Temporal Expressions for Real-World Events. Time-aware Information Access Workshop 2012

52

References (cont’) • [Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their

ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447

(2006)

• [Kraaij, SIGIR Forum 2005] Wessel Kraaij: Variations on language modeling for information

retrieval. SIGIR Forum 39(1): 61 (2005)

• [Li et al., CIKM 2003] Xiaoyan Li, W. Bruce Croft: Time-based language models. CIKM 2003:

469-475

• [Nunes et al., WIDM 2007] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Using neighbors to

date web documents. WIDM 2007: 129-136

• [Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang:

Improving search relevance for implicitly temporal queries. SIGIR 2009: 700-701

• [Mazeika et al., CIKM 2011] Arturas Mazeika, Tomasz Tylenda, Gerhard Weikum: Entity

timelines: visual analytics and named entity evolution. CIKM 2011: 2585-2588

• [Peng et al., ECIR 2010] Jie Peng, Craig Macdonald, Iadh Ounis: Learning to Select a Ranking

Function. ECIR 2010: 114-126

• [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.

Computer Networks 39(3): 289-302 (2002)

• [Shokouhi, SIGIR 2011] Milad Shokouhi: Detecting Seasonal Queries by Time-Series Analysis.

SIGIR 2011: 1171-1172

• [Strötgen et al., SemEval 2010] Jannik Strötgen, Michael Gertz: Heideltime: High quality rule-

based extraction and normalization of temporal expressions. SemEval 2010: 321-324

53

References (cont’) • [Strötgen et al., TempWeb 2012] Jannik Strötgen, Omar Alonso, Michael Gertz: Identification of

top relevant temporal expressions in documents. Temporal Web Workshop 2012.

• [Tahmasebi et al., COLING2012] Nina Tahmasebi, Gerhard Gossen, Nattiya Kanhabua, Helge

Holzmann, Thomas Risse: NEER: An Unsupervised Method for Named Entity Evolution

Recognition. COLING 2012

• [Verhagen et al., ACL 2005] Marc Verhagen, Inderjeet Mani, Roser Sauri, Jessica Littman,

Robert Knippen, Seok Bae Jang, Anna Rumshisky, John Phillips, James Pustejovsky: Automating

Temporal Annotation with TARSQI. ACL 2005

• [WebDyn 2010] Web Dynamics course: http://www.mpi-

inf.mpg.de/departments/d5/teaching/ss10/dyn/, Max-Planck Institute for Informatics, Saarbrücken,

Germany, 2010

• [Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang,

Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139

54

Thank you!

55