+ All Categories
Home > Documents > Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet...

Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet...

Date post: 17-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
66
Heading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. { [email protected], tajima@i}.kyoto-u.ac.jp
Transcript
Page 1: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Heading-aware Snippet Generationfor Web Search

Tomohiro Manabe and Keishi TajimaGraduate School of Informatics, Kyoto Univ.{[email protected], tajima@i}.kyoto-u.ac.jp

Page 2: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Web Search Result Snippets

• Are short summaries of web page text

• Search engine users read them and judgerelevance of original pages to search intents

2

Popular exercise... Jogging ... One benefit is to improve fitness. ... Benefit of sprint is weight loss. ... One benefit of this exercise is protection from stress.

jogging benefit Search

Page 3: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Key Idea

• To generate snippets, search engines rank sentences• Headings of sentences are important to rank them• E.g., a query “jogging”

• We propose heading-aware generation methods

3

Popular exerciseRunningJoggingSlow running. One benefit is to improve fitness.

Sentences about joggingwithout keyword “jogging”

A heading “Jogging”

Page 4: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Key Idea

• To generate snippets, search engines rank sentences• Headings of sentences are important to rank them• E.g., a query “jogging”

• We propose heading-aware generation methods

4

Popular exerciseRunningJoggingSlow running. One benefit is to improve fitness.

Sentences about joggingwithout keyword “jogging”

A heading “Jogging”

Page 5: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Key Idea

• To generate snippets, search engines rank sentences• Headings of sentences are important to rank them• E.g., a query “jogging”

• We propose heading-aware generation methods

5

Popular exerciseRunningJoggingSlow running. One benefit is to improve fitness.

Sentences about joggingwithout keyword “jogging”

A heading “Jogging”

Page 6: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Key Idea

• To generate snippets, search engines rank sentences• Headings of sentences are important to rank them• E.g., a query “jogging”

• We propose heading-aware generation methods

6

Popular exerciseRunningJoggingSlow running. One benefit is to improve fitness.

Sentences about joggingwithout keyword “jogging”

A heading “Jogging”

Page 7: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Outline of This Presentation

I. Our definition of hierarchical heading structureII. Heading-aware snippet generation methodsIII. Current evaluation result

7

Page 8: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Our definition ofhierarchical heading structureHierarchical heading structure and its components

8

Page 9: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

In Short

• Hierarchical Heading Structure is composed of

• nested logical blocks• associated with headings

• Each heading describesa block topic briefly

9

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 10: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

In Short

• Hierarchical Heading Structure is composed of

• nested logical blocks• associated with headings

• Each heading describesa block topic briefly

10

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 11: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

A Heading is

• A highly summarized topic descriptionfor a part of a web page

• Heading words arewords in headings

11

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 12: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

A Heading is

• A highly summarized topic descriptionfor a part of a web page

• Heading words arewords in headings

12

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 13: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

A block is

• A part of a web page• associated with a heading

• Note that• An entire web page is also

a block• There is one-to-one

correspondence between headings and blocks

13

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 14: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

A block is

• A part of a web page• associated with a heading

• Note that• An entire web page is also

a block• There is one-to-one

correspondence between headings and blocks

14

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 15: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Hierarchical Heading Structure• A block may include

other blocks entirely• Blocks in a page form

hierarchical heading structure

• Its root is the entire page• Our methods focus on

such structure

15

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 16: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Hierarchical Heading Structure Extraction from Web Pages• is NOT a trivial problem• Throughout this presentation, the structure is given• For evaluation, we used previously proposed method

• Its implementation is at https://github.com/tmanabe

16Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. VLDB 8(12), 1606–1617 (2015)

Page 17: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Heading-awareSnippet Generation Methods

17

Page 18: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Input: Web page

Output: Snippets

Sentence segmentation

Sentence Scoring

Sentence Selection

Basic Snippet Generation Method

1. Split the page into semantically coherent fragments(e.g. sentences)

2. Score the fragments• By using scoring functions based on

query keyword occurrences• E.g., TFIDF and BM25

3. Select the fragments• In desc. order of their scores• Until the snippet length reaches limit

18

Page 19: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Input: Web page

Output: Snippets

Sentence segmentation

Sentence Scoring

Sentence Selection

Basic Snippet Generation Method

19

1. Split the page into semantically coherent fragments(e.g. sentences)

2. Score the fragments• By using scoring functions based on

query keyword occurrences• E.g., TFIDF and BM25

3. Select the fragments• In desc. order of their scores• Until the snippet length reaches limit

Page 20: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Input: Web page

Output: Snippets

Sentence segmentation

Sentence Scoring

Sentence Selection

Basic Snippet Generation Method

20

1. Split the page into semantically coherent fragments(e.g. sentences)

2. Score the fragments• By using scoring functions based on

query keyword occurrences• E.g., TFIDF and BM25

3. Select the fragments• In desc. order of their scores• Until the snippet length reaches limit

Page 21: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Input: Web page

Output: Snippets

Sentence segmentation

Sentence Scoring

Sentence Selection

Basic Snippet Generation Method

21

1. Split the page into semantically coherent fragments(e.g. sentences)

2. Score the fragments• By using scoring functions based on

query keyword occurrences• E.g., TFIDF and BM25

3. Select the fragments• In desc. order of their scores• Until the snippet length reaches limit

Page 22: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Input: Web page

Output: Snippets

Sentence segmentation

Sentence Scoring

Sentence Selection

Basic Snippet Generation Method

22

1. Split the page into semantically coherent fragments(e.g. sentences)

2. Score the fragments• By using scoring functions based on

query keyword occurrences• E.g., TFIDF and BM25

3. Select the fragments• In desc. order of their scores• Until the snippet length reaches limit

Page 23: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Input: Web page

Output: Snippets

Sentence segmentation

Sentence Scoring

Sentence Selection

Basic Snippet Generation Method

23

1. Split the page into semantically coherent fragments(e.g. sentences)

2. Score the fragments• By using scoring functions based on

query keyword occurrences• E.g., TFIDF and BM25

3. Select the fragments• In desc. order of their scores• Until the snippet length reaches limit

Page 24: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Four Methods

• All follow these steps1. Baseline method

• Query keywords in sentence indicate importance2. Existing method

• Heading words in sentence also indicate importance3. Our method

• Query keywords in headings also indicate importance4. Combined method

• All three ideas

24

Page 25: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Four Methods

• All follow these steps1. Baseline method

• Query keywords in sentence indicate importance2. Existing method

• Heading words in sentence also indicate importance3. Our method

• Query keywords in headings also indicate importance4. Combined method

• All three ideas

25

Page 26: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

1. Baseline Method

• Input• A web page

Input

Output

Segmentation

Scoring

Selection

26

Popular exerciseRunningJoggingSlow running. One benefit is to improve fitness.

Page 27: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

1. Baseline Method

• First step• The method segments the page into sentences• NOT the main topic of our research

Input

Output

Segmentation

Scoring

Selection

27

Popular exerciseRunningJoggingSlow running. One benefit is to improve fitness.

Popular exerciseRunningJoggingSlow running.One benefit is to improve fitness.…

Page 28: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

1. Baseline Method

• Second step• The method scores the sentences

based on the number of query keywords in them• The main topic of our research• But as the baseline, we used

BM25(Query keywords)

Input

Output

Segmentation

Scoring

Selection

Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.:Okapi at TREC-3. In: TREC. pp. 109-126 (1996) 28

Popular exerciseRunningJoggingSlow running.One benefit is to improve fitness.…

jogging benefit Search

Page 29: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

1. Baseline Method

• Third step• The method selects the sentences• It simply scans the sentences in desc. order of score• If there remains space to include sentence, it does so• NOT the main topic of our research

Input

Output

Segmentation

Scoring

Selection

29

JoggingOne benefit is to improve fitness.Benefit of sprint is weight loss.One benefit of this exercise is protection from stress.…

Page 30: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

1. Baseline Method

• Output• Title (or URL if there is no title)• Snippets

Input

Output

Segmentation

Scoring

Selection

30

Popular exercise... Jogging ... One benefit is to improve fitness. ... Benefit of sprint is weight loss. ... One benefit of this exercise is protection from stress.

Page 31: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Four Methods

1. Baseline method• Query keywords in sentence indicate importance

2. Existing method• Heading words in sentence also indicate importance

3. Our method• Query keywords in headings also indicate importance

of sentences in their associated blocks4. Combined method

• All three ideas

31

Page 32: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

2. Existing Method

• An existing idea of Pembe and Güngör• Heading words in sentences indicate importance

• Because the heading words are expected to be important words in the block

Pembe, F.C., Güngör, T.: Structure-preserving and query-biased document summarization for web searching. Online Info. Rev. 33(4), 696-719 (2009) 32

Page 33: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

2. Existing Method

• Existing method• counts heading words in sentences to score the sentences• because heading words are important for the sentences

• Heading words of a sentence• are words in hierarchical headings of the block including it• Hierarchical headings: headings of ancestor-or-self blocks

33

Page 34: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Heading Wordsof a Sentence• Heading words of

“Slow running.”are:

• Popular, exercise,running, and jogging

• The sentence includes a heading word “running”

• It is important

34

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 35: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

2. Existing Method

• The scoring function for sentences• Simply,

BM25 Query keywords + BM25(Heading words)• Problem

• Summation of BM25 scores produces worse rankingwhen they count the same words

• In other words,when heading words include some query keywords

Input

Output

Segmentation

Scoring

Selection

Robertson, S., Zaragoza, H., Taylor, M.:Simple BM25 extension to multiple weighted fields. In: CIKM. pp. 42-49 (2004) 35

Page 36: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

2. Existing Method

• Therefore, we split words into three types

• Sum up their scores• I.e., BM25 Query only +

BM25 Heading only + BM25(Query heading)

Input

Output

Segmentation

Scoring

Selection

Not query keywords Query keywordsNot heading words (Do not care) Query-only words

Heading words Heading-only words Query-heading words

36

Page 37: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Three Word Types

• For “Slow running.” anda query “jogging benefit”

• Query-heading word isjogging

• Heading-only words are popular, exercise, running

• Query-only word isbenefit

37

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 38: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Three Word Types

• For “Slow running.” anda query “jogging benefit”

• Query-heading word isjogging

• Heading-only words are popular, exercise, running

• Query-only word isbenefit

38

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 39: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Three Word Types

• For “Slow running.” anda query “jogging benefit”

• Query-heading word isjogging

• Heading-only words are popular, exercise, running

• Query-only word isbenefit

39

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 40: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Three Word Types

• For “Slow running.” anda query “jogging benefit”

• Query-heading word isjogging

• Heading-only words are popular, exercise, running

• Query-only word isbenefit

40

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 41: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

2. Existing Method

• Modified output• Showed headings separately to improve readability

• Headings shown iff sentences in their blocks are chosen

Input

Output

Segmentation

Scoring

Selection

41Pembe, F.C., Güngör, T.: Structure-preserving and query-biased document summarization for web searching. Online Info. Rev. 33(4), 696-719 (2009)

Popular exercise> Running > Sprint

Benefit of sprint is weight loss.> Swimming > Front Crawl

One benefit of this exercise is protection from stress.

Page 42: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

2. Existing Method

• Other steps are same asthose of the baseline method

Input

Output

Segmentation

Scoring

Selection

42

Page 43: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Four Methods

1. Baseline method• Query keywords in sentence indicate importance

2. Existing method• Heading words in sentence also indicate importance

3. Our method• Query keywords in headings also indicate importance

of sentences in their associated blocks4. Combined method

• All three ideas

43

Page 44: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

3. Our Method

• The existing idea• Heading words in sentences indicate importance of them

• Our idea• Query keywords in a heading indicate importance of

sentences in its associated blocks

44

Page 45: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Omissionof Heading Words• Heading words are

very often omitted• Sentence “Slow running”

is talking about popular exercise, running, and jogging but the heading words popular, exercise, and jogging are omitted

45

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 46: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

3. Our Method

• Takes the omission of heading words into account• Assigns high scores to sentences

including query keywords within either• Sentences themselves• Their hierarchical headings

46

Page 47: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

3. Our Method

• Sentence scoring modified in different way• Each sentence comprises two fields

• Contents of the sentence itself• Its hierarchical headings

• We use BM25F• Scoring function for documents comprising multiple fields

BM25F = ∑к∈𝑞𝑞w(к,𝑆𝑆)

𝑘𝑘1+w(к,𝑆𝑆)log 𝑁𝑁−sf к +0.5

sf к +0.5, w к, 𝑆𝑆 = occurs(к,𝑓𝑓,𝑆𝑆)�𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑓𝑓

1−𝑏𝑏 +𝑏𝑏� length(𝑓𝑓,𝑆𝑆)avgLength(𝑓𝑓)

Input

Output

Segmentation

Scoring

Selection

47

Page 48: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Query Keywords in headings• The sentence

“Slow running.”• NOT include a query

keyword “jogging” in the sentence itself

• Includes “jogging” in the hierarchical headings of the sentence

48

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 49: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Query Keywords in headings• The sentence

“One benefit is…”• includes a query keyword

“jogging” in the hierarchical headings

• Includes another query keyword “benefit” in the sentence itself

• Important for a query “jogging benefit”

49

Popular exerciseRunning

SwimmingFront CrawlOne benefit of this exercise is protection from stress.

JoggingSlow running. One benefit is to improve fitness.

SprintBenefit of sprint is weight loss.

Page 50: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

3. Our Method

• Other steps are same asthose of the existing method

Input

Output

Segmentation

Scoring

Selection

50

Page 51: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Four Methods

1. Baseline method• Query keywords in sentence indicate importance

2. Existing method• Heading words in sentence also indicate importance

3. Our method• Query keywords in headings also indicate importance

of sentences in their associated blocks4. Combined method

• All three ideas

51

Page 52: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

4. Combined Method

• The two ideas are independent• The combined method adopts both• The scoring function is:

BM25F Query only + BM25F Heading only+ BM25F(Query heading)

Input

Output

Segmentation

Scoring

Selection

52

Page 53: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

4. Combined Method

• Other steps are same asthose of the existing and our methods

Input

Output

Segmentation

Scoring

Selection

53

Page 54: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Four Methods

1. Baseline method• Query keywords in sentence indicate importance

2. Existing method• Heading words in sentence also indicate importance

3. Our method• Query keywords in headings also indicate importance

of sentences in their associated blocks4. Combined method

• All three ideas

54

Page 55: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Current Evaluation ResultOn web search

55

Page 56: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Evaluation Methodology

• The most important feature of snippets: Judgeability• To what extent snippets help to judge relevance of pages

• In the INEX snippet retrieval track• Relevance judgments under different conditions are compared

• Based on the entire documents• Only based on their snippets

• If they agree, the snippets provides high judgeability andthe snippet generation method is effective

• Length limit of snippets: 180 letters for a page

56Trappett, M., Geva, S., Trotman, A., Scholer, F., Sanderson, M.:Overview of the INEX 2013 snippet retrieval track. In: CLEF (2013)

Page 57: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Data Set

• Target of INEX is XML while our target is web• We used data set for TREC 2014 web track ad-hoc task

• 50 keyword queries• ClueWeb12 document collection• Relevance judgement based on the entire pages

• We used only subset of the collection• Top-20 pages for each query (1,000 in total) from baseline• Generated by Indri with Waterloo spam filter

57

Page 58: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

User Experiment

• To obtain snippet-based relevance judgment• With 4 participants• In each period, each participant is required to:

1. Read intent description behind a query2. Scan titles and snippets of top-20 search result items3. Judge whether each page is relevant by only them

• Each snippets was judged once, each participant judged a page once and used all methods evenly

58

Page 59: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Evaluation Measures

• From INEX• Recall

• |Pages correctly judged as relevant on their snippets||Pages relevant as a whole|

• Negative recall• |Pages correctly judged as irrelevant on their snippets|

|Pages irrelevant as a whole|• Geometric mean of them

• Recall � Negative Recall• Primary evaluation measure

59

Page 60: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

General Queries

• Comparison of mean evaluation scores

60

Method Recall Negative recall Geometric meanBaseline 0.475 0.828 0.512Existing 0.373 0.780 0.386Ours 0.438 0.777 0.456Combined 0.396 0.776 0.401

Page 61: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

General Queries

• Comparison of mean evaluation scores

61

Method Recall Negative recall Geometric meanBaseline 0.475 0.828 0.512Existing 0.373 0.780 0.386Ours 0.438 0.777 0.456Combined 0.396 0.776 0.401

Page 62: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Single Queries

• TREC splits queries into several types• Single queries have clear and focused intents

62

Method Recall Negative recall Geometric meanBaseline 0.431 0.837 0.488Existing 0.336 0.804 0.392Ours 0.375 0.816 0.443Combined 0.491 0.795 0.530

Page 63: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Single Queries

• TREC splits queries into several types• Single queries have clear and focused intents

63

Method Recall Negative recall Geometric meanBaseline 0.431 0.837 0.488Existing 0.336 0.804 0.392Ours 0.375 0.816 0.443Combined 0.491 0.795 0.530

Page 64: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Query Length and Geometric Mean Scores

• Users need pages including all keywords in relation• Finding such pages gets more difficult for longer queries• Headings may help indicating relationMethod 2 keywords 3 keywords 4+ keywords

Baseline 0.585 0.503 0.394Existing 0.406 0.387 0.350Ours 0.543 0.378 0.362Combined 0.393 0.388 0.425

Page 65: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Query Length and Geometric Mean Scores

• Users need pages including all keywords in relation• Finding such pages gets more difficult for longer queries• Headings may help indicating relationMethod 2 keywords 3 keywords 4+ keywords

Baseline 0.585 0.503 0.394Existing 0.406 0.387 0.350Ours 0.543 0.378 0.362Combined 0.393 0.388 0.425

Page 66: Heading-aware Snippet Generation for Web Searchtajima/papers/airs15slides.pdfHeading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics,

Conclusion

• Introduced a new idea for heading-aware snippet generation• Query keyword in headings of sentences indicate importance of them

• Compared baseline and 3 heading-aware generation methods • Evaluation result indicated that heading-aware methods were:

• Not effective for general queries• Effective only for queries:

• representing its intent clearly• containing four or more keywords

• Additional evaluation with more queries is needed

66


Recommended