+ All Categories
Home > Documents > Survey Jaehui Park 2008. 07. 17.. Copyright 2008 by CEBT Introduction Members Jung-Yeon Yang,...

Survey Jaehui Park 2008. 07. 17.. Copyright 2008 by CEBT Introduction Members Jung-Yeon Yang,...

Date post: 17-Jan-2018
Category:
Upload: caitlin-anderson
View: 219 times
Download: 0 times
Share this document with a friend
Description:
Copyright  2008 by CEBT Main Topic  Long Queries in Keyword Search  Keywords: – Compound query, Evidence Combination, Phrasal Query, Multi-term Query, Multiple Keyword Search, Multiword Unit, and so on.  Issues proximity or distance syntactic structure (order) semantic NLP remedies … 3
16
Survey Survey Jaehui Park 2008. 07. 17.
Transcript
Page 1: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

SurveySurvey

Jaehui Park2008. 07. 17.

Page 2: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

IntroductionIntroduction Members

Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon We are interested in

Issues in Information Retrieval– About crawling, indexing, searching and ranking methods

How to process multi-term queries in information retrieval environments– Ex)

Today US Today Today Weather Paris Today Weather-> Multi-term queries express more complex information need than single

queries.

2

Page 3: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

Main TopicMain Topic Long Queries in Keyword Search Keywords:

– Compound query, Evidence Combination, Phrasal Query, Multi-term Query, Multiple Keyword Search, Multiword Unit, and so on.

Issues proximity or distance syntactic structure (order) semantic NLP remedies …

3

Page 4: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

ProximityProximity An intuitive concept for processing multiple term queries Readings

Term Proximity Scoring for Keyword-Based Retrieval Systems – [ECIR 2003] Yves Rasolofo and Jacques Savoy

Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval– [TREC 2005] Stefan Buttcher and Charles L. A. Clarke

Efficient Text Proximity Search– [SPIRE 2007] Ralf Schenkel, et al.

Why Bigger Windows Are Better Than Smaller Ones– [TR-UM 1997] Ron Papka and James Allan

4

Page 5: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Term Proximity Scoring for Keyword-Based Term Proximity Scoring for Keyword-Based Retrieval Systems Retrieval Systems

Yves Rasolofo and Jacques SavoyEuropean Colloquium on IR Research(ECIR) 2003, LNCS 2633

2008. 07. 17.Presented by Jaehui Park

Page 6: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

IntroductionIntroduction Phrase, term proximity or term distance in IR

Focus on adding a word pair scoring module Okapi probabilistic model + proximity measurement

Previous work Salton & McGil [1983]

– Generating statistical phrases based on word co-occurrence Fagan [1987]

– Considering syntactic relation or syntactic structures Mitra et al. [1997]

– “Once a good basic ranking scheme is used, the use of phrases do not have a major effect on precision at high ranks”

Arampatzis et al.[2000]– The lack of success when using NLP technique in IR

Hawking & Thistlewaite [1996]– The use of proximity scoring within the PADRE system (Z-mode method)

6

Page 7: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

OkapiOkapi Okapi [Robertson & Spark Jones 1976]

Document ranking function according to their relevance to a given search query based on the probabilistic retrieval model

Considering– Term frequency– Document length

The weight for a given term ti in document d

7

Page 8: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

OkapiOkapi Okapi [Robertson & Spark Jones 1976] (continued)

The weight for the term ti within a query

The retrieval status value (for a document according to a query)

8

Page 9: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

Term Proximity WeightingTerm Proximity Weighting Improving retrieval performance by using term proximity

scoring Assumption

If a document contains sentences having at least two query terms within them, the probability that this document will be relevant must be greater.

The closer are the query terms, the higher is the relevance probability.

Objective Assigning more importance to those keywords having a

short distance between their occurrences.

9

Page 10: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

Term Proximity WeightingTerm Proximity Weighting 1. expand the request(query) using keyword pairs

extracted from the query’s wording

2. compute a term pair instance weight

“information retrieval “ : 1.0 “the retrieval of medical information” : 0.11 (1/9)

10

Page 11: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

Term Proximity WeightingTerm Proximity Weighting 3. sum all the corresponding term pairs

4. compute the contribution of all occurring term pairs in the document

5. compute the final retrieval status value

11

Page 12: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

ExperimentsExperiments Test Collections

TREC-8 document (528,155 docs)– Financial Times, Federal Register, Foreign Broadcast

Information Service, LA Times TREC-9, TREC-10 (1,692,096 docs)

12

Page 13: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

ExperimentsExperiments Evaluation

13

Page 14: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

ExperimentsExperiments Evaluation

14

Page 15: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

ExperimentsExperiments Evaluation

15

Page 16: Survey Jaehui Park 2008. 07. 17.. Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Copyright 2008 by CEBT

ConclusionConclusion The impact of a new term proximity algorithm on

retrieval effectiveness for keyword-based system was examined. Improve ranking for documents having query term pairs

occurring within a given distance constraint.

The term proximity scoring approach Improve precision after retrieving a few documents

16


Recommended