+ All Categories
Home > Documents > WBIA Project 2 – Retrieval & Evaluation

WBIA Project 2 – Retrieval & Evaluation

Date post: 20-Jan-2016
Category:
Upload: zea
View: 47 times
Download: 0 times
Share this document with a friend
Description:
WBIA Project 2 – Retrieval & Evaluation. LI Geng Nov.10, 2008. Guidelines. Information retrieval evaluation – a brief review Goals of this assignment Tools & work environment Nutch-0.9 Lucene-2.1.0 Assignment instructions Submission & grading policies. Previously in Project 1 - Crawling. - PowerPoint PPT Presentation
Popular Tags:
29
WBIA Project 2 – Retrieval & Evaluation LI Geng Nov.10, 2008
Transcript
Page 1: WBIA Project 2 – Retrieval & Evaluation

WBIA Project 2 – Retrieval & Evaluation

LI Geng Nov.10, 2008

Page 2: WBIA Project 2 – Retrieval & Evaluation

Guidelines

Information retrieval evaluation – a brief review

Goals of this assignment Tools & work environment

Nutch-0.9Lucene-2.1.0

Assignment instructions Submission & grading policies

Page 3: WBIA Project 2 – Retrieval & Evaluation

Previously in Project 1 - Crawling

Tool: Nutch Target network: ccer.pku.edu.cn What we already have:

A web database that contains web pages of CCER;

Inverted index of your data (you may not have noticed yet);

Global PageRank results

Page 4: WBIA Project 2 – Retrieval & Evaluation

Previously in Project 1 (Cont.)

What we don’t have yet for a complete IR service: Interpreting user information need

Query Web page (at least page urls)

Online retrieval service.

Page 5: WBIA Project 2 – Retrieval & Evaluation

I. Information Retrieval Evaluation – A Brief Review

Project 2’s Focus: Query Web Page What do we need to evaluate retrieval

results?Retrieval model implementation &

optimization;A standard test data set;Pre-defined queries and their corresponding

answer set;Evaluating with well-known metrics (MAP,

P@10, etc.)

Page 6: WBIA Project 2 – Retrieval & Evaluation

II. Goals of this Assignment

Setup an online web search engine (using Nutch)

Understand information retrieval evaluation process

Refine existing retrieval model (by enhancing evaluation metric scores)

Page 7: WBIA Project 2 – Retrieval & Evaluation

How?

A standard web page test set (Done.) Pre-defined queries and their

corresponding answer set (Done.) Retrieval model implementation Evaluating with well-known metrics (MAP,

P@10, etc.)

Page 8: WBIA Project 2 – Retrieval & Evaluation

III. Tools & work environment

Nutch’s major modules:Crawling IndexingRetrievalWeb search…Of which indexing and retrieval modules are

built on top of Lucene.

Page 9: WBIA Project 2 – Retrieval & Evaluation

Lucene

A framework for document retrieval using the Vector Space Model

Inverted index construction Query matching

Page 10: WBIA Project 2 – Retrieval & Evaluation

Lucene (Cont.) It does not handle (from

http://darksleep.com/lucene): managing the process (instantiating the objects and

hooking them together, both for indexing and for searching)

selecting the data files parsing the data files (例如:中文切词 ) getting the search string from the user displaying the search results to the user

A “library” rather than a stand-alone application

Page 11: WBIA Project 2 – Retrieval & Evaluation

Lucene (Cont.)

But a library with useful utilities as standard extensionsE.g.

package org.apache.lucene.analysis.standard; Default document analysis (and tokenizing) utilities

(i.e. they will be used if you don’t implement your onwn.)

Page 12: WBIA Project 2 – Retrieval & Evaluation

Lucene in Nutch

As a third-party library try listing the $NUTCH-HOME/lib directory

Crawled Web Page

org.apache.lucene.analysis

org.apache.lucene.index

org.apache.lucene.searchorg.apache.lucene.index

Inverted Index

HitSet

Web Page

Posting Lists

Matched Documents

Page 13: WBIA Project 2 – Retrieval & Evaluation

Lucene in Nutch (Cont.)

Nutch implements Lucene interfaces and imports Lucene classes so as to reuse its indexing and retrieval functionalities. E.g.

In package org.apache.nutch.analysis; public final class NutchDocumentTokenizer extends

org.apache.lucene.analysis.Tokenizer implements NutchAnalysisConstants

Refer to these packages for more details: package org.apache.nutch.indexer; package org.apache.nutch.analysis; package org.apache.nutch.searcher;

Index Constructio

n

Retrieval

Page 14: WBIA Project 2 – Retrieval & Evaluation

Towards a complete IR Application Nutch’s major modules:

Crawling Indexing

Try listing the root directory of your WebDB: Crawldb indexes linkdb segments

RetrievalWeb search…

Page 15: WBIA Project 2 – Retrieval & Evaluation

IV. Assignment Instructions

The test set and answer set:Taken from one group’s previous crawlWill be put online soon

RetrievalEnhance retrieval quality using your

PageRank results Web search

Set up online search engine with Nutch

Page 16: WBIA Project 2 – Retrieval & Evaluation

Step 1 - Web Search Engine Setup This is the recommended first step in this

assignment. It is relatively simple; Nutch’s online tutorial has

detailed enough information on this. http://wiki.apache.org/nutch/NutchTutorial

You will have an impression of the vector space retrieval model implemented by Lucene.

Important: To save time with Nutch configuration, refer to my instructions in addition to the Nutch online tutorial at http://162.105.80.59/WBIA_NutchConfigHelp.txt

Page 17: WBIA Project 2 – Retrieval & Evaluation

Step 1 - Web Search Engine Setup (Cont.) Your task:

Compute retrieval metrics as the base for comparison

MAP, P@10

Page 18: WBIA Project 2 – Retrieval & Evaluation

Step 2 – Lucene Retrieval Ranking Analysis Entry point:

class org.apache.lucene.search.IndexSearcher

(Hint)Related class, for reference:class

org.apache.lucene.search.BooleanQueryclass

org.apache.lucene.search.BooleanQuery. BooleanWeight

Page 19: WBIA Project 2 – Retrieval & Evaluation

Step 2 – Lucene Retrieval Ranking Analysis (Cont.) Your task:

Figure out the formula of score computing.

Page 20: WBIA Project 2 – Retrieval & Evaluation

Step 3 – Integrate PageRank results with VSM Your task:

Figure out a solution to combine PageRank and VSM score effectively to enhance retrieval quality.

Any ideas now? Required coding: edit

package org.apache.lucene.search

Page 21: WBIA Project 2 – Retrieval & Evaluation

Step 4 – Re-evaluate and Improve Based on your new model and retrieval

results, recomputeMAP, P@10

Compare newly computed values with previous ones, go back to step 3 if there is still room for improvement.

Page 22: WBIA Project 2 – Retrieval & Evaluation

Challenge Task 1

Edit Lucene to implement the language model (and repeat the evaluation process, compare results with VSM + PageRank)Hint:

Find out how Lucene stores and reads the posting lists, and figure out a way to use the data in them for LM similarity computing.

Or, you may consider reformatting the posting list store and insert additional useful information.

Page 23: WBIA Project 2 – Retrieval & Evaluation

Challenge Task 2

Implement LSI (Latent Semantic Indexing) and evalute In this case, could Lucene’s document scoring

module still be reused?…

Page 24: WBIA Project 2 – Retrieval & Evaluation

V. Submission & Grading

Deadline: 12.3 23:59 Challenge属于选做内容

Page 25: WBIA Project 2 – Retrieval & Evaluation

提交内容工程报告文档,包含以下部分:

1. 小组成员及分工2. Lucene进行文档匹配的评分计算公式;3. 如何将 PageRank的计算结果整合进来?

讲思路,不要贴程序代码。4. 整合的效果如何?整合后又做了哪些改进尝试?

用两个评测指标说明5. (选做部分)简述实现语言模型或 LSI的思路

Page 26: WBIA Project 2 – Retrieval & Evaluation

提交内容(续)代码包

至少包括结合了 VSM 和 PageRank文档排序算法的 lucene jar包,并说明修改过的文件;

如果做了 Challenge,请在代码包内加上额外的文本文件说明;

提交格式:将以上两部分打成 zip 或 rar压缩包,命名格式:

(组名) _ ( Project leader学号) .zip(rar)

Page 27: WBIA Project 2 – Retrieval & Evaluation

Grading Policy

起评: 100Challenge 1: +30 bonusChallenge 2: +40 bonus

独力完成的小组至少可以得到 75%的分数根据完成情况, Project Leader 有 0 - 20%的奖励

Page 28: WBIA Project 2 – Retrieval & Evaluation

Any Questions?

Page 29: WBIA Project 2 – Retrieval & Evaluation

Online References

http://wiki.apache.org/nutch/NutchTutorial http://darksleep.com/lucene http://lucene.apache.org/java/2_1_0/


Recommended