WBIA Project 2 – Retrieval & Evaluation

WBIA Project 2 – Retrieval & Evaluation

LI Geng Nov.10, 2008

Guidelines

Information retrieval evaluation – a brief review

Goals of this assignment Tools & work environment

Nutch-0.9Lucene-2.1.0

Assignment instructions Submission & grading policies

Previously in Project 1 - Crawling

Tool: Nutch Target network: ccer.pku.edu.cn What we already have:

A web database that contains web pages of CCER;

Inverted index of your data (you may not have noticed yet);

Global PageRank results

Previously in Project 1 (Cont.)

What we don’t have yet for a complete IR service: Interpreting user information need

Query Web page (at least page urls)

Online retrieval service.

I. Information Retrieval Evaluation – A Brief Review

Project 2’s Focus: Query Web Page What do we need to evaluate retrieval

results?Retrieval model implementation &

optimization;A standard test data set;Pre-defined queries and their corresponding

answer set;Evaluating with well-known metrics (MAP,

P@10, etc.)

II. Goals of this Assignment

Setup an online web search engine (using Nutch)

Understand information retrieval evaluation process

Refine existing retrieval model (by enhancing evaluation metric scores)

How?

A standard web page test set (Done.) Pre-defined queries and their

corresponding answer set (Done.) Retrieval model implementation Evaluating with well-known metrics (MAP,

P@10, etc.)

III. Tools & work environment

Nutch’s major modules:Crawling IndexingRetrievalWeb search…Of which indexing and retrieval modules are

built on top of Lucene.

Lucene

A framework for document retrieval using the Vector Space Model

Inverted index construction Query matching

Lucene (Cont.) It does not handle (from

http://darksleep.com/lucene): managing the process (instantiating the objects and

hooking them together, both for indexing and for searching)

selecting the data files parsing the data files (例如：中文切词 ) getting the search string from the user displaying the search results to the user

A “library” rather than a stand-alone application

Lucene (Cont.)

But a library with useful utilities as standard extensionsE.g.

package org.apache.lucene.analysis.standard; Default document analysis (and tokenizing) utilities

(i.e. they will be used if you don’t implement your onwn.)

Lucene in Nutch

As a third-party library try listing the $NUTCH-HOME/lib directory

Crawled Web Page

org.apache.lucene.analysis

org.apache.lucene.index

org.apache.lucene.searchorg.apache.lucene.index

Inverted Index

HitSet

Web Page

Posting Lists

Matched Documents

Lucene in Nutch (Cont.)

Nutch implements Lucene interfaces and imports Lucene classes so as to reuse its indexing and retrieval functionalities. E.g.

In package org.apache.nutch.analysis; public final class NutchDocumentTokenizer extends

org.apache.lucene.analysis.Tokenizer implements NutchAnalysisConstants

Refer to these packages for more details: package org.apache.nutch.indexer; package org.apache.nutch.analysis; package org.apache.nutch.searcher;

Index Constructio

n

Retrieval

Towards a complete IR Application Nutch’s major modules:

Crawling Indexing

Try listing the root directory of your WebDB: Crawldb indexes linkdb segments

RetrievalWeb search…

IV. Assignment Instructions

The test set and answer set:Taken from one group’s previous crawlWill be put online soon

RetrievalEnhance retrieval quality using your

PageRank results Web search

Set up online search engine with Nutch

Step 1 - Web Search Engine Setup This is the recommended first step in this

assignment. It is relatively simple; Nutch’s online tutorial has

detailed enough information on this. http://wiki.apache.org/nutch/NutchTutorial

You will have an impression of the vector space retrieval model implemented by Lucene.

Important: To save time with Nutch configuration, refer to my instructions in addition to the Nutch online tutorial at http://162.105.80.59/WBIA_NutchConfigHelp.txt

Step 1 - Web Search Engine Setup (Cont.) Your task:

Compute retrieval metrics as the base for comparison

MAP, P@10

Step 2 – Lucene Retrieval Ranking Analysis Entry point:

class org.apache.lucene.search.IndexSearcher

(Hint)Related class, for reference:class

org.apache.lucene.search.BooleanQueryclass

org.apache.lucene.search.BooleanQuery. BooleanWeight

Step 2 – Lucene Retrieval Ranking Analysis (Cont.) Your task:

Figure out the formula of score computing.

Step 3 – Integrate PageRank results with VSM Your task:

Figure out a solution to combine PageRank and VSM score effectively to enhance retrieval quality.

Any ideas now? Required coding: edit

package org.apache.lucene.search

Step 4 – Re-evaluate and Improve Based on your new model and retrieval

results, recomputeMAP, P@10

Compare newly computed values with previous ones, go back to step 3 if there is still room for improvement.

Challenge Task 1

Edit Lucene to implement the language model (and repeat the evaluation process, compare results with VSM + PageRank)Hint:

Find out how Lucene stores and reads the posting lists, and figure out a way to use the data in them for LM similarity computing.

Or, you may consider reformatting the posting list store and insert additional useful information.

Challenge Task 2

Implement LSI (Latent Semantic Indexing) and evalute In this case, could Lucene’s document scoring

module still be reused?…

V. Submission & Grading

Deadline: 12.3 23:59 Challenge属于选做内容

提交内容工程报告文档，包含以下部分：

1. 小组成员及分工2. Lucene进行文档匹配的评分计算公式；3. 如何将 PageRank的计算结果整合进来？

讲思路，不要贴程序代码。4. 整合的效果如何？整合后又做了哪些改进尝试？

用两个评测指标说明5. （选做部分）简述实现语言模型或 LSI的思路

提交内容（续）代码包

至少包括结合了 VSM 和 PageRank文档排序算法的 lucene jar包，并说明修改过的文件；

如果做了 Challenge，请在代码包内加上额外的文本文件说明；

提交格式：将以上两部分打成 zip 或 rar压缩包，命名格式：

（组名） _ （ Project leader学号） .zip(rar)

Grading Policy

起评： 100Challenge 1: +30 bonusChallenge 2: +40 bonus

独力完成的小组至少可以得到 75%的分数根据完成情况， Project Leader 有 0 - 20%的奖励

Any Questions?

Online References

http://wiki.apache.org/nutch/NutchTutorial http://darksleep.com/lucene http://lucene.apache.org/java/2_1_0/

Date post:	20-Jan-2016
Category:	Documents
Upload:	zea
View:	47 times
Download:	0 times