WBIA Project 2 – Retrieval & Evaluation
LI Geng Nov.10, 2008
Guidelines
Information retrieval evaluation – a brief review
Goals of this assignment Tools & work environment
Nutch-0.9Lucene-2.1.0
Assignment instructions Submission & grading policies
Previously in Project 1 - Crawling
Tool: Nutch Target network: ccer.pku.edu.cn What we already have:
A web database that contains web pages of CCER;
Inverted index of your data (you may not have noticed yet);
Global PageRank results
Previously in Project 1 (Cont.)
What we don’t have yet for a complete IR service: Interpreting user information need
Query Web page (at least page urls)
Online retrieval service.
I. Information Retrieval Evaluation – A Brief Review
Project 2’s Focus: Query Web Page What do we need to evaluate retrieval
results?Retrieval model implementation &
optimization;A standard test data set;Pre-defined queries and their corresponding
answer set;Evaluating with well-known metrics (MAP,
P@10, etc.)
II. Goals of this Assignment
Setup an online web search engine (using Nutch)
Understand information retrieval evaluation process
Refine existing retrieval model (by enhancing evaluation metric scores)
How?
A standard web page test set (Done.) Pre-defined queries and their
corresponding answer set (Done.) Retrieval model implementation Evaluating with well-known metrics (MAP,
P@10, etc.)
III. Tools & work environment
Nutch’s major modules:Crawling IndexingRetrievalWeb search…Of which indexing and retrieval modules are
built on top of Lucene.
Lucene
A framework for document retrieval using the Vector Space Model
Inverted index construction Query matching
Lucene (Cont.) It does not handle (from
http://darksleep.com/lucene): managing the process (instantiating the objects and
hooking them together, both for indexing and for searching)
selecting the data files parsing the data files (例如:中文切词 ) getting the search string from the user displaying the search results to the user
A “library” rather than a stand-alone application
Lucene (Cont.)
But a library with useful utilities as standard extensionsE.g.
package org.apache.lucene.analysis.standard; Default document analysis (and tokenizing) utilities
(i.e. they will be used if you don’t implement your onwn.)
Lucene in Nutch
As a third-party library try listing the $NUTCH-HOME/lib directory
Crawled Web Page
org.apache.lucene.analysis
org.apache.lucene.index
org.apache.lucene.searchorg.apache.lucene.index
Inverted Index
HitSet
Web Page
Posting Lists
Matched Documents
Lucene in Nutch (Cont.)
Nutch implements Lucene interfaces and imports Lucene classes so as to reuse its indexing and retrieval functionalities. E.g.
In package org.apache.nutch.analysis; public final class NutchDocumentTokenizer extends
org.apache.lucene.analysis.Tokenizer implements NutchAnalysisConstants
Refer to these packages for more details: package org.apache.nutch.indexer; package org.apache.nutch.analysis; package org.apache.nutch.searcher;
Index Constructio
n
Retrieval
Towards a complete IR Application Nutch’s major modules:
Crawling Indexing
Try listing the root directory of your WebDB: Crawldb indexes linkdb segments
RetrievalWeb search…
IV. Assignment Instructions
The test set and answer set:Taken from one group’s previous crawlWill be put online soon
RetrievalEnhance retrieval quality using your
PageRank results Web search
Set up online search engine with Nutch
Step 1 - Web Search Engine Setup This is the recommended first step in this
assignment. It is relatively simple; Nutch’s online tutorial has
detailed enough information on this. http://wiki.apache.org/nutch/NutchTutorial
You will have an impression of the vector space retrieval model implemented by Lucene.
Important: To save time with Nutch configuration, refer to my instructions in addition to the Nutch online tutorial at http://162.105.80.59/WBIA_NutchConfigHelp.txt
Step 1 - Web Search Engine Setup (Cont.) Your task:
Compute retrieval metrics as the base for comparison
MAP, P@10
Step 2 – Lucene Retrieval Ranking Analysis Entry point:
class org.apache.lucene.search.IndexSearcher
(Hint)Related class, for reference:class
org.apache.lucene.search.BooleanQueryclass
org.apache.lucene.search.BooleanQuery. BooleanWeight
Step 2 – Lucene Retrieval Ranking Analysis (Cont.) Your task:
Figure out the formula of score computing.
Step 3 – Integrate PageRank results with VSM Your task:
Figure out a solution to combine PageRank and VSM score effectively to enhance retrieval quality.
Any ideas now? Required coding: edit
package org.apache.lucene.search
Step 4 – Re-evaluate and Improve Based on your new model and retrieval
results, recomputeMAP, P@10
Compare newly computed values with previous ones, go back to step 3 if there is still room for improvement.
Challenge Task 1
Edit Lucene to implement the language model (and repeat the evaluation process, compare results with VSM + PageRank)Hint:
Find out how Lucene stores and reads the posting lists, and figure out a way to use the data in them for LM similarity computing.
Or, you may consider reformatting the posting list store and insert additional useful information.
Challenge Task 2
Implement LSI (Latent Semantic Indexing) and evalute In this case, could Lucene’s document scoring
module still be reused?…
V. Submission & Grading
Deadline: 12.3 23:59 Challenge属于选做内容
提交内容工程报告文档,包含以下部分:
1. 小组成员及分工2. Lucene进行文档匹配的评分计算公式;3. 如何将 PageRank的计算结果整合进来?
讲思路,不要贴程序代码。4. 整合的效果如何?整合后又做了哪些改进尝试?
用两个评测指标说明5. (选做部分)简述实现语言模型或 LSI的思路
提交内容(续)代码包
至少包括结合了 VSM 和 PageRank文档排序算法的 lucene jar包,并说明修改过的文件;
如果做了 Challenge,请在代码包内加上额外的文本文件说明;
提交格式:将以上两部分打成 zip 或 rar压缩包,命名格式:
(组名) _ ( Project leader学号) .zip(rar)
Grading Policy
起评: 100Challenge 1: +30 bonusChallenge 2: +40 bonus
独力完成的小组至少可以得到 75%的分数根据完成情况, Project Leader 有 0 - 20%的奖励
Any Questions?
Online References
http://wiki.apache.org/nutch/NutchTutorial http://darksleep.com/lucene http://lucene.apache.org/java/2_1_0/