+ All Categories
Home > Documents > Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Date post: 14-Jan-2016
Category:
Upload: christian-barker
View: 217 times
Download: 0 times
Share this document with a friend
17
Lucene Lucene
Transcript
Page 1: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

LuceneLucene

Page 2: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

LuceneLucene

A open source set of Java Classses◦Search Engine/Document

Classifier/Indexer http://lucene.sourceforge.net/talks/pisa/

◦Developed by Doug Cutting 1996 Contributed to Apache project Wrote several papers in IR

Page 3: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Modules for IRModules for IR◦ Analysis

Tokenization Where tokens are indexed

◦ Document Where the Document ID is created Date of Document is extracted Title of document is extracted

◦ Index Provides access to indexes Maintains indexes

◦ Query Parser Where the magic of query happens

◦ Search Searches across indexes

Page 4: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Modules for IRModules for IR◦Search Spans

Spans K+/- words Example:

Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking

◦Store/Util Store the indexes and other

housekeeping

Page 5: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

TheoryTheorySpace Optimization for Total

Ranking◦Cutting et al 1996◦RAIO (Computer Assisted IR) 1997◦http://lucene.sf.net/papers/riao97.ps

Lucene lecture at Pisa◦Doug Cutting◦Slides from Lecture at University of

Pisa 2004

Page 6: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Vector Vector Vectors are a mathematical distance

between terms◦ Uses a cosine distance to determine how

close terms/documents are◦ This distance can then be used for

WSD/Clustering/IR◦ Example:

Bass,fishing: .6506 Bass,guitar: .000423 This tells us the document is about fishing not

about guitars

Page 7: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Vectors-IRVectors-IR“Vector-space search engines use the notion

of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.”

Page 8: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Inverted IndexInverted IndexTerm/Doc Id/Weight

◦Term “A Token, the basic unit of indexing in

Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied.”

http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene-p2.html

Page 9: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Inverted IndexInverted IndexDoc Id

◦A unique “key” that identifies each document

Weight◦Binary◦Freq Count◦Weighting Algorithm

Page 10: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Index MergeIndex MergeBasic/Basket/Basketball

◦Only keeps track of the differences between words

◦Periodically merges indexes Allows new documents to be added easily

Page 11: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

QueryQueryBoolean Search

◦Only searches documents with at least 1 term in query

◦“Boolean Search Engine”Parallel Search

◦Each term in query is search in parallel

◦Partial scores added to queue of docs

Page 12: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

QueryQueryThreshold

◦If partial score is too low and will not be part of N-best then the document is ignored even before search is complete Example

Potential New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored

◦Small loss of recall greatly increases speed of search

Page 13: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Evaluation of LuceneEvaluation of LuceneQuantitative Evaluation of

Passage Retrieval Algorithms for Question Answering◦Tellex et al, MIT AI Lab 2003

Compared Prise to Lucene for question and answer tasks◦Question & Answer

<Who is the president?> <George W. Bush .76>

Page 14: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Evaluation of LuceneEvaluation of LucenePrise

◦A IR system developed by NIS that according to the paper uses “modern” search engine techniques

Findings◦Found Prise was better than Lucene

since “Boolean” query engines are considered old school and its answers to questions were better

Page 15: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

Evaluation of LuceneEvaluation of LuceneLucene

◦Found although Prise had better correct answers Lucene found more documents containing relevant information

MIT used Lucene in their 2005 TREC submission not Prise

Page 16: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

UsersUsersLucene is used widely

◦TREC◦Document Retrieval Enterprise

Systems◦Part of Database/Web engine◦Part of Nutch◦Used by academics for large projects

Page 17: Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer

ConclusionsConclusionsLucene is a good set of classes

◦Designed to allow customization without have to “reinvent the wheel”

◦Robust◦Fast◦Large development groups◦Used Widely in Academia and

Industry


Recommended