+ All Categories
Home > Technology > Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

Date post: 11-May-2015
Category:
Upload: michael-figuiere
View: 2,076 times
Download: 0 times
Share this document with a friend
Popular Tags:
44
Lucene from theory to real world Information retrieval Apache Performance tuning Probabilistic Relevance Vector Dictionary Inverted index Analysis Model Doug Cutting Java Fields IndexReader Document Open Source Query Library Parser Indexing Production Architecture Design Troubleshooting Real world Cluster Search application Solr Server
Transcript
Page 1: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

Lucenefrom theory to real world

Information retrieval ApachePerformance tuning

Probabilistic

RelevanceVector

Dictionary

Inverted index

Analysis

Model

Doug Cutting

Java

Fields

IndexReader

Document

Open Source

Query

Library

Parser

Indexing

Production

Architecture

DesignTroubleshooting

Real world

Cluster

Search application

Solr

Server

Page 2: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr

Agenda

Introduction to Information Retrieval

Lucene overview Lucene in details

Search applications design Performance tuning

2

Page 3: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 3

Information Retrieval

Page 4: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 4

Information Retrieval

“ Information Retrieval (IR) is the science of searching for document ”

Page 5: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Page 6: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 6

Inverted Index

Page 7: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 7

Boolean Model

Query and documents are conceived as sets of terms

Q = (T1 OR T2) AND (T3 OR T4)D1 = {T1, T3}D2 = {T2, T3, T4}

Results set of query is a composition of unions and intersections

R = {D1, D2}with Union for OR operator

Intersection for AND operator

Page 8: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 8

Vector Space Model

Documents and queries are represented as vectors

Similarity can be computed with :dj = (w1,j,w2,j,...,wt,j)

q = (w1,q,w2,q,...,wt,q)

Page 9: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 9

Lucene

Page 10: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 10

Lucene : where do we come from ?Version Release date Description

0.01 March 2000 First open source release (SourceForge)

1.0 October 2000

1.01b July 2001 Last SourceForge release

1.2 June 2002 First Apache Jakarta release

1.3 December 2003 Compound index format, QueryParser enhancements, remote searching, extensible scoring API

1.4 July 2004 Sorting, span queries, term vectors

1.4.1 August 2004 Bug fix for sorting performance

1.4.2 October 2004 IndexSearcher optimization and misc. fixes

1.4.3 29 November 2004 Misc. fixes

1.9.0 27 February 2006 Binary stored fields, DateTools, NumberTools, RangeFilter, RegexQuery, Require Java 1.4

1.9.1 2 March 2006 Bug fix in BufferedIndexOutput

2.0 26 May 2006 Removed deprecated methods

2.1 17 February 2007 Delete/update document in IndexWriter, QueryParser improvements, contrib/benchmark

2.2 19 June 2007 Performance improvements, Function queries, Payloads, Preanalyzed fields, custom deletion policies

2.3.0 24 January 2008 Performance improvements, custom merge policies and merge schedulers, IndexReader.reopen

2.3.1 23 February 2008 Bug fixes from 2.3.0

2.3.2 06 May 2008 Bug fixes from 2.3.1

2.4.0 8 October 2008 Further performance improvements, transactional semantics, expungeDeletes method

2.4.1 9 March 2009 Bug fixes from 2.4.0

2.9 25 September 2009 New per-segment Collector API, faster search performance, near real-time search, attribute based analysis

2.9.1 6 November 2009 Bug fixes from 2.9

3.0.0 25 November 2009 Removed deprecated methods, fixed some bugs

3.0.1 and 2.9.2 26 February 2010 Bug fixes from previous minor versions. Both have same bugfix level

Page 11: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 11

Lucene documentation

Page 12: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 12

Lucene : Simple indexing example

Directory directory = new RAMDirectory();IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),

IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));

writer.addDocument(doc);writer.close();

Page 13: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 13

Lucene : Simple search example

IndexSearcher searcher = new IndexSearcher(dir, true);

Term t = new Term(“country”, “France”);Query query = new TermQuery(t);TopDocs docs = searcher.search(query, 10);

assertEquals(1, docs.totalHits);

searcher.close();

Page 14: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 14

Lucene - indexing

Page 15: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 15

Lucene - analyzers

Page 16: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 16

Lucene – Field types

Store : YES / NO

Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS

TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES

Page 17: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 17

Lucene storage - segments

Page 18: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 18

Lucene storage - segments

A new segment is created each time IndexWriter is flushed

When documents are deleted, a marker is added in the current segment

Page 19: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 19

Lucene storage – segments merge

Segments are merged manually with IndexWriter.optimize()

Or automatically merged depending on : (int) log(max(minMergeMB,

size))/log(mergeFactor)

Page 20: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 20

Lucene - search

Page 21: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 21

Lucene - search

Programatic API

TermQuery

PhraseQuery

WildcardQuery

RangeQuery

FuzzyQuery

BooleanQuery

Page 22: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 22

Lucene - QueryParser

QueryParser build a Query object from a user query string

+JUNIT +ANT –MOCK +xebya~0.8 +title:«Junit in action»

Most of the time, won’t fit application requirements

Page 23: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 23

Lucene – contrib/QueryParser

Framework that simplifies the creation of a query parser that fit your needs

3 layers : QueryParser : Transforms a query string into an

Abstract Syntax Tree representation QueryNodeProcessor : Processes nodes of the

tree to move, remove or modify them QueryBuilder : builds a Lucene BooleanQuery

tree from the abstract syntax tree

Page 24: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 24

Lucene – boolean queries

Page 25: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 25

Lucene – PhraseQuery & SpanQuery

SpanQuery : match documents that contains terms separated by n other terms (n is the ‘slop’)

PhraseQuery : SpanQuery with a slop value of 0

Uses position information

Page 26: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 26

Lucene storage – approximative queries

Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) get transformed to a set of TermQueries

Dictionnary = { court, cours, courir }

FuzzyQuery = cour

TransformedQuery = court OR cours

Page 27: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 27

Inverted Index

Page 28: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 28

Lucene – Levenshtein distance

FuzzyQuery uses Levenshtein distance : the number of modifications required to switch

from one word to another

Page 29: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 29

Lucene - FuzzyQuery

Current implementation not optimal LUCENE-2089 will use a Levenshtein automaton

Prefix Length PQ Size Avg MS (old) Avg MS (new)

0 1024 3286.0 7.8

0 64 3320.4 7.6

1 1024 316.8 5.6

1 64 314.3 5.6

2 1024 31.8 3.8

2 64 31.9 3.7

Page 30: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 30

Lucene – Highlighter

Produces ready to use HTML snippets with highlighted words from query

Can be fully customized

By default limited to 50 KB characters

Uses FastVectorHighlighter for faster results (~2.5 times faster)

Page 31: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 31

Lucene – FieldCache

Lucene cache that allows to store in memory values of a single field

Used internally by Sort objects

Can be used to manually load values of a single field :

float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);

Page 32: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 32

Lucene – MoreLikeThis

Finds similar documents

Produces a query to be searched

MoreLikeThis mlt = new MoreLikeThis(reader);mlt.setFieldNames(new String[] {"title", "author"});mlt.setMinTermFreq(1);mlt.setMinDocFreq(1);

Query query = mlt.like(docId);indexSearcher.search(query, 10);

Page 33: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 33

Lucene – Function Queries

Allows score customization

Consider using FieldCaches to Reduce fetching cost

FieldScoreQuery scoreQuery = new FieldScoreQuery("score",

FieldScoreQuery.Type.BYTE);CustomScoreQuery customQ = new CustomScoreQuery(q, scoreQuery ) {

public float customScore(int doc, float

subQueryScore, float

valSrcScore) {return (float) (Math.sqrt(subQueryScore) * valSrcScore);

}};

Page 34: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 34

Lucene – Luke

Page 35: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 35

Lucene – Global performance tuning

Consider using SSD for low latency

Consider using RAMDirectory / InstanciatedIndex

Uses latest version of Lucene

Uses NIODirectory for Unix and MMAPDirectory for Windows

Try to turn off setUseCompoundFile

Page 36: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 36

Lucene – Indexing performance tuning

Set RAMBufferSizeMB according to your needs

Tune your merge policy with care

Page 37: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 37

Lucene – Search performance tuning

Open IndexReader in read-only mode (default in Lucene 2.9+)

Warmup FieldCache to ensure immediate access when sorting

Limit use of TermVector

Ensure index is optimized

Page 38: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 38

Architecture with Hibernate Search

Page 39: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 39

Architecture with Solr

Page 40: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 40

Architecture with Infinispan

Page 41: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 41

Lucene – Distributed : Katta

Shards and distributes Lucene index over instances

Uses Hadoop for distribution

Page 42: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 42

Lucene galaxy

Apache Nutch : Lucene + Crawling and parsing Apache Compass : Search engine framework Apache Solr : Lucene standalone search server Apache Mahout : Distributed machine learning

Hibernate Search : Hibernate + Lucene

Katta : Distributed Lucene with Hadoop

Page 43: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 43

Lucene - Futures

Flex Branch : making Lucene even more customizable

Apache Mahout : distributed machine learning for clustering, classification and recommendation algorithms

Page 44: Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

www.xebia.fr / blog.xebia.fr 44

Questions ?


Recommended