Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

Lucenefrom theory to real world

Information retrieval ApachePerformance tuning

Probabilistic

RelevanceVector

Dictionary

Inverted index

Analysis

Model

Doug Cutting

Java

Fields

IndexReader

Document

Open Source

Query

Library

Parser

Indexing

Production

Architecture

DesignTroubleshooting

Real world

Cluster

Search application

Solr

Server

www.xebia.fr / blog.xebia.fr

Agenda

Introduction to Information Retrieval

Lucene overview Lucene in details

Search applications design Performance tuning

2

www.xebia.fr / blog.xebia.fr 3

Information Retrieval


Information Retrieval

“ Information Retrieval (IR) is the science of searching for document ”


Inverted Index


Boolean Model

Query and documents are conceived as sets of terms

Q = (T1 OR T2) AND (T3 OR T4)D1 = {T1, T3}D2 = {T2, T3, T4}

Results set of query is a composition of unions and intersections

R = {D1, D2}with Union for OR operator

Intersection for AND operator


Vector Space Model

Documents and queries are represented as vectors

Similarity can be computed with :dj = (w1,j,w2,j,...,wt,j)

q = (w1,q,w2,q,...,wt,q)


Lucene


Lucene : where do we come from ?Version Release date Description

0.01 March 2000 First open source release (SourceForge)

1.0 October 2000

1.01b July 2001 Last SourceForge release

1.2 June 2002 First Apache Jakarta release

1.3 December 2003 Compound index format, QueryParser enhancements, remote searching, extensible scoring API

1.4 July 2004 Sorting, span queries, term vectors

1.4.1 August 2004 Bug fix for sorting performance

1.4.2 October 2004 IndexSearcher optimization and misc. fixes

1.4.3 29 November 2004 Misc. fixes

1.9.0 27 February 2006 Binary stored fields, DateTools, NumberTools, RangeFilter, RegexQuery, Require Java 1.4

1.9.1 2 March 2006 Bug fix in BufferedIndexOutput

2.0 26 May 2006 Removed deprecated methods

2.1 17 February 2007 Delete/update document in IndexWriter, QueryParser improvements, contrib/benchmark

2.2 19 June 2007 Performance improvements, Function queries, Payloads, Preanalyzed fields, custom deletion policies

2.3.0 24 January 2008 Performance improvements, custom merge policies and merge schedulers, IndexReader.reopen

2.3.1 23 February 2008 Bug fixes from 2.3.0

2.3.2 06 May 2008 Bug fixes from 2.3.1

2.4.0 8 October 2008 Further performance improvements, transactional semantics, expungeDeletes method

2.4.1 9 March 2009 Bug fixes from 2.4.0

2.9 25 September 2009 New per-segment Collector API, faster search performance, near real-time search, attribute based analysis

2.9.1 6 November 2009 Bug fixes from 2.9

3.0.0 25 November 2009 Removed deprecated methods, fixed some bugs

3.0.1 and 2.9.2 26 February 2010 Bug fixes from previous minor versions. Both have same bugfix level


Lucene documentation


Lucene : Simple indexing example

Directory directory = new RAMDirectory();IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),

IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));

writer.addDocument(doc);writer.close();


Lucene : Simple search example

IndexSearcher searcher = new IndexSearcher(dir, true);

Term t = new Term(“country”, “France”);Query query = new TermQuery(t);TopDocs docs = searcher.search(query, 10);

assertEquals(1, docs.totalHits);

searcher.close();


Lucene - indexing


Lucene - analyzers


Lucene – Field types

Store : YES / NO

Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS

TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES


Lucene storage - segments


Lucene storage - segments

A new segment is created each time IndexWriter is flushed

When documents are deleted, a marker is added in the current segment


Lucene storage – segments merge

Segments are merged manually with IndexWriter.optimize()

Or automatically merged depending on : (int) log(max(minMergeMB,

size))/log(mergeFactor)


Lucene - search


Lucene - search

Programatic API

TermQuery

PhraseQuery

WildcardQuery

RangeQuery

FuzzyQuery

BooleanQuery


Lucene - QueryParser

QueryParser build a Query object from a user query string

+JUNIT +ANT –MOCK +xebya~0.8 +title:«Junit in action»

Most of the time, won’t fit application requirements


Lucene – contrib/QueryParser

Framework that simplifies the creation of a query parser that fit your needs

3 layers : QueryParser : Transforms a query string into an

Abstract Syntax Tree representation QueryNodeProcessor : Processes nodes of the

tree to move, remove or modify them QueryBuilder : builds a Lucene BooleanQuery

tree from the abstract syntax tree


Lucene – boolean queries


Lucene – PhraseQuery & SpanQuery

SpanQuery : match documents that contains terms separated by n other terms (n is the ‘slop’)

PhraseQuery : SpanQuery with a slop value of 0

Uses position information


Lucene storage – approximative queries

Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) get transformed to a set of TermQueries

Dictionnary = { court, cours, courir }

FuzzyQuery = cour

TransformedQuery = court OR cours


Inverted Index


Lucene – Levenshtein distance

FuzzyQuery uses Levenshtein distance : the number of modifications required to switch

from one word to another


Lucene - FuzzyQuery

Current implementation not optimal LUCENE-2089 will use a Levenshtein automaton

Prefix Length PQ Size Avg MS (old) Avg MS (new)

0 1024 3286.0 7.8

0 64 3320.4 7.6

1 1024 316.8 5.6

1 64 314.3 5.6

2 1024 31.8 3.8

2 64 31.9 3.7


Lucene – Highlighter

Produces ready to use HTML snippets with highlighted words from query

Can be fully customized

By default limited to 50 KB characters

Uses FastVectorHighlighter for faster results (~2.5 times faster)


Lucene – FieldCache

Lucene cache that allows to store in memory values of a single field

Used internally by Sort objects

Can be used to manually load values of a single field :

float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);


Lucene – MoreLikeThis

Finds similar documents

Produces a query to be searched

MoreLikeThis mlt = new MoreLikeThis(reader);mlt.setFieldNames(new String[] {"title", "author"});mlt.setMinTermFreq(1);mlt.setMinDocFreq(1);

Query query = mlt.like(docId);indexSearcher.search(query, 10);


Lucene – Function Queries

Allows score customization

Consider using FieldCaches to Reduce fetching cost

FieldScoreQuery scoreQuery = new FieldScoreQuery("score",

FieldScoreQuery.Type.BYTE);CustomScoreQuery customQ = new CustomScoreQuery(q, scoreQuery ) {

public float customScore(int doc, float

subQueryScore, float

valSrcScore) {return (float) (Math.sqrt(subQueryScore) * valSrcScore);

}};


Lucene – Luke


Lucene – Global performance tuning

Consider using SSD for low latency

Consider using RAMDirectory / InstanciatedIndex

Uses latest version of Lucene

Uses NIODirectory for Unix and MMAPDirectory for Windows

Try to turn off setUseCompoundFile


Lucene – Indexing performance tuning

Set RAMBufferSizeMB according to your needs

Tune your merge policy with care


Lucene – Search performance tuning

Open IndexReader in read-only mode (default in Lucene 2.9+)

Warmup FieldCache to ensure immediate access when sorting

Limit use of TermVector

Ensure index is optimized


Architecture with Hibernate Search


Architecture with Solr


Architecture with Infinispan


Lucene – Distributed : Katta

Shards and distributes Lucene index over instances

Uses Hadoop for distribution


Lucene galaxy

Apache Nutch : Lucene + Crawling and parsing Apache Compass : Search engine framework Apache Solr : Lucene standalone search server Apache Mahout : Distributed machine learning

Hibernate Search : Hibernate + Lucene

Katta : Distributed Lucene with Hadoop


Lucene - Futures

Flex Branch : making Lucene even more customizable

Apache Mahout : distributed machine learning for clustering, classification and recommendation algorithms


Questions ?

Date post:	11-May-2015
Category:	Technology
Upload:	michael-figuiere
View:	2,076 times
Download:	0 times

Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

Technology