Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | michael-figuiere |
View: | 2,076 times |
Download: | 0 times |
Lucenefrom theory to real world
Information retrieval ApachePerformance tuning
Probabilistic
RelevanceVector
Dictionary
Inverted index
Analysis
Model
Doug Cutting
Java
Fields
IndexReader
Document
Open Source
Query
Library
Parser
Indexing
Production
Architecture
DesignTroubleshooting
Real world
Cluster
Search application
Solr
Server
www.xebia.fr / blog.xebia.fr
Agenda
Introduction to Information Retrieval
Lucene overview Lucene in details
Search applications design Performance tuning
2
www.xebia.fr / blog.xebia.fr 3
Information Retrieval
www.xebia.fr / blog.xebia.fr 4
Information Retrieval
“ Information Retrieval (IR) is the science of searching for document ”
www.xebia.fr / blog.xebia.fr 6
Inverted Index
www.xebia.fr / blog.xebia.fr 7
Boolean Model
Query and documents are conceived as sets of terms
Q = (T1 OR T2) AND (T3 OR T4)D1 = {T1, T3}D2 = {T2, T3, T4}
Results set of query is a composition of unions and intersections
R = {D1, D2}with Union for OR operator
Intersection for AND operator
www.xebia.fr / blog.xebia.fr 8
Vector Space Model
Documents and queries are represented as vectors
Similarity can be computed with :dj = (w1,j,w2,j,...,wt,j)
q = (w1,q,w2,q,...,wt,q)
www.xebia.fr / blog.xebia.fr 9
Lucene
www.xebia.fr / blog.xebia.fr 10
Lucene : where do we come from ?Version Release date Description
0.01 March 2000 First open source release (SourceForge)
1.0 October 2000
1.01b July 2001 Last SourceForge release
1.2 June 2002 First Apache Jakarta release
1.3 December 2003 Compound index format, QueryParser enhancements, remote searching, extensible scoring API
1.4 July 2004 Sorting, span queries, term vectors
1.4.1 August 2004 Bug fix for sorting performance
1.4.2 October 2004 IndexSearcher optimization and misc. fixes
1.4.3 29 November 2004 Misc. fixes
1.9.0 27 February 2006 Binary stored fields, DateTools, NumberTools, RangeFilter, RegexQuery, Require Java 1.4
1.9.1 2 March 2006 Bug fix in BufferedIndexOutput
2.0 26 May 2006 Removed deprecated methods
2.1 17 February 2007 Delete/update document in IndexWriter, QueryParser improvements, contrib/benchmark
2.2 19 June 2007 Performance improvements, Function queries, Payloads, Preanalyzed fields, custom deletion policies
2.3.0 24 January 2008 Performance improvements, custom merge policies and merge schedulers, IndexReader.reopen
2.3.1 23 February 2008 Bug fixes from 2.3.0
2.3.2 06 May 2008 Bug fixes from 2.3.1
2.4.0 8 October 2008 Further performance improvements, transactional semantics, expungeDeletes method
2.4.1 9 March 2009 Bug fixes from 2.4.0
2.9 25 September 2009 New per-segment Collector API, faster search performance, near real-time search, attribute based analysis
2.9.1 6 November 2009 Bug fixes from 2.9
3.0.0 25 November 2009 Removed deprecated methods, fixed some bugs
3.0.1 and 2.9.2 26 February 2010 Bug fixes from previous minor versions. Both have same bugfix level
www.xebia.fr / blog.xebia.fr 11
Lucene documentation
www.xebia.fr / blog.xebia.fr 12
Lucene : Simple indexing example
Directory directory = new RAMDirectory();IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));
writer.addDocument(doc);writer.close();
www.xebia.fr / blog.xebia.fr 13
Lucene : Simple search example
IndexSearcher searcher = new IndexSearcher(dir, true);
Term t = new Term(“country”, “France”);Query query = new TermQuery(t);TopDocs docs = searcher.search(query, 10);
assertEquals(1, docs.totalHits);
searcher.close();
www.xebia.fr / blog.xebia.fr 14
Lucene - indexing
www.xebia.fr / blog.xebia.fr 15
Lucene - analyzers
www.xebia.fr / blog.xebia.fr 16
Lucene – Field types
Store : YES / NO
Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS
TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES
www.xebia.fr / blog.xebia.fr 17
Lucene storage - segments
www.xebia.fr / blog.xebia.fr 18
Lucene storage - segments
A new segment is created each time IndexWriter is flushed
When documents are deleted, a marker is added in the current segment
www.xebia.fr / blog.xebia.fr 19
Lucene storage – segments merge
Segments are merged manually with IndexWriter.optimize()
Or automatically merged depending on : (int) log(max(minMergeMB,
size))/log(mergeFactor)
www.xebia.fr / blog.xebia.fr 20
Lucene - search
www.xebia.fr / blog.xebia.fr 21
Lucene - search
Programatic API
TermQuery
PhraseQuery
WildcardQuery
RangeQuery
FuzzyQuery
BooleanQuery
www.xebia.fr / blog.xebia.fr 22
Lucene - QueryParser
QueryParser build a Query object from a user query string
+JUNIT +ANT –MOCK +xebya~0.8 +title:«Junit in action»
Most of the time, won’t fit application requirements
www.xebia.fr / blog.xebia.fr 23
Lucene – contrib/QueryParser
Framework that simplifies the creation of a query parser that fit your needs
3 layers : QueryParser : Transforms a query string into an
Abstract Syntax Tree representation QueryNodeProcessor : Processes nodes of the
tree to move, remove or modify them QueryBuilder : builds a Lucene BooleanQuery
tree from the abstract syntax tree
www.xebia.fr / blog.xebia.fr 24
Lucene – boolean queries
www.xebia.fr / blog.xebia.fr 25
Lucene – PhraseQuery & SpanQuery
SpanQuery : match documents that contains terms separated by n other terms (n is the ‘slop’)
PhraseQuery : SpanQuery with a slop value of 0
Uses position information
www.xebia.fr / blog.xebia.fr 26
Lucene storage – approximative queries
Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) get transformed to a set of TermQueries
Dictionnary = { court, cours, courir }
FuzzyQuery = cour
TransformedQuery = court OR cours
www.xebia.fr / blog.xebia.fr 27
Inverted Index
www.xebia.fr / blog.xebia.fr 28
Lucene – Levenshtein distance
FuzzyQuery uses Levenshtein distance : the number of modifications required to switch
from one word to another
www.xebia.fr / blog.xebia.fr 29
Lucene - FuzzyQuery
Current implementation not optimal LUCENE-2089 will use a Levenshtein automaton
Prefix Length PQ Size Avg MS (old) Avg MS (new)
0 1024 3286.0 7.8
0 64 3320.4 7.6
1 1024 316.8 5.6
1 64 314.3 5.6
2 1024 31.8 3.8
2 64 31.9 3.7
www.xebia.fr / blog.xebia.fr 30
Lucene – Highlighter
Produces ready to use HTML snippets with highlighted words from query
Can be fully customized
By default limited to 50 KB characters
Uses FastVectorHighlighter for faster results (~2.5 times faster)
www.xebia.fr / blog.xebia.fr 31
Lucene – FieldCache
Lucene cache that allows to store in memory values of a single field
Used internally by Sort objects
Can be used to manually load values of a single field :
float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);
www.xebia.fr / blog.xebia.fr 32
Lucene – MoreLikeThis
Finds similar documents
Produces a query to be searched
MoreLikeThis mlt = new MoreLikeThis(reader);mlt.setFieldNames(new String[] {"title", "author"});mlt.setMinTermFreq(1);mlt.setMinDocFreq(1);
Query query = mlt.like(docId);indexSearcher.search(query, 10);
www.xebia.fr / blog.xebia.fr 33
Lucene – Function Queries
Allows score customization
Consider using FieldCaches to Reduce fetching cost
FieldScoreQuery scoreQuery = new FieldScoreQuery("score",
FieldScoreQuery.Type.BYTE);CustomScoreQuery customQ = new CustomScoreQuery(q, scoreQuery ) {
public float customScore(int doc, float
subQueryScore, float
valSrcScore) {return (float) (Math.sqrt(subQueryScore) * valSrcScore);
}};
www.xebia.fr / blog.xebia.fr 34
Lucene – Luke
www.xebia.fr / blog.xebia.fr 35
Lucene – Global performance tuning
Consider using SSD for low latency
Consider using RAMDirectory / InstanciatedIndex
Uses latest version of Lucene
Uses NIODirectory for Unix and MMAPDirectory for Windows
Try to turn off setUseCompoundFile
www.xebia.fr / blog.xebia.fr 36
Lucene – Indexing performance tuning
Set RAMBufferSizeMB according to your needs
Tune your merge policy with care
www.xebia.fr / blog.xebia.fr 37
Lucene – Search performance tuning
Open IndexReader in read-only mode (default in Lucene 2.9+)
Warmup FieldCache to ensure immediate access when sorting
Limit use of TermVector
Ensure index is optimized
www.xebia.fr / blog.xebia.fr 38
Architecture with Hibernate Search
www.xebia.fr / blog.xebia.fr 39
Architecture with Solr
www.xebia.fr / blog.xebia.fr 40
Architecture with Infinispan
www.xebia.fr / blog.xebia.fr 41
Lucene – Distributed : Katta
Shards and distributes Lucene index over instances
Uses Hadoop for distribution
www.xebia.fr / blog.xebia.fr 42
Lucene galaxy
Apache Nutch : Lucene + Crawling and parsing Apache Compass : Search engine framework Apache Solr : Lucene standalone search server Apache Mahout : Distributed machine learning
Hibernate Search : Hibernate + Lucene
Katta : Distributed Lucene with Hadoop
www.xebia.fr / blog.xebia.fr 43
Lucene - Futures
Flex Branch : making Lucene even more customizable
Apache Mahout : distributed machine learning for clustering, classification and recommendation algorithms
www.xebia.fr / blog.xebia.fr 44
Questions ?