of 83
7/31/2019 Luce Ne Bootcamp
1/83
Lucene Boot Camp
Grant Ingersoll
Lucid ImaginationNov. 12, 2007
Atlanta, Georgia
7/31/2019 Luce Ne Bootcamp
2/83
Intro
My Background
Your Background
Brief History of Lucene
Goals for Tutorial
Understand Lucene core capabilities
Real examples, real code, real data
Ask Questions!!!!!
7/31/2019 Luce Ne Bootcamp
3/83
Schedule1. 10-10:10 Introducing Lucene and Search
2. 10:10-12 Indexing, Analysis, Searching, Performance
3. 12-12:05 Break
4. 12-1 More on Indexing, Analysis, Searching, Performance
5. 1-2:30 Lunch
6. 2:30-2:40 Recap, Questions, Content
7. 2:40-4:40 Class Example
8. 4-4:20 Break
9. 4:20-5 Class Example
10. 5-5:20 Lucene Contributions (time permitting)
11. 5:20-5:25 Open Discussion (time permitting)
12. 5:25-5:30 Resources/Wrap Up
7/31/2019 Luce Ne Bootcamp
4/83
Lucene is
NOT a crawler
See Nutch
NOT an applicationSee PoweredBy on the Wiki
NOT a library for doing Google PageRank
or other link analysis algorithmsSee Nutch
A library for enabling text based search
7/31/2019 Luce Ne Bootcamp
5/83
A Few Words about Solr
HTTP-based Search Server
XML Configuration
XML, JSON, Ruby, PHP, Java support
Caching, Replication
Many, many nice features that Lucene users
need
http://lucene.apache.org/solr
http://lucene.apache.org/solrhttp://lucene.apache.org/solr7/31/2019 Luce Ne Bootcamp
6/83
Search Basics
Goal: Identify documents thatare similar to input query
Lucene uses a modified VectorSpace Model (VSM)
Boolean + VSM
TF-IDF
The words in the documentand the query each define aVector in an n-dimensional
space Sim(q1, d1) = cos
In Lucene, boolean approachrestricts what documents toscore
q1
d1
dj=
q= w = weight assigned to term
7/31/2019 Luce Ne Bootcamp
7/83
Indexing
Process of preparing and adding text toLucene
Optimized for searching Key Point: Lucene only indexes Strings
What does this mean?
Lucene doesnt care about XML, Word, PDF, etc.
There are many good open source extractors available
Its our job to convert whatever file format we haveinto something Lucene can use
7/31/2019 Luce Ne Bootcamp
8/83
Indexing Classes
Analyzer
Creates tokens using a Tokenizer and filters
them through zero or more TokenFilters IndexWriter
Responsible for converting text into internal
Lucene format
7/31/2019 Luce Ne Bootcamp
9/83
Indexing Classes
Directory
Where the Index is stored
RAMDirectory, FSDirectory, others
Document A collection ofFields
Can be boosted
Field
Free text, keywords, dates, etc. Defines attributes for storing, indexing
Can be boosted
Field Constructors and parameters
Open up Fieldable and Field in IDE
7/31/2019 Luce Ne Bootcamp
10/83
How to Index
Create IndexWriter
For each input
Create a Document
Add Fields to the Document
Add the Document to the IndexWriter
Close the IndexWriter
Optimize (optional)
7/31/2019 Luce Ne Bootcamp
11/83
Task 1.a From the Boot Camp Files, use the basic.ReutersIndexer
skeleton to start
Index the small Reuters Collection using theIndexWriter, a Directory and
StandardAnalyzer Boost every 10 documents by 3
Questions to Answer:
What Fields should I define?
What attributes should each Field have? What Fields should OMIT_NORMS?
Pick a field to boost and give a reason why you think it should beboosted
7/31/2019 Luce Ne Bootcamp
12/83
Use the Luke
7/31/2019 Luce Ne Bootcamp
13/83
Searching
Key Classes: Searcher
Provides methods for searching
Take a moment to look at the Searcher class declaration
IndexSearcher, MultiSearcher,
ParallelMultiSearcher IndexReader
Loads a snapshot of the index into memory for searching
Hits
Storage/caching of results from searching
QueryParser
JavaCC grammar for creating Lucene Queries
http://lucene.apache.org/java/docs/queryparsersyntax.html
Query
Logical representation of programs information need
http://lucene.apache.org/java/docs/queryparsersyntax.htmlhttp://lucene.apache.org/java/docs/queryparsersyntax.html7/31/2019 Luce Ne Bootcamp
14/83
Query Parsing
Basic syntax:
title:hockey +(body:stanley AND body:cup)
OR/AND must be uppercase Default operator is OR (can be changed)
Supports fairly advanced syntax, see the website http://lucene.apache.org/java/docs/queryparsersyntax.html
Doesnt always play nice, so beware Many applications construct queries programmatically
or restrict syntax
7/31/2019 Luce Ne Bootcamp
15/83
Task 1.b Using the ReutersIndexerTest.java skeleton in the boot
camp files
Search your newly created index using queries you develop
Delete a Document by the doc id
Hints:
Use a IndexSearcher
Create a Query using the QueryParser
Display the results from the Hits
Questions:
What is the default field for the QueryParser?
What Analyzer to use?
7/31/2019 Luce Ne Bootcamp
16/83
Task 1 Results
Locks
Lucene maintains locks on files to prevent
index corruption
Located in same directory as index
Scores from Hits are normalized
Scores across queries are NOT comparable
Lucene 2.3 has some transactional
semantics for indexing, but is not a DB
7/31/2019 Luce Ne Bootcamp
17/83
Deletion and Updates
Deletions can be a bit confusing
Both IndexReader and IndexWriter
have delete methods Updates are always a delete and an add
Updates are always a delete and an add
Yes, that is a repeat!
Nature of data structures used in search
7/31/2019 Luce Ne Bootcamp
18/83
Analysis Analysis is the process of creating Tokens to be indexed
Analysis is usually done to improve results overall, but itcomes with a price
Lucene comes with many different Analyzers,
Tokenizers and TokenFilters, each with their owngoals
See contrib/analyzers
StandardAnalyzer is included with the core JAR and
does a good job for most English and Latin-based tasks Often times you want the same content analyzed in
different ways
Consider a catch-all Field in addition to otherFields
7/31/2019 Luce Ne Bootcamp
19/83
Commonly Used Analyzers
StandardAnalyzer
WhitespaceAnalyzer
PerFieldAnalyzerWrapper
SimpleAnalyzer
7/31/2019 Luce Ne Bootcamp
20/83
Indexing in a Nutshell For each Document
For each Field to be tokenized
Create the tokens using the specified Tokenizer
Tokens consist of a String, position, type and offset information Pass the tokens through the chained TokenFilters where
they can be changed or removed
Add the end result to the inverted index
Position information can be altered
Useful when removing words or to prevent phrases
from matching
7/31/2019 Luce Ne Bootcamp
21/83
Inverted Index
aardvark
hood
red
little
riding
robin
women
zoo
Little Red Riding Hood
Robin Hood
Little Women
0 1
0 2
0
0
2
1
0
1
2
7/31/2019 Luce Ne Bootcamp
22/83
Tokenization
Split words into Tokens to be processed
Tokenization is fairly straightforward for
most languages that use a space for wordsegmentation
More difficult for some East Asian languages
See the CJK Analyzer
7/31/2019 Luce Ne Bootcamp
23/83
Modifying Tokens
TokenFilters are used to alter the tokenstream to be indexed
Common tasks:
Remove stopwords
Lower case
Stem/Normalize -> Wi-Fi -> Wi Fi
Add Synonyms StandardAnalyzer does things that you may
not want
7/31/2019 Luce Ne Bootcamp
24/83
Custom Analyzers
Solution: write your own Analyzer
Better solution: write a configurable
Analyzer so you only need one Analyzerthat you can easily change for your projects
See Solr
Tokenizers and TokenFilters mustbe newly constructed for each input
7/31/2019 Luce Ne Bootcamp
25/83
Special Cases
Dates and numbers need special treatment to be
searchable
o.a.l.document.DateTools org.apache.solr.util.NumberUtils
Altering Position Information
Increase Position Gap between sentences to prevent
phrases from crossing sentence boundaries
Index synonyms at the same position so query can
match regardless of synonym used
7/31/2019 Luce Ne Bootcamp
26/83
5 minute Break
7/31/2019 Luce Ne Bootcamp
27/83
Indexing Performance
Behind the Scenes
Lucene indexes Documents into memory
At certain trigger points, memory (segments)are flushed to the Directory
Segments are periodically merged
Lucene 2.3 has significant performanceimprovements
7/31/2019 Luce Ne Bootcamp
28/83
IndexWriter Performance
Factors maxBufferedDocs
Minimum # of docs before merge occurs and a new segment is
created
Usually, Larger == faster, but more RAM
mergeFactor
How often segments are merged
Smaller == less RAM, better for incremental updates
Larger == faster, better for batch indexing
maxFieldLength
Limit the number of terms in a Document
7/31/2019 Luce Ne Bootcamp
29/83
Lucene 2.3 IndexWriter Changes
setRAMBufferSizeMB
New model for automagically controlling indexingfactors based on the amount of memory in use
Obsoletes setMaxBufferedDocs andsetMergeFactor
Takes storage and term vectors out of the mergeprocess
Turn off auto-commit if there are stored fields andterm vectors
Provides significant performance increase
7/31/2019 Luce Ne Bootcamp
30/83
Index Threading
IndexWriter and IndexReader are thread-
safe and can be shared between threads without
external synchronization
One open IndexWriter perDirectory
Parallel Indexing
Index to separate Directory instances
Merge using IndexWriter.addIndexes
Could also distribute and collect
7/31/2019 Luce Ne Bootcamp
31/83
Benchmarking Indexing
contrib/benchmark
Try out different algorithms between Lucene 2.2and trunk (2.3)
contrib/benchmark/conf:
indexing.alg
indexing-multithreaded.alg
Info:
Mac Pro 2 x 2GHz Dual-Core Xeon
4 GB RAM
ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
7/31/2019 Luce Ne Bootcamp
32/83
Benchmarking ResultsRecords/Sec Avg. T Mem
2.2 421 39M
Trunk 2,122 52M
Trunk-mt (4) 3,680 57M
Your results will depend on analysis, etc.
7/31/2019 Luce Ne Bootcamp
33/83
Searching
Earlier we touched on basics of searchusing the QueryParser
Now look at: Searcher/IndexReader Lifecycle
Query classes
More details on the QueryParser
Filters
Sorting
7/31/2019 Luce Ne Bootcamp
34/83
Lifecycle
Recall that the IndexReader loads a snapshotof index into memory
This means updates made since loading the index will
not be seen
Business rules are needed to define how often toreload the index, if at all
IndexReader.isCurrent() can help
Loading an index is an expensive operation
Do not open a Searcher/IndexReader for everysearch
7/31/2019 Luce Ne Bootcamp
35/83
Query Classes TermQuery is basis for all non-span queries
BooleanQuery combines multiple Queryinstances as clauses
should required
PhraseQuery finds terms occurring near eachother, position-wise
slop is the edit distance between two terms
Take 2-3 minutes to explore Queryimplementations
7/31/2019 Luce Ne Bootcamp
36/83
Spans
Spans provide information about wherematches took place
Not supported by the QueryParser Can be used in BooleanQuery clauses
Take 2-3 minutes to explore SpanQuery
classes SpanNearQuery useful for doing phrase
matching
7/31/2019 Luce Ne Bootcamp
37/83
QueryParser
MultiFieldQueryParser
Boolean operators cause confusion
Better to think in terms of required (+ operator) and notallowed (- operator)
Check JIRA forQueryParser issues http://www.gossamer-threads.com/lists/lucene/java-user/40945
Most applications either modify QP, create theirown, or restrict to a subset of the syntax
Your users may not need all the flexibility ofthe QP
http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/409457/31/2019 Luce Ne Bootcamp
38/83
Sorting Lucene default sort is by score
Searcher has several methods that take in a
Sort object
Sorting should be addressed during indexing
Sorting is done on Fields containing a single
term that can be used for comparison
The SortField defines the different sort types
available AUTO, STRING, INT, FLOAT, CUSTOM, SCORE,
DOC
7/31/2019 Luce Ne Bootcamp
39/83
Sorting II
Look at Searcher, Sort and
SortField
Custom sorting is done with aSortComparatorSource
Sorting can be very expensive
Terms are cached in the FieldCache
SortFilterTest.java example
7/31/2019 Luce Ne Bootcamp
40/83
Filters
Filters restrict the search space to asubset ofDocuments
Use CasesSearch within a Search
Restrict by date
Rating
Security
Author
7/31/2019 Luce Ne Bootcamp
41/83
Filter Classes
QueryWrapperFilter (QueryFilter)
Restrict to subset ofDocuments that match a Query
RangeFilter Restrict to Documents that fall within a range
Better alternative to RangeQuery
CachingWrapperFilter
Wrap anotherFilter and provide caching
SortFilterTest.java example
7/31/2019 Luce Ne Bootcamp
42/83
Expert Results
Searcherhas several expert methods
Hits is not always what you need due to:
Caching
Normalized Scores
Reexecutes Query repeatedly as results are accessed
HitCollector allows low-level access to all
Documents as they are scored
TopDocs represents top n docs that match
TopDocsTest in examples
7/31/2019 Luce Ne Bootcamp
43/83
Searchers MultiSearcher
Search over multiple Searchables, including remote
MultiReader
Not a Searcher, but can be used with
IndexSearcher to achieve same results for localindexes
ParallelMultiSearcher
Like MultiSearcher, but threaded
RemoteSearchable
RMI based remote searching
Look at MultiSearcherTest in example
code
7/31/2019 Luce Ne Bootcamp
44/83
Search Performance
Search speed is based on a number of factors: Query Type(s)
Query Size
Analysis
Occurrences of Query Terms Optimize
Index Size
Index type (RAMDirectory, other)
Usual Suspects
CPU
Memory
I/O
Business Needs
7/31/2019 Luce Ne Bootcamp
45/83
Query Types
Be careful with WildcardQuery as it rewritesto a BooleanQuery containing all the termsthat match the wildcards
Avoid starting a WildcardQuery with wildcard
Use ConstantScoreRangeQuery instead ofRangeQuery
Be careful with range queries and dates User mailing list and Wiki have useful tips for
optimizing date handling
7/31/2019 Luce Ne Bootcamp
46/83
Query Size
Stopword removal
Search an all field instead of many fields with the same
terms
Disambiguation
May be useful when doing synonym expansion
Difficult to automate and may be slower
Some applications may allow the user to disambiguate
Relevance Feedback/More Like This
Use most important words
Important can be defined in a number of ways
7/31/2019 Luce Ne Bootcamp
47/83
Usual Suspects CPU
Profile your application
Memory
Examine your heap size, garbage collection approach
I/O
Cache yourSearcher
Define business logic for refreshing based on indexing needs
Warm yourSearcher before going live -- See Solr
Business Needs
Do you really need to support Wildcards?
What about date range queries down to the millisecond?
7/31/2019 Luce Ne Bootcamp
48/83
Explanations
explain(Query, int) method is
useful for understanding why a Document
scored the way it did ExplainsTest in sample code
Open Luke and try some queries and then
use the explain button
7/31/2019 Luce Ne Bootcamp
49/83
FieldSelector
Prior to version 2.1, Lucene always loaded allFields in a Document
FieldSelector API addition allows Lucene to
skip large Fields Options: Load, Lazy Load, No Load, Load and Break,
Load for Merge, Size, Size and Break
Makes storage of original content more viable
without large cost of loading it when not used
FieldSelectorTest in example code
7/31/2019 Luce Ne Bootcamp
50/83
Scoring and Similarity
Lucene has sophisticated scoring
mechanism designed to meet most needs
Has hooks for modifying scores Scoring is handled by the Query, Weight
and Scorer class
7/31/2019 Luce Ne Bootcamp
51/83
Affecting Relevance
FunctionQuery from Solr (variation in
Lucene)
Override Similarity Implement own Query and related classes
Payloads
HitCollector
Take 5 to examine these
7/31/2019 Luce Ne Bootcamp
52/83
Lunch
1-2:30
7/31/2019 Luce Ne Bootcamp
53/83
Recap
Indexing
Searching
Performance
Odds and Ends
Explains
FieldSelector
Relevance
7/31/2019 Luce Ne Bootcamp
54/83
Next Up
Dealing with Content
File Formats
Extraction
Large Task
Miscellaneous
Wrapping Up
7/31/2019 Luce Ne Bootcamp
55/83
File Formats
Several open source libraries, projects for extracting content to use inLucene
PDF: PDFBox
http://www.pdfbox.org/
Word: POI, Open Office, TextMining
http://www.textmining.org/textmining.zip
XML: SAX or Pull parser
HTML: Neko, Jtidy
http://people.apache.org/~andyc/neko/doc/html/
http://jtidy.sourceforge.net/
Tika http://incubator.apache.org/tika/
Aperture
http://aperture.sourceforge.net
http://www.textmining.org/textmining.ziphttp://people.apache.org/~andyc/neko/doc/html/http://incubator.apache.org/tika/http://aperture.sourceforge.net/http://aperture.sourceforge.net/http://incubator.apache.org/tika/http://people.apache.org/~andyc/neko/doc/html/http://www.textmining.org/textmining.zip7/31/2019 Luce Ne Bootcamp
56/83
Aperture Basics
Crawlers
Data Connectors
Extraction WrappersPOI, PDFBox, HTML, XML, etc.
http://aperture.wiki.sourceforge.net/Extractorswill give you info on what comes back from
Aperture
LuceneApertureCallbackHandlerin example code
http://aperture.wiki.sourceforge.net/Extractorshttp://aperture.wiki.sourceforge.net/Extractors7/31/2019 Luce Ne Bootcamp
57/83
Large Task Using the skeleton files in the
com.lucenebootcamp.training.full package:
Get some content:
Web, file system
Different file formats Index it
Plan out your fields, boosts, field properties
Support updates and deletes
Optional: How fast can you make it go? Divide and conquer?
Multithreaded?
7/31/2019 Luce Ne Bootcamp
58/83
Large Task
Search Content
Allow for arbitrary user queries across multipleFields via command line or simple web interface
How fast can you make it?
Support:
Sort
Filter Explains
How much slower is to retrieve an explanation?
7/31/2019 Luce Ne Bootcamp
59/83
Large Task
Document Retrieval
Display/write out the one or more documents
Support FieldSelector
7/31/2019 Luce Ne Bootcamp
60/83
Large Task
Optional Tasks
Hit Highlighting using contrib/Highlighter
Multithreaded indexing and Search
Explore other Field construction options
Binary fields, term vectors
Use Lucene trunk version and try out some of the
changes in indexing Try out Solr or Nutch at http://lucene.apache.org/
Whats do they offer that Lucene Java doesnt that you might
need?
http://lucene.apache.org/http://lucene.apache.org/7/31/2019 Luce Ne Bootcamp
61/83
Large Task Metadata
Pair up if you want
Ask questions
2 hoursUse Luke to check your index!
Explore other parts of Lucene that you are
interested in
Be prepared to discuss/share with the class
7/31/2019 Luce Ne Bootcamp
62/83
Large Task Post-Mortem
Volunteers to share?
7/31/2019 Luce Ne Bootcamp
63/83
Term Information TermEnum gives access to terms and how manyDocuments they occur in
IndexReader.terms()
IndexReader.termPositions()
TermDocs gives access to the frequency of aterm in a Document
IndexReader.termDocs()
Term Vectors give access to term frequencyinformation in a given Document
IndexReader.getTermFreqVector
TermsTest in sample code
7/31/2019 Luce Ne Bootcamp
64/83
Lucene Contributions
Many people have generously contributed code tohelp solve common problems
These are in contrib directory of the source
Popular:
Analyzers
Highlighter
Queries and MoreLikeThis Snowball Stemmers
Spellchecker
7/31/2019 Luce Ne Bootcamp
65/83
Open Discussion
Multilingual Best Practices
UNICODE
One Index versus many
Advanced Analysis
Distributed Lucene
Crawling
Hadoop Nutch
Solr
7/31/2019 Luce Ne Bootcamp
66/83
Resources
http://lucene.apache.org/
http://en.wikipedia.org/wiki/Vector_space_model
Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
Lucene In Action by Hatcher and Gospodneti
Wiki
Mailing Lists
Discussions on how to use Lucene
Discussions on how to develop Lucene
Issue Tracking https://issues.apache.org/jira/secure/Dashboard.jspa
We always welcome patches
Ask on the mailing list before reporting a bug
http://lucene.apache.org/http://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelmailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://lucene.apache.org/7/31/2019 Luce Ne Bootcamp
67/83
Resources
7/31/2019 Luce Ne Bootcamp
68/83
Finally
Please take the time to fill out a survey to
help me improve this training
Located in base directory of sourceEmail it to me at [email protected]
There are several Lucene related talks on
Friday
7/31/2019 Luce Ne Bootcamp
69/83
Extras
7/31/2019 Luce Ne Bootcamp
70/83
Task 2 Take 10-15 minutes, pair up, and write anAnalyzer and Unit Test
Examine results in Luke
Run some searches
Ideas:
Combine existing Tokenizers and TokenFilters
Normalize abbreviations
Filter out all words beginning with the letter A Identify/Mark sentences
Questions:
What would help improve search results?
7/31/2019 Luce Ne Bootcamp
71/83
Task 2 Results
Share what you did and why
Improving Results (in most cases)
StemmingIgnore Case
Stopword Removal
SynonymsPay attention to business needs
7/31/2019 Luce Ne Bootcamp
72/83
Grab Bag
Accessing Term Information
TermEnum
TermDocsTerm Vectors
FieldSelector
Scoring and Similarity File Formats
7/31/2019 Luce Ne Bootcamp
73/83
Task 6
Count and print all the unique terms in the
index and their frequencies
Notes: Half of the class write it using TermEnum and
TermDocs
Other Half write it using Term Vectors
Time your Task Only count the title and body content
7/31/2019 Luce Ne Bootcamp
74/83
Task 6 Results
Term Vector approach is faster on smaller
collections
TermEnum approach is faster on largercollections
7/31/2019 Luce Ne Bootcamp
75/83
Task 4 Re-index your collection
Add in a rating field that randomly assigns a numberbetween 0 and 9
Write searches to sort by Date
Title
Rating, Date, Doc Id
A Custom Sort
Questions How to sort the title?
How to sort multiple Fields?
7/31/2019 Luce Ne Bootcamp
76/83
Task 4 Results
Add stitle to use for sorting the title
7/31/2019 Luce Ne Bootcamp
77/83
Task 5
Create and search using Filters to:
Restrict to all docs written on Feb. 26, 1987
Restrict to all docs with the word computerin title
Also:
Create a Filter where the length of the body +title is greater than X
7/31/2019 Luce Ne Bootcamp
78/83
Task 5 Results
Solr has more advanced Filter
mechanisms that may be worth using
Cache filters
7/31/2019 Luce Ne Bootcamp
79/83
Task 7 Pair up if you like and take 30-40 minutes to:
Pick two file formats to work on
Identify content in that format
Can you index contents on your hard drive?
Project Gutenberg, Creative Commons, Wikipedia
Combine w/ Reuters collection
Extract the content and index it using the appropriatelibrary
Store the content as a Field
Search the content
Load Documents with and withoutFieldSelector and measure performance
7/31/2019 Luce Ne Bootcamp
80/83
Task 7 (cont.)
Include score and explanation in results
Dump results to XML or HTML
Be prepared to share with class what you did What libraries did you use?
What content did you use?
What is yourDocument structure?
What issues did you have?
7/31/2019 Luce Ne Bootcamp
81/83
20 Minute Break
7/31/2019 Luce Ne Bootcamp
82/83
Task 7 Results
Explain what your group did
Build a Content Handler Framework
Or help out with Tika
7/31/2019 Luce Ne Bootcamp
83/83
Task 8
Building on Task 7
Incorporate one or more contrib packages into
your solution