Luce Ne Bootcamp

7/31/2019 Luce Ne Bootcamp

1/83

Lucene Boot Camp

Grant Ingersoll

Lucid ImaginationNov. 12, 2007

Atlanta, Georgia


2/83

Intro

My Background

Your Background

Brief History of Lucene

Goals for Tutorial

Understand Lucene core capabilities

Real examples, real code, real data

Ask Questions!!!!!


3/83

Schedule1. 10-10:10 Introducing Lucene and Search

2. 10:10-12 Indexing, Analysis, Searching, Performance

3. 12-12:05 Break

4. 12-1 More on Indexing, Analysis, Searching, Performance

5. 1-2:30 Lunch

6. 2:30-2:40 Recap, Questions, Content

7. 2:40-4:40 Class Example

8. 4-4:20 Break

9. 4:20-5 Class Example

10. 5-5:20 Lucene Contributions (time permitting)

11. 5:20-5:25 Open Discussion (time permitting)

12. 5:25-5:30 Resources/Wrap Up


4/83

Lucene is

NOT a crawler

See Nutch

NOT an applicationSee PoweredBy on the Wiki

NOT a library for doing Google PageRank

or other link analysis algorithmsSee Nutch

A library for enabling text based search


5/83

A Few Words about Solr

HTTP-based Search Server

XML Configuration

XML, JSON, Ruby, PHP, Java support

Caching, Replication

Many, many nice features that Lucene users

need

http://lucene.apache.org/solr
http://lucene.apache.org/solrhttp://lucene.apache.org/solr


6/83

Search Basics

Goal: Identify documents thatare similar to input query

Lucene uses a modified VectorSpace Model (VSM)

Boolean + VSM

TF-IDF

The words in the documentand the query each define aVector in an n-dimensional

space Sim(q1, d1) = cos

In Lucene, boolean approachrestricts what documents toscore

q1

d1

dj=

q= w = weight assigned to term


7/83

Indexing

Process of preparing and adding text toLucene

Optimized for searching Key Point: Lucene only indexes Strings

What does this mean?

Lucene doesnt care about XML, Word, PDF, etc.

There are many good open source extractors available

Its our job to convert whatever file format we haveinto something Lucene can use


8/83

Indexing Classes

Analyzer

Creates tokens using a Tokenizer and filters

them through zero or more TokenFilters IndexWriter

Responsible for converting text into internal

Lucene format


9/83

Indexing Classes

Directory

Where the Index is stored

RAMDirectory, FSDirectory, others

Document A collection ofFields

Can be boosted

Field

Free text, keywords, dates, etc. Defines attributes for storing, indexing

Can be boosted

Field Constructors and parameters

Open up Fieldable and Field in IDE


10/83

How to Index

Create IndexWriter

For each input

Create a Document

Add Fields to the Document

Add the Document to the IndexWriter

Close the IndexWriter

Optimize (optional)


11/83

Task 1.a From the Boot Camp Files, use the basic.ReutersIndexer

skeleton to start

Index the small Reuters Collection using theIndexWriter, a Directory and

StandardAnalyzer Boost every 10 documents by 3

Questions to Answer:

What Fields should I define?

What attributes should each Field have? What Fields should OMIT_NORMS?

Pick a field to boost and give a reason why you think it should beboosted


12/83

Use the Luke


13/83

Searching

Key Classes: Searcher

Provides methods for searching

Take a moment to look at the Searcher class declaration

IndexSearcher, MultiSearcher,

ParallelMultiSearcher IndexReader

Loads a snapshot of the index into memory for searching

Hits

Storage/caching of results from searching

QueryParser

JavaCC grammar for creating Lucene Queries

http://lucene.apache.org/java/docs/queryparsersyntax.html

Query

Logical representation of programs information need
http://lucene.apache.org/java/docs/queryparsersyntax.htmlhttp://lucene.apache.org/java/docs/queryparsersyntax.html


14/83

Query Parsing

Basic syntax:

title:hockey +(body:stanley AND body:cup)

OR/AND must be uppercase Default operator is OR (can be changed)

Supports fairly advanced syntax, see the website http://lucene.apache.org/java/docs/queryparsersyntax.html

Doesnt always play nice, so beware Many applications construct queries programmatically

or restrict syntax


15/83

Task 1.b Using the ReutersIndexerTest.java skeleton in the boot

camp files

Search your newly created index using queries you develop

Delete a Document by the doc id

Hints:

Use a IndexSearcher

Create a Query using the QueryParser

Display the results from the Hits

Questions:

What is the default field for the QueryParser?

What Analyzer to use?


16/83

Task 1 Results

Locks

Lucene maintains locks on files to prevent

index corruption

Located in same directory as index

Scores from Hits are normalized

Scores across queries are NOT comparable

Lucene 2.3 has some transactional

semantics for indexing, but is not a DB


17/83

Deletion and Updates

Deletions can be a bit confusing

Both IndexReader and IndexWriter

have delete methods Updates are always a delete and an add

Updates are always a delete and an add

Yes, that is a repeat!

Nature of data structures used in search


18/83

Analysis Analysis is the process of creating Tokens to be indexed

Analysis is usually done to improve results overall, but itcomes with a price

Lucene comes with many different Analyzers,

Tokenizers and TokenFilters, each with their owngoals

See contrib/analyzers

StandardAnalyzer is included with the core JAR and

does a good job for most English and Latin-based tasks Often times you want the same content analyzed in

different ways

Consider a catch-all Field in addition to otherFields


19/83

Commonly Used Analyzers

StandardAnalyzer

WhitespaceAnalyzer

PerFieldAnalyzerWrapper

SimpleAnalyzer


20/83

Indexing in a Nutshell For each Document

For each Field to be tokenized

Create the tokens using the specified Tokenizer

Tokens consist of a String, position, type and offset information Pass the tokens through the chained TokenFilters where

they can be changed or removed

Add the end result to the inverted index

Position information can be altered

Useful when removing words or to prevent phrases

from matching


21/83

Inverted Index

aardvark

hood

red

little

riding

robin

women

zoo

Little Red Riding Hood

Robin Hood

Little Women

0 1

0 2

0

0

2

1

0

1

2


22/83

Tokenization

Split words into Tokens to be processed

Tokenization is fairly straightforward for

most languages that use a space for wordsegmentation

More difficult for some East Asian languages

See the CJK Analyzer


23/83

Modifying Tokens

TokenFilters are used to alter the tokenstream to be indexed

Common tasks:

Remove stopwords

Lower case

Stem/Normalize -> Wi-Fi -> Wi Fi

Add Synonyms StandardAnalyzer does things that you may

not want


24/83

Custom Analyzers

Solution: write your own Analyzer

Better solution: write a configurable

Analyzer so you only need one Analyzerthat you can easily change for your projects

See Solr

Tokenizers and TokenFilters mustbe newly constructed for each input


25/83

Special Cases

Dates and numbers need special treatment to be

searchable

o.a.l.document.DateTools org.apache.solr.util.NumberUtils

Altering Position Information

Increase Position Gap between sentences to prevent

phrases from crossing sentence boundaries

Index synonyms at the same position so query can

match regardless of synonym used


26/83

5 minute Break


27/83

Indexing Performance

Behind the Scenes

Lucene indexes Documents into memory

At certain trigger points, memory (segments)are flushed to the Directory

Segments are periodically merged

Lucene 2.3 has significant performanceimprovements


28/83

IndexWriter Performance

Factors maxBufferedDocs

Minimum # of docs before merge occurs and a new segment is

created

Usually, Larger == faster, but more RAM

mergeFactor

How often segments are merged

Smaller == less RAM, better for incremental updates

Larger == faster, better for batch indexing

maxFieldLength

Limit the number of terms in a Document


29/83

Lucene 2.3 IndexWriter Changes

setRAMBufferSizeMB

New model for automagically controlling indexingfactors based on the amount of memory in use

Obsoletes setMaxBufferedDocs andsetMergeFactor

Takes storage and term vectors out of the mergeprocess

Turn off auto-commit if there are stored fields andterm vectors

Provides significant performance increase


30/83

Index Threading

IndexWriter and IndexReader are thread-

safe and can be shared between threads without

external synchronization

One open IndexWriter perDirectory

Parallel Indexing

Index to separate Directory instances

Merge using IndexWriter.addIndexes

Could also distribute and collect


31/83

Benchmarking Indexing

contrib/benchmark

Try out different algorithms between Lucene 2.2and trunk (2.3)

contrib/benchmark/conf:

indexing.alg

indexing-multithreaded.alg

Info:

Mac Pro 2 x 2GHz Dual-Core Xeon

4 GB RAM

ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M


32/83

Benchmarking ResultsRecords/Sec Avg. T Mem

2.2 421 39M

Trunk 2,122 52M

Trunk-mt (4) 3,680 57M

Your results will depend on analysis, etc.


33/83

Searching

Earlier we touched on basics of searchusing the QueryParser

Now look at: Searcher/IndexReader Lifecycle

Query classes

More details on the QueryParser

Filters

Sorting


34/83

Lifecycle

Recall that the IndexReader loads a snapshotof index into memory

This means updates made since loading the index will

not be seen

Business rules are needed to define how often toreload the index, if at all

IndexReader.isCurrent() can help

Loading an index is an expensive operation

Do not open a Searcher/IndexReader for everysearch


35/83

Query Classes TermQuery is basis for all non-span queries

BooleanQuery combines multiple Queryinstances as clauses

should required

PhraseQuery finds terms occurring near eachother, position-wise

slop is the edit distance between two terms

Take 2-3 minutes to explore Queryimplementations


36/83

Spans

Spans provide information about wherematches took place

Not supported by the QueryParser Can be used in BooleanQuery clauses

Take 2-3 minutes to explore SpanQuery

classes SpanNearQuery useful for doing phrase

matching


37/83

QueryParser

MultiFieldQueryParser

Boolean operators cause confusion

Better to think in terms of required (+ operator) and notallowed (- operator)

Check JIRA forQueryParser issues http://www.gossamer-threads.com/lists/lucene/java-user/40945

Most applications either modify QP, create theirown, or restrict to a subset of the syntax

Your users may not need all the flexibility ofthe QP
http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945


38/83

Sorting Lucene default sort is by score

Searcher has several methods that take in a

Sort object

Sorting should be addressed during indexing

Sorting is done on Fields containing a single

term that can be used for comparison

The SortField defines the different sort types

available AUTO, STRING, INT, FLOAT, CUSTOM, SCORE,

DOC


39/83

Sorting II

Look at Searcher, Sort and

SortField

Custom sorting is done with aSortComparatorSource

Sorting can be very expensive

Terms are cached in the FieldCache

SortFilterTest.java example


40/83

Filters

Filters restrict the search space to asubset ofDocuments

Use CasesSearch within a Search

Restrict by date

Rating

Security

Author


41/83

Filter Classes

QueryWrapperFilter (QueryFilter)

Restrict to subset ofDocuments that match a Query

RangeFilter Restrict to Documents that fall within a range

Better alternative to RangeQuery

CachingWrapperFilter

Wrap anotherFilter and provide caching

SortFilterTest.java example


42/83

Expert Results

Searcherhas several expert methods

Hits is not always what you need due to:

Caching

Normalized Scores

Reexecutes Query repeatedly as results are accessed

HitCollector allows low-level access to all

Documents as they are scored

TopDocs represents top n docs that match

TopDocsTest in examples


43/83

Searchers MultiSearcher

Search over multiple Searchables, including remote

MultiReader

Not a Searcher, but can be used with

IndexSearcher to achieve same results for localindexes

ParallelMultiSearcher

Like MultiSearcher, but threaded

RemoteSearchable

RMI based remote searching

Look at MultiSearcherTest in example

code


44/83

Search Performance

Search speed is based on a number of factors: Query Type(s)

Query Size

Analysis

Occurrences of Query Terms Optimize

Index Size

Index type (RAMDirectory, other)

Usual Suspects

CPU

Memory

I/O

Business Needs


45/83

Query Types

Be careful with WildcardQuery as it rewritesto a BooleanQuery containing all the termsthat match the wildcards

Avoid starting a WildcardQuery with wildcard

Use ConstantScoreRangeQuery instead ofRangeQuery

Be careful with range queries and dates User mailing list and Wiki have useful tips for

optimizing date handling


46/83

Query Size

Stopword removal

Search an all field instead of many fields with the same

terms

Disambiguation

May be useful when doing synonym expansion

Difficult to automate and may be slower

Some applications may allow the user to disambiguate

Relevance Feedback/More Like This

Use most important words

Important can be defined in a number of ways


47/83

Usual Suspects CPU

Profile your application

Memory

Examine your heap size, garbage collection approach

I/O

Cache yourSearcher

Define business logic for refreshing based on indexing needs

Warm yourSearcher before going live -- See Solr

Business Needs

Do you really need to support Wildcards?

What about date range queries down to the millisecond?


48/83

Explanations

explain(Query, int) method is

useful for understanding why a Document

scored the way it did ExplainsTest in sample code

Open Luke and try some queries and then

use the explain button


49/83

FieldSelector

Prior to version 2.1, Lucene always loaded allFields in a Document

FieldSelector API addition allows Lucene to

skip large Fields Options: Load, Lazy Load, No Load, Load and Break,

Load for Merge, Size, Size and Break

Makes storage of original content more viable

without large cost of loading it when not used

FieldSelectorTest in example code


50/83

Scoring and Similarity

Lucene has sophisticated scoring

mechanism designed to meet most needs

Has hooks for modifying scores Scoring is handled by the Query, Weight

and Scorer class


51/83

Affecting Relevance

FunctionQuery from Solr (variation in

Lucene)

Override Similarity Implement own Query and related classes

Payloads

HitCollector

Take 5 to examine these


52/83

Lunch

1-2:30


53/83

Recap

Indexing

Searching

Performance

Odds and Ends

Explains

FieldSelector

Relevance


54/83

Next Up

Dealing with Content

File Formats

Extraction

Large Task

Miscellaneous

Wrapping Up


55/83

File Formats

Several open source libraries, projects for extracting content to use inLucene

PDF: PDFBox

http://www.pdfbox.org/

Word: POI, Open Office, TextMining

http://www.textmining.org/textmining.zip

XML: SAX or Pull parser

HTML: Neko, Jtidy

http://people.apache.org/~andyc/neko/doc/html/

http://jtidy.sourceforge.net/

Tika http://incubator.apache.org/tika/

Aperture

http://aperture.sourceforge.net
http://www.textmining.org/textmining.ziphttp://people.apache.org/~andyc/neko/doc/html/http://incubator.apache.org/tika/http://aperture.sourceforge.net/http://aperture.sourceforge.net/http://incubator.apache.org/tika/http://people.apache.org/~andyc/neko/doc/html/http://www.textmining.org/textmining.zip


56/83

Aperture Basics

Crawlers

Data Connectors

Extraction WrappersPOI, PDFBox, HTML, XML, etc.

http://aperture.wiki.sourceforge.net/Extractorswill give you info on what comes back from

Aperture

LuceneApertureCallbackHandlerin example code
http://aperture.wiki.sourceforge.net/Extractorshttp://aperture.wiki.sourceforge.net/Extractors


57/83

Large Task Using the skeleton files in the

com.lucenebootcamp.training.full package:

Get some content:

Web, file system

Different file formats Index it

Plan out your fields, boosts, field properties

Support updates and deletes

Optional: How fast can you make it go? Divide and conquer?

Multithreaded?


58/83

Large Task

Search Content

Allow for arbitrary user queries across multipleFields via command line or simple web interface

How fast can you make it?

Support:

Sort

Filter Explains

How much slower is to retrieve an explanation?


59/83

Large Task

Document Retrieval

Display/write out the one or more documents

Support FieldSelector


60/83

Large Task

Optional Tasks

Hit Highlighting using contrib/Highlighter

Multithreaded indexing and Search

Explore other Field construction options

Binary fields, term vectors

Use Lucene trunk version and try out some of the

changes in indexing Try out Solr or Nutch at http://lucene.apache.org/

Whats do they offer that Lucene Java doesnt that you might

need?
http://lucene.apache.org/http://lucene.apache.org/


61/83

Large Task Metadata

Pair up if you want

Ask questions

2 hoursUse Luke to check your index!

Explore other parts of Lucene that you are

interested in

Be prepared to discuss/share with the class


62/83

Large Task Post-Mortem

Volunteers to share?


63/83

Term Information TermEnum gives access to terms and how manyDocuments they occur in

IndexReader.terms()

IndexReader.termPositions()

TermDocs gives access to the frequency of aterm in a Document

IndexReader.termDocs()

Term Vectors give access to term frequencyinformation in a given Document

IndexReader.getTermFreqVector

TermsTest in sample code


64/83

Lucene Contributions

Many people have generously contributed code tohelp solve common problems

These are in contrib directory of the source

Popular:

Analyzers

Highlighter

Queries and MoreLikeThis Snowball Stemmers

Spellchecker


65/83

Open Discussion

Multilingual Best Practices

UNICODE

One Index versus many

Advanced Analysis

Distributed Lucene

Crawling

Hadoop Nutch

Solr


66/83

Resources

http://lucene.apache.org/

http://en.wikipedia.org/wiki/Vector_space_model

Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto

Lucene In Action by Hatcher and Gospodneti

Wiki

Mailing Lists

[email protected]

Discussions on how to use Lucene

[email protected]

Discussions on how to develop Lucene

Issue Tracking https://issues.apache.org/jira/secure/Dashboard.jspa

We always welcome patches

Ask on the mailing list before reporting a bug
http://lucene.apache.org/http://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelmailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://lucene.apache.org/


67/83

Resources

[email protected]


68/83

Finally

Please take the time to fill out a survey to

help me improve this training

Located in base directory of sourceEmail it to me at [email protected]

There are several Lucene related talks on

Friday


69/83

Extras


70/83

Task 2 Take 10-15 minutes, pair up, and write anAnalyzer and Unit Test

Examine results in Luke

Run some searches

Ideas:

Combine existing Tokenizers and TokenFilters

Normalize abbreviations

Filter out all words beginning with the letter A Identify/Mark sentences

Questions:

What would help improve search results?


71/83

Task 2 Results

Share what you did and why

Improving Results (in most cases)

StemmingIgnore Case

Stopword Removal

SynonymsPay attention to business needs


72/83

Grab Bag

Accessing Term Information

TermEnum

TermDocsTerm Vectors

FieldSelector

Scoring and Similarity File Formats


73/83

Task 6

Count and print all the unique terms in the

index and their frequencies

Notes: Half of the class write it using TermEnum and

TermDocs

Other Half write it using Term Vectors

Time your Task Only count the title and body content


74/83

Task 6 Results

Term Vector approach is faster on smaller

collections

TermEnum approach is faster on largercollections


75/83

Task 4 Re-index your collection

Add in a rating field that randomly assigns a numberbetween 0 and 9

Write searches to sort by Date

Title

Rating, Date, Doc Id

A Custom Sort

Questions How to sort the title?

How to sort multiple Fields?


76/83

Task 4 Results

Add stitle to use for sorting the title


77/83

Task 5

Create and search using Filters to:

Restrict to all docs written on Feb. 26, 1987

Restrict to all docs with the word computerin title

Also:

Create a Filter where the length of the body +title is greater than X


78/83

Task 5 Results

Solr has more advanced Filter

mechanisms that may be worth using

Cache filters


79/83

Task 7 Pair up if you like and take 30-40 minutes to:

Pick two file formats to work on

Identify content in that format

Can you index contents on your hard drive?

Project Gutenberg, Creative Commons, Wikipedia

Combine w/ Reuters collection

Extract the content and index it using the appropriatelibrary

Store the content as a Field

Search the content

Load Documents with and withoutFieldSelector and measure performance


80/83

Task 7 (cont.)

Include score and explanation in results

Dump results to XML or HTML

Be prepared to share with class what you did What libraries did you use?

What content did you use?

What is yourDocument structure?

What issues did you have?


81/83

20 Minute Break


82/83

Task 7 Results

Explain what your group did

Build a Content Handler Framework

Or help out with Tika


83/83

Task 8

Building on Task 7

Incorporate one or more contrib packages into

your solution

Date post:	05-Apr-2018
Category:	Documents
Upload:	mfahci
View:	215 times
Download:	0 times