Lucene Part1 . Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet...

transcript

Lucene Part1

Lucene

Use Case

Store data in a 2 dimensional way

How do we do this. Spreadsheet

Relational Database

Lucene

Couple of Problems

Relational Database – 1971 – Dr. E.F. Codd

Excellent way to store 2 dimensions

Null Inapplicable

Science of Data.... The Study of Null

Person HockeyTeam

Name Person_Id

Address Team

Lucene

Example

Person HockeyTeam

Stephen 5

Philadelphia Flyers

Lucene

Inverted Indexes

Person HockeyTeam

Stephen 5

Philadelphia Flyers

We can easily answer the question, where does Stephen live, but what about give me all the people living in Philadelphia. We can do this.

Lucene

Inverted Indexes

Storage contains

Philadelphia ------ Stephen, Chris, Tara

Lucene

Document Centric

Create Documents on the Fly

Add field called Category - --- table name

Lucene indexes everything.

Lucene

One Developer Says

Every other open source search engine I evaluated, including Swish-E, Glimpse, iSearch, and libibex, was poorly suited to Eyebrowse's requirements in some way. This would have made integration problematic and/or time-consuming. With Lucene, I added indexing and searching to Eyebrowse in little more than half a day, from initial download to fully working code! This was less than one-tenth of the development time I had budgeted, and yielded a more tightly integrated and feature-rich result than any other search tool I considered.

Lucene

How search engines work

Creating and maintaining an inverted index is the central problem when building an efficient keyword search engine. To index a document, you must first scan it to produce a list of postings. Postings describe occurrences of a word in a document; they generally include the word, a document ID, and possibly the location(s) or frequency of the word within the document

Lucene

Building A Search Index

If you think of the postings as tuples of the form <word, document-id>, a set of documents will yield a list of postings sorted by document ID. But in order to efficiently find documents that contain specific words, you should instead sort the postings by word (or by both word and document, which will make multiword searches faster). In this sense, building a search index is basically a sorting problem. The search index is a list of postings sorted by word.

LuceneAn Innovative Implementation

Most search engines use B-trees to maintain the index; they are relatively stable with respect to insertion and have well-behaved I/O characteristics (lookups and insertions are O(log n) operations). Lucene takes a slightly different approach: rather than maintaining a single index, it builds multiple index segments and merges them periodically. For each new document indexed, Lucene creates a new index segment, but it quickly merges small segments with larger ones -- this keeps the total number of segments small so searches remain fast. To optimize the index for fast searching, Lucene can merge all the segments into one, which is useful for infrequently updated indexes.

LucenePreventing Conflicts

To prevent conflicts (or locking overhead) between index readers and writers, Lucene never modifies segments in place, it only creates new ones. When merging segments, Lucene writes a new segment and deletes the old ones -- after any active readers have closed it. This approach scales well, offers the developer a high degree of flexibility in trading off indexing speed for searching speed, and has desirable I/O characteristics for both merging and searching.

LuceneIndex Segment

A Lucene index segment consists of several files:

A dictionary index containing one entry for each 100 entries in the dictionary

A dictionary containing one entry for each unique word A postings file containing an entry for each posting

Lucene

Flat FilesSince Lucene never updates segments in place, they can be stored in flat files instead of complicated B-trees. For quick retrieval, the dictionary index contains offsets into the dictionary file, and the dictionary holds offsets into the postings file. Lucene also implements a variety of tricks to compress the dictionary and posting files -- thereby reducing disk I/O -- without incurring substantial CPU overhead.

Lucene Using Lucene - Create an Index

The simple program CreateIndex.java creates an empty index by generating an IndexWriter object and instructing it to build an empty index. In this example, the name of the directory that will store the index is specified on the command line.

LuceneThe Code

public class CreateIndex { // usage: CreateIndex index-directory public static void main(String[] args) throws Exception { String indexPath = args[0]; IndexWriter writer; // An index is created by opening an IndexWriter with // create argument set to true. writer = new IndexWriter(indexPath, null, true); writer.close(); }}

LuceneIndex Text Documents

IndexFile.java shows how to add documents -- the files named on the command line -- to an index. For each file, IndexFiles creates a Document object, then calls IndexWriter.addDocument to add it to the index. From Lucene's point of view, a Document is a collection of fields that are name-value pairs. A Field can obtain its value from a String, for short fields, or an InputStream, for long fields. Using fields allows you to partition a document into separately searchable and indexable sections, and to associate metadata -- such as name, author, or modification date -- with a document. For example, when storing mail messages, you could put a message's subject, author, date, and body in separate fields, then build semantically richer queries like "subject contains Java AND author contains

Gosling."

LuceneIndexing In Depth

In the code, we store two fields in each Document: path, to identify the original file path so it can be retrieved later, and body, for the file's contents.

Lucene

public class IndexFiles { // usage: IndexFiles index-path file . . . public static void main(String[] args) throws Exception { String indexPath = args[0]; IndexWriter writer; writer = new IndexWriter(indexPath, new SimpleAnalyzer(), false); for (int i=1; i<args.length; i++) { System.out.println("Indexing file " + args[i]); InputStream is = new FileInputStream(args[i]); // We create a Document with two Fields, one which contains // the file path, and one the file's contents. Document doc = new Document(); doc.add(Field.UnIndexed("path", args[i])); doc.add(Field.Text("body", (Reader) new InputStreamReader(is))); writer.addDocument(doc); is.close(); }; writer.close(); }}

Code Example

Lucene

Search.java provides an example of how to search the index. While the com.lucene.Query package contains many classes for building sophisticated queries, here we use the built-in query parser, which handles the most common queries and is less complicated to use. We create a Searcher object, use the QueryParser to create a Query object, and call Searcher.search on the query. The search operation returns a Hits object -- a collection of Document objects, one for each document matched by the query -- and an associated relevance score for each document, sorted by score.

Search

Lucene

public class Search { public static void main(String[] args) throws Exception { String indexPath = args[0], queryString = args[1]; Searcher searcher = new IndexSearcher(indexPath); Query query = QueryParser.parse(queryString, "body", new SimpleAnalyzer()); Hits hits = searcher.search(query); for (int i=0; i<hits.length(); i++) { System.out.println(hits.doc(i).get("path") + "; Score: " + hits.score(i)); }; }}

Code Example

Lucene

The built-in query parser supports most queries, but if it is insufficient, you can always fall back on the rich set of query-building constructs provided. The query parser can parse queries like these: free AND "text search"Search for documents containing "free" and the phrase "text search"+text searchSearch for documents containing "text" and preferentially containing "search"giants -footballSearch for "giants" but omit documents containing "football"author:gosling javaSearch for documents containing "gosling" in the author field and "java" in the body

Query Parsing

Lucene

Lucene uses three major abstractions to support building text indexes: Document, Analyzer, and Directory. The Document object represents a single document, modeled as a collection of Field objects (name-value pairs). For each document to be indexed, the application creates a Document object and adds it to the index store. The Analyzer converts the contents of each Field into a sequence of tokens.

Beyond Basic Text Documents

LuceneToken

Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied. The application filters undesired tokens, like stop words or portions of the input that do not need to be indexed, through the Analyzer class. It also modifies tokens as they are encountered in the input, to perform stemming or other term normalization. Conveniently, Lucene comes with a set of standard Analyzer objects for handling common transformations like word identification and stop-word elimination, so indexing simple text documents requires no additional work. If these aren't enough, the developer can provide more sophisticated analyzers.

LuceneAnalyzer

The application provides the document data in the form of a String or InputStream, which the Analyzer converts to a stream of tokens. Because of this, Lucene can index data from any data source, not just files. If the documents are stored in files, use FileInputStream to retrieve them, as illustrated in IndexFile.java. If they are stored in an Oracle database, provide an InputStream class to retrieve them. If a document is not a text file but an HTML or XML file, for example, you can extract content by eliminating markups like HTML tags, document headers, or formatting instructions. This can be done with a FilterInputStream, which would convert a document stream into a stream containing only the document's content text, and connect it to the InputStream that retrieves the document. So, if we wanted to index a collection of XML documents stored in an Oracle database, the resulting code would be very similar to IndexFiles.java. But it would use an application-provided InputStream class to retrieve the document from the database (instead of FileInputStream), as well as an application-provided FilterInputStream to parse the XML and extract the desired content.

Lucene Summary

Lucene is the most flexible and convenient open source search toolkit I've ever used.

Cutting describes his primary goal for Lucene as "simplicity without loss of power or performance," and this shines through clearly in the result.

The design seems so simple, you might suspect it is just the obvious way to design a search toolkit.

We should all be so lucky as to craft such obvious designs for our own software.