Search Me: Using Lucene.Net

Post on 07-Jul-2015

810 views 0 download

Tags:

description

May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.

transcript

SEARCH ME

Using Lucene.Net In Your Apps

About Me

Zachary Johnson Gramana

Engineer at Potts Consulting Group

Proud new father of Rex

Search is...

A vague term that encompasses multiple

problems.

Better term is “information retrieval”, or IR

system.

Interdisciplinary, drawing from:

computer science (parsing, data structures)

psychology (query grammar, human/computer

interact.)

linguistics (textual analysis)

information science (scoring/relevancy)

maths (document retrieval strategy)

Problems Solved

Information Overload

Transparently handle all kinds of data:

structured (hierarchical)

semi-structured (markup)

un-structured data (plain text)

Problems Solved

Information Overload

Find the information that users want,

not just the information they asked for.

Transparently handle all kinds of data:

structured (hierarchical)

semi-structured (markup)

un-structured data (plain text)

Single portal to multiple data types and

sources.

Do it fast!

Basic IR System Capabilities

Collection (importing, crawling) Anonymous web page crawling (google)

User-uploaded photographs (flickr)

Publisher upload of .mp3 files (iTunes)

Indexing Analysis

Modify index data structure

Querying Input parsing

Query generation & execution

Collecting the results

Filtering the results (optional)

What is Lucene.Net?

Port of the Apache Foundation‟s Lucene

libraries from Java to C#

It‟s a search library.

Lucene created by Doug Cutting

Named after his wife.

First released in 2000 on SourceForge

Migrated to Apache Foundation in 9/2001.

Used By

StackOverflow

JIRA

IBM

Akamai

Apple

Autodesk

Orchard

RavenDB

CouchDB

What Isn‟t Lucene.NET

Not a complete information retrieval system Check out Google Search Appliance instead:

http://www.google.com/enterprise/search/

Not a web-crawler. Check out Arachnode instead

http://arachnode.net

Not a query service. Check out SOLR instead

http://lucene.apache.org/solr

Not hard Check out Windows Search SDK instead

http://bit.ly/ImRtMk

Concept and Overview

What‟s In an Index?

Stores a collection of Documents, each of

which represent a source record.

Document contain:

Metadata about the source record.

(optionally) actual data from the source record.

(optionally) derived analytical products.

Documents store a collection of

token/frequency pairs (optionally position),

plus a document identifier.

Lucene‟s Index Structure

Documents store a collection of fields.

Fields are collection of terms, plus and identifier, and optional term vectors.

Terms are string key-value-pairs of a field name, and a string value.

Lucene provides special classes to deal with tricky data, like the NumericField class.

Term vectors are terms, along with their frequency counts and positions.

Fields can be indexed, stored, or both. Storing allows a term value to be retrieved after indexing.

Indexing adds the term value to Lucene‟s inverted index.

The Inverted Index

(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )

Lucene‟s Index Structure

What an „inverted index‟?

verted index: document points to collection of

terms

inverted index: term points to a collection of

documents

One or more segments

Self-contained, independent partition of the

entire index.

Stores: field names, field values, term dictionary,

term frequencies, term proximities, normalization

factor, term vectors, and (optional) deleted record

lookup table.

Analysis

(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )

Tokenization

(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )

Tokenization

Normalization: “Gramåna” > “gramana”

Stemming: “preschooling” > “school”

Norms

(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )

Time to Look at Some Code

Getting a Query

Two options:

Parse a search string using a QueryParser class.

Programatically build a query.

QueryParser can build very complex queries

very quickly, but requires user to provide a

query string.

Programatic building of a query requires less

overhead for simple queries.

General Query Types

(taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)

Some Lucene Query Types

TermQuery (general purpose)

BooleanQuery

MultiPhraseQuery

SpanQuery

WildcardQuery

FilteredQuery

MoreLikeThisQuery

BoostingQuery

FuzzyQuery

ConstantScoreRangeQuery

Time to Look at More Code

Lucene.Net Contribs

Spatial (geo-spatial search)

Similarity

SimpleFactedSearch

Highlighter

SpellChecker

WordNET (synonyms)

Snowball (stemming library)

RegEx

Thanks for your time and attention.

twitter: @zgramana

blog: http://www.excitabyte.com/

Email: zgramanaATgee mail dot com

That‟s All!