+ All Categories
Home > Documents > Search enabled applications with lucene.net

Search enabled applications with lucene.net

Date post: 14-Dec-2014
Category:
Upload: willem-meints
View: 998 times
Download: 4 times
Share this document with a friend
Description:
 
41
Search enabled applications with Lucene.NET W.Meints d35xp
Transcript
Page 1: Search enabled applications with lucene.net

Search enabled applications with Lucene.NET

W.Meints

d35xp

Page 2: Search enabled applications with lucene.net

Inspiration

Technical bitsIntroduction

Agenda

#ISKALUCENE

Page 4: Search enabled applications with lucene.net

This is what you often build as a developer.

Because the user wants it.

Page 5: Search enabled applications with lucene.net

Three reasons why search sometimes sucks• Can I even search?• The number one reason, because sometimes it’s not there or

it is there, but you cannot see it is there. Confusing stuff!

• The search form is too complicated• I need to be an expert to find something I don’t know is

there…. Good thinking!

• The search engine is too slow• They sometimes warn you about this (why?!)

Page 6: Search enabled applications with lucene.net

Three reasons why search sometimes sucks• I am not going to address all of these issues today.

• The focus of this talk is on the technical stuff, which solves• Having to use complex search forms to find something• Having to wait a long time before you find something

(hopefully).

• Usability of search engines is something I could talk about for a very long time too… but not today.

Page 7: Search enabled applications with lucene.net

This is what we expect to see today

Page 8: Search enabled applications with lucene.net

Simplicity is key

Gives the right answers

Allows me to refine

Implementing proper search functionality

Page 9: Search enabled applications with lucene.net

What search is today

Search is hard on the developer. It involves a lot of things:

• Linguistics • Psychology • Information analysis • Computer science• Complex math

Page 10: Search enabled applications with lucene.net

Lucene.NET as a possible solution

• Lucene.NET is derived from its Java cousin Lucene.

• Compact search engine that offers a solution to most of your search problems.

• Best of all. It is free.

Page 11: Search enabled applications with lucene.net

Getting started with lucene.NET

Getting started

Page 12: Search enabled applications with lucene.net

Overview of Lucene

• Lucene provides the core things you need to build a search system

• It does not:• Contain a search results

page.• Parse HTML, Word, Excel,

etc.

Page 13: Search enabled applications with lucene.net

This is what is in the box

• Text analyzer• Splits text in searchable

terms• Filters out stopwords (if you

want)

• QueryParser• Common syntax without

needing to learn anything

• IndexSearcher• The goods, THE thing to

have.

Page 14: Search enabled applications with lucene.net

This is what is in the box

• IndexReader• Reads everything from the

index

• IndexWriter• Stores documents and

fields

• Directory• The index itself, comes in

many sizes and shapes

Page 15: Search enabled applications with lucene.net

A standard recipe for building search

Build an index with content you want to search through.

Build a query from the question the user asked.

Get results and present them to the user.

1 2 3

Page 17: Search enabled applications with lucene.net

Step 1: Building an index

• The lucene search index is nothing like your average database!

• Storage happens in key/value pairs

• Most of the time nothing is stored and you can still search for it• The engine stores hashes of content• Only when you ask it to store, it stores something

Page 18: Search enabled applications with lucene.net

Step 1: Building an index

• The Lucene indexing uses a tree like index structure

Doc #1 Doc #2 Doc #3

Merged #1 + #2

Full index

Each document gets its own segment initially

Segments get merged during optimization cycles

Finally everything is merged back into one big pile.

Page 19: Search enabled applications with lucene.net

Step 1: Building an index

• Reasons for going in this direction:• Segments are small, and update very fast. • Searching many segments is slower than one bigger segments

• Overall, a merging segments index is more scalable and easier to implement than a B-tree index that is used elsewhere.

Page 22: Search enabled applications with lucene.net

Step 2: Building queries

• Querying Lucene.NET is done through the IndexSearcher for almost every scenario you can think of.

• There’s a number of possible options for queries:• Hand build a query using BooleanQuery, TermQuery or

another query type• Let lucene decide which would best fit by parsing the query.

Page 23: Search enabled applications with lucene.net

Step 2: Building queries

IndexSearcherQueryQueryParser

Analyzer

“Some

text”

Page 24: Search enabled applications with lucene.net

Step 2: Building queries

• There’s a standard QueryParser, but you can also use the MultiFieldQueryParser

• The MultiFieldQueryParser allows you to build a query across multiple fields at once.

Page 25: Search enabled applications with lucene.net

Step 2: Building queries

• Using the QueryParser and analyzer to get a good query for the search engine is one way of going at it.

• Other query types include:• BooleanQuery – Terms must, should or must not appear in the

document• TermQuery – Look for a single term• SpanQuery – Find terms that are close together in the text

Please note: You can combine!

Page 26: Search enabled applications with lucene.net

Step 2: Building queries

• SpanQuery is a little weird, it allows you to find terms close together in a piece of text. For example:

“The lazy fox jumps over the quick brown dog”“The quick brown fox jumps over the lazy dog”

The second sentence is the one you want. The first one is sort of correct, but a little funky. Since when

did the dog become brown and quick??

Page 28: Search enabled applications with lucene.net

Step 3: Getting results

• With indexed content and a the right query, you can get the answer to everything (Which by the way, might not be 42…)

• The IndexSearcher is used to find the answer to your query.

Page 29: Search enabled applications with lucene.net

Step 3: Getting results

IndexSearcher

IndexReader

Directory

Query

Page 30: Search enabled applications with lucene.net

Step 3: Getting results

• Documents are matched against your query using complex math.

• A TF-IDF algorithm is used to determine how well the document matches the query.

• You have been warned! This is complex stuff.𝑠𝑐𝑜𝑟𝑒 (𝑞 ,𝑑 )=𝑐𝑜𝑜𝑟𝑑 (𝑞 ,𝑑) .𝑞𝑢𝑒𝑟𝑦𝑁𝑜𝑟𝑚 (𝑞 ) .∑

𝑡 𝑖𝑛𝑞

(𝑡𝑓 (𝑡 𝑖𝑛𝑑) .𝑖𝑑𝑓 (𝑡 )2 .𝑏𝑜𝑜𝑠𝑡𝑡 .𝑛𝑜𝑟𝑚 (𝑡 ,𝑑 ))

Page 31: Search enabled applications with lucene.net

Step 3: Getting results

• In the demo I showed you the basic form of finding documents.• There’s more to the Search method than meets the eye!

• Depending on your needs, you may have to use a collector.• A collector optimizes the way you retrieve documents from the

index

Page 32: Search enabled applications with lucene.net

Step 3: Getting results

• Need to find documents in ranked order?• Use the default method or use a TopDocsCollector

• Need to sort the documents in a particular order?• Use the TopFieldsCollector instead.• This collector is optimized for sorting fields

Page 33: Search enabled applications with lucene.net

Step 3: Getting results

• Don’t want documents that have nothing to do with what you asked for in the first place?• Use a PositiveScoresOnlyCollector• Matches documents with score > 0

Use this only when you have a smaller index.

Page 34: Search enabled applications with lucene.net

A standard recipe for building search

Build a query

QueryParserMultiFieldQueryParser

Choose the right query type!

Get results

IndexSearcherCollector

Choose the right collector for better performance!

Build an index with content

IndexWriterDocument

Think about Store / Index settings on your fields!

1 2 3

Page 35: Search enabled applications with lucene.net

Good to go

• Now that you know how Lucene.NET works I think it is time to show you a few other things…

Page 36: Search enabled applications with lucene.net

Categorize content based on previous content

?

Body

IndexSearcher

Label Occurences

Search 180

Requirements 40

Other label 12

Probably a good

candidate!

Page 37: Search enabled applications with lucene.net

Detecting plagiarized content

Potential problematic document

Field Value

Title Lucene.NET in action

Body Lorem ipsum stuff and more about that Lucene thingie.

Tags Search, Lucene, .NET, C#

IndexSearcher

Lucene in action

Lucene in Orchard

?

?

Page 38: Search enabled applications with lucene.net

Spell check content

• You can spell check a document based on what others wrote.• Very similar to categorization, but instead of checking the

highest hit for a single field, check which word matches best for the term at hand.

• Uses an n-gram structure and the Levenshtein distance algorithm (sounds good, doesn’t it?)

• Do NOT build this yourself, but download here: https://nuget.org/packages/Lucene.Net.Contrib/3.0.3

Page 39: Search enabled applications with lucene.net

Play jeopardy?

• The IBM Watson super computer uses Lucene

Page 40: Search enabled applications with lucene.net

By the way…

• Endeavour knowNow uses Lucene.NET

• And there are more devs using it.• Twitter uses Lucene for realtime search• StackOverflow uses Lucene for searching questions• RavenDB uses Lucene as their primary storage mechanism

• Give it a try, you might be surprised!

Page 41: Search enabled applications with lucene.net

http://www.fizzylogic.nl/

@wmeints


Recommended