+ All Categories
Home > Technology > Hacking Lucene and Solr for Fun and Profit

Hacking Lucene and Solr for Fun and Profit

Date post: 26-Jan-2015
Category:
Upload: lucenerevolution
View: 130 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
24
Transcript
Page 1: Hacking Lucene and Solr for Fun and Profit
Page 2: Hacking Lucene and Solr for Fun and Profit

HACKING LUCENE AND

SOLR FOR FUN AND

PROFIT

Grant Ingersoll

CTO, LucidWorks,

[email protected], @gsingers

Page 3: Hacking Lucene and Solr for Fun and Profit

• Search is a system building block

– text is only a part of the story

• If the algorithms fit,

use them!

• Embrace fuzziness!

• Scoring features are everywhere

Keyword Search is so yesterday

Page 4: Hacking Lucene and Solr for Fun and Profit

• Classic: Fast, fuzzy text matching across a large document collection

• Data Quality and Analysis

– Faceting, slicing and dicing of numerical/enumerated data

– Spatial

– Spell checking, record linkage, highlighting

– Stats, Missing fields, etc.

• Top N problems

Lucene and Solr can do…

Page 5: Hacking Lucene and Solr for Fun and Profit

• Search Hacks

• “Trust me, I’m a mathematician”

• “I wish I had thought of that” Hack

Topics

Page 6: Hacking Lucene and Solr for Fun and Profit

Search Hacks

Page 7: Hacking Lucene and Solr for Fun and Profit

• SimpleTextCodec Example

conf.setCodec(new SimpleTextCodec());

File simpleText = new File("simpletext");

directory = new SimpleFSDirectory(simpleText);

writer = new IndexWriter(directory, conf);

index(writer);

• Similarity:

BM25Similarity bm25Similarity = new BM25Similarity();

conf.setSimilarity(bm25Similarity);

• http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html

Learn IR

Page 8: Hacking Lucene and Solr for Fun and Profit

http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body

Page 9: Hacking Lucene and Solr for Fun and Profit

Simple QA Workflow

Page 10: Hacking Lucene and Solr for Fun and Profit

• Split into sentences

– Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer

• Identify Names using OpenNLP

• Add Entity marker tokens at the same position as original token

– Could also be done with Payloads

• Index

• https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta

mer/solr

• https://github.com/tamingtext/book/blob/master/apache-solr/solr-

qa/conf/schema.xml

Analysis

Page 11: Hacking Lucene and Solr for Fun and Profit

• Custom Query Parser takes in user’s natural language query,

classifies it to find the Answer Type and generates Solr query

• Retrieve candidate passages that match keywords and expected

answer type

• Unlike keyword search, we need to know exactly where matches

occur

• https://github.com/tamingtext/book/tree/master/src/main/java/com/

tamingtext/qa

Search Side

Page 12: Hacking Lucene and Solr for Fun and Profit

• Answer Type examples:

– Person (P), Location (L), Organization (O), Time Point (T),

Duration (R), Money (M)

– See page 248 for more

• Train an OpenNLP classifier off of a set of previously annotated

questions, e.g.:

– P Which French monarch reinstated the divine right of the

monarchy to France and was known as `The Sun King'

because of the splendour of his reign?

Answer Type Classification

Page 13: Hacking Lucene and Solr for Fun and Profit

“Trust me, I’m a mathematician”

Page 14: Hacking Lucene and Solr for Fun and Profit

Classification

Page 15: Hacking Lucene and Solr for Fun and Profit

kNN and TF/IDF Classification w/ Lucene

https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt

Page 16: Hacking Lucene and Solr for Fun and Profit

• Builds classifier off of index information

• See the org.apache.lucene.classification package

• Naïve Bayes Classifier

• kNN Classifier

• Perceptron Classifier

Lucene Classification Module

Page 17: Hacking Lucene and Solr for Fun and Profit

• Cross recommendation as search

– with search used to build cross recommendation!

• Recommend content to people who exhibit certain behaviors (clicks, query terms,

other)

• (Ab)use of a search engine

– but not as a search engine for content

– more like a search engine for behavior

• See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation

Algorithms

– http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms

• Go get Mahout/Myrrix or just do it in y(our) search engine

Recommenders

Page 18: Hacking Lucene and Solr for Fun and Profit

• History:

Recommendation Basics

User Thing

1 3

2 4

3 4

2 3

3 2

1 1

2 1

Page 19: Hacking Lucene and Solr for Fun and Profit

• History as matrix:

• t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once

Recommendation Basics

t1 t2 t3 t4

u1 1 0 1 0

u2 1 0 1 1

u3 0 1 0 1

Page 20: Hacking Lucene and Solr for Fun and Profit

• Coocurrence

• More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-built-

into-Search-over-Hadoop

Recommendation Basics

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

t3 not t3

t1 2 1

not t1 1 1

Page 21: Hacking Lucene and Solr for Fun and Profit

“I wish I had thought of that”

Page 22: Hacking Lucene and Solr for Fun and Profit

Time Space Continuum

• Leverage Solr’s new spatial capabilities to index non-spatial data, such as time

ranges

– Useful for Open Hours, Shifts, etc.

• Key: multi-valued range data

• Query using rectangle intersections

– q = shift:"Intersects(0 19 23 365)”

• Credits to David Smiley and Hoss…

https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/

Page 23: Hacking Lucene and Solr for Fun and Profit

Finance Example

Time

% change

AAPL

MSFT

IBM

IBM

AAPL

AAPL

MSFT

MSFT

AAPL

Page 24: Hacking Lucene and Solr for Fun and Profit

• http://www.manning.com/ingersoll

– http://github.com/tamingtext/book

• http://www.tamingtext.com

• Me:

– @gsingers

[email protected]

Resources


Recommended