Judcon Brazil 2014 Lucene from the bottom up

Post on 18-Dec-2014

113 views 2 download

description

Judcon Brazil 2014

transcript

Lucene from the bottom up !

Gustavo Fernandes

Ultra-fast, low memory footprint, high throughput apache licensed search library with support for incremental indexing, written in Java with several language ports Python, .NET, C++

What is Lucene

• Service

• Database

• Product

What Lucene is not

Search

Search

Battle with or against their favourite heroes and outlaws, or your own customised character

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance

Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together if they hope to survive their journey across the US.

GTA V for PS3DC Universe Online for PS4

The Last of Us for PS4

Assassins Creed Black Flag for PS3

1

4

3

2

Index against and battle character customised dc favourite heroes online

a assassins among and captain caribbean creed developed edward fearsome have is

a and appearance by character creating criminal customising developing gta v her

a across and brave brutal ellie girl hope if joel journey last must

1

43

2his in invest or potential ps3 start unique you your

kenway lawless named outlaws pirate pirates ps3 republic rule the these young

or outlaws own ps4 their universe with your

of ps4 survive survivor teenage the their they to together us work young

Inverted Index across against among appearance battle brave brutal captain caribbean character creating criminal customised customising developed developing edward ellie favourite fearsome girl heroes hope invest

joel journey kenway lawless must named outlaws own pirate pirates potential republic rule start survive survivor teenage together unique work young

4

3

2

1

1

1

1

1

1

1

2

2

2

2

2

23

3

3

3

3

3

3

3

3

3

3

3

3

3

4

4

1

4

4

4

4

4

4

4

4

4

4

4

1

4

2

1

Documents and FieldsId

Console

1

PS3

You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance

Title GTA V

Description

Id

Console

2

PS4

Battle with or against their favourite heroes and outlaws, or your own customised character

Title DC Universe Online

Description

Id

Console

3

PS3

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

Title Assassins Creed

Description

Id

Console

4

PS4

Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together if they hope to survive their journey across the US.

Title The Last of Us

Description

Fields across against among appearance battle brave brutal captain . . . republic rule start survive survivor teenage together unique work young

Field: Description Field: Title Field: Console4

3

2

2

1

4

4

3

3

3

4

4

4

4

1

4

3 4

1

assassins black creed dc flag gta last of online universe the us v

3

3

3

3

4

4

4

4

1

1

2

2

2

ps3 ps4

1 3

2 4

Field: Id1 2 3 4

1

3

2

4

On Terms

• Unit of search

• Created by a process called tokenisation

• Numerous ways of doing it

• Language specific “gotchas”

Examples Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together.

Joel a brutal survivor and Ellie a brave young teenage girl must work together

Joel brutal survivor Ellie brave young teenage girl must work together

joel brutal survivor ellie brave young teenage girl must work together

joel brutal survivor survive

ellie brave fearless

young teenage teen-age

girl must work together

SynonymsStemming Synonyms

Examples (2) Coca-Cola improved the market share of the flagship brand Diet Coke by 0.4% to 42.4%

coca cola improved market share flagship brand diet coke 0 4 42 4

私の名前はグスタボです

私の名前はグスタボです私の名前はグスタボです

私 の 名 前 は グ ス タ ボ で す

Phrase q=title:“black creed”q=description:”young teenage”

republic rule start survive survivor teenage together unique work young

Field: Description

3

3

4

4

4

4

1

4

3 4

1

Field: Titleassassins black creed dc flag gta last of online universe the us v

3

3

3

3

4

4

4

4

1

1

2

2

2

1122194

10

148

1318, 9

1321412332142

Autocomplete captain caribbean character captain

caribbean character criminal customised teenage together unique work young

Field: Description

3

3

4

1

4

3 4

1

1949151310

148

1318, 9

2

1

4

c

Autocomplete Finite State Transducer

character

captain

captain, caribbean, character, criminal,young, your

criminal

Relevance q=description:outlaws

Id

Console

2

PS4

Battle with or against their favourite heroes and outlaws, or your own customised character

Title DC Universe Online

Description

Id

Console

3

PS3

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

Title Assassins Creed

Description

? Dc Universe Online? Assassins Creed

Id

Console

2

PS4

Battle with or against their favourite heroes and outlaws, or your own customised character

Title DC Universe Online

Description

Vector

d1

d2V

2

3 V=(2, 3)

V=2 . d1 + 3 . d2

d1

d2

d3

2

3 V=(2, 3, 4)

4

V=2 . d1 + 3 . d2 + 4 . d3

Score- Vector Model • Result documents represented as vectors

• Query represent as vector

• Vectors dimensions are terms

• Vector ‘quantities’ are Tf-Idf

• Score = Cossine Similarity between query vector and document vector

0.4024 Dc Universe Online

0.3219 Assassins Creed

Documents and Queries as vectors

Id

Console

2

PS4

Battle with or against their favourite heroes and outlwas, or your own customised character

Title DC Universe Online

Description

Id

Console

3

PS3

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

Title Assassins Creed

Description

D2 = w21 . against + w22 . battle + … + w23 . outlaws + w2j . own

D3 = w31 . among + w32 . captain + … + w35 . outlaws + … + w3j . young

Q = wq . outlaws

Term Weights

• Term frequency (Tf) : number of appearances of term in the doc

• Inverse Document Frequency (Idf):

3

D3 = 1.6931 . among + 1.6931 . captain + … + 1.287 . outlaws + … + 1.287 . young

TERM among across outlaws young

sqrt(Tf) 1 0 1 1

docFreq 1 1 2 2

Idf 1.6931 1.6931 1.287 1.287

w 1.6931 0 1.287 1.287

nDocs = 4

Id

Console

3

PS3

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

Title Assassins Creed

Description

Tf-Idf

• The more a term appears in a document

• The more rare a term is index-wide

Lucene API

Lucene API - Documents

import org.apache.lucene.document.Document; import org.apache.lucene.document.IntField; import org.apache.lucene.document.TextField; !Document doc = new Document(); !doc.add(new IntField("id", 1, Store.YES)); doc.add(new TextField("console", "PS3", Store.YES)); doc.add(new TextField("title", "GTA V", Store.YES)); doc.add(new TextField("description", "You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance", Store.YES));

Lucene API - Analysis

Name Type AnalysisId Number None

Console String Lowercase

Title TextWhiteSpace, Lowercase

Description Text

WhiteSpace, Lowercase,

Remove commons words

Description_jp Text Japanse Tokenizer

Id

Console

1

PS3

You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance

Title GTA V

Description

Description_jp

かつてないほど大規模でダイナミックな多様性に富んだオープンワールドを誇る『グランド・セフト・オートV』は、ストーリーテリングとゲームプレイを新しい手法で融合。

Lucene API - Analysis

rulepirates

Pirates rule the Caribbean

Whitespace Tokenizer

Lowercase TokenFilter

Stopwords TokenFilter

Analyzer

caribbean

Lucene API - Analysis

Custom Analyzer

public class MySimpleAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { ! WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(reader); LowerCaseFilter lcFilter = new LowerCaseFilter(keywordTokenizer); return new TokenStreamComponents(keywordTokenizer, lcFilter); ! } }

Lucene API - Analysis

@Override protected TokenStreamComponents createComponents( String fieldName, Reader reader) { Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, mode); TokenStream stream = new JapaneseBaseFormFilter(tokenizer); stream = new JapanesePartOfSpeechStopFilter(stream, stoptags); stream = new CJKWidthFilter(stream); stream = new StopFilter(stream, stopwords); stream = new JapaneseKatakanaStemFilter(stream); stream = new LowerCaseFilter(stream); return new TokenStreamComponents(tokenizer, stream); } !

org.apache.lucene.analysis.ja.JapaneseAnalyzer

Lucene API - Analysis

@Override protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) { ! final StandardTokenizer src = new StandardTokenizer(getVersion(), reader); … TokenStream tok = new StandardFilter(getVersion(), src); tok = new LowerCaseFilter(getVersion(), tok); tok = new StopFilter(getVersion(), tok, stopwords); return new TokenStreamComponents(src, tok) ! }

org.apache.lucene.analysis.standard.StandardAnalyzer

Lucene API - Indexing

1 Map<String, Analyzer> analyzerMap = new HashMap<String, Analyzer>(); 2 analyzerMap.put("id", new KeywordAnalyzer()); 3 analyzerMap.put("console", new MySimpleAnalyzer()); 4 analyzerMap.put("description", new StandardAnalyzer()); 5 analyzerMap.put("description_jp", new JapaneseAnalyzer()); 6 7 PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper( new StandardAnalyzer(), analyzerMap); 8 9 Directory ramDirectory = new RAMDirectory(); 10 IndexWriterConfig iwc = new IndexWriterConfig(Version.LATEST, analyzer); 11 IndexWriter iw = new IndexWriter(ramDirectory, iwc); 12 for (Document document : documents) { 13 iw.addDocument(document); 14 } 15 iw.close();

Lucene API - Directory

• RAMDirectory (for tests only)

• FSDirectory • MMapDirectory (Default for 64bit) • SimpleFSDirectory (java.io.RandomAccessFile) • NIOFSDirectory (java.io.FileChannel) • WindowsDirectory (native requires a .dll) • NativeUnixDirectory (experimental)

• InfinispanDirectory (3rd party)

Lucene API - Directory _0.fdt _0.fdx _0.fnm _0.nvd _0.nvm _0.si _0_Lucene41_0.doc _0_Lucene41_0.pos _0_Lucene41_0.tim _0_Lucene41_0.tip

IndexWriter.close()

IndexWriter.close()

IndexWriter.close()

_1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1_Lucene41_0.doc _1_Lucene41_0.pos _1_Lucene41_0.tim _1_Lucene41_0.tip

_2.fdt _2.fdx _2.fnm _2.nvd _2.nvm _2.si _2_Lucene41_0.doc _2_Lucene41_0.pos _2_Lucene41_0.tim _2_Lucene41_0.tip

Lucene API - Directory

from http://blog.mikemccandless.com/

Lucene API - Autocomplete

1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 AnalyzingSuggester suggester = new AnalyzingSuggester(new StandardAnalyzer()); 4 LuceneDictionary dictionary = new LuceneDictionary(reader, "description"); 5 suggester.build(dictionary); 6 7 List<Lookup.LookupResult> suggestions = suggester.lookup("c", false, 5); 8 9 for (Lookup.LookupResult suggestion : suggestions) { 10 System.out.println(suggestion.key); 11 }

captain caribbean character creating criminal

Lucene API - Search

1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery termQuery = new TermQuery(new Term("description", "character")); 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(termQuery, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Document document = reader.document(internalId); 10 String title = document.get("title"); 11 System.out.printf("%f - %s\n", scoreDoc.score, title); 12 }

q = description:character

0.402401 - DC Universe Online 0.321921 - GTA V

Lucene API - Search

q=description:”young teenage”

1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 PhraseQuery query = new PhraseQuery(); 4 query.add(new Term("description","young")); 5 query.add(new Term("description","teenage")); 6 7 IndexSearcher indexSearcher = new IndexSearcher(reader); 8 TopDocs topDocs = indexSearcher.search(query, 10); 9 10 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 11 int internalId = scoreDoc.doc; 12 Document document = reader.document(internalId); 13 String title = document.get("title"); 14 System.out.printf("%f - %s\n", scoreDoc.score, title); 15 }

0.745207 - The Last of Us

Lucene API - Search

q = console:”PS3” AND (description:”pirate” OR description:”criminal”)

0.741689 - GTA V 0.741689 - Assassins Creed Black Flag

1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery descriptionOne = new TermQuery(new Term("description", "pirate")); 4 TermQuery descriptionTwo = new TermQuery(new Term("description", "criminal")); 5 6 BooleanQuery descriptionQuery = new BooleanQuery(); 7 descriptionQuery.add(descriptionOne, BooleanClause.Occur.SHOULD); 8 descriptionQuery.add(descriptionTwo, BooleanClause.Occur.SHOULD); 9 10 TermQuery consoleQuery = new TermQuery(new Term("console", "ps3")); 11 12 BooleanQuery query = new BooleanQuery(); 13 query.add(consoleQuery, BooleanClause.Occur.MUST); 14 query.add(descriptionQuery, BooleanClause.Occur.MUST); 15 16 IndexSearcher indexSearcher = new IndexSearcher(reader); 17 TopDocs topDocs = indexSearcher.search(query, 10);

Lucene API - Search

Query Parser

1 QueryParser queryParser = new QueryParser("description", analyzer); 2 Query query = queryParser.parse("console:PS3 AND (description:pirate OR description:criminal)"); 3 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(query, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Document document = reader.document(internalId); 10 String title = document.get("title"); 11 System.out.printf("%f - %s\n", scoreDoc.score, title); 12 }

0.741689 - GTA V 0.741689 - Assassins Creed Black Flag

Lucene API - Sort

NaN - Assassins Creed Black Flag NaN - GTA V

1 QueryParser queryParser = new QueryParser("description", analyzer); 2 Query query = queryParser.parse("console:PS3 AND (description:pirate OR description:criminal)"); 3 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 6 Sort sort = new Sort(new SortField("title", SortField.Type.STRING, true)); 7 TopDocs topDocs = indexSearcher.search(query, 10, sort); 8 9 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 10 int internalId = scoreDoc.doc; 11 Document document = reader.document(internalId); 12 String title = document.get("title"); 13 System.out.printf("%f - %s\n", scoreDoc.score, title); 14 }

Lucene API - Explain 1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery termQuery = new TermQuery(new Term("description", "character")); 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(termQuery, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Explanation explanation = indexSearcher.explain(termQuery, internalId); 10 System.out.println(explanation); 11 }

0.40240064 = (MATCH) weight(description:character in 2) [DefaultSimilarity], result of: 0.40240064 = fieldWeight in 2, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.3125 = fieldNorm(doc=2) !0.3219205 = (MATCH) weight(description:character in 0) [DefaultSimilarity], result of: 0.3219205 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.25 = fieldNorm(doc=0)

Reviews provided by ign.com