+ All Categories
Home > Technology > Judcon Brazil 2014 Lucene from the bottom up

Judcon Brazil 2014 Lucene from the bottom up

Date post: 18-Dec-2014
Category:
Upload: gustavo-fernandes
View: 113 times
Download: 2 times
Share this document with a friend
Description:
Judcon Brazil 2014
41
Transcript
Page 1: Judcon Brazil 2014 Lucene from the bottom up
Page 2: Judcon Brazil 2014 Lucene from the bottom up

Lucene from the bottom up !

Gustavo Fernandes

Page 3: Judcon Brazil 2014 Lucene from the bottom up

Ultra-fast, low memory footprint, high throughput apache licensed search library with support for incremental indexing, written in Java with several language ports Python, .NET, C++

What is Lucene

Page 4: Judcon Brazil 2014 Lucene from the bottom up

• Service

• Database

• Product

What Lucene is not

Page 5: Judcon Brazil 2014 Lucene from the bottom up

Search

Page 6: Judcon Brazil 2014 Lucene from the bottom up

Search

Battle with or against their favourite heroes and outlaws, or your own customised character

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance

Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together if they hope to survive their journey across the US.

GTA V for PS3DC Universe Online for PS4

The Last of Us for PS4

Assassins Creed Black Flag for PS3

1

4

3

2

Page 7: Judcon Brazil 2014 Lucene from the bottom up

Index against and battle character customised dc favourite heroes online

a assassins among and captain caribbean creed developed edward fearsome have is

a and appearance by character creating criminal customising developing gta v her

a across and brave brutal ellie girl hope if joel journey last must

1

43

2his in invest or potential ps3 start unique you your

kenway lawless named outlaws pirate pirates ps3 republic rule the these young

or outlaws own ps4 their universe with your

of ps4 survive survivor teenage the their they to together us work young

Page 8: Judcon Brazil 2014 Lucene from the bottom up

Inverted Index across against among appearance battle brave brutal captain caribbean character creating criminal customised customising developed developing edward ellie favourite fearsome girl heroes hope invest

joel journey kenway lawless must named outlaws own pirate pirates potential republic rule start survive survivor teenage together unique work young

4

3

2

1

1

1

1

1

1

1

2

2

2

2

2

23

3

3

3

3

3

3

3

3

3

3

3

3

3

4

4

1

4

4

4

4

4

4

4

4

4

4

4

1

4

2

1

Page 9: Judcon Brazil 2014 Lucene from the bottom up

Documents and FieldsId

Console

1

PS3

You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance

Title GTA V

Description

Id

Console

2

PS4

Battle with or against their favourite heroes and outlaws, or your own customised character

Title DC Universe Online

Description

Id

Console

3

PS3

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

Title Assassins Creed

Description

Id

Console

4

PS4

Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together if they hope to survive their journey across the US.

Title The Last of Us

Description

Page 10: Judcon Brazil 2014 Lucene from the bottom up

Fields across against among appearance battle brave brutal captain . . . republic rule start survive survivor teenage together unique work young

Field: Description Field: Title Field: Console4

3

2

2

1

4

4

3

3

3

4

4

4

4

1

4

3 4

1

assassins black creed dc flag gta last of online universe the us v

3

3

3

3

4

4

4

4

1

1

2

2

2

ps3 ps4

1 3

2 4

Field: Id1 2 3 4

1

3

2

4

Page 11: Judcon Brazil 2014 Lucene from the bottom up

On Terms

• Unit of search

• Created by a process called tokenisation

• Numerous ways of doing it

• Language specific “gotchas”

Page 12: Judcon Brazil 2014 Lucene from the bottom up

Examples Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together.

Joel a brutal survivor and Ellie a brave young teenage girl must work together

Joel brutal survivor Ellie brave young teenage girl must work together

joel brutal survivor ellie brave young teenage girl must work together

joel brutal survivor survive

ellie brave fearless

young teenage teen-age

girl must work together

SynonymsStemming Synonyms

Page 13: Judcon Brazil 2014 Lucene from the bottom up

Examples (2) Coca-Cola improved the market share of the flagship brand Diet Coke by 0.4% to 42.4%

coca cola improved market share flagship brand diet coke 0 4 42 4

私の名前はグスタボです

私の名前はグスタボです私の名前はグスタボです

私 の 名 前 は グ ス タ ボ で す

Page 14: Judcon Brazil 2014 Lucene from the bottom up

Phrase q=title:“black creed”q=description:”young teenage”

republic rule start survive survivor teenage together unique work young

Field: Description

3

3

4

4

4

4

1

4

3 4

1

Field: Titleassassins black creed dc flag gta last of online universe the us v

3

3

3

3

4

4

4

4

1

1

2

2

2

1122194

10

148

1318, 9

1321412332142

Page 15: Judcon Brazil 2014 Lucene from the bottom up

Autocomplete captain caribbean character captain

caribbean character criminal customised teenage together unique work young

Field: Description

3

3

4

1

4

3 4

1

1949151310

148

1318, 9

2

1

4

c

Page 16: Judcon Brazil 2014 Lucene from the bottom up

Autocomplete Finite State Transducer

character

captain

captain, caribbean, character, criminal,young, your

criminal

Page 17: Judcon Brazil 2014 Lucene from the bottom up

Relevance q=description:outlaws

Id

Console

2

PS4

Battle with or against their favourite heroes and outlaws, or your own customised character

Title DC Universe Online

Description

Id

Console

3

PS3

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

Title Assassins Creed

Description

? Dc Universe Online? Assassins Creed

Id

Console

2

PS4

Battle with or against their favourite heroes and outlaws, or your own customised character

Title DC Universe Online

Description

Page 18: Judcon Brazil 2014 Lucene from the bottom up

Vector

d1

d2V

2

3 V=(2, 3)

V=2 . d1 + 3 . d2

d1

d2

d3

2

3 V=(2, 3, 4)

4

V=2 . d1 + 3 . d2 + 4 . d3

Page 19: Judcon Brazil 2014 Lucene from the bottom up

Score- Vector Model • Result documents represented as vectors

• Query represent as vector

• Vectors dimensions are terms

• Vector ‘quantities’ are Tf-Idf

• Score = Cossine Similarity between query vector and document vector

0.4024 Dc Universe Online

0.3219 Assassins Creed

Page 20: Judcon Brazil 2014 Lucene from the bottom up

Documents and Queries as vectors

Id

Console

2

PS4

Battle with or against their favourite heroes and outlwas, or your own customised character

Title DC Universe Online

Description

Id

Console

3

PS3

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

Title Assassins Creed

Description

D2 = w21 . against + w22 . battle + … + w23 . outlaws + w2j . own

D3 = w31 . among + w32 . captain + … + w35 . outlaws + … + w3j . young

Q = wq . outlaws

Page 21: Judcon Brazil 2014 Lucene from the bottom up

Term Weights

• Term frequency (Tf) : number of appearances of term in the doc

• Inverse Document Frequency (Idf):

3

D3 = 1.6931 . among + 1.6931 . captain + … + 1.287 . outlaws + … + 1.287 . young

TERM among across outlaws young

sqrt(Tf) 1 0 1 1

docFreq 1 1 2 2

Idf 1.6931 1.6931 1.287 1.287

w 1.6931 0 1.287 1.287

nDocs = 4

Id

Console

3

PS3

Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway

Title Assassins Creed

Description

Page 22: Judcon Brazil 2014 Lucene from the bottom up

Tf-Idf

• The more a term appears in a document

• The more rare a term is index-wide

Page 23: Judcon Brazil 2014 Lucene from the bottom up

Lucene API

Page 24: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Documents

import org.apache.lucene.document.Document; import org.apache.lucene.document.IntField; import org.apache.lucene.document.TextField; !Document doc = new Document(); !doc.add(new IntField("id", 1, Store.YES)); doc.add(new TextField("console", "PS3", Store.YES)); doc.add(new TextField("title", "GTA V", Store.YES)); doc.add(new TextField("description", "You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance", Store.YES));

Page 25: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Analysis

Name Type AnalysisId Number None

Console String Lowercase

Title TextWhiteSpace, Lowercase

Description Text

WhiteSpace, Lowercase,

Remove commons words

Description_jp Text Japanse Tokenizer

Id

Console

1

PS3

You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance

Title GTA V

Description

Description_jp

かつてないほど大規模でダイナミックな多様性に富んだオープンワールドを誇る『グランド・セフト・オートV』は、ストーリーテリングとゲームプレイを新しい手法で融合。

Page 26: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Analysis

rulepirates

Pirates rule the Caribbean

Whitespace Tokenizer

Lowercase TokenFilter

Stopwords TokenFilter

Analyzer

caribbean

Page 27: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Analysis

Custom Analyzer

public class MySimpleAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { ! WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(reader); LowerCaseFilter lcFilter = new LowerCaseFilter(keywordTokenizer); return new TokenStreamComponents(keywordTokenizer, lcFilter); ! } }

Page 28: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Analysis

@Override protected TokenStreamComponents createComponents( String fieldName, Reader reader) { Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, mode); TokenStream stream = new JapaneseBaseFormFilter(tokenizer); stream = new JapanesePartOfSpeechStopFilter(stream, stoptags); stream = new CJKWidthFilter(stream); stream = new StopFilter(stream, stopwords); stream = new JapaneseKatakanaStemFilter(stream); stream = new LowerCaseFilter(stream); return new TokenStreamComponents(tokenizer, stream); } !

org.apache.lucene.analysis.ja.JapaneseAnalyzer

Page 29: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Analysis

@Override protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) { ! final StandardTokenizer src = new StandardTokenizer(getVersion(), reader); … TokenStream tok = new StandardFilter(getVersion(), src); tok = new LowerCaseFilter(getVersion(), tok); tok = new StopFilter(getVersion(), tok, stopwords); return new TokenStreamComponents(src, tok) ! }

org.apache.lucene.analysis.standard.StandardAnalyzer

Page 30: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Indexing

1 Map<String, Analyzer> analyzerMap = new HashMap<String, Analyzer>(); 2 analyzerMap.put("id", new KeywordAnalyzer()); 3 analyzerMap.put("console", new MySimpleAnalyzer()); 4 analyzerMap.put("description", new StandardAnalyzer()); 5 analyzerMap.put("description_jp", new JapaneseAnalyzer()); 6 7 PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper( new StandardAnalyzer(), analyzerMap); 8 9 Directory ramDirectory = new RAMDirectory(); 10 IndexWriterConfig iwc = new IndexWriterConfig(Version.LATEST, analyzer); 11 IndexWriter iw = new IndexWriter(ramDirectory, iwc); 12 for (Document document : documents) { 13 iw.addDocument(document); 14 } 15 iw.close();

Page 31: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Directory

• RAMDirectory (for tests only)

• FSDirectory • MMapDirectory (Default for 64bit) • SimpleFSDirectory (java.io.RandomAccessFile) • NIOFSDirectory (java.io.FileChannel) • WindowsDirectory (native requires a .dll) • NativeUnixDirectory (experimental)

• InfinispanDirectory (3rd party)

Page 32: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Directory _0.fdt _0.fdx _0.fnm _0.nvd _0.nvm _0.si _0_Lucene41_0.doc _0_Lucene41_0.pos _0_Lucene41_0.tim _0_Lucene41_0.tip

IndexWriter.close()

IndexWriter.close()

IndexWriter.close()

_1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1_Lucene41_0.doc _1_Lucene41_0.pos _1_Lucene41_0.tim _1_Lucene41_0.tip

_2.fdt _2.fdx _2.fnm _2.nvd _2.nvm _2.si _2_Lucene41_0.doc _2_Lucene41_0.pos _2_Lucene41_0.tim _2_Lucene41_0.tip

Page 33: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Directory

from http://blog.mikemccandless.com/

Page 34: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Autocomplete

1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 AnalyzingSuggester suggester = new AnalyzingSuggester(new StandardAnalyzer()); 4 LuceneDictionary dictionary = new LuceneDictionary(reader, "description"); 5 suggester.build(dictionary); 6 7 List<Lookup.LookupResult> suggestions = suggester.lookup("c", false, 5); 8 9 for (Lookup.LookupResult suggestion : suggestions) { 10 System.out.println(suggestion.key); 11 }

captain caribbean character creating criminal

Page 35: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Search

1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery termQuery = new TermQuery(new Term("description", "character")); 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(termQuery, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Document document = reader.document(internalId); 10 String title = document.get("title"); 11 System.out.printf("%f - %s\n", scoreDoc.score, title); 12 }

q = description:character

0.402401 - DC Universe Online 0.321921 - GTA V

Page 36: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Search

q=description:”young teenage”

1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 PhraseQuery query = new PhraseQuery(); 4 query.add(new Term("description","young")); 5 query.add(new Term("description","teenage")); 6 7 IndexSearcher indexSearcher = new IndexSearcher(reader); 8 TopDocs topDocs = indexSearcher.search(query, 10); 9 10 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 11 int internalId = scoreDoc.doc; 12 Document document = reader.document(internalId); 13 String title = document.get("title"); 14 System.out.printf("%f - %s\n", scoreDoc.score, title); 15 }

0.745207 - The Last of Us

Page 37: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Search

q = console:”PS3” AND (description:”pirate” OR description:”criminal”)

0.741689 - GTA V 0.741689 - Assassins Creed Black Flag

1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery descriptionOne = new TermQuery(new Term("description", "pirate")); 4 TermQuery descriptionTwo = new TermQuery(new Term("description", "criminal")); 5 6 BooleanQuery descriptionQuery = new BooleanQuery(); 7 descriptionQuery.add(descriptionOne, BooleanClause.Occur.SHOULD); 8 descriptionQuery.add(descriptionTwo, BooleanClause.Occur.SHOULD); 9 10 TermQuery consoleQuery = new TermQuery(new Term("console", "ps3")); 11 12 BooleanQuery query = new BooleanQuery(); 13 query.add(consoleQuery, BooleanClause.Occur.MUST); 14 query.add(descriptionQuery, BooleanClause.Occur.MUST); 15 16 IndexSearcher indexSearcher = new IndexSearcher(reader); 17 TopDocs topDocs = indexSearcher.search(query, 10);

Page 38: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Search

Query Parser

1 QueryParser queryParser = new QueryParser("description", analyzer); 2 Query query = queryParser.parse("console:PS3 AND (description:pirate OR description:criminal)"); 3 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(query, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Document document = reader.document(internalId); 10 String title = document.get("title"); 11 System.out.printf("%f - %s\n", scoreDoc.score, title); 12 }

0.741689 - GTA V 0.741689 - Assassins Creed Black Flag

Page 39: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Sort

NaN - Assassins Creed Black Flag NaN - GTA V

1 QueryParser queryParser = new QueryParser("description", analyzer); 2 Query query = queryParser.parse("console:PS3 AND (description:pirate OR description:criminal)"); 3 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 6 Sort sort = new Sort(new SortField("title", SortField.Type.STRING, true)); 7 TopDocs topDocs = indexSearcher.search(query, 10, sort); 8 9 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 10 int internalId = scoreDoc.doc; 11 Document document = reader.document(internalId); 12 String title = document.get("title"); 13 System.out.printf("%f - %s\n", scoreDoc.score, title); 14 }

Page 40: Judcon Brazil 2014 Lucene from the bottom up

Lucene API - Explain 1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery termQuery = new TermQuery(new Term("description", "character")); 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(termQuery, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Explanation explanation = indexSearcher.explain(termQuery, internalId); 10 System.out.println(explanation); 11 }

0.40240064 = (MATCH) weight(description:character in 2) [DefaultSimilarity], result of: 0.40240064 = fieldWeight in 2, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.3125 = fieldNorm(doc=2) !0.3219205 = (MATCH) weight(description:character in 0) [DefaultSimilarity], result of: 0.3219205 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.25 = fieldNorm(doc=0)

Page 41: Judcon Brazil 2014 Lucene from the bottom up

Reviews provided by ign.com


Recommended