High Performance JSON Search and Relational Faceted Browsing with Lucene

HIGH PERFORMANCE JSON SEARCH AND

RELATIONAL FACETED BROWSING WITH LUCENE

Renaud Delbru Co-Founder, SindiceTech

Post-Doctoral Researcher, NUIG

[email protected]

[email protected]

mailto:[email protected]

mailto:[email protected]

• Lucene / Solr

– User since 7 years

– Built a web search engine – sindice.com (700M documents)

• Academia & Research

– Ph.D. in Information Retrieval and Semantic Web

– Post-doctoral researcher at National Univerity of Ireland, Galway

• Industry

– Technical co-founder of SindiceTech

– Management Platform for Enterprise Knowledge Graph

My Background

• Nested Data Model

• SIREn Overview & Theory

• SIREn Plugin Architecture

• Relational Faceted Browsing

• Comparison with BlockJoin

Agenda

• SQL

– Query-time join performance penalty

• NoSQL

– Denormalisation of relational data into nested data

– Convert many-to-one/many into one-to-many relationships

Nested Data Model: Why is it important ?

Denormalising Relational Data

LucidWorks

Series A

Series B

Granite Ventures

Denormalising Relational Data

LucidWorks

Series A

Series B

Granite Ventures

Granite Ventures

• SQL

– Query-time join performance penalty

• NoSQL

– Denormalisation of relational data into nested data

– Convert many-to-one/many into one-to-many relationships

– Duplicate data …

– … but avoid joins

Nested Data Model: Why is it important ?

• Model becoming prevalent: JSON, XML, Avro, …

– Can be arbitrarily nested and large

– No strict schema / structure enforced

• Schema-less brings

– Flexibility

– Ease of development

• Developers do not have to invest significant modelling effort upfront

Schema-Less Nested Data Model

• Lucene/Solr plugin for indexing and searching JSON

• Rich data model (JSON)

– Nested objects, nested arrays, datatypes

• Schema-agnostic

– No need to define structure (nested model)

– No need to define schema (fields)

Introducing SIREn

Overview of the SIREn API

Document Query

{

"name" : "LucidWorks",

"category_code" : "analytics",

"funding_rounds" : [

{

"round_code" : "a",

"raised_amount" : 6000000,

"funded_year" : 2009,

"investments" : [

{

"name" : "Granite Ventures",

"type" : "financial-org"

},

…

]

},

…

]

}

(category_code : analytics)

AND

(funding_rounds : {

round_code : seed OR a OR angel,

raised_amount : [0 TO 12000000],

* : {

type : financial-org

}

})

• Inspired from tree-labelling scheme techniques (XML IR)

– Label each node with a hierarchical ids (here Dewey’s identifiers)

• Full-text search operators over the content of a node

• Structural search operators over the nodes of the tree

– Ancestor-Descendant, Parent-Child, Sibling, …

Theory behind SIREn

Theory behind SIREn: Tree-Labelling

name

funding_

rounds

LucidWorks

round_

code

raised_

amount

a

6000000

…

{




{

"round_code" : "a",



…

},

…

]

}

Theory behind SIREn: Tree-Labelling

{




{

"round_code" : "a",



…

},

…

]

}

name

funding_

rounds

LucidWorks

round_

code

raised_

amount

a

1.2

1.1

1

1.1.1

1.2.1

1.2.2.1.1 1.2.2.1

6000000

1.2.2.2

… 1.2.2

1.2.2.2.1

Theory behind SIREn: Query Processing

?

name LucidWorks

Query

name

Inverted Index

LucidWorks

1.1 2.2 2.5

1.5.3 2.2.1 4.2.1


Query Inverted Index

1.1 2.2 2.5

1.5.3

?

name LucidWorks

name

LucidWorks 2.2.1 4.2.1



1.1 2.2 2.5

1.5.3 2.2.1 4.2.1

?

name LucidWorks

name

LucidWorks



1.1 2.2 2.5

1.5.3

?

name LucidWorks

name




1.1 2.2 2.5

1.5.3

?

name LucidWorks

name




1.1 2.2 2.5

1.5.3

?

name LucidWorks

name




1.1 2.2 2.5

1.5.3

?

name LucidWorks

name


SIREn Plugin Architecture - Overview

Codec

Tree-Labelling Codec

Analysis

JSON Analyzer

Flexible Query Parser

JSON Query Parser

SIREn Lucene Legend:

Query

Node Query

Document

JSON Field

schema.xml sample

<fields>

<field name="id" type="string" indexed="true" stored="true"/>

<field name="json" type="json" indexed="true" stored="false"/>

…

</fields>

<types>

<fieldType name="json"

class="org.sindice.siren.solr.schema.JsonField"

datatypeConfig="datatypes.xml"/>

…

</types>

Datatypes

datatypes.xml sample

<datatype name="http://www.w3.org/2001/XMLSchema#String"

class="org.sindice.siren.solr.schema.TextDatatype">

<analyzer type="index">

<tokenizer class="solr.KeywordTokenizerFactory"/>

</analyzer>

<analyzer type="query">

<tokenizer class="solr.KeywordTokenizerFactory"/>

</analyzer>

</datatype>

<datatype name="http://www.w3.org/2001/XMLSchema#int"

class="org.sindice.siren.solr.schema.TrieDatatype"

precisionStep="8"

type="integer"/>

• Traverses JSON tree using Depth-First

Search

• Generates one token per JSON node

• Attaches metadata attributes (Dewey id,

datatype, …) to each token

Tokenizer Output

JSON Tokenizer

name LucidWorks funding_ rounds

round_ code

1.1 Field

…

1.1.1 String

1.2 Field

1.2.2.1 String

• Tokenize the content of a node token based on its datatype

JSON Analyzer – NodeTokenizerFilter

lucid works


…

funding rounds

Input

Output


round_ code

1.1 Field

…

1.1.1 String

1.2 Field

1.2.2.1 String



lucid works


…

funding rounds

Input

Output

Tokenized with String datatype analyzer


round_ code

1.1 Field

…

1.1.1 String

1.2 Field

1.2.2.1 String



lucid works


…

funding rounds

Input

Output

Tokenized with Field datatype analyzer


round_ code

1.1 Field

…

1.1.1 String

1.2 Field

1.2.2.1 String

• Encode metadata attributes into a term payload

• Leverage Payload API to transfer attributes to the Codec API

JSON Analyzer – NodePayloadFilter


Codec


Analysis

JSON Analyzer


JSON Query Parser


Query

Node Query

Document

Tree-Labelling Codec – File Structure

.nod

.doc

.pos

Header Doc identifiers Node frequencies

Header Node identifiers Term frequencies

Header Term positions

Block

Tree-Labelling Codec – Compression

• Adaptive Frame Of Reference

– Adapt the encoding to the integer distribution

– Better tolerance against outliers

– Very effective with frequencies, node identifiers and positions (higher

compression rate)

FOR BFS

BFS BFS BFS BFS AFOR


Codec


Analysis

JSON Analyzer


JSON Query Parser


Query

Node Query

Document

• Query Processing

– Collects matching document and node identifiers

– Posting list traversal order: document ids, node ids then positions

• Adaptation of all Lucene’s Query classes to the new file structure

– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …

Node Query






Node Query

• TwigQuery

– Consist of a root query and one or

more descendant or child queries

Boolean

Phrase MUST

Boolean SHOULD






Node Query

• TwigQuery



Boolean

Phrase MUST

Boolean SHOULD






Node Query

• TwigQuery



– Can be nested to form complex tree

structure

Boolean

Phrase MUST

Twig NOT

Boolean SHOULD

Range SHOULD






Node Query

• TwigQuery



– Can be nested to form complex tree

structure

– Can be rewritten as a pure boolean

query

Boolean

Phrase MUST

Twig NOT

Boolean SHOULD

Range SHOULD

• Faceted Navigation

– Data-driven exploratory interface

– User incrementally adds constraints

– Restricted to one record collection

• Relational Faceted Navigation

– Enables navigation of interrelated record collections

– Constraints affect all record collections

– New navigation operation: Pivot

• Switch user view to a record collection

Application: Relational Faceted Navigation

Relational Faceted Navigation – Demo

HCLS Demo: http://hcls.sindice.com/pivot-browser/

http://hcls.sindice.com/pivot-browser/




• Each collection has its own data model (document)

• Lucene fields for facets

• JSON field for relationships with records from other collections

Data Model

Country

Category

JSON

Company Investment Investor

Year

Amount

JSON

Type

JSON

• JSON field: Tree covering all the relationships with records from other collections

• Resulting tree can be very large

JSON Model


category_ code

funding_ rounds

round_ code

raised_ amount

[…]

investments

type

[…]

round_ code

funding_ rounds -1 […]

investments

type

[…]

raised_amount

investments -1

round_ code

raised_ amount

[…]

type


category_ code

country_ code

country_ code

category_ code

country_ code



JSON Model


category_ code

funding_ rounds

round_ code

raised_ amount

[…]

investments

type

[…]

country_ code



JSON Model


category_ code

funding_ rounds

round_ code

raised_ amount

[…]

investments

type

[…]

country_ code



JSON Model


round_ code


investments

type

[…]

raised_amount

category_ code

country_ code



JSON Model


round_ code


investments

type

[…]

raised_amount

category_ code

country_ code



JSON Model


investments -1

round_ code

raised_ amount

[…]

type


country_ code

category_ code



JSON Model


investments -1

round_ code

raised_ amount

[…]

type


country_ code

category_ code

Navigation Model : Drill-Down

Navigation Model: Drill-Down

collection : Company

AND

country_code : irl

AND

category_code : software

Lucene query

Navigation Model: Pivot


collection : Investment

Lucene query



Lucene query

funding_rounds -1 : {

country_code : irl,


}

JSON query

collection : Company

AND

country_code : irl

AND


Preceding Lucene query

Query Rewriting



Lucene query


country_code : irl,


}

JSON query



collection : Investor

Lucene query investments -1 : {

founded_year : 2012,


country_code : irl,


}

}

JSON query

• Lucene BlockJoin

– Introduced support for indexing and searching nested data …

– … for small and well-defined schema

Comparison with BlockJoin

• Increase artificially the number of documents in the index

– One document per nested data record

• Cache size linear with the number of nested data records

– Increased memory usage

Lucene BlockJoin - Scalability

• Developers must be aware of the relations between nested data records

– At indexing time to tag parent records

– At querying time to filter parent records

• Upfront effort required to design and configure the system

– Define Parent-Child relationships between record collections

– Define attributes for each record collection

• If not properly designed, risk of incorrect matches

Lucene BlockJoin - Flexibility

• BlockJoin

+ Works out of the box with all Lucene’s features

‒ Requires upfront design effort

‒ Memory usage dependent on nested data structure

• Tree-Labelling

+ Can handle arbitrary and large nested model

+ Memory friendly

‒ Have to re-think and re-implement Lucene’s features

Comparison with BlockJoin

• Nested data model becomes more and more prevalent

• Searching nested data brings new challenges: performance, scalability, flexibility

• Different approaches exist, each one with pros and cons

• SIREn plugin based on tree-labelling techniques

• Enables new kind of search applications, e.g., relational faceted browser, with sub-

second response time

• SIREn Availability

– Trial license currently available

– In negotiation with the University to open-source

Conclusion

This material is based upon works supported by the European FP7 project LOD2

(257943) and the Irish Research Council for Science, Engineering and Technology.

Acknowledgement

Date post:	11-May-2015
Category:	Technology
Upload:	lucenerevolution
View:	2,222 times
Download:	0 times