High Performance JSON Search and Relational Faceted Browsing with Lucene

transcript

HIGH PERFORMANCE JSON SEARCH AND

RELATIONAL FACETED BROWSING WITH LUCENE

Renaud Delbru Co-Founder, SindiceTech

Post-Doctoral Researcher, NUIG

renaud@sindicetech.com

renaud.delbru@deri.org

• Lucene / Solr

– User since 7 years

– Built a web search engine – sindice.com (700M documents)

• Academia & Research

– Ph.D. in Information Retrieval and Semantic Web

– Post-doctoral researcher at National Univerity of Ireland, Galway

• Industry

– Technical co-founder of SindiceTech

– Management Platform for Enterprise Knowledge Graph

My Background

• Nested Data Model

• SIREn Overview & Theory

• SIREn Plugin Architecture

• Relational Faceted Browsing

• Comparison with BlockJoin

Agenda

• SQL

– Query-time join performance penalty

• NoSQL

– Denormalisation of relational data into nested data

– Convert many-to-one/many into one-to-many relationships

Nested Data Model: Why is it important ?

Denormalising Relational Data

LucidWorks

Series A

Series B

Granite Ventures

Denormalising Relational Data

LucidWorks

Series A

Series B

Granite Ventures

• SQL

– Query-time join performance penalty

• NoSQL

– Denormalisation of relational data into nested data

– Convert many-to-one/many into one-to-many relationships

– Duplicate data …

– … but avoid joins

Nested Data Model: Why is it important ?

• Model becoming prevalent: JSON, XML, Avro, …

– Can be arbitrarily nested and large

– No strict schema / structure enforced

• Schema-less brings

– Flexibility

– Ease of development

• Developers do not have to invest significant modelling effort upfront

Schema-Less Nested Data Model

• Lucene/Solr plugin for indexing and searching JSON

• Rich data model (JSON)

– Nested objects, nested arrays, datatypes

• Schema-agnostic

– No need to define structure (nested model)

– No need to define schema (fields)

Introducing SIREn

Overview of the SIREn API

Document Query

"name" : "LucidWorks",

"category_code" : "analytics",

"funding_rounds" : [

"round_code" : "a",

"raised_amount" : 6000000,

"funded_year" : 2009,

"investments" : [

"name" : "Granite Ventures",

"type" : "financial-org"

(category_code : analytics)

(funding_rounds : {

round_code : seed OR a OR angel,

raised_amount : [0 TO 12000000],

type : financial-org

• Inspired from tree-labelling scheme techniques (XML IR)

– Label each node with a hierarchical ids (here Dewey’s identifiers)

• Full-text search operators over the content of a node

• Structural search operators over the nodes of the tree

– Ancestor-Descendant, Parent-Child, Sibling, …

Theory behind SIREn

Theory behind SIREn: Tree-Labelling

funding_

rounds

LucidWorks

round_

raised_

amount

6000000

"round_code" : "a",

Theory behind SIREn: Tree-Labelling

"round_code" : "a",

funding_

rounds

LucidWorks

round_

raised_

amount

1.2.2.1.1 1.2.2.1

6000000

1.2.2.2

… 1.2.2

1.2.2.2.1

Theory behind SIREn: Query Processing

name LucidWorks

Inverted Index

LucidWorks

1.1 2.2 2.5

1.5.3 2.2.1 4.2.1

Query Inverted Index

1.1 2.2 2.5

name LucidWorks

LucidWorks 2.2.1 4.2.1

1.1 2.2 2.5

1.5.3 2.2.1 4.2.1

name LucidWorks

LucidWorks

1.1 2.2 2.5

name LucidWorks

1.1 2.2 2.5

name LucidWorks

1.1 2.2 2.5

name LucidWorks

1.1 2.2 2.5

name LucidWorks

SIREn Plugin Architecture - Overview

Tree-Labelling Codec

Analysis

JSON Analyzer

Flexible Query Parser

JSON Query Parser

SIREn Lucene Legend:

Node Query

Document

JSON Field

schema.xml sample

</fields>

<types>

<fieldType name="json"

class="org.sindice.siren.solr.schema.JsonField"

datatypeConfig="datatypes.xml"/>

</types>

Datatypes

datatypes.xml sample

<datatype name="http://www.w3.org/2001/XMLSchema#String"

class="org.sindice.siren.solr.schema.TextDatatype">

</analyzer>

</analyzer>

</datatype>

<datatype name="http://www.w3.org/2001/XMLSchema#int"

class="org.sindice.siren.solr.schema.TrieDatatype"

precisionStep="8"

type="integer"/>

• Traverses JSON tree using Depth-First

Search

• Generates one token per JSON node

• Attaches metadata attributes (Dewey id,

datatype, …) to each token

Tokenizer Output

JSON Tokenizer

name LucidWorks funding_ rounds

round_ code

1.1 Field

1.1.1 String

1.2 Field

1.2.2.1 String

• Tokenize the content of a node token based on its datatype

JSON Analyzer – NodeTokenizerFilter

lucid works

funding rounds

Output

round_ code

1.1 Field

1.1.1 String

1.2 Field

1.2.2.1 String

lucid works

funding rounds

Output

Tokenized with String datatype analyzer

round_ code

1.1 Field

1.1.1 String

1.2 Field

1.2.2.1 String

lucid works

funding rounds

Output

Tokenized with Field datatype analyzer

round_ code

1.1 Field

1.1.1 String

1.2 Field

1.2.2.1 String

• Encode metadata attributes into a term payload

• Leverage Payload API to transfer attributes to the Codec API

JSON Analyzer – NodePayloadFilter

Analysis

JSON Analyzer

JSON Query Parser

Node Query

Document

Tree-Labelling Codec – File Structure

Header Doc identifiers Node frequencies

Header Node identifiers Term frequencies

Header Term positions

Tree-Labelling Codec – Compression

• Adaptive Frame Of Reference

– Adapt the encoding to the integer distribution

– Better tolerance against outliers

– Very effective with frequencies, node identifiers and positions (higher

compression rate)

FOR BFS

BFS BFS BFS BFS AFOR

Analysis

JSON Analyzer

JSON Query Parser

Node Query

Document

• Query Processing

– Collects matching document and node identifiers

– Posting list traversal order: document ids, node ids then positions

• Adaptation of all Lucene’s Query classes to the new file structure

– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …

Node Query

• TwigQuery

– Consist of a root query and one or

more descendant or child queries

Boolean

Phrase MUST

Boolean SHOULD

Node Query

• TwigQuery

Boolean

Phrase MUST

Boolean SHOULD

Node Query

• TwigQuery

– Can be nested to form complex tree

structure

Boolean

Phrase MUST

Twig NOT

Boolean SHOULD

Range SHOULD

Node Query

• TwigQuery

– Can be nested to form complex tree

structure

– Can be rewritten as a pure boolean

Boolean

Phrase MUST

Twig NOT

Boolean SHOULD

Range SHOULD

• Faceted Navigation

– Data-driven exploratory interface

– User incrementally adds constraints

– Restricted to one record collection

• Relational Faceted Navigation

– Enables navigation of interrelated record collections

– Constraints affect all record collections

– New navigation operation: Pivot

• Switch user view to a record collection

Application: Relational Faceted Navigation

Relational Faceted Navigation – Demo

HCLS Demo: http://hcls.sindice.com/pivot-browser/

• Each collection has its own data model (document)

• Lucene fields for facets

• JSON field for relationships with records from other collections

Data Model

Country

High Performance JSON Search and Relational Faceted Browsing with Lucene

Technology