Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | lucenerevolution |
View: | 2,222 times |
Download: | 0 times |
HIGH PERFORMANCE JSON SEARCH AND
RELATIONAL FACETED BROWSING WITH LUCENE
Renaud Delbru Co-Founder, SindiceTech
Post-Doctoral Researcher, NUIG
• Lucene / Solr
– User since 7 years
– Built a web search engine – sindice.com (700M documents)
• Academia & Research
– Ph.D. in Information Retrieval and Semantic Web
– Post-doctoral researcher at National Univerity of Ireland, Galway
• Industry
– Technical co-founder of SindiceTech
– Management Platform for Enterprise Knowledge Graph
My Background
• Nested Data Model
• SIREn Overview & Theory
• SIREn Plugin Architecture
• Relational Faceted Browsing
• Comparison with BlockJoin
Agenda
• SQL
– Query-time join performance penalty
• NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
Nested Data Model: Why is it important ?
Denormalising Relational Data
LucidWorks
Series A
Series B
Granite Ventures
Denormalising Relational Data
LucidWorks
Series A
Series B
Granite Ventures
Granite Ventures
• SQL
– Query-time join performance penalty
• NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
– Duplicate data …
– … but avoid joins
Nested Data Model: Why is it important ?
• Model becoming prevalent: JSON, XML, Avro, …
– Can be arbitrarily nested and large
– No strict schema / structure enforced
• Schema-less brings
– Flexibility
– Ease of development
• Developers do not have to invest significant modelling effort upfront
Schema-Less Nested Data Model
• Lucene/Solr plugin for indexing and searching JSON
• Rich data model (JSON)
– Nested objects, nested arrays, datatypes
• Schema-agnostic
– No need to define structure (nested model)
– No need to define schema (fields)
Introducing SIREn
Overview of the SIREn API
Document Query
{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
"investments" : [
{
"name" : "Granite Ventures",
"type" : "financial-org"
},
…
]
},
…
]
}
(category_code : analytics)
AND
(funding_rounds : {
round_code : seed OR a OR angel,
raised_amount : [0 TO 12000000],
* : {
type : financial-org
}
})
• Inspired from tree-labelling scheme techniques (XML IR)
– Label each node with a hierarchical ids (here Dewey’s identifiers)
• Full-text search operators over the content of a node
• Structural search operators over the nodes of the tree
– Ancestor-Descendant, Parent-Child, Sibling, …
Theory behind SIREn
Theory behind SIREn: Tree-Labelling
name
funding_
rounds
LucidWorks
round_
code
raised_
amount
a
6000000
…
{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
…
},
…
]
}
Theory behind SIREn: Tree-Labelling
{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
…
},
…
]
}
name
funding_
rounds
LucidWorks
round_
code
raised_
amount
a
1.2
1.1
1
1.1.1
1.2.1
1.2.2.1.1 1.2.2.1
6000000
1.2.2.2
… 1.2.2
1.2.2.2.1
Theory behind SIREn: Query Processing
?
name LucidWorks
Query
name
Inverted Index
LucidWorks
1.1 2.2 2.5
1.5.3 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3 2.2.1 4.2.1
?
name LucidWorks
name
LucidWorks
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
SIREn Plugin Architecture - Overview
Codec
Tree-Labelling Codec
Analysis
JSON Analyzer
Flexible Query Parser
JSON Query Parser
SIREn Lucene Legend:
Query
Node Query
Document
JSON Field
schema.xml sample
<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="json" type="json" indexed="true" stored="false"/>
…
</fields>
<types>
<fieldType name="json"
class="org.sindice.siren.solr.schema.JsonField"
datatypeConfig="datatypes.xml"/>
…
</types>
Datatypes
datatypes.xml sample
<datatype name="http://www.w3.org/2001/XMLSchema#String"
class="org.sindice.siren.solr.schema.TextDatatype">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</datatype>
<datatype name="http://www.w3.org/2001/XMLSchema#int"
class="org.sindice.siren.solr.schema.TrieDatatype"
precisionStep="8"
type="integer"/>
• Traverses JSON tree using Depth-First
Search
• Generates one token per JSON node
• Attaches metadata attributes (Dewey id,
datatype, …) to each token
Tokenizer Output
JSON Tokenizer
name LucidWorks funding_ rounds
round_ code
1.1 Field
…
1.1.1 String
1.2 Field
1.2.2.1 String
• Tokenize the content of a node token based on its datatype
JSON Analyzer – NodeTokenizerFilter
lucid works
name LucidWorks funding_ rounds
…
funding rounds
Input
Output
name LucidWorks funding_ rounds
round_ code
1.1 Field
…
1.1.1 String
1.2 Field
1.2.2.1 String
• Tokenize the content of a node token based on its datatype
JSON Analyzer – NodeTokenizerFilter
lucid works
name LucidWorks funding_ rounds
…
funding rounds
Input
Output
Tokenized with String datatype analyzer
name LucidWorks funding_ rounds
round_ code
1.1 Field
…
1.1.1 String
1.2 Field
1.2.2.1 String
• Tokenize the content of a node token based on its datatype
JSON Analyzer – NodeTokenizerFilter
lucid works
name LucidWorks funding_ rounds
…
funding rounds
Input
Output
Tokenized with Field datatype analyzer
name LucidWorks funding_ rounds
round_ code
1.1 Field
…
1.1.1 String
1.2 Field
1.2.2.1 String
• Encode metadata attributes into a term payload
• Leverage Payload API to transfer attributes to the Codec API
JSON Analyzer – NodePayloadFilter
SIREn Plugin Architecture - Overview
Codec
Tree-Labelling Codec
Analysis
JSON Analyzer
Flexible Query Parser
JSON Query Parser
SIREn Lucene Legend:
Query
Node Query
Document
Tree-Labelling Codec – File Structure
.nod
.doc
.pos
Header Doc identifiers Node frequencies
Header Node identifiers Term frequencies
Header Term positions
Block
Tree-Labelling Codec – Compression
• Adaptive Frame Of Reference
– Adapt the encoding to the integer distribution
– Better tolerance against outliers
– Very effective with frequencies, node identifiers and positions (higher
compression rate)
FOR BFS
BFS BFS BFS BFS AFOR
SIREn Plugin Architecture - Overview
Codec
Tree-Labelling Codec
Analysis
JSON Analyzer
Flexible Query Parser
JSON Query Parser
SIREn Lucene Legend:
Query
Node Query
Document
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• TwigQuery
– Consist of a root query and one or
more descendant or child queries
Boolean
Phrase MUST
Boolean SHOULD
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• TwigQuery
– Consist of a root query and one or
more descendant or child queries
Boolean
Phrase MUST
Boolean SHOULD
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• TwigQuery
– Consist of a root query and one or
more descendant or child queries
– Can be nested to form complex tree
structure
Boolean
Phrase MUST
Twig NOT
Boolean SHOULD
Range SHOULD
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• TwigQuery
– Consist of a root query and one or
more descendant or child queries
– Can be nested to form complex tree
structure
– Can be rewritten as a pure boolean
query
Boolean
Phrase MUST
Twig NOT
Boolean SHOULD
Range SHOULD
• Faceted Navigation
– Data-driven exploratory interface
– User incrementally adds constraints
– Restricted to one record collection
• Relational Faceted Navigation
– Enables navigation of interrelated record collections
– Constraints affect all record collections
– New navigation operation: Pivot
• Switch user view to a record collection
Application: Relational Faceted Navigation
Relational Faceted Navigation – Demo
HCLS Demo: http://hcls.sindice.com/pivot-browser/
• Each collection has its own data model (document)
• Lucene fields for facets
• JSON field for relationships with records from other collections
Data Model
Country
Category
JSON
Company Investment Investor
Year
Amount
JSON
Type
JSON
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
category_ code
funding_ rounds
round_ code
raised_ amount
[…]
invest- ments
type
[…]
round_ code
funding_ rounds -1 […]
invest- ments
type
[…]
raised_amount
invest- ments -1
round_ code
raised_ amount
[…]
type
funding_ rounds -1 […]
category_ code
country_ code
country_ code
category_ code
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
category_ code
funding_ rounds
round_ code
raised_ amount
[…]
invest- ments
type
[…]
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
category_ code
funding_ rounds
round_ code
raised_ amount
[…]
invest- ments
type
[…]
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
round_ code
funding_ rounds -1 […]
invest- ments
type
[…]
raised_amount
category_ code
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
round_ code
funding_ rounds -1 […]
invest- ments
type
[…]
raised_amount
category_ code
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
invest- ments -1
round_ code
raised_ amount
[…]
type
funding_ rounds -1 […]
country_ code
category_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
invest- ments -1
round_ code
raised_ amount
[…]
type
funding_ rounds -1 […]
country_ code
category_ code
Navigation Model : Drill-Down
Navigation Model: Drill-Down
collection : Company
AND
country_code : irl
AND
category_code : software
Lucene query
Navigation Model: Pivot
Navigation Model: Pivot
collection : Investment
Lucene query
Navigation Model: Pivot
collection : Investment
Lucene query
funding_rounds -1 : {
country_code : irl,
category_code : software
}
JSON query
collection : Company
AND
country_code : irl
AND
category_code : software
Preceding Lucene query
Query Rewriting
Navigation Model: Pivot
collection : Investment
Lucene query
funding_rounds -1 : {
country_code : irl,
category_code : software
}
JSON query
Navigation Model: Pivot
Navigation Model: Pivot
collection : Investor
Lucene query investments -1 : {
founded_year : 2012,
funding_rounds -1 : {
country_code : irl,
category_code : software
}
}
JSON query
• Lucene BlockJoin
– Introduced support for indexing and searching nested data …
– … for small and well-defined schema
Comparison with BlockJoin
• Increase artificially the number of documents in the index
– One document per nested data record
• Cache size linear with the number of nested data records
– Increased memory usage
Lucene BlockJoin - Scalability
• Developers must be aware of the relations between nested data records
– At indexing time to tag parent records
– At querying time to filter parent records
• Upfront effort required to design and configure the system
– Define Parent-Child relationships between record collections
– Define attributes for each record collection
• If not properly designed, risk of incorrect matches
Lucene BlockJoin - Flexibility
• BlockJoin
+ Works out of the box with all Lucene’s features
‒ Requires upfront design effort
‒ Memory usage dependent on nested data structure
• Tree-Labelling
+ Can handle arbitrary and large nested model
+ Memory friendly
‒ Have to re-think and re-implement Lucene’s features
Comparison with BlockJoin
• Nested data model becomes more and more prevalent
• Searching nested data brings new challenges: performance, scalability, flexibility
• Different approaches exist, each one with pros and cons
• SIREn plugin based on tree-labelling techniques
• Enables new kind of search applications, e.g., relational faceted browser, with sub-
second response time
• SIREn Availability
– Trial license currently available
– In negotiation with the University to open-source
Conclusion
This material is based upon works supported by the European FP7 project LOD2
(257943) and the Irish Research Council for Science, Engineering and Technology.
Acknowledgement