Date post: | 16-Apr-2017 |
Category: |
Software |
Upload: | anshum-gupta |
View: | 144 times |
Download: | 3 times |
Working with deeply nested documents in Apache Solr
Anshum GuptaApache Lucene/Solr PMC member & committer
IBM Watson
2
Anshum Gupta
• Apache Lucene/Solr committer and PMC member
• Search guy @ IBM Watson.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
3
Agenda
• Hierarchical Data/Nested Documents
• Indexing Nested Documents
• Querying Nested Documents
• Faceting on Nested Documents
Thanks to my fellow IBMer Alisa Zhila for working on this with me!
Hierarchical Documents
5
• Social media comments, Email threads, Annotated data - AI
• Relationship between documents
• Possibility to flatten
Need for nested data
EXAMPLE: Blog Post with Comments Peter Navarro outlines the Trump economic plan Tyler Cowen, September 27, 2016 at 3:07am Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. 1 Ray Lopez September 27, 2016 at 3:21 am I’ll be the first to say this, but the analysis is flawed. {negative} 2 Brian Donohue September 27, 2016 at 9:20 am The math checks out. Solid. {positive}
examples from http://marginalrevolution.com
6
• Can not flatten, need to retain context
• Relationship between documents
• Get all 'positive comments' to 'posts about Trump' -- IMPOSSIBLE!!!
Nested Documents
EXAMPLE: Data Flattening
Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. Comment_authors: [Ray Lopez, Brian Donohue] Comment_dates: [September 27, 2016 at 3:21 am, September 27, 2016 at 9:20 am] Comment_texts: ["I’ll be the first to say this, but the analysis is flawed.", "The math checks out. Solid."] Comment_sentiments: [negative, positive]
7
• Can not flatten, need to retain context
• Relationship between documents
• Get all 'positive comments' to 'posts about Trump' -- POSSIBLE!!! (stay tuned)
Nested DocumentsEXAMPLE: Hierarchical Documents
Type: Post Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports.
Type: Comment Author: Ray Lopez Date: September 27, 2016 at 3:21 am Text: I’ll be the first to say this, but the analysis is flawed. Sentiment: negative
Type: Comment Author: Brian Donohue Date: September 27, 2016 at 9:20 am Text: The math checks out. Solid. Sentiment: positive
8
• Blog Post Data with Comments and Replies from http://marginalrevolution.com (cured)
• 2 posts, 2-3 comments per post, 0-3 replies per comment
• Extracted keywords & sentiment data
• 4 levels of "nesting"
• Too big to show on slides
• Data + Scripts + Demo Queries:
• https://github.com/alisa-ipn/solr-revolution-2016-nested-demo
Running Example
Indexing Nested Documents
10
• Nested XML
• JSON Documents
• Add _childDocument_ tags for all children
• Pre-process field names to FQNs
• Lose information, or add that as meta-data during pre-processing
• JSON Document endpoint (6x only) - /update/json/docs
• Field name mappings
• Child Document splitting - Enhanced support coming soon.
Sending Documents to Solr
11
solr-6.2.1$ bin/post -c demo-xml ./data/example-data.xml
Sending Documents to Solr: Nested XML
<add> <doc> <field name="type">post</field> <field name="author"> "Alex Tabarrok"</field> <field name="title">"The Irony of Hillary Clinton’s Data Analytics"</field> <field name="body">"Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth."</field> <field name="id">"12015-24204"</field> <doc> <field name="type">comment</field> <field name="author">"Todd"</field> <field name="text">"Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."</field> <field name="sentiment">"positive"</field> <field name="id">"29798-24171"</field> <doc> <field name="type">reply</field> <field name="author">"The Other Jim"</field> <field name="text">"No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win."</field> <field name="sentiment">"negative"</field> <field name="id">"29798-21232"</field> </doc> </doc> </doc> </add>
12
• Add _childDocument_ tags for all children
• Pre-process field names to FQNs
• Lose information, or add that as meta-data during pre-processing solr-6.2.1$ bin/post -c demo-solr-json ./data/small-example-data-solr.json -format solr
Sending Documents to Solr: JSON Documents
[{ "path": "1.posts", "id": "28711", "author": "Alex Tabarrok", "title": "The Irony of Hillary Clinton’s Data Analytics", "body": "Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth.", "_childDocuments_": [ { "path": "2.posts.comments", "id": "28711-19237", "author": "Todd", "text": "Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world.", "sentiment": "positive", "_childDocuments_": [ { "path": "3.posts.comments.replies", "author": "The Other Jim", "id": "28711-12444", "sentiment": "negative", "text": "No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win." }]}]}]
13
• JSON Document endpoint (6x only) - /update/json/docs
• Field name mappings
• Child Document splitting - Enhanced support coming soon.
solr-6.2.1$ curl 'http://localhost:8983/solr/gettingstarted/update/json/docs?split=/|/posts|/posts/comments|/posts/comments/replies&commit=true' --data-binary @small-example-data.json -H ‘Content-type:application/json'
NOTE: All documents must contain a unique ID.
Sending Documents to Solr: JSON Endpoint
14
• Update Request Processors don’t work with nested documents
• Example:
• UUID update processor does not auto-add an id for a child document.
• Workaround:
• Take responsibility at the client layer to handle the computation for nested documents.
• Change the update processor in Solr to handle nested documents.
Update Processors and Nested Documents
15
• The entire block needs reindexing
• Forgot to add a meta-data field that might be useful? Complete reindex
• Store everything in Solr IF
• it’s too expensive to reconstruct the doc from original data source
• No access to data anymore e.g. streaming data
Re-Indexing Your Documents
16
• Various ways to index nested documents
• Need to re-index entire block
Nested Document Indexing Summary
Let’s ask some interesting questions
18
{ "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "path":["3.posts.comments.keywords"], "text":["Trump"]}, { "path":["2.posts.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports."], "path":["1.posts"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}
Easy question firstFind all documents that mention Trumpq=text:Trump
19
{ "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "text":["No one goes to Clinton rallies while tens of thousands line up to see Trump, data-mining leads to a fantasy view of the World."], "path":["2.posts.comments"]}
Returning certain types of documentsFind all comments and replies that mention Trump q=(path:2.posts.comments OR path:3.posts.comments.replies) AND text:Trump
Recipe: At the data pre-processing stage, add a field that indicates document type and also its path in the hierarchy (-- stay tuned):
20
{ "path":["3.posts.comments.keywords"], "sentiment":["positive"], "text":["Hillary"]}, { "path":["4.posts.comments.replies.keywords"], "sentiment":["negative"], "text":["Hillary"]}, { "path":["2.posts.keywords"], "text":["Hillary"]}
Returning similar type from different level Find all keywords that are Hillary q=path:*.keywords AND text:Hillary
Recipe: Use wild-cards in the field that stores the hierarchy path
Cross-Level Querying
22
{ "path":["3.posts.comments.keywords"], "sentiment":["positive"], "text":["Hillary"]}, { "path":["4.posts.comments.replies.keywords"], "sentiment":["negative"], "text":["Hillary"]}, { "path":["2.posts.keywords"], "text":["Hillary"]}
Recap so far...Find all keywords that are Hillary q=path:*.keywords AND text:Hillary
We're querying precisely for documents which we provide a search condition for
Query Level 3
Result Level 3
Query Level 4
Result Level 4
Query Level 2
Result Level 2
23
Returning parents by querying children: Block Join Parent Query
Find all comments whose keywords detected positive sentiment towards Hillary q={!parent which="path:2.posts.comments"}path:3.posts.comments.keywords AND text:Hillary AND sentiment:positive
Query Level 3
Result Level 2 {
"author":["Brian Donohue"], "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "author":["Todd"], "text":["Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."], "path":["2.posts.comments"]}
24
{ "sentiment":["negative"], "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "sentiment":["neutral"], "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]}, { "sentiment":["positive"], "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"]}
Returning children by querying parents: Block Join Child Query
Find replies to negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies
Query Level 2
Result Level 3
Block Join Child Query + Filter Query A bit counterintuitive and non-symmetrical to the BJPQ
25
Returning all document's descendants Block Join Child Query
Find all descendants of negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative
Query Level 2
Results Level 3
Results Level 4
{ "path":["4.posts.comments.replies.keywords"], "id":"17413-13550", "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"], "id":"17413-66188"}, { "path":["3.posts.comments.keywords"], "id":"12413-12487", "text":["Hillary"]}, { "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"], "id":"12413-10998"}
Issue: no grouping by parent What if we want to bring the whole sub-structure?
26
Find all negative comments and return them with all their descendants q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*]
Query Level 2
Result Level 2
sub-hierarchy
Returning document with all descendants: ChildDocTransformer
{ "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "path":["4.posts.comments.replies.keywords"], "text":["U.S."]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ...
Issue: the "sub-hierarchy" is flat
• Returns all descendant documents along with the queried document
• flattens the sub-hierarchy
• Workarounds:
• Reconstruct the document using path ("path":["3.posts.comments.replies"]) information in case you want the entire subtree (result post-processing)
• use childFilter in case you want a specific level
27
“This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document." (ChildDocTransformer cwiki)
Returning document with all descendants: ChildDocTransformer
28
Find all negative comments and return them with all replies to them q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*
childFilter=path:3.posts.comments.replies]
{ "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ...
Returning document with specific descendants: ChildDocTransformer + childFilter
Query Level 2:comments
Result Level 2:comments + Level 3:replies
29
Find all negative comments and return them with all their descendants that mention Trump q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=text:Trump]
{ "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]} ] }, ...
Returning document with queried descendants: ChildDocTransformer + childFilter
Query Level 2:comments
Result Level 2:comments
+ sub-levels
Issue: cannot use boolean expressions in childFilter query
30
Cross-Level Querying Mechanisms:
• Block Join Parent Query
• Block Join Children Query
• ChildDocTransformer Good points:
• overlapping & complementary features
• good capabilities of querying direct ancestors/descendants
• possible to query on siblings of different type Drawbacks:
• need for data-preprocessing for better querying flexibility
• limited support of querying over non-directly related branches (overcome with graphs?)
• flattening nested data (additional post-processing is needed for reconstruction)
Nested Document Querying Summary
Faceting on Nested Documents
32
• Solr allows faceting on nested documents!
• Two mechanisms for faceting:
• Faceting with JSON Facet API (since Solr 5.3)
• Block Join Faceting (since Solr 5.5)
Faceting on Nested Documents
33
q=path:2.posts.comments AND sentiment:positive& json.facet={ most_liked_authors : { type: terms, field: author, domain: { blockParent : "path:1.posts"}}}
Faceting on parents by descendants JSON Facet API: Parent Domain
Count authors of the posts that received positive comments
"most_liked_authors":{ "buckets":[ { "val":"Alex Tabarrok", "count":1}, { "val":"Tyler Cowen", "count":1} ] }
Query Level 2
Facet Level 1
34
Faceting on descendants by ancestors JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level postsQuery Level 1
Facet Descendant
Levels
"top_keywords":{ "buckets":[{ "val":"hillary", "count":4, "counts_by_posts":2}, { "val":"trump", "count":3, "counts_by_posts":2}, { "val":"dnc", "count":1, "counts_by_posts":1}, { "val":"obama", "count":2, "counts_by_posts":1}, { "val":"u.s", "count":1, "counts_by_posts":1} ]}
35
q=path:1.posts&rows=0&json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}}
Faceting on descendants by ancestors JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level postsQuery Level 1
Facet Descendant
Levels
36
Faceting on descendants by top-level ancestor JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level postsQuery Level 1
Facet Descendant
Levels
Issue: only the top-ancestor gets the unique "_root_" field by default
q=path:1.posts&rows=0&json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}}
37
q=path:2.posts.comments&rows=0&json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:2.posts.comments" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique(2.posts.comments-id)" }}}}}
Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields
Distribution of keywords that appear in comments and replies by the comments
Query Level 2
Facet Descendant
Levels
At pre-processing, introduce unique fields for each level
38
Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields
Distribution of keywords that appear in comments and replies by the comments
Query Level 2
Facet Descendant
Levels
"top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]}
Now let's try the same using Block Join Faceting
40
• Experimental Feature
• Needs to be turned on explicitly in solrconfig.xml More info: https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting
Block Join Faceting
41
bjqfacet?q={!parent which=path:2.posts.comments}path:*.comments*keywords&rows=0&facet=true&child.facet.field=text
Faceting on descendants by ancestors #2: Block Join Faceting on Children Domain
Distribution of keywords that appear in comments and replies by the comments
"facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] }
Query Level 2
Facet Descendant
Levels
bjqfacet request handler instead of query
42
Output Comparison
Block Join Facet JSON Facet API
"facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] }
"top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]}
Distribution of keywords that appear in comments and replies by the comments
43
Output Comparison
Block Join Facet JSON Facet API
"facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] }
"top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, ...
Distribution of keywords that appear in comments and replies by the comments
Output is sorted in alphabetical order. It cannot be changed
facet:{ top_keywords : { ... sort: "counts_by_comments desc" }}}
44
JSON Facet API:
• Experimental - but more mature
• More developed and established feature
• bulky JSON syntax
• faceting on children by non-top level ancestors requires introducing unique branch identifiers similar to "_root_" on each level
Block Join Facet:
• Experimental feature
• Lacks controls: sorting, limit...
• traditional query-style syntax
• proper handling of faceting on children by non-top level ancestors
Hierarchical Faceting Summary
45
• Returning hierarchical structure
• JSON facet rollups is in the works - SOLR-8998
• Graph query might replace a lot of functionalities of cross-level querying - No distributed support right now.
• There’s more but the community would love to have more people involved!
Community Roadmap
Thank you!
Anshum Gupta [email protected] | @anshumgupta https://github.com/alisa-ipn/solr-revolution-2016-nested-demo