The Apache Solr Smart Data EcosystemTrey Grainger
SVP of Engineering, Lucidworks
DFW Data Science 2017.01.09
Trey Grainger SVP of Engineering
• Previously Director of Engineering @ CareerBuilder• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Information Retrieval & Web Search - Stanford University
Other fun projects: • Co-author of Solr in Action, plus numerous research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor
About Me
• Apache Solr OverviewLucidworks Fusion Overview
• Search & Relevancy - Keyword Search - Text Analysis - Multilingual Text Analysis• Recommendations (Demo)• Relevancy Spectrum• Reflected Intelligence
- Relevancy Tuning - Learning to Rank (Demo) - Signals (Demo) …
Agenda…
• Semantic Search - Entity Extraction (Demo) - Query Parsing (Demo) - Semantic Knowledge Graph (Demo)• Streaming Expressions• Solr / Fusion SQL (Demo)• Solr Graph
DFW Data Science
what do you do?
Search-Driven
Everything
Customer Service Custome
r Insights
Fraud Surveillance
Research Portal
Online Retail Digital Content
Lucidworks enables Search-Driven Everything
Data Acquisition
Indexing & Streaming
Smart Access API
Recommendations & Alerts Analytics & InsightsExtreme Relevancy
CUSTOMER SERVICE
RESEARCH PORTAL
DIGITAL CONTENT
CUSTOMER INSIGHTS
FRAUD SURVEILLANCE
ONLINERETAIL
•Access all your data in a number of ways from one place.
•Secure storage and processing from Solr and Spark.
•Acquire data from any source with pre-built connectors and adapters.
Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.
Apache Solr
“Solr is the popular, blazing-fast, open source enterprise
search platform built on Apache Lucene™.”
Key Solr Features:● Multilingual Keyword search● Relevancy Ranking of results● Faceting & Analytics (nested / relational)● Highlighting● Spelling Correction● Autocomplete/Type-ahead Prediction● Sorting, Grouping, Deduplication● Distributed, Fault-tolerant, Scalable● Geospatial search● Complex Function queries● Recommendations (More Like This)● Graph Queries and Traversals● SQL Query Support● Streaming Aggregations● Batch and Streaming processing● Highly Configurable / Plugins● Learning to Rank● Building machine-learning models● … many more
*source: Solr in Action, chapter 2
The standard for enterprise search.
of Fortune 500 uses Solr.
90%
Lucidworks Fusion
DFW Data Science
All Your Data
• Over 50 connectors to integrate all your data
• Robust parsing framework to seamlessly ingest all your document types
• Point and click Indexing configuration and iterative simulation of results for full control over your ETL process
• Your security model enforced end-to-end from ingest to search across your different datasources
ExperienceManagement
• Relevancy tuning: Point-and-click query pipeline configuration allow fine-grained control of results.
• Machine-driven relevancy: Signals aggregation learn and automatically tune relevancy and drive recommendations out of the box .
• Powerful pipeline stages: Customize fields, stages, synonyms, boosts, facets, machine learning models, your own scripted behavior, and dozens of other powerful search stages.
• Turnkey search UI(Lucidworks View): Build a sophisticated end-to-end search application in just hours.
Operational Simplicity
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache ZookeeperZK 1
Leader Election
Load Balancing
Shared Config
Management
Worker Worker
Apache SparkCluster
Manager
Core Services
• • •
NLP
Recommenders / Signals
Blob Storage
Pipelines
Scheduling
Alerting / Messaging
Connectors
REST
API
Admin UI
Lucidworks View
LOGS FILE WEB DATABASE CLOUD
HDFS
(Opt
iona
l)
• 75% decrease in development time
• Licensing costs cut by 50%
With Fusion’s out-of-the-box capabilities, we skipped months in our dev cycle so we could focus our team where they would have the most impact.
We cut our licensing costs by 50% and improved application usability. The Lucidworks professional services team amplified our success even further. We’re all Fusion from here on out!”
“
Lourduraju PamishettySenior IT Application Architect—
• Seamless integration of your entire search & analytics platform
• All capabilities exposed through secured API's, so you can use our UI or build your own.
• End-to-end security policies can be applied out of the box to every aspect of your search ecosystem.
• Distributed, fault-tolerant scaling and supervision of your entire search application
Core Services
• • •
NLP
Recommenders / Signals
Blob Storage
Pipelines
Scheduling
Alerting / Messaging
Connectors
REST
API
Admin UI
Lucidworks View
LOGS FILE WEB DATABASE CLOUD
• Seamless integration of your entire search & analytics platform
• All capabilities exposed through secured API's, so you can use our UI or build your own.
• End-to-end security policies can be applied out of the box to every aspect of your search ecosystem.
• Distributed, fault-tolerant scaling and supervision of your entire search application
Fusion powers search for the brightest companies in the world.
Lucidworks Fusion
search & relevancy
Basic Keyword Search
The beginning of a typical search journey
Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far, far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo” once.
… …
What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):
The inverted index
DFW Data Science
/solr/select/?q=apache solr
Field Documents
… …
apache doc1, doc3, doc4, doc5
…
hadoop doc2, doc4, doc6
… …
solr doc1, doc3, doc4, doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3 doc4
solr
apache
apache solr
Matching queries to documents
DFW Data Science
Text Analysis
Generating terms to index from raw text
Text Analysis in SolrA text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized
② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens
③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream
*From Solr in Action, Chapter 6
DFW Data Science
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized
② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens
③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized
② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens
③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized
② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens
③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
Multi-lingual Text Analysis
Analyzing text across multiple languages
Example English Analysis Chains
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt” ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer></fieldType>
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" I ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer></fieldType>
DFW Data Science
Per-language Analysis Chains
DFW Data Science
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
DFW Data Science
Which Stemmer do I choose?
*From Solr in Action, Chapter 14
DFW Data Science
Common English Stemmers
DFW Data Science
*From Solr in Action, Chapter 14
When Stemming goes awry
Fixing Stemming Mistakes:• Unfortunately, every stemmer will have problem-cases that aren’t handled as you would
expect• Thankfully, Stemmers can be overriden
• KeywordMarkerFilter: protects a list of terms you specify from being stemmed• StemmerOverrideFilter: applies a list of custom term mappings you specify
Alternate strategy:• Use Lemmatization (root-form analysis) instead of Stemming• Commercial vendors help tremendously in this space• The Hunspell stemmer enables dictionary-based support of varying quality in over
100 languagesDFW Data Science
Relevancy
Scoring the results, returning the best matches
Classic Lucene Relevancy Algorithm (now switched to BM25):
*Source: Solr in Action, chapter 3
Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q
Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q
norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()
DFW Data Science
• Term Frequency: “How well a term describes a document?”– Measure: how often a term occurs per document
• Inverse Document Frequency: “How important is a term overall?”– Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3
DFW Data Science
News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price range are critical
Ecommerce: likelihood of a purchase is key
Movie search: More popular titles are generally more relevant
Job search: category of job, salary range, and geographical proximity matter
TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors!
That’s great, but what about domain-specific knowledge?
DFW Data Science
John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.
Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.
Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.
Jane is a nurse educator in Boston seeking between $40K and $60K
*Example from chapter 16 of Solr in Action
Consider what you know about users
DFW Data Science
http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)”
*Example from chapter 16 of Solr in Action
Query for Jane
Jane is a nurse educator in Boston seeking between $40K and $60K
DFW Data Science
{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503},
…]}}
*Example documents available @ http://github.com/treygrainger/solr-in-action
Search Results for Jane
{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},
{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}
DFW Data Science
You just built a recommendation
engine!
Demo: Recommendations
Traditional Keyword Search
Recommendations
SemanticSearch
User Intent
Personalized Search
Augmented Search
Domain-awareMatching
The Relevancy Spectrum
DFW Data Science
Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)
Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)
Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)
Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)
Self-learningData-driven App Sophistication
DFW Data Science
what is “reflected intelligence”?
The Three C’sContent:Keywords and other features in your documents
Collaboration:How other’s have chosen to interact with your system
Context:Available information about your users and their intent
Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”
DFW Data Science
Feedback LoopsUser
Searches
User Sees
ResultsUser
takes an
action
Users’ actions inform system improvements
DFW Data Science
● Recommendation Algorithms● Building user profiles from past searches, clicks, and other actions● Identifying correlations between keywords/phrases● Building out automatically-generated ontologies from content and
queries● Determining relevancy judgements (precision, recall, nDCG, etc.)
from click logs● Learning to Rank - using relevancy judgements and machine
learning to train a relevance model● Discovering misspellings, synonyms, acronyms, and related
keywords● Disambiguation of keyword phrases with multiple meanings● Learning what’s important in your content
Examples of Reflected Intelligence
DFW Data Science
Relevancy Tuning
Improving ranking algorithms through experiments and models
How to Measure Relevancy?
A B CRetrieved Documents
Related Documents
Precision = B/A
Recall = B/C
Problem:
Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the retrieved documents, is that OK?
DFW Data Science
Normalized Discounted Cumulative Gain
Rank Relevancy
3 0.95
1 0.70
2 0.60
4 0.45
Rank Relevancy
1 0.95
2 0.85
3 0.80
4 0.65
Ranking
IdealGiven
• Position is considered in quantifying relevancy.
• Labeled dataset is required.
DFW Data Science
Learning to Rank
Learning to Rank (LTR)
● It applies machine learning techniques to discover the best combination of features that provide best ranking.
● It requires labeled set of documents with relevancy scores for given set of queries
● Features used for ranking are usually more computationally expensive than the ones used for matching
● It typically re-ranks a subset of the matched documents (e.g. top 1000)
DFW Data Science
DFW Data Science
Common LTR Algorithms
• RankNet* (Neural Network, boosted trees)
• LambdaMart* (set of regression trees)
• SVM Rank** (SVM classifier)
** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf
* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf
DFW Data Science
LambdaMart Example
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
Demo: Solr Learning to Rank
Obtaining Relevancy JudgementsTypical Methodologies 1) Hire employees, contractors, or interns -Pros: Accuracy -Cons: Expensive Not scalable (cost or man-power-wise) Data Becomes Stale
2) Crowdsource -Pros: Less cost, more scalable -Cons: Less accurate Data still becomes stale
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
Reflected Intelligence: Possible to infer relevancy judgements?
Rank Document ID
1 Doc1
2 Doc2
3 Doc3
4 Doc4
QueryQuery
Doc1 Doc2 Doc3
01 1
Query
Doc1 Doc2 Doc3
10 0
Click Graph
Skip Graph
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
Automated Relevancy Benchmarking
DFW Data Science
Demo: Fusion Signals
• 200%+ increase in click-through rates
• 91% lower TCO• Fewer support
tickets• Increased customer
satisfaction
semantic search
DFW Data Science
Building a Taxonomy of Entities
Many ways to generate this:• Topic Modelling
• Clustering of documents
• Statistical Analysis of interesting phrases- Word2Vec / Glove / Dice Conceptual Search
• Buy a dictionary (often doesn’t work for domain-specific search problems)
• Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain*
* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
DFW Data Science
DFW Data Science
DFW Data Science
entity extraction
DFW Data Science
Demo: Solr Text Tagger
semantic query parsing
DFW Data Science
Probabilistic Query ParserGoal: given a query, predict which combinations of keywords should be combined together as phrases
Example: senior java developer hadoopPossible Parsings:senior, java, developer, hadoop"senior java", developer, hadoop"senior java developer", hadoop"senior java developer hadoop”"senior java", "developer hadoop”senior, "java developer", hadoopsenior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,
and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
DFW Data Science
Demo: Probabilistic Query Parser
Semantic Query ParsingIdentification of phrases in queries using two steps:
1) Check a dictionary of known terms that is continuously built, cleaned, and refined based upon common inputs from interactions with real users of the system. The SolrTextTagger works well for this.*
2) Also invoke a probabilistic query parser to dynamically identify unknown phrases using statistics from a corpus of data (language model)
*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
DFW Data Science
query augmentation
DFW Data Science
Knowledge Graph
Semantic Data Encoded into Free Text Content
DFW Data Science
id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, java
id: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemy
id: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernate
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at1 3
2 4
company 1 6
doing2 6
3 8
engineer1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software1 1
3 2
work2 10
3 9
job_title java developer 3 1
… … … …
field doc term
desc
1 a
at
company
engineer
great
software
2 a
at
doing
hard
hospital
nurse
registered
work
3 a
doing
engineer
java
or
software
work
job_title 1 Software Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
DFW Data Science
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill: Hibernate
skill: Oncology
has_related_skill
has_related_skillhas_related_skill
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill: Java
skill: Java
skill: Scala
skill: Hibernate
skill: Oncology
Data Structure View
Java
Scala Hibernate
docs1, 2, 6
docs 3, 4
Oncologydoc 5
DFW Data Science
Knowledge Graph
Graph ModelStructure:
Single-level Traversal / Scoring:
Multi-level Traversal / Scoring:
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Multi-level Traversal
Data Structure View
Graph Viewdoc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill: Java
skill: Java
skill: Scala
skill: Hibernate
skill: Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title: Software Engineer
job_title: Data
Scientist
job_title: Java
Developer
……
Inverted Index Lookup
Forward Index Lookup
Forward Index Lookup
Inverted Index Lookup
Java
Java Developer
Hibernate
Scala
Software Engineer
Data Scientist
has_related_skill has_related_skill
has_related_skill
has_
rela
ted_
job_
title
has_
rela
ted_job_title
has_
rela
ted_
job_
title
has_
relat
ed_jo
b_title
has_related_job_ti
tle
has_related_job_title
DFW Data Science
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Scoring nodes in the Graph
Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[ { "value":"hive", "relatedness": 0.9765, "popularity":369 },
{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },
{ "value":".net", "relatedness": 0.5417, "popularity":17683 },
{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },
{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },
{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }
+-
Foreground Query: "Hadoop"
DFW Data Science
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Multi-level Graph Traversal with Scores
software engineer*(materialized node)
Java
C#
.NET
.NET Developer
Java Developer
HibernateScalaVB.NET
Software Engineer
Data Scientist
SkillNodes
has_related_skillStartingNode
SkillNodes
has_related_skill Job TitleNodes
has_related_job_title
0.900.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
DFW Data Science
Knowledge Graph
Use Case: Document Summarization
Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG.
Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document
Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring
Demo: Semantic Knowledge Graph
Knowledge Graph
DFW Data Science
Knowledge Graph
DFW Data Science
DFW Data Science
streaming expressions
• Perform relational operations on streams
• Stream sources: search, jdbc, facets, features, gatherNodes, shortestPath, train, features, model, random, stats, topic
• Stream decorators: classify, commit, complement, daemon, executor, fetch, having, leftOuterJoin, hashJoin, innerJoin, intersect, merge, null, outerHashJoin, parallel, priority, reduce, rollup, scoreNodes, select, sort, top, unique, update
Streaming Expressions
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
• Relies on docValues (column-oriented data structure) and /export handler
• Extreme read performance (8-10x faster than queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture• SQL interface tier• Worker tier (scale a pool of worker
“nodes” independently of the data collection)
• Data tier (Solr collection)
Streaming API: Nuts and Bolts
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
Streaming Expressions - Examples
Shortest-path Graph Traversal
Parallel Batch Procesing
Train a Logistic Regression Model
Distributed Joins
Rapid Export of all Search Results
Pull Results from External Database
Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classifying Search Results
Solr SQL
• SQL is ubiquitous language for analytics• People: Less training and easier to
understand• Tools! Solr as JDBC data source
(DbVisualizer, Apache Zeppelin, and SQuirreL SQL)
• Query planning / optimization can evolve iteratively
SQL is natural extension for Solr’s parallel computing engine
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
Give me the top 5 action movies with rating of 4 or better
Mental Warm-up
/select?q=*:* &fq=genre_ss:action &fq=rating_i:[4 TO *] &facet=true &facet.limit=5 &facet.mincount=1 &facet.field=title_s
SELECT title_s, COUNT(*) as cnt FROM movielens WHERE genre_ss='action' AND rating_i='[4 TO *]’ GROUP BY title_s ORDER BY cnt desc LIMIT 5
{ ... "facet_counts":{ "facet_fields":{ "title_s":[ "Star Wars (1977)",501, "Return of the Jedi (1983)",379, "Godfather, The (1972)",351, "Raiders of the Lost Ark (1981)",348, "Empire Strikes Back, The (1980)",293]}, ...}}
{"result-set":{"docs":[{"title_s":"Star Wars (1977)”,"cnt":501},{"title_s":"Return of the Jedi (1983)","cnt":379},{"title_s":"Godfather, The (1972)","cnt":351},{"title_s":"Raiders of the Lost Ark (1981)","cnt":348},{"title_s":"Empire Strikes Back, The (1980)","cnt":293},{"EOF":true,"RESPONSE_TIME":42}]}}
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
SELECT gender_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens WHERE genre_ss='romance' AND age_i='[30 TO *]'GROUP BY gender_s ORDER BY num_ratings desc
SQL Examples
SELECT title_s, genre_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens GROUP BY title_s, genre_s HAVING num_ratings >= 100 ORDER BY avg_rating desc LIMIT 5
SELECT DISTINCT(user_id_i) as user_id FROM movielens WHERE genre_ss='documentary' ORDER BY user_id desc
Give me the avg rating for menand women over 30 for romance movies
Give me the top 5 rated movies with at least 100 ratings
Give me the set of unique users that have rated documentaries
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
parallel(workers, hashJoin( search(movielens, q=*:*, fl="user_id_i,movie_id_i,rating_i", sort="movie_id_i asc", partitionKeys="movie_id_i"), hashed=search(movielens_movies, q=*:*, fl="movie_id_i,title_s,genre_s", sort="movie_id_i asc", partitionKeys="movie_id_i"), on="movie_id_i" ), workers="4", sort="movie_id_i asc")
Streaming Expression Example: hashJoin
The small “right” side of the join gets loaded into memory on each worker node
Each shard queried by N workers, so 4 workers x 4 shards means 16 queries (usually all replicas per shard are hit)
Workers collection isolates parallel computation nodes from data nodes
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
• spark-solr project uses streaming API to pull data from Solr into Spark jobs if docValues enabled, see: https://github.com/lucidworks/spark-solr
• Perform aggregations of “signals”, e.g clicks, to compute boosts and recommendations using Spark
• Custom Scala script jobs to perform complex analysis on data in Solr, e.g. sessionize request logs
• Power rich data visualizations using Fusion’s SQL Engine powered by SparkSQL + Solr streaming aggregations
How we use Solr streaming API in Fusion
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
DFW Data Science
Comparing SQL CapabilitiesFusion Solr Hive Drill SparkSQL
Secret SauceSparkSQL Benefits
+ Solr Benefits+ Enterprise
Security
Push complex query
constructs into engine (full text,
spatial, relevancy,
graph, functions, etc)
Mature SQL solution for
Hadoop stack
Execute SQL over NoSQL data
sources
Spark core (optimized shuffle, in-memory, etc),
integration of other APIs: ML,
Streaming, GraphX
SQL Features Maturing Evolving Mature Maturing Maturing
ScalingLinear (shards and replicas) backed by
inverted index;
Linear (shards and replicas)
backed by inverted index
Limited by Hadoop
infrastructure (table scans)
Good, but need to benchmark
Memory intensive;Scale out using Spark cluster,
backed by RDDsIntegration w/
external systems
Analytics Catalog API, JDCB Driver,
ODBC BridgeJDBC stream
sourceexternal tables /
plugin APImany drivers
availableDataSource API, many systems
supported
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Demo: Fusion SQL Engine
Graph
Graph Use Cases• Anomaly detection /fraud detection
• Recommenders• Social network analysis• Graph Search• Access Control• Relationship discovery / scoring
Exampleso Find all draft blog posts about “Parallel SQL”
written by a developero Find all tweets mentioning “Solr” by me or
people I followo Find all draft blog posts about “Parallel SQL”
written by a developero Find 3-star hotels in NYC my friends stayed
in last year
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Solr Graph Timeline
• Some data is much more naturally represented as a graph structure
• Solr 6.0: Introduced the Graph Query Parser• Solr 6.1: Introduced Graph Streaming expressions…
• Solr 6.3: Current Version• TBD: Semantic Knowledge Graph (patch available)
DFW Data Science
Graph Query Parser• Query-time, cyclic aware graph traversal is able to rank documents based on
relationships• Provides controls for depth, filtering of results and inclusion
of root and/or leaves• Limitations: single node/shard only
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph from=in_edge to=out_edge traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Graph Streaming Expressions• Part of Solr’s broader Streaming Expressions capability• Implements a powerful, breadth-first traversal• Works across shards AND collections• Supports aggregations• Cycle aware
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’"http://localhost:18984/solr/movielens/stream"
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
All movies that viewers of a specific movie watched
expr:gatherNodes(movielens, gatherNodes(movielens,walk="161-
>movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i",
trackTraversal="true")
Movie 161: “The Air Up There”
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Collaborative Filteringexpr=top(n="5", sort="count(*) desc", gatherNodes(movielens, top(n="30", sort="count(*) desc", gatherNodes(movielens, search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"), walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*) ) ), walk="node->user_id_i", gather="movie_id_i", count(*) ))
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Comparing Graph ChoicesSolr Elastic
Graph Neo4J Spark GraphX
Best Use Case
QParser: predef. relationships as
filtersExpressions: fast, query-based, dist.
graph ops
Limited to sequential, term
relatedness exploration only
Graph ops and querying that fit on
a single nodeLarge-scale,
iterative graph ops
Common Graph Algorithms (e.g. Pregel, Traversal)
Partial No Yes Yes
ScalingQParser: Co-located
Shards onlyExpressions: Yes
Yes Master/Replica Yes
CommercialLicense Required No Yes GPLv3 No
Visualizations GraphML support (e.g. Gephi) Kibana Neo4j browser 3rd party
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Additional References:
DFW Data Science
Contact InfoTrey Grainger
[email protected] @treygrainger
http://solrinaction.comMeetup discount (39% off): 39grainger
Other presentations: http://www.treygrainger.com
DFW Data Science