Date post: | 07-Jan-2017 |
Category: |
Technology |
Upload: | trey-grainger |
View: | 3,408 times |
Download: | 6 times |
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger Director of Engineering, Search & Recommendations
2015.10.15
Trey Grainger Director of Engineering, Search & Recommendations
• Joined CareerBuilder in 2007 as a Software Engineer• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Mining Massive Datasets (in progress) - Stanford University
Fun outside of CB: • Co-author of Solr in Action, plus a handful of research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor
About Me
Agenda• Introduction• Defining the problem – the need for Semantic Search• Building an Intent Engine
- Type-ahead prediction- Spelling Correction- Entity / Entity-type Resolution- Semantic Query Parsing- Query Augmentation- The Knowledge Graph
• Conclusion
Knowledge Graph
At CareerBuilder, Solr Powers...At CareerBuilder, Solr Powers...
Search by the Numbers
5
Powering 50+ Search Experiences Including:
100 million +Searches per day
30+Software Developers, Data
Scientists + Analysts
500+Search Servers
1,5 billion +Documents indexed and
searchable
1Global Search
Technology platform
...and many more
What’s the problem we’re trying to solve today?User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java
Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:"machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java
Semantically Expanded Query:("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
But we also really want “things”, not “strings”…
Job Level Job title Company
Job Title Company School + Degree
Type-aheadPrediction
Knowledge Graph and Intent Engine
Search Box
Semantic Query Parsing
Intent Engine
Spelling Correction
Entity / Entity Type Resolution
Machine-learned Ranking
Relevancy Engine (“re-expressing intent”)
User Feedback (Clarifying Intent)
Query Re-writing Search Results
Query Augmentation
Knowledge Graph
Type-ahead Predictions
Semantic Autocomplete• Shows top terms for any search
• Breaks out job titles, skills, companies, related keywords, and other categories
• Understands abbreviations, alternate forms, misspellings
• Supports full Boolean syntax and multi-term autocomplete
• Enables fielded search on entities, not just keywords
Spelling Correction*
*Google “Solr Spell Check Component”
Entity / Entity-type Resolution
Differentiating related terms
Synonyms: cpa => certified public accountant rn => registered nurse r.n. => registered nurse
Ambiguous Terms*: driver => driver (trucking) ~80% likelihood
driver => driver (software) ~20% likelihood
Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig
*differentiated based upon user and query context
Building a Taxonomy of Entities
Many ways to generate this:• Topic Modelling• Clustering of documents• Statistical Analysis of interesting phrases• Buy a dictionary (often doesn’t work for
domain-specific search problems)• …
Our strategy:Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain [1]
[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
Entity-type Recognition
Build classifiers trained onExternal data sources(Wikipedia, DBPedia, WordNet, etc.), as well asfrom our own domain.
The subject for a future talk / research paper…
java developer
registered nurse
emergency room
director
job title
skill
job level
locationwork typePortland, OR
part-time
Semantic Query Parsing
Query Parsing: The whole is greater than the sum of the parts
project manager vs. "project" AND "manager"building architect vs. "building" AND "architect"software architect vs. "software" AND "architect"
Consider: a "software architect" designs and builds software a "building architect" uses software to design architecture
User’s Query:machine learning research and development Portland, OR software engineer AND hadoop java
Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java)
≠
Identifying the correct phrase (not just the parts) is crucial here!
Probabilistic Query Parser
Goal: given a query, predict which combinations of keywords should be combined together as phrases
Example: senior java developer hadoop
Possible Parsings:senior, java, developer, hadoop"senior java", developer, hadoop"senior java developer", hadoop"senior java developer hadoop”"senior java", "developer hadoop”senior, "java developer", hadoopsenior, java, "developer hadoop"
Input: senior hadoop developer java ruby on rails perl
Semantic Search Architecture – Query Parsing1) Generate the previously discussed taxonomy of
Domain-specific phrases • You can mine query logs or actual text of documents for
significant phrases within your domain [1]
2) Feed these phrases to SolrTextTagger (uses Lucene FST for high-throughput term lookups)
3) Use SolrTextTagger to perform entity extraction on incoming queries (tagging documents is also possible)
4) Also invoke probabilistic parser to dynamically identify unknown phrases from a corpus of data (language model)
5) Shown on next slides:Pass extracted entities to a Query Augmentation phase to rewrite the query with enhanced semantic understanding
[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
[2] https://github.com/OpenSextant/SolrTextTagger
Query Augmentation
machine learning
Keywords:
Search Behavior,Application Behavior, etc.
Job Title Classifier, Skills Extractor, Job Level Classifier, etc.
Semantic Query Augmentation
keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) }{ BOOST_TO_TOP: ( job_title:("software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) }
Modified Query:
Related Occupationsmachine learning: {15-1031.00 .58Computer Software Engineers, Applications
15-1011.00 .55Computer and Information Scientists, Research
15-1032.00 .52 Computer Software Engineers, Systems Software }
machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, }
Common Job Titles
Semantic Search Architecture – Query Augmentation
Related Phrases
machine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 }
Known keyword phrasesjava developermachine learningregistered nurse
FST
Knowledge
Graph in
+
Query Enrichment
Document Enrichment
Document Enrichment
Knowledge Graph
Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple levels of relationships between items in our domain. Compare the relationships of skills to keywords, job titles to skills to keywords, skills to government occupation codes, skills to experience level, etc.
Knowledge Graph API
Core similarity engine, exposed via APIAny product can leverage our core relationship scoring engine to score any list of entities against any other list
Full domain supportKeywords, job titles, skills, companies, job levels, locations, and all other taxonomies.
Intersections, overlaps, & relationship scoring, many levels deepUsers can either provide a list of items to score, or else have the system dynamically discover the most related items (or both).
Knowledge Graph
So how does it work?
Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus)
+-
Foreground Query: "Hadoop"
Knowledge Graph
Knowledge Graph – Potential Use Cases
Cross-walk between Types• Have an ID field, but want to enable free text search
on the most associated entity with that ID?
• Have a “state” (geo) search box, but want to accept any free-text location and map it to the right state?
• Have an old classification taxonomy and want to know how the values from the old system now map into the new values?
Build User Profiles from Search Logs• If someone searches for “Java”, and then “JQuery”,
and then “CSS”, and then “JSP”, what do those have in common?
• What if they search for “Java”, and then “C++”, and then “Assembly”?
Discover Relationships Between Anything• If I want to become a data scientist and know
Python, what libraries should I learn?
• If my last job was mid-level software engineer and my current job is Engineering Lead, what are my most likely next roles?
Traverse arbitrarily deep, Sort on anything• Build an instant co-occurrence matrix, sort the top
values by their relatedness, and then add in any number of additional dimensions (RAM permitting).
Data Cleansing• Have dirty taxonomies and need to figure out which
items don’t belong?• Need to understand the conceptual cohesion of a
document (vs spammy or off-topic content)?
Knowledge Graph
2014 - 2015 Publications & PresentationsBooks:Solr in Action - A comprehensive guide to implementing scalable search using Apache Solr
Research papers:● Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific jargon - 2014● Towards a Job title Classification System - 2014● Augmenting Recommendation Systems Using a Model of Semantically-related Terms
Extracted from User Behavior - 2014● sCooL: A system for academic institution name normalization - 2014● PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems - 2014● SKILL: A System for Skill Identification and Normalization – 2015● Carotene: A Job Title Classification System for the Online Recruitment Domain - 2015● WebScalding: A Framework for Big Data Web Services - 2015● A Pipeline for Extracting and Deduplicating Domain-Specific Knowledge Bases - 2015● Macau: Large-Scale Skill Sense Disambiguation in the Online Recruitment Domain - 2015● Improving the Quality of Semantic Relationships Extracted from Massive User Behavioral Data – 2015● Query Sense Disambiguation Leveraging Large Scale User Behavioral Data - 2015
Speaking Engagements:● Over a dozen in the last year: Lucene/Solr Revolution 2014, WSDM 2014, Atlanta Solr Meetup, Atlanta Big Data Meetup, Second
International Syposium on Big Data and Data Analytics, RecSys 2014, IEEE Big Data Conference 2014 (x2), AAAI/IAAI 2015, IEEE Big Data 2015 (x6) Lucene/Solr Revolution 2015
So What’s Next?
machine learning
Keywords:
Search Behavior,Application Behavior, etc.
Job Title Classifier, Skills Extractor, Job Level Classifier, etc.
Semantic Query Augmentation
keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) }{ BOOST_TO_TOP: ( job_title:("software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) }
Modified Query:
Related Occupationsmachine learning: {15-1031.00 .58Computer Software Engineers, Applications
15-1011.00 .55Computer and Information Scientists, Research
15-1032.00 .52 Computer Software Engineers, Systems Software }
machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, }
Common Job Titles
Semantic Search Architecture – Query Augmentation
Related Phrases
machine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 }
Known keyword phrasesjava developermachine learningregistered nurse
FST
Knowledge
Graph in
+This Piece: How do you construct the best possible queries?
The answer… Learning to Rank (Machine-learned Ranking)
That can be a topic for next time…
Type-aheadPrediction
Knowledge Graph and Intent Engine
Search Box
Semantic Query Parsing
Intent Engine
Spelling Correction
Entity / Entity Type Resolution
Machine-learned Ranking
Relevancy Engine (“re-expressing intent”)
User Feedback (Clarifying Intent)
Query Re-writing Search Results
Query Augmentation
Knowledge Graph
Additional References:
Contact Info
Yes, WE ARE HIRING @ . Come talk with me if you are interested…
Trey Grainger [email protected] @treygrainger
http://solrinaction.comConference discount (43% off): lusorevcftw
Other presentations: http://www.treygrainger.com