+ All Categories
Home > Software > Self-learned Relevancy with Apache Solr

Self-learned Relevancy with Apache Solr

Date post: 29-Jan-2018
Category:
Upload: trey-grainger
View: 267 times
Download: 2 times
Share this document with a friend
103
Self-learned relevancy with Apache Solr Trey Grainger SVP of Engineering, Lucidworks NYC Lucene/Solr 2017.03.30
Transcript
Page 1: Self-learned Relevancy with Apache Solr

Self-learned relevancy with Apache SolrTrey Grainger

SVP of Engineering, Lucidworks

NYC Lucene/Solr2017.03.30

Page 2: Self-learned Relevancy with Apache Solr

Trey GraingerSVP of Engineering

• Previously Director of Engineering @ CareerBuilder

• MBA, Management of Technology – Georgia Tech

• BA, Computer Science, Business, & Philosophy – Furman University

• Information Retrieval & Web Search - Stanford University

Other fun projects:

• Co-author of Solr in Action, plus numerous research papers

• Frequent conference speaker

• Founder of Celiaccess.com, the gluten-free search engine

• Lucene/Solr contributor

About Me

Page 3: Self-learned Relevancy with Apache Solr

• Apache Solr Overview

Lucidworks Fusion Overview

• Core Search / Relevancy

- Keyword Search

- Multi-lingual Text Analysis

- Relevancy

• Reflected Intelligence

- Signals (Demo)

- Recommendations (Demo)

- Relevancy Tuning

- Learning to Rank (Demo)

Agenda…

• Semantic Search

- Entity Extraction (Demo)

- Query Parsing (Demo)

- Semantic Knowledge Graph (Demo)

• Streaming Expressions

NYC Lucene/Solr

Page 4: Self-learned Relevancy with Apache Solr

Basic Keyword Search(inverted index, tf-idf, bm25, multilingual text analysis, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learningData-driven App Sophistication

NYC Lucene/Solr

Page 5: Self-learned Relevancy with Apache Solr

what do you do?

Page 6: Self-learned Relevancy with Apache Solr
Page 7: Self-learned Relevancy with Apache Solr

Search-Driven Everything

Customer Service

Customer Insights

Fraud Surveillance

Research Portal

Online RetailDigital Content

Page 8: Self-learned Relevancy with Apache Solr

Lucidworks enables Search-Driven Everything

Data Acquisition

Indexing & Streaming

Smart Access API

Recommendations &

AlertsAnalytics & InsightsExtreme Relevancy

CUSTOMER

SERVICE

RESEARCH

PORTAL

DIGITAL

CONTENT

CUSTOMER

INSIGHTS

FRAUD

SURVEILLANCE

ONLINE

RETAIL

• Access all your data in a

number of ways from one

place.

• Secure storage and

processing from Solr and

Spark.

• Acquire data from any source

with pre-built connectors and

adapters.

Machine learning and

advanced analytics turn all

of your apps into intelligent

data-driven applications.

Page 9: Self-learned Relevancy with Apache Solr

Apache Solr

Page 10: Self-learned Relevancy with Apache Solr

“Solr is the popular, blazing-fast,

open source enterprise search

platform built on Apache Lucene™.”

Page 11: Self-learned Relevancy with Apache Solr

Key Solr Features:

● Multilingual Keyword search

● Relevancy Ranking of results

● Faceting & Analytics (nested / relational)

● Highlighting

● Spelling Correction

● Autocomplete/Type-ahead Prediction

● Sorting, Grouping, Deduplication

● Distributed, Fault-tolerant, Scalable

● Geospatial search

● Complex Function queries

● Recommendations (More Like This)

● Graph Queries and Traversals

● SQL Query Support

● Streaming Aggregations

● Batch and Streaming processing

● Highly Configurable / Plugins

● Learning to Rank

● Building machine-learning models

● … many more*source: Solr in Action, chapter 2

Page 12: Self-learned Relevancy with Apache Solr

The standard

for enterprise

search.of Fortune 500

uses Solr.

90%

Page 13: Self-learned Relevancy with Apache Solr

Lucidworks Fusion

Page 14: Self-learned Relevancy with Apache Solr

DFW Data Science

Page 15: Self-learned Relevancy with Apache Solr
Page 16: Self-learned Relevancy with Apache Solr

All Your Data

Page 17: Self-learned Relevancy with Apache Solr

• Over 50 connectors to

integrate all your data

• Robust parsing framework

to seamlessly ingest all your

document types

• Point and click Indexing

configuration and iterative

simulation of results for full

control over your ETL

process

• Your security model

enforced end-to-end from

ingest to search across your

different datasources

Page 18: Self-learned Relevancy with Apache Solr

Experience

Management

Page 19: Self-learned Relevancy with Apache Solr

• Relevancy tuning: Point-and-click

query pipeline configuration allow

fine-grained control of results.

• Machine-driven relevancy:

Signals aggregation learn and

automatically tune relevancy and

drive recommendations out of the

box .

• Powerful pipeline stages:

Customize fields, stages,

synonyms, boosts, facets,

machine learning models, your

own scripted behavior, and

dozens of other powerful search

stages.

• Turnkey search UI

(Lucidworks View): Build a

sophisticated end-to-end search

application in just hours.

Page 20: Self-learned Relevancy with Apache Solr

Operational Simplicity

Page 21: Self-learned Relevancy with Apache Solr

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache Zookeeper

ZK 1

Leader Election

Load Balancing

Shared Config Management

Worker Worker

Apache Spark

Cluster Manager

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

RE

ST

AP

I

Admin UI

Lucidworks

View

LOGS FILE WEB DATABASE CLOUD

HD

FS

(O

ptio

na

l)

Page 22: Self-learned Relevancy with Apache Solr

• 75% decrease in

development time

• Licensing costs cut

by 50%

With Fusion’s out-of-the-box capabilities, we skipped

months in our dev cycle so we could focus our team

where they would have the most impact.

We cut our licensing costs by 50% and improved

application usability. The Lucidworks professional

services team amplified our success even further. We’re

all Fusion from here on out!”

Lourduraju Pamishetty

Senior IT Application Architect

Page 23: Self-learned Relevancy with Apache Solr

• Seamless integration of your

entire search & analytics

platform

• All capabilities exposed

through secured API's, so

you can use our UI or build

your own.

• End-to-end security policies

can be applied out of the

box to every aspect of your

search ecosystem.

• Distributed, fault-tolerant

scaling and supervision of

your entire search

application

Page 24: Self-learned Relevancy with Apache Solr

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

RE

ST

AP

I

Admin UI

Lucidworks

View

LOGS FILE WEB DATABASE CLOUD

• Seamless integration of your

entire search & analytics

platform

• All capabilities exposed

through secured API's, so

you can use our UI or build

your own.

• End-to-end security policies

can be applied out of the

box to every aspect of your

search ecosystem.

• Distributed, fault-tolerant

scaling and supervision of

your entire search

application

Page 25: Self-learned Relevancy with Apache Solr

Lucidworks Fusion

Page 26: Self-learned Relevancy with Apache Solr

Fusion powers search for the brightest companies in the world.

Page 27: Self-learned Relevancy with Apache Solr

search & relevancy

Page 28: Self-learned Relevancy with Apache Solr

Basic Keyword Search

The beginning of a typical search journey

Page 29: Self-learned Relevancy with Apache Solr

Term Documents

a doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x],

doc4[2x], doc5 [1x]

… …

Document Content Field

doc1 once upon a time, in a land far,

far away

doc2 the cow jumped over the moon.

doc3 the quick brown fox jumped over

the lazy dog.

doc4 the cat in the hat

doc5 The brown cow said “moo”

once.

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

The inverted index

NYC Lucene/Solr

Page 30: Self-learned Relevancy with Apache Solr

/solr/select/?q=apache solr

Field Documents

… …

apache doc1, doc3, doc4,

doc5

hadoop doc2, doc4, doc6

… …

solr doc1, doc3, doc4,

doc7, doc8

… …

doc5

doc7 doc8

doc1 doc3 doc4

solr

apache

apache solr

Matching queries to documents

NYC Lucene/Solr

Page 31: Self-learned Relevancy with Apache Solr

Text Analysis

Generating terms to index from raw text

Page 32: Self-learned Relevancy with Apache Solr

Text Analysis in Solr

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

NYC Lucene/Solr

Page 33: Self-learned Relevancy with Apache Solr

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

NYC Lucene/Solr

Page 34: Self-learned Relevancy with Apache Solr

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

NYC Lucene/Solr

Page 35: Self-learned Relevancy with Apache Solr

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

NYC Lucene/Solr

Page 36: Self-learned Relevancy with Apache Solr

Multi-lingual Text Analysis

Analyzing text across multiple languages

Page 37: Self-learned Relevancy with Apache Solr

Example English Analysis Chains

<fieldType name="text_en" class="solr.TextField"positionIncrementGap="100">

<analyzer><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StopFilterFactory"

words="lang/stopwords_en.txt”ignoreCase="true" />

<filter class="solr.LowerCaseFilterFactory"/><filter class="solr.EnglishPossessiveFilterFactory"/><filter class="solr.KeywordMarkerFilterFactory"

protected="lang/en_protwords.txt"/><filter class="solr.PorterStemFilterFactory"/>

</analyzer></fieldType>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">

<analyzer><charFilter class="solr.HTMLStripCharFilterFactory"/><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SynonymFilterFactory"

synonyms="lang/en_synonyms.txt" IignoreCase="true" expand="true"/>

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.ASCIIFoldingFilterFactory"/><filter class="solr.KStemFilterFactory"/><filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

</analyzer></fieldType>

NYC Lucene/Solr

Page 38: Self-learned Relevancy with Apache Solr

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

NYC Lucene/Solr

Page 39: Self-learned Relevancy with Apache Solr

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

NYC Lucene/Solr

Page 40: Self-learned Relevancy with Apache Solr

Which Stemmer do I choose?

*From Solr in Action, Chapter 14

NYC Lucene/Solr

Page 41: Self-learned Relevancy with Apache Solr

Common English Stemmers

*From Solr in Action, Chapter 14

NYC Lucene/Solr

Page 42: Self-learned Relevancy with Apache Solr

When Stemming goes awry

Fixing Stemming Mistakes:

• Unfortunately, every stemmer will have problem-cases that aren’t handled as you would expect

• Thankfully, Stemmers can be overriden

• KeywordMarkerFilter: protects a list of terms you specify from being stemmed

• StemmerOverrideFilter: applies a list of custom term mappings you specify

Alternate strategy:

• Use Lemmatization (root-form analysis) instead of Stemming

• Commercial vendors help tremendously in this space

• The Hunspell stemmer enables dictionary-based support of varying quality in over 100 languages

NYC Lucene/Solr

Page 43: Self-learned Relevancy with Apache Solr

Relevancy

Scoring the results, returning the best matches

Page 44: Self-learned Relevancy with Apache Solr

Classic Lucene Relevancy Algorithm (now switched to BM25):

*Source: Solr in Action, chapter 3

Score(q, d) =

∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)t in q

Where:t = term; d = document; q = query; f = field

tf(t in d) = numTermOccurrencesInDocument ½

idf(t) = 1 + log (numDocs / (docFreq + 1))

coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery

queryNorm(q) = 1 / (sumOfSquaredWeights ½ )

sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2

t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

NYC Lucene/Solr

Page 45: Self-learned Relevancy with Apache Solr

• Term Frequency: “How well a term describes a document?”

– Measure: how often a term occurs per document

• Inverse Document Frequency: “How important is a term overall?”

– Measure: how rare the term is across all documents

TF * IDF

*Source: Solr in Action, chapter 3

NYC Lucene/Solr

Page 46: Self-learned Relevancy with Apache Solr

News Search : popularity and freshness drive relevance

Restaurant Search: geographical proximity and price range are critical

Ecommerce: likelihood of a purchase is key

Movie search: More popular titles are generally more relevant

Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good

domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

NYC Lucene/Solr

Page 47: Self-learned Relevancy with Apache Solr

what is “reflected intelligence”?

Page 48: Self-learned Relevancy with Apache Solr

The Three C’s

Content:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence“Leveraging previous data and interactions to improve how

new data and interactions should be interpreted”

NYC Lucene/Solr

Page 49: Self-learned Relevancy with Apache Solr

Feedback Loops

User

Searches

User

Sees

ResultsUser

takes an

action

Users’ actions

inform system

improvements

NYC Lucene/Solr

Page 50: Self-learned Relevancy with Apache Solr

● Recommendation Algorithms

● Building user profiles from past searches, clicks, and other actions

● Identifying correlations between keywords/phrases

● Building out automatically-generated ontologies from content and queries

● Determining relevancy judgements (precision, recall, nDCG, etc.) from click

logs

● Learning to Rank - using relevancy judgements and machine learning to train

a relevance model

● Discovering misspellings, synonyms, acronyms, and related keywords

● Disambiguation of keyword phrases with multiple meanings

● Learning what’s important in your content

Examples of Reflected Intelligence

NYC Lucene/Solr

Page 51: Self-learned Relevancy with Apache Solr

John lives in Boston but wants to move to New York or possibly another big city. He is

currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location

in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a

Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

*Example from chapter 16 of Solr in Action

Consider what you know about users

NYC Lucene/Solr

Page 52: Self-learned Relevancy with Apache Solr

http://localhost:8983/solr/jobs/select/?

fl=jobtitle,city,state,salary&

q=(

jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10

)

AND (

(city:"Boston" AND state:"MA")^15

OR state:"MA")

AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

NYC Lucene/Solr

Page 53: Self-learned Relevancy with Apache Solr

{ ...

"response":{"numFound":22,"start":0,"docs":[

{"jobtitle":" Clinical Educator

(New England/ Boston)",

"city":"Boston",

"state":"MA",

"salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action

Search Results for Jane

{"jobtitle":"Nurse Educator",

"city":"Braintree",

"state":"MA",

"salary":56183},

{"jobtitle":"Nurse Educator",

"city":"Brighton",

"state":"MA",

"salary":71359}

NYC Lucene/Solr

Page 54: Self-learned Relevancy with Apache Solr

You just built a

recommendation engine!

Page 55: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Can also integrate user behavior (Ships with Fusion

3.1):

Page 56: Self-learned Relevancy with Apache Solr

Demo:

Signals & Recommendations

Page 57: Self-learned Relevancy with Apache Solr
Page 58: Self-learned Relevancy with Apache Solr

• 200%+ increase in

click-through rates

• 91% lower TCO

• Fewer support tickets

• Increased customer

satisfaction

Page 59: Self-learned Relevancy with Apache Solr

Relevancy Tuning

Improving ranking algorithms through experiments and models

Page 60: Self-learned Relevancy with Apache Solr

How to Measure Relevancy?

A B C

Retrieved

Documents

Related

Documents

Precision = B/A

Recall = B/C

Problem:

Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at

the top of the retrieved documents, is that OK?

NYC Lucene/Solr

Page 61: Self-learned Relevancy with Apache Solr

Normalized Discounted Cumulative Gain

Rank Relevancy

3 0.95

1 0.70

2 0.60

4 0.45

Rank Relevancy

1 0.95

2 0.85

3 0.80

4 0.65

Ranking

IdealGiven

• Position is

considered in

quantifying

relevancy.

• Labeled dataset

is required.

NYC Lucene/Solr

Page 62: Self-learned Relevancy with Apache Solr

Learning to Rank

Page 63: Self-learned Relevancy with Apache Solr

Learning to Rank (LTR)

● It applies machine learning techniques to discover the best combination

of features that provide best ranking.

● It requires labeled set of documents with relevancy scores for given set

of queries

● Features used for ranking are usually more computationally expensive

than the ones used for matching

● It typically re-ranks a subset of the matched documents (e.g. top 1000)

NYC Lucene/Solr

Page 64: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Page 65: Self-learned Relevancy with Apache Solr

Common LTR Algorithms

• RankNet* (Neural Network, boosted trees)

• LambdaMart* (set of regression trees)

• SVM Rank** (SVM classifier)

** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf

* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

NYC Lucene/Solr

Page 66: Self-learned Relevancy with Apache Solr

LambdaMart Example

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

NYC Lucene/Solr

Page 67: Self-learned Relevancy with Apache Solr

Demo: Learning to Rank

Page 68: Self-learned Relevancy with Apache Solr

Obtaining Relevancy JudgementsTypical Methodologies

1) Hire employees, contractors, or interns

-Pros:

Accuracy

-Cons:

Expensive

Not scalable (cost or man-power-wise)

Data Becomes Stale

2) Crowdsource-Pros:

Less cost, more scalable

-Cons:

Less accurate

Data still becomes staleSource: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

NYC Lucene/Solr

Page 69: Self-learned Relevancy with Apache Solr

Reflected Intelligence: Possible to infer relevancy judgements?

Rank Document ID

1 Doc1

2 Doc2

3 Doc3

4 Doc4

QueryQuery

Doc1 Doc2 Doc3

01 1

Query

Doc1 Doc2 Doc3

10 0

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

NYC Lucene/Solr

Page 70: Self-learned Relevancy with Apache Solr

Automated Relevancy Benchmarking

DefaultAlgorithm

0.610.59

0.580.60

0.61 0.610.60

0.610.60

0.750.74

0.750.74

0.750.73

0.750.76

0.750.74

0.79 0.790.78

0.790.80

0.810.80

0.810.79 0.79

0.700.71 0.71

0.690.70 0.70

0.690.70

0.710.70

0.750.76

0.770.76 0.76

0.770.76

0.750.76 0.76

0.300.31

0.320.33

0.320.30

0.31 0.31 0.310.32

10/1/16 10/2/16 10/3/16 10/4/16 10/5/16 10/6/16 10/7/16 10/8/16 10/9/16 10/10/16

DefaultAlgorithm Algorithm1 Algorithm2 Algorithm3 Algorithm4 Algorithm5

NYC Lucene/Solr

Page 71: Self-learned Relevancy with Apache Solr

Traditional

Keyword

SearchRecommendations

Semantic

Search

User Intent

Personalized

Search

Augmented

SearchDomain-aware

Matching

The Relevancy

Spectrum

NYC Lucene/Solr

Page 72: Self-learned Relevancy with Apache Solr

semantic search

Page 73: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Page 74: Self-learned Relevancy with Apache Solr

Building a Taxonomy of Entities

Many ways to generate this:• Topic Modelling

• Clustering of documents

• Statistical Analysis of interesting phrases

- Word2Vec / Glove / Dice Conceptual Search

• Buy a dictionary (often doesn’t work for

domain-specific search problems)

• Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain*

* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

NYC Lucene/Solr

Page 75: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Page 76: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Page 77: Self-learned Relevancy with Apache Solr

entity extraction

Page 78: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Page 79: Self-learned Relevancy with Apache Solr

Demo: Solr Text Tagger

Page 80: Self-learned Relevancy with Apache Solr

semantic query parsing

Page 81: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Page 82: Self-learned Relevancy with Apache Solr

Probabilistic Query Parser

Goal: given a query, predict which

combinations of keywords should be

combined together as phrases

Example:

senior java developer hadoop

Possible Parsings:senior, java, developer, hadoop

"senior java", developer, hadoop

"senior java developer", hadoop

"senior java developer hadoop”

"senior java", "developer hadoop”

senior, "java developer", hadoop

senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

NYC Lucene/Solr

Page 83: Self-learned Relevancy with Apache Solr

Demo: Probabilistic Query Parser

Page 84: Self-learned Relevancy with Apache Solr

Semantic Query Parsing

Identification of phrases in queries using two steps:

1) Check a dictionary of known terms that is continuously

built, cleaned, and refined based upon common inputs from

interactions with real users of the system. The SolrTextTagger

works well for this.*

2) Also invoke a probabilistic query parser to dynamically

identify unknown phrases using statistics from a corpus of data

(language model)

*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation

through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

NYC Lucene/Solr

Page 85: Self-learned Relevancy with Apache Solr

query augmentation

Page 86: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Page 87: Self-learned Relevancy with Apache Solr

Knowledge Graph

Semantic Data Encoded into Free Text Content

e en eng engi engineer engineers

engineer engineersNodeType:Term

softwareengineer

softwareengineers

electricalengineering

engineer

engineering software

NodeType:

CharacterSequence

NodeType:

TermSequence

NodeType:

Document

id:1

text:lookingforasoftwareengineerwithdegreeincomputerscienceorelectricalengineering

id:2

text:applytobeasoftwareengineerandworkwithothergreatsoftwareengineers

id:3

text:startagreatcareerinelectricalengineering

NYC Lucene/Solr

Page 88: Self-learned Relevancy with Apache Solr

id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, java

id: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemy

id: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernate

field term postings list

doc pos

desc

a

1 4

2 1

3 1, 5

at1 3

2 4

company 1 6

doing2 6

3 8

engineer1 2

3 3, 7

great 1 5

hard 2 7

hospital 2 5

java 3 6

nurse 2 3

or 3 4

registered 2 2

software1 1

3 2

work2 10

3 9

job_title java developer 3 1

… … … …

field doc term

desc

1a

at

company

engineer

great

software

2a

at

doing

hard

hospital

nurse

registered

work

3a

doing

engineer

java

or

software

work

job_title 1Software Engineer

… … …

Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

NYC Lucene/Solr

Page 89: Self-learned Relevancy with Apache Solr

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Set-theory View

Graph View

How the Graph Traversal Works

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

Data Structure View

Java

Scala Hibernate

docs1, 2, 6

docs 3, 4

Oncology

doc 5

NYC Lucene/Solr

Page 90: Self-learned Relevancy with Apache Solr

Knowledge Graph

Graph Model

Structure:

Single-level Traversal / Scoring:

Multi-level Traversal / Scoring:

Page 91: Self-learned Relevancy with Apache Solr

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Traversal

Data Structure View

Graph View

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

job_title: Software Engineer

job_title: Data

Scientist

job_title: Java

Developer

……

Inverted Index Lookup

Forward Index Lookup

Forward Index Lookup

Inverted Index Lookup

Java

Java Developer

Hibernate

Scala

Software Engineer

Data Scientist

ha

s_re

late

d_job_title

ha

s_re

late

d_job_title

NYC Lucene/Solr

Page 92: Self-learned Relevancy with Apache Solr

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Scoring nodes in the Graph

Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.

countFG(x) - totalDocsFG * probBG(x)

z = --------------------------------------------------------

sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords”, "values":[

{ "value":"hive", "relatedness": 0.9765, "popularity":369 },

{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },

{ "value":".net", "relatedness": 0.5417, "popularity":17683 },

{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },

{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },

{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }

+-

Foreground Query: "Hadoop"

NYC Lucene/Solr

Page 93: Self-learned Relevancy with Apache Solr

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Graph Traversal with Scores

software engineer*(materialized node)

Java

C#

.NET

.NET Developer

Java Developer

Hibernate

ScalaVB.NET

Software Engineer

Data Scientist

SkillNodes

has_related_skillStartingNode

SkillNodes

has_related_skill Job TitleNodes

has_related_job_title

0.900.88 0.93

0.93

0.34

0.74

0.91

0.89

0.74

0.89

0.780.72

0.48

0.93

0.76

0.83

0.80

0.64

0.61

0.780.55

NYC Lucene/Solr

Page 94: Self-learned Relevancy with Apache Solr

Knowledge Graph

Use Case: Document Summarization

Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG.

Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document

Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring

Page 95: Self-learned Relevancy with Apache Solr

Demo: Semantic Knowledge Graph

Page 96: Self-learned Relevancy with Apache Solr

Knowledge Graph

NYC Lucene/Solr

Page 97: Self-learned Relevancy with Apache Solr

Knowledge Graph

NYC Lucene/Solr

Page 98: Self-learned Relevancy with Apache Solr

NYC Lucene/Solr

Page 99: Self-learned Relevancy with Apache Solr

streaming expressions

Page 100: Self-learned Relevancy with Apache Solr

• Perform relational operations on

streams

• Stream sources: search, jdbc, facets,

features, gatherNodes, shortestPath,

train, features, model, random, stats,

topic

• Stream decorators: classify, commit,

complement, daemon, executor, fetch,

having, leftOuterJoin, hashJoin,

innerJoin, intersect, merge, null,

outerHashJoin, parallel, priority,

reduce, rollup, scoreNodes, select,

sort, top, unique, update

Streaming Expressions

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

NYC Lucene/Solr

Page 101: Self-learned Relevancy with Apache Solr

Streaming Expressions - Examples

Shortest-path Graph

Traversal

Parallel Batch

Procesing

Train a Logistic Regression

Model

Distributed Joins

Rapid Export of all

Search Results

Pull Results from External Database

Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Classifying

Search Results

Page 102: Self-learned Relevancy with Apache Solr

Additional References:

Southern Data Science

Page 103: Self-learned Relevancy with Apache Solr

Contact Info

Trey [email protected]@treygrainger

http://solrinaction.comMeetup discount (39% off): 39grainger

Other presentations: http://www.treygrainger.com

NYC Lucene/Solr


Recommended