+ All Categories
Home > Technology > Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

Date post: 07-Jan-2017
Category:
Upload: lucidworks
View: 5,880 times
Download: 0 times
Share this document with a friend
37
OCTOBER 13-16, 2016 AUSTIN, TX
Transcript
Page 1: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Page 2: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

Implementing Conceptual Search in Solr Simon Hughes

Chief Data Scientist, Dice.com

Page 3: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

3

•  Chief Data Scientist at Dice.com, under Yuri Bykov

•  Key Projects involving Solr:

Who Am I?

•  Recommender Systems – more jobs like this, more seekers like

this (uses custom Solr index)

•  Custom Dice Solr MLT handler (real-time recommendations)

•  Did you mean functionality

•  Title, skills and company type-ahead

•  Relevancy improvements in dice jobs search

Page 4: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

4

•  Supply Demand Analysis

•  Dice Skills pages – http://www.dice.com/skills

Other Projects

PhD

•  PhD candidate at DePaul University, studying natural language processing and

machine learning

Page 5: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com
Page 6: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com
Page 7: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

7

Q. What is the Most Common Relevancy Tuning Mistake?

Page 8: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

8

Q. What is the Most Common Relevancy Tuning Mistake?

A. Ignoring the importance of RECALL

Page 9: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

9

Relevancy Tuning

•  Key performance metrics to measure:

•  Precision •  Recall •  F1 Measure - 2*(P*R)/(P+R)

•  Precision is easier – correct mistakes in the top search results

•  Recall - need to know which relevant documents don’t come back

•  Hard to accurately measure

•  Need to know all the relevant documents present in the index

Page 10: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

10

What is Conceptual Search?

•  A.K.A. Semantic Search

•  Two key challenges with keyword matching: •  Polysemy: Words have more than one meaning •  e.g. engineer – mechanical? programmer? automation engineer?

•  Synonymy: Many different words have the same meaning

•  e.g. QA, quality assurance, tester; VB, Visual Basic, VB.Net

•  Other related challenges -

•  Typos, Spelling Errors, Idioms

•  Conceptual search attempts to solve these problems by learning concepts

Page 11: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

11

Why Conceptual Search?

•  We will attempt to improve recall without diminishing precision

•  Can match relevant documents containing none of the query terms

Page 12: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

12

Concepts

•  Conceptual search allows us to retrieve documents by how similar the concepts

in the query are to the concepts in a document

•  Concepts represent important high-level ideas in a given domain (e.g. java

technologies, big data jobs, helpdesk support, etc)

•  Concepts are automatically learned from documents using machine learning

•  Words can belong to multiple concepts, with varying strengths of association

with each concept

Page 13: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

13

Traditional Techniques

•  Many algorithms have been used for concept learning, include LSA (Latent

Semantic Analysis), LDA (Latent Dirichlet Allocation) and Word2Vec

•  All involve mapping a document to a low dimensional dense vector (an array

of numbers)

•  Each element of the vector is a number representing how well the document

represents that concept

•  E.g. LSA powers the similar skills found in dice’s skills pages

Page 14: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

14

Traditional Techniques Don’t Scale

•  LSA\LSI, LDA and related techniques rely on factorization of very large term-

document matrices – very slow and computationally intensive

•  Require embedding a machine learning model with the search engine to

map new queries to the concept space (latent or topic space)

•  Query performance is very poor – unable to utilize the inverted index as all

documents have the same number of concepts

•  What we want is a way to map words not documents to concepts. Then we

can embed this in Solr via synonym filters and custom query parsers

Page 15: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

15

Word2Vec and ‘Word Math’

•  Word2Vec was developed by google around 2013 for learning vector

representations for words, building on earlier work from Rumelhart, Hinton

and Williams in 1986 (see paper below for citation of this work)

•  Word2Vec Paper:

Efficient Estimation of Word Representations in Vector Space

•  It works by training a machine learning model to predict the words

surrounding a word in a sentence

•  Similar words get similar vector representations

Page 16: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

16

“Word Math” Example

•  Using basic vector arithmetic, you get some interesting patterns

•  This illustrates how it represents relationships between words

•  E.g. man – king + woman = queen

Page 17: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

17

The algorithm learns to

represent different types

of relationships between

words in vector form

Page 18: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

18

At this point you may be thinking…

Page 19: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

19

Page 20: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

20

Why Do I Care? This is a Search Conference…

Page 21: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

21

Why Do I Care? This is a Search Conference…

•  This algorithm can be used to represent documents as vectors of concepts

•  We can them use these representations to do conceptual search

•  This will surface many relevant documents missed by keyword matching

•  This boosts recall

•  This technique can also be used to automatically learn synonyms

Page 22: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

22

A Quick Demo

Using our Dice.com active jobs index, some example common user queries:

•  Data Scientist

•  Big Data

•  Information Retrieval

•  C#

•  Web Developer

•  CTO

•  Project Manager

Page 23: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

23

How?

GitHub- DiceTechJobs/ConceptualSearch:

1.  Pre-Process documents – parse html, strip noise characters, tokenize words

2.  Define important keywords for your domain, or use my code to auto extract top

terms and phrases

3.  Train Word2Vec model on documents to produce a word2vec model

4.  Using this model, either:

1.  Vectors: Use the raw vectors and embed them in Solr using synonyms + payloads

2.  Top N Similar: Or extract the top n similar terms with similarities and embed these as

weighted synonyms using my custom queryboost parser and tokenizer

3.  Clusters: Cluster these vectors by similarity, and map terms to clusters in a synonym file

Page 24: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

24

Define Top Domain Specific Keywords

•  If you have a set of documents belonging to a specific domain, it is

important to define the important keywords for your domain:

•  Use top few thousand search keywords

•  Or use my fast keyword and phrase extraction tool (in GitHub)

•  Or use Solr\Lucene shingle filter to extract top 1 - 4 word sequences by

document frequency •  Important to map common phrases to single tokens, e.g. data scientist =>

data_scientist, java developer=>java_developer

Page 25: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

25

Do It Yourself

•  All code for this talk is now publicly available on GitHub:

•  https://github.com/DiceTechJobs/SolrPlugins - Solr plugins to work with

conceptual search, and other dice plugins, such as a custom MLT handler

•  https://github.com/DiceTechJobs/SolrConfigExamples - Examples of Solr

configuration file entries to enable conceptual search and other Dice

plugins:

•  https://github.com/DiceTechJobs/ConceptualSearch - Python code to

compute the Word2Vec word vectors, and generate Solr synonym files

Page 26: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

26

Some Solr Tricks to Make this Happen

1.  Keyword Extraction: Use the synonym filter to extract key words from your

documents

Page 27: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

27

Page 28: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

28

Some Solr Tricks to Make this Happen

1.  Keyword Extraction: Use the synonym filter to extract key words from your

documents

2.  Synonym Expansion using Payloads:

•  Use the synonym filter to expand a keyword to multiple tokens

•  Each token has an associated payload – used to adjust relevancy scores at

index or query time

Page 29: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

29

Page 30: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

30

Synonym File Examples – Vector Method

•  Each keyword maps to a set of tokens via a synonym file

•  Vector Synonym file entry (5 element vector, usually100+ elements):

•  java developer=>001|0.4 002|0.1 003|0.5 005|.9

•  Uses a custom token filter that averages these vectors over the entire

document (see GitHub - DiceTechJobs/SolrPlugins)

•  Relatively fast at index time but some additional indexing overhead

•  Very slow to query

Page 31: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com
Page 32: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

32

Synonym File Examples – Top N Method

•  Each keyword maps to a set of most similar keywords via a synonym file

•  Top N Synonym file entry (top 5):

•  java_developer=>java_j2ee_developer|0.907526 java_architect|0.889903

lead_java_developer|0.867594 j2ee_developer|0.864028 java_engineer|0.861407

•  Can configure this at index time with payloads, a payload aware query parser and a

payload similarity function

•  Or you can configure this at query time with a special token filter that converts

payloads into term boosts, along with a special parser (see

GitHub - DiceTechJobs/SolrPlugins)

•  Fast at index and query time if N is reasonable (10-30)

Page 33: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

33

Searching over Clustered Terms

•  After we have learned word vectors, we can use a clustering algorithm to

cluster terms by their vectors to give clusters of related words

•  Can learn several different sizes of cluster, such as 500, 1000, 5000 clusters,

and map each of these to a separate field

•  Apply stronger boosts to the fields containing smaller clusters (e.g. the 5000

cluster field) using the edismax qf parameter - tighter clusters get more

weight

•  Code for clustering vectors in GitHub - DiceTechJobs/ConceptualSearch

Page 34: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

34

Synonym File Examples – Clustering Method

•  Each keyword in a cluster maps the same artificial token for that cluster

•  Cluster Synonym file entries:

•  java=>cluster_171

•  java applications=>cluster_171

•  java coding=>cluster_171

•  java design=>cluster_171

•  Doesn’t use payloads so does not require any special plugins

•  No noticeable impact on query or indexing performance

Page 35: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

35

Example Clusters Learned from Dice Job Postings

•  Note: Labels in bold are manually assigned for interpretability:

•  Natural Languages: bi lingual, bilingual, chinese, fluent, french, german, japanese,

korean, lingual, localized, portuguese, russian, spanish, speak, speaker

•  Apple Programming Languages: cocoa, swift

•  Search Engine Technologies: apache solr, elasticsearch, lucene, lucene solr,

search, search engines, search technologies, solr, solr lucene

•  Microsoft .Net Technologies: c# wcf, microsoft c#, microsoft.net, mvc web, wcf

web services, web forms, webforms, windows forms, winforms, wpf wcf

Page 36: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

36

Example Clusters Learned from Dice Job Postings

Attention\Attitude: attention, attentive, close attention, compromising, conscientious, conscious, customer

oriented, customer service focus, customer service oriented, deliver results, delivering results,

demonstrated commitment, dependability, dependable, detailed oriented, diligence,

diligent, do attitude, ethic, excellent follow, extremely detail oriented, good attention,

meticulous, meticulous attention, organized, orientated, outgoing, outstanding customer

service, pay attention, personality, pleasant, positive attitude, professional appearance,

professional attitude, professional demeanor, punctual, punctuality, self motivated, self

motivation, superb, superior, thoroughness

Page 37: Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by Simon Hughes, Dice.com

37

Summary

•  It’s easy to overlook recall when performing relevancy tuning

•  Conceptual search improves recall while maintaining high precision by matching

documents on concepts or ideas.

•  In reality this involves learning which terms are related to one another

•  Word2Vec is a scalable algorithm for learning related words from a set of documents, that

gives state of the art results in word analogy tasks

•  We can train a Word2Vec model offline, and embed it’s output into Solr by using the in-built

synonym filter and payload functionality, combined with some custom plugins


Recommended