+ All Categories
Home > Technology > Enhancing relevancy through personalization & semantic search

Enhancing relevancy through personalization & semantic search

Date post: 15-Jan-2015
Category:
Upload: trey-grainger
View: 1,079 times
Download: 0 times
Share this document with a friend
Description:
Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you'll learn how to implement advanced relevancy techniques which enable your search platform to "learn" from your content and users' behavior. Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results. You'll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.
Popular Tags:
62
ENHANCING RELEVANCY THROUGH PERSONALIZATION & SEMANTIC SEARCH Trey Grainger Search Technology Development Manager Dublin, IE 2013.11.07 @
Transcript
Page 1: Enhancing relevancy through personalization & semantic search

ENHANCING RELEVANCY THROUGH PERSONALIZATION & SEMANTIC SEARCH Trey Grainger

Search Technology Development Manager

Dublin, IE 2013.11.07

@  

Page 2: Enhancing relevancy through personalization & semantic search

My Background

Trey  Grainger  Search  Technology  Development  Manager      @CareerBuilder.com  

 Relevant  Background  

•  Search  &  Recommenda>ons  •  High-­‐volume,  Distributed  Systems  •  NLP,  Relevancy  Tuning,  User  Group  Tes>ng,  &  Machine  Learning  

                                                       Other  Projects  •  Co-­‐author:    Solr  in  Ac*on  •  Founder  and  Chief  Engineer  @                                                    .com  

Page 3: Enhancing relevancy through personalization & semantic search

•  I. How we use Solr @ CareerBuilder •  II. Traditional Relevancy Scoring •  III. Advanced Relevancy through functions

–  Factors as a linear function –  Context-aware relevancy parameter weighting

•  III. Personalization & Recommendations –  Profile and Behavior-based –  Solr as a recommendation engine –  Collaborative Filtering

•  IV. Semantic Search –  Mining user-behavior for synonyms –  Uncovering meaning through clustering –  Latent Semantic Indexing overview –  Document-based searching –  Foreground vs. Background analysis

Roadmap

Page 4: Enhancing relevancy through personalization & semantic search

How  we  use  Solr  @  CareerBuilder  

Page 5: Enhancing relevancy through personalization & semantic search

•  Over  2.5  million  new  jobs  each  month    •  Over  60  million  ac>vely  searchable  resumes  •  ~300  globally  distributed  search  servers    •  Thousands  of  unique,  dynamically  generated  indexes  •  Over  1  Billion  ac>vely  searchable  documents  •  Over  1  million  searches  an  hour  

Search Scale @

Page 6: Enhancing relevancy through personalization & semantic search

Data Analytics

Page 7: Enhancing relevancy through personalization & semantic search

Data Analytics

Page 8: Enhancing relevancy through personalization & semantic search

Data Analytics (market supply)

Page 9: Enhancing relevancy through personalization & semantic search

Data Analytics (market demand)

Page 10: Enhancing relevancy through personalization & semantic search

Data Analytics (labor pressure: supply/demand)

Page 11: Enhancing relevancy through personalization & semantic search

Data Analytics (hiring comparison per market)

Page 12: Enhancing relevancy through personalization & semantic search

Traditional Search

Page 13: Enhancing relevancy through personalization & semantic search

Recommendations

Page 14: Enhancing relevancy through personalization & semantic search

Tradi>onal  Relevancy  Scoring  

Page 15: Enhancing relevancy through personalization & semantic search

Default Lucene Relevancy Algorithm (DefaultSimilarity)

*Source:  Solr  in  Ac*on,  chapter  3  

Score(q,d)  =                  ∑    (  -(t  in  d)  ·∙    idf(t)2  ·∙  t.getBoost()  ·∙  norm(t,  d)  )  ·∙  coord(q,  d)  ·∙  queryNorm(q)  

                 t  in  q  

 Where:    

 t  =  term;  d  =  document;  q  =  query;  f  =  field                    -(t  in  d)    =    numTermOccurrencesInDocument  ½                    idf(t)  =    1  +  log  (numDocs  /  (docFreq  +  1))                    coord(q,  d)  =  numTermsInDocumentFromQuery  /  numTermsInQuery                    queryNorm(q)  =  1  /  (sumOfSquaredWeights  ½  )                    sumOfSquaredWeights  =  q.getBoost()2  ·∙  ∑  (  idf(t)  ·∙  t.getBoost()  )2                                                                                                                                                                                                                                                                                                                                                                                  t  in  q  

                 norm(t,  d)      =      d.getBoost()    ·∙    lengthNorm(f)    ·∙      f.getBoost()  

Page 16: Enhancing relevancy through personalization & semantic search

•  Term Frequency: “How well a term describes a document?” –  Measure: how often a term occurs per document

•  Inverse Document Frequency: “How important is a term overall?” –  Measure: how rare the term is across all documents

TF * IDF

Page 17: Enhancing relevancy through personalization & semantic search

Boosting documents and fields

•  Certain fields may be more important than other fields: –  The Job Title and Skills may be more relevant than other aspects of the job: /select?qf=jobtitle^10 skills^5 jobrequirements^2 jobdescription^1

•  It’s possible to boost documents and fields at both index time and query time

•  If you need more fine-grained control (such as per-term index-time boosting), you can make use of payloads

Page 18: Enhancing relevancy through personalization & semantic search

Custom scoring with Payloads •  In addition to boosting search terms and fields, content within Fields can also be

boosted differently using Payloads (requires a custom scoring implementation): design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] / careerbuilder [2] / design [2], …

jobtitle: bucket=[1] boost=10; company: bucket=[2] boost=4; jobdescription: bucket=[ ] weight=1; experience: bucket=[3] weight=1.5

We can pass in a parameter to solr at query time specifying the boost to apply to each bucket i.e. …&bucketWeights=1:10;2:4;3:1.5;default:1;

•  This allows us to map many relevancy buckets to search terms at index time and adjust the weighting at query time without having to search across hundreds of fields.

•  By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model

Page 19: Enhancing relevancy through personalization & semantic search

•  News search: popularity and freshness drive relevance •  Restaurant search: geographical proximity and price range are critical •  Ecommerce: likelihood of a purchase is key •  Movie search: More popular titles are generally more relevant •  Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

Page 20: Enhancing relevancy through personalization & semantic search

Advanced  Relevancy  through  Func>ons  

Page 21: Enhancing relevancy through personalization & semantic search

Example of domain-specific relevancy calculation

News website:

/select? fq=$myQuery& q=_query_:"{!func}scale(query($myQuery),0,100)" AND _query_:"{!func}div(100,map(geodist(),0,1,1))" AND _query_:"{!func}recip(rord(publicationDate),0,100,100)" AND _query_:"{!func}scale(popularity,0,100)"& myQuery="street festival"& sfield=location& pt=33.748,-84.391

25%  25%  

25%  

25%  

*Example  from  chapter  16  of  Solr  in  Ac*on  

Page 22: Enhancing relevancy through personalization & semantic search

Fancy boosting functions

•  Separating “relevancy” and “filtering” from the query: q=_val_:"$keywords"&fq={!cache=false v=$keywords}&keywords=solr

•  Keywords (50%) + distance (25%) + category (25%)

q=_val_:"scale(mul(query($keywords),1),0,50)" AND _val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,25)” AND _val_:"scale(mul(query($category),1),0,25)" &keywords=solr &radiusInKm=48.28 &distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)” &category=jobtitle:"java developer" &fq={!cache=false v=$keywords}

Page 23: Enhancing relevancy through personalization & semantic search

Context aware relevancy

Example: Willingness to relocate for a job

0  

500  

1,000  

1,500  

2,000  

2,500  

1%   5%   10%   20%   25%   30%   40%   50%   60%   70%   75%   80%   90%   95%  

So>ware  engineers  

Food  service  workers  

Page 24: Enhancing relevancy through personalization & semantic search

Willingness to relocate

Somware  engineers  in  Chicago  want  jobs  in  these  loca>ons:  

Page 25: Enhancing relevancy through personalization & semantic search

Willingness to relocate

Food  service  workers  in  Chicago  want  jobs  in  these  loca>ons:  

Page 26: Enhancing relevancy through personalization & semantic search

Personaliza>on  &  Recommenda>ons  

Page 27: Enhancing relevancy through personalization & semantic search

•  John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.

•  Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.

•  Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.

•  Jane is a nurse educator in Boston seeking between $40K and $60K working in the healthcare industry

Beyond domain knowledge… consider per-user knowledge

Page 28: Enhancing relevancy through personalization & semantic search

http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA”) AND _val_:"map(salary, 40000, 60000,10, 0)” *Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K working in the healthcare industry

Page 29: Enhancing relevancy through personalization & semantic search

{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":"Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503}, …]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action/

Search Results for Jane

{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},

{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}  

Page 30: Enhancing relevancy through personalization & semantic search

•  We built a recommendation engine!

•  What is a recommendation engine? –  A system that uses known information (or derived information from that

known information) to automatically suggest relevant content

•  Our example was just an attribute based recommendation… we’ll see that behavioral-based (i.e. collaborative filtering) is also possible.

What did we just do?

Page 31: Enhancing relevancy through personalization & semantic search

Redefining “Search Engine”

•  “Lucene is a high-performance, full-featured text search engine library…”

Yes,  but  really…  

•   Lucene  is  a  high-­‐performance,  fully-­‐featured  token  matching  and  scoring  library…  which  can  perform  full-­‐text  searching.  

Page 32: Enhancing relevancy through personalization & semantic search

Redefining “Search Engine”

or,  in  machine  learning  speak:  •  A  Lucene  index  is  mul>-­‐dimensional    sparse  matrix…  with  very  fast  and  powerful  lookup  capabili>es.  

•  Think  of  each  field  as  a  matrix  containing  each  term  mapped  to  each  document  

Page 33: Enhancing relevancy through personalization & semantic search

The Lucene Inverted Index (traditional text example)

Term   Documents  

a   doc1  [2x]  brown   doc3  [1x]  ,  doc5  [1x]  cat   doc4  [1x]  cow   doc2  [1x]  ,  doc5  [1x]  …   ...  

once   doc1  [1x],  doc5  [1x]  over   doc2  [1x],  doc3  [1x]  the   doc2  [2x],  doc3  [2x],  

doc4[2x],  doc5  [1x]  …   …  

Document   Content  Field  

doc1     once  upon  a  >me,  in  a  land  far,  far  away  

doc2   the  cow  jumped  over  the  moon.  

doc3     the  quick  brown  fox  jumped  over  the  lazy  dog.  

doc4   the  cat  in  the  hat  

doc5   The  brown  cow  said  “moo”  once.  

…   …  

What  you  SEND  to  Lucene/Solr:  How  the  content  is  INDEXED  into  Lucene/Solr  (conceptually):  

Page 34: Enhancing relevancy through personalization & semantic search

Matching text queries to text fields

/solr/select/?q=jobcontent:“software engineer”

Job  Content  Field   Documents  

…   …  

engineer   doc1,  doc3,  doc4,  doc5  

…  

mechanical   doc2,  doc4,  doc6  …   …  

somware   doc1,  doc3,  doc4,  doc7,  doc8  

…   …  

doc5  

doc7          doc8  

doc1          doc3                      doc4  

engineer  

somware  

somware  engineer  

Page 35: Enhancing relevancy through personalization & semantic search

Beyond Text Searching

•  Lucene/Solr  is  a  search  matching  engine  

•  When  Lucene/Solr  search  text,  they  are  matching  tokens  in  the  query  with  tokens  in  index  

•  Anything  that  can  be  searched  upon  can  form  the  basis  of  matching  and  scoring:  –  text,  atributes,  loca>ons,  results  of  func>ons,  user  behavior,  classifica>ons,  etc.    

Page 36: Enhancing relevancy through personalization & semantic search

•  Content-based –  Attribute based

i.e. income level, hobbies, location, experience –  Hierarchical

i.e. “medical//nursing//oncology”, “animal//dog//terrier” –  Textual Similarity

i.e. Solr’s MoreLikeThis Request Handler & Search Handler –  Concept Based

i.e. Solr => “software engineer”, “java”, “search”, “open source”

•  Collaborative Filtering “Users who liked that also liked this…”

•  Hybrid Approaches

Approaches to Recommendations

Page 37: Enhancing relevancy through personalization & semantic search

Collaborative Filtering

Term   Documents  

user1   doc1,  doc5  user2   doc2  user3   doc2  user4   doc1,  doc3,    

doc4,  doc5  user5   doc1,  doc4  …   …  

Document   “Users  who  bought  this  product”  field  

doc1     user1,  user4,  user5  

doc2   user2,  user3  

doc3     user4  

doc4   user4,  user5  

doc5   user4,  user1  

…   …  

What  you  SEND  to  Lucene/Solr:   How  the  content  is  INDEXED  into  Lucene/Solr  (conceptually):  

Page 38: Enhancing relevancy through personalization & semantic search

Step 1: Find similar users who like the same documents

Document   “Users  who  bought  this  product”  field  

doc1     user1,  user4,  user5  

doc2   user2,  user3  

doc3     user4  

doc4   user4,  user5  

doc5   user4,  user1  

…   …  

Top-­‐scoring  results  (most  similar  users):  1)   user4  (2  shared  likes)  2)   user5  (2  shared  likes)  3)   user  1  (1  shared  like)  

doc1  user1          user4                              user5  

     user4          user5  

doc4  

q=documen>d:  ("doc1"  OR  "doc4")  

*Source:  Solr  in  Ac*on,  chapter  16  

Page 39: Enhancing relevancy through personalization & semantic search

Step 2: Search for docs “liked” by those similar users

Term   Documents  

user1   doc1,  doc5  user2   doc2  user3   doc2  user4   doc1,  doc3,    

doc4,  doc5  user5   doc1,  doc4  …   …  

Top  recommended  documents:  1)  doc1  (matches  user4,  user5,  user1)  2)  doc4  (matches  user4,  user5)  3)  doc5  (matches  user4,  user1)  4)  doc3  (matches  user4)    //  doc2  does  not  match  

Most  similar  users:  1)   user4  (2  shared  likes)  2)   user5  (2  shared  likes)  3)   user  1  (1  shared  like)  

                                                                                                                     /solr/select/?q=userlikes:("user4"^2        

                                                                                                                   OR  "user5"^2  OR  "user1"^1)  

*Source:  Solr  in  Ac*on,  chapter  16  

Page 40: Enhancing relevancy through personalization & semantic search

Building up to personalization

•  Use what you have: –  User’s keywords, IP address, searches, clicks, “likes” (purchases,

job applications, comments, etc.) –  Build up a dossier of information on your users –  If a user gives you a profile (resume, social profile, etc), even better.

Page 41: Enhancing relevancy through personalization & semantic search

For full coverage of building a recommendation engine in Solr…

•  See my talk from Lucene Revolution 2012 (Boston):

Page 42: Enhancing relevancy through personalization & semantic search

Personalized Search

•  Why limit yourself to JUST explicit search or JUST automated recommendations?

•  By augmenting your user’s explicit queries with information you know about them, you can personalize their search results.

•  Examples: –  A known software engineer runs a blank job search in New York…

•  Why not show software engineering higher in the results?

–  A new user runs a keyword-only search for nurse •  Why not use the user’s IP address to boost documents geographically closer?

Page 43: Enhancing relevancy through personalization & semantic search

Seman>c  Search  

Page 44: Enhancing relevancy through personalization & semantic search

Not going to talk about…

•  Using the SynonymFilter •  Automatic language detection •  Stemming/lemmatization/multi-lingual search •  Stopwords (For all of the above, see the Solr Wiki, Reference Guide, or read Solr in Action)

•  Instead, we’re going to cover: –  Mining user behavior to discover synonyms/related queries –  Discovering related concepts using document clustering in Solr –  Future work: Latent Semantic Indexing –  Document to Document searching using More Like This –  Foreground/Background corpus analysis

Page 45: Enhancing relevancy through personalization & semantic search

•  Our primary approach: Search Co-occurrences •  Strategy: Map/Reduce job which computes similar searches run for the same

users

John searched for “java developer” and “j2ee” Jane searched for “registered nurse” and “r.n.” and “prn”. Zeke searched for “java developer” and “scala” and “jvm”

•  By mining the searches of tens millions of search terms per day, we get a list of top

searches, with the corresponding top co-occurring searches. •  We also tie each search term to the top category of jobs (i.e java developer, truck

driver, etc.), so that we know in what context people search for each term.

Automatic Synonym Discovery

Page 46: Enhancing relevancy through personalization & semantic search

Example of “related search terms”

Example:  “accoun>ng”  accountant  8880,  accounts  payable  5235,  finance  3675,  accoun>ng  clerk  3651,  bookkeeper  3225,  controller  2898,  staff  accountant  2866,  accounts  receivable  2842  

Example:  “RN”:  registered  nurse  6588,  rn  registered  nurse  4300,  nurse  2492,  nursing  912,  lpn  707,  healthcare  453,  rn  case  manager  446,  registered  nurse  rn  404,  director  of  nursing  321,  case  manager  292  

Page 47: Enhancing relevancy through personalization & semantic search

Latent Semantic Indexing •  Concept: Build a matrix of all terms, perform singular value decomposition on that

Matrix to reduce the number of dimensions, and index the meaningful (i.e. blurred) terms on each document.

•  Why this matters: if done correctly, the search engine can automatically collapse terms by meaning, remove the useless and redundant ones, and for it’s own conceptual model of your domain space. This can be used to infuse more meaning into a document than just a keyword.

•  See blog posts and presentations by John Berryman and Doug Turnbull about their work on this. They’re leading the way on this right now (in the open-source community).

•  http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy

Future work on building conceptual links

Page 48: Enhancing relevancy through personalization & semantic search

Using Clustering to find semantic links

Page 49: Enhancing relevancy through personalization & semantic search

Setting up Clustering in solrconfig.xml <searchComponent  name="clustering"  enable=“true“    class="solr.clustering.ClusteringComponent">      <lst  name="engine">          <str  name="name">default</str>          <str  name="carrot.algorithm">  

 org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>          <str  name="MultilingualClustering.defaultLanguage">ENGLISH</str>      </lst>  </searchComponent>      <requestHandler  name="/clustering"  enable=“true"  class="solr.SearchHandler">      <lst  name="defaults">          <str  name="clustering.engine">default</str>          <bool  name="clustering.results">true</bool>          <str  name="fl">*,score</str>      </lst>      <arr  name="last-­‐components">          <str>clustering</str>      </arr>  </requestHandler>  

Page 50: Enhancing relevancy through personalization & semantic search

Clustering Query

/solr/clustering/?q=(solr or lucene) &rows=100 &carrot.title=titlefield &carrot.snippet=titlefield &LingoClusteringAlgorithm.desiredClusterCountBase=25 //clustering & grouping don’t currently play nicely Allows you to dynamically identify “concepts” and their prevalence within a user’s top search results

Page 51: Enhancing relevancy through personalization & semantic search

Original  Query:      q=(solr  or  lucene)                      //  can  be  a  user’s  search,  their  job  >tle,    a  list  of  skills,                                                //  or  any  other  keyword  rich  data  source  

Clustering Results

Clusters Identified: Developer (22) Java Developer (13) Software (10) Senior Java Developer (9) Architect (6) Software Engineer (6) Web Developer (5) Search (3) Software Developer (3) Systems (3) Administrator (2) Hadoop Engineer (2) Java J2EE (2) Search Development (2) Software Architect (2) Solutions Architect (2)

Stage  1:  Iden>fy  Concepts  

Page 52: Enhancing relevancy through personalization & semantic search

q=content:(“Developer”^22  or  “Java  Developer”^13  or  “Somware  ”^10  or  “Senior  Java  Developer”^9    or  “Architect  ”^6  or  “Somware  Engineer”^6  or  “Web  Developer  ”^5  or  “Search”^3  or  “Somware  Developer”^3  or  “Systems”^3  or  “Administrator”^2  or  “Hadoop  Engineer”^2  or  “Java  J2EE”^2  or  “Search  Development”^2  or  “Somware  Architect”^2  or  “Solu>ons  Architect”^2)    //  Your  can  also  add  the  user’s  loca[on  or  the  original  keywords  to  the    //  recommenda[ons  search  if  it  helps  results  quality  for  your  use-­‐case.  

Stage  2:  Use  Seman>c  Links  in  your  relevancy  calcula>on  

Page 53: Enhancing relevancy through personalization & semantic search

Goal: use an entire document as your Solr Query, recommending other related documents.

Standard approach: More Like This Handler Alternative Approach: Foreground vs. Background corpus analysis

Document to Document Searching

Page 54: Enhancing relevancy through personalization & semantic search

solrconfig.xml: <requestHandler name="/mlt" class="solr.MoreLikeThisHandler" />

Query: /solr/jobs/mlt/?df=jobdescription& fl=id,jobtitle& rows=3& q=J2EE& // recommendations based on top scoring doc mlt.fl=jobtitle,jobdescription& // inspect these fields for interesting terms mlt.interestingTerms=details& // return the interesting terms mlt.boost=true

More Like This (Query)

*Example  from  chapter  16  of  Solr  in  Ac*on  

Page 55: Enhancing relevancy through personalization & semantic search

More Like This (Results)

{"match":{"numFound":122,"start":0,"docs":[ {"id":"fc57931d42a7ccce3552c04f3db40af8dabc99dc", "jobtitle":"Senior Java / J2EE Developer"}] }, "response":{"numFound":2225,"start":0,"docs":[ {"id":"0e953179408d710679e5ddbd15ab0dfae52ffa6c",

"jobtitle":"Sr Core Java Developer"}, {"id":"5ce796c758ee30ed1b3da1fc52b0595c023de2db",

"jobtitle":"Applications Developer"}, {"id":"1e46dd6be1750fc50c18578b7791ad2378b90bdd",

"jobtitle":"Java Architect/ Lead Java Developer - WJAV Java - Java in Pittsburgh PA"},]},

 "interes>ngTerms":[                                "jobdescrip>on:j2ee",1.0,            "jobdescrip>on:java",0.68131137,            "jobdescrip>on:senior",0.52161527,            "job>tle:developer",0.44706684,            "jobdescrip>on:source",0.2417754,            "jobdescrip>on:code",0.17976432,            "jobdescrip>on:is",0.17765637,            "jobdescrip>on:client",0.17331646,            "jobdescrip>on:our",0.11985878,            "jobdescrip>on:for",0.07928475,            "jobdescrip>on:a",0.07875194,            "jobdescrip>on:to",0.07741922,            "jobdescrip>on:and",0.07479082]}}  

Page 56: Enhancing relevancy through personalization & semantic search

More Like This (passing in external document)

/solr/jobs/mlt/? df=jobdescription& fl=id,jobtitle& mlt.fl=jobtitle,jobdescription& mlt.interestingTerms=details& mlt.boost=true

stream.body=Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. Solr is the most popular enterprise search engine. Solr 4 adds NoSQL features.

Page 57: Enhancing relevancy through personalization & semantic search

More Like This (Results) {"response":{"numFound":2221,"start":0,"docs":[ {"id":"eff5ac098d056a7ea6b1306986c3ae511f2d0d89 ",

•  "jobtitle":"Enterprise Search Architect…"}, {"id":"37abb52b6fe63d601e5457641d2cf5ae83fdc799 ",

"jobtitle":"Sr. Java Developer"}, {"id":"349091293478dfd3319472e920cf65657276bda4 ",

"jobtitle":"Java Lucene Software Engineer"},]},

 "interes>ngTerms":[            "jobdescrip>on:search",1.0,            "jobdescrip>on:solr",0.9155779,            "jobdescrip>on:features",0.36472517,            "jobdescrip>on:enterprise",0.30173126,            "jobdescrip>on:is",0.17626463,            "jobdescrip>on:the",0.102924034,            "jobdescrip>on:and",0.098939896]}  }  

Page 58: Enhancing relevancy through personalization & semantic search

I. Send document as content stream to Solr II. Perform Language Identification on the content III. Do language-specific parts of speech detection

•  Keep nouns, remove other parts of speech (removes noise) IV. Do analysis of additional terms for statistical significance:

tf * idf OR foreground vs. background corpus comparison OR Both Preferred statistical significance measure:

countFG(x) - totalCountFG * probBG(x)

z = -------------------------------------------------------- sqrt(totalCountFG * probBG(x) * (1 - probBG(x))) V. Return top scoring terms

CareerBuilder’s Alternative approach (“enhanced” More Like This)

Page 59: Enhancing relevancy through personalization & semantic search

Foreground vs. Background Corpus Comparison

/solr/doc2doc? fg=category:"software engineer"&bg=*:*&stream.body=java nurse and is are was were ruby php solr oncology part-time … other text in a really long document” Terms statistically more likely to appear in foreground query than background query:

java ruby php

document Note: This method requires you pre-classify your documents (which we do)… it doesn’t work with a document that hasn’t already been classified.

We  are  essen>ally  boos>ng  terms  which  are  more  related  to  some  known  feature  (and  ignoring  terms  which  are  equally  likely  to  appear  in  the  background  corpus)  

Page 60: Enhancing relevancy through personalization & semantic search

Pulling it all together

Tradi>onal  Search  

Recommenda>ons  

Seman>c  Search  

Profit!  

Personalized  Search  

Page 61: Enhancing relevancy through personalization & semantic search

Take-aways

•  Lucene’s inverted index is a sparse matrix useful for traditional search (keywords, locations, etc.), recommendations, and discovering links between terms/tokens

•  Traditional tf * idf keyword search is a good starting point, but the best relevancy lies in combining your domain knowledge (knowledge of user’s in aggregate) and user-specific knowledge into your own relevancy factors.

•  The ability to understand user queries (semantic search) further enhances the search experience, and you already have many tools at your fingertips for this.

Page 62: Enhancing relevancy through personalization & semantic search

Questions?

Yes,  we  are  hiring  @CareerBuilder.    Come  talk  with  me  if  you  are  interested…  

§  Trey  Grainger  [email protected]  @treygrainger    

     Other  presenta[ons:                h_p://www.treygrainger.com  

 htp://solrinac>on.com  


Recommended