BM25issoYesterdayModernTechniquesforBetterSearch
RelevanceinSolrGrantIngersollCTOLucidworks
Lucene/Solr/MahoutCommitter
if(doc.name.contains(“Vikings”)){doc.boost=100
}
OR
q:(MAINQUERY)OR(name:Vikings)^Y
IndexTime:
QueryTime:
• TermFrequency:“Howwellatermdescribesadocument?”• Measure:howoftenatermoccursperdocument
• InverseDocumentFrequency:“Howimportantisatermoverall?”• Measure:howrarethetermisacrossalldocuments
TF*IDF
Score(q,d)=∑idf(t)·(tf(tind)·(k+1))/(tf(tind)+k·(1–b+b·|d|/avgdl)tinq
Where:t=term;d=document;q=query;i=indextf(tind)=numTermOccurrencesInDocument½idf(t)=1+log(numDocs/(docFreq+1))|d|=∑1tindavgdl==(∑|d|)/(∑1))dinidinik=Freeparameter.Usually~1.2to2.0.Increasestermfrequencysaturationpoint.b=Freeparameter.Usually~0.75.Increasesimpactofdocumentnormalization.
BM25 (aka Okapi)
• Captureandlogprettymucheverything• Searches,Timeonpage/1stclick,Whatwasnotchosen,etc.
• Precision—Ofthoseshown,what’srelevant?• Recall—Ofallthat’srelevant,whatwasfound?• NDCG—Accountforposition
Measure, Measure, Measure
Magic
Guessing
CoreInformationTheory(akaLucene/Solr)
SearchAids(Facets,DidYouMean,Highlighting)
MachineLearning(Clicks,Recs,Personalization,Userfeedback)
Rules,DomainSpecificKnowledge
fuhgeddaboudit
Content Collaboration Context
Core Solr capabilities: text matching, faceting, spell checking, highlighting
Business Rules for content: landing pages, boost/block, promotions, etc.
Leverage collective intelligence to predict what users will do based on historical,
aggregated data
Recommenders, Popularity, Search Paths
Who are you? Where are you? What have you done previously?
User/Market Segmentation, Roles, Security, Personalization
Next Genera/on Relevance
But What About the Real World? Indexing Edition
NER,TopicDetection,Clustering
Word2Vec,etc.
DomainRules:Synonyms,Regexes,LexicalResources
Extraction
LoadIntoSparkBuildW2V,
PageRank,Topic,ClusteringModels
Offline
Content
Models
But What About the Real World? Query Edition
QueryIntentStrategic,Tactical,
Semantic😊
iPad case
Head/Tail/Clickstreamenhancement
UserFactors:Segmentation,Location,History,Profile,Security
Parse
DomainSpecificRulesTransformResults
…
CascadingRerankersLearnToRank(multi-
model),Biascorrections
But What About the Real World? Signals Edition
LoadIntoSpark ClickstreamModelsSignals
QueryAnalysisJobsRecommenders/Personalization
😊
iPad case
QueryEdition
Raw
Models
(Exact/OriginalMatch)^X(SloppyPhrase)~M^Y
(ANDQ)^Z(ORQ)^XX
(Expansions/Click/Head/TailBoosts)^YY(PersonalizationBiases)^ZZ
({!ltr…})
Filters+Options:security,rules,hardpreferences,categories
The Perfect(?!?) Query* YMMV!
}Precision
Recall
CaveatEmptor!
*Note:therearealotofvariationsonthis.edismaxhandlesmost
LearntoRank
X>Y>Z>XXAllweightscanbelearned
• Don’ttakemywordforit,experiment!• Goodprimer:
• http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics
• Rulesarefine,aslongasthearecontained,havealifespanandaremeasuredforeffectiveness
Experimentation, Not Editorialization
• But Wait, There’s More!
Fusion Architecture
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader Elec*on Load Balancing
ZK N
Shared Config Management
Worker Worker
Apache SparkCluster
Manager
REST
API
Admin UI
Twigkit
LOGS FILE WEB DATABASE CLOUD
HD
FS (O
p*on
al)
Core Services
Connectors
• • •
ETL and Query Pipelines
Recommenders/Signals/Rules
NLP
Machine Learning
AlerEng and Messaging
Security
Scheduling
Key Features
Shards Shards
Apache Solr
Worker Worker
Apache SparkCluster
Manager
• Solr:• ExtensiveTextRankingFeatures
• SimilarityModels• FunctionQueries• Boost/Block
• PluggableReranker• LearntoRankcontrib• Multi-tenant
• Spark• SparkML(RandomForests,Regression,etc.)• Largescale,distributedcompute
Demo Details
• Best Buy Kaggle Competition Data Set
- Product Catalog: ~1.3M
- Signals: 1 month of query, document logs
• Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib
• Twigkit UI (http://twigkit.com)
Demo Details
• http://lucidworks.com• http://lucene.apache.org/solr• http://spark.apache.org/• https://github.com/lucidworks/spark-solr• https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank
• BloombergtalkonLTRhttps://www.youtube.com/watch?v=M7BKwJoh96s
Resources