Date post: | 16-Jul-2018 |
Category: | Documents |
View: | 212 times |
Download: | 0 times |
BM25issoYesterdayModernTechniquesforBetterSearch
RelevanceinSolrGrantIngersollCTOLucidworks
Lucene/Solr/MahoutCommitter
iPad case
iPad case
"ipad accessory"~3 OR "ipad case"~5
1.
15.
So,whatdoyoudo?
if(doc.name.contains(Vikings)){doc.boost=100
}
OR
q:(MAINQUERY)OR(name:Vikings)^Y
IndexTime:
QueryTime:
TermFrequency:Howwellatermdescribesadocument? Measure:howoftenatermoccursperdocument
InverseDocumentFrequency:Howimportantisatermoverall? Measure:howrarethetermisacrossalldocuments
TF*IDF
Score(q,d)=idf(t)(tf(tind)(k+1))/(tf(tind)+k(1b+b|d|/avgdl)tinq
Where:t=term;d=document;q=query;i=indextf(tind)=numTermOccurrencesInDocumentidf(t)=1+log(numDocs/(docFreq+1))|d|=1tindavgdl==(|d|)/(1))dinidinik=Freeparameter.Usually~1.2to2.0.Increasestermfrequencysaturationpoint.b=Freeparameter.Usually~0.75.Increasesimpactofdocumentnormalization.
BM25 (aka Okapi)
Lather,Rinse,Repeat
WWGD?
Captureandlogprettymucheverything Searches,Timeonpage/1stclick,Whatwasnotchosen,etc.
PrecisionOfthoseshown,whatsrelevant? RecallOfallthatsrelevant,whatwasfound? NDCGAccountforposition
Measure, Measure, Measure
Magic
Guessing
CoreInformationTheory(akaLucene/Solr)
SearchAids(Facets,DidYouMean,Highlighting)
MachineLearning(Clicks,Recs,Personalization,Userfeedback)
Rules,DomainSpecificKnowledge
fuhgeddaboudit
Content Collaboration Context
Core Solr capabilities: text matching, faceting, spell checking, highlighting
Business Rules for content: landing pages, boost/block, promotions, etc.
Leverage collective intelligence to predict what users will do based on historical,
aggregated data
Recommenders, Popularity, Search Paths
Who are you? Where are you? What have you done previously?
User/Market Segmentation, Roles, Security, Personalization
Next Genera/on Relevance
But What About the Real World? Indexing Edition
NER,TopicDetection,Clustering
Word2Vec,etc.
DomainRules:Synonyms,Regexes,LexicalResources
Extraction
LoadIntoSparkBuildW2V,
PageRank,Topic,ClusteringModels
Offline
Content
Models
But What About the Real World? Query Edition
QueryIntentStrategic,Tactical,
Semantic
iPad case
Head/Tail/Clickstreamenhancement
UserFactors:Segmentation,Location,History,Profile,Security
Parse
DomainSpecificRulesTransformResults
CascadingRerankersLearnToRank(multi-
model),Biascorrections
But What About the Real World? Signals Edition
LoadIntoSpark ClickstreamModelsSignals
QueryAnalysisJobsRecommenders/Personalization
iPad case
QueryEdition
Raw
Models
(Exact/OriginalMatch)^X(SloppyPhrase)~M^Y
(ANDQ)^Z(ORQ)^XX
(Expansions/Click/Head/TailBoosts)^YY(PersonalizationBiases)^ZZ
({!ltr})
Filters+Options:security,rules,hardpreferences,categories
The Perfect(?!?) Query* YMMV!
}PrecisionRecall
CaveatEmptor!
*Note:therearealotofvariationsonthis.edismaxhandlesmost
LearntoRank
X>Y>Z>XXAllweightscanbelearned
Donttakemywordforit,experiment! Goodprimer:
http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics
Rulesarefine,aslongasthearecontained,havealifespanandaremeasuredforeffectiveness
Experimentation, Not Editorialization
http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statisticshttp://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics
ShowUsAlready,WillYou!
But Wait, Theres More!
Fusion Architecture
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader Elec*on Load Balancing
ZK N
Shared Config Management
Worker Worker
Apache SparkCluster
Manager
REST
API
Admin UI
Twigkit
LOGS FILE WEB DATABASE CLOUD
HD
FS (O
p*on
al)
Core Services
Connectors
ETL and Query Pipelines
Recommenders/Signals/Rules
NLP
Machine Learning
AlerEng and Messaging
Security
Scheduling
Key Features
Shards Shards
Apache Solr
Worker Worker
Apache SparkCluster
Manager
Solr: ExtensiveTextRankingFeatures
SimilarityModels FunctionQueries Boost/Block
PluggableReranker LearntoRankcontrib Multi-tenant
Spark SparkML(RandomForests,Regression,etc.) Largescale,distributedcompute
Demo Details
Best Buy Kaggle Competition Data Set
- Product Catalog: ~1.3M
- Signals: 1 month of query, document logs
Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib
Twigkit UI (http://twigkit.com)
Demo Details
http://twigkit.com
http://lucidworks.com http://lucene.apache.org/solr http://spark.apache.org/ https://github.com/lucidworks/spark-solr https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank
BloombergtalkonLTRhttps://www.youtube.com/watch?v=M7BKwJoh96s
Resources
http://lucidworks.comhttp://lucene.apache.org/solrhttp://spark.apache.org/https://github.com/lucidworks/spark-solrhttps://cwiki.apache.org/confluence/display/solr/Learning+To+Rankhttps://cwiki.apache.org/confluence/display/solr/Learning+To+Rankhttps://www.youtube.com/watch?v=M7BKwJoh96shttps://www.youtube.com/watch?v=M7BKwJoh96sClick here to load reader