Home >Documents >BM25 is so Yesterday - is so Yesterday ... Relevance in Solr Grant Ingersoll ... ,...

BM25 is so Yesterday - is so Yesterday ... Relevance in Solr Grant Ingersoll ... ,...

Date post:16-Jul-2018
Category:
View:212 times
Download:0 times
Share this document with a friend
Transcript:
  • BM25issoYesterdayModernTechniquesforBetterSearch

    RelevanceinSolrGrantIngersollCTOLucidworks

    Lucene/Solr/MahoutCommitter

  • iPad case

  • iPad case

    "ipad accessory"~3 OR "ipad case"~5

  • 1.

    15.

  • So,whatdoyoudo?

  • if(doc.name.contains(Vikings)){doc.boost=100

    }

    OR

    q:(MAINQUERY)OR(name:Vikings)^Y

    IndexTime:

    QueryTime:

  • TermFrequency:Howwellatermdescribesadocument? Measure:howoftenatermoccursperdocument

    InverseDocumentFrequency:Howimportantisatermoverall? Measure:howrarethetermisacrossalldocuments

    TF*IDF

  • Score(q,d)=idf(t)(tf(tind)(k+1))/(tf(tind)+k(1b+b|d|/avgdl)tinq

    Where:t=term;d=document;q=query;i=indextf(tind)=numTermOccurrencesInDocumentidf(t)=1+log(numDocs/(docFreq+1))|d|=1tindavgdl==(|d|)/(1))dinidinik=Freeparameter.Usually~1.2to2.0.Increasestermfrequencysaturationpoint.b=Freeparameter.Usually~0.75.Increasesimpactofdocumentnormalization.

    BM25 (aka Okapi)

  • Lather,Rinse,Repeat

  • WWGD?

  • Captureandlogprettymucheverything Searches,Timeonpage/1stclick,Whatwasnotchosen,etc.

    PrecisionOfthoseshown,whatsrelevant? RecallOfallthatsrelevant,whatwasfound? NDCGAccountforposition

    Measure, Measure, Measure

  • Magic

    Guessing

    CoreInformationTheory(akaLucene/Solr)

    SearchAids(Facets,DidYouMean,Highlighting)

    MachineLearning(Clicks,Recs,Personalization,Userfeedback)

    Rules,DomainSpecificKnowledge

    fuhgeddaboudit

  • Content Collaboration Context

    Core Solr capabilities: text matching, faceting, spell checking, highlighting

    Business Rules for content: landing pages, boost/block, promotions, etc.

    Leverage collective intelligence to predict what users will do based on historical,

    aggregated data

    Recommenders, Popularity, Search Paths

    Who are you? Where are you? What have you done previously?

    User/Market Segmentation, Roles, Security, Personalization

    Next Genera/on Relevance

  • But What About the Real World? Indexing Edition

    NER,TopicDetection,Clustering

    Word2Vec,etc.

    DomainRules:Synonyms,Regexes,LexicalResources

    Extraction

    LoadIntoSparkBuildW2V,

    PageRank,Topic,ClusteringModels

    Offline

    Content

    Models

  • But What About the Real World? Query Edition

    QueryIntentStrategic,Tactical,

    Semantic

    iPad case

    Head/Tail/Clickstreamenhancement

    UserFactors:Segmentation,Location,History,Profile,Security

    Parse

    DomainSpecificRulesTransformResults

    CascadingRerankersLearnToRank(multi-

    model),Biascorrections

  • But What About the Real World? Signals Edition

    LoadIntoSpark ClickstreamModelsSignals

    QueryAnalysisJobsRecommenders/Personalization

    iPad case

    QueryEdition

    Raw

    Models

  • (Exact/OriginalMatch)^X(SloppyPhrase)~M^Y

    (ANDQ)^Z(ORQ)^XX

    (Expansions/Click/Head/TailBoosts)^YY(PersonalizationBiases)^ZZ

    ({!ltr})

    Filters+Options:security,rules,hardpreferences,categories

    The Perfect(?!?) Query* YMMV!

    }PrecisionRecall

    CaveatEmptor!

    *Note:therearealotofvariationsonthis.edismaxhandlesmost

    LearntoRank

    X>Y>Z>XXAllweightscanbelearned

  • Donttakemywordforit,experiment! Goodprimer:

    http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics

    Rulesarefine,aslongasthearecontained,havealifespanandaremeasuredforeffectiveness

    Experimentation, Not Editorialization

    http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statisticshttp://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics

  • ShowUsAlready,WillYou!

  • But Wait, Theres More!

    Fusion Architecture

    SECURITY BUILT-IN

    Shards Shards

    Apache Solr

    Apache Zookeeper

    ZK 1

    Leader Elec*on Load Balancing

    ZK N

    Shared Config Management

    Worker Worker

    Apache SparkCluster

    Manager

    REST

    API

    Admin UI

    Twigkit

    LOGS FILE WEB DATABASE CLOUD

    HD

    FS (O

    p*on

    al)

    Core Services

    Connectors

    ETL and Query Pipelines

    Recommenders/Signals/Rules

    NLP

    Machine Learning

    AlerEng and Messaging

    Security

    Scheduling

  • Key Features

    Shards Shards

    Apache Solr

    Worker Worker

    Apache SparkCluster

    Manager

    Solr: ExtensiveTextRankingFeatures

    SimilarityModels FunctionQueries Boost/Block

    PluggableReranker LearntoRankcontrib Multi-tenant

    Spark SparkML(RandomForests,Regression,etc.) Largescale,distributedcompute

  • Demo Details

    Best Buy Kaggle Competition Data Set

    - Product Catalog: ~1.3M

    - Signals: 1 month of query, document logs

    Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib

    Twigkit UI (http://twigkit.com)

    Demo Details

    http://twigkit.com

  • http://lucidworks.com http://lucene.apache.org/solr http://spark.apache.org/ https://github.com/lucidworks/spark-solr https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank

    BloombergtalkonLTRhttps://www.youtube.com/watch?v=M7BKwJoh96s

    Resources

    http://lucidworks.comhttp://lucene.apache.org/solrhttp://spark.apache.org/https://github.com/lucidworks/spark-solrhttps://cwiki.apache.org/confluence/display/solr/Learning+To+Rankhttps://cwiki.apache.org/confluence/display/solr/Learning+To+Rankhttps://www.youtube.com/watch?v=M7BKwJoh96shttps://www.youtube.com/watch?v=M7BKwJoh96s

Click here to load reader

Reader Image
Embed Size (px)
Recommended