+ All Categories

Download - Language Models

Transcript
Page 1: Language Models

Language ModelsNaama Kraus

Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze

Page 2: Language Models

IR approaches• Boolean retrieval

– Boolean constrains of term occurrences in documents – no ranking

• Vector space model– Queries and vectors are represented as vectors in a high

dimensional space– Notions of similarity (cosine similarity) implying ranking

• Probabilistic model– Rank documents by the probability P(R|d,q)– Estimate P(R|d,q) using relevance feedback technique

• Language Models – today’s class

Page 3: Language Models

Intuition

• Users who try to think of a good query, think of words that are likely to appear in relevant documents

• Language model approach:• A document is a good match to a query, if the

document model is likely to generate the query– If document contains query words often

Page 4: Language Models

Illustration

LanguageModel

document

query

Page 5: Language Models

Traditional language model

• Finite automata• Generative model

I wish

I wishI wish I wishI wish I wish I wish……

The language of the automaton: the full set of strings that it can generate

Page 6: Language Models

Probabilistic language model

• Each node has a probability distribution over generating different terms

• A language model is a function that puts a probability measure over strings drawn from some vocabulary

Page 7: Language Models

Language model example

s

the 0.2a 0.1frog 0.01toad 0.01said 0.03likes 0.02that 0.04…..STOP 0.2

state emission probabilities(partial)

unigram language model

P(frog said that toad likes frog) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01

(We ignore continue/stop probabilities assuming they are fixed for all queries)

Probability that some text (e.g. a query) was generated by the model:

Page 8: Language Models

Query likelihood

s frog said that toad likes that dog

M1 0.01 0.03 0.04 0.01 0.02 0.04 0.005

M2 0.0002 0.03 0.04 0.0001 0.04 0.04 0.01

q = frog likes toad

P(q | M1) = 0.01 x 0.02 x 0.01P(q | M2) = 0.0002 x 0.04 x 0.0001

P(q|M1) > P(q|M2)

=> M1 is more likely to generate query q

Page 9: Language Models

Types of language models

How do we build probabilities over sequence of terms?

P(t1 t2 t3 t4) = P(t1) x P(t2|t1) x P(t3|t1 t2) x P(t4|t1 t2 t3)

Unigram language model – most simplest ; no conditioning context

P(t1 t2 t3 t4) = P(t1) x P(t2) x P(t3) x P(t4)

Bigram language model – condition on previous term

P(t1 t2 t3 t4) = P(t1) x P(t2|t1) x P(t3|t2) x P(t4|t3)

Trigram language model …

Unigram model is the most common in IR • Often sufficient to judge the topic of a document• Data sparseness issues when using richer models• Simple and efficient implementation

Page 10: Language Models

The query likelihood model

• Goal: rank documents by P(d|q)– The probability that a user querying q , had the

document d in mind• Bayes Rule: P(d|q) = P(q|d)P(d)/P(q)• P(q) – same for all documents ignored• P(d) – often treated as uniform across

documents ignored– Could be non uniform prior based on criteria like authority, length,

genre, newness …

• Rank by P(q|d)

Page 11: Language Models

The query likelihood model (2)

• P(q|d) - the probability that a query q was generated by a language model derived from document d– The probability that a query would be observed as a

random sample from the respective document model

• Algorithm:1. Infer a LM Md for each document d2. Estimate P(q|Md)3. Rank the documents according to these

probabilities

Page 12: Language Models

Illustration

d1Md1

query

d2Md2

d3Md3

P(q|Md1)

P(q|Md2)

P(q|Md3)

E.g., P(q|Md3) > P(q|Md1) > P(q|Md2) d3 is first, d1 is second, d2 is third

Page 13: Language Models

Estimating P(q|Md)

Use Maximum Likelihood Estimation - MLE

Assume a unigram language model (terms occur independently)

unigram MLE

Page 14: Language Models

Sparse data problem

• Documents are sparse– Some words don’t appear in the document– In particular, some of the query terms

• P(q|d) = 0 ; zero probability problem– Conjunctive semantics

• Occurring words are poorly estimated– A single documents is small training set– Occurring words are over estimated

• Their occurrence was partly by chance

Page 15: Language Models

Solution: smoothing

• Smooth probabilities in LMs– overcome zero probabilities – give some probability mass to unseen words

• The probability of a non occurring term should be close to its probability to occur in the collection

P(t|Mc) = cf(t)/T• cf(t) = #occurrences of term t in the collection• T – length of the collection = sum of all

document lengths

Page 16: Language Models

Smoothing methods

Linear Interpolation

Bayesian smoothing

Summary, with linear interpolationIn practice, log in takenfrom both sides of the equationto avoid multiplying many small numbers

Page 17: Language Models

Exercise

Given a collection of two documents D1 , D2

D1: Xyzzy reports a profit but revenue is downD2: Quorus narrows quarter loss but revenue decreases further

A user submitted the query: “revenue down”

Rank D1 and D2 -Use an MLE unigram model and a linear interpolation smoothingwith lambda parameter 0.5

Page 18: Language Models

Extended LM approaches

query querymodel

document Documentmodel

query likelihood

document likelihoodmodel comparison

Query likelihood P(q|d) – the probability of document LM to generate query we’ve seen in previous slides …Document likelihood P(d|q) – the probability of query LM to generate document in the next slides …Model comparison R(d;q) – compare between document and query models in the next slides …

P(t|query)

P(t|document)

Page 19: Language Models

Document likelihood model

• P(d|q) – the probability of query LM to generate document

• Problem: queries are short bad model estimation

• [Zhai and Lafferty 2001] – Expand the query with terms taken from relevant

documents in the usual way and hence update the language mode

Page 20: Language Models

KL divergence• Kullback–Leibler (KL) divergence• An asymmetric divergence measure from information theory• Measures the difference between two probability distributions P , Q

• Typically Q is an estimation of P

Properties• Non negative• Equals 0 iff P equals Q• May have an infinite value• Asymmetric, thus not a metric

• Jensen–Shannon (JS) divergence• Based on KL divergence (D)• Always finite• 0 <= JSD <= 1• Symmetric

Page 21: Language Models

Model comparison

Make LM from both query and document

Measure `how different` these LMs from each other

Use KL divergence

Rank by KLD - the closer to 0 the higher is the rank

Page 22: Language Models

Language models - summary• Probabilistic model

– mathematically precise• Intuitive, simple concept• Achieves very good retrieval results

– Still, no evidence that it exceeds the traditional vector space model

• Relation to the Vector Space Model– Both use term frequency– Smoothing with collection generation probability is a little like idf

• Terms rare in the general collection but common in some documents will have a greater influence on the document’s ranking

– Probabilistic vs. geometric– Mathematical mode vs. heuristic model


Top Related