+ All Categories
Home > Documents > Language Models

Language Models

Date post: 22-Feb-2016
Category:
Upload: jamil
View: 53 times
Download: 0 times
Share this document with a friend
Description:
Language Models. Naama Kraus. Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze. IR approaches. Boolean retrieval Boolean constrains of term occurrences in documents no ranking Vector space model - PowerPoint PPT Presentation
Popular Tags:
22
Language Models Naama Kraus es are based on Introduction to Information Retrieval Book anning, Raghavan and Schütze
Transcript
Page 1: Language Models

Language ModelsNaama Kraus

Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze

Page 2: Language Models

IR approaches• Boolean retrieval

– Boolean constrains of term occurrences in documents – no ranking

• Vector space model– Queries and vectors are represented as vectors in a high

dimensional space– Notions of similarity (cosine similarity) implying ranking

• Probabilistic model– Rank documents by the probability P(R|d,q)– Estimate P(R|d,q) using relevance feedback technique

• Language Models – today’s class

Page 3: Language Models

Intuition

• Users who try to think of a good query, think of words that are likely to appear in relevant documents

• Language model approach:• A document is a good match to a query, if the

document model is likely to generate the query– If document contains query words often

Page 4: Language Models

Illustration

LanguageModel

document

query

Page 5: Language Models

Traditional language model

• Finite automata• Generative model

I wish

I wishI wish I wishI wish I wish I wish……

The language of the automaton: the full set of strings that it can generate

Page 6: Language Models

Probabilistic language model

• Each node has a probability distribution over generating different terms

• A language model is a function that puts a probability measure over strings drawn from some vocabulary

Page 7: Language Models

Language model example

s

the 0.2a 0.1frog 0.01toad 0.01said 0.03likes 0.02that 0.04…..STOP 0.2

state emission probabilities(partial)

unigram language model

P(frog said that toad likes frog) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01

(We ignore continue/stop probabilities assuming they are fixed for all queries)

Probability that some text (e.g. a query) was generated by the model:

Page 8: Language Models

Query likelihood

s frog said that toad likes that dog

M1 0.01 0.03 0.04 0.01 0.02 0.04 0.005

M2 0.0002 0.03 0.04 0.0001 0.04 0.04 0.01

q = frog likes toad

P(q | M1) = 0.01 x 0.02 x 0.01P(q | M2) = 0.0002 x 0.04 x 0.0001

P(q|M1) > P(q|M2)

=> M1 is more likely to generate query q

Page 9: Language Models

Types of language models

How do we build probabilities over sequence of terms?

P(t1 t2 t3 t4) = P(t1) x P(t2|t1) x P(t3|t1 t2) x P(t4|t1 t2 t3)

Unigram language model – most simplest ; no conditioning context

P(t1 t2 t3 t4) = P(t1) x P(t2) x P(t3) x P(t4)

Bigram language model – condition on previous term

P(t1 t2 t3 t4) = P(t1) x P(t2|t1) x P(t3|t2) x P(t4|t3)

Trigram language model …

Unigram model is the most common in IR • Often sufficient to judge the topic of a document• Data sparseness issues when using richer models• Simple and efficient implementation

Page 10: Language Models

The query likelihood model

• Goal: rank documents by P(d|q)– The probability that a user querying q , had the

document d in mind• Bayes Rule: P(d|q) = P(q|d)P(d)/P(q)• P(q) – same for all documents ignored• P(d) – often treated as uniform across

documents ignored– Could be non uniform prior based on criteria like authority, length,

genre, newness …

• Rank by P(q|d)

Page 11: Language Models

The query likelihood model (2)

• P(q|d) - the probability that a query q was generated by a language model derived from document d– The probability that a query would be observed as a

random sample from the respective document model

• Algorithm:1. Infer a LM Md for each document d2. Estimate P(q|Md)3. Rank the documents according to these

probabilities

Page 12: Language Models

Illustration

d1Md1

query

d2Md2

d3Md3

P(q|Md1)

P(q|Md2)

P(q|Md3)

E.g., P(q|Md3) > P(q|Md1) > P(q|Md2) d3 is first, d1 is second, d2 is third

Page 13: Language Models

Estimating P(q|Md)

Use Maximum Likelihood Estimation - MLE

Assume a unigram language model (terms occur independently)

unigram MLE

Page 14: Language Models

Sparse data problem

• Documents are sparse– Some words don’t appear in the document– In particular, some of the query terms

• P(q|d) = 0 ; zero probability problem– Conjunctive semantics

• Occurring words are poorly estimated– A single documents is small training set– Occurring words are over estimated

• Their occurrence was partly by chance

Page 15: Language Models

Solution: smoothing

• Smooth probabilities in LMs– overcome zero probabilities – give some probability mass to unseen words

• The probability of a non occurring term should be close to its probability to occur in the collection

P(t|Mc) = cf(t)/T• cf(t) = #occurrences of term t in the collection• T – length of the collection = sum of all

document lengths

Page 16: Language Models

Smoothing methods

Linear Interpolation

Bayesian smoothing

Summary, with linear interpolationIn practice, log in takenfrom both sides of the equationto avoid multiplying many small numbers

Page 17: Language Models

Exercise

Given a collection of two documents D1 , D2

D1: Xyzzy reports a profit but revenue is downD2: Quorus narrows quarter loss but revenue decreases further

A user submitted the query: “revenue down”

Rank D1 and D2 -Use an MLE unigram model and a linear interpolation smoothingwith lambda parameter 0.5

Page 18: Language Models

Extended LM approaches

query querymodel

document Documentmodel

query likelihood

document likelihoodmodel comparison

Query likelihood P(q|d) – the probability of document LM to generate query we’ve seen in previous slides …Document likelihood P(d|q) – the probability of query LM to generate document in the next slides …Model comparison R(d;q) – compare between document and query models in the next slides …

P(t|query)

P(t|document)

Page 19: Language Models

Document likelihood model

• P(d|q) – the probability of query LM to generate document

• Problem: queries are short bad model estimation

• [Zhai and Lafferty 2001] – Expand the query with terms taken from relevant

documents in the usual way and hence update the language mode

Page 20: Language Models

KL divergence• Kullback–Leibler (KL) divergence• An asymmetric divergence measure from information theory• Measures the difference between two probability distributions P , Q

• Typically Q is an estimation of P

Properties• Non negative• Equals 0 iff P equals Q• May have an infinite value• Asymmetric, thus not a metric

• Jensen–Shannon (JS) divergence• Based on KL divergence (D)• Always finite• 0 <= JSD <= 1• Symmetric

Page 21: Language Models

Model comparison

Make LM from both query and document

Measure `how different` these LMs from each other

Use KL divergence

Rank by KLD - the closer to 0 the higher is the rank

Page 22: Language Models

Language models - summary• Probabilistic model

– mathematically precise• Intuitive, simple concept• Achieves very good retrieval results

– Still, no evidence that it exceeds the traditional vector space model

• Relation to the Vector Space Model– Both use term frequency– Smoothing with collection generation probability is a little like idf

• Terms rare in the general collection but common in some documents will have a greater influence on the document’s ranking

– Probabilistic vs. geometric– Mathematical mode vs. heuristic model


Recommended