Date post: | 05-Dec-2014 |
Category: |
Technology |
Upload: | kira |
View: | 239 times |
Download: | 0 times |
Hypothesis testing, MLE, language models
Kira Radinsky
Based on some slides of Ilan Gronau,
Ydo Wexler, Dan Geiger & Nir Fridman
Hypothesis Testing
•Find the best explanation for the observed data
•Helps predict behavior of similar data sets
An example: Binomial experiments
• Model: The unknown parameter: θ=p(H)• Data Set: series of experiment results, e.g.
D = H H T H T T T H H …• Main Assumption: each experiment is independent of
others
P(H)
P(T)
Parameter EstimationUsing Likelihood Functions
• The likelihood of a given value for θ : LD (θ) = p(D| θ)
• Maximum Likelihood Estimation (MLE) :We wish to find a value for θ which maximizes the likelihood
• For example: The likelihood of ‘HTTHH’ is:LHTTHH (θ) = p(HTTHH | θ)= θ(1-θ)(1- θ)θ θ = θ3(1-θ)2
• We only need to know N(H) (number of Heads) and N(T) (number of Tails).
• These are sufficient statistics : LD(θ) = θN(H) (1-θ)N(T)
Sufficient Statistics
• A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood.
• s(D) is a sufficient statistics if for any two datasets D and D’:
s(D) = s(D’ ) => LD(θ) = LD’(θ)
• Likelihood may be calculated on the statistics.
Maximum Likelihood Estimation
• Goal: Maximize the likelihood (or log-likelihood)
• In our example:
– Lilkelihood:
• LD(θ) = θN(H) (1-θ)N(T)
– Log-Lilkelihood:
• lD(θ) = log(LD(θ)) = N(H)·log(θ) + N(T)·log(1-θ)
– Maximization of Log-Lilkelihood:
• lD‘(θ) =0:
MLE with multiple parameters
• What if we have several parameters θ1, θ2,…, θK that we wish to learn?
• Examples:
– die toss (K=6)
– Grades (K=100)
• Sufficient statistics [assumption: a series of independent experiments]:
– N1, N2, …, NK - the number of times each outcome was observed
• Likelihood:
• MLE:
From MLE to Bayesian Inference
• Likelihood Goal: maximize p(D| θ)
• Our Goal: maximize p(θ|D)
• Following Bayes Rule:
• Intuitively, the prior probability captures our prior knowledge (prejudice) of the model parameters.
posterior probability
Likelihood Prior probability
MLE in Natural Language Processing (NLP)
• Goal: Evaluate the probability of the next word based on the words prior to it:
P(wi| w1,…,wi-1)
• Importance: Speech recognition, Hand written word recognition, part of speech tagging, language identification, spam detection, etc…
• Markov Assumption: The probability of a word wi in a sequence of words, depends only on the n-1 words prior to it in the sequence. n is a constant.
N-Gram Model
• P(wi| w1,…,wi-1) = P(wi| wi-n,…,wi-1)
• Types of n-grams:
– Uni-gram
• P(wi| w1,…,wi-1) = P(wi)
– Bi-gram
• P(wi| w1,…,wi-1) = P(wi| wi-1)
– Tri-gram
• P(wi| w1,…,wi-1) = P(wi| wi-2 , wi-1)
MLE in NLP
• Problem: How do we evaluate P(wi) , P(wi| wi-1) , P(wi| wi-2 , wi-1) ?
• Proposal: MLE
Problems with MLE
• Many sequence of length n never appear in the dataset (but do appear in the real world).
• Example:– Task: Speech recognition. We heard a word in a sentence, and wish to decide
between two words: “Milk” and “Silk”– P(Milk | John drank) >? P(Silk | John drank) – The word “John” never appeared in the dataset, therefore we cannot decide
• Church and Gal (1991)– Dataset: 44 million words from news papers– Vocabulary: 400,653 different words– Therefore, 1.6 * 1011 possible bigrams– Very few of them appeared in the dataset….
• Solutions:Most solutions are based on some sort of smoothing:– Laplace– Good Turing
Evaluation
• The null hypothesis, denoted by H0
• The alternative hypothesis, denoted by H1.
• Should we reject the null hypothesis in favor of the alternative?
Input:
– a value from a certain distribution
– we don't know what the parameter of that distribution is.
Test:
– How likely it is that the value we were given could have come from the distribution with this predicted parameter?
– If it's not very likely, we reject the null hypothesis in favor of the alternative.
• Critical Region
– But what exactly is "not very likely"?
– We choose a region known as the critical region. If the result of our test lies in this region, then we reject the null hypothesis in favor of the alternative.
Empirical Evolution methods
• Divide to train and test
– Leave one out
• Cross validation
– 10 fold cross validation
– 5x2 cross validation
• Never (never never!) perform evaluation on the training data
Never!