+ All Categories
Home > Documents > Corpora and statistical methods

Corpora and statistical methods

Date post: 16-Feb-2016
Category:
Upload: cher
View: 27 times
Download: 0 times
Share this document with a friend
Description:
Corpora and statistical methods. Albert Gatt. In this lecture. Overview of rules of probability multiplication rule subtraction rule Probability based on prior knowledge conditional probability Bayes’ theorem. Part 1. Conditional probability and independence. Prior knowledge. - PowerPoint PPT Presentation
Popular Tags:

of 33

Click here to load reader

Transcript

PowerPoint Presentation

Albert GattCorpora and statistical methodsIn this lectureOverview of rules of probability multiplication rulesubtraction rule

Probability based on prior knowledgeconditional probabilityBayes theoremConditional probability and independencePart 1Prior knowledgeSometimes, an estimation of the probability of something is affected by what is known.cf. the many linguistic examples in Jurafsky 2003.

Example: Part-of-speech taggingTask: Assign a label indicating the grammatical category to every word in a corpus of running text.one of the classic tasks in statistical NLPPart-of-speech tagging exampleStatistical POS taggers are first trained on data that has been previously annotated. Yields a language model.

Language models vary based on the n-gram window:unigrams: probability based on tokens (a lexicon)E.g. input = the_DET tall_ADJ man_NNmodel represents the probability that the word man is a noun (NB: it could also be a verb)

bigrams: probabilities across a span of 2 wordsinput = the_DET tall_ADJ man_NN model represents the probability that a DET is followed by an adjective, adjective is followed by a noun, etc.

Can also do trigrams, quadrigrams etc.POS tagging continuedSuppose weve trained a tagger on annotated data. It has:a lexicon of unigrams:P(the=DET), P(man=NN), etca bigram modelP(DET is followed by ADJ), etcAssume weve trained it on a large input sample.We now feed it a new phrase: the audacious alienOur tagger knows that the word the is a DET, but its never seen the other words.It can:make a wild guess (not very useful!)estimate the probability that the is followed by an ADJ, and that an ADJ is followed by a NOUNPrior knowledge revisitedGiven that I know that the is DET, whats the probability that the following word audacious is ADJ?

This is very different from asking whats the probability that audacious is ADJ out of context.

We have prior knowledge that DET has occurred. This can significantly change the estimate of the probability that audacious is ADJ.

We therefore distinguish:prior probability: Nave estimate based on long-run frequencyposterior probability: probability estimate based on prior knowledgeConditional probabilityIn our example, we were estimating:P(ADJ|DET) = probability of ADJ given DETP(NN|DET) = probability of NN given DETetcIn general:the conditional probability P(A|B) is the probability that A occurs, given that we know that B has occurred

Example continuedIf Ive just seen a DET, whats the probability that my next word is an ADJ?Need to take into account:occurrences of ADJ in our training dataVV+ADJ (was beautiful), PP+ADJ (with great concern), DET+ADJ etcoccurrences of DET in our training corpusDET+N (the man), DET+V (the loving husband), DET+ADJ (the tall man)Venn Diagram representation of the bigram training dataABthe+talla+simplean+excellentthe+manthe+womana+roadis+tallin+terriblewere+niceCases where w is ADJ NOT preceded by DETCases where w is a DET NOT followed by ADJCases where w is a DET followed by ADJEstimation of conditional probabilityIntuition:P(A|B) is a ratio of the chances that both A and B happen, by the chances of B happening alone.

P(ADJ|DET) = P(DET+ADJ) / P(DET)

Another exampleIf we throw a die, whats the probability that the number we get is even, given that the number we get is larger than 4?works out as the probability of getting the number 6P(even|>4) = P(even & >4)/P(>4) = (1/6) / (2/6) = = 0.5Note the difference from simple, prior probability. Using only frequency, P(6)= 1/6 Mind the fallacies!When we speak of prior and posterior, we dont necessarily mean in timee.g. the die exampleMonte Carlo fallacy:if 20 turns of the roulette wheel have fallen on black, what are the chances that the next turn will fall on red?in reality, prior experience here makes no difference at allevery turn of the wheel is independent from every otherThe multiplication ruleMultiplying probabilitiesOften, were interested in switching the conditional probability estimate around.Suppose we know P(A|B) or P(B|A)We want to calculate P(A AND B)For both A and B to occur, they must occur in some sequence (first A occurs, then B)Estimating P(A AND B)

Probability that both A and B occurProbability of A happening overallProbability of B happening given that A has happenedMultiplication rule: example 1We have a standard deck of 52 cardsWhats the probability of pulling out two aces in a row?NB Standard deck has 4 acesLet A1 stand for an ace on the first pick, A2 for an ace on the second pickWere interested in P(A1 AND A2)Example 1 continuedP(A1 AND A2) = P(A1)P(A2|A1)P(A1) = 4/52 (since there are 4 aces in a 52-card pack)If we do pick an ace on the first pick, then we diminish the odds of picking a second ace (there are now 3 aces left in a 51-card pack).P(A2|A1) = 3/51Overall: P(A1 AND A2) = (4/52) (3/51) = .0045Example 2We randomly pick two words, w1 and w2, out of a tagged corpus.What are the chances that both words are adjectives?Let ADJ be the set of all adjectives in the corpus (tokens, not types)|ADJ| = total number of adjectivesA1 = the event of picking out an ADJ on the first tryA2 = the event of picking out an ADJ on second tryP(A1 AND A2) is estimated in the same way as per the previous example:in the event of A1, the chances of A2 are diminishedthe multiplication rule takes this into account

Some observationsIn these examples, the two events are not independent of eachotheroccurrence of one affects likelihood of the othere.g. drawing an ace first diminishes the likelihood of drawing a second acethis is sampling without replacementif we put the ace back into the pack after weve drawn it, then we have sampling with replacementIn this case, the probability of one event doesnt affect the probability of the other.Extending the multiplication ruleThe logic of the A AND B rule is:Both conditions, A and B have to be metA is met a fraction of the timeB is met a fraction of the times that A is metCan be extended indefinitelyE.g. chances of drawing 4 straight aces from a packP(A1 & A2 & A3 & A4)= P(A1) P(A2|A1) P(A3|A1 & A2) P(A4|A1 & A2 & A3)The subtraction ruleExtending the addition ruleIts easy to extend the multiplication rule.Extending the addition rule isnt so easy. We need to correct for double-counting events.

ExampleP(A OR B OR C)ABCOnce wevediscounted the2-way intersection ofA and B, etc, we need to recountthe 3-way intersection!Subtraction ruleFundamental underlying observation:

E.g. Probability of getting at least one head in 3 flips of a coin (a three-set addition problem)Can be estimated using the observation that:P(Head out of 3 flips) = 1-P(no heads) = 1-P(3 tails)

Bayes theoremPart 4Switching conditional probabilitiesProblem 1:We know the probability that a test will give us positive in case a person has a disease.We want to know the probability that there is indeed a disease, given that the test says positiveUseful for finding false positives

Problem 2:We know the probability P(ADJ|DET) that some word w2 is an ADJ, given that the previous word w1 is a DETWe find a new word w. We dont know its category. It might be a DET. We do know that the following word is an ADJ.We would therefore like to know the reverse, i.e. P(DET|ADJ)Deriving Bayes rule from the multiplication ruleGiven symmetry of intersection, multiplication rule can be written in two ways:

Bayes rule involves the substitution of one equation into the other, to replace P(A and B)

Deriving P(A)Often, its not clear where P(A) should come fromwe start out from conditional probabilities!Given that we have two sets of outcomes of interest, A and B, P(A) can be derived from the following observation:

i.e. The events in A are made up of those which are only in A (but not in B) and those which are in both A and B.

Finding P(A) -- I

AB

P(A) must either be in one or the other (or both), since A is composed of these twosets.Finding P(A) -- II

Step 1: Applying the addition rule:Step 2: Substituting into Bayes equation to replace P(A):SummaryThis ends our first foray into the rules of probabilityaddition rulesubtraction & multiplication ruleconditional probabilityBayes theorem

Next upProbability distributions

Random variables

Basic information theory


Recommended