Computational Statistics with Application to...

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1

Computational Statistics withApplication to Bioinformatics

Prof. William H. PressSpring Term, 2008

The University of Texas at Austin

Unit 1: Probability Theory and Bayesian Inference


Unit 1: Probability Theory and Bayesian Inference (Summary)

• Prove informally some basic theorems in probability– Additivity or “Law of Or-ing”– Law of Exhaustion– Multiplicative Rule or “Law of And-ing”; conditional probabilities– Definition of “independence”– Law of Total Probability or “Law of de-And-ing”; Humpty-Dumpty– Bayes Theorem

• Elementary probability examples– dice– fish in barrel

• What comprises a “calculus of inference” and how to talk like a Bayesian?– calculating probabilities of hypotheses as if they were events– should there be a International Talk Like a Bayesian Day?

• Simple examples of Bayesian inference– Trolls-under-the-bridge

• data changes probabilities– Monty Hall (“Let’s Make a Deal”) example

• use relabeling symmetry to save some work• data changes probabilities, really!• posterior vs. prior • differing background information makes probabilities subjective

• Evidence is commutative/associative– get same answers independent of the order in which data is seen– supports the notion of a “calculus of inference”– but requires EME hypotheses (give goodness-of-fit as a contrary example)

continues


• Example: “the white shoe is a red herring”– folk wisdom “an instance of a hypothesis adds support to it” not always true– meaning of data depends on exact nature of hypothesis space– Bayes says “data supports the hypothesis in which it is most likely”– evidence factors can be overcome by strong priors

• Example (Jailer’s Tip) with marginalization over an unknown parameter– statistical model with an unknown parameter– non-informative priors, “insensitive to the prior”

• Estimate a posterior probability distribution from Bernoulli trials data– Bayesians use the exact data seen, not equivalent data not seen– Beta probability distribution– get normalization by symbolic integration– find the mode, mean, and standard error of the Beta distribution

• mode is the “naïve” NB/N• mean encodes a uniform prior (“pseudocounts”)

– the utility of choosing a conjugate prior


“There is this thing called probability. It obeys the laws of an axiomatic system. When identified with the real world, it gives(partial) information about the future.”

• What axiomatic system?• How to identify to real world?

– Bayesian or frequentist viewpoints are somewhat different “mappings” from axiomatic probability theory to the real world

– yet both are useful

“And, it gives a consistent and complete calculus of inference.”

• This is only a Bayesian viewpoint– It’s sort of true and sort of not true, as we will see!

• R.T. Cox (1946) showed that reasonable assumptions about “degree of belief” uniquely imply the axioms of probability (and Bayes)

– ordered (transitive) A > B > C– “composable” (belief in AB depends only on A and B|A)– [what about, e.g., “fuzzy logic”? what assumption is violated?]

Laws of Probability


Axioms:

Example of a theorem:Theorem: P (∅) = 0Proof: A ∩ ∅ = ∅, soP (A) = P (A ∪ ∅) = P (A) + P (∅), q.e.d.

Basically this is a theory of measure on Venn diagrams, so we can (informally) cheat and prove theorems by inspection of the appropriate diagrams, as we now do.

I. P (A) ≥ 0 for an event AII. P (Ω) = 1 where Ω is the set of all possible outcomesIII. if A ∩B = ∅, then P (A ∪B) = P (A) + P (B)


Additivity or “Law of Or-ing”

P (A ∪B) = P (A) + P (B)− P (AB)


“Law of Exhaustion”

Xi

P (Ri) = 1

If Ri are exhaustive and mutually exclusive (EME)


Multiplicative Rule or “Law of And-ing”

(same picture as before)

P (AB) = P (A)P (B|A) = P (B)P (A|B)“given”

P (B|A) = P (AB)

P (A)

“conditional probability”“renormalize the outcome space”


P (ABC) = P (A)P (B|A)P (C|AB)

Events A and B are independent ifP (A|B) = P (A)so P (AB) = P (B)P (A|B) = P (A)P (B)

Similarly, for multiple And-ing:

Independence:


A symmetric die hasP (1) = P (2) = . . . = P (6) = 1

6Why? Because

Pi P (i) = 1 and P (i) = P (j).

Not because of “frequency of occurence in N trials”.That comes later!

The sum of faces of two dice (red and green) is > 8.What is the probability that the red face is 4?

P (R4 |>8) = P (R4 ∩ >8)P (>8)

=2/36

10/36= 0.2


Law of Total Probability or “Law of de-Anding”

H’s are exhaustive and mutually exclusive (EME)

P (B) = P (BH1) + P (BH2) + . . . =Xi

P (BHi)

P (B) =Xi

P (B|Hi)P (Hi)

“How to put Humpty-Dumpty back together again.”


Example: A barrel has 3 minnows and 2 trout, with equal probability of being caught. Minnows must be thrown back. Trout we keep.

What is the probability that the 2nd fish caught is a trout?

H1 ≡ 1st caught is minnow, leaving 3 + 2H2 ≡ 1st caught is trout, leaving 3 + 1B ≡ 2nd caught is a troutP (B) = P (B|H1)P (H1) + P (B|H2)P (H2)

= 25 · 35 + 1

4 · 25 = 0.34


Bayes Theorem

(same picture as before)

P (Hi|B) =P (HiB)

P (B)

=P (B|Hi)P (Hi)Pj P (B|Hj)P (Hj)

We usually write this as

P (Hi|B) ∝ P (B|Hi)P (Hi)

Law of And-ing

Law of de-Anding

this means, “compute the normalization by using the completeness of the Hi’s”

Thomas Bayes1702 - 1761


• As a theorem relating probabilities, Bayes is unassailable

• But we will also use it in inference, where the H’s are hypotheses, while B is the data– “what is the probability of an hypothesis, given the data?”– some (defined as frequentists) consider this dodgy– others (Bayesians like us) consider this fantastically powerful

and useful– in real life, the war between Bayesians and frequentists is long

since over, and most statisticians adopt a mixture of techniquesappropriate to the problem.

• Note that you generally have to know a complete set of EME hypotheses to use Bayes for inference– perhaps its principal weakness


Example: Trolls Under the Bridge

Trolls are bad. Gnomes are benign.Every bridge has 5 creatures under it:

20% have TTGGG (H1)20% have TGGGG (H2)60% have GGGGG (benign) (H3)

Let’s work a couple of examples using Bayes Law:

Before crossing a bridge, a knight captures one of the 5 creatures at random. It is a troll. “I now have an 80% chance of crossing safely,” he reasons, “since only the case

20% had TTGGG (H1) now have TGGGis still a threat.”


P (H1|T ) =25 · 15

25 · 15 + 1

5 · 15 + 0 · 35=2

3

P (Hi|T ) ∝ P (T |Hi)P (Hi)so,

The knight’s chance of crossing safely is actually only 33.3%Before he captured a troll (“saw the data”) it was 60%.Capturing a troll actually made things worse! [well…discuss]

(80% was never the right answer!)Data changes probabilities!Probabilities after assimilating data are called posterior probabilities.


Example: The Monty Hall or Let’s Make a Deal Problem

• Three doors• Car (prize) behind one door• You pick a door, but don’t open it yet• Monty then opens one of the other doors, always revealing no

car (he knows where it is)• You now get to switch doors if you want• Should you?• Most people reason: Two remaining doors were equiprobable

before, and nothing has changed. So doesn’t matter whether you switch or not.

• Marilyn vos Savant (“highest IQ person in the world”) famously thought otherwise (Parade magazine, 1990)

• No one seems to care what Monty Hall thought!


Hi = car behind door i, i = 1, 2, 3Wlog, you pick door 2 (relabeling).Wlog, Monty opens door 3 (relabeling).P (Hi|O3) ∝ P (O3|Hi)P (Hi)

ignorance of Monty’s preferencebetween 1 and 3, so take 1/2

So you should always switch: doubles your chances!

“Without loss of generality…”

P (H1|O3) ∝ 1 · 13 = 26

P (H2|O3) ∝ 12 · 13 = 1

6

P (H3|O3) ∝ 0 · 13 = 0


Exegesis on Monty Hall

Bayesian viewpoint:Probabilities are modified by data. This makes them intrinsically subjective, because different observers have access to different amounts of data (including their “background information”or “background knowledge”).

• Very important example! Master it.• P (Hi) =

13is the “prior probability” or “prior”

• P (Hi|O3) is the “posterior probability” or “posterior”• P (O3|Hi) is the “evidence factor” or “evidence”• Bayes says posterior ∝ prior × evidence


Commutivity/Associativity of Evidence

P (Hi|D1D2) desiredWe see D1:P (Hi|D1) ∝ P (D1|Hi)P (Hi)

Then, we see D2:P (Hi|D1D2) ∝ P (D2|HiD1)P (Hi|D1)But,= P (D2|HiD1)P (D1|Hi)P (Hi)= P (D1D2|Hi)P (Hi)

this is now a prior!

this being symmetrical shows that we would get the same answer regardless of the order of seeing the data

All priors P (Hi) are actually P (Hi|D),conditioned on previously seen data! Oftenwrite this as P (Hi|I). background information


Bayes Law is a “calculus of inference”, often better (and certainly more self-consistent) than folk wisdom.

Example: Hempel’s Paradox

Folk wisdom: A case of a hypothesis adds support to that hypothesis.

Example: “All crows are black” is supported by each new observation of a black crow.

All crows are black

All non-black things are non-crows

But this is supported by the observation of a white shoe.

So, the observation of a white shoe is thus evidence that all crows are black!


We observe one bird, and it is a black crow.a) Which world are we in?b) Are all crows black?

P (H1|D)P (H2|D)

=P (D|H1)P (H1)P (D|H2)P (H2)

=0.0001P (H1)

0.1P (H2)= 0.001

P (H1)

P (H2)

So the observation strongly supports H2 and the existence of white crows.

Hempel’s folk wisdom premise is not true.

Data supports the hypotheses in which it is more likely compared with other hypotheses.

We must have some kind of background information about the universe of hypotheses, otherwise data has no meaning at all.

I.J. Good: “The White Shoe is a Red Herring” (1966)


Example: Jailer’s Tip (our first “estimation of parameters”)

• Of 3 prisoners (A,B,C), 2 will be released tomorrow.• A asks jailer for name of one of the lucky – but not himself.• Jailer says, truthfully, “B”.• “Darn,” thinks A, “now my chances are only ½, C or me”.

A

Is this like Monty Hall? Did the data (“B”) change the probabilities?


Suppose (unlike Monty Hall) the jailer is not indifferentabout responding “B” versus “C”.

P (SB|BC) = x, (0 ≤ x ≤ 1)

“says B”

P (A|SB) = P (AB|SB) + P (AC |SB)

=P (SB|AB)P (AB)

P (SB|AB)P (AB) + P (SB|BC)P (BC) + P (SB|CA)P (CA)

=13

1 · 13 + x · 13 + 0=

1

1 + x

01 1/3

x

So if A knows the value x, he can calculate his chances.If x=1/2 (like Monty Hall), his chances are 2/3, same as before; so (unlike Monty Hall) he got no new information.If x≠1/2, he does get new info – his chances change.

But what if he doesn’t know x at all?


“Marginalization” (this is important!)

(e.g., Jailer’s Tip):

P (A|SBI) =RxP (A|SBxI) p(x|I) dx

=Rx

11+x p(x|I) dx

law of de-Anding

• When a model has unknown, or uninteresting, parameters we “integrate them out”

• Multiplying by any knowledge of their distribution– At worst, just a prior informed by background information– At best, a narrower distribution based on data

• This is not any new assumption about the world– it’s just the Law of de-Anding



=Rx

11+x p(x|I) dx

p(x) ≡ p(x|I)

first time we’ve seen a continuous probability distribution, but we’ll skip the obvious repetition of all the previous laws

Xi

Pi = 1 ↔Xi

p(xi)dxi = 1 ↔Zx

p(x) dx = 1

(Notice that p(x) is a probability of a probability! That is fairly common in Bayesian inference.)



=Rx

11+x p(x|I) dx

What should Prisoner A take for p(x) ?Maybe the “uniform prior”?

p(x) = 1, (0 ≤ x ≤ 1)P (A|SBI) =

R 10

11+xdx = ln 2 = 0.693

p(x) = δ(x− 12 ), (0 ≤ x ≤ 1)

P (A|SBI) = 11+1/2 = 2/3

Not the same as the “massed prior at x=1/2”“Dirac delta function”

substitute value and remove integral

A


Where are we?We are trying to estimate a probability

x = P (SB |BC), (0 ≤ x ≤ 1)The form of our estimate is a (Bayesian) probability distribution

This is a sterile exercise if it is just a debate about priors.What we need is data! Data might be a previous history of choices by the jailer in identical circumstances.

BCBCCBCCCBBCBCBCCCCBBCBCCCBCBCBBCCB

N = 35, NB = 15, NC = 20

We hypothesize (might later try to check) that these are i.i.d. “Bernoulli trials” and therefore informative about x

“independent and identically distributed”

As good Bayesians, we now need P (data|x)


=Rx

11+x p(x|I) dx

(What’s wrong with: x=15/35=0.43? Hold on…)


P (data|x)can mean different things in frequentist vs. Bayesian contexts,so this is a good time to understand the differences (we’ll use both ideas as appropriate)

frequentist considers the universe of what might have been, imagining repeated trials, even if they weren’t actually tried:

since i.i.d. only the N ’s can matter.

P (data|x) =µN

NB

¶xNB (1− x)NC

prob. of exact sequence seen

no. of equivalent arrangements

¡nk

¢= n!

k!(n−k)!

Bayesian considers only the exact data seen:

P (x|data) ∝ xNB (1− x)NC p(x|I)

No binomial coefficient, since independent of x and absorbed in the proportionality. Use only the data you see, not “equivalent arrangements”that you didn’t see. This issue is one we’ll return to, not always entirely sympathetically to Bayesians (e.g., goodness-of-fit).

prior is still with us


syms nn nb xnum = x^nb * (1-x)^(nn-nb)num =x^nb*(1-x)^(nn-nb)denom = int(num, 0, 1)denom =gamma(nn-nb+1)*gamma(nb+1)/gamma(nn+2)p = num / denomezplot(subs(p,[nn,nb],[35,15]),[0,1])p =x^nb*(1-x)^(nn-nb)/gamma(nn-nb+1)/gamma(nb+1)*gamma(nn+2)

Use Matlab to calculate the posterior probability of x:(note we are doing this symbolically to get the general result)

so this is what the data tells us about p(x)


In[7]:= num = x^ nb H1 − xL^Hnn − nbL

Out[7]= H1 − xL−nb+nn xnb

In[8]:= denom = Integrate@num, 8x, 0, 1<,GenerateConditions → FalseD

Out[8]=Gamma@1 + nbD Gamma@1 − nb + nnD

Gamma@2 + nnD

In[9]:= p@x_D = num êdenom

Out[9]=H1 − xL−nb+nn xnb Gamma@2 + nnD

Gamma@1 + nbD Gamma@1 − nb + nnD

In[12]:= Plot@p@xD ê. 8nn → 35, nb → 15<, 8x, 0, 1<,PlotRange → All, Frame → TrueD

0 0.2 0.4 0.6 0.8 10

1

2

3

4

Out[12]= Graphics

Same calculation in Mathematica:

so this is what the data tells us about p(x)


Dp = diff(p);

simplify(Dp)

ans =

-x^(nb-1)*(1-x)^(nn-nb-1)*(-nb+x*nn)*gamma(nn+2)/gamma(nb+1)/gamma(nn-nb+1)

solve(Dp)

ans =

nb/nn

mean = int(x*p, 0, 1)

mean =

(nb+1)/(nn+2)

sigma = simplify(sqrt(int(x^2 * p, 0, 1)-mean^2))

sigma =

(-(nb+1)*(-nn+nb-1)/(nn+2)^2/(nn+3))^(1/2)

pretty(sigma)

/ (nb + 1) (-nn + nb - 1)\1/2

|- -----------------------|

| 2 |

\ (nn + 2) (nn + 3) /

Find the mean, standard error, and mode of our estimate for x

derivative has this simple factor

“maximum likelihood” answer is to estimate x as exactly the fraction seen

mean is the 1st moment

standard error involves the 2nd moment, as shown

they call this “pretty”!This shows how p(x)gets narrower as the amount of data increases.


In[20]:= Simplify@D@p@xD, xDD

Out[20]= −H1 − xL−1−nb+nn x−1+nb H−nb + nn xL Gamma@2 + nnD

Gamma@1 + nbD Gamma@1 − nb + nnD

In[21]:= Solve@Simplify@D@p@xD, xDD 0, xD

Out[21]= 99x →nbnn

==

In[23]:= mean = Integrate@x p@xD, 8x, 0, 1<,GenerateConditions → FalseD

Out[23]=1 + nb2 + nn

In[27]:= sigma =

Sqrt@FullSimplify@Integrate@x ^2 p@xD, 8x, 0, 1<,

GenerateConditions → FalseD − mean^2DD

Out[27]= $%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%H1 + nbL H1 − nb + nnLH2 + nnL2 H3 + nnL

Or in Mathematica:

Editorial: Mathematica has prettier output for symbolic calculations, but Matlab is significantly better at handling numerical data (which we will be doing more of).


Priors that preserve the analytic form of p(x) are called “conjugate priors”. There is nothing special about them except mathematical convenience.

If we got a new batch of data, say with new counts NN and NB, we could use our p(x) based on nn and nb as the prior, then redo the calculations above.

However, you should be able to immediately write down the answers for the mean, standard error, and mode, above.

Assimilating new data:

xα(1− x)β is a conjugate prior for ourbeta distribution xNB (1− x)N−NB

Date post:	29-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Computational Statistics with Application to...

Documents