Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

1/46

1

What is the Philosophy of Statistics?

At one level of analysis at least, statisticians and philosophersof science ask many of the same questions:

What should be observed and what may justifiably be

inferred from the resulting data?

How well do data confirm or fit a model?

What is a good test?

Must predictions be novel in some sense? (selection

effects, double counting, data mining)

How can spurious relationships be distinguished from

genuine regularities? from causal regularities?

How can we infer more accurate and reliable

observations from less accurate ones?

When does a fitted model account for regularities in the

data?

That these very general questions are entwined with longstanding debates in philosophy of science helps to explainwhy the field of statistics tends to cross over so often intophilosophical territory.


2/46

2

Statistics Philosophy

3 ways statistical accounts are used in philosophy ofscience

(1) Model Scientific Inference to capture either theactual or rational ways to arrive at evidence and inference

(2) Resolve Philosophical Problems about scientificinference, observation, experiment;

(problem of induction, objectivity of observation,reliable evidence, Duhem's problem,underdetermination).

(3) Perform a Metamethodological Critique-scrutinize methodological rules, e.g., accord specialweight to "novel" facts, avoid ad hoc hypotheses, avoid"data mining", require randomization.

Philosophy StatisticsCentral job to help resolve the conceptual, logical, andmethodological discomforts of scientists as to: how tomake reliable inferences despite uncertainties and errors?

Philosophy of statistics and the goal of a philosophy ofscience relevant for philosophical problems in scientific

practice


3/46

3

Fresh methodological problems arise in practicesurrounding a panoply ofmethods andmodels relied on

to learn from incomplete, and often non-experimental,

data.Examples abound:

Disputes overhypothesis-testing in psychology (e.g., the

recently proposed significance test ban);

Disputes over the proper uses ofregression in applied

statistics;

Disputes overdose-response curves in estimating risks;

Disputes about the use of computer simulations in

observational sciences;

Disputes about external validity in experimental

economics; and,

Across the huge landscape of fields using the latest, high-powered, computer methods, there are disputes about

data-mining, algorithmic searches, andmodel validation.

Equally important are the methodological

presuppositions that are not, but perhaps ought to be,

disputed, debated, or at least laid out in the open

often, ironically, in the very fields in which philosophers

of science immerse themselves.


4/46

4

I used to teach a course in this department: philosophy of

science and economic methodology

We read how many economic methodologists questionedthe value of philosophy of science

If philosophers and others within science theory cant

agree about the constitution of the scientific method (or

even whether asking about a scientific method makes

any sense), doesnt it seem a little dubious foreconomists to continue blithely taking things off the shelf

and attempting to apply them to economics? (Hands,

2001, p. 6).

Deciding that it is, methodologists of economics

increasingly look to sociology of science, rhetoric,

evolutionary psychology.

The problem is not merely how this cuts philosophers of

science out of being engaged in methodological practice;

equally serious, is how it encourages practitioners to

assume there are no deep epistemological problems with

the ways they collect and base inferences on data.


5/46

5

Professional agreement on statistical philosophy is not

on the immediate horizon, but this should not stop us

from agreeing on methodology, as if what is correct

methodologically does not depend on what is correct

philosophically (Berger, 2003, p. 2).

In addition to the resurgence of the age-old

controversies significance testvs. confidence

intervals, frequentistvs.Bayesian measures, the

latest statistical modeling techniques have introducedbrand new methodological issues.

High-powered computer science packages offer a

welter of algorithms for automatically selecting among

this explosion of models, but as each boasts different,

and incompatible, selection criteria, we are thrown back

to the basic question of inductive inference: what isrequired, to severely discriminate among well-fitting

models such that, when a claim (or hypotheses or model)

survives a test the resulting data count as good evidence

for the claims correctness or dependability or adequacy.


6/46

6

A romp through 4 "waves in philosophy of statistics"

History and philosophy of statistics is a huge territory

marked by 70 years of debates widely known for reaching

unusual heights both of passion and of technical

complexity.

Wave I ~ 1930 1955/60

Wave II~ 1955/60-1980

Wave III~1980-2005 & beyond

Wave IV ~ 2006 and beyond


7/46

7

A core question: What is the nature and role of

probabilistic concepts, methods, and models in making

inferences in the face of limited data, uncertainty anderror?

1.Two Roles For Probability:Degrees of Confirmation and Degrees of Well-Testedness

a.To provide a post-data assignment of degree of

probability, confirmation, support or belief in a

hypothesis;

b.To assess the probativeness, reliability,

trustworthiness, or severity of a test or inference

procedure.

These two contrasting philosophies of the role of

probability in statistical inference are very much at the

heart of the central points of controversy in the three

waves of philosophy of statistics


8/46

8

Having conceded loss in the battle for justifying induction,

philosophers appeal to logic to capture scientific method

Inductive Logics Logic of falsification

Confirmation TheoryRules to assign degrees ofprobability or confirmation tohypotheses given evidence e

Methodological falsificationRules to decide when toprefer or accept hypotheses

Carnap C(H,e) Popper

Inductive Logicians

we can build and try to justifyinductive logics

straight rule: Assign degrees ofconfirmation/credibility

Statistical affinity

Bayesian (and likelihoodist)accounts

Deductive Testers

we can reject induction anduphold the rationality of

preferring or acceptingH if it is well tested

Statistical affinity

Fisherian, Neyman-Pearsonmethods: probability enters to

ensure reliability and severity oftests with these methods.


9/46

9

I. Philosophy of Statistics: The First Wave

WAVE I: circa 1930-1955:Fisher, Neyman, Pearson, Savage, and Jeffreys.

Statistical inference tools use data x0 to probe aspects of the

data generating source:

In statistical testing, these aspects are in terms of statistical

hypotheses about parameters governing a statistical distribution

H tells us the probability ofx under H, written P(x;H)

(probabilistic assignments under a model)

Important to avoid confusion with conditional probabilities in

Bayess theorem, P(x|H).

Testing model assumptions extremely important, though will

not discuss.


10/46

10

Modern Statistics Begins with Fisher:

Simple Significance Tests

Example. Let the sample beX= (X1, ,Xn), be IID from a

Normal distribution (NIID) with =1.

1. Anull hypothesisH0:H0: = 0

e.g., 0 mean concentration of lead, no difference in meansurvival in a given group, in mean risk, mean deflection of

light.

2. A function of the sample, d(X), thetest statistic: which

reflects the difference between the datax0 = (x1, ,xn), andH0;

The larger d(x0) the further the outcome is from what is

expected underH0, with respect to the particular question being

asked.

3. Thep-value is the probability of a difference larger than

d(x0), under the assumption thatH0 is true:

p(x0)=P(d(X) > d(x0);H0)


11/46

11

The observed significance level (p-value) with observedX = .1

p(x0)=P(d(X) > d(x0);H0).

The relevant test statisticd(X) is:

d(X) = (X-0x,

where X is the sample mean with standard deviation x=(n).

0Observed - Expected (under H )

( ) xd

X

Since xn

= 1/5 = .2, d(X) = .1 0 in units ofx

yields

d(x0)=.1/.2 = .5

Under the null, d(X) is distributed as standard Normal,

denoted byd

(X

) ~ N(0,1).(Area to the right of .5) ~.3, i.e. not very significant.


12/46

12

Logic of Simple Significance Tests: Statistical ModusTollens

Every experiment may be said to exist only in order to

give the facts a chance of disproving the null hypothesis

(Fisher, 1956, p.160).

Statistical analogy to the deductively valid patternmodus

tollens:

If the hypothesisH0 is correct then, with high

probability, 1-p, the data wouldnot be statistically

significant at levelp.

x0 is statistically significant at levelp.____________________________

Thus,x0 is evidence againstH0, orx0 indicates the falsity ofH0.

Fisher described the significance test as a procedurefor rejecting the null hypothesis and inferring that the

phenomenon has been experimentally demonstrated

once one is able to generate at will a statistically

significant effect. (Fisher, 1935a, p. 14),


13/46

13

The Alternative or Non-Null Hypothesis

Evidence againstH0 seems to indicate evidenceforsome alternative.

Fisherian significance tests strictly consider only the

H0

Neyman and Pearson (N-P) tests introduce an

alternativeH1 (even if only to serve as a direction ofdeparture).

Example.X= (X1, ,Xn), NIID with =1:

H0: = 0 vs.H1: > 0

Despite the bitter disputes with Fisher that were to

erupt soon after ~1935, Neyman and Pearson, at first saw

their work as merely placing Fisherian tests on firmer

logical footing.

Much of Fishers hostility toward N-P methods

reflects professional and personality conflicts more than

philosophical differences.


14/46

14

Neyman-Pearson (N-P) Tests

N-P hypothesis test: maps each outcomex = (x1, ,xn)into either the null hypothesisH0, or an alternativehypothesisH1(where the two exhaust the parameterspace) to ensure the probabilities of erroneous rejections(type I errors) and erroneous acceptances (type II errors)are controlled at prespecified values, e.g., 0.05 or 0.01, thesignificance level of the test.

Test T(: X= (X1, ,Xn), NIID with =1,H0: = vs.H1: >

ifd(x0) > c, "reject"H0, (or declare the resultstatistically significant at the level);

ifd(x0) < c, "accept"H0,

e.g. c=1.96 for =.025, i.e.

Accept/Reject uninterpreted parts of the mathematicalapparatus.

Type I error probability = P(d(x0) > c; H0) The Type II error probability:

P(Test T( does not reject H0 ; =1) == P(d(X) < c; H0) = (1), for any 1 > 0.


15/46

15

The "best" test at level at the same time minimizes thevalue of for all 1 > 0, or equivalently, maximizes thepower:

POW(T(; 1)= P(d(X) > c; 1

T( is a Uniformly Most Powerful (UMP) level test


16/46

16

Inductive Behavior Philosophy

Philosophical issues and debates arise once one begins to

consider the interpretations of the formal apparatus

Accept/Reject are identified with deciding to take

specific actions, e.g., publishing a result, announcing a

new effect.

The justification for optimal tests is that

it may often be proved that if we behave according to

such a rule ... we shall rejectHwhen it is true not more,

say, than once in a hundred times, and in addition we may

have evidence that we shall rejectHsufficiently often

when it is false.

Neyman: Tests are not rules ofinductive inference but rules of

behavior:

The goal is not to adjust our beliefs but rather to adjust our

behavior to limited amounts of data

Is he just drawing a stark contrast between N-P tests andFisherian as well as Bayesian methods? Or is the behavioral

interpretation essential to the tests?


17/46

17

Inductive behavior vs. Inductive inference

battle

commingles philosophical, statistical and personalityclashes.

Fisher (1955) denounced the way that Neyman and

Pearson transformed his significance tests into

acceptance procedures.

Theyve turned my tests into mechanical rules orrecipes for deciding to accept or reject statistical

hypothesisH0,

The concern has more to do with speeding up

production or making money than in learning about

phenomena


18/46

18

N-P followers are like:

Russians (who) are made familiar with the ideal

that research in pure science can and should be gearedto technological performance, in the comprehensive

organized effort of a five-year plan for the nation.

(1955, 70)

In the U.S. also the great importance of

organized technology has I think made it easy toconfuse the process appropriate for drawing correct

conclusions, with those aimed rather atspeeding

production, or saving money.


19/46

19

Pearson distanced himself from Neymans

inductive behavior jargon, calling it Professor

Neymans field rather than mine.

But the most impressive mathematical results were in

the decision-theoretic framework of Neyman-Pearson-

Wald.

Many of the qualifications by Neyman and Pearson

in the first wave are overlooked in the philosophy of

statistics literature.

Admittedly, these evidential practices were not

made explicit *. (Had they been, the subsequent waves of

philosophy of statistics might have looked very different).

*Mayos goal in ~ 1978


20/46

20

The Second Wave: ~1955/60 -1980

Post-data criticisms of N-P methods:

Ian Hacking (1965), framed the main lines of criticism byphilosophers Neyman-Pearson tests as suitable for before-trialbetting, but not for after-trial evaluation. (p. 99):

Battles: initial precision vs. final precision,

before-data vs. after data

After the data, he claimed, the relevant measure of support is

the (relative) likelihood

Two data setsxandy may afford the same "support"toH, yet warrant different inferences [onsignificance test reasoning] because x and y arosefrom tests with different error probabilities.

oThis is just what error statisticians want!


21/46

21

oBut (at least early on) Hacking (1965) held to the

Law of Likelihood: x0support hypothesesH1 morethanH2 if,

P(x0;H1) > P(x0;H2).

Yet, as Barnard notes, there always is such a rivalhypothesis: That things just had to turn out the way theyactually did .

Since such a maximally likelihood alternativeH2 canalways be constructed,H1 may always be found less wellsupported, even ifH1 is trueno error control

Hacking soon rejected the likelihood approach on suchgrounds, likelihoodist accounts are advocated by others.


22/46

22

Perhaps THE key issue of controversy in the

philosophy of statistics battles

The (strong) likelihood principle, likelihoods suffice toconvey all that the data have to say

According to Bayess theorem, P(x|) ... constitutesthe entire evidence of the experiment, that is, it tells allthat the experiment has to tell. More fully and moreprecisely, ify is the datum of some other experiment, and

if it happens that P(x|) and P(y|) are proportionalfunctions of (that is, constant multiples of each other),then each of the two dataxandyhave exactly the samething to say about the values of (Savage 1962, p. 17.)

the error probabilist needs to consider, in addition, the

sampling distribution of the likelihoods.

significance levels and other error probabilities all

violate the likelihood principle (Savage 1962).


23/46

23

Paradox of Optional Stopping

Instead of fixing the same size n in advance, in some tests, n is

determined by a stopping rule:In Normal testing, 2-sided H0: = 0 vs.H1: 0

Keep sampling until H is rejected at the .05 level

(i.e., keep sampling until | X | 1.96 / n ).

Nominal vs. Actual significance levels: with n fixed the type 1error probability is .05.With this stopping rule the actual significance level differsfrom, and will be greater than .05.

By contrast, since likelihoods are unaffected by the stoppingrule, the LP follower denies there really is an evidential

difference between the two cases (i.e., n fixed and n determinedby the stopping rule).

Should it matter if I decided to toss the coin 100 times andhappened to get 60% heads, or if I decided to keep tossing untilI could reject at the .05 level (2-sided) and this happened tooccur on trial 100?

Should it matter if I kept going until I found statisticalsignificance?

Error statistical principles: Yes! penalty for perseverance!The LP says NO!


24/46

24

Savage Forum 1959: Savage audaciously declares thatthe lesson to draw from the optional stopping effect is thatoptional stopping is no sin so the problem must lie with

the use of significance levels. But why accept thelikelihood principle (LP)? (simplicity and freedom?)

The likelihood principle emphasized in Bayesian statisticsimplies, that the rules governing when data collection stopsare irrelevant to data interpretation. It is entirely appropriate tocollect data until a point has been proved or disproved (p.

193)This irrelevance of stopping rules to statistical inferencerestores a simplicity and freedom to experimental design thathad been lost by classical emphasis on significance levels (inthe sense of Neyman and Pearson) (Edwards, Lindman, Savage1963, p. 239).

For frequentists this only underscores the point raised yearsbefore by Pearson and Neyman:

A likelihood ratio (LR) may be a criterion of relative fitbut it is still necessary to determine its sampling distributionin order to control the error involved in rejecting a truehypothesis, because a knowledge of [LR] alone is not adequateto insure control of this error (Pearson and Neyman, 1930, p.106).


25/46

25

The key difference: likelihood fixes the actual outcome,

i.e., justd(x), while error statistics considers outcomes otherthan the one observedin order to assess the error properties

LP irrelevance of, and no control over, errorprobabilities.

("why you cannot be just a little bit Bayesian" EGEK1996)

Update: A famous argument (1962, Birnbaum)purports to show that plausible error statistical principles

entails the LP!

"Radical!" "Breakthrough!" (since the LP entails the

irrelevance of error probabilities!

But the "proof" is flawed! (Mayo 2010 See blog).


26/46

26

The Statistical Significance TestControversy

(Morrison and Henkel, 1970) contributors chastise social

scientists for slavish use of significance testsoFocus on simple Fisherian significance tests

oPhilosophers direct criticisms mostly to N-P tests.

Fallacies of Rejection: Statistical vs. Substantive Significance

(i) take statistical significance as evidence of

substantive theory that explains the effect

(ii) Infer a discrepancy from the null beyond what the test

warrants

(i) Paul Meehl: It is fallacious to go from a statistically

significant result, e.g., at the .001 level, to infer that ones

substantive theory T, which entails the [statistical] alternative

H1, has received .. quantitative support of magnitude around.999

A statistically significant difference (e.g., in child rearing) is

not automatically evidence for a Freudian theory.

T is subjected to only a feeble risk, violating Popper.


27/46

27

Fallacies of rejection:

(i) Take statistical significance as evidence ofsubstantive theory that explains the effect

(ii) Infer a discrepancy from the null beyond what the

test warrants.

Finding a statistically significant effect,d(x0) > c (cut-off for rejection) need not be indicative of large or

meaningful effect sizes test too sensitive

Large n Problem: an significant rejection ofH0 can bevery probable, even with a substantively trivial discrepancyfromH0 can

This is often taken as a criticism because it is assumed that

statistical significance at a given level is more evidence

against the null the larger the sample size (n) fallacy!

"The thesis implicit in the [NP] approach [is] that a hypothesis

may be rejected with increasing confidence or reasonablenessas the power of the test increases (Howson and Urbach 1989

and later editions)

In fact, it is indicative ofless of a discrepancy from the null

than if it resulted from a smaller sample size.


28/46

28

(analogy with smoke detector: an alarm from one that often

goes off from merely burnt toast (overly powerful or sensitive),vs. alarm from one that rarely goes off unless the house isablaze)

Comes also in the form ofthe Jeffrey-Good-Lindleyparadox

Even a highly statistically significant result can, with nsufficiently large, correspond to a high posterior probability toa null hypothesis.


29/46

29

Fallacy of Non-Statistically Significant Results

Test T() fails to reject the null, when the test statisticfails to reach the cut-off point for rejection, i.e., d(x0) c .

A classic fallacy is to construe such a negative result as

evidence FOR the correctness of the null hypothesis (common

in risk assessment contexts).

No evidence against is not evidence for

Merely surviving the statistical test is too easy, occurs toofrequently, even when the null is false.

results from tests lacking sufficient sensitivity or

power.

The Power Analytic Movement of the 60s in psychology

Jacob Cohen: By considering ahead of time the Power ofthe test, select a test capable of detecting discrepancies of

interest.

pre-data use of power (for planning).


30/46

30

A multitude of tables were supplied (Cohen, 1988), but

until his death he bemoaned their all-to-rare use.

(Power is a feature of N-P tests, but apparently the

prevalence of Fisherian tests in the social sciences, coupled,

perhaps, with the difficulty in calculating power, resulted in

ignoring power. There was also the fact that they were not able

to get decent power in psychology; they turned to meta-

analysis)


31/46

31

Post-data use of power to avoid fallacies of insensitive tests

If there's a low probability of a statistically significant

result, even if a non-trivial discrepancy non-trivial is present (low

power against non-trivial) ) then a non-significant difference is not

good evidence that a non-trivial discrepancy is absent.

Still too course: power is always calculated relative to the cut-

off point c for rejecting H0.

Consider test T() , = 1, n = 25, and let

non-trivial = .2

No matter what the non-significant outcome, power to detect

non-trivial is only .16!

So wed have to deny the data were good evidence that < .2

This suggested to me (in writing my dissertation around

1978) that rather than calculating

(1) P(d(X) > c; =.2) Power

one should calculate

(2) P(d(X) > d(x0); =.2). observed power (severity)

Even if (1) is low, (2) may be high. We return to this in

the developments of Wave III.


32/46

32

III. The Third Wave: Relativism, Reformulations,

Reconciliations ~1980-2005+

(skip) Rational Reconstruction and Relativism in

Philosophy of Science

Fighting Kuhnian battles to the very idea of a unified method of

scientific inference, statistical inference less prominent in

philosophy

largely used rational reconstructions of scientific episodes,

in appraising methodological rules,

in classic philosophical problems e.g., Duhems

problemreconstruct a given assignment of blame so as to

be warranted by Bayesian probability assignments.

no normative force.

The recognition that science involves subjective judgments and

values, reconstructions often appeal to a subjective Bayesian

account (Salmons Tom Kuhn Meets Tom Bayes).

(Kuhn thought this was confused: no reason to suppose an

algorithm remains through theory change)

Naturalisms, HPS


33/46

33

Wave III in Scientific Practice

Statisticians turn to eclecticism.

Non-statistician practitioners (e.g., in psychology,

ecology, medicine), bemoan unholy hybrids

a mixture ofideas from N-P methods, Fisherian tests, and

Bayesian accounts that is inconsistent from both perspectives

and burdened with conceptual confusion. (Gigerenzer, 1993,

p. 323).

Faced with foundational questions, non statistician

practitioners raise anew the questions from the first and

second waves.

Finding the automaticity and fallacies still rampant, most,

if they are not calling for an outright ban on significancetests in research, insist on reforms and reformulations of

statistical tests.

Task Force to consider Test Ban in Psychology: 1990s


34/46

34

Reforms and Reinterpretations Within Error Probability

Statistics

Any adequate reformulation must:

(i) Show how to avoid classic fallacies (of acceptance

and of rejection) on principled grounds,

(ii) Show that it provides an account of inductive

inference


35/46

35

Avoiding Fallacies

To quickly note my own recommendation (for test T(a)):Move away from coarse accept/reject rule; use specific result

(significant or insignificant) to infer those discrepancies from

the null that are well ruled-out, and those which are not.

e.g., Interpretation of Non-Significant results:

If d(x) is not statistically significant, and the

test had a very high probability of a morestatistically significant difference if > 0 + ,then d(x) is good grounds for inferring 0+ .Use specific outcome to infer an upper bound

* (values beyond are ruled out by given

severity.)

If d(x) is not statistically significant, but the test

had a very low probability chance of a more

statistically significant difference if > 0 + ,

then d(x) is poor evidence for inferring 0 +

.

The test had too little probative power to have

detected such discrepancies even if they existed!


36/46

36

Takes us back to thepost-data version of power:

Rather than construe a miss as good as a mile, parity oflogic suggests that the post-data power assessment should

replace the usual calculation of power against :

POW(T(), ) = P(d(X) > c; =),

with what might be called thepower actually attainedor, tohave a distinct term, theseverity (SEV):

SEV(T(), ) = P(d(X) > d(x0); =),

where d(x0) is the observed (non-statistically significant)

result.


37/46

37

Figure 1 compares power and severity for different

outcomes

Figure 1. POW(T(), =.2) =.168, irrespective of the value

ofd(x0) ; solid curve, the severity evaluations are data-specific:

The severity for the inference: < Both X= .39, andX= -.2, fail to rejectH0, but

But with X= .39, SEV( < is low (.17)

But with X= -.2, SEV( < is high (.97)


38/46

38

Fallacies of Rejection: The Large n-Problem

While with a nonsignificantresult, the concern is erroneously

inferring that a discrepancy from 0 is absent;

With a significantresultx0, the concern is erroneously inferring

that it is present.

Utilizing the severity assessment an -significantdifference with n1 passes > 1 less severely than with n2 where

n1 > n2.

Figure 2 compares test T(with three different sample

sizes:

n = 25, n = 100, n = 400, denoted by T(n;where in each case d(x0) = 1.96 reject at the cut-off

point.

In this way we solve the problems of tests too sensitive or not

sensitive enough, but theres one more thing ... showing how it

supplies an account of inductive inference

Many argue in wave III that error statistical methods cannot

supply an account of inductive inference because error

probabilities conflict with posterior probabilities.


39/46

39

Figure 2 compares test T(with three different sample sizes:

n =25, n =100, n =400, denoted by T(n;

in each case d(x0) = 1.96 reject at the cut-off point.

Figure 2. In test T( (H0: < 0 againstH1: > 0, and = 1),, c = 1.96 and d(x0) = 1.96.

The severity for the inference: > n = 25, SEV( > is .93n = 100, SEV( > is .83n = 400, SEV( > is .5


40/46

40

P-values vs. Bayesian Posteriors

A statistically significant difference from H0 can correspondto large posteriors inH0 . From the Bayesian perspective, it

follows that p-values come up short as a measure of inductive

evidence,

the significance testers balk at the recommended priors

resulting in highly significant results being construed as no

evidence against the null or even evidence for it!The conflict often considers the two sided T(2 test

H0: = vs. H1: .

(The difference between p-values and posteriors are far

less marked with one-sided tests).

Assuming a prior of .5 toH0, with n = 50 one can classically

rejectH0 at significance level p = .05, although P(H0|x) = .52

(which would actually indicate that the evidence favorsH0).

This is taken as a criticism of p-values, only because, it is

assumed the .51 posterior is the appropriate measure of the

beliefworthiness.


41/46

41

As the sample size increases, the conflict becomes

more noteworthy.

Ifn = 1000, a result statistically significant at the

.05 level leads to a posterior to the null of .82!

SEV (H1) = .95 while the corresponding posterior has gone

from .5 to .82. What warrants such a prior?

n (sample size)_____________________________________________________

p t n=10 n=20 n=50 n=100 n=1000

.10 1.645 .47 .56 .65 .72 .89

.05 1.960 .37 .42 .52 .60 .82

.01 2.576 .14 .16 .22 .27 .53

.001 3.291 .024 .026 .034 .045 .124

(1) Some claim the prior of .5 is a warranted frequentist

assignment:

H0 was randomly selected from an urn in which 50% are

true

(*) Therefore P(H0) = p


42/46

42

H0 may be 0 change in extinction rates, 0 lead

concentration, etc.

What should go in the urn of hypotheses?

For the frequentist: eitherH0 is true or false the

probability in (*) is fallacious and results from an

unsound instantiation.

We are very interested in how false it might be, which iswhat we can do by means of a severity assessment.

(2) Subjective degree of belief assignments will not ensure

the error probability, and thus the severity assessments we

need.

(3) Some suggest an impartial or uninformative Bayesian

prior gives .5 toH0, the remaining .5 probability being spread

out over the alternative parameter space, Jeffreys.

This spiked concentration of belief in the null is at odds with

the prevailing view we know all nulls are false.

The Bayesian recently co-opts 'error probability' to describe a

posterior, but it is not a frequentist error probability which is

measuring something very different.


43/46

43

Fisher: The Function of the p-Value Is Not Capable of

Finding Expression

Faced with conflicts between error probabilities and Bayesian

posterior probabilities, the error probabilist would conclude

that the flaw lies with the latter measure.

Fisher: Discussing a test of the hypothesis that the stars

are distributed at random, Fisher takes the low p-value (about 1

in 33,000) to exclude at a high level of significance any theoryinvolving a random distribution (Fisher, 1956, page 42).

Even if one were to imagine thatH0 had an extremely high

prior probability, Fisher continues never minding what

such a statement of probability a priori could possibly mean

the resulting high posteriori probability toH0, he thinks, would

only show that reluctance to accept a hypothesis strongly

contradicted by a test of significance (ibid, page 44) . . . is

not capable of finding expression in any calculation of

probability a posteriori (ibid, page 43).


44/46

44

Wave IV? 2006+ The Reference Bayesians Abandon

Coherence, the LP, and strive to match frequentist error

probabilities!Contemporary Impersonal Bayesianism

Because of the difficulty of eliciting subjective priors, and

because of the reluctance among scientists to allow

subjective beliefs to be conflated with the information

provided by data, much current Bayesian work in practice

favors conventional default, uninformative, or

reference, priors .

1. What do reference posteriors measure?

A classic conundrum: there is no unique

noninformative prior. (Supposing there is oneleads to inconsistencies in calculating posterior

marginal probabilities).

Any representation of ignorance or lack of

information that succeeds for one parameterization

will, under a different parameterization, entail having

knowledge.

Contemporary reference Bayesians seeks priors that are

simply conventions to serve as weights for reference

posteriors.


45/46

45

not to be considered expressions of uncertainty,

ignorance, or degree of belief.

may not even be probabilities; flat priors may not

sum to one (improper prior). If priors are notprobabilities, what then is the interpretation of a

posterior? (a serious problem I would like to see

Bayesian philosophers tackle).

2. Priors for the same hypothesis changes according to

what experiment is to be done! Bayesian incoherence

If the prior is to represent information why should it be

influenced by the sample space of a contemplated

experiment?

Violates the likelihood principle the cornerstone of

Bayesian coherency

Reference Bayesians: it is the price of objectivity.

seems to wreck havoc with basic Bayesian

foundations, but without the payoff of an objective,

interpretable output even subjective Bayesiansobject


46/46

3. Reference posteriors with good frequentist

properties

Reference priors are touted as having some good

frequentist properties, at least in one-dimensionalproblems.

They are deliberately designed to match frequentist error

probabilities.

If you want error probabilities, why not use techniques

that provide them directly?

Note: using conditional probability which is part and

parcel of probability theory, as in Bayes nets does not

make one a Bayesian

no priors to hypotheses

Date post:	14-Apr-2018
Category:	Documents
Upload:	ahmed22gouda22
View:	216 times
Download:	0 times

Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

Documents