Date post: | 14-Apr-2018 |
Category: |
Documents |
Upload: | ahmed22gouda22 |
View: | 216 times |
Download: | 0 times |
of 46
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
1/46
1
What is the Philosophy of Statistics?
At one level of analysis at least, statisticians and philosophersof science ask many of the same questions:
What should be observed and what may justifiably be
inferred from the resulting data?
How well do data confirm or fit a model?
What is a good test?
Must predictions be novel in some sense? (selection
effects, double counting, data mining)
How can spurious relationships be distinguished from
genuine regularities? from causal regularities?
How can we infer more accurate and reliable
observations from less accurate ones?
When does a fitted model account for regularities in the
data?
That these very general questions are entwined with longstanding debates in philosophy of science helps to explainwhy the field of statistics tends to cross over so often intophilosophical territory.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
2/46
2
Statistics Philosophy
3 ways statistical accounts are used in philosophy ofscience
(1) Model Scientific Inference to capture either theactual or rational ways to arrive at evidence and inference
(2) Resolve Philosophical Problems about scientificinference, observation, experiment;
(problem of induction, objectivity of observation,reliable evidence, Duhem's problem,underdetermination).
(3) Perform a Metamethodological Critique-scrutinize methodological rules, e.g., accord specialweight to "novel" facts, avoid ad hoc hypotheses, avoid"data mining", require randomization.
Philosophy StatisticsCentral job to help resolve the conceptual, logical, andmethodological discomforts of scientists as to: how tomake reliable inferences despite uncertainties and errors?
Philosophy of statistics and the goal of a philosophy ofscience relevant for philosophical problems in scientific
practice
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
3/46
3
Fresh methodological problems arise in practicesurrounding a panoply ofmethods andmodels relied on
to learn from incomplete, and often non-experimental,
data.Examples abound:
Disputes overhypothesis-testing in psychology (e.g., the
recently proposed significance test ban);
Disputes over the proper uses ofregression in applied
statistics;
Disputes overdose-response curves in estimating risks;
Disputes about the use of computer simulations in
observational sciences;
Disputes about external validity in experimental
economics; and,
Across the huge landscape of fields using the latest, high-powered, computer methods, there are disputes about
data-mining, algorithmic searches, andmodel validation.
Equally important are the methodological
presuppositions that are not, but perhaps ought to be,
disputed, debated, or at least laid out in the open
often, ironically, in the very fields in which philosophers
of science immerse themselves.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
4/46
4
I used to teach a course in this department: philosophy of
science and economic methodology
We read how many economic methodologists questionedthe value of philosophy of science
If philosophers and others within science theory cant
agree about the constitution of the scientific method (or
even whether asking about a scientific method makes
any sense), doesnt it seem a little dubious foreconomists to continue blithely taking things off the shelf
and attempting to apply them to economics? (Hands,
2001, p. 6).
Deciding that it is, methodologists of economics
increasingly look to sociology of science, rhetoric,
evolutionary psychology.
The problem is not merely how this cuts philosophers of
science out of being engaged in methodological practice;
equally serious, is how it encourages practitioners to
assume there are no deep epistemological problems with
the ways they collect and base inferences on data.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
5/46
5
Professional agreement on statistical philosophy is not
on the immediate horizon, but this should not stop us
from agreeing on methodology, as if what is correct
methodologically does not depend on what is correct
philosophically (Berger, 2003, p. 2).
In addition to the resurgence of the age-old
controversies significance testvs. confidence
intervals, frequentistvs.Bayesian measures, the
latest statistical modeling techniques have introducedbrand new methodological issues.
High-powered computer science packages offer a
welter of algorithms for automatically selecting among
this explosion of models, but as each boasts different,
and incompatible, selection criteria, we are thrown back
to the basic question of inductive inference: what isrequired, to severely discriminate among well-fitting
models such that, when a claim (or hypotheses or model)
survives a test the resulting data count as good evidence
for the claims correctness or dependability or adequacy.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
6/46
6
A romp through 4 "waves in philosophy of statistics"
History and philosophy of statistics is a huge territory
marked by 70 years of debates widely known for reaching
unusual heights both of passion and of technical
complexity.
Wave I ~ 1930 1955/60
Wave II~ 1955/60-1980
Wave III~1980-2005 & beyond
Wave IV ~ 2006 and beyond
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
7/46
7
A core question: What is the nature and role of
probabilistic concepts, methods, and models in making
inferences in the face of limited data, uncertainty anderror?
1.Two Roles For Probability:Degrees of Confirmation and Degrees of Well-Testedness
a.To provide a post-data assignment of degree of
probability, confirmation, support or belief in a
hypothesis;
b.To assess the probativeness, reliability,
trustworthiness, or severity of a test or inference
procedure.
These two contrasting philosophies of the role of
probability in statistical inference are very much at the
heart of the central points of controversy in the three
waves of philosophy of statistics
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
8/46
8
Having conceded loss in the battle for justifying induction,
philosophers appeal to logic to capture scientific method
Inductive Logics Logic of falsification
Confirmation TheoryRules to assign degrees ofprobability or confirmation tohypotheses given evidence e
Methodological falsificationRules to decide when toprefer or accept hypotheses
Carnap C(H,e) Popper
Inductive Logicians
we can build and try to justifyinductive logics
straight rule: Assign degrees ofconfirmation/credibility
Statistical affinity
Bayesian (and likelihoodist)accounts
Deductive Testers
we can reject induction anduphold the rationality of
preferring or acceptingH if it is well tested
Statistical affinity
Fisherian, Neyman-Pearsonmethods: probability enters to
ensure reliability and severity oftests with these methods.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
9/46
9
I. Philosophy of Statistics: The First Wave
WAVE I: circa 1930-1955:Fisher, Neyman, Pearson, Savage, and Jeffreys.
Statistical inference tools use data x0 to probe aspects of the
data generating source:
In statistical testing, these aspects are in terms of statistical
hypotheses about parameters governing a statistical distribution
H tells us the probability ofx under H, written P(x;H)
(probabilistic assignments under a model)
Important to avoid confusion with conditional probabilities in
Bayess theorem, P(x|H).
Testing model assumptions extremely important, though will
not discuss.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
10/46
10
Modern Statistics Begins with Fisher:
Simple Significance Tests
Example. Let the sample beX= (X1, ,Xn), be IID from a
Normal distribution (NIID) with =1.
1. Anull hypothesisH0:H0: = 0
e.g., 0 mean concentration of lead, no difference in meansurvival in a given group, in mean risk, mean deflection of
light.
2. A function of the sample, d(X), thetest statistic: which
reflects the difference between the datax0 = (x1, ,xn), andH0;
The larger d(x0) the further the outcome is from what is
expected underH0, with respect to the particular question being
asked.
3. Thep-value is the probability of a difference larger than
d(x0), under the assumption thatH0 is true:
p(x0)=P(d(X) > d(x0);H0)
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
11/46
11
The observed significance level (p-value) with observedX = .1
p(x0)=P(d(X) > d(x0);H0).
The relevant test statisticd(X) is:
d(X) = (X-0x,
where X is the sample mean with standard deviation x=(n).
0Observed - Expected (under H )
( ) xd
X
Since xn
= 1/5 = .2, d(X) = .1 0 in units ofx
yields
d(x0)=.1/.2 = .5
Under the null, d(X) is distributed as standard Normal,
denoted byd
(X
) ~ N(0,1).(Area to the right of .5) ~.3, i.e. not very significant.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
12/46
12
Logic of Simple Significance Tests: Statistical ModusTollens
Every experiment may be said to exist only in order to
give the facts a chance of disproving the null hypothesis
(Fisher, 1956, p.160).
Statistical analogy to the deductively valid patternmodus
tollens:
If the hypothesisH0 is correct then, with high
probability, 1-p, the data wouldnot be statistically
significant at levelp.
x0 is statistically significant at levelp.____________________________
Thus,x0 is evidence againstH0, orx0 indicates the falsity ofH0.
Fisher described the significance test as a procedurefor rejecting the null hypothesis and inferring that the
phenomenon has been experimentally demonstrated
once one is able to generate at will a statistically
significant effect. (Fisher, 1935a, p. 14),
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
13/46
13
The Alternative or Non-Null Hypothesis
Evidence againstH0 seems to indicate evidenceforsome alternative.
Fisherian significance tests strictly consider only the
H0
Neyman and Pearson (N-P) tests introduce an
alternativeH1 (even if only to serve as a direction ofdeparture).
Example.X= (X1, ,Xn), NIID with =1:
H0: = 0 vs.H1: > 0
Despite the bitter disputes with Fisher that were to
erupt soon after ~1935, Neyman and Pearson, at first saw
their work as merely placing Fisherian tests on firmer
logical footing.
Much of Fishers hostility toward N-P methods
reflects professional and personality conflicts more than
philosophical differences.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
14/46
14
Neyman-Pearson (N-P) Tests
N-P hypothesis test: maps each outcomex = (x1, ,xn)into either the null hypothesisH0, or an alternativehypothesisH1(where the two exhaust the parameterspace) to ensure the probabilities of erroneous rejections(type I errors) and erroneous acceptances (type II errors)are controlled at prespecified values, e.g., 0.05 or 0.01, thesignificance level of the test.
Test T(: X= (X1, ,Xn), NIID with =1,H0: = vs.H1: >
ifd(x0) > c, "reject"H0, (or declare the resultstatistically significant at the level);
ifd(x0) < c, "accept"H0,
e.g. c=1.96 for =.025, i.e.
Accept/Reject uninterpreted parts of the mathematicalapparatus.
Type I error probability = P(d(x0) > c; H0) The Type II error probability:
P(Test T( does not reject H0 ; =1) == P(d(X) < c; H0) = (1), for any 1 > 0.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
15/46
15
The "best" test at level at the same time minimizes thevalue of for all 1 > 0, or equivalently, maximizes thepower:
POW(T(; 1)= P(d(X) > c; 1
T( is a Uniformly Most Powerful (UMP) level test
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
16/46
16
Inductive Behavior Philosophy
Philosophical issues and debates arise once one begins to
consider the interpretations of the formal apparatus
Accept/Reject are identified with deciding to take
specific actions, e.g., publishing a result, announcing a
new effect.
The justification for optimal tests is that
it may often be proved that if we behave according to
such a rule ... we shall rejectHwhen it is true not more,
say, than once in a hundred times, and in addition we may
have evidence that we shall rejectHsufficiently often
when it is false.
Neyman: Tests are not rules ofinductive inference but rules of
behavior:
The goal is not to adjust our beliefs but rather to adjust our
behavior to limited amounts of data
Is he just drawing a stark contrast between N-P tests andFisherian as well as Bayesian methods? Or is the behavioral
interpretation essential to the tests?
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
17/46
17
Inductive behavior vs. Inductive inference
battle
commingles philosophical, statistical and personalityclashes.
Fisher (1955) denounced the way that Neyman and
Pearson transformed his significance tests into
acceptance procedures.
Theyve turned my tests into mechanical rules orrecipes for deciding to accept or reject statistical
hypothesisH0,
The concern has more to do with speeding up
production or making money than in learning about
phenomena
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
18/46
18
N-P followers are like:
Russians (who) are made familiar with the ideal
that research in pure science can and should be gearedto technological performance, in the comprehensive
organized effort of a five-year plan for the nation.
(1955, 70)
In the U.S. also the great importance of
organized technology has I think made it easy toconfuse the process appropriate for drawing correct
conclusions, with those aimed rather atspeeding
production, or saving money.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
19/46
19
Pearson distanced himself from Neymans
inductive behavior jargon, calling it Professor
Neymans field rather than mine.
But the most impressive mathematical results were in
the decision-theoretic framework of Neyman-Pearson-
Wald.
Many of the qualifications by Neyman and Pearson
in the first wave are overlooked in the philosophy of
statistics literature.
Admittedly, these evidential practices were not
made explicit *. (Had they been, the subsequent waves of
philosophy of statistics might have looked very different).
*Mayos goal in ~ 1978
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
20/46
20
The Second Wave: ~1955/60 -1980
Post-data criticisms of N-P methods:
Ian Hacking (1965), framed the main lines of criticism byphilosophers Neyman-Pearson tests as suitable for before-trialbetting, but not for after-trial evaluation. (p. 99):
Battles: initial precision vs. final precision,
before-data vs. after data
After the data, he claimed, the relevant measure of support is
the (relative) likelihood
Two data setsxandy may afford the same "support"toH, yet warrant different inferences [onsignificance test reasoning] because x and y arosefrom tests with different error probabilities.
oThis is just what error statisticians want!
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
21/46
21
oBut (at least early on) Hacking (1965) held to the
Law of Likelihood: x0support hypothesesH1 morethanH2 if,
P(x0;H1) > P(x0;H2).
Yet, as Barnard notes, there always is such a rivalhypothesis: That things just had to turn out the way theyactually did .
Since such a maximally likelihood alternativeH2 canalways be constructed,H1 may always be found less wellsupported, even ifH1 is trueno error control
Hacking soon rejected the likelihood approach on suchgrounds, likelihoodist accounts are advocated by others.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
22/46
22
Perhaps THE key issue of controversy in the
philosophy of statistics battles
The (strong) likelihood principle, likelihoods suffice toconvey all that the data have to say
According to Bayess theorem, P(x|) ... constitutesthe entire evidence of the experiment, that is, it tells allthat the experiment has to tell. More fully and moreprecisely, ify is the datum of some other experiment, and
if it happens that P(x|) and P(y|) are proportionalfunctions of (that is, constant multiples of each other),then each of the two dataxandyhave exactly the samething to say about the values of (Savage 1962, p. 17.)
the error probabilist needs to consider, in addition, the
sampling distribution of the likelihoods.
significance levels and other error probabilities all
violate the likelihood principle (Savage 1962).
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
23/46
23
Paradox of Optional Stopping
Instead of fixing the same size n in advance, in some tests, n is
determined by a stopping rule:In Normal testing, 2-sided H0: = 0 vs.H1: 0
Keep sampling until H is rejected at the .05 level
(i.e., keep sampling until | X | 1.96 / n ).
Nominal vs. Actual significance levels: with n fixed the type 1error probability is .05.With this stopping rule the actual significance level differsfrom, and will be greater than .05.
By contrast, since likelihoods are unaffected by the stoppingrule, the LP follower denies there really is an evidential
difference between the two cases (i.e., n fixed and n determinedby the stopping rule).
Should it matter if I decided to toss the coin 100 times andhappened to get 60% heads, or if I decided to keep tossing untilI could reject at the .05 level (2-sided) and this happened tooccur on trial 100?
Should it matter if I kept going until I found statisticalsignificance?
Error statistical principles: Yes! penalty for perseverance!The LP says NO!
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
24/46
24
Savage Forum 1959: Savage audaciously declares thatthe lesson to draw from the optional stopping effect is thatoptional stopping is no sin so the problem must lie with
the use of significance levels. But why accept thelikelihood principle (LP)? (simplicity and freedom?)
The likelihood principle emphasized in Bayesian statisticsimplies, that the rules governing when data collection stopsare irrelevant to data interpretation. It is entirely appropriate tocollect data until a point has been proved or disproved (p.
193)This irrelevance of stopping rules to statistical inferencerestores a simplicity and freedom to experimental design thathad been lost by classical emphasis on significance levels (inthe sense of Neyman and Pearson) (Edwards, Lindman, Savage1963, p. 239).
For frequentists this only underscores the point raised yearsbefore by Pearson and Neyman:
A likelihood ratio (LR) may be a criterion of relative fitbut it is still necessary to determine its sampling distributionin order to control the error involved in rejecting a truehypothesis, because a knowledge of [LR] alone is not adequateto insure control of this error (Pearson and Neyman, 1930, p.106).
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
25/46
25
The key difference: likelihood fixes the actual outcome,
i.e., justd(x), while error statistics considers outcomes otherthan the one observedin order to assess the error properties
LP irrelevance of, and no control over, errorprobabilities.
("why you cannot be just a little bit Bayesian" EGEK1996)
Update: A famous argument (1962, Birnbaum)purports to show that plausible error statistical principles
entails the LP!
"Radical!" "Breakthrough!" (since the LP entails the
irrelevance of error probabilities!
But the "proof" is flawed! (Mayo 2010 See blog).
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
26/46
26
The Statistical Significance TestControversy
(Morrison and Henkel, 1970) contributors chastise social
scientists for slavish use of significance testsoFocus on simple Fisherian significance tests
oPhilosophers direct criticisms mostly to N-P tests.
Fallacies of Rejection: Statistical vs. Substantive Significance
(i) take statistical significance as evidence of
substantive theory that explains the effect
(ii) Infer a discrepancy from the null beyond what the test
warrants
(i) Paul Meehl: It is fallacious to go from a statistically
significant result, e.g., at the .001 level, to infer that ones
substantive theory T, which entails the [statistical] alternative
H1, has received .. quantitative support of magnitude around.999
A statistically significant difference (e.g., in child rearing) is
not automatically evidence for a Freudian theory.
T is subjected to only a feeble risk, violating Popper.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
27/46
27
Fallacies of rejection:
(i) Take statistical significance as evidence ofsubstantive theory that explains the effect
(ii) Infer a discrepancy from the null beyond what the
test warrants.
Finding a statistically significant effect,d(x0) > c (cut-off for rejection) need not be indicative of large or
meaningful effect sizes test too sensitive
Large n Problem: an significant rejection ofH0 can bevery probable, even with a substantively trivial discrepancyfromH0 can
This is often taken as a criticism because it is assumed that
statistical significance at a given level is more evidence
against the null the larger the sample size (n) fallacy!
"The thesis implicit in the [NP] approach [is] that a hypothesis
may be rejected with increasing confidence or reasonablenessas the power of the test increases (Howson and Urbach 1989
and later editions)
In fact, it is indicative ofless of a discrepancy from the null
than if it resulted from a smaller sample size.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
28/46
28
(analogy with smoke detector: an alarm from one that often
goes off from merely burnt toast (overly powerful or sensitive),vs. alarm from one that rarely goes off unless the house isablaze)
Comes also in the form ofthe Jeffrey-Good-Lindleyparadox
Even a highly statistically significant result can, with nsufficiently large, correspond to a high posterior probability toa null hypothesis.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
29/46
29
Fallacy of Non-Statistically Significant Results
Test T() fails to reject the null, when the test statisticfails to reach the cut-off point for rejection, i.e., d(x0) c .
A classic fallacy is to construe such a negative result as
evidence FOR the correctness of the null hypothesis (common
in risk assessment contexts).
No evidence against is not evidence for
Merely surviving the statistical test is too easy, occurs toofrequently, even when the null is false.
results from tests lacking sufficient sensitivity or
power.
The Power Analytic Movement of the 60s in psychology
Jacob Cohen: By considering ahead of time the Power ofthe test, select a test capable of detecting discrepancies of
interest.
pre-data use of power (for planning).
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
30/46
30
A multitude of tables were supplied (Cohen, 1988), but
until his death he bemoaned their all-to-rare use.
(Power is a feature of N-P tests, but apparently the
prevalence of Fisherian tests in the social sciences, coupled,
perhaps, with the difficulty in calculating power, resulted in
ignoring power. There was also the fact that they were not able
to get decent power in psychology; they turned to meta-
analysis)
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
31/46
31
Post-data use of power to avoid fallacies of insensitive tests
If there's a low probability of a statistically significant
result, even if a non-trivial discrepancy non-trivial is present (low
power against non-trivial) ) then a non-significant difference is not
good evidence that a non-trivial discrepancy is absent.
Still too course: power is always calculated relative to the cut-
off point c for rejecting H0.
Consider test T() , = 1, n = 25, and let
non-trivial = .2
No matter what the non-significant outcome, power to detect
non-trivial is only .16!
So wed have to deny the data were good evidence that < .2
This suggested to me (in writing my dissertation around
1978) that rather than calculating
(1) P(d(X) > c; =.2) Power
one should calculate
(2) P(d(X) > d(x0); =.2). observed power (severity)
Even if (1) is low, (2) may be high. We return to this in
the developments of Wave III.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
32/46
32
III. The Third Wave: Relativism, Reformulations,
Reconciliations ~1980-2005+
(skip) Rational Reconstruction and Relativism in
Philosophy of Science
Fighting Kuhnian battles to the very idea of a unified method of
scientific inference, statistical inference less prominent in
philosophy
largely used rational reconstructions of scientific episodes,
in appraising methodological rules,
in classic philosophical problems e.g., Duhems
problemreconstruct a given assignment of blame so as to
be warranted by Bayesian probability assignments.
no normative force.
The recognition that science involves subjective judgments and
values, reconstructions often appeal to a subjective Bayesian
account (Salmons Tom Kuhn Meets Tom Bayes).
(Kuhn thought this was confused: no reason to suppose an
algorithm remains through theory change)
Naturalisms, HPS
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
33/46
33
Wave III in Scientific Practice
Statisticians turn to eclecticism.
Non-statistician practitioners (e.g., in psychology,
ecology, medicine), bemoan unholy hybrids
a mixture ofideas from N-P methods, Fisherian tests, and
Bayesian accounts that is inconsistent from both perspectives
and burdened with conceptual confusion. (Gigerenzer, 1993,
p. 323).
Faced with foundational questions, non statistician
practitioners raise anew the questions from the first and
second waves.
Finding the automaticity and fallacies still rampant, most,
if they are not calling for an outright ban on significancetests in research, insist on reforms and reformulations of
statistical tests.
Task Force to consider Test Ban in Psychology: 1990s
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
34/46
34
Reforms and Reinterpretations Within Error Probability
Statistics
Any adequate reformulation must:
(i) Show how to avoid classic fallacies (of acceptance
and of rejection) on principled grounds,
(ii) Show that it provides an account of inductive
inference
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
35/46
35
Avoiding Fallacies
To quickly note my own recommendation (for test T(a)):Move away from coarse accept/reject rule; use specific result
(significant or insignificant) to infer those discrepancies from
the null that are well ruled-out, and those which are not.
e.g., Interpretation of Non-Significant results:
If d(x) is not statistically significant, and the
test had a very high probability of a morestatistically significant difference if > 0 + ,then d(x) is good grounds for inferring 0+ .Use specific outcome to infer an upper bound
* (values beyond are ruled out by given
severity.)
If d(x) is not statistically significant, but the test
had a very low probability chance of a more
statistically significant difference if > 0 + ,
then d(x) is poor evidence for inferring 0 +
.
The test had too little probative power to have
detected such discrepancies even if they existed!
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
36/46
36
Takes us back to thepost-data version of power:
Rather than construe a miss as good as a mile, parity oflogic suggests that the post-data power assessment should
replace the usual calculation of power against :
POW(T(), ) = P(d(X) > c; =),
with what might be called thepower actually attainedor, tohave a distinct term, theseverity (SEV):
SEV(T(), ) = P(d(X) > d(x0); =),
where d(x0) is the observed (non-statistically significant)
result.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
37/46
37
Figure 1 compares power and severity for different
outcomes
Figure 1. POW(T(), =.2) =.168, irrespective of the value
ofd(x0) ; solid curve, the severity evaluations are data-specific:
The severity for the inference: < Both X= .39, andX= -.2, fail to rejectH0, but
But with X= .39, SEV( < is low (.17)
But with X= -.2, SEV( < is high (.97)
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
38/46
38
Fallacies of Rejection: The Large n-Problem
While with a nonsignificantresult, the concern is erroneously
inferring that a discrepancy from 0 is absent;
With a significantresultx0, the concern is erroneously inferring
that it is present.
Utilizing the severity assessment an -significantdifference with n1 passes > 1 less severely than with n2 where
n1 > n2.
Figure 2 compares test T(with three different sample
sizes:
n = 25, n = 100, n = 400, denoted by T(n;where in each case d(x0) = 1.96 reject at the cut-off
point.
In this way we solve the problems of tests too sensitive or not
sensitive enough, but theres one more thing ... showing how it
supplies an account of inductive inference
Many argue in wave III that error statistical methods cannot
supply an account of inductive inference because error
probabilities conflict with posterior probabilities.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
39/46
39
Figure 2 compares test T(with three different sample sizes:
n =25, n =100, n =400, denoted by T(n;
in each case d(x0) = 1.96 reject at the cut-off point.
Figure 2. In test T( (H0: < 0 againstH1: > 0, and = 1),, c = 1.96 and d(x0) = 1.96.
The severity for the inference: > n = 25, SEV( > is .93n = 100, SEV( > is .83n = 400, SEV( > is .5
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
40/46
40
P-values vs. Bayesian Posteriors
A statistically significant difference from H0 can correspondto large posteriors inH0 . From the Bayesian perspective, it
follows that p-values come up short as a measure of inductive
evidence,
the significance testers balk at the recommended priors
resulting in highly significant results being construed as no
evidence against the null or even evidence for it!The conflict often considers the two sided T(2 test
H0: = vs. H1: .
(The difference between p-values and posteriors are far
less marked with one-sided tests).
Assuming a prior of .5 toH0, with n = 50 one can classically
rejectH0 at significance level p = .05, although P(H0|x) = .52
(which would actually indicate that the evidence favorsH0).
This is taken as a criticism of p-values, only because, it is
assumed the .51 posterior is the appropriate measure of the
beliefworthiness.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
41/46
41
As the sample size increases, the conflict becomes
more noteworthy.
Ifn = 1000, a result statistically significant at the
.05 level leads to a posterior to the null of .82!
SEV (H1) = .95 while the corresponding posterior has gone
from .5 to .82. What warrants such a prior?
n (sample size)_____________________________________________________
p t n=10 n=20 n=50 n=100 n=1000
.10 1.645 .47 .56 .65 .72 .89
.05 1.960 .37 .42 .52 .60 .82
.01 2.576 .14 .16 .22 .27 .53
.001 3.291 .024 .026 .034 .045 .124
(1) Some claim the prior of .5 is a warranted frequentist
assignment:
H0 was randomly selected from an urn in which 50% are
true
(*) Therefore P(H0) = p
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
42/46
42
H0 may be 0 change in extinction rates, 0 lead
concentration, etc.
What should go in the urn of hypotheses?
For the frequentist: eitherH0 is true or false the
probability in (*) is fallacious and results from an
unsound instantiation.
We are very interested in how false it might be, which iswhat we can do by means of a severity assessment.
(2) Subjective degree of belief assignments will not ensure
the error probability, and thus the severity assessments we
need.
(3) Some suggest an impartial or uninformative Bayesian
prior gives .5 toH0, the remaining .5 probability being spread
out over the alternative parameter space, Jeffreys.
This spiked concentration of belief in the null is at odds with
the prevailing view we know all nulls are false.
The Bayesian recently co-opts 'error probability' to describe a
posterior, but it is not a frequentist error probability which is
measuring something very different.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
43/46
43
Fisher: The Function of the p-Value Is Not Capable of
Finding Expression
Faced with conflicts between error probabilities and Bayesian
posterior probabilities, the error probabilist would conclude
that the flaw lies with the latter measure.
Fisher: Discussing a test of the hypothesis that the stars
are distributed at random, Fisher takes the low p-value (about 1
in 33,000) to exclude at a high level of significance any theoryinvolving a random distribution (Fisher, 1956, page 42).
Even if one were to imagine thatH0 had an extremely high
prior probability, Fisher continues never minding what
such a statement of probability a priori could possibly mean
the resulting high posteriori probability toH0, he thinks, would
only show that reluctance to accept a hypothesis strongly
contradicted by a test of significance (ibid, page 44) . . . is
not capable of finding expression in any calculation of
probability a posteriori (ibid, page 43).
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
44/46
44
Wave IV? 2006+ The Reference Bayesians Abandon
Coherence, the LP, and strive to match frequentist error
probabilities!Contemporary Impersonal Bayesianism
Because of the difficulty of eliciting subjective priors, and
because of the reluctance among scientists to allow
subjective beliefs to be conflated with the information
provided by data, much current Bayesian work in practice
favors conventional default, uninformative, or
reference, priors .
1. What do reference posteriors measure?
A classic conundrum: there is no unique
noninformative prior. (Supposing there is oneleads to inconsistencies in calculating posterior
marginal probabilities).
Any representation of ignorance or lack of
information that succeeds for one parameterization
will, under a different parameterization, entail having
knowledge.
Contemporary reference Bayesians seeks priors that are
simply conventions to serve as weights for reference
posteriors.
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
45/46
45
not to be considered expressions of uncertainty,
ignorance, or degree of belief.
may not even be probabilities; flat priors may not
sum to one (improper prior). If priors are notprobabilities, what then is the interpretation of a
posterior? (a serious problem I would like to see
Bayesian philosophers tackle).
2. Priors for the same hypothesis changes according to
what experiment is to be done! Bayesian incoherence
If the prior is to represent information why should it be
influenced by the sample space of a contemplated
experiment?
Violates the likelihood principle the cornerstone of
Bayesian coherency
Reference Bayesians: it is the price of objectivity.
seems to wreck havoc with basic Bayesian
foundations, but without the payoff of an objective,
interpretable output even subjective Bayesiansobject
7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics
46/46
3. Reference posteriors with good frequentist
properties
Reference priors are touted as having some good
frequentist properties, at least in one-dimensionalproblems.
They are deliberately designed to match frequentist error
probabilities.
If you want error probabilities, why not use techniques
that provide them directly?
Note: using conditional probability which is part and
parcel of probability theory, as in Bayes nets does not
make one a Bayesian
no priors to hypotheses