Post on 30-Mar-2015
transcript
Comparing Classical and Bayesian Approaches to Hypothesis Testing
James O. BergerInstitute of Statistics and Decision Sciences
Duke Universitywww.stat.duke.edu
Outline
• The apparent overuse of hypothesis testing
• When is point null testing needed?
• The misleading nature of P-values
• Bayesian and conditional frequentist testing of plausible hypotheses
• Advantages of Bayesian testing
• Conclusions
I. The apparent overuse of hypothesis testing
• Tests are often performed when they are irrelevant.
• Rejection by an irrelevant test is sometimes viewed as “license” to forget statistics in further analysis
HabitatType
Rank ObservedUsage
Hypothesis
A 1 3.8B 2 3.6 H0 : "mean usage isC 3 2.8 equal for all habitats"D 4 1.8 Rejected (P<.025)E 5 1.5F 6 0.7
Prototypical example
Statistical mistakes in the example
• The hypothesis is not plausible; testing serves no purpose.
• The observed usage levels are given without confidence sets.
• The rankings are based only on observed means, and are given without uncertainties. (For instance, perhaps Pr (A>B)=0.6 only.)
HabitatType
Rank ObservedUsage
Hypothesis
A 1 3.8B 2 3.6 H0 : "mean usage isC 3 2.8 equal for all habitats"D 4 1.8 Rejected (P<.025)E 5 1.5F 6 0.7
Prototypical example
Statistical mistakes in the example
• The hypothesis is not plausible; testing serves no purpose.
• The observed usage levels are given without confidence sets.
• The rankings are based only on observed means, and are given without uncertainties. (For instance, perhaps Pr (A>B)=0.6 only.)
HabitatType
Rank ObservedUsage
Hypothesis
A 1 3.8B 2 3.6 H0 : "mean usage isC 3 2.8 equal for all habitats"D 4 1.8 Rejected (P<.025)E 5 1.5F 6 0.7
Prototypical example
Note that, while is typically not
plausible, it is a good approximation to
as long as < (4 n
(assuming Gaussian observations with
standard deviation ).
H
H
n
0
0
0:
:| | , )
II. When is testing of a point null hypothesis needed?
Answer: When the hypothesis is plausible, tosome degree.
Examples of hypotheses that are not realistically plausible
• H0: small mammals are as abundant on livestock grazing land as on non-grazing land
• H0: survival rates of brood mates are independent
• H0: bird abundance does not depend on the type of forest habitat they occupy
• H0: cottontail choice of habitat does not depend on the season
Examples of hypotheses that may be plausible, to at least some degree:
• H0: Males and females of a species are the same in terms of characteristic A.
• H0: Proximity to logging roads does not affect ground nest predation.
• H0: Pollutant A does not affect Species B.
Example: Experimental drugs D1, D2, D3, . . .
are to be tested.Each Test: H0: Di has negligible effect
H1: Di is effective
Typical Bayesian Answer: The probabilitythat H0 is true is 0.06.
Classical Answer (P-value): If H0 were true, the
probability of observing hypothetical data as ormore "extreme" than the actual data is 0.06.
III. For plausible hypotheses, P-valuesare misleading as measures of evidence
DRUG D1 D2 D3 D4 D5 D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28
DRUG D7 D8 D9 D10 D11 D12
P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66
Question: How strongly do we believe thatDrug i has a nonnegligible effect when (i) the P-value is approximately 0.05?(ii) the P-value is approximately 0.01?
A Surprising Fact: Suppose it is knownthat, apriori, about 50% of the Drugs willhave negligible effect. Then,
(i) of the Drugs for which the P-value 0.05, at least 25% (and typically over 50%) will have negligible effect;
(ii) of the Drugs for which the P-value 0.01, at least 7% (and typically over 15%) will have negligible effect.
DRUG D1 D2 D3 D4 D5 D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28
DRUG D7 D8 D9 D10 D11 D12
P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66
Question: How strongly do we believe thatDrug i has a nonnegligible effect when (i) the P-value is approximately 0.05?(ii) the P-value is approximately 0.01?
A Surprising Fact: Suppose it is knownthat, apriori, about 50% of the Drugs willhave negligible effect. Then,
(i) of the Drugs for which the P-value 0.05, at least 25% (and typically over 50%) will have negligible effect;
(ii) of the Drugs for which the P-value 0.01, at least 7% (and typically over 15%) will have negligible effect.
DRUG D1 D2 D3 D4 D5 D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28
DRUG D7 D8 D9 D10 D11 D12
P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66
Question: How strongly do we believe thatDrug i has a nonnegligible effect when (i) the P-value is approximately 0.05?(ii) the P-value is approximately 0.01?
IV. Bayesian testing of point hypotheses
Data and Model: has density
# of eggs hatched out of eggs
in a recently polluted area (so is binomial,
and is the true proportion that would hatch).
To Test: versus
is the historically known proportion
of eggs that hatch in the area
X
X
f x
Example n
f
H H
Example
( | )
:
: :
:
0 0 1 0
0
The prior distribution
Let and be the prior probabilities of and
(The usual choice is
Under , let be the density representing
information concerning the location of (The usual
choice for the binomial problem is
There are two schools of Bayesian statistics,
the school, where the prior distribution
reflects real extraneous information, and the
school, where the prior is chosen in a default fashion.
P P H H
default P P
H
default
Note
subjective
objective
1 2 1 2
1 2
1
05
1
.
. .)
( )
.
( ) .)
:
Posterior probability that H0 is true, given the data (from Bayes theorem):
Pr( | )( | )
( | ) ( | ) ( )
( , )
{ }
( )
H data xP f x
P f x P f x d
Beta x n xx x n
00 0
0 0 1
0 0
1
0
1 1 1 1
(for the binomial testing problem)
Note: Some prefer to use the (or
weighted likelihood ratio) of to ,
=
,
since this does not involve prior probabilties of the .
Suppose eggs hatched out of
Then and (Here a
classical test would yield
Bayes Factor
H H
Bf x
f x d
likelihood of data under H
average likelihood of data under H
H
Example x= n= .
H data x B
P value
i
0 1
0
0
1
0
0
40 100
0 52 0 92
0 05
( | )
( | ) ( )
" "
:
Pr( | ) . . .
. .)
{ }
Conditional frequentist interpretation of the posterior probability of H0
Pr( | )
,
.
(
H data x frequentist type I error
probability
x
type I error probability
0 is also the
conditional on observing data of the
same "strength of evidence" as the actual data
The classical makes the
mistake of reporting the error averaged over data
of very different strengths.)
V. Advantages of Bayesian testing
• Pr (H0 | data x) reflects real expected error rates: P-values do not.
• A default formula exists for all situations:
Pr( | )( , ) ( , ) ( , )
( , ) ( , )
* * *
*
( )
*
H data xf x f x f x dx d
f x f x d
x
0
0
0
1
1 ,
where is independent (unobserved) data of the smallest
size such that the above integrals exist.
• Posterior probabilities allow for incorporation of personal opinion, if desired. Indeed, if the published default posterior probability of H0 is P*, and your prior probability of H0 is P0, then your posterior probability of H0 is
In the binomial example, recall
A "skeptic" has ; hence
A "believer" has ; hence
Pr( | )
: . .
. Pr( | ) . .
. Pr( | ) . .
*
( )
*
H data xP P
Example P
P H data x
P H data x
00
1
0 0
0 0
11
11
1
052
01 011
0 9 0 91
• Posterior probabilities are not affected by the reason for stopping experimentation, and hence do not require rigid experimental designs (as do classical testing measures).
• Posterior probabilities can be used for multiple models or hypotheses.
Example H
H
H
H data H data H data
: :
:
:
Pr( | ) . , Pr( | ) . , Pr( | ) .
pollutant A has no effect on species B
pollutant A decreases abundance of species B
pollutant A increases abundance of species B
0
1
2
0 1 230 68 02
An aside: integrating science and statistics via the Bayesian paradigm
• Any scientific question can be asked (e.g., What is the probability that switching to management plan A will increase species abundance by 20% more than will plan B?)
• Models can be built that simultaneously incorporate known science and statistics.
• If desired, expert opinion can be built into the analysis.
Conclusions
• Hypothesis testing is overutilized while (Bayesian) statistics is underutilized.
• Hypothesis testing is needed only when testing a “plausible” hypothesis (and this may be a rare occurrence in wildlife studies).
• The Bayesian approach to hypothesis testing has considerable advantages in terms of interpretability (actual error rates), general applicability, and flexible experimentation.