+ All Categories
Home > Documents > Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

Date post: 14-Apr-2018
Category:
Upload: ahmed22gouda22
View: 216 times
Download: 0 times
Share this document with a friend

of 46

Transcript
  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    1/46

    1

    What is the Philosophy of Statistics?

    At one level of analysis at least, statisticians and philosophersof science ask many of the same questions:

    What should be observed and what may justifiably be

    inferred from the resulting data?

    How well do data confirm or fit a model?

    What is a good test?

    Must predictions be novel in some sense? (selection

    effects, double counting, data mining)

    How can spurious relationships be distinguished from

    genuine regularities? from causal regularities?

    How can we infer more accurate and reliable

    observations from less accurate ones?

    When does a fitted model account for regularities in the

    data?

    That these very general questions are entwined with longstanding debates in philosophy of science helps to explainwhy the field of statistics tends to cross over so often intophilosophical territory.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    2/46

    2

    Statistics Philosophy

    3 ways statistical accounts are used in philosophy ofscience

    (1) Model Scientific Inference to capture either theactual or rational ways to arrive at evidence and inference

    (2) Resolve Philosophical Problems about scientificinference, observation, experiment;

    (problem of induction, objectivity of observation,reliable evidence, Duhem's problem,underdetermination).

    (3) Perform a Metamethodological Critique-scrutinize methodological rules, e.g., accord specialweight to "novel" facts, avoid ad hoc hypotheses, avoid"data mining", require randomization.

    Philosophy StatisticsCentral job to help resolve the conceptual, logical, andmethodological discomforts of scientists as to: how tomake reliable inferences despite uncertainties and errors?

    Philosophy of statistics and the goal of a philosophy ofscience relevant for philosophical problems in scientific

    practice

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    3/46

    3

    Fresh methodological problems arise in practicesurrounding a panoply ofmethods andmodels relied on

    to learn from incomplete, and often non-experimental,

    data.Examples abound:

    Disputes overhypothesis-testing in psychology (e.g., the

    recently proposed significance test ban);

    Disputes over the proper uses ofregression in applied

    statistics;

    Disputes overdose-response curves in estimating risks;

    Disputes about the use of computer simulations in

    observational sciences;

    Disputes about external validity in experimental

    economics; and,

    Across the huge landscape of fields using the latest, high-powered, computer methods, there are disputes about

    data-mining, algorithmic searches, andmodel validation.

    Equally important are the methodological

    presuppositions that are not, but perhaps ought to be,

    disputed, debated, or at least laid out in the open

    often, ironically, in the very fields in which philosophers

    of science immerse themselves.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    4/46

    4

    I used to teach a course in this department: philosophy of

    science and economic methodology

    We read how many economic methodologists questionedthe value of philosophy of science

    If philosophers and others within science theory cant

    agree about the constitution of the scientific method (or

    even whether asking about a scientific method makes

    any sense), doesnt it seem a little dubious foreconomists to continue blithely taking things off the shelf

    and attempting to apply them to economics? (Hands,

    2001, p. 6).

    Deciding that it is, methodologists of economics

    increasingly look to sociology of science, rhetoric,

    evolutionary psychology.

    The problem is not merely how this cuts philosophers of

    science out of being engaged in methodological practice;

    equally serious, is how it encourages practitioners to

    assume there are no deep epistemological problems with

    the ways they collect and base inferences on data.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    5/46

    5

    Professional agreement on statistical philosophy is not

    on the immediate horizon, but this should not stop us

    from agreeing on methodology, as if what is correct

    methodologically does not depend on what is correct

    philosophically (Berger, 2003, p. 2).

    In addition to the resurgence of the age-old

    controversies significance testvs. confidence

    intervals, frequentistvs.Bayesian measures, the

    latest statistical modeling techniques have introducedbrand new methodological issues.

    High-powered computer science packages offer a

    welter of algorithms for automatically selecting among

    this explosion of models, but as each boasts different,

    and incompatible, selection criteria, we are thrown back

    to the basic question of inductive inference: what isrequired, to severely discriminate among well-fitting

    models such that, when a claim (or hypotheses or model)

    survives a test the resulting data count as good evidence

    for the claims correctness or dependability or adequacy.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    6/46

    6

    A romp through 4 "waves in philosophy of statistics"

    History and philosophy of statistics is a huge territory

    marked by 70 years of debates widely known for reaching

    unusual heights both of passion and of technical

    complexity.

    Wave I ~ 1930 1955/60

    Wave II~ 1955/60-1980

    Wave III~1980-2005 & beyond

    Wave IV ~ 2006 and beyond

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    7/46

    7

    A core question: What is the nature and role of

    probabilistic concepts, methods, and models in making

    inferences in the face of limited data, uncertainty anderror?

    1.Two Roles For Probability:Degrees of Confirmation and Degrees of Well-Testedness

    a.To provide a post-data assignment of degree of

    probability, confirmation, support or belief in a

    hypothesis;

    b.To assess the probativeness, reliability,

    trustworthiness, or severity of a test or inference

    procedure.

    These two contrasting philosophies of the role of

    probability in statistical inference are very much at the

    heart of the central points of controversy in the three

    waves of philosophy of statistics

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    8/46

    8

    Having conceded loss in the battle for justifying induction,

    philosophers appeal to logic to capture scientific method

    Inductive Logics Logic of falsification

    Confirmation TheoryRules to assign degrees ofprobability or confirmation tohypotheses given evidence e

    Methodological falsificationRules to decide when toprefer or accept hypotheses

    Carnap C(H,e) Popper

    Inductive Logicians

    we can build and try to justifyinductive logics

    straight rule: Assign degrees ofconfirmation/credibility

    Statistical affinity

    Bayesian (and likelihoodist)accounts

    Deductive Testers

    we can reject induction anduphold the rationality of

    preferring or acceptingH if it is well tested

    Statistical affinity

    Fisherian, Neyman-Pearsonmethods: probability enters to

    ensure reliability and severity oftests with these methods.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    9/46

    9

    I. Philosophy of Statistics: The First Wave

    WAVE I: circa 1930-1955:Fisher, Neyman, Pearson, Savage, and Jeffreys.

    Statistical inference tools use data x0 to probe aspects of the

    data generating source:

    In statistical testing, these aspects are in terms of statistical

    hypotheses about parameters governing a statistical distribution

    H tells us the probability ofx under H, written P(x;H)

    (probabilistic assignments under a model)

    Important to avoid confusion with conditional probabilities in

    Bayess theorem, P(x|H).

    Testing model assumptions extremely important, though will

    not discuss.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    10/46

    10

    Modern Statistics Begins with Fisher:

    Simple Significance Tests

    Example. Let the sample beX= (X1, ,Xn), be IID from a

    Normal distribution (NIID) with =1.

    1. Anull hypothesisH0:H0: = 0

    e.g., 0 mean concentration of lead, no difference in meansurvival in a given group, in mean risk, mean deflection of

    light.

    2. A function of the sample, d(X), thetest statistic: which

    reflects the difference between the datax0 = (x1, ,xn), andH0;

    The larger d(x0) the further the outcome is from what is

    expected underH0, with respect to the particular question being

    asked.

    3. Thep-value is the probability of a difference larger than

    d(x0), under the assumption thatH0 is true:

    p(x0)=P(d(X) > d(x0);H0)

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    11/46

    11

    The observed significance level (p-value) with observedX = .1

    p(x0)=P(d(X) > d(x0);H0).

    The relevant test statisticd(X) is:

    d(X) = (X-0x,

    where X is the sample mean with standard deviation x=(n).

    0Observed - Expected (under H )

    ( ) xd

    X

    Since xn

    = 1/5 = .2, d(X) = .1 0 in units ofx

    yields

    d(x0)=.1/.2 = .5

    Under the null, d(X) is distributed as standard Normal,

    denoted byd

    (X

    ) ~ N(0,1).(Area to the right of .5) ~.3, i.e. not very significant.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    12/46

    12

    Logic of Simple Significance Tests: Statistical ModusTollens

    Every experiment may be said to exist only in order to

    give the facts a chance of disproving the null hypothesis

    (Fisher, 1956, p.160).

    Statistical analogy to the deductively valid patternmodus

    tollens:

    If the hypothesisH0 is correct then, with high

    probability, 1-p, the data wouldnot be statistically

    significant at levelp.

    x0 is statistically significant at levelp.____________________________

    Thus,x0 is evidence againstH0, orx0 indicates the falsity ofH0.

    Fisher described the significance test as a procedurefor rejecting the null hypothesis and inferring that the

    phenomenon has been experimentally demonstrated

    once one is able to generate at will a statistically

    significant effect. (Fisher, 1935a, p. 14),

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    13/46

    13

    The Alternative or Non-Null Hypothesis

    Evidence againstH0 seems to indicate evidenceforsome alternative.

    Fisherian significance tests strictly consider only the

    H0

    Neyman and Pearson (N-P) tests introduce an

    alternativeH1 (even if only to serve as a direction ofdeparture).

    Example.X= (X1, ,Xn), NIID with =1:

    H0: = 0 vs.H1: > 0

    Despite the bitter disputes with Fisher that were to

    erupt soon after ~1935, Neyman and Pearson, at first saw

    their work as merely placing Fisherian tests on firmer

    logical footing.

    Much of Fishers hostility toward N-P methods

    reflects professional and personality conflicts more than

    philosophical differences.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    14/46

    14

    Neyman-Pearson (N-P) Tests

    N-P hypothesis test: maps each outcomex = (x1, ,xn)into either the null hypothesisH0, or an alternativehypothesisH1(where the two exhaust the parameterspace) to ensure the probabilities of erroneous rejections(type I errors) and erroneous acceptances (type II errors)are controlled at prespecified values, e.g., 0.05 or 0.01, thesignificance level of the test.

    Test T(: X= (X1, ,Xn), NIID with =1,H0: = vs.H1: >

    ifd(x0) > c, "reject"H0, (or declare the resultstatistically significant at the level);

    ifd(x0) < c, "accept"H0,

    e.g. c=1.96 for =.025, i.e.

    Accept/Reject uninterpreted parts of the mathematicalapparatus.

    Type I error probability = P(d(x0) > c; H0) The Type II error probability:

    P(Test T( does not reject H0 ; =1) == P(d(X) < c; H0) = (1), for any 1 > 0.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    15/46

    15

    The "best" test at level at the same time minimizes thevalue of for all 1 > 0, or equivalently, maximizes thepower:

    POW(T(; 1)= P(d(X) > c; 1

    T( is a Uniformly Most Powerful (UMP) level test

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    16/46

    16

    Inductive Behavior Philosophy

    Philosophical issues and debates arise once one begins to

    consider the interpretations of the formal apparatus

    Accept/Reject are identified with deciding to take

    specific actions, e.g., publishing a result, announcing a

    new effect.

    The justification for optimal tests is that

    it may often be proved that if we behave according to

    such a rule ... we shall rejectHwhen it is true not more,

    say, than once in a hundred times, and in addition we may

    have evidence that we shall rejectHsufficiently often

    when it is false.

    Neyman: Tests are not rules ofinductive inference but rules of

    behavior:

    The goal is not to adjust our beliefs but rather to adjust our

    behavior to limited amounts of data

    Is he just drawing a stark contrast between N-P tests andFisherian as well as Bayesian methods? Or is the behavioral

    interpretation essential to the tests?

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    17/46

    17

    Inductive behavior vs. Inductive inference

    battle

    commingles philosophical, statistical and personalityclashes.

    Fisher (1955) denounced the way that Neyman and

    Pearson transformed his significance tests into

    acceptance procedures.

    Theyve turned my tests into mechanical rules orrecipes for deciding to accept or reject statistical

    hypothesisH0,

    The concern has more to do with speeding up

    production or making money than in learning about

    phenomena

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    18/46

    18

    N-P followers are like:

    Russians (who) are made familiar with the ideal

    that research in pure science can and should be gearedto technological performance, in the comprehensive

    organized effort of a five-year plan for the nation.

    (1955, 70)

    In the U.S. also the great importance of

    organized technology has I think made it easy toconfuse the process appropriate for drawing correct

    conclusions, with those aimed rather atspeeding

    production, or saving money.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    19/46

    19

    Pearson distanced himself from Neymans

    inductive behavior jargon, calling it Professor

    Neymans field rather than mine.

    But the most impressive mathematical results were in

    the decision-theoretic framework of Neyman-Pearson-

    Wald.

    Many of the qualifications by Neyman and Pearson

    in the first wave are overlooked in the philosophy of

    statistics literature.

    Admittedly, these evidential practices were not

    made explicit *. (Had they been, the subsequent waves of

    philosophy of statistics might have looked very different).

    *Mayos goal in ~ 1978

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    20/46

    20

    The Second Wave: ~1955/60 -1980

    Post-data criticisms of N-P methods:

    Ian Hacking (1965), framed the main lines of criticism byphilosophers Neyman-Pearson tests as suitable for before-trialbetting, but not for after-trial evaluation. (p. 99):

    Battles: initial precision vs. final precision,

    before-data vs. after data

    After the data, he claimed, the relevant measure of support is

    the (relative) likelihood

    Two data setsxandy may afford the same "support"toH, yet warrant different inferences [onsignificance test reasoning] because x and y arosefrom tests with different error probabilities.

    oThis is just what error statisticians want!

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    21/46

    21

    oBut (at least early on) Hacking (1965) held to the

    Law of Likelihood: x0support hypothesesH1 morethanH2 if,

    P(x0;H1) > P(x0;H2).

    Yet, as Barnard notes, there always is such a rivalhypothesis: That things just had to turn out the way theyactually did .

    Since such a maximally likelihood alternativeH2 canalways be constructed,H1 may always be found less wellsupported, even ifH1 is trueno error control

    Hacking soon rejected the likelihood approach on suchgrounds, likelihoodist accounts are advocated by others.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    22/46

    22

    Perhaps THE key issue of controversy in the

    philosophy of statistics battles

    The (strong) likelihood principle, likelihoods suffice toconvey all that the data have to say

    According to Bayess theorem, P(x|) ... constitutesthe entire evidence of the experiment, that is, it tells allthat the experiment has to tell. More fully and moreprecisely, ify is the datum of some other experiment, and

    if it happens that P(x|) and P(y|) are proportionalfunctions of (that is, constant multiples of each other),then each of the two dataxandyhave exactly the samething to say about the values of (Savage 1962, p. 17.)

    the error probabilist needs to consider, in addition, the

    sampling distribution of the likelihoods.

    significance levels and other error probabilities all

    violate the likelihood principle (Savage 1962).

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    23/46

    23

    Paradox of Optional Stopping

    Instead of fixing the same size n in advance, in some tests, n is

    determined by a stopping rule:In Normal testing, 2-sided H0: = 0 vs.H1: 0

    Keep sampling until H is rejected at the .05 level

    (i.e., keep sampling until | X | 1.96 / n ).

    Nominal vs. Actual significance levels: with n fixed the type 1error probability is .05.With this stopping rule the actual significance level differsfrom, and will be greater than .05.

    By contrast, since likelihoods are unaffected by the stoppingrule, the LP follower denies there really is an evidential

    difference between the two cases (i.e., n fixed and n determinedby the stopping rule).

    Should it matter if I decided to toss the coin 100 times andhappened to get 60% heads, or if I decided to keep tossing untilI could reject at the .05 level (2-sided) and this happened tooccur on trial 100?

    Should it matter if I kept going until I found statisticalsignificance?

    Error statistical principles: Yes! penalty for perseverance!The LP says NO!

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    24/46

    24

    Savage Forum 1959: Savage audaciously declares thatthe lesson to draw from the optional stopping effect is thatoptional stopping is no sin so the problem must lie with

    the use of significance levels. But why accept thelikelihood principle (LP)? (simplicity and freedom?)

    The likelihood principle emphasized in Bayesian statisticsimplies, that the rules governing when data collection stopsare irrelevant to data interpretation. It is entirely appropriate tocollect data until a point has been proved or disproved (p.

    193)This irrelevance of stopping rules to statistical inferencerestores a simplicity and freedom to experimental design thathad been lost by classical emphasis on significance levels (inthe sense of Neyman and Pearson) (Edwards, Lindman, Savage1963, p. 239).

    For frequentists this only underscores the point raised yearsbefore by Pearson and Neyman:

    A likelihood ratio (LR) may be a criterion of relative fitbut it is still necessary to determine its sampling distributionin order to control the error involved in rejecting a truehypothesis, because a knowledge of [LR] alone is not adequateto insure control of this error (Pearson and Neyman, 1930, p.106).

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    25/46

    25

    The key difference: likelihood fixes the actual outcome,

    i.e., justd(x), while error statistics considers outcomes otherthan the one observedin order to assess the error properties

    LP irrelevance of, and no control over, errorprobabilities.

    ("why you cannot be just a little bit Bayesian" EGEK1996)

    Update: A famous argument (1962, Birnbaum)purports to show that plausible error statistical principles

    entails the LP!

    "Radical!" "Breakthrough!" (since the LP entails the

    irrelevance of error probabilities!

    But the "proof" is flawed! (Mayo 2010 See blog).

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    26/46

    26

    The Statistical Significance TestControversy

    (Morrison and Henkel, 1970) contributors chastise social

    scientists for slavish use of significance testsoFocus on simple Fisherian significance tests

    oPhilosophers direct criticisms mostly to N-P tests.

    Fallacies of Rejection: Statistical vs. Substantive Significance

    (i) take statistical significance as evidence of

    substantive theory that explains the effect

    (ii) Infer a discrepancy from the null beyond what the test

    warrants

    (i) Paul Meehl: It is fallacious to go from a statistically

    significant result, e.g., at the .001 level, to infer that ones

    substantive theory T, which entails the [statistical] alternative

    H1, has received .. quantitative support of magnitude around.999

    A statistically significant difference (e.g., in child rearing) is

    not automatically evidence for a Freudian theory.

    T is subjected to only a feeble risk, violating Popper.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    27/46

    27

    Fallacies of rejection:

    (i) Take statistical significance as evidence ofsubstantive theory that explains the effect

    (ii) Infer a discrepancy from the null beyond what the

    test warrants.

    Finding a statistically significant effect,d(x0) > c (cut-off for rejection) need not be indicative of large or

    meaningful effect sizes test too sensitive

    Large n Problem: an significant rejection ofH0 can bevery probable, even with a substantively trivial discrepancyfromH0 can

    This is often taken as a criticism because it is assumed that

    statistical significance at a given level is more evidence

    against the null the larger the sample size (n) fallacy!

    "The thesis implicit in the [NP] approach [is] that a hypothesis

    may be rejected with increasing confidence or reasonablenessas the power of the test increases (Howson and Urbach 1989

    and later editions)

    In fact, it is indicative ofless of a discrepancy from the null

    than if it resulted from a smaller sample size.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    28/46

    28

    (analogy with smoke detector: an alarm from one that often

    goes off from merely burnt toast (overly powerful or sensitive),vs. alarm from one that rarely goes off unless the house isablaze)

    Comes also in the form ofthe Jeffrey-Good-Lindleyparadox

    Even a highly statistically significant result can, with nsufficiently large, correspond to a high posterior probability toa null hypothesis.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    29/46

    29

    Fallacy of Non-Statistically Significant Results

    Test T() fails to reject the null, when the test statisticfails to reach the cut-off point for rejection, i.e., d(x0) c .

    A classic fallacy is to construe such a negative result as

    evidence FOR the correctness of the null hypothesis (common

    in risk assessment contexts).

    No evidence against is not evidence for

    Merely surviving the statistical test is too easy, occurs toofrequently, even when the null is false.

    results from tests lacking sufficient sensitivity or

    power.

    The Power Analytic Movement of the 60s in psychology

    Jacob Cohen: By considering ahead of time the Power ofthe test, select a test capable of detecting discrepancies of

    interest.

    pre-data use of power (for planning).

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    30/46

    30

    A multitude of tables were supplied (Cohen, 1988), but

    until his death he bemoaned their all-to-rare use.

    (Power is a feature of N-P tests, but apparently the

    prevalence of Fisherian tests in the social sciences, coupled,

    perhaps, with the difficulty in calculating power, resulted in

    ignoring power. There was also the fact that they were not able

    to get decent power in psychology; they turned to meta-

    analysis)

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    31/46

    31

    Post-data use of power to avoid fallacies of insensitive tests

    If there's a low probability of a statistically significant

    result, even if a non-trivial discrepancy non-trivial is present (low

    power against non-trivial) ) then a non-significant difference is not

    good evidence that a non-trivial discrepancy is absent.

    Still too course: power is always calculated relative to the cut-

    off point c for rejecting H0.

    Consider test T() , = 1, n = 25, and let

    non-trivial = .2

    No matter what the non-significant outcome, power to detect

    non-trivial is only .16!

    So wed have to deny the data were good evidence that < .2

    This suggested to me (in writing my dissertation around

    1978) that rather than calculating

    (1) P(d(X) > c; =.2) Power

    one should calculate

    (2) P(d(X) > d(x0); =.2). observed power (severity)

    Even if (1) is low, (2) may be high. We return to this in

    the developments of Wave III.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    32/46

    32

    III. The Third Wave: Relativism, Reformulations,

    Reconciliations ~1980-2005+

    (skip) Rational Reconstruction and Relativism in

    Philosophy of Science

    Fighting Kuhnian battles to the very idea of a unified method of

    scientific inference, statistical inference less prominent in

    philosophy

    largely used rational reconstructions of scientific episodes,

    in appraising methodological rules,

    in classic philosophical problems e.g., Duhems

    problemreconstruct a given assignment of blame so as to

    be warranted by Bayesian probability assignments.

    no normative force.

    The recognition that science involves subjective judgments and

    values, reconstructions often appeal to a subjective Bayesian

    account (Salmons Tom Kuhn Meets Tom Bayes).

    (Kuhn thought this was confused: no reason to suppose an

    algorithm remains through theory change)

    Naturalisms, HPS

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    33/46

    33

    Wave III in Scientific Practice

    Statisticians turn to eclecticism.

    Non-statistician practitioners (e.g., in psychology,

    ecology, medicine), bemoan unholy hybrids

    a mixture ofideas from N-P methods, Fisherian tests, and

    Bayesian accounts that is inconsistent from both perspectives

    and burdened with conceptual confusion. (Gigerenzer, 1993,

    p. 323).

    Faced with foundational questions, non statistician

    practitioners raise anew the questions from the first and

    second waves.

    Finding the automaticity and fallacies still rampant, most,

    if they are not calling for an outright ban on significancetests in research, insist on reforms and reformulations of

    statistical tests.

    Task Force to consider Test Ban in Psychology: 1990s

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    34/46

    34

    Reforms and Reinterpretations Within Error Probability

    Statistics

    Any adequate reformulation must:

    (i) Show how to avoid classic fallacies (of acceptance

    and of rejection) on principled grounds,

    (ii) Show that it provides an account of inductive

    inference

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    35/46

    35

    Avoiding Fallacies

    To quickly note my own recommendation (for test T(a)):Move away from coarse accept/reject rule; use specific result

    (significant or insignificant) to infer those discrepancies from

    the null that are well ruled-out, and those which are not.

    e.g., Interpretation of Non-Significant results:

    If d(x) is not statistically significant, and the

    test had a very high probability of a morestatistically significant difference if > 0 + ,then d(x) is good grounds for inferring 0+ .Use specific outcome to infer an upper bound

    * (values beyond are ruled out by given

    severity.)

    If d(x) is not statistically significant, but the test

    had a very low probability chance of a more

    statistically significant difference if > 0 + ,

    then d(x) is poor evidence for inferring 0 +

    .

    The test had too little probative power to have

    detected such discrepancies even if they existed!

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    36/46

    36

    Takes us back to thepost-data version of power:

    Rather than construe a miss as good as a mile, parity oflogic suggests that the post-data power assessment should

    replace the usual calculation of power against :

    POW(T(), ) = P(d(X) > c; =),

    with what might be called thepower actually attainedor, tohave a distinct term, theseverity (SEV):

    SEV(T(), ) = P(d(X) > d(x0); =),

    where d(x0) is the observed (non-statistically significant)

    result.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    37/46

    37

    Figure 1 compares power and severity for different

    outcomes

    Figure 1. POW(T(), =.2) =.168, irrespective of the value

    ofd(x0) ; solid curve, the severity evaluations are data-specific:

    The severity for the inference: < Both X= .39, andX= -.2, fail to rejectH0, but

    But with X= .39, SEV( < is low (.17)

    But with X= -.2, SEV( < is high (.97)

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    38/46

    38

    Fallacies of Rejection: The Large n-Problem

    While with a nonsignificantresult, the concern is erroneously

    inferring that a discrepancy from 0 is absent;

    With a significantresultx0, the concern is erroneously inferring

    that it is present.

    Utilizing the severity assessment an -significantdifference with n1 passes > 1 less severely than with n2 where

    n1 > n2.

    Figure 2 compares test T(with three different sample

    sizes:

    n = 25, n = 100, n = 400, denoted by T(n;where in each case d(x0) = 1.96 reject at the cut-off

    point.

    In this way we solve the problems of tests too sensitive or not

    sensitive enough, but theres one more thing ... showing how it

    supplies an account of inductive inference

    Many argue in wave III that error statistical methods cannot

    supply an account of inductive inference because error

    probabilities conflict with posterior probabilities.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    39/46

    39

    Figure 2 compares test T(with three different sample sizes:

    n =25, n =100, n =400, denoted by T(n;

    in each case d(x0) = 1.96 reject at the cut-off point.

    Figure 2. In test T( (H0: < 0 againstH1: > 0, and = 1),, c = 1.96 and d(x0) = 1.96.

    The severity for the inference: > n = 25, SEV( > is .93n = 100, SEV( > is .83n = 400, SEV( > is .5

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    40/46

    40

    P-values vs. Bayesian Posteriors

    A statistically significant difference from H0 can correspondto large posteriors inH0 . From the Bayesian perspective, it

    follows that p-values come up short as a measure of inductive

    evidence,

    the significance testers balk at the recommended priors

    resulting in highly significant results being construed as no

    evidence against the null or even evidence for it!The conflict often considers the two sided T(2 test

    H0: = vs. H1: .

    (The difference between p-values and posteriors are far

    less marked with one-sided tests).

    Assuming a prior of .5 toH0, with n = 50 one can classically

    rejectH0 at significance level p = .05, although P(H0|x) = .52

    (which would actually indicate that the evidence favorsH0).

    This is taken as a criticism of p-values, only because, it is

    assumed the .51 posterior is the appropriate measure of the

    beliefworthiness.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    41/46

    41

    As the sample size increases, the conflict becomes

    more noteworthy.

    Ifn = 1000, a result statistically significant at the

    .05 level leads to a posterior to the null of .82!

    SEV (H1) = .95 while the corresponding posterior has gone

    from .5 to .82. What warrants such a prior?

    n (sample size)_____________________________________________________

    p t n=10 n=20 n=50 n=100 n=1000

    .10 1.645 .47 .56 .65 .72 .89

    .05 1.960 .37 .42 .52 .60 .82

    .01 2.576 .14 .16 .22 .27 .53

    .001 3.291 .024 .026 .034 .045 .124

    (1) Some claim the prior of .5 is a warranted frequentist

    assignment:

    H0 was randomly selected from an urn in which 50% are

    true

    (*) Therefore P(H0) = p

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    42/46

    42

    H0 may be 0 change in extinction rates, 0 lead

    concentration, etc.

    What should go in the urn of hypotheses?

    For the frequentist: eitherH0 is true or false the

    probability in (*) is fallacious and results from an

    unsound instantiation.

    We are very interested in how false it might be, which iswhat we can do by means of a severity assessment.

    (2) Subjective degree of belief assignments will not ensure

    the error probability, and thus the severity assessments we

    need.

    (3) Some suggest an impartial or uninformative Bayesian

    prior gives .5 toH0, the remaining .5 probability being spread

    out over the alternative parameter space, Jeffreys.

    This spiked concentration of belief in the null is at odds with

    the prevailing view we know all nulls are false.

    The Bayesian recently co-opts 'error probability' to describe a

    posterior, but it is not a frequentist error probability which is

    measuring something very different.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    43/46

    43

    Fisher: The Function of the p-Value Is Not Capable of

    Finding Expression

    Faced with conflicts between error probabilities and Bayesian

    posterior probabilities, the error probabilist would conclude

    that the flaw lies with the latter measure.

    Fisher: Discussing a test of the hypothesis that the stars

    are distributed at random, Fisher takes the low p-value (about 1

    in 33,000) to exclude at a high level of significance any theoryinvolving a random distribution (Fisher, 1956, page 42).

    Even if one were to imagine thatH0 had an extremely high

    prior probability, Fisher continues never minding what

    such a statement of probability a priori could possibly mean

    the resulting high posteriori probability toH0, he thinks, would

    only show that reluctance to accept a hypothesis strongly

    contradicted by a test of significance (ibid, page 44) . . . is

    not capable of finding expression in any calculation of

    probability a posteriori (ibid, page 43).

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    44/46

    44

    Wave IV? 2006+ The Reference Bayesians Abandon

    Coherence, the LP, and strive to match frequentist error

    probabilities!Contemporary Impersonal Bayesianism

    Because of the difficulty of eliciting subjective priors, and

    because of the reluctance among scientists to allow

    subjective beliefs to be conflated with the information

    provided by data, much current Bayesian work in practice

    favors conventional default, uninformative, or

    reference, priors .

    1. What do reference posteriors measure?

    A classic conundrum: there is no unique

    noninformative prior. (Supposing there is oneleads to inconsistencies in calculating posterior

    marginal probabilities).

    Any representation of ignorance or lack of

    information that succeeds for one parameterization

    will, under a different parameterization, entail having

    knowledge.

    Contemporary reference Bayesians seeks priors that are

    simply conventions to serve as weights for reference

    posteriors.

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    45/46

    45

    not to be considered expressions of uncertainty,

    ignorance, or degree of belief.

    may not even be probabilities; flat priors may not

    sum to one (improper prior). If priors are notprobabilities, what then is the interpretation of a

    posterior? (a serious problem I would like to see

    Bayesian philosophers tackle).

    2. Priors for the same hypothesis changes according to

    what experiment is to be done! Bayesian incoherence

    If the prior is to represent information why should it be

    influenced by the sample space of a contemplated

    experiment?

    Violates the likelihood principle the cornerstone of

    Bayesian coherency

    Reference Bayesians: it is the price of objectivity.

    seems to wreck havoc with basic Bayesian

    foundations, but without the payoff of an objective,

    interpretable output even subjective Bayesiansobject

  • 7/27/2019 Mayo 2-19-13seminar WhatIsThePhilosophyOfStatistics

    46/46

    3. Reference posteriors with good frequentist

    properties

    Reference priors are touted as having some good

    frequentist properties, at least in one-dimensionalproblems.

    They are deliberately designed to match frequentist error

    probabilities.

    If you want error probabilities, why not use techniques

    that provide them directly?

    Note: using conditional probability which is part and

    parcel of probability theory, as in Bayes nets does not

    make one a Bayesian

    no priors to hypotheses


Recommended