+ All Categories
Home > Documents > Contents Basic statistics192.38.117.59/~tag/Teaching/BasicStats/lecture... · Variable Level...

Contents Basic statistics192.38.117.59/~tag/Teaching/BasicStats/lecture... · Variable Level...

Date post: 28-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
20
Basic statistics Descriptive statistics and ANOVA Thomas Alexander Gerds Department of Biostatistics, University of Copenhagen Contents Data are variable Statistical uncertainty Summary and display of data Confidence intervals ANOVA Data are variable A statistician is used to receive a value, such as 3.17 %, together with an explanation, such as "this is the expression of 1-B6.DBA-GTM in mouse 12". The value from the next mouse in the list is 4.88% . . . The measurement is difficult
Transcript
  • Basic statisticsDescriptive statistics and ANOVA

    Thomas Alexander Gerds

    Department of Biostatistics, University of Copenhagen

    Contents

    I Data are variableI Statistical uncertaintyI Summary and display of dataI Confidence intervalsI ANOVA

    Data are variable

    A statistician is used to receive a value, such as

    3.17 %,

    together with an explanation, such as

    "this is the expression of 1-B6.DBA-GTM in mouse 12".

    The value from the next mouse in the list is 4.88% . . .

    The measurement is difficult

  • Data processing is done by humans Two mice have different genes

    They are exposed . . . and treated differently

  • Decomposing variance

    Variability of data is usually a composite of

    I Measurement error, sampling schemeI Random variationI GenotypeI Exposure, life style, environmentI Treatment

    Statistical conclusions can often be obtained by explaining thesources of variation in the data.

    Example 1

    In the yeast experiment of Smith and Kruglyak (2008) 1 transcriptlevels were profiled in 6 replicates of the same strain called ’RM’ inglucose under controlled conditions.

    1the article is available at http://biology.plosjournals.org

    Example 1

    Figure:

    Sources of the variation of these 6 values

    I Measurement errorI Random variation

  • Example 1

    In the same yeast experiment Smith and Kruglyak (2008) profiledalso 6 replicates of a different strain called ’By’ in glucose.Theorder in which the 12 samples were processed was at random tominimize a systematic experimental effect.

    Example 1

    Figure:

    Sources of the variation of these 12 values

    I Measurement errorI Study design/experimental environmentI Genotype

    Example 1

    Furthermore, Smith and Kruglyak (2008) cultured 6 ’RM’ and 6’By’ replicates in ethanol.The order in which the 24 samples wereprocessed was random to minimize a systematic experimental effect.

  • Sources of variation

    Figure:

    Sources of variation

    I Measurement errorI Experimental environmentI GenesI Exposure, environmental factors

    Example 2

    Festing and Weigler in the Handbook of Laboratory Animal Science. . .

    . . . consider the results of an experiment using a completelyrandomized design . . .

    . . . in which adult C57BL/6 mice were randomly allocated to oneof four dose levels of a hormone compound.

    The uterus weight was measured after an appropriate time interval.

    Example 2

    Figure:

  • Example 2

    Figure:

    Example 2

    Figure:

    Example 2

    Conclusions from the figures

    I The uterus weight depends on the doseI The variation of the data increases with increasing dose

    Question: Why could these first conclusions be wrong?

    Descriptive statistics

  • Descriptive statistics (summarizing data)

    Categorical variables: count (%).

    Continuous variables:I raw values (if n is small)I range (min, max)I location: median (IQR=inter quartile range)I location: means (SD)

    Sample: Table 1

    22Quality of life (QOL), supportive care, and spirituality in hematopoietic

    stem cell transplant (HSCT) patients. Sirilla & Overcash. Supportive Care inCancer, October 2012.Sample: Table 1 R excursion: calculating descriptive statistics in groups

    library(Publish)library(data.table)data(Diabetes)setDT(Diabetes) ## make data.tableDiabetes[,.(mean.age=mean(age), sd.age=sd(age),median.

    chol=median(chol,na.rm=TRUE)),by=location]

    location mean.age sd.age median.chol1: Buckingham 47.07500 16.74849 2022: Louisa 46.63054 15.90929 206

  • R excursion: making table onelibrary(Publish)data(Diabetes)tab1

  • Exercise

    I Read and discuss the documentation of why dynamite plotsare not good:

    http://biostat.mc.vanderbilt.edu/wiki/Main/DynamitePlots

    Dot plots are appreciated when n is small

    ●●

    ●●

    Mea

    sure

    men

    t sca

    le

    −3

    −2

    −1

    01

    23

    A B C

    Figure: Group A (n=3), group B (n=3, one replicate), group C (n=4)

    Box plots are appreciated when n is large

    ●●

    ●●

    ●●

    ●●●

    Mea

    sure

    men

    t sca

    le

    −4

    −2

    02

    4

    A B C

    Figure: Group A (n=300), group B (n=400), group C (n=400)

    Making boxplots with ggplot2

    library(ggplot2)bp

  • Making boxplots with ggplot2

    ●●

    ●●

    ●●

    100

    200

    300

    400

    Buckingham Louisalocation

    chol

    location

    Buckingham

    Louisa

    Making boxplots with ggplot2

    bp+facet_grid(.∼gender)

    ●●

    ●●

    female male

    Buckingham Louisa Buckingham Louisa

    100

    200

    300

    400

    location

    chol

    location

    Buckingham

    Louisa

    Making dotplots with ggplot2dp

  • Quantifying variability

    A sample of data X1, . . . ,XN has a standard deviation (sd); it isdefined by

    SD =

    √√√√ 1N − 1

    N∑

    i=1

    (Xi − X )2; X =1N

    N∑

    i=1

    Xi

    SD measures the variability of the measurements in the sample.

    The variance of the sample is defined as SD2. The term ’standarddeviation’ relates to the normal distribution.

    Normal distribution

    What is so special about the normal distribution?

    I It is symmetric around the mean, thus the mean is equal tothe median.

    I The mean is the most likely value. Mean and standarddeviation describe the full destribution.

    I The distribution of measurements, like height, distance,volume is often normal.

    I The distribution of statistics, like mean, proportion, meandifference, etc. are very often approximately normal.

    Quantifying statistical uncertainty

    For statistical inference and conclusion making, via p-values andconfidence intervals, it is crucial to quantify the variability of thestatistic (mean, proportion, mean difference, risk ratio, etc.):

    The standard error is the standard deviation of the statistic.

    The standard error is a measure of the statistical uncertainty.

  • Illustration

    Population:

    Mean = 3.81

    Illustration

    Population:

    Mean = 3.81Mean = 2.13

    Illustration

    Population:

    Mean = 3.81

    Mean = 4.01

    Mean = 2.13

    Quantifying statistical uncertainty

    Example: We want to estimate the unknown mean uterus weightfor untreated mice. The standard error of the mean is defined as

    SE = SD/√N where N is the sample size.

    Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009:

    I mean: β̂ = 0.0091I standard deviation: SD = 0.002108I empirical variance: var = 0.0000044I standard error: SE = 0.002108/2 = 0.001054

  • Quantifying statistical uncertainty

    Example: We want to estimate the unknown mean uterus weightfor untreated mice. The standard error of the mean is defined as

    SE = SD/√N where N is the sample size.

    Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009:

    I mean: β̂ = 0.0091I standard deviation: SD = 0.002108I empirical variance: var = 0.0000044I standard error: SE = 0.002108/2 = 0.001054

    The standard error is the standard deviation of the mean

    Ute

    rus

    wei

    ght (

    g)

    0.00

    00.

    005

    0.01

    00.

    015

    Ourstudy

    Hypotheticalstudy 1

    Hypotheticalstudy 47

    Hypotheticalstudy 100

    The unknown trueaverage uterus

    weight●

    The (hypothetical) mean values are approximately normallydistributed, even if the data are not normally distributed!

    Variance vs statistical uncertainty

    "’The terms standard error and standard deviation are oftenconfused. The contrast between these two terms reflects theimportant distinction between data description and inference, onethat all researchers should appreciate."’ 6

    Rules:I The higher the unexplained variability of the data, the higher

    the statistical uncertainty.I The higher the sample size, the lower the statistical

    uncertainty.

    6Altman & Bland, Statistics Notes, BMJ, 2005, Nagele P, Br J Anaesthesiol2003;90: 514-6

    Confidence intervals

  • Constructing confidence limits

    A 95% confidence interval for the parameter β is

    [β̂ − 1.96 ∗ SE ; β̂ + 1.96 ∗ SE ]Example: a confidence interval for the mean uterus weight ofuntreated mice is given by

    95%CI = [0.0091− 1.96 ∗ 0.001054; 0.0091+ 1.96 ∗ 0.001054]= [0.007; 0.011].

    The standard error SE measures the variability of the mean β̂around the (unknown) population value β, under the assumptionthat the model is correctly specified.

    The idea of a 95% confidence interval

    Ute

    rus

    wei

    ght (

    g)

    0.00

    00.

    005

    0.01

    00.

    015

    Ourstudy

    Hypotheticalstudy 1

    Hypotheticalstudy 47

    Hypotheticalstudy 100

    The unknown trueaverage uterus

    weight●

    By construction, we expect at most 5 of the 100 confidenceintervals not to cover (include) the true value.

    Confidence limits for the mean uterus weights (long code)

    library(Publish)cidat

  • Confidence limits for the geometric mean uterus weights(short code)

    library(Publish)gcidat

  • Publish: Plot of means with confidence intervals (result)

    0

    1

    2.5

    7.5

    50

    Hormon dose

    0.01 (0.01−0.01)

    0.02 (0.02−0.03)

    0.05 (0.04−0.06)

    0.09 (0.07−0.11)

    0.09 (0.05−0.12)

    Mean (CI95)

    0.00 0.05 0.10 0.15Uterus weight (g)

    Parameters

    It is generally difficult to interpret a p-value without furtherquantification of the parameter of interest.

    Parameters are interpretable characteristics that have to beestimated based on data.

    Examples that we will study during the course:

    I MeansI Mean differencesI ProbabilitiesI Risk ratios, odds ratios, hazard ratiosI Association parameters, regression coefficients

    Juonala et al. (part I)

    Aims: The objective was to produce reference values and to analyse theassociations of age and sex with carotid intima-media thickness (IMT),carotid compliance (CAC), and brachial flow-mediated dilatation (FMD) inyoung healthy adults.

    Methods and results: We measured IMT, CAC, and FMD with ultrasound in2265 subjects aged 24–39 years. The mean values (mean ± SD) in men andwomen were 0.592± 0.10 vs. 0.572± 0.08mm (P < 0.0001) for IMT,2.00± 0.66 vs. 2.31± 0.77%/10 mmHg (P < 0.0001) for CAC, and6.95± 4.00 vs. 8.83± 4.56% (P < 0.0001) for FMD.

    The sex differences in IMT (95% confidence interval= [-0.013; 0.004] mm,P = 0.37) and CAC (95% CI=[-0.01;0.18]%/10 mmHg, P = 0.09) becamenon-significant after adjustments with risk factors and carotid diameter.

    Confidence intervals

    A confidence interval is a range of values which covers the unknowntrue population parameter with high probability. Roughly theprobability is 100− α% where α is the level of significance.

    For example:−0.013 to 0.004

    is a 95% confidence interval for the unknown average difference inIMT between men and women.

    Confidence intervals have the advantage over p-values, that theirabsolute value has a direct interpretation.7

    7Confidence intervals rather than P values: estimation rather thanhypothesis testing. Statistics with Confidence, Altman et al.

  • Relation between confidence intervals and p-values

    If we estimate the parameter β, e.g.

    β = mean(IMT men )-mean(IMT women)

    and have computed a 95% confidence interval for this parameter,

    [lower95, upper95]

    then the null hypothesis

    β = 0 "There is no difference"

    can be rejected at the 5% significance level if the value 0 is notincluded in the interval: 0 /∈ [lower95, upper95].

    ANOVA

    Example (DGA p.208)

    22 cardiac bypass operation patients were randomized to 3 types ofventilation.

    Outcome: Red cell folate level (µ g/l)Group Ventilation N Mean SdI 50% N2O, 50% O2 in 24 hours 8 316.6 58.7II 50% N2O, 50% O2 during operation 9 256.4 37.1III 30–50% O2 (no N2O) in 24 hours 5 278.0 33.8

    ANOVA

    # R-codeanova(lm(cell∼group,data=RedCellData))

  • ANOVA table for red cell folate levels

    Source ofvariation

    Degreesof free-dom

    Sum ofsquares

    Meansquares

    F P

    Betweengroups

    2 15515.88 7757.9 3.71 0.04

    Withingroups

    19 39716.09 2090.3

    Total 21 55231.97

    What are sum of squares and degrees of freedom?

    Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :

    Var =1

    N − 1{(X1 − X )2 + · · ·+ (XN − X )2}

    Var =1

    N − 1︸ ︷︷ ︸degrees of freedom

    {(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares

    }

    In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.

    What are sum of squares and degrees of freedom?

    Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :

    Var =1

    N − 1{(X1 − X )2 + · · ·+ (XN − X )2}

    Var =1

    N − 1︸ ︷︷ ︸degrees of freedom

    {(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares

    }

    In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.

    What are sum of squares and degrees of freedom?

    Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :

    Var =1

    N − 1{(X1 − X )2 + · · ·+ (XN − X )2}

    Var =1

    N − 1︸ ︷︷ ︸degrees of freedom

    {(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares

    }

    In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.

  • ANOVA methods

    I Independent observationsI t test for two groupsI One-way ANOVA for more groupsI More-way ANOVA for more grouping variables

    I Dependent observations:I Repeated measures anovaI Mixed effect models

    I Rank statistics (non-parametric ANOVA tests)I Nonparametric anova (Kruskal-Wallis test)

    I Mixture of discrete and continuous factors:I Ancova

    I Model comparison and model selection . . .

    Nice method

    Nice methods, but what is the question?

    Typical F-test hypotheses

    H0 Null hypothesis The red cell folate does not depend onthe treatment

    H1 Alternativehypothesis

    The red cell folate does depend on thetreatment

    This means

    H0 : Mean group I = Mean group II = Mean group III

    H1 : Mean group I 6= Mean group IIor Mean group III 6= Mean group IIor Mean group I 6= Mean group III

    Usually we want to know which treatment yields the best response.

    F-test statistic

    Central idea: The deviation of a subjects response from the grandmean of all responses is attributable to a deviation of that valuefrom its group mean plus the deviation of that group mean fromthe grand mean.

    F =between-group variabilitywithin-group variability

    =Variance of the mean response values between groups

    Variance of the values within the groups

    If the between-group variability is large relative to the within-groupvariability, then the grouping factor contributes to the systematicpart of the variability of the response values.

  • Conclusions from the ANOVA table

    Source ofvariation

    Degreesof free-dom

    Sum ofsquares

    Meansquares

    F P

    Betweengroups

    2 15515.88 7757.9 3.710.04

    Withingroups

    19 39716.09 2090.3

    Total 21 55231.97

    Conclusion: The red cell folate depends significantly on thetreatment.

    Take home messages

    I The variation of data can be decomposed into a systematicand a random part.

    I The standard deviation quantifies the variability of the data.I The standard error quantifies the uncertainty of statistical

    conclusions.I ANOVA is an old and general statistical technique with many

    different applications.


Recommended