+ All Categories
Home > Documents > Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf ·...

Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf ·...

Date post: 04-Aug-2018
Category:
Upload: vunhu
View: 229 times
Download: 0 times
Share this document with a friend
289
Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics Graz University of Technology 2009/2010
Transcript
Page 1: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Biostatistics and Experimental Design

Hubert Hackl

Institute for Genomics and BioinformaticsGraz University of Technology

2009/2010

Page 2: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 3: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 4: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Aims

I Learn to understand statistical results (understand statistics inmedical publications)

I To analyze data and apply appropriate statistical methodsI Learn to design experiments for research and clinical studiesI To judge statistical results from a critical point of viewI Learn to use R, a free software environment for statistical

computing and graphics

Page 5: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 6: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Simpson’s paradox

all

Drug Recovery Sum Recovery rateyes no

new 20 20 40 50%old 16 24 40 40%

female

Drug Recovery Sum Recovery rateyes no

new 18 12 30 60%old 7 3 10 70%

male

Drug Recovery Sum Recovery rateyes no

new 2 8 10 20%old 9 21 30 30%

Page 7: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Simpson’s paradox

Confounding variablesThere are a number of examples e.g. kidney stone treatment, sexbias, education, ...

Page 8: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Breadth and length of skulls (Pearson 1896)

Page 9: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Car/goat problem

One of three doors hides a car (all three equally likely) and the othertwo hide goats. You choose Door A. The host, who knows where thecar is, then opens one of the other two doors to reveal a goat, andasks whether you wish to switch your choice. Say he opens Door C;should you stick with your original choice, Door A, or switch to DoorB?

Page 10: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Car/goat problem

Naive approachRegardless to the initial situation now there are only two doors fromwhich I could choose.

p(car is behind A) = p(car is not behind A) = 12

⇒ It is not an advantage to switch the door.

Bayes theorem

p(A|open C) =p(open C|A)× p(A)

p(open C)=

12 ×

13

12

=13

p(B|open C) =p(open C|B)× p(B)

p(open C)=

1× 13

12

=23

⇒ The probability of winning the car is bigger if you change the door

Page 11: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Diagnosis study

1 in 1000 persons are suffering from a disease. There is a test, whichgives wrong results with a probability of 5% (false-positive rate is 5%).

What is the probability that a person with positive test result has thisdisease?

Naive approach would be 95%.

Considering the prevalence of the disease the probability of havingthe disease when the test is positive < 2%.

Page 12: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Biostatistics

BiostatisticsApplication of statistics in biology and medicine and related research.Guidelines to conduct and interpret medical studies.Helps to objectify evaluation of medical data

Descriptive statisticsAim is to describe data by characteristic values and visualization withgraphical procedures in a short and concise wayData presented without measure of significance

Inferential statisticsAre used to draw inferences about a population from a sample.Hypothesis testingQuantify uncertainty of decisionParameter estimation

Page 13: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Key concept

PopulationCollection of all objects, events or individuals (people) about whomyou would like to ask a research question.

SampleTo study a population, the researcher typically selects a small group,called a sample, from the population.The sample size is the number of individuals in the sample (not thenumber of measurements you make on each person!). The sampleshould be representative and random.

Random sampleSample chosen from a population in a fashion that ensures everyobject, event, item or individual has an equal chance of being drawn.The selection of any one entity can in no way influence or affect theselection of any other(independent).

IndividualsObjects, events, persons, individuals (observation unit)

Page 14: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

What statistical calculations can do

Statistical estimationAn example is to calculate the mean of a sample. This is only anestimate of the population mean and called a point estimate. Youwant also know how good this estimate is and want to give a range ofvalues (confidence interval)

Statistical hypothesis testingStatistical hypothesis testing helps to decide whether an observeddifference is likely to be caused by chance and provide a measurecalled p-value.

Statistical modelingStatistical modeling tests how well experimental data fit amathematical model constructed from e.g. physical principles. Anexample for this is linear regression.

Page 15: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

ExamplesSample size and populationAristotle maintained that women have fewer teeth than men; althoughhe was twice married, it never occurred to him to verify this statementby examining his wives mouths.Sir Bertrand Russell, The Impact of Science on Society, 1952.

Test wether a drug is effective in treating patients with HIV

I The population you really care about is more diverse than thepopulation from which your data were sampled

I Collection of data from a ”convenience sample” rather than arandom sample

I The measured variable (CD4 lymphocytes) is a proxy for anothervariable you really care about (survival time)

I Measurements may be made or recorded incorrectly (quality ofantibody!)

I Combination of different kinds of measurements to reach anoverall conclusion.

Page 16: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Applications in Medicine

I EpidemiologyI BiometryI In vitro and animal experimentsI Clinical trials (Phase I to IV)I Approval for drugs and medical devicesI Evaluation of new measurement and diagnostic techniquesI Meta analysisI Evidence based medicine

Page 17: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Research projects

PLANNING

DESIGN

EXECUTION data collection

DATA PROCESSING

DATA ANALYSIS

PRESENTATION

INTERPRETATION

PUBLICATION

Page 18: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Classification of statistical methods

Univariate methodsEach variable is considered individually

Bivariate methodsRelation between 2 variables is studied

Multivariate methodsRelation between >2 variables is studied

Page 19: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 20: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Measurement

Observation unitThe unit upon which measurements are madeBlood samples, animals, test persons, patients ...

VariableObservable or measurable properties of the observation unit whichcan take different values.Should address the question and follow objectivity, reliability, andvalidity.Diagnosis, tumor stage, cholesterol levels ...

ValueA realized measurement; feature characteristicType of surgery, 3 mol/ml, female ...

Page 21: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Types of data

Categorical data (qualitative)Nominal data (sex male, female; blood group 0, A, B, AB)Ordinal data (cancer stage I, II, III, IV)

Numerical data (quantitative)Discrete data (number of children 0, 1, 2, 3, 4, 5+)Continuous data (blood pressure; height in cm)

Other types of dataRanks, percentages, rates and ratios, scores, visual analog scale,censored data

Note: It is important to know the data type since representation andanalysis are dependent on this type.

Page 22: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Types of scales

Nominal scaleEqual or not equal (a = b, a 6= b)

Ordinal scaleRank is possible (a < b,a = b,a > b)

Interval scaleNot only rank but also difference of values (c = a− b)0 is taken arbitrarily (e.g. 2007 AD, temperature scale, diopter)

Ratio scaleNot only differences but also ratios (c = a/b)0 is represented naturally in empirical data (e.g. age of a person,absolute zero)

Page 23: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Frequencies

Absolute frequencyNumber of observation k bearing the same value or fall within a givenclass from the number n of total observations

fabs = k

Relative frequencyEstimate of the probability of a single event for discrete data:

frel =kn

0 ≤ frel ≤ 1

Relative frequency in percent:

frel% = frel × 100%

Page 24: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Presentation of categorical (discrete) data

Frequency table

Blood group frequency relative frequency relative frequency %

A 867 0.421 42.1%AB 134 0.065 6.5%B 363 0.176 17.6%0 696 0.338 33.8%

Total 2060 1.000 100%

Together with relative frequencies sample size should be given

1 man and 6 women are 14.286 % and 85.714 %

⇒ if sample size is small use absolute and avoid relative frequencies⇒ Percentages with many decimal places mirror large sample size

Page 25: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Presentation of discrete data

In bar charts, bars should always start from 0.

Prefer bar charts to pie charts since the eye is good at judging linear measures andbad at judging relative areas.

3-dimensional pie charts could show misleading proportions due to the change ofthe perspective.

Page 26: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Presentation of data course

Consider relation between x- and y-scale.

Diagrams should start from 0.

Page 27: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Presentation of continuous data

A simple graphical way of depicting a complete set of observations isby means of the histogram in which the number (or frequency) ofobservation is plotted for different values or groups of values.

ExampleSerum cholesterol levels (mmol/l) of a sample of 86 stroke patients(Markus et. al. 1995)

3.7 3.8 3.8 4.4 4.5 4.5 4.5 4.7 4.7 4.8 4.8 4.9 4.94.9 5.0 5.1 5.1 5.2 5.3 5.3 5.4 5.4 5.5 5.5 5.5 5.65.6 5.6 5.6 5.6 5.6 5.6 5.7 5.7 5.7 5.8 5.8 5.9 6.06.1 6.1 6.1 6.1 6.2 6.3 6.3 6.4 6.4 6.4 6.4 6.4 6.56.5 6.6 6.7 6.7 6.8 6.8 7.0 7.0 7.0 7.0 7.1 7.1 7.27.3 7.4 7.4 7.5 7.5 7.6 7.6 7.6 7.7 7.8 7.8 7.8 8.28.3 8.6 8.7 8.9 9.3 9.5 10.2 10.4

Page 28: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Histogram

Partition into classes

Following aspects should be considered:

I Partition comprises all valuesI Values have to be assigned to the classes unequivocallyI The class width should be the same for all classesI Mid-point of a class represents all values within the classI The smaller the number of classes the greater the class width

and the greater the loss of information.I The higher the number of classes the more of the uninteresting

random effects are apparent.

Empirical formula:

k ≈√

n, k ≈ 5× log10(n)

where k is the number of classes and n the number of values.

Page 29: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Histogram

Partition in classes (Example)

Range : min = 3.7 , max = 10.4Span width : max −min = 10.4− 3.7 = 6.7k ≈√

86 = 9.27

Class width = 1.0 and k = 8 ⇒

Interval Tally Frequency Relative frequency

3.00-3.99 /// 3 3.5%4.00-4.99 ///// ///// / 11 12.8%5.00-5.99 ///// ///// ///// ///// //// 24 27.9%6.00-6.99 ///// ///// ///// ///// 20 23.3%7.00-7.99 ///// ///// ///// //// 19 22.1%8.00-8.99 ///// 5 5.8%9.00-9.99 // 2 2.3%

10.00-10.99 // 2 2.3%

Total 86 100.0%

Page 30: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Histogram

Histograms of cholesterol levels from stroke patients

Page 31: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Histogram

Histograms with different number of classes

Histograms have to be area accurate, in case of frel or fabs will be plotted the classwidth has to be constant.

In cases of different class widths frequency density should be plotted.

Page 32: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Frequency density histogram

Age group Relative frequency (%) Relative frequency (%) per year

0-4 25.3 5.065-14 18.9 1.89

15-44 30.3 1.0145-64 13.6 0.6865+ 11.7 0.33

Page 33: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Stem-and-leaf diagram

The stem-and-leaf diagram provides a good summary to the datastructure and is not as prone to errors as tally lists.

Page 34: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Frequency polygon

Frequency polygons are useful for comparisons.

Page 35: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Cumulative frequency histogram and empiricalcumulative distribution function

Page 36: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Measures of central tendency

Arithmetic mean

x = (x1 + x2 + . . .+ xn) =1n

n∑i=1

xi

where n is the number of observations (degree of freedom) andx1, x2, . . . , xn is the sample (observations)

MedianFor ranked data x1 ≤ x2 ≤ . . . ≤ xn the median x is for

odd n x = x(n+1)/2

even n x = 12 (xn/2 + xn/2+1)

ModeThe mode xmod is the most frequent observation.

It is the only measure for nominal data. For continuous data it isrepresented by the center of the class with the most frequentobservations within the histogram and can be used for bimodal data

Page 37: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Measures of variability

Rank, rank listThe sample x1, x2, . . . , xn sorted by the size of the values isx(1), x(2), . . . , x(n) and called rank list, where the indices (1), ...(n) arethe ranks R(xi ) of the values.

RangeSpan width (Range): r = xmax − xmin = x(n) − x(1)

PercentilesThe p% percentile (Qp) means that p% of the values are smaller thanor equal to the p% percentile.

Qp =

x(k) : n × p is not an integer (k = int(n × p) + 1)12

(x(k) + x(k+1)) : n × p is an integer (k = n × p)

Page 38: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Measures of variability

Quartiles1stquartile = Q1 = Q252ndquartile = Q2 = Q50 = median3rdquartile = Q3 = Q75

Interquartile rangeIQR = Q3−Q1 = Q75 −Q25

Outlier detectionxi ≥ Q75 + 1.5× IQR or xi ≤ Q25 − 1.5× IQR . . . mild outlierxi ≥ Q75 + 3.0× IQR or xi ≤ Q25 − 3.0× IQR . . . extreme outlier.

This approach could be misleading for small number of observations.There are also other methods for outlier detection and fordetermination of quartiles. E.g.:

Qp = (1− j)×x(k+1) + j×x(k+2) : k = int((n−1)×p); j = (n−1)×p−k

Page 39: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Box-and-whiskers plot

Page 40: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Measures of variability

Variance

s2 =1

n − 1

n∑i=1

(xi − x)2

Standard deviation

s =√

s2 =

√1

n − 1

n∑i=1

(xi − x)2

where n is the number of observations and n − 1 corresponds to thedegree of freedom

Coefficient of variance

CV = s/x

provides a measure if the variability is high or not (CV < 10% meanslow and CV > 25% means high variability).

Page 41: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Measures of variabilityStandard error of mean

SE(x) = s/√

n

describes not the data, but the accuracy of the estimation.

SE is sometimes misleadingly used

Page 42: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

q-q-plot

Comparison of quantiles with quantiles of normal distribution.

Normal distributed observations are following a line.

Page 43: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Measures of shape

Skewness

g1 =m3√m3

2

=

1n

n∑i=1

(xi − x)3√(1n

n∑i=1

(xi − x)2

)3

g1 = 0 means the distribution is symmetrical, g1 > 0 right skewed,and g1 < 0 left skewed and mi is the i-th central moment.

Kurtosis

g2 =m4

m22− 3 =

1n

n∑i=1

(xi − x)4

(1n

n∑i=1

(xi − x)2

)2 − 3

For normal distribution g2 = 0. If g2 > 0 (g2 < 0) within the center ofthe distribution lies more(less) values than for the normal distribution.

Page 44: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Transformations

MotivationMost (parametric) statistical methods for analyzing continuous dataassumes normal distribution.To test for a normal distribution the Shapiro-Wilk test and the q-q-plotcan be used.Another important assumption is that different groups of observationshave the same standard deviations (or CV).Reduction of the influence of outlying values.

TransformationsLog (is the most common transformation)Square rootReciprocalBox-Cox (find the best transformation)Rank

Page 45: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Log transformations

⇒ asymetric confidence interval: CI = blog x±t×

slog x√(n)

Page 46: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Box-Cox-transformations

Define a function to find the best transformation:

x ′ =

xλ − 1λ

forλ 6= 0;

log(x) forλ = 0;

For the logarithmic transformation λ = 0, square root λ = 12 , cubic

λ = 12 , and reciprocal λ = −1.

Optimal λ can be calculated from the likelihood function L(λ).

Page 47: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Standardization

StandardizationFor the analysis of multivariate data a standardization is oftenwanted. That is a normalization where the mean gets 0 and thestandard deviation gets 1.

x′

i =xi − x

s

x′

i is also called z − score. The data are centered and the area underthe normal distribution gets 1. This is helpful for comparisons.

Ranging

x′

i =xi − xmin

xmax − xmin

E.g. for construction of diagrams and figures

Page 48: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Bivariate descriptive methodsContingency tablenominal versus nominal(ordinal) scaled variable

Light Regular Dark TotalMale 20 40 50 110Female 50 20 20 90Total 70 60 70 200

Barplots

Page 49: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Bivariate descriptive methods

Boxplots

nominal versus metric scaled variable

Page 50: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 51: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Diagnostic tests

Sensitivity (SN) =TP

TP + FN= TPR (Precision)

Specificity (SP) =TN

FP + TN= 1− FPR

Positive predictive value (PPV) =TP

FP + TP(Recall)

Negative predictive value (NPV) =TN

TN + FN

Prevalance (observed in this study) =TP + FN

nAccuracy =

TP + TNn

Page 52: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Diagnostic tests

Consider the predictive ability of the test for general populationor groups with other prevalence of disease:

P(D+|T +) =P(T +|D+)× P(D+)

P(T +)=

P(T +|D+)× P(D+)

P(T +|D+)× P(D+) + P(T +|D−)× P(D−)

P(D+) = Prevalence (PREV )P(D+|T +) = PPVP(T +|D+) = SNP(T +|D−) = 1− SP

PPV =SN × PREV

SN × PREV + (1− SP)× (1− PREV )

NPV =SP × (1− PREV )

(1− SN)× PREV + SP × (1− PREV )

Page 53: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Diagnostic tests

Likelihood ratio:

LR+ =P(T +|D+)

P(T +|D−)=

SN1− SP

LR− =P(T−|D+)

P(T−|D−)=

1− SNSP

post test odds = pre test odds × LR

PPV1− PPV

=PREV

1− PREV×

SN1− SP

. . .

PPV =SN × PREV

SN × PREV + (1− SP)× (1− PREV )

Page 54: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Receiver operating characteristics (ROC) curve

Page 55: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Example

Page 56: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Example

Page 57: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Discussion

Screening testsTesting healthy population for early signs of rare serious disease

High sensitivity and PPV

Don’t want FN and accept moderate number of FP

Diagnostic testsE.g. testing high risk individuals

High specifity and NPV

False positive diagnosis would have major consequences for the patient (HIV+)

Predictive values are strongly dependent on prevalence

The choice of the cut-off is not a statistical decision

Repeatable results and minimal inter-observer variation

Page 58: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Method comparison for categorical dataApply also to the agreement of categorical assessments of different observers

Normal Benign Suspect Cancer Total

Normal 21 12 0 0 33Benign 4 17 1 0 22Suspect 3 9 15 2 29Cancer 0 0 0 1 1

Total 28 38 16 3 85

Observed agreement of frequencies

po =∑k

i=1 fii/n = (21 + 17 + 15 + 1)/85 = 0.64(64%)

Expected agreement of frequencies (by chance)

pe =∑k

i=1 ri ci/n2 = (33× 28 + 22× 38 + 29× 16 + 1× 3)/852 = 0.31(31%)

Observer A

Obs. B

Page 59: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Method comparison for categorical dataMeasure of agreement κ

κ =po − pe

1− pe=

0.64− 0.311− 0.31

= 0.47

Guidelines to interpret κ

Value of κ Strength of agreement

<0.20 Poor0.21-0.40 Fair0.41-0.60 Moderate0.61-0.80 Good0.81-1.00 Very good

Kappa statistic takes no account of the degree of disagreement

⇒Weighted κ with weights to the frequencies in each cell according to distance:

wij = 1−|i − j|k − 1

Page 60: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Kappa statistics for gene grouping

Page 61: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Method comparison studiesAim is to see if 2 (or more) methods (devices) agree enough that they can beinterchanged (e.g. quicker or cheaper methods).

Best approach is to analyze the differences between the measurements of the 2 methodson each subject (Bland JM, Altman DG. The Lancet, 1986)

Page 62: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Method comparison studies

It is expected that about 95% of the observations were included in therange of mean ± 2SD.

This range of values defines the 95% limits of agreement.

In case of variable agreement (wider scatter as the averageincreases)⇒log-transform

Inappropriate use of correlation coefficient r and significance testing:

1. r measures the strength of a relation between 2 variables, not the agreementbetween them (perfect correlation if the points lie along any straight line).

2. Change in scale of measurement does not affect the correlation

3. Correlation depends on the range of the true quantity in the sample.

4. The test of significance may prevalently show that the two methods are related

5. Data which seem to be in poor agreement can produce quite high correlations.

Page 63: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Method comparison studies

Repeatability

The repeatability of a method can be assessed by comparing repeated measurementsusing one single method on a series of subjects.

Since for the repeated measurements the same method is used, the mean differenceshould be zero and the Coefficient of Repeatability (CR):

CR = 1.96×

√√√√√ n∑i=1

(d2i − d1i )2

n − 1

If more than 2 measurements⇒ ANOVA

Measuring agreement using repeated measurements:

Take difference of means from each methodThe SD has to be corrected:

SDc =

√√√√SD2 +

(SD1

2

)2

+

(SD2

2

)2

Page 64: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Error grid analysis

Comparison of blood glucose meters with the standard method (Beckman analyzer)

Page 65: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Example

Page 66: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 67: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Combinatorics

PermutationsFor n different elements there are n! permutations.

For example n = 3 :ABC, ACB, BAC, BCA, CAB, CBA ⇒ 3! = 6 permutations.

For n objects in k groups, not distinguishable within a group, there aren!

n1!× n2!× . . .× nk !permutations.

For example 2 red balls, 3 green balls, and 7 blue balls⇒12!

2!× 3!× 7!= 7920 permutations.

Page 68: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Combinations

CombinationsIf from n elements not all (as for permutations) but k elements weredrawn are called combinations.

Binomial coefficient(nk

)=

n!

k !(n − k)!=

n × (n − 1)× (n − 2)× . . .× (n − k + 1)

1× 2× 3× . . .× k

Without repetitionsOrder does not matter:

(nk

)Order matters: k !×

(nk

)With repetitionsOrder does not matter:

(n+k−1k

)Order matters: nk

Page 69: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Random experiments

All outcomes of the experiment are known in advance

But, it is a priori unknown which will be the outcome of eachperformance of the experiment:

I Systematic and random errorsI Complex processes, result of many combined processes

The experiment can be repeated under identical conditions

Examples are tossing a coin, throwing a dice, or life-time of a bulb.

Page 70: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Sample space and eventSample spaceCollection of possible elementary outcomes from a random exp.

Throwing a die: Ω = 1,2,3,4,5,6Life-time of a bulb: Ω = [0,∞)Diagnosis: Ω = diseased ,healthyBody height: Ω = R+

EventA set of outcomes of the experiment.

A = 6,A = tail,A = diseased,A = height > 180cmA = Ω ... certain eventA = ∅ ... impossible event

Sigma-field S

A σ-field (σ-algebra) S is a non empty collection of subsets of Ω thatsatisfy ∅ ∈ S

A ∈ S⇒ Ac ∈ SAi is a countable sequence of sets⇒

⋃i

Ai ∈ S

Page 71: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Probability measure

The pair (Ω,S) is considered as sample space associated with astatistical experiment.

A set function P defined on S is called a probability measure (orprobability) if it satisfies the following conditions:

1. P(A) ≥ 0 for all A ∈ S.2. P(Ω) = 1.3. Ai ∈ S be a disjoint sequence of sets (Aj ∩ Ak = 0 for j 6= k)

⇒ P(∞∑i=1

Ai ) =∞∑i=1

P(Ai )

P(A) is called the probability of event.

The triple (Ω,S,P) is called a probability space.

Page 72: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Probability

For an experiment with k possible equal probable outcomes :

P(A1) = P(A1) = . . . = P(Ak ) =1k

,k∑

i=1P(Ai ) = 1.

Are events mutually exclusive then the probability is the sum of theprobability of each event:

P(A1 ∪ A2 ∪ . . . ∪ Ak ) = P(A1) + P(A2) + . . .+ P(Ak ) =k∑

i=1P(Ai ).

Are events independent then the probability of occurrence of allevents is the product of the probability of each event:

P(A1 ∩ A2 ∩ . . . ∩ Ak ) = P(A1)× P(A2)× . . .× P(Ak ) =k∏

i=1P(Ai ).

Page 73: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Conditional probability

For 2 arbitrary events A and BP(A ∪ B) = P(A) + P(B)− P(A ∩ B)

P(Ac) = 1− P(A)

What is the probability of event A given B?P(A|B) = P(A ∩ B)/P(B)

What is the probability of event B given A?P(B|A) = P(A ∩ B)/P(A)

Page 74: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Bayes Theorem

P(A|B) =P(B|A)× P(A)

P(B)

Example 1Women get less promoted than men?

From 200 promotions only 4 women get promoted (2%).For one position 40 women and 3270 men have applied.

P(P|F ) =P(F |P)× P(P)

P(F )=

0.02× 2003270+40

403270+40

= 0.1 = 10%

P(P|M) =P(M|P)× P(P)

P(M)=

0.98× 2003270+40

32703270+40

= 0.0599 = 6%

Page 75: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Bayes Theorem

P(B) = P(A ∩ B) + P(Ac ∩ B) = P(B|A)× P(A) + P(B|Ac)× P(Ac)

P(A|B) =P(B|A)× P(A)

P(B|A)× P(A) + P(B|Ac)× P(Ac)=

P(B|A)× P(A)n∑

i=1P(B|Ai )× P(Ai )

Example 2A Briton was arrested 1990 for 16 years based on a random DNAmatch with a probability of 1 in 3× 106 according to experts.

Suppose there are 10000 people in the DNA database than theprobability that the suspect is innocent given a DNA match (that iswhat we want to know) can be calculated using the Bayes theorem:

P(I|M) =P(M|I)× P(I)

P(M)=

13×106 × 9999

100001

3×106 × 999910000 + 1

10000

= 0.0033

P(M|I) = 13000000 ⇔ P(I|M) = 3

1000

Page 76: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Likelihood function

posterior(probability)︷ ︸︸ ︷P(B|A) =

likelihood︷ ︸︸ ︷P(A|B)×

prior(probability)︷ ︸︸ ︷P(B)

P(A)︸ ︷︷ ︸evidence, normalizing factor

likelihood of B given fixed A︷ ︸︸ ︷L(B|A) ×

prior︷ ︸︸ ︷P(B)

Consider a model which gives the probability density function of anobservable random variable vector X as a function of a parameter θ(in general a parameter vector). Then for specific values x1, ..., xn of X(given realization), the function

L(θ|x1, ..., xn) = f (x1, ..., xn|θ)

is a likelihood function of θ. The likelihood function is functionally thesame in form as a probability density function. However, theemphasis is changed from the x to the θ. The pdf is a function of thex ’s while holding the parameters θ’s constant, L is a function of theparameters θ’s, while holding the x ’s constant.

Page 77: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Likelihood ratio

Bayes theoremThe Bayes theorem can also written in terms of a likelihood ratio andodds:

O(A|B) = O(A)× Λ(A|B)

where Λ(A|B) is the likelihood ratio,

O(A|B) =P(A|B)

P(Ac |B)are the odds of A given B, and

O(A) =P(A)

P(Ac)are the odds of A.

Likelihood ratio

Λ(A|B) =L(A|B)

L(Ac |B)=

P(B|A)

P(B|Ac)

Page 78: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Maximum likelihood estimation

Choosing an estimator for θ (θ(X )) that maximizes L(θ|x1, ..., xn) andsatisfies therefore

L(θ|x1, ..., xn) = supθ∈Θ

L(x1, ..., xn|θ)

is called maximum likelihood estimator (MLE).

Since products of probabilities are very small it is convenient to workwith logarithm of the likelihood function. log is a monotone functiontherefore

log L(θ|x1, ..., xn) = supθ∈Θ

log L(x1, ..., xn|θ).

If θ exists it must satisfy the likelihood equations

∂ log L(θ|x1, ..., xn)

∂θj= 0, j = 1,2, ..., k , θ = (θ1, ..., θ1).

Page 79: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Maximum likelihood

If X1,X2, ...,Xn are independent and identically distributed (i.i.d.) withprobability density function (PDF) or probability mass function (PMF)the likelihood function can be calculated:

L(θ|x1, ..., xn) =n∏

i=1f (xi |θ)

and the log likelihood function:

log L(θ|x1, ..., xn) = log(n∏

i=1f (xi |θ)) =

n∑i=1

log f (xi |θ)

For example linear regression:

log L(y = ax + b|x) =n∑

i=1log f (ax + b)

Page 80: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Random variable

The probability measure P is a set function and hence difficult to workwith.

Let (Ω,S) be a sample space. A random variable is defined as finite,single-valued function that maps Ω into R if the inverse images underX of all Borel sets in R are events, that is if

X : Ω→ R X−1(B) = ω : X (ω) ∈ B ∈ S for all B ⊂ R

In short, a random variable (r.v.) is a function that assign a realnumber to the outcome of a random experiment.

The resulting value (X = x) is called realization of the randomvariable X.

Page 81: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Discrete random variableA discrete random variable can take a countable number ofpredetermined values.

ExamplesTo toss a coin, to throw a die, or the number of cars crossing a lineduring a certain time interval

Probability mass function (PMF)For discrete random variables the mass function determines theprobability of each element of the sample space.

f (xi ) = P[X = xi ]

Page 82: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Continuous random variableContinuous random variables can take any real value

Probability density function (PDF)A probability density function is a function f(x) that describes theprobability density in terms of the input variable x, which satisfy

1. P[a ≤ X ≤ b] =b∫a

f (x)dx ,

2. f (x) ≥ 0,∀x ∈ R,

3.∞∫−∞

f (x)dx = 1.

The histogram is an estimator for the probability density function.

Page 83: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Distribution function

F (x) = P(X ≤ x) =

x∫a

f (x)dx continuous r .v .x∑

xi =aP(X = xi ) discrete r .v .

where a is the smallest value that the r.v. can take.

Page 84: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Distribution function

Properties of the distribution function

1. limx→−∞

= 0; limx→+∞

= 1

2. x < y ⇒ F (x) ≤ F (y)

3. F (x) is continuous from the right, F (x + h)→ F (x) as h→ 0

Probability and distribution function

P(X > x) = 1− F (x)

P(x < X ≤ y) = F (y)− F (x)

Page 85: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Measures for the distribution function and r.v.

Expectation

µ = E [X ] =∞∫−∞

xf (x)dx

Variance

σ2 = E [(X − E [X ])2] = E [X 2]− (E [X ])2 =∞∫−∞

(x − µ)2f (x)dx

Standard deviation

σ =√

E [(X − E [X ])2]

Covariance

Cov(X ,Y ) = E [(X − E [X ])(Y − E [Y ])]

Correlation

ρ = Cov(X ,Y )/σxσy

Page 86: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Normal distribution

The PDF of the normal distribution with parameters µ and σ (N(µ, σ);also called Gauss distribution) is :

f (x) =1

√2πσ

e−

(x − µ)2

2σ2

Page 87: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Normal distribution

Effects of different σ and µ on the PDF of the normal distribution:

Page 88: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Standard normal distribution

A variable that has a normal distribution with mean µ = 0 andvariance σ2 = 1 is called the standard normal variate and iscommonly designated by the letter Z.

Z =X − µσ

∼ N(0; 1)

Page 89: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Standard normal distribution

The cumulative distribution function can be calculated as follows:

F (x) =

x∫−∞

f (u)du =1

√2πσ

x∫−∞

e−

(u − µ)2

2σ2 du

For standard normal distribution with µ = 0 and σ2 = 1:

Φ(z) =1√

z∫−∞

e−

u2

2 du =12

[1 + erf (z√

2)]

Page 90: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Standard normal distribution and probability

Since the area under the standard normal distribution is 1, theprobability is according to the area under the normal distributionwithin the range of z

P(Z ≤ z) = Φ(z)

P(−0.56 ≤ z ≤ 2.00) = Φ(2.00)− (1− Φ(0.56)) = 0.6895

P(−2.00 ≤ z ≤ 2.00) = 2× Φ(2.00)− 1 = 0.9545

Page 91: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Central limit theorem

If X1,X2, ...,Xn be independent and identically distributed (i.i.d.) withmean µi and variance σ2

i

Then

Xnorm =

n∑i=1

xi −n∑

i=1µi√

n∑i=1

σ2i

has a limiting cumulative distribution function which approaches anormal distribution (∼ N(0; 1)) for large n.

⇒ importance of the normal distribution

Page 92: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Binomial distribution

The simplest probability distribution for discrete data is when thereare only 2 possibilities. The probability being in blood group B is 0.08so the probability of being group 0,A,or AB is 0.92.

For a group of unrelated people, the probability of both from 2 peoplebeing in blood group B is 0.08× 0.08 = 0.006

Number in B Probability

B B 2 0.08× 0.08 = 0.0064¬B B 1 0.92× 0.08 = 0.0736

B ¬B 1 0.08× 0.92 = 0.0736¬B ¬B 0 0.92× 0.92 = 0.8464

f (k ; n,p) =

(nk

)pk (1− p)n−k µ = np σ =

√np(1− p)

Page 93: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Binomial distribution

Page 94: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Binomial versus hypergeometric distribution

Binomial distributionProbability distribution of the number of successes in a sequence of nindependent yes/no experiments (with replacements), each of whichyields success with probability p

f (k ; n,p) =(n

k

)pk (1− p)n−k

For n = 1 is identical to the Bernoulli distribution

Hypergeometric distributionProbability distribution that describes the number of successes k in asequence of n draws from a finite population N without replacements.

f (k ; N,m,n) =

(mk

)(N−mn−k

)(Nn

)The finite population N consists in a drawing experiment e.g. of mwhite marbles and N −m black marbles.

Page 95: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Example

Wenn eine Munze zweimal hintereinander geworfen wurde (Kopf oder Zahl) dann

a) ist die erwartete Anzahl von Zahl 1.5Binom. Vert.: n = 2; p = 0.5; E = n × p = 2× 0.5 = 1⇒ F

b) ist die Wahrscheinlichkeit von zweimal Zahl 0.25n = 2; p = 0.5; k = 2 : P =

(22

)× 0.52 × (1− 0.5)2−2 = 0.25⇒ R

c) folgt die Anzahl von Zahl einer Binomialverteilung⇒ R

d) folgt die Anzahl von Kopf einer Hypergeometrischen Verteilung⇒ F (siehe c)

e) ist die Wahrscheinlichkeit mindestens einmal Zahl zu werfen 0.5P =

(22

)× 0.52 × (1− 0.5)2−2 +

(21

)× 0.51 × (1− 0.5)2−1 = 0.75⇒ F

f) ist die Verteilung der Anzahl von Kopf symmetrisch.P(0) = 0.25; P(1) = 0.5; P(2) = 0.25; ⇒ T

Page 96: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Over-representation analysis

Page 97: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Poisson distribution

Another discrete probability distribution is the Poisson distribution.Can be described by events occurring over time (or space) at a fixedrate on average, but where each event occurs independently and atrandom.For example the daily number of new registrations of cancer may be2.2 on average , but on any day there may be no cases or there maybe several.

f (k ;λ) =λk e−λ

k !µ = λ σ =

√λ

Examples are:

I The number of phone calls at a call center per minute.I The number of mutations in a given stretch of DNA after a certain

amount of radiation.I The number of light bulbs that burn out in a certain time interval.I The number of cars that pass through a certain point on a road

(distant from traffic lights) during a given period of time.

Page 98: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Poisson distribution

Page 99: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Other distributions

Test distributionsχ2,t ,F ,...

Mathematical deduced distributionsExponential, Gamma, Beta, Cauchy, log-normal, logistic, uniform,Weibull,...

Extended Binomial distributionBernoulli, negative-binomial, geometric, hyper-geometric,multinominal,...

Page 100: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Multinominal, Beta, and Dirichlet distribution

Page 101: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 102: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Parameter estimation and confidence interval

AimsI Estimation of parameters of the relevant population by the

statistics of the sample distribution.I Measures of uncertainty and quality of these estimations and

specification of a confidence interval.

To be valid the sample must be representative of the population. Forquantification of the strength of the evidence or its uncertainty thecharacteristics of the sampling distributions are useful (e.g.properties of the distribution of the means of random samples).

Page 103: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Sampling distributions

Variability of sample means of many random samples of a given sizefrom the population

I is less among the means of large samples than small samplesI is less than the variability of the individual observations in the

populationI increases with greater variability (standard deviation) among the

individual values

The distribution of sample means will be nearly normal whatever thedistribution of the variable in the population as long as the samplesare large enough.

Page 104: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Distribution of means from random sampling

Page 105: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Standard error of sample mean

The standard deviation of many sample means will be:

s =σ√

n

where σ is the standard deviation of the variable in the population andn is the size of each sample.

We can estimate the standard error of the (population) mean (SEM)from a single sample using the observed standard deviation in thatsample:

SEM =s√

n

The standard error of the mean is often abbreviated to standard error(SE). The standard error is a measure about the quality of theestimation of the population mean. SE can be used to construct aconfidence interval.

Page 106: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Standard error

Standard error of the differences between two sample means

SE(x1 − x2) =√

[SE(x1)]2 + [SE(x2)]2 =

√s2

1

n1+

s21

n2

Standard error of a sample proportionFrom the binomial distribution we know the standard deviation

s =√

np(1− p) ⇒ SE =

√p(1− p)

n

This will be true only for large samples (np > 5 and n(1− p) > 5).

Standard error of the difference between two proportions

SE(p1 − p2) =

√p1(1− p1)

n1+

p2(1− p2)

n2

Page 107: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Confidence interval

An (1− α) - confidence interval [θl , θu] is a random interval whichincludes the unknown, true value θ with a probability of 1− α.

P[θl ≤ θ ≤ θu] ≥ 1− α

Per convention α = 0.05 but can be chosen arbitrarily.

x − t1−α/2s√

n≤ µ ≤ x + t1−α/2

s√

nfor normal distributed population

Page 108: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Students t-distribution

If X1, ...Xn are independent and N(0,1) distributed then:

t =x

s/√

nis t-distributed with degree of freedom n − 1.

With small degree of freedom (or small n) the t-distribution differsfrom the normal distribution considerably.If the degree of freedom is high the t-distribution approximates thestandard normal distribution.

Page 109: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Confidence interval

Page 110: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Confidence interval for relative frequencies

If X1, ...Xn are independent and binary variables (0,1) with parameter

p = P[xi = 1]⇒ k =n∑

i=1xi is binomial distributed with p =

kn

.

The (1− α) confidence interval is [pl , pu] with

pl =k

k + (n − k + 1)F ∗1−α/2pu =

(k + 1)F1−α/2

n − k + (k + 1)F1−α/2

and F ∗1−α/2, F1−α/2 are quantiles of F-distributions.

In case of large n z =k − np√np(1− p)

is approximately N(0,1)

distributed and the confidence interval is:

p − z1−α/2

√p(1− p)

n≤ p ≤ p + z1−α/2

√p(1− p)

n

Page 111: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Parameter estimation

We want to have good estimators for different parameters of thepopulation distribution (θ = µ, θ = σ2).

An estimator θ of a parameter θ should

I for large n approach θ andI for large n follow a normal distribution (central limit theorem).

These propereties are most of the time satisfied and we want aquantitative criteria. The estimation error (θ − θ) should be minimal:

1. Unbiasedness 3. Consistency

E [θ − θ] = 0 ... Bias limn→∞ P(|tn − θ| < ε) = 1 (ε > 0)

2. Minimal variance 4. Robustness

Var [θ]→ minimal Not unduly affected by outliers

Page 112: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Maximum likelihood estimation

Parameter estimation for normal distribution

L(θ|x) =n∏

i=1f (xi |θ) =

1

(σ√

2π)ne−

12σ2

∑(xi−µ)2

log L = −n logσ −n2

log 2π −1

2σ2

∑(xi − µ)2

d(log L)

dµ= 0 =

1σ2

∑(xi − µ) =

1σ2(∑

(xi )− nµ) ⇔ µ =

∑xi

n

d(log L)

dσ= 0 =

− nσ

+

∑(xi − µ)2

σ3 ⇔ σ2 =

∑(xi − µ)2

n

Ordinary least squares (OLS) is a special case of the maximumlikelihood method

Page 113: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Thumbnail example

X = x1, ..., xn, where xt ∈ 0,1

Binomial distribution: P(X |Θ) =(n

k

)Θk (1−Θ)n−k ... Likelihood

Page 114: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Maximum likelihood estimation

P(X |Θ) =(n

k

)Θk (1−Θ)n−k

log P(X |Θ) = k log Θ + (n − k) log(1−Θ) + C

ddΘ

log P(X |Θ) =kΘ−

n − k1−Θ

= 0⇒ Θ =kn

Since the data X are usually subject to random fluctuations andintrinsic uncertainty, repeating the whole process of data collectionand parameter estimation under identical conditions will mostly leadto slightly different results.

⇒ if we are able to repeat the data-generating process several times,we will get a distribution of parameter estimates Θ, from which wecan infer the intrinsic uncertainty of the estimation process

Page 115: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Distribution of parameter estimate Θ

The probability of k observations of heads in a sample of size n isgiven by

P(k) =(n

k

)Θk (1−Θ)n−k

k = nΘ⇒ P(k) = C( n

)ΘnΘ(1−Θ)n(1−Θ)

In more complicated situations analytical solutions are usually notavailable⇒ computational procedure called bootstrapping.

Page 116: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Frequentist versus Bayesian paradigm

Page 117: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Bayesian approach

P(Θ|X )︸ ︷︷ ︸posterior probability

∝ P(X |Θ)︸ ︷︷ ︸likelihood

P(Θ)︸ ︷︷ ︸prior probability

We want to compute the posterior probability from the likelihood andthe prior probability.

It is mathematically convenient to choose a functional form that isinvariant with respect to the transformation (see above),that is, forwhich the prior and the posterior probability are in the same functionfamily (conjugate).

The conjugate of the binomial distribution is the beta distribution:

P(Θ|X ) ∝ Θk+α+1(1−Θ)N−k+β−1

P(Θ|X ) = B(Θ|k + α,N − k + β)

Page 118: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Comparison of frequentist and Bayesian approach

Maximum a posteriori (MAP) estimate:ΘMAP = argmaxΘ P(Θ|X )

Maximum likelihood (ML) estimate:ΘML = argmaxΘ P(X |Θ)

N →∞⇒ ΘMAP = ΘML

Suppose you are only allowed to toss the thumbnail few times

You can use prior knowledge, e.g. Torque acting on the fallingthumbnail from theoretical physisicts

If you allowed to toss the thumbnail arbitrarly often, the data will”‘speak for itself”’, and including any prior knowledge no longermakes any difference to the prediction

Page 119: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Comparison of frequentist and Bayesian approach

Main difference between the frequentist and the Bayesian approachis the different interpretation of Θ:

The Frequentist statistician interprets Θ as a parameter and aims toestimate it with a point estimate, typically adopting the maximumlikelihood approach

The Bayesian statistician interprets Θ as random variable and tries toinfer its whole posterior distribution, P(Θ|X ).

For derivation of P(Θ|X ) in complex inference problems a powerfulcomputational approach called Markov chain Monte Carlo (MCMC)approach can be used (Bayesian pendant to the frequentist’sbootstrap approach)

Page 120: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Parameter free estimation

Parameter-free means there is no assumptions about the form of thepopulation distribution, but instead using the data (sample) and itsdistribution.

We are not interested in the parameters per se but we want to test ahypothesis or want to know the quality of a prediction based on thedata.

In both cases using re-sampling methods allows to quantify theperformance of the estimation.

Page 121: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Bootstrap

The idea of the bootstrap is to randomly sample n times withreplacement from the original data points (based on the samedistribution of the original data).

If this procedure is often repeated (eg. 1000 times) the distribution ofthe medians should approximate a normal distribution and the meanand variance of the medians can be calculated.

The 95% confidence interval can be derived from the sortedbootstrap samples (at the 25th and 975th value).

Page 122: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Permutation test

The permutation test (randomization test) is similar to the bootstrap,only that the re-sampling procedure is done without replacements.

As example the question is addressed if active genes in a specificcondition tend to be adjacent within the genome. For this purpose theposition within the genome were 10000 times permutated and thenumber of adjacent active genes were counted.

As measure of the test the z-score or the p-value (that is the fractionof the rearrangements that have counts as far apart or more thanactually observed) can be provided.

Page 123: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Jackknife

The jackknife is to measure the performance of an estimator (θ∗) bysystematically recomputing the statistic estimate leaving out oneobservation at a time from the sample.

Each time the estimator (θ∗−i ) is calculated again. Finally the jackknifecorrected estimator (θjack ) can be calculated:

θjack = nθ∗ −n − 1

n

n∑i=1

θ∗−i

For example estimating the mean:

x =

n∑i=1

xi

n and x−j =

n∑i=1

xi−xj

n−1 ⇒ xj = nx − (n − 1)x−j

and analogous for general estimators :

θ∗j = nθ∗ − (n − 1)θ∗−j with θjack =

n∑j=1θ∗j

n

Page 124: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 125: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Hypothesis test

In medicine often comparison between treatments or procedures, orbetween groups or subjects is conducted. Or more general aresearch question is addressed and tested with an experiment.

The numerical value corresponding to the comparison of interest iscalled effect.

A null hypothesis H0 can be stated if this effect of interest is zero, aswell as and alternative hypothesis H1 that the effect is not zero.

The null hypothesis is often the negation of the research hypothesisthat generated the data.

The probability that we could have observed data (or data that weremore extreme) if the null hypothesis is true. This probability is calledp-value and the smaller it is the untenable is the null hypothesis.

Page 126: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Test statistic

For most problems calculating a test statistic - a value which we cancompare with the known distribution of what we expect when the nullhypothesis is true - can be used to evaluate the probability:

test statistic =observed value − hypothesized value

standard error of mean

In many cases the hypothesized value is zero, so that the test statisticbecomes the ratio of the observed quantity of interest to ist standarderror.

Page 127: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Error types

If H0 is rejected with high probability than based on this evidence youcan accept the research hypothesis.In general there are 2 possible decisions:

I reject H0 and accept H1 orI do not reject H0 and consider H1 as not approved

As apparent in the following table there are 2 possibilities to decidecorrect and 2 possibilities to make errors:

Decision H0 is really true H0 is really false

Accept H0 correctType II error

The probability of this is β

Reject H0Type I error

correctThe probability of this is α

Page 128: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Significance

α the (maximal) probability of the Type I Error is the level ofsignificance. By reducing the risk of an error of the first kind weincrease the risk of an error of the second kind.

I The conventional compromise is to choose α = 0.05 as level ofsignificance.

I If p ≤ α H0 is rejected (the research hypothesis accepted) andthe test is stated statistically significant.

I Sometimes, if α = 0.001 is chosen and p ≤ α the test is stated’highly’ significant.

These are reasonable guide-lines, however, not an absolutedemarcation. There is not a great difference between p=0.06 andp=0.04 and they indicate similar strength of evidence. Therefore thep-values should provided and not only that the test is significant.

Page 129: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Two-sided tests versus one-sided tests

H0 : µ = µ0H1 : µ = µ1 6= µ0 ... two-sided alternative hypothesisH1;µ = µ1 > µ0(< µ = µ1) ... one-sided alternative hypothesis

One sided tests are rarely appropriate and in most cases two-sidedtests are used. Even when their is strong prior expectations, forexample that a new treatment can’t be worse than than the old oneyou can not be sure (otherwise you would not need an experiment).

Page 130: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Power of a test

The statistical power of a test is defined as 1− β. This is theprobability that a new therapy or theory is proven better, if it is reallybetter.

The power depends on the sample size n and effect size δ, whichrefers to the magnitude of the effect under the alternate hypothesis.

The effect size if means of 2 normal distributed data are compared is:

δ =µ1 − µ0

σ0.

I Optimal tests are defined, that at a given α the power is maximal.I The power decreases, if α decreases.I The power increases, if the variability decreases.I The power is better for one-sided tests.

Page 131: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Power analysis

Since there is a relation between α, the power (1− β), the samplesize n, and the effect size δ the optimal sample size can be calculatedfrom the other parameters. This procedure is called power analysis.

1. Estimate effect size (e.g. from literature)2. Define α and β3. Calculate optimal sample size n

Page 132: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Calculation of sample size

Determination of difference in the mean to a given µ and knownvariance σ0 and independent normal distributed data x1, ..., xn.

z =√

nx − µ0

σ0

H0 : µ = µ0 is z normal distributed and H0 is rejected if |z| > z1−α/2

If µ = µ1 > µ0 ⇒ z =√

nx − µ1

σ0+√

nµ1 − µ0

σ0

z1−α/2 = zβ +√

nδ ⇒ n =(z1−α/2 + z1−β)2

δ2

For example: α = 0.05, β = 0.20, δ =38− 35

6= 0.5 ⇒

n =(z0.975 + z0.80)2

0.52 =(1.96 + 0.84)2

0.25≈ 31

Page 133: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Estimation versus hypothesis testing

I There is a close relation between confidence intervals andhypothesis testing:p < 0.05 (i.e. significant)⇔ the 95% interval does not includethe value specified in H0.The reason for this relation is that both methods are based onsimilar aspects of the theoretical distribution of the test statistic.

I The confidence interval shows the uncertainty, or lack ofprecision, in the estimate of interest, and thus conveys moreuseful information than the p-value.

I The use of a new treatment is dependent not only on thesignificance but also on the amount of the effect. A singlenumber (p-value) cannot convey the necessary information.

Page 134: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Non-parametric tests

Parametric methodsI Makes assumptions about the sampling distributionsI Based on theoretical distributions which are described by

parameters (mean, standard deviation)I Confidence intervals and hypothesis tests

Non-parametric (distribution-free) methods

I Often used to analyze data which are not normal distributed (i.e.skewed data)

I Mostly based on the ranks or on the comparing sum of ranks.I Tend to be more suited to hypothesis testing than estimationI In some cases estimation calculation of confidence intervals is

possible (e.g. median).

Page 135: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Multiple testing

ProblemIf more hypothesis tests are done in parallel than the probabilityincreases to draw wrong conclusion.

Example - MicroarraysThousands of genes are tested if they are significantly differentialexpressed.

I In case of 1000 tests 50 false positives are expected at an errortype I of 0.05 which are declared significant.

I The probability for k independent tests, that at least one p < α is1− (1− α)k and converge for large k towards 1.

I Multiple testing corrections adjust p-values (or the significancelevel α) derived from multiple statistical tests to correct for theseoccurrence of false positives.

Page 136: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Type I error

accepted H rejected H

true H0 U V (Error type I) G0

false H0 T (Error type II) S G1

G − R R G

Per family and per comparison error ratePFER = E(V ), PCER = E(V )/G

Family wise error rate(FWER)FWER = P(V > 0)

False discovery rate(FDR)

FDR =

E(V/R) : R > 0

0 : R = 0

Page 137: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Methods for multiple testing corrections

Method Error control

Bonferroni FWER most stringentBonferroni step-down (Holm) FWER ..Westfall and Young permutation FWER ..Benjamini and Hochberg FDR FDR less stringent

Familiy-wise error allow very few occurances of false positives.

False discovery error rate allows a percentage of called genes to befalse positive.

Page 138: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Multiple testing corrections

p Bonf. Holm BH(FDR)

p(1) p(1) ∗ n p(1) ∗ n p(1) ∗ np(2) p(2) ∗ n p(2) ∗ (n − 1) p(2) ∗ n/2: : : :p(i) p(i) ∗ n p(i) ∗ (n − i + 1) p(i) ∗ n/i: : : :p(n−1) p(n−1) ∗ n p(n−1) ∗ 2 p(n−1) ∗ n/(n − 1)

p(n) p(n) ∗ n p(n) p(n)

padj = min(1,p)

Page 139: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Westfall and Young permutation

1. Compute the t statistic for each row in the original dataset.

2. Order them: |t(1)| ≥ |t(2)| ≥ |t(3)| ≥ ... ≥ |t(k)|

3. Permute columns of data matrix

4. Compute t statistics for all rows of the permuted dataset:t (b)1 , ..., t (b)

k

5. Compute u(b)k = |t (b)

(k) | and u(b)j = max(u(b)

j+1, |t(b)(j) |), 1 ≤ j ≤ k − 1

6. Repeat 1-5 N times and calculate the adjusted p-values:

p(j) =

N∑b=1

I(u(b)j ≥ |t(j)|)

N,

where I(•) is the indicator function setting to 1 if the condition inparentheses is true and 0 if false.

Page 140: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 141: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Choosing an appropriate method

There are several aspects of the data to be considered whenchossing an appropriate method of analysis:

I The number of groups of observationsI Independent or dependent groups of observationsI The type of the dataI The distribution of the dataI The objective of the analysis

Page 142: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Comparing groups of continuous data

One group of observationsComparing the mean of a single group of observations with a specificvalue (k ).

Confidence interval for the mean

Is k within the (1− α)CI : [x − t1−α/2s√

n, x + t1−α/2

s√

n]

One sample t-test

t =x − ks/√

n

Confidence interval for the medianFrom the ranked data the CI are the values of the nearest ranks to

[np − 1.96√

np(1− p),np + 1.96√

np(1− p)] with p =12

Page 143: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

One group of observations

Sign test

z =r − np√np(1− p)

where r is the number of observations > k and p =12

Sign test with continuity correction

z =|r − np| − 1

2√np(1− p)

Wilcoxon signed rank sum test1. calculate differences: xi − k2. rank them in order to the magnitude |xi − k |3. calculate the sum of all positive (negative) ranks corresponding tothe observation above(below) k

⇒ get p-value for the sum from tabulated test statistic.

Page 144: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Two groups of paired observations

Confidence interval for the differences between means

(1− α)CI : [d − t1−α/2SE(d), d + t1−α/2SE(d)]

Paired t-testOne sample t test can also be used for the comparison of meansusing mean difference (d):

t =d − kSE(d)

(e.g. k = 0)

Non-parametric methodsOne sample sign test and Wilcoxon signed rank sum test can also beapplied to the differences between the paired data (Wilcoxonmatched pairs signed rank sum test).

Page 145: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Two independents groups of observations

Confidence interval for the differences between means

Pooled variance: s2 =(n1 − 1)s2

1 + (n2 − 1)s22

n1 + n2 − 2

Standard error: SE(x1 − x2) = s√

1n1

+ 1n2

(1− α)− CI : x1 − x2 ± t1−α/2SE(x1 − x2)

Two sample t test

t =x1 − x2

SE(x1 − x2)

Mann-Whitney U testRank all observations (as if they were from a single sample)

U1 = n1n2 + n1(n1+1)2 −

n1∑i=1

ri U2 = n1n2 + n2(n2+1)2 −

n2∑i=1

ri

U = min (U1,U2)⇒ if U < U(α;n1;n2) from tabulated statistic than testis significant.

Page 146: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Mann-Whitney U testTwo groups: A=7,4,9,17 and B=11,6,21,14

Is there any evidence that A and B are drawn from populations withdifferent levels of the variable. The null hypothesis is that there is notendency for members of one population to exceed members of theother.

Ranked observations: A B A A B B A B4 6 7 9 11 14 17 21

For each A(B), count how many Bs (As) are preceding:U = 0 + 1 + 1 + 3 = 5 and U ′ = 1 + 3 + 3 + 4 = 11U + U ′ = n1 ∗ n2 ⇔ 5 + 11 = 4 ∗ 4

U = n1n2 +n1(n1 + 1)

2−

n1∑i=1

ri

There are 70 different ways of arrangements (8!/4!4!) and each hasequal probability of 1/70 under the null hypothesis.E.g. U = 2 : AAABBABB, AABAABBB⇒ p = 2/70 = 0.029.

Page 147: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Comparing two variances by the F-test

We can test the null hypothesis that two population variances areequal using the F distribution.

If the data are normal distributed the ratio of two independentestimates of the same variance will follow a F distribution.

F (ν1, ν2) =χ2

1/ν1

χ22/ν2

with ν1, ν2 are the degrees of freedom.

Calculate (s1/s2)2 with s1 > s2 and look up with degrees of freedom(ν1 = n1 − 1; ν2 = n2 − 1) in tabled F statistic.

Page 148: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Chi-square distribution

The chi-square distribution results when independent variables withstandard normal distributions are squared and summed:

X = (Z1 + c1)2 + (Z2 + c2)2 + ...+ (Zν + cν)2

has a χ2 distribution with ν degrees of freedom and non-centrality

parameter δ2 =ν∑

i=1c2

i .

Page 149: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Welch test

The Welch test is a modification of the t test for the case of unequalvariances.

t =x1 − x2√

s21

n1+

s22

n2

with degree of freedom

df =(s2

1/n1 + s22/n2)2

(s21/n1)2

n1 − 1+

(s22/n2)2

n2 − 1

Page 150: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

More independent groups of observations

One way analysis of variance (ANOVA)The main objective is to define the sources of variation that have anyinfluence on the data. Following model is suggested for the data,where just one factor is supposed to be effecting at the population:

xij = µ+ αi + εij i = 1, ..., k j = 1, ...,Ni

k∑i=1

Ni = N

The idea behind is to test if the data xij can be explained as theresponse from different treatments (groups i = 1, ..., k ) of a givenfactor.

αi is the treatment effect and can be characterized by the samplemean for every subgroup:

αi = xi − x =1Ni

Ni∑j=1

xij +1N

k∑i=1

Ni∑j=1

xij

and

εij = xij − xi

Page 151: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

ANOVA

H0 : µ1 = µ2 = ... = µk

The test of H0 is based on estimating σ2. A general estimator of thevariance is based on the variance within groups:

MSE =s2

1 + s22 + ...+ s2

k

k=

1k(N − 1)

k∑i=1

Ni∑j=1

(xij − xi )2

The second estimator of the variance is based on the variancebetween groups:

MSA = Ns2x =

1k − 1

k∑i=1

Ni (xi − x)2

If H0 is true both variances would be very similar and if MSA >> MSEthan H0 is rejected. This can be formulated by the F statistic:

F =MSAMSE

Page 152: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

ANOVA

All the information can be summarized with an ANOVA table:

Variation df Sum of squares MSS F-value

Treatments k-1 SSA =k∑

i=1Ni (xi − x)2 SSA/(k-1) MSA/MSE

Error N-k SSE =k∑

i=1

Ni∑j=1

(xij − xi )2 SSE/(N-k)

Total N-1 SST =k∑

i=1

Ni∑j=1

(xij − x)2

Page 153: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Kruskal-Wallis test

As ANOVA is a more general form of the t test, the Kruskal-Wallis testis a more general form of the non-parametric Mann-Whitney test.

H =12

N(N + 1)

k∑i=1

Ni (Ri − R)2

where R is the average of all ranks (R = (N + 1)/2), Ri is the ranksum of Ni observations in the i th group and Ri is the average rank ineach group (Ri = Ri/Ni )

The H statistic can be also equivalently formulated:

H =12

N(N + 1)

k∑i=1

R2i

Ni− 3(N + 1)

H is χ2 distributed with k-1 degrees of freedom

For more than one tie H have to be corrected by

C = 1−k∑

i=1(t3

i − ti )/(N3 − N) and H ′ =HC

Page 154: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Comparing groups of categorical data

I Categorical data are very common in medical research, whenindividuals are categorized into one or more mutually exclusivegroups. The number falling into a particular group is calledfrequency.

I The data are often shown in form frequency tables.I It can be also summarized as the proportion of the total number

of individuals in one of the categories

Page 155: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

One proportionConfidence interval

p =rn

and SE(p) =

√p(1− p)

n

Based on normal distribution when np > 5 and n(1− p) > 5⇒ r > 5and (n − r) > 5

95% CI: [p − 1.96

√p(1− p)

n,p + 1.96

√p(1− p)

n]

Hypothesis testTest the null hypothesis that the population proportion is somepre-specified value k :

z =p − kSE(p)

with SE(p) =

√k(1− k)

n

and with continuity correction:

z =|p − k | −

12n

SE(p)

Page 156: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Proportions in two independent groups

Confidence interval

SE(p1 − p2) =

√p1(1− p1)

n1+

p2(1− p2)

n2

95% CI : [p1 − p2 − 1.96SE(p1 − p2),p1 − p2 + 1.96SE(p1 − p2)]

Hypothesis test

p =r1 + r2

n1 + n2

SE(p1 − p2) =

√p(1− p)

n1+

p(1− p)

n2=

√p(1− p)(

1n1

+1n2

)

z =p1 − p2

SE(p1 − p2)and zc =

|p1 − p2| −12

(1n1

+1n2

)

SE(p1 − p2)

Page 157: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Two paired proportions

Example - Sleep difficulties

Marijuana group Control group Number of pairs

yes yes a = 4yes no b = 3no yes c = 9no no d = 16

total n = 32

p1 − p2 =a + b

n−

a + cn

=b − c

n

Page 158: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Two paired proportions

Confidence interval

SE(p1 − p2) =1n

√b + c −

(b − c)2

n

95% CI : [p1 − p2 − 1.96SE(p1 − p2),p1 − p2 + 1.96SE(p1 − p2)]

Hypothesis testReplace both b and c by (b + c)/2

SE(p1 − p2) =1n

√b + c

2+

b + c2

=1n√

b + c

z =p1 − p2

SE(p1 − p2)=

b − c√

b + c

Page 159: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Analysis of frequency tables

Chi squared test for an r × c tableThe null hypothesis is that the two classifications (columns and rows)are unrelated in the relevant population.

Compare the observed frequencies which what we would expect ifthe null hypothesis is true.

X 2 =r∑

i=1

c∑j=1

(Oij − Eij )2

Eij

with observed frequencies Oij and expected frequencies Eij

The expected frequency in each cell is the product of the relevant rowand column totals divided by the sum of all the observed frequenciesin the table (i.e. sample size).

X 2 is χ2 distributed with (r − 1)(c − 1) degree of freedom.

Page 160: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

2x2 frequency tables

C1 C2 total

R1 a b a+bR2 c d c+dtotal a+c b+d N

There are two common tests for 2× 2 frequency tables:I Chi squared test (if all Eij > 5)I Fisher’s exact test

Page 161: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Chi squared test

For the first cell:

(O11 − E11)2

E11=

(a− (a+b)(a+c)N )2

(a+b)(a+c)N

and for the sum of all 4 cells:

X 2 =N(ad − bc)2

(a + b)(a + c)(b + d)(c + d)

Continuity correction (also known as Yates’ correction):

X 2Y =

N(|ad − bc| − N2 )2

(a + b)(a + c)(b + d)(c + d)

The Chi squared test is equivalent to the comparison of proportions.

Page 162: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Fisher’s exact test

The method consists of evaluating the probability associated with allpossible 2x2 tables which have the same row and column totals,making the assumption that the null hypothesis is true.

p(a,b, c,d) =(a + b)!(a + c)!(b + d)!(c + d)!

N!a!b!c!d !

In order to calculate the significance of the observed data, i.e. thetotal probability of observing data as extreme or more extreme if thenull hypothesis is true, there are 2 possibilities:

1) Add the probabilities in the ’tail’ of the distribution in which theobserved data fall and double the value to get a two tailed test.

2) Add up probabilities of all tables where p < p(a,b, c,d).

Page 163: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Coefficients of association

Page 164: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

McNemar’s test for paired samples

Cases+ - total

+ a b a+bControl

- c d c+dtotal a+c b+d N

X 2 =(|b − c| − 1)2

b + c

Page 165: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Ordered 2 x k contingency table

score/categories x1 = 1 ... xk = k total

frequency r1 ... rk R =k∑

i=1ri

total n1 ... nk N =k∑

i=1ni

from regression approach we get:

X 2trend =

(k∑

i=1rixi − Rx ]

)2

p(1− p)(∑k

i=1 nix2i − Nx

)2, df = 1, p =RN

, x =k∑

i=1

nixi

N

An alternative approach is based on Kendall’s rank correlation (τ)

X 2 =

SE(τ)

)2

Page 166: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Comparing risksRelative risk and odds ratio

There is another way of analyzing 2× 2 tables, which involves thecomparison of two groups with respect to the risk of some event.

The methods were developed from epidemiology, especially for theanalysis of case control studies.

The parameters of interest are the relative risk (RR) and the oddsratio (OR).

Page 167: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Relative risk

In a prospective study groups of subjects with different characteristicsare followed up to see whether an outcome of interest occurs.

The risks in the two groups (exposed and non-exposed) are a/(a + b)and c/(c + d).

The relative risk RR =a/a + bc/c + d

.

Under the null hypothesis the expected value of RR is 1.

SE(log RR) =

√1a−

1a + b

+1c−

1c + d

(1−α)CI : [log RR−z1−α/2SE(log RR), log RR + z1−α/2SE(log RR)]

Page 168: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Odds ratio

In retrospective case-control studies the selection of subjects is basedon the outcome. In this case the relative risk is not a valid estimate.

We can use the odds (a/b) of the outcome in the first group (cases)and compare to the odds (c/d) in the second group (controls) and get

the odds ratio OR =adbc

.

For case-control studies the outcome of interest is usually rare so theodds ratio offers a method of getting an approximate relative risk.

SE(log OR) =

√1a

+1b

+1c

+1d

(1−α)CI : [log OR−z1−α/2SE(log OR), log RR +z1−α/2SE(log OR)]

Page 169: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Goodness-of-fit

I qq-plotI Chi-square goodness-of-fit testI Kolmogorov-Smirnov test (KS-test)I Shapiro-Wilk testI Anderson-Darling test

Page 170: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 171: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Correlation and regression

Aim is to find associations between two- or more variables (bivariateor multi-variate data).

Possible questions are:

1. Is there a relation between variables?2. How strong is this relation?

3. Which shape has this relation?

4. Can a variable of interest predicted by observation of othervariables?

Page 172: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Correlation

The correlation is a method which analyzes the strength of the linearagreement between x and y . Where x and y are pairwiseobservations of the same observation unit (bivariate data). Asmeasure the (Pearson) correlation coefficient r is used.

Variance

s2x =

1n − 1

n∑i=1

(xi − x)2 s2y =

1n − 1

n∑i=1

(yi − y)2

Covariance

Cov(x , y) = sxy =1

n − 1

n∑i=1

(xi − x)(yi − y)

Correlation

r =sxy

sxsy=

n∑i=1

(xi − x)(yi − y)√n∑

i=1(xi − x)2

n∑i=1

(yi − y)2

Page 173: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Correlation

Page 174: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Test for linear relation

H0: true correlation ρ = 0

For common normal distributed (x , y) the test-statistic

T = r

√n − 21− r2

is t-distributed with n − 2 degree of freedom.

With following transformation the correlation is approximatelystandard normal distributed:

z ′ = 0.5(ln(1 + r)− ln(1− r)) and SE =1

√N − 3

(1− α)CI : [(e2zl − 1)/(e2zl − 1), (e2zu − 1)/(e2zu − 1)] with

zl = z ′ − z1−α/21

√N − 3

and zu = z ′ + z1−α/21

√N − 3

Page 175: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Spearman’s rank correlation

Spearman’s rank correlation coefficient rs is obtained by ranking thevalues of the two variables separately and calculate the Pearson’scorrelation on the ranks of the data. For ties the average rank is used.

In case when there are no ties the Spearman’s rank correlation canbe calculated simpler:

rs = 1−6

n∑i=1

d2i

N3 − N

where di are the differences in the ranks.

Page 176: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Kendall’s τ

Kendell’s rank correlation coefficient τ is the proportion of concordantpairs (ordered the same way) minus the proportion of discordant pairs(ordered in opposite way).

τ =nc − nd

12 n(n − 1)

=S

12 n(n − 1)

When there are no ties nc + nd = n(n − 1)/2.

To allow for perfect correlation when ties were between subjects forboth variables there is a different version:

τb =S√

(n(n − 1)/2−∑

t(t − 1)/2)(n(n − 1)/2−∑

u(u − 1)/2)

Page 177: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Considerations for calculation of correlation

1. If a lot of variables will be tested there are many correlations. Asfor multiple testing the significant correlations are overestimated.

2. Spurious correlations for trends over time (divorce rate vs. price ofgazoline)

3. Correlation by heterogeneity (frequency of voice vs. body height:correlation based on gender)

4. Trivial correlation

5. Confounding variables (number of storks vs. birth-rate; Simpson’sparadox)

6. Non-linear relations

7. Extreme data points

Page 178: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Regression

We want to describe the relation between a set of data on twocontinuous variables and predict the value of one variable for anindividual when we only know the other variable.

Also the effect of one variable on the other variable is of interest.Therefore the relation is directed and the variables are categorized:

X .. independent, predictor value (plotted on the horizontal x-axis)

Y .. dependent, response or outcome variable (plotted on the verticaly-axis)

Whereas correlation provides strength and sign of a relation,regression gives a quantitative model of the relation of dependentvariables.

Page 179: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Linear regression

Define a statistical model of regression:yi = f (xi ) + εi i = 1, ..,nwhere f is the regression function and εi is random noise (error) withE [εi ] = 0 and variance σ2.

For linear regression the regression function is the linear function:f (x) = β0 + β1xwhere β0 is the intercept and β1 the slope of the linear function.

Page 180: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Estimation of parameter

Minimum least square method:

∂n∑

i=1(yi − β0 − β1xi )

2

∂β0= 2

n∑i=1

(β1xi + β0 − yi ) = 0

∂n∑

i=1(yi − β0 − β1xi )

2

∂β1= 2

n∑i=1

xi (β1xi + β0 − yi ) = 0

β1 =

n∑i=1

(xi − x)(yi − y)

n∑i=1

(xi − x)2= r

sy

sx

β0 = y − β1x

εi = yi − yi = yi − β0 − β1xi = yi − y − β1(xi − x)

Page 181: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Residual variance

s2res =

n∑i=1

(yi − yi )2

n − 2=

n∑i=1

(yi − y − β1(xi − x))2

n − 2= (1− r2)s2

y

The variance can be divided in residual (unexplained) s2res and by the

regression explained variance (s2reg)):

s2y︸︷︷︸

total

= s2reg︸︷︷︸

explained

+ s2res︸︷︷︸

unexplained

= r2s2y + (1− r2)s2

y

⇒ r2 is a measure for the quality of the regression

Page 182: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Confidence interval

Slope

SE(β1) =sres√

n∑i=1

(xi − x)2

(1− α)CI : β1 ± t1−α/2SE(β1)

Estimated y for a given x

SE(y) = sres

√√√√1n

+ (x−x)2

n∑i=1

(xi−x)2

(1− α)CI : y ± t1−α/2SE(y)

Hypothesis testH0 : There is no relation⇔ β1 = 0

The ratioβ1

SE(β1)is compared with the t-distribution with df = n − 2

Page 183: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Prediction interval

spred = sres

√√√√√1 +1n

+(x − x)2

n∑i=1

(xi − x)2

(1− α) prediction interval: y ± t1−α/2spred

Here the estimated standard deviation of the individual values y − yat the value x is used and not the standard error.

Note that the prediction interval is much wider than the confidenceinterval.

The confidence interval and the prediction interval can be added tothe scatter plot around the regression line.

Page 184: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Causality

Correlation and regression are based on similar mathematicalbackground but are distinct methods with a different purpose.

Correlation and regression only gives information about association,however, a causal relation cannot be directly inferred. This appliesregardless of the strength of the observed association.

One of the strongest ways to make causal inferences is to conduct anexperiment (i.e., systematically manipulate a variable to study itseffect on another).

Page 185: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Causal inference

ProblemI Confounding variables (see Simpson’s paradox)

MethodsI Pearl’s do-operatorI Control by selection (stratification)

no variation in the confounding variable

I Statistical control

Partial correlationMultiple regression model

I Directionality and time

Page 186: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Pearl’s do operator

from Judea Pearl: Causality - Models, Reasoning, and Interference(Cambridge University Press, 2000)

Page 187: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Partial correlation

rYZ .X =rZY − rZX × rXY√1− r2

ZX

√1− r2

XY

Page 188: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 189: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Scatter plots

Page 190: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Multiple regression

In observational studies we are interested in the way one variable isinfluenced by several variables

X1, ...,Xk ... predictor variables, explanatory variables

Y ... dependent, response or outcome variable is expressed as acombination of the explanatory variables

It is not necessary for the explanatory variables to be continuous.

Statistical Model:

y = β0 + β1x1 + β2x2 + ...+ βk xk + εi

where β0...βk are the regression coefficients.

Page 191: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Multiple regression

Page 192: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Multiple regression

Y = Xβ + ε

with β = (β0, β1, ..., βk )T and

Y =

y1...

yn

, X =

1 x11 · · · x1k...

.... . .

...1 xn1 · · · xnk

, ε =

ε1...εn

Minimum square estimator:

n∑i=1

ε2i = (Y − Xβ)T (Y − Xβ)→ min!⇒

∂β(Y − Xβ)T (Y − Xβ) = Y − Xβ = 0

β = (X T X )−1X T Y

Page 193: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Global F test

H0 : β0 = β1 = ... = βk = 0

Source df Sum of squares MSS F-value

Regression k SSreg =n∑

i=1(yi − y)2 SSreg

kMSreg

MSres

Residues n − k − 1 SSres =n∑

i=1(yi − yi )

2 SSres

n − k + 1

Total n − 1 SSy =n∑

i=1(yi − y)2

SSy = SSreg + SSres

Page 194: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Goodness-of-fit

R2 = 1−SSres

SSy=

SSreg

SSy=

n∑i=1

(yi − y)2

n∑i=1

(yi − y)2

R2 ∗ 100% tells how many percent of the variability around theabsolute mean can be explained by the regression.

The expected value of R2 will increase independent of the influenceof each variable as more variables are added to the model⇒

Adjusted R2 = 1−MSres

MSy= 1−

n − 1n − k − 1

(1− R2)

When linear regression is performed than R2 = r2. For multipleregression models R is called multiple correlation coefficient, howeverit must not interpreted the same way.The F-test is the only way to assess whether a model explains asignificant proportion of variability.

Page 195: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Variable selection

A problem arises when the number of variables is high compared tothe number of observations.

⇒ Selection of variables:I Only select those variables which are significant or most

significant in pairwise comparison.I In case of many strong correlated variables take only one of

them to include in modelI Include variables with already known influenceI Exclude correlated variables where the influence is not plausible.

Page 196: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Forward selection

I Start with null model or take only those variables which have tobe in the model

I Add stepwise those variables which leads to the most reductionof SSres

I Stop procedure when SSres can not be reduced (or whenchanges are very small) by adding a new variable.

Page 197: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Backward selection

I Start with a model containing all variablesI Remove variables one by one which show the least increase of

SSres

I Stop procedure when SSres can not be substantially increased byremoving one of the remaining variables.

Page 198: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

All subsets regression

Selecting the best model is to examine every possible model:

I There are 2k − 1 subsets with i1, .., ip ⊆ 1,2, ..., k.I Calculate for each subset a multiple regression with variables

Xi1 , ...,Xip .I Choose model with smallest p and acceptable SSres

I Assess goodness-of-fit with Cp statistics

Page 199: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Goodness-of-fit measures

Adjusted R2 = 1−MSres

MSy= 1−

n − 1n − p − 1

(1− R2)

F-test: Comparison of a model with k − 1 variables with a modelincluding an additional variable:

F =SSres(k − 1)− SSres(k)

SSres(k)/(n − k − 1)

Mallow’s Cp:

Cp =SSres(p)

MSres(k)− n − 2(p + 1)

Akaike information criterion (AIC):

AIC = n log (s2res(p)

pn

) + 2(p + 1) + n

Page 200: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Model assumptions

I Linearity : The expected value of Y is linear dependent on theexplanatory variables

I Homoscedasticity : Homogeneity of variance of the residualsindependent of the explanatory variables

I Assumption of normal distribution of the residues

Page 201: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Methods to test assumptions

Page 202: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Two way analysis of variance

In one-way ANOVA the means across only one factor (treatmentgroups) are compared, whereas in two-way ANOVA the meansacross two factors are compared.

There are 2 common application cases:

1. Two-way cross classifications (e.g. randomized complete blockdesign RCBD)

2. Repeated measurements

Page 203: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Two-way cross classifications

↓ A B → Level 1 Level 2 . . . Level j . . . Level m Totalx111 x121 x1j1 x1m1

Level 1...

... . . .... . . .

... T1..x11n x12n x1jn x1mn

......

......

......

......

xi11 xi21 xij1 xim1

Level i...

... . . ....

...... Ti..

xi1n xi2n xijn ximn...

......

......

......

...xk11 xk21 xkj1 xkm1

Level k...

... . . ....

...... Tk..

xk1n xk2n xkjn xkmn

Total T.1. T.2. . . . T.j. . . . T.M. T...

Page 204: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Two way cross classification model

Statistical model:

xijl = µ+ αi + βj + γij + εijl

where groups of A are i = 1, .., k , the groups of B j = 1, ..,m, and thenumber of repeated measurements l = 1, ..,n.

The model describes if the data xijl can be explained by the overallmean the effects of treatments of the factor A, the treatment of thefactor B, and the interdependency between A and B.

This is called interdependency model and if the last term is omitted itis basically an additive model.

Page 205: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Partitioning the variation

SST = SSA + SSB + SSAB + SSE

SST =k∑

i=1

m∑j=1

n∑l=1

x2ijk −

T 2...

N

SSA =k∑

i=1

T 2i..

mn−

T 2...

N

SSB =m∑

j=1

T 2.j.

kn−

T 2...

N

SSAB =k∑

i=1

m∑j=1

T 2ij.

n−

T 2...

N− SSA− SSB

SSE = SST − SSA− SSB − SSAB

Page 206: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

ANOVA table

Variation df SSQ MSS

Factor A k − 1 SSA MSA =SSAk − 1

Factor B m − 1 SSA MSB =SSB

m − 1

Interaction (k − 1)(m − 1) SSAB MSAB =SSAB

(k − 1)(m − 1)

Error N − km SSE MSE =SSE

N − km

Total N − 1 SST

Page 207: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

F-values from ANOVA for different effects

Effects Fixed Random Mixed Mixed

A is Fixed Random Fixed RandomB is Fixed Random Random Fixed

Factor A F =MSAMSE

F =MSA

MSABF =

MSAMSAB

F =MSAMSE

Factor B F =MSBMSE

F =MSB

MSABF =

MSBMSE

F =MSB

MSAB

A× B F =MSABMSE

F =MSABMSE

F =MSABMSE

F =MSABMSE

Page 208: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Repeated measurementsThis analysis is considered to be an extension to the paired t test,since the measurements are done on the same subject and thereforecompass paired data.

An example for this type of analysis is studying short-term effects of adrug on the heart rate:

Time (min)Subject 0 30 60 120

1 96 92 86 922 110 106 108 1143 89 86 85 834 95 78 78 835 128 124 118 1186 100 98 100 947 72 68 67 718 79 75 74 749 100 106 104 102

Page 209: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Statistical model for repeated measurements

Statistical model:

xij = µ+ αi + βj (tj ) + εij

with tj are the time points or in general measuring points, βj (tj )individual effect of the subject j at the time point tj .

The question to address is, if the time course is constant (αi = 0) or ischanging over the subjects (αi 6= 0).

The different correct analysis methods are differing in theassumptions of the individual variations:

1. Multi-variate one-way model (MANOVA)

2. Uni-variate model of analysis of variance with repeatedmeasurements

Page 210: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

MANOVA

Multi-variate analysis of variance (MANOVA) is used when there are2 or more dependent variables (DV).

MANOVA uses a linear combination of the response variables, whichmaximizes the ratio of between-group and within-group variances ofz:

zik = c0 + c1xi1 + ...+ ck xik

If H denote the hypothesis sums of squares and cross product matrixand E denote the error sums of squares and cross-product matrixthan A = HE−1 where λi denote the i th eigenvalue of A (correspondto the factors ci in the linear combination).

Page 211: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

MANOVA

The test statistics are:

Pillai’s trace = trace[H(H + E)−1] =k∑

i=1

λi

1 + λi

Hotellings-Lawley’s trace = trace(A) =k∑

i=1λi

Wilk’s Λ =|E |

|H + E |=

k∏i=1

11 + λi

Roy’s largest root = max(λi ).

These statistics are translated into F statistics in order to test the nullhypothesis.

Page 212: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Uni-variate model of analysis of variance withrepeated measurements

The within subjects design requires homogeneity of treatmentdifference variances. One can create a new set of variables,composed of all possible pairwise differences, and the variances ofthese differences must all be equal in the population. This is calledthe sphericity assumption.

The compound symmetry assumption - a special case of thesphericity assumption - is met if all the covariances (the off-diagonalelements of the covariance matrix) are equal and all the variancesare equal in the populations being sampled.

Since for more than 2 time points these assumptions are often not thecase there is a correction accounting for this namely theGreenhouse-Geisser and the Huynh-Feldt corrections.

Page 213: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Logistic regression

In many studies the outcome variable of interest is the presence orabsence of some condition, or in general a binary variable.

For such data multiple linear regression can’t be used and a similarapproach called multiple linear logistic or logistic regression is used.

Here the explanatory variables were used to predict a transformationof the dependent variable and model a probability therefore the linearmodel is not working.

The transformation is called logit:

logit(p) = log (p

1− p) with

p1− p

is the odds

and p is the proportion of individuals with the characteristic. Theregression model can be formulated as:

log (p

1− p) = β0 + β1x1 + β2x2 + ...+ βk xk

Page 214: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Logistic regression

p(x) =eβ0+β1x

1 + eβ0+β1x

p(x) is the logistic distribution function (from which the name isderived) and models the probability that y = 1.

If you want compare predictions for subjects with or without aparticular characteristic (explanatory variable) you have:

log (p1

1− p1)− log (

p2

1− p2) = log

p1(1− p2)

p2(1− p1)= log (OR)

With the logit transformation there is now a linear relation betweenthe explanatory variables and the outcome.

Page 215: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Estimation and tests in logistic regression

Estimation of regression coefficients βi end standard error SE(βi ) isdone by the maximum likelihood method.

For the test if the influence of xi on P(y = 1|xi ) is significant the nullhypothesis is H0 : βi = 0 and the alternative two sided hypothesis isβi 6= 0.

The test statistic is called Wald statistic:

W =βi

SE(βi )

which can be approximated by a normal distribution.

Page 216: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Interpretation of coefficients

Linear model

g(x) = log (p

1− p) = β0 + β1x

Binary variable x

For x = 0 and x = 1⇒

g(0) = β(0) and g(1) = β0 + β1

β1 = g(1)− g(0) = log (OR) and OR = eβ1

Continuous variable x

If x changes by k units:

∆g = kβ1 = log (OR)

ekβ1 = (eβ1 )k = ORk ⇒

OR is multiplicative.

Page 217: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Computation

One issue to consider is that for y = 0 or y = 1 the logit(p) is −∞ or∞.

The method of analysis uses an iterative procedure whereby theanswer is obtained by several repeated cycles of calculation using themaximum likelihood approach.

The k + 1 not-linear equation can lead sometimes to numericalproblems. It is recommended that data from at least 20 events and 20not-events for each explanatory variable are available.

Due to the computational complexity logistic regression can onlyfound in large statistical packages.

Page 218: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Quality of the prediction

Information from different significant influence factors (explanatoryvariables) can be combined by the prognostic index (PI).

PI = β1x1 + β2x2 + ...+ βk xk

Like for diagnostic tests PI can divided with different cut-points andthe quality of the prediction can be studied by a receiver operatingcharacteristics.

Here for all cut-points c the prognostic index is studies how good theoutcome is predicted by the binary variable PI > c.

The AUC is measure for the quality, which can be compared for eachunivariate predictors (explanatory variables).

Page 219: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Discriminant analysis

We wish to be able to find some combinations of variables thatclassifies a large proportion of subjects into the correct group, so thatwe can have a good chance of allocating (diagnosing) new subjectscorrectly.

The basic idea of discriminant analysis to find the combination ofvariables that maximizes the separation between the groups, as withlogistic regression.

With more than two groups the groups can be further separated byconstructing a second combination of the same variables which arecalled canonical variates or discriminant functions.

Page 220: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Discriminant analysis

Subject x1 x2 . . . xk

A 96 92 86 92A 79 75 74 74A 89 86 85 83A 95 78 78 83B 128 124 118 118B 100 98 100 94B 110 106 108 114B 93 87 91 89

The discriminant function can be defined as:

y = β0 + β1x1 + ...βk xk

The parameters βi are estimated in that way that the ratio of thebetween-groups variance to the within-groups variance is maximal.

Page 221: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Discriminant analysis

Discriminant (function) analysis DA is mathematically identical to asingle factor MANOVA: DA is multivariate analysis of variance(MANOVA) reversed. In MANOVA, the independent variables are thegroups and the dependent variables are the predictors. In DA, theindependent variables are the predictors and the dependent variablesare the groups.

Page 222: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Factor analysis and ordination techniques

Explorative methods to find an elementary explanation model formutual relations.

Overview of common ordination techniques

indirect direct

linear Principal component analysis (PCA) Redundancy analysis (RDA)

unimodal (Detrended) Correspondence analyses ((D)CA) Canonical CA (CCA)

Another common method in this context is multidimensional scaling (MDS).

Page 223: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Principal component analysis(PCA)

Variables are summarized by a linear combination to the principalcomponents.

The origin of the coordinate system is centered to the center of thedata (mean centering).

The coordinate system is rotated to a maximum of the variance in thefirst axis⇒ First principal component (PC) is in direction of themaximum variance from origin and subsequent PCs are orthogonal tothe first PC and describe maximum residual variance.

This method can be approached by a singular value decomposition:of the (m × n) data matrix X

Page 224: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Principal component analysis (PCA)

X = UWV T with UUT = V T V = VV T = I

For mean centered data the Covariance matrix C can be calculatedby XX T .

U are eigenvectors of XX T and the eigenvalues are in the diagonal ofW defined by the characteristic equation |C − λI| = 0.

Transformation of the input vectors into the principal componentspace can be described by Y = XU where the projection of sample ialong the axis is defined by the j-th principal component:

yij =m∑

t=1xitutj

Page 225: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

PCA for gene expression data

Page 226: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Correspondence analysis (CA)

CA is an extension of the analysis of contingency tables. In this case,the status of descriptors (objects in rows) are compared with this ofother descriptors (variables in columns).

Aim of the CA is to reduce the contingency table by a fewsummarizing variables, showing a lack in indepency between rowsand columns.

The approach is a combination of using the χ2 statistic and singularvalue decomposition similar to that for principal component analysis.

Starting with an r contingency table where Ti are the row totals in rowi , Tj are the column in column j .

The total number is N and the number of observations in row i andcolumn j is nij .

Page 227: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Correspondence analysis (CA)

χ2 =(O − E)2

E⇒

A matrix S with elements sij can be constructed where

s2ij =

(nij

N−

TiTj

N2 )2

TiTj

N2

The matrix S can singular value decomposited with:

S = UWV T and with the singular values λk in the diagonal of W .

W is a diagonal matrix, and its diagonal elements are referred to asthe singular values of S. We think of them as sorted from the largestto the smallest and denote them by λk .

Page 228: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Correspondence analysis (CA)

The coordinates for sample i in the new space are then given byaik = λk uik/

√Ti/N for k = 1, ..., J and the variables are viewed in the

same space with variable j given coordinates bjk = λk vjk/√

Tj/N fork = 1, ..., J.

These coordinates are called principal coordinates.

Page 229: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 230: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Survival analysis

Survival analysis involves the modelling of time to event data, whichis in the context of biostatistics time to death or other events (time torelapse, time to re-hospitalization).

In other disciplines this type of analysis is also known as reliabilityanalysis (engineering) or duration analysis (economics).

The aim is to statistically describe survival times and comparesurvival times of several groups (the longer the survival times thebetter the therapy).

It is also sometimes important to find relations between survival timesand other explaining variables (age, type of therapy, severity ofdisease,...).

Page 231: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Censored data

Censored data (incomplete follow up) arises when a study is finishedbefore all patients died (withdrawn alive).

Another case is when patients have to be excluded from the studydue to other reasons (emigration, accidental death).

In general patients are recruited to the study at different time points(e.g. time point of surgery, indicated here as time=0).

Page 232: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Survival function

If X is a continuous random variable with a cumulative distributionfunction F (t) of survival times the survival function is defined as:

S(t) = P(T > t) =∞∫t

f (u)du = 1− F (t)

The survival function S(t) shows the proportion of patients(probability), which survived a specified time interval t .

The survival function follows often a Weibull e−(t/λ)kor Exponential

(e−(t/λ)) distribution.

Page 233: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Kaplan-Meier survival curves

The survival function can be estimated by the Kaplan-Meier curves(Kaplan-Meier estimator)

Each event(death) is indicated by a step function and censored dataare indicated by (+).

Page 234: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Kaplan-Meier estimator

The calculation of the Kaplan-Meier estimator is using the conditionalprobability.

The probability of surviving 100 days would than bep = p1 × p2 × ...× p100.

In general:

pk = pk−1rk − fk

rk⇒ S(t) =

∏tk≤t

(1−fkrk

)

where rk is the number of subjects still at risk (still being followed up)immediately before the k th day, and fk is the number of observedevents on day k .

The standard error of the survival proportion (not for small and verylarge sample size) can be calculated:

SE(pk ) = pk√

(1− pk )/rk and 95% CI: pk ± 1.96SE(pk )

Page 235: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Life table analysis

Life tables describe data, where the results are grouped into timeintervals, often of equal length. This method is often described asactuarial.

The method of calculation is similar to the Kaplan-Meier method, butdifferences arise because of the lack of precision of recording oftimes.

However, in general the Kaplan-Meier analysis is recommended.

Page 236: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Comparison of Kaplan-Meier curves

Page 237: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Logrank test

The most common (non-parametric) method of comparingindependent groups of survival times is the logrank test.

The null hypothesis here is that the groups come from the samepopulation.

The survival times of both groups were ranked together and timeintervals were defined between the survival times including the timeof one (or more) event(s) as the upper limit of the intervals.

For each time interval we have a 2× 2 table:

group 1 group 2 totalevents f1 f2 f

no events r1 − f1 r2 − f2 r − ftotal r1 r2 r

Page 238: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Logrank test

For the number of observed and expected events we get

Oi =k∑

j=1f ji and Ei =

k∑j=1

r ji f

j/r j for k time intervals,

and the test statistic X 2 =m∑

i=1

(Oi − Ei )2

Eifor m groups.

Under the null hypothesis the statistic X 2 has a χ2 distribution withdf = m − 1.

A different approach for two groups of observations can be obtained

by calculating the variance of f1 − r1f/r at each time v =r1r2f (r − f )

r2(r − 1)

with V =m∑

i=1vi and the test statistic X 2 =

(O1 − E1)2

V.

Page 239: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Mantel-Haenszel method

The Mantel-Haenszel method is used when several 2× 2 frequencytables are combined.

There are 2 applications for this:

1.) Overview or meta-analysis of many clinical trials2.) Control for cofounder

The Mantel-Haentzel method is based on forming a 2× 2 table, oneat each level of the confounder (or for each study).

Outcome+ - total

+ a b r1Exposure- c d r2

total c1 c2 n

Page 240: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Mantel-Haenszel method

Under the null hypothesis the frequency a is distributed with meanand variance:

E0(a) =r1c1

nand Var0(a) =

r1r2c1c2

n2(n − 1)

The Mantel-Haentzel test is based on the z-statistic or similar on thesquare of the z-score with a chi-square test at df = 1.

z =

∑a−

∑(r1c1/n)√∑

(r1r2c1c2/(n2(n − 1)))

where the summation is across the levels of the confounder.

The odds ratio is pooled across levels of the confounder:

ORMH =

∑(ad/n)∑(bc/n)

Page 241: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Stratified logrank test

If we are interested in comparing 2 groups given different treatmentswe may to wish to stratify by age or other prognostic variable.

When there are 2 groups of subjects, then for each subgroup(stratum) of interest calculate O1,E1,O2, and E2. These are thensummed over all strata and the logrank statistic calculated as

X 2 =(∑

O1 −∑

E1)2∑E1

+(∑

O2 −∑

E2)2∑E2

Page 242: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Logrank test for trend

For three or more ordered groups (e.g. cancer stages) there is a moreappropriate test considering a trend in survival across the groups.

Give a code for each group (e.g. h1 = −1, h2 = 0, h3 = 1).

For each group calculate following parameters

Di = hi (Oi − Ei ), Fi = hiEi , and Hi = h2i Ei

and sum over groups

D =m∑

i=1Di , F =

m∑i=1

Fi , H =m∑

i=1Hi

The test statistic is X 2 = D2/V with df = 1 and V = H − F 2/E .

Page 243: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Hazard ratio

The logrank test is solely a hypothesis test, comparing survival in twoor more groups.

Relative survival in two groups can be measured by comparing theobserved number of events with the expected numbers.

The hazard ratio is defined as

R =O1/E1

O2/E2

and gives an estimate of relative event rates in the two groups.

K =O1 − E1

Vis an estimate of the log hazard ratio (ln R).

SE ≈1√

Vand 95%CI : K ± 1.96/

√V .

Page 244: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Hazard function

The hazard function (failure rate, force of mortality) is closely relatedto the survival curve, representing the risk of dying in a very shorttime interval after a given time, assuming survival so far.

h(t) = lim∆t→0

S(t)− S(t + ∆t)∆t × S(t)

=F ′(t)S(t)

Page 245: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Relative risk and proportional hazards model

When a population is divided into 2 subpopulations exposed (E) andnon-exposed (E) by presence or absence of a certain characteristic(an exposure such as smoking), each subpopulation corresponds to ahazard function and the relative risk can be assigned to

RR =h(t ,E)

h(t , E)

If RR(t) = c we have a proportional hazards model:

h(t ,E) = c × h(t , E)

Page 246: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Cox regression

Since we have a multiplicative model (exposure raises the risk by amultiplicative constant) it can be also expressed as

h(t) = h0(t)eβx with h0(t) = h(t , E) andthe covariate x = 1 for exposed and x = 0 for unexposed population.

The Cox regression model is considering several independentvariables of interest (X1, ...,Xp):

h(t) = h0(t)eβ1x1+...βpxp

Adding all the hazards up to time t to get the risk of dying betweentime 0 and time t gives the cumulative hazard

H(t) = H0(t)eβ1x1+...βpxp

Page 247: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Cox regression

The survival probability can be estimated for any individual withspecific values of the variables in the model

S(t) = e−H(t)

A positive sign of the regression coefficient means that the hazard ishigher and thus the prognosis worse for subjects with higher values ofthis variable.Interpretation of an individual regression coefficient for two differentvalues of the covariate x by the hazard ratio:

h1(t)h2(t)

=h0(t)eβx1

h0(t)eβx2= eβx1−βx2 = eb(x1−x2)

In the special case of a binary variable the hazard ratio is eβ .

A prognostic index can be defined as previously:

PI = β1x1 + ...βpxp ⇒ S(t) = e−H0(t)ePI= S0(t)ePI

Page 248: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Partial likelihood function for the Cox model

The Partial likelihood function is used to estimate the β coefficients inthe proportional hazards model and is constructed by comparingfailing subjects to those not failing at each time t.

L =k∏

i=1

P(subject with xi fails at t)P(some subject in risk set failed at t)

L =k∏

i=1

hi (t)∑j∈R(ti )

hj (t)=

k∏i=1

eβT xi∑

j∈R(ti )eβT xj

δi

with R(ti ) = j : tj ≥ ti, β = (β1, ..., βp)T , h0(t) denotes the baselinehazard function, δi = 0 for censored data and δi = 1 otherwise.

The regression coefficients can be estimated (β) by maximizing thepartial likelihood function.

Page 249: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 250: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Experimental design

Basic principles

I ReplicationI Independence and pseudo-replicationI ControlsI RandomizationI Interspersion (Blocking, Stratification))I Design typesI Power analysis

Page 251: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Replication

I Reduce the effect of uncontrolled variation (i.e., increaseprecision)

I Quantify uncertaintyI Increase power of the significance test (Power analyzes)

Page 252: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Pseudo replicates

I ”Incorrect” replication when replicating samples, not treatmentsI Replicates are not independentI Type I error (α) approaches 1 with increasing number of samples

per unit.

Page 253: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Controls

I Any treatment against which one or more other treatments arecompared

I It may be an ”untreated” treatment, a ”procedural” treatment, orsimply a different treatment

I Controls must undergo identically experimental procedure to thetreated units (e.g. injection of a saline solution)

I To allow separation of the effects of different aspects of theexperimental procedure

Page 254: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Randomization

I Random sampling from clearly defined populationsI Experimental subjects (”units”) should be assigned to treatment

groups at random (does not mean hapazardly)I One needs to explicitly randomize using a computer, dice, ...I Avoid biasI Ensures that statistical inferences are reliable

Page 255: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

InterspersionI Interspersion is necessary to avoid unbalanced effects of

unforeseen events (e.g. weather or other defects) betweentreatment and control.

I Even by randomization simple segregation can occur (with 3-foldreplication the chances are 10%!).

Page 256: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Common design types

I Factorial designsI Completely randomized designI Complete randomized block designI Latin square designI Cross over designsI Nested designI Split-plot designI Repeated measurements

Page 257: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Factorial design

One factorial experimentAim is to study the effect of one single factor (with several levels).

For example the only interesting factor is drug treatment, all otherfactors (age, weight, sex ...) are ignored (but should kept constant).

Multi-factorial experimentThe design incorporates two or more factors that are crossed witheach other. The term crossed indicates that all combinations of thefactors are included and that every level (group) of each factor occursin combination with every level of the other factors.

Multi-factorial design allows the study of interaction between factors.

Analysis of a two factorial design with two-way ANOVA

Page 258: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Randomized complete block design (RCBD)

I Treatments are assigned at random within blocks of adjacentsubjects, each treatment once per block.

I The number of blocks is the number of replications.I Any treatment can be adjacent to any other treatment, but not to

the same treatment within the block.I Used to control variation in an experiment by accounting for

spatial effects.

Sample layout with 4 treatments (A-D) and 4 blocks (I-IV):

Block I A B C DBlock II D A B CBlock III B D C ABlock IV C A B D

Page 259: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Latin square design (LSD)

I Treatments are assigned at random within rows and columns,with each treatment once per row and once per column.

I There are equal numbers of rows, columns, and treatments.I Useful where the experimenter desires to control variation in two

different directions

Sample layout with 4 treatments (A-D) assigned to 4 rows (I-IV) and 4columns (1-4):

Column1 2 3 4

Row I A B C DRow II C D A BRow III D C B ARow IV B A D C

Page 260: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Crossover design

An experimental design that combines attributes of latin squares andrepeated measures designs is the cross-over design, often used inexperiments that apply multiple treatments to individual organsisms.

In its simplest form, the crossover design can be considered as a latinsquare where subjects are one blocking factor (e.g. rows) and timeperiods are a second blocking factor (e.g. columns) and treatmentsare applied to each combination of subject and period using one ofthe latin square randomizations.

Problematic in this type of design are carryover effects

Page 261: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Nested design

Multi-factorial experimental designs where a factor(B) is crossed withon factor (C) but nested within another (A)

A second factor (or set of factors) is then applied to whole blocks, withreplicate blocks for each level of this factor.

Page 262: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Split-plot design

Split-plot designs were originally used in agricultural experiments andrepresents a randomized complete block design, with one or morefactors applied to the experimental units within each block.

A second factor (or set of factors) is then applied to whole blocks, withreplicate blocks for each level of this factor.

Units of replication different for different factors

Page 263: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Repeated measure designs

Factor A are units of replication termed ”subjects”Factor B (subjects) nested within AFactor C: repeated recordings on each subject

Completely randomized design (2 factor design (2x8) with 10replicates)⇒ 160 subjects needed

Page 264: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Power analysis

There is a relation between the 4 parameters for a significance test:Sample size n, significance level α (commonly 0.05), power 1− β(commonly 80%), effect size δ = ∆/σ (standardized difference ofmeans)

1. Clearly define null hypothesis and alternative hypothesis2. Identify the statistical model to be applied to the data, the desiredpower and significance level3. Identify the assumption of the statistical procedure4. Obtain some pilot estimate of variation5. Specify the effect size (e.g. other studies of the same biologicalsystem)6. Calculate sample size

Page 265: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Power analysis

Cohen suggested 1988 some values for small, medium and largestandardized differences (δ = 0.2,0.5,0.8).

A more useful approach may be to plot detectable effect size versussample size or the power versus the effect size.

If there are constraints on the size of the experiment or samplingprogram with an estimate of σ, chosen values for α and β and thenumber of observations possible to determine the minimumdetectable effect size (MDES).

Page 266: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Experimental design for cDNA microarrays

Page 267: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 268: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Types of study design

1. Retrospective studies (of past events) including case-controlstudies2. Prospective studies (of past events)3. Cohort studies or epidemiological design (of ongoing or futureevents)4. Clinical trials

Page 269: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Basic structure for different designs

Page 270: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Types of studies

Therapy studyEffectivity of a drug, new surgery or alternative methodsDesign: RCT

Diagnosis studyValidity and reliability of new diagnostic testsDesign: Cross-sectional

Screening studyInvestigation of test resultsDesign: Cross-sectional

Prognosis studyProgress of an early diagnosed diseaseDesign: Cohort

Causal studyAssociation between dangerous substances and a diseaseDesign: Cohort, Case control

Page 271: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Hierarchy of medical studies

Page 272: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Clinical trials

Clinical studies form a class of all scientific approaches to evaluatingmedical disease prevention, diagnostic techniques, and treatments.Among this class trials, often called clinical trials, form a subset ofthose clinical studies that evaluate investigational drugs.

I Phase I trials focus on safety of a new investigational medicine.These are the first human trials after successful animal trials.

I Phase II trials are small trials to evaluate efficacy and focus moreon a safety profile.

I Phase III trials are well-controlled trials, the most rigorousdemonstration of a drug’s efficacy prior to federal regulatoryapproval.

I Phase IV trials are often conducted after a medicine is marketedto provide additional details about the medicine’s efficacy and amor complete safety profile.

Page 273: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Clinical trials

The goal in a phase I trial is to identify a maximum tolerated dose(MTD), a dose that has reasonable efficacy (i.e. is toxic enough to killcancer cells) but with tolerable toxicity (i.e. not toxic enough to kill thepatient).

Phase I trials are applied to patients from standard treatment failurewho are at high risk of death in the short term.

In phase II trials the optimal dose (MTD) is applied to a small group ofpatients meeting a predefined inclusion criteria (there are alsoexclusion criteria) and the response rate, the proportion orpercentage of patients who respond, is studied.

A second type of phase II trials consist of small comparative trialswhere we want to establish the efficacy of a new drug against acontrol or standard regimen.

Page 274: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Clinical trials

Phase III/IV are larger studies and the standard is a randomizeddouble-blind controlled trial (”golden standard”).

Controlled: The drug is tested against a control group receiving aplacebo or the standard treatment. The size, shape, procedureshould be very similar to control psychological and emotional effects.

Randomized: If a patient gets the drug or the placebo is assignedrandomly.

Stratified randomization: If there are expected cofounding variables(e.g. age) patients are stratified and treatment randomly assignedwithin stratum.

Minimization: A non-random treatment allocation for smaller trials.The allocation is based on the balance of several parameter, so thatthe n + 1 treatment is assigned based on the sum of the numberswithin the stratified variables (e.g. age≤50 or age>50).

Page 275: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Clinical trials

Double-blind: Blind to the patient and blind to the investigator(Triple-blind means that also regulatory officers/statisticians are”blinded”).

Selection of subjects: Based on inclusion/exclusion criteria

Page 276: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Alternative designs

I Crossover designI Within group (paired) comparisonsI Sequential designI Factorial designI Adaptive designI Zelen’s design

Page 277: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Sample size

Sample size for Phase II trials and surveys:

n =(z1−α)2p(1− p)

d2 (response rate)

Sample size for other Phase II trials:

n =(z1−α)2s2

d2 (continuous endpoint)

n =(z1−α + z1−β)2

( 12 ln 1+r

1−r )2+ 3 (correlation endpoint)

Page 278: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Sample size

Phase II designs for selection:

N =4(z1−α)2s2

d2 (continuous endpoint)

N =4(z1−α)2p(1− p)

(p2 − p1)2 (binary endpoint)

Phase III trials:

N =4(z1−α + z1−β)2σ2

d2 (comparison of 2 means)

N =4(z1−α + z1−β)2p(1− p)

(p2 − p1)2 (comparison of 2 proportions)

Page 279: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Number-needed-to-treat

Experimental event rate: EER =a

a + b

Control event rate: CER =c

c + d

Relative risk: RR =EERCER

Relative risk reduction: RRR =EER − CER

CERAbsolute risk reduction: ARR = EER − CER

Number-needed-to-treat: NNT =1

ARR=

1EER − CER

Page 280: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Study protocol

I International Conference on Harmonisation of TechnicalRequirements for Registration of Pharmaceuticals for HumanUse (ICH) guidelines for Good Clinical Practise (GCP).

I Formal document outlining the proposed procedures (basicallycontain any information from patient selection criteria toresponsibilities)

I For protocol violations (e.g. patients didn’t take their treatments)the only safe way is to keep those in the analysis as intended(intention-to-treat).

Page 281: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Safety

SponsorInforming the local site investigators of the true historical safety recordof the drug, Monitoring the results of the study (Data MonitoringCommitee (DMC) also known as Data Safety Monitoring Board),Collecting adverse event reports, Write site-specific informed consent

Local site investigatorConducting the study according to the study protocol, Give trulyinformed consent (risks, potential benefits)

Institutional review board (IRB) or Ethics CommitteeScrutinize the study for both medical safety and protection to thepatients

Regulatory agencies (FDA, EAEM)Review all study data before allowing the drug to proceed to the nextphase, Audits for the local site investigator

Page 282: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

OutlineAims of this course

Introduction

Descriptive statistics

Diagnostic tests and method comparison

Theoretical distributions and probability

Parameter estimation and confidence interval

Hypothesis testing

Comparing groups

Correlation and regression

Relation between several variables

Analysis of survival times

Experimental design

Study design and clinical trials

Discussion of medical literature

Page 283: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Medical journals and sites

Page 284: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

How to choose a statistical test

Page 285: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Bayesians vs. FrequentistFrequentistThe population value is seen as fixed (but unknown) and calculateconfidence interval and hypothesis tests. The entire informationcomes from the data.

BayesiansThe population mean follows a distribution (prior probability). Datacan be used to modify the prior probability distribution and gives theposterior probability distribution. Here a 95% credible interval can beconstructed, which is narrower than the confidence interval.Difficulties can arise by deciding the prior distribution (prior) and somebayesian methods may lead to intractable computational problems.

Page 286: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Dos and Don’ts

I Don’t carry out a significance test, get a large P value, and theninterpret this as meaning that there is no difference.

I A confidence interval for the mean difference would be muchbetter than significance tests. A non-significant difference in 10subjects cannot be interpreted.

I Quote your p values correctly to one significant figure (e.g.p = 0.007) (do not use p < 0.013,p < 0.01,p > 0.05,p = NS)

I Significant should not be used if you mean important.I Don’t do direct comparison of p-values. It is not correct to

compare two groups by testing changes in each one separately.Significance does not depend only on magnitude, but onvariability and sample size. A two sample t method should beused to compare the log ratios in the two groups.

Page 287: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Dos and Don’ts

I Always state if you are using SD, SE or CI. Avoid ±I Do confidence intervals (or SE’s) on group means, rather than on

comparisons.I Don’t use three-dimensional effects.I The tests of significance at baseline should not be done. If the

subjects are randomized, they come from the same populationand the null hypothesis is true. There is no reason to test it.

I Don’t analyze the data as if they are all from the same populationand ignoring the fact that these 21 groups of subjects are from 9different trials.

I Don’t do Chi-square test analyzes of ordered categorical data.

Page 288: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Guidelines

1. Read the journal’s instructions to authors. If they do not coverstatistics, use those of one of the major general medical journals.

2. Never, ever, conclude that there is no difference or relationshipbecause it is not significant.

3. Give confidence intervals where you can.

4. Give exact P values where possible, not P < 0.05 or P = NS,though only one significant figure is necessary.

5. Be clear what your main hypothesis and outcome variable are.Avoid multiple testing.

Page 289: Biostatistics and Experimental Design - …genome.tugraz.at/biostatistics/biostat2009.pdf · Biostatistics and Experimental Design Hubert Hackl Institute for Genomics and Bioinformatics

Guidelines

6. Get the design right, be clear about blinding and randomization, doa sample size calculation if you can.

7. Be clear whether you are quoting standard deviations or standarderrors, avoid ± notation.

8. Avoid bar charts with error bars.

9. Check the assumptions of your statistical methods.

10. Give clear descriptions of your statistical methods.

11. Decide for which baseline characteristics you should adjust inadvance, then do it.


Recommended