Post on 15-Jul-2019
transcript
De nihilo nihil
Statistical Modelling
Causal Relationships
responseblood pressure
disturbing factorbody weight
disturbing factorcigarette smoke
explanatorycaffeine intake
causation association
Statistical Modelling
response or dependent variables and
explanatory or independent variables,
including adjustment for
uncontrollable disturbing factors.
... entails the analysis of the functional relationship between
Experimental Modelling experimental evaluation of the effects of given explanatory variables upon a response variable, involving either randomisation or matching (or 'control') for known disturbing factors (e.g. temperature and humidity as
determinants of the adhesion of dental prostheses)
Observational Modellingobservation-based analysis of the relationship between a response variable and several explanatory variables(e.g. birth weight and gestational age)
Statistical Modelling Basic Approaches
Y: response variableX1,...,Xk: explanatory variables
Ε : random error
Ε+++++= kk2211 xb...xbxbaY
Linear Models
Ε is generally assumed to be N(0,σ2) with unknown σ
Use of Multiple Linear (and other) Models allows regression coefficients bi to be estimated while taking the influence of
disturbing factors into account ('adjustment').
Linear Models
0 E(Y)
ypred=a+b1x1+...+bkxk
Ε Y
body height (inches)
62 64 66 68 70 72
body
weig
ht
(pounds)
90
100
110
120
130
140
150
y: body weight (pounds), x1: body height (inches)
ypred=-111.29+3.44⋅x1
Miss America Body Features 1984 - 2002
1. Data Exploration: isolated assessment of the possible relevance of each explanatory variable
2. Model Formulation: mathematical modelling of the multifaceted relationship between explanatory and response variables, invoking scientific plausibility
3. Model Selection: parameter estimation ('regression'), hypotheses testing (e.g. likelihood ratio, p value, coefficient of determination)
4. Model Checking: comparison between model predictions and observations ('residual diagnostic')
Statistical Modelling Procedure
Prediction of Body Fat Percentage
Body fat percentage can be determined by dual energy X-ray absorptiometry (DXA), a fairly accurate but time-
consuming and expensive technique. On the other hand, measurement of triceps skin fold thickness, thigh and
mid arm circumference may not be as accurate as DXA, but are quicker and cheaper to perform.
from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models
Prediction of Body Fat Percentage
explanatoryskin fold
responsebody fat
explanatorythigh
explanatory mid arm
from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models
body fat (%)Y
skin fold (mm)X1
thigh (cm)X2
mid arm (cm)X3
11.9 19.5 43.1 29.122.8 24.7 49.8 28.2
18.7 30.7 51.9 37.0
20.1 29.8 54.3 31.1
12.9 19.1 42.2 30.9
21.7 25.6 53.9 23.7
27.1 31.4 58.5 27.6
Variables Y, X1,...,X3 were measured simultaneously in 20 individuals.
Prediction of Body Fat Percentage
from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models
...
Multiple Linear Regression
pair-wise Pearson correlation coefficients r (upper right half) and two-sided p values for r=0 (lower left half)
Data Exploration
Y X1 X2 X3
Y
X1
X2
X3
0.843 0.878 0.142
<0.001 0.924 0.458
<0.001 <0.001 0.085
0.549 0.042 0.723
^
skin fold thickness (mm)
10 15 20 25 30 35
body
fat (
%)
10
15
20
25
30
y: body fat (%)x1: skin fold thickness (mm)
ypred=-1.496+0.857⋅x1
R2=0.711
Multiple Linear RegressionData Exploration
thigh circumference (cm)
40 45 50 55 60
body
fat (
%)
10
15
20
25
30
Multiple Linear RegressionData Exploration
y: body fat (%)x2: thigh circumference (cm)
ypred=-23.634+0.857⋅x2
R2=0.771
mid arm circumference (cm)
20 25 30 35 40
body
fat (
%)
10
15
20
25
30
Multiple Linear RegressionData Exploration
y: body fat (%)x3: mid arm circumference (cm)
ypred=14.687+0.199⋅x3
R2=0.020
Ε++++= 332211 xbxbxbaY
linear model with normal error Ε
Multiple Linear RegressionModel Formulation
Backward Selection: stepwise reduction of the number of explanatory variables, starting with the "full" model
Model Selection
Forward Selection: stepwise inclusion of explanatory variables, starting with the best variable (e.g. that with the smallest p value)
Parameter estimation from model equations using maximum likelihood or least square methods
2020,3320,2220,1120
22,332,222,112
11,331,221,111
xbxbxbay
xbxbxbay
xbxbxbay
ε++++=
ε++++=
ε++++=
M
Multiple Linear RegressionModel (Backward) Selection
a (intercept) 117.085 99.782
b1 (skin fold) 4.334 3.016
b2 (thigh) -2.857 2.582
b3 (mid arm) -2.186 1.595
term estimate s.e.
ypred=117.085+4.334⋅x1-2.857⋅x2 -2.186⋅x3
R2= 0.895
Multiple Linear RegressionFull Model
For each regression coefficient bi, test the null hypothesis Hi,0: bi=0 against the alternative Hi,A: bi≠0 using, for example, a Wald test.
)b̂.(e.s
b̂W
i
ii =
Since Wi∼N(0,1) under Hi,0, reject Hi,0 if |Wi |> z1-α/2.
Multiple Linear RegressionModel (Backward) Selection
a (intercept) 1.173 0.258
b1 (skin fold) 1.437 0.170
b2 (thigh) -1.106 0.285
b3 (mid arm) -1.370 0.190
term W p
Multiple Linear RegressionModel (Backward) Selection
a (intercept) 6.792 4.488
b1 (skin fold) 1.001 0.128
b3 (mid arm) -0.431 0.177
term estimate s.e.
ypred=6.792+1.001⋅x1 -0.431⋅x3
R2= 0.887
Multiple Linear RegressionFinal Model
a (intercept) 1.513 0.149
b1 (skin fold) 7.803 <0.001
b3 (mid arm) -2.442 0.026
term W p
Multiple Linear RegressionFinal Model
body fat (%)
10 15 20 25 30
stan
dard
ized
res
idua
l
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
predyy
ipred,ii
s
yy
−
−=ε
verification whether (random) error Ε is N(0,σ2)
'standardized residuals'
Multiple Linear RegressionModel Checking
response variable
resi
dual
0
response variable
resi
dual
0
resi
dual
0
response variable
response variable
resi
dual
0
(a)
(b)
(c)
(d)
Multiple Linear RegressionModel Checking
Analysis of Variance (ANOVA) explanatory variables are either qualitative or quantitative, but discrete
Analysis of Covariance (ANCOVA)some explanatory variables are continuous, some are discrete (multiple regression)
Other (Normal) Linear Models
Y: response variableX1,...,Xk: explanatory variablesΕ: N(0,σ2) with unknown σ
Ε+++++= kk2211 xb...xbxbaY
Linear Models
kk2211 xb...xbxba)YE( ++++=
)(Exb...xbxbaE(Y) kk2211 Ε+++++=
Y: response variableX1,...,Xk: explanatory variablesG: link function
kk2211 xb...xbxba(Y)]EG[ ++++=
Generalised Linear Models
for a dichotomous response variable Y:E(Y) = 0⋅P(Y=0)+1⋅P(Y=1) = P(Y=1) =π
x
0.0 0.2 0.4 0.6 0.8 1.0
logit(x
)
-6
-4
-2
0
2
4
6
kk2211 xb...xbxba)logit( ++++=π
Generalised Linear Model with the 'logit' as the link function
Logistic Regression
)x1
xln(logit(x)
−=
Logistic Regression
Let X1 be a dichotomous explanatory variable (e.g. 1:"exposed", 0:"not exposed")
)bexp(OR 1=
kk221e xb...xb1ba)logit( +++⋅+=π
kk221n xb...xb0ba)logit( +++⋅+=π
)1
ln()1
ln()logit(-)(itlogbn
n
e
ene1 π−
π−π−
π=ππ=
)ORln(1
/1
lnn
n
e
e =
π−π
π−π=
Adjusted Odds Ratio
The Evans County Heart Study
In 1960, the entire population of Evans County, Georgia, aged 40 and over were given a complete cardiovascular
examination. Some 609 white males were followed for 9 years to determine their coronary heart disease (CHD) status.
Hames C (1971) Arch Intern Med 128: 883-886.
The Evans County Heart Study
Y: CHD status (dichotomous)0:"no", 1:"yes"
x1: catecholamine level (CAT; dichotomous) 0:"low", 1:"high"
x2: age (years) x3: cholesterol (CHL; mg/dL) x4: smoking status (dichotomous)
0:"never smoker", 1:"ever smoker"x5: hypertension (dichotomous)
0:"no", 1:"yes"x6: ECG abnormalities (dichotomous)
0:"no", 1:"yes"
from: Kleinbaum DG (1994) Logistic Regression - A Self-Learning Text. Springer, New York
CAT (%) 95 (18%) 27 (38%) <0.001
age 53 ± 9 57 ± 10 0.002
CHL 210 ± 39 222 ±39 0.021
smoking (%) 333 (62%) 54 (76%) 0.025
hypertension (%) 212 (39%) 43 (60%) <0.001
ECG (%) 137 (26%) 29 (41%) 0.010
explanatory variable no (n=538) yes (n=71) p
CHD
Data Exploration
Logistic Regression
number and percentage, or mean±s.e., with p values from χ2-test or t-test, respectively
The Evans County Heart Study
Unadjusted Odds Ratios
44low 443
27high 95
CHD ∅ CHD
17no 205
54yes 333
CHD ∅ CHD
28no 326
43yes 212
CHD ∅ CHD
42no 401
29yes 137
CHD ∅ CHD
CAT
ORCAT=27⋅443/95⋅44=2.86 ORsmoke=54⋅205/333⋅17=1.96
ORhyp=43⋅326/212⋅28=2.36 ORECG=29⋅401/137⋅42=2.02
Smoking
Hypertension ECG abnormality
Model Formulation
Logistic Regression
662211 xb...xbxba)logit( ++++=π
logistic model with π=E(Y) equal to the 9-years incidence proportion (or "risk") of CHD
Logistic RegressionThe Full Model
a (intercept)
b1 (CAT)
b2 (age)
b3 (CHL)
b4 (smoking)
b5 (hypertension)
b6 (ECG)
term estimate s.e.
-6.772
0.598
0.032
0.009
0.834
0.439
0.369
1.140
0.352
0.015
0.003
0.305
0.291
0.294
The Evans County Heart Study
Adjusted versus Unadjusted Odds Ratios
b1 (CAT)
b4 (smoking)
b5 (hypertension)
b6 (ECG)
term estimate
0.598
0.834
0.439
0.369
odds ratio
adjusted unadjusted
1.82
2.30
1.55
1.49
2.86
1.96
2.36
2.02
Logistic RegressionModel (Backward) Selection
a (intercept)
b1 (CAT)
b2 (age)
b3 (CHL)
b4 (smoking)
b5 (hypertension)
b6 (ECG)
term W p
<0.001
0.089
0.034
0.007
0.006
0.131
0.208
-5.940
1.698
2.123
2.680
2.734
1.509
1.258
Logistic RegressionThe Final Model
a (intercept)
b2 (age)
b3 (CHL)
b4 (smoking)
term estimate s.e.
-7.027
0.051
0.007
0.851
1.107
0.014
0.003
0.301
logit(π) = -7.027+0.051⋅x2+0.007⋅x3+0.851⋅x4
ORsmoke unadjusted: 1.96, adjusted: 2.34
x
-10 -5 0 5 10
logit
-1(x
)
0.0
0.2
0.4
0.6
0.8
1.0
Logistic Function (logit-1)
)xexp(1
1(x)logit 1-
−+=
)xb...xbxbexp(-a1
1
kk2211 −−−−+=π
What is the 9-years CHD risk of a 45 year old ever-smoker with a cholesterol level of 260 mg/dL?
The Evans County Heart Study
x2=45, x3=260, x4=1
)1851.0260007.045051.0027.7exp(1
1
⋅−⋅−⋅−+=π
113.0)061.2exp(1
1 =+
=
Logistic RegressionScreening Test
The comparison of the individual risk, π, with a given threshold, ρ, provides a "screening"
test for the disease.
π
>ρ ≤ρ
test positive
test negative
Logistic RegressionScreening Test (ROC Curve)
1-sensitivity
0 1
1
specificity
0.32
ρ: 0.11
sensitivity: 0.68specificity: 0.61Youden's index: 0.29
baseline risk: 71/(71+538)=0.12PPV: 0.19NPV: 0.93
AUC: 0.68
0.61
The triple test is done between the 16th and 18th weeks of pregnancy. The test measures three substances, or markers, that are passed from the fetus and the placenta into the mother's bloodstream - AFP, human chorionic gonadotropin and unconjugated estriol. [...] A method was found
to combine results of the three tests with a mother's age to identify women at increased risk for having a baby with Down's syndrome. Since that time, a number of studies have shown that the triple test can detect 60 to 70 percent of Down's syndrome cases. Because it is a screening test, the triple test identifies pregnancies that are at increased risk, or "screen-positive" for Down's syndrome. A positive result does not necessarily mean the baby is affected, but is only a signal for further testing.
"Triple Test" for Down Syndrome
American Society of Clinical Pathology (www.ascp.org)
Summary
- Statistical modelling entails the analysis of the functional relationship between response and explanatory variables.
- Experimental modelling is based upon prospective trials, addressing controlled explanatory variables. Observational modelling makes use of uncontrolled, observational data.
- Statistical modelling proceeds in multiple steps, including data exploration, followed by model formulation, selection and checking.
- The most commonly used class of statistical models are generalised linear models, encompassing (multiple) linear regression, analysis of variance and logistic regression.
- Multiple models 'adjust' the effect of explanatory variables for any bias introduced by disturbing factors.