EPI293Design and analysis of gene association studies
Winter Term 2006
Lecture 3: Statistical review, single-locus association tests
Peter [email protected]
Bldg 2 Rm 2062-4271
Outline
• Statistical review• One-locus tests• Multiple single locus tests
Outline
• Statistical review– Pearson’s chi-square– Likelihood theory– Measures of model fit: AIC, BIC– Bayesian data analysis
• One-locus tests• Multiple single-locus tests
Pearson’s chi-squared
• Do categorical data have hypothesized dist’n?– Are outcome and exposure independent (kl tables)?– Do genotypes follow Hardy-Weinberg proportions?
• i indexes I categories• Test statistic
• T ~ d2 under null
• d = no. parms under alternative – no. parms under null
i i
ii
E
EOT
2)(
Example: 2 3 table
• Let n00, n01 and n02 be counts of controls with genotypes aa, Aa, and AA, respectively
• Let n10, n11 and n12 be the same for cases
• n0. and n1. are total no.s of controls, cases
• n.1 is total no. of aa genotypes etc.
• T= ~ 22
aa Aa AA TOTAL
Control n00 n01 n02 n0.
Case n10 n11 n12 n1.
TOTAL n.0 n.1 n.2 n..
ij n
nn
n
nn
ij
ji
jin
..
..
..
.. 2)(2 d.f. from 4-2 = 2
or standard formula: (k-1)(l-1) = 2
Example: test for departure from HWE
• T=
aa Aa AA TOTAL
n0 n1 n2 n.
2.
22.2
.
2.1
2.
22.0
ˆ)ˆ(
ˆˆ2
)ˆˆ2(ˆ
)ˆ(
pn
pnn
qpn
qpnn
qn
qnn
• Under null T is a chi-square with 1 d.f.– Two parameters under alternative minus one under null
Likelihood theory
• Likelihood is function of model parameters, given a probabilistic model and data– Probability of observed data for given parameter values
• Assume observations (indexed by i) are independent
• Let X be data for observation i
= parameters of interest; = “nuisance” parameters– Maximize L to estimate (MLEs)
• Equivalent to maximizing log L• Usually requires computers
),;Pr(),()( ηβXηβθ i
iLL
Example: MLE for allele frequency
• Multinomial likelihood
• Maximum at 0, 1 or “score” = U(p) = /p logL = 0
• … so MLE of p is (2n2 + n1) / (2n)
pnpnpnpnKpL
ppppnnn
npL nnn
log2)1log(log)1log(2)(log
)]1(2[)1(!!!
!)(
2110
22
210
210
02
11
2)(log 2110
p
n
p
n
p
n
p
npLp
Example: unconditional logistic regression
• J exposures of interest, K “nuisance” parameters
• No closed-form solution for parameter estimates– Need computer: SAS PROC LOGISTIC, R GLM etc.
cntlsi iicasesi ii
ii
iiii
ii
KKiiJJi
KKiiJJiiii
L
XXZZ
XXZZXZD
]''exp[1
1
]''exp[1
]''exp[),(
]''[expit]''exp[1
]''exp[
]......exp[1
]......exp[),|1Pr(
111110
111110
XηZβXηZβ
XηZβηβ
XηZβXηZβ
XηZβ
Tests based on likelihood theory
• Score test– U(0) ~ N(0,Var(U))
– If observations are independent Var(U) I = - 2/2 log L
– U’I-1U ~ 2 with dim(0) d.f.
– Often has convenient formula (e.g. McNemar’s test)
• Wald test– For large enough samples:– If observations are independent:– Leads to usual test: – “Easy” to robustify if observations are not independent
• Sandwich or Huber-White estimate:
• Likelihood ratio test– Intuitive test of hypotheses that constrain multiple parms
),(~ˆ 0 Iθθ N
))var(,(~ˆ 10 IUIθθ N
)1,0(~)ˆ.(./ˆ Nes kk
IUIθ 1)r(av)ˆr(av
Likelihood ratio test
)(max
)(max
0
1
L
LLR
Alternative model
Null model
0 1, i.e. models are “nested”
LRT = 2 log LR ~ d2 under null, d = dim(1) – dim(0)
E.g. 1= {1,2: 1,2(-, )}, 0 = {1=2=0}, d = 2.
Likelihood ratio test: example
• Case-control study of CHD• Z is BMI, coded in tertiles
– I.e. Zi’ = (Zi1,Zi2)
• Zi1=1 if i in middle tertile, 0 otherwise
• Zi2=1 if i in top tertile, 0 otherwise
• X includes intercept, age (as a linear term)
• Null: Pr(D=1|Z,X) = expit[’X] – (two parameters)
• Alternative: Pr(D=1|Z,X) = expit[’Z+’X]– (four parameters)
Likelihood ratio test has 4-2=2 degrees of freedom
Measures of model fit
• Not all models are nested within each other– Dominant, recessive models for a given risk allele– Locus A versus locus B
• Interested in model fit per se– Which model(s) best describe(s) data
• Akaike Information Criterion– AIC = -2 log L + 2 K
• Bayes Information Criterion– BIC = -2 log L + log(n) K
Smaller is better(but read software manual)
More parameters = more flexibility
= smaller -2 log L
“Penalty for ‘overfitting’”
AIC is an estimate of “in-sample error” using log-likelihood loss functionBIC is a rough estimate of -2 log Pr(Model|Data)
Bayesian data analysis
• Frequentists assume there is a true model with true parameter values, which we estimate given the data– Pearson’s chi-square, likelihood theory: all frequentist
• Bayesians assume the parameters (including perhaps model form) are random variables, and calculate the posterior distribution given the data
– Advantages• Can account for “prior information” about distribution of parms• Quite complicated models are mathematically tractable
– Disadvantages• Requires assumptions about “prior information”
θ
θθX
θθXXθ
~ )~()
~|(
)()|()|(
L
Lf Bayes’ Theorem
is prior distribution of
“Fully Bayes” = assumes prior completely known; “empirical Bayes” = assumes prior depends on “hyper parameters” (e.g. mean and variance) which are estimated from data
“Fully Bayes” example
• Say we collect n std’zed continuous measurements– Xi ~ N(,1)
• Say that a priori ~N(0,02)
• Then posterior distribution of has mean…
…and variance
n
nX
n 20
20
20
0 /1/1
/1
120/1
n
What does this mean? (a) For n large relative to 1/02, “the data swamp the
prior” (b) for n small relative to 1/02, the prior swamps n (c) different priors lead to
different results
Empirical Bayes example: heirarchical modeling
• Say Z1,…,Z5 measure consumption of five food types• First stage model:
– Pr(D=1|Z) – expit[0 + 1 Z1 + … 5 Z5]• Second stage model (prior):
1= 0 + 1 X1 + 1; 2= 0 + 1 X2 + 2; etc. …– …where Xi is amount of nutrient of interest in food i– “regressing effect of Z on X”– Prior depends on three parameters: 0, 1 and var()
0, 1 estimated from data• var() can be estimated from data or treated as fixed
– Or chosen to minimize prediction error
• Advantages– Reduce parameter variance– Allow high-dimensional models to be fit
• Disadvantages– Must make assumptions in second-stage model
• For us: what is the at-risk allele, which loci are “exchangeable”
Outline
• Statistical review• One-locus tests
– Diallelic– Multiallelic
• Multiple single-locus tests
Simple threetwo tables
• Advantages– Simplicity, completeness– Robust to true dominance pattern
• Disadvantage– Statistic unreliable when few homozygote variants (AA)
Control Case OR (95% CI)
aa n00 n01 1 (ref.)
Aa n10 n11
AA n20 n21
TOTAL n.0 n.1
ij n
nn
n
nn
ij
ji
jin
..
..
..
.. 2)(T= has 2 d.f. under null
Simple twotwo tables
• Test statistic now has 1 d.f. under null• Advantages
– Simplicity
• Disadvantage– Lose some information in presentation– Not robust to true dominance pattern
Control Case OR (95%CI)
aa n00 n01 1 (ref.)
Aa or AA n10 n11
Control Case OR (95%CI)
Aa or aa n00 n01 1 (ref.)
AA n10 n11
Dominant model
Recessive model
Simple trend test
• Armitage’s Trend Test– Test linear trend in log(OR) with no. of A alleles
})2()4(){(
)}2()2({2
212111
22112111
nnnnnnnn
nnnnnnnT
Notation from slide 18
• Test statistic still has 1 d.f. under null• Advantages
– Simplicity, retain information in presentation (2x3 table)– More robust than dominant, recessive tests
• Disadvantage– Not as robust as 2 d.f. test
Allelic test
• For all the previous tests, the unit of observation was the subject (genotype)– Total number of observations = n.. = number of subjects
• Can also treat alleles as the unit of observation
– Now total number of observations is 2 n.. – Great! I’ve doubled my sample size! But…– … my Type I error could be inflated if locus is out of
HWE…
– … and ORall requires careful interpretation
Control Case OR (95%CI)
a m00 m01 1 (ref.)
A m10 m11 ORall (CI)
Sasieni, P.D., From genotypes to genes: doubling the sample size. Biometrics, 1997. 53(4): p. 1253-61
Examples
Control Case Total OR
aa 128 116 244 1 (ref.)
Aa 64 72 136 1.2
AA 8 12 20 1.7
TOTAL 200 200 400
Pearson’s chi-square: 1.86 on 2 d.f., p=.39
Codominant test
Control Case Total OR
a 320 304 624 1 (ref.)
A 80 96 176 1.3
TOTAL 400 400 800
Pearson’s chi-square: 1.62 on 1 d.f., p=.20
Allelic test
“Truth:” RRAa = 1.25, RRAA = 1.5
2x3 (etc.) tables via logistic regression
• Trick: create genotype coding variable Z• One d.f. tests
– Dominant: Z=1 if genotype is AA or Aa, 0 otherwise– Recessive: Z=1 if genotype is AA, 0 otherwise– Trend (a.k.a. linear or addtive): Z = # A alleles
• If genotype is AA then Z= 2, if Aa then Z=1 etc.• Score test form this model = Armitage’s trend test
• Two d.f. test– Create two “dummy” variables– Z1 = 1 if genotype is Aa, 0 otherwise– Z2 = 1 if genotype is AA, 0 otherwise– Perform likelihood ratio test
• Advantages of logistic regression– Adjust for other variables, test several loci
simultaneously
data example; input caco z n;cards;0 0 6550 1 3100 2 371 0 5351 1 4011 2 67;run;
proc logistic descending; model caco=z; weight n;run;
How to fit using logistic regression (in SAS)
proc logistic descending; class z (ref=first); model caco=z; weight n;run;
Additive Co-dominant
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 32.3558 1 <.0001 Score 32.1106 1 <.0001 Wald 31.6368 1 <.0001
Analysis of Maximum Likelihood Estimates
Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -0.1963 0.0568 11.9363 0.0006 z 1 0.4331 0.0770 31.6368 <.0001
Additive model
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 32.5780 2 <.0001 Score 32.4012 2 <.0001 Wald 32.0451 2 <.0001
Co-dominant model
Analysis of Maximum Likelihood Estimates
Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 0.2163 0.0753 8.2417 0.0041 z 1 1 0.0411 0.0871 0.2232 0.6366 z 2 1 0.3775 0.1402 7.2486 0.0071
Adjusting for covariates
data example; input caco z x n;cards;0 0 0 6550 1 0 3100 2 0 371 0 0 5351 1 0 4011 2 0 670 0 1 6420 1 1 3110 2 1 311 0 1 5421 1 1 3911 2 1 59;run;
proc logistic descending; model caco=x; weight n;proc logistic descending; class z (ref=first); model caco=z x; weight n;run; Model Fit Statistics
Intercept Intercept and Criterion Only Covariates
AIC 5520.818 5522.805 SC 5521.302 5523.775 -2 Log L 5518.818 5518.805
A)
B)
A)
Model Fit Statistics
Intercept Intercept and Criterion Only Covariates
AIC 5520.818 5468.033 SC 5521.302 5469.972 -2 Log L 5518.818 5460.033
B)
T=5518.8-5460.0=58.8 on 2 d.f.
Which test to use?
• No fishing expeditions (without paying a price)!
Which test to use? (cont)
Different colors = different true models.
Points are comparisons of power for different models.
Codominant offers gain in power under true recessive model, for little cost under
other true models.
PK’s soapbox• For complex diseases, the “mode of inheritance” (dominant,
recessive, et cetera) is an antiquated and potentially dangerous concept– “Mode of inheritance” developed for simple Mendelian diseases
with near-complete penetrance– Complex diseases involve multiple loci, have high phenocopy
rates– A marker that is in LD with a causal gene will “look co-dominant,”
even if the causal gene is actually recessive or dominant– Few can resist temptation to “go fishing”
• Reporting results for “most significant” coding leads to inflated Type I error, narrow CIs
• Very difficult--if not impossible--to choose between competing models
• Suggestion: emphasize results from co-dominant coding – Co-dominant coding is “model free” (model is saturated)– ORs convey more information– Generally retains power relative to other codings
(even though test has 2 d.f. instead of 1)
– Switch to dominant coding only if homozygous carriers are very rare
(As yet) unpublished simulation study by Jean Yen supports these hypotheses
If Causal Variant is Dominant (recessive, codominant), What does Marker-Trait correlation pattern look
like?
Next few slides borrowed from Bruce Weir
Quantitative Traits
Two-allele Models
Trait Mean and Variance
Marker and Trait Values
Two-allele Situation
Trait Values for Marker Loci
Quantitative Traits
Two-allele Models
Trait Mean and Variance
Marker and Trait Values
Two-allele Situation
Trait Values for Marker Loci
BINARY TRAITS
Multi-allelic tests
• General test has KC2 + K d.f.– Number of d.f. gets large quickly– Even bigger problem with sparse cells– Can we use information about dominance pattern?
Control Case OR (95% CI)
AA n00 n01 1 (ref.)
AB n10 n11
AC n20 n21
BB n30 n31
BC n40 n41
CC n50 n51
TOTAL n.0 n.1
Multi-allelic dominance
• E.G. ABO blood group– “A is dominant to O”; “B is dominant to O” – ZA=1 if G{AO,AA}, 0 otherwise– ZB=1 if G{BO,BB}, 0 otherwise– Z0=1 if G=OO, 0 otherwise– ZAB=1 if G=AB, 0 otherwise
• Have to leave one of these vars out as reference group
• We do not generally know multiallelic dominance relations…– Maybe one allele carries risk? Maybe two have same risk
profile? Maybe something odd like ABO alleles?• …and general test quickly becomes problematic• Compromise: additive model
– ZA=# of A alleles, ZB=# of B alleles, etc.– Advantages:
• Number of parameters does not explode with number of alleles• Test is insensitive to choice of baseline
Example: four-allele marker
Genotype
Controls Cases TOTAL
00 71 52 123
01 18 11 29
02 86 72 158
03 38 32 70
11 0 2 2
12 15 12 27
13 8 3 11
22 24 9 33
23 32 22 54
33 8 7 15
TOTAL 300 222 522
Additive coding & “global test”Allele OR LowCI UpCIZ1 1.045 0.633 1.725Z2 1.138 0.842 1.536Z3 1.018 0.720 1.440 Chi-square=0.7292 on 3 d.f. p=0.8663
Contrast OR LowCI UpCIZ 01 vs 00 1.198 0.522 2.751Z 02 vs 00 0.875 0.544 1.407Z 03 vs 00 0.870 0.482 1.57Z 11 vs 00 <0.001 <0.001 >999.999Z 12 vs 00 0.915 0.396 2.119Z 13 vs 00 1.953 0.494 7.719Z 22 vs 00 1.953 0.839 4.549Z 23 vs 00 1.065 0.556 2.041Z 33 vs 00 0.837 0.285 2.454
Chi-square=9.1803 on 9 d.f. p=0.4208
Full genotype analysis
Could also perform four tests: allele 0 versus all others, allele 1 versus all others, etc.—but this requires adjustment for multiple
(correlated) tests
Outline
• Statistical review• One-locus tests• Multiple single-locus tests
Motivation for testing multiple loci
• Want to test as many candidates as possible– Increase odds that we can detect at least one causal
gene– Motivation behind genome-wide association scans– Analytic issue: multiple testing
• Want to boost power…– …by better predicting untyped variants…
• E.g. haplotypes
– …or by capturing gene-gene interactions– Analytic issues: multiple testing, model selection
• Want to test as many candidates as possible– Increase odds that we can detect at least one causal
gene– Motivation behind genome-wide association scans– Analytic issue: multiple testing
• Want to boost power…– …by better predicting untyped variants…
• E.g. haplotypes
– …or by capturing gene-gene interactions– Analytic issues: multiple testing, model selection
Family-wise error rate
• Let T1,…,TK be K independent test statistics of null hypotheses H1,…,HK
• FWER = probability of falsely rejecting even one Hi
• If all Hs are true and we test each at level α*…– …then prob. of at least one false positive is 1-(1-*)K
– E.g. if K=20, then FWER=64%– Expected number of false positives = K *
• Strong control of FWER– Ensures FWER < – Bonferroni: test each H at *=/K– Sidak: test each H at *=1-(1-)1/K
– These tests are conservative• Very conservative if test statistics are correlated • Conservative in principle
– Penalizes Type II error to reduce Type I error
Family-wise error rate (cont.)
– Simes test• Test of global null: H1 & H2 & … & HK
• Only works if tests are independent or “positively dependent”
• Rank p-values p(1) p(2) … p(K)
• p=min(K p(1), (K/2) p(2),…,(K/K) p(K))
• Hommel’s method yields adjusted p-values for individual tests
– Fisher’s product test• Again assumes all tests are independent
• T=k=1…K -log(pk)~K2
• Modification: rank truncated product– Pick number L
– T=k=1…L -log(p(k))
False discovery rate
• Let S = number of hypotheses rejected
• S1= number of hypotheses falsely rejected
• False discovery rate (FDR) = E [S1/S]
• Control with Benjamini-Hochberg “step up” procedure– Rank p-values p(1) p(2) … p(K)
– Find max j s.t. (K/j) p(j) <
– Reject all H corresponding to the j smallest p-values– Works if tests are independent or “positive dependent”– To control FDR at For general dependency, perform BH
procedure with *=/i=1..K i-1
• E.g. for =.05 and K=5, * ≈ .022• For =.05 and K=1000, * ≈ .0067
SNP Raw p-value BH “p-value” BY “p-value”
10 0.0003
4 0.0004
7 0.0011
3 0.0028
5 0.0039
9 0.0089
6 0.0105
8 0.0197
1 0.0384
2 0.1679
proc multtest pdata=temp(where=(Raw_P ne .)) sidak hommel fdr out=adiol_peas;
proc print data=adiol_peas(obs=10) noobs; var BPC3_NAME Raw_P sid_p hom_p fdr_p;run;
Read in data set with pre-computed p-values(here: PROC GLM)
Example: Association between 168 SNPs in 20 genes and log10 plasma androstenediol levels
Example: Association between 170 SNPs in 20 genes and log10 plasma IGFBP3 levels
Permutation tests
• Do not assume tests are independent– Estimates distribution of test statistics under null,
given correlation among covariates– Can result in considerable power gains
• Correlation means fewer “effective tests”
• “Can adjust for Keff rather than K tests”
• How does it work?– Permute outcome within “exchangeable” sets
• I.e. subjects with same outcome distribution under the null– If no stratifying variables, permute outcome in entire data set– Else permute within strata
» Tested variables (here: genetic loci) should be independent of known risk factors not included in stratifying variables
» As many known risk factors as practically possible should be included in strata
Permutation tests (cont.)
– For each replicate (permuted data set)…• …record T* and p* for each test and min(p*)…
– Adjusted global p-value (all null Hs are true)• (#T* more extreme than Tobs)/nperm or
• (#p*<pobs)/nperm
– Adjusted p-values for each H• “Step down procedure”
• Rank p-values p(1) p(2) … p(K)
• Adjusted p(1) = global p
• Adjusted p(2) = global p for tests (2),…,(K)
• Adjusted p(3) = global p for tests (3),…,(K)
• Etc…
Permutation tests in practice
• Can use SAS PROC MULTTEST– Restricted to binary exposures (dominant, codominant
models—can’t weight or take adjust for covariates)
• Can use %wandel SAS Macro
%macro wandel(in=, /* Input data set */ vars=/* List of variables to be tested */ nrep=, /* Number of permutations */ k=1, /* Consider k smallest p-values */ caco=caco); /* Outcome variable name */
%wandel performs permutation adjustments for multiple testing, where each test is the likelihood ratio test from a univariate, unconditional logistic regression.
%wandel compares the sum of the k smallest observed log p-values to the permutation distribution of the k-smallest p vales. (All observations are assumed exchangeable under the null.) For k=1 this is the standard permutation adjustment for the smallest observed p-value. For k>1 this is the Rank Truncated Product Test [Dudbridge and Koelman Genet Epidemiol 25(4):360-6].
What wandel does
1. Compute nsnps p-values using observed data • LRTs comparing intercept-only model to single-SNP additive
models
• Calculate Tobs=i=1…k p(i)
2. Set counter C to 03. For nreps replications:
• Permute case-control indicators• Compute nsnps p-values using permuted data
• Calculate T=i=1…k p(i)
• If T<Tobs then C=C+1
4. P-value is C/nreps
How big should nreps be?
If we want to estimate a p-value assumed to be with precision at a confidence level of 1-,
2
2/1reps
)1(zN
ε
ππα
Or, as a conservative bound, since (1-) is maximized at .5:
2
2/1reps 2
zN
εα
So if we want to estimate p ca. .05 with precision =.01 with 99% confidence,
Nreps ≥ [2.576×.218/(.01)]2 = 3,154 (1st formula), orNreps ≥ [2.576/(.02)]2 = 16,590 (2nd)
%wandel(in=UseMe, vars=zsnp1 zsnp2 zsnp3 zsnp4 zsnp5 zsnp6 zsnp7 zsnp8 zsnp9 zsnp10, nrep=1000,k=3,caco=d);
Observed Rank Truncated Product and Permutation P-value (nreps=1000)
Perm ObsProdP PValue
1.5573E-10 .001
Example
Cheng et al. (2006) “Common Genetic Variation in IGF1 and Prostate Cancer Risk in the Multiethnic Cohort” JNCI 98:123-134
64 SNPs (1/2.4kb) spanning IGF1 used to characterize LDFound four blocks spanning 59 SNPsChose 29 haplotype-tagging SNPsGenotyped the htSNPs in 2320 cases and 2290 controlsAnalyzed haplotyes within blocksAnalyzed 62 SNPs univariately (inferred untyped SNPs from haplotypes)
These two SNPs perfectly correlated; one SNP observed in case-control sample one not
Permutation corrected p-value (test of global null hypothesis that none of the 29 tag SNPs are associated with prostate cancer) = 0.056.
BH p-value for SNP4=0.058; BY p-value=0.229.