Peter Kraft [email protected] Bldg 2 Rm 206 2-4271

EPI293Design and analysis of gene association studies

Winter Term 2006

Lecture 3: Statistical review, single-locus association tests

Peter [email protected]

Bldg 2 Rm 2062-4271

mailto:[email protected]

Outline

• Statistical review• One-locus tests• Multiple single locus tests

Outline

• Statistical review– Pearson’s chi-square– Likelihood theory– Measures of model fit: AIC, BIC– Bayesian data analysis

• One-locus tests• Multiple single-locus tests

Pearson’s chi-squared

• Do categorical data have hypothesized dist’n?– Are outcome and exposure independent (kl tables)?– Do genotypes follow Hardy-Weinberg proportions?

• i indexes I categories• Test statistic

• T ~ d2 under null

• d = no. parms under alternative – no. parms under null

i i

ii

E

EOT

2)(

Example: 2 3 table

• Let n00, n01 and n02 be counts of controls with genotypes aa, Aa, and AA, respectively

• Let n10, n11 and n12 be the same for cases

• n0. and n1. are total no.s of controls, cases

• n.1 is total no. of aa genotypes etc.

• T= ~ 22

aa Aa AA TOTAL

Control n00 n01 n02 n0.

Case n10 n11 n12 n1.

TOTAL n.0 n.1 n.2 n..

ij n

nn

n

nn

ij

ji

jin

..

..

..

.. 2)(2 d.f. from 4-2 = 2

or standard formula: (k-1)(l-1) = 2

Example: test for departure from HWE

• T=

aa Aa AA TOTAL

n0 n1 n2 n.

2.

22.2

.

2.1

2.

22.0

ˆ)ˆ(

ˆˆ2

)ˆˆ2(ˆ

)ˆ(

pn

pnn

qpn

qpnn

qn

qnn

• Under null T is a chi-square with 1 d.f.– Two parameters under alternative minus one under null

Likelihood theory

• Likelihood is function of model parameters, given a probabilistic model and data– Probability of observed data for given parameter values

• Assume observations (indexed by i) are independent

• Let X be data for observation i

= parameters of interest; = “nuisance” parameters– Maximize L to estimate (MLEs)

• Equivalent to maximizing log L• Usually requires computers

),;Pr(),()( ηβXηβθ i

iLL

Example: MLE for allele frequency

• Multinomial likelihood

• Maximum at 0, 1 or “score” = U(p) = /p logL = 0

• … so MLE of p is (2n2 + n1) / (2n)

pnpnpnpnKpL

ppppnnn

npL nnn

log2)1log(log)1log(2)(log

)]1(2[)1(!!!

!)(

2110

22

210

210

02

11

2)(log 2110

p

n

p

n

p

n

p

npLp

Example: unconditional logistic regression

• J exposures of interest, K “nuisance” parameters

• No closed-form solution for parameter estimates– Need computer: SAS PROC LOGISTIC, R GLM etc.

cntlsi iicasesi ii

ii

iiii

ii

KKiiJJi

KKiiJJiiii

L

XXZZ

XXZZXZD

]''exp[1

1

]''exp[1

]''exp[),(

]''[expit]''exp[1

]''exp[

]......exp[1

]......exp[),|1Pr(

111110

111110

XηZβXηZβ

XηZβηβ

XηZβXηZβ

XηZβ

Tests based on likelihood theory

• Score test– U(0) ~ N(0,Var(U))

– If observations are independent Var(U) I = - 2/2 log L

– U’I-1U ~ 2 with dim(0) d.f.

– Often has convenient formula (e.g. McNemar’s test)

• Wald test– For large enough samples:– If observations are independent:– Leads to usual test: – “Easy” to robustify if observations are not independent

• Sandwich or Huber-White estimate:

• Likelihood ratio test– Intuitive test of hypotheses that constrain multiple parms

),(~ˆ 0 Iθθ N

))var(,(~ˆ 10 IUIθθ N

)1,0(~)ˆ.(./ˆ Nes kk

IUIθ 1)r(av)ˆr(av

Likelihood ratio test

)(max

)(max

0

1

L

LLR

Alternative model

Null model

0 1, i.e. models are “nested”

LRT = 2 log LR ~ d2 under null, d = dim(1) – dim(0)

E.g. 1= {1,2: 1,2(-, )}, 0 = {1=2=0}, d = 2.

Likelihood ratio test: example

• Case-control study of CHD• Z is BMI, coded in tertiles

– I.e. Zi’ = (Zi1,Zi2)

• Zi1=1 if i in middle tertile, 0 otherwise

• Zi2=1 if i in top tertile, 0 otherwise

• X includes intercept, age (as a linear term)

• Null: Pr(D=1|Z,X) = expit[’X] – (two parameters)

• Alternative: Pr(D=1|Z,X) = expit[’Z+’X]– (four parameters)

Likelihood ratio test has 4-2=2 degrees of freedom

Measures of model fit

• Not all models are nested within each other– Dominant, recessive models for a given risk allele– Locus A versus locus B

• Interested in model fit per se– Which model(s) best describe(s) data

• Akaike Information Criterion– AIC = -2 log L + 2 K

• Bayes Information Criterion– BIC = -2 log L + log(n) K

Smaller is better(but read software manual)

More parameters = more flexibility

= smaller -2 log L

“Penalty for ‘overfitting’”

AIC is an estimate of “in-sample error” using log-likelihood loss functionBIC is a rough estimate of -2 log Pr(Model|Data)

Bayesian data analysis

• Frequentists assume there is a true model with true parameter values, which we estimate given the data– Pearson’s chi-square, likelihood theory: all frequentist

• Bayesians assume the parameters (including perhaps model form) are random variables, and calculate the posterior distribution given the data

– Advantages• Can account for “prior information” about distribution of parms• Quite complicated models are mathematically tractable

– Disadvantages• Requires assumptions about “prior information”

θ

θθX

θθXXθ

~ )~()

~|(

)()|()|(

L

Lf Bayes’ Theorem

is prior distribution of

“Fully Bayes” = assumes prior completely known; “empirical Bayes” = assumes prior depends on “hyper parameters” (e.g. mean and variance) which are estimated from data

“Fully Bayes” example

• Say we collect n std’zed continuous measurements– Xi ~ N(,1)

• Say that a priori ~N(0,02)

• Then posterior distribution of has mean…

…and variance

n

nX

n 20

20

20

0 /1/1

/1

120/1

n

What does this mean? (a) For n large relative to 1/02, “the data swamp the

prior” (b) for n small relative to 1/02, the prior swamps n (c) different priors lead to

different results

Empirical Bayes example: heirarchical modeling

• Say Z1,…,Z5 measure consumption of five food types• First stage model:

– Pr(D=1|Z) – expit[0 + 1 Z1 + … 5 Z5]• Second stage model (prior):

1= 0 + 1 X1 + 1; 2= 0 + 1 X2 + 2; etc. …– …where Xi is amount of nutrient of interest in food i– “regressing effect of Z on X”– Prior depends on three parameters: 0, 1 and var()

0, 1 estimated from data• var() can be estimated from data or treated as fixed

– Or chosen to minimize prediction error

• Advantages– Reduce parameter variance– Allow high-dimensional models to be fit

• Disadvantages– Must make assumptions in second-stage model

• For us: what is the at-risk allele, which loci are “exchangeable”

Outline

• Statistical review• One-locus tests

– Diallelic– Multiallelic

• Multiple single-locus tests

Simple threetwo tables

• Advantages– Simplicity, completeness– Robust to true dominance pattern

• Disadvantage– Statistic unreliable when few homozygote variants (AA)

Control Case OR (95% CI)

aa n00 n01 1 (ref.)

Aa n10 n11

AA n20 n21

TOTAL n.0 n.1

ij n

nn

n

nn

ij

ji

jin

..

..

..

.. 2)(T= has 2 d.f. under null

Simple twotwo tables

• Test statistic now has 1 d.f. under null• Advantages

– Simplicity

• Disadvantage– Lose some information in presentation– Not robust to true dominance pattern

Control Case OR (95%CI)

aa n00 n01 1 (ref.)

Aa or AA n10 n11


Aa or aa n00 n01 1 (ref.)

AA n10 n11

Dominant model

Recessive model

Simple trend test

• Armitage’s Trend Test– Test linear trend in log(OR) with no. of A alleles

})2()4(){(

)}2()2({2

212111

22112111

nnnnnnnn

nnnnnnnT

Notation from slide 18

• Test statistic still has 1 d.f. under null• Advantages

– Simplicity, retain information in presentation (2x3 table)– More robust than dominant, recessive tests

• Disadvantage– Not as robust as 2 d.f. test

Allelic test

• For all the previous tests, the unit of observation was the subject (genotype)– Total number of observations = n.. = number of subjects

• Can also treat alleles as the unit of observation

– Now total number of observations is 2 n.. – Great! I’ve doubled my sample size! But…– … my Type I error could be inflated if locus is out of

HWE…

– … and ORall requires careful interpretation


a m00 m01 1 (ref.)

A m10 m11 ORall (CI)

Sasieni, P.D., From genotypes to genes: doubling the sample size. Biometrics, 1997. 53(4): p. 1253-61

Examples

Control Case Total OR

aa 128 116 244 1 (ref.)

Aa 64 72 136 1.2

AA 8 12 20 1.7

TOTAL 200 200 400

Pearson’s chi-square: 1.86 on 2 d.f., p=.39

Codominant test

Control Case Total OR

a 320 304 624 1 (ref.)

A 80 96 176 1.3

TOTAL 400 400 800

Pearson’s chi-square: 1.62 on 1 d.f., p=.20

Allelic test

“Truth:” RRAa = 1.25, RRAA = 1.5

2x3 (etc.) tables via logistic regression

• Trick: create genotype coding variable Z• One d.f. tests

– Dominant: Z=1 if genotype is AA or Aa, 0 otherwise– Recessive: Z=1 if genotype is AA, 0 otherwise– Trend (a.k.a. linear or addtive): Z = # A alleles

• If genotype is AA then Z= 2, if Aa then Z=1 etc.• Score test form this model = Armitage’s trend test

• Two d.f. test– Create two “dummy” variables– Z1 = 1 if genotype is Aa, 0 otherwise– Z2 = 1 if genotype is AA, 0 otherwise– Perform likelihood ratio test

• Advantages of logistic regression– Adjust for other variables, test several loci

simultaneously

data example; input caco z n;cards;0 0 6550 1 3100 2 371 0 5351 1 4011 2 67;run;

proc logistic descending; model caco=z; weight n;run;

How to fit using logistic regression (in SAS)

proc logistic descending; class z (ref=first); model caco=z; weight n;run;

Additive Co-dominant

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 32.3558 1 <.0001 Score 32.1106 1 <.0001 Wald 31.6368 1 <.0001

Analysis of Maximum Likelihood Estimates

Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -0.1963 0.0568 11.9363 0.0006 z 1 0.4331 0.0770 31.6368 <.0001

Additive model

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 32.5780 2 <.0001 Score 32.4012 2 <.0001 Wald 32.0451 2 <.0001

Co-dominant model

Analysis of Maximum Likelihood Estimates

Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 0.2163 0.0753 8.2417 0.0041 z 1 1 0.0411 0.0871 0.2232 0.6366 z 2 1 0.3775 0.1402 7.2486 0.0071

Adjusting for covariates

data example; input caco z x n;cards;0 0 0 6550 1 0 3100 2 0 371 0 0 5351 1 0 4011 2 0 670 0 1 6420 1 1 3110 2 1 311 0 1 5421 1 1 3911 2 1 59;run;

proc logistic descending; model caco=x; weight n;proc logistic descending; class z (ref=first); model caco=z x; weight n;run; Model Fit Statistics

Intercept Intercept and Criterion Only Covariates

AIC 5520.818 5522.805 SC 5521.302 5523.775 -2 Log L 5518.818 5518.805

A)

B)

A)

Model Fit Statistics

Intercept Intercept and Criterion Only Covariates

AIC 5520.818 5468.033 SC 5521.302 5469.972 -2 Log L 5518.818 5460.033

B)

T=5518.8-5460.0=58.8 on 2 d.f.

Which test to use?

• No fishing expeditions (without paying a price)!

Which test to use? (cont)

Different colors = different true models.

Points are comparisons of power for different models.

Codominant offers gain in power under true recessive model, for little cost under

other true models.

PK’s soapbox• For complex diseases, the “mode of inheritance” (dominant,

recessive, et cetera) is an antiquated and potentially dangerous concept– “Mode of inheritance” developed for simple Mendelian diseases

with near-complete penetrance– Complex diseases involve multiple loci, have high phenocopy

rates– A marker that is in LD with a causal gene will “look co-dominant,”

even if the causal gene is actually recessive or dominant– Few can resist temptation to “go fishing”

• Reporting results for “most significant” coding leads to inflated Type I error, narrow CIs

• Very difficult--if not impossible--to choose between competing models

• Suggestion: emphasize results from co-dominant coding – Co-dominant coding is “model free” (model is saturated)– ORs convey more information– Generally retains power relative to other codings

(even though test has 2 d.f. instead of 1)

– Switch to dominant coding only if homozygous carriers are very rare

(As yet) unpublished simulation study by Jean Yen supports these hypotheses

If Causal Variant is Dominant (recessive, codominant), What does Marker-Trait correlation pattern look

like?

Next few slides borrowed from Bruce Weir

Quantitative Traits

Two-allele Models

Trait Mean and Variance

Marker and Trait Values

Two-allele Situation

Trait Values for Marker Loci

Quantitative Traits

Two-allele Models

Trait Mean and Variance

Marker and Trait Values

Two-allele Situation

Trait Values for Marker Loci

BINARY TRAITS

Multi-allelic tests

• General test has KC2 + K d.f.– Number of d.f. gets large quickly– Even bigger problem with sparse cells– Can we use information about dominance pattern?

Control Case OR (95% CI)

AA n00 n01 1 (ref.)

AB n10 n11

AC n20 n21

BB n30 n31

BC n40 n41

CC n50 n51

TOTAL n.0 n.1

Multi-allelic dominance

• E.G. ABO blood group– “A is dominant to O”; “B is dominant to O” – ZA=1 if G{AO,AA}, 0 otherwise– ZB=1 if G{BO,BB}, 0 otherwise– Z0=1 if G=OO, 0 otherwise– ZAB=1 if G=AB, 0 otherwise

• Have to leave one of these vars out as reference group

• We do not generally know multiallelic dominance relations…– Maybe one allele carries risk? Maybe two have same risk

profile? Maybe something odd like ABO alleles?• …and general test quickly becomes problematic• Compromise: additive model

– ZA=# of A alleles, ZB=# of B alleles, etc.– Advantages:

• Number of parameters does not explode with number of alleles• Test is insensitive to choice of baseline

Example: four-allele marker

Genotype

Controls Cases TOTAL

00 71 52 123

01 18 11 29

02 86 72 158

03 38 32 70

11 0 2 2

12 15 12 27

13 8 3 11

22 24 9 33

23 32 22 54

33 8 7 15

TOTAL 300 222 522

Additive coding & “global test”Allele OR LowCI UpCIZ1 1.045 0.633 1.725Z2 1.138 0.842 1.536Z3 1.018 0.720 1.440 Chi-square=0.7292 on 3 d.f. p=0.8663

Contrast OR LowCI UpCIZ 01 vs 00 1.198 0.522 2.751Z 02 vs 00 0.875 0.544 1.407Z 03 vs 00 0.870 0.482 1.57Z 11 vs 00 <0.001 <0.001 >999.999Z 12 vs 00 0.915 0.396 2.119Z 13 vs 00 1.953 0.494 7.719Z 22 vs 00 1.953 0.839 4.549Z 23 vs 00 1.065 0.556 2.041Z 33 vs 00 0.837 0.285 2.454

Chi-square=9.1803 on 9 d.f. p=0.4208

Full genotype analysis

Could also perform four tests: allele 0 versus all others, allele 1 versus all others, etc.—but this requires adjustment for multiple

(correlated) tests

Outline

• Statistical review• One-locus tests• Multiple single-locus tests

Motivation for testing multiple loci

• Want to test as many candidates as possible– Increase odds that we can detect at least one causal

gene– Motivation behind genome-wide association scans– Analytic issue: multiple testing

• Want to boost power…– …by better predicting untyped variants…

• E.g. haplotypes

– …or by capturing gene-gene interactions– Analytic issues: multiple testing, model selection

• Want to test as many candidates as possible– Increase odds that we can detect at least one causal

gene– Motivation behind genome-wide association scans– Analytic issue: multiple testing

• Want to boost power…– …by better predicting untyped variants…

• E.g. haplotypes

– …or by capturing gene-gene interactions– Analytic issues: multiple testing, model selection

Family-wise error rate

• Let T1,…,TK be K independent test statistics of null hypotheses H1,…,HK

• FWER = probability of falsely rejecting even one Hi

• If all Hs are true and we test each at level α*…– …then prob. of at least one false positive is 1-(1-*)K

– E.g. if K=20, then FWER=64%– Expected number of false positives = K *

• Strong control of FWER– Ensures FWER < – Bonferroni: test each H at *=/K– Sidak: test each H at *=1-(1-)1/K

– These tests are conservative• Very conservative if test statistics are correlated • Conservative in principle

– Penalizes Type II error to reduce Type I error

Family-wise error rate (cont.)

– Simes test• Test of global null: H1 & H2 & … & HK

• Only works if tests are independent or “positively dependent”

• Rank p-values p(1) p(2) … p(K)

• p=min(K p(1), (K/2) p(2),…,(K/K) p(K))

• Hommel’s method yields adjusted p-values for individual tests

– Fisher’s product test• Again assumes all tests are independent

• T=k=1…K -log(pk)~K2

• Modification: rank truncated product– Pick number L

– T=k=1…L -log(p(k))

False discovery rate

• Let S = number of hypotheses rejected

• S1= number of hypotheses falsely rejected

• False discovery rate (FDR) = E [S1/S]

• Control with Benjamini-Hochberg “step up” procedure– Rank p-values p(1) p(2) … p(K)

– Find max j s.t. (K/j) p(j) <

– Reject all H corresponding to the j smallest p-values– Works if tests are independent or “positive dependent”– To control FDR at For general dependency, perform BH

procedure with *=/i=1..K i-1

• E.g. for =.05 and K=5, * ≈ .022• For =.05 and K=1000, * ≈ .0067

SNP Raw p-value BH “p-value” BY “p-value”

10 0.0003

4 0.0004

7 0.0011

3 0.0028

5 0.0039

9 0.0089

6 0.0105

8 0.0197

1 0.0384

2 0.1679

proc multtest pdata=temp(where=(Raw_P ne .)) sidak hommel fdr out=adiol_peas;

proc print data=adiol_peas(obs=10) noobs; var BPC3_NAME Raw_P sid_p hom_p fdr_p;run;

Read in data set with pre-computed p-values(here: PROC GLM)

Example: Association between 168 SNPs in 20 genes and log10 plasma androstenediol levels

Example: Association between 170 SNPs in 20 genes and log10 plasma IGFBP3 levels

Permutation tests

• Do not assume tests are independent– Estimates distribution of test statistics under null,

given correlation among covariates– Can result in considerable power gains

• Correlation means fewer “effective tests”

• “Can adjust for Keff rather than K tests”

• How does it work?– Permute outcome within “exchangeable” sets

• I.e. subjects with same outcome distribution under the null– If no stratifying variables, permute outcome in entire data set– Else permute within strata

» Tested variables (here: genetic loci) should be independent of known risk factors not included in stratifying variables

» As many known risk factors as practically possible should be included in strata

Permutation tests (cont.)

– For each replicate (permuted data set)…• …record T* and p* for each test and min(p*)…

– Adjusted global p-value (all null Hs are true)• (#T* more extreme than Tobs)/nperm or

• (#p*<pobs)/nperm

– Adjusted p-values for each H• “Step down procedure”

• Rank p-values p(1) p(2) … p(K)

• Adjusted p(1) = global p

• Adjusted p(2) = global p for tests (2),…,(K)

• Adjusted p(3) = global p for tests (3),…,(K)

• Etc…

Permutation tests in practice

• Can use SAS PROC MULTTEST– Restricted to binary exposures (dominant, codominant

models—can’t weight or take adjust for covariates)

• Can use %wandel SAS Macro

%macro wandel(in=, /* Input data set */ vars=/* List of variables to be tested */ nrep=, /* Number of permutations */ k=1, /* Consider k smallest p-values */ caco=caco); /* Outcome variable name */

%wandel performs permutation adjustments for multiple testing, where each test is the likelihood ratio test from a univariate, unconditional logistic regression.

%wandel compares the sum of the k smallest observed log p-values to the permutation distribution of the k-smallest p vales. (All observations are assumed exchangeable under the null.) For k=1 this is the standard permutation adjustment for the smallest observed p-value. For k>1 this is the Rank Truncated Product Test [Dudbridge and Koelman Genet Epidemiol 25(4):360-6].

What wandel does

1. Compute nsnps p-values using observed data • LRTs comparing intercept-only model to single-SNP additive

models

• Calculate Tobs=i=1…k p(i)

2. Set counter C to 03. For nreps replications:

• Permute case-control indicators• Compute nsnps p-values using permuted data

• Calculate T=i=1…k p(i)

• If T<Tobs then C=C+1

4. P-value is C/nreps

How big should nreps be?

If we want to estimate a p-value assumed to be with precision at a confidence level of 1-,

2

2/1reps

)1(zN

ε

ππα

Or, as a conservative bound, since (1-) is maximized at .5:

2

2/1reps 2

zN

εα

So if we want to estimate p ca. .05 with precision =.01 with 99% confidence,

Nreps ≥ [2.576×.218/(.01)]2 = 3,154 (1st formula), orNreps ≥ [2.576/(.02)]2 = 16,590 (2nd)

%wandel(in=UseMe, vars=zsnp1 zsnp2 zsnp3 zsnp4 zsnp5 zsnp6 zsnp7 zsnp8 zsnp9 zsnp10, nrep=1000,k=3,caco=d);

Observed Rank Truncated Product and Permutation P-value (nreps=1000)

Perm ObsProdP PValue

1.5573E-10 .001

Example

Cheng et al. (2006) “Common Genetic Variation in IGF1 and Prostate Cancer Risk in the Multiethnic Cohort” JNCI 98:123-134

64 SNPs (1/2.4kb) spanning IGF1 used to characterize LDFound four blocks spanning 59 SNPsChose 29 haplotype-tagging SNPsGenotyped the htSNPs in 2320 cases and 2290 controlsAnalyzed haplotyes within blocksAnalyzed 62 SNPs univariately (inferred untyped SNPs from haplotypes)

These two SNPs perfectly correlated; one SNP observed in case-control sample one not

Permutation corrected p-value (test of global null hypothesis that none of the 29 tag SNPs are associated with prostate cancer) = 0.056.

BH p-value for SNP4=0.058; BY p-value=0.229.

Date post:	11-Jan-2016
Category:	Documents
Upload:	sarah
View:	40 times
Download:	0 times

Peter Kraft [email protected] Bldg 2 Rm 206 2-4271

Documents