Date post: | 25-Jun-2015 |
Category: |
Technology |
Upload: | usc |
View: | 340 times |
Download: | 2 times |
Using Biological Knowledge ToDiscover Higher Order Interactions
In Genetic Association Studies
Gary K. ChenDuncan C. Thomas
Department of Preventive MedicineUSC
May 19, 2010
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Common diseases have complex etiology
I GWAS have had great success in searching forgenetic variants for common diseases
I Recent successes: AMD, BMI/obesity, Type 2diabetes, breast cancer, prostate cancer
I Marginal effects from single SNP analyses donot explain all heritability. Can we movebeyond the low-hanging fruit? (e.g. CNVs, rarevariants, epistatic interactions, etc.
I Ideally we would fit a model for all SNPs (andinteractions too)
Analyzing all SNPs simultaneously
I Difficult for GWAS: predictors far exceedobservations
I Shrinkage methods: LASSO, ridge regression,elastic net,...
I LASSO method (Tibshirani, J Royal Stat. Soc. 96)I penalizes likelihood based on tuning parameter λI produces sparse (interpretable) models
I In GWAS settings:I Double Exp (LaPlace) prior on β(Wu and Lange,
Bioinf. 2009)I Normal Exp Gamma prior on β(Hoggart et al
PLOS Genet 2008)I Fast! Provides the maximum a posteriori (MAP)
estimates
Fully Bayesian methods for variableselection
I Bayesian model averaging assesses uncertaintyI Probabilistically proposes sub-models from a
posterior distributionI Summarize statistics of parameters averaged across
all proposed modelsI Controls for multiple comparisons
I Disadvantage: Computationally expensiveI P(β) has normal distribution for conjugacyI “Spike and slab” ensures parsimonyI Example: Stochastic Search Variable Selection
via Gibbs sampling (George and McCullochJASA 93)
I βj |γj ∼ (1− γj)N(0, τ 2j ) + γjN(0, c2
j τ2j )
I e.g., f (γ) = Πpγj
j (1− pj)(1−γj )
Searching for interactions
I SSVS via Gibbs SamplingI For 1000 SNPs, length of γ:
500,500=1000 + (1000)(999)2
I Iterating through each parameter is slow
I Reversible jump MCMCI In contrast to SSVS, the “model” is
M = {j : γj 6= 0}I Model size changes at each iteration (similar to
stepwise regression)
I Informative priorsI Incorporating biological information at the level of
each variableI These priors can be used towards a proposal
function in a Metropolis Hastings algorithm
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Posterior density as a two-levelhierarchical model
I Posterior density:I L(Y |β,X ,M)P(β|π, τ, σ,M ,Z ,A)
I First level as likelihood: a GLM at the subjectlevel
I logit(P(Y = 1|β,X )) ∼ β0 +∑K
k=1 βkXI X can be G, E, GxG, GxE, etc.
I Second level as prior: βk as mixed modelI βk ∼ πTZk + φk + θk
Prior mean on variable in Z
Table: The Z matrix
Intercept Conservation Missense eQTL1 20 0 51 10 1 0.011 5 0 11 10 1 4.11 5 0 1.4
I βk ∼ πTZk + φk + θk
I π̂: regress β̂ on Z , π ∼ N(π̂,Σπ)
Variable connectivity in A matrix
Table: Example A matrix for SNP variables
Variable 1 2 31 0 1 02 1 0 13 0 1 0
One appraoch for populating the A matrix
Table: The Z matrix
Intercept Conservation Missense eQTL→ 1 20 0 5
1 10 1 0.01→ 1 5 0 1
1 10 1 4.11 5 0 1.4
I Define entry A1,3 as corr(Z1,−,Z3,−),dichotomize A
φk as mean across k ’s neighbors
Table: Example A matrix for SNP variables
Variable 1 2 31 0 1 02 1 0 13 0 1 0
I βk ∼ πTZk + φk + θk
I φk ∼ N(φ̄−k ,τ 2
νk)
I φ̄−k =Pm
j=1 φjAjkPmj=1 Ajk
, νk neighbors of variable k
I We set φj = β̂j
I Example: If β̂ = (0.2, 0.5, 0.4), φ2 = 0.3
How the parameters fit togetherI L(Y |β,X ,M)P(β|Z , π,A, τ, σ,M)
A reversible jump MCMC algorithm
I Propose a swap, addition or deletion of anvariable
I Perform reversible jump Metropolis Hastingsstep comparing posterior probabilities
I r = L(Y |β′,X ,M′)P(β′|Z ,π,A,τ,σ,M′)P(M→M′)L(Y |β,X ,M)P(β|Z ,π,A,τ,σ,M)P(M′→M)
I Accept move with probability min(1, r)
Model transition proposal density
I Suppose model M ′ has 1 newly proposedvariable:
I P(M → M ′) = Φ−1(zk)I zk ∼ N(µk − µbaseline , 1)
I The variable-specific tuning parameter µkI A function of the components of β’s prior
standardized by their residual variancesI µk = |πT Zk+φ̄−k |
σ2+ τ2
νk
I Weak empirical support for priors lead to smallnumerator, large denominator
Model transition proposal density
I Suppose model M ′ has 1 newly proposedvariable:
I P(M → M ′) = Φ−1(zk)I zk ∼ N(µk − µbaseline , 1)
I The global penalty tuning parameterI Emulate the BICI BIC (M ′)− BIC (M) = χ1(ln(n))I Probability of accepting M ′ is F−1
χ (ln(n))I µbaseline = Φ(F−1
χ (ln(n)))
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Using external information to enhancepower and specificity
I Disease model: 4 GxG interactions jointlycause disease through 4 endophenotypes
I Genotypes simulated for 14 independent SNPsI yik = (1− b)N(sia ∗ sib, 1) + bU(0, 1)I b ∼ Bernoulli(p), p is proportion of noiseI 24 endophenotypes y used only in the prior
I Disease status determined using a logisticmodel
I logit(Yi = 1) = β0 +β1yi01 +β2yi02 +β3yi34 +β4yi35
I First 8000 persons reserved as case controldataset, remaining 2000 for constructing priors
Constructing the Z and the A matrices
I Z matrixI Measures correlation between a model variable and
each endophenotype among 2000 individuals in theprior
I Zkq = corr(gk , yq)
I A matrixI Measures similarity between two variables by
comparing correlation profiles in ZI Ajk = corr(Zjq,Zkq)
Question 1: How do the priors affectpower and specificity?
I The A matrix contains information across all24 endophenotypes
I Set up 3 variants of the original Z matrixI 4 causal endophenotypes only (noise parameter
p = 0)I 4 intermediate endophenotypes only (noise
parameter p = 0.2)I 4 weakly correlated endophenotypes only (noise
parameter p = 0.8)
I Models tested:both A and Z , no A or Z , Aonly, Z only (with 3 variants)
Question 1: How do the priors affectpower and specificity?
At RR=1.5, all prior models perform very well
Question 1: How do the priors affectpower and specificity?
At RR=1.4, prior models with A, Z, or bothoutperform others
Question 1: How do the priors affectpower and specificity?
At RR=1.3, prior models with A, Z, or both have> 5% power
Question 1: How do the priors affectpower and specificity?
At RR=1.2, fully informative prior still retains 80%power
Question 1: How do the priors affectpower and specificity?
At RR=1.1, all prior models perform poorly (∼ 55%power)
Question 2: How do the priors affectposterior estimates (shrinkage)?
Posterior estimates of β vs MLE
Question 2: How do the priors affectposterior estimates (shrinkage)?
Posterior estimates of SE of β vs MLE
Question 3: How do the priors improverankings?
6,441 interactions tested. 4 causal.
Question 3: How do the priors improverankings?
513,591 interactions tested. 4 causal.
Summary of simulation
I Sensitivity analysisI All methods perform well at high RRsI Informative priors improve power at lower RRs but
not at extremely low RRs
I Like LASSO, shrinkage improves interpretability
I Model averaging can improve robustness ofrankings
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Discovering interactions in a knownpathway: Folate
Simulated data set
I 14 genes, 2 environmental variables
I 8000 individuals in casecontrol data, remaining2000 for constructing priors
I Used a pathway simulation program togenerate steady-state concentrations
I Reed et al J Nutr. 2006 Oct;136(10):2653-61I Enzyme kinetics parameters (Km, Vmax) genotype
specific
I 3 mechanisms believed to be related to diseaseetiology
I Homocysteine concentrationI Pyrimidine synthesisI Purine synthesis
Estimates of π
I Construct Z and A in same manner as previoussimulation:
I Z stores genotype-metabolite correlationsI A stores dichotomized-correlations between rows of
Z
I True log relative risk: .18 (RR=1.2)
Simulated Second-level coefficients πmechanism homocysteine pyrimidine purinehomocysteine 0.18(0.13) -0.09(0.536) 0.002(0.38)pyrimidine -0.04(0.22) 0.22(0.066) -0.01(0.06)purine -0.01(0.36) 0.16(0.327) 0.19(0.07)
Comparison of BMA results to stepwiseregresssion
Interaction Pyrimidine synthesisBF MLE p-value
FTD*MAT-II 15 0.038FTD*MTHFR 20 0.046MTCH*MS 534 0.006PGT*MS 14 0.018
→ SHMT*CBS 1254 0.133→ SHMT*Fol 2324 0.036
TS*MTHFR 227 0.022→ TS*SHMT 1091 N/S
Pyrimidine synthesis
I SHMT*CBS SHMT*Fol SHMT*TS
Comparison of BMA results to stepwiseregresssion
Interaction Purine synthesisBF MLE p-value
→ MTCH*MS 1130 0.008→ MTCH*PGT 1416 0.026→ PGT*CBS 1022 0.069→ PGT*MS 2851 0.007→ SHMT*Fol 1398 0.022
SHMT*MAT-II 646 0.012TS*MTHFR 57 0.024
Purine synthesis
I MTCH*MS MTCH*PGT PGT*CBS PGT*MSSHMT*Fol
Comparison of BMA results to stepwiseregresssion
Interaction HomocysteineBF MLE p-value
CBS*MAT-II 77 0.045→ CBS*Met 1072 N/S
FTD*MAT-II 38 0.045FTD*MTHFR 213 0.015
→ MS*Met 1129 N/SMTCH*MS 978 0.006PGT*MS 75 0.044TS*MTHFR 41 0.022
Homocysteine levels
I CBS*Met MS*Met
Summary of folate pathway simulation
I Pathway knowledge can inform model search
I Simulated three plausible disease mechanisms
I Effect of causal metabolite on disease revealedin corresponding element of π
I Revealed plausible interactions not foundthrough a stepwise regression
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Using gene annotations to inform a searchfor interactions
I Proof of concept: GWAS of breast cancer
I Publicly data from NCI(https://caintegrator.nci.nih.gov/cgems/)
I 1,145 cases and 1,142 controls of Europeanancestry
I The 22 Gene Ontology terms from BiologicalProcess used to define priors in A and Z
I Included 6,078 SNPs, where each SNP had GOannotation and had lowest p-value in gene
Top 10 interactions found
Interaction Non-inf prior inf priorβ(SE) BF β(SE) BF
PARK2*SORCS1 0.22(0.06) 1e4 0.27(0.06) 5e4
AK5*ARHGAP26 0.16(0.05) 427 0.17(0.05) 903FGFR2*MAML2 -0.11(0.04) 1 -0.16(0.05) 686SHC3*KIF13B N/A N/A 0.17(0.05) 621PCLO*ME3 N/A N/A 0.18(0.05) 528CNGA3*CNN1 -0.16(0.05) 41 -0.17(0.05) 462FGFR2*CDT1 N/A N/A -0.16(0.05) 445SHC3*CXCL16 N/A N/A -0.18(0.05) 403FGFR2*ABCA1 -0.1(0.05) 158 -0.11(0.05) 268CYP2J2*SORCS1 -0.11(0.05) 74 -0.14(0.05) 266FGFR2*SCG5 N/A N/A 0.21(0.05) 235
Enrichment analysis
I Are the top interactions (BF > 100) enrichedfor certain GO terms?
I Compute empiric p-value for enrichmentI For each permute within bins representative of
non-independence in observed interactionsI Pool bins, compute frequency of a GO term in the
poolI pvalue: Number of iterations freq exceeded obs
freq divided by 1 million
I biological regulation (p=.008), growth(p=1e−6), metabolic process (p=.008), andregulation of biological process (p=.003).
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Incorporate gene-expression data intoGWAS analyses
I Developing priorsI Should be more informative (e.g. empirical) and
granular (e.g. SNP level) than GOI Obtain genotype-expression paired data: HapMap?I Apply WGCNA to infer pathway modulesI Genotype-module correlations used in Z matrix
I Incorporate more advanced MCMC techniquesI Evolutionary Monte CarloI Multiply-try MetropolisI Brute-force search for MAP. Use MAP for initial
values?
Acknowledgements
I James Baurley
I David Conti
I Angela Presson (thanks in advance!)
I Funding: R01 ES016813 and R01 ES015090.