Heritability estimation in high dimensional mixedmodels
Anna Bonnet
Supervisors: Elisabeth Gassiat and Céline Lévy-Leduc
Journées MAS Grenoble
29 août 2016
Goals and Definitions
Heritability
Heritability of a biological trait: Proportion of phenotypic varianceexplained by genetic factors.
Estimation of heritability in human genetics: better understanding ofcomplex diseases, further research for genetic causes...
Estimation of heritability in animal and vegetal genetics:determination of optimal genotypes to produce a valuable resource.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 1 / 20
Goals and Definitions
Examples of data sets - Quantitative traits
- Vector of observations : Y “
¨
˚
˚
˝
162181. . .175
˛
‹
‹
‚
- Predictors : X “
¨
˚
˚
˝
1732. . .25
˛
‹
‹
‚
- Matrix of SNPs : W “
¨
˚
˚
˝
0 1 . . . 10 2 . . . 0. . . . . . . . . . . .1 1 . . . 2
˛
‹
‹
‚
Framework of genetic studies, n „ 2000 individuals, N „ 500000 SNPs
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 2 / 20
Model
Sparse Linear Mixed Model
Y “ Xβ ` Zu ` e
where- Y is a vector n ˆ 1 of observations- Xβ are the fixed effects- Z is a random matrix n ˆ N, centered and normalized version of W .- u and e are the random effects
uii .i .d .„ p1´ qqδ0 ` qN p0, σ‹u2q , for all i and e „ N
`
0, σ‹e 2IdRn˘
Ź Estimation of η‹ “ Nqσ‹2uNqσ‹2u `σ
‹2e.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 3 / 20
Model
Heritability estimatorUp to considering the projection of Y onto (Im X)K, we focus on the model
Y “ Zu ` e
In the case q “ 1 (no sparsity),
Y |Z „ N`
0, η‹σ‹2ZZ 1{N ` p1´ η‹qσ‹2IdRn˘
.
η̂ is defined as the maximizer of the log-likelihood conditionally to Z :
Lnpηq “ ´ logˆ
1nřn
i“1rY 2
iηpλi´1q`1
˙
´ 1nřn
i“1 log pηpλi ´ 1q ` 1q
where rY “ U 1Y and U ZZ 1N U 1 “ diagpλ1, ..., λnq.
Method implemented in the R package HiLMM.Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 4 / 20
Theoretical results
Theoretical result
TheoremLet Y “ pY1, . . . ,Ynq
1 satisfy the sparse LMM with η‹ ą 0 and assumethat the random variables Zi ,j are i.i.d. N p0, 1q.Then for any q P p0, 1s, as n,N Ñ8 such that n{N Ñ a ą 0,
?npη̂ ´ η‹q
converges in distribution to a centered Gaussian random variable withvariance
τ2pa, η‹, qq “ 2rσ2pa, η‹q ` 3 a2η‹2
rσ4pa, η‹q
ˆ
1q ´ 1
˙
Spa, η‹q
where rσ2pa, η‹q and Spa, η‹q are positive functions, for whichclosed-form expressions are available.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 5 / 20
Numerical experiments
Simulation studyInfluence of sparsity q Influence of a “ n{N
●●●●●●●●●●● ●●●●●●●
0.001 0.005 0.01 0.05 0.1 0.5 1
0.0
0.2
0.4
0.6
0.8
1.0
●●●●●●●●● ●●●●●
●
●
●●
●
0.01 0.02 0.05 0.1 0.2 0.5 1
0.0
0.2
0.4
0.6
0.8
1.0
q a
Figure: Boxplots of η̂ for different values of q when a “ 0.01 (right) and differentvalues of a “ n
N when q “ 1(left).
Ź When a decreases, that is N ąą n, the variance of our heritabilityestimator increases.Ź The presence of null components (q ă 1) does not influence theestimations.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 6 / 20
Variable selection
Variable selection steps
Step 1: Empirical correlation computation (SIS, Fan & Lv(2008)) . It consists in reducing the number of relevant columns ofZ by trying to remove those associated to null components in thevector u. The matrix reduced to the most significant columns isdenoted Zred .
Step 2: The LASSO criterion. It consists in minimizing withrespect to u the following criterion:
Critλpuq “ }Y ´ Zredu}22 ` λ}u}1
The choice of λ is made according to the stability selection method(Meinshausen, 2010).§ R Package EstHer: Variable selection + Heritability Estimation
+ Computation of standard errorsAnna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 7 / 20
Variable selection
Choice of the threshold in the stability selection step
Ź Each choice of threshold gives a set of selected variables, and then anestimated value of the heritability.
0.60 0.65 0.70 0.75 0.80 0.85 0.90
−1.
0−
0.5
0.0
0.5
1.0
ecart100 causaux
thresh_vect
ecar
t[1, ]
0.40.50.60.70.8
0.60 0.65 0.70 0.75 0.80 0.85 0.90
0.0
0.2
0.4
0.6
0.8
ecart10000 causaux
thresh_vect
ecar
t[1, ]
0.20.30.40.50.60.70.8
100 causal SNPs 10000 causal SNPs
Figure: Absolute difference |η‹ ´ η̂| for thresholds from 0.6 to 0.9 and for 100(left) and 10000 (right) causal SNPs.
Ź For 100 causal SNPs, there is a range of thresholds between 0.7 and 0.85which provide a good estimation for heritability, with 0.78 as optimal threshold.Ź For 10000 causal SNPs, there does not exist such a threshold.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 8 / 20
Variable selection
First results of the variable selection method
100 SNPs causaux 10000 SNPs causaux0.
00.
20.
40.
60.
81.
0
selection no selection
●●
0.0
0.2
0.4
0.6
0.8
1.0
selection no selection
●
●
Figure: Estimation of η‹ using our variable selection method with threshold 0.78and using no variable selection.
Ź For 100 causal SNPs, selecting variables reduces substantially the variance.Ź For 10000 causal SNPs, selecting variables creates an important bias.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 9 / 20
Variable selection
Results for different thresholdsη‹ “ 0.4 η‹ “ 0.5 η‹ “ 0.6
100 causalSNPs
0.0
0.2
0.4
0.6
0.8
1.0
n=2000, N=10^5, 100 causaux
without 0.71 0.73 0.75 0.77 0.79 0.81 0.83 0.85
●
● ●
● ●
● ●
● ● ●
● ●
●
● ●
● ●
0.0
0.2
0.4
0.6
0.8
1.0
n=2000, N=10^5, 100 causaux
without 0.71 0.73 0.75 0.77 0.79 0.81 0.83 0.85
●
● ●
● ●
● ●● ● ●
● ●●
● ●
● ●
0.0
0.2
0.4
0.6
0.8
1.0
n=2000, N=10^5, 100 causaux
without 0.71 0.73 0.75 0.77 0.79 0.81 0.83 0.85
●
● ●● ●
● ●● ● ●
● ●●
● ●● ●
10000 causalSNPs
0.0
0.2
0.4
0.6
0.8
1.0
n=2000, N=10^5, 10000 causaux
without 0.71 0.73 0.75 0.77 0.79 0.81 0.83 0.85
●
● ●
● ●
● ●
● ● ●
● ●
●
● ●● ● 0.
00.
20.
40.
60.
81.
0
n=2000, N=10^5, 10000 causaux
without 0.71 0.73 0.75 0.77 0.79 0.81 0.83 0.85
●● ●
● ●
● ●
● ● ●
● ●
●
● ●● ● 0.
00.
20.
40.
60.
81.
0
n=2000, N=10^5, 10000 causaux
without 0.71 0.73 0.75 0.77 0.79 0.81 0.83 0.85
●
● ●
● ●
● ●
● ● ●
● ●
●
● ●
● ●
Figure: Estimation of the heritability with 95% confidence intervals obtainedwithout selection and with selection and for thresholds between 0.7 and 0.85.
Ź 100 causal SNPs: two close thresholds provide similar estimations.Ź 10000 causal SNPs: a small change in the threshold causes substantialdifferences in the estimations.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 10 / 20
Variable selection
A criterion to decide whether to apply the variableselection or not
Table: Mean value of the number (and proportion) of overlapping confidenceintervals for 16 thresholds from 0.7 to 0.85.
η‹ 100 causal SNPs 1000 causal SNPs 10000 causal SNPs0.4 12.2 (0.76) 6.6 (0.41) 6.9 (0.43)0.5 14.9 (0.93) 6.6 (0.41) 6.3 (0.39)0.6 16 (1) 7.8 (0.48) 7.2 (0.45)
Ź Criterion: If the mean proportion of overlapping thresholds is greaterthan 0.6, we perform variable selection with threshold 0.78, otherwhise weestimate directly the heritability.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 11 / 20
Variable selection
Application of the criterion
100 causal SNPs 1000 causal SNPs 10000 causal SNPs
0.0
0.2
0.4
0.6
0.8
1.0
criterion selection no selection
● ●●
0.0
0.2
0.4
0.6
0.8
1.0
criterion selection no selection
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
criterion selection no selection
●
●
●
Figure: Comparison of our method with the criterion, the methods with and withoutselection.
Ź Introducing the criterion allows our estimator to have a smaller variance thanthe estimator without selection when the number of causal SNPs is small, and tohave the same behavior when the number of causal SNPs is high.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 12 / 20
Variable selection
Application to brain volume data
Data from the project Imagen: volume of the different regions of the brainfrom „2000 adolescents in Europe.
0.0
0.2
0.4
0.6
0.8
1.0
bv icv th pu pa hip amy acc ca
●
●
●●
●
●
●
●
●
Figure: Different regions of the brain (Toro et al, 2014) and the estimation ofheritability for these different regions’ volumes.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 13 / 20
Binary data
Extension to binary data
How to define heritability for binary traits?
Liability model (Falconer, 1965)
Yi “ 1tLiątu
whereL “ Zu` e,
with L “ pL1, . . . ,Lnq, u „ N p0, σ‹2u INq and e „ N p0, σ‹2e Inq
The heritability is defined "at the liability scale", that is
η‹ “Nσ‹2u
Nσ‹2u ` σ‹2e.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 14 / 20
Binary data
Case-control studies
Specificity of case-control studies: the cases are highly oversampled.The number of patients and controls are similar even for rare diseases.Least square method (Golan, 2014) which takes into account thisoversampling of the cases:
η̂ “ argminηPp0,1q
ÿ
i‰jppipj ´ Erpipj |Z, S “ 1sq2
˝ pi “Yi´P?Pp1´Pq
˝ p prevalence in the study˝ tS “ 1u if individuals i and j are in the study.
Ñ Approximation of Erpipj |Z, S “ 1s.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 15 / 20
Binary data
Approach
Eppi pj |Z, S “ 1q “ 1´ PP PpYi “ Yj “ 1|Z, S “ 1q ´ PpYi ‰ Yj |Z, S “ 1q
`P
1´ P PpYi “ Yj “ 0|Z, S “ 1q.
Approximation of PpYi “ Yj “ 1|Zq, PpYi “ Yj “ 0|Zq, PpYi ‰ Yj |Zq.
PpYi “ Yj “ 1|Zq “ż 8
t
ż 8
tf px , yqdxdy ,
where f px , yq “ 12π |Σ
pNq|´ 1
2 exp"
´px , yqΣpNq´1
px , yqt
2
*
.
with ΣpNq “
˜
1` η‹ Bi?N η‹
Ci,j?
Nη‹
Ci,j?
N 1` η‹ Bj?
N
¸
where Bi “ Opp1q, Bj “ Opp1q and Ci,j “ Opp1q.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 16 / 20
Binary data
Approximation and corresponding estimator
First order approximation:
Eppipj |Z ,S “ 1q “ cGi ,jη‹
where- Gi ,j “
1N
Nř
i“1Zi ,kZj,k
- c a constant which depends on the prevalence K in the population,the prevalence P in the study and the threshold t.The heritability estimator has an explicit form
η̂ “
ř
i‰jpipjGi ,j
ř
i‰jG2
i ,j
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 17 / 20
Binary data
Consistency of the heritability estimator
Theorem (Consistency)
η̂ is a consistent estimator of η‹, that is
η̂PÑ η‹
when n Ñ `8, N Ñ `8 and n{N Ñ a ą 0, under mild assumptions onthe matrix Z .
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 18 / 20
Binary data
Numerical results
Comparison of the estimators η̂p1q and η̂p2q obtained respectively withthe first and second order approximations of Erpipj |Z,Si “ Sj “ 1s.
●
●
●
first approx second approx
0.0
0.2
0.4
0.6
0.8
1.0
●●●●●●●●●●●●
first approx second approx
0.0
0.2
0.4
0.6
0.8
1.0
●●●●●●●●●●●●●●●
first approx second approx
0.0
0.2
0.4
0.6
0.8
1.0
Figure: Performance of η̂p1q and η̂p2q for n “ 100, N “ 10000 and different values of k:0.1 (left), 0.01 (middle) and 0.005 (right).
Ź The numerical results obtained with the two approximations are similar.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 19 / 20
Perspectives
Conclusions and perspectives
Conclusions
- Quantitative traits: we proposed a hybrid estimator which includes aselection step in very sparse scenarios and behaves like the maximumlikelihood estimator otherwhise.
- Binary traits: we showed the consistency of the heritability estimatorproposed by Golan et al. (2014).
Perspectives
- Quantitative traits: study the biological pathways between the listsof selected SNPs.
- Binary traits: consider sparsity, build accurate confidence intervals.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 20 / 20
Perspectives
References
[1] Anna Bonnet, Elisabeth Gassiat, and Celine Levy-Leduc. Heritability estimation inhigh-dimensional sparse linear mixed models. Electronic Journal of Statistics,9(2):2099–2129, 2015.
[2] Anna Bonnet, Elisabeth Gassiat, Celine Levy-Leduc, Roberto Toro, and ThomasBourgeron. Improving heritability estimation by a variable selection approach insparse high dimensional linear mixed models, 2016. Submitted.
[3] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensionalfeature space. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 70(5):849–911, 2008.
[4] David Golan, Eric S Lander, and Saharon Rosset. Measuring missing heritability:Inferring the contribution of common variants. Proceedings of the NationalAcademy of Sciences, 111(49):E5272–E5281, 2014.
[5] Nicolai Meinshausen and Peter Buhlmann. Stability selection. Journal of the RoyalStatistical Society, pages 417–473, 2010.
Anna Bonnet (AgroParisTech) Estimation of heritability 29 août 2016 20 / 20