Modelling Complex Longitudinal
Phenotypes over Childhood in Genetic
Association Studies
Nicole Warrington Bachelor of Science (Honours)
This thesis is presented for the degree of
Doctor of Philosophy
of The University of Western Australia
School of Women’s and Infants’ Health.
November, 2013
Declaration This thesis was completed during the course of enrolment in this degree at the University of
Western Australia and has not previously been accepted for a degree at this or another
institution.
This thesis is the author's own composition. This thesis contains published work and/or work
prepared for publication, some of which has been co-authored. The bibliographical details of
the work and where it appears in the thesis are outlined below. The permission of all co-
authors has been obtained to include the work in this thesis.
Publication 1/Chapter 2:
Warrington NM, Wu YY, Pennell CE, Marsh JA, Beilin LJ, et al. (2013) Modelling BMI
Trajectories in Children for Genetic Association Studies. PLoS One 8: e53897.
The candidate carried out the literature review, conducted the statistical analyses, interpreted
the results and drafted the manuscript. Co-authors provided guidance on data analysis and
interpretation of the results, in addition to critically reviewing all aspects of the study design
and manuscript.
Publication 2/Chapter 4:
Warrington NM, Tilling K, Howe LD, Paternoster L, Pennell CE, Wu YY, Briollais L. Robustness of
the linear mixed effects model to error distribution assumptions and the consequences for
genome-wide association studies. (Statistical Applications in Genetics and Molecular Biology.
Accepted on 22 July 2014)
The candidate carried out the literature review, conducted the simulations and additional
statistical analyses and drafted the manuscript. Dr Paternoster conducted the chromosome 16
analysis in ALSPAC. Dr Wu wrote the R code for the calculation of the robust standard errors.
Dr Howe, Professor Tilling and Associate Professor Briollais critically reviewed all aspects of the
study design and manuscript and provided constructive feedback throughout the analysis.
i
Chapter 5:
This work was conducted in collaboration with Dr Laura Howe and Dr Lavinia Paternoster at
the University of Bristol and Dr Marika Kaakinen and Dr Sauli Herrala from the University of
Oulu. The candidate carried out the literature review, conducted the genome-wide analysis in
the Raine Study, wrote the R scripts for conducting the replication analysis in ALSPAC and
NFBC66 and conducted the meta-analysis. Dr Paternoster and Dr Howe conducted the
replication analysis in ALSPAC and Dr Kaakinen and Dr Herrala conducted the replication
analysis in NFBC66.
Publication 3/Chapter 6:
Warrington NM*, Howe LD* (*joint first authorship), Wu YY, Timpson NJ, Tilling K, Pennell CE,
Newnham J, Davey-Smith G, Palmer LJ, Beilin LJ, Lye SJ, Lawlor DA, Briollais L. Association of a
Body Mass Index Genetic Risk Score with Growth throughout Childhood and Adolescence. PLoS
One 8(11): e79547
Planning of the paper was jointly undertaken by Dr Howe and the candidate. The candidate
carried out the literature review, selected the SNPs of interest, conducted all the statistical
analyses and drafted the manuscript, while Dr Howe and other co-authors on the manuscript
critically reviewed all aspects of the study design and manuscript.
Signed ………………………………………………………………………………
Associate Professor Laurent Briollais, Supervisor, Lunenfeld-Tanenbaum Research
Institute and the Dalla Lana School of Public Health, University of Toronto
Signed ………………………………………………………………………………
Nicole Warrington, PhD candidate, School of Women’s and Infants’ Health, The University of
Western Australia
ii
Abstract Genome-wide association studies (GWASs) are a hypothesis free approach to investigating
genetic factors that influence health and disease. Whilst they have been relatively successful in
uncovering novel genetic variants associated with complex human diseases, it is largely only
the ‘low hanging fruit’ that have been described to date, leaving much of the heritability of any
given trait unexplained. Geneticists are beginning to perform more complex analyses to
improve our understanding of genetic determinants of disease, including the investigation of
how genes play a role in the development of a trait over time in longitudinal studies.
Compared to cross-sectional analyses, longitudinal studies are advantageous for investigating
genetic associations as they: 1) allow information to be repeated among individuals across
various time points; 2) facilitate the detection of genetic variants that influence trajectories
rather than simple differences in phenotypes; and 3) allow the detection of genes that are
associated with age of onset of a trait. Improving analytic techniques for conducting
longitudinal GWASs offers the opportunity to advance our understanding of the aetiology of
health and disease.
The core aim of this thesis was to develop an appropriate modelling framework to conduct
GWASs of complex traits in longitudinal study designs. Body mass index (BMI) trajectories
throughout childhood were chosen for this research for several reasons. Firstly, obesity
(defined by high BMI) is a complex disorder with increasing incidence, particularly during the
first decades of life, and it is important to gain an understanding into the developmental
processes that precedes the obesity diagnosis. Secondly, obesity is linked to increased risk of
many other diseases including type-two diabetes, the metabolic syndrome, mental health
disorders, respiratory problems and some cancers. The principles underlying life course
epidemiology suggest that the link between these diseases begins in early life. Thirdly, the
genetic determinants of BMI remain largely unknown. Finally, BMI trajectories over childhood
are difficult to model statistically due to the complexities in the shape of the growth curve and
differences between individuals rate of growth within the population.
To address the aim of this thesis, five projects were conducted. The first project describes the
application of four longitudinal modelling frameworks to the BMI data from the Western
iii
Australian Pregnancy Cohort (Raine) Study. This research demonstrated that a semi-parametric
linear mixed effects model (SPLMM) provides the best fit to the data while allowing for the
detection of small genetic effects in a computationally efficient manner.
It has been suggested that a two-step approach is the most efficient for longitudinal GWASs,
firstly modelling the trait of interest and then using summary statistics from the model for the
genetic analysis. The second project compared the SPLMM to this two-step approach using a
simulation study and also the genotype data from chromosome 16 in the Raine Study. The
results demonstrated that a two-step approach is not appropriate for childhood BMI
trajectories.
Given the complex nature of data collection in large, longitudinal cohort studies, the
distributional assumptions for the error term of the linear mixed effects model are sometimes
not met. Through analysis of both a simulation study and chromosome 16 data from the UK
Avon Longitudinal Study of Parents and Children (ALSPAC), the third project in this thesis
showed that the power, bias and coverage rate of the fixed effects estimates are relatively
unaffected by the misspecification; however, a robust standard error is required to protect
against inflation of type 1 error for fixed effects estimates when the fixed effects covariates
interact with time.
The fourth project in this thesis used the statistical methods developed in the previous
chapters to conduct a GWAS of BMI over childhood in the Raine Study, with replication in
ALSPAC and the Northern Finnish Birth Cohort from 1966. Results suggest that genetic variants
in the KCNJ15 gene are associated with both increased average BMI and the rate of growth
over childhood. Variants in this gene have previously been reported to be associated with
increased risk of type-two diabetes, increased levels of insulin and insulin resistance, indicating
that this gene may be biologically important.
There are currently 32 SNPs known to be associated with BMI in adulthood. In the fifth project,
analyses of these SNPs was conducted in ALSPAC and the Raine Study to illustrate that the
association with BMI during childhood is mediated by both changes in adiposity and skeletal
growth; effects that are detectable from one year of age.
iv
Through the use of both advanced statistical techniques and the breadth of longitudinal data
that is available in large cohort studies, it is anticipated that geneticists will be able to uncover
more of the genetic determinants of complex diseases. The statistical methods investigated
and developed in this thesis provide a modelling framework that can be applied to numerous
complex disease traits and enable gene discovery to occur on a wider scale. Longer term, this
may lead to the development and implementation of more targeted interventions, at a
younger age, before the onset of disabling diseases like obesity in those at risk.
v
Publications
Publications arising directly from this thesis
Warrington NM, Wu YY, Pennell CE, Marsh JA, Beilin LJ, Palmer LJ, Lye SJ, Briollais L. Modelling
BMI trajectories in children for genetic association studies. PLoS One. 2013;8(1):e53897.
(Chapter 2)
Warrington NM*, Howe LD* (*joint first authorship), Wu YY, Timpson NJ, Tilling K, Pennell CE,
Newnham J, Davey-Smith G, Palmer LJ, Beilin LJ, Lye SJ, Lawlor DA, Briollais L. Association of a
Body Mass Index Genetic Risk Score with Growth throughout Childhood and Adolescence. PLoS
One. 2013; 8(11):e79547 (Chapter 6)
Warrington NM, Tilling K, Howe LD, Paternoster L, Pennell CE, Wu YY, Briollais L. Robustness of
the linear mixed effects model to error distribution assumptions and the consequences for
genome-wide association studies. (Statistical Applications in Genetics and Molecular Biology.
Accepted on 22 July 2014; Chapter 4)
Warrington NM, Marsh JA, Wu YY, Newnham JP, Beilin LJ, Lye SJ, Palmer LJ, Briollais L, Pennell
CE. Genetic Variants in Adult Obesity Genes are Associated with Childhood Growth. Journal of
Developmental Origins of Health and Disease. 2011; 2(S1): p S144. (ABSTRACT) (Chapter 2)
Publications arising indirectly from this thesis
Marsh JA, White SQ, Warrington NM, Lye SJ, Davey-Smith G, Newnham JP, Palmer LJ, Pennell
CE. Feeding the Epidemic of Childhood Obesity. Journal of Developmental Origins of Health
and Disease. 2011; 2(S1): p S92. (ABSTRACT)
Bradfield JP, Taal HR, Timpson NJ, Scherag A, Lecoeur C, Warrington NM, Hypponen E, Holst C,
Valcarcel B, Thiering E, Salem RM, Schumacher FR, Cousminer DL, Sleiman PM, Zhao J,
Berkowitz RI, Vimaleswaran KS, Jarick I, Pennell CE, Evans DM, St Pourcain B, Berry DJ, Mook-
Kanamori DO, Hofman A, Rivadeneira F, Uitterlinden AG, van Duijn CM, van der Valk RJ, de
vi
Jongste JC, Postma DS, Boomsma DI, Gauderman WJ, Hassanein MT, Lindgren CM, Mägi R,
Boreham CA, Neville CE, Moreno LA, Elliott P, Pouta A, Hartikainen AL, Li M, Raitakari O,
Lehtimäki T, Eriksson JG, Palotie A, Dallongeville J, Das S, Deloukas P, McMahon G, Ring SM,
Kemp JP, Buxton JL, Blakemore AI, Bustamante M, Guxens M, Hirschhorn JN, Gillman MW,
Kreiner-Møller E, Bisgaard H, Gilliland FD, Heinrich J, Wheeler E, Barroso I, O'Rahilly S,
Meirhaeghe A, Sørensen TI, Power C, Palmer LJ, Hinney A, Widen E, Farooqi IS, McCarthy MI,
Froguel P, Meyre D, Hebebrand J, Jarvelin MR, Jaddoe VW, Smith GD, Hakonarson H, Grant SF;
Early Growth Genetics Consortium. A genome-wide association meta-analysis identifies new
childhood obesity loci. Nat Genet. 2012 May;44(5):526-31.
Sovio U, Mook-Kanamori DO, Warrington NM, Lawrence R, Briollais L, Palmer CN, Cecil J,
Sandling JK, Syvänen AC, Kaakinen M, Beilin LJ, Millwood IY, Bennett AJ, Laitinen J, Pouta A,
Molitor J, Davey Smith G, Ben-Shlomo Y, Jaddoe VW, Palmer LJ, Pennell CE, Cole TJ, McCarthy
MI, Järvelin MR, Timpson NJ; Early Growth Genetics Consortium. Association between
Common Variation at the FTO Locus and Changes in Body Mass Index from Infancy to Late
Childhood: The Complex Nature of Genetic Association through Growth and Development.
PLoS Genet. 2011 Feb;7(2):e1001307.
vii
"Essentially, all models are wrong, but some are useful"
George E. P. Box
viii
Contents DECLARATION .......................................................................................................................... I
ABSTRACT ............................................................................................................................... III
PUBLICATIONS ...................................................................................................................... VI
CONTENTS ............................................................................................................................... IX
LIST OF TABLES .................................................................................................................... XV
LIST OF FIGURES ............................................................................................................. XVIII
GLOSSARY ..........................................................................................................................XXIII
ABBREVIATIONS .............................................................................................................. XXVI
ACKNOWLEDGEMENTS .................................................................................................. XXIX
CHAPTER 1: INTRODUCTION ........................................................................................... 1
1.1 General Introduction .............................................................................................................. 1
1.2 Introduction to Life Course Epidemiology ......................................................................... 2
1.3 Introduction to Genetic Epidemiology ................................................................................ 3
1.3.1 Genetics for Genetic Epidemiology ............................................................................. 3
1.3.2 Hardy-Weinberg Equilibrium Principle ..................................................................... 5
1.3.3 Linkage Disequilibrium ................................................................................................ 6
1.3.4 Haplotype Inference ..................................................................................................... 8
1.3.5 Evolution of Genetic Epidemiology Studies ............................................................... 9
1.3.5.1 Linkage ....................................................................................................................... 9
1.3.5.2 Association ................................................................................................................10
1.4 Introduction to Genome-Wide Association Studies (GWASs) .........................................14
1.4.1 Definition ......................................................................................................................14
1.4.2 Imputation ....................................................................................................................15
1.4.3 Association Analysis ....................................................................................................20
1.4.4 Replication ....................................................................................................................22
1.4.5 GWASs of Longitudinal Quantitative Traits ..............................................................23
1.5 Obesity and Body Mass Index .............................................................................................24
1.5.1 Life Course Approach to Obesity ................................................................................25
1.5.1.1 Infancy Growth and the Adiposity Peak ..................................................................27
1.5.1.2 Adiposity Rebound ....................................................................................................28
1.5.1.3 Puberty ......................................................................................................................29
1.5.2 Genetics of BMI .............................................................................................................30
1.6 Birth Cohorts used in this Thesis........................................................................................32
ix
1.6.1 The Western Australian Pregnancy Cohort (Raine) Study ..................................... 32
1.6.1.1 Subjects ..................................................................................................................... 32
1.6.1.2 Measurements .......................................................................................................... 33
1.6.1.3 Genotyping ................................................................................................................ 34
1.6.2 Avon Longitudinal Study of Parents and Children (ALSPAC) ................................. 36
1.6.2.1 Subjects ..................................................................................................................... 36
1.6.2.2 Measurements .......................................................................................................... 37
1.6.2.3 Genotyping ................................................................................................................ 38
1.6.3 The Northern Finland Birth Cohort of 1966 (NFBC66) ........................................... 38
1.6.3.1 Subjects ..................................................................................................................... 38
1.6.3.2 Measurements .......................................................................................................... 39
1.6.3.3 Genotyping ................................................................................................................ 39
1.7 Aims ....................................................................................................................................... 40
1.8 Outline of Thesis .................................................................................................................. 41
CHAPTER 2: LONGITUDINAL STATISTICAL MODELS FOR BODY MASS
INDEX TRAJECTORIES THROUGHOUT CHILDHOOD USING THE WESTERN
AUSTRALIAN PREGNANCY COHORT (RAINE) STUDY .......................................... 43
2.1 Introduction ......................................................................................................................... 43
2.2 Background .......................................................................................................................... 43
2.2.1 Aims ........................................................................................................................... 46
2.3 Subjects and Materials ........................................................................................................ 47
2.4 Statistical Methods and Model Fit ...................................................................................... 53
2.4.1 Linear Mixed Effects Model (LMM) ........................................................................ 53
2.4.1.1 Method Description ............................................................................................... 54
2.4.1.2 Model Fit ................................................................................................................... 55
2.4.1.3 Computational Time ................................................................................................58
2.4.2 Skew-t Model Linear Mixed Effects Model (STLMM) ............................................... 63
2.4.2.1 Method Description .................................................................................................... 63
2.4.2.2 Model Fit .................................................................................................................... 65
2.4.2.3 Computational Time ................................................................................................... 68
2.4.3 Semi-Parametric Mixed Model (SPLMM) using Smoothing Splines ....................... 69
2.4.3.1 Method Description .................................................................................................... 69
2.4.3.2 Model Fit .................................................................................................................... 69
2.4.3.3 Computational Time ................................................................................................... 74
2.4.4 Non-Linear Mixed Effects Model (NLMM); also known as the SuperImposition by
Translation And Rotation (SITAR) Model ................................................................................. 75
x
2.4.4.1 Method Description ....................................................................................................75
2.4.4.2 Model Fit .....................................................................................................................76
2.4.4.3 Computational Time....................................................................................................81
2.5 Genetic Associations ............................................................................................................81
2.5.1 SNP Selection ................................................................................................................... 81
2.5.2 Cross-Sectional Analyses .............................................................................................85
2.5.3 Longitudinal Analyses .................................................................................................89
2.5.4 Obesity-Risk Allele Score ............................................................................................99
2.5.5 Characterising Genetic Associations in SPLMM Model .......................................... 102
2.6 Comparison of Models ....................................................................................................... 107
2.6.1 Model Fit ......................................................................................................................... 107
2.6.2 Computation Time ..................................................................................................... 110
2.6.3 Ability to Detect Genetic Associations with Known Adult BMI/Obesity SNPs .... 110
2.7 Discussion ........................................................................................................................... 111
2.8 Conclusion ........................................................................................................................... 114
CHAPTER 3: COMPARING THE SEMI-PARAMETRIC LINEAR MIXED MODEL
TO A TWO-STEP APPROACH FOR GENOME-WIDE ASSOCIATION STUDIES
................................................................................................................................................. 115
3.1 Introduction ........................................................................................................................ 115
3.2 Background ......................................................................................................................... 115
3.2.1 Aims ............................................................................................................................. 118
3.3 Methods ............................................................................................................................... 119
3.3.1 Statistical Methods ..................................................................................................... 119
3.3.2 Simulation Study ........................................................................................................ 121
3.3.3 Chromosome 16 Analysis in the Raine Study ......................................................... 122
3.4 Results ................................................................................................................................. 122
3.4.1 Simulation Study Results .......................................................................................... 122
3.4.2 Chromosome 16 SNPs in the Raine Study................................................................ 130
3.5 Discussion ........................................................................................................................... 133
3.6 Conclusions ......................................................................................................................... 133
CHAPTER 4: ROBUSTNESS OF THE LINEAR MIXED EFFECTS MODEL TO
DISTRIBUTIONAL ASSUMPTIONS AND CONSEQUENCES FOR GENOME-WIDE
ASSOCIATION STUDIES .................................................................................................. 134
4.1 Introduction ......................................................................................................................... 134
4.2 Background ......................................................................................................................... 134
4.2.1 Aims ............................................................................................................................. 136
xi
4.3 Motivating Example ........................................................................................................... 136
4.4 Simulation Study ................................................................................................................ 141
4.4.1 Sampling Designs ....................................................................................................... 143
4.4.2 Models for Data Generation ..................................................................................... 144
4.4.2.1 Standard Linear Mixed Model ............................................................................... 144
4.4.2.2 Non Gaussian Error ................................................................................................ 144
4.4.2.3 Heteroscedastic Error ............................................................................................ 144
4.4.3 Data Generation ......................................................................................................... 144
4.4.4 Calculating Robust Standard Errors and Global Wald Tests ................................ 146
4.5 Results for Simulated Data ............................................................................................... 147
4.5.1 Coverage Probabilities .............................................................................................. 147
4.5.2 Bias .............................................................................................................................. 147
4.5.3 Power .......................................................................................................................... 151
4.5.4 Type 1 Error ............................................................................................................... 154
4.5.5 Type 1 Error in Unbalanced Designs Versus Complete Designs .......................... 158
4.5.6 Power Using the Robust Standard Error ................................................................. 162
4.6 Analysis of Chromosome-Wide BMI Data ........................................................................ 165
4.6.1 Comparison Between the Classical and Robust Tests ........................................... 169
4.7 Discussion ........................................................................................................................... 172
4.8 Conclusion .......................................................................................................................... 175
CHAPTER 5: GENOME-WIDE ASSOCIATION STUDY OF BMI TRAJECTORIES
ACROSS CHILDHOOD ...................................................................................................... 176
5.1 Introduction ........................................................................................................................ 176
5.2 Background ........................................................................................................................ 176
5.2.1 Aims ............................................................................................................................ 178
5.3 Statistical Methods............................................................................................................. 179
5.3.1 Study Populations ...................................................................................................... 179
5.3.1.1 Raine Study ............................................................................................................. 179
5.3.1.2 ALSPAC .................................................................................................................... 179
5.3.1.3 NFBC66 .................................................................................................................... 179
5.3.2 Data Cleaning ............................................................................................................. 179
5.3.3 Longitudinal Modelling ............................................................................................. 181
5.3.4 Statistical Analysis ..................................................................................................... 182
5.3.5 Additional Analysis for Characterizing Significant Findings ................................ 183
5.3.6 Pathway Analysis ....................................................................................................... 184
5.4 Results ................................................................................................................................. 185
xii
5.4.1 Comparison of Cohorts .............................................................................................. 185
5.4.2 Results from the Raine Study GWAS ........................................................................ 187
5.4.2.1 Summary of GWAS .................................................................................................. 187
5.4.2.2 Regions of Interest .................................................................................................. 189
5.4.3 Characterising the Findings of the KCNJ15 Gene .................................................... 193
5.4.4 Results from Replication and Meta-Analysis .......................................................... 196
5.4.5 Results from Pathway Analysis ................................................................................ 199
5.5 Discussion ........................................................................................................................... 199
5.5.1 Role of KCNJ15 Gene and Nearby Genes on Chromosome 21 ................................ 202
5.6 Challenges and Future Research ...................................................................................... 204
5.7 Conclusion ........................................................................................................................... 205
CHAPTER 6: ASSOCIATION OF A GENETIC RISK SCORE WITH
LONGITUDINAL BMI IN CHILDREN ........................................................................... 206
6.1 Introduction ........................................................................................................................ 206
6.2 Background ......................................................................................................................... 206
6.2.1 Aims ............................................................................................................................. 207
6.3 Subjects and Materials ....................................................................................................... 208
6.3.1 Study Populations ...................................................................................................... 208
6.3.2 SNP Selection and Allelic Score ................................................................................ 208
6.4 Statistical Analysis ............................................................................................................. 209
6.4.1 Longitudinal Modelling and Derivation of Growth Parameters ........................... 209
6.4.2 Statistical Analysis ..................................................................................................... 209
6.5 Results ................................................................................................................................. 211
6.5.1 Association Between the Allelic Score and Growth Trajectories ......................... 215
6.5.2 Associations Between the Allelic Score and Birth Measures, Adiposity Peak and
Adiposity Rebound ..................................................................................................................... 219
6.5.3 Variance Explained by the Allelic Score .................................................................. 220
6.5.4 Sex Interactions Between the 32 Individual BMI SNPs and BMI Trajectories .... 222
6.5.5 Adjustment for FTO Effect ......................................................................................... 222
6.5.6 Comparison with Weighted Allelic Score ................................................................ 224
6.6 Discussion ........................................................................................................................... 227
6.7 Conclusion ........................................................................................................................... 229
CHAPTER 7: CONCLUSIONS, LIMITATIONS AND FUTURE DIRECTIONS .... 230
7.1 Main Findings ...................................................................................................................... 230
7.1.1 Longitudinal Statistical Models for Body Mass Index Growth Trajectories
throughout Childhood using the Western Australian Pregnancy Cohort (Raine) Study ... 231
xiii
7.1.2 Comparing SPLMM to Two-Step Approach for GWASs .......................................... 231
7.1.3 Robustness of the Linear Mixed Effects Model to Distribution Assumptions and
Consequences for Genome-Wide Association Studies ........................................................... 232
7.1.4 Genome-Wide Association Study of BMI Trajectories Across Childhood ........... 232
7.1.5 Association of a Genetic Risk Score with Longitudinal BMI in Children ............. 233
7.2 Limitations .......................................................................................................................... 233
7.2.1 Computational Intensity ........................................................................................... 233
7.2.2 Gene discovery ........................................................................................................... 234
7.3 Future Directions ............................................................................................................... 234
7.3.1 Reducing Remaining Type 1 Error .......................................................................... 234
7.3.2 Longitudinal Family Studies ..................................................................................... 235
7.3.3 Adjusting for Environmental Covariates ................................................................ 235
7.3.4 Gene-Environment and Gene-Gene Interactions ................................................... 236
7.3.5 Fine Mapping .............................................................................................................. 237
7.3.6 Rare Variants ............................................................................................................. 238
7.4 Conclusion .......................................................................................................................... 238
REFERENCES ...................................................................................................................... 240
APPENDIX A: PUBLICATION ARISING FROM THE RESEARCH IN CHAPTER TWO
APPENDIX B: ADDITIONAL DETAILS OF THE LINEAR MIXED MODEL IN CHAPTER
TWO
APPENDIX C: PUBLICATION ARISING FROM THE RESEARCH IN CHAPTER FOUR
APPENDIX D: ADDITIONAL RESULTS FROM SIMULATION ANALYSIS IN CHAPTER
FOUR
APPENDIX E: PUBLICATION ARISING FROM THE RESEARCH IN CHAPTER SIX
APPENDIX F: ADDITIONAL RESULTS FROM ALLELIC SCORE ANALYSIS IN CHAPTER
SIX
APPENDIX G: R CODE FOR THE MODELS USED IN THE ANALYSIS OF EACH CHAPTER
xiv
List of Tables
Table 1.1: Probability of observing a given haplotype when the loci are in disequilibrium ................... 7
Table 1.2: The genotypic information an individual possess across two loci determines their possible
diplotype ...................................................................................................................................... 8
Table 1.3: Study designs for genetic association studies [18,33,34] ......................................................11
Table 2.1: Number of follow-ups with BMI measured for each of the participants in the sample ........47
Table 2.2: The phenotypic characteristics at each follow-up year for the 1,506 individuals in the study
sample. Continuous variables are expressed as means (standard deviation); binary variables as
percentage (number). ..................................................................................................................49
Table 2.3: The correlation structure of the repeated observations of BMI ...........................................50
Table 2.4: Model fit statistics for covariance models tested using the LMM method; -2 log likelihood,
Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). All models assumed
an independent correlation structure for the random effects and no specific correlation structure
for the error. ................................................................................................................................56
Table 2.5: Details of LMM model in females (N=733, n=4377) ..............................................................59
Table 2.6: Details of LMM model in males (n=773, N=4609) .................................................................61
Table 2.7: Details of STLMM model in females (N=733, n=4377) ..........................................................65
Table 2.8: Details of STLMM model in males (N=773, n=4609) .............................................................67
Table 2.9: Details of SPLMM model in females (N=733, n=4377). Spline 1 is the change in slope
between two and eight years, Spline 2 is the change in slope after 12 years and Spline 3 is the
change in slope before two years. ...............................................................................................71
Table 2.10: Details of SPLMM model in males (N=773, n=4,609). Spline 1 is the change in slope
between two and eight years, Spline 2 is the change in slope after 12 years and Spline 3 is the
change in slope before two years. ...............................................................................................73
Table 2.11: Details of NLMM model in females (N=733, n=4,377) ........................................................77
Table 2.12: Details of NLMM model in males (N=773, n=4,609) ...........................................................79
Table 2.13: Results from association analysis between the estimates of the three parameters from the
NLMM model and markers of obesity at age 17 years. ................................................................81
Table 2.14: Characteristics from the Raine Study sample of the 17 SNPs investigated in each of the
statistical methods ......................................................................................................................83
Table 2.15: Summary of cross-sectional results for the 17 SNPs. Significant P-Values are in bold. .......86
Table 2.16: Summary of longitudinal analyses, using the four methods, for each of the 17 SNPs in
females. Significant P-Values are in bold. ....................................................................................91
xv
Table 2.17: Summary of longitudinal analyses, using the four methods, for each of the 17 SNPs in
males. Significant P-Values are in bold. ....................................................................................... 95
Table 2.18: Results from association analysis of the obesity-risk allele score with BMI trajectories using
the four methods, adjusted for the first five principal components .......................................... 101
Table 2.19: Statistical measures used to compare model fit of the four methods. ............................. 108
Table 2.20: Computation time for the four methods adjusting for the FTO genotype (median [IQR]) 110
Table 2.21: The number of significant SNPs for each method, using a likelihood ratio test................ 111
Table 3.1: Parameter estimates from the Raine Study SPLMM model (Model 1) used to generate the
data in the simulation study ...................................................................................................... 121
Table 3.2: Results from the 1,000 simulations. SDdiff is the standard deviation of the difference
between -log10(pF) [P-Value for testing the β coefficients in the one step method] and -log10(pT)
[P-Value for testing the β coefficients in the two-step method] for the 1,000 simulations. r2 is the
Pearson correlation coefficient for the ratio of the beta coefficient to the standard error. ....... 124
Table 3.3: Results of the 1,000 simulations in the additional scenarios. SDdiff is the standard deviation
of the difference between -log10(pF) [P-Value for testing the β coefficients in the one step
method] and -log10(pT) [P-Value for testing the β coefficients in the two-step method]. r2 is the
Pearson correlation coefficient for the ratio of the beta coefficient to the standard error. ....... 130
Table 4.1: Parameter estimates from the ALSPAC non-genetic model used to generate the data in the
simulation study........................................................................................................................ 142
Table 4.2: Coverage rates of the 95% confidence intervals of the fixed effects; bold and underlined
cells are those that are significantly different from the nominal 95% based on 4,000 simulations
under each design (1,000 simulations for each MAF combined into one summary statistic). .... 148
Table 4.3: Bias and 95% confidence interval for the complete designs; bold and underlined cells are
those whose confidence interval does not cover zero based on 4,000 simulations under each
design (1,000 simulations for each MAF combined into one summary statistic)........................ 149
Table 4.4: Bias and 95% confidence interval for the unbalanced designs; bold and underlined cells are
those whose confidence interval does not cover zero based on 4,000 simulations under each
design (1,000 simulations for each MAF combined into one summary statistic)........................ 150
Table 4.5: Type 1 error for complete designs; bold and underlined cells are those that are significantly
different from the nominal α=0.05 based on 20,000 simulations under each design (5,000
simulations for each MAF combined into one summary statistic). ............................................ 155
Table 4.6: Type 1 error for unbalanced designs; bold and underlined cells are those that are
significantly different from the nominal α=0.05 based on 20,000 simulations under each design
(5,000 simulations for each MAF combined into one summary statistic). .................................. 156
Table 5.1: Mean (SD) age and BMI at the adiposity rebound in the three cohorts, in addition to the
correlation between the two measures. ................................................................................... 186
xvi
Table 5.2: Results from the pathway analysis using SNPs not in LD .................................................... 199
Table 6.1: Phenotypic characteristics of the two birth cohorts used for analysis ................................ 211
Table 6.2: Descriptive statistics of the single nucleotide polymorphisms included in the allelic score.
.................................................................................................................................................. 213
Table 6.3: Results of the allelic score with each of the trajectory outcomes (BMI, weight and height) in
both cohorts and the combined meta-analysis. Significant findings are in bold; Spline 1 is the
change in slope between two and eight years, Spline 2 is the change in slope after 12 years and
Spline 3 is the change in slope before two years. ....................................................................... 215
Table 6.4: Cross-sectional association analysis results for birth measures, BMI and age at adiposity
peak (AP) and BMI and age at adiposity rebound (AR) in ALSPAC and the Raine Study. ............ 219
Table 6.5: Results of the allelic score, after adjustment for the FTO locus, with each of the trajectory
outcomes (BMI, weight and height) in both cohorts and the combined meta-analysis. Significant
findings are in bold; Spline 1 is the change in slope between two and eight years, Spline 2 is the
change in slope after 12 years and Spline 3 is the change in slope before two years. ................ 223
Table 6.6: Comparison of the unweighted and weighted allelic scores for the three trajectory
outcomes. Spline 1 is the change in slope between two and eight years, Spline 2 is the change in
slope after 12 years and Spline 3 is the change in slope before two years. ................................ 225
xvii
List of Figures Figure 1.1: Architecture Of disease based on genetic determinants (Image adapted from McCarthy et
al [42], Manolio et al [43], Bush and Moore [44])........................................................................ 15
Figure 1.2: A schematic of the process by which genetic data is imputed using haplotype inference .. 17
Figure 1.3: Schema of BMI trajectory over childhood and adolescence. The green arrow indicates the
period around the adiposity peak (9 months of age); the red arrow indicates the period around
the adiposity rebound (5-6 years of age). .................................................................................... 26
Figure 1.4: The Raine Study schedule of assessments and broad measurements collected.................. 33
Figure 1.5: Principal components for population stratification in the Raine Study with the HapMap
populations superimposed, showing that the Raine Study individuals are prominently of
European descent. ...................................................................................................................... 35
Figure 1.6: Principal components for population stratification for the 1,494 participants with genome-
wide data in the Raine Study ....................................................................................................... 36
Figure 2.1: Boxplots of BMI at each follow-up year, with BMI displayed from 10-30kg/m2 for years 1-6
and 10-50kg/m2 for years 8-17. ................................................................................................... 51
Figure 2.2: Individual BMI profiles of 20 individuals from the Raine Study .......................................... 52
Figure 2.3: Observed BMI measures for the 1,506 individuals with a lowess curve to visualise the
curvature in BMI over childhood ................................................................................................. 52
Figure 2.4: Model diagnostic plots from LMM model fit to the data from females in the Raine Study . 60
Figure 2.5: Model diagnostic plots from LMM model fit to the data from males in the Raine Study. ... 62
Figure 2.6: Model diagnostic plots from STLMM model fit to the data from females in the Raine Study
.................................................................................................................................................... 66
Figure 2.7: Model diagnostic plots from STLMM model fit to the data from males in the Raine Study. 68
Figure 2.8: Model diagnostic plots from SPLMM model fit to the data from females in the Raine Study.
.................................................................................................................................................... 72
Figure 2.9: Model diagnostic plots from SPLMM model fit to the data from males in the Raine Study 74
Figure 2.10: Model diagnostic plots from NLMM model fit to the data from females in the Raine Study.
.................................................................................................................................................... 78
Figure 2.11: Model diagnostic plots from NLMM model fit to the data from males in the Raine Study.
.................................................................................................................................................... 80
Figure 2.12: Distribution of obesity-risk allele score, with error bars for mean BMI at age 14 years. The
obesity-risk-allele score incorporates genotypes from 17 loci (FTO, MC4R, TMEM18, GNPDA2,
KCTD15, NEGR1, BDNF, ETV5, SEC16B, LYPLAL1, TFAP2B, MTCH2, BCDIN3D, NRXN3, SH2B1, and
MRSA) in the 1,219 individuals from the Raine Study with complete genetic data. The error bars
xviii
display the mean (95% CI) BMI at age 14 years (the largest follow-up in adolescence) for each
risk-allele score. ......................................................................................................................... 100
Figure 2.13: Population average curves for each of the significantly associated SNPs from the SPLMM
method in females (panel A) and males (panel B) ...................................................................... 103
Figure 2.14: Population average curves from the SPLMM method in females and males ................... 105
Figure 2.15: Associations between the risk-allele score and BMI at each follow-up in females and
males. Regression coefficients (95% CI) presented on ln(BMI) scale from the Semi-Parametric
Linear Mixed Model (SPLMM) longitudinal model, derived at each of the average ages of follow-
up. For example, a male with 17 obesity-risk-alleles is likely to have a ln(BMI) 0.005 units higher
at age 6 than a male 16 alleles and by age 14 this difference will be increased to 0.010 units. .. 106
Figure 2.16: Q-Q plot of residuals for each of the methods for females (top four) and males (bottom
four) .......................................................................................................................................... 109
Figure 3.1: Comparison of the one and two-step approaches for the SNP main effect from the 1,000
simulated data sets with different effect sizes for the SNP main effect and SNP*age interaction
effect; on the x-axis is the –log10(PF) and on the y-axis is the –log10(PT). ................................... 123
Figure 3.2: Comparison of the one and two-step approaches for the SNP*age interaction effect from
the 1,000 simulated data sets with different effect sizes for the SNP main effect and SNP*age
interaction effect; on the x-axis is the –log10(PF) and on the y-axis is the –log10(PT). ................. 125
Figure 3.3: Comparison of the β and SE estimates using the one and two-step approaches for the SNP
main effect and SNP*age interaction effect from the 1,000 simulated data sets where both the
SNP main effect and SNP*age interaction effect were significant; on the x-axis are the estimates
(β and SE(β)) from the SPLMM and on the y-axis are the estimates from the two-step approach.
.................................................................................................................................................. 127
Figure 3.4: Comparison of the one and two-step approaches for each of the SNP effects from analysis
of the chromosome 16 data in the Raine Study; on the x-axis is the –log10(PF) and on the y-axis is
the –log10(PT). ........................................................................................................................... 131
Figure 3.5: Comparison of the β and SE estimates using the one and two-step approaches for each of
the SNP effects from the chromosome 16 analysis in the Raine Study; on the x-axis are the
estimates (β and SE(β)) from the SPLMM and on the y-axis are the estimates from the two-step
approach. .................................................................................................................................. 132
Figure 4.1: Individual BMI trajectories for 20 females from ALSPAC ................................................... 137
Figure 4.2: BMI measurements over time, by measurement source, in ALSPAC ................................. 138
Figure 4.3: Residual plots, by measurement source, for the LMM model fit to the ALSPAC data........ 140
Figure 4.4: Simulated power of the SNP main effect and SNP*age interaction terms for complete
designs. The two plots on the left are for the Sparse Complete design, while the two plots on the
right are from the intense complete design. .............................................................................. 152
xix
Figure 4.5: Simulated power of the SNP main effect and SNP*age interaction terms for unbalanced
designs, where “Equal” is the simulations from the Equal Unbalanced design, “Over” are the
simulations from the unbalanced design with less samples around the adiposity rebound and
“Under” are the simulations from the unbalanced design with more samples around the
adiposity rebound. .................................................................................................................... 153
Figure 4.6: Results from comparison between missing data or variable measurement time under the
sparse design...................................................................................... ........................................ 159
Figure 4.7: Results from comparison between missing data or variable measurement time under the
intense design .................................................................................... ........................................ 160
Figure 4.8: Difference in power based on a normal standard error versus a robust standard error for
the complete designs. A positive value indicates the power using the normal standard error is
greater than the power using the robust standard error. The two plots on the left are for the
Sparse Complete design, while the two plots on the right are from the intense complete design.
.................................................................................................................................................. 163
Figure 4.9: Difference in power based on a normal standard error versus a robust standard error for
the unbalanced designs. A positive value indicates the power using the normal standard error is
greater than the power using the robust standard error. Here, “Equal” is the simulations from
the Equal Unbalanced design, “Over” are the simulations from the unbalanced design with fewer
samples around the adiposity rebound and “Under” are the simulations from the unbalanced
design with more samples around the adiposity rebound. ........................................................ 164
Figure 4.10: Q-Q plot of the chromosome 16 analysis in ALSPAC for the overall Wald test and the
SNP*linear age interaction test. Plots A and B include 88 SNPs in the FTO gene, Plots C and D
exclude SNPs in the FTO gene. .................................................................................................. 167
Figure 4.11: Q-Q plot of the chromosome 16 analysis in ALSPAC for all parameters, excluding 88 SNPs
from the FTO gene. ................................................................................................................... 168
Figure 4.12: Comparison of the classical and robust tests for each of the parameters of interest from
the chromosome 16 analysis in ALSPAC .................................................................................... 170
Figure 4.13: Comparison of the classical and robust tests for the SNP main effect by minor allele
frequency (MAF) from the chromosome 16 analysis in ALSPAC................................................. 171
Figure 5.1: Schematic describing the relationships between genetic variants, environmental exposures
and modification to disease risk in adulthood. Image adapted from Newnham et al [117]. ...... 177
Figure 5.2: Population average BMI trajectories in females (A) and males (B) for each of the three
cohorts; the Raine Study (red), ALSPAC (green) and NFBC66 (blue). .......................................... 186
Figure 5.3: Q-Q plot for each of the four tests of interest in the Raine Study GWAS .......................... 188
Figure 5.4: Plot of standard versus robust test P-Values .................................................................... 188
xx
Figure 5.5: Manhattan plot of the P-Values from the global SNP effect (Wald test) for BMI trajectory in
the Raine Study. The red line indicates the genome-wide significance level. The most significant
genetic variant is in the KCNJ15 gene on chromosome 21. ........................................................ 191
Figure 5.6: Manhattan plot of the P-Values from the global SNP by age effect (Wald test) for BMI
trajectory in the Raine Study. The red line indicates the genome-wide significance level. The
most significant genetic variant is in an intergenic region on chromosome 2. ........................... 191
Figure 5.7: Manhattan plot of the P-Values from the SNP main effect for BMI trajectory in the Raine
Study. The red line indicates the genome-wide significance level. The most significant genetic
variant is in an intergenic region on chromosome 14................................................................. 192
Figure 5.8: Manhattan plot of the P-Values from the SNP by linear age effect for BMI trajectory in the
Raine Study. The red line indicates the genome-wide significance level. The most significant
genetic variant is upstream from the GRM7 gene on chromosome 3......................................... 192
Figure 5.9: BMI trajectories for females (left) and males (right) for each of the KCNJ15, rs2008580
alleles. ....................................................................................................................................... 195
Figure 5.10: Regional plot of (A) global Wald P-Values for the overall SNP effect and (B) Wald P-Values
for the SNP by age effect as a function of genomic position (NCBI Build 36) from the meta-
analysis of ALSPAC and NFBC66 for KCNJ15 gene region. In each plot, the meta-analysis P-Value
for rs2836241 is denoted by a purple diamond; all other analysed SNPs are represented by a
circle. Local LD structure is reflected by the plotted estimated recombination rates (taken from
HapMap). The colour scheme of the circles respects LD patterns (HapMap CEU pairwise r2
correlation coefficients) between rs2836241 and surrounding variants. Gene annotations were
taken from the University of California Santa Cruz genome browser. ........................................ 197
Figure 5.11: Regional plot of (A) P-Values for the SNP main effect at age eight and (B) P-Values for the
SNP by linear age effect as a function of genomic position (NCBI Build 36) from the meta-analysis
of ALSPAC and NFBC66 for KCNJ15 gene region. In each plot, the meta-analysis P-Value for
rs2836241 is denoted by a purple diamond; all other analysed SNPs are represented by a circle.
Local LD structure is reflected by the plotted estimated recombination rates (taken from
HapMap). The colour scheme of the circles respects LD patterns (HapMap CEU pairwise r2
correlation coefficients) between rs2836241 and surrounding variants. Gene annotations were
taken from the University of California Santa Cruz genome browser. ........................................ 198
Figure 6.1: Population average curves for individuals from ALSPAC with 27, 29 or 31 BMI risk alleles in
females (A, C and E) and males (B, D and F). Predicted population average BMI (A and B), weight
(C and D) and height (E and F) trajectories from 1 – 16 years for individuals with 27 (lower
quartile), 29 (median), and 31 (upper quartile) BMI risk alleles in the allelic score. ................... 217
xxi
Figure 6.2: Associations between the allelic score and BMI (A and B), weight (C and D) and height (E
and F) at each follow-up in females and males from ALSPAC. Regression coefficients (95% CI)
derived from the longitudinal model at each year of follow-up between 1 and 16 years. ......... 218
Figure 6.3: Estimates from the longitudinal models of the proportion of BMI variation explained (R2) at
each time point in females and males from ALSPAC. R2 derived from the longitudinal model at
each year of follow-up between 1 and 16 years. Of note, there are increases in the proportion of
BMI variation explained by the allelic score around the landmarks of growth including adiposity
peak and puberty. ..................................................................................................................... 221
xxii
Glossary Allele: a viable DNA (deoxyribonucleic acid) coding that occupies a specific location on a
chromosome. An individual inherits one of these DNA codes from each parent.
Autosome: Any chromosome that is not a sex chromosome. There are 22 autosomes in the
human genome.
Association Study: A study of the statistical association between an allele and a trait in the
population.
Chromosome: an organized structure of deoxyribonucleic acid (DNA) that is found in cells. It is
a single piece of coiled DNA containing many genes, regulatory elements and other nucleotide
sequences.
Complex disease: a disease that involves multiple genetic and environmental factors and their
interactions.
Coverage probability: the proportion of the time that a confidence interval contains the true
value of interest.
Critical period: a period of time in which an exposure can have an adverse (or protective)
effect on the development of a disease outcome; outside this time window, there is no
additional risk of disease associated with the exposure.
Deoxyribonucleic Acid (DNA): a molecule that encodes the genetic instructions used in the
development and functioning of all known living organisms. Most DNA molecules are double-
stranded helices consisting of four different types of subunits called nucleotides: guanine (G),
adenine (A), thymine (T) and cytosine (C).
Exon: the part of a gene that codes for a protein. They are the part of the DNA that is
converted into mature messenger RNA (mRNA).
xxiii
Gene: a segment of inherited DNA which contains the information necessary to produce a
functional product through transcription to RNA and translation to proteins
Genome: the total hereditary information of an individual that is encoded in the DNA.
Genomic Inflation: the presence of excess false-positive results, measured by quantifying the
ratio of the median of the empirically observed distribution of the test statistic to the expected
median.
Genotype: the combination of alleles an individual possesses at a particular locus. As
individuals have two chromosomes, a genotype is usually expressed as two alleles.
Haplotype: a group of alleles on a single chromosome that are closely enough linked to be
inherited as a unit.
Hardy-Weinberg Equilibrium: the principle which describes the distribution of genotypes at a
locus in terms of its allele frequencies in a population.
Heritability: proportion of observable variation in a trait between individuals within a
population that is due to genetic variation.
Heterozygote: an individual that has two different alleles at a particular locus on a
chromosome.
Homozygote: an individual that has the same two alleles at a particular locus on a
chromosome.
Identical-By-Descent: a segment of DNA that two or more individuals have inherited from a
common ancestor without recombination and therefore the segment has the same ancestral
origin in these individuals.
Intron: segments of a gene that are not transcribed into messenger RNA and that are found
between exons.
xxiv
Linkage Disequlibrium: non-random association of alleles at two or more closely linked loci.
Locus: the fixed, unique physical position of a gene or one of its alleles on a chromosome.
Phenotype: specific manifestation of a trait or behaviour that varies between individuals.
These are any qualitative or quantitative observable characteristics of an individual and often
referred to as traits.
Polymorphism: a locus that is polymorphic has a least two alternative alleles. This implies that
the given locus varies genetically between individuals.
Population stratification: different disease rates and allele frequencies co-occurring within
population subgroups, which can lead to spurious associations at the population level.
Power: is the probability of rejecting the null hypothesis when the null hypothesis is false.
Recombination: the exchange of genetic information between two chromosomes during cell
division, resulting in new genetic variation.
Sensitive period: similar to a critical period, it is a period of rapid change in an individual,
however a sensitive period allows the risk of disease be modified or even reversed outside the
time window.
Single Nucleotide Polymorphism: a DNA sequence variation occurring when a single
nucleotide – A, T, C or G – in the genome differs between members of a species.
Type 1 error: the incorrect rejection of a true null hypothesis. That is, when a parameter is
falsely declared significant.
Type ll error: is the failure to reject a false null hypothesis. That is, a parameter is not declared
significant when it should be.
xxv
Abbreviations AIC Akaike information criterion
ALSPAC Avon Longitudinal Study of Parents and Children
BCDIN3D Gene: BCDIN3 domain containing
BDNF Gene: Brain-Derived Neurotrophic Factor
BIC Bayesian information criterion
BLUE Best Linear Unbiased Estimator
BLUP Best Linear Unbiased Predictor
BMI Body mass index
CADM2 Gene: Cell Adhesion Molecule 2
cdf Cumulative Distribution Function
CEU Samples of European descent (from residents of the United States of America
of northern and western European ancestry) from HapMap
CHB Samples of Chinese descent (from Beijing) from HapMap
CI Confidence Interval
CNV Copy Number Variant
dbGaP Database of Genotypes and Phenotypes
df Degrees of Freedom
bp Base pairs
DOHaD Developmental Origins of Health and Disease
DNA Deoxyribonucleic Acid
EGG Early Growth Genetics Consortium
EM Expectation Maximization
ETV5 Gene: ETS Variant 5
FAIM2 Gene: Fas Apoptotic Inhibitory Molecule 2
FANCL Gene: Fanconi Anemia, Complementation Group L
FDR False Discovery Rate
FLJ35779 Gene: POC5 centriolar protein homolog
FTO Gene: Fat mass and obesity associated
GAW Genetic Analysis Workshop
GIANT Genetic Investigation of Anthropometric Traits Consortium
xxvi
GNPDA2 Gene: Glucosamine-6-Phosphate Deaminase 2
GPRC5B Gene: G Protein-Coupled Receptor, Family C, Group 5, Member B
GWAS Genome-Wide Association Study
HOXB5 Gene: Homeobox B5
HWE Hardy-Weinberg Equilibrium
IBD Identical by Decent
IQR Interquartile Range
JPT Samples of Japanese descent (from Tokyo) from HapMap
KCTD15 Gene: Potassium Channel Tetramerisation Domain Containing 15
LD Linkage Disequilibrium
LGR4 Gene: Leucine-Rich Repeat containing G Protein-Coupled Receptor 4
LMM Linear Mixed Effects Model
LMX1B Gene: LIM Homeobox Transcription Factor 1, beta
ln Natural Logarithm
LRP1B Gene: Low Density Lipoprotein Receptor-Related Protein 1B
LRRN6C Gene: Leucine Rich Repeat Neuronal 6C
LRT Likelihood Ratio Test
MACH Markov Chain Haplotyping software
MAF Minor Allele Frequency
MAGIC Meta-Analysis of Glucose and Insulin-related traits Consortium
MAP2K5 Gene: Mitogen-Activated Protein Kinase 5
MC4R Gene: Melanocoritin 4 receptor
MCE Monte Carlo Error
MDS Multidimensional Scaling
ML Maximum Likelihood
MLE Maximum Likelihood Estimate
MTCH2 Gene: Mitochondrial Carrier 2
MTIF3 Gene: Mitochondrial Translational Initiation Factor 3
NFBC66 Northern Finnish Birth Cohort of 1966
NEGR1 Gene: Neuronal Growth Regulator 1
NLMM Non-Linear Mixed Effects Model
NRXN3 Gene: Neurexin 3
OLFM4 Gene: Olfactomedin 4
xxvii
PC Principal Component
PCA Principal Components Analysis
pdf Probability Density Function
PI Ponderal Index
PRKD1 Gene: Protein Kinase D1
PTBP2 Gene: Polypyrimidine Tract Binding Protein 2
QC Quality Control
QPCTL Gene: Glutaminyl-Peptide Cyclotransferase-Like
Raine Western Australian Pregnancy Cohort Study
RBJ Gene: DnaJ (Hsp40) homolog, subfamily C, member 27
REML Restricted Maximum Likelihood
RPL27A Gene: Ribosomal Protein L27a
SD Standard Deviation
SE Standard Error
SEC16B Gene: SEC16 Homolog B
SFRS10 Gene: Splicing Factor, Arginine/Serine-Rich 10
SH2B1 Gene: SH2B Adaptor Protein 1
SITAR SuperImposition by Translation And Rotation
SLC39A8 Gene: Solute Carrier Family 39 (zinc transporter), Member 8
SNI Skew-Normal/Independent distribution
SNP Single Nucleotide Polymorphism
SPLMM Semi-Parametric Linear Mixed Effects Model
STLMM Skew-t Linear Mixed Effects Model
TFAP2B Gene: Transcription Factor AP-2 Beta (activating enhancer binding protein 2
beta)
TFBS Transcription Factor Binding Site
TMEM18 Gene: Transmembrane Protein 18
TMEM160 Gene: Transmembrane Protein 160
TNNI3K Gene: TNNI3 interacting kinase
UMVUE Uniformly Minimally Variance Unbiased Estimator
YRI Samples of African descent (Yoruba people of Ibadan, Nigeria) from HapMap
ZNF608 Gene: Zinc Finger Protein 608
xxviii
Acknowledgements I would firstly like to thank my supervisors. Thank you Associate Professor Craig Pennell for
giving me the freedom to conduct my research in Toronto; I know it was an additional
challenge at times, but I appreciate your dedication to making it work. Thank you Associate
Professor Laurent Briollais for allowing me to join your research group for the past two years; I
have learnt so much more by being in Toronto, and I will be ever grateful for the opportunity
you have provided. Your patience, advice and direction have allowed this thesis to become a
reality. Professor Stephen Lye, your excitement over my findings that fit the DOHaD story has
kept me going, thank you.
To Professor Lyle Palmer, thank you for beginning my career in genetic statistics. Your
guidance and enthusiasm is what has kept me going over the past few years. You have always
believed in my abilities and encouraged me to aspire to a level I never thought was possible.
I am grateful to have spent time working with an amazingly talented group of people
throughout the course of my thesis. Yan Yan, thank you for helping broaden my knowledge of
linear mixed effects models and simulation studies. Your ability to describe complex methods
makes the statistics much more palatable. Laura Howe, thank you for your continuous
enthusiasm and willingness to contribute to my research. Even from so far away, your emails
would always put a smile on my face. To Kate, Lavinia and Debbie, thank you for your advice
and help in interpreting results from various aspects of my research.
This thesis wouldn’t be possible without the foresight of Professor John Newnham who
established the Raine Study, thank you. Thank you also to the participants and scientists
involved in the Raine, ALSPAC and NFBC66 studies; I understand how much these cohort
studies take to maintain and know that they only exist because of the dedication and
generosity of all those involved. Thank you to the Raine Study for providing me with additional
funding to make my research possible and allow me to travel between Perth and Toronto. I
also gratefully acknowledge the financial support received from the Australian Government
Department of Innovation, Industry, Science and Research (Australian Postgraduate Award).
xxix
To my friends who are one step ahead of me; Laura, Katie, Gemma and Sarah. Thank you for
giving me the inspiration to achieve this goal and sharing stories of your PhD journeys with me.
I have enjoyed our statistical and epidemiological discussions in the office, over brunch or even
over a glass of wine (or beer for the Canadians!). Laura, thank you for making me experience
everything Canadian while I was in Toronto; you have been an amazing tour guide and I have
seen so much. Sarah, all of your advice over the final few months of this journey was greatly
appreciated.
A special thank you to my Canadian family: Lucie, Peter, Helen and Jane. When it all got too
much, you were always there to listen, have a glass of wine and entertain me with the latest
misdemeanours. You gave me a home away from home. Our memoirs from the last two years
will provide great entertainment!
Finally, I would to say a big thank you to my amazing family. I’m sure you all thought that I was
crazy to do this, but you continued to support me in every way. Mum and Dad, thank you for
being there and listening through all the ups and downs of this journey. Your support and
willingness to understand what I was working on is admired. Mon, thank you for providing a
bed for my many trips back to Perth, being an amazing right arm when mine was broken, being
an English language teacher when I needed it most and just being the best big sister a girl
could ever want. I would be lost without you. Nanna, thank you for your continued love and
support, and amazing baking when I make it home! I hope I have made you all proud.
xxx
Chapter 1: Introduction
1.1 General Introduction Incidence rates of the most common chronic diseases are increasing in developed, and more
recently in developing countries, affecting millions of people globally. These conditions are
influenced by multiple interacting factors including the environment, behaviour and genetics.
Relatively little is known about what is driving the rise in the prevalence of many chronic
diseases, which limits our ability to constrain the prevalence of these conditions that result in
enormous economic and social burden. The ability to study environmental factors influencing
disease risk has resulted in some important discoveries, such as the relationship between
smoking and increased cancer risk. Despite the common knowledge that most chronic diseases
are heritable, our ability to discover the combination of genes that increase an individual’s
disease risk has, to date, had a very small impact.
Over the past decade there has been explosive growth in the technical capacity to generate
and store enormous genomic datasets across the whole genome. The analysis of these
datasets was initially tempered by failures to find genes for complex phenotypes using any
analytic strategy. More recently, primarily through the increased sample sizes available using
multiple studies in large consortia, many genetic regions associated with hundreds of
phenotypes have been identified. Notwithstanding these successes, there is still a great need
to discover, characterise and translate genetic and environmental determinants of disease.
The statistical methods to analyse genetic data still lag far behind our ability to produce
enormous genetic datasets. Many of the traits investigated to date have been analysed using
cross-sectional designs utilising relatively simplistic statistical methods. This has resulted in the
identification of many regions of interest in the genome; however the genetic variants
discovered to date only explain a small proportion of the expected variability in a given trait
due to genetics (heritability). To advance our knowledge of the role of genetics in complex
disease, sophisticated statistical methods and optimal study designs now need to be
developed that better capture the disease process.
1 Chapter 1: Introduction
This thesis aims to extend the current statistical methods used in genetic association studies to
allow investigation into how genetic variants are associated with quantitative phenotypes over
time. By applying advanced statistical techniques to the breadth of longitudinal data that is
available in large population based studies, it is anticipated that geneticists will be able to
uncover more of the genetic determinants of complex diseases. Longer term, this information
may lead to the development and implementation of more targeted interventions at a younger
age before the onset of disabling chronic disease in those at risk, ultimately reducing the cost
of healthcare.
1.2 Introduction to Life Course Epidemiology Late last century, there were two schools of thought regarding how individuals developed
disease in adulthood. The first hypothesis was the “adult lifestyle” approach, whereby an
individual’s behaviour in adulthood, including smoking, diet, exercise and alcohol
consumption, affected the onset and progression of disease. Its main focus was on identifying
factors that were associated with the timing of disease onset and speed of degeneration. The
second approach was the “biological programming” hypothesis, whereby environmental
exposures, such as under-nutrition during critical periods of growth and development in utero,
programmed an individual in such a way that their risk of adult chronic disease was increased
[1]. This approach was originally known as the foetal origins of adult disease, where birth size,
as a marker of antenatal growth, influenced later disease risk. It has more recently been
termed the developmental origins of health and disease (DOHaD) which broadens the
timeframe of development to beyond that of just antenatal growth leading to birth size [2].
The hypotheses underlying DOHaD indicate that we are predisposed to disease through our
early life exposures. In the late 1990’s, a third approach was introduced to bridge the
increasing gap between the existing biological programming and adult lifestyle approaches.
This new hypothesis was termed the life course approach, which developed into a
multidisciplinary framework for research on health, human development and aging [3,4]. Life
course epidemiology is defined as “the study of long term effects on later health or disease risk
of physical or social exposures during gestation, childhood, adolescence, young adulthood and
later adult life” [3,4]. Life course epidemiology investigates the contribution of early life risk
factors, such as biological (including genetics), environmental and social exposures, in
conjunction with later-life factors, to identify processes that may account for inequalities in
adult health and mortality. It aims to understand the relevance of different exposures
2 Chapter 1: Introduction
occurring at different times in the life course on later health, and allows key periods to be
identified for potential targeted interventions. A significant aspect of life course epidemiology
is the idea of critical or sensitive periods; limited windows of time where an exposure has an
effect, either protective or adverse, on the development of a disease outcome. Specifically, a
critical period occurs where outside this window of time there is no excess disease risk
associated with the exposure, whereas a sensitive period allows for modification or reversal of
the disease risk outside of this particular window [5]. If researchers are able to promote
awareness of these early and life course approaches to disease, it will reinforce the importance
of a healthy lifestyle from a young age, which will ideally have an impact on reducing the
incidence of disease.
1.3 Introduction to Genetic Epidemiology Genetic epidemiology has been defined as “a discipline closely allied to traditional
epidemiology that focuses on the familial, and in particular genetic, determinants of disease
and the joint effects of genes and non-genetic determinants” [6]. This section presents an
overview of some of the fundamental genetic concepts and terminology used throughout this
thesis.
1.3.1 Genetics for Genetic Epidemiology
The number of cells in the human body is estimated to be 100 trillion (1x1014). In the nucleus
of each cell are 23 pairs of chromosomes; one of each pair is inherited from each parent. These
23 pairs are made up of 22 autosomes and 1 sex chromosome; an XX if a female and XY if a
male. Each chromosome is made up of deoxyribonucleic acid (DNA); the carrier of genetic
code. The genome is the total amount of hereditary information encoded in the DNA across all
chromosomes. The genome is made up of smaller units, also known as genes, which are
sequences of DNA located together and are the basic unit of genetic information. Each
individual has approximately 25,000 genes each of which consists of coding and non-coding
regions, known as exons and introns respectively. Exons are DNA sequences that code for
particular proteins, whereas introns do not code proteins but are thought to play an important
role in regulating the manufacture of proteins. DNA consists of two strands; each strand
contains chemical building blocks called nucleotides. Nucleotides differ in their nitrogen-
containing base. There are four bases in DNA; adenine (A), cytosine (C), thymine (T) and
guanine (G). The sequence of these bases within a strand determines the genetic information
3 Chapter 1: Introduction
stored in the strand. As these four bases occur in pairs along the chromosome, the genetic
distance along a chromosome is measured in base pairs (bp). A genetic locus is a particular
position along a chromosome. An allele is any one of the four possible DNA bases occupying a
given genetic locus on a chromosome. Typically, alleles are used as a representation of a gene
that incorporates them, as they account for variations in an inherited characteristic. An
individual will inherit one allele from each parent and will therefore have two alleles at each
genetic locus. An individual’s genotype depicts the alleles that they possess at a locus and
therefore the specific genetic makeup of an individual at a particular locus.
The most commonly used genetic variant in studies of the genome is a biallelic variant. If we
let A denote the common allele and a the rare allele at a locus, then there are three possible
combinations of alleles that make up an individual’s genotype, namely AA, Aa or aa. If an
individual has a genotype with identical alleles (AA or aa) then they are referred to as a
homozygote. If an individual has two different alleles at a locus (Aa) then they are said to be a
heterozygote.
An observable characteristic in an individual, either a biological or physical trait, is called a
phenotype. A phenotype is any characteristic that varies between individuals, for example
height or eye colour, and is influenced by an interaction between an individual’s genotype and
their environment.
A polymorphism is a variation in the genetic material that may change the function of a gene. A
single nucleotide polymorphism (SNP) occurs when a single nucleotide (either A, T, C or G) at a
specific genetic locus in the genome is substituted with another nucleotide. More specifically,
suppose the majority of individuals are homozygous for the C allele. A SNP occurs if one or
both of these alleles are replaced with another allele, say T. In this mutation, the T allele would
be referred to as the variant (or rare) allele. Each SNP has a unique identification number,
known as its rs number. Of the three billion bases in the human genome, there are
approximately 10 million common SNPs with frequency greater than 1% [7].
Many SNPs have no effect on cell function and go unnoticed. However, some SNPs can affect
the development and progression of disease in individuals and how they respond to
treatments used in prevention and disease management. SNPs are often used as markers in
4 Chapter 1: Introduction
genetic studies. As they remain relatively constant from generation to generation, they are
useful for studying the associations between genetic variations and observed phenotypes in an
individual. Although there are several different types of mutations, including copy number
variations (CNVs) and insertions/deletions (Indels), this thesis focuses on SNPs only.
SNPs can affect phenotypic expression in four ways, exerting dominant, recessive, additive or
co-dominant genetic effects. A dominant pattern is where a phenotype is expressed in
individuals who have at least one copy of the variant allele; a recessive pattern is where the
individual must have two copies of the variant allele to have the phenotype; an additive
pattern is where the expression of a phenotype increases or decreases linearly as the number
of variant alleles increases; and a co-dominant pattern occurs when the expression of a
phenotype increases or decreases, not necessarily in a linear pattern, as the number of variant
alleles increases.
Recombination is the “breaking down of one maternal and one paternal chromosome, the
exchange of corresponding sections of DNA, and the re-joining of the chromosomes”[8]. As a
result of recombination, each chromosome contains a mixture of alleles from each parent. A
haplotype is a series of alleles along a chromosome which have not taken part in
recombination, and have therefore been inherited as a unit. The concept of haplotype
inference will be introduced in Section 1.3.4 and how they relate to genome-wide association
studies in Section 1.4.2.
1.3.2 Hardy-Weinberg Equilibrium Principle
The Hardy-Weinberg Equilibrium (HWE) principle serves as the foundation for population
genetics, as it explains the consistency of genotype frequencies across generations in a
population. The principle was defined independently by an English Mathematician, Godfrey
Hardy [9], and a German Physician, Wilhelm Weinberg [10]. The assumption is that in a large,
randomly mating population, the allele and genotype frequencies at a single genetic locus will
remain at an equilibrium value from generation to generation. The principle only holds if the
following conditions are satisfied:
1. The population is very large
2. The population is isolated from other populations (i.e. there is no migration)
3. There are no mutations
5 Chapter 1: Introduction
4. Mating within the population occurs at random
5. There is no natural selection (i.e. every individual has an equal chance of survival)
Let us consider the following example consisting of a bi-allelic locus with two alleles, A and a.
Let the relative frequencies of these alleles in a given population be p and q=1-p respectively.
As stated in the previous section, there are three possible genotypes for this locus, AA, Aa and
aa. Under HWE, the frequencies of these genotypes will be of the proportions p2, 2pq and q2
respectively. If the above conditions hold then these frequencies will remain unchanged from
generation to generation. If the genotype frequencies in the observed population are
significantly different from the expected frequencies then it can be concluded that the sample
is not in HWE. Deviations from HWE can indicate genotyping errors or violation of one of the
five assumptions, both of which could lead to bias in subsequent genetic analysis [11].
1.3.3 Linkage Disequilibrium
Under the assumption of random mating, the alleles of a locus within a population are
combined into genotypes randomly, such that genotype frequencies are consistent with
Hardy-Weinberg proportions. Consider an example with two loci, M1 and M2, where the
possible genotypes at M1 are AA, Aa and aa and the possible genotypes at M2 are BB, Bb and
bb. Although the genotypes of both of these loci may be consistent with Hardy-Weinberg
proportions, the alleles of M1 may not be independent of the alleles of M2. This dependence
between two genetic loci is termed linkage disequilibrium (LD). Alternatively, if allele A is
independent of allele B then the two loci are said to be in linkage equilibrium.
Several measures to quantify the linkage disequilibrium between two SNPs have been
proposed [12,13]. To continue the example above, let the probabilities of the alleles at the two
loci in the sample be denoted as pA, pa, pB, and pb respectively. When alleles at the two loci
occur independently of each other, that is they are in linkage equilibrium, then the probability
of having the an A allele at M1 and B at M2 is pAB = pApB. The sum of the probabilities for each
combination is one; i.e. pAB + paB + pAb + pab = 1. Additionally, due to the independence of the
two loci, the probability of having an A allele given that an individual already has a B allele is
pA|B = pA. When the alleles are in linkage disequilibrium, the observed probabilities of the
alleles differ from the expected probabilities and the strength of this deviation can be
6 Chapter 1: Introduction
quantified by the linkage disequilibrium coefficient, D. Table 1.1 illustrates how D modifies the
expected probabilities.
Table 1.1: Probability of observing a given haplotype when the loci are in disequilibrium
Alleles B b
A pAB = pApB + D pAb = pApb – D
a paB = papB – D pab = papb + D
Table 1.1 shows the probabilities of each combination of alleles differ by the linkage
disequilibrium parameter D when the population is in LD. If D is zero then there is linkage
equilibrium. The value of D is dependent on the allele frequencies at M1 and M2 such that the
smallest possible value of D (Dmin) and the largest possible value (Dmax) are:
Dmin = the larger of –pApB and – papb
Dmax = the smaller of pApb and papB
Because D is dependent on allele frequencies at both the loci, this parameter is comparable
between two pairs of loci only if their allele frequencies are similar. To standardise this
measure, Lewontin [12] introduced the now commonly used measure D’, obtained by dividing
D by Dmax. D’ will then take a value between 0 and 1, with a larger value of D’ indicating a
stronger correlation between the two loci and therefore stronger LD. If the two loci are
completely correlated, D’ will equal 1 and the alleles at one locus will always predict the alleles
at the second locus. Another common LD coefficient, r [13], is the correlation coefficient
between two loci, defined as:
A a B b
Drp p p p+ + + +
=
Where pA+ indicates the relative frequency of the A allele (pAB + pAb), pa+ indicates the relative
frequency of the a allele (paB + pab), and the same applies for the B allele. This term is
commonly squared to remove the sign that can be introduced in this calculation depending on
how the loci are labelled, giving the more familiar measure r2. Even when two loci are in
complete disequilibrium based on Lewontin’s D’ (i.e. D’=1), the pairwise r2 value can vary
widely because it is related to the allele frequencies of the two loci and the position of the
corresponding mutations in the genealogy.
7 Chapter 1: Introduction
1.3.4 Haplotype Inference
Chromosomes from each parent exchange segments of DNA during cell formation, a process
referred to as recombination. Alleles that have not been subjected to the recombination
process are inherited as a unit, henceforth referred to as a haplotype. Several methods can be
used to construct haplotypes with certainty, including the use of family data or laboratory
techniques to isolate specific haplotypes; however these procedures are often costly and time
consuming [14]. Therefore, haplotypes are often derived using analytic techniques from
genotypes; although an individual’s genotype may not uniquely define his haplotype. For
example, consider two loci such that the first has genotype AA, Aa or aa and the second BB, Bb
or bb. Pairing one allele from a genotype at each locus forms a haplotype. Table 1.2 illustrates
the possible diplotypes (haplotype pairs) an individual can have given their genotypic
information.
Table 1.2: The genotypic information an individual possess across two loci determines their
possible diplotype
Genotype BB Bb bb
AA AB AB AB Ab Ab Ab
Aa AB aB AB ab or Ab aB Ab ab
aa aB aB aB ab ab ab
Table 1.2 shows that most genotype combinations result in a unique diplotype. For example, if
an individual has the genotype AA at one locus and BB at the other, then they have the
diplotype AB AB. However, if an individual is heterozygous at the two loci, that is the genotype
Aa at one locus and Bb at the other, then the diplotype is uncertain. This is also referred to as
a phase-ambiguous haplotype pair. Appropriate statistical methods are necessary to establish
the likelihood of an individual possessing each diplotype based on their available genotype
data. There is a growing body of literature on appropriate haplotype inference methods,
including Clark’s algorithm [15], a pseudo-Bayesian algorithm by Stephens et al [16] and an
Expectation-Maximization (EM) algorithm by Excoffier and Slatkin [14].
8 Chapter 1: Introduction
1.3.5 Evolution of Genetic Epidemiology Studies
There are various approaches available to investigate the relationship between genetic
variants and specific phenotypes including linkage studies [17], association studies [18] and
most recently whole genome sequences [19], each of which has its own strengths and
limitations [20]. This section describes these key analytic approaches which are used to
identify the genetic factors of human disease.
1.3.5.1 Linkage
Linkage studies were previously used as the first stage in the genetic investigation of a trait, as
they identify broad genetic regions that might contain a disease gene. Two genetic loci are
linked if recombination between them occurs with a probability of less than 50%; that is, they
are more likely to be transmitted from parent to offspring than expected under independent
inheritance [17].
In parametric (or model-based) linkage analysis, genetic markers that are evenly distributed
throughout the genome are genotyped in pedigrees and their co-segregation is investigated.
Parametric linkage examines the probability of recombination between two loci, quantified by
the recombination fraction θ, and is usually reported as a logarithm of the odds (LOD) score
[21]. LOD score analysis is equivalent to likelihood ratio testing, but uses logs to the base 10
instead of natural logarithms. Under the null hypothesis, there is no linkage between the
disease and marker loci (θ=0.5), while the alternative hypothesis assumes linkage exists
(θ<0.5). The LOD score function, log10 of the ratio between the likelihood of θ and null
hypotheses, is then maximised with respect to θ. A LOD score of greater than around three or
3.3 is deemed significant evidence of linkage [22,23].
Non-parametric (or model-free) linkage analysis uses the expectation that in a region
containing a disease-causing gene there would be an excess of identical by descent (IBD)
haplotype sharing. The simplest approach is to study affected sibling pairs. Under the null
hypothesis of no linkage, the number of IBD alleles shared by the siblings is none with
probability 0.25, one with probability 0.5 and two with probability 0.25. By genotyping genetic
markers across the genome, the observed proportions of sharing none, one or two alleles IBD
at candidate loci in affected sib pairs can be compared with the expected proportions under
the null hypothesis. Linkage would be suggested if the affected sib pairs share significantly
9 Chapter 1: Introduction
more alleles IBD than expected by chance. The most powerful test for detecting linkage in
affected sib pairs is the mean test, whereby the mean number of alleles shared IBD is
compared to the expected value of one [24]. This approach has been modified for other types
of relatives and larger family study designs [25,26]. Methods have also been developed for
linkage analysis of quantitative traits, using the assumption that two siblings who share more
alleles IBD would be expected to have more similar trait values if the marker is linked to a gene
influencing the trait [27,28,29].
Linkage studies have been very useful for identifying genomic regions for single-gene,
‘Mendelian’ diseases, which has resulted in the mapping of over 2,000 genes [30]. However,
they have only had limited success for common, complex diseases [31] due to several factors
including the relatively small proportion of variance explained by individual loci of the complex
disease, low resolution to localise disease alleles to a particular location, imprecise phenotype
definitions and inadequately powered study designs [32].
1.3.5.2 Association
Association studies investigate whether a genetic variant (for example a genotype, allele or
haplotype) is consistently associated with an observed disease or phenotypic trait in a study
population as a whole. They typically use large cross-sectional (either case/control samples or
population based samples for quantitative phenotypes) or cohort designs of unrelated
individuals; however, family designs can also be utilized. Table 1.3, adapted from Cordell and
Clayton [18], illustrates the possible study designs for genetic association studies.
10 Chapter 1: Introduction
Table 1.3: Study designs for genetic association studies [18,33,34]
Study design Advantages Disadvantages
Case-control Sample affected (case) and unaffected (control) individuals.
Cases are often attained from family practitioners or disease
registries; controls can be obtained from a random
population sample.
The sample is relatively easy to collect, in
comparison to other study designs and there
is no need for follow-up of the individuals.
This study design is used to provide an
estimate of exposure effects and is the
preferred design for rare diseases.
Sampling requires careful selection of
controls. Potential for confounding (e.g.
population stratification) using this study
design and it can generally only be used
to investigate one outcome.
Case-only Sample only affected individuals. Cases can be obtained
from initial cross-sectional, cohort or disease based sample.
This design is the most powerful design for
detection of interaction effects.
This design can only estimate interaction
effects and is very sensitive to population
stratification.
Case-parent triads Sample affected individuals and both of their parents.
Affected individuals can be obtained from initial cross-
sectional, cohort or disease based sample.
This design is robust to population
stratification and can be used to estimate
parent of origin and imprinting effects.
This study design is less powerful than
the case-control design.
Case-parent-
grandparent
Sample affected individuals, both of their parents and all of
their grandparents. Affected individuals can be obtained
from initial cross-sectional, cohort or disease based sample.
This design is robust to population
stratification and can be used to estimate
parent of origin and imprinting effects.
Grandparents rarely available for
sampling.
Table 1.3 continued
Study design Advantages Disadvantages
General pedigrees A random sample or disease based sample of families from
general population.
Power is often greater in studies of large
pedigrees than in other family designs. The
sample may already exist from previous
linkage studies. This study design can be
used to investigate a disease or quantitative
trait.
This study design can be expensive to
genotype and there are generally many
missing individuals within the families.
Cohort A subsection of the population which are used to follow the
disease incidence over specified time period.
This design measures events in temporal
sequence and therefore can be used to
distinguish between causes and effects.
This design is expensive to follow-up.
Sample selection and loss to follow-up
are potential causes of bias.
Cross-sectional A random sample from the population which is used to
study the prevalence of a disease.
This sample is inexpensive to collect and it
can be used to investigate multiple diseases
or quantitative traits.
This study design is not ideal for rare
diseases as there are few affected
individuals.
Extreme values A sample of individuals with extreme (high or low) values of
a quantitative trait. These individuals are often obtained
from an established cross-sectional or cohort sample.
This study design genotypes only the most
informative individuals and hence saves on
genotyping costs.
This study design does not allow for an
estimate of true genetic effect sizes.
DNA-pooling Applies to variety of above designs, but genotyping is of
pooled DNA from anywhere between two and 100
individuals (rather than on an individual basis).
This study design is potentially inexpensive
compared with individual genotyping.
It is hard to estimate different
experimental sources of variance using
this study design.
There are two main types of association studies; candidate gene studies and genome-wide
association studies (GWASs). For candidate gene studies, a gene that is likely to be associated
with an outcome of interest is selected based on prior knowledge, such as biological
plausibility, studies of animal models or prior genetic association studies [35]. Particular
genetic variants in that gene, based on the LD structure, are then genotyped and association
analyses between these variants and the outcome are performed. Candidate gene studies
have been unsuccessful in detecting replicable genetic loci for some disease outcomes as there
are no obvious plausible candidates to test, the power of the studies is low because of the
small sample sizes, the effect size of each genetic loci is small and, previous to HapMap, there
was inadequate coverage of common variation [32]. More recently, genetic association studies
have been performed over the entire genome, which removes the requirement of prior
knowledge. These GWAS analyses make no assumptions about the identity of the causal gene
and are therefore a ‘hypothesis free’ approach to genetic association analysis. The specific
details of GWASs will be provided in Section 1.4.
There are three reasons why an association between a genetic polymorphism and a trait might
exist [36]:
1. Direct association: the polymorphism is functional and causes the change in trait
value.
2. Indirect association: the polymorphism does not cause the change in trait value
but is in linkage disequilibrium with the causal variant.
3. Confounded association: the association is due to underlying population
stratification or admixture. In genetic epidemiological studies, population
stratification can be accounted for to reduce the chance of a confounded
association. It can be accounted for by [37]:
a. Matching by family membership so that comparisons are performed
between members of the same family [38].
b. Estimating the population substructure using either ancestry informative
markers (a set of loci that exhibit different allele frequencies between
populations from different geographical regions) or principal components
analysis on a large set of genotype data. These estimates can then be used
to remove individuals from the study who are of a different substructure
or to make statistical adjustments for the substructure in the analysis [39].
13 Chapter 1: Introduction
c. Population substructure increases the type 1 error rate (false positives)
and therefore, increasing the threshold required to declare statistical
significance will also control for the substructure. This method is referred
to as ‘Genomic control’; however, this method does not attempt to control
for the false negative rate that is also an issue when population
substructure is present [40].
These points need to be kept in mind when interpreting the results of a genetic association
study.
1.4 Introduction to Genome-Wide Association Studies (GWASs) 1.4.1 Definition
As linkage and candidate gene approaches have, in general, failed to discover replicable
genetic associations for many diseases and traits to date, the genetics community are turning
to GWASs to identify more genetic variants that explain the heritability of these traits. A GWAS
investigates the association between a particular phenotype and each SNP on a genome-wide
scale. Although ‘genome-wide’ implies that all 10 million common SNPs are investigated,
which would be a complete coverage of the common SNPs within the genome, often only
~500,000 SNPs are genotyped on a ‘whole-genome’ panel (panels with greater than one
million SNPs are available). However, most custom whole-genome panels provide substantial
coverage of common variation in non-African populations through LD patterns [41].
As displayed in Figure 1.1, GWASs identify SNPs that are common in the population but have
low disease penetrance (purple circle). In contrast, linkage analyses often identify genetic
variants that are low in frequency but have high disease penetrance (blue circle) and candidate
gene association studies capture a mix of all combinations depending on the disease and gene
of interest. The shift in genetic epidemiology to using GWASs has been accompanied by an
increasing methodological focus on optimal approaches to the design, analysis, meta-analysis
and reporting of genetic studies, including how to define a SNP as ‘significant’ given the large
number of tests conducted.
14 Chapter 1: Introduction
Figure 1.1: Architecture Of disease based on genetic determinants (Image adapted from
McCarthy et al [42], Manolio et al [43], Bush and Moore [44])
The first GWAS, published in 2005, investigated age-related macular degeneration in a small
sample of individuals on a relatively sparse panel of markers [45]. However, the ‘landmark’
GWAS wasn’t published until 2007 by the Wellcome Trust Case Control Consortium who
performed a GWAS of seven common diseases [46]. There is now a large database describing
the trait/disease associated SNPs discovered using GWASs [47], showing that over 1,500
GWASs have been published on a wide range of diseases and trait phenotypes [48]. Most of
the common variants at loci found to date have only demonstrated a modest effect, often with
odds ratios of less than 1.2 and explaining less than 1% of the variance of a phenotypic trait
[49]. Therefore, for the majority of common diseases and traits there is still a large portion of
the genetic architecture of disease unexplained [50,51].
1.4.2 Imputation
Imputation, the process by which missing data is replaced by a probable value based on
additional information, has been used in statistics for many decades. Genotype imputation
uses information from genotyped SNPs observed in a given individual and known LD patterns
from a more densely genotyped reference panel to determine the probable genotypes at an
untyped locus. Imputation is used for loci that were not genotyped or where a genotype was
not determined for a particular individual.
15 Chapter 1: Introduction
A particular stretch of chromosome in one individual provides information about the
genotypes of many other individuals who inherit that same stretch of chromosome IBD. In
related individuals, the stretch of chromosome, or haplotype block, will be larger as there have
been fewer generations for recombination to occur. In unrelated individuals, the shared
haplotype blocks will be much shorter as the common ancestors will be much more distant
and therefore the haplotypes are harder to identify with confidence. Genotype imputation for
missing genotypes at observed SNPs uses these haplotype blocks in a Hidden Markov Model to
determine the probability of the missing genotype given the observed genotypes and
haplotype combinations. It takes into account recurrent mutation at the SNP and potential
recombination in the region [52].
Genotype imputation for missing SNPs identifies haplotype blocks throughout the genome for
a particular ethnicity from a reference panel, which is an existing database with detailed
genotype information on a large number of markers in a few individuals (50 to several
hundred). The haplotype blocks for each individual in a study sample are then compared to
those in the reference panel. The schema in Figure 1.2 outlines this process. The top left panel
shows the loci of 15 common variants that were genotyped in the reference panel, of which 6
were genotyped in the study sample. The grey question marks indicate loci that were not
genotyped; these genotypes need to be imputed. The first stage of imputation is to phase the
haplotypes (top right panel of Figure 1.2); or in other words, to determine which haplotypes
the individual is likely to have, as described in Section 1.3.4. If the phase is ambiguous, this
stage may produce several possible haplotypes for a given individual. The second stage is to
compare the phased haplotypes for each individual to the reference panel (bottom left panel
of Figure 1.2). Finally, when a match is made to the reference panel, the remaining genotypes
in the haplotype block are imputed for the given individual (bottom right panel of Figure 1.2).
This is then repeated for all individuals in the study sample. Due to possible mutations and
recombination throughout generations, this process is often more complex than illustrated in
Figure 1.2; however, this describes the basic idea being implemented.
There are several different programmes available that will perform the imputation including,
but not limited to, Markov Chain Haplotyping software (MaCH) [53], IMPUTE [52], fastPHASE
[54], PLINK [55] and Beagle [56]. Each programme uses slightly different procedures to search
for shared haplotype blocks, and hence their computational efficiency differs. For example,
16 Chapter 1: Introduction
IMPUTE relies on recombination rates generated by the HapMap Consortium and assumes a
uniform mutation/error rate for all markers, whereas MaCH estimates recombination rates
within each dataset and allows mutation rates to vary. Li et al [53] illustrated that MaCH and
IMPUTE perform similarly, while Biernacka et al [57] and Pei et al [58] showed that these two
programmes both outperform fastPHASE, PLINK and Beagle. Both MaCH and IMPUTE are
based on a Hidden Markov Model and implement variants of the ‘product of approximate
conditionals’ model [59].
Figure 1.2: A schematic of the process by which genetic data is imputed using haplotype
inference
17 Chapter 1: Introduction
The success of genotype imputation is dependent on the following two steps:
Step 1: Pre-processing genotype data before imputation
Before imputing any genotype data, it is imperative that the genotyped data is cleaned;
otherwise there will be inaccuracies in the haplotype estimation. Typical quality control (QC)
measures are applied first to the individuals and then to the SNPs. Individuals are often filtered
out based on heterozygosity, call rate (both may be due to poor quality DNA), cryptic
relatedness (unknown relationship with another individual in the study sample) or population
structure (removing individuals who appear to have descended from a population of different
ethnic background than the rest of the study sample). QC on the SNPs includes call rate, Hardy-
Weinberg outliers (i.e. SNPs not in HWE) and minor allele frequency (MAF). Each study applies
thresholds that they deem appropriate, however it is also common to use the thresholds
detailed in the original Wellcome Trust Case Control Consortium GWAS [46], whereby SNPs are
excluded if:
- The call rate is less than 95%
- The HWE P-Value is less than 5.7x10-7
- The MAF is less than 0.01 (1%)
The final data cleaning step is to ensure that the annotation of the SNPs on the chip are the
same as the chosen reference panel and that the SNPs are all on the same strand. It is common
to align SNPs to build 36 of the human genome and to the forward (+) strand. The alignment
step to a particular strand is crucially important to avoid imputation errors, especially for SNPs
with complement alleles (i.e. A/T and C/G SNPs), and to allow ease of meta-analysis with other
studies.
Step 2: Choosing an appropriate reference panel
To date, the HapMap Consortium database [60] has typically been used as the reference panel.
However greater numbers of individuals and SNPs are being genotyped in a wider range of
ethnic populations for other databases, such as the 1000 Genomes Project
(http://www.1000genomes.org/home), and these are increasingly becoming more popular
reference panels.
The International HapMap project was developed to describe the common patterns of
sequence variation in the human genome; in other words, to develop a haplotype map (hence
18 Chapter 1: Introduction
HapMap) of the human genome. The data from Phase 2 of the project is freely available in the
public domain and includes variants at 3.1 million loci in multiple ethnic groups, including
African (YRI; samples from 30 trios from the Yoruba people of Ibadan, Nigeria), Asian (CHB; 45
samples from Beijing and JPT; 45 samples from the Tokyo area) and European (CEU; samples
from 30 trios from residents of the United States of America with northern and western
European ancestry) [61]. There are currently two larger databases than this available:
1. HapMap Phase 3 [62] with 1.6 million common SNPs in 1,184 individuals from 11
global populations: YRI, CHB, JPT and CEU samples from the previous release in
addition to samples from individuals of African ancestry in the south-western USA
[ASW], Chinese in metropolitan Denver, Colorado, USA [CHD], Gujarati Indians in
Houston, Texas, USA [GIH], Luhya in Webuye, Kenya [LWK], Maasai in Kinyawa, Kenya
[MKK], Mexican ancestry in Los Angeles, California, USA [MXL] and Tuscans in Italy
[TSI].
2. 1000 Genomes Project has pilot data of approximately 15 million SNPs in 742
individuals [63] and a main phase 1 with approximately 28 million SNPs in 1,092
individuals. Both datasets include individuals from 14 locations of five major ethnic
origins including European, East Asian, West African, Americas and South Asian.
For studies of European ancestry, in the majority of cases it is clear that the HapMap CEU
samples are an appropriate reference panel. However, for populations of mixed ancestry or
from areas not included in the HapMap Consortium, such as the Middle East, it is a little more
difficult to decide which panel to use as a reference. Several authors have suggested that
‘masking’ a set of genotypes and imputing them using the different reference panels will
identify a strategy that provides the most accurate genotype imputation [64,65]. Other
methods include using principal components analysis to determine which reference panel is
closest to the individuals in the study [66] or using a ‘cosmopolitan’ panel combining all
reference panels together [53,65,67,68]. For all reference panels, it is recommended that
release 22 is used for imputation [49], as the National Centre for Biotechnology Information
(NCBI) build 36 has become the standard for the human genome assembly.
Genotype imputation is commonly used in association analyses on the genome-wide scale as it
increases the power of a GWAS [52,53,68,69] and furthers the discovery of novel genotype-
phenotype associations though the meta-analysis of multiple studies genotyped with different
19 Chapter 1: Introduction
sets of genetic variants [49,70]. Commercial genotyping platforms differ in the SNPs that are
included and they are continually being updated as new SNPs are being discovered. Therefore
imputation serves as a crucial bridge when merging distinct studies genotyped on different
platforms, combining different versions of the same platform or adopting a new platform
during the course of a study. By imputing GWAS data, each study has a common set of SNPs to
contribute to a meta-analysis. Such meta-analyses are required to achieve the large sample
sizes needed to discover modest genetic effects.
Given the computational time already required to conduct the complex analyses in this thesis,
the HapMap Phase 2 CEU data will be used as a reference panel, with approximately 2.5
million SNPs. The CEU population was chosen as all of the cohorts investigated in this thesis
were of European descent and principal components analysis showed that the studies closely
clustered with the CEU population.
1.4.3 Association Analysis
Once the study dataset has been imputed, an association analysis similar to that outlined in
Section 1.3.5.2 is conducted. For each imputed SNP in a given sample, three values are
calculated:
1. Posterior probability for each of the three possible genotypes (AA, Aa and aa),
2. Allelic dosage, which is the expected number of copies of a given allele, ranging from 0
to 2,
3. ‘Best guess’ genotype, which is the genotype with the highest posterior probability.
The majority of association analyses use the dosages from the imputed data, which are treated
as a continuous variable in the regression model. Software has been developed specifically to
conduct large scale association analyses in a time efficient manner, such as PLINK [55], MaCH
[53,64] and SNPTEST [71]. These commonly used programmes conduct linear (for a trait
outcome) or logistic (for a disease outcome) regression models, allowing for adjustment of any
number of covariates.
One important aspect of association analysis using imputed data is to take into account the
uncertainty of each imputed SNP. There is no consensus for the most appropriate way of
accounting for the uncertainty, however many studies simply remove SNPs that have poor
imputation quality. One measure of imputation quality is the ratio of the empirically observed
20 Chapter 1: Introduction
variance of the allele dosage to the expected binomial variance p(1 - p) at Hardy-Weinberg
Equilibrium, where p is the observed allele frequency from HapMap. This is the metric that
both MaCH (RSQR_HAT) [53] and PLINK (INFO) [55] generate. When a SNP has imputed
accurately, this ratio will be close to one, indicating that the observed variance is close to the
expected variance. However, as the observed variance decreases, this ratio tends towards
zero, indicating more uncertainty in the imputation. A commonly used threshold in meta-
analyses of common diseases/traits is to exclude SNPs with a MaCH RSQR_HAT < 0.3 [72,73].
SNPTEST calculates a similar measure that reflects the effective sample size (or power) for the
genetic effect being estimated, PROPER_INFO, with values < 0.4 commonly excluded from
meta-analyses. Barrett [74] found that different genotyping platforms have different success
rates in terms of accurately imputing additional loci; the Illumina HumanHap550-Duo chip had
better accuracy (87% of imputed loci had r2>0.9) than the Affymetrix 500K chip (60% of
imputed loci had r2>0.9).
Another important adjustment in the association analysis is to account for any population
substructure in the study sample. Principal component analysis, which can be conducted
simply in EIGENSTRAT [39], is a powerful tool to adjust for population stratification. The
eigenvectors generated through EIGENSTRAT can be used to either exclude outliers or
included as covariates in the association analysis to describe the variation along the first few
axes.
The final step of the association analysis is to check the distribution of the test statistic. This
can be done in two ways:
1. Calculating a genomic inflation factor, λ [40]. This is the ratio of the median of the
empirically observed distribution of the test statistic to the expected median. The λ
quantifies the extent of the excess false positive rate, with values close to one
indicating no inflation and values greater than one indicating increasing levels of false
positives.
2. Plotting a quantile-quantile (Q-Q) plot. This is a useful visual tool to mark deviations of
the observed distribution from the expected null distribution. It is advised that Q-Q
plots are derived for genotyped SNPs separately from imputed SNPs as they have
different distributional properties [49].
21 Chapter 1: Introduction
These measures of inflation may point to undetected sample duplications, unknown familial
relationships, a poorly calibrated test statistic, systematic technical bias or uncorrected
population stratification. If inflation is detected, it should be dealt with prior to presenting
results or meta-analysis with any additional studies.
1.4.4 Replication
It is important that any findings from a GWAS are confirmed in an independent study. Two
commonly used words to describe this confirmation are “replication” and “validation”, both of
which are used interchangeably in genetic association studies. Igl et al [75] acknowledge that
these are two different concepts and have defined both terms as follows:
1. “Replication: both original and confirmation sample are drawn from the same
population, and systematic differences are reduced to a minimum.”
2. “Validation: the confirmation sample stems from a population which is different than
that from which the original sample was drawn.”
Chanock et al have also defined criteria for establishing “replication” and “validation” [76].
Their specific criteria for establishing replication included 1) utilizing a similar population to the
discovery population, and 2) using the same study design and analysis techniques. In contrast
to Igl et al [75], Chanock et al believe that “validation” of a finding should be in a population
which may be from a different ethnic background, have a different phenotype definition,
recruitment or sampling strategy or the time point under investigation may be different from
the discovery population [76]. Therefore, if a genetic association discovered through a GWAS is
‘validated’, then it is more generalizable than if the association is only ‘replicated’. Most
GWASs to date have included replication of their findings, however only a few extend their
findings to additional ethnic backgrounds or to other populations and therefore rely on future
studies to validate their findings. This is partly due to the fact that the discovery study of a
GWAS often has the aim of identifying regions of the genome that are of interest, rather than
pin-pointing the causal locus and estimating its effect [77].
22 Chapter 1: Introduction
1.4.5 GWASs of Longitudinal Quantitative Traits
Statistical geneticists are beginning to develop methods for analysis of longitudinal data in
genetic studies. The Genetics Analysis Workshop (GAW; http://www.gaworkshop.org/) is
an example of such an initiative. According to their website, GAW is “a collaborative effort
among researchers worldwide to evaluate and compare statistical genetic methods and
relevant current analytical problems in genetic epidemiology and statistical genetics”. Three
of their 18 workshops were dedicated to longitudinal data analysis: 1) GAW 13 focused on
longitudinal analysis for microsatellite genome-wide data [78,79], 2) GAW 16, problems 2
and 3 used real longitudinal data from the Framingham Heart Study and simulated data
respectively [80], and 3) GAW 18 looked at longitudinal data for sequencing studies, although
none of this research from GAW 18 is published yet.
The methods developed in GAW 13 dealt with genetic linkage analysis, although they would be
applicable to genetic association studies in longitudinal family study designs [81]. The methods
that would be the most computationally efficient for genome-wide association studies used a
two-stage approach to conduct the genome-wide linkage analysis in the Framingham Heart
Study; in the first stage, the longitudinal data was reduced to an intercept and slope estimate
for each individual, which were used as outcomes for the genetic analysis [82,83,84]. In
addition to these methods from GAW 13, there were several methods described in Kerner et al
for the groups at GAW 16 that used the longitudinal data [85]. The approaches included
growth mixture modelling [86,87], linear mixed effects modelling [88], multivariate linear
growth modelling [89] and multivariate adaptive splines for the analysis of longitudinal data
[90]. Although these methods are all relevant for genetic association studies, they were only
applied to subsets of the genome-wide genetic data provided at GAW 16. The two-stage
approaches are promising as they are computationally efficient; however, they have only been
investigated using phenotypes that have a linear trajectory over time.
Recently, Furlotte et al described a mixed effects model for GWAS analysis and calculate the
genetic, environmental, and residual error contributions to the phenotype [91]. Their method
(for which they have written a software package) incorporates a random effect to adjust for
any population stratification or cryptic relatedness in the sample. Their simulation study
includes an average genetic effect in their model; however, the genetic effect over time was
not investigated. Fan et al present a non-parametric mixed effects model using splines, and
23 Chapter 1: Introduction
show it has lower type 1 error, higher power and lower bias in comparison to a parametric
model [92]. However, they do not investigate the effect of missing data, which is common in
large, longitudinal cohort studies.
Although there have now been many GWASs of cross-sectional phenotypes in childhood and
geneticists are beginning to investigate GWASs of longitudinal traits, to my knowledge, no
GWAS of a longitudinal childhood trait has been published in the literature. In addition, the
longitudinal methods described for GWASs to date make relatively simplistic assumptions for
the trait studied, ignoring complex mechanisms such as unbiased study designs, non-linear
trajectories, high correlation between the intercept and trajectory terms and non-normal,
correlated residuals. The research in this thesis aims to address some of these current
limitations by proposing and evaluating a modelling framework for analysing complex,
longitudinal childhood phenotype in unrelated individuals, particularly in the context of GWAS
analysis.
1.5 Obesity and Body Mass Index Obesity is a medical condition, whereby excess body fat has accumulated to the extent that it
may have an adverse effect on health, leading to reduced life expectancy and/or increased
health problems [93]. Obesity is a major global public health problem. In 2010, there were at
least 42 million overweight children under the age of five years and one billion overweight
adults globally [94]. The World Health Organisation considers Australia to have one of the
world’s highest rates of obesity, with 25% of children aged 5-17 years and 62% of adults
classified as overweight or obese in 2007-8 [95]. Childhood obesity is associated with poor
mental [96,97,98,99] and physical health [100,101] and is one of the strongest predictors of
adult obesity [102,103]. In turn, adult obesity increases the risk of many diseases including
coronary heart disease, the metabolic syndrome, some cancers, stroke, liver and gallbladder
disease, sleep apnoea and respiratory problems, osteoarthritis and gynaecological problems
[94]. Since the 1990s, the prevalence of obesity has trebled [104] which has led to an earlier
onset of related adverse health outcomes. There is recent evidence that the rate of obesity
among Australian children is plateauing after a dramatic acceleration over the preceding
decade [105]. The observed plateau, if real, could be a result of increased physical activity and
nutrition initiatives that have been introduced into Australian schools and communities in
recent years. The plateauing in the prevalence of obesity in children has been observed in
24 Chapter 1: Introduction
other developed, high-income countries, such as New Zealand [106], US [107], Sweden [108]
and France [109], however, the prevalence in the developing world is still increasing. Although
the prevalence of obesity appears to be stabilizing, the incidence of obesity is still higher than
desirable, particularly in Australia.
Obesity is a multifactorial condition with many biological, genetic, social and environmental
influences affecting its development [93]. There are monogenic (stemming from a single
dysfunctional gene, but is very rare in the general population) [110] and syndromic forms of
obesity (distinguished by the co-occurrence with mental retardation, dysmorphic features or
organ specific developmental abnormalities) [111]. The focus of this thesis is on the common,
multifactorial form of obesity.
Body mass index (BMI) is the relationship between weight and height that is associated with
body fat, nutritional status and health risk. It is calculated by weight, measured in kilograms,
divided by height squared, measured in metres. In epidemiology, BMI is the most commonly
used quantitative measure of adiposity [112]. Cut-off points have been defined for both child
[113] and adult [114] obesity; they are defined by the point where an increased risk of disease
is observable due to high BMI.
1.5.1 Life Course Approach to Obesity
Obesity doesn’t develop instantaneously; it is a developmental process by which an
individual’s BMI increases over a period of time. Therefore, to understand more about the
increasing incidence of obesity, it is important to understand the developmental process that
precedes the diagnosis of the condition. By utilizing the comprehensive data collected as part
of longitudinal birth cohorts in a growth trajectory modelling framework, we can begin to
understand this developmental process.
BMI growth trajectories are difficult to model statistically due to the complexities of growth
over childhood (Figure 1.3). Children tend to have rapidly increasing BMI from birth to
approximately nine months of age when they reach their adiposity peak. BMI then tends to
decrease until about the age of five or six years at the adiposity rebound and then it steadily
increases again until just after puberty when it tends to plateau through adulthood.
25 Chapter 1: Introduction
Figure 1.3: Schema of BMI trajectory over childhood and adolescence. The green arrow
indicates the period around the adiposity peak (9 months of age); the red arrow indicates the
period around the adiposity rebound (5-6 years of age).
The World Health Organization recently conducted research into the statistical methods
previously used to construct growth curves over childhood. As part of their process of defining
international growth standards for pre-school children, they examined as many as 30
previously published methods [115]. Most of these methods were designed for cross-sectional
cohorts where each child is measured once between certain predefined ages. Only eight
methods allowed for the partitioning of variance into between and within subject variability
required for longitudinally designed studies, where each child is measured multiple times
across childhood.
Intervention programs have traditionally targeted the individual at the onset of disease-
precursors; however, the principles underlying life course epidemiology and DOHaD suggest
that there may be sensitive periods earlier in the developmental continuum, including
pregnancy, infancy and childhood, that may offer greater opportunities for obesity prevention
[116,117,118]. By extending our current statistical methodologies for growth trajectory
modelling to enable the detection of small differences between individuals due to genetic,
environmental or lifestyle determinants, we will potentially be able to identify individuals at
higher risk of developing obesity early in the process. Intervention programmes can be
26 Chapter 1: Introduction
developed to specifically target these individuals when the programmes are likely to be most
cost effective and beneficial. To have the greatest impact through intervention programmes,
we need to determine which indicators are associated with different patterns of growth and
then assess how those growth patterns may predict different disease risks. Several indicators
of critical periods for adiposity development have been identified, in addition to the overall
growth trajectory from infancy to adolescence. These include the age and BMI at the adiposity
peak, the age and BMI at the adiposity rebound and various markers of puberty. Our current
knowledge based on research regarding each of these indicators will be outlined below.
1.5.1.1 Infancy Growth and the Adiposity Peak
Infancy is a period of rapid growth due to the underlying biological changes that are occurring
and therefore has been suggested as a sensitive period for the development of obesity
[118,119]. Two large reviews concluded that there is an association between rapid infant
weight gain and increased risk of adult obesity [120,121]. However, the effects of rapid infant
weight gain need to be reviewed with caution, as they are often difficult to untangle from
intrauterine growth and post-infancy growth [122].
There are several factors thought to influence adult obesity through weight gain during
infancy, including breastfeeding, feeding frequency and amount of physical activity.
Breastfeeding is thought to reduce the risk of obesity in adulthood, and a systematic review by
Owen and colleagues found a mean difference of 0.04 BMI units between subjects who were
breastfed verses formula-fed across a wide range of ages [123]. Of note, the difference was
halved after adjustment for maternal BMI and smoking, and the association did not remain
statistically significant after adjustment for maternal socio-economic status. Horta et al [124]
studied the association between breastfeeding and adult obesity, and found a 22% reduction
in the odds of obesity for those whom were breastfed. In addition to breastfeeding, there have
been a number of studies that focused on the frequency and amount of feeding over infancy
and found that many infants are overfed [125,126]. Parenting behaviours not only influence
feeding but also the amount of physical activity undertaken by the infant. For example, a study
by Zimmerman and colleagues [127] showed that by the age of two years, 90% of children
were watching television regularly with an average exposure of 1.5 hours per day. The
behaviours of overeating and lack of physical exercise induced by parents in the first few years
of life: once established, these behaviours can program the infant for life.
27 Chapter 1: Introduction
The adiposity peak, at around nine months of age, may be a good marker of infant growth.
Silverwood et al [128] and Sovio et al [129] have published data that demonstrate that a
delayed adiposity peak is associated with increased BMI in late childhood and adulthood
respectively. However, beyond these studies, relatively little is known about the association of
the timing of the adiposity peak with disease later in life.
1.5.1.2 Adiposity Rebound
Research has identified the adiposity rebound as a sensitive period for the development of
adiposity that persists into adult life [118,119,130,131,132,133]. The adiposity rebound is a
period of rapid growth in body fat, due to the increase in both size and number of adipocytes
(cells that specialize in storing energy as fat) [131]. It has been shown that an early age of
adiposity rebound is not only associated with adult obesity but also with a greater risk of
diabetes [134,135] and hypertension [136].
There is a growing body of literature showing that the timing of the adiposity rebound,
specifically an early rebound, is associated with increased risk for later obesity
[129,131,132,133,137,138]. Rolland-Cachera and Péneau [137] illustrate that although an early
adiposity rebound is associated with obesity in later life, it is not associated with obesity in
early life. It is therefore thought that this may help to distinguish between two growth
patterns; 1) those individuals who have a high BMI at all ages, which reflects a high lean and
fat body mass, verses 2) those individuals who have a normal BMI followed by an early
adiposity rebound and consequently a higher BMI, reflecting increased fat rather than lean
mass [132,139,140]. Individuals following the first pattern of ‘consistently high BMI’ appear to
have relatively normal metabolic profiles, whereas individuals following the second pattern
seem to be at higher risk for coronary heart disease and insulin resistance [137]. In addition to
increased risk of adult obesity, Williams and Goulding [140] identified that the timing of the
adiposity rebound was also associated with both skeletal and physical maturation. It has also
been shown that BMI at the time of the adiposity rebound positively correlates with BMI in
adulthood [129].
Several factors have been reported to be associated with an early age of adiposity rebound
and a high BMI at the time of the rebound. These factors include parental obesity [141], low
levels of activity and high levels of television viewing [142], low weight gain in the first year of
28 Chapter 1: Introduction
life and therefore low weight at year one [135]. Consistent with the increasing prevalence of
obesity, the age of adiposity rebound is decreasing over time [143].
As promising as the adiposity rebound may sound as a marker of later obesity, it is not without
its critics. Cole [144] uses five hypothetical individuals to show that the timing of the adiposity
rebound is determined by the individuals BMI centile and their rate of centile crossing. He goes
on to show that a high centile and upward centile crossing are independently associated with
an early adiposity rebound and hence an increase in later obesity risk. As an example of the
“horse racing effect” [145], Cole provides evidence that centile crossing at any age is an
indicator of obesity risk, rather than the timing of the adiposity rebound.
In summary, like most complex traits, the adiposity rebound appears to be influenced by both
genetic and behavioural factors. It is therefore a useful marker when investigating the early life
determinants of obesity.
1.5.1.3 Puberty
The timing of menarche in girls has been shown to be associated with obesity in adulthood
[146,147,148]. Further, being overweight during childhood is also associated with early
menarche [149,150,151]. Taken together these data have generated great debate about the
direction and mechanism underpinning the association. Mumby et al [152] have recently used
a Mendelian randomization study to try and untangle the causal relationship and found that
childhood obesity was causing early menarche.
In males, the height growth spurt is often used as a marker of puberty. Peak height velocity, a
measure of the growth spurt, has been shown to be associated with the rate of genital
development in boys [153]. It has also been shown to be highly correlated (ρ=0.84) with the
age of menarche in girls, with estimates of menarche onset occurring about one year after the
onset of the growth spurt [154,155]. It is therefore common to use the growth spurt as a
marker of pubertal status, in both males and females.
Both age of menarche and the pubertal growth spurt are influenced by genetics, as well as
biological and environmental factors [156,157]. Rapid weight gain during adolescence is
expected due to the normal physiologic changes that are occurring during this period. There
29 Chapter 1: Introduction
are three key hormones which play an important role in the timing of puberty that are also
related to obesity risk; leptin, insulin and estrogen [158]. Increased levels of all three
hormones need to occur before the onset of puberty begins, accompanied by a relative
increase in fat mass that accompanies it. However, in overweight or obese adolescence, these
hormones are already elevated, which leads to the earlier onset of puberty.
While these biological changes are occurring, adolescents are also changing their social
behaviours, often to assert their independence. For some individuals, these can include more
time spent in sedentary activities and less physical activity [159], developing a ‘calorie-rich,
nutrient poor’ diet [160], unhealthy eating behaviours such as skipping meals [161], binge-
eating or using laxatives/diuretics [162] and increased levels of stress leading to depression or
anxiety [163,164]. As outlined, puberty is another sensitive period where the development of
obesity may begin or continue to progress due to the complex interactions between genetics,
the environment and the biological process underlying the period. Therefore the presence of
high levels of pubertal hormones and the development of negative social behaviours may
contribute to increased obesity in adolescents.
1.5.2 Genetics of BMI
Heritability is the proportion of variability in a quantitative phenotype that is due to genetic
factors. It is a commonly used measure to identify whether a trait has a genetic component
and therefore should be used in genetic studies. Twin and family studies have indicated that
the heritability of obesity and body weight/BMI is estimated to be between 40 and 80%
[165,166,167]; however, heritability estimates appear to be age dependent where the younger
the individuals the higher the estimate of heritability [168]. A recent meta-analysis of 88 twin
studies showed that heritability of BMI in children was on average 0.07 higher than in adults
[169]. Until recently, the genes identified to regulate weight have largely been rare mutations
that cause severe monogenetic forms of obesity [170]. The latest human obesity gene map
published in 2005 noted that 127 candidate genes had been identified through association
analyses for obesity related phenotypes, however only 22 of those had been replicated in
more than five studies [171]. This, along with several other review articles [172], indicate that
although linkage and candidate gene association studies have identified a few genes that are
implicated in obesity risk or increased BMI, we were still a long way from uncovering the
genetic profile of an obese individual.
30 Chapter 1: Introduction
Since the advent of GWASs, common variants within several genes have been found to be
associated with adult obesity and population variation in BMI. These include the well-
replicated fat-mass and obesity associated (FTO) and the melanocortin 4 receptor (MC4R)
genes. Common variants within these two genes are associated with modest effects on BMI
(0.2-0.4kg/m2 per allele) which translates to increased odds of obesity of 1.1-1.3 in adults
[173,174,175,176,177]. In 2009, two genome-wide studies on BMI and measures of obesity
were published by large consortia which discovered further genetic variants that were
associated with BMI and risk of obesity [178,179]. An additional 10 genomic regions have been
identified through these analyses; however they still require further consistent replication in
alternate cohorts. These genes are spread across the genome and included TMEM18, GNPDA2,
SH2B1, MTCH2, KCTD15, NEGR1, SEC16B, and near SFRS10, LGR4 and BCDIN3D. There are
other studies that replicate these findings in additional genome-wide analyses [180,181,182].
The largest genome-wide meta-analysis of BMI published to date included 249,796 individuals
from the Genetic Investigation of Anthropometric Traits (GIANT) Consortium; which confirmed
14 previously-reported loci and identified 18 novel loci for BMI [72].
Many of the genes that have been identified to date are not only involved in the regulation of
body weight but also in determining an individual’s response to environmental factors, such as
diet and exercise. Of the common variants identified, many act in the central nervous system
and are believed to influence eating behaviour and feeding regulation, which in turn effects
the amount of fat within the body [178,179]. This highlights the fact that many of these genes
are influencing the regulation of food intake rather than controlling metabolism as first
thought. However, the genes identified, including both common and rare variants, only
account for a small fraction of the population variability in BMI and weight, leaving much of
the estimated heritability unexplained. Although at least 32 loci have been replicated to date,
these regions only account for approximately 1.45% of the variability in adult BMI [72]. In
order to uncover more of the heritability of obesity, current statistical methodology will need
to be extended to enable the discovery of additional genetic factors, including investigating
growth rates over childhood (rather than cross-sectional population variation in size at a
particular age, often in adulthood).
To date, there has only been one study investigating the association between genetic variants
and childhood obesity on a genome-wide scale [183] and none focusing on BMI as a
31 Chapter 1: Introduction
continuous trait. The childhood obesity study identified two new genetic regions, one near the
olfactomedin 4 (OLFM4) gene and the other in the homeobox B5 (HOXB5) gene, which were
shown to have the same direction of effect as the meta-analyses of adult BMI. Given the
heritability estimates for BMI are higher in children [168], childhood is an important period for
genetic association analyses with the potential of uncovering additional genetic loci of interest.
1.6 Birth Cohorts used in this Thesis Data from three cohorts are used to investigate the genes associated with childhood growth
trajectories in this thesis. The three cohorts were from Western Australia, Bristol (United
Kingdom) and Northern Finland; each of the cohorts is described in detail below. Subsets of
the cohorts are used for different chapters of this thesis; these subsets are defined in the
relevant chapters.
1.6.1 The Western Australian Pregnancy Cohort (Raine) Study
1.6.1.1 Subjects
The Western Australian Pregnancy Cohort [184,185,186] (http://www.rainestudy.org.au/) is a
prospective pregnancy cohort where 2,900 mothers were recruited prior to 18-weeks’
gestation between 1989 and 1991. The study began as a randomized control trial to
investigate the effects of multiple ultrasounds on birth outcomes and foetal growth. Although
differences due to increased exposure to ultrasound during pregnancy were detected in foetal
growth [184], these differences did not influence postnatal growth. After birth, the
longitudinal aspect of the Study aimed to focus on the DOHaD hypothesis, where data was
collected to gain a better understanding of how events during pregnancy, infancy and
childhood affect later health and development. The Study is now referred to as ‘The Raine
Study’, as an acknowledgement of the generous funding from the Raine Medical Research
Foundation to create and continuously support the study. The Raine Medical Research
Foundation was set up by The University of Western Australia when Mrs Mary Raine, a
prominent figure in Perth in the early part of the century, bequeathed her property empire to
the University in 1957 to fund medical research into the early origins of health and disease. Of
the 2,969 potential births (some of which were multiple births) from the 2,900 mothers, 14
withdrew from the study at birth, 19 had miscarriages (nine in first trimester, 10 in second
trimester ), 10 had in-utero growth restriction, 23 were lost at delivery, two were lost-to-
follow-up, 26 were stillbirths and seven pregnancies were terminated, leaving 2,868 children
32 Chapter 1: Introduction
for follow-up. Recruitment predominantly took place at Western Australia’s major perinatal
centre, King Edward Memorial Hospital, and nearby private practices. Participants have been
followed for the past 23 years, with physical examinations and questionnaires collected at
average ages of 1, 2, 3, 6, 8, 10, 14, 17, 18 and most recently 20 years. Figure 1.4 illustrates the
range of variables that were collected including anthropometric measurements, medical
histories, socio-demographic indicators, health, lifestyle and environmental outcomes.
The study was conducted with appropriate institutional ethics approval from the King Edward
Memorial Hospital and Princess Margaret Hospital for Children ethics boards, and written
informed consent was obtained from all mothers. The cohort has been shown to be
representative of the population presenting to the antenatal tertiary referral centre in
Western Australia [186].
Figure 1.4: The Raine Study schedule of assessments and broad measurements collected.
1.6.1.2 Measurements
The measurements used throughout this thesis are described below. Additional explanatory
variables are described in the relevant chapters.
Birth length was measured between 24 and 72 hours post birth using a Harpenden
Neonatometer to the nearest 0.1cm. Birth weight was measured in the hospital at birth.
Gestational age was based on the date of the last menstrual period unless there was
discordance with ultrasound biometry at the dating scan by greater than seven days; if there
was a discordance the gestational age was based on the dating scan [186].
33 Chapter 1: Introduction
Weight and height were measured at each follow-up by trained members of the research team
[187]; weight was measured using a Wedderburn Digital Chari Scale to the nearest 100 grams
with children dressed in their underclothes and height was measured to the nearest 0.1cm
with a Holtain Standiometer. BMI was calculated from the weight and height measurements.
1.6.1.3 Genotyping
A DNA sample was collected at the 14 and 17 year follow-ups. The genome-wide data was
genotyped in two separate batches using the Illumina Human660W Quad Array at the Centre
for Applied Genomics, Toronto, Ontario, Canada. The first batch of genotyping was completed
on 1,259 Raine Study children (including 63 replicates and a plate control on each plate) and
the second on 334 children (including 18 replicates and a plate control on each plate). The
Illumina Human660W Quad Array includes 657,366 genetic variants including approximately
560,000 SNPs and approximately 95,000 CNV’s.
QC checks were performed for individuals and SNPs using PLINK [55]. The initial data cleaning
step was to ensure that there were no ‘batch effects’ between the two rounds of genotyping.
No clear difference was detected between the two batches of genotyping so the participants in
each batch were merged together for QC and imputation. Replicated samples with the lower
genotyping success rate and plate controls were excluded. Individuals were removed if they
had a gender mismatch between reported gender and that determined on the basis of X
chromosome data (N = 7), had a genotyping success rate lower than 97% (N = 16), had a low
level of heterozygosity (i.e. h<0.3; N = 4) or were related to other individuals at the level of
half-siblings or first cousins by IBD sharing (i.e. π > 0.1875; N = 68). In total, 1,494 individuals
passed QC and were available for analysis. SNPs were excluded if they deviated from HWE (i.e.
HWE P-Value < 5.7x10-7; 919 markers), their genotype call rate was less than 95% (97,718
markers; includes all the CNV’s as they are not called with the SNP data) or their MAF was less
than 1% (119,246 markers). A total of 535,632 SNPs passed QC checks and were available for
analysis.
Imputation of un-typed or missing genotypes was performed using MaCH v1.0.16 [53,64] for
the 22 autosomes with the CEU samples from HapMap Phase 2 (Build 36, release 22) used as a
reference panel. After imputation, 2,543,887 SNPs (535,632 genotyped and 2,008,255
imputed) were available for analysis.
34 Chapter 1: Introduction
There is some population stratification detected between the Raine Study individuals due to
the sampling criteria for genotyping of at least one parent of European descent, so a principal
components analysis was carried out in EIGENSTRAT [39] using a subset of 42,888 SNPS that
are not in LD with each other. Figure 1.5 and Figure 1.6 display the first two principal
components for the 1,494 individuals. These principal components were included in the
genetic association analysis throughout this thesis and will be discussed in further detail in the
following chapters.
Figure 1.5: Principal components for population stratification in the Raine Study with the
HapMap populations superimposed, showing that the Raine Study individuals are
prominently of European descent.
35 Chapter 1: Introduction
Figure 1.6: Principal components for population stratification for the 1,494 participants with
genome-wide data in the Raine Study
1.6.2 Avon Longitudinal Study of Parents and Children (ALSPAC)
1.6.2.1 Subjects
The Avon Longitudinal Study of Parents and Children (ALSPAC) is a prospective cohort study
from Bristol in the United Kingdom (UK) [188] (www.bristol.ac.uk/alspac). The study is known
to its participants as the “Children of the 90’s” study. Pregnant women resident in one of three
Bristol-based health districts with an expected delivery date between 1 April 1991 and 31
December 1992 were invited to participate. ALSPAC was part of the European Longitudinal
Study of Pregnancy and Childhood, which had initial seed funding from the World Health
Organisation Europe to pilot common methodology and questionnaires in the UK, Russia and
Greece. Subsequently, funding for ALSPAC was obtained from various other sources. From
birth to five years, information on the children was extracted from health visitor records.
These records form part of standard child care in the UK and there are up to four
measurements taken on average at six weeks, 10, 21, and 48 months of age. A random 10% of
the children were recruited into a “Children in Focus” subset of the cohort which involved
eight research clinic visits, held between the ages of four months and five years of age. At age
seven, additional eligible children in the Avon district were invited to participant, and all
ALSPAC children were subsequently followed-up with annual research clinic visits that are on-
36 Chapter 1: Introduction
going. Questionnaires were completed by the parents throughout infancy, childhood and
adolescence, gathering information on the child and both parents.
Ethical approval for the study was obtained from the ALSPAC Law and Ethics Committee and
the Local Research Ethics Committees.
1.6.2.2 Measurements
The measurements used throughout this thesis are described below. Additional explanatory
variables are described in the relevant chapters.
Birth length (crown-heel) was measured by ALSPAC staff who visited new-borns soon after
birth (median one day, range 1-14 days), using a Harpenden Neonatometer (Holtain Ltd). Birth
weight was extracted from medical records. Gestational age was obtained from obstetric
medical records, as recorded by health care professionals, who used data from the woman’s
reported last menstrual period, paediatric assessment at birth, obstetric assessment during the
antenatal period and ultrasound assessment. At the time this cohort was established, routine
early pregnancy data scans were not conducted and it is likely that only a minority had
gestational age determined by ultrasound scan. From birth to five years, length and weight
measurements were extracted from the four health visitor records. For the “Children in Focus”
subset, length/height measurements are available from the research clinic visits. At these
clinics, crown-heel length for children aged 4 to 25 months was measured using a Harpenden
Neonatometer and from 25 months onwards standing height was measured using a Leicester
Height Measure; weight was measured using Fereday 100kg combined scale (four month
clinic), Soenhle scale or Seca scale model 724 (eight month clinic), Seca 724 or Seca 835 (12
month clinic), Seca 835 (18 months onwards). From age seven years upwards, all children were
invited to annual clinics, at which standing height was measured to the last complete
millimetre using the Harpenden Standiometer and weight was measured to the nearest 0.1kg
using the Tanita Body Fat Analyser (Model TBF 305) [189]. In addition, parent-reported child
height and weight were also available from the questionnaires (27% of measures). BMI was
calculated from the weight and height measurements.
As outlined above, the growth data was collected using three measurement sources in
ALSPAC; clinic visits, measurements made during routine health care visits, and parental
37 Chapter 1: Introduction
reports in questionnaires. Whilst the measurements from routine health care visits have
previously been shown to be accurate in this cohort [190], parental report of children’s height
tends to be overestimated while weight tends to be under estimated [191]. Therefore, the
variability of BMI is greater in the questionnaire measures, which potentially has implications
for the genetic association analysis; this will be investigated in detail in Chapter Four of this
thesis.
1.6.2.3 Genotyping
ALSPAC children were genotyped using the Illumina HumanHap550 quad genome-wide SNP
genotyping platform by 23 and Me subcontracting the Wellcome Trust Sanger Institute,
Cambridge, UK and the Laboratory Corporation of America, Burlington, North Carolina, United
States of America.
Standard QC methods were performed in each sample separately; similar to the Raine Study
QC, SNPs were removed if the MAF was < 1%, the call rate was < 95% or the P-Value from an
exact test of HWE P-Value was <5.7x10-7 [192,193]. Individual samples were excluded on the
basis of incorrect gender assignment, minimal or excessive heterozygosity, high levels of
missingness and cryptic relatedness (16% of genotyped individuals). Genotypic data were
subsequently imputed using MaCH v1.0.16 [64] for the all 22 autosomes with the CEU samples
from HapMap Phase2 (Build 36, release 22) used as a reference panel.
No substantial population stratification was detected in ALSPAC based on the principal
components generated in the EIGENSTRAT software [39].
1.6.3 The Northern Finland Birth Cohort of 1966 (NFBC66)
1.6.3.1 Subjects
The Northern Finland Birth Cohort of 1966 is a prospective birth cohort from the region
covering the Provinces of Lapland and Oulu in Finland [194] (http://kelo.oulu.fi/NFBC/).
Mothers were invited to participate in the cohort if they had estimated delivery dates falling
between January 1st and December 31st 1966 and were followed from the 24th week of
gestation. The cohort has changed names over the years from the “North Finland premature
birth study” and “Development study of children in North Finland” to “The mother-child cohort
study of morbidity and mortality during childhood with the special purpose of preventing
38 Chapter 1: Introduction
mental and physical handicap” and “Cohort-66 Study”. The study included 12,231 live born and
stillborn infants with birth weight of 600 grams or more. Data collection began during
pregnancy, with additional data collected at birth, 0-1 years, 14 and 31 years in dedicated
research clinics. Register data on morbidity, mortality and socioeconomic factors are collected
from hospital records and official registers over the life course.
Informed consent for the use of the data including DNA was obtained from all subjects. The
study was approved by ethics committees in Oulu (Finland) and Oxford (UK) universities in
accordance with the Declaration of Helsinki.
1.6.3.2 Measurements
Height and weight measures were collected from communal child health clinics as part of
routine clinical care in Finland. Standardised national maternity and child health care systems
have been operating in Finland since the 1940s; therefore staff were trained to record birth
and later growth measurements with great accuracy. In infancy, weight was measured to the
nearest 10 grams and in childhood to the nearest 100 grams with the child dressed in
underclothes. Height was measured to the nearest millimetre using standard procedures.
1.6.3.3 Genotyping
Genome-wide data was genotyped using the Illumina HumanCNV-370DUO Analysis BeadChip
at the Broad Institute Biological Sample Repository (BSP), Boston, Massachusetts, United
States of America.
Standard QC methods were performed; SNPs were removed if the MAF was < 1%, the call rate
was < 95% or the HWE P-Value was <1x10-4 [73]. Individual samples were excluded they had
>5% genotype data missing, incorrect gender assignment, or were related to other individuals
at the level of half-siblings by IBD sharing (i.e. π > 0.1875; one with greater missing data of
every pair removed). Genotypic data were subsequently imputed using IMPUTE software
[52,195] for the all 22 autosomes with the CEU samples from HapMap Phase2 (Build 36,
release 22) used as a reference panel.
There is some population stratification detected between the NFBC66 individuals due the
different linguistic/graphical groups of participants [196]. The population structure was
39 Chapter 1: Introduction
assessed using classical multidimensional scaling (MDS) on the matrix of identity-by-state of all
pairs of individuals in the program PLINK [55]. Similar to the Raine Study, variables from the
MDS analysis were included in the genetic association analysis; details are described in
subsequent chapters.
1.7 Aims This thesis investigates the association between BMI growth trajectories across
childhood/adolescence and genetic variants on a genome-wide scale. To accurately perform
this analysis, the definition of appropriate statistical models to detect small genetic effects is
required. The aims of this thesis are:
1. To develop appropriate longitudinal statistical models for BMI growth trajectories
throughout childhood using the Western Australian Pregnancy Cohort (Raine) Study.
The potential strengths and weaknesses of each model for genetic association studies
will then be explored to identify the most appropriate model for both candidate gene
and GWAS of longitudinal BMI.
2. To investigate whether a two-step approach is a valid option for the GWAS, whereby
BMI trajectories are first modelled and the random effects for each individual are
extracted for the genetic analysis.
3. To ensure the models developed in aim one extend to additional birth cohorts of
European descent. Any model misspecifications will be tested in an extensive
simulation study to ensure they will have limited impact on a large scale genetic
association study. Appropriate tests will be employed and evaluated for whether they
reduce the impact of the model misspecifications.
4. To conduct a GWAS to identify the underlying genetic determinants of BMI trajectory
in early life across the three birth cohorts as proof-of-principle for the statistical
models developed.
5. To investigate how all known genetic variants associated with adult BMI influence
growth over childhood and adolescence (including BMI, height and weight) and related
growth parameters (including age and BMI at both the adiposity peak and rebound).
40 Chapter 1: Introduction
1.8 Outline of Thesis Each chapter in this thesis outlines a particular research project and is an extended form of the
publications that were written for the project. Therefore, although the chapters form a logical
order, there may be some overlap between them in terms of analyses conducted and
background to methods discussed.
Chapter 2 contains a literature review of the statistical models that have previously been used
for BMI trajectory analysis throughout childhood, and describes the Raine Study BMI data in
more detail. The most appropriate methods for genetic association studies are identified and
the fit of each method to the Raine Study data from one to 17 years is described. Previously
reported adult BMI loci are incorporated into each model and the methods are compared to
determine the most appropriate modelling framework for detecting genetic effects of BMI
growth across childhood and adolescence. The chapter concludes with a summary and
recommendation detailing which modelling framework is considered the most appropriate for
future, and potentially larger scale, genetic and epidemiological studies of BMI growth.
Various methods have been suggested for conducting longitudinal GWASs, one of which is a
two-step approach whereby one models the phenotype of interest and extracts summary
measures to conduct the genetic analysis. Chapter 3 explores the application of the two-step
approach to the complex phenotype of BMI over childhood and explains why this approach is
not appropriate for this phenotype.
In Chapter 4, the recommended model from Chapter 2 is applied to ALSPAC data. A simulation
study is described and carried out to determine how model misspecifications may influence
the genome-wide results. Analysis of the genetic variants on chromosome 16 in the ALSPAC
data is also conducted to ensure the real data shows the same results as the simulation study.
Conclusions are made regarding the appropriate methods to utilise to perform GWAS of
longitudinal BMI data from the ALSPAC cohort.
The genome-wide association study for BMI trajectory over childhood is outlined in Chapter 5.
This includes the methodology used, the results found and the suggested future direction for
research involving more detailed analyses that include additional birth cohort studies.
41 Chapter 1: Introduction
Chapter 6 uses an alternate approach to identify genes associated with BMI trajectories in
early life – it begins by identifying gene loci that are associated with adult BMI (that have been
replicated in independent populations) and explores how SNPs in these loci can affect different
growth patterns over childhood by affecting either weight, height and/or BMI growth. Details
regarding the specific methods and data used are described in detail in the chapter.
The thesis is concluded in Chapter 7 with a summary of the key findings and their implication
in future genetic association studies of childhood growth and more broadly for longitudinal
phenotypes.
42 Chapter 1: Introduction
Chapter 2: Longitudinal Statistical Models For Body Mass Index Growth Trajectories Throughout Childhood Using The Western Australian Pregnancy Cohort (Raine) Study
2.1 Introduction The results presented in this chapter have been published [197]; the manuscript is included as
an appendix (Appendix A).
Before investigating the genetic basis underlying childhood growth trajectories, it is important
to assess the statistical properties of various modelling approaches that are able to capture all
the intricate details of the trajectory as described in Chapter 1. This chapter outlines the
process by which the most efficient model for detecting genetic effects for childhood BMI
growth trajectories was chosen.
2.2 Background Obesity is a major global public health problem. The World Health Organisation estimated that
in 2010 there were at least 42 million overweight children under the age of five years and one
billion overweight adults globally [94]. An individual’s susceptibility to obesity is thought to
result from a combination of their genetics, behaviours and environment. The heritability of
obesity is estimated from family and twin studies to be between 40 and 80% [165,166,167]
which appears to be age dependent, with younger individuals having higher heritability
estimates [168]. Genetic factors have an important role in childhood obesity, but their role
may be different to those that operate in adulthood. Since the advent of GWASs, common
variants within 35 genes have been discovered to be associated with adult obesity
[198,199,200,201,202]. A further 48 genes associated with population variation in body mass
index (BMI) and weight [72,174,175,178,179,180,182] in individuals of European descent. To-
43 Chapter 2: BMI Growth Trajectories
date, not all of these genes have been validated. There have been no studies to date
investigating the association between BMI in childhood and genetic variants in a GWAS.
Relatively few studies have investigated the relationship between known adult BMI associated
variants and childhood BMI [178,203,204,205,206]. Zhao et al [203] investigated the
association between childhood BMI and 13 genomic loci reported to be associated with adult
obesity, and found that nine of the loci contributed to paediatric BMI between birth and 18
years of age. Subsequently, several authors have investigated the association between adult
BMI loci and changes in growth over childhood. Hardy and colleagues [205] took variants from
the two most commonly reported obesity genes, FTO and MC4R, to see whether they were
associated with life course body size. They found the association with BMI in both genes
strengthened during childhood up until 20 years of age before weakening throughout
adulthood. In 2010, Elks et al [206] used eight variants that showed individual associations
with childhood BMI to create an obesity-risk-allele-score. This allele-score was strongly
associated with early infant weight gain but also with weight gain throughout childhood. Den
Hoed et al [204] looked at BMI in childhood and adolescence against a larger subset of
replicated SNPs representing the 16 BMI loci from the six GWASs in adults of white European
descent [174,175,178,179,207,208]. They found that the cumulative effect of all 16 variants on
BMI in childhood was similar to that in adulthood; however the association with some variants
differed by age. Finally, Belsky et al [209] investigated the largest number of adult BMI
associated SNPs to determine their influence the development of obesity. They concluded that
individuals with more risk alleles at the 32 loci had an increased likelihood of being obese in
adulthood, and that this genetic risk manifested as rapid early childhood growth. Together,
these studies begin to provide evidence that genetic loci associated with BMI in adulthood may
start to have an effect on BMI in childhood and even infancy.
Every disease, including obesity, develops over a period of time, and hence investigating the
genetic determinants of this developmental process may provide insights into the mechanisms
of the genetic associations. Sophisticated longitudinal analyses allow hypotheses to be tested
that cannot be determined from single time point analyses. These hypotheses include
assessing the patterns and duration of a genetic effects over a given time period and the
differences in means and rates of change of a trait. It is therefore important to investigate the
genetic component of BMI trajectory in order to better understand some of the underlying
44 Chapter 2: BMI Growth Trajectories
biology of growth. The analysis of longitudinal growth curves allows one to identify specific
time periods in which genes play a central role.
A child’s growth profile contains important information regarding their genetic make-up and
environmental exposures. However, BMI trajectories are difficult to model due to the
complexities of growth over childhood; children tend to have rapidly increasing BMI from birth
to approximately nine months of age where they reach their adiposity peak, then BMI
decreases until about the age of 5.5 years at adiposity rebound and then steadily increases
again until after puberty where it tends to plateau through adulthood (see figure 1.3, Chapter
1). These patterns of growth tend to be different between males and females where females
often reach each of the ‘landmarks’ (adiposity rebound, puberty and plateau at adult BMI) at
an earlier age than males. These changes over time within each individual, as well as the
increasing variability over time of BMI between individuals, are often difficult to capture in a
statistical model, particularly with the aim of detecting small genetic effects. The World Health
Organization recently conducted research into statistical methods used to construct growth
curves over childhood. They examined as many as 30 previously published methods, of which
only seven handled multiple measurements per child [115]. Historically, growth (height and
weight) models were non-linear, parametric curves over a small age range, for example
adolescence, which were subsequently concatenated to cover the whole age range [210]. They
were parametric in that they modelled the age range of interest with a small number of
parameters. However, they had several drawbacks which included 1) they did not allow for
enough individual variation from the non-linear curve and therefore often missed interesting
local variations [211] and 2) they are unable to account for variation in growth due to other
characteristics that are measured at each time point such as diet and exercise. Later, these
models were extended to non-parametric but still non-linear functions where the shape of the
curve was determined locally and a curve was estimated for each subject over a small range of
ages [211,212]. These non-parametric methods used spline functions (for example, cubic
smoothing splines [213] or variable knot cubic splines [214]) and kernel estimation techniques
[211], where at any age the nearby measurements contribute to the shape of the curve. Spline
functions are sufficiently smooth polynomial functions that are defined using piecewise
functions between chosen knot points; whereas, kernel estimation applies weights to the
growth measurements and averages the weighted measures over appropriate age windows.
Although it solved some of the drawbacks of the previous method, the non-parametric
45 Chapter 2: BMI Growth Trajectories
methods were still unable to describe the relationships with other covariates at each time
point and the growth estimates were highly dependent on the size of the selected smoothing
parameter; if the smoothing parameter is too small, the model will follow random variations,
whereas if the smoothing parameter is too large it will pick up interesting local patterns but
might be over-fitting the data and the estimates will not be reproducible. As the estimation is
based on nearby measurements, each individual must have a minimum number of
observations, and ideally the same number of observations for all subjects. The next major
development in the growth modelling literature was the introduction of linear mixed-effects
models for longitudinal normally distributed data [215,216]. These models use powers of age
as explanatory variables, can easily incorporate further explanatory variables measured at
each time point and can model growth over a wide age range. These models can be extended
to account for increasing heteroscedasity over the time period of interest based on a
multivariate t-distribution or to account for a curve shape that differs from the polynomial
function of age by using smoothing splines. Although the range of available methods
previously used for growth modelling is large, not all are appropriate for genetic association
analyses.
2.2.1 Aims
The aims of this chapter were to:
• Fit each method that is appropriate for modelling BMI trajectories throughout
childhood to the Raine Study data from years 1 to 17.
• Check residuals and compare model fit between methods to determine the most
appropriate model for BMI growth throughout infancy, childhood and adolescence.
• Incorporate known adult BMI loci into each model and compare estimates to
determine most appropriate model for detecting genetic effects of BMI growth.
Methods that will be explored include:
o Linear mixed effects model with up to a cubic function in fixed and/or random effects
o Linear mixed effects model with an extension to allow for non-normally distributed
random effects and error terms (Skew-normal/Skew-t) with up to a cubic function in
fixed and/or random effects
o Semi-parametric linear mixed effects model with smoothing splines
o SuperImposition by Translation And Rotation (non-linear mixed effects model)
46 Chapter 2: BMI Growth Trajectories
2.3 Subjects and Materials The Western Australian Pregnancy Cohort (Raine) Study is described in detail in Section 1.6.1.
A subset of 1,506 individuals were used for this analysis based on the following criteria: at least
one parent of European descent, live birth, unrelated to anyone else in sample (one individual
of every related pair, including multiple births, was selected at random), no significant
congenital anomalies, genetic data available, and at least one measure of BMI throughout
childhood available. The individuals excluded from these analyses consisted of 369 of non-
Caucasian descent, a further 59 individuals from multiple births (55 twins, two triplets and one
twin who died <18 weeks gestation, one twin withdrew from the study at birth - one of each
multiple remained in the analysis), one individual of each of 66 siblings (not including multiple
births), 10 congenital anomalies, 853 without genetic data available and five without a
measure of BMI in childhood. Many studies investigating BMI trajectories through childhood
and adolescence begin with BMI measured at birth, however, this complicates the modelling
further for two main reasons; firstly, BMI in infants is meaningless and generally ponderal
index (PI = weight/length3) is used as a measure of growth for this age and secondly, there is
often a period of weight loss after birth that would not accurately captured in the modelling
due to the measurement times available. For these reasons, BMI at birth was excluded from all
models and modelling began at the one year assessment. BMI was calculated from the weight
and height measurements (Table 2.1; median six measures per person, interquartile range
[IQR]: 5-7), with a total of 8,986 BMI measures.
Table 2.1: Number of follow-ups with BMI measured for each of the participants in the
sample
Number of follow-ups
attended
1 2 3 4 5 6 7 8 Average
(SD)
Number of individuals 17 41 66 128 192 348 583 131 5.97 (1.52)
Of the 1,506 individuals in the analysis, there are 773 males and 733 females (51% male). At
birth, these individuals were similar to the Western Australian population of births with an
average birth weight of 3.35kg (SD=0.59kg) and gestational age of 39.35 weeks (SD=2.11
weeks), 25.21% of them were born to mothers who smoked throughout pregnancy and 8.77%
born preterm. The mothers on average gained 8.79kg (SD=3.78) from 18-34 weeks of
pregnancy and breast fed their infant for an average of six months (IQR=2-12 months). On
47 Chapter 2: BMI Growth Trajectories
average, the infants gained 6.98kg (SD=1.17kg) in the first year of life. Table 2.2 presents the
average age, weight, height and BMI at each follow-up of the study.
Table 2.3 displays the correlation structure for the repeated observations of BMI, which
indicates that there is a degree of tracking over time. A typical pattern for growth data is
observed, whereby the strength of correlation decreases with increasing time. This suggests
that an autoregressive or unstructured correlation structure may be the most appropriate;
however this will be investigated further in Section 2.4.
Figure 2.1 displays the distributions of BMI for eight scheduled follow-up windows: 1-, 2-, 3-,
6-, 8-, 10-, 14- and 17 years. These can be considered as independent observations as each
individual is only measured once at each scheduled follow-up. It appears that BMI is fairly
normally distributed and the variability between individuals is fairly small, until age six where
the distribution becomes increasingly skewed as age increases.
To get a sense of the time trends, it is important to look at plots of the individual trajectories
over time. Figure 2.2 provides an example of the trajectories for a sample of individuals with
two or more time points over the follow-up period. Figure 2.3 displays a smooth curve through
the observed BMI measurements by age for all 1,506 individuals. Both figures indicate that
there is some curvature in the BMI measurements over time that needs to be accounted for in
the models. No outliers were removed from the data as it was of interest to see whether they
were appropriately accounted for in the chosen methods.
48 Chapter 2: BMI Growth Trajectories
Table 2.2: The phenotypic characteristics at each follow-up year for the 1,506 individuals in
the study sample. Continuous variables are expressed as means (standard deviation); binary
variables as percentage (number).
All
(n=1,506)
Male
(n=773)
Female
(n=733)
P-Value
Summary of birth measures
Birth Weight (kg) 3.35 (0.59) 3.41 (0.59) 3.28 (0.58) 3.85x10-5
Gestational Age (weeks) 39.35 (2.11) 39.37 (2.05) 39.32 (2.17) 0.66
Preterm birth 8.77% (132) 8.03% (62) 9.55% (70) 0.34
Maternal smoking during
pregnancy
25.22% (379) 22.77% (176) 27.81% (203) 0.03
Summary of measures by follow-up year
Age Year 1 (n=1,375) 1.16 (0.10) 1.15 (0.10) 1.16 (0.10) 0.22
(yr) Year 2 (n=402) 2.18 (0.14) 2.19 (0.14) 2.16 (0.14) 0.05
Year 3 (n=994) 3.11 (0.12) 3.12 (0.13) 3.11 (0.10) 0.71
Year 6 (n=1,324) 5.92 (0.18) 5.91 (0.19) 5.92 (0.18) 0.30
Year 8 (n=1,320) 8.10 (0.35) 8.12 (0.34) 8.09 (0.36) 0.17
Year 10 (n=1,274) 10.60 (0.18) 10.60 (0.19) 10.59 (0.17) 0.16
Year 14 (n=1,276) 14.07 (0.20) 14.07 (0.20) 14.07 (0.19) 0.55
Year 17 (n=1,021) 17.05 (0.25) 17.03 (0.24) 17.06 (0.25) 0.06
BMI Year 1 (n=1,375) 17.11 (1.40) 17.38 (1.38) 16.82 (1.37) 4.63x10-14
(kg/m2) Year 2 (n=402) 15.97 (1.29) 16.19 (1.28) 15.72 (1.25) 2.00x10-4
Year 3 (n=994) 16.15 (1.27) 16.29 (1.21) 16.00 (1.31) 2.00x10-4
Year 6 (n=1,324) 15.86 (1.76) 15.88 (1.70) 15.84 (1.82) 0.64
Year 8 (n=1,320) 16.88 (2.54) 16.79 (2.47) 16.97 (2.62) 0.29
Year 10 (n=1,274) 18.69 (3.41) 18.58 (3.38) 18.80 (3.45) 0.25
Year 14 (n=1,276) 21.45 (4.23) 21.21 (4.24) 21.71 (4.20) 0.03
Year 17 (n=1,021) 23.02 (4.38) 22.83 (4.34) 23.23 (4.42) 0.15
Height Year 1 (n=1,375) 0.78 (0.03) 0.78 (0.03) 0.77 (0.03) 1.04x10-14
(m) Year 2 (n=402) 0.90 (0.03) 0.91 (0.03) 0.90 (0.03) 3.00x10-4
Year 3 (n=994) 0.96 (0.04) 0.97 (0.04) 0.96 (0.04) 1.06x10-9
Year 6 (n=1,324) 1.16 (0.05) 1.17 (0.05) 1.15 (0.04) 6.05x10-7
Year 8 (n=1,320) 1.29 (0.06) 1.30 (0.06) 1.29 (0.06) 4.37x10-6
49 Chapter 2: BMI Growth Trajectories
Table 2.2 continued
All
(n=1,506)
Male
(n=773)
Female
(n=733)
P-Value
Height Year 10 (n=1,274) 1.44 (0.06) 1.44 (0.07) 1.44 (0.06) 0.97
(m) Year 14 (n=1,276) 1.65 (0.08) 1.67 (0.09) 1.62 (0.06) 4.94x10-26
Year 17 (n=1,021) 1.73 (0.09) 1.79 (0.07) 1.66 (0.06) 1.94x10-143
Weight Year 1 (n=1,375) 10.34 (1.24) 10.67 (1.24) 9.99 (1.15) 5.03x10-25
(kg) Year 2 (n=402) 13.03 (1.49) 13.39 (1.48) 12.65 (1.40) 3.37x10-7
Year 3 (n=994) 15.06 (1.84) 15.42 (1.83) 14.69 (1.78) 3.99x10-10
Year 6 (n=1,324) 21.48 (3.37) 21.75 (3.42) 21.20 (3.30) 2.91x10-3
Year 8 (n=1,320) 28.42 (5.68) 28.58 (5.65) 28.24 (5.72) 0.28
Year 10 (n=1,274) 39.01 (9.02) 38.80 (9.09) 39.23 (8.95) 0.40
Year 14 (n=1,276) 58.49 (13.44) 59.50 (14.49) 57.39 (12.11) 4.81x10-3
Year 17 (n=1,021) 68.69 (14.59) 73.15 (14.91) 64.12 (12.74) 3.91x10-24
Table 2.3: The correlation structure of the repeated observations of BMI
Year 1 2 3 6 8 10 14 17 1 1 2 0.712 1 3 0.689 0.761 1 6 0.497 0.619 0.729 1 8 0.388 0.461 0.595 0.878 1
10 0.314 0.351 0.503 0.778 0.899 1 14 0.246 0.326 0.423 0.689 0.794 0.861 1 17 0.213 0.272 0.44 0.611 0.698 0.754 0.853 1
50 Chapter 2: BMI Growth Trajectories
Figure 2.1: Boxplots of BMI at each follow-up year, with BMI displayed from 10-30kg/m2 for
years 1-6 and 10-50kg/m2 for years 8-17.
51 Chapter 2: BMI Growth Trajectories
Figure 2.2: Individual BMI profiles of 20 individuals from the Raine Study
Figure 2.3: Observed BMI measures for the 1,506 individuals with a lowess curve to
visualise the curvature in BMI over childhood
52 Chapter 2: BMI Growth Trajectories
2.4 Statistical Methods and Model Fit Four methods were compared to assess the accuracy of estimation for the BMI growth
trajectories and the ability to detect genetic effects influencing these trajectories. These
methods included: Linear Mixed Effects Model (LMM) [215], Skew-t Linear Mixed Effects
Model (STLMM) [217,218,219], Semi-Parametric Linear Mixed Effect Model (SPLMM) and a
Non-Linear Mixed Model (NLMM), also known as SuperImposition by Translation and Rotation
(SITAR) [220]. Although there are many possible statistical methods that could be utilized in
this context, these methods were chosen as they allow for adjustment of potential
confounders, appropriately account for the correlation between the repeated measures,
obtain valid inference, allow for incomplete data on the assumption that data are missing at
random, and are computationally feasible. Once the best fitting model was defined for each
method, the model fit for each of the methods was compared.
A small simulation study was also conducted using re-sampling techniques based on 1,000
non-parametric bootstrap data sets with replacement [221] from the Raine Study data and
calculating an R2 statistic for each method fit to these simulated data sets. These bootstrap
resamples provide an estimate of the variance of the R2 statistic in each method.
Sex stratified models were used for all analyses, 1) to account for the differing growth curves
between males and females, particularly around puberty, and 2) because different genes may
influence the timing of growth spurts in males and females.
All analyses were conducted in R version 2.12.1 [222]; the spida library was used for the
SPLMM models and the sitarlib library was used for the NLMM models. Maximum likelihood
estimation was used for all mixed models to enable comparison between each of the four
methods.
2.4.1 Linear Mixed Effects Model (LMM)
LMMs [215] include both fixed effects, which are parameters that are associated with an entire
population, and random effects, which are parameters that are associated with individual units
drawn from the population at random. An LMM with a polynomial function for the time
component is a common tool for growth curve analysis with continuous repeated measures.
For a set of time points varying from 1,..,t, the time trend in the sample can be described by a
(q-1)st degree polynomial function, with q ≤ t.
53 Chapter 2: BMI Growth Trajectories
2.4.1.1 Method Description
In comparison to the linear model, with one random effect term for the error (often referred
to as the residual error and denoted ε i), the LMM includes additional random-effects terms
which are appropriate for representing clustered and therefore dependant data, such as
observations taken on related individuals or data collected at several time points. The LMM
has the following general form:
( )( )i
i i i i i
i q
i n i
= + +y Xβ Zb εb N 0,Dε N 0,R
Where:
• y i is the ni x 1 response vector for observations in the ith group.
• X i is the ni x p design matrix for the fixed effects for observations in group i.
• β is the p x 1 vector of fixed-effect or population-averaged regression coefficients
(unknown population parameters).
• Z i is the ni x q design matrix for the random effects for observations in group i.
• b i is the q x 1 vector of random-effect coefficients for group i.
• ε i is the ni x 1 vector of errors for observations in group i.
• D is the q x q covariance matrix for the random effects.
• R i is the ni x ni positive-definite covariance matrix for the errors in group i.
• b i and ε i are assumed to be independent
The random effects, b i, are defined to be normally distributed with a mean of zero.
There are several advantages to this form of model including 1) it can be used in an
unbalanced design often seen in longitudinal studies, where either the number of measures or
the timing of the measurements differs between individuals, 2) it explicitly estimates the
between and within individual variation and 3) due to its computational efficiency, it facilitates
exploratory association analysis where multiple covariates are of interest.
Inferences about the fixed effects are generally referred to as estimates, whereas inferences
about the random effects are referred to as predictors [223]. The best linear unbiased
estimator (BLUE) of β is:
1 1 1( )T T− − −=β X V X X V y
54 Chapter 2: BMI Growth Trajectories
Where V is the covariance matrix for the vector of observations, y, such that V = ZDZT + R. The
best linear unbiased predictor (BLUP) of u is [224]:
1( )T −= −u GZ V y Xβ
They are ‘best’ in that they minimize the sampling variance, linear in the sense that they are
linear functions of the observations y and unbiased such that:
( )
( )
E
E
=
=
β β
u u
The BLUE’s are estimated to ensure the random effects are distributed Nq(0, D). Marginally,
the y i are independently normally distributed as N(X iβ; R i + Z iDZ iT).
Laird and Ware [215] note that maximum likelihood (ML) estimates tend to give biased
estimates of covariate structure, whereas restricted ML (REML) is able to give an unbiased
result. In REML, the maximum likelihood estimation is not carried out on all the information,
but instead it uses a transformed set of data so that the nuisance parameters do not influence
the likelihood. That is, the maximum likelihood is based on any full-rank set of error contrast
µTy such that E(µTy)=0 which is equivalent to µTX=0. It is therefore recommended to use REML
when interpreting the covariance estimates as it produces unbiased estimates of the
parameters.
2.4.1.2 Model Fit
The growth curve LMM for the jth individual and tth time point and with the time scale
measured by age is as follows:
BMIjt = β0 + Σ i β i (Agejt – Age )i + b0j + Σk bkj (Agejt - Age )k + ε jt k ≤ i
Where Age is the mean age over the t time points in the sample (i.e. eight years), β i are the
parameters for the fixed effects, bkj are the parameter estimates for the random effects
assumed to be multivariate normal and the ε jt‘s are the error terms assumed to be normally
distributed N(0, Σ), where Σ is the within-individual correlation matrix. Both age and the
natural log (ln) transformation of age were considered as the time component to identify the
optimal underlying scale. Both fixed (i) and random (k) effects up to polynomial of degree 3
were tested. Several within-individual correlation structures were considered, including
autoregressive (i.e. constant variance across occasions, σ2, and Corr(Yij, Yij+k)=ρk), continuous
55 Chapter 2: BMI Growth Trajectories
autoregressive, exchangeable (compound symmetric; i.e. constant variance across occasions,
σ2, and Corr(Yij, Yik)=ρ) and unstructured (i.e no assumption is made about the variances or
covariances).
Output from R version 2.12.1 [222] of the model fitting procedure outlined below are provided
in Appendix B. Following the guidelines outlined in Cheng et al [225], the initial saturated
model included a cubic function of age for both the fixed and random effects and BMI on the
natural log scale. Initially, likelihood ratio tests (LRT) were used to assess the required degree
of polynomial function for the random effects to fit the data accurately, while keeping the
fixed effects the same and specifying an independence correlation matrix for the random
effects. Table 2.4 provides statistics for the covariance models that were tested.
Table 2.4: Model fit statistics for covariance models tested using the LMM method; -2 log
likelihood, Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). All
models assumed an independent correlation structure for the random effects and no specific
correlation structure for the error.
Model Random Effects -2LL BIC AIC
Fem
ale
1 Intercept 3647.45 -7244.60 -7282.90
2 Intercept, age 4294.20 -8529.72 -8574.40
3 Intercept, age,
age2
4392.85 -8718.63 -8769.70
Mal
e
1 Intercept 3812.15 -7573.70 -7612.31
2 Intercept, age 4521.14 -8983.23 -9028.27
3 Intercept, age,
age2
4626.61 -9185.74 -9237.22
56 Chapter 2: BMI Growth Trajectories
The model with a polynomial of degree 3 in the random effects did not converge for either
males or females. Therefore, according to the AIC and other criteria presented in Table 2.4, the
model with a quadratic function for both males and females was the most appropriate (LRT P-
Value < 0.0001 when comparing Model 2 to 3 in both females and males).
Independence between the random effects was assumed in the initial model fit, which may
not be necessary, so a LRT was conducted to see whether this assumption could be relaxed.
The LRT P-Value for both males and females was <0.0001 indicating that it is necessary to
allow a correlation between the random effects.
A similar approach was used to investigate whether a within-individual correlation structure
was required in addition to the random effects. LRTs suggest that a correlation structure using
the continuous autoregressive of order one method is necessary for both males and females.
Finally, models with untransformed BMI and both untransformed and natural log transformed
age were compared using model fit criteria including fitted verses observed values, fitted
verses residual values and distribution of both random effects and error terms. These criteria
suggest that natural log transformed BMI and untransformed age provided the best fit of the
data for both males and females.
To summarise, the optimal LMM model for both males and females was based on ln(BMI) and
untransformed age, with a quadratic polynomial of age in the random effects, a cubic
polynomial of age in the fixed effects and a continuous autoregressive correlation structure of
order one. Hence, the final model for both females and males was
ln(BMIjt) = β0 + β1(Agejt - 8) + β2(Agejt - 8)2 + β3(Agejt - 8)3 + b0 + b1(Agejt - 8) +
b2(Agejt - 8)2 + ε
The model output and diagnostics for females (Table 2.5 and Figure 2.4) and males (Table 2.6
and Figure 2.5) are below. The plots in Figure 2.4 show a variety of diagnostics of the LMM
model fit in females. The residuals versus age in Plot A shows that the residuals are relatively
constant over time, however there are a few outliers within each time window. The predicted
values from the model correspond fairly well to the observed BMI values, as seen in Plot B of
Figure 2.4. The model tends to under estimate the high values of BMI as seen by the points
57 Chapter 2: BMI Growth Trajectories
above the x-y line at the top of Plot B. Plot C illustrates that there is reasonably constant
variability in the standardised residuals across the range of predicted values. Focusing now on
Plots D-F, it can be observed that the autocorrelation is within the expected bounds from lag 4
and although the residuals appear to fit the assumption of normality the random intercept has
deviations at both ends. For males, the plots in Figure 2.5 show a similar pattern to what was
discussed for the females. However, there are fewer outliers in the males than the females and
both the random intercept and residuals appear to follow closer to a normal distribution.
2.4.1.3 Computational Time
Running 100 basic models as described above in R-64-bit version 2.12.1 on a 64-bit operating
system with an Intel Core i7 CPU Processor (L 640 @ 2.13GHz), it takes on average 12.14
seconds (12.10-12.18 seconds) for each female model and 14.25 seconds (13.98-14.59
seconds) for each male model.
58 Chapter 2: BMI Growth Trajectories
Table 2.5: Details of LMM model in females (N=733, n=4377)
Goodness of fit criteria
AIC -9622.34
BIC -9545.74
Log Likelihood 4823.17
Random Effects
SD Correlation
Intercept Age
Intercept 0.1247
Age 0.0110 0.8310
Age2 0.0010 -0.7230 -0.4550
Residual 0.0633
Correlation Structure: Continuous AR(1)
Phi 0.37671
Fixed effects
Value SE DF t-value P-Value
Intercept 2.8170 0.0050 3641 564.28 <0.001
Age 0.0343 0.0007 3641 52.28 <0.001
Age2 0.0030 0.0001 3641 47.04 <0.001
Age3 -0.0003 0.00001 3641 -33.71 <0.001
Correlation
Intercept Age Age2
Age 0.45
Age2 -0.61 0.06
Age3 0.08 -0.70 -0.41
Standardized Within-Group Residuals
Min 1st quartile Median 3rd quartile Max
-4.7245 -0.4761 -0.0304 0.4404 5.8136
59 Chapter 2: BMI Growth Trajectories
Figure 2.4: Model diagnostic plots from LMM model fit to the data from females in the
Raine Study
60 Chapter 2: BMI Growth Trajectories
Table 2.6: Details of LMM model in males (n=773, N=4609)
Goodness of fit criteria
AIC -10119.56
BIC -10042.34
Log Likelihood 5071.78
Random Effects
SD Correlation
Intercept Age
Intercept 0.1194
Age 0.0113 0.8120
Age2 0.0010 -0.7090 -0.3640
Residual 0.0637
Correlation Structure: Continuous AR(1)
Phi 0.3763
Fixed effects
Value SE DF t-value P-Value
Intercept 2.8087 0.0047 3833 597.65 <0.001
Age 0.0288 0.0006 3833 44.51 <0.001
Age2 0.0032 0.0001 3833 52.47 <0.001
Age3 -0.0003 0.00001 3833 -31.16 <0.001
Correlation
Intercept Age Age2
Age 0.44
Age2 -0.61 0.09
Age3 0.08 -0.69 -0.40
Standardized Within-Group Residuals
Min 1st quartile Median 3rd quartile Max
-4.0948 -0.4423 -0.0223 0.4149 3.5616
61 Chapter 2: BMI Growth Trajectories
Figure 2.5: Model diagnostic plots from LMM model fit to the data from males in the Raine
Study.
62 Chapter 2: BMI Growth Trajectories
2.4.2 Skew-t Model Linear Mixed Effects Model (STLMM)
The assumption of multivariate normal random effects and within-subject errors is often
violated, particular when modelling childhood growth. This is the case in the Raine Study BMI
data, particularly in the females – as seen in the random effects and residual errors plots from
the LMM models (Plots E and F of Figure 2.4 and Figure 2.5). This assumption makes the model
easy to apply in widely used software; however its accuracy is difficult to check and the routine
use of normality has been questioned as it often lacks robustness against departures from the
normal. This misspecification of the distribution may lead to biased estimation of fixed effects
and their standard errors, and thus incorrect statistical inference, in particular with the genetic
related parameters. A common approach to achieve normality is to transform the response
variable; however, there is not a unique transformation that could be used in every scenario
and the results of the analyses might depend on the transformation used. To avoid
transforming the response and to maintain valid inference under a non-normal distribution for
the response, an extension to the LMM model was utilised assuming a multivariate t-
distribution for the error terms, ε jt‘s, and a multivariate skew-normal distribution for the
random effects. The resulting model for the response over the t time points is multivariate
skew-t with specific parameters that account for the asymmetry (skewness parameters) and
the long-tail (degree of freedom of the t-distribution) of the response distribution [218].
2.4.2.1 Method Description
There has been considerable work undertaken in the area of extending the LMM for situations
in which the residuals do not follow a normal distribution. Pinheiro, Liu and Wu [226] proposed
a multivariate t linear mixed model and later several articles were published on a skew-normal
linear mixed model [227,228,229]. This mixed model was based on the skew-normal
distribution introduced by Azzalla [230], who also developed an expectation-maximization
(EM) type algorithm for maximum likelihood estimation. Recently, Lachos et al [217] extended
these methods to allow for skew-normal/independent (SNI) distributions including skew-
normal, skew-t, skew-slash and the skew-contaminated distributions.
63 Chapter 2: BMI Growth Trajectories
A p x 1 random vector Y follows a skew-normal distribution (SN) with location vector µ,
dispersion matrix Σ (a p x p positive definite matrix) and p x 1 skewness vector λ, if is
probability density function (pdf) is given by: 1 2
1( ) 2 ( | ) ( ( )), RT nY pf φ −= ,Σ Φ − ∈y y μ λ Σ y μ y
Where Φp(.|μ,Σ) stands for the pdf of the p-variate normal distribution with mean vector µ
and covariance matrix Σ and Φ1(.) is the cumulative distribution function (cdf) of the standard
univariate normal distribution. Note for λ=0 that 2.2 reduces to the symmetric Np(µ, Σ).
If aZ ~ SNp(0, a2Σ, λ) for all a > 0 and Z = Y - µ, the SNI distribution can be defined as follows:
Y = μ + U-1/2Z
Where U is a positive random variable with the cdf H(u; v) and pdf h(u; v) and independent of
the SNp(0,Σ, λ)-random vector Z. The skew-t distribution is a special case of the SNI distribution
with v degrees of freedom, STp(µ, Σ, λ, v), and U ~ Gamma(v/2, v/2), v > 0. The pdf of Y is:
( ) 2 ( ; ) ; , R pp
v pf t v T A v pv d
+= , , + ∈ +
y y μ Σ y
Where tp(.; μ, Σ, v) is the pdf of the p-variate Student-t distribution and T(., v) is the cdf of the
standard univariate t-distribution, d = (y - µ)TΣ-1(y - µ) is the Mahalanobis distance and A = λTy0.
The skew-normal distribution is the limiting case (as v ↑ ∞).
Therefore, the SNI LMM is defined by the following:
SNI , , , , 1,...,i
i i i i i
in q
i iH i n+
= + +
=
y Xβ Ζb ε
b 0 D0 λε 0 0 Σ 0
The matrices D = D(α) and Σ i = Σ i(γ), i = 1, …, n, are dispersion matrices corresponding to the
between and within subject variability, and depend on unknown and reduced parameters α
and γ, respectively. Finally, H = H(.;v) is the cdf-generator that determines the specific SNI
model assumed.
The model has a closed form for the marginal distribution function, which facilitates the use of
straightforward implementation of inferences with standard optimization routines. There is no
explicit solution for maximizing the likelihood function, so it has to be maximized numerically
using an expectation maximization (EM) algorithm (Lachos et al [217]). The important step is in
64 Chapter 2: BMI Growth Trajectories
choosing the starting values, which are often chosen to be the corresponding estimates under
a normal assumption and the starting values for the asymmetric parameters are set to be 0.
2.4.2.2 Model Fit
The specification in terms of the fixed and random effects was identical to the LMM. No
transformations were applied to either BMI or age as the skewness in the data was accounted
for by the model structure. However, the model would not converge with both linear and
quadratic age components in the random effects so this was reduced to only linear age. Hence,
the final model for both females and males was
BMIjt = β0 + β1(Agejt - 8) + β2(Agejt - 8)2 + β3(Agejt - 8)3 + u0 + u1(Agejt - 8) + ε jt
The model output and diagnostics for females are presented in Table 2.7 and Figure 2.6.
Table 2.7: Details of STLMM model in females (N=733, n=4377)
Goodness of fit criteria
Log Likelihood -7952.42
Kurtosis 4.15
Random Effects
SD Correlation
Intercept
Intercept 5.4443
Age 0.0997 0.6814
Residual 0.7235
Fixed effects
Value SE t-value P-Value
Intercept 14.6367 0.0994 147.32 <0.001
Age 0.2891 0.0123 23.52 <0.001
Age2 0.0557 0.0007 79.10 <0.001
Age3 -0.0044 0.0001 -31.03 <0.001
Skewness
Intercept 4.5791
Age 2.2336
65 Chapter 2: BMI Growth Trajectories
Figure 2.6: Model diagnostic plots from STLMM model fit to the data from females in the
Raine Study
The plots in Figure 2.6 display the diagnostics for the STLMM method in females, similar to
those plotted for the of the LMM model. The residuals versus age in Plot A show that the
residuals are relatively constant over time, however, there are a few large outliers at the final
follow-up time. Plot B shows that the predicted values from the model fit fairly well to the
observed BMI values; the under estimation of high BMI values that was seen with the LMM
method is not observed here. Once again, there is reasonably constant variability in the
standardised residuals across the range of predicted values (Plot C), however, this method
provides a better fit than the LMM for the residuals using the t-distributional assumption (Plot
E) and the random intercept under the skew-normal distribution (Plot D).
66 Chapter 2: BMI Growth Trajectories
Similar output is presented in Table 2.8 and Figure 2.7 for the males.
Table 2.8: Details of STLMM model in males (N=773, n=4609)
Goodness of fit criteria
Log Likelihood -8264.53
Kurtosis 3.05
Random Effects
SD Correlation
Intercept
Intercept 3.3500
Age 0.0673 0.4127
Residual 0.5918
Fixed effects
Value SE t-value P-Value
Intercept 14.8247 0.1002 147.89 <0.001
Age 0.19703 0.0128 15.40 <0.001
Age2 0.05738 0.0006 93.10 <0.001
Age3 -0.00357 0.0001 -29.03 <0.001
Skewness
Intercept 2.8590
Age 1.6628
The plots in Figure 2.7 show a similar pattern to what was discussed for the STLMM method in
females. However, the residuals show it is not such a good fit for the males (Plot E), potentially
because there is less variability in the male BMI measures to account for, and hence this
method may be over fitting the data.
67 Chapter 2: BMI Growth Trajectories
Figure 2.7: Model diagnostic plots from STLMM model fit to the data from males in the
Raine Study.
2.4.2.3 Computational Time
This was the most computationally intensive method to fit as it uses the EM algorithm, and
therefore took the longest time to converge. To run 100 basic models as described above, it
takes on average 67.02 minutes (IQR: 66.80-69.58 minutes) per male skew-t model in R-64-bit
version 2.12.1 on a 64-bit operating system with an Intel Core i7 CPU Processor (L 640 @
2.13GHz), and 76.67 minutes (IQR: 76.55-78.48 minutes) for the female skew-t models.
68 Chapter 2: BMI Growth Trajectories
2.4.3 Semi-Parametric Mixed Model (SPLMM) using Smoothing Splines
SPLMMs make use of smoothing splines, which yield a smoother growth curve estimate than
the polynomial function in the LMM when fitting non-linear relationships. By fitting smoothing
splines in a mixed model framework, inference can easily be performed with established ML
and best prediction methods. An overarching benefit of these models is their ability to be
easily extended; they allow complexities in the data to be incorporated in a straightforward
manner.
2.4.3.1 Method Description
Splines are semi-parametric techniques for fitting smooth curves to data, made up of a series
of piecewise polynomials with smooth joints. Spline interpolation uses low-degree polynomials
in each of the intervals and chooses the polynomial pieces, such that they fit smoothly
together. Mathematically, the shape of the curve fixed by n+1 predefined points (“knots”; (xi,
yi), i = 0, 1, … n) and is fit by interpolating between all the pairs of knots, (xi-1, yi-1) and (xi, yi),
with polynomials y = qi(x), i = 1, 2, … n. The curve will take a shape that minimizes the amount
of curvature under the constraint of passing through all knots and first and second derivatives
of y will be continuous everywhere, including at the knots.
A spline function is defined by generating a matrix of regressors for a spline including the knots
points, degree of polynomials in each interval and the degree of smoothness at each knot
(continuous first derivative in this case). The spline function is then used in a model equation.
In this case, the function is used in an LMM framework where the fixed and random effects
time components (age over childhood) are the spline function.
2.4.3.2 Model Fit
The basic model for the jth individual is as follows:
BMIjt = β0 +Σ i β i (Agejt – Age )i + Σk γk ((Agejt-Age ) - κk)i+ + u0j + Σ i bij (Agejt – Age )i + Σk
ηkj ((Agejt-Age ) - κk)i+ + ε jt
Where κk is the kth knot and (t - κk)+ = 0 if t ≤ κk and (t - κk) if t > κk, which is known as the
truncated power basis that ensures smooth continuity between the time windows. Various
numbers and positions of knots and the degree of polynomial between knots were tested to
find the best fit to the data. Knot points were initially estimated visually from both individual
profiles and the population average curve in males and females. To optimise the number and
69 Chapter 2: BMI Growth Trajectories
placement of the knot points, a series of models were fit with the knot points placed at 6-
month intervals around the estimated placement by visualization and additional knot points
were incorporated to see if they added to the model fit. The model with the lowest Akaike
Information Criterion (AIC) was selected as the final model. Finally, the degree of polynomial
was investigated, up to the third degree, required for each spline, once again selecting the
model with the lowest AIC.
For females and males, the optimal model was with three knot points placed at two, eight and
12 years with a cubic slope for each spline. The full spline function was used in both females
and males for the fixed effects, but only the first two parameters corresponding to the
intercept and linear time over the whole period were used for the random effects. The model
output and diagnostics for females and males are presented in Table 2.9 and Table 2.10
respectively.
Similar to the LMM plots, Figure 2.8 shows the residuals for females; the residuals versus age
in Plot A show that the residuals are relatively constant over time, with a few outliers. The
predicted values from the model fit fairly well to the observed BMI values (Plot B), with a
tighter fit to the x=y line than the LMM method and not as much underestimation of the large
BMI values. The variability in the standardised residuals is more constant across the range of
predicted values than in the LMM method (Plot C). Plots D to F are similar to the LMM method,
with not much improvement by adding the smoothing splines to the fixed and random effects
rather than the polynomial function.
70 Chapter 2: BMI Growth Trajectories
Table 2.9: Details of SPLMM model in females (N=733, n=4377). Spline 1 is the change in
slope between two and eight years, Spline 2 is the change in slope after 12 years and Spline 3
is the change in slope before two years.
Goodness of fit criteria
AIC -9515.20 BIC -9425.80 Log Likelihood 4771.60
Random Effects
SD Correlation
Intercept Age
Intercept 0.1297
Age 0.0121 0.7710
Age2 0.0024 -0.6900 -0.4380
Residual 0.0503
Fixed effects
Value SE DF t-value P-Value
Intercept 2.8125 0.0051 3638 546.32 <0.001
Age 0.0339 0.0008 3638 40.66 <0.001
Age2 0.0096 0.0011 3638 8.71 <0.001
Age3 0.00003 0.0006 3638 0.05 0.96
Spline 1 -0.0036 0.0011 3638 -3.22 0.001
Spline 2 0.0029 0.0014 3638 2.14 0.03
Spline 3 0.1437 0.0643 3638 2.23 0.03
Correlation
Intercept Age Age2 Age3 Spline 1 Spline 2
Age 0.38
Age2 -0.31 0.06
Age3 -0.23 -0.13 0.95
Spline 1 0.24 -0.08 -0.99 -0.96
Spline 2 -0.19 0.37 0.86 0.75 -0.89
Spline 3 -0.13 -0.26 0.65 0.81 -0.70 0.45
Standardized Within-Group Residuals
Min 1st quartile Median 3rd quartile Max
-5.5373 -0.4908 0.0039 0.4499 6.8332
71 Chapter 2: BMI Growth Trajectories
Figure 2.8: Model diagnostic plots from SPLMM model fit to the data from females in the
Raine Study.
72 Chapter 2: BMI Growth Trajectories
Table 2.10: Details of SPLMM model in males (N=773, n=4,609). Spline 1 is the change in
slope between two and eight years, Spline 2 is the change in slope after 12 years and Spline 3
is the change in slope before two years.
Goodness of fit criteria
AIC -10036.18 BIC -9946.10 Log Likelihood 5032.09
Random Effects
SD Correlation
Intercept Age
Intercept 0.1246
Age 0.0125 0.7560
Age2 0.0024 -0.6820 -0.3640
Residual 0.0505
Fixed effects
Value SE DF t-value P-Value
Intercept 2.8036 0.0049 3830 576.85 <0.001
Age 0.0308 0.0008 3830 37.79 <0.001
Age2 0.0119 0.0011 3830 10.96 <0.001
Age3 0.0008 0.0005 3830 1.42 0.16
Spline 1 -0.0059 0.0011 3830 -5.35 <0.001
Spline 2 0.0082 0.0014 3830 6.06 <0.001
Spline 3 0.2095 0.0629 3830 3.33 0.001
Correlation
Intercept Age Age2 Age3 Spline 1 Spline 2
Age 0.38
Age2 -0.33 0.06
Age3 -0.25 -0.13 0.95
Spline 1 0.25 -0.07 -0.99 -0.97
Spline 2 -0.20 0.36 0.86 0.75 -0.89
Spline 3 -0.14 -0.26 0.65 0.81 -0.70 0.46
Standardized Within-Group Residuals
Min 1st quartile Median 3rd quartile Max
-4.4810 -0.4565 -0.0076 0.4522 4.2693
73 Chapter 2: BMI Growth Trajectories
Figure 2.9: Model diagnostic plots from SPLMM model fit to the data from males in the
Raine Study
The plots in Figure 2.9 for males are similar to those from the LMM method. It appears that
the smoothing splines allow for better prediction of BMI values than the LMM, as shown in
Plots B and C.
2.4.3.3 Computational Time
On average, each basic model, as described above, takes 17.31 seconds (17.27-17.34 seconds)
for the males in R-64-bit version 2.12.1 on a 64-bit operating system with an Intel Core i7 CPU
Processor (L 640 @ 2.13GHz) and 17.97 seconds (17.94-18.00 seconds) for the females.
74 Chapter 2: BMI Growth Trajectories
2.4.4 Non-Linear Mixed Effects Model (NLMM); also known as the SuperImposition
by Translation And Rotation (SITAR) Model
The minimum point in the BMI trajectory that occurs about 5-7 years of age for most
individuals is commonly known as the adiposity rebound. This rebound can differ between
individuals in three distinct ways; 1) size, where some individual’s minimum BMI is low while
others is higher, 2) timing, where some individuals reach their minimum BMI earlier than
others 3) duration, where the length of time an individual stays at their minimum can be
longer than others. The adiposity rebound is an important marker for later adult disease, as
outlined in Chapter 1, Section 1.5.1.2. All three aspects of the adiposity rebound are targeted
by the SITAR model, which is another extension to the LMM whereby the response is not
assumed to be Gaussian but may come from some other exponential distribution.
2.4.4.1 Method Description
The SITAR model [220] was recently defined to summarize height growth in puberty (in
particular peak height velocity) and estimate subject-specific parameters that can be used to
investigate relationships with earlier exposures and later outcomes. The SITAR method
(referred to here as NLMM) has a single fitted curve at the population level and individual level
estimates of mean differences in size (shifting up or down of the BMI curve), growth tempo
(left-right shift of the curve on the age scale) and velocity (shrinking or stretching of the age
scale).
The basic model for the growth curve is:
i
iit i i
ty heγβα ε− = + +
Where:
yit = growth of subject i at age t
h(t) = natural cubic spline curve of growth versus age
α i = random growth intercept that adjusts for differences in mean BMI (size)
β i = random growth intercept to adjust for differences in timing (tempo)
γ i = random age scaling adjusting for the duration of the growth spurt (velocity)
ε i = within subject residual error
This model was fitted using the nlme function in R with the h(t) function estimated as a fixed-
effect natural cubic spline.
75 Chapter 2: BMI Growth Trajectories
2.4.4.2 Model Fit
This model was fit with the three parameters (size, tempo and velocity) as random effects, size
and velocity as fixed effects, and h(t) a natural cubic spline curve with 3 to 8 degrees of
freedom (df) fitted as fixed effects. BMI and age were fit both untransformed and natural log
transformed to identify the best fit to the data. Model fit to the data were compared using AIC,
deviance and residual standard deviation.
The optimal model for females had a natural cubic spline curve with three df and both BMI and
age on the natural log transformed scale. Similarly, the optimal model for males was with BMI
and age on the natural log transformed scale but with four df for the natural cubic spline
curve. The model output from each of the models is presented in Table 2.11 and Figure 2.10
for females and Table 2.12 and Figure 2.11 for males.
76 Chapter 2: BMI Growth Trajectories
Table 2.11: Details of NLMM model in females (N=733, n=4,377)
Goodness of fit criteria
AIC -9575.00
BIC -9498.39
Log Likelihood 4799.50
Random Effects
SD Correlation
Size Tempo
Size 0.0739
Tempo 0.2774 -0.4960
Velocity 0.2294 -0.3690 0.4940
Residual 0.0513
Fixed effects
Value SE DF t-value P-Value
Knot 1 0.0220 0.0053 3640 4.17 <0.001
Knot 2 0.0753 0.0058 3640 13.05 <0.001
Knot 3 0.2408 0.0128 3640 18.78 <0.001
Size 2.7790 0.0037 3640 742.08 <0.001
Velocity 0.5571 0.0392 3640 14.20 <0.001
Correlation
Knot 1 Knot 2 Knot 3 Size
Knot 2 0.80
Knot 3 0.48 0.69
Size -0.02 0.21 0.58
Velocity -0.14 -0.43 -0.91 -0.68
Standardized Within-Group Residuals
Min 1st quartile Median 3rd quartile Max
-5.6123 -0.4751 -0.0144 0.4474 6.3412
77 Chapter 2: BMI Growth Trajectories
Figure 2.10: Model diagnostic plots from NLMM model fit to the data from females in the
Raine Study.
78 Chapter 2: BMI Growth Trajectories
Table 2.12: Details of NLMM model in males (N=773, n=4,609)
Goodness of fit criteria
AIC -10099.41
BIC -10015.75
Log Likelihood 5062.71
Random Effects
SD Correlation
Size Tempo
Size 0.0745
Tempo 0.2819 -0.5540
Velocity 0.2001 -0.5340 0.5630
Residual 0.0521
Fixed effects
Value SE DF t-value P-Value
Knot 1 -0.1792 0.0242 3831 -7.42 <0.001
Knot 2 0.1314 0.0256 3831 5.14 <0.001
Knot 3 0.2390 0.0453 3831 5.28 <0.001
Knot 4 0.4917 0.0658 3831 7.47 <0.001
Size 2.8841 0.0150 3831 192.88 <0.001
Velocity -0.0372 0.0838 3831 -0.44 0.6576
Correlation
Knot 1 Knot 2 Knot 3 Knot 4 Size
Knot 2 -0.92
Knot 3 -0.94 0.98
Knot 4 -0.96 0.97 0.99
Size -0.97 0.92 0.93 0.96
Velocity 0.98 -0.96 -0.98 -0.99 -0.97
Standardized Within-Group Residuals
Min 1st quartile Median 3rd quartile Max
-5.9288 -0.4479 -0.0150 0.4276 4.7460
79 Chapter 2: BMI Growth Trajectories
Figure 2.11: Model diagnostic plots from NLMM model fit to the data from males in the
Raine Study.
The estimates for the three parameters (size, tempo and velocity) were extracted for each
individual from the best fitting NLMM model and used for genetic analyses. Firstly, these
parameters were investigated against later outcome, including BMI, waist circumference,
waist-hip ratio and superilliac skinfold at 17 years of age (Table 2.13). It was evident that all
three parameters in females were highly predictive of their end point, whereas only size and
tempo in males were predictive of outcome in males. A later age of adiposity rebound is
associated with decreased markers of obesity in both males and females. Additionally, a
shorter rebound period is associated with increased markers of obesity, in females only. Males
with a longer rebound period had lower superilliac skinfolds than males with short rebound
periods.
80 Chapter 2: BMI Growth Trajectories
Table 2.13: Results from association analysis between the estimates of the three parameters
from the NLMM model and markers of obesity at age 17 years.
Log(BMI) Log(Waist
Circumference)
Waist-hip ratio Superilliac skinfold
(log for males)
Effect (SE) P Effect (SE) P Effect (SE) P Effect (SE) P
Mal
e
Size 1.29 (0.09) <0.01 0.80 (0.07) <0.01 0.17 (0.04) <0.01 3.40 (0.35) <0.01
Tempo -0.48 (0.02) <0.01 -0.31 (0.02) <0.01 -0.08 (0.01) <0.01 -1.41 (0.08) <0.01
Velocity -0.04 (0.05) 0.36 -0.04 (0.04) 0.34 0.02 (0.02) 0.15 -0.62 (0.17) <0.01
Fem
ale
Size 1.41 (0.10) <0.01 0.75 (0.09) <0.01 0.14 (0.04) <0.01 60.54 (6.95) <0.01
Tempo -0.46 (0.02) <0.01 -0.28 (0.02) <0.01 -0.07 (0.01) <0.01 -22.98 (1.76) <0.01
Velocity 0.26 (0.04) <0.01 0.18 (0.03) <0.01 0.07 (0.02) <0.01 15.26 (2.72) <0.01
2.4.4.3 Computational Time
There are two different ways genetic variables could be added into these models. Firstly, into
the actual model where running 100 basic models as described above on average it takes
36.68 seconds (36.62-36.74 seconds) per male model in R-64-bit version 2.12.1 on a 64-bit
operating system with an Intel Core i7 CPU Processor (L 640 @ 2.13GHz) and 27.00 seconds
(26.85-27.09 seconds) for the female models. Alternatively, the predictions (estimates of size,
tempo and velocity for each individual) could be extracted and treated as the dependent
variables in a linear regression model, which would dramatically reduce the time for each
genetic locus (time not shown).
2.5 Genetic Associations To investigate whether these chosen methods are appropriate for detecting genetic markers
that have an effect on childhood BMI and the change in BMI over childhood, 17 genetic
variants were selected to include in the best model from each method.
2.5.1 SNP Selection
The 17 genetic variants published in den Hoed et al [204] were selected to investigate the
association with childhood BMI, and more importantly the change in BMI over childhood.
These SNPs were initially discovered to be associated with adult BMI and were subsequently
replicated in at least one study of BMI and change in BMI over childhood. At the time of
selecting SNPs for this study, they were the largest set of SNPs shown to be associated with
BMI over childhood and adolescence. Loci illustrated to be associated with only obesity risk
and not BMI were excluded. Subsets of these 17 SNPs (either the same SNPs or a SNP in high
LD [r2>0.8]) were also presented by Elks et al [206] and Hardy et al [205], who showed
81 Chapter 2: BMI Growth Trajectories
associations with changes in growth over childhood. Genotype information on these 17
published genetic variants was available for individuals in the Raine Study sample (genotype
data was described in Section 1.6.1.3.), either directly genotyped SNPs (rs925946 (BDNF),
rs10913469 (SEC16B), rs2605100 (LYPLAL1), rs987237 (TFAP2B), rs10838738 (MTCH2),
rs7138803 (BCDIN3D) and rs10146997 (NRXN3)) or from the best guess genotype data
imputed against HapMap release 22 (rs2815752 (NEGR1), rs6548238 (TMEM18), rs7647305
(ETV5), rs10938397 (GNPDA2), rs613080 (MRSA), rs1488830 (BDNF), rs8055138 (SH2B1),
rs1121980 (FTO), rs17782313 (MC4R) and rs11084753 (KCTD15)). Two variants in the BDNF
gene, previously been shown to be independently associated with obesity [179], were
investigated (r2 =0.11). The 17 SNPs are described in Table 2.14, including the available sample
size with complete data for each SNP. These 17 SNPs were used to investigate the sensitivity of
each method to detect genetic effects in terms of point estimates and standard errors across
various time points. Each SNP was incorporated into the model independently assuming an
additive genetic effect for the BMI increasing allele. In addition, an ‘obesity-risk allele score’
was created on the subset of individuals with complete genetic data by summing the number
of risk alleles an individual had (n=1,219) [231]. The alleles were not weighted by their effect
size, as this has previously been shown to only have limited benefit [232].
Due to the population stratification detected in the Raine Study (see Chapter 1, Section
1.6.1.3), analysis was conducted adjusting for the first five principal components generated in
the EIGENSTRAT software [39]. No adjustment for multiple testing have been made as the aim
was to estimate a combined effect of SNPs that have already been validated in previous
studies and shown to be significantly associated with childhood BMI and growth. Therefore,
genetic loci were considered associated with BMI if the global LRT was significant at an α<0.05
level.
82 Chapter 2: BMI Growth Trajectories
Table 2.14: Characteristics from the Raine Study sample of the 17 SNPs investigated in each of the statistical methods
SNP Gene Chr. Obesity
Risk
Allele
MAF HWE-P Major
Homozygote
Heterozygote Minor
Homozygote
Sample Size
rs2815752 NEGR1 1p31 A 0.38 0.47 583 687 219 N female=724, N male=765
rs10913469 SEC16B 1q25 G 0.20 0.57 953 471 64 N female=723, N male=765
rs2605100 LYPLAL1 1q41 G 0.30 0.71 726 623 140 N female=724, N male=765
rs6548238 TMEM18 2p25 C 0.15 0.48 1029 370 38 N female=698, N male=739
rs7647305 ETV5 3q28 C 0.21 0.11 928 508 53 N female=724, N male=765
rs10938397 GNPDA2 4p12 G 0.44 0.02 492 678 301 N female=716, N male=755
rs987237 TFAP2B 6p12 G 0.19 0.24 970 473 46 N female=724, N male=765
rs613080 MRSA 8 G 0.14 0.19 1072 368 41 N female=720, N male=761
rs10838738 MTCH2 11p11 G 0.36 0.96 604 690 195 N female=724, N male=765
rs1488830 BDNF 11p13 T 0.21 0.38 938 472 68 N female=717, N male=761
rs925946 BDNF 11p13 T 0.30 0.11 745 597 146 N female=723, N male=765
rs7138803 BCDIN3D 12q13 A 0.37 0.11 612 662 215 N female=724, N male=765
rs10146997 NRXN3 14q31 G 0.22 0.15 919 487 80 N female=722, N male=764
rs8055138 SH2B1 16p11 T 0.38 0.87 573 691 213 N female=718, N male=759
Table 2.14 continued
SNP Gene Chr Obesity
risk
allele
MAF HWE-P Major
Homozygote
Heterozygote Minor
Homozygote
Sample Size
rs1121980 FTO 16q12 T 0.41 0.83 509 713 243 N female=712, N male=753
rs17782313 MC4R 18q22 C 0.23 0.83 881 531 77 N female=724, N male=765
rs11084753 KCTD15 19q13 G 0.35 0.43 574 591 168 N female=645, N male=688
2.5.2 Cross-Sectional Analyses
Cross-sectional analyses at each scheduled follow-up time were conducted to characterize the
genetic associations of each locus. Sex stratified linear models were used, adjusting for mean
centred age (Age ) and the SNP under an additive genetic model. In both males and females,
BMI (and the residuals) were approximately normally distributed from years one to three,
however the distribution became increasingly skewed from year six, and hence the natural
logarithm of BMI was used. The results are presented in Table 2.15.
In females, a consistently significant association between BMI and the BMI increasing allele of
the SEC16B rs10913469 SNP was detected from age 6 to 17, with an increasing effect size over
time (β=0.0171 at age 6 and β=0.0372 by age 17). The BMI increasing allele of the BCDIN3D
rs7138803 SNP was associated with BMI at age 13, which may indicate this SNP has an effect
on BMI from post puberty. Finally, there was a significant association between BMI and the
BMI increasing allele of the KCTD15 rs11084753 SNP at ages 8 and 13.
The males are more complex than the females, with associations between a number of SNPs
and BMI at multiple ages. The BMI increasing alleles of the loci in SEC16B, GNPDA2, MRSA,
MTCH2, BDNF (rs1488830), BCDIN3D, NRXN3 and SH2B1 all had significant associations at a
single time point. The BMI increasing allele of the TMEM18 rs4854344 SNP was associated
with BMI at ages 10 and 13, with a slight increase in effect size over time (β=0.0229 at age 10
and β=0.0249 by age 13). Similarly, there was an association between MC4R and BMI at ages
13 and 16 years and between the BMI increasing allele of TFAP2B rs987237 and BMI at ages 3
to 10. Finally, the BMI increasing allele of FTO rs3751812 showed strong associations with BMI
from age 6 to 16.
Each of these associations are based on a significance level of α=0.05, which is not an
appropriate threshold given the number of tests conducted. Using a Bonferroni adjustment for
the number of SNPs and number of time points analysed in each gender, α=0.00037, only the
FTO association from age eight in males would remain significantly associated with BMI.
85 Chapter 2: BMI Growth Trajectories
Table 2.15: Summary of cross-sectional results for the 17 SNPs. Significant P-Values are in bold.
SNP Females Males Beta (SE) P Beta (SE) P
rs2815752 Year 1 0.0504 (0.0784) 0.52 0.0163 (0.0749) 0.83 (NEGR1) Year 2 -0.1269 (0.1306) 0.33 0.2049 (0.1282) 0.11
Year 3 0.0685 (0.0884) 0.44 0.1018 (0.0777) 0.19 Year 6 0.0094 (0.0062) 0.13 0.0069 (0.0055) 0.21 Year 8 0.0106 (0.0082) 0.20 0.0092 (0.0076) 0.23 Year 10 0.0135 (0.0101) 0.18 0.0102 (0.0095) 0.29 Year 14 0.0074 (0.0108) 0.49 0.0151 (0.0104) 0.15 Year 17 -0.0040 (0.0116) 0.73 0.0110 (0.0113) 0.33
rs10913469 Year 1 -0.0138 (0.0974) 0.89 -0.0497 (0.0872) 0.57 (SEC16B) Year 2 0.0974 (0.1714) 0.57 0.3207 (0.1573) 0.04
Year 3 0.1972 (0.1098) 0.07 -0.0311 (0.0910) 0.73 Year 6 0.0171 (0.0077) 0.03 0.0022 (0.0064) 0.73 Year 8 0.0306 (0.0102) 0.003 -0.0028 (0.0088) 0.75 Year 10 0.0296 (0.0125) 0.02 -0.0007 (0.0110) 0.95 Year 14 0.0424 (0.0132) 0.001 0.0134 (0.0122) 0.27 Year 17 0.0372 (0.0144) 0.01 -0.0175 (0.0132) 0.19
rs2605100 Year 1 -0.1300 (0.0810) 0.11 -0.1091 (0.0774) 0.16 (LYPLAL1) Year 2 0.0080 (0.1390) 0.95 -0.0257 (0.1346) 0.85
Year 3 -0.0910 (0.0928) 0.33 -0.0846 (0.0806) 0.29 Year 6 -0.0010 (0.0066) 0.88 -0.0073 (0.0058) 0.21 Year 8 0.0011 (0.0087) 0.90 -0.0031 (0.0079) 0.69 Year 10 -0.0027 (0.0108) 0.80 -0.0016 (0.0099) 0.87 Year 14 -0.0033 (0.0111) 0.77 -0.0085 (0.0109) 0.44 Year 17 -0.0118 (0.0122) 0.33 -0.0052 (0.0117) 0.66
rs6548238 Year 1 0.0138 (0.0874) 0.87 0.0842 (0.0892) 0.35 (TMEM18) Year 2 -0.0778 (0.1479) 0.60 0.0463 (0.1591) 0.77
Year 3 0.1032 (0.0970) 0.29 0.0133 (0.0921) 0.89 Year 6 0.0048 (0.0071) 0.50 0.0040 (0.0068) 0.56 Year 8 0.0036 (0.0093) 0.70 0.0138 (0.0093) 0.14 Year 10 0.0029 (0.0115) 0.80 0.0229 (0.0114) 0.04 Year 14 -0.0002 (0.0119) 0.99 0.0249 (0.0126) 0.05 Year 17 0.0104 (0.0132) 0.43 0.0260 (0.0138) 0.06
rs7647305 Year 1 0.0919 (0.0966) 0.34 0.0973 (0.0901) 0.28 (ETV5) Year 2 0.0972 (0.1576) 0.54 0.1157 (0.1473) 0.43
Year 3 0.0453 (0.1065) 0.67 -0.0002 (0.0961) 1.00 Year 6 -0.0042 (0.0078) 0.59 0.0035 (0.0067) 0.60 Year 8 -0.0021 (0.0101) 0.84 0.0054 (0.0093) 0.56 Year 10 -0.0006 (0.0126) 0.96 0.0112 (0.0117) 0.34 Year 14 0.0021 (0.0131) 0.87 0.0089 (0.0126) 0.48 Year 17 -0.0055 (0.0141) 0.70 0.0207 (0.0132) 0.12
86 Chapter 2: BMI Growth Trajectories
Table 2.15 continued SNP Females Males rs10938397 Year 1 -0.0453 (0.0714) 0.53 0.0539 (0.0724) 0.46 (GNPDA2) Year 2 0.0224 (0.1196) 0.85 0.1623 (0.1380) 0.24
Year 3 0.0500 (0.0800) 0.53 0.0863 (0.0755) 0.25 Year 6 0.0014 (0.0058) 0.81 0.0067 (0.0054) 0.22 Year 8 0.0023 (0.0077) 0.77 0.0054 (0.0074) 0.47 Year 10 0.0088 (0.0094) 0.35 0.0214 (0.0093) 0.02 Year 14 0.0108 (0.0097) 0.27 0.0102 (0.0102) 0.32 Year 17 0.0191 (0.0106) 0.07 0.0127 (0.0110) 0.25
rs987237 Year 1 0.1815 (0.0991) 0.07 0.0098 (0.0929) 0.92 (TFAP2B) Year 2 0.0395 (0.1742) 0.82 0.0055 (0.1789) 0.98
Year 3 0.0501 (0.1109) 0.65 0.2436 (0.0937) 0.01 Year 6 0.0086 (0.0079) 0.28 0.0188 (0.0068) 0.01 Year 8 0.0067 (0.0106) 0.53 0.0219 (0.0093) 0.02 Year 10 0.0056 (0.0129) 0.67 0.0227 (0.0118) 0.05 Year 14 0.0150 (0.0136) 0.27 0.0181 (0.0127) 0.16 Year 17 -0.0080 (0.0148) 0.59 0.0232 (0.0137) 0.10
rs613080 Year 1 -0.0486 (0.1028) 0.64 0.0368 (0.1002) 0.71 (MRSA) Year 2 -0.1139 (0.1927) 0.56 -0.4622 (0.1985) 0.02
Year 3 0.1386 (0.1114) 0.21 0.0190 (0.1017) 0.85 Year 6 -0.0015 (0.0084) 0.86 0.0069 (0.0074) 0.35 Year 8 -0.0108 (0.0109) 0.32 0.0103 (0.0101) 0.31 Year 10 -0.0198 (0.0134) 0.14 -0.0041 (0.0128) 0.75 Year 14 0.0077 (0.0138) 0.58 0.0102 (0.0139) 0.46 Year 17 0.0181 (0.0150) 0.23 0.0084 (0.0147) 0.57
rs10838738 Year 1 -0.0561 (0.0763) 0.46 -0.0248 (0.0768) 0.75 (MTCH2) Year 2 0.0275 (0.1251) 0.83 -0.0678 (0.1393) 0.63
Year 3 -0.0326 (0.0840) 0.70 0.1353 (0.0781) 0.08 Year 6 -0.0047 (0.0061) 0.45 0.0115 (0.0057) 0.04 Year 8 -0.0018 (0.0082) 0.83 0.0143 (0.0078) 0.07 Year 10 -0.0026 (0.0101) 0.79 0.0135 (0.0097) 0.17 Year 14 0.0015 (0.0104) 0.89 0.0121 (0.0107) 0.26 Year 17 -0.0063 (0.0114) 0.58 0.0102 (0.0114) 0.37
rs1488830 Year 1 -0.0381 (0.0916) 0.68 0.1481 (0.0868) 0.09 (BDNF) Year 2 0.1963 (0.1577) 0.22 0.2860 (0.1548) 0.07
Year 3 -0.0225 (0.1020) 0.83 0.0411 (0.0941) 0.66 Year 6 0.0041 (0.0074) 0.58 0.0092 (0.0065) 0.16 Year 8 0.0098 (0.0097) 0.31 0.0169 (0.0089) 0.06 Year 10 0.0176 (0.0122) 0.15 0.0137 (0.0110) 0.21 Year 14 0.0010 (0.0124) 0.94 0.0270 (0.0124) 0.03 Year 17 0.0055 (0.0132) 0.68 0.0224 (0.0131) 0.09
87 Chapter 2: BMI Growth Trajectories
Table 2.15 continued
SNP Female Male rs925946 Year 1 -0.0960 (0.0812) 0.40 0.1370 (0.0771) 0.08 (BDNF) Year 2 0.0969 (0.1493) 0.52 0.1488 (0.1377) 0.28
Year 3 -0.1415 (0.0885) 0.11 0.0274 (0.0818) 0.74 Year 6 -0.0066 (0.0065) 0.31 0.0062 (0.0058) 0.29 Year 8 -0.0072 (0.0085) 0.40 0.0047 (0.0079) 0.55 Year 10 -0.0019 (0.0105) 0.86 0.0063 (0.0100) 0.53 Year 14 -0.0039 (0.0111) 0.72 0.0100 (0.0108) 0.36 Year 17 -0.0001 (0.0119) 0.99 0.0087 (0.0114) 0.45
rs7138803 Year 1 -0.0185 (0.0778) 0.81 -0.0191 (0.0731) 0.79 (BCDIN3D) Year 2 0.0826 (0.1308) 0.53 -0.0309 (0.1277) 0.81
Year 3 0.0974 (0.0847) 0.25 0.0385 (0.0762) 0.61 Year 6 0.0096 (0.0063) 0.13 0.0083 (0.0054) 0.12 Year 8 0.0152 (0.0082) 0.06 0.0140 (0.0073) 0.06 Year 10 0.0121 (0.0104) 0.25 0.0158 (0.0092) 0.09 Year 14 0.0229 (0.0106) 0.03 0.0223 (0.0101) 0.03 Year 17 0.0182 (0.0116) 0.12 0.0197 (0.0111) 0.08
rs10146997 Year 1 0.0554 (0.0890) 0.53 -0.0020 (0.0866) 0.98 (NRXN3) Year 2 0.1023 (0.1382) 0.46 -0.2526 (0.1515) 0.10
Year 3 0.0055 (0.0987) 0.96 0.1056 (0.0868) 0.23 Year 6 0.0024 (0.0072) 0.74 -0.0027 (0.0065) 0.68 Year 8 -0.0005 (0.0095) 0.96 -0.0053 (0.0089) 0.55 Year 10 -0.0074 (0.0117) 0.53 -0.0085 (0.0109) 0.44 Year 14 0.0102 (0.0123) 0.41 -0.0226 (0.0120) 0.06 Year 17 0.0027 (0.0133) 0.84 -0.0409 (0.0130) 0.002
rs8055138 Year 1 -0.0480 (0.0761) 0.53 0.0838 (0.0766) 0.27 (SH2B1) Year 2 0.0145 (0.1315) 0.91 0.0116 (0.1357) 0.92
Year 3 -0.0212 (0.0837) 0.80 0.1382 (0.0782) 0.08 Year 6 0.0010 (0.0061) 0.87 0.0120 (0.0056) 0.03 Year 8 0.0027 (0.0080) 0.73 0.0154 (0.0077) 0.05 Year 10 -0.0006 (0.0101) 0.95 0.01400 (0.0096) 0.14 Year 14 0.0106 (0.0104) 0.31 0.0148 (0.0105) 0.16 Year 17 0.0082 (0.0114) 0.47 0.0073 (0.0113) 0.52
rs1121980 Year 1 0.0281 (0.0774) 0.72 0.0235 (0.0752) 0.75 (FTO) Year 2 0.1126 (0.1337) 0.40 0.0106 (0.1374) 0.94
Year 3 0.0546 (0.0866) 0.53 0.0580 (0.0776) 0.46 Year 6 0.0092 (0.0063) 0.14 0.0159 (0.0056) 0.004 Year 8 0.0144 (0.0082) 0.08 0.0339 (0.0076) 9.01x10-6 Year 10 0.0202 (0.0101) 0.05 0.0336 (0.0095) 4.18x10-4 Year 14 0.0159 (0.0106) 0.14 0.0297 (0.0103) 0.004 Year 17 0.0079 (0.0116) 0.49 0.0326 (0.0112) 0.004
88 Chapter 2: BMI Growth Trajectories
Table 2.15 continued SNP Female Male rs17782313 Year 1 -0.0462 (0.0884) 0.60 -0.0203 (0.0892) 0.82 (MC4R) Year 2 -0.0721 (0.1445) 0.62 0.0823 (0.1571) 0.60
Year 3 0.0344 (0.0989) 0.73 -0.0068 (0.0898) 0.94 Year 6 -0.0042 (0.0071) 0.56 0.0071 (0.0065) 0.28 Year 8 0.0027 (0.0093) 0.77 0.01411 (0.0089) 0.11 Year 10 0.0102 (0.0116) 0.38 0.0124 (0.0112) 0.27 Year 14 -0.0023 (0.0120) 0.85 0.0265 (0.0122) 0.03 Year 17 0.0002 (0.0132) 0.99 0.0276 (0.0131) 0.04
rs11084753 Year 1 -0.0049 (0.0697) 0.94 0.0758 (0.0679) 0.26 (KCTD15) Year 2 -0.0830 (0.1164) 0.48 -0.0448 (0.1175) 0.70
Year 3 0.0357 (0.0774) 0.65 0.0880 (0.0734) 0.23 Year 6 0.0057 (0.0056) 0.30 -0.0022 (0.0051) 0.67 Year 8 0.0177 (0.0073) 0.02 -0.0006 (0.0069) 0.93 Year 10 0.0139 (0.0090) 0.12 -0.0123 (0.0086) 0.15 Year 14 0.0241 (0.0094) 0.01 -0.0012 (0.0094) 0.90 Year 17 0.0081 (0.0101) 0.42 0.0013 (0.0102) 0.90
Allele score Year 1 -0.0147 (0.0203) 0.47 0.0233 (0.0183) 0.20 Year 2 0.0201 (0.0346) 0.56 0.0361 (0.0332) 0.28 Year 3 0.0231 (0.0220) 0.29 0.0520 (0.0187) 0.01 Year 6 0.0027 (0.0016) 0.09 0.0053 (0.0013) 9.27x10-5 Year 8 0.0058 (0.0021) 0.01 0.0086 (0.0018) 3.62x10-6 Year 10 0.0061 (0.0027) 0.02 0.0088 (0.0023) 1.79x10-4 Year 14 0.0094 (0.0028) 0.001 0.0101 (0.0025) 5.87x10-5 Year 17 0.0057 (0.0030) 0.06 0.0089 (0.0028) 0.001
2.5.3 Longitudinal Analyses
Several genetic associations were detected between longitudinal BMI and the previously
reported adult obesity loci, after adjustment for the first five principal components; however
the detected associations differed by the statistical method used. A LRT indicated the LMM
method detected one significant association in the females and three in males at the 5% level
of significance. The STLMM method detected additional genetic associations with three in
females and four in males. The SPLMM was the most efficient in detecting significant
associations with five in females and four in males. Finally, the NLMM method detected no
significant SNPs in either females or males for the size parameter but two significant SNPs for
the tempo parameter in females and four in males in addition to one significant SNP for
velocity in males. Results of all 17 SNPs can be found in Table 2.16 (females) and Table 2.17
(males).
89 Chapter 2: BMI Growth Trajectories
With the exception of the STLMM method, all methods detected the significant association in
females between BMI and the BMI increasing allele of rs10913469 (SEC16B), observed in the
cross-sectional analysis. However, the STLMM method identified two significant associations
with loci that were not detected in the cross-sectional analyses, rs987237 (TFAP2B) and
rs1121980 (FTO). In addition, the SPLMM method detected 4 associations with loci that were
not observed in the cross-sectional analysis, including TFAP2B, MRSA, BDNF (rs1488830) and
NRXN3. None of the longitudinal methods detected the association observed in the cross-
sectional analysis with the BMI increasing allele of rs7138803 (BCDIN3D); this was only a
marginal association (P-Value=0.02) at age 13 which did not reach the Bonferroni threshold,
and therefore may have been a false positive.
For males, all three SNPs associated with BMI using the LMM method were detected in the
cross-sectional analysis; however, nine potentially significant loci were not detected. The
STLMM method failed to detect the FTO loci, the only loci that reached the Bonferroni
threshold in the cross-sectional analysis. It did however detect an association with the
rs17782313 (MC4R) loci; this loci was statistically significant in the cross-sectional analysis but
none of the other longitudinal methods detected it. The SPLMM method detected four of the
nine loci observed to be significantly associated with BMI at one or more time points in the
cross-sectional analysis. Finally, the NLMM method detected four of the loci that were
significantly associated with BMI in the cross-sectional analyses, one of which was only
detected by this longitudinal method (BCDIN3D).
90 Chapter 2: BMI Growth Trajectories
Table 2.16: Summary of longitudinal analyses, using the four methods, for each of the 17 SNPs in females. Significant P-Values are in bold.
LMM SPLMM STLMM NLMM
SNP Beta P LRT P Beta P LRT P Beta P LRT P Beta P NEGR1 rs2815752 0.012 0.10 0.16 0.154 0.07 0.19 0.011 0.14 0.07 Size 0.001 0.62
Age:rs2815752 0.001 0.32 0.015 0.23 0.001 0.31 Tempo -0.006 0.54 Age2:rs2815752 -1x10-4 0.17 -0.001 0.40 0.001 0.45 Velocity -0.003 0.70 Age3:rs2815752 -1x10-5 0.48 -2x10-4 0.35 0.001 0.31
SEC16B rs10913469 0.029 1x10-3 0.02 0.217 0.04 0.31 0.030 2x10-3 0.04 Size 0.002 0.56 Age:rs10913469 0.003 0.03 0.032 0.04 0.003 0.03 Tempo -0.030 0.02 Age2:rs10913469 -2x10-4 0.06 -0.001 0.46 -0.001 0.60 Velocity 0.007 0.51 Age3:rs10913469 -2x10-7 0.99 -2x10-4 0.47 -0.001 0.61
LYPLAL1 rs2605100 0.002 0.82 0.12 0.036 0.71 0.34 -2x10-4 0.98 0.16 Size -0.003 0.32 Age:rs2605100 0.001 0.40 0.001 0.97 0.001 0.51 Tempo -0.005 0.63 Age2:rs2605100 -1x10-4 0.19 -0.002 0.09 0.001 0.42 Velocity -0.014 0.12 Age3:rs2605100 -1x10-5 0.40 -1x10-4 0.76 0.001 0.43
TMEM18 rs6548238 0.017 0.07 0.19 0.074 0.55 0.71 0.022 0.02 0.18 Size 0.005 0.22 Age:rs6548238 0.001 0.36 0.018 0.32 0.001 0.61 Tempo -0.008 0.56 Age2:rs6548238 -9x10-5 0.48 0.001 0.33 -0.003 0.13 Velocity 0.002 0.87 Age3:rs6548238 -2x10-6 0.90 -3x10-4 0.20 -0.001 0.19
ETV5 rs7647305 -0.001 0.92 0.38 -0.079 0.44 0.42 0.001 0.89 0.31 Size 0.002 0.55 Age:rs7647305 -3x10-4 0.83 -0.026 0.09 -1x10-4 0.94 Tempo 0.008 0.55 Age2:rs7647305 8x10-5 0.50 0.001 0.41 -0.002 0.24 Velocity 0.008 0.43 Age3:rs7647305 4x10-6 0.80 2x10-4 0.50 -0.001 0.15
Table 2.16 continued LMM STLMM SPLMM NLMM
SNP Beta P LRT P Beta P LRT P Beta P LRT P Beta P GNPDA2 rs10938397 0.004 0.52 0.36 0.056 0.48 0.74 0.002 0.80 0.32 Size -4x10-4 0.88
Age:rs10938397 0.001 0.22 0.012 0.28 0.002 0.15 Tempo -0.009 0.35 Age2:rs10938397 1x10-5 0.89 0.001 0.30 0.002 0.22 Velocity 0.007 0.33 Age3:rs10938397 -1x10-5 0.68 -1x10-4 0.54 0.001 0.34
TFAP2B rs987237 0.008 0.42 0.07 0.163 0.16 0.02 0.008 0.44 0.04 Size 0.004 0.30 Age:rs987237 0.002 0.06 0.040 0.01 -0.001 0.56 Tempo -3x10-4 0.98 Age2:rs987237 1x10-4 0.38 0.003 0.01 -0.001 0.79 Velocity 0.003 0.76 Age3:rs987237 -4x10-5 0.02 -4x10-4 0.12 -4x10-7 1.00
MRSA rs613080 -0.008 0.43 0.14 0.049 0.64 0.30 -0.012 0.25 0.02 Size 0.001 0.85 Age:rs613080 -3x10-4 0.79 0.009 0.60 -0.003 0.10 Tempo 0.003 0.80 Age2:rs613080 2x10-4 0.15 0.001 0.35 0.001 0.52 Velocity 0.017 0.12 Age3:rs613080 1x10-5 0.55 2x10-4 0.61 0.001 0.47
MTCH2 rs10838738 -0.006 0.44 0.28 0.031 0.70 0.28 -0.005 0.54 0.34 Size -0.002 0.44 Age:rs10838738 0.001 0.53 0.031 0.01 0.001 0.66 Tempo 0.001 0.90 Age2:rs10838738 4x10-5 0.66 4x10-4 0.69 -0.001 0.53 Velocity -0.009 0.27 Age3:rs10838738 -2x10-5 0.22 -3x10-3 0.10 -0.001 0.45
BDNF rs1488830 0.013 0.13 0.09 0.201 0.06 0.04 0.011 0.22 0.01 Size 1x10-4 0.97 Age:rs1488830 0.003 0.03 0.048 2x10-3 0.005 4x10-4 Tempo -0.018 0.16 Age2:rs1488830 -6x10-5 0.59 -2x10-4 0.85 0.002 0.45 Velocity -0.001 0.96 Age3:rs1488830 -3x10-5 0.08 -0.001 0.02 -1x10-4 0.89
Table 2.16 continued LMM STLMM SPLMM NLMM
SNP Beta P LRT P Beta P LRT P Beta P LRT P Beta P BDNF rs925946 -0.007 0.37 0.35 -0.052 0.55 0.32 -0.008 0.31 0.32 Size -0.003 0.29
Age:rs925946 3x10-4 0.76 0.013 0.32 0.001 0.28 Tempo -0.005 0.62 Age2:rs925946 6x10-5 0.51 0.001 0.26 0.001 0.45 Velocity -0.001 0.93 Age3:rs925946 -1x10-5 0.46 -4x10-5 0.86 2x10-4 0.81
BCDIN3D rs7138803 0.012 0.11 0.09 0.098 0.25 0.82 0.010 0.17 0.10 Size 4x10-4 0.90 Age:rs7138803 0.002 0.01 0.018 0.13 0.002 0.06 Tempo -0.017 0.11 Age2:rs7138803 -2x10-5 0.80 3x10-5 0.98 4x10-4 0.79 Velocity 0.008 0.35 Age3:rs7138803 -2x10-5 0.19 -2x10-4 0.49 3x10-5 0.97
NRXN3 rs10146997 -0.003 0.69 0.30 -0.021 0.84 0.64 -1x10-5 1.00 0.01 Size 0.001 0.83 Age:rs10146997 7x10-5 0.95 0.016 0.31 -0.003 0.05 Tempo -0.001 0.93 Age2:rs10146997 1x10-4 0.23 0.002 0.07 -0.004 0.03 Velocity 0.015 0.11 Age3:rs10146997 4x10-7 0.98 -3x10-4 0.21 -0.002 0.07
SH2B1 rs8055138 0.004 0.60 0.34 0.052 0.49 0.94 0.004 0.59 0.36 Size 5x10-4 0.87 Age:rs8055138 5x10-4 0.60 0.002 0.85 3x10-4 0.82 Tempo -0.005 0.61 Age2:rs8055138 -2x10-6 0.98 -0.001 0.61 -0.001 0.68 Velocity 0.005 0.50 Age3:rs8055138 5x10-6 0.70 1x10-4 0.62 -4x10-4 0.62
FTO rs1121980 0.013 0.08 0.31 0.244 4x10-3 0.02 0.012 0.13 0.38 Size 0.001 0.77 Age:rs1121980 0.001 0.30 0.027 0.03 0.002 0.14 Tempo -0.013 0.22 Age2:rs1121980 -1x10-4 0.23 -0.002 0.10 0.001 0.55 Velocity -0.006 0.46 Age3:rs1121980 1x10-5 0.67 1x10-4 0.77 4x10-4 0.59
Table 2.16 continued LMM STLMM SPLMM NLMM
SNP Beta P LRT P Beta P LRT P Beta P LRT P Beta P MC4R rs17782313 2x10-4 0.98 0.31 -0.032 0.76 0.70 -0.001 0.91 0.10 Size -0.004 0.22
Age:rs17782313 0.001 0.25 0.006 0.68 0.003 0.02 Tempo -0.017 0.15 Age2:rs17782313 -3x10-7 1.00 -2x10-4 0.84 0.002 0.40 Velocity 0.008 0.41 Age3:rs17782313 -2x10-5 0.32 -3x10-4 0.26 3x10-4 0.71
KCTD15 rs11084753 0.011 0.17 0.08 0.008 0.92 0.26 0.013 0.11 0.11 Size -0.004 0.24 Age:rs11084753 0.002 0.02 0.017 0.18 0.003 0.03 Tempo -0.028 0.01 Age2:rs11084753 -7x10-5 0.50 0.001 0.63 -0.002 0.25 Velocity 0.007 0.40 Age3:rs11084753 -2x10-5 0.26 3x10-5 0.88 -0.001 0.19
Table 2.17: Summary of longitudinal analyses, using the four methods, for each of the 17 SNPs in males. Significant P-Values are in bold.
LMM STLMM SPLMM NLMM
SNP Beta P LRT P Beta P LRT P Beta P LRT P Beta P
NEGR1 rs2815752 0.011 0.12 0.57 0.027 0.72 0.96 0.008 0.25 0.70 Size 0.002 0.47 Age:rs2815752 0.001 0.38 0.004 0.74 0.001 0.63 Tempo -0.006 0.54 Age2:rs2815752 -6x10-5 0.49 4x10-4 0.63 0.002 0.23 Velocity 0.002 0.77 Age3:rs2815752 4x10-6 0.78 -1x10-4 0.55 0.001 0.16
SEC16B rs10913469 0.001 0.90 0.28 0.058 0.52 0.33 -0.001 0.92 0.18 Size -0.002 0.53 Age:rs10913469 0.002 0.10 0.026 0.05 -2x10-4 0.86 Tempo -0.004 0.70 Age2:rs10913469 -1x10-5 0.94 -0.001 0.46 -1x10-4 0.95 Velocity -0.004 0.56 Age3:rs10913469 -3x10-5 0.05 -2x10-4 0.22 1x10-4 0.93
LYPLAL1 rs2605100 -0.005 0.53 0.49 0.015 0.85 0.65 -0.007 0.37 0.79 Size -0.003 0.24 Age:rs2605100 2x10-4 0.87 0.016 0.23 -0.001 0.68 Tempo 0.001 0.96 Age2:rs2605100 -5x10-5 0.60 4x10-4 0.70 0.001 0.42 Velocity -0.004 0.48 Age3:rs2605100 -1x10-5 0.57 -3x10-4 0.08 0.001 0.33
TMEM18 rs6548238 0.015 0.13 0.02 0.098 0.31 0.01 0.014 0.16 0.02 Size 0.002 0.55 Age:rs6548238 0.004 1x10-3 0.066 1x10-4 0.003 0.06 Tempo -0.022 0.10 Age2:rs6548238 1x10-4 0.35 0.002 0.06 0.001 0.76 Velocity 0.013 0.12 Age3:rs6548238 -5x10-5 0.01 -0.001 4x10-4 3x10-4 0.77
ETV5 rs7647305 0.006 0.46 0.68 0.062 0.53 0.44 0.006 0.51 0.37 Size 0.003 0.35 Age:rs7647305 5x10-4 0.67 0.026 0.06 0.002 0.24 Tempo -0.003 0.79 Age2:rs7647305 5x10-5 0.63 0.001 0.15 0.002 0.37 Velocity 0.003 0.66 Age3:rs7647305 -3x10-6 0.84 -3x10-4 0.20 0.001 0.42
Table 2.17 continued LMM STLMM SPLMM NLMM
SNP Beta P LRT P Beta P LRT P Beta P LRT P Beta P GNPDA2 rs10938397 0.009 0.17 0.52 0.095 0.17 0.32 0.006 0.41 0.05 Size 0.003 0.20
Age:rs10938397 0.001 0.23 0.009 0.43 0.003 4x10-3 Tempo -4x10-4 0.97 Age2:rs10938397 -1x10-5 0.95 -2x10-4 0.83 0.003 0.04 Velocity 0.003 0.64 Age3:rs10938397 -1x10-5 0.41 -2x10-4 0.19 0.001 0.17
TFAP2B rs987237 0.023 0.01 0.09 0.123 0.20 0.27 0.023 0.01 0.24 Size 0.003 0.41 Age:rs987237 0.001 0.41 -0.011 0.48 0.001 0.45 Tempo -0.011 0.36 Age2:rs987237 -3x10-4 0.02 -0.002 0.03 -0.001 0.75 Velocity -0.005 0.53 Age3:rs987237 1x10-5 0.57 4x10-4 0.06 -1x10-4 0.95
MRSA rs613080 0.002 0.84 0.68 -0.110 0.29 0.17 0.005 0.63 0.11 Size 3x10-4 0.93 Age:rs613080 -4x10-4 0.74 -0.037 0.02 -0.003 0.06 Tempo -0.006 0.64 Age2:rs613080 -1x10-5 0.96 -0.002 0.19 -0.002 0.38 Velocity 0.005 0.51 Age3:rs613080 2x10-5 0.24 0.001 0.01 3x10-6 1.00
MTCH2 rs10838738 0.012 0.10 0.48 0.085 0.28 0.17 0.010 0.16 0.85 Size -3x10-4 0.90 Age:rs10838738 0.001 0.53 -0.010 0.38 0.001 0.61 Tempo -0.012 0.25 Age2:rs10838738 -2x10-4 0.11 -0.002 0.05 4x10-4 0.82 Velocity -0.003 0.68 Age3:rs10838738 1x10-5 0.66 4x10-4 0.02 4x10-4 0.67
BDNF rs1488830 0.011 0.18 0.22 0.038 0.67 0.27 0.014 0.09 0.20 Size 0.003 0.29 Age:rs1488830 0.002 0.06 0.018 0.19 0.002 0.27 Tempo -0.015 0.21 Age2:rs1488830 7x10-5 0.50 0.002 0.03 -0.003 0.08 Velocity 0.013 0.07 Age3:rs1488830 -2x10-5 0.23 -2x10-4 0.36 -0.002 0.05
Table 2.17 continued LMM STLMM SPLMM NLMM
SNP Beta P LRT P Beta P LRT P Beta P LRT P Beta P BDNF rs925946 0.005 0.47 0.22 -0.005 0.94 0.03 0.004 0.60 0.36 Size 0.004 0.16
Age:rs925946 0.001 0.35 0.011 0.36 1x10-4 0.94 Tempo 0.002 0.83 Age2:rs925946 7x10-5 0.45 0.003 1x10-3 0.001 0.64 Velocity 0.005 0.43 Age3:rs925946 -2x10-5 0.12 -4x10-4 0.03 3x10-4 0.71
BCDIN3D rs7138803 0.013 0.05 0.10 0.034 0.62 0.92 0.013 0.05 0.44 Size -1x10-4 0.96 Age:rs7138803 0.002 0.06 0.006 0.57 0.002 0.14 Tempo -0.020 0.04 Age2:rs7138803 -7x10-5 0.43 5x10-6 1.00 -0.001 0.65 Velocity 0.006 0.28 Age3:rs7138803 2x10-6 0.86 5x10-6 0.77 -3x10-4 0.74
NRXN3 rs10146997 -0.004 0.57 1x10-3 0.143 0.11 1x10-3 -0.006 0.50 0.01 Size -0.002 0.58 Age:rs10146997 -0.002 0.06 0.013 0.34 -0.002 0.14 Tempo 0.031 0.01 Age2:rs10146997 -2x10-4 0.03 -0.003 0.00 -4x10-5 0.98 Velocity -0.032 4x10-6 Age3:rs10146997 -2x10-6 0.88 -2x10-4 0.23 1x10-4 0.89
SH2B1 rs8055138 0.011 0.11 0.48 0.054 0.51 0.80 0.013 0.07 0.70 Size 0.002 0.48 Age:rs8055138 0.001 0.49 -0.007 0.57 5x10-4 0.69 Tempo -0.005 0.60 Age2:rs8055138 -8x10-5 0.38 -4x10-4 0.69 -0.001 0.54 Velocity -0.003 0.57 Age3:rs8055138 -1x10-5 0.70 2x10-4 0.23 -3x10-4 0.73
FTO rs1121980 0.026 2x10-4 4x10-4 0.138 0.08 0.21 0.029 3x10-5 3x10-4 Size -4x10-4 0.87 Age:rs1121980 0.004 1x10-4 0.027 0.02 0.005 9x10-5 Tempo -0.034 1x10-3 Age2:rs1121980 -2x10-4 0.08 -0.001 0.56 -0.003 0.07 Velocity 0.003 0.64 Age3:rs1121980 -2x10-5 0.14 -2x10-4 0.26 -0.002 0.06
Table 2.17 continued LMM STLMM SPLMM NLMM
SNP Beta P LRT P Beta P LRT P Beta P LRT P Beta P MC4R rs17782313 0.013 0.10 0.10 0.124 0.16 0.02 0.012 0.15 0.19 Size -0.001 0.74
Age:rs17782313 0.003 0.01 0.037 0.01 0.002 0.20 Tempo -0.029 0.01 Age2:rs17782313 -1x10-5 0.95 0.002 0.11 0.001 0.59 Velocity 0.008 0.27 Age3:rs17782313 -1x10-5 0.47 -1x10-4 0.73 0.001 0.42
KCTD15 rs11084753 0.001 0.94 0.34 -0.041 0.63 0.54 0.002 0.75 0.40 Size 0.004 0.17 Age:rs11084753 -0.001 0.35 -0.023 0.07 -0.002 0.12 Tempo 0.007 0.49 Age2:rs11084753 7x10-5 0.46 -1x10-4 0.95 -0.002 0.33 Velocity 0.008 0.21 Age3:rs11084753 2x10-5 0.27 2x10-4 0.32 -0.001 0.50
2.5.4 Obesity-Risk Allele Score
The obesity-risk allele score based on the genotypes at each of the 17 loci was normally
distributed and showed an approximately linear association with BMI across childhood, based
on the mean BMI (95% confidence interval) for each score at each age (Figure 2.12).
When the obesity-risk allele score was incorporated into the four longitudinal models, it was
associated with increasing BMI in females using all four methods, however only three methods
detected an association in males (Table 2.18). For the females, the LMM, STLMM and SPLMM
methods all detected an increase in BMI per allele increase in the obesity-risk allele score
(LMM β=0.0754, P-Value=0.02; STLMM β=0.0566, P-Value=0.02; SPLMM β=0.0793, P-
Value=0.01), in addition to an increase in linear trajectory over time (LMM β=0.0181, P-
Value=0.00002; STLMM β=0.0152, P-Value=0.00003; SPLMM β=0.0184, P-Value=0.0006). No
significant associations in the LMM, STLMM or SPLMM methods were detected for the
quadratic interactions with the obesity-risk allele score, however the cubic interaction was
significant in the LMM (β=-0.0002, P-Value=0.01) and STLMM (β=-0.0001, P-Value=0.02).
According the LMM and STLMM methods, this indicates that females with higher allele scores
plateau to adult BMI at an earlier age. In contrast, the NLMM method in both females and
males was unable to detect a significant association with an increase in size or velocity,
however did detect a decrease in tempo (assumed to be the adiposity rebound) for each
increase in the number of risk alleles. In the males, the LMM and SPLMM methods, also
detected an increase in BMI (LMM β=0.1045, P-Value=0.0001; SPLMM β=0.1022, P-
Value=0.0002) and BMI/year per allele increase (LMM β=0.0145, P-Value=0.0001; STLMM
β=0.0083, P-Value=0.007; SPLMM β=0.0123, P-Value=0.007). No significant associations in the
LMM, STLMM or SPLMM methods were detected for the quadratic and cubic interactions with
the obesity-risk allele score. This indicates that the shape of the curve is consistent across the
score categories.
99 Chapter 2: BMI Growth Trajectories
Figure 2.12: Distribution of obesity-risk allele score, with error bars for mean BMI at age 14
years. The obesity-risk-allele score incorporates genotypes from 17 loci (FTO, MC4R, TMEM18,
GNPDA2, KCTD15, NEGR1, BDNF, ETV5, SEC16B, LYPLAL1, TFAP2B, MTCH2, BCDIN3D, NRXN3,
SH2B1, and MRSA) in the 1,219 individuals from the Raine Study with complete genetic data.
The error bars display the mean (95% CI) BMI at age 14 years (the largest follow-up in
adolescence) for each risk-allele score.
100 Chapter 2: BMI Growth Trajectories
Table 2.18: Results from association analysis of the obesity-risk allele score with BMI trajectories using the four methods, adjusted for the first five principal
components
LMM STLMM SPLMM NLMM
Beta 95% CI P Beta 95% CI P Beta 95% CI P Beta SE P
Fem
ale
Score 0.075 0.014, 0.137 0.02 0.057 0.008, 0.105 0.02 0.079 0.017, 0.142 0.01 Size -2x10-4 0.001 0.85
Score*Age 0.018 0.010, 0.026 2x10-5 0.015 0.008, 0.023 3x10-5 0.018 0.008, 0.029 6x10-4 Tempo -0.009 0.003 5x10-3
Score*Age2 -7x10-6 -0.001, 0.001 0.99 0.001 9x10-5, 0.001 0.07 -0.008 -0.021, 0.006 0.28 Velocity 0.003 0.002 0.14
Score*Age3 -2x10-4 -3x10-4, -4x10-5 0.01 -1x10-4 -3x10-4, 1x10-4 0.02 -0.006 -0.013, 0.001 0.11
Mal
e
Score 0.105 0.052, 0.157 1x10-4 0.039 -0.004, 0.081 0.07 0.102 0.048, 0.156 2x10-4 Size 3x10-4 0.001 0.69
Score*Age 0.015 0.007, 0.022 1x10-4 0.008 0.002, 0.014 0.01 0.012 0.003, 0.021 0.02 Tempo -0.007 0.003 4x10-3
Score*Age2 -6x10-4 -0.001, 1x10-4 0.10 -1x10-5 -4x10-4, 4x10-4 0.96 -3x10-4 -0.012, 0.011 0.96 Velocity 4x10-4 0.002 0.79
Score*Age3 -1x10-4 -2x10-4, 2x10-6 0.06 -1x10-4 -2x10-4, -1x10-5 0.20 7x10-4 -0.005, 0.007 0.83
2.5.5 Characterising Genetic Associations in SPLMM Model
The SPLMM model yielded the best fit to these data, and therefore further analysis of the
genetic data was undertaken using this model. As seen in the in the previous sections, loci in
different genes were associated with BMI trajectory in males and females; this indicates that
there are potentially different genetic pathways leading to growth rate in males and females.
In females, SNPs in the SEC16B, TFAP2B, MRSA, BDNF and NRXN3 genes were significantly
associated with BMI trajectory, whereas in males TMEM18, GNPDA2 , NRXN3, and FTO were
significant. Figure 2.13 shows the population average curves for females (A) and males (B) with
zero, one or two copies of the minor allele for each of the significantly associated loci using the
SPLMM model.
FTO in males and SEC16B in females have similar trajectory patterns, whereby the genetic
effect is observed early in life and persists throughout adolescence with each additional copy
of the risk allele having a greater increase in BMI over time. In addition, the risk alleles of the
NRXN3 and MRSA loci in females are associated with lower levels of BMI from the adiposity
rebound to post-puberty (approximately 6-13 years of age).
Figure 2.14 displays the population average curves for individuals with 15, 17 or 18 (25th, 50th
and 75th percentile) obesity-risk alleles. The growth curves in each of the genders show
different patterns; females begin their trajectory smaller than males, they have an earlier
rebound, and by the age of 18 years they are beginning to plateau at their potential adult BMI.
In contrast, males go through puberty at a slightly later age resulting in their BMI continuing to
increase at the age of 18 years. The genetic effect is shown to begin later for females, at
around seven and a half years (P-Value=0.03), than for males at four years (P-Value=0.02)
(Figure 2.15).
102 Chapter 2: BMI Growth Trajectories
Figure 2.13: Population average curves for each of the significantly associated SNPs from
the SPLMM method in females (panel A) and males (panel B)
A)
103 Chapter 2: BMI Growth Trajectories
B)
104 Chapter 2: BMI Growth Trajectories
Figure 2.14: Population average curves from the SPLMM method in females and males
Predicted population average BMI trajectory from 1 – 18 years, after adjusting for birth weight
and maternal smoking during pregnancy, for individuals with 15 (lower quartile), 17 (median),
and 18 (upper quartile) risk alleles.
105 Chapter 2: BMI Growth Trajectories
Figure 2.15: Associations between the risk-allele score and BMI at each follow-up in females
and males. Regression coefficients (95% CI) presented on ln(BMI) scale from the Semi-
Parametric Linear Mixed Model (SPLMM) longitudinal model, derived at each of the average
ages of follow-up. For example, a male with 17 obesity-risk-alleles is likely to have a ln(BMI)
0.005 units higher at age 6 than a male 16 alleles and by age 14 this difference will be
increased to 0.010 units.
106 Chapter 2: BMI Growth Trajectories
2.6 Comparison of Models The modelling methods were compared using the following:
1. Model fit
2. Computation time
3. Ability to detect genetic associations with known adult BMI SNPs
2.6.1 Model Fit
To compare model fit, several model parameters were compared:
1. R2: measure of the variance explained by the model calculated as [233]:
22
* 20 0
var( | , )1 1var( | )YRY b
σσ
= − = −x b
Where, using the notation from 2.1 above: 2 ' ' 20 1 1var{ ( ) ( ) }ij x ij ziσ β σ= − + − +x μ b z μ
2 var( )σ ε=
2. Difference between observed and fitted values: calculated by 2( )estY YΣ − which give
an indication of how well the model fits to the observed data
3. Visual inspection the residual and mean plots.
Table 2.19 displays the measures of fit used to compare methods: R2, R2 from 1,000 simulated
datasets (see Section 2.4) and the observed - fitted values. The R2, in conjunction with
interquartile range of variation of R2 estimated through simulations, clearly favour the SPLMM
as the best model fit for the females. The R2 estimates from the simulations indicate that
although the STLMM method has higher R2 for both females and males, the interquartile range
is much larger for STLMM method. The model fit is therefore more data dependent in the
STLMM method than the other methods, which is not desirable when applying these methods
to other cohorts. The R2 in the males favours the STLMM method; however, this method has a
considerably longer computational time and larger deviation between the fitted values and the
observed values (as seen in the following two sections) indicating that is not be appropriate for
large scale genetic studies.
107 Chapter 2: BMI Growth Trajectories
Table 2.19: Statistical measures used to compare model fit of the four methods.
R2 R2 from 1,000 simulated
datasets [mean (95%CI)]
(Observed-fitted values)2
[median (IQR)]
Fem
ale
LMM 83.59% 83.58% (83.50, 83.66) 0.2705 (0.0579, 0.8755)
STLMM 88.78% 90.36% (89.07, 91.66) 0.2728 (0.0613, 0.9007)
SPLMM 89.42% 89.45% (89.34, 89.56) 0.1720 (0.0374, 0.5871)
NLMM 85.98% 85.95% (85.76, 86.14) 0.1678 (0.0350, 0.5752)
Mal
e
LMM 80.67% 80.65% (80.35, 81.94) 0.2390 (0.0470, 0.8187)
STLMM 88.72% 91.44% (90.40, 92.48) 0.2248 (0.0479, 0.8453)
SPLMM 87.59% 87.61% (87.50, 87.73) 0.1656 (0.0329, 0.5501)
NLMM 85.10% 85.07% (84.86, 85.28) 0.1604 (0.0333, 0.5713)
Figure 2.16 displays the residuals from all four methods in both males and females. The female
residual plots indicate the LMM, STLMM and SPLMM methods all have residuals distributed
close to the expected distribution (normal for the LMM and SPLMM and skew-t for the
STLMM). Several within-subject outliers (at the tails of the distribution) were not captured in
all methods. However, the NLMM in particular had additional outliers not present with the
other methods. The LMM and SPLMM methods both have some deviation from the normal
distribution at the top end of the curve signifying that they under estimate the high BMI
values. In contrast, there were an excess of extreme residual values at both ends when using
the NLMM method indicating a poor fit for the data. It over estimates low BMI values and
under estimates high values, thus under estimating within-individual variability and potentially
leading to conservative inference about genetic associations. The male residuals displayed a
similar pattern to females, although there were fewer obvious outliers. In addition, as there
was less skewness in the males, the STLMM method deviated from the expected t-distribution
but in the opposite direction to that of the females, whereby the low values of BMI are
underestimated. Based on model fit, all four methods were adequate in modelling childhood
growth curves; however, the SPLMM was slightly better than the other methods at accounting
for outliers and had the best model fit.
108 Chapter 2: BMI Growth Trajectories
Figure 2.16: Q-Q plot of residuals for each of the methods for females (top four) and males
(bottom four)
109 Chapter 2: BMI Growth Trajectories
2.6.2 Computation Time
Table 2.20 indicates the median (IQR) computation time for 100 models adjusting main effect
and the interaction between the FTO SNP and time in each of the four methods. The models
were run in R-64-bit version 2.12.1 on a 64-bit operating system with an Intel Core i7 CPU
Processor (L 640 @ 2.13GHz). It clearly indicates that the STLMM method is the most
computationally intensive, taking on average 75 minutes for one model in females and 66
minutes in males. The other three methods are all relatively quick and would allow scalability
to a genome-wide analysis.
Table 2.20: Computation time for the four methods adjusting for the FTO genotype (median
[IQR])
Females Males
LMM 13.59sec (13.41, 14.40) 15.84sec (15.66, 16.55) STLMM 4505sec (4490, 4784) 3962sec (3895, 3970) SPLMM 23.49sec (23.41, 23.92) 24.07sec (23.78, 24.52) NLMM 0.01sec (0.00,0.02) 0.00sec (0.00,0.02)
2.6.3 Ability to Detect Genetic Associations with Known Adult BMI/Obesity SNPs
As mentioned in Section 2.5, genetic loci from 17 genes previously shown to be associated
with adult BMI and subsequently with childhood growth were tested for association with BMI
trajectory using each of the four methods. Table 2.21 displays the number of significant SNPs
detected by each of the methods in females and males. The SPLMM method was able to
detect a higher proportion of associations with childhood growth in both males and females. In
addition, the results also reflected those seen in the cross-sectional analyses where no
complex modelling was necessary. The NLMM method was unable to detect many associations
in either males (only five significant SNPs) or females (two significant SNPs), which follows
from it being a slightly more conservative method than the other three methods. The STLMM
also had the ability to detect a number of genetic effects; however it detected SNPs that were
not associated with BMI in any other method or in the cross-sectional analyses (i.e. TFAP2B in
females MC4R in males). In addition, it is a more computationally intensive method, which
would prove difficult in larger scale genetic studies such as a GWAS. It is also not as flexible as
the other methods in terms of extensions to look at gene-environment or gene-gene
interactions. The current study provides evidence that the SPLMM method is the most
110 Chapter 2: BMI Growth Trajectories
effective method to detect genetic associations and allows the flexibility for extensions into
larger scale or more complex genetic analyses.
Table 2.21: The number of significant SNPs for each method, using a likelihood ratio test.
Female Male LMM 1 of 17 3 of 17 STLMM 3 of 17 4 of 17 SPLMM 5 of 17 4 of 17 NLMM 2 of 51 (three tests per SNP) 5 of 51 (three tests per SNP)
2.7 Discussion The current study has shown that of the four statistical methods evaluated, the SPLMM
method was the most efficient for modelling childhood growth to detect modest genetic
effects in the longitudinal pregnancy cohort study investigated. In addition, it has been shown
that there are potentially different genetic pathways leading to increased growth rate in males
and females and that each additional adult BMI allele increases both average BMI and rate of
growth throughout childhood.
There are several different statistical methods that can be used to model childhood growth.
The four methods were selected as they would allow for adjustment of potential confounders,
appropriately account for the correlation between the repeated measures, allow for
incomplete data, and were computationally feasible in the context of candidate gene studies
and GWASs. Results indicate that the SPLMM method does a more proficient job at accounting
for the variation in BMI growth than the LMM method as it has a smaller residual standard
deviation. The STLMM method used a different scale for BMI and was therefore unable to be
compared using standard measures of model fit such as the AIC, deviance or residual standard
deviation; however, the observed versus fitted values could be compared across all four
methods, as it is scale independent. The SPLMM and NLMM methods produce similar
differences between observed and fitted values. However, there is a larger range in values
from the LMM and STLMM methods, which indicates that these methods are less accurate in
predicting BMI for each individual over time, as they tend to overestimate low BMI and
underestimate high BMI. Although the residual plots indicate the STLMM method has the best
fit to the data, the method does not produce the most accurate predictions as seen by the IQR
for the fitted versus observed values. . Furthermore, the estimates of skewness from the
111 Chapter 2: BMI Growth Trajectories
STLMM model were relatively large (Females: intercept=4.5791 [SE=1.0957], slope=2.2336
[SE=0.6269]; Males: intercept=2.8590 [SE=0.5943], slope=1.6628 [SE=0.4155]), which could be
influenced by outliers and result in inaccurate predictions. Based on model fit, all four methods
are adequate in modelling childhood growth curves; however the SPLMM produces the most
accurate fitted values and can account for the majority of the outlying BMI measurements.
Of the 17 genetic variants associated with adult BMI and obesity risk that were investigated,
the SPLMM method was able to detect a higher proportion of associations with childhood
growth in both males and females than the other methods. As expected, the more
conservative NLMM method performed poorly in both males (five significant tests of 51) and
females (two significant tests of 51). The STLMM method detected a number of genetic
effects; however it was a more computationally intensive method and less flexible than others,
which would prove difficult in larger scale genetic studies such as GWASs or gene-gene/gene-
environment interaction studies. The current study provides evidence that the SPLMM method
is the most effective method to detect genetic associations and allows for the flexibility for
extensions into large scale and more complex genetic analyses.
Single genetic loci typically have small effects on complex diseases or explain only a small
proportion of the variability in a quantitative trait; therefore, major increases in disease risk
are expected from simultaneous exposure to multiple genetic risk variants. A post hoc power
calculation using 1,000 non-parametric bootstrap simulations based on the Raine Study data
indicated that this study had 97% power to detect the FTO loci rs1121980 with MAF=0.41,
which has one of the larger effect sizes on BMI, but still had 83% power to detect a more
realistic smaller effect size like the BDNF SNP rs1488830 association in females with MAF=0.21.
In contrast, the power to detect the allele score, combining all risk alleles, was 95% in both
males and females separately. The current study is the first to investigate an association
between 17 published obesity-risk loci as an allele score and BMI trajectory throughout
childhood and adolescence, separately in males and females. Hoed et al [204] used a similar
approach with a 17-loci allele-score but focused on two cross-sectional association analyses in
pre-/early pubertal children and adolescents. By utilizing a longitudinal design, the current
study reduced the number of genetic association tests conducted from eight in a cross-
sectional setting to one per gender, reducing the necessity of adjusting for multiple testing and
potentially overlooking important genetic loci. A second study by Elks et al [206] evaluated the
112 Chapter 2: BMI Growth Trajectories
association between adult obesity risk genes and growth throughout childhood using a smaller
subset of obesity susceptibility loci and with analyses only up to age 11 years. Both studies
conducted analysis adjusting for gender; however, this does not allow each gender to have
different growth trajectories or the investigation of different timing of the genetic effects.
Substantial differences were found between males and females in the timing of the adiposity
rebound and plateauing towards adulthood. Additionally, the genetic effects had different
timing and magnitude in each gender. By combining males and females into one analysis,
these genetic differences may have been averaged out and the biology underlying the
differences may remain undetected.
A recent longitudinal study investigating the life-course effects of variants in the FTO gene and
near the MC4R gene demonstrated that the effects strengthen throughout childhood and peak
at age 20 before weakening during adulthood [205]. A similar pattern was detected with the
obesity-risk allele score throughout childhood, where the effect begins around four years in
males and seven years of age in females, and increases in size each year. One limitation of the
current study is that the cohort currently only has data available up to 17-years. It will be of
interest to follow the cohort to investigate how the combined effect of these SNPs changes as
the cohort progresses into adulthood. Further, it would be valuable to confirm that the SPLMM
method is the most appropriate statistical method in other cohorts investigating the genetic
determinants of childhood growth and the patterns of association across the life course.
Further studies are now required to assess the validity of these findings and also extend them
to perhaps focus on interactions between genes and the environment. Interactions, both gene-
gene and gene-environment, are an important area of research that is critical for
understanding the mechanisms underlying obesity. A small simulation study was performed
using re-sampling techniques based on 1,000 non-parametric bootstrap data sets with
replacement from the Raine Study data and calculating the power to detect a gene-gene
interaction. Two SNP combinations were investigated to gather an understanding of the range
of power in the current study; these included the two most commonly reported BMI
associated loci, FTO rs1121980 (MAF=0.41) by MC4R rs17782313 (MAF=0.23) as well as two
loci with large minor allele frequency, FTO rs1121980 by NEGR1 rs2815752 (MAF=0.38). Based
on these simulations, the current study had 58.0% power to detect an interaction between
two SNPs with larger minor allele frequencies (FTO*NEGR1) and effect sizes (FTO 0.019kg/m2;
113 Chapter 2: BMI Growth Trajectories
NEGR1 0.011kg/m2), while assuming a multiplicative model for the interaction. However, the
power decreases rapidly with the minor allele frequency (FTO*MC4R) and effect size (FTO
0.004kg/m2; MC4R 0.002kg/m2) to 4.6%. This study was therefore not appropriately designed
to detect gene-gene or gene-environment interactions but instead suggest that meta-analyses
of multiple cohorts might be a better way to tackle this problem.
2.8 Conclusion In conclusion, it has been shown that although all four statistical methods investigated for
modelling childhood growth were appropriate to model growth curves in childhood, the
SPLMM method was the most efficient in these data in terms of predicted values and
detection of genetic effects. Further, it was shown that there is some evidence that genetic
variations in established adult obesity-associated genes are associated with childhood growth;
however these effects differ by gender and timing of effect. This study provides further
evidence of genetic effects that may identify individuals early in life that are more likely to
rapidly increase their BMI through childhood, which provides some insight into the biology of
childhood growth.
114 Chapter 2: BMI Growth Trajectories
Chapter 3: Comparing The Semi-Parametric Linear Mixed Model To A Two-Step Approach For Genome-Wide Association Studies 3
3.1 Introduction Some authors are currently suggesting the use of a two-step approach to investigate the
genetic associations with longitudinal traits in a genome-wide setting. Using this approach, one
models the phenotype in a mixed effects model framework and then takes summary measures
from this model to analyse against the ~2.5 million genome-wide SNPs. After showing in
Chapter 2 that the Semi-Parametric Linear Mixed Effects Model (SPLMM) was the best fit to
the longitudinal BMI data in the Raine Study, it was necessary to investigate whether a two-
step approach using this model could be implemented in this data for the GWAS.
3.2 Background There has been a plethora of discussion regarding the ‘missing heritability’ of most traits and
diseases [43,50,234]. Some researchers are sceptical about this idea and state that the
heritability may not be ‘missing’ but rather overestimated by the quantitative genetic studies
to begin with [235] or the fact it is missing is not important for clinical practice [236]. This
‘missing heritability’ comes from the fact that the current estimates of variation explained for
most common complex diseases by the known genes to date are much lower than the original
heritability estimates. For example, the variability explained by the 32 known genetic variants
for adult BMI is 1.45% [72], whereas the heritability estimates are as high as 80% from family
studies [169]. Therefore, to be able to explain more about the genetic associations with a given
disease or trait, we need to develop new statistical methodologies and/or investigate a wider
variety of genetic markers. Progress has begun on both aspects of development, especially
with the advent of next generation sequencing and 1,000 genomes imputation [63]. For
example, it is now possible to look at rare genetic variants through new gene based association
analyses [237,238,239,240,241,242,243]. There are also new methods which use functional
115 Chapter 3: Two-Step Approach
annotation to refine GWAS signals [47,244], and sophisticated gene-gene and gene-
environment interaction tests.
As previously outlined in Chapters 1 and 2, geneticists are also beginning to investigate
longitudinal traits in GWASs. However, the models required for longitudinal traits are
considerably more computationally intensive to conduct than those for cross-sectional traits
(see Chapter 2). Therefore, different methods are being suggested to reduce this
computational burden in GWASs [85,245,246,247]. Some of these methods require a data
reduction method for the phenotypic data, with the summary measures subsequently being
used in the genetic association analysis.
Kerner et al provide a summary of the longitudinal genetic association analyses applied to the
Framingham Heart Study [85]. These analyses either involved:
1. Methods whereby the SNP was added to the longitudinal model, such as those
presented in Chapter 2. However, most of the studies using this method only analysed
a subset of the genome-wide genetic variants [88,90,248];
2. A two-step approach to the analysis whereby the phenotypic data was reduced prior
to the genetic association analysis [87,89,249].
Two of the two-step approaches looked at how genetic variants influenced the trajectory of a
quantitative phenotype over time; the third study reported in Kerner et al that used a two-step
approach was investigating a longitudinal case-control study [249]. Kerner and Muthén used
latent class modelling to define three groups of individuals and then looked at whether SNPs
on chromosome 8 were associated with class memberships; this study therefore investigated
the heterogeneity between individuals due to genetics [87]. In contrast, Roslin et al used
multivariate linear latent growth models to estimate an intercept and slope value for each
individual and then used these parameters as independent scalar outcomes for the genetic
association analyses [89]. These latent models are very similar to the LMM and SPLMM
methods discussed in Chapter 2; however they allow investigation of the relationships
between multiple dependent variables in a multivariate framework. In addition, the traits
presented in Roslin et al had linear trajectories over time, rather than the complex curve
required for childhood BMI. Each of the analyses presented in Kerner et al [85] targeted a
slightly different scientific question; the only publication they presented investigating the
genetic association with both intercept and trajectory of a quantitative trait over time, similar
116 Chapter 3: Two-Step Approach
to this thesis, was Roslin et al [89]. Similar two-step approaches have been used previously in
genetic linkage studies [82,84]
Sikorska et al conducted a simulation study investigating three different two-step approaches
[245]: 1) “slope as outcome” whereby a linear trajectory is estimated for each individual in the
first step and is then regressed against the SNPs in the second step; 2) “two-step” whereby a
linear mixed effects model (LMM) is used in the first step to estimate best linear unbiased
predictors (BLUPs [224]) for the random effects slope parameter in each individual which is
subsequently regressed against the SNPs in the second step; and 3) a “conditional two-step”
which is the same as the two-step approach however a conditional linear mixed effects model
is used in the first step [250]. Their two-step approach was the same as in Roslin et al [89].
Sikorska et al showed that although the conditional two-step method is the most desirable in
terms of both accuracy and computational time, the two-step method is a reasonable
alternative for most scenarios. They also conclude that the two-step approaches were 170
times faster than the one-step LMM for a full GWAS.
The conditional LMM used by Sikorska et al [245], initially introduced by Verbeke et al [250],
separates the time stationary (cross-sectional effects) from the time varying (longitudinal
effects) covariates in both the fixed and random effects. It then uses conditional inference
where the random intercept is treated as nuisance and estimation is performed conditional on
sufficient statistics for the nuisance parameters. This removes both the random intercepts and
the cross-sectional fixed effects from the model. This model was appropriate for Sikorska et al
[245] as they were only interested in the SNP effect over time; however, in this thesis, both the
average SNP effect in childhood and the SNP effect over time are of interest as, to my
knowledge, there has been no GWAS of childhood BMI.
These two-step approaches are used as a ‘screening tool’, where the summary statistics are
used to perform the full genome-wide scan in a fast manner and then the significant loci from
the GWAS analysis are then focused on in a full linear mixed model. Although these data
reduction methods have been shown to be successful for reducing the computational burden
while still producing accurate results for longitudinal genetic association studies to date, they
have not been investigated for complex phenotypic traits; they have been used where there is
a linear relationship between the outcome and time, the correlation between the intercept
117 Chapter 3: Two-Step Approach
and slope is low (for example, the correlation between the intercept and slope in Sikorska et al
is -0.14 [245]) and there are normal, independent errors. Of particular interest for this thesis is
the investigation of both the SNP and SNP*age effects on a complex trait which has a
trajectory that varies with time, high correlation between the intercept and slope terms and
non-normal, correlated (continuous auto-regressive) errors; this data therefore needs a
complex model to account for the intricate details in the data.
3.2.1 Aims
The aim of this chapter is to compare the SNP and SNP*age interaction effects using the two-
step approach outlined in Sikorska et al [245] and the SPLMM model presented in Chapter 2.
I hypothesize that the two-step approach will provide different results to the SPLMM model
for the SNP and SNP*age interaction effects. The conditional two-step method presented in
Sikorska et al removes the correlation between the intercept and slope terms by only
modelling the slope, which increases the correlation between the one- and two-step
approaches. As seen in Chapter 2, the correlation between the intercept and slope terms in
the SPLMM are high, which may influence the genetic results when using the two-step
approach.
118 Chapter 3: Two-Step Approach
3.3 Methods 3.3.1 Statistical Methods
Two models were used for the genetic analyses, the SPLMM and the two-step method. The
SPLMM model is a one-step approach whereby the SNPs were added to the fixed effects of the
model. The general SPLMM model is presented in Section 2.4.3.2; the specific model run in this
analysis including the SNPs and mean centred age (centred at eight years) was: 32
0 1 2 3
3 3 31 2 3
4 5 6
2 3
7 8 9 10
11
(Age Age) (Age Age)log(BMI ) (Age Age)
2! 3!(Age Age ) (Age Age ) (Age Age )
3! 3! 3!
(Age Age) (Age Age)SNP SNP (Age Age) SNP SNP
2! 3!(Ag
SNP
ij ijij ij
ij ij ij
ij iji i ij i i
i
β β β β
κ κ κβ β β
β β β β
β
− −= + − + + +
− − − − − −+ + +
− −+ − + + +
3 3 31 2 3
12 13
2
0 1 2
e Age ) (Age Age ) (Age Age )SNP SNP
3! 3! 3!
(Age Age)b b (Age Age) b
2!
ij ij iji i
iji i ij i ij
κ κ κβ β
ε
− − − − − −+ + +
−+ − + +
(1)
Where κk is the kth knot and (t - κk)+=0 if t ≤ κk and (t - κk) if t > κk, which is known as the
truncated power basis that ensures smooth continuity between the time windows. A Taylor
series, which is a representation of a function as an infinite sum of terms that are calculated
from the values of the function's derivatives at a single point, is used in the spline function to
allow for easier model convergence. The three knot points are placed at κ1=two, κ2=eight and
κ3=12 years with a cubic slope for each spline. SNP indicates the genotype for individual i, SNPi
ε (0, 1, 2) and age ij is the age for individual i at time j. The null hypothesis is such that H0:
β7=β8=β9=0. In other words, the test investigates if there is a statistically significant effect of
the SNP on average BMI at age 8 and BMI over time. The other coefficients for the SNP are not
tested as these do not have a comparable estimate using the BLUPs from the random effects.
The second model, the two-step approach, is where the SNP effects are omitted from the
SPLMM model, such that the following model is fitted to the data only once:
119 Chapter 3: Two-Step Approach
2 3
0 1 2 3
3 3 31 2 3
4 5 6
2
1 2
(Age Age) (Age Age)log(BMI ) (Age Age)2 6
(Age Age ) (Age Age ) (Age Age )6 6 6
(Age Age)b b (Age Age) b2
ij ijij ij
ij ij ij
ijoi i ij i ij
β β β β
κ κ κβ β β
ε
− −= + − + + +
− − − − − −+ + +
−+ − + +
(2)
Then the second step is to regress the BLUPs of b0i, b1i and b2i on the SNPs for each individual i
with a simple linear regression model:
* * *0 0 1
** ** **1 0 1
*** *** ***2 0 1
b
b
b
ii i
ii i
ii i
SNP
SNP
SNP
β β ε
β β ε
β β ε
= + +
= + +
= + +
(3)
These models were applied to each of the data sets described below. If the one and two-step
methods are indeed similar, then testing if *1β = 0 in Model 3 should give a similar P-Value to if
testing 7β = 0 in Model 1; likewise, testing **1β = 0 in Model 3 should give a similar P-Value to
testing 8β = 0 in Model 1 and testing ***1β = 0 in Model 3 should give a similar P-Value to
testing 9β = 0 in Model 1. 7β and *1β will be referred to as the SNP main effect, but are also
known as the cross-sectional effect. 8β and **1β will be referred to as the SNP*age interaction
effect and 9β and ***1β will be referred to as the SNP*age2 interaction effect. To compare the
methods, the difference between the log10(pF) and log10(pT) is summarized for each of the
data sets where pF is the P-Value for testing the β coefficients in the one step method (full
model) and pT is the P-Value for testing the β coefficients in the two-step method; specifically
the standard deviation of the difference, SDdiff, will be investigated, so comparisons can be
made with the results presented in Sikorska et al [245]. In addition, a Pearson correlation
between the ratio of the beta coefficient to the standard error will be presented. The
parameters are estimated using maximum likelihood; however, the conclusions were the same
with restricted maximum likelihood estimation.
120 Chapter 3: Two-Step Approach
3.3.2 Simulation Study
A simulation study was conducted using 1,000 parametric bootstrap data sets [221]. Each
dataset was generated to resemble the Raine Study data as much as possible. Data was
simulated according to Model 1 for 1,000 individuals measured on up to eight occasions (tij =
1, 2, 3, 6, 8, 10, 14, 17). The actual age of measurement was set to vary between individuals by
up to a year (i.e. individuals had measurements taken up to six months before or after a
birthday), which is representative of longitudinal studies. The fixed effects and variance-
covariance matrix were set to be similar to those from the SPLMM in Model 1 for males in the
Raine Study, excluding all terms containing the SNP (Table 3.1). Measurements were set to
missing at random so that 25% of the BMI measurements were missing, which is equivalent to
the proportion of missing data in the Raine Study under the assumption that all individuals
could have been measured yearly. The SNPs were generated by sampling genotypes from a
multinomial distribution (a value of 0, 1 or 2), with a minor allele frequency of 0.3. Four
combinations of effect sizes were investigated; the SNP main effect with two levels β7=0 or
β7=0.01 and the SNP*age interaction with two levels β8=0 or β8=0.001. These effect sizes were
chosen to be similar to that from the allelic score effect in Chapter 2 (i.e. a relatively large
genetic effect).
Table 3.1: Parameter estimates from the Raine Study SPLMM model (Model 1) used to
generate the data in the simulation study
Effect Parameter
in Model 2
Value Effect Parameter
in Model 2
Value
Intercept β0 2.795 SD(b0) σ0 0.118
Age β1 0.030 SD(b1) σ1 0.011
Age2 β2 0.010 SD(b2) σ2 0.002
Age3 β3 -0.001 Cor(b0, b1) ρ0 0.813
Age3:knot 1 β4 0.083 Cor(b0, b2) ρ1 -0.704
Age3:knot 2 β5 -0.003 Cor(b1, b2) ρ2 -0.352
Age3:knot 3 β6 0.005 SD(ε) Σ 0.064
Correlation
structure
ρ 0.383
121 Chapter 3: Two-Step Approach
3.3.3 Chromosome 16 Analysis in the Raine Study
A subset of 1,461 individuals from the Raine Study were used in the analysis (see Chapter 1,
Section 1.6.1 for further details on the Raine Study); 753 males and 708 females, based on the
following inclusion criteria: at least one parent of European descent, unrelated to anyone else
in sample (one of every related pair, including multiple births, was selected at random), no
significant congenital anomalies, genome-wide genetic data, and at least one measure of BMI
throughout childhood. BMI was calculated from the weight and height measurements, with a
total of 8,670 BMI measures (median six measures per person, IQR: 5-7). A chromosome wide
analysis was conducted using the dosages from the imputed data on chromosome 16 to
investigate how well the two-step method works on the Raine Study data. Chromosome 16
was chosen as it is where the most replicated gene to date, the fat mass and obesity gene
(FTO), for BMI in both adults and children is found [174,251]. It was therefore hypothesised
that some significant loci would be detected, specifically around the FTO gene, as well as many
non-associated SNPs. Models 1 and 2 for the chromosome 16 analysis included the first five
principal components for population stratification (see Section 1.6.1.3 for details) and a
sex*age interaction (where age is the spline function for age); given these would be included in
the GWAS analysis, they were included in this analysis also to ensure an accurate
representation of the GWAS results using the two approaches was achieved. Each SNP was
incorporated to the model assuming an additive genetic effect.
3.4 Results 3.4.1 Simulation Study Results
The simulation study showed that the two-step approach may be appropriate to select the
most significant SNP for further follow-up for the SNP main effect parameter; however, for the
SNP*age interaction term the results are largely different between the two methods. Figure
3.1 shows that when there is no effect of the SNP, i.e. β7=0, the two-step approach produces
fairly similar P-Values to the SPLMM for the SNP main effect; however, when there is a SNP
effect (β7=0.01) the two-step approach gives slightly smaller P-Values than the SPLMM (i.e.
more significant). This is ideal as although some of the most significant loci will be found to be
false positives when investigated in the full SPLMM, all SNPs (or the vast majority) that are
indeed significant are identified. In contrast, the SNP*age interaction effect produced very
different results using the two approaches. The concordance between the two P-Values is very
low, which can be seen by the large standard deviations for the difference and low correlations
122 Chapter 3: Two-Step Approach
presented in Table 3.2. Figure 3.2 shows that the P-Values for the two-step approach tend to
be smaller than the SPLMM when there is a SNP*age effect (i.e. β8=0.001), as seen with the
SNP main effect; however, there is a greater amount of variability between the two P-Values.
Figure 3.1: Comparison of the one and two-step approaches for the SNP main effect from
the 1,000 simulated data sets with different effect sizes for the SNP main effect and SNP*age
interaction effect; on the x-axis is the –log10(PF) and on the y-axis is the –log10(PT).
123 Chapter 3: Two-Step Approach
Table 3.2: Results from the 1,000 simulations. SDdiffis the standard deviation of the
difference between -log10(pF) [P-Value for testing the β coefficients in the one step method]
and -log10(pT) [P-Value for testing the β coefficients in the two-step method] for the 1,000
simulations. r2 is the Pearson correlation coefficient for the ratio of the beta coefficient to
the standard error.
β7=0 β7=0.01
β8=0 β8=0.001 β8=0 β8=0.001
SDdiff r2 SDdiff r2 SDdiff r2 SDdiff r2
SNP main effect 0.24 0.93 0.24 0.94 0.36 0.94 0.38 0.93
SNP*age interaction 0.60 0.41 0.79 0.41 0.68 0.39 1.01 0.40
124 Chapter 3: Two-Step Approach
Figure 3.2: Comparison of the one and two-step approaches for the SNP*age interaction
effect from the 1,000 simulated data sets with different effect sizes for the SNP main effect
and SNP*age interaction effect; on the x-axis is the –log10(PF) and on the y-axis is the –
log10(PT).
Interestingly, the SNP main effect appears to be unaffected by a significant SNP*age
interaction effect (i.e. SDdiff=0.24 for both β8 values in Table 3.2), whereas the SNP*age
interaction effect is affected by a significant SNP main effect (i.e. SDdiff=0.60 when β7=0 but
SDdiff=0.68 when β7=0.01). This is consistent with Verbeke et al [250] and Sikorska et al [245]
who both discuss that the longitudinal effect, or the SNP*age interaction effect, is affected by
125 Chapter 3: Two-Step Approach
the cross-sectional component of an LMM and hence propose the conditional linear mixed
model which estimates the longitudinal effect independent of the cross-sectional effect.
Figure 3.3 displays the β estimates and standard errors from the simulations with significant
SNP main effect and SNP*age interaction effect. It illustrates that the standard error is much
smaller in magnitude using the two-step approach for both the SNP main effect and the
SNP*age interaction. The β and standard errors in Figure 3.3 show little variability for the SNP
main effect term; however both estimates differ greatly for the SNP*age interaction effect
term, which leads to the differences in P-Values as seen in Figure 3.2. This is consistent for all
combinations of effect sizes for β7 and β8, hence just β7=0.01 and β8=0.001 are presented
here. Figure 3.3 illustrates why random effects are referred to as ‘shrinkage estimates’,
particularly for the SNP*age interaction effect, as the estimates are shrunk towards the
population average and hence the SNP effect is biased towards zero [252].
126 Chapter 3: Two-Step Approach
Figure 3.3: Comparison of the β and SE estimates using the one and two-step approaches for
the SNP main effect and SNP*age interaction effect from the 1,000 simulated data sets where
both the SNP main effect and SNP*age interaction effect were significant; on the x-axis are
the estimates (β and SE(β)) from the SPLMM and on the y-axis are the estimates from the
two-step approach.
As mentioned previously, there are several differences between the BMI data presented in this
thesis and the bone mineral density data presented by Sikorska et al [245]. These differences
include a complex function of age to model the BMI trajectory over time, high correlation
between the intercept and slope terms, and continuous auto-regressive errors. Therefore,
additional simulations were conducted to investigate the differences between these results
and those from Sikorska et al [245]. The following additional scenarios were simulated:
127 Chapter 3: Two-Step Approach
1. Low correlation between intercept and slope terms: the correlation between the
intercept and slope terms in the BMI model were relatively high (ρ0, ρ1 and ρ2 in Table
3.1), whereas the correlation between the intercept and linear trajectory from the
Sikorska et al [245] publication was only -0.140. For these additional simulations,
ρ0=0.1, ρ1=-0.1 and ρ2=-0.1.
2. Linear random effects: Sikorska et al were able to fit a simple linear trajectory to their
bone mineral density data for both the fixed and random effects. The random effects
for the
BMI data included a quadratic curve. Therefore, the age2 parameter was removed to
simulate the data (i.e. the same random effects as in the Sikorska et al), with SD(b0)
and SD(b1) the same as in Table 3.1. Adjustments to the models were made to
incorporate this change; Model 1 became:
32
0 1 2 3
3 3 31 2 3
4 5 6
2 3
7 8 9 10
11
(Age Age) (Age Age)log(BMI ) (Age Age)
2 6(Age Age ) (Age Age ) (Age Age )
6 6 6
(Age Age) (Age Age)SNP SNP (Age Age) SNP SNP
2 6(Age Age
SNP
ij ijij ij
ij ij ij
ij iji i ij i i
iji
β β β β
κ κ κβ β β
β β β β
β
− −= + − + + +
− − − − − −+ + +
− −+ − + + +
− 3 3 31 2 3
12 13
0 1
) (Age Age ) (Age Age )SNP SNP
6 6 6b b (Age Age)
ij iji i
i i ij ij
κ κ κβ β
ε
− − − − −+ + +
+ − +
Model 2 was the subset of Model 1 without the SNP effects and Model 3 became:
* * *0 0 1
** ** **1 0 1
b
b
ii i
ii i
SNP
SNP
β β ε
β β ε
= + +
= + +
3. Cubic fixed effects: Given the spline function in the SPLMM is complex; further
simulations were conducted with a polynomial cubic function among the fixed effects
instead. The random effects remained as a quadratic function. Model 1 was
therefore:
2 30 1 2 3
2 34 5 6 7
21 2
log(BMI ) (Age Age) (Age Age) (Age Age)SNP SNP (Age Age) SNP (Age Age) SNP (Age Age)
b b (Age Age) b (Age Age)
ij ij ij ij
i i ij i ij i ij
oi i ij i ij ij
β β β β
β β β β
ε
= + − + − + − +
+ − + − + − +
+ − + − +
128 Chapter 3: Two-Step Approach
The random effects were extracted from the non-genetic model the same as
previously and used for Model 3.
4. Linear fixed and random effects: Similar to the random effects, a linear trajectory over
time was simulated with the intercept and age interacting with the SNP; β0 and β1
were the same as in Table 3.1. Given the random effects need to be a subset of the
fixed effects [253], a linear trajectory was also used in the random effects.
Adjustments to Model 1 were made to incorporate this change:
0 1 2 3
0 1
log(BMI ) (Age Age) SNP SNP (Age Age)
b b (Age Age)ij ij i i ij
i i ij ij
β β β β
ε
= + − + + − +
+ − +
The random effects were extracted from the non-genetic model the same as
previously and used for Model 3.
5. Independent correlation structure: Data was simulated assuming the within subject
errors were independently distributed, therefore sampling errors from a normal
distribution with specified variance σ2=0.0642
All of the other parameters were held identical to those of the previous simulations (Table
3.1).
The standard deviations of the difference in P-Values between the one and two-step
approaches (SDdiff) from these additional simulations, along with the Pearson correlation of
the ratio between the beta coefficient and standard error, are presented in Table 3.3.
Comparing these results to those presented in Table 3.2, it is apparent that the simulations
with a linear trajectory among the fixed effects (similar to the model presented in Sikorska et
al [245]), rather than the complex spline function, reduced the difference in P-Values between
the two methods. When there was no SNP or SNP by age effect, for example, the SDdiff
reduced from 0.60 in the model with the full spline function to 0.48 with the cubic function
and 0.13 for the linear term. The model with a linear trajectory in the fixed and random effects
produced a similar SDdiff to that reported in Sikorska et al (i.e. 0.17) [245]. A reduction in the
SDdiff was also observed for the simulations with a linear trajectory in the random effects, but
the reduction was small. When the correlation between the intercept and slope parameters in
the random effects is low, the test of the SNP*age interaction effect is no longer affected by
the presence of a SNP main effect in the model (i.e. SDdiff=0.60 when β7=0 and SDdiff=0.58
when β7=0.01). In addition, the correlation structure for the within individual errors does not
seem to influence the difference in P-Values between the two methods.
129 Chapter 3: Two-Step Approach
Table 3.3: Results of the 1,000 simulations in the additional scenarios. SDdiff is the standard
deviation of the difference between -log10(pF) [P-Value for testing the β coefficients in the
one step method] and -log10(pT) [P-Value for testing the β coefficients in the two-step
method]. r2 is the Pearson correlation coefficient for the ratio of the beta coefficient to the
standard error.
β7=0 β7=0.01
β8=0 β8=0.001 β8=0 β8=0.001
SDdiff r2 SDdiff r2 SDdiff r2 SDdiff r2
Low correlation between intercept and slope terms
0.60 0.34 0.95 0.42 0.58 0.40 0.90 0.39
Linear random effects
0.58 0.39 0.81 0.37 0.64 0.40 0.99 0.42
Cubic fixed effects 0.48 0.67 0.70 0.65 0.63 0.63 0.83 0.66
Linear fixed and random effects
0.13 0.98 0.27 0.98 0.32 0.98 0.22 0.98
Independent correlation structure
0.56 0.34 0.79 0.38 0.72 0.40 0.98 0.38
3.4.2 Chromosome 16 SNPs in the Raine Study
There were 68,690 imputed SNPs on chromosome 16 with a MAF greater than 1% and
imputation quality (R2) greater than 0.3. The results from the chromosome 16 analysis were
consistent with the simulations. Figure 3.4 displays the P-Values for each of the SNP effects; it
can be concluded from these plots that the two-step approach is not appropriate for the
SNP*age and SNP*age2 interaction effects, as the P-Values are both under and overestimated.
The standard deviations for the difference also support this; SDdiff for the SNP main effect is
0.19, and 0.56 for both the SNP*age interactions. However, even the SDdiff for the SNP main
effect is larger than that from the real data example in Sikorska et al (SDdiff=0.117) [245].
Figure 3.5 compares the β and standard errors between the two approaches for each of the
SNP terms. When the β estimates using the SPLMM are large (i.e. either a strong negative or
positive effect), the two-step approach produces β estimates closer to zero. This indicates that
the random effects may be poor surrogates for the intercept and slope of a linear mixed model
130 Chapter 3: Two-Step Approach
and thus the genetic effects are biased towards zero. As observed in the simulation study, the
standard errors are consistently smaller using the two-step approach, and seem to be
increasingly underestimated as the standard error from the SPLMM increases. This pattern can
be seen for all the SNP effects; however appears to be far worse for the age interaction terms.
Figure 3.4: Comparison of the one and two-step approaches for each of the SNP effects
from analysis of the chromosome 16 data in the Raine Study; on the x-axis is the –log10(PF)
and on the y-axis is the –log10(PT).
131 Chapter 3: Two-Step Approach
Figure 3.5: Comparison of the β and SE estimates using the one and two-step approaches for
each of the SNP effects from the chromosome 16 analysis in the Raine Study; on the x-axis
are the estimates (β and SE(β)) from the SPLMM and on the y-axis are the estimates from the
two-step approach.
Focusing on the FTO SNP from Chapter 2, rs1121980, the P-Values for the SNP main effect are
the same at P-Value( 7β )= P-Value( *1β )=0.0001, whereas for the age interaction terms the P-
Values from the two-step approach are much smaller than the one-step approach (P-
Value( 8β )=0.0021, P-Value( **1β )=0.0002, P-Value( 9β )=0.2411, P-Value( ***
1β )=0.0005). The
results from the SPLMM are more consistent with the results seen in Chapter 2 with the other
longitudinal methods, whereby the SNP*age2 effect in particular failed to reach nominal levels
of significance.
132 Chapter 3: Two-Step Approach
3.5 Discussion In this Chapter it has been shown that the two-step approach, which has been suggested by
several authors, is not precise for detecting SNP*age interactions for complex longitudinal
phenotype, such as BMI over childhood. As previously outlined, BMI in childhood is complex as
it has a non-linear trajectory over time, high correlation between the intercept and slope
terms, and non-normal, correlated (continuous auto-regressive) errors. Although this approach
has been successful in detecting genome-wide significant associations for both the SNP main
effect and SNP*time interaction effect in traits that have a linear trajectory [89], the results
shown here indicate that when the phenotype has a non-linear trajectory over time, the two-
step approach produces inaccurate SNP associations. This is perhaps due to the subset of
parameters in the random effects not being able to accurately summarize the full trajectory in
the fixed effects. In addition, when the correlation between the intercept and slope
parameters in the random effects is high, the SNP*age interaction effect using the two-step
approach is affected by a significant SNP main effect, whereby it is less accurate at detecting
an association. These results are consistent with Kerner et al who alluded to the fact that the
two-step approach is far from ideal and could potentially lead to biased results [85]. Therefore,
although this two-step method substantially reduces the computational time of a GWAS, for
complex phenotypes it may be detrimental to detecting genetic associations and is not
recommended.
3.6 Conclusions It would be ideal to utilize an approach that is computationally efficient for the full GWAS as a
screening tool to select a subset of SNPs for further follow-up in a more complex model.
However, the results presented in this Chapter show that the two-step approach reported to
date is not accurate to select SNPs with a significant SNP*age interaction when the phenotype
has a complex trajectory over time. Therefore, it is important to compare results from a one-
and two-step approach for a number of SNPs to confirm that the two-step approach is
accurate in detecting SNPs of interest for a particular phenotype, before conducting a full
GWAS. It is concluded that the two-step approach is void for BMI trajectories over childhood
and therefore the one-step approach will continue to be the focus for the remainder of this
thesis.
133 Chapter 3: Two-Step Approach
Chapter 4: Robustness Of The Linear Mixed Effects Model To Distribution Assumptions And Consequences For Genome-Wide Association Studies 4.1 Introduction The results presented in this chapter have been submitted for review at Statistical Applications
in Genetics and Molecular Biology; the manuscript is included as an appendix (Appendix C).
As concluded in Chapter 2, the SPLMM is the most efficient for modelling childhood growth to
detect modest genetic effects. All four methods discussed in Chapter 2 were applied to the
ALSPAC data (outlined in Chapter 1, Section 1.6.2) to ensure this model was generalizable to
other cohorts. None of the model assumptions from the four methods were satisfied, partly
due to the two different sources of measurement used in this study. This chapter outlines the
issues with the model misspecification and a potential solution when conducting a GWAS.
4.2 Background Over recent years, the study of population genetics has progressed from candidate gene and
linkage studies over relatively small regions of the genome, to whole genome association
analyses. These GWASs are designed to search the entire genome for SNPs that are associated
with a disease or trait of interest. If SNPs are found to be associated, they are then considered
to mark a region of the genome that influences the risk of disease or affects the levels of a
trait. In general, very small effects are expected and hence large sample sizes are required.
This advance in the scale of genetic analyses has transformed the field from hypothesis driven
research to a hypothesis-free approach, which has required additional statistical methods to
be developed to ensure there is a balance between acceptable levels of power and the chance
of inflating the type 1 error. Given the cost of conducting these studies, in terms of both
monetary costs for genotyping samples and computational costs for the analysis, it is
important that appropriate analyses are conducted from the outset.
134 Chapter 4: Simulation Study
To date, most of the GWASs have focused on case/control studies of particular diseases or
cross-sectional measurements of phenotypic traits. These study designs typically use relatively
simple statistical techniques, such as chi-square tests or linear (or logistic) regression models,
to investigate at the association between a trait and each of the ~2.5 million SNPs. There are
now over 1,500 published studies focusing on 250 traits using analyses of this kind [48].
However, researchers are beginning to focus on more complex analyses to uncover additional
genetic loci and reduce the currently unexplained heritability of these traits. One area
requiring extension is to use longitudinal studies, with repeated measures on each individual in
the study, to understand how SNPs affect changes over time of a particular phenotype
[85,245,246]. There are several developed statistical methods commonly used for repeated
measures data to take into account the non-independence of measurements within an
individual. For continuous traits, the most popular statistical method is the LMM by Laird and
Ware [215]. This method can be computationally intensive as the model can account for linear
or non-linear trajectories for the outcome of interest over time, correlation between measures
at the starting point (intercept) and change over time (slope, or non-linear trajectory) within
an individual, and adjustment for both time-independent and time-dependent covariates.
Several methods are available to reduce this computational burden in GWASs [85,245,246],
most of which suggest a two-step approach whereby one models the phenotype in the mixed
effects model framework and then takes summary measures (e.g. the Best Linear Unbiased
Predictors [BLUPs] for the intercept and slope parameters of a linear, random-intercept
random-slope model) from this to analyse against the ~2.5 million SNPs. However, as
illustrated in Chapter 3, these data reduction methods are not appropriate for complex traits
that have non-linear trajectories over time; therefore these data reduction techniques will not
be discussed further.
In LMMs, the usual assumptions made about the random effects and error distributions
include: the random effects and error terms are normally distributed, the random effects are
independent of the error term, and the error term has homoscedastic variance [215]. In
studies that utilize this method to assess the association of a SNP with the trajectory, the fixed
effect estimates are often of most interest; the random effects and correlation structure at the
individual level are necessary to provide an accurate fit of the model to the data, in addition to
providing appropriate test statistics, but are treated as nuisance parameters and are often
difficult to interpret. There have been a number of studies investigating whether violations of
135 Chapter 4: Simulation Study
the assumptions about the random effects and error terms affect the maximum likelihood
inference of the fixed effect parameters and their variance estimates; several manuscripts
have shown that the fixed effects estimates are robust to non-Gaussian random effects
distribution [254,255], non-Gaussian or heteroscedastic error distribution [256] and that the
population fixed effects are robust to misspecified covariance structure [257], but the
individual level predictions are not [258]. Jacqmin-Gadda et al [256] showed the fixed effects
estimates are not robust to error variance that is dependent on a covariate in the model that
interacts with time. Liang and Zeger [259] demonstrated that a robust sandwich estimator
[260] can correct for biased variance estimates of the fixed effects when the covariance
structure is not correctly specified. To my knowledge, there has not been any investigation
into how any of these model misspecifications affect the power and type 1 error in high
dimensional studies, for example when running an LMM on a genome-wide scale, and what
the value of the robust variance estimator is in this context.
4.2.1 Aims
The aim of this study is to assess by simulations whether misspecification of the error term,
with either non-Gaussian error distributions or non-constant error variance, in a complex
longitudinal model with non-linear trajectories will affect: 1) the coverage rates of the 95%
confidence interval of the fixed effects parameter estimates; 2) the bias of the fixed effects
parameter estimates; 3) the statistical power to detect association; or 4) the type 1 error of
SNP detection in a GWAS. Differences in the conclusions due to MAF for the SNPs or sample
size of the investigated cohort were also examined.
4.3 Motivating Example The Avon Longitudinal Study of Parents and Children (ALSPAC) [261] is a birth cohort study;
14,541 pregnant women in the former county of Avon, UK, were recruited into the study if
they had an expected delivery date between 1st April 1991 and 31st December 1992 (described
in detail in Chapter 1, Section 1.6.2). A subset of 7,916 participants were used for analysis
based on the following inclusion criteria: at least one parent of European descent, singleton
birth, unrelated to anyone in the sample, genome-wide genotype data, and at least one
measure of BMI throughout childhood. Participants have a median of nine BMI measurements
between 1 and 15 years of age (interquartile range 5-12, range 1-29 measurements). Figure 4.1
shows the BMI trajectory for 20 randomly selected individuals. There is a large amount of
136 Chapter 4: Simulation Study
variability between individuals for both intercept and slope, with slight curvature in the
trajectory and a nadir at approximately five to seven years of age.
Figure 4.1: Individual BMI trajectories for 20 females from ALSPAC
Figure 4.2 graphically shows how each of the BMI measurements in ALSPAC was taken. From
birth to five years, length and weight measurements were extracted from health visitor
records, with up to four measurements taken on average at six weeks and 10, 21, and 48
months of age. For a random 10% of the cohort, length and height measurements were taken
in eight research clinic visits, held between the ages of four months and five years of age. From
age seven years upwards, all children were invited to annual research clinics from age’s seven
to 11 and biannual research clinics thereafter. Details of measuring equipment used in the
clinics is described elsewhere [189]. In addition, parent-reported child height and weights were
also available from questionnaires (27% of measurements); only questionnaire data was
available around the nadir of the BMI trajectory, also known as the adiposity rebound, which
occurs between five and seven years of age in most children. Whilst the measurements from
routine health care have previously been shown to be accurate in this cohort [190], parental
report of children’s height tends to be over-estimated while weight tends to be under-
estimated [191].
137 Chapter 4: Simulation Study
Figure 4.2: BMI measurements over time, by measurement source, in ALSPAC
As seen in Figure 4.2, the variability of BMI increases over time, with more individuals reaching
BMI values of 30 or greater (the obese cut-off point in adults) at the later time points. This
increase of individuals with BMI>30 also induces a skewed distribution with a long right-hand
tail.
The primary research question is to identify SNPs that are associated with average BMI and
change in BMI over childhood and adolescence in the ALSPAC data. A LMM was used to
appropriately model the longitudinal trajectory over childhood, to account for the large
correlation between each of the random effects parameters, to adjust for additional covariates
such as the source of the height/weight measurements (clinic or questionnaire) and to allow
data to be missing at random across childhood. The general form of the model is as follows:
Yi = Xiβ + Z ibi + ε i (4.1) where Yi is the response vector for the ith individual, β is the vector of fixed effects and
138 Chapter 4: Simulation Study
bi ~ N(0, Σ) is the vector of subject specific random effects, Xi and Zi are the fixed effect and
random effect regressor matrices respectively and ε i ~ N(0, σ2) is the within subject error
vector. When applying this model to the ALSPAC data, the best model fit included a cubic
polynomial of mean centred age (centred at eight years) in the fixed effects, a quadratic
polynomial of mean centred age in the random effects and a continuous autoregressive
correlation structure of order one for the covariance of the within-subject errors. Hence, the
final model for both females and males was:
BMIij = β0 + β1t ij + β2t ij2 + β3t ij
3 + β4MSij + β5SNPi + β6t ijSNPi +
β7t ij2SNPi + β8t ij
3SNPi + bi0 + bi1t ij + bi2t ij2 + ε ij
(4.2)
where MS is the measurement source (i.e. clinical visit or questionnaire) of individual i at time
j, tij is the age (centred at eight years) and ~ (0, )ij iN Rε such that:
1
1 12
1
11
1
j
j
i
j j
R
ρ ρρ ρ
σ
ρ ρ
−
−
=
Therefore β0 is the population intercept (i.e. mean BMI at age 8), β1, β2 and β3 are the fixed
effects for the cubic function of age, β4 is the measurement source, β5 is the change in the
mean BMI at eight years of age for each additional copy of the minor allele, β6 is the SNP by
linear age effect, β7 and β8 are the SNP by quadratic and cubic effects respectively.
Although this was not the optimal model selected in Chapter 1 for BMI growth trajectory
modelling, this simpler function to model the curvature over time was chosen so that the
effects from the simulation study would be relatively interpretable.
The residuals from this model fitted to the ALSPAC data are displayed in Figure 4.3. Figure 4.3A
shows that the residuals have fairly constant variance for the clinic measures (which includes
the Children in Focus and children’s health records), but those in Figure 4.3B show there is
greater variability for the questionnaire measures particularly around the adiposity rebound. It
is also evident that the model is unable to estimate the BMI values well for the questionnaire
measures (Figure 4.3D) and the residuals deviate further from the Gaussian distribution
assumption (Figure 4.3F).
139 Chapter 4: Simulation Study
Figure 4.3: Residual plots, by measurement source, for the LMM model fit to the ALSPAC
data
140 Chapter 4: Simulation Study
Due to the nature of the data collection, which is often intricate in large cohort based studies
such as ALSPAC, the model assumptions were not met due to the following:
1. The questionnaire measures have previously been shown to have greater variability
than clinic measured height and weight [191]; therefore the variability was dependent
on a covariate in the model.
2. There were only questionnaire measures available around the nadir of the trajectory
(also known as the adiposity rebound), which meant there was greater variability
around the rebound.
3. The variability within individuals changes over time; particularly with increasedvariability around puberty and into adolescence.
4. BMI also has a non-Gaussian error distribution. This is in part due to the increasingvariability between individuals over time, with some individuals having rapidlyincreasing BMI while others remain relatively consistent.
In the following simulation study, the robustness of the maximum likelihood inference for the
fixed effects is investigated, along with the power and the type 1 error for detecting an
association with the SNP when the error distribution is misspecified due to the above
intricacies of the data.
4.4 Simulation Study Extensive simulations were carried out to investigate the effects on the LMM when the error
term (also called the level 1 residual, or the occasion-level residual) in the model was non-
Gaussian or had a non-constant variance. In each of the simulation scenarios, we set the non-
genetic fixed effects parameters (β0-β4 from model (2)) and the variance-covariance matrix
similar to those coming from the fitted model for BMI adjusting for the FTO rs1121980 SNP in
the ALSPAC study (Table 4.1). The measurement source, which is a fixed effect in the LMM and
used in the heteroscedastic error simulations, was a randomly generated binary variable for
each individual at each time point with distribution throughout the ages similar to the
distribution in ALSPAC (percent questionnaire measurements per follow-up year: year 1=40%,
year 2=20%, year 3=40%, year 4=10%, year 5=60%, year 6=99%, year 7=10%, year 10=10%,
year 13=30%; the remaining years had 0% questionnaire measurements).
141 Chapter 4: Simulation Study
The fixed effect estimation for various sample sizes, minor allele frequencies of the SNP and
the SNP effect sizes were also investigated:
1. Sample size: two levels; N=1,000 and N=3,000
2. MAF: four levels; 0.1, 0.2, 0.3 and 0.4
3. Effect sizes: two combinations; β5=0.6, β6=0.15, β7 = -0.000752 and β8 = -0.000380
(alternative hypothesis) or β5 = β6= β7 = β8 = 0 (null hypothesis). The alternative
hypothesis effect sizes for β5 and β6 were chosen to have 80% power to detect with
the larger sample size (N=3,000); the effect sizes for β7 and β8 were similar to those
coming from the fitted model for BMI adjusting for the FTO rs1121980 SNP in the
ALSPAC study.
Table 4.1: Parameter estimates from the ALSPAC non-genetic model used to generate the
data in the simulation study
Effect Parameter Value
Intercept β0 16.534
Age β1 0.400
Age2 β2 0.056
Age3 β3 -0.003
Source β4 -0.153
SD(b0) σ0 2.092
SD(b1) σ1 0.269
SD(b2) σ2 0.0235
Cor(b0, b1) ρ0 0.820
Cor(b0, b2) ρ1 -0.389
Cor(b1, b2) ρ2 -0.092
SD(ε) Σ 1.063
Correlation
structure
ρ 0.394
142 Chapter 4: Simulation Study
4.4.1 Sampling Designs
Many longitudinal cohorts have different sampling designs, some with variable amounts of
missing time points and missing observations at each time point, and hence five different
sampling designs were investigated:
1. Sparse complete: ni=8 measures per person with few measures around the adiposity
rebound; times of measures are 1, 2, 3, 5, 8, 10, 13, 15
2. Intense complete: ni=14 measures per person with multiple measures around the
adiposity rebound; times of measures are 1, 2, 3, 3.5, 4, 4.5 ,5 ,5.5 ,6 ,7, 9, 11, 13, 15
3. Equal unbalanced: ni=1 to 15 measures per person between 1 and 15 years with a
mean of nine measures (proportion of missingness = 0.4 across whole age range)
4. Unbalanced with more samples around the adiposity rebound: ni=1 to 15 measures
per person between 1 and 15 years with a mean of nine measures; proportion of
missingness around adiposity rebound of 0.2 and 0.45 outside the five to seven year
age range (average proportion of missingness over whole age range is 0.4)
5. Unbalanced with fewer samples around the adiposity rebound: ni=1 to 15 measures
per person between 1 and 15 years with a mean of nine measures; proportion of
missingness around adiposity rebound of 0.6 and 0.35 outside the five to seven year
age range (average proportion of missingness over whole age range is 0.4)
The first two designs with complete data at each follow-up assume that every individual had
the exact same age at follow-up (i.e. came into clinic on their birthday), whereas the other
three designs are more representative of longitudinal studies, where the actual age of
measurement varies between individuals by up to a year (i.e. came into clinic either six months
before or after a birthday). Data is assumed to be missing completely at random, or in other
words the probability that an observation is missing for a given individual is independent of all
other observed data. The proportion of missingness simulated across the whole range (i.e. 0.4)
was equivalent to the amount of missing data observed in ALSPAC under the assumption that
all individuals could have been measured yearly. A fully factorial design for the simulations
with the three data characteristics (sample size, MAF, effect size) and the five sampling designs
were used.
143 Chapter 4: Simulation Study
4.4.2 Models for Data Generation
4.4.2.1 Standard Linear Mixed Model
Data were generated with Gaussian random effects and error distribution to validate the
estimation method.
4.4.2.2 Non Gaussian Error
Three error structures were investigated:
1. t-distribution: t with 5 degrees of freedom
2. skew-normal distribution: SN(1.0632, 40)
3. Asymmetric mixture of two Gaussian distributions: 0.3N(-0.67, 12) + 0.7N(0.5, 0.32)
4.4.2.3 Heteroscedastic Error
Three cases were studied:
1. Variance dependent on a covariate: Var(eij) = σ2e aXij
where σ2e= 1.131, a=1.5 and Xij=1 if measure was from questionnaire and 0 if measure
was from a follow-up clinic
2. Variance greater at the adiposity rebound: Var(eij) = σ2e aXij
where σ2e=1.131, a=1.5 and Xij=1 if measure was between five and seven years and 0 if
not
3. Variance increasing over time: Var(eij) = σ2e atij
where σ2e=1.131 and a=1.15
4.4.3 Data Generation
Performance estimates included coverage probability, power and type 1 error. Coverage
probability is defined as the proportion of simulations where the 95% confidence interval
around the parameter estimate contained the simulated parameter. Coverage probabilities
can indicate whether the confidence interval of the parameter(s) of interest is conservative
(i.e. the coverage probability is larger than the nominal confidence interval) or liberal (i.e. the
coverage probability is narrower than the nominal confidence interval). Power is defined as
the proportion of tests under the alternative hypothesis that reach genome-wide significance
of P-Value < 5x10-8. Type 1 error is defined as the proportion of tests under the null hypothesis
that reach significance of P-Value < 0.05. It would have been ideal to look at the type 1 error
rate at a genome-wide level also; however, this would have required a null dataset to be
144 Chapter 4: Simulation Study
simulated such that no allele was associated with the outcome and a genome-wide scan would
then be performed on this dataset, with these simulation and association steps repeated 5,000
times. This would require a large amount of computing time and was therefore deemed
infeasible. Investigating a P-Value of less than 0.05, in conjunction with a real data example,
should provide sufficient information regarding the effect of the model misspecification on the
type 1 error.
It is important to report the uncertainty in any estimates from simulation based studies [262].
Therefore, Monte Carlo error (MCE) was calculated for coverage probabilities, bias, power and
type 1 error using the following confidence interval [263]:
P(1-P)P 1.96S
±
Where P is the α-level, for example P for coverage estimates is 0.95 and P for type 1 error is
0.05 and S is the number of simulations. The output from the simulations was then assessed as
to whether they fell within this confidence interval.
We simulated 1,000 datasets under the alternative hypothesis (β5=0.6 and β6=0.15) to look at
coverage probabilities, bias and power and 5,000 datasets under the null hypothesis (β5=0 and
β6=0) to look at type 1 error at α=0.05. The number of simulations for each hypothesis was
determined so that the MCE was appropriate. Each SNP (coded as 0, 1, and 2) was
incorporated into the model assuming an additive genetic model, whereby each additional
minor allele increases BMI by an equal amount. The primary interest was estimating the SNP
main effect, β5, which represents the increase on the mean BMI at eight years of age for each
additional copy of the minor allele and the SNP by age effect, β6, which represents the effect
on the mean linear increase of BMI (slope) for each additional minor allele. All analyses were
conducted in R version 2.12.1 [222] using the nlme package.
145 Chapter 4: Simulation Study
4.4.4 Calculating Robust Standard Errors and Global Wald Tests
As mentioned in Section 4.1, the robust sandwich estimate [260] can correct the biased
variance estimates of the fixed effects when the covariance structure is not correctly specified.
Therefore, a robust standard error was calculated for each fixed effect parameter and
corresponding P-Value. The following formula was used:
'
1
( ) ( )S
i i i ii ii
− −
=
∑
-1 -1 -1 -1' ' 'X V X X V ε ε V X X V X
Where:
X is the fixed effect regressor matrix from equation 4.1
V is the variance of Y from equation 4.1
i i iyε β= −X
S is the number of subjects and i is the ith subject
In addition to the fixed effects parameters with robust standard errors, a Wald test was
conducted to assess whether the overall SNP effect was affected by the misspecification. The
Wald test was estimated using the General Linear Hypothesis approach [264]. This approach is
based on the normal approximation for maximum likelihood estimators using the estimated
variance-covariance matrix. The hypothesis can be specified through a constant matrix L to be
matched with the fixed effects of the model such that H0: Lβ = m where the m are the
hypothesized values. The estimates of the fixed effects, β, asymptotically follow a multivariate
normal distribution ( ,cov( ))Nβ β β by the Central Limit Theorem such that the linear
form also asymptotically follows a multivariate normal distribution:
'~ ( , cov( ) )L N L L Lβ β β
Thus the 95% confidence interval and corresponding P-Value for the hypothesized value can be
obtained accordingly. Testing to evaluate whether the parameters for the SNP were
simultaneously equal to zero was then conducted. It is computationally intensive to calculate a
robust estimate for the Wald test; for example, the robust standard error for the fixed effects
takes approximately 7 minutes for the rs1121980 FTO SNP in the ALSPAC data whereas the
robust standard error for the global Wald test takes approximately an additional 3 minutes.
These computational times decrease exponentially as sample size and the number of repeated
measures per individual decreases; however, they may not be scalable to a GWAS study. To
investigate whether a robust standard error would be beneficial for the global Wald test, we
146 Chapter 4: Simulation Study
selected the scenario where the inflation was greatest was selected and calculated the robust
estimates for all the simulations in this scenario.
4.5 Results for Simulated Data 4.5.1 Coverage Probabilities
Coverage probabilities for the 95% confidence interval of the fixed effects parameter estimates
from each of the simulations are presented in Table 4.2. No consistent differences were seen
across the range of MAFs, so the results from each of the simulated datasets were combined
for ease of presentation; however the coverage probabilities for each of the MAFs are
presented in Appendix D, Tables 1-5.
The coverage probabilities of the SNP main effects parameter for all simulations appear to be
unaffected by the error misspecifications; only nine of 70 coverage probabilities were
significantly different from 95%, that is less than 94.32% or greater than 95.68%, five of which
were from the simulations where the error variance increases over time.
Thirty-one of the 70 coverage probabilities (44%) for the SNP*age interaction parameter were
significantly different from 95%, with both the non-Gaussian and heteroscedastic error
distributions being affected. When the error variance followed a t-distribution, the coverage
probabilities for the confidence interval of the SNP*age interaction parameter are less than
95% in all designs except the sparse complete scenario. Similarly, the SNP*age interaction
parameter had coverage probabilities less than 95% when the error variance followed a skew-
normal distribution, however only in the unbalanced designs with missing data. The coverage
probabilities were less than 95% when the error variance was dependent on a covariate and
increased over time, in both the complete and unbalanced designs. All the coverage
probabilities that significantly differ from 95% for the SNP*age interaction parameter have
underestimated variance estimates and thus confidence intervals that were too narrow, which
could lead to test statistics that are too liberal.
4.5.2 Bias
Bias for the fixed effect parameter estimates are presented in Table 4.3 (complete scenarios)
and Table 4.4 (unbalanced scenarios). No consistent differences were seen in the bias
estimates across the range of minor allele frequencies; however, the 95% confidence intervals
147 Chapter 4: Simulation Study
for the difference between the simulated parameter and the true parameter were tighter as
the sample size and minor allele frequency increased. The bias for each of the minor allele
frequencies are presented in Appendix D, Tables 6-10.
The SNP main effect and the SNP*age interaction parameters are unbiased in the majority of
the simulations, indicating that the misspecifications in the error distribution do not affect the
estimates of the β’s. Only nine of 140 95% confidence intervals did not cover zero; these nine
confidence intervals were across the range of error distributions and designs, indicating that
no one scenario was particularly biased.
Table 4.2: Coverage rates of the 95% confidence intervals of the fixed effects; bold and
underlined cells are those that are significantly different from the nominal 95% based on
4,000 simulations under each design (1,000 simulations for each MAF combined into one
summary statistic).
Sampling Design
Sparse Complete
Intense Complete
Equal Unbalanced
Unbalanced with more samples
around the adiposity rebound
Unbalanced with less samples around the adiposity rebound
Sample Size 1,000 3,000 1,000 3,000 1,000 3,000 1,000 3,000 1,000 3,000
Gaussian Distribution
SNP 95.43 95.03 95.08 95.45 94.83 95.23 95.08 94.70 95.40 94.73
SNP*age 95.00 95.23 94.58 95.13 94.35 94.63 94.30 93.90 94.53 94.35
t-distribution
SNP 95.45 95.35 95.90 94.55 95.13 94.85 94.65 94.48 95.48 94.95
SNP*age 95.30 94.80 94.05 94.13 94.45 94.10 93.70 94.00 93.33 94.03
Skew-normal Distribution
SNP 94.90 95.03 95.18 95.10 95.05 94.25 95.43 94.95 94.85 94.75
SNP*age 95.68 95.18 94.63 94.65 93.88 93.73 94.73 94.13 93.90 93.55
Mixture of 2 Gaussian Distributions
SNP 94.85 94.98 94.83 95.65 95.00 95.08 94.53 95.40 94.33 94.48
SNP*age 95.05 94.78 95.03 94.60 95.20 94.08 94.58 95.20 94.80 94.10
Variance dependent on a covariate
SNP 94.93 95.05 95.83 95.35 94.43 94.75 94.70 94.93 94.98 94.93
SNP*age 94.93 95.03 94.95 94.15 94.03 94.10 93.75 94.53 93.95 93.93
Variance greater at adiposity rebound
SNP 94.75 95.23 95.15 95.08 94.25 95.20 95.48 95.35 95.23 94.43
SNP*age 94.05 94.45 95.38 95.38 94.13 94.43 94.00 94.75 94.73 94.60
Variance increasing over time
SNP 94.20 95.00 94.08 94.38 94.80 94.33 94.30 94.03 95.98 95.48
SNP*age 94.10 94.88 91.78 92.38 94.70 94.23 93.28 93.48 95.65 95.25
148 Chapter 4: Simulation Study
Table 4.3: Bias and 95% confidence interval for the complete designs; bold and underlined cells are those whose confidence interval does not cover zero
based on 4,000 simulations under each design (1,000 simulations for each MAF combined into one summary statistic).
Sampling Design Sparse Complete Intense Complete Sample Size N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 0.0012 (-0.0026,0.0051) 0.0012 (-0.0010,0.0035) -0.0044 (-0.0083,-0,.0005) 0.0022 (0.0000,0.0044) SNP*age 0.0002 (-0.0005,0.0008) 0.0002 (-0.0002,0.0006) -0.0004 (-0.0010,0.0002) -0.0001 (-0.0004,0.0003)
t-distribution SNP 0.0002 (-0.0038,0.0042) -0.0007 (-0.0030,0.0016) -0.0008 (-0.0048,0.0032) 0.0005 (-0.0019,0.0028) SNP*age -0.0001 (-0.0008,0.0007) -0.0002 (-0.0006,0.0002) 0 (-0.0007,0.0007) 0 (-0.0005,0.0004)
Skew-normal Distribution SNP -0.0027 (-0.0065,0.0012) -0.0002 (-0.0025,0.0020) 0.0009 (-0.0030,0.0048) 0.0008 (-0.0014,0.0030) SNP*age 0.0002 (-0.0005,0.0008) 0 (-0.0004,0.0004) 0.0003 (-0.0003,0.0009) 0.0001 (-0.0003,0.0004)
Mixture of 2 Gaussian Distributions SNP -0.0002 (-0.0042,0.0037) -0.0016 (-0.0039,0.0006) -0.0005 (-0.0045,0.0034) -0.0007 (-0.0029,0.0015) SNP*age 0 (-0.0005,0.0005) -0.0003 (-0.0006,0.0000) -0.0003 (-0.0008,0.0003) 0 (-0.0003,0.0003)
Variance dependent on a covariate SNP 0.0004 (-0.0036,0.0044) -0.0002 (-0.0024,0.0021) -0.0041 (-0.008,-0.0002) 0.0023 (0.0000,0.0045) SNP*age -0.0001 (-0.0008,0.0006) -0.0001 (-0.0005,0.0003) -0.0002 (-0.0008,0.0005) 0.0001 (-0.0002,0.0005)
Variance greater at adiposity rebound SNP 0.0002 (-0.0037,0.0042) -0.0004 (-0.0027,0.0019) -0.0001 (-0.0041,0.0039) -0.0025 (-0.0048,-0.0003) SNP*age 0.0002 (-0.0005,0.0009) -0.0001 (-0.0006,0.0003) 0.0004 (-0.0003,0.0010) -0.0003 (-0.0007,0.0000)
Variance increasing over time SNP -0.0009 (-0.0049,0.0031) -0.0045 (-0.0067,-0.0022) -0.0014 (-0.0055,0.0027) 0.0011 (-0.0013,0.0034) SNP*age 0.0003 (-0.0004,0.0011) -0.0004 (-0.0008,0.0000) -0.0003 (-0.0010,0.0004) 0.0002 (-0.0002,0.0006)
Table 4.4: Bias and 95% confidence interval for the unbalanced designs; bold and underlined cells are those whose confidence interval does not cover zero
based on 4,000 simulations under each design (1,000 simulations for each MAF combined into one summary statistic).
Sampling Design Equal Unbalanced
Unbalanced with more samples around the adiposity rebound
Unbalanced with less samples around the adiposity rebound
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP -0.0013 (-0.0052,0.0026) -0.0017 (-0.0040,0.0005) 0.0006 (-0.0033,0.0046) 0.0008 (-0.0015,0.0031) 0.0020 (-0.0019,0.0059) -0.0001 (-0.0024,0.0022) SNP*age -0.0002 (-0.0009,0.0004) -0.0003 (-0.0007,0.0000) -0.0001 (-0.0008,0.0005) 0 (-0.0004,0.0003) -0.0003 (-0.0009,0.0004) 0.0004 (0.0000,0.0008)
t-distribution SNP 0.0009 (-0.0031,0.0049) 0.0021 (-0.0003,0.0044) -0.0003 (-0.0044,0.0037) 0.0004 (-0.0019,0.0027) 0.0037 (-0.0003,0.0078) 0.0036 (0.0012,0.0059) SNP*age 0.0002 (-0.0005,0.0010) 0.0001 (-0.0003,0.0006) 0.0002 (-0.0006,0.0009) 0.0005 (0.0000,0.0009) 0.0008 (0.0000,0.0015) 0.0004 (0.0000,0.0009)
Skew-normal Distribution SNP 0.0009 (-0.0030,0.0047) -0.0006 (-0.0029,0.0017) -0.0003 (-0.0042,0.0035) 0.0029 (0.0006,0.0051) -0.0021 (-0.0060,0.0018) -0.0017 (-0.0040,0.0006) SNP*age 0.0004 (-0.0002,0.0011) -0.0003 (-0.0007,0.0001) 0 (-0.0007,0.0006) 0.0005 (0.0001,0.0009) -0.0004 (-0.0010,0.0003) -0.0002 (-0.0006,0.0001)
Mixture of 2 Gaussian Distributions SNP -0.0036 (-0.0075,0.0003) 0.0014 (-0.0008,0.0037) 0 (-0.0040,0.0040) -0.0016 (-0.0038,0.0006) -0.0017 (-0.0057,0.0022) -0.0022 (-0.0044,0.0001) SNP*age -0.0007 (-0.0012,-0.0001) 0.0003 (0.0000,0.0006) -0.0001 (-0.0006,0.0005) -0.0002 (-0.0005,0.0001) 0.0001 (-0.0005,0.0006) -0.0001 (-0.0004,0.0003)
Variance dependent on a covariate SNP 0.0005 (-0.0034,0.0045) -0.0006 (-0.0029,0.0018) -0.0008 (-0.0048,0.0032) 0.0027 (0.0005,0.0050) -0.0022 (-0.0062,0.0017) -0.0002 (-0.0025,0.0021) SNP*age 0 (-0.0007,0.0007) 0.0001 (-0.0003,0.0005) -0.0004 (-0.0011,0.0003) 0.0003 (-0.0001,0.0007) -0.0005 (-0.0012,0.0002) -0.0001 (-0.0005,0.0003)
Variance greater at adiposity rebound SNP 0.0014 (-0.0027,0.0056) 0.0015 (-0.0008,0.0038) 0.0009 (-0.0031,0.0049) -0.0011 (-0.0034,0.0012) 0.0017 (-0.0022,0.0056) 0.0008 (-0.0015,0.0031) SNP*age 0 (-0.0007,0.0007) -0.0001 (-0.0005,0.0003) 0.0006 (-0.0001,0.0013) -0.0001 (-0.0005,0.0003) 0.0004 (-0.0003,0.0010) -0.0002 (-0.0005,0.0002)
Variance increasing over time SNP -0.0006 (-0.0046,0.0034) -0.0002 (-0.0025,0.0022) 0.0009 (-0.0031,0.0049) -0.0012 (-0.0035,0.0012) 0.0018 (-0.0022,0.0057) 0.0002 (-0.0021,0.0025) SNP*age -0.0001 (-0.0009,0.0006) -0.0002 (-0.0006,0.0002) -0.0002 (-0.0010,0.0006) -0.0003 (-0.0007,0.0002) 0.0004 (-0.0003,0.0011) -0.0001 (-0.0005,0.0003)
4.5.3 Power
Effect sizes for the alternative hypothesis (β5=0.6 and β6=0.15) were chosen to have 80%
power with a MAF of 0.4 and sample size of 1,000 when the error from the fitted LMM follows
a Gaussian distribution with constant variance. Therefore, the power for all error distributions
and MAFs in the simulations with sample size of 3,000 was greater than 80%; hence this
section will only discuss power for the simulations with a sample size of 1,000. Power for the
SNP main effect and SNP*age interaction parameters are displayed in Figure 4.4 (complete
designs) and Figure 4.5 (unbalanced designs).
As expected, the power increases with the MAF. Interestingly, the simulations where the error
distribution was assumed to have a t-distribution had lower power for both the SNP main
effect and the SNP*age interaction parameters than the simulations assuming a Gaussian error
distribution. This pattern was consistent across all the sampling designs; however it appears
that the power is slightly closer to that of the error with the Gaussian distribution when there
is more data around the adiposity rebound (i.e. the intense complete and unbalanced with
more samples around the adiposity rebound). In addition, for simulations where the error
distribution follows a skew-normal distribution, the power for both the SNP and SNP*age
interaction parameters was slightly higher than those with the Gaussian error.
When investigating the different error variance structures, the power for the SNP main effect
parameter across all MAFs was slightly lower than the power when the constant variance
assumption was met. Likewise, for the SNP*age interaction parameter, all of the error variance
structures led to lower power than when the constant variance assumption was met.
However, simulations under the unbalanced designs where the variance increased over time
were the most affected and had notably reduced power until a MAF of approximately 0.3.
151 Chapter 4: Simulation Study
Figure 4.4: Simulated power of the SNP main effect and SNP*age interaction terms for
complete designs. The two plots on the left are for the Sparse Complete design, while the
two plots on the right are from the intense complete design.
152 Chapter 4: Simulation Study
Figure 4.5: Simulated power of the SNP main effect and SNP*age interaction terms for
unbalanced designs, where “Equal” is the simulations from the Equal Unbalanced design,
“Over” are the simulations from the unbalanced design with less samples around the
adiposity rebound and “Under” are the simulations from the unbalanced design with more
samples around the adiposity rebound.
153 Chapter 4: Simulation Study
4.5.4 Type 1 Error
As observed with the coverage probabilities, no consistent differences in type 1 error were
evident across the MAF range, and hence the results from each of the simulated datasets were
combined for ease of presentation; however the type 1 error for each of the MAFs tested are
given in Appendix D, Tables 11-15.
As seen in Table 4.5, the type 1 error for the complete designs remained within acceptable
limits of the nominal alpha level. Inflation for the SNP by age interaction parameter was
observed in several cases, but this inflation was reduced to nominal levels by using a robust
standard error.
Table 4.6 shows that the type 1 error for the SNP by age interaction was often inflated under
the unbalanced designs. However, by using a robust standard error, the inflation can be
reduced to nominal levels in the majority of cases; approximately 75% of the inflated effects
were reduced. The design where the robust standard error didn’t seem to have an effect was
when the error variance increased over time; only 20% of the estimates were reduced to
nominal levels under this design. Interestingly, the robust standard error did not appear to
affect the type 1 error for the scenarios that were not originally inflated.
To declare significance in a GWAS, several thresholds are commonly used; suggestive
association, significant association and highly significant association. Duggal et al define
suggestive associations as SNPs that reach a P-Value threshold under the assumption that one
false positive association is expected per GWAS [265]; SNPs reaching this threshold are often
taken forward to a replication stage. In the context of the simulation study, this definition
would equate to a P-Value of 0.00005 (1/20,000; where 20,000 is the number of simulations
per design and error assumption). The scenario with the highest type 1 error inflation using the
classical standard error was for the SNP*age interaction under the intense design where the
error variance increased over time (0.0746 for both N=1,000 and 3,000). In this scenario, six
SNPs would falsely reach the definition of ‘suggestive association’ for the SNP*age interaction
parameter when using the classical standard error with a sample size of 1,000 individuals. In
contrast, when the model assumptions are met, that is when the error distribution follows a
Gaussian distribution with constant variance, only two SNPs met the ‘suggestive association’
threshold, indicating an inflation in the type 1 error for the simulations where the variance
154 Chapter 4: Simulation Study
increased over time due to the misspecification of the error term. When using the robust
standard error under the increasing variance over time design, one SNP would meet the
criteria, showing not only a reduction in the type 1 error from the seven SNPs seen with the
classical standard error, but also a reduction in power in comparison to the model where the
assumptions were met.
Table 4.5: Type 1 error for complete designs; bold and underlined cells are those that are
significantly different from the nominal α=0.05 based on 20,000 simulations under each
design (5,000 simulations for each MAF combined into one summary statistic).
Sampling Design Sparse Complete Intense Complete
Sample Size N=1,000 N=3,000 N=1,000 N=3,000
Standard Robust Standard Robust Standard Robust Standard Robust
Gaussian Distribution
SNP 0.0514 0.0528 0.0509 0.0513 0.0502 0.0521 0.0500 0.0510
SNP*age 0.0483 0.0504 0.0483 0.0491 0.0549 0.0486 0.0539 0.0467
Global Wald test 0.0497 0.0478 0.0605 0.0620
t-distribution
SNP 0.0495 0.0498 0.0489 0.0496 0.0479 0.0510 0.0483 0.0502
SNP*age 0.0521 0.0534 0.0487 0.0492 0.0581 0.0490 0.0563 0.0465
Global Wald test 0.0531 0.0508 0.0624 0.0629
Skew-normal Distribution
SNP 0.0502 0.0517 0.0524 0.0524 0.0509 0.0526 0.0525 0.0532
SNP*age 0.0503 0.0519 0.0461 0.0474 0.0541 0.0508 0.0529 0.0486
Global Wald test 0.0493 0.0488 0.0621 0.0579
Mixture of 2 Gaussian Distributions
SNP 0.0498 0.0504 0.0479 0.0479 0.0485 0.0499 0.0510 0.0508
SNP*age 0.0502 0.0510 0.0492 0.0488 0.0528 0.0506 0.0529 0.0495
Global Wald test 0.0498 0.0508 0.0615 0.0586
Variance dependent on a covariate
SNP 0.0523 0.0527 0.0488 0.0490 0.0485 0.0511 0.0459 0.0485
SNP*age 0.0546 0.0527 0.0531 0.0514 0.0520 0.0493 0.0524 0.0481
Global Wald test 0.0515 0.0525 0.0556 0.0546
Variance greater at adiposity rebound
SNP 0.0472 0.0478 0.0511 0.0519 0.0477 0.0493 0.0471 0.0490
SNP*age 0.0528 0.0497 0.0570 0.0528 0.0513 0.0513 0.0491 0.0487
Global Wald test 0.0527 0.0540 0.0502 0.0478
Variance increasing over time
SNP 0.0523 0.0536 0.0471 0.0473 0.0543 0.0513 0.0561 0.0522
SNP*age 0.0564 0.0538 0.0522 0.0491 0.0746 0.0528 0.0746 0.0530
Global Wald test 0.0875 0.0549 0.0875 0.0497 0.1667 0.0506 0.1685 0.0506
155 Chapter 4: Simulation Study
Table 4.6: Type 1 error for unbalanced designs; bold and underlined cells are those that are significantly different from the nominal α=0.05 based on 20,000
simulations under each design (5,000 simulations for each MAF combined into one summary statistic).
Sampling Design Equal Unbalanced Unbalanced with more samples around the adiposity
rebound Unbalanced with less samples around the adiposity
rebound
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Standard Robust Standard Robust Standard Robust Standard Robust Standard Robust Standard Robust
Gaussian Distribution
SNP 0.0518 0.0532 0.0500 0.0508 0.0503 0.0521 0.0478 0.0490 0.0529 0.0550 0.0540 0.0542
SNP*age 0.0581 0.0526 0.0592 0.0531 0.0566 0.0514 0.0556 0.0496 0.0560 0.0511 0.0575 0.0509
Global Wald test 0.0646 0.0598 0.0601 0.0615 0.0621 0.0609
t-distribution
SNP 0.0510 0.0522 0.0491 0.0497 0.0485 0.0500 0.0495 0.0505 0.0487 0.0499 0.0516 0.0523
SNP*age 0.0571 0.0487 0.0629 0.0539 0.0596 0.0508 0.0571 0.0475 0.0563 0.0487 0.0577 0.0489
Global Wald test 0.0607 0.0621 0.0620 0.0583 0.0587 0.0605
Skew-normal Distribution
SNP 0.0493 0.0508 0.0495 0.0501 0.0498 0.0517 0.0473 0.0481 0.0512 0.0519 0.0482 0.0484
SNP*age 0.0548 0.0492 0.0589 0.0526 0.0580 0.0512 0.0571 0.0498 0.0593 0.0532 0.0547 0.0490
Global Wald test 0.0618 0.0582 0.0616 0.0583 0.0632 0.0575
Mixture of 2 Gaussian Distributions
SNP 0.0519 0.0527 0.0490 0.0490 0.0505 0.0510 0.0519 0.0517 0.0510 0.0522 0.0487 0.0483
SNP*age 0.0534 0.0517 0.0487 0.0459 0.0510 0.0494 0.0538 0.0518 0.0543 0.0511 0.0551 0.0517
Global Wald test 0.0579 0.0581 0.0605 0.0603 0.0589 0.0568
Table 4.6 continued
Sampling design Equal Unbalanced Unbalanced with more samples around the adiposity
rebound Unbalanced with less samples around the adiposity
rebound
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Standard Robust Standard Robust Standard Robust Standard Robust Standard Robust Standard Robust
Variance dependent on a covariate
SNP 0.0495 0.0515 0.0482 0.0491 0.0498 0.0513 0.0506 0.0509 0.0512 0.0518 0.0528 0.0502
SNP*age 0.0586 0.0499 0.0607 0.0514 0.0576 0.0505 0.0588 0.0497 0.0605 0.0507 0.0604 0.0507
Global Wald test 0.0589 0.0611 0.0597 0.0567 0.0620 0.0583
Variance greater at adiposity rebound
SNP 0.0493 0.0504 0.0492 0.0495 0.0486 0.0498 0.0516 0.0526 0.0506 0.0514 0.0496 0.0531
SNP*age 0.0570 0.0491 0.0563 0.0483 0.0546 0.0482 0.0563 0.0503 0.0561 0.0483 0.0600 0.0505
Global Wald test 0.0572 0.0559 0.0568 0.0541 0.0588 0.0569
Variance increasing over time
SNP 0.0533 0.0545 0.0500 0.0502 0.0564 0.0563 0.0530 0.0520 0.0491 0.0523 0.0500 0.0526
SNP*age 0.0554 0.0536 0.0571 0.0540 0.0643 0.0570 0.0610 0.0527 0.0497 0.0534 0.0497 0.0513
Global Wald test 0.0911 0.0576 0.0929 0.0529 0.1031 0.0578 0.1011 0.0548 0.0850 0.0559 0.0801 0.0520
The global Wald test, which is assessing whether there is any genetic effect on the whole BMI
growth trajectory, was inflated above the acceptable limits under all error variance
misspecifications and even under the Gaussian/constant variance assumption, except under
the sparse complete design. The scenario where the error variance increased over time
showed the largest inflation; however, using the robust estimates for the Wald test under this
scenario were also reduced to nominal levels in most designs; if it wasn’t reduced to nominal
levels it was dramatically lower than using the classical test (Table 4.6).
4.5.5 Type 1 Error in Unbalanced Designs Versus Complete Designs
Inflation in the type 1 error of the SNP*age interaction was observed in the simulations where
the error term followed a Gaussian, constant variance distribution, primarily under the
unbalanced designs rather than the complete designs. This could be due to the missing data in
the unbalanced designs or the variability in timing of the measurements (i.e. the samples were
measured at any time throughout a year rather than at an exact time). Therefore, additional
simulations under the Gaussian, constant variance distribution were conducted using two of
the sampling designs:
1. Sparse: ni=8 measures per person with few measures around the adiposity rebound;
times of measures are 1, 2, 3, 5, 8, 10, 13, 15
2. Intense: ni=14 measures per person with multiple measures around the adiposity
rebound; times of measures are 1, 2, 3, 3.5, 4, 4.5 ,5 ,5.5 ,6 ,7, 9, 11, 13, 15
However, they were not simulated with complete data as in the previous simulations; the
following combinations were used instead:
1. Complete in all individuals and they were all measured at the same time (complete,
same age)
2. Complete in all individuals but they were measured at different times within each year
period (complete, different age)
3. Each individual had 40% missing data over the time period, but were all measured at
the same time (missing, same age)
4. Each individual had 40% missing data over the time period and were measured at
different times within each year period (missing, different age)
158 Chapter 4: Simulation Study
The results from these simulations can be found in Figure 4.6 for the sparse design and Figure
4.7 for the intense design. These simulations provide evidence that this inflation was greater in
the presence of missing data rather than because of the different measurement times
between individuals, with greater inflation seen in the larger sample size but no obvious
differences between MAFs.
Figure 4.6: Results from comparison between missing data or variable measurement time
under the sparse design
159 Chapter 4: Simulation Study
Figure 4.7: Results from comparison between missing data or variable measurement time
under the intense design
Since the LME is known to be robust to missing data under the missing at random and missing
completely at random assumptions, we simulated additional data varying the polynomial
function of age in the fixed and random effects. These simulations showed the type 1 error
was reduced to nominal levels when the fixed and random effects had the same function of
age, i.e. cubic function in both the fixed and random effects (Table 16 in Appendix D).
To determine whether there is remaining inflation in the type 1 error after modelling the same
function of age in the fixed and random effects when the error distribution is misspecified we
simulated additional data using the equal unbalanced sampling design (see Figure 1 of
160 Chapter 4: Simulation Study
Appendix D for outline of additional simulations). These simulations showed that the type 1
error was again reduced to nominal levels when the fixed and random effects had the same
function of age regardless of the misspecification in the error distribution (Table 17 in
Appendix D).
It is often difficult to estimate higher order terms in the random effects when using real data
due to computational and convergence issues. In this case, it is often only possible to fit a
lower-order polynomial function in the random effects than the fixed effects. We simulated
additional data where the fixed and random effects included a quadratic function for age but
we analysed the data with a quadratic function in the fixed effects and a linear function in the
random effects. In addition, we also simulated data where the fixed effects included a
quadratic function for age and the random effects included only a linear function but analysed
the data with a quadratic function in both the fixed and random effects. These simulations
showed that the type 1 error was inflated when the analysis model had lower order terms of
polynomial function in the random effects compared to the fixed effects terms (Table 18 in
Appendix D).
These additional simulations also showed that having the same structure of fixed and random
terms for the age polynomial function would yield nominal type 1 errors for the global Wald
test.
In summary, it is recommended that one includes the same polynomial function for age in the
fixed and random effects to avoid inflation in the type 1 error; however, if this is not possible
due to non-convergence of the model then a robust standard error is required to reach
nominal levels of type 1 error.
Given that many researchers investigating GWAS of longitudinal traits are interested in only
the SNP main effect and not the SNP*age interaction [91], we conducted some additional
simulations without the SNP*age interaction. Once again, we used the scenario where the
error variance increased over time and where there was equal unbalance in the data structure.
We found that the type 1 error was within the nominal range for the SNP main effect for both
sample sizes (N=1,000: 0.0506; N=3,000: 0.0515), where previously we saw inflation for the
sample size of 1,000 (0.0533 from Table 4.6). We have no reason to believe that any of the
161 Chapter 4: Simulation Study
other scenarios would be affected by the misspecifications when the SNP*age interactions are
not modelled.
4.5.6 Power Using the Robust Standard Error
We have shown that using the robust standard error doesn’t affect those situations where the
type 1 error wasn’t initially inflated. However before adopting the robust standard error for a
GWAS analysis, it is important to determine whether using the robust standard error would
decrease the power to detect a statistically significant association.
The power for the SNP main effect parameter remains almost unchanged when using the
robust standard error rather than the normal standard error in all scenarios and under all
model misspecifications (Figure 4.8 for complete designs and Figure 4.9 for unbalanced
designs). The only scenario where the power decreased for the SNP main effect parameter by
using the robust standard error was where there was increasing variance over time under the
intense complete scenario. Given that the type 1 error was not inflated using either standard
error estimate, there appears to be no harm in using a robust standard error for estimation
even when not required.
The power for the SNP*age interaction parameter, particularly for low MAF, is considerably
more variable. Under the sparse complete design, where there was no inflation in the type 1
error, the power remains about the same using either the classical or robust standard error.
For the other designs, the power for the SNP*age interaction parameter decreases using the
robust standard error, but only by 5% or less for most error misspecifications, when the MAF
was 0.2 or greater. The simulations which assumed a t-distribution for the error had a 5-10%
reduction in power using the robust standard error when the MAF 0.1 or 0.2; this might be due
to the substantial reduction in type 1 error. The power also decreases by greater than 5%
when the variance is greater at the adiposity rebound and the variance is dependent on a
covariate, for values of MAF around 0.1 in the scenarios presented.
162 Chapter 4: Simulation Study
Figure 4.8: Difference in power based on a normal standard error versus a robust standard
error for the complete designs. A positive value indicates the power using the normal
standard error is greater than the power using the robust standard error. The two plots on
the left are for the Sparse Complete design, while the two plots on the right are from the
intense complete design.
163 Chapter 4: Simulation Study
Figure 4.9: Difference in power based on a normal standard error versus a robust standard
error for the unbalanced designs. A positive value indicates the power using the normal
standard error is greater than the power using the robust standard error. Here, “Equal” is
the simulations from the Equal Unbalanced design, “Over” are the simulations from the
unbalanced design with fewer samples around the adiposity rebound and “Under” are the
simulations from the unbalanced design with more samples around the adiposity rebound.
164 Chapter 4: Simulation Study
4.6 Analysis of Chromosome-Wide BMI Data While the simulated data provided a useful platform for testing the effect of the error
misspecification in LMM’s in a controlled setting, it is important to also investigate how this
related to the real data. Given the simulation results, in particular the need for a robust
standard error to ensure accurate inference for the SNP*age interaction where the type 1
error is inflated, the impact of the distribution assumption problems was investigated in a real
data application on the ALSPAC data.
Each SNP in the ALSPAC data takes approximately 30 minutes to run the LMM in addition to
the robust tests for the fixed effects and the global test. Therefore, to conduct a genome-wide
analysis on the imputed ALSAPC data using this model would take approximately 2.5 years.
This was deemed to be too large a computational burden, so the genotyped data on one
specific chromosome was used instead. The fat mass and obesity gene (FTO) on chromosome
16 is the most replicated gene to date for association with BMI in both adults and children
[174]. In addition to the cross-sectional associations with BMI, it has also been shown to be
associated with childhood growth in ALSPAC and other birth cohorts [251]. This chromosome
was therefore selected for the analysis and it was hypothesised that some significant loci
would be detected, specifically around the FTO gene, as well as many non-associated SNPs.
We used the same LMM model as in equation 4.2, with the inclusion of an age*sex interaction
in the fixed effects for all the age components (i.e. β9sexi + β10tijsexi + β11tij2sexi + β10tij
3sexi)
to account for the differences in growth between males and females [197]. There were 14,875
SNPs genotyped on chromosome 16, all of which had a MAF greater than 1%; GWASs are
designed to look at common SNPs, so it is a common strategy to exclude SNPs with MAF less
than 1%. Each SNP was incorporated into the model assuming an additive genetic model.
As expected, SNPs in the FTO gene were highly significant for the global tests as well as the
SNP main effect and SNP*age interactions. It is common to display P-Values from a GWAS
analysis as a QQ plot of the observed –log10(P-Value) with the expected –log10(P-Value) under
the null distribution. Figure 4.10 displays a Q-Q plot from the chromosome 16 analysis in
ALSPAC for each of the parameters which displayed inflated levels of type 1 error in the
simulation study. As the 88 SNPs within the FTO region are believed to be true positives, the
165 Chapter 4: Simulation Study
QQ plots are also displayed excluding SNPs from this region (Figure 4.10, C and D). In addition,
Figure 4.11 displays the QQ plot for each of the parameters involving the SNPs from the
chromosome 16 analysis in ALSPAC, excluding the SNPs in the FTO gene. These QQ plots clearly
show that where the parameters have inflated type 1 error using the classical test, including
the global SNP test, the SNP*age and SNP*age3 P-Values, the robust test reduces this to
nominal levels. These plots also indicate that if the classical test is not inflated it may be
dangerous to use the robust test as it artificially induces inflation. When using the robust tests,
associations with SNPs in the FTO gene were still detected, both for the SNP main effect and
the SNP*age interactions.
In the chromosome wide analysis, the P-Value to declare ‘suggestive significance’ would be
0.000067 (1/14,875). Using this threshold, 57 SNPs would reach suggestive significance for the
SNP by age interaction using the classical standard error in comparison to only 16 SNPs using
the robust standard error. Six of these 16 SNPs were in the FTO gene, four of which would
reach the significant threshold.
There is still some remaining inflation in the global Wald test and SNP*age interaction,
however it is suspected this might be due to additional regions of chromosome 16 being
associated with BMI trajectory. There are two regions, in addition to FTO, that have been
shown to be associated with adult BMI which may show some association in these analyses. It
would require conducting a full GWAS analysis to get an accurate estimate of the inflation in
ALSPAC, which will be discussed further in Chapter 4.
166 Chapter 4: Simulation Study
Figure 4.10: Q-Q plot of the chromosome 16 analysis in ALSPAC for the overall Wald test and
the SNP*linear age interaction test. Plots A and B include 88 SNPs in the FTO gene, Plots C
and D exclude SNPs in the FTO gene.
167 Chapter 4: Simulation Study
Figure 4.11: Q-Q plot of the chromosome 16 analysis in ALSPAC for all parameters, excluding
88 SNPs from the FTO gene.
168 Chapter 4: Simulation Study
4.6.1 Comparison Between the Classical and Robust Tests
For the SNP*age and the SNP*age3 interactions, where the greatest inflation in the type 1
error was seen, the majority of the P-Values for the robust tests were larger than the P-Values
using the classical test. This can be seen by the bow shaped curve in Figure 4.12. This was not
the case for the global Wald test, where 36% of the robust P-Values were less than the
classical P-Values; the Wald test also displayed greater variability between the two tests
(Figure 4.12). The difference between the classical and robust P-Values for the SNP main effect
appears to fall within two groups; those that are fairly consistent between the two P-Values
and those where the classical P-Value is larger than the robust P-Value. To investigate these
groups further, the P-Values were examined by MAF. Figure 4.13 shows that for the low MAFs,
for some SNPs the robust estimates deviate from the classical estimates and may not be
accurate. In contrast, at the higher MAFs the robust P-Values were almost identical to the
classical P-Values. Although it is often the case that the robust P-Value is larger than the
classical P-Value, it is not a necessity and there have been several publications indicating that
the robust test is more ‘consistent’ than the classical test [267,268].
Focusing on the ten most significant SNPs using the classical Wald test and the robust Wald
test, only two SNPs were in common. The fixed effects terms showed more consistency
between the two tests, with six of the top 10 being significant using both the classical and
robust tests for the SNP*age interaction.
169 Chapter 4: Simulation Study
Figure 4.12: Comparison of the classical and robust tests for each of the parameters of
interest from the chromosome 16 analysis in ALSPAC
170 Chapter 4: Simulation Study
Figure 4.13: Comparison of the classical and robust tests for the SNP main effect by minor
allele frequency (MAF) from the chromosome 16 analysis in ALSPAC
171 Chapter 4: Simulation Study
4.7 Discussion In this Chapter, longitudinal data was simulated that mimicked childhood BMI to explore the
coverage probability, bias, power and type 1 error for association with a SNP when the linear
mixed effects model is misspecified with either a non-Gaussian error distribution or
heteroskedastic error. We have shown that the type 1 error for the SNP*age interaction terms
in a genetic association study has no inflation if the same function of age is included in both
the fixed and random effects. However, type 1 error is inflated, regardless of the model
misspecification, if the age function in the fixed and random effects differs. In situations where
the model is too complex and will not converge with a high order polynomial function in the
random effects, an appropriate way to deflate the type 1 error to nominal levels is to use a
robust standard error for the fixed effects parameters. Although robust standard errors have
previously been used in a wide range of statistical applications, LMM’s are only just beginning
to be utilized in GWASs and therefore guidance on their application was warranted. Given that
QQ plots in GWASs are an important diagnostic to rule out the possibility of population
stratification, it is essential to generate standard errors that perform well under the null
hypothesis so that any remaining inflation is not due to the model fitting. Similar to the
conclusions by Gurka et al [269] and Verbeke and Molenberhgs [270], the sandwich estimator
is a valid alternative when the model assumptions are misspecified, however it is less efficient
than using the correct covariance model.
Similar to Jacqmin-Gadda et al [256], results in this Chapter have shown that estimates of
differences in slope by the number of copies of minor allele are sensitive to heterogeneous
error variance, particularly when the error variance depends on a covariate or increases over
time. The variance of the estimates is underestimated and therefore the confidence interval is
too narrow; this is consistent with the inflated type 1 error under these misspecified model
assumptions.
Of all the misspecifications investigated, the situation where the error variance increases over
time and is not accounted for in the modelling has poor parameter estimates, low power and
the most inflation of the type 1 error, particularly for the SNP*age interaction terms. It also
appears that by using the robust standard error, the inflation in the type 1 error is reduced to
the nominal level in only some of the scenarios. It is therefore imperative that some
172 Chapter 4: Simulation Study
adjustment is made in the modelling to account for this increasing variance over time. In the
ALSPAC BMI data, the variance stays relatively constant until around the age of four years,
when it rapidly increases until around 11 years of age before plateauing again. This is due to
the different growth rates between individuals through the adiposity rebound and puberty.
Increasing variability over time is observed with many other phenotypes both in childhood and
adulthood; for example, lung function in an elderly population can decrease due to the rate at
which individuals are diagnosed with diseases such chronic obstructive pulmonary disease,
while other individuals remain healthy. Variance functions for modelling heteroscedasticity in
mixed effects models have been studied in detail by Davidian and Giltinan [271] and can be
implemented using the varFunc classes in the nlme package in R [252]. There are also
equivalent functions in alternative statistical packages such as MLwiN [272]. The use of these
variance functions could be recommended in the context of GWASs, if there is remaining
heteroscedasticity in the residuals after appropriately modelling the fixed and random effects;
however further studies are needed to assess their properties in this context.
When looking at SNPs with low MAFs, it seems that by using the robust standard error the
power is reduced by approximately 5%. To counteract this reduction, studies can increase the
sample size though the use of meta-analysis of multiple cohort studies as is commonly done in
GWAS analyses. However, several manuscripts have previously discussed the extended
computational time for longitudinal GWAS in comparison to GWAS of cross-sectional
phenotypes, so it is recommended that large computing clusters are available to those cohort
studies conducting analyses. The longitudinal GWAS of cardiovascular risk factors presented in
Smith et al [246] took approximately 3 hours on 64 processors of a compute cluster for
600,000 tests in 525 individuals. Sikorska et al [245] illustrated that the analysis of 2.5 million
SNPs using the LME function in the nlme package of R would take 3,500 hours for a sample size
of 3,000 individuals on a desktop computer (Intel(R) Core(TM) 2 Duo CPU, 3.00 GHz). These
times are consistent with those in this study; the chromosome 16 analysis of 14,875 SNPs in
the 7,916 ALSPAC individuals took approximately 125 hours on 32 processors of a compute
cluster (BlueCrystal Phase 2 cluster with each node having four 2x2.8 GHz core processors and
8 GB of RAM).
It has been suggested that the genome-wide significance threshold should be set at
5x10-8[273,274]. In addition, Duggal et al [265] established an appropriate P-Value threshold
173 Chapter 4: Simulation Study
based on the number of independent SNP tests in a GWAS. If study data is imputed against the
HapMap CEU population, they suggest a threshold of P-Value < 6.09x10-6 be used to select
SNPs with suggestive evidence for follow-up. Many cross-sectional GWASs use thresholds
around this, generally ranging from P-Value < 5x10-6 [72] to P-Value < 1x10-5 [179], to select
SNPs for replication. In longitudinal genetic association studies, particularly those with
complex, non-linear trajectories, controlling the type 1 error of the many parameters involving
SNP effects, can be quite challenging. This would be the case, for example, when using
smoothed splines functions and those functions could interact with the SNP effects. Providing
robust standard errors in this context can be difficult. As an alternative, it may be plausible to
use genomic control procedures to reduce a possible inflation in the type 1 error for the
parameters involving the SNP effects [40,275]. Genomic control is typically used in genetic
association studies to account for the potential confounding due to cryptic relatedness. It
makes the assumption that the inflation in type 1 error is constant across all marker in the
genome; this assumption is plausible in the context of cryptic relatedness as the inflation is
due to the kinship coefficients which are unrelated to the individual loci. In the context of
LMM’s one would need to show that the inflation was uniform across the genome or genetic
region of interest. Benke et al [247] suggested using a joint test of all SNP effects, similar to the
global Wald test used in the current study, as an optimal way to control the type 1 error and
increase power. However, caution needs to be applied when utilizing this method for complex
traits, such as BMI trajectories over childhood, and a genome-wide significance threshold
should only be used if there is no inflation detected in the type 1 error. Benke et al [247] used
a trait with a linear decrease over time and low correlation between the intercept and slope
parameters; in contrast, in this study there is a complex trajectory over time with high
correlation between the intercept and slope parameters, which indicated that the joint test
has inflated type 1 error and can only be reduced using a robust estimate in some scenarios.
Caution needs to be taken when using the robust test for the global test as the analysis of
chromosome 16 in ALSPAC showed large variability between the classical and robust global
tests, which also lead to different ‘top hits’ depending on which test was used.
174 Chapter 4: Simulation Study
4.8 Conclusion Based on the simulation results in this Chapter, it is strongly suggested that one fits the same
function of age in the fixed and random effect to avoid inflation of the type 1 error of the
SNP*age interaction terms. If this is not possible due to convergence issues, then it is
recommended that one uses a robust standard error for the SNP by age interaction terms to
reduce the type 1 error inflation in GWASs, regardless of whether or not the error term of the
model correctly follows the model assumptions. If no inflation in the type 1 error is detected
for a particular parameter of interest, then the classical standard error should be used; for
example, for the SNP main effect parameter in this study.
175 Chapter 4: Simulation Study
Chapter 5: Genome-Wide Association Study Of BMI Trajectories Across Childhood 5.1 Introduction Now that the most efficient model has been defined and extensively studied in Chapter 2 and
the effect of any possible model misspecification has been investigated in Chapter 4, the next
stage is to conduct the genetic analyses. This chapter outlines the genome-wide association
analyses that were conducted, the results found and the future publication that is planned.
5.2 Background There is growing evidence that genetic variants, particularly SNPs, within genes influence an
individuals’ risk of many common diseases. As mentioned in Chapter 1, Section 1.2, there is
also a growing body of literature from observational and animal studies demonstrating the
influence of antenatal and early life factors on disease risk in later life. One such study showed
that poor infant and child growth led to type 2 diabetes in adulthood [276]. Some of the
identified genetic variants have been found to either act on both the early life factors and
disease outcome or they appear to modify the relationship between them. For example,
Freathy et al found that a SNP in the ADCY5 gene had pleiotropic effects on birth weight,
glucose regulation and type 2 diabetes in adulthood [73]. In contrast, polymorphisms in the
PPAR-γ2 gene modify the relationship between size at birth and hypertension [277], obesity
[278] and insulin sensitivity [279]. Because of these and other studies, it is now widely
accepted that genetic variants play an important role in the DOHaD and life course approaches
to adult disease described in Chapter 1, Section 1.2. Newnham et al [117] described these
relationships with the diagram in Figure 5.1. The inherited genetic variants and antenatal
environmental exposures, along with the postnatal environmental exposures, predispose
individuals to a range of diseases in adulthood. These exposures can work additively, whereby
each exposure independently increases disease risk, or multiplicatively, whereby each
exposure modifies the disease risk imposed by other exposures.
176 Chapter 5: GWAS of childhood BMI growth
Figure 5.1: Schematic describing the relationships between genetic variants, environmental
exposures and modification to disease risk in adulthood. Image adapted from Newnham et
al [117].
There are several pathways by which genetic variants could affect adult disease risk. These
include but are not limited to:
1. The variant could be directly associated with the disease (green arrow in Figure 5.1).
For example, there are at least 18 SNPs associated with type 2 diabetes [280].
2. The variant could be associated with a mediator of disease, such as BMI (orange arrow
in Figure 5.1). As many as 32 genetic regions have been found to date to be associated
with BMI in adulthood [72]. At least one of these genes, the FTO gene, is also
associated with increased risk of type 2 diabetes but only through its influence on BMI
[174].
3. The variant could be associated with an adverse antenatal environment, for which
birth weight is often used as a surrogate, as in Freathy et al described above [73] (blue
arrows in Figure 5.1).
4. The variant could be associated with an adverse postnatal environment, where poor
growth during childhood is one of many different markers (red arrows in Figure 5.1).
177 Chapter 5: GWAS of childhood BMI growth
To date, there is one GWAS investigating childhood obesity in populations of European
descent from the Early Growth Genetics (EGG) Consortium (http://egg-consortium.org/) [183].
Bradfield et al defined cases as children reaching ≥95th percentile for their age and sex at least
once before the age of 18 years, and controls were consistently below the 50th percentile
throughout childhood [183]. Using this definition, they identified two novel genetic variants
associated with childhood obesity; these variants also show evidence for association with BMI
in adulthood, although not at a genome-wide level of significance [183]. The gene near their
first variant, olfactomedin 4 (OLFM4), had not previously been implicated with obesity,
however it had been studied in the context of several cancers, particularly gastrointestinal
cancers [281]. The other variant was in the homeobox B5 (HOXB5) gene, which is involved in
gut and lung development. They concluded that both genes may impact obesity risk through
their influence on gut function. There are no genome-wide studies looking at BMI as a
continuous trait, or BMI trajectories over childhood. In a recent review article of the field of
obesity genetics, Day and Loos highlight the importance of conducting GWASs in children and
adolescence to identify additional loci that may have important effects early in life rather than
adulthood [282].
5.2.1 Aims
The aim of this study is to assess the genetic basis of BMI growth trajectories across childhood
and adolescence. Any genes found to be associated with growth trajectories will be
characterised in terms of their timing and effect on growth.
178 Chapter 5: GWAS of childhood BMI growth
5.3 Statistical Methods 5.3.1 Study Populations
All three cohorts are described in detail in Chapter 1, Section 1.6. A full GWAS was conducted
in the Raine Study, while the other two cohorts were used for replication. The subsets used in
this analysis are described below. The data from each cohort was cleaned as outlined in
Section 5.3.2; the subsets described include only the cleaned data.
5.3.1.1 Raine Study
A subset of 1,461 individuals was used for analysis in this study using the following inclusion
criteria: at least one parent of European descent, live singleton birth, unrelated to anyone in
the sample (one of every related pair was selected at random), no major congenital anomalies,
genotype data and at least one measure of BMI between ages 1 and 17 years. BMI was
calculated from the weight and height measurements (median six measures per person, IQR:
5-7, range 1-8 measurements), with a total of 8,670 BMI measures.
5.3.1.2 ALSPAC
A subset of 7,868 individuals was used for analysis in this study using the same criteria as in the
Raine Study. BMI was calculated from the weight and height measurements (median nine
measures per person, IQR: 5-12, range 1-29 measurements), with a total of 68,862 BMI
measures.
5.3.1.3 NFBC66
A subset of 3,918 individuals was used for analysis in this study using the same criteria as in the
Raine Study. BMI was calculated from the weight and height measurements (median 12
measures per person, IQR: 9-15, range 1-28 measurements), with a total of 48,530 BMI
measures.
5.3.2 Data Cleaning
Several steps were conducted independently in each of the cohorts to ensure that the
phenotypic data was clean before beginning the genetic association analysis. These steps
included:
1. Removing missing BMI records or data outside our age range of interest (1-17 years).
179 Chapter 5: GWAS of childhood BMI growth
2. Due to the data collection methods using medical databases in ALSPAC and NFBC66,
there were potentially multiple measures of height and weight at each age. The
SPLMM will not run unless there are unique values for each of the variables.
Therefore, all multiple measures were removed except the final measure.
3. Similarly to the age, there were repeated height and weight values; for example, if the
child had a health care visit one day, and the parent might record those same values
the following day on the cohort questionnaire so that the height/weight values are the
same but the age is different. All but the final measure were removed.
4. Some of the height measures decreased or remained constant over time. The
following step-wise process was used to identify which record would be removed:
o If height at time j was higher than height at time j+1 and j+2, then it was
removed. E.g. height at 1 year=80cm, height at 1.5 years=75, height at
1.7years =77cm then height at year 1 was removed
o If height at time j was higher than height at time j+1, but height at time j-1 was
higher than height at j+1, then height at j+1 was removed. E.g. height at 1
year=74cm, height at 1.5 years=75cm, height at 1.7 years=72cm then height at
1.7 years was removed
o Finally, if the above two steps didn’t apply, then the second height measure
was removed (i.e. j+1)
5. Finally, any height or weight measures that were ±4SD from the mean of each year
group were investigated in males and females separately. For height, all measures
were removed, however for weight a two stage process was undertaken:
a. If the measure identified was the only measure ±4SD for that individual then it
was removed. If there was more than one measure for an individual that was
±4SD then they were retained. The reason for this was that weight becomes
increasingly skewed as the children get older, and it was therefore expected
that a number of correctly measured ‘outliers’ would be present.
b. If the measure identified was the only measure ±4SD for an individual and it
was their final measure, then their previous measure was looked at. If their
previous measure was ±3SD then their final measure was kept in, but
otherwise removed. The rationale is that the individual is perhaps on an
increasing trend in weight and if there were further measures they are likely to
be ±4SD.
180 Chapter 5: GWAS of childhood BMI growth
5.3.3 Longitudinal Modelling
As shown in Chapter 2, the best fitting model for the GWAS was the SPLMM [197]. Therefore,
the final model for the jth individual and at the tth time-point is as follows:
BMIjt = β0 + (Σ i β i (Agejt – Age )i + Σk γk ((Agejt-Age ) - κk)i+) * sex + Σ l β l Covariatel
+ u0j
+ Σ i uij (Agejt – Age )i + Σk ηkj ((Agejt-Age ) - κk)i+ + ε jt
Where Age is the mean age over the t time points in the sample (i.e. eight years), κk is the kth
knot and (t - κk)+=0 if t ≤ κk and (t - κk) if t > κk, which is known as the truncated power basis
that ensures smooth continuity between the time windows and Covariate includes the first five
principal components in the Raine Study and NFBC66 studies and the measurement source
variable in ALSPAC only. The knot points used in each study are defined below. All models
assumed a continuous autoregressive of order 1 correlation structure. Genetic differences in
the trajectories were estimated by including an interaction between the spline function for age
and the imputed value for each genetic variant (i.e. an additive genetic model).
In Chapter 2, it was shown that the BMI growth trajectories over childhood differ between
males and females, and different genetic variants are associated with growth in males and
females. However, a recent GWAS meta-analysis from the GIANT consortium has shown that
there are no genome-wide significant differences between males and females for BMI in adults
[283]. Given the results of the meta-analysis and the computational time involved in
conducting a GWAS using these longitudinal models, the sexes were combined into one model
for the full GWAS analysis, with the inclusion an age by sex interaction; however, the
significant findings will be further characterised by looking at sex stratified models.
In the Raine Study and ALSPAC, the ideal knot point placement in males and females was at
ages two, eight and twelve years. Given the optimal knot points in the males and females in
these two cohorts were the same, these were used for the combined model in the GWAS
analysis. However, for NFBC66, different knot points were optimal in males (two, ten and
sixteen) and females (three, ten and twelve), so the data was combined and the modelling was
conducted again to determine the optimal placements of the knots (using the same criteria as
in Chapter 2, Section 2.4.3). The final knots chosen for the NFBC66 combined data were two,
ten and twelve years. Each of these knot points were used for the cohort specific analyses,
with a cubic slope for each spline.
181 Chapter 5: GWAS of childhood BMI growth
5.3.4 Statistical Analysis
A full GWAS analysis was conducted in the Raine Study. A discussion regarding the GWAS
analysis of the ALSPAC and NFBC66 cohorts will be presented in Section 5.6.
A robust standard error was calculated for each fixed effect parameter and corresponding P-
Value, using the same calculation as in Chapter 4, Section 4.4.4. A robust test of the overall
SNP effect using a Wald test was not conducted as these were computationally intensive, and
still displayed inflation in the type 1 error in the simulations presented in Chapter 4, Section
4.5.4.
Given the complex nature of the curve being fit by these longitudinal models, four test
statistics were chosen to be of interest from the GWAS including:
1. Global test (Wald test): this will determine which SNPs affect overall BMI and BMI
growth over childhood. This is calculated by conducting a Wald test for a model with
and without the SNP.
2. SNP by age interaction (Wald test): this will determine which SNPs affect BMI growth
over childhood. Given the spline function has multiple parameters (i.e. it is a non-
linear function of time), it is necessary to use a global test to summarize the effect of
each SNP on BMI growth. This is calculated by conducting a Wald test for a model with
just the SNP main effect and a second model with the SNP by spline function
interaction.
3. SNP main effect: this will determine which SNPs affect average childhood BMI.
4. SNP by linear age interaction: this will determine which SNPs affect the initial increase
in growth.
These tests were deemed the most appropriate to identify the genetic associations with BMI
growth trajectories as they look at both the change in BMI over time as well as the overall shift
(up or down) of the curve.
All analyses were conducted in R [222] using the nlme package. The usual genome-wide
significance threshold is 5x10-8 [273,274]; however Duggal et al [265] suggest a threshold of
suggestive evidence (P-Value < 6.09x10-6) based on the number of independent SNPs when the
study data is imputed against the HapMap CEU population. This suggestive evidence P-Value
threshold was adopted for the current study, as it is consistent with many of the cross-
182 Chapter 5: GWAS of childhood BMI growth
sectional GWASs conducted to date whose P-Values to select SNPs for follow-up generally
range from P-Value < 5x10-6 [72] to P-Value < 1x10-5 [179].
5.3.5 Additional Analysis for Characterizing Significant Findings
Regions of interest were determined by at least one SNP in the region reaching the suggestive
level of significance across two or more of the parameters of interest. For regions that were
deemed to be significant, several additional analyses were conducted on the most significant
SNP in the region to determine how it was affecting growth. This included: 1) analysing both
the height and weight trajectories over the same time period to determine whether the SNP
had a larger effect on skeletal or adiposity change; 2) investigating whether the SNP can be
detected from birth; 3) exploring how the SNP influences several aspects of the growth
trajectory, including the age and BMI at the adiposity rebound. These analyses were conducted
in the Raine Study only; the details of the analyses are described below. In addition, replication
the region of interested was attempted in ALSPAC and NFBC66 cohorts.
The same SPLMM framework was used to model weight and height trajectories over childhood
and adolescence. While the optimal height model was same as the BMI model, the weight
model had the same placement for the knot points but had a linear spline from 1-2 years, cubic
slope for 2-8 years and 8-12 years and finally a quadratic slope for over 12 years. The height
and weight models also assumed a continuous autoregressive of order 1 correlation structure.
Genetic differences in the trajectories were estimated by including an interaction between the
spline function for age and the imputed value for the genetic variant.
The association between the imputed genetic variant and weight and length at birth was
analysed using linear regression, adjusting for gestational age at birth.
Age and BMI at adiposity rebound were derived by setting the first derivative of the fixed and
random effects from the BMI model between two and eight years of age for each individual to
zero (i.e. the minimum point in the curve). Linear regression, adjusting for sex, was used to
investigate the associations of the imputed genetic variant with age and BMI at the adiposity
rebound.
183 Chapter 5: GWAS of childhood BMI growth
Replication analysis in the ALSPAC and NFBC66 cohorts included SNPs 250kb up and
downstream from the most significant SNP in the region of interest (a total of 500kb around
the most significant SNP). Given the Raine Study is relatively small in comparison to the two
replication cohorts, the whole region of interest was used for the replication stage, rather than
just the most significant SNP, to ensure that the analysis from the Raine Study had highlighted
the correct gene for association with BMI trajectory and not a nearby gene. Association
analysis between these SNPs and longitudinal BMI was conducted using the SPLMM models
described in Section 5.3.3. Results from the two cohorts were meta-analysed. SNPs were
excluded if they had a MAF less than 5%, an R2 value of less than 0.3 in ALSPAC (i.e. imputation
quality score from MACH [53]), and an info value less than 0.4 in NFBC66 (i.e. imputation
quality score from IMPUTE [52]). For the SNP main effect and SNP by linear age interaction
terms, a random-effects inverse-variance weighted meta-analysis was conducted in METAL
[284] using the beta coefficients and standard errors from the two studies; the robust standard
errors were used for the SNP by linear age interaction term. Stouffer’s Z-score method [285],
weighting by the number of individuals in each study, was used to meta-analyse the global
Wald tests for the overall SNP effect and the SNP by age interaction.
5.3.6 Pathway Analysis
A common extension to a GWAS analysis is to look at the biological process underlying the
GWAS signals using a gene/pathway analysis [286]. These analyses, also known as ‘gene set
enrichment analyses’, use prior biological knowledge on gene function to group SNPs along a
particular pathway together for a more powerful analysis than the single SNP approaches.
There are various tools and databases designed specifically for these analysis, including
MAGENTA [287], i-GSEA4GWAS [288] and GSA-SNP [289]. i-GSEA4GWAS was used as it did not
require any additional software licences, used test statistics from the GWAS results rather than
importing the individual level genotype data and used a competitive test as the null hypothesis
whereby the statistics for genes in a given pathway were compared to statistics for other
genes, which adjusts for the genomic inflation of the test statistic. It has previously been
shown that the results of these analyses, particularly in i-GSEA4GWAS, are strongly affected by
imputation [290], as LD patterns are not taken into account; Zhang et al discuss this in their
manuscript and suggest that SNP sets are pruned to R2>0.2 before being used in analysis [288].
Therefore, a list of SNPs was generated from the genotyped data in the Raine Study which
included SNPs not in LD; these SNPs were selected from the GWAS results files for analysis in i-
184 Chapter 5: GWAS of childhood BMI growth
GSEA4GWAS. Only gene sets having 20-200 gene members were used, with a 20kb padding
added to the end of each gene. The canonical pathways database was used to define the genes
in each pathway, which contain the pathways integrated and curated from a variety of online
resources, as outlined in MSigDB v2.5 (http://www.broadinstitute.org/gsea/msigdb/index.jsp).
5.4 Results 5.4.1 Comparison of Cohorts
The growth trajectories of the three cohorts were compared to investigate the potential
between study heterogeneity. It is important to know the heterogeneity of the growth
trajectories between the studies so any attempt to replicate genetic associations can be
interpreted in light of these differences. Figure 5.2 displays the growth trajectories in females
and males in the three cohorts. BMI at age one is almost identical in the two European
cohorts, ALSPAC and NFBC66, whereas it is lower in the Raine study. The NFBC66 has a
considerably lower BMI at the adiposity rebound on average than the Raine Study and ALSPAC,
whereas the Raine Study has an earlier average adiposity rebound than the other cohorts. This
leads to a lower BMI through adolescence and into adulthood for the NFBC66, whereas both
ALSPAC and the Raine Study have similar trajectories with the Raine Study crossing over
ALSPAC for the first time just before the onset of puberty. These profiles are similar in both
males and females.
Table 5.1 also highlights the differences in the timing and magnitude of the adiposity rebound,
with the Raine Study, particularly the females, having a much earlier rebound than the other
two cohorts. The negative correlation between age and BMI at the adiposity rebound in all
cohorts shows that an earlier rebound is associated with a higher BMI at the rebound; it is also
common for BMI to track throughout childhood, therefore a high BMI at the adiposity rebound
often leads to a high BMI in later life. Interestingly, the Raine Study has the highest correlation
between the age and BMI at the rebound. The lower BMI at the rebound in the NFBC66 may
highlight the generational differences between these cohorts, with the NFBC66 being recruited
more than 23 years earlier than the Raine Study and ALSPAC. Over that time, the
environmental determinants of obesity changed dramatically with the increase in fast food
consumption and decrease in physical activity.
185 Chapter 5: GWAS of childhood BMI growth
Figure 5.2: Population average BMI trajectories in females (A) and males (B) for each of the
three cohorts; the Raine Study (red), ALSPAC (green) and NFBC66 (blue).
Table 5.1: Mean (SD) age and BMI at the adiposity rebound in the three cohorts, in addition
to the correlation between the two measures.
Raine Study ALSPAC NFBC66
Female Male Female Male Female Male
Age at Adiposity Rebound 4.63
(1.08)
5.30
(1.04)
5.57
(1.17)
6.10
(1.01)
5.53
(0.95)
5.63
(0.84)
BMI at Adiposity Rebound 15.40
(0.93)
15.53
(0.95)
15.72
(1.07)
15.80
(1.06)
15.27
(1.20)
15.41
(1.06)
Correlation between Age
and BMI at Adiposity
Rebound
-0.85 -0.84 -0.64 -0.67 -0.57 -0.53
186 Chapter 5: GWAS of childhood BMI growth
5.4.2 Results from the Raine Study GWAS
5.4.2.1 Summary of GWAS
As outlined in Chapter 1, Section 1.4.3, it is important to check the distribution assumptions of
the test statistics to ensure that they have the correct asymptotic behaviour. Figure 5.3
displays the Q-Q plots for each of the four tests of interest from the GWAS. As observed, the
test statistics from the two Wald tests are inflated, which was anticipated given the simulation
study conducted in Chapter 4. The test for the SNP main effect showed no inflation; however
when using the robust standard error this measure became slightly inflated, therefore the
usual standard error will be used from here on for this parameter. Finally, the test for the SNP
by linear age interaction effect showed inflation; however this was reduced to 1.06 using the
robust test, which will be referred to for the remainder of this Chapter. Given the robust test
will be used for the test of the SNP by linear age parameter but not the SNP main effect, it is
important to see how it compares to the classical test for the two parameters. Figure 5.4
shows that for the SNP by age fixed effect, the robust P-Values are often the same as or larger
than the classical P-Values, whereas the robust P-Values for the SNP main effect are not
consistently smaller or larger than the classical P-Values.
187 Chapter 5: GWAS of childhood BMI growth
Figure 5.3: Q-Q plot for each of the four tests of interest in the Raine Study GWAS
Figure 5.4: Plot of standard versus robust test P-Values
188 Chapter 5: GWAS of childhood BMI growth
5.4.2.2 Regions of Interest
A Manhattan plot is commonly used to summarize the results from a GWAS, whereby the
negative log-transformed P-Values from the association analysis are plotted against the
chromosomal position. The negative log-transformed P-Values are used so that the smallest P-
Values have the largest transformed P-Value and can be easily seen. It is termed a Manhattan
plot as it resembles the Manhattan skyline, where skyscrapers (genome-wide significant loci)
tower over smaller level buildings. Figure 5.5 to Figure 5.8 show the Manhattan plots for each
of the parameters of interest from the GWAS.
Figure 5.5 is the Manhattan plot for the global SNP effect using the Wald test. As seen in Figure
5.3, this test has inflated type 1 error, so the significance of the association needs to be treated
with caution. Due to this inflation, SNPs were investigated further if they had a genome-wide
significant P-Value of less than 5x10-8. There are six SNPs that reached this threshold; five of
which are in the KCNJ15 gene on chromosome 21, and one on chromosome 2. Although it
doesn’t meet the strict genome-wide significance threshold, there is another region of interest
on chromosome 13; a group of SNPs near the MTIF3 gene, with the P-Value of the most
significant SNP being 9.81x10-8, an area that has been shown to be associated with adult BMI
[72].
Similar results are seen for the global SNP by age interaction term (Figure 5.6). There are two
SNPs that reach genome-wide significance, one in KCNJ15 on chromosome 21 and one on
chromosome 2. There are an additional 11 SNPs that have a P-Value between 5x10-8 - 1x10-7;
these include an additional six SNPs in KCNJ15, three SNPs in MTIF3 and two SNPs in the CADM
gene on chromosome 3. Interestingly, none of the SNP or SNP by age fixed effects were
significant for the two CADM SNPs, and therefore these were not investigated any further.
No SNPs reached genome-wide significance for the fixed effect estimate for the SNP main
effect (the SNP by spline function is in the model but not taken into account here) (Figure 5.7).
This was expected as there was no inflation detected and the sample size is relatively small to
detect small genetic effects. Therefore, regions with suggestive evidence as outlined by Duggal
et al [265] with a SNP with P-Value < 6.09x10-6 are discussed; these included two SNPs in the
KDR gene on chromosome 4, four SNPs on chromosome 5 in an intergenic region, one SNP in
189 Chapter 5: GWAS of childhood BMI growth
the CPLX2 gene on chromosome 5, one SNP in an intergenic region on chromosome 14 and
one SNP in the CACNG3 gene on chromosome 16.
Finally, Figure 5.8 shows the Manhattan plot for the SNP by linear age effect. Due to the slight
inflation remaining with the robust test, regions were selected that had at least two SNPs with
suggestive evidence for association at P-Value < 6.09x10-6; this included two SNPs in the
HS1BP3 gene on chromosome 2, five SNPs upstream from the GRM7 gene on chromosome 3,
three SNPs in the ACPL2 gene on chromosome 3, seven SNPs crossing multiple genes on
chromosome 4 including the USP53 and C4orf3 genes, six SNPs in the MIR4500HG gene on
chromosome 13, two SNPs in an intergenic region of chromosome 14, three SNPs on
chromosome 15, 16 SNPs in the KCNJ15 gene on chromosome 21 and 10 SNPs in the
TBC1D22A gene on chromosome 22.
The KCNJ15 locus, rs2008580, reached the significance thresholds for three out of four of the
parameters of interest. This locus is located in an intron of the potassium inwardly-rectifying
channel, subfamily J, member 15 gene. The gene has previously been shown to be associated
with increased risk of type 2 diabetes in a Japanese population [291], however the association
was found with a different primary SNP. Okamoto et al found that the association with this
locus was more prominent in lean, rather than obese, individuals [291]. Therefore, it could be
the growth pattern during childhood and adolescence that is influencing these individuals type
2 diabetes risk, rather than their final BMI which is more common in European populations.
This therefore provides additional evidence that this gene could be a likely candidate for
growth trajectories and was hence taken forward for replication in the ALSPAC and NFBC66
cohorts.
190 Chapter 5: GWAS of childhood BMI growth
Figure 5.5: Manhattan plot of the P-Values from the global SNP effect (Wald test) for BMI
trajectory in the Raine Study. The red line indicates the genome-wide significance level. The
most significant genetic variant is in the KCNJ15 gene on chromosome 21.
Figure 5.6: Manhattan plot of the P-Values from the global SNP by age effect (Wald test) for
BMI trajectory in the Raine Study. The red line indicates the genome-wide significance level.
The most significant genetic variant is in an intergenic region on chromosome 2.
191 Chapter 5: GWAS of childhood BMI growth
Figure 5.7: Manhattan plot of the P-Values from the SNP main effect for BMI trajectory in
the Raine Study. The red line indicates the genome-wide significance level. The most
significant genetic variant is in an intergenic region on chromosome 14.
Figure 5.8: Manhattan plot of the P-Values from the SNP by linear age effect for BMI
trajectory in the Raine Study. The red line indicates the genome-wide significance level. The
most significant genetic variant is upstream from the GRM7 gene on chromosome 3.
192 Chapter 5: GWAS of childhood BMI growth
5.4.3 Characterising the Findings of the KCNJ15 Gene
Iwasaki et al identified several regions through linkage analysis that were associated with type
2 diabetes in a Japanese population; the largest LOD score was 1.92 for a region on
chromosome 21q [292]. Okamoto et al conducted an association study on this region to
attempt to localize the causal locus [291]. The first stage of their analysis included a case-
control association analysis using pooled DNA to narrow down the 9Mbp region to several loci;
these few loci were subsequently sequenced using individual level genotyping. This association
analysis revealed that the G allele (minor allele) in rs743296 was associated with a decreased
risk of type 2 diabetes. They went on to sequence the exons and promoter region of the gene
in healthy, unrelated subjects to find that both rs743296 and a SNP on exon 4, rs3746876,
were associated with type 2 diabetes risk; the minor allele of rs3746876 increasing the risk of
disease. Unfortunately, their strongest association, rs3746876, has a MAF of less than 1% in
European populations and is not in the HapMap data or on any of the common GWAS chips.
However, rs743296 was imputed in the Raine Study GWAS and showed suggestive evidence
for association with BMI trajectory using the global Wald test (P-Value=3.8x10-6) and the Wald
test for SNP by age interaction (P-Value=2.3x10-6). They conclude that the rs3746876 SNP
effects type 2 diabetes risk in lean individuals; however, the ‘lean’ cases have a lower BMI than
their control group, which might indicate that this SNP is also acting on type 2 diabetes risk
through BMI. Finally, they conducted functional analysis on the protein level of Kcnj15 and
found that over expression of Kcnj15 decreased insulin secretion at high levels of glucose,
however did not change insulin secretion under normoglycemic conditions [293].
The evidence produced by the above studies [291,292,293] indicate that the KCNJ15 gene may
have pleiotropic effects, influencing multiple outcome measures related to insulin levels. The
strongest signal from the Raine Study GWAS, rs2008580, was therefore investigated in other
genome-wide studies consortium meta-analyses and phenotypes within the Raine Study. The
rs2008580 T allele, which was associated with increased average BMI and increased rate of
BMI growth, is the minor allele with frequency 0.24 in the Raine Study. It was not associated
with birth weight (P-Value=0.12) or birth length (P-Value=0.11) in the Raine Study, which was
consistent with the EGG Consortium meta-analysis of birth weight (P-Value=0.28) [294].
Interestingly, the T allele of the rs2008580 SNP was associated with increased birth weight
(β=57.89g, P-Value=0.03) and birth length (β=0.27cm, P-Value=0.02) in males but not females
(Pbirth weight=0.79, Pbirth length=0.57). The genetic effect is detectable from about 5.5 years of age,
193 Chapter 5: GWAS of childhood BMI growth
and the T allele is associated with increased BMI at the adiposity rebound (β=0.11, P-Value=
0.02) but not age at the adiposity rebound (P-Value=0.06). The SNP was not associated with
age of menarche in the girls in the Raine Study (P-Value=0.16).
The association between the rs2008580 locus and BMI growth was driven by an association
with change in weight (Wald P-Value=2.9x10-6), rather than a change in height (Wald P-
Value=0.11). The lack of association with height was also seen in the GIANT consortium meta-
analysis of height in adults (P-Value=0.27) [295], whereas the meta-analysis in the same
consortium for BMI showed rs2008580 approaching significance (P-Value=0.06) [72]. The T
allele was associated with higher average weight at age eight years (β=0.04, P-Value=9.5x10-6)
and change in weight over childhood and adolescence (Wald P-Value=0.0003). The BMI growth
trajectories for each of the three genotypes in rs2008580 in females and males are displayed in
Figure 5.9.The association between BMI growth and rs2008580 was stronger in females (Wald
P-Value=1.3x10-5) than males (Wald P-Value=0.002). The association in females seemed to be
driven by the change in BMI (PSNP=0.01, Wald PSNP*age=5.2x10-6) whereas the male association
was driven by the average BMI at age eight (PSNP=4.3x10-4, Wald PSNP*age=0.01). A similar
pattern of association was observed between weight growth and rs2008580; the overall
association was similar in females (Wald P-Value=0.002) and males (Wald P-Value=0.0006),
however the female association seemed to be driven by the change in weight (PSNP=0.03, Wald
PSNP*age=0.001) whereas the male association was driven by the average weight at age eight
(PSNP=7.8x10-5, Wald PSNP*age=0.03).
Given the previous associations observed by Okamoto et al [291] with type 2 diabetes, fasting
insulin and glucose levels were also analysed using linear regression. Although the rs2008580 T
allele was not associated with fasting glucose levels in either the Raine Study at age 17 (P-
Value=0.35) or in the Meta-Analysis of Glucose and Insulin-related traits Consortium (MAGIC)
meta-analysis (P-Value=0.16) [296], it was associated with increased fasting insulin (β=0.011,
P-Value=0.01), HOMA-β (β=0.0086, P-Value=0.03) and HOMA-IR (β=0.011, P-Value=0.01) levels
in the MAGIC consortium meta-analysis [296]. In addition, SNPs in high LD with rs2008580 in
the KCNJ15 gene were significantly associated with increased risk of type 2 diabetes in the
DIAGRAM consortium (rs6517456, [LD statistics with rs2008580: r2= 0.96, D’=1] OR=1.04, P-
Value=0.048) [297].
194 Chapter 5: GWAS of childhood BMI growth
Figure 5.9: BMI trajectories for females (left) and males (right) for each of the KCNJ15,
rs2008580 alleles.
According to the SNPInfo database [298], the rs2008580 locus is a transcription factor binding
site (TFBS). SNPs in TFBS regions have been shown in experimental studies to lead to
differences in transcription factor binding between individuals [299], indicating that SNPs
within TFBS regions are more likely to play a biological role than other SNPs in the associated
region without evidence of overlap with any functional data [300]. This SNP is therefore
believed to have a functional role within the KCNJ15 gene.
195 Chapter 5: GWAS of childhood BMI growth
5.4.4 Results from Replication and Meta-Analysis
The most significant SNP from the GWAS analysis in the Raine Study, rs2008580, was not
significantly associated with BMI or BMI trajectory in either ALSPAC (P-ValueWald=0.9476, P-
ValueWald(SNP*Age)=0.9044, P-ValueSNP=0.7919, P-ValueSNP*Age=0.4106) or NFBC66 (P-
ValueWald=0.8725, P-ValueWald(SNP*Age)=0.8817, P-ValueSNP=0.9175, P-ValueSNP*Age=0.8419).
Four hundred and thirty three SNPs were included in the meta-analysis; as mentioned in
section 5.3.5, SNPs were included in the meta-analysis if they were within 250kb of the
rs2008580 variant, had a MAF greater than or equal to 5% and an R2 greater than or equal to
0.3 or info greater than or equal to 0.4. Although none of the SNPs met the genome-wide
significance threshold of < 5x10-8, 26 SNPs reached a Bonferroni corrected P-Value of 0.0001
for the region (i.e. 0.05/433) in the meta-analysis of the global Wald test; no SNPs reached a
Bonferroni corrected P-Value for the Wald test for the SNP by age interaction. The most
significant SNP from the meta-analysis for the global Wald test for the overall SNP effect was
rs2836241. As can be seen in Figure 5.10, the SNPs in high LD with rs2836241 spread across
the Down syndrome critical region genes 4, 8 and 10 (DSCR4, DSCR8 and DSCR10). This region
across the DSCR genes appears to have a greater effect on the average BMI effect at age eight
than the linear trajectory over childhood (Figure 5.11); although, the most significant SNPs for
the SNP main effect (rs2836444 P-Value=0.0220) and SNP by linear age interaction (rs2836335
P-Value=0.0043) did not reach the Bonferroni corrected P-Value.
196 Chapter 5: GWAS of childhood BMI growth
Figure 5.10: Regional plot of (A) global Wald P-Values for the overall SNP effect and (B) Wald
P-Values for the SNP by age effect as a function of genomic position (NCBI Build 36) from the
meta-analysis of ALSPAC and NFBC66 for KCNJ15 gene region. In each plot, the meta-analysis
P-Value for rs2836241 is denoted by a purple diamond; all other analysed SNPs are
represented by a circle. Local LD structure is reflected by the plotted estimated recombination
rates (taken from HapMap). The colour scheme of the circles respects LD patterns (HapMap
CEU pairwise r2 correlation coefficients) between rs2836241 and surrounding variants. Gene
annotations were taken from the University of California Santa Cruz genome browser.
(A)
(B)
197 Chapter 5: GWAS of childhood BMI growth
Figure 5.11: Regional plot of (A) P-Values for the SNP main effect at age eight and (B)
P-Values for the SNP by linear age effect as a function of genomic position (NCBI Build 36)
from the meta-analysis of ALSPAC and NFBC66 for KCNJ15 gene region. In each plot, the
meta-analysis P-Value for rs2836241 is denoted by a purple diamond; all other analysed
SNPs are represented by a circle. Local LD structure is reflected by the plotted estimated
recombination rates (taken from HapMap). The colour scheme of the circles respects LD
patterns (HapMap CEU pairwise r2 correlation coefficients) between rs2836241 and
surrounding variants. Gene annotations were taken from the University of California Santa
Cruz genome browser.
(A)
(B)
198 Chapter 5: GWAS of childhood BMI growth
5.4.5 Results from Pathway Analysis
Pathway analysis in i-GSEA4GWAS uses a false discovery rate (FDR) adjustment for multiple
testing; therefore, significant pathways are determined by a q-value<0.05. There were no
significant pathways using SNPs that were independent (SNPs with LD>0.2 were excluded);
that is, all pathways analysed had an FDR q-value greater than 0.05. There was one pathway
that reached significance using the classical P-Value, though it did not reach significance after
adjusting for multiple testing, for the SNP main effect; however it is presented in Table 5.2 as
an exploratory finding.
Table 5.2: Results from the pathway analysis using SNPs not in LD
Pathway Description P-Value FDR q-value
Significant genes/Selected genes/All genes
Parameter
HSA00260 GLYCINE SERINE AND THREONINE METABOLISM
Genes involved in glycine, serine and threonine metabolism
0.001 0.24 12/28/45 SNP main effect
5.5 Discussion In this study, a GWAS for BMI trajectory across childhood and adolescence was conducted in
the Raine Study, with replication of the most significant region in the ALSPAC and NFBC66
cohorts. The results show that the KCNJ15 gene is associated with BMI trajectory over this
time period. The association of the most significant SNP in the Raine Study, rs2008580, is
driven by a faster rate of change in BMI over time in females but an increase in average BMI in
males. In addition, this SNP appears to be driven by a change in weight, rather than a change in
height, therefore indicating an influence on adiposity rather than skeletal growth. The SNP
does not affect birth weight, which is consistent with other studies of established BMI loci
[301,302]. The rs2008580 SNP is intronic, which, according to Hindorff et al, is consistent with
45% of the loci discovered in GWASs [47]; it is also at a transcription factor binding site. An
association was also found between the KCNJ15 loci and increased fasting insulin levels but not
with glucose, which is consistent with the conclusion made by Okamoto [293] that inactivation
of Kcnj15 leads to increased insulin secretion. Unfortunately, this specific SNP did not replicate
in the ALSPAC and NFBC66 cohorts.
199 Chapter 5: GWAS of childhood BMI growth
There are several possible reasons explaining why the rs2008580 SNP association with BMI
trajectory in the Raine Study was not replicated in the ALSPAC and NFBC66 cohorts, including:
1) the association observed in the Raine Study was spurious; 2) the differences in the growth
patterns between the three cohorts seen in Section 5.4.1 are in part due to different genetic
profiles; 3) there are different environmental stimuli in each of the cohorts interacting with the
genetic variants; 4) differences in LD patterns between the cohorts due to ethnic origin.
Palmer and Cardon [20] state that “For most complex human diseases, the reality of multiple
disease-predisposing genes of modest individual effect, gene-gene interactions, gene-
environment interactions, heterogeneity of both genetic and environmental determinants of
disease and low statistical power mean that both initial detection and replication will likely
remain difficult.” Although the rs2008580 SNP didn’t replicate in the two European cohorts, it
appears that the region on chromosome 21 containing the KCNJ15 gene may still be of interest
for association with BMI trajectory across childhood; there are several SNPs in high LD in the
KCNJ15 gene, and nearby genes, which were highlighted in the global Wald tests.
As outlined in Section 5.4.3, the KCNJ15 gene influences insulin secretion, therefore potentially
having effects on both risk of type 2 diabetes and increased BMI. It is hypothesised that the
KCNJ15 gene shows similarities with the discovery of the FTO gene but for childhood growth.
The FTO gene was first discovered in a GWAS of type 2 diabetes. When additional analysis was
conducted, it was found that the association between FTO and diabetes risk was purely due to
an association between FTO and BMI [174]. To a lesser extent, the KCNJ15 gene might be
similar to this, but rather than acting on diabetes risk through final BMI it influences both
childhood BMI (and weight) growth and diabetes risk through its effect on insulin. A study by
Bhargava et al shows that adults with impaired glucose intolerance or type 2 diabetes are not
only obese in adulthood but also have increased rate of growth from age 2 years onwards
[134]. It is therefore plausible that a genetic variant influencing rate of BMI growth could
ultimately influence risk of type 2 diabetes. The same pattern of growth was shown by Barker
et al for increased risk of cardiovascular events [303]. If true, this would nicely support the
early life approaches to adult disease, whereby the combination of inherited genetic variants
and adverse early life exposures lead to an increased risk of diabetes in later life. The early life
exposure could be the less optimal growth trajectory itself, or some exposure(s) that influence
the growth trajectory. Unfortunately, the individuals in the Raine Study are too young to fully
test this hypothesis at present.
200 Chapter 5: GWAS of childhood BMI growth
SNPs in the MTIF3 gene had P-Values just below the suggestive levels of significance for three
out of four of the parameters of interest; however it is worth mentioning as it has previously
been shown to be associated with adult BMI [72]. The most significant SNP from this gene in
the current study showed a stronger association with average BMI at age eight than BMI
trajectory, consistent with knowledge that BMI tracks throughout life. It was promising to see
that the GWAS in a relatively small cohort was able to detect both novel and previously known
genetic regions.
Some of the other regions that reached significance thresholds for only one of the test
statistics also have plausible mechanisms for influencing BMI growth trajectory. For example,
the locus on chromosome 2 that was significant for the two global tests is in a region of the
chromosome that is also associated with waist-hip ratio in adults in the National Institutes of
Health (NIH) database of Genotypes and Phenotypes (dbGaP). The HS1BP3 gene, which
showed suggestive evidence for association with the SNP by age parameter, encodes a protein
that maybe involved in lymphocyte activation, and the SNP with strongest association in the
Raine Study was approaching significance in the GIANT consortium meta-analysis of BMI (P-
Value=0.075) [72]. A gene downstream from the significant finding in the ACPL2 gene on
chromosome three was previously shown to be associated with height in both a European
[304] and Korean study [305]. Therefore, this region might be a marker of height growth
throughout childhood, which influences BMI through skeletal growth rather than changes in
adiposity. Ten SNPs in the TBC1D22A gene reached the significance threshold, a gene which
has previously been associated with longevity [306] and metabolism in dbGaP. As outlined,
there are several interesting genes and genetic regions that have arisen from this GWAS of
BMI trajectory; however, until the loci have been replicated in independent studies, false
positives cannot be ruled out with certainty.
Given the FTO gene is the most commonly replicated gene for BMI in adulthood, and also
associated with childhood BMI growth [251], it was surprising that it was not one of the top
regions on any of the parameters in this study. However, analysis in Chapter 2 indicated that
the FTO locus, rs1121980, is significant in male but not females. The GWAS analysis presented
in this Chapter shows that the rs1121980 FTO SNP is significant when using a significance
threshold of P-Value<0.05, which doesn’t adjust for multiple testing (P-Valuewald=0.005, P-
Valuewald_int=0.003, P-ValueSNP=0.0001, P-ValueSNP*age=0.006), but due to the lack of association
201 Chapter 5: GWAS of childhood BMI growth
in females this did not reach the genome-wide significance threshold. This is one of the
limitations of including an age by sex interaction rather than stratifying by sex.
5.5.1 Role of KCNJ15 Gene and Nearby Genes on Chromosome 21
It is unknown how the KCNJ15 gene, or nearby genes on chromosome 21, affect BMI growth
throughout childhood and adolescence. Three plausible pathways in which the KCNJ15 gene
may affect BMI growth have been identified in a literature review, which will be outlined
below.
As previously mentioned, the KCNJ15 gene is a potassium inwardly-rectifying channel that is
expressed in the pancreas. The potassium channel sits on the surface of the β-cells of the islets
of Langerhans in the pancreas and regulates depolarisation of the cell membrane, which in
turn regulates how much insulin is produced by the cell. Okamoto et al suggest that the
KCNJ15 protein, similar to KCNJ11, inhibits depolarization of the pancreatic β-cell membrane
by maintaining the resting membrane potential and therefore negatively regulating insulin
secretion [291]. This was also seen in their functional studies whereby “increased plasma
glucose induced KCNJ15 expression at a transcriptional level and inhibited insulin secretion
through the KCNJ15 channel” [293]. This functional information coupled with the results from
the MAGIC consortium indicate that the rs2008580 SNP, or a SNP in LD, has a role on both
releasing an abundance of insulin from the pancreas but also making other cells resistant to
the uptake of insulin, leaving the levels of fasting plasma insulin elevated.
Insulin is central in regulating carbohydrate and fat metabolism in the body. When insulin is
released from the pancreas it signals for the normal release of fatty acids from the adipose
tissue to be shut down, in conjunction with increasing the uptake of fatty acids. Insulin
prioritises the processing of carbohydrates and proteins that enter the body during a meal
over stored fats, and when these are processed it returns to burning stored fats. Adipocytes,
the cells that primarily compose adipose tissue, are responsible for the production of the
hormone leptin. The discovery of leptin in 1994 found that it is important in regulation of
appetite and acts as a satiety factor [307]. Lustig showed that “insulin and leptin share a
common central signalling pathway, and it seems that insulin functions as an endogenous
leptin antagonist” [308]. Therefore, it is biologically plausible that a gene regulating the release
of insulin influences BMI and weight growth.
202 Chapter 5: GWAS of childhood BMI growth
Several of the type 2 diabetes associated genes are in or close to genes which are expressed in
the pancreas, with many shown to be associated with reduced beta-cell dysfunction; whereas,
genes associated with obesity or BMI are expressed in the brain, particularly the
hypothalamus, with some genes in the leptin signalling pathway [309,310]. Given the previous
association with KCNJ15 and type 2 diabetes, it may be that its effect on growth is highlighting
the onset of diabetes risk; in contrast, the KCNJ15 gene may be indicating another pathway, in
addition to that through the brain, to the development of obesity.
Along with the KCNJ15 gene’s influence on insulin levels, it is also located in the Down
syndrome critical region 1 of chromosome 21 [311]. Down syndrome is characterized by
various complex phenotypes, including dysmorphic features, hypotonia, developmental
abnormalities, deficiencies of the immune system and mental retardation [312]. These
phenotypes vary greatly between individuals with Down syndrome, indicating that multiple
genes are involved in its pathogenesis. Individuals who are trisomic for only part of
chromosome 21 and who share the same subset of phenotypes, have been used to define
Down syndrome critical regions [313]. An overdose of genes in these regions appears to
constitute the complex phenotypes of Down syndrome. This critical region contains not only
the KCNJ15 gene, but also the DSCR4 and DSCR8 genes [314] which were significantly
associated with BMI trajectory in the meta-analysis. Gosset et al indicate that the KCNJ15 gene
is expressed in the kidney and the brain [311], which is consistent with Okamoto who also
report KCNJ15 expression in the kidney [293] and other BMI associated genes which are
reported to be expressed in the brain [309,310]. Due to the various levels of expression during
foetal development and in adulthood, Gosset et al acknowledge that overexpression of the
KCNJ15 gene may have pleiotropic effects on organ function and Down syndrome phenotypes
[311].
Thirdly, according to the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database, the
KCNJ15 gene belongs to the gastric acid secretion pathway in the Organismal
Systems/Digestive system class. Gastric acid is a key factor in normal upper gastrointestinal
functions, including protein digestion and absorption of both calcium and iron. It also provides
some protection against bacterial infections.
203 Chapter 5: GWAS of childhood BMI growth
Further functional work is required to determine which of these possible pathways, if any, is
responsible for the observed association between the KCNJ15 gene and BMI growth over
childhood and adolescence.
5.6 Challenges and Future Research As observed with the chromosome-wide analysis in ALSPAC in Chapter 4, the λ for the SNP by
age term remains slightly inflated. One option to reduce the remaining inflation is to use a
genomic control adjustment [275]. Genomic control assumes that the inflation factor is
constant for all markers across the genome; this works well for inflation due to cryptic
relatedness, which is what genomic control was designed for, as the inflation is based on the
kinship matrix that is independent of any particular marker. The robust standard errors are not
inflated by the same amount for each SNP, therefore indicating that the initial inflation
probably isn’t constant. It may be possible that any remaining inflation after using the robust
standard error is constant; although this would need to be shown before the approach is
adopted for GWAS analysis.
Although consideration for computational time was taken into account in Chapter 2 when
selecting the optimal model, the SPLMM model for each SNP is relatively computationally
intensive, much more so than the commonly used cross-sectional models used for GWASs. In
the larger cohorts, the model for one SNP can take several minutes to run due to the
complexities of the correlation structure and spline function being fit. The original aim of this
chapter was to include the meta-analysis of GWAS results for all three cohorts; the Raine
Study, ALSPAC and NFBC66. The Raine Study GWAS took approximately 2 months to conduct
all 22 autosomes using up to 145 nodes on a supercomputer (http://www.ivec.org/). The
ALSPAC GWAS was estimated to take two and a half years for the imputed data as each of
their models take approximately 10 minutes and they have approximately 20 nodes to use.
Finally, the NFBC66 GWAS was estimated to take 10-20 years for the imputed data as each
model takes 3-4 minutes to run and they only have one node to use. These are clearly not
feasible timelines to be included in this thesis, or to be used for a manuscript. It was therefore
decided that the full GWAS would be conducted in the Raine Study only for this thesis and the
other two cohorts would be used for replication. In the meantime, the R code used for
calculating the robust standard error has been rewritten in C++ code, which is more time
204 Chapter 5: GWAS of childhood BMI growth
efficient and reduces the computational time for the robust standard error calculation in
NFBC66 from several minutes to less than a second. The analysts for this project at ALSPAC and
NFBC66 have also had several discussions with the high performance computing teams at their
universities to determine a computationally efficient procedure for running the required
scripts. This chapter has shown that although the most computationally feasible model was
chosen, which could incorporate all the complexities in the data, GWAS analysis in the
timeframe of this thesis was still not possible in large cohort studies; however, these small
changes to the analytic scripts and how they are run on the computing facilities at the
universities will allow the analysis to be conducted for a future publication. Once the meta-
analysis of these three cohorts has been conducted, replication analysis of the most significant
findings will be undertaken in additional cohorts in the EGG Consortium. The project has been
approved by this Consortium, and the current analytic plan has been circulated to the relevant
cohorts.
GWASs are designed to identify regions of interest for further follow-up; they generally do not
have the ability to identify the causal locus, or in some cases the causal gene, of a given
genetic signal. This can be seen in the results presented in this Chapter, whereby the KCNJ15
gene and several genes upstream (DSCR4, DSCR8 and DSCR10) were found to be associated
with BMI trajectory across childhood using the two global Wald tests. Therefore, sequencing
this region would be required to determine the causal locus.
5.7 Conclusion In conclusion, a genome-wide association study of BMI trajectory across childhood was
conducted in the Raine Study with replication in two large European cohorts and found a
significant association with the KCNJ15 gene. The KCNJ15 gene has previously been linked to
increased risk of type 2 diabetes, increased levels of insulin and insulin resistance. The
rs2008580 SNP appears to be driven by a change in weight, rather than a change in height,
hence indicating that it is influencing adiposity rather than skeletal growth, which is consistent
with increased levels of circulating insulin. Therefore, through the development of appropriate
longitudinal models for BMI trajectories, a novel biologically plausible gene for BMI trajectory
over childhood has been discovered.
205 Chapter 5: GWAS of childhood BMI growth
Chapter 6: Association Of A Genetic Risk Score With Longitudinal BMI In Children 6.1 Introduction The results presented in this chapter have been published [315]; the manuscript is included as
an appendix (Appendix E).
The final area of interest was to investigate how genetic variants known to be associated with
adult BMI influence growth over childhood and adolescence (including BMI, height and weight)
and related growth parameters (including age and BMI at both the adiposity peak and
rebound). Using the statistical methods described in the previous chapters, I analyse the
association between 32 adult BMI associated genetic variants and growth over childhood and
adolescence.
6.2 Background Recent GWASs have begun to uncover genetic loci contributing to increases in BMI in
adulthood [72,174,175,178,179,180,182]. The largest genome-wide meta-analysis of BMI
published to date included 249,796 individuals from the GIANT Consortium; which confirmed
14 previously-reported loci and identified 18 novel loci for BMI [72]. There has been one GWAS
to date that has focused on a dichotomous indicator of childhood obesity [183], but none
looking at BMI on a continuous scale in childhood.
Once adult height is attained, changes in BMI are largely driven by changes in weight. In
contrast, during childhood and adolescence, changes in BMI are influenced by both changes in
height and weight. Therefore, genetic variants that affect adult BMI may influence change in
weight, height or both during childhood. Previous studies of adult BMI SNPs in relation to
infant and child change in growth (BMI, weight and height) have shown little evidence of an
association with birth weight [206,209,316], but have shown evidence that these loci are
associated with more rapid height and weight gain in infancy [206,209], and higher BMI and
odds of obesity at multiple ages across the life course [205,206,209,251,316], with the
magnitude of associations not being constant across all ages. Rates of change in height, weight
206 Chapter 6: Genetic risk score
and BMI and other features of a child’s growth are influenced by a combination of genetic and
environmental factors. These act interactively to shape developmental milestones including,
the adiposity peak at approximately nine months of age [129], adiposity rebound at around
the age of 5-6 years, and puberty between 10 and 13 years of age [156,251,317,318]. Early age
at adiposity rebound is associated with greater risk of diabetes [134,135], hypertension [136]
and obesity [132,133] in adulthood. Sovio et al [251] and Belsky et al [209] have recently
shown that SNPs associated with adult BMI are also associated with earlier age and higher BMI
at adiposity rebound. However, relatively little is known about the association of the timing of
the adiposity peak with disease later in life; Silverwood et al [128] and Sovio et al [129] have
both shown that a delayed adiposity peak is associated with increased BMI later in life.
Understanding how genetic loci are associated with BMI and other anthropometric measures
differentially across the life course may shed light on the biological pathways involved, as well
as providing insight into the development of obesity that may be useful in the design of
interventions.
6.2.1 Aims
To date, there has been no comprehensive study of how all the genetic variants known to date
to be associated with adult BMI influence growth over childhood and adolescence and growth
parameters (age and BMI at the adiposity peak and rebound). One of the limitations of
previous studies is they have not stratified by sex, despite some evidence that sex-specific
differences in body composition may be partly due to genetics [319,320], with different
(possibly partially overlapping) genes contributing to variation in body shape in men and
women. Therefore, the two aims for this Chapter are:
1) To examine the association of alleles at 32 loci identified in the recent GWAS of BMI in
European adults [72], in combination as an allelic score, with BMI, weight and height
growth trajectories from age 1 to 17 in two birth cohorts. In addition to investigating
how early in life a genetic effect can be detected, and exploring genetic influences on
several aspects of the growth trajectories, i.e. age and BMI at adiposity peak in infancy
and adiposity rebound in childhood.
2) To assess whether there are differences in the associations between BMI and the 32
individual genetic loci between males and females.
207 Chapter 6: Genetic risk score
6.3 Subjects and Materials 6.3.1 Study Populations
Both cohorts are described in detail in Chapter 1, Section 1.6. The subsets used in this analysis
are described below.
ALSPAC: A subset of 7,868 individuals were used for analysis in this study using the following
inclusion criteria: at least one parent of European descent, live singleton birth, unrelated to
anyone in the sample (one of every related pair was selected at random), no major congenital
anomalies, genotype data and at least one measure of BMI throughout childhood. BMI was
calculated from the weight and height measurements (median nine measures per person, IQR:
5-12, range 1-29 measurements), with a total of 68,862 BMI measures.
Raine Study: A subset of 1,460 individuals was used for analysis in this study using the same
criteria as in ALSPAC. BMI was calculated from the weight and height measurements (median
six measures per person, IQR: 5-7, range 1-8 measurements), with a total of 8,687 BMI
measures.
6.3.2 SNP Selection and Allelic Score
Speliotes et al [72] reported 32 variants to be associated with BMI. In addition, Belsky et al
[209] selected a tag SNP from each LD block that had previously been shown to be associated
with BMI-related traits. The 32 SNPs from either of these two manuscripts were selected; SNPs
reported in these two manuscripts that were within the genes of interest were all in high LD
(r2>0.75). All 32 SNPs were well imputed (all R2 for imputation quality > 0.7, with average of
0.981), therefore the dosages were extracted from the imputed data (i.e. the estimated
number of increasing BMI alleles as described in Chapter 1, Section 1.4.3) for these 32 SNPs in
both ALSPAC and the Raine Study. Each SNP was incorporated into the BMI trajectory model
independently assuming an additive genetic effect for the BMI-increasing allele. In addition, an
‘allelic score’ was created by summing the dosages for the BMI-increasing alleles across all 32
SNPs [231]. Finally, a sensitivity analysis was conducted whereby the alleles were weighted by
the published effect size for adult BMI (Table 6.2). Both the weighted and unweighted allelic
scores were standardized to have a mean of zero and standard deviation of one to allow for
comparison.
208 Chapter 6: Genetic risk score
6.4 Statistical Analysis 6.4.1 Longitudinal Modelling and Derivation of Growth Parameters
A SPLMM using smoothing splines to yield a smooth growth curve estimate, as described in
Chapter 2, Chapter 4 and Warrington et al [197], was fit to the BMI, weight and height
measures. The basic model for the jth individual and at the tth time-point is as follows:
Growthjt = β0 + Σ i β i (Agejt – Age )i + Σk γk ((Agejt-Age ) - κk)i+ + Σ l β l Covariatel
+ u0j + Σ i
uij (Agejt – Age )i + Σk ηkj ((Agejt-Age ) - κk)i+ + ε jt
Where Growth is BMI, weight or height, Age is the mean age over the t time points in the
sample (i.e. eight years), κk is the kth knot and (t - κk)+=0 if t ≤ κk and (t - κk) if t > κk, which is
known as the truncated power basis that ensures smooth continuity between the time
windows and Covariate are the study specific (time independent) covariates. Three knot points
were used, placed at two, eight and 12 years, with a cubic slope for each spline in the BMI and
height models [197]. The weight model had the same placement for the knot points but had a
linear spline from 1-2 years, cubic slope for 2-8 years and 8-12 years and finally a quadratic
slope for over 12 years. All models assumed a continuous autoregressive of order 1 correlation
structure.
Age and BMI at adiposity rebound were derived by setting the first derivative of the fixed and
random effects from the BMI model between two and eight years of age for each individual to
zero (i.e. the minimum point in the curve). In addition, a second model was fit in ALSPAC only,
between birth and five years to derive the adiposity peak, with each individual required to
have greater than two measures throughout this period, similar to Silverwood et al [128]. BMI
measurements after five years were excluded to avoid the period of adiposity rebound. This
model had knot points at ages one and 2.5 years and a cubic slope between each. Adiposity
peak was derived by setting the first derivative of the fixed and random effects between birth
and 2.5 years to zero. The Raine Study does not have enough repeated measurements
between birth and one year of age to justify calculating the adiposity peak.
6.4.2 Statistical Analysis
Implausible height, weight and BMI measurements (> 4SD from the mean for sex and age
specific category) were considered as outliers and were recoded to missing.
209 Chapter 6: Genetic risk score
Genetic differences in the trajectories were estimated by including an interaction between the
spline function for age and the genetic variants (each SNP [BMI only] and the allelic score). The
association between the genetic variants as an allelic score and weight and length at birth was
analysed using linear regression, adjusting for gestational age at birth. An interaction between
gestational age (as a continuous variable and dichotomised as preterm [<37 weeks] and full
term) and the allelic score was also tested to determine whether associations between BMI-
associated variants and growth differ by gestational age. Linear regression was used to
investigate the associations of the allelic score with age and BMI at both adiposity peak
(ALSPAC only) and adiposity rebound.
As discussed in Chapter 3, the growth data was collected using three measurement sources in
ALSPAC; clinic visits, measurements made during routine health care visits, and parental
reports in questionnaires. Whilst the measurements from routine health care visits have
previously been shown to be accurate in this cohort [190], parental report of children’s height
tends to be overestimated while weight tends to be under estimated [191], so all analyses of
the trajectories adjusted for a binary indicator of measurement source (parent reports versus
clinic/health care measurements) as a fixed effect in ALSPAC.
Section 1.6 in Chapter 1 showed evidence for population stratification in the Raine Study so all
analyses in the current study are adjusted for the first five principal components. This was not
the case for ALSPAC so no adjustment was made.
Given that growth curves differ greatly between males and females, particularly around
puberty, and because different genes may influence the timing of growth spurts in males and
females, the effect of sex over the time period was investigated. Preliminary analysis showed
evidence for a sex-age interaction in both cohorts (likelihood-ratio test [LRT] P-Values < 0.0001
in both cohorts); therefore all analyses were conducted stratified for sex.
Given that FTO is the most replicated SNP for BMI, with the largest effect size of the BMI-
associated SNPs to date, and has been shown previously to effect childhood growth [205,251],
there is some concern that the results of the allelic score are representing the associations
seen with the FTO loci only. To determine whether this was the case, all of the analyses looking
at the allelic score were repeated including an adjustment for the FTO locus.
210 Chapter 6: Genetic risk score
The results from the two cohorts for each of the analyses were meta-analysed. A meta-analysis
was conducted rather than pooling the data as different covariates were necessary for each of
the cohorts. For the allelic score analyses, a fixed-effects inverse-variance weighted meta-
analysis was conducted using the beta coefficients and standard errors from the two studies.
No statistically significant heterogeneity was detected between the cohorts (all P-Values >
0.05), so there was no evidence for conducting a random-effects meta-analysis. The allelic
score was considered to be statistically significantly associated with the growth parameter if
the P-Value for the meta-analysis was less than 0.05. For the analyses of the individual SNPs
with BMI, a P-Value meta-analysis was conducted on the LRT P-Values from the two studies,
without weighting, and a Bonferroni significance threshold of 0.0016 (0.05/32) was used to
declare a statistically significant association. All analyses were conducted in R version 2.12.1
[222], using the spida library to estimate the spline functions, the rmeta library for the effect-
size meta-analysis and the MADAM library for the P-Value meta-analysis.
6.5 Results The characteristics of the cohorts are described in Table 6.1. Fifty-one percent of both cohorts
were male, the ALSPAC participants had more BMI measures throughout childhood than the
Raine Study participants with a median of nine (IQR: 5-12) and six (IQR: 5-7), respectively. The
genetic loci are described in Table 6.2. All of the following results are reported from the meta-
analysis of the two cohorts, unless otherwise specified.
Table 6.1: Phenotypic characteristics of the two birth cohorts used for analysis
Sex ALSPAC Raine Study (n=7,868) (n=1,460)
Sex [% male (N)] 7,868 51.25% (4,032) 1,460 51.58% (753) N Mean (SD) N Mean (SD)
BMI measures per person
-- 8.75 (4.58) -- 5.94 (1.52)
Birth Weight (kg) Males 3,001 3.52 (0.53) 752 3.42 (0.57) Females 2,855 3.40 (0.47) 707 3.31 (0.55)
Birth Length (cm) Males 3,001 51.13 (2.40) 675 50.12 (2.34) Females 2,855 50.41 (2.28) 616 49.31 (2.28)
Gestational Age (wks.)
Males 3,001 39.52 (1.64) 753 39.42 (1.99) Females 2,855 39.65 (1.58) 707 39.42 (2.06)
Adiposity Peak Males 4,030 18.03 (0.76) -- -- BMI (kg/m2) Females 3,792 17.45 (0.69) -- --
211 Chapter 6: Genetic risk score
Table 6.1 continued
Sex ALSPAC Raine Study Adiposity Peak Males 4,030 8.90 (0.33) -- -- Age (months) Females 3,792 9.36 (0.49) -- -- Adiposity Rebound Males 3,642 15.62 (1.04) 697 15.53 (0.93) BMI (kg/m2) Females 3,225 15.53 (1.06) 647 15.42 (0.95) Adiposity Rebound Males 3,642 6.07 (1.02) 697 5.30 (1.05) Age (years) Females 3,225 5.61 (1.16) 647 4.64 (1.10)
Age Stratum ALSPAC Raine Study Age 1-1.49 2,832 1.18 (0.18) 1,326 1.15 (0.09) (years) 1.5-2.49 7,113 1.76 (0.25) 387 2.14 (0.13)
2.5-3.49 2,537 2.95 (0.28) 956 3.09 (0.09) 3.5-4.49 6,915 3.77 (0.23) 20 3.69 (0.17) 4.5-5.49 1,843 5.05 (0.33) 3 5.28 (0.14) 5.5-6.49 3,848 5.90 (0.24) 1,269 5.91 (0.17) 6.5-7.49 2,861 7.31 (0.30) 42 7.25 (0.38) 7.5-8.49 3,975 7.74 (0.33) 1,040 8.02 (0.27) 8.5-9.49 4,443 8.71 (0.22) 204 8.60 (0.12) 9.5-10.49 6,777 9.94 (0.29) 303 10.44 (0.08) 10.5-11.49 4,917 10.75 (0.23) 926 10.64 (0.15) 11.5-12.49 5,240 11.82 (0.21) 4 11.91 (0.36) 12.5-13.49 6,797 12.97 (0.22) 9 13.28 (0.17) 13.5-14.49 4,690 13.89 (0.17) 1,196 14.06 (0.17) 14.5-15.49 2,339 15.32 (0.15) 24 14.69 (0.17) 15.5-16.49 1,645 15.72 (0.22) 2 16.16 (0.19) >16.5 90 16.83 (0.24) 976 17.05 (0.24)
BMI 1-1.49 2,832 17.42 (1.51) 1,326 17.11 (1.39) (kg/m2) 1.5-2.49 7,113 16.82 (1.49) 387 15.97 (1.19)
2.5-3.49 2,537 16.48 (1.40) 956 16.14 (1.23) 3.5-4.49 6,915 16.25 (1.39) 20 15.92 (1.41) 4.5-5.49 1,843 16.02 (1.70) 3 15.94 (1.43) 5.5-6.49 3,848 15.71 (1.87) 1,269 15.82 (1.62) 6.5-7.49 2,861 16.10 (1.98) 42 16.41 (2.43) 7.5-8.49 3,975 16.31 (2.01) 1,040 16.83 (2.38) 8.5-9.49 4,443 17.15 (2.40) 204 16.90 (2.44) 9.5-10.49 6,777 17.67 (2.81) 303 18.91 (3.34) 10.5-11.49 4,917 18.25 (3.10) 926 18.55 (3.16) 11.5-12.49 5,240 19.04 (3.35) 4 16.78 (2.64) 12.5-13.49 6,797 19.64 (3.35) 9 21.11 (3.75) 13.5-14.49 4,690 20.31 (3.45) 1,196 21.39 (4.02) 14.5-15.49 2,339 21.28 (3.48) 24 21.66 (4.23) 15.5-16.49 1,645 21.41 (3.51) 2 20.14 (3.26) >16.5 90 22.47 (3.40) 976 23.01 (4.28)
212 Chapter 6: Genetic risk score
Table 6.2: Descriptive statistics of the single nucleotide polymorphisms included in the allelic score.
Chr Nearest Gene SNP Alleles GWAS Effect Size
for BMI
Effect Allele Frequency
R2 HWE
Effect Allele / Non-effect Allele
ALSPAC Raine Study
ALSPAC Raine Study
ALSPAC Raine Study
1 NEGR1 rs2568958 A/G 0.13 0.60 0.62 1.00 1.00 0.59 0.58 TNNI3K rs1514175 A/G 0.07 0.42 0.44 1.00 1.00 0.96 0.87 PTBP2 rs1555543 C/A 0.06 0.59 0.59 1.00 1.00 0.83 0.96 SEC16B rs543874 G/A 0.22 0.21 0.20 1.00 0.99 0.50 0.09
2 TMEM18 rs2867125 C/T 0.31 0.83 0.83 1.00 1.00 0.34 0.27 RBJ, ADCY3, POMC rs713586 C/T 0.14 0.49 0.48 1.00 1.00 0.27 1.00 FANCL rs887912 T/C 0.1 0.29 0.29 1.00 1.00 0.28 0.57 LRP1B rs2890652 C/T 0.09 0.17 0.16 0.99 0.98 0.04 0.92
3 CADM2 rs13078807 G/A 0.1 0.20 0.21 1.00 1.00 0.12 0.88 ETV5, DGKG, SFRS10 rs7647305 C/T 0.14 0.79 0.79 0.97 1.00 0.38 0.13
4 SLC39A8 rs13107325 T/C 0.19 0.08 0.07 1.00 1.00 0.87 0.57 GNPDA2 rs10938397 G/A 0.18 0.43 0.44 0.99 0.99 0.11 0.02
5 FLJ35779, HMGCR rs2112347 T/G 0.1 0.64 0.63 0.99 0.99 0.48 0.32 ZNF608 rs4836133 A/C 0.07 0.49 0.49 0.94 0.93 0.69 0.47
6 TFAP2B rs987237 G/A 0.13 0.18 0.19 1.00 1.00 0.49 0.27 9 LRRN6C rs10968576 G/A 0.11 0.32 0.31 1.00 1.00 0.35 0.58
LMX1B rs867559 G/A 0.24 0.20 0.20 1.00 1.00 0.67 1.00
Table 6.2 continued
Chr Nearest Gene SNP Alleles GWAS Effect Size
for BMI
Effect Allele Frequency
R2 HWE
Effect Allele / Non-effect Allele
ALSPAC Raine Study
ALSPAC Raine Study
ALSPAC Raine Study
11 RPL27A, TUB rs4929949 C/T 0.06 0.54 0.52 0.97 0.97 0.48 0.36 BDNF rs6265 C/T 0.19 0.81 0.81 1.00 1.00 0.002 0.24 MTCH2, NDUFS3, CUGBP1
rs3817334 T/C 0.06 0.40 0.42 1.00 1.00 0.98 0.71
12 FAIM2 rs7138803 A/G 0.12 0.36 0.37 1.00 1.00 0.73 0.09 13 MTIF3, GTF3A rs4771122 G/A 0.09 0.23 0.22 0.93 0.93 0.07 0.41 14 PRKD1 rs11847697 T/C 0.17 0.05 0.04 0.97 0.93 0.71 0.26
NRXN3 rs10150332 C/T 0.13 0.21 0.22 1.00 1.00 0.69 0.17 15 MAP2K5, LBXCOR1 rs2241423 G/A 0.13 0.79 0.77 1.00 1.00 0.87 0.21 16 GPRC5B, IQCK rs12444979 C/T 0.17 0.86 0.85 1.00 0.98 0.93 0.92
SH2B1, ATXN2L, TUFM, ATP2A1
rs7359397 T/C 0.15 0.42 0.38 1.00 1.00 0.69 0.87
FTO rs9939609 A/T 0.39 0.39 0.38 1.00 1.00 0.69 0.70 18 MC4R rs12970134 A/G 0.23 0.27 0.25 1.00 1.00 0.51 0.63 19 KCTD15 rs29941 G/A 0.06 0.68 0.66 1.00 1.00 0.72 0.56
TMEM160, ZC3H4 rs3810291 A/G 0.09 0.69 0.64 0.77 0.71 0.01 4.81x10-5
QPCTL, GIPR rs2287019 C/T 0.15 0.81 0.81 1.00 1.00 0.85 0.87
6.5.1 Association Between the Allelic Score and Growth Trajectories
The allelic score was associated with higher mean levels of BMI at age eight (Female:
β=0.0061, P-Value < 0.001; Male: β=0.0044, P-Value < 0.001) and faster BMI growth over
childhood in both sexes (Table 6.3). Due to the increasing rate of growth over time, the
trajectories of individuals with high and low allelic scores begin together at age one but
separate throughout childhood (Figure 6.1).
To investigate whether the association of these loci with BMI growth over childhood was due
to skeletal growth or adiposity, associations between the allelic score and both weight and
height measurements were tested. The allelic score was associated with higher weight
(Females: β=0.0073, P-Value < 0.001; Males β=0.0056, P-Value < 0.001) and faster rates of
weight gain over childhood in both males and females (Figure 6.1 and Table 6.3). As for
associations with BMI, the association was seen earlier in males (by one year of age in ALSPAC)
than females (around two years of age in ALSPAC). The allelic score was associated with
increased height in females (β=0.0949, P-Value < 0.001) and males (β=0.0838, P-Value < 0.001)
and also displayed evidence for an interaction with age (Figure 6.1 and Table 6.3).
Interestingly, the effect size of the allele score increases over childhood until around 10 years
of age in females and slightly later in males and then decreases until it becomes statistically
non-significant (Figure 6.2).
Table 6.3: Results of the allelic score with each of the trajectory outcomes (BMI, weight and
height) in both cohorts and the combined meta-analysis. Significant findings are in bold;
Spline 1 is the change in slope between two and eight years, Spline 2 is the change in slope
after 12 years and Spline 3 is the change in slope before two years.
ALSPAC Raine Study Combined
Effect Beta SE Beta SE Beta 95% CI P
ln(B
MI)
Fem
ale
Score 0.006 0.001 0.007 0.001 0.006 (0.005, 0.007) <0.01 age:score 0.001 1.0x10-4 0.001 3.0x10-4 0.001 (0.001, 0.001) <0.01 age2:score -3.7x10-5 1.0x10-4 -3.0x10-4 3.0x10-4 -6.2x10-5 (-3.0x10-4,
1.0x10-4) 0.54
age3:score -1.0x10-4 4.7x10-5 -2.0x10-4 2.0x10-4 -9.1x10-5 (-2.0x10-4, -2.7x10-6)
0.04
Spline 1 -5.0x10-5 1.0x10-4 2.0x10-4 3.0x10-4 -2.5x10-5 (-2.0x10-4, 2.0x10-4)
0.80
Spline 2 0.001 2.0x10-4 1.4x10-6 4.0x10-4 3.0x10-4 (-5.0x10-5, 0.001)
0.09
Spline 3 -0.008 0.006 -0.012 0.018 -0.009 (-0.020, 0.002) 0.12
215 Chapter 6: Genetic risk score
Table 6.3 continued
ALSPAC Raine Study Combined
Effect Beta SE Beta SE Beta 95% CI P
ln(B
MI)
Mal
e
Score 0.004 0.001 0.007 0.001 0.004 (0.003, 0.005) <0.01 age:score 0.001 1.0x10-4 0.001 2.0x10-4 0.001 (0.001, 0.001) <0.01 age2:score 3.0x10-4 1.0x10-4 1.0x10-4 3.0x10-4 2.0x10-4 (5.2x10-5,
4.0x10-4) 0.01
age3:score 3.5x10-5 4.5x10-5 -1.4x10-5 1.0x10-4 3.1x10-5 (-5.3x10-5, 1.0x10-4)
0.47
Spline 1 -3.0x10-4 1.0x10-4 -1.0x10-4 3.0x10-4 -3.0x10-4 (-0.001, -8.3x10-5)
0.01
Spline 2 0.001 2.0x10-4 3.0x10-4 4.0x10-4 0.001 (3.0x10-4, 0.001)
<0.01
Spline 3 0.003 0.005 -0.008 0.016 0.002 (-0.008, 0.012) 0.70
ln(W
eigh
t)
Fem
ale
Score 0.007 0.001 0.009 0.002 0.007 (0.006, 0.009) <0.01 age:score 0.001 1.0x10-4 0.001 3.0x10-4 0.001 (0.001, 0.001) <0.01 age2:score 1.6x10-5 1.0x10-4 1.0x10-4 2.0x10-4 2.2x10-5 (-8.2x10-5,
1.0x10-4) 0.68
Spline 1 -1.0x10-4 3.8x10-5 -3.0x10-4 2.0x10-4 -1.0x10-4 (-2.0x10-4, -5.9x10-5)
<0.01
Spline 2 - - 0.001 4.0x10-4 - - -
Mal
e
Score 0.005 0.001 0.008 0.002 0.006 (0.004, 0.007) <0.01 age:score 0.001 1.0x10-4 0.001 3.0x10-4 0.001 (0.001, 0.001) <0.01 age2:score 2.0x10-4 5.0x10-5 -1.0x10-4 1.0x10-4 2.0x10-4 (10.0x10-5,
3.0x10-4) <0.01
Spline 1 -2.0x10-4 3.5x10-5 -1.0x10-4 1.0x10-4 -2.0x10-4 (-3.0x10-4, -1.0x10-4)
<0.01
Spline 2 - - 2.0x10-4 3.0x10-4 - - -
Hei
ght
Fem
ale
Score 0.088 0.028 0.124 0.059 0.095 (0.045, 0.145) <0.01 age:score 0.012 0.004 0.024 0.008 0.014 (0.007, 0.020) <0.01 age2:score 0.002 0.004 -0.006 0.010 0.001 (-0.006, 0.008) 0.76 age3:score 0.001 0.002 -0.004 0.005 0.001 (-0.002, 0.004) 0.64 Spline 1 -0.007 0.004 1.0x10-4 0.010 -0.006 (-0.012, 0.001) 0.07 Spline 2 0.024 0.008 0.020 0.012 0.023 (0.010, 0.035) <0.01 Spline 3 0.109 0.183 -0.393 0.581 0.063 (-0.279, 0.406) 0.72
Mal
e
Score 0.077 0.027 0.116 0.062 0.084 (0.035, 0.133) <0.01 age:score 0.011 0.004 0.009 0.009 0.011 (0.003, 0.018) 0.04 age2:score 0.001 0.004 -0.012 0.013 -2.0x10-4 (-0.007, 0.007) 0.96 age3:score 1.0x10-4 0.002 -0.004 0.006 -1.0x10-4 (-0.003, 0.003) 0.92 Spline 1 4.0x10-4 0.004 0.007 0.013 0.001 (-0.006, 0.008) 0.79 Spline 2 -0.016 0.009 -0.008 0.016 -0.014 (-0.029, 0.001) 0.06 Spline 3 -0.082 0.184 0.0708 0.698 -0.072 (-0.419, 0.276) 0.69
216 Chapter 6: Genetic risk score
Figure 6.1: Population average curves for individuals from ALSPAC with 27, 29 or 31 BMI
risk alleles in females (A, C and E) and males (B, D and F). Predicted population average BMI
(A and B), weight (C and D) and height (E and F) trajectories from 1 – 16 years for individuals
with 27 (lower quartile), 29 (median), and 31 (upper quartile) BMI risk alleles in the allelic
score.
217 Chapter 6: Genetic risk score
Figure 6.2: Associations between the allelic score and BMI (A and B), weight (C and D) and
height (E and F) at each follow-up in females and males from ALSPAC. Regression coefficients
(95% CI) derived from the longitudinal model at each year of follow-up between 1 and 16
years.
218 Chapter 6: Genetic risk score
6.5.2 Associations Between the Allelic Score and Birth Measures, Adiposity Peak
and Adiposity Rebound
As expected, females were both lighter and shorter than males at birth (Table 6.1). The allelic
score was not associated with the birth measures in either sex (Table 6.4). In addition, there
was no interaction between the allelic score and gestational age for either weight or length at
birth (data not shown).
In ALSPAC the mean age of the adiposity peak was slightly later in females at 9.03 months
(SD=0.76) than males at 8.43 months (SD=0.55), with males also having a higher BMI at the
peak than females (Table 6.1). The estimated age and the BMI at the peak were weakly
correlated in females (ρ=0.08) and males (ρ=-0.30). Greater age at adiposity peak was
associated with higher BMI at age 15-17 in females (β=0.6257kg/m2, P-Value < 1.65x10-8) but
not associated with later BMI in males (β=-0.0408kg/m2, P-Value=0.78). In addition, higher BMI
at adiposity peak was associated with higher BMI at age 15-17 years (Females: β=1.3682kg/m2,
P-Value=3.11 x10-44, Males: β=1.0578kg/m2, P-Value=9.4463x10-30). The allelic score was not
statistically significantly associated with age of reaching the adiposity peak in females or males
(Table 6.4). However, the allelic score was associated with a higher BMI at the peak (Table 6.4),
explaining less than 0.5% of the variation in BMI at the peak in both females (0.42%) and males
(0.22%). Adjustment for age at the adiposity peak did not substantively alter the magnitude of
the association of the allelic score with BMI at adiposity peak (Females: β=0.0157kg/m2, P-
Value=0.0003; Males: β=0.0135kg/m2, P-Value=0.0007).
Table 6.4: Cross-sectional association analysis results for birth measures, BMI and age at
adiposity peak (AP) and BMI and age at adiposity rebound (AR) in ALSPAC and the Raine
Study.
Females Males
Beta (95% CI) P-Value Beta (95% CI) P-Value Birth weight -0.0004 (-0.0043, 0.0035) 0.83 0.0026 (-0.0017, 0.0069) 0.23 Birth length -0.0158 (-0.0352, 0.0036) 0.11 -0.0002 (-0.0190, 0.0186) 0.98 BMI at AP 0.0163 (0.0079, 0.0248) <0.001 0.0123 (0.0041, 0.0204) <0.001 Age at AP 0.0074 (-0.0002, 0.0151) 0.06 0.0028 (-0.0025, 0.0080) 0.30 BMI at AR 0.0332 (0.0237, 0.0427) <0.001 0.0364 (0.0277, 0.0451) <0.001 Age at AR -0.0362 (-0.0467, -0.0257) <0.001 -0.0362 (-0.0450, -0.0274) <0.001
219 Chapter 6: Genetic risk score
The ALSPAC participants had a later adiposity rebound than the Raine Study participants with a
mean of 6.1 years (SD=1.02) versus 5.3 (SD=1.05) years in boys and 5.6 (SD=1.16) years versus.
4.6 (SD=1.10) years in girls, respectively (Table 6.1). Earlier age at the adiposity rebound and
higher BMI at the adiposity rebound were both associated with higher BMI at age 15-17 years.
The allelic score was associated with an earlier age at the adiposity rebound for females and
males (Table 6.4), both of which remain associated independent of BMI at the rebound,
however the effect size attenuates (Females: β=-0.0122 years, P-Value=0.002; Males: β=-
0.0096 years, P-Value=0.002). The allelic score was also associated with higher BMI at the
rebound in females and males (Table 6.4), both of which remain significantly associated when
adjusting for age at the rebound, although the effect size attenuates (Females: β=0.0094kg/m2,
P-Value=0.01; Males: β=0.0109kg/m2, P-Value < 0.001). The allelic score accounts for 1-2% of
the variation in age and BMI at the adiposity rebound in the two cohorts, which is twice as
much of the variation in BMI that was accounted for at the time of the adiposity peak or in the
overall trajectory.
There is a strong positive correlation between BMI at the adiposity peak and the adiposity
rebound (Female ρ=0.65; Male ρ=0.59). The BMI at the adiposity rebound explains more of the
variation in BMI at age 15-17 than the BMI at the adiposity peak, with estimates of around 10%
for the adiposity peak and 45% for the adiposity rebound. Nevertheless, the allelic score
remains associated with BMI at the adiposity rebound after adjusting for the BMI at the
adiposity peak in both females (β=0.0171kg/m2, P-Value < 0.001) and males (β=0.0269kg/m2,
P-Value < 0.001).
6.5.3 Variance Explained by the Allelic Score
We calculated the percentage of variation in BMI explained by the allele score at each time
point in ALSPAC using the residual sums of squares from the longitudinal BMI growth model.
This was not calculated in the Raine Study as the sample size is too small for accurate
estimates. The allelic score explains 0.58% of the variation in BMI across childhood in females
and slightly less in males (0.44%). This is approximately a third of the variation in adult BMI
explained by these SNPs in the study that identified them [72]. Figure 6.3 displays the
estimates over childhood in females and males.
220 Chapter 6: Genetic risk score
Figure 6.3: Estimates from the longitudinal models of the proportion of BMI variation
explained (R2) at each time point in females and males from ALSPAC. R2 derived from the
longitudinal model at each year of follow-up between 1 and 16 years. Of note, there are
increases in the proportion of BMI variation explained by the allelic score around the
landmarks of growth including adiposity peak and puberty.
221 Chapter 6: Genetic risk score
The allelic score accounted for a similar percentage of BMI at the adiposity peak in both
females (0.42%) and males (0.22%). However, for the measures at the adiposity rebound, the
allelic score accounts for up to 1-2% of the variation in the two cohorts (Age: 0.87% in ALSPAC
females, 2.70% in the Raine Study females, 1.46% in ALSPAC males and 0.89% in the Raine
Study males; BMI: 1.01% in ALSPAC females, 1.87% in the Raine Study females, 1.46% in
ALSPAC males and 1.14% in the Raine Study males). This is twice as much of the variation in
BMI than was accounted for at the time of the adiposity peak or in the overall trajectory.
6.5.4 Sex Interactions Between the 32 Individual BMI SNPs and BMI Trajectories
In females, differences in BMI trajectories due to the allelic score were detectable from just
after one year of age in ALSPAC and approximately 2.5 years of age in the Raine Study. A LRT
for five of the 32 adult BMI loci reached a Bonferroni significance threshold of 0.0016 in the
meta-analysis, including loci from the RBJ, FTO, MC4R, CADM2 and MTCH2 regions (Appendix
F; Table 1).
In males, differences in BMI trajectories due to the allelic score were detectable from the
beginning of the curve at one year in ALSPAC and slightly later at one and a half years in the
Raine Study. A LRT for four of the 32 adult BMI loci were significantly associated with BMI
trajectory at the Bonferroni significance threshold in the meta-analysis, including loci from the
SEC16B, TMEM18, MC4R and FTO regions (Appendix F; Table 2).
Given that different genes associated with childhood BMI growth in males and females, BMI
trajectories with males and females combined were also conducted to investigate any sex by
genetic loci interactions. None of the sex differences for the 32 loci would be significant under
Bonferroni correction; however following result is reported here as an exploratory finding. The
meta-analysis of LRT P-Values for the interaction terms between sex and the NRXN3 loci,
rs10150332 (including interaction with the spline function), had a P-Value of 0.0039.
6.5.5 Adjustment for FTO Effect
When adjusting for the FTO loci, rs9939609, in the trajectory and adiposity peak/rebound
analyses, the results remained statistically significant, with no attenuation of the effect size.
This indicates that the associations between growth and the allelic score were independent of
the FTO effect (Table 6.5).
222 Chapter 6: Genetic risk score
Table 6.5: Results of the allelic score, after adjustment for the FTO locus, with each of the
trajectory outcomes (BMI, weight and height) in both cohorts and the combined meta-
analysis. Significant findings are in bold; Spline 1 is the change in slope between two and eight
years, Spline 2 is the change in slope after 12 years and Spline 3 is the change in slope before
two years.
ALSPAC Raine Study Combined
Effect Beta SE Beta SE Beta 95% CI P P (Het)
ln(B
MI)
Fem
ale
Score 0.006 0.001 0.007 0.001 0.006 (0.005, 0.007)
<0.01 0.3
age:score 0.001 1.1x10-4 0.001 2.7x10-4 0.001 (0.001, 0.001)
<0.01 0.7
age2:score -5.1x10-5 1.1x10-4 0.000 3.1x10-4 -8.3x10-5 (-3x10-4, 1x10-4)
0.43 0.4
age3:score -8.7x10-5 4.8x10-5 0.000 1.6x10-4 -9.7x10-5 (-2x10-4, -6x10-6)
0.04 0.5
Spline 1 -2.6x10-5 1.1x10-4 0.000 3.1x10-4 5.7x10-6 (-2x10-4, 2x10-4)
0.96 0.4
Spline 2 4.0x10-4 2.4x10-4 0.000 3.8x10-4 3.0x10-4 (-1x10-4, 6x10-4)
0.20 0.2
Spline 3 -0.008 0.006 -0.015 0.019 -0.009 (-0.020, 0.003)
0.13 0.7
Mal
e
Score 0.004 0.001 0.006 0.001 0.004 (0.003, 0.005)
<0.01 0.2
age:score 0.001 1.0x10-4 0.001 2.4x10-4 0.001 (0.001, 0.001)
<0.01 0.5
age2:score 2.4x10-4 1.1x10-4 1.7x10-4 3.0x10-4 2.0x10-4 (4x10-5, 4x10-4)
0.02 0.8
age3:score 2.9x10-5 4.5x10-5 4.2x10-5 1.5x10-4 3.1x10-5 (-5x10-5
1x10-4) 0.48 0.9
Spline 1 -2.6x10-4 1.1x10-4 -2.1x10-4 2.9x10-4 -2.0x10-4 (-4x10-4, -6x10-5)
0.01 0.9
Spline 2 0.001 2.4x10-4 3.9x10-4 3.6x10-4 0.001 (2x10-4, 0.001)
0.00 0.6
Spline 3 0.003 0.005 -0.004 0.016 0.003 (-0.007, 0.013)
0.60 0.7
ln(W
eigh
t)
Fem
ale
Score 0.007 0.001 0.009 0.002 0.007 (0.006, 0.009)
<0.01 0.4
age:score 0.001 1.1x10-4 0.001 3.3x10-4 0.001 (0.001, 0.001)
<0.01 0.2
age2:score 6.5x10-6 5.7x10-5 9.8x10-5 1.7x10-4 1.6x10-5 (-9x10-5, 1x10-4)
0.77 0.6
Spline 1 -1.1x10-4 3.9x10-5 -2.7x10-4 1.6x10-4 -1.0x10-4 (-2x10-4, -4x10-5)
<0.01 0.3
Spline 2 - - 0.001 3.7x10-4 - - - -
Mal
e
Score 0.005 0.001 0.007 0.002 0.005 (0.004, 0.007)
<0.01 0.2
age:score 0.001 1.1x10-4 0.001 3.0x10-4 0.001 (0.001, 0.001)
<0.01 0.7
age2:score 2.x10-4 5.1x10-5 -4.4x10-5 1.5x10-4 2.0x10-4 (9x10-5, 3x10-4)
<0.01 0.1
Spline 1 -2.1x10-4 3.6x10-5 -7.9x10-5 1.4x10-4 -2.0x10-4 (-3x10-4, -1x10-4)
<0.01 0.4
Spline 2 - - 2.0x10-4 3.3x10-4 - - - -
223 Chapter 6: Genetic risk score
Table 6.5 continued
ALSPAC Raine Study Combined
Effect Beta SE Beta SE Beta 95% CI P P (Het)
Hei
ght
Fem
ale
Score 0.088 0.029 0.147 0.061 0.099 (0.048, 0.150)
<0.01 0.4
age:score 0.011 0.004 0.024 0.008 0.013 (0.007, 0.020)
<0.01 0.2
age2:score 0.001 0.004 -0.006 0.010 0.001 (-0.006, 0.007)
0.88 0.5
age3:score 0.001 0.002 -0.004 0.005 0.001 (-0.002, 0.004)
0.71 0.4
Spline 1 -0.006 0.004 -0.001 0.010 -0.006 (-0.012, 0.001)
0.10 0.6
Spline 2 0.024 0.008 0.022 0.012 0.023 (0.010, 0.036)
0.00 0.9
Spline 3 0.154 0.187 -0.352 0.602 0.109 (-0.241, 0.459)
0.54 0.4
Mal
e
Score 0.069 0.028 0.104 0.063 0.075 (0.025, 0.125)
0.00 0.6
age:score 0.009 0.004 0.007 0.009 0.009 (0.002, 0.017)
0.01 0.8
age2:score 0.001 0.004 -0.012 0.013 -2.0x10-4 (-0.007, 0.007)
0.95 0.3
age3:score 1.5x10-4 0.002 -0.004 0.006 -1.0x10-4 (-0.003, 0.003)
0.94 0.5
Spline 1 0.001 0.004 0.009 0.013 0.001 (-0.006, 0.008)
0.73 0.6
Spline 2 -0.018 0.009 -0.012 0.016 -0.016 (-0.031, -0.001)
0.03 0.8
Spline 3 -0.116 0.186 0.039 0.708 -0.106 (-0.458, 0.246)
0.56 0.8
Adip
osity
Pea
k
Fem
ale Age 0.009 0.004 - - - - - -
BMI 0.019 0.004 - - - - - -
Mal
e Age 0.003 0.003 - - - - - - BMI 0.013 0.004 - - - - - -
Adip
osity
Reb
ound
Fem
ale Age -0.028 0.006 -0.051 0.013 -0.032 (-0.043,
-0.021) <0.01 0.1
BMI 0.032 0.006 0.037 0.011 0.033 (0.023, 0.042)
<0.01 0.7
Mal
e
Age -0.032 0.005 -0.031 0.011 -0.032 (-0.041, -0.023)
<0.01 0.9
BMI 0.035 0.005 0.032 0.010 0.035 (0.026, 0.044)
<0.01 0.8
6.5.6 Comparison with Weighted Allelic Score
The mean of the weighted allelic score was 4.09 (SD=0.54) in ALSPAC and 4.06 (SD=0.55) in the
Raine Study while the mean for the unweighted allelic score was higher at 28.82 (SD=3.45) in
ALSPAC and 28.65 (SD=3.54) in the Raine Study. After standardizing both of these scores, the
results using the weighted allele score in the trajectory models displayed the same
associations as the unweighted results (Table 6.6).
224 Chapter 6: Genetic risk score
Table 6.6: Comparison of the unweighted and weighted allelic scores for the three trajectory outcomes. Spline 1 is the change in slope between two and
eight years, Spline 2 is the change in slope after 12 years and Spline 3 is the change in slope before two years.
Unweighted Weighted Phenotype Sex Effect Beta 95% CI P Beta 95% CI P
ln(BMI) Female Score 0.0209 (0.017,0.0249) <0.001 0.0207 (0.0168, 0.0247) <0.001 age:score 0.0039 (0.00317, 0.0046) <0.001 0.0041 (0.00343, 0.00486) <0.001 age2:score -0.0002 (-0.0009, 0.0005) 0.59 -4.10x10-5 (-0.0008, 0.0007) 0.91 age3:score -0.0003 (-0.0007, 0.00004) 0.08 -0.0003 (-0.0007, 0.00006) 0.10 Spline 1 -0.0001 (-0.0008, 0.0006) 0.79 -0.0003 (-0.0010, 0.0005) 0.49 Spline 2 0.0012 (-0.0002, 0.0025) 0.09 0.0015 (0.0001, 0.0028) 0.03 Spline 3 -0.0298 (-0.0674, 0.0079) 0.12 -0.0306 (-0.0678, 0.0067) 0.11
Male Score 0.0151 (0.0115, 0.0187) <0.001 0.0181 (0.0145, 0.0217) <0.001 age:score 0.0035 (0.0028, 0.0042) <0.001 0.0045 (0.0040, 0.0051) <0.001 age2:score 0.0008 (0.00008, 0.0015) 0.03 0.0009 (0.0002, 0.0016) 0.02 age3:score 8.00x10-5 (-0.0003, 0.0004) 0.67 0.0001 (-0.0002, 0.0005) 0.49 Spline 1 -0.0009 (-0.0016, -0.0002) 0.01 -0.0011 (-0.0018, -0.0004) 0.003 Spline 2 0.0022 (0.0009, 0.0035) 0.001 0.0026 (0.0013, 0.0040) <0.001 Spline 3 0.0070 (-0.0275, 0.0415) 0.69 -0.0003 (-0.035, 0.0345) 0.99
ln(Weight) Female Score 0.0251 (0.0199, 0.0303) <0.001 0.0248 (0.0196, 0.0300) <0.001 age:score 0.0034 (0.0027, 0.0042) <0.001 0.0037 (0.0030, 0.0045) <0.001 age2:score 0.0001 (-0.0003, 0.0005) 0.53 0.0002 (-0.0001, 0.0006) 0.22 Spline 1 -0.0004 (-0.0006, -0.0002) <0.001 -0.0005 (-0.0007, -0.0003) <0.001
Table 6.6 continued
Unweighted Weighted Phenotype Sex Effect Beta 95% CI P Beta 95% CI P
Male Score 0.0193 (0.0144, 0.0242) <0.001 0.0237 (0.0188, 0.0286) <0.001 age:score 0.0034 (0.0027, 0.0041) <0.001 0.0041 (0.0034, 0.0048) <0.001 age2:score 0.0006 (0.0002, 0.0009) 0.002 0.0005 (0.0002, 0.0009) 0.004 Spline 1 -0.0006 (-0.0008, -0.0004) <0.001 -0.0007 (-0.0009, -0.0005) <0.001
Height Female Score 0.3280 (0.1560, 0.5010) 0.0002 0.3140 (0.1420, 0.4860) <0.001 age:score 0.0470 (0.0245, 0.0695) <0.001 0.0414 (0.0190, 0.0638) <0.001 age2:score 0.0037 (-0.0194, 0.0268) 0.75 0.0030 (-0.0201, 0.0260) 0.80 age3:score 0.0024 (-0.0076, 0.0123) 0.64 0.0024 (-0.0076, 0.0123) 0.64 Spline 1 -0.0206 (-0.0430, 0.0018) 0.07 -0.0179 (-0.0402, 0.0044) 0.12 Spline 2 0.0790 (0.0352, 0.1230) <0.001 0.0629 (0.0191, 0.1070) 0.004 Spline 3 0.2230 (-0.9630, 1.4100) 0.71 0.0523 (-1.1200, 1.2300) 0.93
Male Score 0.2900 (0.1210, 0.4590) 0.001 0.3900 (0.2200, 0.5590) <0.001 age:score 0.0366 (0.0115, 0.0617) 0.004 0.0464 (0.0213, 0.0715) <0.002 age2:score -0.0005 (-0.0250, 0.0240) 0.97 -0.0028 (-0.0273, 0.0217) 0.83 age3:score -0.0005 (-0.0108, 0.0098) 0.93 -0.0009 (-0.0114, 0.0096) 0.87 Spline 1 0.0033 (-0.0209, 0.0274) 0.79 0.0006 (-0.0235, 0.0248) 0.96 Spline 2 -0.0495 (-0.1010, 0.0019) 0.06 -0.0278 (-0.0789, 0.0234) 0.29 Spline 3 -0.2480 (-1.4500, 0.9550) 0.69 0.5020 (-0.7090, 1.7100) 0.42
6.6 Discussion In this study, the association of variants in genes shown to be associated with increased BMI in
adulthood with longitudinal growth measures over childhood were investigated in two birth
cohorts of 7,868 and 1,460 individuals. Similar to previous studies, it was shown that an allelic
score derived from a set of known adult BMI-associated SNPs is not associated with birth
measures but is associated with BMI growth throughout childhood and adolescence. In
addition, a statistically significant association was detected between the allelic score and
weight gain over childhood and adolescence and also an association with height growth,
though the effect size for height was smaller than that for BMI and weight. It appears that the
association of the allelic score with height growth stops after the age that was considered to
reflect puberty, whereas it continues to be associated with weight, and therefore BMI. This
indicates that the variants that are associated with adult BMI are related to childhood BMI
growth by affecting both how much fat is accumulated and how tall the child grows until
puberty. Although the individuals SNPs are associated with adiposity, they might not all be
associated with height, or height growth, hence the weaker association detected between the
allelic score and height. Of the SNPs included in the allelic score, the GIANT consortium found a
SNP 30,000bp upstream from the RBJ loci and a SNP in the MC4R gene to be associated with
adult height [295]. An extension to the current study would be to investigate whether any of
the individual SNPs in the allelic score largely influence child height growth rather than weight;
however a larger sample size would be required to consider this.
This study is an extension to the work conducted by Elks et al [206] who investigated the
association of an eight SNP allelic score with growth trajectories from birth to 11 years of age
in ALSPAC. The analysis in this Chapter has extended their work using both ALSPAC and the
Raine Study, by increasing the age period over which the trajectories are examined and also
the number of SNPs investigated. By extending the time period under study, it was shown that
although the weight gain in early childhood is small for each additional risk allele, this gain
increases through late childhood and adolescence. Belsky et al [209] are the only other
investigators to look at an allelic score using the same set of SNPs; conclusions in this Chapter
are similar to theirs in terms of the growth trajectories throughout childhood however the
results in this Chapter were also able to show the early timing of effect and some exploratory
findings regarding sex specific genes effecting BMI growth.
227 Chapter 6: Genetic risk score
Previous studies investigating the association between adult BMI associated SNPs and
childhood growth adjusted their analyses for sex [205,206,209,251,316]; only Hardy et al [205]
tested for a sex interaction and found it to be non-significant. Sex specific models were run
due to detecting a statistically significant sex interaction and this allowed the finding that the
allelic score begins to be associated with BMI and weight earlier in males than females, but
around the same age for height. Furthermore, other than the FTO and MC4R SNPs, different
genes appear to be associated with childhood BMI growth in males and females. However,
these differences could not be replicated in the formal interaction analysis and therefore
further investigation in larger sample sizes is required to confirm this observation.
Nevertheless, this is the first study in childhood that has observed some evidence for sex
differences for genetic associations of BMI, which provides additional evidence that there may
be different, but partially overlapping, genes that contribute to the body shape of men and
women, even in childhood.
For the first time, it has been shown that the effect of the BMI increasing alleles has a
detectable effect on childhood growth as early as one year of age in males and slightly later in
females. In addition, a statistically significant association was detected between the allelic
score and higher BMI at the adiposity peak, but only weak evidence of an association between
the allelic score and age at adiposity peak. This contrasts the findings for the association
between the FTO gene and adiposity peak shown in the NFBC66 [129], where the age at the
adiposity peak was associated with the FTO variant but not BMI at the peak. Similar to
previous studies investigating the genetic determinants of the adiposity rebound, the analysis
in this Chapter found that for each additional BMI risk allele an individual’s age at the adiposity
rebound decreases by approximately 11 days and also their BMI at the rebound increases by
0.035kg/m2.
Further studies are now required to assess the validity of these findings, particularly regarding
the onset of the genetic association. Both of the cohorts investigated had limited data
available in the first few years of life, and although the statistical modelling framework allowed
for the estimation the timing of the genetic association and the parameters around the
adiposity peak, it is important that this is replicated in cohorts with more regular
measurements throughout this time period. Likewise, although over 4,500 males and females
228 Chapter 6: Genetic risk score
were included in the sex stratified analyses, it is important that the observed sex differences in
this study are replicated.
6.7 Conclusion In conclusion, an association analysis has been conducted in a large childhood population to
investigate the effect of known adult genetic determinants of BMI on childhood growth. It has
been shown that the genetic effect begins very early in life and that there are potentially sex
differences in the genetic effects of BMI throughout childhood. These results are consistent
with both the DOHaD and life course epidemiology hypotheses – the determinants of adult
susceptibility to obesity begin in early childhood and develop over the life course.
229 Chapter 6: Genetic risk score
Chapter 7: Conclusions, Limitations And Future Directions 7.1 Main Findings This chapter provides a summary of the research and the conclusions presented throughout
this thesis. The limitations of this research are also presented and discussed. This chapter
concludes with a discussion of the potential areas of extension for future research.
In 2009, at the beginning of the research in this thesis, Kerner et al indicated that the groups in
the Genetic Analysis Workshop that focused on longitudinal analysis of GWAS data were
unable to establish a clear analytic strategy for dealing with complex longitudinal data
structures in a time efficient manner [85]. The primary aim of this thesis, as detailed in Chapter
1, was therefore to investigate the association between BMI growth trajectories across
childhood and adolescence and genetic variants on a genome-wide scale. Compared to cross-
sectional analyses, longitudinal studies are advantageous for investigating genetic associations
as they: 1) allow information to be shared between individuals across time improving precision
over analysis of a single time point; 2) facilitate the detection of genes that influence
trajectories rather than simple differences in phenotypes; and 3) allow the detection of genes
that are associated with age of onset of a trait. [85]. There are several traits, including BMI as a
measure of adiposity, where the age of onset provides insight to the pathophysiological
heterogeneity between individuals. For example, individuals who have a high BMI at all ages,
reflecting a high lean and fat body mass, appear to have relatively normal metabolic profiles;
individuals who have normal BMI followed by an early adiposity rebound and consequently a
higher BMI, reflecting increased fat rather than lean mass, are at higher risk for coronary heart
disease and insulin resistance [132,137,139,140]. Therefore, attention to the timing and
trajectory of a phenotype can help to clarify the underlying pathophysiological process. Kerner
et al conclude that “Future work should focus on the development of analytical methods and
computer software that can handle these longitudinal data in the context of other
complexities that are often found in cohort studies.” [85]. The research presented in this
thesis, which were a series of separate but consecutive studies, were conducted to investigate
statistical methods that could be used for analysing longitudinal data in large scale genetic
230 Chapter 7: Conclusions
studies. The first three studies investigated the statistical aspects of various modelling
approaches, while the final two studies focused on the genetic association analyses and clinical
implications; these studies are summarized below.
7.1.1 Longitudinal Statistical Models for Body Mass Index Growth Trajectories
throughout Childhood using the Western Australian Pregnancy Cohort
(Raine) Study [197]
After a systematic review of the literature, no statistical method has been described to detect
small genetic effects in a high dimensional, longitudinal study, such as modelling BMI
trajectories across childhood and adolescence. Therefore, in the first study of this thesis [197],
the major methodological challenges of analysing these data were investigated and the
SPLMM was shown to be the most appropriate model for genetic association studies of BMI
trajectories. This model has the ability to be used for high dimensional studies such as GWASs
or gene-gene/gene-environment interaction studies. This study contributes to a more
complete understanding of the advantages and limitations of each of the statistical methods
evaluated and provides a basis for furthering the exploration of genetic associations with BMI
trajectories – the focus of this thesis. In later studies, this model was applied to the ALSPAC
and NFBC66 cohorts, indicating that it is a flexible modelling framework that is generalizable to
other cohorts and potentially to other complex longitudinal phenotypes.
7.1.2 Comparing SPLMM to Two-Step Approach for GWASs
There have recently been several publications investigating a two-step approach for
conducting longitudinal GWASs that greatly reduces the computational time for traits that
have a linear trajectory over time [85,89,245]. The second study of this thesis explored how
this two-step approach could be applied to the BMI data over childhood and adolescence,
which has a non-linear trajectory over time, a high correlation between the intercept and slope
terms and non-normal, correlated errors. This study showed that the two-step approach
produced results that were unreliable for complex phenotypes, particularly due to the data
having a non-linear trajectory over time. It is recommended that before conducting a GWAS of
a longitudinal phenotype, that the one and two-step approaches are compared for reliability.
Therefore, although it is far more computationally intensive, the full SPLMM was used for the
GWAS analyses in this thesis.
231 Chapter 7: Conclusions
7.1.3 Robustness of the Linear Mixed Effects Model to Distribution Assumptions
and Consequences for Genome-Wide Association Studies
When the four methods from the first study were applied to the ALSPAC data to ensure that
the SPLMM was generalizable, the residuals were non-normal and heteroscedastic, thus not
meeting the assumptions of the model. The third study presented in this thesis explored,
through simulation and a real data example, the effect of these error misspecifications on the
genetic results in a chromosome-wide study. It is shown that the type 1 error for the SNP by
age interaction terms in a genetic association study are inflated, regardless of the type of
model misspecification. To address this issue, results are presented that describe the use of
robust standard errors for the fixed effects parameters as an appropriate way to deflate the
type 1 error, in most scenarios to nominal levels. Given mixed models have only recently
begun being used in GWASs, this study provided practical guidance to genetics researchers
investigating longitudinal traits regarding the use of appropriate statistical methods when
model assumptions cannot be met.
7.1.4 Genome-Wide Association Study of BMI Trajectories Across Childhood
Once the methodological aspects of the thesis had been completed, the results from the first
three methodological studies were applied to longitudinal BMI data from three large human
cohorts. The fourth study of this thesis was the GWAS analysis for BMI trajectory across
childhood and adolescence in the Raine Study, with replication of the most significant region in
the ALSPAC and NFBC66 cohorts. Variants in the KCNJ15 gene were shown to be associated
with BMI trajectory over this time period in both males and females. The most significant SNP
in the Raine Study, rs2008580, is a transcription factor binding site within an intronic region of
the gene. The KCNJ15 gene has previously been reported to be associated with an increased
risk of type 2 diabetes, increased levels of insulin and insulin resistance [291,293]. The
rs2008580 SNP appears to be driven by a change in weight rather than a change in height,
indicating that it is influencing adiposity rather than skeletal growth, which is consistent with
increased levels of circulating insulin. Therefore, through the development of appropriate
longitudinal models for BMI trajectories, a novel biologically plausible gene for BMI trajectory
over childhood was discovered.
232 Chapter 7: Conclusions
7.1.5 Association of a Genetic Risk Score with Longitudinal BMI in Children
The final study presented in this thesis evaluated the impact of all known and replicated
genetic variants associated with adult BMI on growth over childhood and adolescence
(including BMI, height and weight) and related growth parameters (including age and BMI at
both the adiposity peak and rebound). The results of this study indicate that these variants
appear to be related to childhood BMI growth by affecting both how much fat is accumulated
and how tall the child grows until puberty. The genetic effect begins very early in life, around
one year, and there are potentially sex differences in the impact these variants in the
established obesity genes have on BMI growth throughout childhood. These results provide
further evidence for the life course epidemiology hypotheses and provide insight into the
development of obesity, based on the genetic predisposition, which may be useful in the
design of intervention studies.
7.2 Limitations Although the SPLMM method was shown to be the most efficient and flexible approach for
modelling complex longitudinal phenotypes in large-scale genetics studies, it remained a
computationally intensive method for GWAS analyses particularly in the studies with larger
sample size and more repeated measurements.
7.2.1 Computational Intensity
As discussed in Chapter 5, I was only able to conduct a full GWAS analysis in the Raine Study,
primarily due to the computational burden that the GWAS would have in the other studies. As
discussed in Chapter 5, the larger cohorts 1) had a longer computation time for the analysis of
each SNP due to the greater sample size and additional measures per person and 2) had
limited computational facilities available at the cohorts at the time of the study. To attempt to
reduce the computational burden, the code for calculating the robust standard errors and
associated P-Values was rewritten in C++, which reduced the time from several minutes to
seconds per SNP in the ALSPAC and NFBC66 cohorts. To make it feasible to conduct analysis of
the full 2.5 million SNPs, without adjustment to how the scripts were run on the high
performance computers, the lme function from R would also have to be rewritten in C++,
which would require a considerable amount of programming. Instead, the analysts for this
project at ALSPAC and NFBC66 are currently in discussion with the high performance
computing teams at their universities to determine a computationally efficient procedure for
233 Chapter 7: Conclusions
running the required scripts. If there is no alternative to reduce the computation time using
the SPLMM method with robust standard errors, either the analytic plan will need to be
redefined or the lme function reprogrammed in C++.
7.2.2 Gene Discovery
Because the Raine Study is relatively small with only 1,461 individuals and a maximum of 8
measurements per individual, the GWAS in this study only detected one genetic region of
interest. To find reliable genetic associations with BMI trajectories, it will be necessary to have
a larger sample size in the discovery GWAS analysis. To be able to increase the sample size, it
would be necessary to reduce the computational time for each SNP in the larger cohorts, as
discussed above.
7.3 Future Directions To improve the clinical care provided by medical practitioners, additional knowledge about the
complex interplay between genetics, the environment and behaviours underlying the disease
is required. To understand more about the mechanisms of diseases, such as obesity, complex
statistical methods, such as those described in this thesis, will need to be utilized. The use of
longitudinal data in genetic association studies have only recently begun, hence there are
numerous opportunities for building further on the studies described in this thesis. Below are a
number of potential extensions; however this list is not exhaustive.
7.3.1 Reducing Remaining Type 1 Error
The chromosome-wide analysis in Chapter 4 and the GWAS in Chapter 5 present results that
demonstrate there is some remaining inflation in the λ value for the SNP*age effect. This may
be because there are many associated SNPs in the analysis; alternatively, and more likely,
there is some residual type 1 error inflation for that parameter. As discussed in Chapter 5, it
may be possible to use genomic control adjustment to reduce this to nominal levels; however,
it would need to be determined that the remaining inflation, after using the robust standard
errors, is constant across the genome as the genomic control adjusts all test statistics by the
same amount.
234 Chapter 7: Conclusions
7.3.2 Longitudinal Family Studies
There are many large cohort studies that combine both a longitudinal and family based study
design, for example the Framingham Heart Study [321] and the Busselton Health Study [322].
Generally, when using these cohorts, either a cross-sectional study of the families is conducted
or a longitudinal study of unrelated individuals, depending on the research question of
interest. The modelling framework presented in this thesis can easily be extended to account
for the within-family correlation by incorporating additional random effects, as well as the
accounting for the within-individual correlation discussed here. This could be very powerful for
gene discovery by including all genotyped individuals in the analysis to increase the sample
size, as opposed to selecting an unrelated subset of individuals (a design that has previously
been used [322]).
7.3.3 Adjusting for Environmental Covariates
As discussed in Section 1.5, there are several important covariates that have previously been
shown to be associated with BMI throughout childhood and adolescence, including duration of
breast feeding [123,124] and nutrition over childhood [125,126], amount of physical activity
[127,142] and timing of puberty [147,148]. Although the modelling frameworks presented
here allow for covariates and it would be beneficial to adjust for them to decrease the total
residual variation and potentially increase power to detect small genetic effects, it is not
without its challenges. There are two types of covariates that can be incorporated into the
models presented in this thesis; time-invariant and time varying covariates. Ideally, time-
invariant covariates would be those which occur before the start of the trajectory and remain
consistent throughout the time window being modelled. Examples of such covariates would
include sex, maternal smoking during pregnancy and birth weight. Covariates of this nature
could be incorporated into the SPLMM model and investigation into the changes to the genetic
effect could be conducted. Time-invariant covariates that occur within the time window being
modelled, such as the timing of puberty, are more difficult to incorporate. However, one
example of incorporating such covariates would be to include a knot point in the SPLMM at
time of the covariate to allow the curve to differ before and after the onset of the covariate.
Finally, time varying covariates can also be included in the modelling frameworks presented
here, although some thought regarding how they are incorporated is often required. Often in
cohort studies, such as the Raine Study, ALSPAC and NFBC66, the collection of variables such
as nutrition and exercise changes over the follow-up years. This is often necessary as the
235 Chapter 7: Conclusions
nutrition and exercise patterns of a two year-old are quite different to those of a 13 year-old.
For example, questions regarding breast feeding and baby food are asked in the two year
follow-up of the Raine Study, whereas a food frequency questionnaire is asked at the 13 year
follow-up. Therefore, to incorporate a time varying covariate for ‘nutritional status’ over the
full time window, these different questions would need to be classified into an interpretable
variable. This classification was beyond the scope of this thesis, however the modelling
frameworks presented will allow for such covariates.
Although adjustment for covariates is an important part of genetic and epidemiological
research they have not been incorporated in this thesis for two main reasons; firstly, the
covariates that need to be included when modelling BMI trajectories are complex to
incorporate as outlined in the previous paragraph and secondly, the three cohorts used in the
analysis in this thesis collected different covariates at different follow-up times. To include the
required covariates in the BMI models further research into the best way to incorporate them
and their generalizability to all three cohorts would be required; this was beyond the scope of
this thesis.
7.3.4 Gene-Environment and Gene-Gene Interactions
The environment plays an important role in BMI and the development of obesity, particularly
diet and exercise. As discussed in Chapter 2, it is difficult to investigate the genetics of BMI
without investigating the environmental determinants simultaneously. In addition, it is unlikely
that one gene is regulating growth but rather multiple genes acting collectively. An important
future research direction would be to investigate both gene-gene and gene-environment
interactions in association with childhood BMI trajectories. The research in this thesis has
provided a framework that will allow extensions to studies of these interactions; however,
given the statistical power and computational issues, a candidate gene study may initially be
more appropriate than a genome-wide study.
Researchers are beginning to investigate the interactions between known BMI genes and
environmental factors in cross-sectional study designs. For example, previous studies have
shown that the FTO effect was attenuated by approximately 30% in individuals who were more
physically active [323,324,325,326,327]. In addition, studies that examined the interaction of
FTO with dietary intake showed that the effect of FTO on BMI was less pronounced in
236 Chapter 7: Conclusions
individuals with a low calorie intake [328], with healthier diets [329] or were breast fed for
more than two months [330]. These indicate that FTO is sensitive to a healthy lifestyle in
general, but it does not remove the genetic effect all together. Another study investigated the
association between BMI and 12 BMI/obesity associated variants as a genetic risk score; this
study showed that the effect of the score was 40% less pronounced in individuals who were
physically active [331]. Therefore, it is thought that these lifestyle modifications may go
beyond just the FTO genotype. An extension to these studies in a longitudinal setting could be
to study how the timing of onset of a healthy diet or increased physical activity interacts with
specific genes to influence BMI trajectory.
Diet and exercise, as environmental predictors of BMI throughout childhood, were not
investigated in this thesis, as it is difficult to get a consistent measure across the age range of
interest. Although at most of the follow-ups in the Raine Study, diet and physical activity were
measured in some way, it is difficult to combine the measures into one (or a small number) of
time-varying covariates that could be used in a mixed model framework. In addition, the other
cohorts that were used as replication utilized different measures of diet and exercise than the
Raine Study. As gene-environment interactions were not the primary interest of this thesis,
these covariates were not included; however, the framework described in this thesis allows
the inclusion of both time-varying and time independent covariates in future studies.
7.3.5 Fine Mapping
For the majority of loci identified to date, including the 32 adult BMI associated loci discussed
in Chapter 6, there is no clear variant or gene that is on the causal pathway to obesity. Some
loci are in regions with multiple genes, whereas others are intergenic. Therefore, work remains
to be done to fine-map these regions and identify the causal genes and loci, such that they can
be followed up in experimental research [282]. Longitudinal methods may be beneficial when
fine mapping a region as they can provide additional information regarding the timing of
onset, which in turn can be used in developing individualized interventions based on an
individual’s genetic profile. In addition, if real, the KCNJ15 region would need to be sequenced
to determine the causal locus affecting childhood growth, in addition to investigating
functional data. Although the rs2008580 variant was the most significant SNP in the Raine
Study, this was not replicated in the ALSPAC or NFBC66 cohorts and therefore is unlikely to be
the causal locus in the region; however, several SNPs in the region spanning from the DSCR4 to
237 Chapter 7: Conclusions
KCNJ15 genes were significantly associated with BMI trajectory in the meta-analysis of the
three cohorts, indicating there is potentially a genetic variant in the region that influences
childhood growth.
7.3.6 Rare Variants
Imputing against 1,000 genomes to investigate gene-based tests, rather than individual
common variants, is an inexpensive method that may highlight additional genomic regions of
interest. Although the SPLMM method may not be suitable for analysis of all 28 million
variants that are imputed using 1,000 genomes due to the computational burden, it may be
useful when looking at the gene-based tests used for rare variant analysis
[237,238,239,240,241,242,243]. Using these methods, either the number of variants in the
gene are counted and used as the explanatory variable in the association analysis (known as
burden tests) or mixed effects models are used to test the variance of the random effects. In
addition, the SPLMM could be used for the investigation of focused regions of the genome
from next generation sequence data. Given the SPLMM method is flexible and allows for the
inclusion of covariates, this method would allow the investigation of which rare variants play
an important role in an individual’s pattern of growth.
7.4 Conclusion The aim of this thesis was to develop a modelling framework that allowed the detection of
small genetic effects in analysis of complex longitudinal phenotypes, utilizing BMI trajectories
across childhood to develop the framework. Although the methods discussed were applied to
BMI trajectories throughout childhood and adolescence, the SPLMM framework and the use of
robust standard errors are flexible so that they can be translated to the genetic analysis of any
longitudinal phenotype. In addition, the research in this thesis describes a GWAS of childhood
BMI trajectories. One region on chromosome 21, in the KCNJ15 gene, was associated with
higher BMI and faster rate of growth throughout childhood in males and females. This gene
has not previously been shown to be associated with BMI or risk of obesity; however, has been
shown to be associated with type 2 diabetes, increased levels of insulin and insulin resistance
and is therefore a biologically plausible gene for BMI growth.
Obesity cost the Australian society $8.3 billion in 2008, including direct costs such as loss of
productivity, health system costs and carer costs, and indirect cost such as absenteeism and
238 Chapter 7: Conclusions
taxation revenue forgone [332,333]. These costs include those for subsequent diseases caused
by being obese, such as type 2 diabetes or coronary heart disease. The discovery of genetic
variants associated with growth patterns throughout childhood has the potential of identifying
individuals very early in life who are at risk of becoming obese. Although interventions for
preventing obesity in children were initially shown to be ineffective [334], the most recent
Cochrane review showed that some interventions may have an impact on reducing BMI [335].
It has been shown that the most successful interventions start before age three, where a
successful intervention can have personal benefits, social benefits and government savings
[336]. The interventions that have been successful have impacted many aspects of the
children’s lives, including both the environment at home and school [335]. Although each of
the SNPs found to-date will not have clinical utility on their own, they have pointed the way
towards new pathways of obesity development that could be used for pharmacological
manipulation and ultimately therapeutic benefit. Therefore, by incorporating genetic
information when developing an intervention programme, we may be able to offer more
targeted interventions for those high-risk individuals while they are still learning about the
importance of healthy lifestyle factors such as a good diet and physical activity [337]. By
reducing the incidence of obesity through interventions targeting those genetically at risk, the
impact of this disease and other related diseases would also be reduced in the community.
239 Chapter 7: Conclusions
References 1. Barker DJ (1990) The fetal and infant origins of adult disease. BMJ 301: 1111.2. Gluckman PD, Hanson MA (2006) Developmental Origins of Health and Disease: Springer US.3. Ben-Shlomo Y, Kuh D (2002) A life course approach to chronic disease epidemiology:
conceptual models, empirical challenges and interdisciplinary perspectives. Int J Epidemiol 31: 285-293.
4. Kuh D, Ben-Shlomo Y (2004) A life course approach to chronic disease epidemiology; tracingthe origins of ill-health from early to adult life: Oxford: Oxford University Press.
5. Kuh D, Ben-Shlomo Y, Lynch J, Hallqvist J, Power C (2003) Life course epidemiology. JEpidemiol Community Health 57: 778-783.
6. Burton PR, Tobin MD, Hopper JL (2005) Key concepts in genetic epidemiology. Lancet 366:941-951.
7. Kruglyak L, Nickerson DA (2001) Variation is the spice of life. Nat Genet 27: 234-236.8. Palmer L, Smith GD, Burton PR (2011) An Introduction to Genetic Epidemiology: Policy Press.9. Hardy GH (1908) Mendelian Proportions in a Mixed Population. Science 28: 49-50.10. Weinberg W (1908) Über den Nachweis der Vererbung beim Menschen. Jahresh. Ver.
Vaterl. Naturkd. Württemb 64: 369-382. 11. Mitchell AA, Cutler DJ, Chakravarti A (2003) Undetected genotyping errors cause apparent
overtransmission of common alleles in the transmission/disequilibrium test. Am J Hum Genet 72: 598-610.
12. Lewontin RC (1964) The Interaction of Selection and Linkage. I. General Considerations;Heterotic Models. Genetics 49: 49-67.
13. Hill WG, Weir BS (1994) Maximum-likelihood estimation of gene location by linkagedisequilibrium. Am J Hum Genet 54: 705-714.
14. Excoffier L, Slatkin M (1995) Maximum-likelihood estimation of molecular haplotypefrequencies in a diploid population. Mol Biol Evol 12: 921-927.
15. Clark AG (1990) Inference of haplotypes from PCR-amplified samples of diploidpopulations. Mol Biol Evol 7: 111-122.
16. Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, et al. (2001) Haplotype variationand linkage disequilibrium in 313 human genes. Science 293: 489-493.
17. Dawn Teare M, Barrett JH (2005) Genetic linkage studies. Lancet 366: 1036-1044.18. Cordell HJ, Clayton DG (2005) Genetic association studies. Lancet 366: 1121-1131.19. Metzker ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11: 31-
46. 20. Palmer LJ, Cardon LR (2005) Shaking the tree: mapping complex disease genes with linkage
disequilibrium. Lancet 366: 1223-1234. 21. Morton NE (1955) Sequential tests for the detection of linkage. Am J Hum Genet 7: 277-
318. 22. Chotai J (1984) On the lod score method in linkage analysis. Ann Hum Genet 48: 359-378.23. Lander E, Kruglyak L (1995) Genetic dissection of complex traits: guidelines for interpreting
and reporting linkage results. Nat Genet 11: 241-247. 24. Blackwelder WC, Elston RC (1985) A comparison of sib-pair linkage tests for disease
susceptibility loci. Genet Epidemiol 2: 85-97. 25. Kong A, Cox NJ (1997) Allele-sharing models: LOD scores and accurate linkage tests. Am J
Hum Genet 61: 1179-1188. 26. Whittemore AS, Halpern J (1994) A class of tests for linkage using affected pedigree
members. Biometrics 50: 118-127.
240 References
27. Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait anda marker locus. Behav Genet 2: 3-19.
28. Elston RC, Buxbaum S, Jacobs KB, Olson JM (2000) Haseman and Elston revisited. GenetEpidemiol 19: 1-17.
29. Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in generalpedigrees. Am J Hum Genet 62: 1198-1211.
30. George RA, Smith TD, Callaghan S, Hardman L, Pierides C, et al. (2008) General mutationdatabases: analysis and review. J Med Genet 45: 65-70.
31. Altmuller J, Palmer LJ, Fischer G, Scherb H, Wjst M (2001) Genomewide scans of complexhuman diseases: true linkage is hard to find. Am J Hum Genet 69: 936-950.
32. Hirschhorn JN, Daly MJ (2005) Genome-wide association studies for common diseases andcomplex traits. Nat Rev Genet 6: 95-108.
33. Mann CJ (2003) Observational research methods. Research design II: cohort, crosssectional, and case-control studies. Emerg Med J 20: 54-60.
34. Bochud M (2012) Genetics for clinicians: from candidate genes to whole genome scans(technological advances). Best Pract Res Clin Endocrinol Metab 26: 119-132.
35. Hattersley AT, McCarthy MI (2005) What makes a good genetic association study? Lancet366: 1315-1323.
36. Silverman EK, Palmer LJ (2000) Case-control association studies for the genetics of complexrespiratory diseases. Am J Respir Cell Mol Biol 22: 645-648.
37. Price AL, Zaitlen NA, Reich D, Patterson N (2010) New approaches to populationstratification in genome-wide association studies. Nat Rev Genet 11: 459-463.
38. Abecasis GR, Cardon LR, Cookson WO (2000) A general test of association for quantitativetraits in nuclear families. Am J Hum Genet 66: 279-292.
39. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principalcomponents analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904-909.
40. Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55: 997-1004.
41. Barrett JC, Cardon LR (2006) Evaluating coverage of genome-wide association studies. NatGenet 38: 659-662.
42. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, et al. (2008) Genome-wideassociation studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9: 356-369.
43. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. (2009) Finding the missingheritability of complex diseases. Nature 461: 747-753.
44. Bush WS, Moore JH (2012) Chapter 11: Genome-wide association studies. PLoS ComputBiol 8: e1002822.
45. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, et al. (2005) Complement factor Hpolymorphism in age-related macular degeneration. Science 308: 385-389.
46. Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000cases of seven common diseases and 3,000 shared controls. Nature 447: 661-678.
47. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009) Potentialetiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362-9367.
48. Hindorff LA, MacArthur J, Wise A, Junkins HA, Hall PN, et al. (2010) A Catalog of PublishedGenome-Wide Association Studies. National Human Genome Research Institute.
241 References
49. de Bakker PI, Ferreira MA, Jia X, Neale BM, Raychaudhuri S, et al. (2008) Practical aspectsof imputation-driven meta-analysis of genome-wide association studies. Hum Mol Genet 17: R122-128.
50. Maher B (2008) Personal genomes: The case of the missing heritability. Nature 456: 18-21.51. Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five years of GWAS discovery. Am J
Hum Genet 90: 7-24. 52. Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for
genome-wide association studies by imputation of genotypes. Nat Genet 39: 906-913. 53. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2010) MaCH: using sequence and genotype
data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34: 816-834. 54. Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population
genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78: 629-644.
55. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. (2007) PLINK: a tool set forwhole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559-575.
56. Browning SR, Browning BL (2007) Rapid and accurate haplotype phasing and missing-datainference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81: 1084-1097.
57. Biernacka JM, Tang R, Li J, McDonnell SK, Rabe KG, et al. (2009) Assessment of genotypeimputation methods. BMC Proc 3 Suppl 7: S5.
58. Pei YF, Li J, Zhang L, Papasian CJ, Deng HW (2008) Analyses and comparison of accuracy ofdifferent genotype imputation methods. PLoS One 3: e3551.
59. Li N, Stephens M (2003) Modeling linkage disequilibrium and identifying recombinationhotspots using single-nucleotide polymorphism data. Genetics 165: 2213-2233.
60. The International HapMap Consortium (2003) The International HapMap Project. Nature426: 789-796.
61. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. (2007) A second generationhuman haplotype map of over 3.1 million SNPs. Nature 449: 851-861.
62. Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, et al. (2010) Integratingcommon and rare genetic variation in diverse human populations. Nature 467: 52-58.
63. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. (2010) A map of humangenome variation from population-scale sequencing. Nature 467: 1061-1073.
64. Li Y, Willer C, Sanna S, Abecasis G (2009) Genotype imputation. Annu Rev Genomics HumGenet 10: 387-406.
65. Huang L, Li Y, Singleton AB, Hardy JA, Abecasis G, et al. (2009) Genotype-imputationaccuracy across worldwide human populations. Am J Hum Genet 84: 235-250.
66. Jostins L, Morley KI, Barrett JC (2011) Imputation of low-frequency variants using theHapMap3 benefits from large, diverse reference sets. Eur J Hum Genet 19: 662-666.
67. Howie B, Marchini J, Stephens M (2011) Genotype imputation with thousands of genomes.G3 (Bethesda) 1: 457-470.
68. Guan Y, Stephens M (2008) Practical issues in imputation-based association mapping. PLoSGenet 4: e1000279.
69. Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidateregions and quantitative traits. PLoS Genet 3: e114.
70. Zeggini E, Ioannidis JP (2009) Meta-analysis in genome-wide association studies.Pharmacogenomics 10: 191-201.
71. Marchini J, Howie B (2010) Genotype imputation for genome-wide association studies. NatRev Genet 11: 499-511.
242 References
72. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, et al. (2010) Associationanalyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42: 937-948.
73. Freathy RM, Mook-Kanamori DO, Sovio U, Prokopenko I, Timpson NJ, et al. (2010) Variantsin ADCY5 and near CCNL1 are associated with fetal growth and birth weight. Nat Genet 42: 430-435.
74. Barrett JC (2010) Genotype Imputation Enables Powerful Combined Analyses of Genome-Wide Association Studies. Illumina (http://www.illumina.com/Documents/products/appnotes/appnote_imputation.pdf).
75. Igl BW, Konig IR, Ziegler A (2009) What do we mean by 'replication' and 'validation' ingenome-wide association studies? Hum Hered 67: 66-68.
76. Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, et al. (2007) Replicatinggenotype-phenotype associations. Nature 447: 655-660.
77. Kraft P (2008) Curses--winner's and otherwise--in genetic epidemiology. Epidemiology 19:649-651; discussion 657-648.
78. Almasy L, Amos CI, Bailey-Wilson JE, Cantor RM, Jaquish CE, et al. (2003) Proceedings of theGenetic Analysis Workshop 13: analysis of longitudinal family data for complex diseases and related risk factors. November 11-14, 2002. New Orleans, Louisiana, USA. BMC Genet 4 Suppl 1: S1-106.
79. Almasy L, Cupples LA, Daw EW, Levy D, Thomas D, et al. (2003) Proceedings of the GeneticAnalysis Workshop 13: Summarizing analysis of longitudinal family data for complex diseases and related risk factors. New Orleans, Louisiana, USA. November 11-14, 2002. Genet Epidemiol 25 Suppl 1: S1-97.
80. MacCluer JW, Amos CI, Gregersen PK, Heard-Costa N, Lee M, et al. (2009) Genetic AnalysisWorkshop 16: introduction to workshop summaries. Genet Epidemiol 33 Suppl 1: S1-7.
81. Gauderman WJ, Macgregor S, Briollais L, Scurrah K, Tobin M, et al. (2003) Longitudinal dataanalysis in pedigree studies. Genetic Epidemiology 25: S18-S28.
82. Briollais L, Tzontcheva A, Bull S (2003) Multilevel modeling for the analysis of longitudinalblood pressure data in the Framingham Heart Study pedigrees. BMC Genet 4 Suppl 1: S19.
83. Gee C, Morrison JL, Thomas DC, Gauderman WJ (2003) Segregation and linkage analysis forlongitudinal measurements of a quantitative trait. BMC Genet 4 Suppl 1: S21.
84. Palmer LJ, Scurrah KJ, Tobin M, Patel SR, Celedon JC, et al. (2003) Genome-wide linkageanalysis of longitudinal phenotypes using sigma2A random effects (SSARs) fitted by Gibbs sampling. BMC Genet 4 Suppl 1: S12.
85. Kerner B, North KE, Fallin MD (2009) Use of longitudinal data in genetic studies in thegenome-wide association studies era: summary of Group 14. Genet Epidemiol 33 Suppl 1: S93-98.
86. Chang SW, Choi SH, Li K, Fleur RS, Huang C, et al. (2009) Growth mixture modeling as anexploratory analysis tool in longitudinal quantitative trait loci analysis. BMC Proc 3 Suppl 7: S112.
87. Kerner B, Muthen BO (2009) Growth mixture modelling in families of the FraminghamHeart Study. BMC Proc 3 Suppl 7: S114.
88. Luan J, Kerner B, Zhao JH, Loos RJ, Sharp SJ, et al. (2009) A multilevel linear mixed model ofthe association between candidate genes and weight and body mass index using the Framingham longitudinal family data. BMC Proc 3 Suppl 7: S115.
89. Roslin NM, Hamid JS, Paterson AD, Beyene J (2009) Genome-wide association analysis ofcardiovascular-related quantitative traits in the Framingham Heart Study. BMC Proc 3 Suppl 7: S117.
243 References
90. Zhu W, Cho K, Chen X, Zhang M, Wang M, et al. (2009) A genome-wide association analysisof Framingham Heart Study longitudinal data using multivariate adaptive splines. BMC Proc 3 Suppl 7: S119.
91. Furlotte NA, Eskin E, Eyheramendy S (2012) Genome-wide association mapping withlongitudinal data. Genet Epidemiol 36: 463-471.
92. Fan R, Zhang Y, Albert PS, Liu A, Wang Y, et al. (2012) Longitudinal Association Analysis ofQuantitative Traits. Genet Epidemiol.
93. Haslam DW, James WP (2005) Obesity. Lancet 366: 1197-1209.94. World Health Organization (2006) Obesity and Overweight Fact Sheet.95. Australian Bureau of Statistics (2007-2008) National Health Survey: Summary of Results.96. Griffiths LJ, Parsons TJ, Hill AJ (2010) Self-esteem and quality of life in obese children and
adolescents: a systematic review. Int J Pediatr Obes 5: 282-304. 97. Tsiros MD, Olds T, Buckley JD, Grimshaw P, Brennan L, et al. (2009) Health-related quality
of life in obese children and adolescents. Int J Obes (Lond) 33: 387-400. 98. Lawlor DA, Mamun AA, O'Callaghan MJ, Bor W, Williams GM, et al. (2005) Is being
overweight associated with behavioural problems in childhood and adolescence? Findings from the Mater-University study of pregnancy and its outcomes. Arch Dis Child 90: 692-697.
99. Sawyer MG, Miller-Lewis L, Guy S, Wake M, Canterford L, et al. (2006) Is there arelationship between overweight and obesity and mental health problems in 4- to 5-year-old Australian children? Ambul Pediatr 6: 306-311.
100. Srinivasan SR, Myers L, Berenson GS (2006) Changes in metabolic syndrome variables since childhood in prehypertensive and hypertensive subjects: the Bogalusa Heart Study. Hypertension 48: 33-39.
101. Bradford NF (2009) Overweight and obesity in children and adolescents. Prim Care 36: 319-339.
102. Kindblom JM, Lorentzon M, Hellqvist A, Lonn L, Brandberg J, et al. (2009) BMI changes during childhood and adolescence as predictors of amount of adult subcutaneous and visceral adipose tissue in men: the GOOD Study. Diabetes 58: 867-874.
103. Serdula MK, Ivery D, Coates RJ, Freedman DS, Williamson DF, et al. (1993) Do obese children become obese adults? A review of the literature. Prev Med 22: 167-177.
104. Booth ML, Chey T, Wake M, Norton K, Hesketh K, et al. (2003) Change in the prevalence of overweight and obesity among young Australians, 1969-1997. Am J Clin Nutr 77: 29-36.
105. Olds TS, Tomkinson GR, Ferrar KE, Maher CA (2010) Trends in the prevalence of childhood overweight and obesity in Australia between 1985 and 2008. Int J Obes (Lond) 34: 57-66.
106. Gerritsen S, Stefanogiannis N, Galloway Y, Devlin M, Templeton R, et al. (2008) A Portrait of Health - Key Results of the 2006/07 New Zealand Health Survey. In: Ministry of Health: Wellington NZ, editor.
107. Ogden CL, Carroll MD, Flegal KM (2008) High body mass index for age among US children and adolescents, 2003-2006. JAMA 299: 2401-2405.
108. Sjoberg A, Lissner L, Albertsson-Wikland K, Marild S (2008) Recent anthropometric trends among Swedish school children: evidence for decreasing prevalence of overweight in girls. Acta Paediatr 97: 118-123.
109. Peneau S, Salanave B, Maillard-Teyssier L, Rolland-Cachera MF, Vergnaud AC, et al. (2009) Prevalence of overweight in 6- to 15-year-old children in central/western France from 1996 to 2006: trends toward stabilization. Int J Obes (Lond) 33: 401-407.
110. Farooqi IS, O'Rahilly S (2005) Monogenic obesity in humans. Annu Rev Med 56: 443-458.
244 References
111. Delrue MA, Michaud JL (2004) Fat chance: genetic syndromes with obesity. Clin Genet 66: 83-93.
112. Hall DM, Cole TJ (2006) What use is the BMI? Arch Dis Child 91: 283-286. 113. Cole TJ, Bellizzi MC, Flegal KM, Dietz WH (2000) Establishing a standard definition for child
overweight and obesity worldwide: international survey. BMJ 320: 1240-1243. 114. WHO (2000) Obesity: preventing and managing the golbal epidemic. Report of a WHO
Consultation. WHO Technical Report Series 894. Geneva: World Health Organization, 2000.
115. Borghi E, de Onis M, Garza C, Van den Broeck J, Frongillo EA, et al. (2006) Construction of the World Health Organization child growth standards: selection of methods for attained growth curves. Stat Med 25: 247-265.
116. Field CJ (2009) Early risk determinants and later health outcomes: implications for research prioritization and the food supply. Summary of the workshop. Am J Clin Nutr 89: 1533S-1539S.
117. Newnham JP, Pennell CE, Lye SJ, Rampono J, Challis JR (2009) Early life origins of obesity. Obstet Gynecol Clin North Am 36: 227-244, xii.
118. Dietz WH (1994) Critical periods in childhood for the development of obesity. Am J Clin Nutr 59: 955-959.
119. Adair LS (2008) Child and adolescent obesity: epidemiology and developmental perspectives. Physiol Behav 94: 8-16.
120. Monteiro PO, Victora CG (2005) Rapid growth in infancy and childhood and obesity in later life--a systematic review. Obes Rev 6: 143-154.
121. Baird J, Fisher D, Lucas P, Kleijnen J, Roberts H, et al. (2005) Being big or growing fast: systematic review of size and growth in infancy and later obesity. BMJ 331: 929.
122. Mook-Kanamori DO, Durmus B, Sovio U, Hofman A, Raat H, et al. (2011) Fetal and infant growth and the risk of obesity during early childhood: the Generation R Study. Eur J Endocrinol 165: 623-630.
123. Owen CG, Martin RM, Whincup PH, Davey-Smith G, Gillman MW, et al. (2005) The effect of breastfeeding on mean body mass index throughout life: a quantitative review of published and unpublished observational evidence. Am J Clin Nutr 82: 1298-1307.
124. Horta BL, Bahl R, Martines JC, Victora CG (2007) Evidence on the long-term effects of breastfeeding: systematic review and meta-analysis. Geneva, Switzerland: World Health Organization.
125. Briefel R, Ziegler P, Novak T, Ponza M (2006) Feeding Infants and Toddlers Study: characteristics and usual nutrient intake of Hispanic and non-Hispanic infants and toddlers. J Am Diet Assoc 106: S84-95.
126. Mihrshahi S, Battistutta D, Magarey A, Daniels LA (2011) Determinants of rapid weight gain during infancy: baseline results from the NOURISH randomised controlled trial. BMC Pediatr 11: 99.
127. Zimmerman FJ, Christakis DA, Meltzoff AN (2007) Television and DVD/video viewing in children younger than 2 years. Arch Pediatr Adolesc Med 161: 473-479.
128. Silverwood RJ, De Stavola BL, Cole TJ, Leon DA (2009) BMI peak in infancy as a predictor for later BMI in the Uppsala Family Study. Int J Obes (Lond) 33: 929-937.
129. Sovio U, Timpson NJ, Warrington NM, Briollais L, Mook-Kanamori D, et al. (2009) Association Between FTO Polymorphism, Adiposity Peak and Adiposity Rebound in The Northern Finland Birth Cohort 1966. Atherosclerosis 207: e4-e5.
130. He Q, Karlberg J (2002) Probability of adult overweight and risk change during the BMI rebound period. Obes Res 10: 135-140.
245 References
131. Rolland-Cachera MF, Deheeger M, Bellisle F, Sempe M, Guilloud-Bataille M, et al. (1984) Adiposity rebound in children: a simple indicator for predicting obesity. Am J Clin Nutr 39: 129-135.
132. Rolland-Cachera MF, Deheeger M, Maillot M, Bellisle F (2006) Early adiposity rebound: causes and consequences for obesity in children and adults. Int J Obes (Lond) 30 Suppl 4: S11-17.
133. Whitaker RC, Pepe MS, Wright JA, Seidel KD, Dietz WH (1998) Early adiposity rebound and the risk of adult obesity. Pediatrics 101: E5.
134. Bhargava SK, Sachdev HS, Fall CH, Osmond C, Lakshmy R, et al. (2004) Relation of serial changes in childhood body-mass index to impaired glucose tolerance in young adulthood. N Engl J Med 350: 865-875.
135. Eriksson JG, Forsen T, Tuomilehto J, Osmond C, Barker DJ (2003) Early adiposity rebound in childhood and risk of Type 2 diabetes in adult life. Diabetologia 46: 190-194.
136. Taylor RW, Grant AM, Goulding A, Williams SM (2005) Early adiposity rebound: review of papers linking this to subsequent obesity in children and adults. Curr Opin Clin Nutr Metab Care 8: 607-612.
137. Rolland-Cachera MF, Peneau S (2013) Growth trajectories associated with adult obesity. World Rev Nutr Diet 106: 127-134.
138. Freedman DS, Kettel Khan L, Serdula MK, Srinivasan SR, Berenson GS (2001) BMI rebound, childhood height and obesity among adults: the Bogalusa Heart Study. Int J Obes Relat Metab Disord 25: 543-549.
139. Taylor RW, Goulding A, Lewis-Barned NJ, Williams SM (2004) Rate of fat gain is faster in girls undergoing early adiposity rebound. Obes Res 12: 1228-1230.
140. Williams SM, Goulding A (2009) Patterns of growth associated with the timing of adiposity rebound. Obesity (Silver Spring) 17: 335-341.
141. Dorosty AR, Emmett PM, Cowin S, Reilly JJ (2000) Factors associated with early adiposity rebound. ALSPAC Study Team. Pediatrics 105: 1115-1118.
142. Janz KF, Levy SM, Burns TL, Torner JC, Willing MC, et al. (2002) Fatness, physical activity, and television viewing in children during the adiposity rebound period: the Iowa Bone Development Study. Prev Med 35: 563-571.
143. Deheeger M, Rolland-Cachera MF (2004) [Longitudinal study of anthropometric measurements in Parisian children aged ten months to 18 years]. Arch Pediatr 11: 1139-1144.
144. Cole TJ (2004) Children grow and horses race: is the adiposity rebound a critical period for later obesity? BMC Pediatr 4: 6.
145. Peto R (1981) The horse-racing effect. Lancet 2: 467-468. 146. Ong KK, Northstone K, Wells JC, Rubin C, Ness AR, et al. (2007) Earlier mother's age at
menarche predicts rapid infancy growth and childhood obesity. PLoS Med 4: e132. 147. van Lenthe FJ, Kemper CG, van Mechelen W (1996) Rapid maturation in adolescence
results in greater obesity in adulthood: the Amsterdam Growth and Health Study. Am J Clin Nutr 64: 18-24.
148. Garn SM, LaVelle M, Rosenberg KR, Hawthorne VM (1986) Maturational timing as a factor in female fatness and obesity. Am J Clin Nutr 43: 879-883.
149. Freedman DS, Khan LK, Serdula MK, Dietz WH, Srinivasan SR, et al. (2003) The relation of menarcheal age to obesity in childhood and adulthood: the Bogalusa heart study. BMC Pediatr 3: 3.
150. Must A, Naumova EN, Phillips SM, Blum M, Dawson-Hughes B, et al. (2005) Childhood overweight and maturational timing in the development of adult overweight and fatness: the Newton Girls Study and its follow-up. Pediatrics 116: 620-627.
246 References
151. dos Santos Silva I, De Stavola BL, Mann V, Kuh D, Hardy R, et al. (2002) Prenatal factors, childhood growth trajectories and age at menarche. Int J Epidemiol 31: 405-412.
152. Mumby HS, Elks CE, Li S, Sharp SJ, Khaw KT, et al. (2011) Mendelian Randomisation Study of Childhood BMI and Early Menarche. J Obes 2011: 180729.
153. Tanner JM, Whitehouse RH, Marubini E, Resele LF (1976) The adolescent growth spurt of boys and girls of the Harpenden growth study. Ann Hum Biol 3: 109-126.
154. Zacharias L, Rand WM (1983) Adolescent growth in height and its relation to menarche in contemporary American girls. Ann Hum Biol 10: 209-222.
155. Tanner JM, Davies PS (1985) Clinical longitudinal standards for height and height velocity for North American children. J Pediatr 107: 317-329.
156. Elks CE, Perry JR, Sulem P, Chasman DI, Franceschini N, et al. (2010) Thirty new loci for age at menarche identified by a meta-analysis of genome-wide association studies. Nat Genet 42: 1077-1085.
157. Cousminer DL, Berry DJ, Timpson NJ, Ang W, Thiering E, et al. (2013) Genome-wide association and longitudinal analyses reveal genetic loci linking pubertal height growth, pubertal timing and childhood adiposity. Hum Mol Genet.
158. Jasik CB, Lustig RH (2008) Adolescent obesity and puberty: the "perfect storm". Ann N Y Acad Sci 1135: 265-279.
159. Crespo CJ, Smit E, Troiano RP, Bartlett SJ, Macera CA, et al. (2001) Television watching, energy intake, and obesity in US children: results from the third National Health and Nutrition Examination Survey, 1988-1994. Arch Pediatr Adolesc Med 155: 360-365.
160. Berkey CS, Rockett HR, Field AE, Gillman MW, Frazier AL, et al. (2000) Activity, dietary intake, and weight changes in a longitudinal study of preadolescent and adolescent boys and girls. Pediatrics 105: E56.
161. Berkey CS, Rockett HR, Gillman MW, Field AE, Colditz GA (2003) Longitudinal study of skipping breakfast and weight change in adolescents. Int J Obes Relat Metab Disord 27: 1258-1266.
162. Eaton DK, Kann L, Kinchen S, Ross J, Hawkins J, et al. (2006) Youth risk behavior surveillance--United States, 2005. MMWR Surveill Summ 55: 1-108.
163. Richardson LP, Garrison MM, Drangsholt M, Mancl L, LeResche L (2006) Associations between depressive symptoms and obesity during puberty. Gen Hosp Psychiatry 28: 313-320.
164. Goodman E, Whitaker RC (2002) A prospective study of the role of depression in the development and persistence of adolescent obesity. Pediatrics 110: 497-504.
165. Maes HH, Neale MC, Eaves LJ (1997) Genetic and environmental factors in relative body weight and human adiposity. Behav Genet 27: 325-351.
166. Haworth CM, Carnell S, Meaburn EL, Davis OS, Plomin R, et al. (2008) Increasing heritability of BMI and stronger associations with the FTO gene over childhood. Obesity (Silver Spring) 16: 2663-2668.
167. Wardle J, Carnell S, Haworth CM, Plomin R (2008) Evidence for a strong genetic influence on childhood adiposity despite the force of the obesogenic environment. Am J Clin Nutr 87: 398-404.
168. Parsons TJ, Power C, Logan S, Summerbell CD (1999) Childhood predictors of adult obesity: a systematic review. Int J Obes Relat Metab Disord 23 Suppl 8: S1-107.
169. Elks CE, den Hoed M, Zhao JH, Sharp SJ, Wareham NJ, et al. (2012) Variability in the heritability of body mass index: a systematic review and meta-regression. Front Endocrinol (Lausanne) 3: 29.
170. Farooqi IS (2005) Genetic and hereditary aspects of childhood obesity. Best Pract Res Clin Endocrinol Metab 19: 359-374.
247 References
171. Rankinen T, Zuberi A, Chagnon YC, Weisnagel SJ, Argyropoulos G, et al. (2006) The human obesity gene map: the 2005 update. Obesity (Silver Spring) 14: 529-644.
172. Ichihara S, Yamada Y (2008) Genetic factors for human obesity. Cell Mol Life Sci 65: 1086-1098.
173. Dina C, Meyre D, Gallina S, Durand E, Korner A, et al. (2007) Variation in FTO contributes to childhood obesity and severe adult obesity. Nat Genet 39: 724-726.
174. Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, et al. (2007) A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316: 889-894.
175. Loos RJ, Lindgren CM, Li S, Wheeler E, Zhao JH, et al. (2008) Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nat Genet 40: 768-775.
176. Scuteri A, Sanna S, Chen WM, Uda M, Albai G, et al. (2007) Genome-wide association scan shows genetic variants in the FTO gene are associated with obesity-related traits. PLoS Genet 3: e115.
177. Chambers JC, Elliott P, Zabaneh D, Zhang W, Li Y, et al. (2008) Common genetic variation near MC4R is associated with waist circumference and insulin resistance. Nat Genet 40: 716-718.
178. Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, et al. (2009) Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet 41: 25-34.
179. Thorleifsson G, Walters GB, Gudbjartsson DF, Steinthorsdottir V, Sulem P, et al. (2009) Genome-wide association yields new sequence variants at seven loci that associate with measures of obesity. Nat Genet 41: 18-24.
180. Liu JZ, Medland SE, Wright MJ, Henders AK, Heath AC, et al. (2010) Genome-wide association study of height and body mass index in Australian twin families. Twin Res Hum Genet 13: 179-193.
181. Wang KS, Liu X, Zheng S, Zeng M, Pan Y, et al. (2012) A novel locus for body mass index on 5p15.2: a meta-analysis of two genome-wide association studies. Gene 500: 80-84.
182. Fox CS, Heard-Costa N, Cupples LA, Dupuis J, Vasan RS, et al. (2007) Genome-wide association to body mass index and waist circumference: the Framingham Heart Study 100K project. BMC Med Genet 8 Suppl 1: S18.
183. Bradfield JP, Taal HR, Timpson NJ, Scherag A, Lecoeur C, et al. (2012) A genome-wide association meta-analysis identifies new childhood obesity loci. Nat Genet 44: 526-531.
184. Newnham JP, Evans SF, Michael CA, Stanley FJ, Landau LI (1993) Effects of frequent ultrasound during pregnancy: a randomised controlled trial. Lancet 342: 887-891.
185. Williams LA, Evans SF, Newnham JP (1997) Prospective cohort study of factors influencing the relative weights of the placenta and the newborn infant. BMJ 314: 1864-1868.
186. Evans S, Newnham J, MacDonald W, Hall C (1996) Characterisation of the possible effect on birthweight following frequent prenatal ultrasound examinations. Early Hum Dev 45: 203-214.
187. Huang RC, Burke V, Newnham JP, Stanley FJ, Kendall GE, et al. (2007) Perinatal and childhood origins of cardiovascular disease. Int J Obes (Lond) 31: 236-244.
188. Boyd A, Golding J, Macleod J, Lawlor DA, Fraser A, et al. (2013) Cohort Profile: the 'children of the 90s'--the index offspring of the Avon Longitudinal Study of Parents and Children. Int J Epidemiol 42: 111-127.
189. Howe LD, Tilling K, Benfield L, Logue J, Sattar N, et al. (2010) Changes in ponderal index and body mass index across childhood and their associations with fat mass and cardiovascular risk factors at age 15. PLoS One 5: e15186.
248 References
190. Howe LD, Tilling K, Lawlor DA (2009) Accuracy of height and weight data from child health records. Arch Dis Child 94: 950-954.
191. Dubois L, Girad M (2007) Accuracy of maternal reports of pre-schoolers' weights and heights as estimates of BMI values. Int J Epidemiol 36: 132-138.
192. Paternoster L, Zhurov AI, Toma AM, Kemp JP, St Pourcain B, et al. (2012) Genome-wide association study of three-dimensional facial morphology identifies a variant in PAX3 associated with nasion position. Am J Hum Genet 90: 478-485.
193. Taal HR, St Pourcain B, Thiering E, Das S, Mook-Kanamori DO, et al. (2012) Common variants at 12q15 and 12q24 are associated with infant head circumference. Nat Genet 44: 532-538.
194. Rantakallio P (1988) The longitudinal study of the northern Finland birth cohort of 1966. Paediatr Perinat Epidemiol 2: 59-88.
195. Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529.
196. Sabatti C, Service SK, Hartikainen AL, Pouta A, Ripatti S, et al. (2009) Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet 41: 35-46.
197. Warrington NM, Wu YY, Pennell CE, Marsh JA, Beilin LJ, et al. (2013) Modelling BMI Trajectories in Children for Genetic Association Studies. PLoS One 8: e53897.
198. Jiao H, Arner P, Hoffstedt J, Brodin D, Dubern B, et al. (2011) Genome wide association study identifies KCNMA1 contributing to human obesity. BMC Med Genomics 4: 51.
199. Wang K, Li WD, Zhang CK, Wang Z, Glessner JT, et al. (2011) A genome-wide association study on obesity and obesity-related traits. PLoS One 6: e18939.
200. Meyre D, Delplanque J, Chevre JC, Lecoeur C, Lobbens S, et al. (2009) Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations. Nat Genet 41: 157-159.
201. Paternoster L, Evans DM, Aagaard Nohr E, Holst C, Gaborieau V, et al. (2011) Genome-Wide Population-Based Association Study of Extremely Overweight Young Adults - The GOYA Study. PLoS One 6: e24303.
202. Cotsapas C, Speliotes EK, Hatoum IJ, Greenawalt DM, Dobrin R, et al. (2009) Common body mass index-associated variants confer risk of extreme obesity. Hum Mol Genet 18: 3502-3507.
203. Zhao J, Bradfield JP, Li M, Wang K, Zhang H, et al. (2009) The role of obesity-associated loci identified in genome-wide association studies in the determination of pediatric BMI. Obesity (Silver Spring) 17: 2254-2257.
204. den Hoed M, Ekelund U, Brage S, Grontved A, Zhao JH, et al. (2010) Genetic susceptibility to obesity and related traits in childhood and adolescence: influence of loci identified by genome-wide association studies. Diabetes 59: 2980-2988.
205. Hardy R, Wills AK, Wong A, Elks CE, Wareham NJ, et al. (2010) Life course variations in the associations between FTO and MC4R gene variants and body size. Hum Mol Genet 19: 545-552.
206. Elks CE, Loos RJ, Sharp SJ, Langenberg C, Ring SM, et al. (2010) Genetic markers of adult obesity risk are associated with greater early infancy weight gain and growth. PLoS Med 7: e1000284.
207. Heard-Costa NL, Zillikens MC, Monda KL, Johansson A, Harris TB, et al. (2009) NRXN3 is a novel locus for waist circumference: a genome-wide association study from the CHARGE Consortium. PLoS Genet 5: e1000539.
249 References
208. Lindgren CM, Heid IM, Randall JC, Lamina C, Steinthorsdottir V, et al. (2009) Genome-wide association scan meta-analysis identifies three Loci influencing adiposity and fat distribution. PLoS Genet 5: e1000508.
209. Belsky DW, Moffitt TE, Houts R, Bennett GG, Biddle AK, et al. (2012) Polygenic risk, rapid childhood growth, and the development of obesity: evidence from a 4-decade longitudinal study. Arch Pediatr Adolesc Med 166: 515-521.
210. Preece MA, Baines MJ (1978) A new family of mathematical models describing the human growth curve. Ann Hum Biol 5: 1-24.
211. Gasser T, Kohler W, Muller HG, Kneip A, Largo R, et al. (1984) Velocity and acceleration of height growth using kernel estimation. Ann Hum Biol 11: 397-411.
212. Stutzle W, Gasser T, Molinari L, Largo RH, Prader A, et al. (1980) Shape-invariant modelling of human growth. Ann Hum Biol 7: 507-528.
213. Largo RH, Gasser T, Prader A, Stuetzle W, Huber PJ (1978) Analysis of the adolescent growth spurt using smoothing spline functions. Ann Hum Biol 5: 421-434.
214. Berkey CS, Reed RB, Valadian I (1983) Midgrowth spurt in height of Boston children. Ann Hum Biol 10: 25-30.
215. Laird NM, Ware JH (1982) Random-effects models for longitudinal data. Biometrics 38: 963-974.
216. Milani S, Bossi A, Marubini E (1989) Individual growth curves and longitudinal growth charts between 0 and 3 years. Acta Paediatr Scand Suppl 350: 95-104.
217. Lachos VH, Ghosh P, Arellano-Valle RB (2010) Likelihood based inference for skew-normal independent linear mixed model. Statistica Sinica 20: 303-322.
218. Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61: 579-602.
219. Song PXK, Zhang PQA (2007) Maximum likelihood inference in robust linear mixed-effect models using multivariate t distributions. Statistica Sinica 17: 929-943.
220. Cole TJ, Donaldson MD, Ben-Shlomo Y (2010) SITAR--a useful instrument for growth curve analysis. Int J Epidemiol 39: 1558-1566.
221. Efron B, Tibshirani RJ (1994) An Introduction to the Bootstrap: Taylor & Francis. 222. Ihaka R, Gentleman R (1996) R: A Language for Data Analysis and Graphics. Journal of
Computational and Graphical Statistics 5: 299-314. 223. Littell RC, Milliken GA, Stroup WW, Wolfinger RD, Schabenberger O (2006) SAS for Mixed
Models: SAS Institute. 224. Henderson CR (1975) Best linear unbiased estimation and prediction under a selection
model. Biometrics 31: 423-447. 225. Cheng J, Edwards LJ, Maldonado-Molina MM, Komro KA, Muller KE (2010) Real
longitudinal data analysis for real people: building a good enough mixed model. Stat Med 29: 504-520.
226. Pinheiro JC, Liu C, Wu YN (2001) Efficient Algorithms for Robust Estimation in Linear Mixed-Effects Models Using the Multivariate t-Distribution. Journal of Computational and Graphical Statistics 10: 249-276.
227. Arellano-Valle RB, Bolfarine H, Lachos VH (2005) Skew-normal Linear Mixed Models. Journal of Data Science 3: 415-438.
228. Lin TI, Lee JC (2008) Estimation and prediction in linear mixed models with skew-normal random effects for longitudinal data. Statistics in Medicine 27: 1490-1507.
229. Lachos VH, Bolfarine H, Arellano-Valle RB, Montenegro LC (2007) Likelihood based Inference for Multivariate Skew-Normal Regression Models. Communications in Statistics - Theory and Methods 36: 1769-1786.
250 References
230. Azzalini A, Dalla-Valle A (1996) The multivariate skew-normal distribution. Biometrika 83: 715-726.
231. Janssens AC, Aulchenko YS, Elefante S, Borsboom GJ, Steyerberg EW, et al. (2006) Predictive testing for complex diseases using multiple genes: fact or fiction? Genet Med 8: 395-400.
232. Janssens AC, Moonesinghe R, Yang Q, Steyerberg EW, van Duijn CM, et al. (2007) The impact of genotype frequencies on the clinical validity of genomic profiling for predicting common chronic diseases. Genet Med 9: 528-535.
233. Xu R (2003) Measuring explained variation in linear mixed effects models. Stat Med 22: 3527-3541.
234. Brookfield JF (2013) Quantitative genetics: heritability is not always missing. Curr Biol 23: R276-278.
235. Llewellyn CH, Trzaskowski M, Plomin R, Wardle J (2013) Finding the missing heritability in pediatric obesity: the contribution of genome-wide complex trait analysis. Int J Obes (Lond).
236. Chaufan C, Joseph J (2013) The 'missing heritability' of common disorders: should health researchers care? Int J Health Serv 43: 281-303.
237. Morris AP, Zeggini E (2010) An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol 34: 188-193.
238. Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83: 311-321.
239. Madsen BE, Browning SR (2009) A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 5: e1000384.
240. Liu DJ, Leal SM (2010) A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 6: e1001156.
241. Bhatia G, Bansal V, Harismendy O, Schork NJ, Topol EJ, et al. (2010) A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Comput Biol 6: e1000954.
242. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, et al. (2011) Testing for an unusual distribution of rare variants. PLoS Genet 7: e1001322.
243. Wu MC, Lee S, Cai T, Li Y, Boehnke M, et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89: 82-93.
244. Lee PH, Shatkay H (2009) An integrative scoring system for ranking SNPs by their potential deleterious effects. Bioinformatics 25: 1048-1055.
245. Sikorska K, Rivadeneira F, Groenen PJ, Hofman A, Uitterlinden AG, et al. (2013) Fast linear mixed model computations for genome-wide association studies with longitudinal data. Stat Med 32: 165-180.
246. Smith EN, Chen W, Kahonen M, Kettunen J, Lehtimaki T, et al. (2010) Longitudinal genome-wide association of cardiovascular disease risk factors in the Bogalusa heart study. PLoS Genet 6.
247. Benke KS, Wu Y, Fallin DM, Maher B, Palmer LJ (2013) Strategy to control type I error increases power to identify genetic variation using the full biological trajectory. Genet Epidemiol 37: 419-430.
248. Park YM, Province MA, Gao X, Feitosa M, Wu J, et al. (2009) Longitudinal trends in the association of metabolic syndrome with 550 k single-nucleotide polymorphisms in the Framingham Heart Study. BMC Proc 3 Suppl 7: S116.
251 References
249. Fradin DD, Fallin MD (2009) Influence of control selection in genome-wide association studies: the example of diabetes in the Framingham Heart Study. BMC Proc 3 Suppl 7: S113.
250. Verbeke G, Spiessens B, Lesaffre E (2001) Conditional Linear Mixed Models. The American Statistician 55: 25-34.
251. Sovio U, Mook-Kanamori DO, Warrington NM, Lawrence R, Briollais L, et al. (2011) Association between common variation at the FTO locus and changes in body mass index from infancy to late childhood: the complex nature of genetic association through growth and development. PLoS Genet 7: e1001307.
252. Pinheiro J, Bates D (2000) Mixed Effects Models in S and S-Plus: Springer. 253. Fitzmaurice GM, Laird NM, Ware JH (2004) Applied Longitudinal Analysis: Wiley. 254. Zhang D, Davidian M (2001) Linear mixed models with flexible distributions of random
effects for longitudinal data. Biometrics 57: 795-802. 255. Verbeke G, Lesaffre E (1997) The effect of misspecifying the random-effects distribution in
linear mixed models for longitudinal data. Computational Statistics & Data Analysis 23: 541-556.
256. Jacqmin-Gadda H, Sibillot S, Proust C, Molina JM, Thiébaut R (2007) Robustness of the linear mixed model to misspecified error distribution. Computational Statistics & Data Analysis 51: 5142-5154.
257. Taylor JMG, Cumberland WG, Sy JP (1994) A Stochastic Model for Analysis of Longitudinal AIDS Data. Journal of the American Statistical Association 89: 727-736.
258. Taylor JM, Law N (1998) Does the covariance structure matter in longitudinal modelling for the prediction of future CD4 counts? Stat Med 17: 2381-2394.
259. Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73: 13-22.
260. Royall RM (1986) Model Robust Confidence Intervals Using Maximum Likelihood Estimators. International Statistical Review / Revue Internationale de Statistique 54: 221-226.
261. Golding J, Pembrey M, Jones R (2001) ALSPAC--the Avon Longitudinal Study of Parents and Children. I. Study methodology. Paediatr Perinat Epidemiol 15: 74-87.
262. Koehler E, Brown E, Haneuse SJ (2009) On the Assessment of Monte Carlo Error in Simulation-Based Statistical Analyses. Am Stat 63: 155-162.
263. White I (2010) simsum: Analysis of simulation studies including Monte Carlo error. The Stata Journal 10: 369-385.
264. McDonald L (1975) Tests for the General Linear Hypothesis Under the Multiple Design Multivariate Linear Model. The Annals of Statistics 3: 461-466.
265. Duggal P, Gillanders EM, Holmes TN, Bailey-Wilson JE (2008) Establishing an adjusted p-value threshold to control the family-wide type 1 error in genome wide association studies. BMC Genomics 9: 516.
266. Aulchenko YS, Ripke S, Isaacs A, van Duijn CM (2007) GenABEL: an R library for genome-wide association analysis. Bioinformatics 23: 1294-1296.
267. Zeger SL, Liang KY, Albert PS (1988) Models for longitudinal data: a generalized estimating equation approach. Biometrics 44: 1049-1060.
268. Zeger SL, Liang KY (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics 42: 121-130.
269. Gurka MJ, Edwards LJ, Muller KE (2011) Avoiding bias in mixed model inference for fixed effects. Stat Med 30: 2696-2707.
270. Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data: Springer Series in Statistics, Springer-Verlag, New York. 568 p.
252 References
271. Davidian M, Giltinan DM (1995) Nonlinear models for repeated measurement data. London: Chapman & Hall.
272. Rasbash J, Steele F, Browne WJ, Goldstein H (2012) A User’s Guide to MLwiN, v2.26. Centre for Multilevel Modelling, University of Bristol.
273. Dudbridge F, Gusnanto A (2008) Estimation of significance thresholds for genomewide association scans. Genet Epidemiol 32: 227-234.
274. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273: 1516-1517.
275. Dadd T, Weale ME, Lewis CM (2009) A critical evaluation of genomic control methods for genetic association studies. Genet Epidemiol 33: 290-298.
276. Eriksson JG, Forsen TJ, Osmond C, Barker DJ (2003) Pathways of infant and childhood growth that lead to type 2 diabetes. Diabetes Care 26: 3006-3010.
277. Yliharsila H, Eriksson JG, Forsen T, Laakso M, Uusitupa M, et al. (2004) Interactions between peroxisome proliferator-activated receptor-gamma 2 gene polymorphisms and size at birth on blood pressure and the use of antihypertensive medication. J Hypertens 22: 1283-1287.
278. Pihlajamaki J, Vanhala M, Vanhala P, Laakso M (2004) The Pro12Ala polymorphism of the PPAR gamma 2 gene regulates weight from birth to adulthood. Obes Res 12: 187-190.
279. Eriksson JG, Lindi V, Uusitupa M, Forsen TJ, Laakso M, et al. (2002) The effects of the Pro12Ala polymorphism of the peroxisome proliferator-activated receptor-gamma2 gene on insulin sensitivity and insulin metabolism interact with size at birth. Diabetes 51: 2321-2324.
280. Meigs JB, Shrader P, Sullivan LM, McAteer JB, Fox CS, et al. (2008) Genotype score in addition to common risk factors for prediction of type 2 diabetes. N Engl J Med 359: 2208-2219.
281. Oue N, Aung PP, Mitani Y, Kuniyasu H, Nakayama H, et al. (2005) Genes involved in invasion and metastasis of gastric cancer identified by array-based hybridization and serial analysis of gene expression. Oncology 69 Suppl 1: 17-22.
282. Day FR, Loos RJ (2011) Developments in obesity genetics in the era of genome-wide association studies. J Nutrigenet Nutrigenomics 4: 222-238.
283. Randall JC, Winkler TW, Kutalik Z, Berndt SI, Jackson AU, et al. (2013) Sex-stratified Genome-wide Association Studies Including 270,000 Individuals Show Sexual Dimorphism in Genetic Loci for Anthropometric Traits. PLoS Genet 9: e1003500.
284. Willer CJ, Li Y, Abecasis GR (2010) METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26: 2190-2191.
285. Stouffer S, DeVinney L, Suchmen E (1949) The American soldier: Adjusment during army life. Princeton University Press Princeton, US Volume 1.
286. Wang K, Li M, Hakonarson H (2010) Analysing biological pathways in genome-wide association studies. Nat Rev Genet 11: 843-854.
287. Segre AV, Groop L, Mootha VK, Daly MJ, Altshuler D (2010) Common inherited variation in mitochondrial genes is not enriched for associations with type 2 diabetes or related glycemic traits. PLoS Genet 6.
288. Zhang K, Cui S, Chang S, Zhang L, Wang J (2010) i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Nucleic Acids Res 38: W90-95.
289. Nam D, Kim J, Kim SY, Kim S (2010) GSA-SNP: a general approach for gene set analysis of polymorphisms. Nucleic Acids Res 38: W749-754.
253 References
290. Kwon JS, Kim J, Nam D, Kim S (2012) Performance Comparison of Two Gene Set Analysis Methods for Genome-wide Association Study Results: GSA-SNP vs i-GSEA4GWAS. Genomics Inform 10: 123-127.
291. Okamoto K, Iwasaki N, Nishimura C, Doi K, Noiri E, et al. (2010) Identification of KCNJ15 as a susceptibility gene in Asian patients with type 2 diabetes mellitus. Am J Hum Genet 86: 54-64.
292. Iwasaki N, Cox NJ, Wang YQ, Schwarz PE, Bell GI, et al. (2003) Mapping genes influencing type 2 diabetes risk and BMI in Japanese subjects. Diabetes 52: 209-213.
293. Okamoto K, Iwasaki N, Doi K, Noiri E, Iwamoto Y, et al. (2012) Inhibition of glucose-stimulated insulin secretion by KCNJ15, a newly identified susceptibility gene for type 2 diabetes. Diabetes 61: 1734-1741.
294. Horikoshi M, Yaghootkar H, Mook-Kanamori DO, Sovio U, Taal HR, et al. (2013) New loci associated with birth weight identify genetic links between intrauterine growth and adult height and metabolism. Nat Genet 45: 76-82.
295. Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, et al. (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467: 832-838.
296. Dupuis J, Langenberg C, Prokopenko I, Saxena R, Soranzo N, et al. (2010) New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat Genet 42: 105-116.
297. Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, et al. (2012) Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44: 981-990.
298. Xu Z, Taylor JA (2009) SNPinfo: integrating GWAS and candidate gene information into functional SNP selection for genetic association studies. Nucleic Acids Res 37: W600-605.
299. Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, et al. (2010) Variation in transcription factor binding among humans. Science 328: 232-235.
300. Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M (2012) Linking disease associations with regulatory information in the human genome. Genome Res 22: 1748-1759.
301. Andersson EA, Pilgaard K, Pisinger C, Harder MN, Grarup N, et al. (2010) Do gene variants influencing adult adiposity affect birth weight? A population-based study of 24 loci in 4,744 Danish individuals. PLoS One 5: e14190.
302. Kilpelainen TO, den Hoed M, Ong KK, Grontved A, Brage S, et al. (2011) Obesity-susceptibility loci have a limited influence on birth weight: a meta-analysis of up to 28,219 individuals. Am J Clin Nutr 93: 851-860.
303. Barker DJ, Osmond C, Forsen TJ, Kajantie E, Eriksson JG (2005) Trajectories of growth among children who have coronary events as adults. N Engl J Med 353: 1802-1809.
304. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, et al. (2008) Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet 40: 575-583.
305. Kim JJ, Lee HI, Park T, Kim K, Lee JE, et al. (2010) Identification of 15 loci influencing height in a Korean population. J Hum Genet 55: 27-31.
306. Yashin AI, Wu D, Arbeev KG, Ukraintseva SV (2010) Joint influence of small-effect genetic variants on human longevity. Aging (Albany NY) 2: 612-620.
307. Zhang Y, Proenca R, Maffei M, Barone M, Leopold L, et al. (1994) Positional cloning of the mouse obese gene and its human homologue. Nature 372: 425-432.
254 References
308. Lustig RH (2006) Childhood obesity: behavioral aberration or biochemical drive? Reinterpreting the First Law of Thermodynamics. Nat Clin Pract Endocrinol Metab 2: 447-458.
309. O'Rahilly S (2009) Human genetics illuminates the paths to metabolic disease. Nature 462: 307-314.
310. Hofker M, Wijmenga C (2009) A supersized list of obesity genes. Nat Genet 41: 139-140. 311. Gosset P, Ghezala GA, Korn B, Yaspo ML, Poutska A, et al. (1997) A new inward rectifier
potassium channel gene (KCNJ15) localized on chromosome 21 in the Down syndrome chromosome region 1 (DCR1). Genomics 44: 237-241.
312. Epstein CJ, Korenberg JR, Anneren G, Antonarakis SE, Ayme S, et al. (1991) Protocols to establish genotype-phenotype correlations in Down syndrome. Am J Hum Genet 49: 207-235.
313. Delabar JM, Theophile D, Rahmani Z, Chettouh Z, Blouin JL, et al. (1993) Molecular mapping of twenty-four features of Down syndrome on chromosome 21. Eur J Hum Genet 1: 114-124.
314. Toyoda A, Noguchi H, Taylor TD, Ito T, Pletcher MT, et al. (2002) Comparative genomic sequence analysis of the human chromosome 21 Down syndrome critical region. Genome Res 12: 1323-1332.
315. Warrington NM, Howe LD, Wu YY, Timpson NJ, Tilling K, et al. (2013) Association of a Body Mass Index Genetic Risk Score with Growth throughout Childhood and Adolescence. PLoS One 8: e79547.
316. Mei H, Chen W, Jiang F, He J, Srinivasan S, et al. (2012) Longitudinal replication studies of GWAS risk SNPs influencing body mass index over the course of childhood and adulthood. PLoS One 7: e31470.
317. Dvornyk V, Waqar ul H (2012) Genetics of age at menarche: a systematic review. Hum Reprod Update 18: 198-210.
318. Wen X, Kleinman K, Gillman MW, Rifas-Shiman SL, Taveras EM (2012) Childhood body mass index trajectories: modeling, characterizing, pairwise correlations and socio-demographic predictors of trajectory characteristics. BMC Med Res Methodol 12: 38.
319. Zillikens MC, Yazdanpanah M, Pardo LM, Rivadeneira F, Aulchenko YS, et al. (2008) Sex-specific genetic effects influence variation in body composition. Diabetologia 51: 2233-2241.
320. Comuzzie AG, Blangero J, Mahaney MC, Mitchell BD, Stern MP, et al. (1993) Quantitative genetics of sexual dimorphism in body fat measurements. American Journal of Human Biology 5: 725-734.
321. Atwood LD, Heard-Costa NL, Cupples LA, Jaquish CE, Wilson PW, et al. (2002) Genomewide linkage analysis of body mass index across 28 years of the Framingham Heart Study. Am J Hum Genet 71: 1044-1050.
322. Webster RJ, Warrington NM, Weedon MN, Hattersley AT, McCaskie PA, et al. (2009) The association of common genetic variants in the APOA5, LPL and GCK genes with longitudinal changes in metabolic and cardiovascular traits. Diabetologia 52: 106-114.
323. Andreasen CH, Stender-Petersen KL, Mogensen MS, Torekov SS, Wegner L, et al. (2008) Low physical activity accentuates the effect of the FTO rs9939609 polymorphism on body fat accumulation. Diabetes 57: 95-101.
324. Vimaleswaran KS, Li S, Zhao JH, Luan J, Bingham SA, et al. (2009) Physical activity attenuates the body mass index-increasing influence of genetic variation in the FTO gene. Am J Clin Nutr 90: 425-428.
255 References
325. Cauchi S, Stutzmann F, Cavalcanti-Proenca C, Durand E, Pouta A, et al. (2009) Combined effects of MC4R and FTO common genetic variants on obesity in European general populations. J Mol Med (Berl) 87: 537-546.
326. Rampersaud E, Mitchell BD, Pollin TI, Fu M, Shen H, et al. (2008) Physical activity and the association of common FTO gene variants with body mass index and obesity. Arch Intern Med 168: 1791-1797.
327. Kilpeläinen TO, Qi L, Brage S, Sharp SJ, Sonestedt E, et al. (2011) Physical Activity Attenuates the Influence of <italic>FTO</italic> Variants on Obesity Risk: A Meta-Analysis of 218,166 Adults and 19,268 Children. PLoS Med 8: e1001116.
328. Ahmad T, Lee IM, Pare G, Chasman DI, Rose L, et al. (2011) Lifestyle interaction with fat mass and obesity-associated (FTO) genotype and risk of obesity in apparently healthy U.S. women. Diabetes Care 34: 675-680.
329. Sonestedt E, Roos C, Gullberg B, Ericson U, Wirfalt E, et al. (2009) Fat and carbohydrate intake modify the association between genetic variation in the FTO genotype and obesity. Am J Clin Nutr 90: 1418-1425.
330. Abarin T, Yan Wu Y, Warrington N, Lye S, Pennell C, et al. (2012) The impact of breastfeeding on FTO-related BMI growth trajectories: an application to the Raine pregnancy cohort study. Int J Epidemiol 41: 1650-1660.
331. Li S, Zhao JH, Luan J, Ekelund U, Luben RN, et al. (2010) Physical activity attenuates the genetic predisposition to obesity in 20,000 men and women from EPIC-Norfolk prospective population study. PLoS Med 7.
332. Access Economics (2008) The growing cost of obesity in 2008: three years on. Canberra: Diabetes Australia.
333. Colagiuri S, Lee CMY, Colagiuri R, Magliano D, Shaw JE, et al. (2010) The cost of overweight and obesity in Australia. Medical Journal of Australia 192: 260-264.
334. Summerbell CD, Waters E, Edmunds LD, Kelly S, Brown T, et al. (2005) Interventions for preventing obesity in children. Cochrane Database Syst Rev: CD001871.
335. Waters E, de Silva-Sanigorski A, Hall BJ, Brown T, Campbell KJ, et al. (2011) Interventions for preventing obesity in children. Cochrane Database Syst Rev: CD001871.
336. Doyle O, Harmon CP, Heckman JJ, Tremblay RE (2009) Investing in early human development: timing and economic efficiency. Econ Hum Biol 7: 1-6.
337. Crowle J, Turner E (2010) Childhood Obesity: An Economic Perspective. Productivity Commission Staff Working Paper, Melbourne.
256 References
Appendix A: Publication Arising from the Research in Chapter Two
Modelling BMI Trajectories in Children for GeneticAssociation StudiesNicole M. Warrington1,2, Yan Yan Wu2, Craig E. Pennell1, Julie A. Marsh1, Lawrence J. Beilin3,
Lyle J. Palmer2,4, Stephen J. Lye2, Laurent Briollais2*
1 School of Women’s and Infants’ Health, The University of Western Australia, Perth, Western Australia, Australia, 2 Samuel Lunenfeld Research Institute, Mount Sinai
Hospital, Toronto, Ontario, Canada, 3 School of Medicine and Pharmacology, The University of Western Australia, Perth, Western Australia, Australia, 4 Ontario Institute for
Cancer Research, Toronto, Ontario, Canada
Abstract
Background: The timing of associations between common genetic variants and changes in growth patterns over childhoodmay provide insight into the development of obesity in later life. To address this question, it is important to defineappropriate statistical models to allow for the detection of genetic effects influencing longitudinal childhood growth.
Methods and Results: Children from The Western Australian Pregnancy Cohort (Raine; n = 1,506) Study were genotyped at17 genetic loci shown to be associated with childhood obesity (FTO, MC4R, TMEM18, GNPDA2, KCTD15, NEGR1, BDNF, ETV5,SEC16B, LYPLAL1, TFAP2B, MTCH2, BCDIN3D, NRXN3, SH2B1, MRSA) and an obesity-risk-allele-score was calculated as thetotal number of ‘risk alleles’ possessed by each individual. To determine the statistical method that fits these data and hasthe ability to detect genetic differences in BMI growth profile, four methods were investigated: linear mixed effects model,linear mixed effects model with skew-t random errors, semi-parametric linear mixed models and a non-linear mixed effectsmodel. Of the four methods, the semi-parametric linear mixed model method was the most efficient for modellingchildhood growth to detect modest genetic effects in this cohort. Using this method, three of the 17 loci were significantlyassociated with BMI intercept or trajectory in females and four in males. Additionally, the obesity-risk-allele score wasassociated with increased average BMI (female: b= 0.0049, P = 0.0181; male: b= 0.0071, P = 0.0001) and rate of growth(female: b= 0.0012, P = 0.0006; male: b= 0.0008, P = 0.0068) throughout childhood.
Conclusions: Using statistical models appropriate to detect genetic variants, variations in adult obesity genes wereassociated with childhood growth. There were also differences between males and females. This study provides evidence ofgenetic effects that may identify individuals early in life that are more likely to rapidly increase their BMI through childhood,which provides some insight into the biology of childhood growth.
Citation: Warrington NM, Wu YY, Pennell CE, Marsh JA, Beilin LJ, et al. (2013) Modelling BMI Trajectories in Children for Genetic Association Studies. PLoSONE 8(1): e53897. doi:10.1371/journal.pone.0053897
Editor: Guoying Wang, John Hopkins Bloomerg School of Public Health, United States of America
Received August 17, 2012; Accepted December 4, 2012; Published January 17, 2013
Copyright: � 2013 Warrington et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The following institutions provide funding for Core Management of the Raine Study: The University of Western Australia (UWA), Raine Medical ResearchFoundation, UWA Faculty of Medicine, Dentistry and Health Sciences, The Telethon Institute for Child Health Research, Curtin University and Women and InfantsResearch Foundation. This study was supported by project grants from the National Health and Medical Research Council of Australia (Grant ID 403981 and ID003209; http://www.nhmrc.gov.au/) and the Canadian Institutes of Health Research (Grant ID MOP-82893; http://www.cihr-irsc.gc.ca/e/193.html). Ms. Warringtonis funded by an Australian Postgraduate Award from the Australian Government of Innovation, Industry, Science and Research and a Raine Study PhD Top-UpScholarship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Introduction
Obesity is a major global public health problem. The World
Health Organisation estimated in 2010 there were at least 42
million overweight children under the age of 5-years and one
billion overweight adults globally [1]. Childhood obesity is
associated with poor mental [2,3,4,5] and physical health [6,7]
and is one of the strongest predictors of adult obesity [8,9]. Adult
obesity, in turn, increases the risk of many diseases including
coronary heart disease, metabolic syndrome, some cancers, stroke,
liver and gallbladder disease, sleep apnoea and respiratory
problems, osteoarthritis and gynaecological problems [1]. It has
been proposed that there are critical periods early in an
individual’s life for the development of obesity including gestation
and early infancy, adiposity rebound and adolescence [10].
An individual’s susceptibility to obesity is thought to result from
a combination of their genetics, behaviours and environment. The
heritability of obesity is estimated from family and twin studies to
be between 40 and 80% [11,12,13], which appears to be age
dependent with younger individuals having higher heritability
estimates [14]. Genetic factors have an important role in
childhood obesity, but their role may be different to those that
operate in adulthood. Since the advent of genome-wide association
studies (GWAS), common variants within 35 genes have been
discovered to be associated with adult obesity [15,16,17,18,19]
and a further 48 genes associated with population variation in
body mass index (BMI) and weight [20,21,22,23,24,25,26] in
individuals of European descent. In particular, common variants
within the fat-mass and obesity associated (FTO) and melanocor-
PLOS ONE | www.plosone.org 1 January 2013 | Volume 8 | Issue 1 | e53897
itin 4 receptor (MC4R) genes are associated with modest effects on
BMI (0.2–0.4 kg/m2 per allele) which translate into increased odds
of obesity of 1.1–1.3 in adults [24,26,27,28,29]. However, the
genomic regions discovered to date to be associated with BMI
account for less than 1% of the total variance in the BMI [30],
leaving much of the estimated heritability unexplained. In
addition, relatively few studies have investigated the association
between the adult BMI associated variants and childhood BMI
[23,31,32,33,34]. Zhao et al [31] investigated the association
between childhood BMI and 13 genomic loci reported to be
associated with adult obesity to find that nine of the loci contribute
to paediatric BMI between birth and 18 years of age.
Subsequently, several authors have investigated the association
between adult BMI loci and changes in growth over childhood.
Hardy and colleagues [33] took variants from the two most
commonly reported obesity genes, FTO and MC4R, to see if they
were associated with life course body size. They found the
association with BMI in both genes strengthened during childhood
up until 20 years of age before weakening throughout adulthood.
In 2010, Elks et al [34] used eight variants that showed individual
associations with childhood BMI to create an obesity-risk-allele-
score. This allele-score was strongly associated with early infant
weight gain but also with weight gain over childhood. Finally, den
Hoed et al [32] looked at BMI in childhood and adolescence
against a larger subset of replicated SNPs representing the 16 BMI
loci from the six genome-wide association studies in adults of white
European descent [22,23,24,26,35,36]. Together, these studies
begin to provide evidence that genetic loci associated with BMI in
adulthood start having an effect in childhood and even infancy.
Obesity develops over a period of time so investigating the
genetic determinants underlying this developmental process may
provide insights into mechanisms of the genetic associations.
Sophisticated longitudinal analyses allow questions to be addressed
that cannot be determined from cross-sectional analyses. These
longitudinal models assess patterns and duration of genetic effect
at baseline and over a time period and the differences in means
and rates of change of a trait. It is therefore important to
investigate the genetic component of BMI trajectory in order to
better understand some of the underlying biology of growth. The
analysis of longitudinal growth curves allows one to identify
specific stages in which genes play a central role.
A child’s growth rate profile often contains important informa-
tion regarding their genetic make-up and environmental expo-
sures; however, BMI trajectories are difficult to model statistically
due to the various changes in growth rate over childhood.
Children tend to have rapidly increasing BMI from birth to
approximately 9 months of age where they reach their adiposity
peak; BMI then decreases until around the age of 5–6 years at
adiposity rebound and then steadily increases again until after
puberty where it tends to plateau through adulthood. These
patterns of growth tend to be different in males and females where
females often reach each of the ‘landmarks’ (adiposity rebound,
puberty and plateau at adult BMI) at an earlier age than males.
These changes over time within each individual, as well as the
increasing variability over time of BMI between individuals, are
often difficult to capture accurately in a statistical model. This is
particularly the case when the aim is to detect modest genetic
effects. The World Health Organization recently conducted
research into statistical methods used to estimate growth curves
over childhood and examined 30 previously published methods, of
which only 7 could handle multiple measurements per child [37].
These methods range from non-linear, parametric curves [38] to
non-linear, non-parametric methods where the form of the curve
was allowed to differ for each subject [39,40] and from linear
mixed-effects models for longitudinal normally distributed data
[41,42] to a more general multilevel model, some with non-
parametric components [43,44,45]. Although many methods have
been previously used for growth modelling, not all are appropriate
for genetic association analyses or modelling growth profiles in
longitudinal birth cohorts.
We aim to compare various modelling approaches to assess the
genetic effects of BMI growth through infancy, childhood and
adolescence. To investigate the sensitivity of these different
modelling frameworks to detect genetic effects, we will use the
previously published adult obesity and BMI associated SNPs that
have been shown to be associated with childhood BMI and
explore their associations with childhood growth.
Methods
SubjectsThe Western Australian Pregnancy Cohort (Raine) Study
[46,47,48] is a prospective pregnancy cohort where 2,900 mothers
were recruited prior to 18-weeks’ gestation between 1989 and
1991. Recruitment took place at Western Australia’s major
perinatal centre, King Edward Memorial Hospital, and nearby
private practices. The mothers completed questionnaires regard-
ing the children and the children had physical examinations at
average ages of 1, 2, 3, 6, 8, 10, 14 and 17 years. A DNA sample
was collected at the 14 and 17 year follow-ups. A subset of 1,506
individuals were used for analysis in this study using the following
inclusion criteria: at least one parent of European descent, live
birth, unrelated to anyone in the sample (one of every related pair,
including multiple births, was selected at random to exclude), no
significant congenital anomalies, a DNA sample and at least one
measure of body mass index (BMI) throughout childhood. Weight
and height were measured at each follow-up by trained members
of the research team [49]; weight was measured using a
Wedderburn Digital Chair Scale to the nearest 100 g with
children dressed in running shorts and a singlet top and height
was measured to the nearest 0.1 cm with a Holtain Stadiometer.
BMI was calculated from the weight and height measurements
(median 6 measures per person, interquartile range 5–7, range 1–8
measurements), with a total of 8,986 BMI measures. The study
was conducted with appropriate institutional ethics approval from
the King Edward Memorial Hospital and Princess Margaret
Hospital for Children ethics boards, and written informed consent
was obtained from all mothers. The cohort has been shown to be
representative of the population presenting to the antenatal
tertiary referral centre in Western Australia [48].
GenesWe wanted to investigate markers that have an effect on
childhood BMI, and more importantly, change in BMI over
childhood so selected the 17 genetic variants published in den
Hoed et al [32]. These SNPs were first discovered to be associated
with adult BMI and replicated in at least one study against
childhood BMI and change in BMI growth over childhood. At the
time of selecting SNPs for this study, they were the largest set of
SNPs shown to be associated with BMI over childhood and
adolescence. We did not include loci that have been shown to be
associated with only obesity risk but not BMI. Subsets of these 17
SNPs (either the same SNPs or a SNP in high LD [r2.0.8]) were
also presented by Elks et al [34] and Hardy et al [33], who showed
associations with changes in growth over childhood. Genetic
information on these 17 published genetic variants was available
for individuals in our sample, either directly genotyped SNPs
(rs925946 (BDNF), rs10913469 (SEC16B), rs2605100 (LYPLAL1),
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 2 January 2013 | Volume 8 | Issue 1 | e53897
rs987237 (TFAP2B), rs10838738 (MTCH2), rs7138803
(BCDIN3D) and rs10146997 (NRXN3)) or from the best guess
genotype data imputed against HapMap release 22 (rs2815752
(NEGR1), rs6548238 (TMEM18), rs7647305 (ETV5), rs10938397
(GNPDA2), rs613080 (MRSA), rs1488830 (BDNF), rs8055138
(SH2B1), rs1121980 (FTO), rs17782313 (MC4R) and rs11084753
(KCTD15)). Genotyping and quality control has been described
elsewhere [50]. Briefly, our sample was genotyped using the
genome-wide Illumina 660 Quad Array. Genotyping was
performed on the Illumina BeadArray Reader at the Centre for
Applied Genomics, Toronto, Canada using 250 nanograms of
DNA. The genotype data was cleaned using standard thresholds
(HWE p-value .5.761027, call rate .95% and minor allele
frequency .1%). Individual level genotype data was extracted for
those SNPs of interest that were directly genotyped by the chip
and passed QC measures. Imputation of un-typed or missing
genotypes was also performed using MACH v1.0.16 for the all 22
autosomes with the CEU samples from HapMap Phase2 (Build 36,
release 22) used as a reference panel. Two variants in the BDNF
gene were investigated as they have previously been shown to be
Table 1. The phenotypic characteristics of the Raine sample.
All Male Female P-Value
(n = 1,506) (n = 773) (n = 733)
Age Year 1 (n = 1,375) 1.16 (0.10) 1.15 (0.10) 1.16 (0.10) 0.22
(yr) Year 2 (n = 402) 2.18 (0.14) 2.19 (0.14) 2.16 (0.14) 0.05
Year 3 (n = 994) 3.11 (0.12) 3.12 (0.13) 3.11 (0.10) 0.71
Year 5 (n = 1,324) 5.92 (0.18) 5.91 (0.19) 5.92 (0.18) 0.30
Year 8 (n = 1,320) 8.10 (0.35) 8.12 (0.34) 8.09 (0.36) 0.17
Year 10 (n = 1,274) 10.60 (0.18) 10.60 (0.19) 10.59 (0.17) 0.16
Year13/14 (n = 1,276) 14.07 (0.20) 14.07 (0.20) 14.07 (0.19) 0.55
Year 16/17 (n = 1,021) 17.05 (0.25) 17.03 (0.24) 17.06 (0.25) 0.06
BMI Year 1 (n = 1,375) 17.11 (1.40) 17.38 (1.38) 16.82 (1.37) 4.63E-14
(kg/m2) Year 2 (n = 402) 15.97 (1.29) 16.19 (1.28) 15.72 (1.25) 2.00E-04
Year 3 (n = 994) 16.15 (1.27) 16.29 (1.21) 16.00 (1.31) 2.00E-04
Year 5 (n = 1,324) 15.86 (1.76) 15.88 (1.70) 15.84 (1.82) 0.64
Year 8 (n = 1,320) 16.88 (2.54) 16.79 (2.47) 16.97 (2.62) 0.29
Year 10 (n = 1,274) 18.69 (3.41) 18.58 (3.38) 18.80 (3.45) 0.25
Year13/14 (n = 1,276) 21.45 (4.23) 21.21 (4.24) 21.71 (4.20) 0.03
Year 16/17 (n = 1,021) 23.02 (4.38) 22.83 (4.34) 23.23 (4.42) 0.15
Height Year 1 (n = 1,375) 0.78 (0.03) 0.78 (0.03) 0.77 (0.03) 1.04E-14
(m) Year 2 (n = 402) 0.90 (0.03) 0.91 (0.03) 0.90 (0.03) 3.00E-04
Year 3 (n = 994) 0.96 (0.04) 0.97 (0.04) 0.96 (0.04) 1.06E-09
Year 5 (n = 1,324) 1.16 (0.05) 1.17 (0.05) 1.15 (0.04) 6.05E-07
Year 8 (n = 1,320) 1.29 (0.06) 1.30 (0.06) 1.29 (0.06) 4.37E-06
Year 10 (n = 1,274) 1.44 (0.06) 1.44 (0.07) 1.44 (0.06) 0.97
Year13/14 (n = 1,276) 1.65 (0.08) 1.67 (0.09) 1.62 (0.06) 4.94E-26
Year 16/17 (n = 1,021) 1.73 (0.09) 1.79 (0.07) 1.66 (0.06) 1.94E-143
Weight Year 1 (n = 1,375) 10.34 (1.24) 10.67 (1.24) 9.99 (1.15) 5.03E-25
(kg) Year 2 (n = 402) 13.03 (1.49) 13.39 (1.48) 12.65 (1.40) 3.37E-07
Year 3 (n = 994) 15.06 (1.84) 15.42 (1.83) 14.69 (1.78) 3.99E-10
Year 5 (n = 1,324) 21.48 (3.37) 21.75 (3.42) 21.20 (3.30) 2.91E-03
Year 8 (n = 1,320) 28.42 (5.68) 28.58 (5.65) 28.24 (5.72) 0.28
Year 10 (n = 1,274) 39.01 (9.02) 38.80 (9.09) 39.23 (8.95) 0.40
Year13/14 (n = 1,276) 58.49 (13.44) 59.50 (14.49) 57.39 (12.11) 4.81E-03
Year 16/17 (n = 1,021) 68.69 (14.59) 73.15 (14.91) 64.12 (12.74) 3.91E-24
Number of follow-ups per person 5.97 (1.52) 5.96 (1.52) 5.97 (1.53) 0.91
Birth Weight (kg) 3.35 (0.59) 3.41 (0.59) 3.28 (0.58) 3.85E-05
Gestational Age (wks) 39.35 (2.11) 39.37 (2.05) 39.32 (2.17) 0.66
Preterm [% (N)] 8.77% (132) 8.03% (62) 9.55% (70) 0.34
Maternal smoking during pregnancy [% (N)] 25.22% (379) 22.77% (176) 27.81% (203) 0.03
Continuous variables are expressed as means (SD); binary variables as percentage (number).doi:10.1371/journal.pone.0053897.t001
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 3 January 2013 | Volume 8 | Issue 1 | e53897
independently associated with obesity [22] (r2 = 0.11). The 17
SNPs are described in Table S1, including the available sample
size with complete data for each SNP. These 17 SNPs were used to
investigate the sensitivity of each method to detect genetic variants
in terms of point estimates and standard errors (SEs) across various
time points (for those methods that could be compared). Each SNP
was incorporated into the model independently assuming an
additive genetic effect for the obesity risk allele. In addition, an
‘obesity-risk-allele score’ was created on the subset of individuals
with complete genetic data by summing the number of risk alleles
an individual had (n = 1,219) [51]. The alleles were not weighted
by their effect size as this has previously been shown to only have
limited benefit [52].
Statistical AnalysisFour popular methods were compared to assess the accuracy of
estimation of BMI growth trajectories and the ability to detect
genetic effects influencing these trajectories. These methods
included: Linear Mixed Effects Model (LMM) [41], the Skew-t
Linear Mixed Effects Model (STLMM) [53,54,55], Semi-Para-
metric Linear Mixed Models (SPLMM) and a Non-Linear Mixed
Model (NLMM), also known as SuperImposition by Translation
and Rotation (SITAR) [40]. Although there are many possible
statistical methods that could be utilized in this context, these
methods were chosen as they allow for adjustment of potential
confounders, appropriately account for the complex correlation
structure between the repeated measures, allow for incomplete
data on the assumption that data are missing at random, and are
computationally feasible in the context of candidate gene and
genome-wide association studies. Once the best fitting model was
defined for each method, the model fit for each of the methods was
compared. A small simulation study was also conducted using re-
sampling techniques based on 1,000 non-parametric bootstrap
data sets with replacement [56] from the Raine data and
calculating an R2 statistic for each method fit to these simulated
datasets.
LMM. The LMM with a polynomial function is a common
tool for growth curve analysis with continuous repeated measures.
For a set of time points varying from 1,.,t, the time trend in the
sample can be described by a (q-1)st-degree polynomial function,
with q # t. The growth curve LMM for the jth individual and tth
time point and with the time scale measured by age is as follows:
BMIjt kg=m2� �
~b0zSibi Agejt
� �izu0j
zSkukj Agejt{Age� �k
zejt kƒi
Where Age is the mean age over the t time points in the sample
(i.e. 8 years), bi are the parameter estimates for the fixed effects, ukj
are the parameter estimates for the random effects assumed
multivariate normal and the ejt‘s are the error terms assumed
normally distributed N(0, S), where S is the within-individual
correlation matrix. Both age and the natural log transformation of
age were considered as the time component to identify the optimal
underlying scale. Both fixed (i) and random (k) effects up to
polynomial of degree 3 were tested for significance. Several within-
individual correlation structures were considered, including
autoregressive, continuous autoregressive, exchangeable (com-
pound symmetric) and unstructured.
Following the guidelines outlined in Cheng et al [57], the initial
saturated model considered included a cubic function of age for
both the fixed and random effects and BMI on the natural log
scale, was used to compare covariance (random effects) matrices.
Initially, likelihood ratio tests (LRT) were used to assess the
required degree of polynomial function for the random effects to
fit the data accurately, while keeping the fixed effects the same and
specifying an independence correlation matrix for the random
effects. Next, a similar approach was used to investigate within-
individual correlation structures in addition to the random effects.
Finally, models with both untransformed and natural log
transformed age were compared using diagnostic plots such as
fitted verses observed values, fitted versus residual values and
distribution of both random effects and error terms.
STLMM. The assumption of multivariate normal random
effects and within-subject errors is often violated, particularly when
modelling the childhood growth curve. This may lead to biased
estimation of fixed effects and their SEs and thus to wrong
statistical inference, in particular of the genetic association-related
parameters. A common approach to achieve normality is to
transform the response variable but generally there is not a unique
transformation that could be used and the results of the analyses
might depend on the transformation used. To avoid transforming
Table 2. Characteristics of the best model for each method.
Scale of response Fixed effect parameters Random effect parametersWithin-individualcorrelation matrix
Female LMM ln(BMI) 1+ age+age2+age3 1+ age+age2 corCAR1
STLMM BMI 1+ age+age2+age3 1+age None
SPLMM ln(BMI) piecewise cubic spline function ofage with knots at 2, 8 and 12 years
1+ age +0.5*age2 None
NLMM ln(BMI) size and a natural cubic splinefunction of ln(age) for velocity with3df
size and a natural cubic spline functionof ln(age) for tempo and velocityparameters with 3df
corCAR1
Male LMM ln(BMI) 1+ age+age2+age3 1+ age+age2 corCAR1
STLMM BMI 1+ age+age2+age3 1+age None
SPLMM ln(BMI) piecewise cubic spline function ofage with knots at 2, 8 and 12 years
1+ age +0.5*age2 None
NLMM ln(BMI) size and a natural cubic splinefunction of ln(age) for velocity with4df
size and a natural cubic spline functionof ln(age) for tempo and velocityparameters with 4df
corCAR1
doi:10.1371/journal.pone.0053897.t002
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 4 January 2013 | Volume 8 | Issue 1 | e53897
the response and still obtain a valid inference under a non-normal
distribution assumption for the response, we utilised an extension
of the LMM model assuming a multivariate t distribution for the
error terms, ejt‘s, and a multivariate skew-normal distribution for
the random effects. The resulting model for the response over the t
time points is multivariate skew-t with specific parameters that
account for the asymmetry (skewness parameters) and long-tail
(degree of freedom of the t distribution) of the response distribution
[54]. The specification in terms of fixed and random effects was
identical to the LMM. No transformations were applied to either
BMI or age as the skewness in the data was accounted for by the
model structure.
SPLMM. Semi-parametric linear mixed models make use of
smoothing splines, which yield a smoother growth curve estimate
than the polynomial function in the LMM when fitting non-linear
relationships. The basic model for the jth individual and time-point
t is as follows:
BMIjt kg=mð Þ~b0zSibi Agejt
� �izSkck( Agejt{Age
� �{kk)i
z
zu0jzSiuij Agejt
� �izSkgkj( Agejt{Age
� �{kk)i
zzejt
Where kk is the k-th knot and (t – kk)+ = 0 if t # kk and (t – kk) if
t.kk, which is known as the truncated power basis that ensures
smooth continuity between the time windows.
Various numbers and positions of knots and the degree of
polynomial between knots were compared to find the best fit to the
data. Knot points were initially estimated visually from both
individual profiles and the population average curve in males and
females separately. To optimise the number and placement of the
knot points, we fit a series of models with the knot points placed at
6-month intervals around the estimated knot points and incorpo-
rated additional knot points to see if they improved the model fit.
The model with the lowest Akaike Information Criterion (AIC)
was selected as the final model. Finally, we investigated the degree
of polynomial, up to the third degree, required for each spline,
once again selecting the best model with the lowest AIC.
NLMM. The SITAR method [40] was recently defined to
summarize height growth in puberty (in particular peak height
velocity) and estimate subject-specific parameters that can be used
to investigate relationships with earlier exposures and later
outcomes. The SITAR method (referred to here as NLMM)
model has a single fitted curve at the population level and
individual level estimates of mean differences in size (shifting up or
down of the BMI curve), growth tempo (left-right shift of the curve
on the age scale) and velocity (shrinking or stretching of the age
scale).
The basic model for the growth curves is:
yit~aizh(t{bi
e{yi
� �
Where:
yit = growth of subject i at age t.
h(t) = natural cubic spline curve of growth vs. age.
ai = random growth intercept that adjusts for differences in
mean height (size).
bi = random growth intercept to adjust for difference in timing
(tempo).
ci = random age scaling adjusting for the duration of the growth
spurt (velocity).
This model was fit with the three parameters (size, tempo and
velocity) as random effects, size and velocity as fixed effects, and
h(t) a natural cubic spline curve with 3 to 8 degrees of freedom (df)
fitted as fixed effects. BMI and age were fitted both untransformed
and natural log transformed, to identify the best fit to the data.
Model fit to the data were compared using AIC, deviance and
residual standard deviation. The estimates for the three param-
eters (size, tempo and velocity) were extracted for each individual
and used for genetic analyses.
Given that growth curves differ greatly between males and
females, particularly around puberty, and because different genes
may influence the timing of growth spurts in males and females,
sex stratified models were used for all analyses. Age was mean
centred prior to analysis. Due to the possibility of population
stratification in our sample given our sampling criteria of at least
one parent of European descent, a sensitivity analysis was
conducted adjusting the genetic analyses for the first five principal
components generated in the EIGENSTRAT software [58]. No
adjustment for multiple testing have been made as our goal was to
estimate a combined effect of SNPs that have already been
validated in previous studies and shown to be significantly
associated with childhood BMI and growth. All analyses were
conducted in R version 2.12.1 [59]; the spida library was used for
the SPLMM models and the sitarlib library was used for the
NLMM models. To enable comparison between the four methods,
maximum likelihood estimation was used for all mixed models.
Genetic loci were considered associated with BMI if the global
likelihood ratio test was significant at a a,0.05 level.
Results
Population CharacteristicsOf the 1,506 children in the analysis, there are 773 males (51%)
and 733 females. Table 1 gives the characteristics of the Raine
sample used in the analysis. At birth, these babies were similar to
the Western Australian population of births with an average birth
weight of 3.35 Kg (SD = 0.59 Kg) and gestational age of 39.35
weeks (SD = 2.11 weeks), 25.21% of them were born to mothers
who smoked throughout pregnancy and 8.77% born preterm. The
mothers on average gained 8.79 kg (SD = 3.78) throughout
pregnancy and breast fed their infant for an average of 6 months
(IQR = 2–12 months). On average, the infants gained 6.98 Kg
(SD = 1.17 Kg) in the first year of life.
Model Fitting and ComparisonsThe optimal model for each method was defined before any
cross-method comparisons were conducted. The selected models
for each method are summarized in Table 2.
LMM. The optimal LMM model for both males and females
was based on ln(BMI) and untransformed age, with cubic
polynomial of age in the fixed effects, a quadratic polynomial of
age in the random effects and a continuous autoregressive
correlation structure of order one. Hence, the final model for
both females and males was
ln BMI kg=m½ �ð Þ~b0zb1 Age{8ð Þzb2 Age{8ð Þ2z
b3 Age{8ð Þ3zu0zu1 Age{8ð Þzu2 Age{8ð Þ2ze
STLMM. The LMM model defined previously was used for
this method; however BMI was modelled on the untransformed
scale as the method accounts for the skewness and kurtosis of the
BMI distribution. The model would not converge with both linear
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 5 January 2013 | Volume 8 | Issue 1 | e53897
and quadratic age components in the random effects so this was
reduced to only linear age. This was the most computationally
intensive method to fit as it uses an expectation-maximization
(EM) algorithm for parameter estimation, and hence took the
longest time to converge.
SPLMM. For females, the optimal model had three knot
points placed at two, eight and 12 years with a cubic slope for each
spline. The males displayed a similar curve to the females, also
with three knots at two, eight and 12 years and a cubic slope
between each knot.
NLMM. The optimal model for females had a natural cubic
spline curve with three degrees of freedom and both BMI and age
on the natural log transformed scale. Similarly, the optimal model
for males was with BMI and age on the natural log transformed
scale but with four degrees of freedom for the natural cubic spline
curve.
Comparisons. Table 3 displays the measures of fit used to
compare methods: R2, R2 from 1,000 simulated datasets,
observed-fitted values, number of SNPs detected and computa-
tional time. The R2, in conjunction with interquartile range of
variation of R2 estimated through simulations, clearly favour the
SPLMM as the best model fit for the females. The R2 estimates
from the simulations indicate that although the STLMM method
has higher R2 for both females and males, the interquartile range
is much larger for STLMM method, indicating the model fit is
more data dependent than the other methods, which is not
desirable for generalization to other cohorts. The conclusion for
the males is not as simplistic as the R2 is largest for the STLMM,
Figure 1. Q-Q plot of residuals for each of the methods by females (top four) and males (bottom four).doi:10.1371/journal.pone.0053897.g001
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 6 January 2013 | Volume 8 | Issue 1 | e53897
however with the considerably longer computational time and the
larger deviation the fitted values are from the observed values
indicates that this model might not be appropriate for large scale
genetic studies. Figure 1 displays the residuals from all four
methods in both males and females. The female residual plots
indicate the LMM, STLMM and SPLMM methods all have
residuals distributed close to the expected distribution (normal for
the LMM and SPLMM and skew-t for the STLMM). Several
within-subject outliers (at the tails of the distribution) were not
captured in all methods. However, the NLMM in particular had
additional outliers not present with the other methods. The LMM
and SPLMM methods both have some deviation from the normal
distribution at the top end of the curve signifying that they under
estimate the high BMI values. In contrast, there were an excess of
extreme residual values at both ends when using the NLMM
method indicating a poor fit for the data. It over estimates low
BMI values and under estimates high values, thus under estimating
within-individual variability and potentially leading to conserva-
Table 3. Statistical measures used to compare model fit of the four methods.
R2R2 from 1,000 simulateddatasets [median (IQR)]
(Observed-fitted values)2
[median (IQR)]Number of SNPsdetected
Average run time for geneticmodelT (median [IQR])
Female LMM 83.59% 83.60% (82.70, 84.44) 0.2705 (0.0579, 0.8755) 1 of 17 13.59 sec (13.41, 14.40)
STLMM 88.78% 91.80% (86.30, 95.54) 0.2728 (0.0613, 0.9007) 3 of 17 4505 sec (4490, 4784)
SPLMM 89.42% 89.47% (89.06, 89.84) 0.1720 (0.0374, 0.5871) 3 of 17 23.49 sec (23.41, 23.92)
NLMM 85.98% 85.97% (85.32, 86.65) 0.1678 (0.0350, 0.5752) 2 of 51 (three testsper SNP)
0.01 sec (0.00,0.02)
Male LMM 80.67% 80.71% (79.64, 81.71) 0.2390 (0.0470, 0.8187) 3 of 17 15.84 sec (15.66, 16.55)
STLMM 88.72% 91.99% (87.88, 95.74) 0.2248 (0.0479, 0.8453) 4 of 17 3962 sec (3895, 3970)
SPLMM 87.59% 87.62% (87.24, 88.03) 0.1656 (0.0329, 0.5501) 4 of 17 24.07 sec (23.78, 24.52)
NLMM 85.10% 85.07% (84.41, 85.82) 0.1604 (0.0333, 0.5713) 5 of 51 (three testsper SNP)
0.00 sec (0.00,0.02)
TMedian (IQR) of 100 models with the FTO SNP in R-64-bit version 2.12.1 on a 64-bit operating system with an Intel Core i7 CPU Processor (L 640 @ 2.13 GHz).doi:10.1371/journal.pone.0053897.t003
Figure 2. Distribution of obesity-risk allele score, with error bars for mean BMI at age 14 years. The obesity-risk-allele score incorporatesgenotypes from 17 loci (FTO, MC4R, TMEM18, GNPDA2, KCTD15, NEGR1, BDNF, ETV5, SEC16B, LYPLAL1, TFAP2B, MTCH2, BCDIN3D, NRXN3, SH2B1,and MRSA) in the 1,219 individuals from the Raine study with complete genetic data. The error bars display the mean (95% CI) BMI at age 14 years(the largest follow-up in adolescence) for each risk-allele score.doi:10.1371/journal.pone.0053897.g002
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 7 January 2013 | Volume 8 | Issue 1 | e53897
tive inference about genetic associations. The male residuals
displayed a similar pattern to females, although there were fewer
obvious outliers. In addition, as there was less skewness in the
males, the STLMM method deviated from the expected t
distribution but in the opposite direction to that of the females,
whereby the low values of BMI are underestimated. Based on
model fit, all four methods were adequate in modelling childhood
growth curves; however, the SPLMM was slightly better than the
other methods at accounting for outliers and had the best model
fit.
Genetic ResultsOf the 17 SNPs, a likelihood ratio test indicated the LMM
method detected one significant association in the females and
three in males at the 5% level of significance, the STLMM method
detected three in females and four in males, the SPLMM detected
three in females and four in males and finally the NLMM method
detected no significant SNPs in either females or males for the size
parameter but 2 significant SNPs for the velocity parameter in
males. Results of all 17 SNPs can be found in Tables S2 (females)
and S3 (males). The first five principal components for population
stratification were not significantly associated with BMI in any of
the four methods and the genetic results of the 17 SNPs remained
consistent when adjusting for them (data not shown).
The obesity-risk allele score based on the genotypes at each of
the 17 loci was normally distributed and showed an approximately
linear association with BMI across childhood, based on the mean
BMI (95% confidence interval) for each score at each age
(Figure 2). When incorporating the risk-allele score into the four
longitudinal models, it was associated with increasing BMI in
females using all four methods however only three methods
detected an association in males (Table 4). For the females, the
LMM, STLMM and SPLMM methods all detected an increase in
BMI per allele increase in the obesity-risk-allele-score (LMM
b= 0.0046, P = 0.0216; STLMM b= 0.0492, P = 0.0410; SPLMM
b= 0.0049, P = 0.0181), in addition to an increase in linear slope
over time (LMM b= 0.0012, P = 0.00002; STLMM b= 0.0153,
P = 0.00003; SPLMM b= 0.0012, P = 0.0006). No significant
associations in the LMM, STLMM or SPLMM methods were
detected for the quadratic interactions with the risk-allele score,
however the cubic interaction was significant in the LMM
(b= 20.00001, P = 0.0067) and STLMM (b= 20.0001,
P = 0.0236). This indicates that, according the LMM and
STLMM methods, females with higher allele scores plateau to
adult BMI at an earlier age. In contrast, the NLMM method in
both females and males was unable to detect a significant
association with an increase in size or velocity, but did detect a
decrease in tempo (assumed to be adiposity rebound) for each
increase in risk allele. In the males, the LMM, STLMM and
SPLMM methods, also detected an increase in BMI (LMM
b= 0.0073, P = 0.0001; STLMM b= 0.0423, P = 0.0481; SPLMM
b= 0.0071, P = 0.0001) and BMI/year per allele increase (LMM
b= 0.0010, P = 0.0001; STLMM b= 0.0083, P = 0.0070; SPLMM
b= 0.0008, P = 0.0068). No significant associations in the LMM,
STLMM or SPLMM methods were detected for the quadratic
and cubic interactions with the risk-allele score, indicating that the
shape of the curve is consistent across the score categories.
Further analysis focused on the SPLMM model, as this method
was shown to give the best fit to these data. There are potentially
different genetic pathways leading to increased growth rate in
males and females as SNPs from different genes are associated
with BMI trajectory; in females, SNPs in the NRXN3, BDNF and
MRSA genes were significantly associated with BMI trajectory
whereas in males FTO, NRXN3, GNPDA2 and TMEM18 were
significant. Figure 3 displays the population average curves for
individuals with 15, 17 or 18 (25th, 50th and 75th percentile)
obesity-risk alleles. The growth curves in each of the genders show
different patterns; females begin their trajectory smaller than
males, they have an earlier rebound, and by the age of 18 years
they are beginning to plateau at their potential adult BMI. In
contrast, males go through puberty at a slightly later age resulting
in their BMI continuing to increase at the age of 18 years. It is
apparent that the genetic effect begins later for females, around
seven and a half years (P = 0.03), than males at four years
(P = 0.02)(Figure 4).
Table 4. Results from association analysis of the obesity-risk allele score with BMI trajectory using the four methods.
LMM STLMM SPLMM NLMM
Beta 95% CI P-Value Beta 95% CI P-Value Beta 95% CI P-Value Beta SEP-Value
Female Score 0.0720 0.0107,0.1335
0.0216 0.0492 0.0020,0.0964
0.0410 0.0758 0.0131,0.1388
0.0181 Size 20.0003 0.0008 0.6910
Score*Age 0.0182 0.0099,0.02645
1.68E-05 0.0153 0.0082,0.0225
2.84E-05 0.0185 0.0080,0.0290
0.0006 Tempo 20.0090 0.0030 0.0023
Score*Age2 20.00001 20.0008,0.0008
0.9848 0.0005 20.00004,0.0011
0.0685 20.0077 20.0214,0.0061
0.2763 Velocity 0.0045 0.0024 0.0562
Score*Age3 20.0002 20.0003,20.00004
0.0067 20.0001 20.0002,20.00002
0.0236 20.0058 20.0128,0.0013
0.1077
Male Score 0.1073 0.0553,0.1595
0.0001 0.0423 0.0004,0.0843
0.0481 0.1053 0.0516,0.1591
0.0001 Size 0.0005 0.0007 0.4850
Score*Age 0.0144 0.0074,0.0215
0.0001 0.0083 0.0023,0.0144
0.0070 0.0122 0.0034,0.0210
0.0068 Tempo 20.0072 0.0026 0.0053
Score*Age2 20.0006 20.0012,0.0001
0.1043 20.00001 20.0005,0.0004
0.9586 20.0003 20.0120,0.0114
0.9573 Velocity 0.0009 0.0016 0.5820
Score*Age3 20.0001 20.0002,0.000002
0.0550 20.0001 20.0001,0.00003
0.1940 0.0007 20.0052,0.0065
0.8270
doi:10.1371/journal.pone.0053897.t004
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 8 January 2013 | Volume 8 | Issue 1 | e53897
Discussion
The current study has shown that of the four statistical methods
evaluated, the semi-parametric linear mixed model (SPLMM)
method was the most efficient for modelling childhood growth to
detect modest genetic effects in the longitudinal pregnancy cohort
study investigated. In addition, we have shown that there are
potentially different genetic pathways leading to increased growth
rate in males and females and that the obesity-risk-allele score
increases both average BMI and rate of growth throughout
childhood.
There are several different statistical methods that can be used
to model childhood growth. We selected four methods that would
allow for adjustment of potential confounders, appropriately
account for the correlation between the repeated measures, allow
for incomplete data, and were computationally feasible in the
context of candidate gene studies and GWAS. The evidence
suggested that the SPLMM method does a better job at
accounting for the variation in BMI growth than the LMM as it
had a smaller residual standard deviation. The SPLMM and
NLMM methods produce similar differences between observed
and fitted values. The LME and STLMM methods have a larger
range which indicates the prediction of BMI for each individual
over time is worst using both of these methods, introducing bias
whereby they over estimate low BMI values and under estimate
high BMI values. As seen in the residual plots, there are a small
number of outliers in this dataset, which are highly influential for
both the LMM and STLMM and will effect there ability for
accurate prediction. Furthermore, the estimates of skewness from
the STLMM model were relatively large (intercept = 4.5791
[SE = 1.0957] and slope = 2.2336 [SE = 0.6269] for females and
intercept = 2.8590 [SE = 0.5943] and slope = 1.6628
[SE = 0.4155] for males), which could be influenced by outliers
and result in inaccurate predictions. Although residual plots
indicate the STLMM method has the best fit to the data, it does
not produce the most accurate predictions. Based on model fit, all
four methods are adequate in modelling childhood growth curves;
however the SPLMM produces the most accurate fitted values and
can account for outliers.
Of the 17 genetic variants associated with adult BMI and
obesity risk that we investigated, the SPLMM method was able to
detect a higher proportion of associations with childhood growth
in both males and females than the other methods. The NLMM
method performed poorly in both males (five significant tests of 51)
and females (two significant tests of 51) consistent with it being
Figure 3. Population average curves from the SPLMM method in females and males. Predicted population average BMI trajectories from1–18 years for individuals with 15 (lower quartile), 17 (median), and 18 (upper quartile) risk alleles in the allele score.doi:10.1371/journal.pone.0053897.g003
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 9 January 2013 | Volume 8 | Issue 1 | e53897
more conservative than the other three methods. The STLMM
method detected a number of genetic effects, however it was a
more computationally intensive method, which would prove
difficult in larger scale genetic studies such as genome-wide
association studies. Moreover, it is not as flexible as the other
methods in terms of extensions to evaluate gene-environment or
gene-gene interactions. The current study provides evidence that
the SPLMM method is the most effective method to detect genetic
associations and allows the flexibility for extensions into large scale
and more complex genetic analyses.
Single genetic loci typically have small effects on complex
diseases or explain only a small proportion of the variability in a
quantitative trait; therefore, major increases in disease risk are
expected from simultaneous exposure to multiple genetic risk
variants. A post hoc power calculation using 1,000 non-parametric
bootstrap simulations based on the Raine data indicated that this
study had 97% power to detect the FTO loci rs1121980 with
MAF = 0.41, which has one of the larger effect sizes on BMI, but
still had 83% power to detect a more realistic smaller effect size
like the BDNF SNP rs1488830 association in females with
MAF = 0.21. In contrast, the power to detect the allele score,
combining all risk alleles, was 95% in both males and females
separately. The current study is the first to investigate, separately
in males and females, an association between 17 published obesity-
risk loci as an allele score and BMI trajectory throughout
childhood and adolescence. Hoed et al [32] used a similar
approach with a 17-loci allele-score but focused on two cross-
sectional association analyses in pre2/early pubertal children and
adolescents. By utilizing a longitudinal design, the current study
reduced the number of genetic association tests conducted from
eight in a cross-sectional setting to one per gender, reducing the
necessity of adjusting for multiple testing and potentially missing
important genetic loci. A second study by Elks et al [34] evaluated
the association between adult obesity risk genes and growth
throughout childhood using a smaller subset of obesity suscepti-
bility loci and with analyses only up to age 11 years. Both studies
conducted analysis adjusting for gender; however, this does not
allow each gender to have different growth trajectories or the
investigation of different timing of the genetic effects. We found
substantial differences between males and females in the timing of
the adiposity rebound and plateauing towards adulthood. Addi-
tionally, we detected genetic effects had different timing and effects
in each gender. By combining males and females into one analysis,
these genetic differences may have been averaged out and the
biology underlying the differences may remain undetected.
A recent longitudinal study investigating the life-course effects of
variants in the FTO gene and near the MC4R gene demonstrated
that the effects strengthen throughout childhood and peak at age
20 before weakening during adulthood [33]. We detected a similar
pattern with the obesity-risk allele score throughout childhood,
where the effect begins around four years in males and seven years
of age in females and increases in size each year. One limitation of
the current study is that the cohort currently only has data
available up to 18-years. It will be of interest to follow the cohort in
Figure 4. Associations between the risk-allele score and BMI at each follow-up in females and males. Regression coefficients (95% CI)presented on ln(BMI) scale from the Semi-Parametric Linear Mixed Model (SPLMM) longitudinal model, derived at each of the average ages of follow-up. For example, a male with 17 obesity-risk-alleles is likely to have an ln(BMI) 0.005 units higher at age 6 than a male with 16 risk-alleles and by age14 this difference will be increased to 0.010 units.doi:10.1371/journal.pone.0053897.g004
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 10 January 2013 | Volume 8 | Issue 1 | e53897
order to investigate how the combined effect of these SNPs
changes as the cohort progresses into adulthood. Further, it would
be valuable to confirm that the SPLMM method is the most
appropriate statistical method in other cohorts investigating the
genetic determinants of childhood growth and the patterns of
association across the life course.
Further studies are now required to assess the validity of these
findings and also extend them to perhaps focus on interactions
between genes and the environment. Interactions, both gene-gene
and gene-environment, are an important area of research that is
critical for understanding the mechanisms underlying obesity. We
performed a small simulation study using re-sampling techniques
based on 1,000 non-parametric bootstrap data sets with replace-
ment from the Raine data and calculating the power to detect a
gene-gene interaction. Two SNP combinations were investigated
to gather an understanding of the range of power in our study;
these included the two most commonly reported BMI associated
loci, FTO rs1121980 (MAF = 0.41) by MC4R rs17782313
(MAF = 0.23) as well as two loci with large minor allele frequency,
FTO rs1121980 by NEGR1 rs2815752 (MAF = 0.38). Based on
these simulations, our study had 58.0% power to detect an
interaction between two SNPs with larger minor allele frequencies
(FTO*NEGR1) and effect sizes (FTO 0.019 kg/m2; NEGR1
0.011 kg/m2), while assuming a multiplicative model for the
interaction. However, the power decreases rapidly with the minor
allele frequency (FTO*MC4R) and effect size (FTO 0.0044 kg/
m2; MC4R 0.0020 kg/m2) to 4.6%. We therefore believe that our
study was not appropriately designed to detect gene-gene or gene-
environment interactions but instead think that meta-analyses of
multiple cohorts might be a better way to tackle this problem.
In conclusion, we have shown that although all four statistical
methods investigated for modelling childhood growth were
appropriate to model growth curves in childhood, the SPLMM
method was the most efficient in these data in terms of predicted
values and detection of genetic effects. Further, we have shown
that there is some evidence that genetic variations in established
adult obesity-associated genes are associated with childhood
growth; however these effects differ by gender and timing of
effect. This study provides further evidence of genetic effects that
may identify individuals early in life that are more likely to rapidly
increase their BMI through childhood, which provides some
insight into the biology of childhood growth.
Supporting Information
Table S1 Details of the 17 SNPs used in geneticassociation analyses.
(XLSX)
Table S2 Results of genetic association analysis infemales for all 17 SNPs in each of the four statisticalmethods.
(XLSX)
Table S3 Results of genetic association analysis inmales for all 17 SNPs in each of the four statisticalmethods.
(XLSX)
Acknowledgments
The authors are grateful to the Raine Study participants, their families, and
to the Raine Study research staff for cohort coordination and data
collection. The authors gratefully acknowledge the assistance of the
Western Australian DNA Bank (National Health and Medical Research
Council of Australia National Enabling Facility).
Author Contributions
Conceived and designed the experiments: NMW YYW LB. Analyzed the
data: NMW. Wrote the paper: NMW YYW CEP JAM LJB LJP SJL LB.
References
1. World Health Organization (2006) Obesity and Overweight Fact Sheet.
2. Griffiths LJ, Parsons TJ, Hill AJ (2010) Self-esteem and quality of life in obesechildren and adolescents: a systematic review. Int J Pediatr Obes 5: 282–304.
3. Tsiros MD, Olds T, Buckley JD, Grimshaw P, Brennan L, et al. (2009) Health-
related quality of life in obese children and adolescents. Int J Obes (Lond) 33:
387–400.
4. Lawlor DA, Mamun AA, O’Callaghan MJ, Bor W, Williams GM, et al. (2005) Isbeing overweight associated with behavioural problems in childhood and
adolescence? Findings from the Mater-University study of pregnancy and itsoutcomes. Arch Dis Child 90: 692–697.
5. Sawyer MG, Miller-Lewis L, Guy S, Wake M, Canterford L, et al. (2006) Is
there a relationship between overweight and obesity and mental health problemsin 4- to 5-year-old Australian children? Ambul Pediatr 6: 306–311.
6. Srinivasan SR, Myers L, Berenson GS (2006) Changes in metabolic syndrome
variables since childhood in prehypertensive and hypertensive subjects: the
Bogalusa Heart Study. Hypertension 48: 33–39.
7. Bradford NF (2009) Overweight and obesity in children and adolescents. PrimCare 36: 319–339.
8. Kindblom JM, Lorentzon M, Hellqvist A, Lonn L, Brandberg J, et al. (2009)
BMI changes during childhood and adolescence as predictors of amount of adultsubcutaneous and visceral adipose tissue in men: the GOOD Study. Diabetes 58:
867–874.
9. Serdula MK, Ivery D, Coates RJ, Freedman DS, Williamson DF, et al. (1993)Do obese children become obese adults? A review of the literature. Prev Med 22:
167–177.
10. Dietz WH (1994) Critical periods in childhood for the development of obesity.
Am J Clin Nutr 59: 955–959.
11. Maes HH, Neale MC, Eaves LJ (1997) Genetic and environmental factors inrelative body weight and human adiposity. Behav Genet 27: 325–351.
12. Haworth CM, Carnell S, Meaburn EL, Davis OS, Plomin R, et al. (2008)
Increasing heritability of BMI and stronger associations with the FTO gene overchildhood. Obesity (Silver Spring) 16: 2663–2668.
13. Wardle J, Carnell S, Haworth CM, Plomin R (2008) Evidence for a strong
genetic influence on childhood adiposity despite the force of the obesogenicenvironment. Am J Clin Nutr 87: 398–404.
14. Parsons TJ, Power C, Logan S, Summerbell CD (1999) Childhood predictors ofadult obesity: a systematic review. Int J Obes Relat Metab Disord 23 Suppl 8:
S1–107.
15. Jiao H, Arner P, Hoffstedt J, Brodin D, Dubern B, et al. (2011) Genome wide
association study identifies KCNMA1 contributing to human obesity. BMCMed Genomics 4: 51.
16. Wang K, Li WD, Zhang CK, Wang Z, Glessner JT, et al. (2011) A genome-wide
association study on obesity and obesity-related traits. PLoS One 6: e18939.
17. Meyre D, Delplanque J, Chevre JC, Lecoeur C, Lobbens S, et al. (2009)
Genome-wide association study for early-onset and morbid adult obesityidentifies three new risk loci in European populations. Nat Genet 41: 157–159.
18. Paternoster L, Evans DM, Aagaard Nohr E, Holst C, Gaborieau V, et al. (2011)
Genome-Wide Population-Based Association Study of Extremely Overweight
Young Adults - The GOYA Study. PLoS One 6: e24303.
19. Cotsapas C, Speliotes EK, Hatoum IJ, Greenawalt DM, Dobrin R, et al. (2009)Common body mass index-associated variants confer risk of extreme obesity.
Hum Mol Genet 18: 3502–3507.
20. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, et al. (2010)
Association analyses of 249,796 individuals reveal 18 new loci associated withbody mass index. Nat Genet 42: 937–948.
21. Liu JZ, Medland SE, Wright MJ, Henders AK, Heath AC, et al. (2010)
Genome-wide association study of height and body mass index in Australiantwin families. Twin Res Hum Genet 13: 179–193.
22. Thorleifsson G, Walters GB, Gudbjartsson DF, Steinthorsdottir V, Sulem P, et
al. (2009) Genome-wide association yields new sequence variants at seven loci
that associate with measures of obesity. Nat Genet 41: 18–24.
23. Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, et al. (2009) Six new lociassociated with body mass index highlight a neuronal influence on body weight
regulation. Nat Genet 41: 25–34.
24. Loos RJ, Lindgren CM, Li S, Wheeler E, Zhao JH, et al. (2008) Common
variants near MC4R are associated with fat mass, weight and risk of obesity. NatGenet 40: 768–775.
25. Fox CS, Heard-Costa N, Cupples LA, Dupuis J, Vasan RS, et al. (2007)
Genome-wide association to body mass index and waist circumference: theFramingham Heart Study 100K project. BMC Med Genet 8 Suppl 1: S18.
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 11 January 2013 | Volume 8 | Issue 1 | e53897
26. Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, et al. (2007)
A common variant in the FTO gene is associated with body mass index andpredisposes to childhood and adult obesity. Science 316: 889–894.
27. Dina C, Meyre D, Gallina S, Durand E, Korner A, et al. (2007) Variation in
FTO contributes to childhood obesity and severe adult obesity. Nat Genet 39:724–726.
28. Scuteri A, Sanna S, Chen WM, Uda M, Albai G, et al. (2007) Genome-wideassociation scan shows genetic variants in the FTO gene are associated with
obesity-related traits. PLoS Genet 3: e115.
29. Chambers JC, Elliott P, Zabaneh D, Zhang W, Li Y, et al. (2008) Commongenetic variation near MC4R is associated with waist circumference and insulin
resistance. Nat Genet 40: 716–718.30. Hinney A, Hebebrand J (2009) Three at one swoop! Obes Facts 2: 3–8.
31. Zhao J, Bradfield JP, Li M, Wang K, Zhang H, et al. (2009) The role of obesity-associated loci identified in genome-wide association studies in the determination
of pediatric BMI. Obesity (Silver Spring) 17: 2254–2257.
32. den Hoed M, Ekelund U, Brage S, Grontved A, Zhao JH, et al. (2010) Geneticsusceptibility to obesity and related traits in childhood and adolescence:
influence of loci identified by genome-wide association studies. Diabetes 59:2980–2988.
33. Hardy R, Wills AK, Wong A, Elks CE, Wareham NJ, et al. (2010) Life course
variations in the associations between FTO and MC4R gene variants and bodysize. Hum Mol Genet 19: 545–552.
34. Elks CE, Loos RJ, Sharp SJ, Langenberg C, Ring SM, et al. (2010) Geneticmarkers of adult obesity risk are associated with greater early infancy weight gain
and growth. PLoS Med 7: e1000284.35. Heard-Costa NL, Zillikens MC, Monda KL, Johansson A, Harris TB, et al.
(2009) NRXN3 is a novel locus for waist circumference: a genome-wide
association study from the CHARGE Consortium. PLoS Genet 5: e1000539.36. Lindgren CM, Heid IM, Randall JC, Lamina C, Steinthorsdottir V, et al. (2009)
Genome-wide association scan meta-analysis identifies three Loci influencingadiposity and fat distribution. PLoS Genet 5: e1000508.
37. Borghi E, de Onis M, Garza C, Van den Broeck J, Frongillo EA, et al. (2006)
Construction of the World Health Organization child growth standards:selection of methods for attained growth curves. Stat Med 25: 247–265.
38. Preece MA, Baines MJ (1978) A new family of mathematical models describingthe human growth curve. Ann Hum Biol 5: 1–24.
39. Gasser T, Kohler W, Muller HG, Kneip A, Largo R, et al. (1984) Velocity andacceleration of height growth using kernel estimation. Ann Hum Biol 11: 397–
411.
40. Cole TJ, Donaldson MD, Ben-Shlomo Y (2010) SITAR–a useful instrument forgrowth curve analysis. Int J Epidemiol 39: 1558–1566.
41. Laird NM, Ware JH (1982) Random-effects models for longitudinal data.Biometrics 38: 963–974.
42. Milani S, Bossi A, Marubini E (1989) Individual growth curves and longitudinal
growth charts between 0 and 3 years. Acta Paediatr Scand Suppl 350: 95–104.
43. Goldstein H (1986) Efficient statistical modelling of longitudinal data. Ann Hum
Biol 13: 129–141.44. Rice JA, Silverman BW (1991) Estimating the Mean and Covariance Structure
Nonparametrically when the Data are Curves. Journal of the Royal Statistical
Society, Series B 53: 233–243.45. Donnelly CA, Laird NM, Ware JH (1995) Prediction and Creation of Smooth
Curves for Temporally Correlated Longitudinal Data. Journal of the AmericanStatistical Association 90: 984–989.
46. Newnham JP, Evans SF, Michael CA, Stanley FJ, Landau LL (1993) Effects of
frequent ultrasound during pregnancy: a randomised controlled trial. Lancet342: 887–891.
47. Williams LA, Evans SF, Newnham JP (1997) Prospective cohort study of factorsinfluencing the relative weights of the placenta and the newborn infant. British
Medical Journal 314: 1864–1868.48. Evans S, Newnham J, MacDonald W, Hall C (1996) Characterisation of the
possible effect on birthweight following frequent prenatal ultrasound examina-
tions. Early Human Development 45: 203–214.49. Huang RC, Burke V, Newnham JP, Stanley FJ, Kendall GE, et al. (2006)
Perinatal and childhood origins of cardiovascular disease. Int J Obes Res.50. Taal HR, St Pourcain B, Thiering E, Das S, Mook-Kanamori DO, et al. (2012)
Common variants at 12q15 and 12q24 are associated with infant head
circumference. Nat Genet 44: 532–538.51. Janssens AC, Aulchenko YS, Elefante S, Borsboom GJ, Steyerberg EW, et al.
(2006) Predictive testing for complex diseases using multiple genes: fact orfiction? Genet Med 8: 395–400.
52. Janssens AC, Moonesinghe R, Yang Q, Steyerberg EW, van Duijn CM, et al.(2007) The impact of genotype frequencies on the clinical validity of genomic
profiling for predicting common chronic diseases. Genet Med 9: 528–535.
53. Lachos VH, Ghosh P, Arellano-Valle RB (2010) Likelihood based inference forskew-normal independent linear mixed model. Statistica Sinica 20: 303–322.
54. Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skewnormal distribution. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 61: 579–602.
55. Song PXK, Zhang PQA (2007) Maximum likelihood inference in robust linearmixed-effect models using multivariate t distributions. Statistica Sinica 17: 929–
943.56. Efron B, Tibshirani RJ (1994) An Introduction to the Bootstrap: Taylor &
Francis.57. Cheng J, Edwards LJ, Maldonado-Molina MM, Komro KA, Muller KE (2010)
Real longitudinal data analysis for real people: building a good enough mixed
model. Stat Med 29: 504–520.58. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006)
Principal components analysis corrects for stratification in genome-wideassociation studies. Nat Genet 38: 904–909.
59. Ihaka R, Gentleman R (1996) R: a language for data analysis and graphics.
Journal of Computational and Graphical Statistics 5: 299–314.
Statistical Methods for BMI in Genetic Studies
PLOS ONE | www.plosone.org 12 January 2013 | Volume 8 | Issue 1 | e53897
Appendix B: Additional Details of the Linear Mixed Model in Chapter Two
Saturated Model: Females > test.f <- lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.f, method="ML", random = ~ I(age-8) + I((age-8)^2)| ID,
na.action=na.omit)
> summary(test.f)
Linear mixed-effects model fit by maximum likelihood Data: data.f AIC BIC logLik -9594.429 -9524.204 4808.214 Random effects: Formula: ~I(age - 8) + I((age - 8)^2) | ID Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 0.129605923 (Intr) I(g-8) I(age - 8) 0.012108440 0.772 I((age - 8)^2) 0.001200233 -0.690 -0.437 Residual 0.050449166 Fixed effects: log(bmi) ~ I(age - 8) + I((age - 8)^2) + I((age - 8)^3) Value Std.Error DF t-value p-value (Intercept) 2.8173907 0.004965848 3641 567.3534 0 I(age - 8) 0.0344273 0.000619246 3641 55.5956 0 I((age - 8)^2) 0.0029519 0.000060840 3641 48.5194 0 I((age - 8)^3) -0.0003112 0.000008402 3641 -37.0404 0 Correlation: (Intr) I(g-8) I((-8)^2 I(age - 8) 0.524 I((age - 8)^2) -0.610 -0.041 I((age - 8)^3) 0.052 -0.625 -0.347 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -5.488724150 -0.487264811 0.004883059 0.456330949 6.729748021 Number of Observations: 4377 Number of Groups: 733
Males > test.m <- lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.m, method="ML", random = ~ I(age-8) + I((age-8)^2)| ID,
na.action=na.omit)
> summary(test.m)
Linear mixed-effects model fit by maximum likelihood Data: data.m AIC BIC logLik -10091.98 -10021.19 5056.99 Random effects: Formula: ~I(age - 8) + I((age - 8)^2) | ID Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 0.124336118 (Intr) I(g-8) I(age - 8) 0.012422263 0.758 I((age - 8)^2) 0.001185971 -0.684 -0.367 Residual 0.050843578 Fixed effects: log(bmi) ~ I(age - 8) + I((age - 8)^2) + I((age - 8)^3) Value Std.Error DF t-value p-value (Intercept) 2.8088458 0.004665463 3833 602.0507 0 I(age - 8) 0.0290900 0.000612958 3833 47.4584 0 I((age - 8)^2) 0.0031807 0.000059063 3833 53.8518 0 I((age - 8)^3) -0.0002872 0.000008322 3833 -34.5110 0 Correlation: (Intr) I(g-8) I((-8)^2 I(age - 8) 0.516 I((age - 8)^2) -0.611 -0.011 I((age - 8)^3) 0.056 -0.618 -0.339 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -4.54398186 -0.44495072 -0.01160751 0.44978391 4.32696447 Number of Observations: 4609 Number of Groups: 773
Testing the random effects (using method=REML): > base.mod.3.f = lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.f, method="REML", random = list(ID=pdDiag(~ I(age-8) +
I((age-8)^2) )), na.action=na.omit)
> base.mod.2.f = lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.f, method="REML", random = list(ID=pdDiag(~ I(age-8)
)),
na.action=na.omit)
> base.mod.1.f = lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.f, method="REML",random = list(ID=pdDiag(~ 1 )),
na.action=na.omit)
> sapply(list(base.mod.1.f,base.mod.2.f,base.mod.3.f),AIC)
> sapply(list(base.mod.1.f,base.mod.2.f,base.mod.3.f),BIC)
> sapply(list(base.mod.1.f,base.mod.2.f,base.mod.3.f),logLik)
Below is a table providing statistics for the models that were tested (the same code as
above was used for males):
It appears that the random intercept for age^2 is necessary for both males and females.
To calculate the LRT test comparing model.3 and model.2, for example:
Model Random Effects ρ Random Effect
ρ error
-2LL BIC AIC
Female
1 Intercept Independent Null 3647.447 -7244.596 -7282.895
2 Intercept, age Independent Null 4294.200 -8529.718 -8574.400
3 Int, age, age^2 Independent Null 4392.848 -8718.630 -8769.695
Male
1 Intercept Independent Null 3812.153 -7573.697 -7612.306
2 Intercept, age Independent Null 4521.136 -8983.229 -9028.273
3 Int, age, age^2 Independent Null 4626.609 -9185.739 -9237.218
Females > anova(base.mod.2.f,base.mod.3.f)
Model df AIC BIC logLik Test L.Ratio p-
value
base.mod.3.f 1 8 -8769.695 -8718.630 4392.848
base.mod.2.f 2 7 -8574.400 -8529.718 4294.200 1 vs 2 197.2952
<.0001
Males > anova(base.mod.3.m, base.mod.2.m)
Model df AIC BIC logLik Test L.Ratio p-
value
base.mod.3.m 1 8 -9237.218 -9185.739 4626.609
base.mod.2.m 2 7 -9028.273 -8983.229 4521.136 1 vs 2 210.9452
<.0001
Testing the correlation structure of the error terms: Females
> base.mod.3.f.ar1 = update(base.mod.3.f, correlation=corAR1())
>
base.mod.3b.f.cs=update(base.mod.3b.f,correlation=corCompSymm(form=~1|ID)
)
> anova(base.mod.3.f,base.mod.3.f.ar1)
Model df AIC BIC logLik Test L.Ratio p-value
base.mod.3.f 1 8 -8769.695 -8718.630 4392.848
base.mod.3.f.ar1 2 9 -9060.715 -9003.267 4539.358 1 vs 2 293.0203 <.0001
> anova(base.mod.3b,base.mod.3b.cs)
Model df AIC BIC logLik Test L.Ratio p-
value
base.mod.3b.f 1 11 -9531.723 -9461.508 4776.862
base.mod.3b.f.cs 2 12 -9529.723 -9453.125 4776.862 1 vs 2 5.989739e-06
0.998
Males
> base.mod.3.m.ar1 = update(base.mod.3.m, correlation=corAR1())
> base.mod.3.m.cs = update(base.mod.3.m,
correlation=corCompSymm(form=~1|ID))
> anova(base.mod.3.m,base.mod.3.m.ar1)
Model df AIC BIC logLik Test L.Ratio p-value
base.mod.3.m 1 8 -9237.218 -9185.739 4626.609
base.mod.3.m.ar1 2 9 -9556.990 -9499.076 4787.495 1 vs 2 321.7723 <.0001
> anova(base.mod.3.m, base.mod.3.m.cs)
Model df AIC BIC logLik Test L.Ratio p-value
base.mod.3.m 1 8 -9237.218 -9185.739 4626.609
base.mod.3.m.cs 2 9 -9235.218 -9177.304 4626.609 1 vs 2 6.006303e-09 0.9999
It appears that the AR1() correlation structure is needed for both males and females.
Testing the correlation between random effects: Females > base.mod.3b.f = lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.f, method="REML", random = ~ I(age-8) + I((age-8)^2)|ID,
na.action=na.omit)
> anova(base.mod.3.f,base.mod.3b.f)
Model df AIC BIC logLik Test L.Ratio p-
value
base.mod.3.f 1 8 -8769.695 -8718.630 4392.848
base.mod.3b.f 2 11 -9531.723 -9461.508 4776.862 1 vs 2 768.0278
<.0001
Males > base.mod.3b.m = lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.m, method="REML", random = ~ I(age-8) + I((age-8)^2)|ID,
correlation=corAR1(), na.action=na.omit)
> anova(base.mod.3.m.ar1, base.mod.3b.m)
Model df AIC BIC logLik Test L.Ratio p-value
base.mod.3.m.ar1 1 9 -9556.99 -9499.076 4787.495
base.mod.3b.m 2 12 -10119.56 -10042.340 5071.779 1 vs 2 568.5681 <.0001
The LRT suggests that a correlation between the random intercepts and slopes is
necessary. Thus, the chosen model is base.mod.3b.f.ar1 and base.mod.3b.m
Appendix C: Publication Arising from the Research in Chapter Four
Title: 1
Robustness of the linear mixed effects model to error distribution assumptions and the 2
consequences for genome-wide association studies. 3
4
Authors: 5
Nicole M Warrington1,2*, Kate Tilling3,4, Laura D Howe3,4, Lavinia Paternoster3,4, Craig E 6
Pennell1, Yan Yan Wu5, Laurent Briollais5 7
8
Affiliations: 9 1 School of Women’s and Infants’ Health, The University of Western Australia, Perth, 10
Western Australia, Australia 11
2 University of Queensland Diamantina Institute, Translational Research Institute, Brisbane, 12
Queensland, Australia 13
3 School of Social and Community Medicine, University of Bristol, Bristol, UK 14
4 MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, UK 15
5 Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada 16
*Corresponding Author: Nicole Warrington, University of Queensland Diamantina Institute, 17
Translational Research Institute, 37 Kent St, Woolloongabba, Brisbane, Queensland, 18
Australia; Phone: +61 7 3443 7044; Fax: +61 7 344 36966; Email: [email protected] 19
20
Word count of abstract: 244 21
Word count of body (excluding tables, figures and references): 7,612 22
Number of tables: 4 23
Number of figures: 5 24
Short title: Robustness of the LMM to distribution assumptions in GWAS 25
26
1
Abstract: 27
Genome-wide association studies have been successful in uncovering novel genetic variants 28
that are associated with disease status or cross-sectional phenotypic traits. Researchers are 29
beginning to investigate how genes play a role in the development of a trait over time. 30
Linear mixed effects models (LMM) are commonly used to model longitudinal data; 31
however, it is unclear if the failure to meet the models distributional assumptions will affect 32
the conclusions when conducting a genome-wide association study. In an extensive 33
simulation study, we compare coverage probabilities, bias, type 1 error rates and statistical 34
power when the error of the LMM is either heteroscedastic or has a non-Gaussian 35
distribution. We conclude that the model is robust to misspecification if the same function 36
of age is included in the fixed and random effects. However, type 1 error of the genetic 37
effect over time is inflated, regardless of the model misspecification, if the polynomial 38
function for age in the fixed and random effects differs. In situations where the model will 39
not converge with a high order polynomial function in the random effects, a reduced 40
function can be used but a robust standard error needs to be calculated to avoid inflation of 41
the type 1 error. As an illustration, a LMM was applied to longitudinal body mass index 42
(BMI) data over childhood in the ALSPAC cohort; the results emphasised the need for the 43
robust standard error to ensure correct inference of associations of longitudinal BMI with 44
chromosome 16 single nucleotide polymorphisms. 45
46
Key words: mixed model; robustness; misspecificiation; genome-wide association; 47
longitudinal studies; ALSPAC 48
49
2
1. Introduction: 50
Over recent years, the study of population genetics has progressed from candidate gene and 51
linkage studies over relatively small regions of the genome to whole genome association 52
analyses. These genome-wide association studies (GWAS) are designed to search the entire 53
genome for single nucleotide polymorphisms (SNPs) that are associated with a disease or 54
trait of interest. If SNPs are found to be associated, they are then considered to mark a 55
region of the genome that influences the risk of disease or affects the levels of a trait. In 56
general, very small effects are detected so large sample sizes are required. This advance in 57
the scale of genetic analyses has transformed the field from hypothesis driven research to a 58
hypothesis free approach, which has required additional statistical methods to be 59
developed to ensure there is a balance between acceptable levels of power and the chance 60
of inflating the type 1 error. Given the cost of conducting these studies, in terms of both 61
monetary costs for genotyping samples and computational costs for the analysis, it is 62
important that appropriate analyses are conducted from the outset. 63
64
To date, most of the GWAS have focused on case/control studies of particular diseases or 65
cross-sectional measurements of phenotypic traits. These study designs typically use 66
relatively simple statistical techniques, such as chi-square tests or linear (or logistic) 67
regression models, to look at the association between a trait and each of the ~2.5 million 68
SNPs. There are now over 1,500 published studies focusing on 250 traits using analyses of 69
this kind (Hindorff et al. 2010). However, researchers are beginning to focus on more 70
complex analyses to uncover additional genetic loci and reduce the currently unexplained 71
heritability of these traits. One area of extension is to use longitudinal studies, with 72
repeated measures on each individual in the study, to understand how SNPs affect changes 73
over time of a particular phenotype (Kerner, North, and Fallin 2009; Sikorska et al. 2013; 74
Smith et al. 2010). There are several developed statistical methods commonly used for 75
repeated measures data to take into account the non-independence of measurements 76
within an individual. For continuous traits, the most popular statistical method is the linear 77
mixed effects model (LMM) by Laird and Ware (Laird and Ware 1982). This method can be 78
computationally intensive as the model can account for linear or non-linear trajectories for 79
the outcome of interest over time, correlation between measures at the starting point 80
3
(intercept) and change over time (slope, or non-linear trajectory) within an individual and 81
adjustment for both time-independent and time-dependent covariates. 82
83
In LMMs, the usual assumptions made about the random effects and error distributions 84
include: the random effects and error terms are normally distributed, the random effects 85
are independent of the error term and the error term has homoscedastic variance (Laird 86
and Ware 1982). In studies utilizing this method to assess the association of a SNP with the 87
trajectory, the fixed effect estimates are often of most interest; the random effects and 88
correlation structure at the individual level are necessary to provide an accurate fit of the 89
model to the data, in addition to providing appropriate test statistics, but are treated as 90
nuisance parameters and are often difficult to interpret. There have been a number of 91
studies investigating whether violations of the assumptions about the random effects and 92
error terms affect the maximum likelihood inference of the fixed effect parameters and 93
their variance estimates; several manuscripts have shown that the fixed effects estimates 94
are robust to non-Gaussian random effects distribution (Zhang and Davidian 2001; Verbeke 95
and Lesaffre 1997), non-Gaussian or heteroscedastic error distribution (Jacqmin-Gadda et al. 96
2007) and that the population fixed effects are robust to misspecified covariance structure 97
(Taylor, Cumberland, and Sy 1994), but the individual level predictions are not (Taylor and 98
Law 1998). Jacqmin-Gadda et al (Jacqmin-Gadda et al. 2007) show the fixed effects 99
estimates are not robust to error variance that is dependent on a covariate in the model 100
that interacts with time. Liang and Zeger (Liang and Zeger 1986) demonstrated that a robust 101
sandwich estimator (Royall 1986) can correct for biased variance estimates of the fixed 102
effects when the covariance structure is not correctly specified. There has not been any 103
investigation, to our knowledge, into how any of these model misspecifications affect the 104
power and type 1 error in high dimensional studies, for example when running an LMM on a 105
genome-wide scale, and what the value of the robust variance estimator is in this context. 106
107
The aim of this study is to assess by simulations whether misspecification of the error term, 108
with either non-Gaussian error distributions or non-constant error variance, in a complex 109
longitudinal model with non-linear trajectories will affect: 1) the coverage probabilities of 110
the 95% confidence interval of the fixed effects parameter estimates; 2) the bias of the fixed 111
effects parameter estimates; 3) the type 1 error of SNP detection in a GWAS; or 4) the 112
4
statistical power to detect association. We also examined whether our conclusions differ 113
according to minor allele frequency (MAF) for the SNPs or sample size of the investigated 114
cohort. 115
116
2. Motivating example: 117
The World Health Organization defines obesity as “abnormal or excessive fat accumulation 118
that presents a risk to health” (World Health Organization 2012). Obesity is a medical 119
condition which increases an individual’s risk to health problems such as cardiovascular 120
disease, type 2 diabetes and some cancers and therefore reduces life expectancy (Haslam 121
and James 2005). The prevalence of obesity has been increasing in recent decades in 122
developed countries, particularly in children. Body mass index (BMI; calculated as weight 123
(kg)/height2 (m)) is commonly used to define overweight and obesity, with appropriate cut-124
off’s defined for both children (Cole et al. 2000) and adults (WHO 2000). Childhood obesity 125
is one of the strongest predictors of adult obesity (Kindblom et al. 2009; Serdula et al. 1993). 126
Although the growing prevalence of obesity is most likely to be due to the increasing energy 127
intake and decreasing energy expenditure, twin and adoption studies have provided 128
evidence that BMI is heritable (Maes, Neale, and Eaves 1997; Haworth et al. 2008; Wardle et 129
al. 2008; Parsons et al. 1999). Recent GWAS have begun to uncover some plausible genetic 130
loci contributing to higher BMI (Speliotes et al. 2010; Liu et al. 2010; Thorleifsson et al. 2009; 131
Willer et al. 2009; Loos et al. 2008; Fox et al. 2007; Frayling et al. 2007) and obesity in 132
children (Bradfield et al. 2012), with 34 new loci identified. However, none of the studies to 133
date provide information regarding the genetic determinants of the rate of BMI growth over 134
childhood, which leads to obesity. 135
136
The Avon Longitudinal Study of Parents and Children (ALSPAC) (Boyd et al. 2013; Fraser et 137
al. 2013) is a birth cohort study; 14,541 pregnant women in the former county of Avon, UK, 138
were recruited into the study if they had an expected delivery date between 1st April 1991 139
and 31st December 1992. From birth to five years, length and weight measurements were 140
extracted from health visitor records, with up to four measurements taken on average at six 141
weeks, 10, 21, and 48 months of age. For a random 10% of the cohort, length and height 142
measurements were taken in eight research clinic visits held between the ages of four 143
months and five years of age. From age seven years upwards, all children were invited to 144
5
annual research clinics from ages 7 to 11 and biannual research clinics thereafter. Details of 145
measuring equipment used in the clinics is described elsewhere (Howe et al. 2010). In 146
addition, parent-reported child height and weight were also available from questionnaires 147
(27% of measurements). Whilst the measurements from routine health care have previously 148
been shown to be accurate in this cohort (Howe, Tilling, and Lawlor 2009), parental report 149
of children’s height tends to be overestimated while weight tends to be under estimated 150
(Dubois and Girad 2007). Ethical approval for the study was obtained from the ALSPAC 151
Ethics and Law Committee and the Local Research Ethics Committees. Please note that the 152
study website contains details of all the data that is available through a fully searchable data 153
dictionary (http://www.bris.ac.uk/alspac/researchers/data-access/data-dictionary/). 154
155
A subset of 7,916 participants were used for analysis based on the following inclusion 156
criteria: at least one parent of European descent, singleton birth, unrelated to anyone in the 157
sample, genome-wide genotype data, and at least one measure of BMI throughout 158
childhood. Participants have a median of 9 BMI measurements between 1 and 15 years of 159
age (interquartile range 5-12, range 1-29 measurements). Children tend to have rapidly 160
increasing BMI from birth to approximately 9 months of age where they reach their 161
adiposity peak; BMI then decreases until around the age of 5-7 years at adiposity rebound 162
and then steadily increases again until after puberty where it tends to plateau through 163
adulthood. There is a large amount of variability between individuals for both intercept and 164
slope. 165
166
The primary research question is to identify SNPs that are associated with average BMI and 167
change in BMI over childhood and adolescence in the ALSPAC data. A LMM was used to 168
appropriately model the longitudinal trajectory over childhood, to account for the large 169
correlation between each of the random effects parameters, to adjust for additional 170
covariates such as the source of the height/weight measurements (clinic or questionnaire) 171
and to allow data to be missing at random across childhood. The general form of the model 172
is as follows: 173
174
Yi = Xiβ + Z ibi + ε i (1)
6
where Yi is the response vector for the ith individual, β is the vector of fixed effects and 175
bi ~ N(0, Σ) is the vector of subject specific random effects, Xi and Zi are the fixed effect and 176
random effect regressor matrices respectively and ε i ~ N(0, σ2) is the within subject error 177
vector. When applying this model to the ALSPAC data, the best model fit included a cubic 178
polynomial of mean centred age (centred at age 8 years) in the fixed effects, a quadratic 179
polynomial of mean centred age in the random effects and a continuous autoregressive 180
correlation structure of order one for the covariance of the within-subject errors. Hence, the 181
final model for both females and males was: 182
183
BMIij = β0 + β1t ij + β2t ij2 + β3t ij
3 + β4MSij + β5SNPi + β6t ijSNPi +
β7t ij2SNPi + β8t ij
3SNPi + bi0 + bi1t ij + bi2t ij2 + ε ij
(2)
184
where MS is the measurement source (i.e. clinical visit or questionnaire) of individual i at 185
time j and tij is the age (centred at 8). Therefore β0 is the population intercept (i.e. mean 186
BMI at age 8), β1-β3 are the fixed effects for the cubic function of age, β4 is the 187
measurement source, β5 is the change in the mean BMI at 8 years of age for each additional 188
copy of the minor allele, β6 is the SNP by linear age effect, β7 and β8 are the SNP by 189
quadratic and cubic effects respectively. 190
191
Due to the nature of the data collection, which is often complex in large cohort based 192
studies, we found that the model assumptions were not met due to the following: 193
1. The questionnaire measures have previously been shown to have greater variability 194
than clinic measured height and weight (Dubois and Girad 2007); therefore we had 195
variability that was dependent on a covariate in the model. 196
2. There were only questionnaire measures available around the nadir of the trajectory 197
(also known as the adiposity rebound), which meant we had greater variability 198
around the rebound. 199
3. The variability within individuals changes over time; particularly with increased 200
variability around puberty and into adolescence. 201
4. BMI also has a non-Gaussian error distribution. This is in part due to the increasing 202
variability between individuals over time, with some individuals having rapidly 203
increasing BMI while others remain relatively consistent. 204
7
In the following, we investigate the robustness of the maximum likelihood inference for the 205
fixed effects, the type 1 error and the power for detecting an association with the SNP when 206
the error distribution is misspecified due to the above intricacies of the data. 207
208
3. Simulation study: 209
We carried out extensive simulations to investigate the effects on the LMM when the error 210
term (also called the level-1 residual, or the occasion-level residual) in the model was non-211
Gaussian or had a non-constant variance. In each of the simulation scenarios, we set the 212
non-genetic fixed effects parameters (β0-β4 from model (2)) and the variance-covariance 213
matrix similar to those coming from the fitted model for BMI adjusting for the FTO 214
rs1121980 SNP in the ALSPAC study; these can be found Table 1. The measurement source, 215
which is a fixed effect in the LMM and used in the heteroscedastic error simulations, was a 216
randomly generated binary variable for each individual at each time point with distribution 217
throughout the ages similar to the distribution in the ALSPAC cohort (percent questionnaire 218
measurements per follow-up year: year 1 = 40%, year 2 = 20%, year 3 = 40%, year 4 = 10%, 219
year 5= 60%, year 6 = 99%, year 7 = 10%, year 8 = 0%, year 9 = 0%, year 10 = 10%, year 11 = 220
0%, year 12 = 0%, year 13 = 30%, year 14 = 0%, year 15 = 0%). 221
222
We also investigated the fixed effect estimation for various sample sizes, minor allele 223
frequencies of the SNP and the SNP effect sizes: 224
1. Sample size: two levels; N=1,000 and N=3,000 225
2. Minor allele frequency: four levels; 0.1, 0.2, 0.3 and 0.4 226
3. Effect sizes: two combinations; β5 = 0.6, β6 = 0.15, β7 = -0.000752 and β8 = -227
0.000380 (alternative hypothesis) or β5 = β6 = β7 = β8 = 0 (null hypothesis). The 228
alternative hypothesis effect sizes for β5 and β6 were chosen to have 80% power to 229
detect with the larger sample size; the effect sizes for β7 and β8 were similar to 230
those coming from the fitted model for BMI adjusting for the FTO rs1121980 SNP in 231
the ALSPAC study. 232
233
234
8
3.1 Sampling Designs: 235
As many longitudinal cohorts have different sampling designs, some with variable amounts 236
of missing time points and missing observations at each time point, we investigated five 237
different sampling designs: 238
1. Sparse complete: ni = 8 measures per person with few measures around the 239
adiposity rebound; times of measures are 1, 2, 3, 5, 8, 10, 13, 15 240
2. Intense complete: ni = 14 measures per person with multiple measures around the 241
adiposity rebound; times of measures are 1, 2, 3, 3.5, 4, 4.5 ,5 ,5.5 ,6 ,7, 9, 11, 13, 15 242
3. Equal unbalanced: ni = 1 to 15 measures per person between 1 and 15 years with a 243
mean of 9 measures (proportion of missingness = 0.4 across whole age range) 244
4. Unbalanced with more samples around the adiposity rebound: ni = 1 to 15 measures 245
per person between 1 and 15 years with a mean of 9 measures; proportion of 246
missingness around adiposity rebound of 0.2 and 0.45 outside the 5 to 7 year age 247
range (average proportion of missingness over whole age range is 0.4) 248
5. Unbalanced with fewer samples around the adiposity rebound: ni = 1 to 15 measures 249
per person between 1 and 15 years with a mean of 9 measures; proportion of 250
missingness around adiposity rebound of 0.6 and 0.35 outside the 5 to 7 year age 251
range (average proportion of missingness over whole age range is 0.4) 252
The first two designs with complete data at each follow-up assume that every individual had 253
the exact same age at follow-up (i.e. came into clinic on their birthday), whereas the other 254
three designs are more representative of longitudinal studies where the actual age of 255
measurement varies between individuals by up to a year (i.e. came into clinic either 6 256
months before or after a birthday). We assume data is missing completely at random, that is 257
that the probability that an observation is missing for a given individual is independent of all 258
other observed data. The proportion of missingness simulated across the whole range (i.e. 259
0.4) was equivalent to the amount of missing data observed in the ALSPAC cohort under the 260
assumption that all individuals could have been measured yearly. We used a fully factorial 261
design for the simulations with the 3 data characteristics and the 5 sampling designs. 262
263
264
9
3.2 Models for data generation: 265
Standard linear mixed model: 266
Data were generated with Gaussian random effects and error distribution to validate the 267
estimation method. 268
Non Gaussian error: 269
Three error structures were investigated: 270
1. t-distribution: t with 5 degrees of freedom 271
2. skew-normal distribution: SN(1.0632, 40) 272
3. Asymmetric mixture of two Gaussian distributions: 0.3N(-0.67, 12) + 0.7N(0.5, 0.32) 273
Heteroscedastic error: 274
Three cases were studied: 275
1. Variance dependent on a covariate: Var(eij) = σ2e aXij
276
where σ2e = 1.131, a = 1.500 and Xij = 1 if measure was from questionnaire and 0 if 277
measure was from a follow-up clinic 278
2. Variance greater at the adiposity rebound: Var(eij) = σ2e aXij 279
where σ2e
= 1.131, a = 1.500 and Xij = 1 if measure was between 5 and 7 years and 0 280
if not 281
3. Variance increasing over time: Var(eij) = σ2e atij 282
where σ2e
= 1.131 and a = 1.150 283
284
3.3 Data generation: 285
We simulated 1,000 datasets under the alternative hypothesis (β5 = 0.6 and β6 = 0.15) to 286
look at coverage probabilities, bias and power and 5,000 datasets under the null hypothesis 287
(β5 = 0 and β6 = 0) to look at type 1 error at α=0.05. Each SNP (coded as 0, 1, 2) was 288
incorporated into the model assuming an additive genetic model, whereby each additional 289
minor allele increases BMI by an equal amount. We were primarily interested in estimating 290
the SNP main effect, β5, which represents the increase on the mean BMI at 8 years of age 291
for each additional copy of the minor allele and the SNP by age effect, β6, which represents 292
the effect on the mean linear increase of BMI (slope) for each additional minor allele. We 293
calculated a robust standard error for each fixed effect parameter and corresponding p-294
value; the following formula was used: 295
10
'
1
( ) ( )S
i i i ii ii
− −
=
∑
-1 -1 -1 -1' ' 'X V X X V ε ε V X X V X
Where: 296
X is the fixed effect regressor matrix from equation (1) 297
V is the variance of Y from equation (1) 298
i i iyε β= −X 299
S is the number of subjects and i is the ith subject 300
In addition to the fixed effects parameters, we conducted a Wald test to assess whether the 301
overall SNP effect was affected by the misspecification. The Wald test was estimated using 302
the General Linear Hypothesis approach (McDonald 1975). This approach is based on the 303
normal approximation for maximum likelihood estimators using the estimated variance-304
covariance matrix. The hypothesis can be specified through a constant matrix L to be 305
matched with the fixed effects of the model such that H0: Lβ = m where the m are the 306
hypothesized values. The estimates of the fixed effects, β, asymptotically follow a 307
multivariate normal distribution ( ,cov( ))Nβ β β by the Central Limit Theorem such that 308
the linear form also asymptotically follows a multivariate normal distribution: 309
310
'~ ( , cov( ) )L N L L Lβ β β (3)
Thus the 95% confidence interval and corresponding P-value for the hypothesized value can 311
be obtained accordingly. We tested whether the parameters for the SNP were 312
simultaneously equal to zero. It is computationally intensive to calculate a robust estimate 313
for the Wald test; for example, the robust standard error for the fixed effects takes 314
approximately 7 minutes for the rs1121980 FTO SNP in the ALSPAC data whereas the robust 315
standard error for the global Wald test takes approximately an additional 3 minutes. These 316
computational times decrease exponentially as sample size and the number of repeated 317
measures per individual decreases; however, they may not be scalable to a GWAS study. To 318
investigate whether a robust standard error would be beneficial for the global Wald test, we 319
selected the scenario where the inflation was greatest and calculated the robust estimates 320
for all the simulations in this scenario. All analyses were conducted in R version 2.12.1 (Ihaka 321
1996) using the nlme package. 322
11
As it is important to report the uncertainty in any estimates from simulation based studies 323
(Koehler, Brown, and Haneuse 2009), Monte Carlo error (MCE) was calculated using the 324
joint performance method of β and si outlined in White (White 2010). A confidence interval 325
for coverage probabilities, bias, type 1 error and power was calculated using the following: 326
327
P(1-P)P 1.96S
±
(4)
328
Where P is the α-level, for example P for coverage estimates is 0.95 and P for type 1 error is 329
0.05, and S is the number of simulations, for example either 1,000 or 5,000. The output from 330
the simulations was then assessed as to whether they fell within this confidence interval. 331
332
3.4 Results: 333
3.4.1 Coverage probabilities: 334
Coverage probability can indicate whether the confidence interval of the parameter(s) of 335
interest is conservative (i.e. the coverage probability is larger than the nominal confidence 336
interval) or liberal (i.e. the coverage probability is narrower than the nominal confidence 337
interval). 338
339
Coverage probabilities for the 95% confidence interval of the fixed effects parameter 340
estimates from each of the simulations are presented in Table 2. No consistent differences 341
were seen across the range of minor allele frequencies, so the results from each of the 342
simulated datasets were combined for ease of presentation; however the coverage 343
probabilities for each of the minor allele frequencies are presented in Supplementary Table 344
1. 345
346
The coverage probabilities of the SNP main effects parameter for all simulations appear to 347
be unaffected by the error misspecifications; only nine of 70 coverage probabilities were 348
significantly different from 95%, that is less than 94.32% or greater than 95.68%, five of 349
which were from the simulations where the error variance increases over time. 350
351
12
Thirty-one of the 70 coverage probabilities (44%) for the SNP*age interaction parameter 352
were significantly different from 95%, with both the non-Gaussian and heteroscedastic error 353
distributions being affected. When the error variance followed a t distribution, the coverage 354
probabilities for the confidence interval of the SNP*age interaction parameter are less than 355
95% in all designs except the sparse complete scenario. Similarly, the SNP*age interaction 356
parameter had coverage probabilities less than 95% when the error variance followed a 357
skew-normal distribution, however only in the unbalanced designs where there is missing 358
data. The coverage probabilities were less than 95% when the error variance was both 359
dependent on a covariate and increased over time, in both the complete and unbalanced 360
designs. All the coverage probabilities that significantly differ from 95% for the SNP*age 361
interaction parameter have underestimated variance estimates and thus confidence 362
intervals that were too narrow, which could lead to test statistics that are too liberal. 363
364
3.4.2 Bias: 365
The SNP main effect and the SNP*age interaction parameters are unbiased in the majority 366
of the simulations, indicating that the misspecifications in the error distribution do not 367
affect the estimates of the β’s (Supplementary Tables 2 and 3). Only nine of 140 95% 368
confidence intervals did not cover zero; these nine confidence intervals were across the 369
range of error distributions and designs, showing that no one scenario was particularly 370
biased. 371
372
No consistent differences were seen in the bias estimates across the range of minor allele 373
frequencies; however, the 95% confidence intervals for the difference between the 374
simulated parameter and the true parameter were tighter as the sample size and minor 375
allele frequency increased (Supplementary Table 4). 376
377
3.4.3 Type 1 error: 378
As seen with the coverage probabilities, no consistent differences in type 1 error were 379
evident across the minor allele frequency range, so the results from each of the simulated 380
datasets were combined for ease of presentation (Table 3 and Table 4); however the type 1 381
error for each of the minor allele frequencies tested are given in Supplementary Table 5. 382
383
13
As seen in Table 3, the type 1 error for the complete designs remained within acceptable 384
limits of the nominal alpha level. We observed inflation for the SNP by age interaction 385
parameter in several cases, but this inflation was reduced to nominal levels by using a 386
robust standard error. 387
388
Table 4 shows that the type 1 error for the SNP by age interaction was often inflated under 389
the unbalanced designs. However, by using a robust standard error, the inflation seen can 390
be reduced to nominal levels in the majority of cases; approximately 75% of the inflated 391
effects were reduced. The design where the robust standard error didn’t seem to have an 392
effect was when the error variance increased over time; only 20% of the estimates were 393
reduced to nominal levels under this design. Interestingly, the robust standard error did not 394
appear to affect the type 1 error for the scenarios that were not originally inflated. 395
396
To declare significance in a GWAS, several thresholds are commonly used: suggestive 397
association, significant association and highly significant association. Duggal et al define 398
suggestive associations as SNPs that reach a P-value threshold under the assumption that 399
one false positive association is expected per GWAS (Duggal et al. 2008); SNPs reaching this 400
threshold are taken forward for replication. In the context of our simulation study, this 401
definition would equate to a P-value of 0.00005 (1/20,000; where 20,000 is the number of 402
simulations per design and error assumption). The scenario with the highest type 1 error 403
inflation using the classical standard error was for the SNP*age interaction under the 404
intense design where the error variance increased over time (0.0746 for both N=1,000 and 405
3,000). In this scenario, 6 SNPs would falsely reach the definition of ‘suggestive association’ 406
for the SNP*age interaction parameter when using the classical standard error with a 407
sample size of 1,000 individuals. In contrast, when the model assumptions are met, that is 408
when the error distribution follows a Gaussian distribution with constant variance, only 2 409
SNPs met the ‘suggestive association’ threshold, indicating an inflation in the type 1 error 410
for the simulations where the variance increased over time due to the misspecification of 411
the error term. When using the robust standard error under the increasing variance over 412
time design, 1 SNP would meet the criteria, showing not only a reduction in the type 1 error 413
from the 7 SNPs seen with the classical standard error, but also a reduction in power in 414
comparison to the model where the assumptions were met. 415
14
These results show that there is greater inflation in the type 1 error for the SNP*age 416
interaction in the unbalanced designs than the complete designs. As outlined in 417
Supplementary Figure 1, we simulated additional data to investigate what aspects of the 418
unbalanced design contributed to the inflation. Briefly, the results from these additional 419
simulations are as follows: 420
1. The unbalanced designs differed from the complete designs by including missing 421
data and altering the measurement times so they fell within a range around the 422
scheduled times, both of which are inherent in cohort studies. The additional 423
simulations showed the inflation was greater in the presence of missing data rather 424
than because of the different measurement times between individuals 425
(Supplementary Figure 2). 426
2. Since the LME is known to be robust to missing data under the missing at random 427
and missing completely at random assumptions, we simulated additional data 428
varying the polynomial function of age in the fixed and random effects. These 429
simulations showed the type 1 error was reduced to nominal levels when the fixed 430
and random effects had the same function of age, i.e. cubic function in both the 431
fixed and random effects (Supplementary Table 6). 432
3. To determine whether there is remaining inflation in the type 1 error after modelling 433
the same function of age in the fixed and random effects when the error distribution 434
is misspecified we simulated additional data using the equal unbalanced sampling 435
design. These simulations showed that the type 1 error was again reduced to 436
nominal levels when the fixed and random effects had the same function of age 437
regardless of the misspecification in the error distribution (Supplementary Table 7). 438
4. It is often difficult to estimate higher order terms in the random effects when using 439
real data due to computational and convergence issues. In this case, it is often only 440
possible to fit a lower-order polynomial functions in the random effects than the 441
fixed effects. We simulated additional data where the fixed and random effects 442
included a quadratic function for age but we analyzed the data with a quadratic 443
function in the fixed effects and a linear function in the random effects. In addition, 444
we also simulated data where the fixed effects included a quadratic function for age 445
and the random effects included only a linear function but analyzed the data with a 446
quadratic function in both the fixed and random effects. These simulations showed 447
15
that the type 1 error was inflated when the analysis model had lower order terms of 448
polynomial function in the random effects compared to the fixed effects terms 449
(Supplementary Table 8). 450
In summary, it is recommended that one includes the same polynomial function for age in 451
the fixed and random effects to avoid inflation in the type 1 error; however, if this is not 452
possible due to non-convergence of the model then a robust standard error is required to 453
reach nominal levels of type 1 error. 454
455
456
The global Wald test, which is assessing whether there is any genetic effect on the whole 457
BMI growth trajectory, was inflated above the acceptable limits under all error variance 458
misspecifications and even under the Gaussian/constant variance assumption, except under 459
the sparse complete design. The scenario where the error variance increased over time 460
showed the largest inflation; however, using the robust estimates for the Wald test under 461
this scenario were also reduced to nominal levels in most designs; if it wasn’t reduced to 462
nominal levels it was dramatically lower than using the classical test (Table 4). Again, having 463
the same structure of fixed and random terms for the age polynomial function would yield 464
nominal type 1 errors. 465
466
Given that many researchers investigating GWAS of longitudinal traits are interested in only 467
the SNP main effect and not the SNP*age interaction (Furlotte, Eskin, and Eyheramendy 468
2012), we conducted some additional simulations without the SNP*age interaction. Once 469
again, we used the scenario where the error variance increased over time and where there 470
was equal unbalance in the data structure. We found that the type 1 error was within the 471
nominal range for the SNP main effect for both sample sizes (N=1,000: 0.0506; N=3,000: 472
0.0515), where previously we saw inflation for the sample size of 1,000 (0.0533 from Table 473
4). We have no reason to believe that any of the other scenarios would be affected by the 474
misspecifications when the SNP*age interactions are not modelled. 475
476
477
16
3.4.4 Power: 478
Effect sizes for the alternative hypothesis (β5 = 0.6 and β6 = 0.15) were chosen to have 80% 479
power with a MAF of 0.4 and sample size of 1,000 when the error from the fitted LMM 480
follows a Gaussian distribution with constant variance. Therefore, the power for all error 481
distributions and MAFs in the simulations with sample size of 3,000 was greater than 80%; 482
so this section will only discuss power for the simulations with a sample size of 1,000. Power 483
for the SNP main effect and SNP*age interaction parameters are displayed in Figures 1 484
(complete designs) and 2 (unbalanced designs). 485
486
As expected, the power increases with the MAF. Interestingly, assuming the error 487
distribution has a t-distribution led to lower power for both the SNP main effect and the 488
SNP*age interaction parameters than assuming a Gaussian error distribution. This pattern 489
was consistent across all of the sampling designs; however it appears that the power is 490
slightly closer to that of the error with the Gaussian distribution when there is more data 491
around the adiposity rebound (i.e. the intense complete and unbalanced with more samples 492
around the adiposity rebound). In addition, the simulations where the error distribution 493
follows a skew-normal distribution led to slightly higher power for both the SNP and 494
SNP*age interaction parameters than with the Gaussian error. 495
496
When investigating the different error variance structures, the power for the SNP main 497
effect parameter across all MAFs was slightly lower than the power when the constant 498
variance assumption was met. Likewise, for the SNP*age interaction parameter, all of the 499
error variance structures led to lower power than when the constant variance assumption 500
was met. However, simulations under the unbalanced designs where the variance increased 501
over time suffered the most and had notably reduced power until a MAF of approximately 502
0.3. 503
504
3.4.5 Power under the robust standard error: 505
We have shown that using the robust standard error doesn’t affect those situations where 506
the type 1 error wasn’t initially inflated, however before adopting the robust standard error 507
for a GWAS analysis we also wanted to determine whether using the robust standard error 508
would decrease our power to detect a statistically significant association. 509
17
The power for the SNP main effect parameter remains almost unchanged when using the 510
robust standard error rather than the normal standard error in all scenarios and under all 511
model misspecifications (Figures 3 and 4). The only scenario where the power decreased for 512
the SNP main effect parameter by using the robust standard error was where there was 513
increasing variance over time under the intense complete scenario. Given that the type 1 514
error was not inflated using either standard error estimate, there appears to be no harm in 515
using a robust standard error for estimation even when not required. 516
517
The power for the SNP*age interaction parameter, particularly for low MAF, is much more 518
variable. Under the sparse complete design, where there was no inflation in the type 1 519
error, the power remains about the same using either the classical or robust standard error. 520
For the other designs, the power for the SNP*age interaction parameter decreases using the 521
robust standard error, but only by 5% or less for most error misspecifications, when the 522
MAF was 0.2 or greater. Assuming a t-distribution for the error led to a decrease of about 5-523
10% power using the robust standard error when the MAF 0.1 or 0.2; this might be due to 524
the substantial reduction in type 1 error. The power also decreases by greater than 5% 525
when the variance is greater at the adiposity rebound and the variance is dependent on a 526
covariate, for values of MAF around 0.1 in our scenarios. 527
528
4. Analysis of chromosome-wide body-mass-index data: 529
Given our simulation results, in particular the need for a robust standard error to ensure 530
accurate inference for the SNP*age interaction where the type 1 error is inflated, we 531
wanted to investigate the impact of the distribution assumption problems in a real data 532
application. GWAS analysis of multiple cohorts would be ideal to observe the effect of the 533
different error term misspecifications; however this would require a large amount of 534
computing time and was thus determined to be prohibitive. Instead, we chose to conduct 535
analysis using chromosome 16 in the ALSPAC data as the most replicated gene for BMI to 536
date, the fat mass and obesity gene (FTO), is located on this chromosome and we therefore 537
hypothesised that we would detect some significant loci on this chromosome as well as 538
many non-associated SNPs. We used the same LMM model as in equation 2, with the 539
inclusion of an age*sex interaction in the fixed effects for all the age components (i.e. β9sexi 540
+ β10tijsexi + β11tij2sexi + β10tij
3sexi) to account for the differences in growth between males 541
18
and females (Warrington et al. 2013). There were 14,875 SNPs genotyped on chromosome 542
16, all of which had a MAF greater than 1%; GWAS are designed to look at common SNPs, so 543
it is a common strategy to exclude SNPs with MAF less than 1%. Each SNP was incorporated 544
into the model assuming an additive genetic model. 545
546
As expected, SNPs in the FTO gene were highly significant for the global tests as well as the 547
SNP main effect and SNP*age interactions. It is common to display GWAS analysis as a QQ 548
plot of the observed –log10(P) with the expected –log10(P) under the null distribution. Figure 549
5 displays a QQ plot from the chromosome 16 analysis in ALSPAC for each of the parameters 550
which displayed inflated levels of type 1 error in the simulation study. As we believe SNPs 551
within the FTO region to be true positives, we also display the QQ plots excluding SNPs from 552
this region (Figure 5, C and D). Lambda (λ) values are also commonly calculated for GWAS 553
analyses, which is the ratio of the median of the empirically observed distribution of the test 554
statistic to the expected median. The λ quantifies the extent of the excess false positive 555
rate, with values close to 1 indicating no inflation and values deviating from 1 indicating 556
increasing levels of false positives. The lambda values corresponding to each QQ plot were 557
calculated using the estlambda option in the GenABEL software (Aulchenko et al. 2007). 558
These QQ plots and lambda statistics clearly show that where the parameters have lambda 559
values greater than one using the classical test, the robust test reduces this to nominal 560
levels. When using the robust tests, we were still able to detect an association with SNPs in 561
the FTO gene. 562
563
In the chromosome wide analysis, the P-value to declare ‘suggestive significance’ would be 564
0.000067 (1/14,875). Using this threshold, 57 SNPs would reach suggestive significance for 565
the SNP by age interaction using the classical standard error in comparison to only 16 SNPs 566
using the robust standard error. Six of these 16 SNPs were in the FTO gene, four of which 567
would reach the significant threshold. 568
569
570
19
5. Discussion: 571
In this article, we simulated longitudinal data that mimicked childhood BMI to explore the 572
coverage probability, bias, type 1 error and power for association with a SNP when the 573
linear mixed effects model is misspecified with either a non-Gaussian error distribution or 574
heteroscedastic error. We have shown that the type 1 error for the SNP*age interaction 575
terms in a genetic association study has no inflation if the same function of age is included 576
in both the fixed and random effects. However, type 1 error is inflated, regardless of the 577
model misspecification, if the age function in the fixed and random effects differs. In 578
situations where the model is too complex and will not converge with a high order 579
polynomial function in the random effects, an appropriate way to deflate the type 1 error to 580
nominal levels is to use a robust standard error for the fixed effects parameters. Although 581
robust standard errors have been previously used in a wide range of statistical applications, 582
LMM’s are only just beginning to be utilized in GWAS and therefore guidance on their 583
application was warranted. Given that QQ plots in GWAS are an important diagnostic to rule 584
out the possibility of population stratification, it is essential to generate standard errors that 585
perform well under the null hypothesis so that any remaining inflation is not due to the 586
model fitting. Similar to the conclusions by Gurka et al (Gurka, Edwards, and Muller 2011) 587
and Verbeke and Molenberhgs (Verbeke and Molenberghs 2000) using other applications, 588
the sandwich estimator is a valid alternative in GWAS when the model assumptions are 589
misspecified, however it is less efficient than using the correct covariance model. 590
591
Similar to Jacqmin-Gadda et al (Jacqmin-Gadda et al. 2007), we have shown that estimates 592
of differences in slope by the number of copies of minor allele are sensitive to 593
heterogeneous error variance particularly when the error variance depends on a covariate 594
or increases over time. The variance of the estimates is underestimated and therefore the 595
confidence interval is too narrow; this is consistent with the inflated type 1 error under 596
these misspecified model assumptions. 597
598
Of all the misspecifications investigated, the situation where the error variance increases 599
over time and is not accounted for in the modelling has poor parameter estimates, low 600
power and the most inflation of the type 1 error, particularly for the SNP*age interaction 601
terms. It also appears that by using the robust standard error, the inflation in the type 1 602
20
error is reduced to the nominal level in only some of the scenarios. It is therefore imperative 603
that some adjustment is made in the modelling to account for this increasing variance over 604
time. In the ALSPAC BMI data, the variance stays relatively constant until around the age of 605
four years where it rapidly increases until around 11 years of age where it plateaus again. 606
This is due to the different growth rates between individuals through the adiposity rebound 607
and puberty. Increasing variability over time can be seen with many other phenotypes both 608
in childhood and adulthood; for example lung function in an elderly population can decrease 609
due to the rate at which individuals are diagnosed with diseases such chronic obstructive 610
pulmonary disease, while other individuals remain healthy. Variance functions for modelling 611
heteroscedasticity in mixed effects models have been studied in detail by Davidian and 612
Giltinan (Davidian and Giltinan 1995) and can be implemented using the varFunc classes in 613
the nlme package in R (Pinheiro and Bates 2000). There are also equivalent functions in 614
alternative statistical packages such as MLwiN (Rasbash et al. 2012). The use of these 615
variance functions could be recommended in the context of GWAS, if there is remaining 616
heteroscedasticity in the residuals after appropriately modelling the fixed and random 617
effects; however further studies are needed to assess their properties in this context. 618
619
When looking at SNPs with low minor allele frequencies, we have seen that by using the 620
robust standard error we reduce our power by approximately 5%. To counteract this 621
reduction, we can increase the sample size though the use of meta-analysis of multiple 622
cohort studies as is commonly done in GWAS analyses. However, several manuscripts have 623
previously discussed the extended computational time for longitudinal GWAS in comparison 624
to GWAS of cross-sectional phenotypes, so it is recommended that large computing clusters 625
are available to those cohort studies conducting analyses. The longitudinal GWAS of 626
cardiovascular risk factors presented in Smith et al (Smith et al. 2010) took approximately 3 627
hours on 64 processors of a compute cluster for 600,000 tests in 525 individuals. Sikorska et 628
al (Sikorska et al. 2013) illustrated that the analysis of 2.5 million SNPs using the LME 629
function in the nlme package of R would take 3,500 hours for a sample size of 3,000 630
individuals on a desktop computer (Intel(R) Core(TM) 2 Duo CPU, 3.00 GHz). These times are 631
consistent with those in this study; the chromosome 16 analysis of 14,875 SNPs in the 7,916 632
ALSPAC individuals took approximately 125 hours on 32 processors of a compute cluster 633
21
(BlueCrystal Phase 2 cluster with each node having four 2x2.8 GHz core processors and 8 GB 634
of RAM). 635
636
It has been suggested that the genome-wide significance threshold be set at 5 x 10-8 637
(Dudbridge and Gusnanto 2008; Risch and Merikangas 1996). In addition, Duggal et al 638
(Duggal et al. 2008) established an appropriate p-value threshold based on the number of 639
independent SNP tests in a GWAS. If study data is imputed against the HapMap CEU 640
population, they suggest a threshold of p<6.09 x 10-6 be used to select SNPs with suggestive 641
evidence for follow-up. Many cross-sectional GWAS studies use thresholds around this, 642
generally ranging from p<5 x 10-6 (Speliotes et al. 2010) to p<10-5 (Thorleifsson et al. 2009), 643
to select SNPs for replication. In longitudinal genetic association studies, particularly those 644
with complex, non-linear trajectories, controlling the type 1 error of the many parameters 645
involving SNP effects, can be quite challenging. This would be the case when using for 646
example smoothed splines functions and those functions could interact with the SNP 647
effects. Providing robust standard errors in this context can be difficult. As an alternative, it 648
may be plausible to use genomic control procedures to reduce a possible inflation in the 649
type 1 error for the parameters involving the SNP effects (Devlin and Roeder 1999; Dadd, 650
Weale, and Lewis 2009). Genomic control is typically used in genetic association studies to 651
account for the potential confounding due to cryptic relatedness, and makes the 652
assumption that the inflation in type 1 error is constant across all marker in the genome; 653
this is plausible in the context of cryptic relatedness as the inflation is due to the kinship 654
coefficients which are unrelated to the individual loci, however in the context of LMM’s one 655
would need to show that the inflation was uniform across the genome or genetic region of 656
interest. Benke et al (Benke et al. 2013) suggested using a joint test of all SNP effects, similar 657
to the global Wald test used in the current study, as an optimal way to control the type 1 658
error and increase power. However, caution needs to be applied when utilizing this method 659
for complex traits, such as BMI trajectories over childhood, and a genome-wide significance 660
threshold should only be used if there is no inflation detected in the type 1 error. Benke et 661
al (Benke et al. 2013) used a trait with a linear decrease over time and low correlation 662
between the intercept and slope parameters; in contrast, in this study we have a complex 663
trajectory over time with high correlation between the intercept and slope parameters, 664
22
which indicated that the joint test has inflated type 1 error and can only be reduced using a 665
robust estimate in some scenarios. 666
667
In summary, based on our simulation results, we strongly suggest fitting the same function 668
of age in the fixed and random effect to avoid inflation of the type 1 error of the SNP*age 669
interaction terms. If this is not possible due to convergence issues, then we suggest using a 670
robust standard error for the SNP by age interaction terms to reduce the type 1 error 671
inflation in GWAS, regardless of whether the error term of the model correctly follows the 672
model assumptions or not. If no inflation in the type 1 error is detected for a particular 673
parameter of interest, then the classical standard error should be used; for example, for the 674
SNP main effect parameter in this study. 675
676
Acknowledgements: 677
We are extremely grateful to all the families who took part in the ALSPAC study, the 678
midwives for their help in recruiting them, and the whole ALSPAC team, which includes 679
interviewers, computer and laboratory technicians, clerical workers, research scientists, 680
volunteers, managers, receptionists and nurses. The UK Medical Research Council and the 681
Wellcome Trust (Grant ref: 092731) and the University of Bristol provide core support for 682
ALSPAC. NM Warrington is funded by an Australian Postgraduate Award from the Australian 683
Government of Innovation, Industry, Science and Research and a Raine Study PhD Top-Up 684
Scholarship. LD Howe is funded by a UK Medical Research Council Population Health 685
Scientist fellowship (G1002375). L Paternoster is funded by a UK Medical Research Council 686
Population Health Scientist fellowship (MR/J012165/1). K Tilling, LD Howe and L Paternoster 687
work in a Unit that receives core funding from the University of Bristol and the UK Medical 688
Research Council (Grant ref: MC_UU_12013/9). The UK Medical Research Council also 689
supports K Tilling's research (G1000726/1). 690
691
Figure Legends: 692
Figure 1: Simulated power of the SNP main effect and SNP*age interaction terms for 693
complete designs. The two plots on the left are for the sparse complete design, while the 694
two plots on the right are from the intense complete design. The solid black line for the 695
Gaussian Distribution is the situation where the model is correctly specified. 696
23
697 Figure 2: Simulated power of the SNP main effect and SNP*age interaction terms for 698
unbalanced designs. “Equal” is the simulations from the equal unbalanced design, “Over” 699
are the simulations from the unbalanced design with less samples around the adiposity 700
rebound and “Under” are the simulations from the unbalanced design with more samples 701
around the adiposity rebound. The solid black line for the Gaussian Distribution is the 702
situation where the model is correctly specified. 703
704
Figure 3: Difference in power based on a normal standard error vs. a robust standard error 705
for the complete designs. A positive value indicates the power using the normal standard 706
error is greater than the power using the robust standard error. The two plots on the left 707
are for the sparse complete design, while the two plots on the right are from the intense 708
complete design. The solid black line for the Gaussian Distribution is the situation where the 709
model is correctly specified. 710
711
Figure 4: Difference in power based on a normal standard error vs. a robust standard error 712
for the unbalanced designs. A positive value indicates the power using the normal standard 713
error is greater than the power using the robust standard error. Here, “Equal” is the 714
simulations from the equal unbalanced design, “Over” are the simulations from the 715
unbalanced design with less samples around the adiposity rebound and “Under” are the 716
simulations from the unbalanced design with more samples around the adiposity rebound. 717
The solid black line for the Gaussian Distribution is the situation where the model is 718
correctly specified. 719
720
Figure 5: QQ plots of the chromosome 16 analysis in the ALSPAC cohort. These plots are 721
the observed –log10(P) against the expected –log10(P) under the null hypothesis for each 722
SNP on chromosome 16. P-Values deviating from the red x=y line indicate significant 723
findings, whether they be false (i.e. inflation in type 1 error) or true. 724
725
24
References: 726
Aulchenko, Y. S., S. Ripke, A. Isaacs, and C. M. van Duijn. 2007. GenABEL: an R library for genome-727 wide association analysis. Bioinformatics 23 (10):1294-6. 728
Benke, K. S., Y. Wu, D. M. Fallin, B. Maher, and L. J. Palmer. 2013. Strategy to control type I error 729 increases power to identify genetic variation using the full biological trajectory. Genet 730 Epidemiol 37 (5):419-30. 731
Boyd, A., J. Golding, J. Macleod, D. A. Lawlor, A. Fraser, J. Henderson, L. Molloy, A. Ness, S. Ring, and 732 G. Davey Smith. 2013. Cohort Profile: the 'children of the 90s'--the index offspring of the 733 Avon Longitudinal Study of Parents and Children. Int J Epidemiol 42 (1):111-27. 734
Bradfield, J. P., H. R. Taal, N. J. Timpson, A. Scherag, C. Lecoeur, N. M. Warrington, E. Hypponen, C. 735 Holst, B. Valcarcel, E. Thiering, R. M. Salem, F. R. Schumacher, D. L. Cousminer, P. M. 736 Sleiman, J. Zhao, R. I. Berkowitz, K. S. Vimaleswaran, I. Jarick, C. E. Pennell, D. M. Evans, B. St 737 Pourcain, D. J. Berry, D. O. Mook-Kanamori, A. Hofman, F. Rivadeneira, A. G. Uitterlinden, C. 738 M. van Duijn, R. J. van der Valk, J. C. de Jongste, D. S. Postma, D. I. Boomsma, W. J. 739 Gauderman, M. T. Hassanein, C. M. Lindgren, R. Magi, C. A. Boreham, C. E. Neville, L. A. 740 Moreno, P. Elliott, A. Pouta, A. L. Hartikainen, M. Li, O. Raitakari, T. Lehtimaki, J. G. Eriksson, 741 A. Palotie, J. Dallongeville, S. Das, P. Deloukas, G. McMahon, S. M. Ring, J. P. Kemp, J. L. 742 Buxton, A. I. Blakemore, M. Bustamante, M. Guxens, J. N. Hirschhorn, M. W. Gillman, E. 743 Kreiner-Moller, H. Bisgaard, F. D. Gilliland, J. Heinrich, E. Wheeler, I. Barroso, S. O'Rahilly, A. 744 Meirhaeghe, T. I. Sorensen, C. Power, L. J. Palmer, A. Hinney, E. Widen, I. S. Farooqi, M. I. 745 McCarthy, P. Froguel, D. Meyre, J. Hebebrand, M. R. Jarvelin, V. W. Jaddoe, G. D. Smith, H. 746 Hakonarson, and S. F. Grant. 2012. A genome-wide association meta-analysis identifies new 747 childhood obesity loci. Nat Genet 44 (5):526-31. 748
Cole, T. J., M. C. Bellizzi, K. M. Flegal, and W. H. Dietz. 2000. Establishing a standard definition for 749 child overweight and obesity worldwide: international survey. BMJ 320 (7244):1240-3. 750
Dadd, T., M. E. Weale, and C. M. Lewis. 2009. A critical evaluation of genomic control methods for 751 genetic association studies. Genet Epidemiol 33 (4):290-8. 752
Davidian, M., and D. M. Giltinan. 1995. Nonlinear models for repeated measurement data, 753 Monographs on statistics and applied probability;62. London: Chapman & Hall. 754
Devlin, B., and K. Roeder. 1999. Genomic control for association studies. Biometrics 55 (4):997-1004. 755 Dubois, L., and M. Girad. 2007. Accuracy of maternal reports of pre-schoolers' weights and heights 756
as estimates of BMI values. Int J Epidemiol 36 (1):132-8. 757 Dudbridge, F., and A. Gusnanto. 2008. Estimation of significance thresholds for genomewide 758
association scans. Genet Epidemiol 32 (3):227-34. 759 Duggal, P., E. M. Gillanders, T. N. Holmes, and J. E. Bailey-Wilson. 2008. Establishing an adjusted p-760
value threshold to control the family-wide type 1 error in genome wide association studies. 761 BMC Genomics 9:516. 762
Fox, C. S., N. Heard-Costa, L. A. Cupples, J. Dupuis, R. S. Vasan, and L. D. Atwood. 2007. Genome-763 wide association to body mass index and waist circumference: the Framingham Heart Study 764 100K project. BMC Med Genet 8 Suppl 1:S18. 765
Fraser, A., C. Macdonald-Wallis, K. Tilling, A. Boyd, J. Golding, G. Davey Smith, J. Henderson, J. 766 Macleod, L. Molloy, A. Ness, S. Ring, S. M. Nelson, and D. A. Lawlor. 2013. Cohort Profile: the 767 Avon Longitudinal Study of Parents and Children: ALSPAC mothers cohort. Int J Epidemiol 42 768 (1):97-110. 769
Frayling, T. M., N. J. Timpson, M. N. Weedon, E. Zeggini, R. M. Freathy, C. M. Lindgren, J. R. Perry, K. 770 S. Elliott, H. Lango, N. W. Rayner, B. Shields, L. W. Harries, J. C. Barrett, S. Ellard, C. J. Groves, 771 B. Knight, A. M. Patch, A. R. Ness, S. Ebrahim, D. A. Lawlor, S. M. Ring, Y. Ben-Shlomo, M. R. 772 Jarvelin, U. Sovio, A. J. Bennett, D. Melzer, L. Ferrucci, R. J. Loos, I. Barroso, N. J. Wareham, F. 773 Karpe, K. R. Owen, L. R. Cardon, M. Walker, G. A. Hitman, C. N. Palmer, A. S. Doney, A. D. 774 Morris, G. D. Smith, A. T. Hattersley, and M. I. McCarthy. 2007. A common variant in the FTO 775
25
gene is associated with body mass index and predisposes to childhood and adult obesity. 776 Science 316 (5826):889-94. 777
Furlotte, N. A., E. Eskin, and S. Eyheramendy. 2012. Genome-wide association mapping with 778 longitudinal data. Genet Epidemiol 36 (5):463-71. 779
Gurka, M. J., L. J. Edwards, and K. E. Muller. 2011. Avoiding bias in mixed model inference for fixed 780 effects. Stat Med 30 (22):2696-707. 781
Haslam, D. W., and W. P. James. 2005. Obesity. Lancet 366 (9492):1197-209. 782 Haworth, C. M., S. Carnell, E. L. Meaburn, O. S. Davis, R. Plomin, and J. Wardle. 2008. Increasing 783
heritability of BMI and stronger associations with the FTO gene over childhood. Obesity 784 (Silver Spring) 16 (12):2663-8. 785
Hindorff, L.A., J. MacArthur, A. Wise, H.A. Junkins, P.N. Hall, A.K. Klemm, and T.A. Manolio. 2010. A 786 Catalog of Published Genome-Wide Association Studies. National Human Genome Research 787 Institute. 788
Howe, L. D., K. Tilling, L. Benfield, J. Logue, N. Sattar, A. R. Ness, G. D. Smith, and D. A. Lawlor. 2010. 789 Changes in ponderal index and body mass index across childhood and their associations with 790 fat mass and cardiovascular risk factors at age 15. PLoS One 5 (12):e15186. 791
Howe, L. D., K. Tilling, and D. A. Lawlor. 2009. Accuracy of height and weight data from child health 792 records. Arch Dis Child 94 (12):950-4. 793
Ihaka, R., Gentleman R. 1996. R: a language for data analysis and graphics. Journal of Computational 794 and Graphical Statistics 5 (3):299-314. 795
Jacqmin-Gadda, H., S. Sibillot, C. Proust, J.M Molina, and R. Thiébaut. 2007. Robustness of the linear 796 mixed model to misspecified error distribution. Computational Statistics & Data 797 Analysis 51 (10):5142-5154. 798
Kerner, B., K. E. North, and M. D. Fallin. 2009. Use of longitudinal data in genetic studies in the 799 genome-wide association studies era: summary of Group 14. Genet Epidemiol 33 Suppl 800 1:S93-8. 801
Kindblom, J. M., M. Lorentzon, A. Hellqvist, L. Lonn, J. Brandberg, S. Nilsson, E. Norjavaara, and C. 802 Ohlsson. 2009. BMI changes during childhood and adolescence as predictors of amount of 803 adult subcutaneous and visceral adipose tissue in men: the GOOD Study. Diabetes 58 804 (4):867-74. 805
Koehler, E., E. Brown, and S. J. Haneuse. 2009. On the Assessment of Monte Carlo Error in 806 Simulation-Based Statistical Analyses. Am Stat 63 (2):155-162. 807
Laird, N. M., and J. H. Ware. 1982. Random-effects models for longitudinal data. Biometrics 38 808 (4):963-74. 809
Liang, K.Y., and S.L. Zeger. 1986. Longitudinal data analysis using generalized linear models. 810 Biometrika 73 (1):13-22. 811
Liu, J. Z., S. E. Medland, M. J. Wright, A. K. Henders, A. C. Heath, P. A. Madden, A. Duncan, G. W. 812 Montgomery, N. G. Martin, and A. F. McRae. 2010. Genome-wide association study of height 813 and body mass index in Australian twin families. Twin Res Hum Genet 13 (2):179-93. 814
Loos, R. J., C. M. Lindgren, S. Li, E. Wheeler, J. H. Zhao, I. Prokopenko, M. Inouye, R. M. Freathy, A. P. 815 Attwood, J. S. Beckmann, S. I. Berndt, K. B. Jacobs, S. J. Chanock, R. B. Hayes, S. Bergmann, A. 816 J. Bennett, S. A. Bingham, M. Bochud, M. Brown, S. Cauchi, J. M. Connell, C. Cooper, G. D. 817 Smith, I. Day, C. Dina, S. De, E. T. Dermitzakis, A. S. Doney, K. S. Elliott, P. Elliott, D. M. Evans, 818 I. Sadaf Farooqi, P. Froguel, J. Ghori, C. J. Groves, R. Gwilliam, D. Hadley, A. S. Hall, A. T. 819 Hattersley, J. Hebebrand, I. M. Heid, C. Lamina, C. Gieger, T. Illig, T. Meitinger, H. E. 820 Wichmann, B. Herrera, A. Hinney, S. E. Hunt, M. R. Jarvelin, T. Johnson, J. D. Jolley, F. Karpe, 821 A. Keniry, K. T. Khaw, R. N. Luben, M. Mangino, J. Marchini, W. L. McArdle, R. McGinnis, D. 822 Meyre, P. B. Munroe, A. D. Morris, A. R. Ness, M. J. Neville, A. C. Nica, K. K. Ong, S. O'Rahilly, 823 K. R. Owen, C. N. Palmer, K. Papadakis, S. Potter, A. Pouta, L. Qi, J. C. Randall, N. W. Rayner, 824 S. M. Ring, M. S. Sandhu, A. Scherag, M. A. Sims, K. Song, N. Soranzo, E. K. Speliotes, H. E. 825 Syddall, S. A. Teichmann, N. J. Timpson, J. H. Tobias, M. Uda, C. I. Vogel, C. Wallace, D. M. 826
26
Waterworth, M. N. Weedon, C. J. Willer, Wraight, X. Yuan, E. Zeggini, J. N. Hirschhorn, D. P. 827 Strachan, W. H. Ouwehand, M. J. Caulfield, N. J. Samani, T. M. Frayling, P. Vollenweider, G. 828 Waeber, V. Mooser, P. Deloukas, M. I. McCarthy, N. J. Wareham, I. Barroso, P. Kraft, S. E. 829 Hankinson, D. J. Hunter, F. B. Hu, H. N. Lyon, B. F. Voight, M. Ridderstrale, L. Groop, P. 830 Scheet, S. Sanna, G. R. Abecasis, G. Albai, R. Nagaraja, D. Schlessinger, A. U. Jackson, J. 831 Tuomilehto, F. S. Collins, M. Boehnke, and K. L. Mohlke. 2008. Common variants near MC4R 832 are associated with fat mass, weight and risk of obesity. Nat Genet 40 (6):768-75. 833
Maes, H. H., M. C. Neale, and L. J. Eaves. 1997. Genetic and environmental factors in relative body 834 weight and human adiposity. Behav Genet 27 (4):325-51. 835
McDonald, L. 1975. Tests for the General Linear Hypothesis Under the Multiple Design Multivariate 836 Linear Model. The Annals of Statistics 3 (2):461-466. 837
Parsons, T. J., C. Power, S. Logan, and C. D. Summerbell. 1999. Childhood predictors of adult obesity: 838 a systematic review. Int J Obes Relat Metab Disord 23 Suppl 8:S1-107. 839
Pinheiro, J., and D. Bates. 2000. Mixed Effects Models in S and S-Plus: Springer. 840 Rasbash, J., F. Steele, W.J. Browne, and H. Goldstein. 2012. A User’s Guide to MLwiN, v2.26. Centre 841
for Multilevel Modelling, University of Bristol. 842 Risch, N., and K. Merikangas. 1996. The future of genetic studies of complex human diseases. Science 843
273 (5281):1516-7. 844 Royall, R.M. 1986. Model Robust Confidence Intervals Using Maximum Likelihood Estimators. 845
International Statistical Review / Revue Internationale de Statistique 54 (2):221-226. 846 Serdula, M. K., D. Ivery, R. J. Coates, D. S. Freedman, D. F. Williamson, and T. Byers. 1993. Do obese 847
children become obese adults? A review of the literature. Prev Med 22 (2):167-77. 848 Sikorska, K., F. Rivadeneira, P. J. Groenen, A. Hofman, A. G. Uitterlinden, P. H. Eilers, and E. Lesaffre. 849
2013. Fast linear mixed model computations for genome-wide association studies with 850 longitudinal data. Stat Med 32 (1):165-80. 851
Smith, E. N., W. Chen, M. Kahonen, J. Kettunen, T. Lehtimaki, L. Peltonen, O. T. Raitakari, R. M. 852 Salem, N. J. Schork, M. Shaw, S. R. Srinivasan, E. J. Topol, J. S. Viikari, G. S. Berenson, and S. S. 853 Murray. 2010. Longitudinal genome-wide association of cardiovascular disease risk factors in 854 the Bogalusa heart study. PLoS Genet 6 (9). 855
Speliotes, E. K., C. J. Willer, S. I. Berndt, K. L. Monda, G. Thorleifsson, A. U. Jackson, H. L. Allen, C. M. 856 Lindgren, J. Luan, R. Magi, J. C. Randall, S. Vedantam, T. W. Winkler, L. Qi, T. Workalemahu, I. 857 M. Heid, V. Steinthorsdottir, H. M. Stringham, M. N. Weedon, E. Wheeler, A. R. Wood, T. 858 Ferreira, R. J. Weyant, A. V. Segre, K. Estrada, L. Liang, J. Nemesh, J. H. Park, S. Gustafsson, T. 859 O. Kilpelainen, J. Yang, N. Bouatia-Naji, T. Esko, M. F. Feitosa, Z. Kutalik, M. Mangino, S. 860 Raychaudhuri, A. Scherag, A. V. Smith, R. Welch, J. H. Zhao, K. K. Aben, D. M. Absher, N. 861 Amin, A. L. Dixon, E. Fisher, N. L. Glazer, M. E. Goddard, N. L. Heard-Costa, V. Hoesel, J. J. 862 Hottenga, A. Johansson, T. Johnson, S. Ketkar, C. Lamina, S. Li, M. F. Moffatt, R. H. Myers, N. 863 Narisu, J. R. Perry, M. J. Peters, M. Preuss, S. Ripatti, F. Rivadeneira, C. Sandholt, L. J. Scott, 864 N. J. Timpson, J. P. Tyrer, S. van Wingerden, R. M. Watanabe, C. C. White, F. Wiklund, C. 865 Barlassina, D. I. Chasman, M. N. Cooper, J. O. Jansson, R. W. Lawrence, N. Pellikka, I. 866 Prokopenko, J. Shi, E. Thiering, H. Alavere, M. T. Alibrandi, P. Almgren, A. M. Arnold, T. 867 Aspelund, L. D. Atwood, B. Balkau, A. J. Balmforth, A. J. Bennett, Y. Ben-Shlomo, R. N. 868 Bergman, S. Bergmann, H. Biebermann, A. I. Blakemore, T. Boes, L. L. Bonnycastle, S. R. 869 Bornstein, M. J. Brown, T. A. Buchanan, F. Busonero, H. Campbell, F. P. Cappuccio, C. 870 Cavalcanti-Proenca, Y. D. Chen, C. M. Chen, P. S. Chines, R. Clarke, L. Coin, J. Connell, I. N. 871 Day, M. Heijer, J. Duan, S. Ebrahim, P. Elliott, R. Elosua, G. Eiriksdottir, M. R. Erdos, J. G. 872 Eriksson, M. F. Facheris, S. B. Felix, P. Fischer-Posovszky, A. R. Folsom, N. Friedrich, N. B. 873 Freimer, M. Fu, S. Gaget, P. V. Gejman, E. J. Geus, C. Gieger, A. P. Gjesing, A. Goel, P. 874 Goyette, H. Grallert, J. Grassler, D. M. Greenawalt, C. J. Groves, V. Gudnason, C. Guiducci, A. 875 L. Hartikainen, N. Hassanali, A. S. Hall, A. S. Havulinna, C. Hayward, A. C. Heath, C. 876 Hengstenberg, A. A. Hicks, A. Hinney, A. Hofman, G. Homuth, J. Hui, W. Igl, C. Iribarren, B. 877
27
Isomaa, K. B. Jacobs, I. Jarick, E. Jewell, U. John, T. Jorgensen, P. Jousilahti, A. Jula, M. 878 Kaakinen, E. Kajantie, L. M. Kaplan, S. Kathiresan, J. Kettunen, L. Kinnunen, J. W. Knowles, I. 879 Kolcic, I. R. Konig, S. Koskinen, P. Kovacs, J. Kuusisto, P. Kraft, K. Kvaloy, J. Laitinen, O. 880 Lantieri, C. Lanzani, L. J. Launer, C. Lecoeur, T. Lehtimaki, G. Lettre, J. Liu, M. L. Lokki, M. 881 Lorentzon, R. N. Luben, B. Ludwig, P. Manunta, D. Marek, M. Marre, N. G. Martin, W. L. 882 McArdle, A. McCarthy, B. McKnight, T. Meitinger, O. Melander, D. Meyre, K. Midthjell, G. W. 883 Montgomery, M. A. Morken, A. P. Morris, R. Mulic, J. S. Ngwa, M. Nelis, M. J. Neville, D. R. 884 Nyholt, C. J. O'Donnell, S. O'Rahilly, K. K. Ong, B. Oostra, G. Pare, A. N. Parker, M. Perola, I. 885 Pichler, K. H. Pietilainen, C. G. Platou, O. Polasek, A. Pouta, S. Rafelt, O. Raitakari, N. W. 886 Rayner, M. Ridderstrale, W. Rief, A. Ruokonen, N. R. Robertson, P. Rzehak, V. Salomaa, A. R. 887 Sanders, M. S. Sandhu, S. Sanna, J. Saramies, M. J. Savolainen, S. Scherag, S. Schipf, S. 888 Schreiber, H. Schunkert, K. Silander, J. Sinisalo, D. S. Siscovick, J. H. Smit, N. Soranzo, U. 889 Sovio, J. Stephens, I. Surakka, A. J. Swift, M. L. Tammesoo, J. C. Tardif, M. Teder-Laving, T. M. 890 Teslovich, J. R. Thompson, B. Thomson, A. Tonjes, T. Tuomi, J. B. van Meurs, G. J. van 891 Ommen, V. Vatin, J. Viikari, S. Visvikis-Siest, V. Vitart, C. I. Vogel, B. F. Voight, L. L. Waite, H. 892 Wallaschofski, G. B. Walters, E. Widen, S. Wiegand, S. H. Wild, G. Willemsen, D. R. Witte, J. C. 893 Witteman, J. Xu, Q. Zhang, L. Zgaga, A. Ziegler, P. Zitting, J. P. Beilby, I. S. Farooqi, J. 894 Hebebrand, H. V. Huikuri, A. L. James, M. Kahonen, D. F. Levinson, F. Macciardi, M. S. 895 Nieminen, C. Ohlsson, L. J. Palmer, P. M. Ridker, M. Stumvoll, J. S. Beckmann, H. Boeing, E. 896 Boerwinkle, D. I. Boomsma, M. J. Caulfield, S. J. Chanock, F. S. Collins, L. A. Cupples, G. D. 897 Smith, J. Erdmann, P. Froguel, H. Gronberg, U. Gyllensten, P. Hall, T. Hansen, T. B. Harris, A. 898 T. Hattersley, R. B. Hayes, J. Heinrich, F. B. Hu, K. Hveem, T. Illig, M. R. Jarvelin, J. Kaprio, F. 899 Karpe, K. T. Khaw, L. A. Kiemeney, H. Krude, M. Laakso, D. A. Lawlor, A. Metspalu, P. B. 900 Munroe, W. H. Ouwehand, O. Pedersen, B. W. Penninx, A. Peters, P. P. Pramstaller, T. 901 Quertermous, T. Reinehr, A. Rissanen, I. Rudan, N. J. Samani, P. E. Schwarz, A. R. Shuldiner, 902 T. D. Spector, J. Tuomilehto, M. Uda, A. Uitterlinden, T. T. Valle, M. Wabitsch, G. Waeber, N. 903 J. Wareham, H. Watkins, J. F. Wilson, A. F. Wright, M. C. Zillikens, N. Chatterjee, S. A. 904 McCarroll, S. Purcell, E. E. Schadt, P. M. Visscher, T. L. Assimes, I. B. Borecki, P. Deloukas, C. S. 905 Fox, L. C. Groop, T. Haritunians, D. J. Hunter, R. C. Kaplan, K. L. Mohlke, J. R. O'Connell, L. 906 Peltonen, D. Schlessinger, D. P. Strachan, C. M. van Duijn, H. E. Wichmann, T. M. Frayling, U. 907 Thorsteinsdottir, G. R. Abecasis, I. Barroso, M. Boehnke, K. Stefansson, K. E. North, M. I. 908 McCarthy, J. N. Hirschhorn, E. Ingelsson, and R. J. Loos. 2010. Association analyses of 909 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42 910 (11):937-48. 911
Taylor, J. M. G., W. G. Cumberland, and J. P. Sy. 1994. A Stochastic Model for Analysis of Longitudinal 912 AIDS Data. Journal of the American Statistical Association 89 (427):727-736. 913
Taylor, J. M., and N. Law. 1998. Does the covariance structure matter in longitudinal modelling for 914 the prediction of future CD4 counts? Stat Med 17 (20):2381-94. 915
Thorleifsson, G., G. B. Walters, D. F. Gudbjartsson, V. Steinthorsdottir, P. Sulem, A. Helgadottir, U. 916 Styrkarsdottir, S. Gretarsdottir, S. Thorlacius, I. Jonsdottir, T. Jonsdottir, E. J. Olafsdottir, G. H. 917 Olafsdottir, T. Jonsson, F. Jonsson, K. Borch-Johnsen, T. Hansen, G. Andersen, T. Jorgensen, 918 T. Lauritzen, K. K. Aben, A. L. Verbeek, N. Roeleveld, E. Kampman, L. R. Yanek, L. C. Becker, L. 919 Tryggvadottir, T. Rafnar, D. M. Becker, J. Gulcher, L. A. Kiemeney, O. Pedersen, A. Kong, U. 920 Thorsteinsdottir, and K. Stefansson. 2009. Genome-wide association yields new sequence 921 variants at seven loci that associate with measures of obesity. Nat Genet 41 (1):18-24. 922
Verbeke, G, and G Molenberghs. 2000. Linear mixed models for longitudinal data: Springer Series in 923 Statistics, Springer-Verlag, New York. 924
Verbeke, G., and E. Lesaffre. 1997. The effect of misspecifying the random-effects distribution in 925 linear mixed models for longitudinal data. Computational Statistics & Data Analysis 23 926 (4):541-556. 927
28
Wardle, J., S. Carnell, C. M. Haworth, and R. Plomin. 2008. Evidence for a strong genetic influence on 928 childhood adiposity despite the force of the obesogenic environment. Am J Clin Nutr 87 929 (2):398-404. 930
Warrington, N. M., Y. Y. Wu, C. E. Pennell, J. A. Marsh, L. J. Beilin, L. J. Palmer, S. J. Lye, and L. 931 Briollais. 2013. Modelling BMI Trajectories in Children for Genetic Association Studies. PLoS 932 One 8 (1):e53897. 933
White, I. 2010. simsum: Analysis of simulation studies including Monte Carlo error. The Stata Journal 934 10 (3):369-385. 935
WHO. 2000. Obesity: preventing and managing the golbal epidemic. Report of a WHO Consultation. 936 WHO Technical Report Series 894. Geneva: World Health Organization, 2000. 937
Willer, C. J., E. K. Speliotes, R. J. Loos, S. Li, C. M. Lindgren, I. M. Heid, S. I. Berndt, A. L. Elliott, A. U. 938 Jackson, C. Lamina, G. Lettre, N. Lim, H. N. Lyon, S. A. McCarroll, K. Papadakis, L. Qi, J. C. 939 Randall, R. M. Roccasecca, S. Sanna, P. Scheet, M. N. Weedon, E. Wheeler, J. H. Zhao, L. C. 940 Jacobs, I. Prokopenko, N. Soranzo, T. Tanaka, N. J. Timpson, P. Almgren, A. Bennett, R. N. 941 Bergman, S. A. Bingham, L. L. Bonnycastle, M. Brown, N. P. Burtt, P. Chines, L. Coin, F. S. 942 Collins, J. M. Connell, C. Cooper, G. D. Smith, E. M. Dennison, P. Deodhar, P. Elliott, M. R. 943 Erdos, K. Estrada, D. M. Evans, L. Gianniny, C. Gieger, C. J. Gillson, C. Guiducci, R. Hackett, D. 944 Hadley, A. S. Hall, A. S. Havulinna, J. Hebebrand, A. Hofman, B. Isomaa, K. B. Jacobs, T. 945 Johnson, P. Jousilahti, Z. Jovanovic, K. T. Khaw, P. Kraft, M. Kuokkanen, J. Kuusisto, J. 946 Laitinen, E. G. Lakatta, J. Luan, R. N. Luben, M. Mangino, W. L. McArdle, T. Meitinger, A. 947 Mulas, P. B. Munroe, N. Narisu, A. R. Ness, K. Northstone, S. O'Rahilly, C. Purmann, M. G. 948 Rees, M. Ridderstrale, S. M. Ring, F. Rivadeneira, A. Ruokonen, M. S. Sandhu, J. Saramies, L. 949 J. Scott, A. Scuteri, K. Silander, M. A. Sims, K. Song, J. Stephens, S. Stevens, H. M. Stringham, 950 Y. C. Tung, T. T. Valle, C. M. Van Duijn, K. S. Vimaleswaran, P. Vollenweider, G. Waeber, C. 951 Wallace, R. M. Watanabe, D. M. Waterworth, N. Watkins, J. C. Witteman, E. Zeggini, G. Zhai, 952 M. C. Zillikens, D. Altshuler, M. J. Caulfield, S. J. Chanock, I. S. Farooqi, L. Ferrucci, J. M. 953 Guralnik, A. T. Hattersley, F. B. Hu, M. R. Jarvelin, M. Laakso, V. Mooser, K. K. Ong, W. H. 954 Ouwehand, V. Salomaa, N. J. Samani, T. D. Spector, T. Tuomi, J. Tuomilehto, M. Uda, A. G. 955 Uitterlinden, N. J. Wareham, P. Deloukas, T. M. Frayling, L. C. Groop, R. B. Hayes, D. J. 956 Hunter, K. L. Mohlke, L. Peltonen, D. Schlessinger, D. P. Strachan, H. E. Wichmann, M. I. 957 McCarthy, M. Boehnke, I. Barroso, G. R. Abecasis, and J. N. Hirschhorn. 2009. Six new loci 958 associated with body mass index highlight a neuronal influence on body weight regulation. 959 Nat Genet 41 (1):25-34. 960
World Health Organization. Obesity and Overweight Fact Sheet (No 311), May 2012 2012 [cited 4 961 September 2012. Available from 962 http://www.who.int/mediacentre/factsheets/fs311/en/index.html. 963
Zhang, D., and M. Davidian. 2001. Linear mixed models with flexible distributions of random effects 964 for longitudinal data. Biometrics 57 (3):795-802. 965
966
967
968
29
Table 1: Parameter estimates from the ALSPAC non-genetic model used to generate the 969
data in the simulation study: 970
Effect Parameter Value
Intercept β0 16.534
Age β1 0.400
Age2 β2 0.056
Age3 β3 -0.003
Source β4 -0.153
SD(b0) σ0 2.092
SD(b1) σ1 0.269
SD(b2) σ2 0.0235
Cor(b0, b1) ρ0 0.820
Cor(b0, b2) ρ1 -0.389
Cor(b1, b2) ρ2 -0.092
SD(ε) σ 1.063
Correlation
structure
ρ 0.394
971
972
30
Table 2: Coverage rates of the 95% confidence intervals of the fixed effects; bold and 973
underlined cells are those that are significantly different from the nominal 95% based on 974
4,000 simulations under each design (1,000 simulations for each MAF combined into one 975
summary statistic). 976
Sampling Design Sparse Complete Intense Complete Equal Unbalanced
Unbalanced with more samples
around the adiposity rebound
Unbalanced with less samples around the
adiposity rebound
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution
SNP 95.43 95.03 95.08 95.45 94.83 95.23 95.08 94.70 95.40 94.73
SNP*age 95.00 95.23 94.58 95.13 94.35 94.63 94.30 93.90 94.53 94.35
t-distribution
SNP 95.45 95.35 95.90 94.55 95.13 94.85 94.65 94.48 95.48 94.95
SNP*age 95.30 94.80 94.05 94.13 94.45 94.10 93.70 94.00 93.33 94.03
Skew-normal Distribution
SNP 94.90 95.03 95.18 95.10 95.05 94.25 95.43 94.95 94.85 94.75
SNP*age 95.68 95.18 94.63 94.65 93.88 93.73 94.73 94.13 93.90 93.55
Mixture of 2 Gaussian Distributions
SNP 94.85 94.98 94.83 95.65 95.00 95.08 94.53 95.40 94.33 94.48
SNP*age 95.05 94.78 95.03 94.60 95.20 94.08 94.58 95.20 94.80 94.10
Variance dependent on a covariate
SNP 94.93 95.05 95.83 95.35 94.43 94.75 94.70 94.93 94.98 94.93
SNP*age 94.93 95.03 94.95 94.15 94.03 94.10 93.75 94.53 93.95 93.93
Variance greater at adiposity rebound
SNP 94.75 95.23 95.15 95.08 94.25 95.20 95.48 95.35 95.23 94.43
SNP*age 94.05 94.45 95.38 95.38 94.13 94.43 94.00 94.75 94.73 94.60
Variance increasing over time
SNP 94.20 95.00 94.08 94.38 94.80 94.33 94.30 94.03 95.98 95.48
SNP*age 94.10 94.88 91.78 92.38 94.70 94.23 93.28 93.48 95.65 95.25
977 978
31
Table 3: Type 1 error for the complete designs; bold and underlined cells are those that are 979
significantly different from the nominal α=0.05 based on 20,000 simulations under each 980
design (5,000 simulations for each MAF combined into one summary statistic). 981
Sampling Design Sparse Complete Intense Complete
Sample Size N=1,000 N=3,000 N=1,000 N=3,000
Standard Robust Standard Robust Standard Robust Standard Robust
Gaussian Distribution
SNP 0.0514 0.0528 0.0509 0.0513 0.0502 0.0521 0.0500 0.0510
SNP*age 0.0483 0.0504 0.0483 0.0491 0.0549 0.0486 0.0539 0.0467
Global wald test 0.0497 0.0478 0.0605 0.0620
t-distribution
SNP 0.0495 0.0498 0.0489 0.0496 0.0479 0.0510 0.0483 0.0502
SNP*age 0.0521 0.0534 0.0487 0.0492 0.0581 0.0490 0.0563 0.0465
Global wald test 0.0531 0.0508 0.0624 0.0629
Skew-normal Distribution
SNP 0.0502 0.0517 0.0524 0.0524 0.0509 0.0526 0.0525 0.0532
SNP*age 0.0503 0.0519 0.0461 0.0474 0.0541 0.0508 0.0529 0.0486
Global wald test 0.0493 0.0488 0.0621 0.0579
Mixture of 2 Gaussian Distributions
SNP 0.0498 0.0504 0.0479 0.0479 0.0485 0.0499 0.0510 0.0508
SNP*age 0.0502 0.0510 0.0492 0.0488 0.0528 0.0506 0.0529 0.0495
Global wald test 0.0498 0.0508 0.0615 0.0586
Variance dependent on a covariate
SNP 0.0523 0.0527 0.0488 0.0490 0.0485 0.0511 0.0459 0.0485
SNP*age 0.0546 0.0527 0.0531 0.0514 0.0520 0.0493 0.0524 0.0481
Global wald test 0.0515 0.0525 0.0556 0.0546
Variance greater at adiposity rebound
SNP 0.0472 0.0478 0.0511 0.0519 0.0477 0.0493 0.0471 0.0490
SNP*age 0.0528 0.0497 0.0570 0.0528 0.0513 0.0513 0.0491 0.0487
Global wald test 0.0527 0.0540 0.0502 0.0478
Variance increasing over time
SNP 0.0523 0.0536 0.0471 0.0473 0.0543 0.0513 0.0561 0.0522
SNP*age 0.0564 0.0538 0.0522 0.0491 0.0746 0.0528 0.0746 0.0530
Global wald test 0.0875 0.0549 0.0875 0.0497 0.1667 0.0506 0.1685 0.0506
982
32
Table 4: Type 1 error for the unbalanced designs; bold and underlined cells are those that are significantly different from the nominal α=0.05 983
based on 20,000 simulations under each design (5,000 simulations for each MAF combined into one summary statistic). 984
Sampling Design Equal Unbalanced Unbalanced with more samples around the adiposity
rebound Unbalanced with less samples around the adiposity
rebound
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Standard Robust Standard Robust Standard Robust Standard Robust Standard Robust Standard Robust
Gaussian Distribution
SNP 0.0518 0.0532 0.0500 0.0508 0.0503 0.0521 0.0478 0.0490 0.0529 0.0550 0.0540 0.0542
SNP*age 0.0581 0.0526 0.0592 0.0531 0.0566 0.0514 0.0556 0.0496 0.0560 0.0511 0.0575 0.0509
Global wald test 0.0646 0.0598 0.0601 0.0615 0.0621 0.0609
t-distribution
SNP 0.0510 0.0522 0.0491 0.0497 0.0485 0.0500 0.0495 0.0505 0.0487 0.0499 0.0516 0.0523
SNP*age 0.0571 0.0487 0.0629 0.0539 0.0596 0.0508 0.0571 0.0475 0.0563 0.0487 0.0577 0.0489
Global wald test 0.0607 0.0621 0.0620 0.0583 0.0587 0.0605
Skew-normal Distribution
SNP 0.0493 0.0508 0.0495 0.0501 0.0498 0.0517 0.0473 0.0481 0.0512 0.0519 0.0482 0.0484
SNP*age 0.0548 0.0492 0.0589 0.0526 0.0580 0.0512 0.0571 0.0498 0.0593 0.0532 0.0547 0.0490
Global wald test 0.0618 0.0582 0.0616 0.0583 0.0632 0.0575
Mixture of 2 Gaussian Distributions
SNP 0.0519 0.0527 0.0490 0.0490 0.0505 0.0510 0.0519 0.0517 0.0510 0.0522 0.0487 0.0483
SNP*age 0.0534 0.0517 0.0487 0.0459 0.0510 0.0494 0.0538 0.0518 0.0543 0.0511 0.0551 0.0517
Global wald test 0.0579 0.0581 0.0605 0.0603 0.0589 0.0568
Variance dependent on a covariate
SNP 0.0495 0.0515 0.0482 0.0491 0.0498 0.0513 0.0506 0.0509 0.0512 0.0518 0.0528 0.0502
SNP*age 0.0586 0.0499 0.0607 0.0514 0.0576 0.0505 0.0588 0.0497 0.0605 0.0507 0.0604 0.0507
Global wald test 0.0589 0.0611 0.0597 0.0567 0.0620 0.0583
Variance greater at adiposity rebound
SNP 0.0493 0.0504 0.0492 0.0495 0.0486 0.0498 0.0516 0.0526 0.0506 0.0514 0.0496 0.0531
33
SNP*age 0.0570 0.0491 0.0563 0.0483 0.0546 0.0482 0.0563 0.0503 0.0561 0.0483 0.0600 0.0505
Global wald test 0.0572 0.0559 0.0568 0.0541 0.0588 0.0569
Variance increasing over time
SNP 0.0533 0.0545 0.0500 0.0502 0.0564 0.0563 0.0530 0.0520 0.0491 0.0523 0.0500 0.0526
SNP*age 0.0554 0.0536 0.0571 0.0540 0.0643 0.0570 0.0610 0.0527 0.0497 0.0534 0.0497 0.0513
Global wald test 0.0911 0.0576 0.0929 0.0529 0.1031 0.0578 0.1011 0.0548 0.0850 0.0559 0.0801 0.0520
34
Figure 1: Simulated power of the SNP main effect and SNP*age interaction terms for 985
complete designs. The two plots on the left are for the sparse complete design, while the 986
two plots on the right are from the intense complete design. The solid black line for the 987
Gaussian Distribution is the situation where the model is correctly specified. 988
989 990
35
Figure 2: Simulated power of the SNP main effect and SNP*age interaction terms for 991
unbalanced designs. “Equal” is the simulations from the equal unbalanced design, “Over” 992
are the simulations from the unbalanced design with less samples around the adiposity 993
rebound and “Under” are the simulations from the unbalanced design with more samples 994
around the adiposity rebound. The solid black line for the Gaussian Distribution is the 995
situation where the model is correctly specified. 996
997
998
36
Figure 3: Difference in power based on a normal standard error vs. a robust standard error 999
for the complete designs. A positive value indicates the power using the normal standard 1000
error is greater than the power using the robust standard error. The two plots on the left 1001
are for the sparse complete design, while the two plots on the right are from the intense 1002
complete design. The solid black line for the Gaussian Distribution is the situation where the 1003
model is correctly specified. 1004
1005
1006
37
Figure 4: Difference in power based on a normal standard error vs. a robust standard error 1007
for the unbalanced designs. A positive value indicates the power using the normal standard 1008
error is greater than the power using the robust standard error. Here, “Equal” is the 1009
simulations from the equal unbalanced design, “Over” are the simulations from the 1010
unbalanced design with less samples around the adiposity rebound and “Under” are the 1011
simulations from the unbalanced design with more samples around the adiposity rebound. 1012
The solid black line for the Gaussian Distribution is the situation where the model is 1013
correctly specified. 1014
1015
1016
38
Figure 5: QQ plots of the chromosome 16 analysis in the ALSPAC cohort. These plots are 1017
the observed –log10(P) against the expected –log10(P) under the null hypothesis for each 1018
SNP on chromosome 16. P-Values deviating from the red x=y line indicate significant 1019
findings, whether they be false (i.e. inflation in type 1 error) or true. 1020
1021 1022
39
Appendix D: Additional Results from Simulation Analysis in Chapter Four
Table 1: Coverage rates of the 95% confidence intervals of the fixed effects under the sparse complete
design; bold and underlined cells are those that are significantly different from the nominal 95%
based on 1,000 simulations.
MAF 0.1 0.2 0.3 0.4 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 95.2 94.7 94.6 94.2 96.4 95.5 95.5 95.7 SNP*age 95.4 94.6 95.3 94.9 95.6 95.2 93.7 96.2 SNP*age2 95.2 95.7 95.1 94.3 94.4 95.7 95.4 95.0 SNP*age3 94.6 95.6 95.7 93.7 95.9 95.1 95.4 94.0
t-distribution SNP 96.0 95.3 95.3 94.9 94.7 96.4 95.8 94.8 SNP*age 95.9 93.9 94.5 94.6 95.2 95.5 95.6 95.2 SNP*age2 96.2 94.4 95.1 94.9 94.7 95.9 95.4 94.9 SNP*age3 95.5 95.5 95.0 94.1 94.8 96.8 94.5 94.4
Skew-normal Distribution SNP 95.7 95.9 93.9 94.5 94.5 94.8 95.5 94.9 SNP*age 95.3 94.6 95.6 95.4 96.1 94.8 95.7 95.9 SNP*age2 95.5 96.2 95.8 95.9 94.7 95.0 95.1 94.5 SNP*age3 93.4 94.4 95.6 94.7 95.6 95.1 95.3 94.6
Mixture of 2 Gaussian Distributions SNP 94.9 95.7 95.0 94.3 94.0 95.5 95.5 94.4 SNP*age 95.5 95.5 95.7 94.7 94.2 93.7 94.8 95.2 SNP*age2 95.6 94.2 95.0 94.4 94.0 95.5 94.4 93.9 SNP*age3 96.2 95.2 95.2 95.9 95.8 94.5 95.1 95.2
Variance dependent on a covariate SNP 94.0 95.7 93.8 95.2 95.7 94.4 96.2 94.9 SNP*age 93.4 95.7 95.6 95.9 94.9 93.6 95.8 94.9 SNP*age2 94.6 95.1 94.5 94.4 94.7 95.2 94.4 95.0 SNP*age3 94.0 95.9 95.4 95.3 94.1 95.1 94.5 95.0
Variance greater at adiposity rebound SNP 95.9 94.9 95.0 94.9 94.7 96.0 93.4 95.1 SNP*age 94.5 94.7 94.4 94.5 93.1 94.2 94.2 94.4 SNP*age2 95.0 95.9 95.6 95.5 95.4 94.2 94.5 94.6 SNP*age3 95.6 93.4 93.6 92.8 94.4 95.1 96.3 93.4
Variance increasing over time SNP 95.5 94.6 93.8 94.0 93.1 95.7 94.4 95.7 SNP*age 95.0 94.1 94.3 95.3 92.2 94.8 94.9 95.3 SNP*age2 94.1 95.9 96.7 95.3 95.5 95.6 95.0 95.1 SNP*age3 88.9 88.3 90.1 89.5 88.6 88.7 88.7 89.3
Table 2: Coverage rates of the 95% confidence intervals of the fixed effects under the intense
complete design; bold and underlined cells are those that are significantly different from the nominal
95% based on 1,000 simulations.
MAF 0.1 0.2 0.3 0.4 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 93.6 95.3 95.4 95.3 95.6 95.6 95.7 95.6 SNP*age 93.9 95.5 94.6 95.5 95.3 94.4 94.5 95.1 SNP*age2 94.7 94.4 94.7 96.0 94.8 95.4 94.9 94.3 SNP*age3 92.7 92.9 91.1 93.7 92.6 92.7 93.3 93.1
t-distribution SNP 95.3 93.8 95.3 95.5 96.0 94.2 97.0 94.7 SNP*age 93.5 94.4 93.8 95.2 94.4 93.1 94.5 93.8 SNP*age2 94.7 95.7 95.5 95.3 95.2 94.9 95.1 95.0 SNP*age3 93.1 92.1 93.1 92.6 94.0 90.3 92.7 93.2
Skew-normal Distribution SNP 95.1 94.9 95.1 95.5 95.1 95.3 95.4 94.7 SNP*age 94.7 93.9 95.0 95.7 94.4 94.8 94.4 94.2 SNP*age2 94.4 95.6 95.3 96.8 95.5 95.3 95.7 94.7 SNP*age3 93.3 91.5 93.1 93.6 92.6 94.1 92.2 93.2
Mixture of 2 Gaussian Distributions SNP 94.8 96.3 94.6 95.6 95.6 95.7 94.3 95.0 SNP*age 95.5 94.3 94.6 94.8 95.0 94.7 95.0 94.6 SNP*age2 95.1 94.8 95.3 95.5 95.1 95.7 94.8 93.2 SNP*age3 93.4 92.4 93.6 91.5 91.9 93.3 92.9 92.9
Variance dependent on a covariate SNP 95.6 94.5 96.4 95.5 95.2 95.3 96.1 96.1 SNP*age 95.4 92.9 95.0 95.3 93.6 93.6 95.8 94.8 SNP*age2 96.2 94.5 94.4 95.2 95.7 95.4 94.3 94.4 SNP*age3 94.8 93.3 94.4 93.8 92.5 93.5 92.9 94.3
Variance greater at adiposity rebound SNP 95.4 95.1 95.8 95.1 96.3 94.6 93.1 95.5 SNP*age 94.7 95.7 96.5 95.1 95.1 96.0 95.2 94.7 SNP*age2 95.5 96.2 95.8 95.5 95.4 94.4 94.6 95.3 SNP*age3 94.8 96.1 96.7 95.6 94.6 95.8 96.8 95.6
Variance increasing over time SNP 94.2 94.5 94.8 95.0 93.8 93.9 93.5 94.1 SNP*age 91.8 92.5 92.3 93.9 92.9 92.0 90.1 91.1 SNP*age2 93.4 94.2 91.3 92.8 93.4 94.6 93.0 93.9 SNP*age3 75.3 77.3 75.0 75.7 76.8 74.9 74.4 77.4
Table 3: Coverage rates of the 95% confidence intervals of the fixed effects under the equal
unbalanced design; bold and underlined cells are those that are significantly different from the
nominal 95% based on 1,000 simulations.
MAF 0.1 0.2 0.3 0.4 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 95.6 95.0 93.8 95.3 94.5 95.3 95.4 95.3 SNP*age 94.2 95.3 93.6 95.0 93.7 92.7 95.9 95.5 SNP*age2 94.7 96.3 95.5 95.2 95.0 95.0 94.9 94.4 SNP*age3 91.0 93.9 92.3 92.2 90.9 92.6 93.6 93.6
t-distribution SNP 95.0 94.7 94.5 95.0 95.0 94.1 96.0 95.6 SNP*age 94.6 93.8 94.6 94.5 94.1 93.4 94.5 94.7 SNP*age2 95.2 94.3 94.7 94.2 94.6 94.8 94.9 95.0 SNP*age3 92.6 92.6 93.8 93.3 93.1 91.6 93.6 94.2
Skew-normal Distribution SNP 95.4 94.1 95.2 94.1 95.1 94.1 94.5 94.7 SNP*age 94.7 93.5 92.9 94.3 94.5 93.4 93.4 93.7 SNP*age2 93.1 95.2 95.2 93.9 94.4 94.7 93.8 93.9 SNP*age3 93.4 92.5 93.1 93.6 93.4 92.6 92.9 92.7
Mixture of 2 Gaussian Distributions SNP 94.6 94.9 95.0 95.2 95.9 94.1 94.5 96.1 SNP*age 94.9 93.8 94.7 95.0 95.4 93.5 95.8 94.0 SNP*age2 93.5 95.1 95.4 94.7 96.2 95.6 94.3 95.3 SNP*age3 92.7 92.6 92.4 94.1 94.2 93.3 94.2 92.3
Variance dependent on a covariate SNP 95.5 94.3 94.9 95.7 94.6 94.3 92.7 94.7 SNP*age 93.9 94.1 93.7 94.1 95.0 93.8 93.5 94.4 SNP*age2 94.9 94.9 95.0 93.8 95.6 96.6 95.4 94.4 SNP*age3 92.3 91.8 93.3 92.7 93.5 92.9 93.6 92.7
Variance greater at adiposity rebound SNP 94.6 96.0 94.2 95.1 93.8 94.7 94.4 95.0 SNP*age 95.3 93.5 95.2 94.3 92.8 95.3 93.2 94.6 SNP*age2 95.5 95.3 95.2 95.6 95.1 94.5 95.4 94.7 SNP*age3 94.2 92.7 93.7 93.9 92.6 94.1 94.8 94.1
Variance increasing over time SNP 94.5 94.0 95.6 94.3 93.9 94.5 95.2 94.5 SNP*age 96.0 94.6 94.5 94.0 94.9 94.3 93.4 94.0 SNP*age2 95.8 93.2 94.1 93.9 93.4 94.2 94.0 94.5 SNP*age3 88.4 88.5 88.1 89.3 87.5 88.0 87.5 87.7
Table 4: Coverage rates of the 95% confidence intervals of the fixed effects under the unbalanced
design with more samples around the adiposity rebound; bold and underlined cells are those that are
significantly different from the nominal 95% based on 1,000 simulations.
MAF 0.1 0.2 0.3 0.4 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
SNP 93.6 94.7 95.7 94.1 95.3 93.9 95.7 93.9 SNP*age 94.6 93.9 94.2 94.5 93.5 94.3 94.9 93.9 SNP*age2 94.7 94.1 94.8 95.2 95.3 94.6 93.8 94.8 SNP*age3 91.8 92.9 92.3 92.3 92.3 92.6 92.5 92.6 SNP 95.0 94.9 94.3 93.9 95.0 94.2 94.3 94.9 SNP*age 94.3 94.2 93.2 94.2 92.8 93.5 94.5 94.1 SNP*age2 95.9 96.0 96.1 95.2 94.9 94.5 95.7 93.9 SNP*age3 93.7 93.2 93.1 92.3 92.3 93.5 92.4 91.9 SNP 95.3 94.0 94.1 94.6 97.0 95.0 95.3 96.2 SNP*age 94.6 93.1 94.4 94.6 94.2 94.0 95.7 94.8 SNP*age2 95.5 94.2 94.0 95.7 95.6 96.1 95.1 94.4 SNP*age3 93.4 90.9 92.9 92.9 93.0 93.0 91.9 92.7 SNP 95.3 95.5 93.7 95.8 94.8 95.1 94.3 95.2 SNP*age 95.0 95.3 94.3 96.2 93.9 94.4 95.1 94.9 SNP*age2 94.4 94.6 94.6 95.0 94.3 95.5 95.1 94.1 SNP*age3 91.8 92.8 92.0 94.6 93.6 92.7 92.7 93.1 SNP 94.5 95.2 94.7 94.1 95.4 95.1 94.2 95.3 SNP*age 93.6 95.1 95.4 94.6 94.2 93.6 91.8 94.8 SNP*age2 95.3 95.5 94.5 94.4 95.8 94.9 94.9 94.0 SNP*age3 95.0 94.0 94.0 92.9 92.5 92.3 92.7 93.8 SNP 95.4 96.4 94.7 93.3 95.3 96.6 96.5 95.1 SNP*age 93.9 94.7 94.4 95.5 94.7 93.5 93.0 95.3 SNP*age2 94.9 94.2 94.5 94.7 96.1 96.2 95.8 96.0 SNP*age3 93.3 94.0 94.4 94.2 93.1 94.3 93.5 94.7 SNP 95.3 93.6 93.6 95.0 93.9 93.5 94.4 94.0 SNP*age 93.3 93.7 94.5 93.6 92.1 93.3 93.2 93.3 SNP*age2 93.6 92.9 92.7 93.9 95.1 95.0 93.1 92.1 SNP*age3 85.7 86.4 86.5 85.3 85.8 87.2 84.9 85.8
Table 5: Coverage rates of the 95% confidence intervals of the fixed effects under the unbalanced
design with less samples around the adiposity rebound; bold and underlined cells are those that are
significantly different from the nominal 95% based on 1,000 simulations.
MAF 0.1 0.2 0.3 0.4 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 95.1 94.9 95.1 94.6 96.2 95.3 95.2 94.1 SNP*age 94.1 95.1 94.8 93.6 95.4 94.4 93.8 94.3 SNP*age2 94.4 94.2 94.8 95.7 96.0 94.3 96.2 96.1 SNP*age3 91.7 94.1 92.9 92.2 93.9 94.3 93.6 91.9
t-distribution SNP 94.6 95.3 95.5 94.4 96.4 95.3 95.4 94.8 SNP*age 92.9 93.5 94.5 92.1 92.3 94.9 93.6 95.6 SNP*age2 96.0 94.9 95.7 93.8 95.1 94.3 95.1 95.3 SNP*age3 92.8 92.3 92.8 93.4 91.5 91.9 92.1 92.3
Skew-normal Distribution SNP 95.6 94.6 94.9 95.7 93.6 95.0 95.3 93.7 SNP*age 93.3 95.3 93.8 93.0 94.0 92.9 94.5 93.0 SNP*age2 95.8 94.9 94.7 95.0 95.6 94.7 95.6 95.3 SNP*age3 93.3 94.1 92.3 92.7 93.9 93.7 92.3 92.4
Mixture of 2 Gaussian Distributions SNP 94.6 94.8 94.3 94.8 94.2 93.6 94.2 94.7 SNP*age 95.9 94.3 95.7 93.3 94.5 94.5 93.1 94.3 SNP*age2 95.9 93.3 95.8 95.4 94.5 94.6 94.3 94.7 SNP*age3 93.5 93.3 92.4 92.2 92.8 95.0 92.7 92.2
Variance dependent on a covariate SNP 95.3 94.9 94.3 94.4 94.6 96.2 95.7 94.2 SNP*age 93.8 93.9 93.5 93.8 93.7 94.4 94.8 93.6 SNP*age2 95.6 94.7 95.3 94.4 94.4 94.1 94.3 95.4 SNP*age3 91.8 93.3 91.2 92.4 93.4 91.8 91.5 91.4
Variance greater at adiposity rebound SNP 94.9 95.3 95.8 94.8 95.0 92.9 95.2 94.7 SNP*age 93.6 94.6 95.2 94.3 95.6 94.9 94.5 94.6 SNP*age2 96.3 94.1 95.1 94.8 94.2 94.5 94.8 93.6 SNP*age3 91.9 94.2 94.4 92.9 94.0 93.1 92.9 94.1
Variance increasing over time SNP 96.3 95.1 96.4 95.2 95.3 95.7 95.9 95.9 SNP*age 95.0 95.9 95.7 94.8 96.0 94.2 95.9 96.1 SNP*age2 96.1 94.2 94.9 95.9 96.0 95.2 93.7 94.8 SNP*age3 90.4 89.4 89.1 89.7 89.9 89.9 89.8 90.0
Table 6: Bias and 95% confidence interval for the sparse complete design; bold and underlined cells are those whose confidence interval does not cover zero based on 1,000
simulations.
MAF 0.1 0.2 0.3 0.4
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution
SNP -0.001 (-0.0109,0.009)
0.0046 (-0.0014,0.0107)
0.0011 (-0.0065,0.0086)
-0.0019 (-0.0063,0.0026)
0.0029 (-0.0036,0.0094)
0.0029 (-0.0009,0.0067)
0.002 (-0.0042,0.0082)
-0.0007 (-0.0041,0.0028)
SNP*age -0.0005 (-0.0022,0.0012)
0.0007 (-0.0002,0.0017)
0.0001 (-0.0011,0.0013)
0.0003 (-0.0004,0.001)
0.0006 (-0.0005,0.0016)
0.0001 (-0.0005,0.0007)
0.0005 (-0.0006,0.0015)
-0.0003 (-0.0009,0.0003)
SNP*age2 0.00009 (-0.00006,0.00024)
-0.00002 (-0.00011,0.00006)
-0.00002 (-0.00013,0.00009)
-0.00001 (-0.00008,0.00005)
-0.00004 (-0.00014,0.00006)
-0.00005 (-0.00011,0)
-0.00001 (-0.0001,0.00008)
-0.00001 (-0.00006,0.00005)
SNP*age3 0.000004 (-0.00002,0.00003)
-0.000001 (-0.00002,0.00001)
-0.000002 (-0.00002,0.00002)
-0.00001 (-0.00002,0.000003)
0.000001 (-0.00002,0.00002)
-0.000001 (-0.00001,0.00001)
-0.000003 (-0.00002,0.00001)
0.000004 (-0.00001,0.00001)
t-distribution
SNP -0.008 (-0.0182,0.0021)
-0.0006 (-0.0067,0.0054)
0.0072 (-0.0006,0.015)
-0.0041 (-0.0088,0.0006)
0.0002 (-0.0067,0.0071)
0.0004 (-0.0034,0.0041)
0.0013 (-0.0051,0.0078)
0.0015 (-0.0022,0.0052)
SNP*age -0.0011 (-0.003,0.0007)
-0.0004 (-0.0015,0.0007)
0.001 (-0.0004,0.0024)
-0.0003 (-0.0012,0.0005)
-0.0005 (-0.0018,0.0007) 0 (-0.0007,0.0007)
0.0004 (-0.0008,0.0016)
-0.0001 (-0.0008,0.0006)
SNP*age2 -0.00002 (-0.00019,0.00014)
0.00006 (-0.00004,0.00016)
-0.00003 (-0.00016,0.00009)
0.00004 (-0.00004,0.00011) 0 (-0.00012,0.00011)
0.00003 (-0.00003,0.0001)
0.00003 (-0.00007,0.00014)
-0.00001 (-0.00007,0.00005)
SNP*age3 -0.000002 (-0.00004,0.00003)
0.000011 (-0.00001,0.00003)
-0.000005 (-0.00003,0.00002)
-0.000001 (-0.00002,0.00002)
0.00001 (-0.00001,0.00003)
0.000005 (-0.00001,0.00002)
0 (-0.00002,0.00002)
0.000007 (-0.00001,0.00002)
Skew-normal Distribution
SNP 0.0004 (-0.0095,0.0103)
-0.0015 (-0.0072,0.0042)
-0.0093 (-0.017,-0.0017)
0.0026 (-0.0018,0.007)
-0.0002 (-0.0068,0.0063)
-0.0014 (-0.0052,0.0024)
-0.0015 (-0.0075,0.0045)
-0.0007 (-0.0042,0.0029)
SNP*age 0 (-0.0017,0.0017) 0.0002 (-0.0008,0.0012)
-0.0002 (-0.0014,0.0011)
0.0004 (-0.0003,0.0011)
0.0012 (0.0001,0.0022)
-0.0002 (-0.0008,0.0005)
-0.0003 (-0.0013,0.0007)
-0.0004 (-0.0009,0.0002)
SNP*age2 0 (-0.00015,0.00015)
0.00004 (-0.00004,0.0001)
0.00008 (-0.00003,0.00019)
-0.00001 (-0.00008,0.00005) 0.0001 (0,0.0002) 0.00006 (0,0.00011)
0.00005 (-0.00004,0.00015)
-0.00003 (-0.00009,0.00002)
SNP*age3 -0.000006 (-0.00003,0.00002)
-0.000004 (-0.00002,0.00001)
-0.000016 (-0.00004,0.000003)
-0.000002 (-0.00001,0.00001)
-0.000025 (-0.00004,-0.00001)
0.000006 (-0.000004,0.00002)
0.000001 (-0.00002,0.00002)
0.000005 (-0.00001,0.00001)
Mixture of 2 Gaussian Distributions
SNP -0.005 (-0.0152,0.0052)
-0.0025 (-0.0082,0.0033)
0.0024 (-0.0053,0.0101)
-0.0027 (-0.0073,0.0019)
-0.0017 (-0.0086,0.0051)
0.0013 (-0.0025,0.0052)
0.0034 (-0.0027,0.0096)
-0.0027 (-0.0064,0.0009)
SNP*age 0 (-0.0014,0.0013) -0.0003 0 (-0.0011,0.001) -0.0006 (-0.0012,0) -0.0001 0.0003 0.0002 -0.0006
(-0.0011,0.0005) (-0.001,0.0009) (-0.0003,0.0009) (-0.0007,0.001) (-0.0011,-0.0001)
SNP*age2 0.00011 (-0.00002,0.00023)
-0.00001 (-0.00008,0.00006)
-0.00001 (-0.00011,0.00008)
0.00003 (-0.00002,0.00009)
0.00003 (-0.00005,0.00012)
0.00002 (-0.00003,0.00007)
0 (-0.00008,0.00008)
0.00002 (-0.00003,0.00006)
SNP*age3 0.000002 (-0.00001,0.00002)
0.000005 (-0.00001,0.00001)
0.000003 (-0.00001,0.00001)
0.000005 (-0.000001,0.00001)
0 (-0.00001,0.000009
-0.000003 (-0.00001,0.000002)
-0.000003 (-0.00001,0.00001)
0.000002 (-0.00001,0.00001)
Variance dependent on a covariate
SNP 0.0032 (-0.0074,0.0138)
-0.0016 (-0.0074,0.0041)
0.0006 (-0.0073,0.0085)
0.0027 (-0.0018,0.0073)
-0.0021 (-0.0088,0.0046)
-0.0018 (-0.0057,0.0022)
-0.0001 (-0.0063,0.0061) 0 (-0.0036,0.0037)
SNP*age 0 (-0.0019,0.0019) -0.0004 (-0.0014,0.0006)
-0.0009 (-0.0023,0.0005)
-0.0002 (-0.0009,0.0006)
0.0003 (-0.0009,0.0015)
0.0001 (-0.0006,0.0008)
0.0001 (-0.001,0.0012)
0.0001 (-0.0005,0.0008)
SNP*age2 0.00001 (-0.00016,0.00017)
-0.00002 (-0.00011,0.00007)
0 (-0.00012,0.00012)
-0.00005 (-0.00012,0.00002)
0.00003 (-0.00007,0.00014)
0.00004 (-0.00002,0.0001)
-0.00001 (-0.00011,0.00008)
0.00001 (-0.00004,0.00007)
SNP*age3 0.000018 (-0.000014,0.000051)
0.000005 (-0.00001,0.00002)
0.00002 (-0.000003,0.00004)
0.000007 (-0.00001,0.00002)
-0.000004 (-0.00003,0.00002)
-0.000004 (-0.00002,0.00001)
-0.000005 (-0.00002,0.00002)
0.000001 (-0.00001,0.00001)
Variance greater at adiposity rebound
SNP -0.002 (-0.0121,0.0082)
0.0008 (-0.0051,0.0067)
-0.0014 (-0.0091,0.0062)
0.0001 (-0.0043,0.0046)
0.0016 (-0.0051,0.0084)
-0.0012 (-0.005,0.0026)
0.0027 (-0.0038,0.0092)
-0.0013 (-0.005,0.0024)
SNP*age -0.0007 (-0.0025,0.001)
-0.0004 (-0.0015,0.0006)
0.0007 (-0.0006,0.0021)
0.0001 (-0.0007,0.0009)
0.0008 (-0.0004,0.002) 0 (-0.0006,0.0007) 0 (-0.0011,0.0011)
-0.0003 (-0.0009,0.0003)
SNP*age2 0.00003 (-0.00012,0.00019)
-0.00001 (-0.0001,0.00008)
0.00004 (-0.00007,0.00015)
0.00004 (-0.00003,0.00011)
0.00003 (-0.00007,0.00013)
0.00001 (-0.00005,0.00007)
-0.00002 (-0.00011,0.00008)
-0.00001 (-0.00006,0.00005)
SNP*age3 0.00001 (-0.000019,0.000039)
0.00001 (-0.00001,0.00003)
-0.000016 (-0.00004,0.00001)
-0.000002 (-0.00002,0.00001)
-0.000012 (-0.00003,0.00001)
0.000001 (-0.00001,0.00001)
0.00001 (-0.00001,0.00003)
0 (-0.00001,0.00001)
Variance increasing over time
SNP 0.0028 (-0.0075,0.0131)
-0.0082 (-0.0141,-0.0022)
-0.0055 (-0.0132,0.0023)
-0.0038 (-0.0083,0.0007)
-0.0008 (-0.0081,0.0065)
-0.0023 (-0.0061,0.0015)
-0.0002 (-0.0065,0.0061)
-0.0036 (-0.0071,0)
SNP*age 0.0017 (-0.0002,0.0036)
-0.0011 (-0.0022,0)
-0.0005 (-0.0019,0.0009)
-0.0003 (-0.001,0.0005)
0.0002 (-0.0011,0.0015)
-0.0004 (-0.0011,0.0003)
-0.0001 (-0.0012,0.001) 0 (-0.0006,0.0007)
SNP*age2 0.00006 (-0.00014,0.00026)
0.00004 (-0.00007,0.00014)
0.00003 (-0.0001,0.00017)
-0.00005 (-0.00013,0.00004)
0.00008 (-0.00005,0.00021)
-0.00002 (-0.00009,0.00005)
-0.00003 (-0.00015,0.00009)
0.00005 (-0.00002,0.00012)
SNP*age3 -0.000014 (-0.00005,0.00002)
0.000009 (-0.00001,0.00003)
-0.000003 (-0.00003,0.00003)
-0.000002 (-0.00002,0.00001)
-0.000001 (-0.00003,0.00002)
-0.000001 (-0.00002,0.00001)
0.000007 (-0.00002,0.00003)
0.000001 (-0.00001,0.00001)
Table 7: Bias and 95% confidence interval for the intense complete design; bold and underlined cells are those whose confidence interval does not cover zero based on
1,000 simulations.
MAF 0.1 0.2 0.3 0.4
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution
SNP -0.0073 (-0.0178,0.0032)
0.0002 (-0.0056,0.006)
-0.0052 (-0.0127,0.0022)
0.0041 (-0.0001,0.0083)
-0.0007 (-0.007,0.0056)
0.0013 (-0.0024,0.005)
-0.0043 (-0.0104,0.0019)
0.0032 (-0.0003,0.0067)
SNP*age -0.0007 (-0.0024,0.0009)
-0.0008 (-0.0017,0.0001)
-0.0008 (-0.002,0.0004)
-0.0001 (-0.0007,0.0006) 0 (-0.001,0.001)
0.0001 (-0.0005,0.0007)
-0.0001 (-0.0011,0.0008)
0.0005 (-0.0001,0.001)
SNP*age2 0.0001 (-0.00005,0.00025)
-0.00004 (-0.00013,0.00004)
0.00008 (-0.00003,0.00019)
-0.00002 (-0.00008,0.00004)
0.00001 (-0.00009,0.0001)
-0.00002 (-0.00007,0.00003)
0.00005 (-0.00004,0.00014) 0 (-0.00006,0.00005)
SNP*age3 0.000002 (-0.00002,0.00003)
0.000006 (-0.00001,0.00002)
0.000009 (-0.00001,0.00003)
0.000007 (-0.000003,0.00002)
0.000005 (-0.00001,0.00002)
0.000003 (-0.00001,0.00001)
-0.000001 (-0.00002,0.00001)
0.000002 (-0.00001,0.00001)
t-distribution
SNP 0.0007 (-0.0098,0.0113)
0.0011 (-0.0051,0.0074)
-0.0042 (-0.012,0.0036)
0.0007 (-0.0037,0.0051)
0.0025 (-0.0041,0.0091)
-0.0007 (-0.0048,0.0034)
-0.0022 (-0.0084,0.0039)
0.0008 (-0.0029,0.0045)
SNP*age 0.0012 (-0.0006,0.003)
-0.0001 (-0.0011,0.001)
-0.0004 (-0.0017,0.001)
0.0002 (-0.0006,0.001)
0.0004 (-0.0007,0.0016)
-0.0006 (-0.0013,0.0001)
-0.0011 (-0.0022,-0.0001)
0.0002 (-0.0004,0.0009)
SNP*age2 0.00004 (-0.00014,0.00021) 0 (-0.0001,0.0001)
0.00005 (-0.00007,0.00018)
-0.00003 (-0.0001,0.00004) 0 (-0.0001,0.00011)
-0.00001 (-0.00008,0.00005)
0.00009 (-0.00001,0.00019)
-0.00002 (-0.00008,0.00004)
SNP*age3 -0.00001 (-0.00004,0.00002)
0.000006 (-0.00001,0.00003)
0.000004 (-0.00002,0.00003)
-0.000003 (-0.00002,0.00001)
0.000001 (-0.00002,0.00002)
0.000006 (-0.00001,0.00002)
0.000011 (-0.00001,0.00003)
-0.000003 (-0.00001,0.00001)
Skew-normal Distribution
SNP 0.0033 (-0.0069,0.0136)
0.0006 (-0.0051,0.0064)
0.0027 (-0.0048,0.0101)
0.0026 (-0.0017,0.0068)
-0.0015 (-0.0081,0.0051)
0.0006 (-0.0031,0.0043)
-0.0009 (-0.007,0.0052)
-0.0004 (-0.0039,0.0031)
SNP*age 0.0007 (-0.0008,0.0023) 0 (-0.001,0.0009)
0.0009 (-0.0003,0.002)
0.0001 (-0.0006,0.0007)
-0.0004 (-0.0014,0.0006)
0.0003 (-0.0003,0.0009)
-0.0001 (-0.0011,0.0009)
-0.0001 (-0.0007,0.0005)
SNP*age2 -0.00014 (-0.00029,0.00001)
0.00002 (-0.00006,0.0001) 0 (-0.00011,0.00011)
-0.00003 (-0.00009,0.00003)
-0.00002 (-0.00011,0.00008)
-0.00003 (-0.00009,0.00002)
0.00001 (-0.00008,0.0001)
-0.00001 (-0.00006,0.00004)
SNP*age3 -0.00001 (-0.00003,0.00001)
0.000008 (-0.00001,0.00002)
-0.000006 (-0.00002,0.00001)
0.000004 (-0.00001,0.00001)
0 (-0.00002,0.00002)
-0.000007 (-0.00002,0.000002)
0.000007 (-0.00001,0.00002)
-0.000001 (-0.00001,0.000007)
Mixture of 2 Gaussian Distributions
SNP -0.0028 (-0.0131,0.0075)
-0.0004 (-0.0061,0.0054)
0.0026 (-0.0051,0.0103)
-0.0007 (-0.005,0.0037)
0.0044 (-0.0021,0.0108)
0.0014 (-0.0023,0.0051)
-0.0063 (-0.0127,0.0001)
-0.0033 (-0.0069,0.0003)
SNP*age -0.001 0 (-0.0008,0.0008) -0.0001 0.0002 0.0003 0.0001 -0.0002 -0.0004
(-0.0023,0.0003) (-0.0012,0.0009) (-0.0004,0.0008) (-0.0006,0.0012) (-0.0004,0.0006) (-0.0011,0.0006) (-0.0009,0)
SNP*age2 -0.00001 (-0.00014,0.00011)
-0.00002 (-0.0001,0.00005)
-0.00005 (-0.00014,0.00004) 0 (-0.00005,0.00005)
-0.00003 (-0.00011,0.00006)
-0.00006 (-0.0001,-0.00002)
0.00003 (-0.00005,0.00011)
0.00001 (-0.00004,0.00005)
SNP*age3 0.000014 (0.000001,0.000028)
-0.000001 (-0.00001,0.00001)
0.000003 (-0.00001,0.00001)
0.000001 (-0.00001,0.00001)
0.000004 (-0.00001,0.00001)
0.000001 (-0.000004,0.00001)
0.000004 (-0.00001,0.00001) 0.000005 (0,0.00001)
Variance dependent on a covariate
SNP -0.0052 (-0.0153,0.005) 0.0039 (-0.0021,0.01)
-0.0051 (-0.0125,0.0024)
0.0018 (-0.0024,0.0061)
0.0031 (-0.0037,0.0098)
0.0001 (-0.0037,0.0039)
-0.0092 (-0.0155,-0.0028)
0.0032 (-0.0003,0.0067)
SNP*age -0.0003 (-0.0019,0.0014) 0 (-0.001,0.001)
0.0004 (-0.0009,0.0016)
0.0001 (-0.0006,0.0009)
0.0005 (-0.0006,0.0017) 0 (-0.0007,0.0007)
-0.0014 (-0.0024,-0.0003)
0.0004 (-0.0002,0.001)
SNP*age2 0.00009 (-0.00006,0.00025)
-0.00004 (-0.00013,0.00005)
0.00005 (-0.00007,0.00017)
0.00004 (-0.00002,0.00011)
-0.00001 (-0.00012,0.00009)
0.00001 (-0.00005,0.00007)
0.00005 (-0.00004,0.00015)
0.00001 (-0.00005,0.00006)
SNP*age3 -0.000002 (-0.00003,0.00003)
0.000005 (-0.00001,0.00002)
-0.000002 (-0.00002,0.00002)
0.000005 (-0.00001,0.00002)
-0.000003 (-0.00002,0.00002)
-0.000002 (-0.00001,0.00001)
0.00001 (-0.00001,0.00003)
-0.000003 (-0.00001,0.00001)
Variance greater at adiposity rebound
SNP -0.004 (-0.0144,0.0065)
-0.0018 (-0.0077,0.004)
-0.0014 (-0.0091,0.0062)
-0.0044 (-0.0088,0.0001)
0.0074 (0.0009,0.014)
-0.0032 (-0.0071,0.0007)
-0.0024 (-0.0091,0.0042)
-0.0008 (-0.0044,0.0028)
SNP*age -0.0001 (-0.0018,0.0016) 0 (-0.0009,0.001)
0.0008 (-0.0003,0.002)
-0.0006 (-0.0013,0.0002)
0.0008 (-0.0003,0.0019)
-0.0004 (-0.001,0.0002)
-0.0001 (-0.0011,0.0009)
-0.0003 (-0.0009,0.0003)
SNP*age2 0.00006 (-0.00009,0.00022)
0.00002 (-0.00007,0.0001)
0.00006 (-0.00005,0.00017)
0.00003 (-0.00004,0.00009)
-0.00004 (-0.00014,0.00006)
0.00001 (-0.00005,0.00007)
0.00009 (-0.00001,0.00019)
-0.00006 (-0.00011,0)
SNP*age3 -0.000009 (-0.00004,0.00002)
-0.000004 (-0.00002,0.00001)
-0.000022 (-0.00004,-0.000004)
0.000002 (-0.00001,0.00001)
-0.000008 (-0.00003,0.00001)
0.000001 (-0.00001,0.00001)
0.000002 (-0.00001,0.00002)
0.000007 (-0.000002,0.00002)
Variance increasing over time
SNP 0.002 (-0.0087,0.0127)
-0.0022 (-0.0084,0.0039)
-0.0036 (-0.0114,0.0042)
0.001 (-0.0034,0.0054)
-0.0011 (-0.0079,0.0058)
0.0009 (-0.0031,0.0048)
-0.003 (-0.0095,0.0035)
0.0047 (0.001,0.0084)
SNP*age 0 (-0.0017,0.0018) -0.0004 (-0.0014,0.0006)
-0.0007 (-0.002,0.0006)
0.0003 (-0.0005,0.001) 0 (-0.0012,0.0011)
0.0001 (-0.0006,0.0008)
-0.0005 (-0.0016,0.0006)
0.0009 (0.0002,0.0015)
SNP*age2 -0.00018 (-0.00037,0.00001) 0 (-0.00011,0.00011)
-0.00004 (-0.00019,0.0001)
-0.00002 (-0.0001,0.00006)
-0.00008 (-0.0002,0.00004)
-0.00001 (-0.00008,0.00006)
-0.00006 (-0.00018,0.00006)
-0.00005 (-0.00012,0.00002)
SNP*age3 -0.000015 (-0.000049,0.00002)
0.000005 (-0.00002,0.00003)
0 (-0.00003,0.00003)
-0.000008 (-0.00002,0.00001)
-0.000015 (-0.00004,0.00001)
0 (-0.00001,0.00001)
-0.000003 (-0.00003,0.00002)
-0.000012 (-0.000024,0)
Table 8: Bias and 95% confidence interval for the equal unbalanced design; bold and underlined cells are those whose confidence interval does not cover zero based on
1,000 simulations.
MAF 0.1 0.2 0.3 0.4
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution
SNP -0.005 (-0.015,0.0051)
0.0008 (-0.005,0.0067)
0.0036 (-0.0041,0.0114)
-0.0052 (-0.0097,-0.0008)
0.0026 (-0.0041,0.0093)
-0.0024 (-0.0062,0.0015)
-0.0065 (-0.0126,-0.0003)
-0.0002 (-0.0037,0.0033)
SNP*age -0.0003 (-0.0021,0.0015) 0 (-0.001,0.0009)
0.0003 (-0.001,0.0015)
-0.0008 (-0.0015,-0.0001) 0 (-0.0011,0.0011)
-0.0003 (-0.001,0.0004)
-0.0008 (-0.0018,0.0002)
-0.0002 (-0.0008,0.0004)
SNP*age2 0.00012 (-0.00005,0.00028)
-0.00004 (-0.00013,0.00005)
-0.00003 (-0.00015,0.00009)
0.00006 (-0.00001,0.00013)
-0.00003 (-0.00014,0.00008)
0.00002 (-0.00005,0.00008)
0.00002 (-0.00008,0.00012)
-0.00003 (-0.00009,0.00003)
SNP*age3 -0.000008 (-0.00004,0.000023)
0.000009 (-0.00001,0.00003)
0.000011 (-0.00001,0.00003)
0.000001 (-0.00001,0.00001)
0.000009 (-0.00001,0.00003)
0.000004 (-0.00001,0.00002)
0.000001 (-0.00002,0.00002)
-0.000004 (-0.00001,0.00001)
t-distribution
SNP 0.0045 (-0.0058,0.0148)
0.0071 (0.001,0.0133)
0.0018 (-0.0063,0.0099)
0.0004 (-0.0042,0.005)
-0.0051 (-0.012,0.0017)
0.0026 (-0.0014,0.0066)
0.0024 (-0.0039,0.0086)
-0.0019 (-0.0056,0.0018)
SNP*age 0.0029 (0.001,0.0048)
0.0005 (-0.0006,0.0017)
-0.0009 (-0.0024,0.0005)
0.0001 (-0.0007,0.001)
-0.0008 (-0.002,0.0005)
0.0001 (-0.0007,0.0008)
-0.0003 (-0.0015,0.0009)
-0.0002 (-0.0009,0.0005)
SNP*age2 -0.00002 (-0.00021,0.00017)
-0.00006 (-0.00017,0.00005)
-0.00003 (-0.00018,0.00012)
-0.00006 (-0.00015,0.00002)
0.00011 (-0.00001,0.00024)
-0.0001 (-0.00017,-0.00003)
0.00002 (-0.0001,0.00013)
0.00001 (-0.00005,0.00008)
SNP*age3 -0.000047 (-0.00009,-0.00001)
-0.000005 (-0.00003,0.00002)
0.000029 (0,0.000059)
-0.000002 (-0.00002,0.00001)
0.000003 (-0.00002,0.00003)
-0.000004 (-0.00002,0.00001)
0.000021 (-0.000002,0.00005)
-0.000005 (-0.00002,0.00001)
Skew-normal Distribution
SNP 0.0063 (-0.0038,0.0165)
-0.0029 (-0.009,0.0033)
-0.0006 (-0.008,0.0068)
0.0012 (-0.0033,0.0057)
0.0013 (-0.0051,0.0076)
-0.0013 (-0.0051,0.0025)
-0.0035 (-0.0097,0.0027)
0.0005 (-0.0031,0.0041)
SNP*age 0.0014 (-0.0003,0.0031)
-0.0006 (-0.0016,0.0003)
0.0004 (-0.0009,0.0017)
-0.0004 (-0.0011,0.0003)
0.0002 (-0.0009,0.0013)
-0.0003 (-0.001,0.0004)
-0.0002 (-0.0013,0.0009)
0.0001 (-0.0005,0.0007)
SNP*age2 0.00002 (-0.00015,0.00019)
-0.00003 (-0.00013,0.00006)
-0.00001 (-0.00013,0.00012)
-0.00006 (-0.00013,0.00001)
-0.00004 (-0.00015,0.00007)
-0.00001 (-0.00007,0.00006)
0.00001 (-0.00009,0.00011)
0.00004 (-0.00002,0.0001)
SNP*age3 -0.000001 (-0.00003,0.00003)
0.000008 (-0.00001,0.00003)
-0.00001 (-0.00003,0.00001)
0.000012 (-0.000001,0.00003)
-0.000006 (-0.00003,0.00001)
-0.000001 (-0.00001,0.00001)
0.000005 (-0.00001,0.00002)
0.000002 (-0.00001,0.00001)
Mixture of 2 Gaussian Distributions
SNP -0.0073 (-0.0177,0.0031)
0.0054 (-0.0006,0.0113)
0.0021 (-0.0055,0.0098)
-0.0012 (-0.0056,0.0032) -0.0065 (-0.0129,0)
-0.0006 (-0.0044,0.0033)
-0.0028 (-0.009,0.0034)
0.0021 (-0.0015,0.0056)
SNP*age -0.001 0.0007 -0.0003 0 (-0.0006,0.0006) -0.0007 -0.0001 -0.0006 0.0006
(-0.0024,0.0004) (-0.0001,0.0016) (-0.0013,0.0008) (-0.0016,0.0002) (-0.0007,0.0004) (-0.0015,0.0002) (0.0001,0.0011)
SNP*age2 -0.00003 (-0.00016,0.0001)
-0.00006 (-0.00013,0.00002)
-0.00005 (-0.00015,0.00005) 0 (-0.00005,0.00006)
0.00005 (-0.00004,0.00013) 0 (-0.00005,0.00005)
0.00004 (-0.00004,0.00012)
0.00003 (-0.00002,0.00007)
SNP*age3 0.000008 (-0.00001,0.00003)
-0.000007 (-0.00002,0.000004)
0.000015 (0.000001,0.000028)
-0.000004 (-0.00001,0.000004)
0.000007 (-0.000004,0.00002)
0.000005 (-0.000002,0.00001)
0.000005 (-0.00001,0.00002)
-0.000007 (-0.00001,-0.000001)
Variance dependent on a covariate
SNP 0.0053 (-0.0049,0.0154)
0.0005 (-0.0056,0.0066)
0.0033 (-0.0043,0.011)
-0.0038 (-0.0081,0.0005)
-0.0018 (-0.0086,0.0051)
0.0014 (-0.0025,0.0054)
-0.0047 (-0.0115,0.002)
-0.0003 (-0.0041,0.0034)
SNP*age 0.0002 (-0.0017,0.002)
0.0005 (-0.0006,0.0015)
0.0006 (-0.0008,0.002) 0 (-0.0008,0.0007)
-0.0006 (-0.0017,0.0006)
0.0004 (-0.0003,0.0011)
-0.0004 (-0.0016,0.0007)
-0.0006 (-0.0012,0.0001)
SNP*age2 0.0001 (-0.00008,0.00027)
0.00001 (-0.00009,0.00011)
-0.00001 (-0.00014,0.00012)
0.00004 (-0.00004,0.00012)
0.00008 (-0.00003,0.0002)
0.00001 (-0.00005,0.00007)
0.00004 (-0.00006,0.00015)
-0.00003 (-0.0001,0.00003)
SNP*age3 0.000025 (-0.00001,0.00006)
0.000002 (-0.00002,0.00002)
-0.000007 (-0.00003,0.00002)
-0.000011 (-0.00003,0.000003)
0.000009 (-0.00001,0.00003)
-0.000002 (-0.00002,0.00001)
0.00001 (-0.00001,0.00003)
0.000013 (0.000001,0.000026)
Variance greater at adiposity rebound
SNP -0.0007 (-0.0114,0.0099)
0.0105 (0.0046,0.0163)
0.0069 (-0.0012,0.0149)
-0.0031 (-0.0076,0.0014)
0.0011 (-0.0061,0.0082)
-0.0002 (-0.0041,0.0038)
-0.0015 (-0.008,0.0051)
-0.0013 (-0.0049,0.0024)
SNP*age -0.0003 (-0.0021,0.0014)
0.0009 (-0.0002,0.0019)
0.0008 (-0.0005,0.0021)
-0.0008 (-0.0016,-0.0001)
-0.0002 (-0.0014,0.001) 0 (-0.0006,0.0007)
-0.0002 (-0.0014,0.0009)
-0.0003 (-0.001,0.0003)
SNP*age2 0.0001 (-0.00007,0.00027)
-0.00002 (-0.00012,0.00008)
-0.00002 (-0.00015,0.0001)
-0.00004 (-0.00011,0.00003)
-0.00013 (-0.00024,-0.00001)
-0.00002 (-0.00008,0.00005)
0.00007 (-0.00003,0.00017)
-0.00004 (-0.0001,0.00002)
SNP*age3 -0.000002 (-0.00004,0.00003)
-0.000001 (-0.00002,0.000018)
-0.000006 (-0.00003,0.00002)
0.000007 (-0.00001,0.00002)
0.000002 (-0.00002,0.00002)
-0.00001 (-0.00002,0.000002)
0.000005 (-0.00002,0.00003)
0.000003 (-0.000008,0.000015)
Variance increasing over time
SNP -0.0006 (-0.011,0.0098)
0.0013 (-0.0049,0.0075)
0.0023 (-0.0056,0.0102)
0.0007 (-0.0038,0.0053)
-0.0015 (-0.0086,0.0056)
-0.0023 (-0.0063,0.0016)
-0.0026 (-0.0088,0.0037)
-0.0004 (-0.0041,0.0033)
SNP*age 0.0004 (-0.0015,0.0023)
-0.0001 (-0.0012,0.001) 0 (-0.0015,0.0014)
0.0001 (-0.0008,0.0009)
-0.0008 (-0.0021,0.0005)
-0.0002 (-0.001,0.0005)
-0.0001 (-0.0013,0.0011)
-0.0006 (-0.0013,0.0001)
SNP*age2 -0.00007 (-0.00028,0.00014) 0 (-0.00013,0.00013)
0.00012 (-0.00005,0.00028)
-0.00005 (-0.00015,0.00004)
-0.00004 (-0.00019,0.00011)
0.00003 (-0.00005,0.00012)
-0.00012 (-0.00025,0.00002)
-0.00002 (-0.0001,0.00006)
SNP*age3 -0.000005 (-0.00005,0.00004)
0.000002 (-0.00002,0.00003)
0.000037 (0.000004,0.000071)
-0.000004 (-0.00002,0.00002)
0.000011 (-0.00002,0.00004)
0.000005 (-0.00001,0.00002)
-0.000012 (-0.00004,0.00002)
0.000006 (-0.00001,0.00002)
Table 9: Bias and 95% confidence interval for the unbalanced design with more samples around the adiposity rebound; bold and underlined cells are those whose
confidence interval does not cover zero based on 1,000 simulations.
MAF 0.1 0.2 0.3 0.4
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution
SNP 0.0019 (-0.0088,0.0126)
0.0009 (-0.0051,0.0069)
-0.0067 (-0.0143,0.0009)
-0.0031 (-0.0077,0.0014)
0.0039 (-0.0027,0.0105)
0.0041 (0,0.0082)
0.0035 (-0.0026,0.0095)
0.0012 (-0.0024,0.0049)
SNP*age -0.0012 (-0.003,0.0005)
-0.0004 (-0.0014,0.0006)
-0.0008 (-0.0021,0.0005) -0.0007 (-0.0015,0)
0.0006 (-0.0005,0.0017)
0.0006 (-0.0001,0.0012) 0.001 (0,0.002) 0.0004 (-0.0002,0.001)
SNP*age2 0 (-0.00016,0.00017) -0.00009 (-0.00019,0.00001)
-0.00001 (-0.00013,0.00012)
0.00002 (-0.00005,0.00009)
-0.00005 (-0.00015,0.00006)
-0.00003 (-0.0001,0.00003)
0.00003 (-0.00008,0.00013)
-0.00001 (-0.00007,0.00005)
SNP*age3 0.000031 (-0.000002,0.00006)
0.000006 (-0.00001,0.00002)
0.000008 (-0.00002,0.00003)
0.000003 (-0.00001,0.00002)
-0.000004 (-0.00003,0.00002)
-0.000007 (-0.00002,0.00001)
-0.000008 (-0.00003,0.00001)
0.000001 (-0.00001,0.000012)
t-distribution
SNP -0.0062 (-0.0165,0.0041)
-0.0024 (-0.0083,0.0036)
-0.0023 (-0.0104,0.0058)
0.0014 (-0.0032,0.006)
0.0024 (-0.0045,0.0093)
0.0007 (-0.0034,0.0047)
0.0047 (-0.0017,0.0111) 0.0018 (-0.002,0.0056)
SNP*age 0.0004 (-0.0016,0.0023) 0.0011 (0,0.0022)
0.0001 (-0.0015,0.0016)
0.0003 (-0.0006,0.0011)
-0.0001 (-0.0014,0.0012)
0.0002 (-0.0006,0.0009)
0.0003 (-0.0009,0.0016) 0.0003 (-0.0004,0.001)
SNP*age2 0.00004 (-0.00015,0.00023)
0.00009 (-0.00002,0.0002)
-0.0001 (-0.00024,0.00004)
0.00002 (-0.00006,0.0001)
0.00006 (-0.00006,0.00019)
0.00002 (-0.00006,0.00009)
-0.00003 (-0.00015,0.00008)
0.00002 (-0.00005,0.00009)
SNP*age3 -0.000016 (-0.00006,0.00002)
-0.000021 (-0.00004,0.000002)
-0.000017 (-0.00005,0.00001)
-0.000003 (-0.00002,0.00002)
0.000012 (-0.00001,0.00004)
-0.000007 (-0.00002,0.00001)
-0.000002 (-0.00003,0.00002) 0 (-0.00002,0.00002)
Skew-normal Distribution
SNP 0.0076 (-0.0026,0.0177)
0.0056 (-0.0004,0.0116)
-0.0027 (-0.0101,0.0048)
0.0038 (-0.0005,0.0082)
-0.0059 (-0.0124,0.0006)
0.0028 (-0.001,0.0066)
-0.0002 (-0.0063,0.0059)
-0.0007 (-0.0042,0.0027)
SNP*age 0.0009 (-0.0008,0.0026)
0.0009 (-0.0001,0.002) 0 (-0.0012,0.0013)
0.0003 (-0.0004,0.001)
-0.0011 (-0.0022,0.0001)
0.0005 (-0.0002,0.0012)
0.0001 (-0.0009,0.0011)
0.0002 (-0.0004,0.0007)
SNP*age2 -0.00011 (-0.00027,0.00006)
-0.00001 (-0.00011,0.00009)
-0.00002 (-0.00014,0.00011)
0.00001 (-0.00006,0.00008)
-0.00001 (-0.00012,0.00009)
-0.00005 (-0.00011,0.00001)
0.00009 (-0.00002,0.00019)
-0.00001 (-0.00007,0.00005)
SNP*age3 0 (-0.00003,0.00003)
-0.000011 (-0.00003,0.00001)
0.00001 (-0.00001,0.00003)
0.000008 (-0.00001,0.00002)
0.000001 (-0.00002,0.00002)
-0.000011 (-0.00002,0)
0.000008 (-0.00001,0.00003)
0.000002 (-0.00001,0.00001)
Mixture of 2 Gaussian Distributions
SNP -0.0037 (-0.0142,0.0067)
-0.002 (-0.0079,0.0038)
0.0053 (-0.0023,0.013)
-0.0021 (-0.0064,0.0022)
-0.0001 (-0.0068,0.0066)
-0.0037 (-0.0075,0.0002)
-0.0015 (-0.0078,0.0048) 0.0014 (-0.0022,0.005)
SNP*age -0.0004 -0.0002 0.0007 -0.0005 0.0003 -0.0003 -0.0008 (-0.0017,0) 0.0003
(-0.0018,0.0011) (-0.001,0.0006) (-0.0004,0.0018) (-0.0011,0.0001) (-0.0007,0.0012) (-0.0009,0.0002) (-0.0002,0.0008)
SNP*age2 0.00008 (-0.00005,0.00022)
0.00001 (-0.00007,0.00009)
0.00001 (-0.00009,0.00011)
0.00002 (-0.00004,0.00008)
-0.00006 (-0.00014,0.00003)
0.00003 (-0.00002,0.00008) 0.00008 (0,0.00016)
-0.00003 (-0.00008,0.00002)
SNP*age3 0.000011 (-0.00001,0.00003)
0.000002 (-0.00001,0.00001)
-0.000003 (-0.00002,0.00001)
0.000005 (-0.000002,0.00001)
-0.000012 (-0.00002,0.000001)
0.000005 (-0.000002,0.00001)
0.000007 (-0.00001,0.00002)
0.000002 (-0.00001,0.00001)
Variance dependent on a covariate
SNP -0.0019 (-0.0124,0.0087)
0.0063 (0.0003,0.0123)
-0.0015 (-0.0093,0.0063)
0.0037 (-0.0007,0.0082)
-0.0006 (-0.0072,0.006)
-0.0016 (-0.0054,0.0023)
0.0009 (-0.0054,0.0072)
0.0025 (-0.0011,0.0061)
SNP*age -0.0006 (-0.0025,0.0013)
0.0004 (-0.0007,0.0014)
-0.0007 (-0.0021,0.0006)
0.0001 (-0.0006,0.0009)
0.0007 (-0.0006,0.0019)
0.0001 (-0.0006,0.0007)
-0.0008 (-0.002,0.0004)
0.0004 (-0.0002,0.0011)
SNP*age2 0.00021 (0.00004,0.00039) -0.0001 (-0.00021,0)
0.00016 (0.00002,0.00029)
-0.00007 (-0.00015,0.00001)
0.00002 (-0.0001,0.00013)
0.00001 (-0.00005,0.00008)
0.00003 (-0.00007,0.00014)
-0.00002 (-0.00008,0.00005)
SNP*age3 0.000014 (-0.00002,0.00005)
0.000002 (-0.00002,0.00002)
0.000007 (-0.00002,0.00003)
0 (-0.00002,0.00002)
-0.000003 (-0.00003,0.00002)
-0.000001 (-0.00002,0.00001)
0.000015 (-0.00002,0.00004)
-0.000008 (-0.00002,0.000004)
Variance greater at adiposity rebound
SNP -0.0024 (-0.0128,0.008)
-0.0024 (-0.0083,0.0035)
0.0025 (-0.0055,0.0105)
-0.0007 (-0.0054,0.004) 0.0033 (-0.0034,0.01)
-0.0009 (-0.0045,0.0028)
0.0003 (-0.0057,0.0063) -0.0005 (-0.004,0.003)
SNP*age 0.0009 (-0.001,0.0027)
-0.0005 (-0.0016,0.0005)
0.0004 (-0.001,0.0018)
0.0002 (-0.0006,0.0009)
0.0007 (-0.0005,0.0019)
-0.0005 (-0.0012,0.0001)
0.0005 (-0.0006,0.0016)
0.0005 (-0.0002,0.0011)
SNP*age2 -0.00003 (-0.00021,0.00014)
0.00008 (-0.00003,0.00018)
-0.00002 (-0.00015,0.0001)
0.00005 (-0.00003,0.00013)
-0.00009 (-0.0002,0.00002)
0.00001 (-0.00005,0.00007)
0.00005 (-0.00005,0.00016) 0 (-0.00006,0.00006)
SNP*age3 -0.000032 (-0.00007,0.000002)
0.000016 (-0.000004,0.00004)
0.000001 (-0.00003,0.00003)
-0.000003 (-0.00002,0.00001)
-0.000016 (-0.00004,0.00001)
0.000007 (-0.000005,0.00002)
0 (-0.00002,0.00002)
-0.00001 (-0.00002,0.000002)
Variance increasing over time
SNP -0.0019 (-0.0121,0.0083)
-0.0037 (-0.0099,0.0026)
0.0013 (-0.0066,0.0093)
0.0001 (-0.0043,0.0046)
0.0056 (-0.0013,0.0125)
0.0004 (-0.0036,0.0044)
-0.0016 (-0.0081,0.0049)
-0.0015 (-0.0053,0.0022)
SNP*age -0.0002 (-0.0022,0.0018) -0.0012 (-0.0023,0)
-0.0002 (-0.0017,0.0013)
0.0006 (-0.0002,0.0015)
0.0001 (-0.0012,0.0015)
-0.0002 (-0.001,0.0005)
-0.0004 (-0.0017,0.0008)
-0.0003 (-0.001,0.0004)
SNP*age2 -0.00001 (-0.00024,0.00021)
0.00003 (-0.0001,0.00016)
-0.00003 (-0.0002,0.00015)
0.00003 (-0.00006,0.00013)
-0.00002 (-0.00016,0.00012)
0.00006 (-0.00002,0.00014)
-0.00001 (-0.00015,0.00013)
-0.00001 (-0.00009,0.00007)
SNP*age3 -0.000001 (-0.00005,0.00005)
0.000018 (-0.00001,0.00005)
-0.000004 (-0.00004,0.00003)
-0.000007 (-0.00003,0.00001)
0.000008 (-0.00002,0.00004)
0.000005 (-0.00001,0.00002)
0.000003 (-0.00002,0.00003)
-0.000002 (-0.00002,0.00001)
Table 10: Bias and 95% confidence interval for the unbalanced design with less samples around the adiposity rebound; bold and underlined cells are those whose
confidence interval does not cover zero based on 1,000 simulations.
MAF 0.1 0.2 0.3 0.4
Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution
SNP 0.0042 (-0.006,0.0144)
0.0027 (-0.0033,0.0087)
-0.0006 (-0.0083,0.007) 0 (-0.0046,0.0045) 0.0034 (-0.0032,0.01)
-0.003 (-0.0067,0.0008)
0.001 (-0.0052,0.0073) 0 (-0.0036,0.0037)
SNP*age -0.0008 (-0.0025,0.001)
0.0011 (0.0001,0.0021)
0.0001 (-0.0011,0.0014)
0.0001 (-0.0006,0.0009)
-0.0003 (-0.0014,0.0008)
-0.0001 (-0.0007,0.0005)
-0.0001 (-0.0011,0.0009)
0.0005 (-0.0001,0.0012)
SNP*age2 -0.00015 (-0.00032,0.00001) 0 (-0.00009,0.00009)
-0.00001 (-0.00013,0.00011)
0.00004 (-0.00003,0.00011)
-0.00009 (-0.00019,0.00001) 0 (-0.00006,0.00006) 0 (-0.0001,0.0001)
0.00002 (-0.00004,0.00007)
SNP*age3 0.000023 (-0.00001,0.00005)
-0.00002 (-0.00004,-0.000003)
-0.000007 (-0.00003,0.00002)
0.000008 (-0.00001,0.00002)
0.000009 (-0.00001,0.00003)
-0.000005 (-0.00002,0.00001)
-0.000009 (-0.00003,0.00001)
-0.000006 (-0.00002,0.00001)
t-distribution
SNP 0.0103 (-0.0004,0.0209)
0.0094 (0.0033,0.0156)
0.0026 (-0.0054,0.0105)
0.0007 (-0.004,0.0055)
0.003 (-0.0036,0.0096)
0.0025 (-0.0015,0.0065)
-0.0009 (-0.0073,0.0055)
0.0016 (-0.0021,0.0054)
SNP*age 0.0021 (0.0001,0.0041)
0.0006 (-0.0006,0.0017)
0.001 (-0.0005,0.0024)
0.0001 (-0.0007,0.001)
0.0007 (-0.0006,0.002)
0.0008 (0.0001,0.0016)
-0.0006 (-0.0018,0.0005)
0.0002 (-0.0005,0.0008)
SNP*age2 0.00007 (-0.00011,0.00025)
-0.00014 (-0.00025,-0.00003)
-0.00006 (-0.00019,0.00008)
0.00002 (-0.00007,0.0001)
-0.00004 (-0.00016,0.00008)
-0.00004 (-0.00011,0.00004)
-0.00014 (-0.00026,-0.00002)
0.00002 (-0.00004,0.00009)
SNP*age3 -0.000006 (-0.00005,0.00003)
-0.000003 (-0.00003,0.00002)
-0.000013 (-0.00004,0.00002)
-0.000004 (-0.00002,0.00001)
-0.000016 (-0.00004,0.00001)
-0.000006 (-0.00002,0.00001)
0.000007 (-0.00002,0.00003)
0.000002 (-0.00001,0.00002)
Skew-normal Distribution
SNP -0.0021 (-0.0122,0.008)
-0.0018 (-0.0077,0.0041)
-0.0073 (-0.0148,0.0003)
-0.0053 (-0.0097,-0.001)
0.0048 (-0.0018,0.0114)
-0.0009 (-0.0047,0.003)
-0.0038 (-0.01,0.0024)
0.0012 (-0.0026,0.0049)
SNP*age -0.001 (-0.0027,0.0007)
-0.0005 (-0.0015,0.0004)
-0.0002 (-0.0015,0.0011)
-0.0003 (-0.001,0.0005)
0.0002 (-0.0009,0.0013)
-0.0001 (-0.0008,0.0006)
-0.0005 (-0.0016,0.0005) 0 (-0.0006,0.0006)
SNP*age2 -0.00013 (-0.00029,0.00003)
-0.00002 (-0.00012,0.00007)
0.00012 (-0.00001,0.00024) 0.00007 (0,0.00014)
-0.00003 (-0.00013,0.00008)
-0.00002 (-0.00008,0.00004)
0.00011 (0.00001,0.00021)
-0.00002 (-0.00007,0.00004)
SNP*age3 0.000004 (-0.00003,0.00004)
-0.000001 (-0.00002,0.00002)
-0.00002 (-0.00004,0.000003)
0 (-0.00001,0.00001)
0.000012 (-0.00001,0.00003)
-0.000002 (-0.000013,0.00001)
0.000003 (-0.00002,0.00002)
0.000003 (-0.00001,0.00001)
Mixture of 2 Gaussian Distributions
SNP 0.0034 (-0.0068,0.0136)
-0.0073 (-0.0132,-0.0014)
-0.0089 (-0.0165,-0.0012)
-0.0007 (-0.0051,0.0038)
-0.0027 (-0.0095,0.0041)
-0.0006 (-0.0046,0.0035)
0.0013 (-0.0052,0.0078)
-0.0001 (-0.0037,0.0036)
SNP*age 0.0014 (0,0.0027) -0.0008 -0.0006 0 (-0.0006,0.0007) -0.0005 0.0003 0 (-0.0009,0.0009) 0.0001
(-0.0016,0.0001) (-0.0017,0.0004) (-0.0015,0.0004) (-0.0002,0.0009) (-0.0003,0.0006)
SNP*age2 -0.00003 (-0.00015,0.0001)
0.00006 (-0.00002,0.00014)
-0.00002 (-0.00011,0.00008)
0.00002 (-0.00003,0.00008)
-0.00003 (-0.00012,0.00005) 0.00005 (0,0.00009) 0 (-0.00008,0.00008) 0 (-0.00005,0.00005)
SNP*age3 -0.000022 (-0.00004,-0.000005)
-0.000005 (-0.00002,0.00001)
0.000004 (-0.00001,0.00002)
-0.000003 (-0.00001,0.00001)
-0.000002 (-0.00001,0.00001)
0.000001 (-0.000005,0.000008)
0.000007 (-0.00001,0.00002)
-0.000001 (-0.00001,0.00001)
Variance dependent on a covariate
SNP 0.0013 (-0.009,0.0115)
-0.0026 (-0.0087,0.0035)
-0.0009 (-0.0089,0.0071)
0.0022 (-0.0024,0.0067)
-0.0056 (-0.0123,0.0011)
-0.0007 (-0.0045,0.0031)
-0.0037 (-0.0099,0.0025)
0.0003 (-0.0034,0.004)
SNP*age -0.0008 (-0.0026,0.001)
-0.0004 (-0.0014,0.0007)
-0.0011 (-0.0025,0.0003)
-0.0003 (-0.0011,0.0005)
0.0001 (-0.001,0.0013)
0.0002 (-0.0005,0.0009)
-0.0003 (-0.0014,0.0008) 0 (-0.0007,0.0006)
SNP*age2 -0.00003 (-0.0002,0.00014)
-0.00002 (-0.00012,0.00008)
-0.00004 (-0.00017,0.00009)
-0.00005 (-0.00013,0.00002)
0.00006 (-0.00006,0.00017)
0.00003 (-0.00004,0.00009)
-0.00003 (-0.00014,0.00007)
0.00002 (-0.00004,0.00008)
SNP*age3 0.000017 (-0.00002,0.00005)
-0.000003 (-0.00002,0.00002)
0.000024 (-0.000002,0.00005)
0.00001 (-0.00001,0.00002)
-0.00001 (-0.00003,0.00001)
-0.000006 (-0.00002,0.000007)
-0.000009 (-0.00003,0.00001)
0.000004 (-0.00001,0.00002)
Variance greater at adiposity rebound
SNP 0.0002 (-0.0101,0.0105)
-0.0001 (-0.006,0.0057)
0.0066 (-0.0007,0.0139)
-0.0007 (-0.0052,0.0039)
-0.0008 (-0.0076,0.0059)
0.0019 (-0.0022,0.006)
0.0008 (-0.0054,0.007)
0.0022 (-0.0014,0.0059)
SNP*age 0.0006 (-0.0011,0.0024)
-0.0008 (-0.0018,0.0002)
0.0007 (-0.0006,0.0019)
0.0003 (-0.0005,0.001)
-0.0003 (-0.0014,0.0008)
0.0002 (-0.0005,0.0008)
0.0005 (-0.0006,0.0016)
-0.0003 (-0.0009,0.0003)
SNP*age2 -0.00002 (-0.00018,0.00014)
-0.00006 (-0.00016,0.00004)
-0.00002 (-0.00015,0.00011) 0 (-0.00007,0.00007)
-0.00003 (-0.00014,0.00009)
-0.00001 (-0.00008,0.00005) 0 (-0.00011,0.0001) 0 (-0.00006,0.00007)
SNP*age3 -0.000024 (-0.00006,0.00001)
0.00001 (-0.00001,0.00003)
-0.000006 (-0.00003,0.00002)
-0.000005 (-0.00002,0.00001)
0.00001 (-0.000011,0.00003)
0.000008 (-0.000003,0.00002)
-0.000008 (-0.00003,0.00001)
0.000009 (-0.000002,0.00002)
Variance increasing over time
SNP 0.0016 (-0.0087,0.0119)
-0.0013 (-0.0073,0.0047)
0.0041 (-0.0034,0.0116)
0.002 (-0.0024,0.0065)
-0.0024 (-0.0092,0.0044)
-0.0011 (-0.0051,0.0028) 0.0037 (-0.0026,0.01)
0.0011 (-0.0025,0.0047)
SNP*age 0.0008 (-0.0011,0.0028)
-0.0002 (-0.0013,0.0009)
0.0005 (-0.0009,0.0019)
0.0003 (-0.0006,0.0011) 0 (-0.0012,0.0013)
-0.0004 (-0.0011,0.0004)
0.0003 (-0.0009,0.0014) 0 (-0.0007,0.0007)
SNP*age2 0.00012 (-0.00009,0.00032)
-0.00002 (-0.00014,0.00011)
0.00002 (-0.00015,0.00018)
-0.00003 (-0.00012,0.00006)
-0.00013 (-0.00026,0.00001)
-0.00004 (-0.00012,0.00004)
0.00008 (-0.00005,0.00022)
-0.00005 (-0.00013,0.00002)
SNP*age3 0.000007 (-0.00004,0.00005)
-0.000005 (-0.00003,0.00002)
0.000006 (-0.00003,0.00004)
-0.000001 (-0.00002,0.00002)
-0.000038 (-0.00007,-0.00001)
0.000007 (-0.00001,0.00002)
0.00002 (-0.00001,0.00005)
-0.000009 (-0.00002,0.00001)
Table 11: Type one error for sparse complete design; bold and underlined cells are those that are
significantly different from the nominal α=0.05 based on 5,000 simulations.
MAF 0.1000 0.2000 0.3000 0.4000 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 0.0526 0.0494 0.0530 0.0466 0.0492 0.0530 0.0508 0.0544 SNP*age 0.0494 0.0458 0.0498 0.0516 0.0486 0.0466 0.0452 0.0490 SNP*age2 0.0554 0.0482 0.0510 0.0424 0.0498 0.0562 0.0526 0.0480 SNP*age3 0.0522 0.0470 0.0514 0.0478 0.0516 0.0478 0.0498 0.0516
t-distribution SNP 0.0464 0.0520 0.0478 0.0474 0.0482 0.0494 0.0554 0.0468 SNP*age 0.0500 0.0450 0.0450 0.0520 0.0566 0.0480 0.0568 0.0498 SNP*age2 0.0518 0.0492 0.0498 0.0554 0.0492 0.0566 0.0548 0.0486 SNP*age3 0.0502 0.0464 0.0492 0.0524 0.0540 0.0424 0.0562 0.0500
Skew-normal Distribution SNP 0.0496 0.0532 0.0464 0.0520 0.0500 0.0566 0.0548 0.0478 SNP*age 0.0508 0.0522 0.0460 0.0428 0.0496 0.0488 0.0546 0.0404 SNP*age2 0.0408 0.0462 0.0540 0.0438 0.0478 0.0452 0.0478 0.0474 SNP*age3 0.0478 0.0502 0.0470 0.0486 0.0508 0.0486 0.0498 0.0508
Mixture of 2 Gaussian Distributions SNP 0.0512 0.0472 0.0508 0.0474 0.0450 0.0480 0.0522 0.0490 SNP*age 0.0510 0.0478 0.0516 0.0498 0.0476 0.0528 0.0504 0.0464 SNP*age2 0.0498 0.0546 0.0482 0.0526 0.0520 0.0476 0.0494 0.0494 SNP*age3 0.0500 0.0552 0.0508 0.0524 0.0458 0.0514 0.0504 0.0490
Variance dependent on a covariate SNP 0.0524 0.0508 0.0490 0.0492 0.0522 0.0506 0.0554 0.0446 SNP*age 0.0528 0.0502 0.0504 0.0552 0.0588 0.0526 0.0562 0.0544 SNP*age2 0.0528 0.0442 0.0554 0.0510 0.0498 0.0488 0.0522 0.0488 SNP*age3 0.0502 0.0456 0.0454 0.0538 0.0498 0.0446 0.0544 0.0540
Variance greater at adiposity rebound SNP 0.0498 0.0510 0.0496 0.0478 0.0442 0.0502 0.0450 0.0554 SNP*age 0.0530 0.0576 0.0554 0.0572 0.0494 0.0546 0.0532 0.0584 SNP*age2 0.0504 0.0468 0.0466 0.0514 0.0546 0.0484 0.0524 0.0558 SNP*age3 0.0548 0.0558 0.0536 0.0524 0.0586 0.0552 0.0536 0.0496
Variance increasing over time SNP 0.0502 0.0466 0.0502 0.0490 0.0542 0.0472 0.0546 0.0456 SNP*age 0.0590 0.0514 0.0554 0.0524 0.0562 0.0536 0.0548 0.0512 SNP*age2 0.0524 0.0428 0.0464 0.0446 0.0448 0.0468 0.0358 0.0524 SNP*age3 0.1138 0.1106 0.1084 0.1072 0.1118 0.1124 0.1096 0.1098
Table 12: Type one error for intense complete design; bold and underlined cells are those that are
significantly different from the nominal α=0.05 based on 5,000 simulations.
MAF 0.1000 0.2000 0.3000 0.4000 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 0.0498 0.0522 0.0522 0.0522 0.0472 0.0496 0.0514 0.0472 SNP*age 0.0552 0.0538 0.0552 0.0538 0.0572 0.0516 0.0520 0.0566 SNP*age2 0.0504 0.0518 0.0524 0.0518 0.0548 0.0464 0.0538 0.0506 SNP*age3 0.0748 0.0756 0.0750 0.0756 0.0722 0.0676 0.0708 0.0738
t-distribution SNP 0.0460 0.0502 0.0530 0.0504 0.0484 0.0476 0.0440 0.0448 SNP*age 0.0584 0.0542 0.0572 0.0590 0.0584 0.0610 0.0584 0.0508 SNP*age2 0.0486 0.0444 0.0488 0.0510 0.0458 0.0520 0.0484 0.0526 SNP*age3 0.0720 0.0712 0.0740 0.0704 0.0726 0.0764 0.0698 0.0720
Skew-normal Distribution SNP 0.0488 0.0516 0.0532 0.0528 0.0552 0.0502 0.0462 0.0554 SNP*age 0.0540 0.0534 0.0564 0.0478 0.0528 0.0538 0.0532 0.0564 SNP*age2 0.0530 0.0458 0.0498 0.0498 0.0512 0.0566 0.0488 0.0478 SNP*age3 0.0748 0.0738 0.0714 0.0684 0.0710 0.0668 0.0708 0.0680
Mixture of 2 Gaussian Distributions SNP 0.0540 0.0532 0.0426 0.0494 0.0500 0.0472 0.0474 0.0542 SNP*age 0.0590 0.0544 0.0476 0.0528 0.0514 0.0484 0.0532 0.0558 SNP*age2 0.0486 0.0504 0.0452 0.0510 0.0528 0.0476 0.0540 0.0468 SNP*age3 0.0678 0.0824 0.0792 0.0676 0.0672 0.0704 0.0748 0.0716
Variance dependent on a covariate SNP 0.0500 0.0440 0.0428 0.0492 0.0510 0.0446 0.0500 0.0456 SNP*age 0.0508 0.0488 0.0496 0.0532 0.0516 0.0568 0.0558 0.0508 SNP*age2 0.0462 0.0460 0.0438 0.0514 0.0500 0.0432 0.0482 0.0480 SNP*age3 0.0564 0.0640 0.0624 0.0626 0.0630 0.0546 0.0598 0.0670
Variance greater at adiposity rebound SNP 0.0486 0.0440 0.0452 0.0504 0.0494 0.0516 0.0474 0.0424 SNP*age 0.0550 0.0486 0.0546 0.0524 0.0454 0.0494 0.0502 0.0458 SNP*age2 0.0550 0.0460 0.0496 0.0488 0.0486 0.0482 0.0448 0.0480 SNP*age3 0.0468 0.0478 0.0476 0.0490 0.0436 0.0480 0.0514 0.0454
Variance increasing over time SNP 0.0526 0.0530 0.0510 0.0592 0.0548 0.0568 0.0586 0.0552 SNP*age 0.0742 0.0740 0.0742 0.0822 0.0730 0.0702 0.0768 0.0720 SNP*age2 0.0656 0.0728 0.0628 0.0740 0.0670 0.0626 0.7020 0.0652 SNP*age3 0.2322 0.2460 0.2396 0.2516 0.2366 0.2384 0.2470 0.2338
Table 13: Type one error for equal unbalanced design; bold and underlined cells are those that are
significantly different from the nominal α=0.05 based on 5,000 simulations.
MAF 0.1000 0.2000 0.3000 0.4000 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 0.0536 0.0476 0.0518 0.0536 0.0486 0.0518 0.0530 0.0470 SNP*age 0.0568 0.0638 0.0580 0.0610 0.0610 0.0572 0.0564 0.0548 SNP*age2 0.0518 0.0484 0.0516 0.0462 0.0496 0.0506 0.0524 0.0470 SNP*age3 0.0724 0.0732 0.0684 0.0738 0.0768 0.0666 0.0722 0.0726
t-distribution SNP 0.0514 0.0482 0.0500 0.0472 0.0470 0.0550 0.0556 0.0458 SNP*age 0.0548 0.0600 0.0552 0.0666 0.0588 0.0642 0.0594 0.0606 SNP*age2 0.0504 0.0506 0.0522 0.0492 0.0530 0.0504 0.0490 0.0486 SNP*age3 0.0684 0.0784 0.0680 0.0696 0.0742 0.0676 0.0736 0.0704
Skew-normal Distribution SNP 0.0510 0.0528 0.0494 0.0486 0.0492 0.0508 0.0476 0.0456 SNP*age 0.0552 0.0612 0.0550 0.0570 0.0556 0.0594 0.0534 0.0580 SNP*age2 0.0480 0.0484 0.0454 0.0496 0.0528 0.0450 0.0502 0.0532 SNP*age3 0.0812 0.0750 0.0710 0.0740 0.0696 0.0704 0.0764 0.0712
Mixture of 2 Gaussian Distributions SNP 0.0488 0.0484 0.0512 0.0476 0.0556 0.0500 0.0520 0.0500 SNP*age 0.0544 0.0490 0.0534 0.0478 0.0534 0.0490 0.0522 0.0488 SNP*age2 0.0472 0.0524 0.0500 0.0478 0.0514 0.0474 0.0486 0.0496 SNP*age3 0.0704 0.0674 0.0682 0.0712 0.0692 0.0664 0.0732 0.0668
Variance dependent on a covariate SNP 0.0494 0.0506 0.0480 0.0496 0.0496 0.0456 0.0508 0.0470 SNP*age 0.0570 0.0592 0.0580 0.0574 0.0558 0.0624 0.0636 0.0636 SNP*age2 0.0490 0.0552 0.0496 0.0484 0.0488 0.0488 0.0462 0.0542 SNP*age3 0.0646 0.0728 0.0740 0.0702 0.0704 0.0702 0.0706 0.0708
Variance greater at adiposity rebound SNP 0.0502 0.0502 0.0514 0.0504 0.0458 0.0510 0.0496 0.0452 SNP*age 0.0554 0.0580 0.0602 0.0580 0.0604 0.0552 0.0520 0.0538 SNP*age2 0.0516 0.0522 0.0448 0.0488 0.0554 0.0516 0.0538 0.0542 SNP*age3 0.0666 0.0622 0.0670 0.0588 0.0632 0.0626 0.0578 0.0732
Variance increasing over time SNP 0.0546 0.0494 0.0524 0.0492 0.0568 0.0472 0.0492 0.0540 SNP*age 0.0570 0.0550 0.0536 0.0574 0.0558 0.0542 0.0550 0.0616 SNP*age2 0.0526 0.0524 0.0628 0.0616 0.0578 0.0580 0.0584 0.0582 SNP*age3 0.1152 0.1278 0.1186 0.1194 0.1224 0.1124 0.1212 0.1238
Table 14: Type one error for the design with more samples around the adiposity rebound; bold and
underlined cells are those that are significantly different from the nominal α=0.05 based on 5,000
simulations.
MAF 0.1 0.2 0.3 0.4 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000
Gaussian Distribution SNP 0.0504 0.0472 0.0486 0.0442 0.0520 0.0480 0.0500 0.0516 SNP*age 0.0574 0.0520 0.0564 0.0536 0.0568 0.0594 0.0558 0.0574 SNP*age2 0.0502 0.0544 0.0498 0.0508 0.0472 0.0516 0.0526 0.0454 SNP*age3 0.0702 0.0734 0.0730 0.0748 0.0740 0.0716 0.0706 0.0684
t-distribution SNP 0.0502 0.0514 0.0444 0.0482 0.0492 0.0474 0.0500 0.0508 SNP*age 0.0582 0.0620 0.0560 0.0564 0.0598 0.0586 0.0642 0.0514 SNP*age2 0.0510 0.0468 0.0512 0.0484 0.0514 0.0464 0.0514 0.0528 SNP*age3 0.0726 0.0730 0.0690 0.0754 0.0794 0.0706 0.0736 0.0678
Skew-normal Distribution SNP 0.0474 0.0490 0.0512 0.0426 0.0404 0.0512 0.0602 0.0464 SNP*age 0.0594 0.0584 0.0596 0.0546 0.0558 0.0568 0.0570 0.0586 SNP*age2 0.0574 0.0526 0.0520 0.0466 0.0458 0.0520 0.0548 0.0474 SNP*age3 0.0720 0.0698 0.0704 0.0692 0.0736 0.0766 0.0718 0.0712
Mixture of 2 Gaussian Distributions SNP 0.0522 0.0538 0.0504 0.0542 0.0508 0.0488 0.0486 0.0506 SNP*age 0.0502 0.0572 0.0516 0.0524 0.0538 0.0504 0.0482 0.0552 SNP*age2 0.0536 0.0486 0.0480 0.0478 0.0442 0.0482 0.0570 0.0498 SNP*age3 0.0668 0.0696 0.0642 0.0724 0.0688 0.0650 0.0754 0.0706
Variance dependent on a covariate SNP 0.0530 0.0516 0.0540 0.0512 0.0466 0.0480 0.0456 0.0514 SNP*age 0.0582 0.0658 0.0616 0.0540 0.0578 0.0548 0.0528 0.0604 SNP*age2 0.0500 0.0486 0.0488 0.0476 0.0470 0.0472 0.0476 0.0438 SNP*age3 0.0694 0.0676 0.0662 0.0704 0.0646 0.0702 0.0690 0.0708
Variance greater at adiposity rebound SNP 0.0482 0.0498 0.0502 0.0502 0.0490 0.0562 0.0468 0.0500 SNP*age 0.0506 0.0574 0.0548 0.0580 0.0558 0.0578 0.0570 0.0520 SNP*age2 0.0464 0.0456 0.0524 0.0464 0.0530 0.0466 0.0498 0.0466 SNP*age3 0.0598 0.0604 0.0612 0.0664 0.0608 0.0588 0.0632 0.0542
Variance increasing over time SNP 0.0538 0.0538 0.0594 0.0518 0.0590 0.0572 0.0534 0.0492 SNP*age 0.0640 0.0568 0.0616 0.0614 0.0656 0.0646 0.0660 0.0612 SNP*age2 0.0602 0.0662 0.0658 0.0582 0.0710 0.0692 0.0658 0.0624 SNP*age3 0.1360 0.1366 0.1398 0.1322 0.1402 0.1430 0.1396 0.1368
Table 15: Type one error for design with less samples around the adiposity rebound; bold and
underlined cells are those that are significantly different from the nominal α=0.05 based on 5,000
simulations.
MAF 0.1 0.2 0.3 0.4 Sample Size N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 N=1,000 N=3,000 Gaussian Distribution SNP 0.0580 0.0542 0.0442 0.0578 0.0560 0.0506 0.0534 0.0534 SNP*age 0.0556 0.0552 0.0548 0.0664 0.0562 0.0530 0.0574 0.0552 SNP*age2 0.0524 0.0492 0.0434 0.0506 0.0568 0.0542 0.0502 0.0516 SNP*age3 0.0690 0.0734 0.0702 0.0690 0.0786 0.0642 0.0706 0.0772 t-distribution SNP 0.0498 0.0532 0.0432 0.0524 0.0518 0.0498 0.0498 0.0510 SNP*age 0.0586 0.0626 0.0588 0.0560 0.0538 0.0532 0.0540 0.0590 SNP*age2 0.0496 0.0536 0.0544 0.0598 0.0486 0.0532 0.0562 0.0484 SNP*age3 0.0686 0.0764 0.0708 0.0668 0.0674 0.0666 0.0712 0.0696 Skew-normal Distribution SNP 0.0508 0.0498 0.0528 0.0474 0.0502 0.0462 0.0510 0.0494 SNP*age 0.0612 0.0550 0.0556 0.0522 0.0608 0.0558 0.0596 0.0556 SNP*age2 0.0480 0.0444 0.0558 0.0492 0.0482 0.0524 0.0522 0.0474 SNP*age3 0.0768 0.0734 0.0728 0.0732 0.0684 0.0686 0.0738 0.0738 Mixture of 2 Gaussian Distributions SNP 0.0480 0.0474 0.0590 0.0476 0.0492 0.0492 0.0476 0.0504 SNP*age 0.0570 0.0528 0.0544 0.0556 0.0530 0.0550 0.0528 0.0570 SNP*age2 0.0506 0.0534 0.0492 0.0460 0.0518 0.0472 0.0506 0.0502 SNP*age3 0.0730 0.0690 0.0706 0.0650 0.0768 0.0712 0.0678 0.0718 Variance dependent on a covariate SNP 0.0566 0.0484 0.0488 0.0516 0.0514 0.0534 0.0478 0.0576 SNP*age 0.0642 0.0554 0.0574 0.0616 0.0630 0.0606 0.0572 0.0638 SNP*age2 0.0524 0.0548 0.0512 0.0516 0.0508 0.0508 0.0556 0.0534 SNP*age3 0.0734 0.0714 0.0764 0.0772 0.0812 0.0784 0.0738 0.0796 Variance greater at adiposity rebound SNP 0.0510 0.0528 0.0518 0.0480 0.0532 0.0540 0.0462 0.0434 SNP*age 0.0604 0.0632 0.0576 0.0544 0.0530 0.0622 0.0532 0.0600 SNP*age2 0.0548 0.0488 0.0522 0.0524 0.0554 0.0516 0.0514 0.0430 SNP*age3 0.0726 0.0612 0.0630 0.0718 0.0656 0.0700 0.0702 0.0676 Variance increasing over time SNP 0.0528 0.0504 0.0472 0.0492 0.0508 0.0468 0.0456 0.0536 SNP*age 0.0496 0.0492 0.0506 0.0488 0.0484 0.0536 0.0502 0.0470 SNP*age2 0.0526 0.0538 0.0486 0.0496 0.0562 0.0548 0.0498 0.0544 SNP*age3 0.0954 0.1002 0.1054 0.0908 0.1098 0.1002 0.1056 0.0980
Table 16: Results from additional simulations for comparison between missing data or variable measurement time under the intense design. Data was simulated with
the error term following a Gaussian distribution using the missing/different age design so that: 1) there was a cubic function of age in both the fixed and random effects
(i.e. BMIij = β0 + β1t ij + β2t ij2 + β3t ij
3 + β4MSij + β5SNPi + β6t ijSNPi + β7t ij2SNPi + β8t ij
3SNPi + bi0 + bi1t ij + bi2t ij2 + bi3t ij
3 + ε ij); 2) there was a quadratic function of age in both
the fixed and random effects (i.e. BMIij = β0 + β1t ij + β2t ij2 + β3MSij + β4SNPi + β5t ijSNPi + β6t ij
2SNPi + bi0 + bi1t ij + bi2t ij2 + ε ij); 3) there was a linear function of age in both
the fixed and random effects (i.e. BMIij = β0 + β1t ij + β2MSij + β3SNPi + β4t ijSNPi + bi0 + bi1t ij + ε ij); 4) there was a cubic function of age in the fixed effects and a quadratic
function of age in the random effects (i.e. BMIij = β0 + β1t ij + β2t ij2 + β3t ij
3 + β4MSij + β5SNPi + β6t ijSNPi + β7t ij2SNPi + β8t ij
3SNPi + bi0 + bi1t ij + bi2t ij2 + ε ij); 5) there was a
quadratic function of age in the fixed effects and a linear function of age in the random effects (i.e. BMIij = β0 + β1t ij + β2t ij2 + β3MSij + β4SNPi + β5t ijSNPi + β6t ij
2SNPi + bi0 +
bi1t ij + ε ij )
N=1,000 N=3,000
SNP SNP*age SNP*age2 SNP*age3 SNP SNP*age SNP*age2 SNP*age3
Cubic fixed, cubic random 0.0499 0.0509 0.0501 0.0533 0.0472 0.0490 0.0496 0.0489
Quadratic fixed, quadratic random 0.0485 0.0487 0.0519 -- 0.0514 0.0531 0.0499 --
Linear fixed, linear random 0.0513 0.0543 -- -- 0.0517 0.0486 -- --
Cubic fixed, quadratic random 0.0487 0.0648 0.0490 0.0902 0.0501 0.0655 0.0498 0.0861
Quadratic fixed, linear random 0.0505 0.0502 0.0689 -- 0.0533 0.0519 0.0679 --
Table 17: Type 1 error when the fixed and random effects both include a cubic function for age. Data
was simulated under the equal unbalanced scenario with a sample size of 3,000; 1,000 simulations for
each MAF were conducted. Columns 2 and 3 (under the heading “Quadratic”) are the same as in Table
4.6 of Chapter 4 and are included here for comparison purposes. Bold and underlined cells are those
that are significantly different from the nominal α=0.05 under each design
Random effects Quadratic
(20,000 simulations – 5,000 each MAF) Cubic
(4,000 simulations – 1,000 each MAF) Standard Robust Standard Robust Gaussian Distribution SNP 0.0500 0.0508 0.0470 0.0480 SNP*age 0.0592 0.0531 0.0438 0.0440 Global wald test 0.0598 0.0525 t-distribution SNP 0.0491 0.0497 0.0448 0.0443 SNP*age 0.0629 0.0539 0.0483 0.0498 Global wald test 0.0621 0.0535 Skew-normal Distribution SNP 0.0495 0.0501 0.0558 0.0560 SNP*age 0.0589 0.0526 0.0500 0.0490 Global wald test 0.0582 0.0513 Mixture of 2 Gaussian Distributions SNP 0.0490 0.0490 0.0505 0.0498 SNP*age 0.0487 0.0459 0.0533 0.0550 Global wald test 0.0581 0.0478 Variance dependent on a covariate SNP 0.0482 0.0491 0.0555 0.0578 SNP*age 0.0607 0.0514 0.0460 0.0473 Global wald test 0.0611 0.0473 Variance greater at adiposity rebound SNP 0.0492 0.0495 0.0518 0.0528 SNP*age 0.0563 0.0483 0.0555 0.0550 Global wald test 0.0559 0.0518 Variance increasing over time SNP 0.0500 0.0502 0.0488 0.0490 SNP*age 0.0571 0.0540 0.0473 0.0473 Global wald test 0.0929 0.0529 0.0495
Table 18: Type 1 error when the simulation and analysis models are different. Data was simulated with
a sample size of 3,000 under the missing/unbalanced scenario (each individual had 40% missing data
over the time period and were measured at different times within each year period) when the analysis
model was different to the simulation model; 5,000 simulations for each MAF were conducted. Bold and
underlined cells are those that are significantly different from the nominal α=0.05 under each design.
Simulated Model Analysis model SNP SNP*age SNP*age2
Quadratic fixed, linear random Quadratic fixed, quadratic random
0.0508 0.0503 0.0484
Quadratic fixed, quadratic random
Quadratic fixed, linear random 0.0624 0.0521 0.1393
Figure 1: Overview of the simulations conducted in this study. Initial simulations were conducted to determine whether misspecification of the model affected coverage probability, bias, power or type 1 error. Upon discovering inflation in the type 1 error in the unbalanced sampling designs, we conducted steps 2 to 5 attempting to determine the source of the inflation.
Analysis of: • 4 NEW sampling
designs • 4 MAFs • 2 sample sizes • Gaussian random
effects and error distribution
STEP 2: Investigate why type 1 error is increased in unbalanced designs
Null hypothesis (β5 = 0 and β6 = 0)
Complete in all individuals and they were all measured at the same
time Derive type 1 error (5 000 simulations)
Complete in all individuals but they were measured at different times
within each year period Derive type 1 error (5 000 simulations)
Each individual had 40% missing data over the time period, but were
all measured at the same time. Derive type 1 error (5,000
simulations)
Each individual had 40% missing data over the time period and were
measured at different times within each year period. Derive type 1
(5 000 i l ti )
Analysis of: • 5 sampling designs • 4 MAFs • 2 sample sizes • 7 models
Alternative hypothesis (β5 = 0.6 and β6 = 0.15)
Null hypothesis (β5 = 0 and β6 = 0)
Derive coverage probabilities (1,000 simulations per scenario)
Derive power (1,000 simulations per scenario)
Derive bias (1,000 simulations per scenario)
Derive type 1 error (5,000 simulations per scenario)
STEP 1: Investigate whether different error distributions affect coverage probability, bias, power, type 1 error
STEP 3: Investigate why the type 1 error inflation is magnified in the presence of missing data
Analysis of: • Missing/different age
used in STEP 2 • 4 MAFs • 2 sample sizes • Gaussian random
effects and error distribution
Null hypothesis (β5 = 0 and β6 = 0)
Cubic function for age for the fixed and random effects. Derive type 1 error (5,000 simulations)
Cubic function for age in the fixed effects and quadratic function for age in the random effects. Derive type 1 error (5,000 simulations)
Linear function for age for the fixed and random effects. Derive type 1 error (5,000 simulations)
Quadratic function for age for the fixed and random effects. Derive type 1 error (5,000 simulations)
Quadratic function for age in the fixed effects and linear function for age in the random effects. Derive type 1 error (5,000 simulations)
Analysis of: • Equal unbalanced
sampling design • 4 MAFs • sample size of 3,000 • Gaussian random
effects and error distribution
STEP 5: Investigate whether type 1 error is inflated when the analysis model is different to the model the data is simulated under
Null hypothesis (β5 = 0 and β6 = 0) Derive type 1 error (1,000 simulations per scenario)
STEP 4: Investigate whether other six models have nominal type 1 error when the fixed and random effects are the same
Analysis of: • Equal unbalanced
sampling design • 4 MAFs • sample size of 3,000 • 7 models
Null hypothesis (β5 = 0 and β6 = 0) Derive type 1 error (1,000 simulations)
Appendix E: Publication Arising from the Research in Chapter Six
Association of a Body Mass Index Genetic Risk Score withGrowth throughout Childhood and AdolescenceNicole M. Warrington1,2., Laura D. Howe3., Yan Yan Wu2, Nicholas J. Timpson3, Kate Tilling4,
Craig E. Pennell1, John Newnham1, George Davey-Smith3, Lyle J. Palmer2,5, Lawrence J. Beilin6,
Stephen J. Lye2, Debbie A. Lawlor3, Laurent Briollais2*
1 School of Women’s and Infants’ Health, The University of Western Australia, Perth, Western Australia, Australia, 2 Samuel Lunenfeld Research Institute, University of
Toronto, Toronto, Ontario, Canada, 3 MRC Centre for Causal Analyses in Translational Epidemiology, School of Social and Community Medicine, University of Bristol,
Bristol, United Kingdom, 4 School of Social and Community Medicine, University of Bristol, Bristol, United Kingdom, 5 Ontario Institute for Cancer Research, University of
Toronto, Toronto, Ontario, Canada, 6 School of Medicine and Pharmacology, The University of Western Australia, Perth, Western Australia, Australia
Abstract
Background: While the number of established genetic variants associated with adult body mass index (BMI) is growing, therelationships between these variants and growth during childhood are yet to be fully characterised. We examined theassociation between validated adult BMI associated single nucleotide polymorphisms (SNPs) and growth trajectories acrosschildhood. We investigated the timing of onset of the genetic effect and whether it was sex specific.
Methods: Children from the ALSPAC and Raine birth cohorts were used for analysis (n = 9,328). Genotype data from 32 adultBMI associated SNPs were investigated individually and as an allelic score. Linear mixed effects models with smoothingsplines were used for longitudinal modelling of the growth parameters and measures of adiposity peak and rebound werederived.
Results: The allelic score was associated with BMI growth throughout childhood, explaining 0.58% of the total variance inBMI in females and 0.44% in males. The allelic score was associated with higher BMI at the adiposity peak (females =0.0163 kg/m2 per allele, males = 0.0123 kg/m2 per allele) and earlier age (-0.0362 years per allele in males and females) andhigher BMI (0.0332 kg/m2 per allele in females and 0.0364 kg/m2 per allele in males) at the adiposity rebound. No gene:sexinteractions were detected for BMI growth.
Conclusions: This study suggests that known adult genetic determinants of BMI have observable effects on growth fromearly childhood, and is consistent with the hypothesis that genetic determinants of adult susceptibility to obesity act fromearly childhood and develop over the life course.
Citation: Warrington NM, Howe LD, Wu YY, Timpson NJ, Tilling K, et al. (2013) Association of a Body Mass Index Genetic Risk Score with Growth throughoutChildhood and Adolescence. PLoS ONE 8(11): e79547. doi:10.1371/journal.pone.0079547
Editor: Kristel Sleegers, University of Antwerp, Belgium
Received July 12, 2013; Accepted September 23, 2013; Published November 11, 2013
Copyright: � 2013 Warrington et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by the following funding bodies and institutions. The UK Medical Research Council and the Wellcome Trust (Grant ref:092731) and the University of Bristol provide core support for ALSPAC. The following Institutions provide funding for Core Management of the Raine Study: TheUniversity of Western Australia (UWA), Raine Medical Research Foundation, UWA Faculty of Medicine, Dentistry and Health Sciences, The Telethon Institute forChild Health Research, Curtin University and Women and Infants Research Foundation. This study was supported by project grants from the National Health andMedical Research Council of Australia (Grant ID 403981 and ID 003209) and the Canadian Institutes of Health Research (Grant ID MOP-82893). NM Warrington isfunded by an Australian Postgraduate Award from the Australian Government of Innovation, Industry, Science and Research and a Raine Study PhD Top-UpScholarship. LD Howe is funded by a UK Medical Research Council Population Health Scientist fellowship (G1002375). LD Howe, NJ Timpson, K Tilling, G Davey-Smith and DA Lawlor all work in a Centre that receives core funding from the University of Bristol and the UK Medical Research Council (Grant ref: G0600705). Thefunders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
. These authors contributed equally to this work.
Introduction
Twin and family studies have provided evidence that body mass
index (BMI) is strongly heritable [1,2,3,4]. Recent genome-wide
association studies (GWAS) have begun to uncover genetic loci
contributing to increases in BMI in adulthood [5,6,7,8,9,10,11].
The largest genome-wide meta-analysis of BMI published to-date
included 249,796 individuals from the Genetic Investigation of
Anthropometric Traits (GIANT) Consortium; which confirmed 14
previously-reported loci and identified 18 novel loci for BMI [5].
There has been one GWAS to date that has focused on a
dichotomous indicator of childhood obesity [12], but none looking
at BMI on a continuous scale in childhood.
Once adult height is attained, changes in BMI are largely driven
by changes in weight. In contrast, during childhood and
adolescence, changes in BMI are influenced by both changes in
height and weight. Therefore, genetic variants that affect adult
BMI may influence change in weight, height or both during
childhood. Previous studies of adult BMI single nucleotide
polymorphisms (SNPs) in relation to infant and child change in
PLOS ONE | www.plosone.org 1 November 2013 | Volume 8 | Issue 11 | e79547
growth have shown little evidence of an association with birth
weight [13,14,15], but have shown evidence that these loci are
associated with more rapid height and weight gain in infancy
[13,15], and higher BMI and odds of obesity at multiple ages
across the life course [13,14,15,16,17].
BMI growth over childhood and adolescence is complex;
children tend to have rapidly increasing BMI from birth to
approximately 9 months of age where they reach their adiposity
peak, BMI then decreases until about the age of 5-6 years at
adiposity rebound and then steadily increases again until just after
puberty where it tends to plateau through adulthood. The BMI
and timing at the adiposity peak [18] and adiposity rebound
[19,20,21,22,23,24] have been shown to be associated with later
BMI. Genetic variants could also affect features of the growth
trajectory and shape key developmental milestones, including the
adiposity peak [25], adiposity rebound, and onset of puberty
between 10 and 13 years [17,26,27,28]. Sovio et al [17] and Belsky
et al [15] have recently shown that SNPs associated with adult
BMI are also associated with earlier age and higher BMI at
adiposity rebound. Genetic influences on the adiposity peak
remain poorly understood. Understanding whether and how
genetic loci are associated with BMI and other anthropometric
measures differentially across the life course may shed light on the
biological pathways involved, as well as insights into the
development of obesity to inform the design of interventions.
To date, there has been no comprehensive study of how all
known genetic variants of adult BMI influence growth over
childhood and adolescence (BMI, height and weight) and related
growth parameters (age and BMI at the adiposity peak and
rebound). One of the limitations of previous studies is they have
not stratified by sex, despite some evidence that sex-specific
differences in body composition may be partly due to genetics
[29,30]. Therefore, in the current study we:
1. Examine the association between an allelic score of 32 adult
BMI associated alleles and BMI, weight and height growth
trajectories from birth to age 17 in two birth cohorts.
2. Assess whether the association between BMI trajectories and
the 32 individual genetic loci are sex specific.
Materials and Methods
Study PopulationsALSPAC. The Avon Longitudinal Study of Parents and
Children (ALSPAC) is a prospective cohort study. The full study
methodology is published elsewhere [31] (www.bristol.ac.uk/
alspac). Pregnant women resident in one of three Bristol-based
health districts with an expected delivery date between 1 April
1991 and 31 December 1992 were invited to participate.
Invitation cards indicated that study consent was ‘opt out’, i.e.
women not actively declining participation would be included in
future data collection follow-up. Follow-up included parent and
child completed questionnaires, links to routine health care data,
and clinic attendance. 7,868 individuals were included in this
study based on the following criteria: at least one parent of
European descent, live singleton birth, unrelated to anyone in the
sample, no major congenital anomalies, genotype data, and at
least one measure of BMI throughout childhood. Ethical approval
for the study was obtained from the ALSPAC Law and Ethics
Committee and the Local Research Ethics Committees.
Raine. The Western Australian Pregnancy Cohort (Raine)
Study [32,33,34] is a prospective pregnancy cohort where 2,900
mothers were recruited prior to 18-weeks’ gestation between 1989
and 1991 (http://www.rainestudy.org.au/). 1,460 individuals were
included in this study using the same criteria as in the ALSPAC
cohort. The study was conducted with appropriate institutional
ethics approval from the King Edward Memorial Hospital and
Princess Margaret Hospital for Children ethics boards, and written
informed consent was obtained from all mothers.
BMI was calculated from weight and height measurements in
both cohorts. Additional information on the measurements in each
cohort is provided in the supplementary material (see Methods
S1). Access to data and associated protocols from the two cohorts
needs to follow the cohort guidelines outlined on their respective
websites.
Genotyping and allelic scoreImputed genotypic data used in both cohorts has been
previously described [35,36] (details in supplementary material;
see Methods S1). Speliotes et al [5] reported 32 variants to be
associated with BMI, while Belsky et al [15] selected a tag SNP
from each LD block that had previously been shown to be
associated with BMI-related traits. We selected 32 SNPs that were
from either of these two manuscripts; SNPs reported in these two
manuscripts that were within the genes of interest were all in high
LD (r2.0.75) so one loci was selected to be included. All SNPs
imputed well (all R2 for imputation quality . 0.7, mean = 0.981),
therefore, dosages from the imputed data were used (i.e. the
estimated number of increasing BMI alleles). An ‘allelic score’ was
created by summing the dosages for the BMI-increasing alleles
across all 32 SNPs [37]. A sensitivity analysis was conducted
whereby the alleles were weighted by the published effect size for
adult BMI. The weighted score gave the same conclusions as the
unweighted score; therefore only the unweighted score is
presented.
Longitudinal Modelling and derivation of growthparameters
Modelling BMI longitudinally from birth throughout childhood
is complex due to the two inflection points, adiposity peak in
infancy and adiposity rebound in childhood, and the increasing
variance in BMI throughout childhood. For this reason the
longitudinal models focused on data between 1 (when most
individuals will be post adiposity peak) and 17 years of age. A semi-
parametric linear mixed model, using smoothing splines to yield a
smooth growth curve estimate, was fitted to the BMI, weight and
height measures [38]. The basic model for the jth individual and at
the tth time-point is as follows:
Growthjt~b0zX
ibi(Agejt{Age)izX
kck((Agejt{Age){kk)i
zzXlblCovariatelzu0jz
Xiuij(Agejt{Age)izX
kgkj((Agejt{Age){kk)i
zzejt
Where Growth is BMI, weight or height, Age is the mean age
over the t time points in the sample (i.e. 8 years), kk is the k-th knot
and (t 2 kk)+ = 0 if t # kk and (t 2 kk) if t . kk, which is known
as the truncated power basis that ensures smooth continuity
between the time windows and Covariate are the study specific
(time independent) covariates. Three knot points were used, placed
at two, eight and 12 years, with a cubic slope for each spline in the
BMI and height models; this model provided the best fit of the
data compared to other approaches [38]. The weight model had
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 2 November 2013 | Volume 8 | Issue 11 | e79547
the same placement for the knots but a linear spline from 122
years, cubic slope for 228 years and 8212 years and finally a
quadratic slope for over 12 years provided a better fit to the data
based having the lowest Akaike Information Criterion (AIC). All
models assumed a continuous autoregressive of order 1 correlation
structure.
Age and BMI at adiposity rebound were derived by setting the
first derivative of the fixed and random effects from the BMI
model between 2 and 8 years of age for each individual to zero (i.e.
the minimum point in the curve). In addition, a second model was
fit in the ALSPAC cohort only, between birth and 5 years to derive
the adiposity peak; individuals with greater than 2 measures
throughout this period were included [18], with 93% of included
individuals having at least one measure of BMI between six and 12
months. Adiposity peak was derived by setting the first derivative
of the fixed and random effects between birth and 2.5 years to
zero.
Statistical AnalysisImplausible height, weight and BMI measurements (. 4SD
from the mean for sex and age specific category) were considered
as outliers and were recoded to missing. Genetic differences in the
trajectories were estimated by including an interaction between all
components of the spline function for age and the genetic variants.
The association between the allelic score and birth measures was
analysed using linear regression, adjusting for gestational age at
birth. Linear regression was used to investigate the associations of
the allelic score with age and BMI at adiposity peak and adiposity
rebound. In addition, we used the data from the final follow-up in
each of the cohorts (15217 years) to investigate, with linear
regression, the association between the adiposity peak and
adiposity rebound parameters with final BMI.
The growth data were collected using three measurement
sources in the ALSPAC cohort; clinic visits, routine health care
visits, and parental reports in questionnaires. Trajectory analyses
in ALSPAC adjusted for a binary indicator of measurement source
(parent reports versus clinic/health care measurements) as a fixed
effect to allow for differential measurement error. To assess
population stratification, principal components generated in the
EIGENSTRAT software [39]. These components revealed no
obvious population stratification and genome-wide analyses with
other phenotypes indicate a low lambda in the ALSPAC cohort;
however in the Raine cohort there was evidence of stratification so
all analyses were adjusted for the first five principal components.
FTO is the most replicated SNP for BMI, with the largest effect
size of the BMI-associated SNPs found to date, and has been
shown previously to effect childhood growth [16,17]. We therefore
repeated the analysis adjusting for the FTO locus. All results
remained unchanged indicating that the associations between
growth and the allelic score were not driven exclusively by the
FTO effect (data not shown).
We calculated the percentage of variation in BMI explained by
the allelic score at each time point in the ALSPAC cohort using
the residual sums of squares from the longitudinal BMI growth
model [40]. We did not calculate this in the Raine cohort as the
sample size was too small for accurate estimates.
Results from the two cohorts were meta-analysed. For the allelic
score analyses, a fixed-effects inverse-variance weighted meta-
analysis was conducted using the beta coefficients and standard
errors from the two studies. No heterogeneity using Cochran’s Q
was detected between the cohorts (all P.0.05). The allelic score
was considered statistically associated with the growth parameter if
the P-value for the meta-analysis was less than 0.05. For the
analyses of the individuals SNPs with BMI, a P-Value meta-
analysis was conducted on the likelihood-ratio test (LRT) P-Values
from the two studies, without weighting, and a Bonferroni
significance threshold of 0.0016 was used to declare a statistically
significant association. All analyses were conducted in R version
2.12.1 [41], using the Spida library to estimate the spline
functions, the rmeta library for the effect-size meta-analysis and
the MADAM library for the P-Value meta-analysis.
Results
ALSPAC children had more BMI measures throughout
childhood than the Raine children with a median of 9
(interquartile range 5212) and 6 (interquartile range 527)
measures, respectively (Table 1). The minor allele frequency
(MAF) for the 32 SNPs ranged from 0.04 to 0.49 (Table 2). The
FTO loci had the largest effect on adult BMI, with an effect size of
0.39, while the effect size on adult BMI for the majority of the
remaining loci ranged from 0.06 to 0.2. All of the following results
are reported from the meta-analysis of the two cohorts, unless
otherwise specified.
Associations between the allelic score and growthtrajectories
The allelic score was associated with higher mean levels of BMI
at the intercept of 8 years (Female: b = 0.0061 units, P , 0.0001;
Male: b = 0.0044 units, P , 0.0001; Table S1) and faster BMI
growth over childhood in both sexes (all age by score interaction P
, 0.001). Due to the increasing rate of growth over time, the
trajectories of individuals with high and low allelic scores begin
together at age one but separate throughout childhood (Figure 1A
and 1B). In females, differences in BMI trajectories associated with
the allelic score were detectable from just after one year in the
ALSPAC cohort and approximately 2.5 years in the Raine cohort;
a difference was detected earlier in males, at 1 year in ALSPAC
and at 18 months in the Raine cohort.
To investigate whether the association of the allelic score with
BMI growth over childhood was due to skeletal growth or
adiposity, we tested associations between the allelic score and both
weight and height measurements. The allelic score was associated
with higher weight (Females: b= 0.0073 units, P,0.0001; Males
b= 0.0056 units, P,0.0001; Table S1) and faster rates of weight
gain over childhood in both males and females (all age by score
interaction P,0.001; Figure 1C and 1D). The association with
weight was seen earlier in males (by 1 year of age in ALSPAC)
than females (around 2 years of age in ALSPAC). The allelic score
was associated with increased height in females (b= 0.0949m,
P = 0.0002) and males (b= 0.0838m, P = 0.0008) (Table S1) and
also displayed evidence for an interaction with age (P,0.001 in
ALSPAC, P = 0.001 in Raine females and P = 0.015 in Raine
males; Figure 1E and 1F). The effect size of the allelic score on
height growth increased over childhood until around 10 years of
age in females and slightly later in males and then decreased until
it became statistically non-significant (Figure 2C). These results
suggest that the association of the allelic score with BMI growth
over childhood was due to both skeletal growth and adiposity.
Associations between the allelic score and birthmeasures, adiposity peak and adiposity rebound
As expected, females were both lighter and shorter than males
at birth (Table 1). The allelic score was not associated with the
birth measures in either sex (Table 3). In addition, there was no
interaction between the allelic score and gestational age for either
weight or length at birth (data not shown).
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 3 November 2013 | Volume 8 | Issue 11 | e79547
Table 1. Phenotypic characteristics of the two birth cohorts used for analysis.
Age Stratum ALSPAC Raine
(years) (n = 7,868) (n = 1,460)
Sex [% male (N)] 7,868 51.25% (4,032) 1,460 51.58% (753)
N Mean (SD) N Mean (SD)
Number of BMI measures per person -- 8.75 (4.58) -- 5.94 (1.52)
Age 121.49 2,832 1.18 (0.18) 1,326 1.15 (0.09)
(years) 1.522.49 7,113 1.76 (0.25) 387 2.14 (0.13)
2.523.49 2,537 2.95 (0.28) 956 3.09 (0.09)
3.524.49 6,915 3.77 (0.23) 20 3.69 (0.17)
4.525.49 1,843 5.05 (0.33) 3 5.28 (0.14)
5.526.49 3,848 5.90 (0.24) 1,269 5.91 (0.17)
6.527.49 2,861 7.31 (0.30) 42 7.25 (0.38)
7.528.49 3,975 7.74 (0.33) 1,040 8.02 (0.27)
8.529.49 4,443 8.71 (0.22) 204 8.60 (0.12)
9.5210.49 6,777 9.94 (0.29) 303 10.44 (0.08)
10.5211.49 4,917 10.75 (0.23) 926 10.64 (0.15)
11.5212.49 5,240 11.82 (0.21) 4 11.91 (0.36)
12.5213.49 6,797 12.97 (0.22) 9 13.28 (0.17)
13.5214.49 4,690 13.89 (0.17) 1,196 14.06 (0.17)
14.5215.49 2,339 15.32 (0.15) 24 14.69 (0.17)
15.5216.49 1,645 15.72 (0.22) 2 16.16 (0.19)
.16.5 90 16.83 (0.24) 976 17.05 (0.24)
BMI 121.49 2,832 17.42 (1.51) 1,326 17.11 (1.39)
(kg/m2) 1.522.49 7,113 16.82 (1.49) 387 15.97 (1.19)
2.523.49 2,537 16.48 (1.40) 956 16.14 (1.23)
3.524.49 6,915 16.25 (1.39) 20 15.92 (1.41)
4.525.49 1,843 16.02 (1.70) 3 15.94 (1.43)
5.526.49 3,848 15.71 (1.87) 1,269 15.82 (1.62)
6.527.49 2,861 16.10 (1.98) 42 16.41 (2.43)
7.528.49 3,975 16.31 (2.01) 1,040 16.83 (2.38)
8.529.49 4,443 17.15 (2.40) 204 16.90 (2.44)
9.5210.49 6,777 17.67 (2.81) 303 18.91 (3.34)
10.5211.49 4,917 18.25 (3.10) 926 18.55 (3.16)
11.5212.49 5,240 19.04 (3.35) 4 16.78 (2.64)
12.5213.49 6,797 19.64 (3.35) 9 21.11 (3.75)
13.5214.49 4,690 20.31 (3.45) 1,196 21.39 (4.02)
14.5215.49 2,339 21.28 (3.48) 24 21.66 (4.23)
15.5216.49 1,645 21.41 (3.51) 2 20.14 (3.26)
.16.5 90 22.47 (3.40) 976 23.01 (4.28)
Birth Weight (kg) Males 3,001 3.52 (0.53) 752 3.42 (0.57)
Females 2,855 3.40 (0.47) 707 3.31 (0.55)
Birth Length (cm) Males 3,001 51.13 (2.40) 675 50.12 (2.34)
Females 2,855 50.41 (2.28) 616 49.31 (2.28)
Gestational Age (wks) Males 3,001 39.52 (1.64) 753 39.42 (1.99)
Females 2,855 39.65 (1.58) 707 39.42 (2.06)
BMI at Adiposity Males 4,030 18.03 (0.76) -- --
Peak (kg/m2) Females 3,792 17.45 (0.69) -- --
Age at Adiposity Peak Males 4,030 8.90 (0.33) -- --
(months) Females 3,792 9.36 (0.49) -- --
BMI at Adiposity Males 3,642 15.62 (1.04) 697 15.53 (0.93)
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 4 November 2013 | Volume 8 | Issue 11 | e79547
The estimated age and BMI at the peak were weakly correlated
in females (r= 0.08) and males (r= 20.30). Later age at adiposity
peak was associated with higher BMI at age 15217 in females but
not males. In addition, higher BMI at adiposity peak was
associated with higher BMI at age 15217 years in both sexes.
The allelic score was not associated with age of adiposity peak in
females or males (Table 3). However, the allelic score was
associated with a higher BMI at the peak (Females: b= 0.0163 kg/
m2, P = 0.0002; Males: b= 0.0123 kg/m2, P = 0.0033). Adjust-
ment for age at the peak did not substantively alter the magnitude
Table 1. Cont.
Age Stratum ALSPAC Raine
Rebound (kg/m2) Females 3,225 15.53 (1.06) 647 15.42 (0.95)
Age at Adiposity Males 3,642 6.07 (1.02) 697 5.30 (1.05)
Rebound (years) Females 3,225 5.61 (1.16) 647 4.64 (1.10)
doi:10.1371/journal.pone.0079547.t001
Table 2. Descriptive statistics of the single nucleotide polymorphisms included in the allelic score.
Chr Nearest Gene SNP
Alleles (EffectAllele / Non-effect Allele)
GWAS EffectSize for BMI Effect Allele Frequency
ALSPAC Raine
1 NEGR1 rs2568958 A/G 0.13 0.5956 0.6218
TNNI3K rs1514175 A/G 0.07 0.4249 0.4360
PTBP2 rs1555543 C/A 0.06 0.5905 0.5942
SEC16B rs543874 G/A 0.22 0.2075 0.2021
2 TMEM18 rs2867125 C/T 0.31 0.8325 0.8303
RBJ, ADCY3, POMC rs713586 C/T 0.14 0.4888 0.4841
FANCL rs887912 T/C 0.1 0.2904 0.2929
LRP1B rs2890652 C/T 0.09 0.1669 0.1627
3 CADM2 rs13078807 G/A 0.1 0.2025 0.2089
ETV5, DGKG, SFRS10 rs7647305 C/T 0.14 0.7924 0.7934
4 SLC39A8 rs13107325 T/C 0.19 0.0764 0.0723
GNPDA2 rs10938397 G/A 0.18 0.4342 0.4359
5 FLJ35779, HMGCR rs2112347 T/G 0.1 0.6401 0.6347
ZNF608 rs4836133 A/C 0.07 0.4949 0.4920
6 TFAP2B rs987237 G/A 0.13 0.1770 0.1897
9 LRRN6C rs10968576 G/A 0.11 0.3167 0.3062
LMX1B rs867559 G/A 0.24 0.1983 0.1968
11 RPL27A, TUB rs4929949 C/T 0.06 0.5390 0.5210
BDNF rs6265 C/T 0.19 0.8122 0.8119
MTCH2, NDUFS3, CUGBP1 rs3817334 T/C 0.06 0.4000 0.4213
12 FAIM2 rs7138803 A/G 0.12 0.3592 0.3675
13 MTIF3, GTF3A rs4771122 G/A 0.09 0.2304 0.2154
14 PRKD1 rs11847697 T/C 0.17 0.0467 0.0414
NRXN3 rs10150332 C/T 0.13 0.2112 0.2183
15 MAP2K5, LBXCOR1 rs2241423 G/A 0.13 0.7850 0.7699
16 GPRC5B, IQCK rs12444979 C/T 0.17 0.8620 0.8541
SH2B1, ATXN2L, TUFM, ATP2A1 rs7359397 T/C 0.15 0.4166 0.3791
FTO rs9939609 A/T 0.39 0.3933 0.3835
18 MC4R rs12970134 A/G 0.23 0.2680 0.2547
19 KCTD15 rs29941 G/A 0.06 0.6848 0.6606
TMEM160, ZC3H4 rs3810291 A/G 0.09 0.6941 0.6438
QPCTL, GIPR rs2287019 C/T 0.15 0.8123 0.8127
doi:10.1371/journal.pone.0079547.t002
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 5 November 2013 | Volume 8 | Issue 11 | e79547
of the association of the allelic score with BMI at the peak
(Females: b= 0.0157 kg/m2, P = 0.0003; Males: b= 0.0135 kg/
m2, P = 0.0007).
Earlier age and higher BMI at the adiposity rebound were both
associated with higher BMI at age 15217 years. The allelic score
was associated with an earlier age at the adiposity rebound for
females (b= 20.0362years, P,0.0001) and males
(b= 20.0362years, P,0.0001) (Table 3). The effect size was
attenuated after adjusting for BMI at the rebound (Females:
b= 20.0122years, P = 0.0018; Males: b= 20.0096 years,
P = 0.0022). The allelic score was also associated with higher
BMI at the rebound in females (b= 0.0332 kg/m2, P,0.0001) and
males (b= 0.0364 kg/m2, P,0.0001). Again, the effect size
attenuated when adjusting for age at the rebound (Females:
b= 0.0094 kg/m2, P = 0.0078; Males: b= 0.0109 kg/m2,
P = 0.0004).
There was a strong positive correlation between BMI at the
adiposity peak and the adiposity rebound (Female r= 0.65,
p,0.0001; Male r= 0.59, p,0.0001). BMI at the adiposity
rebound explains more of the variation in BMI at age 15217
(45%) than the BMI at the adiposity peak (10%). Nevertheless, the
allelic score remains associated with BMI at the adiposity rebound
after adjusting for the BMI at the adiposity peak in both females
(b= 0.0171 kg/m2, P,0.0001) and males (b= 0.0269 kg/m2,
P,0.0001).
Variance explained by the allelic scoreWe calculated the percentage of variation in BMI explained by
the allelic score at each time point in the ALSPAC cohort using
the residual sums of squares from the longitudinal BMI growth
model [40]. We did not calculate this in the Raine cohort as the
sample size was too small for accurate estimates. The allelic score
explained 0.58% of the variance in BMI across childhood overall
in females and slightly less in males (0.44%) in ALSPAC, but this
percentage varied with age (Figure 3). This is approximately a
third of the variance in adult BMI explained by these SNPs in the
Figure 1. Population average curves for individuals with 27, 29 or 31 BMI risk alleles in females (A, C and E) and males (B, D and F)from the ALSPAC cohort. Predicted population average BMI (A and B), weight (C and D) and height (E and F) trajectories from 1 – 16 years forindividuals with 27 (lower quartile), 29 (median), and 31 (upper quartile) BMI risk alleles in the allelic score.doi:10.1371/journal.pone.0079547.g001
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 6 November 2013 | Volume 8 | Issue 11 | e79547
Figure 2. Associations between the allelic score and BMI, weight and height at each follow-up in females (A, C and E) and males (B,D and F) from the ALSPAC cohort. Regression coefficients (95% CI) derived from the longitudinal model at each year of follow-up between 1 and16 years.doi:10.1371/journal.pone.0079547.g002
Table 3. Cross-sectional association analysis results for birth measures, BMI and age at adiposity peak (AP) and BMI and age atadiposity rebound (AR) in the ALSPAC and Raine cohorts.
Females Males
Beta (95% CI) P-Value Beta (95% CI) P-Value
Birth weight (kg) 20.0004 (20.0043, 0.0035) 0.8283 0.0026 (20.0017, 0.0069) 0.2334
Birth length (cm) 20.0158 (20.0352, 0.0036) 0.1111 20.0002 (20.0190, 0.0186) 0.9840
BMI at AP (kg/m2) 0.0163 (0.0079, 0.0248) 0.0002 0.0123 (0.0041, 0.0204) 0.0033
Age at AP (months) 0.0074 (20.0002, 0.0151) 0.0566 0.0028 (20.0025, 0.0080) 0.3020
BMI at AR (kg/m2) 0.0332 (0.0237, 0.0427) ,0.0001 0.0364 (0.0277, 0.0451) ,0.0001
Age at AR (years) 20.0362 (20.0467, 20.0257) ,0.0001 20.0362 (20.0450, 20.0274) ,0.0001
doi:10.1371/journal.pone.0079547.t003
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 7 November 2013 | Volume 8 | Issue 11 | e79547
study that identified them [5]. Figure 3 displays the estimates over
childhood in females and males.
The allelic score accounted for a similar percentage of BMI at
the adiposity peak in both females (0.42%) and males (0.22%).
However, for the measures at the adiposity rebound, the allelic
score accounts for up to 122% of the variation in the two cohorts
(Age: 0.87% in ALSPAC females, 2.70% in Raine females, 1.46%
in ALSPAC males and 0.89% in Raine males; BMI: 1.01% in
ALSPAC females, 1.87% in Raine females, 1.46% in ALSPAC
males and 1.14% in Raine males). This is twice as much of the
variation in BMI than was able to be accounted for at the time of
the adiposity peak or in the overall trajectory.
Single SNP analysesIn females, five of the 32 individual loci (RBJ, FTO, MC4R,
CADM2 and MTCH2) reached a Bonferroni significance threshold
of 0.0016 in the meta-analysis (Table S2). In males, four of the 32
individual loci (SEC16B, TMEM18, MC4R and FTO) were
associated with BMI trajectory at the Bonferroni significance
threshold (Table S3). Only FTO and MC4R reached statistical
significance in both males and females.
Sex differencesIn analyses combining males and females, there was no evidence
for sex interactions for any of the 32 loci after Bonferroni
correction; however we report the following result here as an
exploratory finding. The sex interaction for the NRXN3 loci,
rs10150332 (including interaction with the spline function), had a
P-Value of 0.0039.
Discussion
We investigated the association of variants in genes known to be
associated with increased BMI in adulthood with growth measures
over childhood from two extensively characterized longitudinal
birth cohorts. Similar to previous studies [13,14,15,16,17], we
have shown that an allelic score of known adult BMI-associated
SNPs is not associated with birth measures but is associated with
BMI growth throughout childhood and adolescence, weight
changes, and also height changes (though with weaker associa-
tions). Previous work by Elks et al [13] in the ALSPAC cohort
investigated the association of an 8 SNP allelic score with growth
trajectories from birth to 11 years of age. We have extended their
work by including an additional cohort, and by increasing the age
period over which the trajectories are examined and the number
of SNPs investigated. By extending the age range, we have shown
that the association between the allelic score and weight changes
increases in magnitude with age, whereas the association of the
allelic score with height growth stops after the onset of puberty.
Belsky et al [15] are the only other investigators to look at an allelic
score using the same set of SNPs; our conclusions are similar to
theirs in terms of the growth trajectories throughout childhood,
but we extend their work by i) having more detailed early growth
measurements, enabling us to show that the allelic score starts to
be associated with growth trajectories at an early age and to assess
Figure 3. A smooth curve of the estimates from the longitudinal models of the proportion of BMI variation explained (R2) at eachtime point in females and males from the ALSPAC cohort. R2 derived from the longitudinal model at each year of follow-up between 1 and 16years.doi:10.1371/journal.pone.0079547.g003
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 8 November 2013 | Volume 8 | Issue 11 | e79547
associations between the allelic score and the adiposity peak in
infancy, and ii) some exploratory findings regarding sex specific
genes effecting BMI growth. The GIANT consortium found a
SNP 30,000 bp upstream from the RBJ loci and a SNP in the
MC4R gene to be associated with adult height [42], but the full
functional relevance of the 32 loci, and which of them affect
height, fat accumulation or both, is not yet understood, and our
study does not have sufficient power to address this. A useful
extension to the current study would be to investigate whether any
of the individual SNPs in the allelic score largely influence child
height growth rather than weight; however a larger sample size
would be required to consider this.
Although the effect sizes presented appear relatively small, they
are consistent with those previously reported in the adult studies.
At age 15, an increase of one BMI risk allele increases BMI by
approximately 0.15 kg/m2, which is equivalent to some of the
mid-range effect sizes from adult GWAS studies as reported in
Table 2. It is widely known that the genetic basis of obesity is still
largely unknown, with only 1.45% of the variation in BMI due to
genetics having been described [5]; however, this study sheds more
light on the mechanisms behind how these genetic variants
influence childhood growth, rather than describing particularly
large effects sizes from any individual SNP.
Our results suggest that known adult BMI increasing alleles
have a detectable effect on childhood growth as early as one year.
In addition, we investigated the association between the allelic
score and features of the growth curve thought to be associated
with later obesity and cardiovascular health [23,24,43,44,45]; the
allelic score was positively associated with higher BMI at the
adiposity peak, but only weakly associated with age at adiposity
peak. This contrasts the findings for the association between the
FTO gene and adiposity peak shown in the Northern Finnish Birth
Cohort from 1966 [25], where the age but not BMI at adiposity
peak was associated with the FTO variant; however, subsequent
analysis in this cohort as part of a meta-analysis showed the
association was not statistically significant [17]. The explanation
for these differences are unclear; both of the cohorts investigated
had limited data available in the first few years of life, and
although data availability was greater than in previous studies and
we were able to estimate the emergence of the genetic association
and the parameters around the adiposity peak, it would be
beneficial to replicate this finding in cohorts with more regular
measurements in early infancy. Likewise, we saw differences in the
timing of the adiposity rebound between the ALSPAC and Raine
cohorts, with an earlier rebound being found in the Raine cohort.
This could be due to the lack of data between three and five and a
half years where the rebound is expected to occur. In contrast, the
ALSPAC cohort had an adequate number of measurements
throughout the adiposity rebound period although a portion of
them came from parental report questionnaires which have been
shown to be less accurate than the clinic measures [46]. Therefore,
the precision of the estimate for the BMI and age at the adiposity
rebound is very similar between the two cohorts, as seen by the
standard deviations in Table 1. In addition, we do not believe this
has influenced the genetic results as the effect sizes of the allelic
score were similar between the ALSPAC and Raine cohorts for
both the age and BMI at the adiposity rebound (data not shown).
Previous studies investigating the association between adult BMI
associated SNPs and childhood growth adjusted their analyses for
sex [13,14,15,16,17]; only Hardy et al [16] tested for a sex
interaction and found it to be non-significant. We detected a
statistically significant sex interaction for the allelic score, so
conducted sex specific analyses. We found that the allelic score
begins to be associated with BMI and weight earlier in males than
females, but around the same age for height. Furthermore, other
than the FTO and MC4R SNPs, we found different genes
associated with childhood BMI trajectory in males and females.
However, these differences could not be replicated in the formal
interaction analysis and therefore further investigation in larger
sample sizes is required to confirm this observation. Our findings
provide additional evidence that there may be different, but
partially overlapping, genes that contribute to the body shape of
males and females from early childhood.
In conclusion, we have conducted an association analysis in a
large childhood population to investigate the effect of known adult
genetic determinants of BMI on childhood growth trajectory. We
have shown that the genetic effect begins very early in life, which is
consistent with the life course epidemiology hypotheses – the
determinants of adult susceptibility to obesity begin in early
childhood and develop over the life course.
Supporting Information
Table S1 Longitudinal allelic score association analysis results
for BMI, weight and height in ALSPAC and Raine, in addition to
the meta-analysis summary
(XLSX)
Table S2 Longitudinal association analysis results for each of the
32 BMI SNPs against BMI, weight and height in females from
ALSPAC and Raine, in addition to the meta-analysis summary
(XLSX)
Table S3 Longitudinal association analysis results for each of the
32 BMI SNPs against BMI, weight and height in males from
ALSPAC and Raine, in addition to the meta-analysis summary
(XLSX)
Methods S1 Additional information regarding the collection of
phenotypic measurements and genotyping methods in the
ALSPAC and Raine cohorts. Furthermore, additional details
regarding the longitudinal modelling and derivation of growth
phenotypes are provided.
(DOC)
Acknowledgments
ALSPAC: We are extremely grateful to all the families who took part in this
study, the midwives for their help in recruiting them, and the whole
ALSPAC team, which includes interviewers, computer and laboratory
technicians, clerical workers, research scientists, volunteers, managers,
receptionists and nurses.
Raine: The authors are grateful to the Raine Study participants, their
families, and to the Raine Study research staff for cohort coordination and
data collection. The authors gratefully acknowledge the assistance of the
Western Australian DNA Bank (National Health and Medical Research
Council of Australia National Enabling Facility).
Author Contributions
Conceived and designed the experiments: NMW LDH KT LJP DAL LB.
Analyzed the data: NMW. Contributed reagents/materials/analysis tools:
LDH YYW. Wrote the paper: NMW LDH DAL LB. Aquired data: CEP
JN GDS LJP LJB SJL DAL. Interpreted results and reviewed manuscript:
YYW NJT KT CEP JN GDS LJP LJB SJL. Approved manuscript for
submission: NMW LDH YYW NJT KT CEP JN GDS LJP LJB SJL DAL
LB.
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 9 November 2013 | Volume 8 | Issue 11 | e79547
References
1. Maes HH, Neale MC, Eaves LJ (1997) Genetic and environmental factors in
relative body weight and human adiposity. Behav Genet 27: 3252351.2. Haworth CM, Carnell S, Meaburn EL, Davis OS, Plomin R, et al. (2008)
Increasing heritability of BMI and stronger associations with the FTO gene overchildhood. Obesity (Silver Spring) 16: 266322668.
3. Wardle J, Carnell S, Haworth CM, Plomin R (2008) Evidence for a strong
genetic influence on childhood adiposity despite the force of the obesogenicenvironment. Am J Clin Nutr 87: 3982404.
4. Parsons TJ, Power C, Logan S, Summerbell CD (1999) Childhood predictors ofadult obesity: a systematic review. Int J Obes Relat Metab Disord 23 Suppl 8:
S12107.
5. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, et al. (2010)Association analyses of 249,796 individuals reveal 18 new loci associated with
body mass index. Nat Genet 42: 9372948.6. Liu JZ, Medland SE, Wright MJ, Henders AK, Heath AC, et al. (2010)
Genome-wide association study of height and body mass index in Australiantwin families. Twin Res Hum Genet 13: 1792193.
7. Thorleifsson G, Walters GB, Gudbjartsson DF, Steinthorsdottir V, Sulem P, et
al. (2009) Genome-wide association yields new sequence variants at seven locithat associate with measures of obesity. Nat Genet 41: 18224.
8. Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, et al. (2009) Six new lociassociated with body mass index highlight a neuronal influence on body weight
regulation. Nat Genet 41: 25234.
9. Loos RJ, Lindgren CM, Li S, Wheeler E, Zhao JH, et al. (2008) Commonvariants near MC4R are associated with fat mass, weight and risk of obesity. Nat
Genet 40: 7682775.10. Fox CS, Heard-Costa N, Cupples LA, Dupuis J, Vasan RS, et al. (2007)
Genome-wide association to body mass index and waist circumference: theFramingham Heart Study 100K project. BMC Med Genet 8 Suppl 1: S18.
11. Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, et al. (2007)
A common variant in the FTO gene is associated with body mass index andpredisposes to childhood and adult obesity. Science 316: 8892894.
12. Bradfield JP, Taal HR, Timpson NJ, Scherag A, Lecoeur C, et al. (2012) Agenome-wide association meta-analysis identifies new childhood obesity loci. Nat
Genet 44: 5262531.
13. Elks CE, Loos RJ, Sharp SJ, Langenberg C, Ring SM, et al. (2010) Geneticmarkers of adult obesity risk are associated with greater early infancy weight gain
and growth. PLoS Med 7: e1000284.14. Mei H, Chen W, Jiang F, He J, Srinivasan S, et al. (2012) Longitudinal
replication studies of GWAS risk SNPs influencing body mass index over thecourse of childhood and adulthood. PLoS One 7: e31470.
15. Belsky DW, Moffitt TE, Houts R, Bennett GG, Biddle AK, et al. (2012)
Polygenic risk, rapid childhood growth, and the development of obesity:evidence from a 4-decade longitudinal study. Arch Pediatr Adolesc Med 166:
5152521.16. Hardy R, Wills AK, Wong A, Elks CE, Wareham NJ, et al. (2010) Life course
variations in the associations between FTO and MC4R gene variants and body
size. Hum Mol Genet 19: 5452552.17. Sovio U, Mook-Kanamori DO, Warrington NM, Lawrence R, Briollais L, et al.
(2011) Association between common variation at the FTO locus and changes inbody mass index from infancy to late childhood: the complex nature of genetic
association through growth and development. PLoS Genet 7: e1001307.18. Silverwood RJ, De Stavola BL, Cole TJ, Leon DA (2009) BMI peak in infancy as
a predictor for later BMI in the Uppsala Family Study. Int J Obes (Lond) 33:
9292937.19. Adair LS (2008) Child and adolescent obesity: epidemiology and developmental
perspectives. Physiol Behav 94: 8216.20. Dietz WH (1994) Critical periods in childhood for the development of obesity.
Am J Clin Nutr 59: 9552959.
21. He Q, Karlberg J (2002) Probability of adult overweight and risk change duringthe BMI rebound period. Obes Res 10: 1352140.
22. Rolland-Cachera MF, Deheeger M, Bellisle F, Sempe M, Guilloud-Bataille M,et al. (1984) Adiposity rebound in children: a simple indicator for predicting
obesity. Am J Clin Nutr 39: 1292135.
23. Rolland-Cachera MF, Deheeger M, Maillot M, Bellisle F (2006) Early adiposityrebound: causes and consequences for obesity in children and adults. Int J Obes
(Lond) 30 Suppl 4: S11217.
24. Whitaker RC, Pepe MS, Wright JA, Seidel KD, Dietz WH (1998) Early
adiposity rebound and the risk of adult obesity. Pediatrics 101: E5.25. Sovio U, Timpson NJ, Warrington NM, Briollais L, Mook-Kanamori D, et al.
(2009) Association Between FTO Polymorphism, Adiposity Peak and AdiposityRebound in The Northern Finland Birth Cohort 1966. Atherosclerosis 207:
e42e5.
26. Elks CE, Perry JR, Sulem P, Chasman DI, Franceschini N, et al. (2010) Thirtynew loci for age at menarche identified by a meta-analysis of genome-wide
association studies. Nat Genet 42: 107721085.27. Dvornyk V, Waqar ul H (2012) Genetics of age at menarche: a systematic
review. Hum Reprod Update 18: 1982210.
28. Wen X, Kleinman K, Gillman MW, Rifas-Shiman SL, Taveras EM (2012)Childhood body mass index trajectories: modeling, characterizing, pairwise
correlations and socio-demographic predictors of trajectory characteristics.BMC Med Res Methodol 12: 38.
29. Zillikens MC, Yazdanpanah M, Pardo LM, Rivadeneira F, Aulchenko YS, et al.(2008) Sex-specific genetic effects influence variation in body composition.
Diabetologia 51: 223322241.
30. Comuzzie AG, Blangero J, Mahaney MC, Mitchell BD, Stern MP, et al. (1993)Quantitative genetics of sexual dimorphism in body fat measurements. American
Journal of Human Biology 5: 7252734.31. Boyd A, Golding J, Macleod J, Lawlor DA, Fraser A, et al. (2012) Cohort Profile:
The ’Children of the 90s’--the index offspring of the Avon Longitudinal Study of
Parents and Children. Int J Epidemiol.32. Newnham JP, Evans SF, Michael CA, Stanley FJ, Landau LI (1993) Effects of
frequent ultrasound during pregnancy: a randomised controlled trial. Lancet342: 8872891.
33. Williams LA, Evans SF, Newnham JP (1997) Prospective cohort study of factorsinfluencing the relative weights of the placenta and the newborn infant. BMJ
314: 186421868.
34. Evans S, Newnham J, MacDonald W, Hall C (1996) Characterisation of thepossible effect on birthweight following frequent prenatal ultrasound examina-
tions. Early Hum Dev 45: 2032214.35. Paternoster L, Zhurov AI, Toma AM, Kemp JP, St Pourcain B, et al. (2012)
Genome-wide association study of three-dimensional facial morphology
identifies a variant in PAX3 associated with nasion position. Am J Hum Genet90: 4782485.
36. Taal HR, St Pourcain B, Thiering E, Das S, Mook-Kanamori DO, et al. (2012)Common variants at 12q15 and 12q24 are associated with infant head
circumference. Nat Genet 44: 5322538.37. Janssens AC, Aulchenko YS, Elefante S, Borsboom GJ, Steyerberg EW, et al.
(2006) Predictive testing for complex diseases using multiple genes: fact or
fiction? Genet Med 8: 3952400.38. Warrington NM, Wu YY, Pennell CE, Marsh JA, Beilin LJ, et al. (2013)
Modelling BMI Trajectories in Children for Genetic Association Studies. PLoSOne 8: e53897.
39. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006)
Principal components analysis corrects for stratification in genome-wideassociation studies. Nat Genet 38: 9042909.
40. Xu R (2003) Measuring explained variation in linear mixed effects models. StatMed 22: 352723541.
41. Ihaka R, Gentleman R (1996) R: a language for data analysis and graphics.Journal of Computational and Graphical Statistics 5: 2992314.
42. Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, et al. (2010)
Hundreds of variants clustered in genomic loci and biological pathways affecthuman height. Nature 467: 8322838.
43. Bhargava SK, Sachdev HS, Fall CH, Osmond C, Lakshmy R, et al. (2004)Relation of serial changes in childhood body-mass index to impaired glucose
tolerance in young adulthood. N Engl J Med 350: 8652875.
44. Eriksson JG, Forsen T, Tuomilehto J, Osmond C, Barker DJ (2003) Earlyadiposity rebound in childhood and risk of Type 2 diabetes in adult life.
Diabetologia 46: 1902194.45. Taylor RW, Grant AM, Goulding A, Williams SM (2005) Early adiposity
rebound: review of papers linking this to subsequent obesity in children and
adults. Curr Opin Clin Nutr Metab Care 8: 6072612.46. Dubois L, Girad M (2007) Accuracy of maternal reports of pre-schoolers’
weights and heights as estimates of BMI values. Int J Epidemiol 36: 1322138.
BMI Allelic Score Associated with Childhood Growth
PLOS ONE | www.plosone.org 10 November 2013 | Volume 8 | Issue 11 | e79547
Appendix F: Additional Results from Allelic Score Analysis in Chapter Six
Table 1: Results for females of the 32 individual adult BMI associated SNPs with BMI trajectories in both cohorts and the combined meta-analysis; k1, k2 and k3 represent
knot point 1, 2 and 3 respectively.
ALSPAC Raine
Combined P-Value
Combined bonferroni corrected P-Value
Gene SNP Beta SE P LRT P Beta SE P LRT P
NEGR1
rs2568958 0.0042 0.0032 0.19
0.23
0.0090 0.0072 0.21
0.17 0.16 1.00
(Age-8):rs2568958 0.0005 0.0005 0.31 0.0013 0.0012 0.29 (Age-8)2:rs2568958 -0.0006 0.0005 0.27 0.0019 0.0016 0.24 (Age-8)3:rs2568958 -0.0002 0.0002 0.35 0.0011 0.0008 0.18 (Age-8)3:k2:rs2568958 0.0005 0.0005 0.39 -0.0023 0.0016 0.14 (Age-8)3:k3:rs2568958 -0.0004 0.0012 0.72 0.0027 0.0020 0.17 (Age-8)3:k1:rs2568958 0.0500 0.0282 0.08 0.1762 0.0924 0.06
TNNI3K
rs1514175 0.0109 0.0032 0.00
4.8x10-3
0.0165 0.0071 0.02
0.47 0.02 0.52
(Age-8):rs1514175 0.0019 0.0005 0.00 0.0011 0.0012 0.34 (Age-8)2:rs1514175 1.35x10-5 0.0005 0.98 -0.0002 0.0015 0.89 (Age-8)3:rs1514175 -0.0001 0.0002 0.81 0.0000 0.0008 0.97 (Age-8)3:k2:rs1514175 -0.0001 0.0005 0.79 0.0001 0.0015 0.97 (Age-8)3:k3:rs1514175 0.0008 0.0011 0.50 -0.0002 0.0019 0.91 (Age-8)3:k1:rs1514175 -0.0165 0.0285 0.56 -0.0114 0.0888 0.90
PTBP2
rs1555543 0.0011 0.0032 0.73
0.86
0.0134 0.0073 0.07
0.15 0.38 1.00
(Age-8):rs1555543 0.0003 0.0005 0.57 0.0031 0.0012 0.01 (Age-8)2:rs1555543 0.0001 0.0005 0.92 0.0012 0.0016 0.46 (Age-8)3:rs1555543 0.0001 0.0002 0.72 0.0004 0.0008 0.58 (Age-8)3:k2:rs1555543 -0.0002 0.0005 0.67 -0.0015 0.0016 0.36 (Age-8)3:k3:rs1555543 0.0008 0.0011 0.49 0.0029 0.0020 0.16
(Age-8)3:k1:rs1555543 0.0180 0.0289 0.53 0.0876 0.0929 0.35
SEC16B
rs543874 0.0074 0.0038 0.05
0.03
0.0260 0.0087 0.00
0.09 0.02 0.60
(Age-8):rs543874 0.0016 0.0006 0.01 0.0033 0.0014 0.02 (Age-8)2:rs543874 -0.0007 0.0006 0.31 0.0001 0.0019 0.95 (Age-8)3:rs543874 -0.0003 0.0003 0.28 0.0000 0.0010 0.97 (Age-8)3:k2:rs543874 0.0006 0.0006 0.33 -0.0005 0.0019 0.81 (Age-8)3:k3:rs543874 -0.0013 0.0014 0.36 0.0014 0.0024 0.55 (Age-8)3:k1:rs543874 -0.0123 0.0350 0.72 -0.0357 0.1148 0.76
TMEM18
rs2867125 0.0065 0.0041 0.12
0.01
0.0124 0.0090 0.17
0.85 0.05 1.00
(Age-8):rs2867125 0.0011 0.0007 0.11 0.0002 0.0015 0.91 (Age-8)2:rs2867125 0.0006 0.0007 0.41 -0.0021 0.0020 0.29 (Age-8)3:rs2867125 4.25x10-5 0.0003 0.89 -0.0009 0.0010 0.35 (Age-8)3:k2:rs2867125 -0.0003 0.0007 0.68 0.0021 0.0020 0.29 (Age-8)3:k3:rs2867125 -0.0002 0.0015 0.90 -0.0024 0.0024 0.33 (Age-8)3:k1:rs2867125 -0.0244 0.0362 0.50 -0.0841 0.1153 0.47
RBJ, ADCY3, POMC
rs713586 0.0152 0.0031 0.00
4.8x10-10
0.0154 0.0068 0.02
0.18 2.1x10-9 6.73x10-8
(Age-8):rs713586 0.0015 0.0005 0.00 0.0012 0.0011 0.29 (Age-8)2:rs713586 0.0002 0.0005 0.73 -0.0006 0.0015 0.69 (Age-8)3:rs713586 -0.0001 0.0002 0.67 -0.0003 0.0008 0.68 (Age-8)3:k2:rs713586 -0.0002 0.0005 0.70 0.0002 0.0015 0.90 (Age-8)3:k3:rs713586 0.0012 0.0011 0.26 0.0009 0.0019 0.63 (Age-8)3:k1:rs713586 -0.0206 0.0276 0.45 -0.0222 0.0896 0.80
FANCL
rs887912 0.0023 0.0035 0.51
0.80
-0.0002 0.0076 0.97
0.32 0.60 1.00 (Age-8):rs887912 0.0004 0.0006 0.50 0.0031 0.0013 0.02 (Age-8)2:rs887912 -0.0004 0.0006 0.51 0.0015 0.0017 0.39 (Age-8)3:rs887912 -0.0002 0.0003 0.37 0.0003 0.0009 0.70 (Age-8)3:k2:rs887912 0.0005 0.0006 0.41 -0.0015 0.0017 0.38
(Age-8)3:k3:rs887912 -0.0012 0.0012 0.31 0.0026 0.0021 0.21 (Age-8)3:k1:rs887912 -0.0384 0.0329 0.24 0.0145 0.1034 0.89
CADM2
rs13078807 0.0142 0.0039 0.00
1.5x10-3
0.0121 0.0090 0.18
0.09 1.4x10-3 0.04
(Age-8):rs13078807 0.0020 0.0006 0.00 -0.0009 0.0015 0.55 (Age-8)2:rs13078807 -0.0006 0.0007 0.33 -0.0001 0.0020 0.94 (Age-8)3:rs13078807 -0.0002 0.0003 0.46 0.0004 0.0010 0.69 (Age-8)3:k2:rs13078807 0.0002 0.0007 0.78 0.0003 0.0020 0.90 (Age-8)3:k3:rs13078807 0.0004 0.0014 0.77 -0.0021 0.0025 0.41 (Age-8)3:k1:rs13078807 0.0470 0.0348 0.18 0.1199 0.1147 0.30
ETV5, DGKG, SFRS10
rs7647305 0.0104 0.0040 0.01
0.02
0.0027 0.0090 0.76
0.62 0.08 1.00
(Age-8):rs7647305 0.0016 0.0007 0.01 0.0006 0.0015 0.71 (Age-8)2:rs7647305 0.0002 0.0007 0.80 -0.0019 0.0020 0.33 (Age-8)3:rs7647305 3.56x10-5 0.0003 0.90 -0.0013 0.0010 0.17 (Age-8)3:k2:rs7647305 -0.0004 0.0007 0.56 0.0023 0.0020 0.24 (Age-8)3:k3:rs7647305 0.0009 0.0014 0.54 -0.0021 0.0025 0.40 (Age-8)3:k1:rs7647305 0.0528 0.0364 0.15 -0.1877 0.1118 0.09
SLC39A8
rs13107325 0.0167 0.0060 0.01
4.9x10-4
0.0138 0.0132 0.30
0.88 3.8x10-3 0.12
(Age-8):rs13107325 0.0033 0.0010 0.00 -0.0014 0.0022 0.53 (Age-8)2:rs13107325 0.0009 0.0010 0.36 -0.0014 0.0029 0.62 (Age-8)3:rs13107325 3.85x10-5 0.0004 0.93 -0.0004 0.0015 0.77 (Age-8)3:k2:rs13107325 -0.0010 0.0010 0.35 0.0014 0.0029 0.63 (Age-8)3:k3:rs13107325 0.0036 0.0023 0.11 -0.0026 0.0036 0.47 (Age-8)3:k1:rs13107325 0.0080 0.0539 0.88 -0.0111 0.1777 0.95
FLJ35779, HMGCR
rs2112347 0.0029 0.0033 0.38
0.16
0.0090 0.0075 0.23
0.77 0.38 1.00 (Age-8):rs2112347 0.0011 0.0005 0.03 0.0012 0.0012 0.34 (Age-8)2:rs2112347 -0.0003 0.0006 0.65 -0.0021 0.0016 0.19 (Age-8)3:rs2112347 -0.0003 0.0002 0.16 -0.0011 0.0008 0.20
(Age-8)3:k2:rs2112347 0.0002 0.0005 0.65 0.0021 0.0016 0.21 (Age-8)3:k3:rs2112347 0.0007 0.0012 0.57 -0.0022 0.0020 0.27 (Age-8)3:k1:rs2112347 -0.0533 0.0297 0.07 -0.0972 0.0968 0.32
ZNF608
rs4836133 0.0033 0.0032 0.31
0.68
0.0022 0.0072 0.76
0.15 0.33 1.00
(Age-8):rs4836133 -0.0002 0.0005 0.77 -0.0024 0.0012 0.05 (Age-8)2:rs4836133 -0.0001 0.0005 0.88 -0.0039 0.0016 0.01 (Age-8)3:rs4836133 3.55x10-5 0.0002 0.88 -0.0014 0.0008 0.08 (Age-8)3:k2:rs4836133 -0.0001 0.0005 0.85 0.0037 0.0016 0.02 (Age-8)3:k3:rs4836133 0.0009 0.0012 0.46 -0.0051 0.0020 0.01 (Age-8)3:k1:rs4836133 -0.0056 0.0293 0.85 -0.0687 0.0946 0.47
TFAP2B
rs987237 0.0118 0.0041 0.00
0.02
0.0091 0.0094 0.33
0.05 0.01 0.23
(Age-8):rs987237 0.0014 0.0007 0.05 -0.0009 0.0015 0.55 (Age-8)2:rs987237 0.0004 0.0007 0.59 -0.0009 0.0020 0.67 (Age-8)3:rs987237 0.0002 0.0003 0.45 -0.0001 0.0010 0.93 (Age-8)3:k2:rs987237 -0.0009 0.0007 0.22 0.0011 0.0021 0.58 (Age-8)3:k3:rs987237 0.0031 0.0015 0.04 -0.0041 0.0025 0.10 (Age-8)3:k1:rs987237 0.0518 0.0368 0.16 0.0998 0.1203 0.41
LRRN6C
rs10968576 -0.0003 0.0034 0.93
0.08
-0.0075 0.0078 0.33
0.77 0.23 1.00
(Age-8):rs10968576 -0.0006 0.0006 0.30 -0.0013 0.0013 0.32 (Age-8)2:rs10968576 -0.0017 0.0006 0.00 0.0001 0.0017 0.94 (Age-8)3:rs10968576 -0.0008 0.0003 0.00 0.0002 0.0009 0.84 (Age-8)3:k2:rs10968576 0.0017 0.0006 0.00 0.0001 0.0017 0.94 (Age-8)3:k3:rs10968576 -0.0026 0.0012 0.04 -0.0015 0.0021 0.48 (Age-8)3:k1:rs10968576 -0.0315 0.0309 0.31 0.0564 0.1051 0.59
LMX1B rs867559 0.0002 0.0039 0.95
0.32 0.0071 0.0092 0.44
0.15 0.19 1.00 (Age-8):rs867559 0.0014 0.0006 0.03 0.0036 0.0015 0.02 (Age-8)2:rs867559 0.0003 0.0007 0.60 0.0019 0.0020 0.34
(Age-8)3:rs867559 0.0001 0.0003 0.78 0.0002 0.0010 0.87 (Age-8)3:k2:rs867559 -0.0005 0.0006 0.40 -0.0018 0.0020 0.36 (Age-8)3:k3:rs867559 0.0018 0.0014 0.21 0.0043 0.0025 0.09 (Age-8)3:k1:rs867559 0.0272 0.0356 0.44 -0.1023 0.1211 0.40
RPL27A, TUB
rs4929949 0.0058 0.0032 0.07
0.03
0.0050 0.0072 0.49
0.40 0.06 1.00
(Age-8):rs4929949 0.0002 0.0005 0.76 0.0015 0.0012 0.22 (Age-8)2:rs4929949 -0.0014 0.0005 0.01 0.0009 0.0016 0.56 (Age-8)3:rs4929949 -0.0007 0.0002 0.00 0.0005 0.0008 0.56 (Age-8)3:k2:rs4929949 0.0014 0.0005 0.01 -0.0014 0.0016 0.38 (Age-8)3:k3:rs4929949 -0.0013 0.0012 0.25 0.0027 0.0020 0.18 (Age-8)3:k1:rs4929949 -0.0500 0.0298 0.09 0.0980 0.0923 0.29
BDNF
rs6265 0.0105 0.0040 0.01
0.07
0.0088 0.0090 0.33
0.03 0.01 0.43
(Age-8):rs6265 0.0013 0.0007 0.04 0.0050 0.0015 0.00 (Age-8)2:rs6265 0.0001 0.0007 0.87 0.0016 0.0020 0.43 (Age-8)3:rs6265 0.0001 0.0003 0.76 -0.0002 0.0010 0.86 (Age-8)3:k2:rs6265 -0.0003 0.0007 0.70 -0.0015 0.0020 0.47 (Age-8)3:k3:rs6265 0.0001 0.0015 0.93 0.0043 0.0025 0.09 (Age-8)3:k1:rs6265 -0.0450 0.0387 0.24 -0.1909 0.1164 0.10
MTCH2, NDUFS3, CUGBP1
rs3817334 0.0071 0.0032 0.02
1.7x10-4
-0.0062 0.0070 0.38
0.91 1.5x10-3 0.05
(Age-8):rs3817334 0.0027 0.0005 0.00 -0.0007 0.0012 0.55 (Age-8)2:rs3817334 0.0005 0.0005 0.37 -0.0011 0.0015 0.47 (Age-8)3:rs3817334 -6.92x10-6 0.0002 0.98 -0.0005 0.0008 0.47 (Age-8)3:k2:rs3817334 -0.0007 0.0005 0.17 0.0013 0.0015 0.40 (Age-8)3:k3:rs3817334 0.0028 0.0011 0.01 -0.0019 0.0019 0.33 (Age-8)3:k1:rs3817334 -0.0310 0.0286 0.28 -0.0462 0.0881 0.60
FAIM2 rs7138803 0.0093 0.0032 0.00
0.09 0.0108 0.0072 0.14
0.19 0.09 1.00 (Age-8):rs7138803 0.0008 0.0005 0.14 0.0023 0.0012 0.06
(Age-8)2:rs7138803 -0.0003 0.0005 0.58 0.0002 0.0016 0.90 (Age-8)3:rs7138803 -0.0001 0.0002 0.55 -0.0001 0.0008 0.90 (Age-8)3:k2:rs7138803 0.0002 0.0005 0.76 0.0000 0.0016 0.99 (Age-8)3:k3:rs7138803 0.0006 0.0011 0.61 -0.0006 0.0020 0.78 (Age-8)3:k1:rs7138803 -0.0372 0.0298 0.21 -0.0624 0.0939 0.51
MTIF3, GTF3A
rs4771122 -0.0019 0.0038 0.61
0.30
0.0212 0.0088 0.02
0.15 0.19 1.00
(Age-8):rs4771122 0.0010 0.0006 0.10 0.0033 0.0015 0.03 (Age-8)2:rs4771122 0.0014 0.0006 0.03 -0.0011 0.0019 0.58 (Age-8)3:rs4771122 0.0006 0.0003 0.04 -0.0004 0.0010 0.67 (Age-8)3:k2:rs4771122 -0.0015 0.0006 0.02 0.0005 0.0019 0.79 (Age-8)3:k3:rs4771122 0.0028 0.0014 0.04 0.0002 0.0024 0.93 (Age-8)3:k1:rs4771122 0.0447 0.0343 0.19 0.0439 0.1140 0.70
PRKD1
rs11847697 0.0023 0.0077 0.76
0.43
-0.0043 0.0185 0.81
0.59 0.60 1.00
(Age-8):rs11847697 -7.63x10-6 0.0013 1.00 -0.0029 0.0031 0.34 (Age-8)2:rs11847697 -0.0014 0.0013 0.27 -0.0041 0.0040 0.31 (Age-8)3:rs11847697 -0.0008 0.0006 0.18 -0.0023 0.0021 0.26 (Age-8)3:k2:rs11847697 0.0015 0.0013 0.24 0.0051 0.0041 0.21 (Age-8)3:k3:rs11847697 -0.0029 0.0029 0.31 -0.0071 0.0050 0.15 (Age-8)3:k1:rs11847697 0.0513 0.0709 0.47 -0.3977 0.2618 0.13
NRXN3
rs10150332 0.0001 0.0039 0.97
0.05
-0.0016 0.0084 0.85
0.01 3.3x10-3 0.11
(Age-8):rs10150332 0.0019 0.0006 0.00 -0.0032 0.0014 0.02 (Age-8)2:rs10150332 0.0006 0.0006 0.34 -0.0033 0.0018 0.07 (Age-8)3:rs10150332 1.96x10-5 0.0003 0.94 -0.0013 0.0009 0.14 (Age-8)3:k2:rs10150332 -0.0006 0.0006 0.37 0.0040 0.0018 0.03 (Age-8)3:k3:rs10150332 0.0022 0.0014 0.11 -0.0074 0.0023 0.00 (Age-8)3:k1:rs10150332 0.0177 0.0344 0.61 -0.1183 0.1037 0.25
MAP2K5, LBXCOR1 rs2241423 0.0057 0.0038 0.13 0.13 0.0113 0.0086 0.19 0.21 0.13 1.00
(Age-8):rs2241423 0.0013 0.0006 0.03 0.0022 0.0014 0.13 (Age-8)2:rs2241423 -0.0007 0.0006 0.28 -0.0014 0.0019 0.46 (Age-8)3:rs2241423 -0.0005 0.0003 0.10 -0.0008 0.0009 0.38 (Age-8)3:k2:rs2241423 0.0008 0.0006 0.22 0.0011 0.0019 0.56 (Age-8)3:k3:rs2241423 -0.0011 0.0013 0.38 -0.0005 0.0024 0.82 (Age-8)3:k1:rs2241423 -0.0168 0.0345 0.63 0.0046 0.1060 0.97
GPRC5B, IQCK
rs12444979 0.0050 0.0046 0.27
0.01
0.0151 0.0097 0.12
0.33 0.02 0.65
(Age-8):rs12444979 -0.0003 0.0007 0.70 0.0040 0.0016 0.01 (Age-8)2:rs12444979 0.0018 0.0008 0.02 -0.0005 0.0021 0.80 (Age-8)3:rs12444979 0.0010 0.0003 0.00 -0.0005 0.0011 0.61 (Age-8)3:k2:rs12444979 -0.0019 0.0008 0.01 0.0002 0.0021 0.91 (Age-8)3:k3:rs12444979 0.0009 0.0016 0.58 0.0015 0.0027 0.58 (Age-8)3:k1:rs12444979 0.0574 0.0423 0.17 -0.0929 0.1194 0.44
SH2B1, ATXN2L, TUFM, ATP2A1
rs7359397 -0.0007 0.0032 0.83
0.33
0.0041 0.0071 0.57
0.67 0.55 1.00
(Age-8):rs7359397 0.0004 0.0005 0.41 0.0003 0.0012 0.83 (Age-8)2:rs7359397 -3.05x10-5 0.0005 0.95 -0.0005 0.0016 0.76 (Age-8)3:rs7359397 -0.0001 0.0002 0.67 -0.0003 0.0008 0.69 (Age-8)3:k2:rs7359397 0.0001 0.0005 0.78 0.0007 0.0016 0.66 (Age-8)3:k3:rs7359397 -0.0011 0.0011 0.33 -0.0009 0.0019 0.63 (Age-8)3:k1:rs7359397 -0.0140 0.0285 0.62 -0.1020 0.0937 0.28
FTO
rs9939609 0.0108 0.0032 0.00
3.8x10-10
0.0122 0.0073 0.09
0.26 2.4x10-9 7.7x10-8
(Age-8):rs9939609 0.0036 0.0005 0.00 0.0007 0.0012 0.57 (Age-8)2:rs9939609 0.0003 0.0005 0.63 0.0010 0.0016 0.51 (Age-8)3:rs9939609 -3.70x10-5 0.0002 0.87 0.0007 0.0008 0.40 (Age-8)3:k2:rs9939609 -0.0006 0.0005 0.29 -0.0014 0.0016 0.38 (Age-8)3:k3:rs9939609 0.0017 0.0011 0.12 0.0021 0.0020 0.30 (Age-8)3:k1:rs9939609 -0.0156 0.0289 0.59 0.0201 0.0957 0.83
MC4R
rs12970134 0.0154 0.0036 0.00
1.0x10-6
0.0018 0.0080 0.82
0.34 5.6x10-6 1.8x10-4
(Age-8):rs12970134 0.0028 0.0006 0.00 0.0032 0.0013 0.02 (Age-8)2:rs12970134 -0.0011 0.0006 0.07 0.0010 0.0018 0.57 (Age-8)3:rs12970134 -0.0007 0.0003 0.01 0.0001 0.0009 0.88 (Age-8)3:k2:rs12970134 0.0007 0.0006 0.24 -0.0013 0.0018 0.45 (Age-8)3:k3:rs12970134 0.0003 0.0013 0.84 0.0035 0.0022 0.11 (Age-8)3:k1:rs12970134 -0.0213 0.0317 0.50 0.0240 0.0984 0.81
KCTD15
rs29941 0.0028 0.0034 0.40
0.58
0.0095 0.0073 0.19
0.73 0.79 1.00
(Age-8):rs29941 -0.0003 0.0006 0.59 0.0018 0.0012 0.13 (Age-8)2:rs29941 0.0003 0.0006 0.60 -0.0009 0.0016 0.58 (Age-8)3:rs29941 0.0003 0.0002 0.25 -0.0004 0.0008 0.60 (Age-8)3:k2:rs29941 -0.0003 0.0006 0.63 0.0007 0.0016 0.68 (Age-8)3:k3:rs29941 -0.0008 0.0012 0.53 -0.0004 0.0020 0.86 (Age-8)3:k1:rs29941 0.0252 0.0307 0.41 0.0044 0.0942 0.96
TMEM160, ZC3H4
rs3810291 0.0039 0.0038 0.31
0.03
0.0137 0.0084 0.10
0.58 0.09 1.00
(Age-8):rs3810291 0.0014 0.0006 0.02 -0.0002 0.0014 0.87 (Age-8)2:rs3810291 0.0007 0.0006 0.27 0.0002 0.0018 0.89 (Age-8)3:rs3810291 0.0002 0.0003 0.56 0.0004 0.0009 0.69 (Age-8)3:k2:rs3810291 -0.0005 0.0006 0.46 -0.0004 0.0018 0.85 (Age-8)3:k3:rs3810291 0.0001 0.0013 0.92 -0.0002 0.0023 0.93 (Age-8)3:k1:rs3810291 -0.0002 0.0342 1.00 0.0015 0.1046 0.99
QPCTL, GIPR
rs2287019 0.0013 0.0040 0.74
0.05
-0.0078 0.0088 0.38
0.76 0.16 1.00
(Age-8):rs2287019 -0.0003 0.0007 0.66 0.0003 0.0015 0.84 (Age-8)2:rs2287019 -0.0007 0.0007 0.32 0.0007 0.0019 0.73 (Age-8)3:rs2287019 -0.0004 0.0003 0.17 0.0001 0.0010 0.93 (Age-8)3:k2:rs2287019 0.0007 0.0007 0.32 -0.0007 0.0019 0.71 (Age-8)3:k3:rs2287019 -0.0008 0.0014 0.57 0.0017 0.0024 0.48
(Age-8)3:k1:rs2287019 -0.1035 0.0355 0.00 0.0088 0.1109 0.94
GNPDA2
rs10938397 0.0045 0.0032 0.15
0.24
0.0055 0.0068 0.42
0.69 0.46 1.00
(Age-8):rs10938397 0.0010 0.0005 0.07 0.0018 0.0011 0.10 (Age-8)2:rs10938397 0.0005 0.0005 0.31 0.0017 0.0015 0.26 (Age-8)3:rs10938397 0.0002 0.0002 0.42 0.0007 0.0007 0.37 (Age-8)3:k2:rs10938397 -0.0007 0.0005 0.16 -0.0017 0.0015 0.27 (Age-8)3:k3:rs10938397 0.0021 0.0011 0.07 0.0021 0.0018 0.24 (Age-8)3:k1:rs10938397 -0.0277 0.0293 0.34 0.0206 0.0858 0.81
LRP1B
rs2890652 -0.0002 0.0042 0.96
0.09
-0.0079 0.0096 0.41
0.22 0.10 1.00
(Age-8):rs2890652 0.0014 0.0007 0.05 -0.0007 0.0016 0.64 (Age-8)2:rs2890652 0.0008 0.0007 0.25 0.0032 0.0021 0.13 (Age-8)3:rs2890652 0.0002 0.0003 0.55 0.0018 0.0010 0.08 (Age-8)3:k2:rs2890652 -0.0006 0.0007 0.35 -0.0030 0.0021 0.15 (Age-8)3:k3:rs2890652 0.0003 0.0015 0.86 0.0022 0.0027 0.41 (Age-8)3:k1:rs2890652 -0.0500 0.0371 0.18 0.1159 0.1200 0.33
Table 2: Results for males of the 32 individual adult BMI associated SNPs with BMI trajectories in both cohorts and the combined meta-analysis; k1, k2 and k3 represent
knot point 1, 2 and 3 respectively.
ALSPAC Raine
Combined P-Value
Combined bonferroni corrected P-Value
Gene SNP Beta SE P LRT P Beta SE P LRT P
NEGR1
rs2568958 0.0041 0.0029 0.15
0.44
0.0052 0.0068 0.44
0.49 0.55 1.00
(Age-8):rs2568958 -2.79x10-5 0.0005 0.96 0.0002 0.0012 0.87 (Age-8)2:rs2568958 -0.0002 0.0005 0.74 0.0022 0.0015 0.15 (Age-8)3:rs2568958 1.63x10-5 0.0002 0.94 0.0013 0.0008 0.08 (Age-8)3:k2:rs2568958 0.0001 0.0005 0.91 -0.0023 0.0015 0.13 (Age-8)3:k3:rs2568958 -0.0005 0.0011 0.64 0.0020 0.0019 0.29 (Age-8)3:k1:rs2568958 0.0422 0.0271 0.12 0.1354 0.0877 0.12
TNNI3K
rs1514175 0.0040 0.0028 0.16
0.07
0.0104 0.0067 0.12
0.42 0.13 1.00
(Age-8):rs1514175 0.0007 0.0005 0.19 0.0026 0.0011 0.02 (Age-8)2:rs1514175 0.0005 0.0005 0.32 0.0008 0.0015 0.60 (Age-8)3:rs1514175 0.0001 0.0002 0.65 0.0001 0.0008 0.86 (Age-8)3:k2:rs1514175 -0.0004 0.0005 0.45 -0.0009 0.0015 0.58 (Age-8)3:k3:rs1514175 0.0006 0.0011 0.62 0.0020 0.0019 0.29 (Age-8)3:k1:rs1514175 -0.0171 0.0266 0.52 -0.0499 0.0890 0.58
PTBP2
rs1555543 -0.0039 0.0029 0.18
0.58
0.0133 0.0068 0.05
0.11 0.25 1.00 (Age-8):rs1555543 0.0005 0.0005 0.32 0.0028 0.0012 0.02 (Age-8)2:rs1555543 0.0004 0.0005 0.41 0.0008 0.0015 0.61 (Age-8)3:rs1555543 1.75x10-5 0.0002 0.94 0.0004 0.0008 0.65 (Age-8)3:k2:rs1555543 -0.0003 0.0005 0.51 -0.0009 0.0015 0.55
(Age-8)3:k3:rs1555543 0.0014 0.0012 0.23 0.0013 0.0019 0.49 (Age-8)3:k1:rs1555543 -0.0072 0.0264 0.78 0.0498 0.0902 0.58
SEC16B
rs543874 0.0138 0.0035 0.00
5.8x10-7
-0.0015 0.0081 0.85
0.02 2.4x10-7 7.8x10-6
(Age-8):rs543874 0.0023 0.0006 0.00 -0.0009 0.0014 0.52 (Age-8)2:rs543874 -0.0007 0.0006 0.24 -0.0005 0.0018 0.78 (Age-8)3:rs543874 -0.0005 0.0003 0.05 0.0000 0.0009 0.96 (Age-8)3:k2:rs543874 0.0009 0.0006 0.14 0.0009 0.0019 0.65 (Age-8)3:k3:rs543874 -0.0017 0.0014 0.22 -0.0041 0.0023 0.07 (Age-8)3:k1:rs543874 -0.0215 0.0320 0.50 0.0380 0.1068 0.72
TMEM18
rs2867125 0.0113 0.0038 0.00
3.5x10-6
0.0127 0.0091 0.16
0.04 2.5x10-6 1.0x10-4
(Age-8):rs2867125 0.0037 0.0007 0.00 0.0040 0.0016 0.01 (Age-8)2:rs2867125 0.0007 0.0007 0.31 0.0013 0.0021 0.53 (Age-8)3:rs2867125 -0.0001 0.0003 0.79 0.0005 0.0011 0.62 (Age-8)3:k2:rs2867125 -0.0007 0.0007 0.28 -0.0013 0.0021 0.53 (Age-8)3:k3:rs2867125 0.0025 0.0015 0.10 0.0012 0.0026 0.63 (Age-8)3:k1:rs2867125 -0.0389 0.0342 0.26 0.1612 0.1248 0.20
RBJ, ADCY3, POMC
rs713586 0.0063 0.0028 0.02
0.22
0.0117 0.0069 0.09
0.25 0.22 1.00
(Age-8):rs713586 0.0008 0.0005 0.11 0.0013 0.0012 0.27 (Age-8)2:rs713586 0.0001 0.0005 0.81 -0.0003 0.0015 0.85 (Age-8)3:rs713586 1.12x10-6 0.0002 1.00 -0.0003 0.0008 0.69 (Age-8)3:k2:rs713586 -0.0002 0.0005 0.70 0.0004 0.0015 0.81 (Age-8)3:k3:rs713586 0.0009 0.0011 0.42 -0.0006 0.0019 0.74 (Age-8)3:k1:rs713586 0.0058 0.0257 0.82 -0.0390 0.0894 0.66
FANCL
rs887912 -0.0032 0.0031 0.30
0.21
-0.0049 0.0074 0.50
0.55 0.36 1.00 (Age-8):rs887912 0.0002 0.0005 0.66 -0.0003 0.0012 0.79 (Age-8)2:rs887912 0.0014 0.0006 0.01 0.0006 0.0017 0.70 (Age-8)3:rs887912 0.0006 0.0002 0.01 0.0002 0.0009 0.77
(Age-8)3:k2:rs887912 -0.0014 0.0005 0.01 -0.0008 0.0017 0.64 (Age-8)3:k3:rs887912 0.0019 0.0012 0.13 0.0018 0.0021 0.39 (Age-8)3:k1:rs887912 0.0412 0.0279 0.14 0.0563 0.0994 0.57
CADM2
rs13078807 0.0030 0.0035 0.40
0.98
-0.0063 0.0079 0.42
0.01 0.05 1.00
(Age-8):rs13078807 0.0003 0.0006 0.60 0.0012 0.0013 0.39 (Age-8)2:rs13078807 -0.0005 0.0006 0.46 0.0002 0.0018 0.93 (Age-8)3:rs13078807 -0.0002 0.0003 0.39 -0.0005 0.0009 0.59 (Age-8)3:k2:rs13078807 0.0004 0.0006 0.49 0.0009 0.0018 0.63 (Age-8)3:k3:rs13078807 -0.0004 0.0014 0.78 -0.0024 0.0022 0.28 (Age-8)3:k1:rs13078807 0.0000 0.0326 1.00 -0.1535 0.1026 0.13
ETV5, DGKG, SFRS10
rs7647305 0.0008 0.0035 0.82
0.48
0.0041 0.0084 0.62
0.35 0.47 1.00
(Age-8):rs7647305 0.0012 0.0006 0.05 0.0012 0.0014 0.39 (Age-8)2:rs7647305 0.0008 0.0006 0.20 0.0018 0.0019 0.35 (Age-8)3:rs7647305 0.0002 0.0003 0.51 0.0009 0.0010 0.36 (Age-8)3:k2:rs7647305 -0.0007 0.0006 0.26 -0.0020 0.0019 0.28 (Age-8)3:k3:rs7647305 0.0013 0.0014 0.38 0.0033 0.0024 0.17 (Age-8)3:k1:rs7647305 -0.0230 0.0328 0.48 0.1532 0.1043 0.14
SLC39A8
rs13107325 0.0075 0.0053 0.16
0.64
0.0047 0.0129 0.71
0.32 0.52 1.00
(Age-8):rs13107325 -0.0003 0.0009 0.79 0.0007 0.0022 0.76 (Age-8)2:rs13107325 -0.0013 0.0010 0.16 0.0016 0.0029 0.59 (Age-8)3:rs13107325 -0.0005 0.0004 0.22 0.0006 0.0014 0.69 (Age-8)3:k2:rs13107325 0.0014 0.0009 0.14 -0.0019 0.0029 0.51 (Age-8)3:k3:rs13107325 -0.0030 0.0021 0.15 0.0049 0.0036 0.18 (Age-8)3:k1:rs13107325 -0.0678 0.0480 0.16 0.0385 0.1711 0.82
FLJ35779, HMGCR rs2112347 0.0020 0.0030 0.50
0.01 -0.0009 0.0070 0.90
0.31 0.03 0.84 (Age-8):rs2112347 0.0018 0.0005 0.00 -0.0018 0.0012 0.13 (Age-8)2:rs2112347 0.0004 0.0005 0.44 -0.0001 0.0016 0.94
(Age-8)3:rs2112347 -0.0001 0.0002 0.83 0.0000 0.0008 0.99 (Age-8)3:k2:rs2112347 -0.0004 0.0005 0.47 0.0003 0.0016 0.87 (Age-8)3:k3:rs2112347 0.0014 0.0012 0.23 -0.0002 0.0020 0.90 (Age-8)3:k1:rs2112347 0.0209 0.0277 0.45 -0.0706 0.0929 0.45
ZNF608
rs4836133 0.0010 0.0029 0.74
0.06
0.0068 0.0071 0.34
0.56 0.15 1.00
(Age-8):rs4836133 -0.0013 0.0005 0.01 0.0018 0.0012 0.14 (Age-8)2:rs4836133 -0.0003 0.0005 0.52 -0.0019 0.0016 0.23 (Age-8)3:rs4836133 2.62x10-7 0.0002 1.00 -0.0013 0.0008 0.09 (Age-8)3:k2:rs4836133 0.0003 0.0005 0.52 0.0021 0.0016 0.19 (Age-8)3:k3:rs4836133 -0.0006 0.0011 0.58 -0.0015 0.0020 0.45 (Age-8)3:k1:rs4836133 0.0282 0.0264 0.28 -0.1404 0.0922 0.13
TFAP2B
rs987237 0.0066 0.0036 0.07
3.0x10-3
0.0204 0.0085 0.02
0.42 0.01 0.32
(Age-8):rs987237 0.0024 0.0006 0.00 0.0009 0.0015 0.56 (Age-8)2:rs987237 0.0012 0.0006 0.07 -0.0004 0.0019 0.82 (Age-8)3:rs987237 0.0004 0.0003 0.20 0.0000 0.0010 0.99 (Age-8)3:k2:rs987237 -0.0014 0.0006 0.04 0.0001 0.0019 0.98 (Age-8)3:k3:rs987237 0.0025 0.0015 0.08 0.0002 0.0024 0.95 (Age-8)3:k1:rs987237 0.0235 0.0329 0.48 -0.0332 0.1150 0.77
LRRN6C
rs10968576 0.0034 0.0030 0.26
0.02
0.0040 0.0073 0.58
0.22 0.02 0.75
(Age-8):rs10968576 0.0006 0.0005 0.28 -0.0005 0.0012 0.71 (Age-8)2:rs10968576 -0.0009 0.0005 0.08 0.0010 0.0017 0.56 (Age-8)3:rs10968576 -0.0004 0.0002 0.08 0.0007 0.0008 0.40 (Age-8)3:k2:rs10968576 0.0010 0.0005 0.07 -0.0012 0.0017 0.46 (Age-8)3:k3:rs10968576 -0.0024 0.0012 0.05 0.0020 0.0021 0.34 (Age-8)3:k1:rs10968576 0.0436 0.0281 0.12 0.0787 0.0953 0.41
LMX1B rs867559 0.0011 0.0036 0.76
0.07 0.0020 0.0082 0.81
0.86 0.22 1.00 (Age-8):rs867559 0.0009 0.0006 0.15 0.0005 0.0014 0.73
(Age-8)2:rs867559 0.0017 0.0006 0.01 -0.0005 0.0018 0.78 (Age-8)3:rs867559 0.0006 0.0003 0.02 -0.0003 0.0009 0.78 (Age-8)3:k2:rs867559 -0.0016 0.0006 0.01 0.0002 0.0018 0.92 (Age-8)3:k3:rs867559 0.0025 0.0014 0.08 0.0011 0.0023 0.62 (Age-8)3:k1:rs867559 -0.0228 0.0324 0.48 -0.0063 0.1036 0.95
RPL27A, TUB
rs4929949 0.0033 0.0029 0.25
0.58
0.0055 0.0068 0.42
0.98 0.89 1.00
(Age-8):rs4929949 -0.0001 0.0005 0.85 0.0008 0.0012 0.50 (Age-8)2:rs4929949 -0.0007 0.0005 0.19 -0.0007 0.0015 0.64 (Age-8)3:rs4929949 -0.0003 0.0002 0.21 -0.0005 0.0008 0.54 (Age-8)3:k2:rs4929949 0.0007 0.0005 0.18 0.0007 0.0016 0.67 (Age-8)3:k3:rs4929949 -0.0013 0.0012 0.26 -0.0002 0.0019 0.92 (Age-8)3:k1:rs4929949 0.0145 0.0266 0.59 -0.0763 0.0888 0.39
BDNF
rs6265 0.0015 0.0035 0.68
0.10
0.0121 0.0083 0.14
0.32 0.15 1.00
(Age-8):rs6265 0.0014 0.0006 0.03 0.0013 0.0014 0.35 (Age-8)2:rs6265 0.0018 0.0006 0.01 -0.0021 0.0018 0.26 (Age-8)3:rs6265 0.0007 0.0003 0.02 -0.0012 0.0009 0.19 (Age-8)3:k2:rs6265 -0.0018 0.0006 0.00 0.0024 0.0018 0.20 (Age-8)3:k3:rs6265 0.0036 0.0014 0.01 -0.0026 0.0023 0.25 (Age-8)3:k1:rs6265 0.0271 0.0333 0.42 -0.1286 0.1049 0.22
MTCH2, NDUFS3, CUGBP1
rs3817334 -0.0005 0.0029 0.86
0.47
0.0068 0.0070 0.33
0.89 0.78 1.00
(Age-8):rs3817334 0.0005 0.0005 0.37 0.0002 0.0012 0.86 (Age-8)2:rs3817334 0.0003 0.0005 0.62 0.0007 0.0016 0.63 (Age-8)3:rs3817334 -4.89x10-5 0.0002 0.83 0.0005 0.0008 0.55 (Age-8)3:k2:rs3817334 -0.0001 0.0005 0.87 -0.0008 0.0016 0.62 (Age-8)3:k3:rs3817334 0.0004 0.0012 0.75 0.0004 0.0020 0.82 (Age-8)3:k1:rs3817334 -0.0354 0.0267 0.19 0.0180 0.0917 0.84
FAIM2 rs7138803 0.0063 0.0030 0.03 0.62 0.0133 0.0067 0.05 0.44 0.63 1.00
(Age-8):rs7138803 0.0005 0.0005 0.39 0.0014 0.0011 0.21 (Age-8)2:rs7138803 -0.0005 0.0005 0.37 -0.0010 0.0015 0.52 (Age-8)3:rs7138803 -0.0001 0.0002 0.56 -0.0004 0.0008 0.61 (Age-8)3:k2:rs7138803 0.0004 0.0005 0.48 0.0009 0.0015 0.56 (Age-8)3:k3:rs7138803 -0.0008 0.0012 0.49 -0.0011 0.0019 0.55 (Age-8)3:k1:rs7138803 -0.0007 0.0272 0.98 -0.0209 0.0884 0.81
MTIF3, GTF3A
rs4771122 0.0014 0.0035 0.70
0.51
0.0293 0.0087 0.00
3.8x10-3 0.01 0.44
(Age-8):rs4771122 0.0002 0.0006 0.71 0.0032 0.0015 0.03 (Age-8)2:rs4771122 0.0011 0.0006 0.07 0.0002 0.0020 0.94 (Age-8)3:rs4771122 0.0005 0.0003 0.07 0.0002 0.0010 0.87 (Age-8)3:k2:rs4771122 -0.0012 0.0006 0.06 -0.0010 0.0020 0.61 (Age-8)3:k3:rs4771122 0.0024 0.0014 0.08 0.0026 0.0025 0.29 (Age-8)3:k1:rs4771122 0.0392 0.0321 0.22 0.0927 0.1166 0.43
PRKD1
rs11847697 0.0207 0.0067 0.00
1.5x10-3
0.0031 0.0179 0.86
0.98 0.01 0.35
(Age-8):rs11847697 0.0046 0.0012 0.00 0.0017 0.0030 0.57 (Age-8)2:rs11847697 0.0015 0.0012 0.22 0.0029 0.0040 0.46 (Age-8)3:rs11847697 0.0004 0.0005 0.45 0.0012 0.0020 0.55 (Age-8)3:k2:rs11847697 -0.0017 0.0012 0.15 -0.0028 0.0040 0.48 (Age-8)3:k3:rs11847697 0.0037 0.0028 0.18 0.0032 0.0050 0.52 (Age-8)3:k1:rs11847697 0.0160 0.0624 0.80 0.0691 0.2353 0.77
NRXN3
rs10150332 0.0040 0.0035 0.25
0.06
-0.0024 0.0080 0.76
0.01 3.1x10-3 0.10
(Age-8):rs10150332 0.0020 0.0006 0.00 -0.0016 0.0013 0.23 (Age-8)2:rs10150332 -0.0001 0.0006 0.92 -0.0001 0.0018 0.98 (Age-8)3:rs10150332 -0.0002 0.0003 0.56 0.0000 0.0009 0.96 (Age-8)3:k2:rs10150332 -0.0002 0.0006 0.78 -0.0003 0.0018 0.87 (Age-8)3:k3:rs10150332 0.0019 0.0014 0.17 0.0005 0.0022 0.84 (Age-8)3:k1:rs10150332 0.0139 0.0333 0.68 -0.0279 0.1041 0.79
MAP2K5, LBXCOR1
rs2241423 0.0039 0.0035 0.26
0.08
0.0065 0.0076 0.39
0.71 0.21 1.00
(Age-8):rs2241423 0.0016 0.0006 0.01 0.0020 0.0013 0.13 (Age-8)2:rs2241423 -0.0006 0.0006 0.30 0.0014 0.0017 0.42 (Age-8)3:rs2241423 -0.0005 0.0003 0.05 0.0005 0.0009 0.57 (Age-8)3:k2:rs2241423 0.0005 0.0006 0.42 -0.0013 0.0017 0.44 (Age-8)3:k3:rs2241423 0.0010 0.0014 0.48 0.0015 0.0021 0.47 (Age-8)3:k1:rs2241423 -0.0495 0.0314 0.12 0.0109 0.1034 0.92
GPRC5B, IQCK
rs12444979 0.0077 0.0041 0.06
0.07
0.0109 0.0098 0.27
0.51 0.15 1.00
(Age-8):rs12444979 0.0010 0.0007 0.15 0.0011 0.0017 0.52 (Age-8)2:rs12444979 0.0009 0.0007 0.23 -0.0029 0.0022 0.18 (Age-8)3:rs12444979 0.0003 0.0003 0.32 -0.0016 0.0011 0.16 (Age-8)3:k2:rs12444979 -0.0009 0.0007 0.22 0.0025 0.0022 0.25 (Age-8)3:k3:rs12444979 0.0009 0.0016 0.57 -0.0012 0.0027 0.67 (Age-8)3:k1:rs12444979 0.0151 0.0386 0.70 -0.0914 0.1273 0.47
SH2B1, ATXN2L, TUFM, ATP2A1
rs7359397 0.0032 0.0029 0.27
0.23
0.0126 0.0069 0.07
0.66 0.44 1.00
(Age-8):rs7359397 0.0011 0.0005 0.03 0.0004 0.0012 0.70 (Age-8)2:rs7359397 -0.0005 0.0005 0.28 -0.0012 0.0016 0.43 (Age-8)3:rs7359397 -0.0003 0.0002 0.15 -0.0005 0.0008 0.56 (Age-8)3:k2:rs7359397 0.0005 0.0005 0.30 0.0011 0.0016 0.50 (Age-8)3:k3:rs7359397 -0.0007 0.0011 0.51 -0.0015 0.0019 0.44 (Age-8)3:k1:rs7359397 -0.0320 0.0266 0.23 0.0044 0.0907 0.96
FTO
rs9939609 0.0106 0.0029 0.00
8.4x10-12
0.0310 0.0069 0.00
1.4x10-4 4.0x10-14 1.36x10-12
(Age-8):rs9939609 0.0041 0.0005 0.00 0.0047 0.0012 0.00 (Age-8)2:rs9939609 0.0009 0.0005 0.07 -0.0031 0.0016 0.05 (Age-8)3:rs9939609 0.0002 0.0002 0.46 -0.0017 0.0008 0.03 (Age-8)3:k2:rs9939609 -0.0013 0.0005 0.01 0.0027 0.0016 0.08 (Age-8)3:k3:rs9939609 0.0044 0.0012 0.00 -0.0018 0.0020 0.37
(Age-8)3:k1:rs9939609 -0.0114 0.0278 0.68 -0.1378 0.0912 0.13
MC4R
rs12970134 0.0107 0.0032 0.00
1.6x10-4
0.0031 0.0078 0.69
0.79 1.2x10-3 0.04
(Age-8):rs12970134 0.0027 0.0006 0.00 -0.0001 0.0013 0.96 (Age-8)2:rs12970134 0.0007 0.0006 0.20 0.0011 0.0017 0.51 (Age-8)3:rs12970134 0.0001 0.0002 0.54 0.0008 0.0009 0.36 (Age-8)3:k2:rs12970134 -0.0009 0.0006 0.12 -0.0012 0.0018 0.51 (Age-8)3:k3:rs12970134 0.0026 0.0013 0.04 0.0004 0.0022 0.85 (Age-8)3:k1:rs12970134 0.0011 0.0292 0.97 0.1289 0.1028 0.21
KCTD15
rs29941 0.0044 0.0030 0.15
0.63
-0.0010 0.0071 0.89
0.26 0.46 1.00
(Age-8):rs29941 0.0005 0.0005 0.31 -0.0025 0.0012 0.04 (Age-8)2:rs29941 0.0004 0.0005 0.50 -0.0015 0.0016 0.36 (Age-8)3:rs29941 0.0002 0.0002 0.41 -0.0005 0.0008 0.56 (Age-8)3:k2:rs29941 -0.0004 0.0005 0.41 0.0017 0.0016 0.29 (Age-8)3:k3:rs29941 0.0003 0.0012 0.78 -0.0030 0.0020 0.14 (Age-8)3:k1:rs29941 -0.0006 0.0283 0.98 -0.0487 0.0913 0.59
TMEM160, ZC3H4
rs3810291 0.0007 0.0034 0.83
0.24
0.0042 0.0084 0.62
0.98 0.57 1.00
(Age-8):rs3810291 0.0015 0.0006 0.01 0.0011 0.0014 0.42 (Age-8)2:rs3810291 0.0001 0.0006 0.87 -0.0011 0.0019 0.57 (Age-8)3:rs3810291 -0.0002 0.0003 0.49 -0.0007 0.0010 0.49 (Age-8)3:k2:rs3810291 0.0000 0.0006 1.00 0.0009 0.0019 0.64 (Age-8)3:k3:rs3810291 0.0007 0.0014 0.59 0.0003 0.0024 0.91 (Age-8)3:k1:rs3810291 -0.0319 0.0320 0.32 -0.0438 0.1069 0.68
QPCTL, GIPR
rs2287019 0.0032 0.0036 0.37
0.91
0.0033 0.0088 0.71
0.54 0.84 1.00 (Age-8):rs2287019 0.0002 0.0006 0.75 0.0006 0.0015 0.69 (Age-8)2:rs2287019 0.0003 0.0006 0.63 -0.0003 0.0020 0.90 (Age-8)3:rs2287019 0.0002 0.0003 0.56 -0.0004 0.0010 0.68 (Age-8)3:k2:rs2287019 -0.0004 0.0006 0.55 0.0006 0.0020 0.75
(Age-8)3:k3:rs2287019 0.0007 0.0014 0.63 -0.0013 0.0025 0.60 (Age-8)3:k1:rs2287019 0.0297 0.0327 0.36 -0.0765 0.1176 0.52
GNPDA2
rs10938397 0.0076 0.0029 0.01
0.02
0.0075 0.0068 0.27
0.05 0.01 0.30
(Age-8):rs10938397 0.0008 0.0005 0.12 0.0034 0.0011 0.00 (Age-8)2:rs10938397 0.0002 0.0005 0.66 0.0031 0.0015 0.04 (Age-8)3:rs10938397 0.0002 0.0002 0.36 0.0011 0.0008 0.16 (Age-8)3:k2:rs10938397 -0.0004 0.0005 0.45 -0.0033 0.0015 0.03 (Age-8)3:k3:rs10938397 5.46x10-6 0.0012 1.00 0.0057 0.0019 0.00 (Age-8)3:k1:rs10938397 0.0579 0.0262 0.03 0.0408 0.0897 0.65
LRP1B
rs2890652 -0.0039 0.0038 0.30
0.26
0.0021 0.0093 0.82
0.30 0.28 1.00
(Age-8):rs2890652 0.0010 0.0007 0.13 0.0028 0.0016 0.07 (Age-8)2:rs2890652 0.0014 0.0007 0.04 0.0007 0.0021 0.73 (Age-8)3:rs2890652 0.0004 0.0003 0.17 -0.0002 0.0010 0.85 (Age-8)3:k2:rs2890652 -0.0012 0.0007 0.08 -0.0008 0.0021 0.70 (Age-8)3:k3:rs2890652 0.0023 0.0015 0.13 0.0031 0.0026 0.23 (Age-8)3:k1:rs2890652 -0.0129 0.0341 0.71 -0.1640 0.1162 0.16
Appendix G: R Code For The Models Used In The Analysis Of Each Chapter
G.1 Chapter Two G.1.1 Linear Mixed Model (LMM)
######################
### Load libraries ###
######################
library(nlme)
##################
### Female LMM ###
##################
female.lmm <- lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.f, method="ML", random = ~ I(age-8) + I((age-8)^2)|ID,
na.action=na.omit, correlation=corCAR1(form = ~ 1 |ID))
summary(female.lmm)
female.lmm.genetic <- lme(log(bmi) ~ (I(age-8) + I((age-8)^2) +
I((age-8)^3)) * score, data=data.f, method="ML",
random = ~ I(age-8) + I((age-8)^2)|ID, na.action=na.omit,
correlation=corCAR1(form = ~ 1 |ID))
summary(female.lmm.genetic)
################
### Male LMM ###
################
male.lmm <- lme(log(bmi) ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.m, method="ML", random = ~ I(age-8) + I((age-8)^2)|ID,
na.action=na.omit, correlation=corCAR1(form = ~ 1 |ID))
summary(male.lmm)
male.lmm.genetic <- lme(log(bmi) ~ (I(age-8) + I((age-8)^2) +
I((age-8)^3)) * score, data=data.m, method="ML",
random = ~ I(age-8) + I((age-8)^2)|ID, na.action=na.omit,
correlation=corCAR1(form = ~ 1 |ID))
summary(male.lmm.genetic)
G.1.2 Skew-t Linear Mixed Model (STLMM)
#################
### Load code ###
#################
source(“http://www.ime.unicamp.br/~hlachos/RprogramSNI.r”)
####################
### Female STLMM ###
####################
attach(data.f)
# Define lme model to get starting values for skew-t
female.lme.skew <- lme(bmi ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
data=data.f, method="REML", random = ~ 1 + I(age-8) + I((age-
8)^2) |ID, na.action=na.omit, correlation=corCAR1())
# Define data vectors and matrices
y <- as.vector(bmi)
X.mat <- as.matrix(cbind(1, data.f$age-8, (data.f$age-8)^2,
(data.f$age-8)^3 ) )
Z.mat <- as.matrix(cbind(1, (data.f$age-8)))
X.mat.complete <- X.mat[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi),]
Z.mat.complete <- Z.mat[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi),]
BMI.complete <- bmi[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi)]
ID.complete <- ID[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi)]
# Calculate number of obserations per group
histo <- histogram(~BMI.complete | ID.complete)
nj <- as.vector(summary(histo)[[2]])
# Starting values for EM algorithm
beta1 <- fixed.effects(female.lme.skew)
sigmae <- female.lme.skew$sigma
D1 <- diag(1,dim(Z.mat)[2])
D1[1,1] <- as.numeric(VarCorr(female.lme.skew)[,"Variance"][1])
D1[2,2] <- as.numeric(VarCorr(female.lme.skew)[,"Variance"][2])
D1[1,2] <- as.numeric(VarCorr(female.lme.skew)[,"Corr"][2])
D1[2,1] <- as.numeric(VarCorr(female.lme.skew)[,"Corr"][2])
lambda <- rep(1,dim(Z.mat)[2])
nu <- 10
# Skew-t mixed Model
female.skewt <- EM.Skew(nj=nj, y=BMI.complete, x=X.mat.complete,
z=Z.mat.complete, beta1=beta1, sigmae=sigmae, D1=D1,
lambda=lambda, nu=nu, Ind=2, lb=-Inf, lu=Inf, precisão=0.001,
loglik=T, informa=T, calcbi=T)
bi.skewt.f <-read.table("bi.txt")
# Extract and derive parameters for use
bi.mat.st.f <- NULL
for (i in 1:nrow(bi.skewt.f)) {
bi.mat.st.f <- rbind(bi.mat.st.f,cbind(rep(bi.skewt.f[i,1],nj[i]),
rep(bi.skewt.f[i,2],nj[i])))
}
fixed.effects.st.f <- female.skewt$theta[1:4]
sigmae.st.f <- female.skewt$theta[5]
variance.intercept.st.f <- female.skewt$theta[6]
variance.slope.st.f <- female.skewt$theta[8]
corr.slope.intercept.st.f <- female.skewt$theta[7]
skewness.intercept.st.f <- female.skewt$theta[9]
skewness.slope.st.f <- female.skewt$theta[10]
kurtosis.st.f <- female.skewt$theta[11]
marginal.res.st.f <- BMI.complete - X.mat.complete %*%
(matrix(fixed.effects.st.f, nrow=length(fixed.effects.st.f)))
conditional.res.st.f <- BMI.complete - (X.mat.complete %*%
(matrix(fixed.effects.st.f, nrow=length(fixed.effects.st.f))))-
apply(Z.mat.complete*bi.mat.st.f, 1, sum)
bmi.fitted.skewt.f <- (X.mat.complete %*%
(matrix(fixed.effects.st.f, nrow=length(fixed.effects.st.f)))) +
apply(Z.mat.complete*bi.mat.st.f, 1, sum)
pt(abs(female.skewt$theta[1:10]/female.skewt$desvios[1:10]),
df=4000,lower.tail=F)*2
# Skew-t mixed model including genetics
female.lme.skew.score <- lme(bmi ~ (I(age-8) + I((age-8)^2) +
I((age-8)^3)) * score, data=data.f, method="REML",
random = ~ 1 + I(age-8)+ I((age-8)^2) |ID, na.action=na.omit,
correlation=corCAR1())
y <- as.vector(bmi)
X.mat <- as.matrix(cbind(1, data.f$age-8, (data.f$age-8)^2,
(data.f$age-8)^3, data.f$score, (data.f$age-8) * score,
(data.f$age-8)^2 * score, (data.f$age-8)^3 * score))
Z.mat <- as.matrix(cbind(1, (data.f$age-8)))
X.mat.complete <- X.mat[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi),]
Z.mat.complete <- Z.mat[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi),]
BMI.complete <- bmi[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi)]
ID.complete <- ID[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi)]
histo <- histogram(~BMI.complete | ID.complete)
nj <- as.vector(summary(histo)[[2]])
beta1 <- fixed.effects(female.lme.skew.score)
sigmae <- female.lme.skew.score$sigma
D1 <- diag(1,dim(Z.mat)[2])
D1[1,1] <- 5; D1[2,2] <- 0.07; D1[1,2]<-0.2; D1[2,1]<-0.2
lambda <- rep(1,dim(Z.mat)[2])
nu<-10
female.skewt.score <- EM.Skew(nj=nj, y=BMI.complete, x=X.mat.complete,
z=Z.mat.complete, beta1=beta1, sigmae=sigmae, D1=D1,
lambda=lambda, nu=nu, Ind=2, lb=-Inf, lu=Inf, precisão=0.001,
loglik=T, informa=T, calcbi=T)
bi.skewt.f.score <-read.table("bi.txt")
bi.mat.st.f.score <- NULL
for (i in 1:nrow(bi.skewt.f.score)) {
bi.mat.st.f.score <- rbind(bi.mat.st.f.score,
cbind(rep(bi.skewt.f.score[i,1],nj[i]),
rep(bi.skewt.f.score[i,2],nj[i])))
}
fixed.effects.st.f.score <- female.skewt.score$theta[1:8]
sigmae.st.f.score <- female.skewt.score$theta[9]
variance.intercept.st.f.score <- female.skewt.score$theta[10]
variance.slope.st.f.score <- female.skewt.score$theta[12]
corr.slope.intercept.st.f.score <- female.skewt.score$theta[11]
skewness.intercept.st.f.score <- female.skewt.score$theta[13]
skewness.slope.st.f.score <- female.skewt.score$theta[14]
kurtosis.st.f.score <- female.skewt.score$theta[15]
marginal.res.st.f.score <- BMI.complete - X.mat.complete %*%
(matrix(fixed.effects.st.f.score,
nrow=length(fixed.effects.st.f.score)))
conditional.res.st.f.score <- BMI.complete - (X.mat.complete %*%
(matrix(fixed.effects.st.f.score,
nrow=length(fixed.effects.st.f.score)))) –
apply(Z.mat.complete*bi.mat.st.f.score,1,sum)
bmi.fitted.skewt.f.score <- (X.mat.complete %*%
(matrix(fixed.effects.st.f.score,
nrow=length(fixed.effects.st.f.score)))) +
apply(Z.mat.complete*bi.mat.st.f.score,1,sum)
detach(data.f)
##################
### Male STLMM ###
##################
attach(data.m)
# Define lme model to get starting values for skew-t
male.lme.skew <- lme(bmi ~ I(age-8) + I((age-8)^2) + I((age-8)^3),
method = "REML", random = ~ 1 + I(age-8)+ I((age-8)^2) | ID,
data=data.m, na.action=na.omit, correlation=corCAR1(form=~1|ID))
summary(male.lme.skew)
# Define data vectors and matrices
y <- as.vector(bmi)
X.mat <- as.matrix(cbind(1, data.m$age-8, (data.m$age-8)^2,
(data.m$age-8)^3))
Z.mat < -as.matrix(cbind(1, (data.m$age-8)))
X.mat.complete < -X.mat[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi),]
Z.mat.complete <- Z.mat[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi),]
BMI.complete <- bmi[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi)]
ID.complete <- ID[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi)]
# Calculate number of observations per group
histo <- histogram(~BMI.complete | ID.complete)
nj < -as.vector(summary(histo)[[2]])
# Starting values for EM algorithm
beta1 <- fixed.effects(male.lme.skew)
sigmae <- male.lme.skew$sigma
D1 <- diag(1,dim(Z.mat)[2])
D1[1,1]<-as.numeric(VarCorr(male.lme.skew)[,"Variance"][1])
D1[2,2]<-as.numeric(VarCorr(male.lme.skew)[,"Variance"][2])
D1[1,2]<-as.numeric(VarCorr(male.lme.skew)[,"Corr"][2])
D1[2,1]<-as.numeric(VarCorr(male.lme.skew)[,"Corr"][2])
Lambda <- rep(1,dim(Z.mat)[2])
nu <- 10
# Skew-t mixed model
male.skewt<-EM.Skew(nj=nj, y=BMI.complete, x=X.mat.complete,
z=Z.mat.complete, beta1=beta1, sigmae=sigmae, D1=D1,
lambda=lambda, nu=nu, Ind=2, lb=-Inf, lu=Inf, precisão=0.001,
loglik=T, informa=T, calcbi=T)
bi.skewt.m <- read.table("bi.txt")
# Extract and derive parameters for use
bi.mat.st.m <- NULL
for (i in 1:nrow(bi.skewt.m)) {
bi.mat.st.m <- rbind(bi.mat.st.m,cbind(rep(bi.skewt.m[i,1],nj[i]),
rep(bi.skewt.m[i,2],nj[i])))
}
fixed.effects.st.m <- male.skewt$theta[1:4]
sigmae.st.m <- male.skewt$theta[5]
variance.intercept.st.m <- male.skewt$theta[6]
variance.slope.st.m <- male.skewt$theta[8]
corr.slope.intercept.st.m <- male.skewt$theta[7]
skewness.intercept.st.m <- male.skewt$theta[9]
skewness.slope.st.m <- male.skewt$theta[10]
kurtosis.st.m <- male.skewt$theta[11]
marginal.res.st.m <- BMI.complete-X.mat.complete %*%
(matrix(fixed.effects.st.m, nrow=length(fixed.effects.st.m)))
conditional.res.st.m <- BMI.complete-(X.mat.complete %*%
(matrix(fixed.effects.st.m, nrow=length(fixed.effects.st.m))))-
apply(Z.mat.complete*bi.mat.st.m, 1, sum)
bmi.fitted.skewt.m <-(X.mat.complete %*%
(matrix(fixed.effects.st.m, nrow=length(fixed.effects.st.m)))) +
apply(Z.mat.complete*bi.mat.st.m, 1, sum)
pt(abs(male.skewt$theta[1:10]/male.skewt$desvios[1:10]), df=4000,
lower.tail=F)*2
# Skew-t mixed model including genetics
male.lme.skew.score <- lme(bmi ~ (I(age-8) + I((age-8)^2) +
I((age-8)^3)) * score, data=data.m, method="REML",
random = ~ 1 + I(age-8)+ I((age-8)^2) |ID, na.action=na.omit,
correlation=corCAR1())
y <- as.vector(bmi)
X.mat <- as.matrix(cbind(1, data.m$age-8, (data.m$age-8)^2,
(data.m$age-8)^3, data.m$score, (data.m$age-8) * data.m$score,
(data.m$age-8)^2 * data.m$score, (data.m$age-8)^3 *
data.m$score) )
Z.mat <- as.matrix(cbind(1, (data.m$age-8)))
X.mat.complete <- X.mat[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi),]
Z.mat.complete <- Z.mat[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi),]
BMI.complete <- bmi[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi)]
ID.complete <- ID[apply(is.na(X.mat),1,sum)==0 & !is.na(bmi)]
histo <- histogram(~BMI.complete | ID.complete)
nj <- as.vector(summary(histo)[[2]])
beta1 <- fixed.effects(male.lme.skew.score)
sigmae <- male.lme.skew.score$sigma
D1<-diag(1,dim(Z.mat)[2])
D1[1,1] <- 5; D1[2,2] <- 0.07; D1[1,2]<-0.2; D1[2,1]<-0.2
Lambda <- rep(1,dim(Z.mat)[2])
Nu <- 10
male.skewt.score <- EM.Skew(nj=nj, y=BMI.complete, x=X.mat.complete,
z=Z.mat.complete, beta1=beta1, sigmae=sigmae, D1=D1,
lambda=lambda, nu=nu, Ind=2, lb=-Inf, lu=Inf, precisão=0.001,
loglik=T, informa=T, calcbi=T)
bi.skewt.m.score <-read.table("bi.txt")
bi.mat.st.m.score <- NULL
for (i in 1:nrow(bi.skewt.m.score)) {
bi.mat.st.m.score <- rbind(bi.mat.st.m.score,
cbind(rep(bi.skewt.m.score[i,1],nj[i]),
rep(bi.skewt.m.score[i,2],nj[i])))
}
fixed.effects.st.m.score <- male.skewt.score$theta[1:8]
sigmae.st.m.score <- male.skewt.score$theta[9]
variance.intercept.st.m.score <- male.skewt.score$theta[10]
variance.slope.st.m.score <- male.skewt.score$theta[12]
corr.slope.intercept.st.m.score <- male.skewt.score$theta[11]
skewness.intercept.st.m.score <- male.skewt.score$theta[13]
skewness.slope.st.m.score <- male.skewt.score$theta[14]
kurtosis.st.m.score <- male.skewt.score$theta[15]
marginal.res.st.m.score <-BMI.complete – X.mat.complete %*%
(matrix(fixed.effects.st.m.score,
nrow=length(fixed.effects.st.m.score)))
conditional.res.st.m.score <- BMI.complete - (X.mat.complete %*%
(matrix(fixed.effects.st.m.score,
nrow=length(fixed.effects.st.m.score))))-
apply(Z.mat.complete*bi.mat.st.m.score,1,sum)
bmi.fitted.skewt.m.score <-(X.mat.complete %*%
(matrix(fixed.effects.st.m.score,
nrow=length(fixed.effects.st.m.score)))) +
apply(Z.mat.complete*bi.mat.st.m.score,1,sum)
detach(data.m)
G.1.3 Semi-Parametric Linear Mixed Model (SPLMM)
######################
### Load libraries ###
######################
library(spida)
####################
### Female SPLMM ###
####################
sp.f <- function(x) gsp( x, knots = c((2-8), (8-8),(12-8)),
degree = c(3,3,3,3), smooth = c(2,2,2))
female.spline <- lme(log(bmi) ~ sp.f(age-8), data=data.f,
random=~sp.f(age-8)[,1:2]|ID, na.action=na.omit, method="ML")
summary(female.spline)
female.spline.score <- lme(log(bmi) ~ sp.f(age-8)*score, data=data.f,
random=~sp.f(age-8)[,1:2]|ID, na.action=na.omit, method="ML")
summary(female.spline.score)
##################
### Male SPLMM ###
##################
sp.m <- function(x) gsp( x, knots = c((2-8),(8-8),(12-8)),
degree = c(3,3,3,3), smooth = c(2,2,2))
male.spline <- lme(log(bmi) ~ sp.m(age-8), data=data.m,
random=~sp.m(age-8)[,1:2]|ID, na.action=na.omit, method="ML")
summary(male.spline)
male.spline.score <- lme(log(bmi) ~ sp.m(age-8)*score, data=data.m,
random=~sp.m(age-8)[,1:2]|ID, na.action=na.omit, method="ML")
summary(male.spline.score)
G.1.4 Non-linear mixed model (NLMM)
## sitarlib (the R library for SITAR) was provided by the author,
## Professor Tim Cole (Cole, T. J., M. D. Donaldson, et al. (2010).
## "SITAR--a useful instrument for growth curve analysis." Int J
## Epidemiol 39(6): 1558-1566.)
##############################
### Set control parameters ###
##############################
con.nlme <- nlmeControl(maxIter=10, pnlsMaxIter=10, msMaxIter=10,
returnObject=TRUE, msVerbose=FALSE)
###################
### Female NLMM ###
###################
female.sitar <- sitar(x=log(age), y=log(bmi), id=ID, data=data.f,
nk=3, random="a+b+c", a.formula=~1, b.formula=~-1, c.formula=~1,
d.formula=~-1, correlation=corCAR1(), control=con.nlme)
female.sitar.para <- merge(female.sitar.para, gen[,c("SUBJECTID",
"score")], by.x="ID", by.y="SUBJECTID", all.x=T)
female.sitar.score.size <- lm(BMI.size ~ score + BMI.tempo +
BMI.velocity, data=female.sitar.para)
summary(female.sitar.score.size)
female.sitar.score.temp <- lm(BMI.tempo ~ score + BMI.size +
BMI.velocity, data=female.sitar.para)
summary(female.sitar.score.temp)
female.sitar.score.vel <- lm(BMI.velocity ~ score + BMI.size +
BMI.tempo, data=female.sitar.para)
summary(female.sitar.score.vel)
#################
### Male NLMM ###
#################
male.sitar <- sitar(x=log(age), y=log(bmi), id=ID, data=data.m, nk=4,
random="a+b+c", a.formula=~1, b.formula=~-1, c.formula=~1,
d.formula=~-1, correlation=corCAR1(), control=con.nlme)
male.sitar.para <- merge(male.sitar.para, gen[,c("SUBJECTID",
"score")], by.x="ID", by.y="SUBJECTID", all.x=T)
male.sitar.score.size <- lm(BMI.size ~ score + BMI.tempo +
BMI.velocity, data=male.sitar.para)
summary(male.sitar.score.size)
male.sitar.score.temp <- lm(BMI.tempo ~ score + BMI.size +
BMI.velocity, data=male.sitar.para)
summary(male.sitar.score.temp)
male.sitar.score.vel <- lm(BMI.velocity ~ score + BMI.size +
BMI.tempo, data=male.sitar.para)
summary(male.sitar.score.vel)
G.2 Chapter Three
########################################
### SPLMM Model – including genetics ###
########################################
nfitF <- lme((bmi) ~ sp(age8)*snp, data=ndd,
random=~sp(age8)[,1:2]|ID, na.action=na.omit, method="ML",
correlation=corCAR1())
####################################
### SPLMM Model without genetics ###
####################################
nfitF_base <- tryCatch(lme((bmi) ~ sp(age8), data=ndd,
random=~sp(age8)[,1:2]|ID, na.action=na.omit, method="ML",
correlation=corCAR1())
##############################
### Extract random effects ###
##############################
ranefs <- ranef(nfitF_base)
ranefs <- cbind(rownames(ranefs), ranefs)
names(ranefs) <- c("ID", "Intercept", "Slope", "Slope2")
ranefs <- merge(ranefs, SNP, by="ID")
#############################################
### Model genetics against random effects ###
#############################################
ranef_i <- lm(Intercept ~ snp, data=ranefs)
ranef_s <- lm(Slope ~ snp, data=ranefs)
G.3 Chapter Four ## Code from the simulations assuming normally distributed random
## errors and residuals with constant variance under the equal
## unbalanced sampling design
#####################################
### Model using FTO SNP in ALSPAC ###
#####################################
rfitF <- lme((bmi) ~ (age8 + I((age8)^2) + I((age8)^3)) *
rs1121980_add + source.r, na.action=na.exclude, data=data,
random = ~ (age8) + I((age8)^2)|cid_724a, method="ML",
correlation=corCAR1())
summary(rfitF)
############################
### Base model in ALSPAC ###
############################
rfitF_base <- lme((bmi) ~ (age8 + I((age8)^2) + I((age8)^3)) +
source.r, data=data, method="ML", na.action=na.exclude,
random = ~ (age8) + I((age8)^2)|cid_724a, correlation=corCAR1())
summary(rfitF_base)
###############################################################
### Extract parameters from model fit to use in simulations ###
###############################################################
Phi = 0.3935524
bta <- as.numeric(fixef(rfitF))
varRan <- matrix(as.numeric(getVarCov(rfitF)),ncol=3)
varE <-as.numeric(VarCorr(rfitF)[4,1])
bta[5] <- 0
bta[7] <- 0
bta[8] <- 0
bta[9] <- 0
jr.begin=1
jr.finis=1000
maf=0.3
###############################################
### Creates design matrix for fixed effects ###
###############################################
N <- 1000
times <- 1:15
prob <- maf
ID <- as.factor(sort(rep(1:N, length(times))))
prob_s <- as.vector(c(0.4, 0.2, 0.4, 0.1, 0.6, 0.99, 0.1, 0.0, 0.0,
0.1, 0.0, 0.0, 0.3, 0.0, 0.0))
yr <- 1:(N*length(times))
ar <- sort(c((0:N*length(times))+5, (0:N*length(times))+6,
(0:N*length(times))+7))
#####################
### Generate data ###
#####################
beta_s <- data.frame()
se_s <- data.frame()
rse_s <- data.frame()
pval_s <- data.frame()
rpval_s <- data.frame()
lrt_s <- data.frame()
wald_s <- data.frame()
sample_s <- data.frame()
for(i in jr.begin:jr.finis){
repeat{
age <- as.data.frame(rep(times, N))
for(k in 1:nrow(age)){
age$age[k] <- runif(1, min=age[k,1]-0.5, max=age[k,1]+0.5)
age$age8[k] <- age$age[k]-8
age$source.r[k] <- rbinom(1, size=1,
prob=prob_s[round(age$age[k],0)])
}
snp <- sort(rep(rbinom(N, size=2, prob=prob), length(times)))
mdlMtx <- as.data.frame(cbind(1, age$age8, age$age8^2,
age$age8^3, snp, age$source.r, age$age8*snp,
(age$age8^2)*snp, (age$age8^3)*snp, ID))
names(mdlMtx) <- c("Intercept", "age8", "age8.2", "age8.3",
"rs1121980_add", "source.r", "age8.rs1121980_add",
"age8.2.rs1121980_add", "age8.3.rs1121980_add", "ID")
u1 <- sample(ar, 0.4*(N*3), replace=F) # delete these samples
from between 5-7yrs (proportion = 0.4*(n-3))
u2 <- sample(yr[!yr%in%ar], 0.4*(N*12), replace=F) # delete this
proportion of samples outside 5-7yrs
u <- c(u1,u2)
mdlMtx <- mdlMtx[-u,]
Zmtx <- mdlMtx[,1:3]
Zmtx$ID <- mdlMtx$ID
id <- data.frame(id=unique(mdlMtx$ID))
mdlMtx$age8 <- mdlMtx[,2]
# Normally distributed random effects and error
for (j in 1:nrow(id)) { #random effect, random error and correlation
Zj <- as.matrix(subset(Zmtx,Zmtx$ID == id$id[j])[, 1:3])
n <- nrow(Zj)
if(n==1){
Yj <- Zj%*% t(rmvnorm(1,sigma=varRan))+rnorm(n,sqrt(varE))
}
else{
Yj <- Zj%*% t(rmvnorm(1,sigma=varRan)) + t(rmvnorm(1,
sigma=CARmatrix(rho=Phi, Zj[,2])*varE))
}
if (j == 1 ){Y = Yj} else {Y =c(Y,Yj)}
}
# Add back in the population average
Y <- Y + as.matrix(mdlMtx[,1:9]) %*% as.matrix(bta)
# Create a data frame
ndd <- data.frame(mdlMtx$ID)
ndd$bmi <- Y
dx <-mdlMtx[,c("age8", "rs1121980_add", "source.r")]
ndd <-cbind(ndd,dx)
names(ndd)[1] <- "cid_724a"
# Update model with simulated data
nfitF <- tryCatch(update(rfitF,data=ndd), error=function(e) NA)
nfitF_base <- tryCatch(update(rfitF_base,data=ndd), error=function(e)
NA)
if(is.na(nfitF)==F & is.na(nfitF_base)==F){break}
}
# Generate robust standard errors
G <- matrix(as.numeric(getVarCov(nfitF)),ncol=3)
E <- cbind(mdlMtx$ID, nfitF$residuals[,1])
S <- matrix()
for(j in 1:nrow(id)){
Zj <- as.matrix(subset(Zmtx,Zmtx$ID == id$id[j])[, 1:3])
Xj <- as.matrix(subset(mdlMtx,mdlMtx$ID == id$id[j])[, 1:9])
Rj <- as.numeric(VarCorr(nfitF)[4,1]) * diag(nrow(Zj))
Sj <- t(Xj) %*% ginv((Zj %*% G %*% t(Zj)) + Rj) %*% Xj
if (j == 1 ){S = Sj} else {S = S + Sj}
}
M <- matrix()
for(j in 1:nrow(id)){
Zj <- as.matrix(subset(Zmtx,Zmtx$ID == id$id[j])[, 1:3])
Xj <- as.matrix(subset(mdlMtx,mdlMtx$ID == id$id[j])[, 1:9])
Ej <- as.matrix(subset(E,E[,1] == id$id[j])[,2])
Rj <- as.numeric(VarCorr(nfitF)[4,1]) * diag(nrow(Zj))
Mj <- t(Xj) %*% ginv((Zj %*% G %*% t(Zj)) + Rj) %*% Ej %*% t(Ej)
%*% ginv((Zj %*% G %*% t(Zj)) + Rj) %*% Xj
if (j == 1 ){M = Mj} else {M = M + Mj}
}
rVarCov <- ginv(S) %*% M %*% ginv(S)
# Create output dataframes
beta_s <- rbind(beta_s, c(i, as.numeric(fixef(nfitF))))
se_s <- rbind(se_s, c(i, as.numeric(summary(nfitF)$tTable[,2])))
rse_s <- rbind(rse_s, c(i, sqrt(abs(diag(rVarCov)))))
pval_s <- rbind(pval_s, c(i, as.numeric(summary(nfitF)$tTable[,5])))
rpval_s <- rbind(rpval_s, c(i,
pt(abs(summary(nfitF)$tTable[,1]/sqrt(abs(diag(rVarCov)))),
df=4000,lower.tail=F)*2))
lrt_s <- rbind(lrt_s, c(i, as.numeric(anova(nfitF_base, nfitF)[2,9])))
wald_s <- rbind(wald_s, c(i,
wald(nfitF, 'rs1121980_add')[[1]]$anova[[4]][1,1]))
sample_s <- rbind(sample_s, c(i, length(unique(ndd$cid_724a)),
nrow(ndd), mean(table(ndd$cid_724a)), sd(table(ndd$cid_724a))))
}
# Add names to output dataframes
names(beta_s) <- c("beta_i", "beta_intercept", "beta_age",
"beta_age2", "beta_age3", "beta_snp", "beta_source.r",
"beta_snp_age", "beta_snp_age2", "beta_snp_age3")
names(se_s) <- c("se_i", "se_intercept", "se_age", "se_age2",
"se_age3", "se_snp", "se_source.r", "se_snp_age", "se_snp_age2",
"se_snp_age3")
names(rse_s) <- c("rse_i", "rse_intercept", "rse_age", "rse_age2",
"rse_age3", "rse_snp", "rse_source.r", "rse_snp_age",
"rse_snp_age2", "rse_snp_age3")
names(pval_s) <- c("pval_i", "pval_intercept", "pval_age",
"pval_age2", "pval_age3", "pval_snp", "pval_source.r",
"pval_snp_age", "pval_snp_age2", "pval_snp_age3")
names(rpval_s) <- c("rpval_i", "rpval_intercept", "rpval_age",
"rpval_age2", "rpval_age3", "rpval_snp", "rpval_source.r",
"rpval_snp_age", "rpval_snp_age2", "rpval_snp_age3")
names(lrt_s) <- c("lrt_i", "lrt")
names(wald_s) <- c("wald_i", "wald")
names(sample_s) <- c("sample_i", "N_ID", "N_obs", "Mean_obs_ID",
"SD_obs_ID")
G.4 Chapter Five
######################
### Load libraries ###
######################
library(lattice)
library(nlme)
library(foreign)
source("wald.R")
library(MASS)
######################################
### Read in command line arguments ###
######################################
options(scipen=30)
args = commandArgs(TRUE)
print(args)
jr.begin=as.numeric(args[1])
jr.finis=as.numeric(args[2])
chr=as.numeric(args[3])
print(jr.begin)
print(jr.finis)
print(chr)
rm(args)
##########################
### Read in data files ###
##########################
data <- read.csv("RAINE_cleaned_GWAS.csv", na.strings=c("", " "))
inf.fname = paste("Chr",chr,"/step2.mlinfo", sep="")
info = read.table(file=inf.fname, header=T)
myinfo = info[(jr.begin-2):(jr.finis-2),]
rm(info)
dose.fpath = paste("Chr",chr,sep="")
setwd(dose.fpath)
unix.cmd = paste( "cut -d' ' -f 1,", jr.begin, "-", jr.finis, "
step2.mldose" ," ", sep="")
dose=read.table(pipe(unix.cmd))
#####################################
### Create ID column in dose file ###
#####################################
dose <- as.data.frame(dose)
dose$ID = substr(dose[,1], 1, (nchar(as.character(dose[,1]))/2)-1)
dose <- dose[,c(ncol(dose), 2:(ncol(dose)-1))]
dim(dose)
dose <- as.matrix(dose)
head(dose)[,1:5]
##################################
### Create function for models ###
##################################
sp <- function(x) gsp( x, knots = c((2-8), (8-8),(12-8)), degree =
c(3,3,3,3), smooth = c(2,2,2))
lme.fun =function(snp) {
geno = dose[,c(1,as.numeric(snp))]
geno <- as.data.frame(geno)
colnames(geno) <- c("ID", "g")
data <- merge(data, geno, by="ID")
data <- subset(data, !is.na(data$bmi) & !is.na(data$sex))
dim(data)
fit.snp = tryCatch( lme( log(bmi) ~ as.numeric(as.character(g))
* sp(age-8) + sex * sp(age-8) + PC1 + PC2 + PC3 + PC4 +
PC5, data=data, method="ML", na.action=na.exclude,
random = ~ sp(age-8)[,1:2]| ID,
correlation=corCAR1()),error=function(e) NA)
model <- summary(fit.snp)
mdlMtx <- as.data.frame(cbind(1,
as.numeric(as.character(data$g)), sp(data$age-8),
data$sex, data$PC1, data$PC2, data$PC3, data$PC4,
data$PC5, as.numeric(as.character(data$g))*sp(data$age-8),
data$sex*sp(data$age-8), data$ID))
Zmtx <- as.data.frame(cbind(1, sp(data$age-8)[,1:2], data$ID))
id <- as.data.frame(unique(data$ID))
data <- data[,-ncol(data)] # remove SNP from end column of data
if (is.na(fit.snp)==F) {
G <- matrix(as.numeric(getVarCov(fit.snp)),ncol=3)
E <- cbind(data$ID, fit.snp$residuals[,1])
S <- matrix()
for(j in 1:nrow(id)){
Zj <- as.matrix(subset(Zmtx,Zmtx[,4] == id[j,1])[, 1:3])
Xj <- as.matrix(subset(mdlMtx, mdlMtx[,ncol(mdlMtx)] ==
id[j,1])[, 1:(ncol(mdlMtx)-1)])
Rj <- as.numeric(VarCorr(fit.snp)[4,1]) * diag(nrow(Zj))
Sj <- t(Xj) %*% ginv((Zj %*% G %*% t(Zj)) + Rj) %*% Xj
if (j == 1 ){S = Sj} else {S = S + Sj}
}
M <- matrix()
for(j in 1:nrow(id)){
Zj <- as.matrix(subset(Zmtx,Zmtx[,4] == id[j,1])[, 1:3])
Xj <- as.matrix(subset(mdlMtx, mdlMtx[,ncol(mdlMtx)] ==
id[j,1])[, 1:(ncol(mdlMtx)-1)])
Ej <- as.matrix(subset(E,E[,1] == id[j,1])[,2])
Rj <- as.numeric(VarCorr(fit.snp)[4,1]) * diag(nrow(Zj))
Mj <- t(Xj) %*% ginv((Zj %*% G %*% t(Zj)) + Rj) %*% Ej %*%
t(Ej) %*% ginv((Zj %*% G %*% t(Zj)) + Rj) %*% Xj
if (j == 1 ){M = Mj} else {M = M + Mj}
}
rVarCov <- ginv(S) %*% M %*% ginv(S)
wald_t <- try(wald(fit.snp, c(2, 15:20)), silent=TRUE)
if(class(wald_t)[1] != 'try-error'){
wald_p <- wald_t[[1]]$anova[[4]][1,1]
}else{
wald_p <- NA
}
wald_t_int <- try(wald(fit.snp, c(15:20)), silent=TRUE)
if(class(wald_t_int)[1] != 'try-error'){
wald_p_int <- wald_t_int[[1]]$anova[[4]][1,1]
}else{
wald_p_int <- NA
}
snp.out <- c(as.numeric(fixef(fit.snp))[grep("as.numeric",
names(fixef(fit.snp)))],
as.numeric(model$tTable[grep("as.numeric",
rownames(model$tTable)),2]),
as.numeric(model$tTable[grep("as.numeric",
rownames(model$tTable)),5]),
sqrt(abs(diag(rVarCov)))[c(2,15:20)],
(pt(abs(summary(fit.snp)$tTable[,1]/sqrt(abs(diag(rVarCov)
))),df=4000,lower.tail=F)*2)[c(2,15:20)],
wald_p, wald_p_int)
}
if (is.na(fit.snp)==T) {
snp.out <- rep("NA",37)
}
return(snp.out)
}
G.5 Chapter Six ## ALSPAC code (similar code was used for Raine)
#######################
### Spline function ###
#######################
sp <- function(x) gsp( x, knots = c((2-8), (8-8),(12-8)), degree =
c(3,3,3,3), smooth = c(2,2,2))
######################################
### SPLMM Model for BMI in females ###
######################################
female <- lme(log(bmi) ~ sp(age_yr-8) + source.r, data=data.f,
random=~sp(age_yr-8)[,1:2]|cid_724a, na.action=na.omit,
method="ML", correlation=corCAR1(form = ~ 1 |cid_724a))
summary(female)
female.score <- lme(log(bmi) ~ sp(age_yr-8)*score + source.r,
data=data.f, random=~sp(age_yr-8)[,1:2]|cid_724a, method="ML",
na.action=na.omit, correlation=corCAR1(form = ~ 1 |cid_724a))
summary(female.score)
anova(female, female.score)$"p-value"
wald(female.score, 10:15)
####################################
### SPLMM Model for BMI in males ###
####################################
male <- lme(log(bmi) ~ sp(age_yr-8) + source.r, data=data.m,
random=~sp(age_yr-8)[,1:2]|cid_724a, na.action=na.omit,
method="ML", correlation=corCAR1(form = ~ 1 |cid_724a))
summary(male)
male.score <- lme(log(bmi) ~ sp(age_yr-8)*score + source.r,
data=data.m, random=~sp(age_yr-8)[,1:2]|cid_724a, method="ML",
na.action=na.omit, correlation=corCAR1(form = ~ 1 |cid_724a))
summary(male.score)
anova(male, male.score)$"p-value"
wald(male.score, 10:15)
##############################
### Weight spline function ###
##############################
sp <- function(x) gsp( x, knots = c((2-8), (8-8),(12-8)), degree =
c(1,3,3,2), smooth = c(2,2,2))
###################################
### SPLMM for weight in females ###
###################################
female.wt <- lme(log(weight) ~ sp(age_yr-8) + source.r, data=data.f,
random=~sp(age_yr-8)[,1:2]|cid_724a, na.action=na.omit,
method="ML", correlation=corCAR1(form = ~ 1 |cid_724a))
summary(female.wt)
female.score.wt <- lme(log(weight) ~ sp(age_yr-8)*score + source.r,
data=data.f, random=~sp(age_yr-8)[,1:2]|cid_724a, method="ML",
na.action=na.omit, correlation=corCAR1(form = ~ 1 |cid_724a))
summary(female.score.wt)
anova(female.wt, female.score.wt)$"p-value"
wald(female.score.wt, 7:9)
#################################
### SPLMM for weight in males ###
#################################
male.wt <- lme(log(weight) ~ sp(age_yr-8) + source.r, data=data.m,
random=~sp(age_yr-8)[,1:2]|cid_724a, na.action=na.omit,
method="ML", correlation=corCAR1(form = ~ 1 |cid_724a))
summary(male.wt)
male.score.wt <- lme(log(weight) ~ sp(age_yr-8)*score + source.r,
data=data.m, random=~sp(age_yr-8)[,1:2]|cid_724a, method="ML",
na.action=na.omit, correlation=corCAR1(form = ~ 1 |cid_724a))
summary(male.score.wt)
anova(male.wt, male.score.wt)$"p-value"
wald(male.score.wt, 7:9)
L <- c(0,0,0,0,1,0,sp(1.75-8)*1) # Test when effect occurs
wald(female.score.wt, L)
L <- c(0,0,0,0,1,0,sp(1-8)*1)
wald(male.score.wt, L)
##############################
### Height spline function ###
##############################
sp <- function(x) gsp( x, knots = c((2-8), (8-8),(12-8)), degree =
c(3,3,3,3), smooth = c(2,2,2))
###################################
### SPLMM for height in females ###
###################################
female.ht <- lme(height ~ sp(age_yr-8) + source.r, data=data.f,
random=~sp(age_yr-8)[,1:2]|cid_724a, na.action=na.omit,
method="ML", correlation=corCAR1(form = ~ 1 |cid_724a))
summary(female.ht)
female.score.ht <- lme(height ~ sp(age_yr-8)*score + source.r,
data=data.f, random=~sp(age_yr-8)[,1:2]|cid_724a, method="ML",
na.action=na.omit, correlation=corCAR1(form = ~ 1 |cid_724a))
summary(female.score.ht)
anova(female.ht, female.score.ht)$"p-value"
wald(female.score.ht, 10:15)
#################################
### SPLMM for height in males ###
#################################
male.ht <- lme(height ~ sp(age_yr-8) + source.r, data=data.m,
random=~sp(age_yr-8)[,1:2]|cid_724a, na.action=na.omit,
method="ML", correlation=corCAR1(form = ~ 1 |cid_724a))
summary(male.ht)
male.score.ht <- lme(height ~ sp(age_yr-8)*score + source.r,
data=data.m, random=~sp(age_yr-8)[,1:2]|cid_724a, method="ML",
na.action=na.omit, correlation=corCAR1(form = ~ 1 |cid_724a))
summary(male.score.ht)
anova(male.ht, male.score.ht)$"p-value"
wald(male.score.ht, 10:15)
L <- c(0,0,0,0,0,0,0,1,0,sp(3.5-8)*1)
wald(female.score.ht, L)
L <- c(0,0,0,0,0,0,0,1,0,sp(3.5-8)*1)
wald(male.score.ht, L)
#####################################
### Calculating adiposity rebound ###
#####################################
sp <- function(x) gsp( x, knots = c((2), (8),(12)), degree =
c(3,3,3,3), smooth = c(2,2,2))
female2.ar <- lme(log(bmi) ~ sp(age_yr) + source.r, data=data.f,
random=~sp(age_yr)[,1:2]|cid_724a, na.action=na.omit,
method="ML", correlation=corCAR1(form = ~ 1 |cid_724a))
male2.ar <- lme(log(bmi) ~ sp(age_yr) + source.r, data=data.m,
random=~sp(age_yr)[,1:2]|cid_724a, na.action=na.omit,
method="ML", correlation=corCAR1(form = ~ 1 |cid_724a))
k1 <- 2
k2 <- 8
k3 <- 12
AR <- data.frame()
ar_alspac <- function(fit, dd1){
fxef <- as.numeric(fixef(fit)[2:8])
fxef2 <- as.numeric(fixef(fit))
for(i in 1:nrow(dd1)){
coeff <- fxef + c(as.numeric(ranef(fit)[i,2:3]), rep(0,5))
y <- try(uniroot(function(x) coeff[1] + coeff[2]*x +
coeff[3]*x^2/2 + coeff[4]*(x-k1)^2/2, lower=k1, upper=k2)$root, TRUE)
AR[i,1] <- ifelse(class(y) != "try-error", y, NA)
coeff2 <- fxef2 + c(as.numeric(ranef(fit)[i,]), rep(0,5))
AR[i,2] <- ifelse(class(y) != "try-error", exp(coeff2[1] +
coeff2[2]*y + (coeff2[3]*(y^2))/2 + (coeff2[4]*(y^3))/6 +
(coeff2[5]*((y-k1)^3))/6), NA) }
out <- data.frame(AR)
out$cid_724a <- dd1$cid_724a
out
}
dd1 <- as.data.frame(unique(data.f$cid_724a))
names(dd1)[1] <- "cid_724a"
AR_girls2 <- ar_alspac(female2.ar, dd1)
names(AR_girls2) <- c("Age2", "BMI2", "cid_724a")
dd1 <- as.data.frame(unique(data.m$cid_724a))
names(dd1)[1] <- "cid_724a"
AR_boys2 <- ar_alspac(male2.ar, dd1)
names(AR_boys2) <- c("Age2", "BMI2", "cid_724a")