+ All Categories
Home > Documents > CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating...

CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating...

Date post: 19-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
63
1 Article (Resources) 1 24 th Feb, 2020 2 3 CoalQC - Quality control while inferring demographic histories from genomic 4 data: Application to forest tree genomes 5 6 7 8 Ajinkya Bharatraj Patil 1 , Sagar Sharad Shinde 1 , Raghavendra S 2 , Satish B.N 3 , Kushalappa 9 C.G 3 , Nagarjun Vijay 1 10 11 12 13 14 1 Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER 15 Bhopal, Bhauri, Madhya Pradesh 2 College of Agriculture Hassan, UAS Bangalore 3 College 16 of Forestry, Ponnampet, Kodagu 17 18 *Corresponding authors: [email protected] & [email protected] 19 20 21 22 23 24 25 26 27 28 29 30 31 Running head: quality control of demographic inference. 32 Keywords: demographic history inference, Mesua ferrea, whole-genome assembly, PSMC, 33 repeat sequences, forest plants. 34 35 36 37 . CC-BY 4.0 International license (which was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint this version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365 doi: bioRxiv preprint
Transcript
Page 1: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

1

Article (Resources) 1

24th

Feb, 2020 2

3

CoalQC - Quality control while inferring demographic histories from genomic 4

data: Application to forest tree genomes 5

6

7

8

Ajinkya Bharatraj Patil1, Sagar Sharad Shinde

1, Raghavendra S

2, Satish B.N

3, Kushalappa 9

C.G3, Nagarjun Vijay

1 10

11

12

13

14 1Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER 15

Bhopal, Bhauri, Madhya Pradesh 2College of Agriculture Hassan, UAS Bangalore

3College 16

of Forestry, Ponnampet, Kodagu 17

18

*Corresponding authors: [email protected] & [email protected] 19

20

21

22

23

24

25

26

27

28

29

30

31

Running head: quality control of demographic inference. 32

Keywords: demographic history inference, Mesua ferrea, whole-genome assembly, PSMC, 33

repeat sequences, forest plants. 34

35

36

37

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 2: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

2

Abstract 38

Estimating demographic histories using genomic datasets has proven to be useful in 39

addressing diverse evolutionary questions. Despite improvements in inference methods and 40

availability of large genomic datasets, quality control steps to be performed prior to the use of 41

sequentially Markovian coalescent (SMC) based methods remains understudied. While 42

various filtering and masking steps have been used by previous studies, the rationale for such 43

filtering and its consequences have not been assessed systematically. In this study, we have 44

developed a reusable pipeline called “CoalQC”, to investigate potential sources of bias (such 45

as repeat regions, heterogeneous coverage, and callability). First, we demonstrate that 46

genome assembly quality can affect the estimation of demographic history using the genomes 47

of several species. We then use the CoalQC pipeline to evaluate how different repeat classes 48

affect the inference of demographic history in the plant species Populus trichocarpa. Next, 49

we assemble a draft genome by generating whole-genome sequencing data for Mesua ferrea 50

(sampled from Western Ghats, India), a multipurpose forest plant distributed across tropical 51

south-east Asia and use it as an example to evaluate several technical (sequencing technology, 52

PSMC parameter settings) and biological aspects that need to be considered while comparing 53

demographic histories. Finally, we collate the genomic datasets of 14 additional forest tree 54

species to compare the temporal dynamics of Ne and find evidence of a strong bottleneck in 55

all tropical forest plants during Mid-Pleistocene glaciations. Our findings suggest that quality 56

control prior to the use of SMC based methods is important and needs to be standardised. 57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 3: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

3

Introduction 72

Coalescent theory continues to be the fundamental tool for the study of gene genealogies in a 73

population genetic framework. Changes in coalescence time and the Watterson estimator of 74

genetic diversity (θw) along the genome serve as a record of the population history of a 75

species through time. Increasing availability of whole genomic datasets for a multitude of 76

species has made it possible to analyse demographic histories to answer a suite of questions 77

such as host-parasite co-evolution (Hecht et al. 2018), effect of climate change on population 78

dynamics (Bai et al. 2018), hybridization (Vijay et al. 2016), speciation events and split times 79

between species (Cahill et al. 2016), history of inbreeding (Prado-Martinez et al. 2013), 80

mutational meltdown (Rogers and Slatkin 2017), detecting population decline and addressing 81

threats of extinction (Mays et al. 2018). Such widespread use of genomic datasets for 82

coalescent inferences was made possible by the introduction of the PSMC method (Li and 83

Durbin 2011) that requires only one diploid genome sequencing dataset. Technical advances 84

in the use of genomic datasets for making demographic inferences and prevalence of multi-85

individual datasets facilitated by reductions in sequencing cost now allow integration of 86

information across an increasing number of individuals (Schiffels and Durbin 2014; Terhorst 87

et al. 2016; Palamara et al. 2018). 88

89

Despite the widespread use of demographic history inference methods like PSMC, 90

many potential sources of bias due to data quality have been identified and efforts to reduce 91

such effects are considered important. Earlier studies have shown that low coverage regions, 92

ascertainment bias, hyperdiverse sequences, the fraction of usable data available as well as 93

population structure will affect the estimation and interpretation of demographic histories (Li 94

and Durbin 2011; Mazet et al. 2015; Nadachowska-Brzyska et al. 2016). Notably, some of the 95

PSMC parameters or options such as mutation rate and generation time are known to 96

drastically change the scaling of the curve, while the trajectory remains unchanged 97

(Nadachowska-Brzyska et al. 2015). Detailed guidelines for the use of PSMC and MSMC are 98

described elsewhere (Mather et al. 2020). Although the effect of genome assembly quality on 99

demographic inferences has not been systematically assessed, it had been noted that genome 100

quality could bias the results (Tiley et al. 2018). Intriguingly, a recent paper investigated the 101

effect of genome quality and concluded that contemporary demographic inference methods 102

are robust to the quality of the reference genome used (Patton et al. 2019). 103

104

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 4: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

4

In general practice, repetitive elements and low coverage regions of the genome are 105

masked before the analyses to overcome biased inferences (Foote et al. 2016). It has been 106

suggested that at least 75% or more of the genome should be retained after masking for 107

robust inference of demographic history. However, organisms that have high (>30%) repeat 108

content would violate either the masking criteria or the fraction of the genome to be retained. 109

While several such quality control steps have been applied prior to the use of PSMC on 110

genomic data, the rationale for performing specific filtering steps and the adverse 111

consequences of skipping such quality control is understudied. A standardised quality control 112

pipeline would be able to alleviate some of these challenges. 113

114

In this study, we have created a reusable pipeline (CoalQC) for evaluating the quality 115

of genome-wide coalescence inferences and demonstrate the utility of the tool using a newly 116

generated ~180X coverage whole genome sequencing dataset of Mesua ferrea (a tropical tree 117

species distributed across south-east Asia) as well as public re-sequencing datasets of several 118

species. We investigate the relevance of various filtering practices and specifically answer the 119

following questions: 120

1. Does genome assembly quality affect the inference of coalescence histories? 121

2. How does the coalescence history differ between repeat classes? 122

3. Which biological (change in genome size) and technical (sequencing platform used, 123

PSMC parameter settings) factors influence demographic inference? 124

4. What can the comparison of demographic histories of forest plants reveal? 125

Our exploration of how several technical aspects can affect the inference of coalescence 126

histories is relevant not just for use of the PSMC program, but also for numerous other tools 127

that make coalescent inferences using genomic datasets. We also apply our pipeline to 128

compare the demographic history of forest trees to evaluate whether ecologically relevant 129

hypothesis can be robustly tested using demographic inference methods. 130

131

New Approaches 132

CoalQC 133

We have implemented a re-usable pipeline to perform quality control prior to the use of 134

genome-wide coalescent methods. Separate modules to evaluate the effects of repeat regions, 135

coverage and callability have been implemented to allow extensive quality control. The 136

repeat module estimates independent demographic histories using genomic regions of one 137

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 5: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

5

repeat class at a time along with non-repeat regions of the genome. Informative graphs that 138

(a) compare these independent estimates of Ne, (b) quantify relative abundance of each repeat 139

class in various atomic intervals, (c) assess robustness of the inferred results using bootstrap 140

replicates, (d) visualise trends of change in heterozygosity and Ts/Tv ratio is generated by this 141

module. We believe this module will be a valuable quality check prior to masking specific 142

repeat regions of the genome. 143

144

The module for coverage is specifically designed to evaluate the robustness of the results to 145

different coverage thresholds. Genomic regions are divided into several cumulative coverage 146

classes based on the local read depth. These coverage classes are then used to independently 147

estimate demographic histories and generate a comparative graph that can be used to 148

understand the robustness of the results to coverage constraints. Similar to the coverage 149

module, the callability module divides the genome into several callability classes to identify 150

regions of the genome that need to be excluded from the analysis by masking. Detailed 151

instructions and example commands for the use of the pipeline are provided on the github 152

repository of the CoalQC program (https://github.com/ceglab/coalqc). 153

154

Results 155

Does genome quality affect demographic inference? 156

Genome assembly quality encompasses multiple factors such as sequence contiguity 157

(generally quantified as N50), number and length of gaps, the fraction of genes assembled 158

(quantified using BUSCO’s) and fraction of the genome assembled (quantified based on the 159

percent of reads mapping to the genome assembly). To assess the effect of genome quality on 160

demographic inference, we compared Ne trajectories estimated using a single human 161

individual (NA12878) mapped to five different versions (hg4, hg10, hg15, hg19, and hg38) 162

of human genome assemblies with varying levels of quality. We found that all the measures 163

of genome quality used by us showed an improvement in recent versions of the human 164

genome (see Table S1). The estimated effective population size (Ne) showed greater 165

variability between genome assembly versions during ancient i.e., 1-7 MYA (Mean of 166

standard deviations in Ne of each atomic interval from 43 to 64 = 0.84) and recent i.e., 0-15 167

KYA (Mean of standard deviations in Ne of each atomic interval from 0 to 6 = 0.82) 168

compared to mid-time period i.e., 100-400 KYA (Mean of standard deviations in Ne of each 169

atomic interval from 18 to 32 = 0.29) (see Fig. 1). PSMC trajectories of earlier (poorer 170

assembly quality metrics) versions of the human genome showed higher estimates of Ne 171

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 6: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

6

during the ancient (~1-7 MYA) and recent times (~0-15 KYA) and lower estimates of Ne 172

during the mid-time period (~100-400 KYA) compared to the recent versions of the human 173

genome. 174

175

To evaluate the effect of assembly quality on the robustness of results we performed 176

100 bootstrap runs using each of the human genome versions considered. The heterogeneity 177

between bootstrap replicates within each version was quantified as the coefficient of variation 178

(CV) across the 100 replicates. While the CV of the early version of human genome assembly 179

(Mean of the CV of Ne across bootstrap replicates of hg4 assembly=0.054) was higher than 180

that of the other recent assemblies (hg10=0.041, hg15=0.042, hg19=0.045, and hg38=0.043) 181

considered by us, the CV’s of all the assemblies are very similar (see Fig. 1b). Comparable 182

estimates of Ne across bootstrap runs suggest that these estimates are being robustly inferred 183

for each specific genome assembly. 184

185

To ensure that the effect of genome assembly quality is not limited to just the human 186

genome, we compared the demographic histories inferred from the initial and recent versions 187

of the Tribolium castaneum and Danio rerio genomes (see Table S1). We find that similar to 188

the differences seen between different versions of the human genome assembly, the estimates 189

of Ne inferred from different versions of the genome show distinct trends (see Fig. S1). Our 190

results from the human, red flour beetle (Tribolium castaneum) and zebrafish (Danio rerio) 191

genomes suggest that genome quality does have a noticeable effect on demographic 192

inference. 193

194

How do repeat regions affect demographic inference? 195

Prior to performing demographic inference, repeat regions of the genome are generally 196

masked and excluded from the analysis. Masking of repeat regions is justified by the high 197

risk of assembly errors, collapsed segmental duplications and miss-mapping of short-reads in 198

repeat regions. Plant genomes with a high fraction of repetitive content are more prone to be 199

affected by repeats. Hence, we decided to use the high-quality genome of the plant Populus 200

trichocarpa to compare the Ne trajectories inferred using masked and unmasked genomes to 201

understand the magnitude of the change introduced by masking of repeat regions. The 202

estimates of Ne from the masked compared to the unmasked genome were lower during 203

ancient time period i.e., after ~ 1 MYA (Mean difference in Ne across atomic intervals 48 to 204

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 7: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

7

64=0.63 x 104) and higher during recent times i.e., 20KYA – 100 KYA (Mean difference in 205

Ne across atomic intervals 5 to 17=0.45 x 104, see Fig. 2a). 206

207

To evaluate how specific repeat classes affect the estimates of Ne, each repeat family 208

was unmasked while keeping other repeats masked (see Fig. 2b). Estimates of Ne after 209

inclusion of LTR-Gypsy was intermediate between masked and unmasked genome-based 210

inferences during the ancient past i.e., after 1MYA (Mean difference in Ne compared to 211

masked genome across atomic intervals 48 to 64= 0.29 x 104), whereas it showed a similar 212

trend as the unmasked genome during recent times i.e., 20KYA-100KYA (Mean difference in 213

Ne compared to masked genome across atomic intervals 5 to 17=0.44 x 104; see Fig. 2c). 214

Other repeat classes did not influence the estimates as much as LTRs and were closer to the 215

masked inference (see Fig. 2b). The robustness of the Ne estimates was assessed based on the 216

variability (quantified as CV) between bootstrap replicates using the non-repeat fraction of 217

the genome along with each individual repeat class. The CV was heterogeneous between 218

repeat classes and was relatively higher in recent time intervals (see Fig. 2c). Robustness of 219

the estimated values of Ne was comparable between the unmasked (Mean of the CV across 220

atomic intervals= 0.04687) and masked (Mean of the CV across atomic intervals= 0.047) 221

genomes. 222

We found that the fraction of repeat content in a particular atomic interval was 223

positively correlated (τ = 0.346, p-value= 0.0003, see Fig. S2) with the absolute difference 224

between masked and unmasked genome-based estimates of effective population size (Ne). 225

Having established that greater repeat abundance would more strongly affect estimates of Ne, 226

we quantified repeat family-wise abundance in genomic regions corresponding to each 227

atomic interval. In all atomic intervals, the non-repeat fraction was found to be the most 228

abundant (see Fig. 2d). Among the repeat classes, LTR-Gypsy had the highest abundance in 229

most of the atomic intervals. The extremely high abundance of LTR-Gypsy repeats in the first 230

few atomic intervals could have led to the drastic change in the Ne trajectory during recent 231

times (i.e., 20KYA-100KYA) after inclusion of LTR-Gypsy repeats. LTR's and RC-Helitron 232

have high abundance at the genome-wide level (see Fig. 2e) and have a greater influence on 233

the estimates of Ne (see Fig. 2b). 234

Genomic regions are assigned to a specific atomic interval based on the TMRCA of 235

that region. This leads to a trend of increasing levels of heterozygosity from atomic intervals 236

that correspond to recent to older time points. Each of the repeat classes independently shows 237

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 8: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

8

this trend of increase in heterozygosity similar to non-repeat regions (see Fig. 3). Comparable 238

estimates of heterozygosity between repeat and non-repeat regions suggest that the 239

heterozygous sites identified in repeat regions are not merely variant calling artefacts. The 240

ratio (Ts/Tv) of the number of transitions (Ts) to the number of transversions (Tv) has been 241

used to evaluate the accuracy of variant call sets (Wang et al. 2015). Regions of the genome 242

with artefactual variant calls would have a Ts/Tv ratio very different from the genomic 243

average. Hence, as an additional validation of the variants identified within repeat regions, we 244

calculated the Ts/Tv ratio for each repeat class by the atomic interval. While the estimates of 245

heterozygosity showed an increasing trend towards older atomic intervals, we found that the 246

Ts/Tv ratio did not show any discernible trend (see Fig. 3). Similar estimates of the Ts/Tv 247

ratio in repeat and non-repeat regions suggests that the heterozygous sites identified in repeat 248

regions are truly polymorphic. 249

250

Repeat regions of the genome have very high coverage due to the mapping of reads 251

from multiple copies and low callability due to a large number of mismatches across the 252

reads mapped to the same genomic region. Hence, some studies tend to mask genomic 253

regions based on criteria determined based on coverage or callability instead of the presence 254

of repeats. We separately evaluated the effect of masking genomic regions based on coverage 255

or callability classes (see Fig. S3 and S4) and find that masking based on these criteria needs 256

to be treated independently of repeat region-based masking. 257

258

Which biological and technical factors influence demographic inference? 259

Several technical factors such as the optimal PSMC parameter settings, sequencing platform 260

used, the prevalence of cross-contamination from closely related species, misleading or 261

incomplete metadata in public datasets are important considerations during the comparative 262

interpretation of demographic history. Similarly, biological factors such as the prevalence of 263

whole-genome duplications, changes in the karyotype or genome size and high intraspecific 264

variation in genetic diversity also need to be considered. We generated whole-genome 265

sequencing data for a tropical plant species (Mesua ferrea) and use it as an example to 266

understand how biological and technical factors influence demographic inference. For any 267

newly sequenced genome, the PSMC parameters –r (initial theta/rho ratio), -p (pattern of 268

parameters specifying distribution of free intervals and atomic intervals) and –t (the 269

maximum time to TMRCA) need to be optimised so that all the atomic intervals have 270

sufficient number of recombination events. 271

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 9: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

9

The maximum time to TMRCA 272

For the first PSMC run of the Mesua ferrea genome, -t 5 -r 5 -p "4+25*2+4+6" options were 273

used. However, the resultant PSMC output did not have enough recombination events in 274

some of the atomic intervals. Therefore, the -p parameter was optimized until all the atomic 275

intervals had a sufficient number of recombination events. A mutation rate of 2.5e-09 per site 276

per year i.e. 3.75e-08 per site per generation was used assuming 15 years of generation time 277

for scaling the results. The resultant trajectory did not go back to older (i.e., beyond 150 Kya; 278

see Fig. 4a) time points. So, we decided to optimize all the parameters so that the trajectory 279

will give meaningful results beyond 150 Kya. Hence, the maximum time to TMRCA, i.e., -t 280

parameter was increased so that the trajectory extended to older (i.e., 150 to 400 Kya; see 281

Fig. 4a) time points. The -p parameter was optimized along with -t, as increasing -t gave less 282

number of recombination events in some of the atomic intervals. While maintaining 64 283

atomic intervals, we were able to get a reliable demographic trajectory with options -t 65 -r 5 284

-p "3+2*17+15*1+1*12" i.e. 64 atomic intervals distributed across 19 free intervals 285

(1+2+15+1). To know how far the trajectory might be extended back in time if we increase –286

t, we used -t 500 for one run, which however did not have a sufficient number of 287

recombination events in most of the atomic intervals. 288

289

To know how increasing maximum time to TMRCA was altering the Ne trajectory, 290

we considered some of the longest scaffolds and visualised the assignment of specific 291

genomic regions to various atomic intervals. Comparing the atomic intervals assigned to the 292

same genomic region at different values of -t, we found that genomic regions which were 293

assigned to older atomic intervals for smaller values of -t were assigned to relatively recent 294

atomic intervals with an increase in the -t parameter (see Fig. 4b). This redistribution of 295

regions with increasing values of –t can be better understood by looking at changes in the 296

distribution of lengths of genomic regions assigned to each atomic interval (see Fig. S5). For 297

instance, in the case of Mesua ferrea the length of older atomic intervals tends to decrease 298

with increasing values of –t. Genomic regions contributing to the older atomic intervals at 299

higher values of the –t parameter become shorter and highly heterozygous. We ensured that 300

such short high heterozygosity regions are not merely variant calling artefacts by visualising 301

the atomic intervals assigned to genomic regions along scaffolds with associated 302

heterozygosity and callability at these regions (see Fig. 4b). 303

304

305

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 10: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

10

The initial theta/rho ratio 306

The ratio of heterozygosity to recombination rate generally does not have an impact over the 307

trajectory inferred using PSMC, but an increase in the value of -r does increases the number 308

of recombination events and achieves convergence faster. We used the same -p and -t values 309

(i.e., -t 65 -p "3+2*17+15*1+1*12") with different values of –r. With decreasing values of –r, 310

the Ne trajectory of Mesua ferrea extended further back in time (i.e., beyond 400KYA; see 311

Fig. S6a). However, the atomic intervals corresponding to the extended trajectory did not 312

have a sufficient number of recombination events for –r values less than 5. When the –r 313

values were set at 5 or more, the convergence was achieved faster (see Fig. S6b). 314

315

Does the sequencing platform affect the result? 316

We have previously shown that the trajectory of Ne can show extremely contrasting patterns 317

between different populations of the same species (Vijay et al. 2018). To understand the 318

variability in the demographic histories of different populations of Mesua ferrea we wanted 319

to sample additional populations. Upon searching the European Nucleotide Archive (ENA), 320

we found that a re-sequencing dataset labeled as Mesua ferrea sampled from Yunnan, China 321

(see Table S2) was available for download. Surprisingly, the demographic trajectory inferred 322

using this dataset extended back in time with -t of 5 (optimised with -r 5 -p “4+25*2+5*2”) 323

and gave a different inference (see Fig. S7). However, we found that the sequencing for the 324

sample from Yunnan had been performed using the BGISEQ-500 platform. In order to rule 325

out the possibility of sequencing platform-specific technical issues, we compared the 326

demographic trajectory of the Human individual NA12878 sequenced using BGISEQ-500 327

(see Table S2 for dataset details) with the trajectory obtained for the same individual when 328

the sequencing was performed using Illumina platform (see Fig. S8). We did not find any 329

differences in the Ne trajectories estimated using BGISEQ-500 and the Illumina platform. 330

331

Are the differences in Ne trajectories due to biological differences? 332

Having ruled out the possibility of sequencing platform-specific technical factors we 333

considered the possibility of biological differences between the two samples of Mesua ferrea. 334

Biological reasons for different trajectories can involve (a) different demographic histories of 335

specific populations or (b) changes in the karyotype altering recombination landscape or (c) 336

changes in genome size due to segmental or whole-genome duplication events. Since an 337

earlier study has documented the prevalence of ecotype specific differences in the genome 338

size of Mesua ferrea (Das et al. 2018), we first decided to compare the approximate in-silico 339

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 11: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

11

genome size estimates of the two individuals under consideration. However, genome size 340

estimates from raw read data are sensitive to the coverage depth, sequencing error rate, and 341

polymorphism rate. Nonetheless, we see that the estimated genome size of the sample from 342

China is (approximately 478 Mbp; see Fig. S9 and S10) ~ 20Mbp less than the genome size 343

estimated for the sample from India. We also observed that the heterozygosity of the sample 344

from China was ~25 fold higher than the sample from India. These differences in 345

heterozygosity could be attributed to multiple factors such as contamination of reads from 346

other species, independent Whole Genome Duplication (WGD) in Chinese sample, incorrect 347

metadata regarding species identity in the public sequencing data repository. However, the 348

WGD program (Zwaenepoel and Van De Peer 2019) used to detect whole-genome 349

duplication events from genomic data did not find any evidence to support a WGD event 350

specific to the Chinese sample based on the distribution of synonymous substitution rates 351

(Ds) (see Fig. S11). Despite ruling out the possibility of bias from sequencing technology, we 352

are not able to conclusively establish the reasons for the difference in the Ne trajectories due 353

to reasons beyond the scope of this study. 354

355

Comparative demography using PSMC on forest plant genomes 356

Estimation of the demographic history of multiple species of forest plants can provide useful 357

information about the overall evolution of forests and the role of ecological processes or 358

climatic events. We collated publicly available forest plant genomes that had sufficient data 359

and compared the demographic histories inferred using PSMC. Demographic trajectories of 360

the tropical species showed a considerable decline in Ne during 300 KYA – 1 MYA (see Fig. 361

5). This decline in Ne corresponds to a common event irrespective of the species-specific 362

population dynamics. The period which shows bottleneck in all these species might be 363

attributed to the environmental conditions of this period. During this period two major 364

glaciations and extensive de-glaciation events have been recorded, which were considered to 365

be longer and harsher than normal (van der Hammen 1974; Verbitsky et al. 2018). Glaciations 366

following the Mid-Pleistocene transition (MPT) changed the duration of glacial events from 367

41 KYs to 100 KYs, translating into longer dry and colder conditions (Pisias and Moore 368

1981; Clark et al. 2006). These dry environments also affected precipitation in the tropics, 369

leading to a decrease in CO2 concentration and less rainfall in these regions (Hewitt 2000; 370

Dupont et al. 2001; Clark et al. 2006; Cabanne et al. 2016). In contrast, the silver birch 371

(Betula pendula) showed an increase in Ne during this period, which could be explained by 372

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 12: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

12

its adaptation to the dry, cold and high-altitude conditions (Salojärvi et al. 2017). Based on 373

our analysis of 15 forest plants we could infer that late Pleistocene glaciations had a great 374

impact on most of the forest plant species leading to reduced Ne during this period. 375

Discussion 376

Using genomes and re-sequencing datasets of several species, we here explore how genome 377

quality (contiguity, prevalence of gaps, assembly completeness), repeat abundance, 378

technological heterogeneity (sequencing platform used, parameter settings optimised) and 379

biological factors (changes in genome size) can affect the inference of demographic history. 380

The scripts used for our analysis are implemented in the form of a re-usable pipeline with 381

detailed documentation. We are thus confident that these quality control strategies and the 382

associated pipeline will prove useful while comparing demographic trajectories between 383

species to obtain insights into the underlying processes. The following paragraphs highlight 384

our major findings and their relevance with respect to previous studies. 385

386

Genome quality 387

We demonstrate that the estimates of Ne for the same sequencing dataset can be drastically 388

different when the quality of the reference genome assembly changes. A recent study by 389

Patton et al. (2019) investigated the robustness of several demographic inference methods to 390

genome assembly quality and find that in comparison to other methods, PSMC robustly 391

estimates Ne except in recent time periods. In contrast to our results, Patton et al. (2019) 392

concluded that demographic inference methods are robust to the quality of the genome 393

assembly. However, Patton et al. simulated differing amounts of genome fragmentation by 394

manipulating the variant call file and overlook the possibility of genome quality-related 395

biases introduced during the read mapping and variant calling steps. Moreover, these 396

simulations assumed that random fragmentation of the genome would capture the complexity 397

of differences in the qualities of real genomes. The ends of contigs or scaffolds in genomes 398

are regions that are difficult to assemble, such as repeat-rich or hypervariable regions and are 399

not randomly distributed in the genome. We consistently see differences in the demographic 400

history estimated from different genome assembly versions of the same species. Yet, the 401

demographic histories estimated from the 2012-devil and 2019-devil assemblies in Patton et 402

al. are very similar. The negligible improvement (maximum difference in the percent of reads 403

mapped is 0.01 considering 12 individuals) in the percentage of reads mapping to the recent 404

version of the devil genome compared to the older version might explain why genome quality 405

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 13: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

13

does not appear to influence the results in the case of the Tasmanian devil. Depending on the 406

complexity of the genome architecture, the extent of improvement in genome assembly 407

quality and which aspects of genome quality are improved will affect the magnitude of 408

change in estimates of Ne. Hence, we urge caution while making generalisations regarding 409

the effect of genome quality on the estimation of Ne. 410

411

Repeat regions 412

Our results demonstrate that the inclusion of repeat regions does affect demographic 413

inference but not all types of repeats have the same effect. The effect that a particular repeat 414

class would have on the inference seems to depend on the abundance and genomic 415

distribution of that particular repeat class. Hence, lineage-specific repeat classes can 416

potentially affect the comparative analysis of demographic histories of closely related 417

species. For instance, the LTR content might differ between closely related species (Zhang et 418

al. 2020) and can heavily influence the results. We implement a quality control strategy that 419

involves the comparison of demographic histories inferred from each repeat class separately. 420

A better understanding of repeat class-specific mutation rates might allow for scaling each 421

repeat type with an appropriate mutation rate and resolve this heterogeneity in the 422

trajectories. In order to evaluate the effect of diverse repeat classes on the estimation of Ne, 423

our pipeline relies on the existence of a reasonably good quality of repeat annotation in the 424

focal species. While this is a caveat, it is a compromise done in order to finish the execution 425

of the program in a timely manner. The users also have the choice to decide the number of 426

repeat classes by combining repeat classes or separating them into sub-classes. A larger 427

number of repeat classes leads to an increase in the runtime. We urge users to perform their 428

own repeat annotation, identification and classification to overcome this limitation. 429

430

PSMC parameter settings 431

Using the newly generated genome sequencing dataset of the tropical plant Mesua Ferrea, we 432

demonstrate the effect of changing the three main parameter settings of the PSMC program. 433

Our results highlight the importance of appropriately choosing the –t parameter (i.e., the 434

maximum time to TMRCA) and provides intuitive understanding about changes in the 435

distribution of genomic regions into specific atomic intervals as the values of –t is changed. 436

The comparison of demographic histories across species requires that the PSMC parameter 437

settings are properly optimised to identify relevant differences in their Ne trajectories. By 438

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 14: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

14

comparing the trajectories of Mesua Ferrea individuals from two different populations, we 439

show the importance of the –t parameter. 440

Certain historical time periods might be of greater importance due to specific climatic 441

or geological events that have occurred in that time frame. Hence, it can be desirable to have 442

a greater resolution while estimating the effective population size histories during these time 443

periods. While increasing the number of free intervals in a particular time period will increase 444

the resolution it can also lead to certain atomic intervals having too few recombination 445

events. Hence, the atomic intervals need to be distributed such that each atomic interval has 446

more than 10 recombination events after the 20th iteration. We hope that our results provide 447

some clarity regarding the strategies to be used while choosing the –p parameter. 448

449

The output of PSMC is insightful when it is scaled to time in years based on the 450

mutation rate and generation time of the species under consideration. While changes in the 451

scaling parameters have been shown to result in similarly shaped trajectories it does change 452

the absolute values of the estimates (Nadachowska-Brzyska et al. 2015). However, accurate 453

estimates of mutation rates are missing or unreliable in the case of many species. Moreover, 454

the estimates of mutation rates obtained by different methods can produce drastically 455

different values. Generation time can also be difficult to estimate especially for long-living 456

plant species. Hence, recent studies have resorted to scaling the results using multiple 457

combinations of mutation rates and generation times to ensure the robustness of their 458

observations. 459

460

Conclusion 461

In summary, our study systematically investigates multiple sources of bias that can 462

affect the inference of demographic history from whole genomic datasets. By comparing the 463

demographic inferences obtained using different versions of the human, red flour beetle and 464

zebrafish genomes, we establish that genome quality does have a considerable impact on the 465

estimation of effective population size (Ne). Instead of simply masking repeat regions of the 466

genome, we investigate the consequences of including each repeat class using the genome of 467

the plant Populus trichocarpa. Interestingly, we find that most repeat classes are able to 468

provide inferences consistent with those obtained from non-repeat regions and can be a viable 469

source of demographic history. Our analysis of repeat regions is of special relevance as the 470

quality of genome assemblies continues to improve with long-read sequencing technologies 471

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 15: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

15

being able to correctly assemble repeat regions. Demographic histories inferred for the same 472

human individual using BGI-seq and Illumina datasets show highly concordant results. 473

Future studies that plan to compare datasets originating from these technologies are likely to 474

benefit from our comparative analysis. Parameter settings used for running PSMC have also 475

been investigated in considerable detail with guidelines for optimal use of parameters. 476

Having investigated various sources of technical variability and parameter settings, we 477

consider the genomic datasets of 15 forest trees and compare their demographic histories as a 478

case study. We not only provide guidelines for performing filtering of genomic datasets but 479

also develop the CoalQC pipeline that we hope will become a standard part of quality control 480

prior to demographic analysis using whole-genome datasets. 481

482

Material and Methods 483

Genome quality assessment 484

Assembly and annotation 485

Leaves of the plant Mesua ferrea were collected from a tree located near the College of 486

Forestry, Ponnampet (GPS coordinates 12°08'56.5"N75°54'32.5"E; Altitude: 829-850 meter 487

Above Sea level). Genomic DNA was extracted from the leaves using the standard CTAB 488

protocol. The quality and integrity of DNA were assessed using 1x gel electrophoresis, 489

Nanodrop, and Qubit. Illumina short-read (150bp) libraries were prepared with an insert size 490

of 450 ± 50 bp to generate ~110 Gbp of paired-end short-read data. The quality of the 491

sequencing read data was assessed using the program FASTQC (Andrews et al. 2015). 492

Genome size was estimated using Jellyfish (Marçais and Kingsford 2011) followed by 493

GenomeScope (Vurture et al. 2017) using k-mer size of 21. GenomeScope estimated genome 494

size of approximately 496.72 Mbp. Despite the observation of variability in genome size 495

between ecotypes of Mesua ferrea, flow cytometry-based estimates have consistently 496

predicted genome size of approximately 684.5 Mbp (Das et al. 2018). 497

498

For assembling the genome, MaSuRCA (Maryland Super Read Cabog Assembler) 499

(Zimin et al. 2017) was used with parameters USE_LINKING_MATES = 1 500

CLOSE_GAPS=1 NUM_THREADS = 32 JF_SIZE = 12000000000 SOAP_ASSEMBLY=1. 501

Our genome assembly length of 614.35 MB is slightly less than but comparable to previous 502

estimates based on flow cytometry. To assess the quality of the assembled genome, we 503

employed multiple quality assessment methods. While N50, N75, etc. are accepted metrics of 504

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 16: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

16

genome contiguity, the number of genes assembled in the assembly serves as a metric of 505

assembly completeness. The amount of repeats assembled is also one of the relevant metrics 506

which informs about the quality of the assembly in low-complexity regions. The Quast 507

(Mikheenko et al. 2018) program was used to calculate assembly statistics i.e. N50, N75, 508

Number of N´s per 100 KB, etc. (see Table S3). BUSCO (Benchmarking Universal Single-509

Copy Orthologues) (Simão et al. 2015) was used to assess the completeness of the assembly, 510

using eudicotyledons_odb10 (see Table S4 and S12) and embryophyta_odb10 (see Table S5) 511

dataset together with previously sequenced genomes of order Malpighiales. LTR_retriever's 512

LAI module was used to determine assembly quality based on the LAI (LTR Assembly 513

Index) score which assesses repeat content assembled (see Table S6). 514

515

Annotation was carried out using MAKER-P (Campbell et al. 2014) version 2 with 516

MPI. Repeat libraries obtained from RepeatModeler (Smit, AFA, Hubley 2015) and LTR-517

retriever (Ou and Jiang 2018) were concatenated and used to mask repeat regions of the 518

genome. Published CDS dataset of Populus trichocarpa and concatenated multi-fasta of all 519

available Malpighiales proteins were used as homology evidence for the first round of de-520

novo annotation. The results of the first round of annotation were then used for training 521

SNAP (Korf 2004) and AUGUSTUS (Stanke et al. 2008) implemented in BUSCO. These 522

predictions were used for the second round of annotation in MAKER-P. Iterative rounds of 523

annotation were carried out for 5 rounds until no further improvement was observed as 524

assessed by the AED (Annotation Edit Distance) values (see Table S7). 525

The raw sequencing read data was used to separately assemble the chloroplast 526

genome using the NOVOPlasty (Dierckxsens et al. 2017) program. The Maturase K gene 527

sequence of Mesua ferrea was used as a seed sequence and Garcinia mangostana chloroplast 528

genome was used as a reference. The assembler uses seed sequence to find reads that cover 529

this sequence and starts overlapped sequence assembly. The assembled chloroplast genome 530

had two sets of contigs. The orientation of the contigs was determined by dot-plot analyses 531

(see Fig. S13) with Garcinia mangostana and other Malpighiales chloroplast genome 532

sequences. The full length of the assembled chloroplast genome was 161.4 Kbp long. It was 533

then annotated using GeSeq and visualised using OGDRAW (see Fig. S14) implemented in 534

CHLOROBOX (Greiner et al. 2019). For assembling the mitochondrial genome matR gene 535

sequence of Mesua ferrea was used as a seed. Assembled chloroplast sequence was used for 536

comparison and WGS raw reads were used in NOVOPlasty. The total assembled 537

mitochondrial sequence was 20084 bp long (see Fig. S15). 538

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 17: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

17

539

540

541

Datasets used 542

To demonstrate the utility of our quality control pipeline and generality of our observations 543

across diverse taxa we used species from different phyla such as nematodes, plants, 544

vertebrates, etc. The published genome assemblies used as reference genomes were 545

downloaded from NCBI/UCSC genome browser (details provided in Table S8). We searched 546

the European Nucleotide Archive (ENA) for genomic sequencing datasets and downloaded 547

those datasets that had >20X coverage (see Table S2). The raw read datasets were mapped to 548

corresponding unmasked genomes using the short-read aligner BWA-MEM (Li 2013) with 549

default settings. 550

551

Genome assembly quality comparison 552

The latest version of the Human genome assembly, i.e., hg38, was downloaded from 553

Ensembl, whereas previous assemblies i.e. hg19, hg15, hg10, and hg4 were downloaded from 554

UCSC genome browser (see Table S8). The genome assembly statistics i.e. N50 statistics and 555

Number of N’s per 100 Kb were calculated using Quast. SAMTOOLS flagstat module was 556

used to get mapping percentages for each assembly using mapped alignments of each 557

assembly. To get assembly completeness statistics, BUSCO was used with dataset 558

Mammalia_odb9 on each of the assemblies. Genome assembly quality was also assessed for 559

red flour beetle (Tribolium castaneum) and zebrafish (Danio rerio) genomes. These 560

comparative assembly quality statistics are available in Table S1. 561

562

Inference of demographic history using PSMC 563

Parameter settings 564

Variant calling was performed using SAMTOOLS and BCFTOOLS with the depth 565

parameters for vcf2fq command of vcfutils.pl decided based on the mean coverage of the 566

reads. The resultant fastq file of heterozygous sites was converted into psmcfa format using 567

the fq2psmcfa program for bin sizes of 20, 50, and 100. For the first run of PSMC, options 568

were set as -t 5, -r 5, -p "4+25*2+4+6”. The output was evaluated to see if a sufficient 569

number of recombination events had occurred in each atomic interval. If there were some 570

atomic intervals that did not have at least ten recombination events after the 20th iteration, 571

then the -p parameters were modified. For example, if -p parameter "4+25*2+4+6" is set, it 572

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 18: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

18

has 64 atomic intervals distributed across 28 free intervals i.e. (1+25+ 1+ 1). Changing the 573

distribution of atomic intervals across free intervals, e.g., "8+25*2+2+4" would be the first 574

step to see if enough recombination events are obtained. If not, then changing the free 575

intervals, i.e., "3*2+1*10+15*2+1*14+1*4" (written as free intervals * atomic intervals) 576

was done to get sufficient recombination events. Only after obtaining a sufficient number of 577

recombination events, the -p parameter was finalised. PSMC was run with the -d (decode) 578

option for identifying genomic regions that contributed to each atomic interval. After 579

obtaining the PSMC output, an appropriate mutation rate and generation time were used to 580

generate the scaled plot using the psmc_plot.pl script. For bootstrapping analyses, the psmcfa 581

file was first split into equal lengths of 5 MB, and was used for 100 runs of PSMC. 582

583

Effect of repeat regions 584

The unmasked genomes were analysed to identify and annotate repetitive regions. For 585

genome-wide identification of LTR’s, the program LTR-retriever was run using repeat 586

libraries made by concatenating LTR harvest (Ellinghaus et al. 2008) and LTR_finder v 1.0.6 587

(Xu and Wang 2007) output. The RepeatModeler program was used for the de-novo 588

identification of repeats. Both genome-wide LTR-retriever and RepeatModeler repeat 589

libraries were concatenated and given as input to the RepeatMasker program. The tabulated 590

output file of RepeatMasker was converted to bed format and used for further analyses. 591

592

Separate runs of variant calling were carried out using the unmasked and masked 593

genomes followed by PSMC analyses. PSMC was run with -d option for both unmasked and 594

masked datasets, and outputs were produced for three bin sizes (specified using the –s flag), 595

i.e., 20, 50, and 100. For each run of PSMC, the decode2bed.pl script was used to obtain 596

details of the atomic interval assigned to specific genomic regions. The prevalence of repeat 597

regions in each atomic interval was assessed by intersecting the positions of the repeats with 598

the positions of atomic intervals using BEDTools (Quinlan and Hall 2010). The genomic 599

coordinates of heterozygous sites and ratio of transitions (Ts) to transversions (Tv) were 600

obtained using the hetlist command of seqtk (Li 2015). Subsequently, repeat class-specific 601

heterozygosity and Ts/Tv ratio in each atomic interval was calculated using the positions of 602

repeat regions, decoded atomic intervals and genome-wide list of heterozygote sites as 603

arguments to BEDTools. To evaluate the effect each individual repeat class would have on 604

PSMC, one repeat class was unmasked at a time keeping all the other repeat classes masked 605

prior to PSMC analyses. 606

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 19: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

19

607

608

609

Effect of coverage and callability 610

Genome-wide per base depths were calculated using SAMTOOLS depth command. The 611

read-depth information was used to assign cumulative coverage classes, i.e., bases having >0 612

read depth, >10 read depth, >20 read depth and so on. The genome could thus be divided into 613

regions that correspond to specific cumulative coverage classes. GATK version 3.8.1 614

(McKenna et al. 2010) was used to get callability information using CallableLoci command. 615

The callable status obtained from CallableLoci was used to divide the genome into regions of 616

a specific callability. The effect these coverage based and callability based classes would have 617

on PSMC inference was assessed by performing PSMC analyses separately on each of the 618

classes after masking all genomic regions that were outside the class under consideration. 619

620

Comparative PSMC of forest plant genomes 621

Published plant genome assemblies till November 2019 and their details were obtained from 622

the plant genome database (available at https://www.plabipd.de/timeline_view.ep). From this 623

list of published plant genomes, forest plants (i.e. excluding annual plants) species with >20X 624

coverage were selected. The genomes and corresponding short-read data were downloaded 625

from public repositories (see Table S2 and S8 for details). PSMC analysis was performed on 626

each of these species with default parameters i.e. –t5 –r5 –p “4+25*2+4+6”. A mutation rate 627

estimate of 2.5e-09 per site per year which has been used for Populus trichocarpa in an 628

earlier study (Bai et al. 2018) was used for all the species. Generation time for each species 629

was obtained through a literature search. For each species per generation, mutation rates were 630

obtained using corresponding generation times (Table S9). 631

632

Acknowledgments 633

We thank the Ministry of Human Resource Development for fellowship to ABP, the Council 634

of Scientific & Industrial Research for fellowship to SSS. NV has been awarded the 635

Innovative Young Biotechnologist Award 2018 from the Department of Biotechnology and 636

Early Career Research Award from the Department of Science and Technology (both 637

Government of India). The computational analyses were performed on the Har Gobind 638

Khorana Computational Biology cluster established and maintained by combining funds from 639

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 20: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

20

IISER Bhopal under Grant # INST/BIO/2017/019, IYBA 2018 from DBT and ECRA from 640

DST. 641

Author contributions 642

NV conceived of the study based on inspiration from CGK and designed the study along with 643

ABP & CGK. ABP conducted computational analysis and compiled the results with 644

assistance from NV. SBN and RS collected the samples required for primary data generation. 645

SSS extracted DNA from the leaf samples. ABP, NV, and CGK wrote the manuscript along 646

with SBN, RS, and SSS. All authors approved the final draft. 647

References 648

649

Andrews S, Krueger F, Seconds-Pichon A, Biggins F, Wingett S. 2015. FastQC. A quality 650

control tool for high throughput sequence data. Babraham Bioinformatics. Babraham 651

Inst. 652

Bai WN, Yan PC, Zhang BW, Woeste KE, Lin K, Zhang DY. 2018. Demographically 653

idiosyncratic responses to climate change and rapid Pleistocene diversification of the 654

walnut genus Juglans (Juglandaceae) revealed by whole-genome sequences. New 655

Phytol. 656

Cabanne GS, Calderón L, Trujillo Arias N, Flores P, Pessoa R, d’Horta FM, Miyaki CY. 657

2016. Effects of Pleistocene climate changes on species ranges and evolutionary 658

processes in the Neotropical Atlantic Forest. Biol. J. Linn. Soc. 659

Cahill JA, Soares AER, Green RE, Shapiro B. 2016. Inferring species divergence times using 660

pairwise sequential markovian coalescent modelling and low-coverage genomic data. 661

Philos. Trans. R. Soc. B Biol. Sci. 662

Campbell MS, Law MY, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun 663

R, Jiao D, Lawrence CJ, et al. 2014. MAKER-P: A Tool kit for the rapid creation, 664

management, and quality control of plant genome annotations. Plant Physiol. 665

Clark PU, Archer D, Pollard D, Blum JD, Rial JA, Brovkin V, Mix AC, Pisias NG, Roy M. 666

2006. The middle Pleistocene transition: characteristics, mechanisms, and implications 667

for long-term changes in atmospheric pCO2. Quat. Sci. Rev. 668

Das R, Shelke RG, Rangan L, Mitra S. 2018. Estimation of nuclear genome size and 669

characterization of Ty1-copia like LTR retrotransposon in Mesua ferrea L. J. Plant 670

Biochem. Biotechnol. 671

Dierckxsens N, Mardulyn P, Smits G. 2017. NOVOPlasty: De novo assembly of organelle 672

genomes from whole genome data. Nucleic Acids Res. 673

Dupont LM, Donner B, Schneider R, Wefer G. 2001. Mid-Pleistocene environmental change 674

in tropical Africa began as early as 1.05 Ma. Geology. 675

Ellinghaus D, Kurtz S, Willhoeft U. 2008. LTRharvest, an efficient and flexible software for 676

de novo detection of LTR retrotransposons. BMC Bioinformatics. 677

Foote AD, Vijay N, Ávila-Arcos MC, Baird RW, Durban JW, Fumagalli M, Gibbs RA, 678

Hanson MB, Korneliussen TS, Martin MD, et al. 2016. Genome-culture coevolution 679

promotes rapid divergence of killer whale ecotypes. Nat. Commun. 7:ncomms11693. 680

Greiner S, Lehwark P, Bock R. 2019. OrganellarGenomeDRAW (OGDRAW) version 1.3.1: 681

expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids 682

Res. 683

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 21: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

21

van der Hammen T. 1974. The Pleistocene Changes of Vegetation and Climate in Tropical 684

South America. J. Biogeogr. 685

Hecht LBB, Thompson PC, Rosenthal BM. 2018. Comparative demography elucidates the 686

longevity of parasitic and symbiotic relationships. In: Proceedings of the Royal Society 687

B: Biological Sciences. 688

Hewitt G. 2000. The genetic legacy of the quaternary ice ages. Nature. 689

Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics. 690

Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-691

MEM. Available from: http://arxiv.org/abs/1303.3997 692

Li H. 2015. Seqtk: Toolkit for processing sequences in FASTA/Q formats. GitHub. 693

Li H, Durbin R. 2011. Inference of human population history from individual whole-genome 694

sequences. Nature 475:493–496. 695

Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of 696

occurrences of k-mers. Bioinformatics. 697

Mather N, Traves SM, Ho SYW. 2020. A practical introduction to sequentially Markovian 698

coalescent methods for estimating demographic history from genomic data. Ecol. Evol. 699

Mays HL, Hung CM, Shaner PJ, Denvir J, Justice M, Yang SF, Roth TL, Oehler DA, Fan J, 700

Rekulapally S, et al. 2018. Genomic Analysis of Demographic History and Ecological 701

Niche Modeling in the Endangered Sumatran Rhinoceros Dicerorhinus sumatrensis. 702

Curr. Biol. 703

Mazet O, Rodríguez W, Chikhi L. 2015. Demographic inference using genetic data from a 704

single individual: Separating population size variation from population structure. Theor. 705

Popul. Biol. 706

McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, 707

Altshuler D, Gabriel S, Daly M, et al. 2010. The genome analysis toolkit: A MapReduce 708

framework for analyzing next-generation DNA sequencing data. Genome Res. 709

Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. 2018. Versatile genome 710

assembly evaluation with QUAST-LG. In: Bioinformatics. 711

Nadachowska-Brzyska K, Burri R, Smeds L, Ellegren H. 2016. PSMC analysis of effective 712

population sizes in molecular ecology and its application to black-and-white Ficedula 713

flycatchers. Mol. Ecol. 25:1058–1072. 714

Nadachowska-Brzyska K, Li C, Smeds L, Zhang G, Ellegren H. 2015. Temporal dynamics of 715

avian populations during pleistocene revealed by whole-genome sequences. Curr. Biol. 716

25:1375–1380. 717

Ou S, Jiang N. 2018. LTR_retriever: A highly accurate and sensitive program for 718

identification of long terminal repeat retrotransposons. Plant Physiol. 719

Palamara PF, Terhorst J, Song YS, Price AL. 2018. High-throughput inference of pairwise 720

coalescence times identifies signals of selection and enriched disease heritability. Nat. 721

Genet. 722

Patton AH, Margres MJ, Stahlke AR, Hendricks S, Lewallen K, Hamede RK, Ruiz-Aravena 723

M, Ryder O, McCallum HI, Jones ME, et al. 2019. Contemporary Demographic 724

Reconstruction Methods Are Robust to Genome Assembly Quality: A Case Study in 725

Tasmanian Devils. Mol. Biol. Evol. 726

Pisias NG, Moore TC. 1981. The evolution of Pleistocene climate: A time series approach. 727

Earth Planet. Sci. Lett. 728

Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, 729

Woerner AE, O’Connor TD, Santpere G, et al. 2013. Great ape genetic diversity and 730

population history. Nature 499:471–475. 731

Quinlan AR, Hall IM. 2010. BEDTools: A flexible suite of utilities for comparing genomic 732

features. Bioinformatics 26:841–842. 733

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 22: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

22

Rogers RL, Slatkin M. 2017. Excess of genomic defects in a woolly mammoth on Wrangel 734

island. PLoS Genet. 735

Salojärvi J, Smolander OP, Nieminen K, Rajaraman S, Safronov O, Safdari P, Lamminmäki 736

A, Immanen J, Lan T, Tanskanen J, et al. 2017. Genome sequencing and population 737

genomic analyses provide insights into the adaptive landscape of silver birch. Nat. 738

Genet. 739

Schiffels S, Durbin R. 2014. Inferring human population size and separation history from 740

multiple genome sequences. Nat. Genet. 46:919–925. 741

Simão FA, Waterhouse RM, Ioannidis P, Kriventseva E V., Zdobnov EM. 2015. BUSCO: 742

Assessing genome assembly and annotation completeness with single-copy orthologs. 743

Bioinformatics 31:3210–3212. 744

Smit, AFA, Hubley R. 2015. RepeatModeler Open-1.0. Available from: 745

http://www.repeatmasker.org 746

Stanke M, Diekhans M, Baertsch R, Haussler D. 2008. Using native and syntenically mapped 747

cDNA alignments to improve de novo gene finding. Bioinformatics. 748

Terhorst J, Kamm JA, Song YS. 2016. Robust and scalable inference of population history 749

from hundreds of unphased whole genomes. Nat. Genet. 49:303–309. 750

Tiley GP, Kimball RT, Braun EL, Burleigh JG. 2018. Comparison of the Chinese bamboo 751

partridge and red Junglefowl genome sequences highlights the importance of 752

demography in genome evolution. BMC Genomics. 753

Verbitsky MY, Crucifix M, Volobuev DM. 2018. A theory of Pleistocene glacial rhythmicity. 754

Earth Syst. Dyn. 755

Vijay N, Bossu CM, Poelstra JW, Weissensteiner MH, Suh A, Kryukov AP, Wolf JBW. 2016. 756

Evolution of heterogeneous genome differentiation across multiple contact zones in a 757

crow species complex. Nat. Commun. 7:ncomms13195. 758

Vijay N, Park C, Oh J, Jin S, Kern E, Kim HW, Zhang J, Park J-K. 2018. Population 759

Genomic Analysis Reveals Contrasting Demographic Changes of Two Closely Related 760

Dolphin Species in the Last Glacial.Satta Y, editor. Mol. Biol. Evol. [Internet] 35:2026–761

2033. Available from: https://academic.oup.com/mbe/article/35/8/2026/5017252 762

Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. 763

2017. GenomeScope: Fast reference-free genome profiling from short reads. In: 764

Bioinformatics. 765

Wang J, Raskin L, Samuels DC, Shyr Y, Guo Y. 2015. Genome measures used for quality 766

control are dependent on gene function and ancestry. Bioinformatics. 767

Xu Z, Wang H. 2007. LTR-FINDER: An efficient tool for the prediction of full-length LTR 768

retrotransposons. Nucleic Acids Res. 769

Zhang Z, Chen Y, Zhang J, Ma X, Li Y, Li M, Wang D, Kang M, Wu H, Yang Y, et al. 2020. 770

Improved genome assembly provides new insights into genome evolution in a desert 771

poplar ( Populus euphratica ). Mol. Ecol. Resour.:1755-0998.13142. 772

Zimin A V., Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. 773

2017. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a 774

progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 775

Zwaenepoel A, Van De Peer Y. 2019. Wgd-simple command line tools for the analysis of 776

ancient whole-genome duplications. Bioinformatics. 777

778

Bai WN, Yan PC, Zhang BW, Woeste KE, Lin K, Zhang DY. 2018. Demographically 779

idiosyncratic responses to climate change and rapid Pleistocene diversification of the 780

walnut genus Juglans (Juglandaceae) revealed by whole-genome sequences. New 781

Phytol. 782

Cahill JA, Soares AER, Green RE, Shapiro B. 2016. Inferring species divergence times using 783

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 23: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

23

pairwise sequential markovian coalescent modelling and low-coverage genomic data. 784

Philos. Trans. R. Soc. B Biol. Sci. 785

Campbell MS, Law MY, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun 786

R, Jiao D, Lawrence CJ, et al. 2014. MAKER-P: A Tool kit for the rapid creation, 787

management, and quality control of plant genome annotations. Plant Physiol. 788

Das R, Shelke RG, Rangan L, Mitra S. 2018. Estimation of nuclear genome size and 789

characterization of Ty1-copia like LTR retrotransposon in Mesua ferrea L. J. Plant 790

Biochem. Biotechnol. 791

Dierckxsens N, Mardulyn P, Smits G. 2017. NOVOPlasty: De novo assembly of organelle 792

genomes from whole genome data. Nucleic Acids Res. 793

Ellinghaus D, Kurtz S, Willhoeft U. 2008. LTRharvest, an efficient and flexible software for 794

de novo detection of LTR retrotransposons. BMC Bioinformatics. 795

Foote AD, Vijay N, Ávila-Arcos MC, Baird RW, Durban JW, Fumagalli M, Gibbs RA, 796

Hanson MB, Korneliussen TS, Martin MD, et al. 2016. Genome-culture coevolution 797

promotes rapid divergence of killer whale ecotypes. Nat. Commun. 7:ncomms11693. 798

Greiner S, Lehwark P, Bock R. 2019. OrganellarGenomeDRAW (OGDRAW) version 1.3.1: 799

expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids 800

Res. 801

Hecht LBB, Thompson PC, Rosenthal BM. 2018. Comparative demography elucidates the 802

longevity of parasitic and symbiotic relationships. In: Proceedings of the Royal Society 803

B: Biological Sciences. 804

Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics. 805

Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-806

MEM. Available from: http://arxiv.org/abs/1303.3997 807

Li H. 2015. Seqtk: Toolkit for processing sequences in FASTA/Q formats. GitHub. 808

Li H, Durbin R. 2011. Inference of human population history from individual whole-genome 809

sequences. Nature 475:493–496. 810

Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of 811

occurrences of k-mers. Bioinformatics. 812

Mather N, Traves SM, Ho SYW. 2020. A practical introduction to sequentially Markovian 813

coalescent methods for estimating demographic history from genomic data. Ecol. Evol. 814

Mays HL, Hung CM, Shaner PJ, Denvir J, Justice M, Yang SF, Roth TL, Oehler DA, Fan J, 815

Rekulapally S, et al. 2018. Genomic Analysis of Demographic History and Ecological 816

Niche Modeling in the Endangered Sumatran Rhinoceros Dicerorhinus sumatrensis. 817

Curr. Biol. 818

Mazet O, Rodríguez W, Chikhi L. 2015. Demographic inference using genetic data from a 819

single individual: Separating population size variation from population structure. Theor. 820

Popul. Biol. 821

McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, 822

Altshuler D, Gabriel S, Daly M, et al. 2010. The genome analysis toolkit: A MapReduce 823

framework for analyzing next-generation DNA sequencing data. Genome Res. 824

Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. 2018. Versatile genome 825

assembly evaluation with QUAST-LG. In: Bioinformatics. 826

Nadachowska-Brzyska K, Burri R, Smeds L, Ellegren H. 2016. PSMC analysis of effective 827

population sizes in molecular ecology and its application to black-and-white Ficedula 828

flycatchers. Mol. Ecol. 25:1058–1072. 829

Nadachowska-Brzyska K, Li C, Smeds L, Zhang G, Ellegren H. 2015. Temporal dynamics of 830

avian populations during pleistocene revealed by whole-genome sequences. Curr. Biol. 831

25:1375–1380. 832

Ou S, Jiang N. 2018. LTR_retriever: A highly accurate and sensitive program for 833

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 24: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

24

identification of long terminal repeat retrotransposons. Plant Physiol. 834

Palamara PF, Terhorst J, Song YS, Price AL. 2018. High-throughput inference of pairwise 835

coalescence times identifies signals of selection and enriched disease heritability. Nat. 836

Genet. 837

Patton AH, Margres MJ, Stahlke AR, Hendricks S, Lewallen K, Hamede RK, Ruiz-Aravena 838

M, Ryder O, McCallum HI, Jones ME, et al. 2019. Contemporary Demographic 839

Reconstruction Methods Are Robust to Genome Assembly Quality: A Case Study in 840

Tasmanian Devils. Mol. Biol. Evol. 841

Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, 842

Woerner AE, O’Connor TD, Santpere G, et al. 2013. Great ape genetic diversity and 843

population history. Nature 499:471–475. 844

Quinlan AR, Hall IM. 2010. BEDTools: A flexible suite of utilities for comparing genomic 845

features. Bioinformatics 26:841–842. 846

Rogers RL, Slatkin M. 2017. Excess of genomic defects in a woolly mammoth on Wrangel 847

island. PLoS Genet. 848

Schiffels S, Durbin R. 2014. Inferring human population size and separation history from 849

multiple genome sequences. Nat. Genet. 46:919–925. 850

Simão FA, Waterhouse RM, Ioannidis P, Kriventseva E V., Zdobnov EM. 2015. BUSCO: 851

Assessing genome assembly and annotation completeness with single-copy orthologs. 852

Bioinformatics 31:3210–3212. 853

Smit, AFA, Hubley R. 2015. RepeatModeler Open-1.0. Available from: 854

http://www.repeatmasker.org 855

Stanke M, Diekhans M, Baertsch R, Haussler D. 2008. Using native and syntenically mapped 856

cDNA alignments to improve de novo gene finding. Bioinformatics. 857

Terhorst J, Kamm JA, Song YS. 2016. Robust and scalable inference of population history 858

from hundreds of unphased whole genomes. Nat. Genet. 49:303–309. 859

Tiley GP, Kimball RT, Braun EL, Burleigh JG. 2018. Comparison of the Chinese bamboo 860

partridge and red Junglefowl genome sequences highlights the importance of 861

demography in genome evolution. BMC Genomics. 862

Vijay N, Bossu CM, Poelstra JW, Weissensteiner MH, Suh A, Kryukov AP, Wolf JBW. 2016. 863

Evolution of heterogeneous genome differentiation across multiple contact zones in a 864

crow species complex. Nat. Commun. 7:ncomms13195. 865

Vijay N, Park C, Oh J, Jin S, Kern E, Kim HW, Zhang J, Park J-K. 2018. Population 866

Genomic Analysis Reveals Contrasting Demographic Changes of Two Closely Related 867

Dolphin Species in the Last Glacial.Satta Y, editor. Mol. Biol. Evol. [Internet] 35:2026–868

2033. Available from: https://academic.oup.com/mbe/article/35/8/2026/5017252 869

Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. 870

2017. GenomeScope: Fast reference-free genome profiling from short reads. In: 871

Bioinformatics. 872

Wang J, Raskin L, Samuels DC, Shyr Y, Guo Y. 2015. Genome measures used for quality 873

control are dependent on gene function and ancestry. Bioinformatics. 874

Xu Z, Wang H. 2007. LTR-FINDER: An efficient tool for the prediction of full-length LTR 875

retrotransposons. Nucleic Acids Res. 876

Zhang Z, Chen Y, Zhang J, Ma X, Li Y, Li M, Wang D, Kang M, Wu H, Yang Y, et al. 2020. 877

Improved genome assembly provides new insights into genome evolution in a desert 878

poplar ( Populus euphratica ). Mol. Ecol. Resour.:1755-0998.13142. 879

Zimin A V., Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. 880

2017. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a 881

progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 882

Zwaenepoel A, Van De Peer Y. 2019. Wgd-simple command line tools for the analysis of 883

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 25: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

25

ancient whole-genome duplications. Bioinformatics. 884

885

Bai, W. N., Yan, P. C., Zhang, B. W., Woeste, K. E., Lin, K., & Zhang, D. Y. (2018). 886

Demographically idiosyncratic responses to climate change and rapid Pleistocene 887

diversification of the walnut genus Juglans (Juglandaceae) revealed by whole-genome 888

sequences. New Phytologist. 889

Cahill, J. A., Soares, A. E. R., Green, R. E., & Shapiro, B. (2016). Inferring species 890

divergence times using pairwise sequential markovian coalescent modelling and low-891

coverage genomic data. Philosophical Transactions of the Royal Society B: Biological 892

Sciences. doi:10.1098/rstb.2015.0138 893

Campbell, M. S., Law, M. Y., Holt, C., Stein, J. C., Moghe, G. D., Hufnagel, D. E., … 894

Yandell, M. (2014). MAKER-P: A Tool kit for the rapid creation, management, and 895

quality control of plant genome annotations. Plant Physiology. 896

doi:10.1104/pp.113.230144 897

Das, R., Shelke, R. G., Rangan, L., & Mitra, S. (2018). Estimation of nuclear genome size 898

and characterization of Ty1-copia like LTR retrotransposon in Mesua ferrea L. Journal 899

of Plant Biochemistry and Biotechnology. doi:10.1007/s13562-018-0457-7 900

Dierckxsens, N., Mardulyn, P., & Smits, G. (2017). NOVOPlasty: De novo assembly of 901

organelle genomes from whole genome data. Nucleic Acids Research. 902

doi:10.1093/nar/gkw955 903

Ellinghaus, D., Kurtz, S., & Willhoeft, U. (2008). LTRharvest, an efficient and flexible 904

software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 905

doi:10.1186/1471-2105-9-18 906

Foote, A. D., Vijay, N., Ávila-Arcos, M. C., Baird, R. W., Durban, J. W., Fumagalli, M., … 907

Wolf, J. B. W. (2016). Genome-culture coevolution promotes rapid divergence of killer 908

whale ecotypes. Nature Communications, 7, ncomms11693. 909

Greiner, S., Lehwark, P., & Bock, R. (2019). OrganellarGenomeDRAW (OGDRAW) version 910

1.3.1: expanded toolkit for the graphical visualization of organellar genomes. Nucleic 911

Acids Research. doi:10.1093/nar/gkz238 912

Hecht, L. B. B., Thompson, P. C., & Rosenthal, B. M. (2018). Comparative demography 913

elucidates the longevity of parasitic and symbiotic relationships. In Proceedings of the 914

Royal Society B: Biological Sciences. doi:10.1098/rspb.2018.1032 915

Korf, I. (2004). Gene finding in novel genomes. BMC Bioinformatics. doi:10.1186/1471-916

2105-5-59 917

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-918

MEM. Retrieved from http://arxiv.org/abs/1303.3997 919

Li, H. (2015). Seqtk: Toolkit for processing sequences in FASTA/Q formats. 920

Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-921

genome sequences. Nature, 475(7357), 493–496. doi:10.1038/nature10231 922

Marçais, G., & Kingsford, C. (2011). A fast, lock-free approach for efficient parallel counting 923

of occurrences of k-mers. Bioinformatics. doi:10.1093/bioinformatics/btr011 924

Mather, N., Traves, S. M., & Ho, S. Y. W. (2020). A practical introduction to sequentially 925

Markovian coalescent methods for estimating demographic history from genomic data. 926

Ecology and Evolution. doi:10.1002/ece3.5888 927

Mays, H. L., Hung, C. M., Shaner, P. J., Denvir, J., Justice, M., Yang, S. F., … Primerano, D. 928

A. (2018). Genomic Analysis of Demographic History and Ecological Niche Modeling 929

in the Endangered Sumatran Rhinoceros Dicerorhinus sumatrensis. Current Biology. 930

doi:10.1016/j.cub.2017.11.021 931

Mazet, O., Rodríguez, W., & Chikhi, L. (2015). Demographic inference using genetic data 932

from a single individual: Separating population size variation from population structure. 933

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 26: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

26

Theoretical Population Biology. doi:10.1016/j.tpb.2015.06.003 934

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … 935

DePristo, M. A. (2010). The genome analysis toolkit: A MapReduce framework for 936

analyzing next-generation DNA sequencing data. Genome Research. 937

doi:10.1101/gr.107524.110 938

Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D., & Gurevich, A. (2018). Versatile 939

genome assembly evaluation with QUAST-LG. In Bioinformatics. 940

doi:10.1093/bioinformatics/bty266 941

Nadachowska-Brzyska, K., Burri, R., Smeds, L., & Ellegren, H. (2016). PSMC analysis of 942

effective population sizes in molecular ecology and its application to black-and-white 943

Ficedula flycatchers. Molecular Ecology, 25(5), 1058–1072. doi:10.1111/mec.13540 944

Nadachowska-Brzyska, K., Li, C., Smeds, L., Zhang, G., & Ellegren, H. (2015). Temporal 945

dynamics of avian populations during pleistocene revealed by whole-genome sequences. 946

Current Biology, 25(10), 1375–1380. doi:10.1016/j.cub.2015.03.047 947

Ou, S., & Jiang, N. (2018). LTR_retriever: A highly accurate and sensitive program for 948

identification of long terminal repeat retrotransposons. Plant Physiology. 949

doi:10.1104/pp.17.01310 950

Palamara, P. F., Terhorst, J., Song, Y. S., & Price, A. L. (2018). High-throughput inference of 951

pairwise coalescence times identifies signals of selection and enriched disease 952

heritability. Nature Genetics. doi:10.1038/s41588-018-0177-x 953

Patton, A. H., Margres, M. J., Stahlke, A. R., Hendricks, S., Lewallen, K., Hamede, R. K., … 954

Storfer, A. (2019). Contemporary Demographic Reconstruction Methods Are Robust to 955

Genome Assembly Quality: A Case Study in Tasmanian Devils. Molecular Biology and 956

Evolution. 957

Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley, J. L., Lorente-Galdos, B., … 958

Marques-Bonet, T. (2013). Great ape genetic diversity and population history. Nature, 959

499(7459), 471–5. doi:10.1038/nature12228 960

Quinlan, A. R., & Hall, I. M. (2010). BEDTools: A flexible suite of utilities for comparing 961

genomic features. Bioinformatics, 26(6), 841–842. 962

Rogers, R. L., & Slatkin, M. (2017). Excess of genomic defects in a woolly mammoth on 963

Wrangel island. PLoS Genetics. doi:10.1371/journal.pgen.1006601 964

Schiffels, S., & Durbin, R. (2014). Inferring human population size and separation history 965

from multiple genome sequences. Nature Genetics, 46(8), 919–925. 966

doi:10.1038/ng.3015 967

Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). 968

BUSCO: Assessing genome assembly and annotation completeness with single-copy 969

orthologs. Bioinformatics, 31(19), 3210–3212. doi:10.1093/bioinformatics/btv351 970

Smit, AFA, Hubley, R. (2015). RepeatModeler Open-1.0. Retrieved from 971

http://www.repeatmasker.org 972

Stanke, M., Diekhans, M., Baertsch, R., & Haussler, D. (2008). Using native and syntenically 973

mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 974

doi:10.1093/bioinformatics/btn013 975

Terhorst, J., Kamm, J. A., & Song, Y. S. (2016). Robust and scalable inference of population 976

history from hundreds of unphased whole genomes. Nature Genetics, 49(2), 303–309. 977

doi:10.1038/ng.3748 978

Tiley, G. P., Kimball, R. T., Braun, E. L., & Burleigh, J. G. (2018). Comparison of the 979

Chinese bamboo partridge and red Junglefowl genome sequences highlights the 980

importance of demography in genome evolution. BMC Genomics. doi:10.1186/s12864-981

018-4711-0 982

Vijay, N., Bossu, C. M., Poelstra, J. W., Weissensteiner, M. H., Suh, A., Kryukov, A. P., & 983

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 27: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

27

Wolf, J. B. W. (2016). Evolution of heterogeneous genome differentiation across 984

multiple contact zones in a crow species complex. Nature Communications, 7, 985

ncomms13195. 986

Vijay, N., Park, C., Oh, J., Jin, S., Kern, E., Kim, H. W., … Park, J.-K. (2018). Population 987

Genomic Analysis Reveals Contrasting Demographic Changes of Two Closely Related 988

Dolphin Species in the Last Glacial. Molecular Biology and Evolution, 35(8), 2026–989

2033. Retrieved from https://academic.oup.com/mbe/article/35/8/2026/5017252 990

Vurture, G. W., Sedlazeck, F. J., Nattestad, M., Underwood, C. J., Fang, H., Gurtowski, J., & 991

Schatz, M. C. (2017). GenomeScope: Fast reference-free genome profiling from short 992

reads. In Bioinformatics. doi:10.1093/bioinformatics/btx153 993

Wang, J., Raskin, L., Samuels, D. C., Shyr, Y., & Guo, Y. (2015). Genome measures used for 994

quality control are dependent on gene function and ancestry. Bioinformatics. 995

doi:10.1093/bioinformatics/btu668 996

Xu, Z., & Wang, H. (2007). LTR-FINDER: An efficient tool for the prediction of full-length 997

LTR retrotransposons. Nucleic Acids Research. doi:10.1093/nar/gkm286 998

Zimin, A. V., Puiu, D., Luo, M. C., Zhu, T., Koren, S., Marçais, G., … Salzberg, S. L. (2017). 999

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a 1000

progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome 1001

Research. doi:10.1101/gr.213405.116 1002

Zwaenepoel, A., & Van De Peer, Y. (2019). Wgd-simple command line tools for the analysis 1003

of ancient whole-genome duplications. Bioinformatics. 1004

doi:10.1093/bioinformatics/bty915 1005

1006

Figure Legends: 1007

1008

Figure 1a: Change in Effective population size (Ne) with change in the genome quality of 1009

Human assemblies. 1010 PSMC curve with bootstrap replicates for Human-NA12878 (see Table S2) mapped to 1011

human assembly version 4 (hg4) shown in red, to hg10 shown in brown, to hg15 shown 1012

in green, to hg19 shown in purple and to hg38 shown in blue, corresponding mapping 1013

percentages are given in parentheses. Poor quality assembly (hg4) overestimated the Ne 1014

during recent (~1 KYA) and ancient (~1-5 MYA) times, whereas Ne was underestimated 1015

during mid-period (~100-400 KYA) compared to better assemblies. 1016

1017

Figure 1b: Extent of variation in PSMC trajectories inferred from Human assemblies. 1018 Estimates of Ne inferred using each assembly showed heterogeneity across time points. 1019

Blackline indicates Standard Deviation in each atomic interval across estimates of all the 1020

assemblies. The colored lines show the Coefficient of Variation within 100 bootstrap 1021

replicates of each assembly. SD curve (black line) shows that estimates in the Atomic 1022

intervals contributing to recent times (AI 1-6) and ancient times (AI 43-64) showed the 1023

highest variation across assemblies. The CV of the poorest assembly (hg4) shows the 1024

highest variation across bootstrap estimates in recent and ancient times suggesting 1025

relatively low robustness compared to others but there was not much difference for mid-1026

period. 1027

1028

Figure 2a: Change in Effective population Sizes (Ne) due to masking of repeat regions in 1029

Populus trichocarpa PSMC. 1030 PSMC curve for Populus trichocarpa after masking all the repeat regions in the genome 1031

(blue line) and without masking (orange-red line). Unmasked trajectory has dots 1032

indicating the fraction of repeats in an atomic interval, larger the size more the repeat 1033

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 28: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

28

content in an atomic interval. Ne estimates during recent (20-100 KYA) and ancient 1034

(1MYA-5MYA) times show considerable differences between the two curves showing 1035

the effect of exclusion/inclusion of repeat sequences. 1036

1037

Figure 2b: Change in Effective population Sizes (Ne) after the inclusion of each repeat 1038

class in Populus trichocarpa PSMC separately. 1039 PSMC curves for Populus trichocarpa, with masked (blue) and unmasked (orange-red) 1040

genomes used. Change in trajectory due to the inclusion of each class of repeat to the 1041

masked genome is shown. Including each repeat class and masking, other repeat-classes 1042

will show changes specific to respective repeat-class. The inclusion of LTR-Gypsy 1043

shows a distinguishingly different trajectory similar to the unmasked genome during 1044

~20-100 KYA, which shows that the inclusion of LTR-Gypsy is influencing the 1045

trajectory in recent times. 1046

1047

Figure 2c: Bootstrapped PSMC results after masking of repeats in Populus trichocarpa. 1048 PSMC curves for Populus trichocarpa showing the robustness of changes due to 1049

masking of repeats. Masked (blue) and unmasked (orange-red) shows completely 1050

distinctive trajectories whereas unmasking only LTR-Gypsy repeat class (pink) also 1051

shows a marked difference. The second y-axis (red) shows the Coefficient of variation 1052

(CV) across the bootstraps across all the repeat classes. This indicates changes in Ne due 1053

to repeats are robust to bootstrap replications. 1054

1055

Figure 2d: Abundance of different repeat classes across all atomic intervals in Populus 1056

trichocarpa PSMC. 1057 Contribution of various repeat classes to each atomic interval is shown. Non-repeat 1058

regions (light green) are generally most abundant across all intervals whereas some 1059

repeat classes such as LTR’s have considerable abundance in some of the atomic 1060

intervals. Repeat families such as LTR-Gypsy, LTR-Copia and RC-Helitron showed 1061

higher abundance compared to other repeat classes. 1062

1063

Figure 2e: Fraction of genome contributed to each atomic interval of Populus 1064

trichocarpa PSMC by the repeat classes. 1065 Contribution of various repeat classes (percentage of whole genomic length) to each 1066

atomic interval is shown. LTR-Gypsy has contributed to around 2% in the atomic 1067

intervals spanning recent and ancient times, which might be one of the contributing 1068

factors to the change in Ne trajectory. 1069

1070

Figure 3: Comparison of heterozygosity and Ts/Tv ratio across atomic intervals of 1071

Populus trichocarpa PSMC. 1072 Change in heterozygosity and corresponding Ts/Tv ratio across atomic intervals in 1073

different repeat classes of Populus trichocarpa PSMC. The heterozygosity (left y-axis) 1074

increases with atomic intervals (x-axis) whereas Ts/Tv ratio (right y-axis) does not follow 1075

this trend for most of the repeat classes. 1076

1077

Figure 4a: Demographic inference of Mesua ferrea by PSMC and effect of different 1078

values of maximum TMRCA. 1079 PSMC inferred trajectories with same -p parameter (3*2+1*10+15*2+14+4) but for 1080

several values of maximum TMRCA parameter. Colour used for -t of 35 (cyan), 45 1081

(blue), 55 (green) and 65 (brown). For -t 500 (red), -p was used “4+25*2+4+6”, but did 1082

not have sufficient number of recombination events in some of the last atomic intervals. 1083

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 29: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

29

Demographic scenario shows steep decline in Ne, after MPT (Mid-Pleistocene 1084

transition) i.e. ~700 KYA, which again went through a second bottleneck during LGM 1085

(Last glacial maximum) of Last glacial period i.e. around ~30KYA. 1086

1087

Figure 4b: Distribution of atomic intervals across scaffold1281 for different maximum 1088

TMRCA values for Mesua ferrea PSMC. 1089 For each run of PSMC with different -t values decode based genomic regions along this 1090

scaffold and corresponding atomic intervals are shown. The atomic intervals which 1091

spanned scaffold1281 are shown here with their respective colours. Callability of bases 1092

in these regions is shown to highlight the quality of variants identified; heterozygosity is 1093

shown to demarcate hypervariable regions. It can be seen that same genomic coordinates 1094

are being distributed to more recent atomic intervals from older AI’s, which hints at 1095

redistribution of positions of atomic intervals with changes in the maximum TMRCA 1096

parameter values. 1097

1098

Figure 5: Comparative PSMC for Forest plant genomes. 1099 PSMC inferred trajectories with bootstrap replicates of 15 forest plant species. Top 1100

rectangles show respective time periods with important predicted glaciation events. 1101

Betula pendula shows a completely discordant trajectory compared to all other species. 1102

Whereas, tropically distributed species have a common trend of decline during and after 1103

Mid-Pleistocene glaciations. Some of the species such as Faidherbia albida were able to 1104

recover from these bottlenecks, which translates into their adaptation to dryer 1105

environments but most of the other plants were not able to recover from the same. 1106

1107

1108

Supplementary figure legends: 1109

1110

Figure S1a: Change in Effective population sizes (Ne) along with change in genome 1111

quality. Bootstrapped PSMC curves for Tribolium castaneum mapped to Tcas1.0 and 1112

Tcas5.2 genome assemblies with different genome quality. Mutation rate used 2.9e-09 1113

per site per generation with generation time of 0.3 i.e. 12 weeks for one generation. 1114

1115

Figure S1b: Change in Effective population sizes (Ne) along with change in genome 1116

quality. Bootstrapped PSMC curves for Danio rerio mapped to danRer1 and danRer11 1117

genome assemblies with different genome quality. Mutation rate used 1.9e-09 per site 1118

per generation with generation time of 1 year. 1119

1120

Figure S2: Correlation between repeat abundance and change in Effective population 1121

size (Ne). Repeat content across atomic intervals in Populus trichocarpa PSMC showed 1122

a positive correlation with absolute change in Ne estimated from masked vs unmasked 1123

genomes. Kendall’s correlation coefficient was calculated showing Kendall’s correlation 1124

coefficient i.e. Tau (τ) =0.346, with p-value of 0.0004. 1125

1126

Figure S3: Effect of callability of bases on Populus trichocarpa PSMC. 1127

The output of CallableLoci module of GATKv3.8 distributed the genomic regions in 1128

several classes such as, Callable (good quality bases of reference genome), N-ref (Bases 1129

having N’s or gaps in the reference genome), No-coverage (Bases in the genome which 1130

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 30: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

30

were not supported by any read), Low-coverage (Bases in the genome which showed 1131

small support of reads compared to mean) and Poorly-mapped (Bases in the genome 1132

which showed poor mapping quality of reads). All of these classes were masked one at a 1133

time and the effect of each non-callable group was evaluated. After that all the non-1134

callable groups were merged as a non-callable class and they were masked followed by 1135

another run of masking callable sites. For each run the percent of the genome masked is 1136

given in parentheses. Masking of callable sites gave completely different results, 1137

whereas other individual non-callable classes did not show any large change in the 1138

inferred trajectories. Non-callable trajectory showed some difference but did not change 1139

the trajectory much to change the inferences about the demography. 1140

1141

Figure S4: Effect of masking of different cumulative coverage classes on Populus 1142 trichocarpa PSMC. Whole genomic per base depth was calculated using SAMTOOLS depth 1143

command; this was used to make cumulative coverage classes based on their coverage 1144

distribution. Bases having < 10x coverage in one class, bases having < 20x coverage in 1145

another class etc. For each coverage class genomic co-ordinates were obtained and used for 1146

masking these regions, followed by PSMC analyses. The amount of genome masked by each 1147

class is showed in parentheses. There was no difference in PSMC trajectories till masking of 1148

less than 20X coverage classes, whereas from <30X coverage class masking trajectories 1149

started to differ till <60 X coverage class. Masking for more coverage classes i.e. < 70X and 1150

more did not produce psmcfa file during analyses. Genomic regions with higher coverages 1151

mostly contributed to the older atomic intervals, as masking till <40X coverage classes 1152

showed difference in recent time and small difference in older time. 1153

1154

Figure S5: Change in length distribution of atomic intervals due to change in maximum 1155

TMRCA values in Mesua ferrea PSMC. 1156

Lengths of sequences in each atomic interval are compared for each maximum TMRCA 1157

value to evaluate if lengths are getting redistributed or not. For -t 65 (purple box) atomic 1158

intervals contributing to older times (see AI 53 to 63) are more represented than all other 1159

values, and even if present (see AI 53,54,55 and 63) -t 65 has smaller lengths compared 1160

to others in those time periods. This shows that increasing the maximum TMRCA allows 1161

shorter genomic regions to contribute to older times. 1162

1163

Figure S6a: Effect of changing θ/ρ (r flag) value in PSMC on demographic inference of 1164

Mesua ferrea PSMC. 1165

PSMC estimates were inferred for Mesua ferrea with different values of -r flag. Smaller 1166

values of these value were able to travel further ahead in trajectory (see black and red 1167

lines), whereas atomic intervals contributing to these time points did not have enough 1168

recombination events. The other -r values i.e. 5,10 and 25 showed convergence in terms 1169

of recombination events but did not show any change in trajectory (see green, blue and 1170

cyan lines). 1171

1172

Figure S6b: Effect of changing θ/ρ (r flag) value in PSMC on the number of 1173

recombination events across atomic intervals from 50 till 64 of Mesua ferrea PSMC. 1174

The -r values were able to go further ahead in trajectories with smaller values i.e. 0.1 1175

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 31: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

31

and 1, but these atomic intervals did not show convergence in terms of recombination 1176

events. For the value of 0.1, till 56th

AI enough recombination events occurred, whereas 1177

for value of 1 there were less than 10 recombination events for last three atomic 1178

intervals. 1179

1180

Figure S7: Demographic inference of Mesua ferrea for North-eastern (China) 1181

population (red) and South-western (India) population (sky-blue). 1182

Chinese sample (red) trajectory extends well back in time till ~5MYA, whereas Indian 1183

sample (sky-blue) reaches till ~400 KYA only. The time at which the population decline 1184

begins is similar in both and shows similar trajectories from ~100 KYA till the recent 1185

times, as both show second decline during last glacial period i.e. ~20 KYA. 1186

1187

Figure S8: Demographic inference of Human-NA12878 sequenced using Illumina (blue) 1188

and BGISEQ (red). The inferred PSMC trajectories showed identical Ne estimates, 1189

showing there are no sequencing platform based differences. 1190

1191

Figure S9: K-mer distribution of 21-mer’s of sample from Indian population Mesua 1192 ferrea. GenomeScope results of Indian sample predict low heterozygosity (0.03%) of the 1193

sample with a single peak at 150x coverage. Predicted genome size is approx. 497 Mbp, 1194

which is underestimation owing to neglecting high coverage sequences of organellar and 1195

repeat sequences. 1196

1197

Figure S10: K-mer distribution of 21-mer’s of sample from Chinese population Mesua 1198 ferrea. GenomeScope results of Chinese sample predict high heterozygosity (0.85%) 1199

compared to Indian sample, showing ~25-fold difference between both populations. Two 1200

peaks are due to high heterozygosity, but predict somewhat similar genome size i.e. 479 Mbp 1201

compared to other sample. 1202

1203

Figure S11: Ks- distribution plot. 1204

Distribution of synonymous substitutions (Ks) across paralogs of Mesua ferrea (green) 1205

and homologs of Mesua ferrea with Manihot esculenta (red) and Populus trichocarpa 1206

(blue). Blue and red peaks show common WGD event across Malpighiales which is 1207

around 1.1-1.2. There is a possibility of independent WGD event in clusiods or Mesua 1208

ferrea which shows peak at 0.2. The independent WGD event could be species-specific 1209

or clade specific but cannot be stated due to unavailability of other genomic dataset from 1210

clusioids. 1211

1212

Figure S12: Assembly quality comparison of Malpighiales using Eudicotyledons_odb10 1213 dataset. Comparison of genome completeness based on BUSCO scores of Malpighiales 1214

genome assemblies. Mesua ferrea showed relatively complete assembly compared to other 1215

compared species. 1216

1217

Figure S13a: Dot-plot of Garcinia mangostana (top) and Mesua ferrea (left) plastome 1218

contig set 1. 1219

1220

Figure S13b: Dot-plot of Garcinia mangostana (top) and Mesua ferrea (left) plastome 1221

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 32: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

32

contig set 2. 1222

1223

Figure S13c: Dot-plot of Jatropha curcus (top) and Mesua ferrea (left) plastome contig 1224

set 1. 1225

1226

Figure S13d: Dot-plot of Jatropha curcus (top) and Mesua ferrea (left) plastome contig 1227

set 2. 1228

1229

Figure S13e: Dot-plot of Byrsonima coccolobifolia (top) and Mesua ferrea (left) plastome 1230

contig set 1. 1231

1232

Figure S13f: Dot-plot of Byrsonima coccolobifolia (top) and Mesua ferrea (left) plastome 1233

contig set 2. 1234

1235

Figure S14: Circular plot of Mesua ferrea plastome. 1236

Assembled chloroplast genome length is 161.422 Kbp. Annotated genes are shown with 1237

colours showing their association to the pathways. 1238

1239

Figure S15: Circular plot of Mesua ferrea Mitochondria. 1240

1241

1242

1243

1244

1245

1246

1247

1248

1249

1250

1251

1252

1253

1254

1255

1256

1257

1258

1259

1260

1261

1262

1263

1264

1265

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 33: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

33

Figures 1266

1267

Figure 1a 1268

1269

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 34: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

34

Figure 1b 1270

1271

1272

1273

1274

1275

1276

1277

1278

1279

1280

1281

1282

1283

1284

1285

1286

1287

1288

1289

1290

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 35: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

35

Figure 2a 1291

1292

1293

1294

1295

1296

1297

1298

1299

1300

1301

1302

1303

1304

1305

1306

1307

1308

1309

1310

1311

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 36: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

36

Figure 2b 1312

1313

1314

1315

1316

1317

1318

1319

1320

1321

1322

1323

1324

1325

1326

1327

1328

1329

1330

1331

1332

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 37: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

37

Figure 2c 1333

1334

1335

1336

1337

1338

1339

1340

1341

1342

1343

1344

1345

1346

1347

1348

1349

1350

1351

1352

1353

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 38: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

38

Figure 2d 1354

1355

1356

1357

1358

1359

1360

1361

1362

1363

1364

1365

1366

1367

1368

1369

1370

1371

1372

1373

1374

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 39: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

39

Figure 2e 1375

1376

1377

1378

1379

1380

1381

1382

1383

1384

1385

1386

1387

1388

1389

1390

1391

1392

1393

1394

1395

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 40: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

40

Figure 3 1396

1397

1398

1399

1400

1401

1402

1403

1404

1405

1406

1407

1408

1409

1410

1411

1412

1413

1414

1415

1416

1417

1418

1419

1420

1421

1422

1423

1424

1425

1426

1427

1428

1429

1430

1431

1432

1433

1434

1435

1436

1437

1438

1439

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 41: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

41

Figure 4a 1440

1441

1442

1443

1444

1445

1446

1447

1448

1449

1450

1451

1452

1453

1454

1455

1456

1457

1458

1459

1460

1461

1462

1463

1464

1465

1466

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 42: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

42

Figure 4b 1467

1468

1469

1470

1471

1472

1473

1474

1475

1476

1477

1478

1479

1480

1481

1482

1483

1484

1485

1486

1487

1488

1489

1490

1491

1492

1493

1494

1495

1496

1497

1498

1499

1500

1501

1502

1503

1504

1505

1506

1507

1508

1509

1510

1511

1512

1513

1514

1515

1516

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 43: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

43

Figure 5 1517

1518

1519

1520

1521

1522

1523

1524

1525

1526

1527

1528

1529

1530

1531

1532

1533

1534

1535

1536

1537

1538

1539

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 44: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

44

Supplementary figures: 1540

1541

Figure S1a 1542

1543

1544

1545

1546

1547

1548

1549

1550

1551

1552

1553

1554

1555

1556

1557

1558

1559

1560

1561

1562

1563

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 45: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

45

1564

Figure S1b 1565

1566

1567

1568

1569

1570

1571

1572

1573

1574

1575

1576

1577

1578

1579

1580

1581

1582

1583

1584

1585

1586

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 46: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

46

Figure S2 1587

1588

1589

1590

1591

1592

1593

1594

1595

1596

1597

1598

1599

1600

1601

1602

1603

1604

1605

1606

1607

1608

1609

1610

1611

1612

1613

1614

1615

1616

1617

1618

1619

1620

1621

1622

1623

1624

1625

1626

1627

1628

1629

1630

1631

1632

1633

1634

1635

1636

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 47: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

47

Figure S3 1637

1638

1639

1640

1641

1642

1643

1644

1645

1646

1647

1648

1649

1650

1651

1652

1653

1654

1655

1656

1657

1658

1659

1660

1661

1662

1663

1664

1665

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 48: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

48

Figure S4 1666

1667

1668

1669

1670

1671

1672

1673

1674

1675

1676

1677

1678

1679

1680

1681

1682

1683

1684

1685

1686

1687

1688

1689

1690

1691 1692

1693

1694

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 49: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

49

Figure S5 1695

1696

1697

1698

1699

1700

1701

1702

1703

1704

1705

1706

1707

1708

1709

1710

1711

1712

1713

1714

1715

1716

1717

1718

1719

1720

1721

1722

1723

1724

1725

1726

1727

1728

1729

1730

1731

1732

1733

1734

1735

1736

1737

1738

1739

1740

1741

1742

1743

1744

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 50: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

50

Figure S6a 1745

1746

1747

1748

1749

1750

1751

1752

Figure S6b 1753

1754

1755

1756

1757

1758

1759

1760

1761

1762

1763

1764

1765

1766

1767

1768

1769

1770

1771

1772

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 51: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

51

1773

1774

Figure S7 1775

1776

1777

1778

1779

1780

1781

1782

1783

1784

1785

1786

1787

1788

1789

1790

1791

1792

1793

1794

Figure S8 1795

1796

1797

1798

1799

1800

1801

1802

1803

1804

1805

1806

1807

1808

1809

1810

1811

1812

1813

1814

1815

1816

1817

1818

1819

1820

1821

1822

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 52: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

52

1823

1824

Figure S9 1825

1826

1827

1828

1829

1830

1831

1832

1833

1834

1835

1836

1837

1838

1839

1840

1841

1842

1843

1844

1845

1846

1847

1848

1849

1850

Figure S10 1851

1852

1853

1854

1855

1856

1857

1858

1859

1860

1861

1862

1863

1864

1865

1866

1867

1868

1869

1870

1871

1872

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 53: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

53

1873

Figure S11 1874

1875

1876

1877

1878

1879

1880

1881

1882

1883

1884

1885

1886

1887

1888

1889

1890

1891

1892

1893

1894

1895

1896

1897

1898

Figure S12 1899

1900

1901

1902

1903

1904

1905

1906

1907

1908

1909

1910

1911

1912

1913

1914

1915

1916

1917

1918

1919

1920

1921

1922

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 54: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

54

1923

Figure S13a 1924

1925

1926

1927

1928

1929

1930

1931

1932

1933

1934

1935

1936

1937

1938

1939

1940

1941

1942

1943

1944

1945

Figure S13b 1946

1947

1948

1949

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

1962

1963

1964

1965

1966

1967

1968

1969

1970

1971

1972

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 55: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

55

1973

Figure S13c 1974

1975

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

Figure S13d 1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 56: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

56

2023

Figure S13e 2024

2025

2026

2027

2028

2029

2030

2031

2032

2033

2034

2035

2036

2037

2038

2039

2040

2041

2042

2043

2044

2045

2046

2047

2048

Figure S13f 2049

2050

2051

2052

2053

2054

2055

2056

2057

2058

2059

2060

2061

2062

2063

2064

2065

2066

2067

2068

2069

2070

2071

2072

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 57: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

57

2073

Figure S14 2074

2075

2076

2077

2078

2079

2080

2081

2082

2083

2084

2085

2086

2087

2088

2089

2090

2091

2092

2093

2094

2095

2096

2097

2098

Figure S15 2099

2100

2101

2102

2103

2104

2105

2106

2107

2108

2109

2110

2111

2112

2113

2114

2115

2116

2117

2118

2119

2120

2121

2122

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 58: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

58

2123

Supplementary Tables 2124

2125

Supplementary Table S1: Assembly quality comparison. 2126

2127

Assembly Mapping Percent N50 (Length in MB) N’s per 100 KB Percentage of complete

BUSCO’s

hg4 92.86 6.2 11725.66 72.4

hg10 97.44 1.98 206.89 89.5

hg15 99.84 25.44 7.29 94.6

hg19 100 146.36 7645.47 94.9

hg38 100 145.14 4964.97 94.9

Tcas1.0 98.2 0.24 2291.2 94.2

Tcas5.2 99.01 15.27 8144.72 99.4

danRer1 95.53 45.41 12179.91 NA

danRer11 98.04 52.19 279.51 NA

2128

Supplementary Table S2: SRA reads used in this study. 2129

2130

Sr. No. SRA Accession Sample Details

1 SRR9091899 Homo sapiens NA12878 Illumina-Hiseq 4000

2 SRR7121482 Homo sapiens NA12878 BGISEQ-500

3 SRR7906163 Populus trichocarpa

4 SRR9007075 Populus euphratica

5 SRR3045849 Populus nigra

6 SRR2745904 Populus tremula

7 SRR2751102 Populus tremuloides

8 SRR7121482 Mesua ferrea BGISEQ-500

9 ERR2026087 Betula pendula

10 SRR6058604 Durio zibethinus

11 SRR10339638 Eucalyptus grandis

12 SRR7072804 Faidherbia albida

13 SRR5265130 Tectona grandis

14 SRR3860174 Quercus robur

15 SRR5678803 Ficus carica

16 DRR142810 Citrus unshiu

17 SRR5674478 Trema orientalis

18 SRR8731963 Castanea mollisima

19 ERR1346607 Olea europea

20 SRR5150443 Santalum album

21 SRR5019373 Populus pruinosa

22 ERR3491152 Trochodendron aralloides

23 SRR5992151 Tribolium castaneum

24 SRR6687445 Danio rerio

25 This study Mesua ferrea

2131

2132

2133

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 59: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

59

2134

2135

2136

Supplementary Table S3: Genome assembly statistics of Mesua ferrea genome. 2137

2138

Sr. No. Parameter Contigs Scaffolds

1 Total Assembled genome length (MB) 609.19 614.35

2 Number of contigsa 531964 503130

3 Largest contiga (KB) 2231.61 4132.84

4 N50a (KB) 251.71 392.76

5 L50a 607 379

6 L75a 2612 1584

7 GC % 38.38 38.38 aStatistics are based on contigs > 100 bp. 2139

2140

Supplementary Table S4: BUSCO score comparison across previously published 2141

genomes from Malpighiales using eudicotyledons_odb10 dataset. 2142

2143

Sr. No. Species Complete

and Single

Copy

BUSCO’s

Complete

and

Duplicated

BUSCO’s

Fragmented

BUSCO’s

Missing

BUSCO’s

Complete

BUSCO’s

(N=2121)

Complete

BUSCO’s

percentage

1 Caryocar

brasiliense

1528 101 246 246 1629 76.8

2 Euphorbia esula 348 43 476 1254 391 18.43

3 Hevea brasiliense 1646 412 24 39 2058 97.03

4 Jatropha curcus 2040 33 14 34 2073 97.74

5 Linum usitatissimum 745 1245 36 95 1990 93.82

6 Mesua ferrea 1633 364 31 93 1997 94.15

7 Manihot esculenta 1853 198 20 50 2051 96.7

8 Passiflora edulis 827 20 638 636 847 39.93

9 Populus alba 1640 426 18 37 2066 97.41

10 Populus euphratica 1592 479 13 37 2071 97.64

11 Populus simonii 1623 443 16 39 2066 97.41

12 Populus trichocarpa 1629 436 14 42 2065 97.36

13 Ricinus communis 2009 22 46 44 2031 95.76

14 Salix brachista 1757 288 19 57 2045 96.42

15 Viola pubescens 1471 297 188 165 1768 83.36

2144

2145

2146

2147

2148

2149

2150

2151

2152

2153

2154

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 60: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

60

2155

2156

2157

Supplementary Table S5: BUSCO score comparison across previously published 2158

genomes from Malpighiales using embryophyta_odb10 dataset. 2159

2160

Sr. No. Species Complete

and Single

Copy

BUSCO’s

Complete

and

Duplicated

BUSCO’s

Fragmented

BUSCO’s

Missing

BUSCO’s

Complete

BUSCO’s

(N=1375)

Complete

BUSCO’s

percentage

1 Caryocar brasiliense 1017 43 183 132 1060 77.1

2 Euphorbia esula 265 42 424 644 307 22.4

3 Hevea brasiliense 1088 243 16 28 1331 96.8

4 Jatropha curcus 1331 18 4 22 1349 98.1

5 Linum usitatissimum 467 859 10 39 1326 96.5

6 Mesua ferrea 1093 199 18 65 1292 94

7 Manihot esculenta 1248 80 15 32 1328 96.6

8 Passiflora edulis 598 9 460 308 607 44.2

9 Populus alba 1074 270 10 21 1344 97.7

10 Populus euphratica 1063 284 6 22 1347 98

11 Populus simonii 1076 270 8 21 1346 97.9

12 Populus trichocarpa 1084 258 9 24 1342 97.6

13 Ricinus communis 1317 7 19 32 1375 96.3

14 Salix brachista 1200 146 5 24 1346 97.9

15 Viola pubescens 973 186 124 92 1159 84.3

2161

Supplementary Table S6: LTR-retriever LAI scores for Malpighiales genome 2162

assemblies. 2163

2164

Sr. No. Species Raw LAI LAI

1 Hevea brasiliensis 1.75 0.75

2 Jatropha curcas 2.37 1.22

3 Linum usitatissimum 5.55 8.76

4 Manihot esculenta 3.79 3.5

5 Mesua ferrea 2.1 7.73

6 Populus alba 11.32 11.74

7 Populus euphratica 1.25 3

8 Populus simonii 8.96 12.15

9 Populus trichocarpa 6.54 11.57

10 Rhizophora apiculata 7.48 13.03

11 Ricinus communis 4.37 5.43

12 Salix brachista 17.54 17.1

2165

2166

2167

2168

2169

2170

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 61: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

61

2171

2172

2173

Supplementary Table S7: Annotation Statistics for iterative MAKER-P 2174

annotation of Mesua ferrea genome assembly. 2175

2176

Parameter Round 1 Round 2 Round 3 Round 4 Round 5

No. of protein coding genes 35557 38971 38965 39011 46540

Gene density per KB 0.07 0.06 0.08 0.06 0.08

Average gene length (bp) 2631.77 2722.22 2768.24 2761.98 2477.54

Average exons per mRNA 4.73 4.96 5.1 5.09 4.55

Average exon length (bp) 206.42 217.37 222.09 222.14 222.24

Average intron length (bp) 401.68 378.83 359.1 360.17 365.12

Cumulative fraction of genes with AED

< 0.5

0.99 0.92 0.92 0.92 0.93

Percentage of complete BUSCO’s 89.4 93.3 92.3 92.1 94.7

2177

2178

2179

2180

2181

2182

2183

2184

2185

2186

2187

2188

2189

2190

2191

2192

2193

2194

2195

2196

2197

2198

2199

2200

2201

2202

2203

2204

2205

2206

2207

2208

2209

2210

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 62: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

62

2211

2212

2213

Supplementary Table S8: Genome assemblies used in this study. 2214 2215

Sr. No. Species Assembly Database

1 Homo sapiens GRCh38.p13 ENSEMBL

2 Homo sapiens hg4 UCSC Genome

Browser

3 Homo sapiens hg10 UCSC Genome

Browser

4 Homo sapiens hg15 UCSC Genome

Browser

5 Homo sapiens hg19 UCSC Genome

Browser

6 Caryocar brasiliense Cbr_v1.0 NCBI

7 Euphorbia esula ASM291907v1 NCBI

8 Hevea brasiliense ASM165405v1 NCBI

9 Jatropha curcus JatCur_1.0 NCBI

10 Linum usitatissimum ASM22429v2 NCBI

11 Mesua ferrea This Study

12 Manihot esculenta Manihot esculenta v6 NCBI

13 Passiflora edulis ASM215610v1 NCBI

14 Populus alba ASM523922v1 NCBI

15 Populus euphratica PopEup_1.0 NCBI

16 Populus simonii Populus_simonii_2.0 NCBI

17 Populus trichocarpa Pop_tri_v3 NCBI

18 Ricinus communis JCVI_RCG_1.1 NCBI

19 Salix brachista ASM907833v1 NCBI

20 Viola pubescens GCA_002752925.1 NCBI

21 Betula pendula Bpev01 NCBI

22 Durio zibethinus Duzib1.0 NCBI

23 Eucalyptus grandis Egrandis1_0 NCBI

24 Faidherbia albida http://gigadb.org/dataset/101054 GIGADB

25 Tectona grandis https://biit.cs.ut.ee/supplementary/WGSteak/

26 Quercus robur Q_robur_v1 NCBI

27 Ficus carica UNIPI_FiCari_1.0 NCBI

28 Citrus unshiu http://www.citrusgenome.jp/ DDBJ

29 Trema orientalis TorRG33x02_asm01 NCBI

30 Castanea mollisima ASM76360v1 NCBI

31 Olea europea O_europea_v1 NCBI

32 Santalum album ASM291163v1 NCBI

33 Populus pruinosa http://gigadb.org/dataset/100319 GIGADB

34 Trochodendron aralloides http://gigadb.org/dataset/100657 GIGADB

35 Tribolium castaneum ftp://ftp.hgsc.bcm.edu/Tcastaneum/Tcas1.0/ HGEC, BGM

36 Tribolium castaneum Tcas5.2 NCBI

37 Danio rerio danRer1 UCSC Genome

Browser

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint

Page 63: CoalQC - Quality control while inferring demographic histories … · 2 38. Abstract. 39 Estimating demographic histories using genomic datasets has proven to be useful in 40 . addressing

63

38 Danio rerio danRer11 UCSC Genome

Browser

2216

2217

2218

2219

2220

Supplementary Table S9: Details of the mutation rate and generation time used. The 2221

Median estimate of mutation rate i.e. 2.5e-09 per site per year was used for all and 2222

converted to per generation mutation rates according to their generation times. 2223 2224

Sr. No. Species Generation time used Mutation rate per generation

1 Betula pendula 20 5e-08

2 Durio zibethinus 7 1.75e-08

3 Eucalyptus grandis 7 1.75e-08

4 Faidherbia albida 7 1.75e-08

5 Tectona grandis 7 1.75e-08

6 Quercus robur 15 3.75e-08

7 Ficus carica 7 1.75e-08

8 Citrus unshiu 7 1.75e-08

9 Trema orientalis 7 1.75e-08

10 Castanea mollisima 7 1.75e-08

11 Olea europea 7 1.75e-08

12 Santalum album 7 1.75e-08

13 Populus pruinosa 15 3.75e-08

14 Trochodendron aralloides 15 3.75e-08

15 Mesua ferrea 15 3.75e-08

16 Tribolium castaneum 0.3 (12 weeks) 2.7e-09

17 Danio rerio 1 1e-09

18 Homo sapiens 25 2.5e-08

2225

2226

.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint


Recommended