Post on 01-Jul-2018
transcript
1
Supplementary information, Data S1 Supplemental methods
Genome assembly
Three intermediate assembly versions for the quinoa genome were generated
using Illumina reads (v0.1) and PacBio reads (v0.2 and v0.3) separately. Then the
three assemblies were merged together using the HABOT2 software (1gene Corp.,
Hangzhou, China; https://github.com/asarum/HABOT2) and a final round of
scaffolding and gap filling was performed using Illumina reads to obtain Cq assembly
v1.0. Detailed protocols are described below.
Filtered Illumina reads from two PCR-free libraries, with average insert sizes of
~380 bp and ~450 bp respectively, were used to construct contigs using the
DISCOVAR de novo software [1, 2]. Reads from the two medium-size mate pair
libraries (Supplementary Table 1) were then used for scaffolding with SSPACE 3.0
[3]. Gaps inside the constructed scaffolds were filled using GapCloser v1.12 module
from SOAPdenovo2 [4]. The resulting assembly v0.1 has a scaffold N50 of 49.5kb
and contig N50 of 26.1kb (Supplementary Table 3). Parameters used for each
software are listed below:
Discovar:
DiscovarDeNovo READS=./data/*fastq.gz OUT_DIR=Assembly
SSPACE:
perl SSPACE_Standard_v3.0.pl -l lib.cfg -s a.lines.fasta -T 8
GapCloser:
GapCloser -o closegap.fa -b lib.cfg -a closeGap.fa -t 16
Falcon v0.3 was used to assemble Cq assembly v0.2. The config file for
running Falcon v0.3:
Cq.cfg:
input_fofn = input.fofn
input_type = raw
length_cutoff = 8000
2
length_cutoff_pr = 5000
pa_HPCdaligner_option = -v -dal128 -M32 -e.70 -l1000 -s1000
pa_DBsplit_option = -x500 -s400
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --
local_match_count_threshold 2 --max_n_read 200 --n_core 10
overlap_filtering_setting = --max_diff 500 --max_cov 500 --min_cov 2
--bestn 10 --n_core 6
We then used Canu to obtain Cq assembly v0.3 that has a contig N50 of ~51 kb and
an assembly size of ~1.43Gb (Supplementary Table 3).
canu -p Cq_assem -d Cq_assem_34X genomeSize=1480m useGrid=remote -
pacbio-raw ./Data/Cq.Pcabio.fa
Then HABOT2 was used to combine contigs over 1 kb in length from the
above three assemblies and to construct a new contig set. HABOT2 contains 4
modules:
a. Graph module. This module counts k-mer frequencies and extracts the unique k-
mers from Illumina reads. A unique k-mer is theoretically defined as a k-bp sequence
that occurs just once in a haploid genome and is calculated following a Poisson model
as described previously [5]. Using unique k-mers, instead of all the k-mers, for graph
construction minimizes the effects of error-prone repeats and increases computation
speed.
b. Alignment module. This module is use for an all-to-all alignment between PacBio
contigs and Illumina contigs. By using unique k-mers it performs the alignment much
faster than blasr [6] and is of high accuracy.
c. Duplication remove module. When two sequences have common uinque k-mers that
exceed a cutoff (default is 0.5), the shorter sequence is removed.
d. De novo module. This module calls the above 3 modules and performs hybrid
assembly.
For the Cq assembly, we extracted unique 17-mers from the PCR-free Illumina
libraries. Overlaps among contigs from different intermediate assembly versions were
3
identified using the alignment module. Then the OLC (Overlap Layout Consensus)
graph was built for the overlapped contigs. The connection from contig A to contig B
is dropped in case of the following situation:
(1) Contig A’s best connection is contig B
(2) Contig B’s best connection is contig C
(3) A does not overlap with C
After the merge of overlapped contigs, duplicated regions in the contig set were
removed. Finally scaffolding and gap closure were performed on the new contig set
using Illumina mate pair reads with SSPACE v3.0 and GapCloser v1.12 (both with
default parameters).
Next we utilized the PE250 reads from two PCR-free libraries to correct for
assembly errors. Filtered reads were aligned to the assembly using bwa mem with
default parameters. Next the GATK package [7] was used to identify variants. The
order of commands and corresponding parameters used were as following:
IndelRealigner; UnifiedGenotyper --read_filter BadCigar -glm BOTH -
stand_call_conf 30.0 -stand_emit_conf 0; VariantFiltration --filterExpression
"QD<20.0 || ReadPosRankSum <-8.0 || FS>10.0 || QUAL<30 --filterName
LowQualFilter; BaseRecalibrator; PrintReads; UnifiedGenotyper --read_filter
BadCigar -glm BOTH -stand_call_conf 30.0 -stand_emit_conf 0; VariantFiltration --
filterExpression "QD < 20.0 || ReadPosRankSum < -8.0 || FS > 10.0 || QUAL < 30 --
filterName LowQualFilter. Then an in-house perl script was used to correct for the
highly-confident errors identified. This process was reiterated for a total
of 5 times to generate assembly v1.0.
Specific HABOT2 commands are listed below:
a) Extract unique 17-mers from Illumina reads:
perl Graph.pl pipe -i fq.lst -m 2 -k 17 -s 1,3 -d Kmer_17
4
b) Config file for hybrid assembly using the de novo module:
# the input file list, in fasta format
[pb_lst] file.lst
# Data type:
# 1:the uncorrected data
# 2:the corrected data
[pb_type] 2
# filter read length
[filt_len] 1000
# genome size in Magebase
[genome_size] 1500
# unique kmer, kmer size
[unique_kmer] kmer_17.bit 17
# Align parameters
# Col.1,kmer size
# Col.2:the scope for find anchor,-1 is for all length
# Col.3:align mode
# 1:for uncorrected reads
# 2:for corrected reads
# Col.4:filter score below this value
[strategy] 17 -1 2 0.8
[strategy] 17 -1 1 0.9
[strategy] 35 1000 1 0.9
[strategy] 35 -1 1 0.9
# project name
[pro_name] CQ
# qsub parameters
[queue] dna.q,rna.q,reseq.q
[Project] og
[max_job] 50
[thread] 8
Then run the command:
perl Denovo.pl input.cfg &
Assessment of the assembly
5
Alignment of PE250 reads (from the two PCR-free libraries) to Cq_real_v1.0 were
performed using bwa mem with default parameters. Then samtools and bcftools were
used for SNP calling and summarization.
A 40-kb fosmid library for quinoa was constructed using CopyControl Fosmid
Library Production Kit (Epicenter). Then 10 single colonies were picked and cultured
in 100 mL LB medium. The 10 fosmids were then extracted using a Plasmid Midi Kit
(Qiagen), mixed in equal quantity and used for a 20-kb PacBio library preparation.
Library preparation and sequencing were performed at Tianjin Biochip Corporation.
Falcon v0.3.0 was used for the de novo assembly of fosmid sequences with default
parameters. After removing plasmid sequences, the contigs were then aligned to
Cq_real_v1.0 using blastn with parameters –task blastn-short. The results were
summarized manually.
Chloroplast Genome Annotation
Annotation of the chloroplast genome was performed separately using DOGMA [8]
and CpGAVAS [9] with the following parameters: blast E-value cutoff - 1e-10,
maximum target hit number – 10, and maximum length of tRNA intron and variable
region - 116 bp. Then outputs from both software were integrated by retaining the
longer opening read frame (ORF) with an in-house Perl script. The predicted
start/stop codons and the exon-intron boundaries for intron-containing genes were
manually examined and curated. The map of the chloroplast genome was generated
using GenomeVx [10] followed by some manual adjustment.
Annotation of Repeats
Both homology-based and de novo approaches were used for repeat annotation. Three
complementary software programs, LTR_FINDER [11], PILER [12], RepeatModeler
6
[13], were used with default parameters to generate a de novo repeat library for
quinoa. These programs use complementary methods to predict repeats.
LTR_FINDER retrieves full-length LTR retrotransposons; PILER searches for
repeats in the genome by aligning the genome sequence to itself; RepeatModeler uses
two complementary repeat prediction programs, RECON and RepeatScout, to identify
repeat element boundaries and family relationships. This de novo repeat library was
then used together with Repbase [14] for homology search of repeats using
RepeatMasker [15].
Gene Prediction and Annotation
Three independent approaches, including homology search, ab initio prediction and
reference guided transcriptome assembly were used for gene prediction in a repeat-
masked genome. Evidence from the three approaches were then merged using
GLEAN to generate the final gene set.
Homology-based gene prediction. Putative open reading frames (ORFs) in the quinoa
genome were identified by aligning the protein sequences of A. thaliana,
Thellungiella salsuginea, Beta vulgaris, Spinacia oleracea, Amaranthus
hypochondriacus, and Fagopyrum esculentum (Supplementary Table 15) to
Cq_real_v1.0 using TBLASTN with an E-value cutoff of 1e−5. We next extracted
those ORF regions containing introns from the genome, including 2,000-bp
extensions at both ends, and again aligned protein sequences from other species to
these DNA fragments using GeneWise [16] with parameters: –trev –sum –genesf.
Ab initio gene prediction. AUGUSTUS v2.5.5 [17] and SNAP (version 2006-07-28)
[18] with default parameters were utilized for de novo gene prediction with gene
model parameters trained with A. thaliana TAIR10 genome. Short coding sequences
that were less than 150 bp in length were discarded.
7
Transcriptome-assisted gene prediction. We used TopHat [19] to map filtered mRNA-
seq reads to Cq_real_v1.0 to identify exonic regions and intron-exon boundaries with
the following parameters: -p 4 -max-intron-length 20,000 -m 1 -r 20 -mate-std-dev 20.
Cufflinks [19] was then used to assemble the alignments into transcripts with the
parameters: –I 20,000 –p 4.
Results derived from the above three approaches were integrated to generate a
consensus gene set using GLEAN with default parameters [20]. The probabilistic
confidence score generated by GLEAN was used to reflect the consistency among
different sources of evidence.
Assessment of gene models
The completeness of gene prediction was assessed by comparing the quinoa protein
sequences to the 1,411 embryophyta single copy orthologs in BUSCO v2
(Benchmarking Universal Single-Copy Orthologs) [21] with a BLAST E-value cutoff
of 1e-5.
The predicted gene models were further assessed using Eval v2.2.8 [22]. Two sets of
“gold standard” genes were used: the 56 mRNA sequences of C. quinoa retrieved
from the NCBI nucleotide database and high-expression transcripts based on the
mRNA-seq data. For high-expression transcripts, Trinity was used for de novo
transcript assembly in the genome guided mode [23] with combined mRNA-seq data
from 8 types of quinoa tissue and the transcripts with a sequencing coverage higher
than 100 were retained. Maker v2.31.9 was used to map the assembled transcripts to
Cq_real_v1.0 to generate the gene model in gff format.
Function annotation of genes
To assign gene functions, the predicted quinoa protein sequences were searched
against five protein/function databases: InterPro, GO, KEGG, Swiss-Prot and
8
TrEMBL. The Interpro database search was performed using InterproScan with
parameters: -f TSV –dp –gotermes -iprlookup –pa. For the 4 databases, BLAST searches
using the quinoa protein sequences as query were performed with an E-value cutoff of
1e-05. Results from the 5 database searches were then concatenated.
For GO term enrichment analysis, Fisher's exact test was performed and the p-value
was adjusted for multiple testing using the BH method.
References
1. Weisenfeld NI, Yin S, Sharpe T, et al. Comprehensive variation discovery in single human genomes. Nat Genet 2014; 46:1350-1355.
2. Love RR, Weisenfeld NI, Jaffe DB, Besansky NJ, Neafsey DE. Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly. BMC Genomics 2016; 17:187.
3. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 2011; 27:578-579.
4. Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 2012; 1:18.
5. You M, Yue Z, He W, et al. A heterozygous moth genome provides insights into herbivory and detoxification. Nat Genet 2013; 45:220-225.
6. Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012; 13:238.
7. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 2013; 43:11 10 11-33.
8. Wyman SK, Jansen RK, Boore JL. Automatic annotation of organellar genomes with DOGMA. Bioinformatics 2004; 20:3252-3255.
9. Liu C, Shi L, Zhu Y, et al. CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences. BMC Genomics 2012; 13:715.
10. Conant GC, Wolfe KH. GenomeVx: simple web-based creation of editable circular chromosome maps. Bioinformatics 2008; 24:861-862.
11. Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 2007; 35:W265-268.
12. Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics 2005; 21 Suppl 1:i152-158.
13. Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics 2005; 21 Suppl 1:i351-358.
14. Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 2015; 6:11.
15. Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics 2009; Chapter 4:Unit 4 10.
9
16. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res 2004; 14:988-995.
17. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 2006; 34:W435-439.
18. Korf I. Gene finding in novel genomes. BMC Bioinformatics 2004; 5:59. 19. Trapnell C, Williams BA, Pertea G, et al. Transcript assembly and
quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010; 28:511-515.
20. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS, Weinstock GM. Creating a honey bee consensus gene set. Genome Biol 2007; 8:R13.
21. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 2015; 31:3210-3212.
22. Keibler E, Brent MR. Eval: a software package for analysis of genome annotations. BMC Bioinformatics 2003; 4:50.
23. Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011; 29:644-652.
Supplementary information, Figure S1 The distribution of 17-kmer frequencies of
C. quinoa.
The frequency of 17-mers peaks at 67. The genome size is estimated using the
formula: Genome size = K-mer number/peak depth. A total of 344,467,594 reads from
the two PCR-free libraries were used, which resulted in 99,297,839,479 K-mers.
K-mer K-mer Number Peak Depth Genome Size (nt) Used Bases Used Reads Depth
17 99,297,839,479 67 1,482,057,305 86,461,366,094 344,467,594 58.4
Supplementary information, Figure S2 The chloroplast genome of C. quinoa.
The bold inner circle indicates the 4 conserved regions of the chloroplast genome: two
inverted repeats (IRA and IRB) that separate a short single copy (SSC) region and a
long single copy (LSC) region. The inner dark grey bars show the coverage of the
chloroplast genome in non-overlapping 10-bp window. Genes located on the (+) and
(-) strand are indicated on the outside and inside of the circle respectively.
Supplementary information, Figure S3 Examples of Cq_real_v1.0 scaffolds that are
anchored to a published genetic map of quinoa (Maughan et al. 2012).
Names of the SNP markers and their positions on different linkage groups (LG) are
indicated on the left and the corresponding scaffolds with the position (bp) of SNP
markers are indicated on the right.
Supplementary information, Figure S4 Assessment of gene prediction in
Cq_real_v1.0.
(A) Cumulative frequency of the probabilistic confidence score of predicted genes in
Cq_real_v1.0 given by GLEAN. (B) Pie charts showing the percentage of gene sets
that have conserved protein domains present or absent in the protein sequence.
Conserved domains were identified by blast against the Pfam database using a
E-value cutoff of 1E-4. (C) Distribution of RPKM values of all the annotated genes.
The mRNA-seq data from 5 different types of quinoa tissue were used.
Supplementary information, Figure S5 Scatterplot showing the RPKM values of
the 10,554 genes that are predicted in Cq_real_v1.0 but not in ASM168347v1.
The 8 different types of quinoa tissue used for mRNA-seq are indicated in the x-axis
and the log transformed RPKM value are in the y-axis.
Supplementary information, Figure S6 Venn diagram showing the overlap of
orthologous groups among quinoa and 4 other plant species.
Supplementary information, Figure S7 Histogram of Ks (synonymous substitution)
values for paralogous quinoa gene pairs.
The peaks at Ks=0.12 and Ks=1.6 indicate a recent and an ancient genome
duplication respectively.
Supplementary information, Figure S8 Composition of 19 amino acid residues
(except Lys) in albumin proteins in five different crops.
AA, Aegilops tauschii; CC, Chenopodium quinoa; Gl, Glycine max; GR, Zea mays;
Os, Oryza sativa.
Supplementary information, Figure S9 Composition of 19 amino acid residues
(except Lys) in globulin proteins in five different crops.
AA, Aegilops tauschii; CC, Chenopodium quinoa; Gl, Glycine max; GR, Zea mays;
Os, Oryza sativa.
Supplementary information, Figure S10 Composition of 19 amino acid residues
(except Lys) in LEA proteins in five different crops.
AA, Aegilops tauschii; CC, Chenopodium quinoa; Gl, Glycine max; GR, Zea mays;
Os, Oryza sativa.
Supplementary information, Figure S11 Unrooted phylogenetic tree of NCED
genes.
The consensus tree was generated based a multiple alignment of the CDS (coding
sequence) of NCED genes from quinoa and 7 other plant species as in Figure 3B. The
codon alignment was generated by Muscle in MEGA7.0 and the tree was constructed
using the neighbor-joining method with 1000 bootstrap replicates. Numbers at each
branching point indicate that the local bootstrap values per 100 replicates. The origin
of genes can be distinguished based on the first few letters of the gene name (A.
hypochondriacus - AHYPO, A. thaliana – AT; S. oleracea - SOVF, B. vulgaris -
BVRB, V. vinifera - GSVIVG, S. lycopersicum- Solyc and S. tuberosum – PGSC, C.
quinoa - CCG).
Supplementary information, Figure S12 Unrooted phylogenetic tree of PYL genes.
The consensus tree was generated based a multiple alignment of the CDS (coding
sequence) of PYL genes from quinoa and 7 other plant species as in Figure 3B. The
codon alignment was generated by Muscle in MEGA7.0 and the tree was constructed
using the neighbor-joining method with 1000 bootstrap replicates. Numbers at each
branching point indicate that the local bootstrap values per 100 replicates. The origin
of genes can be distinguished based on the first few letters of the gene name (A.
hypochondriacus - AHYPO, A. thaliana – AT; S. oleracea - SOVF, B. vulgaris -
BVRB, V. vinifera - GSVIVG, S. lycopersicum- Solyc and S. tuberosum – PGSC, C.
quinoa - CCG).
Supplementary information, Figure S13 Matrix of Pearson’s correlation
coefficients based on RPKM values of all annotated genes.
R1, R2, R3 and R4 represented the biological replicates of each indicated tissue type.
The size of the dot and the depth of blue color positively correlate with the coefficient
on a scale from 0 to 1.
Supplementary information, Figure S14 Venn diagram showing the overlap of
DEGs (differentially expressed genes) among the four indicated comparisons.
Supplementary information, Figure S15 Heatmap showing the RPKM value of the
transporter genes for ions, monosaccharides and ABA in leaf without bladders and
bladder cells.
The category of genes and the gene ID are indicated to the left of the heatmap. The
legend indicates the correlation between colors and RPKM values.
Supplementary information, Figure S16 Unrooted phylogenetic tree of hemoglobin
genes.
The consensus tree was generated based a multiple alignment of the CDS (coding
sequence) from quinoa and Arabidopsis thalian. The codon alignment was generated
by Muscle in MEGA7.0 and the tree was constructed using the neighbor-joining
method with 1000 bootstrap replicates. Numbers at each branching point indicate that
the local bootstrap values per 100 replicates.
1
Raw Library No. (Illumina)
Insert Size
Amount of Data (Gb) Read Length (bp) Sequence
Depth (x)Physical Depth (x)
1 380 bp 27.97 250_250 18.87 14.342 380 bp 90.58 250_250 61.15 46.453 5 kb 20.75 250_250 14.00 140.01
24.62 125_125 16.61 332.254 8 kb 38.24 125_125 25.80 825.70
18.37 250_250 12.40 198.335 (PacBio) 20 kb 50.3 7,521 (average) 33.94 33.94
Total 213.97 182.77 1591.02
Filtered Library No. (Illumina)
Insert Size
Amount of Data (Gb) Read Length (bp) Sequence
Depth (x)Physical Depth (x)
1 380 bp 23.7 250_250 16.01 9.372 380 bp 62.76 250_250 42.41 31.443 5 kb 2.99 50_50 2.02 100.884 8 kb 4.79 50_50 3.23 258.57
5 (PacBio) 20 kb 50.3 7,521 (average) 33.94 33.94Total 153.41 97.61 434.20
Supplementary Table 1. Summary of sequencing data used for de novo assembly
2
Size (bp) Number Size (bp) NumberN90 284 444,927 269 482,414N80 736 92,438 667 126,648N70 13,363 18,162 6,374 34,478N60 32,172 11,290 16,613 21,015N50 49,570 7,592 26,127 14,095
Longest 499,594 ---- 281,357 ----Total Size 1,490,886,271 ---- 1,450,125,437 ----
Total Number (≥100bp) 1,142,478 ---- 1,165,224Total Number (≥2kb) 33,071 ---- 52,781
Size (bp) NumberN90 14,775 14,393N80 23,994 10,451N70 33,267 7,801N60 43,470 5,827N50 53,964 4,281
Longest 365,916 ----Total Size 750,878,283 ----
Total Number (≥100bp) ---- 25,024Total Number (≥2kb) ---- 23,874
Size (bp) Number Size (bp) NumberN90 9,609 37,102 9,606 37,107N80 13,389 24,304 13,388 24,306N70 22,513 15,803 22,509 15,804N60 36,585 10,814 36,582 10,815N50 51,182 7,493 51,182 7,493
Longest 586,601 ---- 586,601 ----Total Size 1,439,881,532 ---- 1,439,878,528 ----
Total Number (≥100bp) 59,594 ---- 59,618Total Number (≥2kb) 58,870 ---- 58,881
Size (bp) Number Size (bp) NumberN90 422,582 1,087 75,678 5,001N80 629,431 833 125,667 3,657N70 810,545 647 172,908 2,761N60 985,512 498 220,272 2,082N50 1,161,829 373 268,320 1,536
Longest 5,397,643 ---- 1,594,973 ----Total Size 1,337,226,356 ---- 1,325,521,993 ----
Total Number (≥100bp) 3,184 10,795Total Number (≥2kb) 2,827 9,981
v1.0 Scaffold Contig
Table S2. Summary of different Cq assembly versions
Contigv0.1
v0.2 Contig
v0.3 Scaffold Contig
Scaffold
3
Number Percent (%) Number Percent (%)All 234,311 264,398,310 98.938 223,625 95.439 234,149 99.931>200bp 234,311 264,398,310 98.938 223,625 95.439 234,149 99.931>500bp 151,776 237,088,802 98.903 144,416 95.151 151,672 99.931>1000bp 98,860 198,932,903 98.970 94,075 95.160 98,798 99.937
Table S3. Summary of gene space assessment using de novo assembled EST sequences
with >90% sequence in one scaffold
with >50% sequence in one scaffoldNumber Total length
(bp)Covered by
assembly (%)
4
Fosmid ID length Target scaffold Target match score mismatch gap Target scaffold Target match score mismatch gapfosmid1 31691 scaffold_0031 31856 31681 4 171 C_Quinoa_Scaffold_3525 31864 31668 9 187fosmid2 31421 scaffold_0094 31877 31277 102 498 C_Quinoa_Scaffold_1319 31732 31260 123 349fosmid3 33944 scaffold_0440 34642 33796 83 763 C_Quinoa_Scaffold_3901 34221 33656 111 454fosmid4 33082 scaffold_0742 33267 33026 8 233 C_Quinoa_Scaffold_4480 33359 33022 30 307fosmid5 38339 scaffold_0103 38545 38280 29 236 C_Quinoa_Scaffold_3144 38614 38249 62 303fosmid6 30450 scaffold_0165 30604 30333 57 214 C_Quinoa_Scaffold_3820 30615 30125 90 400fosmid7 33655 scaffold_0122 33858 33607 14 237 C_Quinoa_Scaffold_3298 33898 33605 22 271fosmid8 36137 scaffold_0270 36334 35922 82 330 C_Quinoa_Scaffold_1465 36772 35800 171 801fosmid9 30327 scaffold_0351 30527 30193 85 249 C_Quinoa_Scaffold_3388 28062 27735 84 243
fosmid10 32749 scaffold_0949 32848 32722 1 125 C_Quinoa_Scaffold_3107 33245 32664 27 554
Fosmids Cq_real_v1.0 (This study) ASM168347v1 (Jarvis et al. 2017)
Table S4. Summary of fosmid alignment with published quinoa assemblies
5
Superfamily of TEs Coverage of TEs (bp) Proportion of the assembly (%)Class I
LTR retrotransposonsGypsy 448,883,024 33.58Copia 156,347,966 11.69ERV 1,312,713 0.10Caulimovirus 1,463,548 0.11Unclasssified 756,480 0.06
LINER2 1,206,051 0.09RTE 1,692,267 0.13Jockey 742,469 0.06L1 21,604,281 1.62CRE 3,298,455 0.25Unclasssified 1,234,992 0.09
SINE 15,186 0.00Class II
CMC 48,617,566 3.64hAT 15,146,731 1.13MuDR 12,738,187 0.95TcMar 8,147,037 0.61En 7,900,599 0.59PIF 3,363,523 0.25Harbinger 1,468,810 0.11Maverick 1,266,066 0.09Unclasssified 2,019,151 0.15
Total 73,940,212 55.30
Table S5. Summary of transposable elements (TEs) in Cq_real_v1.0
6
Motif Counts Average length (bp) Average mismatches (bp) Counts/MbpMononucleotide 121,810 22.8 0.4 91.1Dinucleotide 44,872 42.3 1 33.6Trinucleotide 76,221 39.5 1.4 57Tetranucleotide 46,847 20.2 0.3 35.1Pentanucleotide 77,059 20.6 0.4 57.7Hexanucleotide 25,955 26.6 0.7 19.4Total 392,764 28.7 0.7 49
Table S6. Summary of SSRs (simple sequence repeats) identified in quinoa
7
Motif Counts Average length Average mismatches Counts per MbA 119,416 22.87 0.37 89.35AAT 42,557 45.84 1.88 31.84AT 27,048 53.00 1.33 20.24AAAT 23,566 20.09 0.32 17.63AAAAT 18,284 20.27 0.37 13.68AACTG 14,320 28.06 1.00 10.71AG 11,520 26.60 0.46 8.62AATT 9,619 20.65 0.24 7.20AAC 8,170 30.20 0.92 6.11AAG 7,513 46.89 0.83 5.62AAAAG 7,168 18.32 0.23 5.36AAATT 6,609 18.12 0.11 4.94AC 6,280 25.27 0.52 4.70ATC 5,542 21.75 0.52 4.15ACC 5,067 20.62 0.54 3.79AAAAAT 4,298 23.58 0.45 3.22AAAG 3,216 18.88 0.28 2.41AGCCT 2,552 15.15 0.00 1.91AGG 2,547 22.14 0.71 1.91AGC 2,525 20.67 0.47 1.89C 2,394 20.45 0.21 1.79AATC 2,069 18.29 0.20 1.55ACT 1,880 61.79 1.92 1.41AATAAC 1,836 36.24 1.86 1.37AAACC 1,810 17.35 0.12 1.35AAAAC 1,760 19.16 0.27 1.32ACGGC 1,660 16.48 0.03 1.24AAAC 1,601 18.57 0.25 1.20AAACT 1,568 20.86 0.32 1.17AGAGC 1,370 21.20 0.51 1.03AAAAAG 1,332 23.24 0.45 1.00ATCAG 1,172 20.07 0.29 0.88AAGAG 1,093 21.65 0.84 0.82ATAC 1,050 28.92 0.63 0.79AGAGG 1,034 19.67 0.27 0.77AATCAT 1,018 25.20 0.58 0.76ACCCC 1,003 19.17 0.22 0.75Others 39,238
Table S7. A list of top SSRs identified in quinoa
8
Gene CDS Exon IntronHomology
A. hypochondriacus 75,952 2595.4 914.2 267.4 695.1 3.4A. thaliana 68,880 2320.5 892.7 274.8 635.2 3.2T. salsuginea 73,610 2191.7 842.4 272.9 646.7 3.1S. oleracea 135,497 1944.7 817.0 288.9 616.8 2.8B. vulgaris 160,859 1869.5 791.3 303.8 671.9 2.6
ab initioFgenesh 48,208 3955.5 1197.5 203.8 565.8 5.9Augustus 63,690 2622.8 1045.8 218.9 417.5 4.8Genescan 54,919 13792.9 1219.3 205.7 2551.4 5.9GlimmerHMM 66,627 2280.4 879.1 226.7 487.1 3.9SNAP 101,308 1296.2 636.7 203.6 310.0 3.1
mRNA-seq 51,172 3408.1 1104.1 231.2 534.1 4.6
GLEAN 54,438 3547.6 1096.3 232.0 560.7 4.8
Method Exon per GeneAverage Length
Table S8. Statistics of predicted quinoa gene models at each step of the annotation pipeline
Gene number
9
Type Copy Number Average length (bp) Total length (bp) Percent genomemiRNA 192 192 24,160 <0.01tRNA 2934 2934 215,855 0.02rRNA
18S 75 75 40,787 <0.0128S 139 139 16,186 <0.015.8S 25 25 3,720 <0.015S 1071 1071 108,263 0.01
snRNA 5922CD-box 5565 5565 588,244 0.04HACA-box 85 85 11,254 <0.01splicing 272 272 39,465 <0.01
Total 10,358 1,047,934 0.09
Table S9. Statistics of noncoding RNAs (ncRNA) identified in the quinoa genome
10
Number Percent(%)Total predicted genes 54,459 100.0Annotated
InterPro 41,060 75.4GO 46,264 85.0KEGG 27,486 50.4Swissprot 35,708 65.6TrEMBL 47,862 87.9Total 52,058 95.6
Table S10. Summary of quinoa genes with assigned (predicted) functions
11
Number Percent Number PercentComplete BUSCOs (C) 1344 93.3 1318 91.6Complete and single-copy BUSCOs (S) 392 27.2 361 25.1Complete and duplicated BUSCOs (D) 952 66.1 957 66.5Fragmented BUSCOs (F) 23 1.6 25 1.7Missing BUSCOs (M) 73 5.1 97 6.7Total BUSCO groups* searched 1440 --- 1440 ---
Type This study (Jarvis et al. 2017)
Table S11. Summary of Cq genes identified in BUSCO v2 embryophyta gene set
12
Sensitivity Specificity Sensitivity SpecificityGene 69.58% 78.39% 21.67% 31.43%Transcript 42.25% 75.74% 21.67% 31.43%Exon 71.92% 92.62% 69.74% 43.09%Nucleotide 95.27% 98.44% 94.18% 55.27%
mRNA (NCBI)High-expression transcripts
Table S12. Estimated accuracy of gene prediction
13
Number Size Number Size Number SizeAssembly features
Total Scaffolds (≥100bp) 3,184 1,337 Mb 24,845 1,087 Mb 3,486 1,385 MbScaffold N50 373 1,162 Kb 86 Kb 105 3,847 KbScaffold N90 1,087 423 Kb 439 250 KbLongest Scaffold 5,398 Kb 641 Kb 23,816 KbAverage Scaffold length 420 Kb 43 Kb 397 kbGC content 37.00% 36.90% 36.90%Undetermined bases 0.87% 11,704 Kb 2.61% 28,385 Kb 4.56% 63,176 Kb
Genome annotationTotal repetitive sequences 64.50% 863 Mb 61.50% 668 Mb 64.02% 879 MbGene models 54,438 61.9 Mbp* 226,647 190.5 Mbp* 44,776 57.1 Mbp* The total length of coding sequences
Cqu_r1.0 (Yasui et al. 2016)Cq_real_v1.0 (This study) ASM168347v1 (Javis et
al. 2017)
Table S13. Comparison of Cq_real_v1.0 with two published quinoa assemblies
14
GO ID p Value q Value Term ClassificationGO:0010329 5.30E-10 2.80E-07 auxin efflux transmembrane transporter activity MFGO:0005515 8.86E-07 1.41E-04 protein binding MFGO:0003677 1.54E-06 2.22E-04 DNA binding MFGO:0003723 8.04E-06 9.12E-04 RNA binding MFGO:0017016 2.26E-05 1.91E-03 Ras GTPase binding MFGO:0016709 5.55E-05 3.54E-03 oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, NAD(P)H as one donor, and incorporation of one atom of oxygenMFGO:0046872 6.24E-05 3.54E-03 metal ion binding MFGO:0004252 6.35E-05 3.54E-03 serine-type endopeptidase activity MFGO:0042626 6.97E-05 3.69E-03 ATPase activity, coupled to transmembrane movement of substances MFGO:0016740 1.28E-04 6.54E-03 transferase activity MFGO:0047134 1.77E-04 7.40E-03 protein-disulfide reductase activity MFGO:0001727 2.63E-04 9.72E-03 lipid kinase activity MF
GO:0009840 5.59E-11 4.43E-08 chloroplastic endopeptidase Clp complex CCGO:0005739 4.28E-09 1.36E-06 mitochondrion CCGO:0009507 9.61E-08 2.18E-05 chloroplast CCGO:0009532 1.38E-07 2.73E-05 plastid stroma CCGO:0005634 4.74E-07 8.36E-05 nucleus CCGO:0005829 2.12E-06 2.81E-04 cytosol CCGO:0005737 6.04E-06 7.38E-04 cytoplasm CCGO:0005622 1.82E-05 1.81E-03 intracellular CCGO:0032045 1.77E-04 7.40E-03 guanyl-nucleotide exchange factor complex CCGO:0044599 2.63E-04 9.72E-03 AP-5 adaptor complex CC
GO:0002238 2.64E-09 1.05E-06 response to molecule of fungal origin BPGO:0006351 1.01E-05 1.07E-03 transcription, DNA-templated BPGO:0044550 2.31E-05 1.91E-03 secondary metabolite biosynthetic process BPGO:0006355 2.40E-05 1.91E-03 regulation of transcription, DNA-templated BPGO:0090602 2.41E-05 1.91E-03 sieve element enucleation BPGO:0006624 2.52E-05 1.91E-03 vacuolar protein processing BPGO:0090603 3.24E-05 2.34E-03 sieve element differentiation BPGO:0035957 6.47E-05 3.54E-03 positive regulation of starch catabolic process by positive regulation of transcription from RNA polymerase II promoterBPGO:1900524 6.47E-05 3.54E-03 positive regulation of flocculation via cell wall protein-carbohydrate interaction by positive regulation of transcription from RNA polymerase II promoterBPGO:1900461 6.47E-05 3.54E-03 positive regulation of pseudohyphal growth by positive regulation of transcription from RNA polymerase II promoterBPGO:0036095 6.47E-05 3.54E-03 positive regulation of invasive growth in response to glucose limitation by positive regulation of transcription from RNA polymerase II promoterBPGO:0048235 1.35E-04 6.68E-03 pollen sperm cell differentiation BPGO:0006810 1.61E-04 7.40E-03 transport BPGO:0051966 1.77E-04 7.40E-03 regulation of synaptic transmission, glutamatergic BPGO:0060079 1.77E-04 7.40E-03 excitatory postsynaptic potential BPGO:1900452 1.77E-04 7.40E-03 regulation of long term synaptic depression BPGO:0010540 1.96E-04 7.97E-03 basipetal auxin transport BPGO:0010981 2.45E-04 9.72E-03 regulation of cell wall macromolecule metabolic process BPGO:0045892 2.63E-04 9.72E-03 negative regulation of transcription, DNA-templated BP
Table S14. Enriched GO terms in quinoa-specific orthologous groups
15
Species Version Source A. hypochondriacus Phytozome 11 JGI (https://phytozome.jgi.doe.gov)A. thaliana Phytozome 11 JGI (https://phytozome.jgi.doe.gov)F. esculentum Version 1.0 BGDB (http://buckwheat.kazusa.or.jp)T. salsuginea Phytozome 11 JGI (https://phytozome.jgi.doe.gov)S. oleracea ASM200726v1 GeneBank (https://www.ncbi.nlm.nih.gov)G. max Phytozome 11 JGI (https://phytozome.jgi.doe.gov)Z. may Phytozome 11 JGI (https://phytozome.jgi.doe.gov)O. sativa Phytozome 11 JGI (https://phytozome.jgi.doe.gov)Aegilops tauschii Aet_MR_1.0 GeneBank (https://www.ncbi.nlm.nih.gov)B. vulgaris RefBeet-1.2.2 GeneBank (https://www.ncbi.nlm.nih.gov)
Table S15. List of other genomes used in this study