Supplementary information, Data S1 Supplemental … Supplementary information, Data S1 Supplemental...

transcript

Supplementary information, Data S1 Supplemental methods

Genome assembly

Three intermediate assembly versions for the quinoa genome were generated

using Illumina reads (v0.1) and PacBio reads (v0.2 and v0.3) separately. Then the

three assemblies were merged together using the HABOT2 software (1gene Corp.,

Hangzhou, China; https://github.com/asarum/HABOT2) and a final round of

scaffolding and gap filling was performed using Illumina reads to obtain Cq assembly

v1.0. Detailed protocols are described below.

Filtered Illumina reads from two PCR-free libraries, with average insert sizes of

~380 bp and ~450 bp respectively, were used to construct contigs using the

DISCOVAR de novo software [1, 2]. Reads from the two medium-size mate pair

libraries (Supplementary Table 1) were then used for scaffolding with SSPACE 3.0

[3]. Gaps inside the constructed scaffolds were filled using GapCloser v1.12 module

from SOAPdenovo2 [4]. The resulting assembly v0.1 has a scaffold N50 of 49.5kb

and contig N50 of 26.1kb (Supplementary Table 3). Parameters used for each

software are listed below:

Discovar:

DiscovarDeNovo READS=./data/*fastq.gz OUT_DIR=Assembly

SSPACE:

perl SSPACE_Standard_v3.0.pl -l lib.cfg -s a.lines.fasta -T 8

GapCloser:

GapCloser -o closegap.fa -b lib.cfg -a closeGap.fa -t 16

Falcon v0.3 was used to assemble Cq assembly v0.2. The config file for

running Falcon v0.3:

Cq.cfg:

input_fofn = input.fofn

input_type = raw

length_cutoff = 8000

length_cutoff_pr = 5000

pa_HPCdaligner_option = -v -dal128 -M32 -e.70 -l1000 -s1000

pa_DBsplit_option = -x500 -s400

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --

local_match_count_threshold 2 --max_n_read 200 --n_core 10

overlap_filtering_setting = --max_diff 500 --max_cov 500 --min_cov 2

--bestn 10 --n_core 6

We then used Canu to obtain Cq assembly v0.3 that has a contig N50 of ~51 kb and

an assembly size of ~1.43Gb (Supplementary Table 3).

canu -p Cq_assem -d Cq_assem_34X genomeSize=1480m useGrid=remote -

pacbio-raw ./Data/Cq.Pcabio.fa

Then HABOT2 was used to combine contigs over 1 kb in length from the

above three assemblies and to construct a new contig set. HABOT2 contains 4

modules:

a. Graph module. This module counts k-mer frequencies and extracts the unique k-

mers from Illumina reads. A unique k-mer is theoretically defined as a k-bp sequence

that occurs just once in a haploid genome and is calculated following a Poisson model

as described previously [5]. Using unique k-mers, instead of all the k-mers, for graph

construction minimizes the effects of error-prone repeats and increases computation

speed.

b. Alignment module. This module is use for an all-to-all alignment between PacBio

contigs and Illumina contigs. By using unique k-mers it performs the alignment much

faster than blasr [6] and is of high accuracy.

c. Duplication remove module. When two sequences have common uinque k-mers that

exceed a cutoff (default is 0.5), the shorter sequence is removed.

d. De novo module. This module calls the above 3 modules and performs hybrid

assembly.

For the Cq assembly, we extracted unique 17-mers from the PCR-free Illumina

libraries. Overlaps among contigs from different intermediate assembly versions were

identified using the alignment module. Then the OLC (Overlap Layout Consensus)

graph was built for the overlapped contigs. The connection from contig A to contig B

is dropped in case of the following situation:

(1) Contig A’s best connection is contig B

(2) Contig B’s best connection is contig C

(3) A does not overlap with C

After the merge of overlapped contigs, duplicated regions in the contig set were

removed. Finally scaffolding and gap closure were performed on the new contig set

using Illumina mate pair reads with SSPACE v3.0 and GapCloser v1.12 (both with

default parameters).

Next we utilized the PE250 reads from two PCR-free libraries to correct for

assembly errors. Filtered reads were aligned to the assembly using bwa mem with

default parameters. Next the GATK package [7] was used to identify variants. The

order of commands and corresponding parameters used were as following:

IndelRealigner; UnifiedGenotyper --read_filter BadCigar -glm BOTH -

stand_call_conf 30.0 -stand_emit_conf 0; VariantFiltration --filterExpression

"QD<20.0 || ReadPosRankSum <-8.0 || FS>10.0 || QUAL<30 --filterName

LowQualFilter; BaseRecalibrator; PrintReads; UnifiedGenotyper --read_filter

BadCigar -glm BOTH -stand_call_conf 30.0 -stand_emit_conf 0; VariantFiltration --

filterExpression "QD < 20.0 || ReadPosRankSum < -8.0 || FS > 10.0 || QUAL < 30 --

filterName LowQualFilter. Then an in-house perl script was used to correct for the

highly-confident errors identified. This process was reiterated for a total

of 5 times to generate assembly v1.0.

Specific HABOT2 commands are listed below:

a) Extract unique 17-mers from Illumina reads:

perl Graph.pl pipe -i fq.lst -m 2 -k 17 -s 1,3 -d Kmer_17

b) Config file for hybrid assembly using the de novo module:

# the input file list, in fasta format

[pb_lst] file.lst

# Data type:

# 1：the uncorrected data

# 2：the corrected data

[pb_type] 2

# filter read length

[filt_len] 1000

# genome size in Magebase

[genome_size] 1500

# unique kmer, kmer size

[unique_kmer] kmer_17.bit 17

# Align parameters

# Col.1，kmer size

# Col.2：the scope for find anchor，-1 is for all length

# Col.3：align mode

# 1：for uncorrected reads

# 2：for corrected reads

# Col.4：filter score below this value

[strategy] 17 -1 2 0.8

[strategy] 17 -1 1 0.9

[strategy] 35 1000 1 0.9

[strategy] 35 -1 1 0.9

# project name

[pro_name] CQ

# qsub parameters

[queue] dna.q,rna.q,reseq.q

[Project] og

[max_job] 50

[thread] 8

Then run the command:

perl Denovo.pl input.cfg &

Assessment of the assembly

Alignment of PE250 reads (from the two PCR-free libraries) to Cq_real_v1.0 were

performed using bwa mem with default parameters. Then samtools and bcftools were

used for SNP calling and summarization.

A 40-kb fosmid library for quinoa was constructed using CopyControl Fosmid

Library Production Kit (Epicenter). Then 10 single colonies were picked and cultured

in 100 mL LB medium. The 10 fosmids were then extracted using a Plasmid Midi Kit

(Qiagen), mixed in equal quantity and used for a 20-kb PacBio library preparation.

Library preparation and sequencing were performed at Tianjin Biochip Corporation.

Falcon v0.3.0 was used for the de novo assembly of fosmid sequences with default

parameters. After removing plasmid sequences, the contigs were then aligned to

Cq_real_v1.0 using blastn with parameters –task blastn-short. The results were

summarized manually.

Chloroplast Genome Annotation

Annotation of the chloroplast genome was performed separately using DOGMA [8]

and CpGAVAS [9] with the following parameters: blast E-value cutoff - 1e-10,

maximum target hit number – 10, and maximum length of tRNA intron and variable

region - 116 bp. Then outputs from both software were integrated by retaining the

longer opening read frame (ORF) with an in-house Perl script. The predicted

start/stop codons and the exon-intron boundaries for intron-containing genes were

manually examined and curated. The map of the chloroplast genome was generated

using GenomeVx [10] followed by some manual adjustment.

Annotation of Repeats

Both homology-based and de novo approaches were used for repeat annotation. Three

complementary software programs, LTR_FINDER [11], PILER [12], RepeatModeler

[13], were used with default parameters to generate a de novo repeat library for

quinoa. These programs use complementary methods to predict repeats.

LTR_FINDER retrieves full-length LTR retrotransposons; PILER searches for

repeats in the genome by aligning the genome sequence to itself; RepeatModeler uses

two complementary repeat prediction programs, RECON and RepeatScout, to identify

repeat element boundaries and family relationships. This de novo repeat library was

then used together with Repbase [14] for homology search of repeats using

RepeatMasker [15].

Gene Prediction and Annotation

Three independent approaches, including homology search, ab initio prediction and

reference guided transcriptome assembly were used for gene prediction in a repeat-

masked genome. Evidence from the three approaches were then merged using

GLEAN to generate the final gene set.

Homology-based gene prediction. Putative open reading frames (ORFs) in the quinoa

genome were identified by aligning the protein sequences of A. thaliana,

Thellungiella salsuginea, Beta vulgaris, Spinacia oleracea, Amaranthus

hypochondriacus, and Fagopyrum esculentum (Supplementary Table 15) to

Cq_real_v1.0 using TBLASTN with an E-value cutoff of 1e−5. We next extracted

those ORF regions containing introns from the genome, including 2,000-bp

extensions at both ends, and again aligned protein sequences from other species to

these DNA fragments using GeneWise [16] with parameters: –trev –sum –genesf.

Ab initio gene prediction. AUGUSTUS v2.5.5 [17] and SNAP (version 2006-07-28)

[18] with default parameters were utilized for de novo gene prediction with gene

model parameters trained with A. thaliana TAIR10 genome. Short coding sequences

that were less than 150 bp in length were discarded.

Transcriptome-assisted gene prediction. We used TopHat [19] to map filtered mRNA-

seq reads to Cq_real_v1.0 to identify exonic regions and intron-exon boundaries with

the following parameters: -p 4 -max-intron-length 20,000 -m 1 -r 20 -mate-std-dev 20.

Cufflinks [19] was then used to assemble the alignments into transcripts with the

parameters: –I 20,000 –p 4.

Results derived from the above three approaches were integrated to generate a

consensus gene set using GLEAN with default parameters [20]. The probabilistic

confidence score generated by GLEAN was used to reflect the consistency among

different sources of evidence.

Assessment of gene models

The completeness of gene prediction was assessed by comparing the quinoa protein

sequences to the 1,411 embryophyta single copy orthologs in BUSCO v2

(Benchmarking Universal Single-Copy Orthologs) [21] with a BLAST E-value cutoff

of 1e-5.

The predicted gene models were further assessed using Eval v2.2.8 [22]. Two sets of

“gold standard” genes were used: the 56 mRNA sequences of C. quinoa retrieved

from the NCBI nucleotide database and high-expression transcripts based on the

mRNA-seq data. For high-expression transcripts, Trinity was used for de novo

transcript assembly in the genome guided mode [23] with combined mRNA-seq data

from 8 types of quinoa tissue and the transcripts with a sequencing coverage higher

than 100 were retained. Maker v2.31.9 was used to map the assembled transcripts to

Cq_real_v1.0 to generate the gene model in gff format.

Function annotation of genes

To assign gene functions, the predicted quinoa protein sequences were searched

against five protein/function databases: InterPro, GO, KEGG, Swiss-Prot and

TrEMBL. The Interpro database search was performed using InterproScan with

parameters: -f TSV –dp –gotermes -iprlookup –pa. For the 4 databases, BLAST searches

using the quinoa protein sequences as query were performed with an E-value cutoff of

1e-05. Results from the 5 database searches were then concatenated.

For GO term enrichment analysis, Fisher's exact test was performed and the p-value

was adjusted for multiple testing using the BH method.

References

1. Weisenfeld NI, Yin S, Sharpe T, et al. Comprehensive variation discovery in single human genomes. Nat Genet 2014; 46:1350-1355.

2. Love RR, Weisenfeld NI, Jaffe DB, Besansky NJ, Neafsey DE. Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly. BMC Genomics 2016; 17:187.

3. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 2011; 27:578-579.

4. Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 2012; 1:18.

5. You M, Yue Z, He W, et al. A heterozygous moth genome provides insights into herbivory and detoxification. Nat Genet 2013; 45:220-225.

6. Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012; 13:238.

7. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 2013; 43:11 10 11-33.

8. Wyman SK, Jansen RK, Boore JL. Automatic annotation of organellar genomes with DOGMA. Bioinformatics 2004; 20:3252-3255.

9. Liu C, Shi L, Zhu Y, et al. CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences. BMC Genomics 2012; 13:715.

10. Conant GC, Wolfe KH. GenomeVx: simple web-based creation of editable circular chromosome maps. Bioinformatics 2008; 24:861-862.

11. Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 2007; 35:W265-268.

12. Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics 2005; 21 Suppl 1:i152-158.

13. Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics 2005; 21 Suppl 1:i351-358.

14. Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 2015; 6:11.

15. Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics 2009; Chapter 4:Unit 4 10.

16. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res 2004; 14:988-995.

17. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 2006; 34:W435-439.

18. Korf I. Gene finding in novel genomes. BMC Bioinformatics 2004; 5:59. 19. Trapnell C, Williams BA, Pertea G, et al. Transcript assembly and

quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010; 28:511-515.

20. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS, Weinstock GM. Creating a honey bee consensus gene set. Genome Biol 2007; 8:R13.

21. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 2015; 31:3210-3212.

22. Keibler E, Brent MR. Eval: a software package for analysis of genome annotations. BMC Bioinformatics 2003; 4:50.

23. Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011; 29:644-652.

Supplementary information, Figure S1 The distribution of 17-kmer frequencies of

C. quinoa.

The frequency of 17-mers peaks at 67. The genome size is estimated using the

formula: Genome size = K-mer number/peak depth. A total of 344,467,594 reads from

the two PCR-free libraries were used, which resulted in 99,297,839,479 K-mers.

K-mer K-mer Number Peak Depth Genome Size (nt) Used Bases Used Reads Depth

17 99,297,839,479 67 1,482,057,305 86,461,366,094 344,467,594 58.4

Supplementary information, Figure S2 The chloroplast genome of C. quinoa.

The bold inner circle indicates the 4 conserved regions of the chloroplast genome: two

inverted repeats (IRA and IRB) that separate a short single copy (SSC) region and a

long single copy (LSC) region. The inner dark grey bars show the coverage of the

chloroplast genome in non-overlapping 10-bp window. Genes located on the (+) and

(-) strand are indicated on the outside and inside of the circle respectively.

Supplementary information, Figure S3 Examples of Cq_real_v1.0 scaffolds that are

anchored to a published genetic map of quinoa (Maughan et al. 2012).

Names of the SNP markers and their positions on different linkage groups (LG) are

indicated on the left and the corresponding scaffolds with the position (bp) of SNP

markers are indicated on the right.

Supplementary information, Figure S4 Assessment of gene prediction in

Cq_real_v1.0.

(A) Cumulative frequency of the probabilistic confidence score of predicted genes in

Cq_real_v1.0 given by GLEAN. (B) Pie charts showing the percentage of gene sets

that have conserved protein domains present or absent in the protein sequence.

Conserved domains were identified by blast against the Pfam database using a

E-value cutoff of 1E-4. (C) Distribution of RPKM values of all the annotated genes.

The mRNA-seq data from 5 different types of quinoa tissue were used.

Supplementary information, Figure S5 Scatterplot showing the RPKM values of

the 10,554 genes that are predicted in Cq_real_v1.0 but not in ASM168347v1.

The 8 different types of quinoa tissue used for mRNA-seq are indicated in the x-axis

and the log transformed RPKM value are in the y-axis.

Supplementary information, Figure S6 Venn diagram showing the overlap of

orthologous groups among quinoa and 4 other plant species.

Supplementary information, Figure S7 Histogram of Ks (synonymous substitution)

values for paralogous quinoa gene pairs.

The peaks at Ks=0.12 and Ks=1.6 indicate a recent and an ancient genome

duplication respectively.

Supplementary information, Figure S8 Composition of 19 amino acid residues

(except Lys) in albumin proteins in five different crops.

AA, Aegilops tauschii; CC, Chenopodium quinoa; Gl, Glycine max; GR, Zea mays;

Os, Oryza sativa.

(except Lys) in globulin proteins in five different crops.

Os, Oryza sativa.

(except Lys) in LEA proteins in five different crops.

Os, Oryza sativa.

Supplementary information, Figure S11 Unrooted phylogenetic tree of NCED

genes.

The consensus tree was generated based a multiple alignment of the CDS (coding

sequence) of NCED genes from quinoa and 7 other plant species as in Figure 3B. The

codon alignment was generated by Muscle in MEGA7.0 and the tree was constructed

using the neighbor-joining method with 1000 bootstrap replicates. Numbers at each

branching point indicate that the local bootstrap values per 100 replicates. The origin

of genes can be distinguished based on the first few letters of the gene name (A.

hypochondriacus - AHYPO, A. thaliana – AT; S. oleracea - SOVF, B. vulgaris -

BVRB, V. vinifera - GSVIVG, S. lycopersicum- Solyc and S. tuberosum – PGSC, C.

quinoa - CCG).

Supplementary information, Figure S12 Unrooted phylogenetic tree of PYL genes.

sequence) of PYL genes from quinoa and 7 other plant species as in Figure 3B. The

codon alignment was generated by Muscle in MEGA7.0 and the tree was constructed

using the neighbor-joining method with 1000 bootstrap replicates. Numbers at each

branching point indicate that the local bootstrap values per 100 replicates. The origin

of genes can be distinguished based on the first few letters of the gene name (A.

hypochondriacus - AHYPO, A. thaliana – AT; S. oleracea - SOVF, B. vulgaris -

BVRB, V. vinifera - GSVIVG, S. lycopersicum- Solyc and S. tuberosum – PGSC, C.

quinoa - CCG).

Supplementary information, Figure S13 Matrix of Pearson’s correlation

coefficients based on RPKM values of all annotated genes.

R1, R2, R3 and R4 represented the biological replicates of each indicated tissue type.

The size of the dot and the depth of blue color positively correlate with the coefficient

on a scale from 0 to 1.

Supplementary information, Figure S14 Venn diagram showing the overlap of

DEGs (differentially expressed genes) among the four indicated comparisons.

Supplementary information, Figure S15 Heatmap showing the RPKM value of the

transporter genes for ions, monosaccharides and ABA in leaf without bladders and

bladder cells.

The category of genes and the gene ID are indicated to the left of the heatmap. The

legend indicates the correlation between colors and RPKM values.

Supplementary information, Figure S16 Unrooted phylogenetic tree of hemoglobin

genes.

sequence) from quinoa and Arabidopsis thalian. The codon alignment was generated

by Muscle in MEGA7.0 and the tree was constructed using the neighbor-joining

method with 1000 bootstrap replicates. Numbers at each branching point indicate that

the local bootstrap values per 100 replicates.

Raw Library No. (Illumina)

Insert Size

Amount of Data (Gb) Read Length (bp) Sequence

Depth (x)Physical Depth (x)

1 380 bp 27.97 250_250 18.87 14.342 380 bp 90.58 250_250 61.15 46.453 5 kb 20.75 250_250 14.00 140.01

24.62 125_125 16.61 332.254 8 kb 38.24 125_125 25.80 825.70

18.37 250_250 12.40 198.335 (PacBio) 20 kb 50.3 7,521 (average) 33.94 33.94

Total 213.97 182.77 1591.02

Filtered Library No. (Illumina)

Insert Size

Amount of Data (Gb) Read Length (bp) Sequence

Depth (x)Physical Depth (x)

1 380 bp 23.7 250_250 16.01 9.372 380 bp 62.76 250_250 42.41 31.443 5 kb 2.99 50_50 2.02 100.884 8 kb 4.79 50_50 3.23 258.57

5 (PacBio) 20 kb 50.3 7,521 (average) 33.94 33.94Total 153.41 97.61 434.20

Supplementary Table 1. Summary of sequencing data used for de novo assembly

Size (bp) Number Size (bp) NumberN90 284 444,927 269 482,414N80 736 92,438 667 126,648N70 13,363 18,162 6,374 34,478N60 32,172 11,290 16,613 21,015N50 49,570 7,592 26,127 14,095

Longest 499,594 ---- 281,357 ----Total Size 1,490,886,271 ---- 1,450,125,437 ----

Total Number (≥100bp) 1,142,478 ---- 1,165,224Total Number (≥2kb) 33,071 ---- 52,781

Size (bp) NumberN90 14,775 14,393N80 23,994 10,451N70 33,267 7,801N60 43,470 5,827N50 53,964 4,281

Longest 365,916 ----Total Size 750,878,283 ----

Total Number (≥100bp) ---- 25,024Total Number (≥2kb) ---- 23,874

Size (bp) Number Size (bp) NumberN90 9,609 37,102 9,606 37,107N80 13,389 24,304 13,388 24,306N70 22,513 15,803 22,509 15,804N60 36,585 10,814 36,582 10,815N50 51,182 7,493 51,182 7,493

Longest 586,601 ---- 586,601 ----Total Size 1,439,881,532 ---- 1,439,878,528 ----

Total Number (≥100bp) 59,594 ---- 59,618Total Number (≥2kb) 58,870 ---- 58,881

Size (bp) Number Size (bp) NumberN90 422,582 1,087 75,678 5,001N80 629,431 833 125,667 3,657N70 810,545 647 172,908 2,761N60 985,512 498 220,272 2,082N50 1,161,829 373 268,320 1,536

Longest 5,397,643 ---- 1,594,973 ----Total Size 1,337,226,356 ---- 1,325,521,993 ----

Total Number (≥100bp) 3,184 10,795Total Number (≥2kb) 2,827 9,981

v1.0 Scaffold Contig

Table S2. Summary of different Cq assembly versions

Contigv0.1

v0.2 Contig

v0.3 Scaffold Contig

Scaffold

Number Percent (%) Number Percent (%)All 234,311 264,398,310 98.938 223,625 95.439 234,149 99.931>200bp 234,311 264,398,310 98.938 223,625 95.439 234,149 99.931>500bp 151,776 237,088,802 98.903 144,416 95.151 151,672 99.931>1000bp 98,860 198,932,903 98.970 94,075 95.160 98,798 99.937

Table S3. Summary of gene space assessment using de novo assembled EST sequences

with >90% sequence in one scaffold

with >50% sequence in one scaffoldNumber Total length

(bp)Covered by

assembly (%)

Fosmid ID length Target scaffold Target match score mismatch gap Target scaffold Target match score mismatch gapfosmid1 31691 scaffold_0031 31856 31681 4 171 C_Quinoa_Scaffold_3525 31864 31668 9 187fosmid2 31421 scaffold_0094 31877 31277 102 498 C_Quinoa_Scaffold_1319 31732 31260 123 349fosmid3 33944 scaffold_0440 34642 33796 83 763 C_Quinoa_Scaffold_3901 34221 33656 111 454fosmid4 33082 scaffold_0742 33267 33026 8 233 C_Quinoa_Scaffold_4480 33359 33022 30 307fosmid5 38339 scaffold_0103 38545 38280 29 236 C_Quinoa_Scaffold_3144 38614 38249 62 303fosmid6 30450 scaffold_0165 30604 30333 57 214 C_Quinoa_Scaffold_3820 30615 30125 90 400fosmid7 33655 scaffold_0122 33858 33607 14 237 C_Quinoa_Scaffold_3298 33898 33605 22 271fosmid8 36137 scaffold_0270 36334 35922 82 330 C_Quinoa_Scaffold_1465 36772 35800 171 801fosmid9 30327 scaffold_0351 30527 30193 85 249 C_Quinoa_Scaffold_3388 28062 27735 84 243

fosmid10 32749 scaffold_0949 32848 32722 1 125 C_Quinoa_Scaffold_3107 33245 32664 27 554

Fosmids Cq_real_v1.0 (This study) ASM168347v1 (Jarvis et al. 2017)

Table S4. Summary of fosmid alignment with published quinoa assemblies

Superfamily of TEs Coverage of TEs (bp) Proportion of the assembly (%)Class I

LTR retrotransposonsGypsy 448,883,024 33.58Copia 156,347,966 11.69ERV 1,312,713 0.10Caulimovirus 1,463,548 0.11Unclasssified 756,480 0.06

LINER2 1,206,051 0.09RTE 1,692,267 0.13Jockey 742,469 0.06L1 21,604,281 1.62CRE 3,298,455 0.25Unclasssified 1,234,992 0.09

SINE 15,186 0.00Class II

CMC 48,617,566 3.64hAT 15,146,731 1.13MuDR 12,738,187 0.95TcMar 8,147,037 0.61En 7,900,599 0.59PIF 3,363,523 0.25Harbinger 1,468,810 0.11Maverick 1,266,066 0.09Unclasssified 2,019,151 0.15

Total 73,940,212 55.30

Table S5. Summary of transposable elements (TEs) in Cq_real_v1.0

Motif Counts Average length (bp) Average mismatches (bp) Counts/MbpMononucleotide 121,810 22.8 0.4 91.1Dinucleotide 44,872 42.3 1 33.6Trinucleotide 76,221 39.5 1.4 57Tetranucleotide 46,847 20.2 0.3 35.1Pentanucleotide 77,059 20.6 0.4 57.7Hexanucleotide 25,955 26.6 0.7 19.4Total 392,764 28.7 0.7 49

Table S6. Summary of SSRs (simple sequence repeats) identified in quinoa

Motif Counts Average length Average mismatches Counts per MbA 119,416 22.87 0.37 89.35AAT 42,557 45.84 1.88 31.84AT 27,048 53.00 1.33 20.24AAAT 23,566 20.09 0.32 17.63AAAAT 18,284 20.27 0.37 13.68AACTG 14,320 28.06 1.00 10.71AG 11,520 26.60 0.46 8.62AATT 9,619 20.65 0.24 7.20AAC 8,170 30.20 0.92 6.11AAG 7,513 46.89 0.83 5.62AAAAG 7,168 18.32 0.23 5.36AAATT 6,609 18.12 0.11 4.94AC 6,280 25.27 0.52 4.70ATC 5,542 21.75 0.52 4.15ACC 5,067 20.62 0.54 3.79AAAAAT 4,298 23.58 0.45 3.22AAAG 3,216 18.88 0.28 2.41AGCCT 2,552 15.15 0.00 1.91AGG 2,547 22.14 0.71 1.91AGC 2,525 20.67 0.47 1.89C 2,394 20.45 0.21 1.79AATC 2,069 18.29 0.20 1.55ACT 1,880 61.79 1.92 1.41AATAAC 1,836 36.24 1.86 1.37AAACC 1,810 17.35 0.12 1.35AAAAC 1,760 19.16 0.27 1.32ACGGC 1,660 16.48 0.03 1.24AAAC 1,601 18.57 0.25 1.20AAACT 1,568 20.86 0.32 1.17AGAGC 1,370 21.20 0.51 1.03AAAAAG 1,332 23.24 0.45 1.00ATCAG 1,172 20.07 0.29 0.88AAGAG 1,093 21.65 0.84 0.82ATAC 1,050 28.92 0.63 0.79AGAGG 1,034 19.67 0.27 0.77AATCAT 1,018 25.20 0.58 0.76ACCCC 1,003 19.17 0.22 0.75Others 39,238

Table S7. A list of top SSRs identified in quinoa

Gene CDS Exon IntronHomology

A. hypochondriacus 75,952 2595.4 914.2 267.4 695.1 3.4A. thaliana 68,880 2320.5 892.7 274.8 635.2 3.2T. salsuginea 73,610 2191.7 842.4 272.9 646.7 3.1S. oleracea 135,497 1944.7 817.0 288.9 616.8 2.8B. vulgaris 160,859 1869.5 791.3 303.8 671.9 2.6

ab initioFgenesh 48,208 3955.5 1197.5 203.8 565.8 5.9Augustus 63,690 2622.8 1045.8 218.9 417.5 4.8Genescan 54,919 13792.9 1219.3 205.7 2551.4 5.9GlimmerHMM 66,627 2280.4 879.1 226.7 487.1 3.9SNAP 101,308 1296.2 636.7 203.6 310.0 3.1

mRNA-seq 51,172 3408.1 1104.1 231.2 534.1 4.6

GLEAN 54,438 3547.6 1096.3 232.0 560.7 4.8

Method Exon per GeneAverage Length

Table S8. Statistics of predicted quinoa gene models at each step of the annotation pipeline

Gene number

Type Copy Number Average length (bp) Total length (bp) Percent genomemiRNA 192 192 24,160 <0.01tRNA 2934 2934 215,855 0.02rRNA

18S 75 75 40,787 <0.0128S 139 139 16,186 <0.015.8S 25 25 3,720 <0.015S 1071 1071 108,263 0.01

snRNA 5922CD-box 5565 5565 588,244 0.04HACA-box 85 85 11,254 <0.01splicing 272 272 39,465 <0.01

Total 10,358 1,047,934 0.09

Table S9. Statistics of noncoding RNAs (ncRNA) identified in the quinoa genome

Number Percent(%)Total predicted genes 54,459 100.0Annotated

InterPro 41,060 75.4GO 46,264 85.0KEGG 27,486 50.4Swissprot 35,708 65.6TrEMBL 47,862 87.9Total 52,058 95.6

Table S10. Summary of quinoa genes with assigned (predicted) functions

Number Percent Number PercentComplete BUSCOs (C) 1344 93.3 1318 91.6Complete and single-copy BUSCOs (S) 392 27.2 361 25.1Complete and duplicated BUSCOs (D) 952 66.1 957 66.5Fragmented BUSCOs (F) 23 1.6 25 1.7Missing BUSCOs (M) 73 5.1 97 6.7Total BUSCO groups* searched 1440 --- 1440 ---

Type This study (Jarvis et al. 2017)

Table S11. Summary of Cq genes identified in BUSCO v2 embryophyta gene set

Sensitivity Specificity Sensitivity SpecificityGene 69.58% 78.39% 21.67% 31.43%Transcript 42.25% 75.74% 21.67% 31.43%Exon 71.92% 92.62% 69.74% 43.09%Nucleotide 95.27% 98.44% 94.18% 55.27%

mRNA (NCBI)High-expression transcripts

Table S12. Estimated accuracy of gene prediction

Number Size Number Size Number SizeAssembly features

Total Scaffolds (≥100bp) 3,184 1,337 Mb 24,845 1,087 Mb 3,486 1,385 MbScaffold N50 373 1,162 Kb 86 Kb 105 3,847 KbScaffold N90 1,087 423 Kb 439 250 KbLongest Scaffold 5,398 Kb 641 Kb 23,816 KbAverage Scaffold length 420 Kb 43 Kb 397 kbGC content 37.00% 36.90% 36.90%Undetermined bases 0.87% 11,704 Kb 2.61% 28,385 Kb 4.56% 63,176 Kb

Genome annotationTotal repetitive sequences 64.50% 863 Mb 61.50% 668 Mb 64.02% 879 MbGene models 54,438 61.9 Mbp* 226,647 190.5 Mbp* 44,776 57.1 Mbp* The total length of coding sequences

Cqu_r1.0 (Yasui et al. 2016)Cq_real_v1.0 (This study) ASM168347v1 (Javis et

al. 2017)

Table S13. Comparison of Cq_real_v1.0 with two published quinoa assemblies

GO ID p Value q Value Term ClassificationGO:0010329 5.30E-10 2.80E-07 auxin efflux transmembrane transporter activity MFGO:0005515 8.86E-07 1.41E-04 protein binding MFGO:0003677 1.54E-06 2.22E-04 DNA binding MFGO:0003723 8.04E-06 9.12E-04 RNA binding MFGO:0017016 2.26E-05 1.91E-03 Ras GTPase binding MFGO:0016709 5.55E-05 3.54E-03 oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, NAD(P)H as one donor, and incorporation of one atom of oxygenMFGO:0046872 6.24E-05 3.54E-03 metal ion binding MFGO:0004252 6.35E-05 3.54E-03 serine-type endopeptidase activity MFGO:0042626 6.97E-05 3.69E-03 ATPase activity, coupled to transmembrane movement of substances MFGO:0016740 1.28E-04 6.54E-03 transferase activity MFGO:0047134 1.77E-04 7.40E-03 protein-disulfide reductase activity MFGO:0001727 2.63E-04 9.72E-03 lipid kinase activity MF

GO:0009840 5.59E-11 4.43E-08 chloroplastic endopeptidase Clp complex CCGO:0005739 4.28E-09 1.36E-06 mitochondrion CCGO:0009507 9.61E-08 2.18E-05 chloroplast CCGO:0009532 1.38E-07 2.73E-05 plastid stroma CCGO:0005634 4.74E-07 8.36E-05 nucleus CCGO:0005829 2.12E-06 2.81E-04 cytosol CCGO:0005737 6.04E-06 7.38E-04 cytoplasm CCGO:0005622 1.82E-05 1.81E-03 intracellular CCGO:0032045 1.77E-04 7.40E-03 guanyl-nucleotide exchange factor complex CCGO:0044599 2.63E-04 9.72E-03 AP-5 adaptor complex CC

GO:0002238 2.64E-09 1.05E-06 response to molecule of fungal origin BPGO:0006351 1.01E-05 1.07E-03 transcription, DNA-templated BPGO:0044550 2.31E-05 1.91E-03 secondary metabolite biosynthetic process BPGO:0006355 2.40E-05 1.91E-03 regulation of transcription, DNA-templated BPGO:0090602 2.41E-05 1.91E-03 sieve element enucleation BPGO:0006624 2.52E-05 1.91E-03 vacuolar protein processing BPGO:0090603 3.24E-05 2.34E-03 sieve element differentiation BPGO:0035957 6.47E-05 3.54E-03 positive regulation of starch catabolic process by positive regulation of transcription from RNA polymerase II promoterBPGO:1900524 6.47E-05 3.54E-03 positive regulation of flocculation via cell wall protein-carbohydrate interaction by positive regulation of transcription from RNA polymerase II promoterBPGO:1900461 6.47E-05 3.54E-03 positive regulation of pseudohyphal growth by positive regulation of transcription from RNA polymerase II promoterBPGO:0036095 6.47E-05 3.54E-03 positive regulation of invasive growth in response to glucose limitation by positive regulation of transcription from RNA polymerase II promoterBPGO:0048235 1.35E-04 6.68E-03 pollen sperm cell differentiation BPGO:0006810 1.61E-04 7.40E-03 transport BPGO:0051966 1.77E-04 7.40E-03 regulation of synaptic transmission, glutamatergic BPGO:0060079 1.77E-04 7.40E-03 excitatory postsynaptic potential BPGO:1900452 1.77E-04 7.40E-03 regulation of long term synaptic depression BPGO:0010540 1.96E-04 7.97E-03 basipetal auxin transport BPGO:0010981 2.45E-04 9.72E-03 regulation of cell wall macromolecule metabolic process BPGO:0045892 2.63E-04 9.72E-03 negative regulation of transcription, DNA-templated BP

Table S14. Enriched GO terms in quinoa-specific orthologous groups

Species Version Source A. hypochondriacus Phytozome 11 JGI (https://phytozome.jgi.doe.gov)A. thaliana Phytozome 11 JGI (https://phytozome.jgi.doe.gov)F. esculentum Version 1.0 BGDB (http://buckwheat.kazusa.or.jp)T. salsuginea Phytozome 11 JGI (https://phytozome.jgi.doe.gov)S. oleracea ASM200726v1 GeneBank (https://www.ncbi.nlm.nih.gov)G. max Phytozome 11 JGI (https://phytozome.jgi.doe.gov)Z. may Phytozome 11 JGI (https://phytozome.jgi.doe.gov)O. sativa Phytozome 11 JGI (https://phytozome.jgi.doe.gov)Aegilops tauschii Aet_MR_1.0 GeneBank (https://www.ncbi.nlm.nih.gov)B. vulgaris RefBeet-1.2.2 GeneBank (https://www.ncbi.nlm.nih.gov)

Table S15. List of other genomes used in this study

Supplementary information, Data S1 Supplemental … Supplementary information, Data S1 Supplemental...

Documents