Genome Assembly at JGI · 4/3/2016  · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1...

Post on 07-Aug-2020

3 views 0 download

transcript

Genome Assembly at JGI

Alicia Clum Genomic Technologies Workshop JGI User Meeting March 22, 2016

Outline

• Overview •  Improving assemblies with long

read technology •  Future improvements

3/23/16 2

Outline

• Overview •  Improving assemblies with long

read technology •  Future improvements

3/23/16 3

Genome assembly review

3/23/16 4

Genomic DNA

fragmentation

Library creation

Sequencing

Assemble reads

Overview of assembly at JGI

ProgramSize (MB) LibrariesAssembler

Target assemblies / year

Microbe 5 1 SPAdes/ HGAP 1,330

Fungi 10's 1 ALLPATHS-LG/ Falcon 160

Plant100-10

000 3+

Arachne/ ALLPATHS-LG/Falcon 20

Metagenome10-100

00 1 MEGAHIT 825

Challenges in genome assembly

• Repeat content • Genome size • GC content • DNA quality

and quantity •  Ploidy

Genome Size (MB)

Rep

eat C

onte

nt

Fungal Repeat Content vs Genome Size (MB)

•  37 MB median genome size •  9% median repeat content

Making assemblies better

Outline

• Overview •  Improving assemblies with long

read technology •  Future improvements

3/23/16 8

Microbial drafts- number of contigs by data type

Num

ber o

f con

tigs

Illumina fragment

PacBio 10kb

Data Type

Median=43 N=1203

Median=2 N=216

Overview of Assembly at JGI

ProgramSize (MB) LibrariesAssembler

Target genomes / year

Microbe 5 1 SPAdes/ HGAP 1,330

Fungi 10's 1 ALLPATHS-LG/ Falcon 160

Plant100-10

000 3+

Arachne/ ALLPATHS-LG/Falcon 20

Metagenome10-100

00 1 MEGAHIT 825

Timeline - PacBio for fungal genomes

Feb. - First Illumina/PacBio hybrid release (APLG)

2012 2013

May - First PacBio only release (HBAR-DTK)

2014

July – Falcon development begins

summer – JGI Falcon testing begins, first good diploid assemblies

July – daligner work begins

2015

Jan. – Falcon incorporates daligner

Oct. – First Falcon assembly to annotation

Summer -Validated switch to PacBio for fungal assemblies for FY 2016

2016

Can a single PacBio library approach produce better fungal assemblies?

Genome Size (MB)Repeat Content (%)PloidyClavicorona pyxidata 43 14 diploidByssothecium circinans 48 15 haploidClathrospora elynae 45 47 haploidLindgomyces ingoldianus 66 20 diploid

1 Illumina fragment library

1 Illumina 4kb mate-pair library

10 kb AMPure PacBio library

ALLPATHS-LG Falcon

4 fungal genomes (~5 ug DNA each)

Image Credit: Laszlo Nagy, Manfred Binder, Pedro Crous, David Culley

PacBio assemblies have fewer contigs

0

500

1000

1500

2000

2500

Clavicorona pyxidata

Byssothecium circinans

Clathrospora elynae

Lindgomyces ingoldianus

Con

tigs

(N)

Genome

Number of Contigs

PacBio

Illumina

PacBio assemblies produce longer contigs

0 100 200 300 400 500 600 700 800

Clavicorona pyxidata

Byssothecium circinans

Clathrospora elynae

Lindgomyces ingoldianus

Con

tig L

50 (k

b)

Genome

Contig L50

PacBio

Illumina

PacBio assemblies are larger

•  larger assembled genome sizes representing assembled repeat content

0 10 20 30 40 50 60 70 80

Clavicorona pyxidata

Byssothecium circinans

Clathrospora elynae

Lindgomyces ingoldianus

Ass

embl

ed S

ize

(MB

)

Genome

Assembled Genome Size

PacBio

Illumina

PacBio assembles more repeat content

0

10

20

30

40

50

60

Basme Boled Hesve Lacbi Lizem Pirfi

Mas

ked

Sequ

ence

(%)

Genome

Percent of Assembled Genome Repeat Masked

PacBio

Illumina

Median difference of 7 % between how much sequence is masked in Illumina vs. PacBio

Data courtesy of the fungal annotation team

PacBio only assembly now implemented for fungal assembly Genomic

DNA

Short insert fragment (270bp)

Random fragmentation

Paired-end short insert

reads (millions)

Library Creation

Sequencing

Assemble reads

Long fragment (10kb)

Long reads (~100,000)

Illumina PacBio

Outline

• Overview •  Improving assemblies with long

read technology •  Future improvements

3/23/16 18

Courtesy: Jason Chin

Courtesy: Jason Chin

(Clavicorona pyxidata HHB10654)

Managed to phase >50% of the genome. JGI data with current Falcon is at < 25%.

Conclusions

•  Assembly pipelines vary by program and input data

•  Long read technology and assembly algorithm development have improved assembly results

•  Continued efforts for further improvements

Acknowledgments

3/23/16 22

JGI Alex Copeland Igor Grigoriev & Fungal Annotation Group Chris Daum & Sequencing Technologies Group Genome Assembly & QA/QC Groups Pacific Biosciences Jason Chin Paul Peluso David Rank Kristi Spittle

Supplement

3/23/16 23

Long Reads Span Common Repetitive Elements

3/23/16 24

Example for the Input Data: Length Distribution of the Pre-assembled Reads For Assembly

6

Transposons

45S rDNAs

Retrotransposons

Common repeat element lengths

Methods for pre-assembly consensus: Genome Biology 2013, 14:R101 S. Koren, et al. Nature Methods 10, 563–569 (2013), C.-S. Chin, et al.

Acc. > 99%

PacBio Read Length Distribution

>10kb AMPure Subread Lengths

L50 subread lengths range from 3.3 kb-6.5 kb

Evaluating Assemblers

3/23/16 26