+ All Categories
Home > Documents > Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Date post: 16-Dec-2015
Category:
Upload: rudolph-nicholson
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
27
Large Plant Genome Assemblies using Phusion2 Zemin Ning Zemin Ning The Wellcome Trust Sanger Institute The Wellcome Trust Sanger Institute
Transcript
Page 1: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Large Plant Genome Assemblies using Phusion2

Zemin NingZemin NingThe Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

Page 2: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Phusion2 Assembly Pipeline

NGS Data Assembly

Contig Contig MergeMerge

FilteringFilteringUnikalow

ClusteringClusteringPhusion2

Contig Contig GenerationGeneration

ScaffoldingScaffoldingSpinner

Consensus BasesConsensus BasesSmalt & Gap5

SOAPdenovo

Fermi

ABySS

Mate Pair Reads2k-40k

Pair End Reads170-800bp

Page 3: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

ftp://ftp.sanger.ac.uk/pub/badger/aw7/icas_v061.tar.bz2

iCAS – an Illumina Clone Assembly System

Page 4: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Unikalow: ftp://ftp.sanger.ac.uk/pub/zn1/unikalow/

Data filtering using Unikalow

Page 5: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Assembly Method

1 A C C T G A T C

2 C T G A T C A A

3 T G A T C A A T

4 A G C G A T C A

5 C G A T C A A T

6 G A T C A A T G

7 T C A A T G T G

8 C A A T G T G A

1. Overlap graphSequencing reads:

2. de Bruijn graph

3. String graph

Page 6: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Scaffold Merge: Scaffold Merge:

RefRef

Contig Merge: Contig Merge:

BaseBase

SupSup

RefRef

BaseBase

CtgCtg

ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/

Page 7: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Contig Consensus using Gap5 Contig Consensus using Gap5

Page 8: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
Page 9: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

PacBioPacBio

CapillaryCapillary

IlluminaIllumina

Can we really trust Single Molecule Sequencing?Can we really trust Single Molecule Sequencing?

Page 10: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Clone Length SOAP ABySS iCAS

N50* Sub|Ind N50* Sub|Ind N50* Sub|Ind Uncov

bE217O4 186945 59863 11|10 109235 0|2 109235 0|2 (2)** 12

bT237K12 130462 13717 57|32 23386 8|4 47205 8|4 (19)** 626

bE352A13 153875 31247 41|23 93010 8|15 132592 8|14 (65)** 23

bE367M14 154288 105083 40|9 31405 1|1 107394 0|1 (20)** 1487

bE378K21 207850 173047 11|10 54240 23|5 187396 0|1 (10)** 741

fSS328I2 42036 42087 3|5 12628 1|0 42047 0|0 0

fSS404B14 32829 19543 0|3 29098 3|1 32832 0|0 0

fSY5K10 41286 41352 0|3 41296 0|0 41296 0|0 0

Clone Assemblies vs Assemblers 5 BAC clones and 3 fosmids

Clone coverage: 99.7%; Base quality: Q39

Page 11: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Spinner – a scaffolding tool

Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph:

Using expected insert size, a estimate of the gap size can be given for each contig.

ftp://ftp.sanger.ac.uk/pub/users/zn1/spinner/

Page 12: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Spinner – walks through a loopThese techniques alone produces useful results.Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

Page 13: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

_________________________________________________________ SSPACE SPINNER_________________________________________________________

Genome_Size N50 Average N50 Average

Assemblathon 1 119 Mb 608Kb 86.8Kb 11Mb 450Kb

Grass Carp (F) 900Mb 2.3Mb 14.4 5.85Mb 17.1Kb

Grass Carp (M) 1000MB 0.34Mb 11.2Kb 2.27 Mb 8.2Kb

Bamboo 2.0 Gb 322Kb 7404 488Kb 7689

Parrot 1.23 Gb 906Kb 4675 1.32Mb 6969 ________________________________________________________

Spinner vs SSPACESpinner vs SSPACE

Page 14: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Grass Phylogeny

Page 15: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Gs = (Kn – Ks)/D = 1.97x109

Kn = 80.5x109 – Total number of kmer words;Ks = 9.5x109 - Number of single copy kmer words;D = 36 - Depth of kmer occurrence

Bamboo Genome: Size EstimationBamboo Genome: Size Estimation

Page 16: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Solexa reads:Number of read pairs: 877 Million;Finished genome size: 2.0 GB;Read length: 2x100bp;Estimated read coverage: ~90X;Insert size: 500/50-600 bp;Mate pair data: 3k,5k,7k,8k,10k,20kNumber of reads clustered: 757 Million

Assembly features: - statsContigs Scaffolds

Total number of contigs: 744,286 277,278Total bases of contigs: 1.86 Gb 2.05 GbN50 contig size: 11,622 328,698Largest contig: 188,163 4,869,017 Averaged contig size: 2,500 7,400Contig coverage on genome: ~90% >95%

Bamboo Genome Bamboo Genome AssemblyAssembly

Page 17: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Assemblies by pure

SOAPdenovo

Assemblies by SOAPdenovo &

Abyss

Rate of single-base difference (# per Kb) 2.28 0.43

Rate of insertion and deletion (# per Kb) 0.82 0.19

Coverage by initial contigs 0.76 0.85

Coverage by supercontigs 0.91 0.94

Bamboo Genome Assembly Bamboo Genome Assembly QC using Finished BACsQC using Finished BACs

Page 18: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
Page 19: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
Page 20: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Evolution of the Wheat Genome

Page 21: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Size of the Wheat Genome: 17Gb

Page 22: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

International Wheat Genome Sequencing Consortium

Page 23: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

WHEjyyDADDBAAPE 167WHEjjzDADDCBAPE 199WHEjjzDADDCCAPE 223WHEjjzDADDCABPE 230WHEjyyDAEDDAAPE 250WHEjyyDAEDDABPE 250WHEjyyDAEDDBAPE 250WHEjyyDAEDDBBPE 250WHEjyyDAEDDCAPE 250WHEjyyDAEDDCBPE 250WHEjyyDAEDDDAPE 250WHEjjzDADDCACPE 254WHEjyyDAEDIAAPE 500WHEjyyDAEDIBAPE 500WHEjyyDADDIAAPE 502WHEjyyDADDIDAPE 510WHEjyyDADDICAPE 527WHEjyyDADDIBAPE 532WHEjyyDADDIBBPE 551WHEjyyDADDKAAPE 682WHEjyyDADDMBAPE 706WHEjyyDADDKCAPE 725WHEjyyDADDMAAPE 764

WHEjyyDAADWAAPE 2000WHEjyyDAADWBAPE 2000WHEjyyDAADWCAPE 2000WHEjyyDAADWDAPE 2000WHEjyyDACDWAAPE 2002WHEjyyDAEDWAAPE 2008WHEjyyDACDWBBPE 2500WHEjyyDAADLAAPE 5000WHEjyyDAADLBAPE 5000WHEjyyDAADLBBPE 5000WHEjyyDAEDLAAPE 5004WHEjjzDADLBBPE 8300WHEjyyDAADTAAPE 10000WHEjyyDABDTAAPE 10000WHEjyyDADDTAAPE 10000WHEjyyDADDTBBPE 10000WHEjyyDAIDUAAPE 20000

Sequencing of D GenomeSequencing of D GenomeLibraries & Insert SizesLibraries & Insert Sizes

Page 24: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Gs = (Kn – Ks)/D = 4.2x109

Kn = 59.8x109 – Total number of kmer words;Ks = 4.3x109 - Number of single copy kmer words;D = 13 - Depth of kmer occurrence

D Genome: Size EstimationD Genome: Size Estimation

Page 25: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Solexa reads:Number of read pairs: 805 Million;Estimated genome size: 4.2 GB;Read length: 45-95bp;Estimated read coverage: ~40X;Insert size: 167-800 bp;Mate pair data: 2k - 20kNumber of reads clustered: 558 Million

Assembly features: - statsContigs

Total number of contigs: 3,228,623Total bases of contigs: 3.34 GbN50 contig size: 3,084Largest contig: 86,064Averaged contig size: 1,035Contig coverage on genome: ~80%

Wheat D Genome Wheat D Genome AssemblyAssembly

Page 26: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

55,277 130,221 0.88 Gb 0.97Gb40,353 18,2525.89 Mb 2.27Mb

Grass carp(F&M)Grass carp(F&M)

MiscanthusMiscanthus Wild riceWild rice

Page 27: Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Acknowledgements: Joe Henson German Tischler Andrew Whitwham

Chinese Academy of Agricultural Sciences

Jizeng Jia

Guangyue Zhao National Gene Research Centre,

Chinese Academy of Sciences

Han Bin

Hengyun Lu


Recommended