Date post: | 16-Dec-2015 |
Category: |
Documents |
Upload: | rudolph-nicholson |
View: | 215 times |
Download: | 0 times |
Large Plant Genome Assemblies using Phusion2
Zemin NingZemin NingThe Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute
Phusion2 Assembly Pipeline
NGS Data Assembly
Contig Contig MergeMerge
FilteringFilteringUnikalow
ClusteringClusteringPhusion2
Contig Contig GenerationGeneration
ScaffoldingScaffoldingSpinner
Consensus BasesConsensus BasesSmalt & Gap5
SOAPdenovo
Fermi
ABySS
Mate Pair Reads2k-40k
Pair End Reads170-800bp
ftp://ftp.sanger.ac.uk/pub/badger/aw7/icas_v061.tar.bz2
iCAS – an Illumina Clone Assembly System
Unikalow: ftp://ftp.sanger.ac.uk/pub/zn1/unikalow/
Data filtering using Unikalow
Assembly Method
1 A C C T G A T C
2 C T G A T C A A
3 T G A T C A A T
4 A G C G A T C A
5 C G A T C A A T
6 G A T C A A T G
7 T C A A T G T G
8 C A A T G T G A
1. Overlap graphSequencing reads:
2. de Bruijn graph
3. String graph
Scaffold Merge: Scaffold Merge:
RefRef
Contig Merge: Contig Merge:
BaseBase
SupSup
RefRef
BaseBase
CtgCtg
ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/
Contig Consensus using Gap5 Contig Consensus using Gap5
PacBioPacBio
CapillaryCapillary
IlluminaIllumina
Can we really trust Single Molecule Sequencing?Can we really trust Single Molecule Sequencing?
Clone Length SOAP ABySS iCAS
N50* Sub|Ind N50* Sub|Ind N50* Sub|Ind Uncov
bE217O4 186945 59863 11|10 109235 0|2 109235 0|2 (2)** 12
bT237K12 130462 13717 57|32 23386 8|4 47205 8|4 (19)** 626
bE352A13 153875 31247 41|23 93010 8|15 132592 8|14 (65)** 23
bE367M14 154288 105083 40|9 31405 1|1 107394 0|1 (20)** 1487
bE378K21 207850 173047 11|10 54240 23|5 187396 0|1 (10)** 741
fSS328I2 42036 42087 3|5 12628 1|0 42047 0|0 0
fSS404B14 32829 19543 0|3 29098 3|1 32832 0|0 0
fSY5K10 41286 41352 0|3 41296 0|0 41296 0|0 0
Clone Assemblies vs Assemblers 5 BAC clones and 3 fosmids
Clone coverage: 99.7%; Base quality: Q39
Spinner – a scaffolding tool
Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph:
Using expected insert size, a estimate of the gap size can be given for each contig.
ftp://ftp.sanger.ac.uk/pub/users/zn1/spinner/
Spinner – walks through a loopThese techniques alone produces useful results.Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.
_________________________________________________________ SSPACE SPINNER_________________________________________________________
Genome_Size N50 Average N50 Average
Assemblathon 1 119 Mb 608Kb 86.8Kb 11Mb 450Kb
Grass Carp (F) 900Mb 2.3Mb 14.4 5.85Mb 17.1Kb
Grass Carp (M) 1000MB 0.34Mb 11.2Kb 2.27 Mb 8.2Kb
Bamboo 2.0 Gb 322Kb 7404 488Kb 7689
Parrot 1.23 Gb 906Kb 4675 1.32Mb 6969 ________________________________________________________
Spinner vs SSPACESpinner vs SSPACE
Grass Phylogeny
Gs = (Kn – Ks)/D = 1.97x109
Kn = 80.5x109 – Total number of kmer words;Ks = 9.5x109 - Number of single copy kmer words;D = 36 - Depth of kmer occurrence
Bamboo Genome: Size EstimationBamboo Genome: Size Estimation
Solexa reads:Number of read pairs: 877 Million;Finished genome size: 2.0 GB;Read length: 2x100bp;Estimated read coverage: ~90X;Insert size: 500/50-600 bp;Mate pair data: 3k,5k,7k,8k,10k,20kNumber of reads clustered: 757 Million
Assembly features: - statsContigs Scaffolds
Total number of contigs: 744,286 277,278Total bases of contigs: 1.86 Gb 2.05 GbN50 contig size: 11,622 328,698Largest contig: 188,163 4,869,017 Averaged contig size: 2,500 7,400Contig coverage on genome: ~90% >95%
Bamboo Genome Bamboo Genome AssemblyAssembly
Assemblies by pure
SOAPdenovo
Assemblies by SOAPdenovo &
Abyss
Rate of single-base difference (# per Kb) 2.28 0.43
Rate of insertion and deletion (# per Kb) 0.82 0.19
Coverage by initial contigs 0.76 0.85
Coverage by supercontigs 0.91 0.94
Bamboo Genome Assembly Bamboo Genome Assembly QC using Finished BACsQC using Finished BACs
Evolution of the Wheat Genome
Size of the Wheat Genome: 17Gb
International Wheat Genome Sequencing Consortium
WHEjyyDADDBAAPE 167WHEjjzDADDCBAPE 199WHEjjzDADDCCAPE 223WHEjjzDADDCABPE 230WHEjyyDAEDDAAPE 250WHEjyyDAEDDABPE 250WHEjyyDAEDDBAPE 250WHEjyyDAEDDBBPE 250WHEjyyDAEDDCAPE 250WHEjyyDAEDDCBPE 250WHEjyyDAEDDDAPE 250WHEjjzDADDCACPE 254WHEjyyDAEDIAAPE 500WHEjyyDAEDIBAPE 500WHEjyyDADDIAAPE 502WHEjyyDADDIDAPE 510WHEjyyDADDICAPE 527WHEjyyDADDIBAPE 532WHEjyyDADDIBBPE 551WHEjyyDADDKAAPE 682WHEjyyDADDMBAPE 706WHEjyyDADDKCAPE 725WHEjyyDADDMAAPE 764
WHEjyyDAADWAAPE 2000WHEjyyDAADWBAPE 2000WHEjyyDAADWCAPE 2000WHEjyyDAADWDAPE 2000WHEjyyDACDWAAPE 2002WHEjyyDAEDWAAPE 2008WHEjyyDACDWBBPE 2500WHEjyyDAADLAAPE 5000WHEjyyDAADLBAPE 5000WHEjyyDAADLBBPE 5000WHEjyyDAEDLAAPE 5004WHEjjzDADLBBPE 8300WHEjyyDAADTAAPE 10000WHEjyyDABDTAAPE 10000WHEjyyDADDTAAPE 10000WHEjyyDADDTBBPE 10000WHEjyyDAIDUAAPE 20000
Sequencing of D GenomeSequencing of D GenomeLibraries & Insert SizesLibraries & Insert Sizes
Gs = (Kn – Ks)/D = 4.2x109
Kn = 59.8x109 – Total number of kmer words;Ks = 4.3x109 - Number of single copy kmer words;D = 13 - Depth of kmer occurrence
D Genome: Size EstimationD Genome: Size Estimation
Solexa reads:Number of read pairs: 805 Million;Estimated genome size: 4.2 GB;Read length: 45-95bp;Estimated read coverage: ~40X;Insert size: 167-800 bp;Mate pair data: 2k - 20kNumber of reads clustered: 558 Million
Assembly features: - statsContigs
Total number of contigs: 3,228,623Total bases of contigs: 3.34 GbN50 contig size: 3,084Largest contig: 86,064Averaged contig size: 1,035Contig coverage on genome: ~80%
Wheat D Genome Wheat D Genome AssemblyAssembly
55,277 130,221 0.88 Gb 0.97Gb40,353 18,2525.89 Mb 2.27Mb
Grass carp(F&M)Grass carp(F&M)
MiscanthusMiscanthus Wild riceWild rice
Acknowledgements: Joe Henson German Tischler Andrew Whitwham
Chinese Academy of Agricultural Sciences
Jizeng Jia
Guangyue Zhao National Gene Research Centre,
Chinese Academy of Sciences
Han Bin
Hengyun Lu