+ All Categories
Home > Documents > 1.Data production 2.General outline of assembly strategy.

1.Data production 2.General outline of assembly strategy.

Date post: 17-Jan-2016
Category:
Upload: donna-stevens
View: 221 times
Download: 0 times
Share this document with a friend
31
1.Data production 2.General outline of assembly strategy
Transcript
Page 1: 1.Data production 2.General outline of assembly strategy.

1.Data production

2.General outline of assembly strategy

Page 2: 1.Data production 2.General outline of assembly strategy.

Original plan

► 454

► SOLiD

► WGP (Keygene): new sequence-based physical map

| Due date: July 15

Page 3: 1.Data production 2.General outline of assembly strategy.

Developments

► US to join 454 data production

► Spain to prepare 4-5 kb mate-pair library and run

► Throughput Titanium lower than specs (500 Mb/run):

| 350-400 Mb /run

► Effect of clonality/redundancy apparent in 454 data:

| ~11% in shotgun library

| ~13% in 3 kb library

| ~30% in 20 kb library

► Roche/454 offered to prepare additional paired-end libraries

► new recommendations for coverage given by Roche/454:

Page 4: 1.Data production 2.General outline of assembly strategy.

New recommendations Roche/454

► Libraries per genome size

| 3kb: 1 library every 250MB of your genome

| 8kb: 1 library every 100MB (or 250MB) of your genome

| 20kb: 1 library every 100MB of your genome

► Sequencing per library

| 3kb: 2 Titanium runs per library, 3X coverage

| 8kb: 1 Titanium run per library, 2X coverage

| 20kb: 0.5 Titanium runs per library, 1.5-2X coverage

| 15X shotgun reads

Page 5: 1.Data production 2.General outline of assembly strategy.

Paired-end library production by Roche/454

► Q2 2009

| 3 kb libraries: 4

| 20 kb libraries: 4

► Q3 2009 (currently being produced: ready ~beginning august)

| 8 kb libraries: 10 (US)

| 20 kb libraries: 6 (Italy & France)

| 40 kb libraries: 4 (US)

Page 6: 1.Data production 2.General outline of assembly strategy.

NL sequencing of Q2 2009 libraries

► shotgun libraries (home made):

| total: 19 runs

► 3 kb:

| lib1: 4.0 runs; lib2: 1.625 runs; lib3: 0.25 runs; lib4: 0.25 runs

| total: 6.125 runs

| libraries also shipped to Italy and France

► 20 kb:

| lib1: 5.75 runs; lib2 1 run; lib3: 0.25 runs; lib3: 0.25 runs

| total: 7.25 runs

| libraries also shipped to Italy and France

Page 7: 1.Data production 2.General outline of assembly strategy.

Typical output 454 Ti runbasecalling software bug

new version basecaller may yield additional ~25 bases/read!

Page 8: 1.Data production 2.General outline of assembly strategy.

Calculations (1)

low end specs

corrections for clonality/redundancy

Page 9: 1.Data production 2.General outline of assembly strategy.

Calculations (2)

NL sequencing of Q2 2009 libraries

► shotgun libraries (home made):

| total: 19 runs = 5.9 Gb = 6.2X

| recommended = 15X

► 3 kb:

| total: 6.125 runs = 1.7 Gb (nonredundant) * 50% = 0.9X paired ends

| recommended = 3X

► 20 kb:

| total: 7.25 runs = 1.5 Gb (nonredundant) * 50% = 0.8X paired ends

| recommended = 1.5-2X

Page 10: 1.Data production 2.General outline of assembly strategy.

To be calculated today!

► Who has to do how much additional sequencing from which libraries?

Page 11: 1.Data production 2.General outline of assembly strategy.

SOliD data production

► NL / Applied BioSystems

Page 12: 1.Data production 2.General outline of assembly strategy.

SOliD data production

► NL / Applied BioSystems

Page 13: 1.Data production 2.General outline of assembly strategy.

SOliD data production

► Applied BioSystems offered to prepare additional 10 kb mate-pair library

| currently running in Italy

► Spain produces 4-5 kb mate-pair library

► Discussion:

| do we need additional 7 kb mate-pair library, to be prepared by UK?

Page 14: 1.Data production 2.General outline of assembly strategy.

Additional data

► ~4 Million shotgun Sanger reads from Selected BAC Mixture (SBM-data, Kazusa)

| currently being put on harddisk which will be shipped to Netherlands this week

► 400,000 BAC ends (200,000 pairs)

► 200,000 fosmid ends (100,000 pairs)

| additional 200K reads will be produced (?)

► ~36% euchromatic sequence (70 Mb)

► WGP: sequenced based physical map

Page 15: 1.Data production 2.General outline of assembly strategy.

1.Data production

2.General outline of assembly strategy

Page 16: 1.Data production 2.General outline of assembly strategy.

Strategy overview

1. Create assembly-validation set

2. Filter raw data

3. De novo assembly of 454 & SBM data

4. Consolidate 454/SBM assemblies

5. Integrate SOLiD data into 454/SBM assembly

6. Scaffold using BAC and fosmid ends

7. Map scaffolds to physical map

Page 17: 1.Data production 2.General outline of assembly strategy.

Strategy overview

► Release of assembly to SOL Sequencing Consortium: November

| Annotation by iTAG

► Public release of data (under ENCODE guidelines) December 2009

Page 18: 1.Data production 2.General outline of assembly strategy.

Strategy in detail1: Create assembly-validation set

► Input: Sanger BAC contigs from SGNOutput: Selected high-quality subset of large Sanger BAC contigs

Discussion:

► We might be able to use the same pipeline for BAC selection as is being developed for potato (by Erwin Datema)

► Coordinator/specific tasks/division of labor:single location, single person: NL

► Deadline: August 1

Page 19: 1.Data production 2.General outline of assembly strategy.

2: Filter raw data (1)

► Input: raw sequence data

Output: clean reads, ready for assembly

► Discussion:Should the input data be filtered in advance? If so, what criteria should be used? Should all countries use the same filtering or can everyone experiment with different settings and filters and contribute their best data set?

► Possible filter criteria: repeats, contamination (human, vectors, local sources of contamination, mitochondrion/chloroplast), duplicates (redundancy & clonality)

► How exactly will the high repeat content influence the assembly? Can we include them in the assembly from the start or should we remove them to reduce complexity (and will this influence the final assembly quality)?

Page 20: 1.Data production 2.General outline of assembly strategy.

2: Filter raw data (2)

► Coordinator/specific tasks/division of labor:single location (filtering for local sources of contamination probably has to be done locally, because not everyone may be willing or allowed to share 'local' sequences)

► Deadline: September 5

Page 21: 1.Data production 2.General outline of assembly strategy.

3: De novo assembly of 454 & SBM data (1)

► Input: (filtered) 454 and SBM readsOutput: 5-10 different assemblies

Discussion:

► Explore different assembly methods, parameter settings, etc. 

| Newbler, CABOG, other?

► Should these assemblies already be validated against the validation set or will this happen during the next step?

► What are the criteria that an assembly should comply with or how to assess the quality of the assemblies? Should we define these? Statistics like the number of contigs/scaffolds, N50 size, etc?

Page 22: 1.Data production 2.General outline of assembly strategy.

3: De novo assembly of 454 & SBM data (2)

Discussion:

► How should unassembled reads be treated? These would include repetitive reads, singleton reads (and very small contigs?), erroneous reads, etc.

► Should all data (assembled or not) be available in the end for possible usage downstream?

► Do we want to do a de novo assembly of the SOLiD data? If so, should we assemble it standalone or in a hybrid fashion with 454 & SBM?

► Coordinator/specific tasks/division of labor:Assembly in one location or distribute over countries? In case of the latter, how to divide the labor? In our opinion multiple people could contribute to this step.

► Deadline:

Page 23: 1.Data production 2.General outline of assembly strategy.

4: Consolidate multiple 454/SBM assemblies

into a single best product (1)

► Input: 5-10 assembled data setsOutput: Single best, validated, assembly of 454 and SBM data.

Discussion:

► Reconcile and merge various assemblies (from step 3) into a single best assembly

► The assembly must be validated against the validation set (from step 1): all BAC contigs must be present in the assembly.

► Compare and validate assemblies (e.g. amosvalidate) and assess error rates among different assemblies

Page 24: 1.Data production 2.General outline of assembly strategy.

4: Consolidate multiple 454/SBM assemblies

into a single best product (2)

Discussion:

► What are the quality criteria? Which data makes it into the best assembly? How should conflicts between the assemblies be resolved?

► Can we already use the physical map for some quality assessment?

► Coordinator/specific tasks/division of labor:Consolidation should happen in a single location

► Deadline:

Page 25: 1.Data production 2.General outline of assembly strategy.

5: Add SOLiD data to 454/SBM assembly (1)

► Input: SOLiD reads and single best 454/SBM assembly (from step 4)Output: single best 454/SBM/SOLiD assembly

Discussion:

► De novo assembly of SOLiD data?

► Use SOLiD reads to fix possible base errors in 454/SBM assembly and homopolymer tracts.

► Gap filling and extension using unassembled SOLiD/454/SBM reads and read-pairs

Page 26: 1.Data production 2.General outline of assembly strategy.

5: Add SOLiD data to 454/SBM assembly (2)

Discussion:

Coordinator/specific tasks/division of labor:De novo assembly can possibly be done by multiple people

► Consolidation and/or mapping (incl. gap filling) on 454/SBM assembly should happen at a single location

► Deadline:

Page 27: 1.Data production 2.General outline of assembly strategy.

6: Scaffold using BAC and fosmid ends

► Input: clone ends and single best 454/SBM/SOLiD assemblyOutput: single best 454/SBM/SOLiD/clone-end assembly

Discussion:

► Strict selection on clone ends to select non-duplicated reads that have a paired-end read

► Newbler can handle paired fosmid ends but not BAC ends (limit on spacing of paired ends)

► Coordinator/specific tasks/division of labor:Single location?

► Deadline:

Page 28: 1.Data production 2.General outline of assembly strategy.

7: Map scaffolds to physical map

► Input: physical map and single best 454/SBM/SOLiD/CE assemblyOutput: draft of tomato genome

Discussion:Should be done incrementally with mapping of the clone ends? How to handle contradictions between step 6 and 7?

► Coordinator/specific tasks/division of labor:Coordinated by and executed in NL (Wageningen)

► Deadline:

Page 29: 1.Data production 2.General outline of assembly strategy.

To be settled today

► Time frame

| July - October2009

| Timing of deliverables

► Practical issues:

| Division of labor

► Share all 454 data with assembly team from 454 Life Sciences (Jim Knight)?

Page 30: 1.Data production 2.General outline of assembly strategy.

Strategy overview

Task Partners Due date

1. Create assembly-validation set NL 01.08

2. Filter raw data NL, Fr, It 05.09

3. De novo assembly of 454 & SBM data NL, US, Fr, It

4. Consolidate 454/SBM assemblies NL

5. Integrate SOLiD data into 454/SBM assembly It, Sp

6. Scaffold using BAC and fosmid ends NL

7. Map scaffolds to physical map NL

Page 31: 1.Data production 2.General outline of assembly strategy.

Recommended