Date post: | 27-Aug-2014 |
Category: |
Software |
Upload: | adrian-baez-ortega |
View: | 336 times |
Download: | 0 times |
Author
Adrián Báez Ortega
Supervisors
Marcos Colebrook SantamaríaJosé Luis Roda García
Date
17/07/2014
IonGAP
Contents1. Introduction
2. Objective of the project
3. State of the art
4. The genome assembler
5. A genome assembly and analysis pipeline
6. IonGAP Web service
7. Parallel assembly of large genomes
8. Conclusions
IonGAP 1
DNA
Genomics
Genome Proteins
GenesDouble helix
Biomedicine Life
Introduction
IonGAP 2
Genomesequencing
Genomede novo assembly
Adapted from:http://en.wikipedia.org/wiki/Genomic_library#mediaviewer/File:Whole_genome_shotgun_sequencing_versus_Hierarchical_shotgun_sequencing.png
Introduction
IonGAP 3
Introduction
Genomics
Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias
ComputerScience
Escuela Técnica Superior de
Ingeniería InformáticaB
ioin
form
ati
cs
IonGAP 4
Objective of the project
The development of an easy-to-use integrated software
platform that offers an optimally configured processing and
de novo assembly of genomic data obtained by Ion Torrent
sequencing, also complemented with several result analysis
stages.
IonGAP 5
Most sequencingtechnologies:
Paired-end short reads
IUETSPC’s sequencingtechnology:
Single-end long reads
DNA DNA
5’ 3’ 5’ 3’
Gap25-250 bp 25-250 bp 200-400 bp
Genome sequencing
Genome fragments FASTQ file
State of the art
IonGAP 6
Source:http://gcat.davidson.edu/phast/img/contig.png
Genome assembly
State of the art
IonGAP 7
Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
State of the art
IonGAP 8
Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
Adapted from:http://gcat.davidson.edu/phast
State of the art
IonGAP 9
Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
State of the art
IonGAP 10
Source:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646
State of the art
IonGAP 11
Data preprocessing
• Removing adapters
• Quality control
State of the art
IonGAP 12
Data preprocessing
• Quality control
State of the art
IonGAP 13
Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 14
Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 15
Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 16
The genome assembler
IonGAP 17
Data preprocessing
Genomeassembly
Genome finishing
Genome analysis
The genome assembler
Data set
Streptococcus
agalactiae
(686,800 reads)
IonGAP 18
Source:http://ngm.nationalgeographic.com/wallpaper/img/2013/01/08-streptococcus_1600.jpg
The genome assembler
Comparative study of assemblers
• OLC assemblers
– MIRA
– Celera Assembler
– SGA
IonGAP 19
• DBG assemblers
– ABySS
– Ray
– Velvet
– SparseAssembler
– Minia
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 20
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
50% of the genome is in contigs larger than N50
Source:http://schatzlab.cshl.edu/teaching/2012/CSHL.Sequencing/Whole%20Genome%20Assembly%20and%20Alignment.pdf
The genome assembler
IonGAP 21
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 22
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly1
The genome assembler
IonGAP 23
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 24
MIRA assembler
The genome assembler
IonGAP 25
1Automatic
editing
Data preprocessing
Fast readcomparison
Smith-Watermanalignment
Contig assembly
Finishedproject
Assembly parameter optimization
• Number of assembly iterations
• Uniform read distribution
• Separation of long repeats in different contigs
• Maximum number of times a contig can be rebuilt during an iteration
• Minimum number of reads per contig
Conclusion
The assembler is set by default in its optimal configuration
• Minimum size of a contig for being considered as "large"
• Minimum read length
• Minimum repeat length
• Minimum overlap length
• Minimum overlap score
The genome assembler
IonGAP 26
Minimum size of a contig for being considered as "large"
A genome assembly and analysis pipeline
IonGAP 27
Data preprocessing
Genomeassembly
Genome finishing
Genome analysis
aagttttggaaccattcgaaacagcacagctctaaaacttaccgattagaacatcatcta
aggtaatcgttttggaaccattcgaaacagcacagctctaaaactatcgctcaagcattc
gtatttgttttggagttttggaaccattcgaaacagcacagctctaaaacaacatttaac
tcataactatcatttagagtgttttggaaccattcgaaacagcacagctctaaaactaag
taacaagacagacttgaaactgttaagttttggaaccattcgaaacagcacagctctaaa
acttaccgattagaacatcatctaaggtaatcgttttggaaccattcgaaacagcacagc
tctaaaactatcgctcaagcattcgtatttgttttggagttttggaaccattcgaaacag
cacagctctaaaacatttccagtaagttcaaatttaacaaatgtgttttggaaccattcg
aaacagcacagctctaaaacagttttaacattaaatcacgtcttaaataagttttggaac
cattcgaaacagcacagctctaaaactaccgcaataagatcaccaatgttgtttgagttt
tggaaccattcgaaacagcacagctctaaaacgctattagtggaaacttttgaacgttat
gtgttttggaaccattcgaaacagcacagctctaaaacgaacaagatgtagatatgaaat
taacatttgttttggaaccattcgaaacagcacagctctaaaacctccaagtgctttaaa
gtcatttattttttgttttggaaccattcgaaacagcacagctctaaaacccatcatcaa
cctgaatgactccacatttcgttttggaaccattcgaaacagcacagctctaaaacgacc
cttatcaaacccaagcagaagtaactgttttggaaccattcgaaacagcacagctctaaa
acgatggtcgagcacttagaaaaccaataaaagttttggaaccattcgaaacagcacagc
tctaaaacgcttgtttcgctgtcgctcttgtttgacgggttttggaaccattcgaaacag
cacagctctaaaacaagcacaagaagcaactgttagaagacatagttttggaaccattcg
aaacagcacagctctaaaacacagctgaagagttagaaaaggctaatgttgttttggaac
cattcgaaacagcacagctctaaaacacatgacctgctgaacctgtccaccatatcgttt
tggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatacttattctt
ttgttttggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatactt
attcttttgttttggaaccattcgaaacagcacagctctaaaacctcgtagaaaattttc
ttttgagctttcgtaatcgcgccattcgtctcagcaggacttcagtttcgatgattcctt
gttattactgtgcttttactaatattataccatattttcgcctatcaagaaataatcctt
atcaataacatattgcggtaaatcatagagtcttctaggttctagaaagagtactgactt
ttgcattaaattgatgtattcacataattttataacttcatctttggtaagataagctcc
gctattaacaaaaaccaagagattctttttcgttaaataatggtaaacttgtataatttc
aaaacatttttcaaagatagtgtcgctctgtgtctcaattttgactcccagtgccttaat
gagttctaaaatcgtaatttcatcgtattctaaatcaagctcattctctagacactcaaa
tgcgataagttctgtaatagtagctgctaatttttctaccattgatttcacttctggctt
gene cas2
inference ab initio prediction:Prodigal:2.60
inference similar to AA sequence:UniProtKB:G3ECR3
locus_tag Sagalactiae_00003
product CRISPR-associated endoribonuclease Cas2
protein_id gnl|Prokka|Sagalactiae_00003
Contig name Subject name Score % Identity
Sagalactiae_c8 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00
Sagalactiae_c8 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00
Sagalactiae_c10 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00
Sagalactiae_c10 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00
A genome assembly and analysis pipeline
IonGAP 28
A genome assembly and analysis pipeline
IonGAP 29
Genome assembly
Data preprocessing
Genome finishing
Genome analysis
Data preprocessing
• Comparative study of trimmers
(PRINSEQ, ERNE-filter, Trimmomatic)
– Removing adapters → 5’ trimming
– Discarding useless reads → Minimum length
– Removing low-quality regions
• Internal quality control of MIRA
– Sliding window trimming
Maximum length
Sliding window trimming
Window length
Quality threshold
A genome assembly and analysis pipeline
IonGAP 30
A genome assembly and analysis pipeline
Data preprocessing
Mauve Assembly Metrics
IonGAP 31
Data preprocessing
Conclusion
Read preprocessing has negative effects on the assembly
• An extensive evaluation of read trimming effects on Illumina NGS data analysis
(Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. PLoS ONE 2013):
"For high quality values, trimmed datasets produce slightly more fragmented assemblies, probably due to a more stringent trimming that reflects also on lower computational needs."
• MIRA user manual (Chevreux B):
"For heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to remove standard sequencing adaptors yourself. Just leave the data alone!"
A genome assembly and analysis pipeline
IonGAP 32
A genome assembly and analysis pipeline
IonGAP 33
Data preprocessing
Genomefinishing
Genome assembly
Genome analysis
Genome finishing
• Scaffolding
– Impossible: no mate-pair reads
• Correction of assembly errors
– Simplifier: selective elimination of redundant sequences
A genome assembly and analysis pipeline
IonGAP 34
Genome finishing
Simplifier
• Only eliminates complete redundant contigs
• Time expensive
• Natural repeats in genome → Risky
Conclusion
It is better to leave postprocessing in the user's hands
A genome assembly and analysis pipeline
IonGAP 35
A genome assembly and analysis pipeline
IonGAP 36
Data preprocessing
Genomeanalysis
Genome assembly
Genome finishing
Genome analysis
• Quality analysis of reads and contigs (FastQC)
• Taxonomic classification (BLAST)
• Genome annotation (Prokka)
If reference sequence provided:
• Genome alignment and coverage analysis (MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 37
Genome analysis
• Taxonomic classification (BLAST)
• Genome annotation (Prokka)
A genome assembly and analysis pipeline
IonGAP 38
Genome analysis
• Genome annotation (Prokka)
UGENE genome viewer
A genome assembly and analysis pipeline
IonGAP 39
Genome analysis
If reference sequence provided:
• Genome alignment and coverage analysis (MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)
A genome assembly and analysis pipeline
IonGAP 40
Generated byCircos, BLAST and Circoletto
A genome assembly and analysis pipeline
IonGAP 41
Genome analysis
If reference sequence provided:
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 42
Mauve genome viewer
Genome analysis
If reference sequence provided:
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 43
Mauve genome viewer
Functioning and implementation
• Web user interface
• Input Web form
• Two independent modules (daemons)
– Assembly module
– Analysis module
• User notification via email
IonGAP Web service
IonGAP 44
Functioning and implementation
• Hosting: ETSII’s Computing Center
– Virtual machine (Ubuntu 12.04)
– Dual core 64 bits processor
– 17 GB RAM
IonGAP Web service
IonGAP 45
IonGAP Web service
IonGAP 46
IonGAP Web service
IonGAP 47
Web service demo
IonGAP | an integrated Genome Assembly Platform
for Ion Torrent data
IonGAP Web service
IonGAP 48
(http://193.145.101.223/)
Genome assembly with IonGAP
Trypanosoma cruzi
• Extremely repetitive genome
• Data explosion
• Data filtering: 900 MB = 1,500,000 reads
IonGAP Web service
IonGAP 49
Parallel assembly of large genomes
Parallel genome assembly
• Parallel computing: Computer cluster
• Contrail
– Parallel assembly on Hadoop
• ETSII’s Computing Center
– Cluster of 108 computers
– Hadoop installation
IonGAP 50
Parallel assembly of large genomes
Parallel assembly with Contrail
IonGAP 51
Parallel assembly with Contrail
Conclusions
• Good performance
– Parallel computing is the future of assembly
• Bad results
– Contrail uses DBG → Not suitable for long reads
Parallel assembly of large genomes
IonGAP 52
• IonGAP solves the need for an automated tool for the assembly and preliminary analysis of Ion Torrent data suffered by IUETSPC
• Availability to the scientific community is directed to stimulate low-cost genome research and development of other customized solutions
• The S. agalactiae genome has been successfully
assembled, and a manuscript is been prepared for publication in a scientific journal
Conclusions
IonGAP 53
Future work
• New options and features
• Cloud assembly with Amazon Web Services
• Parallel OLC assembly on Hadoop
• High performance computing
– ITER’s Teide HPC – September 2014
Conclusions
IonGAP 54
Conclusions
Multidisciplinary work is the way to tackle the new science of the 21st century
IonGAP 55
Genomics
Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias
ComputerScience
Escuela Técnica Superior de
Ingeniería Informática
Bio
info
rma
tics
Many thanksfor yourattention
IonGAP 56