Post on 11-May-2015
description
transcript
GATC Biotech confidential VII © 2007-2011
Eagle Genomics Symposium
"Provisioning bioinformatics - are we prepared?"
Ulrike Schoeck
GATC Biotech
April, 5th 2011
GATC Biotech confidential VII © 2007-2011
I. Introduction to GATC Biotech, providing sequencing service
II. Presentation of in-house sequencing technologies
III. Bioinformatics - definition and history
IV. Evolution of sequencing
V. Sequence analysis - what do we have to face everyday?
VI. Sequencing applications - what is possible?
VII. Conclusions - are we prepared?
Agenda
GATC Biotech confidential VII © 2007-2011
GATC Biotech - where we are
GATC Biotech confidential VII © 2007-2011
GATC Biotech
• leading european commercial sequencing service provider
• over 20 years of experience and know how
• ISO-certified since 1997
• 100% privately owned, self-financed & independent
• more than 125 employees in 5 subsidaries, 22 sales offices
• 3-shift sequencing labs in Germany (Konstanz & Duesseldorf) and UK
• over 10,000 customers all over the world (industry & academia)
• Illumina Certified Service Provider
Complete and integrated sequencing & bioinformatic solutions:
from single sample to ultra high throughput
GATC Biotech confidential VII © 2007-2011
Sequencing technologies in house
Applied Biosystems
ABI 3730xl
Roche / 454
Genome Sequencer FLX
since 1996 since 2006
since 2006
Illumina / Solexa
HiSeq 2000
Pacific Bosciences
PacBio RS
May 2011
GATC Biotech confidential VII © 2007-2011
Sequencing capacity
GS FLX
yearly s
equencin
g c
apacity in T
b
GA
HiSeq
PacBio RS
GS FLX
GA GA
0
10
20
30
40
50
60
70
till 2006 July 07 Jan 08 July 08 Jan 09 July 09 Jan 10 July 10 Jan 11 July 11
Applied Biosystems ABI 3730xl
GATC Biotech confidential VII © 2007-2011
System comparison
system GS FLX HiSeq
2000
PacBio
RS
available
since
2005 (GS 20 by
454 Life Science)
2006 (Genetic
Analyzer by Solexa) 2010
device PicoTiterPlate
w/ wells
flowcells
w/ channels
SMRT cells w /zero-
mode waveguides
library DNA fragmentation, adapter ligation
amplification emulsion PCR bridging PCR none
sequencing
sequencing by
synthesis
pyrosequencing
sequencing by
synthesis
cyclic reversible
termination
sequencing by
synthesis
single molecule,
real-time
GATC Biotech confidential VII © 2007-2011
Comparison
GS FLX HiSeq 2000 PacBio RS
Read length Ø 400 bases 50 bases
100 bases > 1,000 bases
Mate pairs /
paired end
averaging 140-
200+ bases
insert sizes
~ 3 kb & higher
2 x 50 or 2 x 100 bases
insert sizes 300 b,
~ 3 kb
strobe reads
# of reads /
run > 1 mio
> 800,000,000
(single reads)
> 1,600,000,000
(paired end)
75,000 ZMVs /
SMRT cell
base
integration
same bases in
one cycle
(homopolymers)
base after base,
cycle per cycle
base after base,
continuously
GATC Biotech confidential VII © 2007-2011
Definition:
• Science explaining biology by using information
technologies (computational biology)
• Providing algorithms, databases, user interfaces and
statistical applications for specifying potential scientific
significance
Bioinformatics
GATC Biotech confidential VII © 2007-2011
Main object:
• Presentation of macromolecules as linear chains of
defined components or as sequences of symbols
• Main application in bioinformatics: comparison of
sequences for detecting homology (function, structure)
GCGTCCTCGGGCTTGGCGA
ACTGGGCGGCGGCGGTGGC
GGGCAGCAGCATGGGGGCG
GCA...
Bioinformatics
GATC Biotech confidential VII © 2007-2011
GCGTCCTCGGGCTTGGCGA
ACTGGGCGGCGGCGGTGGC
GGGCAGCAGCATGGGGGCG
GCA...
Main object:
• Presentation of macromolecules as linear chains of
defined components or as sequences of symbols
• Main application in bioinformatics: comparison of
sequences for detecting homology (function, structure)
Bioinformatics
GATC Biotech confidential VII © 2007-2011
• Manual analyses of sequential homologies using standard
word processing programmes
• Sustainable change in molecular biology by introducing
efficient computer algorithms
• Sequence alignment
• Phylogenetics
• Pattern matching
• Web-based database searches
• …
History
GATC Biotech confidential VII © 2007-2011
• Sanger sequencing (1 read per sequencing run)
• Roche (1,000,000 reads per sequencing run)
• Illumina (1,600,000,000 reads per sequencing run)
Evolution of Sequencing
GATC Biotech confidential VII © 2007-2011
0
10
20
30
40
50
60
70
till 2006 July 07 Jan 08 July 08 Jan 09 July 09 Jan 10 July 10 Jan 11 July 11
yearly s
equencin
g c
apacity in T
b
Evolution of Sequencing
GATC Biotech confidential VII © 2007-2011
• Massively produced sequence data using next generation
sequencing technologies
• Advantages
• Applications, applications, applications…
• Runtime
• Costs
• Challenges
• Data analysis and interpretation
• Hardware infrastructure
• Data storage
• Software development
• Error rates
Sequence analysis - today
GATC Biotech confidential VII © 2007-2011
• Massively produced sequence data using next generation
sequencing technologies
• Advantages
• Applications, applications, applications…
• Runtime
• Costs
Example: de novo sequencing
@HWI-ST143_0345:7:1:1200:2150#CGATGT/1
TTCTTCTGATGCCGGCATCCCTGCTTGCAGGTGTGAAG
+
HHHHHHHHHHHHHHHHFHHHHHHHHHHHHHF=FBF:CF
@HWI-ST143_0345:7:1:1310:2072#CGATGT/1
CGTTTCTAAAGCACCCACTATGGATGNNCAGCAGGACA
+
GFFGEFEFFFGDGGCEBEE=EEEEE9##55++;(1@A:
...
Sequence analysis - today
GATC Biotech confidential VII © 2007-2011
• Massively produced sequence data using next generation
sequencing technologies
• Advantages
• Applications, applications, applications…
• Runtime
• Costs
@HWI-ST143_0345:7:1:1200:2150#CGATGT/1
TTCTTCTGATGCCGGCATCCCTGCTTGCAGGTGTGAAG
+
HHHHHHHHHHHHHHHHFHHHHHHHHHHHHHF=FBF:CF
@HWI-ST143_0345:7:1:1310:2072#CGATGT/1
CGTTTCTAAAGCACCCACTATGGATGNNCAGCAGGACA
+
GFFGEFEFFFGDGGCEBEE=EEEEE9##55++;(1@A:
...
Bioinformatics
Assembly
Scaffolding
Annotation
Finishing
Sequence analysis - today
GATC Biotech confidential VII © 2007-2011
• Massively produced sequence data using next generation sequencing
technologies
• Advantages
• Applications, applications, applications…
• Runtime
• Costs
Example: Quantitative transcriptomics
@HWI-ST143_0345:7:1:1200:2150#CGATGT/1
TTCTTCTGATGCCGGCATCCCTGCTTGCAGGTGTGAAG
+
HHHHHHHHHHHHHHHHFHHHHHHHHHHHHHF=FBF:CF
@HWI-ST143_0345:7:1:1310:2072#CGATGT/1
CGTTTCTAAAGCACCCACTATGGATGNNCAGCAGGACA
+
GFFGEFEFFFGDGGCEBEE=EEEEE9##55++;(1@A:
...
Sequence analysis - today
GATC Biotech confidential VII © 2007-2011
• Massively produced sequence data using next generation
sequencing technologies
• Advantages
• Applications, applications, applications…
• Runtime
• Costs
@HWI-ST143_0345:7:1:1200:2150#CGATGT/1
TTCTTCTGATGCCGGCATCCCTGCTTGCAGGTGTGAAG
+
HHHHHHHHHHHHHHHHFHHHHHHHHHHHHHF=FBF:CF
@HWI-ST143_0345:7:1:1310:2072#CGATGT/1
CGTTTCTAAAGCACCCACTATGGATGNNCAGCAGGACA
+
GFFGEFEFFFGDGGCEBEE=EEEEE9##55++;(1@A:
...
Bioinformatics
Alignment
Quantification
Comparison
data
cleaned
sequences
Preanalysis:
short quality reads
low complexity regions
cDNA adapters
sequencing primers
de novo contigs cluster representatives
contig hits
option 2:
assembly
option 1:
clustering
BLAST
analysis
cluster hits
Assembly
Assembly validation BLAST
analysis
Clustering
Sequence analysis - today
GATC Biotech confidential VII © 2007-2011
data
cleaned
sequences
Preanalysis:
short quality reads
low complexity regions
cDNA adapters
sequencing primers
de novo contigs cluster representatives
contig hits
option 2:
assembly
option 1:
clustering
BLAST
analysis
cluster hits
Assembly
Assembly validation BLAST
analysis
Query QLength %HitLength HitLength %Identity e-value GeneID GeneLength
GD3X8YD02G3UTK 473 105.07 497 88 1.00E-138 59783566 778
GD3X8YD02HAS2D 504 103.57 522 92 0 112201467 867
GD3X8YD01EQVR9 438 103.42 453 89 1.00E-129 82985781 904
GD3X8YD02F9LP1 372 103.23 384 92 1.00E-140 194673237 7376
GD3X8YD01BBAL3 435 103.22 449 91 1.00E-165 112362035 3276
GD3X8YD02IRTW2 413 103.15 426 87 1.00E-104 56145323 770
GD3X8YD01C9534 418 103.11 431 93 1.00E-167 157279321 3170
GD3X8YD01EJJ53 461 103.04 475 89 1.00E-137 73976208 2891
Cluster ID Length(bp) %HitLength e-value UniGene ID Gene Length Contig ID Length (bp) Hit Start Hit End
GD3X8YD02G3UTK 473 105.07 1.00E-138 59783566 778 contig20575 2718 300 770
GD3X8YD02HAS2D 504 103.57 0 112201467 867 contig01324 2816 1230 1804
GD3X8YD01EQVR9 438 103.42 1.00E-129 82985781 904 contig01325 2825 67 513
GD3X8YD02F9LP1 372 103.23 1.00E-140 194673237 7376 contig01323 2903 2005 2375
GD3X8YD01BBAL3 435 103.22 1.00E-165 112362035 3276 contig01321 2980 400 830
GD3X8YD02IRTW2 413 103.15 1.00E-104 56145323 770 contig01320 2977 2300 2710
GD3X8YD01C9534 418 103.11 1.00E-167 157279321 3170 contig01318 2894 34 460
GD3X8YD01EJJ53 461 103.04 1.00E-137 73976208 2891 contig01315 2971 56 510
GD3X8YD02F06H3 427 103.04 0 194673243 2026 contig01314 2968 124 552
GD3X8YD02JQKXX 463 103.02 1.00E-180 146186547 4283 contig01322 2808 2287 2750
GD3X8YD01C2RA0 438 102.97 0 167693932 567 contig01319 2886 1456 1890
GD3X8YD01BMILD 439 102.96 1.00E-130 112362035 3276 contig01317 2960 1098 1530
GD3X8YD02IA76G 478 102.93 1.00E-155 160333384 4068 contig01316 2963 1004 1484
GD3X8YD02IJUER 443 102.93 0 114451341 864 contig05489 2657 95 1438
GD3X8YD01AUA6W 484 102.89 1.00E-119 59782054 869 contig22645 3358 3009 3486
GD3X8YD01CKSI4 451 102.88 1.00E-140 74268339 2212 contig20113 2734 1765 2109
GD3X8YD02JPD23 492 102.85 1.00E-167 151556820 3665 contig11912 2558 53 547
GD3X8YD02GG9F3 390 102.82 1.00E-105 219283151 3782 contig01371 2299 1453 1843
Sequence analysis - today
GATC Biotech confidential VII © 2007-2011
DNA
• single reads in tubes & plates (PCR, plasmids)
• whole (meta)genome de novo sequencing
• whole genome re-sequencing
• targeted re-sequencing (enrichment, amplicons, exons)
• methylome / epigenome studies
• ChIP-Seq
RNA
• eukaryotic / prokaryotic cDNA de novo sequencing
• eukaryotic / prokaryotic cDNA re-sequencing (3’ UTR / 5’ UTR)
• smallRNA / microRNA
Sequencing applications
GATC Biotech confidential VII © 2007-2011
• Massively produced sequence data using next generation
sequencing technologies
• Advantages
• Applications, applications, applications…
• Turnover
• Costs
• Challenges
• Data analysis and interpretation
• Hardware infrastructure
• Data storage
• Software development
• Error rates
Sequence analysis - today
GATC Biotech confidential VII © 2007-2011
Conclusion
Provisioning bioinformatics: Are we prepared?
Advancements in sequencing technologies
(data quantity and application complexity)
and
Advancements in information technologies
(hardware and software)
Cloud computing,
GPU usage, software developement,
parallelization...
GAP
SOLUTION
GATC Biotech confidential VII © 2007-2011
Thanks for your kind attention.
Open questions?
www.gatc-biotech.com