Dr. Stefan Czemmel, Quantitative Biology Center (QBiC)
Lecture 3: Data sources ("Next-generation" technologies)
Data Management for Quantitative Biology
Overview
• Coevolution of genomic achievements and sequencing
technologies
• Next Generation Sequencing (NGS) technologies as data sources
- Introduction to Illumina and PacBio Sequencing technologies
• Applications
• Summary and Outlook
2
3
miRNAs
ubiquitin
Trend of flow of information not always followed: e.g. RNA increasingly recognized as more then just an information carrier
(mRNA): miRNAs and other small non-coding RNAs ribozymes and riboswitches
enormous complexity of transcriptome and proteome not reflected in the genome
Alberts, Molecular Biology of the Cell. 4th edit.
Histone modificationDNA methylation
Preface: The central dogma – classical view
Coevolution of genomic achievements and sequencing technologies
4
5
1952 | Rosalind Franklin creates Photograph 51, showing a distinctive pattern that indicates the helical shape of DNA
1953 | James Watson and Francis Crick discover the double helix structure of DNA
1977 | Frederick Sanger develops rapid DNA sequencing technique
1983 | First genetic disease mapped, Huntington’s Disease
1983 | Invention of polymerase chain reaction (PCR) technology for amplifying DNA
1973 | First sequence of 24 bp published
1982 | Genbank started
1865 | Gregor Mendel, presents his research on pea hybridization to show that what we call “genes” nowadays determine inheritance of traits
Partially adapted from: https://unlockinglifescode.org/timeline
Coevolution of genomic achievements and sequencing technologies
6
2000 | Genome sequence of model organism fruit fly reported
2001 | First draft of the human genome released
2002 | Mouse becomes first mammalian research organism with decoded genome
1987 | ABI Prism 373 (1st automated sequencing machine)
1996 | Capillary sequencer: ABI 310
2005 | 1st 454 Life Sciences NGS system : GS 20 System
2006 | 1st Solexa NGS sequencer: Genome Analyzer
2007 | 1st ABI NGS sequencer: SOLiD
2009 | 1st Helicos single molecule sequencer : Helicos Genetic Analyser
2011 | 1st Ion Torrent Sequencer : PGM
2011 | 1st Pacific Biosciences single
molecule sequencer : PacBio RS2012 | Oxford Nanopore Technologies
demonstrates ultra long single molecule reads
1990 | The Human Genome Project begins1995 | “shotgun” sequencing helped to sequence first bacterial genome: Haemophilus influenzae
ED Green et al. Nature (2010)
Coevolution of genomic achievements and sequencing technologies
Developments in sequencing allowed many genomes to be sequenced…
February 15, 2001
7
Populus trichocarpa~417 Mb
September 15, 2006
April 5, 2002
Homo sapiens~3259 Mb
Oryza sativa~426 Mb
Developments in sequencing revolutionized Humane Genome sequencing efforts
8
Illumina’s estimates that the number of sequenced human genomes will reach ~1.6 million genomes by 2017. (Francis de Souza (President of Illumina) at MIT Technology Review’s EmTech conference in Cambridge, Massachusetts)
Overview platform providers
9
71
10
16
3
Illumina
Roche
Life Tech
PacBio
Source: Mizuho Securities and GenomeWeb survey: No. of respondents: 103
Some of the sequencing technologies in Tuebingen
10http://www.illumina.com/systems/sequencing.html
HiSeq2500
HiSeq2000
MiSeq HiSeq3000
The principle at the heart of all these technologies is the same :
Sanger Sequencing
11
Frederick Sanger13 August 1918 – 19 November 2013
Nobel Prizes in Chemistry 1958 and 1980
Prize motivation 1958: "for his work on the structure of proteins, especially that of insulin".
Prize motivation 1980 for him and his co-laureates Paul Berg, Walter Gilbert: "for their contributions concerning the determination of base sequences in nucleic acids”.
http://www.nobelprize.org/nobel_prizes/chemistry/
Sanger sequencing
http://en.wikipedia.org/wiki/Sanger_sequencing
12
Sanger versus second-generation sequencing
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
13
14
Next Generation Sequencing (NGS) technologies as data sources
15
PacBio sequencing workflow
1. Sample/library preparation
2. Annealing of Seq Primer to SMRTbell Templates (hairpins)
3. Bind Polymerase (immobilization) to SMRTbell Templates (hairpins)
4. Sequencing
5. Data analysis
16
PacBio sequencing workflow (3rd generation seq): single molecule approach
Metzker, Nature Reviews Genetics (2010)
17
Illumina sequencing workflow (2nd generation seq): PCR-based approach
1. Sample/library preparation
2. Cluster generation
3. Sequencing and Imaging
4. Downstream data analysis for a typical RNA-Seq experiment
18
1. Sample preparation: nucleic acid extraction
http://www.brown.edu/Research/CGP/download/illumina-public/R%20Sequerra%20-%20trouble%20shooting%20library%20preps.pdf
19
1. Sample preparation: fragmentation and adapter ligation
http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf
20
1. Sample preparation: fragmentation
Expected library traces (analyzed using Bioanalyzer)
http://www.brown.edu/Research/CGP/download/illumina-public/R%20Sequerra%20-%20trouble%20shooting%20library%20preps.pdf
21
Problematic library traces (analyzed using Bioanalyzer)
http://www.mbl.edu/jbpc/files/2014/05/Bioanalyzer_for_NGS_slideshow.pdf
1. Sample preparation: fragmentation
Tailing
Uneven shearing
Increased size range
22
Attach library fragments to surface of flow cell
http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf
2. Bridge amplification and cluster generation
23
Flow cells
http://research.stowers-institute.org/microscopy/external/PowerpointPresentations/ppt/Methods_Technology/KSH_Tech&Methods_012808Final.pdf
24
3. Sequencing and Imaging: Clonal Single Molecule Array
@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ@DJB77P1:476:H15H9ADXX:1:1101:1807:1994 1:N:0:AGTTCCNGGACCTGGAATTACATCACCAATAGCATAGACACCTGAAACATTTGTAG+#4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJGHH@DJB77P1:476:H15H9ADXX:1:1101:2215:1967 1:N:0:AGTTCCNTTTACTGCATCCCTGTGTTGGGTTGAGATTTTGGGTACTCTGAGATAAA+#4=DDFFFHHHHHJJJGHIIJJJGIJGJJJJJJJJJDHIJJIJIJJIIJJ
Flat files (*.bcl …)
Fastq filespartially adapted from: http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf
25
4. Downstream Data analysis for a typical RNA-Seq experiment
4.1 Quality control (FastQC) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
4.2 Alignment to genome (using Tophat2/STAR/BWA …) (http://ccb.jhu.edu/software/tophat/index.shtml)(http://bioinformatics.oxfordjournals.org/content/29/1/15)(http://bio-bwa.sourceforge.net)
4.2.1 Manipulation of SAM/BAM files with Samtools(http://samtools.sourceforge.net)
4.3 Read counting (HTSeq)(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)
4.4 Statistical analysis for differential expression in R(http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html)(http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)(http://www.bioconductor.org/packages/release/bioc/html/limma.html)
26
Fastq format
@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ@DJB77P1:476:H15H9ADXX:1:1101:1807:1994 1:N:0:AGTTCCNGGACCTGGAATTACATCACCAATAGCATAGACACCTGAAACATTTGTAG+#4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJGHH@DJB77P1:476:H15H9ADXX:1:1101:2215:1967 1:N:0:AGTTCCNTTTACTGCATCCCTGTGTTGGGTTGAGATTTTGGGTACTCTGAGATAAA+#4=DDFFFHHHHHJJJGHIIJJJGIJGJJJJJJJJJDHIJJIJIJJIIJJ
Single sequence (here 50bp long)
Line 1 : starts with '@’, contains sequence identifier and an optional descriptionLine 2 : raw sequence lettersLine 3 : starts with '+’, optionally followed by same info as in Line 1Line 4 encodes the quality values for the sequence in Line 2 (ASCII)
27
5.1 Raw data inspection: Fastq format
@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ
Single sequence (here 50bp long)
@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCC
Unique instrument name
Flow cell lane
tile number within the flowcell lane
‘x’ and ‘y’ coordinates of cluster within the tile
Run ID Flow cell ID
The member of a pair, 1 or 2
Y if the read is filtered, N otherwise
Index sequence
0 when none of the control bits are on, otherwise it is an even number
28
5.1 Raw data inspection: FastQC reports
Good Illumina 65bp long raw data Bad Illumina 40bp long raw data
Quality scores across all bases
Phr
ed s
core
(q)
Position in read (bp)
Quality scores across all bases
Position in read (bp)
Left side: in house dataRight side: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
)(log10 10 pq
• p=error probability for the base• if p=0.01 (1% chance of error), then q=20• p = 0.001, (0.1% chance of error), q = 30• Phred quality values are rounded to the nearest integer
29
Calculation Phred Quality Score
30
5.2 Alignment/mapping to the genome: TopHat2
Kim et al., Genome Biology (2013)http://ccb.jhu.edu/software/tophat/downloads/
31
5.2. Alignment/mapping: Visualization with e.g. IGV
Sox17 gene, mm10 genome
SNP’s
https://www.broadinstitute.org/software/igv/download
32
5.3. Read counting (HTSeq)
Anders et al., Bioinformatics (2014)http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
33
5.4 Differential expression (DE) analysis in R
Czemmel et al., PLoS One (2014)Using DESeq package, Anders and Huber,
Genome Biology 2010 http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdfUsing edgeR package, Robinson et al., Bioinformatics (2010)
Broderick et al., BMC Plant Biol. (2014)Using DESeq2 package, Love et al., Genome Biology (2014)
Choice which statistical test to use with respect to experimental design at hand, see e.g.:
Luo et al., Genome Biology 2014
34
Counting (RNA-Seq) and coverage (WGS)
RNA-Seq: 13 read counts for Gene A
WGS: 5x coverage at position X in Gene A2x coverage at position Y in Gene A
XY
Gene A
Experimental designing based on coverage/counting
WGS project: aim is to sequence the whole humane genome of three patients at 50x coverage on a HiSeq2500 High Output with 100bp Paired end reads (PE). How many lanes/flow cells are needed?
http://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
humane genome ~3.9Gb=3900MbHiSeq2500 High Output=~>=180e+06 PE reads per lanebp you get per lane=180e+06*200 = 3.6e+10bp coverage per lane per multiplexed genome=3.6e+10/(3900e+06*3) = 3x50x/3x >= 16 lanes = 2 flow cells (with 8 lanes each)
RNASeq project: aim is to sequence the transcriptome of three Arabidopsis plants treated with reagent A and three others with control reagent B on a HiSeq2500 Rapid Run v2 with 50bp single end reads (SE). To reach good scientific standard 20M reads per sample are needed. How many lanes/flow cells are needed?
Ath genome ~126MbHiSeq2500 Rapid Run v2 Output=~>=150e+06 SE reads per laneReads per sample on each lane = 150e+06/6 =25M reads per sample
Multiplex 3 samples per lane
Multiplex 6 samples per lane
Applications of NGS technologies
36
HiSeq2500
MiSeq
Whole genome (re-)sequencing (WGS)
Small non coding RNA sequencing (miRNAs …)
Chromatin Immunoprecipitation Sequencing (ChiP-Seq)
(Total) RNA-Seq
Bisulfite Sequencing (DNA methylation)
Targeted (re-)sequencing, e.g.
Whole exome sequencing (WES)
For many more see:http://www.illumina.com/content/dam/illumina-marketing/documents/products/research_reviews/sequencing-methods-review.pdf
mRNA-Seq
Targeted RNA-Seq
De Novo sequencing
Ribosome profiling
Epigenomics
Transcriptomics
Genomics
Transcriptomics: Most widely used technology before NGS: microarrays
37Top left: http://www.mun.ca/biology/scarr/cDNA_microarray_Assay_of_Gene_Expression.html
Genomics: De Novo Sequencing
www.illumina.com
38
www.illumina.com
39
Genomics: Targeted Seq e.g. Whole exome Seq (WES)
Epigenomics
http://www.illumina.com/applications/epigenetics.html
40
41
Applications
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
42
Applications of NGS technologies
Rabbani et al., Journal of Human Genetics (2014)
43
Applications of WES in cancer research
Applications of de novo sequencing: De novo assembly of the Haitian cholera outbreak strain
Bashir et al., Nature (2012)
44
Yant et al., The Plant Cell (2010)
45
Applications of Chip-Seq: Identification of APETALA2 TF binding sites in the Arabidopsis genome
Li et al., The Plant Cell (2014)
46
Combination of WGS and sequence-capture bisulfite sequencing: Identification of genetic perturbations of the maize methylome
Application of small non coding RNA-Seq: Identification of novel miRNA biomarkers of muscle disease
Guess et al., PLoS One (2015)
47
48
Method Advantages disadvantages
Sanger Lowest error rateLong read length (~750 bp)Low costs for small study
-High cost per base/for large studies-Long time to generate data-Need for cloning-Amount of data per run
Illumina Low error rateLowest cost per baseCan support de novo seq approaches performed with PacBio via high output yield
Shorter read length then e.g. PacBioHigh startup costsDe Novo assembly difficult
Summary: Highlighted advantages/disadvantages of NGS technologies
More details: http://www.molecularecologist.com/next-gen-fieldguide-2014/
Summary: Highlighted advantages/disadvantages of NGS technologies continued
49More details: http://www.molecularecologist.com/next-gen-fieldguide-2014/
Method Advantages disadvantages
Ion Torrent Low startup costsMedium/low cost per baseLow error rateFast runs
-Costs higher then e.g. Illumina-Read length between Illumina and PacBio-Higher error rate than Illumina
PacBio Single molecule as templateLong reads-often used in conjunction with Illumina for de novo seq approaches
Still high error rateLow total no. of readsMedium/high costs per baseHigh startup costs
Oxford Nanopore
-Minion is a USB device-extremely low-cost-extremely long reads feasible
Unknow error rate
50
Outlook
Is there a translation of NGS technologies into clinical diagnostics soon?
51
NGS technologies are promising in molecular diagnostics in which high sensitivity and specificity are required as they are:
+ able to provide single-nucleotide resolution+ constantly improve with more simplified and automated sample preparation
- the per-base-position error rate is still too high for most diagnostic tools (0.5–2%). - combination of various errors and variability arising from DNA fragmentation, sequencing library preparation, sequencing-by-synthesis and short reads alignment/assembly could incur a significant false-positive rate.
Su et al., Expert Rev Mol Diagn. 2011;11(3):333-343.
Contact:
Quantitative Biology Center (QBiC)Auf der Morgenstelle 1072076 Tübingen · Germany
Thanks for listening – See you next week