Dr. Stefan Czemmel, Quantitative Biology Center (QBiC)

Lecture 3: Data sources ("Next-generation" technologies)

Data Management for Quantitative Biology

• Coevolution of genomic achievements and sequencing


• Next Generation Sequencing (NGS) technologies as data sources

- Introduction to Illumina and PacBio Sequencing technologies

• Applications

• Summary and Outlook


Trend of flow of information not always followed: e.g. RNA increasingly recognized as more then just an information carrier

(mRNA): miRNAs and other small non-coding RNAs ribozymes and riboswitches

enormous complexity of transcriptome and proteome not reflected in the genome

Alberts, Molecular Biology of the Cell. 4th edit.

Histone modificationDNA methylation

Preface: The central dogma – classical view

Coevolution of genomic achievements and sequencing technologies


1952 | Rosalind Franklin creates Photograph 51, showing a distinctive pattern that indicates the helical shape of DNA

1953 | James Watson and Francis Crick discover the double helix structure of DNA

1977 | Frederick Sanger develops rapid DNA sequencing technique

1983 | First genetic disease mapped, Huntington’s Disease

1983 | Invention of polymerase chain reaction (PCR) technology for amplifying DNA

1973 | First sequence of 24 bp published

1982 | Genbank started

1865 | Gregor Mendel, presents his research on pea hybridization to show that what we call “genes” nowadays determine inheritance of traits

Partially adapted from: https://unlockinglifescode.org/timeline

Coevolution of genomic achievements and sequencing technologies

2000 | Genome sequence of model organism fruit fly reported

2001 | First draft of the human genome released

2002 | Mouse becomes first mammalian research organism with decoded genome

1987 | ABI Prism 373 (1st automated sequencing machine)

1996 | Capillary sequencer: ABI 310

2005 | 1st 454 Life Sciences NGS system : GS 20 System

2006 | 1st Solexa NGS sequencer: Genome Analyzer

2007 | 1st ABI NGS sequencer: SOLiD

2009 | 1st Helicos single molecule sequencer : Helicos Genetic Analyser

2011 | 1st Ion Torrent Sequencer : PGM

2011 | 1st Pacific Biosciences single

molecule sequencer : PacBio RS2012 | Oxford Nanopore Technologies

demonstrates ultra long single molecule reads

1990 | The Human Genome Project begins1995 | “shotgun” sequencing helped to sequence first bacterial genome: Haemophilus influenzae

ED Green et al. Nature (2010)

Coevolution of genomic achievements and sequencing technologies

Developments in sequencing allowed many genomes to be sequenced…

February 15, 2001


Populus trichocarpa~417 Mb

September 15, 2006

April 5, 2002

Homo sapiens~3259 Mb

Oryza sativa~426 Mb

Developments in sequencing revolutionized Humane Genome sequencing efforts


Illumina’s estimates that the number of sequenced human genomes will reach ~1.6 million genomes by 2017. (Francis de Souza (President of Illumina) at MIT Technology Review’s EmTech conference in Cambridge, Massachusetts)

Overview platform providers








Life Tech


Source: Mizuho Securities and GenomeWeb survey: No. of respondents: 103

Some of the sequencing technologies in Tuebingen




MiSeq HiSeq3000

The principle at the heart of all these technologies is the same :

Sanger Sequencing


Frederick Sanger13 August 1918 – 19 November 2013

Nobel Prizes in Chemistry 1958 and 1980

Prize motivation 1958: "for his work on the structure of proteins, especially that of insulin".

Prize motivation 1980 for him and his co-laureates Paul Berg, Walter Gilbert: "for their contributions concerning the determination of base sequences in nucleic acids”.


Sanger sequencing



Page 13: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Sanger versus second-generation sequencing

Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)


Next Generation Sequencing (NGS) technologies as data sources

PacBio sequencing workflow

1. Sample/library preparation

2. Annealing of Seq Primer to SMRTbell Templates (hairpins)

3. Bind Polymerase (immobilization) to SMRTbell Templates (hairpins)

4. Sequencing

5. Data analysis

PacBio sequencing workflow (3rd generation seq): single molecule approach

Metzker, Nature Reviews Genetics (2010)

Illumina sequencing workflow (2nd generation seq): PCR-based approach

1. Sample/library preparation

2. Cluster generation

3. Sequencing and Imaging

4. Downstream data analysis for a typical RNA-Seq experiment

1. Sample preparation: nucleic acid extraction


1. Sample preparation: fragmentation and adapter ligation


1. Sample preparation: fragmentation

Expected library traces (analyzed using Bioanalyzer)


Problematic library traces (analyzed using Bioanalyzer)


1. Sample preparation: fragmentation


Uneven shearing

Increased size range

Attach library fragments to surface of flow cell


2. Bridge amplification and cluster generation

Flow cells


3. Sequencing and Imaging: Clonal Single Molecule Array


Flat files (*.bcl …)

Fastq filespartially adapted from: http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf

4. Downstream Data analysis for a typical RNA-Seq experiment

4.1 Quality control (FastQC) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

4.2 Alignment to genome (using Tophat2/STAR/BWA …) (http://ccb.jhu.edu/software/tophat/index.shtml)(http://bioinformatics.oxfordjournals.org/content/29/1/15)(http://bio-bwa.sourceforge.net)

4.2.1 Manipulation of SAM/BAM files with Samtools(http://samtools.sourceforge.net)

4.3 Read counting (HTSeq)(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)

4.4 Statistical analysis for differential expression in R(http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html)(http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)(http://www.bioconductor.org/packages/release/bioc/html/limma.html)

Fastq format


Single sequence (here 50bp long)

Line 1 : starts with '@’, contains sequence identifier and an optional descriptionLine 2 : raw sequence lettersLine 3 : starts with '+’, optionally followed by same info as in Line 1Line 4 encodes the quality values for the sequence in Line 2 (ASCII)

5.1 Raw data inspection: Fastq format


Single sequence (here 50bp long)

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCC

Unique instrument name

Flow cell lane

tile number within the flowcell lane

‘x’ and ‘y’ coordinates of cluster within the tile

Run ID Flow cell ID

The member of a pair, 1 or 2

Y if the read is filtered, N otherwise

Index sequence

0 when none of the control bits are on, otherwise it is an even number

5.1 Raw data inspection: FastQC reports

Good Illumina 65bp long raw data Bad Illumina 40bp long raw data

Quality scores across all bases


ed s



Position in read (bp)

Quality scores across all bases

Position in read (bp)

Left side: in house dataRight side: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

)(log10 10 pq

• p=error probability for the base• if p=0.01 (1% chance of error), then q=20• p = 0.001, (0.1% chance of error), q = 30• Phred quality values are rounded to the nearest integer


Calculation Phred Quality Score

5.2. Alignment/mapping: Visualization with e.g. IGV

Sox17 gene, mm10 genome



5.3. Read counting (HTSeq)

Anders et al., Bioinformatics (2014)http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

5.4 Differential expression (DE) analysis in R

Czemmel et al., PLoS One (2014)Using DESeq package, Anders and Huber,

Genome Biology 2010 http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdfUsing edgeR package, Robinson et al., Bioinformatics (2010)

Broderick et al., BMC Plant Biol. (2014)Using DESeq2 package, Love et al., Genome Biology (2014)

Choice which statistical test to use with respect to experimental design at hand, see e.g.:

Luo et al., Genome Biology 2014

Counting (RNA-Seq) and coverage (WGS)

RNA-Seq: 13 read counts for Gene A

WGS: 5x coverage at position X in Gene A2x coverage at position Y in Gene A


Gene A

Experimental designing based on coverage/counting

WGS project: aim is to sequence the whole humane genome of three patients at 50x coverage on a HiSeq2500 High Output with 100bp Paired end reads (PE). How many lanes/flow cells are needed?


humane genome ~3.9Gb=3900MbHiSeq2500 High Output=~>=180e+06 PE reads per lanebp you get per lane=180e+06*200 = 3.6e+10bp coverage per lane per multiplexed genome=3.6e+10/(3900e+06*3) = 3x50x/3x >= 16 lanes = 2 flow cells (with 8 lanes each)

RNASeq project: aim is to sequence the transcriptome of three Arabidopsis plants treated with reagent A and three others with control reagent B on a HiSeq2500 Rapid Run v2 with 50bp single end reads (SE). To reach good scientific standard 20M reads per sample are needed. How many lanes/flow cells are needed?

Ath genome ~126MbHiSeq2500 Rapid Run v2 Output=~>=150e+06 SE reads per laneReads per sample on each lane = 150e+06/6 =25M reads per sample

Multiplex 3 samples per lane

Multiplex 6 samples per lane

Applications of NGS technologies




Whole genome (re-)sequencing (WGS)

Small non coding RNA sequencing (miRNAs …)

Chromatin Immunoprecipitation Sequencing (ChiP-Seq)

(Total) RNA-Seq

Bisulfite Sequencing (DNA methylation)

Targeted (re-)sequencing, e.g.

Whole exome sequencing (WES)

For many more see:http://www.illumina.com/content/dam/illumina-marketing/documents/products/research_reviews/sequencing-methods-review.pdf


Targeted RNA-Seq

De Novo sequencing

Ribosome profiling




Transcriptomics: Most widely used technology before NGS: microarrays

37Top left: http://www.mun.ca/biology/scarr/cDNA_microarray_Assay_of_Gene_Expression.html

Genomics: De Novo Sequencing



Page 39: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel



Genomics: Targeted Seq e.g. Whole exome Seq (WES)

Page 40: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel




Page 41: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel



Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)


Applications of NGS technologies

Rabbani et al., Journal of Human Genetics (2014)


Applications of WES in cancer research

Applications of de novo sequencing: De novo assembly of the Haitian cholera outbreak strain

Bashir et al., Nature (2012)


Yant et al., The Plant Cell (2010)


Applications of Chip-Seq: Identification of APETALA2 TF binding sites in the Arabidopsis genome

Li et al., The Plant Cell (2014)


Combination of WGS and sequence-capture bisulfite sequencing: Identification of genetic perturbations of the maize methylome

Application of small non coding RNA-Seq: Identification of novel miRNA biomarkers of muscle disease

Guess et al., PLoS One (2015)


Method Advantages disadvantages

Sanger Lowest error rateLong read length (~750 bp)Low costs for small study

-High cost per base/for large studies-Long time to generate data-Need for cloning-Amount of data per run

Illumina Low error rateLowest cost per baseCan support de novo seq approaches performed with PacBio via high output yield

Shorter read length then e.g. PacBioHigh startup costsDe Novo assembly difficult

Summary: Highlighted advantages/disadvantages of NGS technologies

More details: http://www.molecularecologist.com/next-gen-fieldguide-2014/

Summary: Highlighted advantages/disadvantages of NGS technologies continued

49More details: http://www.molecularecologist.com/next-gen-fieldguide-2014/

Method Advantages disadvantages

Ion Torrent Low startup costsMedium/low cost per baseLow error rateFast runs

-Costs higher then e.g. Illumina-Read length between Illumina and PacBio-Higher error rate than Illumina

PacBio Single molecule as templateLong reads-often used in conjunction with Illumina for de novo seq approaches

Still high error rateLow total no. of readsMedium/high costs per baseHigh startup costs

Oxford Nanopore

-Minion is a USB device-extremely low-cost-extremely long reads feasible

Unknow error rate

Is there a translation of NGS technologies into clinical diagnostics soon?


NGS technologies are promising in molecular diagnostics in which high sensitivity and specificity are required as they are:

+ able to provide single-nucleotide resolution+ constantly improve with more simplified and automated sample preparation

- the per-base-position error rate is still too high for most diagnostic tools (0.5–2%). - combination of various errors and variability arising from DNA fragmentation, sequencing library preparation, sequencing-by-synthesis and short reads alignment/assembly could incur a significant false-positive rate.

Su et al., Expert Rev Mol Diagn. 2011;11(3):333-343.

Thanks for listening

