Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30,...

Dr. Stefan Czemmel, Quantitative Biology Center (QBiC)

Lecture 3: Data sources ("Next-generation" technologies)

Data Management for Quantitative Biology

Overview

• Coevolution of genomic achievements and sequencing

technologies

• Next Generation Sequencing (NGS) technologies as data sources

- Introduction to Illumina and PacBio Sequencing technologies

• Applications

• Summary and Outlook

2

3

miRNAs

ubiquitin

Trend of flow of information not always followed: e.g. RNA increasingly recognized as more then just an information carrier

(mRNA): miRNAs and other small non-coding RNAs ribozymes and riboswitches

enormous complexity of transcriptome and proteome not reflected in the genome

Alberts, Molecular Biology of the Cell. 4th edit.

Histone modificationDNA methylation

Preface: The central dogma – classical view

Coevolution of genomic achievements and sequencing technologies

4

Sven Nahnsen

source of picture?

5

1952 | Rosalind Franklin creates Photograph 51, showing a distinctive pattern that indicates the helical shape of DNA

1953 | James Watson and Francis Crick discover the double helix structure of DNA

1977 | Frederick Sanger develops rapid DNA sequencing technique

1983 | First genetic disease mapped, Huntington’s Disease

1983 | Invention of polymerase chain reaction (PCR) technology for amplifying DNA

1973 | First sequence of 24 bp published

1982 | Genbank started

1865 | Gregor Mendel, presents his research on pea hybridization to show that what we call “genes” nowadays determine inheritance of traits

Partially adapted from: https://unlockinglifescode.org/timeline


6

2000 | Genome sequence of model organism fruit fly reported

2001 | First draft of the human genome released

2002 | Mouse becomes first mammalian research organism with decoded genome

1987 | ABI Prism 373 (1st automated sequencing machine)

1996 | Capillary sequencer: ABI 310

2005 | 1st 454 Life Sciences NGS system : GS 20 System

2006 | 1st Solexa NGS sequencer: Genome Analyzer

2007 | 1st ABI NGS sequencer: SOLiD

2009 | 1st Helicos single molecule sequencer : Helicos Genetic Analyser

2011 | 1st Ion Torrent Sequencer : PGM

2011 | 1st Pacific Biosciences single

molecule sequencer : PacBio RS2012 | Oxford Nanopore Technologies

demonstrates ultra long single molecule reads

1990 | The Human Genome Project begins1995 | “shotgun” sequencing helped to sequence first bacterial genome: Haemophilus influenzae

ED Green et al. Nature (2010)


Developments in sequencing allowed many genomes to be sequenced…

February 15, 2001

7

Populus trichocarpa~417 Mb

September 15, 2006

April 5, 2002

Homo sapiens~3259 Mb

Oryza sativa~426 Mb

Developments in sequencing revolutionized Humane Genome sequencing efforts

8

Illumina’s estimates that the number of sequenced human genomes will reach ~1.6 million genomes by 2017. (Francis de Souza (President of Illumina) at MIT Technology Review’s EmTech conference in Cambridge, Massachusetts)

Overview platform providers

9

71

10

16

3

Illumina

Roche

Life Tech

PacBio

Source: Mizuho Securities and GenomeWeb survey: No. of respondents: 103

Some of the sequencing technologies in Tuebingen

10http://www.illumina.com/systems/sequencing.html

HiSeq2500

HiSeq2000

MiSeq HiSeq3000

The principle at the heart of all these technologies is the same :

Sanger Sequencing

11

Frederick Sanger13 August 1918 – 19 November 2013

Nobel Prizes in Chemistry 1958 and 1980

Prize motivation 1958: "for his work on the structure of proteins, especially that of insulin".

Prize motivation 1980 for him and his co-laureates Paul Berg, Walter Gilbert: "for their contributions concerning the determination of base sequences in nucleic acids”.

http://www.nobelprize.org/nobel_prizes/chemistry/

Sanger sequencing

http://en.wikipedia.org/wiki/Sanger_sequencing

12

Sanger versus second-generation sequencing

Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

13

14

Next Generation Sequencing (NGS) technologies as data sources

15

PacBio sequencing workflow

1. Sample/library preparation

2. Annealing of Seq Primer to SMRTbell Templates (hairpins)

3. Bind Polymerase (immobilization) to SMRTbell Templates (hairpins)

4. Sequencing

5. Data analysis

16

PacBio sequencing workflow (3rd generation seq): single molecule approach

Metzker, Nature Reviews Genetics (2010)

17

Illumina sequencing workflow (2nd generation seq): PCR-based approach

1. Sample/library preparation

2. Cluster generation

3. Sequencing and Imaging

4. Downstream data analysis for a typical RNA-Seq experiment

18

1. Sample preparation: nucleic acid extraction

http://www.brown.edu/Research/CGP/download/illumina-public/R%20Sequerra%20-%20trouble%20shooting%20library%20preps.pdf

19

1. Sample preparation: fragmentation and adapter ligation

http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf

Sven Nahnsen

Ref... below red line

20

1. Sample preparation: fragmentation

Expected library traces (analyzed using Bioanalyzer)

http://www.brown.edu/Research/CGP/download/illumina-public/R%20Sequerra%20-%20trouble%20shooting%20library%20preps.pdf

21

Problematic library traces (analyzed using Bioanalyzer)

http://www.mbl.edu/jbpc/files/2014/05/Bioanalyzer_for_NGS_slideshow.pdf

1. Sample preparation: fragmentation

Tailing

Uneven shearing

Increased size range

22

Attach library fragments to surface of flow cell

http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf

2. Bridge amplification and cluster generation

23

Flow cells

http://research.stowers-institute.org/microscopy/external/PowerpointPresentations/ppt/Methods_Technology/KSH_Tech&Methods_012808Final.pdf

24

3. Sequencing and Imaging: Clonal Single Molecule Array

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ@DJB77P1:476:H15H9ADXX:1:1101:1807:1994 1:N:0:AGTTCCNGGACCTGGAATTACATCACCAATAGCATAGACACCTGAAACATTTGTAG+#4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJGHH@DJB77P1:476:H15H9ADXX:1:1101:2215:1967 1:N:0:AGTTCCNTTTACTGCATCCCTGTGTTGGGTTGAGATTTTGGGTACTCTGAGATAAA+#4=DDFFFHHHHHJJJGHIIJJJGIJGJJJJJJJJJDHIJJIJIJJIIJJ

Flat files (*.bcl …)

Fastq filespartially adapted from: http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf

25

4. Downstream Data analysis for a typical RNA-Seq experiment

4.1 Quality control (FastQC) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

4.2 Alignment to genome (using Tophat2/STAR/BWA …) (http://ccb.jhu.edu/software/tophat/index.shtml)(http://bioinformatics.oxfordjournals.org/content/29/1/15)(http://bio-bwa.sourceforge.net)

4.2.1 Manipulation of SAM/BAM files with Samtools(http://samtools.sourceforge.net)

4.3 Read counting (HTSeq)(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)

4.4 Statistical analysis for differential expression in R(http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html)(http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)(http://www.bioconductor.org/packages/release/bioc/html/limma.html)

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/



http://ccb.jhu.edu/software/tophat/index.shtml

http://ccb.jhu.edu/software/tophat/index.shtml

http://bioinformatics.oxfordjournals.org/content/29/1/15



http://bio-bwa.sourceforge.net/

http://bio-bwa.sourceforge.net/

http://samtools.sourceforge.net/

http://samtools.sourceforge.net/

http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html



http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html



http://www.bioconductor.org/packages/release/bioc/html/edgeR.html



http://www.bioconductor.org/packages/release/bioc/html/limma.html



26

Fastq format

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ@DJB77P1:476:H15H9ADXX:1:1101:1807:1994 1:N:0:AGTTCCNGGACCTGGAATTACATCACCAATAGCATAGACACCTGAAACATTTGTAG+#4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJGHH@DJB77P1:476:H15H9ADXX:1:1101:2215:1967 1:N:0:AGTTCCNTTTACTGCATCCCTGTGTTGGGTTGAGATTTTGGGTACTCTGAGATAAA+#4=DDFFFHHHHHJJJGHIIJJJGIJGJJJJJJJJJDHIJJIJIJJIIJJ

Single sequence (here 50bp long)

Line 1 : starts with '@’, contains sequence identifier and an optional descriptionLine 2 : raw sequence lettersLine 3 : starts with '+’, optionally followed by same info as in Line 1Line 4 encodes the quality values for the sequence in Line 2 (ASCII)

27

5.1 Raw data inspection: Fastq format

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ

Single sequence (here 50bp long)

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCC

Unique instrument name

Flow cell lane

tile number within the flowcell lane

‘x’ and ‘y’ coordinates of cluster within the tile

Run ID Flow cell ID

The member of a pair, 1 or 2

Y if the read is filtered, N otherwise

Index sequence

0 when none of the control bits are on, otherwise it is an even number

28

5.1 Raw data inspection: FastQC reports

Good Illumina 65bp long raw data Bad Illumina 40bp long raw data

Quality scores across all bases

Phr

ed s

core

(q)

Position in read (bp)

Quality scores across all bases

Position in read (bp)

Left side: in house dataRight side: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

)(log10 10 pq

• p=error probability for the base• if p=0.01 (1% chance of error), then q=20• p = 0.001, (0.1% chance of error), q = 30• Phred quality values are rounded to the nearest integer

29

Calculation Phred Quality Score

30

5.2 Alignment/mapping to the genome: TopHat2

Kim et al., Genome Biology (2013)http://ccb.jhu.edu/software/tophat/downloads/

http://ccb.jhu.edu/software/tophat/downloads/





31

5.2. Alignment/mapping: Visualization with e.g. IGV

Sox17 gene, mm10 genome

SNP’s

https://www.broadinstitute.org/software/igv/download

32

5.3. Read counting (HTSeq)

Anders et al., Bioinformatics (2014)http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html



33

5.4 Differential expression (DE) analysis in R

Czemmel et al., PLoS One (2014)Using DESeq package, Anders and Huber,

Genome Biology 2010 http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdfUsing edgeR package, Robinson et al., Bioinformatics (2010)

Broderick et al., BMC Plant Biol. (2014)Using DESeq2 package, Love et al., Genome Biology (2014)

Choice which statistical test to use with respect to experimental design at hand, see e.g.:

Luo et al., Genome Biology 2014

http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf

http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf

34

Counting (RNA-Seq) and coverage (WGS)

RNA-Seq: 13 read counts for Gene A

WGS: 5x coverage at position X in Gene A2x coverage at position Y in Gene A

XY

Gene A

Experimental designing based on coverage/counting

WGS project: aim is to sequence the whole humane genome of three patients at 50x coverage on a HiSeq2500 High Output with 100bp Paired end reads (PE). How many lanes/flow cells are needed?

http://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf

humane genome ~3.9Gb=3900MbHiSeq2500 High Output=~>=180e+06 PE reads per lanebp you get per lane=180e+06*200 = 3.6e+10bp coverage per lane per multiplexed genome=3.6e+10/(3900e+06*3) = 3x50x/3x >= 16 lanes = 2 flow cells (with 8 lanes each)

RNASeq project: aim is to sequence the transcriptome of three Arabidopsis plants treated with reagent A and three others with control reagent B on a HiSeq2500 Rapid Run v2 with 50bp single end reads (SE). To reach good scientific standard 20M reads per sample are needed. How many lanes/flow cells are needed?

Ath genome ~126MbHiSeq2500 Rapid Run v2 Output=~>=150e+06 SE reads per laneReads per sample on each lane = 150e+06/6 =25M reads per sample

Multiplex 3 samples per lane

Multiplex 6 samples per lane

Applications of NGS technologies

36

HiSeq2500

MiSeq

Whole genome (re-)sequencing (WGS)

Small non coding RNA sequencing (miRNAs …)

Chromatin Immunoprecipitation Sequencing (ChiP-Seq)

(Total) RNA-Seq

Bisulfite Sequencing (DNA methylation)

Targeted (re-)sequencing, e.g.

Whole exome sequencing (WES)

For many more see:http://www.illumina.com/content/dam/illumina-marketing/documents/products/research_reviews/sequencing-methods-review.pdf

mRNA-Seq

Targeted RNA-Seq

De Novo sequencing

Ribosome profiling

Epigenomics

Transcriptomics

Genomics

Transcriptomics: Most widely used technology before NGS: microarrays

37Top left: http://www.mun.ca/biology/scarr/cDNA_microarray_Assay_of_Gene_Expression.html

Genomics: De Novo Sequencing

www.illumina.com

38

www.illumina.com

39

Genomics: Targeted Seq e.g. Whole exome Seq (WES)

Epigenomics

http://www.illumina.com/applications/epigenetics.html

40

41

Applications

Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

42

Applications of NGS technologies

Rabbani et al., Journal of Human Genetics (2014)

43

Applications of WES in cancer research

Applications of de novo sequencing: De novo assembly of the Haitian cholera outbreak strain

Bashir et al., Nature (2012)

44

Yant et al., The Plant Cell (2010)

45

Applications of Chip-Seq: Identification of APETALA2 TF binding sites in the Arabidopsis genome

Li et al., The Plant Cell (2014)

46

Combination of WGS and sequence-capture bisulfite sequencing: Identification of genetic perturbations of the maize methylome

Application of small non coding RNA-Seq: Identification of novel miRNA biomarkers of muscle disease

Guess et al., PLoS One (2015)

47

48

Method Advantages disadvantages

Sanger Lowest error rateLong read length (~750 bp)Low costs for small study

-High cost per base/for large studies-Long time to generate data-Need for cloning-Amount of data per run

Illumina Low error rateLowest cost per baseCan support de novo seq approaches performed with PacBio via high output yield

Shorter read length then e.g. PacBioHigh startup costsDe Novo assembly difficult

Summary: Highlighted advantages/disadvantages of NGS technologies

More details: http://www.molecularecologist.com/next-gen-fieldguide-2014/

Summary: Highlighted advantages/disadvantages of NGS technologies continued

49More details: http://www.molecularecologist.com/next-gen-fieldguide-2014/

Method Advantages disadvantages

Ion Torrent Low startup costsMedium/low cost per baseLow error rateFast runs

-Costs higher then e.g. Illumina-Read length between Illumina and PacBio-Higher error rate than Illumina

PacBio Single molecule as templateLong reads-often used in conjunction with Illumina for de novo seq approaches

Still high error rateLow total no. of readsMedium/high costs per baseHigh startup costs

Oxford Nanopore

-Minion is a USB device-extremely low-cost-extremely long reads feasible

Unknow error rate

50

Outlook

Is there a translation of NGS technologies into clinical diagnostics soon?

51

NGS technologies are promising in molecular diagnostics in which high sensitivity and specificity are required as they are:

+ able to provide single-nucleotide resolution+ constantly improve with more simplified and automated sample preparation

- the per-base-position error rate is still too high for most diagnostic tools (0.5–2%). - combination of various errors and variability arising from DNA fragmentation, sequencing library preparation, sequencing-by-synthesis and short reads alignment/assembly could incur a significant false-positive rate.

Su et al., Expert Rev Mol Diagn. 2011;11(3):333-343.

Contact:

Quantitative Biology Center (QBiC)Auf der Morgenstelle 1072076 Tübingen · Germany

[email protected]

Thanks for listening – See you next week

Date post:	07-Aug-2015
Category:	Education
Upload:	qbictue
View:	632 times
Download:	2 times

Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30,...

Education