An Introduction to Sequencing Informatics
Daniel P. Depledge
Assistant Professor, Department of Medicine
Lecture 8: Sequencing Technologies, April 28, 3pm
Lecture 9: File Formats, April 30, 3pm
Lecture 10: Linux 1, May 5, 3pm
Lecture 11: Linux 2, May 7, 3pm
Lecture 12: Processing & quality control, May 12, 3pm
Lecture 13: Downloading data, QC & Trimming, May 14, 3pm
Lecture 14: Alignment & Visualization 1, May 19, 3pm
Lecture 15: Alignment & Visualization 2, May 21, 3pm
Lecture 16: RNA sequencing, May 26, 3pm
Lecture 17: Aligning RNA-Seq data & Generating Counts, May 28, 3pm
Lecture 18: Gene expression analysis 1, June 2, 3pm
Lecture 19: Gene expression analysis 2, June 4, 3pm
Course overview
Daniel P. Depledge
Angelina Volkova
Lizabeth Katsnelson
Anna Yeaton
Igor Dolgalev
MacIntosh Cornwell
Applied Sequencing Informatics (2021)
• Expanded version of current course
• Lectures and practicums (50:50 split)
• Advanced sequencing analyses using both short- and long-read data
• Prerequisites: experience working in HPC environments (i.e. Big Purple) + experience in R-based environments
Advanced Integrative Omics (2021)
• Small class size
• Single project to which all class members contribute
• Most work undertaken in small groups (2-4 individuals) that are regularly remixed
• Student-led (no formal lectures)
• Aims to produce publication including all class-members as co-authors
• Prerequisites: significant experience working in HPC environments (i.e. Big Purple) + R-based environments
demonstrated prior experience working with diverse sequencing data types
Advanced course offerings
High-throughput sequencing (HTS) is fundamentally changing how we approach science
• HTS is a readout for many different types of laboratory experiments
• Clinical and basic science investigators from all areas of biology can make use of this technology
• Many (most?) are completely naïve about bioinformatics
• Decreasing sequencing costs = increasing use for routine assays + technical innovation + novel applications
Sequencing informatics is a bottleneck!
• Sequencing is a commodity – easy to outsource
• Sequencing informatics is the essential point of the science
• Data analysis and discovery of meaning in raw results
• Increasing data throughput = increasing time and cost of analysis
Setting the scene
The beauty and challenge of sequencing informatics
Many are the ways to skin shave a cat
(there is no right way to analyze sequencing data but there are many wrong ways)
• Rapid turnover in technology platforms
• New file formats, new data types
• Different “standards” from different vendors
• Rapid evolution of new sequence approaches & associated analyses
• Constant rapid ‘release’ of methods as ‘software’ via unsupported open source distribution
• Increasingly large data sizes (both experimental and reference)
Staying in the game…
The history and future of nucleic acid sequencing
An abridged history of sequencing
Why we sequence
Illumina (short read) sequencing
Pacbio (long read) sequencing
Nanopore (long read) sequencing
The long and short of it: Experimental design
Overview
An abridged history of sequencing
1865 1869
1953
1965 1970
1972
19731975
19771984
1986
1995 1996
1998
1999
2000
2001
2002
2005
20072008
2009
20112012
2017
Gregor Mendel figures
our the fundamental
principles of heredityFirst use of primer extension
to read a short sequence of
DNA (Ray Wu)
The structure of
DNA (double helix)
published
DNA isolated in the
form of chromatin
(Friedrich Miescher)
First tRNA
sequenced
First gene sequenced
(MS2 virus protein)
(Walter Fiers)
DNA sequencing through
chemical cleavage
(Walter Gilbert & Allan Maxam)
Frederick Sanger
introduces “plus
and minus” method
Frederick Sanger
established dideoxy
sequencing +
sequences first
genome (PhiX174)
Fritz Phol develops
non-radioactive
sequencing platform
First bacterial genome
sequences (H. influenza)
Sequence-by-
synthesis
(pyrosequencing)
introduced by
Mostafa Ronaghi
First yeast genome
sequenced
(S. cerevisiae)
ABI release first
commercial capillary
sequencing platform
First ABI semi-
automated
sequencing
platform released
A. thaliana and
D. melanogaster
genomes
sequenced
Mouse genome
sequenced
First draft of
human genome
published
C. elegans
genome
sequenced
The 454 high throughput
pyrosequencing system
becomes the first NGS
sequencer to come on the
market
Whole genome
sequence of a
cancer sequenced
for the first time
The era of NGS
informatics and
personalized
medicine begins…
Pacific Biosciences
launches single molecule
real time technology
(PacBio RS)
The first third generation
sequencer is launched,
utilizing single-molecule
fluorescent technology
Illumina (solexa)
sequencing is launched
Oxford Nanopore
Technologies launches
nanopore sequencing
1975, Frederick Sanger & the birth of DNA sequencing
13 August 1918 – 19 November 2013
2 x Nobel prizes
Sequencing via the “plus & minus” method
1977, The Sanger sequencing method - dideoxy" chain-termination
Why we sequence?
The rise of high-throughput sequencing
short-read sequencing
The principle of generating a short-read sequencing library
1. Capture DNA or RNA of interest
• cDNA must be synthesized from RNA
2. Fragment DNA/cDNA to produce fragments of 150-300 nt
• Acoustic sonication (random shearing) is favoured
• Alternative strategies include use of transposases or targeted
ligation
3. Repair ends and ligate adapter sequences
4. PCR amplification to enrich for fragments with correct ligation
• PCR primes of sequences in adapters
5. Sequence
Illumina: sequencing by synthesis – the workhorse
https://youtu.be/fCd6B5HRaZ8
The incredible versatility of Illumina sequencing
• Hundreds of distinct Illumina-based methods for DNA &
RNA sequencing at global (bulk) or single-cell level
• Most all of these methods require tweaks and special
considerations when performing informatics analyses
https://www.illumina.com/science/sequencing-method-explorer.html
Illumina sequencing platforms
8 Gb 15 Gb 120 Gb 1 Tb 6 Tb2 Tb
[Dozens, hundreds, even thousands of samples in one lane]
The power of pooling samples
long-read sequencing
The applications of long-read sequencing
• De novo assembly of microbial genomes &
metagenomics
• De novo assembly eukaryotic genomes
• Resolution of problematic genomic sequences (i.e.
repeat regions)
• Characterization of structural variants & other
rearrangements
• Mapping of long insertions and deletions
• Detection and mapping of mobile elements
• Direct detection of methylated bases
• Full length transcripts
Short-read (Illumina) Long-read (PacBio / Nanopore)
Sequencing depth high medium
Read lengths Short ( up to 300 nt ) (extremely) Long***
(RNA) Transcript resolution low high
(DNA) de novo assembly low high
Relative cost low High (decreasing)
Recoding bias yes yes/no**
Input requirements pico/nanograms micrograms*
Versatility very high high
Long read vs. short read sequencing
* can be achieved through amplification of raw material
** no recoding bias in direct RNA sequencing protocol
*** The longest nanopore read to date is 2,272,580 nt
PacBio sequencing – the great pretender
https://youtu.be/_lD8JyAbwEo
Lower throughput, (much) longer reads
• ZMWs are tiny holes (70-100 nm) in a metal film coated on top of a fused silica surface
• The ZMW nanostructure isolates a single DNA template molecule and a single DNA polymerase, enabling detection of single-molecule fluorescence events
• Each read has an average error rate of 15%, predominated by insertion/deletion events
• CCS reduces error-rates significantly
• Fluorescence data is collected as a DNA polymerase enzyme moves along a single DNA template
• Sequencing takes place on the surface of a flow cell that contains millions of zero-mode waveguide (ZMW)
nanostructure arrays
A complete characterization of a breast cancer cell line
Nanopore sequencing – the game changer
https://youtu.be/RcP85JHLmnI
An array of protein nanopores
Protein nanopore – set in an electrically-resistant polymer
membrane Array of microscaffolds - each microscaffold
supports a membrane and embedded nanopore.
Sensor chip – each microscaffold corresponds to its own
electrode that is connected to a channel in the sensor array
chip. Sensor arrays may be manufactured with any number of
channels.
Application-Specific Integrated Circuit (ASIC) –
Each nanopore channel is controlled and measured
individually by the bespoke ASIC. This allows for
multiple nanopore experiments to be performed in
parallel. More than one ASIC may be included in a
device.
Threading the pore
Flongles, dongles, plongles, and so much more
2 Gb 50 Gb 250 Gb 10 Gb
The nanopore principal - sequence anything, anytime, anywhere
A few thoughts on experimental design
An elegant design saves more than just time
Consider a situation in which we want to identify m6A modified bases in RNA
• Multiple methodologies exists
• meRIP-Seq / m6A-Seq (Illumina)
• SMRT sequencing (PacBio)
• direct RNA sequencing (Nanopore)
• Which should you choose?
• What are the biological questions that need answering?
• How many biological/technical replicates are required? How are these being prepared?
• How many sequence reads are needed? Is the protocol amenable to multiplexing
• What resolution is required? Is linkage information important?
• What budget is available?
• Which software is appropriate to use for analysis?
• Is it compatible with your local and/or HPC environment?
• Can you obtain test datasets to confirm it is working correctly?
• Which parameters need tweaking?
Confounding factors
batch effects during nucleic acid extraction
batch effect during library preparation
https://www.abmgood.com/marketing/knowledge_base/next_generation_sequencing_experimental_design.php
Rules to live by
1. Ask to be part of experimental design meetings during early stages of project
2. If using existing data, inspect very carefully and obtain as much information as possible about sample
preparation procedure
3. Demand that nucleic acid extractions are carried out using the same kit/batch at the same time
(where possible) and/or that samples are grouped appropriately (i.e. so batch effects are between
replicates, not comparators)
4. Ask for randomization of sample prior to library preparation
5. Whether library prep is automated or performed by hand, ask for a tracking database that shows
proximity of samples to each other (aerosol contamination) during sample preparation
6. Use strict de-multiplexing settings (i.e. no mismatches in 6- or 8-base barcodes)
7. Be particularly careful when dealing with low frequency observations
Coming up (next lecture)
1. Sequence read data formats (FASTA, FASTQ, FAST5)
2. Single vs. paired-end read data structures
3. The sequence alignment map (SAM) format
Thank you for your attention
Questions?