An Introduction to Sequencing Informaticsfenyolab.org/presentations/Bioinformatics_2020...Igor...

An Introduction to Sequencing Informatics

Daniel P. Depledge

Assistant Professor, Department of Medicine

Lecture 8: Sequencing Technologies, April 28, 3pm

Lecture 9: File Formats, April 30, 3pm

Lecture 10: Linux 1, May 5, 3pm

Lecture 11: Linux 2, May 7, 3pm

Lecture 12: Processing & quality control, May 12, 3pm

Lecture 13: Downloading data, QC & Trimming, May 14, 3pm

Lecture 14: Alignment & Visualization 1, May 19, 3pm

Lecture 15: Alignment & Visualization 2, May 21, 3pm

Lecture 16: RNA sequencing, May 26, 3pm

Lecture 17: Aligning RNA-Seq data & Generating Counts, May 28, 3pm

Lecture 18: Gene expression analysis 1, June 2, 3pm

Lecture 19: Gene expression analysis 2, June 4, 3pm

Course overview

Daniel P. Depledge

Angelina Volkova

Lizabeth Katsnelson

Anna Yeaton

Igor Dolgalev

MacIntosh Cornwell

Applied Sequencing Informatics (2021)

• Expanded version of current course

• Lectures and practicums (50:50 split)

• Advanced sequencing analyses using both short- and long-read data

• Prerequisites: experience working in HPC environments (i.e. Big Purple) + experience in R-based environments

Advanced Integrative Omics (2021)

• Small class size

• Single project to which all class members contribute

• Most work undertaken in small groups (2-4 individuals) that are regularly remixed

• Student-led (no formal lectures)

• Aims to produce publication including all class-members as co-authors

• Prerequisites: significant experience working in HPC environments (i.e. Big Purple) + R-based environments

demonstrated prior experience working with diverse sequencing data types

Advanced course offerings

High-throughput sequencing (HTS) is fundamentally changing how we approach science

• HTS is a readout for many different types of laboratory experiments

• Clinical and basic science investigators from all areas of biology can make use of this technology

• Many (most?) are completely naïve about bioinformatics

• Decreasing sequencing costs = increasing use for routine assays + technical innovation + novel applications

Sequencing informatics is a bottleneck!

• Sequencing is a commodity – easy to outsource

• Sequencing informatics is the essential point of the science

• Data analysis and discovery of meaning in raw results

• Increasing data throughput = increasing time and cost of analysis

Setting the scene

The beauty and challenge of sequencing informatics

Many are the ways to skin shave a cat

(there is no right way to analyze sequencing data but there are many wrong ways)

• Rapid turnover in technology platforms

• New file formats, new data types

• Different “standards” from different vendors

• Rapid evolution of new sequence approaches & associated analyses

• Constant rapid ‘release’ of methods as ‘software’ via unsupported open source distribution

• Increasingly large data sizes (both experimental and reference)

Staying in the game…

The history and future of nucleic acid sequencing

An abridged history of sequencing

Why we sequence

Illumina (short read) sequencing

Pacbio (long read) sequencing

Nanopore (long read) sequencing

The long and short of it: Experimental design

Overview

An abridged history of sequencing

1865 1869

1953

1965 1970

1972

19731975

19771984

1986

1995 1996

1998

1999

2000

2001

2002

2005

20072008

2009

20112012

2017

Gregor Mendel figures

our the fundamental

principles of heredityFirst use of primer extension

to read a short sequence of

DNA (Ray Wu)

The structure of

DNA (double helix)

published

DNA isolated in the

form of chromatin

(Friedrich Miescher)

First tRNA

sequenced

First gene sequenced

(MS2 virus protein)

(Walter Fiers)

DNA sequencing through

chemical cleavage

(Walter Gilbert & Allan Maxam)

Frederick Sanger

introduces “plus

and minus” method

Frederick Sanger

established dideoxy

sequencing +

sequences first

genome (PhiX174)

Fritz Phol develops

non-radioactive

sequencing platform

First bacterial genome

sequences (H. influenza)

Sequence-by-

synthesis

(pyrosequencing)

introduced by

Mostafa Ronaghi

First yeast genome

sequenced

(S. cerevisiae)

ABI release first

commercial capillary

sequencing platform

First ABI semi-

automated

sequencing

platform released

A. thaliana and

D. melanogaster

genomes

sequenced

Mouse genome

sequenced

First draft of

human genome

published

C. elegans

genome

sequenced

The 454 high throughput

pyrosequencing system

becomes the first NGS

sequencer to come on the

market

Whole genome

sequence of a

cancer sequenced

for the first time

The era of NGS

informatics and

personalized

medicine begins…

Pacific Biosciences

launches single molecule

real time technology

(PacBio RS)

The first third generation

sequencer is launched,

utilizing single-molecule

fluorescent technology

Illumina (solexa)

sequencing is launched

Oxford Nanopore

Technologies launches

nanopore sequencing

1975, Frederick Sanger & the birth of DNA sequencing

13 August 1918 – 19 November 2013

2 x Nobel prizes

Sequencing via the “plus & minus” method

1977, The Sanger sequencing method - dideoxy" chain-termination

Why we sequence?

The rise of high-throughput sequencing

short-read sequencing

The principle of generating a short-read sequencing library

1. Capture DNA or RNA of interest

• cDNA must be synthesized from RNA

2. Fragment DNA/cDNA to produce fragments of 150-300 nt

• Acoustic sonication (random shearing) is favoured

• Alternative strategies include use of transposases or targeted

ligation

3. Repair ends and ligate adapter sequences

4. PCR amplification to enrich for fragments with correct ligation

• PCR primes of sequences in adapters

5. Sequence

Illumina: sequencing by synthesis – the workhorse

https://youtu.be/fCd6B5HRaZ8

The incredible versatility of Illumina sequencing

• Hundreds of distinct Illumina-based methods for DNA &

RNA sequencing at global (bulk) or single-cell level

• Most all of these methods require tweaks and special

considerations when performing informatics analyses

https://www.illumina.com/science/sequencing-method-explorer.html

Illumina sequencing platforms

8 Gb 15 Gb 120 Gb 1 Tb 6 Tb2 Tb

[Dozens, hundreds, even thousands of samples in one lane]

The power of pooling samples

long-read sequencing

The applications of long-read sequencing

• De novo assembly of microbial genomes &

metagenomics

• De novo assembly eukaryotic genomes

• Resolution of problematic genomic sequences (i.e.

repeat regions)

• Characterization of structural variants & other

rearrangements

• Mapping of long insertions and deletions

• Detection and mapping of mobile elements

• Direct detection of methylated bases

• Full length transcripts

Short-read (Illumina) Long-read (PacBio / Nanopore)

Sequencing depth high medium

Read lengths Short ( up to 300 nt ) (extremely) Long***

(RNA) Transcript resolution low high

(DNA) de novo assembly low high

Relative cost low High (decreasing)

Recoding bias yes yes/no**

Input requirements pico/nanograms micrograms*

Versatility very high high

Long read vs. short read sequencing

* can be achieved through amplification of raw material

** no recoding bias in direct RNA sequencing protocol

*** The longest nanopore read to date is 2,272,580 nt

PacBio sequencing – the great pretender

https://youtu.be/_lD8JyAbwEo

Lower throughput, (much) longer reads

• ZMWs are tiny holes (70-100 nm) in a metal film coated on top of a fused silica surface

• The ZMW nanostructure isolates a single DNA template molecule and a single DNA polymerase, enabling detection of single-molecule fluorescence events

• Each read has an average error rate of 15%, predominated by insertion/deletion events

• CCS reduces error-rates significantly

• Fluorescence data is collected as a DNA polymerase enzyme moves along a single DNA template

• Sequencing takes place on the surface of a flow cell that contains millions of zero-mode waveguide (ZMW)

nanostructure arrays

A complete characterization of a breast cancer cell line

Nanopore sequencing – the game changer

https://youtu.be/RcP85JHLmnI

An array of protein nanopores

Protein nanopore – set in an electrically-resistant polymer

membrane Array of microscaffolds - each microscaffold

supports a membrane and embedded nanopore.

Sensor chip – each microscaffold corresponds to its own

electrode that is connected to a channel in the sensor array

chip. Sensor arrays may be manufactured with any number of

channels.

Application-Specific Integrated Circuit (ASIC) –

Each nanopore channel is controlled and measured

individually by the bespoke ASIC. This allows for

multiple nanopore experiments to be performed in

parallel. More than one ASIC may be included in a

device.

Threading the pore

Flongles, dongles, plongles, and so much more

2 Gb 50 Gb 250 Gb 10 Gb

The nanopore principal - sequence anything, anytime, anywhere

A few thoughts on experimental design

An elegant design saves more than just time

Consider a situation in which we want to identify m6A modified bases in RNA

• Multiple methodologies exists

• meRIP-Seq / m6A-Seq (Illumina)

• SMRT sequencing (PacBio)

• direct RNA sequencing (Nanopore)

• Which should you choose?

• What are the biological questions that need answering?

• How many biological/technical replicates are required? How are these being prepared?

• How many sequence reads are needed? Is the protocol amenable to multiplexing

• What resolution is required? Is linkage information important?

• What budget is available?

• Which software is appropriate to use for analysis?

• Is it compatible with your local and/or HPC environment?

• Can you obtain test datasets to confirm it is working correctly?

• Which parameters need tweaking?

Confounding factors

batch effects during nucleic acid extraction

batch effect during library preparation

https://www.abmgood.com/marketing/knowledge_base/next_generation_sequencing_experimental_design.php

Rules to live by

1. Ask to be part of experimental design meetings during early stages of project

2. If using existing data, inspect very carefully and obtain as much information as possible about sample

preparation procedure

3. Demand that nucleic acid extractions are carried out using the same kit/batch at the same time

(where possible) and/or that samples are grouped appropriately (i.e. so batch effects are between

replicates, not comparators)

4. Ask for randomization of sample prior to library preparation

5. Whether library prep is automated or performed by hand, ask for a tracking database that shows

proximity of samples to each other (aerosol contamination) during sample preparation

6. Use strict de-multiplexing settings (i.e. no mismatches in 6- or 8-base barcodes)

7. Be particularly careful when dealing with low frequency observations

Coming up (next lecture)

1. Sequence read data formats (FASTA, FASTQ, FAST5)

2. Single vs. paired-end read data structures

3. The sequence alignment map (SAM) format

Thank you for your attention

Questions?

Date post:	11-Mar-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

An Introduction to Sequencing Informaticsfenyolab.org/presentations/Bioinformatics_2020...Igor...

Documents