+ All Categories
Home > Documents > RNA-Seq analysis workshop - · PDF fileOutline • Background of RNA-Seq • Application...

RNA-Seq analysis workshop - · PDF fileOutline • Background of RNA-Seq • Application...

Date post: 05-Mar-2018
Category:
Upload: truongmien
View: 251 times
Download: 5 times
Share this document with a friend
71
RNA-Seq analysis workshop Zhangjun Fei
Transcript

RNA-Seq analysis workshop

Zhangjun Fei

Outline

• Background of RNA-Seq

• Application of RNA-Seq (what RNA-Seq can do?)

• Available sequencing platforms and strategies and which one to choose

• RNA-Seq data analysis

Read processing and quality assessment

De novo assembly

Alignment to reference genome/transcriptome

Differentially expressed gene identification

Year Milestone

1965 Sequence of the first RNA molecule determined

1977 Development of the Northern blot technique and the Sanger sequencing

method

1989 Reports of RT-PCR experiments for transcriptome analysis

1991 First high-throughput EST sequencing study

1992 Introduction of Differential Display for the discovery of differentially

expressed genes

1995 Reports of the microarray and Serial Analysis of Gene Expression (SAGE)

methods

1996 Suppression subtractive hybridization reported

2005 First next-generation sequencing technology (Roche/454) introduced to the

market

2006 First transcriptome sequencing studies using a next-generation technology

(Roche/454)

Milestones of Transcriptome analysis

New sequencing technologies

Next generation sequencing

• Pacific Biosciences

• Oxford Nanopore

• Complete Genomics

Third generation sequencing

• Illumina (HiSeq, NovaSeq)

• Roche/454

• Ion Torrent (Ion Proton)

• ABI/SOLiD

• Helicos

• Ion Torrent PGM

• Illumina MiSeq

• 454 GS Junior

Desktop sequencer

RNA-Seq applications

• Accelerating gene discovery and gene family expansion

• Improving genome annotation – identifying novel genes and gene models

• Identifying tissue/condition specific alternative splicing events

RNA-Seq application

RNA-Seq applications

• Short reads can’t provide the complete structure of an isoform

Alternative splicing

RNA-Seq applications

PacBio long reads

SEQUEL SYSTEM

RNA-Seq applications

PacBio long reads – error correction

RNA-Seq applications

Each sample needs four libraries with different

insert sizes: 1-2K, 2-3K, 3-5K, >5K

RNA-Seq applications

SNP discovery in RNA-Seq is more challenging than in DNA:

• Varying levels of coverage depth

• False discovery around splicing junctions due to incorrect mapping

SNP and SSR marker identification – facilitating breeding

RNA-Seq applications

RNA-Seq applications

Phylogenetic relationship, population structure

Xu et al. (2017) Draft genome of spinach Spinacia oleracea and transcriptome diversity of

120 Spinacia accessions. Nature Communications

RNA-Seq applications

selective sweep

Xu et al. (2017) Draft genome of spinach Spinacia oleracea and transcriptome diversity of

120 Spinacia accessions. Nature Communications

Expression QTL (eQTL) network

RNA-Seq applications

A melon RIL population (Nurit Katzir, unpublished)

Mutant gene cloning (BSA RNA-Seq)

white fruit x yellow fruit

F1

F2

F3

yellow poolwhite pool

RNA-Seq

SNPs and DE genes

132 of 189 SNPs in this region

RNA-Seq applications

kb

Feder et al. (2015) A Kelch domain-containing F-box coding gene negatively regulates flavonoid accumulation in

Cucumis melo L. Plant Physiol 169:1714-1726

Distribution of mapped markers associating with the

erucic acid trait

GWAS

RNA-Seq applications

Genomic imprinting and allele specific expression

RNA-Seq applications

RNA-Seq applications

Regulatory mode of gene expression in F1 hybrids

Provided by Nabil Elrouby

C. maxima, Rimu ♀ C. moschata, Rifu ♂

The interspecific hybrid, ‘Shintosa’

x Fruit

Root

Leaf

stem

62-80% trans, 13-24% cis

Cma F1 Cmo

Cma F1 Cmo

Root Root

Defense response

(GO:0006952)

Response to heat

(GO:0009408)

Cma F1 Cmo

Fruit

Carotenoid biosynthesis

(GO:0016117)

Fruit Leaf Stem Root

Cma F1 Cmo Cma F1 Cmo Cma F1 Cmo Cma F1 Cmo

Photosynthesis

(GO:0015979)

Genes exhibiting dominant and

transgressive expression patterns in

‘Shinotasa’ are enriched with those

involved in defense response,

response to heat, carotenoid

biosynthesis and photosynthesis

RNA-Seq applications

non-coding RNAs (lncRNA, lincRNAs…)

RNA-Seq applications

Gene fusion

RNA-Seq applications

RNA-Seq applications

Gene expression profiling

• Cross-hybridization

• Stable probe secondary structures

• high background (e.g., nonspecific hybridization)

• limited dynamic range (e.g., nonlinear and saturable hybridization

kinetics)

• allow direct enumeration of transcript molecules

• digital expression data are absolute so data can be directly

compared across different experiments and laboratories without

the need for extensive internal controls or other experimental

manipulation

• provide open systems that allow detection of previously

uncharacterized transcripts, as well as rare transcripts

Problem of microarray

RNA-Seq (digital expression analysis)

RNA-Seq vs microarray

RNA-Seq applications

Summary

Accelerating gene discovery and gene family expansion

Improving genome annotation – identifying novel genes and gene models

Identifying tissue/condition specific alternative splicing events

SNP and SSR marker identification

Phylogenetic relationship, population structure, selective sweep

Expression QTL analysis

Mutant gene cloning (BSA RNA-Seq)

Genome (Transcriptome)-wide associate study

Genomic imprinting and allele specific expression analysis

Identifying non-coding RNAs (lncRNA, lincRNAs…)

Identifying gene fusion events

Gene expression profiling analysis

Sequencing platforms and strategies

Sequencing platforms

Next generation sequencing

• Pacific Biosciences

• Oxford Nanopore

• Complete Genomics

Third generation sequencing

• Illumina (HiSeq, NovaSeq)

• Ion Torrent (Ion Proton)

• ABI/SOLiD

• Roche/454

• Helicos

• Ion Torrent PGM

• Illumina MiSeq

• Illumina NextSeq

• 454 GS Junior

Desktop sequencer

http://www.biotech.cornell.edu/brc/genomics/services/price-list

High-output mode (200-300M reads/

read pairs per lane)

Single-end, 50, 100 bp

Paired-end, 2 x 125bp

Run time: 2-11 days

Rapid run mode (150-200M reads/

read pairs per lane)

Single-end, 50, 100, 150 bp

Paired-end, 2 x 100 bp

Paired-end, 2 x 150 bp

Paired-end, 2 x 200 bp

Paired-end, 2 x 250 bp

Runtime: 7-40 hours

50 bp sequencing kit

300 bp sequencing kit (e.g. 2 x 150 bp)

500 bp sequencing kit (e.g. 2 x 250 bp)

150 bp sequencing kit (e.g. 2 x 75 bp)

600 bp sequencing kit (e.g. 2 x 300 bp)

Run time: 5-65 hours

Illumina HiSeq 2000/2500 Illumina MiSeq

Sequencing platforms

Sequencing platforms

Single-end or paired-end

For gene expression analysis with a reference genome, single-

end is enough

For de novo assembly, genome annotation, alternative splicing

identification……, it’s better to use paired-end

Strand-specific or non strand-specific

Always choose strand-specific RNA-Seq if possible

Strand-specific RNA sequencing

• More accurately determine the expression level

• Significantly reduce false positives in identifying alternatively

spliced transcripts

• Identify antisense transcripts – another level of gene regulation

in important biological processes

• Determine the transcribed strand of non-coding RNAs (e.g.

lincRNAs)

Strand-specific RNA-Seq library construction

Up to 96 libraries in two days

Paired-end compatible

multiplexing

High throughput ssRNA-Seq

Strand specific RNA sequencing

Strand-specific sequencing can produce more accurate digital gene

expression data when compared to the conventional Illumina RNA-Seq.

Strand specific RNA sequencing

Strand specific RNA sequencing

Antisense transcript

cis-natural antisense transcripts (cis-NAT)

1340 cis-NAT pairs in Arabidopsis (Wang et al., 2005)

687 cis-NAT pairs in rice (Osato et al., 2003)

trans-natural antisense transcripts (trans-NAT)

1,320 trans-NAT pairs in Arabidopsis (Wang et al., 2006)

alternative splicing

RNA editing

DNA methylation

genomic imprinting

X-chromosome inactivation

function

LEFL2040O15

1394 reads

259 reads

LEFL2002DC06

1189 reads

389 reads

Strand specific RNA sequencing

Antisense transcript

Strand specific RNA sequencing

lincRNA (determine the sense strand)

RNA-Seq strategies

Most frequently asked question

• “How many samples should I multiplex in one lane?” or “How many reads

should I generate for each of my samples?”

• Depend on $$$

• Depends on the quality of the library and the reads

rRNA, tRNA, organelle, adaptor contamination

• No. of biological replicates for expression call

At least three

• Effects of read numbers on expression call

Mature green fruit library (22M reads)

Randomly select 0.1-0.9, 1-22M reads from the library and calculate

gene expression for each dataset (20 different randomizations)

Sequencing depth and no. of biological replicates

RNA-Seq (multiplexing)

r=0.9867

r=0.9992r=0.9976

r=0.9934

r=0.9957

r=0.8682

0.1M 1M 2M

3M 5M 10M

Mature green fruit, 22M

RNA-Seq (multiplexing)

RNA-Seq (multiplexing)

Common problems in RNA-Seq

experimental design

• Without involvement of a bioinformatics expert in the experimental design.

This could cause serious problems for downstream data analysis if the

experimental design has flaws.

• No biological replicates. Currently most journals requires at least three

biological replicates.

• Biological replicated samples collected at different time or different places.

• For biotic/abiotic stress experiment, no mock control. All treatments are

compared to non-treated samples (time 0). (Circadian clock genes, genes

differentially expressed due to different environmental factors and

developmental stages……)

• Directly compare different genotypes with totally different genetic

background. Genes differentially expressed due to other phenotypes, not the

interested one.

RNA-Seq data analysis

Read quality control (fastqc)

Read processing

Read processing

Read quality control (fastqc)

Read processing

Read quality control (fastqc)

Remove adaptors and all possible contaminations: rRNA, tRNA, organelle

(chloroplast and mitochondrion) RNAs, virus, low quality sequences……

Arabidopsis 25S ribosomal RNA vs GenBank nr protein database

Read processing

Remove contaminated sequences

• Align reads to rRNA and organelle

sequence database (bowtie or BWA)

• Affect RPKM values if not removed

Trim adaptor and low quality sequences

• FASTX-Toolkit

• AdapterRemoval

• Trimmomatic

• Cutadapt

• Condetri

• ERNE-filter

• Prinseq

• SolexaQA-bwa

• Sickle

Read processing

Read processing

RNA-Seq data analysis

De novo transcriptome assembly

Long reads (454/Sanger)

overlap-layout-consensus strategy

Short reads (Illumina)

de Bruijn graph approach

Martin & Wang, 2011

Two major problems in existing EST assembly programs and unigene

databases:

1) Large portion of different transcripts (mainly alternative spliced

transcripts and paralogs) are incorrectly assembled into same

transcripts – type I error (false positives)

2) Large portion of nearly identical sequences are not assembled

into one transcript – type II error (false negatives)

De novo transcriptome assembly

CAP3 (http://seq.cs.iastate.edu/cap3.html)

TGICL/CAP3 (http://compbio.dfci.harvard.edu/tgi/software/)

MIRA (http://www.chevreux.org/projects_mira.html)

Newbler (-cDNA)

Phrap (http://www.phrap.org/)

Long reads (454/Sanger)

Example of type I assembly error (paralog)

In DFCI Tomato Gene Index, AW218649 is a member of TC237370

Sequence identity between AW218649 and TC232370: 91.5%

AW218649 is aligned to tomato chromosome 4

TC237370 is aligned to tomato chromosome 11

Example of type I assembly error

(alternative splicing)

In DFCI Tomato Gene Index, U95008 is a member of TC226520

Example of type II assembly error

In DFCI Tomato Gene Index, two unigenes, TC219875 and TC221582, are identical

iAssembler

http://bioinfo.bti.cornell.edu/tool/iAssembler/

• iterative assemblies (assembly of assemblies) using MIRA and CAP3 (four cycles

of MIRA followed by one cycle of CAP3) – reduce errors that nearly identical

sequences are not assembled

• Further assembly error identification

1) comparing unigene sequences against themselves to identify nearly identical

sequences (type II errors)

2) aligning EST sequences to their corresponding unigene sequences to identify

mis-assembled ESTs (type I errors)

• Both type I and II assembly errors are corrected automatically by the program

• Unigene base errors are then corrected based on the resulting SAM files

Trinity

Trans-ABySS

Oases/velvet

SOAPdenovo-Trans

De novo transcriptome assembly

Short reads (Illumina)

De novo transcriptome assembly

Reference-guided de novo assembly

Cufflink

IsoLasso

Scripture

Traph

StringTie

De novo transcriptome assembly

Trinity

De novo transcriptome assembly

Post processing of de novo assemblies

Remove contaminations (bacteria, virus, fungus……)

Remove assembly errors (mainly redundancy)

Remove errors caused by library preparation (incomplete

digestion of dUTP containing 2nd strand during strand-

specific RNA-Seq library construction)

De novo transcriptome assembly

Remove contamination

blastn

blastx

De novo transcriptome assembly

DeconSeq

SeqClean

Remove contamination

De novo transcriptome assembly

Remove type II assembly error (redundancy) iAssembler

Gene ID length antisense sense

UN22492 1504 97 48138

comp38294_c0_seq1 526 10822 103

De novo transcriptome assembly

Remove transcripts derived from incomplete 2nd digestion

removed

De novo transcriptome assembly

High number of assembled transcripts

Alternative splicing

Non-coding RNAs

Incomplete coverage of full length transcripts

DFCI gene index

RNA-Seq data analysis

Alignment

Align reads to reference genome

TopHat

HISAT

STAR

Alignment reads to reference transcriptome

bowtie

BWA

If you have a reference genome, it’s not a good

idea to align the reads to the predicted CDS or

cDNA, due to the incomplete prediction of UTRs

and alternative splicing

Visualization tools

RNA-Seq data analysis

Integrative Genomics Viewer (IGV)

RNA-Seq data analysis

Read counting and normalization

Read counting

htseq-count

samtools (samtools view –c)

Normalization

RPKM: reads per kilobase of exon model per

million mapped reads

FPKM: fragments per kilobase of exon model per

million mapped reads

Sample correlation matrix

Quality control – biological replicates

RNA-Seq data analysis

RNA-Seq data analysis

Differentially expressed gene detection

Pair-wise comparison

DESeq

edgeR

Time course data

edgeR

first data transformation using getVarianceStabilizedData

function in DESeq (to get normal distribution). Then DE

gene identification using F tests in LIMMA

Multiple test correction

False Discovery Rate (FDR)

q value

RNA-Seq data analysis

Differentially expressed gene detection


Recommended