+ All Categories
Home > Documents > Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts...

Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts...

Date post: 03-Mar-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
88
Analysis of bulk RNA-seq II: Reads to DGE Analysis of Next-Generation Sequencing Data Friederike Dündar Applied Bioinformatics Core Slides at https://bit.ly/2T3sjRg 1 February 25, 2020 1 https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/ Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 1 / 63
Transcript
Page 1: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Analysis of bulk RNA-seq II: Reads to DGEAnalysis of Next-Generation Sequencing Data

Friederike Dündar

Applied Bioinformatics Core

Slides at https://bit.ly/2T3sjRg1

February 25, 2020

1https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 1 / 63

Page 2: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

1 Gene expression quantification recap

2 Normalization of read counts

3 Exploratory analyses

4 Differential gene expression

5 Downstream analyses

6 References

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 2 / 63

Page 3: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Many slides today were inspired or directly taken from theexcellent book Data Analysis for the Life Sciences byRafael Irizarry and Michael Love, and training materialdeveloped by the Harvard Chan Bioinformatics Core.

Go and check them out for even more details! TheHarvard Chan Bioinformatics Core’s material can be foundat their github page:https://github.com/hbctraining/DGE_workshop

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 3 / 63

Page 4: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

Gene expression quantification recap

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 4 / 63

Page 5: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

Alignment of NGS data is resource-intensive

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 5 / 63

Page 6: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

Quantification of gene expression

1 AlignI with splice-aware alignment tools! e.g. STAR

2 Count reads that overlap with annotated genesI complicated by alternative isoforms: genes != transcripts

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 6 / 63

Page 7: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

Alternative isoforms are common in eukaryotictranscriptomesGene isoforms = mRNAs produced from the same locus, but with differentfinal sequences (possibly giving rise to different protein sequences, too)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 7 / 63

Page 8: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

2 Philosophies of gene expression quantification

(A) Alignment + countingHistorically, the reads of RNA-seqexperiments were treated the same way asreads of DNA-seq experiments, i.e. it wasdeemed important that we knew theprecise location that each read hadoriginated from.

The results of alignment, however, are notinherently quantitative, which is why a 2ndcounting step was needed.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 8 / 63

Page 9: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

2 Philosophies of gene expression quantification

(B) PseudoalignmentFor standard bulk RNA-seq, we really justwant the number of reads that arecompatible with a known transcriptsequence. If we decide to not care aboutthe precise genome location, we can:

reduce the size of our search space,i.e. our index of k-mers can belimited to cDNAs (no introns!)chop up the reference cDNAs ANDour reads into fairly small k-mersperform a “simple” k-mer matchingstrategy and assign the read to thetranscript that most of its k-mersmatched to

See Zielezinski et al. [2017] for a goodexplanation of pseudo-alignment etc.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 9 / 63

Page 10: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

2 Philosophies of gene expression quantification

(B) Transcript abundance estimation via pseudoalignment

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 10 / 63

Page 11: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

2 Philosophies of gene expression quantification

(B) Transcript abundance estimation via pseudoalignment

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 11 / 63

Page 12: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

2 Philosophies of gene expression quantification

(B) Pseudoalignment caveatsabundance estimates for lowlyexpressed transcripts are highlyvariable (not enough distinct k-mers)

short RNAs have inherently fewerdistinct k-mers

problem when coverage of anisoform-defining region is low (or itssequence isnt’t distinct)

any read that originated fromsomewhere else in the genome thancDNAs may be mapped spuriously

For very similar transcripts, collapsing all abundances per gene into a gene-centricmeasure is more robust and accurate. [Soneson et al., 2015]

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 12 / 63

Page 13: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

2 Philosophies of gene expression quantification

(B) Transcript abundance estimates

If you decide to use abundance estimates rather than gene-read overlapcounts, use the tximport package [Soneson et al., 2015] package fortheir use with Bioconductor differential gene expression packages.

The advantages of using the transcript abundance quantifiers inconjunction with tximport to produce gene-level count matrices andnormalizing offsets, are:

in-built correction for any potential changes in gene length acrosssamples (e.g. from differential isoform usage) [Trapnell et al., 2012]increased speed and less memory and less disk usage compared toalignment-based methodsit is possible to avoid discarding fragments that can align to multiplegenes with homologous sequence

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 13 / 63

Page 14: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

2 Philosophies of gene expression quantification

Traditional PseudoalignmentEx. workflow: STAR + featureCounts kallisto or salmonRead mappingbased on:

Where does a read matchbest?

Which collection of uniquek-mers does a read matchbest?

Reference: Genome seq. + exon bound-aries

cDNA sequences

Mapping result: Genome coordinates (BAM) Table of expression level esti-mates (txt)

Expression quan-tification:

Counting how many readsoverlap a gene2.

Summing the values assignedto each collection of uniquek-mers (equivalence class).

Output: Read counts (integers) Estimated transcript abun-dances (numeric)

Speed: ++ and +++ ++++

2The read sequence is irrelevant at this point.Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 14 / 63

Page 15: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

General bioinformatics workflow – updatedUnderstand your null hypothesis!(See Soneson et al. [2015], Love et al. [2018])

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limma (see M. Love’s protocolsDTU:: Differential Transcript Usage

I Has the isoform composition for a given gene changed? I.e. are theredifferent dominant isoforms depending on the condition?

I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 15 / 63

Page 16: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

General bioinformatics workflow – updatedUnderstand your null hypothesis!(See Soneson et al. [2015], Love et al. [2018])

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limma (see M. Love’s protocolsDTU:: Differential Transcript Usage

I Has the isoform composition for a given gene changed? I.e. are theredifferent dominant isoforms depending on the condition?

I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 15 / 63

Page 17: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

General bioinformatics workflow – updatedUnderstand your null hypothesis!(See Soneson et al. [2015], Love et al. [2018])

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limma (see M. Love’s protocolsDTU:: Differential Transcript Usage

I Has the isoform composition for a given gene changed? I.e. are theredifferent dominant isoforms depending on the condition?

I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 15 / 63

Page 18: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

General bioinformatics workflow – updatedUnderstand your null hypothesis!(See Soneson et al. [2015], Love et al. [2018])

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limma (see M. Love’s protocolsDTU:: Differential Transcript Usage

I Has the isoform composition for a given gene changed? I.e. are theredifferent dominant isoforms depending on the condition?

I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 15 / 63

Page 19: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

General bioinformatics workflow – updatedUnderstand your null hypothesis!(See Soneson et al. [2015], Love et al. [2018])

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limma (see M. Love’s protocolsDTU:: Differential Transcript Usage

I Has the isoform composition for a given gene changed? I.e. are theredifferent dominant isoforms depending on the condition?

I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 15 / 63

Page 20: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

General bioinformatics workflow – updatedUnderstand your null hypothesis!(See Soneson et al. [2015], Love et al. [2018])

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limma (see M. Love’s protocolsDTU:: Differential Transcript Usage

I Has the isoform composition for a given gene changed? I.e. are theredifferent dominant isoforms depending on the condition?

I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 15 / 63

Page 21: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Gene expression quantification recap

General bioinformatics workflow – updatedUnderstand your null hypothesis!(See Soneson et al. [2015], Love et al. [2018])

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limma (see M. Love’s protocolsDTU:: Differential Transcript Usage

I Has the isoform composition for a given gene changed? I.e. are theredifferent dominant isoforms depending on the condition?

I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 15 / 63

Page 22: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Normalization of read counts

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 16 / 63

Page 23: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Read counts are influenced by numerous factors, not justexpression strength

Raw counts3 = number of reads (or fragments) overlapping with the unionof exons of a gene.

Raw count numbers are not just a reflection of the actual number ofcaptured transcripts!

They are strongly influenced by:sequencing depthgene lengthDNA sequence content (% GC)expression of all other genes in the same sample

3also true for "estimated" gene counts from pseudoalignersFriederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 17 / 63

Page 24: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Read counts are influenced by numerous factors, not justexpression strength

Raw counts3 = number of reads (or fragments) overlapping with the unionof exons of a gene.

Raw count numbers are not just a reflection of the actual number ofcaptured transcripts!

They are strongly influenced by:sequencing depthgene lengthDNA sequence content (% GC)expression of all other genes in the same sample

3also true for "estimated" gene counts from pseudoalignersFriederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 17 / 63

Page 25: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Read counts are influenced by numerous factors, not justexpression strength

Raw counts3 = number of reads (or fragments) overlapping with the unionof exons of a gene.

Raw count numbers are not just a reflection of the actual number ofcaptured transcripts!

They are strongly influenced by:sequencing depthgene lengthDNA sequence content (% GC)expression of all other genes in the same sample

3also true for "estimated" gene counts from pseudoalignersFriederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 17 / 63

Page 26: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Read counts are influenced by numerous factors, not justexpression strength

Raw counts3 = number of reads (or fragments) overlapping with the unionof exons of a gene.

Raw count numbers are not just a reflection of the actual number ofcaptured transcripts!

They are strongly influenced by:sequencing depthgene lengthDNA sequence content (% GC)expression of all other genes in the same sample

3also true for "estimated" gene counts from pseudoalignersFriederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 17 / 63

Page 27: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Influences on read count numbers

1. Sequencing depth (= total number of reads per sample)sequencing depth of Sample A � Sample B

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 18 / 63

Page 28: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Influences on read count numbers

2. Gene length (and GC bias)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 19 / 63

Page 29: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Influences on read count numbers

3. RNA composition - individual gene abundances

All the numbers within a given sample are relative abundance mea-surements.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 20 / 63

Page 30: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Influences on read count numbers - summary

Which biases are relevant for comparing differentsamples?

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 21 / 63

Page 31: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Different units for expression values

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 22 / 63

Page 32: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Why not RPKMs?

[RF]PKM values are not comparable between samples – Do NOT usethem!if you need normalized expression values for exploratory plots, use TPMor DESeq2’s rlog values

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 23 / 63

Page 33: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Normalization of read counts

Working with read counts

Download the featureCounts results to your laptop.Read the featureCounts results into R.Let’s normalize!

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 24 / 63

Page 34: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Exploratory analyses

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 25 / 63

Page 35: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Exploratory analyses

Exploratory analyses do not test a null hypothesis! They are meantto familiarize yourself with the data to discover biases and unexpectedvariability!

Typical exploratory analyses:correlation of gene expressionbetween different samples(hierarchical) clusteringdimensionality reductionmethods, e.g. PCAdot plots/box plots/violinplots of individual genes

Use normalized and transformed read counts for data exploration!

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 26 / 63

Page 36: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Pairwise correlation of gene expression values

replicates of the same conditionshould show high correlations(>0.9)Pearson method: metricdifferences between samples

I influenced by outliersI covariance of two variables

divided by the product oftheir standard deviation

I suitable for normallydistributed values

Spearman method: based onrankings

I less sensitiveI less driven by outliers

R function: cor()

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 27 / 63

Page 37: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Hierarchical clustering – grouping similar samples

Goal: partition the objects into homogeneous groups, such that thewithin-group similarities are large.

single-sample (or single-gene) clusters aresuccessively joined, starting with the leastdissimilar two samples

Result: dendrogramI clustering is obtained bycutting the dendrogram atthe desired level

Similarity measureI EuclideanI Pearson

Distance measureI Complete: largest distanceI Average: average distance

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 28 / 63

Page 38: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Hierarchical clustering – grouping similar samples

Goal: partition the objects into homogeneous groups, such that thewithin-group similarities are large.

single-sample (or single-gene) clusters aresuccessively joined, starting with the leastdissimilar two samples

Result: dendrogramI clustering is obtained bycutting the dendrogram atthe desired level

Similarity measureI EuclideanI Pearson

Distance measureI Complete: largest distanceI Average: average distance

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 28 / 63

Page 39: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Hierarchical clustering – grouping similar samples

Goal: partition the objects into homogeneous groups, such that thewithin-group similarities are large.

single-sample (or single-gene) clusters aresuccessively joined, starting with the leastdissimilar two samples

Result: dendrogramI clustering is obtained bycutting the dendrogram atthe desired level

Similarity measureI EuclideanI Pearson

Distance measureI Complete: largest distanceI Average: average distance

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 28 / 63

Page 40: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Hierarchical clustering – grouping similar samples

Goal: partition the objects into homogeneous groups, such that thewithin-group similarities are large.

single-sample (or single-gene) clusters aresuccessively joined, starting with the leastdissimilar two samples

Result: dendrogramI clustering is obtained bycutting the dendrogram atthe desired level

Similarity measureI EuclideanI Pearson

Distance measureI Complete: largest distanceI Average: average distance

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 28 / 63

Page 41: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Hierarchical clustering – grouping similar samples

Goal: partition the objects into homogeneous groups, such that thewithin-group similarities are large.

single-sample (or single-gene) clusters aresuccessively joined, starting with the leastdissimilar two samples

Result: dendrogramI clustering is obtained bycutting the dendrogram atthe desired level

Similarity measureI EuclideanI Pearson

Distance measureI Complete: largest distanceI Average: average distance

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 28 / 63

Page 42: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Hierarchical clustering – grouping similar samples

Goal: partition the objects into homogeneous groups, such that thewithin-group similarities are large.

single-sample (or single-gene) clusters aresuccessively joined, starting with the leastdissimilar two samples

Result: dendrogramI clustering is obtained bycutting the dendrogram atthe desired level

Similarity measureI EuclideanI Pearson

Distance measureI Complete: largest distanceI Average: average distance

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 28 / 63

Page 43: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Hiearchical clustering - R code## calculate the correlation between columns of a matrixpw_cor <- cor(rlog.norm.counts, method = "pearson" )

## use the correlation as a distance measuredistance.m_rlog <- as.dist(1 - pw_cor)

## plot() can directly interpret the output of hclust() to generate## a dendrogramplot( hclust(distance.m_rlog),

labels = colnames(rlog.norm.counts),main = "rlog transformed read counts")

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 29 / 63

Page 44: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

Principal component analysis – capturing variabilityGoal: reduce the dataset to have fewer dimensions, yet approx.preserve the distance between samples

starting point: matrix with expression values per gene and sample,e.g. 6,600 genes x 10 samples

assay(DESeq.rlog)[topVarGenes,])%>% t %>% prcomp

transformed into 6,600 principal components x 10 sampleslinear combi of optimallyweighted observed variablesthe vectors along which thevariation between samples ismaximalPC1-3 are usually sufficient tocapture the major trends!

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 30 / 63

Page 45: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Exploratory analyses

PCA vs. hierarchical clustering

often similar results because both techniques should capture the mostdominant patternsPCA will always be run on just a subset of the data!clustering will ALWAYS return clusters, PCA may not if the patterns ofvariation are too random

See practical_exploratory.Rmd R codeto generate exploratory plots.

Use the pcaExplorer package!

See the chapter “Distance and DimensionReduction” in Irizarry and Love [2015] formore details and the StatQuest video(s) on

youtube.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 31 / 63

Page 46: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Differential gene expression

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 32 / 63

Page 47: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Understand your null hypothesis!

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limmaI see Soneson et al. [2015] and bioconductor’s tximport package

vignette for details

DTU: Differential Transcript UsageI Has the isoform composition for a given gene changed? I.e. are there

different dominant isoforms depending on the condition?I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)I see Love et al. [2018] for details

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 33 / 63

Page 48: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Understand your null hypothesis!

DGE: Differential Gene ExpressionI Has the total ouput of a gene changed?I input for the statistical testing: (estimated) counts per gene used by

DESeq2/edgeR/limmaI see Soneson et al. [2015] and bioconductor’s tximport package

vignette for details

DTU: Differential Transcript UsageI Has the isoform composition for a given gene changed? I.e. are there

different dominant isoforms depending on the condition?I common when comparing different cell types (incl. healthy vs. cancer)I input for the statistical testing: (estimated) counts per transcript used

by DEXSeq (!)I see Love et al. [2018] for details

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 33 / 63

Page 49: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

DGE basics

H0: There is no difference in the read distributions of the 2 conditions.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 34 / 63

Page 50: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Applying linear models for read count modeling

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 35 / 63

Page 51: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Applying linear models for read count modeling

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 36 / 63

Page 52: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Applying linear models for read count modeling

To describe all expression values ofone (!) example gene (snf2), we canuse a linear model like this:

Linear models model a response variable asa linear combination of predictors (betas),plus randomly distributed noise (e).

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 37 / 63

Page 53: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Applying linear models for read count modeling

To describe all expression values ofone (!) example gene (snf2), we canuse a linear model like this:

Linear models model a response variable asa linear combination of predictors (betas),plus randomly distributed noise (e).

b0: intercept, i.e. average value of the baseline groupb1: difference between baseline and non-reference groupx : 0 if genotype == “SNF2”, 1 if genotype == “WT”

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 38 / 63

Page 54: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Model formulae syntax in R

regression functions in R (e.g., lm(), glm()) use a “model formula”interfacethe basic format is:response variable ~ explanatory variableswhere tilde means “is modeled by” or “is modeled as a function of”.4e.g.: lm( y ~ x )

If you find yourself using linear models and somewhat complicatedexperimental designs more often than not, we strongly recommend towork through chapters 4 and 5 of the PH525x series BiomedicalData Science [Irizarry and Love, 2016]

4See King [2016] for more details on the special meaning of mathematical operatorswithin R formula contexts.Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 39 / 63

Page 55: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Applying linear models for read count modeling

b0: intercept, i.e. average value ofthe baseline groupb1: difference between baseline andnon-reference groupx : 0 if genotype == "SNF2", 1 ifgenotype == "WT"

Describe expression values snf2using a linear model:

Factor of interest (b1) can beestimated as follows:

# 1. FIT the model> lmfit <- lm(rlog.norm ~ genotype)# 2. ESTIMATE the coefficients> coef(lmfit)(Intercept) genotypeWT

6.666 3.111

Both values (b0, b1) are estimates!(They’re spot-on because the values are so

clear and the model is so simple!)Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 40 / 63

Page 56: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

DGE basics

H0: There is no difference in the read distributions of the 2 conditions.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 41 / 63

Page 57: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

DGE steps (à la DESeq2)

1 Fitting a sophisticated regression model to the read counts (per gene!)I library size factorI dispersion estimate using information across multiple genesI assuming neg. binomial distribution to describe read count distribution

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 42 / 63

Page 58: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

DGE steps (à la DESeq2)

1 Fitting a sophisticated regression model to the read counts (done pergene; includes normalization)

2 Estimating coefficients to obtain the difference between the estimatedmean expression of the different groups (⇒ log2FC)

I define the contrast of interest, e.g. Y ~ batchEffect + conditonI always put the factor of interest lastI order of the factor levels determines the direction of the log2FC values

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 43 / 63

Page 59: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

DGE steps (à la DESeq2)

1 Fitting a sophisticated regression model to the read counts (done pergene; includes normalization)

2 Estimating coefficients to obtain the difference between the estimatedmean expression of the different groups (⇒ log2FC)

3 Test whether the log2FC is “far away” from zero (remember H0!)I log-likelihood test or Wald test are offered by DESeq2I multiple hypothesis correction!

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 44 / 63

Page 60: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Summary: from read counts to DGE et al.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 45 / 63

Page 61: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Differential gene expression

Comparison of additional tools for DGE analysis

When in doubt, compare the results of limma, edgeR, and DESeq2 to get a feeling forhow robust your favorite DE genes are. All packages can be found at Bioconductor.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 46 / 63

Page 62: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Downstream analyses

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 47 / 63

Page 63: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Understanding the RESULTS of the DGE analysis

Investigate the results()output:

I How many DE genes?(FDR/q-value!)

I How strongly do the DEgenes change?

I Directions of change?I Are your favorite genes

among the DE genes?

Spend some time on this to performsome sanity checks! This will helpyou spot discrepancies early on!

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 48 / 63

Page 64: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Understanding the FUNCTIONS of your DE genesThere are myriad tools for this – many are web-based, many are R packages,many will address very specific questions. Typical points of interest are:

enriched gene ontology (GO) termsI ontology = standardized vocabularyI 3 classes of gene ontologies are maintained:

biological processes (BP), cell components (CC), and molecularfunctions (MF)

enriched pathwaysI gene sets: e.g. from MSigDB [Liberzon et al., 2015]I physical interaction networks: e.g. from STRING [Szklarczyk et al., 2017]I metabolic (and other) pathways: e.g. from KEGG [Kanehisa et al., 2017]

upstream regulators

None (!) of these methods should lead you to make definitive claims about therole of certain pathways for your phenotype. These are hypothesis-generatingtools! Also: make sure you use shrunken logFC values [Zhu et al., 2019].

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 49 / 63

Page 65: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Two typical approaches of enrichment analyses

1. Over-representation analysis (ORA)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 50 / 63

Page 66: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Two typical approaches of enrichment analyses

1. Over-representation analysis (ORA)“2x2 table method”assessing overlap of DE genes with genes of a given pathwaystatistical test: e.g. hypergeometric testlimitations:

I direction of change is ignoredI magnitude of change is ignoredI interprets genes as well as pathways as independent entities

See Khatri et al. [2012] for details!

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 51 / 63

Page 67: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Two typical approaches of enrichment analyses

1. Over-representation analysis (ORA)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 52 / 63

Page 68: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Two typical approaches of enrichment analyses

2. Functional Class Scoring (“Gene set enrichment”)gene-level statistics for all genes in a pathway are aggregated into asingle pathway-level statisticscore will depend on size of the pathway, and the amount of correlationbetween genes in the pathwayall genes are useddirection and magnitude of change mattercoordinated changes of genes within the same pathway matter, too

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 53 / 63

Page 69: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Two typical approaches of enrichment analyses

2. Functional Class Scoring (“Gene set enrichment”)

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 54 / 63

Page 70: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Two typical approaches of enrichment analyses

2. Functional Class Scoring: Example GSEA

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 55 / 63

Page 71: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Two typical approaches of enrichment analyses

2. Functional Class Scoring (“Gene set enrichment”)

Example GSEA results for positive and negative correlation

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 56 / 63

Page 72: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 73: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 74: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 75: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 76: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 77: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 78: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 79: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 80: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 81: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 82: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

Downstream analyses

Summary – downstream analyses

Know your biological question(s) of interest!

all enrichment methods potentially suffer from gene length biasI long genes will get more reads

for GO terms:I use goseq to identify enriched GO terms [Young et al., 2010]I use additional tools, such as GOrilla, REVIGO [Eden et al., 2009,

Supek et al., 2011] to summarize the often redundant GO term listsfor KEGG pathways:

I e.g. GAGE and PATHVIEW [Luo and Brouwer, 2013, Luo et al., 2017] 5

miscellaneous including attempts to predict upstream regulatorsI Enrichr [Chen et al., 2013]I RegulatorTrail [Kehl et al., 2017]I Ingenuity Pathway Analysis Studio (proprietory software!)

See the additional links and material on our course website!5https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-

sailfish-deseq2-gage-and-pathview/Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 57 / 63

Page 83: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

References

References

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 58 / 63

Page 84: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

References

Edward Y. Chen, Christopher M. Tan, Yan Kou, Qiaonan Duan, ZichenWang, Gabriela V. Meirelles, Neil R. Clark, and Avi Ma’ayan. Enrichr:Interactive and collaborative HTML5 gene list enrichment analysis tool.BMC Bioinformatics, 2013. doi: 10.1186/1471-2105-14-128. URLhttp://amp.pharm.mssm.edu/Enrichr.

Eran Eden, Roy Navon, Israel Steinfeld, Doron Lipson, and Zohar Yakhini.GOrilla: a tool for discovery and visualization of enriched GO terms inranked gene lists. BMC bioinformatics, 10(1):48, jan 2009. doi:10.1186/1471-2105-10-48. URL http://cbl-gorilla.cs.technion.ac.il.

R. Irizarry and M. Love. Leanpub, 2015. URLhttps://leanpub.com/dataanalysisforthelifesciences.

R. Irizarry and M. Love. Biomedical Data Science, 2016.Minoru Kanehisa, Miho Furumichi, Mao Tanabe, Yoko Sato, and Kanae

Morishima. KEGG: New perspectives on genomes, pathways, diseases anddrugs. Nucleic Acids Research, 2017. doi: 10.1093/nar/gkw1092.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 59 / 63

Page 85: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

References

Tim Kehl, Lara Schneider, Florian Schmidt, Daniel Stöckel, Nico Gerstner,Christina Backes, Eckart Meese, Andreas Keller, Marcel H. Schulz, andHans Peter Lenhof. RegulatorTrail: A web service for the identification ofkey transcriptional regulators. Nucleic Acids Research, 2017. doi:10.1093/nar/gkx350. URL https://regulatortrail.bioinf.uni-sb.de/.

Purvesh Khatri, Marina Sirota, and Atul J. Butte. Ten years of pathwayanalysis: Current approaches and outstanding challenges. PLoSComputational Biology, 2012. doi: 10.1371/journal.pcbi.1002375.

William B. King. Model Formulae Tutorial, 2016. URLhttp://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html.

Arthur Liberzon, Chet Birger, Helga Thorvaldsdóttir, Mahmoud Ghandi,Jill P. Mesirov, and Pablo Tamayo. The Molecular Signatures DatabaseHallmark Gene Set Collection. Cell Systems, 2015. doi:10.1016/j.cels.2015.12.004.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 60 / 63

Page 86: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

References

Michael I Love, Charlotte Soneson, and Rob Patro. Swimming downstream:statistical analysis of differential transcript usage following Salmonquantification. F1000Research, 7(952), 2018. doi:10.12688/f1000research.15398.1.

Weijun Luo and Cory Brouwer. Pathview: An R/Bioconductor package forpathway-based data integration and visualization. Bioinformatics, 2013.doi: 10.1093/bioinformatics/btt285.

Weijun Luo, Gaurav Pant, Yeshvant K. Bhavnasi, Steven G. Blanchard, andCory Brouwer. Pathview Web: User friendly pathway visualization anddata integration. Nucleic Acids Research, 2017. doi:10.1093/nar/gkx372. URL https://pathview.uncc.edu/.

Charlotte Soneson, Michael I. Love, and Mark D. Robinson. Differentialanalyses for RNA-seq: transcript-level estimates improve gene-levelinferences. F1000Research, 4(0):1521, 2015. doi:10.12688/f1000research.7563.2.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 61 / 63

Page 87: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

References

Fran Supek, Matko Bošnjak, Nives Škunca, and Tomislav Šmuc. REVIGOsummarizes and visualizes long lists of gene ontology terms. PloS one, 6(7):e21800, jan 2011. doi: 10.1371/journal.pone.0021800. URLhttp://revigo.irb.hr/.

Damian Szklarczyk, John H. Morris, Helen Cook, Michael Kuhn, StefanWyder, Milan Simonovic, Alberto Santos, Nadezhda T. Doncheva,Alexander Roth, Peer Bork, Lars J. Jensen, and Christian Von Mering.The STRING database in 2017: Quality-controlled protein-proteinassociation networks, made broadly accessible. Nucleic Acids Research,2017. doi: 10.1093/nar/gkw937.

Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim,David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, andLior Pachter. Differential gene and transcript expression analysis ofRNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3):562–78, March 2012. doi: 10.1038/nprot.2012.016.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 62 / 63

Page 88: Analysis of bulk RNA-seq II: Reads to DGE - Analysis of ......Ex. workflow: STAR +featureCounts kallisto orsalmon Read mapping basedon: Where does a read match best? Which collection

References

Matthew D. Young, Matthew J. Wakefield, Gordon K. Smyth, and AliciaOshlack. Gene ontology analysis for RNA-seq: accounting for selectionbias. Genome Biology, 2010. doi: 10.1186/gb-2010-11-2-r14.

Anqi Zhu, Joseph G. Ibrahim, and Michael I. Love. Heavy-Tailed priordistributions for sequence count data: Removing the noise and preservinglarge differences. Bioinformatics, 35(12):2084–2092, 2019. doi:10.1093/bioinformatics/bty895.

Andrzej Zielezinski, Susana Vinga, Jonas Almeida, and Wojciech M.Karlowski. Alignment-free sequence comparison: Benefits, applications,and tools. Genome Biology, 2017. doi: 10.1186/s13059-017-1319-7.

Friederike Dündar (ABC, WCM) Analysis of bulk RNA-seq II: Reads to DGE February 25, 2020 63 / 63


Recommended