RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential...

Post on 05-Aug-2020

12 views 0 download

transcript

RNA-seq from a bioinformatics perspective

Harmen van de Werken Erasmus MC;

Cancer Computational Biology Center (CCBC)

Outlook

RNA seq software + RNA-seq courses

Alternative splicing & Promoters

Introduction RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Table 6-1 Molecular Biology of the Cell (© Garland Science 2008)

Which type of RNA is most abundant? How many different genes do we have per type?

THE HUMAN GENOME ▪ Consensus ~ 22,500 protein-coding genes

~ 9,000 long non-coding RNAs ~ 2,500 – 3,000 small RNAs

▪ miTranscriptome1 ~ 91,013 genes ~ 58,648 lncRNA genes

1Iyer MK et al. Nature Genetics 47, 199–208 (2015)

Transcriptomics of (Cancer) Tissue

Common mRNA-seq Work flow

Dry lab Bioinformatics

Wetlab

(c)DNA Next Generation Sequencing (NGS)

ThermoFisher Ion Torrent

Personal Genome Machine (PGM)

PACBio RS II Illumina HiSeq 2000

Illumina HiSeq 2000

Illumina Sequencing

Ion Torrent Platform

RNAseq Data Analysis

Alternative splicing & Promoters RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Detecting Single Nucleotide Variants and small indels

Errors occur at each stage

Primary Analysis - Incorrect base calling - Homopolymer errors - Phasing

Errors occur at each stage

Secondary Analysis Read mapping - Incorrect ref. Sequence - Pseudogenes - Indels - Complex variants

Errors occur at each stage Secondary Analysis Variant calling

Variant Calling filters are heuristics; therefore, they will generate false negatives and positives and are best applied as soft filters.

Errors occur at each stage

Tertiary Analysis - Incorrect gene annotation - Contamination in reference Databases.

False Negative: c.2237_2259del,insCCAACAAGGAA EGFR

False Negative BRAF p.V600R

RNAseq Data Analysis

Alternative splicing & Promoters RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Differential Gene Expression mRNA-seq Work flow

Fig1. FastQC report on Base Quality of position and overrepresented sequences

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC

+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Fig2. Fastq format of one read

mRNA-seq alignment

Courtesy: Wikipedia

mRNA derived cDNA fragments alignment to transcriptome

Alignment to transcriptome

Alignment to reference genome

RNA-Seq - Alignment

Alignment algorithms need: • Reference sequence • Transcriptome database (optional) Algorithms commonly used for RNA-Seq alignment: • Tophat • STAR • HISAT2

Visualization of NGS Transcriptomics and Genomics data

RNA-Seq - Alignment/QC

RNA-Seq - Stranded

Differential expression

Rakesh Kaundal et al.

Normalization of RNA-seq

Total count (TC): Gene counts are divided by the total number of mapped reads Upper Quart ile (UQ): Very similar in principle to TC, the total counts are replaced by the upper quartile of counts Median (Med): Also similar to TC, the total counts are replaced by the median counts Trimmed Mean of M-values (TMM): This normalization method is implemented in the edgeR Bioconductor package (Robinson et al., 2010). Quant ile (Q): First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes. Reads Per Kilobase per Million mapped reads (RPKM): This approach quantifies gene expression from RNA-Seq data by normalizing for the total transcript length and the number of sequencing reads.

Reduce Dimensions PCA /QC

Principal components Analysis (PCA) of a multivariate Gaussian distribution. PCA is a linear algorithm. It will not be able to interpret complex polynomial relationship between features.

Reduce Dimensions t-SNE

t-Stochastic Neighbor Embedding (t-SNE) is a non-linear algorithm.

Clustering Analysis/ QC

Fig 1: Example Hierarchical Clustering. Example of hierarchical clustering: clusters are consecutively merged with the most nearby clusters. The length of the vertical dendogram-lines reflect the nearness. (Jansen et al.)

Clustering Analysis/ QC

Differential expression DESeq2

A common difficulty in the analysis of read count data is the strong variance of Log Fold Change (LFC) estimates for genes with low read count.

Differential expression of genes

Test Differentially gene expression with correction for multiple testing

Gene Set Enrichment Analysis: GO and KEGG database

Gene Set Enrichment Analysis: GO and KEGG database

RNAseq Data Analysis

Alternative splicing & Promoters RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Fusion Gene Detection

Fig1 RNA-seq mapping of short reads over exon-exon junctions, it could be defined a Trans or a Cis event. (wikipedia)

Fusion Gene Detection

Fusion Gene Detection

Fusion Catcher Tool

Fusion Catcher outperforms other tools by using multiple Aligners ❖ Bowtie ❖ Bowtie2 ❖ BLAT ❖ STAR

RNAseq Data Analysis

Alternative splicing & Promoters RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

RNA-seq de novo Assembly

❖ Define the whole transcriptome without a reference.

❖ Trinity

RNA-seq Analysis Software

RNA-seq Molmed Courses

❖ Basic Course on 'R' ❖ Galaxy for NGS ❖ Workshop Ingenuity Pathway Analysis (IPA) + CLC

Workbench / Ingenuity Variant Analysis ❖ Gene expression data analysis using R: How to make

sense out of your RNA-Seq/microarray data

Take Home Message

Think before you start

Thank you for your attention

Hematology Mathijs A. Sanders Remco Hoogenboezem

CCBC Job van Riet Wesley van de Geer

CCBC@erasmusmc.nl https://ccbc.erasmusmc.nl

@ ErasmusMC_CCBC

Harmen van de Werken

Cancer Computational Biology Center (CCBC)