RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify...

transcript

RNA-seq bioinfo analysisBilille training

13-14 Juin 2019Camille Marchet - Pierre Pericard

General Introduction

This course main goals:

● An overview of RNA-seq data analysis

● Identify the (key issues/points) (critical steps/parameters)

Warning !This is NOT a course to train you as a bioinformatician, and this course will NOT allow you to design an analysis pipeline set-up for your specific needs

This course WILL give you the basis information to understand and run a generic RNA-seq analysis, its key steps and problematics, and how to interact with bioinformaticians/bioanalysts that can analyze your RNA-seq datasets

Preliminary

Transcriptome/transcript

Transcriptomics

(Alternative) isoform

Splicing

Sequencing: overview

How to make cDNA libraries

- Extract RNA, convert to cDNA- pass to next gen sequencer- millions to billions of reads

make cDNA?

- Prime mRNA with random hexamers R6- reverse transcriptase => cDNA first strand synthesis- then second strand

=> illumina cDNA library

How to sequence (1)

- polyA+- Ribo-Zero (human, mouse, plants, bacteria, …)

(ARN = 90% of ARNr, 1-2% of ARNm)

in prokaryotes: no polyA (= no capture), no splicing (= less complex)

- paired-end- replicates

How to sequence (2)

RNA-seq-reads around 150-200 bp

-the number of detected transcripts increases with the sequencing depth

-the expression measure is more precise with more depth

-5 millions reads can be enough to detect genes mildly-highly expressed in human

-100 millions must be preferred to detect lowly expressed genes (see for instance saturation curves in “Differential expression in RNA-seq: a matter of

depth.” Genome Res. 2011)

- these numbers depends on the species/tissues (complex splicing...) §

-keep replicates in mind

There are plenty of protocols...

from Clara Benoit Pilven’s PhD thesis 11

Resources: genomes, transcriptomes, annotations

From Rachel Legendre (Institut Pasteur)

FASTA/Q formats

FASTA format:>61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT

FASTQ format:@61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT+ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA

FASTA/Q formats

Quality Error rate

10 10%

30 0.1%

40 0.01%14

What people do with their RNA-seq

From J. Audoux’s PhD thesis 15

Nature Communications 8, Article number: 59 (2017)

It’s complicated

Outcomes of RNA-seq studies

- gene annotation- protein/function prediction- gene/splicing quantification- isoform discovery/fusion transcripts/lncRNA...- variant calling- methylations- RNA structures-

Cleaning - Preprocessing

Known biases in RNA-seq

Biological sample:

● presence of pre-mRNA● 3’ bias over-represented (RNA degradation)● contaminations

Library preparation:

● DNAse fail● pcr bias● variable insert size (smaller than sequencing length)● reads with no inserts

Sequencing:

● quality drops at the end of reads

Quality Control (QC)

Quality Control (QC) is important to:

● Check if your sample sequencing went well

● Know when you need to sequence again (sequencing platform QC fail)

● Identify potential problems that can be fixed, or not

● Follow the impact of preprocessing steps

⇒ FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

+ MultiQC (https://multiqc.info/) when comparing multiple datasets 20

Practical: Quality Control (QC)

Open Galaxy

Practical Part 1 “Cleaning -Preprocessing”

Loss of base call accuracy with increasing sequencing cycles Source: https://sequencing.qcfail.com

Position specific failures of flowcells

Source: https://sequencing.qcfail.com23

Positional sequence bias in random primed libraries Source: https://sequencing.qcfail.com

Contamination with adapter dimersSource: https://sequencing.qcfail.com

Libraries contain technical duplicationSource: https://sequencing.qcfail.com

GC content / Contamination ?

Cleaning has to be done in the reverse order that errors were generated.

1. Sequencing errors: quality trimming and filtering, Ns removal2. Library preparation: adapters removal3. Sample contamination: rRNA, mito, other contaminants

Note 1: step 1 (quality trimming) is not considered critical anymore and could even hinder downstream tools/algorithms.

Note 2: If the reads are going to be aligned against a reference genome, this whole process can be skipped or applied very lightly

Raw dataset

FastQC

quality, N, adapters cleaning

(Trimmomatic)

quality-cleaned dataset

rRNA removal

(SortMeRNA)

FastQC FastQC

Final dataset

contaminant removal (?)

To map or not to map ?

With reference RNA-seq

W/ reference RNA-seq. For what purpose ?

Mainly:

● Differential expression○ between genes○ between transcripts/isoformes

● Transcriptome assembly○ variant calling○ isoforme discovery

What people do with their RNA-seq

From J. Audoux’s PhD thesis 34

RNA-seq w/ refraw/cleaned sequencing

dataset

Count gene expression

MappingReference genome

annotation

Transcriptome assembly with

reference

Assembled transcripts

Reference genome

sequence Reference transcriptome

Gene counts

Transcript counts

Genepseudo -counts

Transcript pseudo-counts

aligned reads

Pseudo-mapping

The champion: Tuxedo Suite, “Classic” version

Nat Protoc. 2012;7(3):562–578. doi:10.1038/nprot.2012.01636

The champion: Tuxedo Suite, “Classic” version

Nat Protoc. 2012;7(3):562–578. doi:10.1038/nprot.2012.016

EXPIRED

The champion: Tuxedo Suite, New version

HISAT/HISAT2: splice aware aligner

StringTie: Transcriptome assembler

Ballgown: Differential expression analysis

Nat Protoc. 2016;11(9):1650–1667. doi:10.1038/nprot.2016.095 38

Counting gene expression from alignments

https://htseq.readthedocs.io/en/latest/count.html 39

RNA-seq w/ refraw/cleaned sequencing

dataset

Count gene expression

MappingReference genome

annotation

Transcriptome assembly with

reference

Assembled transcripts

Reference genome

sequence Reference transcriptome

Gene counts

Transcript counts

Genepseudo -counts

Transcript pseudo-counts

aligned reads

Pseudo-mapping

HISAT2

featureCounts

StringTie

Salmon

Practical: With reference RNA-seq

Open Galaxy

Practical Part 2 “With-reference RNA-seq analysis”

Recommended pipeline (as of June 2019)

● Transcriptome assembly: HISAT2 + StringTie (+ Ballgown ?)

● Transcript/Gene quantification with mapping: STAR + featureCounts

● Mapping-less transcript quantification: Kallisto or Salmon

De novo RNA-seq

Outline

1 - De novo assembly

2 - De novo variant call in transcriptomics

3 - Long reads

De novo assembly

This part goals:

● know the main step of transcriptome de-novo assemblers● understand the difference between genomic and transcriptomic

assemblies● be aware of the main tools● understand that paper/algorithm/implementation can diverge● know the tools to evaluate/visualize an assembly

Challenge: get transcripts from cDNA

Assembly: preliminaries

Some vocabulary:

- k-mer: Any sequence of length k

- Contig: gap-less assembled sequence

- Graph:

Vocabulary: connected components

Vocabulary: De Bruijn graph

Redundancy in the De Bruijn graph

Path in the De Bruijn graph

assembly : a set of paths covering the graph (after some modifications of the graph…)

Vocabulary: alternative variants

Vocabulary: bubbles/bulges

Vocabulary: tips/dead ends

Assembly: preliminariesAn assembly generally is:

- smaller than the reference, - fragmented

- missing reads create gaps

- repeats fragment assemblies and reduce total size

Contrasting genome and transcriptome assemblies

genome

-uniform coverage-single contig per locus-double stranded-theory: one massive graph per chromosome-practice: repeats aggregate, contigs smaller than chromosomes

transcriptome-exponentially distributed coverage-multiple contigs per locus-strand specific- theory: thousands of small disjoint graphs, one per gene-practice: gene families, ALU & TE, low covered

Contrasting genome and transcriptome assemblies

Despite these differences, DNA-seq assembly methods apply:

- Construct a de Bruijn graph (same as DNA) - Output contigs (same as DNA) - Allow to re-use the same contig in many different transcripts (new part)

Real instance graphs

Credit: ERABLE team (Lyon)

graph from shallow covered Drosophila dataset

zoomed-in bubbles (+ tips)

gene family

There is no single solution for assembly...

Conclusions of the GAGE benchmark : in terms of assembly quality, there is no single best assembler. Applies to RNA-seq.

Main tools:

-TransAbyss, Robertson et al. Nat. Met 2010 https://github.com/bcgsc/transabyss

-IDBA-Tran, Pend et al. Bioinformatics 2013 https://github.com/loneknightpy/idba

-SOAPdenovo-Trans, Xie et al. Bioinformatics 2014 https://github.com/aquaskyline/SOAPdenovo2

-Trinity, Grabherr et al. Nat. Biotechnol. 2011 https://github.com/trinityrnaseq/trinityrnaseq/wiki

- rnaSPAdes, Bushmanov et al. bioRXiv 2018 http://cab.spbu.ru/software/spades/

Assemblers recent benchs

from rnaSPAdes preprint: https://www.biorxiv.org/content/biorxiv/early/2018/09/18/420208.full.pdf60

The main building blocks in theory

1. (optional) correct the reads (for instance BayesHammer in rnaSPAdes)2. build a graph from the reads (remove k-mers seen once)3. remove likely sequencing errors (tips)4. remove known patterns (bubbles)5. return simple paths (i.e. contigs), allow nodes to be used several times

Multi-k assembly

From Rayan Chikhi (http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2016/01/Assembly-2016-v2.1.pdf)

Warning: what’s in the paper is different than what’s in the implementation...

Example of details in practice mercy k-mers

Trinity assembler

- Inchworm de Bruijn graph construction, part 1

- Chrysalis de Bruijn graph construction, part 2

- Butterfly Graph traversal using reads, isoforms enumeration

Trinity: detail

Trinity output

>TRINITY_DN1000_c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]

AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC

ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA

AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC

CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA

TAAAGCA

-Trinity read cluster 'TRINITY_DN1000_c115'

- gene 'g5'

- isoform 'i1'

-path=[31015:0-148 23018:149-246]") indicates the path traversed in the Trinity de Bruijn graph to construct that transcript

Normalization effects on assembly (example of Trinity) From Brian

Errors made by assemblers

Assembly quality assessment

In transcriptome assemblies

● N50 is not very useful. ● unreasonable isoform annotation for long transcripts drives higher N50● very sensitive reconstruction for short lowly expressed transcripts leads

to lower N50

Main tools:● rnaQuast http://cab.spbu.ru/software/rnaquast/● Transrate http://hibberdlab.com/transrate/

Visualization: Bandagehttps://rrwick.github.io/Bandage/

Meta-practices

1- Read surveys, Twitter, blogs 2. Pick two assemblers 3. Run each assembler at least two times (different parameters) 4. Compare assemblies 5. If possible, visualize them

An assembly is not the absolute truth, it is a mostly complete, generally fragmented and mostly accurate hypothesis

Practical: Trinity assembly

State of the research

New developments:

1. Long reads are coming

2. Efficient assemblers

3. Best-practice protocols

4. Assembly-based variant calling (mostly for genomics)

Challenges that remain:

-Splice isoforms vs. paralogs

-Sequencing errors vs. polymorphisms

Assembly does not output all variants

KISSPLICE

Goal: instead of assembling full-length transcripts, KISSPLICE (Sacomoto et al. 2012) focuses on assembling ONLY the bubbles that contain events and enumerate the maximum of them

KISSPLICE: graph cleaning + local assembly

example: discard if ratio is<0.05

Variants in local assembly

KISSPLICE’s output

Post-processings

KISSPLICE case studiesDiscover splicing events: Benoit Pilven et al. 2018

Farline: mapping B found only by Kissplice (not annotated) C found only by Kissplice (paralog) D found only by mapping (Alu repeat)

Discover SNPs in pooled RNA-seq data: Lopez-Maestre et al. 2016

Discover SNPs in pooled RNA-seq: Lopez-Maestre et al. 2016 85

Practical: Kissplice

Long reads : the future of transcriptomics

1. Long read transcriptomics sequencing technologies

2. Available pipelines

3. Current limitations

PACBIO vs Nanopore

from Reuters et al. 2015 88

Error rates and profiles

From Weirather et al.

The PacBio CCS

Nanopore RNA protocols

(from Oxford Nanopore website)

direct RNA protocol

- no dependence on RT or PCR- detect modifications (methylations)- more material is needed, less reads

Nanopore evolution

From Rang et al. 93

New exon-exon junctions

+ quantification seems possible (see Sessogolo et al. 2019 (bioRXiv) and Oikonopoulos et al. Sci. Rep. 2016)

Some tools to work with RNA long reads

Full pipelines:

- Mandalorion (Byrne et al. 2017, exploit Nanopore reads with reference)- Tofu (Gordon et al. 2015, for PacBio CSS only, with/without reference)- TAPIS (Abdel-Ghany et al. 2016, with reference)- FLAIR (Tang et al. 2018, (bioRXiv), Nanopore with reference)

Clustering:

- IsOnClust (Sahlin et al. RECOMB 2019, for PacBio)- CARNAC-LR (Marchet et al. NAR 2018)

Correction:

- No designed tool at the moment, some genomic tools work, see Lima et al. 2019 for a survey

What was not viewed during this session

-bacterial RNA

-genome-guided assembly

-metatranscriptomics

-single cell RNA

-tools specialized for ncRNAs, smallRNAs

-tools specialized for fusion transcripts

-transcript annotation (https://busco.ezlab.org/ for instance)

-up next: differential study (statistics for RNA-seq)

RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify...

Documents