+ All Categories
Home > Documents > RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify...

RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify...

Date post: 19-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
97
RNA-seq bioinfo analysis Bilille training 13-14 Juin 2019 Camille Marchet - Pierre Pericard
Transcript
Page 1: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

RNA-seq bioinfo analysisBilille training

13-14 Juin 2019Camille Marchet - Pierre Pericard

Page 2: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

General Introduction

2

Page 3: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Goals

This course main goals:

● An overview of RNA-seq data analysis

● Identify the (key issues/points) (critical steps/parameters)

3

Page 4: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Warning !This is NOT a course to train you as a bioinformatician, and this course will NOT allow you to design an analysis pipeline set-up for your specific needs

This course WILL give you the basis information to understand and run a generic RNA-seq analysis, its key steps and problematics, and how to interact with bioinformaticians/bioanalysts that can analyze your RNA-seq datasets

4

Page 5: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Preliminary

Transcriptome/transcript

Transcriptomics

(Alternative) isoform

Splicing

5

Page 6: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Sequencing: overview

6

Page 7: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

How to make cDNA libraries

- Extract RNA, convert to cDNA- pass to next gen sequencer- millions to billions of reads

make cDNA?

- Prime mRNA with random hexamers R6- reverse transcriptase => cDNA first strand synthesis- then second strand

=> illumina cDNA library

7

Page 8: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

How to sequence (1)

- polyA+- Ribo-Zero (human, mouse, plants, bacteria, …)

(ARN = 90% of ARNr, 1-2% of ARNm)

in prokaryotes: no polyA (= no capture), no splicing (= less complex)

- paired-end- replicates

8

Page 9: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

How to sequence (2)

9

Page 10: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

RNA-seq-reads around 150-200 bp

-the number of detected transcripts increases with the sequencing depth

-the expression measure is more precise with more depth

-5 millions reads can be enough to detect genes mildly-highly expressed in human

-100 millions must be preferred to detect lowly expressed genes (see for instance saturation curves in “Differential expression in RNA-seq: a matter of

depth.” Genome Res. 2011)

- these numbers depends on the species/tissues (complex splicing...) §

-keep replicates in mind

10

Page 11: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

There are plenty of protocols...

from Clara Benoit Pilven’s PhD thesis 11

Page 12: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Resources: genomes, transcriptomes, annotations

From Rachel Legendre (Institut Pasteur)

12

Page 13: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

FASTA/Q formats

FASTA format:>61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT

FASTQ format:@61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT+ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA

13

Page 14: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

FASTA/Q formats

Quality Error rate

10 10%

20 1%

30 0.1%

40 0.01%14

Page 15: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

What people do with their RNA-seq

From J. Audoux’s PhD thesis 15

Page 16: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Nature Communications 8, Article number: 59 (2017)

It’s complicated

16

Page 17: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Outcomes of RNA-seq studies

- gene annotation- protein/function prediction- gene/splicing quantification- isoform discovery/fusion transcripts/lncRNA...- variant calling- methylations- RNA structures-

17

Page 18: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Cleaning - Preprocessing

18

Page 19: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Known biases in RNA-seq

Biological sample:

● presence of pre-mRNA● 3’ bias over-represented (RNA degradation)● contaminations

Library preparation:

● DNAse fail● pcr bias● variable insert size (smaller than sequencing length)● reads with no inserts

Sequencing:

● quality drops at the end of reads

19

Page 20: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Quality Control (QC)

Quality Control (QC) is important to:

● Check if your sample sequencing went well

● Know when you need to sequence again (sequencing platform QC fail)

● Identify potential problems that can be fixed, or not

● Follow the impact of preprocessing steps

⇒ FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

+ MultiQC (https://multiqc.info/) when comparing multiple datasets 20

Page 21: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Practical: Quality Control (QC)

Open Galaxy

Practical Part 1 “Cleaning -Preprocessing”

21

Page 22: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Loss of base call accuracy with increasing sequencing cycles Source: https://sequencing.qcfail.com

22

Page 23: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Position specific failures of flowcells

Source: https://sequencing.qcfail.com23

Page 24: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Positional sequence bias in random primed libraries Source: https://sequencing.qcfail.com

24

Page 25: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Contamination with adapter dimersSource: https://sequencing.qcfail.com

25

Page 26: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Libraries contain technical duplicationSource: https://sequencing.qcfail.com

26

Page 27: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

GC content / Contamination ?

27

Page 28: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

GC content / Contamination ?

28

Page 29: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Cleaning - Preprocessing

Cleaning has to be done in the reverse order that errors were generated.

1. Sequencing errors: quality trimming and filtering, Ns removal2. Library preparation: adapters removal3. Sample contamination: rRNA, mito, other contaminants

Note 1: step 1 (quality trimming) is not considered critical anymore and could even hinder downstream tools/algorithms.

Note 2: If the reads are going to be aligned against a reference genome, this whole process can be skipped or applied very lightly

29

Page 30: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Cleaning - Preprocessing

Raw dataset

FastQC

quality, N, adapters cleaning

(Trimmomatic)

quality-cleaned dataset

rRNA removal

(SortMeRNA)

FastQC FastQC

Final dataset

contaminant removal (?)

30

Page 31: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

To map or not to map ?

31

Page 32: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

With reference RNA-seq

32

Page 33: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

W/ reference RNA-seq. For what purpose ?

Mainly:

● Differential expression○ between genes○ between transcripts/isoformes

● Transcriptome assembly○ variant calling○ isoforme discovery

33

Page 34: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

What people do with their RNA-seq

From J. Audoux’s PhD thesis 34

Page 35: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

RNA-seq w/ refraw/cleaned sequencing

dataset

Count gene expression

MappingReference genome

annotation

Transcriptome assembly with

reference

Assembled transcripts

Reference genome

sequence Reference transcriptome

Gene counts

Gene counts

Transcript counts

Genepseudo -counts

Transcript pseudo-counts

aligned reads

Pseudo-mapping

35

Page 36: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

The champion: Tuxedo Suite, “Classic” version

Nat Protoc. 2012;7(3):562–578. doi:10.1038/nprot.2012.01636

Page 37: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

The champion: Tuxedo Suite, “Classic” version

Nat Protoc. 2012;7(3):562–578. doi:10.1038/nprot.2012.016

EXPIRED

37

Page 38: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

The champion: Tuxedo Suite, New version

HISAT/HISAT2: splice aware aligner

StringTie: Transcriptome assembler

Ballgown: Differential expression analysis

Nat Protoc. 2016;11(9):1650–1667. doi:10.1038/nprot.2016.095 38

Page 39: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Counting gene expression from alignments

https://htseq.readthedocs.io/en/latest/count.html 39

Page 40: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

RNA-seq w/ refraw/cleaned sequencing

dataset

Count gene expression

MappingReference genome

annotation

Transcriptome assembly with

reference

Assembled transcripts

Reference genome

sequence Reference transcriptome

Gene counts

Gene counts

Transcript counts

Genepseudo -counts

Transcript pseudo-counts

aligned reads

Pseudo-mapping

HISAT2

featureCounts

StringTie

Salmon

40

Page 41: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Practical: With reference RNA-seq

Open Galaxy

Practical Part 2 “With-reference RNA-seq analysis”

41

Page 42: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Recommended pipeline (as of June 2019)

● Transcriptome assembly: HISAT2 + StringTie (+ Ballgown ?)

● Transcript/Gene quantification with mapping: STAR + featureCounts

● Mapping-less transcript quantification: Kallisto or Salmon

42

Page 43: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

De novo RNA-seq

43

Page 44: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Outline

1 - De novo assembly

2 - De novo variant call in transcriptomics

3 - Long reads

44

Page 45: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

De novo assembly

This part goals:

● know the main step of transcriptome de-novo assemblers● understand the difference between genomic and transcriptomic

assemblies● be aware of the main tools● understand that paper/algorithm/implementation can diverge● know the tools to evaluate/visualize an assembly

45

Page 46: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Challenge: get transcripts from cDNA

46

Page 47: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Assembly: preliminaries

Some vocabulary:

- k-mer: Any sequence of length k

- Contig: gap-less assembled sequence

- Graph:

47

Page 48: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Vocabulary: connected components

48

Page 49: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Vocabulary: De Bruijn graph

49

Page 50: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Redundancy in the De Bruijn graph

50

Page 51: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Path in the De Bruijn graph

assembly : a set of paths covering the graph (after some modifications of the graph…)

51

Page 52: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Vocabulary: alternative variants

52

Page 53: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Vocabulary: bubbles/bulges

53

Page 54: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Vocabulary: tips/dead ends

54

Page 55: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Assembly: preliminariesAn assembly generally is:

- smaller than the reference, - fragmented

- missing reads create gaps

- repeats fragment assemblies and reduce total size

55

Page 56: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Contrasting genome and transcriptome assemblies

genome

-uniform coverage-single contig per locus-double stranded-theory: one massive graph per chromosome-practice: repeats aggregate, contigs smaller than chromosomes

transcriptome-exponentially distributed coverage-multiple contigs per locus-strand specific- theory: thousands of small disjoint graphs, one per gene-practice: gene families, ALU & TE, low covered

56

Page 57: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Contrasting genome and transcriptome assemblies

Despite these differences, DNA-seq assembly methods apply:

- Construct a de Bruijn graph (same as DNA) - Output contigs (same as DNA) - Allow to re-use the same contig in many different transcripts (new part)

57

Page 58: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Real instance graphs

Credit: ERABLE team (Lyon)

graph from shallow covered Drosophila dataset

zoomed-in bubbles (+ tips)

gene family

58

Page 59: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

There is no single solution for assembly...

Conclusions of the GAGE benchmark : in terms of assembly quality, there is no single best assembler. Applies to RNA-seq.

Main tools:

-TransAbyss, Robertson et al. Nat. Met 2010 https://github.com/bcgsc/transabyss

-IDBA-Tran, Pend et al. Bioinformatics 2013 https://github.com/loneknightpy/idba

-SOAPdenovo-Trans, Xie et al. Bioinformatics 2014 https://github.com/aquaskyline/SOAPdenovo2

-Trinity, Grabherr et al. Nat. Biotechnol. 2011 https://github.com/trinityrnaseq/trinityrnaseq/wiki

- rnaSPAdes, Bushmanov et al. bioRXiv 2018 http://cab.spbu.ru/software/spades/

59

Page 60: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Assemblers recent benchs

from rnaSPAdes preprint: https://www.biorxiv.org/content/biorxiv/early/2018/09/18/420208.full.pdf60

Page 61: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

The main building blocks in theory

1. (optional) correct the reads (for instance BayesHammer in rnaSPAdes)2. build a graph from the reads (remove k-mers seen once)3. remove likely sequencing errors (tips)4. remove known patterns (bubbles)5. return simple paths (i.e. contigs), allow nodes to be used several times

61

Page 62: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Multi-k assembly

From Rayan Chikhi (http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2016/01/Assembly-2016-v2.1.pdf)

62

Page 63: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Warning: what’s in the paper is different than what’s in the implementation...

63

Page 64: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Example of details in practice mercy k-mers

64

Page 65: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Trinity assembler

- Inchworm de Bruijn graph construction, part 1

- Chrysalis de Bruijn graph construction, part 2

- Butterfly Graph traversal using reads, isoforms enumeration

65

Page 66: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Trinity: detail

66

Page 67: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Trinity: detail

67

Page 68: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Trinity: detail

68

Page 69: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Trinity output

>TRINITY_DN1000_c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]

AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC

ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA

AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC

CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA

TAAAGCA

-Trinity read cluster 'TRINITY_DN1000_c115'

- gene 'g5'

- isoform 'i1'

-path=[31015:0-148 23018:149-246]") indicates the path traversed in the Trinity de Bruijn graph to construct that transcript

69

Page 70: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Normalization effects on assembly (example of Trinity) From Brian

Haas

70

Page 71: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Normalization effects on assembly (example of Trinity) From Brian

Haas

71

Page 72: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Errors made by assemblers

72

Page 73: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Assembly quality assessment

In transcriptome assemblies

● N50 is not very useful. ● unreasonable isoform annotation for long transcripts drives higher N50● very sensitive reconstruction for short lowly expressed transcripts leads

to lower N50

Main tools:● rnaQuast http://cab.spbu.ru/software/rnaquast/● Transrate http://hibberdlab.com/transrate/

73

Page 74: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

74

Page 75: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Visualization: Bandagehttps://rrwick.github.io/Bandage/

75

Page 76: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Meta-practices

1- Read surveys, Twitter, blogs 2. Pick two assemblers 3. Run each assembler at least two times (different parameters) 4. Compare assemblies 5. If possible, visualize them

An assembly is not the absolute truth, it is a mostly complete, generally fragmented and mostly accurate hypothesis

76

Page 77: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Practical: Trinity assembly

77

Page 78: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

State of the research

New developments:

1. Long reads are coming

2. Efficient assemblers

3. Best-practice protocols

4. Assembly-based variant calling (mostly for genomics)

Challenges that remain:

-Splice isoforms vs. paralogs

-Sequencing errors vs. polymorphisms

78

Page 79: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Assembly does not output all variants

79

Page 80: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

KISSPLICE

Goal: instead of assembling full-length transcripts, KISSPLICE (Sacomoto et al. 2012) focuses on assembling ONLY the bubbles that contain events and enumerate the maximum of them

80

Page 81: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

KISSPLICE: graph cleaning + local assembly

example: discard if ratio is<0.05

81

Page 82: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Variants in local assembly

82

Page 83: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

KISSPLICE’s output

83

Page 84: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Post-processings

84

Page 85: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

KISSPLICE case studiesDiscover splicing events: Benoit Pilven et al. 2018

Farline: mapping B found only by Kissplice (not annotated) C found only by Kissplice (paralog) D found only by mapping (Alu repeat)

Discover SNPs in pooled RNA-seq data: Lopez-Maestre et al. 2016

Discover SNPs in pooled RNA-seq: Lopez-Maestre et al. 2016 85

Page 86: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Practical: Kissplice

86

Page 87: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Long reads : the future of transcriptomics

1. Long read transcriptomics sequencing technologies

2. Available pipelines

3. Current limitations

87

Page 88: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

PACBIO vs Nanopore

from Reuters et al. 2015 88

Page 89: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Error rates and profiles

From Weirather et al.

89

Page 90: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

The PacBio CCS

90

Page 91: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Nanopore RNA protocols

(from Oxford Nanopore website)

direct RNA protocol

- no dependence on RT or PCR- detect modifications (methylations)- more material is needed, less reads

91

Page 92: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

92

Page 93: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Nanopore evolution

From Rang et al. 93

Page 94: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Costs

94

Page 95: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

New exon-exon junctions

+ quantification seems possible (see Sessogolo et al. 2019 (bioRXiv) and Oikonopoulos et al. Sci. Rep. 2016)

95

Page 96: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

Some tools to work with RNA long reads

Full pipelines:

- Mandalorion (Byrne et al. 2017, exploit Nanopore reads with reference)- Tofu (Gordon et al. 2015, for PacBio CSS only, with/without reference)- TAPIS (Abdel-Ghany et al. 2016, with reference)- FLAIR (Tang et al. 2018, (bioRXiv), Nanopore with reference)

Clustering:

- IsOnClust (Sahlin et al. RECOMB 2019, for PacBio)- CARNAC-LR (Marchet et al. NAR 2018)

Correction:

- No designed tool at the moment, some genomic tools work, see Lima et al. 2019 for a survey

96

Page 97: RNA-seq bioinfo analysis - Université de Lille · An overview of RNA-seq data analysis Identify the (key issues/points) (critical steps/parameters) 3. Warning ! This is NOT a course

What was not viewed during this session

-bacterial RNA

-genome-guided assembly

-metatranscriptomics

-single cell RNA

-tools specialized for ncRNAs, smallRNAs

-tools specialized for fusion transcripts

-transcript annotation (https://busco.ezlab.org/ for instance)

- ...

-up next: differential study (statistics for RNA-seq)

97


Recommended