De novo transcript sequence reconstruction from RNA-seq ...

De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysisB R I A N J H A A S , A L E X I E P A P A N I C O L A O U , M O R A N Y A S S O U R , M A N F R E D G R A B H E R R , P H I L I P D B L O O D , J O S H U A B O W D E N , M A T T H E W B R I A N C O U G E R , D A V I D E C C L E S , B O L I , M A T T H I A S L I E B E R , M A T T H E W D M A C M A N E S , M I C H A E L O T T , J O S H U A O R V I S , N A T H A L I E P O C H E T , F R A N C E S C O S T R O Z Z I , N A T H A N W E E K S , R I C K W E S T E R M A N , T H O M A S W I L L I A M , C O L I N N D E W E Y , R O B E R T H E N S C H E L , R I C H A R D D L E D U C , N I R F R I E D M A N & A V I V R E G E V

N A T U R E P R O T O C O L S 8 , 2 0 1 3

Anti Alman22.05.2014

IntroductionPlatform for de novo transcriptome assembly◦ From RNA-seq data (only Illumina)

◦ Mainly for non-model organisms

◦ Fully reconstructs a large fraction of the transcripts present in the data

◦ Including alternative splice isoforms and transcripts from recently duplicated genees (with some caveats)

Introduction IIOriginal methodology published in 2011

Used in many different research projects◦ Genome sequence of foxtail millet (Setaria italica) provides insights into

grass evolution and biofuel potential

◦ The African coelacanth genome provides insights into tetrapod evolution

Significantly improved since 2011◦ memory requirements halved

◦ increased performance trough parallelization

◦ seamlessly uses various third-party tools

Trinity de novo assemblyThree consecutive modules

◦ Inchworm

◦ Chrysalis

◦ Butterfly

InchwormInchworm assembles the read data set by greedily searching for paths in a k-mer graph, resulting in a collection of linear contigs with each k-mer present only once in the contigs.

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Inchworm


Constructs a k-mer dictionary from all sequence reads

Selects the most frequent k-mer in the dictionary (seed)

Extends the seed in each direction by finding the highest occurring k-mer with a k-1 overlap

Extends the sequence in either direction until it cannot be extended further, then reporting the linear contig

Inchwormcontiguous (fused) transcripts

ChrysalisChrysalis pools (clusters) contigs into components

◦ If they have at least k-1 overlap

◦ If enough reads span the join

An individual de Bruijn graph is built from each pool


de Bruijn graphEvery edge is a k-mer

Every node is a k-1 overlap

HTTP://GCAT.DAVIDSON.EDU/PHAST/DEBRUIJN.HTML

Chrysalis IIt recursively groups inchworm contigs into connected components.

◦ If there is a perfect overlap of k-1 bases

◦ If there is a minimal number of reads that span the junction across both contigs◦ with a (k-1)/2 bases match on each side of the (k-1)-mer junction.


Chrysalis IIIt builds a de Bruijn graph for each component

◦ using a word size of k-1 to represent nodes

◦ k to define the edges connecting the nodes.

It weights each edge of the de Bruijn graph with the number of (k-1)-mers in the original read set that support it.


Chrysalis IIIEach read is assigned to the component with which it shares the largest number of k-mers.

Determines the regions within each read that contribute k-mers to the component.


ButterflyButterfly takes each de Bruijn graph from Chrysalis and trims spurious edges and compacts linear paths.

It then reconciles the graph with reads and pairs.

It outputs one linear sequence for each splice form and/or paralogous transcript reflected in the graph.


ButterflyButterfly iterates between

◦ merging consecutive nodes in linear paths

◦ pruning edges that represent minor deviations

Reads are typically much longer than k◦ can resolve ambiguities

◦ reduce the combinatorial number of paths


ButterflyAlternatively spliced transcripts

Transcript reconstructionS.pombe

Oracle◦ Empirical upper limit based on reads and known protein-coding sequences


Expression profiles reference vs Trinity

Protocol exampleSchizosaccharomyces pombe grown in four conditions

◦ 4 million paired-end reads

◦ Requires 8GB RAM (1GB per million)

◦ Takes approximately 4 h

Main steps◦ Collection of RNA-seq data (10 min)

◦ De novo RNA-seq assembly using Trinity (60-90 min)

◦ Quality assessment (90 min)

◦ Abundance estimation using RSEM (40-60 min)

◦ Differential expression analysis using edgeR (<5 min)

AlternativesVelvet – de Bruijn

ABYSS – de Bruijn

Mira – overlap graph

Oases – de Bruijn◦ Based on Velvet

Comparison

SCIENCE CHINA FEBRUARY 2013 VOL.56 NO.2: 156–162

On randomly generated short reads from chromosome 22

Comparison II

SCIENCE CHINA FEBRUARY 2013 VOL.56 NO.2: 156–162

10 highest concentration RNAs in the ERCC mix

After de novo RNA-seqassemblyRelies on third-party tools

Transcriptome analysis package for non-model organisms◦ Comparing transcriptomes across samples

◦ Transcript abundance estimation

◦ Analysis of differentially expressed transcripts

◦ Protein-coding region prediction and functional annotation of Trinity transcripts

Comparing transcriptomesacross samplesCombine all reads across all samples into a single RNA-seq data set

Generate a single reference Trinity assembly

Aligning each sample’s (not normalized) reads to the Trinity assembly

Transcript abundance estimationRe-align reads to the assembled transcripts

◦ Alternatively spliced isoforms and recently duplicated genes?

◦ RNA-seq by Expectation Maximization (RSEM)

◦ Requires gap-free alignments

edgeR - compare expression levels of different transcripts or genes across samples

Analysis of differentially expressed transcriptsRelies on tools from the Bioconductor

◦ edgeR

◦ DESeq

Easy-to-use perl scripts

Protein-coding region prediction and functional annotation of Trinity transcriptsTransDecoder identifies candidate protein-coding regions

◦ Based on nucleotide composition

◦ Open reading frame length

◦ Pfam domain content

LimitationsOnly for Illumina RNA-seq data

Difficult to fully understand the structural basis for the observed transcript variations

Sequence variations that cannot be properly phased can result in erroneous chimeras between isoforms.

Incorrect transcript assembly or isoform misalignment can be easily misinterpreted as evidence of polymorphism

Thank you for attention!Trinity is installed in alligaator.at.mt.ut.ee

◦ /usr/local/trinityrnaseq_r20140413p1

Questions:1. What is the difference between model organisma and non-model

organism?

2. Why do we need de novo transcriptome assembly and what makes it difficult?

Date post:	09-Jan-2022
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

De novo transcript sequence reconstruction from RNA-seq ...

Documents