De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysisB R I A N J H A A S , A L E X I E P A P A N I C O L A O U , M O R A N Y A S S O U R , M A N F R E D G R A B H E R R , P H I L I P D B L O O D , J O S H U A B O W D E N , M A T T H E W B R I A N C O U G E R , D A V I D E C C L E S , B O L I , M A T T H I A S L I E B E R , M A T T H E W D M A C M A N E S , M I C H A E L O T T , J O S H U A O R V I S , N A T H A L I E P O C H E T , F R A N C E S C O S T R O Z Z I , N A T H A N W E E K S , R I C K W E S T E R M A N , T H O M A S W I L L I A M , C O L I N N D E W E Y , R O B E R T H E N S C H E L , R I C H A R D D L E D U C , N I R F R I E D M A N & A V I V R E G E V
N A T U R E P R O T O C O L S 8 , 2 0 1 3
Anti Alman22.05.2014
IntroductionPlatform for de novo transcriptome assembly◦ From RNA-seq data (only Illumina)
◦ Mainly for non-model organisms
◦ Fully reconstructs a large fraction of the transcripts present in the data
◦ Including alternative splice isoforms and transcripts from recently duplicated genees (with some caveats)
Introduction IIOriginal methodology published in 2011
Used in many different research projects◦ Genome sequence of foxtail millet (Setaria italica) provides insights into
grass evolution and biofuel potential
◦ The African coelacanth genome provides insights into tetrapod evolution
Significantly improved since 2011◦ memory requirements halved
◦ increased performance trough parallelization
◦ seamlessly uses various third-party tools
Trinity de novo assemblyThree consecutive modules
◦ Inchworm
◦ Chrysalis
◦ Butterfly
InchwormInchworm assembles the read data set by greedily searching for paths in a k-mer graph, resulting in a collection of linear contigs with each k-mer present only once in the contigs.
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
Inchworm
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
Constructs a k-mer dictionary from all sequence reads
Selects the most frequent k-mer in the dictionary (seed)
Extends the seed in each direction by finding the highest occurring k-mer with a k-1 overlap
Extends the sequence in either direction until it cannot be extended further, then reporting the linear contig
Inchwormcontiguous (fused) transcripts
ChrysalisChrysalis pools (clusters) contigs into components
◦ If they have at least k-1 overlap
◦ If enough reads span the join
An individual de Bruijn graph is built from each pool
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
de Bruijn graphEvery edge is a k-mer
Every node is a k-1 overlap
HTTP://GCAT.DAVIDSON.EDU/PHAST/DEBRUIJN.HTML
Chrysalis IIt recursively groups inchworm contigs into connected components.
◦ If there is a perfect overlap of k-1 bases
◦ If there is a minimal number of reads that span the junction across both contigs◦ with a (k-1)/2 bases match on each side of the (k-1)-mer junction.
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
Chrysalis IIIt builds a de Bruijn graph for each component
◦ using a word size of k-1 to represent nodes
◦ k to define the edges connecting the nodes.
It weights each edge of the de Bruijn graph with the number of (k-1)-mers in the original read set that support it.
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
Chrysalis IIIEach read is assigned to the component with which it shares the largest number of k-mers.
Determines the regions within each read that contribute k-mers to the component.
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
ButterflyButterfly takes each de Bruijn graph from Chrysalis and trims spurious edges and compacts linear paths.
It then reconciles the graph with reads and pairs.
It outputs one linear sequence for each splice form and/or paralogous transcript reflected in the graph.
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
ButterflyButterfly iterates between
◦ merging consecutive nodes in linear paths
◦ pruning edges that represent minor deviations
Reads are typically much longer than k◦ can resolve ambiguities
◦ reduce the combinatorial number of paths
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
ButterflyAlternatively spliced transcripts
Transcript reconstructionS.pombe
Oracle◦ Empirical upper limit based on reads and known protein-coding sequences
GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)
Expression profiles reference vs Trinity
Protocol exampleSchizosaccharomyces pombe grown in four conditions
◦ 4 million paired-end reads
◦ Requires 8GB RAM (1GB per million)
◦ Takes approximately 4 h
Main steps◦ Collection of RNA-seq data (10 min)
◦ De novo RNA-seq assembly using Trinity (60-90 min)
◦ Quality assessment (90 min)
◦ Abundance estimation using RSEM (40-60 min)
◦ Differential expression analysis using edgeR (<5 min)
AlternativesVelvet – de Bruijn
ABYSS – de Bruijn
Mira – overlap graph
Oases – de Bruijn◦ Based on Velvet
Comparison
SCIENCE CHINA FEBRUARY 2013 VOL.56 NO.2: 156–162
On randomly generated short reads from chromosome 22
Comparison II
SCIENCE CHINA FEBRUARY 2013 VOL.56 NO.2: 156–162
10 highest concentration RNAs in the ERCC mix
After de novo RNA-seqassemblyRelies on third-party tools
Transcriptome analysis package for non-model organisms◦ Comparing transcriptomes across samples
◦ Transcript abundance estimation
◦ Analysis of differentially expressed transcripts
◦ Protein-coding region prediction and functional annotation of Trinity transcripts
Comparing transcriptomesacross samplesCombine all reads across all samples into a single RNA-seq data set
Generate a single reference Trinity assembly
Aligning each sample’s (not normalized) reads to the Trinity assembly
Transcript abundance estimationRe-align reads to the assembled transcripts
◦ Alternatively spliced isoforms and recently duplicated genes?
◦ RNA-seq by Expectation Maximization (RSEM)
◦ Requires gap-free alignments
edgeR - compare expression levels of different transcripts or genes across samples
Analysis of differentially expressed transcriptsRelies on tools from the Bioconductor
◦ edgeR
◦ DESeq
Easy-to-use perl scripts
Protein-coding region prediction and functional annotation of Trinity transcriptsTransDecoder identifies candidate protein-coding regions
◦ Based on nucleotide composition
◦ Open reading frame length
◦ Pfam domain content
LimitationsOnly for Illumina RNA-seq data
Difficult to fully understand the structural basis for the observed transcript variations
Sequence variations that cannot be properly phased can result in erroneous chimeras between isoforms.
Incorrect transcript assembly or isoform misalignment can be easily misinterpreted as evidence of polymorphism
Thank you for attention!Trinity is installed in alligaator.at.mt.ut.ee
◦ /usr/local/trinityrnaseq_r20140413p1
Questions:1. What is the difference between model organisma and non-model
organism?
2. Why do we need de novo transcriptome assembly and what makes it difficult?