2 of 34
Ways we use RNASeq data in Ensembl:
• Build complete gene set from scratch for individual or pooled RNASeq data sets
• Incorporate into a new Ensembl gene set
• Add novel models into a gene set
• UTR
• Filtering Models
• Improve old gene sets
Introduction
4 of 34
• Reads are aligned to the genome with a quick un-gapped alignment using BWA
• Transcriptome reads split over introns - we need to allow for this:
• Align with up to 50% miss-matches to get intron spanning reads to align• The alignments are then processed to collapse overlapping reads into
blocks representing exons• Read pairing is used (if available) to group the exon blocks into
approximate transcript structures
RNASeq PipelineAlignment and Initial Processing
5 of 17
6 of 34
RNASeq Pipeline Intron Alignment
We align split reads using Exonerate – has a good splice model but is not a short read aligner
Intron alignment is made faster in 2 ways: • Don’t realign all the reads:
• Introns are resolved by realigning partially aligned reads.• Use Exonerate word length to define which reads to realign
• Align to a single transcript:• Reads are realigned either to the rough transcript sequence or
to the genomic span of the rough transcript.
• Limiting the search space allows us to do a more sophisticated Exonerate alignment with a splice model and a shorter word length.
• Aligning to the genomic span of the transcript can identify small exons that were missed by the BWA alignment that can be incorporated into the final model.
Exonerate spliced alignment Partially aligned reads
Split reads
CollapsedIntron Features
Final Models
BLASTP
Coverage
(PE12)
9 of 34
Website Display of RNASeq pipeline results
Data visible in Ensembl
Transcript models
Intron features
BAM files of BWA alignments
10 of 34
Human gene ZMPSTE24
RNASeq introns by tissue
RNASeq models by tissue & merged
CCDS
GENCODE transcript
14 of 34
RNASeq Volume
We are collecting more and more RNASeq
We now have sizeable RNASeq sets for 12 species +
Pipeline is now being used in production
Further automation has allowed us to speed up model building:
• Process spreadsheet data to automate the pipeline setup and configuration
• Parse meta data out of spreadsheets into the final BAM files
16 of 34
Using RNASeq in the Ensembl genebuild pipeline
Some species have little specific dataEg. Nile tilapia
131 proteins in Uniprot
35 cDNAs, 119531 ESTs
Rely on data from related species
RNASeq supplements the above dataSpecies-specific
Fills gaps, alternate splice sites, faster genebuild
17 of 34
Raw Computes
Targeted stage Similarity stage
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Genebuild process
Filtering
TranscriptConsensus
LayerAnnotation
Annotation Projection(primates)
18 of 34
Raw Computes
Targeted stage Similarity stage
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Genebuild process
FilteringMerged
RNA-Seq models
Annotation Projection(primates)
19 of 34
RNASeq helps with:1. Choice of splice site
RNASeq
Similarity models
Ensembl model
20 of 34
RNASeq helps with:2. UTR addition
RNASeq model
Similarity model
Ensembl model
21 of 34
RNASeq helps with:3. New models
RNASeq intronsRNASeq modelSimilarity modelEnsembl model
22 of 34
Species with RNASeq used in generating Ensembl gene set
Released:•Zebrafish•Tasmanian Devil•Coelacanth•Tilapia
In progress:
Dog, Turtle, Rat, Cat, Chicken, Platyfish
So RNASeq is becoming a central part of the genebuild process with many species having components of RNASeq going forward
24 of 34
Gene set Update Pipeline using RNASeq
1. RNA-Seq• RNA-Seq is pipeline is highly automated, many
species take around a week to process
2. Split core gene set into single transcript genes
3. Transcript scoring / filtering• UTR addition done at the same time
4. Layering• avoiding pseudogenes• gap filling with fragments
5. Rebuild core set
6. Transfer pseudogenes + ncRNAs
Gene set update pipeline is fast and is using existing code in a novel way with very few alterations
RNASeq model
Ensembl models
RNASeq Introns
Filter and add UTRs
Add ‘UTR’
Extend CDS
RNASeq models
Ensembl models
RNASeq Introns
31 of 34
ResultsMonodelphisPlatypus
Genes Transcripts
19,466 32,541
21,324 22,307
132
Genes Transcripts
17,951 26,836
21,695 23,581
204
before merge
after merge
joined genes
32 of 34
Gene set update pipeline -Summary
Quick, straightforward method of tidying up gene sets
Add species specific models into gene-sets that were previously mostly based on proteins from other species
Much more efficient than a new genebuild
Future work:
Lots of other species we could apply this to
See what effect it has on primates / projection builds - in progress
33 of 34
Ensembl Use of NHPRT dataPrimates in Ensembl currently: Chimp, Gorilla, Rhesus macaque, Marmoset, Mouse lemur*, Squirrel monkey+, Baboon+, Orangutan, Gibbon, Tarsier* (+ = Pre!, *=2x)
Run RNASeq pipeline on NHPRT primates in Ensembl to generate:–Transcript models–Introns–BAM files of alignments
(would like individual tissue RNASeq data for this)
Use NHPRT RNASeq in Ensembl gene builds on new species eg. Baboon
Use NHPRT RNASeq to improve existing Ensembl gene sets eg. Rhesus macaque
Consider other uses - –targeted improvement of models for ‘important’ genes (disease related)–Long non coding genes–Alignment to human
34 of 34
Steve Searle
Bronwen Aken
Daniel Barrell
Susan Fairley
Carlos Garcia Giron
Thibaut Hourlier
Andreas Kahari
Rishi Nag
Magali Ruffier
Amy Tang
Jan-Hinnerk Vogel
Amonida Zadissa
Acknowledgements
John E Collins
Stephen Keenan
Henrik Kaessman
Jessica Alfoldi
Illumina (Human Body Map data)