Download - 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.

1 of 34

Ensembl use of RNASeq

Steve Searle

http://www.ensembl.org/java/core-api

2 of 34

Ways we use RNASeq data in Ensembl:

• Build complete gene set from scratch for individual or pooled RNASeq data sets

• Incorporate into a new Ensembl gene set

• Add novel models into a gene set

• UTR

• Filtering Models

• Improve old gene sets

Introduction


3 of 34

RNASeq pipelineBuilding genes from RNASeq


4 of 34

• Reads are aligned to the genome with a quick un-gapped alignment using BWA

• Transcriptome reads split over introns - we need to allow for this:

• Align with up to 50% miss-matches to get intron spanning reads to align• The alignments are then processed to collapse overlapping reads into

blocks representing exons• Read pairing is used (if available) to group the exon blocks into

approximate transcript structures

RNASeq PipelineAlignment and Initial Processing


5 of 17

6 of 34

RNASeq Pipeline Intron Alignment

We align split reads using Exonerate – has a good splice model but is not a short read aligner

Intron alignment is made faster in 2 ways: • Don’t realign all the reads:

• Introns are resolved by realigning partially aligned reads.• Use Exonerate word length to define which reads to realign

• Align to a single transcript:• Reads are realigned either to the rough transcript sequence or

to the genomic span of the rough transcript.

• Limiting the search space allows us to do a more sophisticated Exonerate alignment with a splice model and a shorter word length.

• Aligning to the genomic span of the transcript can identify small exons that were missed by the BWA alignment that can be incorporated into the final model.


Exonerate spliced alignment Partially aligned reads

Split reads

CollapsedIntron Features

Final Models

BLASTP

Coverage

(PE12)

9 of 34

Website Display of RNASeq pipeline results

Data visible in Ensembl

Transcript models

Intron features

BAM files of BWA alignments


10 of 34

Human gene ZMPSTE24

RNASeq introns by tissue

RNASeq models by tissue & merged

CCDS

GENCODE transcript


11 of 34


12 of 34

Nile tilapia: BAM files


13 of 34

Nile tilapia: BAM files


14 of 34

RNASeq Volume

We are collecting more and more RNASeq

We now have sizeable RNASeq sets for 12 species +

Pipeline is now being used in production

Further automation has allowed us to speed up model building:

• Process spreadsheet data to automate the pipeline setup and configuration

• Parse meta data out of spreadsheets into the final BAM files


15 of 34

Using RNASeq in the Ensembl genebuild pipeline


16 of 34

Using RNASeq in the Ensembl genebuild pipeline

Some species have little specific dataEg. Nile tilapia

131 proteins in Uniprot

35 cDNAs, 119531 ESTs

Rely on data from related species

RNASeq supplements the above dataSpecies-specific

Fills gaps, alternate splice sites, faster genebuild


17 of 34

Raw Computes

Targeted stage Similarity stage

cDNAs/ESTs

UTR addition

Final gene set

Filtering

Genebuild process

Filtering

TranscriptConsensus

LayerAnnotation

Annotation Projection(primates)


18 of 34

Raw Computes

Targeted stage Similarity stage

cDNAs/ESTs

UTR addition

Final gene set

Filtering

Genebuild process

FilteringMerged

RNA-Seq models

Annotation Projection(primates)


19 of 34

RNASeq helps with:1. Choice of splice site

RNASeq

Similarity models

Ensembl model


20 of 34

RNASeq helps with:2. UTR addition

RNASeq model

Similarity model

Ensembl model


21 of 34

RNASeq helps with:3. New models

RNASeq intronsRNASeq modelSimilarity modelEnsembl model


22 of 34

Species with RNASeq used in generating Ensembl gene set

Released:•Zebrafish•Tasmanian Devil•Coelacanth•Tilapia

In progress:

Dog, Turtle, Rat, Cat, Chicken, Platyfish

So RNASeq is becoming a central part of the genebuild process with many species having components of RNASeq going forward


23 of 34

Gene set update pipeline using RNASeq


24 of 34

Gene set Update Pipeline using RNASeq

1. RNA-Seq• RNA-Seq is pipeline is highly automated, many

species take around a week to process

2. Split core gene set into single transcript genes

3. Transcript scoring / filtering• UTR addition done at the same time

4. Layering• avoiding pseudogenes• gap filling with fragments

5. Rebuild core set

6. Transfer pseudogenes + ncRNAs

Gene set update pipeline is fast and is using existing code in a novel way with very few alterations


RNASeq model

Ensembl models

RNASeq Introns

Filter and add UTRs

Add ‘UTR’

Extend CDS

RNASeq models

Ensembl models

RNASeq Introns

27 of 34


28 of 34


29 of 34


30 of 34


31 of 34

ResultsMonodelphisPlatypus

Genes Transcripts

19,466 32,541

21,324 22,307

132

Genes Transcripts

17,951 26,836

21,695 23,581

204

before merge

after merge

joined genes


32 of 34

Gene set update pipeline -Summary

Quick, straightforward method of tidying up gene sets

Add species specific models into gene-sets that were previously mostly based on proteins from other species

Much more efficient than a new genebuild

Future work:

Lots of other species we could apply this to

See what effect it has on primates / projection builds - in progress


33 of 34

Ensembl Use of NHPRT dataPrimates in Ensembl currently: Chimp, Gorilla, Rhesus macaque, Marmoset, Mouse lemur*, Squirrel monkey+, Baboon+, Orangutan, Gibbon, Tarsier* (+ = Pre!, *=2x)

Run RNASeq pipeline on NHPRT primates in Ensembl to generate:–Transcript models–Introns–BAM files of alignments

(would like individual tissue RNASeq data for this)

Use NHPRT RNASeq in Ensembl gene builds on new species eg. Baboon

Use NHPRT RNASeq to improve existing Ensembl gene sets eg. Rhesus macaque

Consider other uses - –targeted improvement of models for ‘important’ genes (disease related)–Long non coding genes–Alignment to human


34 of 34

Steve Searle

Bronwen Aken

Daniel Barrell

Susan Fairley

Carlos Garcia Giron

Thibaut Hourlier

Andreas Kahari

Rishi Nag

Magali Ruffier

Amy Tang

Jan-Hinnerk Vogel

Amonida Zadissa

Acknowledgements

John E Collins

Stephen Keenan

Henrik Kaessman

Jessica Alfoldi

Illumina (Human Body Map data)