Nadia Davidson - Introduction to rna-seq

Nadia Davidson Murdoch Childrens Research Institute

Introduction to RNA-Seq

Winter School in Mathematical and Computational Biology 2014

The central dogma of molecular biology

Image from wikipedia

Alterna9ve splicing

DNA RNA

Transcrip9onal abundance

DNA RNA

2 copies

mul9ple copies, different “splice” variants

Transcrip9onal abundance

RNA – cell type A RNA – cell type B

Different quan99es, different “splice” variants

A

G

Which copy is expressed more?

DNA

G

Base change aKer transcrip9on

DNA

RNA

Structural rearrangement in the genome fuses Gene A to Gene B

DNA

RNA

Gene A Gene B

Benefits and opportuni9es of RNA-‐seq •  Differen9al expression

–  Comparing the expression between different samples

•  Whole transcriptome sequencing –  Annota9on of new exons, transcribed regions, genes or non-‐coding RNAs

–  The ability to look at alterna9ve splicing

–  Allele specific expression –  RNA edi9ng –  Fusion genes in cancer –  Etc.

RNA-‐Seq

@HWI-ST945:93:c02g4acxx GGAAAAGGCAGAGGGTGGACTAAATGCTCAATCATGGGATTCTAATCTGG + CCCFFFFFHHHHHJJFGIIJJJJJJJJJJJJJGJJJJJGIIJJJJJJJJJJJIHIJJJJJIIJJJ

Millions to billions of these

RNA-‐Seq data analysis

•  Whole transcriptome sequencing: – What were the original full length transcript sequences?

–  This Talk •  Differen8al expression:

–  Do we have more blue transcripts in one cell type than another?

–  Next Talk

What were the original full length transcript sequences…

if we have a reference genome?

The reference annota9on •  Model organisms have a reference annota9on

•  E.g. ENSEMBL, RefSeq, UCSC, GENCODE all provide the posi9on

of known genes in the reference genome •  OKen, we assume these are the full set of transcripts of a gene •  But how do we know which gene a read came from?

ScalechrX:

50 kb hg1972,800,000 72,850,000 72,900,000

Ensembl Gene Predictions - Ensembl 75ENST00000602584ENST00000438453ENST00000421245

ENST00000373504ENST00000373502ENST00000498407ENST00000498318

chrX (q13.2) 22.2 12 q21.1 Xq23 24 Xq25 Xq28

UCSC screen shot

Mapping reads to the genome

Cole Trapnell & Steven L Salzberg, Nature Biotechnology 27, 455 -‐ 457 (2009)

•  Some reads can be mapped wholly to the genome (grey) •  Other reads need to be ‘split’ across splice sites (blue) •  So#ware: Tophat, STAR, Subread


if we have a reference genome but want to find something novel?

Map reads

Graph splicing events

Traverse the graph

Genome guided assembly

Gene func9on? e.g. BLAST against the protein database or a related species (Blast2GO) Jeffrey A. Mar9n & Zhong Wang Nature Reviews Gene9cs 12, 671-‐682 (October 2011)

So#ware: Cufflinks, Scripture


if we don’t have a reference

genome?

De novo transcriptome assembly •  Like genome assembly •  But also needs to deal with:

–  Splicing –  Non-‐uniform coverage

•  SoKware: (Trinity, Oases, TransAbyss)

0 20 40 60 80

05

00

00

15

00

00

25

00

00

35

00

00

Reads (Millions)N

um

be

r o

f tr

an

scri

pts

A

0 20 40 60 80

05

00

10

00

15

00

20

00

Reads (Millions)

Me

an

tra

nsc

rip

t le

ng

th (

bp

)

B

0 20 40 60 80

0500

10

00

15

00

Reads (Millions)

Me

dia

n t

ran

scri

pt

len

gth

(b

p)

C

0 20 40 60 80

05

00

10

00

15

00

20

00

25

00

30

00

Reads (Millions)

N5

0 (

bp

)

D

0 20 40 60 80

01

00

00

20

00

03

00

00

40000

50000

Reads (Millions)

Nu

mb

er

of

loci

E

0 20 40 60 800

50

01

00

01

50

02

00

02

50

0Reads (Millions)

Lo

ci p

er

mill

ion

re

ad

s

F

0 20 40 60 80

02

00

04

00

06

00

08

00

0

Reads (Millions)

Tra

nsc

rip

ts p

er

mill

ion

re

ad

s

G

0 20 40 60 80

02

46

810

Reads (Millions)

Ave

rag

e t

ran

scri

pts

pe

r lo

cus

H

Samples

C.multidentata H.californensis P.robusta H.imbricata S.similis D.gigas Mouse!C10Figure 3

Francis et. al., BMC Genomics 2013

•  Challenges: –  Accuracy –  Computa9onal requirements –  Lots of transcripts. Need to filter and cluster transcripts into genes (e.g. with Corset, CD-‐HIT-‐EST, assembler informa9on etc.)


if we have a reference genome but

it’s not very good?

More common than you may think

– Non-‐model organisms: •  A badly assembled genome •  No reference genome, but one of a related species

– Model organisms: •  Cancer •  Poorly assembled regions in an otherwise good reference genome

– No standard approach

Example -‐ Annota9ng the chicken W sex chromosome

Chicken is a model organisms, but the sequenced reference W chromosome is poorly assembled with missing sequence. Mo9va9on: The mechanism for sex determina9on in birds has not been proven. Are there any novel W genes which could be involved?

Source: hkp://mac122.icu.ac.jp/gen-‐ed/mendel-‐gifs/13-‐sex-‐chromosomes.JPG

Experiment and analysis Extracted and sequenced mRNA from the gonads of

4 female and 4 male embryonic chickens

1.4 billion 100bp paired-‐end reads

Re-‐assembled the reference annota9on sequences (Ensembl), with a genome guided assembly (Cufflinks) and a de novo assembly (Abyss)

Iden9fied W genes as those with female specific expression

Discovered 2 novel W genes and for 1/3 of known W gene sequence which were previously incomplete, we found the full length sequences.

Some W candidates were followed up in the lab for sex determina9on studies

An example of one W gene

Ayers et al, 2013 Reference Annota9on

Genome

Genome guided

Coverage

0 500 1000 1500 2000 2500

194

0 Blastoderm

Gonads

De novo assembly

On the W chromosome in the reference chicken genome On “Unknown” con9gs in the reference chicken genome On an autosome in the reference chicken genome

base posi9on in the transcript

Take home message: All approaches have their strengths and limita9ons

Summary •  RNA-‐seq is very powerful!

–  It allows both the transcript sequence and the rela9ve quan99es to be measured.

–  It has numerous applica9ons: •  It compliments DNA sequencing by telling us how the genome is actually used is a par9cular cell type.

•  In some cases (e.g. non-‐model organisms) it can circumvent the need for DNA sequencing.

– There are standard pipelines for some applica9ons, but many require a problem specific solu9on. Challenging but fun!

Acknowledgements MCRI Bioinforma8cs The (Alicia) Oshlack Lab

This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve

Research Centres Program.







Red Jungle Fowl (credit: NHGRI)

Chicken W genes: MCRI Compara8ve Development Craig Smith Ka9e Ayers

Feel free to email me with ques8ons: [email protected]

More informa9on •  General:

–  Wang et al, RNA-‐Seq: a revolu9onary tool for transcriptomics, Nature Reviews Gene9cs 2009

•  Differen9al Expression Pipelines and Reviews: –  Alicia Oshlack et al., From RNA-‐seq reads to differen9al expression results, Genome

Biology 2010 –  Anders et al., Count-‐based differen9al expression analysis of RNA sequencing data using

R and Bioconductor, Nature Protocols, 2013 –  hkp://bioinf.wehi.edu.au/RNAseqCaseStudy/

•  Assembly Pipelines and Reviews: –  Jeffrey A. Mar9n1 & Zhong Wang, Next-‐genera9on transcriptome assembly, Nature

Reviews Gene9cs 2011 –  hkps://code.google.com/p/corset-‐project/wiki/Example –  Hass et al., De novo transcript sequence reconstruc9on from RNA-‐seq using the Trinity

plasorm for reference genera9on and analysis, Nature Protocols, 2013 •  The human transcriptome (ENCODE):

–  Sarah Djebali et al, Landscape of transcrip9on in human cells, Nature 2012

Date post:	10-May-2015
Category:	Science
Upload:	australian-bioinformatics-network
View:	561 times
Download:	2 times

Nadia Davidson - Introduction to rna-seq

Science