Date post: | 10-May-2015 |
Category: |
Science |
Upload: | australian-bioinformatics-network |
View: | 561 times |
Download: | 2 times |
Nadia Davidson Murdoch Childrens Research Institute
Introduction to RNA-Seq
Winter School in Mathematical and Computational Biology 2014
The central dogma of molecular biology
Image from wikipedia
Alterna9ve splicing
DNA RNA
Transcrip9onal abundance
DNA RNA
2 copies
mul9ple copies, different “splice” variants
Transcrip9onal abundance
RNA – cell type A RNA – cell type B
Different quan99es, different “splice” variants
A
G
Which copy is expressed more?
DNA
G
Base change aKer transcrip9on
DNA
RNA
Structural rearrangement in the genome fuses Gene A to Gene B
DNA
RNA
Gene A Gene B
Benefits and opportuni9es of RNA-‐seq • Differen9al expression
– Comparing the expression between different samples
• Whole transcriptome sequencing – Annota9on of new exons, transcribed regions, genes or non-‐coding RNAs
– The ability to look at alterna9ve splicing
– Allele specific expression – RNA edi9ng – Fusion genes in cancer – Etc.
RNA-‐Seq
@HWI-ST945:93:c02g4acxx GGAAAAGGCAGAGGGTGGACTAAATGCTCAATCATGGGATTCTAATCTGG + CCCFFFFFHHHHHJJFGIIJJJJJJJJJJJJJGJJJJJGIIJJJJJJJJJJJIHIJJJJJIIJJJ
Millions to billions of these
RNA-‐Seq data analysis
• Whole transcriptome sequencing: – What were the original full length transcript sequences?
– This Talk • Differen8al expression:
– Do we have more blue transcripts in one cell type than another?
– Next Talk
What were the original full length transcript sequences…
if we have a reference genome?
The reference annota9on • Model organisms have a reference annota9on
• E.g. ENSEMBL, RefSeq, UCSC, GENCODE all provide the posi9on
of known genes in the reference genome • OKen, we assume these are the full set of transcripts of a gene • But how do we know which gene a read came from?
ScalechrX:
50 kb hg1972,800,000 72,850,000 72,900,000
Ensembl Gene Predictions - Ensembl 75ENST00000602584ENST00000438453ENST00000421245
ENST00000373504ENST00000373502ENST00000498407ENST00000498318
chrX (q13.2) 22.2 12 q21.1 Xq23 24 Xq25 Xq28
UCSC screen shot
Mapping reads to the genome
Cole Trapnell & Steven L Salzberg, Nature Biotechnology 27, 455 -‐ 457 (2009)
• Some reads can be mapped wholly to the genome (grey) • Other reads need to be ‘split’ across splice sites (blue) • So#ware: Tophat, STAR, Subread
What were the original full length transcript sequences…
if we have a reference genome but want to find something novel?
Map reads
Graph splicing events
Traverse the graph
Genome guided assembly
Gene func9on? e.g. BLAST against the protein database or a related species (Blast2GO) Jeffrey A. Mar9n & Zhong Wang Nature Reviews Gene9cs 12, 671-‐682 (October 2011)
So#ware: Cufflinks, Scripture
What were the original full length transcript sequences…
if we don’t have a reference
genome?
De novo transcriptome assembly • Like genome assembly • But also needs to deal with:
– Splicing – Non-‐uniform coverage
• SoKware: (Trinity, Oases, TransAbyss)
0 20 40 60 80
05
00
00
15
00
00
25
00
00
35
00
00
Reads (Millions)N
um
be
r o
f tr
an
scri
pts
A
0 20 40 60 80
05
00
10
00
15
00
20
00
Reads (Millions)
Me
an
tra
nsc
rip
t le
ng
th (
bp
)
B
0 20 40 60 80
0500
10
00
15
00
Reads (Millions)
Me
dia
n t
ran
scri
pt
len
gth
(b
p)
C
0 20 40 60 80
05
00
10
00
15
00
20
00
25
00
30
00
Reads (Millions)
N5
0 (
bp
)
D
0 20 40 60 80
01
00
00
20
00
03
00
00
40000
50000
Reads (Millions)
Nu
mb
er
of
loci
E
0 20 40 60 800
50
01
00
01
50
02
00
02
50
0Reads (Millions)
Lo
ci p
er
mill
ion
re
ad
s
F
0 20 40 60 80
02
00
04
00
06
00
08
00
0
Reads (Millions)
Tra
nsc
rip
ts p
er
mill
ion
re
ad
s
G
0 20 40 60 80
02
46
810
Reads (Millions)
Ave
rag
e t
ran
scri
pts
pe
r lo
cus
H
Samples
C.multidentata H.californensis P.robusta H.imbricata S.similis D.gigas Mouse!C10Figure 3
Francis et. al., BMC Genomics 2013
• Challenges: – Accuracy – Computa9onal requirements – Lots of transcripts. Need to filter and cluster transcripts into genes (e.g. with Corset, CD-‐HIT-‐EST, assembler informa9on etc.)
What were the original full length transcript sequences…
if we have a reference genome but
it’s not very good?
More common than you may think
– Non-‐model organisms: • A badly assembled genome • No reference genome, but one of a related species
– Model organisms: • Cancer • Poorly assembled regions in an otherwise good reference genome
– No standard approach
Example -‐ Annota9ng the chicken W sex chromosome
Chicken is a model organisms, but the sequenced reference W chromosome is poorly assembled with missing sequence. Mo9va9on: The mechanism for sex determina9on in birds has not been proven. Are there any novel W genes which could be involved?
Source: hkp://mac122.icu.ac.jp/gen-‐ed/mendel-‐gifs/13-‐sex-‐chromosomes.JPG
Experiment and analysis Extracted and sequenced mRNA from the gonads of
4 female and 4 male embryonic chickens
1.4 billion 100bp paired-‐end reads
Re-‐assembled the reference annota9on sequences (Ensembl), with a genome guided assembly (Cufflinks) and a de novo assembly (Abyss)
Iden9fied W genes as those with female specific expression
Discovered 2 novel W genes and for 1/3 of known W gene sequence which were previously incomplete, we found the full length sequences.
Some W candidates were followed up in the lab for sex determina9on studies
An example of one W gene
Ayers et al, 2013 Reference Annota9on
Genome
Genome guided
Coverage
0 500 1000 1500 2000 2500
194
0 Blastoderm
Gonads
De novo assembly
On the W chromosome in the reference chicken genome On “Unknown” con9gs in the reference chicken genome On an autosome in the reference chicken genome
base posi9on in the transcript
Take home message: All approaches have their strengths and limita9ons
Summary • RNA-‐seq is very powerful!
– It allows both the transcript sequence and the rela9ve quan99es to be measured.
– It has numerous applica9ons: • It compliments DNA sequencing by telling us how the genome is actually used is a par9cular cell type.
• In some cases (e.g. non-‐model organisms) it can circumvent the need for DNA sequencing.
– There are standard pipelines for some applica9ons, but many require a problem specific solu9on. Challenging but fun!
Acknowledgements MCRI Bioinforma8cs The (Alicia) Oshlack Lab
This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve
Research Centres Program.
This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve
Research Centres Program.
This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve
Research Centres Program.
This research was partly conducted within the Poultry CRC, established and supported under the Australian Government’s Coopera9ve
Research Centres Program.
Red Jungle Fowl (credit: NHGRI)
Chicken W genes: MCRI Compara8ve Development Craig Smith Ka9e Ayers
Feel free to email me with ques8ons: [email protected]
More informa9on • General:
– Wang et al, RNA-‐Seq: a revolu9onary tool for transcriptomics, Nature Reviews Gene9cs 2009
• Differen9al Expression Pipelines and Reviews: – Alicia Oshlack et al., From RNA-‐seq reads to differen9al expression results, Genome
Biology 2010 – Anders et al., Count-‐based differen9al expression analysis of RNA sequencing data using
R and Bioconductor, Nature Protocols, 2013 – hkp://bioinf.wehi.edu.au/RNAseqCaseStudy/
• Assembly Pipelines and Reviews: – Jeffrey A. Mar9n1 & Zhong Wang, Next-‐genera9on transcriptome assembly, Nature
Reviews Gene9cs 2011 – hkps://code.google.com/p/corset-‐project/wiki/Example – Hass et al., De novo transcript sequence reconstruc9on from RNA-‐seq using the Trinity
plasorm for reference genera9on and analysis, Nature Protocols, 2013 • The human transcriptome (ENCODE):
– Sarah Djebali et al, Landscape of transcrip9on in human cells, Nature 2012