RNA sequencing: advances and opportunities

advances and opportunities

Paolo Dametto May 2012

Is there a correlation between the size of the genome and the morphological complexity?

0 Only to a certain extent! 0 There is not a clear correlation between the size of a genome and the overall

complexity of an organism

Is there a correlation between the number of genes and morphological complexity ?

0 Once again, only to a certain extent

0 The complexity of an organism increases much more than the number of genes

0 It is not needed to increase the variety of the pieces available in order to increase the complexity of a construction, but you have to increase the complexity of the project

Introns 24%

Exons 1%

Intergenic DNA 30%

Transposons 45%

0 Transcription factors 0 Operators 0 Enhancers 0 Promoters 0 ncRNA (e.g. involved in

alternative splicing and miRNA)

0 Antisense transcripts

Genetic expression workflow

1951-1965

Nowadays

RNA

Messenger RNA mRNA Codes for protein All organismsRibosomal RNA rRNA Translation All organisms

Signal recognition particle RNA 7SL RNA or SRP RNA

Membrane integration All organisms

Transfer RNA tRNA Translation All organismsTransfer-messenger RNA tmRNA Rescuing stalled ribosomes Bacteria

Type Abbr. Function Distribution

Small nuclear RNA snRNA Splicing and other functions Eukaryotes and archaea

Small nucleolar RNA snoRNA Nucleotide modification of RNAs Eukaryotes and archaea

SmY RNA SmY mRNA trans-splicing Nematodes

Small Cajal body-specific RNA scaRNA Type of snoRNA; Nucleotide modification of RNAs

Guide RNA gRNA mRNA nucleotide modification Kinetoplastid mitochondria

Ribonuclease P RNase P tRNA maturation All organismsRibonuclease MRP RNase MRP rRNA maturation, DNA replication EukaryotesY RNA RNA processing, DNA replication AnimalsTelomerase RNA Telomere synthesis Most eukaryotes

Type Abbr. Function Distribution

Antisense RNA aRNATranscriptional attenuation / mRNA degradation / mRNA stabilisation / Translation block

All organisms

Cis-natural antisense transcript Gene regulation

CRISPR RNA crRNA Resistance to parasites, probably by targeting their DNA

Bacteria and archaea

Long noncoding RNA Long ncRNA Various EukaryotesMicroRNA miRNA Gene regulation Most eukaryotesPiwi-interacting RNA piRNA Transposon defense, maybe other functions Most animalsSmall interfering RNA siRNA Gene regulation Most eukaryotesTrans-acting siRNA tasiRNA Gene regulation Land plantsRepeat associated siRNA rasiRNA Type of piRNA; transposon defense Drosophila

RNAs involved in post-transcriptional modification or DNA replication

Regulatory RNAs

RNAs involved in protein synthesis

The Transcriptome

Transcriptomic 0 To catalogue all species of transcripts;

0 To determine the transcriptional structure of genes, in terms of their starting

site, 5’ and 3’ ends, splicing patterns and other post-transcriptional modification;

0 To quantify the changing expression levels of each transcript during development and under different conditions.

Various technologies have been developed to deduce and quantify the transcriptome, including hybridization-based approaches (microarray) or sequence-based approaches (RNA-seq)

Hybridization-based approach: Microarray technology…

0 High-throughput

0 Fast 0 Relatively inexpensive

0 Multiple applications:

0 gene expression profiling

0 gene fusions detection

0 alternative splicing detection

0 SNP detection 0 Tiling array 0 ChIP

0 Reliance upon existing knowledge about genome sequence

0 High background levels owing to cross-hybridization

0 A limited dynamic range of detection due to both background and saturation of signals

0 Comparing expression levels across different experiments is often difficult and can require complicated normalization methods

…and its limitations

0 Directly determine the cDNA sequence, hence defining the corresponding mRNA

1. Sanger sequencing of cDNA or EST libraries 0 Low-throughput, expensive, generally not quantitative

2. Tag-based methods were developed: SAGE, CAGE, MPSS

0 Still expensive because based on Sanger sequencing, short tags cannot be uniquely mapped to the reference genome, isoforms are generally not distiguishable

3. RNA-seq, based on NGS technologies

0 By analyzing the transcriptome at spectacular and unprecedented depth and accuracy, thousands of new transcripts variants and isoforms have been shown to be expressed in mammalian tissues or organs

0 it greatly accelerated our understanding of the complexity of gene expression, regulation and networks for mammalian cells

Sequence-based approaches

NGS

Roche/454 Illumina/Solexa Life/SOLiD

Helicos/tSMS Pacific Biosciences Life/Ion Torrent

A typical RNA-seq experiment

RNA-seq for detection of alternative splicing events

Library construction 0 Larger RNA molecules must be fragmented into smaller pieces (200-500bp) to be

compatible with most deep-sequencing technologies

RNA fragmentation has little bias over the transcript body, but is depleted for transcript ends compared with other methods

cDNA fragmentation is usually strongly biased towards the identification of sequences from the 3’ ends of transcripts

Challenges for RNA-seq

Challenges for RNA-seq Bioinformatic challenges 0 Development of efficient methods to store, retrieve and process large amounts of data:

ELAND, SOAP, MAQ and RMAP

High-quality reads are selected and matched against a reference genome, or they are first assembled into contigs before alignining them to the genomic sequence to reveal transcription structure

1. Junctions reads are difficult to map:

a junction library containing all known and predicted junction sequences has been created and junction reads are mapped there

2. Many reads match multiple locations in the genome (e.g. repetitive regions) Multi-matched reads are assigned proportionally to the number of reads mapped to their neighbouring

unique sequences Roche 454 to obtain longer reads (250 bp) Paired-end sequencing strategy (Solexa)

Challenges for RNA-seq Defining transcription level

0 RNA-seq can be used to determine levels more accurately than microarrays. In

principle, it is possible to determine the absolute quantity of every molecule in a cell population, and directly compared results between experiments.

1. RNA fragmentation + cDNA synthesis (exons’ body-biased): 0 Gene expression level is deduced from the total number of reads that fall into the exons of a

gene, normalized by the length of exons that can be uniquely mapped

2. cDNA fragmentation (3’end-biased): 0 read counts from a window near the 3’end are used

0 RNA-seq can capture transcriptome dynamics across different tissues or conditions

without sophisticated normalization of data sets.

0 mRNA-seq on a single mouse blastomere and oocyte

0 They detected the expression of 75% (5270) more genes than microarray techniques

0 They identified 1753 previously unknown splice junctions called by at least 5 read

0 8-19% of the genes with multiple known transcript isoforms expressed at least two isoforms in the same blastomere or oocyte

0 Dicer1-/- and Ago2-/- oocytes show 1696 and 1553 genes, respectively, to be upregulated compared to wild-type controls, with 619 genes in common

Life/SOLiD

Mitinouri S. et al, Nat Protocol, 2007

5 min > 30 min

3 min > 6 min

(80-130 bp) 64% genes

High accuracy of the sequencing technique and mapping algorithms

Comparison of mRNA-Seq and microarray assays

0 Microarray analysis of 320 blastomeres found 6650 genes in common with RNA-seq. Overall RNA-seq detected 60% more genes compared with microarray.

0 mRNA-Seq missed 5.7% of the transcripts (400 genes) 0 327/400 genes had fluorescence intensity on the chip lower than 100 0 9/11 genes tested by RT-PCR were found to be false positive

0 Cross-hybridization 0 Stochastically, some low-expressed genes on a single cell can be either on or off.

0 Very similar expression pattern compared to a NIH mouse array

0 380 genes detected by RNA-seq were chosen and tested by RT-PCR. 71% were clearly

confirmed

1. Generation of a library containing all possible combinations of exon-exon junctions as 84-bp sequences, with 42-bp from each exon

2. Removing of all known exon junctions

3. Matching between RNA-seq reads and the new library Results 0 One blastomere: 6701 and 1753 new junctions with at least 2 or 5 reads, respectively

0 8/8 confirmed by RT-PCR

0 One mature oocyte: 9012 and 2070 new junctions

0 335 genes (19% of all known genes with at least two known isoforms) expressed more than two transcripts insoforms in a single blastomere, at the same time

New splice isoforms identified by mRNA-seq

0 Two separately processed single wild-type mature oocytes showed very similar transcriptome profiles. Same results for Dicer1-/- oocytes

0 Differences between Ago2-/- and WT were clearly less than that between Dicer1-/- and WT >> this observation correlates with the fact that Ago2-/- oocytes phenotype is similar but milder than that of Dicer1-/-

RNA-seq to dissect functional differences: Dicer1-/- vs WT

0 Single-exon resolution of RNA-seq with low or even no background: in Dicer1-/- oocytes, exon 23 is deleted by loxP-directed Cre recombination. Result confirmed by TaqMan assay.

0 Abnormal upregulation was detected for three genes Ccne1, Dppa5 and Klf2 and confirmed by RT-PCR. They may contribute to the compromised developental potential of Dicer1-/- and Ago2-/- oocytes


Overall results


Dicer1-/- Ago2-/-

Upregulated 1696

Downregulated

1571

Upregulated 1553

Downregulated

1121

Upregulated 619

Downregulated

589

Core candidates to dissect the function of microRNAs and endogenous small interfering RNAs involved in oogenesis

Conclusions

0 mRNA-seq on a single mouse blastomere > small amount of starting material

0 7% > 64% of full-length cDNAs captured

0 They detected the expression of 75% (5270) more genes than microarray techniques and identified 1753 previously unknown splice junctions

0 8-19% of the genes with multiple known transcript isoforms expressed at least two isoforms in the same blastomere or oocyte

0 Dicer1-/- and Ago2-/- oocytes show 1696 and 1553 genes, respectively, to be upregulated compared to wild-type controls, with 619 genes in common

Limitations Only poly(A) mRNA are captured (e.g. histone mRNA is not detected) For mRNAs longer than 3 Kb, the 5’end will not be characterized The assay uses double-stranded cDNAs but cannot discriminate between sense and

antisense

0 cDNA synthesis introduces multiple biases:

0 Erases RNA strand information 0 Spurious second-strand cDNA artefacts can be introduced, owing to the DNA-

dependent DNA polymerase (DDDP) activites 0 Artefactual cDNAs due to template switching 0 Error prone and inefficiency of the enzyme

Direct single molecule RNA sequencing without prior conversion of RNA to cDNA >> it captures all RNAs

The sequencing was performed on Poly(A)+ S.cerevisiae RNA strain

Helicos/tSMS

(DRS)

PAPI enzyme add ~150 bp to the 3’end

Pilot experiment with oligoribonucleotides 0 48.5% of aligned reads have a sequence

length of at least 20 nucleotides (nt) 0 38 nt is the longest read with no errors 0 Errors: 4%

0 2-3% missing base errors 0 1-2% insertion rate 0 0.1%-0.3% substitution errors

Poly(A)+ S.cerevisiae (Clontech) 0 Femtomoles quantities of RNA needed 0 120 cycles in 3 days 0 41261 reads of > 20 nt, average of 28 nt 0 50 nt is the longest read 0 19501 reads (48.4%) aligned to the yeast

genome using the BLAT algorithm

DRS sequencing read-length statistics

0 Of the aligned reads, 91% were within 400 nt downstream of annotated yeast gene 3’ ORF ends

0 Most of the reads were in close proximity to EST 3’ ends

0 ~2% of the total reads were from ribosomal RNAs and small nucleolar RNAs, indicating that at least a fraction of those can be polyadenylated post-transcriptionally.

DRS sequencing read-length statistics

0 The emerging discoveries on the link between polyadenylation and disease states (oculopharyngeal muscular dystrophy, thalassemias, thrombophilia, and IPEX syndrome ) underline the need to fully characterize genome-wide polyadenylation states. Here, we report comprehensive maps of global polyadenylation events in human and yeast generated using refinements to the Direct RNA Sequencing technology. This direct approach provides a quantitative view of genome-wide polyadenylation states in a strand-specific manner and requires only attomole RNA quantities. The polyadenylation profiles revealed an abundance of unannotated polyadenylation sites, alternative polyadenylation patterns, and regulatory element-associated poly(A)+ RNAs. We observed differences in sequence composition surrounding canonical and noncanonical human polyadenylation sites, suggesting novel noncoding RNA-specific polyadenylation mechanisms in humans. Furthermore, we observed the correlation level between sense and antisense transcripts to depend on gene expression levels, supporting the view that overlapping transcription from opposite strands may play a regulatory role. Our data provide a comprehensive view of the polyadenylation state and overlapping transcription.

0 Requirement of minor RNA quantities 0 No biases due to cDNA synthesis, end repair, ligation and amplification procedures 0 Potentially useful to study short RNA species

1. Generation of a complete catalogue of transcripts that are derived from genomes

ranging from those of simple unicellular organisms to complex mammalian cells, normal or disease tissues, single-cells and formalin-paraffin embedded tissues

2. Generation of complex biological networks in a wide range of biological specimens

3. Use of these networks to fully understand the biological pathways that are active in various physiological conditions

Immediate application in clinical diagnostic: analyses of extracellular nucleic acid (e.g. fetal RNA) and cells (e.g. circulating tumor cells)

Conclusions

Future perspective

Date post:	11-May-2015
Category:	Technology
Upload:	paolo-dametto
View:	525 times
Download:	1 times

RNA sequencing: advances and opportunities

Technology