Post on 15-Feb-2017
transcript
Sean Davis, M.D., Ph.D.Genetics Branch, Center for Cancer Research
National Cancer InstituteNational Institutes of Health
RNA-seq: A high-resolutionView of the Transcriptome
Normal Karyotype
Tumor Karyotype
The Central Dogma
phenotype
Gene Copy Number
Sequence Variation
Chromatin Structure and
Function
Gene Expression
Transcriptional Regulation
DNA Methylation
Patient and Population
Characteristics
+
=
Your Nature Paper
High Throughput SequencingAKA, NGS
DNA(0.1-1.0 ug)
Single molecule arraySample preparation
Cluster growth5’
5’3’
G
T
C
A
G
T
C
A
G
T
C
A
C
A
G
TC
A
T
C
A
C
C
TAG
CG
TA
GT
1 2 3 7 8 94 5 6
Image acquisition Base calling
T G C T A C G A T …
Sequencing
Illumina SBS TechnologyReversible Terminator Chemistry Foundation
© Illumina, Inc.http://www.illumina.com/technology/sequencing_technology.ilmnhttp://seqanswers.com/forums/showthread.php?t=21
Single end vs paired end sequencing
Illumina Paired-end sequencingPaired-end: useful for RRBS, essential for RNA-seq, not useful for ChIP-
seq
What comes out of the machine: short reads in fastq format
@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1[^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1_[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1\^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1GTGGCCGATTCCTGAGCTGTGTTTGAGGAGAGGGCGGAGTGCCATCTGGGTAGC+D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1aa_eeeeegggggihhiiifgeghfeghbgcghifiidg^dbgggeeeee`dcd@D3B4KKQ1_0166:8:1101:2358:2174#CGATGT/1CTGACCTGGGTCCTGTGGTGCTCAGCCTTTTGAAGATGCCAGAAAAATACGTCG+D3B4KKQ1_0166:8:1101:2358:2174#CGATGT/1\^_cccccg^Y`ega`fg`ebegfhd^egghhghfffhghdhbfffhhhfgfcf
QS to int In R:as.integer(charToRaw(‘e'))-33
Pair end sequencings_8_1_sequence.txt.gz s_8_2_sequence.txt.gz
@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1[^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1_[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1\^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1GTGGCCGATTCCTGAGCTGTGTTTGAGGAGAGGGCGGAGTGCCATCTGGGTAGC+D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1aa_eeeeegggggihhiiifgeghfeghbgcghifiidg^dbgggeeeee`dcd
@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2GGCATATTTAACAGCATTGAACAGAATTCTGTGTCCTGTAAAAAAATTAGCTTA+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2a__aaa`ce`cgcffdf_acda^ea]befffbeged`g[a`e_caaac]cb`gb@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2TTGAGGCTGTTGTCATACTTCTCATGGTTCACACCCATGACGAACATGGGGGCG+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2a__eeeeeggegefhhhiiihhhhhiieghhhghhiiffhiififhhiihegic@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2CGGGGTGCACCTCGTCGTAGAGGAACTCTGCCGTCAGCTCTGCCCCATCGCCAA+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2^__ee__cge`cghghhfgddgfgi]ehhfffff^ec[beegidffhhfhadba@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2CTTAGTCTCAGTTTTCCTCCAGCAGCCTGAGGAAACTCAAAGGCACAGTTCCCA+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2_abeaaacg^g^eghhhhgafghhdfghfedeghfiiicfbgdHYagfeecggf@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2TAGGCTCAAAGTCTAACGCCAATCCCGAACCTGGGCATCTGTACACACACACAC+D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2abbeceeegggcghiihiihhhhiifhiiiiihiiiiiiihegh`eggfebfhg
… …
RNA-seq protocol schematic
Our First Experiment
Overview of BAC in the Genome
Sequencing a BAC
Sequence Coverage
Repeats
Repeats
Repeats are not created equal
Approaches to RNA-seq
Nature Biotech (2010) 28, 421-423
Alignment
RNA-seq Alignment
Run Time
Alignment Yield
Splice Read Placement Accuracy
Impact on Transcript Assembly
Transcript Quantification
Models for RNA-seq
• Count-based models• Multi-reads (isoform resolution)• Paired-end reads (include length resolution
step)• Positional bias along transcript length• Sequence bias
Read Counting
Mortazavi, 2008, NMeth
L. Pachter (2011) arXiv:1104.3889v
Sequence Bias--priming
Hansen (2010), NAR
Sample-specific Sequence Bias
Models for RNA-seq
Result of Quantification
Clustering and Visualization
Hierarchical ClusteringGene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Hierarchical ClusteringGene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Hierarchical ClusteringGene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Hierarchical ClusteringGene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Distance Metrics
Euclidean distance
Manhattan distance
Minkowski distance (generalized distance)
Distance Metrics• Correlation
– maximum value of 1 if X and Y are perfectly correlated– minimum value of -1 if X and Y are exactly opposite– d(X,Y) = 1 – rxy
• Many, many others• Choice of distance metric can be driven by
underlying data (eg., binary data, categorical data, outliers, etc.)
Example of Distance Metric Choice
Example• dat = matrix(rnorm(10000),ncol=20)• dat[1:100,1:10] = dat[1:100,1:10]+1• hclust• dist• as.dist(1-cor)
Differential Expression
MA Plot
DE False Positive Rates
DE Evaluation
DE Software Runtime
RNA-seq workflow as proposed by Anders et al. in Nature Protocols
MA Plot
Fusion Gene Detection
Fusion gene schematic
Fusion Detection
False Positive Fusion Detection
Experimental Design
• What are my goals?– Differential expression?– Transcriptome assembly?– Identify rare, novel trancripts?
• System characteristics?– Large, expanded genome?– Intron/exon structures complex?– No reference genome or transcriptome
Experimental Design
• Technical replicates– Probably not needed due to low technical variation
• Biological replicates– Not explicitly needed for transcript assembly– Essential for differential expression analysis– Number of replicates often driven by sample
availability for human studies– More is almost always better
Links of Interest
• http://bioconductor.org• http://biostars.org• http://www.rna-seqblog.com/• https://genome.ucsc.edu/ENCODE/• http://www.ncbi.nlm.nih.gov/gds/
Visualizing Splicing