Journal Club
A single-molecule long-read survey of the human transcriptome
Sharon et al., Nature Biotechnology 31, 1009–1014 (2013)
Sanzhen LiuPlant Pathology
3/12/2014
PacBio technology
• Amplification-free sequencing• Very long (up to 20kb, peak on 2-6 kb)• High errors (random, no-context-specific errors)
PacBio website
CCS approach
• High-quality, single-molecule, circular-consensus (CCS) reads
http://flxlexblog.wordpress.com/2013/02/11/applications-for-pacbio-circular-consensus-sequencing/
Figure 1
• Input: pooled RNAs from 20 tissues• Approach: prepare double-stranded cDNAs -> CCS library -> PacBio sequencing• Output: 476,000 CCS reads, mean=1kb
• 61% reads cover all introns and most first and last exons• CCS reads well cover (generally >90%) short transcripts (<1.2 kb) but stay low
coverage for long transcripts, especially for those with >2.4 kb
Figure 2
Missing 3’ ends
Missing 5’ ends
The correlations of the number of reads and …
ERCC, mixture of known/quantified RNAs
Figure 3
• 67% molecules with splicing sites were estimated• CSMM: consensus split-mapped molecule (accurate CCS reads with splicing sites?)
• Splicing sites well match annotated splicing sites• PacBio (versus 454) exhibits much higher power to detect isoforms with >=10 introns• Estimate: 21,000 genes and 139,000 isoforms can be detected with high-depth seq
Summary
• Full-length RNA of up to 1.5kb can readily be monitored with little sequence loss at the 5’ ends
• With 476k CCS reads (>300bp), 14,000 spliced genes were identified.
• The majority of introns are consistent with annotations, but >10% are novel.
Conclusion
• Isoforms can be monitored at a single-molecule level without amplification or fragmentation
• The majority of reads represent all splice sites of the original transcripts
• Unannotated splice isoforms: long non-coding RNAs with few introns and isoforms of known protein-coding genes with many introns