ChIP-seq MBD-seq (MIRA-seq) BS-seq RNA-seq miRNA-seq
ChIP-Seq is a new frontier technology to analyze in vivo protein-DNA interactions.
ChIP-Seq◦ Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively parallel sequencing
◦ Allow mapping of protein–DNA interactions in-vivo on a genome scale
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
Workflow ofChIP-Seq
Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain.
Lower cost Higher resolution Higher accuracy
Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.
Solexa (Illumina)◦ 1 GB of sequences in a single run◦ 35 bases in length
454 Life Sciences (Roche Diagnostics)◦ 25-50 MB of sequences in a single run◦ Up to 500 bases in length
SOLiD (Applied Biosystems)◦ 6 GB of sequences in a single run◦ 35 bases in length
8 lanes100 tiles per lane
Sequence FilesQuality Scores
10-40 million reads per lane
~500 MB files
Quality scores describe the confidence of bases in each read Solexa pipeline assigns a quality score to the four possible
nucleotides for each sequenced base 9 million sequences (500MB file) ~6.5GB quality score file
Rapid mapping of these short sequence reads to the reference genome
Visualize mapping results◦ Thousand of enriched regions
Peak analysis◦ Peak detection◦ Finding exact binding sites
Compare results of different experiments◦ Normalization◦ Statistical tests
Mapping Methods◦ Need to allow mismatches and gaps
SNP locations Sequencing errors Reading errors
◦ Indexing and hashing genome oligonucleotide reads
Use of quality scores Use of SNP knowledge Performance
◦ Partitioning the genome or sequence reads
Fast sequence similarity search algorithms (like BLAST)◦ Not specifically designed for mapping millions of
query sequences◦ Take very long time
e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST)
◦ Indexing the genome is memory expensive
Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding
Load reference genome into memory ◦ For human genome, 14GB RAM required for storing
reference sequences and index tables 300(gapped) to 1200(ungapped) times faster than BLAST 2 mismatches or 1-3bp continuous gap Errors accumulate during the sequencing process
◦ Much higher number of sequencing errors at the 3’-end (sometimes make the reads unalignable to the reference genome)
◦ Iteratively trim several basepairs at the 3’-end and redo the alignment
◦ Improve sensitivity
ELAND (Cox, unpublished)◦ “Efficient Large-Scale Alignment of Nucleotide
Databases” (Solexa Ltd.) SeqMap (Jiang, 2008)
◦ “Mapping massive amount of oligonucleotides to the genome”
RMAP (Smith, 2008) ◦ “Using quality scores and longer reads
improves accuracy of Solexa read mapping” MAQ (Li, 2008)
◦ “Mapping short DNA sequencing reads and calling variants using mapping quality scores”
Partition reads into 4 seeds {A,B,C,D}◦ At least 2 seed must map with no mismatches
Scan genome to identify locations where the seeds match exactly◦ 6 possible combinations of the seeds to search
{AB, CD, AC, BD, AD, BC}◦ 6 scans to find all candidates
Do approximate matching around the exactly-matching seeds.◦ Determine all targets for the reads◦ Ins/del can be incorporated
The reads are indexed and hashed before scanning genome
Bit operations are used to accelerate mapping◦ Each nt encoded into 2-bits
Commercial sequence mapping program comes with Solexa machine
Allow at most 2 mismatches Map sequences up to 32 nt in length All sequences have to be same length
Improve mapping accuracy◦ Possible sequencing errors at 3’-ends of longer
reads ◦ Base-call quality scores
Use of base-call quality scores◦ Quality cutoff
High quality positions are checked for mismatces
Low quality positions always induce a match◦ Quality control step eliminates reads with too
many low quality positions
Allow any number of mismatches
Map to reference genome
Map to reference genome
Mapped to a unique location
Mapped to multiple locations
No mapping
Low quality
3 MQuality
filter
7.2 M
1.8 M
2.5 M
0.5 M
12 M
BED files are build to summarize mapping results
BED files can be easily visualized in Genome Browser
http://genome.ucsc.edu
Robertson, G. et al. Nat. Methods 4, 651-657 (2007)
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
300 kb region from mouse ES cells
Frietze et al JBC 2010
SISSRs (Site Identification from Short Sequence Reads): Jothi et al. NAR, 2008.
MACS (Model-based Analysis of ChIP-Seq): Zhang et al, Genome Biology, 2008.QuEST (Genome-wide analysis of transcription factor binding sites based on ChIP–seq data): Valouev, A. et al. Nature Methods, 2008.PeakSeq (PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls): Rozowsky, J. et al. Nature Biotech. 2009.FindPeaks (FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology.): Fejes, A .P. et al. Bioinformatisc, 2008.Hpeak (An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data): Xu et al, Bioinformatics, 2008.
The MBD methyl-CpG binding domain-based (MBDCap) technology to capture the methylation sites. Double stranded methylated DNA fragments can be detected. It is sensitive to different methylation densities
Genome-wide sequencing technology was used to get the sequence of each short fragment.
The sequenced read was mapped to human genome to find the locations.
Lan et al Unpublished
BS-seq: genomic DNA is treated with sodium bisulphite (BS) to convert cytosine, but not methylcytosine, to uracil, and subsequent high-throughput sequencing.
Truly single-base resolution
RNA-Seq is a new approach to transcriptome profiling that uses deep-sequencing technologies.
Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.
RNA-seq protocol
Single base resolution High throughput Low background noise Ability to distinguish different isoforms and alleic
expression Relatively low cost