Download - ChIP-seq MBD-seq (MIRA-seq) BS-seq RNA-seq miRNA-seq.

ChIP-seq MBD-seq (MIRA-seq) BS-seq RNA-seq miRNA-seq

ChIP-Seq is a new frontier technology to analyze in vivo protein-DNA interactions.

ChIP-Seq◦ Combination of chromatin immunoprecipitation

(ChIP) with ultra high-throughput massively parallel sequencing

◦ Allow mapping of protein–DNA interactions in-vivo on a genome scale

Mardis, E.R. Nat. Methods 4, 613-614 (2007)

Workflow ofChIP-Seq

Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain.

Lower cost Higher resolution Higher accuracy

Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.

Solexa (Illumina)◦ 1 GB of sequences in a single run◦ 35 bases in length

454 Life Sciences (Roche Diagnostics)◦ 25-50 MB of sequences in a single run◦ Up to 500 bases in length

SOLiD (Applied Biosystems)◦ 6 GB of sequences in a single run◦ 35 bases in length

8 lanes100 tiles per lane

Sequence FilesQuality Scores

10-40 million reads per lane

~500 MB files

Quality scores describe the confidence of bases in each read Solexa pipeline assigns a quality score to the four possible

nucleotides for each sequenced base 9 million sequences (500MB file) ~6.5GB quality score file

Rapid mapping of these short sequence reads to the reference genome

Visualize mapping results◦ Thousand of enriched regions

Peak analysis◦ Peak detection◦ Finding exact binding sites

Compare results of different experiments◦ Normalization◦ Statistical tests

Mapping Methods◦ Need to allow mismatches and gaps

SNP locations Sequencing errors Reading errors

◦ Indexing and hashing genome oligonucleotide reads

Use of quality scores Use of SNP knowledge Performance

◦ Partitioning the genome or sequence reads

Fast sequence similarity search algorithms (like BLAST)◦ Not specifically designed for mapping millions of

query sequences◦ Take very long time

e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST)

◦ Indexing the genome is memory expensive

Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding

Load reference genome into memory ◦ For human genome, 14GB RAM required for storing

reference sequences and index tables 300(gapped) to 1200(ungapped) times faster than BLAST 2 mismatches or 1-3bp continuous gap Errors accumulate during the sequencing process

◦ Much higher number of sequencing errors at the 3’-end (sometimes make the reads unalignable to the reference genome)

◦ Iteratively trim several basepairs at the 3’-end and redo the alignment

◦ Improve sensitivity

ELAND (Cox, unpublished)◦ “Efficient Large-Scale Alignment of Nucleotide

Databases” (Solexa Ltd.) SeqMap (Jiang, 2008)

◦ “Mapping massive amount of oligonucleotides to the genome”

RMAP (Smith, 2008) ◦ “Using quality scores and longer reads

improves accuracy of Solexa read mapping” MAQ (Li, 2008)

◦ “Mapping short DNA sequencing reads and calling variants using mapping quality scores”

Partition reads into 4 seeds {A,B,C,D}◦ At least 2 seed must map with no mismatches

Scan genome to identify locations where the seeds match exactly◦ 6 possible combinations of the seeds to search

{AB, CD, AC, BD, AD, BC}◦ 6 scans to find all candidates

Do approximate matching around the exactly-matching seeds.◦ Determine all targets for the reads◦ Ins/del can be incorporated

The reads are indexed and hashed before scanning genome

Bit operations are used to accelerate mapping◦ Each nt encoded into 2-bits

Commercial sequence mapping program comes with Solexa machine

Allow at most 2 mismatches Map sequences up to 32 nt in length All sequences have to be same length

Improve mapping accuracy◦ Possible sequencing errors at 3’-ends of longer

reads ◦ Base-call quality scores

Use of base-call quality scores◦ Quality cutoff

High quality positions are checked for mismatces

Low quality positions always induce a match◦ Quality control step eliminates reads with too

many low quality positions

Allow any number of mismatches

Map to reference genome

Map to reference genome

Mapped to a unique location

Mapped to multiple locations

No mapping

Low quality

3 MQuality

filter

7.2 M

1.8 M

2.5 M

0.5 M

12 M

BED files are build to summarize mapping results

BED files can be easily visualized in Genome Browser

http://genome.ucsc.edu

Robertson, G. et al. Nat. Methods 4, 651-657 (2007)

Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)

300 kb region from mouse ES cells

Frietze et al JBC 2010

SISSRs (Site Identification from Short Sequence Reads): Jothi et al. NAR, 2008.

MACS (Model-based Analysis of ChIP-Seq): Zhang et al, Genome Biology, 2008.QuEST (Genome-wide analysis of transcription factor binding sites based on ChIP–seq data): Valouev, A. et al. Nature Methods, 2008.PeakSeq (PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls): Rozowsky, J. et al. Nature Biotech. 2009.FindPeaks (FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology.): Fejes, A .P. et al. Bioinformatisc, 2008.Hpeak (An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data): Xu et al, Bioinformatics, 2008.

The MBD methyl-CpG binding domain-based (MBDCap) technology to capture the methylation sites. Double stranded methylated DNA fragments can be detected. It is sensitive to different methylation densities

Genome-wide sequencing technology was used to get the sequence of each short fragment.

The sequenced read was mapped to human genome to find the locations.

Lan et al Unpublished

BS-seq: genomic DNA is treated with sodium bisulphite (BS) to convert cytosine, but not methylcytosine, to uracil, and subsequent high-throughput sequencing.

Truly single-base resolution

RNA-Seq is a new approach to transcriptome profiling that uses deep-sequencing technologies.

Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.

RNA-seq protocol

Single base resolution High throughput Low background noise Ability to distinguish different isoforms and alleic

expression Relatively low cost