Functional annotationof ChIP-peaksMinghui Wang, Qi SunBioinformatics Facility
Institute of Biotechnology
Sense Antisense Antisense_step1 Antisense_step21 0 1 26 1 2 83 2 8 32 8 3 38 3 3 84 3 8 02 8 0 0
R= -0.27R= 0.13
R= 0.79
Sense
Antisense
Antisense_step1
Antisense_step2
peak_annotation
seqnamesstartendwidthstrandlengthsummittagsX.10.log10.pvalue.fold_enrichmentFDR...annotationgeneChrgeneStartgeneEndgeneLengthgeneStrandgeneIdtranscriptIddistanceToTSS
1I40584225168*16841364612.73.5710.99Promoter (
Experimental designBiorep 1
Yong
Old
Biorep 2 Biorep 3
TR
CL
TR
CL
𝑢𝑢𝑢
𝑢𝑢3
𝑢𝑢4
𝑢𝑢𝑢-𝑢𝑢2 ? 0𝑢𝑢2
𝑢𝑢3-𝑢𝑢4 ? 0
((𝒖𝒖𝒖𝒖-𝒖𝒖𝒖𝒖) − (𝒖𝒖𝒖𝒖 − 𝒖𝒖𝒖𝒖) )? 0 is for ????
GLM (Poisson)
Yi
341242184420251032153814
=
1 1 1 11 1 0 01 1 1 11 1 0 01 1 1 11 1 0 01 0 1 01 0 0 01 0 1 01 0 0 01 0 1 01 0 0 0
µ
αβ
α*β
+ εi
out
Identify enriched regions within Yong or Old
GLM
Biorep 1
Biorep 2
Biorep 3
M pu (2015) Trimethylation of Lys36 on H3 restricts gene expression change during aging and impacts life span. Genes Dev 1;29(7):718-31
Identify enrichment regions between young and old stages
Young
Old
Biorep 1
Biorep 2
Biorep 3
Biorep 1
Biorep 2
Biorep 3
GLM
M pu (2015) Trimethylation of Lys36 on H3 restricts gene expression change during aging and impacts life span. Genes Dev 1;29(7):718-31
Visualization & Annotation
Downstream analysis workflow
Peak calling
Enriched regionsVisualization with IGV
Nearest genes
Relationship to gene features
Density plotting of gene features
Functional enrichment(e.g. GO categories)
Motif analysis
Other advanced analysis
Tools for visualization
https://deeptools.readthedocs.io/en/latest/
http://software.broadinstitute.org/software/igv/
http://homer.ucsd.edu/homer/
https://deeptools.readthedocs.io/en/latest/http://software.broadinstitute.org/software/igv/http://homer.ucsd.edu/homer/
deepTools
Tools for BAM and bigWig file processing - multiBamSummary compute read coverages over bam files. Output used for plotCorrelation or plotPCA- multiBigwigSummary extract scores from bigwig files. Output used for plotCorrelation or plotPCA- correctGCBias corrects GC bias from bam file. Don't use it with ChIP data- bamCoverage computes read coverage per bins or regions- bamCompare computes log2 ratio and others of read coverage of two samples per bins or regions- bigwigCompare computes log2 ratio and others from bigwig scores of two samples per bins or regions- computeMatrix prepares the data from bigwig scores for plotting with plotHeatmap or plotProfile
Tools for QC - plotCorrelation plots heatmaps or scatterplots of data correlation- plotPCA plots PCA- plotFingerprint plots the distribution of enriched regions
…
Heatmaps and summary plots- plotHeatmap plots one or multiple heatmaps of user selected regions over different genomic scores- plotProfile plots the average profile of user selected regions over different genomic scores- plotEnrichment plots the read/fragment coverage of one or more sets of regions
…
[mingh@cbsumm16 ChIP_seq_workshop_2017]$ bamCoverage -hbamCoverage -b reads.bam -o coverage.bw optionalOptional arguments:--help, -h show this help message and exit--scaleFactor SCALEFACTOR--binSize INT bp…
usage: bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bwOptional arguments:--help, -h show this help message and exit--scaleFactorsMethod--binSize INT bp--ratio {log2,ratio,subtract,add,mean,reciprocal_ratio,first,second}…
deepTools
bigwig
peak_region
HOMERmakeTagDirectory [options] [alignment file 2]ormakeTagDirectory reps-combined -d rep-1 rep-2 rep-3
Optimizing tag files...Estimated genome size = 119638054Estimated average read density = 0.014636 per bpTotal Tags = 1751052.0Total Positions = 1559937Average tag length = 50.0Median tags per position = 1 (ideal: 1)Average tags per position = 1.123Fragment Length Estimate: 291Peak Width Estimate: 345Autocorrelation quality control metrics:
Same strand fold enrichment: 3.9Diff strand fold enrichment: 3.8Same / Diff fold enrichment: 1.0
Guessing sample is ChIP-Seq or unstranded RNA-Seq - autocorrelation looks good.
Optimizing tag files...Estimated genome size = 25588101Estimated average read density = 0.090854 per bpTotal Tags = 2324782.0Total Positions = 754412Average tag length = 48.9Median tags per position = 2 (ideal: 1)Average tags per position = 3.082
!! Might have some clonal amplification in this sample if sonication was used
Fragment Length Estimate: 155Peak Width Estimate: 154
!!! No reliable estimate for peak sizeSetting Peak width estimate to be equal to fragment length estimate
Autocorrelation quality control metrics:Same strand fold enrichment: 1.2Diff strand fold enrichment: 1.2Same / Diff fold enrichment: 1.0
Guessing sample is ChIP-Seq - may have low enrichment with lots of background
Treatment file Control file
HOMERmakeUCSCfile [options] -bigWig -o outputbedGraphToBigWig should be available in your executable pathwget https://github.com/ENCODE-DCC/kentUtils/blob/v302.1.0/bin/linux.x86_64/bedGraphToBigWigchmod 777 bedGraphToBigWiggenome_size.file1 304276712 196982893 234598304 185850565 26975502Mt 366924Pt 154478
https://github.com/ENCODE-DCC/kentUtils/blob/v302.1.0/bin/linux.x86_64/bedGraphToBigWig
Tools for gene features analysis
HOMERannotatePeaks.pl options >outputannotatePeaks.pl test_results_peaks.narrowPeak_chr tair10 >out
est_results_peaks.narrowP Chr Start End Strand Pea FDAnnotation Detailed Annotat Distance to TSS Nearest PromotNearest U Nearest Re Gene Name Gene Aliastest_results_peak_14368 Chr5 6833504 6837577 + exon (AT5G20 exon (AT5G20250 1881 AT5G20250.4 At.74986 NM_00103 DIN10 DARK IND test_results_peak_1382 Chr1 6971312 6973001 + exon (AT1G20 exon (AT1G20110 671 AT1G20110.1 At.15444 NM_10186 AT1G20110 T20H2.10|test_results_peak_855 Chr1 4347808 4349969 + promoter-TSS promoter-TSS (A 390 AT1G12760.1 At.43884 NM_00103 AT1G12760 T12C24.29
test_results_peak_15041 Chr5 15843775 15845935 + exon (AT5G39 exon (AT5G39570 896 AT5G39570.1 At.20492 NM_12331 AT5G39570 MIJ24.6|Mtest_results_peak_154 Chr1 739488 742090 + intron (AT1G0 intron (AT1G0309 1110 AT1G03090.2 At.24059 NM_10019 MCCA -
test_results_peak_6386 Chr2 16483892 16485127 + exon (AT2G39 exon (AT2G39480 530 AT2G39480.1 At.63501 NM_12950 PGP6 F12L6.14|F test_results_peak_3313 Chr1 24046891 24048872 + promoter-TSS promoter-TSS (A 741 AT1G64720.1 At.74749 NM_10514 CP5 F13O11.4|test_results_peak_8490 Chr3 7116011 7117776 + exon (AT3G20 exon (AT3G20410 692 AT3G20410.1 At.8182 NM_11293 CPK9 domain pro
PAVIS
features distance to TSS
PeakAnalyzer
Peak fileGTF file
peak
Tools for density plot
https://github.com/shenlab-sinai/ngsplot
https://github.com/shenlab-sinai/ngsplot
ngs.plot
ngs.plot.r -G genome -R region -C [cov|config] file-O name [Options]
-G Genome name. Use ngsplotdb.py list to show available genomes.-R Genomic regions to plot: tss, tes, genebody, exon, cgi, enhancer, dhs or bed-C Indexed bam file or a configuration file for multiplot-O Name for output: multiple files will be generated
Options-L Flanking region size(will override flanking factor)-N Flanking region factor-RB The fraction of extreme values to be trimmed on both ends default=0, 0.05 means 5% of extreme values will be trimmed-S Randomly sample the regions for plot, must be:(0, 1]-P #CPUs to use. Set 0(default) for auto detection-AL Algorithm used to normalize coverage vectors: spline(default), bin-CS Chunk size for loading genes in batch(default=100)-MQ Mapping quality cutoff to filter reads(default=20)-FL Fragment length used to calculate physical coverage(default=150)
https://github.com/shenlab-sinai/ngsplot/wiki/SupportedGenomes
ngs.plot.r -G hg19 -R tss -C hesc.H3k4me3.rmdup.sort.bam -O hesc.H3k4me3.tss -T H3K4me3 -L 3000 -FL 300
ngs.plot
ngs.plot.r -G hg19 -R tss -C hesc.H3k4me3.rmdup.sort.bam:hesc.Input.rmdup.sort.bam -O hesc.H3k4me3vsInp.tss -T H3K4me3 -L 3000
ngs.plot
ngs.plot.r -G hg19 -R genebody -C config.hesc.k36.txt -O hesc.k36.genebody -D ensembl -FL 300
hesc.H3k36me3.rmdup.sort.bam high_expressed_genes.txt "High"hesc.H3k36me3.rmdup.sort.bam medium_expressed_genes.txt "Med"hesc.H3k36me3.rmdup.sort.bam low_expressed_genes.txt "Low"
ngs.plot
deepToolscomputeMatrix scale-regions -S H3K27Me3-input.bigWig H3K4Me1-Input.bigWig H3K4Me3-Input.bigWig -R genes19.bed genesX.bed --beforeRegionStartLength 3000 --regionBodyLength 5000 --afterRegionStartLength 3000 --skipZeros -o matrix.mat.gz
plotProfile -m matrix.mat.gz -out ExampleProfile1.png --numPlotsPerRow 2 --plotTitle "Test data profile"
HOMERannotatePeaks.pl -hist …
ChIPseeker
Functional enrichmentOver-represented functional annotations of nearest genes of peaks• Gene Ontology• Biological Pathways
Typical tools• DAVID https://david.ncifcrf.gov/• GREAT http://bejerano.stanford.edu/great/public/html/• Blast2go https://www.blast2go.com/• Mapman http://mapman.gabipd.org• Homer http://homer.salk.edu/homer/ngs/annotation.html
https://david.ncifcrf.gov/http://bejerano.stanford.edu/great/public/html/https://www.blast2go.com/http://mapman.gabipd.org/http://homer.salk.edu/homer/ngs/annotation.html
Webpage tools
Oliver et al (2004) The plant J
HOMER
findGO.pl [options]
http://homer.ucsd.edu/homer/microarray/go.html
http://homer.ucsd.edu/homer/microarray/go.html
Motif pattern analysis
Motif analysismeme [optional arguments]
MEME (http://meme.sdsc.edu/meme/cgi-bin/meme.cgi)
http://meme.sdsc.edu/meme/cgi-bin/meme.cgi
HOMER
biomaRt & Bioconductor
ChIPseeker
ChIPpeakAnnopeak
peak
promoter
genome