Date post: | 16-Jul-2015 |
Category: |
Science |
Upload: | phil-ewels |
View: | 184 times |
Download: | 5 times |
Bioinformatics Analysis of ChIP-Seq Phil Ewels, NGI Stockholm [email protected]
Epigenetics and its applications in clinical research (2601)
2017-03-21
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Talk Overview
• Overview of ChIP-Seq
• ChIP-Seq data processing
• Peak Calling
• Normalisation & quality control
• Analysis Pipelines
• Downstream analyses
2
Overview of ChIP-Seq
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Question - Can we find where a protein of interest binds across
the genome?
• Requirements - Good antibody - Reference genome
• Assumptions - Protein binds in a stable pattern - Binding is comparable across a population of cells
4
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin Immunoprecipitation
• Reverse cross-linksand purify DNA
• Add adapters & sequence
5
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin Immunoprecipitation
• Reverse cross-linksand purify DNA
• Add adapters & sequence
6
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin Immunoprecipitation
• Reverse cross-linksand purify DNA
• Add adapters & sequence
7
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin Immunoprecipitation
• Reverse cross-linksand purify DNA
• Add adapters & sequence
8
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin Immunoprecipitation
• Reverse cross-linksand purify DNA
• Add adapters & sequence
9
ChIP-Seq data processing
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing
• Sequence QC - FastQC / FastQ Screen
• Trimming - Cutadapt / Trimmomatic / AlienTrimmer / FASTX-Toolkit
• Alignment - Bowtie / BWA / STAR
• Duplicate removal - Picard / Samtools / SeqMonk
11
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: FastQC
12
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: FastQ Screen
13
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: cutadapt
14
http://opensource.scilifelab.se/projects/cutadapt/
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: cutadapt + FastQC
15
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: Alignment
• Bowtie 1 & 2 - Bowtie 1 good for short reads (less than 50bp) - Bowtie 2 better with longer reads
• STAR - As good as bowtie but much faster - Has a large memory footprint (~30 gigs for Human)
• BWA / Subread / SOAP / MAQ
• Alignments should be unique
16
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: Duplicate Removal
• Duplicates can come from multiple sources - PCR duplicates - Optical duplicates - Deep sequencing (genuine duplicates)
• How do you define duplicates? - Sequence content - errors? - Mapping position
• Sonication makes genuine duplicates unlikely
17
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Results Summary: MultiQC
18
• Scans your results directory and parses log files
• Builds a single report summarising everything
http://multiqc.info
Peak Calling
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Peak Calling: Considerations
• What kind of mark are you looking for?
• Point-source factors - Few, sharp peaks - Most transcription factors
• Many peaks - RNA Polymerase II
• Broad peaks - Some histone marks (H3K27me3)
20
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Peak Calling: Tools
• Huge number of tools available
• Many different statistical approaches
• Only important thing to remember: your results should look sensible and you must be consistent
• If in doubt, use MACS v2 or SPP - https://github.com/taoliu/MACS/ - http://compbio.med.harvard.edu/Supplements/ChIP-seq/
21
Normalisation & QC
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Normalisation & quality control
• What we expect to see:
• Assumptions: - Specific antibody - Perfect purification - Equal representation
• Reality: - Non-specific antibody binding - Unbound DNA being sequenced - Open chromatin bias, repetitive regions not aligned
23
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Normalisation: input controls
• Typically run an input sample - Cross-linked DNA, but no ChIP step - Can use a non-nuclear antibody such as IgG - Same sample, same prep - Captures systematic biases (eg. chromatin type, GC)
• Can use the data in multiple ways - Just determine regions to exclude - Subtraction normalisation - Typically used when calling peaks
24
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Normalisation: Signal and Noise
25
• We will sequence lots of irrelevant stuff - What is signal and what is noise?
• Essentially, we’re looking for enrichment - Peak callers do a lot of this for you
• Most peak callers need an input sample - Some can use mappability and GC content instead
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: visualisation
26
• Visualising the data is quick and very helpful - UCSC / SeqMonk / IGV
• Fast impression of how the experiment has worked
• Not enough on its own
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: SeqMonk
27
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Saturation Analysis
• If you sequence more reads, you’ll find more peaks
• If you’ve sequenced enough, you should be nearing a plateau
• Look into complexity of data - Preseq - SPP subsampling
28
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Strand Cross-Correlation
29
• Single-end sequencing should give a bimodal peak around binding sites on the two DNA strands
• Some peak callers use this to aid in region calling and for QC
• Can define NSC and RSC scores…
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Strand Cross-Correlation
30
• Can define NSC and RSC scores…Landt et al. 2012
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Stats, stats, stats
• NSC and RSC - Normalised strand cross-correlation coefficient - Relative strand cross-correlation coefficient
• FRiP - Fraction of reads in peaks
• FDRs, IDRs, p-values of peaks - False discovery rates - Irreproducible discovery rates
31
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Stats, stats, stats
• Useful if you have a lot of samples - Allows benchmarking and identification of failed
samples
• Don’t be overwhelmed by the acronyms
• Believe your eyes - if the data looks trustworthy, it probably is trustworthy
32
Analysis Pipelines
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Bioinformatics Workflows
• Running all of these steps for many samples is repetitive - Difficult, dull, prone to errors
• Processing can be automated by a Workflow Manager - Also known as Pipeline Tools
• Execute processing steps for you, managing files and dependencies.
34
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Cluster Flow
• Available on UPPMAX
• ChIP-seq pipeline - Runs QC, alignment,
deduplication and generates coverage tracks / fingerprint plots
- Written with J Westholm
35
#fastqc#bowtie1#samtools_sort_index#bedtools_bamToBed#bedToNrf#picard_dedup#samtools_sort_index#phantompeaktools_runSpp#deeptools_bamCoverage#deeptools_bamFingerprint#bedtools_intersectNeg#samtools_sort_index
moduleloadclusterflow
cf--setup
cf-uppmax--add_genomes
cf--genomeGRCh37chipseq_qc*.fq.gz
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Nextflow
• Runs on UPPMAX
• Several pipelines built at NGI, including ChIP-seq - Still under development, could be a little buggy
• Also runs elsewhere. Docker coming soon.
36
curl-fsSLget.nextflow.io|bash
nextflowrunSciLifeLab/NGI-ChIPseq\
--projectb2017123\
--reads'*_R{1,2}.fastq.gz'\
--macsconfig‘macssetup.config'\
--genomeGRCh37
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Bioinformatics Workflows
• These are great, but come with some caveats - Some setup is required - They don’t always work… - Results must be checked
• They are not a substitute for understanding the analysis steps
• You are still responsible for your results!
37
Downstream Analysis
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Annotation
• You have reads! Peaks! But where are they? - Co-ordinates are not helpful by themselves
• BEDTools - closest: distance to nearest genes - intersect: overlap with feature classes
• HOMER annotation
• SeqMonk Average quantitation plots
39
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Annotation
40
• HOMER can annotate read intensities
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Annotation
41
• SeqMonk average quantitation plot across genes
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Annotation
• GO analysis is increasingly popular - Gene Ontology search
• Databases classify every gene with a restricted vocabulary
• Use your data to find if any GO terms are enriched
• Like peak callers, lots of software available - DAVID and GREAT are popular - Cytoscape good for visualisation
42
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Motif Searching
• Search peaks for enriched sequence motifs - Could indicate a TF binding motif - Interesting for new ChIP factors - Can be informative for co-operative binding
• HOMER is one of many tools to do this
43
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Differential binding
• May want to compare samples across conditions or time series
• Overlapping peaks is too simplistic
• DiffBind: R Bioconductor package - ChIP-seq equivalent of DESeq and edgeR - Extensive documentation and tutorials - http://bioconductor.org/packages/release/bioc/html/DiffBind.html
44
Conclusions
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Conclusions
• There is no “correct way” to analyse ChIP-seq - Depends on biological system and question - Affected by number of samples and experimental setup - Defined by your experience and skills
• Two packages that do a lot of steps: - HOMER - SeqMonk
- Lots of YouTube walk through videos - https://youtu.be/LcMVb4zQBXI and https://youtu.be/Cy13yV6Rf6s
46
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Further Reading
• Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data - Bailey et al. PLOS Comp Bio (2013)
• ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia - Landt et al. Genome Research (2012)
• ChIP–seq: advantages and challenges of a maturing technology - Park. Nature Reviews Genetics (2009)
• http://seqanswers.com and http://biostars.org
47