+ All Categories
Home > Science > Analysis of ChIP-Seq Data

Analysis of ChIP-Seq Data

Date post: 16-Jul-2015
Category:
Upload: phil-ewels
View: 184 times
Download: 5 times
Share this document with a friend
48
Bioinformatics Analysis of ChIP-Seq Phil Ewels, NGI Stockholm [email protected] Epigenetics and its applications in clinical research (2601) 2017-03-21
Transcript
Page 1: Analysis of ChIP-Seq Data

Bioinformatics Analysis of ChIP-Seq Phil Ewels, NGI Stockholm [email protected]

Epigenetics and its applications in clinical research (2601)

2017-03-21

Page 2: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Talk Overview

• Overview of ChIP-Seq

• ChIP-Seq data processing

• Peak Calling

• Normalisation & quality control

• Analysis Pipelines

• Downstream analyses

2

Page 3: Analysis of ChIP-Seq Data

Overview of ChIP-Seq

Page 4: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Overview of ChIP-Seq

• Question - Can we find where a protein of interest binds across

the genome?

• Requirements - Good antibody - Reference genome

• Assumptions - Protein binds in a stable pattern - Binding is comparable across a population of cells

4

Page 5: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Overview of ChIP-Seq

• Cross-link DNA and proteins

• Isolate DNA & fragmentation

• Chromatin Immunoprecipitation

• Reverse cross-linksand purify DNA

• Add adapters & sequence

5

Page 6: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Overview of ChIP-Seq

• Cross-link DNA and proteins

• Isolate DNA & fragmentation

• Chromatin Immunoprecipitation

• Reverse cross-linksand purify DNA

• Add adapters & sequence

6

Page 7: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Overview of ChIP-Seq

• Cross-link DNA and proteins

• Isolate DNA & fragmentation

• Chromatin Immunoprecipitation

• Reverse cross-linksand purify DNA

• Add adapters & sequence

7

Page 8: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Overview of ChIP-Seq

• Cross-link DNA and proteins

• Isolate DNA & fragmentation

• Chromatin Immunoprecipitation

• Reverse cross-linksand purify DNA

• Add adapters & sequence

8

Page 9: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Overview of ChIP-Seq

• Cross-link DNA and proteins

• Isolate DNA & fragmentation

• Chromatin Immunoprecipitation

• Reverse cross-linksand purify DNA

• Add adapters & sequence

9

Page 10: Analysis of ChIP-Seq Data

ChIP-Seq data processing

Page 11: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Data processing

• Sequence QC - FastQC / FastQ Screen

• Trimming - Cutadapt / Trimmomatic / AlienTrimmer / FASTX-Toolkit

• Alignment - Bowtie / BWA / STAR

• Duplicate removal - Picard / Samtools / SeqMonk

11

Page 12: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Data processing: FastQC

12

Page 13: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Data processing: FastQ Screen

13

Page 14: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Data processing: cutadapt

14

http://opensource.scilifelab.se/projects/cutadapt/

Page 15: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Data processing: cutadapt + FastQC

15

Page 16: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Data processing: Alignment

• Bowtie 1 & 2 - Bowtie 1 good for short reads (less than 50bp) - Bowtie 2 better with longer reads

• STAR - As good as bowtie but much faster - Has a large memory footprint (~30 gigs for Human)

• BWA / Subread / SOAP / MAQ

• Alignments should be unique

16

Page 17: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Data processing: Duplicate Removal

• Duplicates can come from multiple sources - PCR duplicates - Optical duplicates - Deep sequencing (genuine duplicates)

• How do you define duplicates? - Sequence content - errors? - Mapping position

• Sonication makes genuine duplicates unlikely

17

Page 18: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Results Summary: MultiQC

18

• Scans your results directory and parses log files

• Builds a single report summarising everything

http://multiqc.info

Page 19: Analysis of ChIP-Seq Data

Peak Calling

Page 20: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Peak Calling: Considerations

• What kind of mark are you looking for?

• Point-source factors - Few, sharp peaks - Most transcription factors

• Many peaks - RNA Polymerase II

• Broad peaks - Some histone marks (H3K27me3)

20

Page 21: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Peak Calling: Tools

• Huge number of tools available

• Many different statistical approaches

• Only important thing to remember: your results should look sensible and you must be consistent

• If in doubt, use MACS v2 or SPP - https://github.com/taoliu/MACS/ - http://compbio.med.harvard.edu/Supplements/ChIP-seq/

21

Page 22: Analysis of ChIP-Seq Data

Normalisation & QC

Page 23: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Normalisation & quality control

• What we expect to see:

• Assumptions: - Specific antibody - Perfect purification - Equal representation

• Reality: - Non-specific antibody binding - Unbound DNA being sequenced - Open chromatin bias, repetitive regions not aligned

23

Page 24: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Normalisation: input controls

• Typically run an input sample - Cross-linked DNA, but no ChIP step - Can use a non-nuclear antibody such as IgG - Same sample, same prep - Captures systematic biases (eg. chromatin type, GC)

• Can use the data in multiple ways - Just determine regions to exclude - Subtraction normalisation - Typically used when calling peaks

24

Page 25: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Normalisation: Signal and Noise

25

• We will sequence lots of irrelevant stuff - What is signal and what is noise?

• Essentially, we’re looking for enrichment - Peak callers do a lot of this for you

• Most peak callers need an input sample - Some can use mappability and GC content instead

Page 26: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Quality Control: visualisation

26

• Visualising the data is quick and very helpful - UCSC / SeqMonk / IGV

• Fast impression of how the experiment has worked

• Not enough on its own

Page 27: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Quality Control: SeqMonk

27

Page 28: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Quality Control: Saturation Analysis

• If you sequence more reads, you’ll find more peaks

• If you’ve sequenced enough, you should be nearing a plateau

• Look into complexity of data - Preseq - SPP subsampling

28

Page 29: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Quality Control: Strand Cross-Correlation

29

• Single-end sequencing should give a bimodal peak around binding sites on the two DNA strands

• Some peak callers use this to aid in region calling and for QC

• Can define NSC and RSC scores…

Page 30: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Quality Control: Strand Cross-Correlation

30

• Can define NSC and RSC scores…Landt et al. 2012

Page 31: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Quality Control: Stats, stats, stats

• NSC and RSC - Normalised strand cross-correlation coefficient - Relative strand cross-correlation coefficient

• FRiP - Fraction of reads in peaks

• FDRs, IDRs, p-values of peaks - False discovery rates - Irreproducible discovery rates

31

Page 32: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Quality Control: Stats, stats, stats

• Useful if you have a lot of samples - Allows benchmarking and identification of failed

samples

• Don’t be overwhelmed by the acronyms

• Believe your eyes - if the data looks trustworthy, it probably is trustworthy

32

Page 33: Analysis of ChIP-Seq Data

Analysis Pipelines

Page 34: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Bioinformatics Workflows

• Running all of these steps for many samples is repetitive - Difficult, dull, prone to errors

• Processing can be automated by a Workflow Manager - Also known as Pipeline Tools

• Execute processing steps for you, managing files and dependencies.

34

Page 35: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Cluster Flow

• Available on UPPMAX

• ChIP-seq pipeline - Runs QC, alignment,

deduplication and generates coverage tracks / fingerprint plots

- Written with J Westholm

35

#fastqc#bowtie1#samtools_sort_index#bedtools_bamToBed#bedToNrf#picard_dedup#samtools_sort_index#phantompeaktools_runSpp#deeptools_bamCoverage#deeptools_bamFingerprint#bedtools_intersectNeg#samtools_sort_index

moduleloadclusterflow

cf--setup

cf-uppmax--add_genomes

cf--genomeGRCh37chipseq_qc*.fq.gz

Page 36: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Nextflow

• Runs on UPPMAX

• Several pipelines built at NGI, including ChIP-seq - Still under development, could be a little buggy

• Also runs elsewhere. Docker coming soon.

36

curl-fsSLget.nextflow.io|bash

nextflowrunSciLifeLab/NGI-ChIPseq\

--projectb2017123\

--reads'*_R{1,2}.fastq.gz'\

--macsconfig‘macssetup.config'\

--genomeGRCh37

Page 37: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Bioinformatics Workflows

• These are great, but come with some caveats - Some setup is required - They don’t always work… - Results must be checked

• They are not a substitute for understanding the analysis steps

• You are still responsible for your results!

37

Page 38: Analysis of ChIP-Seq Data

Downstream Analysis

Page 39: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Downstream Analysis: Annotation

• You have reads! Peaks! But where are they? - Co-ordinates are not helpful by themselves

• BEDTools - closest: distance to nearest genes - intersect: overlap with feature classes

• HOMER annotation

• SeqMonk Average quantitation plots

39

Page 40: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Downstream Analysis: Annotation

40

• HOMER can annotate read intensities

Page 41: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Downstream Analysis: Annotation

41

• SeqMonk average quantitation plot across genes

Page 42: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Downstream Analysis: Annotation

• GO analysis is increasingly popular - Gene Ontology search

• Databases classify every gene with a restricted vocabulary

• Use your data to find if any GO terms are enriched

• Like peak callers, lots of software available - DAVID and GREAT are popular - Cytoscape good for visualisation

42

Page 43: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Downstream Analysis: Motif Searching

• Search peaks for enriched sequence motifs - Could indicate a TF binding motif - Interesting for new ChIP factors - Can be informative for co-operative binding

• HOMER is one of many tools to do this

43

Page 44: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Downstream Analysis: Differential binding

• May want to compare samples across conditions or time series

• Overlapping peaks is too simplistic

• DiffBind: R Bioconductor package - ChIP-seq equivalent of DESeq and edgeR - Extensive documentation and tutorials - http://bioconductor.org/packages/release/bioc/html/DiffBind.html

44

Page 45: Analysis of ChIP-Seq Data

Conclusions

Page 46: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Conclusions

• There is no “correct way” to analyse ChIP-seq - Depends on biological system and question - Affected by number of samples and experimental setup - Defined by your experience and skills

• Two packages that do a lot of steps: - HOMER - SeqMonk

- Lots of YouTube walk through videos - https://youtu.be/LcMVb4zQBXI and https://youtu.be/Cy13yV6Rf6s

46

Page 47: Analysis of ChIP-Seq Data

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42

Further Reading

• Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data - Bailey et al. PLOS Comp Bio (2013)

• ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia - Landt et al. Genome Research (2012)

• ChIP–seq: advantages and challenges of a maturing technology - Park. Nature Reviews Genetics (2009)

• http://seqanswers.com and http://biostars.org

47

Page 48: Analysis of ChIP-Seq Data

Questions? [email protected]

Slides: http://tiny.cc/chipseq


Recommended