+ All Categories
Home > Documents > RNA-Seq / ChIP-Seq Analysis Workflow

RNA-Seq / ChIP-Seq Analysis Workflow

Date post: 09-Feb-2022
Category:
Upload: others
View: 13 times
Download: 1 times
Share this document with a friend
30
RNA-Seq / ChIP-Seq Analysis Workflow RNA-Seq / ChIP-Seq Data Analysis Workshop 10 September 2012 CSC, Helsinki Nicolas Delhomme
Transcript
Page 1: RNA-Seq / ChIP-Seq Analysis Workflow

RNA-Seq / ChIP-Seq Analysis Workflow

RNA-Seq / ChIP-Seq Data Analysis Workshop10 September 2012CSC, Helsinki

Nicolas Delhomme

Page 2: RNA-Seq / ChIP-Seq Analysis Workflow

Outline

• Introducing ourselves

• Introduction to the topic• The technology• Technologies• Field of application• Issues

• Analyses Workflows• RNA-Seq• ChIP-Seq

2

Page 3: RNA-Seq / ChIP-Seq Analysis Workflow

Ângela Gonçalves, Bioinformatician

• Computer scientist by training

• Master thesis in artificial intelligence at the University of Coimbra, Portugal

• Trainee at the European Space Agency’s outstation in Italy

• PhD at the European Bioinformatics Institute (EMBL-Cambridge)

• analysing RNA-seq for the study of gene expression regulation in mice

•• author of the ArrayExpressHTS package in Bioconductor - a pipeline

for analysing RNA-seq data

3

Page 4: RNA-Seq / ChIP-Seq Analysis Workflow

Myself

• Nicolas Delhomme: Bio-informatician/Bio-statistician• Geneticist by training

• Master Thesis in Bioinformatics (2001), 6 months internship at LION Bioscience - a biotech company, Heidelberg, Germany

• Worked as a software engineer at LION Bioscience, HD until 2004

• PhD Thesis at the DKFZ (German Cancer Research Center), HD, until 2009 on integrative and comparative analysis of micro-array data

4

Page 5: RNA-Seq / ChIP-Seq Analysis Workflow

• Technical Officer at the EMBL (European Molecular Biology Laboratory), HD, until 2012• establishing a pipeline for Next-Generation Sequencing data (from

the sequencer to the analysis (Galaxy))• developing tools for Galaxy and Bioconductor (package

easyRNASeq, manuscript in press)• performing analyses on RNA-Seq, ChIP-Seq and DNA-Seq data• assembling de-novo genome (yeast) and transcriptome (fruitfly)

• Post-doc at the UPSC (Umeå Plant Science Center), Umeå, Sweden• Working on RNA-Seq and assembly of one of the largest genome

(Spruce, 20GB)

5

Page 6: RNA-Seq / ChIP-Seq Analysis Workflow

Technologies

• Platforms• Roche 454• Illumina Genome Analyzer• ABI SOLiD

• 3rd Generation:• PacBio, Ion Torrent, Complete Genomics, Oxford Nanopore, ...

• Most data currently comes from Illumina Genome Analyzer or HiSeq machines• most tools are developed for that platform

6

Page 7: RNA-Seq / ChIP-Seq Analysis Workflow

Important Issues

• New technology

• “expensive”• no or too few replicates

• experimental design and protocols• unknown/not understood biases• sample preparation artifacts

• data properties not fully understood• Poisson distribution• Negative Binomial distribution

• pre-processing in its infancy• alignment biases

7

Page 8: RNA-Seq / ChIP-Seq Analysis Workflow

Typical NGS experiments

• ChIP-seq• Transcription Factor Binding Site, very localized• Histone modifications: very variable

• RNA-seq• Annotation based• Alternative splicing• de-novo Transcriptome• Absolute versus Differential Expression

• Sequencing• Re-sequencing• de-novo sequencing• Metagenomics

8

Page 9: RNA-Seq / ChIP-Seq Analysis Workflow

A typical workflow

• Quality Assessment

• Read alignment

• Pre-processing

• Getting annotation of interest

• Determining the count table / peaks

• Exporting/Visualizing the results

9

Page 10: RNA-Seq / ChIP-Seq Analysis Workflow

Quality Assessment (day 1: Nicolas and Ângela)

• number of reads• percentage ATGC• quality per cycle• quality variation per cycle• base per cycle• tile performance• tile quality• alignment quality

10

Page 11: RNA-Seq / ChIP-Seq Analysis Workflow

Sequence alignment (day 1: Olli-Pekka and Kui)

• Two main approaches:

• based on hash table• spaced seeds

• based on suffix/prefix tries• Burrows-Wheeler transform (BWT)

• Reviewed in Li and Homer: A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics (2010)

11

Page 12: RNA-Seq / ChIP-Seq Analysis Workflow

Aligners c’ed

• 20 aligners published in the last 2 years

• Most deal with short reads

• some of those with ABI specific “color-space”

• A large scale study comparing them is underway:• GSNAP: http://research-pub.gene.com/gmap/ is the most efficient so

far (personal communication, Paul Bertone, EBI)

12

Page 13: RNA-Seq / ChIP-Seq Analysis Workflow

Recent developments• gapped alignment• Recent aligners are able to perform gapped alignments• small indels• no splicing events with large introns

• BWA, Novoalign

• bisulfite sequencing• unmethylated C are converted to T (G complement converted

to A)• 2 references• one with all C converted to T• one with all G converted to A• C-T mismatch or G-A mismatch are ignored• results from both alignments are combined

13

Page 14: RNA-Seq / ChIP-Seq Analysis Workflow

Formats

• export• ID run lane tile x y PE mate seq qual alignment info chastity• HWI-EAS225 37 3 1 1080 935 0 1 NAGAGCAAC... BBBBBBBBBB... NM N• HWI-EAS225 37 3 1 1218 936 0 1 NGCTGCATT... BBBBBBBBBB... chr5 6827036 F A30C44 39 Y

• fastq• first line:@ID; second line: sequence; third line:+(ID); forth line: quality (different encodings)• @HWI-ST169_0186:4:1:1369:1932#0/1• NCCTAACGACGTTTGGTCAGTTCCATCAACATCATAGCC...• +HWI-ST169_0186:4:1:1369:1932#0/1• BUWUX[WVWVccccc_\ccccccccccacccc_cc^V^^V[[][[^^X^B...

• SAM (BAM)• ID flag alignment info seq qual additional info• HWI-EAS225_37:3:108:4047:5812#0/1 0 ChrC 1 255 76M * 0 0 ATG... III... XA:i:1 MD:Z:44T31 NM:i:1

• BAM is the SAM compressed binary format. The specification are available there: http://samtools.sourceforge.net/SAM-1.3.pdf

14

Page 15: RNA-Seq / ChIP-Seq Analysis Workflow

Formats c’ed

• In May 2011, Illumina released a new version of its analysis software CASAVA 1.8.

• In that version the default output format is BAM.

• In addition, it abandons the Illumina specific quality scale (offset of 64 for the ASCII encoding) for the standard Sanger one: +33 offset.

• Note that CASAVA still generate export files to be backward compatible and that these still have a +64 offset.

15

Page 16: RNA-Seq / ChIP-Seq Analysis Workflow

Annotation• External• Ensembl gtf, UCSC gff files

• Bioconductor• Genome coordinate / gene (and other) relationships,•GenomicFeatures, ChIPpeakAnno

• Gene ontology / pathway

•goseq• HapMap, 1000 genomes, UCSC, SRA, ENA, GEO,

ArrayExpress

• rtracklayer , biomaRt, Rsamtools,GEOquery, SRAdb

16

Page 17: RNA-Seq / ChIP-Seq Analysis Workflow

Analysis (day 1-3)

• Domain specific

• ChIP-seq:

•chipseq, ChIPseqR, CSAR, BayesPeak

• Differential expression:

•DESeq, edgeR, baySeq

• RNA-seq: •Genominator

• Examples:

•EatonEtAlChIPseq, leeBamViewsq

17

Page 18: RNA-Seq / ChIP-Seq Analysis Workflow

RNA-Seq Workflow (day 1-2 / Nicolas and Ângela)

QA

Alignment

SNP/Indel

Diff. Exp.eQTL

Read Counts

18

Page 19: RNA-Seq / ChIP-Seq Analysis Workflow

•Not all that glitters is gold!!!

Page 20: RNA-Seq / ChIP-Seq Analysis Workflow

Bias per cycle base call

RNA-Seq DNA-SeqIllumina 20

Page 21: RNA-Seq / ChIP-Seq Analysis Workflow

21

Page 22: RNA-Seq / ChIP-Seq Analysis Workflow

Correction

• Numerous publications

• Hansen et al. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research (2010) pp.

• Zheng et al. Bias detection and correction in RNA-Sequencing data. BMC bioinformatics (2011) vol. 12 (1) pp. 290

• But no proper evaluation/comparison of these

22

Page 23: RNA-Seq / ChIP-Seq Analysis Workflow

Another caveat: what reference?

• How close is your sample’s genome to the published available reference one?

• Specific kind of data, such as RNA-Seq:• genome or transcriptome?• what about novel exon-exon junctions?

23

Page 24: RNA-Seq / ChIP-Seq Analysis Workflow

Reference modification

24

Page 26: RNA-Seq / ChIP-Seq Analysis Workflow

What they did

• Compare RNA and DNA from matched samples• observe numerous events where RNA != DNA• process known as RNA editing• known in human: • an enzyme convert A into I (Inosine) recognized as a G during

translation• another less frequently observed event frmo another enzyme: • C -> U

• BUT they observe all possible conversions!

26

Page 27: RNA-Seq / ChIP-Seq Analysis Workflow

What might be

• They use reads aligning uniquely to the genome.• The main point can be summarized like this: RNA

editing involves the production of two different RNA and/or protein sequences from a single DNA sequence. To infer RNA editing from the presence of two different RNA and/or protein sequences, then, one must be very sure that they derive from the same DNA sequence, rather than from two different copies of the DNA (due to, for example, paralogs or copy number variants).

27

Page 28: RNA-Seq / ChIP-Seq Analysis Workflow

28

Page 29: RNA-Seq / ChIP-Seq Analysis Workflow

Always challenge your results...

QA

Alignment

SNP/Indel

Diff. Exp.

Read Counts

!

!

samtoolsGATK

tophat/cufflinks,RSEM, edgeR, DESeq, DEXSeq

HTseqeasyRNASeq

29

Page 30: RNA-Seq / ChIP-Seq Analysis Workflow

Acknowledgments

• for the organization of the course• Ari Löytynoja• Gabriella Rustici

• for willing to participate• You

30


Recommended