Analysis of Next Generation Sequence Data

Post on 30-Jan-2016

50 views 0 download

description

Analysis of Next Generation Sequence Data. Illumina HiSeq2000 600 Gbp (6 billion reads) in ~11 days. Typical Next Gen Experiments. Genome sequencing Novel genomes Resequencing Transcriptome sequencing (RNA-seq) Characterize transcripts with or without reference genome Typical length - PowerPoint PPT Presentation

transcript

The Genome Access Course

November 2011

Analysis of Next Generation Sequence Data

Illumina HiSeq2000600 Gbp

(6 billion reads) in ~11 days

The Genome Access Course

November 2011

Typical Next Gen Experiments

• Genome sequencing– Novel genomes– Resequencing

• Transcriptome sequencing (RNA-seq)– Characterize transcripts with or without reference genome

• Typical length• Short (microRNAs, …)

– Find differentially expressed transcripts

• Other– Methyl-seq– ChIP-seq– RIP-seq– …

The Genome Access Course

November 2011

Types of Sequencing Libraries

Single-End Reads - 5’ or 3’ (random)

Paired-End Reads - 5’ and 3’

Mate-Pair Reads - 5’ and 3’

2-5 kbp

200-500 bp

The Genome Access Course

November 2011

What Does the Data Look Like?FASTQ File Format

Sequence

Quality (ASCII character for each base)

> 80 million reads in one lane

The Genome Access Course

November 2011

Quality Control Analysis of Reads

The Genome Access Course

November 2011

Trim Sequences Prior To Analysis

• Make sure sequencing adapters are removed• Trim ends of sequence based on quality scores

The Genome Access Course

November 2011

Sequence Composition Diagnostics

Unbiased Reads

Biased Reads

First Position Nearly Always “T”

The Genome Access Course

November 2011

Genome Sequencing

The Genome Access Course

November 2011

Workflows for Genome Sequencing

Novel Genome Sequencing

• de novo assembly– Generate contigs and

scaffolds using overlapping reads

• If applicable, align reads from a sample back to consensus to examine variation

Resequencing

• Align reads from a sample to a reference genome assembly to examine variation– BWA mapping software

The Genome Access Course

November 2011

Sequence Alignment/Map (SAM) Format

Common file format to store reads and their alignment to a reference sequence

Generated by most next gen analysis softwaresamtools software package

The Genome Access Course

November 2011

Binary Alignment/Map (BAM) Files

• SAM (text file) BAM (binary file)– Not human-readable– Smaller file sizes

• BAM is widely used:– Often deposited to Gene Expression Omnibus (GEO) at NCBI– UCSC Genome Browser can display alignments as a track

The Genome Access Course

November 2011

UCSC Genome Browser with 1,000 Genomes Project Data

The Genome Access Course

November 2011

LookSeq at Sanger Mouse Genomes Project

The Genome Access Course

November 2011

Glo1 CNV Present in Mouse Genomes Data for A/J

Proximal FlankChr17: 30.5Mb

Max ~50x coverage

Glo1 LocusChr17: 30.7Mb

Max >100x coverage

Distal FlankChr17: 31.2Mb

Max ~50x coverage

50kb 50kb 50kb

The Genome Access Course

November 2011

Glo1 CNV Not Present in Mouse Genomes Data for NZO

Proximal FlankChr17: 30.5Mb

Max ~25x coverage

Glo1 LocusChr17: 30.7Mb

Max ~25x coverage

Distal FlankChr17: 31.2Mb

Max ~25x coverage

50kb 50kb 50kb

The Genome Access Course

November 2011

RNA-seq Data Analysis

The Genome Access Course

November 2011

RNA-Seq

Reads are randomly sampled fragments from RNA sample

Proportion of reads for a transcript Expression level of transcript

Lots of reads needed to construct models for every alternatively spliced transcript

Garber et al, Nat Methods (2011)

The Genome Access Course

November 2011

Experimental Design

Auer & Doerge Genetics (2010) 185: 405-416

The Genome Access Course

November 2011

Marioni et al, Genome Res (2008) 18(9):1509-17

The Genome Access Course

November 2011

Comparison of Affy and RNA-seq

Marioni et al, Genome Res (2008) 18(9):1509-17

The Genome Access Course

November 2011

Comparison of Affy and RNA-seq

Marioni et al, Genome Res (2008) 18(9):1509-17

The Genome Access Course

November 2011

Marioni et al, Genome Res (2008) 18(9):1509-17

The Genome Access Course

November 2011Shendure Nat Methods (2008) 5(7): 585-7

The Genome Access Course

November 2011

Workflows for RNA-seq

Novel Transcriptome Sequencing• de novo assembly

• Align reads from each sample/group to assembly

– Statistics for each transcript contig

Transcriptome Sequencing with Reference Genome

• Align reads from each sample/group to genome

– Statistics for each transcript model

– Examine isoforms

QC ReadsQC Reads

Analyze CountsAnalyze Counts

The Genome Access Course

November 2011

de novo Transcriptome Assembly

Rarefaction Plot

How much sequencing is enough?

The Genome Access Course

November 2011

Mapping Reads

Align reads to a referenceGenome assemblyTranscriptome assembly

Commonly used aligners:bwabowtie

The Genome Access Course

November 2011

RNAseq Workflow With Reference Genome

Langmead et al. Genome Biology (2010), 11:R83

The Genome Access Course

November 2011

Map Reads & ObtainCount Reads Per Gene

Both utilize a reference genome

The Genome Access Course

November 2011

Bowtie/TopHat

Trapnell, Pachter, Salzberg. Bioinformatics (2009) 25(9):1105-1111

Bowtie uses Burrows-Wheeler indexing for rapid mapping

TopHat uses Initially Un-Mapped (IUM) reads to find novel splice sites

The Genome Access Course

November 2011

Cufflinks

FPKM = Fragments Per Kilobase of transcript per Million fragments mapped

Trapnell et al. Nature Biotech (2010) 28(5):511-515

The Genome Access Course

November 2011

Galaxy

Can be used to upload FASTQ files and then run a number of QC tools and many other tools:

bwabowtietophatcufflinks…

The Genome Access Course

November 2011

Third Generation Sequencing