+ All Categories
Home > Documents > Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim...

Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim...

Date post: 15-Jan-2016
Category:
Upload: julianna-hanley
View: 224 times
Download: 0 times
Share this document with a friend
Popular Tags:
33
Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760
Transcript
Page 1: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Introduction to Short Read Sequencing Analysis

James Knight(with many slides adapted from Jim Noonan)

GENE 760

Page 2: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Rule to remember whenusing bioinformatics tools

• Computer scientists and mathematicians/statisticians make simplifications to speed up computations or to create a robust statistical model

– So, every tool solves an approximation of the problem

– The answers you get will largely be correct, but must be taken with a grain of salt• What might the tools be missing?• What quality/confidence score values can be believed?• Do the downstream steps of the pipeline adjust for the simplifications?

Page 3: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Sequence read lengths remain limiting

• For most applications reads are aligned to a reference genome• Short reads contain inherently limited information• De novo assembly of short reads is difficult

Chr1: 249 Mb

249 Mb sequencing read

Current platforms:• Illumina: A very large number (2 billion) of short reads (75-125 bp) • PacBio: A moderate number (~500,000) of long reads (~10 kb)

Page 4: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Determining the identity and location of short sequence reads in the genome/exome/transcriptome

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Need a computationally efficient method to perform accurate alignments of millions of reads

Aligning short reads to much larger reference

Page 5: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Determining the identity and location of short sequence reads in the genome/exome/transcriptome

• Fundamental simplification– Represent DNA molecules as strings (or sequences)– 4 character alphabet, plus N

• Algorithmic simplification– Finding exact matches is quickest

– Comparing strings is slower

– “Gapped” alignment is very slow

TAGATTAC||||||||TAGATTAC

TAGATTACTCAGA|||||||| ||||TAGATTACACAGA

TAGATTACTCAGA-TAC|||||||| |||| |||TAGATTACACAGATTAC

Page 6: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

First attempt: A “telephone directory”

• Suppose input is 100 bp reads

• Create “telephone directory” of 100 bp sections of genome– Sorted list of sequences, with locations– <Sequence, chromosome, start position>

• For each read, lookup in directory to find genome location(s) for the read

• This solution does not work– Lookup might be slow for hundreds of millions of reads– Sequencing errors and variation prevent correct lookup

Page 7: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Older (and Newer) Algorithms: Seed and Extend

• Find shorter exact matches, called seeds– Long enough to be mostly unique (20-25 bp for human)– Short enough to find exact matches for “all” reads

• Encode seeds as integer values– 2-bit encoding, A=00 C=01 G=10 T=11– 32 nt seed fits within 64 bit integer– Exact seed match when integer values the same

• Create “telephone directory” for reference, called an index– Choose forward strand only or forward and reverse strands?– Store every seed, or spaced seeds (every X nts)?– Newer algorithms create Burrows-Wheeler Transform index

Page 8: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Reference genome indexing usingBurrows-Wheeler transform

alignment Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

• Reversible encoding scheme• All locations of every suffix

stored in a contiguous region of the index

• For any suffix, transform table can find region when adding 1 nt prefix

• Lookups become very fast• Independent of reference

size

Page 9: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Older (and Newer) Algorithms: Seed and Extend

• Lookup the seeds occurring in each read– Forward strand only or forward and reverse?– Every seed or spaced seeds?

• Extend seeds to calculate alignment– Ungapped alignment just comparing strings– Gapped alignment using dynamic programming

TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTGTGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG

RefRead

Page 10: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Alignments in Bowtie 2

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Multiseed alignment (ungapped)Ref index: BWT, every nt, fwd onlyRead seeds: 16 nt, every 10 nt, fwd+rev

Mismatch = -6

TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTGTGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG

RefRead

Gap = -11-5 to open

-3 to extend by 1 bp

Seeds are extended (gaps allowed) to generate alignment Match = 2

Page 11: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Scoring alignments

TAGATTACTCAGATTAC|||||||| ||||||||TAGATTACACAGATTAC

TAGATTACTCAGA-TAC|||||||| |||| |||TAGATTACACAGATTAC

Adapted from Mark Gerstein

Ungapped:

Gapped:

C|C

Match (+1)

Mismatch (-1, -2, etc.)C

T

Gap penalty:

P = a +bNa = cost of opening a gapb = cost of extending gap by 1N = length of gap

A-TAC|||||ATTAC

A--AC|||||ATTAC

• Simpler algorithms allow a fixed number of mismatches• Complex algorithms pick best scoring alignment, stopping

extension if score falls below threshold from best score so far

Page 12: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Program WebsiteELAND (v2) N/A – integrated into Illumina pipelineBowtie/Bowtie2 http://bowtie-bio.sourceforge.net/BWA http://bio-bwa.sourceforge.net/Novoalign http://www.novocraft.com/products/novoalign/

Common algorithms for mapping short reads to a reference genome

Considerations•Alignment scoring method•Speed•Quality aware•Seeding•Gapped alignment•Split read alignment

Page 13: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Split read alignments for transcriptomesand structural variations

Aligning totranscripts:

Splice junctions(contiguous)

Splice junctions(split)

Aligning togenome:

Page 14: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Simplifications Aligners Use

• Assume the reference is complete• The best scoring match is considered location (or locations) of

the read in the genome• Speed is the priority• Difficult to align ends are “soft clipped”• Reads are aligned individually

• Indels may not be consistent across reads

Page 15: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

QUALITY SCORES

Page 16: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Scoring using Likelihood Measures

• Likelihood-based scoring methods more robust than empirically derived scoring methods

• Construct a model of the error modes– Usually, a training-based Bayesian or HMM model

• Common NGS quality scores– Basecall quality scores– Mapping quality scores– Variant/Genotype quality scores

Page 17: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Basecall quality scores (“Phred” scores)

A quality score (or Q-score) expresses the probability that a basecall is incorrect. Given a basecall, A:

• The estimated probability that A is not correct is P(~A);

• The quality score for A is Q (A) = -10 log10 (P(~A))

A quality score of 10 means a probability of 0.1 that A is the wrong basecall.

Quality scores are logarithmic:

P(~A) is platform-specific; Q-scores can be compared across platforms.

Q-score Error probability

10 0.1

20 0.01

40 0.0001

Page 18: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Sequencingby synthesiswith reversibledye terminators

1 cycle

Scan flow cell

Add base

Reverse terminationAdd next base, etc.

Errors in lllumina sequencing reads

• Wrong “color” called when scanning image• Wrong nucleotide is added• Terminator not removed for a cycle• Multiple bases added (terminator doesn’t work for a base)

Page 19: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Errors in lllumina sequencing reads

• Error are mainly mismatches (substitutions) • Error rates increase with increasing cycle number

Page 20: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Errors in single-molecule sequencingPacBio:

TAGATTA-ACAG-TT-C||||||| |||| || |TAGATTACACAGATTAC

• Incorporation happens to quickly to be observed• Nucleotide sits in polymerase “pocket” (and is detected),

but is not incorporated• Wrong nucleotide incorporated

• Errors are mainly insertions and deletions

Florescent molecules, attachedto each nucleotide, capturedby direct observationof “zero-mode waveguide”during incorporation by polymerase

Page 21: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Quality Score Metholodogy

• Training-based calibration of scores– Identify key “features” of a platform’s sequencing signals

• Signal Intensity/frequency/duration, base position, ...• Phred method: must be continuous value range, “monotonic” to error rate

– Train using set of runs from known samples and genomes– Correlate errors to feature value ranges

• Phred method: bin combinations of value ranges, then use error counts to set score

– Must recalibrate with changes in instrument or reagent kits• Simplifications

– The training set characterizes the range of genomes/transcriptomes/...– Library/Sequencing prep PCR errors not captured by sequencing

measurements

Page 22: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

GATK Recalibration

https://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf

Page 23: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Quality score encoding in FASTQ format

Page 24: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

MAPPING QUALITY

Page 25: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Mappability

•The genome contains non-unique sequences (repeats, segmental duplications)•Short reads derived from repetitive regions are difficult to map

Chr3 Chr7repeat repeat

Longer reads:

Paired reads:

Page 26: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Mapability scores at UCSC

•The genome contains non-unique sequences (repeats, segmental duplications)•Short reads derived from repetitive regions are difficult to map

36mers, 2 mismatches

75mers, 2 mismatches

100mers, 2 mismatches

Page 27: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Poorly mappable regions of the genome

36mers, 2 mismatches

75mers, 2 mismatches

100mers, 2 mismatches

Page 28: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Mapping quality score

Base quality values and mismatch positions in a candidate alignment are used to assign a p value

P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error ratescorresponding to the read’s quality values

Mapping quality score for a read is computed from p values of all candidate alignments

If there are two candidates for a read with p values 0.9 and 0.3:

• 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct

• 1- 0.75, chance highest scoring alignment is wrong

• Mapping quality score = -10 log(0.25) = 6.

Page 29: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

COUNTING AND FREQUENCIES

Page 30: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Counting Reads, Converting to Frequencies

• NGS reads are all “clonally amplified”– Start with a single molecule– Either sequence it directly, or amplify it into a cluster/spot/well

• Post-alignment analysis begins with read counting– Variant vs Non-variant– Transcript– ChIP-seq peaks

• Read counts (depth) converted into frequencies– Normalizes across datapoints to permit comparisons– “Alt frequency”, FPKM/RPKM, fold change

• Filtering applied to produce call set– Distinguish significant events from random chance or sequencing error

Page 31: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Counting Reads, Converting to Frequencies

• Confounders – Sampling variation (across genome/transcriptome)– PCR duplicates– Non-random sequencing error– GC-bias– Strand bias– Amplification bias– Alignment bias– Sample purity– Transcript length

Page 32: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Variant Calling Quality Scores

• At each location where reads differ from the reference, compute likelihood values– Difference from the reference– Each genotype (0/0, 0/1, 1/1) for diploid genomes– Models either pre-trained or trained on the reads using a “truth set”

of known variants

• Variant quality score– Convert likelihood into phred-scaled score

• Genotype quality score– Most likely genotype minus second most likely genotype– Phred-scaled score

Page 33: Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Conclusion

• First alignment, then counting, and then comes the analysis

• Alignment and counting are computation-focused– Pipelines tradeoff speed vs. accuracy

• Quality/confidence measures calculated and used in the analysis– Simple likelihoods or trained models?

• Your choices depend on your objectives/budgets


Recommended