Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim...

Introduction to Short Read Sequencing Analysis

James Knight(with many slides adapted from Jim Noonan)

GENE 760

Rule to remember whenusing bioinformatics tools

• Computer scientists and mathematicians/statisticians make simplifications to speed up computations or to create a robust statistical model

– So, every tool solves an approximation of the problem

– The answers you get will largely be correct, but must be taken with a grain of salt• What might the tools be missing?• What quality/confidence score values can be believed?• Do the downstream steps of the pipeline adjust for the simplifications?

Sequence read lengths remain limiting

• For most applications reads are aligned to a reference genome• Short reads contain inherently limited information• De novo assembly of short reads is difficult

Chr1: 249 Mb

249 Mb sequencing read

Current platforms:• Illumina: A very large number (2 billion) of short reads (75-125 bp) • PacBio: A moderate number (~500,000) of long reads (~10 kb)

Determining the identity and location of short sequence reads in the genome/exome/transcriptome

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Need a computationally efficient method to perform accurate alignments of millions of reads

Aligning short reads to much larger reference

Determining the identity and location of short sequence reads in the genome/exome/transcriptome

• Fundamental simplification– Represent DNA molecules as strings (or sequences)– 4 character alphabet, plus N

• Algorithmic simplification– Finding exact matches is quickest

– Comparing strings is slower

– “Gapped” alignment is very slow

TAGATTAC||||||||TAGATTAC

TAGATTACTCAGA|||||||| ||||TAGATTACACAGA

TAGATTACTCAGA-TAC|||||||| |||| |||TAGATTACACAGATTAC

First attempt: A “telephone directory”

• Suppose input is 100 bp reads

• Create “telephone directory” of 100 bp sections of genome– Sorted list of sequences, with locations– <Sequence, chromosome, start position>

• For each read, lookup in directory to find genome location(s) for the read

• This solution does not work– Lookup might be slow for hundreds of millions of reads– Sequencing errors and variation prevent correct lookup

Older (and Newer) Algorithms: Seed and Extend

• Find shorter exact matches, called seeds– Long enough to be mostly unique (20-25 bp for human)– Short enough to find exact matches for “all” reads

• Encode seeds as integer values– 2-bit encoding, A=00 C=01 G=10 T=11– 32 nt seed fits within 64 bit integer– Exact seed match when integer values the same

• Create “telephone directory” for reference, called an index– Choose forward strand only or forward and reverse strands?– Store every seed, or spaced seeds (every X nts)?– Newer algorithms create Burrows-Wheeler Transform index

Reference genome indexing usingBurrows-Wheeler transform

alignment Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

• Reversible encoding scheme• All locations of every suffix

stored in a contiguous region of the index

• For any suffix, transform table can find region when adding 1 nt prefix

• Lookups become very fast• Independent of reference

size

Older (and Newer) Algorithms: Seed and Extend

• Lookup the seeds occurring in each read– Forward strand only or forward and reverse?– Every seed or spaced seeds?

• Extend seeds to calculate alignment– Ungapped alignment just comparing strings– Gapped alignment using dynamic programming

TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTGTGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG

RefRead

Alignments in Bowtie 2

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Multiseed alignment (ungapped)Ref index: BWT, every nt, fwd onlyRead seeds: 16 nt, every 10 nt, fwd+rev

Mismatch = -6

TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTGTGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG

RefRead

Gap = -11-5 to open

-3 to extend by 1 bp

Seeds are extended (gaps allowed) to generate alignment Match = 2

Scoring alignments

TAGATTACTCAGATTAC|||||||| ||||||||TAGATTACACAGATTAC

TAGATTACTCAGA-TAC|||||||| |||| |||TAGATTACACAGATTAC

Adapted from Mark Gerstein

Ungapped:

Gapped:

C|C

Match (+1)

Mismatch (-1, -2, etc.)C

T

Gap penalty:

P = a +bNa = cost of opening a gapb = cost of extending gap by 1N = length of gap

A-TAC|||||ATTAC

A--AC|||||ATTAC

• Simpler algorithms allow a fixed number of mismatches• Complex algorithms pick best scoring alignment, stopping

extension if score falls below threshold from best score so far

Program WebsiteELAND (v2) N/A – integrated into Illumina pipelineBowtie/Bowtie2 http://bowtie-bio.sourceforge.net/BWA http://bio-bwa.sourceforge.net/Novoalign http://www.novocraft.com/products/novoalign/

Common algorithms for mapping short reads to a reference genome

Considerations•Alignment scoring method•Speed•Quality aware•Seeding•Gapped alignment•Split read alignment

Split read alignments for transcriptomesand structural variations

Aligning totranscripts:

Splice junctions(contiguous)

Splice junctions(split)

Aligning togenome:

Simplifications Aligners Use

• Assume the reference is complete• The best scoring match is considered location (or locations) of

the read in the genome• Speed is the priority• Difficult to align ends are “soft clipped”• Reads are aligned individually

• Indels may not be consistent across reads

QUALITY SCORES

Scoring using Likelihood Measures

• Likelihood-based scoring methods more robust than empirically derived scoring methods

• Construct a model of the error modes– Usually, a training-based Bayesian or HMM model

• Common NGS quality scores– Basecall quality scores– Mapping quality scores– Variant/Genotype quality scores

Basecall quality scores (“Phred” scores)

A quality score (or Q-score) expresses the probability that a basecall is incorrect. Given a basecall, A:

• The estimated probability that A is not correct is P(~A);

• The quality score for A is Q (A) = -10 log10 (P(~A))

A quality score of 10 means a probability of 0.1 that A is the wrong basecall.

Quality scores are logarithmic:

P(~A) is platform-specific; Q-scores can be compared across platforms.

Q-score Error probability

10 0.1

20 0.01

40 0.0001

Sequencingby synthesiswith reversibledye terminators

1 cycle

Scan flow cell

Add base

Reverse terminationAdd next base, etc.

Errors in lllumina sequencing reads

• Wrong “color” called when scanning image• Wrong nucleotide is added• Terminator not removed for a cycle• Multiple bases added (terminator doesn’t work for a base)

Errors in lllumina sequencing reads

• Error are mainly mismatches (substitutions) • Error rates increase with increasing cycle number

Errors in single-molecule sequencingPacBio:

TAGATTA-ACAG-TT-C||||||| |||| || |TAGATTACACAGATTAC

• Incorporation happens to quickly to be observed• Nucleotide sits in polymerase “pocket” (and is detected),

but is not incorporated• Wrong nucleotide incorporated

• Errors are mainly insertions and deletions

Florescent molecules, attachedto each nucleotide, capturedby direct observationof “zero-mode waveguide”during incorporation by polymerase

Quality Score Metholodogy

• Training-based calibration of scores– Identify key “features” of a platform’s sequencing signals

• Signal Intensity/frequency/duration, base position, ...• Phred method: must be continuous value range, “monotonic” to error rate

– Train using set of runs from known samples and genomes– Correlate errors to feature value ranges

• Phred method: bin combinations of value ranges, then use error counts to set score

– Must recalibrate with changes in instrument or reagent kits• Simplifications

– The training set characterizes the range of genomes/transcriptomes/...– Library/Sequencing prep PCR errors not captured by sequencing

measurements

GATK Recalibration

https://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf

Quality score encoding in FASTQ format

MAPPING QUALITY

Mappability

•The genome contains non-unique sequences (repeats, segmental duplications)•Short reads derived from repetitive regions are difficult to map

Chr3 Chr7repeat repeat

Longer reads:

Paired reads:

Mapability scores at UCSC

•The genome contains non-unique sequences (repeats, segmental duplications)•Short reads derived from repetitive regions are difficult to map

36mers, 2 mismatches



Poorly mappable regions of the genome




Mapping quality score

Base quality values and mismatch positions in a candidate alignment are used to assign a p value

P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error ratescorresponding to the read’s quality values

Mapping quality score for a read is computed from p values of all candidate alignments

If there are two candidates for a read with p values 0.9 and 0.3:

• 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct

• 1- 0.75, chance highest scoring alignment is wrong

• Mapping quality score = -10 log(0.25) = 6.

COUNTING AND FREQUENCIES

Counting Reads, Converting to Frequencies

• NGS reads are all “clonally amplified”– Start with a single molecule– Either sequence it directly, or amplify it into a cluster/spot/well

• Post-alignment analysis begins with read counting– Variant vs Non-variant– Transcript– ChIP-seq peaks

• Read counts (depth) converted into frequencies– Normalizes across datapoints to permit comparisons– “Alt frequency”, FPKM/RPKM, fold change

• Filtering applied to produce call set– Distinguish significant events from random chance or sequencing error

Counting Reads, Converting to Frequencies

• Confounders – Sampling variation (across genome/transcriptome)– PCR duplicates– Non-random sequencing error– GC-bias– Strand bias– Amplification bias– Alignment bias– Sample purity– Transcript length

Variant Calling Quality Scores

• At each location where reads differ from the reference, compute likelihood values– Difference from the reference– Each genotype (0/0, 0/1, 1/1) for diploid genomes– Models either pre-trained or trained on the reads using a “truth set”

of known variants

• Variant quality score– Convert likelihood into phred-scaled score

• Genotype quality score– Most likely genotype minus second most likely genotype– Phred-scaled score

Conclusion

• First alignment, then counting, and then comes the analysis

• Alignment and counting are computation-focused– Pipelines tradeoff speed vs. accuracy

• Quality/confidence measures calculated and used in the analysis– Simple likelihoods or trained models?

• Your choices depend on your objectives/budgets

Date post:	15-Jan-2016
Category:	Documents
Upload:	julianna-hanley
View:	224 times
Download:	0 times

Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim...

Documents