Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken...

transcript

Indexed Alignment

Tricks of the Trade

Ross David Bayer18th October, 2005

Note: many diagrams taken from Serafim’s CS 262 class

Roadmap

Background RecapSimple Tricks of the Trade

Wildcards Multiple Words

State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Status Check

Motivation

We have a newly discovered gene: Does it occur in other species? How fast does it evolve?

We want to “find” this gene in other species But there will be mutations

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Global Alignment

Running Time:

MAGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

Needleman-Wunsch (Dynamic Programming)

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Local Alignment

Running Time:

MAGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

Smith-Waterman

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Modifications:

• Store 0 instead of –ve values• Search entire table for maximum

Alignment Applications

We have our newly discovered gene: Does it occur in other species? How fast does it evolve?

Complete genomes today

About 300 complete genomes have been

sequenced

GenBank Growth

Exponential growth in total sequence data

Recently exceeded 100 Gbp (1011 base pairs)

More DNA is coming …

Alignment Applications

We have our newly discovered gene: Does it occur in other species? How fast does it evolve?

Assume we try Smith-Waterman:

1015 cells

The entire genomic database

Our new gene

Indexed Alignment

(BLAST- Basic Local Alignment Search Tool)

Main idea:

1. Construct a dictionary of all words in the query

2. Initiate a local alignment for each word match between query and DB

Running Time: O(MN) in worst case

However, in practice orders of magnitude faster than Smith-Waterman

Step 1 (Basic): Construct dictionary of query words

Query indexed by all words of size k Query indexed by all words of size k = 3 (in our examples) Query indexed by all words of size k ≈ 11

Query:

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCG…

AAA AAC AAG AAT ACA ... AGG ... CTA ... GCT ... GGC ...

AGG GGC GCT CTA TAT ATC TCA CAC ACC CCT CTG TGA GAC ACC CCT CTC TCC CCA CAG AGG GGC GCC CCG CGA GAT ATG TGC GCC CCC CCT CTA TAG AGC GCT CTA TAT ATC TCA CAC ACG CGA GAC ACC CCG CGC GCG

Step 1 (Advanced): Relative Generation

For each query word, generate all relatives A relative is a word with alignment score ≥ T All relatives are updated to point to new location

Query word:

Threshold:T = 28

Relatives:

GGC 30AGC 28GAC 28AAC 26GGT 25GGA 24

Query:

AGGCTATCACCTGACCTCCAGGCCG…

... AGC ... GAC ... GGC ...

Step 2: Searching Search through database linearly, one word at a time Initiate alignment with all occurrences of that word in query

Genomic database:

AGCTAGCTGCTAGTCAGTCGATGCATGCTACTAGCTGCGATCGTCGTC…

AAA AAC AAG AAT ACA ... AGC ... CTA ... GCT ... TAG ...

Query:

AGC GCT

A C G A A G T A A G G T C C A G T

Alignment Extension

Example:

The matching word GGT initiates an alignment

Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far

Output:

GTAAGGTCC

GTTAGGTCC

BLAST Algorithm Variations

BLAT- BLAST-Like Alignment Tool

1. Builds index (dictionary) for database, scans linearly through query

2. Alignment extensions allow for gaps as well

A C G A A G T A A G G T C C A G T

AGapped Extensions

Extensions with gaps in a band around anchor

Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far

Output:

GTAAGGTCC-AG

GTTAGGTCCTAG

Perfect Match Results

Perfect Match: no relatives generated

Perfect Match Results

Interpreting Results

Word size k

Conservation rate

Conservation rate: 81%

Mutation rate: 19%

Sensitivity

Probability of a particular homologous area being identified Larger k decreases probability (exact match less likely) Straightforward mathematics

Skip math

Sensitivity Calculation

Database (genome)

Homologous area:

Suppose k = 7:

Mutation rate: 19%

7Probability whole word is conserved:

0.817 ≈ 23%

Sensitivity Calculation

Database (genome)

Homologous area:

Suppose k = 7:

23% 23% 23% 23% 23% 23% 23% 23% 23% 23%

Words: 10 Probability a particular word is conserved: 23%

Probability at least one word is conserved:

1 – 0.7710 ≈ 93%

Specificity

Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED

The Classic BLAST Tradeoff

As we increase k …

Sensitivity gets worse Speed gets better

Status Check

Exact matches unlikely for larger values of k Include variants with one “wildcard”

placed in each position

Relative Generation Any match: 1 Any mismatch: 0 Threshold: T = k – 1

Wildcards

Wildcard Results

Better?

Wildcard Results

Perfect match:

Wildcards:

For the same sensitivity, wildcard variant is about 440 times faster

Wildcard Results

Perfect match:

Wildcards:

For the same sensitivity, wildcard variant is about 40 times faster

Wildcard Results

Better Sensitivity/speed tradeoff consistently improved

Status Check

N perfect matches Same separation in query and database

Database:

TGCTAGCTACGATCTGCAGTGCGTAATCT…

Query:

TCATTACATCGTGACTTGCAGTCGTCCAG…

All separations less than distance W

Multiple Words

TAC TGC

TGCTGC12 bp7 bp

INITIATEALIGNMENT

NOINITIATION

Skip math

Intuition Behind Multiple Words

Database (genome)

Homologous area:

If we use a single word of size k = 16:

Mutation rate: 19%

16Probability whole word is conserved:

0.8116 ≈ 3%

Database (genome)

Homologous area:

If we use a single word of size k = 16:

3% 3% 3% 3%3% 3% 3% 3% 3% 3% 3% 3% 3% 3%

Probability at least one word is conserved:

1 – 0.9710 ≈ 29%

Database (genome)

If we use a single word of size k = 16:Probability of a match = 29%

If we use N = 2 words of size k = 8:Homologous area:

Probability a particular word is conserved:

0.818 ≈ 19%

19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19%

Probability at least two words are conserved:

1 – 0.8120 – 20 × 0.19 × 0.8119 ≈ 91%

Database (genome)

If we use a single word of size k = 16:Probability of a match = 29%

If we use N = 2 words of size k = 8:Probability of a match = 91%

3% 3% 3% 3%3% 3% 3% 3% 3% 3% 3% 3% 3% 3%

19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19%

Multiple Words Results

Single perfect match:

Multiple perfect matches:

For the same sensitivity, multiple words variant about 1,200 times faster

Single perfect match:

Multiple perfect matches:

For the same sensitivity, multiple words variant about 75,000 times faster

Much better than single matches Bigger improvement even than wildcards

Why not combine them:Multiple Wildcard Matches?

Status Check

Contiguous word (k = 10)

GTCAGTACGTCAGTCGTGCGTCGTCTAG××××××××××

Seed pattern

GTCAGTACGTCAGTCGTGCGTCGTCTAG××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙×

Seed Patterns

GTCAGTACGT

GTATTAGGCG

Patterns increase the likelihood of at least one match within a long conserved region

3 common

5 common

7 common

Consecutive Positions Non-Consecutive Positions

6 common

On a 100-long 70% conserved region: Consecutive Non-consecutive

Expected # hits: 1.07 0.97Prob[at least one hit]: 0.30 0.47

Intuition Behind Seed Patterns

Advantage of Patterns

11 positions

10 positions

Status Check

Designing Seeds

Is this a good seed pattern?

×∙×∙×∙×∙×∙×∙×∙×∙×∙× ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 0 matches!

×∙×∙×∙×∙×∙×∙×∙×∙×∙× 9 matches

Not so much …

Designing Seeds

Remember three-periodicity in exons?

Higher mutation rate in last position of codon A decent pattern is thus π110:

××∙××∙××∙××∙××∙× But isn’t regularity bad?

Human CCTGTT (Proline, Valine)Mouse CCAGTC (Proline, Valine)Rat CCAGTC (Proline, Valine)Dog CCGGTA (Proline, Valine)Chicken CCCGTG (Proline, Valine)

××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙×

Optimizing Seeds

Hard problemNo efficient solution known

for fixed word size k and max span s

××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙×

k = 10

1 2 3 4 5 6 7 8 9 10

s = 28

How sensitive is a specific seed π? Construct finite-state automaton Aπ

that accepts strings containing it

Regular expression: (0 + 1)* 1 (0 + 1) 1 (0 + 1)* Dynamic program to convert into DFA

Computing Detection Probabilities

START YES1 0,1 1

π101 (×∙×)

Computing Detection Probabilities

Compute probability a sequence of length L will be accepted by the DFA

Markov model M (nth order) which dictates conservation pattern

Dynamic program: θ(k L 2s – k + n)

CDSP(1) = 0.9P(0) = 0.1

NCP(1) = 0.8P(0) = 0.2

0.950.95

1 0 1 1 0 1 1 1 0 1 1 1L

××∙×∙∙×××∙∙××××∙××∙××∙∙∙××××∙××∙××∙∙×∙×

Mandala Algorithm

Fast, practical seed design

Start with random seed Jump to best neighbor

(moving one × to unused ∙) Keep jumping until no better

neighbors (greedy hill climbing) Finds local optimum

Random restarts to try findglobally optimal seed

×∙×∙××∙××∙∙∙××∙×∙××∙××∙∙∙××∙×∙××∙××∙∙∙××∙∙∙××∙××∙∙××

×∙×∙××∙××∙∙∙×

×∙∙∙××∙××∙∙×××∙∙∙××∙××∙∙×××∙∙×∙×∙××∙∙×××∙∙∙××∙××∙∙××

×∙∙×∙×∙××∙∙×××∙∙×∙×∙××∙∙×××∙∙×∙×∙××∙∙×××∙∙×∙×∙×××∙∙×NO BETTER NEIGHBORS

××∙×∙∙××××∙∙×

××∙××∙××∙∙×∙×

×∙∙×∙×∙×××∙∙×

×∙∙×∙×∙××∙∙××

×∙∙∙××∙××∙∙××

Mandala Results

Very fast 20 seconds on k = 11, s = 22

Within 1% of true optimum in all trials

Mandala pattern fornon-coding DNA (k = 11):

××××∙×××∙∙×∙×××

Mandala pattern forcoding DNA (k = 11):

×××∙∙∙∙∙××∙××∙××∙××

×∙×∙××∙××∙∙∙×

NO BETTER NEIGHBORS

Seed Pattern Results

Non-coding sequences

Coding sequences

Motivation for Improvement

Good CDS seeds

××∙××∙××∙××∙××

Good NC seeds

××××∙×××∙∙×∙××

Status Check

Single seed

GTCAGTACGTCAGTCGTGCGTCGTCTAG××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙××××∙×∙∙×∙∙∙∙×∙∙∙∙∙∙∙×∙∙∙∙×××∙∙∙××∙××∙××∙××∙××∙∙∙∙∙∙∙∙∙∙∙

How much longer is it taking us to index?

GTCAGTACGTCAGTCGTGCGTCGTCTAG

Multiple Simultaneous Seeds

How much longer is it taking us to search?

Multiple Simultaneous Seeds

AGACTCGTGTGTCGCGTTAGGTATTAGGCG

GTCAGTACGTCAGTCGTGCGTCGTCTAG

... AGCCAGTCAG ... GACAGTCCAG ... GGCATCATCA ...

THREE INDEXES:

XX∙X∙X∙∙∙X∙X∙∙∙X∙X∙∙∙X∙∙∙∙∙X

XXX∙X∙∙X∙∙∙∙X∙∙∙∙∙∙∙X∙∙∙∙XXX

∙∙∙XX∙XX∙XX∙XX∙XX∙∙∙∙∙∙∙∙∙∙∙

... AGCCAGTCAG ... GACAGTCCAG ... GGCATCATCA ...

Mandala still works

Just change definition of neighbor Pick one seed Pick one × Move to a ∙ Slower convergence

×∙×∙××∙××∙∙∙××∙×∙××∙××∙∙∙××∙×∙××∙××∙∙∙××∙∙∙××∙××∙∙××∙××∙××∙××∙××∙

×××∙×∙∙×∙××∙×

Results

Non-coding sequences

Back to earlier results …

Future Extensions

Combine all of the above! Multiple Wildcard Matches of Multiple Patterns?

*GT∙G∙T∙∙∙TGTA∙C∙∙G∙∙∙∙∙GC∙AG∙T

*TC∙T∙T∙∙∙GTCG∙C∙∙C∙∙∙∙∙TC∙AC∙G

ACGATGGCTCCTGACGAGCTAGCTGATGAGCCGTAGCAGACGTGTAACGGTCGCTCCTGCAGCGA

ACGATGCATCGTAGCTAGCTAGCTGATCAGCTGTAGTAGTCGTCTACTGGATGCTGCTGCAGCGT

Database

Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken...

Documents