Post on 21-Dec-2015
transcript
Indexed Alignment
Tricks of the Trade
Ross David Bayer18th October, 2005
Note: many diagrams taken from Serafim’s CS 262 class
Roadmap
Background RecapSimple Tricks of the Trade
Wildcards Multiple Words
State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
Status Check
Background RecapSimple Tricks of the Trade
Wildcards Multiple Words
State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
Motivation
We have a newly discovered gene: Does it occur in other species? How fast does it evolve?
We want to “find” this gene in other species But there will be mutations
Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
Global Alignment
Running Time:
O(MN)
MAGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Needleman-Wunsch (Dynamic Programming)
N
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Local Alignment
Running Time:
O(MN)
MAGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Smith-Waterman
N
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Modifications:
• Store 0 instead of –ve values• Search entire table for maximum
Alignment Applications
We have our newly discovered gene: Does it occur in other species? How fast does it evolve?
GenBank Growth
Exponential growth in total sequence data
Recently exceeded 100 Gbp (1011 base pairs)
Alignment Applications
We have our newly discovered gene: Does it occur in other species? How fast does it evolve?
Assume we try Smith-Waterman:
1015 cells
The entire genomic database
1011
Our new gene
104
Indexed Alignment
(BLAST- Basic Local Alignment Search Tool)
Main idea:
1. Construct a dictionary of all words in the query
2. Initiate a local alignment for each word match between query and DB
Running Time: O(MN) in worst case
However, in practice orders of magnitude faster than Smith-Waterman
query
DB
Step 1 (Basic): Construct dictionary of query words
Query indexed by all words of size k Query indexed by all words of size k = 3 (in our examples) Query indexed by all words of size k ≈ 11
BLAST
Query:
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCG…
AAA AAC AAG AAT ACA ... AGG ... CTA ... GCT ... GGC ...
INDEX
AGG GGC GCT CTA TAT ATC TCA CAC ACC CCT CTG TGA GAC ACC CCT CTC TCC CCA CAG AGG GGC GCC CCG CGA GAT ATG TGC GCC CCC CCT CTA TAG AGC GCT CTA TAT ATC TCA CAC ACG CGA GAC ACC CCG CGC GCG
Step 1 (Advanced): Relative Generation
For each query word, generate all relatives A relative is a word with alignment score ≥ T All relatives are updated to point to new location
BLAST
Query word:
GGC
Threshold:T = 28
Relatives:
GGC 30AGC 28GAC 28AAC 26GGT 25GGA 24
...
Query:
AGGCTATCACCTGACCTCCAGGCCG…
... AGC ... GAC ... GGC ...
INDEX
Step 2: Searching Search through database linearly, one word at a time Initiate alignment with all occurrences of that word in query
BLAST
Genomic database:
AGCTAGCTGCTAGTCAGTCGATGCATGCTACTAGCTGCGATCGTCGTC…
AAA AAC AAG AAT ACA ... AGC ... CTA ... GCT ... TAG ...
INDEX
Query:
AGC GCT
AGC GCT
A C G A A G T A A G G T C C A G T
C
C
C
T
T
C C
T
G
G
A T
T
G
C
G
A
Alignment Extension
Example:
The matching word GGT initiates an alignment
Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far
Output:
GTAAGGTCC
GTTAGGTCC
BLAST
BLAST Algorithm Variations
BLAT- BLAST-Like Alignment Tool
1. Builds index (dictionary) for database, scans linearly through query
2. Alignment extensions allow for gaps as well
A C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
AGapped Extensions
Extensions with gaps in a band around anchor
Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far
Output:
GTAAGGTCC-AG
GTTAGGTCCTAG
BLAT
Interpreting Results
Sensitivity
Probability of a particular homologous area being identified Larger k decreases probability (exact match less likely) Straightforward mathematics
Skip math
Sensitivity Calculation
Database (genome)
Query
Homologous area:
Suppose k = 7:
Conservation rate: 81%
Mutation rate: 19%
7Probability whole word is conserved:
0.817 ≈ 23%
Sensitivity Calculation
Database (genome)
Query
Homologous area:
Suppose k = 7:
23% 23% 23% 23% 23% 23% 23% 23% 23% 23%
Words: 10 Probability a particular word is conserved: 23%
Probability at least one word is conserved:
1 – 0.7710 ≈ 93%
Interpreting Results
Specificity
Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED
Interpreting Results
SPEED
Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED
Status Check
Background RecapSimple Tricks of the Trade
Wildcards Multiple Words
State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
Exact matches unlikely for larger values of k Include variants with one “wildcard”
placed in each position
GTA
*TA
G*A
GT*
Relative Generation Any match: 1 Any mismatch: 0 Threshold: T = k – 1
Wildcards
Wildcard Results
Perfect match:
Wildcards:
For the same sensitivity, wildcard variant is about 440 times faster
Wildcard Results
Perfect match:
Wildcards:
For the same sensitivity, wildcard variant is about 40 times faster
Status Check
Background RecapSimple Tricks of the Trade
Wildcards Multiple Words
State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
N perfect matches Same separation in query and database
Database:
TGCTAGCTACGATCTGCAGTGCGTAATCT…
Query:
TCATTACATCGTGACTTGCAGTCGTCCAG…
All separations less than distance W
Multiple Words
TAC
TAC TGC
TGCTGC12 bp7 bp
12 bp
INITIATEALIGNMENT
NOINITIATION
Skip math
Intuition Behind Multiple Words
Database (genome)
Query
Homologous area:
If we use a single word of size k = 16:
Conservation rate: 81%
Mutation rate: 19%
16Probability whole word is conserved:
0.8116 ≈ 3%
Intuition Behind Multiple Words
Database (genome)
Query
Homologous area:
If we use a single word of size k = 16:
3% 3% 3% 3%3% 3% 3% 3% 3% 3% 3% 3% 3% 3%
Words: 10 Probability a particular word is conserved: 3%
Probability at least one word is conserved:
1 – 0.9710 ≈ 29%
Intuition Behind Multiple Words
Database (genome)
Query
If we use a single word of size k = 16:Probability of a match = 29%
If we use N = 2 words of size k = 8:Homologous area:
Probability a particular word is conserved:
0.818 ≈ 19%
19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19%
Words: 20 Probability a particular word is conserved: 19%
Probability at least two words are conserved:
1 – 0.8120 – 20 × 0.19 × 0.8119 ≈ 91%
Intuition Behind Multiple Words
Database (genome)
Query
If we use a single word of size k = 16:Probability of a match = 29%
If we use N = 2 words of size k = 8:Probability of a match = 91%
3% 3% 3% 3%3% 3% 3% 3% 3% 3% 3% 3% 3% 3%
19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19%
Multiple Words Results
Single perfect match:
Multiple perfect matches:
For the same sensitivity, multiple words variant about 1,200 times faster
Multiple Words Results
Single perfect match:
Multiple perfect matches:
For the same sensitivity, multiple words variant about 75,000 times faster
Status Check
Background RecapSimple Tricks of the Trade
Wildcards Multiple Words
State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
Contiguous word (k = 10)
GTCAGTACGTCAGTCGTGCGTCGTCTAG××××××××××
Seed pattern
GTCAGTACGTCAGTCGTGCGTCGTCTAG××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙×
Seed Patterns
GTCAGTACGT
GTATTAGGCG
Patterns increase the likelihood of at least one match within a long conserved region
3 common
5 common
7 common
Consecutive Positions Non-Consecutive Positions
6 common
On a 100-long 70% conserved region: Consecutive Non-consecutive
Expected # hits: 1.07 0.97Prob[at least one hit]: 0.30 0.47
Intuition Behind Seed Patterns
Status Check
Background RecapSimple Tricks of the Trade
Wildcards Multiple Words
State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
Designing Seeds
Is this a good seed pattern?
×∙×∙×∙×∙×∙×∙×∙×∙×∙× ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 0 matches!
×∙×∙×∙×∙×∙×∙×∙×∙×∙× 9 matches
Not so much …
Designing Seeds
Remember three-periodicity in exons?
Higher mutation rate in last position of codon A decent pattern is thus π110:
××∙××∙××∙××∙××∙× But isn’t regularity bad?
Human CCTGTT (Proline, Valine)Mouse CCAGTC (Proline, Valine)Rat CCAGTC (Proline, Valine)Dog CCGGTA (Proline, Valine)Chicken CCCGTG (Proline, Valine)
××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙×
Optimizing Seeds
Hard problemNo efficient solution known
for fixed word size k and max span s
××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙×
k = 10
1 2 3 4 5 6 7 8 9 10
s = 28
How sensitive is a specific seed π? Construct finite-state automaton Aπ
that accepts strings containing it
Regular expression: (0 + 1)* 1 (0 + 1) 1 (0 + 1)* Dynamic program to convert into DFA
Computing Detection Probabilities
START YES1 0,1 1
00
π101 (×∙×)
Computing Detection Probabilities
Compute probability a sequence of length L will be accepted by the DFA
Markov model M (nth order) which dictates conservation pattern
Dynamic program: θ(k L 2s – k + n)
CDSP(1) = 0.9P(0) = 0.1
NCP(1) = 0.8P(0) = 0.2
0.05
0.05
0.950.95
1 0 1 1 0 1 1 1 0 1 1 1L
××∙×∙∙×××∙∙××××∙××∙××∙∙∙××××∙××∙××∙∙×∙×
Mandala Algorithm
Fast, practical seed design
Start with random seed Jump to best neighbor
(moving one × to unused ∙) Keep jumping until no better
neighbors (greedy hill climbing) Finds local optimum
Random restarts to try findglobally optimal seed
×∙×∙××∙××∙∙∙××∙×∙××∙××∙∙∙××∙×∙××∙××∙∙∙××∙∙∙××∙××∙∙××
×∙×∙××∙××∙∙∙×
×∙∙∙××∙××∙∙×××∙∙∙××∙××∙∙×××∙∙×∙×∙××∙∙×××∙∙∙××∙××∙∙××
×∙∙×∙×∙××∙∙×××∙∙×∙×∙××∙∙×××∙∙×∙×∙××∙∙×××∙∙×∙×∙×××∙∙×NO BETTER NEIGHBORS
××∙×∙∙××××∙∙×
××∙××∙××∙∙×∙×
×∙∙×∙×∙×××∙∙×
×∙∙×∙×∙××∙∙××
×∙∙∙××∙××∙∙××
Mandala Results
Very fast 20 seconds on k = 11, s = 22
Within 1% of true optimum in all trials
Mandala pattern fornon-coding DNA (k = 11):
××××∙×××∙∙×∙×××
Mandala pattern forcoding DNA (k = 11):
×××∙∙∙∙∙××∙××∙××∙××
×∙×∙××∙××∙∙∙×
NO BETTER NEIGHBORS
Status Check
Background RecapSimple Tricks of the Trade
Wildcards Multiple Words
State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
Single seed
GTCAGTACGTCAGTCGTGCGTCGTCTAG××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙××××∙×∙∙×∙∙∙∙×∙∙∙∙∙∙∙×∙∙∙∙×××∙∙∙××∙××∙××∙××∙××∙∙∙∙∙∙∙∙∙∙∙
How much longer is it taking us to index?
GTCAGTACGTCAGTCGTGCGTCGTCTAG
Multiple Simultaneous Seeds
How much longer is it taking us to search?
Multiple Simultaneous Seeds
AGACTCGTGTGTCGCGTTAGGTATTAGGCG
GTCAGTACGTCAGTCGTGCGTCGTCTAG
... AGCCAGTCAG ... GACAGTCCAG ... GGCATCATCA ...
THREE INDEXES:
XX∙X∙X∙∙∙X∙X∙∙∙X∙X∙∙∙X∙∙∙∙∙X
XXX∙X∙∙X∙∙∙∙X∙∙∙∙∙∙∙X∙∙∙∙XXX
∙∙∙XX∙XX∙XX∙XX∙XX∙∙∙∙∙∙∙∙∙∙∙
... AGCCAGTCAG ... GACAGTCCAG ... GGCATCATCA ...
... AGCCAGTCAG ... GACAGTCCAG ... GGCATCATCA ...
Mandala still works
Just change definition of neighbor Pick one seed Pick one × Move to a ∙ Slower convergence
×∙×∙××∙××∙∙∙××∙×∙××∙××∙∙∙××∙×∙××∙××∙∙∙××∙∙∙××∙××∙∙××∙××∙××∙××∙××∙
×××∙×∙∙×∙××∙×
Future Extensions
Combine all of the above! Multiple Wildcard Matches of Multiple Patterns?
*GT∙G∙T∙∙∙TGTA∙C∙∙G∙∙∙∙∙GC∙AG∙T
*TC∙T∙T∙∙∙GTCG∙C∙∙C∙∙∙∙∙TC∙AC∙G
ACGATGGCTCCTGACGAGCTAGCTGATGAGCCGTAGCAGACGTGTAACGGTCGCTCCTGCAGCGA
ACGATGCATCGTAGCTAGCTAGCTGATCAGCTGTAGTAGTCGTCTACTGGATGCTGCTGCAGCGT
Database
Query