Sequence Similarity
The Viterbi algorithm for alignment
• Compute the following matrices (DP) M(i, j): most likely alignment of x1…xi with y1…yj ending in state M
I(i, j): most likely alignment of x1…xi with y1…yj ending in state I
J(i, j): most likely alignment of x1…xi with y1…yj ending in state J
M(i, j) = log( Prob(xi, yj) ) +
max{ M(i-1, j-1) + log(1-2),
I(i-1, j) + log(1-), J(i, j-1) + log(1-) }
I(i, j) = max{ M(i-1, j) + log ,
I(i-1, j) + log }
MP(xi, yj)
IP(xi)
JP(yj)
log(1 – 2)
log(1 – )
log log log log
log(1 – )
log Prob(xi, yj)
One way to view the state paths – State M
x1
xm
y1 yn……
……
State I
x1
xm
y1 yn……
……
State J
x1
xm
y1 yn……
……
Putting it all together
States I(i, j) are connected with states J and M (i-1, j)
States J(i, j) are connected with states I and M (i-1, j)
States M(i, j) are connected with states J and I (i-1, j-1)
x1
xm
y1 yn……
……
Putting it all together
States I(i, j) are connected with states J and M (i-1, j)
States J(i, j) are connected with states I and M (i-1, j)
States M(i, j) are connected with states J and I (i-1, j-1)
Optimal solution is the best scoring path from top-left to bottom-right corner
This gives the likeliest alignment according to our HMM
x1
xm
y1 yn……
……
Yet another way to represent this model
Mx1 Mxm
Sequence X
BEGIN Iy Iy
Ix Ix
END
We are aligning, or threading, sequence Y through sequence X
Every time yj lands in state xi, we get substitution score s(xi, yj)
Every time yj is gapped, or some xi is skipped, we pay gap penalty
From this model, we can compute additional statistics
• P(xi ~ yj | x, y) The probability that positions i, j align, given that sequences x and y align
P(xi ~ yj | x, y) = α: alignmentP(α | x, y) 1(xi ~ yj in α)
We will not cover the details, but
this quantity can also be
calculated with DP
MP(xi, yj)
IP(xi)
JP(yj)
log(1 – 2)
log(1 – )
log log log log
log(1 – )
log Prob(xi, yj)
Fast database search – BLAST
(Basic Local Alignment Search Tool)
Main idea:
1. Construct a dictionary of all the words in the query
2. Initiate a local alignment for each word match between query and DB
Running Time: O(MN)
However, orders of magnitude faster than Smith-Waterman
query
DB
BLAST Original Version
• Dictionary:
All words of length k (~11 nucl.; ~4 aa)
Alignment initiated between words of alignment score T (typically T = k)
• Alignment:
Ungapped extensions until score below statistical threshold
• Output:
All local alignments with score > statistical threshold
……
……
query
DB
query
scan
PSI-BLAST
Given a sequence query x, and database D
1. Find all pairwise alignments of x to sequences in D
2. Collect all matches of x to y with some minimum significance
3. Construct position specific matrix M• Each sequence y is given a weight so that many similar sequences cannot have
much influence on a position (Henikoff & Henikoff 1994)
4. Using the matrix M, search D for more matches
5. Iterate 1–4 until convergence
Profile M
BLAST Variants
• BLASTN – genomic sequences• BLASTP – proteins• BLASTX – translated genome versus proteins• TBLASTN – proteins versus translated genomes• TBLASTX – translated genome versus translated genome• PSIBLAST – iterated BLAST search
http://www.ncbi.nlm.nih.gov/BLAST
Multiple Sequence Multiple Sequence AlignmentsAlignments
Protein Phylogenies
• Proteins evolve by both duplication and species divergence
Definition
• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L
• Score of the global map is maximum
• A faint similarity between two sequences becomes significant if present in many
• Multiple alignments can help improve the pairwise alignments
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Sum Of Pairs (cont’d)
• Heuristic way to incorporate evolution tree:
Human
Mouse
Chicken• Weighted SOP:
S(m) = k<l wkl s(mk, ml)
wkl: weight decreasing with distance
Duck
A Profile Representation
• Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi
• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings
Can think of this as a “likelihood” of each letter in each position
- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G
A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4
Multiple Sequence Alignments
Algorithms
Multidimensional DP
Generalization of Needleman-Wunsh:
S(m) = i S(mi)
(sum of column scores)
F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)
F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))
• Example: in 3D (three sequences):
• 7 neighbors/cell
F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }
Multidimensional DP
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Multidimensional DP
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Multidimensional DP
• How do gap states generalize?
• VERY badly! Require 2N states, one per combination of
gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)
XY XYZ Z
Y YZ
X XZ
Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new
alignment with associated profile presult
Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles
x
w
y
z
pxy
pzw
pxyzw
Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new
alignment with associated profile presult
Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles
x
w
y
z
Example
Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)
s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)
Result: pxy = (0.7, 0.1, 0, 0, 0.2)
s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)
Result: px- = (0.4, 0.1, 0, 0, 0.5)
Progressive Alignment
• When evolutionary tree is unknown:
Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary
distance, based on pairwise alignment Construct a tree Align on the tree
x
w
y
z?
Heuristics to improve alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
• …
Iterative Refinement
One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear correct y = GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
1. For j = 1 to N,Remove xj, and realign to x1…
xj-1xj+1…xN
2. Repeat 4 until convergence
x
y
z
x,z fixed projection
allow y to vary
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
Variant:Refinement on a tree“tree partitioning”
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
Iterative Refinement
Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
Some Resources
http://www.ncbi.nlm.nih.gov/BLAST
BLAST & PSI-BLAST
http://www.ebi.ac.uk/clustalw/
CLUSTALW – most widely used
http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
MUSCLE – most scalable
http://probcons.stanford.edu/
PROBCONS – most accurate
MUSCLE at a glance
1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time
2. Build tree TDRAFT based on DDRAFT, with a hierarchical clustering method (UPGMA)
3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT
4. Measure distances D(x, y) based on MDRAFT
5. Build tree T based on D
6. Progressive alignment over T, to build M
7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept
PROBCONS: Probabilistic Consistency-based Multiple
Alignment of Proteins
INSERTINSERT
XXINSERTINSERT
YY
MATCHMATCH
xxiiyyjj
――yyjj
xxii――
INSERTINSERTXX
INSERTINSERTYY
MATCHMATCH
A pair-HMM model of pairwise alignment
Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences
Transition probabilities ~ gap penalties
Emission probabilities ~ substitution matrix
ABRACA-DABRAAB-ACARDI---
xxyy
xxiiyyjj
――yyjj
xxii
――
Computing Pairwise Alignments
• The Viterbi algorithm conditional distribution P(α | x, y) reflects model’s uncertainty over the
“correct” alignment of x and y identifies highest probability alignment, αviterbi, in O(L2) time
Caveat: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy
P(α)P(α)
P(α | x, y)P(α | x, y)
ααviterbiviterbi
The Lazy-Teacher Analogy
• 10 students take a 10-question true-false quiz• How do you make the answer key?
Approach #1: Use the answer sheet of the best student! Approach #2: Weighted majority vote!
A- AAB A- A
B+ B+B+B- B- C
4. F4. F 4. T 4. F 4. F
4. F4. F 4. F 4. F 4. T
Viterbi vs. Maximum Expected Accuracy (MEA)
Viterbi
• picks single alignment with highest chance of being completely correct
• mathematically, finds the alignment α that maximizes
Eα*[1{α = α*}]
Maximum Expected Accuracy
• picks alignment with highest expected number of correct predictions
• mathematically, finds the alignment α that maximizes
Eα*[accuracy(α, α*)]
AA
4. T A- AAB A- A
B+ B+B+B- B- C
4. F4. F 4. T 4. F 4. F
4. F4. F 4. F 4. F 4. T