+ All Categories
Home > Documents > Sequence Similarity

Sequence Similarity

Date post: 06-Feb-2016
Category:
Upload: sancha
View: 38 times
Download: 0 times
Share this document with a friend
Description:
Sequence Similarity. The Viterbi algorithm for alignment. Compute the following matrices (DP) M(i, j):most likely alignment of x 1 …x i with y 1 …y j ending in state M I(i, j): most likely alignment of x 1 …x i with y 1 …y j ending in state I - PowerPoint PPT Presentation
Popular Tags:
42
Sequence Similarity
Transcript
Page 1: Sequence Similarity

Sequence Similarity

Page 2: Sequence Similarity

The Viterbi algorithm for alignment

• Compute the following matrices (DP) M(i, j): most likely alignment of x1…xi with y1…yj ending in state M

I(i, j): most likely alignment of x1…xi with y1…yj ending in state I

J(i, j): most likely alignment of x1…xi with y1…yj ending in state J

M(i, j) = log( Prob(xi, yj) ) +

max{ M(i-1, j-1) + log(1-2),

I(i-1, j) + log(1-), J(i, j-1) + log(1-) }

I(i, j) = max{ M(i-1, j) + log ,

I(i-1, j) + log }

MP(xi, yj)

IP(xi)

JP(yj)

log(1 – 2)

log(1 – )

log log log log

log(1 – )

log Prob(xi, yj)

Page 3: Sequence Similarity

One way to view the state paths – State M

x1

xm

y1 yn……

……

Page 4: Sequence Similarity

State I

x1

xm

y1 yn……

……

Page 5: Sequence Similarity

State J

x1

xm

y1 yn……

……

Page 6: Sequence Similarity

Putting it all together

States I(i, j) are connected with states J and M (i-1, j)

States J(i, j) are connected with states I and M (i-1, j)

States M(i, j) are connected with states J and I (i-1, j-1)

x1

xm

y1 yn……

……

Page 7: Sequence Similarity

Putting it all together

States I(i, j) are connected with states J and M (i-1, j)

States J(i, j) are connected with states I and M (i-1, j)

States M(i, j) are connected with states J and I (i-1, j-1)

Optimal solution is the best scoring path from top-left to bottom-right corner

This gives the likeliest alignment according to our HMM

x1

xm

y1 yn……

……

Page 8: Sequence Similarity

Yet another way to represent this model

Mx1 Mxm

Sequence X

BEGIN Iy Iy

Ix Ix

END

We are aligning, or threading, sequence Y through sequence X

Every time yj lands in state xi, we get substitution score s(xi, yj)

Every time yj is gapped, or some xi is skipped, we pay gap penalty

Page 9: Sequence Similarity

From this model, we can compute additional statistics

• P(xi ~ yj | x, y) The probability that positions i, j align, given that sequences x and y align

P(xi ~ yj | x, y) = α: alignmentP(α | x, y) 1(xi ~ yj in α)

We will not cover the details, but

this quantity can also be

calculated with DP

MP(xi, yj)

IP(xi)

JP(yj)

log(1 – 2)

log(1 – )

log log log log

log(1 – )

log Prob(xi, yj)

Page 10: Sequence Similarity

Fast database search – BLAST

(Basic Local Alignment Search Tool)

Main idea:

1. Construct a dictionary of all the words in the query

2. Initiate a local alignment for each word match between query and DB

Running Time: O(MN)

However, orders of magnitude faster than Smith-Waterman

query

DB

Page 11: Sequence Similarity

BLAST Original Version

• Dictionary:

All words of length k (~11 nucl.; ~4 aa)

Alignment initiated between words of alignment score T (typically T = k)

• Alignment:

Ungapped extensions until score below statistical threshold

• Output:

All local alignments with score > statistical threshold

……

……

query

DB

query

scan

Page 12: Sequence Similarity

PSI-BLAST

Given a sequence query x, and database D

1. Find all pairwise alignments of x to sequences in D

2. Collect all matches of x to y with some minimum significance

3. Construct position specific matrix M• Each sequence y is given a weight so that many similar sequences cannot have

much influence on a position (Henikoff & Henikoff 1994)

4. Using the matrix M, search D for more matches

5. Iterate 1–4 until convergence

Profile M

Page 13: Sequence Similarity

BLAST Variants

• BLASTN – genomic sequences• BLASTP – proteins• BLASTX – translated genome versus proteins• TBLASTN – proteins versus translated genomes• TBLASTX – translated genome versus translated genome• PSIBLAST – iterated BLAST search

http://www.ncbi.nlm.nih.gov/BLAST

Page 14: Sequence Similarity

Multiple Sequence Multiple Sequence AlignmentsAlignments

Page 15: Sequence Similarity

Protein Phylogenies

• Proteins evolve by both duplication and species divergence

Page 16: Sequence Similarity
Page 17: Sequence Similarity
Page 18: Sequence Similarity

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L

• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can help improve the pairwise alignments

Page 19: Sequence Similarity

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 20: Sequence Similarity

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

Page 21: Sequence Similarity

A Profile Representation

• Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi

• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings

Can think of this as a “likelihood” of each letter in each position

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4

Page 22: Sequence Similarity

Multiple Sequence Alignments

Algorithms

Page 23: Sequence Similarity

Multidimensional DP

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)

F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

Page 24: Sequence Similarity

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }

Multidimensional DP

Page 25: Sequence Similarity

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional DP

Page 26: Sequence Similarity

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional DP

• How do gap states generalize?

• VERY badly! Require 2N states, one per combination of

gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)

XY XYZ Z

Y YZ

X XZ

Page 27: Sequence Similarity

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

pxy

pzw

pxyzw

Page 28: Sequence Similarity

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

Example

Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)

s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)

Result: pxy = (0.7, 0.1, 0, 0, 0.2)

s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)

Result: px- = (0.4, 0.1, 0, 0, 0.5)

Page 29: Sequence Similarity

Progressive Alignment

• When evolutionary tree is unknown:

Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary

distance, based on pairwise alignment Construct a tree Align on the tree

x

w

y

z?

Page 30: Sequence Similarity

Heuristics to improve alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Page 31: Sequence Similarity

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

Page 32: Sequence Similarity

Iterative Refinement

Algorithm (Barton-Stenberg):

1. For j = 1 to N,Remove xj, and realign to x1…

xj-1xj+1…xN

2. Repeat 4 until convergence

x

y

z

x,z fixed projection

allow y to vary

Page 33: Sequence Similarity

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Variant:Refinement on a tree“tree partitioning”

Page 34: Sequence Similarity

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Page 35: Sequence Similarity

Iterative Refinement

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Page 36: Sequence Similarity

Some Resources

http://www.ncbi.nlm.nih.gov/BLAST

BLAST & PSI-BLAST

http://www.ebi.ac.uk/clustalw/

CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable

http://probcons.stanford.edu/

PROBCONS – most accurate

Page 37: Sequence Similarity

MUSCLE at a glance

1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time

2. Build tree TDRAFT based on DDRAFT, with a hierarchical clustering method (UPGMA)

3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT

4. Measure distances D(x, y) based on MDRAFT

5. Build tree T based on D

6. Progressive alignment over T, to build M

7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept

Page 38: Sequence Similarity

PROBCONS: Probabilistic Consistency-based Multiple

Alignment of Proteins

INSERTINSERT

XXINSERTINSERT

YY

MATCHMATCH

xxiiyyjj

――yyjj

xxii――

Page 39: Sequence Similarity

INSERTINSERTXX

INSERTINSERTYY

MATCHMATCH

A pair-HMM model of pairwise alignment

Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences

Transition probabilities ~ gap penalties

Emission probabilities ~ substitution matrix

ABRACA-DABRAAB-ACARDI---

xxyy

xxiiyyjj

――yyjj

xxii

――

Page 40: Sequence Similarity

Computing Pairwise Alignments

• The Viterbi algorithm conditional distribution P(α | x, y) reflects model’s uncertainty over the

“correct” alignment of x and y identifies highest probability alignment, αviterbi, in O(L2) time

Caveat: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy

P(α)P(α)

P(α | x, y)P(α | x, y)

ααviterbiviterbi

Page 41: Sequence Similarity

The Lazy-Teacher Analogy

• 10 students take a 10-question true-false quiz• How do you make the answer key?

Approach #1: Use the answer sheet of the best student! Approach #2: Weighted majority vote!

A- AAB A- A

B+ B+B+B- B- C

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T

Page 42: Sequence Similarity

Viterbi vs. Maximum Expected Accuracy (MEA)

Viterbi

• picks single alignment with highest chance of being completely correct

• mathematically, finds the alignment α that maximizes

Eα*[1{α = α*}]

Maximum Expected Accuracy

• picks alignment with highest expected number of correct predictions

• mathematically, finds the alignment α that maximizes

Eα*[accuracy(α, α*)]

AA

4. T A- AAB A- A

B+ B+B+B- B- C

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T


Recommended