+ All Categories
Home > Documents > Rapid Global Alignments

Rapid Global Alignments

Date post: 27-Jan-2016
Category:
Upload: hei
View: 30 times
Download: 1 times
Share this document with a friend
Description:
Rapid Global Alignments. How to align genomic sequences in (more or less) linear time. Methods to CHAIN Local Alignments. Sparse Dynamic Programming O(N log N). The Problem: Find a Chain of Local Alignments. (x,y)  (x’,y’) requires x < x’ y < y’. Each local alignment has a weight - PowerPoint PPT Presentation
Popular Tags:
51
Rapid Global Alignments How to align genomic sequences in (more or less) linear time
Transcript
Page 1: Rapid Global Alignments

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Page 2: Rapid Global Alignments

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 3: Rapid Global Alignments

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 4: Rapid Global Alignments

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj L is implemented as a balanced binary tree

y

h

l

Page 5: Rapid Global Alignments

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates

• To the right of b, anything chainable to a is chainable to b

• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it

• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j)

V(b)V(a)

Page 6: Rapid Global Alignments

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from left to right:

1. When on the leftmost end of rectangle i, compute V(i)

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i, possibly store V(i) in L:

a. j: rectangle in L, with largest lj lib. If V(i) > V(j):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lk, V(k), k) with V(k) V(i) &

lk li

i

j

Page 7: Rapid Global Alignments

Example

x

y

1: 5

3: 3

2: 6

4: 45: 2

2

56

91011

1214

1516

Page 8: Rapid Global Alignments

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Page 9: Rapid Global Alignments

Putting it All Together:Fast Global Alignment Algorithms

1. FIND local alignments

2. CHAIN local alignments

FIND CHAIN

GLASS: k-mers hierarchical DP

MumMer: Suffix Tree sparse DP

Avid: Suffix Tree hierarchical DP

LAGAN CHAOS sparse DP

Page 10: Rapid Global Alignments

LAGAN: Pairwise Alignment

1. FIND local alignments

2. CHAIN local alignments

3. DP restricted around chain

Page 11: Rapid Global Alignments

LAGAN

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

Page 12: Rapid Global Alignments

LAGAN: recursive call

• What if a box is too large? Recursive application of LAGAN,

more sensitive word search

Page 13: Rapid Global Alignments

A trick to save on memory

“necks” have tiny tracebacks

…only store tracebacks

Page 14: Rapid Global Alignments

Multiple Sequence Multiple Sequence AlignmentsAlignments

Page 15: Rapid Global Alignments

Sequence Comparison

• Introduction• Comparison

Homogy -- Analogy Identity -- Similarity Pairwise -- Multiple Scoring Matrixes Gap -- indel Global -- Local

• Manual alignment, dot plot visual inspection

• Dynamic programming Needleman-Wunsch

• exhaustive global alignment Smith-Waterman

• exhaustive local alignment

• Multiple alignment• Database search

BLAST FASTA

Page 16: Rapid Global Alignments

Sequence Comparison

Multiple alignment (Multiple sequence alignment: MSA)

Application Procedure

Extrapolation Allocation of an uncharacterized sequence to a protein family.

Phylogenetic analysis Reconstruction of the history of closely related proteins and protein families.

Pattern identification Identification of regions characteristic of a function by conserved positions.

Domain identification Turning MSA into a domain or protein family specific profile may be useful in identifying new or remote family members.

DNA regulatory elements Turning DNA-MSAs of a binding site into a weight matrix may be used in scanning other DNA sequences for potential similar binding sites.

Structure prediction Good MSAs yield high quality prediction of secondary structure and help building 3D models.

PCR analysis Identification of less degenerated regions of a protein family are useful in fishing out new members by PCR (primer design).

Page 17: Rapid Global Alignments
Page 18: Rapid Global Alignments

Overview

• Definition

• Scoring Schemes

• Algorithms

Page 19: Rapid Global Alignments

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L

• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can help improve the pairwise alignments

Page 20: Rapid Global Alignments

Scoring Function

• Ideally: Find alignment that maximizes probability that sequences evolved

from common ancestor, according to some phylogenetic model

• More on phylogenetic models later

x

yz

w

v

?

Page 21: Rapid Global Alignments

Scoring Function

• A comprehensive model would have too many parameters, too inefficient to optimize

• Possible simplifications

Ignore phylogenetic tree

Statistically independent columns:

S(m) = G(m) + i S(mi)

m: alignment matrixG: function penalizing gaps

Page 22: Rapid Global Alignments

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 23: Rapid Global Alignments

Sum Of Pairs (cont’d)

• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments

S(m) = k<l s(mk, ml)

s(mk, ml): score of induced alignment (k,l)

Page 24: Rapid Global Alignments

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

Page 25: Rapid Global Alignments

Consensus

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC

CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Find optimal consensus string m* to maximize

S(m) = i s(m*, mi)

s(mk, ml): score of pairwise alignment (k,l)

Page 26: Rapid Global Alignments

Multiple Sequence Alignments

Algorithms

Page 27: Rapid Global Alignments

Multiple sequence alignment - Computational complexity

V S N S

A

N

S

V S N S

_ S N A

Sequence Comparison

Multiple alignment

Page 28: Rapid Global Alignments

Alignment of protein sequences with 200 amino acids using dynamic programming

# of sequences CPU time (approx.)

2 1 sec

4 104 sec – 2,8 hours

5 106 sec – 11,6 days

6 108 sec – 3,2 years

7 1010 sec – 371 years

Multiple sequence alignment - Computational complexity

Sequence Comparison

Multiple alignment

Page 29: Rapid Global Alignments

Approximate methods for MSA

Sequence Comparison

Multiple alignment

• Multidimensional dynamic programming(MSA, Lipman 1988)

• Progressive alignments(Clustalw, Higgins 1996; PileUp, Genetics Computer Group (GCG))

• Local alignments(e.g. DiAlign, Morgenstern 1996; lots of others)

• Iterative methods (e.g. PRRP, Gotoh 1996)

• Statistical methods (e.g. Bayesian Hidden Markov Models)

Page 30: Rapid Global Alignments

Multiple sequence alignment - Programs

Sequence Comparison

Multiple alignment

OMA

CombalignDCA T-Coffee

Clustal

DalignMSA

InteralignPrrp

Sam HMMER

GA SAGA

MultidimentionalDynamic programming

Progressive

IterativeHMMS

GAs

Non treebased

Treebased

Page 31: Rapid Global Alignments

Multiple sequence alignment - Computational complexity

Sequence Comparison

Multiple alignment

Program Seq type Alignment Methode Comment

ClustalW Prot/DNA Global Progressive No format limitationRun on Windows too!

PileUp Prot/DNA Global Progressive Limited by the formatand UNIX based

MultAlin Prot/DNA Global Progressive/Iterativ Limited by the format

T-COFFEEProt/DNA Global/local Progressive Can be slow

Page 32: Rapid Global Alignments

1. Multidimensional Dynamic Programming

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

Page 33: Rapid Global Alignments

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }

1. Multidimensional Dynamic Programming

Page 34: Rapid Global Alignments

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

1. Multidimensional Dynamic Programming

Page 35: Rapid Global Alignments

2. Progressive Alignment

• Multiple Alignment is NP-complete

• Most used heuristic: Progressive Alignment

Algorithm:1. Align two of the sequences xi, xj

2. Fix that alignment

3. Align a third sequence xk to the alignment xi,xj

4. Repeat until all sequences are aligned

Running Time: O( N L2 )

Page 36: Rapid Global Alignments

2. Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree

Example:Order of alignments: 1. (x,y)

2. (z,w)3. (xy, zw)

x

w

y

z

Page 37: Rapid Global Alignments

CLUSTALW: progressive alignment

CLUSTALW: most popular multiple protein alignment

Algorithm:

1. Find all dij: alignment dist (xi, xj)

2. Construct a tree

(Neighbor-joining hierarchical clustering)

3. Align nodes in order of decreasing similarity

+ a large number of heuristics

Page 38: Rapid Global Alignments

CLUSTALW & the CINEMA viewer

Page 39: Rapid Global Alignments

MLAGAN: progressive alignment of DNA

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

Page 40: Rapid Global Alignments

MLAGAN: main steps

Given a collection of sequences, and a phylogenetic tree

1. Find local alignments for every pair of sequences x, y

2. Find anchors between every pair of sequences, similar to LAGAN anchoring

3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP

4. Optional refinement steps

Page 41: Rapid Global Alignments

MLAGAN: multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Page 42: Rapid Global Alignments

Heuristics to improve multiple alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Page 43: Rapid Global Alignments

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

Page 44: Rapid Global Alignments

Iterative Refinement

Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned

4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to converge

Page 45: Rapid Global Alignments

Iterative Refinement

For each sequence y1. Remove y2. Realign y

(while rest fixed)x

y

z

x,z fixed projection

allow y to vary

Page 46: Rapid Global Alignments

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Page 47: Rapid Global Alignments

Iterative Refinement

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Page 48: Rapid Global Alignments

Restricted MDP

Here is another way to improve a multiple alignment:

1. Construct progressive multiple alignment m

2. Run MDP, restricted to radius R from m

Running Time: O(2N RN-1 L)

Page 49: Rapid Global Alignments

Restricted MDP

Run MDP, restricted to radius R from m

x

y

z

Running Time: O(2N RN-1 L)

Page 50: Rapid Global Alignments

Restricted MDP

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

• Within radius 1 of the optimal

Restricted MDP will fix it.

Page 51: Rapid Global Alignments

Optional refinement steps in MLAGAN

• Limited-area iterative refinement

• Radius-r 3-sequence refinement on each node of the tree


Recommended