+ All Categories
Home > Documents > Introduction to Sequence Alignment PENCE Bioinformatics Research Group University of Alberta May...

Introduction to Sequence Alignment PENCE Bioinformatics Research Group University of Alberta May...

Date post: 21-Dec-2015
Category:
View: 233 times
Download: 1 times
Share this document with a friend
Popular Tags:
35
Introduction to Sequence Introduction to Sequence Alignment Alignment PENCE Bioinformatics Research Group University of Alberta May 2001
Transcript

Introduction to Sequence AlignmentIntroduction to Sequence Alignment

PENCE Bioinformatics Research Group

University of Alberta

May 2001

©Duane Szafron 2000

2

OutlineOutline

Sequence Alignment Full Matrix Algorithms Hirschberg’s Algorithm The FastLSA Algorithm Leading and Trailing Blanks

©Duane Szafron 2000

3

Sequence AlignmentSequence Alignment

Sequence alignment reduces to a problem of matching two strings by introducing gaps to maximize a scoring function.

The scoring function favors similar characters in the same position, penalizes dissimilar characters and penalizes gaps.

AGTATGCAATTGATAAGT-ATGCAATTGAT--A

©Duane Szafron 2000

4

+2-2+2+2-2-2+2 = 32 -1

Scoring FunctionScoring Function

There are many different scoring functions. Here is a simple one suitable for illustration,

but not actually used:

AGT-ATGCAATTGAT--A

– Exact match: +2 points

– Different characters: -1 point– Gap: -2 points

©Duane Szafron 2000

5

Scoring TiesScoring Ties

There can be several optimal alignment solutions due to scoring ties.

There are actually three optimal solutions in our example alignment:

AGT-ATGCAATTGAT--A

2-1+2-2+2+2-2-2+2 = 3

AGTATG-CAA-T-TGATA

2-2+2-2+2+2-2-1+2 = 3

AGTATGC-AA-T-TGATA

2-2+2-2+2+2-1-2+2 = 3

©Duane Szafron 2000

6

Alignment AlgorithmsAlignment Algorithms The goal is to find an optimal alignment for a given scoring function as quickly as

possible, using a minimum amount of storage. We will look at three different kinds of algorithms:

– Full Matrix algorithms like Needleman-Wunch and Smith-Waterman– The Hirschberg Algorithm– Fast linear space alignment (FastLSA)

©Duane Szafron 2000

7

Matrix RepresentationMatrix Representation

A matrix is used to represent all possible alignments for a pair of sequences.

There is a sequence along each axis. Each path from the top left corner to the bottom right corner represents an

alignment solution.

©Duane Szafron 2000

8

Alignments as Matrix PathsAlignments as Matrix Paths

- A G T A T G C A

-

A

T

T

G

A

T

A

AA

GT

TT

-G

AA

TT

G-

C-

AA

©Duane Szafron 2000

9

Other Alignment Matrix PathsOther Alignment Matrix Paths

- A G T A T G C A

-

A

T

T

G

A

T

A

AA

GT

TT

-G

AA

TT

G-

C-

AA

AA

G-

TT

A-

TT

GG

-A

CT

AA

©Duane Szafron 2000

10

Other Alignment Matrix PathsOther Alignment Matrix Paths

- A G T A T G C A

-

A

T

T

G

A

T

A

AA

GT

TT

-G

AA

TT

G-

C-

AA

AA

G-

TT

A-

TT

GG

CA

-T

AA

AA

G-

TT

A-

TT

GG

-A

CT

AA

©Duane Szafron 2000

11

Matrix Alignment AlgorithmsMatrix Alignment Algorithms

A matrix algorithm uses a dynamic programming matrix to find an optimal solution.

There are two phases to the algorithms:– FindScore– FindPath

©Duane Szafron 2000

12

FindScore DescriptionFindScore Description

The FindScore phase applies the scoring matrix to all paths from the upper left to the lower right.

Values are propagated left-to-right, from top-to-bottom. At the end, the lower right corner is the optimal score.

©Duane Szafron 2000

13

FindScore ExampleFindScore Example

-2-20

A G T A T G C A-

A

T

T

G

A

T

A

- -2 -4 -6-4 -6+2-4

-4-2 +2 0 -2

-30

-6 -5-2

-8 -4-6

-10 -11-12 -13-14 -12-16 -18-8 -10 -12-4 -4

-9-6 -8 -10 -12

-8 -10 -12 -14 -16-8 -10 -12 -14 -16

-4 0 -2 -4 -6 -8 -10 -12 -14-3 +1 +2 -3 -2 -7 -9 -11-4 0 +1 +2 0 -2 -4 -6 -8-6 -2 -1 0 -2 -4 -6 -8-6 -2 -1 0 -2 -4 -6 -8 -10-5 -1 +3 +1 +2 -3 -5 -7

-8 -4 -3 +1 -1 0 -2 -4 -6-7 0 -2 +2 0 +4 -1 -3

-10 -6 -2 -1 0 -2 +2 0 -2-6 -5 -1 +3 +1 -1 +3 +4

-12 -8 -4 -3 +1 -1 0 +1 +2-11 -7 0 -2 +5 0 +1 +2

-14 -10 -6 -2 -1 +3 +1 -1 0-10 -9 -5 +2 0 +4 +2 +3

-6 -2 -1 +3 +1 +2 0 -2 -4-4 -4 -3 +1 -1 0 -2 -4

-8 -4 0 +1 +2 0 +4 +2 0-10 -6 -2 -1 0 -2 +2 0

-10 -6 -2 -1 +3 +1 +2 +3 +4-12 -8 -4 -3 +1 -1 0 +1

-12 -8 -4 0 +1 +5 +3 +1 +2-14 -10 -6 -2 -1 +3 +1 -1

-14 -10 -6 -2 +2 0 +4 +2 3-16 -12 -8 -4 0 -2 +2 0

©Duane Szafron 2000

14

FindPath DescriptionFindPath Description

The FindPath phase starts in the lower right corner. At each box, a direction is picked: up, left or diagonal based on the

highest score that entered the box from those three directions. If two (or three) directions have equal scores both (all) are optimal paths.

©Duane Szafron 2000

15

FindPath ExampleFindPath Example

-2-20

A G T A T G C A-

A

T

T

G

A

T

A

- -2 -4 -6-4 -6+2-4

-4-2 +2 0 -2

-30

-6 -5-2

-8 -4-6

-10 -11-12 -13-14 -12-16 -18-8 -10 -12-4 -4

-9-6 -8 -10 -12

-8 -10 -12 -14 -16-8 -10 -12 -14 -16

-4 0 -2 -4 -6 -8 -10 -12 -14-3 +1 +2 -3 -2 -7 -9 -11-4 0 +1 +2 0 -2 -4 -6 -8-6 -2 -1 0 -2 -4 -6 -8-6 -2 -1 0 -2 -4 -6 -8 -10-5 -1 +3 +1 +2 -3 -5 -7

-8 -4 -3 +1 -1 0 -2 -4 -6-7 0 -2 +2 0 +4 -1 -3

-10 -6 -2 -1 0 -2 +2 0 -2-6 -5 -1 +3 +1 -1 +3 +4

-12 -8 -4 -3 +1 -1 0 +1 +2-11 -7 0 -2 +5 0 +1 +2

-14 -10 -6 -2 -1 +3 +1 -1 0-10 -9 -5 +2 0 +4 +2 +3

-6 -2 -1 +3 +1 +2 0 -2 -4-4 -4 -3 +1 -1 0 -2 -4

-8 -4 0 +1 +2 0 +4 +2 0-10 -6 -2 -1 0 -2 +2 0

-10 -6 -2 -1 +3 +1 +2 +3 +4-12 -8 -4 -3 +1 -1 0 +1

-12 -8 -4 0 +1 +5 +3 +1 +2-14 -10 -6 -2 -1 +3 +1 -1

-14 -10 -6 -2 +2 0 +4 +2 3-16 -12 -8 -4 0 -2 +2 0

©Duane Szafron 2000

16

Cost of Full Matrix AlgorithmsCost of Full Matrix Algorithms A full matrix algorithm maintains the entire matrix in memory during both phases (FindScore and

FindPath) of the algorithm. For sequences of length n and m, this takes nxm entries in memory. FindScore takes nxm operations (time). FindPath takes m+n operations (time). If we want to align two sequences of length 10,000, the storage space is prohibitive (100,000,000 entries).

©Duane Szafron 2000

17

The Hirschberg Algorithm - 1The Hirschberg Algorithm - 1

The Hirschberg algorithm is designed to take less space, but find the same optimal solutions.

It splits one sequence into two and performs the FindScore algorithm on each half, working backwards on the second half sequence.

It does not store all of the results in memory, just the current row of each half matrix (2xn entries instead of mxn entries).

©Duane Szafron 2000

18

Hirschberg’s AlgorithmHirschberg’s Algorithm

At the end of the two FindScore computations, the final rows of each half matrix are used to find the optimal “crossing-point” of the two “half-alignments”.

The complete algorithm is then called again on the two pairs of half sequences.

This recursion continues until the lengths of the sequences being aligned is 1.

©Duane Szafron 2000

19

Hirschberg FindScore ExampleHirschberg FindScore Example

A G T A T G C A -

A

T

T

G

A

T

A

-

--

-2-2 0-2-4-6 -4-6-8-10-12-14-16+2-4-4

-3-60

-5-8-2 -2+2-2 0-4-6-8-10-12

-8-4-3+1-10+1+20

©Duane Szafron 2000

20

Hirschberg FindScore ExampleHirschberg FindScore Example

A G T A T G C A -

A

T

T

G

A

T

A

-

--

-8-4-3+1-10+1+20

0 -2 -4-2 -4 -6 -8 -10 -12 -14 -16

-6 -1 +3-2 +1 +2 0 -2 -4

-6 0 0 +3 0 +3 -3 -6 -12+3 +3

©Duane Szafron 2000

21

ATGCAGATAGCAGATA

GCAGATA

ATTAGT

ATTAGTAT

ATTAGTAT

Hirschberg Example Sub-problemsHirschberg Example Sub-problems

There are two optimal splits of the sequences, colored pink and blue.

However, the blue split generates two different optimal solutions, blue and white.

GAT--A-ATGCA

A-T-TAGTAT G-CA

GATA

GC-AGATA

ATTAGT

A-T-TAGTAT

©Duane Szafron 2000

22

Hirschberg RecursionHirschberg Recursion

©Duane Szafron 2000

23

Hirschberg’s AlgorithmHirschberg’s Algorithm Hirschberg’s algorithm takes only linear space - 2xn, instead of quadratic space - mxn. This means that aligning two sequences of length 10,000 would only require 20,000

entries instead of 100,000,000 entries. The disadvantage of this algorithm is that the time goes from mxn operations to about

2xmxn operations since many matrix computations must be redone.

©Duane Szafron 2000

24

FastLSA IdeaFastLSA Idea

FastLSA improves Hirschberg by reducing the number of re-computations that need to be done.

This makes the algorithm faster. There are three improvements to reduce

computations:– Sequences are split on both axes, not just one.– Sequences are not just bisected, they are cut

into several smaller pieces.– Scores on splitting lines are maintained.

©Duane Szafron 2000

25

FastLSA - AlgorithmFastLSA - Algorithm

Each sequences is split on both axes. FindScore is called on a region consisting

of 3 quadrants (excluding the lower right). Scores are kept only on the bisecting lines. FastLSA is called recursively on the lower

right quadrant and the optimal path is eventually returned for this quadrant.

Recursive calls are made on part of 1 or 2 of the other 3 quadrants, depending on the path returned from the lower right quadrant.

©Duane Szafron 2000

26

FastLSA - Stopping the RecursionFastLSA - Stopping the Recursion

When a block has size u*v < some B, stop the recursion and apply a full matrix algorithm to solve the block.

©Duane Szafron 2000

27

FastLSA - Using BisectionFastLSA - Using BisectionFastLSA(DPM,rs,re,cs,ce) if ((re-rs)*(ce-cs) < B) FullMatrix(DPM,rs,re,cs,ce); return; else rm = (rs+re)/2; cm = (cs+ce)/2; FindScores(DPM,rs,rm,re,cs,cm,ce); FastLSA(DPM,rm,re,cm,ce); if (direction == diagonal) FastLSA(rs,rm,cs,cm) else if (direction == side) re = path.end.row; FastLSA(rm,re,cs,cm); if (direction == up) ce = path.end.column; FastLSA(rs,rm,cs,ce) else // direction == up ce = path.end.column; FastLSA(rs,rm,cm,ce); if (direction == side) re = path.end.row; FastLSA(rs,re,cs,cm);

1

1

11

2

2

22

3

33

2

3 E

2

4

444 E

22

5

555 E

2

2

2 E

1

11

6

6

66

7

777 E

6

66

8

888 E

6

6 E

111

9

9

99

10

101010E

99

11

111111E

9E

1E

1

2 6 9

3 54 7 8 10 11

©Duane Szafron 2000

28

FastLSA - cuts (k) = 4FastLSA - cuts (k) = 4

©Duane Szafron 2000

29

Using FastLSAUsing FastLSA

If you don’t have enough memory to run a full-matrix algorithm, use FastLSA and pick your k-value based on your available memory.

It will run faster than Hirschberg’s algorithm.

©Duane Szafron 2000

30

Aligning Sub-sequencesAligning Sub-sequences

Sometimes you are trying to align a sub-sequence with a large sequence.

In this case there should many leading and trailing gaps.

AGATCTGATCGTAAGTCATTCGCATAATGCGT----------GTACGTC---------------

AGATCTGATCGTAAGTCATTCGCATAATGCGT----------GTA---C----G--T----C--

...

...

Score = 25*(-2) + 1*(-1) + 6*2 = -39

Score = 25*(-2) + 7*2 = -36

©Duane Szafron 2000

31

Leading and Trailing GapsLeading and Trailing Gaps

To score this properly, we assign zero penalties to leading and trailing gaps.

AGATCTGATCGTAAGTCATTCGCATAATGCGT----------GTACGTC---------------

AGATCTGATCGTAAGTCATTCGCATAATGCGT----------GTA---C----G--T----C--

...

...

Score = 25*(0) + 1*(-1) + 6*2 = 11

Score = 12*(0) 13*(-2) + 7*2 = -8

©Duane Szafron 2000

32

Implementing Leading GapsImplementing Leading Gaps

000

A G T A T G C A-

A

T

T

G

A

T

A

- 0 0 00 0+20

-20 +2 0 -1

-10

-2 -1-2

-2 +20

-2 -1-2 -1-2 +2-2 -2-2 -1 -3-3 +2

-10 -1 -3 +2

0 0 0 0 00 0 0 0 0

0 0 -2 -3 0 -2 -3 -3 0-1 +1 +2 -2 +4 -1 -2 -20 0 +1 +2 0 +4 +2 0 -2-2 -2 -1 0 -2 +2 0 -20 -2 -1 0 -2 +2 0 -2 -4-1 -1 +3 +1 +2 +3 +1 -1

0 -4 -3 +1 -1 0 +1 -1 -3-1 0 -2 +2 0 +4 +2 0

0 -6 -2 -1 0 -2 +2 0 -2+2 -5 -1 +3 +1 -1 +3 +4

0 0 -2 -3 +1 -1 0 +1 +2-1 +1 +2 -1 +5 0 +1 +2

0 -2 -1 0 -1 +3 +1 -1 0+2 -1 0 +4 0 +4 +2 +3

0 -2 -1 +3 +1 +2 +3 +1 -1-2 -4 -3 +1 -1 0 +1 -1

0 -4 0 +1 +2 0 +4 +2 0-2 -6 -2 -1 0 -2 +2 0

0 +2 0 -1 +3 +1 +2 +3 +4-2 0 -2 -3 +1 -1 0 +1

0 0 +1 +2 +1 +5 +3 +1 +2-2 -2 -1 0 -1 +3 +1 -1

0 +2 0 0 +4 +3 +4 +2 3-2 0 -2 -2 +2 +1 +2 0

©Duane Szafron 2000

33

New optimal path - same score 3New optimal path - same score 3

000

A G T A T G C A-

A

T

T

G

A

T

A

- 0 0 00 0+20

-20 +2 0 -1

-10

-2 -1-2

-2 +20

-2 -1-2 -1-2 +2-2 -2-2 -1 -3-3 +2

-10 -1 -3 +2

0 0 0 0 00 0 0 0 0

0 0 -2 -3 0 -2 -3 -3 0-1 +1 +2 -2 +4 -1 -2 -20 0 +1 +2 0 +4 +2 0 -2-2 -2 -1 0 -2 +2 0 -20 -2 -1 0 -2 +2 0 -2 -4-1 -1 +3 +1 +2 +3 +1 -1

0 -4 -3 +1 -1 0 +1 -1 -3-1 0 -2 +2 0 +4 +2 0

0 -6 -2 -1 0 -2 +2 0 -2+2 -5 -1 +3 +1 -1 +3 +4

0 0 -2 -3 +1 -1 0 +1 +2-1 +1 +2 -1 +5 0 +1 +2

0 -2 -1 0 -1 +3 +1 -1 0+2 -1 0 +4 0 +4 +2 +3

0 -2 -1 +3 +1 +2 +3 +1 -1-2 -4 -3 +1 -1 0 +1 -1

0 -4 0 +1 +2 0 +4 +2 0-2 -6 -2 -1 0 -2 +2 0

0 +2 0 -1 +3 +1 +2 +3 +4-2 0 -2 -3 +1 -1 0 +1

0 0 +1 +2 +1 +5 +3 +1 +2-2 -2 -1 0 -1 +3 +1 -1

0 +2 0 0 +4 +3 +4 +2 3-2 0 -2 -2 +2 +1 +2 0

A-

G-

T-

AA

TT

-T

GG

CA

-T

AA

©Duane Szafron 2000

34

Implementing trailing GapsImplementing trailing Gaps

000

A G T A T G C A-

A

T

T

G

A

T

A

- 0 0 00 0+20

-20 +2 0 -1

-10

-2 -1-2

-2 +20

-2 -1-2 -1-2 +2-2 0-2 -1 -3-3 +2

-10 -1 -3 +2

0 0 0 0 00 0 0 0 0

0 0 -2 -3 0 -2 -3 -3 +2-1 +1 +2 -2 +4 -1 -2 -20 0 +1 +2 0 +4 +2 0 +2-2 -2 -1 0 -2 +2 0 -20 -2 -1 0 -2 +2 0 -2 +2-1 -1 +3 +1 +2 +3 +1 -1

0 -4 -3 +1 -1 0 +1 -1 +2-1 0 -2 +2 0 +4 +2 0

0 -6 -2 -1 0 -2 +2 0 +2+2 -5 -1 +3 +1 -1 +3 +4

0 0 -2 -3 +1 -1 0 +1 +4-1 +1 +2 -1 +5 0 +1 +2

0 -2 -1 0 -1 +3 +1 -1 +4+2 -1 0 +4 0 +4 +2 +3

0 -2 -1 +3 +1 +2 +3 +1 +2-2 -4 -3 +1 -1 0 +1 -1

0 -4 0 +1 +2 0 +4 +2 +2-2 -6 -2 -1 0 -2 +2 0

0 +2 0 -1 +3 +1 +2 +3 +4-2 0 -2 -3 +1 -1 0 +1

0 0 +1 +2 +1 +5 +3 +1 +4-2 -2 -1 0 -1 +3 +1 -1

0 +2 +2 +2 +4 +4 +4 +4 +40 +2 +2 +2 +4 +4 +4 +4

©Duane Szafron 2000

35

New optimal paths - new score 4New optimal paths - new score 4

000

A G T A T G C A-

A

T

T

G

A

T

A

- 0 0 00 0+20

-20 +2 0 -1

-10

-2 -1-2

-2 +20

-2 -1-2 -1-2 +2-2 0-2 -1 -3-3 +2

-10 -1 -3 +2

0 0 0 0 00 0 0 0 0

0 0 -2 -3 0 -2 -3 -3 +2-1 +1 +2 -2 +4 -1 -2 -20 0 +1 +2 0 +4 +2 0 +2-2 -2 -1 0 -2 +2 0 -20 -2 -1 0 -2 +2 0 -2 +2-1 -1 +3 +1 +2 +3 +1 -1

0 -4 -3 +1 -1 0 +1 -1 +2-1 0 -2 +2 0 +4 +2 0

0 -6 -2 -1 0 -2 +2 0 +2+2 -5 -1 +3 +1 -1 +3 +4

0 0 -2 -3 +1 -1 0 +1 +4-1 +1 +2 -1 +5 0 +1 +2

0 -2 -1 0 -1 +3 +1 -1 +4+2 -1 0 +4 0 +4 +2 +3

0 -2 -1 +3 +1 +2 +3 +1 +2-2 -4 -3 +1 -1 0 +1 -1

0 -4 0 +1 +2 0 +4 +2 +2-2 -6 -2 -1 0 -2 +2 0

0 +2 0 -1 +3 +1 +2 +3 +4-2 0 -2 -3 +1 -1 0 +1

0 0 +1 +2 +1 +5 +3 +1 +4-2 -2 -1 0 -1 +3 +1 -1

0 +2 +2 +2 +4 +4 +4 +4 +40 +2 +2 +2 +4 +4 +4 +4

A-

G-

T-

AA

TT

GT

CG

AA

-T

-A

-A

-T

-T

-G

AA

G-

TT

AA

T-

G-

C-

A-


Recommended