Download - 2 D Comparing Sequences NW and SW

BMIF 310: Foundations of Bioinformatics

Sequence Analysis: Lecture D

Needleman-Wunsch and Smith-Waterman

Overview

• Intro to sequence comparison

• Global alignments: the Needleman-Wunsch algorithm

• Local alignments: the Smith-Waterman algorithm

Why compare sequences?

• Recognize orthologs – Having newly sequenced an organism, find genes

that match known genes in other organisms.

• Recognize paralogs – Determine whether a sequenced gene is part of a

gene family for this organism.

• Recognize conserved regions – Find segments of conserved sequence that may

have functional or evolutionary significance.

Milestones in sequence alignment

• 1970 Needleman and Wunsch find optimal global alignments through dynamic programming.

• 1971 Tinoco et al deduce tRNA secondary structure through dot plots.

• 1981 Smith and Waterman optimize dynamic programming for local alignment.

• 1985 Lipman and Pearson implement rapid alignment and standard format in FASTA.

• 1990 Altschul et al create BLAST algorithm, greatly accelerating sequence matching.

Where can we find sequences?

• ExPASy Proteomics (UniProt): http://ca.expasy.org/

• Nat’l Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov/

• European Bioinformatics Institute: http://www.ebi.ac.uk/

• DNA Databank of Japan: http://www.ddbj.nig.ac.jp/

http://ca.expasy.org/

http://www.ncbi.nlm.nih.gov/

http://www.ebi.ac.uk/

http://www.ddbj.nig.ac.jp/

Needleman-Wunsch

• Find optimal global alignment between sequences for a given set of rules.

• Simplest form: matching residues score 1, mismatches score 0.

• Employ dynamic programming:

– Early work informs later work.

– Many possible alignments can be eliminated.

Many alignments possible

• Sequence 1: ASLVNDK

• Sequence 2: ALVNKDK

• Possible alignments: ASLVNDK-- ASLVN-DK A----SLVNDK

A-LVN-KDK A-LVNKDK ALVNK----DK

• Sequences are typically hundreds of residues, and degree of match can be quite variable.

Needleman-Wunsch setup

A S L V N D K

A 1

L 1

V 1

N 1

K 1

D 1

K 1

Step 1: Change bottom and rightmost blanks to zeroes and yellows to ones.

A S L V N D K

A 0

L 0

V 0

N 0

K 1

D 0

K 0 0 0 0 0 0 1

Step 2: For each cell, find highest below and to right

A S L V N D K

A 0

L 0

V 0

N 0

K 1

D 0

K 0 0 0 0 0 0 1

Step 3: Add one if matched

A S L V N D K

A 0

L 0

V 0

N 0

K 1

D 2 0

K 0 0 0 0 0 0 1

Step 4: Repeat 2 and 3 until all filled

A S L V N D K

A 0

L 0

V 0

N 0

K 2 1 1

D 1 1 1 1 1 2 0

K 0 0 0 0 0 0 1

Step 4: Repeat 2 and 3 until all filled

A S L V N D K

A 0

L 0

V 0

N 3 1 0

K 2 2 2 2 2 1 1

D 1 1 1 1 1 2 0

K 0 0 0 0 0 0 1

Step 4: last cell!

A S L V N D K

A 6 5 4 3 2 1 0

L 4 4 5 3 2 1 0

V 3 3 3 4 2 1 0

N 2 2 2 2 3 1 0

K 2 2 2 2 2 1 1

D 1 1 1 1 1 2 0

K 0 0 0 0 0 0 1

Step 5: Backtrack biggest numbers, heading down and to right A S L V N D K

A 6 5 4 3 2 1 0

L 4 4 5 3 2 1 0

V 3 3 3 4 2 1 0

N 2 2 2 2 3 1 0

K 2 2 2 2 2 1 1

D 1 1 1 1 1 2 0

K 0 0 0 0 0 0 1

Step 6: Interpret path

A S L V N D K

A 6 5 4 3 2 1 0

L 4 4 5 3 2 1 0

V 3 3 3 4 2 1 0

N 2 2 2 2 3 1 0

K 2 2 2 2 2 1 1

D 1 1 1 1 1 2 0

K 0 0 0 0 0 0 1

Gap in Seq V

Sequence H (horizontal)

Sequence V

(veritc

al)

Consecutive matches

Gap in Seq H

Step 7: Write optimal alignment

Hseq: ASLVN-DK

Vseq: A-LVNKDK

• Match, V gap, Match, Match, Match, H gap, Match, Match

• Sometimes there are ties!

The path to Smith-Waterman

• Small regions of sequence align better without flanking regions, or match multiple times.

• Using substitution matrices rather than {1,0} scoring improves realism of alignment.

• Penalizing gaps in one sequence versus another reflects evolutionary “cost.”

• Enabling alignments to span any subset of sequence letters finds small matching pieces.

Incorporating substitution matrices

• Residues need not be identical to be significant matches.

• N-W gives optimal results only when pair match scores are >=0, but S-W allows for increases and decreases in alignment score.

• BLOSUM62 is compatible with S-W algorithm.

• A score reflects the matrix used, the length of the sequences being aligned, and the number of sequences being compared.

Gaps

• In our simplification, gaps simply didn’t add to score. Instead, we can subtract a gap penalty for each skipped residue.

• An improved version has two different penalties:

– gap initiation: a cost for each run of gaps

– gap extension: a cost for each piece of each gap

• Example for BLOSUM62: 5 for initiation, 3 for extension

Reese and Pearson. Bioinformatics (2002) 18: 1500-1507.

Smith-Waterman algorithm

• Negative scores (reflecting consecutive mismatches) are zeroed out.

• Backtracking begins at highest value, not necessarily the value at the upper left.

• Intended to produce optimal local alignments rather than global alignments.

BLOSUM62, using Gap of 5+3(n-1)

A S L V N D K

A 19 21 10 4 0 0 0

L 7 9 20 8 0 0 0

V 4 1 9 16 3 0 0

N 0 4 3 8 12 6 0

K 0 0 1 4 11 0 5

D 0 0 0 0 1 11 0

K 0 0 0 0 0 0 5

Sequence H (horizontal)

Sequence V

(veritc

al)

Summary

• Sequence comparison is useful for discovering relationships between sequences.

• Dynamic programming improves efficiency of aligning sequences

• Local alignments highlight conserved domains, but global alignments seek best total matches.