BMIF 310: Foundations of Bioinformatics
Sequence Analysis: Lecture D
Needleman-Wunsch and Smith-Waterman
Overview
• Intro to sequence comparison
• Global alignments: the Needleman-Wunsch algorithm
• Local alignments: the Smith-Waterman algorithm
Why compare sequences?
• Recognize orthologs – Having newly sequenced an organism, find genes
that match known genes in other organisms.
• Recognize paralogs – Determine whether a sequenced gene is part of a
gene family for this organism.
• Recognize conserved regions – Find segments of conserved sequence that may
have functional or evolutionary significance.
Milestones in sequence alignment
• 1970 Needleman and Wunsch find optimal global alignments through dynamic programming.
• 1971 Tinoco et al deduce tRNA secondary structure through dot plots.
• 1981 Smith and Waterman optimize dynamic programming for local alignment.
• 1985 Lipman and Pearson implement rapid alignment and standard format in FASTA.
• 1990 Altschul et al create BLAST algorithm, greatly accelerating sequence matching.
Where can we find sequences?
• ExPASy Proteomics (UniProt): http://ca.expasy.org/
• Nat’l Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov/
• European Bioinformatics Institute: http://www.ebi.ac.uk/
• DNA Databank of Japan: http://www.ddbj.nig.ac.jp/
Needleman-Wunsch
• Find optimal global alignment between sequences for a given set of rules.
• Simplest form: matching residues score 1, mismatches score 0.
• Employ dynamic programming:
– Early work informs later work.
– Many possible alignments can be eliminated.
Many alignments possible
• Sequence 1: ASLVNDK
• Sequence 2: ALVNKDK
• Possible alignments: ASLVNDK-- ASLVN-DK A----SLVNDK
A-LVN-KDK A-LVNKDK ALVNK----DK
• Sequences are typically hundreds of residues, and degree of match can be quite variable.
Needleman-Wunsch setup
A S L V N D K
A 1
L 1
V 1
N 1
K 1
D 1
K 1
Step 1: Change bottom and rightmost blanks to zeroes and yellows to ones.
A S L V N D K
A 0
L 0
V 0
N 0
K 1
D 0
K 0 0 0 0 0 0 1
Step 2: For each cell, find highest below and to right
A S L V N D K
A 0
L 0
V 0
N 0
K 1
D 0
K 0 0 0 0 0 0 1
Step 3: Add one if matched
A S L V N D K
A 0
L 0
V 0
N 0
K 1
D 2 0
K 0 0 0 0 0 0 1
Step 4: Repeat 2 and 3 until all filled
A S L V N D K
A 0
L 0
V 0
N 0
K 2 1 1
D 1 1 1 1 1 2 0
K 0 0 0 0 0 0 1
Step 4: Repeat 2 and 3 until all filled
A S L V N D K
A 0
L 0
V 0
N 3 1 0
K 2 2 2 2 2 1 1
D 1 1 1 1 1 2 0
K 0 0 0 0 0 0 1
Step 4: last cell!
A S L V N D K
A 6 5 4 3 2 1 0
L 4 4 5 3 2 1 0
V 3 3 3 4 2 1 0
N 2 2 2 2 3 1 0
K 2 2 2 2 2 1 1
D 1 1 1 1 1 2 0
K 0 0 0 0 0 0 1
Step 5: Backtrack biggest numbers, heading down and to right A S L V N D K
A 6 5 4 3 2 1 0
L 4 4 5 3 2 1 0
V 3 3 3 4 2 1 0
N 2 2 2 2 3 1 0
K 2 2 2 2 2 1 1
D 1 1 1 1 1 2 0
K 0 0 0 0 0 0 1
Step 6: Interpret path
A S L V N D K
A 6 5 4 3 2 1 0
L 4 4 5 3 2 1 0
V 3 3 3 4 2 1 0
N 2 2 2 2 3 1 0
K 2 2 2 2 2 1 1
D 1 1 1 1 1 2 0
K 0 0 0 0 0 0 1
Gap in Seq V
Sequence H (horizontal)
Sequence V
(veritc
al)
Consecutive matches
Gap in Seq H
Step 7: Write optimal alignment
Hseq: ASLVN-DK
Vseq: A-LVNKDK
• Match, V gap, Match, Match, Match, H gap, Match, Match
• Sometimes there are ties!
The path to Smith-Waterman
• Small regions of sequence align better without flanking regions, or match multiple times.
• Using substitution matrices rather than {1,0} scoring improves realism of alignment.
• Penalizing gaps in one sequence versus another reflects evolutionary “cost.”
• Enabling alignments to span any subset of sequence letters finds small matching pieces.
Incorporating substitution matrices
• Residues need not be identical to be significant matches.
• N-W gives optimal results only when pair match scores are >=0, but S-W allows for increases and decreases in alignment score.
• BLOSUM62 is compatible with S-W algorithm.
• A score reflects the matrix used, the length of the sequences being aligned, and the number of sequences being compared.
Gaps
• In our simplification, gaps simply didn’t add to score. Instead, we can subtract a gap penalty for each skipped residue.
• An improved version has two different penalties:
– gap initiation: a cost for each run of gaps
– gap extension: a cost for each piece of each gap
• Example for BLOSUM62: 5 for initiation, 3 for extension
Reese and Pearson. Bioinformatics (2002) 18: 1500-1507.
Smith-Waterman algorithm
• Negative scores (reflecting consecutive mismatches) are zeroed out.
• Backtracking begins at highest value, not necessarily the value at the upper left.
• Intended to produce optimal local alignments rather than global alignments.
BLOSUM62, using Gap of 5+3(n-1)
A S L V N D K
A 19 21 10 4 0 0 0
L 7 9 20 8 0 0 0
V 4 1 9 16 3 0 0
N 0 4 3 8 12 6 0
K 0 0 1 4 11 0 5
D 0 0 0 0 1 11 0
K 0 0 0 0 0 0 5
Sequence H (horizontal)
Sequence V
(veritc
al)
Summary
• Sequence comparison is useful for discovering relationships between sequences.
• Dynamic programming improves efficiency of aligning sequences
• Local alignments highlight conserved domains, but global alignments seek best total matches.