Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | brandon-hancock |
View: | 216 times |
Download: | 0 times |
04/20/23 1
Multiple sequence alignment
04/20/23 2
Copyright notice
• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.
• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!
04/20/23 3
Multiple sequence alignment: definition
• a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned
• Homologous residues are aligned in columns across the length of the sequences
• residues are homologous in an evolutionary sense
• residues are homologous in a structural sense
04/20/23 4
Multiple sequence alignment: properties
• not necessarily one “correct” alignment of a protein family
• protein sequences evolve...
• ...the corresponding three-dimensional structures of proteins also evolve
• may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment
• for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures
04/20/23 5
Multiple sequence alignment: features
• some aligned residues, such as Cysteines that form disulfide bridges, may be highly conserved
• there may be conserved motifs such as a transmembrane domain
• there may be conserved secondary structure features
• there may be regions with consistent patterns of insertions or deletions (indels)
04/20/23 6
Multiple sequence alignment: uses
• MSA is more sensitive than pairwise alignment to detect homologs
• BLAST output can take the form of a MSA, and can reveal conserved residues or motifs
• Population data can be analyzed in a MSA (PopSet)
• A single query can be searched against a database of MSAs
• Regulatory regions of genes may have consensus sequences identifiable by MSA
04/20/23 7
Multiple Sequence Alignment: Approaches
• Optimal Global Alignments -Dynamic programming
• Global Progressive Alignments - Match closely-related sequences first using a guide tree. (Feng & Doolittle)
• Global Iterative Alignments - Multiple re-building attempts to find best alignment
• Local alignments– Profiles, Blocks, Patterns
04/20/23 8
Dynamic Programming
• Generalization of Needleman-Wunsch– Find alignment that maximizes a score function
• Computationally expensive: Time grows as product of sequence lengths– 2 sequences: O(n2)– 3 sequences: O(n3)– 4 sequence: O(n4)– N sequences: O(nN)
• Can align about 7 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that
04/20/23 9
Progressive Alignment
• Find succession of pairwise alignments
• Heurisic – cannot separate scoring and optimization
• Works well for closely related sequences
• Very sensitive to initial alignments
10
Progressive Alignment
• Use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order in the guide tree– Align the most closely related sequence then
add the next more closely related sequence, iteratively
– Full DP algorithm is used by aligning two existing alignments or sequences
– Gaps in present/older alignments remain fixed
04/20/23 11
Progessive Alignment Examples
• Feng-Doolittle (1987)
• ClustalW
• T-coffee
04/20/23 12
Feng-Doolittle MSA occurs in 3 stages
• [1] Do a set of global pairwise alignments (Needleman and Wunsch)
• [2] Create a guide tree
• [3] Progressively align the sequences
04/20/23 13
Progressive MSA stage 1 of 3:generate global pairwise alignments
five distantly related lipocalins
best score
04/20/23 14
Progressive MSA stage 1 of 3:generate global pairwise alignments
Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96
five closely related lipocalins
best score
04/20/23 15
Number of pairwise alignments needed
For N sequences, (N-1)(N)/2
For 5 sequences, (4)(5)/2 = 10
04/20/23 16
Feng-Doolittle stage 2: guide tree
• Convert similarity scores to distance scores
• A tree shows the distance between objects
• ClustalW provides a syntax to describe the tree
• A guide tree is not a phylogenetic tree
04/20/23 17
Guide Tree
• UPGMA – Unweighted Pair Group Method by Arithmetic Mean– Simplest method of tree construction– Assumes equal rates of mutation along the branches
• UPGMA Algorithm– Definition: Node in a tree is called an Operational
Taxonomic Unit (OTU)– From distance matrix, cluster pair of OTUs with
smallest distance, and calculate new distance– Repeat previous step until clusters converge
04/20/23 18
Guide Tree - UPGMA
• Cluster pair with smallest distance
• Recalculate distance matrix
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
04/20/23 19
Guide Tree - UPGMA
• Calculate new distance using composite OTU(A,B):– Distance between a simple OTU and a composite OTU is
the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU
dist (A,B),C = (dist A,C + dist B,C) / 2 = (4 + 4) / 2 = 4dist (A,B),D = (dist A,D + dist B,D) / 2 = (6 + 6) / 2 = 6dist (A,B),E = (dist A,E + dist B,E) / 2 = (6 + 6) / 2 = 6 dist (A,B),F = (dist A,F + dist B,F) / 2 = (8 + 8) / 2 = 8
04/20/23 20
Guide Tree - UPGMA
• Calculate new distance using composite OTU(A,B):– Distance between a simple OTU and a composite OTU is
the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU
A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8
04/20/23 21
Guide Tree - UPGMA
• Second Iteration
A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8
04/20/23 22
Guide Tree - UPGMA
• Third Iteration
A,B C D,E
C 4
D,E 6 6
F 8 8 8
04/20/23 23
Guide Tree - UPGMA
• Fourth Iteration
AB,C D,E
D,E 6
F 8 8
04/20/23 24
Guide Tree - UPGMA
• Fifth Iteration
ABC,DE
F 8
25
Guide Tree
• ClustalW uses Neighbor-Joining• Neighbor Joining corrects the UPGMA method for its
(frequently invalid) assumption that the same rate of evolution applies to each branch of a tree.
• Neighbor Joining has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)
• Neighbor-Joining Algorithm• Assumes unequal rates of mutation along each branch• Find pairs of OTUs that minimize total branch length at
each stage of clustering starting with a starlike tree (Minimum-Evolution Tree).The distance matrix is adjusted for differences in the rate of evolution of each taxon (branch).
NJ Algorithm
Neighbor Joining to Calculate the Guide Tree Phase:– does not require a uniform molecular clock– the raw data are provided as a distance matrix– the initial tree is a star tree– distance matrix is modified
• distance between node pairs is adjusted on the basis of their average divergence from all other nodes.
– the least-distant pair of nodes are linked.
NJ Algorithm
Neighbor Joining to Calculate the Guide Tree Phase:– When two nodes are linked:
• Add their common ancestral node to the tree• delete the terminal nodes with their branches • the common ancestor is now a terminal node on a smaller
tree
– At each step, two terminal nodes are replaced by one new node
– The process is complete when there are only two nodes separated by a single branch
NJ Algorithm
• Advantages of Neighbor Joining– Fast.
• Can be used on large datasets• Can support bootstrap analysis
– Can handle lineages with largely different branch lengths (different molecular evolutionary rates)
– Can be used with methods that use correction for multiple substitutions
NJ Algorithm
• Disadvantages of Neighbor Joining– sequence information is reduced
• Sequences are boiled down to distances• No secondary or tertiary features used
– gives only one possible tree – strongly dependent on the model of evolution used
NJ Algorithm
• NJ example from: http://www.icp.ucl.ac.be/~opperd/private/neighbor.html
• Consider the following tree:
• Notice that the branches for D and B are longer.
• This expresses the idea that they have a faster molecular clock than the other OTUs.
NJ Algorithm
The distance matrix for the tree is:
A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8
Normally, we create the tree from the distances.
In this example, we use to tree to derive the distances.
NJ Algorithm
• We start with a star tree.• Notice that we have 6 operational taxonomic
units (OTUs)• The start tree has a leaf for each OTU
A
B
C D
E
F
NJ Algorithm
Step 1: Calculate the net divergence for each OTU.The net divergence is the sum of distances from i to all
other OTUs.
A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8
r(A) = 5+4+7+6+8=30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44
N
i jiijiXX D
NLr
1 1
1
NJ Algorithm
Step 2: Calculate a new distance matrix based on average divergence:M(ij)=d(ij) - [r(i) + r(j)]/(N-2)
Example: A,B
M(AB)=d(AB) -[(r(A) + r(B)]/(N-2) = -13
A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8
Recall:r(A) =30r(B) = 42
NJ Algorithm
Step 2: continuedM(ij)=d(ij) - [r(i) + r(j)]/(N-2)
A B C D EB -13.0C -11.5 -11.5D -10.0 -10.0 -10.5E -10.0 -10.0 -10.5 -13.0F -10.5 -10.5 -11.0 -11.5 -11.5
A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8
Distance matrix Average divergence matrix
NJ Algorithm
Step 3: choose two OTUs for which Mij is the smallest.– the possible choices are: A,B and D,E– arbitrarily choose A and B– form a new node called U, the parent of A & B.– calculate the branch length from U to A and B.
S(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1S(BU) =d(AB) -S(AU) = 4
NJ Algorithm
• The tree after U is added.
A
B C
D
E
F
U 1
4
NJ Algorithm
Step 4: define distances from U to other terminal nodes:– d(CU) = d(AC) + d(BC) - d(AB) / 2 = 3– d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6– d(EU) = d(AE) + d(BE) - d(AB) / 2 = 5– d(FU) = d(AF) + d(BF) - d(AB) / 2 = 7– Note: no change in paired distances {C,D,E,F}
U C D EC 3D 6 7E 5 6 5F 7 8 9 8
NJ Algorithm
• Now N = N-1 = 5• Repeat steps 1 through 4• Stop when N = 2
04/20/23 40
Progressive MSA stage 2 of 3:generate a guide tree calculated from
the distance matrix
04/20/23 41
Progressive MSA stage 2 of 3:generate a guide tree calculated from
the distance matrix
04/20/23 42
Progressive MSA stage 2 of 3:generate guide tree
((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);
five closely related lipocalins
04/20/23 43
Progressive MSA stage 2 of 3:generate guide tree
((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);
five closely related lipocalins
04/20/23 44
Feng-Doolittle stage 3: progressive alignment
• Make a MSA based on the order in the guide tree
• Start with the two most closely related sequences
• Then add the next closest sequence
• Continue until all sequences are added to the MSA
• Rule: “once a gap, always a gap.”
04/20/23 45
Use Clustal W to do a progressive MSA
http://www2.ebi.ac.uk/clustalw/
04/20/23 46
Progressive MSA stage 3 of 3:progressively align the sequences
following the branch order of the tree
04/20/23 47
Clustal W alignment of 5 closely related lipocalins
CLUSTAL W (1.82) multiple sequence alignment
gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:*****
gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:**************
gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********
04/20/23 48
Why “once a gap, always a gap”?
• There are many possible ways to make a MSA
• Where gaps are added is a critical question
• Gaps are often added to the first two (closest) sequences
• To change the initial gap choices later on would be to give more weight to distantly related sequences
• To maintain the initial gap choices is to trust that those gaps are most believable
04/20/23 49
Progressive Alignment: Discussion
• Strengths:– Speed– Progression biologically sensible (aligns using a tree)
• Weaknesses:– No objective function.– No way of quantifying whether or not the alignment is
good– Local minimum problem– Any errors in the initial alignment are carried through, no way to
correct an early mistake– More efficient for closely related sequences than for divergent
sequences
04/20/23 50
Iterative Methods for Multiple Sequence Alignment
• Seeks to increase MSA score by randomly altering the alignment.
• Usually used to refine alignment• Attempt to correct initial alignment problems by
repeatedly aligning subgroups of the sequences and then by aligning these subgroups into a global alignment of all the sequences– Starts with a multiple sequence alignment.– Refine it. – Repeat until one MSA doesn’t change significantly
from the next.
04/20/23 51
MultAlign
• Pairwise scores recalculated during progressive alignment
• Tree is recalculated
• Alignment is refined
04/20/23 52
PRRP
• Initial pairwise alignment predicts tree
• Tree produces weights
• Locally aligned regions considered to produce new alignment and tree
• Continue until alignments converge
04/20/23 53
DIALIGN
• Pairs of sequences aligned to locate ungapped aligned regions
• Diagonals of various lengths identified
• Collection of weighted diagonals provide alignment
04/20/23 54
SAGA: Genetic Algorithms
• Generate as many different MSAs by rearrangements simulating gaps and recombination events
• SAGA (Serial Alignment by Genetic Algorithm) is one approach
04/20/23 55
Simulated Annealing
• Obtain a higher-scoring multiple alignment
• Rearranges current alignment using probabalistic approach to identify changes that increase alignment score
• MSASA: Multiple Sequence Alignment by Simulated Annealing
MUSCLE: next-generation progressive MSA
[1] Build a draft progressive alignmentDetermine pairwise similarity through k-mer counting (not by
alignment)
Compute distance (triangular distance) matrix
Construct tree using UPGMA
Construct draft progressive alignment following tree
MUSCLE: next-generation progressive MSA
[2] Improve the progressive alignment Compute pairwise identity through current MSA
Construct new tree with Kimura distance measures
Compare new and old trees: if improved, repeat this step, if not improved, then we’re done
MUSCLE: next-generation progressive MSA
[3] Refinement of the MSA Split tree in half by deleting one edge
Make profiles of each half of the tree
Re-align the profiles
Accept/reject the new alignment
MUSCLE output (formatted with SeaView)
SeaView is a graphical multiple sequence alignment editor available at http://pbil.univ-lyon1.fr/software/seaview.html
04/20/23 61
Scoring Multiple Alignments
• Because we can’t see the ancestral sequences, it is often impossible to ever know what is the “correct” multiple alignment. (Since some residues may not be structurally superposable, there may not be a correct alignment.)
• The best we can do is to define a “scoring function” for evaluating the “goodness” of a multiple alignment.
• We then try to find the multiple alignment that maximizes this function.
• This is entirely analogous to the scoring function used in pairwise alignment algorithms.
04/20/23 62
Scoring Function Features
• The key difference between multiple alignments and pairwise alignments is the fact that different pairs of sequences are separated by different evolutionary distances.
• Any set of sequences we wish to align is related by a phylogenetic tree.
• Ideally, our scoring system should model molecular sequence evolution.
04/20/23 63
Ideal Scoring Function
• Sequences are related by an evolutionary tree.
• Assume a probabilistic model of molecular evolution.
• Multiple alignment score, S, is
S = ΣX Pr(Tree|Root=X) Pr(X)
D
A B
C
Root
04/20/23 64
Ideal Function is Too Complex
• In most cases, we don’t have nearly enough information to model evolution accurately enough.
• The probability depends on knowing the length of each branch in the tree accurately.
• Evolution is not constant at each column in the alignment since selective pressure is stronger on critical residues.
04/20/23 65
Scoring Function Features cont’d
• As with pairwise alignments, the scoring function take the chemical/physical properties of residues into account.
04/20/23 66
Simple Score Functions
• If we assume that the columns of the alignment are independent, the scoring function can be written as a sum of column scores plus a gap score:
S(m) = G + Σi S(mi)
where mi is column i of the alignment and G is
a function for scoring gaps.
04/20/23 67
Sum of Pairs: SP Scores
• Using BLOSUM62 matrix, gap penalty -8
• In column 1, we have pairs-,S-,SS,S
• k(k-1)/2 pairs per column
- I K
S I K
S S E
-8 - 8 + 4 = -12
04/20/23 68
Problems with Sum of Pairs Scores
SP scores are very commonly used, but they have problems:
• They have no probabilistic justification.
• The relative difference in score between the correct and incorrect alignment decreases as the evidence increases—this is counter-intuitive.
04/20/23 69
Minimum Entropy Scores
• This is a probabilistic (well, information theoretic) way of saying how “pure” or “good” an alignment column is.
• Intuition: good alignment columns will contain very few different letters
• Method: We convert the alignment column into a probability vector and compute the entropy of the vector.
04/20/23 70
Entropy
• Entropy is a very useful concept from Information Theory.
• If X is a random variable that can have values X1,X2,…,Xk, the entropy of X is defined as:
H(X) = −Σj Pr(Xj) log Pr(Xj)• The maximum entropy is log k. when the
distribution is uniform, eg, Pr(X) = (¼, ¼, ¼, ¼).• The minimum entropy is 0, when the distribution
puts all its weight on one letter, eg, Pr(X) = (0,0,1,0).
Entropy
• Define frequencies for the occurrence of each letter in each column of multiple alignment– pA = 1, pT=pG=pC=0 (1st column)
– pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)
– pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)
• Compute entropy of each column
CGTAX
XX pp,,,
log
AAAAAAAATATC
Entropy: Example
0
A
A
A
A
entropy
2)24
1(4
4
1log
4
1
C
G
T
A
entropy
Best case
Worst case
Multiple Alignment: Entropy Score
Entropy for a multiple alignment is the sum of entropies of its columns:
over all columns X=A,T,G,C pX logpX
Entropy of an Alignment: Example
column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)
•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0
•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811
•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0
•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811
A A A
A C C
A C G
A C T
04/20/23 75
Pros and Cons of Entropy Scores
• The entropy scores are probabilistic.
• They don’t take into account the fact that the sequences are related by a phylogenetic tree. This can be “fixed” by weighting the sequences so that sequences from close species are downweighted relative to sequences from distant species.
04/20/23 76
Multiple sequence alignment to profile: HMMs
• Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged In a column of a multiple sequence alignment
• HMMs are probabilistic models
• An HMM gives more sensitive alignments than traditional techniques such as progressive alignments
Simple Hidden Markov Model
Observation: YNNNYYNNNYN
(Y=goes out, N=doesn’t go out)
What is underlying reality (the hidden state chain)?
R
S
0.15
0.85
0.2
0.8
P(dog goes out in rain) = 0.1
P(dog goes out in sun) = 0.85
04/20/23 78
GTWYA (hs RBP)GLWYA (mus RBP)GRWYE (apoD)GTWYE (E Coli)GEWFS (MUP4)
An HMM is constructed from a MSA
Example: five lipocalins
04/20/23 79
GTWYAGLWYAGRWYEGTWYEGEWFS
Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2
04/20/23 80
GTWYAGLWYAGRWYEGTWYEGEWFS
Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2
P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064
log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75
04/20/23 81
GTWYAGLWYAGRWYEGTWYEGEWFS
P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064
log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75
G:1.0T:0.4L:0.2R:0.2E:0.2
W:1.0Y:0.8F:0.2
E:0.4A:0.4S:0.2
Structure of a hidden Markov model (HMM)
main state
insert state
delete state
04/20/23 83
From MSA to Profile
• Profile HMMs are important because they provide a powerful way to search databases for distantly related homologs.
• HMMs can be created using the HMMER program.
04/20/23 84
HMMER: search an HMM against GenBankScores for complete sequences (score includes all domains):Sequence Description Score E-value N-------- ----------- ----- ------- ---gi|20888903|ref|XP_129259.1| (XM_129259) ret 461.1 1.9e-133 1gi|132407|sp|P04916|RETB_RAT Plasma retinol- 458.0 1.7e-132 1gi|20548126|ref|XP_005907.5| (XM_005907) sim 454.9 1.4e-131 1gi|5803139|ref|NP_006735.1| (NM_006744) ret 454.6 1.7e-131 1gi|20141667|sp|P02753|RETB_HUMAN Plasma retinol- 451.1 1.9e-130 1..gi|16767588|ref|NP_463203.1| (NC_003197) out 318.2 1.9e-90 1
gi|5803139|ref|NP_006735.1|: domain 1 of 1, from 1 to 195: score 454.6, E = 1.7e-131 *->mkwVMkLLLLaALagvfgaAErdAfsvgkCrvpsPPRGfrVkeNFDv mkwV++LLLLaA + +aAErd Crv+s frVkeNFD+ gi|5803139 1 MKWVWALLLLAA--W--AAAERD------CRVSS----FRVKENFDK 33
erylGtWYeIaKkDprFErGLllqdkItAeySleEhGsMsataeGrirVL +r++GtWY++aKkDp E GL+lqd+I+Ae+S++E+G+Msata+Gr+r+L gi|5803139 34 ARFSGTWYAMAKKDP--E-GLFLQDNIVAEFSVDETGQMSATAKGRVRLL 80
eNkelcADkvGTvtqiEGeasevfLtadPaklklKyaGvaSflqpGfddy +N+++cAD+vGT+t++E dPak+k+Ky+GvaSflq+G+dd+ gi|5803139 81 NNWDVCADMVGTFTDTE----------DPAKFKMKYWGVASFLQKGNDDH 120
Two kinds of multiple sequence alignment resources
Text-based or query-based searches:CDD, Pfam (profile HMMs), PROSITE
[2] Multiple sequence alignment programs
Muscle, ClustalW, ClustalX
[1] Databases of multiple sequence alignments
Page 329
BLOCKSCDD Pfam SMARTDOMO (Gapped MSA)INTERPROiProClassMetaFAMPRINTSPRODOM (PSI-BLAST)PROSITE
Databases of multiple sequence alignments
TheseUseHMMs
04/20/23 87
Multiple sequence alignment programs
• AMAS• CINEMA• ClustalW• ClustalX• DIALIGN• HMMT• Match-Box• MultAlin• MSA• Musca• PileUp• SAGA• T-COFFEE
04/20/23 88
Multiple sequence alignment algorithms
Progressive
Iterative
Local Global
PIMA
DIALIGN SAGA
CLUSTALPileUpother
04/20/23 89
performance of alignment programs depends on (McClure et al., 1994)
• the number of sequences,
• the degree of similarity between sequences
• the number of insertions in the alignment.
• the length of the sequences
• the existence of large insertions and N/C-terminal extensions
• over-representation of some members of the protein family.
04/20/23 90
Strategy for assessment of alternativemultiple sequence alignment algorithms
• [1] Create or obtain a database of protein sequences for which the 3D structure is known. Thus we can define “true” homologs using structural criteria.
• [2] Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers).
• [3] Compare the answers.
04/20/23 91
BAliBASE: A benchmark alignments database for the evaluation of multiple
sequence alignment programs
• BAliBASE is a database of manually-refined multiple sequence alignments specifically designed for the evaluation and comparison of multiple sequence alignment programs. The alignments are categorised by sequence length, similarity, and presence of insertions and N/C- terminal extensions. Core blocks are identified excluding non-superposable regions.
• http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/
04/20/23 92
BaliBase
• Thompson et al., 1999, Nuc. Acids. Res. 27, 2682-2690).
• DIALIGN was found to be the best method for local multiple alignment.
• CLUSTAL W, PRRP and SAGA were superior on globally related sequence sets
04/20/23 93
Conclusions: assessment of alternativemultiple sequence alignment algorithms
• [1] As percent identity among proteins drops, performance (accuracy) declines also. This is especially severe for proteins < 25% identity.– Proteins <25% identity: 65% of residues
align well– Proteins <40% identity: 80% of residues
align well
04/20/23 94
Conclusions: assessment of alternativemultiple sequence alignment algorithms
• [2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.
04/20/23 95
Conclusions: assessment of alternativemultiple sequence alignment algorithms
• [3] Separate multiple sequence alignments can be combined (e.g. RBPs and lactoglobulins).– Iterative algorithms (PRRP, SAGA)
outperform progressive alignments (ClustalX)
04/20/23 96
Conclusions: assessment of alternativemultiple sequence alignment algorithms
• [4] When proteins have large N-terminal or C-terminal extensions, local alignment algorithms are superior. PileUp (global) is an exception.