Multiple sequence alignment
Why? It is the most important means to assess relatedness
of a set of sequences Gain information about the structure/function of a
query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments
(Fragment assembly) Recognise alternative splice sites Many bioinformatics methods depend on it
(secondary/tertiary structure)
Multiple sequence alignment (MSA) of 12 * Flavodoxin + cheY
Pairwise alignment
Now we know how to do it: How do we get a multiple
alignment (three or more sequences)?
Multiple alignment: much greater combinatorial explosion than with pairwise alignment…..
Multi-dimensional dynamic programming(Murata et al. 1985)
Simultaneous Multiple alignmentMulti-dimensional dynamic programming
MSA (Lipman et al., 1989, PNAS 86, 4412)
extremely slow and memory intensive up to 8-9 sequences of ~250 residues
DCA (Stoye et al., 1997, CABIOS 13, 625)
still very slow
Alternative multiple alignment methods
Biopat (Hogeweg Hesper 1984, first method ever)
MULTAL (Taylor 1987) DIALIGN (Morgenstern 1996) PRRP (Gotoh 1996) Clustal (Thompson Higgins Gibson 1994) Praline (Heringa 1999) T-Coffee (Notredame Higgins Heringa 2000) HMMER (Eddy 1998) [Hidden Markov Model] SAGA (Notredame Higgins1996) [Genetic
algorithm]
Progressive multiple alignment general principles
1213
45
Guide tree Multiple alignment
Score 1-2
Score 1-3
Score 4-5
Scores Similaritymatrix5×5
Scores to distances Iteration possibilities
General progressive multiple alignment technique(follow generated tree)
13
25
13
13
13
25
254
d
root
Progressive multiple alignment
Problem: Accuracy is very important Errors are propagated into the
progressive steps
“Once a gap, always a gap”
Feng & Doolittle, 1987
Pair-wise alignment quality versus sequence identity(Vogt et al., JMB 249, 816-831,1995)
Multiple alignment profilesGribskov et al. 1987
ACDWY
Gappenalties
i0.30.100.30.3
0.51.0
Position dependent gap penalties
ACD……VWY
sequence
profile
Profile-sequence alignment
ACD..Y
ACD……VWY
profile
profileProfile-profile alignment
Clustal, ClustalW, ClustalX CLUSTAL W/X (Thompson et al., 1994) uses Neighbour
Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct guide tree.
Sequence blocks are represented by profiles, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree.
Further carefully crafted heuristics include: (i) local gap penalties (ii) automatic selection of the amino acid substitution matrix,
(iii) automatic gap penalty adjustment (iv) mechanism to delay alignment of sequences that appear to
be distant at the time they are considered. CLUSTAL (W/X) does not allow iteration (Hogeweg and
Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)
Profile pre-processing Secondary structure-induced
alignment Globalised local alignment Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Pre-profile generation1213
45
Score 1-2
Score 1-3
Score 4-5
ACD..Y
12345
1ACD..Y
21345
2
Pre-profilesPre-alignments
512354
ACD..Y
Cut-off
Pre-profile alignment
ACD..YACD..YACD..Y
ACD..Y
ACD..Y
1
2
3
4
5
12345
Pre-profiles
Final alignment
Pre-profile alignment
12345
12134531245
341235
4512354
2
12345
Final alignment
Profile pre-processing Secondary structure-induced
alignment Globalised local alignment Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE (oligomers)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
One of the Molecular Biology Dogma’s
“Structure more conserved than sequence”
Secondary structure-induced alignment
Using secondary structure for alignment
Dynamic programmingsearch matrix
Amino acid exchangeweights matrices
MDAGSTVILCFVHHHCCCEEEEEE
MDAASTILCGS
HHHHCCEEECC
C
H
E
H C
E Default
Flavodoxin-cheYUsing predicted secondary structure1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeeeFLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhhhhhhhhh eeeeee eeeeee hhhhhh eeeeeFLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee hhhhhh eeeeeeFLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhhhhhhhhh eeeee eeeee hhhhhhh h eeeeeFLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee hhhhhhh hh eeeee2fcr --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeeeFLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhh eee hhh hhhhhhheeeeee hhhhhhhhh eeeeeeFLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhh eee hhh hhhhhhheeeee hhhhh eeeeeeFLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhhhhhhhh hhh hhhhhhheeeee hhhhhhhhh eeeeeeFLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhhhhhhh hhh hhhhhhheeeee hhhhh eeeee4fxn ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeeeFLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee eeeeeFLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee hhhhhhhhh eeeee3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s sss hhhhhhhhhh ttttt eeee 1fx1 GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------- eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhhFLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------- eee hhhhhhhhhhhh eeeee eeeee hhhhhhhhhhhhhhFLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS---------------------SLKIDGE--P--DSAEVLDwAREVLARV-------- eee hhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD---------------------SLKIDGD--P--ERDEIVSwGSGIADKI-------- hhhhhhhhhhhh eeeee e eeeFLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------- e hhhhhhhhhhhhhh eeeee ee hhhhhhhhhhh2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhhtFLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhFLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhhhFLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- e hhhhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ hhhhhhhhhhhhhhh eeee hhhhhhh hhhhhhhhhhhh4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------- e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhhtFLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNAPE-CKElGEAAAKA--------- hhhhhhhhhhh eeeee eeee h hhhhhhhhFLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhh eeeee hhhh hhh hhhhhhhhhhhh h3chy -----------TAEAKKENIIAAAQAGASGY-------------------------VVK----P-FTAATLEEKLNKIFEKLGM------ ess hhhhhhhhhtt see ees s hhhhhhhhhhhhhhht
G
Profile pre-processing Secondary structure-induced
alignment Globalised local alignment Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Globalised local alignment
+ =
1. Local (SW) alignment (M + Po,e)
2. Global (NW) alignment (no M or Po,e)
Double dynamic programming
M = BLOSUM62, Po= 0, Pe= 0
M = BLOSUM62, Po= 12, Pe= 1
M = BLOSUM62, Po= 60, Pe= 5
Profile pre-processing Secondary structure-induced
alignment Globalised local alignment Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Matrix extension
T-CoffeeTree-based Consistency Objective Function
For alignmEnt Evaluation
Cedric Notredame
Des Higgins
Jaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205-217302, 205-217;2000;2000
Matrix extension – T COFFEE
12
13
14
23
24
34
Integrating alignment methods and alignment information with
T-Coffee• Integrating different pair-wise alignment
techniques (NW, SW, ..)
• Combining different multiple alignment methods (consensus multiple alignment)
• Combining sequence alignment methods with structural alignment techniques
• Plug in user knowledge
Using different sources of alignment information
Clustal
Dialign
Clustal
Lalign
Structure alignments
Manual
T-Coffee
Search matrix extension
T-Coffee• Combine different alignment techniques by adding scores:
W(A(x), B(y)) = S(A(x), B(y))
– A(x) is residue x in sequence A
– summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y))
– S is sequence identity percentage of the associated alignment
• Combine direct alignment seqA- seqB with each seqA-seqI-seqB:
W’(A(x), B(y)) = W(A(x), B(y)) +
IA,BMin(W(A(x), I(z)), W(I(z), B(y)))
– Summation over all third sequences I other than A or B
T-Coffee
Direct alignment
Other sequences
Search matrix extension
Evaluating multiple alignmentsEvaluating multiple alignments Conflicting standards of truth
evolution structure function
With orphan sequences no additional information Benchmarks depending on reference alignments Quality issue of available reference alignment
databases Different ways to quantify agreement with reference
alignment (sum-of-pairs, column score) “Charlie Chaplin” problem
Evaluating multiple alignmentsEvaluating multiple alignments
As a standard of truth, often a reference alignment based on structural superpositioning is taken
Evaluation measuresQuery Reference
Column score
Sum-of-Pairs score
Evaluating multiple alignmentsEvaluating multiple alignments
SP
BAliBASE alignment nseq * len
Summary
Weighting schemes simulating simultaneous multiple alignment Profile pre-processing (global/local) Matrix extension (well balanced scheme)
Smoothing alignment signals globalised local alignment
Using additional information secondary structure driven alignment
Schemes strike balance between speed and sensitivity
References Heringa, J. (1999) Two strategies for sequence
comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, 341-364.
Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217.
Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.
Where to find this….http://www.ibivu.cs.vu.nl/teaching