Post on 18-Jan-2018
description
transcript
Sequence Alignment Part 3
Protein Sequence AlignmentMultiple Sequence Alignment
Table 3.1. Web sites for alignment of sequence pairs
Name of site
Bayes block alignera http://www.wadsworth.org/resnres/bioinfo Zhu et al. (1998)
Likelihood-weighted sequence alignmentb http://stateslab.bioinformatics.med.umich.edu/service see Web site
PipMaker (percent identity plot), a graphical tool for assessing long alignments
http://www.bx.psu.edu/miller_lab/ Schwartz et al. (2000)
BCM Search Launcherc http://searchlauncher.bcm.tmc.edu/ see Web site
SIM—Local similarity program for finding alternative alignments
http://us.expasy.org/ Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992)
Global alignment programs (GAP, NAP) http://genome.cs.mtu.edu/align/align.html Huang (1994)
FASTA program suited http://fasta.bioch.virginia.edu/ Pearson and Miller (1992); Pearson (1996)
Pairwise BLASTe http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Altschul et al. (1990)
AceViewf shows alignment of mRNAs and ESTs to the genome sequence
http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly see Web site
BLATf Fast alignment for finding genes in genome http://genome.ucsc.edu Kent (2002)
GeneSeqerf predicts genes and aligns mRNA and genome sequences
http://www.bioinformatics.iastate.edu/bioinformatics2go/ Usuka et al. (2000)
SIM4f http://globin.cse.psu.edu Floria et al. (1998)
Protein Sequence Alignment
Protein Pairwise Sequence Alignment• The alignment tools are similar to the DNA alignment
tools• BLASTP, FASTA
• Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores:• Score s(i,j) > 0 if amino acids i and j have similar
properties • Score s(i,j) is 0 otherwise
• How should we score s(i,j)?
The 20 Amino Acids
Chemical Similarities Between Amino Acids
Acids & Amides DENQ (Asp, Glu, Asn, Gln)
Basic HKR (His, Lys, Arg)
Aromatic FYW (Phe, Tyr, Trp)
Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)
Hydrophobic ILMV (Ile, Leu, Met, Val)
Amino Acid Substitutions Matrices
• For aligning amino acids, we need a scoring matrix of 20 rows 20 columns
• Matrices represent biological processes– Mutation causes changes in sequence– Evolution tends to conserve protein function– Similar function requires similar amino acids
• Could base matrix on amino acid properties– In practice: based on empirical data
identity similarity
Given an alignment of closely related sequences we can score the relation between amino acidsbased on how frequently they substitute each other
D SFHRRRAGCDE-DEDEES
AGHKKKR
In this column E & D are found 8/10
Amino Acid MatricesSymmetric matrix of 20x20 entries: entry (i,j)=entry(j,i)
Entry (i,j): the score of aligning amino acid i against amino acid j.
Entry (i,i) is greater than any entry (i,j), ji.
PAM - Point Accepted Mutations• Developed by Margaret Dayhoff, 1978.• Analyzed very similar protein sequences
• Proteins are evolutionary close. • Alignment is easy.• Point mutations - mainly substitutions• Accepted mutations - by natural selection.
• Used global alignment.• Counted the number of substitutions (i,j) per amino acid pair: Many
i<->j substitutions => high score s(i,j)• Found that common substitutions occurred involving
chemically similar amino acids.
PAM 250• Similar amino acids are close to each other.• Regions define conserved substitutions.
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local similarities.
• High PAM numbers: long sequences, weak similarities.– PAM120 recommended for general use (40% identity)
– PAM60 for close relations (60% identity)
– PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended
BLOSUM• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992)• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function– Highly conserved protein domains
• Ungapped local alignment to identify motifs– Each motif is a block of local alignment– Counts amino acids observed in same column– Symmetrical model of substitution AABCDA… BBCDA
DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC
BLOSUM Matrices
• Different BLOSUMn matrices are calculated independently from BLOCKS
• BLOSUMn is based on sequences that are at most n percent identical.
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general use– BLOSUM80 for close relations– BLOSUM45 for distant relations
Multiple Sequence Alignment
Multiple Alignment
• Like pairwise alignment– n input sequences instead of 2– Add indels to make same length– Local and global alignments
• Score columns in alignment independently
• Seek an alignment to maximize score
Alignment Example
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
GTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
1*12*0.7511*0.5
Score=8
4*111*0.752*0.5
Score=13.25
Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0
Dynamic Programming
• Pairwise A–B alignment table– Cell (i,j) = score of best alignment between first i elements of A and first j elements of B
– Complexity: length of A length of B• 3-way A–B–C alignment table
– Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C
– Complexity: length A length B length C
• n-way S1–S2–…–Sn-1–Sn alignment table– Cell (x1,…,xn) = best alignment score between
first x1 elements of S1, …, xn elements of Sn
– Complexity: length S1 … length Sn
• Example: protein family alignment– 100 proteins, 1000 amino acids each– Complexity: 10300 table cells– Calculation time: beyond the big bang!
MSA Complexity
Feasible Approach
• Based on pairwise alignment scores– Build n by n table of pairwise scores
• Align similar sequences first– After alignment, consider as single sequence– Continue aligning with further sequences
• Sum of pairwise alignment scores– For n sequences, there are n(n-1)/2 pairs
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
1 GTCGTAGTCG-GC-TCGAC2 GTC-TAG-CGAGCGT-GAT3 GC-GAAGAGGCG-AGC4 GCCGTCGCGTCGTAAC
1 GTCGTA-GTCG-GC-TCGAC2 GTC-TA-G-CGAGCGT-GAT3 G-C-GAAGA-G-GCG-AG-C4 G-CCGTCGC-G-TCGTAA-C
ClustalW Algorithm
• Compute pairwise alignment for all the pairs of sequences.
• Use the alignment scores to build a phylogenetic tree such that • similar sequences are neighbors in the tree• distant sequences are distant from each other in
the tree.• The sequences are progressively aligned
according to the branching order in the guide tree.• http://www.ebi.ac.uk/clustalw/
Progressive Sequences Alignment (Higgins and Sharp 1988)
N Y L S N K Y L S N F S N F L S
N K/- Y L S N F L/- S
N K/- Y/F L/- S
Progressive Sequence Alignment (Protein sequences example)
Treating Gaps in ClustalW
• Penalty for opening gaps and additional penalty for extending the gap
• Gaps found in initial alignment remain fixed
• New gaps are introduced as more sequences are added (decreased penalty if gap exists)
• Decreased within stretches of hydrophilic residues
MSA Approaches• Progressive approach
CLUSTALW (CLUSTALX) PILEUP
T-COFFEE
• Iterative approach: Repeatedly realign subsets of sequences.
MultAlin, DiAlign.
• Statistical Methods:Hidden Markov ModelsSAM2K
• Genetic algorithmSAGA