|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…
ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA
MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
Before we beginBefore we begin……
Pairwise Pairwise Sequence Sequence AlignmentAlignment
Lesson 2Lesson 2
What is sequence alignmentWhat is sequence alignment??
Alignment: Alignment: Comparing two (pairwise) or Comparing two (pairwise) or more (multiple) sequences. Searching for more (multiple) sequences. Searching for a series of identical or similar characters in a series of identical or similar characters in the sequences.the sequences.
MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE
Why sequence alignment?Why sequence alignment?
Predict characteristics of a protein – Predict characteristics of a protein –
use the structure or function information on use the structure or function information on known proteins with similar sequences available known proteins with similar sequences available in databases in order to predict the structure or in databases in order to predict the structure or function of an unknown proteinfunction of an unknown protein
Assumptions: similar sequences Assumptions: similar sequences produce similar proteinsproduce similar proteins
Local vs. GlobalLocal vs. Global Global alignmentGlobal alignment – finds the best – finds the best
alignment across the alignment across the wholewhole two two sequences.sequences.
Local alignmentLocal alignment – finds regions of – finds regions of high similarity in high similarity in partsparts of the of the sequences.sequences.
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLG CDRYFQ|||| |||| |ADLG CDRYYQ
Global alignment:
forces alignment in
regions which differ
Local alignment
concentrates on regions of high similarity
In the course of evolution, the sequences changed In the course of evolution, the sequences changed from the ancestral sequence by random mutationsfrom the ancestral sequence by random mutations
Three types of changes:Three types of changes:1.1. InsertionInsertion - an insertion of a letter or several letters to the - an insertion of a letter or several letters to the
sequence. AAGAsequence. AAGA AAG AAGTTAA
Sequence evolutionSequence evolution
AAGAAGAA
InsertionInsertion
In the course of evolution, the sequences changed In the course of evolution, the sequences changed from the ancestral sequence by random mutationsfrom the ancestral sequence by random mutations
Three types of Three types of changeschanges : :1.1. InsertionInsertion - an insertion of a letter or several letters to the - an insertion of a letter or several letters to the
sequence. AAGAsequence. AAGA AAG AAGTTAA2.2. DeletionDeletion – a deletion of a letter (or more) from the sequence. – a deletion of a letter (or more) from the sequence.
AAAAGAGA AGA AGA
Sequence evolutionSequence evolution
AA AGAG
DeletionDeletion
AA
In the course of evolution, the sequences changed In the course of evolution, the sequences changed from the ancestral sequence by random mutationsfrom the ancestral sequence by random mutations
Three types of mutations:Three types of mutations:1.1. InsertionInsertion - an insertion of a letter or several letters to the - an insertion of a letter or several letters to the
sequence. AAGAsequence. AAGA AAG AAGTTAA2.2. DeletionDeletion - deleting a letter (or more) from the sequence. - deleting a letter (or more) from the sequence.
AAAAGAGA AGA AGA3.3. SubstitutionSubstitution – a replacement of one (or more) sequence letter by – a replacement of one (or more) sequence letter by
anotheranother AAAAGGAA AA AACCAA
Evolutionary changes in sequencesEvolutionary changes in sequences
AAAA AA
SubstitutionSubstitution
GGCCInsertionInsertion + + DeletionDeletion IndelIndel
Sequence alignmentSequence alignment
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
One possible alignment:
This alignment includes:
2 mismatches 4 indels (gap)
10 perfect matches
Choosing an alignment: Choosing an alignment:
Many different alignments are possible:Many different alignments are possible:
AAGCTGAATTCGAAAGGCTCATTTCTGA
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Which alignment is better?
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Scoring an alignment:Scoring an alignment:example - naïve scoring system:example - naïve scoring system: Match: Match: +1+1 Mismatch: Mismatch: -2-2 Indel: Indel: -1-1
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Higher score Better alignment
Scoring systemScoring system::
Different scoring systems can produce Different scoring systems can produce different optimal alignmentsdifferent optimal alignments
Scoring systems implicitly represent a Scoring systems implicitly represent a particular theory of similarity/dissimilarity particular theory of similarity/dissimilarity between sequence characters: evolution between sequence characters: evolution based, physico-chemical properties based based, physico-chemical properties based Some mismatches are more plausibleSome mismatches are more plausible
• Transition vs. Transversion Transition vs. Transversion
• LysLysArgArg ≠≠ LysLysCysCys Gap extension Vs. Gap openingGap extension Vs. Gap opening
Substitutions Matrices Substitutions Matrices
Nucleic acids:Nucleic acids: Transition-transversionTransition-transversion
Amino acids:Amino acids: Evolution (empirical data) based: (PAM, Evolution (empirical data) based: (PAM,
BLOSUM)BLOSUM) Physico-chemical properties based Physico-chemical properties based
(Grantham, McLachlan)(Grantham, McLachlan)
PAM MatricesPAM Matrices Family of matrices PAM 80, PAM 120, PAM Family of matrices PAM 80, PAM 120, PAM
250250
The number with PAM matrices represent The number with PAM matrices represent evolutionary distance evolutionary distance
Greater numbers denote greater distancesGreater numbers denote greater distances
Which PAM matrix to useWhich PAM matrix to use??
Low PAM numbers: strong similaritiesLow PAM numbers: strong similarities
High PAM numbers: weak similaritiesHigh PAM numbers: weak similarities
PAM120 for general use (40% identity)PAM120 for general use (40% identity) PAM60 for close relations (60% identity)PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity)PAM250 for distant relations (20% identity)
If uncertain, try several different matricesIf uncertain, try several different matrices PAM40, PAM120, PAM250PAM40, PAM120, PAM250
PAM - limitationsPAM - limitations
Based on only one original datasetBased on only one original dataset
Examines proteins with few differences Examines proteins with few differences (85% identity)(85% identity)
Based mainly on small globular proteins Based mainly on small globular proteins so the matrix is biased so the matrix is biased
BLOSUM MatricesBLOSUM Matrices
Different BLOSUMDifferent BLOSUMnn matrices are matrices are calculated independently from BLOCKScalculated independently from BLOCKS
BLOSUMBLOSUMnn is based on sequences that is based on sequences that share at least share at least nn percent identity percent identity
BLOSUMBLOSUM6262 represents closer sequences represents closer sequences than BLOSUMthan BLOSUM4545
Example : Blosum62Example : Blosum62
derived from blocks of sequences that share at least 62% identity
Which BLOSUM matrix to useWhich BLOSUM matrix to use??
Low BLUSOM numbers for distant Low BLUSOM numbers for distant sequencessequences
High BLUSOM numbers for similar High BLUSOM numbers for similar sequencessequences
BLOSUM62 for general useBLOSUM62 for general use BLOSUM80 for close relationsBLOSUM80 for close relations BLOSUM45 for distant relationsBLOSUM45 for distant relations
PAM Vs. BLOSUMPAM Vs. BLOSUM
PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45
More distant sequences
Gap penaltyGap penalty
We expect to penalize gaps We expect to penalize gaps A different score for gap opening and for A different score for gap opening and for
extensionextension Insertions and deletions are rare in evolution Insertions and deletions are rare in evolution But once they occur, they are easy to extendBut once they occur, they are easy to extend Gap-extension penalty < gap-opening penaltyGap-extension penalty < gap-opening penalty
Web servers for pairwise alignmentWeb servers for pairwise alignment
BLAST 2 sequences (bl2Seq) at BLAST 2 sequences (bl2Seq) at NCBI NCBI
Produces the Produces the locallocal alignment of two given alignment of two given sequences using sequences using BLASTBLAST (Basic Local (Basic Local Alignment Search Tool)Alignment Search Tool) engine for local engine for local alignmentalignment
Does not use an exact algorithm but a Does not use an exact algorithm but a heuristicheuristic
Back to NCBIBack to NCBI
BLAST – bl2seqBLAST – bl2seq
blastnblastn – nucleotide – nucleotide
blastpblastp – protein – protein
Bl2Seq - queryBl2Seq - query
Bl2seq resultsBl2seq results
Bl2seq resultsBl2seq results
MatchMatch DissimilarityDissimilarity GapsGaps SimilaritySimilarity Low Low
complexitycomplexity
Bl2seq resultsBl2seq results::
Bits scoreBits score – A score for the alignment according – A score for the alignment according to the number of similarities, identities, etc.to the number of similarities, identities, etc.
Expected-score (E-value)Expected-score (E-value) –The number of –The number of alignments with the same score one can alignments with the same score one can “expect” to see by chance when searching a “expect” to see by chance when searching a database of a particular size. The closer the e-database of a particular size. The closer the e-value approaches zero, the greater the value approaches zero, the greater the confidence that the hit is realconfidence that the hit is real
BLAST – programsBLAST – programs
Query: DNA Protein
Database: DNA Protein
BLAST – BlastpBLAST – Blastp
Blastp - resultsBlastp - results
Blastp – results (cont’)Blastp – results (cont’)
Blastp – acquiring sequencesBlastp – acquiring sequences
blastp – acquiring sequences blastp – acquiring sequences (cont’)(cont’)
Fasta format – multiple sequencesFasta format – multiple sequences>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH
>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH
>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH
Searching for remote homologsSearching for remote homologs
Sometimes BLAST isn’t enoughSometimes BLAST isn’t enough Large protein family, and BLAST only finds Large protein family, and BLAST only finds
close members. We want more distant close members. We want more distant members members
PSI-BLASTPSI-BLAST Profile HMMs (not discussed in this Profile HMMs (not discussed in this
exercise)exercise)
PSI-BLASTPSI-BLAST
PPosition osition SSpecific pecific IIterated BLASTterated BLAST
Regular blast
Construct profile from blast results
Blast profile search
Final results
PSI-BLASTPSI-BLAST
Advantage:Advantage: PSI-BLAST looks for seq’s PSI-BLAST looks for seq’s that are close to the query, and learns that are close to the query, and learns from them to extend the circle of friendsfrom them to extend the circle of friends
Disadvantage:Disadvantage: if we obtained a WRONG if we obtained a WRONG hit, we will get to unrelated sequences hit, we will get to unrelated sequences (contamination). This gets worse and (contamination). This gets worse and worse each iterationworse each iteration
BLAST – PSI-BlastBLAST – PSI-Blast
PSI-Blast - resultsPSI-Blast - results