Post on 21-Jan-2016
description
transcript
Pairwise Pairwise Sequence Sequence AlignmentAlignment
Exercise 2Exercise 2
|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…
ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA
MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
MotivationMotivation
What is sequence alignmentWhat is sequence alignment??
Alignment: Alignment: Comparing two (pairwise) or Comparing two (pairwise) or more (multiple) sequences. Searching for more (multiple) sequences. Searching for a series of identical or similar characters in a series of identical or similar characters in the sequences.the sequences.
MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE
Why sequence alignment?Why sequence alignment?
Predict characteristics of a protein – Predict characteristics of a protein –
Premised on:Premised on:
similar sequence (or structure)similar sequence (or structure)
similar functionsimilar function
Local vs. GlobalLocal vs. Global Global alignmentGlobal alignment – finds the best – finds the best
alignment across the alignment across the wholewhole two two sequences.sequences.
Local alignmentLocal alignment – finds regions of – finds regions of high similarity in high similarity in partsparts of the of the sequences.sequences.
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLG CDRYFQ|||| |||| |ADLG CDRYYQ
Global alignment:
forces alignment in
regions which differ
Local alignment
concentrates on regions of high similarity
Three types of changes:Three types of changes:1.1. SubstitutionSubstitution – a replacement of one (or more) – a replacement of one (or more)
sequence letter by another:sequence letter by another:
2.2. InsertionInsertion - an insertion of a letter or several - an insertion of a letter or several letters to the sequence:letters to the sequence:
3.3. DeletionDeletion - deleting a letter (or more) from the - deleting a letter (or more) from the sequence:sequence:
TTAA
Evolutionary changes in sequencesEvolutionary changes in sequences
InsertionInsertion + + DeletionDeletion IndelIndel
AAAAGGAA AAAACCAA
AAGAAG
GAGAAAAA
Choosing an alignment: Choosing an alignment:
Many Many differentdifferent alignments are possible: alignments are possible:
AAGCTGAATTCGAAAGGCTCATTTCTGA
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Which alignment is better?
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Exercise: compute both Exercise: compute both alignment scoresalignment scores
Match: Match: +1+1 Mismatch: Mismatch: -2-2 Indel: Indel: -1-1
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Scoring systems: accounting for Scoring systems: accounting for biological contextbiological context
Which is true about the scores in a pairwise alignment of nucleotide sequences?
1. Tr > Tv > 0
2. Tr < Tv < 0
3. 0 > Tr > Tv
4. 0 > Tv > Tr
Tr = Transition
Tv = Transversion
Scoring systems: accounting for Scoring systems: accounting for biological contextbiological context
Which is true about the scores in a pairwise alignment of amino-acid sequences?
1. Asp->Asn > Asp->Glu
2. Arg->His > Ala->Phe
3. Arg->His < Thr->Met
Substitutions matrices Substitutions matrices
Nucleic acids:Nucleic acids: Transition-transversionTransition-transversion
Amino acids:Amino acids: Evolutionary (empirical data) based: (PAM, Evolutionary (empirical data) based: (PAM,
BLOSUM)BLOSUM) Physico-chemical properties based Physico-chemical properties based
(Grantham, McLachlan)(Grantham, McLachlan)
PAM matricesPAM matrices Family of matrices PAM 80, PAM 120, Family of matrices PAM 80, PAM 120,
PAM 250PAM 250
The number with a PAM matrix represents the The number with a PAM matrix represents the evolutionary distance between the sequences evolutionary distance between the sequences on which the matrix is basedon which the matrix is based
Greater numbers denote greater distancesGreater numbers denote greater distances
PAM - limitationsPAM - limitations
Based on only one original datasetBased on only one original dataset
Examines proteins with few differences Examines proteins with few differences (85% identity)(85% identity)
Based mainly on small globular proteins Based mainly on small globular proteins so the matrix is biased so the matrix is biased
BLOSUM matricesBLOSUM matrices
Different BLOSUMDifferent BLOSUMnn matrices are calculated matrices are calculated independently from BLOCKS (ungapped local independently from BLOCKS (ungapped local alignments)alignments)
BLOSUMBLOSUMnn is based on a cluster of BLOCKS of is based on a cluster of BLOCKS of sequences that share at least sequences that share at least nn percent identity percent identity
BLOSUMBLOSUM6262 represents closer sequences than represents closer sequences than BLOSUMBLOSUM4545
Substitution matrices exerciseSubstitution matrices exercise
Pick the best substitution matrix (PAM and Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment:BLOSUM) for each pairwise alignment:
Human – chimpHuman – chimp Human - yeastHuman - yeast Human – fishHuman – fish
PAM options: PAM60 PAM120 PAM250
BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80
PAM Vs. BLOSUMPAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45
More distant sequences
BLOSUM62 for general useBLOSUM62 for general useBLOSUM80 for close relationsBLOSUM80 for close relationsBLOSUM45 for distant relationsBLOSUM45 for distant relations
PAM120 for general usePAM120 for general usePAM60 for close relations PAM60 for close relations PAM250 for distant relationsPAM250 for distant relations
Gap penaltyGap penalty
AAGCGAAATTCGAACA-G-GAA-CTCGAAC
AAGCGAAATTCGAACAGG---AACTCGAAC
• Which alignment is more likely?
• Which alignment has a higher score?
Web servers for pairwise alignmentWeb servers for pairwise alignment
BLAST 2 sequences (bl2Seq) at BLAST 2 sequences (bl2Seq) at NCBI NCBI
Produces the Produces the locallocal alignment of two given alignment of two given sequences using sequences using BLASTBLAST (Basic Local (Basic Local Alignment Search Tool)Alignment Search Tool) engine for local engine for local alignmentalignment
Does not use an exact algorithm but a Does not use an exact algorithm but a heuristicheuristic
Back to NCBIBack to NCBI
BLAST – bl2seqBLAST – bl2seq
Bl2Seq - queryBl2Seq - query
blastnblastn – – nucleotide nucleotide blastpblastp – protein – protein
Bl2seq resultsBl2seq results
Bl2seq resultsBl2seq results
MatchMatch DissimilarityDissimilarity GapsGaps SimilaritySimilarity Low Low
complexitycomplexity
BLAST – programsBLAST – programs
Query: DNA Protein
Database: DNA Protein
BLAST – BlastpBLAST – Blastp
Blastp - resultsBlastp - results
Blastp – results (cont’)Blastp – results (cont’)
Blast scoresBlast scores::
Bits scoreBits score – A score for the alignment according – A score for the alignment according to the number of similarities, identities, etc.to the number of similarities, identities, etc.
Expected-score (E-value)Expected-score (E-value) –The number of –The number of alignments with the same score one can alignments with the same score one can “expect” to see by chance when searching a “expect” to see by chance when searching a random database of a particular size. The closer random database of a particular size. The closer the e-value is to zero, the greater the confidence the e-value is to zero, the greater the confidence that the hit is really a homologthat the hit is really a homolog
Blastp – acquiring sequencesBlastp – acquiring sequences
blastp – acquiring sequencesblastp – acquiring sequences
Fasta format – multiple sequencesFasta format – multiple sequences>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH
>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH
>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH
Searching for remote homologsSearching for remote homologs
Sometimes BLAST isn’t enoughSometimes BLAST isn’t enough Large protein family, and BLAST only finds Large protein family, and BLAST only finds
close members. We want more distant close members. We want more distant members members
PSI-BLASTPSI-BLAST
PSI-BLASTPSI-BLAST
PPosition osition SSpecific pecific IIterated terated BLASTBLAST
Regular blast
Construct profile from blast results
Blast profile search
Final results
PSI-BLASTPSI-BLAST
Advantage:Advantage: PSI-BLAST looks for seq’s PSI-BLAST looks for seq’s that are close to the query, and learns that are close to the query, and learns from them to extend the circle of friendsfrom them to extend the circle of friends
Disadvantage:Disadvantage: if we obtained a WRONG if we obtained a WRONG hit, we will get to unrelated sequences hit, we will get to unrelated sequences (contamination). This gets worse and (contamination). This gets worse and worse each iterationworse each iteration
PSI-BLASTPSI-BLAST
Which one(s) of the following is/are correct?Which one(s) of the following is/are correct?
1.1. PSI-BLAST is expected to give more hits PSI-BLAST is expected to give more hits than BLASTthan BLAST
2.2. PSI-BLAST is an iterative search methodPSI-BLAST is an iterative search method
3.3. PSI-BLAST is faster than BLASTPSI-BLAST is faster than BLAST
4.4. Each iteration of PSI-BLAST can only Each iteration of PSI-BLAST can only improve the results of the previous improve the results of the previous iterationiteration
BLAST – PSI-BlastBLAST – PSI-Blast
PSI-Blast - resultsPSI-Blast - results