Before we begin…

|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…

ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

Before we beginBefore we begin……

Pairwise Pairwise Sequence Sequence AlignmentAlignment

Lesson 2Lesson 2

What is sequence alignmentWhat is sequence alignment??

Alignment: Alignment: Comparing two (pairwise) or Comparing two (pairwise) or more (multiple) sequences. Searching for more (multiple) sequences. Searching for a series of identical or similar characters in a series of identical or similar characters in the sequences.the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE

Why sequence alignment?Why sequence alignment?

Predict characteristics of a protein – Predict characteristics of a protein –

use the structure or function information on use the structure or function information on known proteins with similar sequences available known proteins with similar sequences available in databases in order to predict the structure or in databases in order to predict the structure or function of an unknown proteinfunction of an unknown protein

Assumptions: similar sequences Assumptions: similar sequences produce similar proteinsproduce similar proteins

Local vs. GlobalLocal vs. Global Global alignmentGlobal alignment – finds the best – finds the best

alignment across the alignment across the wholewhole two two sequences.sequences.

Local alignmentLocal alignment – finds regions of – finds regions of high similarity in high similarity in partsparts of the of the sequences.sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment

concentrates on regions of high similarity

In the course of evolution, the sequences changed In the course of evolution, the sequences changed from the ancestral sequence by random mutationsfrom the ancestral sequence by random mutations

Three types of changes:Three types of changes:1.1. InsertionInsertion - an insertion of a letter or several letters to the - an insertion of a letter or several letters to the

sequence. AAGAsequence. AAGA AAG AAGTTAA

Sequence evolutionSequence evolution

AAGAAGAA

InsertionInsertion


Three types of Three types of changeschanges : :1.1. InsertionInsertion - an insertion of a letter or several letters to the - an insertion of a letter or several letters to the

sequence. AAGAsequence. AAGA AAG AAGTTAA2.2. DeletionDeletion – a deletion of a letter (or more) from the sequence. – a deletion of a letter (or more) from the sequence.

AAAAGAGA AGA AGA

Sequence evolutionSequence evolution

AA AGAG

DeletionDeletion

AA


Three types of mutations:Three types of mutations:1.1. InsertionInsertion - an insertion of a letter or several letters to the - an insertion of a letter or several letters to the

sequence. AAGAsequence. AAGA AAG AAGTTAA2.2. DeletionDeletion - deleting a letter (or more) from the sequence. - deleting a letter (or more) from the sequence.

AAAAGAGA AGA AGA3.3. SubstitutionSubstitution – a replacement of one (or more) sequence letter by – a replacement of one (or more) sequence letter by

anotheranother AAAAGGAA AA AACCAA

Evolutionary changes in sequencesEvolutionary changes in sequences

AAAA AA

SubstitutionSubstitution

GGCCInsertionInsertion + + DeletionDeletion IndelIndel

Sequence alignmentSequence alignment

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

One possible alignment:

This alignment includes:

2 mismatches 4 indels (gap)

10 perfect matches

Choosing an alignment: Choosing an alignment:

Many different alignments are possible:Many different alignments are possible:

AAGCTGAATTCGAAAGGCTCATTTCTGA

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?


Scoring an alignment:Scoring an alignment:example - naïve scoring system:example - naïve scoring system: Match: Match: +1+1 Mismatch: Mismatch: -2-2 Indel: Indel: -1-1


Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Higher score Better alignment

Scoring systemScoring system::

Different scoring systems can produce Different scoring systems can produce different optimal alignmentsdifferent optimal alignments

Scoring systems implicitly represent a Scoring systems implicitly represent a particular theory of similarity/dissimilarity particular theory of similarity/dissimilarity between sequence characters: evolution between sequence characters: evolution based, physico-chemical properties based based, physico-chemical properties based Some mismatches are more plausibleSome mismatches are more plausible

• Transition vs. Transversion Transition vs. Transversion

• LysLysArgArg ≠≠ LysLysCysCys Gap extension Vs. Gap openingGap extension Vs. Gap opening

Substitutions Matrices Substitutions Matrices

Nucleic acids:Nucleic acids: Transition-transversionTransition-transversion

Amino acids:Amino acids: Evolution (empirical data) based: (PAM, Evolution (empirical data) based: (PAM,

BLOSUM)BLOSUM) Physico-chemical properties based Physico-chemical properties based

(Grantham, McLachlan)(Grantham, McLachlan)

PAM MatricesPAM Matrices Family of matrices PAM 80, PAM 120, PAM Family of matrices PAM 80, PAM 120, PAM

250250

The number with PAM matrices represent The number with PAM matrices represent evolutionary distance evolutionary distance

Greater numbers denote greater distancesGreater numbers denote greater distances

Which PAM matrix to useWhich PAM matrix to use??

Low PAM numbers: strong similaritiesLow PAM numbers: strong similarities

High PAM numbers: weak similaritiesHigh PAM numbers: weak similarities

PAM120 for general use (40% identity)PAM120 for general use (40% identity) PAM60 for close relations (60% identity)PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity)PAM250 for distant relations (20% identity)

If uncertain, try several different matricesIf uncertain, try several different matrices PAM40, PAM120, PAM250PAM40, PAM120, PAM250

PAM - limitationsPAM - limitations

Based on only one original datasetBased on only one original dataset

Examines proteins with few differences Examines proteins with few differences (85% identity)(85% identity)

Based mainly on small globular proteins Based mainly on small globular proteins so the matrix is biased so the matrix is biased

BLOSUM MatricesBLOSUM Matrices

Different BLOSUMDifferent BLOSUMnn matrices are matrices are calculated independently from BLOCKScalculated independently from BLOCKS

BLOSUMBLOSUMnn is based on sequences that is based on sequences that share at least share at least nn percent identity percent identity

BLOSUMBLOSUM6262 represents closer sequences represents closer sequences than BLOSUMthan BLOSUM4545

Example : Blosum62Example : Blosum62

derived from blocks of sequences that share at least 62% identity

Which BLOSUM matrix to useWhich BLOSUM matrix to use??

Low BLUSOM numbers for distant Low BLUSOM numbers for distant sequencessequences

High BLUSOM numbers for similar High BLUSOM numbers for similar sequencessequences

BLOSUM62 for general useBLOSUM62 for general use BLOSUM80 for close relationsBLOSUM80 for close relations BLOSUM45 for distant relationsBLOSUM45 for distant relations

PAM Vs. BLOSUMPAM Vs. BLOSUM

PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

More distant sequences

Gap penaltyGap penalty

We expect to penalize gaps We expect to penalize gaps A different score for gap opening and for A different score for gap opening and for

extensionextension Insertions and deletions are rare in evolution Insertions and deletions are rare in evolution But once they occur, they are easy to extendBut once they occur, they are easy to extend Gap-extension penalty < gap-opening penaltyGap-extension penalty < gap-opening penalty

Web servers for pairwise alignmentWeb servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at BLAST 2 sequences (bl2Seq) at NCBI NCBI

Produces the Produces the locallocal alignment of two given alignment of two given sequences using sequences using BLASTBLAST (Basic Local (Basic Local Alignment Search Tool)Alignment Search Tool) engine for local engine for local alignmentalignment

Does not use an exact algorithm but a Does not use an exact algorithm but a heuristicheuristic

Back to NCBIBack to NCBI

BLAST – bl2seqBLAST – bl2seq

blastnblastn – nucleotide – nucleotide

blastpblastp – protein – protein

Bl2Seq - queryBl2Seq - query

Bl2seq resultsBl2seq results

Bl2seq resultsBl2seq results

MatchMatch DissimilarityDissimilarity GapsGaps SimilaritySimilarity Low Low

complexitycomplexity

Bl2seq resultsBl2seq results::

Bits scoreBits score – A score for the alignment according – A score for the alignment according to the number of similarities, identities, etc.to the number of similarities, identities, etc.

Expected-score (E-value)Expected-score (E-value) –The number of –The number of alignments with the same score one can alignments with the same score one can “expect” to see by chance when searching a “expect” to see by chance when searching a database of a particular size. The closer the e-database of a particular size. The closer the e-value approaches zero, the greater the value approaches zero, the greater the confidence that the hit is realconfidence that the hit is real

BLAST – programsBLAST – programs

Query: DNA Protein

Database: DNA Protein

BLAST – BlastpBLAST – Blastp

Blastp - resultsBlastp - results

Blastp – results (cont’)Blastp – results (cont’)

Blastp – acquiring sequencesBlastp – acquiring sequences

blastp – acquiring sequences blastp – acquiring sequences (cont’)(cont’)

Fasta format – multiple sequencesFasta format – multiple sequences>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH

>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH

>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

Searching for remote homologsSearching for remote homologs

Sometimes BLAST isn’t enoughSometimes BLAST isn’t enough Large protein family, and BLAST only finds Large protein family, and BLAST only finds

close members. We want more distant close members. We want more distant members members

PSI-BLASTPSI-BLAST Profile HMMs (not discussed in this Profile HMMs (not discussed in this

exercise)exercise)

PSI-BLASTPSI-BLAST

PPosition osition SSpecific pecific IIterated BLASTterated BLAST

Regular blast

Construct profile from blast results

Blast profile search

Final results

PSI-BLASTPSI-BLAST

Advantage:Advantage: PSI-BLAST looks for seq’s PSI-BLAST looks for seq’s that are close to the query, and learns that are close to the query, and learns from them to extend the circle of friendsfrom them to extend the circle of friends

Disadvantage:Disadvantage: if we obtained a WRONG if we obtained a WRONG hit, we will get to unrelated sequences hit, we will get to unrelated sequences (contamination). This gets worse and (contamination). This gets worse and worse each iterationworse each iteration

BLAST – PSI-BlastBLAST – PSI-Blast

PSI-Blast - resultsPSI-Blast - results

Date post:	02-Feb-2016
Category:	Documents
Upload:	fell
View:	28 times
Download:	0 times

Before we begin…

Documents