+ All Categories
Home > Documents > Blast Clustal 4Students

Blast Clustal 4Students

Date post: 03-Jun-2018
Category:
Upload: james-mcinerney
View: 223 times
Download: 0 times
Share this document with a friend

of 33

Transcript
  • 8/12/2019 Blast Clustal 4Students

    1/33

    So you have a sequence. What now?

  • 8/12/2019 Blast Clustal 4Students

    2/33

    The simplest bioinformatic problem:

    Let us assume you have an uncharacterised (yet) nucleotide sequence that you obtainedfrom a PCR experiment.

    Question:How do you characterise (validate) your PCR product?

    Answer:

    (1) You interrogate a PRIMARY database (e.g. GenBank) and retrieve all the

    sequences that are significantly similar (i.e are HOMOLOGOUS) to your query.

    This is done using the BLAST software.

    (2) You generate a multiple sequence alignment of the retrieved (Homologous) proteins.

    This is done using ClustalW.

  • 8/12/2019 Blast Clustal 4Students

    3/33

  • 8/12/2019 Blast Clustal 4Students

    4/33

    1) Create a 2-D matrix and populate it withscores representing the similarities of the

    compared sequences

    2) Accumulate the scores in the matrix &

    penalize insertions and deletions

    3) Identify the highest scoring path in thematrix.

  • 8/12/2019 Blast Clustal 4Students

    5/33

    SEQ1

    A H C N I R V S G V C L C R P M

    A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

    C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0N 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

    SEQ2 R 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0R 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0H 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

  • 8/12/2019 Blast Clustal 4Students

    6/33

    A H C N I R V S G V C L C R P M

    A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

    C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0N 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

    R 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0

    R 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0H 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

    A H C N I R V S G V C L C R P M

    A 8 7 6 6 5 4 4 4 4 4 3 3 2 1 0 0I 7 7 6 6 6 4 4 4 4 4 3 3 2 1 0 0

    C 6 6 7 6 5 4 4 4 4 4 4 3 3 1 0 0I 6 6 6 5 6 4 4 4 4 4 3 3 2 1 0 0N 5 5 5 6 5 4 4 4 4 4 3 3 2 1 0 0

    R 4 4 4 4 4 5 4 4 4 4 3 3 2 1 0 0C 3 3 4 3 3 3 3 3 3 3 4 3 3 1 0 0K 3 3 3 3 3 3 3 3 3 3 3 3 2 1 0 0

    C 2 2 3 2 2 2 2 2 2 2 3 2 3 1 0 0R 2 1 1 1 1 2 1 1 1 1 1 1 1 2 0 0H 1 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

    The MAX previous score, the one

    that has to be added to the current

    RED CELL value, is the highest in

    the BLUE ROW OR COLUMN.

    The matrix is accumulated moving from the bottomright corner to the top left corner!

  • 8/12/2019 Blast Clustal 4Students

    7/33

    P

    A

    GCS-H

    CS-S S N

    Q

    Y

    WF

    M

    I V

    L

    T

    Small

    Hydrophobic

    PolarAliphatic

    Tiny

    Aromatic

    Charged

    enn agram o am no ac s proper es

    K

    RH

    -

    D +

    E

  • 8/12/2019 Blast Clustal 4Students

    8/33

    Matrix representing probabilities of amino acid substitutions. This and other existing

    matrices can be used to build more accurate alignments of two sequences.

  • 8/12/2019 Blast Clustal 4Students

    9/33

  • 8/12/2019 Blast Clustal 4Students

    10/33

    To search databases we use heuristic, similaritybased algorithms Similarity based database searches generate local alignments

    to find (within a sequence database) sequences related to the

    query sequence. Given a query sequence, local alignments of the query sequence

    are generated against every sequence in the database. The scores of

    the alignments are used to identify sequences that are related to the

    query

    sequence. BLAST is the most common heuristic algorithm used to searchsequence databases.

  • 8/12/2019 Blast Clustal 4Students

    11/33

    BLASTThe Basic Local Alignment Search Tool BLAST is the standard database search tool. Developed by Altschul Stephen in 1990.

    BLAST is a class of related software that perform a variety ofdatabase comparisons. For example:

    Objective: To find high scoring untapped alignments between

    a query sequence and the sequences in a database. These are called High Scoring Pairs (HSP). The existence of such segments above a given similarity thresholdindicates pairwise similarity beyond random chance. This is used to distinguish related from unrelated sequences in adatabase.

  • 8/12/2019 Blast Clustal 4Students

    12/33

  • 8/12/2019 Blast Clustal 4Students

    13/33

    The Algorithm

    Given a Query sequence (e.g. QLNFSAGW)

    FIRST STEP - SEEDING. Generate all words of length K (e.g. k= 2)in the querysequence.

    Words in our example:QL; NF; SA; GW; LN; FS; AG.

    SECOND STEP.Identify all words in the sequences in the database.

    THIRD STEP. Align every seed against every word generated from the database.Calculate (Using BLOSUM62 -or another matrix) the score of every ungapped twoletter alignment generated in this way. An alignment is considered a MATCH if itsscore is above a certain threshold (default = 8 for amino acids).

    FOURTH STEP. Matches (only) are extended to generate longer alignments. If nomatch is found for two sequences, they are not considered any longer. This savestime. If multiple matches are found for two sequences, all matches are extended.The extension of a match continues until mismatches cause the alignment score todrop below a given threshold (22 for proteins 20 for DNA).

    Resulting ungapped alignments are the HSPs.

  • 8/12/2019 Blast Clustal 4Students

    14/33

  • 8/12/2019 Blast Clustal 4Students

    15/33

  • 8/12/2019 Blast Clustal 4Students

    16/33

    Extending a match

    Stop when : Score Current Extension < 22.

    Every MATCH (alignment with a minimal score of 8), is extended until we found the

    best extension (alignment of maximal score).

    AGT PYNNGT NNT LTW HKR RRR K

    TAG PYNNGT NNT LTW KHK KKK R

    Initial Match (or Hit)

    Extend until score of alignment increases

    Keep extending until score drops below 22

  • 8/12/2019 Blast Clustal 4Students

    17/33

    Interpreting BLAST The output of BLAST provides a list of pairwise sequence matchesranked by the statistical significance of the scores of their HSP.

    In BLAST the statistical indicator is the E-value (NOT to be confused with a P value -see below). E-values (expectation values) express how likely it is for an HSP ofa certain score to be observed by chance alone in a database ofgiven dimensions. E = m * n * P. m = total number of residue in database. n = number of residue in the query sequence P = the probability that an HSP alignment is a result of random

    chance (THIS IS THE PROBABILITY OF THE ALIGNMENT!)

  • 8/12/2019 Blast Clustal 4Students

    18/33

    Interpretations of E-values

    E =< 1e - 50: Extremely high sequence similarity. Very close homologs.

    1e - 50< E < 1e - 8: Significantly high similarity. Surely homologous.

    1e - 7 < E < 1e - 2 (0.01): Sequences similar but not necessarily homologous. If they are

    homologous, they are distant homologoue.

    0.01 < E < 10: Match not significant.

    Generally speaking, as a rule of thumb: E =< 1e - 8 is significant.

    Calculating E-values an example Given a Query Sequence 100 residues long A database containing 1012 residues P = 1*10-20 (of the HSP between 2 sequences)

    E-value = 100 * 1012 * 10-20 = 10-6 This will be expressed as: 1e-6 in the BLAST output.

  • 8/12/2019 Blast Clustal 4Students

    19/33

  • 8/12/2019 Blast Clustal 4Students

    20/33

    E = 4.2

  • 8/12/2019 Blast Clustal 4Students

    21/33

    Proteins can be classified in families

    Members of a family generally perform similar (or related) tasksand have specific signatures. They are identified using BLAST If we can identify a protein as a member of a well-characterised family,we can generally predict its function. Signatures of a protein family are referred as Conserved Motifs. Conserved motifs can only be identified building a multiple sequencealignment. If we can identify a conserved motif we learned somethinguseful about the considered protein family Motifs generally have functional and/or structural relevance Understanding motifs is useful for: biotech proposes. Proteins with specific functions can be engineered. Clues about the causes of diseases can be unrevealed.

  • 8/12/2019 Blast Clustal 4Students

    22/33

    GCGGCCCA TCAGGTAGTT GGTGG

    GCGGCCCA TCAGGTAGTT GGTGG

    GCGTTCCA TCAGCTGGTT GGTGG

    GCGTCCCA TCAGCTAGTT GGTGG

    GCGGCGCA TTAGCTAGTT GGTGA

    ******** ********** *****

    TTGACATG CCGGGG---A AACCG

    TTGACATG CCGGTG--GT AAGCC

    TTGACATG -CTAGG---A ACGCG

    TTGACATG -CTAGGGAAC ACGCG

    TTGACATC -CTCTG---A ACGCG

    ******** ?????????? *****

    Easy

    Difficult due

    to insertions

    or deletions

    (indels)

    Building a multiple sequence alignmentcan be easy or difficult

  • 8/12/2019 Blast Clustal 4Students

    23/33

  • 8/12/2019 Blast Clustal 4Students

    24/33

    Multiple Sequence Alignment- Goals To generate a concise, information-rich summaryof sequence data. Sometimes used to illustrate the dissimilarity orsimilarity between a group of sequences. Alignments can be treated as models that can beused to test hypotheses. Does this model of events accurately reflect knownbiological evidence.

  • 8/12/2019 Blast Clustal 4Students

    25/33

  • 8/12/2019 Blast Clustal 4Students

    26/33

    1) Given a set of sequences, the first step of a multiple sequence alignment is

    calculating the pairwise distances between the sequences.

    2) The pairwise distances are used to build a guide tree which is used as aguide to perform the multiple sequence alignment.

    3) Using the guide tree sequences are aligned starting from the two most

    similar. More distantly related sequences are progressively added.

    Seq_a

    Seq_b

    Seq_c

    Seq_d

    Multiple Sequence Alignment with Clustal (Thompson1996): The Principle

  • 8/12/2019 Blast Clustal 4Students

    27/33

  • 8/12/2019 Blast Clustal 4Students

    28/33

    ClustalW- Guide Tree

    Generate a Neighbor-Joining guidetree from these pairwise distances. This guide tree gives the order inwhich the progressive alignment willbe carried out.

    Cl t lW Fi t i

  • 8/12/2019 Blast Clustal 4Students

    29/33

    ClustalW- First pair Align the two most closely-relatedsequences first. This alignment is then fixed and willnever change. If a gap is to beintroduced subsequently, then it will be

    introduced in the same place in bothsequences, but their relative alignmentremains unchanged.

  • 8/12/2019 Blast Clustal 4Students

    30/33

  • 8/12/2019 Blast Clustal 4Students

    31/33

    1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

    Hbb_Human 1 -Hbb_Horse 2 .17 -Hba_Human 3 .59 .60 -Hba_Horse 4 .59 .59 .13 -Myg_Whale 5 .77 .77 .75 .75 -

    Hbb_Human

    Hbb_Horse

    Hba_Horse

    Hba_Human

    Myg_Whale

    1

    2

    3 4

    1

    2

    3 4

    alpha-helices

    Quick pairwise alignment:

    calculate distance matrix

    Neighbor-joining tree

    (guide tree)

    Progressive alignment

    following guide tree

    CLUSTAL W

  • 8/12/2019 Blast Clustal 4Students

    32/33

  • 8/12/2019 Blast Clustal 4Students

    33/33

    Advice on progressive alignment Progressive alignment is a mathematicalprocess that is completely independentof biological reality

    Can be a very good estimate Can be an impossibly poor estimate

    Requires user input and skill


Recommended