+ All Categories
Home > Documents > BLAST ND FASTA

BLAST ND FASTA

Date post: 10-Apr-2018
Category:
Upload: devbaljinder
View: 216 times
Download: 0 times
Share this document with a friend

of 28

Transcript
  • 8/8/2019 BLAST ND FASTA

    1/28

    BLOSUM 62

    The Blast and FastA algorithms

  • 8/8/2019 BLAST ND FASTA

    2/28

    Global alignments that do not include gaps : a matrix of 200

    PAMS for sequences that are thought to be related.Unknown sequences : a 120 PAM matrix was the bestcompromise.

    Local alignment method PAM40, PAM120 and PAM250. Thelower PAM matrices (40-120) find short alignments of highly

    similar sequences, while higher PAM matrices (120-250)

    find longer, weaker local alignments.

  • 8/8/2019 BLAST ND FASTA

    3/28

    Standard Blast: Overall the BLOSUM 62 matrix is the most

    effective.

    All other substitution matrices perform better than BLOSUM

    62 for a proportion of the families.

  • 8/8/2019 BLAST ND FASTA

    4/28

    Algorithms

    Comparing sequences by dot matrix display or byany other standard method of sequence comparison

    is a very slow process therefore:

    Most commonly the Blast and the FastAapproximation algorithm are used

  • 8/8/2019 BLAST ND FASTA

    5/28

    Blast and Fasta create alignments

    In an optimal alignment, non-identicalcharacters and gaps are so placed to bring as

    many identical or sim ilar characters aspossible into columns.

    Two types of sequence alignment are used

    global and local

  • 8/8/2019 BLAST ND FASTA

    6/28

    In global alignment,

    an attempt is made to align the entire

    sequences, as many characters as possible.The alignment is stretched over the entiresequence lengths to include as many matchingamino acids as possible up to and including thesequence ends. Although there is an obviousregion of identity in this example (the sequenceFGKG), a global alignment may not align suchregions in order to favour matching more aminoacids along the ent ire sequence lengths.

    LGPSTKQFGKGSSSRIWDN

    | |||| | | global alignment

    LNQIERSFGKGAIMRLGDA

  • 8/8/2019 BLAST ND FASTA

    7/28

    Local alignment.

    The alignment tends to stop at the ends of

    regions of identity or strong similarity. A muchhigher priority is given to finding these local

    regions than to extending the alignment toinclude more neighbouring amino acid pairs.Dashes indicate sequence not included in the

    alignment. This type of alignment favoursfinding conserved amino acid motifs in relatedprotein sequences.

    -------FGKG--------

    |||| local alignment

    -------FGKG--------

  • 8/8/2019 BLAST ND FASTA

    8/28

    Global alignment is appropriate for sequences that are

    known to share similarity over their whole length.

  • 8/8/2019 BLAST ND FASTA

    9/28

    Global alignment Algorithm FASTA

    Step 1 Preprocessingfinds regions of similarity by making an index showing all of theamino acid positions for each sequence i.e. a C at position 1, S atposition 2, etc.

    Step 2 Heuristic searching

    these indexes are used to find if a row of the same characters arefound in the same order in the two sequences being compared.

    If these rows are long enough, the sequences are similar.

    An alignment is shown with the best matched sequences in thedatabase

  • 8/8/2019 BLAST ND FASTA

    10/28

    FastA

    PAM250

    top 10 sequences

    init 1 scores used to rank the

    database sequences

    Initn: Sum of init 1 scores

    - penalty for gaps (20) NW opt score

  • 8/8/2019 BLAST ND FASTA

    11/28

    Characteristics of FASTA :

    Local alignments: FASTA tries to find patches of regionalsimilarity, rather than trying to find the best alignmentbetween your entire query and an entire database sequence.

    Gapped alignments Alignments generated with FASTA cancontain gaps.

    Rapid

    Heuristic FASTA is not guaranteed to find the best alignmentbetween your query and the database; it may miss matches.This is because it uses a strategy which is expected to findmost matches, but sacrifices complete sensitivity in order to

    gain speed.

  • 8/8/2019 BLAST ND FASTA

    12/28

    Initn = init1 = opt indicates 100% homology over the matched stretch.

    Initn > init1 indicates that there is more than one matching region in the database

    sequence, with poorly matching separating regions(s).

    Opt > initn shows that the matching regions are greatly improved by the addition

    of gaps in one or both of the sequences. Such differences in score are indicative of

    non-homologous sequences.Opt < initn FASTA only optimizes within a narrow band along the same diagonal

    as the INIT1 region (best single region of match). If any of the (n-1) regions lie

    outside the band, then they are excluded from the optimized score. i.e.: There is too

    large a separation between the good scoring regions for FASTA to join them.

    ScoresScores

  • 8/8/2019 BLAST ND FASTA

    13/28

    With the BLAST algorithm a substitution matrix is usedduring all phases of protein searches (BLASTP, BLASTX,

    TBLASTN)

    FASTA uses a substitution matrix only for the extension

    phase. This is in contrast to BLAST, which uses a matrix for

    both phases. To reduce the penalty of using a substitutionmatrix for only the second phase, set the k-tuple parameter to

    a low value (1). However, this will give a significant speed

    penalty (for you).

    Finding a local alignment: BLAST algorithm

  • 8/8/2019 BLAST ND FASTA

    14/28

    Algorithms BLAST

    makes an index of the query sequence showing the positions ofeach possible amino triplet i.e. a CCC occurs at positions 1, YTL atposition 23, etc.

    Triplets are ordered according to how often they will occur bychance in two related proteins, the most rarely found being the mostsignificant.

    A matrix (for instance BLOSUM62) is used to determine thesesignificances

  • 8/8/2019 BLAST ND FASTA

    15/28

    Each database sequence is searched for these unusual triplets first.

    An alignment is shown with the best matched sequences in the database

    this is a heuristic (tried-and-true) method which usually works well

  • 8/8/2019 BLAST ND FASTA

    16/28

  • 8/8/2019 BLAST ND FASTA

    17/28

    BLAST (Basic Local Alignment Search Tool).

    Characteristics :

    Local alignments BLAST tries to find patches of regionalsimilarity, rather than trying to find the best alignment

    between your entire query and an entire database sequence.

    Ungapped alignments Alignments generated withBLAST do not contain gaps. BLAST's speed and statistical

    model depend on this, but in theory it reduces sensitivity.

    However, BLAST will report multiple local alignments

    between your query and a database sequence.

  • 8/8/2019 BLAST ND FASTA

    18/28

    Rapid: BLAST is extremely fast.

    Heuristic; BLAST is not guaranteed to find the bestalignment between your query and the database; it may miss

    matches. This is because it uses a strategy which is expected

    to find most matches, but sacrifices complete sensitivity in

    order to gain speed.

    However, in practice few biologically significant matches aremissed by BLAST which can be found with other sequence

    search programs. BLAST searches the database in two

    phases. First it looks for short subsequences which are likely

    to produce significant matches, and then it tries to extend

    these sub-sequences.

  • 8/8/2019 BLAST ND FASTA

    19/28

    BLASTP search a Protein Sequence against a Protein

    Database.BLASTN search a Nucleotide Sequence against a Nucleotide

    Database.

    TBLASTN search a Protein Sequence against a Nucleotide

    Database, by translating each database Nucleotide sequence in

    all 6 reading frames.

    BLASTX search a Nucleotide Sequence against a Protein

    Database, by first translating the query Nucleotide sequence in

    all 6 reading frames.

    Especially good for EST databases

  • 8/8/2019 BLAST ND FASTA

    20/28

    Finally some rules of the thumb: Homology

    Protein sequence comparisons typically double the evolutionary

    look-back time over DNA sequence comparisons.The requirement for a common folded structure in homologous

    proteins usually causes these proteins to be similar over the

    entire length of the gene product (or domain). Therefore, most

    sequences that share statistically significant similarity throughout

    their entire lengths are homologous.Matches that are more than 50% identical in a 20-40 amino acid region occur frequently by

    chance.

  • 8/8/2019 BLAST ND FASTA

    21/28

    Distantly related homologs may lack significant similarity. Two or morehomologous sequences may have very few absolutely conserved residues.

    If homology has been inferred due to significant similarity scores between

    two proteins, A and B, that align over their entire lengths and between protein

    B and a third protein, C, then proteins A and C must also be homologous, even

    if they share no significant similarity.

    Low complexity regions, transmembrane regions and coiled-coil regions

    frequently display significant similarity in the absense of homology. Low

    complexity regions can be filtered out using the default parameters of BLAST.

    Transmembrane and coiled-coil regions should be identified and masked (by

    eliminating these regions from the query) by the user.

  • 8/8/2019 BLAST ND FASTA

    22/28

    Significance

    Results of searches using different scoring systems may be

    compared directly using normalized scores.

    If S is the (raw) score for a local alignment, the normalized score S'

    (in bits) is calculated by the formula S'=(lambdaS-lnK)/ln2. lambda

    and K are parameters associated with a given scoring system..

    A normalized score, S' with E value = E, is statistically significantif it exceeds log N/E where N is the size of the search space. As the

    evolutionary distance between two sequences increases, the length

    of a local alignment required to achieve a statistically significant

    score also increases..

  • 8/8/2019 BLAST ND FASTA

    23/28

    Global alignment is appropriate for sequences that areknown to share similarity over their whole length.

    Local alignment is appropriate when the sequences may

    show isolated regions of similarity, for example multiple

    domains or repeats.

    Local alignment is best applied when scanning a database to

    find similarities or when there is noa priori knowledge that

    the protein sequences are similar.

    Summary of previous

  • 8/8/2019 BLAST ND FASTA

    24/28

    Database artifacts and Low complexity filters

  • 8/8/2019 BLAST ND FASTA

    25/28

    Database Artifacts

    Vector sequences A number of authors have identified and

    catalogued the contamination of sequence databases withvectors.

    Among the studies are:

    Claverie Genomics 12:838 1992.

    Lamperti et al Nucleic. Acids. Res 20:2741) 1992.

    Of particular note in this paper is the finding of short

    apparent vector sequencesin the middle of non-vector sequence.

    The authors speculate that these may be due to errors in

    the editing of sequences or to rearranged plasmids.

    Lopez, Kristensen, & Prydz. Nature 355:211. 1992.

    Kristensen, Lopez, & Prydz. An estimate of the sequencing

    error frequency in the DNA sequence databases. DNA Seq

    2:343 1989.

  • 8/8/2019 BLAST ND FASTA

    26/28

    Heterologous sequences

    White, O. et al. Nucl. Acids. Res. 21:2829

    Describes a statistical method to compare sequence sets (but notindividual

    sequences). Shows that several sets of cDNAs show bulk properties

    different than human cDNAs. Sequence comparisons are used to show that

    this is due to contamination of the anomalous libraries with yeast and

    bacterial sequences.

    Rearranged & deleted sequences

    Repetitive element contamination

    cDNA cloning methods may sometimes capture retroelements such as Alus.

    In some cases, chimaeras between cellular transcripts and Alus may form.

    Derived protein sequences which appear to contain Alu-derived sequences

    were cataloged by Claverie (Genomics 12:838)

    Sequencing errors / Natural polymorphisms

    .

  • 8/8/2019 BLAST ND FASTA

    27/28

    Sequence Pre-Filters

    Reducing matches due to biased amino acid composition

    Many amino acid sequences are highly repetitive in nature,especially naive translations of genomic DNA. Matches

    between such segments are more likely to be due to these

    local amino acid composition biases than to common

    descent. Filters have been developed to mask out regions

    showing highly-biased local composition.SEG (Wooton & Federhen, Computers & Chemistry 17:149.

    1993)

    XNU(Claverie & States, Computers & Chemistry, 17:191.

    1993)

  • 8/8/2019 BLAST ND FASTA

    28/28

    The end

    Thank you for your attention


Recommended