BLAST ND FASTA

8/8/2019 BLAST ND FASTA

1/28

BLOSUM 62

The Blast and FastA algorithms


2/28

Global alignments that do not include gaps : a matrix of 200

PAMS for sequences that are thought to be related.Unknown sequences : a 120 PAM matrix was the bestcompromise.

Local alignment method PAM40, PAM120 and PAM250. Thelower PAM matrices (40-120) find short alignments of highly

similar sequences, while higher PAM matrices (120-250)

find longer, weaker local alignments.


3/28

Standard Blast: Overall the BLOSUM 62 matrix is the most

effective.

All other substitution matrices perform better than BLOSUM

62 for a proportion of the families.


4/28

Algorithms

Comparing sequences by dot matrix display or byany other standard method of sequence comparison

is a very slow process therefore:

Most commonly the Blast and the FastAapproximation algorithm are used


5/28

Blast and Fasta create alignments

In an optimal alignment, non-identicalcharacters and gaps are so placed to bring as

many identical or sim ilar characters aspossible into columns.

Two types of sequence alignment are used

global and local


6/28

In global alignment,

an attempt is made to align the entire

sequences, as many characters as possible.The alignment is stretched over the entiresequence lengths to include as many matchingamino acids as possible up to and including thesequence ends. Although there is an obviousregion of identity in this example (the sequenceFGKG), a global alignment may not align suchregions in order to favour matching more aminoacids along the ent ire sequence lengths.

LGPSTKQFGKGSSSRIWDN

| |||| | | global alignment

LNQIERSFGKGAIMRLGDA


7/28

Local alignment.

The alignment tends to stop at the ends of

regions of identity or strong similarity. A muchhigher priority is given to finding these local

regions than to extending the alignment toinclude more neighbouring amino acid pairs.Dashes indicate sequence not included in the

alignment. This type of alignment favoursfinding conserved amino acid motifs in relatedprotein sequences.

-------FGKG--------

|||| local alignment

-------FGKG--------


8/28

Global alignment is appropriate for sequences that are

known to share similarity over their whole length.


9/28

Global alignment Algorithm FASTA

Step 1 Preprocessingfinds regions of similarity by making an index showing all of theamino acid positions for each sequence i.e. a C at position 1, S atposition 2, etc.

Step 2 Heuristic searching

these indexes are used to find if a row of the same characters arefound in the same order in the two sequences being compared.

If these rows are long enough, the sequences are similar.

An alignment is shown with the best matched sequences in thedatabase


10/28

FastA

PAM250

top 10 sequences

init 1 scores used to rank the

database sequences

Initn: Sum of init 1 scores

- penalty for gaps (20) NW opt score


11/28

Characteristics of FASTA :

Local alignments: FASTA tries to find patches of regionalsimilarity, rather than trying to find the best alignmentbetween your entire query and an entire database sequence.

Gapped alignments Alignments generated with FASTA cancontain gaps.

Rapid

Heuristic FASTA is not guaranteed to find the best alignmentbetween your query and the database; it may miss matches.This is because it uses a strategy which is expected to findmost matches, but sacrifices complete sensitivity in order to

gain speed.


12/28

Initn = init1 = opt indicates 100% homology over the matched stretch.

Initn > init1 indicates that there is more than one matching region in the database

sequence, with poorly matching separating regions(s).

Opt > initn shows that the matching regions are greatly improved by the addition

of gaps in one or both of the sequences. Such differences in score are indicative of

non-homologous sequences.Opt < initn FASTA only optimizes within a narrow band along the same diagonal

as the INIT1 region (best single region of match). If any of the (n-1) regions lie

outside the band, then they are excluded from the optimized score. i.e.: There is too

large a separation between the good scoring regions for FASTA to join them.

ScoresScores


13/28

With the BLAST algorithm a substitution matrix is usedduring all phases of protein searches (BLASTP, BLASTX,

TBLASTN)

FASTA uses a substitution matrix only for the extension

phase. This is in contrast to BLAST, which uses a matrix for

both phases. To reduce the penalty of using a substitutionmatrix for only the second phase, set the k-tuple parameter to

a low value (1). However, this will give a significant speed

penalty (for you).

Finding a local alignment: BLAST algorithm


14/28

Algorithms BLAST

makes an index of the query sequence showing the positions ofeach possible amino triplet i.e. a CCC occurs at positions 1, YTL atposition 23, etc.

Triplets are ordered according to how often they will occur bychance in two related proteins, the most rarely found being the mostsignificant.

A matrix (for instance BLOSUM62) is used to determine thesesignificances


15/28

Each database sequence is searched for these unusual triplets first.

An alignment is shown with the best matched sequences in the database

this is a heuristic (tried-and-true) method which usually works well


16/28


17/28

BLAST (Basic Local Alignment Search Tool).

Characteristics :

Local alignments BLAST tries to find patches of regionalsimilarity, rather than trying to find the best alignment

between your entire query and an entire database sequence.

Ungapped alignments Alignments generated withBLAST do not contain gaps. BLAST's speed and statistical

model depend on this, but in theory it reduces sensitivity.

However, BLAST will report multiple local alignments

between your query and a database sequence.


18/28

Rapid: BLAST is extremely fast.

Heuristic; BLAST is not guaranteed to find the bestalignment between your query and the database; it may miss

matches. This is because it uses a strategy which is expected

to find most matches, but sacrifices complete sensitivity in

order to gain speed.

However, in practice few biologically significant matches aremissed by BLAST which can be found with other sequence

search programs. BLAST searches the database in two

phases. First it looks for short subsequences which are likely

to produce significant matches, and then it tries to extend

these sub-sequences.


19/28

BLASTP search a Protein Sequence against a Protein

Database.BLASTN search a Nucleotide Sequence against a Nucleotide

Database.

TBLASTN search a Protein Sequence against a Nucleotide

Database, by translating each database Nucleotide sequence in

all 6 reading frames.

BLASTX search a Nucleotide Sequence against a Protein

Database, by first translating the query Nucleotide sequence in

all 6 reading frames.

Especially good for EST databases


20/28

Finally some rules of the thumb: Homology

Protein sequence comparisons typically double the evolutionary

look-back time over DNA sequence comparisons.The requirement for a common folded structure in homologous

proteins usually causes these proteins to be similar over the

entire length of the gene product (or domain). Therefore, most

sequences that share statistically significant similarity throughout

their entire lengths are homologous.Matches that are more than 50% identical in a 20-40 amino acid region occur frequently by

chance.


21/28

Distantly related homologs may lack significant similarity. Two or morehomologous sequences may have very few absolutely conserved residues.

If homology has been inferred due to significant similarity scores between

two proteins, A and B, that align over their entire lengths and between protein

B and a third protein, C, then proteins A and C must also be homologous, even

if they share no significant similarity.

Low complexity regions, transmembrane regions and coiled-coil regions

frequently display significant similarity in the absense of homology. Low

complexity regions can be filtered out using the default parameters of BLAST.

Transmembrane and coiled-coil regions should be identified and masked (by

eliminating these regions from the query) by the user.


22/28

Significance

Results of searches using different scoring systems may be

compared directly using normalized scores.

If S is the (raw) score for a local alignment, the normalized score S'

(in bits) is calculated by the formula S'=(lambdaS-lnK)/ln2. lambda

and K are parameters associated with a given scoring system..

A normalized score, S' with E value = E, is statistically significantif it exceeds log N/E where N is the size of the search space. As the

evolutionary distance between two sequences increases, the length

of a local alignment required to achieve a statistically significant

score also increases..


23/28

Global alignment is appropriate for sequences that areknown to share similarity over their whole length.

Local alignment is appropriate when the sequences may

show isolated regions of similarity, for example multiple

domains or repeats.

Local alignment is best applied when scanning a database to

find similarities or when there is noa priori knowledge that

the protein sequences are similar.

Summary of previous


24/28

Database artifacts and Low complexity filters


25/28

Database Artifacts

Vector sequences A number of authors have identified and

catalogued the contamination of sequence databases withvectors.

Among the studies are:

Claverie Genomics 12:838 1992.

Lamperti et al Nucleic. Acids. Res 20:2741) 1992.

Of particular note in this paper is the finding of short

apparent vector sequencesin the middle of non-vector sequence.

The authors speculate that these may be due to errors in

the editing of sequences or to rearranged plasmids.

Lopez, Kristensen, & Prydz. Nature 355:211. 1992.

Kristensen, Lopez, & Prydz. An estimate of the sequencing

error frequency in the DNA sequence databases. DNA Seq

2:343 1989.


26/28

Heterologous sequences

White, O. et al. Nucl. Acids. Res. 21:2829

Describes a statistical method to compare sequence sets (but notindividual

sequences). Shows that several sets of cDNAs show bulk properties

different than human cDNAs. Sequence comparisons are used to show that

this is due to contamination of the anomalous libraries with yeast and

bacterial sequences.

Rearranged & deleted sequences

Repetitive element contamination

cDNA cloning methods may sometimes capture retroelements such as Alus.

In some cases, chimaeras between cellular transcripts and Alus may form.

Derived protein sequences which appear to contain Alu-derived sequences

were cataloged by Claverie (Genomics 12:838)

Sequencing errors / Natural polymorphisms

.


27/28

Sequence Pre-Filters

Reducing matches due to biased amino acid composition

Many amino acid sequences are highly repetitive in nature,especially naive translations of genomic DNA. Matches

between such segments are more likely to be due to these

local amino acid composition biases than to common

descent. Filters have been developed to mask out regions

showing highly-biased local composition.SEG (Wooton & Federhen, Computers & Chemistry 17:149.

1993)

XNU(Claverie & States, Computers & Chemistry, 17:191.

1993)


28/28

The end

Thank you for your attention

Date post:	10-Apr-2018
Category:	Documents
Upload:	devbaljinder
View:	216 times
Download:	0 times

BLAST ND FASTA

Documents