04/10/23 1
Scoring Matrices, Database Searching and Heuristic Alignment Algorithms
ISSP 2081 / BIOINF 2051
Fall 2002
Lecture #6
2
Handouts on Weight matrices Weight matrices for sequence similarity sc
oring by David Wheeler
Supplement to above by David Wheeler
3
PAM Matrices First substitution matrices widely used Based on the point-accepted-mutation
(PAM) model of evolution (Dayhoff..1978) PAMs are relative measures of
evolutionary distance 1 PAM = 1 accepted mutation per 100 AAs Does not mean that after 100 PAMs every
AA will be different? Why or why not?
4
PAM Matrices If changes were purely random
Frequency of each possible substitution is proportional to background frequencies
In related proteins: Observed substitution frequencies called the target
(replacement) frequencies are biased toward those that do not seriously disrupt the protein’s function
These point mutations are “accepted” during evolution Log-odds approach:
Scores proportional to the natural log of the ratio of target frequencies to background frequencies
5
The Math Score matrix entry for time t given by:
s(a,b|t) = log P(b|a,t)
qb
Conditional probability that a is substituted by b in time t
Frequency of amino acid b
6
PAM Matrices Construction Pairs of very closely related sequences used to
collect mutation frequencies corresponding to 1 PAM Explicit model Two families studied – immunoglobin, cytochrome C
Extrapolation of the data to a distance of 250 PAMs PAM250 was original Dayhoff matrix
Family of matrices – PAM10… PAM200 Matrix multiplication using PAM-1
7
PAM Matrices: salient points Derived from global alignments of closely related
sequences. Matrices for greater evolutionary distances are
extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100)
refers to the evolutionary distance; greater numbers are greater distances.
Does not take into account different evolutionary rates between conserved and non-conserved regions.
8
BLOSUM Matrices Henikoff, S. & Henikoff J.G. (1992) Use blocks of protein sequence fragments from
different families (the BLOCKS database) Amino acid pair frequencies calculated by
summing over all possible pairs in block Different evolutionary distances are incorporated
into this scheme with a clustering procedure (identity over particular threshold = same cluster)
9
BLOSUM Matrices Similar idea to PAM matrices Probabilities estimated from blocks of
sequence fragments Blocks represent structurally conserved
regions
10
BLOSUM Matrices Target frequencies are identified directly
instead of extrapolation. Sequences more than x% identitical within
the block where substitutions are being counted, are grouped together and treated as a single sequence BLOSUM 50 : >= 50% identity BLOSUM 62 : >= 62 % identity
11
BLOSUM Matrices: Salient points Derived from local, ungapped alignments of distantly
related sequences All matrices are directly calculated; no extrapolations
are used – no explicit model The number after the matrix (BLOSUM62) refers to
the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances.
The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches (Proteins 17:49).
12
BLOSUM Example PSC Tutorial - BLOSUM example
http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html
13
Heuristic Alignment Algorithms Database searching vs. sequence alignment What is a heuristic? Why use heuristics? Approximations to Smith-Waterman
FASTA [Pearson & Lipman, 1988] BLAST [Altschul et al., 1990]
What are the tradeoffs in terms of search? Sensitivity vs. Selectivity
14
BLAST Overview BLAST heuristically finds maximal
segment pairs: highest scoring pair of identical length segments from 2 sequences
SP = ungapped, local alignment MSP = a segment pair (SP) with maximum
score over all segment pairs in S1 and S2
15
BLAST Overview Given: query sequence q, word length w, word
score threshold T, segment score threshold S Compile a list of “words” that score at least T when
compared to words from q Scan database for matches to words in list Extend all matches to seek high-scoring segment
pairs
Return: segment pairs scoring at least S
16
Determining Query Words Given:
Query sequence: QLNFSAGWWord length w = 2Word score threshold T = 8
Step 1: Determine all words of length w in query sequence
QL LN NF FS SA AG GW
17
Determining Query WordsStep 2: Determine all words that score at least
T when compared to a word in the query sequence
QL QL=11, QM=9, HL=8, ZL=9
LN LN=9, LB=8
….
18
Scanning the database Search database for all occurrences of
query words Approach:
Build a DFA that recognizes all query words Run DB sequences through DFA Remember hits
19
Finding MSPs Extend hits in both directions (without
allowing gaps) as long as score of segment pair increases
Return segment pairs scoring at least S
20
Choosing Values for w and T Trade-off: sensitivity vs. running-time Choosing a value for w
Small w: many matches to expand Big w: many words to be generated W=4 is a good compromise
Choosing a value for T Small T: greater sensitivity, more matches to
expand
21
BLAST Notes May fail to find optimal MSPs
May miss seeds if T is too stringent Extension is greedy
Empirically, 10 to 50 times faster than Smith-Waterman
Large impact: NCBI’s BLAST server handles more than 50,000 queries a day
22
Statistics of alignment scores(or how to choose a value for S) [Karlin & Altschul, 1990] A model of random sequences
Ungapped alignments All residues drawn independently Expected score for a pair of randomly chosen
residues required to be negative – Why? See text for math
23
FASTA Heuristic, exclusion method http://gcg.nhri.org.tw/fasta.html See PSC tutorial for examples:
www.cbmi.upmc.edu/~vanathi/syllabus.html
24
Readings for next class FASTA Summary for FASTA paper due