COMPUTATIONAL BIOLOGY
B.Tech – BioTech (VIth Semester)
Module 2
Scoring Matrices
INTRODUCTION• It is assummed that the sequences being sought have an
evolutionary ancestral sequence in common with the query sequence.
• The best guess at the actual path of evolution is the path that requires the fewest evolutionary events.
• All substitutions are not equally likely and should be weighted to account for this.
• Insertions and deletions are less likely than substitutions and should be weighted to account for this.
INTRODUCTION• A substitution is more likely to occur between amino acids
with similar biochemical properties.• For example the hydrophobic amino acids Isoleucine(I) and
valine(V) get a positive score on matrices adding weight to the likeliness that one will substitute for another.
• While the hydrophobic amino acid isoleucine has a negative score with the hydrophilic amino acid cystine(C) as the likeliness of this substitution occurring in the protein is far less.
• Thus matrices are used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.
IMPORTANCE OF SCORING MATRICES
• Scoring matrices appear in all analysis involving sequence comparison.
• The choice of matrix can strongly influence the outcome of the analysis.
• Scoring matrices implicitly represent a particular theory of evolution.
• Understanding theories underlying a given scoring matrix can aid in making proper choice.
TYPES OF SCORING MATRICES• An amino-acid scoring matrix is a 20x20 table such that position
indexed with amino-acids so that position X,Y in the table gives the score of aligning amino-acid X with amino-acid Y
• Identity matrix – Exact matches receive one score and non-exact matches a different score (1 on the diagonal 0 everywhere else)
• Mutation data matrix – a scoring matrix compiled based on observation of protein mutation rates: some mutations are observed more often then other (PAM, BLOSUM).
• Physical properties matrix – amino acids with with similar biophysical properties receive high score.
• Genetic code matrix – amino acids are scored based on similarities in the coding triple.
Matrices used
PSSM = Position Specific Scoring Matrices
PAM matrices
BLOSUM (BLOck Substitution Matrices)
• Publication– Henikoff and Henikoff, 1992
• Motivation– PAM matrices do not capture the difference between
short and long time mutations • Method
– For several degrees of sequence divergence, derive mutations from set of related proteins
– BLOSUM-k is based on related proteins with k% identity or less
BLOSUM METHOD
• Use Blocks – collections of multiple alignments of similar segments without gaps
• Cluster together sequences whenever more than k% identical residues are shared
• Count number of substitutions across different clusters (in the same family)
• Estimate frequencies using the counts
BLOCKS
Each BLOCK represents a conserved region in a group of proteins
1 5 n
sequence 1 ABPEDG… …FGW
sequence 2 ABSEDQ… …QGW
sequence 3 SBPEDQ… …FGD
: : :
: : :
sequence m ABAEDS… …QGD
BLOSUM = BLOCK SUBSTITUTION MATRIX
The relationship between BLOSUM and PAM substitution matrices
• BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences.
• BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.
Position-Specific Scoring Matrix
• A weight matrix or position-specific scoring matrix (PSSM) is a table of numbers containing scores for each residue at each position of a fixed-length (gap-free) motif.
• There are two types of numerical representations:• frequency matrix: reflects position-dependent frequencies
of residues • Scoring matrix: contains additive weights for computing a
match score• Weigh matrices or PSSMs are quantitative, fixed-length motif
descriptors. Unlike regular expressions, they can distinguish between mild and severe mismatches.
Position-Specific Scoring Matrix
• A PSSM is a motif descriptor• The descriptor includes a weight (score, probability) for each
symbol occurring at each position along the motif• Examples of motifs:
– Protein active sites, – structural elements, – zinc finger, – intron/exon boundaries, – transcription-factor binding sites, etc.
Position-Specific Scoring Matrix
Construction of PSSM is a multi-stage process:1. Architecture of matrix2. Create multiple alignment from which the matrix is
derived3. Calculate frequencies for each position4. Applying BLAST to PSSM
Position-Specific Scoring Matrix• 10 vertebrate donor site sequences aligned at
exon/intron boundaryseq 1 GAGGTAAAC
seq 2 TCCGTAAGT
seq 3 CAGGTTGGA
seq 4 ACAGTCAGT
seq 5 TAGGTCATT
seq 6 TAGGTACTG
seq 7 ATGGTAACT
seq 8 CAGGTATAC
seq 9 TGTGTGAGT
seq 10 AAGGTAAGT
Position-Specific Scoring Matrix• Calculate the absolute frequency of each
nucleotide at each positionseq 1 GAGGTAAAC
seq 2 TCCGTAAGT
seq 3 CAGGTTGGA
seq 4 ACAGTCAGT
seq 5 TAGGTCATT
seq 6 TAGGTACTG
seq 7 ATGGTAACT
seq 8 CAGGTATAC
seq 9 TGTGTGAGT
seq 10 AAGGTAAGT
1 2 3 4 5 6 7 8 9
A 3 6 1 0 0 6 7 2 1
C 2 2 1 0 0 2 1 1 2
G 1 1 7 10 0 1 1 5 1
T 4 1 1 0 10 1 1 2 6
Position-Specific Scoring Matrix• Calculate the relative frequency of each
nucleotide at each positionseq 1 GAGGTAAAC
seq 2 TCCGTAAGT
seq 3 CAGGTTGGA
seq 4 ACAGTCAGT
seq 5 TAGGTCATT
seq 6 TAGGTACTG
seq 7 ATGGTAACT
seq 8 CAGGTATAC
seq 9 TGTGTGAGT
seq 10 AAGGTAAGT
1 2 3 4 5 6 7 8 9
A 3 6 1 0 0 6 7 2 1
C 2 2 1 0 0 2 1 1 2
G 1 1 7 10 0 1 1 5 1
T 4 1 1 0 10 1 1 2 6
1 2 3 4 5 6 7 8 9
A 0.3 0.6 0.1 0 0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0 0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1 0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0 1 0.1 0.1 0.2 0.6
Position-Specific Scoring Matrix• What is the probability of finding CAGGTTGGA?
– The product of the frequency of each nucleotide at each position:
– C is 0.2 at position 1, A is 0.6 at position 2, etc ->• 0.2 * 0.6 * 0.7 * 1 * 1 * 0.1 * 0.1 * 0.5 * 0.1
1 2 3 4 5 6 7 8 9
A 0.3 0.6 0.1 0 0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0 0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1 0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0 1 0.1 0.1 0.2 0.6