Sequence motifs, correlations and structural mapping of
evolutionary data
Eran EyalMarch 2011
Talk overview• Sequence profiles – position specific scoring matrix
• Psi-blast. Automated way to create and use sequence profiles in similarity searches
• Sequence patterns and sequence logos
• Bioinformatic tools which employ sequence profiles:PFAMBLOCKSPROSITEPRINTSInterPro
• Correlated Mutations and structural insight
• Mapping sequence data on structures:ConservationsCorrelations
PSSM – position specific scoring matrix
• A position-specific scoring matrix (PSSM) is a commonly used
representation of motifs (patterns) in biological sequences
• PSSM enables us to represent multiple sequence alignments as
mathematical entities which we can work with.
• PSSMs enables the scoring of multiple alignments with sequences,
or other PSSMs.
PSSM – position specific scoring matrix
nssssS ...321=
∑=
=n
jjs j
mscorealignment1
,_
where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast).
Assuming a string S of length n
If we want to score this string against our PSSM of length n (with n lines):
PSI-BLAST
• For a query sequence use Blast to find matching
sequences.
• Construct a multiple sequence alignment from the hits to
find the common regions (consensus).
• Use the “consensus” to search again the database, and
get a new set of matching sequences
• Repeat the process !
Sequence space
Sequence space Position-Specific-Iterated-BLAST
• Intuition– substitution matrices should be specific to sites and
not global. – Example: penalize alanine→glycine more in a helix
• Idea– Use BLAST with high stringency to get a set of
closely related sequences. – Align those sequences to create a new substitution
matrix for each position. – Then use that matrix to find additional sequences.
• Cycling/iterative method– Gives increased sensitivity for detecting distantly related
proteins– Can give insight into functional relationships– Very refined statistical methods
• Fast and simple
Position-Specific-Iterated-BLAST PSI-BLAST Principle
• First, a standard blastp is performed• The highest scoring hits are used to generate a multiple
alignment• A PSSM is generated from the multiple alignment. • Another similarity search is performed, this time using
the new PSSM• Repeat previous steps until convergence (no new
sequences appear after iteration)
Sequence space Example:Aminoacyl tRNA Synthetases
• Each is very different– Aminoacyl tRNA Synthetases are very different: size, multimers,
etc…– But all bind to their own tRNAs and amino acids with high
specificity.• TrpRS and TyrRS share only 13% sequence identity
– Yet the structures of TrpTRS and TyrTRS are similar– Structure Function relationship (See ellipsoid slide from
previous lecture…)
Same SCOP family based on catalytic domain
Overall structure similarity noted
• Given structural similarities, we would expect to find sequence similarity…
• However, blastp of E.coli TyrRS against bacterial sequences in SwissProt does NOT show similarity with TrpRS at e-value cutoff of 10
No TrpRS!
TrpRS Similarity to TyrRS!
After a few iterations…
PSI-BLAST
– Be sure to inspect and think about the results included in the PSSM build
– include/exclude sequences on basis of biological knowledge: you are in the driving seat!
– PSI-BLAST performance varies according to choice of matrix, filter, statistics and nature of data just like any other alignment tool.
Using PSI-BLAST
• PSI-BLAST available from BLAST web sites
• Query form just like for blastp
– BUT: one extra formatting option must be used
– A special e-value cutoff used to determine which alignments will be used for PSSM build.
– PSI-BLAST also available from the stand alone versions of BLAST.
Why (not) PSI-BLAST
• If the sequences used to construct the Position Specific Scoring Matrices (PSSMs) are true homologous, the sensitivity at a given specificity improves significantly.
• However, if non-homologous sequences are included in the PSSMs, they are “corrupted.” Then they pull in false non-homologous sequences and will amplify the errors in the next rounds.
• If all hits in the first rounds are highly similar, then the prediction power of the new PSSM will not be significantly better than of the original substitution matrix
Query
Does the query really have a relationship with
the results?
PSI-BLAST caveat
• Increased ability to find distant homologues• Cost of additional required care to prevent non-
homologous sequences from being included in the PSSM calculation– When in doubt, leave it out!– Examine sequences with moderate similarity carefully.
• Be particularly cautious about matches to sequences with highly biased amino acid content– Low complexity regions, transmembrane regions and coiled-coil
regions often display significant similarity without homology– Screen them out of your query sequences!
PSI-BLASTon the command line
• As with simple BLAST searches, using PSI-BLAST on the command line gives the user more power
• Opens up additional options, e.g. – PSI-BLASTing over nucleotide databases– automating number of iterations– trying out lots of different settings in parallel– inputting multiple sequences
PFAM – Database of Protein families represented by HMM
The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hiddenMarkov models (HMMs).
There are two levels of quality to Pfam families: Pfam-A and Pfam-B. For each Pfam-A family Pfam builds a single curated profile hidden Markov model (HMM) from a seed alignment (a small set of representative members of the family). Pfam-B families have no associated annotation or literature reference and are of much lower quality than Pfam-A families.
Release 24.0 has 11912 families
For each Pfam accession there is a family page, which can be accessed in several ways.
• The HMM are generated using the HMMER3 program, which is a new and efficient HMM builder.
• There is a new option to search single DNA sequences against the library of Pfam HMMs
• HMM models can be downloaded, as well as the multiple alignments of the seed and full alignments used to create the models.
Prositehttp://www.expasy.org/prosite
• PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns. One can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.
• In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residues which is variously known as a pattern, motif, signature, or fingerprint.
http://expasy.org/prosite/
A Pattern in our context is a Protein WORD conserved in many sequences:
PVAILL
What is a sequence pattern
A pattern lets you identify a protein family
Prosite patterns can describe complex signatures
This reads as follows:
“an Arginine or a Lysine, followed by one random residue, followed by a Serine or a Threonine”
[RK]-x-[ST]
C-[DES]-x-C-x(3)-I-x(3)-R-x(4)-P-x(4)-C-x(2)-C Is a signature for Zn finger proteins which bind DNA
MALRAGLVLG FHTLMTLLSP QEAGATKADH MGSYGPAFYQ SYGASGQFTH EFDEEQLFSV DLKKSEAVWR LPEFGDFARF DPQGGLAGIA AIKAHLDILV ERSNRSRAIN VPPRVTVLPK SRVELGQPNI LICIVDNIFP PVINITWLRN GQTVTEGVAQ TSFYSQPDHL FRKFHYLPFV
Using PrositeScanhttp://expasy.org/tools/scanprosite/
Using PrositeScan Using PROSITE-Scan: Structure
Printshttp://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
• PRINTS is a compendium of protein fingerprints which are conserved motifs used to characterize a protein family.• Release 41.1 of PRINTS contains 2050 entries, encoding 12,121 individual motifs.• Two types of fingerprint are represented in the database: simple or composite. simple fingerprints are essentially single-motifs; while composite fingerprints encode multiple motifs. • Most entries are of the latter type because discrimination power is greater for multi-component searches, and results are easier to interpret.
Direct PRINTS access: By accession number By PRINTS code By database code By text By sequence By title By number of motifs By author By query language
Sequence logos
• A sequence logo is a graphical representation of aligned sequences where at each position the size of each residue is proportional to its frequency in that position and the total height of all the residues in the position is proportional to the conservation (information content) of the position
Blocks
• Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.• The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins.• Blocks is not updated any more. The last version of database (14.3) is from 2007.
InterProhttp://www.ebi.ac.uk/interpro/
• InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes.
• It facilitates prediction for the occurrence of functional domains, repeats and important sites.
• InterPro combines a number of databases (referred to as member databases) that use different methodologies to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).
The member databases use a number of approaches:
- ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST. - PROSITE patterns: provider of simple regular expressions. - PROSITE and HAMAP profiles: provide sequence matrices. - PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).
-PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs).
Entries typed Family contain signatures that cover all domains in the matching proteins and span >80% of the protein length with no adjacent signatures of type Domain or Region in >90% of the entry protein set. Entries typed Domain identify biological units with defined boundaries, which includes structural and functional domains as well as defined sub-domains.
Correlated mutations
• Approaches to detect residue coupling
• Applications
• Some debates regarding correlated mutations
R
R
EK
DDE
E
KK
DE
R
KVVVVVVV
NNNSSSS
tree determinant
conserved
Coupled“correlated mutations”
Information extracted from multiple sequence alignment (MSA)
sequences
consenzusi
i NNC =
∑=
−=20
1ln
aa
aai
aaii ppC
So what correlated mutations can tell us and where are they useful ?
Basically every evolutionary constrained
Very useful for RNA folding
In proteins:• Contact prediction• Analysis of important interactions• Analysis of allosteric paths and energetically coupled residues
Correlation coefficient Gobel et al. (1994), Proteins
Kass & Horovitz (OMES) Kass and Horovitz (2002), Proteins
SCA Lockless and Ranganathan (1999), Science
Mutual Information
http://bip.weizmann.ac.il/correlated_mutations/
400 amino acid pairs
400
amin
o ac
id p
airs
i, j – positionsws, wt – sequence weightss, t - sequencessi – amino acid found at position i on sequence s
P2P
R
R
EK
DDE
E
KK
DE
R
KVVVVVVV
NNNSSSS
Instead of calculating correlations, we can derive universal scores for substitutions between amino acid pairs
Score (R,E E,K) ?
How to derive such a substitution matrix?
Eyal et al. (2007) Proteins, 67, 142-153
N
R
R
EK
DDE
E
KK
DE
R
KVVVVVVV
NNNSSSS
GGGREKK
LLLILVI
VVVVVV
M RGD
AA
AAASASS
AAASAGG
PPPPA
GG
WWWWYYF
K
w1w2w3w4w5w6w7
Blocks: small un-gapped multiple alignments
Advantage:•Accurate alignments•No gaps•Sequence weights
Representative structures for each block
]][[]][[ln
]][[]][[ln]][[
uvxyfuvxyf
uvxyfuvxyfuvxyM nocon
exp
noconobs
conexp
conobs −=
∑=
abcd
conobs
conobscon
obs cdabnuvxynuvxyf
]][[]][[
]][[
∑∑⋅=
abobs
obs
abobs
obsconexp ban
vynban
uxnuvxyf]][[
]][[]][[
]][[]][[
Signal Noise The pair-to-pair(P2P) substitution matrix
Flipped pairs:
XY YX
Invariant pairs:
XY XY
XY XY
XY XZXY YX
XY WZ
What is the matrix useful for?
• Detect contact between amino acids when there is no structural data
• Evaluate structures
• Detect functional/structural important regions
P2P for contact prediction
Eyal et al. (2007) Proteins, 67, 142-153
P2P for contact prediction
Contacts prediction in smaller proteins is easier
Galectin-7 (1bkz)All prediction are between 2 β-sheets
β-lactamase II (1bc2)Most predictions are around the metal binding site
P2P as a scoring function for structure evaluation
i j
Sij using P2P
i
j
∑=contact withji
ijSS,
overall score:
Advantages of the P2P method over other methods
• No need for large MSAs
• No need to construct evolutionary trees
• Naturally handle conservation and correlations
• Interactive web implementation
u1u3
u2
va4a1
u4
a3
w
a2
Considering correlations together improves contact prediction – the GARP approach
Frankel et al. (2007) BMC Bioinformatics, 67, 142-153
Considering also neighbors and “windows” of correlations may improve predictions of primary correlated mutations methods
Methods based on the evolutionary tree
ADSDDFGRLIILM
ADSDDFGRLIILL ADSDLFGVLIILM
ADSDLFGVLIILLADTDLFGVLIILMADSDDFGRLIILLGDTDDFGRLIILM
2 mutation events
Methods based on the evolutionary tree
ADSDDFGRLIILM
ADSDDFGRLIILL ADSDDFGRLIILM
ADSDLFGVLIILLADTDDFGRLIILMADSDLFGVLIILL GDTDDFGRLIILM
2 mutation events
2 mutation events
GDTDDFGRLIILMADSDDFGRLIILLADTDLFGVLIILMADSDLFGVLIILL
Although the same multiple alignment will be obtained in the 2 cases
It is clear that evolutionary history of multiple independent events is a much stronger indication for real coupling
Methods based on evolutionary tree:
Pagel M. (1994) Proc R Soc LondPollock D, Taylor W. (1997). Protein EngPollock D et al. (1999) J Mol BiolTuffery and Darlu (2000) Mol Biol EvolFleishman S et al. (2004) J Mol BiolNoivirt et al. (2005) Protein Eng
Lockless and Ranganathan (1999), Science, 286, 295-299
Statistical coupling
R
R
EK
DDE
E
KK
DE
K
KVVVVVVV
NNNSSSS
i j
DD
K
KK
K
VVVV
NSSS
i j
MSA
MSA|δj
For every selected j we can measure the coupling to all other sites i
2
|
|20
1
*, )ln(ln aa
MSA
aai
aajMSA
aaji
aa
statji p
ppp
kTG −=ΔΔ ∑= δ
δ
E
E
Statistical coupling
Lockless and Ranganathan (1999), Science, 286, 295-299 PDZ domain Lockless and Ranganathan (1999), Science, 286, 295-299
Russ et al. Nature (2005), 437, 579-583
WW domain
Russ et al. Nature (2005), 437, 579-583
Cooperatively in WW domains
Studies using SCA
Estabrook et al. (2005), PNAS methyltranferases
Marcelino et al. (2006), Proteins intracellular lipid binding proteins (iLBPs)
Swain et al. (2006), Curr Opin Str Biol HSP70 chaperones
Chen et al. (2006), JBC Cys loop ligand-gated ion channels
Dima and Thirumalai (2006), Protein Sci Selectins
Ferguson et al. (2007), PNAS TonB-dependent transporters
Yu et al (2007), Biophys J DNA Helicases
Lee et al. (2008), Science PAS-DHFR
Hsu and Traugh (2010) PLoS One protein kinase Pak2
Is correlated mutations analysis really meaningful?
Which are the leading methods?
Fodor and Aldrich (2004), Proteins Halperin et al. (2006), Proteins
Fodor and Aldrich (2004), Proteins
Can correlated mutations reveal allosteric pathways??
1
2 34
5
6
1
2 45
6
123456
1 2 3 4 5 6123456
1 2 3 4 5 6
1
2 45
6
123456
1 2 3 4 5 6
Fodor and Aldrich (2004), JBC, 279, 19046-19050
A
B
Can CM detect interactions between different molecules?
Halperin et al. (2006), Proteins
intra inter
The main problems in detecting inter protein coupling
• Correct selection of paralogs
• Basic assumptions? Do interfaces are conserved? Do all pairs interact?
• Smaller number of protein complexes and data about interfaces for testing/training
Covariance analysis of Glutamate transporter
• Covariance analysis was performed using different methods
• 989 sequences were extracted from PFAM for the sodium-dicarboxylate symporter family (PF00375).
• Alignment was modified such as the reference numbers are of Human EAAT1 protein.
• Hierarchical clustering was used to analyze the matrices
residues from TM4cresidues from TM2
residues from TM4a
residues in the core region
core region interface regions
Is there a real connection between CM and energetically coupled residues?
Works for PDZ domain
Not so fast….
Bouncing back
Different correlated mutations are appropriate for different tasks
Did Fodor implemented SCA appropriately ???
Dima and Thirumalai (2006), Protein Sci 15, 258-268
Number of sequences remaining after the perturbation should be consideredThe matrix is not symmetrical, but considered as such by Fodor.
McBascP2P
OMESMISCA
Correlations in HIV-1 Protease• cleavage of premature polypeptides to form
the proteins required by the virus• a major drug target in AIDS therapies• exhibits multi-drug resistance• large amount of data available
– sequence databases– many solved structures– clinical information and known drug resistant
mutations
Dataset and MSAs
Data source: http://hivdb.stanford.edu/
IDV = indinavir, protease inhibitorNFV = nelfinavir, protease inhibitor
Mutual Information
• Mutual information measures the dependencebetween two random variables
• Suppose Xi and Xj are two random variables. The mutual information between Xi and Xj is defined as
where xi and xj are specific amino acid types
joint probability singlet (marginal) probability
Two Extreme Cases
• when Xi and Xj are independent
• when Xi and Xj follow exactly the same distribution
Mutual Information Matrix
• By calculating I(Xi, Xj) for all pairs of i, j, we obtain an N×N mutual information matrix W with element I(Xi, Xj).
• In our case N = 99.
Clustering
Why do clustering?– The origin of the correlations is not always
pairwise, but most available statistical methods are based on pairwise metrics. Clustering helps in detecting more integrated patterns.
– Enhance signal over noise (S/N ratio) (Noivirt et al., Protein Eng. Des. Sel., 2005)
Spectral Clustering
Scheme A Scheme B
• A graph segmentation algorithm
(Shi and Malik, IEEE Trans, 2000)
Cut = S….
Spectral Clustering
• Minimize the normalized cut between two groups
• assoc(A, V) is the total weight of connection from A to all nodes in the graph
Back to the Protein
• Each column in the MSA corresponds to a residue, which in turn is represented as a node in the graph.
• The mutual information is the weight of edge between node i and j.mutual information matrix = weight matrix
Residue i Residue j
mutual information between Xi and Xj
Spectral Clustering
• The problem reduces to solving a generalized eigenvalue problem
where D is a diagonal matrix with element
W is the mutual information weight matrix
• The eigenvector with the first nonzero eigenvalue is used to bi-partition the nodes
Sequence Correlation Matrixand its Permutation Based on Clustering
treated data
untreateddata
Results• two clusters were distinguished based on
spectral clustering procedure• one of the cluster (blue) contains residues known
to be involved in multi-drug resistance• the other cluster (red) contains residues that
exhibit substantial sequence variability between subtypes of HIV.
Gonzales et al.J. Infec. Dis.2001
Cooperative Coupling Relation between Sequence Variability and Protein Dynamics (GNM)
Relation between Sequence Variability and Protein Dynamics (cont)
the two clustering partitions
mobilities
• Covariance analysis detects the drug-resistance mutations and their cooperativity in HIV-1 protease in agreement with experimental data.
• Clustering techniques can be applied to analyse the data. • Relationship is elucidated between coevolving residue
clusters and the collective dynamics of the protease.
Correlated mutations - summary
• Correlated mutations analysis is a simple tool to detect coupling between residues. The tremendous amount of available sequences makes it more attractive
• The tremendous amount of available sequences makes it more attractive
• Current methods can assist in detection of close tertiary contacts
• Depends on the application different CM methods should be applied
• Relation between sets of correlated paths has been suggested but not always in a consistent and convincing ways.
• Relation between CM and free-energy has been suggested but shown not to hold on a consistent basis
Correlated mutations - future directions
• Improve MSA – filtering out sequences
• N-body correlations instead of pair-wise correlations
• Improved clustering techniques
• ConSurf is a tool developed in TAU for mapping conservation scores on protein structures (and recently nucleic acid) structures.
• Detailed understanding of the mechanism of biological processes requires the identification of functionally important amino acids at the protein surface that are responsible for these interactions
• ConSurf server is a useful and user-friendly tool that enables the identification of functionally important regions on the surface of a protein of known three-dimensional structure, based on the evolutionary analysis.
ConSurf – mapping conservation scoreson 3D structures
Ashkenazy H., Erez E., Martz E., Pupko T. and Ben-Tal N. 2010 ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids.Nucl. Acids Res (2010)
http://consurf.tau.ac.il/
• Given the 3D-structure of a protein or a domain as an input, ConSurfextracts the sequence from the PDB .
• It then carries out a search for close homologous sequences of the protein of known structure using PSI-BLAST.
• Multiple sequence alignment is done using MUSCLE or CLUSTALW. The multiple sequence alignment is used to build a phylogenetic tree.
• Conservation scores are calculated based on Bayesian or Maximum Likelihood method.
• The protein, with the conservation scores color-coded onto its surface, can finally be visualized on-line using Jmol.
ConSurf – mapping conservation scoreson 3D structures