Post on 11-Jan-2016
transcript
EMBL-EBI
PATTERNS
Kim Henrick
EMBL-EBI
Terri AttwoodSchool of Biological SciencesUniversity of Manchester, Oxford RoadManchester M13 9PT, UKhttp://www.bioinf.man.ac.uk/dbbrowser/
EMBL-EBI
Motifs and domains
Motif: a simple combination of a few consecutive secondary structure elements with a specific geometric arrangement (e.g., helix-loop-helix). May have a specific biological function.
Domain: the fundamental unit of structure folding and evolution. It combines several secondary elements and motifs packed in a compact globular structure. A domain can fold independently into a stable 3D structure, and May have a specific function.
Domain family: proteins that share a domain (possibly in combination with other domains)
Protein family: proteins that have the same combination of domains
EMBL-EBI
Profiles & Motifs are Useful
Helped identify active site of HIV protease Helped identify SH2/SH3 class of STP’s Helped identify important GTP oncoproteins Helped identify hidden leucine zipper in HGA Used to scan for lectin binding domains Regularly used to predict T-cell epitopes
Domains are More Useful
EMBL-EBI
Rules of Thumb
Sequence pattern-based motifs should be determined from no fewer than 5 multiply aligned sequences
A good degree of sequence divergence is needed. If “S” is the %similarity and “N” is the no. of sequences then 1 - SN > 0.95
A good sequence pattern should have no fewer than 8 defined amino acid positions
EMBL-EBI
Representations of protein families
Regular expression
Position specific scoring matrices (profiles)
Hidden Markov Models
Probabilistic suffix trees
Sparse Markov transducers
EMBL-EBI
Pattern recognition methods
These methods classify proteins into familiesthe basis of the methods is multiple sequence
alignment They depend on developing a representation of
conserved elements of alignments that may be diagnostic of structure or function, whether from homologous sequence families sequences that share some
structural/functional domains
EMBL-EBI
Regular expressions/patterns
These are derived from single conserved regions, which are reduced to consensus expressions for db searches they are minimal expressions, so sequence
information is lost the more divergent the sequences used, the
more fuzzy & poorly discriminating the pattern becomes
Alignment PatternGAVDFIALCDRYFGPIDFVCFCERFY G-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2GRVEFLNRCDRYY
EMBL-EBI
Patterns do not tolerate similarity
sequences either match or not, regardless of how similar they are
matching is a binary ‘on-off’ event & frequently misses true matches
single-motif methods are very hit-or-miss – how do you know if you've encoded the ‘best’ region?
Regular expressions/patterns
EMBL-EBI
PROSITE
This represents an apparent 18% error rate the actual rate is probably higher
Thus, a match to a pattern is not necessarily true & a mis-match is not necessarily false!
False-negatives are a fundamental limitation to this type of pattern matchingif you don't know what you're looking for, you'll never know you missed it!
G_PROTEIN_RECEPTOR; PATTERN PS00237; G-protein coupled receptor signature [GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]- X(2)-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R /TOTAL=919(919);/POS=869(869);/FALSE_POS=50(50);/FALSE_NEG=70; /PARTIAL=49; UNKNOWN=0(0)
EMBL-EBI
Regular expressions/rules
Regular expression patterns are most effective when applied to highly-conserved, family-specific motifs
It is often possible to identify, shorter generic patterns that are characteristic of common functional sites
Functional site RuleN-glycosylation N-{P}-[ST]-{P}Protein kinase C phosphorylation [ST]-X-[RK]Casein kinase II phosphorylation [ST]-X2-[DE]
Such features result from convergence to a common propertyglycosylation sites, phosphorylation sites, etc.
They cannot be used for family diagnosis & don't discriminate they can only be used to suggest whether a certain
functional site might exist (which must then be tested by experiment)
such patterns are termed rules
EMBL-EBI
Diagnostic limitations of short motifs
Consider the sequence motif Asp-Ala-Val-Ile-Asp (DAVID)results of db searching for such a sequence will differ,
depending on whether we search for exact or permissive ‘fuzzy’ matches
Pattern MatchesD-A-V-I-D 99D-A-V-I-[DEQN] 252[DEQN]-A-V-I-[DEQN] 925[DEQN]-A-[VLI]-I-[DEQN] 2,739[DEQN]-[AG]-[VLI]-[VLI]-[DEQN] 51,506D-A-V-E 1,493(number of matches in OWL31.1)
Use of fuzzy regular expressions has the potential advantage of being able to recognise more distant relationships& the inherent disadvantage that more matches will be made
by chance, making it difficult to separate out true matches from noise
EMBL-EBI
Fingerprints
Fingerprints are groups of motifs excised from alignments & used for iterative db searching no weighting scheme is used searches depend only on residue frequencies resulting scoring matrices are thus sparse
Each motif trawls the database independently search results are correlated to determine which
sequences match all the motifs & which match only partially
no information is thrown awayIteration refines the fingerprint & increases its
potency fingerprints are diagnostically more powerful
than regular expressions
EMBL-EBI
Profiles
Profiles are scoring tables derived from full alignments these define which residues are allowed at given
positions which positions are conserved & which degenerate which positions, or regions, can tolerate insertions the scoring system is intricate, & may include
evolutionary weights, results from structural studies, & data implicit in the alignment
variable penalties are specified to weight against INDELs occurring in core 2' structure elements
EMBL-EBI
Within a profile, fields contain position-specific scores for insert & match positions
in conserved regions, INDELs aren't totally forbidden, but are strongly impeded by large penalties defined in a DEFAULT field
these are superseded by more permissive values in gapped regions
the inherent complexity of profiles renders them highly potent discriminators, but they are time-consuming to derive
Profiles
EMBL-EBI
Hidden Markov Models
HMMs are similar in concept to profiles they are probabilistic models consisting of inter-
connecting states essentially, linear chains of match, delete or insert
states
EMBL-EBI
Match states are assigned to conserved columns in an alignment
Insert states allow for insertions relative to match states
Delete states allow match positions to be skipped
Thus, building an HMM requires each position in an alignment to be assigned to match, delete or insert states
Hidden Markov Models
EMBL-EBI
HMMs usually perform well, but can be over-trained
they may also suffer if created from automatic iterative processes
if it once accepts a false match, an HMM becomes corrupt
Hidden Markov Models
EMBL-EBI
Probabilistic Suffix Trees
Identify short significant contiguous segmentsDo not require multiple alignment Induces a probability distribution on the next symbol to appear right after the segment (short term memory)Variable memory lengthMore efficient than order L Markov chainsLonger memory length compared to first-order HMMs, and easier to learn
EMBL-EBI
Which method is best?
The range of methods available leads to familiar problems which should we use? which is the most reliable? which is the most comprehensive?
None of the pattern-recognition techniques is infallible each has its optimum area of application
None of the resulting pattern databases are complete none is the best bearing in mind the diagnostic strengths & weaknesses of
the different approaches, & keeping biological significance in mind, the best strategy is to use them all
EMBL-EBI
Pattern recognition & prediction
In investigating the meaning of sequences, 2 distinct analytical approaches have emerged pattern recognition is used to detect similarity
between sequences & hence to infer related structures & functions
ab initio prediction is used to deduce structure, & to infer function, directly from sequence
These methods are different & shouldn’t be confused !!!!!
EMBL-EBI
Sequence- & structure-based pattern recognition methods demand that some characteristic has been seen before & housed in a db
Prediction methods remove the need for template dbs because deductions are made directly from sequence
Pattern recognition & prediction
EMBL-EBI
fact & fiction
Sequence pattern recognition is easier to achieve, & is much more reliable, than fold recognition which is only ~40-50% reliable even in expert hands
Prediction is still not possible & is unlikely to be so for decades to come (if ever)
Structural genomics will yield representative structures for more proteins in future structures of new sequences will be determined by
modelling prediction will become an academic exercise
But, to debunk a popular myth, knowing structure alone does not inherently tell us function
EMBL-EBI
Prediction methods don’t work because we don’t fully understand the Folding Problemwe can’t read the language sequences use to create their
folds But, with sequence analysis techniques, we can try
to find similarities between new sequences & those in dbswhose structures & functions we hope have been elucidated
This is straightforward at high levels of identity, but below 50% it is difficult to establish relationship reliably
Analyses can be pursued with decreasing certainty~20% identity, where results may look plausible to the eye,
but are no longer statistically significant
fact & fiction
EMBL-EBI
TERMINOLOGY
EMBL-EBI
Homology & analogy
The term homology is confounded & abused in the literature! sequences are homologous if they’re related by
divergence from a common ancestor analogy relates to the acquisition of common features
from unrelated ancestors via convergent evolution
e.g.,-barrels occur in soluble serine proteases & integral membrane porins; chymotrypsin & subtilisin share groups of catalytic residues, with near identical spatial geometries, but no other similarities
EMBL-EBI
Homology is not a measure of similarity & is not quantifiable
it is an absolute statement that sequences have a divergent rather than a convergent relationship
the phrases "the level of homology is high" or "the sequences show 50% homology", or any like them, are strictly meaningless!
This is not just a semantic issue
loose use muddies thinking about evolutionary relationships
Homology & analogy
EMBL-EBI
A terminology muddle
In comparing 3D structures, exactly the same arguments apply structures may be similar, as denoted by RMS
positional deviation between compared atomic positions
common evolutionary origin remains a hypothesis, until supported by other evidence
homology among similar structures is a hypothesisThis may be correct or mistaken, but their
similarity is a fact, no matter how it is interpreted
Similarity of sequence or structure is just that - similarity
Homology connotes a common evolutionary origin
EMBL-EBI
Classification of homologs
Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species.
Paralogs – Two genes that derive from a single gene that was duplicated within a genome.
EMBL-EBI
Orthology & paralogy
Among homologous sequences we can distinguish orthologues - largely perform the same function
in different species paralogues - perform different but related
functions in one organism
EMBL-EBI
Studying orthologues opens the way to molecular palaeontology
e.g., using phylogenetic trees to show cross-species relationships
Paralogues shed light on underlying evolutionary mechanisms
paralogous proteins are thought to have arisen from single genes via successive duplication events
duplicated genes follow separate evolutionary pathways & new specificities evolve through variation & adaptation
Such complexity presents real challenges for sequence analysis
Orthology & paralogy
EMBL-EBI
Classification of homologs
Inparalogs - paralogs that evolved by gene duplication after the speciation event.
Outparalogs - paralogs that evolved by gene duplication before the speciation event.
EMBL-EBI
Challenges for sequence analysis
Much of the challenge is in getting the biology right complicated by orthology vs paralogy
Following a db search, it may be unclear how much functional annotation can be legitimately inherited by a query source of numerous annotation errors in dbs propagation could lead to an error catastrophe
EMBL-EBI
Further complications result from the modular nature of proteins
modules are autonomous folding units, used as protein building blocks - like Lego bricks, they can confer a variety of functions on the parent protein, either by multiple combinations of the same module, or via different modules to form mosaics
Automatic systems don’t distinguish orthologues from paralogues & don’t consider the modular nature of proteins
Challenges for sequence analysis
EMBL-EBI
Identifying evolutionary links between sequences is usefulthis often implies a shared function
Arguably, prediction of function from sequence is of more immediate value than the prediction of structure
However, between distantly-related proteins, structure is more conserved than the underlying sequencesthus, some relationships are only apparent at the structural
level Such relationships can't be detected by even the most
sensitive sequence comparison methodsthere is thus a theoretical limit to the effectiveness of
sequence analysis methods and a region of identity where sequence comparisons fail completely to detect structural similarity
Challenges for sequence analysis
EMBL-EBI
What can we learn from them?
Ortholog proteins are evolutionary, and typically functional counterparts in different species.
Paralog proteins are important for detecting lineage-specific adaptations.
Both of them can reveal information on a specific species or a set of species.
EMBL-EBI
Databases of protein domains
Prositehttp://www.expasy.ch/prosite/
Pfamhttp://www.sanger.ac.uk/Software/Pfam/
Blockshttp://www.blocks.fhcrc.org/
ProDom http://prodes.toulouse.inra.fr/prodom/doc/prodom.html
Printshttp://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
Domohttp://www.infobiogen.fr/services/domo/
InterProhttp://www.ebi.ac.uk/interpro/
Smarthttp://smart.embl-heidelberg.de/
eMotifhttp://dna.stanford.edu/identify
EMBL-EBI
Integrating Pattern Databases
MetaFam
IProClass
CDD
InterPro
EMBL-EBI
Domains, motifs, and clusters in the protein universe
Jinfeng Liu & Burkhard Rost
Current Opinion in Chemical Biology, Vol 7 No 1 2003
Reference