EMBL-EBI PATTERNS Kim Henrick. EMBL-EBI Terri Attwood School of Biological Sciences University of...

transcript

EMBL-EBI

PATTERNS

Kim Henrick

EMBL-EBI

Terri AttwoodSchool of Biological SciencesUniversity of Manchester, Oxford RoadManchester M13 9PT, UKhttp://www.bioinf.man.ac.uk/dbbrowser/

EMBL-EBI

Motifs and domains

Motif: a simple combination of a few consecutive secondary structure elements with a specific geometric arrangement (e.g., helix-loop-helix). May have a specific biological function.

Domain: the fundamental unit of structure folding and evolution. It combines several secondary elements and motifs packed in a compact globular structure. A domain can fold independently into a stable 3D structure, and May have a specific function.

Domain family: proteins that share a domain (possibly in combination with other domains)

Protein family: proteins that have the same combination of domains

EMBL-EBI

Profiles & Motifs are Useful

Helped identify active site of HIV protease Helped identify SH2/SH3 class of STP’s Helped identify important GTP oncoproteins Helped identify hidden leucine zipper in HGA Used to scan for lectin binding domains Regularly used to predict T-cell epitopes

Domains are More Useful

EMBL-EBI

Rules of Thumb

Sequence pattern-based motifs should be determined from no fewer than 5 multiply aligned sequences

A good degree of sequence divergence is needed. If “S” is the %similarity and “N” is the no. of sequences then 1 - SN > 0.95

A good sequence pattern should have no fewer than 8 defined amino acid positions

EMBL-EBI

Representations of protein families

Regular expression

Position specific scoring matrices (profiles)

Hidden Markov Models

Probabilistic suffix trees

Sparse Markov transducers

EMBL-EBI

Pattern recognition methods

These methods classify proteins into familiesthe basis of the methods is multiple sequence

alignment They depend on developing a representation of

conserved elements of alignments that may be diagnostic of structure or function, whether from homologous sequence families sequences that share some

structural/functional domains

EMBL-EBI

Regular expressions/patterns

These are derived from single conserved regions, which are reduced to consensus expressions for db searches they are minimal expressions, so sequence

information is lost the more divergent the sequences used, the

more fuzzy & poorly discriminating the pattern becomes

Alignment PatternGAVDFIALCDRYFGPIDFVCFCERFY G-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2GRVEFLNRCDRYY

EMBL-EBI

Patterns do not tolerate similarity

sequences either match or not, regardless of how similar they are

matching is a binary ‘on-off’ event & frequently misses true matches

single-motif methods are very hit-or-miss – how do you know if you've encoded the ‘best’ region?

Regular expressions/patterns

EMBL-EBI

PROSITE

This represents an apparent 18% error rate the actual rate is probably higher

Thus, a match to a pattern is not necessarily true & a mis-match is not necessarily false!

False-negatives are a fundamental limitation to this type of pattern matchingif you don't know what you're looking for, you'll never know you missed it!

G_PROTEIN_RECEPTOR; PATTERN PS00237; G-protein coupled receptor signature [GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]- X(2)-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R /TOTAL=919(919);/POS=869(869);/FALSE_POS=50(50);/FALSE_NEG=70; /PARTIAL=49; UNKNOWN=0(0)

EMBL-EBI

Regular expressions/rules

Regular expression patterns are most effective when applied to highly-conserved, family-specific motifs

It is often possible to identify, shorter generic patterns that are characteristic of common functional sites

Functional site RuleN-glycosylation N-{P}-[ST]-{P}Protein kinase C phosphorylation [ST]-X-[RK]Casein kinase II phosphorylation [ST]-X2-[DE]

Such features result from convergence to a common propertyglycosylation sites, phosphorylation sites, etc.

They cannot be used for family diagnosis & don't discriminate they can only be used to suggest whether a certain

functional site might exist (which must then be tested by experiment)

such patterns are termed rules

EMBL-EBI

Diagnostic limitations of short motifs

Consider the sequence motif Asp-Ala-Val-Ile-Asp (DAVID)results of db searching for such a sequence will differ,

depending on whether we search for exact or permissive ‘fuzzy’ matches

Pattern MatchesD-A-V-I-D 99D-A-V-I-[DEQN] 252[DEQN]-A-V-I-[DEQN] 925[DEQN]-A-[VLI]-I-[DEQN] 2,739[DEQN]-[AG]-[VLI]-[VLI]-[DEQN] 51,506D-A-V-E 1,493(number of matches in OWL31.1)

Use of fuzzy regular expressions has the potential advantage of being able to recognise more distant relationships& the inherent disadvantage that more matches will be made

by chance, making it difficult to separate out true matches from noise

EMBL-EBI

Fingerprints

Fingerprints are groups of motifs excised from alignments & used for iterative db searching no weighting scheme is used searches depend only on residue frequencies resulting scoring matrices are thus sparse

Each motif trawls the database independently search results are correlated to determine which

sequences match all the motifs & which match only partially

no information is thrown awayIteration refines the fingerprint & increases its

potency fingerprints are diagnostically more powerful

than regular expressions

EMBL-EBI

Profiles

Profiles are scoring tables derived from full alignments these define which residues are allowed at given

positions which positions are conserved & which degenerate which positions, or regions, can tolerate insertions the scoring system is intricate, & may include

evolutionary weights, results from structural studies, & data implicit in the alignment

variable penalties are specified to weight against INDELs occurring in core 2' structure elements

EMBL-EBI

Within a profile, fields contain position-specific scores for insert & match positions

in conserved regions, INDELs aren't totally forbidden, but are strongly impeded by large penalties defined in a DEFAULT field

these are superseded by more permissive values in gapped regions

the inherent complexity of profiles renders them highly potent discriminators, but they are time-consuming to derive

Profiles

EMBL-EBI

HMMs are similar in concept to profiles they are probabilistic models consisting of inter-

connecting states essentially, linear chains of match, delete or insert

states

EMBL-EBI

Match states are assigned to conserved columns in an alignment

Insert states allow for insertions relative to match states

Delete states allow match positions to be skipped

Thus, building an HMM requires each position in an alignment to be assigned to match, delete or insert states

EMBL-EBI

HMMs usually perform well, but can be over-trained

they may also suffer if created from automatic iterative processes

if it once accepts a false match, an HMM becomes corrupt

EMBL-EBI

Probabilistic Suffix Trees

Identify short significant contiguous segmentsDo not require multiple alignment Induces a probability distribution on the next symbol to appear right after the segment (short term memory)Variable memory lengthMore efficient than order L Markov chainsLonger memory length compared to first-order HMMs, and easier to learn

EMBL-EBI

Which method is best?

The range of methods available leads to familiar problems which should we use? which is the most reliable? which is the most comprehensive?

None of the pattern-recognition techniques is infallible each has its optimum area of application

None of the resulting pattern databases are complete none is the best bearing in mind the diagnostic strengths & weaknesses of

the different approaches, & keeping biological significance in mind, the best strategy is to use them all

EMBL-EBI

Pattern recognition & prediction

In investigating the meaning of sequences, 2 distinct analytical approaches have emerged pattern recognition is used to detect similarity

between sequences & hence to infer related structures & functions

ab initio prediction is used to deduce structure, & to infer function, directly from sequence

These methods are different & shouldn’t be confused !!!!!

EMBL-EBI

Sequence- & structure-based pattern recognition methods demand that some characteristic has been seen before & housed in a db

Prediction methods remove the need for template dbs because deductions are made directly from sequence

Pattern recognition & prediction

EMBL-EBI

fact & fiction

Sequence pattern recognition is easier to achieve, & is much more reliable, than fold recognition which is only ~40-50% reliable even in expert hands

Prediction is still not possible & is unlikely to be so for decades to come (if ever)

Structural genomics will yield representative structures for more proteins in future structures of new sequences will be determined by

modelling prediction will become an academic exercise

But, to debunk a popular myth, knowing structure alone does not inherently tell us function

EMBL-EBI

Prediction methods don’t work because we don’t fully understand the Folding Problemwe can’t read the language sequences use to create their

folds But, with sequence analysis techniques, we can try

to find similarities between new sequences & those in dbswhose structures & functions we hope have been elucidated

This is straightforward at high levels of identity, but below 50% it is difficult to establish relationship reliably

Analyses can be pursued with decreasing certainty~20% identity, where results may look plausible to the eye,

but are no longer statistically significant

fact & fiction

EMBL-EBI

TERMINOLOGY

EMBL-EBI

Homology & analogy

The term homology is confounded & abused in the literature! sequences are homologous if they’re related by

divergence from a common ancestor analogy relates to the acquisition of common features

from unrelated ancestors via convergent evolution

e.g.,-barrels occur in soluble serine proteases & integral membrane porins; chymotrypsin & subtilisin share groups of catalytic residues, with near identical spatial geometries, but no other similarities

EMBL-EBI

Homology is not a measure of similarity & is not quantifiable

it is an absolute statement that sequences have a divergent rather than a convergent relationship

the phrases "the level of homology is high" or "the sequences show 50% homology", or any like them, are strictly meaningless!

This is not just a semantic issue

loose use muddies thinking about evolutionary relationships

Homology & analogy

EMBL-EBI

A terminology muddle

In comparing 3D structures, exactly the same arguments apply structures may be similar, as denoted by RMS

positional deviation between compared atomic positions

common evolutionary origin remains a hypothesis, until supported by other evidence

homology among similar structures is a hypothesisThis may be correct or mistaken, but their

similarity is a fact, no matter how it is interpreted

Similarity of sequence or structure is just that - similarity

Homology connotes a common evolutionary origin

EMBL-EBI

Classification of homologs

Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species.

Paralogs – Two genes that derive from a single gene that was duplicated within a genome.

EMBL-EBI

Orthology & paralogy

Among homologous sequences we can distinguish orthologues - largely perform the same function

in different species paralogues - perform different but related

functions in one organism

EMBL-EBI

Studying orthologues opens the way to molecular palaeontology

e.g., using phylogenetic trees to show cross-species relationships

Paralogues shed light on underlying evolutionary mechanisms

paralogous proteins are thought to have arisen from single genes via successive duplication events

duplicated genes follow separate evolutionary pathways & new specificities evolve through variation & adaptation

Such complexity presents real challenges for sequence analysis

Orthology & paralogy

EMBL-EBI

Classification of homologs

Inparalogs - paralogs that evolved by gene duplication after the speciation event.

Outparalogs - paralogs that evolved by gene duplication before the speciation event.

EMBL-EBI

Challenges for sequence analysis

Much of the challenge is in getting the biology right complicated by orthology vs paralogy

Following a db search, it may be unclear how much functional annotation can be legitimately inherited by a query source of numerous annotation errors in dbs propagation could lead to an error catastrophe

EMBL-EBI

Further complications result from the modular nature of proteins

modules are autonomous folding units, used as protein building blocks - like Lego bricks, they can confer a variety of functions on the parent protein, either by multiple combinations of the same module, or via different modules to form mosaics

Automatic systems don’t distinguish orthologues from paralogues & don’t consider the modular nature of proteins

EMBL-EBI

Identifying evolutionary links between sequences is usefulthis often implies a shared function

Arguably, prediction of function from sequence is of more immediate value than the prediction of structure

However, between distantly-related proteins, structure is more conserved than the underlying sequencesthus, some relationships are only apparent at the structural

level Such relationships can't be detected by even the most

sensitive sequence comparison methodsthere is thus a theoretical limit to the effectiveness of

sequence analysis methods and a region of identity where sequence comparisons fail completely to detect structural similarity

EMBL-EBI

What can we learn from them?

Ortholog proteins are evolutionary, and typically functional counterparts in different species.

Paralog proteins are important for detecting lineage-specific adaptations.

Both of them can reveal information on a specific species or a set of species.

EMBL-EBI

Databases of protein domains

Prositehttp://www.expasy.ch/prosite/

Pfamhttp://www.sanger.ac.uk/Software/Pfam/

Blockshttp://www.blocks.fhcrc.org/

ProDom http://prodes.toulouse.inra.fr/prodom/doc/prodom.html

Printshttp://www.bioinf.man.ac.uk/dbbrowser/PRINTS/

Domohttp://www.infobiogen.fr/services/domo/

InterProhttp://www.ebi.ac.uk/interpro/

Smarthttp://smart.embl-heidelberg.de/

eMotifhttp://dna.stanford.edu/identify

EMBL-EBI

Integrating Pattern Databases

MetaFam

IProClass

InterPro

EMBL-EBI

Domains, motifs, and clusters in the protein universe

Jinfeng Liu & Burkhard Rost

Current Opinion in Chemical Biology, Vol 7 No 1 2003

Reference

EMBL-EBI PATTERNS Kim Henrick. EMBL-EBI Terri Attwood School of Biological Sciences University of...

Documents