+ All Categories
Home > Documents > Centre for Integrative Bioinformatics VU (IBIVU)

Centre for Integrative Bioinformatics VU (IBIVU)

Date post: 23-Jan-2016
Category:
Upload: baka
View: 30 times
Download: 0 times
Share this document with a friend
Description:
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families. Centre for Integrative Bioinformatics VU (IBIVU) Faculty of Sciences / Faculty of Earth & Life Sciences. Protein structure evolution. - PowerPoint PPT Presentation
Popular Tags:
47
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Centre for Integrative Bioinformatics VU (IBIVU) Faculty of Sciences / Faculty of Earth & Life Sciences
Transcript
Page 1: Centre for  Integrative Bioinformatics VU (IBIVU)

Bioinformatics master courseDNA/Protein structure-function analysis

and prediction

Lecture 5: Protein Fold Families

Centre for Integrative Bioinformatics VU (IBIVU)

Faculty of Sciences / Faculty of Earth & Life Sciences

Page 2: Centre for  Integrative Bioinformatics VU (IBIVU)

Protein structure evolutionInsertion/deletion of secondary structural

elements can ‘easily’ be done at loop sites

Page 3: Centre for  Integrative Bioinformatics VU (IBIVU)

Protein structure evolutionInsertion/deletion of structural domains can

‘easily’ be done at loop sites

N

C

Page 4: Centre for  Integrative Bioinformatics VU (IBIVU)

four broad structural protein fold classes:

•all-α

•all-β

•α/β (α mixed with β),

•α+β (separated α and β regions)

Fold classification

Page 5: Centre for  Integrative Bioinformatics VU (IBIVU)

The first protein structure in 1960: Myoglobin - fold

Page 6: Centre for  Integrative Bioinformatics VU (IBIVU)

There are a number of examples of small proteins (or peptides) which consist of little more than a single helix. A striking example is alamethicin, a transmembrane voltage gated ion channel, acting as a peptide antibiotic.

Page 7: Centre for  Integrative Bioinformatics VU (IBIVU)

Tropomyosin

Coiled-coil domains

This long protein is involved In muscle contraction

Page 8: Centre for  Integrative Bioinformatics VU (IBIVU)

Two helix interface areas should have complementary surfaces. a-helix surface can be thought of as consisting of grooves and ridges, like a screw thread: for instance, the side chains of every 4th residue form a “i+4” ridge (because there are 3.6 residues per turn). The direction of this ridge is 26° from the direction of the helix axis. Therefore if 2 helices pack such that such a ridge from each fits into the other's groove, the expected angle between the two is 52°. In fact, in the observed distribution of this angle between packed alpha-helices, there is a sharp peak at 50°. Ridges can also be formed by other stacking patterns of residues, such as every 3rd residue, or indeed every residue. The "i+4" ridge is believed to be the most common because residues at every 4th position have side-chains which are more closely aligned than in "i+3" or "i+1" ridges as indicated below.

Alpha-helix interaction

http://swissmodel.expasy.org/course/text/chapter4.htm

Page 9: Centre for  Integrative Bioinformatics VU (IBIVU)

Here is a diagram of Interleukin-2, human Growth Hormone, Granulocyte-macrophage colony-stimulating factor (GM-CSF) and Interleukin-4.

Helix-turn-helix and 4-helix bundles

Page 10: Centre for  Integrative Bioinformatics VU (IBIVU)

Beta-proteins

Page 11: Centre for  Integrative Bioinformatics VU (IBIVU)

porin

Beta-sheet structures

Page 12: Centre for  Integrative Bioinformatics VU (IBIVU)

Greek key -strand motif

Page 13: Centre for  Integrative Bioinformatics VU (IBIVU)

Greek key -strand motif

Structure: gamma-crystallin

Page 14: Centre for  Integrative Bioinformatics VU (IBIVU)

5() fold

Flavodoxin fold

/ fold

Page 15: Centre for  Integrative Bioinformatics VU (IBIVU)

Flavodoxin family - TOPS diagrams (Flores et al., 1994)

1 2345

1

234

5

/ fold

Page 16: Centre for  Integrative Bioinformatics VU (IBIVU)

Beta-alpha-beta structures

Page 17: Centre for  Integrative Bioinformatics VU (IBIVU)

Alpha-beta barrel

Page 18: Centre for  Integrative Bioinformatics VU (IBIVU)

Plait motif

Page 19: Centre for  Integrative Bioinformatics VU (IBIVU)

3-layer motifs (2 layers of helices with a -sheet in between)

are often specified as

x-y-z (e.g. 4-14-5),

where x is number of helices in the first helical layer, y is number of strands in the -sheet, and y is number of helices

in the second helical layer

Page 20: Centre for  Integrative Bioinformatics VU (IBIVU)

For proteins, there are no good classification systems. You can only

count…

Page 21: Centre for  Integrative Bioinformatics VU (IBIVU)

How many folds – Chothia 1992The first estimate of the number of protein families has been explicitly done by Chothia in 1992. At that time about 120 structural families were known. Chothia summarized the results of several genome projects and revealed that the chances of a random protein to belong to one of the known sequence families is approximately 1/3. According to the results of sequence comparison of the PDB with sequence databases (Sander, Schneider 1991), about 1/4 of all sequences appeared to be similar to one of the PDB entries at 25% identity level. Assuming equal distribution of proteins among the families, Chothia concluded that the total number of protein structural families should be equal to 120*3*4 = 1440.

Page 22: Centre for  Integrative Bioinformatics VU (IBIVU)

How many folds – Alexandrov & Go, 1994, updated

Pfam-2.1 database consists of 101,724 domains of proteins from SwissProt (Bairoch & R., 1996) release 34, clustered in 13,816 families. There were also 7,694 proteins of 30 or more amino acids in SwissProt-34, which are not present in Pfam and are not similar to other proteins. We have added them into the database, which now contains 109,418 domains in 21,510 families. We have eliminated very similar sequences from the database, trying to make the database more homogeneous. In the final classification there were 60,601 domains, distributed within 21,510 families. All families were ranked by the number of sequences in each family. The resulting distribution fits nicely to the Zipf’s law (http://wwww.bionet.nsc.ru/bgrs/thesis/100/)

Page 23: Centre for  Integrative Bioinformatics VU (IBIVU)

How many folds

r is the rank of family, n(r) is the number of proteins in the r-th family, a is a scaling constant, depending on the number of proteins in the dataset, and b 0.64. Constant b does not depend on the size of the dataset.

n(r) = ar-b

Page 24: Centre for  Integrative Bioinformatics VU (IBIVU)

How many folds (cont.)

Distribution of protein sequences among protein families. One can see that the distribution is essentially non-equal. The shape of the distribution is described very well by Zipf’s law:

n(r) = ar-b, with a= 640 and b=0.64. The correlation coefficient of this approximation equals to 0.992.

Page 25: Centre for  Integrative Bioinformatics VU (IBIVU)

Fold number according to Alexandrov & Go

60,000 protein sequence families in 14,000 different folds

Page 26: Centre for  Integrative Bioinformatics VU (IBIVU)

Fold number according to Alexandrov & Go

An important feature of Zipf’s distribution is that it has a very long tail of clusters with only few members in it. For example, if b=0.7, half of all proteins is located in 10% of all clusters.

Page 27: Centre for  Integrative Bioinformatics VU (IBIVU)

General fold classification systems

The definitions of four broad structural classes, all-α, all-β, α/β, and α+β, based on secondary structure compositions and β-sheet topologies [Levitt & Chothia, 1976] represented the first step towards a global characterization of the protein fold space. These definitions have been generally accepted and are being used by many classification systems to organize the fold hierarchy [Murzin et al., 1995; Orengo et al., 1997]. However, there is a need for methods to represent the full range of structural relationships among folds for a better understanding of the organizing principles and features of the protein fold space.

Page 28: Centre for  Integrative Bioinformatics VU (IBIVU)

General fold classification systems(cont.)

The fold family trees such as those built by Effimov [1997], Zhang and Kim [2000] and Taylor [2002] are very informative, but the construction of such trees involves extensive manual operations and, sometimes, considerable human judgment. An alternative approach is to apply a uniform measure of the structural similarity across all fold types and map the structural relationships into a low dimensional space. Two such maps have been introduced, one is represented in the CATH database by Orengo and colleages [1997] and the other in the DALI database by Holm and Sander [1993]. Although the two maps are based on different structural alignment algorithms and multivariant analysis methods, they give similar two-dimensional projections featuring three large clusters corresponding to α, β, and α/β folds, respectively.

Page 29: Centre for  Integrative Bioinformatics VU (IBIVU)

General fold classification system references

Levitt, M. and C. Chothia, Structural patterns in globular proteins. Nature, 1976. 261(5561): p. 552-8.

Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40.

Orengo, C.A., et al., CATH--a hierarchic classification of protein domain structures. Structure, 1997. 5(8): p. 1093-108.

Taylor, W.R., A 'periodic table' for protein structures. Nature, 2002. 416(6881): p. 657-60.

Orengo, C.A., et al., Identification and classification of protein fold families. Protein Eng, 1993. 6(5): p. 485-500.

Page 30: Centre for  Integrative Bioinformatics VU (IBIVU)

General fold classification system references (cont.)

Efimov, A.V., Structural trees for protein superfamilies. Proteins, 1997. 28(2): p. 241-60.

Zhang, C. and S.H. Kim, A comprehensive analysis of the Greek key motifs in protein beta-barrels and betasandwiches. Proteins, 2000. 40(3): p. 409-19.

Holm, L. and C. Sander, Protein structure comparison by alignment of distance matrices. J Mol Biol, 1993. 233(1): p. 123-38.

Page 31: Centre for  Integrative Bioinformatics VU (IBIVU)

Fold distribution

Metric matrix distance geometry method applied to all pair-wise “distances” (structural dissimilarities) to assign three-dimensional coordinates to a set of 498 SCOP folds such that the relative distance between two folds is inversely correlated with the DALI alignment score. The results of the mapping are shown in the figure on the left.

Page 32: Centre for  Integrative Bioinformatics VU (IBIVU)

The first 20 eigen values of the metric matrix calculated from the 498x498 DALI structural alignment scores.

Page 33: Centre for  Integrative Bioinformatics VU (IBIVU)

Plotting the first 3 eigenvectors; i.e., the eigenvectors corresponding to the three largest eigenvalues. Again, notice

the segregation of the four main structural classes..

Page 34: Centre for  Integrative Bioinformatics VU (IBIVU)

The same as the preceding slide, but from another angle…

Page 35: Centre for  Integrative Bioinformatics VU (IBIVU)

Comparing fold usage between two species in the eubacterial domain (Chlamydia versus Aquifex, A) and between those of two different domains (Chlamydia of bacteria versus Halobacterium of archaea, B). The usages of the 498 folds by the second organism are subtracted from the fold usages by the first organism. A contour surface (mesh) is then constructed and set at the values of 0.4% for blue and –0.4% for red. Regions within the blue contour include folds that appear more frequently in the first organism, whereas regions within the red contour include folds that occur more frequently in the second organism.

Page 36: Centre for  Integrative Bioinformatics VU (IBIVU)

CATH database

Classification

Architecture

Topology

Homologous family

Page 37: Centre for  Integrative Bioinformatics VU (IBIVU)

CATH database

Page 38: Centre for  Integrative Bioinformatics VU (IBIVU)

Structural Classification of proteins (SCOP) database

1. All alpha proteins

2. All beta proteins

3. Alpha and beta proteins (a/b) - Mainly parallel beta sheets (beta-alpha-beta units)

4. Alpha and beta proteins (a+b) - Mainly antiparallel beta sheets (segregated alpha and beta

regions)

5. Multi-domain proteins (alpha and beta) - Folds consisting of two or more domains belonging to different

classes

6. Membrane and cell surface proteins and peptides – No proteins in the immune system

Page 39: Centre for  Integrative Bioinformatics VU (IBIVU)

Structural Classification of proteins (SCOP) database (cont.)

7. Small proteins - Usually dominated by metal ligand, heme, and/or disulfide bridges

8. Coiled coil proteins - Not a true class

9. Low resolution structures - Not a true class

10. Peptides - Peptides and fragments. Not a true class

11. Designed proteins - Experimental structures of proteins with essentially non-natural sequences. Not a true

class

Page 40: Centre for  Integrative Bioinformatics VU (IBIVU)

SCOP

• Gold standard of protein classification• In essence, the work of a single man

(Alexei Murzin) • The classification has been constructed

manually by visual inspection and comparison of structures, but with the assistance of tools to make the task manageable and help provide generality.

Page 41: Centre for  Integrative Bioinformatics VU (IBIVU)

SCOP

The different major levels in the hierarchy are: 1. Family: Clear evolutionarily relationship

Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absense of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.

Page 42: Centre for  Integrative Bioinformatics VU (IBIVU)

SCOP

The different major levels in the hierarchy are:

2. Superfamily: Probable common evolutionary originProteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.

Page 43: Centre for  Integrative Bioinformatics VU (IBIVU)

SCOPThe different major levels in the hierarchy are: 3. Fold: Major structural similarity

Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.

Page 44: Centre for  Integrative Bioinformatics VU (IBIVU)

DALI database

• Based upon the DALI method for structural superpositioning. The programme optimises the overlay of distance plots (see next slide)

• Fully automatic• Database contains clusters of protein families (e.g.

a giant PDB structures tree) and structural alignments

• Database is consistent, but grouping is not done manually by experts

Page 45: Centre for  Integrative Bioinformatics VU (IBIVU)

DALI databaseContact Maps

Figures (c) and (d)..

Fig (c): contact map of ROP (lower) and 256B (upper triangle). Fig (d): ‘Collapsed’ ROP (lower) and difference contact plot (upper triangle)

Page 46: Centre for  Integrative Bioinformatics VU (IBIVU)

PROTOMAP database (Linial et al.)

• Number of proteins in DB (May 2000) is 365174 (341645 after merging identical entries), number of cluster is 18140, number of singletons is 43219 (of which 14384 are satellites of other clusters)

•Provides software to group new protein sequences

•Fully automatic

•Classifies UniProt + TrEMBL (translated EMBL) databases

Page 47: Centre for  Integrative Bioinformatics VU (IBIVU)

Folds: how many?

• Chothia (1992) – appr. 1,000 folds

• Estimates vary from 1,000 – 15,000

• With 30,000 human genes, ≥3 genes per fold on average (but think about alternative splicing)

Chothia, C., Proteins. One thousand families for the molecular biologist. Nature, 1992. 357(6379): p. 543-4.

Zhang, C. and C. DeLisi, Estimating the number of protein folds. J Mol Biol, 1998. 284(5): p. 1301-5.


Recommended