Michael Schroeder BioTechnological Center TU Dresden Biotec Protein Structure Lesk, chapter 5...

Michael Schroeder BioTechnological CenterTU Dresden Biotec

Protein Structure

Lesk, chapter 5Details on SCOP and CATH can be found in

Structural Bioinformatics, Bourne/Weissig, chapter 12 and 13

By Michael Schroeder, Biotec, 2

Folding Proteins are linear polymer

mainchains with different amino acid side chains

Proteins fold spontaneously reaching a state of minimal energy Side and main chains

interact with one another and with solvent

Example movie

Jones, D.T. (1997) Successful ab initio prediction of the tertiary structure of NK-Lysin using multiple sequences and recognized supersecondary structural motifs. PROTEINS. Suppl. 1, 185-191


Examining Proteins

Specialised tools with different views of structure Corey, Pauling, Koltun

(CPK) Diameter of sphere ~

atomic radius Hydrogen white,

carbon grey, nitrogen blue, oxygen red, sulphur yellow

Cartoon Wire Balls


Examining Proteins


Protein Folding

Residue

Image taken from www.expasy.org/swissmod/course

Conformation of residue Rotation around N-Ca bond, (phi) Rotation around Ca-C bond, (psi) Rotation around peptide bond (omega)

Peptide bond tends to be planar and in one of two states:

trans 180 (usually) and cis, 0 (rarely, and mostly proline)


Sasisekharan-Ramakrishnan-Ramachandran plot

Solid line = energetically preferred

Outside dotted line = disallowed

Most amino acids fall into R region (right-handed alpha helix) or -region (beta-strand)

Glycine has additional conformations (e.g. left-handed alpha helix = L region) and in lower right panel



Ramachandran plot

Plot for a protein with mostly beta-sheets

Example for conformations



Helices and Strands

Consecutive residues in alpha or beta conformation generate alpha-helices and beta-strands, respectively

Such secondary structure elements are stabilised by weak hydrogen bonds

They are by turns or loops, regions in which the chain alters direction

Turns are often surface exposed and tend to contain charged or polar residues


Alpha Helix

Residue j is hydrogen-bonded to residue j+4

3.6 residues per turn 1.5A rise per turn Repeat every 3.6*1.5A = 5.4 A = -60 , = -45



Beta strand



Beta Sheets



Turn Residue j is bonded to

residue j+3

Often proline and glycine



How to Fold a Structure All residues must have stereochemically allowed

conformations Buried polar atoms must be hydrogen-bonded

If a few are missed, it might be energetically preferable to bond these to solvent

Enough hydrophobic surface must be buried and interior must be sufficiently densely packed

There is evidence, that folding occurs hierarchically: First secondary structure elements, then super-secondary,…

This justifies hierarchic approach when simulating folding


Structure Alignment

+

Slides from Hanekamp, University of Wyoming, www.uwyo.edu


Structure Alignment

+


Structure Alignment

In the same way that we align sequences, we wish to align structure

Let’s start simple: How to score an alignment Sequences: E.g. percentage of matching residues Structure: rmsd (root mean square deviation)


Root Mean Square Deviation

What is the distance between two points a with coordinates xa and ya and b with coordinates xb and yb? Euclidean distance:

d(a,b) = √ (xa--xb )2 + (ya -yb )2 + (za -zb )2

a

b


Root Mean Square Deviation

In a structure alignment the score measures how far the aligned atoms are from each other on average

Given the distances di between n aligned atoms, the root mean square deviation is defined as

rmsd = √ 1/n ∑ di2


Quality of Alignment and Example Unit of RMSD => e.g. Ångstroms

Identical structures => RMSD = “0” Similar structures => RMSD is small (1 – 3 Å) Distant structures => RMSD > 3 Å

Structural superposition of gamma-chymotrypsin and Staphylococcus aureus epidermolytic toxin A


Pitfalls of RMSD

all atoms are treated equally(e.g. residues on the surface have a higher degree of freedom than those in the core)

best alignment does not always mean minimal RMSD

significance of RMSD is size dependent

From www.uwyo.edu/molecbio/LectureNotes/ MOLB5650


Alternative RSMDs

aRMSD = best root-mean-square deviation calculated over all aligned alpha-carbon atoms

bRMSD = the RMSD over the highest scoring residue pairs

wRMSD = weighted RMSD

Source: W. Taylor(1999), Protein Science, 8: 654-665.http://www.prosci.uci.edu/Articles/Vol8/issue3/8272/8272.html#relat

From www.uwyo.edu/molecbio/LectureNotes/ MOLB5650


Computing Structural Alignments DALI (Distance-matrix-ALIgnment) is one of the first tools for structural

alignment How does it work?

Atoms: Given two structures’ atomic coordinates

Compute two distance matrices: Compute for each structure all pairwise inter-atom distances.

This step is done as the computed distances are independent of a coordinate system

The two original atomic coordinate sets cannot be compared, the two distance matrices can

Align two distance matrices: Find small (e.g. 6x6) sub-matrices along diagonal that match Extend these matches to form overall alignment

This method is a bit similar to how BLAST works.

SSAP (double dynamic programming) in term 3.


DALI Example

The regions of common fold, as determined by the program DALI by L. Holm and C. Sander, in the TIM-barrel proteins mouse adenosine deaminase [1fkx] (black) and Pseudomonas diminuta phosphotriesterase [1pta] (red):


Protein zinc finger (4znf)



Superimposed 3znf and 4znf

30 CA atoms RMS = 0.70Å248 atoms RMS = 1.42Å


Lys30


Superimposed 3znf and 4znf backbones

30 CA atoms RMS = 0.70Å



RMSD vs. Sequence Similarity At low sequence identity, good structural

alignments possible

Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt


Structure Classification


Why classify structures?

Structure similarity is good indicator for homology, therefore classify structures

Classification at different levels Similar general folding patterns (structures not

necessarily related) Possibly low sequence similarity, but similar structure

and function implies very likely homology High sequence similarity implies similar structures

and homology Classification can be used to investigate

evolutionary relationships and possibly infer function


Structure Classification

SCOP: Structural Classification of Proteins Hand curated (Alexei Murzin, Cambridge) with some

automation CATH: Class, Architecture, Topology, Homology

Automated, where possible, some checks by hand FSSP: Fold classification based on Structure-

Structure alignment of Proteins Fully automated

Reasonable correspondance (>80%)


Evolutionary Relation

Strong sequence similarity is assumed to be sufficient to infer homology

Close structural and functional similarity together are also considered sufficient to infer homology Similar structure alone not sufficient, as proteins may have

converged on structure due to physiochemical necessity Similar function alone not sufficient, as proteins may have

developed it due to functional selection In general, structure is more conserved than sequence

Beware: Descendents of ancestor may have different function, structure, and sequence! Difficult to detect


What is a domain? Single and Multi-Domain Proteins


What is a domain?

Functional: Domain is “independent” functional unit, which occurs in more than one protein

Physiochemical: Domain has a hydrophobic core

Topological: Intra-domain distances of atoms are minimal, Inter-domain distances maximal

Difficult to exactly define domain Difficult to agree on exact domain border


Domains re-occur

A domain re-occurs in different structures and possibly in the context of different other domains

P-loop domain in 1goj: Structure Of A

Fast Kinesin: Implications For ATPase Mechanism and Interactions With Microtubules Motor Protein (single domain)

1ii6: Crystal Structure Of The Mitotic Kinesin Eg5 In Complex With Mg-ADP Cell Cycle (two domains)


Domains re-occur

1in5: interaction of P-loop domain (green & orange) and winged helix DNA binding domain

1a5t: interaction of P-loop domain (green & orange) and DNA polymerase III domain


Domains have hydrophobic core

Kyte J., Doolittle R.F, J. Mol. Biol. 157:105-132(1982).

Hydrophobicity Plot for 1GOJ Kinesin Motor

-3

-2

-1

0

1

2

3

1 51 101 151 201 251 301

Residue

Hydrophobicity

Ala: 1.800 Arg: -4.500 Asn: -3.500 Asp: -3.500 Cys: 2.500 Gln: -3.500 Glu: -3.500 Gly: -0.400 His: -3.200 Ile: 4.500 Leu: 3.800 Lys: -3.900 Met: 1.900 Phe: 2.800 Pro: -1.600 Ser: -0.800 Thr: -0.700 Trp: -0.900 Tyr: -1.300 Val: 4.200


Intra-domain distances minimal

Distances between atoms within domain are minimal

Distances between atoms of two different domains are maximal


PDB, Proteins, and Domains

Ca. 20.000 structures in PDB 50% single domain 50% multiple domain 90% have less than 5 domains

Distribution of Number of Domains

-2000

0

2000

4000

6000

8000

10000

0 10 20 30 40 50 60

Number of Domains

Frequency

Dom# Freq.

1 8464

2 4358

3 926

4 1888

5 148

6 624

7 42

8 491

9 22

10 58

…

…

30 7

31 1

32 16

36 1

40 8

42 1

48 3

49 1


A structure with 49 domains 1AON, Asymmetric Chaperonin Complex Groel/Groes/(ADP)7


SCOP: Structural Classification of Proteins

FOLD

CLASS top

SUPERFAMILY

FAMILY

C1 set domains (antibody constant)

V set domains (antibody variable)

All alpha (218) All Beta (144) Alpha/Beta (136)Alpha+Beta (279)

Trypsin-like serine proteases (1) Immunoglobulin-like (23)

Transglutaminase (1) Immunoglobulin (6)


Class

All alpha (possibly small beta

adornments)

All beta (possibly small alpha

adornments)


Class Alpha/beta (alpha and beta) =

single beta sheet with alpha helices joining C-terminus of one strand to the N-terminus of the next subclass: beta sheet forming barrel

surrounded by alpha helices sublass: central planar beta sheet

Alpha+beta (alpha plus beta) = Alpha and beta units are largely separated Strands joined by hairpins leading

to antiparallel sheets


Class

Multi-domain proteins have domains placed in

different classes domains have not been

observed elsewhere

E.g. 1hle


Class

Membrane (few and most unique) and cell surface proteins E.g. Aquaporin 1ih5


Class

Small Proteins E.g. Insulin, 1pid


Class

Coiled coil proteins E.g. 1i4d, Arfaptin-Rac

binding fragment


Class

Low-resolution structures, peptides, designed proteins

E.g. 1cis, a designed protein, hybrid protein between chymotrypsin inhibitor CI-2 and helix E from subtilisin Carlsberg from Barley (Hordeum vulgare), hiproly strain


Fold, Superfamily, Family

Fold Common core structure

i.e. same secondary structure elements in the same arrangement with the same topological structure

Superfamily Very similar structure and function

Family Sequence identity (>30%) or extremely similar

structure and function


Distribution (2007)

Class Fold Superfamily Family

All alpha 259 459 772

All beta 165 331 679

Alpha/beta 141 232 736

Alpha+beta 334 488 897

Multidomain 53 53 74

Membrane and cell surface

50 92 104

Small proteins 85 122 202

Total 1086 1777 3464


Uses of SCOP

Automatic classification Understanding of protein enzymatic function Use superfamily and fold to study distantly related

proteins Study sequence and structure variability Derive substitution matrices for sequence

comparison Extract structural principles for design Study decomposition of multi domain proteins Estimate total number of folds Derived databases


PDB, Proteins, Domains revisited

80% of PDB have only one type of SCOP superfamily

15% of PDB have two different SCOP superfamilies

Frequency of Number of SCOP Superfamilies

-2000

02000

4000

60008000

10000

1200014000

16000

0 5 10 15 20 25

Number of Superfamilies

Frequency

sfNo sfNoFreq

1 13960

2 2721

3 495

4 178

5 33

6 25

7 1

9 4

20 9

21 1

22 1

23 6


A structure with 23 different

superfamilies

1k9m Co Crystal Structure Of Tylosin Bound To The 50S Ribosomal Subunit Of Haloarcula Marismortui Ribosome


The 20 Most Frequently Occurring

Superfamilies

Suyperfamily SCOP ID #PDB

Immunoglobulin b.1.1 823

Lysozyme-like d.2.1 777

Trypsin-like serine proteases b.47.1 649

P-loop containing nucleotide triphosphate hydrolases c.37.1 521

NAD(P)-binding Rossmann-fold domains c.2.1 384

Globin-like a.1.1 384

(Trans)glycosidases c.1.8 332

Acid proteases b.50.1 288

Concanavalin A-like lectins/glucanases b.29.1 230

Thioredoxin-like c.47.1 217

EF-hand a.39.1 212

alpha/beta-Hydrolases c.69.1 195

Cupredoxins b.6.1 178

Ribonuclease H-like c.55.3 178

PLP-dependent transferases c.67.1 176

Periplasmic binding protein-like II c.94.1 171

Carbonic anhydrase b.74.1 169

Metalloproteases (\zincins\"), catalytic domain" d.92.1 169

FAD/NAD(P)-binding domain c.3.1 162

Cytochrome c a.3.1 161


CATH

Class secondary structure

composition Architecture

orientation in 3D Topology

connectivity Homology

Grouped by evidence for homology (sequence, structure and function)


Generating CATH

1. Identify close relatives by pairwise sequence alignment

2. Detect more distant relatives using 2a. sequence profiles and 2b. structure alignment

3. Structures still unclassified after 1. and 2. are examined by hand to detect domain boundaries

4. Try 2. and 3. again 5. If still unclassified assign manually


CATH step 1: Sequence-based Identification of

Homologues Structures

> 30% sequence similarity implies similar structure

Relatives identified using pairwise alignment are clustered using hierarchical clustering with single linkage

Reminder…


Hierarchical Clustering

(1,2) 3 (4,5)

(1,2) 0 5 8

3 0 4

(4,5) 0

1 2 3 4 5

1 0 2 6 10 9

2 0 5 9 8

3 0 4 5

4 0 3

5 0

(1,2) 3 4 5

(1,2) 0 5 9 8

3 0 4 5

4 0 3

5 0

(1,2) (3,(4,5))

(1,2) 0 5

(3,(4,5)) 0

5

4

3

2

1

0

1 2 3 4 5


Hierarchical Clustering: How to define distance between clusters?

Single linkage: Minimum Example: Distance (A,B) to C is 1

Complete linkage: Maximum Example: Distance (A,B) is C is 2

Average linkage: Average Example: Distance (A,B) to C is 1.5

Are dendrograms always the same independent of the linkage method?

0C

10B

210A

CBA

A B C A B C


Hierarchical Clustering: Chaining Beware of chaining

when using single linkage

As nearest neighbour selected, it appears that all members of the cluster are very similar to each other, when in fact A and Z are very different

A B C D … Z

A 0 1 2 3 … 25

B 0 1 2 … 24

C 0 1 … 23

D 0 … 22

… …

Z 0

A B C D … Z


CATH and single linkage

It is argued that structural data is quite sparse, hence it cannot be expected that all cluster

members will be very similar (in terms of sequence) to each other,

so that the chaining effect is even useful


CATH step 2a:

Profile-based methods such as PSI-BLAST are used to detect distant relatives

Build profiles using all sequence data available (rather than only sequences for which structure exists)

This increases quality of profiles dramatically 51% distant relatives retrieved using profiles based on

sequences with known structure only 82% distant relatives retrieved using profile based on

all sequences


CATH step 2b: Structure-based methods to detect distant relatives

For ca. 15% of structures, sequence-based method does not work Example: For globins sequence similarity can fall

below 10%, yet structure and function (oxygen-binding) are preserved

Use SSAP, the Sequential Structure Alignment Program


Clustering Result of Structure Alignment

Relatives identified using pairwise alignment are clustered using hierarchical clustering with single linkage


Improving Efficiency: GRATH

Screening large structures (>300 residues) against database can take days

Idea of GRATH (Graphical Representation of CATH): Improve efficiency by filtering at a higher level before doing

detailed comparison Represent protein as graph where

Nodes are secondary structure elements represented as their midpoint, tilt, and rotation

Edges distances between midpoints of secondary structure elements

Use algorithm to determine subgraph isomorphism (i.e. does one graph occur in another one) Yes, then do detailed comparison using SSAP


Structure Prediction and Modelling


Structure Prediction:Four Main Problem Areas

Given a sequence with unknown structure, predict its structure

Secondary structure prediction Predict regions of helices and strands

Homology modelling Predict structure from known structures of one or more related

proteins Fold recognition

Given a library of structures, determine which one (if any) is the fold of the given sequence

Prediction of novel folds: A-priori and knowledge-based methods


Structure Prediction of Novel Folds: Two Approaches

A priori: Most approaches aim to reproduce inter-atomic

interactions by defining an energy function and trying to find global minimum

Problem: Inadequacy of the energy function Algorithms get stuck in local minima

Knolwedge-based: Find similarities to known structures or sub-

structures


Secondary Structure Prediction A successful tool for secondary structure prediction is PROF PROF uses a neural networks to learn secondary structure from

known structures ¾ of PROF’s prediction are correct At CASP 2000 it predicted e.g. the following

|10 |20 |30 |40 |50Sequence ALVEDPPLKVSEGGLIREGYDPDLDALRAAHREGVAYFLELEERERERTGPrediction HH------------EEE------HHHHHHHHHH-HHHHHHHHHHHHHHH-Experiment -E-------------E-----HHHHHHHHHHHHHHHHHHHHHHHHHHHH-

|60 |70 |80 | 90 |100IPTLKVGYNAVFGYYLEVTRPYYERVPKEYRPVQTLKDRQRYTLPEMKEK--EEEEEEEEEEEEEEEE-----------EEEEEEEE—-EEEE-HHHHHH----EEEEE---EEEEEEEHHHHHH-----EEEEE---EEEEE-HHHHHH

|110 |120EREVYRLEALIRRREEEVFLEVRERAKRQHHHHHHHHHHHHHHHHHHHHHHHHHHHH-HHHHHHHHHHHHHHHHHHHHHHHHHHH--


PROF’s prediction The regions

predicted by the PROF server of Rost to be helical are shown as wider ribbons. The prediction missed only a short helix, at the top left of the picture


Homology modelling Define the model of an unknown structure by making

minimal changes to a relative with known structure

Align amino acid sequences of target and one or more known structures Insertions and deletions should be in loop regions

Determine mainchain segments to represent the regions containing insertions and deletions and stitch these into the known structure

Replace the sidechains of the residues that have been mutated

Examine the model (by hand and computationally) to detect collisions between atoms

Refine the model by limited energy minimisation


Accuracy of Homology Modelling

Works for >40-50% sequence similarity Example: SWISS-MODEL Prediction of neurotoxin of red

scorpion (1DQ7) from neurotoxin of yellow scorpion (1PTX)


Fold Recognition: 3D Profiles

Given a sequence determine which (if any) fold is most similar Can we build profiles to represent structures of similar fold

(similar to sequence profiles)? 3D profiles:

Classify the environment of each residue Secondary structure:

Is it part of helix, sheet or other (determined by Mainchain hydrogen bonding interactions)

Surface exposure: <40A2, 40-114A2, or >114A2 accessible surface area

Polar or non-polar nature of environment Total of 18 residue classes, one of which each residue is part of Sequence of these residue classes is 3D profile


3D Profiles and Alignments Structure-Structure Alignment:

3D profiles of two known structures can be aligned against each other Sequence-Structure Alignment:

Based on existing 3D profiles, probability can be determined for a residue occurring in a residue class.

Using this probability, we can assign 3D profile to a sequence And hence align the sequence 3D profile to a structure 3D profile

For correctly determined protein structures, the structure 3D profile fits the sequence 3D profile well

However, other proteins may score even better

If a structure does not match its own 3D profile well it is likely that there is an error in the structure determination


Threading

Pull query sequence through known structure and rate the score

Necessary: Method to score the

models to select best one Method to calibrate the

scores to decide which of the best is correct

Homology modelling

Threading

Identify homologues

Try all possible parents

Determine optimal alignment

Try many alignments

Optimize one model

Evaluate many rough models


Scoring for Threading

Empirical patterns of residue neighbours derived from known structures

Observe distribution of inter-residue distances for all 20 x 20 residue pairs

Derive probability distribution as function of distance in space and on sequence

Boltzmann equation relates probability and energy Reverse this and derive energy function from

probability distribution


Threading the sequence

template

Target



“Threaded” sequence

Yellow = adrenergic receptor sequenceBlue = adrenergic receptor (PDB 1F88 )



Modeled structure

Gaps



Corrected Model



Ab initio Structure Prediction


Molecular dynamics

Structure prediction = place atoms so that interactions between them create a unique state of maximum stability

Problem: Model of inter-atomic distances is not complete Computational scale:

Large number of variables and massive search space Non-linearities Rough energy surface with many local minima


Conformational energy calculations

Bond stretching: Bond angle bend Torsion angle (e.g. , , ) Van der Waals interactions

Short-range repulsion ~R-12 and long-range attraction ~R-6, where R is the inter-atom distance

Hydrogen bond Weak chemical/electrostatic interaction, ~R-12 and ~R-10

Electrostatics Charges on atoms

Solvent Interactions with water, salt, sugar, etc.


Rosetta

Predicts structure by first generating structures of fragments using known structures (3-9 residues)

Combine fragments using Monte Carlo simulation using an energy function with terms for Paired beta-sheets Burial of hydrophobic residues

Carries out 1000 simulations Results are clustered and the centre of the largest

cluster is presented as prediction

Demo


ROSETTA The program ROSETTA, by D. Baker and colleagues,

can predict the structures of proteins for which no complete domain of similar folding pattern appears in the database. Prediction by ROSETTA of H. influenzae, hypothetical protein. Black lines, experimental structure; red lines, prediction


Rosetta

Prediction by ROSETTA of The N-terminal half of domain 1 of human DNA repair protein Xrcc4. This figures shows a selected substructure of Xrcc4 containing the N-terminal 55 out of 116 residues. Black lines, experimental structure; red lines, prediction


LINUS Another programme with similar idea Prediction by LINUS (program by G.D. Rose and R. Srinivasan) of C-

terminal domain of rat endoplasmic reticulum protein ERp29. Black lines, experimental structure; red lines, prediction


Monte Carlo Simulation Objective: Find conformation with minimal energy Problem: Avoid local minima

Algorithm: 1. Generate a random initial conformation x 2. Perturb conformation x to generate a neighbouring conformation x’ 3. Calculate the energies E(x) and E(x’), resp., for conformations x and x’ 4. If E(x)>E(x’) (i.e. x’ is an improvement, we go down hill from x to x’) then accept

x’ as new conformation and go to 2. 5. If E(x)<E(x’) (i.e. x’ is no improvement, we go uphill from x to x’) then accept x’

as new conformation with probability p 6. The probability p to accept uphill moves is reduced with every step 7. Go to step 2.

Step 1.-4. make sure that we “walk” downhill towards a minimum Step 5.-7. make sure that if we are in local minimum there is a chance to get out

of it by accepting an uphill move. It’s important that this probability decreases so that we are getting more and more unlikely to walk uphill


Summary You should know now

What helices, strands, sheets are What a Ramachandran plot is How to score a structural alignment (rmsd) How to compute a structural alignment How a domain can be characterised Why structure classification is useful What the main structure classes are How classifications can be generated automatically What the problems are What secondary structure prediction, homology modelling, threading,

ab-initio and knowledge-based structure prediction of novel folds are

Visit PDB, SCOP and CATH websites and Read chapter 5

Date post:	02-Apr-2015
Category:	Documents
Upload:	anahi-morell
View:	215 times
Download:	1 times

Michael Schroeder BioTechnological Center TU Dresden Biotec Protein Structure Lesk, chapter 5...

Documents