1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point...

Post on 21-Jan-2016

216 views 0 download

Tags:

transcript

1

Protein structure Prediction

2

Copyright notice

• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

3

Levels of Protein Structure

4

Why is protein structure prediction needed?

• 3D structure determination is expensive, slow and difficult (by X-ray crystallography or NMR)

• Assists in the engineering of new proteins

5

Approaches to predicting protein structures

• ab initio– Use just first principles: energy, geometry, and

kinematics

• Homology Comparative– Find the best match to a database of sequences with

known 3D-structure

Combinations

• Threading

6

Protein Data Bank PDB http://www.pdb.org

Database of templates

Separate into single chainsRemove bad structures (models)

Create BLAST database

Comparative Modeling

Template(s) selection

Sequence Alignment

Structure Modeling

Structure E

valuation

Final Structural Models

Target sequence

Known Structures (templates)

7

Known Structures (templates)

Sequence Alignment

Structure Modeling

Structure E

valuation

Final Structural Models

Target sequence

Sequence Similarity / Fold recognition

Structure quality (resolution, experimental method)

Experimental conditions (ligands and cofactors)

Comparative Modeling

Template(s) selection

8

Known Structures (templates)

Template(s) selection

Structure Modeling

Structure E

valuation

Final Structural Models

Target sequence

Key step in homology modeling

Global alignment is required

Small error in alignment can lead to big error in model

Multiple alignments are better than pairwise alignments

Comparative Modeling

Sequence Alignment

9

Known Structures (templates)

Template(s) selection

Structure E

valuation

Final Structural Models

Target sequence

Template based fragment Assembly (SwissMod). Satisfaction of Spatial Restraints: MODELLER

Comparative Modeling

Sequence Alignment

Structure Modeling

10

Known Structures (templates)

Template(s) selection

Sequence Alignment

Structure Modeling

Final Structural Models

Target sequence

Errors in template selection or alignment result in bad models

Iterative cycles of alignment, modeling and evaluation

Comparative Modeling

Structure E

valuation

11

Measure Proteins Structure Similarity

• Need ways to determine if two protein structures are related and to compare predicted models to experimental structures

• Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979):

 

• Usually use C atoms 

N

dzdydxN

i iii

1

222

3.6 Å 2.9 Å

NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) T102 best model• Other measures include contact maps and torsion angle RMSDs

12

Comparative modeling

• In general, accuracy of structure prediction depends on the percent amino acid identity shared between target and template.

• For >50% identity, RMSD is often only 1 Å.

13

Many web servers offer comparative modeling services.

Examples areSWISS-MODEL (ExPASy)Predict Protein server (Columbia)WHAT IF (CMBI, Netherlands)

Comparative modeling

14

Ab Initio Methods

• Ab initio: “From the beginning”.• Assumption 1: All the information about the

structure of a protein is contained in its sequence of amino acids.

• Assumption 2: The structure that a (globular) protein folds into is the structure with the lowest free energy.

• Finding native-like conformations require: - A scoring function (potential). - A search strategy.

15

Ab initio prediction can be performed when a proteinhas no detectable homologs.

Protein folding is modeled based on global free-energyminimum estimates.

Ab initio protein structure prediction

16

Ab initio Prediction

• Sampling the global conformation space– Lattice models / Discrete-state models– Molecular Dynamics

• Picking native conformations with an energy function– Solution model: how protein interacts with water– Pair interactions between amino acids

• Predicting secondary structure– Local homology– Fragment libraries

17

ROSETTA

• ROSETTA is mainly an ab initio structure prediction algorithm, although various parts of it can be used for other purposes as well (such as homology modeling).

• Rationale

– Local structures often fold

independently of full protein

– Can predict large areas of protein by

matching sequence to I-Sites

DavidDavid BakerBaker

18

Ab initio Prediction – ROSETTA 1. PSI-BLAST – homology search

Discard sequences with >25% homology

2. PHD

For each 3-long and each 9-long sequence fragment, get 25 structure fragments that match “well”

3. Markov-Chain Monte Carlo method

Insert and remove iteratively one short structure fragment at a time

?? ?

19

Ab initio Prediction

20

Protein Threading

• The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB

• Energy function – knowledge (or statistics) based rather than physics based – Should be able to distinguish correct structural folds from

incorrect structural folds

– Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments

MTYKLILN …. NGVDGEWTYTE

21

Threading

• Threading is in-between homology-based prediction and molecular modeling

MTYKLILN …. NGVDGEWTYTE

Main difference between homology-based prediction and threading:

Threading uses the structure to compute energy function during alignment

22

Threading – Overview

• Build a structural template database

• Define a sequence–structure energy function

• Apply a threading algorithm to query sequence

• Perform local refinement of secondary structure

• Report best resulting structural model

23

Threading – Template Database• FSSP, SCOP, CATH

• Remove pairs of proteins with highly similar structures– Efficiency

– Statistical skew in favor of large families

24

Threading – Energy Function

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

how well a residue fits a structural environment: Es

how preferable to put two particular residues nearby: Ep

alignment gap penalty: Eg

total energy: wmEm + wsEs + wpEp + wgEg + wssEss

how often a residue mutates to the template residue: Em

compatibility with local secondary structure prediction: Ess

25

Protein Threading -- algorithm

• Threading algorithm – to find a sequence-structure alignment with the minimum energy– considering only singleton energy and gap penalty

– considering all three energy terms

sequence

fold

links

26

Protein Threading -- algorithm

• Iterative procedurese.g. repeated 3D-profile alignment

• Double dynamic programming

• Integer programming

27

Assessing Prediction Reliability

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

Score = -1500 Score = -900Score = -1120Score = -720

Which one is the correct structural fold for the target sequence if any?

The one with the highest score ?

28

Assessing Prediction Reliability

Template #1: AATTAATACATTAATATAATAAAATTACTGA

Query sequence: AAAA

Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA

Better template?

Which of these two sequences will have better chance to have a good match with the query sequence after randomly reshuffling them?

29

Assessing Prediction Reliability

• Different template structures may have different background scores, making direct comparison of threading scores against different templates invalid

• Comparison of threading results should be made based on how standout the score is in its background score distribution rather the threading scores directly

30

Assessing Prediction Reliability

Threading 100,000 sequences against a template structure provides the baseline information about the background scores of the template

By locating where the threading score with a particular query sequence, one can decide how significant the score, and hence the threading result, is!

Not significant significant

E-value

31

Assessing Prediction Reliability

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

Score = -1500

E-value = e-1

Score = -900

E-value = e-21

Score = -1120

E-value = 0.5 e-1

Score = -720

E-value = e-2

If no predictions have non-significant e-values, a prediction program should indicate that it could not make a prediction!

32

Prediction of Protein Structures

• Threading against a template database

• Select the hits with good e-values, e.g., < e-10

• Put the backbone atoms in the backbone into the corresponding positions in the aligned residues

FMFTAIGEEVVQRSRKIL- - - DDLVELVK

AVLTRYGQRLIQLYDLLAQIQQKAFDVLS

Unaligned residues will not have 3D coordinates

33

Prediction of Protein Structures

• Protein threading can predict only the backbone structure of a protein (side-chains have to be predicted using other methods)

• Typically the lower the e-value, the higher the prediction accuracy

Blue: actual structure

Green: predicted structure

predicted actual

34

Prediction of Protein Structures

• Examples – a few good examples

actual predicted actual

actual actual

predicted

predicted predicted

35

Prediction of Protein Structures

• Not so good example

36

Prediction of Protein Structures

• State of the art: ~50% of the soluble proteins in a microbial genome could have correct fold prediction and might be 50% of these proteins have good backbone structure prediction

• Functional inference could be made based on– accurately predicted structures:

– correctly identified structural folds:

37

Prediction of Protein Structures

• All-atom structures could be predicted through prediction of– prediction of backbone structure

– prediction of sidechain packing• Backbone-dependent rotamers• Ab initio prediction of sidechains

• State of the art – accurate prediction of side chains remains a challenging problem

38

Structure prediction using additional information

• Some structural information may be available before whole structure is solved

– disulfide bonds– active sites– residues identified buried/exposed– (partial) secondary structure– partial NMR data– inter-residual distances by cross-linking and mass spec– overall shape derived from cryo-EM– …….

• These data can provide highly useful constraints on threading prediction

39

Structure prediction using additional information

• The basic idea

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

Distance or other types of constraints could be derived before the structure is solved, which could help to the structure prediction more accurate

40

Applications• Many protein structures have been successfully predicted prior to the solution of

their experimental structures (and later were verified by experimental structures)

• Structure predictions of all predicted genes in three microbial genomes, Synechococcus, Procholorococcus MIT/MED

~60% of predicted genes have structural fold assignments

41

Existing Prediction Programs

• PROSPECT– https://csbl.bmb.uga.edu/protein_pipeline

• FUGU– http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html

• THREADER– http://bioinf.cs.ucl.ac.uk/threader/

42

CASP: Critical Assessment of Structure Prediction

• A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction, John Moult

• First held in 1994, every 2 years afterwards

• Teams make structure predictions from sequences alone

43

CASP

• Two categories of predictors– Automated

• Automatic Servers, must complete analysis within 48 hours

• Shows what is possible through computer analysis alone

– Non-automated• Groups spend considerable time and effort on

each target• Utilize computer techniques and human analysis

techniques

44

CAFASP

GOAL

The goal of CAFASP is to evaluate the performance of fully automatic structure prediction servers available to the community. In contrast to the normal CASP procedure, CAFASP aims to answer the question of how well servers do without any intervention of experts, i.e. how well ANY user using only automated methods can predict protein structure. CAFASP assesses the performance of methods without the user intervention

allowed in CASP.

45

Performance Evaluation in CAFASP3

Servers

(54 in total)

Sum MaxSub

Score

# correct

(30 FR targets)

3ds5 robetta 5.17-5.25 15-17

pmod 3ds3 pmode3 4.21-4.36 13-14

RAPTOR 3.98 13

shgu 3.93 13

3dsn 3.64-3.90 12-13

pcons3 3.75 12

fugu3 orf_c 3.38-3.67 11-12

… … …

pdbblast 0.00 0

(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released in December, 2002.)

Servers with name in italic are meta servers

MaxSub score ranges from 0 to 1

Therefore, maximum total score is 30

46

One structure where RAPTOR did best

Red: true structure

Blue: correct part of prediction

Green: wrong part of prediction

• Target Size:144

• Super-imposable size within 5A: 118

• RMSD:1.9

47

Some more results by other programs

48

Some more results by other programs

49

Some more results by other programs

50

Summary of current state of the art

51

Secondary Structure Prediction

• Given a protein sequence a1a2…aN, secondary structure prediction aims at defining the state of each amino acid ai as being either H (helix), E (extended=strand), or O (other) (Some methods have 4 states: H, E, T for turns, and O for other).

52

Measures used to evaluated secondary structure predictions

• Percentage of residues predicted ("PP") Percentage of residues for which secondary structure prediction was made (residues were assigned secondary structure with nonzero probability). The number is provided for the reference.

53

Measures used to evaluated secondary structure predictions

• Qindex: Qindex (Qhelix, Qstrand, Qcoil, Q3) gives percentage of residues predicted correctly as helix(H), strand(E), coil(C) or for all three conformational states.

• Qhelix ("Q_H") • Qstrand("Q_S") • Qcoil("Q_C") • Q3 ("Q3")

54

Qindex

• For a single conformational state:

• where i is either helix, strand or coil.

• For all three states:

55

Limitations of Q3

ALHEASGPSVILFGSDVTVPPASNAEQAK

hhhhhooooeeeeoooeeeooooohhhhh

ohhhooooeeeeoooooeeeooohhhhhh

hhhhhoooohhhhooohhhooooohhhhh

Amino acid sequence

Actual Secondary Structure

Q3=22/29=76%

Q3=22/29=76%

(useful prediction)

(terrible prediction)

Q3 for random prediction is 33%

Secondary structure assignment in real proteins is uncertain to about 10%; Therefore, a “perfect” prediction would have Q3=90%.

56

Early methods for Secondary Structure Prediction

• Chou and Fasman(Chou and Fasman. Prediction of protein conformation.

Biochemistry, 13: 211-245, 1974)

• GOR(Garnier, Osguthorpe and Robson. Analysis of the accuracy and implications of simple methods for predicting the

secondary structure of globular proteins. J. Mol. Biol., 120:97-120, 1978)

57

Chou and Fasman

• Start by computing amino acids propensities to belong to a given type of secondary structure:

)(

)/(

)(

)/(

)(

)/(

iP

TurniP

iP

BetaiP

iP

HelixiP

Propensities > 1 mean that the residue type I is likely to be found in theCorresponding secondary structure type.

58

Amino Acid -Helix -Sheet Turn Ala 1.29 0.90 0.78 Cys 1.11 0.74 0.80 Leu 1.30 1.02 0.59 Met 1.47 0.97 0.39 Glu 1.44 0.75 1.00 Gln 1.27 0.80 0.97 His 1.22 1.08 0.69 Lys 1.23 0.77 0.96 Val 0.91 1.49 0.47 Ile 0.97 1.45 0.51 Phe 1.07 1.32 0.58 Tyr 0.72 1.25 1.05 Trp 0.99 1.14 0.75 Thr 0.82 1.21 1.03 Gly 0.56 0.92 1.64 Ser 0.82 0.95 1.33 Asp 1.04 0.72 1.41 Asn 0.90 0.76 1.23 Pro 0.52 0.64 1.91 Arg 0.96 0.99 0.88

Chou and Fasman

Favors-Helix

Favors-strand

Favorsturn

59

Chou and Fasman

Predicting helices:- find nucleation site: 4 out of 6 contiguous residues with P()>1- extension: extend helix in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker)- if average P() over whole region is >1, it is predicted to be helical

Predicting strands:- find nucleation site: 3 out of 5 contiguous residues with P()>1- extension: extend strand in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker)- if average P() over whole region is >1, it is predicted to be a strand

60

Chou and Fasman

Position-specific parametersfor turn:Each position has distinctamino acid preferences.

Examples:

-At position 2, Pro is highly preferred; Trp is disfavored

-At position 3, Asp, Asn and Gly are preferred

-At position 4, Trp, Gly and Cys preferred

f(i) f(i+1) f(i+2) f(i+3)

61

Chou and Fasman

Predicting turns:- for each tetrapeptide starting at residue i, compute:

- PTurn (average propensity over all 4 residues)- F = f(i)*f(i+1)*f(i+2)*f(i+3)

- if PTurn > P and PTurn > Pand PTurn > 1 and F>0.000075 tetrapeptide is considered a turn.

Chou and Fasman prediction:

http://fasta.bioch.virginia.edu/fasta_www/chofas.htm

62

The GOR method

Position-dependent propensities for helix, sheet or turn is calculated for each amino acid. For each position j in the sequence, eight residues on either side are considered.

A helix propensity table contains information about propensity for residues at 17 positions when the conformation of residue j is helical. The helix propensity tables have 20 x 17 entries.Build similar tables for strands and turns.

GOR simplification:The predicted state of AAj is calculated as the sum of the position-dependent propensities of all residues around AAj.

GOR can be used at : http://abs.cit.nih.gov/gor/ (current version is GOR IV)

j

63

Accuracy

• Both Chou and Fasman and GOR have been assessed and their accuracy is estimated to be Q3=60-65%.

(initially, higher scores were reported, but the experiments set to measure Q3 were flawed, as the test cases included proteins used to derive the propensities!)

64

-Available servers:

- JPRED : http://www.compbio.dundee.ac.uk/~www-jpred/

- PHD: http://cubic.bioc.columbia.edu/predictprotein/

- PSIPRED: http://bioinf.cs.ucl.ac.uk/psipred/

- NNPREDICT: http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html

- Chou and Fassman: http://fasta.bioch.virginia.edu/fasta_www/chofas.htm

Secondary Structure Prediction