CSCE555 Bioinformatics Lecture 18 Protein Tertiary Structure Prediction Meeting: MW 4:00PM-5:15PM...

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 18 Protein Tertiary

Structure Prediction

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun Hu

Course page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering

2008 www.cse.sc.edu.

OutlineOutlineExperimental limitation of protein

structure determinationTertiary Structure Prediction

◦AB initio◦Homology modeling◦Threading

Experimental Protein Structure Experimental Protein Structure DeterminationDeterminationHigh-resolution structure determination

◦ X-ray crystallography (<1A)◦ Nuclear magnetic resonance (NMR) (~1-2.5A)

Lower-resolution structure determination◦ Cryo-EM (electron-microscropy) ~10-15A

Theoretical Models?◦ Highly variable - but a few equiv to X-ray!

Tertiary Structure PredictionTertiary Structure Prediction

Fold or tertiary structure prediction problem can be formulated as a search for minimum energy conformation◦ Search space is defined by psi/phi angles of

backbone and side-chain rotamers◦ Search space is enormous even for small proteins!◦ Number of local minima increases exponentially

with number of residuesComputationally it is an exceedingly difficult problem!

Levinthal Paradox of Protein Levinthal Paradox of Protein Folding: How nature does Folding: How nature does search?search?We assume that there are three conformations for each amino acid (ex. α-helix, β-sheet and random coil). If a protein is made up of 100 amino acid residues, a total number of conformations is

3100 = 515377520732011331036461129765621272702107522001

≒ 5 x 1047.

If 100 psec (10-10 sec) were required to convert from a conformation to another one, a random search of all conformations would require

5 x 1047 x 10-10 sec ≒ 1.6 x 1030 years.

However, folding of proteins takes place in msec to sec order. Therefore, proteins fold not via a random search but a more sophisticated search process.

We want to watch the folding process of a protein using molecular simulation techniques.

Steps in Protein FoldingSteps in Protein Folding

1- "Collapse"- driving force is burial of hydrophobic aa’s

(fast - msecs)2- Molten globule - helices & sheets form, but

"loose"(slow - secs)

3- "Final" native folded state - compaction, some 2' structures rearranged

Native state? - assumed to be lowest free energy - may be an ensemble of structures

7

Protein Folding FunnelProtein Folding Funnel

Local mimina

Global minimum

Native Structure

Protein Structure Protein Structure PredictionPredictionAb initio

◦ Use just first principles: energy, geometry, and kinematics

Homology◦ Find the best match to a database

of sequences with known 3D-structure

Combinations

Threading

Meta-servers and other methods

Knowledge based approaches

9

Ab InitioAb Initio Prediction Prediction

Basic idea

Anfinsen’s theory: Protein native structure corresponds

to the state with the lowest free energy of the protein-

solvent system.

General procedures

◦ Develop a Potential/Energy function

Evaluate the energy of protein conformation

Select native structure

◦ Conformational search algorithm

To produce new conformations

Search the potential energy surface and locate the global

minimum (native conformation)Provides both folding pathway & folded structure

Can only apply to very small proteins

10

Potential Functions for PSPPotential Functions for PSP

Potential function◦ Physical based energy function

Empirical all-atom forcefields: CHARMM, AMBER, ECEPP-3,

GROMOS, OPLS

Parameterization: Quantum mechanical calculations,

experimental data

Simplified potential: UNRES (united residue)

◦ Solvation energy

Implicit solvation model: Generalized Born (GB) model,

surface area based model

Explicit solvation model: TIP3P (computationally

expensive)

11

General Form of All-atom General Form of All-atom ForcefieldsForcefields

pairs ,ticelectrosta

pairs , der Waalsvan

612

Hbonds

1012

dihedralsangles

2

0

bonds

2

0totalcos1

jiij

ji

jiij

ij

ij

ij

ij

ij

ij

ij

b

r

qq

r

B

r

A

r

D

r

C

nKKrrKV

Electrostatic term

H-bonding term Van der Waals term

Bond stretching term

Dihedral termAngle bending term

r ΦΘ

＋ーO H rr r

The most time demanding part.

12

Search Potential Energy Search Potential Energy SurfaceSurface

We are interested in minimum points on Potential Energy Surface (PES)

Conformational search techniques

Energy Minimization

Monte Carlo

Molecular Dynamics

Others: Genetic Algorithm, Simulated

Annealing

13

Energy MinimizationEnergy Minimization

Energy minimization

Methods

First-order minimization: Steepest descent, Conjugate

gradient minimization

Second derivative methods: Newton-Raphson method

Quasi-Newton methods: L-BFGS

Local miminum

14

Monte CarloMonte Carlo

In molecular simulations, ‘Monte Carlo’ is an importance sampling technique.1. Make random move and produce a new conformation

2. Calculate the energy change E for the new conformation

3. Accept or reject the move based on the Metropolis criterion

exp( )E

PkT

Boltzmann factor

If E<0, P>1, accept new conformation;

Otherwise: P>rand(0,1), accept, else reject.

Ab initio Prediction – CASP Ab initio Prediction – CASP resultsresults

Comparative Modeling (Knowledge Comparative Modeling (Knowledge based approach)based approach)

Provide folded structure only

Two primary methods 1) Homology modeling2) Threading (fold recognition)

Both rely on availability of experimentally determined structures that are "homologous" or at least structurally very similar to target

Homology ModelingHomology Modeling1. Identify homologous protein sequences (-BLAST)2. Among available structures, choose the one with

closest sequence match to target as template(can combine steps 1 & 2 by using PDB-BLAST)

3. Build model by placing residues in corresponding positions of homologous structure & refine by "tweaking"

Homology modeling - works "well"• Computationally? not very expensive• Accuracy? higher sequence identity better model

Requires ~30% sequence identity with sequence for which structure is known

Homology-based Homology-based PredictionPrediction

Raw model

Loop modeling

Side chain placement

Refinement

Homology-based Homology-based PredictionPrediction

Threading - Fold RecognitionThreading - Fold RecognitionIdentify “best” fit between target sequence & template structure

Threading - works "sometimes"• Computationally? Can be expensive or cheap,

depends on energy function & whether "all atom" or "backbone only" threading

• Accuracy? in theory, should not depend on sequence identity (should depend on quality of template library & "luck")

Usually, higher sequence identity to protein of known structure better model

Threading Algorithm for Threading Algorithm for PSPPSP Database of 3D structures and sequences

◦ Protein Data Bank (or non-redundant subset)

Query sequence◦ Sequence < 25% identity to known

structures Alignment protocol

◦ Dynamic programming Evaluation protocol

◦ Distance-based potential or secondary structure

Ranking protocol3.3b 21

ThreadinThreadingg

Basic premise:

Statistics from Protein Data Bank (~40,000 structures)

Thus, chances for a protein to have a native-like structural fold in PDB are quite good

◦ Note: Proteins with similar structural folds could be either homologs or analogs

The number of unique structural folds in nature is fairly small (probably 2000-3000)

Until very recently, 90% of new structures submitted to PDB had similar structural folds in PDB

1. Align target sequence with template structures

(fold library) from the Protein Data Bank (PDB)

2. Calculate energy score to evaluate goodness of fit

between target sequence & template structure

3. Rank models based on energy scores

Target Sequence

Structure Templates

ALKKGF…HFDTSE

Steps in Threading

Threading IssuesThreading Issues

Structure database - must be complete: no decent model if no good template in library!

Sequence-structure alignment algorithm:

Bad alignment Bad score!

Energy function (scoring scheme): must distinguish correct sequence-fold alignment from

incorrect sequence-fold alignments must distinguish “correct” fold from close decoys

Prediction reliability assessment - How determine whether predicted structure is correct? (or even close?)

Find “correct” sequence-structure alignment of a target sequence with its native-like fold in PDB

Threading: Threading: Template databaseTemplate database

Build a database of structural templates

(eg, ASTRAL domain library derived from the PDB)

Supplement with additional decoys, e.g., generated usingab initio approach such as Rosetta (Baker)

Threading: Threading: Energy functionEnergy function

Two main methods (and combinations of these)

Structural profile (environmental) physico-chemical properties of aa’s

Contact potential (statistical) based on contact statistics from PDB

Miyazawa & Jernigan (ISU)

Protein Threading: Protein Threading: Typical energy functionTypical energy function

How well does a specific residue fit structural environment?

What is "probability" that two specific residues are in contact?

Alignment gap penalty?

Total energy: Ep + Es + Eg

Goal: Find a sequence-structure alignment that minimizes the energy function

CAFASPCAFASP

GOAL

The goal of CAFASP is to evaluate the performance of fully automatic structure prediction servers available to the community. In contrast to the normal CASP procedure, CAFASP aims to answer the question of how well servers do without any intervention of experts, i.e. how well ANY user using only automated methods can predict protein structure. CAFASP assesses the performance of methods without the

user intervention allowed in CASP.

Performance Evaluation in Performance Evaluation in CAFASP3CAFASP3

Servers

(54 in total)

Sum MaxSub

Score

# correct

(30 FR targets)

3ds5 robetta 5.17-5.25 15-17

pmod 3ds3 pmode3 4.21-4.36 13-14

RAPTOR 3.98 13

shgu 3.93 13

3dsn 3.64-3.90 12-13

pcons3 3.75 12

fugu3 orf_c 3.38-3.67 11-12

… … …

pdbblast 0.00 0

(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released in December, 2002.)

Servers with name in italic are meta servers

MaxSub score ranges from 0 to 1

Therefore, maximum total score is 30

One structure where RAPTOR One structure where RAPTOR did bestdid best

Red: true structure

Blue: correct part of prediction

Green: wrong part of prediction

• Target Size:144

• Super-imposable size within 5A: 118

• RMSD:1.9

Some more results by other Some more results by other programsprograms



Summary of current state of Summary of current state of the artthe art

Automated Web-Based Homology Automated Web-Based Homology ModelingModeling SWISS Model :

http://www.expasy.org/swissmod/SWISS-MODEL.html

WHAT IF : http://www.cmbi.kun.nl/swift/servers/

The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/

3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/

SDSC1 : http://cl.sdsc.edu/hm.html

EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/

Comparative Modeling Server & Comparative Modeling Server & ProgramProgram

COMPOSER http://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/matchmaker.html

MODELER http://salilab.org/modeler

InsightII http://www.msi.com/

SYBYL http://www.tripos.com/

Date post:	12-Jan-2016
Category:	Documents
Upload:	hilary-allen
View:	213 times
Download:	0 times

CSCE555 Bioinformatics Lecture 18 Protein Tertiary Structure Prediction Meeting: MW 4:00PM-5:15PM...

Documents