Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | hilary-allen |
View: | 213 times |
Download: | 0 times |
CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 18 Protein Tertiary
Structure Prediction
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South CarolinaDepartment of Computer Science and Engineering
2008 www.cse.sc.edu.
OutlineOutlineExperimental limitation of protein
structure determinationTertiary Structure Prediction
◦AB initio◦Homology modeling◦Threading
Experimental Protein Structure Experimental Protein Structure DeterminationDeterminationHigh-resolution structure determination
◦ X-ray crystallography (<1A)◦ Nuclear magnetic resonance (NMR) (~1-2.5A)
Lower-resolution structure determination◦ Cryo-EM (electron-microscropy) ~10-15A
Theoretical Models?◦ Highly variable - but a few equiv to X-ray!
Tertiary Structure PredictionTertiary Structure Prediction
Fold or tertiary structure prediction problem can be formulated as a search for minimum energy conformation◦ Search space is defined by psi/phi angles of
backbone and side-chain rotamers◦ Search space is enormous even for small proteins!◦ Number of local minima increases exponentially
with number of residuesComputationally it is an exceedingly difficult problem!
Levinthal Paradox of Protein Levinthal Paradox of Protein Folding: How nature does Folding: How nature does search?search?We assume that there are three conformations for each amino acid (ex. α-helix, β-sheet and random coil). If a protein is made up of 100 amino acid residues, a total number of conformations is
3100 = 515377520732011331036461129765621272702107522001
≒ 5 x 1047.
If 100 psec (10-10 sec) were required to convert from a conformation to another one, a random search of all conformations would require
5 x 1047 x 10-10 sec ≒ 1.6 x 1030 years.
However, folding of proteins takes place in msec to sec order. Therefore, proteins fold not via a random search but a more sophisticated search process.
We want to watch the folding process of a protein using molecular simulation techniques.
Steps in Protein FoldingSteps in Protein Folding
1- "Collapse"- driving force is burial of hydrophobic aa’s
(fast - msecs)2- Molten globule - helices & sheets form, but
"loose"(slow - secs)
3- "Final" native folded state - compaction, some 2' structures rearranged
Native state? - assumed to be lowest free energy - may be an ensemble of structures
7
Protein Folding FunnelProtein Folding Funnel
Local mimina
Global minimum
Native Structure
Protein Structure Protein Structure PredictionPredictionAb initio
◦ Use just first principles: energy, geometry, and kinematics
Homology◦ Find the best match to a database
of sequences with known 3D-structure
Combinations
Threading
Meta-servers and other methods
Knowledge based approaches
9
Ab InitioAb Initio Prediction Prediction
Basic idea
Anfinsen’s theory: Protein native structure corresponds
to the state with the lowest free energy of the protein-
solvent system.
General procedures
◦ Develop a Potential/Energy function
Evaluate the energy of protein conformation
Select native structure
◦ Conformational search algorithm
To produce new conformations
Search the potential energy surface and locate the global
minimum (native conformation)Provides both folding pathway & folded structure
Can only apply to very small proteins
10
Potential Functions for PSPPotential Functions for PSP
Potential function◦ Physical based energy function
Empirical all-atom forcefields: CHARMM, AMBER, ECEPP-3,
GROMOS, OPLS
Parameterization: Quantum mechanical calculations,
experimental data
Simplified potential: UNRES (united residue)
◦ Solvation energy
Implicit solvation model: Generalized Born (GB) model,
surface area based model
Explicit solvation model: TIP3P (computationally
expensive)
11
General Form of All-atom General Form of All-atom ForcefieldsForcefields
pairs ,ticelectrosta
pairs , der Waalsvan
612
Hbonds
1012
dihedralsangles
2
0
bonds
2
0totalcos1
jiij
ji
jiij
ij
ij
ij
ij
ij
ij
ij
b
r
r
B
r
A
r
D
r
C
nKKrrKV
Electrostatic term
H-bonding term Van der Waals term
Bond stretching term
Dihedral termAngle bending term
r ΦΘ
+ ーO H rr r
The most time demanding part.
12
Search Potential Energy Search Potential Energy SurfaceSurface
We are interested in minimum points on Potential Energy Surface (PES)
Conformational search techniques
Energy Minimization
Monte Carlo
Molecular Dynamics
Others: Genetic Algorithm, Simulated
Annealing
13
Energy MinimizationEnergy Minimization
Energy minimization
Methods
First-order minimization: Steepest descent, Conjugate
gradient minimization
Second derivative methods: Newton-Raphson method
Quasi-Newton methods: L-BFGS
Local miminum
14
Monte CarloMonte Carlo
In molecular simulations, ‘Monte Carlo’ is an importance sampling technique.1. Make random move and produce a new conformation
2. Calculate the energy change E for the new conformation
3. Accept or reject the move based on the Metropolis criterion
exp( )E
PkT
Boltzmann factor
If E<0, P>1, accept new conformation;
Otherwise: P>rand(0,1), accept, else reject.
Ab initio Prediction – CASP Ab initio Prediction – CASP resultsresults
Comparative Modeling (Knowledge Comparative Modeling (Knowledge based approach)based approach)
Provide folded structure only
Two primary methods 1) Homology modeling2) Threading (fold recognition)
Both rely on availability of experimentally determined structures that are "homologous" or at least structurally very similar to target
Homology ModelingHomology Modeling1. Identify homologous protein sequences (-BLAST)2. Among available structures, choose the one with
closest sequence match to target as template(can combine steps 1 & 2 by using PDB-BLAST)
3. Build model by placing residues in corresponding positions of homologous structure & refine by "tweaking"
Homology modeling - works "well"• Computationally? not very expensive• Accuracy? higher sequence identity better model
Requires ~30% sequence identity with sequence for which structure is known
Homology-based Homology-based PredictionPrediction
Raw model
Loop modeling
Side chain placement
Refinement
Homology-based Homology-based PredictionPrediction
Threading - Fold RecognitionThreading - Fold RecognitionIdentify “best” fit between target sequence & template structure
Threading - works "sometimes"• Computationally? Can be expensive or cheap,
depends on energy function & whether "all atom" or "backbone only" threading
• Accuracy? in theory, should not depend on sequence identity (should depend on quality of template library & "luck")
Usually, higher sequence identity to protein of known structure better model
Threading Algorithm for Threading Algorithm for PSPPSP Database of 3D structures and sequences
◦ Protein Data Bank (or non-redundant subset)
Query sequence◦ Sequence < 25% identity to known
structures Alignment protocol
◦ Dynamic programming Evaluation protocol
◦ Distance-based potential or secondary structure
Ranking protocol3.3b 21
ThreadinThreadingg
Basic premise:
Statistics from Protein Data Bank (~40,000 structures)
Thus, chances for a protein to have a native-like structural fold in PDB are quite good
◦ Note: Proteins with similar structural folds could be either homologs or analogs
The number of unique structural folds in nature is fairly small (probably 2000-3000)
Until very recently, 90% of new structures submitted to PDB had similar structural folds in PDB
1. Align target sequence with template structures
(fold library) from the Protein Data Bank (PDB)
2. Calculate energy score to evaluate goodness of fit
between target sequence & template structure
3. Rank models based on energy scores
Target Sequence
Structure Templates
ALKKGF…HFDTSE
Steps in Threading
Threading IssuesThreading Issues
Structure database - must be complete: no decent model if no good template in library!
Sequence-structure alignment algorithm:
Bad alignment Bad score!
Energy function (scoring scheme): must distinguish correct sequence-fold alignment from
incorrect sequence-fold alignments must distinguish “correct” fold from close decoys
Prediction reliability assessment - How determine whether predicted structure is correct? (or even close?)
Find “correct” sequence-structure alignment of a target sequence with its native-like fold in PDB
Threading: Threading: Template databaseTemplate database
Build a database of structural templates
(eg, ASTRAL domain library derived from the PDB)
Supplement with additional decoys, e.g., generated usingab initio approach such as Rosetta (Baker)
Threading: Threading: Energy functionEnergy function
Two main methods (and combinations of these)
Structural profile (environmental) physico-chemical properties of aa’s
Contact potential (statistical) based on contact statistics from PDB
Miyazawa & Jernigan (ISU)
Protein Threading: Protein Threading: Typical energy functionTypical energy function
How well does a specific residue fit structural environment?
What is "probability" that two specific residues are in contact?
Alignment gap penalty?
Total energy: Ep + Es + Eg
Goal: Find a sequence-structure alignment that minimizes the energy function
CAFASPCAFASP
GOAL
The goal of CAFASP is to evaluate the performance of fully automatic structure prediction servers available to the community. In contrast to the normal CASP procedure, CAFASP aims to answer the question of how well servers do without any intervention of experts, i.e. how well ANY user using only automated methods can predict protein structure. CAFASP assesses the performance of methods without the
user intervention allowed in CASP.
Performance Evaluation in Performance Evaluation in CAFASP3CAFASP3
Servers
(54 in total)
Sum MaxSub
Score
# correct
(30 FR targets)
3ds5 robetta 5.17-5.25 15-17
pmod 3ds3 pmode3 4.21-4.36 13-14
RAPTOR 3.98 13
shgu 3.93 13
3dsn 3.64-3.90 12-13
pcons3 3.75 12
fugu3 orf_c 3.38-3.67 11-12
… … …
pdbblast 0.00 0
(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released in December, 2002.)
Servers with name in italic are meta servers
MaxSub score ranges from 0 to 1
Therefore, maximum total score is 30
One structure where RAPTOR One structure where RAPTOR did bestdid best
Red: true structure
Blue: correct part of prediction
Green: wrong part of prediction
• Target Size:144
• Super-imposable size within 5A: 118
• RMSD:1.9
Some more results by other Some more results by other programsprograms
Some more results by other Some more results by other programsprograms
Some more results by other Some more results by other programsprograms
Summary of current state of Summary of current state of the artthe art
Automated Web-Based Homology Automated Web-Based Homology ModelingModeling SWISS Model :
http://www.expasy.org/swissmod/SWISS-MODEL.html
WHAT IF : http://www.cmbi.kun.nl/swift/servers/
The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/
3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/
SDSC1 : http://cl.sdsc.edu/hm.html
EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/
Comparative Modeling Server & Comparative Modeling Server & ProgramProgram
COMPOSER http://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/matchmaker.html
MODELER http://salilab.org/modeler
InsightII http://www.msi.com/
SYBYL http://www.tripos.com/