NEW APPROACHES TO PROTEIN STRUCTURE PREDICTION AND DESIGN
Joe DeBartolo
amino acid
sequence
native protein
structure
structure prediction
protein design
An overview of my thesis
Why do prediction and design matter?
Structure Prediction. Growth of sequences outpaces experimental characterization. Knowing their structure provides insights into their function and interactions
Protein design. Understanding design principles can allow the creation of new proteins with therapeutic and industirial applications
PART I ItFix: Homology-free structure predictionPART II SPEED: ItFix enhanced with evolutionPART III Future directions in prediction
Protein structure prediction and design
PART IV Protein design
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLR1° structure
Protein structure prediction
2° and 3° structure
local 2° structure
topology diagram
3D model
Residue-residue contact map
The Challenge:Distill the folding problem down to the basic principles, code them into an algorithm, and predict pathways and
structure without using homology
nativestructure
…LEKVQLN…
amino acid sequence
• Ramachandran angles• backbone hydrogen bonds
• long range sterics • Van der Waals• electrostatics• hydrophobic effect
y
f
• local sterics• solvation• backbone entropy
Capturing the interrelated forces of protein structure
local structures
α-helix β-strand
-180
180
-180180φ
ψ
turn
-180 180φ -180 180φ
The overlapping features of local protein structure
backboneRamachandran
torsionangles
backboneH-bonds
amphipathic sidechainpatterning
polar
apolar
y
f
mostly polarapolar
polar
• sterics • Van der Waals• electrostatics
• ramachandran angles• backbone hydrogen bonds• solvationy
f
• long-range hydrogen bonding
Capturing the interrelated forces of protein structure
long range effects
3° packing specificity of the chain
solvent exposed residues
apolar buried residues
hydrophobic effect surface residue placement
salt bridges and other favorable pairings
long-range hydrogen bonding
contacts that are highly
separated in sequence
y
f
The structure prediction challenge:To integrate all of these features into an algorithm
requirements
a way to sample conformations
-180 180φ
180
-180
ψ Xa way to evaluate conformations
Sample Ramachandran space
y
f
180
ψ
-180180-180 φ
Rama angle pairs describe entire conformation...NO sidechain rotamer sampling
Rama map of PDBRama angle pair
exclude sidechains beyond Cβ
1° and 2° structure information refines the Rama search space
180
-180
ψ
1° structure 2° structure
ALL-ALL-ALLEntire PDB ALL-ALL-ALL
180
-180
ψadd amino acid identity ALL-ALL-ALLALL-ASN-ALL
180
-180
ψadd neighbor identity ALL-ALL-ALLALL-ASN-GLY
-180
180
-180 180φ
ψadd 2° structure identity BETA-ALL-ALLALL-ASN-GLY
-180 180φ
-180 180φ
-180 180φ
y
f
The structure prediction challenge:To integrate all of these features into one algorithm
requirements
a way to sample conformations
-180 180φ
180
-180
ψ Xa way to evaluate conformations
Discrete Optimized Potential EnergyKnowledge-based modeling of the energy of a conformation
The DOPE statistical potential
PDB
EnergyPDB(rij) = -ln( ProbPDB(rij) )
Shen and Sali, Proteins (2007)
residue iamino acid iatom type I
rij is the distance between atoms i and j
The DOPE atom pair energy…residue j
amino acid jatom type j
GLU-Cβ - GLU-CβLEU-Cβ - LEU-Cβ
DOPE
ene
rgy
Distance (Å)
DOPE
• orientation dependence• 2° structure dependence• eliminate local biases
I have added to DOPE…
GLU-Cβ - GLU-CβLEU-Cβ - LEU-Cβ
DOPE
PW
ene
rgy
Distance (Å)
DOPE-PW
residue 1
residue 2
ρ2-1
ρ1-2
Ca
Cb
Cβ
Ca
,
CαCβ
Capturing sidechain orientation in a sidechain-free model
DeBartolo et al. PNAS 2009
Ca
Cβ
residue 1
residue 2
ρ1-2 ρ2-1 Ca
Cβ
High ρ (in-line)low ρ
PW = r = 212
221 )90()90( rr
ρ1-2 is the angle between two vectors
DOPE-PW (uniquely) captures the hydrophobic effect
Cα Cβ CαCb
Cα Cβ C α Cβ
C α CβC αCβ
Potential orientations of high PW
GLU-Cβ GLU-CβLEU-Cβ LEU-Cβ
DOPE
ene
rgy
Distance (Å)
hydrophobic residues pairs have lower energy at smaller distances
buried in the core
large distance preferred
DOPE-PW captures the amphipathic nature of β-sheets
C α
Cβ
C α
C β
potential orientations of low PWpolar and apolar residues prefer opposing sides of the β-sheet
C α
C β
C α
C β
GLU-Cβ LYS-CβGLU-Cβ LEU-Cβ
DOPE
ene
rgy
Distance (Å)
same side of β-sheet
opposite side of β-sheet
y
f
The challenge:To integrate all of these features into one algorithm
requirements
a way to sample conformations
-180 180φ
180
-180
ψ Xa way to evaluate conformations
Not(Helix)
helixstrand
Not(Strand)
Coil subtypes
Fold with (f,y) from LibraryRestricted 1
Repeat until no further fixing is possible
FinalRound
“I2”
Remove trimers
Repeat removal
Fold with (f,y) from LibraryRestricted 2
Fold with (f,y) from LibraryRestricted final
“N”
φ-180° 180°
ψ
180°
-180°
ItFixIterative Fixing to reduce the conformational search
DeBartolo et al., PNAS 2009
Starting configuration 1° only (no 2o structure restriction)
“U”ψ
-180°
180°
“I1”Fold with (f,y) from LibraryInitial
Remove trimers of lowly-populated 2o structure
ψ
-180°
180°
2° st
ruct
ure
optio
n re
mov
ed
sear
ch sp
ace
is re
stric
ted
sampling library
DeBartolo et al., PNAS 2009
Native ---HHHHHHHHHHHHHHH-----GGGHHHHHHHHHHHHHHHT---HHHHHHHHHH-TT-THHHHHHHH-ItFix ---HHHHHHHHHHHHHHHT-----S-HHHHHHHHHHHHHHHT-S--HHHHHHHHHT---HHHHHHHHH-SSPro ---HHHHHHHHHHHHHHHHHHE-TTHHHHHHHHHHHHHHHHT--HHHHHHHHHHT-TTHHHHHHHHHH-PSIPRED ---HHHHHHHHHHHHHHH-----HHHHHHHHHHHHHHHHHH----HHHHHHHHH----HHHHHHHHH--
Native -HHHHHHHHHHHTT-SS--HHHHHHHHHHHT--HHHHHHHHHHHHHHHH-ItFix --HHHHHHHHHHHH-----HHHHHHHHHHHH--S-HHHHHHHHHHHHHH-SSPro -HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-HHHHEEHEHHHHHHH--PSIPRED -HHHHHHHHHHHHH-----HHHHHHHHHHHHHHHHHHHHHHH-HHHH---
Native -EEEEEEEEETTTTEEEEE-TTS--EEEEGGGB-SSSS----TT-EEEEEEEEETTEEEEEEEEE--ItFix -EEEEEEEE-STTTEEEEEEET-T-EEEEEEE--SSS-----TS--EEEEEEES--S----EEEEE-SSPro --TEEEEEE-TTTTEEEE--TT--EEEEEEEHEETTT--E--TT-EEEEEEEE-TT--E-EE-----PsiPred --EEEEEEEE----EEEEE-----EEEEEEE--------------EEEEEEEE-----EEEEEE---
Native --BGGG---SEEEEE-TTS-EEEEEEHHHHHHHHHHTT-EEEEEETTSSS-EEEEE-ItFix -EEE-SSSSEEEEEE-TTS-EEEEEEHHHHHHHHHHHT--EEEE-TTSSS-EEEEE-SSPro --BBTEEE-EEEEEEETTT-EEEEE-HHHHHHHHHHHT--EEEE-TT----EEEE--PSIPRED ----------EEEEE-----EEEEE-HHHHHHHHHH----EEEE-------EEEE--
Native -HHHHHHHHHHHTT--HHHHHHHHTS-HHHHHHHHTTS-SS-TTHHHHHHHTT--HHHHH-ItFix -HHHHHHHHHHHHT--HHHHHHHHT--HHHHHHHHTT--SS----HHHHHHHT--HHHHH-SSPro ---HHHHHHHHHHHHHHHHHHHHHT-HHHHHHHHHTT-------HHHHHHHHHT--HHHH-PsiPred -HHHHHHHHHHH----HHHHHHHH---HHHHHHHH------HHHHHHHHHHH---HHHH--
Native -EEEEEETTS-EEEEE--TTSBHHHHHHHHHHHH---GGGEEEEETTEE--TTSBTGGGT--TT-EEEEEE-ItFix -EEEEEETTS-EEEEEE---S-B-HHHHHHHHHSS---SSEEEEETT----TT-B----------EEEEEE-SSPro -EEEEEEETTEEEEEEE---SHHHHHHHHHHHTTT---T--E--ETT-E--TT-EEEEEE--TT-EEEEEE-PSIPRED -EEEEEE----EEEEEE-----HHHHHHHHHHHH---HHHEEEEE--EE------HHH-------EEEEEE-
1af7 2.5 Å
1b72 1.6 Å
1csp 6.0 Å
1tif 4.2 Å
1r69 2.4 Å
1ubq 3.1 Å
Homology-free ItFix2° and 3° structure prediction results
2° S
truc
ture
freq
uenc
yMajor pathway
(from experiment)
Unfoldedstate
Round 1
Round 2
Round 3
Round 4
Round 9
residue index 73
10
10
10
10
10
10
1
Round 6
b1 b2 helix b4 b5 310 b3
b1 b2 helix b4 b5 310 b3
+ b3
+ b4
b1-b2 hairpin
+helix
+ b4
+helix
+ b3
+ b5
+310
helix
Nativestate
10
Round 0 Mimicking folding
pathways
DeBartolo et al., PNAS 2009
Use basic principles of protein structure and folding.
Search strategies: mimic true folding behavior
i) Coupled 2° & 3° structure formation
ii) Iterative fixing to reduce the search
iii) Outputs pathway information
Energy functions: orientational and 2° structure dependence
Challenge:Distill the folding problem down to the basic principles, code them into an algorithm, and predict pathways and
structure without using homology
What novel about how we approached this challenge?
Part I Conclusions
PART II SPEED: ItFix enhanced with evolution PART I ItFix: Homology-free structure prediction
PART III Future directions in prediction
ψ
φ
Protein structure prediction and design
PART IV Protein design
Cover image of Protein Science, March 2010
MQIFVKTLTGKTITLEV
SPEED: Structure Prediction Enhanced by Evolutionary DiversityIncrease φ, ψ diversity and accuracy
target sequencesequencedatabase IEIKIRDIYSKTYKFMA
IEITCNDRLGKKVRVKC MRLFIRSHLHDQVVISA MKLSVKSPNGRIEIFNE LQFFVRLLDGKSVTLTF IEITLNDRLGKKIRVKC IEIWVNDHLSHRERIKC MDVFLMIRRQKTTIFDA IIVTVNDRLGTKAQIPA MRISVIKLDSTSFDVAV MNVNFRTILGKTYTITV MLLTVRDRSELTFSLQV MQIFVTTPSENVFGLEV MSLTIKF-GAKSIALSL MKYRIRTISNDEAVIEL … ~1000 sequences
multiple sequence alignment
ψ
φ
180°
-180°180°-180° ψ
φ
180°
-180°180°-180°
homology-free sampling SPEED
sampling
Uses sequence data base 107 seq’s, growing fast; PDB only 104 structures growing slowly
Round 2 Ramadistribution
homologyfree
…AGTYEFRKAKIT…
Rama Distribution
Fold 500xwith Eradial
Analyze 2° Structure Statistics
no
yes
Fold 10000x with Eradial or DOPE-PW (all α)
Final 2° Structure
ItFix
MultipleSequenceAlignment
SPEED
SPEED1tif position 4
{IND , IGD , VGN,…}MSA
φ-180° 180°
Homology-free
Final Ramadistribution
Round 1 Ramadistribution
1tif position 4 INE
ψ
ψ
ψ
φ
2° structureconverged
180°
-180°180°-180°
-180°
180°-180°
180°
ItFix-SPEED overview
DeBartolo et al., Protein Sci. 2010
Round 2 Ramadistribution
homologyfree
…AGTYEFRKAKIT…
Rama Distribution
Fold 500xwith Eradial
Analyze 2° Structure Statistics
no
yes
Fold 10000x with Eradial or DOPE-PW (all α)
Final 2° Structure
ItFix
MultipleSequenceAlignment
SPEED
SPEED1tif position 4
{IND , IGD , VGN,…}MSA
φ-180° 180°
Homology-free
Final Ramadistribution
Round 1 Ramadistribution
1tif position 4 INE
ψ
ψ
ψ
φLargest cluster
Refine 100X each with DOPE-PWReject ∆Eradial> 0
min<Energy> 100
prediction
cluster
2° structureconverged
180°
-180°180°-180°
-180°
180°-180°
180°
ItFix-SPEED overview
DeBartolo et al., Protein Sci. 2010
Clustering predicts model accuracy and confidence
fold ItFix predicted 2° structure
cluster
1af7
1b72
1r69
Local Accuracy
Assaying accuracy
(i.e. we know whether we got it right or wrong)
identify best cluster
1.5 2.0 2.5 3.0 3.5 4.0 4.52
3
4
5
6
7
8
Mea
n C
a-R
MS
D to
nat
ive
of c
lust
er (Å
)
Mean Ca-RMSD between models in cluster (Å)
R2=0.85
Global Accuracy
ItFix
Global Distance Test
Cut
-off
Dist
ance
(Å)
Percentage of residues
RAPTORItFix
Cut-o
ff Di
stan
ce (Å
)
RAPTOR
ItFix
ItFix
T0405 D1 (6.4 Å )
Cut-o
ff Di
stan
ce (Å
)Cu
t-off
Dist
ance
(Å)
T0464 D1 (4.5 Å)
T0429 D2 (6.8 Å)
ItFix
Better template
T0482 (4.8 Å)Performance in CASP8
free modeling
loop insertion modeling
template identification using folding
DeBartolo et al., Protein Sci. 2010
Aashish Adhikari
Part II Conclusions
• Adding evolutionary information to ItFix improves the accuracy of the conformational search
• Clustering permits global and local prediction of cluster accuracy and uncertainty
• SPEED is successful in the CASP8 experiment
PART II SPEED: ItFix enhanced with evolution PART I ItFix: Homology-free structure prediction
PART III Future directions in prediction
Protein structure prediction and design
PART IV Protein design
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLR1° structure
2° and 3° structure
local 2° structure
topology diagram
3D model
3D contacts
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLR1° structure
Invert the structure prediction problem
design length fold wt % id (wt % sim)
top % id(top % sim)
top-wt % id(top-wt % sim)
protein L1 62 αβ 35 (61) 50 (62) 73 (86)
protein L2 62 αβ 45 (60) 45 (60) 73 (86)
ACP 98 αβ 41 (54) 39 (57) 67 (69)PCP 70 αβ 31 (56) 33 (56) 73 (84)S6 94 αβ 26 (43) 32 (46) 33 (52)
U1A 96 αβ 32 (57) 33 (57) 97 (100)FKB 107 αβ 42 (59) 44 (62) 96 (96)
zinc-finger 28 αβ 21 (38) N/A N/Atenascin 89 β 42 (64) 42 (64) 100 (100)
Current designs are very similar to parent sequences
Can we design a more unique protein sequence?
Design method
01010111
Restrict AA possibilities by burial in native structure for thehydrophobic effect
1
Find best sequences for maximum Rama propensity
2 MKLFVKTP…LTVTIR LIV R Epositional sequence
library
3 Monte Carlo search of Statistical Potential
GLU-Cβ - GLU-CβLEU-Cβ - LEU-Cβ
DOPE
PW
ene
rgy
Distance (Å)
DOPE-PW
Hello Jello
soluble at induce
insoluble at induce
soluble at 3 hrs
insoluble at 3 hrsPreliminary wetlab analysis
• 1ds0 expresses in inclusion bodies• mutations enhance in vitro solubility• further experiments needed
wavelength (nm)
cd
design design-sol native
Thesis defenseConclusions
• Homology-free structure prediction can provide accurate models by mimicking folding pathways
• Adding evolutionary information improves the accuracy of the conformational search
• Inverting our homology-free prediction method into a design algorithm aims to generate unique amino acid sequences
Prof. Tobin SosnickProf. Karl FreedProf. Jinbo Xu
Glen HockyAndres ColubriJames FitzgeraldAbhishek JhaEsmael HaddadianJames HinshawAashish AdhikariJouko VirtanenChloe AntoniouJosiah ZaynerFeng ZhaoJian PengGrzegorz GawlakSrikanth Aravamuthan
Acknowledgements
Funding: NIH, NSF, Joint Theory Institute
ψ
φ
Enhancement of Ramachandran propensity
Nati
ve R
ama
prob
abili
ty
Enhancement in energy and structure prediction
• ∆∆E = -120 (arb. units)• 2X enhancement in native-like models in prediction
AASecStr
position
Seco
ndar
y St
ruct
ure
freq
uenc
y
Round 0
Round 1
Round 2
Round 3
Round 5
Round 7
10
10
10
10
10
01
1af72.7 Å
residue index
Round 1
Round 2
Round 3
Round 6
Round 4
Round 0
Round 1
Round 0
Round 6
Round 8
Round 4
Round 2
Round 1
Round 2
Round 3
Round 6
Round 4
Round 0
1di2 4.6 Å 1r69 2.4 Å
1b72 1.6 Å
2° structure by positionAmino acid by position
Nati
ve b
asin
pro
babi
lity
1b72
SPEED improves native φ, ψ probability across sequence
SPEED increases the native Rama probability
PDB id of target
% p
ositi
ons
with
PN
ative
> 0
.25
SPEED reduces cases where native φ, ψ has a
very low probability 1
2
3
4
native Rama regions180
ψ
-180180-180 φ
CMCα
Rg-Cα
Rg-phil
Rg-phob
CαCβ
Radial energy terms enforce productive chain collapse(global terms)
Rg-Cα: Root-squared distance of Cα from CM. Compactness of model
Ru-Cα: Root-mean-squared deviation of Cα from CM. Enforces a spherical model
Rg-phob/Rg-phil (burial ratio): best packing of hydrophobic residues
180
0 180-180
0
0
0
0
-180 180
-180 180
-180 180
-180
round0MQIFVKT…STLHLVLR
(e.g. pos. 67)
Rama distribution
round1
fold 2000X
Rama distribution
round2
fold 2000X
Rama distribution
round3
fold 2000X
Rama distribution
Eliminating the fixing thresholds from ItFix
WT:ILEHomologs: polar
PHE4
THR14
0.0 5.0 10.0 15.0 20.0 25.0 30.0
0
2
4
6
8
10
0.0 2.0 4.0 6.0 8.0 10.0 12.0
0
2
4
6
8
10
14.0 16.0
distance (Å)
distance (Å)
ener
gyen
ergy
DOPE-PWDOPE-PW-SPEED
DOPE-PWDOPE-PW-SPEED
An evolution-enhanced energy functionDOPE-PW-SPEED
WT:AlaHomologs: polar