1
1
Biological Overview:Sequence-Structure Asymmetry
1DWR 2JHO
Sequence Identity ~ 85%
Horse Sperm Whale
2
Biological Overview:Sequence-Structure Asymmetry
1LH1 2JHO
Sequence Identity ~ 20%
Sperm WhaleLupinus luteus
2
3
Structures are better conservedthan sequences during evolution
-> Homology modeling of structures
-> Protein design and evolution
1LH1:_ 2/3 ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE--VPQNNPE1MBC:_ 1/2 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
1LH1:_ 60/61 LQAHAGKVFKLVYEAAIQLEVTGVVV--TDATLKNLGSVHVSKGVADAHFPVVKEAILKT1MBC:_ 61/62 LKKHGVTVLTALGAILKKK---GHHEAELKPLAQSHATKHK---IPIKYLEFISEAIIHV
1LH1:_ 118/119 IKEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA1MBC:_ 115/116 LHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELG
4
Biological Overview: Sequence-Structure Asymmetry
3
5
Biological Overview:Mechanisms of Protein Evolution
… …
… …
Rigidity and stability
6
The Questions:Sequence Capacity and Flow
Can we estimate Can we estimate sequence capacitysequence capacity: : the number ofthe number ofsequences that are compatible with a given structure?sequences that are compatible with a given structure?
Is there migration, or Is there migration, or flowflow, of sequences between, of sequences betweenstructures under point mutations? (May impact proteinstructures under point mutations? (May impact proteindesign and studies of evolution)design and studies of evolution)
Recent paper: Recent paper: Leonid Leonid MeyerguzMeyerguz, Jon Kleinberg, and Ron Elber, Jon Kleinberg, and Ron Elber
The network of sequence flow between protein structuresThe network of sequence flow between protein structuresPNAS 2007 104: 11627-11632;PNAS 2007 104: 11627-11632;
4
7
Physical (stability based) networkmodel for sequence capacity of
structures & structural flips Detailed (whole PDB), efficiently computable andDetailed (whole PDB), efficiently computable and
experimentally testable model (the set of PDBexperimentally testable model (the set of PDBstructures was argued to be complete)structures was argued to be complete)
Design of protein structures and protein switches Design of protein structures and protein switches
Zero order model of the evolution of proteinZero order model of the evolution of proteinsequences & structures (no selection due tosequences & structures (no selection due tofunction).function).
8
Related work on capacity ofspecific protein models
ShakhnovichShakhnovich DillDill WolynesWolynes ThirumalaiThirumalai LevittLevitt ……
So far no global view of capacity (andSo far no global view of capacity (andthermodynamics) of the PDB, no flow.thermodynamics) of the PDB, no flow.
5
9
Is capacity relevant to biology?
Capacity shows weak correlation with theCapacity shows weak correlation with thenumber of sequences thatnumber of sequences that are found for aare found for aparticular fold in the NR database (correlationparticular fold in the NR database (correlationcoefficient 0.2)coefficient 0.2)Capacity correlates with mutation ratesCapacity correlates with mutation ratesmeasured experimentally measured experimentally ((J. D. Bloom, D. A.J. D. Bloom, D. A.Drummond, F. H. Arnold, and C. O. Drummond, F. H. Arnold, and C. O. WilkeWilke. Structural. Structuraldeterminants of the rate of protein evolution in yeast. Mol. determinants of the rate of protein evolution in yeast. Mol. BiolBiol..EvolEvol. 23:1751-1761, 2006. And C. . 23:1751-1761, 2006. And C. WilkeWilke, private, privatecommunication).communication).
10
Experimental tests
Collaboration with Thomas Collaboration with Thomas Magliery Magliery ononLambda repressor - 160K mutantsLambda repressor - 160K mutants
Protein flipsProtein flips - Bryan lab (flips are well- Bryan lab (flips are wellknown for RNA)known for RNA)
6
11
Patrick A. Alexander, Patrick A. Alexander, Yanan Yanan He, He, Yihong Yihong Chen, JohnChen, JohnOrbanOrban, and Philip N. Bryan , and Philip N. Bryan ““The design andThe design andcharacterization of two proteins with 88% sequencecharacterization of two proteins with 88% sequenceidentity but different structure and functionidentity but different structure and function”” PNAS PNAS2007 104: 11963-11968;2007 104: 11963-11968;
Protein flips
12
Measuring Protein Fitness:Energy Functions
Energy functions measure the fitness of sequences tostructures.
If energy is low, then the structure is thermodynamicallystable with respect to the probe sequence.
AR
N
E D
C Q
ARNDECQ
Energy calculation is based onplacing amino acids of the S intosites of the target structure X.
7
13
Approximate Energy Functions
TE13 (Toby, Elber)
XX
N
E D
X Q
( )( )
1
( )n
X
i i
i
E S X c !=
" =#
THOM2 (Meller, Elber)
( )( )( ) , , X
i j ij
i j
E S X c r! !<
" =# AR
N
E D
C Q
Ergodic/well mixed model
NP complete
14
Specifying Sequence-StructureFitness Criteria
HI
L
M K
F P
AR
N
E D
C Q
Original (Native)Sequence:Snat: HILKMFP
Candidate Sequence:S: ARNEDCQ
We say S is fit for X if ( ) ( )nat
E S X E S X! " !
8
15
Estimating the SequenceCapacity of a Fold
In general, we would like to compute the function N(E):
Specifically, we want to compute N (Enat). The number of all possible sequences is 20The number of all possible sequences is 20nn, for proteins, for proteins
of length of length nn. For small proteins, . For small proteins, n n ≈≈ 50.50. Random sampling of sequence space does not work:Random sampling of sequence space does not work:
since since N N ((EEnatnat) / 20) / 20nn can be exponentially small. can be exponentially small. Need a more sophisticated counting method.Need a more sophisticated counting method.
}{( ) : ( )N E S E S X E= ! "
16
Estimating N(E)
Express Express NN((EE) = ) = | { | { SS : : EE ( (S S → X ) < Ek }| as astelescoping ratios/ umbrella sampling:telescoping ratios/ umbrella sampling:
1 2
1
( ) ( ) ( )( ) ( ) ...
( ) ( ) ( )ref
ref m
N E N E N EN E N E
N E N E N E= ! ! ! !
EEref ref : Pre-selected reference energy: Pre-selected reference energyN N ((EErefref) : Number of sequences below ) : Number of sequences below EErefrefEE11 …… EEmm : Values above ratios are : Values above ratios are itermediatesitermediates..
9
17
Approximating Successive Ratios
Ek
Ek+1
Emin
Select site i u.a.r. from S, and amino acid r u.a.r. fromamong the 20 types.
If E ( S → X ) + ΔE < Ek, accept. Otherwise, reject.
N(Ek)
N(Ek+1)
18
Approximating Successive Ratios
Emin
Let Let lltt((kk) = | { ) = | { SS : : EE ( (S S → X ) < Ek } | after t steps. Then, for sufficiently large Then, for sufficiently large tt,,
NN ( (EEkk+1+1) / ) / NN ( (EEkk) ) ≈≈ lltt((kk+1) / +1) / lltt((kk).).
Ek
Ek+1
N(Ek)
N(Ek+1)
10
19
Choosing Intermediate Ek ValuesEmean
Emin
Enat
20
Algorithm Summary: N(E) Given a structure Given a structure XX, compute , compute EEmeanmean and and N N ((EEmeanmean).). Pick Pick EE11 …… EEmm s.t. s.t. EEkk > > EEk+k+1 1 and (and (EEkk −− EEkk+1+1) is) is
decreasing with decreasing with k.k. For For kk = 1 = 1 …… mm, run the Markov chain for , run the Markov chain for t t steps.steps.
Compute Compute lltt((kk+1) / +1) / lltt((kk) ) ≈≈ NN ( (EEkk+1+1) / ) / NN ( (EEkk).).
1 2
1
( ) ( ) ( )( ) ( ) ...
( ) ( ) ( )mean
mean m
N E N E N EN E N E
N E N E N E= ! ! ! !
11
21
Counting With THOM2:Markov Chain Convergence
State space is connected: all states communicateState space is connected: all states communicatevia the minimum-energy statevia the minimum-energy state
Mixing time (the Markov chain is Mixing time (the Markov chain is ergodicergodic) is) ispolynomial in sequence length.polynomial in sequence length.
Generalizes Morris-Sinclair algorithm (1999) forGeneralizes Morris-Sinclair algorithm (1999) forcounting knapsack solutions to arbitrary alphabetscounting knapsack solutions to arbitrary alphabets
22
Sequence capacity withoutcompetition
Remain at a particular fold Remain at a particular fold XX and perform and performcounting for this single structurecounting for this single structure
Compute Compute N(E),N(E), ΩΩ(E)(E)=dN/dE=dN/dE, S(E)=log(, S(E)=log(ΩΩ(E))(E)) The temperature of sequence selection is definedThe temperature of sequence selection is defined
as:as: T=( T=(dS/dEdS/dE))-1-1
Compute for a representative set of PDBstructures (~3000 folds)
12
23
Weaknesses (before we even start…)
No selection due to functionNo selection due to function
No domain swaps (only single point mutations)No domain swaps (only single point mutations)
Coarse structural models and approximate energyCoarse structural models and approximate energyfunction.function.
No structural competition yet (addressed later)No structural competition yet (addressed later)
24
Counting without Competition:N(E)
13
25
Counting without competition:Different folds, same length
(150)
26
Sequence Capacity and Flow
14
27
Coarse description of foldconnectivity
Different temperatures forDifferent temperatures for alternate foldsalternate foldssuggests lack of connectivity.suggests lack of connectivity.
T=(T=(dS/dEdS/dE))-1-1
28
Temperature distribution for thepotential TE-13
15
29
Peaked temperature distributiondoes not exclude connectivity
Can we suggest a more directCan we suggest a more directcalculation of connectivity betweencalculation of connectivity between
folds?folds?
30
Counting with Competition
Fix structure of interest Fix structure of interest X X as a as a reference structure.reference structure. Run counting algorithm as described above.Run counting algorithm as described above. Keep track of how many sequences have lowestKeep track of how many sequences have lowest
energy in energy in XX, and how many , and how many ““escapeescape”” to (achieve to (achievelower energy in) competing structures.lower energy in) competing structures.
Let Let ffretret((EE)) be the fraction of retained (non-be the fraction of retained (non-escaping) sequences below energy level escaping) sequences below energy level E.E.
Approximate Approximate CC((EE) = ) = NN((EE) ) ×× ffretret((EE).).
16
31
Differentiating BetweenCompeting Folds
For a fold For a fold XX, how do we count only sequences that, how do we count only sequences thatboth are both compatible with both are both compatible with XX and prefer and prefer XX to all to allother folds?other folds?
Given a structure Given a structure X X and a set of and a set of competingcompetingstructures structures YY = { = {YY11, , ……YYKK},}, we wish to estimate we wish to estimatethe function the function CC((EE) which gives the size of the set) which gives the size of the set
( ) ( ) ( ){ }: & , 1jS E S X E E S X E S Y j K!" # < # < # = …
32
Differentiating BetweenCompeting Folds( )X
minE
( )XE
( )X
natE
( )Y
natE
( )Y
minE
( )YE
E(X) = E
(Y)
X Y
17
33
Differentiating BetweenCompeting Folds( )X
minE
( )XE
( )X
natE
( )Y
natE
( )Y
minE
( )YE
E(X) = E
(Y)
X Y
34
Counting with Competition
Emin
Ek
Ek+1
( ) ( ) ( )k k ret kC E N E f E= !
18
35
Results: Competition Among aLarge Library of Protein Folds
We use a set of 2060 structurally dissimilarWe use a set of 2060 structurally dissimilarprotein folds, constituting a representative sample.protein folds, constituting a representative sample.
Fix each fold in turn as a reference fold, and runFix each fold in turn as a reference fold, and runthe counting procedure, using the remaining 2059the counting procedure, using the remaining 2059folds as competitors.folds as competitors.
We are interested in how each fold retains or losesWe are interested in how each fold retains or losessequences as a function of energy.sequences as a function of energy.
We can model the competition for sequencesWe can model the competition for sequencesamong folds as a network of among folds as a network of sequence flowsequence flow..
36
C(E) and N(E)
19
37
Maximal Retention Energy E*
EE** is the first-encountered energy where is the first-encountered energy where ffret ret isismaximum (mostly 1). In general, maximum (mostly 1). In general, EE** EEmin min ..
Between Between EE** and and EEminmin the protein evolves inthe protein evolves instructure and sequence spaces.structure and sequence spaces.
Below Below EE** only the sequence evolves. only the sequence evolves. NativeNativeproteins are always found above proteins are always found above EE**
38
E* and Behavior of fret(E)Emean
Enat
E*
1.0fret
20
39
Enat-E* and contact density
40
E* and fret(E*)
21
41
Sequence Flow Network
Nodes are protein structures.Nodes are protein structures. Edges depend on energy Edges depend on energy E E and a cutoff value and a cutoff value c.c. There is an edge from There is an edge from X X to to YY if the fraction of if the fraction of
sequences that escape from sequences that escape from XX into into YY at energy at energylevel level E E exceed exceed c.c.
We are interested in network connectivity at theWe are interested in network connectivity at thenative energy range.native energy range.
Standard cutoff is Standard cutoff is c = c = 1/1/KK, where , where K=K=2060 is the2060 is thenumber of folds in the dataset.number of folds in the dataset.
42
Example Sequence Flow Network
22
43
Proteins sinks are rich in betasheets
Log Scale
1tyv
44
In-Degree Correlations Protein lengthProtein length
ρρ = 0.630, P-value = 0.630, P-value 1e-121e-12 ββ-Sheet content-Sheet content
ρρ = 0.215, P-value = 0.215, P-value 1e-121e-12 No correlation between length and No correlation between length and ββ-content.-content.
Number of related sequences found by BLASTNumber of related sequences found by BLAST ρρ = 0.223, P-value = 0.223, P-value 1e-121e-12
23
45
Almost ready to speculate
Modeling kinetics of structural evolutionModeling kinetics of structural evolution(Directed graph and a Master equation at(Directed graph and a Master equation athand -- Many sinks, no origin, folds rich inhand -- Many sinks, no origin, folds rich inbeta sheet structures are attractors)beta sheet structures are attractors)