+ All Categories
Home > Documents > Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds,...

Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds,...

Date post: 26-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
23
1 1 Biological Overview: Sequence-Structure Asymmetry 1DWR 2JHO Sequence Identity ~ 85% Horse Sperm Whale 2 Biological Overview: Sequence-Structure Asymmetry 1LH1 2JHO Sequence Identity ~ 20% Sperm Whale Lupinus luteus
Transcript
Page 1: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

1

1

Biological Overview:Sequence-Structure Asymmetry

1DWR 2JHO

Sequence Identity ~ 85%

Horse Sperm Whale

2

Biological Overview:Sequence-Structure Asymmetry

1LH1 2JHO

Sequence Identity ~ 20%

Sperm WhaleLupinus luteus

Page 2: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

2

3

Structures are better conservedthan sequences during evolution

-> Homology modeling of structures

-> Protein design and evolution

1LH1:_ 2/3 ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE--VPQNNPE1MBC:_ 1/2 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED

1LH1:_ 60/61 LQAHAGKVFKLVYEAAIQLEVTGVVV--TDATLKNLGSVHVSKGVADAHFPVVKEAILKT1MBC:_ 61/62 LKKHGVTVLTALGAILKKK---GHHEAELKPLAQSHATKHK---IPIKYLEFISEAIIHV

1LH1:_ 118/119 IKEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA1MBC:_ 115/116 LHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELG

4

Biological Overview: Sequence-Structure Asymmetry

Page 3: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

3

5

Biological Overview:Mechanisms of Protein Evolution

… …

… …

Rigidity and stability

6

The Questions:Sequence Capacity and Flow

Can we estimate Can we estimate sequence capacitysequence capacity: : the number ofthe number ofsequences that are compatible with a given structure?sequences that are compatible with a given structure?

Is there migration, or Is there migration, or flowflow, of sequences between, of sequences betweenstructures under point mutations? (May impact proteinstructures under point mutations? (May impact proteindesign and studies of evolution)design and studies of evolution)

Recent paper: Recent paper: Leonid Leonid MeyerguzMeyerguz, Jon Kleinberg, and Ron Elber, Jon Kleinberg, and Ron Elber

The network of sequence flow between protein structuresThe network of sequence flow between protein structuresPNAS 2007 104: 11627-11632;PNAS 2007 104: 11627-11632;

Page 4: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

4

7

Physical (stability based) networkmodel for sequence capacity of

structures & structural flips Detailed (whole PDB), efficiently computable andDetailed (whole PDB), efficiently computable and

experimentally testable model (the set of PDBexperimentally testable model (the set of PDBstructures was argued to be complete)structures was argued to be complete)

Design of protein structures and protein switches Design of protein structures and protein switches

Zero order model of the evolution of proteinZero order model of the evolution of proteinsequences & structures (no selection due tosequences & structures (no selection due tofunction).function).

8

Related work on capacity ofspecific protein models

ShakhnovichShakhnovich DillDill WolynesWolynes ThirumalaiThirumalai LevittLevitt ……

So far no global view of capacity (andSo far no global view of capacity (andthermodynamics) of the PDB, no flow.thermodynamics) of the PDB, no flow.

Page 5: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

5

9

Is capacity relevant to biology?

Capacity shows weak correlation with theCapacity shows weak correlation with thenumber of sequences thatnumber of sequences that are found for aare found for aparticular fold in the NR database (correlationparticular fold in the NR database (correlationcoefficient 0.2)coefficient 0.2)Capacity correlates with mutation ratesCapacity correlates with mutation ratesmeasured experimentally measured experimentally ((J. D. Bloom, D. A.J. D. Bloom, D. A.Drummond, F. H. Arnold, and C. O. Drummond, F. H. Arnold, and C. O. WilkeWilke. Structural. Structuraldeterminants of the rate of protein evolution in yeast. Mol. determinants of the rate of protein evolution in yeast. Mol. BiolBiol..EvolEvol. 23:1751-1761, 2006. And C. . 23:1751-1761, 2006. And C. WilkeWilke, private, privatecommunication).communication).

10

Experimental tests

Collaboration with Thomas Collaboration with Thomas Magliery Magliery ononLambda repressor - 160K mutantsLambda repressor - 160K mutants

Protein flipsProtein flips - Bryan lab (flips are well- Bryan lab (flips are wellknown for RNA)known for RNA)

Page 6: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

6

11

Patrick A. Alexander, Patrick A. Alexander, Yanan Yanan He, He, Yihong Yihong Chen, JohnChen, JohnOrbanOrban, and Philip N. Bryan , and Philip N. Bryan ““The design andThe design andcharacterization of two proteins with 88% sequencecharacterization of two proteins with 88% sequenceidentity but different structure and functionidentity but different structure and function”” PNAS PNAS2007 104: 11963-11968;2007 104: 11963-11968;

Protein flips

12

Measuring Protein Fitness:Energy Functions

Energy functions measure the fitness of sequences tostructures.

If energy is low, then the structure is thermodynamicallystable with respect to the probe sequence.

AR

N

E D

C Q

ARNDECQ

Energy calculation is based onplacing amino acids of the S intosites of the target structure X.

Page 7: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

7

13

Approximate Energy Functions

TE13 (Toby, Elber)

XX

N

E D

X Q

( )( )

1

( )n

X

i i

i

E S X c !=

" =#

THOM2 (Meller, Elber)

( )( )( ) , , X

i j ij

i j

E S X c r! !<

" =# AR

N

E D

C Q

Ergodic/well mixed model

NP complete

14

Specifying Sequence-StructureFitness Criteria

HI

L

M K

F P

AR

N

E D

C Q

Original (Native)Sequence:Snat: HILKMFP

Candidate Sequence:S: ARNEDCQ

We say S is fit for X if ( ) ( )nat

E S X E S X! " !

Page 8: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

8

15

Estimating the SequenceCapacity of a Fold

In general, we would like to compute the function N(E):

Specifically, we want to compute N (Enat). The number of all possible sequences is 20The number of all possible sequences is 20nn, for proteins, for proteins

of length of length nn. For small proteins, . For small proteins, n n ≈≈ 50.50. Random sampling of sequence space does not work:Random sampling of sequence space does not work:

since since N N ((EEnatnat) / 20) / 20nn can be exponentially small. can be exponentially small. Need a more sophisticated counting method.Need a more sophisticated counting method.

}{( ) : ( )N E S E S X E= ! "

16

Estimating N(E)

Express Express NN((EE) = ) = | { | { SS : : EE ( (S S → X ) < Ek }| as astelescoping ratios/ umbrella sampling:telescoping ratios/ umbrella sampling:

1 2

1

( ) ( ) ( )( ) ( ) ...

( ) ( ) ( )ref

ref m

N E N E N EN E N E

N E N E N E= ! ! ! !

EEref ref : Pre-selected reference energy: Pre-selected reference energyN N ((EErefref) : Number of sequences below ) : Number of sequences below EErefrefEE11 …… EEmm : Values above ratios are : Values above ratios are itermediatesitermediates..

Page 9: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

9

17

Approximating Successive Ratios

Ek

Ek+1

Emin

Select site i u.a.r. from S, and amino acid r u.a.r. fromamong the 20 types.

If E ( S → X ) + ΔE < Ek, accept. Otherwise, reject.

N(Ek)

N(Ek+1)

18

Approximating Successive Ratios

Emin

Let Let lltt((kk) = | { ) = | { SS : : EE ( (S S → X ) < Ek } | after t steps. Then, for sufficiently large Then, for sufficiently large tt,,

NN ( (EEkk+1+1) / ) / NN ( (EEkk) ) ≈≈ lltt((kk+1) / +1) / lltt((kk).).

Ek

Ek+1

N(Ek)

N(Ek+1)

Page 10: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

10

19

Choosing Intermediate Ek ValuesEmean

Emin

Enat

20

Algorithm Summary: N(E) Given a structure Given a structure XX, compute , compute EEmeanmean and and N N ((EEmeanmean).). Pick Pick EE11 …… EEmm s.t. s.t. EEkk > > EEk+k+1 1 and (and (EEkk −− EEkk+1+1) is) is

decreasing with decreasing with k.k. For For kk = 1 = 1 …… mm, run the Markov chain for , run the Markov chain for t t steps.steps.

Compute Compute lltt((kk+1) / +1) / lltt((kk) ) ≈≈ NN ( (EEkk+1+1) / ) / NN ( (EEkk).).

1 2

1

( ) ( ) ( )( ) ( ) ...

( ) ( ) ( )mean

mean m

N E N E N EN E N E

N E N E N E= ! ! ! !

Page 11: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

11

21

Counting With THOM2:Markov Chain Convergence

State space is connected: all states communicateState space is connected: all states communicatevia the minimum-energy statevia the minimum-energy state

Mixing time (the Markov chain is Mixing time (the Markov chain is ergodicergodic) is) ispolynomial in sequence length.polynomial in sequence length.

Generalizes Morris-Sinclair algorithm (1999) forGeneralizes Morris-Sinclair algorithm (1999) forcounting knapsack solutions to arbitrary alphabetscounting knapsack solutions to arbitrary alphabets

22

Sequence capacity withoutcompetition

Remain at a particular fold Remain at a particular fold XX and perform and performcounting for this single structurecounting for this single structure

Compute Compute N(E),N(E), ΩΩ(E)(E)=dN/dE=dN/dE, S(E)=log(, S(E)=log(ΩΩ(E))(E)) The temperature of sequence selection is definedThe temperature of sequence selection is defined

as:as: T=( T=(dS/dEdS/dE))-1-1

Compute for a representative set of PDBstructures (~3000 folds)

Page 12: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

12

23

Weaknesses (before we even start…)

No selection due to functionNo selection due to function

No domain swaps (only single point mutations)No domain swaps (only single point mutations)

Coarse structural models and approximate energyCoarse structural models and approximate energyfunction.function.

No structural competition yet (addressed later)No structural competition yet (addressed later)

24

Counting without Competition:N(E)

Page 13: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

13

25

Counting without competition:Different folds, same length

(150)

26

Sequence Capacity and Flow

Page 14: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

14

27

Coarse description of foldconnectivity

Different temperatures forDifferent temperatures for alternate foldsalternate foldssuggests lack of connectivity.suggests lack of connectivity.

T=(T=(dS/dEdS/dE))-1-1

28

Temperature distribution for thepotential TE-13

Page 15: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

15

29

Peaked temperature distributiondoes not exclude connectivity

Can we suggest a more directCan we suggest a more directcalculation of connectivity betweencalculation of connectivity between

folds?folds?

30

Counting with Competition

Fix structure of interest Fix structure of interest X X as a as a reference structure.reference structure. Run counting algorithm as described above.Run counting algorithm as described above. Keep track of how many sequences have lowestKeep track of how many sequences have lowest

energy in energy in XX, and how many , and how many ““escapeescape”” to (achieve to (achievelower energy in) competing structures.lower energy in) competing structures.

Let Let ffretret((EE)) be the fraction of retained (non-be the fraction of retained (non-escaping) sequences below energy level escaping) sequences below energy level E.E.

Approximate Approximate CC((EE) = ) = NN((EE) ) ×× ffretret((EE).).

Page 16: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

16

31

Differentiating BetweenCompeting Folds

For a fold For a fold XX, how do we count only sequences that, how do we count only sequences thatboth are both compatible with both are both compatible with XX and prefer and prefer XX to all to allother folds?other folds?

Given a structure Given a structure X X and a set of and a set of competingcompetingstructures structures YY = { = {YY11, , ……YYKK},}, we wish to estimate we wish to estimatethe function the function CC((EE) which gives the size of the set) which gives the size of the set

( ) ( ) ( ){ }: & , 1jS E S X E E S X E S Y j K!" # < # < # = …

32

Differentiating BetweenCompeting Folds( )X

minE

( )XE

( )X

natE

( )Y

natE

( )Y

minE

( )YE

E(X) = E

(Y)

X Y

Page 17: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

17

33

Differentiating BetweenCompeting Folds( )X

minE

( )XE

( )X

natE

( )Y

natE

( )Y

minE

( )YE

E(X) = E

(Y)

X Y

34

Counting with Competition

Emin

Ek

Ek+1

( ) ( ) ( )k k ret kC E N E f E= !

Page 18: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

18

35

Results: Competition Among aLarge Library of Protein Folds

We use a set of 2060 structurally dissimilarWe use a set of 2060 structurally dissimilarprotein folds, constituting a representative sample.protein folds, constituting a representative sample.

Fix each fold in turn as a reference fold, and runFix each fold in turn as a reference fold, and runthe counting procedure, using the remaining 2059the counting procedure, using the remaining 2059folds as competitors.folds as competitors.

We are interested in how each fold retains or losesWe are interested in how each fold retains or losessequences as a function of energy.sequences as a function of energy.

We can model the competition for sequencesWe can model the competition for sequencesamong folds as a network of among folds as a network of sequence flowsequence flow..

36

C(E) and N(E)

Page 19: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

19

37

Maximal Retention Energy E*

EE** is the first-encountered energy where is the first-encountered energy where ffret ret isismaximum (mostly 1). In general, maximum (mostly 1). In general, EE** EEmin min ..

Between Between EE** and and EEminmin the protein evolves inthe protein evolves instructure and sequence spaces.structure and sequence spaces.

Below Below EE** only the sequence evolves. only the sequence evolves. NativeNativeproteins are always found above proteins are always found above EE**

38

E* and Behavior of fret(E)Emean

Enat

E*

1.0fret

Page 20: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

20

39

Enat-E* and contact density

40

E* and fret(E*)

Page 21: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

21

41

Sequence Flow Network

Nodes are protein structures.Nodes are protein structures. Edges depend on energy Edges depend on energy E E and a cutoff value and a cutoff value c.c. There is an edge from There is an edge from X X to to YY if the fraction of if the fraction of

sequences that escape from sequences that escape from XX into into YY at energy at energylevel level E E exceed exceed c.c.

We are interested in network connectivity at theWe are interested in network connectivity at thenative energy range.native energy range.

Standard cutoff is Standard cutoff is c = c = 1/1/KK, where , where K=K=2060 is the2060 is thenumber of folds in the dataset.number of folds in the dataset.

42

Example Sequence Flow Network

Page 22: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

22

43

Proteins sinks are rich in betasheets

Log Scale

1tyv

44

In-Degree Correlations Protein lengthProtein length

ρρ = 0.630, P-value = 0.630, P-value 1e-121e-12 ββ-Sheet content-Sheet content

ρρ = 0.215, P-value = 0.215, P-value 1e-121e-12 No correlation between length and No correlation between length and ββ-content.-content.

Number of related sequences found by BLASTNumber of related sequences found by BLAST ρρ = 0.223, P-value = 0.223, P-value 1e-121e-12

Page 23: Biological Overview: Sequence-Structure Asymmetryarbogast/cam397/elber_08_2.pdf · protein folds, constituting a representative sample. Fix each fold in turn as a reference fold,

23

45

Almost ready to speculate

Modeling kinetics of structural evolutionModeling kinetics of structural evolution(Directed graph and a Master equation at(Directed graph and a Master equation athand -- Many sinks, no origin, folds rich inhand -- Many sinks, no origin, folds rich inbeta sheet structures are attractors)beta sheet structures are attractors)


Recommended