1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical...

Post on 25-Dec-2015

214 views 0 download

transcript

1

A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis

in the Grand Canonical Model

Ming-Yang KaoDepartment of Computer Science

Northwestern UniversityEvanston, Illinois

U. S. A

2

Acknowledgments

This talk is based on joint work with colleagues & students at Yale University:

Computer Science: • Jim Aspnes• Gauri Shah

Biology:• Julia Hartling • Junhyong Kim

3

Dual Purposes of This Talk

1. Discuss protein folding problems.

2. Emphasize the point that as bioinformatics grows, advanced algorithmic techniques will become useful and crucial.

4

Importance of Protein Folding

The 3D structure significantly determines the function.

5

Two Complementary Problems for Protein Folding

1. Protein Folding Prediction --- Given a protein sequence, determine the 3D folding of the sequence.

2. Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence

for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

6

Complexity for Protein Folding Problems

• Protein Folding Prediction --- Given a protein sequence, determine the 3D folding of the sequence.

NP-hard under various models.

• Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for

the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

Solvable in polynomial time under the Grand Canonical model.

7

History of Protein Sequence Design

• Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the

structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

1. Sun et al, 1995: Heuristic search without optimality guarantee.

2. Hart, 1997: Open question on the computational tractability.

3. Kleinberg, 1999: Polynomial-time algorithms.

4. Aspnes, Hartling, Kao, Kim, Shah, 2001: Improved algorithms and generalized problems. this talk

8

Outline of Technical Discussions

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

9

Outline of Technical Discussions (1)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

10

Grand Canonical Model (Sun et al, 1995)

• Each amino acid is classified as Hydrophobic (H) and Polar (P).

• Each amino acid sequence is then considered as a binary sequence of H and P. (For mathematical convenience, set H = 1 and P = 0).

• Hydrophobic (H): A, C, F, I, L, M, V, W, Y.• Polar (P): the other amino acids.

• Sun, Brem, Chan, Dill. Designing amino acid sequences to fold with good hydrophobic cores. Protein Engineering, 1995.

11

Representation of a 3D structure: (Sun et al, 1995)

A 3D folding structure S of n amino acid sequence:

• the coordinate of each atom in S.

1. the pairwise distances between the centers of amino acid residues in S.

2. the solvent-accessible areas of the amino acid residues in S.

12

Goal of Protein Sequence Design: (Sun et al, 1995)

Input: A 3D structure S and a sequence length n.

Output: a sequence X of n amino acids that, when folded into S, has the following properties:

1. The H-residues in X are as close to each other as possible.

2. The solvent-accessible areas of the H-residues of X are as small as possible.

13

Fitness of a Sequence (Sun et al, 1995)

14

Fitness of a Sequence (Sun et al, 1995)

closeness among H-residues

small surface area

15

Outline of Technical Discussions (2)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

16

Problem #1

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure with respect to the

given alpha and beta.

Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

17

Problem #2

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure that is the most similar

to Y over all possible alpha and beta.

Applications of this problem: tune the alpha and beta of the Grand Canonical model.

18

Basic Computational Scheme (1)

3D structure

network

a min cut

HPPPHHPHP

a fittest sequence

19

Problem #1

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure with respect to the

given alpha and beta.

Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

Computational Complexity: 1 network flow.

20

Problem #2

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure that is the most similar

to Y over all possible alpha and beta.

Applications of this problem: tune the alpha and beta of the Grand Canonical model.

Computational Complexity: O(n) network flows.

21

Outline of Technical Discussions (3)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

22

Empirical Study: Predictive Ability

1. Computed Fittest Sequence versus Native Sequences (% similarity)

2. Our % Similarity versus Kleinberg’s

3. % Similarity versus Protein Family Size.

23

% similarity --- computed versus native

1. % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence.

2. The average percentage of the hydrophobic residues is 42% in the native sequences that were studied.

3. The best sequence picked without “domain knowledge” would have a 58% similarity on average.

24

% similarity --- computed versus native (1)

25

% similarity --- computed versus native (2)

Our results versus Kleinberg’s

26

% similarity --- computed versus native (3)

27

% similarity versus PFAM family size (1)

1. % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence.

2. PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein.

• The relatedness is computed via HMM models.• pfam.wustl.edu• measure of success of a protein in Nature.

28

% similarity versus PFAM family size (2)

1. % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence.

2. PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein.

Intuition/Conjecture: (3A) the more diverse a protein family is,

(3B) the more its 3D structures vary, (3C) the smaller the % similarity will be.

29

% similarity versus PFAM family size (3)

30

% similarity versus PFAM family size (4)

31

Outline of Technical Discussions (4)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

32

Tool #1: Linear Programming

Goal: find a fittest sequence X of n amino acids.

find a binary sequence x that minimizes

find x and y that

1. Linear 2. Totally unimodular 3. Integer solution4. Useful for proving

theorems 5. Still too inefficient

clueless!

quadratic

33

Tool #2: Network Flow (1)

• analogy: a network of oil pipes

1. source s (origin of oil)

2. sink t (destination of oil)3. other nodes (midway stations)4. arcs (pipes)5. arc capacity (pipe capacity)6. flow (amount of oil through a pipe)

• goal: deliver max amount of oil from source to sink

• computational goal: a max flow• computational complexity: VE log (V2/E)

14

14

4

5

54

5

1

9

20

8

10

s

t

34

Tool #2: Network Flow (2)

example of max flow

1. source (origin of oil)2. sink (destination of oil)3. other nodes (midway stations)4. arcs (pipes)5. arc capacity (pipe capacity)6. flow (amount of oil through a pipe)

• goal: deliver max amount of oil from source to sink

• computational goal: a max flow• computational complexity: VE log (V2/E)

14 (1)

14 (14) 4 (4)

5 (5)

5 (4)4 (4)

5

1 (1)

9 (9)

20

8 (5)

10 (5)

s

t

35

Tool #2: Network Flow (3)

max flow versus min cut

1. min cut bottleneck

2. a partition (S,T) of nodes with s in S and t in T.

3. total capacity of arcs from S to T = max flow.

14 (1)

14 (14) 4 (4)

5 (5)

5 (4)

4 (4)

5

1 (1)

9 (9)

20

8 (5)

10 (5)

s

t

36

Tool #2: Network Flow (4)

max flow versus min cut

1. min cut bottleneck

2. a partition (S,T) of nodes with s in S and t in T.

3. total capacity of arcs from S to T = max flow.

• computational complexity: VE log (V2/E)

14 (1)

14 (14) 4 (4)

5 (5)

5 (4)

4 (4)

5

1 (1)

9 (9)

20

8 (5)

10 (5)

s

t

37

Basic Computational Scheme (1)

3D structure

network

a min cut

HPPPHHPHP

a fittest sequence

38

Tool #2: 3D Network (1)

6 5 4

7 8 9

1 2 3

S1= 3S2= 18S3= 6S4= 9S5= 3S6= 9S7= 6S8= 24S9= 9

g(d16) = 0.5g(d25) = 0.75g(d58) = 0.9g(d49) = 0.75

alpha = -8beta = 1/3

39

Tool #2: 3D Network (2)

6 5 4

7 8 9

1 2 3

S1= 3S2= 18S3= 6S4= 9S5= 3S6= 9S7= 6S8= 24S9= 9

g(d16) = 0.5g(d25) = 0.75g(d58) = 0.9g(d49) = 0.75

alpha = -8beta = 1/3

1

2

3

4

5

6

7

8

9

16

31328

3

2

1,6

2,5

5,8

4,9

4

6

7.2

6

-alpha*g(dij)

beta*si

40

Tool #2: 3D Network (3)

6 5 4

7 8 9

1 2 3

S1= 3S2= 18S3= 6S4= 9S5= 3S6= 9S7= 6S8= 24S9= 9

g(d16) = 0.5g(d25) = 0.75g(d58) = 0.9g(d49) = 0.75

alpha = -8beta = 1/3

1

2

3

4

5

6

7

8

9

16

31328

3

2

1,6

2,5

5,8

4,9

4

6

7.2

6

-alpha*g(dij)

beta*si

41

Tool #2: 3D Network (4)

6 5 4

7 8 9

1 2 3

1

2

3

4

5

6

7

8

9

16

31328

3

2

1,6

2,5

5,8

4,9

4

6

7.2

6

-alpha*g(dij)

beta*si

Theorem (Kleinberg, 1999)The amino acids that are with the

source in a min cut are H’s.

42

Basic Computational Scheme (1)

3D structure

network

a min cut

HPPPHHPHP

a fittest sequence

43

Problem #1

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure with respect to the

given alpha and beta.

Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

44

Tool #3: Linear Size Representation of All Min Cuts (1)14 (1)

14 (14) 4 (4)

5 (5)

5 (4)

4 (4)

5

1 (1)

9 (9)

20

8 (5)

10 (5)

s

t

Step 1: Compute a max flow of G.Step 2: Compute the residual network G’.Step 3: Contract every strongly connected component into a super node. Call the new graph G”.

Def: A node subset U of G” is a closed set if for every node x in U, every descendant of x is also in U.

Theorem: (Picard and Queyranne, 1980)Every closed set not including the sink forms a min cut, and vice versa.

v1

v2

v3

v4

v5

v7

v6

45

Tool #3: Linear Size Representation of All Min Cuts (2)

13

14 4

5

44

5

1

9

20

5

s

t

v1

v2

v3

v4

v5

v7

v6

13

5

1

5

Residual Network

46

Tool #3: Linear Size Representation of All Min Cuts (3)

5

s

t

v1

v2

v3

v4

v5

v7

v6

Picard-Queyranne Representation

47

Tool #3: Linear Size Representation of All Min Cuts (4)

5

s

t

v1

v2

v3

v4

v5

v7

v6

Picard-Queyranne Representation

Applications:1.Obtain all fittest sequences.

2.Study the landscape of the fittest sequences.

3.Compute fittest sequences with additional optimization objectives.

48

Basic Computational Scheme (2)

3D structure

network

a max flow/min cut

the space of all fittest sequencesHPPPHHPHP

Picard-Queyranne Representation

49

Outline of Technical Discussions (5)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others • Further Algorithmic & Computational Hardness

Results• Conclusions

50

Problem #3

• Input: a 3D structure.

• Output: all its fittest protein sequences.

• Computational Complexity: (A) A linear size representation can be computed with 1 network flow.

(B) Each individual fittest protein sequences can be generated from this representation in O(n) time.

51

Problem #4

Input: f 3D structures.

Output: the set of all protein sequences that are the fittest simultaneously for all these 3D structures.

Computational Complexity: f network flows.

52

Problem #5

• Input: a protein sequence Y and its native 3D structure.

• Output: the set of all fittest protein sequences that are also the most (or least) similar to Y in terms of unweighted (or weighted) Hamming distances.

• Computational Complexity: 1 network flow.

53

Problem #6

• Input: a 3D structure.

• Output: Count the number of protein sequences in the solution to each of Problems #3, #4, and #5.

• Computational Complexity: #P-complete.

54

Problem #7

• Input: a 3D structure and a bound e.

• Output: Enumerate the protein sequences whose fitness function values are within an additive factor e of that of the fittest protein sequences.

• Computational Complexity: polynomial time to generate each desired protein sequence.

55

Problem #8

• Input: a 3D structure.

• Output: the largest possible unweighted (or weighted) Hamming distance between any two fittest protein sequences.

• Computational Complexity: 1 network flow.

56

Problem #9

• Input: a protein sequence Y and its native 3D structure.

• Output: the average unweighted (or weighted) Hamming distance between Y and the fittest protein sequences for the 3D structure.

• Computational Complexity: #P-complete.

57

Problem #10

• Input: a protein sequence Y, its native 3D structure, and two unweighted Hamming distances d1and d2.

• Output: a fittest protein sequence whose distance from Y is also between d1and d2.

• Computational Complexity: NP-hard.

58

Problem #11

• Input: a protein sequence Y, its native 3D structure, and an unweighted Hamming distance d.

• Output: the fittest among the protein sequences which are

at distance d from Y.

• Computational Complexity: NP-hard. We have a polynomial-time approximation algorithm.

59

Problem #12

• Input: a protein sequence Y and its native 3D structure

• Output: all the ratios between the scaling factors alpha and beta in the GC model such that the smallest possible unweighted (or weighted) Hamming distance between Y and any fittest protein sequence is minimized over all possible alpha and beta.

• Computational Complexity: O(n) network flows.

60

Problem #13

• Input: a 3D structure.

• Output: Determine whether the fittest protein sequences are connected, i.e., whether they can mutate into each other through allowable mutations, such as point mutations, while the intermediate protein sequences all remain the fittest.

• Computational Complexity: 1 network flow.

61

Problem #14

•  Input: a 3D structure and two fittest protein sequences.

• Output: Determine whether the two sequences are connected.

• Computational Complexity: 1 network flow.

62

Problem #15

• Input: a 3D structure.

• Output: the smallest set of allowable mutations with respect to which the fittest protein sequences (or two given fittest protein sequences) for the structure are connected.

• Computational Complexity: 1 network flow.

63

Outline of Technical Discussions (6)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

64

Further Research for Protein Sequence Design

1.More sophisticated models (biology).

2.Algorithms and complexity for such models (computer science).

3.Web lab validation (biology).

65

Further Algorithmic Research for Bioinformatics

Current State of Bioinformatics:

1. Biology: mostly very simple heuristics

2. Algorithms: mostly very simple techniques

Conjectures:

1. Biology: Nature is not so simple. Most of the biological information is very complicated.

2. Algorithms: Very sophisticated, novel, and fundamental techniques will be needed to unlock Nature’s secrets.