+ All Categories
Home > Documents > 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical...

1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical...

Date post: 25-Dec-2015
Category:
Upload: jack-freeman
View: 214 times
Download: 0 times
Share this document with a friend
65
1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A
Transcript
Page 1: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

1

A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis

in the Grand Canonical Model

Ming-Yang KaoDepartment of Computer Science

Northwestern UniversityEvanston, Illinois

U. S. A

Page 2: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

2

Acknowledgments

This talk is based on joint work with colleagues & students at Yale University:

Computer Science: • Jim Aspnes• Gauri Shah

Biology:• Julia Hartling • Junhyong Kim

Page 3: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

3

Dual Purposes of This Talk

1. Discuss protein folding problems.

2. Emphasize the point that as bioinformatics grows, advanced algorithmic techniques will become useful and crucial.

Page 4: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

4

Importance of Protein Folding

The 3D structure significantly determines the function.

Page 5: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

5

Two Complementary Problems for Protein Folding

1. Protein Folding Prediction --- Given a protein sequence, determine the 3D folding of the sequence.

2. Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence

for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

Page 6: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

6

Complexity for Protein Folding Problems

• Protein Folding Prediction --- Given a protein sequence, determine the 3D folding of the sequence.

NP-hard under various models.

• Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for

the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

Solvable in polynomial time under the Grand Canonical model.

Page 7: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

7

History of Protein Sequence Design

• Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the

structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

1. Sun et al, 1995: Heuristic search without optimality guarantee.

2. Hart, 1997: Open question on the computational tractability.

3. Kleinberg, 1999: Polynomial-time algorithms.

4. Aspnes, Hartling, Kao, Kim, Shah, 2001: Improved algorithms and generalized problems. this talk

Page 8: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

8

Outline of Technical Discussions

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

Page 9: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

9

Outline of Technical Discussions (1)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

Page 10: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

10

Grand Canonical Model (Sun et al, 1995)

• Each amino acid is classified as Hydrophobic (H) and Polar (P).

• Each amino acid sequence is then considered as a binary sequence of H and P. (For mathematical convenience, set H = 1 and P = 0).

• Hydrophobic (H): A, C, F, I, L, M, V, W, Y.• Polar (P): the other amino acids.

• Sun, Brem, Chan, Dill. Designing amino acid sequences to fold with good hydrophobic cores. Protein Engineering, 1995.

Page 11: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

11

Representation of a 3D structure: (Sun et al, 1995)

A 3D folding structure S of n amino acid sequence:

• the coordinate of each atom in S.

1. the pairwise distances between the centers of amino acid residues in S.

2. the solvent-accessible areas of the amino acid residues in S.

Page 12: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

12

Goal of Protein Sequence Design: (Sun et al, 1995)

Input: A 3D structure S and a sequence length n.

Output: a sequence X of n amino acids that, when folded into S, has the following properties:

1. The H-residues in X are as close to each other as possible.

2. The solvent-accessible areas of the H-residues of X are as small as possible.

Page 13: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

13

Fitness of a Sequence (Sun et al, 1995)

Page 14: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

14

Fitness of a Sequence (Sun et al, 1995)

closeness among H-residues

small surface area

Page 15: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

15

Outline of Technical Discussions (2)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

Page 16: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

16

Problem #1

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure with respect to the

given alpha and beta.

Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

Page 17: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

17

Problem #2

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure that is the most similar

to Y over all possible alpha and beta.

Applications of this problem: tune the alpha and beta of the Grand Canonical model.

Page 18: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

18

Basic Computational Scheme (1)

3D structure

network

a min cut

HPPPHHPHP

a fittest sequence

Page 19: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

19

Problem #1

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure with respect to the

given alpha and beta.

Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

Computational Complexity: 1 network flow.

Page 20: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

20

Problem #2

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure that is the most similar

to Y over all possible alpha and beta.

Applications of this problem: tune the alpha and beta of the Grand Canonical model.

Computational Complexity: O(n) network flows.

Page 21: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

21

Outline of Technical Discussions (3)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

Page 22: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

22

Empirical Study: Predictive Ability

1. Computed Fittest Sequence versus Native Sequences (% similarity)

2. Our % Similarity versus Kleinberg’s

3. % Similarity versus Protein Family Size.

Page 23: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

23

% similarity --- computed versus native

1. % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence.

2. The average percentage of the hydrophobic residues is 42% in the native sequences that were studied.

3. The best sequence picked without “domain knowledge” would have a 58% similarity on average.

Page 24: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

24

% similarity --- computed versus native (1)

Page 25: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

25

% similarity --- computed versus native (2)

Our results versus Kleinberg’s

Page 26: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

26

% similarity --- computed versus native (3)

Page 27: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

27

% similarity versus PFAM family size (1)

1. % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence.

2. PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein.

• The relatedness is computed via HMM models.• pfam.wustl.edu• measure of success of a protein in Nature.

Page 28: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

28

% similarity versus PFAM family size (2)

1. % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence.

2. PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein.

Intuition/Conjecture: (3A) the more diverse a protein family is,

(3B) the more its 3D structures vary, (3C) the smaller the % similarity will be.

Page 29: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

29

% similarity versus PFAM family size (3)

Page 30: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

30

% similarity versus PFAM family size (4)

Page 31: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

31

Outline of Technical Discussions (4)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

Page 32: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

32

Tool #1: Linear Programming

Goal: find a fittest sequence X of n amino acids.

find a binary sequence x that minimizes

find x and y that

1. Linear 2. Totally unimodular 3. Integer solution4. Useful for proving

theorems 5. Still too inefficient

clueless!

quadratic

Page 33: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

33

Tool #2: Network Flow (1)

• analogy: a network of oil pipes

1. source s (origin of oil)

2. sink t (destination of oil)3. other nodes (midway stations)4. arcs (pipes)5. arc capacity (pipe capacity)6. flow (amount of oil through a pipe)

• goal: deliver max amount of oil from source to sink

• computational goal: a max flow• computational complexity: VE log (V2/E)

14

14

4

5

54

5

1

9

20

8

10

s

t

Page 34: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

34

Tool #2: Network Flow (2)

example of max flow

1. source (origin of oil)2. sink (destination of oil)3. other nodes (midway stations)4. arcs (pipes)5. arc capacity (pipe capacity)6. flow (amount of oil through a pipe)

• goal: deliver max amount of oil from source to sink

• computational goal: a max flow• computational complexity: VE log (V2/E)

14 (1)

14 (14) 4 (4)

5 (5)

5 (4)4 (4)

5

1 (1)

9 (9)

20

8 (5)

10 (5)

s

t

Page 35: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

35

Tool #2: Network Flow (3)

max flow versus min cut

1. min cut bottleneck

2. a partition (S,T) of nodes with s in S and t in T.

3. total capacity of arcs from S to T = max flow.

14 (1)

14 (14) 4 (4)

5 (5)

5 (4)

4 (4)

5

1 (1)

9 (9)

20

8 (5)

10 (5)

s

t

Page 36: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

36

Tool #2: Network Flow (4)

max flow versus min cut

1. min cut bottleneck

2. a partition (S,T) of nodes with s in S and t in T.

3. total capacity of arcs from S to T = max flow.

• computational complexity: VE log (V2/E)

14 (1)

14 (14) 4 (4)

5 (5)

5 (4)

4 (4)

5

1 (1)

9 (9)

20

8 (5)

10 (5)

s

t

Page 37: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

37

Basic Computational Scheme (1)

3D structure

network

a min cut

HPPPHHPHP

a fittest sequence

Page 38: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

38

Tool #2: 3D Network (1)

6 5 4

7 8 9

1 2 3

S1= 3S2= 18S3= 6S4= 9S5= 3S6= 9S7= 6S8= 24S9= 9

g(d16) = 0.5g(d25) = 0.75g(d58) = 0.9g(d49) = 0.75

alpha = -8beta = 1/3

Page 39: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

39

Tool #2: 3D Network (2)

6 5 4

7 8 9

1 2 3

S1= 3S2= 18S3= 6S4= 9S5= 3S6= 9S7= 6S8= 24S9= 9

g(d16) = 0.5g(d25) = 0.75g(d58) = 0.9g(d49) = 0.75

alpha = -8beta = 1/3

1

2

3

4

5

6

7

8

9

16

31328

3

2

1,6

2,5

5,8

4,9

4

6

7.2

6

-alpha*g(dij)

beta*si

Page 40: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

40

Tool #2: 3D Network (3)

6 5 4

7 8 9

1 2 3

S1= 3S2= 18S3= 6S4= 9S5= 3S6= 9S7= 6S8= 24S9= 9

g(d16) = 0.5g(d25) = 0.75g(d58) = 0.9g(d49) = 0.75

alpha = -8beta = 1/3

1

2

3

4

5

6

7

8

9

16

31328

3

2

1,6

2,5

5,8

4,9

4

6

7.2

6

-alpha*g(dij)

beta*si

Page 41: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

41

Tool #2: 3D Network (4)

6 5 4

7 8 9

1 2 3

1

2

3

4

5

6

7

8

9

16

31328

3

2

1,6

2,5

5,8

4,9

4

6

7.2

6

-alpha*g(dij)

beta*si

Theorem (Kleinberg, 1999)The amino acids that are with the

source in a min cut are H’s.

Page 42: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

42

Basic Computational Scheme (1)

3D structure

network

a min cut

HPPPHHPHP

a fittest sequence

Page 43: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

43

Problem #1

Input: 1. the parameters alpha and beta,2. a protein sequence Y,3. Y’s 3D structure,4. the sequence length n of Y.

Output:a fittest sequence X for the 3D structure with respect to the

given alpha and beta.

Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

Page 44: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

44

Tool #3: Linear Size Representation of All Min Cuts (1)14 (1)

14 (14) 4 (4)

5 (5)

5 (4)

4 (4)

5

1 (1)

9 (9)

20

8 (5)

10 (5)

s

t

Step 1: Compute a max flow of G.Step 2: Compute the residual network G’.Step 3: Contract every strongly connected component into a super node. Call the new graph G”.

Def: A node subset U of G” is a closed set if for every node x in U, every descendant of x is also in U.

Theorem: (Picard and Queyranne, 1980)Every closed set not including the sink forms a min cut, and vice versa.

v1

v2

v3

v4

v5

v7

v6

Page 45: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

45

Tool #3: Linear Size Representation of All Min Cuts (2)

13

14 4

5

44

5

1

9

20

5

s

t

v1

v2

v3

v4

v5

v7

v6

13

5

1

5

Residual Network

Page 46: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

46

Tool #3: Linear Size Representation of All Min Cuts (3)

5

s

t

v1

v2

v3

v4

v5

v7

v6

Picard-Queyranne Representation

Page 47: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

47

Tool #3: Linear Size Representation of All Min Cuts (4)

5

s

t

v1

v2

v3

v4

v5

v7

v6

Picard-Queyranne Representation

Applications:1.Obtain all fittest sequences.

2.Study the landscape of the fittest sequences.

3.Compute fittest sequences with additional optimization objectives.

Page 48: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

48

Basic Computational Scheme (2)

3D structure

network

a max flow/min cut

the space of all fittest sequencesHPPPHHPHP

Picard-Queyranne Representation

Page 49: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

49

Outline of Technical Discussions (5)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others • Further Algorithmic & Computational Hardness

Results• Conclusions

Page 50: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

50

Problem #3

• Input: a 3D structure.

• Output: all its fittest protein sequences.

• Computational Complexity: (A) A linear size representation can be computed with 1 network flow.

(B) Each individual fittest protein sequences can be generated from this representation in O(n) time.

Page 51: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

51

Problem #4

Input: f 3D structures.

Output: the set of all protein sequences that are the fittest simultaneously for all these 3D structures.

Computational Complexity: f network flows.

Page 52: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

52

Problem #5

• Input: a protein sequence Y and its native 3D structure.

• Output: the set of all fittest protein sequences that are also the most (or least) similar to Y in terms of unweighted (or weighted) Hamming distances.

• Computational Complexity: 1 network flow.

Page 53: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

53

Problem #6

• Input: a 3D structure.

• Output: Count the number of protein sequences in the solution to each of Problems #3, #4, and #5.

• Computational Complexity: #P-complete.

Page 54: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

54

Problem #7

• Input: a 3D structure and a bound e.

• Output: Enumerate the protein sequences whose fitness function values are within an additive factor e of that of the fittest protein sequences.

• Computational Complexity: polynomial time to generate each desired protein sequence.

Page 55: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

55

Problem #8

• Input: a 3D structure.

• Output: the largest possible unweighted (or weighted) Hamming distance between any two fittest protein sequences.

• Computational Complexity: 1 network flow.

Page 56: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

56

Problem #9

• Input: a protein sequence Y and its native 3D structure.

• Output: the average unweighted (or weighted) Hamming distance between Y and the fittest protein sequences for the 3D structure.

• Computational Complexity: #P-complete.

Page 57: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

57

Problem #10

• Input: a protein sequence Y, its native 3D structure, and two unweighted Hamming distances d1and d2.

• Output: a fittest protein sequence whose distance from Y is also between d1and d2.

• Computational Complexity: NP-hard.

Page 58: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

58

Problem #11

• Input: a protein sequence Y, its native 3D structure, and an unweighted Hamming distance d.

• Output: the fittest among the protein sequences which are

at distance d from Y.

• Computational Complexity: NP-hard. We have a polynomial-time approximation algorithm.

Page 59: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

59

Problem #12

• Input: a protein sequence Y and its native 3D structure

• Output: all the ratios between the scaling factors alpha and beta in the GC model such that the smallest possible unweighted (or weighted) Hamming distance between Y and any fittest protein sequence is minimized over all possible alpha and beta.

• Computational Complexity: O(n) network flows.

Page 60: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

60

Problem #13

• Input: a 3D structure.

• Output: Determine whether the fittest protein sequences are connected, i.e., whether they can mutate into each other through allowable mutations, such as point mutations, while the intermediate protein sequences all remain the fittest.

• Computational Complexity: 1 network flow.

Page 61: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

61

Problem #14

•  Input: a 3D structure and two fittest protein sequences.

• Output: Determine whether the two sequences are connected.

• Computational Complexity: 1 network flow.

Page 62: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

62

Problem #15

• Input: a 3D structure.

• Output: the smallest set of allowable mutations with respect to which the fittest protein sequences (or two given fittest protein sequences) for the structure are connected.

• Computational Complexity: 1 network flow.

Page 63: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

63

Outline of Technical Discussions (6)

• The Grand Canonical Model• Two Basic Computational Problems• Experimental Results• Combinatorial Tools

(1a) Linear Programming

(1b) Network Flow

(1c) Compact Representation of All Min Cut

(1d) others • Further Algorithmic & Computational Hardness Results• Conclusions

Page 64: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

64

Further Research for Protein Sequence Design

1.More sophisticated models (biology).

2.Algorithms and complexity for such models (computer science).

3.Web lab validation (biology).

Page 65: 1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

65

Further Algorithmic Research for Bioinformatics

Current State of Bioinformatics:

1. Biology: mostly very simple heuristics

2. Algorithms: mostly very simple techniques

Conjectures:

1. Biology: Nature is not so simple. Most of the biological information is very complicated.

2. Algorithms: Very sophisticated, novel, and fundamental techniques will be needed to unlock Nature’s secrets.


Recommended