Computational Methods for Protein Structure Prediction Ying Xu.

Computational Methods for Protein Structure Prediction

Ying Xu Ying Xu

Outline introduction to protein structures

the problem of protein structure prediction

why we can predict protein structure

protein tertiary structure prediction – Ab initio folding– homology modeling

protein threading

Protein and Structure

>1MBN:_ MYOGLOBIN (154 AA) MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKELGYQG

Protein sequenc

e

Protein structur

e

Oxygen storage

Protein function

Ball and stick

spacefill

Protein Structure protein sequence folds into a “unique” shape (“structure”) that

minimizes its free potential energy

Protein Structures Primary sequence

Secondary structure

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

-helix

-sheet

anti-parallel

parallel

Protein Structures Tertiary structure

Quaternary structure

Protein Structures Protein structure

– generally compact

Soluble protein structure– individual domains are generally globular

– they share various common characteristics, e.g. hydrophobic moment profile

Membrane protein structure

most of the amino acid sidechains of transmembrane segments are non-polar

polar groups of the polypeptide backbone of transmembrane segments generally participate in hydrogen bonds

Protein Tertiary Structures

Family: Clear evolutionary relationship, protein in the same family are homologous, sequence identity >=30%.

Superfamily: Low sequence identity, probable common evolutionary origin. Fold: May not have a common evolutionary origin. Major structural similarity.

Class: all-, all-, /, +, … http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.65 release: 2327 families, 1294 superfamilies, 800 folds

SCOP: Structural Classification Of Proteins

Protein Tertiary Structures

Protein Structure Determination X-ray crystallography

– most accurate– in vitro– need crystals proteins – ~50K per structure

NMR – accurate– in vivo– no need for crystals– limited to small proteins

Cryo-EM– Imaging technology– Low-resolution

Protein Structure Prediction

Problem:Given the amino acid sequence of a protein,

what’s its 3-dimensional shape?

MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKELGYQG

? ……..

Why Protein Structure Prediction? Importance of protein structure

– knowledge of the structure of a protein enable us to understand its function and functional mechanism

– design better mutagenesis experiments

– structure-based rational drug design Experimental methods for protein structure determination

Pros: high resolution Cons: time-consuming and very expensive

Why Protein Structure Prediction?

Big gap between the number of protein sequences and the number of protein structures– Uniprot/Swiss-prot, 283,454 protein sequences

– Uniprot/TrEMBL, 4,864,587 gene sequences

– PDB (Protein Data Bank), 48,385 protein structures

Fundamental, unsolved, challenging problem

Why We Can Predict Structure In theory, a protein structure can solved computationally

A protein folds into a 3D structure to minimizes its free potential energy

– Anfinsen’s classic experiment on Ribonuclease A folding in the 1960’s

– energy functions

This problem can be formulated as an optimization problem– protein folding problem, or ab initio folding

Why We Can Predict Structure While there could be billions of billions of proteins in nature, the

number of unique structural folds (shapes) might be small

90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

Why We Can Predict Structure Theoretical studies suggest that the vast majority of the proteins

in nature fall into not much more than 1000 structural folds

This realization has fundamentally changed how protein structures can be predicted

The structure prediction problem becomes for a protein sequence, find which of the structural folds the protein can fold into, plus possibly some structural refinement

MTYKLILN …. NGVDGEWTYTE

Computational Methods for Protein Structure Prediction

ab initio --use first principles to fold proteins --does not require templates --high computational complexity

•homology modeling --similar sequence similar structures --practically very useful, need homologues

• protein threading --many proteins share the same structural fold --a folding problem becomes a fold recognition problem

Need known protein structures

ab initio structure prediction

An energy function to describe the proteino bond energyo bond angle energyo dihedral angel energyo van der Waals energyo electrostatic energy

Efficient and reliable algorithms to search the conformational space to minimize the function and obtain the structure.

ab initio structure prediction

The problem is exceedingly difficult to solve– the search space is defined by psi/phi angles of backbone and side-

chain positions– the search space is enormous even for small proteins!– the number of local minima increases exponentially of the number

of residues

Theoretically solvable but practically infeasible!

ROSETTA (Dave Baker’s Lab)

Construct a library of small structure fragments, e.g. 9 AA

Cut a target sequence to sequence fragments. For each sequence fragment, choose structural candidate fragments from the fragment library

Assemble the fragment structures by Monte Carlo simulation

The generated structures are grouped into clusters

Clusters are ranked by their energy

Homology ModelingObservation: proteins with similar sequences tend to fold into similar structures.

1. Target sequence is aligned with the sequence of a known structure, they usually share sequence identity of 30% or higher

2. Superimpose target sequence onto the template, replacing equivalent side-chain atoms where necessary

3. Refine the model by minimizing an energy function.

Programs: Modeller http://salilab.org/modeller/

Swiss-Model http://swissmodel.expasy.org//SWISS-MODEL.html

Protein Threading Basic premise

Statistics from Protein Data Bank (~48,000 structures)

Chances for a protein to have a native-like structural fold in PDB are quite good (estimated to be 60-70%)

– Proteins with similar structural folds could be homologues or analogues

The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand)

90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

Protein Threading The goal: find the “correct” sequence-structure alignment

between a target sequence and its native-like fold in PDB

Energy function – knowledge (or statistics) based rather than physics based – Should be able to distinguish correct structural folds from

incorrect structural folds

– Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments

MTYKLILN …. NGVDGEWTYTE

Protein Threading – four basic components

Structure database

Energy function

Sequence-structure alignment algorithm

Prediction reliability assessment

Protein Threading – structure database

Build a template database

Protein Threading – structure database

• Non-redundant representatives through structure-structure and/or sequence-sequence comparison

FSSP (http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html)

(Families of Structurally Similar Proteins)

SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)

PDB-Select (http://www.sander.embl-heidelberg.de/pdbsel/)

Pisces (http://www.fccc.edu/research/labs/dunbrack/pisces/)

Protein Threading – energy function


how well a residue fits a structural environment: E_s

how preferable to put two particular residues nearby: E_p

alignment gap penalty: E_g

total energy: E_p + E_s + E_g

find a sequence-structure alignment to minimize the energy function


A singleton energy measures each residue’s preference in a specific structural environments– secondary structure

– solvent accessibility Compare actual occurrence against its “expected value” by

chance

Where


A simple definition of structural environment– secondary structure: alpha-helix, beta-strand, loop– solvent accessibility: 0, 10, 20, …, 100% of accessibility– each combination of secondary structure and solvent

accessibility level defines a structural environment• E.g., (alpha-helix, 30%), (loop, 80%), …

E_s: a scoring matrix of 30 structural environments by 20 amino acids– E.g., E_s ((loop, 30%), A)

Singleton energy term


Helix Sheet Loop Buried Inter Exposed Buried Inter Exposed Buried Inter ExposedALA -0.578 -0.119 -0.160 0.010 0.583 0.921 0.023 0.218 0.368ARG 0.997 -0.507 -0.488 1.267 -0.345 -0.580 0.930 -0.005 -0.032ASN 0.819 0.090 -0.007 0.844 0.221 0.046 0.030 -0.322 -0.487ASP 1.050 0.172 -0.426 1.145 0.322 0.061 0.308 -0.224 -0.541CYS -0.360 0.333 1.831 -0.671 0.003 1.216 -0.690 -0.225 1.216GLN 1.047 -0.294 -0.939 1.452 0.139 -0.555 1.326 0.486 -0.244GLU 0.670 -0.313 -0.721 0.999 0.031 -0.494 0.845 0.248 -0.144GLY 0.414 0.932 0.969 0.177 0.565 0.989 -0.562 -0.299 -0.601HIS 0.479 -0.223 0.136 0.306 -0.343 -0.014 0.019 -0.285 0.051ILE -0.551 0.087 1.248 -0.875 -0.182 0.500 -0.166 0.384 1.336LEU -0.744 -0.218 0.940 -0.411 0.179 0.900 -0.205 0.169 1.217LYS 1.863 -0.045 -0.865 2.109 -0.017 -0.901 1.925 0.474 -0.498MET -0.641 -0.183 0.779 -0.269 0.197 0.658 -0.228 0.113 0.714PHE -0.491 0.057 1.364 -0.649 -0.200 0.776 -0.375 -0.001 1.251PRO 1.090 0.705 0.236 1.249 0.695 0.145 -0.412 -0.491 -0.641SER 0.350 0.260 -0.020 0.303 0.058 -0.075 -0.173 -0.210 -0.228THR 0.291 0.215 0.304 0.156 -0.382 -0.584 -0.012 -0.103 -0.125TRP -0.379 -0.363 1.178 -0.270 -0.477 0.682 -0.220 -0.099 1.267TYR -0.111 -0.292 0.942 -0.267 -0.691 0.292 -0.015 -0.176 0.946VAL -0.374 0.236 1.144 -0.912 -0.334 0.089 -0.030 0.309 0.998


It measures the preference of a pair of amino acids to be close in 3D space.

How close is close?– distance dependent– single cutoff– C, C, or centroid of the sidechain

Observed occurrence of a pair compared with its “expected” occurrence

Pair-wise interaction energy term


ALA -140ARG 268 -18ASN 105 -85 -435ASP 217 -616 -417 17CYS 330 67 106 278 -1923GLN 27 -60 -200 67 191 -115GLU 122 -564 -136 140 122 10 68GLY 11 -80 -103 -267 88 -72 -31 -288HIS 58 -263 61 -454 190 272 -368 74 -448ILE -114 110 351 318 154 243 294 179 294 -326LEU -182 263 358 370 238 25 255 237 200 -160 -278LYS 123 310 -201 -564 246 -184 -667 95 54 194 178 122MET -74 304 314 211 50 32 141 13 -7 -12 -106 301 -494PHE -65 62 201 284 34 72 235 114 158 -96 -195 -17 -272 -206PRO 174 -33 -212 -28 105 -81 -102 -73 -65 369 218 -46 35 -21 -210SER 169 -80 -223 -299 7 -163 -212 -186 -133 206 272 -58 193 114 -162 -177THR 58 60 -231 -203 372 -151 -211 -73 -239 109 225 -16 158 283 -98 -215 -210TRP 51 -150 -18 104 52 -12 157 -69 -212 -18 81 29 -5 31 -432 129 95 -20TYR 53 -132 53 268 62 -90 269 58 34 -163 -93 -312 -173 -5 -81 104 163 -95 -6VAL -105 171 298 431 196 180 235 202 204 -232 -218 269 -50 -42 46 267 73 101 107 -324 ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL


w(k) = h + gk, k ≥ 1, w(0) = 0;

Where h and g are constants.h: opening gap penaltyg: extension gap penalty

FDSK---THRGHR:.: :: :::FESYWTCTH-GHR

FDSK-T--HRGHR:.: : : :::FESYWTCTH-GHR

gap penalty term


• Secondary structure prediction is mature and can achieve ~80% accuracy

• The performance of using probabilities of the predicted three secondary structure states (-helices, -strand, and loop) is better

Secondary structure match energy

Threading Parameter Optimization

The contribution of each term (weight).

Based on threading performance on a training set (fold recognition and alignment accuracy).

Different weight for different classes? (superfamily, fold) pair-wise may contribute more for fold level threading mutation/profile terms dominate in superfamily level threading

Etotal = sEsingleton + pEpairwise + gEgap + ssEss

Protein Threading -- algorithm

Considering only singleton energy + gap penalty

Represent a structure a sequence of “structural environments”– (helix, 100%), (helix, 90%), ….. (strand, 0%)

Align a sequence MACKLPV …. with a structural sequence (helix, 100%), (helix, 90%), ….. (strand, 0%)

Protein Threading – dynamic programming

AAGG

AACG | | |

Two sequences: AACG and AAGG

A A C G

A

A

G

G

2

-1

-1

2

-1-12

2

3

4 3 2

5

2

3

5

Step #1: calculating alignment matrixRule:

1: initialization– fill the first row and column with matching scores

2: fill an empty cell based on scores of its left, upper and upper-left neighbors + the matching score of the current cell

3: chose the one giving the highest score


A A C G

A

A

G

G

2

-1

-1

2

-1-12

2

3

4 3 2

5

2

3

5

Step #2: Tracing back to recover the alignment

Rule:

1: start from the right-lower corner

2: trace back to left, upper or upper-left neighbor which gives the current cell’s score

3. Keep doing this until it cannot continue


Steps: 1. Initialization: construct an (n+1) x (m+1) matrix F for two sequences

of lengths n and m.

2. Matrix fill: for each cell in the matrix F, check all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring.

3. Traceback: construct an alignment back from the last cell in the matrix (or the highest scoring) cell to give the highest scoring alignment.

Protein Threading -- dynamic programming

(helix, 100%)

(helix, 90%)

(helix, 80%)

(loop, 80%)

M

L

V

A


Considering all three energy terms

Considering the pair-wise interaction energy makes the problem much more difficult to solve – dynamic programming algorithm does not work any more!

There are other techniques that can be used to solve the problem


Dynamic programming Heuristic algorithms for pair-wise interactions

– Frozen approximation algorithm (A. Godzik et al.)

– Double dynamic programming (D. Jones et al.)

– Monte carlo sampling (S.H. Bryant et al.)

Rigorous algorithms for pair-wise interactions– Branch-and-bound (R.H. Lathrop and T.F. Smith)

– Divide-and-conquer (Y. Xu et al.) --PROSPECT

– Linear programming (J. Xu et al.) –RAPTOR

– Tree decomposition (L. Cai et al.) Rigorous algorithm for treating backbone and side-chain

simultaneously (Li et al.)

Fold Recognition


Score = -1500 Score = -900Score = -1120Score = -720

Which one is the correct structural fold for the target sequence if any?

The one with the highest score ?

Fold Recognition

Template #1: AATTAATACATTAATATAATAAAATTACTGA

Query sequence: AAAA

Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA

Better template?

Which of these two sequences will have better chance to have a good match with the query sequence after randomly reshuffling them?

Fold Recognition

Different template structures may have different background scores, making direct comparison of threading scores against different templates invalid

Comparison of threading results should be made based on how standout the score is in its background score distribution rather than the threading scores directly

Fold Recognition

Threading 100,000 sequences against a template structure provides the baseline information about the background scores of the template

By locating where the threading score with a particular query sequence, one can decide how significant the score, and hence the threading result, is!

Not significant significant

Fold Recognition

Z-score = standard deviation

score - average

--randomly shuffle the query sequence and calculate the alignment score

Fold Recognition

Significance score versus prediction specificity/sensitivity

Z-score

Fold Recognition

Examine feature space of threading alignments: (singleton score, pair contact scores, secondary structure score, hydrophobic moment score, ......) versus true/false fold recognition

Separate true ones from false ones using support vector machine (SVM)

false

true

-2000, -500, -35, -90, ......, true

-1000, -201, -11, -500, ......, false

-5020, -900, -20, -75, ......, true

-1050, -185, -18, -320, ......, false

......

Fold Recognition

Each feature has somewhat different distributions in the true and false predictions

E.g., hydrophobic moments (Hydrophobic moments of protein structures: spatially profiling the

distribution, David Silverman, PNAS 2001 98: 4996-5001) is quite useful in distinguishing true from false threading predictions

Hydrophobic Moment profiles

-100

-80

-60

-40

-20

0

20

40

60

80

100

120

0 5 10 15 20 25 30 35

d (Angstromes)

H2

(d)

Application

Sequence preprocessing

Protein sequenceRemove signal peptide

Membrane/soluble??

Domain prediction

Database searching

Find homolog in PDB?

YESHomologymodeling

NO Fold recognition

3Dstructure

Sequence profile generationSecondary structure PredictionExperimental data constraints

Challenging Issues Much improved energy functions for threading

Considering side-chain information when doing protein threading

Effective integration of fragment-based methods and threading techniques

Date post:	19-Dec-2015
Category:	Documents
Upload:	lawrence-chase
View:	226 times
Download:	8 times

Computational Methods for Protein Structure Prediction Ying Xu.

Documents