Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning...

Post on 04-Jan-2016

219 views 3 download

transcript

Protein structure predictionMay 26, 2011HW #8 due todayQuiz #3 on Tuesday, May 31Learning objectives-Understand the biochemical basis of secondary structure prediction programs. Become familiar with the databases that hold secondary structure information. Understand neural networks and how they help to predict secondary structure.Workshop-Predict secondary structure of p53.Homework #9-Due June 2

What is secondary structure?

Three major types:

Alpha Helical Regions

Beta Strand Regions

Coils, Turns, Extended (anything else)

Can we predict the final structure?

http://en.wikipedia.org/wiki/Protein_folding

Some Prediction Methods

ab initio methods Based on physical properties of aa’s and bonding

patterns

Statistics of amino acid distributions in known structures Chou-Fasman

Sequence similarity to sequences with known structure PSIPRED

Chou-Fasman

First widely used procedureOutput-helix, strand or turnPercent accuracy: 60-65%

Psi-BLAST Predict Secondary Structure (PSIPRED)

Three steps:1) Generation of position specific

scoring matrix.2) Prediction of initial secondary

structure3) Filtering of predicted structure

AA P(α) AA P(β) AA P(T) AA f(i) f(i+1) f(i+2) f(i+3) Glu 1.51 Val 1.70 Asn 1.56 Ala 0.060 0.076 0.035 0.058 Met 1.45 Ile 1.60 Gly 1.56 Arg 0.070 0.106 0.099 0.085 Ala 1.42 Tyr 1.47 Pro 1.52 Asp 0.147 0.110 0.179 0.081 Leu 1.21 Phe 1.38 Asp 1.46 Asn 0.161 0.083 0.191 0.091 Lys 1.14 Trp 1.37 Ser 1.43 Cys 0.149 0.050 0.117 0.128 Phe 1.13 Leu 1.30 Cys 1.19 Glu 0.056 0.060 0.077 0.064 Gln 1.11 Cys 1.19 Tyr 1.14 Gln 0.074 0.098 0.037 0.098 Ile 1.08 Thr 1.19 Lys 1.01 Gly 0.102 0.085 0.190 0.152 Trp 1.08 Gln 1.10 Gln 0.98 His 0.140 0.047 0.093 0.054 Val 1.06 Met 1.05 Thr 0.96 Ile 0.043 0.034 0.013 0.056 Asp 1.01 Arg 0.93 Trp 0.96 Leu 0.061 0.025 0.036 0.070 His 1.00 Asn 0.89 Arg 0.95 Lys 0.055 0.115 0.072 0.095 Arg 0.98 His 0.87 His 0.95 Met 0.068 0.082 0.014 0.055 Thr 0.83 Ala 0.83 Glu 0.74 Phe 0.059 0.041 0.065 0.065 Ser 0.77 Gly 0.75 Ala 0.66 Pro 0.102 0.301 0.034 0.068 Cys 0.70 Ser 0.75 Met 0.60 Ser 0.120 0.139 0.125 0.106 Tyr 0.69 Lys 0.74 Phe 0.60 Thr 0.086 0.108 0.065 0.079 Asn 0.67 Pro 0.55 Leu 0.59 Trp 0.077 0.013 0.064 0.167 Gly 0.57 Asp 0.54 Val 0.50 Tyr 0.082 0.065 0.114 0.125 Pro 0.57 Glu 0.37 Ile 0.47 Val 0.062 0.048 0.028 0.053

Conformational parameters for α-helical, β-strand, and turn amino acids (from Chou and Fasman, 1978)

PSIPRED

Uses multiple aligned sequences for prediction.Uses training set of folds with known structure.Uses a two-stage neural network to predict structure based on position specific scoring matrices generated by PSI-BLAST (Jones, 1999) First network converts a window of 15 aa’s into a raw score

of h,e (sheet), c (coil) or terminus Second network filters the first output. For example, an

output of hhhhehhhh might be converted to hhhhhhhhh.

Can obtain a Q3 value of 70-78% (may be the highest achievable)

Neural networks

• Computer neural networks are based on simulation of adaptivelearning in networks of real neurons.•Neurons connect to each other via synaptic junctions which are either stimulatory or inhibitory. •Adaptive learning involves the formation or suppression of the right combinations of stimulatory and inhibitory synapses so that a setof inputs produce an appropriate output.

Neural Networks (cont. 1)•The computer version of the neural network involves identification of a set of inputs - amino acids in the sequence, which transmit through a network of connections.•At each layer, inputs are numerically weighted and the combined result passed to the next layer.•Ultimately a final output, a decision, helix, sheet or coil, is produced.

Neural Networks (cont. 2)

90% of training set was used (known structures)10% was used to evaluate the performance of the neuralnetwork after the training session.

Neural Networks (cont. 3)

•During the training phase, selected sets of proteins of known structure were scanned, and if the decisions were incorrect, the input weightings were adjusted by the software to produce the desired result.

•Training runs were repeated until the success rate is maximized.

•Careful selection of the training set is an important aspect of this technique. The set must contain as wide a range of different fold types as possible without duplications of structural types that may bias the decisions.

Neural Networks (cont. 4)

•An additional component of the PSIPRED procedures involves sequence alignment with similar proteins.

•The rationale is that some amino acids positions in a sequence contribute more to the final structure than others. (This has been demonstrated by systematic mutation experiments in which each consecutive position in a sequence is substituted by a spectrum of amino acids. Some positions are remarkably tolerant of substitution, while others have unique requirements.)

•To predict secondary structure accurately, one should place less weight on the tolerant positions, which clearly contribute little to the structure

•One must also put more weight on the intolerant positions.

15 groups of 21 units(1 unit for each aa plusone specifying the end)

Row specifies aa position

three outputs are helix, strand or coil

Filtering network

Provides infoon tolerant orintolerant positions

(Jones, 1999)

Example of Output from PSIPRED

PSIPRED PREDICTION RESULTS

Key

Conf: Confidence (0=low, 9=high)

Pred: Predicted secondary structure (H=helix, E=strand, C=coil)

AA: Target sequence

Conf: 923788850068899998538983213555268822788714786424388875156215

Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC

AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD

10 20 30 40 50 60

How to calculate Q3?

Sequence: MEETHAPYRGVCNNMActual Structure: CCCCCHHHHHHEEEEPSIPRED Prediction: CCCCCHHHHHHEEEH

Q3 = 14/15 x 100 = 93%

Recognizing motifs in proteins.

PROSITE is a database of protein families and domains.

Most proteins can be grouped, on the basis of similarities in their sequences, into a limited number of families.

Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.

PROSITE Database

Contains 1612 documentation entries.Signatures are produced by scanning the PROSITE database with your query. A “signature” of a protein allows one to place a protein within a specific function class based on structure and/or function.An example of an documentation entry in PROSITE is:

http://ca.expasy.org/cgi-bin/nicedoc.pl?PDOC50020

Signatures are produced from profiles and patterns.

Profile-”a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence.”

Sequences in one profile and the PSSM associated with the profile

F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C

A -18 -10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18

How are the patterns constructed?

ALRDFATHDDVCGK..SMTAEATHDSVACY..ECDQAATHEAVTHR..

Sequences necessary for structure or function are aligned manually byexperts in field. Then a pattern iscreated.

A-T-H-[DE]-X-V-X(4)-{ED}This pattern is translated as: Ala, Thr, His, [Asp or Glu], any,Val, any, any, any, any, any but Glu or Asp

Example of a pattern in a PROSITE record

ID ZINC_FINGER_C3HC4; PATTERN.

PA C-X-H-X-[LIVMFY]-C-X(2)-C-[LIVMYA]

Scanning the PROSITE database

“Scan a sequence against PROSITE patterns and profiles” allows the user to scan the ProSite database to search for patterns and profiles. It uses dynamic programming to determine optimal alignments. If the alignment produces a high score (a hit), then the hit is shown to the user.

http://www.expasy.ch/prosite/

If a “hit” is generated, the program gives an output that shows the region of the query that contains the pattern and a reference to the 3-D structure database if available.

Example of output from Prosite Scan

RPSBlast

Reverse psi-blast, or rpsblast, is a program that searches a query protein sequence or protein sequences against a database of position specific scoring matrices. The PSSMs are from conserved protein sequences that have known functions/structure.

3D structure data

The largest 3D structure database is the Protein Databank It contains over 20,000 records Each record contains 3D coordinates for

macromolecules 80% of the records were obtained from X-ray

diffraction studies, 20% from NMR.

ATOM 1 N ARG A 14 22.451 98.825 31.990 1.00 88.84 N

ATOM 2 CA ARG A 14 21.713 100.102 31.828 1.00 90.39 C

ATOM 3 C ARG A 14 22.583 101.018 30.979 1.00 89.86 C

ATOM 4 O ARG A 14 22.105 101.989 30.391 1.00 89.82 O

ATOM 5 CB ARG A 14 21.424 100.704 33.208 1.00 93.23 C

ATOM 6 CG ARG A 14 20.465 101.880 33.215 1.00 95.72 C

ATOM 7 CD ARG A 14 20.008 102.147 34.637 1.00 98.10 C

ATOM 8 NE ARG A 14 18.999 103.196 34.718 1.00100.30 N

ATOM 9 CZ ARG A 14 18.344 103.507 35.833 1.00100.29 C

ATOM 10 NH1 ARG A 14 18.580 102.835 36.952 1.00 99.51 N

ATOM 11 NH2 ARG A 14 17.441 104.479 35.827 1.00100.79 N

Part of a record from the PDB

Quiz #3 prep

BLAST Three steps Gapped BLAST Heuristic program Uses S-W algorithm for

final scoring

CLUSTAL W Pairwise alignments Difference matrix Guide tree Importance of having

highly similar sequences

Secondary Structure prediction Chou-Fasman PSIPRED Good for secondary str

Protein analysis ProScan RPBlast