Brian Kidd
November 23, 2010
Computational Biology Tools
Lecture 15:
Protein Structure Prediction/Analysis
*Slides from David Bernick and Carol Rohl
Questions/Concerns from Last Time
Overview
1. Structure alignments
• methods and applications
2. Protein structure prediction
• methods and applications
3. Case study
4. 3D structure visualization
Structure more conserved than sequence
Why Examine Protein Structures?
Similar folds often share similar function
Remote similarities may only be detectable at structure level
Interpret experimental dataLocate sites of interesting mutations
Locate splice sites
Design ExperimentsIn silico mutagenesis
Structure Analysis
Identify interesting sites on a protein
Homologs
Mutants
With and without ligand (or binding partner)
Measure geometry (distances, angles, ...)
Examine surface properties (shape, charge)
Compare two structures
Comparing Protein Structures
Defined alignmentMutant v. wildtype, model v. experimental, i.e. two different conformations
Unique solution exists – we know the true alignment
Derived alignmentQuery is an unknown protein
Known parent (assumed homolog)
Calculate an “optimal” alignment computationally
Infer annotation from parent to query
What do we want from an alignment?
Optimal alignmentImportant parts of protein should associate (align) with each other
Catalytic residues and their positionsImportant structures (hinges, binding sites, etc)Protein interface residues and their positionEvolutionary history
Natural selection only selects for successful function
Sequences (and alignments) are assumed to be sequential
What do we want from an alignment?
Sequence alignments can be improved when we have structural information
No unique solution (more residues or closer match?)
Structural alignment implies a sequence alignment
Tools and DatabasesNCBI Structure (VAST and MMDB)
http://www.ncbi.nlm.nih.gov/Structure/Molecular Modeling Database
Experimentally derived structures from PDB (not theoretical)FSSP (DALI)
http://www2.embl-ebi.ac.uk/dali/fssp/http://ekhidna.biocenter.helsinki.fi/daliFamilies of structurally similar proteins
Maintains database of protein neighbors organized by PDB codeFully automated using the DALI algorithm (Holm & Sander)
No internal node annotationsStructural similarity search using DALI
CEhttp://cl.sdsc.edu/Combinatorial extension
Maintains database of protein neighbors by PDB code
Tools and Databases
Structure classification by domainClassification based on secondary structureSCOP – Structural Classification of Proteins
http://scop.berkeley.edu/Class-fold-superfamily-familyManual assembly by inspection (last release June 2009)
CATH – Class-Architecture-Topology-Homologous Superfamilyhttp://www.biochem.ucl.ac.uk/bsm/cath/Manual classification at Architecture levelAutomated topology classification using SSAP (Orengo & Taylor)Last release July 2009
CEMC – Multiple Structure Alignmenthttp://bioinformatics.albany.edu/~cemc/
How Structure Alignments Work
MethodsStructal – Gerstein group at Yale
DALI – Holm group at Helsinki
VAST – NCBI resource
Structure similarity measuresRMSD – similarity metric
Pvalues – significance measure
Iterative Dynamic Programming
Algorithm
1. Make initial guess for the superposition
2. Calculate all pairwise Ca-Ca distances and generate scoring matrix
3. Find optimal alignment according to this scoring matrix by dynamic programming
4. Re-superimpose structures using this alignment
5. Repeat steps 2–4 until converged
No guarantee of optimal solution, final results depends on the initial alignment selected
Structural: Subbiah et al., Curr. Biol. 3:141 (1993)
Structural Alignment
Many methods other than dynamic programming are used
Most methods use some sort of heuristics to speed things up and make good initial guess
Sheba – sequence alignment
Mammoth – local structural alignment
VAST – aligns secondary structure element vectors
DALI – distance matrix alignment
Distance Matrix AlignmentMatrix of all pairwise distances
Characteristic patterns:
Main diagonal runs correspond to helix (i.e. local contacts)
Hairpins – start on main diagonal, run perpendicular
Parallel pairs run parallel to main diagonal
Others are long range contacts
Converts 3D alignment problem into a 2D problem
Find best subset of rows and columns such that the distance matrices of two proteins are optimally similar
Myoglobin
!"#$%&'()
*+%,&-
*+.&"/&'
00*1$".'21
Contact Map Comparison
Myoglobin
Protein G
// strands
α-helix
β-hairpin
Similarity Measure
RMSD – root mean square deviation�< ||XA
i −XBi ||2 >
1. Superimpose optimally
2. Pair up residues
3. Calculate RMSD
!"#
!$#!%#
!&#
!'#
!"(
!$(
!&(
!%(!'(
Sensitive to outliersDepends on number of pairs comparedA better measure is the significance of this RMSD for similar sized matches
z-scores and p-values
!"#$%&'()*+,"-.*+'(*)(/
0'$12(3(-42(,"-.*+'(5(3
/(-42(,"-.*+'(5(/
6(-42(,"-.*+'(5(6,"-.*+'(5(7
,"-.*+'(5(8
z-score: number of standard deviations above/below the mean
± 1 sd ~ 66%
± 2 sd ~ 95%
If we have a histogram, we can just count, or integrate a function fitted to the histogram
p-valueprobability of obtaining ≥ this score under the null model (normally distributed data -- “by chance”)
Histogram of scores for random matches
Meaning of Structural Alignments!"""""""!""!"""""""!""""""""!!!!!!!!""!!!!!!!!!!!"""!
###$%&'()*!+*,)*&*+-(!-./0*&-1##!####()2)&%!0)-,&..0%%333!!!3!3333333333!33333333!!!!!!!!!!!!!!!!!!3333332*4)(*+&1-!2-,&1-*&05!000*4&+022!--2,+0+.4/!562,25/*52
"""""""!""!"""""""!""""""""!!!!!!!!""!!!!!!!!!!!"""
"""""!!!!!!!!"!!!!!!!!!!!!!!!!!!!!!""""""""!"""""6+&'2,)%##!#+-0,6##*+!/########0!41&%)-/*+7!+(+6+6,33!3!3!3!!!!3333!!!33!3!!!!!!!!3!33333333,*&*/,*&0%!/0%/'+000%!&-2,4(+*5(!24.*/05*&)!*7%--,+"""""!!!!!!!!"!!!!!!!!!!!!!!!!!!!!!""""""""!"""""
!"#$ %&'(
Two proteins are clearly structurally similar
Mammoth identifies similar substructures, but the alignment is not entirely “correct”
Opportunistic matched residuesMisses some analogous elements
1ubq 4fxc
Why Predict Protein Structures?
Test models and theories about structural biochemistry
Why Predict Protein Structures?
Identify drug targets for medicine
Experimentally derived structures are still slow and not all structures are easily solved
Explore states that are difficult to examine experimentally
Challenges for Structure Prediction
Search space is astronomical – need an efficient sampling algorithm
Actual proteins tend to be in energy minimums – need a scoring system for discriminating between modesl
CASP
Critical Assessment of Structure Prediction
Community effort to improve predictions
Forced scientists to start learning what actually works in prediction
Types of Predictions
Comparative Model
Ab initio or de novo
!"#$%&%!"#$%&'()!*'+,!-#.%/0!Stage I. Fragment
Assembly!
Baker Method
*Slides from Rhiju Das
!"#$%&%!"#$%&'()!*'+,!-#.%/0!Stage II. All-atom
refinement!
Baker Method
*Slides from Rhiju Das
Example
!"#$%&%!"#$$%""%"&!'(()!"
Native! Model!2.0 Å over 61 residues
CASP7 target T0316 (domain 3)
*Slides from Rhiju Das
Case Study
PAZ domain of Pf_Ago
Pf_Ago 1u04
hAgo1 1si2/1si3
Y212 Y309(Y90)
Y216 Y314(Y95)
H217 H269(H49)
Y190 Y277(Y57)
Ji-Joon Song [PMID: 15284453] asserts on p. 1435 that the following are functionally equivalent:
What do you think?
Case Study Continued
1u04 – PAZ domain, chain A 152-275
1si2 – PAZ domain, chain A 4-128
http://www.pdb.orgGet the above structures
http://www.ebi.ac.uk/DaliLite/Align 1UO4 with 1SI2:A (h_Ago)
What is the z-score? Is it significant?
What is the RMSD? Is this a reasonable alignment?
How many residues aligned?
PyMol InClassLoad both molecules (aligned) into PyMol
Action-preset-pretty for both molecules
For 1u04, delete everything you don’t need
select-rename object 1u04
chain B; select-remove atoms
chain A and resi 1-151; select-remove atoms
chain A and resi 276-770; select remove atoms
color red
Load 1si2, color it yellow
chain B is a small RNA; show spheres, chain B; color blue
select 1u04 and resi 212; show as sticks
repeat for 190, 216, 217
select 1si2 and resi 309; show as sticks
repeat for 269, 277, 314
In Class Summary
So, who’s correct?
Is J.J. Song correct?
Is Dali?
Is Vast?
Essentials at this PointAccessing literature and sequence information from various databases (NCBI and UCSC)
BLAST (all variants)
Pairwise sequence analysis tools and algorithms
Single sequence analysis tools DNA:EMBOSS, ORFs, Restriction Enzymes, & Primers
Protein databases and analysis tools
PSI and PHI BLASTs
Multiple sequence alignments
Phylogeny
RNA structure (basics and analytical tools)
Protein structure (basics and analytical tools)
This is everything!
For Next Time
Reading
Problem set
Review
Finish up PS #3 (due Tuesday, November 23)
Start working on PS #4 (due Friday, December 3)
http://www.soe.ucsc.edu/classes/bme110/Fall10/calendar.html