Date post: | 22-Jan-2018 |
Category: |
Science |
Upload: | bcbbslides |
View: | 75 times |
Download: | 2 times |
1
WITH A FOCUS ON ROSETTA
This presentation was prepared by: Xavier Ambroggio, [email protected]
PROTEIN STRUCTURE PREDICTION
OFFICE OF CYBER INFRASTRUCTURE AND COMPUTATIONAL BIOLOGY
NATIONAL INSTITUTE OF ALLERGY AND INFECTIOUS DISEASES
Fall 2011 Computational Structural Biology Seminar Series
2
9 – 11 AM, T/Th in 12A/B51 http://training.cit.nih.gov
Week Day Date Course Instructor CIT Course #
Week 1 Tues Aug. 23 Fundamentals, Data Sources, and Visualization of Macromolecular Structure Darrell Hurt SS260-11001
Thurs Aug. 25 Generating Protein Structures from Homology Darrell Hurt SS270-11001
Week 2 Tues Aug. 30 Predicting Protein Structures from Amino Acid Sequences Xavier Ambroggio SS660-11001
Thurs Sept. 1 Predicting Macromolecular Complexes from Uncomplexed Structures Xavier Ambroggio SS670-11001
Week 3 Tues Sept. 6 Design and Analysis of Macromolecular Interfaces Xavier Ambroggio SS770-11001
Thurs Sept. 8 Analysis and Advanced Visualization of Macromolecular Structure Darrell Hurt SS330-11001
Week 4 Tues Sept. 13 Computational Drug Design Mike Dolan SS340-11001
Thurs Sept. 15 Introduction to Molecular Dynamics Mike Dolan TBA
Week 5 Thurs. Sept. 22 Advanced Molecular Dynamics Mike Dolan TBA
Bioinformatics and Computational Biosciences Branch
3
Scientific Collaboration
Scientific Training
Custom Scientific Software &
Infrastructure
• Structural Biology • Phylogenetics • Statistics • Sequence Analysis • Microarray Analysis • NGS Analysis • Bioinformatics • Biological Networks • Function Prediction • …
4
Ab Initio Structure Prediction: Given an amino acid sequence, find the tertiary structure
“Protein folding problem”
CASP: Critical Assessment of protein Structure Prediction
http://predictioncenter.org
• Double-blind experiment (…competition) • World-wide scientific community • Unbiased assessment of techniques in structure
prediction • Biennial (every even year)
• “Pulse” of the prediction community • What can be predicted? • Which servers/algorithms perform best?
6
CASP Overview
Blutsbrüder Design
CASP Top Free-Modeling Servers
7
Why Rosetta focus? • Standalone • Versatile
RNA design dock …
• Open Source • Substantial Literature • Shared methodology
Use any and all available servers!!!
Das & Baker Annu. Rev. Biochem 2008
prediction
design
Rosetta: multipurpose macromolecular modeling suite
CIT Course # SS660-11001
CIT Course # SS670-11001
CIT Course # SS770-11001
ab initio predict the structure from sequence
relax refine the structure using Rosetta energy functions
idealize replace bond geometries with ideal values
loop modeling build and refine local structurally variable regions in context of a structural template
design optimize sequence given a structure with a fixed backbone
docking structure prediction for a protein-protein complex given subunits
ligand ligand docking
ddG prediction protein-protein interface and protein stability ddG stability calculations for mutations
scoring score input conformations with Rosetta energy functions
RNA predict RNA structures from sequences and design sequences from fixed structures
clustering grouping input structures by RMSD to each other for structure prediction analysis
backrub generate alternate backbone conformations based on sets of rotations
membrane ab initio predict the structures of helical membrane proteins
enzyme design redesign a protein around a ligand
domain assembly fixed domains connected by variable regions
antibody automated antibody homology modeling
XML parsing Parse XML scripts into protocols
Brief Description of Select Rosetta Functions
What types of protein domains can Rosetta fold?
Small, globular, soluble protein domains…
Small, simple membrane protein domains… …but not complex domains or multi-domain proteins.
T4-lysozyme C-terminal domain
V-type Na+ ATP synthase subunit
rhodopsin
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
A B C
What are the success rates?
High resolution predictions are achievable
• targets ≤100 residues • success rate ~30% • success rate with accurate secondary
structure ~50% • a hallmark of accuracy: convergence
11 Slide content courtesy Rhiju Das, Baker Lab
What types of protein domains can no one fold? CASP9: domains with no good FM predictions
Slide content adapted from talk given by Lisa Kinch of the Grishin lab at CASP9 mee>ng: h@p://predic>oncenter.org/casp9/
• Non-‐globular • Trimeric • Fe stabilized
• High contact order Many residues close in 3D, far in 1D
• + elongated sheet?
T0591d1, 3MWT T0550d2, 3NQK
T0629d2, 2XGF
1. Select fragments consistent with local sequence preferences
2. Assemble fragments into models with na>ve-‐like global proper>es
3. Iden>fy the best model from the popula>on of decoys
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology” Figures adapted from Charlie Strauss; Protein structure prediction using ROSETTA, Rohl et al (2004) Methods in Enzymology, 383:66
Basic Ab Ini'o Rose<a protocol
Assembly
Decoy
Decoy
Decoy
Decoy
Decoy
Decoy
Decoy
Decoy
Decoy
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Decoy
Fragment-Based Structure Prediction
Rosetta, Quark, …
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s) Model Alignment Homology modeling:
First atomic-resolution model
Target 0281 CASP6 • Topology sampled by ab initio trajectory
of homolog sequence (rmsd=2.2Å) • Full atom refinement reduces rmsd to
1.5Å • Side chain packing accurately
recovered
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology” Figures adapted from Bradley P, Malmström L, Qian B, Schonbrun J, Chivian D, Kim DE, Meiler J, Misura KM, Baker D. Free modeling with Rosetta in CASP6. Proteins.
Folding Theory: Sequence-Structure Relationships
16
• Secondary structure formation is the earliest part of the folding process
• Local sequence codes for local structures… i.e. fragments
helical sequences in a folded protein tend to be helical in isolation
• Secondary structure prediction algorithms have ~70-80% accuracy
Partial failure due to tertiary interactions stabilizing secondary structure elements
Rosetta fragments
• 3 and 9 residue fragments matched to query sequence
• database created from crystal structures < 2.5Å resolution < 50% sequence identity
• low resolution modeling centroid representation of side chains
• ranked by: alignment Secondary structure predictions
• PSI-PRED • SAM-T02 • Jufo • PhD
17
KVFGRCELAAAMKRHGLDNYRGYSLGNWVC... KVF KVFGRCELA VFG VFGRCELAA FGR FGRCELAAA GRC GRCELAAAM --------------------------------- EEEE TT S EEEEEEE TT HH...
query
sec str
Slide content courtesy David Hoover, CIT, NIH
Sliding fragment windows
# Rank G K L M Q E R A
13 1000 G K L
25 821 G R L
46 1000 K L M
21 635 R L M
43 923 K V M
26 523 R V M
15 970 M Q E
26 934 E R A
Separate 3-mer and 9-mer libraries generated
Slide content courtesy David Hoover, CIT, NIH
Example 3-mer fragment library
Making Fragment Libraries with Robetta
http://robetta.bakerlab.org/
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Making Fragment Libraries on Biowulf
Slide content by David Hoover from: http://biowulf.nih.gov/apps/Rosetta23.html#RosettaFragments
22
• Levinthal paradox:
Given either alpha, beta, or loop conformation, for protein of nres, 3nres possible conformations.
If nres = 100, sampling a conformation every 10-13 seconds = 1027 years to fold
Universe is 1010 years old.
Folding is non-random and cooperative.
• Many different combinations of secondary structure elements have similar stabilities
Tertiary (side-chain level) interactions drive folding towards the native topology
Phase transition results in a substantial energy gap between native and non-native structures
Folding Theory: The Folding Landscape
• Cyrus Levinthal, J. Chim. Phys. 65, 44; 1968 • Hue Sun Chan and Ken A. Dill, Protein Folding in the Landscape Perspective: Chevron Plots and Non-
Arrhenius Kinetics, Proteins: Structure, Function, and Genetics, Volume 30, No. 1, January 1998, pp 2-33.
Implications and requirements for folding algorithm:
• Fast conformational sampling algorithm
• Accurate scoring function
• Full-atom modeling
early centroid models centroid models final full-atom models
Assembly Coarse funnel to native-like decoys Fine-grained funnel to near-native decoys
Major Classes of Energy Functions in Rosetta
24
Low resolution: reduced atom representation (centroid) simplified energy function used for aggressive search of state space
High resolution: full-atom representation detailed energy function local search of state space refinement and minimization
General weighted sum of linear terms: Energy = w1*term1 + w2*term2 + … pairwise decomposable (speed) weighted for task, e.g. ligand docking
Low resolution (centroid) folding
25
Fragment insertion conformation modification occurs in torsion space initial insertions result in large changes in dihedrals 9 mers inserted first followed by 3 mers later in process later insertions purposefully result in small changes in dihedrals random insertion
*
*
Sss + SHS - sheet and helix-sheet geometries
• Scβ density/compactness of structure
• Svdw no clashes
• SRgyr radius of gyra>on (Rgyr), globular structure
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Driving assembly towards native-like decoys
Low-resolution homolog folding improves prediction
• Collect homologs • Create low-resolution models
cluster • Thread query sequence onto models • Proceed to fullatom refinement
… … …
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Low resolution (centroid) folding example
28
Clustering: Graphical representation
29
30
High resolution (full-atom) refinement
Chen Y et al. Nucl. Acids Res. 2004;32:5147-5162
evaluating/optimizing specific atom-atom interactions e.g. hydrogen bonding:
Comparison of low resolution, relax, and abrelax folding example
31
32
Examples from the Rosetta@home archive of top predictions Note: massively parallel computation
rosetta prediction crystal structure
Detailed ab initio Rosetta Workflow
33
INPUT • amino acid sequence • secondary structure prediction(s) • fragment library • constraints from experimental data
• NMR • biochemical/biophysical studies • ...
LOW RESOLUTION FOLDING • fragment insertions • scoring • filters
CLUSTERING • groups of decoys with low RMSD to each other • lowest energy decoy of clusters selected for
further refinement or prediction
HIGH RESOLUTION REFINEMENT • backbone minimization • rotamer optimization
ADDITIONAL MODELING • identifying variable regions • rebuilding
>103-106 trajectories
automated manual
34
Computational Considerations
Protocol Utility Caveats
Centroid • fast • widely sample conformational space
• possibility of no near-native models after low resolution folding
• no discrimination by energy
Full-atom refinement
• near-native decoys separated by energy • more computationally demanding • must have near-native in starting decoy pool
Combined • streamlined • for powerful and massively parallel
computing
• most computationally demanding • improvement only with sufficient sampling
35
Native (CheY)
A ~1000-fold increase in computational power
Slide content courtesy Rhiju Das, Baker Lab
36
Architect of Rosetta@home: David Kim
A ~1000-fold increase in computational power
Native (CheY)
Lowest energy Rosetta structure
“brute force” approach
Computational power vs. accuracy in ab initio structure prediction
37
Cα RMSD of lowest energy model to the native structure vs. sample size
Sample Size
RM
SD
to n
ativ
e
Category 1: Successful high-resolution predictions
Category 2: Successful high-resolution predictions with additional sampling
Category 3: Unsuccessful predictions (with any amount of sampling)
Kim DE, Blum B, Bradley P, Baker D. Sampling bottlenecks in de novo protein structure prediction. J Mol Biol. 2009 Oct 16;393(1):249-60.
38
“De novo” phasing: large-scale tests
Tests on 30 data sets (covering 16 proteins)
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
TF Z-score Have I solved it? < 5 no
5 - 6 unlikely 6 - 7 possibly 7 - 8 probably > 8 definitely
39
“De novo” phasing: large-scale tests
Tests on 30 data sets (covering 16 proteins)
1hz5-sf.cif
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
Rosetta-refined native (positive controls)
Rosetta-refined de novo models
40
“De novo” phasing: large-scale tests
Tests on 30 data sets (covering 16 proteins)
1hz5-sf.cif
Success in 14/30 data sets
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
Rosetta-refined native (positive controls)
Rosetta-refined de novo models
41
“De novo” phasing: large-scale tests
Tests on 30 data sets (covering 16 proteins)
Rosetta-refined native (positive controls)
Rosetta-refined de novo models
Rosetta-refined de novo models, fragments with correct native 2° structure
1hz5-sf.cif
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
Preparation for folding simulations
• proper secondary structure assignment • constraints
• limit search space • increase sampling efficiency • decrease CPU time
42
Constraints
• There are constraint types and function types Constraint types: AtomPair, Angle, Dihedral, etc. Function types: Bounded, Spline, Harmonic, Gaussian, etc.
• Each constraint is scored individually and the total constraint score is the sum of all individual scores
• Each constraint can have its own constraint type and function type. In some cases, like when using Spline function, each constraint can have its own
weight • How you define the constraint and how it’s scored depends on the constraint type;
this is same with function type.
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Constraint file example: EPR data
<cst type> <atom1> <res1> <atom2> <res2> <cst_func> <RosettaEPR> <Dcb> <weight> <bin>!AtomPair CB 32 CB 36 SPLINE EPR_DISTANCE 16.0 1.0 0.5!AtomPair CB 59 CB 74 SPLINE EPR_DISTANCE 19.0 1.0 0.5!AtomPair CB 62 CB 71 SPLINE EPR_DISTANCE 19.0 1.0 0.5!AtomPair CB 62 CB 74 SPLINE EPR_DISTANCE 25.0 1.0 0.5!AtomPair CB 63 CB 74 SPLINE EPR_DISTANCE 14.0 1.0 0.5!AtomPair CB 66 CB 74 SPLINE EPR_DISTANCE 23.0 1.0 0.5!AtomPair CB 83 CB 90 SPLINE EPR_DISTANCE 13.0 1.0 0.5!
Constraint info Constraint Function info
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Membrane protein ab initio
• RosettaMembrane divides the protein into: hydrophobic hydrophilic soluble layers
• Specific scoring function for each layer
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop Figure from Yarov-Yarovoy, Schonbrun, and Baker 2006.
Input Files
Spanfile -‐ *.span
-‐-‐transmembrane topology predic>on file generated using octopus2span.pl script
-‐-‐Input OCTOPUS topology file is generated at h@p://octopus.cbr.su.se using protein sequence as input.
Lipopholicity predicDon file -‐ *.lips4
-‐-‐Generate using run_lips.pl script
-‐-‐Need input FASTA file, spanfile, blaspgp and nr (NCBI) database to run
Fragment generaDon -‐-‐Advised to use SAM but not JUFO or PSIPRED, which predict TMH regions poorly
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Folding and studying folding with molecular dynamics
Specialized hardware, ANTON capable of continuous ms length trajectories
Standard simulations: 1 - 3 µs simulations ~ months of HPC
Approximate Rates of Folding: 1 µs helix 10 µs sheet 100 µs fast folding protein 1+ ms typical protein
D E Shaw et al. Science 2010;330:341-346
simulation of villin at 300 K 2-8 µs folder
simulation of FiP35 at 337 K 20-80 µs folder
Blue: x-ray structures Red: last frame of MD simulation
Folding proteins at x-ray resolution
Published by AAAS
tip of hairpin 1 (12-18, blue) hairpin 1 (8-22, green) hairpin 2 (19-30, orange) full protein (2-33, red)
D E Shaw et al. Science 2010;330:341-346
Reversible folding simulation of FiP35.