Protein structure prediction with a focus on Rosetta

1

WITH A FOCUS ON ROSETTA

This presentation was prepared by: Xavier Ambroggio, [email protected]

PROTEIN STRUCTURE PREDICTION

OFFICE OF CYBER INFRASTRUCTURE AND COMPUTATIONAL BIOLOGY

NATIONAL INSTITUTE OF ALLERGY AND INFECTIOUS DISEASES

Fall 2011 Computational Structural Biology Seminar Series

2

9 – 11 AM, T/Th in 12A/B51 http://training.cit.nih.gov

Week Day Date Course Instructor CIT Course #

Week 1 Tues Aug. 23 Fundamentals, Data Sources, and Visualization of Macromolecular Structure Darrell Hurt SS260-11001

Thurs Aug. 25 Generating Protein Structures from Homology Darrell Hurt SS270-11001

Week 2 Tues Aug. 30 Predicting Protein Structures from Amino Acid Sequences Xavier Ambroggio SS660-11001

Thurs Sept. 1 Predicting Macromolecular Complexes from Uncomplexed Structures Xavier Ambroggio SS670-11001

Week 3 Tues Sept. 6 Design and Analysis of Macromolecular Interfaces Xavier Ambroggio SS770-11001

Thurs Sept. 8 Analysis and Advanced Visualization of Macromolecular Structure Darrell Hurt SS330-11001

Week 4 Tues Sept. 13 Computational Drug Design Mike Dolan SS340-11001

Thurs Sept. 15 Introduction to Molecular Dynamics Mike Dolan TBA

Week 5 Thurs. Sept. 22 Advanced Molecular Dynamics Mike Dolan TBA

Bioinformatics and Computational Biosciences Branch

3

Scientific Collaboration

Scientific Training

Custom Scientific Software &

Infrastructure

•  Structural Biology •  Phylogenetics •  Statistics •  Sequence Analysis •  Microarray Analysis •  NGS Analysis •  Bioinformatics •  Biological Networks •  Function Prediction •  …

4

Ab Initio Structure Prediction: Given an amino acid sequence, find the tertiary structure

“Protein folding problem”

CASP: Critical Assessment of protein Structure Prediction

http://predictioncenter.org

•  Double-blind experiment (…competition) •  World-wide scientific community •  Unbiased assessment of techniques in structure

prediction •  Biennial (every even year)

•  “Pulse” of the prediction community •  What can be predicted? •  Which servers/algorithms perform best?

6

CASP Overview

Blutsbrüder Design

CASP Top Free-Modeling Servers

7

Why Rosetta focus? •  Standalone •  Versatile

  RNA   design   dock   …

•  Open Source •  Substantial Literature •  Shared methodology

Use any and all available servers!!!

Das & Baker Annu. Rev. Biochem 2008

prediction

design

Rosetta: multipurpose macromolecular modeling suite

CIT Course # SS660-11001



ab initio predict the structure from sequence

relax refine the structure using Rosetta energy functions

idealize replace bond geometries with ideal values

loop modeling build and refine local structurally variable regions in context of a structural template

design optimize sequence given a structure with a fixed backbone

docking structure prediction for a protein-protein complex given subunits

ligand ligand docking

ddG prediction protein-protein interface and protein stability ddG stability calculations for mutations

scoring score input conformations with Rosetta energy functions

RNA predict RNA structures from sequences and design sequences from fixed structures

clustering grouping input structures by RMSD to each other for structure prediction analysis

backrub generate alternate backbone conformations based on sets of rotations

membrane ab initio predict the structures of helical membrane proteins

enzyme design redesign a protein around a ligand

domain assembly fixed domains connected by variable regions

antibody automated antibody homology modeling

XML parsing Parse XML scripts into protocols

Brief Description of Select Rosetta Functions

What types of protein domains can Rosetta fold?

Small, globular, soluble protein domains…

Small, simple membrane protein domains… …but not complex domains or multi-domain proteins.

T4-lysozyme C-terminal domain

V-type Na+ ATP synthase subunit

rhodopsin

Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop

A B C

What are the success rates?

High resolution predictions are achievable

•  targets ≤100 residues •  success rate ~30% •  success rate with accurate secondary

structure ~50% •  a hallmark of accuracy: convergence

11 Slide content courtesy Rhiju Das, Baker Lab

What types of protein domains can no one fold? CASP9: domains with no good FM predictions

Slide content adapted from talk given by Lisa Kinch of the Grishin lab at CASP9 mee>ng: h@p://predic>oncenter.org/casp9/

•  Non-‐globular •  Trimeric •  Fe stabilized

•  High contact order Many residues close in 3D, far in 1D

•  + elongated sheet?

T0591d1, 3MWT T0550d2, 3NQK

T0629d2, 2XGF

1.  Select fragments consistent with local sequence preferences

2.  Assemble fragments into models with na>ve-‐like global proper>es

3.  Iden>fy the best model from the popula>on of decoys

Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology” Figures adapted from Charlie Strauss; Protein structure prediction using ROSETTA, Rohl et al (2004) Methods in Enzymology, 383:66

Basic Ab Ini'o Rose<a protocol

Assembly

Decoy

Decoy

Decoy

Decoy

Decoy

Decoy

Decoy

Decoy

Decoy

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Decoy

Fragment-Based Structure Prediction

Rosetta, Quark, …

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s) Model Alignment Homology modeling:

First atomic-resolution model

Target 0281 CASP6 •  Topology sampled by ab initio trajectory

of homolog sequence (rmsd=2.2Å) •  Full atom refinement reduces rmsd to

1.5Å •  Side chain packing accurately

recovered

Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology” Figures adapted from Bradley P, Malmström L, Qian B, Schonbrun J, Chivian D, Kim DE, Meiler J, Misura KM, Baker D. Free modeling with Rosetta in CASP6. Proteins.

Folding Theory: Sequence-Structure Relationships

16

•  Secondary structure formation is the earliest part of the folding process

•  Local sequence codes for local structures… i.e. fragments

  helical sequences in a folded protein tend to be helical in isolation

•  Secondary structure prediction algorithms have ~70-80% accuracy

  Partial failure due to tertiary interactions stabilizing secondary structure elements

Rosetta fragments

•  3 and 9 residue fragments matched to query sequence

•  database created from crystal structures   < 2.5Å resolution   < 50% sequence identity

•  low resolution modeling   centroid representation of side chains

•  ranked by:   alignment   Secondary structure predictions

•  PSI-PRED •  SAM-T02 •  Jufo •  PhD

17

KVFGRCELAAAMKRHGLDNYRGYSLGNWVC... KVF KVFGRCELA VFG VFGRCELAA FGR FGRCELAAA GRC GRCELAAAM --------------------------------- EEEE TT S EEEEEEE TT HH...

query

sec str

Slide content courtesy David Hoover, CIT, NIH

Sliding fragment windows

# Rank G K L M Q E R A

13 1000 G K L

25 821 G R L

46 1000 K L M

21 635 R L M

43 923 K V M

26 523 R V M

15 970 M Q E

26 934 E R A

Separate 3-mer and 9-mer libraries generated

Slide content courtesy David Hoover, CIT, NIH

Example 3-mer fragment library

Making Fragment Libraries with Robetta

http://robetta.bakerlab.org/


Making Fragment Libraries on Biowulf

Slide content by David Hoover from: http://biowulf.nih.gov/apps/Rosetta23.html#RosettaFragments

22

•  Levinthal paradox:

  Given either alpha, beta, or loop conformation, for protein of nres, 3nres possible conformations.

  If nres = 100, sampling a conformation every 10-13 seconds = 1027 years to fold

  Universe is 1010 years old.

  Folding is non-random and cooperative.

•  Many different combinations of secondary structure elements have similar stabilities

  Tertiary (side-chain level) interactions drive folding towards the native topology

  Phase transition results in a substantial energy gap between native and non-native structures

Folding Theory: The Folding Landscape

•  Cyrus Levinthal, J. Chim. Phys. 65, 44; 1968 •  Hue Sun Chan and Ken A. Dill, Protein Folding in the Landscape Perspective: Chevron Plots and Non-

Arrhenius Kinetics, Proteins: Structure, Function, and Genetics, Volume 30, No. 1, January 1998, pp 2-33.

Implications and requirements for folding algorithm:

•  Fast conformational sampling algorithm

•  Accurate scoring function

•  Full-atom modeling

early centroid models centroid models final full-atom models

Assembly Coarse funnel to native-like decoys Fine-grained funnel to near-native decoys

Major Classes of Energy Functions in Rosetta

24

Low resolution: reduced atom representation (centroid)   simplified energy function   used for aggressive search of state space

High resolution: full-atom representation   detailed energy function   local search of state space   refinement and minimization

General   weighted sum of linear terms: Energy = w1*term1 + w2*term2 + …   pairwise decomposable (speed)   weighted for task, e.g. ligand docking

Low resolution (centroid) folding

25

  Fragment insertion   conformation modification occurs in torsion space   initial insertions result in large changes in dihedrals   9 mers inserted first followed by 3 mers later in process   later insertions purposefully result in small changes in dihedrals random insertion

*

*

Sss + SHS - sheet and helix-sheet geometries

•  Scβ density/compactness of structure

•  Svdw no clashes

•  SRgyr radius of gyra>on (Rgyr), globular structure

Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”

Driving assembly towards native-like decoys

Low-resolution homolog folding improves prediction

•  Collect homologs •  Create low-resolution models

  cluster •  Thread query sequence onto models •  Proceed to fullatom refinement

… … …

Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”

Low resolution (centroid) folding example

28

Clustering: Graphical representation

29

30

High resolution (full-atom) refinement

Chen Y et al. Nucl. Acids Res. 2004;32:5147-5162

evaluating/optimizing specific atom-atom interactions e.g. hydrogen bonding:

Comparison of low resolution, relax, and abrelax folding example

31

32

Examples from the Rosetta@home archive of top predictions Note: massively parallel computation

rosetta prediction crystal structure

Detailed ab initio Rosetta Workflow

33

INPUT •  amino acid sequence •  secondary structure prediction(s) •  fragment library •  constraints from experimental data

•  NMR •  biochemical/biophysical studies •  ...

LOW RESOLUTION FOLDING •  fragment insertions •  scoring •  filters

CLUSTERING •  groups of decoys with low RMSD to each other •  lowest energy decoy of clusters selected for

further refinement or prediction

HIGH RESOLUTION REFINEMENT •  backbone minimization •  rotamer optimization

ADDITIONAL MODELING •  identifying variable regions •  rebuilding

>103-106 trajectories

automated manual

34

Computational Considerations

Protocol Utility Caveats

Centroid •  fast •  widely sample conformational space

•  possibility of no near-native models after low resolution folding

•  no discrimination by energy

Full-atom refinement

•  near-native decoys separated by energy •  more computationally demanding •  must have near-native in starting decoy pool

Combined •  streamlined •  for powerful and massively parallel

computing

•  most computationally demanding •  improvement only with sufficient sampling

35

Native (CheY)

A ~1000-fold increase in computational power

Slide content courtesy Rhiju Das, Baker Lab

36

Architect of Rosetta@home: David Kim

A ~1000-fold increase in computational power

Native (CheY)

Lowest energy Rosetta structure

“brute force” approach

Computational power vs. accuracy in ab initio structure prediction

37

Cα RMSD of lowest energy model to the native structure vs. sample size

Sample Size

RM

SD

to n

ativ

e

Category 1: Successful high-resolution predictions

Category 2: Successful high-resolution predictions with additional sampling

Category 3: Unsuccessful predictions (with any amount of sampling)

Kim DE, Blum B, Bradley P, Baker D. Sampling bottlenecks in de novo protein structure prediction. J Mol Biol. 2009 Oct 16;393(1):249-60.

38

“De novo” phasing: large-scale tests

Tests on 30 data sets (covering 16 proteins)

Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.

TF Z-score Have I solved it? < 5 no

5 - 6 unlikely 6 - 7 possibly 7 - 8 probably > 8 definitely

39



1hz5-sf.cif

Å


Rosetta-refined native (positive controls)

Rosetta-refined de novo models

40



1hz5-sf.cif

Success in 14/30 data sets

Å




41





Rosetta-refined de novo models, fragments with correct native 2° structure

1hz5-sf.cif

Å


Preparation for folding simulations

•  proper secondary structure assignment •  constraints

•  limit search space •  increase sampling efficiency •  decrease CPU time

42

Constraints

•  There are constraint types and function types   Constraint types: AtomPair, Angle, Dihedral, etc.   Function types: Bounded, Spline, Harmonic, Gaussian, etc.

•  Each constraint is scored individually and the total constraint score is the sum of all individual scores

•  Each constraint can have its own constraint type and function type.   In some cases, like when using Spline function, each constraint can have its own

weight •  How you define the constraint and how it’s scored depends on the constraint type;

this is same with function type.


Constraint file example: EPR data

<cst type> <atom1> <res1> <atom2> <res2> <cst_func> <RosettaEPR> <Dcb> <weight> <bin>!AtomPair CB 32 CB 36 SPLINE EPR_DISTANCE 16.0 1.0 0.5!AtomPair CB 59 CB 74 SPLINE EPR_DISTANCE 19.0 1.0 0.5!AtomPair CB 62 CB 71 SPLINE EPR_DISTANCE 19.0 1.0 0.5!AtomPair CB 62 CB 74 SPLINE EPR_DISTANCE 25.0 1.0 0.5!AtomPair CB 63 CB 74 SPLINE EPR_DISTANCE 14.0 1.0 0.5!AtomPair CB 66 CB 74 SPLINE EPR_DISTANCE 23.0 1.0 0.5!AtomPair CB 83 CB 90 SPLINE EPR_DISTANCE 13.0 1.0 0.5!

Constraint info Constraint Function info


Membrane protein ab initio

•  RosettaMembrane divides the protein into:   hydrophobic   hydrophilic   soluble layers

•  Specific scoring function for each layer

Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop Figure from Yarov-Yarovoy, Schonbrun, and Baker 2006.

Input Files

Spanfile -‐ *.span

-‐-‐transmembrane topology predic>on file generated using octopus2span.pl script

-‐-‐Input OCTOPUS topology file is generated at h@p://octopus.cbr.su.se using protein sequence as input.

Lipopholicity predicDon file -‐ *.lips4

-‐-‐Generate using run_lips.pl script

-‐-‐Need input FASTA file, spanfile, blaspgp and nr (NCBI) database to run

Fragment generaDon -‐-‐Advised to use SAM but not JUFO or PSIPRED, which predict TMH regions poorly


Folding and studying folding with molecular dynamics

Specialized hardware, ANTON capable of continuous ms length trajectories

Standard simulations: 1 - 3 µs simulations ~ months of HPC

Approximate Rates of Folding: 1 µs helix 10 µs sheet 100 µs fast folding protein 1+ ms typical protein

D E Shaw et al. Science 2010;330:341-346

simulation of villin at 300 K 2-8 µs folder

simulation of FiP35 at 337 K 20-80 µs folder

Blue: x-ray structures Red: last frame of MD simulation

Folding proteins at x-ray resolution

Published by AAAS

tip of hairpin 1 (12-18, blue) hairpin 1 (8-22, green) hairpin 2 (19-30, orange) full protein (2-33, red)

D E Shaw et al. Science 2010;330:341-346

Reversible folding simulation of FiP35.

Thank You

For questions or comments please contact:

[email protected]

301.496.4455

50

Date post:	22-Jan-2018
Category:	Science
Upload:	bcbbslides
View:	75 times
Download:	2 times