SCIENCE & TECHNOLOGY
DIVINING PROTEIN ARCHITECTURE Predicting structure from sequence has advanced impressively in the past few years STU BORMAN, C&EN WASHINGTON
EVER SINCE THE 1960S, WHEN experiments on the refolding of ribonuclease by the late biochemist Christian B. Anfinsen demonstrated that a protein's
amino acid sequence determines its three-dimensional structure, researchers have been working toward using sequence information as a basis for predicting protein structure.
Applications of protein structure prediction include learning more about protein folding, designing functional proteins from scratch, and divining the structure and function of proteins from living organisms.
Five or six years ago, the protein structure prediction problem was considered to be pretty much unsolved. But since then considerable progress has been made: Scientists can now construct fairly accurate models that show where a protein's chain is supposed to go, and many of the topological features of these models closely resemble those in actual structures.
Not that all the problems have now been solved. "In terms of actually explaining protein folding from first principles, I think we are very far away" says Krzysztof Fidelis, senior scientist at Lawrence Livermore National Laboratory's Protein Structure Prediction Center. "I would like to see that bridge built, but it's not happening very easily"
Attaining more accurate predictions is one of the key challenges. "It's going to take not just two years' time, but maybe 20 years' time to solve that problem," says Michael Levitt, professor and chairman of computational structural biology at Stanford University School of Medicine. "By 'solved,' I mean correctly predicting to 1-A resolution the structure of a large protein. We're very far from that."
Nevertheless, continual progress in the field is being made. "If you look back over multiple years, you
do see an improvement over the whole range of modeling," says John Moult, professor and fellow of the Center for Advanced Research in Biotechnology at the University of Maryland Biotechnology Institute, Rockville, Md. "The field is moving forward—not always rapidly but steadily"
Protein structure prediction techniques are of three main types. An amino acid sequence that turns up in a genome-sequencing project or becomes known some other way may adopt a type of structure that's already been seen in nature. If you find a similar sequence in a protein database, you can then build a model based on the structure of the known protein. That's comparative modeling or homology modeling.
HI0817
SUPERMODELS Baker and coworkers used the program Rosetta to create ab initio models (right sides of each row) that turned out to closely resemble crystal structures (left sides) of the DNA repair protein MutS and the bacterial protein HI0817.
Even if a sequence does not match any in protein databases, it may still be possible to find a suitable structure using a second technique, fold recognition. For perhaps half or three-quarters of unknown (structurally uncharacterized) proteins, there will be a suitable structure in the database one can use as a basis or template for extrapolating a 3-D model, even in the absence of a sequence match.
"The idea of fold recognition is to turn the folding problem on its head and say 'Rather than finding a structure that's suitable for the sequence, let's see if we can find whether our sequence fits on any existing structures,' " explains Rob B. Russell, group leader of structural bioinfor-matics at the European Molecular Biology Laboratory Heidelberg, Germany "Ifou thread the sequence onto existing structures and see if those structures bury the sequence's hydrophobic residues well, among a number of other structural considerations."
If there's no known sequence or known structure to which you can match your sequence, a third option is to build a 3-D structure of a protein from scratch, called de novo or ab initio modeling. This is the only viable option for proteins with new
m folds —structures that have not t= been observed before in nature.
SOMETHING THAT has had ahuge impact on the development of these techniques in the past few years is a program called CASP, Critical Assessment of Structure Prediction Methods, founded by Moult. Held every two years, it's essentially a competition to see which structure prediction techniques are most accurate at modeling unknown proteins.
The predictions are evaluated and compared after the unknown structures have been determined crystallographically Results are discussed in an online CASP forum called FORCASP, and papers arising from each CASP are published in a special issue of Proteins: Structure, Function & Genetics. Such an issue will be published later this year for the current program, CASP 5.
Comparative modeling results at CASP 5 were summarized recently by biochemistry professor Anna Tramontano of the University of Rome "La Sapienza" [Nat. Struct. Biol., 10,87 (2003)}. "Overall, the average quality of compar-
2 6 C & E N / A U G U S T k, 2 0 0 3 HTTP:/ /WWW.CEN-ONLINE.ORG
MutS
CAS PFOUNDERMoult observes that the big news of the CASP 5 competition was in the area of fold recognition, where metaservers have greatly improved accuracy.
ative modeling predictions ... improved, with the vast majority of methods producing good models... for targets sharing greater than 25% sequence identity with known structures," she wrote. "It is undoubtedly true that biologists can now confidently use comparative modeling for structure prediction, {although} it is still difficult to predict the structure of regions of the target that are substantially different (farther than 2.5 A) from the template." Predictions by two groups in Poland and one in the U.S. were most accurate at CASP 5, Tramontane noted.
However, others in the field are more critical. "In comparative modeling, where your protein of interest has a sequence that is related to that of a structure that's already known, we've really been stuck for some time, in my judgment, and we still seem to be pretty much stuck in that area," Moult says.
Fidelis agrees that "the lack of progress in the area of comparative modeling is quite disappointing. We haven't seen any significant progress since CASP 2," in 1996.
"It's always been clear that homology modeling is a random displacement away from the starting model built on a homolog, and nobody can consistently move toward the right answer," adds professor of biological chemistry David Shortle of Johns Hopkins University School of Medicine. "Ifou've demonstrated that a target protein is evolutionarily related to a pro-tein of known structure, so you know roughly how it's folded. Can you then move that structure away from the known homolog toward the right answer in some sort of consistent way? And people cannot. Sometimes they get closer, some
times they get further away" However, computational
structural biologist Roland L. Dunbrack Jr. of Fox Chase Cancer Center, Philadelphia, who specializes in comparative modeling, views the technique's prospects more hopefully: "Shortle is right that we don't move backbones closer to the target structure from the template structure
even most of the time. However, sequence alignments have improved tremendously in the last five years, and comparative models have improved accordingly Also, while backbones have not improved, side-chain modeling for the target sequence onto the
Protein sequence
Search databases of
known structures
Homologous sequence of knov structure found?
Three-dimensional protein structure
GO WITH THE FLOW Protein structure prediction techniques are of three main types: comparative modeling, fold recognition, and de novo prediction.
template backbone is reasonably accurate at higher sequence identities."
Comparative models at lower sequence identities have also been improving, he says. "We now frequently make models in the 10 to 30% sequence identity range. This is necessary, since most proteins of unknown structure are only very distantly related to proteins of known structure."
But the task isn't easy or fast, he says: 'At low identity—10 to 20%—it's alot ofwork to get the alignment right."
Dunbrack and coworkers developed SCWRL, a side-chain conformation prediction program used for comparative modeling. But he notes that Modeller, created by professor of computational biology Andrej Sali and coworkers at the University of California, San Francisco, "is the most commonly used comparative modeling software and has had a large impact on the field."
Sali agrees with Dunbrack that comparative modeling has improved considerably in recent years. "I am not saying that the problems are solved—they remain and require additional work—but I simply do not agree with the suggestion that the field
is stuck," he says. CO
3 "THE BIG NEWS at CASP 5,"Moult « says, "was in the fold-recognition a category, where you're trying to " predict the structure of a protein > that's not obviously related at a se-£ quence level to known structures S but does have a fold that's been
seen before. Wha t we saw there was a quite large increase in the quality of the models in terms of accuracy And the reason for that seems to have to do with the introduction of metaservers."
Various research groups have developed computer servers that accept incoming sequence data and generate models in an automated manner. "One can also set up a metaserver," Moult explains—"a server that sends out sequences to multiple other servers, gets a number of models from them, and then uses them to make consensus models. The result is a significant improvement of results in the fold-recognition category"
Automated servers and metaservers can currently be used for both comparative modeling and
fold recognition, and some of the better comparative modeling at CASP 5 was carried out by metaservers. But metaserver performance was more impressive in fold recognition. In fact, metaserver fold-recognition scores at CASP 5 were better than those of almost all human participants.
Computer science senior lecturer Daniel Fischer of Ben-Gurion University of the
HTTP://WWW.CEN-ONLINE.ORG C & E N / A U G U S T 4 , 2 0 0 3 2 7
SCIENCE & TECHNOLOGY
DE NOVO ASSESSOR Russell was impressed with this year's CASP results.
Negev, Israel, pioneered the metaserver concept and also runs CAFASP (Critical Assessment ofFullyAutomated Structure Prediction), a parallel program to CASP solely for automated servers and metaservers. CAFASP uses the same targets as those in CASP, and all CAFASP predictions become part of the CASP evaluation.
Other important metaserver groups include those of Leszek Rychlewski, head of the Bioinformatics Laboratory at Bio-InfoBank Institute, Poznan, Poland, and biochemistry and molecular biophysics professor Burkhard Rost at Columbia University Rychlewski runs the Live Bench Project, and Rost runs EVA—services that help researchers evaluate server and metaserver performance on structure prediction problems.
Although automated metaservers did surprisingly well in fold recognition in CASP 5, some scientists are troubled by the metaserver concept. "You send sequences to servers, and they make predictions," Levitt says. "Then metaservers collect results from other servers and make consensus predictions. Then you have metaservers that go to the consensus metaservers and collect new consensus results. And whoever gets the last result seems to do better. The last person into the game wins."
The trouble is that "you can't win if the others don't do their part," Levitt says. "So it's a strange business. This is kind of a meaningless technique. It works fine for
CASP because all these different machines run the CASP sequences. But if you came along with a whole genome of 50 ,000 sequences that you would like to predict this way, you couldn't. Because to be the top guy who's getting the best predictions, you have to rely on everyone else to do the work for you."
However, Fischer notes that there are two types of metaservers: "selectors," which gather information from other servers and just select an answer, and "added-value metaservers,"
which not only make selections but also enhance the input to generate better predictions. 'A number of groups, including ours, are now working on developing fast, independent 'metapredictors' that run all components internally and do not depend on others," he tells C&EN. "Thus, some of the criticism attributed to the first generation of metaservers may not be justified and will certainly fade away in the future, when fast, powerful, independent metapredictors will challenge the best human predictors."
Ab initio or de novo techniques are used for proteins that don't share either sequence or structural similarity with known proteins and thus have new folds. "To predict the structure of a protein with a new fold, you might imagine in the limit solving Schrodinger's equation—getting a purely quantum mechanical solution of the problem," explains associate professor of biochemistry and Howard Hughes Medical Institute assistant investigator David A. Baker of the University of Washington, Seattle. "But you can't do that because you can't get exact solutions for molecules with tens of atoms, let alone thousands of atoms as in proteins. So you have to make approximations."
To approximate the structure of new folds, ab initio programs assign energies
to different polypeptide chain conformations and then use optimization routines to find the lowest energy (most stable) conformations.
In new-fold prediction, Moult says, "we started with very poor results in CASP 1 but saw steady improvement through CASPs 2,3, and 4. "We actually hiccuped a bit between CASP 4 and CASP 5—you can't see much progress there. But I'm not very put off by that. There's a lot of good stuff going on in the new-fold area, and I think we'll see things pick up again next time."
Russell, who helped assess CASP 5 de novo predictions, says: "I was very impressed with the results. It's clear that a number of groups are able to do things that I never would have dreamt possible 10 years ago."
TRADITIONALLY, "having a lot of related sequences is the sort of thing that helps you a lot if you're trying to do a de novo prediction," Russell says. "It gives you some information about where the location of helices and strands might be, and so on. For some proteins in CASP 5, there was very lit-
BUILDING BLOCKS Baker uses a computer program called Rosetta that combines little bits and pieces from known proteins.
tie help of this sort. And there was at least one case where the protein didn't actually have any sequence homologs at all. It was a complete orphan, all by itself in the whole world. Nevertheless, a few groups—and Baker's certainly stands out among them— were able to get quite accurate structures.
"If you look back over multiple years, you do see an improvement over the whole range of modeling. The field is moving forward—not always rapidly, but steadily/'
28 C&EN / AUGUST k, 2003 H T T P : / / W W W . C E N - O N L I N E . O R G
I thought this was really phenomenal." Shortle says that in new-fold techniques,
"there's been significant progress, and 80% of it derives from the results of the Baker lab. When they get it right, they get it dramatically right. I think the secret of their success has several components, but the major one is their selection of fragments from the Protein Data Bank to assemble their models with. We were told we came in second in CASP 5 in new-fold prediction, but it was a distant second. So there isn't a really dramatic story outside of the Baker lab's success. I think that will change in the future. There's a lot on the horizon, but the Baker lab has led the way"
To perform their predictions, Baker and coworkers use a program called Rosetta. Essentially it makes new proteins by assembling little bits from known proteins.
"If you look at any nine-amino-acid chunk of a protein, during the folding process it doesn't immediately go to one conformation," Baker says. "It flickers between a number of different possible local conformations. Folding occurs when everything happens to be in the right place at the right time—when the different pieces are oriented so they make low-energy interactions throughout the chain."
To model this flickering between local conformations, "you have to know what distribution of conformations any given portion of this chain is going to adopt," Baker says. Rosetta gets that information from protein databases. The program then searches for combinations of local conformations that, when spliced together, produce very low energy protein tertiary structures.
The method is thus partly empirical, in that it makes use of databases of known structures. Other ab initio prediction programs are more strictly theoretical, using molecular mechanics or molecular dynamics and making little to no use of data from known structures.
Some contend that programs like Rosetta are less pure, in a sense, than the more exclusively theoretical ones. "This has been an area of discussion, to put it politely," Moult says.
"Ifears ago we didn't have a lot of other structures to use as abasis for building mod
els," Moult says, "so early ab initio methods were more or less based on physics. However, we saw early in the CASP program that those methods didn't work very well. Meanwhile, people like Baker became very clever at using the information from known structures in various ways. This has been more successful. But it has upset some people with the older methods, who feel that in some sense this is not really science, but rather information science that's not based on physics."
R296
H-bond
CLOSE MATCH Dunbrack and coworkers used their SCWRL program to generate a comparative model (yellow and orange) of key residues from a complex of BACE protease with its substrate, amyloid precursor protein (APP, red). The model closely matches a crystal structure (dark and light blue) of a complex of BACE with an APP-like inhibitor (green). The comparative model shows a likely physiologically important salt bridge between a BACE residue and one residue on APP—an interaction not present in the crystal structure, which instead has a single hydrogen bond in a different location. Note: D = aspartic acid, R= arginine, Y = tyrosine, I = isoleucine, L = leucine.
These researchers "certainly have a point," he says. "The problem is that the traditional physics-based methods are still not delivering. I think it's important that we still give them space and that they still get funded, because if not, we're never going to move forward in that area. But CASP puts an awful lot of emphasis on results, and results right now are better from methods that somehow use a knowledge base."
Baker concedes that using information from known protein structures in constructing sets of possible local conformations is not the same as starting from scratch. However, "there is no method that starts from scratch," he says. "Ifou can't do
a truly first-principles calculation, so you have to get parameters from somewhere. It's a bit of semantics."
Because Schrodinger's equation can't be solved exactly for proteins, Baker says, "you have to be able to combine information from quite different areas. I think that's really the way of the future."
Levitt and coworkers, on the other hand, are among a number of groups that continue to develop a more purely theoretical approach. "It should be possible to predict pro-
£ tein structure from quantum | mechanics or molecular me-* chanics force fields —that is, | from the basic physics and chem-m istry of a situation," Levitt says. | An ab initio approach that o he and his postdoc Chen Keasar o recently tried was to write an uj expression for the free energy | of a protein and then minimize
that directly [J. Mol Biol, 329, 159(2003)}. "Wegotsome very interesting results," Levitt says, but the program didn't do very well in CASP 5.
Nevertheless, "ab initio approaches need to be emphasized further," Levitt says. "%u could argue that just for the purity of chemistry and physics we need to be able to predict protein structures" that way
One promising ab initio effort is the Folding@Home project run by assistant professor of chemistry and of structural biology VijayS. Pande of Stanford University Folding@Home uses free time on the computers of thousands of volunteers to carry out computationally intensive protein calculations.
Pande and coworkers primarily study the mechanism of protein folding, but in a study last
year they applied the computer power of the Folding@Home system to protein structure prediction. They discovered that the average unfolded structure of a protein mirrors the structure of its folded state [J. Mol. Biol, 323,153 (2002)}. By using a molecular dynamics routine to calculate average distances between pairs of residues in an unfolded protein, they were able to derive its folded structure.
"The trouble is that Vijay, with all the computers in the world and three months of computer time, can fold a protein with 36 amino acids sometimes," Levitt says, "and Baker can get much better results in two minutes."
HTTP:/ /WWW.CEN-ONLINE.ORG C & E N / A U G U S T A, 2 0 0 3 2 9
SCIENCE & TECHNOLOGY
The question is whether empirical methods like Baker's will ultimately be able to fold most biological proteins, or whether more purely ab initio methods like Levitt's will in the end be needed as well. "My own preferences are for the kind of methods where there's deep understanding," says Eaton E. Lattman, professor and chairman of the department of biophysics at Johns Hopkins University and editor-in-chief of Proteins: Structure, Function & Genetics. "So I think people have to respect Baker's achievements enormously, but whether the empirical methods are going to run out of gas and only get you so far isn't clear."
The CASP program is generally believed to have helped the field. But some say CASP cycles are too fast, that the number of targets is too small for results to be valid statistically, and that head-to-head competition isn't a good way to do science.
"One of the problems maybe that we're reaching the limits of what we can do," Levitt says. "Normal science is not done at all like CASP It's done by people having ideas, thinking about them, writing careful papers, and so on." From the end of one CASP to the start of another, "you basically have ayear to develop a new method, and that isn't enough time. At CASP, all that matters is how you do, and that may be a long-term negative thing."
"Because the number of targets dealt with is small and the number of groups that are uniformly successful is even smaller, it's unclear whether CASP is really assessing progress and whether we're really learning anything about protein structure prediction there," adds molecular biology professor Charles L. Brooks III of Scripps Research Institute.
However, "I've attended all five CASPs, and I don't perceive any serious problems," Shortle says. "I think the people in charge do a very reasonable job in the assessments. Every year they've gotten better—more
rigorous and well defined. But every year there's a certain amount of grumbling. The rules of the game are pretty clear to anyone who has been around awhile, and I think the whining is inappropriate."
AN IMPORTANT POTENTIAL application area for protein structure prediction is structural genomics, a large-scale effort to determine the structures of proteins across entire genomes. One might think that by solving all protein structures experimentally, structural genomics will eclipse structure prediction and eventually even make it obsolete. But researchers say that's not the plan and that structural genomics is indeed depending on pro- Levitt tein structure prediction to help achieve its aims.
An example is the University at Buffalo Center of Excellence in Bioinformatics, where director Jeffrey Skolnick and coworkers are developing comparative modeling and ab initio prediction techniques to advance the center's structural genomics goals.
"The structural genomics field has become a very strong raison d'etre for computational structure prediction," Levitt says. "It's impossible to make crystal structures of every single gene on Earth, although getting the sequence of these is very likely to happen. In some sense, the premise of structural genomics is that we need to do enough structures so that modeling can then do the rest for us."
"The current plan for structural genomics," Moult says, "is to try to sample structure space experimentally in such a way that you can build useful models of all of the other proteins. Right now, we have experimental structures for maybe 1% of
the proteins for which we have a sequence. So you'd like to build models for the other 99%, and that's never going to go down to less than 90%."
Because the number of proteins in different genomes is enormous, metaservers could play an important role in determin-
0 ing their structures. "As | servers continue to im-z prove, they will become in-| creasinglyimportantinany J prediction process, espe-^ cially when dealing with | genome-scale prediction £ tasks," Fischer and coworker ers wrote in a CAEASP pa-i per [Proteins: Struct., Fund., 1 G^tf.,45,171(2001)}."We z expect that in the near fu-« ture, the performance dif
ference between humans and machines will continue to narrow and that fully au
tomated structure prediction will become an effective companion and complement to experimental structural genomics."
For applications related to protein function, efforts will continue to be made to refine modeling accuracy In most current models, "if you look in detail at where the atoms are, they're not in exactly the right places," Baker says. "That means that they're not good for applications where you want to understand catalysis or you want to do drug design. So the problem really then becomes how to take a rough model and make it more accurate."
As solutions to that problem develop, new-fold modeling could prove increasingly useful for designing proteins as well as divining structures. "Many of the principles used in ab initio folding are obviously applicable to problems where you're trying to change an enzyme specificity, modify the structure of an existing protein, or design proteins from scratch," Russell says. "That seems to be where the field is moving."
"The long-term dream," Levitt says, "is being able to treat proteins as a materials science and to model in proteins the way we can model in silicon and steel and polymers. It's clear that with the right sequence you can develop proteins that do anything you want. But to do that you're going to need to solve ab initio protein folding. It's a problem that's very much at the boundary of physics, chemistry, and life. Because on the one hand we think it's a purely computationally solvable problem, and on the other hand it's what makes living things possible. So there are many reasons why this probably will not go away" •
LAB FOR SALE ^ Austin, TX
ball: (512) 346-5180 ,[ Jerry Heare, SIOR
n ~—:'i NAI Commercial Industrial Properties Co.
3 0 C & E N / A U G U S T 4 , 2 0 0 3 H T T P : / / W W W . C E N - O N L I N E . O R G