Worldwide Protein Data Bank
www.wwpdb.org
What the Protein Data Bank teaches us about structural biology
Helen M. Berman
NCMI Workshop
December 13, 2008
1960’s Protein
crystallography begins to take off
Emerging interest in protein folding
Use of computer graphics to represent structure
Nobel Prize awarded for the first 3D protein structures: myoglobin and hemoglobin
Lysozyme
Hemoglobin
Ribonuclease
Myoglobin
Myoglobin: Kendrew, Bodo, Dintzis, Parrish, Wyckoff, Phillips (1958) Nature 181 662-666; Hemoglobin: Perutz (1962) Proc. R. Soc. A265, 161-187; Lysozyme: Blake, Koenig, Mair, North, Phillips, Sarma (1965) Nature 206 757; Ribonuclease: Kartha, Bello, Harker (1967) Nature 213, 862-865; Wyckoff, Hardman, Allewell, Inagami, Johnson, Richards (1967) J. Biol. Chem. 242, 3753-3757.
1970’s Grass roots
community efforts to archive data
Protein crystallographers discuss how to archive data
June 1971 Cold Spring Harbor meeting brings groups together (Cold Spring Harbor Symposia on Quantitative Biology, vol. XXXVI, 1972)
October 1971 PDB is announced in Nature New Biology (7 structures; vol 233, 1971, page 223)
1975 PDB receives first funding from NSF (~32 structures)
Hemoglobin
M.F. Perutz (1962) Proc. R. Soc. A265:161-187
Carboxypeptidase AF.A. Quiocho, W.N. Lipscomb (1971) Adv Protein Chem 25:1-78
MyoglobinJ.C. Kendrew, G. Bodo, H.M. Dintzis, R.G. Parrish, H. Wyckoff, D.C. Phillips (1958) Nature 181:662-666
SubtilisinR.A. Alden, J.J. Birktoft, J. Kraut, J.D. Robertus, C.S. Wright (1971) Biochem Biophys Res Commun 45: 337-344
Alpha-chymotrypsinJ.J. Birktoft, D.M. Blow (1972) J Mol Biol 68: 187-240
Pancreatic trypsin inhibitorR. Huber, D. Kukla, A. Ruhlmann, O. Epp, H. Formanek(1970) Nature 57: 389-392
Rubredoxin K.D. Watenpaugh, L.C. Sieker, J.R. Herriott, L.H. Jensen (1973) Acta Crystallogr B29: 943-956
Lactate dehydrogenaseJ.L. White, M.L. Hackert, M. Buehner, M.J. Adams, G.C. Ford, P.J. Lentz Jr., I.E. Smilely, S.J. Steindel, M.G. Rossmann (1976) J Mol Biol 102: 759-779
Cytochrome b5 F.S. Mathews, P. Argos, M. Levine (1972) Cold Spring Harb Symp Quant Biol 36: 387-395
PapainJ. Drenth, J.N. Jansonius, R. Koekoek, H.M. Swen, B.G. Wolthers (1968) Nature 218: 929-932
Ligases
Isomerases
Lyases
HydrolasesTransferases
Oxidoreductases
Proportion of enzyme classes relative to total enzyme structures
Enzyme Class 1972-79 1980-89 1990-99 2000-08 Total
Oxidoreductases 5 25 918 2977 3925
Transferases 3 29 1423 5246 6701
Hydrolases 29 123 2797 6846 9795
Lyases 2 3 451 1337 1793
Isomerases 1 2 280 716 999
Ligases 0 4 123 652 779
Total 40 186 5992 17774 23992
Enzymes
In the beginning
LysozymeBlake, Koenig, Mair, North, Phillips, Sarma (1965) Nature 206 757
Ribonuclease Kartha, Bello, Harker (1967) Nature 213, 862-865; Wyckoff, Hardman, Allewell, Inagami, Johnson, Richards (1967) J. Biol. Chem. 242, 3753-3757.
Decade:
Per
cen
t
In the beginning
RNA-containing structures (1317)
Protein/RNA complexes
RNA only
1972-1979 1980-1989 1990-1999 2000-2008Decade:
Nu
mb
er o
f S
tru
ctu
res
0
200
400
600
800
1000
1200
DNA/RNA hybrid
Protein/DNA/RNA complexes
J.L. Sussman, S.-H. Kim (1976) Biochem Biophys Res Commun. 68:89-96; J.D. Robertus, J.E. Ladner, J.T. Finch, D. Rhodes, R.S. Brown, B.F.C. Clark, & A. Klug (1974) Nature 250: 546-551.
tRNA
1980’s Technology takes
off
Structural biology is able to focus on medical problems
Community efforts to promote data sharing
IUCr guidelines requiring data deposition in the PDB are published
In the beginning
DNA-containing structures (2474)
Protein/DNA complexes
DNA only
DNA/RNA hybrid
Protein/DNA/RNA complexes
Z-DNAB-DNA
1bna Dickerson & Drew (1981) J. Mol. Biol. 149: 761-786 2dcg Wang, Quigley, Kolpak, Crawford, van Boom, van der Marel, Rich (1979) Nature 282: 680-686
Decade
In the beginning
Phage 434 repressor-operator
Protein-nucleic acid complexes (1920)
Protein/DNA complexes
Protein/RNA complexes
Protein/DNA/RNA complexes
Nu
mb
er o
f S
tru
ctu
res
2or1 Aggarwal, Rodgers, Drottar, Ptashne, & Harrison (1988) Science 242: 899-907
Decade:
Helical (25)
Icosahedral(255)
Viruses (280 total)
In the beginning
Hopper, Harrison, Sauer (1984) Structure of tomato bushy stunt virus. V.
Coat protein sequence determination and its structural implications J.Mol.Biol.
177: 701-713
Silva, Rossmann (1985) The refinement of southern bean mosaic virus in
reciprocal space Acta Crystallogr. B41: 147-157
20
121
139
0
20
40
60
80
100
120
140
160
1980-1989 1990-1999 >=2000
Nu
mb
er o
f S
tru
ctu
res
Decade
Cooperative community action
Individual letters to editors of journals
Committees – IUCr commission on Biological
Macromolecules– ACA/USNCCr– Richards committee
Funding agencies Articles in journals
Marvin Cassman Fred Richards Richard Dickerson
1990’s
Number of structures increases exponentially
Complexity of structures increases
mmCIF dictionary created
New databases begin to emerge
User base expands dramatically
PDB archive moves
mmCIF Working Group Members
In the beginning
Electron Microscopy structures
Bacteriorhodopsin
Henderson, Baldwin, Ceska, Zemlin, Beckmann, Downing (1990) J.Mol.Biol.
213: 899-929.
Ribosome structures (214)
Prokaryotic Eukaryotic
In the beginning
Ban, Nissen, Hansen, Moore, & Steitz (2000) Science 289: 905-920; Clemons Jr., May, Wimberly, McCutcheon, Capel, & Ramakrishnan (1999) Nature 400: 833-840; Schluenzen, Tocilj, Zarivach, Harms, Gluehmann, Janell, Bashan, Bartels, Agmon, Franceschi, Yonath (2000) Cell 102: 615-623; Yusupova, Yusupov, Cate,& Noller (2001) Cell 106: 233-241.
Ribosome
1%1% 2%
55%
41%
30S 50S
2000’s wwPDB is formed Continued growth in structures Structural genomics takes off
www.wwpdb.org
Nu
mb
er o
f re
leas
ed e
ntr
ies
Year:
Depositions to the PDB by decade
July 2008
What can we learn from the PDB?
Structure distribution
46157
1301
1093
755
39
582655
Other
Protein only
Protein-DNA
complexes
DNA only
Protein-RNA complexes RNA only
RNA-DNA hybrid
Response to stimuli
Biological regulation &
signal transduction
Cellular processes
Immune system process
Other
RibosomeVirus
Enzyme17988
23466
819
4445t500
2911
218
280
*
*
*
*
* GO process
number_prot_rna_nmr.listnumber_prot_rna_xray.listnumber_total_em.listnumber_total_nmr.listnumber_total_xray.list
000
0.3
86 0 0341 2 0
8837
1790
2
33797
5492
154
0
5000
10000
15000
20000
25000
30000
35000
1972-1979 1980-1989 1990-1999 2000-2008
X-Ray
NMR
EM
22 22 21
18
15
11
3
0
5
10
15
20
25
FIBER DIFFRACTIONNEUTRON DIFFRACTIONSOLUTION SCATTERINGPOWDER DIFFRACTIONELECTRON DIFFRACTIONELECTRON TOMOGRAPHY INFRARED SPECTROSCOPY
86 341 2
8837
1790
33797
5492
0
5000
10000
15000
20000
25000
30000
35000
1972-1979 1980-1989 1990-1999 2000-2008
X-Ray NMR EM
Num
ber o
f str
uctu
res
Structure determination methods
April 30, 2008Decade
6 176
Resolution distribution of protein structures
Resolution distribution of other structures
Year
Re
solu
tion
Resolution distribution of all structures
Structures containing distinct protein sequences (<98%)
Structures containing novel protein sequences (<30%)
Distinct and novel protein sequences
Decade
Per
cen
t o
f d
isti
nct
/no
vel
stru
ctu
res
Subset of PSI structures
Subset of other SG structures
1972-1979 1980-1989 1990-1999 2000-2008
0
10
20
30
40
50
60
70 63%
37%
51%
27%32%
14%
39%
16%
7%
7%
25%
4%2%
10%
Redundancy: protein clustersCluster #
Total distinct chains in cluster
Protein cluster First structure Deposition Date
1 459 Bacteriophage T4 lysozyme 2LZM 1977-03-28
2 297 Hen white lysozyme 2LYZ 1975-02-01
3 196 Human lysozyme 1GFE 1984-10-12
4 445Mouse immunoglobulin Fc&Fab fragments 1GIG 1993-01-20
5 218Human immunoglobulin Fc&Fab fragments 1FC1 1981-05-21
6 330 HIV-1 protease 2HVP 1989-04-10
7 302 Trypsin (serine protease) 5PTP 1977-12-19
8 254 Thrombin 2HGT 1991-06-03
9 229 Human carbonic anhydrase II 1CA2 1976-05-22
10 185 Whale myoglobin 1MBN 1973-04-05
11 182 Human leukocyte antigen 1HLA 1987-10-15
12 178 Human hemoglobin -subunit 3HHB 1975-04-01
13 176 Human hemoglobin -subunit 3HHB 1975-04-01
14 160 Ribonuclease A 2RNS 1973-04-01
15 153Human cyclin-dependant kinase 2 (CDK2) 1HCK 1996-06-03
Lysozyme: Lessons learned
Blake, Koenig, Mair, North, Phillips, Sarma (1965) Nature 206: 757.
T4 bacteriophage (459 structures) Amino acid replacement studies suggest
that fraction of amino acid residues that define the structure of T4 lysozyme is about 50% B.W. Matthews (1996) FASEB J.10: 35-41.
Insight into folding and catalysis
Hen egg white (297 structures) Low sequence identity Structural similarity of active site to T4
B.W. Matthews, M.G. Remington, M.G. Grutter, W.F. Anderson (1981) J.Mol.Biol. 147: 545-58.
Insight into evolution and catalysis
Myoglobin and hemoglobin: Lessons learned
Lodish et al.6
1Kuriyan, Wilz, Karplus, Petsko (1986) J. Mol. Biol. 192:133–154; 2Quillin, Arduini, Olson, Phillips, Jr. (1993) J. Mol. Biol. 234: 140–155, Carver, Brantley Jr, Singleton, Arduini, Quillin, Phillips Jr, Olson (1992) J. Biol. Chem. 267:14443–14450; 3Bourgeois, Vallone, Schotte, Arcovito, Miele, Sciara, Wulff, Anfinrud, Brunori (2003) PNAS 100: 8704-8709; 4Dickerson, Geis (1983) Hemoglobin: structure, function, and pathology; 5Kidd, Baker, Mathews, Brittain Baker (2001) Prot. Sci. 10:1739-1749, Harrington, Adachi, Royer Jr.
(1998) J. Biol. Chem. 273: 32690 - 32696; 6Lodish, Berk, Zipursky, Matsudaira, Balitmore, Darnell (2000) Molecular Cell Biology WH Freeman & Co.
Whale myoglobin (185 structures) Different ligands: oxygen, carbon dioxide1
Amino acid substitution studies2
Laue studies3
Insight into function and dynamics
Other species myoglobin Low sequence identity, same structure4
Insight into evolution
Human hemoglobin (178 structures)
Insight into function and disease (sickle cell anemia, thalassemia)5
Other species hemoglobin Low sequence identity, same structure4
Profound insight into evolution
TIM barrel proteins: Lessons learned
TIM barrel structures (1727)
http://www.cathdb.info Share the same fold but represent
significant sequence and functional diversity
Are enzymes or enzyme-related proteins involved in molecular or energy metabolism
Comparative structure analysis indicates evolutionary relatedness of TIM barrel proteins
Banner, Bloomer, Petsko, Phillips, Wilson, (1976) Biochem.Biophys.Res.
Commun. 72: 146-155
Nagano, Orengo, Thornton (2002) J.Mol.
Biol. 321: 741-65.
Nagano, Orengo, Thornton (2002) J.Mol. Biol. 321: 741-65.
HIV-related structures (609)
311
110
39
27
122
Nu
mb
er o
f S
tru
ctu
res
DecadeProteaseReverse TranscriptaseGag proteinIntegraseOther
Amprenavir (GSK) Fosamprenavir (GSK)
Lopinavir (Abbott) Atazanavir (BMS)
Nelfinavir (Agouron) Darunavir (Tibotec)
Tipranavir (BI) Indinavir (Merck)
Ritonavir (Abbott) Saquinavir (Roche)
HIV-1 protease (311)
Navia, Fitzgerald, McKeever, Leu, Heimbach, Herber, Sigal, Darke, Springer (1989) Nature 337: 615-620; Wlodawer, Miller, Jaskolski, Sathyanarayana, Baldwin, Weber, Selk, Clawson, Schneider, Kent (1989) Science 245: 616-621
226 structures with ligands
2R5P, 2B7Z, 2AVV, 2AVO, 2AVS, 1SGU, 1SDT, 1SDV, 1SDU, 1K6C, 1C6Y, 2BPX, 1HSG, 1HSH
1T7J, 1HPV
2B60, 1RL8, 1SH9, 1N49, 1HXW
2QAK, 2PYM, 2Q63, 2PYN, 2Q64, 2R5Q, 1OHR
2O4N, 2O4L, 2O4P, 1D4Y, 1D4S
3D1X, 3D1Y, 3CYX, 2NMW, 2NMZ, 2NNP, 2NMY, 2NNK, 1C6Z, 1FB7
2FXE, 2FXD, 2O4K, 2AQU, 2FND 2RKG, 2RKF,
2QHC, 2Z54, 2Q5K, 2O4S, 1RV7, 1MUI
Abacavir (GSK)
Nevirapine (BI) Stavudin (BMS)
Efavirenz (BMS) Lamivudine (GSK)
Zidovudine (GSK) Emtricitabine (Gilead)
Tenofovir (Gilead) Zalcitabine (Hoffmann- LaRoche)
Etravirine (Tibotec) Delavirdine (Pfizer)
HIV-1 reverse transcriptase (110)
Year
Nu
mb
er o
f S
tru
ctu
res
Wang, Smerdon, Jager, Kohlstaedt, Rice, Friedman, Steitz, (1994) Proc.Natl.Acad.Sci.USA 91: 7242-7246
76 structures with ligands
2HND, 2HNY, 1S1U, 1S1X, 1LW0, 1LWE, 1LWC, 1LWF, 1JLB, 1JLF, 1FKP, 1VRT, 3HVT
1JKH, 1IKW, 1IKV, 1FKO, 1FK9
1T05
1S6P
KEGG PathwayNumber of Structures
Complement and coagulation cascades 506
Small cell lung cancer 506
Regulation of actin cytoskeleton 449
Non-small cell lung cancer 407
Pyrimidine metabolism 402
Nitrogen metabolism 399
Two-component system - General 360
Ribosome 333
Base excision repair 328
Purine metabolism 310
Antigen processing and presentation 281
Nicotinate and nicotinamide metabolism 252
Insulin signaling pathway 248
Porphyrin and chlorophyll metabolism 248
ABC transporters - General 246
Prostate cancer 244
Structural coverage of KEGG pathways50136 structures
16526 structures associated with KEGG pathway (33%)
Human biological pathways
Genes that contain a PDB structure are in red
Complement and coagulation cascades pathway
Small cell lung cancer Non small cell lung cancer
Regulation of actin cytoskeleton
KEGG (http://www.genome.jp/kegg/)
EM maps and Models in the PDB
How EM experimentsare archived
Nuclear porecomplex, 85 ÅEMD-1097
Rotavirus V6protein, 3.8 ÅEMD-1461
EMDataBank
Created by EBI in 2002 for archiving EM maps US deposition/annotation site added this year Maps stored in CCP4/MRC format Associated metadata stored in xml format
580 entries total
EM entries in the PDB
Atomic coordinate models fitted to EM maps Storage format for models and metadata is CIF Matrix representations possible Some large entries “break” PDB format
PBCV-1(1m4x, 1680 matrices)
80S ribosome(1s1h + 1s1i)230 entries total
PDBj
Goals
Common data model Data harvesting tools “One-stop shop” for deposition and retrieval Tools for visualization, segmentation, and
assessment
Acknowledgements
Wellcome Trust, EU, CCP4, BBSRC, MRC, EMBL NLMBIRD-JST, MEXT
NSF, NIGMS, DOE, NLM, NCI, NCRR, NIBIB, NINDS, NIDDK
Acknowledgements
NIH GM079429 (Baylor, Rutgers, EBI) 2007- 2012EU Network of Excellence LSHG-CT-2004-50282 (EBI) 2004-2009