Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | francine-wiggins |
View: | 212 times |
Download: | 0 times |
Exploring 3D Molecular Structures Using NCBI Tools
A Field Guide
June 17, 2004
NCBI Structure Resources
• Overview of Structural Informatics at NCBI• How 3D Macromolecular Structures are Determined• Indexing Structural Data at NCBI• Finding Homologous Structures
– By Sequence Similarity: BLAST– By Structure Similarity: VAST– By Conserved Function: RPS-BLAST and CDD
• Finding a Structural Template for a Query Protein
The National Center for Biotechnology Information
• Created as a part of NLM in 1988– Establish public databases– Perform research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information
Structural Informatics
ChemicalFormula
3D Conformation
Function
ARKLMPQSCSW…ModificationsIonsLigands
Binding Sites Catalytic ResiduesKinetics ThermodynamicsSubstrates Intermediates
StructureDynamicsActive StatesFolding
Structural Informatics
ChemicalFormula
3D Conformation
Function
GenPeptNCBI RefSeqSWISS-PROTPIRPRF
Multiple Sequence Alignments:Pfam, SMART, COGs, CDD
PDB
Structural Informatics at NCBI
ChemicalFormula
3D Conformation
Function
GenPeptNCBI RefSeqSWISS-PROTPIRPRF
Multiple Sequence Alignments:Pfam, SMART, COGs, CDD
EntrezProtein
EntrezDomains
PDB
EntrezStructure
Entrez3D Domains
4,818,495 25,003
11,382
103,820
The Entrez System
Entrez
Nucleotide
PubMed
Protein
Taxonomy
Structure Domains
3D Domains
Books
Journals
PMC
OMIM
UniSTS
PopSet
GenomeSNP UniGene
Gene
GEO
GEO Datasets
MeSH
Solving StructuresX-Ray Crystallography
Bond r (Å)
C-S 1.82
C-C 1.54
C-N 1.47
C-O 1.43
S-H 1.34
C=O 1.20
C-H 1.09
N-H 1.01
O-H 0.96
Electron Density Map
P F I
Resolution
5 Å 3 Å 1 Å T or V?
Challenges
Disorder
Cn3D
More About Resolution1EJG: Crambin at 0.54 Å 2TMA: Tropomyosin at 15 Å
protons!! only alpha carbons!!
3 Å
“Temperature”
Solving StructuresNuclear Magnetic Resonance Spectroscopy
Bo
Constraint List
DistancesDihedral AnglesOrientation
Models consistentwith constraints
RMSD (Å)
Cn3D
PDB
PDB File: HeaderHEADER ISOMERASE/DNA 01-MAR-00 1EJ9TITLE CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX COMPND MOL_ID: 1; COMPND 2 MOLECULE: DNA TOPOISOMERASE I; COMPND 3 CHAIN: A; COMPND 4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES 203-765; COMPND 5 EC: 5.99.1.2; COMPND 6 ENGINEERED: YES; COMPND 7 MUTATION: YES; COMPND 8 MOL_ID: 2; COMPND 9 MOLECULE: DNA (5'- COMPND 10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP* COMPND 11 TP*TP*TP*T)-3'); COMPND 12 CHAIN: C; COMPND 13 ENGINEERED: YES; COMPND 14 MOL_ID: 3; COMPND 15 MOLECULE: DNA (5'- COMPND 16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP* COMPND 17 TP*TP*TP*T)-3'); COMPND 18 CHAIN: D; COMPND 19 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; SOURCE 3 EXPRESSION_SYSTEM_COMMON: BACULOVIRUS EXPRESSION SYSTEM; SOURCE 4 EXPRESSION_SYSTEM_CELL: SF9 INSECT CELLS; SOURCE 5 MOL_ID: 2; SOURCE 6 SYNTHETIC: YES; SOURCE 7 MOL_ID: 3; SOURCE 8 SYNTHETIC: YES KEYWDS PROTEIN-DNA COMPLEX, TYPE I TOPOISOMERASE, HUMAN
REMARK 1 REMARK 2 REMARK 2 RESOLUTION. 2.60 ANGSTROMS. REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : X-PLOR 3.1 REMARK 3 AUTHORS : BRUNGER …REMARK 280 REMARK 280 CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20 REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT REMARK 290 ...
PDB File: DataATOM 1 N TRP A 203 30.156 -4.908 37.767 1.00 50.81 N ATOM 2 CA TRP A 203 30.797 -4.667 36.431 1.00 49.96 C ATOM 3 C TRP A 203 30.369 -3.337 35.766 1.00 49.18 C ATOM 4 O TRP A 203 29.315 -3.238 35.147 1.00 49.27 O ATOM 5 CB TRP A 203 30.518 -5.863 35.513 1.00 46.77 C ATOM 6 CG TRP A 203 30.847 -5.651 34.081 1.00 44.60 C ATOM 7 CD1 TRP A 203 32.028 -5.234 33.553 1.00 49.72 C ATOM 8 CD2 TRP A 203 29.980 -5.876 32.984 1.00 43.73 C ATOM 9 NE1 TRP A 203 31.956 -5.191 32.177 1.00 45.45 N ATOM 10 CE2 TRP A 203 30.704 -5.582 31.805 1.00 45.23 C ATOM 11 CE3 TRP A 203 28.657 -6.305 32.877 1.00 46.48 C ATOM 12 CZ2 TRP A 203 30.149 -5.705 30.539 1.00 46.06 C ATOM 13 CZ3 TRP A 203 28.101 -6.431 31.622 1.00 43.08 C ATOM 14 CH2 TRP A 203 28.849 -6.131 30.463 1.00 45.77 C …
Name
AtomNumber
AtomName
ResidueName
Chain ID
ResidueNumber
YX Z
Occupancy
TemperatureFactor
Issues:Justification
Nomenclature
ATOM 1 N TRP A 203 30.156 -4.908 37.767 1.00 50.81
From PDB to Entrez
Structure
3D DomainsProtein
Domains
From Coordinates to Models1EJ9: Human topoisomerase I
Building the Structure Summary
Taxonomy
Pubmed
Protein 3D Domains
Domains
Nucleotide
Indexing into MMDB
Structure
• Import only experimentally determined structures• Convert to ASN.1 • Verify sequences
inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } ,
id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } ,
Add secondary structure Add chemical bonds
• Create “backbone” model (Cα, P only)• Create single-conformer model
Structure Indexing
Entrez• MMDB-ID• MMDB entry date• EC number • Organism
PDB• Accession• Release date• Class• Source• Description• Comment
Ligands• PDB code• PDB name• PDB description
Literature• Article title• Author• Journal • Publication date
Experimental• Method• Resolution
Counters• Ligand types• Modified amino acids• Modified nucleotides• Modified ribonucleotides• Protein chains• DNA chains• RNA chains
topoisomerase AND 2[dnachaincount] AND human[organism]
Creating Sequence Records
Protein Nucleotide Nucleotide
1EJ9A 1EJ9C 1EJ9D
One record per chain
Building the Structure Summary
Building the Structure Summary
Annotating Secondary Structure1EJ9: Human topoisomerase I
α-Helices
β-strands
coils/loops
Creating 3D Domains3D Domain 0: 1EJ9A0 = entire polypeptide
Creating 3D Domains
3D Domains
1EJ9A1
1EJ9A3
1EJ9A2
1EJ9A4
1EJ9A5
< 3 Secondary Structure Elements
Building the Structure Summary
Building the Structure Summary
3D Domain IndexingEntrez• SDI• MMDB-ID• Accession• MMDB entry date • Organism• Domain number• Cumulative number
PDB• Accession• Release date• Class• Source• Description• Comment
Literature• Article title• Author • Publication date
Counters• Modified amino acids• α-Helices• β-Strands• Residues• Molecular weight
REMEMBER:3D Domain 0 is the entirepolypeptide chain!
4[helixcount] AND 0[strandcount] AND 0[domainno] AND viruses[organism]
Find all viral four helix bundles
Conserved Domains
Weakly conserved serine Active site serine
Sequences Aligned by Function
Linking Sequence to FunctionThe PSSM Position Specific Score Matrix
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
Pfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT
COG
SMART
CD
Entrez Domains (CDD v2.00)
HMM based models originally concentrating on eukaryotic signalingdomains, now expanding
BLAST based alignments derived from complete proteomes of prokaryotes
NCBI curated domains based on sequence and structural alignments
Pfam pfam01234
smart00123
cd01234
COG0123
NCBI
NCBI
Sanger
EMBL
Single Domains
Protein Families
A database of Position Specific Score Matrices (PSSMs)
CD-Search Output
CD
SMART
Pfam
COG
Click on a colored bar to align your sequence to the CD
CD Summary
Alignment view controls
Cn3D launch
PSSM created
Aligned query
Building the Structure Summary
Building the Structure Summary
Cn3D
Creating Entrez Links
NCBI Taxonomy
Literature from PDB
Sequences
Full Chain
Entrez Structure
Entrez 3D Domains
Links to CDsCD-Search / RPS-BLAST
1EJ9A
Query: protein sequence Database: PSSMs
pre-computed inEntrez Protein
Enter accession, GI,or FASTA sequenceinto RPS-BLAST
Finding Homologous Structures
• By sequence similarity: BLAST
• By structural similarity: VAST
• By conserved function: CD-Search
EntrezProtein
EntrezStructure
Entrez3D Domains
EntrezDomains
BLAST: Sequence Neighbors
BLAST Related StructuresDisplays a graphical and text alignment between a query sequence and a similar sequence with structure
Accessed from• Blink• Any protein BLAST search
?GVKWKYLEHKGPVFAPPYDPLP
GIKWKFLEHKGPVFAPPYEPLP
BLink NeighborsEAA05377: ENSANGP00000011118 from A. gambiae
Related Structures
Related Structures from BLASTp
Related Structures Cn3D
VAST: Searching by StructureWhy search for similar structures?
• To find homologs that sequence searches cannot: distant protein homologs often conserve structure more strongly than sequence
• To explore protein evolution: similar protein folds can be used to support different functions
• To identify conserved core elements of a protein fold that can be used to model related proteins of unknown structure
VAST: Structure NeighborsVector Alignment Search Tool
For each protein chain,
locate SSEs (secondarystructure elements),
and represent them asindividual vectors. 1
2
3
4
5 6
Human IL-4
VAST: Calculate ij
1
2
3
4
5 6
16
4
5
2
14
zFor both the query andtarget structures,
Calculate the midpointof each SSE.
For each SSE k,align k along z andproject midpoints ontothe xy plane.
Then calculate [ij]k fori ≠ k, j ≠ k.
Vector position about the z axis
VAST: Calculate (rik, zik)
3
1
zFor both the query andtarget structures,
For each SSE k,set the origin at themidpoint of k.
Then calculate rik andzik for the endpoints ofSSEs i ≠ k.
Vector position relative to the xy plane
xyz13
r13
VAST: Create Comparison Graph
IL-4
IL-6
3 1
4
6
12
3
5
1 2 3 4 5 6
1
2
3
4
5
4
2
5
Nodes: r13<>r12
z13<>z12
Arcs: 16<>15
must follow sequence order
Select path with highest “weights”
N
N
C
C
VAST: Refinement
Aligned residuesare red
Alignment extended to the end of this strand
C atoms are added to the aligned SSEs
Alignments are allowed to extend beyond SSE boundaries
All atoms are added to the models, and the detailed backbone and sidechain positions are refined
VAST: Alignment of Sequence• Aligned blocks represent structural core elements• Aligned blocks have no internal gaps• Aligned residues occupy the same position in space• Aligned residues are shown in CAPITAL letters
Helix 1
Helix 2 Helix 3
Helix 4
VAST: Scoringp = d P(s > s0, n) c(n, P1, P2)
P(s > s0, n) Probability of observing an alignment of n SSEs with a score greater than s0 by chance.
c(n, P1, P2)Search space:Number of possible alignments of n SSEs between vector sets P1 and P2.
d Number of structures searched (set to 500)
The probability that the VAST alignment occurred by chance.
VAST: Summary• Secondary structure elements are represented as vectorsand are aligned based on their relative orientations
• VAST ignores loops and tolerates variation in SSE length• The initial alignment is wholly ignorant of atomic coordinates
• Pathways through aligned SSEs respect sequence order• VAST is sensitive to topology
NN N
C C
C
• Alignments are extended and optimized using all-atom models• Aligned blocks may extend across or into loops or other SSEs
Query by Chain vs 3D Domain
Query by whole chain
Query by domain 5
Not found using whole chain query!
c(n, P1, P2) is smaller for a 3D domain!
VAST: Multiple Alignments Cn3D
nr-PDB Sets
EntrezStructure
Choose criteria for inclusion in a set
Non-redundant set ofsequence similar clusters
VAST reports onerepresentative from each cluster
Submitting a PDB File to VAST
• Pick the correct file format• Remove all records except ATOM• This is the best way to convert PDB into MMDB format!
Blocks in CD Alignments
Alignment view controls
Aligned query
Cn3D launch
Block 1 Block 2 Block 3
Consensus sequence created
PSSM created
Curating CD Alignmentssmart00235
VAST
cd00203
Cn3DCn3D
Curated CD Summary
List of annotated features
Customized view of the selected feature in Cn3D
Residues comprising the selected feature
Cn3D
CD-Curation: Effect on model alignment accuracy
04
81
2
0 10 20 30 40 50 60 70 80 90 100
%id in structure alignment
mo
de
l alig
nm
en
t R
MS VAST
04
81
2
0 10 20 30 40 50 60 70 80 90 100
%id in structure alignment
mo
de
l alig
nm
en
t R
MS RPS-BLAST before curation
04
81
2
0 10 20 30 40 50 60 70 80 90 100
%id in structure alignment
mo
de
l alig
nm
en
t R
MS RPS-BLAST after curation
A. Marchler-Bauer
CDART
Only available for single domain records:cd, pfam, smart
Finding a Structural TemplateOverall Strategy: For a query protein sequence, construct a block alignment representing conserved core SSEs of the most sequence similar structures to the query, and then align the query sequence to this template.
1. Construct the block alignmentA. Curated CD: Locate using CD-Search and use the sequences
most similar to the queryB. VAST: Find the most sequence similar structure and find its
VAST neighbors
2. Align the query to the template: Use Cn3DA. PSI-BLAST: Aligns sequence using PSSM of current alignmentB. BLOCKER: Aligns sequence to an existing block alignment: use
where sequence similarity is highC. Threader: Aligns sequence to a structure and a block alignment:
use where sequence similarity is low
BLOCKER: The Block Aligner
PSSM
• Creates alignments that match the existing block structure• Matches are scored from a PSSM generated from the block alignment• An entire block must be matched with no internal gaps• There are no penalties for gaps between blocks up to a set gap length• Can perform both local and global alignments• Generally used after BLAST or PSI-BLAST
The Block Aligner tests the existing block structure
BLAST/PSSM vs BLOCKER
BLAST/PSSM
BLOCKER
Alignment
Import and align GI 1470115
The NCBI ThreaderLRLSLEQLQVIAIAN
Input• Structure• Block alignment• Sequence
Attempts to find matches based on chemical contacts, mainly buried hydrophobic interactions
Useful on blocks for which sequence alignment methods fail
Should be iterated with varying block structures
Cn3D
The Future
• More curated CDs: they keep coming…• Pre-computed Related Structures for all sequences in
Entrez Protein• CD “children”: subfamilies of large CD records based on
sequence and structure similarity• Improved mapping of SNP data onto 3D structures• Further linking of structural and genomic biology
What comes next…
• Workshop I– Working with Structures
• Workshop II– Working with Alignments
• All exercises and other resources will remain on the course web pages
• [email protected]• NCBI Handbook, Ch. 3