Exploring 3D Molecular Structures Using NCBI Tools A Field Guide June 17, 2004.

Exploring 3D Molecular Structures Using NCBI Tools

A Field Guide

June 17, 2004

NCBI Structure Resources

• Overview of Structural Informatics at NCBI• How 3D Macromolecular Structures are Determined• Indexing Structural Data at NCBI• Finding Homologous Structures

– By Sequence Similarity: BLAST– By Structure Similarity: VAST– By Conserved Function: RPS-BLAST and CDD

• Finding a Structural Template for a Query Protein

The National Center for Biotechnology Information

• Created as a part of NLM in 1988– Establish public databases– Perform research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

Structural Informatics

ChemicalFormula

3D Conformation

Function

ARKLMPQSCSW…ModificationsIonsLigands

Binding Sites Catalytic ResiduesKinetics ThermodynamicsSubstrates Intermediates

StructureDynamicsActive StatesFolding

Structural Informatics

ChemicalFormula

3D Conformation

Function

GenPeptNCBI RefSeqSWISS-PROTPIRPRF

Multiple Sequence Alignments:Pfam, SMART, COGs, CDD

PDB

Structural Informatics at NCBI

ChemicalFormula

3D Conformation

Function

GenPeptNCBI RefSeqSWISS-PROTPIRPRF

Multiple Sequence Alignments:Pfam, SMART, COGs, CDD

EntrezProtein

EntrezDomains

PDB

EntrezStructure

Entrez3D Domains

4,818,495 25,003

11,382

103,820

The Entrez System

Entrez

Nucleotide

PubMed

Protein

Taxonomy

Structure Domains

3D Domains

Books

Journals

PMC

OMIM

UniSTS

PopSet

GenomeSNP UniGene

Gene

GEO

GEO Datasets

MeSH

Solving StructuresX-Ray Crystallography

Bond r (Å)

C-S 1.82

C-C 1.54

C-N 1.47

C-O 1.43

S-H 1.34

C=O 1.20

C-H 1.09

N-H 1.01

O-H 0.96

Electron Density Map

P F I

Resolution

5 Å 3 Å 1 Å T or V?

Challenges

Disorder

Cn3D

More About Resolution1EJG: Crambin at 0.54 Å 2TMA: Tropomyosin at 15 Å

protons!! only alpha carbons!!

3 Å

“Temperature”

Solving StructuresNuclear Magnetic Resonance Spectroscopy

Bo

Constraint List

DistancesDihedral AnglesOrientation

Models consistentwith constraints

RMSD (Å)

Cn3D

PDB

PDB File: HeaderHEADER ISOMERASE/DNA 01-MAR-00 1EJ9TITLE CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX COMPND MOL_ID: 1; COMPND 2 MOLECULE: DNA TOPOISOMERASE I; COMPND 3 CHAIN: A; COMPND 4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES 203-765; COMPND 5 EC: 5.99.1.2; COMPND 6 ENGINEERED: YES; COMPND 7 MUTATION: YES; COMPND 8 MOL_ID: 2; COMPND 9 MOLECULE: DNA (5'- COMPND 10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP* COMPND 11 TP*TP*TP*T)-3'); COMPND 12 CHAIN: C; COMPND 13 ENGINEERED: YES; COMPND 14 MOL_ID: 3; COMPND 15 MOLECULE: DNA (5'- COMPND 16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP* COMPND 17 TP*TP*TP*T)-3'); COMPND 18 CHAIN: D; COMPND 19 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; SOURCE 3 EXPRESSION_SYSTEM_COMMON: BACULOVIRUS EXPRESSION SYSTEM; SOURCE 4 EXPRESSION_SYSTEM_CELL: SF9 INSECT CELLS; SOURCE 5 MOL_ID: 2; SOURCE 6 SYNTHETIC: YES; SOURCE 7 MOL_ID: 3; SOURCE 8 SYNTHETIC: YES KEYWDS PROTEIN-DNA COMPLEX, TYPE I TOPOISOMERASE, HUMAN

REMARK 1 REMARK 2 REMARK 2 RESOLUTION. 2.60 ANGSTROMS. REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : X-PLOR 3.1 REMARK 3 AUTHORS : BRUNGER …REMARK 280 REMARK 280 CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20 REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT REMARK 290 ...

PDB File: DataATOM 1 N TRP A 203 30.156 -4.908 37.767 1.00 50.81 N ATOM 2 CA TRP A 203 30.797 -4.667 36.431 1.00 49.96 C ATOM 3 C TRP A 203 30.369 -3.337 35.766 1.00 49.18 C ATOM 4 O TRP A 203 29.315 -3.238 35.147 1.00 49.27 O ATOM 5 CB TRP A 203 30.518 -5.863 35.513 1.00 46.77 C ATOM 6 CG TRP A 203 30.847 -5.651 34.081 1.00 44.60 C ATOM 7 CD1 TRP A 203 32.028 -5.234 33.553 1.00 49.72 C ATOM 8 CD2 TRP A 203 29.980 -5.876 32.984 1.00 43.73 C ATOM 9 NE1 TRP A 203 31.956 -5.191 32.177 1.00 45.45 N ATOM 10 CE2 TRP A 203 30.704 -5.582 31.805 1.00 45.23 C ATOM 11 CE3 TRP A 203 28.657 -6.305 32.877 1.00 46.48 C ATOM 12 CZ2 TRP A 203 30.149 -5.705 30.539 1.00 46.06 C ATOM 13 CZ3 TRP A 203 28.101 -6.431 31.622 1.00 43.08 C ATOM 14 CH2 TRP A 203 28.849 -6.131 30.463 1.00 45.77 C …

Name

AtomNumber

AtomName

ResidueName

Chain ID

ResidueNumber

YX Z

Occupancy

TemperatureFactor

Issues:Justification

Nomenclature

ATOM 1 N TRP A 203 30.156 -4.908 37.767 1.00 50.81

From PDB to Entrez

Structure

3D DomainsProtein

Domains

From Coordinates to Models1EJ9: Human topoisomerase I

Building the Structure Summary

Taxonomy

Pubmed

Protein 3D Domains

Domains

Nucleotide

Indexing into MMDB

Structure

• Import only experimentally determined structures• Convert to ASN.1 • Verify sequences

inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } ,

id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } ,

Add secondary structure Add chemical bonds

• Create “backbone” model (Cα, P only)• Create single-conformer model

Structure Indexing

Entrez• MMDB-ID• MMDB entry date• EC number • Organism

PDB• Accession• Release date• Class• Source• Description• Comment

Ligands• PDB code• PDB name• PDB description

Literature• Article title• Author• Journal • Publication date

Experimental• Method• Resolution

Counters• Ligand types• Modified amino acids• Modified nucleotides• Modified ribonucleotides• Protein chains• DNA chains• RNA chains

topoisomerase AND 2[dnachaincount] AND human[organism]

Creating Sequence Records

Protein Nucleotide Nucleotide

1EJ9A 1EJ9C 1EJ9D

One record per chain



Annotating Secondary Structure1EJ9: Human topoisomerase I

α-Helices

β-strands

coils/loops

Creating 3D Domains3D Domain 0: 1EJ9A0 = entire polypeptide

Creating 3D Domains

3D Domains

1EJ9A1

1EJ9A3

1EJ9A2

1EJ9A4

1EJ9A5

< 3 Secondary Structure Elements



3D Domain IndexingEntrez• SDI• MMDB-ID• Accession• MMDB entry date • Organism• Domain number• Cumulative number

PDB• Accession• Release date• Class• Source• Description• Comment

Literature• Article title• Author • Publication date

Counters• Modified amino acids• α-Helices• β-Strands• Residues• Molecular weight

REMEMBER:3D Domain 0 is the entirepolypeptide chain!

4[helixcount] AND 0[strandcount] AND 0[domainno] AND viruses[organism]

Find all viral four helix bundles

Conserved Domains

Weakly conserved serine Active site serine

Sequences Aligned by Function

Linking Sequence to FunctionThe PSSM Position Specific Score Matrix

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently in these two positions

Active site nucleophile

Pfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT

COG

SMART

CD

Entrez Domains (CDD v2.00)

HMM based models originally concentrating on eukaryotic signalingdomains, now expanding

BLAST based alignments derived from complete proteomes of prokaryotes

NCBI curated domains based on sequence and structural alignments

Pfam pfam01234

smart00123

cd01234

COG0123

NCBI

NCBI

Sanger

EMBL

Single Domains

Protein Families

A database of Position Specific Score Matrices (PSSMs)

CD-Search Output

CD

SMART

Pfam

COG

Click on a colored bar to align your sequence to the CD

CD Summary

Alignment view controls

Cn3D launch

PSSM created

Aligned query



Cn3D

Creating Entrez Links

NCBI Taxonomy

Literature from PDB

Sequences

Full Chain

Entrez Structure

Entrez 3D Domains

Links to CDsCD-Search / RPS-BLAST

1EJ9A

Query: protein sequence Database: PSSMs

pre-computed inEntrez Protein

Enter accession, GI,or FASTA sequenceinto RPS-BLAST

Finding Homologous Structures

• By sequence similarity: BLAST

• By structural similarity: VAST

• By conserved function: CD-Search

EntrezProtein

EntrezStructure

Entrez3D Domains

EntrezDomains

BLAST: Sequence Neighbors

BLAST Related StructuresDisplays a graphical and text alignment between a query sequence and a similar sequence with structure

Accessed from• Blink• Any protein BLAST search

?GVKWKYLEHKGPVFAPPYDPLP

GIKWKFLEHKGPVFAPPYEPLP

BLink NeighborsEAA05377: ENSANGP00000011118 from A. gambiae

Related Structures

Related Structures from BLASTp

Related Structures Cn3D

VAST: Searching by StructureWhy search for similar structures?

• To find homologs that sequence searches cannot: distant protein homologs often conserve structure more strongly than sequence

• To explore protein evolution: similar protein folds can be used to support different functions

• To identify conserved core elements of a protein fold that can be used to model related proteins of unknown structure

VAST: Structure NeighborsVector Alignment Search Tool

For each protein chain,

locate SSEs (secondarystructure elements),

and represent them asindividual vectors. 1

2

3

4

5 6

Human IL-4

VAST: Calculate ij

1

2

3

4

5 6

16

4

5

2

14

zFor both the query andtarget structures,

Calculate the midpointof each SSE.

For each SSE k,align k along z andproject midpoints ontothe xy plane.

Then calculate [ij]k fori ≠ k, j ≠ k.

Vector position about the z axis

VAST: Calculate (rik, zik)

3

1

zFor both the query andtarget structures,

For each SSE k,set the origin at themidpoint of k.

Then calculate rik andzik for the endpoints ofSSEs i ≠ k.

Vector position relative to the xy plane

xyz13

r13

VAST: Create Comparison Graph

IL-4

IL-6

3 1

4

6

12

3

5

1 2 3 4 5 6

1

2

3

4

5

4

2

5

Nodes: r13<>r12

z13<>z12

Arcs: 16<>15

must follow sequence order

Select path with highest “weights”

N

N

C

C

VAST: Refinement

Aligned residuesare red

Alignment extended to the end of this strand

C atoms are added to the aligned SSEs

Alignments are allowed to extend beyond SSE boundaries

All atoms are added to the models, and the detailed backbone and sidechain positions are refined

VAST: Alignment of Sequence• Aligned blocks represent structural core elements• Aligned blocks have no internal gaps• Aligned residues occupy the same position in space• Aligned residues are shown in CAPITAL letters

Helix 1

Helix 2 Helix 3

Helix 4

VAST: Scoringp = d P(s > s0, n) c(n, P1, P2)

P(s > s0, n) Probability of observing an alignment of n SSEs with a score greater than s0 by chance.

c(n, P1, P2)Search space:Number of possible alignments of n SSEs between vector sets P1 and P2.

d Number of structures searched (set to 500)

The probability that the VAST alignment occurred by chance.

VAST: Summary• Secondary structure elements are represented as vectorsand are aligned based on their relative orientations

• VAST ignores loops and tolerates variation in SSE length• The initial alignment is wholly ignorant of atomic coordinates

• Pathways through aligned SSEs respect sequence order• VAST is sensitive to topology

NN N

C C

C

• Alignments are extended and optimized using all-atom models• Aligned blocks may extend across or into loops or other SSEs

Query by Chain vs 3D Domain

Query by whole chain

Query by domain 5

Not found using whole chain query!

c(n, P1, P2) is smaller for a 3D domain!

VAST: Multiple Alignments Cn3D

nr-PDB Sets

EntrezStructure

Choose criteria for inclusion in a set

Non-redundant set ofsequence similar clusters

VAST reports onerepresentative from each cluster

Submitting a PDB File to VAST

• Pick the correct file format• Remove all records except ATOM• This is the best way to convert PDB into MMDB format!

Blocks in CD Alignments

Alignment view controls

Aligned query

Cn3D launch

Block 1 Block 2 Block 3

Consensus sequence created

PSSM created

Curating CD Alignmentssmart00235

VAST

cd00203

Cn3DCn3D

Curated CD Summary

List of annotated features

Customized view of the selected feature in Cn3D

Residues comprising the selected feature

Cn3D

CD-Curation: Effect on model alignment accuracy

04

81

2

0 10 20 30 40 50 60 70 80 90 100

%id in structure alignment

mo

de

l alig

nm

en

t R

MS VAST

04

81

2

0 10 20 30 40 50 60 70 80 90 100


mo

de

l alig

nm

en

t R

MS RPS-BLAST before curation

04

81

2

0 10 20 30 40 50 60 70 80 90 100


mo

de

l alig

nm

en

t R

MS RPS-BLAST after curation

A. Marchler-Bauer

CDART

Only available for single domain records:cd, pfam, smart

Finding a Structural TemplateOverall Strategy: For a query protein sequence, construct a block alignment representing conserved core SSEs of the most sequence similar structures to the query, and then align the query sequence to this template.

1. Construct the block alignmentA. Curated CD: Locate using CD-Search and use the sequences

most similar to the queryB. VAST: Find the most sequence similar structure and find its

VAST neighbors

2. Align the query to the template: Use Cn3DA. PSI-BLAST: Aligns sequence using PSSM of current alignmentB. BLOCKER: Aligns sequence to an existing block alignment: use

where sequence similarity is highC. Threader: Aligns sequence to a structure and a block alignment:

use where sequence similarity is low

BLOCKER: The Block Aligner

PSSM

• Creates alignments that match the existing block structure• Matches are scored from a PSSM generated from the block alignment• An entire block must be matched with no internal gaps• There are no penalties for gaps between blocks up to a set gap length• Can perform both local and global alignments• Generally used after BLAST or PSI-BLAST

The Block Aligner tests the existing block structure

BLAST/PSSM vs BLOCKER

BLAST/PSSM

BLOCKER

Alignment

Import and align GI 1470115

The NCBI ThreaderLRLSLEQLQVIAIAN

Input• Structure• Block alignment• Sequence

Attempts to find matches based on chemical contacts, mainly buried hydrophobic interactions

Useful on blocks for which sequence alignment methods fail

Should be iterated with varying block structures

Cn3D

The Future

• More curated CDs: they keep coming…• Pre-computed Related Structures for all sequences in

Entrez Protein• CD “children”: subfamilies of large CD records based on

sequence and structure similarity• Improved mapping of SNP data onto 3D structures• Further linking of structural and genomic biology

What comes next…

• Workshop I– Working with Structures

• Workshop II– Working with Alignments

• All exercises and other resources will remain on the course web pages

• [email protected]• NCBI Handbook, Ch. 3

Date post:	11-Jan-2016
Category:	Documents
Upload:	francine-wiggins
View:	212 times
Download:	0 times

Exploring 3D Molecular Structures Using NCBI Tools A Field Guide June 17, 2004.

Documents