+ All Categories
Home > Documents > Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department...

Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department...

Date post: 13-Dec-2015
Category:
Upload: claude-wilkinson
View: 215 times
Download: 0 times
Share this document with a friend
42
Pharm201 Lecture 4 2009 1 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural Bioinformatics Chapters 10
Transcript
Page 1: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 1

Data Representation

Pharm 201/Bioinformatics I

Philip E. Bourne

Department of Pharmacology, UCSD

Prerequisite Reading: Structural Bioinformatics Chapters 10

Page 2: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 2

Take Home Message

• Good data representation of complex data is not a trivial undertaking

• However it is prerequisite to effective use of those data

• History often precludes the above

You should have got a sense of the first item from the assignment

Page 3: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 3

Global Considerations in Defining a Data Representation

• Scope - breadth and depth of data to be included

• Name space• How to cast that data• What will the definition be used for?

– Archiving, schema representation, methods ...

Page 4: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 4

Simple query, browsing and retrieval Consistent data resulting from autonomous validation and verification Simple and consistent data exchange A unified view of disparate types of data Accommodation of new knowledge as it is

discovered Inclusion of procedures (methods) to specify how a particular item of data is derived or verified.

Global Considerations – More Specifically

Page 5: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 5

Given These Considerations – Where Does the PDB Format Fit In?

First we need to examine the format

Page 6: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 6

The PDB Format

• A full description is at http://www.wwpdb.org/docs.html

• It was designed around an 80 column punched card!

• It was designed to be human readable• It is used by every piece of software that deals

with structural data

Page 7: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 7

The PDB Format – An Example – The Header

Page 8: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 8

The PDB Format - Records

• Every PDB file may be broken into a number of lines terminated by an end-of-line indicator. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of-line indicator.

• Each line in the PDB file is self-identifying. The first six columns of every line contain a record name, left-justified and blank-filled. This must be an exact match to one of the stated record names.

• The PDB file may also be viewed as a collection of record types. Each record type consists of one or more lines.

• Each record type is further divided into fields.

Page 9: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 9

The PDB Format – An Example – The Atomic Coordinates

Page 10: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 10

The Description – Atom Records

Page 11: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 11

What is Wrong with this Approach?

• The description and the data are separate

• Parsing is a nightmare – the most complex piece of code we have in our research laboratory probably remains the PDB parser

• There are no relationships between items of data

• Some data just cannot be parsed ….

Page 12: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 12

REMARK 3 REFINEMENT. BY THE RESTRAINED LEAST-SQUARES PROCEDURE OFREMARK 3 J. KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE RREMARK 3 VALUE IS 0.168 FOR 2680 REFLECTIONS WITH I GREATER THANREMARK 3 2.0*SIGMA(I) REPRESENTING 74 PER CENT OF THE TOTALREMARK 3 AVAILABLE DATA IN THE RESOLUTION RANGE 10.0 TO 2.0REMARK 3 ANGSTROMS.

REMARK 4 THE ERABUTOXIN A (EA) CRYSTAL STRUCTURE IS ISOMORPHOUS WITHREMARK 4 THE KNOWN STRUCTURE OF ERABUTOXIN B (PROTEIN DATA BANKREMARK 4 ENTRIES *2EBX*, *3EBX*). EA DIFFERS FROM EB BY A SINGLEREMARK 4 SUBSTITUTION - EA ASN 26 FOR EB HIS 26. THE EA STARTINGREMARK 4 MODEL WAS OBTAINED FROM A MOLECULAR REPLACEMENT STUDY INREMARK 4 WHICH COORDINATES FOR 309 OF THE 475 ATOMS IN THE EBREMARK 4 STRUCTURE (*2EBX*) WERE USED.

PDB Format - Important Components of the Data are Lost

to All But Humans

Page 13: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 13

Enter mmCIF

• Prerequisite reading: http://www.sdsc.edu/pb/papers/methenz97.pdf

• Complete information:

http://mmcif.pdb.org

Page 14: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 14

The macromolecular Crystallographic Information File (mmCIF) – An Approach to Addressing Problems with the PDB Format

• Has the support of a major scientific society• In the backbone of the current PDB• Provides a rich description of very complex

data• Predates any use of ontologies, Web

developments, CORBA, XML etc.• Still has some problems

Page 15: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 15

The temperature is 30 degrees

A human would know whether that was Centigrade or Fahrenheit with additional

context. A computer would have more difficulty!

What would be the point of archiving such data if in10 years the meaning was lost

mmCIF - Initial Motivator Circa Late 1980’s

Page 16: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 16

• All PDB data should be captured• Describe a paper’s material and methods

section• Describe biologically active molecule• Fully describe secondary structure but not

tertiary or quaternary• Describe details of chemistry (inc. 2D)• Meaningful 3D views

mmCIF – Scope of the Initial Effort

Page 17: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 17

mmCIF - Topology

Page 18: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 18

• Data are defined in data blocks• A global declaration spans data blocks• Data exists as name-value pairs• A data name may appear only once in a data block• Loop constructs are supported

mmCIF - STAR Encoding Rules

Page 19: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 19

loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.entity_id _atom_site.entity_seq_num _atom_site.id ATOM N N VAL A 11 . 25.360 30.691 11.795 1.00 17.93 . 1 11 1 ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 1 11 2 ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 1 11 3

mmCIF - Extract from a Data File

Page 20: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 20

save__atom_site.Cartn_x _item_description.description; The x atom site coordinate in angstroms specified according to a set of orthogonal Cartesian axes related to the cell axes as specified by the description given in _atom_sites.Cartn_transform_axes.; _item.name '_atom_site.Cartn_x' _item.category_id atom_site _item.mandatory_code no _item_aliases.alias_name '_atom_site_Cartn_x' _item_aliases.dictionary cifdic.c94 _item_aliases.version 2.0 loop_ _item_dependent.dependent_name '_atom_site.Cartn_y' '_atom_site.Cartn_z' _item_related.related_name '_atom_site.Cartn_x_esd' _item_related.function_code associated_esd _item_sub_category.id cartesian_coordinate _item_type.code float _item_type_conditions.code esd _item_units.code angstroms

mmCIF - Extract from the Dictionary

Page 21: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 21

The DDL category item_description holds a description for each data item. The key item for this category is item_description.name

which is defined in the parent category item. The text of the item description is held by item _item_description.description.

A single description may be provided for each data item.

The DDL for the item_description category is given in the following section.

save_ITEM_DESCRIPTION

_category.description

;

This category holds the descriptions of each data item.

;

_category.id item_description

_category.mandatory_code yes

loop_

_category_key.name '_item_description.name'

'_item_description.description'

loop_

_category_group.id 'ddl_group'

'item_group'

save_

save__item_description.description

_item_description.description

;

Text decription of the defined data item.

;

_item.name '_item_description.description'

_item.category_id item_description

_item.mandatory_code yes

_item_type.code text

save_

mmCIF Dictionary DefinitionLanguage

Page 22: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 22

mmCIF – Topology Revisited

Page 23: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 23

STRUCT_BIOL

STRUCT_BIOL_GEN

STRUCT_ASYM

ENTITY

ENTITY_POLY

ENTITY_POLY_SEQ

CHEM_COMP

ATOM_SITE

STRUCT_CONF

STRUCT_CONN

STRUCT_SITE_GEN

STRUCT_REF

mmCIF - The Category Group Organization of any Macromolecular Structure

Page 24: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 24

mmCIF - Entity - Unique Chemical Component

Page 25: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

2525

Page 26: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

2626

Entity Information

Page 27: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 27

STRUCT_BIOL

STRUCT_BIOL_GEN

STRUCT_ASYM

ENTITY

ENTITY_POLY

ENTITY_POLY_SEQ

CHEM_COMP

ATOM_SITE

STRUCT_CONF

STRUCT_CONN

STRUCT_SITE_GEN

STRUCT_REF

mmCIF - The Category Group Organization of any Macromolecular Structure

Page 28: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

2828

Shows ATOM and UniProt Sequences

Page 29: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 29

STRUCT_BIOL

STRUCT_BIOL_GEN

STRUCT_ASYM

ENTITY

ENTITY_POLY

ENTITY_POLY_SEQ

CHEM_COMP

ATOM_SITE

STRUCT_CONF

STRUCT_CONN

STRUCT_SITE_GEN

STRUCT_REF

mmCIF - The Category Group Organization of any Macromolecular Structure

Page 30: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 30

STRUCT_BIOL

STRUCT_BIOL_GEN

STRUCT_ASYM

ENTITY

ENTITY_POLY

ENTITY_POLY_SEQ

CHEM_COMP

ATOM_SITE

STRUCT_CONF

STRUCT_CONN

STRUCT_SITE_GEN

STRUCT_REF

mmCIF - The Category Group Organization of any Macromolecular Structure

Page 31: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 31

STRUCT_BIOL

STRUCT_BIOL_GEN

STRUCT_ASYM

ENTITY

ENTITY_POLY

ENTITY_POLY_SEQ

CHEM_COMP

ATOM_SITE

STRUCT_CONF

STRUCT_CONN

STRUCT_SITE_GEN

STRUCT_REF

mmCIF - The Category Group Organization of any Macromolecular Structure

Page 32: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 32

STRUCT_BIOL

STRUCT_BIOL_GEN

STRUCT_ASYM

ENTITY

ENTITY_POLY

ENTITY_POLY_SEQ

CHEM_COMP

ATOM_SITE

STRUCT_CONF

STRUCT_CONN

STRUCT_SITE_GEN

STRUCT_REF

mmCIF - The Category Group Organization of any Macromolecular Structure

Page 33: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 33

STRUCT_BIOL

STRUCT_BIOL_GEN

STRUCT_ASYM

ENTITY

ENTITY_POLY

ENTITY_POLY_SEQ

CHEM_COMP

ATOM_SITE

STRUCT_CONF

STRUCT_CONN

STRUCT_SITE_GEN

STRUCT_REF

mmCIF - The Category Group Organization of any Macromolecular Structure

Page 34: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 34

STRUCT_BIOL

STRUCT_BIOL_GEN

STRUCT_ASYM

ENTITY

ENTITY_POLY

ENTITY_POLY_SEQ

CHEM_COMP

ATOM_SITE

STRUCT_CONF

STRUCT_CONN

STRUCT_SITE_GEN

STRUCT_REF

mmCIF - The Category Group Organization of any Macromolecular Structure

Page 35: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

mmCIF - Defining Secondary Structure

35

Page 36: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

mmCIF - Other Interactions

36Pharm201 Lecture 4 2009

Page 37: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

mmCIF – Defining the Biological Assembly

Pharm201 Lecture 4 2009 38

_pdbx_struct_assembly.id 1 _pdbx_struct_assembly.details author_and_software_defined_assembly _pdbx_struct_assembly.method_details PISA _pdbx_struct_assembly.oligomeric_details dimeric _pdbx_struct_assembly.oligomeric_count 2 # _pdbx_struct_assembly_gen.assembly_id 1 _pdbx_struct_assembly_gen.oper_expression 1,2 _pdbx_struct_assembly_gen.asym_id_list A,B,C,D,E,F,G,H

loop__pdbx_struct_oper_list.id _pdbx_struct_oper_list.type _pdbx_struct_oper_list.name _pdbx_struct_oper_list.symmetry_operation _pdbx_struct_oper_list.matrix[1][1] _pdbx_struct_oper_list.matrix[1][2] _pdbx_struct_oper_list.matrix[1][3] _pdbx_struct_oper_list.vector[1] _pdbx_struct_oper_list.matrix[2][1] _pdbx_struct_oper_list.matrix[2][2] _pdbx_struct_oper_list.matrix[2][3] _pdbx_struct_oper_list.vector[2] _pdbx_struct_oper_list.matrix[3][1] _pdbx_struct_oper_list.matrix[3][2] _pdbx_struct_oper_list.matrix[3][3] _pdbx_struct_oper_list.vector[3] 1 'identity operation' 1_555 x,y,z 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 2 'crystal symmetry operation' 4_565 x,-y+1,-z 1.0 0.0 0.0 0.0 0.0 -1.0 0.0 106.344 0.0 0.0 -1.0 0.0

AsymmetricUnit(monomer)

BiologicalAssembly(dimer)

The pdbx_struct_assembly category describes the components andsymmetry operators required to generate the biological assembly

The pdbx_struct_oper category describes thesymmetry operations. Each symmetry operation consists of a 3x3 rotation matrix and a translation vector.

http://www.pdb.org/pdb/static.do?p=education_discussion/Looking-at-Structures/biounit_tutorial.html

Example:PDB 3C70

Page 38: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 29

mmCIF – Defining non-standard Amino Acids

Page 39: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 39

mmCIF - Problems• Two sets of chain identifier and residue numbering schemes (cif

and author defined)– Example: _atom_site.label_asym_id, _atom_site.author_asym_id– PDB files use the author defined scheme

• Poor data typing• Redundant or unnecessary data (i.e. amino acid info)• Use of free text instead of controlled vocabulary

– IMAGE PLATE (RAXIS V) IMAGE PLATE (RAXIS-V) IMAGE PLATE RAXIS V IMAGE PLATE RAXIS-V RAXISIV

• Little software support due to complexity of mmCIF format• Parsing issues

Page 40: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Examples of File Parsing Issues

• Non-standard quoting rules for strings

• Items that require sub-parsing of expressions

Pharm201 Lecture 4 2009 40

loop__pdbx_struct_assembly_gen.assembly_id _pdbx_struct_assembly_gen.oper_expression _pdbx_struct_assembly_gen.asym_id_list 1 '(1-60)(61-88)' A,B,C 4 '(1,2,6,10,23,24)(61-88)' A,B,C 6 '(1,10,23)(61,62,69-88)' A,B,C PAU '(P)(61-88)' A,B,C

TURN_P TURN_P1 'A'"' VAL A 31 ? ILE A 34 ? VAL A 31 ILE A 34 ?;TYPE I';?

Page 41: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 42

Summary• mmCIF has provided the PDB with a robust data

representation which serves as conceptual and physical schema upon which the current RCSB, PDBe and PDBj are built

• This work predated XML and XML-schema but embodies the important concepts inherent in these descriptions

• mmCIF was later exactly converted into XML and is now used more than mmCIF, but much less than the old PDB format

Page 42: Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Pharm201 Lecture 4 2009 43

Take Home Message

• Good data representation of complex data is not a trivial undertaking

• However it is prerequisite to effective use of those data


Recommended