PubChem—Substance, Compound, BioAssay Part 3: Essentials.

Post on 16-Jan-2016

216 views 0 download

transcript

PubChem—Substance, Compound, BioAssay

Part 3:

Essentials

PubChem—Substance, Compound, BioAssay

Global Entrez Search Page

All[Filter]All[Filter]

PubChem—Substance, Compound, BioAssay

Overall Goal:

An on-line resource providing comprehensive information on the

biological activities of small molecules

PubChem—Substance, Compound, BioAssay

Why Are Small Molecules Important?

Constituents to all macromolecules(DNA, RNA, protein, carbohydrates, etc.)

Serve as cofactors and signaling molecules to thousands of proteins

The chemistry part of “biochemistry” Most drug entities and drug types are small

molecules Most biomarkers used in clinical chemistry are

small molecules

PubChem—Substance, Compound, BioAssay

PubChem Databases and Tools:http://

pubchem.ncbi.nlm.nih.gov/

http://pubchem.ncbi.nlm.nih.gov/

PubChem—Substance, Compound, BioAssay

ChemicalDiversity

Technology Development

Screening

Instrumentation

AssayDevelopment

PredictiveADMET

Compound Repository(MLSMR)

Informatics

Chem-informaticsResearchCenters

The Molecular Libraries Roadmap:

An Integrated Initiative

Molecular LibrariesScreening Centers

Network ( M L S C N )

PubChem—Substance, Compound, BioAssay

PubChem = Repository for small molecules and

bioactivity assay data Part of Entrez search and linking system Links to other NCBI databases, e.g.,

• PubMed, MeSH• Protein structures (MMDB)• Protein/Nucleotide sequences

(GenPept/GenBank) Contains complete chemical structures

Standardized for uniformity Small set of computed properties

Structure similarity searching

PubChem—Substance, Compound, BioAssay

and more…

Other Depositors to PubChem

PubChem—Substance, Compound, BioAssay

PubChem: Bird’s Eye View

Depositors

PubChemBioAssays

PubChemCompound

PubChemSubstance

ChemicalStructureSimilarity

PubChem—Substance, Compound, BioAssay

How does data get into PubChem?

PubChem—Substance, Compound, BioAssay

PubChem integration in Entrez

Protein Sequences

LiteratureVAST

StructureSimilarity

BioactivityAssay

Results

SmallMolecule

Structures

3DStructures

Term FrequencyStatistics

ChemicalStructureSimilarityActivity

ProfileSimilarity

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

PrimaryDatabase

PubChem—Substance, Compound, BioAssay

Depositor Data

• No “Global” rules or standards– Based on organizational needs– Lots of data overlap– Often based on individual Scientist preferences

• PubChem accepts data from many organizations– Previously unseen data representation– Combinatorial explosion of ways for drawing the

same structure

PubChem—Substance, Compound, BioAssay

Redundancy, mixtures

Mixture

PubChem—Substance, Compound, BioAssay

DerivativeDatabase

PubChem—Substance, Compound, BioAssay

Chemical Structures may be representedin many different ways

PubChem—Substance, Compound, BioAssay

Chemical Structures may be representedin many different ways

PubChem—Substance, Compound, BioAssay

Compound

Substance

PubChem—Substance, Compound, BioAssay

Knownstereochemistry

Unknown stereo Unknown E/Z isomers

Compound

Substance

PubChem—Substance, Compound, BioAssay

Most molecules come out right, even complex ones

VancomycinNeed to fix heme bond orders Result

Sometimes there is a need to fix problems, e.g. bond orders

PDB lacks chemical detail

– no bond order information

– no hydrogens

Substances (heterogens) from Protein 3D structures (PDB)

Deposited structure receives

– bond information

– hydrogens

– stereochemistry(where possible)

Dopamine

PubChem—Substance, Compound, BioAssay

PubChem Compound Processing

• Chemical Data Verification– Atom description (label, element?)– Functional group clean-up– Atom valence verification to prevent non-sense

• “Normalize” and “Standardize”– Valence-Bond canonicalize (for Tautomer invariance)– Aromaticity detection and self-consistency– Stereochemistry detection– Explicit hydrogen assignment

• Calculation– 2-D Coordinate generation– Image Depictions– Fingerprints

– IUPAC Name– SMILES, InChI, Hash Codes– xLogP, TPSA, HBD, HBA, MW, MF

PubChem—Substance, Compound, BioAssay

Chemical Structure “Sanitization”

Chemical Structures that fail Sanitization Are not part of the aggregated PubChem Compound

Database Still “searchable” via PubChem Substance Database

Keeps the PubChem Compound Database “Clean” for Chemical Informatic Analysis

Collapses structures represented in various ways into a uniform, identical representation

PubChem—Substance, Compound, BioAssay

Compound for mixture

Component compounds

PubChem—Substance, Compound, BioAssay

Components of a mixture

PubChem—Substance, Compound, BioAssay

Substance vs. Compound

Substance summary Compound summary

PubChem—Substance, Compound, BioAssay

Substance vs. Compound

PubChem—Substance, Compound, BioAssay

"InChI=1/Ca.3H2O/h;3*1H2/q 2;;;/p-3/fCa.3HO/h;3*1h/qm;3*-1"[InChI]

200[MW]

300:500[MW]

“ dopamine”[CompleteSynonym]

“ pcsubstance structure"[Filter]

“ ca"[Element] AND 300:500[MW] AND "chemidplus"[SourceName]

"lipinski"[Filter] AND "antineoplastic agents"[PharmAction]

Examples of queries

Lipinski rule of 5 -- a molecule is likely to be bioactive if it has:•not more than 5 hydrogen bond donors (OH and NH groups) •<10 hydrogen bond acceptors (N or O) •a molecular weight under 500 •a LogP under 5

PubChem—Substance, Compound, BioAssay

All [ALL] -- All of the following fields are searched; default search field. Uid[UID] -- The integer represents SID for PCSubstance database. By default, an integer without a field alias is recognized as a UID. Same as [SID].Filter [Filter] -- Limits the records to various indexed filters. ActiveAid [AA] -- Active BioAssay identifier, integer. ActiveAidCount [AC, ACNT] -- # bioassays where tested active. AtomChiralCount [ACC, ACCNT] -- Total count of chiral atoms in a given compound.BioAssayID [BAID, AID] -- BioAssay identifier.BondChiralCount [BCC, BCCNT] –- Number of chiral bonds.Comment [CMT] -- Substance or bioassay comment. CompleteSynonym [CSYN, CSYNO] – exactly matching name for substance/compound. CompoundID [CID] -- Compound identifier, integer. DepositDate [DDAT, DEPDAT] -- Deposition timestamp for a substance.Element [ELMT, EL] -- Chemical element in a substance/compound. ExactMass [EMAS, EXMASS]-- The calculated mass of an ion or a molecule containing most likely isotopic composition for a single random molecule, corresponding to mass of most intense ion/molecule peak in a MS spec. A real number.HeavyAtomCount [HAC, HACNT] -- Atom count in a compound except hydrogen, integer. HydrogenBondAcceptorCount [HBAC, HBACNT] -- Hydrogen bond acceptors for a compound, integer. HydrogenBondDonorCount [HBDC, HBDCNT] -- Hydrogen bond donors for a compound, integer. InChI [inchi] -- IUPAC International Chemical Identifier.

Examples of PubChem Index Fields …

PubChem—Substance, Compound, BioAssay

IUPACName [UPAC, IUPAC] -- Standard IUPAC name for compound. MeSHDescription [MHD]MeSHTerm [MSHT, MESHT] -- Medical Subject Heading term.MeSHTreeNode [MSHN, MESHTN] -- Medical Subject Heading tree node (tree structures).MolecularWeight [MW, MWT, MOLWT] -- Mass of a molecule calculated using the average mass of each element weighted for its natural isotopic abundance. E.g., Carbon has two natural isotopes 12 and 13 with relative abundances of 98.9% and 1.1% to yield an average mass of 12.011 g/mol. A real number. MonoisotopicMass [MMAS, MIMASS] -- Mass of a molecule calculated using the mass of the most abundant isotope of each element. E.g., Carbon has a monoisotopic mass of 12.000 g/mol. A real number. PharmAction [PHMA, PHARMA] -- MeSH pharmacological actions heading.RotatableBondCount [RBC, RBCNT] – Number of rotatable bonds. SourceCategory [SRCC, SRCCAT, SRCCATG] -- Depositor categories.SourceID [SRID, SRCID] -- Depositor's external id.SourceName [SRC, SRCNAM, SRCNAME] -- official depositor name.SubstanceID [SID] -- Substance ID. Same as [UID].Synonym [SYNO] -- Synonyms for substance. TautomerCount [TC, TCNT, TTMC] -- Possible tautomer count for each given structure, ≤ 200.  TotalFormalCharge [TFC, CHG, CHRG] -- Total formula charge.TPSA [TPSA] -- Topological Polar Surface Area.XLogP [XLGP, LOGP]

Examples of PubChem Index Fields, contd.

PubChem—Substance, Compound, BioAssay

Preview/Index Tab

PubChem—Substance, Compound, BioAssay

History Tab

Substances of MW 300-500Da having antineoplastic properties and obeying Lipinski rule of 5

Substances of MW 300-500Da having antineoplastic properties and obeying Lipinski rule of 5

PubChem—Substance, Compound, BioAssay

LinksLinks

For the whole set oronly selected records

PubChem—Substance, Compound, BioAssay

Property Report

PubChem—Substance, Compound, BioAssay

SDF format

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

Medical Subject Headings (MeSH)

MeSH is the National Library of Medicine's controlled vocabulary thesaurus.

Consists of sets of terms naming descriptors in a hierarchical and alphabetic structure, e.g.:

"Mental Disorders”, “Pharmacological action”, “Catecholamine hormones” , etc.

Permits searching at various levels of specificity MeSH thesaurus is used for indexing articles for the

MEDLINE/PubMed database MeSH is continually updated

PubChem assigns MeSH headings to Compound records

PubChem—Substance, Compound, BioAssay

Contains bioactivity screens of chemical substances described in PubChem Substance

Provides searchable descriptions of each bioassay, including descriptions of the conditions and readouts specific to a screening protocol

Depositor decides on data definitions and interpretation

Data can be plotted as graphs of statistical histograms

Cross-indexed to other Entrez databases

PrimaryDatabase

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

Click to view structureClick to view structureClick to view structureClick to view structure

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay

NCBI FTP >> PubChem Folder

PubChem—Substance, Compound, BioAssay

Entrez PubChem: Help and Tabs

PubChem—Substance, Compound, BioAssay

PubChem is part of NIH Molecular Libraries Roadmap for Medicine Initiative

PubChem consists of 3 databases, Substance, Compound and BioAssay, and a poweful Structure Search engine

Substance = samples; Compounds = calculated structures, properties

PubChem is integrated into NCBI’s Entrez Search and Linking system of databases

Records are indexed using number of terms

Records are linked to each other and to other databases at NCBI

Brief Summary

PubChem—Substance, Compound, BioAssay

For More Information…

PubChem—Substance, Compound, BioAssay

For More Information…

•General Help info@ncbi.nlm.nih.gov•BLASTblast-help@ncbi.nlm.nih.gov•Telephone:• Voice: +1 (301) 496-2475

Fax:     +1 (301) 480-9241

E-mail addresses

The (free!) NCBI Newsletter

The NCBI Handbook

http://www.ncbi.nih.gov/Education/index.html

The NCBI Education Page

http://www.ncbi.nih.gov/About/newsletter.html

Follow the link from the NCBI Home Page