+ All Categories
Home > Documents > 1 Chemical Structure Representation and Search Systems Lecture 2. Oct 30, 2003 John Barnard Barnard...

1 Chemical Structure Representation and Search Systems Lecture 2. Oct 30, 2003 John Barnard Barnard...

Date post: 27-Dec-2015
Category:
Upload: bethanie-cummings
View: 216 times
Download: 0 times
Share this document with a friend
41
1 Chemical Structure Representation and Search Systems Lecture 2. Oct 30, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software & Consultancy Services Sheffield, UK
Transcript

1Chemical Structure Representation

and Search Systems

Lecture 2. Oct 30, 2003

John Barnard

Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services

Sheffield, UK

2 Lecture 2: Topics to be Covered

Problems for chemical structure representation• aromaticity• tautomerism• multi-centre bonds• stereochemistry• organometallics and inorganics• macromolecules and polymers• incompletely-defined substances

o Markush Structures

3

Structure diagrams and topological graphs

OH

CH2

CHNH2

OH

O

4

Structure diagrams and topological graphs

useful analogy, but not a perfect one• identical graphs identical molecules• different graphs different molecules

realities of chemical structures cause problems

//

5 Aromaticity

electronic property of certain ring systems, giving enhanced chemical stability

bonds in aromatic rings have properties that are distinct from single and double bonds

generally accepted definition is Hückel rule• 4n+2 pi-electrons (n is a small integer)

there are borderline cases aromaticity causes problems for computer

representation• different systems deal with it in different ways

6 Aromaticity problems

using single and double bonds can give different topological graphs for the same compound

one solution is to usean aromatic bond type

Br Br

BrBr

Br

Br

7 Alternating bonds and aromaticity

Chemical Abstracts Registry System uses a “normalised” bond type for all rings with alternating single and double bonds

• this includes some systems that are not aromatic(8 ≠ 4n+2)

• and omits some that are S

8 Representing aromaticity

some systems represent aromaticity as an atom property

• SMILES allows use of lower-case atomic symbols for aromatic atoms (adjacent aromatic atoms are assumed to be joined by aromatic bonds)

problem is that aromaticity is really a ring property

S

s1cccc1S1C=CC=C1

Brc1c(Br)cccc1BrC1=C(Br)C=CC=C1

Br

Br

9

Aromaticity: problem areas

Aromaticity is sometimes a matter of degree or opinion

Aromatic envelope rings Outer ring has 10 = 4n+2 pi electrons fusion bond is not aromatic

Exocyclic bonds: right ring has 6 pi electrons

2 from usp, 2 from bond in ring, 2 from bonds in left ring and 0 from exocyclic bond to O)

O

O

..

..

10 Tautomerism

dynamic equilibrium between positional isomers (labile H)

are they different compounds?• answer depends on what you want to do with them

can use normalised bondsto represent them by a single graph• gets mixed up with ring

alternating bonds• some tautomers may be

aromatic, when others are not

NH

O

N

OH

N

O H

11 Tautomerism

tautomerism is a matter of degree tautomers can be defined in different ways

HQ–X=R Q=X–RHonly certain elements can be Q, X or R

o keto-enol tautmersare not recognisedby Chemical Abstracts

o mono-unsaturatedcarbon chains arenot distinguishedby Daylight

OH O

OH

O

OH

O

12 Structure conventions

sometimes called “business rules”• some chemical groups can be shown in different but

equally valid ways

• conventions are needed to determine which is preferred• software may be needed to convert to preferred form

NOO

N+

OO

13 Structure conventions

Getting the structure representation “right” can be very important• automatic property prediction

o wrong tautomeric form can give poor prediction of solubility, acid dissociation constant etc.

• receptor site dockingo molecular modelling programs “dock” small molecules into

protein receptor sites, and calculate score based on hydrogen-bond interactions, charges etc.

o wrong ionisation state / tautomer can give misleading results

14 Multi-centre bonds

sometimes bonds involve more than 2 atoms• graph edges always involve exactly 2

e.g. ferrocene

most systems fudge this sort of structure• bond to arbitrary carbon• bonds to all 5 carbons• bond to dummy atom placed in ring

o which itself has dummy bonds to ring atoms

Fe

15 Stereochemistry

different compounds with identical connectivity same topology, different topography

S-tyrosine R-tyrosine

16 Stereochemistry

configuration is often unknown • or partially known (relative stereochemistry)• or you may have a mixture of stereoisomers

o in which one isomer may occur in enantiomeric excess

many different descriptors used by chemists• wedge (up) and hatched (down) bonds in structure

diagrams• Cahn, Ingold, Prelog (CIP) designators (R, S, E, Z)• text-based descriptors (stereoparent, or optical rotation)

17 Stereochemistry: up/down bonds

can be used as additional “colours” for graph edges• many connection table

formats have special codes for up and down bonds

• need to know which end of bond is which

useful for re-generating diagrams for display can be used to calculate other stereo descriptors

OH

CH2NH2

O OH

OH

CH2NH2

O OH

18 Up/down bond problems

different patterns of up/down bonds can show the same stereo- isomer

• different graphs, same molecule

some patterns of up and down bonds actually convey no useful information about configuration

OH

CH2NH2

O OH

OH

CH2 NH2

OOH

ClF

CH3

CH2

CH3

19 Stereochemistry: CIP designators

R.S. Cahn, C. Ingold, and V. Prelog, Angewandte Chemie Intl. Ed. in English 1966, 5, 385-551

one-letter designator for stereocentres• based on rules assigning priorities to groups around it• tetrahedral carbons (R, S)• double bonds (E, Z)

additional colours for graph nodes or edges• useful for distinguishing stereoisomers when absolute

configuration is known• less useful for matching parts of structures (substructure

search) as priority rules can cause designator to change when remote part of structure is changed

20

Stereochemistry: ordered “stereovertex” lists define order of neighbours around stereocentre

• there are two sets of equivalent orders, corresponding to the two configurations of a tetrahedral carbon atom

A

B

CD

A B C DA D B CA C D BB C A DB D C AB A D CC A B DC D A BC B D AD A C BD B A CD C B A

A

B

C D

A D C BA C B DA B D CB A C DB D A CB C D AC B A DC D B AC A D BD A B CD C A BD B C A

neighbours are listed arounda right-handed spiral

21 Stereochemistry: stereovertex lists

Two alternative approaches:1. Geometric ordering

List neighbours of stereo centre in a predefined order for the geometry

(e.g. right-handed spiral)

Advantages:• ordering is locally-defined (rest of molecule is

irrelevant)• stereocentre need not be a single atom

Disadvantage: • equivalent orderings need to be defined

22 Stereochemistry: stereovertex lists

2. Parity value• most common used approach in practice• list neighbours according to an ordering rule

• atom numbers in connection table• CIP priority rules

• decide which geometry they conform to • right-handed (clockwise) or left-handed (anti-

clockwise) spiral• record this as parity value on stereocentre

• CIP R and S designators are an example of this• potential disadvantage:

• ordering rule may be globally defined (rest of molecule is relevant)

23 Stereochemistry: parity values

MDL formats:• number atoms around stereo centre with 1, 2, 3, and 4 in

order of increasing connection table atom numbero “implicit” hydrogen atom is considered to be atom 4

• view stereo centre so that the bond to atom 4 projects behind the plane formed by atoms 1, 2, and 3

• if numbers increase:o clockwise: parity value is 1o anti-clockwise, parity is 2

• parity value stored at nodefor stereo centre atom

o parity 0 = not stereoo parity 3 = unknown stereo

1

32

4

P ari ty 2

1

23

4

P ari ty 1

24 Stereochemistry: parity value

Stereochemistry in SMILES clockwise/anticlockwise approach, like MDL atoms are numbered according to sequence of atoms

in SMILES view from first atom (instead of toward last atom as

in MDL)• if other three atoms are anticlockwise – use @• if other three atoms are clockwise – use @@

OC(=O)[C@H](N)CC1=CC=C(O)C=C1OC(=O)[C@@H]

(N)CC1=CC=C(O)C=C1

25 Double bond stereochemistry

depiction of double bonds in a structure diagram usually implies either cis or trans configuration

MDL files use bond type code to indicate• 0: use 2D atom co-ordinates to determine cis/trans• 3: double bond stereochemistry not specified(other code values are used for up/down/either single

bonds)

ClI

Br

F

FI

Br

Cl

26 Double bond stereo in SMILES

/ and \ used as “directional” single bonds• only meaningful when used on both atoms of a

double bond• several ways of showing same configuration

ClI

Br

F

FI

Br

Cl

Cl/ C(F)=C(Br)/ I Cl\ C(F)=C(Br)/ I

Cl\ C(F)=C(Br)\ I Cl/ C(F)=C(Br)\ I

27 Stereovertex lists for double bonds neighbours of stereocentre have rectangular

geometry

A B

CD

A B C DB C D AC D A BD A B C

A C B DB D A CC B D AD A C B

neighbours are listed arounda right-handed spiral (clockwise)

A C

BD

28 Other stereochemistry geometries

Many coordination complexes have other stereochemical geometriese.g.

there are special SMILES rules for these specification of equivalent geometric orderings

defines symmetry properties of each geometry

1

2 3

4

5

SquareP yram id

1

2 3

45

O c tahe dro n6

1

2

34

Trigo nalB ipyram id

5

29 Stereochemistry of biphenyls

some stereoisomers occur because of sterically-hindered rotation of a single bond

o stereocentre is C–C bond here geometric ordering of

neighbours of stereocentrecan specify configuration

3 1 4 2

Cl

Br OH

CH3

1 2

3 4

1

2

3

4

A nti-re c ta ngle

30 Allene stereochemistry

anti-rectangle geometry alsoapplies to allene configuration• stereocentre is C=C=C group

CBr

I

F

Cl 1

2

3

4

A nti-re c ta ngle

31 Stereochemistry: conclusions

Many different systems in use Interconversions between different representations not

always easy• e.g. wedge bonds → CIP descriptors

Several problems remain• incomplete/partially-defined stereochemistry• “knotted” structures, helices etc.

B. Rohde, “Representation and manipulation of stereochemistry”, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 206-230. Wiley, 2003

32 Other representation complications

Organometallic and co-ordination compounds• complex stereochemistry• special bond types may be needed (dative bonds etc.)• ambiguity over covalent/ionic character of bonds

o “business rules” rules usually needed

Inorganic compounds• topological representation often not possible• composition may not involve integral ratios between

elements

33 Macromolecules

in principle can represent all atoms, as for small molecules

some systems use “shortcuts” or “superatoms” for subunits (e.g. amino acids)

AspHis

ValCys

Gly AlaHis

ValOH

CysArg

Trp

Tyr

ValTyr

AlaArg

ProAla

AspTyr

GlyGly

Ala OH

34 Macromolecules

Each shortcut is defined with appropriate attachment points

ordinary atoms can bemixed with shortcuts

system can expandshortcuts when needed

Tyr

NH*

O

O

*"

OH

35 Polymers

special problems are presented because properties of polymer can be affected by polymerisation conditions• average number of subunits• extent of cross-linking• ratio between different subunits• random / block sequences of subunits• etc.

Two main approaches• monomer representation• structural repeating unit (SRU) representation

36 Polymers

Monomer-based representation• show original monomer(s) and describe

polymerisation conditions in text notes SRU-based representation

• show repeating units (as shortcuts), with details of length etc.

• generally more satisfactory for structure search• complications when composition is

incompletely defined

37 Incompletely-defined substances

unknown stereochemistry unknown attachment position unknown repetition

OH

n

NH2

Cl

38 Markush (“Generic”) structures

• structures with R-groups• shorthand for describing sets of structures with

common featuresOH

R1R2

Br

*

I*

Cl

*R1=

CH2

*

CH3CH2

* CH2CH3 CH2

* CH2CH2

CH3R2=

39 Markush structures

• also called “generic” structures• very important in chemical patents

o inventor claims whole class of related compounds

• can be used to describe combinatorial libraries• can be used as queries in database searches• will be discussed in more detail in lecture 5

(Nov 13)

40 Conclusions from Lecture 2

analogy between chemical structures and topological graphs is not perfect and many problems arise in situations where it breaks down• aromaticity and tautomerism• stereochemistry

additional complications arise in representing some classes of molecule• inorganic and coordination compounds• macromolecules and polymers• incompletely-defined substances

41 Lecture 3: Topics to be Covered

More Graph Theory Structure Analysis and Processing

• canonicalisation and symmetry perception• ring perception• functional group identification• structure fingerprints and fragments• structure depiction• principles of structure searching


Recommended