Download - Bioinformatics Structural and global properties Martin Saturka ...Ligand docking combinatorial chemistry drug design - de novo, known substrate alterations usage in medicinal chemistry,

Bioinformatics - Lecture 05

BioinformaticsStructural and global properties

Martin Saturka

http://www.bioplexity.org/lectures/

EBI version 0.52

Creative Commons Attribution-Share Alike 2.5 License

Martin Saturka www.Bioplexity.org Bioinformatics - Structure

Structures

Sequence based secondary structures, domains and foldingcomputational structure predictionsexperimental data based explorations

Structure explorationsdynamic programming- secondary structure prediction- RNA folds, CM and SCFGstructural biology- experiments, comparisons- combinatorial chemistry, dockingfeature characteristics- hydrophobic packing- global properties


Structure hierarchy

from sequences to 3D structures

primary, secondary, tertiary, quaternary structures

primary - linear sequences of monomersmodifications: crosslinking, cleavage, ligation

secondary: local structural motifsregular simple sturctures vs. random coils

tertiary: whole single molecule structuresfolds, complex, dependency on environments

quaternary: molecular complexesenzymatic complexes, cytoskeleton, capsids


Secondary structure

systematics of local structures of proteins and nucleic acids

proteinsRamachandran plot of dihedral anglesα-helices, β-sheets, coils, others

β − sheets: updown

helices:

10, 3 )α(

RNAsbase pairing: hairpins - stem / looppseudoknots, kissing structures

DNAsrelatively rigid double helixG and C quadruplex structures


Protein local structures

the basic protein building blocks

C=O

H−N

O=C

N−H

H−N

C=O

N−H

O=CN−H

C=O

N−H

C=OC=O

H−N

O=C

N−H

C=O

H−N

N−H

O=C

α β− helix − sheets

α-helix (ψ(i) + ϕ(i + 1).= −105◦)

right-handed, 3.6 residues per turnapproximately each fourth residuum to the same direction

β-sheetsparallel (-120◦, 115◦) and anti-parallel (-140◦, 135◦) casesalternating residuum directions, with respect to the plane


Protein local predictions

mediocre results, methods based on dynamic programming

specifics to considerstart at the N’ ends - first foldedproneness to helices, beta sheets, structure breakersassumed residua by multiple sequence alignmentfeature - e.g. hydrophobic character alternations

altered HMM algorithmn-th order Markov modelinner states: helices, sheets, turns, coilsconsider the shortest structure lengths

either to look the shortest lengths backward ormultiple inner states - for (short) structure lengths


Prediction example

1T T

5T

4

1H H

n

nE

ii−3 i−1i−2i−4

2 31E E E ...

2 3T T

H2 3

H ...Inner states for :

H − helices

E − sheets

T − turns

α-helix, min 4 residuespropensities: MALEK

β-sheet, min 2 (5 �) residuespropensities: YFWTVI

turns, 3-5 residuespropensities: GP

A Q G L A EOK: H1 H2 H3 H4 Hn Hn

NO: H1 H2 T1 H1 H2 H3


Hierarchical HMMs

secondary and super-secondary structures

make secondary structure predictioninput (outer) string: primary structureinner states sequence: secondary structure

run subsequent higher order HMM predictioninput (outer) string: secondary structureinner states sequence: complex motifs

...AGPGAQGLAE... → ...HTTTHHHHHH...

...HTTTHHHHHH... → helix-turn-helix


RNA

nucleotides pairing - strong long range interactions

standard and non-standard ribonucleotide pairing

HMMs: without long range interactions, not sufficientcovariance versions of locality based algorithms

covariance Gibbs sampling - pairs of trialscovariance extension of HMMs - pairwise probabilities

i−2i−3i−4

i

j+1

j+2j

j−1i−1

i+1


Automata and grammars

Chomsky-Schutzenberger hierarchyregular languagescontext-free languages

non-terminal symbols (with the start symbol)terminal symbols, the outer alphabetrewrite rules

context-sensitive languagesrecursive languages

stochastic modelsfinite automata / regular grammars → HMMspushdown automata / context-free grammars → CMs

nonterminals: s, a1, a2 terminals: A, C, G, Us → A a1 Ua1 → C a1 Ga1 → C a2 Ga2 → A A U

ACC...CCAAUGG...GGU


Covariance model

SCFG (PCFG): stochastic (probabilistic) context-free grammars

sets of: terminals, nonterminal, probabilistic rulesprobability of rule sums for each nonterminal is unite

rewrite rules play roles of both inner transitionsand symbol outputs of HMMs

a1

a1

a2

A A UA UC C GG

s

60%: a1 → C a1 G10%: a1 → A a1 A a1 A30%: a1 → C a2 G


CM algorithms

time complexity of n3 for lengths of parsed sequences

weighted CYK algorithm for the most probable productionanalogy of the Viterbi algorithm of HMMs

inside / outside algorithm for SCFG adjustinganalogy of the forward / backward algorithm of HMMs

the Inside and weighted CYK algorithmsdifference: ’inside’ makes sums, ’CYK’ takes maximaiterative substring parse generation (for the CYK)

first, finding the best parses for short subsequencesfor larger subsequences, make the best separationonto the most paired / the best parsed subsequencespossibility to do separations at just single points


Inside algorithm

normalizing grammar rules for just binary tree parsesna → nbnc - for inner state changesna → Tr - for outer symbol outputs

square matrices indexed by sequence positionseach matrix for subsequences from a single non-terminaljust one matrix for a non-grammar / best pairing search

first, filled with zeros - initial parse weight sumsdiagonals with probabilities of respective symbol outputs

iterative matrices filling out of the main diagonalscomputation for each matrix of a nonterminal symbolfor every subsequence make its two-pieces separationcompute parse weights for the pieces of the subsequencefor generation form any pair of nonterminal symbolsmultiply with the probability of the nonterminal pair rulethe end is for the whole sequence from the start symbol


Algorithm structuring

the ’outside’

structuring

na

nb

s

α αx y

nc

structuring

the ’inside’

s

na

nb

ncβ

α

x

yβ

base−pairing

maximization

(no grammar)

α new

new


Inside / Outside

the Inside algorithm: α(i , j ,na)probability sums of all parse trees of subsequence(i to j positions) generated from the na nonterminal

the Outside algorithm: β(i , j ,na)probability sums of all parse trees without countingthe probabilities of the (i to j positions) subsequencegeneration from the na nonterminal

α(i , i , na) = Pr(na → o(i))α(i , j , na) =

Pnb

Pnc

Pj−1k=i α(i , k , nb) · α(k + 1, j , nc) · Pr(na → nb nc)

β(1, |o|, ns) = 1 for the start non-terminalβ(1, |o|, nz) = 1 for a non-start non-terminalβ(i , i , na) =

Pnb

Pnc

Pi−1k=1 α(k , i − 1, nb) · β(k , j , nc) · Pr(nc → na nb)+P

nb

Pnc

P|o|k=j+1 α(j + 1, k , nb) · β(i , k , nc) · Pr(nc → nb na)


CM profiling

the covariance model can work without a grammarjust with maximizing base pairing - poor results

having a grammar, we need to adjust the probabilitesanalogically to HMM profiling

parameter reestimation by expected times of a rule usagedivided by all the rules usage from the non-terminal

new output probabilitiesnew Pr(na → Tr ) = c(na → Tr )/c(na)count of na used to generate the terminal Trc(na → Tr ) =

∑i,o(i)=Tr

β(i , i ,na) · Pr(na → Tr )

count of na used to generate anythingc(na) =

∑i∑

j β(i , j ,na) · α(i , j ,na)


Profiling counts

new Pr(na → nb nc) = c(na → nb nc)/c(na)

c(na → nb nc) the count of na used to generate a non-terminal pair nb nc isP|o|−1i=1

P|o|j=i+1

Pj−1k=i β(i , j , na) · Pr(na → nb nc) · α(i , k , nb) · α(k + 1, j , nc)

a(i,j,n )β

nc

i k k+1

na

1 j |o|

nb

α α(k+1,j,n )c

(i,k,n )b

Pr(n −> n n )

s

ca b


CM obstacles

high time complexitynot suitable for large RNA molecules

pseudoknotsusually low depth subtree separations

5’

3’

RNA

pseudoknot


Physical approaches

energy minimization

while symbolic base-pairing approaches popular,physics agnostic methods suffer from the ignorance

many dispersed short base pairing unfavourabledifferent base-pairs of different strengthsminor bases common in RNA molecules

complex RNA structurescomplex molecular modeling and stochastic grammarsalignment based structure prediction

many RNA molecules with known folds


Experimental technics

structural biology

necessary source of solid molecular structure data

standard technicscrystallography

inner cores generally correctreduced possibilities for surfaces and domain flipping

NMR spectroscopyless accurate than X-ray diffractionmeasurements in more natural environments

IR, Raman spectroscopiessimpler, for vibrations of specific parts

EM, AFMfor structures of greater molecular complexes

other methodsmany kinds of spectroscopy and microscopy,ultracentrifugation, chromatography, etc.


Structure refinement

frequent usageexperimental data adjustmentexploring small alterationsshort macromolecular dynamicsstates of small molecules

molecular mechanicsstatics, energy minimizationsstandard hill-climbing methods

molecular dynamicsintensive computer simulationsamount of solvent, long range interactions

stochastic dynamicsLangevine dynamics - extra random forcesMonte Carlo - probabilities, not forces


Force fields

biomacromolecules: classical forces approximation

quantum potentials for ligands, limited areas

environment approximationcharge shielding, hydrophobic interaction, entropy

empirical potentialsbonds, bond angles, dihedral angleselectrostatic, van der Waals forcesimplicit solvatation


Structural alignment

structure comparison and prediction

comparing similar structuresminimizing root mean square of distancesdistance matrices for chosen atoms

sequence to structure alignmentprediction of structures of large protein blockspopular methods with growing staructural databasesstructural alignment onto structures of similar sequences

protein threadingthreading 1D sequences onto 3D structuresusable technics with large structural databases available

chance of a strucutre with a domain of a similar sequence


Enzyme activity

proteins been most studied, RNAs as the current ’big thing’

several basic factsactive sites formed by sequence-distant residuainduced fit action - structure adjustment on substratesenzyme activity modulation by cofactors and coenzymesstructure change as allosteric regulation of many enzymesin evolution, function interchange as receptors

protein flexibilityenabling huge amount of protein functionspath to intentionally regulate enzyme functionspath to escape intentional regulations


Ligand docking

combinatorial chemistrydrug design - de novo, known substrate alterationsusage in medicinal chemistry, pharmaceutical industrycomputational reduction of vast amount of ligands

QSAR approachesquantitative structure-activity relationshiprules for combinatorial ligand construction

docking methodsgeneration of possible ligand conformationsinitial ligand positions and orientationsmolecular mechanics to minimize interaction energytoo tight bindings lack entropy contributions


Protein folding

random vs. natural sequencesrandom polypeptides do not form folded structuresproteins with folded and denaturated forms

folding pathsLevinthal paradox - too large amount of degrees of freedomthus sampling just a minor fractions of them possiblefunnels of folding paths, directing to the right conformations

many proteins need chaperons for the right foldsdual forms of prions, probably of many other (innocent)proteins, hidden by cellular degradation pathways as well


Fold predictions

threading - global structure predictions

inverse approaches more feasible than direct predictionsthreading of altered structures onto the original folds

generated databases of such threaded sequences

search threading databases for similar fragmentsarrangement of the subsequences onto the structuresscoring with coarse-grained pseudo-energy functions

better with experimental (e.g. NMR) distnace constraints


Global view

properties along the whole sequences

hydrophobic characterregulatory - active sites connectionposition vs. frequency viewsauto-correlation, repetitions

structure recognition by hydrophobicity distributionmembrane proteins with specific characteristics


Sequence profiles

float-point sequenceshydrophobic values, charge valuesstructure prediction by profile similarity

cores vs. surfaceshydrophobic cores as structure identificationconvex and alpha hulls surfaces for protein docking

wavelet analysislocalizing at both position and frequency spacesused for protein core predictions on hydrophobic scales

the reliability claimed similar to that of standardsecondary structure predictions


Qualitative modeling

qualitative topology and flexibility

discrete structure modelinghydrophobic packingfrustration minimization

qualitative dynamicscoarse-grained domain vibrationsnormal modes of the domains


Other biomacromolecules

saccharides - the next ’big thing’

lipid membranesseparation of the inner vs. the outersurfaces with protein and carbohydrate markers

carbohydratesmajor roles in immune system, cell recognitioncommon glycosylation of lipids and proteins


Items to remember

Nota bene:

molecular structure hierarchy

Secondary structure predictionsproteinsRNAs

Higher order structuresalignmentsglobal methods