Bioinformatics - Lecture 05
BioinformaticsStructural and global properties
Martin Saturka
http://www.bioplexity.org/lectures/
EBI version 0.52
Creative Commons Attribution-Share Alike 2.5 License
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Structures
Sequence based secondary structures, domains and foldingcomputational structure predictionsexperimental data based explorations
Structure explorationsdynamic programming- secondary structure prediction- RNA folds, CM and SCFGstructural biology- experiments, comparisons- combinatorial chemistry, dockingfeature characteristics- hydrophobic packing- global properties
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Structure hierarchy
from sequences to 3D structures
primary, secondary, tertiary, quaternary structures
primary - linear sequences of monomersmodifications: crosslinking, cleavage, ligation
secondary: local structural motifsregular simple sturctures vs. random coils
tertiary: whole single molecule structuresfolds, complex, dependency on environments
quaternary: molecular complexesenzymatic complexes, cytoskeleton, capsids
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Secondary structure
systematics of local structures of proteins and nucleic acids
proteinsRamachandran plot of dihedral anglesα-helices, β-sheets, coils, others
β − sheets: updown
helices:
10, 3 )α(
RNAsbase pairing: hairpins - stem / looppseudoknots, kissing structures
DNAsrelatively rigid double helixG and C quadruplex structures
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Protein local structures
the basic protein building blocks
C=O
H−N
O=C
N−H
H−N
C=O
N−H
O=CN−H
C=O
N−H
C=OC=O
H−N
O=C
N−H
C=O
H−N
N−H
O=C
α β− helix − sheets
α-helix (ψ(i) + ϕ(i + 1).= −105◦)
right-handed, 3.6 residues per turnapproximately each fourth residuum to the same direction
β-sheetsparallel (-120◦, 115◦) and anti-parallel (-140◦, 135◦) casesalternating residuum directions, with respect to the plane
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Protein local predictions
mediocre results, methods based on dynamic programming
specifics to considerstart at the N’ ends - first foldedproneness to helices, beta sheets, structure breakersassumed residua by multiple sequence alignmentfeature - e.g. hydrophobic character alternations
altered HMM algorithmn-th order Markov modelinner states: helices, sheets, turns, coilsconsider the shortest structure lengths
either to look the shortest lengths backward ormultiple inner states - for (short) structure lengths
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Prediction example
1T T
5T
4
1H H
n
nE
ii−3 i−1i−2i−4
2 31E E E ...
2 3T T
H2 3
H ...Inner states for :
H − helices
E − sheets
T − turns
α-helix, min 4 residuespropensities: MALEK
β-sheet, min 2 (5 �) residuespropensities: YFWTVI
turns, 3-5 residuespropensities: GP
A Q G L A EOK: H1 H2 H3 H4 Hn Hn
NO: H1 H2 T1 H1 H2 H3
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Hierarchical HMMs
secondary and super-secondary structures
make secondary structure predictioninput (outer) string: primary structureinner states sequence: secondary structure
run subsequent higher order HMM predictioninput (outer) string: secondary structureinner states sequence: complex motifs
...AGPGAQGLAE... → ...HTTTHHHHHH...
...HTTTHHHHHH... → helix-turn-helix
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
RNA
nucleotides pairing - strong long range interactions
standard and non-standard ribonucleotide pairing
HMMs: without long range interactions, not sufficientcovariance versions of locality based algorithms
covariance Gibbs sampling - pairs of trialscovariance extension of HMMs - pairwise probabilities
i−2i−3i−4
i
j+1
j+2j
j−1i−1
i+1
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Automata and grammars
Chomsky-Schutzenberger hierarchyregular languagescontext-free languages
non-terminal symbols (with the start symbol)terminal symbols, the outer alphabetrewrite rules
context-sensitive languagesrecursive languages
stochastic modelsfinite automata / regular grammars → HMMspushdown automata / context-free grammars → CMs
nonterminals: s, a1, a2 terminals: A, C, G, Us → A a1 Ua1 → C a1 Ga1 → C a2 Ga2 → A A U
ACC...CCAAUGG...GGU
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Covariance model
SCFG (PCFG): stochastic (probabilistic) context-free grammars
sets of: terminals, nonterminal, probabilistic rulesprobability of rule sums for each nonterminal is unite
rewrite rules play roles of both inner transitionsand symbol outputs of HMMs
a1
a1
a2
A A UA UC C GG
s
60%: a1 → C a1 G10%: a1 → A a1 A a1 A30%: a1 → C a2 G
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
CM algorithms
time complexity of n3 for lengths of parsed sequences
weighted CYK algorithm for the most probable productionanalogy of the Viterbi algorithm of HMMs
inside / outside algorithm for SCFG adjustinganalogy of the forward / backward algorithm of HMMs
the Inside and weighted CYK algorithmsdifference: ’inside’ makes sums, ’CYK’ takes maximaiterative substring parse generation (for the CYK)
first, finding the best parses for short subsequencesfor larger subsequences, make the best separationonto the most paired / the best parsed subsequencespossibility to do separations at just single points
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Inside algorithm
normalizing grammar rules for just binary tree parsesna → nbnc - for inner state changesna → Tr - for outer symbol outputs
square matrices indexed by sequence positionseach matrix for subsequences from a single non-terminaljust one matrix for a non-grammar / best pairing search
first, filled with zeros - initial parse weight sumsdiagonals with probabilities of respective symbol outputs
iterative matrices filling out of the main diagonalscomputation for each matrix of a nonterminal symbolfor every subsequence make its two-pieces separationcompute parse weights for the pieces of the subsequencefor generation form any pair of nonterminal symbolsmultiply with the probability of the nonterminal pair rulethe end is for the whole sequence from the start symbol
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Algorithm structuring
the ’outside’
structuring
na
nb
s
α αx y
nc
structuring
the ’inside’
s
na
nb
ncβ
α
x
yβ
base−pairing
maximization
(no grammar)
α new
new
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Inside / Outside
the Inside algorithm: α(i , j ,na)probability sums of all parse trees of subsequence(i to j positions) generated from the na nonterminal
the Outside algorithm: β(i , j ,na)probability sums of all parse trees without countingthe probabilities of the (i to j positions) subsequencegeneration from the na nonterminal
α(i , i , na) = Pr(na → o(i))α(i , j , na) =
Pnb
Pnc
Pj−1k=i α(i , k , nb) · α(k + 1, j , nc) · Pr(na → nb nc)
β(1, |o|, ns) = 1 for the start non-terminalβ(1, |o|, nz) = 1 for a non-start non-terminalβ(i , i , na) =
Pnb
Pnc
Pi−1k=1 α(k , i − 1, nb) · β(k , j , nc) · Pr(nc → na nb)+P
nb
Pnc
P|o|k=j+1 α(j + 1, k , nb) · β(i , k , nc) · Pr(nc → nb na)
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
CM profiling
the covariance model can work without a grammarjust with maximizing base pairing - poor results
having a grammar, we need to adjust the probabilitesanalogically to HMM profiling
parameter reestimation by expected times of a rule usagedivided by all the rules usage from the non-terminal
new output probabilitiesnew Pr(na → Tr ) = c(na → Tr )/c(na)count of na used to generate the terminal Trc(na → Tr ) =
∑i,o(i)=Tr
β(i , i ,na) · Pr(na → Tr )
count of na used to generate anythingc(na) =
∑i∑
j β(i , j ,na) · α(i , j ,na)
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Profiling counts
new Pr(na → nb nc) = c(na → nb nc)/c(na)
c(na → nb nc) the count of na used to generate a non-terminal pair nb nc isP|o|−1i=1
P|o|j=i+1
Pj−1k=i β(i , j , na) · Pr(na → nb nc) · α(i , k , nb) · α(k + 1, j , nc)
a(i,j,n )β
nc
i k k+1
na
1 j |o|
nb
α α(k+1,j,n )c
(i,k,n )b
Pr(n −> n n )
s
ca b
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
CM obstacles
high time complexitynot suitable for large RNA molecules
pseudoknotsusually low depth subtree separations
5’
3’
RNA
pseudoknot
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Physical approaches
energy minimization
while symbolic base-pairing approaches popular,physics agnostic methods suffer from the ignorance
many dispersed short base pairing unfavourabledifferent base-pairs of different strengthsminor bases common in RNA molecules
complex RNA structurescomplex molecular modeling and stochastic grammarsalignment based structure prediction
many RNA molecules with known folds
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Experimental technics
structural biology
necessary source of solid molecular structure data
standard technicscrystallography
inner cores generally correctreduced possibilities for surfaces and domain flipping
NMR spectroscopyless accurate than X-ray diffractionmeasurements in more natural environments
IR, Raman spectroscopiessimpler, for vibrations of specific parts
EM, AFMfor structures of greater molecular complexes
other methodsmany kinds of spectroscopy and microscopy,ultracentrifugation, chromatography, etc.
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Structure refinement
frequent usageexperimental data adjustmentexploring small alterationsshort macromolecular dynamicsstates of small molecules
molecular mechanicsstatics, energy minimizationsstandard hill-climbing methods
molecular dynamicsintensive computer simulationsamount of solvent, long range interactions
stochastic dynamicsLangevine dynamics - extra random forcesMonte Carlo - probabilities, not forces
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Force fields
biomacromolecules: classical forces approximation
quantum potentials for ligands, limited areas
environment approximationcharge shielding, hydrophobic interaction, entropy
empirical potentialsbonds, bond angles, dihedral angleselectrostatic, van der Waals forcesimplicit solvatation
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Structural alignment
structure comparison and prediction
comparing similar structuresminimizing root mean square of distancesdistance matrices for chosen atoms
sequence to structure alignmentprediction of structures of large protein blockspopular methods with growing staructural databasesstructural alignment onto structures of similar sequences
protein threadingthreading 1D sequences onto 3D structuresusable technics with large structural databases available
chance of a strucutre with a domain of a similar sequence
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Enzyme activity
proteins been most studied, RNAs as the current ’big thing’
several basic factsactive sites formed by sequence-distant residuainduced fit action - structure adjustment on substratesenzyme activity modulation by cofactors and coenzymesstructure change as allosteric regulation of many enzymesin evolution, function interchange as receptors
protein flexibilityenabling huge amount of protein functionspath to intentionally regulate enzyme functionspath to escape intentional regulations
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Ligand docking
combinatorial chemistrydrug design - de novo, known substrate alterationsusage in medicinal chemistry, pharmaceutical industrycomputational reduction of vast amount of ligands
QSAR approachesquantitative structure-activity relationshiprules for combinatorial ligand construction
docking methodsgeneration of possible ligand conformationsinitial ligand positions and orientationsmolecular mechanics to minimize interaction energytoo tight bindings lack entropy contributions
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Protein folding
random vs. natural sequencesrandom polypeptides do not form folded structuresproteins with folded and denaturated forms
folding pathsLevinthal paradox - too large amount of degrees of freedomthus sampling just a minor fractions of them possiblefunnels of folding paths, directing to the right conformations
many proteins need chaperons for the right foldsdual forms of prions, probably of many other (innocent)proteins, hidden by cellular degradation pathways as well
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Fold predictions
threading - global structure predictions
inverse approaches more feasible than direct predictionsthreading of altered structures onto the original folds
generated databases of such threaded sequences
search threading databases for similar fragmentsarrangement of the subsequences onto the structuresscoring with coarse-grained pseudo-energy functions
better with experimental (e.g. NMR) distnace constraints
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Global view
properties along the whole sequences
hydrophobic characterregulatory - active sites connectionposition vs. frequency viewsauto-correlation, repetitions
structure recognition by hydrophobicity distributionmembrane proteins with specific characteristics
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Sequence profiles
float-point sequenceshydrophobic values, charge valuesstructure prediction by profile similarity
cores vs. surfaceshydrophobic cores as structure identificationconvex and alpha hulls surfaces for protein docking
wavelet analysislocalizing at both position and frequency spacesused for protein core predictions on hydrophobic scales
the reliability claimed similar to that of standardsecondary structure predictions
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Qualitative modeling
qualitative topology and flexibility
discrete structure modelinghydrophobic packingfrustration minimization
qualitative dynamicscoarse-grained domain vibrationsnormal modes of the domains
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Other biomacromolecules
saccharides - the next ’big thing’
lipid membranesseparation of the inner vs. the outersurfaces with protein and carbohydrate markers
carbohydratesmajor roles in immune system, cell recognitioncommon glycosylation of lipids and proteins
Martin Saturka www.Bioplexity.org Bioinformatics - Structure
Items to remember
Nota bene:
molecular structure hierarchy
Secondary structure predictionsproteinsRNAs
Higher order structuresalignmentsglobal methods
Martin Saturka www.Bioplexity.org Bioinformatics - Structure