BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page vi

accuracy, 353–4, 353F, 354F, 403–5, 403F, 404F

E. coli segment, 322, 323Fevaluation and reevaluation, 405functional, 400–3pathway information aiding, 348,

349–50Fpipeline approach, 319practical aspects, 346–52, 347FDquality of information used, 403role of gene ontology, 348–52, 402theoretical basis, 357–9, 358MM

Genome Browser, 352, 352FGenomeNet, 84GenomeScan, 397genome sequence alignments

to verify annotation, 353–4, 353F, 354F, 403–5, 403F, 404F

whole genomes, 156–9, 157FDgenome sequences

excluding noncoding regions, 319–21

gene prediction from see gene prediction

preliminary examination, 318–22, 319FD

splitting, 319genome sequencing, 71

multiple genomes, 376Bshotgun procedure, 376B

genomic imprinting, 7genomics

functional, 600role in systems biology, 668structural, 569

GenScan program, 334comparative results, 331T, 332T, 336exon detection, 390promoter detection, 385, 385Fsplice site prediction, 394, 395F,

396transcription stop signal detection,

389translation start site detection, 389use of gene models, 398–9, 401Fuse of homology searches, 397

GenTHREADER, 532–3, 534–5, 535F, 536F

GEPASI, 691–2, 691FGES (Goldman, Engelman and Steitz)

hydrophobicity scale, 438, 475, 477T

Gibbs program, 215–17Gleevec®, 593GLIMMER program, 323, 371–2global alignments, 88–9, 89F

large genome sequences, 352F, 353optimal, 128, 129–35, 129F, 130F,

131Fscore significance, 154time saving methods of deriving,

139–41, 139F, 140F



global–local dynamic programming, 533

globular proteins, 41length distributions of secondary

structures, 467, 468Fsecondary structure prediction,

509secondary structures, 463

gluconeogenesis pathway, 348, 349–50F

glycolytic pathway, 671, 672FE. coli, 673Finteractions, 673Fmodularity, 686F, 687F

glycosylphosphatidylinositol (GPI) anchors, 513–14, 513F

Godzik, Adam, 491Gojobori, Takashi, 240BGOLD program, 591–2, 592FGOR methods, 414, 422–5, 425F, 472–3

accuracy, 422, 423, 424T, 484derivation, 480–4, 482Fversion III, 483, 484Fversion IV, 423–5, 427F, 483version V, 423–5, 425–6, 426F, 483

Gotoh, Osamu, 206GPI-SOM method, 513–14, 513FG-protein-coupled receptors, 436,

436BGrailEXP program, 331T, 332T, 334–5,

336Grail program, 323, 386, 387F, 389,

399greedy alignment methods, 199greedy permutation encoding method,

646–7Greek Key structure, 40BGRID program, 591GRIN program, 591Grishin, Nick, 466growth factors, 616–17, 617Fguanine (G), 6, 6Fguide tree, 90, 199–200

construction, 204–6, 205Fmultiple alignment from, 206, 206Fpattern discovery, 214

Guigo, Roderic, 365–6B, 392BGumbel extreme-value distribution see

extreme-value distribution

HHbP method, 491–2, 492F, 493FHaemophilus influenzae, 371hairpins, 36–7harmonic approximation, 526, 702–3,

702Fhashing, 95

theoretical basis, 143–6whole genome sequences, 158

heartcellular modeling, 685Tmodeling of function, 677, 678F

heat shock response, E. coli, 680, 680Fhelical wheels, 439F, 440–1, 448helices, 435

see also 310-helices; a-helices; p-helices; transmembrane helices

helix tails, 441hemagglutinin, 34, 486, 486Fhemoglobin, 43, 43FHenikoff, Steven and Jorja, 122, 171Fheptads, 451, 451F, 510Hessian, 714–15hexamers (hexanucleotides) see

dicodonsHHsearch, 195F, 196hidden layers, 431, 431F, 494, 499hidden Markov models (HMMs), 166,

179, 179FDwith duration, or explicit state

duration, 374–6EcoParse gene model, 375F, 376–7exon prediction, 328, 332GAZE gene model, 402FGeneMark.hmm algorithm, 374–6,

374Fgenome annotation, 359GenScan gene model, 399, 401Fmultiple sequence alignments, 200,

203–4profile see profile hidden Markov

modelssecondary structure prediction,

504–10, 506FDtransmembrane protein prediction,

446–7, 446F, 451vs finite-state automata, 147, 179,

180–1hidden neural networks (HNN), 509hierarchical clustering, 638, 639–41

see also UPGMA methodgene expression microarray data,

606–8, 606F, 607Fprotein expression data, 616–17,

617F, 618Fvs other clustering methods, 643B

hierarchical likelihood ratio test (hLRT), 253, 254F, 255

Higgins, Desmond, 209high-scoring segment pairs (HSPs),

141, 149Hinton diagram, 499Fhistone deposition protein, 571FHIV (HIV-1), 337B

drug design, 589Bprotease (HIV-PR), 551–2, 552F

HKY85 model, 253T, 254F, 256T, 273HMM see hidden Markov models

HMMER2 program, 185HMMGene program, 331T, 332, 332T,

333HMMTOP program, 441F, 446–7, 448F,

506–7, 507FHochberg, Yosef, 659Hollerith, Herman, 48, 48Fhomolog methods see nearest-

neighbor methodshomologous genes

chicken, human and puffer fish genomes, 245, 246F

evolution, 239–42, 242Fhomologous proteins, 38

see also protein familiesalignment, 38, 74secondary structure prediction,

416, 418–19, 419Fhomologous sequences

see also sequence alignmentcut-off points for identifying, 81identifying, 74–6inserting gaps, 85–7scoring alignments, 76–81searching databases see searching

sequence databasessecondary structure prediction

using, 425–6, 484–5, 489–90, 502–3

homologyexon prediction based on, 397functional, 569–70, 569F, 570Fgene prediction based on, 320F,

321B, 322, 372–3homology modeling (3D protein

structure), 522–3, 537–64, 538FDassumptions, 541–2automated, 541, 552–6, 553FD,

561–3checking for accuracy, 549–51, 550F,

551T, 560, 560Fenergy minimization, 548, 559–60history, 538–9, 538Floops, 545–6, 546F, 547F, 559, 559Fmanual or semi-manual, 557–61molecular dynamics, 548mTOR protein, 563, 563Tmultidomain proteins, 564PI3 kinase p110a, 557–63principles, 537–42sequence length cut-offs, 540–1,

542Fsequence similarity thresholds,

539–40, 541Fsteps, 540F, 542–52, 543FDstructurally conserved regions

(SCRs), 544–5, 545F, 554trustworthiness, 551–2Web-based servers, 554, 561–3

homoplasy, 244



horizontal gene transfer (HGT), 246–7,246F, 247F, 292B

Hsp60, 249HSSP database, 490HTML (hypertext markup language),

50–1human immunodeficiency virus see

HIVHutchinson, Gail, 475Hutchinson, Gordon, 387hybridization, 9, 602hydrogen bonds

DNA, 7, 8energetics, 525–6, 701peptide bonds, 29, 32, 32Bprotein folds, 42RNA, 456secondary protein structure, 34, 35,

35F, 36Fdefining, for prediction

algorithms, 464–5, 465Fnonidealized patterns, 463–4

hydropathic (hydrophobicity) profiles, 439, 442

hydrophilic amino acid residues, 29, 30F

transmembrane proteins, 439F, 440–1

hydrophilic regions, folded proteins, 41

hydrophobic amino acid residues, 29, 30F

hydrophobic cluster analysis (HCA), 110–11, 110F

hydrophobicity scales, 437–9, 450, 475, 477T

hydrophobic moment, 440hydrophobic regions

folded proteins, 41, 42indicating binding sites, 583transmembrane proteins, 437–41,

439Fhyperplanes, separating, 661, 662,

662Fhypertext markup language (HTML),

50–1HyPhy program, 255hypothetical proteins, 65, 348

conserved, 348

Iidentity, 76

percent/percentage seepercent/percentage identity

visual assessment, 77–8, 77F, 79Fimmunoglobulin folds, 571Fimmunoglobulins, 381, 555–6Bimportin a, 480Fimprinting, genomic, 7

indels, 85, 117see also deletions; insertions

indexing techniques, 141–6see also hashing; suffix treeswhole genome sequences, 157–9

influenza virushemagglutinin, 34, 486, 486Frational drug design, 589B, 591

informationdirectional, 423, 482mutual, 697pair, 423, 482Shannon entropy and, 696

information theory approach, secondary structure prediction, 422–5, 480–4

informative sites, 298ingroups, 230inhomogeneous Markov chain (IMC)

models, 328, 368–70initiator (Inr) see cap signalinput, 431, 494input layer, 430, 494insertions

accounting for, in sequence alignment, 85–7

alignment scoring schemes, 117, 126–7

homology modeling, 542, 545–6, 545F

threading and, 532, 537integral membrane proteins see

transmembrane proteinsintegrative approach, 670Fintermediate alignment, 198, 204–5,

205Fintermediate sequences, 97internal nodes, 226, 227FInternet, access to databases via, 52interpolated Markov models, 371–2,

388intrinsic classification, 638intrinsic gene detection methods, 361,

368FDintron prediction, 319, 323, 379–81

approaches used, 324–5theoretical basis, 389–97, 391FD

introns, 18–19, 18Fsee also splice sitesAT–AC or U12, 19, 392branch point, 18–19, 396length distributions, 379, 379F

invariable sites, 298inverse protein folding, 530–1inversion, sequence, 158FI-sites library, 487–8, 487Fisochores, 275, 378isoelectric focusing (IEF), 613iterated sequence search (ISS), 168iterative alignment, 198, 206, 206F

JJarnac, 690F, 691–2JC model see Jukes–Cantor modelJnet program, 424T, 434, 435FJones, David, 276, 503JTEF program, 397JTT matrix, 276Jukes, Thomas, 271Jukes–Cantor (JC) model, 253T, 271–2

evaluation using maximum likelihood, 302, 303–4

example distance corrections, 252Fexamples of constructed trees, 256,

258F, 261F, 262Gamma distribution applied to

(JC+G), 273more complex models based on,


mutations, 241Btesting for suitability, 253, 254F,

256Tjunk DNA, 22B, 336, 378–9jury decision neural networks, 432,

501jury voting technique, 485JWS Online Cellular System Modeling,


KKabat database, 103Kabsch, Wolfgang, 464–5Katoh, Kazutaka, 206KD hydrophobicity scale, 475, 477T,

479FKendrew, John Cowdery, 538Fkeratins, 451keys, 49, 49FKihara, Daisuke, 480Kimura-two-parameter (K2P or K80)

model, 253, 253T, 272–3practical application, 261F, 262transition/transversion ratio

calculation, 274–5BKimura-three-parameter (K3P or K81)

model, 253, 253Tkinetic energy, 718kinetic models, 678, 690Fkinetic parameters, biological

networks, 674k-means clustering, 608, 641–2, 642F

vs other clustering methods, 643Bk-mers, 141, 147, 199–200k-nearest-neighbor method, sample

classification, 660–1knockout mice, 688knowledge-based methods

modeling 3D protein structure seehomology modeling



secondary structure prediction, 414–15, 421–30

transmembrane protein prediction, 443

knowledge-based scoring, 590KOG database, 243, 245BKohonen networks see self-organizing

mapsKrebs cycle see tricarboxylic acid cycleKrogh, Anders, 500, 501F, 502–3k-tuples, 95, 141, 143–4, 147

whole genome sequences, 158–9Kullback–Leibler distance see relative

entropykuru, 101, 101BKyoto Encyclopedia of Genes and

Genomes (KEGG), 348, 671, 672F

Kyte–Doolittle (KD) hydrophobicity scale, 475, 477T, 479F

LL2L tool, 611Laboratory Information Management

System (LIMS), 600LAGAN method, 352F, 353Lake, James, 292BLAMA program, 106

alignment of PSSMs, 193–5, 194FLander, Eric, 488, 491lariat RNA, 18–19, 18Flast common ancestor, 227, 227Flast universal common ancestor, 293lateral gene transfer (LGT) see

horizontal gene transferlayers, neural networks, 430–1, 431F,


supervised, 497B, 638unsupervised, 638, 644

learning rate, 497Bleast-squares method, 250

Bryant and Waddell version, 296, 296F

evaluating tree topologies, 294–6, 295F, 297

leaves, 226, 227FLEGO® system, 686, 688Flength distributions

a-helices and b-strands, 467, 468Fprokaryotic coding/noncoding

regions, 374–5, 374Fvertebrate introns and exons, 379,

379FLennard–Jones terms, 705, 705Fleucine zipper, 413, 451

prediction, 453–4, 455FLevitt, Michael, 195LIBRA, 536, 537F

library extension, COFFEE scoring scheme, 203, 204F

ligandsdocking procedures, 587–93, 588FDdrug design methods, 588, 589Bidentifying candidate, 590

likelihood ratio test, hierarchical (hLRT), 253, 254F, 255

linear discriminant analysis (LDA)promoter prediction, 340, 388, 389Fsecondary structure prediction,

512–13linear gap penalties, 126–7

global alignments, 131F, 132–3local alignments, 137suboptimal alignments, 137F, 139

links (in databases), 52, 53lipopolysaccharide (LPS), 608, 609F,

674Fliquid chromatography, 623local alignments, 88–9, 89F

dynamic programming algorithm, 135–9, 136F

gapped, score statistics, 153, 156multiple alignment using, 92–3, 93Foptimal, 135–7, 136Fprofile hidden Markov model,

183–4, 184Fsuboptimal, 137–9, 137Fungapped, score statistics, 153,


amino acid propensities, 476, 478Fevolutionary models, 254F, 256Tmultiple alignments, 192, 216

log-odds ratio, 118–19, 169–70log-odds scores, 188–90logos, 177

aligned HMMs, 196, 196Fpatterns, 213PSSMs, 106F, 177–8, 178F

log ratiosdefining distances between, 634–7expression data, 629–30, 630Ft-test, 654z-test, 653–4

long-branch attraction see Felsenstein zone

LOOPP program, 532–3, 533F, 535–6, 536F

loops, 36–7amino acid residue preferences,

202homology modeling and, 542,

545–6, 546F, 547F, 559, 559Ftransmembrane proteins,

prediction, 506, 508Loopy program, 561low-complexity regions, 100–2, 151–2B

see also repeat sequences

Lowe, Todd, 362lowess normalization, 630–1, 631FLUDI program, 591lysozyme, 538–9, 539F

MM (mutation probability matrix), 120,

121–2machine-learning methods, 430

see also neural network methodssecondary structure prediction,

414, 415–16Macromolecular Structure Database

(MSD), 52, 60, 64macrophages, 608, 609F, 674FMAFFT method, 199–200, 206Mahalanobis distance, 636–7main chain see backboneMajor, Francois, 466major histocompatibility complex

(MHC) proteins, 593majority-rule consensus trees, 234F,

235majority voting technique, 485Mann–Whitney U test, 656–7MAO (multiple alignment ontology),

54F, 55MARCOIL, 510, 510FMarkov chain Monte Carlo (MCMC),

307Markov chains, 368–9

first order, splice site prediction, 394–5

Markov models, 179see also hidden Markov models;

inhomogeneous Markov chain models

fifth order, 368–70, 370Finterpolated, 371–2, 388splice site prediction, 394–5used by GeneMark, 369, 370F

MASCOT program, 622–3mass spectrometry (MS), 600, 621–3

protein identification, 621–3, 622F

protein quantitation, 623MAST program, 106mathematical modeling of biological

systems, 689–92, 689FDapproaches, 674–7, 676Fmodel databases, 692model structure, 679–83, 679FDspecialized programs, 690F, 691–2,

691Fstandardized languages, 692

Matthews correlation coefficient, 469–70

maximal dependence decomposition (MDD), 394, 395F



maximal segment pair (MSP), 141, 149maximum likelihood (ML), 250, 251T,

286evaluating tree topologies, 302–5,

302F, 303F, 304Fhidden Markov model

parameterization, 191inference of parameter values, 698measure of optimality, 287practical application, 255–6, 257F,

262, 263Ftesting for suitability, 253

maximum parsimony, 250, 251T, 286branch-and-bound technique, 288long-branch attraction problem,

309, 309Fmeasure of optimality, 287unweighted, 297–300, 299Fweighted, 300–2, 300F, 301F

McClintock, Barbara, 337BMcPromoter program, 388mean(s), 626, 652

comparison between two, 652–5MEGA3 program, 250, 260Melanie program, 614–15, 620membrane proteins, 436–7, 462

see also transmembrane proteinsinteractions with membrane, 437,

437Fsecondary structure prediction, 468

MEME program, 105–7, 107F, 215–17MEMSAT program, 443, 475–6, 479messenger RNA (mRNA), 11

analysis of transcribed see gene expression analysis

capping, 18genetic code, 12–13, 12Tpolyadenylation, 18reading frames, 13, 13Fsecondary structure, 455splicing see RNA splicingsynthesis see transcriptiontranslation see translation

metabolic models, 678metabolic pathways

databases as sources, 671, 672F, 673F

modeling interactions, 681–3, 682Fmodularity, 685, 686F, 687Fsimulation programs, 690F, 691–2,

691Fmethylation, 6–7MFOLD program, 457F, 458MIAME (Minimum Information About

a Microarray Experiment), 64, 606

Michener, Charles, 278microarray databases, 58, 60F

applications, 610–11, 612Fdata standards, 64, 606

Microarray Gene Expression Data (MGED), 54–5, 606

MicroArray Quality Control (MAQC) project, 606

microarrays, 602DNA see DNA microarraysprotein, 621

middle-out approach, modeling biological systems, 677, 678F

midnight zone, 81minimum evolution, 250, 282

methods, 250, 251T, 297MIRIAM standard, 692mitochondria, 22, 292B, 367modeling biological systems see

mathematical modeling of biological systems

modeling (tertiary) protein structure, 521–65, 522MM

ab initio approach, 522, 523Bassessment of predicted structure,

554–6comparative, homology or

knowledge-based see homology modeling

potential energy functions and forcefields, 524–9, 524FD

ROSETTA/HMMSTR method, 523Bthreading (fold recognition) see

threadingMODELLER program, 535, 541, 552,

553, 554Fmodel surgery, 182ModelTest, 255modularity, biological systems, 685–6modules, 680, 681F, 685–6, 686FMolecular Biology Database

Collection, 55, 56Fmolecular clock, 229–30, 278

hypothesis, 250molecular configuration, 33Bmolecular dynamics, 528–9

function optimization, 718–19homology modeling, 548

molecular energy functions, 700–8see also bonding terms; nonbonding

termsforce fields for intra- and

intermolecular interactions, 701–5

potentials used in threading, 706–8molecular evolution, 235–48, 235FDMolecular INTeraction database

(MINT), 58molecular mechanics, 524–9molecular modeling, ligand binding,

588, 589Bmolecular models, 39, 39FMolIDE, 542, 557–8, 560–1, 561FMolProbity program, 527, 549, 551T

monophyletic (groups), 231, 255–6, 258

Monte Carlo methodssee also Markov chain Monte Carlodocking, 590function optimization, 716–18,

716Fmodeling protein structure, 523B

Morse potential, 702F, 703MOTIF program, 217motifs, 103–9, 412

see also patternsautomated generation, 105–7, 106F,

107Fcreating databases, 104–5searching for, 103–4, 107–8

MrAIC script, 255mRNA see messenger RNAmTOR protein, 563, 563TMULTICOIL program, 453multidomain proteins, 41

3D structural modeling, 537, 564sequence alignment, 88, 88F

multifurcating trees, 227, 233, 233FMulti-LAGAN method, 353multiple alignment, 89–93

applications, 90construction methods, 90–1,

196–211discovering patterns, 213–15divide-and-conquer method, 91,

91Fby gradual sequence addition,

196–206, 197FDmanual refinement, 93methods not using pairwise

alignment, 207–11, 207FDphylogenetic tree reconstruction

using, 250–1, 255, 260secondary structure prediction

using, 425–7, 427Ffrom series of local alignments,

92–3, 93Ftheory, 165–7, 166MMtransmembrane protein prediction

using, 444, 445value for sequences of low

similarity, 91–2, 92Fvs pairwise alignments, 90, 166–7

multiple alignment ontology (MAO), 54F, 55

multiple linear regression, 514MUMmer method, 159MUSCLE method, 199–200, 206mutation data matrices (MDMs),

Dayhoff see PAM matricesmutation probability matrix (M), 120,

121–2mutation rates

codon position and, 238–9, 238F



estimating and predicting, 236, 237F

type of base substitution and, 236–8, 238F

mutations, 22–3accepted, 84masking sequence similarities, 72,

73–4selective pressures on, 240–1Bsynonymous/nonsynonymous, 238,

240–1B, 245transition and transversion, 237–8,

238Fmutual information, 697Mycoplasma, 684myoglobin, sperm whale, 538Fmyosin II, 451MZEF program, 328

comparative results, 331–2, 331T, 332T

scores used, 331T

NN-acetylneuraminate lyase gene, 247FNational Center for Biotechnology

Information (NCBI), 52, 55dbEST, 56, 321BGEO, 606Protein Database, 56–8SAGE analysis programs, 605UniGene database, 103, 605–6, 605F

native structure or state (of proteins), 522

NCBI see National Center for Biotechnology Information

nearest-neighbor interchange (NNI) method, 289–90, 289B

nearest-neighbor methods, 414–15, 428–30, 485–92, 485FD

misfolding proteins, 491–2, 492F, 493F

outline, 486, 487Fsample classification, 660–1similarity measures used, 488–90,

490Fweighting of predictions, 490–1

Needleman, S.B., 87, 128Needleman–Wunsch algorithm, 87,

128database search programs using, 95discarding intermediate

calculations, 138Bextension to multiple alignments,

199illustration of original, 135, 135Fmore efficient variations, 129–35,

129F, 130Fnegative selection, 240–1BNei, Masatoshi, 240B, 282

neighbor-joining (NJ) method, 250, 251T, 252–3

generating single trees, 282–5, 282F, 284F

multiple alignment, 199, 200practical application, 261F, 262variants, 285

Nei–Gojobori method, 240–1BNeisseria meningitidis, 348, 350Fnested genes, 399NetPhos server, 110NetPlantGene program, 390–1, 393F,


see also neural networks; systems, biological

architectures, 676, 677Fbiological, 670–1information for constructing, 671–4kinetic models, 678mathematical modeling

approaches, 674–7mathematical representation of

interactions, 680–3scale-free, 676

neural network methodsexon prediction, 334–5, 390–1genome annotation, 359promoter prediction, 340, 385–6,

386F, 387Fsecondary structure prediction,

415–16, 430–4, 430FDassessing reliability, 432Qian and Sejnowski studies,

496–9, 499F, 500FRiis and Krogh methods, 500–1,

501F, 502–3theoretical basis, 492–504,

493FDtransmembrane proteins, 445using homologous sequences,

502–3Web-based programs using,

432–4splice site prediction, 395–6

neural networks, 430–2GenTHREADER, 534–5, 535FKohonen see self-organizing mapslayered feed-forward, 494–502,

495Fmore complex, 503–4, 504F, 505Fmultilayer, 431, 431Ftraining process, 496, 497–8Btwo-layered, 430–1, 431F

neuraminidase, 589BNevill-Manning, Craig, 213Newick or New Hampshire format,

231–2Newton–Raphson method, 528NMR see nuclear magnetic resonance

NNPP program, 340, 341T, 385–6, 386F

NNSSP program, 424T, 433, 488–9, 490, 491

nodesneural networks see units, neural

networkphylogenetic trees, 226, 227Fself-organizing maps, 608, 608F,

644, 644Fself-organizing tree algorithms, 648,

648Fnonbonding terms, 525–6, 701, 704–5noncoding DNA see junk DNAnoncoding RNA (ncRNA) genes,

detection, 319–21, 361–3noncoding strand, 11nonlinearity, 667nonparametric tests, 656–7nonrandom model, sequence

alignment, 117–19nonredundant database, 63nonsynonymous mutations, 239,

240–1B, 245normal distributions, 626, 628F, 698

statistical tests, 653–5normalization

data, 627–31, 628F, 630Flowess, 630–1, 631F

Notredame, Cedric, 209N terminus, 29nuclear magnetic resonance (NMR),

411, 521nucleic acid sequences see nucleotide

sequencesNucleic Acids Research (NAR), 55, 56Fnucleic acid world, 3–23, 4MM

see also DNA; RNAnucleotides, 5–6, 6Fnucleotide sequences, 5, 6

see also DNA sequences; RNA sequences

base composition variations, 275–6

comparison with protein sequences, 150–3

databases, 55–6, 57F, 58derivation of scoring matrices,

124F, 125detection of homology, 75–6evolutionary changes, 236–9evolutionary models, 271–2large-scale rearrangements see

rearrangements, large-scalelow-complexity regions, 151Bscoring of alignment, 76–7, 80–1searching with, 97–103

null distribution, 656null model, 189–90NVT ensemble, 718



Oobject-oriented databases, 48, 51odds ratio, 118Ohler, Uwe, 388oligomeric proteins, 42–3one-tailed test, 653Online Mendelian Inheritance in Man

(OMIM) Web site, 352ontologies, 54–5, 54F, 64

gene see gene ontologyopen reading frames (ORFs), 13, 318,

367compared to eukaryotic genes,

377–8hypothetical proteins, 348identifying, 318–19, 359–60

practical aspects, 322–3theoretical basis, 364, 371, 372–3

minimum and maximum sizes, 328, 405

orphan (ORFans), 405potential, 364

operational taxonomic units (OTUs), 225

operons, 19–20, 19F, 319, 341optimal alignments, 76, 128

extreme-value distribution, 155, 155F

global, 128, 129–35, 129F, 130F, 131Flocal, 135–7, 136Fscore significance, 153–6, 154FD

optimization, function, 709–19, 709Ffull search methods, 710global, 715–19, 715Flocal, 710–15

ordinary differential equations (ODEs),683

ORFs see open reading framesOrganismic System Theory, 667orphan ORFs (ORFans), 405ORPHEUS program, 323, 372–3orthogonal encoding, 496orthologous genes, 239, 242F

chicken, human and puffer fish genomes, 245, 246F

to construct species trees, 239–47identifying, 243, 245Blarge-scale rearrangements and,

248orthologous sequences (orthologs),

223Osguthorpe, David, 422outgroups, 229F, 230, 258, 291–2output, 680output expansion, 500output layer, 430, 494overall alignment score, 80overlapping classification, 638overlapping genes, 12, 12F, 360overtraining, neural networks, 498B

OWL database, 109oxygen, molecular (O2), 684–5

Pp53 protein, 580–2, 581F, 582F

identifying interaction sites, 584–5, 584F, 587, 587F

module, apoptotic pathway, 680, 681F

Pacific Northwest National Laboratory (PNNL), 668

PAIRCOIL program, 453paired-site tests, 311pair information, 423, 482pairwise alignment, 89, 115–61,

116MMalignment score significance, 153–6complete genome sequences,

156–9discarding intermediate

calculations, 138Bdynamic programming algorithms,

127–41, 128FDindexing techniques and

algorithmic approximations, 141–53, 142FD

inserting gaps, 86, 86Fmultiple alignments based on,

196–206, 197FDsecondary structure prediction

method using, 430substitution matrices and scoring,

117–27, 117FDvs multiple alignment, 90, 166–7

pairwise contact potential (PCP), 533PALSSE method, 466, 466F, 467, 467F,

468PAM matrices, 82–4, 83F

derivation, 119–22, 119Fevolutionary model incorporation,

276PET91 version of PAM250, 121F,

122selection, 84, 85summary score measures, 125F,

126vs percentage identity, 120F, 121

paralogous genes (paralogs), 239–42, 242F

identifying, 243, 245Bparameters

Bayesian inference, 698system, 678, 679, 679F

Parisien, Marc, 466parsimony methods see unweighted

parsimonypartially resolved tree, 227partitional classification, 638partition function, 706, 707, 716

partitionssee also splitsclustering methods, 637, 638hierarchical clustering, 639–41k-means clustering, 641–2phylogenetic trees, 231

parvalbumin (1B8C), 421F, 422Fpath, 179pathogenicity islands, 342, 402–3pathways, metabolic see metabolic

pathwayspatristic distances, 294PatternHunter program, 159patterns, 103–11, 104FD, 151B

see also motifsautomated generation, 105–7, 106F,

107Fcreating databases, 104–5discovery, 165, 166MM, 211–18,

212FDprotein function and, 109–11searching for, 103–4, 107–9, 108F,

109FPavesi, Angelo, 362–3PDB see Protein Data BankPDB_SELECT, 416–17, 473p-distance, 236, 237F, 268–9

effects of correction, 252FGamma correction, 269F, 270, 270Fphylogenetic tree reconstruction,

251–2Poisson correction, 269F, 270

Pearson, William, 144Pearson correlation coefficient, 194,

635–6, 636Fpeptide bonds, 29–33, 31F

trans and cis conformations, 32, 33F

percent/percentage identity, 76–7BLOSUM matrices and, 84homology modeling and, 540–1,

541F, 542Flimitations, 79–81minimum acceptable, 81PAM matrices, 120F, 121

percent similarity, 80–1perceptrons, 430–1, 494per-comparison error rate (PCER), 658per-family error rate (PFER), 658periodicity, 151BPET91 matrix, 121F, 122Petersen, Thomas, 499–500, 501Pfam database, 109phages, sequenced genomes, 324TPHAT matrix, 84PHDhtm program, 442F, 445PHD program, 424T, 432, 432FPHDsec program, 499, 501–2, 503PHI-BLAST program, 108Phobius method, 509



phosphatidylinositol 3-OH kinase (PI3 kinase) p110a subunit, 557

alignment, 86, 86Fhomology modeling, 557–64, 563Tlocal and global alignment, 89, 89Fmultiple alignment, 91–2, 92Fprotein family profile, 109searching sequence databases, 99F,

100, 101Fphosphatidylinositol 3-OH kinase (PI3

kinase) p110g subunit, 557, 557F, 563, 563T, 564

phosphatidylinositol 3-OH kinases (PI3 kinases), 87B, 557

multidomain nature, 88, 88Fpatterns and motifs, 106–9, 106F,

107F, 109Fphosphatidylinositol-4-OH kinases

(PI4-kinases), 87B, 88Fpatterns and motifs, 106, 107–8

phosphoinositol kinase, 439F, 441phospholipid kinases, 87Bphosphopeptide-binding proteins,

570–1, 572Fphosphorylation sites, predicting, 110phosphotyrosine-binding (PTB)

domain, 571, 572Fphylogenetic tree reconstruction,

248–64assessing tree feature reliability,

307–10, 308FDchoice of method, 249–51, 251Tclustering methods, 276–85, 277FDdata choice, 249evaluating topologies, 293–307,

294FDevolutionary model choice, 251–5multiple alignment as starting

point, 255, 260multiple topologies, 286–93, 287FDpractical examples, 255–8, 257F,

258Fsingle trees, 276–86, 277FDstarting trees for further

exploration, 285–6theoretical basis, 267–311, 268MM

phylogenetic trees, 223–4see also guide treeadditive, 228–9, 229F, 230comparing two or more alternative,

310–11condensed, 233–4, 233Fconsensus, 234–5, 234F, 291fully resolved, 227gene see gene treesmeasuring difference between two,

289Bmultifurcating (polytomous), 227,

233, 233Fpartially resolved, 227

reconciled, 243, 244Frooted see rooted treesscoring multiple alignments, 200–1,

200Fspecies see species treesstrict consensus, 234–5, 234Fstructure and interpretation,

225–35, 226FDsubstitution matrix derivation from,

82–3, 119F, 120topologies see tree topologiesultrameric, 229–30, 229Funrooted see unrooted trees

phylogenomics, 262PHYML program, 251, 255PHYRE program, 535–6, 536FPI3 kinase see phosphatidylinositol

3-OH kinasePISSLRE see CDK10 genePKN/PRK1 protein kinase, 452, 452F,

453F, 454Fplasmids, 21platelet-derived growth factor (PDGF),

616–17, 617Fpleckstrin homology (PH) domain,

571, 572FPocket-Finder program, 585–6, 586Fpoint accepted mutations matrices see

PAM matricesPoisson corrected distance, 269F,

270polyadenylation, 18

signal detection, 389polycystein-1-protein, 571Fpolypeptide chain, 29, 31–2

conformational flexibility, 32, 32Fpolytomous (multifurcating) trees,

227, 233, 233Fporins, 35, 436

secondary structure prediction, 450–1, 450F

position-specific scoring matrices (PSSMs), 96, 166, 168–78

see also profilesaligning, 193–5, 194Fconstruction, 168–71overcoming lack of data, 171–5,

176Frepresentation as logos, 177–8,

178Fsecondary structure prediction

using, 503, 504, 505F, 514sequence weighting schemes, 171,

171Fusing PSI-BLAST, 176–7, 177F

positive-inside rule, 441positive selection, 240–1Bposterior probability, 698post-order traversal, 298–9, 298F,


potential energy, 522, 524, 525see also force fieldscalculations, 525–6functions, 522, 524–9, 706–8surface, 525

potentials of mean force, 532–3, 706–7PPI-PRED program, 584–5, 584FPRATT program, 108, 109F, 217–18Predator Multiple Seq., 424TPREDATOR program, 414, 424T,

428–30prediction confidence level (PCL), 432prediction filtering, 484PRED-TMBB method, 509Pribnow box, 339, 340primary structure, 26–7, 27F, 29–33principal component analysis (PCA)

application, 618, 619Fprinciple, 631–3, 632F, 633F

PRINTS database, 109prion proteins (PrP), 101B

chameleon sequences, 37Bhydrophobic cluster analysis, 110F,

111low-complexity regions, 101–2,

102Fprior distribution, 172prior probabilities, 307, 698probabilistic approaches

alignment scoring, 117–19pattern discovery, 215–17secondary structure prediction, 414

probabilityconditional, 696marginal, 696posterior, 698prior, 307, 698statistical tests, 652–3, 653F

probability theory, 695–7ProbCons method, 200, 203–4, 206PROCHECK program, 527, 549, 550F,

551TProdom database, 58profile hidden Markov models

(HMMs), 109, 179–93, 374aligning, 195–6, 195F, 196Fbasic structure, 180–5, 181F, 183F,


using aligned sequences, 185–7using unaligned sequences,

191–3path lengths, 185, 185F, 186Fscoring sequences against, 187–91

profiles, 96, 165–96, 166MMsee also position-specific scoring

matricesaligning, 193–6, 193FDdefining, 167–78, 167FDrepresentation as logos, 177–8, 178F



PROF program, 424T, 433F, 434prof-sim method, 195PROFtmb program, 450F, 451, 508F,

509progressive alignment, 198, 204–6,

205Fprokaryotes, 21, 21F

see also bacteria16S RNA, 249control of translation, 19–20gene detection, 359–60

algorithms, 368–77, 368FDhomology searching, 322practical aspects, 322–3, 322FD,

323Fsequence features used, 364–8,

364FDvs methods used in eukaryotes,

377–9gene structure, 318–19genomes, 324Tpromoter prediction, 339–40,

341–2regulation of transcription, 15–17,

16FtRNA gene detection, 361–2, 362F,

363FProMate, 584, 584FPromFind program, 387–8Promoter 2.0 algorithm, 340, 341TPromoterInspector program, 341,

341T, 388promoter prediction, 338–42, 381–9

eukaryotes see under eukaryotesindefinite nature of results, 341,

341Tonline methods, 340–1prokaryotes, 339–40, 341–2

Promoter Recognition Profile, 341promoters, 15–16, 319

core (basal) see core promoterseukaryotic, 17, 17F, 381

ProScan program, 341, 341T, 386–7PROSITE database, 105, 107–8, 108F,

109protease, HIV (HIV-PR), 551–2, 552Fprotein(s), 4–5

concentration measurement, 623conformation see conformationdenatured, 42function see functionhypothetical, 65, 348identification of purified, 621–3,

622Finteractions between atoms, 32Blocalization signals, 111, 111Bphylogenetic trees, 226, 230stability of folded, 41–2synthesis see translation

protein backbone see backbone

protein binding sitesdocking procedures, 587–93, 588FDfinding, 580–7, 581FD

highlighting clefts or holes, 585–6, 585F, 586F

residue conservation for, 586–7, 587F

surface properties for, 584–5, 585F

useful features for, 582–4types, 582water molecules, 592–3

Protein Data Bank (PDB), 60, 62F, 102–3, 531

finding target protein homologs, 543, 557

PDB_SELECT, 416–17, 473Protein Domain Parser (PDP), 575, 576protein expression

2D gel electrophoresis see two-dimensional gel electrophoresis

analysis, 612–23, 612FDcluster analysis, 615–17, 617F,

618Fdata preparation for, 626–33,

627F, 627FDdifferential, 615, 616F, 617Fmethods, 614–20online tools, 620principle component analysis,

618, 619Fstatistics, 652–9tracking changes over different

samples, 618–20, 619Fclustering methods and statistics,

625–64, 626MMdatabases, 58, 620identification of purified proteins,

621–3, 622Fquantitation, 623sample classification, 659–62,

660FDprotein families, 259B

phylogenetic tree reconstruction, 259–63, 261F, 263F

profiles of, 109protein fold libraries, 573topological, 573F, 574

protein folding, 40–1, 41F, 412alternative, 486, 491–2, 492F, 493Finverse, 530–1

protein fold recognition see threadingprotein folds, 40, 41, 411

classification, 573–4, 573Flibraries, 531, 532F, 571–4prediction in absence of known

homologs, 531recognition see threadingstructurally different, with similar

functions, 570–1, 572F

structurally similarwith different functions, 570,

571F, 572Funrelated molecules, 529, 530F

protein interaction(s), 580–2databases, 58–9interactive Web sites, 671–2, 673F,

674Fmaps, 610, 611Fsites see protein binding sites

protein kinases, 86, 87BcAMP-dependent see cyclic

AMP-dependent protein kinasecatalytic subunit (PRKD), 107–8,

107Fmicroarrays, 621PKN/PRK1, 452, 452F, 453F, 454F

protein microarrays, 621ProteinProspector program, 622–3protein–protein interactions

see also protein interaction(s)analysis using clustered data, 610,

611Fsearching for, 584–5, 584F

protein sequence databases, 56–8, 59Fnomenclature for amino acid

uncertainty, 63protein sequences

see also amino acid sequencescomparison with nucleotide

sequences, 150–3constructing predicted, 343–6, 345detection of homology, 75–6evolutionary models, 276low-complexity regions, 100–2,

151Bmultiple alignments, 92obtaining secondary structure from

see secondary structure prediction

phylogenetic tree reconstruction, 249

scoring of alignment, 76–7, 79–80searching for motifs or patterns,

103–4searching with, 97–103substitution matrices, 82–5, 117–25

protein structure, 25–43, 26FD, 26MMclassification, 421F, 573–4, 573Fcomparison methods, 574–80,

575FDimplications for bioinformatics,

37–9, 38FDlow secondary structure content

(low SS), 573F, 574modeling see modeling protein

structuremolecular representations, 39, 39Fnative, 522primary see primary structure



quaternary see quaternary conformation

secondary see secondary structuresupersecondary, 40B, 529tertiary see tertiary protein structurethree-dimensional see tertiary

protein structurevisualization and computer

manipulation, 38–9, 39Fprotein subunits, 27, 42–3Proteobacteria, 249, 255–8, 257F, 258Fproteome, 600, 612

see also protein expressionanalysis, 612–23, 612FD

proteomics, 600–1applications, 601Trole in systems biology, 668

protocols, 686ProtScale, 110prrp program, 206pseudocounts, 172–3, 176Fpseudo-energy functions, 526–7pseudogenes, 22B, 73, 73B, 242pseudoknots, 457pseudo-torsion angles, 703PSI-BLAST program, 96–7, 108

comparative effectiveness, 177, 178T

homology modeling, 560–1PSSM construction, 176–7, 177Fsecondary structure prediction,

433F, 502, 503, 504PSIMLR method, 514PSIPRED program, 433F, 434, 434F, 503

accuracy, 424T, 469, 469F, 472, 503homology modeling, 560–1

PSORT programs, 111PSSMs see position-specific scoring

matricespSTIING, 58–9, 671–2

analysis of clustered genes, 610, 611F

protein interaction networks, 61F, 674F

purifying (negative) selection, 240Bpurines, 6, 6Fpyridoxal phosphate-dependent

aminotransferases, 570pyrimidines, 6, 6Fpyruvate formate-lyase, 467Fpyruvate kinase, 480F

QQ3, 417–19, 418F, 469

compared to Sov, 470T, 471–2different methods compared, 422,

424TGOR method, 422, 423, 484nearest-neighbor methods, 491

neural network methods, 499, 501, 503, 504

range of values, 469, 469FQian, Ning, 496–9, 499FQ-SiteFinder, 585–6, 586Fquadratic discriminant analysis (QDA),

340, 388, 389F, 396quality match scores, 200, 203–4quantum mechanics, 700quartet-puzzling method, 251T, 305–6,

306Fquaternary conformation, 27, 27F,

42–3, 42F, 43F

RRamachandran plots, 33, 34F, 525

PI3 kinase p110a model, 560, 560Frandom error, 627–8random model, sequence alignment,

117–19rank-sum test, 656–7reaction rates, 679–80reading frames, 13, 13F

see also open reading framesexon prediction and, 325–7, 328F,

329F, 391–2rearrangements, large-scale, 248

examples, 158Fidentifying, 156–7, 158F, 159rat and mouse X chromosomes,

403–4, 403Freceptor tyrosine kinases (RTKs), 436BReciprocal Net database, 52reconciled trees, 243, 244FRECON program, 347records, database, 46–7reductionist approach, 670Fredundancy, biological systems, 686–8redundant data, 63regulatory elements, 15relational databases, 48, 49–50, 49Frelative entropy, 697

substitution matrices, 125F, 126relative mutability, 120Relenza®, 589B, 591reliability (confidence index), 432RELL method, 311repeated elements, 337BRepeatMasker program, 347, 378–9repeat sequences

see also DNA repeats; low-complexity regions

annotation, 347dot-plots for identifying, 77–8, 79Fexclusion from analysis, 319–21,

360, 378–9SEG for identifying, 151–2B

repressors, 16–17resolution, 64

response function, 495, 496Frestriction enzymes, type I, 420Bretrotransposons, 337BREV+G model, 254F, 255–6, 256TREV (GTR) model, 253T, 255, 262R factor, 64Rhodopseudomonas blastica, 450rhodopsin, 440–1, 440F

helical wheel representation, 439F, 441

secondary structure prediction, 441F, 442F, 443, 447F

ribonuclease (RNase), 412ribonucleic acid see RNAribonucleotides, 6ribose, 5–6Ribosomal Database Project (RDP)

database, 255ribosomal RNA (rRNA), 13

see also 16S RNA sequencessequences, identifying, 361small ribosomal subunit, 249

ribosome-binding sites (RBS), 366F, 380

absence in eukaryotes, 380, 389GeneMark.hmm, 375ORPHEUS scoring scheme, 372–3

ribosomes, 13–14, 14Frice genome, 335BRiis, Søren, 500–1, 501F, 502–3RING-finger domains, 575ring of life, 292Britonavir, 589BRivera, Maria, 292BRMSD see root mean square deviationRNA, 4

central dogma concept, 10, 10F, 10FD

functions, 13noncoding, detection, 319–21,

361–3, 361FDstructure, 5, 5FD, 6F, 9–10, 9Ftranscription see transcription

RNA capping, 18RNAfold, 457F, 458RNA polymerase II, 17

promoters, detection, 383–7, 387Fsubunit, 582, 582F

RNA polymerases, 11bacterial, 15–17, 339eukaryotic, 17–18, 383

RNA secondary structure, 9, 435, 455–6

prediction, 455–8, 455FD, 456Ftypes, 456, 456F

RNA sequencesdatabases, 56searching with, 97

RNA splicing, 18–19, 18Falternative, 19, 380–1



Robinson–Foulds difference seesymmetric difference

Robson, Barry, 422, 480robustness

biological systems, 683–9, 684FDcharacterization, 690as feature of complexity, 684–5

Rocke, David, 627–8roll, 573Froot, 227, 227Frooted trees, 227, 227F

construction, 291–3root mean square deviation (RMSD),

542domain identification, 577modeling of loops, 546, 547Fpractical application, 563, 563T

ROSETTA/HMMSTR method, 523BRost, Burkhard, 470rotamer libraries, 547–8rRNA see ribosomal RNARychlewski, Leszek, 491

SSaccharomyces cerevisiae, 324, 404,

405cDNA array data analysis, 632Fgene expression microarray

database, 611, 612FSAGA multiple alignment method,

209–11, 210F, 211FSAGE (serial analysis of gene

expression), 604–5, 604FSAGEmap, 605Saitou, Naruya, 282Salzberg, Steven, 489, 491SAM (significance analysis of

microarray method), 656sample classification, 659–62, 660FD

see also data classificationbiclustering, 649–50, 650Fmethods available, 660–1principal component analysis,

631–3, 632F, 633Fsupport vector machines, 661–2,

662F, 663Fsample classifier, 660SAM program, 182, 184Sander, Christian, 464–5sandwich, 573FSanger, Frederick, 45Sanger Institute, 55Sankoff algorithm, 300–2, 301FSATCHMO program, 200, 203scatterplots, protein expression data,

615, 617FScherf, Matthias, 388Schneider, Thomas, 178SCOP database, 531, 532F, 572–4

scores (alignment), 76, 117derivation, 117–19expected, 119, 126overall, 80statistical significance, 153–6,

154FDscoring schemes/matrices, 75, 76–81

see also position-specific scoring matrices; substitution matrices

constructing multiple alignments, 200–4

selection of appropriate, 126theoretical basis, 117–27, 117FDthreading, 531–3

scrapie, 101BSCWRL3, 561searching sequence databases,

93–111, 94FDassessing quality of match, 97–100,

99Fdatabase selection, 102–3dealing with low-complexity

regions, 100–2exon prediction, 397patterns and protein function,

109–11programs, 94–7protein sequence motifs or

patterns, 103–7using motifs and patterns, 107–9

secondary RNA structure see RNA secondary structure

secondary structure, 27, 27F, 33–6see also a-helices; b-strandsalternative conformations, 486,

486Fcommon types, 413–14, 413Fdatabases, 60–1defining, for prediction algorithms,

463–8length distributions, 467, 468, 468Flocal sequence effects, 479–84,

480Fsequence correlations, 487–8, 487F

secondary structure prediction, 37, 411–59, 412MM

assessing accuracy, 417–19, 418FD, 469–72

based on residue propensities, 472–85, 472FD

coiled coils, 451–4, 452FDdefining secondary structure,

463–8, 464FDexpected accuracy, 468general data classification

techniques, 510–14, 511FDhidden Markov models, 504–10,

506FDmethods of defining structures,

417, 417F

nearest-neighbor methods seenearest-neighbor methods

neural network methods see underneural network methods

specialized methods, 435–58, 435FD

statistical and knowledge-based methods, 421–30, 421FD

success application, 420Btheoretical basis, 461–514, 462MMtraining and test databases, 416–17,

416FDtransmembrane proteins, 438–51,

438FDtypes of methods available, 413–16,

413FDsecond derivative methods, function

optimization, 714–15SEG program, 151–2BSejnowski, Terrence, 496–9, 499Fselective pressures, 240–1Bself-information, 423, 482self-organizing maps (SOMs), 644–6,

644F, 645Fbasic principle, 608, 608Fbiclustering, 650, 650Fgene expression microarray data,

608–9, 609F, 610secondary structure prediction,

513–14, 513Fvs other clustering methods, 643B

self-organizing tree algorithms (SOTA), 648–9, 648F

evaluating validity of clusters, 651gene expression microarray data,

610, 610Fsemiglobal alignment, 132F, 133semi-Markov model, 374sense strand, 11–12sensitivity (Sn)

exon prediction, 343, 392Bgene prediction at nucleotide level,

365–6Bseparating hyperplane, 661, 662, 662Fsequence alignment, 71–112, 72MM

see also global alignments; local alignments

applications, 72detection of homology, 74–6genome sequences see genome

sequence alignmentshomology modeling, 543–4, 544F,

558–9inserting gaps, 85–7multiple see multiple alignmentoptimal see optimal alignmentspairwise see pairwise alignmentprinciples, 72–6, 73FDprogressive, 198, 204–6, 205Fscores see scores (alignment)



scoring see scoring schemes/matrices

searching databases see searching sequence databases

suboptimal, 76substitution matrices, 81–5types, 87–93, 88FD

sequence analysis, 71, 72MMevolutionary conservation and, 38

sequence databases, 55–8automated data analysis, 64–5gene prediction using, 334–6nonredundancy, 62–3searching see searching sequence

databasesselecting, 102–3

sequence lengthcompositional complexity and,

151Bhomology modeling and, 540–1,

542Fsubstitution matrix choice and, 85

sequence motifs see motifssequence ontology project (SOP), 55Sequences Annotated by Structure

(SAS), 103sequence similarity see similarity,

sequencesequence–structure correlations,

487–8, 487Fsequence-to-structure networks, 432,

499–500, 500Fserial analysis of gene expression

(SAGE), 604–5, 604Fserine proteases, 570serotonin N-acetyltransferase, 421F

secondary structure prediction, 423F

SH2 domains, 78B, 571, 572FCbl protein, 575, 576Fdot-plot assessment, 77F, 78identification, 576–80searching sequence databases,

98–100sequence alignments, 92, 93F

SH3 domains, 529, 530FShannon entropy, 695–6Shigella flexneri, 262Shine–Dalgarno sequence, 19, 373shotgun genome sequencing

procedure, 376BSH test, 311shuffle test, 534Sibbald, Peter, 171side chains, amino acid see amino acid

side chainssigma factors (s), 339signaling pathways, 110

modeling interactions, 681–3, 682Fnetwork models, 678

signal peptide, 508–9signal sequences, protein localization,

111, 111Bsignificance, statistical, 653SigPath, 692silent states, 180, 181, 183–4, 184Fsimilarity, sequence, 74

dot-plots for assessing, 77–8, 77Fgene prediction using, 334–6homology modeling and, 539–40,

541Fpercent, 80–1percent identity for quantifying,

76–7scoring, 80, 81secondary structure prediction,

488–90Simon, István, 506–7SIMPA96 scoring method, 488, 490,

491simple sequences, 151–2B

see also low-complexity regionssimplex, 711, 712FSIM program, 554simulated annealing, 528–9

function optimization, 719single linkage clustering, 640, 641Fsingleton sites, 298Sippl, Manfred, 534, 706Sippl test, 534Sjögren–Larsson syndrome (SLS), 351,

351BSjölander, Kimmen, 174, 174FSLAGAN program, 158F, 159SLIM matrices, 84small ribosomal subunit rRNA, 249Smith, Randal, 214Smith, Temple, 214Smith–Waterman algorithm, 88–9,

136–7database search programs using,

95, 97, 145–6discarding intermediate

calculations, 138Bvs PSI-BLAST, 178T

Söding, Johannes, 195F, 196sodium dodecyl sulfate (SDS), 613softmax, 495–6Sokal, Robert, 278solvation potential, 533solvents

see also water moleculesomission from energetics

calculations, 700potential terms relating to, 526–7,

707–8SOMs see self-organizing mapsSOSUI program, 442F, 443, 444F, 447SOTA see self-organizing tree


Sov, 417, 419, 419Fcompared to Q3, 470T, 471–2derivation, 470–2different methods compared, 422,

424TGOR method, 423range of values, 469F, 472

spaced seed method, 158–9spacer unit, 496, 500speciation duplication inference (SDI),

293speciation events, 226, 239, 242Fspecies

reconstructing evolution, 249specific databases, 103

species (phylogenetic) trees, 225–30, 227F, 229F

combined with gene trees, 243, 244F

effects of gene loss/missing gene data, 242–3, 243F

orthologous genes for constructing, 239–47, 242F

vs gene trees, 230, 231Fspecificity (Sp)

exon prediction, 343, 392Bgene prediction at nucleotide level,

365–6Bspliceosomes, 18SplicePredictor program, 393–4splice sites, 18–19

detection, 337–8, 338F, 379–81, 390theoretical basis, 392–6, 395F

donor and acceptor, 18F, 380F, 392variability, 379, 380F

splice variants, 380–1SpliceView program, 338, 339Fsplicing, RNA see RNA splicingsplits

assessing accuracy, 309differences between two trees, 289Bmultiple alignment guide trees,

206, 206Fphylogenetic trees, 231–2, 232F

Src-homology domains see SH2 domains; SH3 domains

SSAHA program, 158SSEARCH program, 96T, 97, 100SSPAL method, 489, 490, 490F, 491SSpro method, 504, 505Fstandard deviation, 652, 653F

dealing with lack of replicates, 657BStanford Microarray Database (SMD),

58, 60F, 611star decomposition, 285–6start codons, 13, 19, 318, 367

E. coli, 366F, 367predicting correct, 327, 330F, 333–4,

389star tree, 200F, 201



start state, 179, 182–3, 183Fstates (hidden Markov models), 179,

180, 181state variables, 679–80statistical methods

secondary structure prediction, 414, 415F, 421–30

transmembrane protein prediction, 443

statistical tests, 625, 626MM, 651–62, 651FD

importance of variance, 652, 652Fmultiple, controlling error rates,

657–9, 658Tnonparametric, 656–7

steady state, 690steepest descent method, 528, 711–13,

713Fstep-down Holm method, 658Stephens, Michael, 178step-up Hochberg method, 659stepwise addition, 285–6steric hindrance, 32Sternberg, Michael, 206stop codons, 12T, 13, 19, 318, 367

detection, 389Streptococcus protein G, 484FStreptomyces coelicolor, 643Bstrict consensus trees, 234–5, 234FSTRIDE program, 417STR matrix, 84Structural Bioinformatics Protein

Databank see Protein Data Bankstructural databases, 59–61

automated data analysis, 64checking for data consistency, 63–4

structure, protein see protein structureStructured Query Language (SQL),

49–50structure–function relationships,

567–93, 568MMdocking methods and programs,

587–93, 588FDfinding binding sites, 580–7, 581FDfunctional conservation, 568–74,

568FDstructure comparison methods,

574–80, 575FDstructure-to-structure network, 432,

499Student’s t-distribution, 654, 655suboptimal alignments, 76, 135–9,

137Fsubstitution groups, 213substitution matrices, 81–5

see also BLOSUM matrices; PAM matrices

evolutionary models and, 276position-specific scoring matrices

and, 168–71

selection of appropriate, 126theoretical background, 117–27,

117FDthreading, 532

subtilisin, 243–4, 244Fsubtree pruning and regrafting (SPR),

289B, 290, 290Fsubtrees, 230subunits, protein, 27, 42–3suffix, 142suffix trees, 141–3, 143F

whole genome sequences, 158sum-of-pairs (SP), scoring multiple

alignments, 200F, 201superfamilies, 259, 259B

phylogenetic tree reconstruction, 259–63, 261F, 263F

protein fold libraries, 573superkingdoms, 21supersecondary structures, 40B, 529supervised learning, 497B, 638support vector machines (SVMs)

sample classification, 661–2, 662F, 663F

secondary structure prediction, 511–12, 512F, 513F

survivin, 583, 583FS-values

branch-and-bound method, 288maximum-likelihood methods,

287minimum evolution method, 297optimizing tree topologies, 288,

290, 291, 291F, 293parsimony methods, 287, 293,

297–9, 301starting trees, 286

SWISS-2D-PAGE, 620Swiss Institute for Bioinformatics (SIB),

620Swiss-Model, 552, 554, 561–3, 562FSwiss-Pdb Viewer, 542, 557–60, 558F,

559F, 562–3Swiss-Prot database, 54, 56–8, 59F,

102–3manual annotation, 65pattern and motif searching, 105,

106–8searching, 98–100, 99F, 101Fvs PSI-BLAST, 178T

switches, bistable, 688–9, 689Fsymmetric difference, 289, 289B, 291SYM model, 253Tsynonymous mutations, 238, 240–1B,

245syntenic regions, 248, 403–4, 404Fsystematic errors, 625, 627–8systems, biological, 669–78, 669FD

see also networksbistable switches, 688–9, 689F

concept, 669–70, 670F, 671Fcontrol circuits, 680, 680Finformation needed to construct,

671–4mathematical modeling

approaches, 674–7, 676Fmathematical representation of

interactions, 680–3modularity, 685–6network properties, 670–1redundancy, 686–8robustness, 683–9, 684FDstandardized description, 692storing and running models,

689–92, 689FDsystems biology, 667–93, 668MM

model types used, 678structure of model, 679–83,

679FDsystem properties, 683–9, 684FDWeb-based tool and databases,

671–2, 675TSystems Biology Markup Language

(SBML), 692

TTamura-Nei (TN) model, 253Ttarget protein, 527

alignment with template, 543–4, 544F

finding structural homologs, 543, 557

similarity to template, 539–40TATA-binding protein (TBP), 17TATA box, 17, 383

Bucher weight matrix, 383, 384, 384F

detection, 383–7, 389genes lacking, 381, 383GenScan prediction method, 385,

385FNNPP prediction method, 385–6,

386Ftaxa, 225Taylor, Willie, 276tblastx, 96, 150T-Coffee program, 203, 204Ftemperature

biological systems, 679–80molecular dynamics simulations,

718simulated annealing, 529, 719

template protein, 527, 542–3alignment with target, 543–4, 544Flocating, 543, 557similarity to target, 539–40

terminator signal, 16tertiary contact (TC) measure, 491–2,




tertiary protein structure, 27, 27F, 40–2

see also protein foldsanalyzing function from see

structure–function relationshipsexperimental methods of

determining, 521modeling see modeling protein

structurevisualization and computer

manipulation, 38–9, 39Ftest dataset, 416–17test statistic, 652, 653Ftetramers, 43thermodynamic simulation, and global

optimization, 715–19, 715Fthermodynamic stability, folded

proteins, 41–2thiamine diphosphate (TDP), 259B,

260Thornton, Janet, 276, 475THREADER program, 707threading (fold recognition), 523–4,

529–37, 530FDassessing confidence of prediction,

534–5, 535Fdynamic programming methods,

533–4, 534Flibraries of protein folds, 531potentials used, 706–8practical example, 535–7, 536F,

537Fprocedure, 530–1, 531Fpseudo-energy functions, 527scoring schemes, 531–3

three-dimensional protein structure see tertiary protein structure

thymine (T), 6, 6FTie, Jien-Ke, 449BTIM barrel folds, 570, 570F, 573F

differing functions, 570, 572Ftime-delay neural network (TDNN),

385–6, 386FTMAP program, 442F, 444, 447TMbase, 443TMHMM server, 446, 446F, 447F,

507–9assessing accuracy, 471F, 472comparative results, 442F

TMpred program, 442F, 443Toll-like receptor, 608top-down approach, modeling

biological systems, 676–7, 677Ftopological families, 573F, 574topological models, 678TopPred program, 441, 442torsion angle potential, 703, 703Ftorsion (dihedral) angles, 29–33

amino acid side chains (c1, c2, etc), 547, 548F

Ca chain (f, y), 29–32, 32Fideal b-strands, 36FRamachandran plots, 33, 34Fsecondary structure prediction,

417, 466, 466F, 503–4, 504F, 505Fimproper, 703peptide bond (w), 31–2, 32F

traceback, 132, 136, 138B, 300training, neural networks, 496, 497–8Btraining dataset, 416–17trans conformation, 32, 33Ftranscription, 11–12, 11F

regulation, 15–18, 16F, 17Fstop signals, detection, 389

transcription (initiation) factorsbinding sites, 381, 386

detection algorithm, 386–7general, 17leucine zipper, 413, 451

transcription start site (TSS), 15–16, 16F, 17F

prediction, 338–9, 340, 381–9transcriptome, 600transfer function see response

functiontransfer RNA (tRNA), 13

base modifications, 7function in translation, 13–14gene detection methods, 320–1,

320F, 361–3secondary structure prediction,

457F, 458structure, 14F

transition mutations, 237–8, 238Ftransitions, hidden Markov models,

179, 180, 181, 181Ftransition/transversion ratio (R),

237–8calculation, 274–5Bweighted parsimony method, 300,

300Ftranslation, 13–14, 14F

control, 19–20genetic code, 12–13, 12Tpredicted exons, 343, 344F, 345, 345start sites, prediction, 389stop signals see stop codons

translation initiation factor 5A (1BKB), 421F

secondary structure prediction, 422F

TRANSLATOR program, 345translocation, 158Ftransmembrane b-barrels, prediction,

448–50, 450F, 508F, 509transmembrane helices, 436

amino acid propensities, 475–6, 478F

helical wheel diagrams, 439F, 440–1length distribution, 468, 468F

prediction, 439–48algorithms available, 441–7assessing accuracy, 471F, 472based on residue propensities,

477–8, 479, 479Fcomparing results, 447–8example, 449Bhidden Markov models, 506–9,

507Fusing evolutionary information,

444–5three-dimensional structure, 440F

transmembrane proteins, 435, 436–517-transmembrane spanning

superfamily, 436Bbitopic and polytopic, 437, 437Ffunctional importance, 436Bhydrophobicity scales and, 437–8prediction, 438–51, 438FD

example, 449Bhidden Markov models, 506–9

structural elements, 437Ttransmissible spongiform

encephalopathies, 101Btransport systems, 669–70, 670Ftransposons, 22B, 336, 337Btransversion mutations, 237–8, 238Ftransversion parsimony, 300tree bisection and reconnection (TBR),

289B, 290–1tree methods, multiple alignment,

90–1, 90F, 200–1tree of life, 20–3, 20F, 21F, 38F

horizontal gene transfer within, 246F, 247

origins, 292Btree topologies, 227–8, 228B

comparing, 232–5, 233F, 234Fdescribing, 230–2, 232Fevaluating, 293–307, 294FDgenerating initial, 285–6generating multiple, 286–93,

287FDinterior branch examination,

309–10measuring difference between two,

289BTrEMBL, 102–3tricarboxylic acid (TCA) cycle, 685,

686F, 687Ftrimers, 43tRNA see transfer RNAtRNAscan algorithm, 321, 361–2, 362F,

363FtRNAscan-SE algorithm, 362–3TSSG algorithm, 340, 341TTSSW algorithm, 340, 341, 341Tt-statistic, 654, 655t-test, 654–5, 656T

modifications, 657–9



tumorsinvasion, mathematical modeling,

676–7, 677Fsample classification, 662, 663F

turns, 36–7, 37Fsee also b-turnsamino acid preferences, 37

Tusnády, Gábor, 506–7twilight zone, 81TWINSCAN program, 331T, 332T,

336–7two-dimensional (2D) gel

electrophoresis, 600, 613–20see also protein expressionanalysis of data, 614–20

clustering, 615–17, 617F, 618Fdifferential protein expression,

615, 616F, 617Fmeasuring expression levels,

614–15principal component analysis,

618, 619Fidentification of separated proteins,

621–3, 622Fspot detection, 614, 614Ftechnique, 613–14, 613F

two-hit method, 149two-tailed test, 653, 653Ftype I error, 653, 658

Uubiquitin ligases, 575UGA codon, 23ultrameric trees, 229–30, 229FUniGene database, 103, 605–6, 605FUniProtKB, 56–8, 65units

see also nodesneural network, 430–1, 494–5, 495F

unrooted trees, 227, 227Fgeneration, 286–91

unsupervised learning, 638, 644untranslated regions (UTRs), 325F, 379

detection, 390, 396–7unweighted parsimony, 297–300, 299FUPGMA method, 199, 250, 251T, 608,

639practical application, 256–8, 258Ftheoretical basis, 278–9, 279F, 640vs Fitch–Margoliash, 280

UPGMC method, 640upstream sequences, 16

URL, 53UTRs see untranslated regionsUzzell, Thomas, 270

Vvan der Waals interactions, 32Bvan der Waals terms, 705variable region, 555Bvariance, 626, 652, 653–4

importance in statistical testing, 652, 652F

Vector Alignment Search Tool (VAST), 577–8, 579F

Venn diagram, amino acid conservation, 426, 428F

Venter, J. Craig, 376BViagra, 589Bvirtual heart project, 677virulence factors, 341–2viruses, 21

overlapping genes, 12, 360sequenced genomes, 324T

VISTA program, 353–4, 353F, 354Fvitamin K epoxide reductase (VKOR),

449BViterbi algorithm, 188–9von Bertalanffy, Ludwig, 667von Heijne, G., 441, 442

WWaddell, Peter, 296, 296FWaterman, M.S., 136, 154water molecules, 700

see also solventsligand–protein docking and, 592–3

Watson, James, 7Watson–Crick base-pairing, 7–9, 8Fweight matrices

Bucher, 383–4, 384Fsplice site prediction, 394

weight sharing, neural networks, 500–1, 501F

Welsh’s t-test, 655WHAT_CHECK program, 549–50WHAT-IF program, 549, 551Twhole-genome alignment, 156–9,

157FDsee also genome sequence

alignmentsWilcoxon test, 656–7Wilkins, Maurice, 7, 7F

windows (sequence), 476–9GOR methods, 422–3nearest-neighbor methods, 428,

486, 487F, 489neural network methods, 431support vector machines, 511

winner takes all strategy, 495wobble base-pairing, 14Woese, Carl, 249Wood, Valerie, 405words, 95, 141WormBase, 399Wu-BLAST, 95Wunsch, C.D., 87, 128

XX chromosomes, mouse and rat,

403–4, 403FX-drop method, 139F, 140–1, 140Fxenologous genes, 247XHTML (eXtensible hypertext markup

language), 50–1XML (eXtensible markup language),

50–1xProfiler, 605Xquery, 51X-ray crystallography, 411, 521X-SITE program, 591, 592F

YYASPIN, 509, 509FYBL036C hypothetical protein (1CT5),

421Fsecondary structure prediction,

423FYi, Tau-Mu, 488, 491Yona, Golan, 195

ZZmasek, Christian, 293Zpred program, 425–7, 484, 485

accuracy, 424Tamino acid properties used, 426,

428F, 429Tconservation values, 426, 427F,

428F, 429Tz-statistic, 577, 578F, 654z-test, 309, 653–4Zvelebil conservation number, 426Zviling hydrophobicity scale, 477T



