Bioinformatics Tools
Byoung-Tak Zhang and Chul Joo KangSchool of Computer Science and Engineering
Seoul National University
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2
Contents
1. Sequence Alignment
2. Multiple Sequence Alignment
3. Pattern Finding
4. Structure Prediction
5. DNA Microarray
6. Major Tools for Proteomics
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 3
Bioinformatics Tools
2D Gel, MALDI-TOFHardwares forProteomics
GeneX, GOE, MAT, GeNetDNA Microarray
Bend.it, RNA Draw, NNPREDICT, SWISS-MODEL
Structure Prediction
GRAIL, FGENEH, tRNAscan-SE, NNPP,eMOTIF, PROSITE, ChloroP
Pattern Finding
Clustal W, MacawMultiple SequenceAlignment
BLAST, FASTASequence Alignment
Tools or DatabaseProblems
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 4
Bioinformatics Based on Sequence
! Genome sequencing project: get huge sequencedata
! Rretrieving useful information from sequence data- Find genes and other elements
- Classification
- Predict its function
- Others
! Read the articles on Nature and Science
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 5
1. Sequence Alignment
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 6
Pairwise Sequence Alignment
! Basic job handling sequences! Alignment between two sequences! Sequence database search! Finding similar DNA or protein sequences
! Smith-Waterman algorithm! BLAST! FASTA
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 7
Smith-Waterman Algorithm (1)
! Finding local alignments
! Using dynamic programming
TCAT*G*CATTG
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 8
! Best results and slow performance
! Can grasp results missed by BLAST or FASTA.! Available on web:
spiral.genes.nig.ac.jp/homolgy/ssearch-e.shtml! Ref : Smith, T. F. and Watermann, M. S. (1981)
Identification of common molecular subsequence.Journal of Molecular Biology, 147: 196-197.
Smith-Waterman Algorithm (2)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 9
BLAST & FASTA
! Using heuristic algorithms! Word based match! Faster than Smith &Waterman! BLAST: www.ncbi.nlm.nih.gov/BLAST
Ref: Altschul, S. F., et al. (1990) Basic local alignmentsearch tool. Journal of Molecular Biology, 215, 403-410.
! FASTA: www.ebi.ac.uk/fasta3Ref: Wilbur, W. J. and Lipman, D. J. (1983) Rapid
similarity searches of nucleic acid and protein databanks. Proceedings of the National Academy ofSciences USA 80, 726-30.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 10
The BLAST Search
For the query, find the listof high scoring words oflength W.
Compares the word list tothe database and identifiesexact matches.
For each word match,extends the alignment inboth directions to findalignments that scoregreater than a threshold ofvalue S.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 11
BLAST Result
Program: BLASTP
Query: human MYB
binding protein
Database: Swissprot
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 12
2. Multiple Sequence Alignment
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 13
Multiple Sequence Alignment
! Align multiple sequences! Finding well conserved regions
4Finding motifs or other elements
! Building gene families! Constructing a phylogenetic tree! Using whole dynamic programming ?
4O(n^k) problem (n = sequence length, k = number ofsequences)
! Clustal W! Macaw
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 14
Clustal W
! Step 1: Pairwise alignment between two sequences
! Step 2: Making sequence weights and clusters
! Step 3: Alignment between most similar sequences orclusters
! Step 4: Making one optimal output sequence
Ref: Higgins, D. G. and Sharp, P. H. (1988) CLUSTAL: Apackage for performing multiple sequence alignmenton a microcomputer. Gene 73, 237-244.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 15
Multiple Alignment by Clustal W
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 16
Phylogenetic Tree Construction Basedon Clustal W Results
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 17
3. Pattern Finding
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 18
Pattern Finding
! DNA sequences have all information abouttheir products4Gene finding
4Other sequence element finding
4Motif finding
4Localization prediction
4And others
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 19
Gene Structure
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 20
Gene Finding
! From an unannotated DNA sequence, findputative expressive regions
! Distinguish exon and intron regions (ineukaryote).
! Solutions4Compare known genes (alignments)4Simply find conserved region4Using statistical models like hidden Markov
models and others
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 21
GRAIL: Find Exon and Intron
! In eukaryotes, RNA is processed.
! Intron: splice out
! Exon: join together
! GU – AG rule! GRAIL: exon prediction from a
genomic sequence withRepeatMasker filtering
! Neural networks which combine aseries of coding predictionalgorithms
! http://compbio.ornl.gov/Grail-1.3! Ref: Einstein et al. (1992) Computer-
based construction of gene usingthe GRAIL gene assembly program.Oak Ridge Natl. Lab. Report TM-12174.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 22
GRAIL Results
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 23
FGENEH: Prediction of Multiple Genesin Human DNA Sequences! Finding genes from DNA sequences
! HMM based human gene structure prediction
! Available on web:
http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html
! Ref: Algorithm is trained on our data and based on HMMsimilar with Genescan (Burge, C. and Karlin, S. (J.Mol. Biol. 1997, 268, 78-94.) and Genie (Kulp, D.,Haussler, D., Reese, M.G., and Eeckman, F. H. (1996),Proc. Conf. on Intelligent Systems in MolecularBiology '96, 134-142).
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 24
FGENEH: Prediction of Multiple Genesin Human DNA Sequences
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 25
tRNAscan-SE: Find tRNA Genes
! Finding tRNA from genomic sequence
! Finding polII promoter sites, searching conservedsecondary structures - easier problem than finding proteincoding genes
! Combining several earlier programs and algorithms
! Http://www.genetics.wustl.edu/eddy/tRNAscan-SE/
! Ref: Lowe, T. M. and Eddy. S. R. (1997) tRNAscan-SE:A program for improved detection of transfer RNAgenes in genomic sequence. Nucl. Acids Res. 25: 955-964.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 26
tRNAscan-SE Results
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 27
Sequence Element Finding
* Promoter region of E.coli lacZ operon
! Promoter and other elements: regulatory and other protein bindingsites
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 28
NNPP: Neural Network PromoterPrediction
! Finding promoter sequences on DNA sequences
! Time-delay neural networks
! Two feature layers4Recognizing TATA boxes
4Recognizing initiators
! http://www.fruitfly.org/seq_tools/promoter.html
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 29
Neural Network Promoter Prediction
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 30
Motif Finding
! Find conserved residues from DNA or protein sequences.
! Commonly, conserved residues are related to protein functions.
Motif of one class of zinc finger protein(C2H2) Cx{2,4}Cx{12}Hx{3,5}H
Solutions:• Finding consensus sequences
• Alignment
• Position specific weight matrix
• Hidden Markov models
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 31
Motif Databases: eMOTIF, PROSITE
! Prosite, eMotif: protein motif databases
! eMotif (motif.stanford.edu/emotif)
eMotif Maker: building new motif from multiple alignment
eMotif scan: find motif pattern
eMotif search: find motif from protein sequence
! Ref: Huang, J. Y. and Brutlag, D. L. (2001) The eMOTIF Database.Nucleic Acids Research, 29(1), 202-204
Craig et al. (1998) Highly specific protein sequence motif forgenomic analysis. Proc. Natl. Acad. Sci. 95: 5865-5871
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 32
eMOTIF (1)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 33
eMOTIF (2)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 34
eMOTIF (3)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 35
PROSITE (1)
! PROSITE: Database of protein families anddomains (http://www.expasy.ch/prosite/)
! Consists of biologically significant sites, patternsand profiles that help to reliably identify to whichknown protein family (if any) a new sequencebelongs.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 36
PROSITE (2)
e.g.) C2H2 zinc finger
protein
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 37
ScanPROSITE
! Compares query sequences
(protein) to Prosite.
! Find patterns
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 38
Others: Find Gene Destination
! After translation, proteins are located to proper places(nuclear, mitochondria, outside of cell…)
! That process determined by amino acid sequences onproteins4Finding signal peptide
4Predicting protein targets
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 39
ChloroP: Prediction of ChloroplastImporting Protein (1)
! Predicting chloroplast targeting transit peptides
! Predicting cleavage sites
! Neural network based method! Ref: Emanuelsson, O., Nilesen, H., and von
Heijne, G. (1999) A neural network-basedmethod for predicting chloroplast transitpeptides and their cleavage sites. ProteinScience 8, 978-984.
! Service site:http://www.cbs.dtu.dk/services/ChloroP/
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 40
ChloroP: Prediction of ChloroplastImporting Protein (2)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 41
4. Structure Prediction
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 42
Structure Prediction
! DNA, RNA and protein structure prediction
! These structures affect their function, stability, etc.
! Protein structure prediction by calculation ofbiochemical properties of each amino acid4In most cases of protein secondary or tertiary structure
prediction, it is a terribly huge computing job (almostimpossible)
! Prediction based on known structures
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 43
DNA & RNA Structure
! DNA & RNA strands: dynamic one, not linear form
! DNA: regionally denatured (breathing), bending4Affects its expressions and others.
! RNA: forming intra strand pairing4Affects its stability, function (RNase), and others.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 44
DNA Bending
! DNA structure is flexible.
! DNA double strand isbended by protein binding.
! The bending of the DNAstructure will influenceother proteins to bind to it.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 45
Bend.it: Predict DNA Bendability
! Predicting DNA curvatures from DNA sequences
! The curvature is calculated as a vector sum ofdinucleotide geometries (roll, tilt and twist angles).
! Expressed as degrees per helical turn.! Ref: Munteanu, M. G., et al. (1998) Rod models of
DNA: sequence-dependent anisotropic elasticmodelling of local bending phenomena.Trends Biochem. Sci. 23(9), 341-346
! Service site:http://www2.icgeb.trieste.it/~dna/bend_it.html
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 46
RNA Draw: Predict RNA SecondaryStructure on PC (1)! Predicting RNA secondary conformation under given E
state
! Using dynamic programming, base pair probability and Eparameters
Clover leaf structure of tRNA and its 3D structure
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 47
E.coli tRNA-Ilepredictedstructure on 37 C
RNA Draw: Predict RNA SecondaryStructure on PC (2)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 48
Protein Structure
! Protein function: determined by its structure(e.g.: enzyme inactivated by heat - because its structurewas changed by heat.)
Primary structure (amino acidsequence) determine itssecondary structure (partialstructure: α helix, β-sheet) andtertiary structure
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 49
Secondary Structure Prediction
! Prediction of secondarystructure4α-helix4β-sheet
! Can be helpful for its 3Dconformation study.
! Solutions4Calculate “propensity” for each
amino acid to be in helix, sheet…4Calculate “information content”4Neural network based method4Nearest neighbor
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 50
NNPREDICT: Protein SecondaryStructure Prediction! Two-layer, feed-forward neural network! Finding secondary structure from amino acid sequences
! http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 51
Tertiary Structure Prediction
! Using E state?: only select lowest E state - impossible
! Using homology modeling: similar sequence = similarstructure
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 52
SWISS-MODEL
! Predicting protein 3D structure
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 53
Swiss-Pdb Viewer
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 54
SWISS-MODEL
! Ref: Guex, N. and Peitsch, M. C. (1997) SWISS-MODEL and the Swiss-PdbViewer: Anenvironment for comparative protein modelling.Electrophoresis 18:2714-2723.Guex, N. and Peitsch, M. C. (1999) Molecularmodelling of proteins. Immunology News 6:132-134.
! Available on web:
http://www.expasy.ch/swissmod/SWISS-MODEL.html
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 55
5. DNA Microarray
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 56
DNA Microarray (1)
! Monitoring of gene expression
! cDNA chip: spotting cDNAs
! Oligonucleotide chip: spottingshort DNA sequences
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 57
! DNAs are spotted on glass surfaces.
! Applications: drug effects, drug metabolism, disease diagnosis, findinggene pathway, finding disease genes, etc.
DNA Microarray: cDNA chip (2)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 58
! Short (10~25 nt) oligonucleotides on chip
! Applications: find sNPs, detect mutation, sequencing by chip,genotyping
DNA Microarray: oligonucleotide chip (3)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 59
! Processing DNA chip images to obtain qualified data
! Data mining from DNA chip images like clustering (hierarchicalalgorithms, partitioning algorithms…)
DNA Microarray (4)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 60
! Ref: Schena, M., et al. (1996) Parallel humangenome analysis: Microarray-based expressionmonitoring of 1000 genes. Proc Natl. Acad. Sci.USA. 93(20):10614-9.
! Useful websites:4www.affimatrix.com
4www.genechip.co.kr
4inkage.rockefeller.edu/wli/microarray
4cmgm.stanford.edu/~plf58/microArray
DNA Microarray (4)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 61
6. Major Tools for Proteomics
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 62
Major Tools for Proteomics: 2D Gel &MALDI-TOF
! Proteomics:4Protein expression profile
4Protein interaction
4Protein structure
4Protein variation
! Why proteomics?4Post-Genomics Era: Understanding functions
4Applications in target discovery
4Novel proteins and genes
4Disease associated proteins
4Elucidate pathways and processes in disease/cell biology
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 63
2-D Gel Electrophoresis
! Separates proteins by their netcharge andmolecular weights.
! Displays whole proteins
! Like DNA chip, compares expression underdifferent treatments, and detects tissues or cellspecific expressions.4Finding specific proteins for further study
! Poor reproduction problem
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 64
2-D Gel Image of Human Liver
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 65
2-D Electrophoretic Gel Images
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 66
MALDI-TOF: Matrix-Assisted LaserDesorption Ionization! Detection of protein molecular weights (mass
spectrophotometer)
! Can determine protein sequence.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 67
Other bioinformatic work
! Sequence assembly4From scattered DNA sequences, construct full sequence
! Specific database warehousing4Structural databases, protein block databases, drug databases
! sNPs4Find single nucleotide polymorphisms in population
! Construct gene to gene, protein to protein and gene toprotein interaction pathway