Bioinformatics. introduction molecular biology biotechnology bioMEMS bioinformatics ...

Post on 25-Dec-2015

223 views 1 download

transcript

bioinformatics

introduction molecular biology biotechnology bioMEMS bioinformatics bio-modeling cells and e-cells transcription and regulation cell communication neural networks dna computing fractals and patterns the birds and the bees ….. and ants

course layout

book

Introduction to Computational Molecular Biology

introduction

DNA

DNA

central dogma

definitions

Informatics the science of information management

Bioinformatics the science of biological information management

what is bioinformatics?

Bioinformatics is Multidisciplinary

ComputerScience

Math

Statistics

StructuralBiology

Phylogenetics

Drug Design

Genomics

MolecularBiology

interdisciplinary

increasing levels of complexity

Genome (DNA)

Transcriptosome (RNA)

Proteome (proteins)

Metabalome (metabolic pathways)

1 2 3 5 10 16 24 35 49 72 101 157217

385652

1,160

2,009

3,841

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

Millions

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

Source: GenBank

GenBank basepair growth

growth of biological databases

growth of biological databases

3D structures growthhttp://www.rcsb.org/pdb/holdings.html

symbol meaning explanation

G G Guanine

A A Adenine

T T Thymine

C C Cytosine

R A or G puRine

Y C or T pYrimidine

N A, C, G or T aNy base

U U Uracil

DNA/RNA

some definitions use of computers to catalog and organize molecular life science informa

tion into meaningful entities. subset of computational biology

Methods to analyse, store, search, retrieve and represent biological data by computers /in computers

massive amounts of data: databases extracting information and knowledge from "raw" data for most bioscientists, all they need in bioinformatics is

sequence analysis

definitions of bioinformatics

bioinformatics is not just the storage of data in a computer.

bioinformatics is the use of computers to test a biological hypothesis prior to performing the experiment in the laboratory.

bioinformatics is the design of software programs that analyse data.

what does it do?

nucleotide and protein sequences protein structures all sorts of functional data related to genes, proteins

and their regulation, interactions etc. curated and non-curated databases

bioinformatics databases

sequence searching and sequence alignments looking at properties that can be analyzed/predicted

from sequence data protein structures and their analysis structural classification visualisation of macromolecules ”system-wide” understanding of the biology of a given

organism

some goals

genomes and their annotation

complete genomes of many organisms are available seeing ”parts lists” of everything an organism needs

and figuring out how they work together

annotation: looking at the DNA sequence

genomes and their annotation

gene finding is not always straightforward problem: rare gene products, for which you cannot

find corresponding mRNA or protein sequences in databanks

additional complication: alternative splicing, many transcripts per gene

genomes and their annotation

if you intend to analyze or just use data from a databankit is useful to know both the goals and the reality of their annotation level

inconsistencies, missing data even well-annotated databanks provide only a fraction

of all biologically relevant information relevant to a gene or a molecule (compared to literature)

annotation: a vision

databank content: all knowlegde on functions of a gene product add structural information

insights in structure-function relationships add data on expression patterns and regulation

understanding cell differentiation and other big questions in biology on molecular level

current -omics

metabolomics

“…to identify, measure and interpret the complex time-related concentration, activity and flux of metabolites in cells, tissues, and other bio-samples such as blood, urine, and saliva.”

systems biology

Integrated view of biology at multiple levels

Generation of quantitative, predictive models of the behavior of biological systems, such as organisms

bioinformatics in short

very short

common genes?

Application of information technology to the storage, management and analysis of biological information

Facilitated by the use of computers

what is bioinformatics?

what is bioinformatics?

Sequence analysis Geneticists/ molecular biologists analyse genome

sequence information to understand disease processes Molecular modeling

Crystallographers/ biochemists design drugs using computer-aided tools

Phylogeny/evolution Geneticists obtain information about the evolution of

organisms by looking for similarities in gene sequences Ecology and population studies

Bioinformatics is used to handle large amounts of data obtained in population studies

Medical informatics Personalised medicine

Nucleotide sequence file

Search databases for similar sequences

Sequence comparison

Multiple sequence analysis

Design further experiments

Restriction mappingPCR planning

Translate into

protein

Search for known motifs

RNA structure prediction

non-coding

coding

Protein sequence analysis

Search for protein coding regions

Sequencing project management

Protein sequence file

Sequence comparison

Search for known motifs

Predict secondary structure

Predict tertiary

structureCreate a multiple

sequence alignment

Edit the alignment

Format the alignment for publication

Molecular phylogeny

Protein family analysis

Nucleotide sequence analysis

Sequence entry

sequence analysis: overviewManual

sequence entry

Sequence database browsing

Search databases for similar sequences

gene sequencing

Automated chemical sequencing methods allow rapid generation of large data banks of gene sequences

Sequences producing significant alignments: (bits) Value

gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 7e-26gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 5e-24gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 7e-13gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 0.66gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 1.1gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 1.5

gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] Length = 478 Score = 112 bits (278), Expect = 7e-26 Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)

Query: 2 QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50 + PWG+ RV G G GV VLDTGI T H D R + +Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233

Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110 P D NGHGTH AG I + + GVA + ++ +G+ESbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288

The BLAST program has been written to allow rapid comparison of a new gene sequence with the 100s of 1000s of gene sequences in data bases

database similarity searching

768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813 || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135 . . . . .814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863 | | | | |||||| | |||| | || | |136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172 . . . . .864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913 ||| | ||| || || ||| | ||||||||| || |||||| |173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216

sequence comparison

Gene sequences can be aligned to see similarities between gene from different sources

restriction mapping

Genes can be analysed to detect gene sequences that can be cleaved with restriction enzymes

AceIII 1 CAGCTCnnnnnnn’nnn...AluI 2 AG’CTAlwI 1 GGATCnnnn’n_ApoI 2 r’AATT_yBanII 1 G_rGCy’CBfaI 2 C’TA_GBfiI 1 ACTGGGBsaXI 1 ACnnnnnCTCCBsgI 1 GTGCAGnnnnnnnnnnn...

BsiHKAI 1 G_wGCw’CBsp1286I 1 G_dGCh’C

BsrI 2 ACTG_Gn’BsrFI 1 r’CCGG_yCjeI 2 CCAnnnnnnGTnnnnnn...CviJI 4 rG’CyCviRI 1 TG’CADdeI 2 C’TnA_GDpnI 2 GA’TCEcoRI 1 G’AATT_CHinfI 2 G’AnT_CMaeIII 1 ’GTnAC_MnlI 1 CCTCnnnnnn_n’MseI 2 T’TA_AMspI 1 C’CG_GNdeI 1 CA’TA_TG

Sau3AI 2 ’GATC_SstI 1 G_AGCT’CTfiI 2 G’AwT_CTsp45I 1 ’GTsAC_

Tsp509I 3 ’AATT_TspRI 1 CAGTGnn’

50 100 150 200 250

PCR primer design

Oligonucleotides for use in the polymerisation chain reaction can be designed using computer based programs

OPTIMAL primer length --> 20MINIMUM primer length --> 18MAXIMUM primer length --> 22 OPTIMAL primer melting temperature --> 60.000MINIMUM acceptable melting temp --> 57.000MAXIMUM acceptable melting temp --> 63.000MINIMUM acceptable primer GC% --> 20.000MAXIMUM acceptable primer GC% --> 80.000Salt concentration (mM) --> 50.000 DNA concentration (nM) --> 50.000MAX no. unknown bases (Ns) allowed --> 0 MAX acceptable self-complementarity --> 12 MAXIMUM 3' end self-complementarity --> 8 GC clamp how many 3' bases --> 0

Alignment formatted using MacBoxshade

Sequences of proteins from different organisms can be aligned to see similarities and differences

multiple sequence alignment

E.coli

C.botulinum

C.cadavers

C.butyricum

B.subtilis

B.cereus

Phylogenetic tree constructed using the Phylip package

Analysis of sequences allows evolutionary relationships to be determined

phylogeny inference

large scale bioinformatics: genome projects

MappingIdentifying the location of clones and markers on the chromosome by genetic linkage analysis and physical mapping

SequencingAssembling clone sequence reads into large (eventually complete) genome sequences

Gene discoveryIdentifying coding regions in genomic DNA by database searching and other

Function assignmentUsing database searches, pattern searches, protein family analysis and structure prediction to assign a function to each predicted geneData miningSearching for relationships and correlations in the information

Genome comparisonComparing different complete genomes to infer evolutionary history and genome rearrangements

genomics

introduction to DNA microarrays

massive data sets from simultaneous expression levels of thousands of genes

impossible to grasp directly by the human mind methods are needed for finding meaningful results

and patterns from the bulk of data

DNA microarray bioinformatics

data manipulation: normalization etc. data clustering

genes which behave in a similar fashion sample classification by profiles of predictive genes (e.g.

cancer typing) data mining:

finding interpretation to clustering results example: recognition of regulatory factor binding sites in

coexpressed genes

hierarchy of relationships:

genome

gene 1 gene 3gene 2 gene X

protein 1 protein 2 protein 3 protein X

function 1 function 2 function 3 function X

basis of molecular biology

FERN 160,000,000,000LUNGFISH 139,000,000,000SALAMANDER 81,300,000,000NEWT 20,600,000,000ONION 18,000,000,000GORILLA 3,523,200,000MOUSE 3,454,200,000HUMAN 3,400,000,000 31,000Drosophila 137,000,000 13,500C. Elegans 96,000,000 19,000Yeast 12,000,000 6,315E. Coli 5,000,000 5,361smallest Genome ??????

genes

genome size

comparative genomics whole-genome analyses evolution studies analyses of components in a ”complete” system

functional genomics = inferring functions from data expression patterns, gene regulation sequence comparisons, homologue relationships studies of gene variation, altered phenotypes

genomics

gene finding is not always straightforward problem: rare gene products, for which you cannot

find corresponding mRNA or protein sequences in databanks

additional complication: alternative splicing, many transcripts per gene

even well-annotated databanks provide only a fraction of all biologically relevant information relevant to a gene or a molecule (compared to literature)

genomics

massive data sets from simultaneous expression levels of thousands of genes

impossible to grasp directly by the human mind methods are needed for finding meaningful results

and patterns from the bulk of data

DNA microarrays

data manipulation: normalization etc. data clustering

genes which behave in a similar fashion sample classification by profiles of predictive genes (e.g.

cancer typing) data mining:

finding interpretation to clustering results example: recognition of regulatory factor binding sites in

coexpressed genes

DNA microarrays

DNA array technology

Array TypeSpot Density

(per cm 2 )Probe Target Labeling

Nylon Macroarrays < 100 cDNA RNA RadioactiveNylon Microarrays < 5000 cDNA mRNA Radioactive/FlourescentGlass Microarrays < 10,000 cDNA mRNA FlourescentOligonucleotide Chips <250,000 oligo's mRNA Flourescent

spotting robot

microarray expression analysis

microarray

photolithography

array terminology

70 mer vs 40 mer Attachment

NH2

NH2

NH2

70 mer40 merTarget

microarray

microarray data

control mouse

a stressed mouse

RNAextraction

target labeling

image analysis

gene expression analysis

determination of expression levels

DNA micro-array

genomics

The application of high-throughput automated technologies to molecular biology.

The experimental study of complete genomes.

genomics technologies

Automated DNA sequencing Automated annotation of sequences DNA microarrays

gene expression (measure RNA levels) single nucleotide polymorphisms (SNPs)

Protein chips (SELDI, etc.) Protein-protein interactions

cDNA spotted microarrays

Affymetrix gene chips

microarray data analysis

Clustering and pattern detection Data mining and visualization Controls and normalization of results Statistical validation Linkage between gene expression data and gene

sequence/function/metabolic pathways databases Discovery of common sequences in co-regulated

genes Meta-studies using data from multiple experiments

microarray data analysis

impact on bioinformatics

Genomics produces high-throughput, high-quality data, and bioinformatics provides the analysis and interpretation of these massive data sets.

It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis.

proteomics

what is proteomics?

The analysis of the entire protein complement

expressed by a genome, or by a cell or tissue type.“

Two most related technologies 2-D electrophoresis: separation of complex protein

mixtures Mass spectrometry: Identification and structure

analysis

Wasinger VC et al Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium.

Electrophoresis 16 (1995) 1090-1094

transcription

genomic DNA

Structure Regulation Information

Computers cannot determine which of these 3 roles DNA play solely based on sequence… (although we would all like to believe they can)

introduction to proteomics

Definitions Classical - restricted to large scale analysis of gene

products involving only proteins Inclusive - combination of protein studies with analyses

that have genetic components such as mRNA, genomics, and yeast two-hybrid

Don’t forget that the proteome is dynamic, changing to reflect the environment that the cell is in

1 gene is no longer equal to one protein 1 gene = how many proteins?

1 gene = 1 protein?

why proteomics?

Annotation of genomes, i.e. functional annotation Genome + proteome = annotation

Protein Function Protein Post-Translational Modification Protein Localization and Compartmentalization Protein-Protein Interactions Protein Expression Studies

Differential gene expression is not the answer

types of proteomics

Protein Expression Quantitative study of protein expression between

samples that differ by some variable

Structural Proteomics Goal is to map out the 3-D structure of proteins and

protein complexes

Functional Proteomics To study protein-protein interaction, 3-D structures,

cellular localization and PTMS in order to understand the physiological function of the whole set of proteome.

introduction to proteomics

composition of the proteome depends on cell type, developmental phase and conditions

proteome analyses are still struggling to solve the ”basic proteome” of different cells and tissues or limited changes under changing conditions or during processes

current methods can only ”see” the most abundant proteins

expression proteomics = differential proteomics = 2D-PE + MS

interaction proteomics functional proteomics = systematic perturbation or

functional inactivation of proteins in a given environment

structural proteomics

proteomics

typically a combination of 2D protein electrophoresis and mass spectrometry

labour-intensive, not really ”high-throughput” methods

more efficient ”protein array” methods are emerging

proteomics experiments

bioinformatics in proteomics

High-throughput determination of the 3D structure of proteins Goal: to be able to determine or predict the structure of every

protein. Direct determination - X-ray crystallography and nuclear magnetic

resonance (NMR). Prediction

Comparative modeling - Threading/Fold recognition Ab initio

structural proteomics

To study proteins in their active conformation. Study protein:drug interactions Protein engineering

Proteins that show little or no similarity at the primary sequence level can have strikingly similar structures.

why structural proteomics?

FtsZ - protein required for cell division in prokaryotes, mitochondria, and chloroplasts.

Tubulin - structural component of microtubules - important for intracellular trafficking and cell division.

FtsZ and Tubulin have limited sequence similarity and would not be identified as homologous proteins by sequence analysis.

an example

Burns, R., Nature 391:121-123Picture from E. Nogales

FtsZ and tubulin have little similarity at the amino acid sequence level

homologues

Yes! Proteins that have conserved secondary structure can be derived from a common ancestor even if the primary sequence has diverged to the point that no similarity is detected.

are FtsZ and tubulin homologues?

structure is function

protein structure

Imaging Experimental X-ray diffraction data

Predicting structure in silico from sequence

Make crystals of your protein 0.3-1.0mm in size Proteins must be in an ordered, repeating pattern.

X-ray beam is aimed at crystal and data is collected. Structure is determined from the diffraction data.

X-ray crystallography

http://www-structure.llnl.gov/Xray/101index.html

X-ray crystallography

Schmid, M. Trends in Microbiolgy, 10:s27-s31.

crystals

X-ray crystallography

Protein must crystallize. Need large amounts (good expression) Soluble (many proteins aren’t, membrane proteins).

Need to have access to an X-ray beam. Solving the structure is computationally intensive. Time - can take several months to years to solve a structure

Efforts to shorten this time are underway to make this technique high-throughput.

general process for proteomics research

Image Analysis

Digester

Spot picker

Gel hotel

Spotter

MS

2-D Gel

取材自台大微生物生化系莊榮輝教授網頁

general process for proteomics research

protein microarray

arrayIT TM

protein microarray

arrayIT TM

G. MacBeath and S.L. Schreiber, 2000, Science 289:1760

what can protein microarrays do?

1. Protein / protein interaction2. Enzyme / substrate interaction (transient)3. Protein / small molecule interaction4. Protein / lipid interaction5. Protein / glycan interaction6. Protein / Ab interaction

1. G. MacBeath and S.L. Schreiber, 2000, Science 289:1760

2. H.Zhu et al, 2001 Science 293:2101

3. Ziauddin J and Sabatini DM, 2001 Nature 411:107

protein microarrays (Antibody arrays)

the real world

The true spot quality compared to a real experiment

mobility of protein in an electric field

Mobility : Electrolytic molecules move in an electric field

Mobility ~[Electric field (mV)][Net charge of molecule]

[Friction between molecules and matrix]

2-dim electrophoresis

Digest to peptide fragmentMS analysis

2-D gel electrophoresis

First dimension denaturing iso-electric focusing separation according to the pI

Second dimension SDS-PAGE (Sodium Dodecyl

Sulfate coated in a Poly-Acrylamide Gel Electrophoresis)

Separation according to the molecular weight

pI is the iso-electric point of the protein

result example

Ion source: substance to ion gas Mass analysis: according to mass/charge (m/z) Detection: femtomole -attomole

Ion source Ion separator detector

mass spectrometry

++

++ ++

++ +

+

pulsed

UV or IR laser

(3-4 ns)

detector

vacuum

strong electric field

Time Of Flight tube

peptide mixture

embedded in

light absorbing

chemicals (matrix)

cloud of

protonated

peptide moleculesaccV

principle of mass spec

Linear Time Of Flight tube

Reflector Time Of Flight tube

detector

reflector

ion source

ion source

detector

time of flight

time of flight

principle of mass spec

typical result

Nuclear Magnetic Resonance Spectroscopy (NMR)

Can perform in solution. No need for crystallization

Can only analyze proteins that are <300aa. Many proteins are much larger. Can’t analyze multi-subunit complexes

Proteins must be stable.

structure modeling

Comparative modeling Modeling the structure of a protein that has a high degree of sequ

ence identity with a protein of known structure Must be >30% identity to have reliable structure

Threading/fold recognition Uses known fold structures to predict folds in primary sequence.

Ab initio Predicting structure from primary sequence data Usually not as robust, computationally intensive

sequence alignment

sequence alignment

©Ken Howard, Scientific American, July 2000

sequenceGATCAAACATTAAACATCCTGAGATCCAAAGGTAAGAGATCTAGCCACAGGGAGTGCTGGGGATTCGGGTCCTGGTGATCTTCACATGCTGACATAGCTCAGCCCTTTTTGGCCCTGGCTTTGTCCTGTTGTGGGCTTTCCCATCTGCAACCCATGCTCCTGGGCCATTTTCCTATGGGCCAGGGAAAACAAGATGGGGTGAAGGCACCCTTACATTTAGGGGCAAGACCTAGTACTCAGAAGGATTCAGAAACTGAAATAGCTGGGTGATACCACACAGGTGCTAGGGATAAGGGGCCTTGAGCCATGGACCATGGGAACTACAAAGCTGAAGGAGCTGCTGCCTCAGCAGAACCAGCGCTTGAATTTGTTCTTTCAGAACCTCAGTCTCTTCCTCTGAAAAATGGGTGTGTTGTGTATCCCACATTCCCAAGTCAGCCATGGGACCAAATGTGAGCGTGTGGGTTTTGCCTCCTGAGAAACTCAGGGGAGCAGAATGCTACAGTGGGTGAATTGGATTCTTTCAGAGAGCCCACCCTGTTTCCCACATCAGCCAGAAGGCTCAAAACCCTGAAGAGCTTTCTGAACTTTGAGGTGCCCAAAGCTTCAGGGCTGTATGGGAAGCACCTGAGGTCCAAGTCCGTTTACAAGAATTTTGTTTTTTGGTTTACAGCTGCTTGGCCGGTCCAAGGAGCAGGTTTGGGTCCTGTGCTCCACAGACCTAAGGGTTACCTTAGAGCTTATGGGAGAGCATTGTGTGTGGACAGTGGACAGTGCCCTCTAGTGCTCAGTGTTAGCACTACATCCAGTTGCCCTCCACCAGTTTATGCTGCTGAGGAAGTCTTTCTTTTCCCAACAGCAGTGTCTCTCCCTCTCCCACCCCCTCTCCCTCTCCCTCCCCCCCTAGGTTATTTTTATTTTTACTGGTGTGTATGTGTGTGAGTCTATGTCACATGTATGAGAGTGCTTGTGGAGACCAGAAGAGGGCATCAGAAGAGCCCCTAGAACTGGAGTATAGGTGGTTGTGAGCCACTTGTCATGGGTGTTGGGAACCAAACTCAGGTTCTCTGGAAGAACAACAAGCTCCCTTATCATATAAGCCATCTCTAAATCCAGGACATTTTTTTTTTTTTTTTTGAGATTTAGAGATTCAAGGAGGAGGAACAATAGGAGGAAGAAGGGGACAGAATAAGGCCAACAAAATGACCAAGGAGGTATAGGCACTTGAAGCCAAACCTAAGTACCTGAGTTCAATCCCTGGGACCCACATGATGGAAAGATGGAATCGATCCCCAAAAGTTATCTTCTGATCCCTATATGCACACACTTGAGGATGGACAGACAAAGAGACAGACACACAAACACACACAAATGTAACTGAAAAAGAAACCTCTATGGGGACATCGCCTTCTTGGAGAGGCTCTGTTGCCCCTCATCCTAGTGAACAAACAACTCCTACTCCCTGCCAGAGTATCCTACCCTTGGATTCAAAATGGTCTCAGAGGACACACCGGGTGGGCTCTGTCGCTGGGATCTTGCATAACCAATGCCCATAAGCCTGGCAAAGGTGGCGATGAGACGATAAGGTCAGGGACATGACCGCAGAAGAGGAGTGGGGACGCGATGAGTGGGAGGAGCTTCTAAATTATCCATCAGCACAAGCTGTCAGTGGCCCCAGCCATGAATAAATGTATAGGGGGAAAGGCAGGAGCCTTGGGGTCGAGGAAAACAGGTAGGGTATAAAAAGGGCACGCAAGGGACCAAGTCCAGCATCCTAGAGTCCAGATTCCAAACTGCTCAGAGTCCTGTGGACAGATCACTGCTTGGCAATGGCTACAGGTAAGCATGCGCAAATCCCGCTGGGTGTGGTTTGGGACCCAGGGCCCCTGAAGATGGATCTGAGGCTTCTAATGTGAGTGCGTTCCAACTTCTGCCATGTTGGGAATACTCTGGGTCCCTATGGGGATTGGGAGAGATCGGCCATTGCTCCCAGGTTTCTCCTGCCCTCCTGTCTCTCTCTAGACTCTCGGACCTCCTGGCTCCTGACCGTCAGCCTGCTCTGCCTGCTCTGGCCTCAGGAGGCTAGTGCTTTTCCCGCCATGCCCTTGTCCAGTCTGTTTTCTAATGCTGTGCTCCGAGCCCAGCACCTGCACCAGCTGGCTGCTGACACCTACAAAGAGTTCGTAAGTTCCCCAGAGATGGGTGCCCGTTTGTGGAAGCAGGAAGGGGCAGGTCCTACCCCATACTCCTGGCCCCAGGGAAGGTCAATGGAGGGGAAATTATGGGGTAGGGGAATCTTAGCCAATGCTGTACCATAGTAATGATGGTGACGAGACACAAGCTGGTCCCTCAGTGACCACCCTTCTTCCAGGAGCGTGCCTACATTCCCGAGGGACAGCGCTATTCCATTCAGAATGCCCAGGCTGCTTTCTGCTTCTCAGAGACCATCCCGGCCCCCACAGGCAAGGAGGAGGCCCAGCAGAGAACCGTGAGTAGTCCCAGGCCTTGTCTGCACAAATCCTCGTTTCCCTCCATGCAGCCCTAACTGCACTCCAGGCCAGGGACCAGCTCCTCCCTGAAGCTGGGGTAACCTGGGAGTCCCAGGCAGAGGTCACTAGGCAATACACTAACCCCAGCCCTTTTTTTCCCCCCTCAGGACATGGAATTGCTTCGCTTCTCGCTGCTGCTCATCCAGTCATGGCTGGGGCCCGTGCAGTTCCTCAGCAGGATTTTCACCAACAGCCTGATGTTCGGCACCTCGGACCGTGTCTATGAGAAACTGAAGGACCTGGAAGAGGGCATCCAGGCTCTGATGCAGGTGAGGATGGACTAGCCTGGGGTTATGCCTGGAGCCTAGGTGGGGCTCACTGTCCTCTGTTTTACCGGTCAGCCCTTAGACCCTTGAGAAGGCTTCTTCTTCTTCATTTTCCTTTATGAAGCCTCCAGGCTTTTCCTTCGGTCCTGGGGTGGAGGGAGGCACAGCTCCCGAGTCTCCTGCCCTTCTTTCCCACGACAGGAGCTGGAAGATGGCAGCCCCCGTGTTGGGCAGATCCTCAAGCAAACCTATGACAAGTTTGACGCCAACATGCGCAGCGACGACGCGCTGCTCAAAAACTATGGGCTGCTCTCCTGCTTCAAGAAGGACCTGCACAAAGCGGAGACCTACCTGCGGGTCATGAAGTGTCGCCGCTTTGTGGAAAGCAGCTGTGCCTTCTAGCCACTCACCAGTGTCTCTGCTGCACTCTCCTGTGCCTCCCTGCCCCCTGGCAACTGCCACCCCGCGCTTTGTCCTAATAAAATTAAGATGCATCATATCACCCGGCTAGAGGTCTTTCTGTTATGGGATGGAGCAGTTGTGTCAATCTTGTTCCTGGAAGCCTGCGAGAA

sequence alignment: why?

Early in the days of protein and gene sequence analysis, it was discovered that the sequences from related proteins or genes were similar, in the sense that one could align the sequences so that many corresponding residues match.

This discovery was very important: strong similarity between two genes is a strong argument for their homology. Bioinformatics is based on it.

sequence alignment: why?

Terminology: Homology means that two (or more) sequences

have a common ancestor. This is a statement about evolutionary history.

Similarity simply means that two sequences are similar, by some criterion. It does not refer to any historical process, just to a comparison of the sequences by some method. It is a logically weaker statement.

However, in bioinformatics these two terms are often confused and used interchangeably. The reason is probably that significant similarity is such a strong argument for homology.

two protein alignment

many genes have a common ancestor

The basis for comparison of proteins and genes using the similarity of their sequences is that the proteins or genes are related by evolution; they have a common ancestor.

Random mutations in the sequences accumulate over time, so that proteins or genes that have a common ancestor far back in time are not as similar as proteins or genes that diverged from each other more recently.

Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments.

definition of sequence alignment

Sequence alignment is the procedure of comparing two (pair-wise alignment) or more multiple sequences by searching for a series of individual characters or patterns that are in the same order in the sequences.

There are two types of alignment: local and global. In global alignment, an attempt is made to align the entire sequence. If two sequences have approximately the same length and are quite similar, they are suitable for the global alignment.

Local alignment concentrates on finding stretches of sequences with high level of

definition of sequence alignment

L G P S S K Q T G K G S - S R I W D N

Global alignment

L N - I T K S A G K G A I M R L G D A

- - - - - - - T G K G - - - - - - - -

Local alignment

- - - - - - - A G K G - - - - - - - -

interpretation of sequence alignment

Sequence alignment is useful for discovering structural, functional and evolutionary information.

Sequences that are very much alike may have similar secondary and 3D structure, similar function and likely a common ancestral sequence. It is extremely unlikely that such sequences obtained similarity by chance. For DNA molecules with n nucleotides such probability is very low P = 4-n. For proteins the probability even much lower P = 20 –n, where n is a number of amino acid residues

Large scale genome studies revealed existence of horizontal transfer of genes and other sequences between species, which may cause similarity between some sequences in very distant species

Dot matrix analysis The dynamic programming (DP) algorithm Word or k-tuple methods

methods of sequence alignment

dot matrix analysis

A dot matrix analysis is a method for comparing two sequences to look for possible alignment (Gibbs and McIntyre 1970)

One sequence (A) is listed across the top of the matrix and the other (B) is listed down the left side

Starting from the first character in B, one moves across the page keeping in the first row and placing a dot in many column where the character in A is the same

The process is continued until all possible comparisons between A and B are made

Any region of similarity is revealed by a diagonal row of dots

Isolated dots not on diagonal represent random matches

Detection of matching regions can be improved by filtering out random matches and this can be achieved by using a sliding window

It means that instead of comparing a single sequence position more positions is compared at the same time and dot is printed only if a certain minimal number of matches occur

Dot matrix analysis can also be used to find direct and inverted repeats within the sequences

dot matrix analysis

Nucleic Acids Dot Plots - http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html

dot matrix analysis: two identical sequences

Nucleic Acids Dot Plots of genes Adh1 and G6pd in the mouse

dot matrix analysis: two very different sequences

Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY)

dot matrix analysis: two similar sequences

dynamic programming

back to the basics

DNA/RNA sequences: strings composed of an alphabet of 4 letters

Protein sequences: alphabet of 20 letters

why do we do it?

Identify a gene Find clues to gene function (ortholog?) Find other organisms with this gene (homology) Gather info for an evolutionary model …

alignment

alignment is the basis for finding similarity Pairwise alignment = dynamic programming Multiple alignment: protein families and functional domains Multiple alignment is "impossible" for lots of sequences Another heuristic - progressive pairwise alignment

an example

GCGCATGGATTGAGCGA

TGCGCCATTGATGACCA

possible alignment

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

alignment

three elements Perfect matches Mismatches Insertions & deletions (indel)

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

significant similarity in an alignment

Ho: the current alignment is a result of random line-up (the 2 sequences are unrelated)

Ha: the sequences diverge from a common ancestor (related) Test statistic: Ymax = length of the longest running perfect matc

h subsequence

exact matching subsequences

In DNA alignment, the matching probability

Under Ho lengths of exact match subseq Y should follow a geometric distribution

2222tcgamatch ppppp

well-matching subsequences

Evolution may cause small differences to even sequences with a reasonably recent common ancestor.

We consider Ymax to be the longest subseq with up to k mismatches.

Y follow hyper-geometric distribution P-value: exact/simulated/approximate (independence among Y

does not hold any more)

choosing alignments

There are many possible alignments

For example, compare:

-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A

to

------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--

Which one is better?

Score Used to determine quality of match and basis for the

selection of matches. Scores are relative. Expectation value

An estimate of the likelihood that a given hit is due to pure chance, given the size of the database; should be as low as possible. E.V.’s are absolute. A high score and a low E.V. indicate a true hit.

Sequence identity (%) (or Similarity) Number of matched residues divided by total length of

probe

scoring

scoring rule

Example Score = (# matches) – (# mismatches) – (# indels) x 2

examples

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score: (+1x5) + (-1x6) + (-2x11) = -23

edit distance

The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other

Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance

nment)score(aligmax),d( & of alignment 21 ss21 ss

computing edit distance

How can we compute the edit distance?? If |s| = n and |t| = m, there are more than

alignments 2 sequences each of length 1000: > 10^600

The additive form of the score allows to perform dynamic programming to compute edit distance efficiently

m

nm

Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

recursive argument

])[],[(

])..[],..,[(])..[],..[(

1mt1ns

m1tn1sd1m1t1n1sd

Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

)],[(

])..[],..,[(])..[],..[(

1ns

1m1tn1sd1m1t1n1sd

recursive argument

Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

])[,(

])..[],..,[(])..[],..[(

1nt

m1t1n1sd1m1t1n1sd

recursive argument

Define the notation:

Using the recursive argument, we get the following recurrence for V:

])[,(],[

)],[(],[

])[],[(],[

max],[

1jtj1iV

1is1jiV

1jt1isjiV

1j1iV

])..[],..[(],[ j1ti1sdjiV

recursive argument

recursive argument

Of course, we also need to handle the base cases in the recursion:

])1[,(],0[]1,0[

)],1[(]0,[]0,1[

0]0,0[

jtjVjV

isiViV

V

We fill the matrix using the recurrence rule

0A1

G2

C3

0

A 1

A 2

A 3

C 4

dynamic programming algorithm

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

Conclusion: d(AAAC,AGC) = -1

dynamic programming algorithm

interpretation of pointers

Insertion of S2(j) into S1

Deletion of S1(i) from S1

Match or Substitution

We now trace back the path the corresponds to the best alignment

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

AAACAG-C

reconstructing the best alignment

reconstructing the best alignment

Sometimes, more than one alignment has the best score

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

AAACA-GC

complexity

Space: O(mn) Time: O(mn) Filling the matrix O(mn) Backtrace O(m+n)

other scoring schemes

Needleman and Wunsch: 1 for identical amino acid, 0 otherwise

Dayhoff PAM scoring matrix: variations include BLOSUM matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci. 89, 10915-10919).

… Different Gap Cost Function

scoring matrix for protein sequences

substitution “log odds” matrix BLOSUM 62

Henikoff and Henikoff (1992; PNAS 89:10915-10919)

( M.O. Dayhoff, ed., 1978, Atlas of Protein Sequence and Structure, Vol 5).

PAM 250 matrix

multiple sequence alignment

Often a probe sequence will yield many hits in a search. Then we want to know which are the residues and positions that are common to all or most of the probe and match sequences

In multiple sequence alignment, all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the 'same' position in each sequence.

name of homologous domians position of residue

residues and position common to most homologs consensus

an example

cellulose-binding domain of cellobiohydrolase

why multiple sequence alignment?

Identify consensus segments Hence the most conserved sites and residues

Use for construction of phylogenesis Convert similarity to distance

www.ch.embnet.org/software/ClustalW.html Of genes, strains, organisms, species, life

sequence logo

This shows the conserved residues as larger characters, where the total height of a column is proportional to how conserved that position is. Technically, the height is proportional to the information content of the position.

sample multiple alignment

Eukarya

Bacteria

A. aeolicus

T. maritima

Archaea

with k-mers (16s RNA, 35 organisms)

Black tree: dist ’n of 8-mers. Red tree: sequence aligment .

constructing the tree of life

databases of multiple alignments

Pfam: Protein families database of aligments and HMMs www.cgr.ki.se

PRINTS, multiple motifs consisting of ungapped, aligned segments of sequences, which serve as fingerprints for a protein family www.bioinf.man.ac.uk

BLOCKS, multiple motifs of ungapped, locally aligned segments created automatically fhcrc.org

software

manual alignment- software

GDE- The Genetic Data Environment (UNIX) CINEMA- Java applet available from:

http://www.biochem.ucl.ac.uk

Seqapp/Seqpup- Mac/PC/UNIX available from: http://iubio.bio.indiana.edu

SeAl for Macintosh, available from: http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html

BioEdit for PC, available from: http://www.mbio.ncsu.edu/RNaseP/info/programs/

BIOEDIT/bioedit.html

Search a sequence database for fragments similar to the query sequence

1. Compile a list of high-scoring short words shared by the query sequence and the database;

2. Scan the database for “hits”3. Expand the “hits” to MSP (maximum segment pair =

a pair of equal-length/no-gap segments with the highest alignment score)

BLAST

BLAST

Altschul, et. al. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

Variations of BLAST designed for specific purposes http://www.ncbi.nlm.nih.gov/BLAST/

similarity searching the databanks

What is similar to my sequence? Searching gets harder as the databases get bigger -

and quality degrades Tools: BLAST and FASTA = time saving heuristics

(approximate) Statistics + informed judgement of the biologist

read out>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'. Length = 369

Score = 272 bits (137), Expect = 4e-71 Identities = 258/297 (86%), Gaps = 1/297 (0%) Strand = Plus / Plus

Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76 |||||||||||||||| | ||| | ||| || ||| | |||| ||||| ||||||||| Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59

Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136 |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119

Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196 |||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179

Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 ||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239

Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 || || ||||| || ||||||||||| | |||||||||||||||||| ||||||||Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296

structure - function relationships

Can we predict the function of protein molecules from their sequence?

sequence > structure > function

Conserved functional domains = motifs

Prediction of some simple 3-D structures (a-helix, b-sheet, membrane spanning, etc.)

protein domains

(from ProDom database)

DNA sequencing

Automated sequencers > 40 KB per day 500 bp reads must be assembled into complete genes

errors especially insertions and deletions error rate is highest at the ends where we want to overlap the rea

ds vector sequences must be removed from ends

Faster sequencing relies on better software overlapping deletions vs. shotgun approaches: TIGR

DNA sequencing

finding genes in genome sequence is not easy

About 2% of human DNA encodes functional genes.

Genes are interspersed among long stretches of non-coding DNA.

Repeats, pseudo-genes, and introns confound matters

pattern finding tools

It is possible to use DNA sequence patterns to predict genes: promoters translational start and stop codes (ORFs) intron splice sites codon bias

Can also use similarity to known genes/ESTs

phylogenetics

Evolution = mutation of DNA (and protein) sequences

Can we define evolutionary relationships between organisms by comparing DNA sequences

is there one molecular clock? phenetic vs. cladisitic approaches lots of methods and software, what is the "correct" analysis?

phylogenetics

software tools on the web

Many of the best tools are free over the Web BLAST ENTREZ/PUBMED Protein motifs databases

Bioinformatics “service providers” DoubleTwist™, Celera, BioNavigator™

Hodgepodge collection of other tools PCR primer design Pairwise and Multiple Alignment

PC programs

Macintosh and Windows applications -Commercial Vector NTI™, MacVector™, OMIGA™, Sequencher™ - Freeware Phylip, Fasta, Clustal, etc.

Better graphics, easier to use Can't access very large databases or perform demanding calcu

lations Integration with web databases and computing services

Vector NTI

most important sequence databases

Genbank– maintained by USA National Center for Biology Information (NCBI) All biological sequences

www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html

Genomes www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?

db=Genome Swiss-Prot - maintained by EMBL- European Bioinformatics

Institute (EBI ) Protein sequences

www.ebi.ac.uk/swissprot/

genome project

the human genome project

The genome sequence is complete - almost! Approximately 3.2 billion base pairs.

Any human gene can now be found in the genome by similarity searching with over 99% certainty.

However, the sequence still has many gaps hard to find an uninterrupted genomic segment for any gene still can’t identify pseudogenes with certainty

This will improve as more sequence data accumulates

all the genes

Raw Genome Data:example of the code

The next step is obviously to locate all of the genes and describe their functions. This will probably take another 15-20 years!

example of the code

inconsistency

Celera says that there are only ~34,000 genes so why are there ~60,000 human genes o

n Affymetrix GeneChips? Why does GenBank have 49,000 human g

ene coding sequences and UniGene have 96,000 clusters of unique human ESTs?

Clearly we are in desperate need of a theoretical framework to go with all of this data

http://www.celera.com/

implications for biomedicine

Physicians will use genetic information to diagnose and treat disease.

Virtually all medical conditions have a genetic component. Faster drug development research

Individualized drugs Gene therapy

All Biologists will use gene sequence information in their daily work

the equipment

meaning of the code …

meaning of the code …

evolution

how do genomes evolve?

Point mutations Rearrangements Recombination Selection and Drift

how can you view the evolution?

Individual gene alignment view (usually proteins) Dot plot or VISTA (local similarity) view Synteny view Composite (average) views

DNA dot plot view

Show one DNA along X-axis, second on Y-axis For every position along both, score local similarity Display 2-D plot of similarity in gray-scale

self-match

tandem duplication

dot plot example

random dot plot

promoterconservation

gene structure revealed by dot plot

synteny view

Synteny definition: a contiguous region in another genome that has more-or-less the same genes in the same order.

The boundaries of what constitutes synteny are a bit fuzzy… for example you probably wouldn’t say a region isn’t syntenic if it is missing one gene out of many.

single inversion

insertion or deletion

double inversion

intra-chromosomal rearrangements

inter-chromosomal rearrangements

syntenic scaling

These regions are perfectly syntenic, but on average the mouse has shorter regions separating alignable conserved blocks.

limitations to synteny view

Provides only overview of arrangement, with no information about the degree or areas of conservation.

As genomes become more distant synteny becomes more chaotic, until (in the extreme) most blocks are one gene long (e.g. flies vs. human).

In some cases, very deep synteny can be seen, most dramatically in the Hox clusters.

Hox cluster

composite or summary views

View comparative summaries that encapsulate general properties of the genome.

For example, G-C content comparison:

phylogenetics and evolution