+ All Categories
Home > Documents > Mathematics and the Genome Winfried Just Department of Mathematics and Quantitative Biology...

Mathematics and the Genome Winfried Just Department of Mathematics and Quantitative Biology...

Date post: 29-Dec-2015
Category:
Upload: willis-morgan
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
38
Mathematics and the Genome Winfried Just Department of Mathematics and Quantitative Biology Institute Ohio University
Transcript

Mathematics and the Genome

Winfried JustDepartment of Mathematics and

Quantitative Biology InstituteOhio University

This talk is dedicated to the memory of Dr. Pawel Zbierski,

one of the great teachers in my life.

Biology’s dilemma: There is too much to know about living things

Roughly 1.5 million species of organisms have been described and given scientific names to date. Some biologists estimate that the total number of all living species may be several times higher. It is impossible tolearn everything about all these organisms. Biologists solve the dilemma by focusing on some species, so-calledmodel organisms, and trying to find out as much as theycan about these model organisms.

Some important model organisms

Mammals: Homo sapiens, Chimpanzee, mouse, ratFish: Zebrafish, PufferfishInsects: Fruitfly (Drosophila melanogaster)Roundworms: Ceanorhabditis elegansProtista: Malaria parasite (Plasmodium falciparum)Fungi: Yeast (Saccharomyces cerevisiae, S. pombe) Plants: Thale cress (Arabidopsis thaliana), corn, riceBacteria: Escherichia coli, salmonellaArchea: Methanococcus janaschii

Let’s find out everything about some species

What would it mean to learn everything about a given species? All available evidence indicates that the

completeblueprint for making an organism is encoded in the organism’s genome. Chemically, the genome consists of one or several DNA molecules. These are long strings composed of pairs of nucleotides. There are only fourdifferent nucleotides, denoted by A, C, G, T. The information about how to make the organism is encoded by the order in which the nucleotides appear.

Some genome sizes

HIV2 virus 9671 bp Mycoplasma genitalis 5.8 · 105 bp Haemophilus influenzae 1.83 · 106 bp Saccharomyces cerevisiae 1.21 · 107 bp Caenorhabditis elegans 108 bp Drosophila melanogaster 1.65 · 108 bp Homo sapiens 3.14 · 109 bp Some amphibians 8 · 1010 bp Amoeba dubia 6.7 · 1011 bp

Sequencing Genomes

Contemporary technology makes it possible to completelysequence entire genomes, that is, determine the sequenceof A’s, C’s, G’s, and T’s in the organism’s genome. Thefirst virus was sequenced in the 1980’s, the firstbacterium (Haemophilus influenzae) in 1995, the first multicellular organism (Caenorhabditis elegans) in 1998. The rough draft of the human genome was announced inJune 2000.

How rough is the draft of the human genome?

Announced June 2000 Covers about 95% of the genome. Contains more than 100,000 gaps Public version: Started: 1990

Based on genome of one person Celera version: Started 1998

Based on genome of five persons

Where to store all these data?

Some of the sequence data are stored in proprietary data bases, but most of them are stored in the public data base Genbank and can be accessed via the World Wide Web. In fact, most relevant journals require proof of submission to Genbank before an article discussing sequence data will be published. A notable exception was the publication ofCelera’s announcement in Science.

What’s in the databases?

As of February 20, 2000, Genbank contained 5,861,088,510 bp of information. There were about

600 completely sequenced viruses, 19 completely

sequencedbacteria, 6 completely sequenced archaea, and 3 complete genomes of eukaryotes: S. cerevisiaea

(baker’s yeast), C. elegans (a roundworm), and Drosophilamelanogaster (fruitfly).

What’s in the databases?

As of November 23, 2000, Genbank contained 10,853,673,034 bp of information. There were about 600 completely sequenced viruses, 29 completely sequencedbacteria, 8 completely sequenced archaea, and 3 complete genomes of eukariotes: S. cerevisiaea (baker’s yeast), C. elegans (a roundworm), and Drosophilamelanogaster (fruitfly). The genome of Arabidopsis (thalecress) was near completion, and a first draft of the human genome had been completed.

What’s in the databases?

As of March 18, 2002, Genbank contained 20,197,497,568 bp of information. There were about

700 completely sequenced viruses, 63 completely

sequencedbacteria, 13 completely sequenced archaea, and 5complete genomes of eukaryotes: S. cerevisiaea, S.

pombe(two yeasts), C. elegans (a roundworm), Drosophilamelanogaster (fruitfly) and Arabidopsis thaliana (thale cress), as well as a draft of the human genome.

First mathematical challenge:Sequencing large genomes

Currently, much of the sequencing process is automated.

However, contemporary sequencing machines can only

sequence stretches of DNA that are a few hundred base

pairs long at a time. The process of assembling these stretches of sequence into a whole genome poses

some interesting mathematical problems.

First mathematical challenge:Sequencing large genomes

For example, the publicly financed Human Genome Project used an approach called genome mapping to facilitate sequence assembly. Most of the time the HGP took was in fact spent on onstructing the scaffold of this map.In contrast, Celera Genomics allegedly used an approach called shotgun sequencing that works by randomly

cutting up the genome into small streches, sequencing them, and then using a clever algorithm to assemble the whole genome. There was much debate over the feasibility of the latter approach, but it apparently worked.

You have sequenced your genome - what do you do with it?

This is known as genome analysis or sequence analysis.

At present, most of bioinformatics is concerned with sequence analysis. Here are some of the questions studied in sequence analysis: gene finding protein 3D structure prediction gene function prediction prediction of important sites in proteins reconstruction of phylogenetic trees

How the genome controls the organism

The genome controls the making and workings of an organism by telling the cell which proteins to

manufactureunder which conditions. Proteins are the workhorses ofbiochemistry and play a variety of roles. Most notably, many proteins are enzymes that catalyze specific chemical reactions. All biochemical reactions are catalyzed by enzymes.

A gene is a stretch of DNA that codes a given protein.

Where are the genes?

The objective of gene finding is to identify the regions of DNA that are genes. Ideally, we want to make

statementslike: “Positions 28,354 through 29,536 of this genome

code a protein.” Once we have identified a gene, it is easy to translate the DNA code into the sequence of amino acids that make up the corresponding protein.The mathematical challenge here is to identify patterns in DNA that reliably indicate where a gene starts and ends,especially in eukaryotes.

Hidden Markov Models for gene finding (caricature)

Most current gene finding programs are based on Hidden Markov Models. These work as follows: assume (wrongly)that the DNA-sequence has been generated randomly by a Markov model that can be in one of two states: “gene” or “intergenic region.” Each state has a characteristic probability of “emitting” a given nucleotide, and has a characteristic (low) probability of switching to the other state. The observer sees the sequence of emissions(nucleotides), but the information by which state a givennucleotide was emitted is hidden from the observer.

Hidden Markov Models for gene finding (caricature)

Now the observer wants to infer the actual sequence ofstates of the Markov model that caused the observed emissions. This sequence is called the path through the hidden Markov model. The (posterior) probability of anygiven path is easy to calculate, and it is computationallyinexpensive to infer the most likely path for a given sequence of emissions (using the so-called Viterbi algorithm). This path gives some hypothesis for the location of the genes. It is also easy to calculate probabilities that a predicted gene is actually a gene (under the assumptions of the model).

Hidden Markov Models for gene finding - the real picture

In reality, the situation is much more complicated. Coding

regions of genes are not characterized by frequencies of single nucleotides, but of triplets and hexamers of nucleotides. Additional information, such as signals that indicate the beginning or end of a gene or a splicing site are being used. Additional difficulties arise because of: existence of six possible reading frames existence of introns in eukaryotes variable codon usage frequencies in different species

A big mathematical challenge

The underlying assumption of Hidden Markov Models that

DNA sequences are emitted by a Markov Model is obviously far removed from biological reality. So the question is: How can we construct gene finding tools

thatare based on biologically more meaningful assumptions?This has practical consequences for evaluating the probability that a predicted gene is actually a gene, or estimating the fraction of actual genes that have been identified as such by a given gene-finding algorithm.

What did the Hidden Markov Models find?

Mycoplasma genitalis (bacterium) 500 Genes Escherichia coli (bacterium) 4,500 Genes Saccharomyces cerevisiae (yeast) 6,000 Genes Caenorhabditis elegans (worm) 19,000 Genes Drosophila Melanogaster (fruitfly) 13,500 Genes Arabidopsis thaliana (thale cress) 25,500 Genes Homo sapiens (Human) 24,000-40,000 Genes Oryza sativa japonica (rice) 32,000-50,000 Genes Oryza sativa indica (rice) 45,000-56,000 Genes

So we know the genes - do we know everything?

Far from it. The next two questions are:

Given a single gene, how does it function in the biology of an organism?

How do various genes interact?

From genes to proteins

From the chemical point of view, proteins are long chains

of chemicals called amino acids. There are 20 amino acids

used to make most proteins in most organisms. Amino

acids are coded by triplets of nucleotides, which are also

called codons.

Protein structure prediction

When a protein is manufactured in the cell, it assumes acharacteristic 3D structure or fold. It is very costly to determine the 3D structure of a protein experimentally

(by NMR or X-ray crystallography). It would be much cheaper if we could predict the 3D structure of a protein directly from its sequence of amino acids that is coded in the genome. This is known as the protein folding problem. Many approaches have been proposed to develop algorithms for solving this problem; so far results are mixed.

Protein structure prediction

In theory, it is possible to predict a protein fold ab initio, that is from first principles. However, the task is beyond the capabilities of current supercomputers. Recently IBM announced plans to develop a new pentaflopsupercomputer (1015 floating point operations per second)called “Blue Gene” that will be designed specifically with the task of ab initio protein fold prediction in mind. It should be just powerful enough for the task if no unexpected complications arise.

Prediction of gene function

Suppose you have identified a gene. What is its role in the

biochemistry of its organism? Sequence databases can help us in formulating reasonable hypotheses. Search the database for genes with similar nucleotide

sequences in other organisms. If the functions of the most similar genes are known

and if they tend to be the same function (e.g., “codes enzyme involved in glucose metabolism”), then it is reasonable to conjecture that your gene also codes an enzyme involved in glucose metabolism.

Prediction of gene function: homology searches

Given a nucleotide or DNA sequence, searching the data base(s) for similar sequences is known as “homologysearches”. The most popular software tool for performingthese searches is called BLAST; therefore biologists oftenspeak of “BLAST searches”. There are two interesting problems here: How to measure “similarity” of two sequences. How much similarity constitutes evidence of biologically

meaningful homology as opposed to random chance?

Prediction of important sites in proteins

Not all parts of a protein are equally important; the function of most of its amino acids is often just to

maintain an appropriate 3D structure, and mutations of those

less crucial amino acids often don't have much effect. However, most proteins have crucial parts such asbinding sites. Any mutations occurring at binding

sitestend to be lethal and will be weeded out by evolution.

How to predict binding sites from sequence data:

Get a collection of proteins of similar amino acid sequences and analogous biochemical function from your database.

Align these sequences amino acid by amino acid. Check which regions of the protein are highly

conserved in the course of evolution. The binding site should be in one of the highly

conserved regions.

Using genomic data for reconstruction of phylogenies

A phylogenetic tree depicts the branching pattern in theevolution of contemporary species from their common ancestor. Given the sequences of homologous genes (i.e., genes derived from a common ancestral gene), one can try to reconstruct the phylogenetic tree for these species by looking at the amount of evolutionary change that has occurred at the molecular level and estimating thetimes at which any two of these species diverged.

Methods for reconstruction of phylogenies

Distance methods Maximum parsimony Maximum likelihood Bayesian Analysis

A big mathematical challenge is to devise fast and reliablealgorithms for phylogenetic reconstruction for large sets ofspecies, especially using Maximum Likelihood or BayesianAnalysis.

Reconstruction of phylogenies: A success story

There are two basic kinds of free-living organisms: prokaryotes, that do not have a cell nucleus, and eukaryotes, which do. Prokaryotes fall into two major groups: Eubacteria and Archaea. Phenotypically, eubacteria and archaea are very similar to each other. However, it has been demonstrated by using moleculardata that archaea are more closely related to eukaryotesthan to eubacteria, and thus it appears that the evolutionary branching between archaea and eubacteria occurred before the branching of archaea and

eukaryotes.

Gene interactions: Collecting gene expression data

All cells of a multicellular organism have the same set ofgenes. What accounts for the differences in various celltypes and function is which of the genes are being expressed (switched on) at a given time in a given cell.A relatively new technology, called gene chips or microarrays makes it possible to monitor, for tens of thousands of genes simultaneously, the differences in

geneexpression levels between two different experimental conditions.

Gene interactions: Interpeting gene expression data

Once gene expression data have been collected, it is possible to identify clusters of genes that have similarexpression profiles, that is, are up- or downregulatedunder the same experimental conditions. One then conjectures that genes with similar expression profileshave similar functions, for example, are involved in thesame biochemical pathways. Such conjectures can

serveas powerful guides for setting up experiments to confirmthe biochemical role of groups of genes.

Interpeting gene expression data:A mathematical challenge

Gene expression data sets are peculiar in the sense that we typically have very few experiments (5-10 perhaps) and a large number (tens of thousands) monitored genes.It seems inevitable that some genes will show similar expression profiles just by random accident.

Question: How can we tell spurious clusters of genes from biologically meaningful ones?

Gene expression profiles:A success story

Cancer patients with the same clinical picture often respond to very different types of treatment. Gene expression profiles of groups of cancer patients have revealed that what looks to the clinician as the same disease can sometimes be one of several diseases at thebiochemical level. The latter can be distinguished bycharacteristic expression profiles of certain groups of genes. Once the biochemical nature of the disease has been established, treatment can be tailored to the type ofdisease a patient actually has.

The databases of the future

Genomic data bases like Genbank are just the beginning.In the near future we will see: Gene expression data banks SNP (single nucleotide polymorphism) data banks proteomics data banks data banks of biochemical pathways …Setting up these data banks and making intelligent use ofthem will require new mathematical tools, to be

developed by the next generation of mathematicians.


Recommended