Pharmacogenomics and Bioinformatics
M. Saleet Jafri
What is pharmacogenomics?
• Pharmacogenomics is the use genomic and sequence data of host and pathogens to identify potential drug targets
• Involves a variety of techniques/disciplines such as sequence analysis, protein structure, genomics, micorarray analysis and others
• These fields rely heavily on bioinformatics• Usually focuses on medical or agricultural applications
Human Genome ProjectProject goals are to • identify all the approximately 20,000-25,000 genes in
human DNA, • determine the sequences of the 3 billion chemical base
pairs that make up human DNA, • store this information in databases, • improve tools for data analysis, • transfer related technologies to the private sector, and • address the ethical, legal, and social issues (ELSI) that
may arise from the project.
From http://www.ornl.gov/hgmis/
Human Genome ProjectProgress
- Several types of genome maps have already been completed, and a working draft of the entire human genome sequence was announced in June 2000, with analyses published in February 2001.
- An important feature of this project is the federal government's long-standing dedication to the transfer of technology to the private sector. By licensing technologies to private companies and awarding grants for innovative research, the project is catalyzing the multibillion-dollar U.S. biotechnology industry and fostering the development of new medical applications.
From http://www.ornl.gov/hgmis/
Human Genome Project
• Seven organisms were originally chosen for sequencing.– E. coli– Yeast– Fly– Worm– Arabidopsis– Mouse– human
• Why were these chosen?
Genome ProjectsAs of January 2005 there were many more sequenced
– 25 non-plant eukaryotes– 5 plants– 213 microbes completed– 21 Archae– 274 microbes in progress– 1431 viruses in progress– 833 non-virus organisms with at least on nucleotide
sequence submitted
• Why were these chosen?
Genome Projects
• Chosen by funding agencies• Four main categories
– Medical applications– Evolutionary significance– Environmental impact– Food production
How are genomics used for drug target identification?
• The basic idea is to look for genes unique to the pathogen that are crucial for its survival. This would be the drug target.
• If this is a pathogen in the host, the gene would be in the pathogen and not in the host.
• If this was in the environment, the gene should be as specific as possible for the pathogen to avoid harming other organisms that might be beneficial.
How can this be done?
• To do this genomics, proteomics and bioinformatics are involved.
• In any of these cases bioinformatics tools are necessary.
Genome Sequencing and Comparison
• As mentioned earlier, many pathogen (virus, bacteria, and other microorganisms) have been sequenced.
• Once they are sequenced, they are annotated. Annotation is the process by which the functions of the different proteins (genes) are determined.
• In this way, an understanding of the organisms metabolism is gained.
Malaria
• Malaria is caused by the genus Plasmodium, with Plasmodium falciparum being the most lethal.
• Its genome has been sequenced• It is a pathogen that digests proteins for food. It does not
contain any amino acid producing genes in its genome, i.e. it does not make its own amino acids.
• Purines are recycled, but there are not genes for purine synthesis.
• Has many solute ATP dependent transporters and one novel multifunctional transporter.
How is annotation done?
• Annotation is the process of predicting the function of genes in a genome.
• First all the genes have to be found. This is done by finding the open reading frame (ORF).
• This is done by gene finding or gene prediction software.
Gene Prediction
• Analysis by sequence similarity can only reliably identify about 30% of the protein-coding genes in a genome
• 50-80% of new genes identified have a partial, marginal, or unidentified homolog
• Frequently expressed genes tend to be more easily identifiable by homology than rarely expressed genes
Gene Finding
• Process of identifying potential coding regions in an uncharacterized region of the genome
• Still a subject of active research
• There are many different gene finding software packages and no one program is capable of finding everything
Eukaryotes vs Prokaryotes
• Eukaryotic DNA wrapped around histones that might result in repeated patterns (periodicity of 10) for histone binding. The promotor regions might be near these sites so that they remain hidden.
• Prokaryotes have no introns.
• Promotor regions and start sites more highly conserved in Prokaryotes
• Different codon use frequencies
Gene finding is species-specific
• Codon usage patterns vary by species
• Functional regions (promoters, splice sites, translation initiation sites, termination signals) vary by species
• Common repeat sequences are species-specific
• Gene finding programs rely on this information to identify coding regions
The genetic code
Codon usage
Identifying ORFs
• Simple first step in gene finding
• Translate genomic sequence in six frames. Identify stop codons in each frame
• Regions without stop codons are called "open reading frames" or ORFs
• Locate and tag all of the likely ORFs in a sequence
• The longest ORF from a Met codon is a good prediction of a protein encoding sequence.
• SOFTWARE: NCBI ORF Finder
ORF Finder input
ORF finder results
Tests of the Predicted ORF
• Check if the third base in the codons tends to be the same one more often than by chance alone.
• Are the codons used in the ORF the same as those used in other genes (need codon usage frequency).
• Compare the amino acid sequence for similarity with other know amino acid sequences.
Problems with ORF finding
• A single-character sequencing error can hide a stop codon or insert a false stop codon, preventing accurate identification of ORFs
• Short exons can be overlooked
• Multiple transcripts or ORFs on complementary strand can confuse results
Pattern-based gene finding
• ORF finding based on start and stop codon frequency is a pattern-based procedure
• Other pattern-based procedures recognize characteristic sequences associated with known features and genes, such as ribosome binding sites, promoter sites, histone binding sites, etc.
• Statistically based.
Content-based gene finding
• Content-based gene finding methods rely on statistical information derived from known sequences to predict unknown genes
• Some evaluative measures include: "coding potential" (based on codon bias), periodicity in the sequence, sequence homogeneity, etc.
A standard content-based alignment procedure
• Select a window of DNA sequence from the unknown. The window is usually around 100 base pairs long
• Evaluate the window's potential as a gene, based on a variety of factors
• Move the window over by one base
• Repeat procedure until end of sequence is reached; report continuous high-scoring regions as putative genes
Combining measures
• Programs rarely use one measure to predict genes
• Different values are combined (using probabilistic methods, discriminant analysis, neural net methods, etc.)
to produce one "score" for the entire window
Drawbacks to window-based evaluation
• A sequence length of at least 100 b.p. is required before significant information can be gained from the analysis
• Results in a +/- 100 b.p. uncertainty in the start site of predicted coding regions, unless an unambiguous pattern can also be found to indicate the start.
Most are web-based, but...
• Submit sequence; input sequence length may be limited
• Select parameters, if any
• Interpret results
• Most software is first or second generation; results come in non-graphical formats.
• GeneMark, GenScan, Glimmer
How is annotation done?
• This is done by comparing the DNA sequences of the genes to known genes in a database. If they sequences are similar, the a similar function is assumed.
• The comparison is done using sequence comparison
tools such as BLAST
Database Searching for Similar Sequences
• Database searching for similar sequences is ubiquitous in bioinformatics.
• Databases are large and getting larger• Need fast methods
Types of Searches
• Sequence similarity search with query sequence• Alignment search with profile (scoring matrix with gap
penalties)• Serch with position-specific scoring matrix representing
ungapped sequence alignment• Iterative alignment search for similar sequences that
starts with a query sequence, builds a multiple alignmnet, and then uses the alignment to augment the search
• Search query sequence for patterns representative of protein families
From Bioinformatics by Mount
DNA vs Protein Searches
• DNA sequences consists of 4 characters (nucleotides)• Protein sequences consist of 20 characters (amino acids)• Hence, it is easier to detect patterns in protein sequences
than DNA sequences• Better to convert DNA sequences to protein sequences
for searches.
Database Searching Efficacy
• To evaluate searching methods, selectivity and sensitivity need to be considered.
• Selectivity is the ability of the method not to find members known to be of another group (i.e. false positives).
• Sensitivity is the ability of the method to find members of the same protein family as the query sequence.
Protein Searches
• Easier to identify protein families by sequence similarity rather than structural similarity. (same structure does not mean same sequence)
• Use the appropriate gap penalty scorings• Evaluate results for statistical significance.
History
• Historically dynamic programming was used for database sequence similarity searching.
• Computer memory, disk space, and CPU speed were limiting factors.
• Speed still a factor due to the larger databases and increase number of searches.
• FASTA and BLAST allow fast searching.
History
• The PAM250 matrix was used for a long time. It corresponds to a period of time where only 20% of the amino acids have remained unchanged.
• BLOSUM has replace PAM250 in most applications. BLAST use the BLOSUM62 matrix. FASTA uses the BLOSUM50 matrix.
Search Tools
• Similarity Search Tools– Smith-Waterman Searching
• Heuristic Search Tools– FASTA– BLAST
Malaria Vaccine
• A German and American Team used reverse genetics i.e. they used the sequenced genome, deduced the candidate genes, and then knocked out a particular gene (Uis3).
• This give 30 day immunity in mice which is better than vaccines made by traditional methods
Microarray Data Analysis
Gene chips allow the simultaneous monitoring of the expression level of thousands of genes. Many statistical and computational methods are used to analyze this data. These include: – statistical hypothesis tests for differential expression analysis– principal component analysis and other methods for
visualizing high-dimensional microarray data– cluster analysis for grouping together genes or samples with
similar expression patterns– hidden Markov models, neural networks and other classifiers
for predictively classifying sample expression patters as one of several types (diseased, ie. cancerous, vs. normal)
What is Microarray Data?
In spite of the ability to allow us to simultaneously monitor the expression of thousands of genes, there are some liabilities with micorarray data. Each micorarray is very expensive, the statistical reproducibility of the data is relatively poor, and there are a lot of genes and complex interactions in the genome.
Microarray data is often arranged in an n x m matrix M with rows for the n genes and columns for the m biological samples in which gene expression has been monitored. Hence, mij is
the expression level of gene i in sample j. A row ei is the gene
expression pattern of gene i over all the samples. A column sj
is the expression level of all genes in a sample j and is called the sample expression pattern.
Types of Microarrays
• cDNA microarray
• Nylon membrane and plastic arrays (by Clontech)
• Oligonucleotide silicon chips (by Affymetrix)
• Note: Each new version of a microarray chip is at least slightly different from the previous version. This means that the measures are likely to change. This has to be taken into account when analyzing data.
cDNA Microarray
• The expression level eij of a gene i in sample j is
expressed as a log ratio, log(rij/gi), of the log of its
actual expression level rij in this sample over its
expression level gi in a control.
• When this data is visualized eij is color coded to a
mixture of red (rij >> gi) and green (rij << gi) and a
mixture in between.
Nylon Membrane and Plastic Arrays (by Clontech)
• A raw intensity and a background value are measured for each gene.
• The analyst is free to choose the raw intensity or can adjust it by subtracting the background intensity.
Oligonucleotide Silicon Chips (by Affymetrix)
• These arrays produce a variety of numbers derived from 16-20 pairs of perfect match (PM) and mismatch (MM) probes.
• There are several statistics related to gene expression that can be derived from this data. The most commonly used one is the average difference (AVD), which is derived from the differences of PM-MM in the 16-20 probe pairs.
• The next most commonly used method is the log absolute value (LAV), which comes from the ratios PM/MM in the probe pairs.
• Note: The Affymetrix gene-chip software has a absent/present call for each gene on a chip. According to Jagota, the method is complex and arbitrary so they usually ignore it.
For What Do We Use Microarray Data?
• Genes with similar expression patterns over all samples – We can compare the expression patterns ei and ei’ of two genes i and i' over all samples.
• If we use cluster analysis, we can separate the genes into groups of genes with similar expression patterns (trees).
• This will allow us to find what unknown genes have altered expression in a particular disease by comparing the pattern to genes know to be affiliated with a disease.
• It can also find genes that fit a certain pattern such as a particular pattern of change with time.
• It can also characterize broad functional classes of new genes from the known classes of genes with similar expression.
For What Do We Use Microarray Data?
• Genes with unusual expression levels in a sample – In contrast to standard statistical methods where we ignore outliers, here outliers might have particular importance. Hence, we look for genes whose expression levels are very different from the others.
• Genes whose expression levels vary across samples – We can compare gene expression levels of a particular gene or set of genes in different samples. This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment.
For What Do We Use Microarray Data?
• Samples that have similar expression patterns – We might want to compare the expression patters of all genes between two samples. We might cluster the genes into gene with similar expression patterns to help with the comparison. This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment.
• Tissues that might be cancerous (diseased) – We can take the gene expression pattern of sample and compare it to library expression patterns that indicate diseased or not diseased tissue.
Statistical Methods Can Help
• Experimental Design – Since using microarrays is costly and time consuming, we want to design experiments to use the minimal number of micorarrays that will give a statistically significant result.
• Data Pre-processing – It is sometimes useful to preprocess the data prior to visualization. An example of this is the log ratio mentioned earlier. It is often necessary to rescale data from different microarrays so that they can be compared. This is due to variation in chip to chip intensity. Another type of preprocessing is subtracting the mean and dividing by the variance.
Statistical Methods Can Help
• Data Visualization – Principle component analysis and multidimensional scaling are two useful techniques for reducing multidimensional data to two and three dimensions. This allows us to visualize it.
• Cluster Analysis – By associating genes with similar expression patterns, we might be able to draw conclusions about their functional expression.
• Probability Theory – We can use statistical modeling and inference to analyze our data. Probability theory is the basis for these.
Statistical Methods Can Help
• Statistical Inference – This is the formulation and statistical testing of a hypothesis and alternative hypothesis.
• Classifiers for the Data – We can construct classes from data, such a diseased vs. non-diseased tissue. We can build a model (such as a hidden Markov model) that fits know data for the different classes. This can then be used to classify previously unclassified data.
Preprocessing Microarray Data
• Before microarray data can be analyzed or stored, a number of procedures or transformations must be applied to it.
• In order to analyze the data correctly, it is important to understand what the transformations might be doing to the data.
Preprocessing Microarray Data • Ratioing the data• Log-tranforming ratioed data• Alternative to ratioing the data• Differencing the data• Scaling data across chips to account for chip-to-chip
difference• Zero-centering a gene on a sample expression pattern• Weighting the components of a gene or sample
expression pattern differently• Handling missing data• Variation filtering expression patterns• Discretizing expression data
Cluster Analysis of Microarray Data
• Recall that microarray data can be thought of as gene expression patterns or sample expression patterns. These can be each considered to be vectors. The first thing we have to do before applying cluster analysis is to find a distance between the various expression pattern vectors. This is done using similarity/dissimilarity measures such as Euclidean distance, Mahalonobis distance, or linear correlation coefficients. Once a distance matrix is computed, the following clustering algorithms can be used. The clusters formed can differ significantly depending upon the distance measure used.
Cluster Analysis of Microarray Data
• Hierarchical Clustering – Assume each data point is in a singleton cluster. – Find the two clusters that are closest together.
Combine these to form a new cluster.– Compute the distance from all clusters to the new
cluster using some form of averaging. – Find the two closest clusters and repeat.
Cluster Analysis of Microarray Data
• k-Means Clustering – An alternate method of clustering called k-means clustering, partitions the data into k clusters and finds cluster means i for each cluster. In our case, the means will be vectors also. Usually, the number of clusters k is fixed in advance. To choose k something must be know about the data. There might be a range of possible k values. To decide which is best, optimization of a quantity that maximizes cluster tightness ie. minimizes distances between points in a cluster.
Cluster Analysis of Microarray Data
• Self-organizing Maps – This is basically an application of neural networks to microarray data. Assume that there is a 2-dimensional grid of cells and a map from a given set of expression data vectors in Rn, ie, there are n nodes in the input layer and a connection neuron from each of these to each cell. Each cell (i, j) gets it own weight from n input neurons. The weight vector mij is the mean of the cluster associated with cell (i, j). Each data vector d gets mapped to the cell (i, j) that is closest to d using Euclidean distance.In order to train the network, the mean vectors mij for the cells (i, j) must be learned.
Sample Microarray
Correlations
Clustering of Genes
Personalized Medicine
• There is a new buzz word called personalized medicine.• The idea is to develop medicine and treatment plan
based on an individuals genetic make-up.
Proteomics
• Understanding protein function • Functional genomics• Multiple approaches – structure, expression levels,
biochemistry, modeling etc.• Combining technologies is necessary to understand in
vivo protein functional
Approach
• Use data to determine pathway.• Use biochemistry to figure out kinetics and
concentrations.• Use new proteomic approaches to determine relative
concentrations.• Apply pathway model to determine functional
consequence.
Pathway Data
• Using molecular biological techniques we can determine what proteins make up a biochemical pathway.
A B C
D
Pathways
• Biochemical Pathways form complex biochemical reaction networks.
• There might be multiple ways to get from A to B.• The path chosen depends on biochemical kinetics.
Biochemistry
• Classical biochemistry isolates proteins from tissue or cells.
• Modern molecular biology allows the production of purified protein.
• The concentration of the protein is determined• The kinetic properties of the proteins is determined by
biochemical assay – rates of reactions, modulating factors, etc.
Pathway Modeling Methods
• Boolean Models • Metabolic Control Theory – Flux Balance Analysis• Biochemical Systems Analysis• Kinetic Modeling Approach
Disorders of Thrombophilia
• The functional consequences of nonsynonymous SNPS can be predicted by comparison of protein structures.
• There are various SNPs know– Activated protein C resistance by Arg 506 to Glu– Prothrombing polymorphism (G20210A) causing
elevated prothrombin levels– Protein C deficiency– Protein S deficiency– Antithormbin deficiency– Elevated factor VIII levels
Fibrinogen Abnormalities
• Various polymorphisms found in the long arm of chromosome 4
• Two dimorphisms of the -chain gene are of major importance and in linkage disequilibrium with each other.
• These affect plasma fibrogen levels
Prothrombin G20210 Polymorphism
• Replacement of a G by A at nucleotide 20210 in the untranslated section of the prothrombin gene increases translation without altering transcription of the gene.
• This results in elevated synthesis and secretion of prothrombin by the liver.
• This results in increased thrombin levels
Activated protein C deficiency
• Factor V Leiden R506Q mutation occurs in 8% of the population.
• It is a GA substitution at nucleotide 1691 in the gene for factor V.
• Factor V is cleaved less efficiently by activated protein C
• Results in deep vein thrombosis, early kidney transplant loss, recurrent miscarriages and other disorders