+ All Categories
Home > Documents > Biological information analysis by heterogeneous data comparison

Biological information analysis by heterogeneous data comparison

Date post: 09-Jan-2016
Category:
Upload: peyton
View: 36 times
Download: 2 times
Share this document with a friend
Description:
Biological information analysis by heterogeneous data comparison. Advisor: Pro. Chuan Yi Tang Speaker: Jeh Ting Hsu. Outline. Extraction of correlated gene clusters by multiple graph comparison. Bayesian Network. Generalized kernel canonical correlation analysis. - PowerPoint PPT Presentation
45
Biological information analysis by heterogeneous data comparison Advisor: Pro. Chuan Yi Tang Speaker: Jeh Ting Hsu
Transcript
Page 1: Biological information analysis by heterogeneous data comparison

Biological information analysis by heterogeneous data

comparison

Advisor: Pro. Chuan Yi Tang

Speaker: Jeh Ting Hsu

Page 2: Biological information analysis by heterogeneous data comparison

Outline

Extraction of correlated gene clusters by multiple graph comparison.

Bayesian Network. Generalized kernel canonical correlation

analysis. Filling gaps in a metabolic network using

expression information.

Page 3: Biological information analysis by heterogeneous data comparison

Extraction of correlated gene clusters by multiple graph

comparison

Page 4: Biological information analysis by heterogeneous data comparison

Motivation The complete genome sequence contains the

information about ordering of genes along the chromosome. Besides such geometrical relationships, other features characterize relationships among genes, including similarity relationships based on sequences or 3D structures of gene products, and functional relationships in metabolic/regulatory pathways.

When multiple gene gene relationships can be found on different attributes as above, it would be interesting to see whether or not a set of genes share their mutual relationships in relation to each attribute.

Page 5: Biological information analysis by heterogeneous data comparison

Example the enzymes in the glycolytic pathway commonl

y displayα/β folds, which is obtained by examining the relationships of enzymes with respect to their structural similarities and neighboring relationships in the pathway. This type of observation has been made for a specific set of genes. Here we examine all the sets of genes in a given organism that are correlated with respect to more than one attribute.

A series of enzymes in the glycolytic pathway display fold.

Page 6: Biological information analysis by heterogeneous data comparison

Gene-gene relationships on a specific attribute can be denoted by using a set of binary relationships in a general manner. For example, let a binary operator ’∼’ denote a binary relationship between two genes, and let g1, g2, g3, and g4 be a series of genes arranged in this order in a genome sequence, their geometrical relationships are broken down into a set of binary relationships {g1 ∼ g2, g2 ∼ g3, g3 ∼ g4}. A set of binary relationships among genes forms a graph structure as a whole.

Each graph node corresponds to a gene or a gene product. In a graph, two nodes are connected by an edge (expressed by a solid line) when they are related by a binary relationship.

Page 7: Biological information analysis by heterogeneous data comparison
Page 8: Biological information analysis by heterogeneous data comparison

Protein interaction networks are comprised of groups of interaction proteins.

Page 9: Biological information analysis by heterogeneous data comparison

Heuristic algorithm for network comparison Let consider 2 graphs: G1=(V1,E1) and G2=(V2,E2)

d1(i,j) : the shortest path between nodes v1i and v1j in graph G1

d2(i,j) : the shortest path between nodes v2i and v2j in graph G2

Page 10: Biological information analysis by heterogeneous data comparison
Page 11: Biological information analysis by heterogeneous data comparison

An example of E.coli FREC

Page 12: Biological information analysis by heterogeneous data comparison
Page 13: Biological information analysis by heterogeneous data comparison

Graph comparison or local pathway alignment

Page 14: Biological information analysis by heterogeneous data comparison

The structure similarity dataset defines 3D strucral similarities among proteins. Two proteins in the same category are connected by an edge.

E. Coli correlated gene clusters

Page 15: Biological information analysis by heterogeneous data comparison

ns

ss CCdisCCD1

2121 ),(),(

),(),(),(),( 31

31

22

21

12

1121 CCdisCCdisCCdisCCD

=10+8+8=26

Page 16: Biological information analysis by heterogeneous data comparison

Using the distance D, we cluster the hyperedges. Let C be the initial set of clusters, each of which consists of a single hyperedge, i.e.,

C ={{h1},…, {hm} . Starting with C, we iterate the procedure to pick two clusters between which the distance is the smallest and to merge them into a new cluster (i.e., hierarchical clustering using the distance D).

Page 17: Biological information analysis by heterogeneous data comparison

References

Tohsato Y., Matsuda H., Hashimoto, A. A Multiple Alignment Algorithm for Metabolic Pathway Analysis using Enzyme Hierarchy. in International Conference on Intelligent Systems for Molecular Biology (ISMB). AAAI Press, 2000.

Hiroyuki Ogata, Wataru Fujibuchi, Susumu Goto and Minoru Kanehisa. A heuristic graph comparison algorithm and its application to detect functionally related enzyme cluster. Nucleic Acids Research,2000,vol 28, 4021-4028.

Page 18: Biological information analysis by heterogeneous data comparison

Bayesian Network

Page 19: Biological information analysis by heterogeneous data comparison

Motivation

Whereas gene coexpression data are an excellent tool for hypothesis generation, microarray data alone often lack the degree of specificity needed for accurate gene function prediction.

This improvement in specificity can be achieved through incorporation of heterogeneous functional data in an integrated analysis.

The system is based on a Bayesian network that combines evidence from diverse data sources to predict whether two proteins are functionally related.

The network essentially performs a probabilistic ‘‘weighting’’ of data sources, thus avoiding double counting evidence and allowing for formal representation of expert knowledge about the methods.

Page 20: Biological information analysis by heterogeneous data comparison
Page 21: Biological information analysis by heterogeneous data comparison

General architecture of the MAGIC Bayesian Network

(expression data)

Functional Relationship

Expression

Data TypeCoexpression

Data Noise

Level

K-means

Clustering

Self Organizing Maps

Hierarchical Clustering

Page 22: Biological information analysis by heterogeneous data comparison

General architecture of the MAGIC Bayesian Network.

(nonexpression-based data)

Functional Relationship

Colocalization

Physical Association

Transcription Factor Binding

Affinity precipitation

Genetic Association

Two HybridDirect Binding

Purified Complex

Reconstructed Complex

Biochemical Assay

Unlinked Noncomple mentation

Synthetic Rescue Dosage

Lethality

Synthetic Lethality

Page 23: Biological information analysis by heterogeneous data comparison

General architecture of the MAGIC Bayesian Network.

Page 24: Biological information analysis by heterogeneous data comparison

Physical Association

Affinity precipitation Two HybridDirect Binding

PhysAss PhysAssYes PhysAssNo

AffPrecipYe 0.75 0.05

unknown 0.25 0.95

Conditional Probability Table

Page 25: Biological information analysis by heterogeneous data comparison

Tradeoff between the number of TP and FP for each method.

Page 26: Biological information analysis by heterogeneous data comparison

References

Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. and Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11, 4241-57.

Olga G. Troyanskaya, Dolinski K, Owen AB, Altman RB, and Botstein D. A Bayesian framwork for combining heterogeneous data sources for gene function prediction. Proc Natl Acad Sci USA 100(14): 8348-53, 2003.

http://biodata.mshri.on.ca/grid/servlet/Index http://cgsigma.cshl.org/jian/

Page 27: Biological information analysis by heterogeneous data comparison

Generalized kernel canonical correlation analysis

Page 28: Biological information analysis by heterogeneous data comparison

Motivation It is crucial to investigate the correlation which

exists between multiple biological attributes, and eventually to use this correlation in order to extract biologically meaningful features from heterogeneous genomic data.

Indeed, a correlation detected between multiple datasets is likely to be due to some hidden biological phenomenon. Moreover, by selecting the genes responsible for the correlation, one can expect to select groups of genes which play a special role in or are affected by the underlying biological phenomenon.

Page 29: Biological information analysis by heterogeneous data comparison

Example As an example, the existence of operons in prokary

otes is responsible for a form of correlation between several datasets, because genes which form operons are close to each other along chromosomes, have similar expression profiles and can catalyze successive reactions in a pathway. Conversely, one can start from three datasets containing the localization of the genes on the genome, their expression profiles, and the chemical reactions they catalyze in known pathways, and look for correlations between these datasets, in order to finally recover groups of genes, which may form operons.

Page 30: Biological information analysis by heterogeneous data comparison
Page 31: Biological information analysis by heterogeneous data comparison

Methods

Canonical correlation analysis (CCA). However, ordinary CCA cannot be applied to non-vect

orial genomic data, such as pathways, protein-protein interactions or gene positions in a chromosome.

Kernel CCA (KCCA). Its goal is to detect correlations between two datasets.

Multiple kernel CCA (MKCCA). Integrated kernel CCA (IKCCA).

One can prefer to maximize the correlation between one type of attribute and a combination of other types.

Page 32: Biological information analysis by heterogeneous data comparison

Kernel function

)exp()(),(),(),( hd

xxxxkjiK ijjgg

igf

jg

iggg

dij is the number of nucleotides between the end of the ith gene and the start of the jth gene along the chromosomes.

Page 33: Biological information analysis by heterogeneous data comparison
Page 34: Biological information analysis by heterogeneous data comparison
Page 35: Biological information analysis by heterogeneous data comparison
Page 36: Biological information analysis by heterogeneous data comparison

References

Yamanishi Y., Vert, J.-P., and Kanehisa, M., Heterogeneous data comparison and gene selection with kernel canonical correlation analysis, In Schoelkopf, B., Tsuda, K., and Vert, J.-P., editors, Kernel Methods in Computational Biology, pp.209-230, MIT Press, 2004.

Yamanishi, Y., Vert, J.-P., Nakaya, A. and Kanehisa, M., Extraction of Correlated Gene Clusters from Multiple Genomic Data by Generalized Kernel Canonical Correlation Analysis. Bioinformatics (in ISMB2003), 19, i323-i330.

Page 37: Biological information analysis by heterogeneous data comparison

Filling gaps in a metabolic network using expression

information

Page 38: Biological information analysis by heterogeneous data comparison

Motivation With a growing number of completely sequenced genomes, increasing

attention has been devoted to understanding the functional coordination of individual genes in complex biological processes.

Computational reconstruction of metabolic networks typically uses genomic information to associate genes with enzymatic functions, thereby identifying the metabolic pathways encoded by the organism.

In many cases, while there exists sufficient biological evidence to believe that a given pathway is present in an organism, one or more enzymes responsible for the critical reaction steps cannot be identified via sequence homology methods alone. Similarity of gene expression profiles has been used extensively to assign genes to general functional categories.

However, prediction of specific gene function from expression information alone has, so far, not been possible.

Page 39: Biological information analysis by heterogeneous data comparison

Specific functional predictions can be made by considering expression similarity together with the structural information of the metabolic network.

Page 40: Biological information analysis by heterogeneous data comparison

Hughes, T.R., Marton, M.J.,Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D. et al. Functional discovery via a compendium of expression profiles. Cell, 2000, 102, 109-126.

Page 41: Biological information analysis by heterogeneous data comparison

The metabolic gene dependency graph is then used to calculate network distance between the genes. We define a pair of directly dependent metabolic genes X and Y to be separated by a distance 1; similarly, the distance between X and Z, given dependencies between X->Y and Y->Z, is 2 and so on.

The expression distance measure between ORFs X and Y is calculated as 1- |corr(px, py)|, where corr(px,py) is the Spearman’s rank correlation between expression profile vectors of X and Y.

Page 42: Biological information analysis by heterogeneous data comparison

Local coexpression in the metabolic network. Mean expression distance is shorwn as a function of the network distance between metabolic genes.

Page 43: Biological information analysis by heterogeneous data comparison

Given a node L in the metaqbolic dependency graph, and a set of candidate ORF’s, we can evaluate the similarity of expression profile of each candidate ORF with each member of the metabolic neighborhood of location L.

R

i iNgp

i

gxd

w

NxF

1 ),(||

1)(

cost function (Type 1):

X is the candidate gene. R is the network neighborhood radius.

N is a neighborhood of radius R around the metabolic location L.

|N| is the total number of genes in the neighborhood.

Ni is the set of genes in the i-th layer of the network neighborhood, d(x,y) is the expression distance between genex and gene g.

w is a vector of the neighobrhood layer weights and p is the positive power factor.

Page 44: Biological information analysis by heterogeneous data comparison

Results

Page 45: Biological information analysis by heterogeneous data comparison

References

Forster, J., Famili,I., Fu, P., Palsson, B.O. and Nielsen, J. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res., 2003, 13, 244-253.

Hughes, T.R., Marton, M.J.,Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D. et al. Functional discovery via a compendium of expression profiles. Cell, 2000, 102, 109-126.

Kharchenko, P., Vitkup, D., Church, G.M. Filling gaps in a metabolic network using expression information. Bioinformatics, 2004, 20 Suppl. 1,i178-i185.


Recommended