Part I Iterative Clustering of Gene Expression Data for Analyzing Temporal Patterns

Part IIterative Clustering of Gene Expression

Data for Analyzing Temporal Patterns

By

Bharath Sankararaman

Tarun V Rajavelu

Under Dr.Aidong Zhang

Agenda

Microarray – what is it?

Gene Expression – what is it?

Iterative Clustering Algorithm

Motivation

Clustering Challenges

Clustering Approach

The algorithm

Measures

Comparison

Conclusion

Microarray

DNA microarray is a collection of microscopic

DNA spots attached to a solid surface, such as

glass, plastic or silicon chip forming an array for the

purpose of expression profiling, monitoring

expression levels for thousands of genes

simultaneously, or for comparative genomic

hybridization

Microarray

The affixed DNA segments are known as probes, thousands of which can be placed in known locations on a single DNA microarray

DNA microarrays can be used to detect RNAs that may or may not be translated into active proteins. This kind of analysis is referred to as "expression analysis“ or expression profiling

Microarray

Gene Expression

Gene expression, or simply expression, is the process by which a gene's DNA sequence is converted into the functional protein structures of the cell. Non-protein coding genes (e.g. rRNA genes, tRNA genes) are not translated into protein.

The expression of particular genes may be assessed with DNA microarray technology, which can provide a rough measure of the cellular concentration of different messenger RNAs

Recap

Iterative Clustering algorithm forAnalyzing Temporal Patterns of

Gene Expression ~

Seo Young Kim, Jae Won Lee, Jong Sung Bae~

INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE VOLUME 2 NUMBER 1 2005 ISSN 1304-4508

Motivation

Microarray experiments provide a wealth of information; however,

extensive data mining is required to identify the patterns that characterize the underlying mechanisms of action.

For biologists, a key aim when analyzing microarray data is to group genes based on the temporal patterns of their expression levels

Provides insights into genetic capacities and their interactions

Genes with similar functions often evince similar temporal patterns of co-regulation

Due to the large number of genes involved in these experiments and the complexity of biological processes in general, an effective clustering algorithm for grouping genes is crucial to such studies

Clustering challenges

How to determine the number of true clusters ?

How to evaluate samples assigned to those clusters ?

Clustering analysis results rely heavily on limited biological and medical information (i.e. tumor classification), they are not only sensitive to noise but they can also be prone to over-fitting

Clustering Approaches

Studies on the analysis of gene expression data have extensively explored the use of unsupervised clustering analysis to find temporal patternsResampling and cross-validation methods have been shown to be effective for evaluating the stability of clustersConsensus clustering method for class discovery - based on a resampling method. Kim et al. devised an extension of consensus clustering that exploits a mixed clustering algorithm based on a mixed distance measure

Here, we introduce a new clustering method based on an iterative algorithm that measures the relative stabilities of clusters from cross-validation criteria

Cross - validation

When a clustering program is created in a supervised situation, it is necessary to be sure that it can perform in an unsupervised situation

In cross-validation, a portion of the data is set aside as

training data leaving the remainder as testing data The quality of performance of the program on the testing data reflects how well it would perform in an unsupervised setting

Approach

One important property of temporal gene expression data is that the data for a given gene at different times may be correlated

Gene expression levels may vary markedly over timeTo reflect such time dependencies in observed data, compare the stability and consistency of the results produced by deleting one set of temporal observations at a timeIn addition, compare the average expression patterns in each group with the model profiles obtained using our iterative algorithm and existing clustering algorithms

Consensus clustering

The consensus clustering method is a type of resampling-based method

Original data set is perturbed by subsampling iteratively, and then existing clustering methods are iteratively applied to the perturbed data set to construct a distance measureData represented as a matrix X = [xgti

]

where xgti denotes expression of the gth gene at time Ti ,

1 ≤ i ≤ t

Consensus clustering algorithm1. a data resampling scheme and an initial clustering

algorithm must be chosen2. a similarity matrix is used to assign genes to the proper

clusters obtained by applying the algorithm to the various perturbed data sets

~ where N is the number of genes and Mh is the matrix corresponding to the results obtained by applying the initially selected clustering algorithm to the hth perturbed data set.

~ Mh (i, j) equals 1 if observations i and j belong to the same

cluster, and 0 otherwise ~ When all the entries of S are close to 1 or 0, we can

infer that the results have been well clustered. Here, the matrix I

− S represents a distance matrix

Consensus clustering (contd)

When the consensus clustering method is applied to gene expression data to identify temporal patterns, the results obtained depend on the choice of the initial clustering method

To improve the consensus clustering take a mixed similarity matrix to be the average of two similarity matrices, i.e., Sm = average(S1,S2)

Iterative clustering algorithm

Assessing clusters

Two measures were used to assess and compare the performance of various clustering methods

The average proportion of non-overlap measure computes the average proportion of genes that are not placed in the same cluster by the clustering method under consideration on the basis of the entire data set and the data sets obtained by deleting the expression levels at one time point at a timeThe average of the adjusted rand index computes the average degree of agreement between two partitions

Average proportion of non-overlap measure

where Cg,i denotes the cluster containing gene g in the clustering based on the data set from which the observations at time Ti have been deleted, and Cg,0 denotes the original cluster containing gene g in the clustering based on the entire data set.

A good algorithm is expected to yield a small value of VM1 (K)

ImplementationAgglomerative clustering method UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and the divisive clustering method Diana were used as initial clustering algorithms

Additionally, iterative clustering with UPGMA (ITU), iterative clustering with Diana (ITD), and iterative mixed clustering with UPGMA and Diana were appliedThe clustering performance was tested on a real data set and a simulated data setPublicly available gene expression data set on yeast sporulation was used. The data set consists of expression levels of 6118 genes in the yeast genome measured at seven time points during the sporulation process (i.e., 0, 0.5, 2, 5, 7, 9 and 11.5 hours)

Results of gene expression data

Results of gene expression data

For the five clustering algorithms under consideration, computed

the two cluster-assessment measures, VM1(K) and VM2(K), over a

range of cluster numbers around 7, specifically, K = 4–12.

The average proportion of non-overlap measure gave similar results

for the five algorithms, although UPGMA and ITU appeared to be

the best as judged by this measure.

The results for the average of the adjusted rand index on the other

hand, indicated that ITU and ITM gave the best performance

Model profile comparison

to classify yeast genes based on their expression levels, Chu et al. hand picked seven small subsets of representative genes using their knowledge of the yeast genome

the same subsets are used to construct the model temporal profiles by averaging the log-expression ratio of all genes in each subsetinspection of the plots for the various algorithms suggests that the ITM plots are closest to the model profiles

Conclusion

The iterative algorithms were found to be more accurate and consistent than existing methods. Furthermore, the mixed iterative algorithm gave superior results to the other iterative algorithms tested. The present findings suggest that the mixed iterative algorithm overcomes the demerits of the agglomerative and divisive hierarchical clustering algorithms

Part IIClustering Gene Expression Data Using Graph

Theoretic Approach- An Application of Minimum Spanning Trees

MST based Clustering Algorithms

Limitations of well known clustering algorithms-

• None of the algorithms (K means, SoM) guaranteed to produce globally optimal solutions for non trivial objective functions.

• K means and SoM depend on the ‘regularity’ of the cluster boundaries.

Intution

General 1D problem of grouping n data points to k clusters can be solved by finding k-1 connecting points and “cutting” them.

Here n = 9, k = 3

MST Represntation

• Entire gene expression dataset represnted as graph.

• MST deduced using Kruskal’s/Prim’s algorithm.• Clusters now become sub trees of this MST.• A multidimensional clustering problem reduced

to a tree partitioning problem.

MST Representation of set of data points:-

2D data points MST

Spanning Tree Representation of a Data Set

Let D = {di} be a set of expression data with each di = (e1i, ...., eti) representing the expression levels at time 1 through time t of gene i.

We define a weighted (undirected) graph G(D) = (V,E) as follows. The vertex set V = {di|di belongs to D} and the edge set E = {(di, dj)| for di, dj belongs to D and i = j}. Hence G(D) is a complete graph. Each edge (u, v) belongs to E has a weight that represents the distance (or dissimilarity), ρ(u, v), between u and v, which could be defined as the Euclidean distance, the correlational coefficient, or some other distance measures.

Properties of representation.

Let D be a data set and ρ represent the distance between two data points of D.

Algorithm 1 :Clustering through Removing Long MST-Edges

Find the k-1 long edges from the MST and cut them to get clusters. Here the objective function is to minimize the edge distance of all K subtrees.

Approach doesn’t work when intra cluster edges are larger than inter cluster edges.

To determine automatically how many clusters there should be, the algorithm examines the optimal K-clustering for all K = 1, 2, ..., up to some large number to see how much improvement we can get as K goes up. Typically after K reaches the “correct” number (of clusters), the quality improvement levels off,

Iterative Clustering Algorithm

Attempts to partition MST T to k subtrees to optimize a general objective function.

Start with an arbitrary K-partitioning of the tree (selecting K − 1 edges and removing them gives a K-partitioning). Then it repeatedly does the following operation until

the process converges: For each pair of adjacent clusters (connected by a tree edge), go through all tree edges within the merged cluster of the two to find the edge to cut, which globally optimizes the 2-partitioning of the merged cluster, measured by the objective function

Algorithm 3 : A Globally Optimal Clustering

Rather than finding the center point find “representatives” i.e. in addition to partitioning to K sub trees find k data points such that the following objective function is minimized.

Rationale: center may not belong to or even be close to, the data points of its cluster when the shape of the cluster boundary is not convex, which may result in biologically less meaningful clustering results. The representative-based scheme provides an alternative when center-based clustering does not generate desired results.

Algorthim (contd..)

• first convert the minimum spanning tree into a rooted tree by arbitrarily selecting a tree vertex as the root.

• Define parentchild relationship among all tree vertices. At each tree vertex v, :

S(v, k, d) is defined to be the minimum value of the objective function on the subtree rooted at vertex v, under the constraint that the subtree is partitioned into k subtrees and the representative ofthe subtree rooted at v is d. By definition, the following gives the global minimum of objective function

• Uses a dynamic programming (DP) approach to calculate the S() values at each tree vertex v, based on the S() values of v’s children in the rooted MST.

• For each v with children, S() of v is calculated as a DP recurrence as follows:

Running time is O(n(2K)s)

K = no of desired clusters

S= max no of children of any tree vertex

This global optimization algorithm runs efficiently for a typical clustering problem with a few hundred data points consisting of a dozen or so clusters.

It finds the optimal k-clustering for all k’s simultaneously, k ≤ K, for some pre-selected K. For a particular application, if we set K to, say, 30 or to certain percentage of the total number of vertices, we will get the optimal objective values for any k = 1, 2, ....,K.

.

Memetic Algorithms

The combination of an Evolutionary Algorithm with a local search heuristic is called Memetic Algorithm.

Inspired by the notion of a meme – an “adapting” gene.

Difference between memes and genes is that memes are processed and possibly improved by the people that hold them - something that cannot happen to genes. It is this advantage that the memetic algorithm has over simple genetic or evolutionary algorithms.

MAs are known to exploit the correlation structure of the fitness landscape of combinatorial optimization problems.

Pseudocode of a typical MA.

BeginInitialize populationLocal searchEvaluate fitnessWhile (stopping criteria not met) doSelect individuals for variationCrossoverMutationLocal searchEvaluate fitnessSelect new populationend

MST-MA Clustering Algorithm

Individual represntation and initialization

• Compute MST using Prim’s algorithm• Represent an individual with a bit vector of

length n-1, where n = no of genes. 0 - > edge deleted, 1 otherwise.

• Cluster memberships can be calculated from MST partition.

• To initialize population randomly choose k-1 edges and delete from MST.

Fitness function

Two criteria :

• Avoid basing functions on distance to a centroid.• Prefer compact clusters.

Two known objective functions used:-

Based on distance from cluster centre- minimizing intra cluster and maximizing intercluster distances.

• Authors came up with another function :

where in both equations d(·, ·) is the Euclidean distance, |Ci| the number of cluster members in cluster Ci, k the number of clusters and p a term to penalize results including clusters with less than a

defined number Ci of members.

Local search

• Compute MNVs (mutual neighborhood values).• Edges with higher MNVs might separate two clusters.• for each individual a list of deleted and non-deleted edges

is created. During each step, a pair of a deleted and a non-deleted edge is chosen randomly. For the non-deleted edges, edges are favored with a higher mnv and for the deleted edges those with a lower mnv are favored.

• Reverse edge states if the resulting clustering has a smaller objective value according to the objective function.

• This procedure is repeated until no enhancement could be made or the two lists are empty. Since for each flipped deleted edge a non-deleted edge is flipped as well, the number of clusters is preserved during local search.

Selection, Recombination and Mutation

Selection done for variation and for survival. For variation (recombination and mutation) individuals are randomly selected without favoring better individuals. To determine the parents of the next generation, selection for survival is performed on a pool consisting of all parents of the current generation and the offspring. The new population is derived from the best individuals of that pool. To guarantee that the population contains each solution only once duplicates are eliminated.

The recombination operator is a modified uniform crossover, similar to the uniform crossover for binary strings . To preserve the number of clusters, for both parents, lists of their deleted edges are created. Each bit of the child’s bit vector is set to 1. Then, a pair of deleted edges (one from each parent) is randomly chosen and deleted from the lists. With a probability of 0.5 either the deleted edge of parent a or the one of parent b is copied to the child. This is repeated until both lists are empty. Thus, it is guaranteed that the number of clusters is preserved.

As mutation operator a simple modified point mutation is applied. Since each individual contains much more nondeleted than deleted edges a normal point mutation (just flipping a randomly chosen bit) would lead to more and more clusters. To preserve the number of clusters, again the two lists with deleted and non-deleted edges are created. A pair of a deleted and a non-deleted edge is randomly chosen and both are flipped.

Comparison

•Comparisons done with Best2Partition and AvgLink algorithms. Population size = 40 and termination was done at 200th generation. Recombination and mutation rate was set to 40%

• Outperforms in terms of both best and average value of objective function and also running time.

Bibliography

• Minimum Spanning Trees for Gene Expression Data Clustering ∗ Ying Xu Victor Olman Dong Xu – 2002 Bioinformatics journal.

• Clustering Gene Expression Data with Memetic Algorithms based on Minimum Spanning Trees - Nora Speer, Peter Merz, Christian Spieth, Andreas Zell – 2003 IEEE

Date post:	15-Jan-2016
Category:	Documents
Upload:	mairi
View:	27 times
Download:	0 times

Part I Iterative Clustering of Gene Expression Data for Analyzing Temporal Patterns

Documents