Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation...

Post on 07-Jun-2020

4 views 0 download

transcript

Cluster Subspace Identification Via

Conditional Entropy Calculations

James DiggansGeorge Mason University

jdiggans@gmu.edu

Jeffrey L. SolkaGeorge Mason University

jsolka@gmu.edu

Outline

Subspace identification - why?Conditional entropy and clusters in R2.Ordering dimensions for easy subspace visualization and identification.Maximal cliques lead to automatic subspace identification.

Subspace identificationInitial, high-level exploration of complex data can inform downstream analyses.Explore samples (observations) or genes (dimensions) depending on intent.Cluster structure in patients may only be revealed on a subset of genes (and vice-versa) (Getz el at).Uninformed feature selection can discard informative features.

Conditional entropy and clusters in R2

Use of conditional entropy gives us:Distribution-freeRobust to outliers/extreme valuesMinimal nuisance parametersRobust to noise as long as the noise exists in all subspaces.

Adapted from a method proposed by Guo et al at the Geography department at Penn State.

Guo et al, Workshop on Clustering High-Dimensional Data and its Applications, 2003

Geography to … Microarrays?Guo et al have data with many (~10,000) observations in a few (~50) dimensions (measurements):

Dim.

Obs.

We have the opposite problem; we have many more ‘dimensions’ – genes – than we do observations –‘samples’ or ‘patients’ – on those dimensions. We flip Guo’s method on its ear – pretend that observations are dimensions and vice-versa.

Dim.

Obs.

“Obs”

“Dim”

The methodns

ns

nr

Nested MeansMatrix

ng

ns

ns Minimal SpanningTree

MST Order

CE DistanceMatrix

Clique Discovery CliquesGene ExpressionData

CE – what are we looking for?

Nested means discretizationResistant to extreme outliers not seen in an equal-interval approach.We calculate nested mean vectors by:

Calculate the mean value of a dimension.Divide the data into two halves on this mean.Recursively divide each half into half again, calculating a vector of ‘nested mean’ boundaries.Stop once we have the ‘required’ number of intervals (denoted r).

We want enough intervals so that, on average, each cell contains~35 points (Cheng et al, 1999). Guo uses (r is the number of intervals):

35/ 2 ≈rnkr 2=

and Example: For n = 10,000, r = 16 because 16*16is 256 and 256*35 = 8960 < 10,000.

The methodns

ns

nr

Nested MeansMatrix

ng

ns

ns Minimal SpanningTree

MST Order

CE DistanceMatrix

Clique Discovery CliquesGene ExpressionData

Calculating CEFor every pair of dimensions (X and Y), discretizethe 2D sub-space (using the nested means intervals); each cell is then represented in a table by the number of observations that fall in that cell.Calculate entropy for every row and column; weight each by the row or column sum divided by the total number of observations.Add up weighted row and column entropy values to get CE(Y|X) and CE(X|Y). The maximum of these two values is the final cluster tendency measure.

Calculating CE∑ ∈

−=χ

χx

xdxdCH log)](log)([)(

X1 X2 X3 X4 X5 X6 Sum Wt CE

X1 0 1 3 0 0 0 4 .03 .314

X2 1 9 1 0 1 2 14 .09 .629X3 7 14 3 7 6 0 37 .25 .835X4 7 6 13 19 12 5 62 .41 .939X5 0 4 14 5 1 1 25 .17 .668X6 1 2 3 2 0 0 8 .05 .737

Sum 16 36 37 33 20 8Wt .11 .24 .25 .22 .13 .05CE .597 .847 .806 .615 .540 .502

CE(Y|X).700

CEmax.812

CE(X|Y).812

example taken from Guo et al150 total values, r = 6 intervals

The methodns

ns

nr

Nested MeansMatrix

ng

ns

ns Minimal SpanningTree

MST Order

CE DistanceMatrix

Clique Discovery CliquesGene ExpressionData

Graph-theoretic analysis

CE calculation results in a distance matrix -visualizing the fully-connected graph is of little use.We can use graph theory to answer two questions:

Topologically, is there a linear order that, when sorted and imaged, can reveal cluster structure?What fully-connected sub-graphs (cliques) exist in my data?

Sample ordering – the MSTA minimum spanning tree (MST) is a spanning tree, but has weights or lengths associated with the edges, and the total weight of the tree (the sum of the weights of its edges) is at a minimum.We can use the topological ordering of the MST to create a relative ordering of our samples. Sorting the samples in this way in a data image can reveal structure.We used Kruskal’s algorithm in the RBGL R library (mstree.kruskal()) – a greedy approach to generate an MST.

Use of the MST to Induce Orderings on the Dimensions

• similar to UPGMA tree-building

• the linear ordering can be viewed as a 1D compression of the resulting hierarchical tree

MST orderings on the image of the CE values

After ordering the samples according to their MST order, use of R’s image() method can generate the image at right.This ordering can show us formerly-hidden cluster structure without any presupposition.

Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph

If we can see cluster structure, can we retrieve it in an automatic fashion?On the fully-connected graph, break all edges longer than a threshold distance (somewhat subjective; varies between data sets).

Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph

On the resulting graph, find all cliques (fully-connected node sets).Dr. Marchette – graph library’s clique()Future work: a more efficient method is required.

Implementation details

Nested means discretization and calculation of conditional entropy written in RMST ordering and dot files (our graph format of choice) written in PerlGraphs visualized using AT&T’s GraphvizAll input and output files are tab-delimited ASCII text

Anecdotal Results

Artificial Data Set1000 observations in R100 distributed N(0,1) in each of the variatesObservations 1-250 translated by + 3 in dimensions {5,6,7,8}Observations 251-500 translated by –3 in dimensions {24,25,26,27,28,29,30} Observations 501-750 translated by +5 in dimensions {55,56,57,58,59,60,61,62,63,64,65,66,67} Observations 751-1000 translated by –5 in dimensions {10,11,12,13,14}

Artificial dataset results - MST

Image of Sorted CE Values for the Artificial Dataset

Golub datasetAn experiment to determine the ability of microarray data to separate acute myeloid leukemia (AML) from acute lymphoblasticleukemia (ALL).Custom microarray, 7,129 genes72 samples

47 ALL samples (both B- and T-cell)25 AML samples

T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, vol. 286, 531 (1999)

Golub Dataset - MST

• ALL samples

• AML samples

Image of Sorted CE Values for the Golub Dataset

ALL data set

Acute lymphoblastic leukemia B and T-cell data set contributed to Bioconductor by the Dana Farber Cancer Institute.Affymetrix U95Av2 chip, 12,625 genes128 samples

95 B-cell samples33 T-cell samples

ALL - MST

• B-cell samples

• T-cell samples

Image of Sorted CE Values for the ALL Dataset

Summary/Conclusions

An informative technique for initial high-level data explorationFuture direction:

Concretely determine sensitivity to noiseDevelop a visualization tool for the MST orderingA more efficient clique-discovery method

ReferencesCheng, C., A. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA. (1999)Getz, G., Levine, E., Domany E. Coupled two-way clustering analysis of gene microarray data. PNAS. 97:22, 12079. (2000).Guo, D. et al. Breaking Down Dimensionality: Effective and Efficient Feature Selection for High-Dimensional Clustering. [Name of Conference]. [date]Guo, D., D. Peuquet and M. Gahegan (2002). Opening the Black Box: Interactive Hierarchical Clustering for Multivariate Spatial Patterns. The 10th ACM International Symposium on Advances in Geographic Information Systems, McLean, VA, USA.