+ All Categories
Home > Documents > Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation...

Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation...

Date post: 07-Jun-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
31
Cluster Subspace Identification Via Conditional Entropy Calculations James Diggans George Mason University [email protected] Jeffrey L. Solka George Mason University [email protected]
Transcript
Page 1: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Cluster Subspace Identification Via

Conditional Entropy Calculations

James DiggansGeorge Mason University

[email protected]

Jeffrey L. SolkaGeorge Mason University

[email protected]

Page 2: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Outline

Subspace identification - why?Conditional entropy and clusters in R2.Ordering dimensions for easy subspace visualization and identification.Maximal cliques lead to automatic subspace identification.

Page 3: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Subspace identificationInitial, high-level exploration of complex data can inform downstream analyses.Explore samples (observations) or genes (dimensions) depending on intent.Cluster structure in patients may only be revealed on a subset of genes (and vice-versa) (Getz el at).Uninformed feature selection can discard informative features.

Page 4: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Conditional entropy and clusters in R2

Use of conditional entropy gives us:Distribution-freeRobust to outliers/extreme valuesMinimal nuisance parametersRobust to noise as long as the noise exists in all subspaces.

Adapted from a method proposed by Guo et al at the Geography department at Penn State.

Guo et al, Workshop on Clustering High-Dimensional Data and its Applications, 2003

Page 5: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Geography to … Microarrays?Guo et al have data with many (~10,000) observations in a few (~50) dimensions (measurements):

Dim.

Obs.

We have the opposite problem; we have many more ‘dimensions’ – genes – than we do observations –‘samples’ or ‘patients’ – on those dimensions. We flip Guo’s method on its ear – pretend that observations are dimensions and vice-versa.

Dim.

Obs.

“Obs”

“Dim”

Page 6: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

The methodns

ns

nr

Nested MeansMatrix

ng

ns

ns Minimal SpanningTree

MST Order

CE DistanceMatrix

Clique Discovery CliquesGene ExpressionData

Page 7: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

CE – what are we looking for?

Page 8: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Nested means discretizationResistant to extreme outliers not seen in an equal-interval approach.We calculate nested mean vectors by:

Calculate the mean value of a dimension.Divide the data into two halves on this mean.Recursively divide each half into half again, calculating a vector of ‘nested mean’ boundaries.Stop once we have the ‘required’ number of intervals (denoted r).

We want enough intervals so that, on average, each cell contains~35 points (Cheng et al, 1999). Guo uses (r is the number of intervals):

35/ 2 ≈rnkr 2=

and Example: For n = 10,000, r = 16 because 16*16is 256 and 256*35 = 8960 < 10,000.

Page 9: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

The methodns

ns

nr

Nested MeansMatrix

ng

ns

ns Minimal SpanningTree

MST Order

CE DistanceMatrix

Clique Discovery CliquesGene ExpressionData

Page 10: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Calculating CEFor every pair of dimensions (X and Y), discretizethe 2D sub-space (using the nested means intervals); each cell is then represented in a table by the number of observations that fall in that cell.Calculate entropy for every row and column; weight each by the row or column sum divided by the total number of observations.Add up weighted row and column entropy values to get CE(Y|X) and CE(X|Y). The maximum of these two values is the final cluster tendency measure.

Page 11: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Calculating CE∑ ∈

−=χ

χx

xdxdCH log)](log)([)(

X1 X2 X3 X4 X5 X6 Sum Wt CE

X1 0 1 3 0 0 0 4 .03 .314

X2 1 9 1 0 1 2 14 .09 .629X3 7 14 3 7 6 0 37 .25 .835X4 7 6 13 19 12 5 62 .41 .939X5 0 4 14 5 1 1 25 .17 .668X6 1 2 3 2 0 0 8 .05 .737

Sum 16 36 37 33 20 8Wt .11 .24 .25 .22 .13 .05CE .597 .847 .806 .615 .540 .502

CE(Y|X).700

CEmax.812

CE(X|Y).812

example taken from Guo et al150 total values, r = 6 intervals

Page 12: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

The methodns

ns

nr

Nested MeansMatrix

ng

ns

ns Minimal SpanningTree

MST Order

CE DistanceMatrix

Clique Discovery CliquesGene ExpressionData

Page 13: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Graph-theoretic analysis

CE calculation results in a distance matrix -visualizing the fully-connected graph is of little use.We can use graph theory to answer two questions:

Topologically, is there a linear order that, when sorted and imaged, can reveal cluster structure?What fully-connected sub-graphs (cliques) exist in my data?

Page 14: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Sample ordering – the MSTA minimum spanning tree (MST) is a spanning tree, but has weights or lengths associated with the edges, and the total weight of the tree (the sum of the weights of its edges) is at a minimum.We can use the topological ordering of the MST to create a relative ordering of our samples. Sorting the samples in this way in a data image can reveal structure.We used Kruskal’s algorithm in the RBGL R library (mstree.kruskal()) – a greedy approach to generate an MST.

Page 15: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Use of the MST to Induce Orderings on the Dimensions

• similar to UPGMA tree-building

• the linear ordering can be viewed as a 1D compression of the resulting hierarchical tree

Page 16: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

MST orderings on the image of the CE values

After ordering the samples according to their MST order, use of R’s image() method can generate the image at right.This ordering can show us formerly-hidden cluster structure without any presupposition.

Page 17: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph

If we can see cluster structure, can we retrieve it in an automatic fashion?On the fully-connected graph, break all edges longer than a threshold distance (somewhat subjective; varies between data sets).

Page 18: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph

On the resulting graph, find all cliques (fully-connected node sets).Dr. Marchette – graph library’s clique()Future work: a more efficient method is required.

Page 19: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Implementation details

Nested means discretization and calculation of conditional entropy written in RMST ordering and dot files (our graph format of choice) written in PerlGraphs visualized using AT&T’s GraphvizAll input and output files are tab-delimited ASCII text

Page 20: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Anecdotal Results

Page 21: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Artificial Data Set1000 observations in R100 distributed N(0,1) in each of the variatesObservations 1-250 translated by + 3 in dimensions {5,6,7,8}Observations 251-500 translated by –3 in dimensions {24,25,26,27,28,29,30} Observations 501-750 translated by +5 in dimensions {55,56,57,58,59,60,61,62,63,64,65,66,67} Observations 751-1000 translated by –5 in dimensions {10,11,12,13,14}

Page 22: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Artificial dataset results - MST

Page 23: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Image of Sorted CE Values for the Artificial Dataset

Page 24: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Golub datasetAn experiment to determine the ability of microarray data to separate acute myeloid leukemia (AML) from acute lymphoblasticleukemia (ALL).Custom microarray, 7,129 genes72 samples

47 ALL samples (both B- and T-cell)25 AML samples

T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, vol. 286, 531 (1999)

Page 25: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Golub Dataset - MST

• ALL samples

• AML samples

Page 26: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Image of Sorted CE Values for the Golub Dataset

Page 27: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

ALL data set

Acute lymphoblastic leukemia B and T-cell data set contributed to Bioconductor by the Dana Farber Cancer Institute.Affymetrix U95Av2 chip, 12,625 genes128 samples

95 B-cell samples33 T-cell samples

Page 28: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

ALL - MST

• B-cell samples

• T-cell samples

Page 29: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Image of Sorted CE Values for the ALL Dataset

Page 30: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

Summary/Conclusions

An informative technique for initial high-level data explorationFuture direction:

Concretely determine sensitivity to noiseDevelop a visualization tool for the MST orderingA more efficient clique-discovery method

Page 31: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.

ReferencesCheng, C., A. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA. (1999)Getz, G., Levine, E., Domany E. Coupled two-way clustering analysis of gene microarray data. PNAS. 97:22, 12079. (2000).Guo, D. et al. Breaking Down Dimensionality: Effective and Efficient Feature Selection for High-Dimensional Clustering. [Name of Conference]. [date]Guo, D., D. Peuquet and M. Gahegan (2002). Opening the Black Box: Interactive Hierarchical Clustering for Multivariate Spatial Patterns. The 10th ACM International Symposium on Advances in Geographic Information Systems, McLean, VA, USA.


Recommended