[IEEE 2010 IEEE International Conference on Data Mining Workshops (ICDMW) - Sydney, TBD, Australia...

Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomy Datasets

Abstract— Terascale astronomical datasets have the potential to provide unprecedented insights into the origins of our universe. However, automated techniques for determining regions of interest are a must if domain experts are to cope with the intractable amounts of simulation data. This paper addresses the important problem of locating and tracking high density regions in space that generally correspond to halos and sub-halos and host galaxies. A density based, mode following clustering method called Automated Hierarchical Density Shaving (Auto-HDS) is adapted for this application. Auto-HDS can detect clusters of different densities while discarding the vast majority of background data.

Two alternative parallel implementations of the algorithm, based respectively on the dataflow computational model and on Hadoop/ MapReduce functional programming constructs, are realized and compared. Based on runtime performance, scalability across compute cores and across increasing data volumes, we demonstrate the benefits of fine grain parallelism. The proposed distributed and multithreaded AutoHDS clustering algorithm is shown to produce high quality clusters, be computationally efficient, and scalable from 1 through 1024 compute-cores.

Keywordds: Distributed Clustering; Scalable; Terascale; Astronomy

I. INTRODUCTION Simulation of the dynamical evolution of the entire

observable universe via N-Body interactions begins with the presumed first principles of the universe: cosmic background radiation; an expanding volume of cooling helium and hydrogen; dark matter separating from gas and coalescing into massive stars [1][2]. The classical N-body problem simulates the evolution of a system of N bodies, where the force exerted on each body arises due to its interaction with all the other bodies in the system. N-body algorithms have numerous applications in areas such as astrophysics, molecular dynamics and plasma physics. The Cube3PM method for carrying out large N-Body simulations to study formation and evolution of the large scale structure in the universe combines direct particle-particle forces at small scales with particle-mesh ones at larger scales (Particle-Particle-Particle-Mesh Method) [3]. Such an approach at the Texas Advanced Computing Center (TACC) Ranger and

Lonestar supercomputing cluster with 864 compute-cores (total) has provided a 123Gb (binary) simulated dataset with 17283 or 5.2 billion particles [4] , which we use for illustrative application in this work. Our ultimate goal is to apply this method to much larger datasets, with 40003-54883 (64-165 billion) particles. Several such simulations were completed on Ranger on 4,000-22,976 cores.

A. Problem Setting From the astrophysicist’s perspective, the problem of

identifying regions of interest in this large scale data and being able to visualize these regions is made intractable by the overwhelming volumes of data. Therefore, automated methods for visualization are critical to advancing the physical understanding of what is happening through better analysis. One important area of interest where data mining methods would be quite useful is in locating collapsed dark matter halos. These dark matter halos host galaxies. There are specific criteria that define a collapsed halo and its sub-structure. These dense clumps or sub-halos within large halos are a result of smaller previously-formed halos accreting into larger ones. When halos or sub-halos are identified, the region of interest can be followed over time in order to gain understanding of star formation. From the data mining perspective, a highly scalable and automated technique is needed to filter out the vast majority of the simulation data while identifying the large number of high density particle clusters of arbitrary shape. Given the problem requirements, density based clustering methods, which segment the feature space into dense regions of particles separated by regions of relatively low density, seem a more natural fit, than say centroid based (e.g. k-means) or hierarchical methods. However, such algorithms are not designed for scaling to large scale data across distributed systems; neither are they focused on identifying a few dense regions while ignoring the rest [5]. Moreover, standard quality metrics for clustering aim to show high similarity within a group and low similarity between patterns belonging to two different groups [6]. However, here we are not interested in much of the data, but only in the match of the identified regions of interest with the actual halos; how these regions can be quickly detected, and how the software can guide the user through these regions isolated from a far larger “universe”.

Srivatsava Daruru Dept. of Computer Sciences

The University of Texas at Austin Austin, USA

[email protected]

Gunjan Gupta Dept. of Electrical and Comp. Eng

The University of Texas Austin, USA

[email protected]

Ilian Iliev Dept. of Physics and Astronomy

The University of Sussex Brighton, UK

[email protected]

Weijia Xu, Paul Navratil Texas Advanced Computing Center The University of Texas at Austin

Austin, USA [email protected] [email protected]

Joydeep Ghosh Department of Electrical and Computer Engineering

The University of Texas at Austin Austin, USA

[email protected]

Nena Marín Innovations Laboratory

Pervasive Software Austin, USA

[email protected]

Sankari Dhandapani Dept. of Electrical and Comp. Eng. The University of Texas at Austin

Austin, USA [email protected]

2010 IEEE International Conference on Data Mining Workshops

978-0-7695-4257-7/10 $26.00 © 2010 IEEE

DOI 10.1109/ICDMW.2010.26

138

2010 IEEE International Conference on Data Mining Workshops

978-0-7695-4257-7/10 $26.00 © 2010 IEEE

DOI 10.1109/ICDMW.2010.26

138

B. Key Contributions In this paper, we present a density-based clustering algorithm based on the recently proposed Automated Hierarchical Density Shaving (Auto-HDS) method [7][8]. This algorithm is tailored to low dimensional spatial large scale astronomical datasets on the Teragrid, and implemented as a dataflow system as well as on Hadoop. It also addresses a key problem in astronomy on gigabytes of data representing over 35 million particles. The Key contributions include:

1. Introducing a distributed and multithreaded implementation of Auto-HDS. In addition to algorithmic modifications, the proposed solution uses dataflow; data-parallel pipelines, to “map” data and clustering application across multiple nodes and then “reduce” the clustering results. Distributed Dataflow Auto-HDS leverages coarse grain parallelism via data parallelism across up-to 128 nodes, and fine grain parallelism via partitioning node-level tasks into “p” threads of execution where “p” corresponds to the number of cores in a single node.

2. Comparing the dataflow implementation versus a Hadoop based implementation of AutoHDS [9] for scalable distributed clustering of large scale datasets.

3. Providing a novel and effective parallel data mining visualization tool for large scale astronomical datasets, and insights into the interactions between clustering algorithm development and parallel implementation

4. Presenting new lift-based evaluation methods that are more appropriate for this large-scale data mining application.

A Java-based dataflow implementation is available for free download that includes support for interactive clustering and exploration. Dataflow AutoHDS is currently under active testing on Terabytes of data by two astronomy centres.

II. BACKGROUND AND RELATED WORK The literature on clustering algorithms is encyclopedic,

but the vast majority of techniques are not relevant to the problem setting being addressed [6],[7],[5]. Therefore we only summarize density based approaches, that are the most relevant, as the ability to focus on only part of the data is innate to this class of approaches. The history of density based clustering algorithms date back to at least 1968, when a broad framework termed Hierarchical Mode Analysis (HMA) was introduced by Wishart [11]. HMA uses kernel density estimation to produce a hierarchy of dense clusters. Its weakness was its cubic computational cost, which was not practical for contemporary computers. Later in 1996, DBSCAN, which can be seen as a special case of HMA, was introduced by Ester et al. [12] DBSCAN relies on density-based notion of clusters where a cluster is defined as a maximal set of densely connected points. DBSCAN, and its subsequent enhancements, is particularly good at discovering

clusters of arbitrary shape in low-dimensional data (e.g. spatial 2D and 3D databases). DBSCAN is resistant to noise and works well with clusters of different shapes and sizes. DBSCAN is weak on clusters with varying densities, and, like many localized approaches is susceptible to the curse of dimensionality encountered in high dimensional data.

Traditional density based clustering algorithms use two parameters: the maximum radius of the neighborhood and the minimum number of points within that radius. OPTICS [13], introduced in 1999 is an extension of DBSCAN that produces a special order of the database with respect to its density based clustering structure. The cluster ordering is equivalent to the density-based clusterings from a broad range of parameter settings. Finally Hinneburg’s DENCLUE [14], employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Data points are assigned to clusters by hill climbing.

DENCLUE allows a compact description of arbitrarily shaped clusters in high-dimensional data sets. DENCLUE is faster than DBSCAN but it requires a large number of parameters. While all these clustering approaches represent considerable progress in scalable clustering methods, these clustering techniques do not address all requirements for scalability or manage both strict versus overlapping clusters. Subsequently in 2006, Gupta et al. introduced automatic hierarchical density shaving (Auto-HDS) and Gene diver to automatically select clusters of different densities, present them in a compact hierarchy and rank individual clusters based on their stability [7][8]. Most notably, Auto-HDS was successfully applied to biological datasets to identify functionally related genes in datasets where many irrelevant genes or experiments are present. In subsequent sections we summarize the key aspects of Auto-HDS and describe how this algorithm can be adapted for large-scale parallel systems. Below, we provide a quick overview of the two computational frameworks: MapReduce/Hadoop and Dataflow.

Map-Reduce [15] is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers. Map-Reduce abstracts difficult low-level tasks, such as distributing and coordinating parallel work across many machines. Hadoop [16] is an open source Java implementation of Map-Reduce software framework.

The Map-Reduce programming model forces users to map their applications to the simplistic map-reduce model in order to achieve parallelism. For some applications this is not a simple task. The user ends up juggling multiple Map-Reduce tasks with the added complexity of manually orchestrating validation and execution orders. We address these issues using a naturally parallel programming model called dataflow [17][18], the essence of which is computation organized by the flow of data through a graph of operators. We compose these graphs using standard operators designed to tune their parallelism to available resources, like the number of cores and size of the heap on a single node. Dataflow application exhibits pipeline,

139139

horizontal (data parallelism), and vertical parallelism. The dataflow library [19] we use facilitates the development of sophisticated parallel programs designed to fully utilize commodity multi-core hardware, while hiding traditional difficulties such as queuing, threading, memory management, and deadlocks.

Current methods to detect individual halos fall into two basic categories, namely friends-of-friends (FOF; e.g. Jenkins et al.,[20]) and spherical over density (SO; Cole, S. and Lacey, S [21]) methods. The FOF method is particle-based. Dense regions are identified by locating particles that are closer to each other than a pre-defined distance, which is a parameter of the model and is usually referred to as 'linking length' . Particles that are within that distance from each other are called 'friends', and the halos produced consist of all particles which are connected by a chain of friends. The SO class of methods, on the other hand, start by identifying the local density peaks (or gravitational potential minima) as the halo centers and then expand spherical shells around those centers until a pre-defined density threshold (a free parameter of the model picked based on dynamical considerations) is crossed. Within these types of methods there are multiple variations, regarding e.g. how the halo centers are located, how the gravitationally-unbound particles are treated, etc. Each of the two basic approaches, FOF and SO, has its advantages and drawbacks and can fail in certain situations (see e.g. Tinker et al. [21]). Automated methods for halo identification and visualization are critical to advancing the physical understanding of what is happening through better analysis. The methods proposed here supply an alternative to the current approaches which on one hand is density-based like the SO, but does not make assumptions about the halo shapes as the SO does.

III. PROBLEM DEFINITION Auto-HDS [8] is a more scalable embodiment of the

Hierarchical Mode Analysis philosophy [9], and represents a non-parametric density-based clustering framework for robust automated clustering and visualization. However, Auto-HDS in its present form scales well only on medium-sized clustering problems involving up to 105 points, and is not parallelized. For the large scale HALO dataset (17283 points), we propose combining coarse grain and fine grain parallelism. Distributed Clustering is considered coarse grain data parallelism. Subsets of points in the HALO dataset are partitioned based on the number of nodes in the cluster. The data partitions and the autoHDS application are distributed and mapped to each node for execution. Additionally at each single node we propose leveraging every core available. The dataflow [23] computational model facilitates construction and execution of efficient data-parallel pipelines of computation as threads. Each thread of computation is assigned to an individual core. Here we propose a new multithreaded and parallelized implementation of AutoHDS implemented as a dataflow application graph.

IV. ALGORITHM IMPLEMENTATION

A. Density Shaving The Density Shaving (DS) [8] algorithm takes two

parameters : (1) fshave, the fraction of least dense points to shave or exclude from consideration, and (2) neps, the number of points that must be within a distance reps of a given point xi in order for xi to be considered dense. Let dneps(x) denote the distance each point x to its nth nearest point, then reps can be calculated from the parameters using the equation reps = dneps(i), where i = [n(1−fshave)]. In other words, given neps, reps is the corresponding radius such that there are exactly n(1−fshave) points which are dense w.r.t. these parameters. The DS algorithm finds all such points and then applies a graph traversal process to discover the clusters composed of the dense points, where, two dense points are in the same cluster if the distance between them is less than reps. The output of the algorithm is the set of labels for all dense points and a “don’t care” set containing the remaining points. The pseudo-code of the algorithm is available in [8] along with inline comments presented as (Algorithm 1).

B. Hierarchical Density Shaving The DS algorithm discovers dense clusters given fshave

and neps. In this section, we describe Hierarchical Density Shaving (HDS) which clusters the points at varying density scales and outputs a compact cluster hierarchy. In other words, at each level we shave off rshave fraction of the dense points as noise. The pseudo-code of the algorithm is available in [8] along with inline comments presented as (Algorithm 2).

C. Auto-HDS [8] also proposed a method to rank the clusters. The

stability of a cluster is given by the number of shaving levels it survived without splitting or by becoming a part of the “don’t-care set“. To remove very small clusters from consideration, the authors also define n_part which is the minimum number of data points a cluster should have in order to be considered.

D. Auto-HDS for halo-detection Discovering a small set of clusters with varying density

has many challenges and generally requires a lot of computation. The main features of Auto-HDS which augurs well for the HALO dataset application are as follows: 1. It has the ability to cluster only a small subset of the data while pruning a large number of irrelevant points. This is important because the halos in the astronomy dataset make up only a small percentage of data space while the remaining portion has to be discarded as noise.

140140

Figure 1 - 2D Representation of the Volume Space

Partitioning across Four Nodes.[2] 2. Being a density based clustering algorithm, it can detect clusters of varying shapes and sizes which is necessary to detect the arbitrary shaped halos. 3. It detects clusters of varying densities. The “dense” halos within the dataset often have variable densities and hence this property of the algorithm helps in detecting them. 4. There is a need for an unsupervised setting since there is little or no labeled data available to select the model parameters. 5. The ability to find compact hierarchy of clusters is very useful for exploring the structures of halos at different densities. These unique set of abilities gives Auto-HDS a clear advantage to discover dense halos.

E. Astronomical Data Partitioning The simulation volume is evenly partitioned into cubic sub-sections. The subsections are segments of equal length along each dimension, resulting in each partition occupying a contiguous region. Adjacent cubic subsections are made to overlap by exactly 3 * reps. As demonstrated in [9], an overlap of at least reps between adjacent partitions along each feature dimension is required to guarantee that each point in the dataset is correctly clustered in at least one of the partitions. Figure 1 depicts in two dimensions the volume space partitioning across nodes. In this figure, nf_physical_dim corresponds to the physical dimensions on the entire volume included in the HALO dataset. This figure shows partitioning across 4 nodes. The entire volume would therefore be divided into 4 equal cubic sub-sections. nf_physical_node_dim corresponds to the size of each sub-cube. A data partition though is comprised of nf_buf + nf_physical_node_dim + nf_buf. The additional bands of nf_buf correspond to the overlapping regions between contiguous sub-cubes and its size is equal to nf_buf=1.5*reps. This overlapping section of 3 * reps provides the region where clusters that span adjacent cubes can be identified and easily merged [9]. The same particle

may appear on the overlapping regions of two adjacent cubes. The cluster assignment for these particles is renamed to the same label on both cubes. In order to obtain equally sized segments in all three dimensions, the number of nodes must be an integer-cubed number. For example based on the maximum number of nodes on TACC Longhorn cluster (256), the maximum number of nodes for running distributed AutoHDS is 63 = 216 nodes. Each node in the Longhorn cluster receives an equal volume in space but not an equal number of particles. This is due to the fact that certain regions in space are denser than others. This will result in unbalanced computational loads across nodes.

F. DataFlow Distributed Multithreaded Auto-HDS In the same style as Hadoop MapReduce functional

programming constructs, we propose a dataflow Mapper and Reducer to perform distributed and multithreaded Auto-HDS on large scale astronomical datasets and on the Teragrid. Mapper: the mapper evenly partitions the large scale HALO dataset into the number of nodes available on Longhorn Cluster and applies Auto-HDS to each partition.

a. Data: the volume of the universe captured in the HALO dataset is assumed to be a three dimensional cube. This volume is equally divided into contiguous and slightly overlapping sub-cubes ordered in increasing x, y and z. The overlapping regions are there to facilitate the Reducer step described below.

b. App: the mapped application is the dataflow implementation of AutoHDS. A large portion of AutoHDS is spent computing an all-to-all point’s distance matrix. We present a parallelized K-nearest technique to concurrently compute all distances in multiple threads of computation.

Reducer: the reducer stitches the cubes back together in order to obtain a single clustering solution. The overlapping regions in contiguous cubes provide a reduced search space for stitching. The objective of the stitcher is that a cluster should have unique labels across all partitions. The final solution contains cluster assignments and cluster stability.

1) Mapper::DataPartitioner: Let NoN be the number of

compute nodes to use in Longhorn Cluster. We need to partition the data into NoN cuboids. All cuboids are equally sized such that in each dimension X, Y, Z, we require a number of CuboidSegments = Cubic Root(NoN). The size of a cuboid in X, Y, Z dimensions is now defined. Starting at 0, 0, 0 and increasing first x, then y and finally z, we can determine X, Y, Z bounds for each cuboid. We then build a dataflow application graph that reads, partitions into cuboids and writes each partition into NoN binary staged datasets. On a single node and using all cores available in such node, the dataflow application allocates a thread per cuboid and data tokens are pushed into corresponding cuboid flow of data

141141

points. Some particles may be pushed into more than one cuboid data pipeline. This is because of the overlapping regions in contiguous cubes.

2) Mapper:: Dataflow Auto-HDS: In the dataflow computational model [23], the large-scale dataset is partitioned into independently computational dataflows. As data tokens flow down an application graph the worker nodes receive the data tokens and operate on them. The clustering and shaving iterations operate concurrently on the partitioned data flows. In the dataflow Auto-HDS, there are three passes through the data. In the first pass, the distance matrix is computed from every particle to every particle in the dataset. In the second pass, given the distance matrix and a number of neighbors, the densest neighbors to each particle are found. In the last step, HDS is applied to establish a baseline of clusters. AutoHDS follows with multiple shaving iterations until all data is shaved off. The parameter inputs to the algorithm are as follows:

• partitionCount = number of desired partitions or data flows. If set to zero, p partitions are created where p is the number of available processors.

• Neps = number of neighbors • Fshave = initial shaving parameter. e.g. 0.1 or 10%

initial shave in HDS step. • Rshave = shaving parameter per iteration. e.g. 0.1

or 10% per iteration of auto-HDS. • estMaxEps = estimated number of distances to

stage that would include closest Neps neighbors and radius of densest points. E.g. 800

• bytesPerChunk = 10,000,000 • runt = minimum number of particles in a cluster

before is discarded as an outlier. e.g. 2. For Pass1 on a single node, distance from all-to-all particles is needed. The dataflow implementation reads all particles in parallel while partitioning into subsets of particles across p partitions. P corresponds to the number of cores available at runtime. Additionally, all particles are read into subsets of m particles where m is defined by the user at runtime based on “bytesPerChunk”. For each partition p distances to the m subsets are calculated concurrently. When the last particle arrives at the distMatrix calculation node, the flows are merged into a single flow. In the single flow, a data token corresponds to the distances from single particle to all others. For each point and its subsequent iterations, we only need the distances to the points in its own partition. The parameter estMaxEps is set to the maximum number of points in any partition which thus determines the maximum number of closest neighbors for each point to stage to the disk. Pass2 is relatively inconsequential to the runtime. Pass3 was the most computationally expensive. Special partitioning had to be orchestrated in order to achieve the multiple shaving iterations. A sensitivity study on the HALO dataset revealed that a shaving parameter of 10% (0.10) would produce stable clusters. Once the initial HDS run is performed, the multiple shaving iterations are independent of each other. Shaving

iterations were therefore equally divided into batches across all cores on the single node and concurrently run in separate threads.

3) Reducer. The job of the reducer/stitcher is to re-label clusters found across multiple nodes thus giving us the final clustering that is identical to the non-partitioned Auto-HDS. Stitching is the second main component, next to partitioner, that is responsible for the correctness of the algorithm. Since the dataset can have multiple partitions and multiple dimensions, stitching can be thought of as having three stages. A typical stitching component may or may not include all the three stages depending on the number of partitions and dimensionality. The three stages of stitching are:

• Stitching between two partitions, • Stitching along one dimension • Stitching along multiple dimensions

V. EXPERIMENTAL EVALUATION In this section we present our experimental results in support of the quality of predictions and runtime performance of the proposed dataflow AutoHDS solution.

A. Software and Hardware Specifications The dataflow Auto-HDS algorithm was implemented using Pervasive DataRushTM version 4.0.1.21 [2] and version 1.6.0 07 of the HotSpot JVM. The Hadoop Auto-HDS algorithm was implemented using Apache Hadoop Project release 0.20.1. Development Hardware specifications: Processor: Intel Xeon. CPU L5310 2.50 GHz (2 processors, each quad core). Memory: 48 GB.System type: 64-bit Windows. Distributed Environment: Longhorn, provided by Texas Advanced Computing Center (TACC). Longhorn system resources include 2048 compute cores (Nehalem quadcore), 512 GPUs (128 NVIDIA Quadro Plex S4s, each containing 4 NVIDIA FX 5800s), 13.5 TB of distributed memory and a 210 TB global file system.

TABLE I - HALO TEST DATASETS LABELS AND SIZES�

label size (bytes) Particles

data30 192,457 6,583

10k 301,200 10,196

20k 602,402 20,583

50k 1,469,922 50,854

100k 3,069,891 100,528

200k 5,979,773 200,776

2M 53,031,370 2,003,188

35M 1,026,936,722 35,318,948

B. Tera-Scale Datasets TABLE I shows the testing datasets sizes used. We are

concerned with scalability with increasing problem size and scalability across cores. This large scale HALO dataset was

142142

generated on Longhorn cluster using Cube3PM N-Body simulation software. Particle decomposition across nodes for the HALO dataset is shown in Figure 1. There are overlapping sections of space across the nodes in order to facilitate stitching of results when reducing the results to a single answer. Since the output of the algorithm are the cluster assignments per particles and the cluster stability, particles in the overlapping regions from one node to its neighboring node can be used to identify and rename clusters that span adjacent sub-cubes. For the purpose of experiments, we used a medium sized dataset containing the configuration of the points in an arbitrary time-step. The schema of the dataset is as follows: position (x, y ,z) - 3 floats, x mean - 3 floats, v mean - 3 floats, v disp – float, radius calc – float, halo mass – float, imass p – float, halo mass1 – float. For the purposes of clustering, we used only their positions in the coordinate space. This testing dataset contained 35,318,948 data points of x, y, z coordinates (1Gb). For the purpose of experiments, smaller datasets were created containing all the data points that have their (x,y,z) coordinates lying between (0,K) where the value of K determines the size of the dataset.

Table II – Data Subsets and Statistics.

Label Number of Points nz Value of K

10K 10,026 35

20K 20,783 47

50K 50,854 75

100K 100,065 102 The value of K for the entire dataset is 511 and the larger the value of K, the greater the size of the dataset. Table II shows the breakdown of each subset. In addition to the data points, we also have the ground truth in the form of halos. Each halo is represented by its center and radius. This information is used to evaluate the clusters generated by Auto-HDS.

C. Algorithm Predictive Performance In this section, we evaluate the performance of AutoHDS on the Astronomy dataset. Specifically we measure the ability to discover the subspaces where the probability of finding a halo is high. This helps the astronomers to quickly focus on these areas for analysis. For evaluating the clusters, we used the 10K dataset described in the previous section. We also verified the result by comparing the output with a couple of other random sub-cubes of the data and the results were very similar. In the interest of space, we present only the output of the 10K dataset. Figure 2 plots the points in 3D space. This gives us an idea of how the data is distributed. We can see that there are some dense spaces having a lot of empty or sparse spaces in between. Figure 3 plots the halos represented by their halo centers and its radius around it in 3D space. This is the ground truth given by the astronomers. These halos correspond to the dense regions in the dataset which we hope

to discover automatically. To evaluate the output of AutoHDS given the position and radius of halos, we define a metric called Lift Factor. Lift Factor measures the likelihood of finding original halos inside the discovered clusters as compared with random space.

1) Lift Factor It is defined as follows: Lift = (vhInside/vHDSclusters)/(vhalo/totalCubeVolume); Where vhInside is the total volume of the actual halos inside the discovered halos, vHDSclusters is the total volume of the discovered clusters, vhalo corresponds to the total volume of the actual halos and totalCubeVolume represents the volume of the entire data space.

Figure 2 – Dataset.

In other words, it is the fraction of the volume of halos inside the discovered clusters divided by their fraction in the entire space. When the discovered clusters are exactly equal to the halos, the lift is totalCubeVolume/vHDSclusters. On the other hand, when none of the halos intersect with the discovered halos, the lift factor becomes zero. To calculate vhInside, we need to find the volume of intersection between arbitrary shaped discovered cluster and spherical shaped halo which is a difficult problem. We instead approximate it by finding the volume of intersection between the bounding box of the discovered cluster and the bounding box of the halo. Since the bounding boxes are reasonably tight and the lift factor is still the same when they exactly coincide, we feel that it is a reasonable assumption.

2) Discovering Dense Halos In this section, we evaluate the ability of AutoHDS to discover the dense halos. We have seen in the previous section that, the greater the intersection of the halos, the greater the lift factor. Figures [4-7] show different clusters obtained using different parameters and plot the discovered clusters in 3D space. The legend on the figures shows the

143143

stability of the clusters. A label of i implies that the corresponding cluster is the ith most stable cluster.

Figure 3 – Ground Truth: Halos defined by center and

radius.

Figure 4 - Clustering with parameters: Neps: 5, fShave:

0.7, rShave: 0.1


0.9, rShave: 0.1


0.7, rShave: 0.1 From these figures, we can observe that lift-factor alone does not determine the cluster quality. For example, Figure 5 has a high lift factor; however it has missed many halos per the ground truth. On the other hand, Figure 4 has a smaller lift factor but was able to cover all the halos. We can also see that Figure 6 is clearly worse than as it has lesser lift factor

while�covering the same number of halos and has more volume covered. This leads us to the next metric: the volume covered. As we cover an increasing volume, we increase the chance of getting a higher lift factor. This trade-off is thus important to analyze in evaluating the clusters. Figure 8 shows the trade-off between Lift and volume covered by the halos for different clustering results which is similar to the precision-like and recall-like metrics. The x-axis corresponds to the volume covered and the y-axis corresponds to the lift factor. Each line represents a different clustering result obtained by using different parameters. The ith point on each line corresponds to the volume covered and the corresponding lift factor of the first i most stable clusters. As we include more and more clusters, the volume covered by them increases and the lift factor is expected to decrease.


0.9, rShave: 0.1 This is clearly shown by all clusters with i > 2. For i ≤ 2, the top stable clusters may be too tiny to miss much of the halo portion and thus may have a low lift factor. This plot helps in comparing two different clustering results. Figure 9 shows the tradeoff between Lift factor and the number of halos covered as we include more and more clusters similar to the precision-recall framework. Given a minimum number of halos to be discovered, this plot can easily pick the clustering result with the highest lift factor. Then this can be used to compare different clustering results.

D. Algorithmic Scalability Performance 1) Dataflow AutoHDS

In this section, we present the scalability results for our dataflow implementation of Auto-HDS. These experiments were performed on the TACC Longhorn cluster unless otherwise specified. The parameter NoN corresponds to the “number of nodes” to allocate at Longhorn cluster for execution. The first step in the Mapper is to partition the data into sub-cubes based on equally dividing the entire volume in the dataset NoN-ways. The data partitioner runs on a single node and exploits parallelism by spawning multiple threads on a single node. The number of threads is equal to the desired number of sub-cubes and equal to the fore mentioned parameter NoN. The Stitcher (Reducer) also runs on a single node and stitches pair wise two sub-cubes at a time. The merging

144144

network of pair wise sub-cubes (see Figure 10), will recursively reduce the number of sub-cubes by ½ until only a single and final cube solution is left. The dataflow computational model exploits enough parallelism via multi-threading on a single node to efficiently data map and reduce large scale datasets. Application profiling on 80K particles shows Mapper corresponds to 5.6% of total runtime. Stitcher corresponds to 8.35% of the total runtime. At 800K particles, Mapper is 0.45% and Stitcher is 0.39% of the total runtime. Mapper and Stitcher operations are negligible comparable to the total runtime for dataflow AutoHDS.

Figure 8 - Lift versus Volume.

Figure 9 - Lift versus Halo Coverage.

Figure 10 - Dataflow AutoHDS Execution Plan for

Reducer, NoN = 8. 2) Core Scalability on Single Node

Figure 11 shows the scalability of the dataflow AutoHDS application as we increase the number of cores on 100k particles. The x-axis represents the number of cores on a single node and the y-axis represents the corresponding runtime in seconds. The dashed is the ideal curve obtained by halving the runtime as we double the number of cores.

1 2 4 810

2

103

104

Number of Cores

Ru

ntim

e (s

ecs)

Scaling with Cores

idealdataflow AutoHDS

Figure 11 – Dataflow AutoHDS on single node Scaling

with Cores (100K particles). Figure 11 demonstrates nearly linear scalability of the end-to-end dataflow AutoHDS application across 1 through 8 cores clustering 100K particles on single node. In the initial stage of AutoHDS, one must find the K-Nearest neighbors within a certain radius “reps”. This distance matrix calculation is very easily parallelizable. Auto-HDS also consists of levels or iterations, given by [ - log(n) / log(1 – rshave)] with each iteration finding dense clusters of specific density determined by reps corresponding to each level[8]. Notice that the iterations are independent of each other and therefore it is easily parallelizable by computing shaving iterations concurrently.

145145

3) Data Scalability Across Nodes Next we look at scalability with dataset size. For this experiment, we varied the number of particles from 25K through 110K, 340K, 750K, 1.5M and 35M. In Figure 12, the solid black line corresponds to our dataflow AutoHDS implementation. Dhandapani et al. [9], has reported preliminary results on 25K and up to 1.5M particles from Hadoop AutoHDS on the same astronomical HALO dataset and Longhorn Hadoop cluster. In Figure 12, the dashed line corresponding to the Hadoop AutoHDS runtime has three segments labeled (a), (b), (c). The segment (a) corresponds to 27 Hadoop partitions running on 4 nodes. The segment (b) corresponds to 125 Hadoop partitions on 16 nodes. Finally, segment (c) corresponds to 1000 Hadoop partitions on 125 nodes. Dataflow AutoHDS maintains linear scalability with increasing dataset set size and using only 8 nodes. At segment (a), dataflow AutoHDS is 10x times faster than Hadoop. At segment (b), dataflow AutoHDS is 5.5x times faster than Hadoop on half the number of nodes. At segment (c), dataflow AutoHDS is 1.5x times faster than Hadoop using 8 nodes versus 125 Hadoop nodes. To the right of the vertical line marked as (d), the runtimes for Hadoop are linearly extrapolated based on 1000 hadoop partitions on 125 nodes. The results of actual runs for dataflow AutoHDS using 125 nodes versus the extrapolated Hadoop runtimes on 35M particles show that dataflow is 1.72 times faster. Dataflow AutoHDS is consistently faster on fewer resources because of the performance increase from using all cores on each node. The dataflow computational model is designed to produce increased performance gains with increasing data volumes. There is a certain amount of overhead in setting the dataflow application graph which becomes less significant compared to the total runtime as the data volumes increase.

Figure 12 - AutoHDS Scaling with Problem Size[9].

4) Scalability Across Nodes Figure 13 compares runtime in seconds for 80k and 800k particles with compute-cores starting at 216 and through

1000. At 80k particles, the dataset size is so small that the runtime is simple reflecting the dataflow framework overhead. At 800k particles, the dataset size is large enough that the runtime linearly decreases with the number of nodes and compute-cores. Dataflow is capable of consuming and processing large volumes of data across all nodes while leveraging every core within each node.

VI. CONCLUSION AND FUTURE WORK We have presented a novel dataflow implementation for discovering dense structures in massive astronomy dataset. The parallelized dataflow AutoHDS facilitates the use of large number of cores on a single industry standard machine instead of expensive super computers. Experiments revealed that when data points were uniformly distributed across partitions, dataflow AutoHDS achieved linear speed up with the increase in the number of machines used. Dataflow AutoHDS also yields better performance with increasing data volumes. In comparisons against Hadoop AutoHDS, dataflow was consistently faster on fewer resources. In both implementations, only the particles in the overlapping regions need to be searched, compared and operated on. Therefore the stitching runtime is negligible compared to the HDS clustering step. Hadoop facilitates mapping the large scale data and the in-memory HDS clustering algorithm. As the volumes of data increase and the data partitions become larger than can be processed by the in-memory HDS clustering algorithm, the hadoop implementation can no longer scale. The dataflow computational model does not suffer from this limitation. Instead on each node, dataflow partitions the particles into streaming data tokens. Queues throttle the number of data tokens in memory at any one time.

200 300 400 500 600 700 800 900 10000

50

100

150

200

250

300

350

Compute-Cores

Ru

ntim

e (s

ecs

)

Scaling with Cores Across Multiple Nodes

80k Particles800k Particles

Figure 13 – Scaling with Cores across Multiple Nodes.

We are currently extending the application to larger timesteps into tera and peta-scale astronomical datasets, and integrating the implementation with the interactive visualization mechanisms developed for the human experts

146146

in order to create a powerful, data-driven knowledge discovery and guidance system.

ACKNOWLEDGMENT This research was funded by Pervasive Software of Austin, Texas and by NSF grant IIS 0713142. We would like to thank Jonathan Heffner of Pervasive Software for his invaluable contributions to the dataflow engineering process of k-nearest neighbor algorithm.

REFERENCES [1] R. Barnes, T.R. Quinn, J.J. Lissauer, D. Richardson, “N-Body

simulations of growth from 1 km planetesimals at 0.4 AU”, Icarus, 203 (2), pp. 626-643, 2009

[2] T. Quinn. (2008) Cosmological Simulations [Online] Available: http://www.tacc.utexas.edu/research/users/features/quinn.php

[3] Particle Mesh code. [Online] Available: http://www.cita.utoronto.ca/mediawiki /index.php/CubePM, 2010

[4] I.T. Iliev, K. Ahn, J. Koda, P.R. Shapiro and U.L. Pen “Cosmic Structure Formation at High Redshift” proc. of 45th Rencon. de Moriond, La Thuile, Italy, March 2010 (eprint arXiv:1005.2502)

[5] J. Ghosh, Scalable Clustering Methods for Data Mining The Handbook of Data Mining Vol. No. pp. 247-277 Nong Ye(Eds) Lawrence Erlbaum Assoc. 2003

[6] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. New Jersey: Prentice Hall, 1988

[7] G. Gupta, A. Liu, and J. Ghosh, "Hierarchical Density Shaving: A Clustering and Visualization Framework for Large Biological Datasets," Proc. IEEE ICDM Workshop Bioinformatics (DMB '06), pp. 89-93, Dec. 2006

[8] G. Gupta, A. Liu, J. Ghosh, “Automated Hierarchical Density Shaving: A Robust Automated Clustering and Visualization Framework for Large Biological Data Sets”, Comp. Biol. & Bioinfor., IEEE/ACM Transactions on Vol. 7 No. 2 pp. 223 -237 (Eds) 2010

[9] S. Dhandapani, “Design and Implementation of Scalable Hierarchical Density Based Clustering”, MS Thesis, Dept of CS, UT Austin, May 2010

[10] P. Berkhin, “A survey of clustering data mining techniques,” in Grouping Multidimensional Data: Recent Advances in Clustering, C. N. J. Kogan and M. Teboulle, Eds. Heidelberg: Springer, 2006, pp. 25–71.

[11] D. Wishart. Mode analysis: A generalization of nearest neighbour which reduces chaining e_ects. In Proceedings of the Colloquium in Numerical Taxonomy, pages 282-308, University of St. Andrews, Fife, Scotland, Academic Press, September 1968.

[12] M. Ester, H. Kriegel, J. Sander, X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, KDD 1996

[13] M. Ankerst, M.M. Breunig, H. Kriegel, J. Sander, "OPTICS: Ordering Points To Identify the Clustering Structure". ACM SIGMOD international conference on Management of data. ACM Press.pp. 49–60. 1999

[14] A.Hinneburg, D.A. Keim, “An Efficient Approach to Clustering in Large Multimedia Databases with Noise”, Proc. of KDD 1998, pp. 58-65

[15] J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on large clusters. pp. 137-150, OSDI 2004

[16] Hadoop: Open source MapReduce. http://lucene. apache.org/hadoop/ [17] G. Kahn. “The semantics of a simple language for parallel

programming”. Proc. of the IFIP Congress 74, North-Holland Publishing Co., Holland, 1974.

[18] E. Lee and T. Parks. “Dataflow process networks”. Proceedings of the IEEE, 83(5),pp. 773–801, May 1995.

[19] Pervasive Software Inc. pervasive data rush. [Online] Available: http://www.pervasivedatarush.com/downloads, 2010

[20] A. Jenkins, C.S. Frenk, S.D.M. White, J.M. Colberg, et.al, “The mass function of dark matter haloes”, Mon. Not. Royal Astronomical Society, v321, pp. 372-384, 2001

[21] S. Cole, C. Lacey, “The structure of dark matter haloes in hierarchical clustering models”, M.N.R.A.S., vol. 281, pp. 716, 1996

[22] J. Tinker, A. Kratzov, A. Klypin, et.al, “Toward a Halo Mass Function for Precision Cosmology: The Limits of Universality”, ApJ 688 709, 2008

[23] Daruru, M. Walker, N. Marin, J. Ghosh,“Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse Netflix data” Proc. 15th ACM SIGKDD, pp. 1115-1124, 2009

147147

Date post:	21-Dec-2016
Category:	Documents
Upload:	joydeep
View:	213 times
Download:	1 times

[IEEE 2010 IEEE International Conference on Data Mining Workshops (ICDMW) - Sydney, TBD, Australia...

Documents