March 12, 2007 CICC quarterly meeting 1
Optimizing DivKmeans for Multicore Architectures: a status re
port
Jiahu Deng and Beth PlaleDepartment of Computer Science
Indiana University
March 12, 2007 CICC quarterly meeting 2
Acknowledgements
• David Wild• Rajarshi Guha• Digital Chemistry • Work funded in part by CICC and Microsoft
March 12, 2007 CICC quarterly meeting 3
Problem Statements
1. Clustering is an important method to organize thousands of data times into meaningful groups. It is widely applied in chemistry, chemical informatics, biology, drug discovery, etc. However, for large datasets, clustering is a slow process even it’s parallelized
and be executed in powerful computer clusters.
2. Multi-core architectures provide large degrees of parallelism. Taking advantage of this requires examination of traditional parallelism approaches. We apply that examination to the DivKmeans clustering method.
March 12, 2007 CICC quarterly meeting 4
Multi-core Architectures
Diagram of an Intel Core 2 dual core processor, with CPU-local Level 1 caches, a shared, on-die Level 2 cache.
Multi-core processors: combines two or more independent processors into a single package.
March 12, 2007 CICC quarterly meeting 5
Clustering Algorithm1. hierarchical clustering
Series of partitioning steps take place, generating a hierarchy of clusters. It includes two families, agglomerative methods, which work from leaves upward, and divisive methods which decompose from a root downward.
http://www.digitalchemistry.co.uk/prod_clustering.html
March 12, 2007 CICC quarterly meeting 6
Clustering Algorithm
2. non-hierarchical clustering
Clusters form around centroids, the number of which can be specified by the user. All clusters rank equally and there is no particular relationship between them.
http://www.digitalchemistry.co.uk/prod_clustering.html
March 12, 2007 CICC quarterly meeting 7
Divisive KMeans (DivKmeans) Clustering Algorithm
Kmeans Method: K: number of clusters, which can be specified. The items are initially randomly assigned to a cluster. The kmeans clustering proceeds by repeated application of a two-step process:
1. The mean vector for all items in each cluster is computed. 2. Items are reassigned to the cluster whose center is closest to the
item.
Features: The K-means algorithm is stochastic and the results are subject to a
random component. The K-means algorithm works very well for well-defined clusters with a clear cluster center.
March 12, 2007 CICC quarterly meeting 8
Divisive KMeans (DivKmeans) Clustering Algorithm
Divisive KMeans : A hierarchical kmeans method. In the following discussion, we consider k= 2, i.e. each clustering process accepts one cluster as input, and generates two partitioned clusters as outputs.
Originalcluster
Kmeans Method
Kmeans Method
Kmeans Method
cluster1
cluster2
Kmeans Method
…
…
…
…
March 12, 2007 CICC quarterly meeting 9
Parallelization of DivKmeans Algorithm for Multicore
• Proceeding without Digital Chemistry DivKmeans• Once agreement was reached (Nov 2006), could not get version of source code isolated that
communicated with public interfaces instead of private interfaces.
• Naive parallelization of DivKmeans• Chose to work with Cluster 3.0 from Open Source Clustering Software Laboratory of DNA Info
rmation Analysis, Human Genome Center Institute of Medical Science, University of Tokyo. • The C clustering library is released under the “Python License”.• Parallelized this Kmeans code with decomposition
• Gather performance results on naive parallelization
• Suggest multicore-sensitive parallelizations
• Early performance results of these parallelizations
March 12, 2007 CICC quarterly meeting 10
Naive Parallelization of Cluster 3.0 Kmeans
• Treat each kmeans clustering process as a black box , which takes one cluster as input, and generates two clusters as outputs
• When a new cluster is generated having more than one element in it, assign it to free processor for further clustering
• A master node maintains status of each node
March 12, 2007 CICC quarterly meeting 11
Naive Parallelization of Cluster 3.0 Kmeans
.
.
.
Working Node1
MasterNode
Working Node2
Working Node3
Originalcluster
cluster1
cluster2
Assign to Node 2
Reassign to Node 1
Assign to Node 3
(Reassign to Node 2)
March 12, 2007 CICC quarterly meeting 12
Quality of Cluster 3.0 Kmeans Naive Parallelization
Pros:
Don’t need to worry about the details of DivKmeans method. Can use Kmeans functions of other libraries directly.
Cons:
Speedup and scalability?
How about parallelization overhead?
March 12, 2007 CICC quarterly meeting 13
Profiling Naive Parallelization
• Platform: • A Linux cluster, each node has two 2GHz AMD Opteron(TM) CP
Us, each CPU has dual cores• Linux RHEL WS release 4
• Algorithm: Cluster 3.0, parallelized and made divisive• Dataset: Pubchem dataset of sizes 24,000 and 96,000 elements• Additional Libraries:
• LAM 7.1.2/MPI
March 12, 2007 CICC quarterly meeting 14
Speedup: naive parallelization of Cluster 3.0
Speedup of Di vKmeans( I tem Si ze: 24000)
0
1
2
3
4
0 5 10 15 20 25 30 35
Number of Nodes
Speedup
speedup is defined by Sp = T1/Tp
where: * p is number of processors * T1 is execution time of sequential algorithm * Tp is execution time of parallel algorithm with p processorsConclusion: maximum benefit reached at 17 nodes; significant decrease in speedup after only 5 nodes.
March 12, 2007 CICC quarterly meeting 15
CPU Utilization:
Conclusion: Node 1 maxes out at 100% utilization. A likely limiter to overall performance.
CPU Uti l i zati on of Di vKmeans(I tem Si ze: 96000)
020406080
100120
1 178 355 532 709 886 1063 1240 1417
Runni ng Ti me (Second)
CPU
Util
izat
ion
(%)
Node0Node1Node2Node3Node4Node5Node6Node7
March 12, 2007 CICC quarterly meeting 16
Memory Utilization
Conclusion: nothing outstanding
Memory Uti l i zati on of Di vKmeans(I tem Si ze: 96000)
0
10
20
30
40
50
1 185 369 553 737 921 1105 1289 1473
Runni ng Ti me (Second)
Memo
ry U
tili
zati
on(%
)
Node0Node1Node2Node3Node4Node5Node6Node7
March 12, 2007 CICC quarterly meeting 17
Process Behaviors
By XMPI, which is a graphical user interface for running, debugging and visualizing MPI programs.
March 12, 2007 CICC quarterly meeting 18
Conclusions on Naive Parallelization from Profiling
• Poor scalability beyond 5 nodes. • Performance likely inhibited by 100% utilization
of Node 1.
Proposed Solution• Multi-core solution: using multi-threads on each
node, each thread runs on one core.• How this solution will explicitly address the two
problems identified above.
March 12, 2007 CICC quarterly meeting 19
Proposed Solution Instead of treating each kmeans clustering process as a black box,
each clustering process is decomposed into several threads.
original cluster
thread 3thread 2 thread 4
Merge Results
some pre-processing
other processing
cluster1
Cluster 2
thread 1
March 12, 2007 CICC quarterly meeting 20
Step 1: identify parts to decompose (parallelize)
Calling sequence of kmeans clustering process
Do loopFindingCentroids
CalculatingDistance
While loop
Do loop
Inside Kmeans Profiling shows:-> About 93% of total execution time is spent in kmeans() functions.-> Inside kmeans() function, almost all time is spent in “Finding Centroids” and “Calculating Distance”.-> Hence, parallelize these two.
DivKmeans Kmeans()
CalculatingDistance
Find Centroids
March 12, 2007 CICC quarterly meeting 21
Simplified codes of Finding Centroids
// sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values for (i = 0; i < nclusters; i++) { for (j = 0; j < ncolumns; j++) cdata[i][j] /= total_number[i][j]; }
March 12, 2007 CICC quarterly meeting 22
Parallelized Codes of “Finding Centroids”
// sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values …
Before parallelization After parallelization
// sum up elements assigned to current threadfor (k = nrows * index / n_thread; k < nrows * (index + 1) / n_thread; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) { if (mask[k][j] != 0) { t_data[i][j]+=data[k][j]; t_mask[i][j]++; } }}
// merge data … // calculate mean values …
March 12, 2007 CICC quarterly meeting 23
Mapping of Algorithms into Multi-core Architectures
original cluster
Core 3Core 2 Core 4
Merge Results
some pre-processing
other processing
cluster1
Cluster 2
Core1 Each thread uses one core
March 12, 2007 CICC quarterly meeting 24
Mapping of Algorithms into Multi-core Architectures
• How to further benefit from multi-core architectures?
Data locality
Cache aware algorithm
Architecture aware algorithm
March 12, 2007 CICC quarterly meeting 25
Example 1: AMD Opteron
Mapping of Algorithms into Multi-core Architectures
No cache sharing between two cores in thisarchitecture
Diagram of AMD Opteron
March 12, 2007 CICC quarterly meeting 26
Mapping of Algorithms into Multi-core Architectures
Example 2: Intel Core 2
Improve cache re-use:If two threads share common data, assign them to the cores on the same die.
Diagram of an Intel Core 2 dual Core processor
March 12, 2007 CICC quarterly meeting 27
Mapping of Algorithms into Multi-core Architectures
Dell PowerEdge 6950NUMA (Non-Uniform Memory Access)
Example 3:
Improve data locality:Keep data in local memory so that each thread uses local memory instead of remote ones as much as possible.
March 12, 2007 CICC quarterly meeting 28
Early Results on Multi-core Platform
Experiment Environments Platform: 3 nodes in a Linux cluster, each node has two 2GHz AMD Opteron
(TM) CPUs, each CPU has dual cores Linux RHEL WS release 4
Library: LAM 7.1.2/MPI Pthread for Linux RHEL WS release 4
Degree of Parallelization Only the code of “Finding Centroids” is parallelized for early study. 4 threads are used for “Finding Centroids” on each node, and each t
hread runs on one core.
March 12, 2007 CICC quarterly meeting 29
Results of Parallelizing “Finding Centroids”
Performance of Di vKmeans
0
500
1000
1500
2000
2500
3000
0 20000 40000 60000 80000 100000
Data Si ze (Number of I tems)
Tota
l Ex
ecut
ion
Time
(Sec
onds
) Bef ore Paral l el i zat i onAf ter Paral l el i zat i on
Conclusion: Modest improvement. DivKmeans runs about 12% faster after parallelization.
March 12, 2007 CICC quarterly meeting 30
Perf ormance of Di vKmeans( I tem Si ze: 12000)
320330340350360370380
0 20 40 60 80
Number of Threads Used per Node
Tota
l Ex
ecut
ion
Time
(se
cond
s)
Parallelizing “Finding Centroids” with Different Number of Threads per Node
Conclusion: can hardly benefit from using more threads than the number of cores.Total Number of Cores per Node: 4
March 12, 2007 CICC quarterly meeting 31
Optimizations for Next Step
• Reduce overhead of managing threads (e.g. use thread pool instead of creating new threads for each call to “Finding Centroids”)
• Parallelize the “Calculating Distance” part, which consumes twice the time of “Finding Centroids”
• More cores (4, 8, 32…) on a single computer are on the way. Should get more performance enhancements with more cores if the scalability of the program is good.
• The platform we used (AMD Opteron TM) doesn’t support cache sharing between two cores on the same die. However, L2, and even L1 cache sharing among cores are becoming available.
March 12, 2007 CICC quarterly meeting 32
The Multi-core Project in the Distributed Data Everywhere (DDE) Lab and the Extreme Lab
• Multi-core processors: represent a major evolution in today’s computing technology
• We are exploring the programming styles and challenges on multi-core platforms, and potential applications in both academic and commercial areas, including chemical-informatics, XML parsing, data streaming, Web Service, etc.
March 12, 2007 CICC quarterly meeting 33
References
1. Open Source Clustering Software Laboratory of DNA Information Analysis, Human Genome Center Institute of Medical Science, University of Tokyo http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
2. http://www.nsc.liu.se/rd/enacts/Smith/img1.htm
3. http://www.mhpcc.edu/training/workshop/parallel_intro/
4. http://www.digitalchemistry.co.uk/prod_clustering.html
5. Performance Benchmarking on the Dell PowerEdge™ 6950 David Morse, Dell Inc.