Johann M. Kraus and Hans A. Kestler
AG Bioinformatics and Systems BiologyInstitute of Neural Information Processing
University of Ulm
Multi-core Parallelization in Clojure -a Case Study
29.06.2009
Outline
1. Concepts of parallel programming
2. Short introduction to Clojure
3. Multi-core parallel K-means - the case study
4. Analysis and Results
5. Summary
Parallel Programming
Parallel programming is a form of programming where many calculations are performed simultaneously.
Definition:
• Physical constraints prevent frequency scaling of processors
• This led to an increasing interest in parallel hardware and parallel programming
• Multi-core hardware is standard on desktop computers
• Parallel software can use this hardware to the full capacity
• Large problems are divided into smaller ones and the sub-problems are solved simultaneously
• Speedup S is limited by the fraction of parallelizable code P
• Amdahl’s law: S =1
1! P + PN
Amdahl's law
Number of processors
Spee
dup
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
02
46
810
1214
1618
20
Fraction of parallelizable code0.95 %0.90 %0.75 %0.50 %
Concepts of Parallel Programming
Explicit vs. implicit parallelization
• Functional programming allows implicit parallelization:
• Explicitly define communication and synchronization details for each task:
• MPI
• Java Threads
• Parallel processing of functions
• Functions are free of side-effects
• Data is immutable
Distributed vs. local hardware
Master
Slave 0
Shared Memory
CPU 4
CPU 0
CPU 1
CPU 3
CPU2
Slave 4
Slave 3
Slave 1
Slave 2
readwrite
send data
send result
• Master - Slave parallelization (e.g. Message Passing Interface)
• Shared memory parallelization (e.g. Open Multi-Processing)
Thread programming
newstart
runnable
running
waiting
terminated
schedule
end block
awake
• Threads are refinements of a process that share the same memory and can be processed separately and simultaneously
• Available in many languages, e.g. PThreads (C), Java Threads (Java), OpenMP Threads (C, Fortran)
• Execution of threads is handled by a scheduler that manages the available processing time
• Communication between threads is faster than communication between processes
• Invoking threads is also faster than fork/join processes
Concurrency control via locking and synchronizing
• Concurrency control ensures that threads can access shared memory without violating data integrity
• The most popular approach to concurrency is locking and synchronizing
• Problems might occur when using too many locks, too few locks, wrong locks, or locks in the wrong order
• Using locks can be fatally error-prone, e.g. dead-locks
public class Counter{private int value = 0 ;public synchronized void i n c r {
value = value + 1 ;}
}Counter counter = new Counter ( ) ;counter . i n c r ( ) ;
• Transactional memory offers a flexible alternative to lock-based concurrency control
• Functionality is analogous to controlling simultaneous access to database management systems
• Transactions ensure properties:
• Atomicity: Either all changes of a transaction occur or none do
• Consistency: Only valid changes are committed
• Isolation: No transaction sees the effect of other transactions
• Durability: Changes from transactions will be persistent
Concurrency control via transactional memory
:Transaction 0 :Transaction 1:Data
get data
[consistent data]send modified data
[consistent data]send modified data
get data
[consistent data]send modified data
get data
TIME
• Software transactional memory maps transactional memory to concurrency control in parallel programming
Clojure
• Functional programming language hosted on the JVM
• Extends the code-as-data paradigm to maps and vectors
• Based on immutable data structures
• Provides built-in concurrency support via software transactional memory
• Completely symbiotic to Java, e.g. easy access to Java libraries
• Platform independent
• Java interaction
• Add type hints to speed up code
(defn da+ [#ˆdoubles as #ˆdoubles bs ](amap as i r e t(+ ( aget as i ) ( aget bs i ) ) ) )
• Dynamic typing and multi-methods
• An object is defined as the sum of what it can do (methods), rather than the sum of what it is (type hierarchy)
( import ’ ( cern . j e t . random . samplingRandomSamplingAssistant ) )
(defn sample[ n k ]( seq ( . RandomSamplingAssistant
( sampleArray k ( int!array ( range n ) ) ) ) ) )
• Transactional references ensure safe coordinated synchronous changes to mutable storage locations
• Are bound to a single storage location for their lifetime
• Only allow mutation of that location to occur within transactions
• Available operations are ref-set, alter, and commute
• No explicit locking is required
Transactional references and STM
(def counter ( ref 0) )(dosync ( alter counter inc ) )
• Agents allow independent asynchronous change of mutable locations
• Are bound to a single storage location for their lifetime
• Only allow mutation of that location to a new state to occur as a result of an action
• Actions are functions that are asynchronously applied to the state of an Agent
• The return value of an action becomes new state of the Agent
• Agents are integrated with the STM
(def counter (agent 0) )(send counter inc )
Agents
Cluster analysis
3 cluster 9 cluster
• Given a data set X compute a partition of X into k disjoint clusters C, such that:
• How many clusters are in the data set?
(1)k!
i=1
Ci = X
(2) Ci != " and Ci # Cj = "
Cluster algorithms
• For all possible partitions evaluate the objective function f and search the optimum.
Cluster algorithms provide a heuristic for this search:
• The cardinality of the set of all possible partitions is given by:
• Partitional clustering (K-means, Neuralgas, SOM, Fuzzy C-means, ...)
• Hierarchical clustering (Divisive/agglomerative, Complete linkage, ...)
• Graph-based clustering (Spectral clustering, NMF, Affinity propagation, ...)
• Model-based clustering, Biclustering, Semi-supervised clustering
SkN =
1k!
k!
i=0
(!1)k!i"
k
i
#iN
0 5 10 15 20 25 30 35
0 5
1015
2025
30
0
5
10
15
20
25
30
35
Number of clusters
Num
ber o
f dat
a po
ints
Runt
ime
(nan
osec
ond)
Stirling numbers ofthe second kind
K-means algorithm
Function KMeans
Input : X = { x 1 , . . . , x n } ( Data to be c l u s t e r e d )k (Number o f c l u s t e r s )
Output : C = { c 1 , . . . , c k } ( C l u s t e r c e n t r o i d s )m: X !> C ( C l u s t e r a s s i gnment s )
I n i t i a l i z e C ( e . g . random s e l e c t i o n from X)While C has changedFor each x i i n Xm( x i ) = a r gm i n j d i s t a n c e ( x i , c j )
EndFor each c j i n C
c j = c e n t r o i d ({ x i | m( x i ) = j })End
End
Cluster Validation
• MCA-index: mean proportion of samples being consistent over different clusterings
MCA = 1n max!
!ki=1 |Ai !Bj |
• Evaluation requires repeated runs of clustering, e.g.:
• Resampled data sets
• Different parameters
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
cluster
me
an
mca
in
de
x
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
cluster
me
an
mca
in
de
x
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
cluster
me
an
mca
in
de
x
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
cluster
me
an
mca
in
de
x
Estimation of the expected value of a validation index
Random label: randomly assign each item to a cluster k
Random partition: choose a random partition
Random prototype: assign each item to its next prototype
Mean value from 100 runs
Multi-core K-means with Clojure
• Split the data set into smaller pieces that are handled by agents
• Each cluster is represented by an agent
• Add a commutative list of cluster members within a transactional reference to accelerate the centroid update step
Cluster Agent 1
Member Ref 1
Data Agent 0
Cluster Agent k
Member Ref k
Cluster Agent 0
Member Ref 0
Data Agent 1
Data Agent n
Data Agent 2
Data Agent 3
read
write
Data Agent 0
Cluster Agent 0
Data Agent 1
Data Agent n
Cluster Agent 1
Cluster Agent k
Member Ref 0
Member Ref 1
Member Ref 2
simultaneous read
simultaneous write
Data Agent 0
Data Agent 1
Data Agent n
(defn assignment [ ](map #(send % update!dataagent ) DataAgents )
(defn update!dataagent [ datapo int s ](map update!datapoint datapo int s ) )
(defn update!datapoint [ datapoint ]( l e t [ newass ( nearest!c l u s t e r datapoint ) ](dosync (commute (nth MemberRefs newass )
conj ( : data datapoint ) ) )( assoc datapoint : ass ignment newass ) ) )
read: (nearest-cluster)
write: (commute) (assoc)
Benchmark results
• Each data point is sampled from N(0,1)
• Summary for 10 runs of K-means
050
100
150
run
tim
e (
seco
nd
s)
0150
300
450
ParaKMeans K-means R McKmeans K-means R McKmeans
10.000 cases, 100 dimensions
20 Cluster
1.000.000 cases, 200 dimensions
20 Cluster
run
tim
e (
min
ute
s)
Large data sets (artificial):
1 4 8
050
010
0015
00
100.000 x 50020 cluster
number of computer cores
runt
ime
(sec
onds
)
4 6 8 10
020
040
060
080
0
100.000 x 50020 cluster
number of data agents
runt
ime
(sec
onds
)
• Number of computer cores used • Number of data agents used
• Data sampled from a multi-variate normal distribution
• 100000 samples, 200/500 dimensions, 10/20 cluster
05
00
10
00
15
00
20
00
Number of samples / Number of clusters
run
tim
e (
se
co
nd
s)
200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20
K-means R McKmeans
Large data sets with cluster structure
• Measured with the MCA index
• Red bars indicate the random-prototype baseline
0.0
0.2
0.4
0.6
0.8
1.0
MC
A i
nd
ex
_ __ _ _ _ _ _
McKmeans K-means R McKmeans K-means R McKmeans K-means R McKmeans K-means R
100.000 x 20010 cluster
100.000 x 20020 cluster
100.000 x 50010 cluster
100.000 x 50020 cluster
Accuracy compared to the known grouping of data
• Microarray data (Radiation-induced changes in human gene expression)
• 22277 samples (genes) and 465 features (profiles)
Number of clusters
run
tim
e (
seco
nd
s)
050
150
250
350
2 Cluster 5 Cluster 10 Cluster 20 Cluster 2 Cluster 5 Cluster 10 Cluster 20 Cluster
K-means R McKmeans
Real world data set
Smirnov D, Morley M, Shin E, Spielman R, Cheung V: Genetic analysis of radiation-induced changes in human gene expression. Nature 2009, 459:587–591
Application to Cluster Number Estimation• Repeated clustering with different subsets of data
• Repeated for different number of clusters k
• Most stable clustering is produced for the ‘real’ cluster number
2 3 4 5 6 7
0.0
0.2
0.4
0.6
0.8
1.0
number of clusters
MC
A ind
ex
_ _ _ __ _
• Jackknife resampling
• Evaluation with MCA index
• Data set:100000 samples, 100 features, 3 cluster
• 10 runs per cluster number
• 49.26 minutes on dual-quad core 3.2 GHz
Java GUI
( import ’ ( javax . swing JFrame JLabel JTextFie ld JButton )’ ( java . awt . event Act ionL i s t ene r )’ ( java . awt GridLayout ) )
( l e t [ frame (new JFrame ”Hel lo , World ! ” )he l l o ! button (new JButton ”Say h e l l o ”)he l l o ! l a b e l (new JLabel ” ” ) ]
( . h e l l o ! button( addAct ionListener
( proxy [ Act i onL i s t ene r ] [ ]( act ionPerformed [ evt ]
( . h e l l o ! l a b e l( setText ”Hel lo , World ! ” ) ) ) ) ) )
( doto frame( . setLayout (new GridLayout 1 1 3 3) )( . add he l l o ! button )( . add he l l o ! l a b e l )( . s e t S i z e 300 80)( . s e tV i s i b l e t rue ) ) )
Summary
• Writing parallel programs usually requires a careful software design and a deep knowledge about thread-safe programming
• Concurrency control via transactional memory circumvents problems of lock-based concurrency strategies
• Immutable data structures play a key role to software transactional memory
• Clojure combines Lisp, Java and a powerful STM system
• This enables fast parallelization of algorithms, even for rapid prototyping
• Our simulations show a good performance of the parallelized code
Thank you for your attention.
Statistical computing library
• http://wiki.github.com/liebke/incanter
• Clojure-based statistical computing
• R-like semantics
• COLT library for numerical computation
• JFreeChart library for graphics