Mul$threaded Graph Coloring Algorithms for
Scien$fic Compu$ng on Many‐core Architectures Assefaw Gebremedhin [email protected]
Purdue University
ICCS Workshop on Manycore and Accelerator‐based High‐Performance Scien$fic Compu$ng
Berkeley, January 28, 2011
CSCAPES
www.cscapes.org
2
Coloring and its applica$ons
• Graph coloring is an abstrac$on for par$$oning a set of binary‐related objects into few “independent sets”
• Coloring contributed to the growth of much of Graph Theory
• Our work on coloring is mo$vated by its prac$cal applica$ons:
– Concurrency discovery in parallel (scien$fic) compu$ng
– Sparse deriva$ve matrix computa$on
– Scheduling – Frequency Assignment – Facility Loca$on – Register Alloca$on, etc
3
Graph coloring in concurrency discovery
S1
S4
S2
S5
S3
S6
S1
S2
S6
S3
S4
S5
T1 T2 T3Time
• Adap$ve mesh refinement
• Itera$ve methods for sparse linear systems
• Full sparse $ling
4
Coloring models in deriva$ve computa$on: overview
Unidirec(onal par((on Bidirec(onal par((on
Jacobian distance‐2 coloring star bicoloring Direct
Hessian star coloring NA Direct
Jacobian NA acyclic bicoloring Subs(tu(on
Hessian acyclic coloring NA Subs(tu(on
4‐step procedure for compu/ng a sparse deriva/ve matrix A using Automa/c Differen/a/on:
• S1: Determine the sparsity structure of A
• S2: Obtain a seed matrix S by coloring the graph of A
• S3: Compute a compressed matrix B=AS
• S4: Recover entries of A from B
A S B
m × n n × p m × p
5
Distance‐2 coloring: an archetypal model in direct methods
c1
c2
c5
c4
c3
a11
0 0 0 a15
0 a22a23
0 0
0 a32a33a34a35
0 0 a43a44a45
a51
0 a53a54a55
a11a12 0 0 a
15
a21a22 0 0 0
a31 0 0 a
34 0
0 0 a43a44a45
A
Ga
c1
c3
c2
c4
c5
r1
r2
r3
r4
Gb
A
symmetric case
nonsymmetric case
structurally orthogonal par$$on
distance‐2 coloring
6
Coloring models in deriva$ve computa$on revisited
Unidirec(onal par((on Bidirec(onal par((on
Jacobian distance‐2 coloring G, Manne and Pothen (05)
star bicoloring Coleman and Verma (98) Hossain and Steihaug (98)
Direct
Hessian star coloring Coleman and More (84)
restricted star coloring* Powell and Toint (79)
NA Direct
Jacobian NA acyclic bicoloring Coleman and Verma (98)
Subs(tu(on
Hessian acyclic coloring Coleman and Cai (86)
triangular coloring* Coleman and More (84)
NA Subs(tu(on
Jacobian: bipar$te graph Hessian: adjacency graph
* Less accurate models
SIAM Review 47(4):629—705, 2005.
ColPack
www.cscapes.org/coloringpage
7
An Example Applica$on
8
Principle of Chromatography Desorbent
(Water, organic solvent, etc)
Feed
(Mixture of red and blue components)
Pump
hgp://www.cwg.hu/english/r‐wtcomp.html
Packing medium
(adsorbent par$cles)
Chromatographic column
Red component s$cks more strongly to adsorbent par$cles
Blue component
Red component
Figure courtesy of Yoshiaki Kawajiri, GT
9
Simulated Moving Bed process • A psuedo counter‐current process that mimics opera$on of TMB
• Reaches only Cyclic Steady State • Various objec$ves to be maximized could be iden$fied
E.g: product purity, product recovery, desorbent consump$on, throughput
• We considered throughput maximiza$on
• Objec$ve modeled as an op$miza$on problem with PDAEs as constraints
• Full discre$za$on was used to solve the PDAEs sparse Jacobians
10
• Tested efficacy of the 4‐step procedure:
• Used ADOL‐C for steps S1and S3, and ColPack for steps S2 and S4
• Observed results for each step matched analy$cal results
• Techniques enabled huge savings in run$me
Time(Jacobian eval) ≈ 100×Time(func/on eval)
• Dense computa$on (without exploi$ng sparsity) was infeasible
Results on Jacobian computa$on on SMB problem
sparsity detec$on (S1)
seed genera$on
(S2)
matrix‐vector product (S3) recovery (S4)
0 1 2 3 4 50
50
100
150
200
250
m/100000
run
tim
e(t
ask)/
run
tim
e(F
)
S1
S2
S3
S4
total
0 1 2 3 4 50
0.005
0.01
0.015
0.02
m/100000
runtime(F)
G, Pothen and Walther: AD2008. 11
Complexity and algorithms • Distance‐k, star, and acyclic coloring are NP‐hard (to even approximate)
– Distance‐1 coloring hard to approximate to within n(1‐e) for all e>0 [Zuckerman’07]
• A greedy algorithm usually gives good solu$on
GREEDY(G=(V,E)) Order the ver$ces in V for i = 1 to |V| do
Determine forbidden colors to vi Assign vi the smallest permissible color [Update collec$on of induced subgraphs]
end‐for
• ColPack has
– O(|V|dk)‐$me algorithms for distance‐k coloring (dk is average degree‐k)
– O(|V|d2)‐$me algorithms for star and acyclic coloring
Key idea: exploit structure of two‐colored induced subgraphs
12
Ordering techniques in ColPack: fresh formula$on
Ordering Property
Largest First for i = 1 to n: vi has largest degree in V \ {v1 , v2 , . . . , vi‐1}
Incidence Degree for i = 1 to n: vi has largest back degree in V \ {v1 , v2 , . . . , vi‐1}
Dynamic Largest First
for i = 1 to n: vi has largest forward degree in V \ {v1 , v2 , . . . , vi‐1}
Smallest Last for i = n to 1: vi has smallest back degree in V \ {vn , vn‐1 , . . . , vi+1}
Formula$on enables: • modular imp. • linear /me imp. • discovery of use in
other contexts
1v vn2v vi
Back degree
Degree
Forward degree
. . .. . . vn!1
13
Paralleliza$on…
14
Challenges in paralleliza$on in general (on contemporary plauorms)
• Parallel Architectural Models? – Control mechanism; address space (memory) organiza$on; interconnec$on network; etc
• Parallel Programming Models? – Shared memory; distributed memory; massive threading; etc
• Parallel Computa$onal Models? – Wish: realis$c yet reasonably simple abstrac$ons
15
Challenges in parallelizing graph algorithms
• Low available concurrency • Poor data locality • Irregular memory access pagern
• Access pagern determined only at run$me
• High data access to computa$on ra$o
16
Parallel Coloring Algorithms
• Independent‐set based (previous approaches) – Find maximal independent set in parallel (Luby’s algorithm) – Limited (or no) success
• Itera$on and specula$on Itera(ve Algorithm (G=(V,E)) Order V in parallel
U = V while U is not empty 1. Specula(vely color ver(ces in U in parallel; 2. Check consistency of colors in U in parallel, store conflicts in R;
U = R;
• Dataflow – Fine‐grain (edge‐level) synchroniza$on; no itera$on – Feasible when there is HW support for FGS (like the Cray XMT)
17
Enhancing the Itera$ve Algorithm
• Color choice – First Fit – Staggered First Fit – Least Used – Random
• Resolving a conflict – Randomiza$on
18
Ordering is inherently sequen$al Remedy: approxima$on
Illustra(on:
Smallest Last ordering
19
Experimental Results on Parallel Performance
20
Test plauorms
Cray XMT Sun Niagara 2 Intel Nehalem
128 processors 128 hardware thread streams per processor
cache‐less, globally accessible shared memory
hardware support for fine‐grain synchroniza$on
two 8‐core sockets 8 hardware threads per socket
L1 cache on core, shared L2 cache
two quad‐core chips two hyperthreads per core
private L1 and L2 cache, shared L3 cache
!"#$%&
'%()*#+%'
,$-+%".
/()0.123/4
5 6 ! 678
9%+:$**+%&5
,$-+%".
/()0.123/4
,$-+%".
/()0.123/4
!"#$%&'()*+#)',-.$/0#)1'2%3*$4'56'!"#$%& '(!)
*+$,(,-./0-.%(&,1223+45(-$(67("#$%&(5.-413-.+$#
;<=':> /?@@$%
,$-+%".A+)'%+BB$%
89(:;.1&(<%$0;.=
!"#$%&
'%()*#+%'
5 6 ! 678
9%+:$**+%&6
;<=':> /?@@$%
,$-+%".A+)'%+BB$%
!"#$%&
'%()*#+%'
5 6 ! 678
9%+:$**+%&!
;<=':> /?@@$%
,$-+%".A+)'%+BB$%
!"#$%&'()*+#)',%-*$.
!"#$%&'/0'1#2"%'34'5#6789
!"#$%&'
($)*%$++"%
4:;'1#2"%'1$*88+#$
< = 0 > ? @ A B
($%",-
./'(012"
< = 0 > ? @ A B
($%",/
./'(012"
< = 0 > ? @ A B
($%",3
./'(012"
!"#$%&'
($)*%$++"%
!"#$%&'
($)*%$++"%
!"#$%&'
($)*%$++"%
!"#$%&'/0'1#2"%'34'5#6789
!"#$%&'
($)*%$++"%
4:;'1#2"%'1$*88+#$
< = 0 > ? @ A B
($%",-
./'(012"
< = 0 > ? @ A B
($%",/
./'(012"
< = 0 > ? @ A B
($%",3
./'(012"
!"#$%&'
($)*%$++"%
!"#$%&'
($)*%$++"%
!"#$%&'
($)*%$++"%
!"#$%&'()*+#)',%-*$.
!"#$%&'/0'1#2"%
!"# !"$
%&'()#
*$+%,-.(
*/+%,-.(
!"# !"$
%&'()$
*$+%,-.(
*/+%,-.(
!"# !"$
%&'()/
*$+%,-.(
*/+%,-.(
!"# !"$
%&'()0
*$+%,-.(
*/+%,-.(
,%-*$.'
1*34$*))%$567
!"#$%&'/0'1#2"%
!"# !"$
%&'()#
*$+%,-.(
*/+%,-.(
!"# !"$
%&'()$
*$+%,-.(
*/+%,-.(
!"# !"$
%&'()/
*$+%,-.(
*/+%,-.(
!"# !"$
%&'()0
*$+%,-.(
*/+%,-.(
567,%-*$.'
1*34$*))%$
21
Test graphs
sc : graphs from scien$fic compu$ng apps er : R‐MAT (0.25, 0.25, 0.25, 0.25) g : R‐MAT (0.45, 0.15, 0.15, 0.25) b : R‐MAT (0.55, 0.15, 0.15, 0.15) 22
Distance‐2 coloring: # colors
Nehalem
23
Distance‐2 coloring: # colors
Nehalem
24
Distance‐2 coloring: run$me
Nehalem
25
Distance‐2 coloring: run$me
Nehalem
26
Distance‐1 coloring: # colors
Nehalem, Niagara 2, Cray XMT 27
Distance‐1 coloring : run$me
!
"
#
$
!%
&"
%#
!"$
"'%
! " # $ !%
!"#
$%&"'%($)*'+(,
-.#/$0%*1%)*0$(
!()*+,-./01+,
"()*+,-.2/01+,
#()*+,-.2/01+,
$()*+,-.2/01+,
!"#$
!"$
%
#
&
'
%(
)#
(&
%#'
#$(
% # & ' %( )# (& %#'
!"#
$%&"'%($)*'+(,
-.#/$0%*1%20*)$((*0(
*+,-.#&
*+,-.#$
*+,-.#(
*+,-.#/
!"#$%
!"$%
!"%
#
$
&
'
#(
)$
(&
#$'
$%(
# $ & ' #( )$ (& #$'
!"#
$%&"'%($)*'+(,
-.#/$0%*1%20*)$((*0(
*+,-.$&
*+,-.$%
*+,-.$(
*+,-.$/
Small‐world graph with 224 = 16M ver$ces and 134M edges
Itera$ve Dataflow
Niagara 2 Nehalem
Small‐world graphs with 224, …, 227 ver$ces and 134M, …, 1B edges
Itera$ve Itera$ve
Cray XMT Cray XMT
28
Itera$ve: looking inside
Nehalem, Niagara 2, Cray XMT 29
A “generic” paralleliza$on technique?
• “Standard” Par$$oning – Break up the given problem into p independent subproblems of
almost equal sizes – Solve the p subproblems concurrently
• “Relaxed” Par$$oning – Break up the problem into p, not necessarily en$rely
independent, subproblems of almost equal sizes – Solve the p subproblems concurrently – Detect inconsistencies in the solu$ons concurrently – Resolve any inconsistencies
Can be used poten/ally successfully if the resolu/on in the fourth step involves only local adjustments
30
Thanks
• Erik Boman, Doruk Bozdag, Umit Catalyurek, John Feo, Mahantesh Halappanavar, Bruce Hendrickson, Paul Hovland, Fredrik Manne, Duc Nguyen, Mostafa Patwary, Alex Pothen, Arijit Tarafdar, Andrea Walther
• Financial Support: DOE, NSF
31
Some References • Gebremedhin, Nguyen, Pothen and Patwary. ColPack: Graph Coloring So{ware for
Deriva$ve Computa$on and Beyond. ACM Trans. Math. Soaware. Submiged. 2010.
• Gebremedhin, Manne and Pothen. What color is your Jacobian? Graph coloring for compu$ng deriva$ves. SIAM Review 47(4):627—705, 2005.
• Gebremedhin, Tarafdar, Manne and Pothen. New acyclic and star coloring algorithms with applica$ons to compu$ng Hessians. SIAM J. Sci. Comput. 29:1042—1072, 2007.
• Gebremedhin, Pothen and Walther. Exploi$ng sparsity in Jacobian computa$on via coloring and automa$c differen$a$on: a case study in a Simulated Moving Bed process. AD2008, LNCSE 64:339‐‐‐349, 2008.
• Catalyurek, Feo, Gebremedhin, Halappanavar, Pothen. Mul$threaded Algorithms for Graph Coloring. In submission, 2011.
• Bozdag, Catalyurek, Gebremedhin, Manne, Boman and Ozguner. Distributed‐memory parallel algorithms for distance‐2 coloring and related problems in deriva$ve computa$on. SIAM J. Sci. Comput. 32(4):2418‐‐2446, 2010.
• Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable greedy coloring on distributed‐memory parallel computers. J. Parallel Distrib. Comput. 68(4):515—535, 2008.
• For more informa$on: www.cs.purdue.edu/homes/agebreme 32