Parallel Algorithms for Solving Large AssignmentProblems on GPU Clusters
2018 Blue Waters Symposium
Ketan Date Rakesh Nagi (PI)
Department of Industrial and Enterprise Systems EngineeringUniversity of Illinois at Urbana-Champaign
June 6, 2018
1 / 30
Outline
Assignment Problems: Introduction and Impact
Research Tasks and Role of Blue Waters
The Linear Assignment Problem
The Quadratic Assignment Problem
2 / 30
Outline
Assignment Problems: Introduction and Impact
Research Tasks and Role of Blue Waters
The Linear Assignment Problem
The Quadratic Assignment Problem
3 / 30
Introduction
Assignment problems: Fundamental optimization problems in OperationsResearch that have prominent applications in science and engineering.
Our inability to efficiently solve large instances of these problems cangreatly inhibit the discovery in these domains.
Objectives: Designing faster, parallel algorithms for Linear AssignmentProblem (LAP) and Quadratic Assignment Problem (QAP) using GPUsand large computational clusters like Blue Waters.
Future work: Extending the proposed methodology to GeneralizedAssignment Problem (GAP), Traveling Salesman Problem (TSP), VehicleRouting Problem (VRP), and Graph Association/Matching (GA/GM).
4 / 30
Impact of Assignment Problems
1. Data sciences: Data association in information fusion andmulti-target tracking.
2. Bioinformatics: Alignment of protein-protein interaction (PPI)networks.
3. Engineering: Facility location, routing and scheduling problems, etc.
3. 3/18/2013 -Suspected bank robber John Dillinger is wanted for questioning by Greencastle police. Dillinger is said to be 30 years old, with a height of 5'8", and weight of 185 pounds. Dillinger was last seen driving his black 2010 Ford Focus westward down Indianapolis Road.
b’
a’ c’
Name: John DillingerType: PersonSex: MaleHeight: 5’ 8”Age: 30Weight: 185 lbsShirt color: Red
Type: VehicleMake: FordModel: FocusYear: 2010Color: Black
d2 e2
Name: AnonymousType: PersonSex: MaleHeight: 5’ 10”Shirt color: Black
Name: AnonymousType: PersonSex: MaleHeight: 6’ 2”Shirt color: Blue
Owner of
b2
a2 c2
Name: AnonymousType: PersonSex: MaleHeight: 5’ 8”Shirt color: Red
Type: VehicleMake: FordColor: Black
d2 e2
Name: AnonymousType: PersonSex: MaleHeight: 5’ 10”Shirt color: Black
Name: AnonymousType: PersonSex: MaleHeight: 6’ 2”Shirt color: Blue
Located atLocated at
Located at
Located at
b1
a1 c1
Name: John DillingerType: PersonSex: MaleHeight: 5’ 8”Weight: 185 lbsAge: 30
Type: VehicleMake: FordModel: FocusYear: 2010Color: Black
Address: Indianapolis RoadType: Location
Owner of
Located at Located at
Name: Sunoco Gas StationAddress: Indianapolis Road Type: LocationTime: 1320Date: 03172013
Name: Sunoco Gas StationAddress: Indianapolis Road Type: LocationTime: 1320Date: 03172013
1
5
4
2
3
V1
V1
V2
V2
V1
V2
1
5
4
2
3
L1 L2
L3 L4
F1
F3
F2
F4
Association of complex information from heterogeneous data sources using Graph Association Formulation
TSP and VRP
Facility Location
5 / 30
Outline
Assignment Problems: Introduction and Impact
Research Tasks and Role of Blue Waters
The Linear Assignment Problem
The Quadratic Assignment Problem
6 / 30
Research Tasks and Role of Blue Waters
Research tasks1. Develop GPU accelerated Hungarian algorithm for the LAP.2. Develop GPU accelerated Dual Ascent procedure for QAP-RLT2.3. Couple both the algorithms and deploy on Blue Waters to obtain
exact solutions to the QAP in a parallel branch-and-bound scheme.
Role of Blue WatersI Exponential number of tree nodes in branch-and-bound.I Lower bounding procedure requires solving O(n3) LAPs and adjusting
O(n6) Lagrange multipliers.I Solving the benchmark Nug30 QAP required over 1200 XK compute
nodes for over 110 hours (15 yrs worth of computation).I We are grateful to Blue Waters and the NCSA staff for providing this
invaluable service to the scientific community.
7 / 30
Outline
Assignment Problems: Introduction and Impact
Research Tasks and Role of Blue Waters
The Linear Assignment Problem
The Quadratic Assignment Problem
8 / 30
Linear Assignment Problem: Introduction
I Also known as weighted bipartite matching problem.I Objective: To minimize total cost of assigning n resources to n tasks.I Important subproblem of many NP-Hard optimization problems, e.g.,
I Quadratic Assignment ProblemI Traveling Salesperson ProblemI Graph Matching and Association Problems, etc.
minn∑
i=1
n∑j=1
cijxij ;
s.t.n∑
j=1xij = 1 ∀i = 1, . . . , n;
n∑i=1
xij = 1 ∀j = 1, . . . , n;
xij ∈ 0, 1 ∀i , j = 1, . . . , n.
9 / 30
Literature Review
Sequential algorithmsI Hungarian algorithm [Kuhn, 1955, Munkres, 1957].I Shortest path algorithms [Jonker and Volgenant, 1987].I Auction algorithm [Bertsekas, 1990].
Parallel implementationsI Parallel synchronous/asynchronous Hungarian algorithms
[Bertsekas and Castanon, 1993].I Parallel shortest path algorithms
[Balas et al., 1991, Storøy and Sørevik, 1997].I Parallel synchronous/asynchronous Auction algorithm:
[Wein and Zenios, 1990, Bertsekas and Castanon, 1991].I Parallel Auction algorithm using GPUs
[Vasconcelos and Rosenhahn, 2009]
10 / 30
Sequential Hungarian AlgorithmWith opportunities for acceleration
Yes
No
Yes
No
All Assigned? End
Augmenting Path Found?
Initialization
Partial Assignment
Start
Optimality Check
Augmenting Path Search
Augmentation
Dual Update
High granularityScalable to multiple GPUs
Low GranularityExecuted on single GPU
11 / 30
Accelerated Hungarian: Augmenting Path Search(Forward Pass)
I Goal is to find vertex disjoint augmenting paths from unassigned rowsto unassigned columns.
I In forward pass, threads traverse the graph one hop at a time andconstruct augmenting trees.
I More than one augmenting trees may be found per iteration.I Due to race condition, the trees are guaranteed to be vertex disjoint
(our innovation).
Thread 11
Thread 12
R1
R2
R3
R4 C4
C3
C2
C1
BFS Iteration 1Frontier: R2 and R3New Frontier: R1 and R4
R1
R2
R3
R4 C4
C3
C2
C1 R1
R2
R3
R4 C4
C3
C2
C1
Thread 22
Thread 21 R1
R2
R3
R4 C4
C3
C2
C1
BFS Iteration 2Frontier: R1 and R4New Frontier: --
R1
R2
R3
R4 C4
C3
C2
C1 R1
R2
R3
R4 C4
C3
C2
C1
Augmenting path(s) found:R3-C2-R1-C1 and R3-C2-R1-C4
12 / 30
Accelerated Hungarian: Reverse Pass andAugmentation
I Reverse pass is performed to extract augmenting paths from theaugmenting trees (each leaf vertex is processed by one thread).
I Due to “race” condition only one path survives per tree (ourinnovation).
I All such paths can be used to augment the current solution andincrease number of assignments.
R1
R2
R3
R4 C4
C3
C2
C1
Thread 22
Thread 21
AugmentationNumber of assignments increased by 1
Reverse PassSurvivor Path: C1-R1-C2-R3
R1
R2
R3
R4 C4
C3
C2
C1
13 / 30
Computational Experiments
Experimental SetupI Small Scale: n = 500 to n = 5000 in increments of 500. Cost matrix
of randomly generated integers between [0, n].I Large Scale: n = 5000 to n = 20000 in increments of 5000. Cost
matrix of randomly generated integers between [0, n10 ], [0, n], and
[0, 10n].
Hardware detailsI Computational resources from Blue Waters Supercomputing Facility at
University of Illinois at Urbana-Champaign.I CPU: AMD Interlagos 6376, 2.30GHz clock speed, and 32GB memory.I GPU: NVIDIA GK110 “Kepler” K20X, with 2688 processor cores, and
6GB memory.
14 / 30
Execution Times and Speedup Profiles (Small Scale)
15 / 30
Execution Times and Speedup Profiles (Large Scale)
16 / 30
Contributions
1. Developed parallel versions of two variants of the Hungarianalgorithm, for solving the LAP on GPU(s).
2. Accelerated algorithms leverage “race condition” to find multiplevertex-disjoint augmenting paths.
3. Single GPU variant can solve problems with up to 400 Millionvariables in 13 seconds.
4. Multi-GPU variant can solve problems with up to 1.6 Billion variablesin 24 seconds.
17 / 30
Outline
Assignment Problems: Introduction and Impact
Research Tasks and Role of Blue Waters
The Linear Assignment Problem
The Quadratic Assignment Problem
18 / 30
Quadratic Assignment Problem: Introduction
I Introduced by [Koopmans and Beckmann, 1957] as a mathematicalmodel for facility location.
I Objective: To place n facilities on n locations such that total cost(distance times flow) is minimized.
I Strongly NP-hard problem. No polynomial time optimal or ε-optimalalgorithm.
minn∑
i=1
n∑p=1
bipxip +n∑
i=1
n∑j=1
n∑p=1
n∑q=1
fijdpqxipxjq
s.t.n∑
i=1xip = 1 ∀p = 1, · · · , n;
n∑p=1
xip = 1 ∀i = 1, · · · , n;
xip ∈ 0, 1 ∀i , p = 1, · · · , n.
19 / 30
Linearization Models
I One of the ways to solve QAP is to convert it into MILP and useBranch-and-bound.
I Linearization is accomplished by introducing large number of variablesand constraints.
I LP relaxation provides a valid lower bound on QAP.I Can be used to fathom nodes in branch-and-bound tree.
Linearization Model Binary Variables Continuous Variables Constraints[Lawler, 1963] O(n4) – O(n4)[Kaufman and Broeckx, 1978] O(n2) O(n2) O(n2)[Frieze and Yadegar, 1983] O(n2) O(n4) O(n4)[Adams and Johnson, 1994] RLT1 O(n2) O(n4) O(n4)[Adams et al., 2007] RLT2 O(n2) O(n6) O(n6)[Hahn et al., 2012] RLT3 O(n2) O(n8) O(n8)
20 / 30
RLT2, DLRLT2, and Staged LAP SolutionRLT2:
min∑
i
∑p
bipxip +∑
i
∑j 6=i
∑p
∑q 6=p
Cijpqyijpq
+∑
i
∑j 6=i
∑k 6=i ,j
∑p
∑q 6=p
∑r 6=p,q
Dijkpqr zijkpqr ;
s.t.∑
pxip = 1, ∀i ;
∑i
xip = 1, ∀p;∑q 6=p
yijpq = xip, ∀(i , j , p) : i 6= j ;
∑j 6=i
yijpq = xip, ∀(i , p, q) : p 6= q;
∑r 6=p,q
zijkpqr = yijpq, ∀(i , j , k, p, q) : i 6= j 6= k, p 6= q;
∑k 6=i ,j
zijkpqr = yijpq, ∀(i , j , p, q, r) : i 6= j , p 6= q 6= r ;
zijkpqr = zikjprq = zjikqpr
= zjkiqrp = zkijrpq = zkjirqp, ∀(i , j , k, p, q, r) : i < j < k, p 6= q 6= r ;xip ∈ 0, 1, ∀i , p;yijpq ≥ 0, ∀(i , j , p, q) : i 6= j , p 6= q;zijkpqr ≥ 0, ∀(i , j , k, p, q, r) : i 6= j 6= k, p 6= q 6= r .
DLRLT2(v):
max∑
iαi +
∑pβp;
s.t. αi + βp −∑j 6=i
γijp −∑q 6=p
δipq ≤ bip, ∀i , p;
γijp + δipq −∑
k 6=i ,jξijkpq −
∑r 6=p,q
ψijpqr ≤ Cijpq, ∀(i , j , p, q) : i 6= j , p 6= q;
ξijkpq + ψijpqr ≤ Dijkpqr − vijkpqr , ∀(i , j , k, p, q, r) : i 6= j 6= k, p 6= q 6= r ;αi , βp, γijp, δipq, ξijkpq, ψijpqr ∼ ur, ∀(i , j , k, p, q, r) : i 6= j 6= k, p 6= q 6= r .
Stage 1: n2(n-1)2
Z-LAPs
Stage 2: n2
Y-LAPs
Stage 3:1
X-LAP
21 / 30
LDRLT2 and Lagrangian Dual Ascent
I For a fixed value of v , DLRLT2 is decomposable into a large numberof LAPs (which can be solved in parallel).
I Lagrangian Dual Problem, LDRLT2: maxv DLRTL2(v)I ν(DLRLT2(v)) ≤ ν(LDRLT2) ≤ ν(LPRLT2) ≤ ν(RLT2) = ν(QAP)I After solving DLRLT2(vm) at mth iteration, vm could be adjusted
systematically to obtain an improved lower bound.I Highly parallelizable on GPUs.
vm+1ijkpqr = vm
ijkpqr + κzπ(zijkpqr ) + κyπ(yijpq)(n − 2) + κxπ(xip)
(n − 1)(n − 2) − Ωijkpqr ,
where, 0 ≤ κ ≤ 1 and Ωijkpqr is a function of dual slacks π(x), π(y), π(z).
22 / 30
RLT2-DAWith opportunities for parallelization
Stopping Criteria Satisfied
End
Initialization
Z Dual ascent Cost update
Start
Solve Z-LAP
High granularity
Low Granularity
Z Solution Transfer
Solve Y-LAP
Y Solution Transfer
Y Dual ascent Cost update
X Cost update
Solve X-LAP
YES
NO
23 / 30
Accelerated RLT2-DA
CPU(0)MPI MPI
CPU(1) CPU(K-1)
GPU(0)
Initialization+
Dual Ascent+
Z-TLAP Solve+
Y-TLAP Solve+
X-LAP Solve+
Feasibility Check
GPU(1)
Initialization+
Dual Ascent+
Z-TLAP Solve+
Y-TLAP Solve+
X-LAP Solve+
Feasibility Check
GPU(K-1)
Initialization+
Dual Ascent+
Z-TLAP Solve+
Y-TLAP Solve+
X-LAP Solve+
Feasibility Check
Table: Minimum number of GPUs for various problem sizes.
n ≤ 27 30 35 40 42Minimum # of GPUs 1 2 4 7 15
24 / 30
Tiled Linear Assignment Problems (TLAP)
I LAPs are solved using our GPU-Accelerated Hungarian Algorithm[Date and Nagi, 2016].
I LAPs owned by a particular GPU are combined and solved as a tiledLAP (or TLAP).
I In practice, TLAP solves much faster than solving individual LAPs.
LAP(0)
LAP(1)
LAP(M-1)
TLAP
(0)
LAP(M)
LAP(M+1)
LAP(2M-1)
TLAP
(1)
LAP(M2-M)
LAP(M2-M+1)
LAP(M2-1)
TLAP
(M-1)
25 / 30
Parallel Branch-and-bound
d
x11=1
PE Bank 0 Best First (Priority Queue)
Depth First(Stack)
x1(.)=1 x1(.)=1
PE Bank (N-1) Best First (Priority Queue)
x1(.)=1 x1(.)=1
x22=1 x2(.)=1 x2(.)=1
xdd=1 xd(.)=1 xd(.)=1
x1(.)=1
I 1 Processing Element (PE) contains 1 CPU and 1 GPU.I Initial upper bound established using local search heuristic.I Each PE bank executes RLT2-DA on a single node and establishes a lower bound.I Tree expanded using “Polytomic” branching rule.I PEs synchronized after 300 seconds and work is redistributed if needed.I Branching Rule: Branch on the facility that has the lowest total flow with the
previously placed facilities.
26 / 30
Experimental Results
Problem UB N K `initNodes PE Utilization Time
Initial Explored Min Avg Max (d:hh:mm:ss)Nug20† 2570∗ 4 1 2 98 134 0.77 0.89 0.99 0:00:38:41Nug22 3596∗ 10 1 2 462 622 0.82 0.91 0.99 0:01:34:03Nug25† 3744∗ 50 2 3 1,755 3,868 0.81 0.90 0.97 0:02:44:24Nug27 5234∗ 100 2 3 17,550 55,761 0.97 0.98 0.99 1:02:28:32Nug30† 6124∗ 300 4 4 164,520 840,273 0.96 0.96 0.97 4:14:06:21Tai20a 703, 482∗ 10 1 2 380 3,512 0.88 0.92 0.98 0:03:56:52Tai20b‡ 122, 455, 319∗ 1 1 0 1 1 1.00 1.00 1.00 0:00:18:57Tai25a 1, 167, 256∗ 100 2 3 13,800 523,005 0.97 0.98 0.98 3:13:53:33Tai25b‡ 344, 355, 646∗ 1 4 0 1 1 1.00 1.00 1.00 0:01:52:43Tai30b 637, 117, 113∗ 60 4 2 870 30,523 0.91 0.93 0.95 2:09:55:17
† Grid symmetry elimination rules were used for these problem instances.‡ Gap closure was achieved for these problems in 110 and 684 iterations respectively.
27 / 30
Contributions
1. Developed/implemented GPU-accelerated RLT2-DA solver.2. Developed/implemented parallel branch-and-bound scheme on a
GPU-cluster, to solve QAP to optimality.3. Under initial review in INFORMS Journal on Computing.4. Approximate Hungarian Solver can speed up DA iterations with
minimal impact on optimality gap.5. Next steps: Use Approximate Hungarian in B&B and solve instances
with n ≥ 30.
28 / 30
Acknowledgments
The work on “accelerated Hungarian algorithm” has been supported by aMultidisciplinary University Research Initiative (MURI) grant (NumberW911NF-09-1-0392) for “Unified Research on Network-based Hard / SoftInformation Fusion,” issued by the US Army Research Office (ARO) underthe program management of Dr. John Lavery. We gratefully appreciatethis support.
This research is part of the Blue Waters sustained-petascale computingproject, which is supported by the National Science Foundation (awardsOCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is ajoint effort of the University of Illinois at Urbana-Champaign and itsNational Center for Supercomputing Applications. We gratefullyappreciate this support.
29 / 30
Thank You
30 / 30
Adams, W. P., Guignard, M., Hahn, P. M., and Hightower, W. L.(2007).A level-2 reformulation–linearization technique bound for thequadratic assignment problem.European Journal of Operational Research, 180(3):983–996.
Adams, W. P. and Johnson, T. A. (1994).Improved linear programming-based lower bounds for the quadraticassignment problem.DIMACS Series in Discrete Mathematics and Theoretical ComputerScience, 16:43–77.Balas, E., Miller, D., Pekny, J., and Toth, P. (1991).A parallel shortest augmenting path algorithm for the assignmentproblem.Journal of the ACM (JACM), 38(4):985–1004.
Bertsekas, D. P. (1990).The Auction algorithm for assignment and other network flowproblems: A tutorial.
30 / 30
Interfaces, 20(4):133–149.
Bertsekas, D. P. and Castanon, D. A. (1991).Parallel synchronous and asynchronous implementations of theAuction algorithm.Parallel Computing, 17(6):707–732.
Bertsekas, D. P. and Castanon, D. A. (1993).Parallel asynchronous Hungarian methods for the assignment problem.
ORSA Journal on Computing, 5(3):261–274.
Date, K. and Nagi, R. (2016).GPU-accelerated Hungarian algorithms for the Linear AssignmentProblem.Parallel Computing, 57:52–72.
Frieze, A. and Yadegar, J. (1983).On the quadratic assignment problem.Discrete applied mathematics, 5(1):89–98.
30 / 30
Hahn, P. M., Zhu, Y.-R., Guignard, M., Hightower, W. L., andSaltzman, M. J. (2012).A level-3 reformulation-linearization technique-based bound for thequadratic assignment problem.INFORMS J. on Computing, 24(2):202–209.
Jonker, R. and Volgenant, A. (1987).A shortest augmenting path algorithm for dense and sparse linearassignment problems.Computing, 38(4):325–340.
Kaufman, L. and Broeckx, F. (1978).An algorithm for the quadratic assignment problem using Bender’sdecomposition.European Journal of Operational Research, 2(3):207–211.
Koopmans, T. C. and Beckmann, M. (1957).Assignment problems and the location of economic activities.Econometrica: Journal of the Econometric Society, pages 53–76.
Kuhn, H. W. (1955).30 / 30
The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97.
Lawler, E. L. (1963).The quadratic assignment problem.Management science, 9(4):586–599.
Munkres, J. (1957).Algorithms for the assignment and transportation problems.Journal of the Society for Industrial & Applied Mathematics,5(1):32–38.
Storøy, S. and Sørevik, T. (1997).Massively parallel augmenting path algorithms for the assignmentproblem.Computing, 59(1):1–16.
Vasconcelos, C. N. and Rosenhahn, B. (2009).Bipartite graph matching computation on GPU.In Energy Minimization Methods in Computer Vision and PatternRecognition, pages 42–55. Springer.
30 / 30
Wein, J. M. and Zenios, S. (1990).Massively parallel Auction algorithms for the assignment problem.In Frontiers of Massively Parallel Computation, 1990. Proceedings.,3rd Symposium on the, pages 90–99. IEEE.
30 / 30