Parallel Algorithms for Solving Large Assignment Problems on … · 2018-06-12 · Parallel...

Parallel Algorithms for Solving Large AssignmentProblems on GPU Clusters

2018 Blue Waters Symposium

Ketan Date Rakesh Nagi (PI)

Department of Industrial and Enterprise Systems EngineeringUniversity of Illinois at Urbana-Champaign

June 6, 2018

1 / 30

Outline

Assignment Problems: Introduction and Impact

Research Tasks and Role of Blue Waters

The Linear Assignment Problem

The Quadratic Assignment Problem

2 / 30

Outline





3 / 30

Introduction

Assignment problems: Fundamental optimization problems in OperationsResearch that have prominent applications in science and engineering.

Our inability to efficiently solve large instances of these problems cangreatly inhibit the discovery in these domains.

Objectives: Designing faster, parallel algorithms for Linear AssignmentProblem (LAP) and Quadratic Assignment Problem (QAP) using GPUsand large computational clusters like Blue Waters.

Future work: Extending the proposed methodology to GeneralizedAssignment Problem (GAP), Traveling Salesman Problem (TSP), VehicleRouting Problem (VRP), and Graph Association/Matching (GA/GM).

4 / 30

Impact of Assignment Problems

1. Data sciences: Data association in information fusion andmulti-target tracking.

2. Bioinformatics: Alignment of protein-protein interaction (PPI)networks.

3. Engineering: Facility location, routing and scheduling problems, etc.

3. 3/18/2013 -Suspected bank robber John Dillinger is wanted for questioning by Greencastle police. Dillinger is said to be 30 years old, with a height of 5'8", and weight of 185 pounds. Dillinger was last seen driving his black 2010 Ford Focus westward down Indianapolis Road.

b’

a’ c’

Name: John DillingerType: PersonSex: MaleHeight: 5’ 8”Age: 30Weight: 185 lbsShirt color: Red

Type: VehicleMake: FordModel: FocusYear: 2010Color: Black

d2 e2

Name: AnonymousType: PersonSex: MaleHeight: 5’ 10”Shirt color: Black

Name: AnonymousType: PersonSex: MaleHeight: 6’ 2”Shirt color: Blue

Owner of

b2

a2 c2

Name: AnonymousType: PersonSex: MaleHeight: 5’ 8”Shirt color: Red

Type: VehicleMake: FordColor: Black

d2 e2

Name: AnonymousType: PersonSex: MaleHeight: 5’ 10”Shirt color: Black

Name: AnonymousType: PersonSex: MaleHeight: 6’ 2”Shirt color: Blue

Located atLocated at

Located at

Located at

b1

a1 c1

Name: John DillingerType: PersonSex: MaleHeight: 5’ 8”Weight: 185 lbsAge: 30

Type: VehicleMake: FordModel: FocusYear: 2010Color: Black

Address: Indianapolis RoadType: Location

Owner of

Located at Located at

Name: Sunoco Gas StationAddress: Indianapolis Road Type: LocationTime: 1320Date: 03172013

Name: Sunoco Gas StationAddress: Indianapolis Road Type: LocationTime: 1320Date: 03172013

1

5

4

2

3

V1

V1

V2

V2

V1

V2

1

5

4

2

3

L1 L2

L3 L4

F1

F3

F2

F4

Association of complex information from heterogeneous data sources using Graph Association Formulation

TSP and VRP

Facility Location

5 / 30

Outline





6 / 30


Research tasks1. Develop GPU accelerated Hungarian algorithm for the LAP.2. Develop GPU accelerated Dual Ascent procedure for QAP-RLT2.3. Couple both the algorithms and deploy on Blue Waters to obtain

exact solutions to the QAP in a parallel branch-and-bound scheme.

Role of Blue WatersI Exponential number of tree nodes in branch-and-bound.I Lower bounding procedure requires solving O(n3) LAPs and adjusting

O(n6) Lagrange multipliers.I Solving the benchmark Nug30 QAP required over 1200 XK compute

nodes for over 110 hours (15 yrs worth of computation).I We are grateful to Blue Waters and the NCSA staff for providing this

invaluable service to the scientific community.

7 / 30

Outline





8 / 30

Linear Assignment Problem: Introduction

I Also known as weighted bipartite matching problem.I Objective: To minimize total cost of assigning n resources to n tasks.I Important subproblem of many NP-Hard optimization problems, e.g.,

I Quadratic Assignment ProblemI Traveling Salesperson ProblemI Graph Matching and Association Problems, etc.

minn∑

i=1

n∑j=1

cijxij ;

s.t.n∑

j=1xij = 1 ∀i = 1, . . . , n;

n∑i=1

xij = 1 ∀j = 1, . . . , n;

xij ∈ 0, 1 ∀i , j = 1, . . . , n.

9 / 30

Literature Review

Sequential algorithmsI Hungarian algorithm [Kuhn, 1955, Munkres, 1957].I Shortest path algorithms [Jonker and Volgenant, 1987].I Auction algorithm [Bertsekas, 1990].

Parallel implementationsI Parallel synchronous/asynchronous Hungarian algorithms

[Bertsekas and Castanon, 1993].I Parallel shortest path algorithms

[Balas et al., 1991, Storøy and Sørevik, 1997].I Parallel synchronous/asynchronous Auction algorithm:

[Wein and Zenios, 1990, Bertsekas and Castanon, 1991].I Parallel Auction algorithm using GPUs

[Vasconcelos and Rosenhahn, 2009]

10 / 30

Sequential Hungarian AlgorithmWith opportunities for acceleration

Yes

No

Yes

No

All Assigned? End

Augmenting Path Found?

Initialization

Partial Assignment

Start

Optimality Check

Augmenting Path Search

Augmentation

Dual Update

High granularityScalable to multiple GPUs

Low GranularityExecuted on single GPU

11 / 30

Accelerated Hungarian: Augmenting Path Search(Forward Pass)

I Goal is to find vertex disjoint augmenting paths from unassigned rowsto unassigned columns.

I In forward pass, threads traverse the graph one hop at a time andconstruct augmenting trees.

I More than one augmenting trees may be found per iteration.I Due to race condition, the trees are guaranteed to be vertex disjoint

(our innovation).

Thread 11

Thread 12

R1

R2

R3

R4 C4

C3

C2

C1

BFS Iteration 1Frontier: R2 and R3New Frontier: R1 and R4

R1

R2

R3

R4 C4

C3

C2

C1 R1

R2

R3

R4 C4

C3

C2

C1

Thread 22

Thread 21 R1

R2

R3

R4 C4

C3

C2

C1

BFS Iteration 2Frontier: R1 and R4New Frontier: --

R1

R2

R3

R4 C4

C3

C2

C1 R1

R2

R3

R4 C4

C3

C2

C1

Augmenting path(s) found:R3-C2-R1-C1 and R3-C2-R1-C4

12 / 30

Accelerated Hungarian: Reverse Pass andAugmentation

I Reverse pass is performed to extract augmenting paths from theaugmenting trees (each leaf vertex is processed by one thread).

I Due to “race” condition only one path survives per tree (ourinnovation).

I All such paths can be used to augment the current solution andincrease number of assignments.

R1

R2

R3

R4 C4

C3

C2

C1

Thread 22

Thread 21

AugmentationNumber of assignments increased by 1

Reverse PassSurvivor Path: C1-R1-C2-R3

R1

R2

R3

R4 C4

C3

C2

C1

13 / 30

Computational Experiments

Experimental SetupI Small Scale: n = 500 to n = 5000 in increments of 500. Cost matrix

of randomly generated integers between [0, n].I Large Scale: n = 5000 to n = 20000 in increments of 5000. Cost

matrix of randomly generated integers between [0, n10 ], [0, n], and

[0, 10n].

Hardware detailsI Computational resources from Blue Waters Supercomputing Facility at

University of Illinois at Urbana-Champaign.I CPU: AMD Interlagos 6376, 2.30GHz clock speed, and 32GB memory.I GPU: NVIDIA GK110 “Kepler” K20X, with 2688 processor cores, and

6GB memory.

14 / 30

Execution Times and Speedup Profiles (Small Scale)

15 / 30

Execution Times and Speedup Profiles (Large Scale)

16 / 30

Contributions

1. Developed parallel versions of two variants of the Hungarianalgorithm, for solving the LAP on GPU(s).

2. Accelerated algorithms leverage “race condition” to find multiplevertex-disjoint augmenting paths.

3. Single GPU variant can solve problems with up to 400 Millionvariables in 13 seconds.

4. Multi-GPU variant can solve problems with up to 1.6 Billion variablesin 24 seconds.

17 / 30

Outline





18 / 30

Quadratic Assignment Problem: Introduction

I Introduced by [Koopmans and Beckmann, 1957] as a mathematicalmodel for facility location.

I Objective: To place n facilities on n locations such that total cost(distance times flow) is minimized.

I Strongly NP-hard problem. No polynomial time optimal or ε-optimalalgorithm.

minn∑

i=1

n∑p=1

bipxip +n∑

i=1

n∑j=1

n∑p=1

n∑q=1

fijdpqxipxjq

s.t.n∑

i=1xip = 1 ∀p = 1, · · · , n;

n∑p=1

xip = 1 ∀i = 1, · · · , n;

xip ∈ 0, 1 ∀i , p = 1, · · · , n.

19 / 30

Linearization Models

I One of the ways to solve QAP is to convert it into MILP and useBranch-and-bound.

I Linearization is accomplished by introducing large number of variablesand constraints.

I LP relaxation provides a valid lower bound on QAP.I Can be used to fathom nodes in branch-and-bound tree.

Linearization Model Binary Variables Continuous Variables Constraints[Lawler, 1963] O(n4) – O(n4)[Kaufman and Broeckx, 1978] O(n2) O(n2) O(n2)[Frieze and Yadegar, 1983] O(n2) O(n4) O(n4)[Adams and Johnson, 1994] RLT1 O(n2) O(n4) O(n4)[Adams et al., 2007] RLT2 O(n2) O(n6) O(n6)[Hahn et al., 2012] RLT3 O(n2) O(n8) O(n8)

20 / 30

RLT2, DLRLT2, and Staged LAP SolutionRLT2:

min∑

i

∑p

bipxip +∑

i

∑j 6=i

∑p

∑q 6=p

Cijpqyijpq

+∑

i

∑j 6=i

∑k 6=i ,j

∑p

∑q 6=p

∑r 6=p,q

Dijkpqr zijkpqr ;

s.t.∑

pxip = 1, ∀i ;

∑i

xip = 1, ∀p;∑q 6=p

yijpq = xip, ∀(i , j , p) : i 6= j ;

∑j 6=i

yijpq = xip, ∀(i , p, q) : p 6= q;

∑r 6=p,q

zijkpqr = yijpq, ∀(i , j , k, p, q) : i 6= j 6= k, p 6= q;

∑k 6=i ,j

zijkpqr = yijpq, ∀(i , j , p, q, r) : i 6= j , p 6= q 6= r ;

zijkpqr = zikjprq = zjikqpr

= zjkiqrp = zkijrpq = zkjirqp, ∀(i , j , k, p, q, r) : i < j < k, p 6= q 6= r ;xip ∈ 0, 1, ∀i , p;yijpq ≥ 0, ∀(i , j , p, q) : i 6= j , p 6= q;zijkpqr ≥ 0, ∀(i , j , k, p, q, r) : i 6= j 6= k, p 6= q 6= r .

DLRLT2(v):

max∑

iαi +

∑pβp;

s.t. αi + βp −∑j 6=i

γijp −∑q 6=p

δipq ≤ bip, ∀i , p;

γijp + δipq −∑

k 6=i ,jξijkpq −

∑r 6=p,q

ψijpqr ≤ Cijpq, ∀(i , j , p, q) : i 6= j , p 6= q;

ξijkpq + ψijpqr ≤ Dijkpqr − vijkpqr , ∀(i , j , k, p, q, r) : i 6= j 6= k, p 6= q 6= r ;αi , βp, γijp, δipq, ξijkpq, ψijpqr ∼ ur, ∀(i , j , k, p, q, r) : i 6= j 6= k, p 6= q 6= r .

Stage 1: n2(n-1)2

Z-LAPs

Stage 2: n2

Y-LAPs

Stage 3:1

X-LAP

21 / 30

LDRLT2 and Lagrangian Dual Ascent

I For a fixed value of v , DLRLT2 is decomposable into a large numberof LAPs (which can be solved in parallel).

I Lagrangian Dual Problem, LDRLT2: maxv DLRTL2(v)I ν(DLRLT2(v)) ≤ ν(LDRLT2) ≤ ν(LPRLT2) ≤ ν(RLT2) = ν(QAP)I After solving DLRLT2(vm) at mth iteration, vm could be adjusted

systematically to obtain an improved lower bound.I Highly parallelizable on GPUs.

vm+1ijkpqr = vm

ijkpqr + κzπ(zijkpqr ) + κyπ(yijpq)(n − 2) + κxπ(xip)

(n − 1)(n − 2) − Ωijkpqr ,

where, 0 ≤ κ ≤ 1 and Ωijkpqr is a function of dual slacks π(x), π(y), π(z).

22 / 30

RLT2-DAWith opportunities for parallelization

Stopping Criteria Satisfied

End

Initialization

Z Dual ascent Cost update

Start

Solve Z-LAP

High granularity

Low Granularity

Z Solution Transfer

Solve Y-LAP

Y Solution Transfer

Y Dual ascent Cost update

X Cost update

Solve X-LAP

YES

NO

23 / 30

Accelerated RLT2-DA

CPU(0)MPI MPI

CPU(1) CPU(K-1)

GPU(0)

Initialization+

Dual Ascent+

Z-TLAP Solve+

Y-TLAP Solve+

X-LAP Solve+

Feasibility Check

GPU(1)

Initialization+

Dual Ascent+

Z-TLAP Solve+

Y-TLAP Solve+

X-LAP Solve+

Feasibility Check

GPU(K-1)

Initialization+

Dual Ascent+

Z-TLAP Solve+

Y-TLAP Solve+

X-LAP Solve+

Feasibility Check

Table: Minimum number of GPUs for various problem sizes.

n ≤ 27 30 35 40 42Minimum # of GPUs 1 2 4 7 15

24 / 30

Tiled Linear Assignment Problems (TLAP)

I LAPs are solved using our GPU-Accelerated Hungarian Algorithm[Date and Nagi, 2016].

I LAPs owned by a particular GPU are combined and solved as a tiledLAP (or TLAP).

I In practice, TLAP solves much faster than solving individual LAPs.

LAP(0)

LAP(1)

LAP(M-1)

TLAP

(0)

LAP(M)

LAP(M+1)

LAP(2M-1)

TLAP

(1)

LAP(M2-M)

LAP(M2-M+1)

LAP(M2-1)

TLAP

(M-1)

25 / 30

Parallel Branch-and-bound

d

x11=1

PE Bank 0 Best First (Priority Queue)

Depth First(Stack)

x1(.)=1 x1(.)=1

PE Bank (N-1) Best First (Priority Queue)

x1(.)=1 x1(.)=1

x22=1 x2(.)=1 x2(.)=1

xdd=1 xd(.)=1 xd(.)=1

x1(.)=1

I 1 Processing Element (PE) contains 1 CPU and 1 GPU.I Initial upper bound established using local search heuristic.I Each PE bank executes RLT2-DA on a single node and establishes a lower bound.I Tree expanded using “Polytomic” branching rule.I PEs synchronized after 300 seconds and work is redistributed if needed.I Branching Rule: Branch on the facility that has the lowest total flow with the

previously placed facilities.

26 / 30

Experimental Results

Problem UB N K `initNodes PE Utilization Time

Initial Explored Min Avg Max (d:hh:mm:ss)Nug20† 2570∗ 4 1 2 98 134 0.77 0.89 0.99 0:00:38:41Nug22 3596∗ 10 1 2 462 622 0.82 0.91 0.99 0:01:34:03Nug25† 3744∗ 50 2 3 1,755 3,868 0.81 0.90 0.97 0:02:44:24Nug27 5234∗ 100 2 3 17,550 55,761 0.97 0.98 0.99 1:02:28:32Nug30† 6124∗ 300 4 4 164,520 840,273 0.96 0.96 0.97 4:14:06:21Tai20a 703, 482∗ 10 1 2 380 3,512 0.88 0.92 0.98 0:03:56:52Tai20b‡ 122, 455, 319∗ 1 1 0 1 1 1.00 1.00 1.00 0:00:18:57Tai25a 1, 167, 256∗ 100 2 3 13,800 523,005 0.97 0.98 0.98 3:13:53:33Tai25b‡ 344, 355, 646∗ 1 4 0 1 1 1.00 1.00 1.00 0:01:52:43Tai30b 637, 117, 113∗ 60 4 2 870 30,523 0.91 0.93 0.95 2:09:55:17

† Grid symmetry elimination rules were used for these problem instances.‡ Gap closure was achieved for these problems in 110 and 684 iterations respectively.

27 / 30

Contributions

1. Developed/implemented GPU-accelerated RLT2-DA solver.2. Developed/implemented parallel branch-and-bound scheme on a

GPU-cluster, to solve QAP to optimality.3. Under initial review in INFORMS Journal on Computing.4. Approximate Hungarian Solver can speed up DA iterations with

minimal impact on optimality gap.5. Next steps: Use Approximate Hungarian in B&B and solve instances

with n ≥ 30.

28 / 30

Acknowledgments

The work on “accelerated Hungarian algorithm” has been supported by aMultidisciplinary University Research Initiative (MURI) grant (NumberW911NF-09-1-0392) for “Unified Research on Network-based Hard / SoftInformation Fusion,” issued by the US Army Research Office (ARO) underthe program management of Dr. John Lavery. We gratefully appreciatethis support.

This research is part of the Blue Waters sustained-petascale computingproject, which is supported by the National Science Foundation (awardsOCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is ajoint effort of the University of Illinois at Urbana-Champaign and itsNational Center for Supercomputing Applications. We gratefullyappreciate this support.

29 / 30

Thank You

30 / 30

Adams, W. P., Guignard, M., Hahn, P. M., and Hightower, W. L.(2007).A level-2 reformulation–linearization technique bound for thequadratic assignment problem.European Journal of Operational Research, 180(3):983–996.

Adams, W. P. and Johnson, T. A. (1994).Improved linear programming-based lower bounds for the quadraticassignment problem.DIMACS Series in Discrete Mathematics and Theoretical ComputerScience, 16:43–77.Balas, E., Miller, D., Pekny, J., and Toth, P. (1991).A parallel shortest augmenting path algorithm for the assignmentproblem.Journal of the ACM (JACM), 38(4):985–1004.

Bertsekas, D. P. (1990).The Auction algorithm for assignment and other network flowproblems: A tutorial.

30 / 30

Interfaces, 20(4):133–149.

Bertsekas, D. P. and Castanon, D. A. (1991).Parallel synchronous and asynchronous implementations of theAuction algorithm.Parallel Computing, 17(6):707–732.

Bertsekas, D. P. and Castanon, D. A. (1993).Parallel asynchronous Hungarian methods for the assignment problem.

ORSA Journal on Computing, 5(3):261–274.

Date, K. and Nagi, R. (2016).GPU-accelerated Hungarian algorithms for the Linear AssignmentProblem.Parallel Computing, 57:52–72.

Frieze, A. and Yadegar, J. (1983).On the quadratic assignment problem.Discrete applied mathematics, 5(1):89–98.

30 / 30

Hahn, P. M., Zhu, Y.-R., Guignard, M., Hightower, W. L., andSaltzman, M. J. (2012).A level-3 reformulation-linearization technique-based bound for thequadratic assignment problem.INFORMS J. on Computing, 24(2):202–209.

Jonker, R. and Volgenant, A. (1987).A shortest augmenting path algorithm for dense and sparse linearassignment problems.Computing, 38(4):325–340.

Kaufman, L. and Broeckx, F. (1978).An algorithm for the quadratic assignment problem using Bender’sdecomposition.European Journal of Operational Research, 2(3):207–211.

Koopmans, T. C. and Beckmann, M. (1957).Assignment problems and the location of economic activities.Econometrica: Journal of the Econometric Society, pages 53–76.

Kuhn, H. W. (1955).30 / 30

The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97.

Lawler, E. L. (1963).The quadratic assignment problem.Management science, 9(4):586–599.

Munkres, J. (1957).Algorithms for the assignment and transportation problems.Journal of the Society for Industrial & Applied Mathematics,5(1):32–38.

Storøy, S. and Sørevik, T. (1997).Massively parallel augmenting path algorithms for the assignmentproblem.Computing, 59(1):1–16.

Vasconcelos, C. N. and Rosenhahn, B. (2009).Bipartite graph matching computation on GPU.In Energy Minimization Methods in Computer Vision and PatternRecognition, pages 42–55. Springer.

30 / 30

Wein, J. M. and Zenios, S. (1990).Massively parallel Auction algorithms for the assignment problem.In Frontiers of Massively Parallel Computation, 1990. Proceedings.,3rd Symposium on the, pages 90–99. IEEE.

30 / 30

Date post:	10-Mar-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Parallel Algorithms for Solving Large Assignment Problems on … · 2018-06-12 · Parallel...

Documents