Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments

Hardware-based Job Queue Management for Manycore Architectures and OpenMP

EnvironmentsJunghee Lee, Chrysostomos Nicopoulos, Yongjae

Lee, Hyung Gyu Lee and Jongman Kim

Presented by Junghee Lee

http://www.ucy.ac.cy/goto/mainportal/en-US/HOME.aspx

2

Introduction

• Manycore systems– Number of cores is increasing

• Challenges in scalability– Memory– Power consumption– Cache coherence protocol– Load balancing

3

Contents

• Introduction• Background

– Programming models– Motivation

• IsoNet• Fault-tolerance• Evaluation• Conclusion

4

Programming Models

• Parallel programming models– MPI– OpenMP

• Fine-grained parallelism– Emerging applications:

Recognition, Mining and Synthesis– Execution time of each computation kernel is very short

but it has abundant parallelism– Excessive overhead in multithreading

5

Job Queuing

• Creates jobs instead of threads– One thread per core is

created– Thread: a set of instructions

and states of execution– Job: a set of data that is

processed by a thread• Job queue

– Manages the list of jobs– Maintains load balance CPU CPU

Thread Thread

JobJobJob

6

Conflicts in Job Queue

• Chance of conflicts increases as:– The number of cores increases– The time taken to update the job queue increases– The job queue is accessed more frequently (job is short)

• Previous approaches– Distributed queues

• Load balance is maintained by job-stealing• The chance of conflicts in one local queue is decreased

– Hardware implementation• Time spent on updating the queue is reduced

7

Profile of SMVM

Number of cores8 16 32 64

0

0.2

0.4

0.6

0.8

1.0

Rat

io o

f exe

cutio

n tim

e

4

Conflicts Stealing job Processing job

128 256

8

Objectives

• Requirements of load balancer– Scalability: conflict-free– Fault-tolerance

• The probability of faults increases exponentially as technology scales

• Contributions of this paper– Light weight micro-network for load balancing– Scalable even with more than a thousand cores– Comprehensive fault-tolerance support

9

Contents

• Introduction• Background• IsoNet

– Architecture– Implementation

• Fault-tolerance• Evaluation• Conclusion

10

System View

R

CPU

R

CPU

R

CPU

R

CPU

R

CPU

R

CPU

I I I

I I I

11

Microarchitecture of IsoNet Node

Com

p

MU

X

MU

X

Com

p

MU

X

DEM

UX

Dual ClockStack

JobCount

JobCount

Job Job

Max Selector

Min Selector

Switch

12

How It Works

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

1 1

1

2

2 222

2

0

00

0

Tree-based routing: for fault-tolerance

13

Single Cycle Implementation

• Estimated critical path delay– 11.38 ns (87.8 MHz)– By Elmore delay model

• Single cycle implementation offers low hardware cost

Leaf node

Int.node

Rootnode

Int.node

Src or

DestSwt Swt

Src node

Dest node

14

Hardware Cost Estimation

Count Inst

Gate count

DCStack 204 1024

Selector

Leaf 0 641 Child 110 9282 Children 256 23 Children 480 294 Children 682 1

Switch 356 1024Root 59 1Total 674.50

674.50 * 240 * 4 = 647.52 K = 0.046% of 1.4 B (NVIDIA GTX285)

15

Contents

• Introduction• Background• IsoNet• Fault-tolerance

– Transparent mode– Reconfiguration mode

• Evaluation• Conclusion

16

Supporting Fault-Tolerance

• Transparent mode– For faulty CPUs– Bypass the corresponding IsoNet node

• Reconfiguration mode– For faulty IsoNet node– Operation

• When a fault is detected, all IsoNet nodes go into the reconfiguration mode

• Reconfigure the topology of IsoNet so that the faulty node is excluded

• Assign a new root node if the root node fails

17

Reconfiguration

01

1

1

1

2

2

2

22

3

3 3

33333

3

3

33

3333

2 2

Root Node Candidate

18

Contents

• Introduction• Background• IsoNet• Fault-tolerance• Evaluation

– Experimental setup– Results

• Conclusion

19

Experimental Setup

• Simulation framework– Wind River’s Simics full-system simulator– CMP with 4~64 x86 compatible cores– Fedora 12 with kernel 2.6.33

• Benchmarks from recognition, mining and synthesis applications– GS: Gauss-Seidel– MMM: Dense Matrix-Matrix Multiply– SVA: Scaled Vector Addition– MVM: Dense Matrix Vector Multiply– SMVM: Sparse Matrix Vector Multiply

20

Results


MMM (6,473 instructions)

640

5

10

15

20

25

Exec

utio

n tim

e (1

07 cyc

les)

2

4

6

8

10

12

14

Spee

d up

Job stealing Carbon IsoNetCarbon speedup IsoNet speed up


SMVM (2,872 instructions)

640

12

3

4

5

6

7

Exec

utio

n tim

e (1

07 cyc

les)

5101520253035

Spee

d up

404550

21

Beyond Hundred Cores

• MMM (6,473 instructions)

Number of cores4 8 16 32 64

0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Ex

ecut

ion

Tim

e

Carbon IsoNet

128 256 512 1024

22

Profile of IsoNet


0

0.2

0.4

0.6

0.8

1.0R

atio

of e

xecu

tion

time

4

Conflicts Stealing job Processing job

23

Conclusion

• Scalability is one of key challenges in manycore domain• Scalability in load balancing is critical to utilize a number

of processing elements• This paper proposes a novel hardware-based dynamic

load distributor and balancer, called IsoNet• IsoNet also provides comprehensive fault-tolerance

support• Experimental results in a full-system simulation with real

applications demonstrate that IsoNet scales better than alternative techniques

24

Questions?

Contact info

Junghee [email protected] and Computer EngineeringGeorgia Institute of Technology

25

Thank you!

Date post:	09-Feb-2016
Category:	Documents
Upload:	emory
View:	26 times
Download:	0 times

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments

Documents