Hardware-based Job Queue Management for Manycore Architectures and OpenMP
EnvironmentsJunghee Lee, Chrysostomos Nicopoulos, Yongjae
Lee, Hyung Gyu Lee and Jongman Kim
Presented by Junghee Lee
2
Introduction
• Manycore systems– Number of cores is increasing
• Challenges in scalability– Memory– Power consumption– Cache coherence protocol– Load balancing
3
Contents
• Introduction• Background
– Programming models– Motivation
• IsoNet• Fault-tolerance• Evaluation• Conclusion
4
Programming Models
• Parallel programming models– MPI– OpenMP
• Fine-grained parallelism– Emerging applications:
Recognition, Mining and Synthesis– Execution time of each computation kernel is very short
but it has abundant parallelism– Excessive overhead in multithreading
5
Job Queuing
• Creates jobs instead of threads– One thread per core is
created– Thread: a set of instructions
and states of execution– Job: a set of data that is
processed by a thread• Job queue
– Manages the list of jobs– Maintains load balance CPU CPU
Thread Thread
JobJobJob
6
Conflicts in Job Queue
• Chance of conflicts increases as:– The number of cores increases– The time taken to update the job queue increases– The job queue is accessed more frequently (job is short)
• Previous approaches– Distributed queues
• Load balance is maintained by job-stealing• The chance of conflicts in one local queue is decreased
– Hardware implementation• Time spent on updating the queue is reduced
7
Profile of SMVM
Number of cores8 16 32 64
0
0.2
0.4
0.6
0.8
1.0
Rat
io o
f exe
cutio
n tim
e
4
Conflicts Stealing job Processing job
128 256
8
Objectives
• Requirements of load balancer– Scalability: conflict-free– Fault-tolerance
• The probability of faults increases exponentially as technology scales
• Contributions of this paper– Light weight micro-network for load balancing– Scalable even with more than a thousand cores– Comprehensive fault-tolerance support
9
Contents
• Introduction• Background• IsoNet
– Architecture– Implementation
• Fault-tolerance• Evaluation• Conclusion
10
System View
R
CPU
R
CPU
R
CPU
R
CPU
R
CPU
R
CPU
I I I
I I I
11
Microarchitecture of IsoNet Node
Com
p
MU
X
MU
X
Com
p
MU
X
DEM
UX
Dual ClockStack
JobCount
JobCount
Job Job
Max Selector
Min Selector
Switch
12
How It Works
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
1 1
1
2
2 222
2
0
00
0
Tree-based routing: for fault-tolerance
13
Single Cycle Implementation
• Estimated critical path delay– 11.38 ns (87.8 MHz)– By Elmore delay model
• Single cycle implementation offers low hardware cost
Leaf node
Int.node
Rootnode
Int.node
Src or
DestSwt Swt
Src node
Dest node
14
Hardware Cost Estimation
Count Inst
Gate count
DCStack 204 1024
Selector
Leaf 0 641 Child 110 9282 Children 256 23 Children 480 294 Children 682 1
Switch 356 1024Root 59 1Total 674.50
674.50 * 240 * 4 = 647.52 K = 0.046% of 1.4 B (NVIDIA GTX285)
15
Contents
• Introduction• Background• IsoNet• Fault-tolerance
– Transparent mode– Reconfiguration mode
• Evaluation• Conclusion
16
Supporting Fault-Tolerance
• Transparent mode– For faulty CPUs– Bypass the corresponding IsoNet node
• Reconfiguration mode– For faulty IsoNet node– Operation
• When a fault is detected, all IsoNet nodes go into the reconfiguration mode
• Reconfigure the topology of IsoNet so that the faulty node is excluded
• Assign a new root node if the root node fails
17
Reconfiguration
01
1
1
1
2
2
2
22
3
3 3
33333
3
3
33
3333
2 2
Root Node Candidate
18
Contents
• Introduction• Background• IsoNet• Fault-tolerance• Evaluation
– Experimental setup– Results
• Conclusion
19
Experimental Setup
• Simulation framework– Wind River’s Simics full-system simulator– CMP with 4~64 x86 compatible cores– Fedora 12 with kernel 2.6.33
• Benchmarks from recognition, mining and synthesis applications– GS: Gauss-Seidel– MMM: Dense Matrix-Matrix Multiply– SVA: Scaled Vector Addition– MVM: Dense Matrix Vector Multiply– SMVM: Sparse Matrix Vector Multiply
20
Results
Number of cores4 8 16 32
MMM (6,473 instructions)
640
5
10
15
20
25
Exec
utio
n tim
e (1
07 cyc
les)
2
4
6
8
10
12
14
Spee
d up
Job stealing Carbon IsoNetCarbon speedup IsoNet speed up
Number of cores4 8 16 32
SMVM (2,872 instructions)
640
12
3
4
5
6
7
Exec
utio
n tim
e (1
07 cyc
les)
5101520253035
Spee
d up
404550
21
Beyond Hundred Cores
• MMM (6,473 instructions)
Number of cores4 8 16 32 64
0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
e Ex
ecut
ion
Tim
e
Carbon IsoNet
128 256 512 1024
22
Profile of IsoNet
Number of cores8 16 32 64
0
0.2
0.4
0.6
0.8
1.0R
atio
of e
xecu
tion
time
4
Conflicts Stealing job Processing job
23
Conclusion
• Scalability is one of key challenges in manycore domain• Scalability in load balancing is critical to utilize a number
of processing elements• This paper proposes a novel hardware-based dynamic
load distributor and balancer, called IsoNet• IsoNet also provides comprehensive fault-tolerance
support• Experimental results in a full-system simulation with real
applications demonstrate that IsoNet scales better than alternative techniques
24
Questions?
Contact info
Junghee [email protected] and Computer EngineeringGeorgia Institute of Technology
25
Thank you!