Date post: | 02-Jul-2015 |
Category: |
Documents |
Upload: | stefan-marr |
View: | 632 times |
Download: | 4 times |
Insertion Tree Phasers
Efficient and Scalable Barrier Synchronization for Fine-grained
Parallelism
Stefan Marr
S. Verhaegen, B. De Fraine, T. D’Hondt, W. De Meuter
Software Languages Lab
VrijeUniversiteitBrussel
Agenda
• Introduction
– Barriers, Phasers
• Insertion Tree Phasers
– Insertion Tree
– Phaser Algorithm
• Evaluation
• Summary
9/26/2010 2
Barriers
• Synchronizing parallel activities
– High productivity: easy to get right
• Mostly for scientific computing
• Many-core evolution
– Synchronizing dynamic and irregular problems
– Requires low-overhead dynamic hierarchical barriers
9/26/2010 3Introduction
t1
p1
t2
p1
t3
p1
t1
p2
t2
p2
t3
p2
t1
p3
t2
p3
t3
p3
Phasers
9/26/2010 4Introduction
• Extension of X10 clocks
– Clocks: dynamic two-phase barrier for fork/join parallelism
• Registration modes for barrier
– Enables expression of producer/consumer relation
• Single statements
– Executed only by single thread, avoids duplicated barrier operations
t1
p1
t1
p2
t2
p2
t3
p2
t2
p2
t3
p2
t2
p3
t3
p3
Hierarchical Phasers
9/26/2010 5Introduction
• First scalable implementation strategy
• Predefined tree structure– Degree, i.e., tree arity
– Max. number of tiers, i.e., height
– Composed from phasers
• Problematic– None dynamic structure
– Two-phase support incomplete
– Leaves design decisions open
Shirako&Sarkar in Proc. of IEEE IPDPS 2010 [1]
sig sig sig sig sig sig sig sig
A1 A2 A3 A4 A5 A6 A7 A8
sub sub
Phaser
sub sub sub sub
Tier 0
Tier 1
Tier 2(leafs)
Array accessList access
INSERTION TREE PHASERS
9/26/2010 7
Design Goal
• Support for full generality of Phaser properties
– Two-phase support
– Signal-only/wait-only for producers/consumers
– Single statement
– Full dynamicity: fine-grained hierarchical fork/join
• Adaptation of existing, scalable approaches
– Dissemination barrier not adaptable
– Remaining are tree-based approaches
9/26/2010 8Insertion TreePhaserAlgorithm
Insertion Tree
• Goals
– Stable, i.e., minimized tree modifications
• Avoid inconsistent synchronization information
– Maximizing parallel operations
• Solution: Insertion Tree
– Inverted tree
– No removal
– Complete smallest subtree first
9/26/2010 9Insertion TreePhaserAlgorithm
1/2
Insertion Tree
9/26/2010 10Insertion TreePhaserAlgorithm
2/2
Insertion Tree
9/26/2010 11Insertion TreePhaserAlgorithm
2/2
1
Insertion Tree
9/26/2010 12Insertion TreePhaserAlgorithm
2/2
1 2
h1
Insertion Tree
9/26/2010 13Insertion TreePhaserAlgorithm
2/2
1 2
h1
h2
3
Insertion Tree
9/26/2010 14Insertion TreePhaserAlgorithm
2/2
1 2
h1
h2
3
h3
4
Insertion Tree
9/26/2010 15Insertion TreePhaserAlgorithm
2/2
1 2
h1
h2
3
h3
4 5
h4
Insertion Tree
9/26/2010 16Insertion TreePhaserAlgorithm
2/2
1 2
h1
h2
3
h3
4 5 6 7 8
h5 h7
h6
h4
Synchronization Tree*
9/26/2010 18Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
0 0 0 0wo
0 0
*) is simplified, leaves out registration modes
0 0 0 0rsm
d
Participant nodes
Helper nodes
Phase counter
Resume flag
Phase counter
Wait-only flag
Announcing Synchronization
9/26/2010 19Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
0 0 0 0
0 0
0 0 0 0
Announcing Synchronization
9/26/2010 20Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
0 1 1 0
0 0
0 01rsm
d
1rsm
d
Announcing Synchronization
9/26/2010 21Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
1 1 1 1
0 0
1rsm
d
1rsm
d
1rsm
d
1rsm
d
Announcing Synchronization
9/26/2010 22Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
1 1 1 1
0 1
1rsm
d
1rsm
d
1rsm
d
1rsm
d
Announcing Synchronization
9/26/2010 23Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
1 1 1 1
1 1
1rsm
d
1rsm
d
1rsm
d
1rsm
d
Announcing Synchronization
9/26/2010 24Insertion Tree Phaser Algorithm
Phaserphase: 1
A1 A2 A3 A4
1 1 1 1
1 1
1rsm
d
1rsm
d
1rsm
d
1rsm
d
Synchronization reached.Continue to next phase.
Dropping Participants
9/26/2010 25Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
0 0
0
0 0 1rsm
d
1 1
1
1rsm
d
Dropping Participants
9/26/2010 26Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
0
0
0 1rsm
d
wo1 1
1
1rsm
d
h1:R
Dropping Participants
9/26/2010 27Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
0
1rsm
d
wo1
wo1
1
1rsm
d
h1:R
Dropping Participants
9/26/2010 28Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
1rsm
d
wo1
wo1
wo1
1rsm
d
h1:R
Dropping Participants
9/26/2010 29Insertion TreePhaserAlgorithm
Phaserphase: 1
A1 A2 A3 A4
1rsm
d
wo1
wo1
wo1
1rsm
d
Synchronization reached.Continue to next phase.
h1:L
h1:R
Dropping Participants
9/26/2010 30Insertion TreePhaserAlgorithm
Phaserphase: 1
A1 A2 A3 A4
1rsm
d
wo1
wo1
wo1
1rsm
d
Adding New Participants
9/26/2010 31Insertion TreePhaserAlgorithm
Phaserphase: 8
A1 A2 A3 A4
9 8
8 9
9rsm
d
8 9rsm
d
8
Adding New Participants
9/26/2010 32Insertion TreePhaserAlgorithm
Phaserphase: 8
A1 A2 A3 A4
9 8 8 8
8 9
9rsm
d
8 9rsm
d
8
Adding New Participants
9/26/2010 33Insertion TreePhaserAlgorithm
Phaserphase: 8
A1 A2 A3 A4
9 8 9 8
8 8
9rsm
d
8 9rsm
d
8
-1
+1
Adding New Participants
9/26/2010 34Insertion TreePhaserAlgorithm
Phaserphase: 8
A1 A2 A3 A4
9 8 9 8
8 8
9rsm
d
8 9rsm
d
8
propagate phase count
EVALUATION
9/26/2010 35
Two-Phaser Barrier Operation
9/26/2010 36Evaluation
80
85
90
95
100
105
110
115
120
0 8 16 24 32 40 48 56
µs
TreeBarrier HabaneroPhaser Central
Dissemination InsertionTreePhaser TmcSpinBarrier
Overhead: Two-Phase vs. Classic
9/26/2010 37Evaluation
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
130%
0 8 16 24 32 40 48 56TreeBarrier HabaneroPhaser CentralDissemination TreePhaser TmcSpinBarrier
Use as Drop-In Replacement for SPLASH-2Speedup compared to TmcSpinBarrier
9/26/2010 38Evaluation
-20%
-15%
-10%
-5%
0%
5%
10%
15%
20%
Barnes58 cores
Cholesky58 cores
LU58 cores
Radiosity58 cores
Water-Spacial58 cores
FFT32 cores
Ocean32 cores
Radix32 cores
TreeBarrier HabaneroPhaser Central Dissemination InsertionTreePhaser
Summary
• Scalable and efficient approach to Phasers
– Documents implementation
– Based on fully dynamic insertion tree
– Overcomes limitations of existing approaches
– Usable as drop-in replacement
• Future work
– Scalability beyond 59 cores
– Optimization for other memory architectures
9/26/2010 39Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers
9/26/2010 40Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers
h1:L
h1:R
Phaserphase: 1
A1 A2 A3 A4
wo wo
wo
1rsm
d
1 1
1
1rsm
d
• Implementation
– http://barriers.googlecode.com/
– MIT license
Questions?
References
[1] Shirako, Jun &Sarkar, Vivek: Hierarchical Phasers for Scalable Synchronization and Reductions in Dynamic Parallelism In: Proc. of IEEE IPDPS (2010).
9/26/2010 41