Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | maximilian-jeffrey-golden |
View: | 212 times |
Download: | 0 times |
D. Becker, M. Geimer, R. Rabenseifner, and F. Wolf Laboratory for Parallel Programming | September 21 2010
Synchronizing the timestamps of concurrent events in traces of hybrid MPI/OpenMP applications
2Daniel Becker
• Cluster systems represent majority of today’s supercomputers– Availability of inexpensive
commodity components
• Vast diversity– Architecture– Interconnect technology– Software environment
• Message-passing and shared-memory programming models for communication and synchronization
Cluster systems
3Daniel Becker
• Application areas– Performance analysis
• Time-line visualization• Wait-state analysis
– Performance modeling – Performance prediction– Debugging
• Events recorded at runtime to enable post-mortem analysis of dynamic program behavior
• Event includes at least timestamp, location, and event type
Event tracing
Send
Recv
Barrier
Barrier
E
E
S X E MX
R X E MX
… …E S X E MX
… …E R X E MX
… …S R X X EE E MX MX
merge(opt.)
write
record
4Daniel Becker
Problem: Non-synchronized clocks
5Daniel Becker
Outline
6Daniel Becker
Lamport, Mattern, Fidge,Rabenseifner
Restore and preserve logical correctness
Lamport, Mattern, Fidge,Rabenseifner
Restore and preserve logical correctness
Dunigan, Maillet, Tron, Doleschal
Measure offset values and determine interpolation function
Determine medial smoothing function based on send/receive differences
Duda, Hofman, Hilgers
Query time from reference clocks synchronized at regular intervals
Mills
Clock synchronization
7Daniel Becker
Controlled logical clock
E X
E
S
µmin
XX RE
8Daniel Becker
MPI semantics
E
E
MX
MX
E MX
E
E
MX
MX
E MX
MX
MX
MX
E
E
E MX
MX
MX
E
E
E
9Daniel Becker
• Neither restores nor preserves clock condition in OpenMP event semantics
• May introduce violations in locations that were previously intact
Limitations of the CLC algorithm
R
S
omp_barrier
omp_barrier
Romp_barrier
10Daniel Becker
Collective communication
omp_barrier
omp_barrier
E
E
OX
OX
Consider OpenMP constructs as composed of multiple logical messages
Define logical send/receive pairs for each flavor
11Daniel Becker
OpenMP semantics
E
E
E
F J
OX
OX
OX
OX
OX
OX
E
E
E
U
U
L
Tasking
U
U
L
U
12Daniel Becker
• Operation may have multiple logical receive and send events
• Multiple receives used to synchronize multiple clocks• Latest send event is the relevant send event
Happened-before relation
MXE
E OX
OXE
OXE
13Daniel Becker
• Correct local traces in parallel– Keep whole trace in memory– Exploit distributed memory &
processing capabilities
Parallelization
• Replay communication– Traverse trace in parallel– Exchange data at
synchronization points – Use operation of same type
• MPI functions• OpenMP constructs
14Daniel Becker
222
1
3
Forward replay
1… …
3… …
2… …omp_barrier
omp_barrier
2
omp_barrier1
3
15Daniel Becker
• Avoid new violations• Do not advance send
farther than matching receive
Backward amortization
RS
S
R
16Daniel Becker
• Data on sender side needed
• Communication direction– Communication precedes
in backward direction– Roles of sender and
receiver are inverted
• Traversal direction– Start at end of trace– Avoid deadlocks
Backward replay
S R… …
S R… …
S
S R
R
R
R S
S
17Daniel Becker
Piece-wise correction
LCib
RR
R
RSSSSS
∆tR
R
LCib Controlled logical clock without jump discontinuities
LCi’ – LCib Controlled logical clock with jump discontinuities
LCiA’ - LCi
b Linear interpolation for backward amortization
LCiA - LCi
b Piecewise linear interpolation for backward amortization
Amortization interval
min(LCk’(corr. receive event) - µ - LCib)
dif
fere
nce
s t
o L
Cib
18Daniel Becker
Experimental evaluation
Significant percentage of messages was violated (up to 5%)
After correction all traces were free of clock condition violations
Nic
ole
clus
ter • JSC@FZJ
• 32 compute nodes• 2 quad-core Opteron running at 2.4 GHz• Infiniband Ap
plic
ation
s • PEPC (4 threads per process)
• Jacobi solver (2 threads per process)
Evaluation focused on frequency of clock violations, accuracy, and scalability of the correction
19Daniel Becker
• Event position– Absolute deviations correspond to
value clock condition violations– Relative deviations are negligible
Accuracy of the algorithm
• Event distance– Larger relative deviations possible– Impact on analysis results negligible
Correction only marginally changes the length of local
intervals
Correction changed the length of local intervals
only marginally
20Daniel Becker
• Only violated MPI semantics in original trace• Roughly half of the corrections correspond to
OpenMP semantics
Synchronizing hybrid codes
Algorithm preserved OpenMP semantics
RR
S
omp_barrier
omp_barrier
omp_barrier
omp_barrier
21Daniel Becker
Scalability
22Daniel Becker
Summary
23Daniel Becker
• Exploit knowledge of MPI-internal messaging inside collective operations using PERUSE
• Leverage periodic offset measurements at global synchronization points
Outlook
24Daniel Becker
Thanks!