IBM T. J. Watson Research Center
© 2013 IBM Corporation
Blue Gene Performance Data Repository and Application Data Prefetching
I-Hsin Chung [email protected]
© 2013 IBM Corporation
Performance Data Repository
§ Goal
– To characterize the applications on existing systems
– To understand the system resource usage
• To provide inputs for next generation system design
§ Objective
– Collect performance data and store them into a relational database
– Uniform storage format
• to support queries and presentation
• To make comparisons cross applications or platforms
© 2013 IBM Corporation
Application Performance data/trace
Application
Job execution
Performance data/trace
instrumentation
collection
© 2013 IBM Corporation 4
Performance Data Repository
DB2 bgqsn2
grotius
Blue Gene Compute nodes
mgmt
perf. data
submit Instrumented
binary
Relational
Database
© 2013 IBM Corporation 5
How to use it?
§ Link the application with performance tool – Link the profiler libraries (e.g., -lmpihpmperf or -lpomprofperf) statically – Or modified version of MPI compiler (wrapper) – Supports MPI/Hardware counter, OpenMP profiler
§ Run the application
§ Query the database (optional) – SQL statements
© 2013 IBM Corporation 6
Link the application with performance tool – MPI/HW counter § Example: NPB-3.3-MPI BT on grotius § The MPI/BGPM library is at /gpfs/DDNgpfs1/ihchung/codes11/hpct.db2/
bobmpihpm/libmpihpmperf.a
§ /opt/ibmcmp/xlf/bg/14.1/bin/bgxlf_r -O -g -o ../bin/bt.A.4 bt.o make_set.o initialize.o exact_solution.o exact_rhs.o set_constants.o adi.o define.o copy_faces.o rhs.o solve_subs.o x_solve.o y_solve.o z_solve.o add.o error.o verify.o setup_mpi.o ../common/print_results.o ../common/timers.o btio.o -L /gpfs/DDNgpfs1/ihchung/codes11/hpct.db2/bobmpihpm -lmpihpmperf -L /bgsys/drivers/ppcfloor/comm/xl/lib -lmpich -lmpl -lopa -L /bgsys/drivers/ppcfloor/comm/sys/lib -lpami -L /bgsys/drivers/ppcfloor/spi/lib -lSPI_cnk -lrt -lstdc++
© 2013 IBM Corporation 7
Link the application with performance tool - OpenMP
§ Example: a OpenMP toy code on grotius § The OpenMP profiler library is at /gpfs/DDNgpfs1/ihchung/codes11/
hpct.db2/source/lib/libpomprofperf.a § The instrumentation is done via "hooks" provided by the XL compiler
§ /opt/ibmcmp/vac/bg/12.1/bin/bgxlc_r -O2 -g -qsmp=omp -qsimd=noauto -qsmp=omp myomp.c -o myomp -L/bgsys/drivers/ppcfloor/spi/lib -L . -lSPI_cnk -lxlsmp_pomp -L../../lib -lpomprofperf
© 2013 IBM Corporation
Rosetta - CPU
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Rank 0 Rank n
Commited Instructions
Commited AXU uCode sub-ops
Commited XU uCode sub-ops
Flushed Instructions and Operation Cycles
FXU Dep Stalls
AXU Dep Stalls
Thread Arbitration Stalls
IU empty
© 2013 IBM Corporation
Rosetta – instruction & memory
Rank 0 Rank n DDR bandwidth 0.001 0.005 Heap Usage 330100736 666894336 Stack Usage 20351 20351 Gflops 0 1.813 IPC 0.3869 0.2072
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Rank 0 Rank n
integer ratio
float ratio
SIMD
non-SIMD
L1
L1P
L2
DDR
© 2013 IBM Corporation
Rosetta – MPI comm.
Rank 0 Call count Data size Time
MPI Comm P2P 217 868 700.658 CollecLve 7 84 0.034
total comm 700.692 total Lme 711.03
Rank n Call count Data size Time
P2P 7 28 0.022 CollecLve 7 84 0.004
total comm 0.026 total Lme 711.03
© 2013 IBM Corporation
Methodology
Performance Projection
Simulation method
Hardware Specification
Application Performance data/
trace
© 2013 IBM Corporation
Simulation NPB D 64
0%
50%
100%
150%
200%
250%
300%
350%
400%
bt cg ep ft is lu mg sp
sim
bw/4
bw*4
comp/3.5
comp*3.5
© 2013 IBM Corporation
Status
§ Being deployed to ANL ALCF
§ Performance data will be output into plain-text SQL files in additional to original performance data files
§ User interface is being developed – Excel spreadsheet – Web interface
IBM Research
© 2013 IBM Corporation
Application Data Prefetching
§ Memory performance – Latency: caching, prefetching
– Bandwidth
§ Indirect memory access – A(B(j)): A, B arrays, j ordered index
– Prevalent in a wide variety of science and engineering applications
– Difficult to optimize
14
IBM Research
© 2013 IBM Corporation
IBM Blue Gene/Q L1P
§ Level One Cache Prefetch – Predicts memory access patterns
– Prefetches the data accordingly
§ Stream prefetcher – Handles sequential memory access patterns
§ List prefetcher – Handles non-sequential memory access patterns
15
IBM Research
© 2013 IBM Corporation
Level One Cache Prefetcher
§ The interface between the core and the rest of BG/Q system
§ Each core is paired with a L1P
§ 4K byte L1P buffer storing data from L2
§ Prefetch line size 128 bytes, 32 lines
§ Buffer managed by Prefetch Directory (PFD)
§ Unlike cache, L1P generates load requests autonomously
16
IBM Research
© 2013 IBM Corporation
Stream Prefetch Engine
§ Prefetches data from a consecutive sequence of 128-byte blocks
§ Maintains up to 16 sequences (streams) per core
§ Stream establishment – Automatic detection
• Optimistic – no consecutive memory access pattern is required • Confirmed (default) – requires a detection of a stream
– Manual setup • Use dcbt (data cache block touch) command • Write to a special memory mapped I/O (MMIO) register
17
IBM Research
© 2013 IBM Corporation
Stream Prefetch Engine - continued
§ Stream detection table (SDT) – Holds up to 16 addresses – L1 miss address
• If there is a match – stream detected • If no match – insert into table
§ Prefetch directory (PFD) – Stream established – user
configurable depth & stream ID
– L1 miss address matches • hit: triggers further prefetching • not ready: increases prefetch depth
t
SDT (16)
(32)
18
IBM Research
© 2013 IBM Corporation
List Prefetch Engine § Prefetches data by a predefined list of addresses
§ Data prefetched up to a user-configurable depth
§ Tracking location of the current L1 miss in the list
§ Memory access pattern may not repeat itself exactly
– Sliding window “ReadListArray” size of 8
§ L1 miss address compared within the window – Found – advances the window and further
prefetching
– Not found – mismatch counter increases; too many mismatches then aborts
§ Predefined list resides in main memory – Can be manually controlled via APIs
– Can be used for analysis/debugging
19
IBM Research
© 2013 IBM Corporation
List Prefetch Engine - continued
§ Simple pattern compression mechanism – Two L1 misses fall into same L1P line
§ List compression table (LCT)
§ The engine can be paused/resumed – To exclude code segments
§ Stop/deactivate list prefetch engine – List creating side: flush LCT, attach end-of-list mark
– List prefetching side: wait last memory access
§ One list prefetch engine per one hardware thread 20
IBM Research
© 2013 IBM Corporation
Evaluation – Basic Analysis
§ Three loops are used – Only non-uniform memory accesses
– Half uniform and half non-uniform memory accesses
– Only uniform memory accesses
§ Up to 4 hardware threads on same/different cores
§ Performance metrics – Time (cycles)
– Number of L1P hits
21
IBM Research
© 2013 IBM Corporation
Non-uniform Memory Accesses - Time § Weak scaling workload for threads
§ Stream prefetcher is not helpful
§ List prefetcher with enough resource (L1P buffer)
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
1 2 3 4
no prefetch - same core confirmed - same core optimistic - same core
list prefetch - same core confirmed & list - same core optimistic & list - same core
no prefetch - diff core confirmed - diff core optimistic - diff core
list prefetch - diff core confirmed & list - diff core optimistic & list - diff core
22
IBM Research
© 2013 IBM Corporation
Non-uniform Memory Accesses – L1P hits
§ Stream prefetcher with optimistic policy is too aggressive when competing L1P buffer with list prefetcher
0
10000
20000
30000
40000
50000
60000
1 2 3 4
list prefetch - same core confirmed & list - same core optimistic & list - same core
list prefetch - diff core confirmed & list - diff core optimistic & list - diff core
23
IBM Research
© 2013 IBM Corporation
Half Uniform and Half Non-uniform Memory Accesses - Time § Strong scaling workload for threads
§ Performance further improved when both prefetchers working together
0
5
10
15
20
25
30
1 2 3 4
default - same core opt - same core conf - same core list - same core list & opt - same core list & conf - same core default - diff core opt - diff core conf - diff core list - diff core list & opt - diff core list & conf - diff core
24
IBM Research
© 2013 IBM Corporation
Uniform Memory Accesses - Time § Weak scaling workload for threads
§ List prefetcher is competitive to the stream prefetcher to achieve similar performance
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
1 2 3 4
default - same core opt - same core conf - same core list - same core
list & opt - same core list & conf - same core default - diff core opt - diff core
25
IBM Research
© 2013 IBM Corporation
Evaluation – Graph Algorithm § An Variant of Shiloach-Vishkin
Algorithm (SV) § A representation of connectivity
algorithms adapts the widely-used graft-and-shortcut approach
§ Complexities of O[log n] time and O[(m+n) log n] work under CRCW PRAM model
§ Grafting components dominates execution time
§ Indirect memory accesses § Works reasonably well with
small to mid size datasets § One may break a big graph
into small pieces and handles separately
100K, 400K 1M, 4M 10M, 40M sw + opt 285 2960 30186 sw + conf 210 2138 23201 sw + dcbt 253 2460 26749 list + opt 141 2299 39611 list + conf 124 2280 44130 list + dcbt 124 2280 44130
1
10
100
1000
10000
100000
Per
form
ance
tim
e (M
illio
n cy
cles
)
sw + opt sw + conf sw + dcbt list + opt list + conf list + dcbt
26
IBM Research
© 2013 IBM Corporation
Evaluation – CG Iteration and Sparse Matrix § The conjugate gradient method is
frequently used to solve linear systems § The CG method requires iteration of
sparse matrix and vector for multiplication
§ Sparse matrix is represented in CSR format (compressed sparse row);
§ NPB CG OpenMP 3.3 class C § CG Kernel 60% strided memory
accesses, 40% random – Uniform: accessing matrix elements (8-
byte) and column indices (4-byte) in CSR
– Random: accessing vector elements (8-byte) by the column indices
– Strong scaling
§ Optimistic stream policy is too aggressive when with limited resource
27
IBM Research
© 2013 IBM Corporation
Discussion § With sufficient of hardware resource and time,
L1P prefetches data from L2 into L1P and improves the performance
– Overlapping computation with data prefetching § If data requests from L1 misses are predicted
correctly – It reduces 2/3 of the latency – 18 more cycles (between L1 and L1P) to be
overlapped § The stream of prefetch list addresses go through
L2 to main memory – The list prefetch reduces L1 cache miss penalty
at the cost of the memory bandwidth (to L2 and/or main memory) and the storage space
– Unless heavily memory-bound application, list prefetching should incur little overhead
L1 6 cycles L1P 24 cycles L2 84 cycles Memory 346 cycles
28
IBM Research
© 2013 IBM Corporation
Discussion - continued
§ L1P buffer (4K bytes) is shared between stream and list prefetch engines up to four hardware threads
– Competition for injecting a request into PFD (prefetch directory) – Stream prefetch has priority when requests come at the same
time – Competition may cause thrashing
§ Size of the sliding window – Decides mismatch of addresses allowed – Constrained by the hardware (parallel comparison) – May be eased by using multiple smaller lists
§ List of addresses provides a channel for performance tuning – Analysis together with hardware performance counter and
debugging registers
29
IBM Research
© 2013 IBM Corporation
Conclusion
§ Performance Data Repository aims to – Characterize the applications – Study the hardware usage
§ L1P is intended to benefit applications with performance limited by indirect or random memory access patterns
§ L1P shows effectiveness in graph-based applications and sparse matrix solvers with sufficient hardware resource.
§ List prefetcher works well with stream prefetcher § Future work
– More intelligent control – Coordination with other features such as massive speculative
threading
30