Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | shana-cannon |
View: | 218 times |
Download: | 1 times |
Overview
Gap between peak and sustained performance well known problem in HPC
Generally attributed to memory system, but difficult to identify bottleneck
Application benchmarks too complex to isolate specific architectural features
Microbenchmarks too narrow to predict actual code performance
We use an adaptable probe to isolate performance limitations: Give application and hardware developers possible optimizations
Sqmat uses 4 paramters to captures behavior broad range of scientific code:
Working set size(N), Computational Intensity(M), Indirection(S), Irregularity(S)
Architectures examined: Intel Itanium2, AMD Opteron, IBM Power3, IBM Power4
Sqmat overview
Sqmat based on matrix multiplication and linear solvers
Java program used to generate optimally unrolled C code
Square a set of matrices M times in (use enough matrices to exceed cache)
M controls computational intensity (CI) - the ratio between flops and mem access
Each matrix is size NxN N controls working set size: 2N2 registers required per matrix
Direct Storage: Sqmat’s matrix entries stored continuously in memory
Indirect: Entries accessed indirectly through pointer Parameter S controls degree of indirection, S matrix entries stored contiguously, then random jump in memory
Unit Stride Algorithmic Peak
Curve increases until memory system fully utilized, plateaus when FPU units saturate
Itanium2 requires longer time to achieve plateau due to register spill penalty
Opteron’s SIMD nature of SSE2 inhibits high algorithmic peak
Power3 effective hiding latency of cache-access
Power4’s deep pipeline inhibits ability to find sufficient ILP to saturate FPUs
0%
20%
40%
60%
80%
100%
1 10 100 1000 10000
Computational Intensity
Algorithmic peak
Itanium2OpteronPower3Power4
Slowdown due to Indirection
Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values
Itanium2 showing high penalty for indirection - issue is currently under invesigation
Unit stride access via indirection (S=1)
1
2
3
4
5
6
1 2 4 8 16 32 64 128 256 512
M
Slowdown
Itanium2
Opteron
Power3
Power4
Cost of Irregularity (1)
–1
–1.5
–2
–2.5
–3
–3.5
–1 –2 –4 –8 –16 –32 –64 –128 –256 –512–M
– slo
wd
ow
n f
or
irre
gu
lar
acce
ss
–100% (S=1)
–50% (S=2)
–25% (S=4)
–12.5% (S=8)
–6.25% (S=16)
–1
–1.5
–2
–2.5
–3
–3.5
–4
–1 –2 –4 –8 –16 –32 –64 –128 –256 –512–M– s
low
do
wn
fo
r ir
reg
ula
r ac
cess
–100% (S=1)–50% (S=2)–25% (S=4)–12.5% (S=8)–6.25% (S=16)–3.13% (S=32)–1.56% (S=64)–0.78% (S=128)
– Irregularity on Itanium, N=4 – Irregularity on Opteron, N=4
Itanium and Opteron perform well for irregular accesses due to:
Itanium2’s L2 caching of FP values (reduces cost of cache miss) Opteron’s low memory latency due to on-chip memory controller
Cost of Irregularity (2)
–1
–6
–11
–16
–21
–1 –2 –4 –8 –16 –32 –64 –128 –256 –512–M
– slo
wd
ow
n f
or
irre
gu
lar
acce
ss
–100% (S=1)–50% (S=2)–25% (S=4)–12.5% (S=8)–6.25% (S=16)–3.13% (S=32)–1.56% (S=64)–0.78% (S=128)–0.39% (S=256)–random accesses
–1
–3
–5
–7
–9
–11
–13
–15
–1 –2 –4 –8 –16 –32 –64 –128 –256 –512–M
– slo
wd
ow
n f
or
irre
gu
lar
acce
ss
–100% (S=1)–50% (S=2)–25% (S=4)–12.5% (S=8)–6.25% (S=16)–3.13% (S=32)–1.56% (S=64)–0.78% (S=128)–0.39% (S=256)–random accesses
– Irregularity on Power3, N=4 – Irregularity on Power4, N=4
Power3 and Power4 perform well for irregular accesses due to:
Power3’s high penalty cache miss (35 cycles) and limited prefetch abilities Power4’s requires 4 cache-line hit to activate prefetching
Tolerating Irregularity
S50 Start with some M at S= (indirect unit stride) For a given M, how large must S be to achieve at least
50% of the original performance? M50
Start with M=1, S= At S=1 (every access random), how large must M be to
achieve 50% of the original performance
Tolerating Irregularity Probe stresses the balance points of processor design (PMEO-04)
Gather/Scatter expensive on commodity cache-based systems
Power4 can is only 1.6% (1 in 64)Itanium2: much less sensitive at 25% (1 in 4)
Huge amount of computation may be required to hide overhead of irregular
data access
Itanium2 requires CI of about 9 flops/wordPower4 requires CI of almost 75!
Interested in developing application driven architectural probes for evaluation of emerging petascale systems
S50: What % of memory access can be random before performance decreases by
half?
M50: How much computational intensity is required to hide penalty of all random
access?Reducing performance by 50%
1.6%
25%
6.3%
0.8%
0%
1%
10%
100%
Itanium 2 Opteron Power3 Power4
% Indirection
CI required to hide indirection
9.3
149.3
18.7
74.7
0
50
100
150
200
Itanium 2 Opteron Power3 Power4Computational Intensity
(CI)
Emerging Architectures
General purpose procs badly suited for data intensive ops Large caches not useful Low memory bandwidth Superscalar methods of increasing ILP inefficient Power consumption
Application-specific ASICs Good, but expensive/slow to design.
Solution: general purpose “memory aware” processors Large number of ALUs: to exploit data-parallelism Huge memory bandwidth: to keep ALUs busy Concurrency: overlap memory w/ computation
VIRAM Overview MIPS core (200 MHz) Main memory system
8 banks w/13 MB of on-chip DRAM Large 6.4 GBytes/s on-chip peak bandwidth
Cach-less Vector unit
Energy efficient way to express fine-grained parallelism and exploit bandwidth
Single issue, in order Low power consumption: 2.0 W Peak vector performance
1.6/3.2/6.4 Gops 1.6 Gflops (single-precision)
Fabricated by IBM: Taped-out 02/2003 To hide DRAM access load/store, arithmetic
instructions deeply pipelined (15 stages) We use simulator with Cray’s vcc compiler
VIRAM Power Efficiency
Comparable performance with lower clock rate Large power/performance advantage for VIRAM from
PIM technology, data parallel execution model
0.1
1
10
100
1000
Transitive GUPS SPMV Hist Mesh
MOPS/Watt
VIRAM
R10K
P-III
P4
Sparc
EV6
Stream Processing
Stream: ordered set of records (homogenous, arbitrary data type) Stream programming: data is streams, computation is kernel Kernel loop through all stream elements (sequential order) Perform compound (multiword) operation on each stream element Sngle arithmetic operation performed on each vector element (then store in
register)
Example: stereodepth extraction
Data and Functional Parallelism
High Computational rate Little Data Reuse Producer-Consumer and
Spatial locality Ex: Multimedia, signal
processing, graphics
Imagine Overview “Vector VLIW” processor Coprocessor to off-chip
host processor 8 arithmetic clusters
control in SIMD w/ VLIW instructions
Central 128KB Stream Register File @ 32GB/s
SRF can overlap computation w/ memory (double buffering)
SRF cab reuse intermediate results (proc-consum locality)
Stream-aware memory system with 2.7 GB/s off-chip
544 GB/s intercluster comm
Host sends instuctions to stream controller, SC issues commands to on-chip modules
VIRAM and Imagine
Imagine order of magnitude higher performance VIRAM twice memory bandwidth, less power consumption Notice peak Flop/Word ratios
VIRAM IMAGINEMemory
IMAGINE SRF
Bandwidth GB/s 6.4 2.7 32
Peak Float 32bit 1.6 GF/s 20 GF/s 20
Peak Float/Word 1 30 2.5
Speed MHz 200 400
Chip Area 15x18mm 12x12mm
Data widths 64/32/16 32/16/8
Transistors 130 x 106 21 x 106
Power Consmp 2 Watts 10 Watts
SQMAT: Performance Crossover
0
10000
20000
30000
40000
50000
60000
70000
80000
8 16 32 64 128 256 512 1024
Vector/Stream Length(L)
CYCLES
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MFLOPS
CYCLES VIRAM
CYCLES IMAGINE
MFLOPS VIRAM
MFLOPS IMAGINE
Large number of ops/word N10 where N=3x3 Crossover point L=64 (cycles) , L = 256 (MFlop) Imagine power becomes apparent almost 4x VIRAM at
L=1024Codes at this end of spectrum greatly benefit from Imagine arch
Stencil Probe
Stencil computations core of wide range of scientific applications Applications include Jacobi solvers, complex multigrid, block structured
AMR We developing adaptable stencil probe to model range of
computations
Findings isolate importance of streaming memory accesses which engage automatic prefetch engines, thus greatly increasing memory throughput
Previous L1 tiling techniques mostly ineffective for stencil computations on modern microprocessors
Small blocks inhibit automatic prefetching performance Modern large on-chip L2/L3 caches have similar bandwidth to L1
Currently investigating tradeoffs between blocking and prefetching (paper in preparation)
Interested in exploring potential benefits of enhancing commodity processors with explicitly programmable prefetching