1
© 2004 IBM
Charles GrasslIBM
August, 2004
pSeries Optimization
2 © 2004 IBM Corporation
Agenda
Optimization– Strategies and techniques– Bandwidth Exploitation– Pipelining
2
3 © 2004 IBM Corporation
POWER4 Performance Expectations
Floating point: –Hundreds to 1000’s Mflop/s
–Limitations:– System bandwidth– Program model
– Floating point operations
– Adds and Multiplies– Divides
– Memory access– Copying
Bandwidth:–0.100 - 5 Gbyte/s–Limitations:
– Strided access is slow
–Enhancers:– Multiple streams
4 © 2004 IBM Corporation
Performance Expectations Example: program POP
Structured Fortran 90Data movementLow Computational IntensityLow FMA Percentage
0.87Computation intensity53 %FMA percentage372 Mflop/sFloat point instruction rate0.3HW Float points Inst. per Cyle0.9Instructions per cycle
3
5 © 2004 IBM Corporation
Program POP:Computation Time Distribution
0
2
4
6
8
10
12%
tim
e
Subroutine
6 © 2004 IBM Corporation
Performance Expectations Example: program SPPM
Fortran 77OptimizedHigh Computational IntensityVector intrinsicsLow FMA Percentage
1.8Computation intensity55 %FMA percentage979 Mflop/sFloat point instruction rate0.7HW Float points Inst. per Cycle1.2Instructions per cycle
4
7 © 2004 IBM Corporation
Program SPPM:Computation Time Distribution
0
10
20
30
40
50
60%
Tim
e
Subroutine
8 © 2004 IBM Corporation
Bandwidth Exploitation
Computational intensity–Increase number of flop/s per memory reference
Blocking–Reuse cache
Load streams–Expose up to 8 memory patterns
5
9 © 2004 IBM Corporation
Computational Intensity Examples
SUM0.33A(i)=B(i)+C(i)
AXPY0.67A(i)=A(i)+s*B(i)
Scale1A(i)=r*B(i)+s
Triad1A(i)=r*B(i)+s*C(i)
Polynomial2A(i)=r+B(i)*(s+t*B(i))
CommentComp. Inten.Loop
10 © 2004 IBM Corporation
Computational Intensity Tests
0100200300400500600700800900
1000
Mflo
p/s
32.3321.51.3310.670.50.33
Data from 1.3 GHz p690
6
11 © 2004 IBM Corporation
Computational Intensity
Nominal bandwidth is 2.5 Gbyte/s for single line loops–300 Mword/s–...–300 Mflop/s for Comp. Intens. of 1–600 Mflop/s for Comp. Intens. of 2–...
Performance is (often) limited by bandwidth–NOT functional unit performance
12 © 2004 IBM Corporation
Computational Intensity Strategy: Loop Unrolling
Outer Loop Strategy:–Increase computational intensity–Minimizes load/stores
Find variable which is constant with respect to outer loop–Unroll such that this variable is loaded once, but used multiple times
7
13 © 2004 IBM Corporation
Outer Loop Unrolling Example
DO I = 1, NDO J = 1, Ns = s +X(J)*A(J,I)
END DOEND DO
2 flops / 2 loadsComp. Int.: 1
DO I= 1, N, 4DO J = 1, N
s = s + X(J)*A(J,I+0)+X(J)*A(J,I+1)+X(J)*A(J,I+2)+X(J)*A(J,I+3)
END DOEND DO
8 flops / 5 LoadsComp. Int.: 1.6
14 © 2004 IBM Corporation
Outer Loop Unroll Test
0
100
200
300
400
500
600
Mflop/s
-O3 -O3 -qhot
Unroll124812
8
15 © 2004 IBM Corporation
Outer Loop Unroll Analysis
Strategy:–Expose 8 prefetch streams
–Unroll up to 8 times–Compiler:
–Unrolls 4 times–Near optimal performance–Combines inner and outer loop unrolling
16 © 2004 IBM Corporation
Inner Loop Unrolling Strategies
Inner loop strategy:– Reduces data dependency– Eliminate intermediate loads and stores– Expose functional units– Expose registers
Examples:– Linear recurrences– Simple loops
– Single operation
9
17 © 2004 IBM Corporation
Software Pipelining
Exploit registers–Number of registers in POWER4 is critical–Some assistance from rename registers
Compiler does unrolling at -O3 and higher–User unrolling can conflict with compiler
18 © 2004 IBM Corporation
Software Pipelining
do i=1,nsum = sum + X(i)
end do
do i=1,n-3,4sum1 = sum1 + X(i )sum2 = sum2 + X(i+1)sum3 = sum3 + X(i+2)sum4 = sum4 + X(i+3)
end dosum=sum1+sum2+
sum3+sum4
Explicit unrolling:Expose more variables for register use
10
19 © 2004 IBM Corporation
Software Pipelining Example
0100200300400500600700800900
Mflo
p/s
1 2 4 6 8 12Unroll Factor
-O2
20 © 2004 IBM Corporation
Software Pipelining Example
0100200300400500600700800900
1000
Mflo
p/s
1 2 4 6 8 12Unroll Factor
-O3
11
21 © 2004 IBM Corporation
Software Pipelining Example
0100200300400500600700800900
1000M
flop/
s
1 2 4 6 8 12Unroll Factor
-O2-O3
22 © 2004 IBM Corporation
Software Pipelining Example
Conclusion:Avoid explicit user unrolling at -O3Allow compiler to perform unrolling
0
100
200
300
400
500
600
700
800
900
1000
1 2 4 6 8 12
12
23 © 2004 IBM Corporation
Computational Intensity: Effect of Cache
Large caches alleviate memory bandwidth bottleneck–Store parts of computation stream in cache–Reduce shared memory contention
Hpmcount group 59 reports “load and stores” from processor–Discounts cache loads
Need extension to concept of computational intensity
24 © 2004 IBM Corporation
Computational Intensity: Effect of Cache
Example:…Do j=1,m
call DAXPY(..A(1,k),.,A(1,j),.)End do…Subroutine daxpy(…) Do i=1,n
X(i) = X(i) + t*Y(i)End do
Computational intensity is actually 2 flops/2 mem ref.Computational Intensity: 1.
A(:,k) A(:,j)
13
25 © 2004 IBM Corporation
Loop of DAXPY
0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500
Mop
/sBandwidth
Computation
Bandwith w/o cache
Computation w/ocache
Results from 1.5 GHz POWER4+
26 © 2004 IBM Corporation
Memory Access Strides
Strided memory access:–Fewer words per cache line–Reduced cache line utilization–Reduced bandwidth–Memory is access by cache line–16 REAL*8 or double words
14
27 © 2004 IBM Corporation
Stride Test Bandwidth
0
500
1000
1500
2000
2500M
byte
/s
-16-14-12-10 -8 -6 -4 -2 -1 1 2 4 6 8 10 12 14 16Stride
1.3 GHz POWER4
28 © 2004 IBM Corporation
Strides and TBL
Strided memory access:–Fewer memory reference per memory page–Increased TLB misses
TLB signature obtained from:–hpmcount -g 59–Memory references per TLB
do i=1,mdo j=1,n
A(i,j)=A(i,j)+...end do
end do
15
29 © 2004 IBM Corporation
Large Stride Test
050
100150200250300350400450
Mby
te/s
8 32 128 512 2048Stride
CacheTLB
30 © 2004 IBM Corporation
Cache Blocking
Common technique in Linear Algebra–Similar to unrolling–Utilize cache lines–Linear Algebra NB:–Typically 96-256
16
31 © 2004 IBM Corporation
Blocking
Blocking
32 © 2004 IBM Corporation
Blocking
do i = 1,ndo j = 1,mB(j,i) = A(i,j)
end doend do
do j1 = 1,n-nb+1,nbj2 = min(j1+nb-1,n)do i1 = 1,m-nb+1,nb
i2 = min(i1+nb-1,m)do i = i1, i2do j = j1, j2
B(j,i) = A(i,j)end do
end doend do
end do
17
33 © 2004 IBM Corporation
Blocking Example: Transpose
0200400600800
10001200140016001800
Mby
tes
1 32 64 128 192 dgetmoBlocking Factor
34 © 2004 IBM Corporation
Prefetch Strategies
Merge loops– Combine conforming loops
–Compiler can do much of thisFolding– Useful for very long loops
–Compiler can do much of this
18
35 © 2004 IBM Corporation
Merge Loops
Overlap cache line fetchesExpose memory prefetching
for (j=1; j
19
37 © 2004 IBM Corporation
Fold Loops
Increase number of streams at the expense of loop length
do i = 1,nsum = sum +A(i)
end do
One stream per loop
do i = 1,n/4sum = sum +A(i )
+ A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)
end do
Four streams
38 © 2004 IBM Corporation
Folding Example
0
500
1000
1500
2000
2500
3000
3500
4000
Mflo
p/s
Unroll Factor
0246812
1.3 GHz POWER4 Single RHS; size 10000000
20
39 © 2004 IBM Corporation
Folding Example
0100020003000400050006000700080009000
Mflo
p/s
1000 13894 2E+05 3E+06Unroll Factor
02468
1.3 GHz POWER4
40 © 2004 IBM Corporation
Folding Example
Beneficial for long loops– Works at ~100,000 loads
Do not fold past factor of 8
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1000 13894 193069 2682695
21
41 © 2004 IBM Corporation
Stride Folding
Fold loop to increase number out standing cache loads:
do i = 1,n,msum = sum +A(i)
end do
do i = 1,n/4,msum = sum +A(i )
+ A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)
end do
42 © 2004 IBM Corporation
Stride Folding Example
0
50
100
150
200
250
300
350
Mflo
p/s
Unroll Factor
024681216
1.3 GHz POWER4 Single RHS; size 5,000,000, stride 32
22
43 © 2004 IBM Corporation
Stride Folding Example
Overlap out standing loads– Not limited by number of
stream buffersMore than 8 RHS
0
50
10 0
150
2 0 0
2 50
3 0 0
3 50
44 © 2004 IBM Corporation
Managing Pipelines
Deep pipeline functional units– FMA– Divide– Square Root
23
45 © 2004 IBM Corporation
Divide and Square Root
POWER4 special functions:– Divide– Sqrt
Use FMA functional unit– 2 simultaneous divide or sqrt (or rsqrt)– NOT pipelined
4040Sqrt3333Divide66Fma
DoubleSingleInstruction
46 © 2004 IBM Corporation
Hardware DIV, SQRT, RSQRT
0
20
40
60
80
100
120
Mop
/s
Divide
SQRT
RSQRT
1.3 GHz p690
24
47 © 2004 IBM Corporation
Hardware and Pipelined DIV, SQRT, RSQRT
0
20
40
60
80
100
120
Mop
/sDividePipeline DivideSQRTPipeline SQRTRSQRTPipeline RSQRT
1.3 GHz p690
48 © 2004 IBM Corporation
Vectorization Analysis
DependenciesCompiler overhead:– Generate (malloc) local temporary arrays– Extra memory traffic
Moderate vector lengths required
302545N 1/2
258020Cross Over
RSQRTSQRTREC
25
49 © 2004 IBM Corporation
Function Timings
PipelinesHardware or Scalar Function
77349288LOG833213200EXP90293966RSQRT84316936SQRT130208132ReciprocalRateClocksRateClocksFunction
50 © 2004 IBM Corporation
Function Rates
0
20
40
60
80
100
120
140
Mop
/s
Reciprocal SQRT RSQRT EXP LOG
ScalarVector
26
51 © 2004 IBM Corporation
SQRT
POWER4 has hardware SQRT availableDefault -qarch=comm uses software libraryUse: -qarch=pwr4
0
10
20
30
40
50
60
70
80
90
SQRT
commpwr4-qhot
52 © 2004 IBM Corporation
Divide
IEEE divide specifies actual divideDo not use multiply by reciprocal (default)Optimize with -O3
do i=1,nB(i) = A(i)/s
end do
rs=1/sdo i=1,n
B(i) = A(i)*rsend do
27
53 © 2004 IBM Corporation
Divide
0
20
40
60
80
100
120
140M
op/s
DIV
-O2-O2 -qunroll-O3
1.1 GHz POWER4
54 © 2004 IBM Corporation
Vector Intrinsics
-qhot generates "vector" call to vector intrinsic functionsMonitor with "-qreport=hotlist"
do i=1,nB(i) = func(A(i))
end docall __vfunc(B,A,n)
28
55 © 2004 IBM Corporation
Example of Pipelined Functions
do i = -nbdy+3,n+nbdy-1prl = qrprl(i)pll = qrmrl(i)pavg = vtmp1(i)wllfac(i) = 5*gammp1*pavg + gamma * pllwrlfac(i) = 5*gammp1 * pavg +gamma *prlhrholl = rho(1,i-1)hrhorl = rho(1,i)wll(i) = 1/sqrt(hrholl * wllfac(i))wrl(i) = 1/sqrt(hrhorl * wrlfac(i))
end do
56 © 2004 IBM Corporation
Example of Pipelined Functions
allocate(t1,n+2*nbdy-3)allocate(t2,n+2*nbdy-3)do i = -nbdy+3,n+nbdy-1
prl = qrprl(i)...t1(i) =hrholl * wllfac(i)t2(i) =hrhorl * wrlfac(i)
end docall __vrsqrt(t1,wrl,n+2*nbdy-3)call __vrsqrt(t2,wll,n+2*nbdy-3)
29
57 © 2004 IBM Corporation
Vector Intrinsics
0102030405060708090
100M
op/s
REC SQRT RSQRT EXP LOG
-O3-O3 -qhot
1.3 GHz POWER4
58 © 2004 IBM Corporation
32-bit versus 64-bit Addressing
64-bit address mode enable use of 64-bit integer arithmeticInteger Arithmetic, especially (kind=8), or long, is much faster with -q64
30
59 © 2004 IBM Corporation
Integer Arithmetic
0
200
400
600
800
1000
1200
1400
1600M
op/s
I4*I4 I4*I8 I8*I8
-q32-q64
60 © 2004 IBM Corporation
32-bit Floating Point Arithmetic
FasterLess bandwidth requiredArithmetic operations are same speed as 64-bitMore efficient use of cache