1
Tuesday, September 19, 2006
The practical scientist is trying to solve tomorrow's problem
on yesterday's computer. Computer scientists often have
it the other way around.
- Numerical Recipes, C Edition
2
Reference Material Lectures 1 & 2
“Parallel Computer Architecture” by David Culler et. al., Chapter 1. “Sourcebook of Parallel Computing” by Jack Dongarra et. al.,
Chapters 1 and 2. Introduction to Parallel Computing by Grama et. al., Chapter 1 and
Chapter 2 §2.4. www.top500.org
Lecture 3 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3 Introduction to Parallel Computing, Lawrence Livermore National
Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ Lecture 4 & 5
“Techniques for Optimizing Applications” by Garg et. al., Chapter 9 “Software Optimizations for High Performance Computing” by
Wadleigh et. al., Chapter 5 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1-
2.2
3
Software Optimizations
Optimize serial code before parallelizing it.
4
Loop Unrolling
do i=1,n
A(i)=B(i)
enddo
do i=1,n,4
A(i)=B(i)
A(i+1)=B(i+1)
A(i+2)=B(i+2)
A(i+3)=B(i+3)
enddo•Unrolled by 4.•Some compilers allow users to specify unrolling depth.•Avoid excessive unrolling: Register pressure / spills can hurt performance•Pipelining to hide instruction latencies•Reduces overhead of index increment and conditional check
Assumption n is divisible by 4
5
Loop Unrolling
do j=1 to N
do i = 1 to N
Z[i,j]=Z[i,j]+X[i]*Y[j]
enddo
enddo
Unroll outer loop by 2
6
Loop Unrolling
do j=1 to N
do i = 1 to N
Z[i,j]=Z[i,j]+X[i]*Y[j]
enddo
enddo
do j=1 to N step 2
do i = 1 to N
Z[i,j]=Z[i,j]+X[i]*Y[j]
Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1]
enddo
enddo
7
Loop Unrolling
do j=1 to N
do i = 1 to N
Z[i,j]=Z[i,j]+X[i]*Y[j]
enddo
enddo
do j=1 to N step 2
do i = 1 to N
Z[i,j]=Z[i,j]+X[i]*Y[j]
Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1]
enddo
enddo
Number of load operations can be reduced e.g. Half as many loads of X
8
Loop Fusion
Beneficial in loop-intensive programs.Decreases index calculation overhead.Can also help in instruction level
parallelism.Beneficial if same data structures are
used in different loops.
9
Loop Fusion
for (i=0; i<n; i++)
temp[i] =x[i]*y[i];
for (i=0; i<n; i++)
z[i] =w[i]+temp[i];
10
Loop Fusion
for (i=0; i<n; i++)
temp[i] =x[i]*y[i];
for (i=0; i<n; i++)
z[i] =w[i]+temp[i];
for (i=0; i<n; i++)
z[i] =x[i]*y[i]+w[i];
Check for register pressure before fusing
11
Loop Fission
Condition statements can hurt pipeliningSplit into two, one with condition
statements and the other without.Compiler can do optimizations in
condition-free loop like unrolling.Beneficial for fat loops that may lead to
register spills
12
Loop Fission
for (i=0;i<nodes;i++) {
a[i] = a[i]*small;
dtime = a[i] + b[i];
dtime = fabs(dtime*ratinpmt);
temp1[i] = dtime*relaxn;
if(temp1[i] > hgreat) {
temp1[i]=1;
}
}
13
Loop Fission
for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime =
fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } }
for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime =
fabs(dtime*ratinpmt); temp1[i] =
dtime*relaxn;}for (i=0;i<nodes;i++) { if(temp1[i] > hgreat)
{ temp1[i]=1;
}}
14
Reductions
for (i=0; i<n; i++)
{
sum +=x[i];
}
Normally a single register would be used for reduction variable.
Hide floating point instruction latency?
15
Reductionsfor (i=0; i<n; i++)
{
sum +=x[i];
}
sum1=sum2=sum3=sum4=0.0
nend = (n>>2)<<2;
for (i=0; i<nend; i+=4){
sum1 +=x[i];
sum2 +=x[i+1];
sum3 +=x[i+2];
sum4 +=x[i+3];
}
sumx = sum1 + sum2+ sum3 + sum4;
for (i=nend; i<n; i++)
sumx += x[i]
16
a**0.5 vs sqrt(a)
17
a**0.5 vs sqrt(a) Appropriate include files can help in
generating faster code. e.g. math.h
18
The time to access memory has not kept pace with CPU clock speeds.
Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them.
Wastage of CPU cycles: CPU starvation
19
20
Ability of memory system to feed data to the processor Memory latency Memory Bandwidth
21
Effect of Memory Latency
1 GHz processor (1ns clock) Capable of executing 4 instructions in each
cycle of 1ns
DRAM with latency 100nsCache block size : 1 wordPeak processor rating?
22
Effect of Memory Latency
1 GHz processor (1ns clock) Capable of executing 4 instructions in each
cycle of 1ns
DRAM with latency 100ns (no caches)Memory block 1 wordPeak processor rating 4 GFlops
23
Effect of Memory Latency
1 GHz processor (1ns clock) Capable of executing 4 instructions in each
cycle of 1nsDRAM with latency 100ns (no caches)Memory block: 1 wordPeak processor rating 4 GFlopsDot product of two vectors Peak speed of computation?
24
Effect of Memory Latency1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle of 1ns
DRAM with latency 100ns (no caches)Memory block 1 wordPeak processor rating 4 GFlops• Dot product of two vectors • Peak speed of computation? one floating point
operation every 100ns i.e. speed of 10 MFLOPS
25
Effect of Memory Latency: Introduce Cache
1 GHz processor (1ns clock) Capable of executing 4 instructions in each cycle of
1ns
DRAM with latency 100ns Memory block 1 wordCache 32KB with 1ns latencyMultiply two matrices A and B of 32x32 words
with result in C. (Note: Previous example had no data reuse).
Assume ideal cache placement and enough capacity to hold A,B and C
26
Effect of Memory Latency: Introduce Cache
Multiply two matrices A and B of 32x32 words with result in C
32x32 = 1K wordsTotal operations and total time taken?
27
Effect of Memory Latency: Introduce Cache
Multiply two matrices A and B of 32x32 words with result in C
32x32 = 1K wordsTotal operations and total time taken?Two matrices = 2K require wordsMultiplying two matrices requires 2n3
operations
28
Effect of Memory Latency: Introduce CacheMultiply two matrices A and B of 32x32 words
with result in C32x32 = 1KTwo matrices = 2K require 2K *100ns = 200µs.Multiplying two matrices requires 2n3
operations = 2*323 = 64K operations 4 operations per cycle we need 64K/4 cycles =
16µsTotal time = 200+16µsComputation rate 64K operations/(200+16µs) =
303 MFLOPS
29
Effect of Memory Bandwidth
1 GHz processor (1ns clock) Capable of executing 4 instructions in each
cycle of 1ns
DRAM with latency 100ns Memory block 4 wordsCache 32KB with 1ns latencyDot product example againBandwidth increased 4 fold
30
Reduce cache misses.Spatial localityTemporal locality
31
Impact of strided access
for (i=0; i<1000; i++)
column_sum[i] = 0.0;
for(j=0; j<1000; j++)
column_sum[i]+= b[j][i];
32
Eliminating strided access
for (i=0; i<1000; i++)
column_sum[i] = 0.0;
for(j=0; j<1000; j++)
for (i=0; i<1000; i++)
column_sum[i]+= b[j][i];
Assumption: Vector column_sum is retained in the cache
33
do i = 1, N
do j = 1, N
A[i] =A[i] + B[j]
enddo
enddo
N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop.
Little reuse between touches
How many cache misses for A and B?