Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | thomasine-casey |
View: | 212 times |
Download: | 0 times |
2007/11/2 First French-Japanese PAAP Workshop
1
The FFTE Library and the HPC Challenge (HPCC)
Benchmark Suite
Daisuke Takahashi
Center for Computational Sciences/Graduate School of Systems and Information
EngineeringUniversity of Tsukuba
2007/11/2 First French-Japanese PAAP Workshop
2
Outline• HPC Challenge (HPCC) Benchmark Suite
– Overview– The Benchmark Tests– Example Results
• FFTE: A High-Performance FFT Library– Background– Related Works– Block Six-Step/Nine-Step FFT Algorithm– Performance Results– Conclusion and Future Work
2007/11/2 First French-Japanese PAAP Workshop
3
Overview of the HPC Challenge (HPCC) Benchmark Suite
• HPC Challenge (HPCC) is a suite of tests that examine the performance of HPC architectures using kernels.
• The suite provides benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g.,– Spatial locality– Temporal locality
2007/11/2 First French-Japanese PAAP Workshop
4
The Benchmark Tests• The HPC Challenge benchmark consists at t
his time of 7 performance tests:– HPL (High Performance Linpack)– DGEMM (matrix-matrix multiplication)– STREAM (sustainable memory bandwidth)– PTRANS (A=A+B^T, parallel matrix transpose)– RandomAccess (integer updates to random
memory locations)– FFT (complex 1-D discrete Fourier transform)– b_eff (MPI latency/bandwidth test)
2007/11/2 First French-Japanese PAAP Workshop
5
Targeted Application Areas in the Memory Access Locality Space
Temporal locality
Spa
tial l
ocal
ity
PTRANSSTREAM
RandomAccess FFT
HPLDGEMM
Applications
CFD Radar X-section
TSP DSP
0
2007/11/2 First French-Japanese PAAP Workshop
6
HPCC Testing Scenarios• Local (S-STREAM, S-RandomAccess,
S-DGEMM, S-FFTE)– Only single MPI process computes.
• Embarrassingly parallel (EP-STREAM, EP-RandomAccess, EP-DGEMM, EP-FFTE)– All processes compute and do not communicate (explicitl
y).
• Global (G-HPL, G-PTRANS, G-RandomAccess, G-FFTE)– All processes compute and communicate.
• Network only (RandomRing Bandwidth, etc.)
2007/11/2 First French-Japanese PAAP Workshop
7
Sample results pagehttp://icl.cs.utk.edu/hpcc/hpcc_results.cgi
2007/11/2 First French-Japanese PAAP Workshop
8
The winners of the 2006 HPC Challenge Class 1 Awards
• G-HPL: 259 TFlops/s– IBM Blue Gene/L (131072 Procs)
• G-RandomAccess: 35 GUPS– IBM Blue Gene/L (131072 Procs)
• G-FFTE: 2311 GFlop/s– IBM Blue Gene/L (131072 Procs)
• EP-STREAM-Triad (system): 160TB/s– IBM Blue Gene/L (131072 Procs)
2007/11/2 First French-Japanese PAAP Workshop
9
FFTE: A High-Performance FFT Library
• FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions.
• It includes complex, mixed-radix and parallel transforms.– Shared / Distributed memory parallel computers (OpenM
P, MPI and OpenMP + MPI)
• It also supports Intel’s SSE2/SSE3 instructions.• The FFTE library can be obtained from
http://www.ffte.jp
2007/11/2 First French-Japanese PAAP Workshop
10
Background• One goal for large FFTs is to minimize the number
of cache misses.• Many FFT algorithms work well when data sets
fit into a cache.• When a problem exceeds the cache size, howeve
r, the performance of these FFT algorithms decreases dramatically.
• The conventional six-step FFT algorithm requires– Two multicolumn FFTs.– Three data transpositions.
→ The chief bottlenecks in cache-based processors.
2007/11/2 First French-Japanese PAAP Workshop
11
Related Works• FFTW [Frigo and Johnson (MIT)]
– The recursive call is employed to access main memory hierarchically.
– This technique is very effective in the case that the total amount of data is not so much greater than the cache size.
– For parallel FFT, the conventional six-step FFT is used.– http://www.fftw.org
• SPIRAL [Pueschel et al. (CMU)]– The goal of SPIRAL is to push the limits of automation in
software and hardware development and optimization for digital signal processing (DSP) algorithms.
– http://www.spiral.net
2007/11/2 First French-Japanese PAAP Workshop
12
Approach• Some previously presented six-step FFT algorithms
separate the multicolumn FFTs from the transpositions.
• Taking the opposite approach, we combine the multicolumn FFTs and transpositions to reduce the number of cache misses.
• We modify the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”.
2007/11/2 First French-Japanese PAAP Workshop
13
Discrete Fourier Transform (DFT)
• DFT is given by
)/2exp(,10
)()(1
0
nink
jxky
n
jkn
n
j
2007/11/2 First French-Japanese PAAP Workshop
14
2-D Formulation
• If has factors and thenn 1n 2n
11
1
2
2
21
21
22
2
1
1
1
021
1
012 ),(),( kj
n
n
j
kjnn
kjn
n
j
jjxkky
2007/11/2 First French-Japanese PAAP Workshop
15
Six-Step FFT Algorithm
1n
2n 1n
2n
1n
2n1n
2n
n individual
n -point FFTs
Transpose
Transpose
Transpose
2007/11/2 First French-Japanese PAAP Workshop
16
Block Six-Step FFT Algorithm
1n
2n
1n
2n1n
2n
Bn
2nn
individual
Bn
-point FFTs
PartialTranspose
PartialTranspose
Transpose
2007/11/2 First French-Japanese PAAP Workshop
17
3-D Formulation
11
1
21
21
3122
2
32
32
33
3
1
1
2
2
3
3
1
0
1
0
1
0321123 ),,(),,(
kjn
kjnn
kjn
kjn
kjnn
kjn
n
j
n
j
n
j
jjjxkkky
For very large FFTs, we should switch to a 3-D formulation. If has factors , and then n 1n 2n 3n
2007/11/2 First French-Japanese PAAP Workshop
18
Parallel Block Nine-Step FFT
3nBn
1n
32nn
0P 1P 2P 3P
2n
32nn
0P 1P 2P 3P
3n
12nn
0P 1P 2P 3P
2n
Bn
1n
32nn
PartialTranspose
PartialTranspose
PartialTranspose
All-to-all comm.
2007/11/2 First French-Japanese PAAP Workshop
19
Operation Counts for -point FFT• Conventional FFT algorithms
(e.g., Cooley-Tukey FFT, Stockham FFT)– Arithmetic operations:– Main memory accesses:
• Block Nine-Step FFT– Arithmetic operations:– Main memory accesses (ideal case):
nn 2log5nn 2log4
nn 2log5n12
n
2007/11/2 First French-Japanese PAAP Workshop
20
Performance Results• To evaluate the implemented parallel FFTs, we com
pared– The implemented parallel FFT, named FFTE (ver 4.0, sup
ports SSE3, using MPI)– FFTW (ver. 2.1.5, not support SSE3, using MPI)
• Target parallel machine:– A 32-node dual PC SMP cluster
(Irwindale 3GHz, 1GB DDR2-400 SDRAM / node, Linux 2.4.17-1smp).
– Interconnected through a Gigabit Ethernet switch.– LAM/MPI 7.1.1 was used as a communication library– The compilers used were gcc 4.0.2 and g77 3.2.3.
2007/11/2 First French-Japanese PAAP Workshop
21
Performance of parallel FFTsXeon PC cluster, N = 2̂ 24xP)(
0
5
10
15
1 2 4 8 16 32Number of Nodes
GFL
OPS
FFTE 4.0(SSE3)FFTE 4.0(x87)FFTW 2.1.5
2007/11/2 First French-Japanese PAAP Workshop
22
Breakdown of parallel FFTs(Xeon PC Cluster, N=2̂ 24xP)
0
1
2
3
4
5
6
7
1 2 4 8 16 32Number of Nodes
Tim
e (s
ec)
Comp.Comm..
2007/11/2 First French-Japanese PAAP Workshop
23
Discussion• For N = 2^29 and P = 32, the FFTE runs about 1.72
times faster than the FFTW.– The performance of the FFTE remains at a high level
even for the larger problem size, owing to cache blocking.– Since the FFTW uses the conventional six-step FFT,
each column FFT does not fit into the L1 data cache.– Moreover, the FFTE exploits the SSE3 instructions.
• These are three reasons why the FFTE is most advantageous than the FFTW.
2007/11/2 First French-Japanese PAAP Workshop
24
Conclusion and Future Work• The block nine-step FFT algorithm is most advanta
geous with processors that have a considerable gap between the speed of the cache memory and that of the main memory.
• Towards Petascale computing systems,– Exploiting the multi-level parallelism:
• SIMD or Vector accelerator• Multi-core• Multi-socket• Multi-node
– Reducing the number of main memory accesses.– Improving the all-to-all communication performance.
• In the G-FFTE, the all-to-all communication occursthree times.