2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC)...

2007/11/2 First French-Japanese PAAP Workshop

1

The FFTE Library and the HPC Challenge (HPCC)

Benchmark Suite

Daisuke Takahashi

Center for Computational Sciences/Graduate School of Systems and Information

EngineeringUniversity of Tsukuba


2

Outline• HPC Challenge (HPCC) Benchmark Suite

– Overview– The Benchmark Tests– Example Results

• FFTE: A High-Performance FFT Library– Background– Related Works– Block Six-Step/Nine-Step FFT Algorithm– Performance Results– Conclusion and Future Work


3

Overview of the HPC Challenge (HPCC) Benchmark Suite

• HPC Challenge (HPCC) is a suite of tests that examine the performance of HPC architectures using kernels.

• The suite provides benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g.,– Spatial locality– Temporal locality


4

The Benchmark Tests• The HPC Challenge benchmark consists at t

his time of 7 performance tests:– HPL (High Performance Linpack)– DGEMM (matrix-matrix multiplication)– STREAM (sustainable memory bandwidth)– PTRANS (A=A+B^T, parallel matrix transpose)– RandomAccess (integer updates to random

memory locations)– FFT (complex 1-D discrete Fourier transform)– b_eff (MPI latency/bandwidth test)


5

Targeted Application Areas in the Memory Access Locality Space

Temporal locality

Spa

tial l

ocal

ity

PTRANSSTREAM

RandomAccess FFT

HPLDGEMM

Applications

CFD Radar X-section

TSP DSP

0


6

HPCC Testing Scenarios• Local (S-STREAM, S-RandomAccess,

S-DGEMM, S-FFTE)– Only single MPI process computes.

• Embarrassingly parallel (EP-STREAM, EP-RandomAccess, EP-DGEMM, EP-FFTE)– All processes compute and do not communicate (explicitl

y).

• Global (G-HPL, G-PTRANS, G-RandomAccess, G-FFTE)– All processes compute and communicate.

• Network only (RandomRing Bandwidth, etc.)


7

Sample results pagehttp://icl.cs.utk.edu/hpcc/hpcc_results.cgi


8

The winners of the 2006 HPC Challenge Class 1 Awards

• G-HPL: 259 TFlops/s– IBM Blue Gene/L (131072 Procs)

• G-RandomAccess: 35 GUPS– IBM Blue Gene/L (131072 Procs)

• G-FFTE: 2311 GFlop/s– IBM Blue Gene/L (131072 Procs)

• EP-STREAM-Triad (system): 160TB/s– IBM Blue Gene/L (131072 Procs)


9

FFTE: A High-Performance FFT Library

• FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions.

• It includes complex, mixed-radix and parallel transforms.– Shared / Distributed memory parallel computers (OpenM

P, MPI and OpenMP + MPI)

• It also supports Intel’s SSE2/SSE3 instructions.• The FFTE library can be obtained from

http://www.ffte.jp


10

Background• One goal for large FFTs is to minimize the number

of cache misses.• Many FFT algorithms work well when data sets

fit into a cache.• When a problem exceeds the cache size, howeve

r, the performance of these FFT algorithms decreases dramatically.

• The conventional six-step FFT algorithm requires– Two multicolumn FFTs.– Three data transpositions.

→ The chief bottlenecks in cache-based processors.


11

Related Works• FFTW [Frigo and Johnson (MIT)]

– The recursive call is employed to access main memory hierarchically.

– This technique is very effective in the case that the total amount of data is not so much greater than the cache size.

– For parallel FFT, the conventional six-step FFT is used.– http://www.fftw.org

• SPIRAL [Pueschel et al. (CMU)]– The goal of SPIRAL is to push the limits of automation in

software and hardware development and optimization for digital signal processing (DSP) algorithms.

– http://www.spiral.net


12

Approach• Some previously presented six-step FFT algorithms

separate the multicolumn FFTs from the transpositions.

• Taking the opposite approach, we combine the multicolumn FFTs and transpositions to reduce the number of cache misses.

• We modify the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”.


13

Discrete Fourier Transform (DFT)

• DFT is given by

)/2exp(,10

)()(1

0

nink

jxky

n

jkn

n

j


14

2-D Formulation

• If has factors and thenn 1n 2n

11

1

2

2

21

21

22

2

1

1

1

021

1

012 ),(),( kj

n

n

j

kjnn

kjn

n

j

jjxkky


15

Six-Step FFT Algorithm

1n

2n 1n

2n

1n

2n1n

2n

n individual

n -point FFTs

Transpose

Transpose

Transpose


16

Block Six-Step FFT Algorithm

1n

2n

1n

2n1n

2n

Bn

2nn

individual

Bn

-point FFTs

PartialTranspose

PartialTranspose

Transpose


17

3-D Formulation

11

1

21

21

3122

2

32

32

33

3

1

1

2

2

3

3

1

0

1

0

1

0321123 ),,(),,(

kjn

kjnn

kjn

kjn

kjnn

kjn

n

j

n

j

n

j

jjjxkkky

For very large FFTs, we should switch to a 3-D formulation. If has factors , and then 　　　　　　 n 1n 2n 3n


18

Parallel Block Nine-Step FFT

3nBn

1n

32nn

0P 1P 2P 3P

2n

32nn

0P 1P 2P 3P

3n

12nn

0P 1P 2P 3P

2n

Bn

1n

32nn

PartialTranspose

PartialTranspose

PartialTranspose

All-to-all comm.


19

Operation Counts for -point FFT• Conventional FFT algorithms

(e.g., Cooley-Tukey FFT, Stockham FFT)– Arithmetic operations:– Main memory accesses:

• Block Nine-Step FFT– Arithmetic operations:– Main memory accesses (ideal case):

nn 2log5nn 2log4

nn 2log5n12

n


20

Performance Results• To evaluate the implemented parallel FFTs, we com

pared– The implemented parallel FFT, named FFTE (ver 4.0, sup

ports SSE3, using MPI)– FFTW (ver. 2.1.5, not support SSE3, using MPI)

• Target parallel machine:– A 32-node dual PC SMP cluster

(Irwindale 3GHz, 1GB DDR2-400 SDRAM / node, Linux 2.4.17-1smp).

– Interconnected through a Gigabit Ethernet switch.– LAM/MPI 7.1.1 was used as a communication library– The compilers used were gcc 4.0.2 and g77 3.2.3.


21

Performance of parallel FFTsXeon PC cluster, N = 2̂ 24xP)（

0

5

10

15

1 2 4 8 16 32Number of Nodes

GFL

OPS

FFTE 4.0(SSE3)FFTE 4.0(x87)FFTW 2.1.5


22

Breakdown of parallel FFTs(Xeon PC Cluster, N=2̂ 24xP)

0

1

2

3

4

5

6

7

1 2 4 8 16 32Number of Nodes

Tim

e (s

ec)

Comp.Comm..


23

Discussion• For N = 2^29 and P = 32, the FFTE runs about 1.72

times faster than the FFTW.– The performance of the FFTE remains at a high level

even for the larger problem size, owing to cache blocking.– Since the FFTW uses the conventional six-step FFT,

each column FFT does not fit into the L1 data cache.– Moreover, the FFTE exploits the SSE3 instructions.

• These are three reasons why the FFTE is most advantageous than the FFTW.


24

Conclusion and Future Work• The block nine-step FFT algorithm is most advanta

geous with processors that have a considerable gap between the speed of the cache memory and that of the main memory.

• Towards Petascale computing systems,– Exploiting the multi-level parallelism:

• SIMD or Vector accelerator• Multi-core• Multi-socket• Multi-node

– Reducing the number of main memory accesses.– Improving the all-to-all communication performance.

• In the G-FFTE, the all-to-all communication occursthree times.

Date post:	04-Jan-2016
Category:	Documents
Upload:	thomasine-casey
View:	212 times
Download:	0 times

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC)...

Documents