+ All Categories
Home > Documents > 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC)...

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC)...

Date post: 04-Jan-2016
Category:
Upload: thomasine-casey
View: 212 times
Download: 0 times
Share this document with a friend
24
2007/11/2 First French-Japanese PAA P Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational Sciences/ Graduate School of Systems and Information Engineering University of Tsukuba
Transcript
Page 1: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

1

The FFTE Library and the HPC Challenge (HPCC)

Benchmark Suite

Daisuke Takahashi

Center for Computational Sciences/Graduate School of Systems and Information

EngineeringUniversity of Tsukuba

Page 2: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

2

Outline• HPC Challenge (HPCC) Benchmark Suite

– Overview– The Benchmark Tests– Example Results

• FFTE: A High-Performance FFT Library– Background– Related Works– Block Six-Step/Nine-Step FFT Algorithm– Performance Results– Conclusion and Future Work

Page 3: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

3

Overview of the HPC Challenge (HPCC) Benchmark Suite

• HPC Challenge (HPCC) is a suite of tests that examine the performance of HPC architectures using kernels.

• The suite provides benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g.,– Spatial locality– Temporal locality

Page 4: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

4

The Benchmark Tests• The HPC Challenge benchmark consists at t

his time of 7 performance tests:– HPL (High Performance Linpack)– DGEMM (matrix-matrix multiplication)– STREAM (sustainable memory bandwidth)– PTRANS (A=A+B^T, parallel matrix transpose)– RandomAccess (integer updates to random

memory locations)– FFT (complex 1-D discrete Fourier transform)– b_eff (MPI latency/bandwidth test)

Page 5: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

5

Targeted Application Areas in the Memory Access Locality Space

Temporal locality

Spa

tial l

ocal

ity

PTRANSSTREAM

RandomAccess FFT

HPLDGEMM

Applications

CFD Radar X-section

TSP DSP

0

Page 6: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

6

HPCC Testing Scenarios• Local (S-STREAM, S-RandomAccess,

S-DGEMM, S-FFTE)– Only single MPI process computes.

• Embarrassingly parallel (EP-STREAM, EP-RandomAccess, EP-DGEMM, EP-FFTE)– All processes compute and do not communicate (explicitl

y).

• Global (G-HPL, G-PTRANS, G-RandomAccess, G-FFTE)– All processes compute and communicate.

• Network only (RandomRing Bandwidth, etc.)

Page 7: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

7

Sample results pagehttp://icl.cs.utk.edu/hpcc/hpcc_results.cgi

Page 8: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

8

The winners of the 2006 HPC Challenge Class 1 Awards

• G-HPL: 259 TFlops/s– IBM Blue Gene/L (131072 Procs)

• G-RandomAccess: 35 GUPS– IBM Blue Gene/L (131072 Procs)

• G-FFTE: 2311 GFlop/s– IBM Blue Gene/L (131072 Procs)

• EP-STREAM-Triad (system): 160TB/s– IBM Blue Gene/L (131072 Procs)

Page 9: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

9

FFTE: A High-Performance FFT Library

• FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions.

• It includes complex, mixed-radix and parallel transforms.– Shared / Distributed memory parallel computers (OpenM

P, MPI and OpenMP + MPI)

• It also supports Intel’s SSE2/SSE3 instructions.• The FFTE library can be obtained from

http://www.ffte.jp

Page 10: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

10

Background• One goal for large FFTs is to minimize the number

of cache misses.• Many FFT algorithms work well when data sets

fit into a cache.• When a problem exceeds the cache size, howeve

r, the performance of these FFT algorithms decreases dramatically.

• The conventional six-step FFT algorithm requires– Two multicolumn FFTs.– Three data transpositions.

→ The chief bottlenecks in cache-based processors.

Page 11: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

11

Related Works• FFTW [Frigo and Johnson (MIT)]

– The recursive call is employed to access main memory hierarchically.

– This technique is very effective in the case that the total amount of data is not so much greater than the cache size.

– For parallel FFT, the conventional six-step FFT is used.– http://www.fftw.org

• SPIRAL [Pueschel et al. (CMU)]– The goal of SPIRAL is to push the limits of automation in

software and hardware development and optimization for digital signal processing (DSP) algorithms.

– http://www.spiral.net

Page 12: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

12

Approach• Some previously presented six-step FFT algorithms

separate the multicolumn FFTs from the transpositions.

• Taking the opposite approach, we combine the multicolumn FFTs and transpositions to reduce the number of cache misses.

• We modify the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”.

Page 13: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

13

Discrete Fourier Transform (DFT)

• DFT is given by

)/2exp(,10

)()(1

0

nink

jxky

n

jkn

n

j

Page 14: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

14

2-D Formulation

• If has factors and thenn 1n 2n

11

1

2

2

21

21

22

2

1

1

1

021

1

012 ),(),( kj

n

n

j

kjnn

kjn

n

j

jjxkky

Page 15: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

15

Six-Step FFT Algorithm

1n

2n 1n

2n

1n

2n1n

2n

n individual

n -point FFTs

Transpose

Transpose

Transpose

Page 16: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

16

Block Six-Step FFT Algorithm

1n

2n

1n

2n1n

2n

Bn

2nn

individual

Bn

-point FFTs

PartialTranspose

PartialTranspose

Transpose

Page 17: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

17

3-D Formulation

11

1

21

21

3122

2

32

32

33

3

1

1

2

2

3

3

1

0

1

0

1

0321123 ),,(),,(

kjn

kjnn

kjn

kjn

kjnn

kjn

n

j

n

j

n

j

jjjxkkky

For very large FFTs, we should switch to a 3-D formulation. If has factors , and then         n 1n 2n 3n

Page 18: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

18

Parallel Block Nine-Step FFT

3nBn

1n

32nn

0P 1P 2P 3P

2n

32nn

0P 1P 2P 3P

3n

12nn

0P 1P 2P 3P

2n

Bn

1n

32nn

PartialTranspose

PartialTranspose

PartialTranspose

All-to-all comm.

Page 19: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

19

Operation Counts for -point FFT• Conventional FFT algorithms

(e.g., Cooley-Tukey FFT, Stockham FFT)– Arithmetic operations:– Main memory accesses:

• Block Nine-Step FFT– Arithmetic operations:– Main memory accesses (ideal case):

nn 2log5nn 2log4

nn 2log5n12

n

Page 20: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

20

Performance Results• To evaluate the implemented parallel FFTs, we com

pared– The implemented parallel FFT, named FFTE (ver 4.0, sup

ports SSE3, using MPI)– FFTW (ver. 2.1.5, not support SSE3, using MPI)

• Target parallel machine:– A 32-node dual PC SMP cluster

(Irwindale 3GHz, 1GB DDR2-400 SDRAM / node, Linux 2.4.17-1smp).

– Interconnected through a Gigabit Ethernet switch.– LAM/MPI 7.1.1 was used as a communication library– The compilers used were gcc 4.0.2 and g77 3.2.3.

Page 21: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

21

Performance of parallel FFTsXeon PC cluster, N = 2̂ 24xP)(

0

5

10

15

1 2 4 8 16 32Number of Nodes

GFL

OPS

FFTE 4.0(SSE3)FFTE 4.0(x87)FFTW 2.1.5

Page 22: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

22

Breakdown of parallel FFTs(Xeon PC Cluster, N=2̂ 24xP)

0

1

2

3

4

5

6

7

1 2 4 8 16 32Number of Nodes

Tim

e (s

ec)

Comp.Comm..

Page 23: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

23

Discussion• For N = 2^29 and P = 32, the FFTE runs about 1.72

times faster than the FFTW.– The performance of the FFTE remains at a high level

even for the larger problem size, owing to cache blocking.– Since the FFTW uses the conventional six-step FFT,

each column FFT does not fit into the L1 data cache.– Moreover, the FFTE exploits the SSE3 instructions.

• These are three reasons why the FFTE is most advantageous than the FFTW.

Page 24: 2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

2007/11/2 First French-Japanese PAAP Workshop

24

Conclusion and Future Work• The block nine-step FFT algorithm is most advanta

geous with processors that have a considerable gap between the speed of the cache memory and that of the main memory.

• Towards Petascale computing systems,– Exploiting the multi-level parallelism:

• SIMD or Vector accelerator• Multi-core• Multi-socket• Multi-node

– Reducing the number of main memory accesses.– Improving the all-to-all communication performance.

• In the G-FFTE, the all-to-all communication occursthree times.


Recommended