Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

Review of XT6 Architecture

AMD Opteron

Cray Networks

Lustre Basics

Programming Environment

PGI Compiler Basics

The Cray Compiler Environment

Cray Scientific Libraries

Cray Performance Analysis Tools

Optimizations

CPU

Communication

I/O

AMD CPU Architecture

Cray Architecture

Lustre Filesystem Basics

2003 2005 2007 2008 2009 2010

AMD Opteron™

AMD Opteron™

“Barcelona” “Shanghai” “Istanbul” “Magny-Cours”

Mfg. Process

130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI

CPU Core

K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+

L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB

HyperTransport™Technology

3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s

Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333

12 cores

1.7-2.2Ghz

105.6Gflops

8 cores

1.8-2.4Ghz

76.8Gflops

Power (ACP)

80Watts

Stream

27.5GB/s

Cache

12x 64KB L1

12x 512KB L2

12MB L3

1

3

4 10

5 8

6

7

9 12

2 11

ME

MO

RY

CO

NT

RO

LL

ER

HT

Lin

k

HT

Lin

k

ME

MO

RY

CO

NT

RO

LL

ER

HT

Lin

k

HT

Lin

kH

T L

ink

HT

Lin

k

HT

Lin

k

HT

Lin

k

L3 cache

L2 cache L2 cache

L2 cacheL2 cache

L2 cache L2 cache L2 cache

L2 cache L2 cache

L2 cache

L2 cache L2 cache

Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7

Core 8

Core 9

Core 10

Core 11

A cache line is 64B

Cache is a “victim cache”

All references go to L1 immediately and get evicted down the caches

A cache line is usually only in one level of cache

Hardware prefetcher detects forward and backward strides through memory

Each core can perform a 128b add and 128b multiply per clock cycle

This requires SSE, packed instructions

“Stride-one vectorization”

SeaStar (XT-series)

Gemini (XE-series)

Microkernel on Compute PEs, full featured Linux on Service PEs.

Service PEs specialize by function

Software Architecture eliminates OS “Jitter”

Software Architecture enables reproducible run times

Large machines boot in under 30 minutes, including filesystem

Service Partition

Specialized

Linux nodes

Compute PE

Login PE

Network PE

System PE

I/O PE

10

GigE

10 GigE

GigE

RAID

Subsystem

Fibre

Channels

SMW

Compute node

Login node

Network node

Boot/Syslog/Database nodes

I/O and Metadata nodes

X

ZY

11

Cray XT5 systems ship with the SeaStar2+ interconnect

Custom ASIC

Integrated NIC / Router

MPI offload engine

Connectionless Protocol

Link Level Reliability

Proven scalability to 225,000 cores

HyperTransport

Interface

Memory

PowerPC

440 Processor

DMA

Engine6-Port

Router

BladeControl

ProcessorInterface

Now Scaled

to 225,000

cores

12

Processor Frequency Peak (Gflops)

Bandwidth (GB/sec)

Balance(bytes/flop

)

Istanbul (XT5)

2.6 62.4 12.8 0.21

MC-8

2.0 64 42.6 0.67

2.3 73.6 42.6 0.58

2.4 76.8 42.6 0.55

MC-12

1.9 91.2 42.6 0.47

2.1 100.8 42.6 0.42

2.2 105.6 42.6 0.40

Cray Inc. Preliminary and Proprietary 13SC09

Cray Inc. Preliminary and Proprietary

6.4 GB/sec direct connect

HyperTransport

CraySeaStar2+

Interconnect

83.5 GB/sec direct connect memory

Characteristics

Number of Cores

16 or 24 (MC)32 (IL)

Peak PerformanceMC-8 (2.4)

153 Gflops/sec

Peak Performance MC-12 (2.2)

211 Gflops/sec

Memory Size 32 or 64 GB per node

Memory Bandwidth

83.5 GB/sec

14SC09

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

6MB L3

Cache

Greyhound

Greyhound

Greyhound

Greyhound

Greyhound

Greyhound

6MB L3

Cache

Greyhound

Greyhound

Greyhound

Greyhound

Greyhound

Greyhound

6MB L3

Cache

Greyhound

Greyhound

Greyhound

Greyhound

Greyhound

Greyhound

6MB L3

Cache

Greyhound

Greyhound

Greyhound

Greyhound

Greyhound

GreyhoundHT

3

HT

3

2 Multi-Chip Modules, 4 Opteron Dies

8 Channels of DDR3 Bandwidth to 8 DIMMs

24 (or 16) Computational Cores, 24 MB of L3 cache

Dies are fully connected with HT3

Snoop Filter Feature Allows 4 Die SMP to scale well

To Interconnect

HT3

HT3

HT3

HT1 / HT3


Without snoop filter, a streams test

shows 25MB/sec out of a possible

51.2 GB/sec or 48% of peak

bandwidth


• This feature will be key for two-socket Magny Cours Nodes which are the same architecture-wise

With snoop filter, a streams test

shows 42.3 MB/sec out of a

possible 51.2 GB/sec or 82% of

peak bandwidth


New compute blade with 8 AMD Magny Cours processors

Plug-compatible with XT5 cabinets and backplanes

Initially will ship with SeaStarinterconnect as the Cray XT6

Upgradeable to Gemini Interconnect or Cray XE6

Upgradeable to AMD’s “Interlagos” series

XT6 systems will continue to ship with the current SIO blade

First customer ship, March 31st

SC09Cray Inc. Preliminary and Proprietary 18


Supports 2 Nodes per ASIC

168 GB/sec routing capacity

Scales to over 100,000 network endpoints

Link Level Reliability and Adaptive Routing

Advanced Resiliency Features

Provides global address space

Advanced NIC designed to efficiently support

MPI

One-sided MPI

Shmem

UPC, Coarray FORTRAN


LO

Processor

Gemini

Hyper

Transport

3

NIC 0

Hyper

Transport

3

NIC 1Netlink

BlockSB

48-Port

YARC Router

20SC09


10 12X Gemini

Channels

(Each Gemini

acts like two

nodes on the 3D

Torus)

Cray Baker Node Characteristics

Number of Cores

16 or 24

Peak Performance

140 or 210 Gflops/s

Memory Size 32 or 64 GB per node

Memory Bandwidth

85 GB/sec

High Radix

YARC Router

with adaptive

Routing

168 GB/sec

capacity

21SC09

Module with

SeaStar

Module with

Gemini

Y

X

Z


FMA (Fast Memory Access) Mechanism for most MPI transfers Supports tens of millions of MPI requests per second

BTE (Block Transfer Engine) Supports asynchronous block transfers between local and remote memory,

in either direction For use for large MPI transfers that happen in the background


HT

3 C

av

e

vc0

vc1

vc1

vc0

LB Ring

LB

LM

NL

FMA

CQ

NPT

RMTnet req

H

A

R

B

net

rsp

ht p

ireq

ht treq p

ht irsp

ht np

ireq

ht np req

ht np reqnet req

ht p req O

R

B

RAT

NAT

BTE

net

req

net

rsp

ht treq np

ht trsp net

req

net

req

net

req

net

req

net

reqnet req

ht p req

ht p req

ht p req net rsp

CLM

AMOnet rsp headers

T

A

R

B

net req

net rsp

S

S

I

D

Ro

ute

r T

ile

s

23SC09


Two Gemini ASICs are packaged on a pin-compatible mezzanine card

Topology is a 3-D torus

Each lane of the torus is composed of 4 Gemini router “tiles”

Systems with SeaStarinterconnects can be upgraded by swapping this card

100% of the 48 router tiles on each Gemini chip are used

24SC09

Like SeaStar, Gemini has a DMA offload engine allowing large transfers to proceed asynchronously

Gemini provides low-overhead OS-bypass features for short transfers MPI latency targeted at ~ 1us NIC provides for many millions of MPI messages per second

“Hybrid” programming not a requirement for performance

RDMA provides a much improved one-sided communication mechanism

AMOs provide a faster synchronization method for barriers

Gemini supports adaptive routing, which Reduces problems with network hot spots Allows MPI to survive link failures


Globally addressable memory provides efficient support for UPC, Co-array FORTRAN, Shmem and Global Arrays Cray Programming Environment will target this capability

directly

Pipelined global loads and stores Allows for fast irregular communication patterns

Atomic memory operations Provides fast synchronization needed for one-sided

communication models


Gemini will represent a large improvement over SeaStar interms of reliability and serviceability

Adaptive Routing – multiple paths to the same destination

Allows mapping around bad links without rebooting

Supports warm-swap of blades

Prevents hot spots

Reliable Transport of Messages

Packet level CRC carried from start to finish

Large blocks of memory protected by ECC

Can better handle failures on the HT-link, discards packets instead of putting backpressure into the network

Supports end-to-end reliable communication (used by MPI)

Improved error reporting and handling

The low overhead error reporting allows the programming model to replay failed transactions

Performance counters allowing tracking of app specific packets


28

29

Hig

h V

elo

city A

irflow

Hig

h V

elo

city A

irflow

Lo

w V

elo

city A

irflo

w

Lo

w V

elo

city A

irflo

w

Lo

w V

elo

city A

irflo

w30

Hot air stream passes through evaporator, rejects heat to R134a via liquid-vapor phase change

(evaporation).

R134a absorbs energy only in the presence of heated air.

Phase change is 10x more efficient than pure water

cooling.

Liquid/Vapor Mixture out

Liquid in

Cool air is released into the computer room

31

R134a piping Exit Evaporators

Inlet Evaporator

32

32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size Unable to take advantage of file system parallelism

Access to multiple disks adds overhead which hurts performance

Lustre

0

20

40

60

80

100

120

1 2 4 16 32 64 128 160

Wri

te (

MB

/s)

Stripe Count

Single WriterWrite Performance

1 MB Stripe

32 MB Stripe

36

Lustre

0

20

40

60

80

100

120

140

1 2 4 8 16 32 64 128

Wri

te (

MB

/s)

Stripe Size (MB)

Single Writer Transfer vs. Stripe Size

32 MB Transfer

8 MB Transfer

1 MB Transfer

Single OST, 256 MB File Size Performance can be limited by the process (transfer size) or file system

(stripe size)

37

Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and possibly size

lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)

lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)

export MPICH_MPIIO_HINTS=‘*: striping_factor=160’

Files inherit striping information from the parent directory, this cannot be changed once the file is written

Set the striping before copying in files

PGI Compiler

Cray Compiler Environment

Cray Scientific Libraries

Cray XT/XE Supercomputers come with compiler wrappers to simplify building parallel applications (similar the mpicc/mpif90)

Fortran Compiler: ftn

C Compiler: cc

C++ Compiler: CC

Using these wrappers ensures that your code is built for the compute nodes and linked against important libraries

Cray MPT (MPI, Shmem, etc.)

Cray LibSci (BLAS, LAPACK, etc.)

…

Choosing the underlying compiler is via the PrgEnv-* modules, do not call the PGI, Cray, etc. compilers directly.

Always load the appropriate xtpe-<arch> module for your machine

Enables proper compiler target

Links optimized math libraries

Traditional (scalar) optimizations are controlled via -O# compiler flags

Default: -O2

More aggressive optimizations (including vectorization) are enabled with the -fast or -fastsse metaflags

These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre

–Mautoinline -Mvect=sse -Mscalarsse

-Mcache_align -Mflushz –Mpre

Interprocedural analysis allows the compiler to perform whole-program optimizations. This is enabled with –Mipa=fast

See man pgf90, man pgcc, or man pgCC for more information about compiler options.

Compiler feedback is enabled with -Minfo and -Mneginfo

This can provide valuable information about what optimizations were or were not done and why.

To debug an optimized code, the -gopt flag will insert debugging information without disabling optimizations

It’s possible to disable optimizations included with -fast if you believe one is causing problems

For example: -fast -Mnolre enables -fast and then disables loop redundant optimizations

To get more information about any compiler flag, add -help with the flag in question

pgf90 -help -fast will give more information about the -fast flag

OpenMP is enabled with the -mp flag

Some compiler options may effect both performance and accuracy. Lower

accuracy is often higher performance, but it’s also able to enforce accuracy.

-Kieee: All FP math strictly conforms to IEEE 754 (off by default)

-Ktrap: Turns on processor trapping of FP exceptions

-Mdaz: Treat all denormalized numbers as zero

-Mflushz: Set SSE to flush-to-zero (on with -fast)

-Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to speed up some floating point optimizations

Some other compilers turn this on by default, PGI chooses to favor accuracy to speed by default.

Cray has a long tradition of high performance compilers on Cray platforms (Traditional vector, T3E, X1, X2)

Vectorization

Parallelization

Code transformation

More…

Investigated leveraging an open source compiler called LLVM

First release December 2008

X86 Code

Generator

Cray X2 Code

Generator

Fortran Front End

Interprocedural Analysis

Optimization and

Parallelization

C and C++ Source

Object File

Co

mp

iler

C & C++ Front End

Fortran Source C and C++ Front End

supplied by Edison Design

Group, with Cray-developed

code for extensions and

interface support

X86 Code Generation from

Open Source LLVM, with

additional Cray-developed

optimizations and interface

support

Cray Inc. Compiler

Technology

Standard conforming languages and programming models Fortran 2003 UPC & CoArray Fortran

Fully optimized and integrated into the compiler

No preprocessor involved

Target the network appropriately:

GASNet with Portals

DMAPP with Gemini & Aries

Ability and motivation to provide high-quality support for custom Cray network hardware

Cray technology focused on scientific applications Takes advantage of Cray’s extensive knowledge of automatic

vectorization Takes advantage of Cray’s extensive knowledge of automatic

shared memory parallelization Supplements, rather than replaces, the available compiler

choices

Make sure it is available

module avail PrgEnv-cray

To access the Cray compiler

module load PrgEnv-cray

To target the various chip

module load xtpe-[barcelona,shanghi,istanbul]

Once you have loaded the module “cc” and “ftn” are the Cray compilers

Recommend just using default options

Use –rm (fortran) and –hlist=m (C) to find out what happened

man crayftn

Excellent Vectorization Vectorize more loops than other compilers

OpenMP 3.0 Task and Nesting

PGAS: Functional UPC and CAF available today

C++ Support

Automatic Parallelization Modernized version of Cray X1 streaming capability

Interacts with OMP directives

Cache optimizations Automatic Blocking

Automatic Management of what stays in cache

Prefetching, Interchange, Fusion, and much more…

Loop Based Optimizations Vectorization OpenMP

Autothreading

Interchange Pattern Matching Cache blocking/ non-temporal / prefetching

Fortran 2003 Standard; working on 2008

PGAS (UPC and Co-Array Fortran) Some performance optimizations available in 7.1

Optimization Feedback: Loopmark

Focus

Cray compiler supports a full and growing set of directives and pragmas

!dir$ concurrent

!dir$ ivdep

!dir$ interchange

!dir$ unroll

!dir$ loop_info [max_trips] [cache_na] ... Many more

!dir$ blockable

man directives

man loop_info

Compiler can generate an filename.lst file. Contains annotated listing of your source code with letter indicating important

optimizations

%%% L o o p m a r k L e g e n d %%%

Primary Loop Type Modifiers

------- ---- ---- ---------

a - vector atomic memory operation

A - Pattern matched b - blocked

C - Collapsed f - fused

D - Deleted i - interchanged

E - Cloned m - streamed but not partitioned

I - Inlined p - conditional, partial and/or computed

M - Multithreaded r - unrolled

P - Parallel/Tasked s - shortloop

V - Vectorized t - array syntax temp used

W - Unwound w - unwound

• ftn –rm … or cc –hlist=m …

29. b-------< do i3=2,n3-1

30. b b-----< do i2=2,n2-1

31. b b Vr--< do i1=1,n1

32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)

33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1)

34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)

35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)

36. b b Vr--> enddo

37. b b Vr--< do i1=2,n1-1

38. b b Vr r(i1,i2,i3) = v(i1,i2,i3)

39. b b Vr > - a(0) * u(i1,i2,i3)

40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )

41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) )

42. b b Vr--> enddo

43. b b-----> enddo

44. b-------> enddo

ftn-6289 ftn: VECTOR File = resid.f, Line = 29

A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38.

ftn-6049 ftn: SCALAR File = resid.f, Line = 29

A loop starting at line 29 was blocked with block size 4.


A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38.


A loop starting at line 30 was blocked with block size 4.


A loop starting at line 31 was unrolled 4 times.


A loop starting at line 31 was vectorized.


A loop starting at line 37 was unrolled 4 times.


A loop starting at line 37 was vectorized.

-hbyteswapio

Link time option

Applies to all unformatted fortran IO

Assign command

With the PrgEnv-cray module loaded do this:

setenv FILENV assign.txt

assign -N swap_endian g:su

assign -N swap_endian g:du

Can use assign to be more precise

OpenMP is ON by default Optimizations controlled by –Othread#

To shut off use –Othread0 or –xomp or –hnoomp

Autothreading is NOT on by default; -hautothread to turn on

Modernized version of Cray X1 streaming capability

Interacts with OMP directives

If you do not want to use OpenMP and have OMP directives in the code, make sure to make a run with OpenMP shut

off at compile time

Traditional model

Tuned general purpose codes

Only good for dense

Not problem sensitive

Not architecture sensitive

60

Goal of scientific libraries

Improve Productivity at optimal performance

Cray use four concentrations to achieve this

Standardization Use standard or “de facto” standard interfaces whenever available

Hand tuning Use extensive knowledge of target processor and network to optimize common code

patterns

Auto-tuning Automate code generation and a huge number of empirical performance evaluations

to configure software to the target platforms

Adaptive Libraries Make runtime decisions to choose the best kernel/library/routine

61

Three separate classes of standardization, each with a corresponding definition of productivity

1. Standard interfaces (e.g., dense linear algebra) Bend over backwards to keep everything the same despite increases in machine complexity.

Innovate ‘behind-the-scenes’

Productivity -> innovation to keep things simple

2. Adoption of near-standard interfaces (e.g., sparse kernels) Assume near-standards and promote those. Out-mode alternatives. Innovate ‘behind-the-scenes’

Productivity -> innovation in the simplest areas

(requires the same innovation as #1 also)

3. Simplification of non-standard interfaces (e.g., FFT) Productivity -> innovation to make things simpler than they are

62

Algorithmic tuning

Increased performance by exploiting algorithmic improvements Sub-blocking, new algorithms

LAPACK, ScaLAPACK

Kernel tuning

Improve the numerical kernel performance in assembly language

BLAS, FFT

Parallel tuning

Exploit Cray’s custom network interfaces and MPT

ScaLAPACK, P-CRAFFT

63

Dense

BLAS

LAPACK

ScaLAPACK

IRT

Sparse

CASK

PETSc

Trilinos

FFT

CRAFFT

FFTW

P-CRAFFT

IRT – Iterative Refinement Toolkit

CASK – Cray Adaptive Sparse Kernels

CRAFFT – Cray Adaptive FFT

64

Serial and Parallel versions of sparse iterative linear solvers

Suites of iterative solvers CG, GMRES, BiCG, QMR, etc.

Suites of preconditioning methods IC, ILU, diagonal block (ILU/IC), Additive Schwartz, Jacobi, SOR

Support block sparse matrix data format for better performance

Interface to external packages (ScaLAPACK, SuperLU_DIST)

Fortran and C support

Newton-type nonlinear solvers

Large user community

DoE Labs, PSC, CSCS, CSC, ERDC, AWE and more.

http://www-unix.mcs.anl.gov/petsc/petsc-as

65






Cray provides state-of-the art scientific computing packages to strengthen the capability of PETSc

Hypre: scalable parallel preconditioners AMG (Very scalable and efficient for specific class of problems)

2 different ILU (General purpose)

Sparse Approximate Inverse (General purpose)

ParMetis: parallel graph partitioning package

MUMPS: parallel multifrontal sparse direct solver

SuperLU: sequential version of SuperLU_DIST

To use Cray-PETSc, load the appropriate module :

module load petsc

module load petsc-complex

(no need to load a compiler specific module)

Treat the Cray distribution as your local PETSc installation

66

The Trilinos Project http://trilinos.sandia.gov/

“an effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems”

A unique design feature of Trilinos is its focus on packages.

Very large user-base and growing rapidly. Important to DOE.

Cray’s optimized Trilinos released on January 21

Includes 50+ trilinos packages

Optimized via CASK

Any code that uses Epetra objects can access the optimizations

Usage :

module load trilinos

67

CASK is a product developed at Cray using theCray Auto-tuning Framework (Cray ATF)

The CASK Concept :

Analyze matrix at minimal cost

Categorize matrix against internal classes

Based on offline experience, find best CASK code for particular matrix

Previously assign “best” compiler flags to CASK code

Assign best CASK kernel and perform Ax

CASK silently sits beneath PETSc on Cray systems

Trilinos support coming soon

Released with PETSc 3.0 in February 2009

Generic and blocked CSR format

68

• Highly portable

• User controlled

Large-scale application

• Highly portable

• User controlled

PETSc / Trilinos / Hypre

• XT4 & XT5 specific / tuned

• Invisible to User

CASK

All systems

Cray only

69

Speedup on Parallel SpMV on 8 cores, 60 different matrices

1

1.1

1.2

1.3

1.4

0 10 20 30 40 50 60

Matrix ID#

70

Block Jacobi Preconditioning

0

50

100

150

200

0 128 256 384 512 640 768 896 1024

GF

lop

s

# of Cores

Performance of CASK VS PETSc

N=65,536 to 67,108,864

MatMult-CASK MatMult-PETSc

0

50

100

150

200

250

300

0 128 256 384 512 640 768 896 1024

GF

lop

s

# of Cores

Performance of CASK VS PETSc

N=65,536 to 67,108,864

BlockJacobi-IC(0)-CASK

BlockJacobi-IC(0)-PETSc

SpMV

71

0200400600800

100012001400160018002000

MF

lop

s

Matrix Name

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 2 3 4 5 6 7 8

MF

lop

s

# of vectors

CASK Trilinos Original

Geometric Mean of 80 sparse matrix instances from U of Florida collection

In FFTs, the problems are Which library choice to use?

How to use complicated interfaces (e.g., FFTW)

Standard FFT practice Do a plan stage

Deduced machine and system information and run micro-kernels

Select best FFT strategy

Do an execute

Our system knowledge can remove some of this cost!

74

CRAFFT is designed with simple-to-use interfaces Planning and execution stage can be combined into one function call Underneath the interfaces, CRAFFT calls the appropriate FFT kernel

CRAFFT provides both offline and online tuning Offline tuning

Which FFT kernel to use

Pre-computed PLANs for common-sized FFT

No expensive plan stages

Online tuning is performed as necessary at runtime as well

At runtime, CRAFFT will adaptively select the best FFT kernel to use based on both offline and online testing (e.g. FFTW, Custom FFT)

75

128x128 256x256 512x512

FFTW plan 74 312 2758

FFTW exec 0.105 0.97 9.7

CRAFFT plan 0.00037 0.0009 0.00005

CRAFFT exec 0.139 1.2 11.4

1. Load module fftw/3.2.0 or higher.

2. Add a Fortran statement “use crafft”

3. call crafft_init()

4. Call crafft transform using none, some or all optional arguments (as shown in red)

In-place, implicit memory management :

call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign)

in-place, explicit memory management

call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign,work)

out-of-place, explicit memory management :

crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,output,ld_out,ld_out2,isign,work)

Note : the user can also control the planning strategy of CRAFFT using the CRAFFT_PLANNING environment variable and the do_exe optional argument, please see the intro_crafft man page.

77

As of December 2009, CRAFFT includes distributed parallel transforms

Uses the CRAFFT interface prefixed by “p”, with optional arguments

Can provide performance improvement over FFTW 2.1.5

Currently implemented

complex-complex

Real-complex and complex-real

3-d and 2-d

In-place and out-of-place

Upcoming

C language support for serial and parallel

78

1. Add “use crafft” to Fortran code

2. Initialize CRAFFT using crafft_init

3. Assume MPI initialized and data distributed (see manpage)

4. Call crafft, e.g. (optional arguments in red)

2-d complex-complex, in-place, internal mem management :

call crafft_pz2z2d(n1,n2,input,isign,flag,comm)

2-d complex-complex, in-place with no internal memory :

call crafft_pz2z2d(n1,n2,input,isign,flag,comm,work)

2-d complex-complex, out-of-place, internal mem manager :

call crafft_pz2z2d(n1,n2,input,output,isign,flag,comm)

2-d complex-complex, out-of-place, no internal memory :

crafft_pz2z2d(n1,n2,input,output,isign,flag,comm,work)

Each routine above has manpage. Also see 3d equivalent :

man crafft_pz2z3d 79

80

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

128 256 512 1024 2048 4096 8192 16384 3276865536

Mfl

op

s

Size N

2D FFT (N x N, transposed), 128 cores

pcrafft

fftw2.5.1

Solves linear systems in single precision

Obtaining solutions accurate to double precision For well conditioned problems

Serial and Parallel versions of LU, Cholesky, and QR

2 usage methods IRT Benchmark routines

Uses IRT 'under-the-covers' without changing your code Simply set an environment variable Useful when you cannot alter source code

Advanced IRT API If greater control of the iterative refinement process is required

Allows condition number estimation error bounds return minimization of either forward or backward error 'fall back' to full precision if the condition number is too high max number of iterations can be altered by users

81

“High Power Electromagnetic Wave

Heating in the ITER Burning Plasma’’

rf heating in tokamak

Maxwell-Bolzmann Eqns

FFT

Dense linear system

Calc Quasi-linear op

Courtesy

Richard Barrett

82

TheoreticalPeak

83

Decide if you want to use advanced API or benchmark API

benchmark API : setenv IRT_USE_SOLVERS 1

Advanced API :

1. locate the factor and solve in your code (LAPACK or ScaLAPACK)

2. Replace factor and solve with a call to IRT routine

e.g. dgesv -> irt_lu_real_serial

e.g. pzgesv -> irt_lu_complex_parallel

e.g pzposv -> irt_po_complex_parallel

3. Set advanced arguments

Forward error convergence for most accurate solution

Condition number estimate

“fall-back” to full precision if condition number too high

84

LibSci 10.4.2 February 18th 2010

OpenMP-aware LibSci

Allows calling of BLAS inside or outside parallel region

Single library supported No multi-thread library and single thread library (-lsci and –lsci_mp)

Performance not compromised

(there were some usage restrictions with this version)

LibSci 10.4.3 April 2010

Parallel CRAFFT improvements

Fixes usage restrictions of 10.4.2

OMP_NUM_THREADS required (not GOTO_NUM_THREADS)

Upcoming

PETSc 3.1.0 May 20

Trilinos 10.2 May 20

85

CrayPAT

Assist the user with application performance analysis and optimization

Help user identify important and meaningful information from potentially massive data sets

Help user identify problem areas instead of just reporting data

Bring optimization knowledge to a wider set of users

Focus on ease of use and intuitive user interfaces Automatic program instrumentation Automatic analysis

Target scalability issues in all areas of tool development Data management

Storage, movement, presentation

September 21-24, 2009 87© Cray Inc.

Supports traditional post-mortem performance analysis

Automatic identification of performance problems

Indication of causes of problems

Suggestions of modifications for performance improvement

CrayPat

pat_build: automatic instrumentation (no source code changes needed)

run-time library for measurements (transparent to the user)

pat_report for performance analysis reports

pat_help: online help utility

Cray Apprentice2

Graphical performance analysis and visualization tool


CrayPat

Instrumentation of optimized code

No source code modification required

Data collection transparent to the user

Text-based performance reports

Derived metrics

Performance analysis

Cray Apprentice2

Performance data visualization tool

Call tree view

Source code mappings


When performance measurement is triggered

External agent (asynchronous) Sampling

Timer interrupt

Hardware counters overflow

Internal agent (synchronous) Code instrumentation

Event based

Automatic or manual instrumentation

How performance data is recorded

Profile ::= Summation of events over time run time summarization (functions, call sites, loops, …)

Trace file ::= Sequence of events over time


Millions of lines of code

Automatic profiling analysis Identifies top time consuming routines

Automatically creates instrumentation template customized to your application

Lots of processes/threads

Load imbalance analysis Identifies computational code regions and synchronization calls that could benefit most from

load balance optimization

Estimates savings if corresponding section of code were balanced

Long running applications

Detection of outliers


Important performance statistics:

Top time consuming routines

Load balance across computing resources

Communication overhead

Cache utilization

FLOPS

Vectorization (SSE instructions)

Ratio of computation versus communication


No source code or makefile modification required

Automatic instrumentation at group (function) level

Groups: mpi, io, heap, math SW, …

Performs link-time instrumentation

Requires object files

Instruments optimized code

Generates stand-alone instrumented program

Preserves original binary

Supports sample-based and event-based instrumentation


Analyze the performance data and direct the user to meaningful information

Simplifies the procedure to instrument and collect performance data for novice users

Based on a two phase mechanism

1. Automatically detects the most time consuming functions in the application and feeds this information back to the tool for further (and focused) data collection

2. Provides performance information on the most significant parts of the application


Performs data conversion

Combines information from binary with raw performance data

Performs analysis on data

Generates text report of performance results

Formats data for input into Cray Apprentice2


Craypat / Cray Apprentice2 5.0 released September 10, 2009

New internal data format

FAQ

Grid placement support

Better caller information (ETC group in pat_report)

Support larger numbers of processors

Client/server version of Cray Apprentice2

Panel help in Cray Apprentice2

September 21-24, 2009 © Cray Inc. 96

Access performance tools software

% module load xt-craypat apprentice2

Build application keeping .o files (CCE: -h keepfiles)

% make clean% make

Instrument application for automatic profiling analysis You should get an instrumented program a.out+pat

% pat_build –O apa a.out

Run application to get top time consuming routines You should get a performance file (“<sdatafile>.xf”) or

multiple files in a directory <sdatadir>

% aprun … a.out+pat (or qsub <pat script>)


September 21-24, 2009 © Cray Inc. Slide 98

Generate report and .apa instrumentation file

% pat_report –o my_sampling_report [<sdatafile>.xf | <sdatadir>]

Inspect .apa file and sampling report

Verify if additional instrumentation is needed

# You can edit this file, if desired, and use it

# to reinstrument the program for tracing like this:

#

# pat_build -O mhd3d.Oapa.x+4125-401sdt.apa

#

# These suggested trace options are based on data from:

#

# /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf

# ----------------------------------------------------------------------

# HWPC group to collect by default.

-Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics.

# ----------------------------------------------------------------------

# Libraries to trace.

-g mpi

# ----------------------------------------------------------------------

# User-defined functions to trace, sorted by % of samples.

# Limited to top 200. A function is commented out if it has < 1%

# of samples, or if a cumulative threshold of 90% has been reached,

# or if it has size < 200 bytes.

# Note: -u should NOT be specified as an additional option.

# 43.37% 99659 bytes

-T mlwxyz_

# 16.09% 17615 bytes

-T half_

# 6.82% 6846 bytes

-T artv_

# 1.29% 5352 bytes

-T currenh_

# 1.03% 25294 bytes

-T bndbo_

# Functions below this point account for less than 10% of samples.

# 1.03% 31240 bytes

# -T bndto_

. . .

# ----------------------------------------------------------------------

-o mhd3d.x+apa # New instrumented program.

/work/crayadm/ldr/mhd3d/mhd3d.x # Original program.


biolib Cray Bioinformatics library routines

blacs Basic Linear Algebra communication subprograms

blas Basic Linear Algebra subprograms

caf Co-Array Fortran (Cray X2 systems only)

fftw Fast Fourier Transform library (64-bit only)

hdf5 manages extremely large and complex data collections

heap dynamic heap

io includes stdio and sysio groups

lapack Linear Algebra Package

lustre Lustre File System

math ANSI math

mpi MPI

netcdf network common data form (manages array-oriented scientific data)

omp OpenMP API (not supported on Catamount)

omp-rtl OpenMP runtime library (not supported on Catamount)

portals Lightweight message passing API

pthreads POSIX threads (not supported on Catamount)

scalapack Scalable LAPACK

shmem SHMEM

stdio all library functions that accept or return the FILE* construct

sysio I/O system calls

system system calls

upc Unified Parallel C (Cray X2 systems only)

0 Summary with instruction

metrics

1 Summary with TLB metrics

2 L1 and L2 metrics

3 Bandwidth information

4 Hypertransport information

5 Floating point mix

6 Cycles stalled, resources

idle

7 Cycles stalled, resources

full

8 Instructions and branches

9 Instruction cache

10 Cache hierarchy

11 Floating point operations

mix (2)


mix (vectorization)


mix (SP)


mix (DP)

15 L3 (socket-level)

16 L3 (core-level reads)

17 L3 (core-level misses)

18 L3 (core-level fills caused

by L2 evictions)

19 Prefetches

June 10 Slide 101

Regions, useful to break up long routines

int PAT_region_begin (int id, const char *label)

int PAT_region_end (int id)

Disable/Enable Profiling, useful for excluding initialization

int PAT_record (int state)

Flush buffer, useful when program isn’t exiting cleanly

int PAT_flush_buffer (void)


Instrument application for further analysis (a.out+apa)

% pat_build –O <apafile>.apa

Run application

% aprun … a.out+apa (or qsub <apa script>)

Generate text report and visualization file (.ap2)

% pat_report –o my_text_report.txt [<datafile>.xf | <datadir>]

View report in text and/or with Cray Apprentice2

% app2 <datafile>.ap2

MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)

Number of files used to store raw data

1 file created for program with 1 – 256 processes

√n files created for program with 257 – n processes

Ability to customize with PAT_RT_EXPFILE_MAX


July 15, 2008 Slide 106

Full trace files show transient events but are too large

Current run-time summarization misses transient events

Plan to add ability to record:

Top N peak values (N small)

Approximate std dev over time

For time, memory traffic, etc.

During tracing and sampling

Call graph profile

Communication statistics

Time-line view

Communication

I/O

Activity view

Pair-wise communication statistics

Text reports

Source code mapping

Cray Apprentice2

is target to help identify and correct:

Load imbalance

Excessive communication

Network contention

Excessive serialization

I/O Problems


Switch Overview display


September 21-24, 2009 Slide 109© Cray Inc.



Min, Avg, and Max

Values

-1, +1

Std Dev

marks


Function

List

Load balance overview:

Height Max time

Middle bar Average time

Lower bar Min time

Yellow represents

imbalance time

Zoom

Height exclusive time

Width inclusive time

DUH Button:

Provides hints

for performance

tuning

Filtered

nodes or

sub tree


Function

List off

Right mouse click:

Node menu

e.g., hide/unhide

children

Sort options

% Time,

Time,

Imbalance %

Imbalance time

Right mouse click:

View menu:

e.g., Filter






-1, +1

Std Dev

marks

Min, Avg, and Max

Values



Cray Apprentice2 panel help

pat_help – interactive help on the Cray Performance toolset

FAQ available through pat_help


intro_craypat(1)

Introduces the craypat performance tool

pat_build

Instrument a program for performance analysis

pat_help

Interactive online help utility

pat_report

Generate performance report in both text and for use with GUI

hwpc(3)

describes predefined hardware performance counter groups

papi_counters(5)

Lists PAPI event counters

Use papi_avail or papi_native_avail utilities to get list of events when running on a specific architecture


September 21-24, 2009 Slide 123

pat_report: Help for -O option:

Available option values are in left column, a prefix can be specified:

ct -O calltree

defaults Tables that would appear by default.

heap -O heap_program,heap_hiwater,heap_leaks

io -O read_stats,write_stats

lb -O load_balance

load_balance -O lb_program,lb_group,lb_function

mpi -O mpi_callers

---

callers Profile by Function and Callers

callers+hwpc Profile by Function and Callers

callers+src Profile by Function and Callers, with Line Numbers

callers+src+hwpc Profile by Function and Callers, with Line Numbers

calltree Function Calltree View

calltree+hwpc Function Calltree View

calltree+src Calltree View with Callsite Line Numbers

calltree+src+hwpc Calltree View with Callsite Line Numbers

...

© Cray Inc.

Interactive by default, or use trailing '.' to just print a topic:

New FAQ craypat 5.0.0.

Has counter and counter group information

% pat_help counters amd_fam10h groups .



The top level CrayPat/X help topics are listed below.A good place to start is:

overview

If a topic has subtopics, they are displayed under the heading"Additional topics", as below. To view a subtopic, you needonly enter as many initial letters as required to distinguishit from other items in the list. To see a table of contentsincluding subtopics of those subtopics, etc., enter:

toc

To produce the full text corresponding to the table of contents,specify "all", but preferably in a non-interactive invocation:

pat_help all . > all_pat_helppat_help report all . > all_report_help

Additional topics:

API executebalance experimentbuild first_examplecounters overviewdemos reportenvironment run

pat_help (.=quit ,=back ^=up /=top ~=search)=>

CPU Optimizations

Optimizing Communication

I/O Best Practices

55. 1 ii = 0

56. 1 2-----------< do b = abmin, abmax

57. 1 2 3---------< do j=ijmin, ijmax

58. 1 2 3 ii = ii+1

59. 1 2 3 jj = 0

60. 1 2 3 4-------< do a = abmin, abmax

61. 1 2 3 4 r8----< do i = ijmin, ijmax

62. 1 2 3 4 r8 jj = jj+1

63. 1 2 3 4 r8 f5d(a,b,i,j) = f5d(a,b,i,j)

+ tmat7(ii,jj)

64. 1 2 3 4 r8 f5d(b,a,i,j) = f5d(b,a,i,j)

- tmat7(ii,jj)

65. 1 2 3 4 r8 f5d(a,b,j,i) = f5d(a,b,j,i)

- tmat7(ii,jj)

66. 1 2 3 4 r8 f5d(b,a,j,i) = f5d(b,a,j,i)

+ tmat7(ii,jj)

67. 1 2 3 4 r8----> end do

68. 1 2 3 4-------> end do

69. 1 2 3---------> end do

70. 1 2-----------> end do

The inner-most loop

strides on a slow

dimension of each

array.

The best the compiler

can do is unroll.

Little to no cache

reuse.

Poor loop order results in poor

striding

USER / #1.Original Loops

-----------------------------------------------------------------

Time% 55.0%

Time 13.938244 secs

Imb.Time 0.075369 secs

Imb.Time% 0.6%

Calls 0.1 /sec 1.0 calls

DATA_CACHE_REFILLS:

L2_MODIFIED:L2_OWNED:

L2_EXCLUSIVE:L2_SHARED 11.858M/sec 165279602 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:

ALL 11.931M/sec 166291054 fills

PAPI_L1_DCM 23.499M/sec 327533338 misses

PAPI_L1_DCA 34.635M/sec 482751044 refs

User time (approx) 13.938 secs 36239439807 cycles

100.0%Time

Average Time per Call 13.938244 sec

CrayPat Overhead : Time 0.0%

D1 cache hit,miss ratios 32.2% hits 67.8% misses

D2 cache hit,miss ratio 49.8% hits 50.2% misses

D1+D2 cache hit,miss ratio 66.0% hits 34.0% misses

For every L1 cache

hit, there’s 2 misses

Overall, only 2/3 of

all references were in

level 1 or 2 cache.

Poor loop order results in poor

cache reuse

75. 1 2-----------< do i = ijmin, ijmax

76. 1 2 jj = 0

77. 1 2 3---------< do a = abmin, abmax

78. 1 2 3 4-------< do j=ijmin, ijmax

79. 1 2 3 4 jj = jj+1

80. 1 2 3 4 ii = 0

81. 1 2 3 4 Vcr2--< do b = abmin, abmax

82. 1 2 3 4 Vcr2 ii = ii+1

83. 1 2 3 4 Vcr2 f5d(a,b,i,j) = f5d(a,b,i,j)

+ tmat7(ii,jj)

84. 1 2 3 4 Vcr2 f5d(b,a,i,j) = f5d(b,a,i,j)

- tmat7(ii,jj)

85. 1 2 3 4 Vcr2 f5d(a,b,j,i) = f5d(a,b,j,i)

- tmat7(ii,jj)

86. 1 2 3 4 Vcr2 f5d(b,a,j,i) = f5d(b,a,j,i)

+ tmat7(ii,jj)

87. 1 2 3 4 Vcr2--> end do

88. 1 2 3 4-------> end do

89. 1 2 3---------> end do

90. 1 2-----------> end do

Now, the inner-most

loop is stride-1 on

both arrays.

Now memory

accesses happen

along the cache line,

allowing reuse.

Compiler is able to

vectorize and better-

use SSE instructions.

Reordered loop nest

USER / #2.Reordered Loops

-----------------------------------------------------------------

Time% 31.4%

Time 7.955379 secs


Imb.Time% 3.8%


DATA_CACHE_REFILLS:




ALL 15.285M/sec 121598284 fills




100.0%Time






Runtine was cut

nearly in half.

Still, some 20% of all

references are cache

misses

Improved striding greatly improved

cache reuse

First loop, partially vectorized and unrolled by 495. 1 ii = 0

96. 1 2-----------< do j = ijmin, ijmax

97. 1 2 i---------< do b = abmin, abmax

98. 1 2 i ii = ii+1

99. 1 2 i jj = 0

100. 1 2 i i-------< do i = ijmin, ijmax

101. 1 2 i i Vpr4--< do a = abmin, abmax

102. 1 2 i i Vpr4 jj = jj+1

103. 1 2 i i Vpr4 f5d(a,b,i,j) =

f5d(a,b,i,j) + tmat7(ii,jj)

104. 1 2 i i Vpr4 f5d(a,b,j,i) =

f5d(a,b,j,i) - tmat7(ii,jj)

105. 1 2 i i Vpr4--> end do

106. 1 2 i i-------> end do

107. 1 2 i---------> end do

108. 1 2-----------> end do

109. 1 jj = 0

110. 1 2-----------< do i = ijmin, ijmax

111. 1 2 3---------< do a = abmin, abmax

112. 1 2 3 jj = jj+1

113. 1 2 3 ii = 0

114. 1 2 3 4-------< do j = ijmin, ijmax

115. 1 2 3 4 Vr4---< do b = abmin, abmax

116. 1 2 3 4 Vr4 ii = ii+1

117. 1 2 3 4 Vr4 f5d(b,a,i,j) =

f5d(b,a,i,j) - tmat7(ii,jj)

118. 1 2 3 4 Vr4 f5d(b,a,j,i) =

f5d(b,a,i,j) + tmat7(ii,jj)

119. 1 2 3 4 Vr4---> end do

120. 1 2 3 4-------> end do

121. 1 2 3---------> end do

122. 1 2-----------> end do

Second loop, vectorized and unrolled by 4

USER / #3.Fissioned Loops

-----------------------------------------------------------------

Time% 9.8%

Time 2.481636 secs


Imb.Time% 2.1%


DATA_CACHE_REFILLS:




ALL 34.109M/sec 84646518 fills




100.0%Time






Runtime further

reduced.

Cache hit/miss ratio

improved slightly

Loopmark file points

to better

vectorization from

the fissioned loops

Fissioning further improved cache reuse and resulted in better

vectorization

Cache blocking is a combination of strip mining and loop interchange, designed to increase data reuse.

Takes advantage of temporal reuse: re-reference array elements already referenced

Good blocking will take advantage of spatial reuse: work with the cache lines!

Many ways to block any given loop nest

Which loops get blocked?

What block size(s) to use?

Analysis can reveal which ways are beneficial

But trial-and-error is probably faster

2D Laplacian

do j = 1, 8

do i = 1, 16

a = u(i-1,j) + u(i+1,j) &

- 4*u(i,j) &

+ u(i,j-1) + u(i,j+1)

end do

end do

Cache structure for this example:

Each line holds 4 array elements

Cache can hold 12 lines of u data

No cache reuse between outer loop iterations34679101213151830120

i=1

i=16

j=1

j=8

Unblocked loop: 120 cache misses

Block the inner loop

do IBLOCK = 1, 16, 4

do j = 1, 8

do i = IBLOCK, IBLOCK + 3

a(i,j) = u(i-1,j) + u(i+1,j) &

- 2*u(i,j) &

+ u(i,j-1) + u(i,j+1)

end do

end do

end do

Now we have reuse of the “j+1” data

3467891011122080

i=1

i=13

j=1

j=8

i=5

i=9

One-dimensional blocking reduced misses from 120 to 80

Iterate over 4 4 blocks

do JBLOCK = 1, 8, 4

do IBLOCK = 1, 16, 4

do j = JBLOCK, JBLOCK + 3

do i = IBLOCK, IBLOCK + 3

a(i,j) = u(i-1,j) + u(i+1,j) &

- 2*u(i,j) &

+ u(i,j-1) + u(i,j+1)

end do

end do

end do

end do

Better use of spatial locality (cache lines)34678910111213151617183060

i=1

i=13

j=1

j=5

i=5

i=9

Matrix-matrix multiply (GEMM) is the canonical cache-blocking example

Operations can be arranged to create multiple levels of blocking

Block for register

Block for cache (L1, L2, L3)

Block for TLB

No further discussion here. Interested readers can see

Any book on code optimization Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in

Chapter 8

Insert your favorite book here

Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR-2001-22, The University of Texas at Austin, Department of Computer Sciences Develops algorithms and cost models for GEMM in hierarchical memories

Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (May), 1-25 Description of GotoBLAS DGEMM

You’re doing it wrong.

Your block size is too small (too much loop overhead).

Your block size is too big (data is falling out of cache).

You’re targeting the wrong cache level (?)

You haven’t selected the correct subset of loops to block.

The compiler is already blocking that loop.

Prefetching is acting to minimize cache misses.

Computational intensity within the loop nest is very large, making blocking less important.

“I tried cache-blocking my code, but it didn’t help”

Multigrid PDE solver

Class D, 64 MPI ranks

Global grid is 1024 × 1024 × 1024

Local grid is 258 × 258 × 258

Two similar loop nests account for >50% of run time

27-point 3D stencil

There is good data reuse along leading dimension, even without blocking

do i3 = 2, 257

do i2 = 2, 257

do i1 = 2, 257

! update u(i1,i2,i3)

! using 27-point stencil

end do

end do

end do

i1 i1+1i1-1

i2-1

i2

i2+1

i3-1

i3

i3+1

cache lines

Block the inner two loops

Creates blocks extending along i3 direction

do I2BLOCK = 2, 257, BS2


do i3 = 2, 257

do i2 = I2BLOCK, &

min(I2BLOCK+BS2-1, 257)

do i1 = I1BLOCK, &




end do

end do

end do

end do

end do

Block sizeMop/s/proces

s

unblocked 531.50

16 × 16 279.89

22 × 22 321.26

28 × 28 358.96

34 × 34 385.33

40 × 40 408.53

46 × 46 443.94

52 × 52 468.58

58 × 58 470.32

64 × 64 512.03

70 × 70 506.92

Block the outer two loops

Preserves spatial locality along i1 direction



do i3 = I3BLOCK, &


do i2 = I2BLOCK, &


do i1 = 2, 257



end do

end do

end do

end do

end do

Block sizeMop/s/proces

s

unblocked 531.50

16 × 16 674.76

22 × 22 680.16

28 × 28 688.64

34 × 34 683.84

40 × 40 698.47

46 × 46 689.14

52 × 52 706.62

58 × 58 692.57

64 × 64 703.40

70 × 70 693.87

( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa,

int cola, int colb)

( 54) {

( 55) int i, j, k; /* loop counters */

( 56) int rowc, colc, rowb; /* sizes not passed as arguments */

( 57) double con; /* constant value */

( 58)

( 59) rowb = cola;

( 60) rowc = rowa;

( 61) colc = colb;

( 62)

( 63) for(i=0;i<rowc;i++) {

( 64) for(k=0;k<cola;k++) {

( 65) con = *(a + i*cola +k);

( 66) for(j=0;j<colc;j++) {

( 67) *(c + i*colc + j) += con * *(b + k*colb + j);

( 68) }

( 69) }

( 70) }

( 71) }

mat_mul_daxpy:

66, Loop not vectorized: data dependency

Loop not vectorized: data dependency

Loop unrolled 4 times

C pointers don’t carry

the same rules as

Fortran Arrays.

The compiler has no

way to know whether

*a, *b, and *c

overlap or are

referenced differently

elsewhere.

The compiler must

assume the worst,

thus a false data

dependency.

C pointers

Slide 147

( 53) void mat_mul_daxpy(double* restrict a, double* restrict b,

double* restrict c, int rowa, int cola, int colb)

( 54) {

( 55) int i, j, k; /* loop counters */

( 56) int rowc, colc, rowb; /* sizes not passed as arguments */

( 57) double con; /* constant value */

( 58)

( 59) rowb = cola;

( 60) rowc = rowa;

( 61) colc = colb;

( 62)

( 63) for(i=0;i<rowc;i++) {

( 64) for(k=0;k<cola;k++) {

( 65) con = *(a + i*cola +k);

( 66) for(j=0;j<colc;j++) {

( 67) *(c + i*colc + j) += con * *(b + k*colb + j);

( 68) }

( 69) }

( 70) }

( 71) }

C99 introduces the

restrict keyword,

which allows the

programmer to

promise not to

reference the

memory via another

pointer.

If you declare a

restricted pointer and

break the rules,

behavior is undefined

by the standard.

C pointers, restricted

Slide 148

66, Generated alternate loop with no peeling - executed if loop count <= 24

Generated vector sse code for inner loop

Generated 2 prefetch instructions for this loop



Generated alternate loop with no peeling and more aligned moves -

executed if loop count <= 24 and alignment test is passed



Generated alternate loop with more aligned moves - executed if loop

count >= 25 and alignment test is passed



• This can also be achieved with the PGI safe pragma and –Msafeptrcompiler option or Pathscale –OPT:alias option

Slide 149

July 2009 Slide 150

GNU malloc library malloc, calloc, realloc, free calls

Fortran dynamic variables

Malloc library system calls Mmap, munmap =>for larger allocations Brk, sbrk => increase/decrease heap

Malloc library optimized for low system memory use Can result in system calls/minor page faults

151

Detecting “bad” malloc behavior

Profile data => “excessive system time”

Correcting “bad” malloc behavior

Eliminate mmap use by malloc Increase threshold to release heap memory

Use environment variables to alter malloc

MALLOC_MMAP_MAX_ = 0 MALLOC_TRIM_THRESHOLD_ = 536870912

Possible downsides

Heap fragmentation User process may call mmap directly User process may launch other processes

PGI’s –Msmartalloc does something similar for you at compile time

152

Google created a replacement “malloc” library

“Minimal” TCMalloc replaces GNU malloc

Limited testing indicates TCMalloc as good or better than GNU malloc

Environment variables not required

TCMalloc almost certainly better for allocations in OpenMP parallel regions

There’s currently no pre-built tcmalloc for Cray XT, but some users have successfully built it.

153

Linux has a “first touch policy” for memory allocation

*alloc functions don’t actually allocate your memory

Memory gets allocated when “touched”

Problem: A code can allocate more memory than available

Linux assumed “swap space,” we don’t have any

Applications won’t fail from over-allocation until the memory is finally touched

Problem: Memory will be put on the core of the “touching” thread

Only a problem if thread 0 allocates all memory for a node

Solution: Always initialize your memory immediately after allocating it

If you over-allocate, it will fail immediately, rather than a strange place in your code

If every thread touches its own memory, it will be allocated on the proper socket

Slide 154

Short Message Eager Protocol

The sending rank “pushes” the message to the receiving rank Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less Sender assumes that receiver can handle the message

Matching receive is posted - or - Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space

(MPICH_UNEX_BUFFER_SIZE) to store the message

Long Message Rendezvous Protocol

Messages are “pulled” by the receiving rank Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes Sender sends small header packet with information for the receiver to pull

over the data Data is sent only after matching receive is posted by receiving rank

MPI_RECV is posted prior to MPI_SEND call

MPI

Unexpected

Buffers

Unexpected

Msg Queue

Sender

RANK 0

Receiver

RANK 1

Eager

Short Msg ME

Incoming Msg

Rendezvous

Long Msg MEApp ME

Unexpected

Event Queue

Match Entries Posted by MPI

to handle Unexpected Msgs

STEP 3

Portals DMA PUT

STEP 2

MPI_SEND call

STEP 1

MPI_RECV call

Post ME to Portals

(MPICH_PTL_UNEX_EVENTS)

Other Event Queue

(MPICH_PTL_OTHER_EVENTS)

(MPICH_UNEX_BUFFER_SIZE)

SEASTAR

MPT Eager ProtocolData “pushed” to the receiver(MPICH_MAX_SHORT_MSG_SIZE bytes or less)

MPI_RECV is not posted prior to MPI_SEND call

MPI

Unexpected

Buffers

Unexpected

Msg Queue

Sender

RANK 0

Receiver

RANK 1

Eager

Short Msg ME

Incoming Msg

Rendezvous

Long Msg ME

Unexpected

Event Queue



STEP 2

Portals DMA PUTSTEP 4

Memcpy of data

STEP 1

MPI_SEND call

STEP 3

MPI_RECV call

No Portals ME

SEASTAR

(MPICH_UNEX_BUFFER_SIZE)

(MPICH_PTL_UNEX_EVENTS)

Data is not sent until MPI_RECV is issued

MPI

Unexpected

Buffers

Unexpected

Msg Queue

Sender

RANK 0

Receiver

RANK 1

Eager

Short Msg ME

Incoming Msg

Rendezvous

Long Msg ME

Unexpected

Event Queue

App ME

STEP 2

Portals DMA PUT

of Header

STEP 4

Receiver issues

GET request to

match Sender ME

STEP 5

Portals DMA of Data



STEP 1

MPI_SEND call

Portals ME created

STEP 3

MPI_RECV call

Triggers GET request

SEASTAR

Controls message sending protocol

Message sizes <= MSG_SIZE: Use EAGER

Message sizes > MSG_SIZE: Use RENDEZVOUS

Increasing this variable may require that MPICH_UNEX_BUFFER_SIZE be increased

Increase MPICH_MAX_SHORT_MSG_SIZE if App sends large messages and receives are pre-posted

Can reduce messaging overhead via EAGER protocol

Can reduce network contention

Decrease MPICH_MAX_SHORT_MSG_SIZE if:

App sends lots of smaller messages and receives not pre-posted, exhausting unexpected buffer space

161

If set => Disables Portals matching

Matching happens on the Opteron

Requires extra copy for EAGER protocol

Reduces MPI_Recv Overhead

Helpful for latency-sensitive application

Large # of small messages

Small message collectives (<1024 bytes)

When can this be slower?

When extra copy time longer than post-to-Portals time

Pre-posted Receives can slow it down

For medium to larger messages (16k-128k range)

Not beneficial for Gemini

The default ordering can be changed using the following environment variable:

MPICH_RANK_REORDER_METHOD

These are the different values that you can set it to:0: Round-robin placement – Sequential ranks are placed on the next node in the

list. Placement starts over with the first node upon reaching the end of the list.

1: SMP-style placement – Sequential ranks fill up each node before moving to the next.

2: Folded rank placement – Similar to round-robin placement except that each pass over the node list is in the opposite direction of the previous pass.

3: Custom ordering. The ordering is specified in a file named MPICH_RANK_ORDER.

When is this useful? Point-to-point communication consumes a significant fraction of program time and a

load imbalance detected

Also shown to help for collectives (alltoall) on subcommunicators (GYRO)

Spread out IO across nodes (POP)

One can also use the CrayPat performance measurement tools to generate a suggested custom ordering.

Available if MPI functions traced (-g mpi or –O apa)

pat_build –O apa my_program see Examples section of pat_build man page

pat_report options:

mpi_sm_rank_order Uses message data from tracing MPI to generate suggested MPI rank order. Requires the program to

be instrumented using the pat_build -g mpi option.

mpi_rank_order Uses time in user functions, or alternatively, any other metric specified by using the -s mro_metric

options, to generate suggested MPI rank order.

module load xt-craypat

Rebuild your code

pat_build –O apa a.out

Run a.out+pat

pat_report –Ompi_sm_rank_order a.out+pat+…sdt/ > pat.report

Creates MPICH_RANK_REORDER_METHOD.x file

Then set env var MPICH_RANK_REORDER_METHOD=3 AND

Link the file MPICH_RANK_ORDER.x to MPICH_RANK_ORDER

Rerun code

Table 1: Suggested MPI Rank Order

Eight cores per node: USER Samp per node

Rank Max Max/ Avg Avg/ Max Node

Order USER Samp SMP USER Samp SMP Ranks

d 17062 97.6% 16907 100.0% 832,328,820,797,113,478,898,600

2 17213 98.4% 16907 100.0% 53,202,309,458,565,714,821,970

0 17282 98.8% 16907 100.0% 53,181,309,437,565,693,821,949

1 17489 100.0% 16907 100.0% 0,1,2,3,4,5,6,7

•This suggests that

1. the custom ordering “d” might be the best

2. Folded-rank next best

3. Round-robin 3rd best

4. Default ordering last

GYRO 8.0 B3-GTC problem with 1024 processes

Run with alternate MPI orderings Custom: profiled with with –O apa and used reordering file

MPICH_RANK_REORDER.d

Reorder method Comm. time

Default 11.26s

0 – round-robin 6.94s

2 – folded-rank 6.68s

d-custom from apa 8.03s

CrayPAT

suggestion

almost right!

TGYRO 1.0

Steady state turbulent transport code using GYRO, NEO, TGLF components

ASTRA test case

Tested MPI orderings at large scale

Originally testing weak-scaling, but found reordering very useful

Reorder method

TGYRO wall time (min)

20480 40960 81920

Default 99m 104m 105m

Round-robin 66m 63m 72m

Huge win!

Application data is in

a 3D space, X x Y x Z.

Communication is

nearest-neighbor.

Default ordering

results in 12x1x1

block on each node.

A custom reordering

is now generated:

3x2x2 blocks per

node, resulting in

more on-node

communication

Rank Reordering Case Study

July 15, 2008 Slide 171

% pat_report -O mpi_sm_rank_order -s rank_grid_dim=8,6 ...

Notes for table 1:

To maximize the locality of point to point communication,

specify a Rank Order with small Max and Avg Sent Msg Total Bytes

per node for the target number of cores per node.

To specify a Rank Order with a numerical value, set the environment

variable MPICH_RANK_REORDER_METHOD to the given value.

To specify a Rank Order with a letter value 'x', set the environment

variable MPICH_RANK_REORDER_METHOD to 3, and copy or link the file

MPICH_RANK_ORDER.x to MPICH_RANK_ORDER.

Table 1: Sent Message Stats and Suggested MPI Rank Order

Communication Partner Counts

Number Rank

Partners Count Ranks

2 4 0 5 42 47

3 20 1 2 3 4 ...

4 24 7 8 9 10 ...

July 15, 2008 Slide 172

Four cores per node: Sent Msg Total Bytes per node

Rank Max Max/ Avg Avg/ Max Node

Order Total Bytes SMP Total Bytes SMP Ranks

g 121651200 73.9% 86400000 62.5% 14,20,15,21

h 121651200 73.9% 86400000 62.5% 14,20,21,15

u 152064000 92.4% 146534400 106.0% 13,12,10,4

1 164505600 100.0% 138240000 100.0% 16,17,18,19

d 164505600 100.0% 142387200 103.0% 16,17,19,18

0 224640000 136.6% 207360000 150.0% 1,13,25,37

2 241920000 147.1% 207360000 150.0% 7,16,31,40

July 15, 2008 Slide 173

% $CRAYPAT_ROOT/sbin/grid_order -c 2,2 -g 8,6

# grid_order -c 2,2 -g 8,6

# Region 0: 0,0 (0..47)

0,1,6,7

2,3,8,9

4,5,10,11

12,13,18,19

14,15,20,21

16,17,22,23

24,25,30,31

26,27,32,33

28,29,34,35

36,37,42,43

38,39,44,45

40,41,46,47

This script will also handle the case that cells do not

evenly partition the grid.

July 15, 2008 Slide 174

% $CRAYPAT_ROOT/sbin/mgrid_order -H -g 8,6

# mgrid_order -H -g 8,6

0

1 0 1 2 3 4 5 X X

7 6 7 8 9 10 11 X X

6 12 13 14 15 16 17 X X

12 18 19 20 21 22 23 X X

18 24 25 26 27 28 29

19 30 31 32 33 34 35

13 36 37 38 39 40 41

14 42 43 44 45 46 47

20

21

15

9,8,2,3,4,10,11,5,23,17,16,22,28,34,35,29

47,41,40,46,45,44,38,39,33,27,26,32,31,25

24,30,36,37,43,42

Hilbert curve order works best for 2^n side.

July 15, 2008 Slide 175

X X o o

X X o o

o o o o

o o o o

Nodes marked X heavily use a shared resource

If memory bandwidth, scatter the X's

If network bandwidth to others, again scatter

If network bandwidth among themselves, concentrate

I/O is simply data migration.

Memory Disk

I/O is a very expensive operation.

Interactions with data in memory and on disk.

Must get the kernel involved

How is I/O performed?

I/O Pattern

Number of processes and files.

File access characteristics.

Where is I/O performed?

Characteristics of the computational system.

Characteristics of the file system.

177

There is no “One Size Fits All” solution to the I/O problem.

Many I/O patterns work well for some range of parameters.

Bottlenecks in performance can occur in many locations. (Application and/or File system)

Going to extremes with an I/O pattern will typically lead to problems.

178

179

The best performance comes from situations when the data is accessed contiguously in memory and on disk.

Facilitates large operations and minimizes latency.

Commonly, data access is contiguous in memory but noncontiguous on disk or vice versa. Usually to reconstruct a global data structure via parallel I/O.

Memory Disk

Memory Disk

Spokesperson

One process performs I/O.

Data Aggregation or Duplication

Limited by single I/O process.

Pattern does not scale.

Time increases linearly with amount of data.

Time increases with number of processes.

180

Disk

File per process

All processes perform I/O to individual files.

Limited by file system.

Pattern does not scale at large process counts.

Number of files creates bottleneck with metadata operations.

Number of simultaneous disk accesses creates contention for file system resources.

181

Disk

Shared File

Each process performs I/O to a single file which is shared.

Performance

Data layout within the shared file is very important.

At large process counts contention can build for file system resources.

182

Disk

Subset of processes which perform I/O.

Aggregation of a group of processes data.

Serializes I/O in group.

I/O process may access independent files.

Limits the number of files accessed.

Group of processes perform parallel I/O to a shared file.

Increases the number of shared files to increase file system usage.

Decreases number of processes which access a shared file to decrease file system contention.

183

128 MB per file and a 32 MB Transfer size

0

2000

4000

6000

8000

10000

12000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Wri

te (

MB

/s)

Processes or Files

File Per ProcessWrite Performance

1 MB Stripe

32 MB Stripe

184

32 MB per process, 32 MB Transfer size and Stripe size

0

1000

2000

3000

4000

5000

6000

7000

8000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Wri

te (

MB

/s)

Processes

Single Shared FileWrite Performance

POSIX

MPIIO

HDF5

185

Lustre

Minimize contention for file system resources.

A process should not access more than one or two OSTs.

Performance

Performance is limited for single process I/O.

Parallel I/O utilizing a file-per-process or a single shared file is limited at large scales.

Potential solution is to utilize multiple shared file or a subset of processes which perform I/O.

186

Standard Ouput and Error streams are effectively serial I/O.

All STDIN, STDOUT, and STDERR I/O serialize through aprun

Disable debugging messages when running in production mode.

“Hello, I’m task 32000!”

“Task 64000, made it through loop.”

187

Lustre

Advantages

Aggregates smaller read/write operations into larger operations.

Examples: OS Kernel Buffer, MPI-IO Collective Buffering

Disadvantages

Requires additional memory for the buffer.

Caution

Frequent buffer flushes can adversely affect performance.

188

Buffer

A particular code both reads and writes a 377 GB file. Runs on 6000 cores.

Total I/O volume (reads and writes) is 850 GB.

Utilizes parallel HDF5

Default Stripe settings: count 4, size 1M, index -1.

1800 s run time (~ 30 minutes)

Stripe settings: count -1, size 1M, index -1.

625 s run time (~ 10 minutes)

Results

66% decrease in run time.

189

Lustre

Included in the Cray MPT library.

Environmental variable used to help MPI-IO optimize I/O performance.

MPICH_MPIIO_CB_ALIGN Environmental Variable. (Default 2)

MPICH_MPIIO_HINTS Environmental Variable

Can set striping_factor and striping_unit for files created with MPI-IO.

If writes and/or reads utilize collective calls, collective buffering can be utilized (romio_cb_read/write) to approximately stripe align I/O within Lustre.

190

MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and

transfers both of 1M bytes and a strided access pattern. Tested on an

XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220

segments, 96 GB file

0

200

400

600

800

1000

1200

1400

1600

1800

MB

/Sec

MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and

transfers both of 10K bytes and a strided access pattern. Tested on an

XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220

segments, 96 GB file

MB

/Sec

0

20

40

60

80

100

120

140

160

On 5107 PEs, and by application design, a subset of the Pes(88), do the

writes. With collective buffering, this is further reduced to 22 aggregators

(cb_nodes) writing to 22 stripes. Tested on an XT5 with 5107 Pes, 8

cores/node

MB

/Sec

0

500

1000

1500

2000

2500

3000

3500

4000

Total file size 6.4 GiB. Mesh of 64M bytes 32M elements, with work divided

amongst all PEs. Original problem was very poor scaling. For example, without

collective buffering, 8000 PEs take over 5 minutes to dump. Note that disabling

data sieving was necessary. Tested on an XT5, 8 stripes, 8 cb_nodes

Se

co

nd

s

PEs

1

10

100

1000

w/o CB

CB=0

CB=1

CB=2

Do not open a lot of files all at once (Metadata Bottleneck)

Use a simple ls (without color) instead of ls -l (OST Bottleneck)

Remember to stripe files

Small, individual files => Small stripe counts

Large, shared files => Large stripe counts

Never set an explicit starting OST for your files (Filesystem Balance)

Open Files as Read-Only when possible

Limit the number of files per directory

Stat files from just one processes

Stripe-align your I/O (Reduces Locks)

Read small, shared files once and broadcast the data (OST Contention)

Date post:	20-Jan-2015
Category:	Technology
Upload:	jeff-larkin
View:	3,240 times
Download:	0 times