Date post: | 20-Jan-2015 |
Category: |
Technology |
Upload: | jeff-larkin |
View: | 3,240 times |
Download: | 0 times |
Review of XT6 Architecture
AMD Opteron
Cray Networks
Lustre Basics
Programming Environment
PGI Compiler Basics
The Cray Compiler Environment
Cray Scientific Libraries
Cray Performance Analysis Tools
Optimizations
CPU
Communication
I/O
AMD CPU Architecture
Cray Architecture
Lustre Filesystem Basics
2003 2005 2007 2008 2009 2010
AMD Opteron™
AMD Opteron™
“Barcelona” “Shanghai” “Istanbul” “Magny-Cours”
Mfg. Process
130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI
CPU Core
K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+
L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB
HyperTransport™Technology
3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s
Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333
12 cores
1.7-2.2Ghz
105.6Gflops
8 cores
1.8-2.4Ghz
76.8Gflops
Power (ACP)
80Watts
Stream
27.5GB/s
Cache
12x 64KB L1
12x 512KB L2
12MB L3
1
3
4 10
5 8
6
7
9 12
2 11
ME
MO
RY
CO
NT
RO
LL
ER
HT
Lin
k
HT
Lin
k
ME
MO
RY
CO
NT
RO
LL
ER
HT
Lin
k
HT
Lin
kH
T L
ink
HT
Lin
k
HT
Lin
k
HT
Lin
k
L3 cache
L2 cache L2 cache
L2 cacheL2 cache
L2 cache L2 cache L2 cache
L2 cache L2 cache
L2 cache
L2 cache L2 cache
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Core 8
Core 9
Core 10
Core 11
A cache line is 64B
Cache is a “victim cache”
All references go to L1 immediately and get evicted down the caches
A cache line is usually only in one level of cache
Hardware prefetcher detects forward and backward strides through memory
Each core can perform a 128b add and 128b multiply per clock cycle
This requires SSE, packed instructions
“Stride-one vectorization”
SeaStar (XT-series)
Gemini (XE-series)
Microkernel on Compute PEs, full featured Linux on Service PEs.
Service PEs specialize by function
Software Architecture eliminates OS “Jitter”
Software Architecture enables reproducible run times
Large machines boot in under 30 minutes, including filesystem
Service Partition
Specialized
Linux nodes
Compute PE
Login PE
Network PE
System PE
I/O PE
10
GigE
10 GigE
GigE
RAID
Subsystem
Fibre
Channels
SMW
Compute node
Login node
Network node
Boot/Syslog/Database nodes
I/O and Metadata nodes
X
ZY
11
Cray XT5 systems ship with the SeaStar2+ interconnect
Custom ASIC
Integrated NIC / Router
MPI offload engine
Connectionless Protocol
Link Level Reliability
Proven scalability to 225,000 cores
HyperTransport
Interface
Memory
PowerPC
440 Processor
DMA
Engine6-Port
Router
BladeControl
ProcessorInterface
Now Scaled
to 225,000
cores
12
Processor Frequency Peak (Gflops)
Bandwidth (GB/sec)
Balance(bytes/flop
)
Istanbul (XT5)
2.6 62.4 12.8 0.21
MC-8
2.0 64 42.6 0.67
2.3 73.6 42.6 0.58
2.4 76.8 42.6 0.55
MC-12
1.9 91.2 42.6 0.47
2.1 100.8 42.6 0.42
2.2 105.6 42.6 0.40
Cray Inc. Preliminary and Proprietary 13SC09
Cray Inc. Preliminary and Proprietary
6.4 GB/sec direct connect
HyperTransport
CraySeaStar2+
Interconnect
83.5 GB/sec direct connect memory
Characteristics
Number of Cores
16 or 24 (MC)32 (IL)
Peak PerformanceMC-8 (2.4)
153 Gflops/sec
Peak Performance MC-12 (2.2)
211 Gflops/sec
Memory Size 32 or 64 GB per node
Memory Bandwidth
83.5 GB/sec
14SC09
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
6MB L3
Cache
Greyhound
Greyhound
Greyhound
Greyhound
Greyhound
Greyhound
6MB L3
Cache
Greyhound
Greyhound
Greyhound
Greyhound
Greyhound
Greyhound
6MB L3
Cache
Greyhound
Greyhound
Greyhound
Greyhound
Greyhound
Greyhound
6MB L3
Cache
Greyhound
Greyhound
Greyhound
Greyhound
Greyhound
GreyhoundHT
3
HT
3
2 Multi-Chip Modules, 4 Opteron Dies
8 Channels of DDR3 Bandwidth to 8 DIMMs
24 (or 16) Computational Cores, 24 MB of L3 cache
Dies are fully connected with HT3
Snoop Filter Feature Allows 4 Die SMP to scale well
To Interconnect
HT3
HT3
HT3
HT1 / HT3
Cray Inc. Preliminary and Proprietary 15SC09
Without snoop filter, a streams test
shows 25MB/sec out of a possible
51.2 GB/sec or 48% of peak
bandwidth
Cray Inc. Preliminary and Proprietary 16SC09
• This feature will be key for two-socket Magny Cours Nodes which are the same architecture-wise
With snoop filter, a streams test
shows 42.3 MB/sec out of a
possible 51.2 GB/sec or 82% of
peak bandwidth
Cray Inc. Preliminary and Proprietary 17SC09
New compute blade with 8 AMD Magny Cours processors
Plug-compatible with XT5 cabinets and backplanes
Initially will ship with SeaStarinterconnect as the Cray XT6
Upgradeable to Gemini Interconnect or Cray XE6
Upgradeable to AMD’s “Interlagos” series
XT6 systems will continue to ship with the current SIO blade
First customer ship, March 31st
SC09Cray Inc. Preliminary and Proprietary 18
Cray Inc. Preliminary and Proprietary 19SC09
Supports 2 Nodes per ASIC
168 GB/sec routing capacity
Scales to over 100,000 network endpoints
Link Level Reliability and Adaptive Routing
Advanced Resiliency Features
Provides global address space
Advanced NIC designed to efficiently support
MPI
One-sided MPI
Shmem
UPC, Coarray FORTRAN
Cray Inc. Preliminary and Proprietary
LO
Processor
Gemini
Hyper
Transport
3
NIC 0
Hyper
Transport
3
NIC 1Netlink
BlockSB
48-Port
YARC Router
20SC09
Cray Inc. Preliminary and Proprietary
10 12X Gemini
Channels
(Each Gemini
acts like two
nodes on the 3D
Torus)
Cray Baker Node Characteristics
Number of Cores
16 or 24
Peak Performance
140 or 210 Gflops/s
Memory Size 32 or 64 GB per node
Memory Bandwidth
85 GB/sec
High Radix
YARC Router
with adaptive
Routing
168 GB/sec
capacity
21SC09
Module with
SeaStar
Module with
Gemini
Y
X
Z
Cray Inc. Preliminary and Proprietary 22SC09
FMA (Fast Memory Access) Mechanism for most MPI transfers Supports tens of millions of MPI requests per second
BTE (Block Transfer Engine) Supports asynchronous block transfers between local and remote memory,
in either direction For use for large MPI transfers that happen in the background
Cray Inc. Preliminary and Proprietary
HT
3 C
av
e
vc0
vc1
vc1
vc0
LB Ring
LB
LM
NL
FMA
CQ
NPT
RMTnet req
H
A
R
B
net
rsp
ht p
ireq
ht treq p
ht irsp
ht np
ireq
ht np req
ht np reqnet req
ht p req O
R
B
RAT
NAT
BTE
net
req
net
rsp
ht treq np
ht trsp net
req
net
req
net
req
net
req
net
reqnet req
ht p req
ht p req
ht p req net rsp
CLM
AMOnet rsp headers
T
A
R
B
net req
net rsp
S
S
I
D
Ro
ute
r T
ile
s
23SC09
Cray Inc. Preliminary and Proprietary
Two Gemini ASICs are packaged on a pin-compatible mezzanine card
Topology is a 3-D torus
Each lane of the torus is composed of 4 Gemini router “tiles”
Systems with SeaStarinterconnects can be upgraded by swapping this card
100% of the 48 router tiles on each Gemini chip are used
24SC09
Like SeaStar, Gemini has a DMA offload engine allowing large transfers to proceed asynchronously
Gemini provides low-overhead OS-bypass features for short transfers MPI latency targeted at ~ 1us NIC provides for many millions of MPI messages per second
“Hybrid” programming not a requirement for performance
RDMA provides a much improved one-sided communication mechanism
AMOs provide a faster synchronization method for barriers
Gemini supports adaptive routing, which Reduces problems with network hot spots Allows MPI to survive link failures
Cray Inc. Preliminary and Proprietary 25SC09
Globally addressable memory provides efficient support for UPC, Co-array FORTRAN, Shmem and Global Arrays Cray Programming Environment will target this capability
directly
Pipelined global loads and stores Allows for fast irregular communication patterns
Atomic memory operations Provides fast synchronization needed for one-sided
communication models
Cray Inc. Preliminary and Proprietary 26SC09
Gemini will represent a large improvement over SeaStar interms of reliability and serviceability
Adaptive Routing – multiple paths to the same destination
Allows mapping around bad links without rebooting
Supports warm-swap of blades
Prevents hot spots
Reliable Transport of Messages
Packet level CRC carried from start to finish
Large blocks of memory protected by ECC
Can better handle failures on the HT-link, discards packets instead of putting backpressure into the network
Supports end-to-end reliable communication (used by MPI)
Improved error reporting and handling
The low overhead error reporting allows the programming model to replay failed transactions
Performance counters allowing tracking of app specific packets
Cray Inc. Preliminary and Proprietary 27SC09
28
29
Hig
h V
elo
city A
irflow
Hig
h V
elo
city A
irflow
Lo
w V
elo
city A
irflo
w
Lo
w V
elo
city A
irflo
w
Lo
w V
elo
city A
irflo
w30
Hot air stream passes through evaporator, rejects heat to R134a via liquid-vapor phase change
(evaporation).
R134a absorbs energy only in the presence of heated air.
Phase change is 10x more efficient than pure water
cooling.
Liquid/Vapor Mixture out
Liquid in
Cool air is released into the computer room
31
R134a piping Exit Evaporators
Inlet Evaporator
32
32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size Unable to take advantage of file system parallelism
Access to multiple disks adds overhead which hurts performance
Lustre
0
20
40
60
80
100
120
1 2 4 16 32 64 128 160
Wri
te (
MB
/s)
Stripe Count
Single WriterWrite Performance
1 MB Stripe
32 MB Stripe
36
Lustre
0
20
40
60
80
100
120
140
1 2 4 8 16 32 64 128
Wri
te (
MB
/s)
Stripe Size (MB)
Single Writer Transfer vs. Stripe Size
32 MB Transfer
8 MB Transfer
1 MB Transfer
Single OST, 256 MB File Size Performance can be limited by the process (transfer size) or file system
(stripe size)
37
Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and possibly size
lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)
lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)
export MPICH_MPIIO_HINTS=‘*: striping_factor=160’
Files inherit striping information from the parent directory, this cannot be changed once the file is written
Set the striping before copying in files
PGI Compiler
Cray Compiler Environment
Cray Scientific Libraries
Cray XT/XE Supercomputers come with compiler wrappers to simplify building parallel applications (similar the mpicc/mpif90)
Fortran Compiler: ftn
C Compiler: cc
C++ Compiler: CC
Using these wrappers ensures that your code is built for the compute nodes and linked against important libraries
Cray MPT (MPI, Shmem, etc.)
Cray LibSci (BLAS, LAPACK, etc.)
…
Choosing the underlying compiler is via the PrgEnv-* modules, do not call the PGI, Cray, etc. compilers directly.
Always load the appropriate xtpe-<arch> module for your machine
Enables proper compiler target
Links optimized math libraries
Traditional (scalar) optimizations are controlled via -O# compiler flags
Default: -O2
More aggressive optimizations (including vectorization) are enabled with the -fast or -fastsse metaflags
These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre
–Mautoinline -Mvect=sse -Mscalarsse
-Mcache_align -Mflushz –Mpre
Interprocedural analysis allows the compiler to perform whole-program optimizations. This is enabled with –Mipa=fast
See man pgf90, man pgcc, or man pgCC for more information about compiler options.
Compiler feedback is enabled with -Minfo and -Mneginfo
This can provide valuable information about what optimizations were or were not done and why.
To debug an optimized code, the -gopt flag will insert debugging information without disabling optimizations
It’s possible to disable optimizations included with -fast if you believe one is causing problems
For example: -fast -Mnolre enables -fast and then disables loop redundant optimizations
To get more information about any compiler flag, add -help with the flag in question
pgf90 -help -fast will give more information about the -fast flag
OpenMP is enabled with the -mp flag
Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also able to enforce accuracy.
-Kieee: All FP math strictly conforms to IEEE 754 (off by default)
-Ktrap: Turns on processor trapping of FP exceptions
-Mdaz: Treat all denormalized numbers as zero
-Mflushz: Set SSE to flush-to-zero (on with -fast)
-Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to speed up some floating point optimizations
Some other compilers turn this on by default, PGI chooses to favor accuracy to speed by default.
Cray has a long tradition of high performance compilers on Cray platforms (Traditional vector, T3E, X1, X2)
Vectorization
Parallelization
Code transformation
More…
Investigated leveraging an open source compiler called LLVM
First release December 2008
X86 Code
Generator
Cray X2 Code
Generator
Fortran Front End
Interprocedural Analysis
Optimization and
Parallelization
C and C++ Source
Object File
Co
mp
iler
C & C++ Front End
Fortran Source C and C++ Front End
supplied by Edison Design
Group, with Cray-developed
code for extensions and
interface support
X86 Code Generation from
Open Source LLVM, with
additional Cray-developed
optimizations and interface
support
Cray Inc. Compiler
Technology
Standard conforming languages and programming models Fortran 2003 UPC & CoArray Fortran
Fully optimized and integrated into the compiler
No preprocessor involved
Target the network appropriately:
GASNet with Portals
DMAPP with Gemini & Aries
Ability and motivation to provide high-quality support for custom Cray network hardware
Cray technology focused on scientific applications Takes advantage of Cray’s extensive knowledge of automatic
vectorization Takes advantage of Cray’s extensive knowledge of automatic
shared memory parallelization Supplements, rather than replaces, the available compiler
choices
Make sure it is available
module avail PrgEnv-cray
To access the Cray compiler
module load PrgEnv-cray
To target the various chip
module load xtpe-[barcelona,shanghi,istanbul]
Once you have loaded the module “cc” and “ftn” are the Cray compilers
Recommend just using default options
Use –rm (fortran) and –hlist=m (C) to find out what happened
man crayftn
Excellent Vectorization Vectorize more loops than other compilers
OpenMP 3.0 Task and Nesting
PGAS: Functional UPC and CAF available today
C++ Support
Automatic Parallelization Modernized version of Cray X1 streaming capability
Interacts with OMP directives
Cache optimizations Automatic Blocking
Automatic Management of what stays in cache
Prefetching, Interchange, Fusion, and much more…
Loop Based Optimizations Vectorization OpenMP
Autothreading
Interchange Pattern Matching Cache blocking/ non-temporal / prefetching
Fortran 2003 Standard; working on 2008
PGAS (UPC and Co-Array Fortran) Some performance optimizations available in 7.1
Optimization Feedback: Loopmark
Focus
Cray compiler supports a full and growing set of directives and pragmas
!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable
man directives
man loop_info
Compiler can generate an filename.lst file. Contains annotated listing of your source code with letter indicating important
optimizations
%%% L o o p m a r k L e g e n d %%%
Primary Loop Type Modifiers
------- ---- ---- ---------
a - vector atomic memory operation
A - Pattern matched b - blocked
C - Collapsed f - fused
D - Deleted i - interchanged
E - Cloned m - streamed but not partitioned
I - Inlined p - conditional, partial and/or computed
M - Multithreaded r - unrolled
P - Parallel/Tasked s - shortloop
V - Vectorized t - array syntax temp used
W - Unwound w - unwound
• ftn –rm … or cc –hlist=m …
29. b-------< do i3=2,n3-1
30. b b-----< do i2=2,n2-1
31. b b Vr--< do i1=1,n1
32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1)
34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
36. b b Vr--> enddo
37. b b Vr--< do i1=2,n1-1
38. b b Vr r(i1,i2,i3) = v(i1,i2,i3)
39. b b Vr > - a(0) * u(i1,i2,i3)
40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) )
42. b b Vr--> enddo
43. b b-----> enddo
44. b-------> enddo
ftn-6289 ftn: VECTOR File = resid.f, Line = 29
A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 29
A loop starting at line 29 was blocked with block size 4.
ftn-6289 ftn: VECTOR File = resid.f, Line = 30
A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 30
A loop starting at line 30 was blocked with block size 4.
ftn-6005 ftn: SCALAR File = resid.f, Line = 31
A loop starting at line 31 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 31
A loop starting at line 31 was vectorized.
ftn-6005 ftn: SCALAR File = resid.f, Line = 37
A loop starting at line 37 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 37
A loop starting at line 37 was vectorized.
-hbyteswapio
Link time option
Applies to all unformatted fortran IO
Assign command
With the PrgEnv-cray module loaded do this:
setenv FILENV assign.txt
assign -N swap_endian g:su
assign -N swap_endian g:du
Can use assign to be more precise
OpenMP is ON by default Optimizations controlled by –Othread#
To shut off use –Othread0 or –xomp or –hnoomp
Autothreading is NOT on by default; -hautothread to turn on
Modernized version of Cray X1 streaming capability
Interacts with OMP directives
If you do not want to use OpenMP and have OMP directives in the code, make sure to make a run with OpenMP shut
off at compile time
Traditional model
Tuned general purpose codes
Only good for dense
Not problem sensitive
Not architecture sensitive
60
Goal of scientific libraries
Improve Productivity at optimal performance
Cray use four concentrations to achieve this
Standardization Use standard or “de facto” standard interfaces whenever available
Hand tuning Use extensive knowledge of target processor and network to optimize common code
patterns
Auto-tuning Automate code generation and a huge number of empirical performance evaluations
to configure software to the target platforms
Adaptive Libraries Make runtime decisions to choose the best kernel/library/routine
61
Three separate classes of standardization, each with a corresponding definition of productivity
1. Standard interfaces (e.g., dense linear algebra) Bend over backwards to keep everything the same despite increases in machine complexity.
Innovate ‘behind-the-scenes’
Productivity -> innovation to keep things simple
2. Adoption of near-standard interfaces (e.g., sparse kernels) Assume near-standards and promote those. Out-mode alternatives. Innovate ‘behind-the-scenes’
Productivity -> innovation in the simplest areas
(requires the same innovation as #1 also)
3. Simplification of non-standard interfaces (e.g., FFT) Productivity -> innovation to make things simpler than they are
62
Algorithmic tuning
Increased performance by exploiting algorithmic improvements Sub-blocking, new algorithms
LAPACK, ScaLAPACK
Kernel tuning
Improve the numerical kernel performance in assembly language
BLAS, FFT
Parallel tuning
Exploit Cray’s custom network interfaces and MPT
ScaLAPACK, P-CRAFFT
63
Dense
BLAS
LAPACK
ScaLAPACK
IRT
Sparse
CASK
PETSc
Trilinos
FFT
CRAFFT
FFTW
P-CRAFFT
IRT – Iterative Refinement Toolkit
CASK – Cray Adaptive Sparse Kernels
CRAFFT – Cray Adaptive FFT
64
Serial and Parallel versions of sparse iterative linear solvers
Suites of iterative solvers CG, GMRES, BiCG, QMR, etc.
Suites of preconditioning methods IC, ILU, diagonal block (ILU/IC), Additive Schwartz, Jacobi, SOR
Support block sparse matrix data format for better performance
Interface to external packages (ScaLAPACK, SuperLU_DIST)
Fortran and C support
Newton-type nonlinear solvers
Large user community
DoE Labs, PSC, CSCS, CSC, ERDC, AWE and more.
http://www-unix.mcs.anl.gov/petsc/petsc-as
65
Cray provides state-of-the art scientific computing packages to strengthen the capability of PETSc
Hypre: scalable parallel preconditioners AMG (Very scalable and efficient for specific class of problems)
2 different ILU (General purpose)
Sparse Approximate Inverse (General purpose)
ParMetis: parallel graph partitioning package
MUMPS: parallel multifrontal sparse direct solver
SuperLU: sequential version of SuperLU_DIST
To use Cray-PETSc, load the appropriate module :
module load petsc
module load petsc-complex
(no need to load a compiler specific module)
Treat the Cray distribution as your local PETSc installation
66
The Trilinos Project http://trilinos.sandia.gov/
“an effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems”
A unique design feature of Trilinos is its focus on packages.
Very large user-base and growing rapidly. Important to DOE.
Cray’s optimized Trilinos released on January 21
Includes 50+ trilinos packages
Optimized via CASK
Any code that uses Epetra objects can access the optimizations
Usage :
module load trilinos
67
CASK is a product developed at Cray using theCray Auto-tuning Framework (Cray ATF)
The CASK Concept :
Analyze matrix at minimal cost
Categorize matrix against internal classes
Based on offline experience, find best CASK code for particular matrix
Previously assign “best” compiler flags to CASK code
Assign best CASK kernel and perform Ax
CASK silently sits beneath PETSc on Cray systems
Trilinos support coming soon
Released with PETSc 3.0 in February 2009
Generic and blocked CSR format
68
• Highly portable
• User controlled
Large-scale application
• Highly portable
• User controlled
PETSc / Trilinos / Hypre
• XT4 & XT5 specific / tuned
• Invisible to User
CASK
All systems
Cray only
69
Speedup on Parallel SpMV on 8 cores, 60 different matrices
1
1.1
1.2
1.3
1.4
0 10 20 30 40 50 60
Matrix ID#
70
Block Jacobi Preconditioning
0
50
100
150
200
0 128 256 384 512 640 768 896 1024
GF
lop
s
# of Cores
Performance of CASK VS PETSc
N=65,536 to 67,108,864
MatMult-CASK MatMult-PETSc
0
50
100
150
200
250
300
0 128 256 384 512 640 768 896 1024
GF
lop
s
# of Cores
Performance of CASK VS PETSc
N=65,536 to 67,108,864
BlockJacobi-IC(0)-CASK
BlockJacobi-IC(0)-PETSc
SpMV
71
0200400600800
100012001400160018002000
MF
lop
s
Matrix Name
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 3 4 5 6 7 8
MF
lop
s
# of vectors
CASK Trilinos Original
Geometric Mean of 80 sparse matrix instances from U of Florida collection
In FFTs, the problems are Which library choice to use?
How to use complicated interfaces (e.g., FFTW)
Standard FFT practice Do a plan stage
Deduced machine and system information and run micro-kernels
Select best FFT strategy
Do an execute
Our system knowledge can remove some of this cost!
74
CRAFFT is designed with simple-to-use interfaces Planning and execution stage can be combined into one function call Underneath the interfaces, CRAFFT calls the appropriate FFT kernel
CRAFFT provides both offline and online tuning Offline tuning
Which FFT kernel to use
Pre-computed PLANs for common-sized FFT
No expensive plan stages
Online tuning is performed as necessary at runtime as well
At runtime, CRAFFT will adaptively select the best FFT kernel to use based on both offline and online testing (e.g. FFTW, Custom FFT)
75
128x128 256x256 512x512
FFTW plan 74 312 2758
FFTW exec 0.105 0.97 9.7
CRAFFT plan 0.00037 0.0009 0.00005
CRAFFT exec 0.139 1.2 11.4
1. Load module fftw/3.2.0 or higher.
2. Add a Fortran statement “use crafft”
3. call crafft_init()
4. Call crafft transform using none, some or all optional arguments (as shown in red)
In-place, implicit memory management :
call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign)
in-place, explicit memory management
call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign,work)
out-of-place, explicit memory management :
crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,output,ld_out,ld_out2,isign,work)
Note : the user can also control the planning strategy of CRAFFT using the CRAFFT_PLANNING environment variable and the do_exe optional argument, please see the intro_crafft man page.
77
As of December 2009, CRAFFT includes distributed parallel transforms
Uses the CRAFFT interface prefixed by “p”, with optional arguments
Can provide performance improvement over FFTW 2.1.5
Currently implemented
complex-complex
Real-complex and complex-real
3-d and 2-d
In-place and out-of-place
Upcoming
C language support for serial and parallel
78
1. Add “use crafft” to Fortran code
2. Initialize CRAFFT using crafft_init
3. Assume MPI initialized and data distributed (see manpage)
4. Call crafft, e.g. (optional arguments in red)
2-d complex-complex, in-place, internal mem management :
call crafft_pz2z2d(n1,n2,input,isign,flag,comm)
2-d complex-complex, in-place with no internal memory :
call crafft_pz2z2d(n1,n2,input,isign,flag,comm,work)
2-d complex-complex, out-of-place, internal mem manager :
call crafft_pz2z2d(n1,n2,input,output,isign,flag,comm)
2-d complex-complex, out-of-place, no internal memory :
crafft_pz2z2d(n1,n2,input,output,isign,flag,comm,work)
Each routine above has manpage. Also see 3d equivalent :
man crafft_pz2z3d 79
80
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
128 256 512 1024 2048 4096 8192 16384 3276865536
Mfl
op
s
Size N
2D FFT (N x N, transposed), 128 cores
pcrafft
fftw2.5.1
Solves linear systems in single precision
Obtaining solutions accurate to double precision For well conditioned problems
Serial and Parallel versions of LU, Cholesky, and QR
2 usage methods IRT Benchmark routines
Uses IRT 'under-the-covers' without changing your code Simply set an environment variable Useful when you cannot alter source code
Advanced IRT API If greater control of the iterative refinement process is required
Allows condition number estimation error bounds return minimization of either forward or backward error 'fall back' to full precision if the condition number is too high max number of iterations can be altered by users
81
“High Power Electromagnetic Wave
Heating in the ITER Burning Plasma’’
rf heating in tokamak
Maxwell-Bolzmann Eqns
FFT
Dense linear system
Calc Quasi-linear op
Courtesy
Richard Barrett
82
TheoreticalPeak
83
Decide if you want to use advanced API or benchmark API
benchmark API : setenv IRT_USE_SOLVERS 1
Advanced API :
1. locate the factor and solve in your code (LAPACK or ScaLAPACK)
2. Replace factor and solve with a call to IRT routine
e.g. dgesv -> irt_lu_real_serial
e.g. pzgesv -> irt_lu_complex_parallel
e.g pzposv -> irt_po_complex_parallel
3. Set advanced arguments
Forward error convergence for most accurate solution
Condition number estimate
“fall-back” to full precision if condition number too high
84
LibSci 10.4.2 February 18th 2010
OpenMP-aware LibSci
Allows calling of BLAS inside or outside parallel region
Single library supported No multi-thread library and single thread library (-lsci and –lsci_mp)
Performance not compromised
(there were some usage restrictions with this version)
LibSci 10.4.3 April 2010
Parallel CRAFFT improvements
Fixes usage restrictions of 10.4.2
OMP_NUM_THREADS required (not GOTO_NUM_THREADS)
Upcoming
PETSc 3.1.0 May 20
Trilinos 10.2 May 20
85
CrayPAT
Assist the user with application performance analysis and optimization
Help user identify important and meaningful information from potentially massive data sets
Help user identify problem areas instead of just reporting data
Bring optimization knowledge to a wider set of users
Focus on ease of use and intuitive user interfaces Automatic program instrumentation Automatic analysis
Target scalability issues in all areas of tool development Data management
Storage, movement, presentation
September 21-24, 2009 87© Cray Inc.
Supports traditional post-mortem performance analysis
Automatic identification of performance problems
Indication of causes of problems
Suggestions of modifications for performance improvement
CrayPat
pat_build: automatic instrumentation (no source code changes needed)
run-time library for measurements (transparent to the user)
pat_report for performance analysis reports
pat_help: online help utility
Cray Apprentice2
Graphical performance analysis and visualization tool
September 21-24, 2009 88© Cray Inc.
CrayPat
Instrumentation of optimized code
No source code modification required
Data collection transparent to the user
Text-based performance reports
Derived metrics
Performance analysis
Cray Apprentice2
Performance data visualization tool
Call tree view
Source code mappings
September 21-24, 2009 89© Cray Inc.
When performance measurement is triggered
External agent (asynchronous) Sampling
Timer interrupt
Hardware counters overflow
Internal agent (synchronous) Code instrumentation
Event based
Automatic or manual instrumentation
How performance data is recorded
Profile ::= Summation of events over time run time summarization (functions, call sites, loops, …)
Trace file ::= Sequence of events over time
September 21-24, 2009 90© Cray Inc.
Millions of lines of code
Automatic profiling analysis Identifies top time consuming routines
Automatically creates instrumentation template customized to your application
Lots of processes/threads
Load imbalance analysis Identifies computational code regions and synchronization calls that could benefit most from
load balance optimization
Estimates savings if corresponding section of code were balanced
Long running applications
Detection of outliers
September 21-24, 2009 91© Cray Inc.
Important performance statistics:
Top time consuming routines
Load balance across computing resources
Communication overhead
Cache utilization
FLOPS
Vectorization (SSE instructions)
Ratio of computation versus communication
September 21-24, 2009 92© Cray Inc.
No source code or makefile modification required
Automatic instrumentation at group (function) level
Groups: mpi, io, heap, math SW, …
Performs link-time instrumentation
Requires object files
Instruments optimized code
Generates stand-alone instrumented program
Preserves original binary
Supports sample-based and event-based instrumentation
September 21-24, 2009 93© Cray Inc.
Analyze the performance data and direct the user to meaningful information
Simplifies the procedure to instrument and collect performance data for novice users
Based on a two phase mechanism
1. Automatically detects the most time consuming functions in the application and feeds this information back to the tool for further (and focused) data collection
2. Provides performance information on the most significant parts of the application
September 21-24, 2009 94© Cray Inc.
Performs data conversion
Combines information from binary with raw performance data
Performs analysis on data
Generates text report of performance results
Formats data for input into Cray Apprentice2
September 21-24, 2009 95© Cray Inc.
Craypat / Cray Apprentice2 5.0 released September 10, 2009
New internal data format
FAQ
Grid placement support
Better caller information (ETC group in pat_report)
Support larger numbers of processors
Client/server version of Cray Apprentice2
Panel help in Cray Apprentice2
September 21-24, 2009 © Cray Inc. 96
Access performance tools software
% module load xt-craypat apprentice2
Build application keeping .o files (CCE: -h keepfiles)
% make clean% make
Instrument application for automatic profiling analysis You should get an instrumented program a.out+pat
% pat_build –O apa a.out
Run application to get top time consuming routines You should get a performance file (“<sdatafile>.xf”) or
multiple files in a directory <sdatadir>
% aprun … a.out+pat (or qsub <pat script>)
September 21-24, 2009 © Cray Inc. 97
September 21-24, 2009 © Cray Inc. Slide 98
Generate report and .apa instrumentation file
% pat_report –o my_sampling_report [<sdatafile>.xf | <sdatadir>]
Inspect .apa file and sampling report
Verify if additional instrumentation is needed
# You can edit this file, if desired, and use it
# to reinstrument the program for tracing like this:
#
# pat_build -O mhd3d.Oapa.x+4125-401sdt.apa
#
# These suggested trace options are based on data from:
#
# /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf
# ----------------------------------------------------------------------
# HWPC group to collect by default.
-Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics.
# ----------------------------------------------------------------------
# Libraries to trace.
-g mpi
# ----------------------------------------------------------------------
# User-defined functions to trace, sorted by % of samples.
# Limited to top 200. A function is commented out if it has < 1%
# of samples, or if a cumulative threshold of 90% has been reached,
# or if it has size < 200 bytes.
# Note: -u should NOT be specified as an additional option.
# 43.37% 99659 bytes
-T mlwxyz_
# 16.09% 17615 bytes
-T half_
# 6.82% 6846 bytes
-T artv_
# 1.29% 5352 bytes
-T currenh_
# 1.03% 25294 bytes
-T bndbo_
# Functions below this point account for less than 10% of samples.
# 1.03% 31240 bytes
# -T bndto_
. . .
# ----------------------------------------------------------------------
-o mhd3d.x+apa # New instrumented program.
/work/crayadm/ldr/mhd3d/mhd3d.x # Original program.
September 21-24, 2009 99© Cray Inc.
biolib Cray Bioinformatics library routines
blacs Basic Linear Algebra communication subprograms
blas Basic Linear Algebra subprograms
caf Co-Array Fortran (Cray X2 systems only)
fftw Fast Fourier Transform library (64-bit only)
hdf5 manages extremely large and complex data collections
heap dynamic heap
io includes stdio and sysio groups
lapack Linear Algebra Package
lustre Lustre File System
math ANSI math
mpi MPI
netcdf network common data form (manages array-oriented scientific data)
omp OpenMP API (not supported on Catamount)
omp-rtl OpenMP runtime library (not supported on Catamount)
portals Lightweight message passing API
pthreads POSIX threads (not supported on Catamount)
scalapack Scalable LAPACK
shmem SHMEM
stdio all library functions that accept or return the FILE* construct
sysio I/O system calls
system system calls
upc Unified Parallel C (Cray X2 systems only)
0 Summary with instruction
metrics
1 Summary with TLB metrics
2 L1 and L2 metrics
3 Bandwidth information
4 Hypertransport information
5 Floating point mix
6 Cycles stalled, resources
idle
7 Cycles stalled, resources
full
8 Instructions and branches
9 Instruction cache
10 Cache hierarchy
11 Floating point operations
mix (2)
12 Floating point operations
mix (vectorization)
13 Floating point operations
mix (SP)
14 Floating point operations
mix (DP)
15 L3 (socket-level)
16 L3 (core-level reads)
17 L3 (core-level misses)
18 L3 (core-level fills caused
by L2 evictions)
19 Prefetches
June 10 Slide 101
Regions, useful to break up long routines
int PAT_region_begin (int id, const char *label)
int PAT_region_end (int id)
Disable/Enable Profiling, useful for excluding initialization
int PAT_record (int state)
Flush buffer, useful when program isn’t exiting cleanly
int PAT_flush_buffer (void)
September 21-24, 2009 © Cray Inc. Slide 104
Instrument application for further analysis (a.out+apa)
% pat_build –O <apafile>.apa
Run application
% aprun … a.out+apa (or qsub <apa script>)
Generate text report and visualization file (.ap2)
% pat_report –o my_text_report.txt [<datafile>.xf | <datadir>]
View report in text and/or with Cray Apprentice2
% app2 <datafile>.ap2
MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)
Number of files used to store raw data
1 file created for program with 1 – 256 processes
√n files created for program with 257 – n processes
Ability to customize with PAT_RT_EXPFILE_MAX
September 21-24, 2009 105© Cray Inc.
July 15, 2008 Slide 106
Full trace files show transient events but are too large
Current run-time summarization misses transient events
Plan to add ability to record:
Top N peak values (N small)
Approximate std dev over time
For time, memory traffic, etc.
During tracing and sampling
Call graph profile
Communication statistics
Time-line view
Communication
I/O
Activity view
Pair-wise communication statistics
Text reports
Source code mapping
Cray Apprentice2
is target to help identify and correct:
Load imbalance
Excessive communication
Network contention
Excessive serialization
I/O Problems
September 21-24, 2009 107© Cray Inc.
Switch Overview display
September 21-24, 2009 108© Cray Inc.
September 21-24, 2009 Slide 109© Cray Inc.
September 21-24, 2009 110© Cray Inc.
September 21-24, 2009 111© Cray Inc.
Min, Avg, and Max
Values
-1, +1
Std Dev
marks
September 21-24, 2009 112© Cray Inc.
Function
List
Load balance overview:
Height Max time
Middle bar Average time
Lower bar Min time
Yellow represents
imbalance time
Zoom
Height exclusive time
Width inclusive time
DUH Button:
Provides hints
for performance
tuning
Filtered
nodes or
sub tree
September 21-24, 2009 113© Cray Inc.
Function
List off
Right mouse click:
Node menu
e.g., hide/unhide
children
Sort options
% Time,
Time,
Imbalance %
Imbalance time
Right mouse click:
View menu:
e.g., Filter
September 21-24, 2009 114© Cray Inc.
September 21-24, 2009 115© Cray Inc.
September 21-24, 2009 Slide 116© Cray Inc.
September 21-24, 2009 Slide 117© Cray Inc.
September 21-24, 2009 Slide 118© Cray Inc.
-1, +1
Std Dev
marks
Min, Avg, and Max
Values
September 21-24, 2009 119© Cray Inc.
September 21-24, 2009 120© Cray Inc.
Cray Apprentice2 panel help
pat_help – interactive help on the Cray Performance toolset
FAQ available through pat_help
September 21-24, 2009 © Cray Inc. 121
intro_craypat(1)
Introduces the craypat performance tool
pat_build
Instrument a program for performance analysis
pat_help
Interactive online help utility
pat_report
Generate performance report in both text and for use with GUI
hwpc(3)
describes predefined hardware performance counter groups
papi_counters(5)
Lists PAPI event counters
Use papi_avail or papi_native_avail utilities to get list of events when running on a specific architecture
September 21-24, 2009 © Cray Inc. 122
September 21-24, 2009 Slide 123
pat_report: Help for -O option:
Available option values are in left column, a prefix can be specified:
ct -O calltree
defaults Tables that would appear by default.
heap -O heap_program,heap_hiwater,heap_leaks
io -O read_stats,write_stats
lb -O load_balance
load_balance -O lb_program,lb_group,lb_function
mpi -O mpi_callers
---
callers Profile by Function and Callers
callers+hwpc Profile by Function and Callers
callers+src Profile by Function and Callers, with Line Numbers
callers+src+hwpc Profile by Function and Callers, with Line Numbers
calltree Function Calltree View
calltree+hwpc Function Calltree View
calltree+src Calltree View with Callsite Line Numbers
calltree+src+hwpc Calltree View with Callsite Line Numbers
...
© Cray Inc.
Interactive by default, or use trailing '.' to just print a topic:
New FAQ craypat 5.0.0.
Has counter and counter group information
% pat_help counters amd_fam10h groups .
September 21-24, 2009 © Cray Inc. 124
September 21-24, 2009 © Cray Inc. Slide 125
The top level CrayPat/X help topics are listed below.A good place to start is:
overview
If a topic has subtopics, they are displayed under the heading"Additional topics", as below. To view a subtopic, you needonly enter as many initial letters as required to distinguishit from other items in the list. To see a table of contentsincluding subtopics of those subtopics, etc., enter:
toc
To produce the full text corresponding to the table of contents,specify "all", but preferably in a non-interactive invocation:
pat_help all . > all_pat_helppat_help report all . > all_report_help
Additional topics:
API executebalance experimentbuild first_examplecounters overviewdemos reportenvironment run
pat_help (.=quit ,=back ^=up /=top ~=search)=>
CPU Optimizations
Optimizing Communication
I/O Best Practices
55. 1 ii = 0
56. 1 2-----------< do b = abmin, abmax
57. 1 2 3---------< do j=ijmin, ijmax
58. 1 2 3 ii = ii+1
59. 1 2 3 jj = 0
60. 1 2 3 4-------< do a = abmin, abmax
61. 1 2 3 4 r8----< do i = ijmin, ijmax
62. 1 2 3 4 r8 jj = jj+1
63. 1 2 3 4 r8 f5d(a,b,i,j) = f5d(a,b,i,j)
+ tmat7(ii,jj)
64. 1 2 3 4 r8 f5d(b,a,i,j) = f5d(b,a,i,j)
- tmat7(ii,jj)
65. 1 2 3 4 r8 f5d(a,b,j,i) = f5d(a,b,j,i)
- tmat7(ii,jj)
66. 1 2 3 4 r8 f5d(b,a,j,i) = f5d(b,a,j,i)
+ tmat7(ii,jj)
67. 1 2 3 4 r8----> end do
68. 1 2 3 4-------> end do
69. 1 2 3---------> end do
70. 1 2-----------> end do
The inner-most loop
strides on a slow
dimension of each
array.
The best the compiler
can do is unroll.
Little to no cache
reuse.
Poor loop order results in poor
striding
USER / #1.Original Loops
-----------------------------------------------------------------
Time% 55.0%
Time 13.938244 secs
Imb.Time 0.075369 secs
Imb.Time% 0.6%
Calls 0.1 /sec 1.0 calls
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED:
L2_EXCLUSIVE:L2_SHARED 11.858M/sec 165279602 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:
ALL 11.931M/sec 166291054 fills
PAPI_L1_DCM 23.499M/sec 327533338 misses
PAPI_L1_DCA 34.635M/sec 482751044 refs
User time (approx) 13.938 secs 36239439807 cycles
100.0%Time
Average Time per Call 13.938244 sec
CrayPat Overhead : Time 0.0%
D1 cache hit,miss ratios 32.2% hits 67.8% misses
D2 cache hit,miss ratio 49.8% hits 50.2% misses
D1+D2 cache hit,miss ratio 66.0% hits 34.0% misses
For every L1 cache
hit, there’s 2 misses
Overall, only 2/3 of
all references were in
level 1 or 2 cache.
Poor loop order results in poor
cache reuse
75. 1 2-----------< do i = ijmin, ijmax
76. 1 2 jj = 0
77. 1 2 3---------< do a = abmin, abmax
78. 1 2 3 4-------< do j=ijmin, ijmax
79. 1 2 3 4 jj = jj+1
80. 1 2 3 4 ii = 0
81. 1 2 3 4 Vcr2--< do b = abmin, abmax
82. 1 2 3 4 Vcr2 ii = ii+1
83. 1 2 3 4 Vcr2 f5d(a,b,i,j) = f5d(a,b,i,j)
+ tmat7(ii,jj)
84. 1 2 3 4 Vcr2 f5d(b,a,i,j) = f5d(b,a,i,j)
- tmat7(ii,jj)
85. 1 2 3 4 Vcr2 f5d(a,b,j,i) = f5d(a,b,j,i)
- tmat7(ii,jj)
86. 1 2 3 4 Vcr2 f5d(b,a,j,i) = f5d(b,a,j,i)
+ tmat7(ii,jj)
87. 1 2 3 4 Vcr2--> end do
88. 1 2 3 4-------> end do
89. 1 2 3---------> end do
90. 1 2-----------> end do
Now, the inner-most
loop is stride-1 on
both arrays.
Now memory
accesses happen
along the cache line,
allowing reuse.
Compiler is able to
vectorize and better-
use SSE instructions.
Reordered loop nest
USER / #2.Reordered Loops
-----------------------------------------------------------------
Time% 31.4%
Time 7.955379 secs
Imb.Time 0.260492 secs
Imb.Time% 3.8%
Calls 0.1 /sec 1.0 calls
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED:
L2_EXCLUSIVE:L2_SHARED 0.419M/sec 3331289 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:
ALL 15.285M/sec 121598284 fills
PAPI_L1_DCM 13.330M/sec 106046801 misses
PAPI_L1_DCA 66.226M/sec 526855581 refs
User time (approx) 7.955 secs 20684020425 cycles
100.0%Time
Average Time per Call 7.955379 sec
CrayPat Overhead : Time 0.0%
D1 cache hit,miss ratios 79.9% hits 20.1% misses
D2 cache hit,miss ratio 2.7% hits 97.3% misses
D1+D2 cache hit,miss ratio 80.4% hits 19.6% misses
Runtine was cut
nearly in half.
Still, some 20% of all
references are cache
misses
Improved striding greatly improved
cache reuse
First loop, partially vectorized and unrolled by 495. 1 ii = 0
96. 1 2-----------< do j = ijmin, ijmax
97. 1 2 i---------< do b = abmin, abmax
98. 1 2 i ii = ii+1
99. 1 2 i jj = 0
100. 1 2 i i-------< do i = ijmin, ijmax
101. 1 2 i i Vpr4--< do a = abmin, abmax
102. 1 2 i i Vpr4 jj = jj+1
103. 1 2 i i Vpr4 f5d(a,b,i,j) =
f5d(a,b,i,j) + tmat7(ii,jj)
104. 1 2 i i Vpr4 f5d(a,b,j,i) =
f5d(a,b,j,i) - tmat7(ii,jj)
105. 1 2 i i Vpr4--> end do
106. 1 2 i i-------> end do
107. 1 2 i---------> end do
108. 1 2-----------> end do
109. 1 jj = 0
110. 1 2-----------< do i = ijmin, ijmax
111. 1 2 3---------< do a = abmin, abmax
112. 1 2 3 jj = jj+1
113. 1 2 3 ii = 0
114. 1 2 3 4-------< do j = ijmin, ijmax
115. 1 2 3 4 Vr4---< do b = abmin, abmax
116. 1 2 3 4 Vr4 ii = ii+1
117. 1 2 3 4 Vr4 f5d(b,a,i,j) =
f5d(b,a,i,j) - tmat7(ii,jj)
118. 1 2 3 4 Vr4 f5d(b,a,j,i) =
f5d(b,a,i,j) + tmat7(ii,jj)
119. 1 2 3 4 Vr4---> end do
120. 1 2 3 4-------> end do
121. 1 2 3---------> end do
122. 1 2-----------> end do
Second loop, vectorized and unrolled by 4
USER / #3.Fissioned Loops
-----------------------------------------------------------------
Time% 9.8%
Time 2.481636 secs
Imb.Time 0.045475 secs
Imb.Time% 2.1%
Calls 0.4 /sec 1.0 calls
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED:
L2_EXCLUSIVE:L2_SHARED 1.175M/sec 2916610 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:
ALL 34.109M/sec 84646518 fills
PAPI_L1_DCM 26.424M/sec 65575972 misses
PAPI_L1_DCA 156.705M/sec 388885686 refs
User time (approx) 2.482 secs 6452279320 cycles
100.0%Time
Average Time per Call 2.481636 sec
CrayPat Overhead : Time 0.0%
D1 cache hit,miss ratios 83.1% hits 16.9% misses
D2 cache hit,miss ratio 3.3% hits 96.7% misses
D1+D2 cache hit,miss ratio 83.7% hits 16.3% misses
Runtime further
reduced.
Cache hit/miss ratio
improved slightly
Loopmark file points
to better
vectorization from
the fissioned loops
Fissioning further improved cache reuse and resulted in better
vectorization
Cache blocking is a combination of strip mining and loop interchange, designed to increase data reuse.
Takes advantage of temporal reuse: re-reference array elements already referenced
Good blocking will take advantage of spatial reuse: work with the cache lines!
Many ways to block any given loop nest
Which loops get blocked?
What block size(s) to use?
Analysis can reveal which ways are beneficial
But trial-and-error is probably faster
2D Laplacian
do j = 1, 8
do i = 1, 16
a = u(i-1,j) + u(i+1,j) &
- 4*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
end do
Cache structure for this example:
Each line holds 4 array elements
Cache can hold 12 lines of u data
No cache reuse between outer loop iterations34679101213151830120
i=1
i=16
j=1
j=8
Unblocked loop: 120 cache misses
Block the inner loop
do IBLOCK = 1, 16, 4
do j = 1, 8
do i = IBLOCK, IBLOCK + 3
a(i,j) = u(i-1,j) + u(i+1,j) &
- 2*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
end do
end do
Now we have reuse of the “j+1” data
3467891011122080
i=1
i=13
j=1
j=8
i=5
i=9
One-dimensional blocking reduced misses from 120 to 80
Iterate over 4 4 blocks
do JBLOCK = 1, 8, 4
do IBLOCK = 1, 16, 4
do j = JBLOCK, JBLOCK + 3
do i = IBLOCK, IBLOCK + 3
a(i,j) = u(i-1,j) + u(i+1,j) &
- 2*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
end do
end do
end do
Better use of spatial locality (cache lines)34678910111213151617183060
i=1
i=13
j=1
j=5
i=5
i=9
Matrix-matrix multiply (GEMM) is the canonical cache-blocking example
Operations can be arranged to create multiple levels of blocking
Block for register
Block for cache (L1, L2, L3)
Block for TLB
No further discussion here. Interested readers can see
Any book on code optimization Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in
Chapter 8
Insert your favorite book here
Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR-2001-22, The University of Texas at Austin, Department of Computer Sciences Develops algorithms and cost models for GEMM in hierarchical memories
Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (May), 1-25 Description of GotoBLAS DGEMM
You’re doing it wrong.
Your block size is too small (too much loop overhead).
Your block size is too big (data is falling out of cache).
You’re targeting the wrong cache level (?)
You haven’t selected the correct subset of loops to block.
The compiler is already blocking that loop.
Prefetching is acting to minimize cache misses.
Computational intensity within the loop nest is very large, making blocking less important.
“I tried cache-blocking my code, but it didn’t help”
Multigrid PDE solver
Class D, 64 MPI ranks
Global grid is 1024 × 1024 × 1024
Local grid is 258 × 258 × 258
Two similar loop nests account for >50% of run time
27-point 3D stencil
There is good data reuse along leading dimension, even without blocking
do i3 = 2, 257
do i2 = 2, 257
do i1 = 2, 257
! update u(i1,i2,i3)
! using 27-point stencil
end do
end do
end do
i1 i1+1i1-1
i2-1
i2
i2+1
i3-1
i3
i3+1
cache lines
Block the inner two loops
Creates blocks extending along i3 direction
do I2BLOCK = 2, 257, BS2
do I1BLOCK = 2, 257, BS1
do i3 = 2, 257
do i2 = I2BLOCK, &
min(I2BLOCK+BS2-1, 257)
do i1 = I1BLOCK, &
min(I1BLOCK+BS1-1, 257)
! update u(i1,i2,i3)
! using 27-point stencil
end do
end do
end do
end do
end do
Block sizeMop/s/proces
s
unblocked 531.50
16 × 16 279.89
22 × 22 321.26
28 × 28 358.96
34 × 34 385.33
40 × 40 408.53
46 × 46 443.94
52 × 52 468.58
58 × 58 470.32
64 × 64 512.03
70 × 70 506.92
Block the outer two loops
Preserves spatial locality along i1 direction
do I3BLOCK = 2, 257, BS3
do I2BLOCK = 2, 257, BS2
do i3 = I3BLOCK, &
min(I3BLOCK+BS3-1, 257)
do i2 = I2BLOCK, &
min(I2BLOCK+BS2-1, 257)
do i1 = 2, 257
! update u(i1,i2,i3)
! using 27-point stencil
end do
end do
end do
end do
end do
Block sizeMop/s/proces
s
unblocked 531.50
16 × 16 674.76
22 × 22 680.16
28 × 28 688.64
34 × 34 683.84
40 × 40 698.47
46 × 46 689.14
52 × 52 706.62
58 × 58 692.57
64 × 64 703.40
70 × 70 693.87
( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa,
int cola, int colb)
( 54) {
( 55) int i, j, k; /* loop counters */
( 56) int rowc, colc, rowb; /* sizes not passed as arguments */
( 57) double con; /* constant value */
( 58)
( 59) rowb = cola;
( 60) rowc = rowa;
( 61) colc = colb;
( 62)
( 63) for(i=0;i<rowc;i++) {
( 64) for(k=0;k<cola;k++) {
( 65) con = *(a + i*cola +k);
( 66) for(j=0;j<colc;j++) {
( 67) *(c + i*colc + j) += con * *(b + k*colb + j);
( 68) }
( 69) }
( 70) }
( 71) }
mat_mul_daxpy:
66, Loop not vectorized: data dependency
Loop not vectorized: data dependency
Loop unrolled 4 times
C pointers don’t carry
the same rules as
Fortran Arrays.
The compiler has no
way to know whether
*a, *b, and *c
overlap or are
referenced differently
elsewhere.
The compiler must
assume the worst,
thus a false data
dependency.
C pointers
Slide 147
( 53) void mat_mul_daxpy(double* restrict a, double* restrict b,
double* restrict c, int rowa, int cola, int colb)
( 54) {
( 55) int i, j, k; /* loop counters */
( 56) int rowc, colc, rowb; /* sizes not passed as arguments */
( 57) double con; /* constant value */
( 58)
( 59) rowb = cola;
( 60) rowc = rowa;
( 61) colc = colb;
( 62)
( 63) for(i=0;i<rowc;i++) {
( 64) for(k=0;k<cola;k++) {
( 65) con = *(a + i*cola +k);
( 66) for(j=0;j<colc;j++) {
( 67) *(c + i*colc + j) += con * *(b + k*colb + j);
( 68) }
( 69) }
( 70) }
( 71) }
C99 introduces the
restrict keyword,
which allows the
programmer to
promise not to
reference the
memory via another
pointer.
If you declare a
restricted pointer and
break the rules,
behavior is undefined
by the standard.
C pointers, restricted
Slide 148
66, Generated alternate loop with no peeling - executed if loop count <= 24
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated alternate loop with no peeling and more aligned moves -
executed if loop count <= 24 and alignment test is passed
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated alternate loop with more aligned moves - executed if loop
count >= 25 and alignment test is passed
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
• This can also be achieved with the PGI safe pragma and –Msafeptrcompiler option or Pathscale –OPT:alias option
Slide 149
July 2009 Slide 150
GNU malloc library malloc, calloc, realloc, free calls
Fortran dynamic variables
Malloc library system calls Mmap, munmap =>for larger allocations Brk, sbrk => increase/decrease heap
Malloc library optimized for low system memory use Can result in system calls/minor page faults
151
Detecting “bad” malloc behavior
Profile data => “excessive system time”
Correcting “bad” malloc behavior
Eliminate mmap use by malloc Increase threshold to release heap memory
Use environment variables to alter malloc
MALLOC_MMAP_MAX_ = 0 MALLOC_TRIM_THRESHOLD_ = 536870912
Possible downsides
Heap fragmentation User process may call mmap directly User process may launch other processes
PGI’s –Msmartalloc does something similar for you at compile time
152
Google created a replacement “malloc” library
“Minimal” TCMalloc replaces GNU malloc
Limited testing indicates TCMalloc as good or better than GNU malloc
Environment variables not required
TCMalloc almost certainly better for allocations in OpenMP parallel regions
There’s currently no pre-built tcmalloc for Cray XT, but some users have successfully built it.
153
Linux has a “first touch policy” for memory allocation
*alloc functions don’t actually allocate your memory
Memory gets allocated when “touched”
Problem: A code can allocate more memory than available
Linux assumed “swap space,” we don’t have any
Applications won’t fail from over-allocation until the memory is finally touched
Problem: Memory will be put on the core of the “touching” thread
Only a problem if thread 0 allocates all memory for a node
Solution: Always initialize your memory immediately after allocating it
If you over-allocate, it will fail immediately, rather than a strange place in your code
If every thread touches its own memory, it will be allocated on the proper socket
Slide 154
Short Message Eager Protocol
The sending rank “pushes” the message to the receiving rank Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less Sender assumes that receiver can handle the message
Matching receive is posted - or - Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space
(MPICH_UNEX_BUFFER_SIZE) to store the message
Long Message Rendezvous Protocol
Messages are “pulled” by the receiving rank Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes Sender sends small header packet with information for the receiver to pull
over the data Data is sent only after matching receive is posted by receiving rank
MPI_RECV is posted prior to MPI_SEND call
MPI
Unexpected
Buffers
Unexpected
Msg Queue
Sender
RANK 0
Receiver
RANK 1
Eager
Short Msg ME
Incoming Msg
Rendezvous
Long Msg MEApp ME
Unexpected
Event Queue
Match Entries Posted by MPI
to handle Unexpected Msgs
STEP 3
Portals DMA PUT
STEP 2
MPI_SEND call
STEP 1
MPI_RECV call
Post ME to Portals
(MPICH_PTL_UNEX_EVENTS)
Other Event Queue
(MPICH_PTL_OTHER_EVENTS)
(MPICH_UNEX_BUFFER_SIZE)
SEASTAR
MPT Eager ProtocolData “pushed” to the receiver(MPICH_MAX_SHORT_MSG_SIZE bytes or less)
MPI_RECV is not posted prior to MPI_SEND call
MPI
Unexpected
Buffers
Unexpected
Msg Queue
Sender
RANK 0
Receiver
RANK 1
Eager
Short Msg ME
Incoming Msg
Rendezvous
Long Msg ME
Unexpected
Event Queue
Match Entries Posted by MPI
to handle Unexpected Msgs
STEP 2
Portals DMA PUTSTEP 4
Memcpy of data
STEP 1
MPI_SEND call
STEP 3
MPI_RECV call
No Portals ME
SEASTAR
(MPICH_UNEX_BUFFER_SIZE)
(MPICH_PTL_UNEX_EVENTS)
Data is not sent until MPI_RECV is issued
MPI
Unexpected
Buffers
Unexpected
Msg Queue
Sender
RANK 0
Receiver
RANK 1
Eager
Short Msg ME
Incoming Msg
Rendezvous
Long Msg ME
Unexpected
Event Queue
App ME
STEP 2
Portals DMA PUT
of Header
STEP 4
Receiver issues
GET request to
match Sender ME
STEP 5
Portals DMA of Data
Match Entries Posted by MPI
to handle Unexpected Msgs
STEP 1
MPI_SEND call
Portals ME created
STEP 3
MPI_RECV call
Triggers GET request
SEASTAR
Controls message sending protocol
Message sizes <= MSG_SIZE: Use EAGER
Message sizes > MSG_SIZE: Use RENDEZVOUS
Increasing this variable may require that MPICH_UNEX_BUFFER_SIZE be increased
Increase MPICH_MAX_SHORT_MSG_SIZE if App sends large messages and receives are pre-posted
Can reduce messaging overhead via EAGER protocol
Can reduce network contention
Decrease MPICH_MAX_SHORT_MSG_SIZE if:
App sends lots of smaller messages and receives not pre-posted, exhausting unexpected buffer space
161
If set => Disables Portals matching
Matching happens on the Opteron
Requires extra copy for EAGER protocol
Reduces MPI_Recv Overhead
Helpful for latency-sensitive application
Large # of small messages
Small message collectives (<1024 bytes)
When can this be slower?
When extra copy time longer than post-to-Portals time
Pre-posted Receives can slow it down
For medium to larger messages (16k-128k range)
Not beneficial for Gemini
The default ordering can be changed using the following environment variable:
MPICH_RANK_REORDER_METHOD
These are the different values that you can set it to:0: Round-robin placement – Sequential ranks are placed on the next node in the
list. Placement starts over with the first node upon reaching the end of the list.
1: SMP-style placement – Sequential ranks fill up each node before moving to the next.
2: Folded rank placement – Similar to round-robin placement except that each pass over the node list is in the opposite direction of the previous pass.
3: Custom ordering. The ordering is specified in a file named MPICH_RANK_ORDER.
When is this useful? Point-to-point communication consumes a significant fraction of program time and a
load imbalance detected
Also shown to help for collectives (alltoall) on subcommunicators (GYRO)
Spread out IO across nodes (POP)
One can also use the CrayPat performance measurement tools to generate a suggested custom ordering.
Available if MPI functions traced (-g mpi or –O apa)
pat_build –O apa my_program see Examples section of pat_build man page
pat_report options:
mpi_sm_rank_order Uses message data from tracing MPI to generate suggested MPI rank order. Requires the program to
be instrumented using the pat_build -g mpi option.
mpi_rank_order Uses time in user functions, or alternatively, any other metric specified by using the -s mro_metric
options, to generate suggested MPI rank order.
module load xt-craypat
Rebuild your code
pat_build –O apa a.out
Run a.out+pat
pat_report –Ompi_sm_rank_order a.out+pat+…sdt/ > pat.report
Creates MPICH_RANK_REORDER_METHOD.x file
Then set env var MPICH_RANK_REORDER_METHOD=3 AND
Link the file MPICH_RANK_ORDER.x to MPICH_RANK_ORDER
Rerun code
Table 1: Suggested MPI Rank Order
Eight cores per node: USER Samp per node
Rank Max Max/ Avg Avg/ Max Node
Order USER Samp SMP USER Samp SMP Ranks
d 17062 97.6% 16907 100.0% 832,328,820,797,113,478,898,600
2 17213 98.4% 16907 100.0% 53,202,309,458,565,714,821,970
0 17282 98.8% 16907 100.0% 53,181,309,437,565,693,821,949
1 17489 100.0% 16907 100.0% 0,1,2,3,4,5,6,7
•This suggests that
1. the custom ordering “d” might be the best
2. Folded-rank next best
3. Round-robin 3rd best
4. Default ordering last
GYRO 8.0 B3-GTC problem with 1024 processes
Run with alternate MPI orderings Custom: profiled with with –O apa and used reordering file
MPICH_RANK_REORDER.d
Reorder method Comm. time
Default 11.26s
0 – round-robin 6.94s
2 – folded-rank 6.68s
d-custom from apa 8.03s
CrayPAT
suggestion
almost right!
TGYRO 1.0
Steady state turbulent transport code using GYRO, NEO, TGLF components
ASTRA test case
Tested MPI orderings at large scale
Originally testing weak-scaling, but found reordering very useful
Reorder method
TGYRO wall time (min)
20480 40960 81920
Default 99m 104m 105m
Round-robin 66m 63m 72m
Huge win!
Application data is in
a 3D space, X x Y x Z.
Communication is
nearest-neighbor.
Default ordering
results in 12x1x1
block on each node.
A custom reordering
is now generated:
3x2x2 blocks per
node, resulting in
more on-node
communication
Rank Reordering Case Study
July 15, 2008 Slide 171
% pat_report -O mpi_sm_rank_order -s rank_grid_dim=8,6 ...
Notes for table 1:
To maximize the locality of point to point communication,
specify a Rank Order with small Max and Avg Sent Msg Total Bytes
per node for the target number of cores per node.
To specify a Rank Order with a numerical value, set the environment
variable MPICH_RANK_REORDER_METHOD to the given value.
To specify a Rank Order with a letter value 'x', set the environment
variable MPICH_RANK_REORDER_METHOD to 3, and copy or link the file
MPICH_RANK_ORDER.x to MPICH_RANK_ORDER.
Table 1: Sent Message Stats and Suggested MPI Rank Order
Communication Partner Counts
Number Rank
Partners Count Ranks
2 4 0 5 42 47
3 20 1 2 3 4 ...
4 24 7 8 9 10 ...
July 15, 2008 Slide 172
Four cores per node: Sent Msg Total Bytes per node
Rank Max Max/ Avg Avg/ Max Node
Order Total Bytes SMP Total Bytes SMP Ranks
g 121651200 73.9% 86400000 62.5% 14,20,15,21
h 121651200 73.9% 86400000 62.5% 14,20,21,15
u 152064000 92.4% 146534400 106.0% 13,12,10,4
1 164505600 100.0% 138240000 100.0% 16,17,18,19
d 164505600 100.0% 142387200 103.0% 16,17,19,18
0 224640000 136.6% 207360000 150.0% 1,13,25,37
2 241920000 147.1% 207360000 150.0% 7,16,31,40
July 15, 2008 Slide 173
% $CRAYPAT_ROOT/sbin/grid_order -c 2,2 -g 8,6
# grid_order -c 2,2 -g 8,6
# Region 0: 0,0 (0..47)
0,1,6,7
2,3,8,9
4,5,10,11
12,13,18,19
14,15,20,21
16,17,22,23
24,25,30,31
26,27,32,33
28,29,34,35
36,37,42,43
38,39,44,45
40,41,46,47
This script will also handle the case that cells do not
evenly partition the grid.
July 15, 2008 Slide 174
% $CRAYPAT_ROOT/sbin/mgrid_order -H -g 8,6
# mgrid_order -H -g 8,6
0
1 0 1 2 3 4 5 X X
7 6 7 8 9 10 11 X X
6 12 13 14 15 16 17 X X
12 18 19 20 21 22 23 X X
18 24 25 26 27 28 29
19 30 31 32 33 34 35
13 36 37 38 39 40 41
14 42 43 44 45 46 47
20
21
15
9,8,2,3,4,10,11,5,23,17,16,22,28,34,35,29
47,41,40,46,45,44,38,39,33,27,26,32,31,25
24,30,36,37,43,42
Hilbert curve order works best for 2^n side.
July 15, 2008 Slide 175
X X o o
X X o o
o o o o
o o o o
Nodes marked X heavily use a shared resource
If memory bandwidth, scatter the X's
If network bandwidth to others, again scatter
If network bandwidth among themselves, concentrate
I/O is simply data migration.
Memory Disk
I/O is a very expensive operation.
Interactions with data in memory and on disk.
Must get the kernel involved
How is I/O performed?
I/O Pattern
Number of processes and files.
File access characteristics.
Where is I/O performed?
Characteristics of the computational system.
Characteristics of the file system.
177
There is no “One Size Fits All” solution to the I/O problem.
Many I/O patterns work well for some range of parameters.
Bottlenecks in performance can occur in many locations. (Application and/or File system)
Going to extremes with an I/O pattern will typically lead to problems.
178
179
The best performance comes from situations when the data is accessed contiguously in memory and on disk.
Facilitates large operations and minimizes latency.
Commonly, data access is contiguous in memory but noncontiguous on disk or vice versa. Usually to reconstruct a global data structure via parallel I/O.
Memory Disk
Memory Disk
Spokesperson
One process performs I/O.
Data Aggregation or Duplication
Limited by single I/O process.
Pattern does not scale.
Time increases linearly with amount of data.
Time increases with number of processes.
180
Disk
File per process
All processes perform I/O to individual files.
Limited by file system.
Pattern does not scale at large process counts.
Number of files creates bottleneck with metadata operations.
Number of simultaneous disk accesses creates contention for file system resources.
181
Disk
Shared File
Each process performs I/O to a single file which is shared.
Performance
Data layout within the shared file is very important.
At large process counts contention can build for file system resources.
182
Disk
Subset of processes which perform I/O.
Aggregation of a group of processes data.
Serializes I/O in group.
I/O process may access independent files.
Limits the number of files accessed.
Group of processes perform parallel I/O to a shared file.
Increases the number of shared files to increase file system usage.
Decreases number of processes which access a shared file to decrease file system contention.
183
128 MB per file and a 32 MB Transfer size
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Wri
te (
MB
/s)
Processes or Files
File Per ProcessWrite Performance
1 MB Stripe
32 MB Stripe
184
32 MB per process, 32 MB Transfer size and Stripe size
0
1000
2000
3000
4000
5000
6000
7000
8000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Wri
te (
MB
/s)
Processes
Single Shared FileWrite Performance
POSIX
MPIIO
HDF5
185
Lustre
Minimize contention for file system resources.
A process should not access more than one or two OSTs.
Performance
Performance is limited for single process I/O.
Parallel I/O utilizing a file-per-process or a single shared file is limited at large scales.
Potential solution is to utilize multiple shared file or a subset of processes which perform I/O.
186
Standard Ouput and Error streams are effectively serial I/O.
All STDIN, STDOUT, and STDERR I/O serialize through aprun
Disable debugging messages when running in production mode.
“Hello, I’m task 32000!”
“Task 64000, made it through loop.”
187
Lustre
Advantages
Aggregates smaller read/write operations into larger operations.
Examples: OS Kernel Buffer, MPI-IO Collective Buffering
Disadvantages
Requires additional memory for the buffer.
Caution
Frequent buffer flushes can adversely affect performance.
188
Buffer
A particular code both reads and writes a 377 GB file. Runs on 6000 cores.
Total I/O volume (reads and writes) is 850 GB.
Utilizes parallel HDF5
Default Stripe settings: count 4, size 1M, index -1.
1800 s run time (~ 30 minutes)
Stripe settings: count -1, size 1M, index -1.
625 s run time (~ 10 minutes)
Results
66% decrease in run time.
189
Lustre
Included in the Cray MPT library.
Environmental variable used to help MPI-IO optimize I/O performance.
MPICH_MPIIO_CB_ALIGN Environmental Variable. (Default 2)
MPICH_MPIIO_HINTS Environmental Variable
Can set striping_factor and striping_unit for files created with MPI-IO.
If writes and/or reads utilize collective calls, collective buffering can be utilized (romio_cb_read/write) to approximately stripe align I/O within Lustre.
190
MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and
transfers both of 1M bytes and a strided access pattern. Tested on an
XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220
segments, 96 GB file
0
200
400
600
800
1000
1200
1400
1600
1800
MB
/Sec
MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and
transfers both of 10K bytes and a strided access pattern. Tested on an
XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220
segments, 96 GB file
MB
/Sec
0
20
40
60
80
100
120
140
160
On 5107 PEs, and by application design, a subset of the Pes(88), do the
writes. With collective buffering, this is further reduced to 22 aggregators
(cb_nodes) writing to 22 stripes. Tested on an XT5 with 5107 Pes, 8
cores/node
MB
/Sec
0
500
1000
1500
2000
2500
3000
3500
4000
Total file size 6.4 GiB. Mesh of 64M bytes 32M elements, with work divided
amongst all PEs. Original problem was very poor scaling. For example, without
collective buffering, 8000 PEs take over 5 minutes to dump. Note that disabling
data sieving was necessary. Tested on an XT5, 8 stripes, 8 cb_nodes
Se
co
nd
s
PEs
1
10
100
1000
w/o CB
CB=0
CB=1
CB=2
Do not open a lot of files all at once (Metadata Bottleneck)
Use a simple ls (without color) instead of ls -l (OST Bottleneck)
Remember to stripe files
Small, individual files => Small stripe counts
Large, shared files => Large stripe counts
Never set an explicit starting OST for your files (Filesystem Balance)
Open Files as Read-Only when possible
Limit the number of files per directory
Stat files from just one processes
Stripe-align your I/O (Reduces Locks)
Read small, shared files once and broadcast the data (OST Contention)