Post on 11-Nov-2014
description
transcript
Programming the Memory Hierarchy with Sequoia
Michael Bauer, Alex Aiken
1
Stanford University
OutlineThe Sequoia Programming Model
Compiling Sequoia
Automated Tuning with Sequoia
Performance Results
Extensions for Irregular Parallelism (Time Permitting)
Areas of Future Research
2
The Sequoia Programming Model
3
Sequoia
4
Language: stream programming for machines with deep memory hierarchies
Idea: expose abstract memory hierarchy to the programmer
Implementation: benchmarks run well on many multi-level machines
SMP, CMP, Cluster of CMPs, GPU, Cluster of GPUs, Disk
The key challenge in high performance programming is:
communication (not parallelism)
LatencyBandwidth
LOCALITY
5
Streaming
6
Streaming involves structuring algorithms as collections of independent [locality cognizant] computations with well defined working sets.
This structuring can be done at many scalesKeep temporaries in registers
Cache/scratchpad blockingMessage passing on a clusterOut-of-core algorithms
Streaming
7
Streaming involves structuring algorithms as collections of independent [locality cognizant] computations with well defined working sets.
Efficient programs exhibit thisstructure at many scales.
8
Facilitate development of hierarchy-aware stream
Provide constructs that can be implemented efficiently
Place computation and data in machineExplicit parallelism and communicationLarge bulk transfers
Locality in Programming Languages
9
Local (private) vs. global (remote) addressesUPC, Titanium
Domain distributions (map array elements to locations)HPF, UPC, ZPL, X10, Fortress, Chapel
Focus on communication between nodesIgnore hierarchy within a node
Locality in Programming Languages
10
Streams and kernelsStream data off chip. Kernel data on chip.StreamC/KernelC, BrookGPU Shading (Cg, HLSL)
Architecture specificOnly represent two levels
(Except CUDA and PMH)
Abstract Machine ModelTree of independent address spaces
Each level is progressively smaller, but computationally more powerful
Arbitrary branching factor
Arbitrary number of levels
11
Hierarchical MemoryReal machines as trees of memories
Dual-core PC 4 node cluster of PCs
L2 cache
ALUs ALUs
Main memory
L1 cache L1 cache
L2 cache
ALUs ALUs
Main memory
L1 cache L1 cache L2 cache
ALUs
Nodememory
Aggregate cluster memory(virtual level)
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
12
Hierarchical Memory
13
CPU Main memory
GPU Main memory
Shared Mem. Shared Mem. Shared Mem. Shared Mem.
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Single GPU
Hierarchical Memory
14
Aggregate cluster memory(virtual level)
CMP Main Memory CMP Main Memory
GPU Main memory
GPU Main memory
GPU Main memory
GPU Main memory
Shared Shared Shared Shared Shared Shared Shared Shared
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
MPI Cluster of CMPs w/ GPU Accelerators
Example: Blocked Matrix Multiply
void matmul_L1( int M, int N, int T,float* A,float* B,float* C)
for (int i=0;; i<M;; i++)
for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;
C += A x B
matmul_L132x32
matrix mult
A B C
15
Example: Blocked Matrix Multiply
C += A x Bvoid matmul_L2( int M, int N, int T,float* A,float* B,float* C)
Perform series of L1 matrix multiplications.
matmul_L2256x256
matrix mult
A B C
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
16
Example: Blocked Matrix Multiplyvoid matmul( int M, int N, int T,
float* A,float* B,float* C)
Perform series of L2 matrix multiplications.
matmullarge matrix mult
matmul_L132x32
matrix mult ...
matmul_L2256x256
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L2256x256
matrix mult
matmul_L132x32
matrix mult ...matmul_L1
32x32matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
. . .. . .. . .
17
Sequoia TasksSpecial functions called tasks are the building blocks of Sequoia programs
task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )
for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;
18
Sequoia TasksTask arguments and temporaries define a working setTask working set resident at a specific location in abstract machine modelTasks assigned location in the memory hierarchy
Maintain call-by-value-result (CBVR) semantics
task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )
for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;
19
Task Hierarchiestask matmul::inner( in float A[M][T],
in float B[T][N],inout float C[M][N] )
tunable int P, Q, R;;
Recursively call matmul task on submatrices
of A, B, and C of size PxQ, QxR, and PxR.
task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )
for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;20
Task Hierarchiestask matmul::inner( in float A[M][T],
in float B[T][N],inout float C[M][N] )
tunable int P, Q, R;;
mappar( int i=0 to M/P,int j=0 to N/R )
mapseq( int k=0 to T/Q )
matmul( A[P*i:P*(i+1);;P][Q*k:Q*(k+1);;Q],B[Q*k:Q*(k+1);;Q][R*j:R*(j+1);;R],C[P*i:P*(i+1);;P][R*j:R*(j+1);;R] );;
task matmul::leaf( in float A[M][T],
in float B[T][N],inout float C[M][N] )
for (int i=0;; i<M;; i++)
for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;
matmul::inner
matmul::leaf
Variant call graph
21
Summary: Sequoia TasksSingle abstraction for
Isolation/ parallelismExplicit communication/ working setsExpressing locality
Sequoia programs describe hierarchies of tasksMapped onto memory hierarchyParameterized for portability
22
Compiling Sequoia
23
Sequoia CompilerSource-to-source compilation
Three inputsSource FileMachine FileMapping File
Compilation works on hierarchical programs
Many standard optimizationsDone at all levels of the hierarchyGreatly increases leverage of optimization
source.sq source.mp machine.m
Sequoia Compiler (sq++)
Machine Specific Source Code
24
Inter-Level Copy Elimination (1)
25
Copy elimination near the root removes not one instruction, but thousands/millions
B
A
CopyCopy
C
Mi
Mi+1
A
Copy
C
Mi
Mi+1
Inter-Level Copy Elimination (2)
26
Copy elimination near the root removes not one instruction, but thousands/millions
A
B
CopyCopy
C
Mi
Mi+1A
B
Copy
Mi
Mi+1
Software Pipelining
27
SchedulingPrefetch batch of dataCompute on dataInitiate write of results
Overlap communication and computation
compute 1write output 0
time
compute 2
compute 3
read input 2
write output 1
read input 3
write output 2
read input 4
Sequoia RuntimeUniform scheme for explicitly describing memory hierarchies
Capture common traits important for performanceAllow composition of memory hierarchies
Simple, portable API for many parallel machinesMechanism independence for communication and management of parallel resources
SMP, CMP, MPI Cluster, CUDA, Disk, Cell (deprecated), Scalar (debugging), OpenCL (future)
28
Graphical Runtime Representation
Memory Level i+1
CPU Level i+1
Memory Level iChild N
Memory Level iMemory Level iChild 1
Runtime
CPU Level iChild 1
CPU Level i CPU Level iChild N
29
Runtime DesignUniform API to support many devices
Manages basic program tasksData allocation and namingSetup parallel resourcesSynchronization
Greatly simplifies/modularizes implementationCompiler generates code for one API, not many machinesEach runtime is isolated from all othersRuntimes can be implemented separately and composed freely
Makes some basic assumptionsSoftware has control over memory resourcesPersistence of data for software controlled memory
30
Automatic Tuning
31
Autotuner
32
Many parameters to tuneSequoia codes parameterized by tunablesChoice of task variants at different callsites
The tuning framework sets these parametersSearch-basedProgrammer defines the search space
Software-Managed Memory
33
Smooth with high frequency components(due to alignment)
Hierarchical Search
34
M2
M1
M0
set of tunables: S1
Bottom Up
set of tunables: S0
Search Algorithm
35
A pyramid searchGreedy search performed at each level
Achieve good performance quickly because of smooth space
Start with a coarse grid Refine the grid when no further progress can be made
grid spacing
Performance Results
36
Sequoia Benchmarks
37
Linear Algebra: BLAS Level 1SAXPY, Level 2 SGEMV, and Level 3 SGEMM
Conv2D: 2D single precision convolution with 9x9 support (non-periodic boundary constraints)
FFT3D: Complex single-precision FFTGravity: 100 times steps of N-body stellar dynamics
simulation (N^2) single precisionHMMER: Fuzzy protein string matching using HMM
evaluation (Horn et al. SC2005 paper)
Single Runtime Configurations
38
Scalar2.4 GHz Intel Pentium4 Xeon, 1GB
8-way SMP4 dual-core 2.66GHz Intel P4 Xeons, 8GB
Disk2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk
Cluster16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect (780MB/s)
Cell3.2 GHz IBM Cell blade (1 Cell 8 SPE), 1GB
PS33.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)
Single Runtime Configurations - GFLOPS
39
Scalar SMP Disk Cluster Cell PS3
SAXPY 0.3 0.7 0.007 4.9 3.5 3.1
SGEMV 1.1 1.7 0.04 12 12 10
SGEMM 6.9 45 5.5 91 119 94
CONV2D 1.9 7.8 0.6 24 85 62
FFT3D 0.7 3.9 0.05 5.5 54 31
GRAVITY 4.8 40 3.7 68 97 71
HMMER 0.9 11 0.9 12 12 7.1
SGEMM Performance
40
ClusterIntel Cluster MKL: 101 GFlop/sSequoia: 91 GFlop/s
SMPIntel MKL: 44 GFlop/sSequoia: 45 GFlop/s
FFT3D Performance
41
CellMercury Computer: 58 GFlop/sFFTW 3.2 alpha 2: 35 GFlop/sSequoia: 54 GFlop/s
ClusterFFTW 3.2 alpha 2: 5.3 GFlop/sSequoia: 5.5 GFlop/s
SMPFFTW 3.2 alpha 2: 4.2 GFlop/sSequoia: 3.9 GFlop/s
Best Known Implementations
42
HMMerATI X1900XT: 9.4 GFlop/s
(Horn et al. 2005)Sequoia Cell: 12 GFlop/sSequoia SMP: 11 GFlop/s
GravityGrape-6A: 2 billion interactions/s
(Fukushige et al. 2005)Sequoia Cell: 4 billion interactions/sSequoia PS3: 3 billion interactions/s
Multi-Runtime System Configurations
43
Cluster of SMPsFour 2-way, 3.16GHz Intel Pentium 4 Xeons connected via GigE (80MB/s peak)
Disk + PS3Sony Playstation 3 bringing data from disk (~30MB/s)
Cluster of PS3sTwo Sony Playstation GigE (60MB/s peak)
SMP vs. Cluster of SMP (GFLOPS)
44
Cluster of SMPs
SMP
SAXPY 1.9 0.7
SGEMV 4.4 1.7
SGEMM 48 45
CONV2D 4.8 7.8
FFT3D 1.1 3.9
GRAVITY 50 40
HMMER 14 11
Same number of total processors
Computed limited applications agnostic to interconnect.
Disk+PS3 Comparison (GFLOPS)
45
Disk+PS3 PS3
SAXPY 0.004 3.1
SGEMV 0.014 10
SGEMM 3.7 94
CONV2D 0.48 62
FFT3D 0.05 31
GRAVITY 66 71
HMMER 8.3 7.1
Some applications have the computational intensity to run from disk with little slowdown.
blocks to hide memory latency.
Extensions for Irregular Parallelism
46
Regular vs. Irregular Parallelism
Regular ComputationsStatically known working setsStatically known communication patternsPredictable running times
Irregular ComputationsDynamically determined working setsDynamically determinedcommunication patternsUnpredictable running times
Regular applications provide scalability, but large applications still have irregular components that need to be parallelized
47
Spawn StatementSpawn launches unbounded number of tasksContinue launching until termination condition is met
task<inner> void performWork() // ...// (task to be run, termination condition)spawn(performWork(this), workQueue.isEmpty());;// ...
48
Parent Pointers and Call-UpParent pointers provide child with way to name parent address spaceA call-up is a task that runs atomically in the
Maintains abstract machine model
task<leaf> void handleWork(parent WorkList *wl) // Perform call-up to retrieve workvector<Work> localWorkQueue = wl->getWork();;
/* Perform Work... */
// Call-Up to add back extra workwl->addWork(localWorkQueue);;
49
Case Study: Boolean Satisfiability (SAT)SAT is useful in many industrial applications
CAD ToolsModel Checkers/ Static Analyses
SAT is a search problem
SAT is an irregular applicationDynamic working set (partial assignments)Dynamic communication (learned clauses)Unknown running times (solving partial assignments)
(x1 + ¬x3 + x4) ^ (¬x2 + x3 + x5) ^ (x4 + ¬x5 + ¬x6
50
Parallel SAT SolvingSpawn many sequential SAT solvers1
Periodically call-up to update assumptions
1. We use the Mini-Sat sequential solver
Level 1
Level 0
(x1 + x2 + ¬x3
SequentialSolver
SequentialSolver
SequentialSolver
51
Parallel SAT SolvingSpawn many sequential SAT solvers1
Periodically call-up to update assumptions
1. We use the Mini-Sat sequential solver
Level 1
Level 0
(x1 + x2 + ¬x3
SequentialSolver
SequentialSolver
SequentialSolver
52
Performance Results for SAT
53
Speedups over Mini-SAT sequential solver2008 SAT Champion
Number of Processors
Speedup over
Sequential
11
Case Study: Parallel Sorting
17 2 16 29821 5
Generalized Quicksort Algorithm
Sorting is an irregular applicationDynamic working set (partitions)Dynamic communication (swizzle)Unknown execution time (partition sort)
54
Sorting PerformanceSort 227 Integers (vs STL Sort)Compare to Original Sequoia
55
Dynamic Load Balancing with SpawnEmploy a re-spawn heuristicDynamic load balancing with enough tasks
56
Case Study: Sparse Matrix Multiply
57
Sparse matrix multiplied with a dense vector
No optimizations for sparse patterns
Dynamically allocate chunks of rows to childrenHandle case were some rows have more non-zero elements than othersChildren have an initial assignment of rowsUse working stealing when finished
Performance Results for SpMV
58
Speedups over sequential OSKI codeExcluding tuning time for OSKI execution
Similar to other results for SpMVInherently memory bound
Number of Processors
Speedup over
Sequential
Areas of Future Research
59
Memory ManagementWhat abstractions are presented for each address space?
Stack? Heap? GC?Persistence?
How to represent distributed data structures and arrays in virtual levels?
Major problem for clusters with distributed memory
How do to handle pointer data structures that need to be partitioned?
Better mechanisms for communicating locality?
60
Sequoia: DSL Compiler Target?
61
Can DSL compilers perform domain specific optimizations in a machine agnostic context?
DSL A
DSL A Compilerwith Optimizations
DSL B
DSL B Compilerwith Optimizations
Sequoia
Sequoia Compilerw/ machine optim.
MPI P-Threads CUDA OpenCL Disk
ConclusionsProgramming to an abstract memory hierarchy provides both locality information and portability
Sequoia provides a general framework for autotuning in deep memory hierarchies
Constructs for irregular parallelism are important for obtaining good performance
62
Questions?http://sequoia.stanford.edu
63
Back-up Slides
64
Locality Aware ProgrammingSpecify functionally independent tasks
Call-by-value-result semantics
Couple locality information with explicit parallelism
Provide tunable variables for machine independence
65
TunablesProvide mechanism for specifying machine dependent variables
Two flavors:Integer tunables for controlTask tunables for task variants
Specified in the mapping file
66
Sequoia Rough EdgesMemory Management
What abstractions are presented for each address space?
structures)?
Abstract Machine ModelIs it too abstract?What operations would be allowed if it was slightly less abstract?
Pipeline ParallelismCould Sequoia support something like GRAMPS graph of queues
67
Case Study: Fluid SimulationFluidanimate application from PARSEC1
3-D fluid flow simulated by particlesSpace is partitioned into cells
Fluid is an irregular applicationDynamic working set (cells per grid)Dynamic communication (ghost cells)Unknown running time (particles per cell)
1. PARSEC benchmark suite: parsec.princeton.cs.edu68
The Ghost Cell ProblemAll copies of a ghost cell must be reduced sequentiallyDifferent ghost cells can be reduced in parallel
Call-up serializes these parallel reductions
Future research: prove independence of call-ups at compile time to allow concurrent call-up execution
69