Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
April, 2011 Cartesian Computations
Cartesian Computations
Larry Carter
with thanks to
Bowen Alpern, Jeanne Ferrante,
Kang Su Gatlin, Nick Mitchell, Michelle Strout
Cartesian Computations2
Outline
• Architectural trends
– Computation is (almost) free
– Data movement and memory aren’t
• Cartesian computations – Examples– Static, tidy and lazy implementations
• Analysis of Cartesian computations– Computation density and temp data ratio– Some theorems
• Application to examples
Cartesian Computations3
Technology trends
Everything’s getting smaller
Energy consumption limits computation speed
Need to keep chip from melting
Electricity costs are non-trivial!
Energy is dominated by cost to move data
In 32 nm technology (available today): 0.1 pJ (picoJoule) to add two 32-bit numbers
2 pJ to move a 32-bit number 1 mm on chip
320 pJ to move a 32-bit number onto or off a chipthese numbers may be reduced by 10x with heroic methods, e.g. very low
power&speed
m mm µm nm Å
chip light atomfeature width
Cartesian Computations4
Computation is (almost) free
On a 20 x 20 mm chip, you can fit:
1,000,000 32-bit adders, … or …
256 MB fast memory (SRAM)
Memory is what takes up area
For 100 Watts of power, you can
Perform 1015 additions per second (one petaOp), … or …
Read data from DRAM at 128 GB/sec
Memory and moving data is what consumes power
Cartesian Computations5
(Super)computer architecture
Core = CPU + caches
Chip = bunch of coresHigh bandwidth interconnect
Node = bunch of chips + DRAMTb/sec bandwidth between chips
Less bandwidth to memory
System = bunch of nodesSwitch or grid connection
OK bandwidth for block transfer
ProcL1
L2
P
L1
L2
P
L1
L2
P
L1
L2
P
L1
L2
DRAM
Cartesian Computations6
Models of computation
RAM
PRAM
Proc
Memory
Memory
Proc Proc Proc Proc
Memory
Cartesian Computations7
The point being …
(P)RAM analysis counts arithmetic operations, ignoring cost of memory accesses and data movement
Real costs, both time and energy, are dominated by memory accesses and data movementThese costs vary by over 100x depending on where data
goes
Algorithm analysis should focus on data movement!
The rest of this talk will study data movement for a common class of scientific computation kernels
not even considering disk
Cartesian Computations8
Outline• Architectural trends
– Computation is (almost) free
– Data movement and memory aren’t
• Cartesian computations – Examples– Static, tidy and lazy implementations
• Analysis of Cartesian computations– Computation density and temp data ratio– Some theorems
• Application to examples
Cartesian Computations9
An incomplete taxonomy of kernels
Compute intensive E.g. well-tuned matrix multiply, seismic migration, …Runtime is dominated by core speed
Streaming applications E.g. simple encoding, scanning, one pass of quicksort, …Runtime dominated by sequential access to memory
Dynamic data structures E.g. binary search trees, quadtrees, graph algorithmsRuntime dominated by random access to memory
and …
Cartesian Computations10
Cartesian computation
Given large data structures A and B
A generates temporary data T, which updates B
For each chunk of A and B, there ’s a “tile” of computation
There may be data dependences, but we’ll assume that reordering is possible.
In a uniform Cartesian computation, for all tiles of a given size, the amount of computation is roughly the same.
A
B
Cartesian Computations11
A messy detail
For very small tiles, the uniformity assumption fails
How small is “very small” depends on the computation
Most very small tiles will have no temp data or work
If we only execute non-empty very small tiles, there is a savings in communication costs.
But using very small tiles requires much more communication than using large tiles
• In this talk, we assume that such a strategy is too expensive!
• We also leave the open the problem of non-uniform distributions.
Cartesian Computations12
Sparse matrix-vector product y = Mx
A = input vector (x), B = output vector (y)
T might be pairs of the form {i, Mij xj}
Matrix M could be:
known implicitly (e.g. regular grid), or
streamed in from local memory
stored in order of use
A tile is a rectangular subset of M
The memory characteristics of y = Mx
depend on the density & distribution of nonzeros
depend on the implementation
y
x
** *
**
M
Cartesian Computations13
A different (?) example
Histogram (from NAS Integer Sort Benchmark)
For i=1 to numkeys Tally[Key[i]]++
Key is A; Tally is B;
Temp data is indices into tile’s subset of tally
Transpose, Bit Reversal (used in FFTs), and other Permutations are similar
0 1 2 3 4 5 6 7 8 9
Tally
3
1
4
1
5
9
2
6
5
3
5
8
9
7
5
65
Key
Cartesian Computations14
Generalization: Map-Reduce
Input A is a set of (key,value) pairs
Distributed by “key” in big computer
“Map” takes each (k,v) to a list {(k1,v1), (k2,v2), … }
Resulting data is sorted according to new keys
Sj = {v | (kj, v) comes from Map}
This is intermediate data of Cartesian computation
“Reduce” is applied to each Sj, producing a new pair (kj, Reduce(Sj))
(or more generally, any number of new pairs)
This is output set B
Cartesian Computations15
In general …
In a uniform Cartesian computation,
– chunks of A are used to generate temporary datamay need extra “one-time” data too
– chunks of B consume the temporary dataN.B. the intermediate values of B are not called temp data
– every chunk of A generates data for every chunk of B
Program for executing Cartesian computations:
for each “architectural unit” (cache, core, chip, node, …), choose
– tile size and shape
– whether to move temp data or to move chunks of A and/or B
– how to use multiple processors and share memory
Cartesian Computations16
Implementing Cartesian computationsTidy Implementation
– Temp data is consumed as soon as it is generated• relevant chunks of both A and B must be in local memory
– A and B chunks are copied or moved so each is together at some time
Temp data doesn’t move; but each bit of A and B may move many times
• particularly if the tiles are small
Lazy Implementation– Bring bits of A into a processor; generate temp data
– Move temp either:• To another processor, or• To some level of cache memory
– Consume temp data where or when appropriate chunk of B is available
Temp data moves, but A and B may move less than for tidy program
Cartesian Computations17
Two types of lazy implementation
1. Static partitioning:
– Partition B among memory of all “architectural units”B must fit in totality of these memories
– Stream A in a few bits at a time, generate temp data and send it to unit owning appropriate chunk of B
A and B only moved once Temp data may need to travel long distance
Aside: we could swap the roles of A and B.
2. Bucket tiling:• Stream A in, generate & store temp data in local memory
Store A in bucket depending on which chunk of B will consume it
• Later, bring B in a chunk at a time, consume temp data
• Repeat the above until done
A and/or B may move in & out many times .
Cartesian Computations18
Bucket Tiling Example: histogram
For i=1 to numkeys Tally[Key[i]]++
becomes … For ii=0 to numKeys by chunk_size // Break set of Keys into
chunks
Bend[*] = 0
For i=ii to ii + chunk_size // Put keys into buckets
k = Key[i]>>16 // (indexed by high bits)
Temp[k][Bend[k]++] = Key[i] && mask
For k = 0 to numBuckets // Process each bucket
For i = 0 to Bend[k]
Tally[(k<<16)+Temp[k][i]]++
Choose chunk_size so that Temp fits in L2 cache;
Choose “16” so random accesses to Tally are in L1 cache
Code ran 3x faster (very old experiment)
Cartesian Computations19
Hybrid implementations
Many variants are possible:
Different strategies at different levels of granularity
E.g. tidy implementation within node; lazy within chip
If a chunk of A (or B) is needed in several places, it can either move from place or there can be duplicate copies
If B has copies, they need to be combined at end.
Different strategies for different regions
E.g., if (non-uniform) problem has dense regions, use static partitioning but duplicate denser parts of A and/or B to reduce communication
Cartesian Computations20
Outline
• Architectural trends
– Computation is (almost) free
– Data movement and memory aren’t
• Cartesian computations – Examples– Static, tidy and lazy implementations
• Analysis of Cartesian computations– Computation density and temp data ratio– Some theorems
• Application to examples
Cartesian Computations21
Data movement in Cartesian computations
Focus on moving of A, B, and temp data …
ignore other data
… into an architectural component
cache, core, chip, node
Same analysis for sequential or parallel execution.
Useful metrics:
Bits moved into component gives approximate energy or power requirement (power =
energy/time)
ignores difference between moving from DRAM vs. another chip
ignores difference between near and distant network moves
Computation/communication ratio= ops/sec ÷ bits/sec = speed ÷ bandwidth
DRAM
Cartesian Computations22
Quantifying data movement
Measure chunks of A and B in bits
Tiles area is measured in “square bits”
In a uniform problem, tile area is proportional to work
Define compute density = work/bits2 (e.g. floatOps/bit2)
Measure temp data in bits
Suppose an a x b tile generates t bits temp data.
Define (temp) data density d = t/ab (d is “bits/bit2”)
In a uniform problem, d is independent of tile choice
Suppose all temp data is moved off-chip (or off-core or off-node):
(almost happen in static partitioning – occasionally B is local)
bits moved = abd
compute speed / bandwidth ≤ /d
Cartesian Computations23
y = Mx example
Suppose x and y are 64-bit floats.
Suppose 1/400 of matrix entries are non-zerosimilar to NAS cg class B benchmark
= 2 floatOps/(400 x 64 bits x 64 bits)
≈ 1.2 x 10-6 floatOp/bit2
Suppose temp data is {64 bit float, 32-bit index}
d = 96 bits/(400 x 64 x 64 bit2)
≈ 58 x 10-6 bit-1
Note: /d = 1 floatOp/48 bits
Typical supercomputer has only 4 bit DRAM bandwidth per floatOp
y
x
** *
**
M
≈ compute/communication for static partitioning
Cartesian Computations24
Lower bound on tidy implementationRecall: the intermediate data is consumed immediately in the
same component as it is produced.
And, every square bit of A x B co-resides in memory at some time
Theorem: If a machine component with memory capacity c executes an a x b tile tidily, a>c and b>c, then ab/c bits must move into the component during the computation, even if the memory contained data at the start.
Note: This is better than abd (static partitioning) if 1/c < d
Corollary: computation speed ≤ x memory capacity x bandwidth
Cartesian Computations25
Proof (if memory is initially empty):
Assign a penny to each square bit (i,j) in tile total of ab pennies
The first time i and j are in the same core, give penny to whichever load operation brought i or j in later
no load can get more than c pennies
Thus, there must be at least ab/c load operations.
Proof (general case):
Messy.
Cartesian Computations26
Upper bound on tidy implementations
Tiling an a x b Cartesian problem:
For all chunks Bi of B of size c(1-)
Move Bi into local memory
Stream through all nibbles of A of size
Execute tile nibble x Bi
B moves into component once b bits of B are moved
At most b/c(1-) + 1 chunks of B ab/c(1-) + a bits of A moved
Temp data is always local.
Total: ab/c(1-) + a + b bits
For large a and b and small , this is about ab/c
matching the lower bound
Cartesian Computations27
Hierarchical tiling example
Partition A among DRAM of nodes
Each cache-sized chunk of B makes a tour of all nodes
When at a node,
chunk resides in cache of one core
local part of A streams into core
relevant tile is executed
Data movement of B between nodes is ab / DRAM sizecorresponds to horizontal lines in picture
Movement of A into cores is ab / cache sizevertical lines in picture
DRAMsize
Cachesize
B
A
Cartesian Computations28
Lower bound for bucket tiling
Recall: Bucket tiling has only two operations:
move bits of A into component, generate temp data
move bits of B into component, consume temp data
Theorem: A bucket tiled uniform Cartesian computation, where temp data is stored in a component with memory capacity c, moves at least ab(2d/c)½ bits of A and B into the component.
Proof: See Carter & Gatlin, FOCS ’98
But (d/c)½ is the geometric mean of d and 1/c, so
this is worse than abd (static partitioning) or ab/c (tidy) or both.
Possible advantage over static partitioning:
If B doesn’t fit in sum of local memories, can’t static partition
Bucket tiling communication is to nearest neighbor
tilingstatic partitioning
Cartesian Computations29
Upper bound on bucket tilingPartition space into m x n tiles
Entire a x b problem needs ab/mn tiles
Tile generates mnd bits of temp data; must be ≤ c
For each tile:
Move m bits of A into node and create temp data from it
Consume temp data by moving and updating n bits of B m+n bits of A and B moved per tile (temp data stays in
cache)
Data movement is minimized if m = n = (c/d)½
so total communication = ab/(c/d) x 2(c/d) ½ = 2ab (d/c)½
6x6 tile needs 12 bits moved(3 bits/point)
4x9 tile needs 13 bits moved
Cartesian Computations30
(Theoretically) better implementation
Form triangular tile with side m
m(m-1)d/2 temp data; must be ≤ c; so m (2c/d)½
Slide tile along by deleting one column, placing one row
communication = ab (2d/c)½
sqrt(2) times better (4 vs 3 points/bit moved in example)matches lower bound
delete generate
Cartesian Computations31
Which implementation is best?
blahblahblah Method
Bits of cache to shared memory
Bits to nearest neighbor node
Bits to random node
Static partitioning
(relatively small)
none ~abd
Bucket tiling abd 2ab(d/c2)½ none
Tidy ab/c1 ab/c2 none
Each core has cache of size c1
Node has memory of size c2 >> c1
a = |A| > c2, b = |B| > c2
d = temp data ratio
NetworkDRAM
Cartesian Computations32
Sparse Matrix-Vector example
blahblahblah Method
Bits of cache to shared memory
Bits to nearest neighbor node
Bits to random node
Static partitioning
(relatively small)
none abd = ab/2KB Worst!
Bucket tiling abd 2ab(d/c2)½ =
ab×.015/c2½
none
Tidy ab/c1
c1 > 2KB
ab/c2
c2 >> 2 KB
none
Each core has cache of size c1
Node has memory of size c2 >> c1
a = |A| > c2, b = |B| > c2
d = 58 x 10-6 (2KB)-1
Cartesian Computations33
Histogram example
Suppose A is 64-bit integers and B is 16-bit counters
There’s one increment operation per row
An a x b tile (a and b are bits) represents a/64 rows
In a uniform problem, tile contains b/|B| of indices
so tile has (a/64)x(b/|B|) = ab/64|B| “hits”
= work/bit2 = (ab/64|B|)/ab = 1 / 64|B| adds/bit2
Suppose we need 32-bit indices for temp data
(fewer bits are needed since index is limited in tile)
d = t/ab = (ab/64|B|) x 32/ab = 1 / (2|B|)
Note: /d = 1 add/32 bits (not at all surprising) ≈ compute/communication for static partitioning
|B| is size of B in bits
5
65
Key
Tally
Cartesian Computations34
Histogram example
blahblahblah Method
Bits of cache to shared memory
Bits to nearest neighbor node
Bits to random node
Static partitioning
(relatively small)
none abd = a/2 BEST!
Bucket tiling abd = a/2 2ab(d/c2)½
= a (2b/c2)½ > a
none
Tidy ab/c1
b/c1 >> 1
ab/c2
b/c2 > 1
none
Each core has cache of size c1
Node has memory of size c2 >> c1
a = |A| > c2, b = |B| > c2
d = 1/(2b)
Cartesian Computations35
Conclusions, etc
Computation is free
A theory for data movement (analogous to a theory of computation) is needed for today’s architectures
For Cartesian computations, we can relate the three orthogonal aspects of computers:computation speedbandwidthmemory
and choose algorithms accordingly
Results are needed for other classes of computations !
And architectural features (e.g. sequential access)!
Cartesian Computations36
Backup slides
Cartesian Computations37
Implementing Cartesian computations(put some words here)
P1
P1
P1
P2
P2
P2
P3
P3
P3
P4
P4
P4
Static partitioning Tidy (tiling)
B
A2->1
1->4
2->3
1->2 1->3
etc.
Cartesian Computations38
Algorithmic vs. memory analysis
• MergeSort or QuickSort
– O(N lg N) operations
– Sequential memory access
• BucketSort or CountSort
– O(N) operations
– Random memory access
Cartesian Computations39
Algorithmic vs. memory analysis• Sandia study of Linear Algebra algorithms
Millions of floating point ops needed for sample problem
– 1950: 1,000,000,000,000 Mops (Cramer’s rule)
– 1965: 10,000,000 Mops (Gaussian Elimination)
– 1975: 300 Mops (Gauss-Seidel)
– 1985: 8 Mop (Conjugate Gradient)
• Each algorithmic improvement results in less locality in memory references.
Cartesian Computations40
Integer Sort Example• Loop from NAS IS benchmarkFor i=1 to numkeys Tally[Key[i]]++ Random access to Tally (BAD)
•
Bucket-tiling*
For i=1 to numkeys // Count keys per bucket Bcount[Key[i]>>12]++ Bcount in cache (GOOD); Key is sequential (OK)For k=1 to maxKey>>12 // Find bucket end points Bend[k] = Bend[k-1]+Bcount[k] Bend fits in cache (GOOD)For i=1 to numkeys // Put keys into buckets Temp[Bend[Key[i]>>12]++] = Key[i] Bend in cache; Key and Temp are sequential (OK)For i=1 to numkeys // Count keys Tally[Temp[i]]++ Active parts of Tally stay in cache (GOOD); Temp OK Bottom Line: Code is over 2 times faster (on IBM RS6000).*Alpern & Carter ‘94, “Towards a Model for Portable Parallel Performance”
Cartesian Computations41
Inspector – Executor paradigm
• Introduced by Chaos Group at U. Md. (led by Joel Saltz) e.g. 1995 TPDS paper
• “Inspector” is runtime code that:
– Iterates through loops without executing, records information about data access order
– Reorders data in arrays or chooses new execution order
– Generates new index arrays
• “Executor” is (modified) code
– Does the actual computation
– Uses reordered data and index arrays