© 2002 IBM Corporation
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
PACT, Tuesday, September 20th, 2005
Optimizing Compiler for the Cell ProcessorOptimizing Compiler for the Cell Processor
Alexandre Eichenberger, Kathryn O’Brien, Kevin O’Brien, Peng Wu, Tong Chen, Peter Oden, Daniel Prener, Janice Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind
www.research.ibm.com/cellcompiler/compiler.htm
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
2
Cell Broadband EngineCell Broadband Engine
Multiprocessor on a chip 241M transistors, 235mm2
200 GFlops (SP) @3.2GHz
200 GB/s bus (internal) @ 3.2GHz
Power Proc. Element (PPE) general purpose
running full-fledged OSs
Synergistic Proc. Element (SPE) optimized for compute density
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
3
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
Pow
er
P
roce
ssor
Ele
men
t
(PP
E)
Pow
er
P
roce
ssor
Ele
men
t
(PP
E)
To
Ext
erna
l Mem
To
Ext
erna
l Mem
To
Ext
erna
l IO
To
Ext
erna
l IO
Element Interconnect Bus (96 Bytes/cycle)
L1 L2
Pow
er
P
roce
ssor
Ele
men
t
(PP
E)
L1 L2
Cell Broadband Engine OverviewCell Broadband Engine Overview
16Bytes(one dir)
128Bytes(one dir)
8 Bytes (per dir)
Heterogeneous, multi-core engine 1 multi-threaded power processor
up to 8 compute-intensive-ISA engines
Local Memories fast access to 256KB local memories
globally coherent DMA to transfer data
Pervasive SIMD PPE has VMX
SPEs are SIMD-only engines
High bandwidth fast internal bus (200GB/s) dual XDRTM controller (25.6GB/s) two configurable interfaces (76.8GB/s) numbers based on 3.2GHz clock rate
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
SP
ES
PE
To
Ext
erna
l Mem
To
Ext
erna
l Mem
To
Ext
erna
l IO
To
Ext
erna
l IO
Element Interconnect Bus (96 Bytes/cycle)
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
4
Supporting a Broad Range of Expertise to Program CellSupporting a Broad Range of Expertise to Program Cell
Highest performance with help from programmers
Highest Productivity with fully automatic compiler technology
PR
OG
RA
MS
Multiple-ISA hand-tuned programs
Automatic tuning for each ISA
SIM
D
Explicit SIMD coding
Automatic simdization
SIMD/alignment
directives
PA
RA
LLE
LIZ
AT
ION
Explicit parallelization with
local memories
Automatic parallelization
Shared memory,
single program
abstraction
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
5
PR
OG
RA
MS
Multiple-ISA hand-tuned programs
Automatic tuning for each ISA
SIM
D
Explicit SIMD coding
Automatic simdization
SIMD/alignment
directives
PA
RA
LLE
LIZ
AT
ION
Explicit parallelization with
local memories
Automatic parallelization
Shared memory,
single program
abstraction
OutlineOutline
Part 1:Automatic SPE tuning
Part 2:Automatic simdization
Part 3: Shared memory &single program abstr.
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
6
Even PipeFloating/FixedPoint
Even PipeFloating/FixedPoint
Odd PipeBranchMemoryPermute
Odd PipeBranchMemoryPermute
Even PipeFloating/FixedPoint
Even PipeFloating/FixedPoint
Odd PipeBranchMemoryPermute
Odd PipeBranchMemoryPermute
Dual-IssueInstructionLogic
Dual-IssueInstructionLogic
Instr.Buffer(3.5 x 32 instr)
Dual-IssueInstructionLogic
Instr.Buffer(3.5 x 32 instr)
Architecture: Relevant SPE FeaturesArchitecture: Relevant SPE Features
Register File(128 x 16Byte register)
Register File(128 x 16Byte register)
DMA(Globally-Coherent)
DMA(Globally-Coherent)
Local Store(256 KByte, Single Ported)
Local Store(256 KByte, Single Ported)
branch: 1,2branch hint: 1,2instr. fetch: 2dma request: 3
1
2
3
Synergistic Processing Element (SPE)
16 bytes(one dir)
128 bytes(one dir)
8 bytes (per dir)
SIMD-only functional units 16-bytes register/memory accesses
Simplified branch architecture no hardware branch predictor
compiler managed hint/predication
Dual-issue for instructions full dependence check in hardware
must be parallel & properly aligned
Single-ported local memory aligned accesses only
contentions alleviated by compiler
Local Store(256 KByte, Single Ported)
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
7
Even PipeFloating/FixedPoint
Even PipeFloating/FixedPoint
Odd PipeBranchMemoryPermute
Odd PipeBranchMemoryPermute
Dual-IssueInstructionLogic
Dual-IssueInstructionLogic
Instr.Buffer(3.5 x 32 instr)
Register File(128 x 16Byte register)
Register File(128 x 16Byte register)
DMA(Globally-Coherent)
DMA(Globally-Coherent)
Local Store(256 KByte, Single Ported)
Local Store(256 KByte, Single Ported)
branch: 1,2branch hint: 1,2instr. fetch: 2dma request: 3
1
2
3
Local Store(256 KByte, Single Ported) Ifetch
Feature 1: Single-Ported Local MemoryFeature 1: Single-Ported Local Memory
SPE
16Bytes(one dir)
128Bytes(one dir)
8 Bytes (per dir)
Local store is single ported less expensive hardware
asymmetric port
16 bytes for load/store ops 128 bytes for IFETCH/DMA
static priority
DMA > MEM > IFETCH
If we are not careful, we may starve for instructions
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
8
instruction buffers
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
Instruction Starvation SituationInstruction Starvation Situation
There are 2 instruction buffers up to 64 ops along the fall-through path
First buffer is half-empty can initiate refill
When MEM port is continuously used starvation occurs (no ops left in buffers)
Dual-IssueInstructionLogic
Dual-IssueInstructionLogic
initiate
refill
after
half
empty
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
9
instruction buffer
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
Instruction Starvation PreventionInstruction Starvation Prevention
SPE has an explicit IFETCH op which initiates a instruction fetch
Scheduler monitors starvation situation when MEM port is continuously used
insert IFETCH op within the (red) window
Compiler design scheduler must keep track of code layout
Dual-IssueInstructionLogic
Dual-IssueInstructionLogic
initiate
refill
after
half
empty
refill IFETCH latency
before
it is too
late to
hide
latency
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
10
Even PipeFloating/FixedPoint
Even PipeFloating/FixedPoint
Odd PipeBranchMemoryPermute
Odd PipeBranchMemoryPermute
Dual-IssueInstructionLogic
Dual-IssueInstructionLogic
Instr.Buffer(3.5 x 32 instr)
Feature #2: Software-Assisted Branch ArchitectureFeature #2: Software-Assisted Branch Architecture
Register File(128 x 16Byte register)
Register File(128 x 16Byte register)
DMA(Globally-Coherent)
DMA(Globally-Coherent)
Local Store(256 KByte, Single Ported)
Local Store(256 KByte, Single Ported)
branch: 1,2branch hint: 1,2instr. fetch: 2dma request: 3
1
2
3
SPE
16 bytes(one dir)
128 bytes(one dir)
8 bytes (per dir)
Branch architecture no hardware branch-predictor, but
compare/select ops for predication
software-managed branch-hint
one hint active at a time
Lowering overhead by predicating small if-then-else
hinting predictably taken branches
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
11
instruction buffers
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
HINT buffer
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
SPE provides a HINT operation fetches the branch target into HINT buffer
no penalty for correctly predicted branches
compiler inserts hints when beneficial
Impact on instruction starvation after a correctly hinted branch, IFETCH
window is smaller
Hinting Branches & Instruction Starvation PreventionHinting Branches & Instruction Starvation Prevention
refill latency
Dual-IssueInstructionLogic
Dual-IssueInstructionLogic
HINT br, target
BRANCH if true
target
fetches ops from target;needs a min of 15 cyclesand 8 intervening ops
IFETCHwindow
instruction buffers
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
HINT buffer
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
FP MEM
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
12
SPE Optimization ResultsSPE Optimization Results
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Re
lati
ve
re
du
cti
on
s i
n e
xe
cu
tio
n t
ime
Original +Bundle +Branch Hint + Ifetch
single SPE performance, optimized, simdized code (avg 1.00 → 0.78)
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
13
Part 1:Automatic SPE tuning
PR
OG
RA
MS
Multiple-ISA hand-tuned programs
Automatic tuning for each ISA
SIM
D
Explicit SIMD coding
Automatic simdization
SIMD/alignment
directives
PA
RA
LLE
LIZ
AT
ION
Explicit parallelization with
local memories
Automatic parallelization
Shared memory,
single program
abstraction
OutlineOutline
Part 2:Automatic simdization
Part 3: Shared memory &single program abstr.
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
14
Single Instruction Multiple Data (SIMD) Computation Single Instruction Multiple Data (SIMD) Computation
Process multiple “b[i]+c[i]” data per operations
b0 b1 b2 b3
c0 c1 c2 c3
b0+c0
b1+c1
b2+c2
b3+c3
b0 b2 b3 b4 b5 b6 b7 b8 b9 b10
c0 c1 c3 c4 c5 c6 c7 c8 c9 c10c2
b1
+
R1
R2
R3
16-byte boundaries
16-byte boundaries
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
15
b0b1b2b3
c0c1c2c3
b0+c0
b1+c1
b2+c2
b3+c3
b0 b2b3b4b5b6b7b8b9b10
c0c1 c3c4c5c6c7c8c9c10c2
b1
+
R1
R2
R3
Successful SimdizationSuccessful Simdization
for (i=0; i<256; i++)
a[i] =
loop level
a[i+0] =
a[i+1] =
a[i+2] =
a[i+3] =
basic-block level
for (i=0; i<8; i++) a[i] =
entire short loop
GENERIC
VMX SPE
multiple targets
load b[i]
load a[i] unpack
add
store
load a[i+4]unpack
add
store
SHORT
INT 2INT 1
data size conversion
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
16-byte boundaries
vload b[1]
b0 b1 b2 b3
vload b[5]
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
...b1
b1
b1
alignment constraints
Extract Parallelism Satisfy Constraints
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
16
Example of SIMD-Parallelism ExtractionExample of SIMD-Parallelism Extraction
for (i=0; i<256; i++)
a[i] =
loop level
a[i+0] =
a[i+1] =
a[i+2] =
a[i+3] =
basic-block level
for (i=0; i<8; i++) a[i] =
entire short loop
Loop level SIMD for a single statement across consecutive
iterations
successful at:
efficiently handling misaligned data pattern recognition (reduction, linear recursion) leverage loop transformations in most compilers
[Bik et al, IJPP 2002]
[VAST compiler, 2004]
[Eichenberger et al, PLDI 2004] [Wu et al, CGO 2005]
[Naishlos, GCC Developer’s Summit 2004]
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
17
GENERIC
VMX SPE
multiple targets
Example of SIMD ConstraintsExample of SIMD Constraints
load b[i]
load a[i] unpack
add
store
load a[i+4]unpack
add
store
SHORT
INT 2INT 1
data size conversion
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
16-byte boundaries
vload b[1]
b0 b1 b2 b3
vload b[5]
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
...b1
b1
b1
alignment constraints Alignment in SIMD units matters: consider “b[i+1] + c[i+0]”
b4b2 b3b0 b5 b6 b7
c0 c1 c3 c4 c5 c6 c7c2
b1
16-byte boundaries
b1
c0
+ b2+c2
b3+c3
b0+c0
b1+c1
16-byte boundaries
this is not b[1] + c[0]
b0 b2 b2 b3R1 b1
c0 c1 c2 c3R2 c0
vload b[1]
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
18
Example of SIMD Constraints (cont.)Example of SIMD Constraints (cont.)
load b[i]
load a[i] unpack
add
store
load a[i+4]unpack
add
store
SHORT
INT 2INT 1
data size conversion
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
16-byte boundaries
vload b[1]
b0 b1 b2 b3
vload b[5]
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
...b1
b1
b1
alignment constraints Alignment in SIMD units matters when alignments within inputs do not match
must realign the data
b4b2 b3b0 b5 b6 b7
c0 c1 c3 c4 c5 c6 c7c2
b1
16-byte boundaries
b1
b1 b2 b3 b4
c0 c1 c2 c3
R1
R2
vpermute
b1
c0
c0
+ b2+c1
b3+c2
b4+c3
b1+c0
GENERIC
VMX SPE
multiple targets
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
19
Automatic Simdization for CellAutomatic Simdization for Cell
b0b1b2b3
c0c1c2c3
b0+c0
b1+c1
b2+c2
b3+c3
b0 b2b3b4b5b6b7b8b9b10
c0c1 c3c4c5c6c7c8c9c10c2
b1
+
R1
R2
R3
for (i=0; i<256; i++)
a[i] =
loop level
a[i+0] =
a[i+1] =
a[i+2] =
a[i+3] =
basic-block level
for (i=0; i<8; i++)a[i] =
entire short loop
GENERIC
VMX SPU BG/L
multiple targets
load b[i]
load a[i] unpack
add
store
load a[i+4]unpack
add
store
SHORT
INT 2INT 1
data size conversion
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
16-byte boundaries
vload b[1]
b0 b1 b2 b3
vload b[5]
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
...b1
b1
b1
alignment constraints
b0b1b2b3
c0c1c2c3
b0+c0
b1+c1
b2+c2
b3+c3
b0 b2b3b4b5b6b7b8b9b10
c0c1 c3c4c5c6c7c8c9c10c2
b1
+
R1
R2
R3
b0b1b2b3
c0c1c2c3
b0+c0
b1+c1
b2+c2
b3+c3
b0 b2b3b4b5b6b7b8b9b10
c0c1 c3c4c5c6c7c8c9c10c2
b1
+
R1
R2
R3
b0b1b2b3
c0c1c2c3
b0+c0
b1+c1
b2+c2
b3+c3
b0 b2b3b4b5b6b7b8b9b10
c0c1 c3c4c5c6c7c8c9c10c2
b1
+
R1
R2
R3
b0b1b2b3
c0c1c2c3
b0+c0
b1+c1
b2+c2
b3+c3
b0 b2b3b4b5b6b7b8b9b10
c0c1 c3c4c5c6c7c8c9c10c2
b1
+
R1
R2
R3
for (i=0; i<256; i++)
a[i] =
loop level
for (i=0; i<256; i++)
a[i] =
loop level
a[i+0] =
a[i+1] =
a[i+2] =
a[i+3] =
basic-block level
a[i+0] =
a[i+1] =
a[i+2] =
a[i+3] =
basic-block level
for (i=0; i<8; i++)a[i] =
entire short loop
for (i=0; i<8; i++)a[i] =
entire short loop
GENERIC
VMX SPU BG/L
multiple targets
GENERIC
VMX SPU BG/L
multiple targets
load b[i]
load a[i] unpack
add
store
load a[i+4]unpack
add
store
SHORT
INT 2INT 1
data size conversion
load b[i]
load a[i] unpack
add
store
load a[i+4]unpack
add
store
SHORT
INT 2INT 1
load b[i]
load a[i] unpackload a[i] unpack
add
store
load a[i+4]unpack load a[i+4]unpack
add
store
SHORT
INT 2INT 1
data size conversion
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
16-byte boundaries
vload b[1]
b0 b1 b2 b3
vload b[5]
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
...b1
b1
b1
alignment constraints
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
16-byte boundaries
vload b[1]
b0 b1 b2 b3
vload b[5]
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
...b1
b1
b1
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
16-byte boundaries
vload b[1]
b0 b1 b2 b3b0 b1 b2 b3
vload b[5]
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
...b1
b1
b1
alignment constraints Integrated Approach extract at multiple levels
satisfy all SIMD constraints
use “virtual SIMD vector” as glue
Minimize alignment overhead lazily insert data reorganization
handle compile time & runtime alignment
simdize prologue/epilogue for SPEs
memory accesses are always safe on SPE
Full throughput computations even in presence of data conversions
manually unrolled loops...
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
20
SPE Simdization ResultsSPE Simdization Results
2.4 2.5 2.9 2.9
7.5 8.1
11.4
25.326.2
9.9
0
5
10
15
20
25
30
Linpack Swim-l2 FIR Autcor DotProduct
Checksum AlphaBlending
Saxpy Mat Mult Average
Sp
ee
du
p f
ac
tors
single SPE, optimized, automatic simdization vs. scalar code
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
21
PA
RA
LLE
LIZ
AT
ION
Explicit parallelization with
local memories
Automatic parallelization
Shared memory,
single program
abstraction
PR
OG
RA
MS
Multiple-ISA hand-tuned programs
Automatic tuning for each ISA
SIM
D
Explicit SIMD coding
Automatic simdization
SIMD/alignment
directives
Part 1:Automatic SPE tuning
Part 2:Automatic simdization
OutlineOutline
Part 3: Shared memory &single program abstr.
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
22
Cell Memory & DMA ArchitectureCell Memory & DMA Architecture
Local stores are mapped in global address space
PPE can access/DMA memory
set access rights
SPE can initiate DMAs to any global addresses,
including local stores of others.
translation done by MMU
Note all elements may be masters,
there are no designated slaves
Local Store #1
Local Store #8
SPULocal store 1
L1
L2
SPE #1MMU
SPULocal store 1
SPE #8MMU
...
Local store 1
ALIAS TO
Local store 8
TLBsMFC Regs
QofS* / L3*
IO Devices
Main
Memory*
PPE
DMA
Memory requests
* externalMapped
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
23
Local Store
coderegulardata
irregular data
Competing for the SPE Local StoreCompeting for the SPE Local Store
Local store is fast, need support when full.
Provided compiler support:
SPE code too large compiler partitions code
partition manager pulls in code as needed
Data with regular accesses is too large compiler stages data in & out
using static buffering
can hide latencies by using double buffering
Data with irregular accesses is present e.g. indirection, runtime pointers...
use a software cache approach to pull the data in & out (last resort solution)
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
24
Software Cache for Irregular AccessesSoftware Cache for Irregular Accesses
Data with irregular accesses cannot reside permanently in the SPE’s local memory (typically)
thus reside in global memory
when accessed,
must translate the global address into a local store address must pull the data in/out when its not already present
Use a software cache managed by the SPEs in the local store
generate DMA requests to transfer data to/from global memory
use 4-way set associative cache to naturally use the SIMD units of the SPE
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
25
Software Cache AchitectureSoftware Cache Achitecture
a2a2 c2c2 e3e3 h4h4 ..0
set tags data ptrs. dirty bits...
b2b2 c3c3 f4f4 i3i3 ..1
c4c4 f6f6 a1a1 j5j5 d2d2 ..x
... ... ...
...d1
4242 ...d2
...dn
data array
... ......
Cache directory 128-set, 4-way set associative
pointers to data lines
use 16KByte of data
Data in a separate structure 512 x 128B lines
use 64KByte of data
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
26
Software Cache AccessSoftware Cache Access
a2a2 c2c2 e3e3 h4h4 ..0
set tags data ptrs. dirty bits...
b2b2 c3c3 f4f4 i3i3 ..1
c4c4 f6f6 a1a1 j5j5 d2d2 ..x
... ... ...a1a1 a1a1 a1a1 a1a1
addr a1addr a1
extract & load set
splat addr
compare ‘=‘
locate data ptr.
0000 0000 FFFF 0000
compute addr
...d1
4242 ...d2
...dn
data array
subset of addrused by tags
addr offset
... ......when successful
SIMDcomparison
hit
translate this global address
hit latency: ~ 20 extra cycles
d2d2
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
27
““Single Source” Compiler, using OpenMP Single Source” Compiler, using OpenMP
#pragma Omp parallel for
for( i =0; i<10000; i++)
A[i] = x* B[i];
#pragma Omp parallel for
for( i =0; i<10000; i++)
A[i] = x* B[i];
void foo()
#pragma Omp parallel for
for( i =0; i<10000; i++)
A[i] = x* B[i];
void foo()
#pragma Omp parallel for
for( i =0; i<10000; i++)
A[i] = x* B[i];
void foo_PPE() init_omp_rte(); for( i=LB; i<UB; i++) ´ A[i] = x * B[i];
void foo_PPE() init_omp_rte(); for( i=LB; i<UB; i++) ´ A[i] = x * B[i];
void foo_SPE() for( k=LB; k<UB; k++) DMA 100 B elements into B´ for (j=0; j<100; j++) A´[j] = cache_lookup(x) * B´[j]; DMA 100 A elements out of A´
void foo_SPE() for( k=LB; k<UB; k++) DMA 100 B elements into B´ for (j=0; j<100; j++) A´[j] = cache_lookup(x) * B´[j]; DMA 100 A elements out of A´
void foo_SPE() for( k=LB; k<UB; k++) DMA 100 B elements into B´ for (j=0; j<100; j++) A´[j] = cache_lookup(x) * B´[j]; DMA 100 A elements out of A´
void foo_SPE() for( k=LB; k<UB; k++) DMA 100 B elements into B´ for (j=0; j<100; j++) A´[j] = cache_lookup(x) * B´[j]; DMA 100 A elements out of A´
void foo_SPE() for( k=LB; k<UB; k++) DMA 100 B elements into B´ for (j=0; j<100; j++) A´[j] = cache_lookup(x) * B´[j]; DMA 100 A elements out of A´
void foo_SPE() for( k=LB; k<UB; k++) DMA 100 B elements into B´ for (j=0; j<100; j++) A´[j] = cache_lookup(x) * B´[j]; DMA 100 A elements out of A´
void foo_SPE() for( k=LB; k<UB; k++) DMA 100 B elements into B´ for (j=0; j<100; j++) A´[j] = cache_lookup(x) * B´[j]; DMA 100 A elements out of A´
void foo_SPE() for( k=LB; k<UB; k++) DMA 100 B elements into B´ for (j=0; j<100; j++) A´[j] = cache_lookup(x) * B´[j]; DMA 100 A elements out of A´
outline omp
region
clone for SPE
clone for PPE
Runtime:
initialize OpenMP runtime
compute its own work
Runtime:
DMA in/out array, lookup software cache
compute its own work
code with OpenMP directives
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
28
Single Source Compiler ResultsSingle Source Compiler Results
0
2
4
6
8
10
12
Sp
ee
du
p w
ith
8 S
PE
s
softcache optimization
baseline: execution on one single PPE
Results for Swim, Mgrid, & some of their kernels
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
29
ConclusionsConclusions
Cell Broadband Engine architecture heterogeneous parallelism
dense compute architecture
Present the application writer with a wide range of tool from support to extract maximum performance
to support to achieve maximum productivity with automatic tools
Shown respectable speedups using automatic tuning, simdization, and support for shared-memory abstraction
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
30
QuestionsQuestions
For additional info:
www.research.ibm.com/cellcompiler/compiler.htm
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
31
ExtraExtra
Template release: Oct 02For the latest, go to http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizing Compiler for a Cell Processor Alexandre Eichenberger
32
Compiler Support For Single Source ProgramCompiler Support For Single Source Program
SP
E E
mbedder
SP
E E
mbedder
SPESource
SPEObject
SPEExec
PPEObject
SPESource
SPEObject
PPEObject
PPEObject
Code
Data
SPEExec
SPELibraries
PPELibraries
SP
E C
ompiler
SP
E C
ompiler
SP
E Linker
SP
E Linker
PPESource
PPESource
PP
E C
ompiler
PP
E C
ompiler
PP
E Linker
PP
E Linker