CBE Architecture CBE Architecture OverviewOverview
Martin Kreibe: [email protected]
Matthew Longley: [email protected]
Paul Snyder: [email protected]
What is CBE?What is CBE?
A new interpretation of Multi-core processorsDevelopment motivated by heavy graphics based applications
Game ConsolesGraphics Rendering Applications
Developed by a collaboration between Sony, Toshiba, and IBM (known as STI) in 2001.
Architecture Architecture ComponentsComponents
PPEMain processing unit.Controls SPE units
EIBCommunication Bus
SPEsNon-control Processor Elements8 on chip
BEIEngine Interface
Cell Broadband Engine Architecture
CBE Endian-nessCBE Endian-ness
The CBE Architecture is big endian
Byte
Halfword
Word
Address
Doubleword
Quadword
0 1 2 3 4 5 6 7 8 9 a b c d e f
Power PC Processor Power PC Processor ElementElement
64bit, Dual-thread PowerPC Architecture32KB L1 cache size512 KB L2 cache sizeInstruction set extensions:
Vector/SIMD multimedia (“Altivec”)PPU to SPU communication
Classic CPU Architecture
Synergistic Processor Synergistic Processor ElementsElements
Operations must be allocated by PPU
“[O]ptimized for data-rich operations”Programming Tutorial (DRAFT)
RISC core
256 KB Local Store (“LS”, holds both Instructions and Data)
Unified 128-bit, 128-entry register file.
Manual branch hinting
Special SIMD instruction set
Vector operations
DMA control
Interprocessor messaging and synchonization
“[N]ot intended to run an operating system.”Programming Tutorial (DRAFT)
Cell Broadband Engine Architecture
SPU RegistersSPU Registers
General Purpose Registers (GPR) 0
GPR 1
GPR 127
Floating-Point Status and Control Register (FPSCR)
SPU LatenciesSPU Latencies
Simple fixed point - 2 cycles
Complex fixed point - 4 cycles
Load - 6 cycles
Single-precision (ER) float - 6 cycles
Integer multiply - 7 cycles
Branch miss (no penalty for correct hint) - 20 cycles
DP (IEEE) float (partially pipelined) - 13 cycles
Enqueue DMA Command - 20 cycles
Cell Broadband Engine Architecture
Element Interconnect BusElement Interconnect Bus
Peak bandwidth: 96 bytes/cycleFour 16-byte data rings100+ outstanding DMA requests
Source: http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
Cell Broadband Engine Architecture
Platform DetailsPlatform Details
Cross-Unit CommunicationMailbox mechanism for synchronization
32-bit messages between SPE
Signal Notification (inbound)32-bit signal notification register
PPU and SPUs can retrieve data from Memory into a SPU DMADMA loads are asynchronous
EnvironmentEnvironmentHardware
PS3 (may have dead SPUs)
Multi-processor blades
Workstations and accelerator cards
Simulator
Cycle-accurate emulation of SPUs
TCL and GUI interfaces
Modified Linux environment
Cell Broadband Engine Architecture
Using the environmentUsing the environment
Project development was performed using GNU GCC-based cross-compilation toolchainExecutables were tested on both the IBM Cell simulator and on PS3s running Yellow Dog LinuxSimulator is slow but functionalCode ran smoothly on PS3sThanks to the University of Delaware for providing access to their PS3s
ToolsetToolsetDual GNU binutils/gcc toolchains (for PPU and SPU)
IBM XLC++ compiler (automatic vectorization)Currently generates poorly-optimized code
Static and dynamic analysis tools
Multithreaded debugger (gdb)
Cell Simulator and toolchain are provided only for Fedora Linux
We used VMware virtual machines to ease installation
A Gentoo installation package exists, but is poorly supported
Cell Broadband Engine Architecture
Toolchain ChallengesToolchain Challenges
Cell SDK Makefiles use custom include footers for Makefiles
These interface POORLY with GNU AutotoolsSpiral-WHT uses GNU Autotools
MUCH time was spent analyzing the operation of the Cell Makefiles and mixing this functionality with the Autotools compilation framework
Considered trying to drop Autotools for this project but:
(1) This is just as much work as trying to go the other way, and(2) Ideally, Cell target can be rolled into the Spiral-WHT package, so this way the porting effort is not wasted
Cell Broadband Engine Architecture
More Toolchain More Toolchain ChallengesChallenges
Best course was to analyze commands run by Cell Makefiles, then add those to the Automake configuration
Initially, scripts were used to munge the MakefilesLater, cell-specific options were added to Autoconf frameworksSPU uses separate toolchain; our current implementation is hackishStill more work needed to implement cleanly
Cell Broadband Engine Architecture
Architectural ChallengesArchitectural Challenges
Keep SPUs processing at capacity.PPU needs to run the OS and allocate jobs to SPUsExploit multiple levels of parallelism
Vector (SIMD) operations
PPE + 8 SPEs
Dual pipelines
Multiple processors on a blade
Multiple blades!
Exploit data locality
More Architectural More Architectural ChallengesChallenges
Distributed architecture basicsShared memory
Message passing
Synchronization
Manual DMA Scheduling for
Vectorization IssuesPPU and SPUs have different vector intrinsics
Most operators have a direct mapping between SSE and SPU/PPU intrinsics. Exceptions: Shuffle and permutations (due to endianness)
Cell Broadband Engine Architecture
Implementation StrategyImplementation Strategy
Utilize Vector ConstructsSPUs allow vectorization of doubles as well as floats; PPU is single-precision only
Implement a distributed ‘split’ across SPUs: splitcell[]
Use reference ‘d_split’ code as implementation guide
Cell Broadband Engine Architecture
Vector IntrinsicsVector IntrinsicsVector Integer
vector arithmetic, compare, logical, rotate, and shift
Vector Floating-Point
floating-point arithmetic, multiply/add, rounding and conversion, compare, and estimate instructions.
Vector Load and Store
basic integer and floating-point load and store instructions. No update forms of load and store
Vector Permutation and Formatting
vector pack, unpack, merge, splat, permute, select, and shift
Cell Broadband Engine Architecture
Vectorizing Details Vectorizing Details
Conversion from SSE vectors to SIMD style vectors is non-trivialPPU and SPU have different vector intrinsicsMany SSE intrinsics do have a SIMD intrinsic except for memory interactions and permutationsCare must be takes to maintain the correct endian model
Cell Broadband Engine Architecture
splitcell[] Strategysplitcell[] Strategy
Pairing transpose blocks by flipping upper and lower address halves
Limited to 22×n block sizes each cell will calculate 2 blocks. Values n = 1,2
xx…x yy…y xx…xyy…y
xx…x yy…y≠
xx…x yy…y xx…x yy…y
xx…x yy…y=
xx…x yy…y
Move Blocks
Don’t Move Blocks
Cell Broadband Engine Architecture
splitcell[] Mapping splitcell[] Mapping VisualizedVisualized
n = 1
n = 2
Cell #0Block 0Block15
Cell #1Block 1Block 4
Cell #2Block 2Block 8
Cell #3Block 3Block 12
Cell #4Block 5Block10
Cell #5Block 6Block 9
Cell #6Block 7
Block 13
Cell #7Block 11Block 14
Matrix to cell Mapping
0000 0001 0010 0011
0100 0101 0110 0111
1000 1001 1010 1011
1100 1101 1110 1111
Cell #0Block 0Block 3
Cell #1Block 1Block 2
Matrix to cell Mapping
00 01
10 11
Cell Broadband Engine Architecture
More IntrisicsMore Intrisics
Processor Control
read and write the vector status and control register (VSCR)
Memory Control
instructions for managing caches (user-level and supervisor-level)
Cell Broadband Engine Architecture
SomeSome DMA Memory DMA Memory InteractionInteraction
tag = mfc_tag_reserve(); // reserve a single tag for exclusive use
mfc_get(ls, ea, size, tag, tid, rid); // move data from main memory to local storemfc_write_tag_mask(mask); // mfc_put(ls, ea, size, tag, tid, rid); // move data from local store to main memory
mfc_read_tag_status_all(); // wait for all write commands to finish
ls - local storage locationea – effective main memory addresstag – status id for memory operationstid – transfer idrid – replace id
Cell Broadband Engine Architecture
Vector Intrinsics MappingVector Intrinsics MappingSSE Intrinsic -> PPE intrinsic;SPE intrinsic---------------------------------------------_mm_add_ps -> vec_add; spu_add_mm_sub_ps -> vec_sub; spu_sub_mm_load_ps -> vec_ld; (no SPE equiv)_mm_store_ps -> vec_st; (no SPE equiv)_mm_shuffle_ps -> vec_perm; spu_shuffle
(both require custom macro for permuation mask)
Cell Broadband Engine Architecture
Starting SPU ProgramsStarting SPU Programs#include <libspe2.h>
#include <pthread.h>
spe_context_ptr_t ctx; // the SPU construct
pthread_t thd; // the thread construct
extern spe_program_handle_t spuProgram; // the binary SPU program handle
int main() {
ctx = spe_context_create(0, NULL); // create the context
if(!program_load(ctx, &spuProgram) { // load the SPU program
if(pthread_create(&thd, NULL, &thdFunction, &ctx)) { // spawn the thread
phthread_join(thd, NULL); // wait for the SPU program to finish
}
}
spe_context_destroy(ctx); // clean up the context
return 0;
}
Cell Broadband Engine Architecture
SPU Threads and SPU Threads and ProgramsPrograms
// Thread functionvoid* thdFunction(void* arg) {
spe_context_ptr_t ctx; // the SPU constructunsigned long long spuArg; // argument pointer to pass to the SPUunsigned int entry = SPE_DEFAULT_ENTRY;
spe_context_run(ctx, &entry, 0, spuArg, NULL, NULL);pthread_exit(NULL);
}
// SPU program this will be linked in as ‘spuProgram’int main(unsigned long long spuId, unsigned long long argv) {
// SPU code…
return 0;}
Cell Broadband Engine Architecture
Combining PPU and SPU Combining PPU and SPU codecode
PPU and SPU code are compiled as separate object files using separate compilersppu-embedspu is used to embed SPU-compiled object code into PPU ELF binariesUnfortunately, the technical challenges of integrating this into Spiral-WHT were not overcome within our timeframes
Key problem: embedded SPU code has to be dynamically linked, while our Makefile hackery was using static librariesThis is very poorly documented, and was finally diagnosed too late to allow us to resolve the problem
Cell Broadband Engine Architecture
ResultsResults
Spiral-WHT successfully ported to Cell platformImplemented codelets for PPU and SPUInitial modifications to codelet generatorThe most difficult issues were toolchain-related, and these limited our ability to generate empirical performance data
…So, what sort of performance improvement should we expect when the bugs are ironed out?
Cell Broadband Engine Architecture
Expected Performance Expected Performance GainsGains
SPU has eight SPUs; on a PS3, six are available for useThus, we would hope for a 6-8x performance increase over using just the PPUWilliams et al. 2006 project a 12.7x speedup for 2-D FFTs over a 64-bit Intel CPU
With some minor architectural modifications, they project a 20x speedup!
Cell Broadband Engine Architecture
Future Work and Future Work and Alternate StrategiesAlternate Strategies
Implementing split/splitddl on the SPUUse multi-buffered DMA scheduling to maximize throughputPPU can handle two simultaneous hardware threads
Allow the PPU to run codelets in parallel with SPUs
Additional levels of parallelization: splitting over multiple CBE processors
Cell Blades have 2 CBE processors each
Cell Broadband Engine Architecture
ReferencesReferenceshttp://www.ibm.com/developerworks/power/cell/index.html
IBM Full-System Simulator User’s GuideCell Broadband Engine Programming Handbook Version 1.1Programming Tutorial (DRAFT)
S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, K. Yellick, “The Potential of the Cell Processor for Scientific Computing”, CF06
http://www.lbl.gov/Science-Articles/Archive/sabl/2006/Jul/CellProcessorPotential.pdf
Links to these and other useful Cell programming resources are on our group’s website:
http://www.cs.drexel.edu/~pls29/cell/
Questions?Questions?