+ All Categories
Home > Documents > CBE Architecture Overview

CBE Architecture Overview

Date post: 28-Jan-2016
Category:
Upload: mab
View: 31 times
Download: 0 times
Share this document with a friend
Description:
Martin Kreibe: [email protected] Matthew Longley: [email protected] Paul Snyder: [email protected]. CBE Architecture Overview. What is CBE?. A new interpretation of Multi-core processors Development motivated by heavy graphics based applications Game Consoles - PowerPoint PPT Presentation
Popular Tags:
33
CBE Architecture CBE Architecture Overview Overview Martin Kreibe: [email protected] Matthew Longley: [email protected] Paul Snyder: [email protected]
Transcript
Page 1: CBE Architecture Overview

CBE Architecture CBE Architecture OverviewOverview

Martin Kreibe: [email protected]

Matthew Longley: [email protected]

Paul Snyder: [email protected]

Page 2: CBE Architecture Overview

What is CBE?What is CBE?

A new interpretation of Multi-core processorsDevelopment motivated by heavy graphics based applications

Game ConsolesGraphics Rendering Applications

Developed by a collaboration between Sony, Toshiba, and IBM (known as STI) in 2001.

Page 3: CBE Architecture Overview

Architecture Architecture ComponentsComponents

PPEMain processing unit.Controls SPE units

EIBCommunication Bus

SPEsNon-control Processor Elements8 on chip

BEIEngine Interface

Page 4: CBE Architecture Overview

Cell Broadband Engine Architecture

CBE Endian-nessCBE Endian-ness

The CBE Architecture is big endian

Byte

Halfword

Word

Address

Doubleword

Quadword

0 1 2 3 4 5 6 7 8 9 a b c d e f

Page 5: CBE Architecture Overview

Power PC Processor Power PC Processor ElementElement

64bit, Dual-thread PowerPC Architecture32KB L1 cache size512 KB L2 cache sizeInstruction set extensions:

Vector/SIMD multimedia (“Altivec”)PPU to SPU communication

Classic CPU Architecture

Page 6: CBE Architecture Overview

Synergistic Processor Synergistic Processor ElementsElements

Operations must be allocated by PPU

“[O]ptimized for data-rich operations”Programming Tutorial (DRAFT)

RISC core

256 KB Local Store (“LS”, holds both Instructions and Data)

Unified 128-bit, 128-entry register file.

Manual branch hinting

Special SIMD instruction set

Vector operations

DMA control

Interprocessor messaging and synchonization

“[N]ot intended to run an operating system.”Programming Tutorial (DRAFT)

Page 7: CBE Architecture Overview

Cell Broadband Engine Architecture

SPU RegistersSPU Registers

General Purpose Registers (GPR) 0

GPR 1

GPR 127

Floating-Point Status and Control Register (FPSCR)

Page 8: CBE Architecture Overview

SPU LatenciesSPU Latencies

Simple fixed point - 2 cycles

Complex fixed point - 4 cycles

Load - 6 cycles

Single-precision (ER) float - 6 cycles

Integer multiply - 7 cycles

Branch miss (no penalty for correct hint) - 20 cycles

DP (IEEE) float (partially pipelined) - 13 cycles

Enqueue DMA Command - 20 cycles

Page 9: CBE Architecture Overview

Cell Broadband Engine Architecture

Element Interconnect BusElement Interconnect Bus

Peak bandwidth: 96 bytes/cycleFour 16-byte data rings100+ outstanding DMA requests

Source: http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

Page 10: CBE Architecture Overview

Cell Broadband Engine Architecture

Platform DetailsPlatform Details

Cross-Unit CommunicationMailbox mechanism for synchronization

32-bit messages between SPE

Signal Notification (inbound)32-bit signal notification register

PPU and SPUs can retrieve data from Memory into a SPU DMADMA loads are asynchronous

Page 11: CBE Architecture Overview

EnvironmentEnvironmentHardware

PS3 (may have dead SPUs)

Multi-processor blades

Workstations and accelerator cards

Simulator

Cycle-accurate emulation of SPUs

TCL and GUI interfaces

Modified Linux environment

Page 12: CBE Architecture Overview

Cell Broadband Engine Architecture

Using the environmentUsing the environment

Project development was performed using GNU GCC-based cross-compilation toolchainExecutables were tested on both the IBM Cell simulator and on PS3s running Yellow Dog LinuxSimulator is slow but functionalCode ran smoothly on PS3sThanks to the University of Delaware for providing access to their PS3s

Page 13: CBE Architecture Overview

ToolsetToolsetDual GNU binutils/gcc toolchains (for PPU and SPU)

IBM XLC++ compiler (automatic vectorization)Currently generates poorly-optimized code

Static and dynamic analysis tools

Multithreaded debugger (gdb)

Cell Simulator and toolchain are provided only for Fedora Linux

We used VMware virtual machines to ease installation

A Gentoo installation package exists, but is poorly supported

Page 14: CBE Architecture Overview

Cell Broadband Engine Architecture

Toolchain ChallengesToolchain Challenges

Cell SDK Makefiles use custom include footers for Makefiles

These interface POORLY with GNU AutotoolsSpiral-WHT uses GNU Autotools

MUCH time was spent analyzing the operation of the Cell Makefiles and mixing this functionality with the Autotools compilation framework

Considered trying to drop Autotools for this project but:

(1) This is just as much work as trying to go the other way, and(2) Ideally, Cell target can be rolled into the Spiral-WHT package, so this way the porting effort is not wasted

Page 15: CBE Architecture Overview

Cell Broadband Engine Architecture

More Toolchain More Toolchain ChallengesChallenges

Best course was to analyze commands run by Cell Makefiles, then add those to the Automake configuration

Initially, scripts were used to munge the MakefilesLater, cell-specific options were added to Autoconf frameworksSPU uses separate toolchain; our current implementation is hackishStill more work needed to implement cleanly

Page 16: CBE Architecture Overview

Cell Broadband Engine Architecture

Architectural ChallengesArchitectural Challenges

Keep SPUs processing at capacity.PPU needs to run the OS and allocate jobs to SPUsExploit multiple levels of parallelism

Vector (SIMD) operations

PPE + 8 SPEs

Dual pipelines

Multiple processors on a blade

Multiple blades!

Exploit data locality

Page 17: CBE Architecture Overview

More Architectural More Architectural ChallengesChallenges

Distributed architecture basicsShared memory

Message passing

Synchronization

Manual DMA Scheduling for

Vectorization IssuesPPU and SPUs have different vector intrinsics

Most operators have a direct mapping between SSE and SPU/PPU intrinsics. Exceptions: Shuffle and permutations (due to endianness)

Page 18: CBE Architecture Overview

Cell Broadband Engine Architecture

Implementation StrategyImplementation Strategy

Utilize Vector ConstructsSPUs allow vectorization of doubles as well as floats; PPU is single-precision only

Implement a distributed ‘split’ across SPUs: splitcell[]

Use reference ‘d_split’ code as implementation guide

Page 19: CBE Architecture Overview

Cell Broadband Engine Architecture

Vector IntrinsicsVector IntrinsicsVector Integer

vector arithmetic, compare, logical, rotate, and shift

Vector Floating-Point

floating-point arithmetic, multiply/add, rounding and conversion, compare, and estimate instructions.

Vector Load and Store

basic integer and floating-point load and store instructions. No update forms of load and store

Vector Permutation and Formatting

vector pack, unpack, merge, splat, permute, select, and shift

Page 20: CBE Architecture Overview

Cell Broadband Engine Architecture

Vectorizing Details Vectorizing Details

Conversion from SSE vectors to SIMD style vectors is non-trivialPPU and SPU have different vector intrinsicsMany SSE intrinsics do have a SIMD intrinsic except for memory interactions and permutationsCare must be takes to maintain the correct endian model

Page 21: CBE Architecture Overview

Cell Broadband Engine Architecture

splitcell[] Strategysplitcell[] Strategy

Pairing transpose blocks by flipping upper and lower address halves

Limited to 22×n block sizes each cell will calculate 2 blocks. Values n = 1,2

xx…x yy…y xx…xyy…y

xx…x yy…y≠

xx…x yy…y xx…x yy…y

xx…x yy…y=

xx…x yy…y

Move Blocks

Don’t Move Blocks

Page 22: CBE Architecture Overview

Cell Broadband Engine Architecture

splitcell[] Mapping splitcell[] Mapping VisualizedVisualized

n = 1

n = 2

Cell #0Block 0Block15

Cell #1Block 1Block 4

Cell #2Block 2Block 8

Cell #3Block 3Block 12

Cell #4Block 5Block10

Cell #5Block 6Block 9

Cell #6Block 7

Block 13

Cell #7Block 11Block 14

Matrix to cell Mapping

0000 0001 0010 0011

0100 0101 0110 0111

1000 1001 1010 1011

1100 1101 1110 1111

Cell #0Block 0Block 3

Cell #1Block 1Block 2

Matrix to cell Mapping

00 01

10 11

Page 23: CBE Architecture Overview

Cell Broadband Engine Architecture

More IntrisicsMore Intrisics

Processor Control

read and write the vector status and control register (VSCR)

Memory Control

instructions for managing caches (user-level and supervisor-level)

Page 24: CBE Architecture Overview

Cell Broadband Engine Architecture

SomeSome DMA Memory DMA Memory InteractionInteraction

tag = mfc_tag_reserve(); // reserve a single tag for exclusive use

mfc_get(ls, ea, size, tag, tid, rid); // move data from main memory to local storemfc_write_tag_mask(mask); // mfc_put(ls, ea, size, tag, tid, rid); // move data from local store to main memory

mfc_read_tag_status_all(); // wait for all write commands to finish

ls - local storage locationea – effective main memory addresstag – status id for memory operationstid – transfer idrid – replace id

Page 25: CBE Architecture Overview

Cell Broadband Engine Architecture

Vector Intrinsics MappingVector Intrinsics MappingSSE Intrinsic -> PPE intrinsic;SPE intrinsic---------------------------------------------_mm_add_ps -> vec_add; spu_add_mm_sub_ps -> vec_sub; spu_sub_mm_load_ps -> vec_ld; (no SPE equiv)_mm_store_ps -> vec_st; (no SPE equiv)_mm_shuffle_ps -> vec_perm; spu_shuffle

(both require custom macro for permuation mask)

Page 26: CBE Architecture Overview

Cell Broadband Engine Architecture

Starting SPU ProgramsStarting SPU Programs#include <libspe2.h>

#include <pthread.h>

spe_context_ptr_t ctx; // the SPU construct

pthread_t thd; // the thread construct

extern spe_program_handle_t spuProgram; // the binary SPU program handle

int main() {

ctx = spe_context_create(0, NULL); // create the context

if(!program_load(ctx, &spuProgram) { // load the SPU program

if(pthread_create(&thd, NULL, &thdFunction, &ctx)) { // spawn the thread

phthread_join(thd, NULL); // wait for the SPU program to finish

}

}

spe_context_destroy(ctx); // clean up the context

return 0;

}

Page 27: CBE Architecture Overview

Cell Broadband Engine Architecture

SPU Threads and SPU Threads and ProgramsPrograms

// Thread functionvoid* thdFunction(void* arg) {

spe_context_ptr_t ctx; // the SPU constructunsigned long long spuArg; // argument pointer to pass to the SPUunsigned int entry = SPE_DEFAULT_ENTRY;

spe_context_run(ctx, &entry, 0, spuArg, NULL, NULL);pthread_exit(NULL);

}

// SPU program this will be linked in as ‘spuProgram’int main(unsigned long long spuId, unsigned long long argv) {

// SPU code…

return 0;}

Page 28: CBE Architecture Overview

Cell Broadband Engine Architecture

Combining PPU and SPU Combining PPU and SPU codecode

PPU and SPU code are compiled as separate object files using separate compilersppu-embedspu is used to embed SPU-compiled object code into PPU ELF binariesUnfortunately, the technical challenges of integrating this into Spiral-WHT were not overcome within our timeframes

Key problem: embedded SPU code has to be dynamically linked, while our Makefile hackery was using static librariesThis is very poorly documented, and was finally diagnosed too late to allow us to resolve the problem

Page 29: CBE Architecture Overview

Cell Broadband Engine Architecture

ResultsResults

Spiral-WHT successfully ported to Cell platformImplemented codelets for PPU and SPUInitial modifications to codelet generatorThe most difficult issues were toolchain-related, and these limited our ability to generate empirical performance data

…So, what sort of performance improvement should we expect when the bugs are ironed out?

Page 30: CBE Architecture Overview

Cell Broadband Engine Architecture

Expected Performance Expected Performance GainsGains

SPU has eight SPUs; on a PS3, six are available for useThus, we would hope for a 6-8x performance increase over using just the PPUWilliams et al. 2006 project a 12.7x speedup for 2-D FFTs over a 64-bit Intel CPU

With some minor architectural modifications, they project a 20x speedup!

Page 31: CBE Architecture Overview

Cell Broadband Engine Architecture

Future Work and Future Work and Alternate StrategiesAlternate Strategies

Implementing split/splitddl on the SPUUse multi-buffered DMA scheduling to maximize throughputPPU can handle two simultaneous hardware threads

Allow the PPU to run codelets in parallel with SPUs

Additional levels of parallelization: splitting over multiple CBE processors

Cell Blades have 2 CBE processors each

Page 32: CBE Architecture Overview

Cell Broadband Engine Architecture

ReferencesReferenceshttp://www.ibm.com/developerworks/power/cell/index.html

IBM Full-System Simulator User’s GuideCell Broadband Engine Programming Handbook Version 1.1Programming Tutorial (DRAFT)

S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, K. Yellick, “The Potential of the Cell Processor for Scientific Computing”, CF06

http://www.lbl.gov/Science-Articles/Archive/sabl/2006/Jul/CellProcessorPotential.pdf

Links to these and other useful Cell programming resources are on our group’s website:

http://www.cs.drexel.edu/~pls29/cell/

Page 33: CBE Architecture Overview

Questions?Questions?


Recommended