Kevin Eady Ben Plunkett
Prateeksha Satyamoorthy
HistoryJointly designed by Sony, Toshiba,IBM (STI)Design began March 2001First used in Sony’s PlayStation 3IBM’s Roadrunner cluster contains over
12,000 Cell processors.
IBM Roadrunner cluster
Cell Broadband EngineNine coresOne Power Processing Element (PPE)
Main processorEight Synergistic Processing Elements (SPE)
Fully functional co-processorsComprised of Synergistic Processing Unit
(SPU), Memory Flow Controller (MFC)
Stream processing
Power Processing ElementIn-order dual-issue design64-bit Power ArchitectureTwo 32 KB L1 caches (instruction, data),
one 512 KB L2 cacheInstruction Unit: instruction fetch, decode,
branch, issue, completion4 instructions per cycle per thread into bufferdispatches instructions from bufferdual-issued to Execution Unit
Branch prediction: 4-KB x 2-bit branch history table
Pipeline depth: 23 stages
Synergistic Processing Element• Implements new instruction-set architecture • Each SPU contains a dedicated DMA management
queue • 256 KB local store memory
– Stores instructions and data– Memory transferred via DMA between local and system
memory• No data load / branch prediction.
– Relies on "prepare-to-branch" instructions to pre-fetch data
– Loads at least 17 instructions at the branch target address
• Two instructions per cycle– 128-bit SIMD– In-order dual-issue statically scheduled
On-chip Interconnect : Element Interconnect Bus (EIB)Provides internal connection for 12 ‘units’:
PPE8 SPEsMemory Controller (MIC)2 Off-chip I/O interfaces
Each ‘unit’ has one 16B read port and one 16B write port
Circular ring Four 16-byte-wide unidirectional channels
which counter-rotate in pairs
Includes an arbitration unit which functions as a set of traffic lights.
Runs at half the system clock rate Peak instantaneous EIB bandwidth is 96B per
clock 12 concurrent transactions * 16 bytes wide / 2
system clocks per transferEIB channel is not permitted to convey data
requiring more than six steps
Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle.
Maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the systemThe theoretical peak data bandwidth on the
EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s197 GB/s Actual peak data bandwidth achieved
David Krolak explains: “Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.”
Multi-threading OrganizationPPE is an in-order, 2-way Simultaneous Multi-Threading
(SMT)Each SPU is a vectorial accelerator targeted at the
execution of SIMD codeAll architecture states are duplicated to perform
interleaved instruction issuing. Asynchronous DMA transfers.
The setup of a DMA takes the SPE a few cycles whereas a cache miss on a normal system causes the CPU to stall to up to thousands of cycles.
SPEs can perform other calculations while waiting for data.
Scheduling PolicyTwo classes of threads defined
PPU threads: run on the PPU SPU tasks: run on the SPUs.
PPU threads are managed by the Completely Fair Scheduler (CFS)
SPU scheduler supports time-sharing in multi-programmed workloads Allows preemption of SPU tasks
Cell-based systems allow only one active application to run at the same time to avoid performance degradation.
Completely Fair SchedulerRanked by
Consider an example with two users, A and B, who are running jobs on a machine. User A has just two jobs running, while user B has 48 jobs running. Group scheduling enables CFS to be fair to users A and B, rather than being fair to all 50 jobs running in the system. Both users get a 50-50 share. B would use his 50% share to run his 48 jobs and would not be able to encroach on A's 50% share.