1
Edgar Gabriel
COSC 6385
Computer Architecture
- Multi-Processors (II)
The IBM Cell, Intel Larrabee and
Nvidia G80 processors
Edgar Gabriel
Fall 2008
COSC 6385 – Computer Architecture
Edgar Gabriel
References• Intel Larrabee:
[1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins,
A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan:
“Larrabee: a many-core x86 architecture for visual computing”,
ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15.
http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf
• IBM Cell processor:
[2] C. R. Johns, D. A. Brokenshire
“Introductioon to the Cell Broadband Engine Architecture”,
IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519
http://www.research.ibm.com/journal/rd/515/johns.pdf
[3] M. Kistler, M. Perrone, F. Petrini,
“Cell Multiprocessor Communication Network: Built for Speed”
IEEE Micro, vol. 26, no. 3, pp .10-23
ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf
• Nvidia G80
[4] Scott Wasson, Nvidia GeForce 8800 graphics processor”
http://techreport.com/articles.x/11211/1
2
COSC 6385 – Computer Architecture
Edgar Gabriel
Larrabee Motivation
• Comparison of two architectures with the same number
of transistors
– Half the performance of a single stream for the simplified
core
– 40x increase for multi-stream executions
2 out-of-order
cores
10 in-order
cores
Instruction issue 4 2
VPU per core 4-wide SSE 16-wide
L2 cache size 4 MB 4 MB
Single stream 4 per clock 2 per clock
Vector
throughput
8 per clock 160 per clock
COSC 6385 – Computer Architecture
Edgar Gabriel
Larrabee Overview
• Many-core visual computing architecture
• Based on x86 CPU cores
– Extended version of the regular x86 instruction set
– Supports subroutines and page faulting
• Number of x86 cores can vary depending on the
implementation and processor version
• Fixed functional units for texture filtering
– Other graphical operations such as rasterization or post-
shader blending done in software
3
COSC 6385 – Computer Architecture
Edgar Gabriel
Larrabee Overview (II)
Image Source: [1]
COSC 6385 – Computer Architecture
Edgar Gabriel
Overview of a Larrabee Core (I)
Image Source: [1]
4
COSC 6385 – Computer Architecture
Edgar Gabriel
Overview of a Larrabee Core (I)
• x86 core derived from the Pentium processor
– No out-of-order execution
• Standard Pentium instruction set with the addition of
– 64 bit instructions
– Instructions for pre-fetching data into L1 and L2 cache
– Support for 4 simultaneous threads, separate registers for
each thread
• Each core is augmented with a wide vector processor
(VPU)
• 32kb L1 Instruction cache, 32 kb L1 Data Cache
• 256 KB of ‘local subset’ of the L2 cache
– Coherent L2 cache across all cores
COSC 6385 – Computer Architecture
Edgar Gabriel
Vector Processing Unit in Larrabee
• 16-wide VPU executing integer, single- and double
precision floating point operations
• VPU supports gather-scatter operations
– The 16 elements are loaded or can be stored from up to
16 different addresses
• Support for predicated instructions using a mask control
register (if-then-else statements)
5
COSC 6385 – Computer Architecture
Edgar Gabriel
Inter-Processor Ring Network
• Bi-directional ring network
• 512 bits-wide per direction
• Routing decisions done before injecting message into
the network
COSC 6385 – Computer Architecture
Edgar Gabriel
Larrabee Programming Models
• Most application can be executed without modification
due to the full support of the x86 instruction set
• Support for POSIX threads to create multiple threads
– API extended by thread affinity parameters
• Recompiling code with Larrabee’s native compiler will
generate automatically the codes to use the VPUs.
• Alternative parallel approaches
– Intel threading building blocks
– Larrabee specific OpenMP directives
6
COSC 6385 – Computer Architecture
Edgar Gabriel
Larrabee Performance
Image Source: [1]
COSC 6385 – Computer Architecture
Edgar Gabriel
IBM Cell Overview (I)
• Cell Broadband Architecture (CBEA) defined by a
consortium of IBM, Sony, and Toshiba
• Originally targeting the multi-media industry
– E.g. Playstation 3, Toshiba HDTV, etc.
• Sold as regular compute-blades also by IBM
– IBM QS20, QS21, QS22
• Main idea: heterogeneous microprocessor consisting of
– one (or more) general purpose processor element (PPE)
and
– (one or) more synergistic processor elements (SPEs)
7
COSC 6385 – Computer Architecture
Edgar Gabriel
Cell Architecture block diagram
Image Source: [2]
COSC 6385 – Computer Architecture
Edgar Gabriel
• Two generations available so far:
– Cell BE:
• 204.8 GFLOPS single precision peak performance
• 14.6 GFLOPS double precision peak performance
– PowerXCell 8i (2008):
• 204.8 GFLOPS single precision peak performance
• 102.4 GFLOPS double precision peak performance
– Both have 1 PPE and 8 SPEs
8
COSC 6385 – Computer Architecture
Edgar Gabriel
General Purpose Processor (PPE)
• Based on the IBM PowerPC processor
– Supports multiple simultaneous operating environments
(virtualization)
– E.g. can execute an instance of a real-time operating
system and an instance of a non-real-time operating
system
• Performs management and application control
functions
COSC 6385 – Computer Architecture
Edgar Gabriel
Synergistic Processor Element (SPE)
• SIMD processor used for offloading compute-intensive,
data parallel operations from the PPE
• Each SPE has its own local storage and can access data
only from the local storage
– Current versions of the Cell processors: 256k local storage
• The local storage is connected to the main memory
through a Memory Flow Controller (MFC)
– MFC moves data from main memory to local storage or
between two SPEs.
9
COSC 6385 – Computer Architecture
Edgar Gabriel
MFC commands
Image Source: [2]
COSC 6385 – Computer Architecture
Edgar Gabriel
Synergistic Processor Element (SPE) (II)
• Each SPE has 128 registers
• Each register is 128 bits wide which can be used to
hold
– Sixteen 8-bit integers or
– Eight 16-bit integers or
– Four 32-bit integers or single precision floating-point
numbers
– Two 64-bit integers or double precision floating point
numbers
• Most instructions supported by the synergistic processor
unit utilize all elements in a register -> SIMD
10
COSC 6385 – Computer Architecture
Edgar Gabriel
Simplified representation of a current
Cell processor
Image Source: [3]
COSC 6385 – Computer Architecture
Edgar Gabriel
Element Interconnect Bus
• PPE and SPEs communicate through the Element
Interconnect Bus
– Contains a shared command bus
• Sets up end-to-end transactions
• Used for coherence protocols
– Point-to-point data interconnect
• Four 16-byte-wide rings, two used for clockwise data
transfers, two for counter-clockwise data transfers
• Each ring transfer 128 byte packets ( = cache block
size of an SPE)
• Communication costs between two SPEs can vary
between 1 hop and 6 hops
– Overall bandwidth: 204.8 GB/s
11
COSC 6385 – Computer Architecture
Edgar Gabriel
Comparison IBM Cell and Intel
Larrabee• Both use a large number of small and simple cores
• Both use high-bandwidth ring bus to communicate
between the cores
• Intel Larrabee is homogeneous, while IBM Cell is a
heterogeneous process (difference between PPE and
SPE)
• IBM Cell requires data to be moved explicitly to the
‘local store’, while Larrabee can address any memory
area
– Programm for the Cell have to be written taking the
limited amount of memory available for a SPE into
account
COSC 6385 – Computer Architecture
Edgar Gabriel
Nvidia G80
• Parallel Stream Processor
– Each green block is a stream processor
– 16 stream processors are grouped and connected by a L1 cache
– Each G80 has 8 groups with 16 SPs = 128 SPs total
– Each SP is a generalized processors running at 1.35 GHz
– Each SP operates on a single element (scalar)
– groups are connected by a crossbar style switch and that connects them to six
ROP
– Each ROP has its own L2 cache and an interface to graphics memory (frame
buffer) with 64 bits width
– 6 * 64bits = 384 bits path to memory
12
COSC 6385 – Computer Architecture
Edgar Gabriel
Nvidia G80 (I)
COSC 6385 – Computer Architecture
Edgar Gabriel
Performance comparison G80 to IBM
Cell
Source: http://gametomorrow.com/blog/index.php/2007/09/05/cell-vs-g80/
• Ray Tracing Application