Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | mckenzie-pearson |
View: | 43 times |
Download: | 3 times |
IBM Systems & Technology Group
Cell/BE
Cell Programming Workshop 04/19/23 © 2007 IBM Corporation1
Cell Broadband Engine
Introduction& architecture
Francesco BertagnolliFrancesco BertagnolliSystem & Technology GroupSystem & Technology Group
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/232
Agenda
– Cell introduction
– Cell architecture
– SDK 3.0
– Linux on ps3
– Cell basic programming
– Hands-on
– Cell applications
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/233
IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Austin-based Design Center opened in March 2001 Single CellBE operational Spring 2004 February 7, 2005: First technical disclosures November 9, 2005: Open source SDK & simulator published February 8, 2006: IBM announced Cell Blade July 2006: SDK 1.1 available Sep 2006: GA of IBM Blade Center QS20 Dec 2006: SDK 2.0 available Oct 2007: SDK 3.0 available Oct 2007: QS21 available May 2008: QS22 available!!
Systems and Technology Group
Cell History
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/234
1.0E+02
1.0E+03
1.0E+04
1990 1995 2000 2005 2010
Clo
ck S
pee
d (
MH
z)
Intel Processors
IBM Processors
103
102
104
Po
we
r3
Po
we
r3-I
I
Po
we
r4
Po
we
r4+
Po
we
r5 GS
/GT
Po
we
r5+
Po
we
r6Z
6 C
PZ
6 S
C
Blu
eGe
ne
/L
Blu
eGe
ne
/P
Power5+
Intel's 2003Roadmap
RS
64-4
10
100
1000
10000
100000
1000000
1994
1996
1998
2000
2002
2004
2006Game Processors
PC Processors
Sin
gle
Pre
cis
ion
Flo
ati
ng
Po
int
(Mfl
op
s)
Year
1 TFlop
1 GFlop
Introduction
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/235
The CBE processor is the first implementation of a new multiprocessor family conforming to the Cell Broadband Engine Architecture (CBEA)
The CBEA and the CBE processor are the result of a collaboration between Sony, Toshiba, and IBM known as STI, formally begun in early 2001
The Cell Broadband Engine Architecture has been designed to support a very broad range of applications (commercial, scientific fields...)
Although the CBE processor is initially intended for applications in media-rich
consumer-electronics devices such as game
consoles and high-definition televisions, the architecture
has been designed to enable fundamental advances in processor performance.
Overview of the Cell Broadband Engine Processor
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/236
2006 2007 2008 2009 2010
PerformanceEnhancements/Scaling
EnhancedCell BE
(1+8eDP SPE)65nm SOI
Cell BE(1+8)
90nm SOI
CostReduction
All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indicate concept designs.
Next Gen (2PPE’+32SPE’)
45nm SOI~1 TFlop (est.)
Cell BE(1+8)
65nm SOI
Cell Competitive Roadmap
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/237
Cell Broadband Engine Architecture BladesIBM BladeCenter QS20 and beyond
2006 20082007 2009-2010
BladeCenter QS20• 2 Cell/B.E. processors • 1PPE + 8SPE• SP: 460 GFLOPS per
Cell blade• DP: 42 GFLOPS per
Cell blade• 1 GB memory
BladeCenter QS2X• 2 Cell/B.E. processors • 1PPE + 8SPE• SP: 460 GFLOPS per
Cell blade• DP: 42 GFLOPS per
Cell blade• Next Generation I/O
chip• 2 GB memory
BladeCenter QS2Y• 2 CBEA-compliant
processors • 1PPE + 8eDP SPE• SP: 460 GFLOPS per
blade• eDP: 217 GFLOPS per
blade• Up to 32 GB memory• PCI Express™ x16 slots
SDK 1.1
SDK 2.1 SDK 3.0
SDK 4.0
GA September 2006
Target availability: 4Q07
Target availability: 1H08
Available July 2006
Available:March 07
Target release:September 07
Target release: 08
BladeCenter QS2Z• First CBEA teraflop
processor• 2PPE’+32 eSPE• Power Architecture
compliant• ~2 TFLOPS SP per blade• ~1 TFLOPS DP per blade• Next generation memory
technology
Target availability: 1H10
SDK 5.0
Target release:December 08
Concept
Committed
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/238
Cell Basic Design Concept
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/239
Cache
Deep Pipelining
Out-of-Order Processing
X
X
X
Where Have All the Transistors Gone …?
Add Performance … and Inefficiency
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2310
Power Wall
– Hard limit to acceptable system power
Memory Wall
– Processor frequency vs. DRAM memory latency
Frequency Wall
Three Major Limiters to Processor Performance
Cell Concept
Increased efficiency and performance
– Non Homogenous Coherent Chip Multiprocessor
• Allows an attack on the “Frequency Wall”
– DMA architecture attacks “Memory Wall”
– Design, low operating voltage attacks “Power Wall”
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2311
Example Dual Core349mm2, 3.4 GHz @ 150W
2 Cores, ~54 SP GFlops
Cell/B.E.3.2 GHz
9 Cores, ~230 SP GFlops
Cell/B.E. - ½ the space & power vs traditional approaches
Please note, that on any traditional processor, the show ratio of cores to cache illustrated here remains ~50% of area.
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2312
Why Cell ? (1)
Cell/BE: General Purpose…
Flexibility
Parallelism multi-levels
Stream processing
Double pipeline into SPEs
Static scheduling pipeline: no buffer
Storage hierarchy
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2313
Simple hardware LS
SPEs indipendent & synergistic– cluster with 8
Several systems: – game, HDTV, Blades, supercomputing, cluster computing,
mainframes, etc..
– Structure is not fix
MFC, DMA
Registers 128x128 (4x32)
Why Cell ? (2)
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2314
Technology 90-65-45.. nm
State of art
Software development support
Low consumer
Flaws? NO, It’s RISC..
FLEXIBILITY
Why Cell ? (3)
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2315
Cell/B.E. enables scalable, shared architecture with full consumer to professional potential
SCE PS3(Cell/B.E. + GPU)
IBM Cell/B.E. Blade
(2 Cell/B.E.s)b
IBM Roadrunner(16,000 Cell/B.E.s
+ AMD)Sony Cell/B.E. Computing Unit
(Cell/B.E. + GPU + AV I/O)
Consumer ProfessionalHigh Perf
Computing Business
Mercury Cell/B.E. PCI Card
(Cell/B.E. + Network)
Common Operating Systems, Infrastructure, Tools, Libraries, Code…
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2316
Challenges of Digital Future – System integration and flexibility
Integration of offload engines and accelerators into processor
– Simpler system structure
Integration of bridge functionality
– More efficient I/O designs
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2317
Cell Hardware components & performance
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2318
Hardware Environment
The Processor Elements
Element Interconnect Bus
Memory Interface Controller
Cell Broadband Engine Interface Unit
Block diagram of the CBE processor hardware
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2319
Synergistic Processor ElementsSynergistic Processor Elements
PowerPC Processor ElementPowerPC Processor Element
Mem
ory
Inte
rface
Contr
olle
r
Element Interconnect BusElement Interconnect Bus
Cell B
roadband E
ngin
e In
terfa
ce U
nit
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2320
Power Processor Elements: PPE
EIB
64-bit Power
Architecture
with VMX
PPE
PXUL1
PPU
L2
L2
PPU
The PowerPC Processor Element (PPE) features:
a general-purpose 64-bit RISC processor
conforms to the PowerPC Architecture
dual-threaded
with vector/SIMD multimedia extensions
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2321
PPE responsability:
o responsible for overall control of a CBE system
o run the operating systems
It has:
32 KB level-1 (L1) instruction and data caches
512 KB level-2 (L2) unified (instruction and data) cache
The PPE supports the standard PowerPC Architecture
instructions and the vector/SIMD multimedia extensions
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2322
PPE Registers
32 General-Purpose Registers (GPRs)—Fixed-point instructions operate on the full 64-bit width of the GPRs.
32 Floating-Point Registers (FPRs), 64 bits wide. The internal format of floating- point data is the IEEE 754 double-precision format. Single-precision results are maintained internally in the double-precision format.
64-bit LR - to hold the effective address of a branch target.
64-bit CTR - to hold either a loop counter or the effective address of a branch target.
64-bit XER - contains the carry and overflow bits and the byte count for the move-assist instructions.
32 128-bit-wide VMRs - served as source and destination registers for all vector instructions.
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2323
To software, the PPE appears to provide two independent instruction-processing units.
The threads appear to be independent because the PPE provides each thread with a copy of architectural state (registers), but the threads are not completely independent because many execution resources are shared by the threads to reduce the hardware cost of multithreading.
To software, the PPE implementation of multithreading looks similar to a multiprocessor implementation, but there are several important differences
PPE multithreading
It has duplicate sets of the PowerPC and vector user-state
register files (one set for each thread)
The PPE hardware supports two simultaneous threads of execution
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2324
PPE Multithreading vs Multi-Core Implementations
Table compares the PPE multithreading implementation to a conventional dual-core microprocessor
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2325
Pre-Decode
L1 Instruction Cache
Microcode
SMT Dispatch (Queue)
Decode
Dependency
Issue
Branch Scan
Fetch ControlL2
Interface
VMX/FPU Issue (Queue)
VMXLoad/Store/
Permute
VMXArith./Logic Unit
FPULoad/Store
FPUArith/Logic Unit
Load/StoreUnit
BranchExecution
Unit
Fixed-PointUnit
FPU CompletionVMX Completion
Completion/Flush
Thread A Thread B
Thread A
Thread B
Thread A
L1 Data Cache
PPE Block Diagram
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2326
Synergistic Processor Elements: SPEs
SPE1
Each SPE:
RISC core
256 KB SRAM Local Store for instructions and data
128X128-bit register file
support a special SIMD instruction set
EIB
SPE
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
…
SPU Core (SXU)
Channel Unit
Local StoreMFC
(DMA Unit)
SPU
SPE
To Element Interconnect Bus
DMA Unit: Transfers data between
Local Store and Main Memory
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2327
Synergistic Processor Element (SPE)
It is not optimized for running an operating system
The SPEs are independent processor elements, each running their own individual application programs or threads
The SPEs are designed to be programmed in high-level languages, such as C/C++
They support a rich instruction set that includes extensive SIMD functionality
However, use of SIMD data types is preferred, not mandatory
The eight identical SPEs are single-instruction, multiple-data (SIMD) processor elements are optimized for data-rich
operations allocated to them by the PPE.
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2328
SPU Organization
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2329
SPE Registers
128 of 128-bit General-Purpose Registers (GPRs) that can be used to store all data types
The Floating-Point Status and Control Register (FPSCR) records information about the result and any associated exceptions.
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2330
One Difference between PPE and SPEs ...
The more significant difference between the SPE and PPE lies in how they access memory
The PPE accesses main storage with load and store instructions that move data between main storage and a private register file, the contents of which may be cached
The SPEs, in contrast, access main storage with direct memory access (DMA) commands that move data and instructions between main storage and a private local memory, called a local store or local storage (LS). The LS has no associated cache
This 3-level organization of storage (register file, LS, main storage) is a radical break from conventional architecture and programming models
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2331
System Memory
4x128 kB L2-Cache Sub-Array
512 kB L2-Cache
32 kB L1 Data-Cache 32 kB L1 Instruction-Cache 256 kB Local Store
16x16 kB
Sub-Array
CHIP CELL BE
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2332
Memory Interface Controller - MIC
MIC
EIB
Dual XDRTM
The MIC provides the interface between the EIB and physical memory
It supports one or two Rambus extreme data rate (XDR) memory interfaces (which together support between 64 MB and 64 GB of XDR DRAM memory)
XDR Dram is ECC-protected, with multi-bit error detection and optional single bit error correction
Memory Interface16 B/cycle25.6 GB/s (@1.6 Ghz)
Memory Interface16 B/cycle25.6 GB/s (@1.6 Ghz)
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2333
On chip coherent bus96B / cycle bandwidth2 Rings in each direction
On chip coherent bus96B / cycle bandwidth2 Rings in each direction
I/O InterfaceCan be coherent16 B/cycle x 2
I/O InterfaceCan be coherent16 B/cycle x 2
Element Interconnect Bus - EIB
Cell Broadband Engine Interface Unit – (BEI)
EIB
BEI
BEI
FlexIOTM
EIB
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2334
Cell performance
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2335
>100 GFLOPs DP in 65nm>100 GFLOPs DP in 65nm
Cell is not a collection of different processors, but a synergistic wholeCell is not a collection of different processors, but a synergistic whole
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2336
>100 GFLOPs DP in 65nm>100 GFLOPs DP in 65nm
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2337
Source: Cell Broadband Engine Architecture and its first implementation – A performance view, http://www-128.ibm.com/developerworks/library/pa-cellperf/
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2338
Key Performance Characteristics
Cell's performance is about an order of magnitude better than GPP for media and other applications that can take advantage of its SIMD capability
– Performance of its simple PPE is comparable to a traditional GPP performance
– its each SPE is able to perform mostly the same as, or better than, a GPP with SIMD running at the same frequency
– key performance advantage comes from its 8 de-coupled SPE SIMD engines with dedicated resources including large register files and DMA channels
Cell can cover a wide range of application space with its capabilities in
– floating point operations
– integer operations
– data streaming / throughput support
– real-time support
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2339
Cell Blade
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2340
The First Generation Cell Blade
Cell Processors1GB XDR Memory IBM Blade Center interface
BladeCenter Network Interface
CellProcessor
SouthBridge
XDRAM
CellProcessor
SouthBridge
XDRAM
IB4X
IB4X
GbE GbE
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2341
14 blades
BladeCenter-H
- 2 Cell Chips pro QS21-Blade- 14 QS21 Blades pro BladeCenter- 60 Watt pro Cell
Peak Performance
Up to 460 GFLOPS per blade
Up to 6.4 TFLOPS in a single BladeCenter H chassis
Up to 25.8 TFLOPS in a standard 42U rack
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2342
Workstations
back
IBM
QS
20
IBM
QS
20
IBM
QS
20
IBM
QS
20
IBM
QS
20
IBM
QS
20
IBM
QS
20
IBM BladeCenter
IBM
QS
20
IBM
QS
21
IBM BladeCenter
InfiniBand
InfiniBandIB <> Eth.
echo Thinkpad Thinkpad
C:\IBM\product\Cell\_
Thinkpad T60
echo Thinkpad Thinkpad
C:\IBM\product\Cell\_ echo Thinkpad Thinkpad
C:\IBM\product\Cell\_
Thinkpad T60
echo Thinkpad Thinkpad
C:\IBM\product\Cell\_
Eth. Switch
echo Thinkpad Thinkpad
C:\IBM\product\Cell\_
Thinkpad T60
echo Thinkpad Thinkpad
C:\IBM\product\Cell\_
echo Thinkpad Thinkpad
C:\IBM\product\Cell\_
echo PC PC
C:\IBM\product\Cell\_
echo Thinkpad Thinkpad
C:\IBM\product\Cell\_
echo PC PC
C:\IBM\product\Cell\_
Server architecture
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2343
IBM BladeCenter QS21
Announcement: August 28, 2007
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2344
IBM BladeCenter QS22
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2345
IBM BladeCenter QS22: specifications
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2346
Where to get more Cell BE information?
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2347
Cell Resource
Cell resource center at developerWorks– http://www-128.ibm.com/developerworks/power/cell/
Cell developer's corner at power.org– http://www.power.org/resources/devcorner/cellcorner/
The cell project at IBM Research– http://www.research.ibm.com/cell/
The Cell BE at IBM alphaWorks– http://www.alphaworks.ibm.com/topics/cell
Cell BE at IBM Engineering & Technical Services– http://www-03.ibm.com/technology/
IBM Power Architecture– http://www-03.ibm.com/chips/power/
Cell BE documentation at IBM Microelectronics– http://www-306.ibm.com/chips/techlib/techlib.nsf/products/
Cell_Broadband_EngineCell
Linux info at the Barcelona Supercomputing Center website– http://www.bsc.es/projects/deepcomputing/linuxoncell/
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2348
Cell Education
Online courses at IBM Education Assistant
– http://publib.boulder.ibm.com/infocenter/ieduasst/stgv1r0/index.jsp
Online courses at IBM Learning
– http://ibmlearning.ibm.com/index.html
Podcasts at power.org
– http://www.power.org
Onsite classes at IBM Innovation Center
– https://www-304.ibm.com/jct09002c/isv/spc/events/cbea.html
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2349
Cell BE Documentation
The following documents define the Cell Broadband Engine architecture, programming using the SDK, the new IBM BladeCenter QS20, XL C/C++compiler, Full-System Simulator, and the PowerPC base architecture.
Cell Broadband Engine – Cell Broadband Engine Architecture V1.01 (updated) – Cell Broadband Engine Programming Handbook V1.0 – Cell Broadband Engine Registers V1.4 (updated) – SPU C/C++ Language Extensions V2.2.1 (updated) – Synergistic Processor Unit (SPU) Instruction Set Architecture V1.11 (updated) – SPU Application Binary Interface Specification V1.5.1 (updated) – SPU Assembly Language Specification V1.4 (updated)
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2350
Cell BE Documentation
Cell Broadband Engine Programming using the SDK – Cell Broadband Engine SDK Installation Guide V2.0 (updated) – Cell Broadband Engine SDK Programmer's Guide V1.0 (new) – Cell Broadband Engine Programming Tutorial V2.0 (updated) – Cell Broadband Engine Linux Reference Implementation Application Binary Interface
Specification V1.1 (updated) – SPE Runtime Management library documentation V1.2 (updated) – SPE Runtime Management library documentation V2.0 (new) – Cell Broadband Engine SIMD Math Library Specification V1.0 (new) – Accelerator Library Framework Programming Guide and API Reference V1.0 (new) – Sample Library documentation V2.0 (updated) – IDL Compiler for Remote Procedure Calls – Post-link Optimization Utility (new)
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2351
Cell BE Documentation
IBM BladeCenter QS20 – IBM BladeCenter QS20 Datasheet – IBM BladeCenter QS20 Installation and User's Guide – IBM BladeCenter QS20 Problem Determination and Service Guide
IBM XL C/C++ Compiler – Getting Started with IBM XL C/C++ Compiler (new) – IBM XL C/C++ Compiler Language Reference (new) – IBM XL C/C++ Compiler Programming Guide (new) – IBM XL C/C++ Compiler Reference (new) – IBM XL C/C++ Compiler Installation Guide (new)
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2352
Cell BE Documentation
IBM Cell Broadband Engine Full-System Simulator – IBM Full-System Simulator Users Guide (updated) – IBM Full-System Simulator Command Reference (updated) – Performance Analysis with the IBM Full-System Simulator – IBM Full-System Simulator BogusNet HowTo (updated)
PowerPC Base – PowerPC Architecture Book, Version 2.02
• Book I: PowerPC User Instruction Set Architecture • Book II: PowerPC Virtual Environment Architecture • Book III: PowerPC Operating Environment Architecture
– PowerPC Microprocessor Family • Vector/SIMD Multimedia Extension Technology Programming Environments
Manual Version 2.06c
IBM Systems & Technology Group
© 2007 IBM CorporationCell Programming Workshop 04/19/2353
Cell BE Technical Articles
Real-time Ray Tracing
Papers from the Fall Processor Forum 2005: Unleashing the Cell Broadband Engine Processor: The Element Interconnect Bus
Papers from the Fall Processor Forum 2005: Unleashing the power of the Cell Broadband Engine: A programming model approach
Cell Broadband Engine Architecture and its first implementation
Introduction to the Cell Broadband Engine
Introduction to the Cell Multiprocessor
Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance
Terrain Rendering Engine (TRE): Cell Broadband Engine Optimized Real-time Ray-caster
An Implementation of the Feldkamp Algorithm for Medical Imaging on Cell Broadband Engine
Cell Broadband Engine Support for Privacy, Security, and Digital Rights Management Applications
A Programming Example: Large FFT on the Cell Broadband Engine