Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | jenette-foley |
View: | 24 times |
Download: | 0 times |
Carnegie Mellon
High Performance Computing on the Cell Broadband Engine
Vas ChellappaElectrical & Computer EngineeringCarnegie Mellon University
Dec 3 2008
Carnegie Mellon
2
Designing “faster” processors
Need for speed
Parallelism: forms Superscalar Pipelining Vector Multi-core Multi-node
Carnegie Mellon
3
Designing “faster” processors
Need for speed
Parallelism: forms (limitations) Superscalar (power density) Pipelining (latch overhead: frequency scaling, branching) Vector (programming, only numeric) Multi-core (memory wall, programming) Multi-node (interconnects, reliability)
Carnegie Mellon
4
Multi-core Parallelism
Future is definitely multi-core parallelism But what problems/limitations do multi-cores have?
Increased programming burden Scaling issues: power, interconnects etc.
Carnegie Mellon
5
The Cell BE Approach
Frequency wall: many simple, in-order cores Power wall: vectorized, in-order, arithmetic cores Memory wall: Memory Flow Controller handles
programmer driven DMA in background
Cell BE ChipCell BE Chip
Main MemMain Mem
EIB
SPELS
SPELS
SPELS
SPELS
SPE LS
SPE LS
SPE LS
SPE LS
PPE
Carnegie Mellon
6
Presentation Overview
Cell Broadband Engine: Design
Programming on the Cell
Exercise: implement addition of vectors
Wrap-up
Carnegie Mellon
7
Cell Broadband Engine
Designed for high-density floating-point computation (PlayStation 3, IBM Roadrunner)
Compute: Heterogeneous multi-core (1 PPE + 8 SPEs) 204 Gflop/s (only SPEs) High-speed on-chip interconnect
Memory system: Explicit scratchpad-type “local store” DMA based programming
Challenges: Parallelization, vectorization, explicit memory New design: new programming paradigm
Main Mem
EIBSPELS
SPELS
SPELS
SPELS
SPE LS
SPE LS
SPE LS
SPE LS
Carnegie Mellon
8
Cell BE Processor: A Closer Look
Power Processing Element (PPE) Synergistic Processing Element (SPE) x8 Local Stores (LS)
Cell BE ChipCell BE Chip
Main MemMain Mem
EIB
SPELS
SPELS
SPELS
SPELS
SPE LS
SPE LS
SPE LS
SPE LS
PPE
Carnegie Mellon
9
Power Processing Element (PPE)
Purpose: Operating System, program control Uses POWER Instruction Set Architecture 2-way multithreaded Cache: 32KB L1-I, 32KB L1-D, 512KB L2 AltiVec SIMD System functions
Virtualization, address translation/protection, exception handling
Carnegie Mellon
10
Synergistic Processing Element (SPE)
SPU = Processor + LS; SPE = + MFC Synergistic Processing Unit (SPU) Local Store (LS) Memory Flow Controller (MFC)
Carnegie Mellon
11
Synergistic Processing Unit (SPU)
Number cruncher Vectorization (4-way/2-way) Peak performance (each SPE)
25.6 Gflop/s (single precision): 3.2 GHz x 4-way (vector) x 2 (FMA) <2 Gflop/s (double precision): Not pipelined EDP version: full speed double precision (12.8 Gflop/s) Comparison: Intel
128 vector registers, each 128B Even, odd pipelines In-order, shallow pipelines No branch prediction (hinting) Completely deterministic
Carnegie Mellon
12
Local Stores (LS) and Memory Flow Cont. (MFC)
Local Stores Each SPU contains a 256KB LS (instead of cache) Explicit read/write (programmer issues DMA) Extremely fast (6-cycle load latency to SPU)
Memory Flow Controller Co-processor to handle DMAs (in background) 8/16 command-queue entries Handles DMA-lists (scatter/gather) Barriers, fences, tag groups etc. Mailboxes, signals
Carnegie Mellon
13
Element Interconnect Bus (EIB)
4 data rings (16B wide each) 2 clockwise, 2 counter-clockwise Supports multiple data transfers Data ports: 25.6 Gb/s per direction 204.8 Gb/s sustained peak
Carnegie Mellon
14
Direct Memory Access (DMA)
Programmer driven Packet sizes 1B – 16KB Several alignment constraints (bus errors!) Packet size vs. performance DMA lists Get, put: SPE-centric
view Mailboxes/signals are
also DMAs
Carnegie Mellon
15
Systems using the Cell
Sony PlayStation 3 6 available SPEs
7th: hypervisor 8th: defective (yield issues)
Can run Linux (Fedora / Yellow Dog Linux) Various PS3-cluster projects
IBM BladeCenter QS20/QS22 Two Cell processors Infiniband/Ethernet
Carnegie Mellon
16
IBM Roadrunner
Supercomputer at Los Alamos National Lab (NM) Main purpose: model decay of the US nuclear arsenal
Performance World’s fastest [TOP500.org] Peak: 1.7 petaflop/s. First to top 1.0 petaflop/s on Linpack
Design: hybrid dual-core 64-bit AMD Opterons at 1.8GHz (6,480 Opterons) Cell attached to each Opteron core at 3.2GHz (12,960 Cells)
Design hierarchy QS22 Blade = 2 PowerXCell 8i TriBlade = LS21 Opteron Blade + 2x QS22 Cell Blades (PCIe x8) Connected Unit = 180 TriBlades (Infiniband) Cluster = 18 CUs (Infiniband)
Carnegie Mellon
17
Presentation Overview
Cell Broadband Engine: Design
Programming on the Cell
Exercise: implement addition of vectors
Wrap-up
Carnegie Mellon
18
Programming on the Cell: Philosophy Major differences to traditional processors
Not designed for scalar performance Explicit memory access Heterogeneous multi-core
Using the SPEs SPMD model (Single Program Multiple Data) Streaming model
Carnegie Mellon
19
Programming Tips What kind of code good/bad for SPEs?
No branching (no prediction) Use branch hinting No scalar (no support)
Use intrinsics for vectorization, DMA Context switches are expensive
Program + data reside in LS. These have to be swapped in/out DMA code: alignment, alignment, alignment! Libraries available to emulate software-managed cache
Carnegie Mellon
20
DMA Programming
Main idea: hide memory accesses with multibuffering Compute on one buffer in LS Write back / read in other batches of data Like a completely controlled cache
Inter-chip communication Message boxes Signals DMA
Carnegie Mellon
21
Tools for Cell Programming
IBM’s Cell SDK 3.0 spu-gcc, ppu-gcc, xlc compilers Simulator libspe: SPE runtime management library
Other tools: Assembly visualizer
Because SPEs are in-order Single source compiler No OpenMP right now Other tools (from RapidMind, Mercury etc.)
Carnegie Mellon
22
Program Design
Use knowledge of architecture to model Back of the envelope calculations
Cost of processing? Cost of communication? Trends? Limits?
How close is the model? What programming improvements can be made to fit the
architecture better?
Carnegie Mellon
23
Presentation Overview
Cell Broadband Engine: Design
Programming on the Cell
Exercise: implement addition of vectors
Wrap-up
Carnegie Mellon
24
Creating PPE Program, SPE Threads
Each program consists of PPE and SPE sections Program is started up on PPE PPE creates SPE threads
pthreads implementation Not full
PPE data structure to keep track of SPE threads PPE/SPE shared data structure for argument passing
X, Y, Z addresses Thread id Returned cycle count
Carnegie Mellon
25
DMA Access
spu_writech(MFC_WrTagMask, -1);
spu_mfcdma64(source_address,
dest_high_address, dest_low_address,
size_in_byes,
tag_id, MFC_GET_CMD);
spu_mfcstat(MFC_TAG_UPDATE_ALL);
Use my DMA_BL_GET, DMA_BL_PUT macros
Carnegie Mellon
26
Compiling
Compile ppe, spe programs separately Details: specify SPE program name, call from PPE 32/64 bit (watch out for pointer sizes etc.) Cell SDK has sample Makefiles We will use a simple Makefile
Carnegie Mellon
27
Performance Evaluation: Timing
Performance measure: runtime, Gflop/s
Timing Each SPE has its own decrementer Decrements at an independent, lower frequency (80GHz on PS3) cat /proc/cpuinfo Reset counter to highest value Measure on each SPE? Average? Min? Max? Which one fits the
real-word scenario the best?
Carnegie Mellon
28
Exercise 1: Add/Mul Two Arrays
Goal: X[] += Y[] * Z[]
Part 1: Infrastructure, understand skeleton code Part 2: Parallelization and vectorization (easy) Part 3: Hiding memory access costs
Carnegie Mellon
29
Part 1 Goal:
Understand skeleton code Get infrastructure up and running (compiler, basic code)
Evaluate: scalar, sequential code performance
PPU’s tasks: Initialize vectors in main memory Start up threads for each SPU, and let them run Verify/print results, performance
Use only single SPU. SPU’s task: Get (DMA) all 3 arrays from main memory Perform computation Put (DMA) back result to main memory Write back time to PPU
Your tasks: Compile Transform code Timer code
Carnegie Mellon
30
Part 2 Goal
Parallelize across 4 SPEs (easy with skeleton code) Vectorize X[] += Y[] * Z[] (easy)
Evaluate: Parallel code performance Vectorized parallel code performance
PPU: Start up 4 SPU threads Performance evaluation: how?
SPU: DMA-get, compute, DMA-put only its own chunk 4-way single precision vectorization
Your tasks: Parallelize Vectorize Performance?
(vector float)d = spu_madd(a,b,c);
Carnegie Mellon
32
Presentation Overview
Cell Broadband Engine: Design
Programming on the Cell
Exercise: implement addition of vectors
Wrap-up
Carnegie Mellon
33
Exercise Debriefing
How effectively did we use the architecture? Parallelization, vectorization mandatory! Memory overlapping: big difference
Do our optimizations work for a large size range? Smaller sizes: lower packet sizes?
Real world problems (Fourier transform, WHT) Real-world problems are rarely embarrassingly parallel Additional complexities?
Carnegie Mellon
34
WHT on the Cell
Vectorization: as before Parallelization: locality-aware! Explicit memory access
Provide code Multibuffering? How?
Inter-SPE data exchange Algorithms that generate large packet sizes? Overlap? Fast barrier
Carnegie Mellon
39
DMA Issues
External multibuffering (streaming) Strategies for problem sizes
Small/medium: data exchange on-chip, streaming Large: trickier. Break down into parts
Using all memory banks
Carnegie Mellon
40
Cell Philosophy
Cell philosophies: do they extend to other systems? Yes: Fundamental problems are the same Distributed memory computing
Clusters, supercomputers Processing faster than interconnects Higher interconnect bandwidth with larger packets
Multicore processors Trend: NUMA, even on-chip Locality-aware parallelism
Carnegie Mellon
41
Wrap-Up
Programming Cell BE for high-performance computing
Cell: chip multiprocessor designed for HPCApplications from video gaming to supercomputers
Programming burden is factor for performanceParallelization, vectorization, memory handling
Automated tools yield limited performance Programmers must understand μ-arch., tradeoffs
For performance (esp. on Cell)