+ All Categories
Home > Documents > Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and...

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and...

Date post: 24-Dec-2015
Category:
Upload: ashlee-kelley
View: 221 times
Download: 1 times
Share this document with a friend
Popular Tags:
32
Programming the Cell Multiprocessor Işıl ÖZ
Transcript

Programming the Cell Multiprocessor

Işıl ÖZ

Outline

Cell processor– Objectives– Design and architecture

Programming the cell– Programming models

CellSs

Cell Processor

Cell Broadband Engine Architecture– Cell BE

Developed by STI (SCEI-Toshiba-IBM) design center– STI formed in 2000 – STI design center opened in 2001– Introduced in 2005– 65 nm in 2007, 45 nm in 2008

Cell Processor Objectives

Outstanding performance especially on game/multimedia applications

– Memory latency– Power efficiency– Processor frequency and pipeline depth

Real time response to the user and the network Applicable to a wide range of platforms Support for introduction in 2005

Cell Architecture

a 64-bit Power processor element (PPE)

8 synergistic processor elements (SPE)

Memory controller Bus-interface controller Element interconnect

bus

Power Processor Elements

PPE– Power core– First level cache L1– Second level cache L2

PPE Major Units

Synergistic Processor Elements

SPEs– DMA

(Direct Memory Access Unit)– LS

(Local Store Memory)– SXUs

(Execution Units)

SPE Organization

Controllers

Memory Interface Controller

– interfaces to the Rambus XDR I/O unit which communicates directly to DRAM modules

Bus Interface Controller– interfaces to the Rambus

FlexIO which provides to communicate with system components

Element Interconnect Bus

EIB– Coherent, on-chip bus– Connects the processing

elements, memory and I/O devices

Programming the Cell

Local store memory in SPEs (256KB) SIMD nature of dataflows The size of the register file (128 bits) Single program context

Programming Models

Function offload model Device extension model Computational acceleration model Streaming models Shared-memory multiprocessor model Asymmetric thread runtime model

A programming model:CellSs

Cell superscalar– Simple and flexible– Automatic parallelism of sequential program– Task scheduling and data handling

CellSs Structure

Based on – code annotations– C language

Composed of– Source compiler– Runtime library

CellSs Compilation Environment

CellSs Compiler

Source to source compiler– Function(task) to be executed in the SPEs– Function parameter directions– Parameters that are arrays and their lengths

No pointers!

Parallelism on CellSs

Annotated codeAnnotated code

Generated code for the PPEGenerated code for the PPE Generated code for the SPEGenerated code for the SPE

CellSs Syntax

Three types of pragmas– initialization and finalization

css start and css finish

– task css task [input inout output]

– synchronization css wait

Example CellSs Source Code

start/finish

task

wait for task

CellSs Runtime

Execute function– Add a node in task graph– Data dependency analysis (RaW, WaR, Waw)– Parameters renaming– Task submission

CellSs Runtime Behavior

Middleware for the Cell

Task scheduling– task control buffer– task grouping– dynamic scheduling

Locality Aware Task Scheduling

Tracing

Generates Paraver trace files by a tracing component embedded in the CellSs runtime– when the main program enters or exits– when an annotated function is called in the main

program– when a task is started or finished

Performance Analysis

Matmul– Block matrix multiplication

TSP– Recursive implementation of Traveling Salesman

Problem

Cholesky– Block matrix Cholesky factorization

Performance Analysis

TSP– No data dependency

Cholesky– Highly connected data

dependency graph

Performance Analysis

x-axis : timeline y-axis : a thread of the application green : events yellow : communications

Performance Analysis

yellow : SPE thread DMA transfer brown : SPE executing the task

Pros and Cons

annotations– simple– but limited

data transfer transparently to the user code task dependency analysis

rely on other compilers for– code vectorization (SPE performance)– lower level code optimization

Related Work

OpenMP Accelerated Library Framework (ALF) Thread level synchronization Sequoia Rapidmind Ohara Graphics Processor Units (GPUs)

References

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy, “Introduction to the Cell multiprocessor”, IBM J. Res. & Dev. Vol. 49 No. 4/5 July/ September 2005.

Pieter Bellens, Josep M. Perez, Rosa M. Badia and Jesus Labarta, “CellSs: a Programming Model for the Cell BE Architecture”, Supercomputing Conference, 2006.

M. W. Riley, J. D. Warnock, D. F. Wendel, “Cell Broadband Engine processor:Design and implementation”, IBM J. Res. & Dev. Vol. 51 No. 5 September 2007.

J. M. Perez, P. Bellens, R. M. Badia, J. Labarta, “CellSs: Making it easier to program the Cell Broadband Engine processor”, IBM J. Res. & Dev. Vol. 51 No. 5 September 2007.

http://www.ibm.com/developerworks/power/cell/ www.bsc.es/cellsuperscalar


Recommended