+ All Categories
Home > Documents > Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit...

Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit...

Date post: 26-Dec-2015
Category:
Upload: charleen-fields
View: 218 times
Download: 1 times
Share this document with a friend
36
Cell/B.E. Jiří Dokulil
Transcript

Cell/B.E.

Jiří Dokulil

Introduction

Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC

PowerPC Processor Element (PPE) runs OS

SIMD Synergistic Processor Element (SPE) 8x computations, no OS

big endian

Architecture

Memory access

PPE load & store cache

SPE DMA

up to 16 concurrent per SPE no direct access to memory

no need for out-of-order processing, no speculation local storage no cache

PPE

PowerPC Processor Element PPU (PowerPC Processor Unit) PPSS (PowerPC Processor Storage Subsystem)

64-bit, dual-thread PowerPC Architecture RISC core

2x32KB L1 (instructions and data) 512LB L2 (unified) PowerPC instruction set

vector/SIMD extensions – different from SPE 32x 128bit vector registers

SPE Synergistic Processor Element

SPU (Synergistic Processor Unit) MFC (Memory Flow Controller)

RISC, SIMD Synergistic Processor Unit Instruction Set Architecture support for DMA and interprocessor messaging

256KB LS 128x128bit register file DMA access to main memory

segment and page tables of PPE channels

in MFC unidirectional message-passing interfaces memory-mapped I/O (MMIO) registers and queues

EIB

Element Interconnect Bus four 16-byte-wide data rings

transfer 128byte at a time (one PPE cache line) internal bandwidth 96bytes per clock cycle latency depends on the number of hops

bus is a ring half frequency of SPU

DMA

MFCs support naturally aligned DMA transfer sizes of 1, 2, 4, or 8 bytes, and multiples of 16 bytes

maximum transfer size of 16 KB per transfer DMA list commands can initiate up to 2048 transfers peak transfer performance

if both the effective addresses and the LS addresses are 128-byte aligned

and the size of the transfer is an even multiple of 128 bytes SMM (Synergistic Memory Management) unit

processes address translation access-permission information data supplied by the PPE operating system

SIMD example

// 16 iterations of a loop

int rolled_sum(unsigned char bytes[16])

{

int i;

int sum = 0;

for (i = 0; i < 16; ++i)

{

sum += bytes[i];

}

return sum;

}

SIMD example cont.// Vectorized for vector/SIMD multimedia extension

int vectorized_sum(unsigned char bytes[16])

{

vector unsigned char vbytes;

union {

int i[4];

vector signed int v;

} sum;

vector unsigned int zero = (vector unsigned int){0};

// Perform a misaligned vector load of the 16 bytes.

vbytes = vec_perm(vec_ld(0, bytes), vec_ld(16, bytes), vec_lvsl(0, bytes));

// Sum the 16 bytes of the vector

sum.v = vec_sums((vector signed int)vec_sum4s(vbytes, zero), (vector signed int)zero);

// Extract the sum and return the result.

return (sum.i[3]);

}

Communication DMA

2 command queues per SPE one for commands by SPE one for commands by PPE and other SPEs

commands have tags (32 different) – status query one transfer or a list

mailboxes for each SPE communication with PPE 2 outgoing (1 message) 1 incoming (4 messages)

signals 2 inbound channels

DMA

put, get SPE or PPE initiated tag

5bit ordering

out of order barrier – maintains order (within tag group) fence – after all previous (within tag group)

simple or lists lists stored in LS (8bytes per item) -> SPE only up to 2048 transfers, 16KB each -> 32MB

compare to 256KB LS size

DMA – PPE raw access

MFC registers mapped to virtual address space

void *ps = get_ps(); //get the problem state – must be mapped by privileged software

unsigned int ls = 0x500;

unsigned int long long ea = 0x10000000;

unsigned int size = 0x4000;

unsigned int tag = 5;

unsigned int classid = 0;

unsigned int cmd = MFC_GET_CMD;

unsigned int cmd_status;

do {

*((volatile unsigned int *)(ps + MFC_LSA)) = ls;

*((volatile unsigned long long *)(ps + MFC_EAH)) = ea;

*((volatile unsigned int *)(ps + MFC_Size)) = (size << 16) | tag;

*((volatile unsigned int *)(ps + MFC_ClassID)) = (classid << 16) | cmd;

/* Read MFC_CMDStatus to enqueue command and check enqueue success. */

cmd_status = *((volatile unsigned int *)(ps + MFC_CMDStatus)) & 0x3;

} while (cmd_status); /* Attempt to enqueue until success */

only enqueues the command

DMA – PPE raw access cont.

test for completion (poll tag group status)void *ps = get_ps();

unsigned int tag_mask = 1 << 5;

unsigned int tag_status;

*((volatile unsigned int *)(ps + Prxy_QueryMask)) = tag_mask;

__asm__(“eieio”); /* force write to Prxy_QueryMask to complete */

do {

tag_status = *((volatile unsigned int *)(ps + Prxy_TagStatus));

} while (!tag_status);

more tag groupsunsigned int tag_mask = (1<<5)|(1<<14)|(1<<31);

DMA – SPE no direct access to the virtual address space

only by DMA direct access to own command channels

wrch assembly instructionextern void dma_transfer(volatile void *lsa, // local storage address

unsigned int eah, // high 32-bit effective address

unsigned int eal, // low 32-bit effective address

unsigned int size, // transfer size in bytes

unsigned int tag_id, // tag identifier (0-31)

unsigned int cmd); // DMA command

in assembler: wrch $MFC_LSA, $3

wrch $MFC_EAH, $4

wrch $MFC_EAL, $5

wrch $MFC_Size, $6

wrch $MFC_TagID, $7

wrch $MFC_Cmd, $8

in C intrinsic: spu_mfcdma64(lsa, eah, eal, size, tag_id, cmd);

DMA – SPE cont. poll for completion # Set tag group mask

wrch $MFC_WrTagMask, $0

# Set up for immediate tag status update.

il $1, 0

repeat:

wrch $MFC_WrTagUpdate, $1

rdch $1, $MFC_RdTagStat

brz $1, repeat

OR

#include <spu_intrinsics.h>

#include <spu_mfcio.h>

unsigned int tag_id = 0;

unsigned int tag_mask = 1 << tag_id;

spu_writech(MFC_WrTagMask, tag_mask);

do {

}while(!spu_mfcstat(MFC_TAG_UPDATE_IMMEDIATE)); /* poll for update */

DMA – SPE cont. wait for completion (stall SPE) # Set tag group mask

wrch $MFC_WrTagMask, $0

# 0x1 for any tag, 0x2 for all tags.

il $1, 0x1

# Wait for conditional tag status update (stall the SPU).

wrch $MFC_WrTagUpdate, $1

rdch $1, $MFC_RdTagStat

OR

#include <spu_intrinsics.h>

#include <spu_mfcio.h>

unsigned int tag_id = 0;

unsigned int tag_mask = 1 << tag_id;

spu_writech(MFC_WrTagMask, tag_mask);

/* Wait for all ids in tag group to complete (stall the SPU) */

spu_mfcstat(MFC_TAG_UPDATE_ALL);

DMA – SPE cont.

completion of DMA source buffer can be reused data may not have yet been written to the main

storage mailbox-ed notification can reach PPE before the data SPE can do mfcsync PPE can do lwsync

more efficient SPE can notify via DMA

mfceieio must be used between DMAs for ordering

Mailboxes 32bit messages blocking for SPE (stalls SPE)

reading of empty inbound writing of full outbound SPE can poll the number of messages

non-blocking for PPE (and other devices) reading returns zeros writing overwrites last message

Mailboxes – SPE send (stalling) wrch $SPU_WrOutMbox, $1

or spu_writech(SPU_WrOutMbox, mb_value);

send (active waiting) repeat:

rchcnt $2, $SPU_WrOutMbox

brz $2, repeat

wrch $SPU_WrOutMbox, $1

or do {

/* Do other useful work while waiting. */

} while (!spu_readchcnt(SPU_WrOutMbox));

spu_writech(SPU_WrOutMbox, mb_value);

Mailboxes – SPE cont. read (stalling) rdch $1, $SPU_RdInMbox

or mb_value = spu_readch(SPU_RdInMbox);

read (active waiting) repeat:

rchcnt $1, $SPU_RdInMbox

brz $1, repeat

rdch $2, $SPU_RDInMbox

or do {

/* Do other useful work while waiting. */

} while (!spu_readchcnt(SPU_RdInMbox));

mb_value = spu_readch(SPU_RdInMbox);

Mailboxes – PPE read SPE’s outbound mailboxsendvoid *ps = get_ps();

unsigned int mb_status;

unsigned int new;

unsigned int mb_value;

do {

mb_status = *((volatile unsigned int *)(ps + SPU_Mbox_Stat));

new = mb_status & 0x000000FF;

} while ( new == 0 );

mb_value = *((volatile unsigned int *)(ps + SPU_Out_Mbox));

Mailboxes – PPE cont. writing to SPE’s inbound mailbox

problem of overrunning full mailbox//send four messages without overrunning the mailbox

void *ps = get_ps();

unsigned int j,k = 0;

unsigned int mb_status;

unsigned int slots;

unsigned int mb_value[4] = {0x1, 0x2, 0x3, 0x4};

do {

/* Poll the Mailbox Status Register until the SPU_In_Mbox_Count field indicates there is at least one slot available in the SPU Read Inbound Mailbox. */

do {

mb_status = *((volatile unsigned int *)(ps + SPU_Mbox_Stat));

slots = (mb_status & 0x0000FF00) >> 8;

} while ( slots == 0 );

for (j=0; j<slots && k < 4; j++) {

*((volatile unsigned int *)(ps + SPU_In_Mbox)) = mb_value[k++];

}

} while ( k < 4 );

CELL SDK 3.1 http://www.ibm.com/developerworks/power/cell/ Cell BE Programming Handbook Including PowerXCell 8i

http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1741C509C5F64B3300257460006FD68D

SPE Runtime Management Library http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/

1DFEF31B3211112587257242007883F3 PPU & SPU C/C++ Language Extension Specification

http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E

libspe & libspe2

low level APIs to access Cell from C/C++ new threading model in libspe2

use threading library of your choice and use libspe2 from there – no “SPE threads”

create e.g. pthread thread and launch SPE code from that – call returns after SPE finishes

Compilation

PPE object g++ [-m64] -c -Ox

SPE object spu-gcc -Ox

no –m64 LS adresses are always 32bit

ppu-embedspu [-m64] <symbol> <object> <output> link

g++ [-m64] <spe object> <ppe object> -lspe -lspe2

Referencing SPE code from PPE code

extern spe_program_handle_t <symbol>; spe_program_load(spe_context,&<symbol>);

Launching SPE code (libspe2)struct thread_data{ spe_context_ptr_t context; program_data* pd;};

void *ppu_pthread_function(void *arg) { thread_data td = *(thread_data *) arg; spe_context_ptr_t context = td.context; unsigned int entry = SPE_DEFAULT_ENTRY; spe_context_run(context,&entry,0,td.pd,NULL,NULL); pthread_exit(NULL);}

spe_context_ptr_t context;pthread_t pthread;thread_data td;context = spe_context_create(0,NULL);spe_program_load(context,&spe_prg);

pthread_create(&pthread,NULL,&ppu_pthread_function,&td[spe]);pthread_join(pthread,NULL);spe_context_destroy(context);

SPE code#include <spu_mfcio.h>int main(

unsigned long long spe_id,unsigned long long program_data_ea,unsigned long long env)

{program_data pd __attribute__((aligned(16)));int tag_id = 1;mfc_get(&pd, program_data_ea, sizeof(pd), tag_id, 0, 0);mfc_write_tag_mask(1<<tag_id);mfc_read_tag_status_any();…}

Program data

structure shared by SPE and PPE code unsigned long long for 64bit pointers

void* is 32bit on SPE and 32/64bit on PPE be careful with the alignment

DMA cannot handle misaligned transfers

size padded to 16byte

DMA – SPE side

(void) mfc_put(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag, uint32_t tid, uint32_t rid) initiate transfer from LS tag is number (e.g. 5)

mfc_putb, mfc_putf

DMA – SPE side cont.

(void) mfc_get(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag, uint32_t tid, uint32_t rid)

mfc_getb, mfc_getf

DMA status – SPE side

(void) mfc_write_tag_mask (uint32_t mask) tag mask (e.g. 1<<5)

(uint32_t) mfc_read_tag_status_any(void) blocks untill any of the specified tag groups has no

outstanding operations (uint32_t) mfc_read_tag_status_all(void) blocks untill all of the specified tag groups have no

outstanding operations

Mailboxes – SPE side

(uint32_t) spu_read_in_mbox(void) (uint32_t) spu_stat_in_mbox(void) (void) spu_write_out_mbox(uint32_t data)

(uint32_t) spu_stat_out_mbox(void)

Mailboxes – PPE side int spe_out_mbox_read (spe_context_ptr_t spe,

unsigned int *mbox_data, int count) int spe_out_mbox_status (spe_context_ptr_t spe) int spe_in_mbox_write (spe_context_ptr_t spe,

unsigned int *mbox_data, int count, unsigned int behavior) SPE_MBOX_ALL_BLOCKING

blocks until all are sent SPE_MBOX_ANY_BLOCKING

blocks until at least one message is sent SPE_MBOX_ANY_NONBLOCKING

sends as many as possible without blocking int spe_in_mbox_status (spe_context_ptr_t spe)

PPE direct access to SPE void* spe_ls_area_get (spe_context_ptr_t spe) less efficient than DMA

int spe_ls_size_get (spe_context_ptr_t spe) void* spe_ps_area_get (spe_context_ptr_t spe, enum ps_area area) enum ps_area

SPE_MFC_COMMAND_AREA MFC registers

SPE_CONTROL_AREA mailboxes

the get_ps function used in examples from the first part


Recommended