Brief Introduction to Parallella

Post on 12-Jul-2015

469 views 2 download

Tags:

transcript

Parallella

Presented By:Somnath Mazumdar

University of Siena, Italy

Outline

This Presentation was held on 10th Dec 2014

Place: Ericsson Research Lab, Lund

SwedenThis work is licensed under a Creative Commons Attribution 4.0 International License.

Outline

Introduction

Architecture

System View

Programming

Conclusion

Outline

Genesis

Influenced by Open Source Hardware Design projects:ArduinoBeaglebone

Inspired by:Raspberry PiZedboard

The board is open source hardware*

*https://github.com/parallella/parallella-hw

In News “Smallest Supercomputer in the World”

Adapteva A-1…...

• Launched at ISC'14*

• It has 2.112 RISC cores

• Based on 64-core Epiphany board

• Power Consumption 200 Watt.

• Performance: 16 Gflop/s per Watt

*http://primeurmagazine.com/weekly/AE-PR-07-14-104.htmlImage Source:https://twitter.com/StreamComputing/media

Adapteva (Zynq + Epiphany III)

• Based on Epiphany™ architecture (Multi-core MIMD

Architecture)

• SoC fully programmable Xilinx Zynq with dual core CPU ARM Cortex-A9

• 16/64-core microprocessor/coprocessor:No cache32-bit coresMax Clock Speed 1 GHz (600 MHz)Peak Performance : 32 GFLOPS Support Fused Multiply–Add (FMA) operationsSuperscalar floating-point (IEEE-754) RISC CPU CoreTwo floating point operations /clock cycle.

• Supports Static Dual-Issue Scheduling

IALU: Single 32-bit integer operation/clk. cycle. FPU: Single floating-point

instruction /clk cycle 64 General purpose registers Program Sequencer supports

all standard program flows…. Branching costs 3 cycles. No hardware support:

Integer multiply Floating point divide Double-precision

floating point ops.

eCore CPU(1)

Adapteva (Zynq + Epiphany III)

Epiphany Architecture(1)

Every router in the mesh is connected to North, East, West, South, and to a mesh node.

Routers at every node contains round-robin arbiters. Routing hop latency is 1.5 clock cycles

Interconnects

• Ecores are Connected by 2D low-latency NoC (eMesh) rMesh for read xMesh for off-chip write cMesh for on-chip write

• eMash has only nearest-neighbor direct connections.

• Each routing link can transfer up to 8 bytes data on every clock cycle. Network-On-Chip Overview(1)

Network Topology(1)

Interconnects

• Network complete transactions in a single clock cycle because of spatial locality and short point-to-point on-chip wires.

• Each mesh node has globally addressable ID (6 row-ID and 6 col-ID)

Memory

Chip Core Start Address End Address Size

(0,0) 00000000 00007FFF 32KB

• Shared memory (32 bit wide flat memory and unprotected)

• Primary Memory: 1GB (DDR3 SDRAM)• Flash Memory: 128Mb (Boot code) • Is a little-endian memory architecture.• This, single, flat address space consisting of 232 8-

bit bytes.(consisting of 230 32-bit words)• SRAM Distribution:

• On every clock cycle 64 bits of data / instructions can be exchanged between memory and CPU’s register file, network interface or local DMA.

• Dual channel DMA engine

• Memory Mapped Registers

• Each eCore has 32KB of local memory(4 sub-banks * 8KB)

• eCPU has a variable-length instruction pipeline that depends on the type of instruction being executed.

Memory

Memory Architecture(2)

Memory: Read-Write Transactions

• Read transactions are non-blocking• RW transactions from local memory follow a strong

memory-order model.• RW transactions that access non-local memory

follow weak memory-order model.• Soln: Use run-time synchronization calls with

order-dependent memory sequences.• Less inter-node communication

Scalability

• It has four identical source-synchronous bidirectional off chip eLink.

• eLink is non-blocking

• Optimal bandwidth is achieved when a large number of incrementally numbered 64 bit data packets are sent consecutively

FPGA eLink Integration(1)

360 Degree View(front)

Image Source : http://www.parallella.org/board/

360 Degree View(back)

Image Source : http://www.parallella.org/board/

PEC: Parallella Expansion Connector

How to get started..

1. Create a Parallellamicro-SD card1

2. Connect the wires mentioned in2

3. Power On 4. Go...

1. http://www.parallella.org/create-sdcard/2. http://www.parallella.org/quick-start/

Epiphany Host Library (eHAL)

• Encapsulates low-level Epiphany functionality(Epiphany device driver)

• Library interface is defined in “e-hal.h”.• Steps to write a program:

1. Prepare the system:e_init(NULL); //Initialize system

e_reset_system(); //reset the platform

e_get_platform_info(&platform); // get the actual system parameters

2. Allocate Memory(optional)e_mem_t emem; // object of type e_mem_t

char emsg[Size];e_alloc(&emem, <BufOffset>, <BufferSize>); //Allocate a buffer in shared external memory

3. Open Workgroup:e_open(&dev, 0, 0, platform.rows, platform.cols); // open all cores (OR)

e_open(&dev, 0, 0, 1, 1); // Core coordinates relative to the workgroup.

e_reset_group(&dev); //Soft Reset

Epiphany Host Library (eHAL)

4. Load program

e_load("program", &dev, 0, 0, E_TRUE);

5. Wait and then print message from buffer.

usleep(time);

e_read(&emem, 0, 0, 0x0, emsg, _BufSize);

fprintf(stderr, "\"%s\"\n", emsg);

6: Close every connection.

e_close(&dev);

e_free(&emem);

e_finalize();

Epiphany Host Library (eHAL)

Epiphany Hardware Utility Library (eLib)

• Provides functions for configuring and querying eCores.

• Also automates many common programming tasks in eCores

• Steps to write an eCore program• Step1: Declare shared memory:

char outbuf[128] SECTION("shared_dram");• Step2: Enquire about eCore id:

e_coreid_t coreid;coreid = e_get_coreid();

• Step3: Print “Hello World” with core id• Step4: Exit

Hello Worldint main(int argc, char *argv[]){

e_platform_t platform;e_epiphany_t dev;e_mem_t emem;char emsg[_BufSize];e_init(NULL);e_reset_system(); e_get_platform_info(&platform);e_alloc(&emem, _BufOffset,

_BufSize);e_open(&dev, 0, 0, 1, 1);e_load("e_core.srec", &dev, 0, 0,

E_TRUE);usleep(10000);e_read(&emem, 0, 0, 0x0, emsg,

_BufSize);fprintf(stderr, "\"%s\"\n", emsg);e_close(&dev);fflush(stdout);e_free(&emem);e_finalize();return 0;

}

#include <needed .h files>#include "e-lib.h" char outbuf[128] SECTION("shared_dram");

int main(void){e_coreid_t coreid;coreid = e_get_coreid();

sprintf(outbuf, "Hello World from core 0x%03x!", coreid);

return 0;}

Host SideeCore Side

Epiphany Program Build Flow(2)

Where to put the code..

• 3 different Linker Description Files (LDF)

• Internal.ldf : Store Data/Ins. in internal SRAM (limit 32KB).

• Fast.ldf : User code/data and stack in internal SRAM. Standard libraries in external DRAM.

Good for few large library functions

• Legacy.ldf: Everything stored in external DRAM (limit 1MB)

Slower than internal and legacy..

Synchronization(eCores)

http://www.linuxplanet.org/blogs/?cat=2359

Barrier for synchronizing parallel executing threads

1. Setup e_barrier_init(bar_array[],tgt_bar_array[])

2. Call Function

3. Wait for sync e_barrier(bar_array[],tgt_bar_array[]

Mutex(blocking & non blocking)..

1. Setup:e_mutex_init(0,0,s_mutex, mutex_attr)

2. Gain access:e_mutex_lock(0,0,s_mutex)

3. Call function

4. Release accesse_mutex_unlock(0,0,s_mutex)

Image Source: http://xkcd.com/1445/

Synchronization between the ARM and eCores useflag

Because: eMesh writes from an individual Epiphany core to the external shared DRAM will update the DRAM in the same order as they were sent. However if multiple cores are writing to external DRAM, the sequence of writing into the DRAM will be changed.

Soln:1. Set Flag

2. Use software barrier function e_barrier() (time consuming)

3. Use the experimental hardware barrier opcode

My Understanding

Useful for Sync

Ecore side Read & Write:e_write(remote, Dst, row, col, Src, Byte_size);e_read(remote, Dst, row, col,Src, Byte_size);

Remote parameter must be either: e_group_config if remote is workgroup core ore_emem_config if remote is an external memory buffer

Conclusion

• Fast and power efficient

• Power needed 5V/2A (0.3A -1.5A)

• Fully-featured ANSI-C/C++ and OpenCLprogramming environments

• Large Application domain support

• But..• Need Improved SDK (on the way..)• Cache might improve the performance (software cache is

on the way…)• Synchronization and randomness is a big issue…

Reference

1. Epiphany Architecture Referencehttp://www.adapteva.com/docs/epiphany_arch_ref.pdf

2. Epiphany SDK Reference:http://adapteva.com/docs/epiphany_sdk_ref.pdf

3. Esdk GitHub: https://github.com/adapteva/epiphany-sdk

4. Reading: http://www.adapteva.com/all-documents/