+ All Categories
Home > Devices & Hardware > Brief Introduction to Parallella

Brief Introduction to Parallella

Date post: 12-Jul-2015
Category:
Upload: somnath-mazumdar
View: 469 times
Download: 2 times
Share this document with a friend
Popular Tags:
31
Parallella Presented By: Somnath Mazumdar University of Siena, Italy
Transcript
Page 1: Brief Introduction to Parallella

Parallella

Presented By:Somnath Mazumdar

University of Siena, Italy

Page 2: Brief Introduction to Parallella

Outline

This Presentation was held on 10th Dec 2014

Place: Ericsson Research Lab, Lund

SwedenThis work is licensed under a Creative Commons Attribution 4.0 International License.

Page 3: Brief Introduction to Parallella

Outline

Introduction

Architecture

System View

Programming

Conclusion

Outline

Page 4: Brief Introduction to Parallella

Genesis

Influenced by Open Source Hardware Design projects:ArduinoBeaglebone

Inspired by:Raspberry PiZedboard

The board is open source hardware*

*https://github.com/parallella/parallella-hw

Page 5: Brief Introduction to Parallella

In News “Smallest Supercomputer in the World”

Adapteva A-1…...

• Launched at ISC'14*

• It has 2.112 RISC cores

• Based on 64-core Epiphany board

• Power Consumption 200 Watt.

• Performance: 16 Gflop/s per Watt

*http://primeurmagazine.com/weekly/AE-PR-07-14-104.htmlImage Source:https://twitter.com/StreamComputing/media

Page 6: Brief Introduction to Parallella

Adapteva (Zynq + Epiphany III)

• Based on Epiphany™ architecture (Multi-core MIMD

Architecture)

• SoC fully programmable Xilinx Zynq with dual core CPU ARM Cortex-A9

• 16/64-core microprocessor/coprocessor:No cache32-bit coresMax Clock Speed 1 GHz (600 MHz)Peak Performance : 32 GFLOPS Support Fused Multiply–Add (FMA) operationsSuperscalar floating-point (IEEE-754) RISC CPU CoreTwo floating point operations /clock cycle.

• Supports Static Dual-Issue Scheduling

Page 7: Brief Introduction to Parallella

IALU: Single 32-bit integer operation/clk. cycle. FPU: Single floating-point

instruction /clk cycle 64 General purpose registers Program Sequencer supports

all standard program flows…. Branching costs 3 cycles. No hardware support:

Integer multiply Floating point divide Double-precision

floating point ops.

eCore CPU(1)

Adapteva (Zynq + Epiphany III)

Page 8: Brief Introduction to Parallella

Epiphany Architecture(1)

Every router in the mesh is connected to North, East, West, South, and to a mesh node.

Routers at every node contains round-robin arbiters. Routing hop latency is 1.5 clock cycles

Page 9: Brief Introduction to Parallella

Interconnects

• Ecores are Connected by 2D low-latency NoC (eMesh) rMesh for read xMesh for off-chip write cMesh for on-chip write

• eMash has only nearest-neighbor direct connections.

• Each routing link can transfer up to 8 bytes data on every clock cycle. Network-On-Chip Overview(1)

Page 10: Brief Introduction to Parallella

Network Topology(1)

Interconnects

• Network complete transactions in a single clock cycle because of spatial locality and short point-to-point on-chip wires.

• Each mesh node has globally addressable ID (6 row-ID and 6 col-ID)

Page 11: Brief Introduction to Parallella

Memory

Chip Core Start Address End Address Size

(0,0) 00000000 00007FFF 32KB

• Shared memory (32 bit wide flat memory and unprotected)

• Primary Memory: 1GB (DDR3 SDRAM)• Flash Memory: 128Mb (Boot code) • Is a little-endian memory architecture.• This, single, flat address space consisting of 232 8-

bit bytes.(consisting of 230 32-bit words)• SRAM Distribution:

Page 12: Brief Introduction to Parallella

• On every clock cycle 64 bits of data / instructions can be exchanged between memory and CPU’s register file, network interface or local DMA.

• Dual channel DMA engine

• Memory Mapped Registers

• Each eCore has 32KB of local memory(4 sub-banks * 8KB)

• eCPU has a variable-length instruction pipeline that depends on the type of instruction being executed.

Memory

Page 13: Brief Introduction to Parallella

Memory Architecture(2)

Page 14: Brief Introduction to Parallella

Memory: Read-Write Transactions

• Read transactions are non-blocking• RW transactions from local memory follow a strong

memory-order model.• RW transactions that access non-local memory

follow weak memory-order model.• Soln: Use run-time synchronization calls with

order-dependent memory sequences.• Less inter-node communication

Page 15: Brief Introduction to Parallella

Scalability

• It has four identical source-synchronous bidirectional off chip eLink.

• eLink is non-blocking

• Optimal bandwidth is achieved when a large number of incrementally numbered 64 bit data packets are sent consecutively

FPGA eLink Integration(1)

Page 16: Brief Introduction to Parallella

360 Degree View(front)

Image Source : http://www.parallella.org/board/

Page 17: Brief Introduction to Parallella

360 Degree View(back)

Image Source : http://www.parallella.org/board/

PEC: Parallella Expansion Connector

Page 18: Brief Introduction to Parallella

How to get started..

1. Create a Parallellamicro-SD card1

2. Connect the wires mentioned in2

3. Power On 4. Go...

1. http://www.parallella.org/create-sdcard/2. http://www.parallella.org/quick-start/

Page 19: Brief Introduction to Parallella

Epiphany Host Library (eHAL)

• Encapsulates low-level Epiphany functionality(Epiphany device driver)

• Library interface is defined in “e-hal.h”.• Steps to write a program:

1. Prepare the system:e_init(NULL); //Initialize system

e_reset_system(); //reset the platform

e_get_platform_info(&platform); // get the actual system parameters

Page 20: Brief Introduction to Parallella

2. Allocate Memory(optional)e_mem_t emem; // object of type e_mem_t

char emsg[Size];e_alloc(&emem, <BufOffset>, <BufferSize>); //Allocate a buffer in shared external memory

3. Open Workgroup:e_open(&dev, 0, 0, platform.rows, platform.cols); // open all cores (OR)

e_open(&dev, 0, 0, 1, 1); // Core coordinates relative to the workgroup.

e_reset_group(&dev); //Soft Reset

Epiphany Host Library (eHAL)

Page 21: Brief Introduction to Parallella

4. Load program

e_load("program", &dev, 0, 0, E_TRUE);

5. Wait and then print message from buffer.

usleep(time);

e_read(&emem, 0, 0, 0x0, emsg, _BufSize);

fprintf(stderr, "\"%s\"\n", emsg);

6: Close every connection.

e_close(&dev);

e_free(&emem);

e_finalize();

Epiphany Host Library (eHAL)

Page 22: Brief Introduction to Parallella

Epiphany Hardware Utility Library (eLib)

• Provides functions for configuring and querying eCores.

• Also automates many common programming tasks in eCores

• Steps to write an eCore program• Step1: Declare shared memory:

char outbuf[128] SECTION("shared_dram");• Step2: Enquire about eCore id:

e_coreid_t coreid;coreid = e_get_coreid();

• Step3: Print “Hello World” with core id• Step4: Exit

Page 23: Brief Introduction to Parallella

Hello Worldint main(int argc, char *argv[]){

e_platform_t platform;e_epiphany_t dev;e_mem_t emem;char emsg[_BufSize];e_init(NULL);e_reset_system(); e_get_platform_info(&platform);e_alloc(&emem, _BufOffset,

_BufSize);e_open(&dev, 0, 0, 1, 1);e_load("e_core.srec", &dev, 0, 0,

E_TRUE);usleep(10000);e_read(&emem, 0, 0, 0x0, emsg,

_BufSize);fprintf(stderr, "\"%s\"\n", emsg);e_close(&dev);fflush(stdout);e_free(&emem);e_finalize();return 0;

}

#include <needed .h files>#include "e-lib.h" char outbuf[128] SECTION("shared_dram");

int main(void){e_coreid_t coreid;coreid = e_get_coreid();

sprintf(outbuf, "Hello World from core 0x%03x!", coreid);

return 0;}

Host SideeCore Side

Page 24: Brief Introduction to Parallella

Epiphany Program Build Flow(2)

Page 25: Brief Introduction to Parallella

Where to put the code..

• 3 different Linker Description Files (LDF)

• Internal.ldf : Store Data/Ins. in internal SRAM (limit 32KB).

• Fast.ldf : User code/data and stack in internal SRAM. Standard libraries in external DRAM.

Good for few large library functions

• Legacy.ldf: Everything stored in external DRAM (limit 1MB)

Slower than internal and legacy..

Page 26: Brief Introduction to Parallella

Synchronization(eCores)

http://www.linuxplanet.org/blogs/?cat=2359

Barrier for synchronizing parallel executing threads

1. Setup e_barrier_init(bar_array[],tgt_bar_array[])

2. Call Function

3. Wait for sync e_barrier(bar_array[],tgt_bar_array[]

Mutex(blocking & non blocking)..

1. Setup:e_mutex_init(0,0,s_mutex, mutex_attr)

2. Gain access:e_mutex_lock(0,0,s_mutex)

3. Call function

4. Release accesse_mutex_unlock(0,0,s_mutex)

Page 27: Brief Introduction to Parallella

Image Source: http://xkcd.com/1445/

Page 28: Brief Introduction to Parallella

Synchronization between the ARM and eCores useflag

Because: eMesh writes from an individual Epiphany core to the external shared DRAM will update the DRAM in the same order as they were sent. However if multiple cores are writing to external DRAM, the sequence of writing into the DRAM will be changed.

Soln:1. Set Flag

2. Use software barrier function e_barrier() (time consuming)

3. Use the experimental hardware barrier opcode

My Understanding

Page 29: Brief Introduction to Parallella

Useful for Sync

Ecore side Read & Write:e_write(remote, Dst, row, col, Src, Byte_size);e_read(remote, Dst, row, col,Src, Byte_size);

Remote parameter must be either: e_group_config if remote is workgroup core ore_emem_config if remote is an external memory buffer

Page 30: Brief Introduction to Parallella

Conclusion

• Fast and power efficient

• Power needed 5V/2A (0.3A -1.5A)

• Fully-featured ANSI-C/C++ and OpenCLprogramming environments

• Large Application domain support

• But..• Need Improved SDK (on the way..)• Cache might improve the performance (software cache is

on the way…)• Synchronization and randomness is a big issue…

Page 31: Brief Introduction to Parallella

Reference

1. Epiphany Architecture Referencehttp://www.adapteva.com/docs/epiphany_arch_ref.pdf

2. Epiphany SDK Reference:http://adapteva.com/docs/epiphany_sdk_ref.pdf

3. Esdk GitHub: https://github.com/adapteva/epiphany-sdk

4. Reading: http://www.adapteva.com/all-documents/


Recommended