+ All Categories
Home > Documents > Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx...

Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx...

Date post: 26-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
17
Unrestricted © Siemens AG 2016. All rights reserved Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API EuroPAR 2016 | ROME Workshop Suyang Zhu 1 , Sunita Chandrasekaran 2 , Peng Sun 1 , Barbara Chapman 1 , Marcus Winter 3 , Tobias Schuele 4 1 Dept. of Computer Science, University of Houston 2 Dept. of Computer and Information Sciences, University of Delaware 3 Hypnolords Gbr 4 Siemens Corporate Technology
Transcript
Page 1: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reserved

Exploring Task Parallelism for Heterogeneous

Systems Using Multicore Task Management API

EuroPAR 2016 | ROME Workshop

Suyang Zhu1, Sunita Chandrasekaran2, Peng Sun1, Barbara Chapman1, Marcus Winter3, Tobias Schuele4

1 Dept. of Computer Science, University of Houston2 Dept. of Computer and Information Sciences, University of Delaware3 Hypnolords Gbr4 Siemens Corporate Technology

Page 2: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Page 2 Unrestricted © Siemens AG 2016. All rights reserved

Introduction

Current Trends in Embedded Systems

Embedded systems are everywhere:

Industrial automation

Energy production and distribution

Healthcare / medical imaging

Transportation and traffic control

Consumer electronics

In-field data analytics

Source: Siemens

Source: Siemens

Industry 4.0 Autonomous driving

Source: Bosch

Source: Siemens

Augmented / virtual reality

Requirements and key characteristics:

Real-time capability (progress

guarantees, nonblocking operations)

Resource awareness (no dynamic

memory allocation during operation)

Portability / platform independence

Energy efficiency

Fine-grained control over hardware

Heterogeneous systems

Page 3: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Page 3 Unrestricted © Siemens AG 2016. All rights reserved

Introduction

Benefits of Multi-/Manycore in Embedded Systems

High computing power

High energy

efficiency

Low material

costs

Functional consolidation

Integration of previously

separate hardware

Processing of more data in the same time

Additional functionality and new features

Battery-powered devices

Passive cooling (no fans)

Source: Siemens

Page 4: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Page 4 Unrestricted © Siemens AG 2016. All rights reserved

Introduction

“In 2022, multicore will be everywhere.” (IEEE CS)

Parallel Patterns Library

Threading Building Blocks

Most frameworks for parallel programming target

desktop / server / HPC applications.

Not suitable for embedded systems

Hard real-time architectures with local memory and their programming

Low-power scalable homogeneous and heterogeneous architectures

Top challenges for multicore (IEEE CS 2022 Report)1

1 H. Alkhatib, P. Faraboschi, E. Frachtenberg, H. Kasahara, D. Lange, P. Laplante, A. Merchant, D. Milojicic, and K. Schwan. IEEE CS 2022 Report. IEEE Computer Society, 2014.

www.computer.org/cms/Computer.org/ComputingNow/2022Report.pdf

Open MPI

Page 5: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Page 5 Unrestricted © Siemens AG 2016. All rights reserved

Introduction

Heterogeneous Systems

Heterogeneous architectures provide high performance at low power consumption by

incorporating specialized processing units to handle particular tasks.

Processor manufacturers integrate general purpose processors together with accelerators like

GPUs and FPGAs on the same chip.

Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810

Increased complexity at silicon and system level

Proprietary interfaces and tool-chains

Long time-to-market, lack of portability

Page 6: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 6

Programming Model

Multicore Task Management API (MTAPI)

MTAPI

Standardized API for task-parallel

programming on a wide range of hardware

architectures

Developed and driven by practitioners of

market-leading companies

Part of Multicore-Association’s ecosystem

(MRAPI, MCAPI, SHIM, OpenAMP, …)

Contributing members:

Working group lead

Tasks

Tasks Queues Heterogeneous Systems

Shared memory

Distributed memory

Different instruction

set architectures

The Multicore Association develops and promotes open specifications for multicore product development.

Page 7: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 7

Programming Model

MTAPI for Heterogeneous Systems

Node Node Node Node

tasks

CPU

core

memory

GPU

memory

DSPCPU

core

CPU

core

CPU

core

memory

sched. / lib. OS 1 OS 2

MTAPI runtime system (optionally MCAPI / MRAPI)

MTAPI tasksMTAPI tasks

MTAPI application

MTAPI tasks

Domain

Heterogeneous systems are modelled using MTAPI nodes and domains.

Page 8: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 8

Programming Model

MTAPI Terms in a Nut Shell

Job: A piece of processing implemented by an action. Each job has a unique identifier.

Action: Implementation of a job, may be hardware or software-defined.

Task: Execution of a job resulting in the invocation of an action implementing the job associated

with some data to be processed.

Task Job Action 2

Action 1

Action n

accomplishes

implemented by

MTAPI distinguishes between jobs, actions, and tasks:

Page 9: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 9

Programming Model

MTAPI for Heterogeneous Systems (cont.)

Example for the usage of MTAPI in heterogeneous systems:

Task 1

Task 2

Task 3

Job A

Job B

Action III

Node 1 (CPU)

Ap

plic

atio

n

Node 2 (GPU)

Node 3 (DSP)

Action I

Action II

Matrix mult.

FFT

Page 10: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 10

Programming Model

MTAPI for Heterogeneous Systems (cont.)

// Define actions

void Action_I(...) {CUDA_Kernel(arg->A, arg->B, arg->C, arg->n);}

void Action_II(...) {OpenCL_Kernel(arg->A, arg->B, arg->C, arg->n);}

void Action_III(...) {CPP_Kernel(arg->A, arg->B, arg->C, arg->n);}

// Create actions and associate them with jobs

mtapi_action_create(JOB_A, Action_I, ...);

mtapi_action_create(JOB_A, Action_II, ...);

mtapi_action_create(JOB_B, Action_III, ...);

// Start tasks

mtapi_task_hndl_t task[3];

task[0] = mtapi_task_start(0, JOB_A, args0, ...);

task[1] = mtapi_task_start(0, JOB_A, args1, ...);

task[2] = mtapi_task_start(0, JOB_B, args2, ...);

// Wait for task completion

mtapi_task_wait(task[0], MTAPI_INFINITE, ...);

mtapi_task_wait(task[1], MTAPI_INFINITE, ...);

mtapi_task_wait(task[2], MTAPI_INFINITE, ...);

Example with three MTAPI jobs

Page 11: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 11

Implementation

MTAPI Flow Chart

START

Task

complete

?

Task end

Switch to

another task

S

c

h

e

d

u

l

e

r

Get a task

from

scheduler

Process

task

Child

task?

Task

complete

Switch to

another task

No

Yes

Communication Layer

Sender Receiver

Local Node

Ready to receive

task from neighbor

nodes

Receive task

Process task

Create task

Wait for task

Create Job,

Action, Queue,

Group

Initialize MTAPI

environment.

No

Yes

Send task back to

its origin node

Remote Node

Communication

Layer

Local

Task

Remote

Task

Worker Team

Page 12: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 12

Implementation

MTAPI Scheduling

Scheduler

Core 0

Worker

thread 0

Q00 Q01 Q02

Core 1

Worker

thread 1

Q10 Q11 Q12

Node 0 (CPU)

Work stealing

Unit 0

Bare

metal

Q0

Node 1 (DSP)

Work dealing

Example for scheduling MTAPI tasks in heterogeneous systems:

Page 13: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 13

Performance Evaluation

MTAPI Implementations

Operating system / hypervisor

Dataflow

Application

Hardware

Containers

Task management (MTAPI)

Algorithms

Base library (abstraction layer)

Embedded Multicore Building Blocks (EMB²)1

Open source library and runtime platform

for embedded multicore systems

Easy parallelization of existing code

using high-level patterns

Real-time capability, resource awareness

Fine-grained control over core usage

(task priorities, affinities)

Lock-/wait-free implementation

UH-MTAPI2

MTAPI implementation developed at the

Universities of Houston / Delaware

Utilizes MCAPI for inter-node communication

and MRAPI for resource management

Has been used as runtime system for

OpenMP programs

1 https://github.com/siemens/embb2 https://github.com/MCAPro2015/OpenMP_MCA_Project

Page 14: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 14

Performance Evaluation

Testbed and Benchmarks

Reference platform:

NVIDIA Jetson TK1 development kit

Tegra K1 SoC which contains

NVIDIA 4-Plus-1 Quad-Core ARM

Cortex-A15 processor

Kepler GPU with 192 CUDA cores

Compiler:

GCC 4.8.4

NVCC V6.5.30

Benchmarks:

Rodinia: Accelerating Compute-Intensive Applications with Accelerators1

Barcelona OpenMP Task Suite (BOTS)2

1 https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators2 https://pm.bsc.es/projects/bots

Source: Nvidia

Page 15: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 15

Performance Evaluation

SparseLU and Heartwall

MTAPI implementations and OpenMP perform comparably well

Heartwall benchmark does not scale linearly (memory bound)

Performance relative to sequential implementation:

Page 16: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 16

Performance Evaluation

Matrix Multiplication

Normalized execution times for UH-MTAPI and Siemens MTAPI (EMB²):

MTAPI-ARM faster than MTAPI-GPU for small matrices due to overhead for data copying

MTAPI-GPU faster than MTAPI-ARM-GPU for larger matrices due to load imbalance

MTAPI-ARM-GPU-Opt always fastest due to asynchronous transfers and variable block sizes

Page 17: Exploring Task Parallelism for Heterogeneous Systems Using ... · Consumer electronics ... Xilinx Zynq UltraScale MPSoC Nvidia Tegra K1 Qualcomm Snapdragon 810 ... CPU core memory

Unrestricted © Siemens AG 2016. All rights reservedPage 17Page 17

Summary and Outlook

Existing frameworks for parallel programming often not

suitable for embedded systems

SW development for heterogeneous systems-on-a-chip

(SoCs) challenging due to proprietary interfaces / tools

MTAPI provides standard API for leveraging task parallelism

on embedded devices with multicore processors

designed for homogeneous and heterogeneous systems

support for shared and distributed memory

can even be used bare metal (w/o OS)

may serve as a basis for higher level programming models

Experimental results show competitive performance

Improved scheduling algorithms for heterogeneous and real-

time systems

Support for further accelerators such as DSPs and FPGAs


Recommended