MulticoreWare Heterogeneous Multicore Task...

MulticoreWare Heterogeneous Multicore Task Scheduler (HMTS)

– A Parallel Building Block Library

Wen-Mei Hwu, Lihua Zhang Peng Gao, Qisheng Yang, Jigui Ma, Hui Huang

6/15/11 Copyright (C) 2011 and Confidential,

MulticoreWare Inc

• What is HMTS?

• Why HMTS?

• HMTS Overview

• HMTS Parallel_for

• HMTS Pipeline

• HMTS Grid

• Suggestions for users

• Ongoing development

• Q&A

Agenda

Copyright (C) 2011 and Confidential, MulticoreWare Inc

6/15/11

• A user-level library for creating task-based multithreaded/multicore applications and implementing dynamic workload balancing across the whole heterogeneous system – It is based on task dependencies and priorities

– It is to help developers exploit the maximum capability of the system

• It is to simplify the way developers design task-based applications – With HMTS, developers only need to think about their algorithms in

terms of parallel tasks and dependencies. The mapping of these tasks to the heterogeneous computing devices in the system is handled by HMTS

– HMTS automatically handles new hardware configurations without modification to the user code

What is HMTS?


6/15/11

• It is to provide interfaces for popular parallelism method, such as parallel_for, pipeline

• It integrates with other MulticoreWare tools (e.g. PPA & GMAC) to provide a comprehensive development toolset for heterogeneous computing

• It targets both discrete CPU/GPU based and APU-accelerated OpenCL & Non-OpenCL applications – Will be able to configure HMTS to optimize system for performance,

power or balance profiles

What is HMTS?


6/15/11

• What is HMTS?

• Why HMTS?

• HMTS Overview


• HMTS Pipeline

• HMTS Grid



• Q&A

Agenda


6/15/11

• Hard to achieve optimal performance without proper

scheduling strategy and trade-off

• Hard to make it transparent to the developers about the complexity of the system that the application is running on

• Hard to implement & optimize the scheduling of arbitrary number of sync & async parallel/sequential tasks over a complex system with arbitrary number of devices and cores

• Hard to dynamically balance the workload on the fly based on every available information about the computing tasks in the application

Why HMTS? – Challenges on Heterogeneous Computing


6/15/11

• Provide popular parallelism methods in both data-parallel way and task-parallel way – TBB-like interfaces for developers to quickly port their codes

– HMTS extension to enable heterogeneous computing • Grid & grid partition across multiple OpenCL devices

• Support pipeline filters over CPUs/GPUs

• Super device concept is introduced to make the underlying system transparent to the developers – Porting/upgrading without any modifications

• Provide adaptive algorithm to regulate the scheduling strategy at runtime based on dynamic instrumentation of computing workload on each device/core on the fly

Why HMTS? – Benefits & key features of HMTS


6/15/11

• Automatic trade-off between system performance and energy efficiency. – It is HMTS that decides whether the device is used or not depending on

the scheduling policy chosen by user.

• Non-blocking task scheduling mechanism so CPU overhead is minimized

• Contains rich functional modules,

– workload manager and balancing algorithm

– optimized partitioners for data-parallel tasks

– Seamless integration with PPA

• Support multi applications to share workload information & resources so that a heterogeneous system’s all available HPC power can be balanced and fully utilized

Why HMTS? – Benefits & key features of HMTS


• What is HMTS?

• Why HMTS?

• HMTS Overview


• HMTS Pipeline

• HMTS Grid



• Q&A

Agenda


6/15/11

• Task is the essential concept & work unit operated by HMTS – Task level parallelization is purely based on task dependency

– Low overhead callback mechanism is used internally to manage task dependency and help reduce central scheduler workload

– Some of task scheduling, especially data parallelization operations like parallel_for() partition are recursively distributed across multi cores on the fly, which further reduces the load of central scheduler.

• Three major types of tasks and they are all derived from TaskBase class.

HMTS Overview


6/15/11

HMTS Overview


6/15/11

Every task can inherit from Class TaskBase,

• User tasks which are directly inherited from the above 3 types are called simple tasks, which HMTS can directly schedule according to dependency. Code sample for simple task,

Task taskA ,taskB, taskC ,taskD ,taskE, taskF , taskG, taskH, taskI;

SubmitTask(&taskA);

taskB.SetPreDependency(&taskA);

SubmitTask(&taskB);

taskC.SetPreDependency(&taskA);

SubmitTask(&taskC);

taskD.SetPreDependency(&taskA);

SubmitTask(&taskD);

taskE.SetPreDependency(&taskA);

SubmitTask(&taskE);

taskF.SetPreDependency(&taskB);

SubmitTask(&taskF);

WaitForTaskComplete(&taskF);

HMTS Overview


6/15/11

• What is HMTS?

• Why HMTS?

• HMTS Overview


• HMTS Pipeline

• HMTS Grid



• Q&A

Agenda


6/15/11

• It is data-level parallel and partition by grain-size.

HMTS parallel_for


6/15/11

The Mandelbrot set M is defined by a family of complex quadratic polynomials

given by

where c is a complex parameter. For each c, one considers the behavior of the

sequence

obtained by iterating Pc(z) starting at critical

point z = 0, which either escapes to infinity or

stays within a disk of some finite radius. The

Mandelbrot set is defined as the set of all

points c such that the above sequence does

not escape to infinity.

Mandelbrot by HMTS parallel_for


6/15/11

The right top figure shows mandelbrot using parallel_for in HMTS library. The width

of the figure is 1280, and the row grain size is 4. parallel_for recursively splits the

range into subranges to the point such that its size is smaller than 4, and then

invokes HMTS Mandelbrot function. All these tasks are scheduled dynamically

according to load balancing.

range2d<size_t> r(0,WIDTH,xgrainsize,0,HEIGHT,ygrainsize);

//tm parallel

TaskBase * tp = parallel_for(r,&(tmMandelbrot));

WaitForTaskComplete(tp);

Where,

void tmMandelbrot(const range2d<size_t> &r2d)

{

COMPLEX z, c;

for (size_t x=r2d.rows().begin();x!=

r2d.rows().end();++x)

{

:


Mandelbrot by HMTS parallel_for

6/15/11

• What is HMTS?

• Why HMTS?

• HMTS Overview


• HMTS Pipeline

• HMTS Grid



• Q&A

Agenda


6/15/11

• Each filter is a HMTS task and scheduled by HMTS task scheduler.

• Adaptive token control according to current system performance / load

• Provide blocked & non-blocked APIs for checking the completion of a filter

HMTS Pipeline

data

Filter1(serial) Filter2(parallel) Filter3(serial)

Itask2

Ttask1

Otask1

Ttask2

Ttask4

Ttask5

Ttasks can be parallelSequential Itasks Sequential Otasks

data

data

Ttask3

Pipeline scheduler

data

Itask1 Otask2


6/15/11

• Code sample for a simple pipeline with CPU filters only,

void RunTest_TM(bool firstFilter, bool secondFilter, bool thirdFilter, PointAndColor* dataSrc)

{

MyFirstFilter filter1(firstFilter, dataSrc);

MySecondFilter filter2(secondFilter);

MyThirdFilter filter3(thirdFilter);

Pipeline pipeline("process_image");

pipeline.AddFilter(filter1);



pipeline.Run(MyFirstFilter::n_buffer);

}

HMTS Pipeline


6/15/11

Face.dat

extract character

based on algo

Memory

load

Detect

(c++ code)

Process

serial filer

parallel filter

serial filter

Detect

(opencl code)a batch

a batch

Story 2: Viewdle by HMTS

A complicated pipeline example - Viewdle

Copyright (C) 2011 and Confidential, MulticoreWare Inc 6/15/11

Based on TBB pipeline and highly optimized to utilize both CPU & GPU

•Filter 1 (serial): c_CreateImagesFilter: public tbb::filter • Fill in a batch of 4 images from the file images we have.

•Filter 2 (parallel): c_DetectFilter: public tbb::filter • The filter runs the appropriate detector (either CPU C routine or OCL on GPU).

void* c_DetectFilter::operator() (void *x_item)

{

if(detect_mode == multi)

{

if (get an opencl-device)

{

detectImageBatchOpencl(…);

}

else

{

detectImageBatchCcode(…);

}

}

Original Viewdle


6/15/11

if (detect_mode == gpuOcl)

{

// spin until get a device

while(! tryGetADevice())

{

detectImageBatchOpencl(…);

}

}

}

•Filter 3 (serial): c_ProcessRects: public tbb::filter • The filter process the results.

•Note. To be able to run TBB filters on multiple devices, Viewdle has to implement a

complicated OpenCL device management module to manage all the devices and their

running environments.

Original Viewdle


6/15/11

• Parallel filter 2: c_DetectFilter: public TM::Filter • The filter runs the appropriate detector (CPU C routine or OCL for GPU).

void* c_DetectFilter::operator() (void *x_item)

{

detectImageBatchCcode(…);

}

void* c_DetectFilter::operator() (void* x_item, TM::OpenclEnviron myEnviron)

{

detectImageBatchOpencl(…, myEnviron);

}


Viewdle based on HMTS

6/15/11

facedetect app from AMD

compute device 6 cores : 1 Barts 6 cores : 1 Barts : 1 RedWood 4 cores : 1 Barts 4 cores : 1 Barts : 1 RedWood

gpuOcl gpuOcl gpuOcl gpuOcl

15 token 10 token 1 token 15 token 12 token 2 token 12 token 10 token 1 token 12 token 10 token 2 token

distribution of batches 20 : 30 25 : 25 0 : 50 11:28:11 15:24:11 0:34:17 11:39 14:36 0:50 8:30:12 9:28:11 0:33:17

detect image/sec 8.81 7.01 6.95 11.18 8.66 7.52 8.9 7.13 6.95 8.76 7.12 7.52

facedetect app with TM

distribution of batches 20 : 30 20 : 30 12:27:11 12:27:11 11 : 39 11 : 39 8:30:12 8:30:12

detect image/sec 9.28 9.13 11.45 11.1 9.22 9.21 9.13 9.18

detect-mode, tokenmultimultimulti multi

initial token 15 token 6 token 15 token 6 12 token 6 12 token 6


• HMTS version is constantly faster (up to 30%) than original Viewdle,

and especially it is not sensitive at all to the user given number of

tokens compared to TBB. • This is because HMTS pipeline can dynamically adjust number of tokens to

achieve best system utilization.

• In addition, HMTS pipeline significantly simplifies coding for

heterogeneous computing • HMTS has a Multi-Filter implementation, in which the user can implement two

virtual functions for C or OpenCL. User doesn’t need to write any device

detection, managing & scheduling code, and HMTS will automatically call one

of the two functions to achieve optimal load-balancing.

Results & Conclusions

6/15/11

• What is HMTS?

• Why HMTS?

• HMTS Overview


• HMTS Pipeline

• HMTS Grid



• Q&A

Agenda


6/15/11

• Similar to parallel_for for partition CPU tasks, Grid provides a capability to partition kernel tasks across multiple CL devices.

– View all available OpenCL devices in system as a super device.

– HMTS dynamically partitions a grid task’s workload across the super device according to each device’s computing capability & balance workload on the fly

– A device’s computing capability can be instrumented on the runtime and guide for further partitions.

– User write OpenCL code as normal, except that the globalWorkSize & data range fed by HMTS at runtime

class Grid

{

private:

// Total number of workitems for the kernel task

cl::NDRange ndrNumElements;

// workgroup size specified by user and constrained by device

cl::NDRange ndrLocalWorkSize;

cl::NDRange ndrGrainSize; // Partition threshold

HMTS Grid


6/15/11

2GPU 1 Grid

822 387 478

822/478=171.9% 387/478= 80.9% 100%

1st run 2nd run non_partition

Executing time

Comparison

Performance

Improvement 19.1%


1 Grid 2 GPUs Sample

6/15/11

• First run is slow due to clBuildProgram calls. No need to build again for later runs

since HMTS has the capability of caching built program

• Optimal load balancing for 2nd run due to the dynamic compute capability

instrumentation based on real task profiling from 1st run!

• What is HMTS?

• Why HMTS?

• HMTS Overview


• HMTS Pipeline

• HMTS Grid



• Q&A


6/15/11

Agenda

• Minimize the use of blocking wait APIs like clFinish() inside task::clRun() so CPU threads can be freed up ASAP • If blocking wait is must to use, such a task can be split to two tasks at

the wait point and HMTS will guarantee the 2nd task won’t be started till 1st task is completed in case dependency is set properly.

• In main program, one can still explicitly wait for task completion if required by using specific HMTS API, WaitForTaskComplete(task). An alternative is user call RegisteCallBack(t_uint eventid , pfc_task_event_cb cb, void* param) to register a callback which will be called as soon as the task is complete.

• Give proper task workload hints (including linear/non-linear, load size heavy or light etc) to help HMTS choose right scheduling strategy from the very beginning, although HMTS has the capability of dynamic load balancing.


6/15/11

Suggestions for users

• What is HMTS?

• Why HMTS?

• HMTS Overview


• HMTS Pipeline

• HMTS Grid



• Q&A


6/15/11

Agenda

• GMAC integration to automate data movements

• PPA integration to visualize dynamic task scheduling & monitor overall system status on the fly

• Other data parallelism patterns like parallel_do


6/15/11

Ongoing development

6/15/11 Copyright (C) 2011 and Confidential,

MulticoreWare Inc

• Visit http://www.multicorewareinc.com to request closed beta application

• Contact email: [email protected]

http://www.multicorewareinc.com/

mailto:[email protected]

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is

not responsible for the content herein and no endorsements are implied.

Date post:	24-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

MulticoreWare Heterogeneous Multicore Task...

Documents