MulticoreWare Heterogeneous Multicore Task Scheduler (HMTS)
– A Parallel Building Block Library
Wen-Mei Hwu, Lihua Zhang Peng Gao, Qisheng Yang, Jigui Ma, Hui Huang
6/15/11 Copyright (C) 2011 and Confidential,
MulticoreWare Inc
• What is HMTS?
• Why HMTS?
• HMTS Overview
• HMTS Parallel_for
• HMTS Pipeline
• HMTS Grid
• Suggestions for users
• Ongoing development
• Q&A
Agenda
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• A user-level library for creating task-based multithreaded/multicore applications and implementing dynamic workload balancing across the whole heterogeneous system – It is based on task dependencies and priorities
– It is to help developers exploit the maximum capability of the system
• It is to simplify the way developers design task-based applications – With HMTS, developers only need to think about their algorithms in
terms of parallel tasks and dependencies. The mapping of these tasks to the heterogeneous computing devices in the system is handled by HMTS
– HMTS automatically handles new hardware configurations without modification to the user code
What is HMTS?
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• It is to provide interfaces for popular parallelism method, such as parallel_for, pipeline
• It integrates with other MulticoreWare tools (e.g. PPA & GMAC) to provide a comprehensive development toolset for heterogeneous computing
• It targets both discrete CPU/GPU based and APU-accelerated OpenCL & Non-OpenCL applications – Will be able to configure HMTS to optimize system for performance,
power or balance profiles
What is HMTS?
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• What is HMTS?
• Why HMTS?
• HMTS Overview
• HMTS Parallel_for
• HMTS Pipeline
• HMTS Grid
• Suggestions for users
• Ongoing development
• Q&A
Agenda
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• Hard to achieve optimal performance without proper
scheduling strategy and trade-off
• Hard to make it transparent to the developers about the complexity of the system that the application is running on
• Hard to implement & optimize the scheduling of arbitrary number of sync & async parallel/sequential tasks over a complex system with arbitrary number of devices and cores
• Hard to dynamically balance the workload on the fly based on every available information about the computing tasks in the application
Why HMTS? – Challenges on Heterogeneous Computing
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• Provide popular parallelism methods in both data-parallel way and task-parallel way – TBB-like interfaces for developers to quickly port their codes
– HMTS extension to enable heterogeneous computing • Grid & grid partition across multiple OpenCL devices
• Support pipeline filters over CPUs/GPUs
• Super device concept is introduced to make the underlying system transparent to the developers – Porting/upgrading without any modifications
• Provide adaptive algorithm to regulate the scheduling strategy at runtime based on dynamic instrumentation of computing workload on each device/core on the fly
Why HMTS? – Benefits & key features of HMTS
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• Automatic trade-off between system performance and energy efficiency. – It is HMTS that decides whether the device is used or not depending on
the scheduling policy chosen by user.
• Non-blocking task scheduling mechanism so CPU overhead is minimized
• Contains rich functional modules,
– workload manager and balancing algorithm
– optimized partitioners for data-parallel tasks
– Seamless integration with PPA
• Support multi applications to share workload information & resources so that a heterogeneous system’s all available HPC power can be balanced and fully utilized
Why HMTS? – Benefits & key features of HMTS
Copyright (C) 2011 and Confidential, MulticoreWare Inc
• What is HMTS?
• Why HMTS?
• HMTS Overview
• HMTS Parallel_for
• HMTS Pipeline
• HMTS Grid
• Suggestions for users
• Ongoing development
• Q&A
Agenda
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• Task is the essential concept & work unit operated by HMTS – Task level parallelization is purely based on task dependency
– Low overhead callback mechanism is used internally to manage task dependency and help reduce central scheduler workload
– Some of task scheduling, especially data parallelization operations like parallel_for() partition are recursively distributed across multi cores on the fly, which further reduces the load of central scheduler.
• Three major types of tasks and they are all derived from TaskBase class.
HMTS Overview
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
HMTS Overview
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
Every task can inherit from Class TaskBase,
• User tasks which are directly inherited from the above 3 types are called simple tasks, which HMTS can directly schedule according to dependency. Code sample for simple task,
Task taskA ,taskB, taskC ,taskD ,taskE, taskF , taskG, taskH, taskI;
SubmitTask(&taskA);
taskB.SetPreDependency(&taskA);
SubmitTask(&taskB);
taskC.SetPreDependency(&taskA);
SubmitTask(&taskC);
taskD.SetPreDependency(&taskA);
SubmitTask(&taskD);
taskE.SetPreDependency(&taskA);
SubmitTask(&taskE);
taskF.SetPreDependency(&taskB);
SubmitTask(&taskF);
WaitForTaskComplete(&taskF);
HMTS Overview
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• What is HMTS?
• Why HMTS?
• HMTS Overview
• HMTS Parallel_for
• HMTS Pipeline
• HMTS Grid
• Suggestions for users
• Ongoing development
• Q&A
Agenda
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• It is data-level parallel and partition by grain-size.
HMTS parallel_for
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
The Mandelbrot set M is defined by a family of complex quadratic polynomials
given by
where c is a complex parameter. For each c, one considers the behavior of the
sequence
obtained by iterating Pc(z) starting at critical
point z = 0, which either escapes to infinity or
stays within a disk of some finite radius. The
Mandelbrot set is defined as the set of all
points c such that the above sequence does
not escape to infinity.
Mandelbrot by HMTS parallel_for
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
The right top figure shows mandelbrot using parallel_for in HMTS library. The width
of the figure is 1280, and the row grain size is 4. parallel_for recursively splits the
range into subranges to the point such that its size is smaller than 4, and then
invokes HMTS Mandelbrot function. All these tasks are scheduled dynamically
according to load balancing.
range2d<size_t> r(0,WIDTH,xgrainsize,0,HEIGHT,ygrainsize);
//tm parallel
TaskBase * tp = parallel_for(r,&(tmMandelbrot));
WaitForTaskComplete(tp);
Where,
void tmMandelbrot(const range2d<size_t> &r2d)
{
COMPLEX z, c;
for (size_t x=r2d.rows().begin();x!=
r2d.rows().end();++x)
{
:
Copyright (C) 2011 and Confidential, MulticoreWare Inc
Mandelbrot by HMTS parallel_for
6/15/11
• What is HMTS?
• Why HMTS?
• HMTS Overview
• HMTS Parallel_for
• HMTS Pipeline
• HMTS Grid
• Suggestions for users
• Ongoing development
• Q&A
Agenda
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• Each filter is a HMTS task and scheduled by HMTS task scheduler.
• Adaptive token control according to current system performance / load
• Provide blocked & non-blocked APIs for checking the completion of a filter
HMTS Pipeline
data
Filter1(serial) Filter2(parallel) Filter3(serial)
Itask2
Ttask1
Otask1
Ttask2
Ttask4
Ttask5
Ttasks can be parallelSequential Itasks Sequential Otasks
data
data
Ttask3
Pipeline scheduler
data
Itask1 Otask2
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• Code sample for a simple pipeline with CPU filters only,
void RunTest_TM(bool firstFilter, bool secondFilter, bool thirdFilter, PointAndColor* dataSrc)
{
MyFirstFilter filter1(firstFilter, dataSrc);
MySecondFilter filter2(secondFilter);
MyThirdFilter filter3(thirdFilter);
Pipeline pipeline("process_image");
pipeline.AddFilter(filter1);
pipeline.AddFilter(filter2);
pipeline.AddFilter(filter3);
pipeline.Run(MyFirstFilter::n_buffer);
}
HMTS Pipeline
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
Face.dat
extract character
based on algo
Memory
load
Detect
(c++ code)
Process
serial filer
parallel filter
serial filter
Detect
(opencl code)a batch
a batch
Story 2: Viewdle by HMTS
A complicated pipeline example - Viewdle
Copyright (C) 2011 and Confidential, MulticoreWare Inc 6/15/11
Based on TBB pipeline and highly optimized to utilize both CPU & GPU
•Filter 1 (serial): c_CreateImagesFilter: public tbb::filter • Fill in a batch of 4 images from the file images we have.
•Filter 2 (parallel): c_DetectFilter: public tbb::filter • The filter runs the appropriate detector (either CPU C routine or OCL on GPU).
void* c_DetectFilter::operator() (void *x_item)
{
if(detect_mode == multi)
{
if (get an opencl-device)
{
detectImageBatchOpencl(…);
}
else
{
detectImageBatchCcode(…);
}
}
Original Viewdle
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
if (detect_mode == gpuOcl)
{
// spin until get a device
while(! tryGetADevice())
{
detectImageBatchOpencl(…);
}
}
}
•Filter 3 (serial): c_ProcessRects: public tbb::filter • The filter process the results.
•Note. To be able to run TBB filters on multiple devices, Viewdle has to implement a
complicated OpenCL device management module to manage all the devices and their
running environments.
Original Viewdle
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• Parallel filter 2: c_DetectFilter: public TM::Filter • The filter runs the appropriate detector (CPU C routine or OCL for GPU).
void* c_DetectFilter::operator() (void *x_item)
{
detectImageBatchCcode(…);
}
void* c_DetectFilter::operator() (void* x_item, TM::OpenclEnviron myEnviron)
{
detectImageBatchOpencl(…, myEnviron);
}
Copyright (C) 2011 and Confidential, MulticoreWare Inc
Viewdle based on HMTS
6/15/11
facedetect app from AMD
compute device 6 cores : 1 Barts 6 cores : 1 Barts : 1 RedWood 4 cores : 1 Barts 4 cores : 1 Barts : 1 RedWood
gpuOcl gpuOcl gpuOcl gpuOcl
15 token 10 token 1 token 15 token 12 token 2 token 12 token 10 token 1 token 12 token 10 token 2 token
distribution of batches 20 : 30 25 : 25 0 : 50 11:28:11 15:24:11 0:34:17 11:39 14:36 0:50 8:30:12 9:28:11 0:33:17
detect image/sec 8.81 7.01 6.95 11.18 8.66 7.52 8.9 7.13 6.95 8.76 7.12 7.52
facedetect app with TM
distribution of batches 20 : 30 20 : 30 12:27:11 12:27:11 11 : 39 11 : 39 8:30:12 8:30:12
detect image/sec 9.28 9.13 11.45 11.1 9.22 9.21 9.13 9.18
detect-mode, tokenmultimultimulti multi
initial token 15 token 6 token 15 token 6 12 token 6 12 token 6
Copyright (C) 2011 and Confidential, MulticoreWare Inc
• HMTS version is constantly faster (up to 30%) than original Viewdle,
and especially it is not sensitive at all to the user given number of
tokens compared to TBB. • This is because HMTS pipeline can dynamically adjust number of tokens to
achieve best system utilization.
• In addition, HMTS pipeline significantly simplifies coding for
heterogeneous computing • HMTS has a Multi-Filter implementation, in which the user can implement two
virtual functions for C or OpenCL. User doesn’t need to write any device
detection, managing & scheduling code, and HMTS will automatically call one
of the two functions to achieve optimal load-balancing.
Results & Conclusions
6/15/11
• What is HMTS?
• Why HMTS?
• HMTS Overview
• HMTS Parallel_for
• HMTS Pipeline
• HMTS Grid
• Suggestions for users
• Ongoing development
• Q&A
Agenda
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
• Similar to parallel_for for partition CPU tasks, Grid provides a capability to partition kernel tasks across multiple CL devices.
– View all available OpenCL devices in system as a super device.
– HMTS dynamically partitions a grid task’s workload across the super device according to each device’s computing capability & balance workload on the fly
– A device’s computing capability can be instrumented on the runtime and guide for further partitions.
– User write OpenCL code as normal, except that the globalWorkSize & data range fed by HMTS at runtime
class Grid
{
private:
// Total number of workitems for the kernel task
cl::NDRange ndrNumElements;
// workgroup size specified by user and constrained by device
cl::NDRange ndrLocalWorkSize;
cl::NDRange ndrGrainSize; // Partition threshold
HMTS Grid
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
2GPU 1 Grid
822 387 478
822/478=171.9% 387/478= 80.9% 100%
1st run 2nd run non_partition
Executing time
Comparison
Performance
Improvement 19.1%
Copyright (C) 2011 and Confidential, MulticoreWare Inc
1 Grid 2 GPUs Sample
6/15/11
• First run is slow due to clBuildProgram calls. No need to build again for later runs
since HMTS has the capability of caching built program
• Optimal load balancing for 2nd run due to the dynamic compute capability
instrumentation based on real task profiling from 1st run!
• What is HMTS?
• Why HMTS?
• HMTS Overview
• HMTS Parallel_for
• HMTS Pipeline
• HMTS Grid
• Suggestions for users
• Ongoing development
• Q&A
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
Agenda
• Minimize the use of blocking wait APIs like clFinish() inside task::clRun() so CPU threads can be freed up ASAP • If blocking wait is must to use, such a task can be split to two tasks at
the wait point and HMTS will guarantee the 2nd task won’t be started till 1st task is completed in case dependency is set properly.
• In main program, one can still explicitly wait for task completion if required by using specific HMTS API, WaitForTaskComplete(task). An alternative is user call RegisteCallBack(t_uint eventid , pfc_task_event_cb cb, void* param) to register a callback which will be called as soon as the task is complete.
• Give proper task workload hints (including linear/non-linear, load size heavy or light etc) to help HMTS choose right scheduling strategy from the very beginning, although HMTS has the capability of dynamic load balancing.
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
Suggestions for users
• What is HMTS?
• Why HMTS?
• HMTS Overview
• HMTS Parallel_for
• HMTS Pipeline
• HMTS Grid
• Suggestions for users
• Ongoing development
• Q&A
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
Agenda
• GMAC integration to automate data movements
• PPA integration to visualize dynamic task scheduling & monitor overall system status on the fly
• Other data parallelism patterns like parallel_do
Copyright (C) 2011 and Confidential, MulticoreWare Inc
6/15/11
Ongoing development
6/15/11 Copyright (C) 2011 and Confidential,
MulticoreWare Inc
• Visit http://www.multicorewareinc.com to request closed beta application
• Contact email: [email protected]
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
this presentation are for informational purposes only and may be trademarks of their respective owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and
opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is
not responsible for the content herein and no endorsements are implied.