+ All Categories
Home > Documents > Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf ·...

Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf ·...

Date post: 03-Aug-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
42
Threads: either under- or over-utilised Underutilised: limited by creation speed of work Cannot exploit all the CPUs even though there is more work Overutilised: losing performance due to context switches There is overhead when switching between OS threads Each thread needs to warm up cache again Increases memory pressure Worst case: continual slow-down The cost of creating threads is partially borne by kernel User code may slow down more than kernel code under load Number of workers slowly goes up; completion rate goes down HPCE / dt10/ 2015 / 7.1
Transcript
Page 1: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Threads: either under- or over-utilised

• Underutilised: limited by creation speed of work

– Cannot exploit all the CPUs even though there is more work

• Overutilised: losing performance due to context switches

– There is overhead when switching between OS threads

– Each thread needs to warm up cache again

– Increases memory pressure

• Worst case: continual slow-down

– The cost of creating threads is partially borne by kernel

– User code may slow down more than kernel code under load

– Number of workers slowly goes up; completion rate goes down

HPCE / dt10/ 2015 / 7.1

Page 2: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Solving under-utilisation

template<class TI, class TF>

void parallel_for(const TI &begin, const TI &end, const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

TI mid=(begin+end)/2;

std::thread left( // Spawn the left thread in parallel

[&](){ parallel_for(begin, mid, f); }

);

// Perform the right segment on our thread

parallel_for(mid, end, f);

// wait for the left to finish

left.join();

}

}

HPCE / dt10/ 2015 / 7.2

Page 3: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Creation of work using trees

• Tree starts on one thread [0..4)

template<class TI, class TF>

void parallel_for(

const TI &begin,const TI &end,

const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

TI mid=(begin+end)/2;

std::thread left(

[&](){ parallel_for(begin, mid, f); }

);

parallel_for(mid, end, f);

left.join();

}

}

HPCE / dt10/ 2015 / 7.3

Page 4: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Creation of work using trees

• Tree starts on one thread

• Create thread to branch

template<class TI, class TF>

void parallel_for(

const TI &begin,const TI &end,

const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

TI mid=(begin+end)/2;

std::thread left(

[&](){ parallel_for(begin, mid, f); }

);

parallel_for(mid, end, f);

left.join();

}

}

[0..4)

[2..4) [0..2)

spawn

HPCE / dt10/ 2015 / 7.4

Page 5: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Creation of work using trees

• Tree starts on one thread

• Create thread to branch

template<class TI, class TF>

void parallel_for(

const TI &begin,const TI &end,

const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

TI mid=(begin+end)/2;

std::thread left(

[&](){ parallel_for(begin, mid, f); }

);

parallel_for(mid, end, f);

left.join();

}

}

[0..4)

[2..4)

[3..4)

[0..2)

[2..3) [1..2) [0..1)

spawn

spawn spawn

HPCE / dt10/ 2015 / 7.5

Page 6: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Creation of work using trees

• Tree starts on one thread

• Create thread to branch

• Execute function at leaves

template<class TI, class TF>

void parallel_for(

const TI &begin,const TI &end,

const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

TI mid=(begin+end)/2;

std::thread left(

[&](){ parallel_for(begin, mid, f); }

);

parallel_for(mid, end, f);

left.join();

}

}

[0..4)

[2..4)

[3..4)

[0..2)

[2..3) [1..2) [0..1)

f(3) f(2) f(1) f(0)

spawn

spawn spawn

HPCE / dt10/ 2015 / 7.6

Page 7: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Creation of work using trees

• Tree starts on one thread

• Create thread to branch

• Execute function at leaves

• Join back up to the root

template<class TI, class TF>

void parallel_for(

const TI &begin,const TI &end,

const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

TI mid=(begin+end)/2;

std::thread left(

[&](){ parallel_for(begin, mid, f); }

);

parallel_for(mid, end, f);

left.join();

}

}

[0..4)

[2..4)

[3..4)

[0..2)

[2..3) [1..2) [0..1)

f(3) f(2) f(1) f(0)

[3..4) [2..3) [1..2) [0..1)

[2..4) [0..2)

[0..4)

spawn

join

join join

spawn spawn

HPCE / dt10/ 2015 / 7.7

Page 8: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Properties of fork / join trees

• Recursively creating trees of work is very efficient

– We are not limited to one thread creating all tasks

– Exponential rather than linear growth of threads with time

• Problem solved?

• Growth of threads is exponential with time

• Can put significant pressure on the OS thread scheduler

– Context switching 1000s of threads is very inefficient

• Each thread requires significant resources

– Need kernel handles, stack, thread-info block, ...

– Can’t allocate more than a few thousand threads per process

HPCE / dt10/ 2015 / 7.8

Page 9: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Re-examining the goals

• What we want is parallel_for:

“Iterations may execute in parallel”

• std::thread gives us something different:

“The new thread will execute in parallel”

• Our thread based strategy is too eager to go parallel

• We want to go just parallel enough, then stay serial

HPCE / dt10/ 2015 / 7.9

Page 10: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Tasks versus threads

• A task is a chunk of work that can be executed

– A task may execute in parallel with other tasks

– A task will eventually be executed, but no guarantee on when

• Tasks are scheduled and executed by a run-time (TBB)

– Maintain a list of tasks which are ready to run

– Have one thread per CPU for running tasks

– If a thread is idle, assign a task from the ready queue to it

– No limit on number of tasks which are ready to run

– (OS is still responsible for mapping threads to CPUs)

• TBB has a number of high-level ways to use tasks

– But there is a single low-level underlying task primitive HPCE / dt10/ 2015 / 7.10

Page 11: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Overview of task groups

• A task group collects together a number of child tasks

– The task creating the group is called the parent

– One or more child tasks are created and run() by the parent

– Child tasks may execute in parallel

– Parent task must wait() for all child tasks before returning

HPCE / dt10/ 2015 / 7.11

Page 12: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

parallel_for using tbb::task_group

#include "tbb/task_group.h"

template<class TI, class TF>

void parallel_for(const TI &begin, const TI &end, const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

auto left=[&](){ parallel_for(begin, (begin+end)/2, f); }

auto right=[&](){ parallel_for((begin+end)/2, end, f); }

// Spawn the two tasks in a group

tbb::task_group group;

group.run(left);

group.run(right);

group.wait(); // Wait for both to finish

}

}

HPCE / dt10/ 2015 / 7.12

Page 13: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Overview of task groups

• A task group collects together a number of child tasks

– The task creating the group is called the parent

– One or more child tasks are created and run() by the parent

– Child tasks may execute in parallel

– Parent task must wait() for all child tasks before returning

• Some important differences between tasks and threads

– Threads must execute in parallel

– A thread may continue after its creator exits

– Threads must be joined individually

HPCE / dt10/ 2015 / 7.13

Page 14: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

More patterns: tbb::parallel_invoke

template<typename Func0, typename Func1>

void parallel_invoke(const Func0& f0, const Func1& f1);

template<typename Func0, typename Func1, typename Func2>

void parallel_invoke(const Func0& f0, const Func1& f1, const Func2& f2);

• Takes two or more functions and may run in parallel

– Overloaded for different numbers of arguments

– No overload for 1 argument for obvious reasons

• Interface is very clean, but also quite simple

– Decision about number of tasks is completely static

– You can’t add more tasks once some starts

– No choice about when to synchronise with tasks

HPCE / dt10/ 2015 / 7.14

Page 15: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

parallel_invoke using task_group

• parallel_invoke can be implement using task_group

– task_group supports a super-set of the functionality

template<typename Fc0, typename Fc1, typename Fc2>

void parallel_invoke(const Fc0& f0, const Fc1& f1, const Fc2& f2)

{

tbb::task_group group;

group.run(f0);

group.run(f1);

group.run(f2);

group.wait();

}

HPCE / dt10/ 2015 / 7.15

Page 16: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Can’t do task_group using parallel_invoke

• task_group is intrinsically dynamic

– Decide how much work to add at run-time

– Can add work even while tasks are running in the group

void my_function(int n, float *x)

{

tbb::task_group group;

for(unsigned i=0;i<n;i++){

if(x[i]==0)

group.run([=](){ f(i); });

else

group.run([=](){ g(x[i]); });

}

group.wait();

}

HPCE / dt10/ 2015 / 7.16

Page 17: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

The underlying primitive: tbb::task

• TBB has a basic primitive called tbb::task

• This is the raw unit of scheduling understood by the lib.

– Other high-level wrappers create tasks internally

– The TBB run-time takes tasks and schedules them to a CPU

• Tasks are very flexible, with a lot of power

– Can express complicate dependency graphs

– Build non-local synchronisation and barriers

• With power comes responsibility

– They allow you to make mistakes

– Possible (though not likely) to mess up the TBB run-time

• Better to create wrappers on top that hide tasks

– parallel_for, parallel_invoke, task_group, parallel_reduce, ...

HPCE / dt10/ 2015 / 7.17

Page 18: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

class MyTask

: public tbb::task

{

int start, end;

MyTask(int _start, int _end)

{ start=_start; end=_end; }

tbb::task * execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

spawn(t1);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

}

};

void CreateTasks(int start, int end)

{

MyTask &root=*new(allocate_root()) MyTask(start,end);

tbb::task::spawn_root_and_wait();

}

void MyTask(int start, int end)

{

if(cond())

return 0;

tbb::task_group group;

group.run([=](){MyTask(start,(start+end)/2); });

group.run([=](){MyTask((start+end)/2,end); });

DoSomethingFirst();

group.wait();

DoSomethingElse();

return 0;

}

HPCE / dt10/ 2015 / 7.18

Page 19: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Life-cycle of a task

• Life-cycle of task due to interaction between task and run-

time

– Individual task calls spawn, wait_for_all (sync), return

– TBB run-time will keep track of a task’s children (dependencies)

Page 20: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Scheduling through reference counts

• Each task has a reference count and a successor task

• The reference count identifies whether a task is blocked

– If the reference count is zero then the task could be run

– But only if it has been given to the task scheduler

– Legal to create a task and not give it to the scheduler

– Note the difference: “reference count” vs “C++ reference”

• Successor task identifies the task blocked by this task

– Generally the successor task is the creator, or parent

– When a task completes it decrements the count of its successor

HPCE / dt10/ 2015 / 7.20

Page 21: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Created

void CreateTasks(int start, int end)

{

MyTask &root=*new(allocate_root()) MyTask(start,end);

tbb::task::spawn_root_and_wait();

}

Page 22: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

void CreateTasks(int start, int end)

{

MyTask &root=*new(allocate_root()) MyTask(start,end);

tbb::task::spawn_root_and_wait();

}

Page 23: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

Page 24: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

Page 25: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Created

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

successor

Page 26: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Created

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Created

successor successor

Page 27: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

Page 28: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

Page 29: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

2

start ...

end ...

successor -

Blocked

tbb::task

refcount

MyTask

0

start ...

end ...

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

Page 30: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

2

start ...

end ...

successor -

Blocked

tbb::task

refcount

MyTask

0

start ...

end ...

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

Page 31: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

2

start ...

end ...

successor -

Blocked

tbb::task

refcount

MyTask

0

start ...

end ...

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Running

successor successor

Page 32: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

1

start ...

end ...

successor -

Blocked

tbb::task

refcount

MyTask

0

start ...

end ...

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Finished

successor successor

Page 33: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Finished

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

successor

Page 34: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

Page 35: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Finished

void CreateTasks(int start, int end)

{

MyTask &root=*new(allocate_root()) MyTask(start,end);

tbb::task::spawn_root_and_wait();

}

Page 36: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Managing reference counts

• What happens if we get the reference count wrong?

• Finishing task calls decrement_ref_count on

successor

– Automatically returns task to scheduler if count becomes zero

HPCE / dt10/ 2015 / 7.36

Page 37: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

1

start ...

end ...

successor -

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(1);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

Page 38: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Finished

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(1);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

Page 39: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

set_ref_count(3);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Running

successor

Page 40: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

tbb::task

refcount

MyTask

-1

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Finished

tbb::task * MyTask::execute()

{

if(cond())

return 0;

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t1);

spawn(t2);

set_ref_count(3);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

Page 41: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Some help is available

• TBB library comes in two forms: debug and release

– release library does no error checking – all about speed

– debug library will check reference counts at many points

• Choose library version at compilation and link stages

– Debug: #define TBB_USE_DEBUG=1 when compiling

• On microsoft compilers it will automatically link the correct library

• On other compilers use “-ltbb” vs “-ltbb_debug”

– Usually maintain different release and debug settings

• Debug: /DTBB_USE_DEBUG=1 /MDd

• Release: /DNDEBUG=1 /O2

– Can setup in Visual Studio or in a makefile

• Generally: try to avoid raw tasks if possible HPCE / dt10/ 2015 / 7.41

Page 42: Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Many design patterns are built on tasks

• Iteration in various forms

– parallel_for, parallel_for_each

• Reduction and accumulation

– parallel_reduce

• Data-dependent looping and queue processing

– parallel_do

• Support for heterogeneous tasks

– parallel_invoke, task_group

• Heterogeneous tasks and token-based data-flow

– parallel_pipeline

• Goal: turn design patterns in concrete functions

HPCE / dt10/ 2015 / 7.42


Recommended