+ All Categories
Home > Documents > Heterogeneous Work-stealing across CPU and DSP...

Heterogeneous Work-stealing across CPU and DSP...

Date post: 08-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
50
Vivek Kumar 1 , Alina Sbîrlea 1 , Ajay Jayaraj 2 , Zoran Budimlić 1 , Deepak Majeti 1 , and Vivek Sarkar 1 1 Rice University 2 Texas Instruments Heterogeneous Work-stealing across CPU and DSP cores
Transcript
Page 1: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1, and Vivek Sarkar1

1 Rice University

2 Texas Instruments

Heterogeneous Work-stealing across CPU and DSP cores

Page 2: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al. 2

Accelerators in Top500

Lower power-to-performance ratio = $$ Source: http://www.slideshare.net/top500/top500-list-november-2014?related=1

Page 3: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Outline

• Background  and  mo-va-on  • Our  contribu-ons  •  Implementa-on  • Results  •  Summary  

3

Page 4: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

if ( ! (GPGPU || CoProcessor) ) ?

4

Page 5: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al. 5

if ( ! (GPGPU || CoProcessor) ) ?

Page 6: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al. 6

DSP Continuous

analog signals as electric pules of

varying amplitude

Digital signals in binary format

Real time processing

if ( ! (GPGPU || CoProcessor) ) ?

Page 7: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

DSP and HPC… Really ?

•  Traditionally DSPs existed in two different flavors –  Fixed point operations

•  Integer arithmetic •  Low cost

–  Floating point operations •  Usage restricted to research, avionics •  High cost

7 Source: http://www.ti.com/lit/wp/spry061/spry061.pdf

Page 8: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

And Out of Thin Air a HPC Engine is Born…

8 Source: http://www.ti.com/lit/wp/spry147/spry147.pdf

Page 9: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

And Out of Thin Air a HPC Engine is Born…

9

High capacity data transmission

Signal processing (e.g., MIMO)

High performance and accuracy

Floating point operations

Source: http://www.ti.com/lit/wp/spry147/spry147.pdf

Page 10: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

And Out of Thin Air a HPC Engine is Born…

10

High capacity data transmission

Signal processing (e.g., MIMO)

High performance and accuracy

Floating point operations

Source: http://www.ti.com/lit/wp/spry147/spry147.pdf

TI’s C66x DSPs – Integrated both

fixed and floating point capabilities in the same DSP

November 2010

Page 11: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

TI Keystone II SoC

11

ARM A15 ARM

A15 ARM A15 ARM

A15

C66x DSP C66x

DSP C66x DSP C66x

DSP

C66x DSP C66x

DSP C66x DSP C66x

DSP

4MB ARM Shared L2

1MB L2 Per C66x Core

Multicore shared memory controller

6MB MSMC SRAM

DDR3 DDR3

Total 2GB

Page 12: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Existing Programming Model

•  No special programming language required for DSP –  Supports C language (TI’s C66x)

•  Parallel programming using OpenMP –  Stotzer et. al., OpenMP on the low-power TI Keystone-II ARM/DSP system-on-

chip, IWOMP 2013

12

Page 13: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Existing Programming Model

•  No special programming language required for DSP –  Supports C language (TI’s C66x)

•  Parallel programming using OpenMP –  Stotzer et. al., OpenMP on the low-power TI Keystone-II ARM/DSP system-on-

chip, IWOMP 2013

–  ARM dispatches OpenMP kernels to DSPs and wait for completion

•  Idle ARM cores

13

Page 14: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Contributions

✔ HC-K2H programming model Task parallel programming model for TI’s ARM+DSP SoC, which abstracts away hardware complexities from the user

✔ Hybrid work-stealing runtime

That performs load balancing of tasks across ARM and DSP cores

✔ Detailed performance study Using standard work-stealing benchmarks

✔ Results That shows HC-K2H runtime can even outperform DSP only execution

14

Page 15: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Parallel Programming Model

15

// Task T0 (Parent)start_finish( );

async (STMT1); //T1 (Child)

STMT2; //T0

end_finish( );

STMT3; //T3

Page 16: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Parallel Programming Model

16

// Task T0 (Parent)start_finish( );

async (STMT1); //T1 (Child)

forasync (STMT2); //T2 (Child)

end_finish( );

STMT3; //T3

Page 17: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Parallel Programming Model

17

// Task T0 (Parent)start_finish( );

async (STMT1); //T1 (Child)

forasync (STMT2); //T2 (Child)

end_finish( );

STMT3; //T3

T1 T2

STMT3

async forasync

Page 18: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Compiling and Running

18

main ( ) { }

Page 19: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Compiling and Running

19

main ( ) { }

User main replaced using macros

main ( ) { initialize_runtime ( ) finalize_runtime ( );}_user_main ( ) { }

DSP version

main ( ) { initialize_runtime ( ) _user_main ( ); finalize_runtime ( );}_user_main ( ) { }

ARM version

Page 20: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Compiling and Running

20

main ( ) { }

User main replaced using macros

main ( ) { initialize_runtime ( ) finalize_runtime ( );}_user_main ( ) { }

DSP version

$ ARM_WORKERS=X DSP_WORKERS=Y ./executable <command line args>

main ( ) { initialize_runtime ( ) _user_main ( ); finalize_runtime ( );}_user_main ( ) { }

ARM version

Page 21: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Implementation

21

Page 22: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Task Data-Structure

ARMfunPtr DSPfunPtr ARM_finish DSP_finish args [SIZE]

22

// ARM has access to symbol table manager// which helps function pointer mapping between // ARM and DSP ARMfunPtr = lookup(DSPfunPtr);

DSPfunPtr = lookup(ARMfunPtr);

Page 23: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Task Data-Structure

ARMfunPtr DSPfunPtr ARM_finish DSP_finish args [SIZE]

23

// ARM has access to symbol table manager// which helps function pointer mapping between // ARM and DSP ARMfunPtr = lookup(DSPfunPtr);

DSPfunPtr = lookup(ARMfunPtr);

Page 24: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Task Data-Structure

ARMfunPtr DSPfunPtr ARM_finish DSP_finish args [SIZE]

24

// ARM has access to symbol table manager// which helps function pointer mapping between // ARM and DSP ARMfunPtr = lookup(DSPfunPtr);

DSPfunPtr = lookup(ARMfunPtr);

Page 25: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Work-stealing Properties WS Data-structure

Task synchronization (task counter at

each finish scope)

Push Pop Steal

ARM Software deques at each core

Atomic operations

Tail Tail Head

DSP Hardware queues at each core

Using single hardware

semaphore (total hardware semaphores =

32 only)

Head or tail Tail (only) == Pop

25

Page 26: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Work-stealing Properties WS Data-structure

Task synchronization (task counter at

each finish scope)

Push Pop Steal

ARM Software deques at each core

Atomic operations

Tail Tail Head

DSP Hardware queues at each core

Using single hardware

semaphore (total hardware semaphores =

32 only)

Head or tail Tail (only) == Pop

26

Page 27: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Tail

Hea

d

27

Software deques

Hardware queues

Page 28: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3 Steals within DSP cores

Pop

P

ush

Tail

Hea

d

28

Page 29: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3 Steals within DSP cores

Pop

P

ush

Tail

Hea

d

29

Page 30: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3 Steals within DSP cores

Pop

P

ush

Tail

Hea

d

if (Idle_DSPs) { task->DSPfunPtr = lookup(ARMfunPtr); increment(task->ARM_finish); push(task, SQ1); }

30

Page 31: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3 Steals within DSP cores

Pop

P

ush

Tail

Hea

d

Steal(task, SQ1); start_new_DSP_finish( );

if (Idle_DSPs) { task->DSPfunPtr = lookup(ARMfunPtr); increment(task->ARM_finish); push(task, SQ1); }

31

Page 32: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3 Steals within DSP cores

Pop

P

ush

Tail

Hea

d

Steal(task, SQ1); start_new_DSP_finish( );

if (Idle_DSPs) { task->DSPfunPtr = lookup(ARMfunPtr); increment(task->ARM_finish); push(task, SQ1); }

32

Page 33: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3 Steals within DSP cores

Pop

P

ush

Tail

Hea

d

Steal(task, SQ1); start_new_DSP_finish( );

if (Idle_DSPs) { task->DSPfunPtr = lookup(ARMfunPtr); increment(task->ARM_finish); push(task, SQ1); }

end_DSP_new_finish( ); push(task, SQ3);

33

Page 34: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3 Steals within DSP cores

Pop

P

ush

Tail

Hea

d

Steal(task, SQ1); start_new_DSP_finish( );

Steal(task, SQ3); decrement(task->ARM_finish);

if (Idle_DSPs) { task->DSPfunPtr = lookup(ARMfunPtr); increment(task->ARM_finish); push(task, SQ1); }

end_DSP_new_finish( ); push(task, SQ3);

34

Page 35: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3 Steals within DSP cores

Pop

P

ush

Tail

Hea

d

Steal(task, SQ1); start_new_DSP_finish( );

Steal(task, SQ3); decrement(task->ARM_finish);

if (Idle_DSPs) { task->DSPfunPtr = lookup(ARMfunPtr); increment(task->ARM_finish); push(task, SQ1); }

end_DSP_new_finish( ); push(task, SQ3);

Steal(task, DSPs); task->ARMfunPtr = lookup(DSPfunPtr); start_new_ARM_finish( );

35

Page 36: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Tail

Hea

d

Steals within DSP cores

Steal(task, SQ1); start_new_DSP_finish( );

Steal(task, SQ3); decrement(task->ARM_finish);

Pop

P

ush

if (Idle_DSPs) { task->DSPfunPtr = lookup(ARMfunPtr); increment(task->ARM_finish); push(task, SQ1); }

end_DSP_new_finish( ); push(task, SQ3);

Steal(task, DSPs); task->ARMfunPtr = lookup(DSPfunPtr); start_new_ARM_finish( );

end_ARM_new_finish( ); push(task, SQ2);

36

Page 37: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Hybrid Work-Stealing Design A

RM

_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Steals within ARM cores

Pop

P

ush

AR

M_1

Sha

red

SQ

1

DS

P_0

DS

P_1

DS

P_2

DS

P_3

DS

P_4

DS

P_5

DS

P_6

DS

P_7

Sha

red

SQ

2

AR

M_0

AR

M_2

AR

M_3

Sha

red

SQ

3

Tail

Hea

d

Steals within DSP cores

Steal(task, SQ1); start_new_DSP_finish( );

end_DSP_new_finish( ); push(task, SQ3);

Steal(task, SQ3); decrement(task->ARM_finish);

end_ARM_new_finish( ); push(task, SQ2);

Steal(task, SQ2); decrement(task->DSP_finish)

Pop

P

ush

if (Idle_DSPs) { task->DSPfunPtr = lookup(ARMfunPtr); increment(task->ARM_finish); push(task, SQ1); }

Steal(task, DSPs); task->ARMfunPtr = lookup(DSPfunPtr); start_new_ARM_finish( );

37

Page 38: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Experimental Evaluation •  Work-stealing performance

–  DSP only v/s hybrid

38

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8

Speedup o

ver

Serial E

lisio

n (

AR

M)

Total workers (at DSP only)

Matmul NBody Series

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12S

pe

ed

up

ove

r S

eria

l Elis

ion

(A

RM

)Total workers (ARM + DSP)

Matmul NBody Series

DSP only work-stealing performance Hybrid work-stealing performance

Page 39: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Experimental Evaluation

39

•  Work-stealing performance –  Understanding anomalies

0

10

20

30

40

50

60

70

80

5 6 7 8 9 10 11 12

Hyb

rid s

teals

(%

of to

tal s

teals

)

Total workers (ARM + DSP)

Matmul NBody Series

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5 6 7 8 9 10 11 12C

ach

e flu

shes

at A

RM

/ T

ask

s at A

RM

Total workers (ARM + DSP)

Matmul NBody Series

Total hybrid steals (% of total steals) Ratio of cache flushes at ARM to total async tasks generated at ARM

Page 40: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Summary

•  Current Work –  Parallel programming model for TI Keystone II SoC

•  Abstracts away hardware complexities –  Hybrid work-stealing across ARM and DSP –  Detailed experimental evaluation

•  Optimal load balancing, even outperforming DSP only executions

•  Future work –  Shared memory allocations from DSP cores –  More benchmarks

40

Page 41: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Backup Slides

41

Page 42: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Parallel Programming Constructs •  Asynchronous tasks

// Execute only on DSP coresasyncDSP (void* funPtr, void* args, int args_size);

// Can execute on ARM or DSP coreasync (void* funPtr, void* args, int args_size);

// Can execute on ARM or DSP coreforasync (void* funPtr, void* args,

int args_size, int loop_dimension, loop_domain_t * domain, int mode);

// start a finish scope// variable argument function call start_finish ( int total_shared_vars,

shared_pointer_i, … );// end a finish scopeend_finish ( );

•  Synchronization over asynchronous tasks

42

Page 43: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HClib Programming Model

43

#include ”hclib.h”int size, *A, *B, *C;main ( ) { arm_init (N); dsp_init ( ); parallel_sum ( ); }

void arm_init (int N) { size = N; A = ws_malloc(N * sizeof(int)); initialize_A( ); ws_cacheWbInv (A); /* similarly for B and C */ }

void dsp_init ( ) { /* pointer translation */ int in[ ] = {N, ws_dspPtr(A), ws_dspPtr(B), ws_dspPtr(C)}; start_finish (0); /* DSP only async task */ /* dsp_init_func( ) à A = (int*) in [1]; ………… */ asyncDSP (dsp_init_func, in, sizeof(in)); end_finish ( );}

void parallel_sum ( ) { loop_domain_t loop = {low, high, stride, tile}; ws_args_t t1 = {C}; /* result array */ start_finish(1, &t1); forasync(kernel, NULL, 0, 1, &loop, WS_RECURSION); end_finish( );}

void kernel(void* args, int i) { C[i] = A[i] + B[i];}

Parallel array addition in HClib More information: http://habanero-rice.github.io/hclib/

Page 44: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Programming Model

44

#include “hc-k2h.h”int size, *A, *B, *C;main ( ) { arm_init (N); dsp_init ( ); parallel_sum ( ); }

void arm_init (int N) { size = N; A = ws_malloc(N * sizeof(int)); initialize_A( ); ws_cacheWbInv (A); /* similarly for B and C */ }

void dsp_init ( ) { /* pointer translation */ int in[ ] = {N, ws_dspPtr(A), ws_dspPtr(B), ws_dspPtr(C)}; start_finish (0); /* DSP only async task */ /* dsp_init_func( ) à A = (int*) in [1]; ………… */ asyncDSP (dsp_init_func, in, sizeof(in)); end_finish ( );}

void parallel_sum ( ) { loop_domain_t loop = {low, high, stride, tile}; ws_args_t t1 = {C}; /* result array */ start_finish(1, &t1); forasync(kernel, NULL, 0, 1, &loop, WS_RECURSION); end_finish( );}

void kernel(void* args, int i) { C[i] = A[i] + B[i];}

Parallel array addition in HC-K2H

Page 45: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Programming Model

45

#include “hc-k2h.h”int size, *A, *B, *C;main ( ) { arm_init (N); dsp_init ( ); parallel_sum ( ); }

void arm_init (int N) { size = N; A = ws_malloc(N * sizeof(int)); initialize_A( ); ws_cacheWbInv (A); /* similarly for B and C */ }

void dsp_init ( ) { /* pointer translation */ int in[ ] = {N, ws_dspPtr(A), ws_dspPtr(B), ws_dspPtr(C)}; start_finish (0); /* DSP only async task */ /* dsp_init_func( ) à A = (int*) in [1]; ………… */ asyncDSP (dsp_init_func, in, sizeof(in)); end_finish ( );}

void parallel_sum ( ) { loop_domain_t loop = {low, high, stride, tile}; ws_args_t t1 = {C}; /* result array */ start_finish(1, &t1); forasync(kernel, NULL, 0, 1, &loop, WS_RECURSION); end_finish( );}

void kernel(void* args, int i) { C[i] = A[i] + B[i];}

Parallel array addition in HC-K2H

Page 46: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Programming Model

46

#include “hc-k2h.h”int size, *A, *B, *C;main ( ) { arm_init (N); dsp_init ( ); parallel_sum ( ); }

void arm_init (int N) { size = N; A = ws_malloc(N * sizeof(int)); initialize_A( ); ws_cacheWbInv (A); /* similarly for B and C */ }

void dsp_init ( ) { /* pointer translation */ int in[ ] = {N, ws_dspPtr(A), ws_dspPtr(B), ws_dspPtr(C)}; start_finish (0); /* DSP only async task */ /* dsp_init_func( ) à A = (int*) in [1]; ………… */ asyncDSP (dsp_init_func, in, sizeof(in)); end_finish ( );}

void parallel_sum ( ) { loop_domain_t loop = {low, high, stride, tile}; ws_args_t t1 = {C}; /* result array */ start_finish(1, &t1); forasync(kernel, NULL, 0, 1, &loop, WS_RECURSION); end_finish( );}

void kernel(void* args, int i) { C[i] = A[i] + B[i];}

Parallel array addition in HC-K2H

Page 47: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Programming Model

47

#include “hc-k2h.h”int size, *A, *B, *C;main ( ) { arm_init (N); dsp_init ( ); parallel_sum ( ); }

void arm_init (int N) { size = N; A = ws_malloc(N * sizeof(int)); initialize_A( ); ws_cacheWbInv (A); /* similarly for B and C */ }

void dsp_init ( ) { /* pointer translation */ int in[ ] = {N, ws_dspPtr(A), ws_dspPtr(B), ws_dspPtr(C)}; start_finish (0); /* DSP only async task */ /* dsp_init_func( ) à A = (int*) in [1]; ………… */ asyncDSP (dsp_init_func, in, sizeof(in)); end_finish ( );}

void parallel_sum ( ) { loop_domain_t loop = {low, high, stride, tile}; ws_args_t t1 = {C}; /* result array */ start_finish(1, &t1); forasync(kernel, NULL, 0, 1, &loop, WS_RECURSION); end_finish( );}

void kernel(void* args, int i) { C[i] = A[i] + B[i];}

Parallel array addition in HC-K2H

Page 48: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

HC-K2H Programming Model

48

#include “hc-k2h.h”int size, *A, *B, *C;main ( ) { arm_init (N); dsp_init ( ); parallel_sum ( ); }

void arm_init (int N) { size = N; A = ws_malloc(N * sizeof(int)); initialize_A( ); ws_cacheWbInv (A); /* similarly for B and C */ }

void dsp_init ( ) { /* pointer translation */ int in[ ] = {N, ws_dspPtr(A), ws_dspPtr(B), ws_dspPtr(C)}; start_finish (0); /* DSP only async task */ /* dsp_init_func( ) à A = (int*) in [1]; ………… */ asyncDSP (dsp_init_func, in, sizeof(in)); end_finish ( );}

void parallel_sum ( ) { loop_domain_t loop = {low, high, stride, tile}; ws_args_t t1 = {C}; /* result array */ start_finish(1, &t1); forasync(kernel, NULL, 0, 1, &loop, WS_RECURSION); end_finish( );}

void kernel(void* args, int i) { C[i] = A[i] + B[i];}

Parallel array addition in HC-K2H

Page 49: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Avoiding False Sharing

•  ARM cache line –  64 bytes

•  DSP cache line –  128 bytes

49

Al locate writable shared buffers with sizes in multiple of 128 bytes

// Specifying information on for-loop in forasync taskloop_domain_t loop_info = { lowBound, highBound, stride, tile_size };

uint32_t writable_shared_array[1024]; /* Option: use tile_size = 32 */

Page 50: Heterogeneous Work-stealing across CPU and DSP coresvivkumar.github.io/papers/slides_hpec2015.pdf · Vivek Kumar1, Alina Sbîrlea1, Ajay Jayaraj2, Zoran Budimlić1, Deepak Majeti1,

Heterogeneous Work-stealing across CPU and DSP cores | Kumar et. al.

Multicore ARM+DSP TI’s Keystone-II SoC

50

•  Software configuration –  ARM

•  Standard Linux –  DSP

•  Custom real-time O.S. called as SYS/BIOSTM

–  Support for inter processor communication – Custom runtime library to support thread management,

scheduling and synchronization –  Supports C99 C language – No pthread libraries or GCC built-in atomic functions


Recommended