A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level...

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and

Module-Level Prefetching

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and

Module-Level PrefetchingJunghee Lee*, Hyung Gyu Lee*, Soonhoi Ha†,

Jongman Kim*, and Chrysostomos Nicopoulos‡

Presented by Junghee Lee

*

† ‡

http://www.ucy.ac.cy/goto/mainportal/en-US/HOME.aspx

2

IntroductionIntroduction

Single Core

Multi Core

Many Core

Fusion

Programmable Hardware AcceleratorEx) GPGPU

Powerful cores + H/W accelerator in a single dieEx) AMD Fusion

Massively Parallel Processing Array

http://www.futurelooks.com/wordpress/wp-content/uploads/2009/06/intel_i7-975_extreme_processor_12.jpg

http://1.bp.blogspot.com/_9vgJ1nwu_xA/TAUQpcVwcfI/AAAAAAAADhs/ChUnpx9KE5o/s1600/Aubrey_Isle_die.jpg

3

MPPA as Hardware AcceleratorMPPA as Hardware Accelerator

CPU CPU

CPU CPU

I/O

I/O

Massively Parallel

Processing Array

CoreTile

Ho

st C

PU

In

terf

ace

Dev

ice

Mem

ory

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

Challenges

ExpressivenessDebugging

Memory Hierarchy Design

4

Related WorksRelated Works

Expressiveness Debugging Memory

GPGPUAMD Fusion

Tilera

Rigel

Ambric

SIMDMultiple debuggers

Event graphScratch-pad memory

Cache

Multi-threading

Multi-threading

Kahn process network

Multiple debuggers Coherent cache

Not addressedSoftware-managed

cache

Formal model Scratch-pad memory

Proposed MPPA

Event-driven model

Inter-module debugIntra-module debug

Scratch-pad memoryPrefetching

5

ContentsContents

• Introduction• Execution Model • Hardware Architecture• Evaluation• Conclusion

6

Execution ModelExecution Model

• Specification– Module = (b, Pi, Po, C, F)

• b = Behavior of module• Pi = Input ports• Po = Output ports• C = Sensitivity list• F = Prefetch list

– Signal– Net = (d, K)

• d = Driver port• K = A set of sink ports

• Semantics– A module is triggered

when any signal connected to C changes

– Function calls and memory accesses are limited to within a module

– Non-blocking write and block read

– The specification can be modified during run-time

7

ExampleExample

• Quick sort– Pivot is selected– The given array is partitioned so that

• The left segment should contain smaller elements than the pivot

• The right segment should contain larger elements than the pivot

– Recursively partition the left and right segments

• Specifying quick sort– Multi-threading

• OK but hard to debug– SIMD

• Inefficient due to input dependency– Kahn process network

• Impossible due to the dynamic nature

8

Specify Quick Sort with Event-driven ModelSpecify Quick Sort with Event-driven Model

• Partition module– b (behavior): select a pivot, partition the input array,

instantiate another partition module if necessary– Pi (input port): input array and its position– Po (output port): left and right segments and their

position– C (sensitivity list): input array– P (prefetch list): input array

• Collection module– b (behavior): collect segments– Pi (input port): sorted segments and intermediate

result– Po (output port): final result and intermediate result– C (sensitivity list): sorted segments– P (prefetch list): sorted segments and intermediate

result

Partition

Partition

Partition

Collection

…

Input array

Final result

Intermediate result

9

ContentsContents


10

MPPA MicroarhitectureMPPA Microarhitecture

CoreTile

Ho

st C

PU

In

terf

ace

Dev

ice

Mem

ory

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

CoreTile

E

• Identical core tiles• Consists of uCPU, scratch-pad memory, and peripherals that support the execution model• One core tile is designated to an execution engine

• Software running on a core tile• Consists of scheduler, signal storage and interconnect directory• Supports the execution model• If necessary, it is split into multiple instances running on different core tiles

11

Core Tile ArchitectureCore Tile Architecture

Scratch Pad Memory

ForCurrentModule

ForNext

Module

Context Manager

Input Signal Queue

Output Signal Queue

Message Queue

Network Interface

uCPU Prefetcher

MessageHandler

• Generic small processor• Treated as a black box

• Software-managed on-chip SRAM• Double buffering where one is for the current module and the other is for the next module to be prefetched

• Prefetches the code and data of the next module while the current module is running on uCPU

• Counter-part of the prefetcher• Sends data to the requester

• Switches the context when the current module finishes and the next module is ready• Stores information about the modules

• Stores the input data• The actual data is stored in the SPM while its information is managed by this module

• Stores the output data• Notifies the update event to the interconnect directory when the output is updated

• Handles the system messages

• NoC router

12

Execution EngineExecution Engine

• Most of its functionality is implemented in software while the hardware facilitates communication Software implementation gives us flexibility in the number and location of the execution engine

• One way to visualize our MPPA is to regard the execution engine as an event-driven simulation kernel

• The execution engine interacts with modules running on other core tiles through messages

Type From To Payload

REQ_FETCH_MODULE Prefetcher Scheduler Request a new module

RES_FETCH_MODULE Scheduler Prefetcher Module ID and list of input ports

MODULE_INSTANCE Scheduler Prefetcher Code of the module

REQ_SIGNAL Prefetcher Interconnect Port ID

RES_SIGNAL Signal storage or a node

Prefetcher Data

13

Components of Execution EngineComponents of Execution Engine

• Scheduler– Keeps track of the status and location of modules– Maintains three queues: wait, ready and run queue

• Signal storage– Stores signal values in the device memory– If a signal is updated but its value is still stored in the

node, the signal storage invalidates its value and keeps the location of the latest value

• Interconnect directory– Keeps track of connectivity of signals and ports– Maintains the sensitivity list

14

Module-Level PrefetchingModule-Level Prefetching

• Hides the overhead of the dynamic scheduling• Prefetches the next module while the current module is running

uCPU Prefetcher SchedulerInterconn.Directory

Signal Storage

Other Node

Exe

cute

a m

od

ule

Mem

ory

ac

cess

Mem

ory

ac

cess

15

Illustrative ExampleIllustrative Example

uCPU

Prefetcher

Out Sig Q

Msg Handler

uCPU

Prefetcher

Out Sig Q

Msg Handler

uCPU

Prefetcher

Out Sig Q

Msg Handler

Interconnect Directory

Signal Storage

Scheduler

Wait Q

Ready Q

Run Q

Partition 0 Partition 2

Partition 3 Partition 4 Partition 5

Collection

Collection

Collection

Partition 4Partition 1

16

ContentsContents


17

BenchmarkBenchmark

• Recognition, Synthesis and Mining (RMS) benchmark• Fine-grained parallelism: dominated by short tasks

– Small memory foot print– High run-time scheduling overhead

• Task-level parallelism: exhibits dependency– Hard to be implemented with GPGPU

Benchmark Min Max Average

Forward Solve (FS) 26 646 336.00

Backward Solve (BS) 42 569 305.50

Cholesky Factorization (CF) 151 11800 789.35

Canny Edge Detection (CED) 330 5011 669.68

Binomial Tree (BT) 117 4506 462.71

Octree Partitioning (OP) 1441 6679 2678.70

Quick Sort (QS) 88 47027 683.70

18

SimulatorSimulator

• In-house cycle-level simulator• Parameters

Parameter Value

Number of core tiles 32

Memory access time 1 cycle for scratch-pad memory100 cycles for device memory

Memory size 8 KB scratch-pad memory32 MB device memory

Communication Delay 4 cycles per hop

19

UtilizationUtilization

BenchmarksFS BS CF CED BT

0

0.2

0.4

0.6

0.8

1.0

Co

re u

tiliz

ati

on

w/o prefetching w/ prefetching

OP QS

20

ScalabilityScalability

Number of core tiles24 32 40 48 56

0

0.2

0.4

0.6

0.8

1.0

Co

re u

tiliz

ati

on

Util (1)

Util (3)

648000

10000

12000

14000

16000

20000

Ex

ecu

tio

n t

ime

(cy

cle

s)

18000

Execution time (1)

Execution time (3)

21

ConclusionConclusion

• This paper proposes a novel MPPA architecture that employs an event-driven execution model– Handles dependencies by dynamic scheduling– Hides dynamic scheduling overhead by module-level

prefetching• Future works

– Supports applications that require larger memory footprint

– Adjusts the number of execution engines dynamically– Supports inter-module debugging

22

Questions?Questions?

Contact info

Junghee [email protected] and Computer EngineeringGeorgia Institute of Technology

23

Thank you!Thank you!

Date post:	01-Jan-2016
Category:	Documents
Upload:	cecily-francis
View:	216 times
Download:	1 times

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level...

Documents