+ All Categories
Home > Documents > Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain...

Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain...

Date post: 28-Dec-2015
Category:
Upload: tyler-booker
View: 213 times
Download: 1 times
Share this document with a friend
Popular Tags:
22
cale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12, 2014 https://xstackwiki.modelado.org/ Traleika_Glacier/ This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
Transcript

Exascale Programming Models Lecture Series 06/12/2014

What is OCR?

TG Team (presenter: Romain Cledat)June 12, 2014

https://xstackwiki.modelado.org/Traleika_Glacier/

This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either

expressed or implied, of the U.S. Government.

2Exascale Programming Models Lecture Series 06/12/2014

• OCR– Open Community Runtime– Developed collaboratively with partners (mainly Rice University and

Reservoir Labs)

• The term ‘OCR’ is used to refer to way too many concepts– A programming model– A user-level API– A runtime framework– One of a multitude of reference runtime implementations

OCR

Seshasayee, Bala
Might want to de-emphasize this. We want it to be just the programming model (which we'll cover along with *our* runtime framework). Any other usage is incorrect and should not be acknowledged.

3Exascale Programming Models Lecture Series 06/12/2014

• Design a software stack to meet Exascale goals– Target a strawman architecture– Provide a programming model, API, reference implementation and

tools

• Concerns– Extreme hardware parallelism– Data locality– Fine grained resource management– Resiliency– Power and energy and not just performance– Platform independence

TG X-Stack project goals

4Exascale Programming Models Lecture Series 06/12/2014

mainEdt

fibIterEdt

fibIterEdt

fibIterEdt

sumEdt

N

finishEdt

N-2N-1

Dataflow programming model

EDT

DatablockCreate

Event

Runtime maps the constructed

data-flow graph to architecture

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

………..

Shared LLC

Interconnect……

…..

……

…..

Seshasayee, Bala
Emphasize the "separation of concerns" in comments, which will smoothly transition to the next slide...

5Exascale Programming Models Lecture Series 06/12/2014

OCR level of abstractionvoid ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 1, n ), avg );}

if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a);}void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); }

hides…

hides…

hides…

OCR’s level of abstraction is at the very bottom

TBB user-friendly API

6Exascale Programming Models Lecture Series 06/12/2014

• Common– All objects globally and uniquely identifiable and relocate-able

• Computation– Event Driven Task (EDT)– Does not perform synchronization– Distinct from the notion of thread or core

• Data– Data-block (DB)– Relocate-able consecutive chunk of data

• Synchronization, links– Events– Runtime-visible

• Slots– Positional end-points for dependences

OCR concepts

7Exascale Programming Models Lecture Series 06/12/2014

• N pre slots (N known at creation time)

• Optional attached “completion event”

OCR concepts: building blocks

Evt

0 N

EDT

0 N

( )

Data

• No pre slots• Post slot always

“satisfied”

• N pre slots (N fixed by type of event NOT determined by user)

• Post slot initially “unsatisfied”

• Slot is:– Connected (attached to another slot) or unconnected– Satisfied (user-triggered or runtime-triggered) or unsatisfied

Pre slots

Post slots (multiple connections)

Seshasayee, Bala
This slide needs to be redone slightly to reduce confusion.The next 2 slides need to be redone heavily as they're difficult to grasp.

8Exascale Programming Models Lecture Series 06/12/2014

OCR concepts: add dependence

Data

Evt

0 N

OR

EDT

0 N

Evt

0 N

OR

Evt

0 N

EDT

0 N

Connected=>

1 of 4 possible combinations

Argument 1 Argument 2

9Exascale Programming Models Lecture Series 06/12/2014

OCR concepts: satisfy

EDT

0 N

Evt

0 N

OR

Data

OR

NULLEDT

0 N

Satisfied/triggered

Data

=>

1 of 4 possible combinations

Argument 1 Argument 2

10Exascale Programming Models Lecture Series 06/12/2014

• EDTs– 0..N in/out pre-slots

• Slots are initially “unconnected” and “unsatisfied”• At creation time, the number of incoming slots must be known

– An EDT executes after all pre slots are “satisfied”• Satisfaction of pre slots can happen in any order

– An EDT can access memory:• Data-blocks:

– passed in through one of its in/out slots (the EDT gets a C pointer)– created by the EDT

• Stack and ephemeral heap (local)• NO global memory

– An EDT, during its execution, can at any time:• Write to any accessible data-blocks• Manipulate the dependence graph for future (not yet runnable) EDTs by

adding dependences, satisfying events, etc.

OCR execution model for EDTs

11Exascale Programming Models Lecture Series 06/12/2014

• Dynamic dependence construction• Producer and consumer never know about each other• Focus on minimum needed for placement and scheduling

Example 1: Producer/Consumer

ConsumerEDT

ProducerEDT

Data

Concept OCR

Evt

ConsumerEDT

ProducerEDT

Data

(1) dbCreate(*) addDep

(3) satisfy

(2) edit Data

Who executes call

Data dependence

Control dependence

12Exascale Programming Models Lecture Series 06/12/2014

• Control dependence is no different than a data dependence

Example 2: Simple synchronization

(1) satisfy

Concept OCR

Step 1EDT

Step 2-aEDT

Step 2-bEDT

Evt

Step 1EDT

(*) addDep

NULL

Step 2-aEDT

Step 2-bEDT

13Exascale Programming Models Lecture Series 06/12/2014

• Events– 0..N pre slots

• Slots are initially “unconnected” and “unsatisfied”– Events have a “trigger” rule that determines when their post slot

transitions to “satisfied” and what gets connected to it• Simple event (pass-through)

– 1 pre slot– When: satisfy post slot on incoming slot satisfaction– What: whatever is on incoming slot (pass GUID)

• Latch event (multi-party synchronization)– 2 pre slots; “waiting-on” count and current count– When: satisfy outgoing slot when number of satisfies on both pre

slots matches (similar to reference count in TBB)– What: NULL (incoming data-blocks are ignored)

OCR execution model for events

14Exascale Programming Models Lecture Series 06/12/2014

Example 3: In place parallel update

Concept OCR

SetupEDT

Parallel_1EDT

Parallel_2EDT

WrapupEDT

Data

Data

SetupEDT Data

Parallel_1EDT

Parallel_2EDT

FinishEDT

WrapupEDT

(1) dbCreate

(1) edtCreate

(1) e

dtCr

eate

(3) edtCreate

(4) addDep

(2) addDep(2

) add

Dep

(3) edtCreate

15Exascale Programming Models Lecture Series 06/12/2014

Example 4: Single assignment update

Concept OCR

SetupEDT

Parallel_1EDT

Parallel_2EDT

WrapupEDT

Data

SetupEDT Data

Parallel_1EDT

Parallel_2EDT

WrapupEDT

(1) dbCreate

(1) edtCreate

(1) e

dtCr

eate

(2) addDep

Data2Data1

Evt2

Data2Data1

Evt1

(4) dbCreate (4) dbCreate

(5) satisfy (5) satisfy

(3) addDep

(1) e

vtCr

eate

Exascale Programming Models Lecture Series 06/12/2014

OCR ecosystem

FSim - TG Architecture

Low-level compilers

Platforms

OCR implementations

LLVM

OCR targeting TG

C, Array DSL CnC Hero

CodeHC

CnC Translator

HC CompilerR-Stream

HTA

PIL

Programming platforms

OCR API + Tuning AnnotationsOpen Community Runtime

x86

GCC

OCR targeting x86

Cluster

Evaluation platforms

17Exascale Programming Models Lecture Series 06/12/2014

• OCR API is at the “assembly” level; other tools are meant to sit between it and programmers

• Few simple concepts, multiple ways to use them– Interested in determining “best” use

• Dependence graph built on the fly:– Complicates the writing of the program– Scalable approach

Take-aways

18Exascale Programming Models Lecture Series 06/12/2014

• On some code, OCR matches or bests OMP• Simple scheduler, no data-blocks (very preliminary but promising)

Preliminary results

19Exascale Programming Models Lecture Series 06/12/2014

• Development of a specification:– Memory model

• Tuning hints and annotations

• More expressive support for collectives

Areas of investigation

20Exascale Programming Models Lecture Series 06/12/2014

Backup

21Exascale Programming Models Lecture Series 06/12/2014

Strawman architecture

Intel Confidential / Internal Use Only

• Heterogeneous• Hierarchical architecture• Tapered memory

bandwidth• Global, shared address

space• Software managed non-

coherent memories • Functional simulator

available

DP FPFMAC

Execution Engine (XE)

32KB I$

64KB SP

RF?

Application specific

GPInt

Control Engine (CE)

32KB I$

64KB SP

RF?

System SW

XE XE XE XE

XE XE XE XE

CE 1MB shared L2

Block (8 XE + CE)

Cluster (16 Blocks)PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

………..

8MB Shared LLC

Interconnect……

…..

……

…..

Processor Chip (16 Clusters)PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2………..

8MB Shared LLC

Interconnect………..

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2………..

8MB Shared LLC

Interconnect………..

64MB Shared LLC

……

…..

……

…..

Interconnect

22Exascale Programming Models Lecture Series 06/12/2014

OCR vs other solutions

CnC MPI OCR OpenMP

TBB

Execution model

Tasks Bulk Sync Fine-grained tasks

Bulk Sync Tasks

Memory model

Shared memory

Explicit message passing

Explicit; global

Shared memory

Shared memory

Separation of

concerns?

Yes No Yes No Yes (but can dig deeper)


Recommended