LLNL Summer School 07/08/2014
What is OCR?
TG Team (presenters: Romain Cledat & Bala Seshasayee)July 8, 2014
https://xstack.exascale-tech.com/wiki/
This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either
expressed or implied, of the U.S. Government.
2LLNL Summer School 07/08/2014
• OCR– Open Community Runtime– Developed collaboratively with partners (mainly Rice University and
Reservoir Labs)
• The term ‘OCR’ is used to refer to– A programming model– A user-level API– A runtime framework– One of several reference runtime implementations
• In this talk– Presentation of the programming model– Presentation of the API and implementations through demos
OCR
3LLNL Summer School 07/08/2014
• Design a software stack to meet Exascale goals– Target a strawman architecture– Provide a programming model, API, reference implementation and
tools
• Concerns– Extreme hardware parallelism– Data locality– Fine grained resource management– Resiliency– Power and energy and not just performance– Platform independence
TG X-Stack project goals
4LLNL Summer School 07/08/2014
mainEdt
fibIterEdt
fibIterEdt
fibIterEdt
sumEdt
doneEdt
Dataflow programming model
Runtime maps the constructed
data-flow graph to architecture
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
………..
Shared LLC
Interconnect……
…..
……
…..
N
N-2N-1
Fib(N-2)Fib(N-1)
Fib(N)
EDT
Datablock Data shared between EDTs
A non-blocking unit of work. Runnable once all pre-slots are satisfied.
Creation link: Source EDT creates destination
Event/Data link: Source EDT provides data to the destination
Both creation and event/data link
5LLNL Summer School 07/08/2014
OCR level of abstractionvoid ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 1, n ), avg );}
if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a);}void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); }
hides…
hides…
hides…
OCR’s level of abstraction is at the very bottom
TBB user-friendly API
6LLNL Summer School 07/08/2014
• Common– All objects globally and uniquely identifiable and relocate-able
• Computation– Event Driven Task (EDT)– Does not perform synchronization– Distinct from the notion of thread or core
• Data– Data-block (DB)– Relocate-able consecutive chunk of data
• Synchronization, links– Events– Runtime-visible
• Slots– Positional end-points for dependences
OCR concepts
7LLNL Summer School 07/08/2014
• Simplest OCR Concepts – EDTs, datablocks• Example – producer/consumer• Clarifying concepts – what EDTs can/can’t do, DBs are/aren’t• Example – simple synchronization• More concepts – events, slots• Example – complex synch• More concepts – latch events
Outline
8LLNL Summer School 07/08/2014
• Event Driven Task (EDT)• N pre-slots (known at creation time)• Available states on a slot:
– Connected (attached to another slot) or unconnected
– Satisfied or unsatisfied
OCR concepts: 3 core building blocks
0 N
Data
Pre-slots
Data
Data
Data
Data
Data
Data
• A globally visible name space of data blocks– Explicitly created – EDTs can only access data either
created by them or passed through their pre-slots
EDT1 EDT2
• EDT1 creates EDT2
• EDT1 provides data on EDT2’s pre-slot– Possibly through an indirection chain
EDT
9LLNL Summer School 07/08/2014
• EDTs– 0..N in/out pre-slots
• Slots are initially “unconnected” and “unsatisfied”• At creation time, the number of incoming slots must be known
– An EDT executes after all pre-slots are “satisfied”• Satisfaction of pre-slots can happen in any order
– An EDT can access memory:• Data-blocks:
– passed in through one of its in/out slots (the EDT gets a C pointer)– created by the EDT
• Stack and ephemeral heap (local)• NO global memory
– An EDT, during its execution, can at any time:• Write to any accessible data-blocks• Manipulate the dependence graph for future (not yet runnable) EDTs
OCR execution model for EDTs
10LLNL Summer School 07/08/2014
• Dynamic dependence construction• Producer and consumer never know about each other• Focus on minimum needed for placement and scheduling
ANIMATE: Example 1: Producer/Consumer
ConsumerEDT
ProducerEDT
Data
Concept OCR
Creation link
Event/Data link
Both creation & event
ConsumerEDT
ProducerEDT
Data
11LLNL Summer School 07/08/2014
• Control dependence is no different than a data dependence
ANIMATE: Example 2: Simple synchronization
Concept OCR
Step 1EDT
Step 2-aEDT
Step 2-bEDT
Step 1EDT
Step 2-aEDT
Step 2-bEDT
Ø Ø
12LLNL Summer School 07/08/2014
• Runtime EDTs– Created by the runtime to handle more complex synchronization
situations– 0..N pre slots
• Slots are initially “unconnected” and “unsatisfied”– Runtime EDTs have a “trigger” rule that determines when they
“satisfy” their outgoing edges and what gets propagated• Latch runtime EDT (multi-party synchronization)
– 2 pre slots; “waiting-on” count and current count– When: satisfy outgoing edges when number of satisfies on both pre
slots matches (similar to reference count in TBB)– What: NULL (incoming data-blocks are ignored)
OCR execution model for runtime EDTs
13LLNL Summer School 07/08/2014
ANIMATE: Example 3: In place parallel update
Concept OCR
SetupEDT
Parallel_1EDT
Parallel_2EDT
WrapupEDT
Data
Data
SetupEDT
Parallel_1EDT
Parallel_2EDT
WrapupEDT
Data Data
SyncREDT
Data
ØØ
Ø
LLNL Summer School 07/08/2014
OCR ecosystem
FSim - TG Architecture
Low-level compilers
Platforms
OCR implementations
LLVM
OCR targeting TG
C, Array DSL CnC Hero
CodeHC
CnC Translator
HC CompilerR-Stream
HTA
PIL
Programming platforms
OCR API + Tuning AnnotationsOpen Community Runtime
x86
GCC
OCR targeting x86
Cluster
Evaluation platforms
15LLNL Summer School 07/08/2014
• OCR API is at the “assembly” level; other tools are meant to sit between it and programmers
• Few simple concepts, multiple ways to use them– Interested in determining “best” use
• Dependence graph built on the fly:– Complicates the writing of the program– Scalable approach
Take-aways
16LLNL Summer School 07/08/2014
• On some code, OCR matches or bests OMP• Simple scheduler, no data-blocks (very preliminary but promising)
Preliminary results
17LLNL Summer School 07/08/2014
• Development of a specification:– Memory model
• Tuning hints and annotations
• More expressive support for collectives
Areas of investigation
LLNL Summer School 07/08/2014
Case Study: FFT in OCR
This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either
expressed or implied, of the U.S. Government.
19LLNL Summer School 07/08/2014
• Final year undergraduate project in Oregon State University• OCR implementation of Fast Fourier Transform
– Cooley-Tukey algorithm– Evolution from serial version– OCR behavior
Background
20LLNL Summer School 07/08/2014
• Divide-and-conquer• Data-flow friendly
Algorithm
Source:Wikimedia Commons
21LLNL Summer School 07/08/2014
• (1) Serial implementation – 1 EDT running the entire program• (2) Naive parallelization – division of DFT is carried out by EDTs
recursively, combination of outputs is done by 1 EDT for each step of recursion
• (3) Bounded parallelization – both stages of butterfly are parallelized, but upto a user-specified block size (to minimize scheduling overhead)
• (4) Bounded parallelization with datablocks – previous implementations operated on a single datablock; this uses 3 datablocks (input, output real & imaginary terms)
• Scope for better parallelism– Finer datablocks– Staggered creation of EDTs in the combination phase
Versions
22LLNL Summer School 07/08/2014
Behavior
Version No. of EDTs Mean EDT Longevity (us)
Load variance across cores (%)
Running time (s)
Serial 2 1673420 70.7 3.36
Naïve parallel 12582913 253 5.1 877.0
Bounded parallel 1793 1982 2.7 0.46
Bounded parallel w/ datablocks
1793 1946 2.9 0.45
• OCR X86 running FFT on 232 sized dataset– 2.9GHz Xeon 16 cores; 8 cores made available to OCR
• Balance to be achieved between number and size of EDTs
24LLNL Summer School 07/08/2014
Strawman architecture
Intel Confidential / Internal Use Only
• Heterogeneous• Hierarchical architecture• Tapered memory
bandwidth• Global, shared address
space• Software managed non-
coherent memories • Functional simulator
available
DP FPFMAC
Execution Engine (XE)
32KB I$
64KB SP
RF?
Application specific
GPInt
Control Engine (CE)
32KB I$
64KB SP
RF?
System SW
XE XE XE XE
XE XE XE XE
CE 1MB shared L2
Block (8 XE + CE)
Cluster (16 Blocks)PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
………..
8MB Shared LLC
Interconnect……
…..
……
…..
Processor Chip (16 Clusters)PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2………..
8MB Shared LLC
Interconnect
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2………..
8MB Shared LLC
Interconnect………..
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2………..
8MB Shared LLC
Interconnect
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2………..
8MB Shared LLC
Interconnect………..
64MB Shared LLC
……
…..
……
…..
Interconnect
25LLNL Summer School 07/08/2014
OCR vs other solutions
CnC MPI OCR OpenMP
TBB
Execution model
Tasks Bulk Sync Fine-grained tasks
Bulk Sync Tasks
Memory model
Shared memory
Explicit message passing
Explicit; global
Shared memory
Shared memory
Separation of
concerns?
Yes No Yes No Yes (but can dig deeper)
26LLNL Summer School 07/08/2014
• N pre slots (N known at creation time)
• Optional attached “completion event”
OCR concepts: building blocks
Evt
0 N
EDT
0 N
( )
Data
• No pre slots• Post slot always
“satisfied”
• N pre slots (N fixed by type of event NOT determined by user)
• Post slot initially “unsatisfied”
• Slot is:– Connected (attached to another slot) or unconnected– Satisfied (user-triggered or runtime-triggered) or unsatisfied
Pre slots
Post slots (multiple connections)
27LLNL Summer School 07/08/2014
OCR concepts: add dependence
Data
Evt
0 N
OR
EDT
0 N
Evt
0 N
OR
Evt
0 N
EDT
0 N
Connected=>
1 of 4 possible combinations
Argument 1 Argument 2
28LLNL Summer School 07/08/2014
OCR concepts: satisfy
EDT
0 N
Evt
0 N
OR
Data
OR
NULLEDT
0 N
Satisfied/triggered
Data
=>
1 of 4 possible combinations
Argument 1 Argument 2
29LLNL Summer School 07/08/2014
• Dynamic dependence construction• Producer and consumer never know about each other• Focus on minimum needed for placement and scheduling
Example 1: Producer/Consumer
ConsumerEDT
ProducerEDT
Data
Concept OCR
Evt
ConsumerEDT
ProducerEDT
Data
(1) dbCreate(*) addDep
(3) satisfy
(2) edit Data
Who executes call
Data dependence
Control dependence
30LLNL Summer School 07/08/2014
• Control dependence is no different than a data dependence
Example 2: Simple synchronization
(1) satisfy
Concept OCR
Step 1EDT
Step 2-aEDT
Step 2-bEDT
Evt
Step 1EDT
(*) addDep
NULL
Step 2-aEDT
Step 2-bEDT
31LLNL Summer School 07/08/2014
Example 3: In place parallel update
Concept OCR
SetupEDT
Parallel_1EDT
Parallel_2EDT
WrapupEDT
Data
Data
SetupEDT Data
Parallel_1EDT
Parallel_2EDT
FinishEDT
WrapupEDT
(1) dbCreate
(1) edtCreate
(1) e
dtCr
eate
(3) edtCreate
(4) addDep
(2) addDep(2
) add
Dep
(3) edtCreate
32LLNL Summer School 07/08/2014
Example 4: Single assignment update
Concept OCR
SetupEDT
Parallel_1EDT
Parallel_2EDT
WrapupEDT
Data
SetupEDT Data
Parallel_1EDT
Parallel_2EDT
WrapupEDT
(1) dbCreate
(1) edtCreate
(1) e
dtCr
eate
(2) addDep
Data2Data1
Evt2
Data2Data1
Evt1
(4) dbCreate (4) dbCreate
(5) satisfy (5) satisfy
(3) addDep
(1) e
vtCr
eate