Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | tyler-booker |
View: | 213 times |
Download: | 1 times |
Exascale Programming Models Lecture Series 06/12/2014
What is OCR?
TG Team (presenter: Romain Cledat)June 12, 2014
https://xstackwiki.modelado.org/Traleika_Glacier/
This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either
expressed or implied, of the U.S. Government.
2Exascale Programming Models Lecture Series 06/12/2014
• OCR– Open Community Runtime– Developed collaboratively with partners (mainly Rice University and
Reservoir Labs)
• The term ‘OCR’ is used to refer to way too many concepts– A programming model– A user-level API– A runtime framework– One of a multitude of reference runtime implementations
OCR
3Exascale Programming Models Lecture Series 06/12/2014
• Design a software stack to meet Exascale goals– Target a strawman architecture– Provide a programming model, API, reference implementation and
tools
• Concerns– Extreme hardware parallelism– Data locality– Fine grained resource management– Resiliency– Power and energy and not just performance– Platform independence
TG X-Stack project goals
4Exascale Programming Models Lecture Series 06/12/2014
mainEdt
fibIterEdt
fibIterEdt
fibIterEdt
sumEdt
N
finishEdt
N-2N-1
Dataflow programming model
EDT
DatablockCreate
Event
Runtime maps the constructed
data-flow graph to architecture
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
………..
Shared LLC
Interconnect……
…..
……
…..
5Exascale Programming Models Lecture Series 06/12/2014
OCR level of abstractionvoid ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 1, n ), avg );}
if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a);}void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); }
hides…
hides…
hides…
OCR’s level of abstraction is at the very bottom
TBB user-friendly API
6Exascale Programming Models Lecture Series 06/12/2014
• Common– All objects globally and uniquely identifiable and relocate-able
• Computation– Event Driven Task (EDT)– Does not perform synchronization– Distinct from the notion of thread or core
• Data– Data-block (DB)– Relocate-able consecutive chunk of data
• Synchronization, links– Events– Runtime-visible
• Slots– Positional end-points for dependences
OCR concepts
7Exascale Programming Models Lecture Series 06/12/2014
• N pre slots (N known at creation time)
• Optional attached “completion event”
OCR concepts: building blocks
Evt
0 N
EDT
0 N
( )
Data
• No pre slots• Post slot always
“satisfied”
• N pre slots (N fixed by type of event NOT determined by user)
• Post slot initially “unsatisfied”
• Slot is:– Connected (attached to another slot) or unconnected– Satisfied (user-triggered or runtime-triggered) or unsatisfied
Pre slots
Post slots (multiple connections)
8Exascale Programming Models Lecture Series 06/12/2014
OCR concepts: add dependence
Data
Evt
0 N
OR
EDT
0 N
Evt
0 N
OR
Evt
0 N
EDT
0 N
Connected=>
1 of 4 possible combinations
Argument 1 Argument 2
9Exascale Programming Models Lecture Series 06/12/2014
OCR concepts: satisfy
EDT
0 N
Evt
0 N
OR
Data
OR
NULLEDT
0 N
Satisfied/triggered
Data
=>
1 of 4 possible combinations
Argument 1 Argument 2
10Exascale Programming Models Lecture Series 06/12/2014
• EDTs– 0..N in/out pre-slots
• Slots are initially “unconnected” and “unsatisfied”• At creation time, the number of incoming slots must be known
– An EDT executes after all pre slots are “satisfied”• Satisfaction of pre slots can happen in any order
– An EDT can access memory:• Data-blocks:
– passed in through one of its in/out slots (the EDT gets a C pointer)– created by the EDT
• Stack and ephemeral heap (local)• NO global memory
– An EDT, during its execution, can at any time:• Write to any accessible data-blocks• Manipulate the dependence graph for future (not yet runnable) EDTs by
adding dependences, satisfying events, etc.
OCR execution model for EDTs
11Exascale Programming Models Lecture Series 06/12/2014
• Dynamic dependence construction• Producer and consumer never know about each other• Focus on minimum needed for placement and scheduling
Example 1: Producer/Consumer
ConsumerEDT
ProducerEDT
Data
Concept OCR
Evt
ConsumerEDT
ProducerEDT
Data
(1) dbCreate(*) addDep
(3) satisfy
(2) edit Data
Who executes call
Data dependence
Control dependence
12Exascale Programming Models Lecture Series 06/12/2014
• Control dependence is no different than a data dependence
Example 2: Simple synchronization
(1) satisfy
Concept OCR
Step 1EDT
Step 2-aEDT
Step 2-bEDT
Evt
Step 1EDT
(*) addDep
NULL
Step 2-aEDT
Step 2-bEDT
13Exascale Programming Models Lecture Series 06/12/2014
• Events– 0..N pre slots
• Slots are initially “unconnected” and “unsatisfied”– Events have a “trigger” rule that determines when their post slot
transitions to “satisfied” and what gets connected to it• Simple event (pass-through)
– 1 pre slot– When: satisfy post slot on incoming slot satisfaction– What: whatever is on incoming slot (pass GUID)
• Latch event (multi-party synchronization)– 2 pre slots; “waiting-on” count and current count– When: satisfy outgoing slot when number of satisfies on both pre
slots matches (similar to reference count in TBB)– What: NULL (incoming data-blocks are ignored)
OCR execution model for events
14Exascale Programming Models Lecture Series 06/12/2014
Example 3: In place parallel update
Concept OCR
SetupEDT
Parallel_1EDT
Parallel_2EDT
WrapupEDT
Data
Data
SetupEDT Data
Parallel_1EDT
Parallel_2EDT
FinishEDT
WrapupEDT
(1) dbCreate
(1) edtCreate
(1) e
dtCr
eate
(3) edtCreate
(4) addDep
(2) addDep(2
) add
Dep
(3) edtCreate
15Exascale Programming Models Lecture Series 06/12/2014
Example 4: Single assignment update
Concept OCR
SetupEDT
Parallel_1EDT
Parallel_2EDT
WrapupEDT
Data
SetupEDT Data
Parallel_1EDT
Parallel_2EDT
WrapupEDT
(1) dbCreate
(1) edtCreate
(1) e
dtCr
eate
(2) addDep
Data2Data1
Evt2
Data2Data1
Evt1
(4) dbCreate (4) dbCreate
(5) satisfy (5) satisfy
(3) addDep
(1) e
vtCr
eate
Exascale Programming Models Lecture Series 06/12/2014
OCR ecosystem
FSim - TG Architecture
Low-level compilers
Platforms
OCR implementations
LLVM
OCR targeting TG
C, Array DSL CnC Hero
CodeHC
CnC Translator
HC CompilerR-Stream
HTA
PIL
Programming platforms
OCR API + Tuning AnnotationsOpen Community Runtime
x86
GCC
OCR targeting x86
Cluster
Evaluation platforms
17Exascale Programming Models Lecture Series 06/12/2014
• OCR API is at the “assembly” level; other tools are meant to sit between it and programmers
• Few simple concepts, multiple ways to use them– Interested in determining “best” use
• Dependence graph built on the fly:– Complicates the writing of the program– Scalable approach
Take-aways
18Exascale Programming Models Lecture Series 06/12/2014
• On some code, OCR matches or bests OMP• Simple scheduler, no data-blocks (very preliminary but promising)
Preliminary results
19Exascale Programming Models Lecture Series 06/12/2014
• Development of a specification:– Memory model
• Tuning hints and annotations
• More expressive support for collectives
Areas of investigation
21Exascale Programming Models Lecture Series 06/12/2014
Strawman architecture
Intel Confidential / Internal Use Only
• Heterogeneous• Hierarchical architecture• Tapered memory
bandwidth• Global, shared address
space• Software managed non-
coherent memories • Functional simulator
available
DP FPFMAC
Execution Engine (XE)
32KB I$
64KB SP
RF?
Application specific
GPInt
Control Engine (CE)
32KB I$
64KB SP
RF?
System SW
XE XE XE XE
XE XE XE XE
CE 1MB shared L2
Block (8 XE + CE)
Cluster (16 Blocks)PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
………..
8MB Shared LLC
Interconnect……
…..
……
…..
Processor Chip (16 Clusters)PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2………..
8MB Shared LLC
Interconnect
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2………..
8MB Shared LLC
Interconnect………..
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2………..
8MB Shared LLC
Interconnect
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Core
1MB L2………..
8MB Shared LLC
Interconnect………..
64MB Shared LLC
……
…..
……
…..
Interconnect
22Exascale Programming Models Lecture Series 06/12/2014
OCR vs other solutions
CnC MPI OCR OpenMP
TBB
Execution model
Tasks Bulk Sync Fine-grained tasks
Bulk Sync Tasks
Memory model
Shared memory
Explicit message passing
Explicit; global
Shared memory
Shared memory
Separation of
concerns?
Yes No Yes No Yes (but can dig deeper)