Definition(of(the(Directory/Cache API$for ... · OmpSs [1], StarPU [2] and PaRSEC [7]). The main...

Copyright INTERTWinE Consortium 2017

Definition of the Directory/Cache API for distributed memory

runtime systems

Deliverable 4.2

Due date M18

Submission date 30/MAR/2017

Project start date 01/OCT/2015

Project duration 36 months

Deliverable lead organization BSC

Version 1.0

Status Published

Author(s) Tiberiu Rotaru, Olivier Aumage, Xavier Teruel, Vicenç Beltran Querol, Nick Brown, Bernd Lörwald, Jakub Sistek, David Stevens

Reviewer(s) Dan Holmes (EPCC)

Luis Cebamanos (EPCC)

Dissemination level

PU Public

Project Acronym INTERTWinE

Project Title Programming Model INTERoperability ToWards Exascale

Project Number 671602

Instrument Horizon 2020

Thematic Priority FETHPC

DEFINITION OF THE DIRECTORY/CACHE API FOR DISTRIBUTED MEMORY RUNTIME SYSTEMS

2

Version History Version Date Comments, Changes, Status Authors, contributors,

reviewers

0.1 09/Jan/17 Initial draft Tiberiu Rotaru

0.2 09/Feb/17 Draft updated, including task based runtime description.

Tiberiu Rotaru, Olivier Aumage, Xavier Teruel

0.3 24/Feb/17 Updated API description, modified figures

Tiberiu Rotaru, Bernd Lörwald

0.4 01/Mar/17 Added PaRSEC description (from Jakub), added Jacobi use case sketch, corrected text

Tiberiu Rotaru

0.5 03/Mar/17 Extend executive summary and small modification on OmpSs description

Vicenç Beltran Querol

0.6 03/Mar/17 Updated text of document for grammar and extended Jacobi example section

Nick Brown

0.7 07/Mar/17 Updated text, corrected misspellings, changed picture, added subsection about segment updated Jacobi use case

Tiberiu Rotaru, Vicenç Beltran Querol, Nick Brown, Jakub Sistek, David Stevens

0.8 08/Mar/17 Slightly updated the Jacobi use case, minor corrections added

Tiberiu Rotaru

0.9 21/Mar/17 Included reviewers’ corrections. Updated text and figures following reviewers’ comments.

Dan Holmes, Luis Cebamanos, Tiberiu Rotaru

1.0 30/MAR/17 Version approved by Exec Group Tiberiu Rotaru


3

Table of Contents 1 EXECUTIVE SUMMARY ................................................................................................................. 5

2 INTRODUCTION ........................................................................................................................... 6

2.1 CONTEXT ....................................................................................................................................... 6 2.2 PURPOSE ....................................................................................................................................... 6 2.3 GLOSSARY OF ACRONYMS ................................................................................................................. 6

3 TASK-‐BASED PROGRAMMING MODELS ....................................................................................... 7

3.1 OMPSS .......................................................................................................................................... 7 3.2 STARPU ........................................................................................................................................ 8 3.3 PARSEC ........................................................................................................................................ 9

4 DIRECTORY/CACHE API DEFINITION ............................................................................................ 10

4.1 GOAL .......................................................................................................................................... 10 4.2 ARCHITECTURE .............................................................................................................................. 10 4.3 MEMORY SEGMENTS ..................................................................................................................... 11 4.4 DATA MANAGEMENT ..................................................................................................................... 12

4.4.1 Global memory ..................................................................................................................... 12 4.4.2 Local memory ....................................................................................................................... 12

4.5 CLIENT INTERFACE DATA TYPES ........................................................................................................ 13 4.5.1 Node identifiers .................................................................................................................... 13 4.5.2 Segments .............................................................................................................................. 13 4.5.3 Data identifiers ..................................................................................................................... 13 4.5.4 Global ranges ....................................................................................................................... 14 4.5.5 Local caches ......................................................................................................................... 14 4.5.6 Local ranges ......................................................................................................................... 14

4.6 DIRECTORY/CACHE CLIENT PUBLIC API .............................................................................................. 15 4.6.1 Distributed shared memory .................................................................................................. 15 4.6.2 Local memory ....................................................................................................................... 16 4.6.3 Operations with ranges ........................................................................................................ 17 4.6.4 Executing operations with ranges ........................................................................................ 22 4.6.5 Transfer costs ....................................................................................................................... 22 4.6.6 Data locality description ....................................................................................................... 23

4.7 DIRECTORY/CACHE SERVER PUBLIC API ............................................................................................. 23 4.7.1 Segment factory ................................................................................................................... 23 4.7.2 Segment description ............................................................................................................. 24 4.7.3 Segment interface ................................................................................................................ 24 4.7.4 Data description ................................................................................................................... 25

5 DIRECTLY USING THE DIRECTORY/CACHE: JACOBI USE CASE EXAMPLE ...................................... 26

6 COUPLING THE DIRECTORY/CACHE WITH TASK-‐BASED RUNTIME SYSTEMS ................................ 30

7 REFERENCES ............................................................................................................................... 31

Index of Figures Figure 1: StarPU’s task dependence graph can be used to schedule tasks on heterogeneous, accelerated nodes. ................................................................................ 8

Figure 2: Directory/Cache client-server architecture. .................................................... 10


4

Figure 3: The task-based runtime systems can use or extend the Directory/Cache basic client interface and new segment types can be added. ................................................ 11

Figure 4: Global address space Programming Interface (GPI-2). ................................. 11

Figure 5: Hierarchical data representation. ................................................................... 12

Figure 6: Local copies of the same global data may be referenced by multiple clients.13

Figure 7: StarPU-DSM big picture, compared to StarPU-MPI layer. ............................ 30


5

1 Executive Summary Task-based programming models have been successfully used in conjunction with message passing libraries to orchestrate and exploit intra-node and inter-node parallelism respectively. However, the resulting applications are more complex to develop and maintain, as two very different approaches are used to exploit different levels of parallelism. In recent years, several task-based runtimes systems have been extended to directly support inter-node parallelism but without exposing any message passing concept to the programmer. These distributed task-based runtime systems use message passing libraries under-the-hood to move data across nodes as required by the application execution flow. To that end, distributed task-based runtimes have to implement complex Directory/Cache services to efficiently move and cache data. However, efficient and scalable implementation of these services remains a challenging task.

This document attempts to define an application programming interface for a Directory/Cache service. The aim is to design an API that leverages the efficient sharing of data in distributed memory systems. The document describes the architecture of the Directory/Cache and proposes a generic, abstract API that is intended to be used primarily by distributed task-based runtime systems. Such an API allows the runtimes to be completely independent from the physical representation of data, as well as from the type of storage used. The API relies on a generic, abstract multilevel data representation, promoting asynchronous data transfers and automatic caching.


6

2 Introduction 2.1 Context The transition to Exascale predicts the appearance of more complex architectures that are inherently highly parallel (distributed, heterogeneous and multi-core). Such systems may have a huge number of multi-core nodes, deep memory hierarchies, and complex interconnect topologies. Efficiently programming them requires new adequate programming models and currently the task-based models are perceived as a very promising candidate to start with; they abstract the notion of parallelism from the application developer and are better suited for efficiently exploiting heterogeneous architectures.

As the traditional programming models will not fit anymore with the future architectures, new alternatives must be found in order to be able to reach the expected targets related to performance and scalability. As the goal of having a single common API that addresses all the architectural levels is considered to be unachievable in the near future, it is reasonable to adopt a programming model that is a combination of different APIs at different system levels, in a first step.

2.2 Purpose One of the tasks defined in the INTERTWinE project is the definition of a common, generic API for a Directory/Cache service for task-based runtime systems (such as OmpSs [1], StarPU [2] and PaRSEC [7]). The main goal is to achieve a programming model that hides most of the complexities associated with distributed systems, but also enables an efficient utilization of remote compute and memory resources. The present document proposes an API for a Directory/Cache (a kind of virtual memory manager) that is sufficiently abstract and generic, promoting one-sided, asynchronous communication and providing automatic caching.

2.3 Glossary of Acronyms API Application Programming Interface

GASPI Global Address Space Programming Interface

GPI-2 Open source implementation of the GASPI standard

MPI Message Passing Interface

TBRS Task-Based Runtime System

DSM Distributed Shared Memory

DAG Directed Acyclic Graph


7

3 Task-‐Based Programming Models One of the biggest concerns with many recent codes is that, despite scaling well on current architectures, they may not be capable of exploiting future Exascale systems and scaling to the level of concurrency that these future machines are likely to provide. One of the main causes lies in the underlying programming model used, with many codes still relying on bulk synchronous parallel (BSP) programming paradigms using a combination of MPI and OpenMP and simple resource management and scheduling policies. Task-based programming models are currently considered to be a promising starting point for designing new Exascale programming models. They abstract the notion of parallelism from the application developers and offer better chances to work well on future (heterogeneous) architectures. Task parallelism offers better perspectives for dealing with aspects such as dynamic load balancing, latency hiding and fault tolerance, which are crucial for guaranteeing performance in large distributed systems.

Numerous task-based runtime systems have emerged recently (such as OmpSs [1], StarPU [2] and PaRSEC [7]). Below, some of the most representative task-based systems that are also involved in the INTERTWinE project are presented.

3.1 OmpSs OmpSs [1] was initially an effort to integrate features from the OpenMP tasking model and the different implementations from the StarSs programming model family into a single programming model based exclusively on tasks. The current OmpSs implementation is developed at Barcelona Supercomputing Center (BSC) using the Mercurium compiler and the Nanos 6 runtime system.

In OmpSs, the task construct allows the annotation of function declarations or definitions. When a function is annotated with the task construct each invocation of that function creates a task. Note that only the execution of the function itself is part of the task not the evaluation of the task arguments. In OmpSs, task granularity usually varies from few microseconds up to several seconds per task.

The task construct allows the programmer to express data-dependencies among tasks using the in, out and inout clauses (standing for input, output and input/output, respectively). This allows the programmer to specify, for each task in the program, what data a task is waiting for and to signal its readiness. Note that whether the task really uses that data in the specified way is a programmer responsibility.

Each time a new task is created, its in, out, and inout parameters are matched against those of the existing tasks. If a data-dependency is found, the task becomes a successor of the corresponding task(s). This process creates a task dependency graph at runtime. Tasks with dependencies are scheduled for execution as soon as their entire predecessor set in the graph has finished.

Data dependencies can be also annotated with the weak modifier, which indicates the task does not perform by itself any action that requires the enforcement of the dependency. Instead, these actions can be performed by any of its deeply nested subtasks. Any subtask that may directly perform these actions will include the dependency in the non-weak variant.

Unlike the OpenMP semantics, when the dependency is specified over an array section, OmpSs understands that this information affects multiple elements of the array (or pointee data). The runtime system will be responsible for computing the overlap of these array sections among tasks. OmpSs also permits the use of shaping expressions in the provided dependence parameters. This mechanism allows the recasting of pointers in order to recover the size of dimensions that could have been lost across calls.


8

The taskwait construct may specify the on clause. This clause allows waiting only on the tasks that produce some data in the same way as in that clause. It suspends the current task until all previous tasks with an out over the expression have completed their execution.

One of the most relevant features of OmpSs is to handle architectures with disjoint address spaces. By disjoint address spaces we refer to those architectures were the memory of the system is not contained in a single address space (e.g. clusters of SMPs or heterogeneous systems built around accelerators with private memory). A set of directives (copy_in, copy_out and copy_inout) allows programmers to specify which data will be used by a given task. The OmpSs run-time is responsible for guaranteeing that this data will be available (even across multiple address spaces, if present) when the task executes. To that end, the OmpSs runtime implements its own Directory/Cache service to track and move the required data across disjoint address spaces.

3.2 StarPU

The StarPU Runtime System developed by the STORM Team provides a framework for task scheduling on heterogeneous platforms, together with an API for implementing various classes of scheduling algorithms. This scheduling framework jointly works with a distributed shared-memory manager to automatically optimize data transfers and overlap communications with computations. Heterogeneous multi-core platforms, mixing regular cores and dedicated accelerators are becoming increasingly popular. To fully tap into the potential of these hybrid machines, both in terms of computation efficiency and power saving, the StarPU runtime system is capable of driving task-based applications over heterogeneous cluster nodes equipped with nVidia CUDA devices, Intel Xeon Phi devices and/or OpenCL accelerators. The StarPU environment features a virtual shared memory, to keep track of data replicates on discrete accelerators, and a data prefetching engine to overlap transfers with computation. Such facilities, together with a database of self-tuned per-task performance models, enable StarPU to greatly improve the quality of scheduling policies in the field of parallel linear algebra algorithms. Instead of rewriting the entire code, programmers encapsulate existing kernels within codelets objects. Codelets may be enriched with additional kernel implementations for all supported architectures. StarPU then schedules those codelets as efficiently as possible over the entire machine, dynamically selecting adequate implementations as execution unfolds. In order to relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet

M. GPU M. GPU

CPU

CPU

CPU

CPU CPU

CPU

CPU

CPU

Time

Figure 1: StarPU’s task dependence graph can be used to schedule tasks on heterogeneous, accelerated nodes.


9

starts, all data it requires are transparently made available on the compute resource. Given its expressive interface and portable scheduling policies, StarPU obtains portable performance by efficiently and easily using all computing resources at the same time. StarPU takes advantage of the heterogeneity of a machine, using scheduling strategies and auto-tuned performance models to feed each unit with the kernels on which it performs best.

3.3 PaRSEC PaRSEC [7] is a generic framework for architecture-aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. PaRSEC is the underlying runtime and dynamic scheduling engine of the DPLASMA numerical linear algebra library for dense matrices, among others.

Perhaps the main advantage of PaRSEC comes from expressing algorithms as parametrised graphs of tasks with dependencies. Unlike the standard (non-parametrised) Directed Acyclic Graph (DAG), PaRSEC does not need access to the global DAG representation, and it is able to unfold the dependencies from any point of the graph. This feature is extremely important for maintaining scalability.

An additional complexity of the runtime systems in distributed memory machines comes from the fact that local tasks often require access to remote data. PaRSEC currently employs a dataflow model that constructs an internal object ‘data’. This object can encapsulate itself and flow throughout the distributed machine, as deduced from the data dependencies specified within the DAG. To enhance data locality across different compute nodes, and within the local memory of any accelerators, various copies of each ‘data’ object are devised and maintained. The object ‘data’ keeps track of its version and owner in order to manage consistency. In PaRSEC, the execution time for each task depends on the application and typically starts from few microseconds. There is no bound on data sizes to be moved except the system characteristics. This said, a rule followed e.g. by the dense linear algebra library DPLASMA is passing sub-matrices that fit into the caches of the processors.

PaRSEC tries to overlap communication and computation as much as possible. To achieve this it employs non-blocking two-sided communication functions from the MPI library. Ideally, a local copy of data is ready before a task operating on them is launched. The functionality of resolving remote data dependencies is well isolated in PaRSEC into a single file (remote_dep_mpi.c). This separation should allow the development of a version based on the Directory/Cache system of INTERTWinE.


10

4 Directory/Cache API Definition 4.1 Goal The main purpose of the Directory/Cache is to provide a set of services that support task-based runtime systems efficiently running distributed applications, while being able to consistently manage data stored in distributed memory or in local caches. The Directory/Cache API allows runtimes to be completely independent from the physical representation of data and from the type of storage used, facilitating access through the same interface to an extendable list of memory segment implementations using different communication libraries (such as GPI-2 [5] and MPI [4]). Moreover, applications may also use the Directory/Cache API directly.

4.2 Architecture The Directory/Cache relies on the client-server architecture depicted in Figure 2.

Figure 2: Directory/Cache client-server architecture.

This architecture assumes that one Directory/Cache server instance is deployed on each node in a distributed environment. A server instance can coordinate with other instances for carrying out operations with memory ranges (local or global) in a consistent way. A local server may create and manage multiple caches. The original data is stored in one or more segments across several nodes and copies of global memory regions are stored in local caches. Multiple clients may connect to a Directory/Cache server and each client may attach to specific local caches which may be shared between numerous clients. The clients can be either started within the same process as the corresponding local server or in a distinct process. The second approach offers the advantage of supporting the coupling of external programs with an already running Directory/Cache service. Moreover, running clients in different processes to the server makes the Directory/Cache tolerant to client failures; as in this case the Directory/Cache will continue to work as normal with its other connected clients. This architecture is flexible and extensible, supporting numerous different types of memory segments, which may coexist and can be simultaneously used. The

Server

GASPI MPI BeeGFS …

Client

Original Data

Scratch Copies

Segment 0

Cache

Client

Client

Node 2

Segment 0

Segment 1

Cache

Cache

Client

Client

Node 1

Segment 0

Segment 1

Cache

Client

Client

Node 3

Segment 0

Segment 1

Cache

Client

Node 0

Cache


11

architecture also decouples the client API from the server API, which means that modifications can be made to the client interface for specific runtimes without impacting the underlying Directory/Cache server. Direct users of the Directory/Cache are most likely to be the task-based runtime systems, but application developers may also use it directly if they wish.

The flexibility of the Directory/Cache means that it can be used directly by any applications relying on traditional programming models, but the benefits are increased when used in conjunction with a task-based programming paradigm. When integrated with a task-based runtime, in order to ease the application developers’ work the runtimes should hide the Directory/Cache calls behind their specific exposed API. The applications already running on a task-based runtime system should work with minimal (ideally none) modifications after coupling the runtime with the Directory/Cache service. The main benefit from this coupling is providing access through a unified interface to a mixture of memory types and segment implementations that can be simultaneously used, at the expense of operating minimal changes in applications.

The task-based runtime systems can either directly use the basic interface or extend it by adding specific functionality. At the same time, new segment implementations can be straightforwardly added without being necessary to operate modifications in the client or the server implementation, as shown in Figure 3.

4.3 Memory Segments A segment is a continuous space of globally accessible memory that can be addressed in terms of offsets of data. The Directory/Cache is designed to work with various types of segments simultaneously. Of major interest for the application developers are the GASPI [3] and MPI [4] segments due to the increasing number of applications using MPI or GPI-2 [5] as data transport layers in distributed environments. The applications may benefit from using directly or indirectly the Directory/Cache by taking advantage of an efficient implementation of the segment types using asynchronous, one-sided communication with either GPI-2 or MPI RMA.

MPI is a standardized message-passing system for distributed memory applications. GASPI (Global Address Space Programming Interface) is a partitioned global address space API. GPI-2 is an implementation of the GASPI standard as a low latency communication library and runtime system for scalable real-time parallel applications running on distributed systems, with the architecture as depicted in Figure 4.

Figure 3: The task-based runtime systems can use or extend the Directory/Cache basic client interface and new segment types can be added.

Directory/Cache (Virtual Memory Manager)

GASPI Segment

MPI Segment

BeeGFS Segment …

OmpSs Runtime StarPU Runtime …

RAM RAM RAM RAM RAM

GLOBAL ADDRESS SPACE

THREADS

NUMA SYSTEM

THREADS

NUMA SYSTEM

THREADS

CO-PROCESSOR

RDMA INTERCONNECT

THREADS

CO-PROCESSOR

Figure 4: Global address space Programming Interface (GPI-2).


12

4.4 Data Management 4.4.1 Global memory The Directory/Cache unifies the access to segments and abstracts distributed hardware. It assumes a multilevel representation of data, as shown in Figure 5

The main idea behind this multilevel representation is to decouple the end users from the hardware representation of data, allowing them to focus more on the application’s logic. It is intended that the end users will only work with data abstractions (global ranges), safely assuming that the data with which they interact is contiguous and linearly addressable, without being concerned about its physical distribution.

The first layer in this multilevel representation is the physical segment. This level offers detailed information about data allocations and distributions over segments and nodes, requiring knowledge about the segment type and hardware. This information is mainly used internally by the Directory/Cache server and is not directly exposed to the clients. The second level is represented by the logical segment. This abstracts the notion of physical segment, offering a linear view of a user’s data. The third level of abstraction exposes the allocations in global memory to the end user as contiguous regions, even though they may physically span over numerous segment parts across different nodes. An allocation is a linear view on (distributed) parts of a segment. A subrange of an allocation is referred to as a global range.

The end user is offered the possibility to choose among different segment implementations, distribution types and factories that are using different communication libraries for realizing memory transfers in the background.

4.4.2 Local memory The Directory/Cache clients work with copies of global data stored in global memory. These copies are stored in one or more local caches and are referred to as local ranges. A local range describes a memory region in a local cache. A local range contains a pointer to that region and its size in bytes. Multiple local ranges corresponding to a single global range may coexist. Additionally, multiple clients may use copies of the same data, stored in either the same cache or in different caches.

Figure 5: Hierarchical data representation.

Physical segment

Segment knowledge

Logical segment

Server knowledge

Allocation and global range


13

In Figure 6, Client 1 and Client 2 share a copy of the global data B in the same cache, while Client 2 and Client 3 use copies of the same global data B but situated in different local caches. Client 3 uses copies of global data A and B, stored in different local caches.

4.5 Client Interface Data Types 4.5.1 Node identifiers Each instance of a Directory/Cache server has associated a node identifier, which is similar to a rank. However, these node identifiers may differ from the ranks generated by the communication libraries used by server factories (such as GPI-2 [5] and MPI [4]).

4.5.2 Segments The global memory layer is organized into memory segments that can span over multiple nodes. The Directory/Cache can be configured to work with a variable number of segment implementations (predefined or user defined). The runtime (or end programmer if used directly) may specify, upon instantiation, the type of segments to use without having to care about the implementation details due to the generic interface we provide for accessing the global memory. The segments are identified by objects of type segment_id_t. This type is serializable and uniquely identifies a segment. Serializable means that objects of this type can be translated into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.

A segment identifier type is defined as below:

struct segment_id_t { bool operator== (segment_id_t const&) const; template<typename Archive> void serialize (Archive& ar, unsigned int); };

4.5.3 Data identifiers A piece of data stored in the global memory is identified by an object of type data_id_t. A data identifier is generated as a result of a successful global memory allocation operation and uniquely identifies the data across the Directory/Cache. An object of this type is serializable and is defined as follows:

struct data_id_t { bool operator== (data_id_t const& other) const;

allocations and global ranges

caches and local ranges

directory/cache clients

A B

B A B

Client 3 Client 1 Client 2

Cache 1 Cache 2

Figure 6: Local copies of the same global data may be referenced by multiple clients.


14

template<typename Archive> void serialize (Archive& ar, unsigned int); };

4.5.4 Global ranges A generic range denotes a contiguous region of memory that starts at a specific offset relative to an allocation and has a certain size. A generic range type range_t is therefore defined as a structure with two integral type members: offset and size. A global range is a serializable type that describes a memory range in the global memory. It is constructed out of a data identifier that is the result of an allocation operation and of an object of type generic range (range_t). A global range is defined as below:

struct global_range_t { global_range_t (data_id_t allocation, range_t range); global_range_t() = default; bool operator== (global_range_t const&) const; template<typename Archive> void serialize (Archive& ar, unsigned int) };

4.5.5 Local caches The Directory/Cache supports the existence of multiple local caches managed by a local server. A client may request the local server to create one or more local caches, which may eventually be shared between multiple Directory/Cache clients. A local cache may be created within the same process space as the local Directory/Cache server, in which case one refers it to as an in-process cache, or in a different process, in which case one refers it to as an inter-process cache. A local cache is identified by an object of type cache_id_t or shareable_cache_id_t, which is the result of a cache creation operation.

4.5.6 Local ranges A local range is a contiguous region of memory allocated within a local cache. The local ranges are managed by a range manager responsible with reserving and freeing ranges within caches.

A local range is defined as a structure containing a member of type generic range, with the offset relative to the beginning of the cache. A local range can be queried for the physical address in the local cache. There are two types of local ranges: constant local ranges that cannot be modified by a client, of type const_local_range_t, and mutable local ranges that may be modified, of type mutable_local_range_t. A constant local range type is defined as follows:

struct const_local_range_t { range_t range; void const* pointer() const; bool operator== (const_local_range_t const&) const; };

A mutable local range type is defined as follows:

struct mutable_local_range_t { range_t range;


15

void* pointer() const; bool operator== (mutable_local_range_t& rhs) const; };

4.6 Directory/Cache Client Public API As explained in the subsection 4.4, a Directory/Cache client works with data abstractions in terms of ranges. A local range identifies some data stored in a local cache. Global ranges are a linear representation of global data distributed over segments.

A Directory/Cache client can be instantiated within the same process where the local server is running, in which case one may speak about an in-process client. Alternatively, a client can be instantiated within a process that is different from the one where the local server is running, in which case one may speak about an inter-process client.

Whilst generally the Directory/Cache clients should work only with local and global ranges, being not granted direct access to the global data stored in segment parts residing on a node, exceptions can be made for trusted clients.

4.6.1 Distributed shared memory • Segment creation

For creating a segment within the global memory the API method segment_create, with the following signature, should be invoked:

template<typename SegmentDescription> segment_id_t segment_create ( size_t , SegmentDescription , bool do_tagging = false );

This method requires the following parameters: the total size of the segment, the segment description and a flag for signalling if the tagging mechanism should be activated or not. The second parameter is an object describing the type of segment to be created (such as a GASPI segment or MPI segment) and the policy to be applied when allocating segments. The segment types can be defined independently and don’t require modifying the client or the server implementation. The developers of segment types may define and implement their own allocation policy (such as equal distribution, prefix-fill, percentage, etc.).

In case of success, this function returns a valid segment identifier, otherwise it throws an exception and rolls back the already performed modifications.

• Segment deletion

For deleting a segment, the API provides the method segment_delete. This requires as argument the identifier of the segment to be deleted and returns no value. The call is of the following signature:

void segment_delete (segment_id_t);

• Data allocation

Once a segment is created in the distributed shared memory, one may allocate memory within it using the API function allocate. The signature of this method is:

template<typename DataDescription>


16

data_id_t allocate ( size_t , segment_id_t , DataDescription );

This allocation method requires as parameters the size of the memory to be allocated, the identifier of the segment where the allocation should occur and the allocation policy to be applied (data description). The Directory/Cache users are allowed to define and implement their own policies for allocating data within segments (such as equally-distributed, prefix-fill, percentage, etc.). On success this method returns a data identifier object of type data_id_t, otherwise it throws an exception in the case of failure and the modifications operated are rolled back. The physical space allocated by this function is not necessarily contiguous and may span over multiple nodes, however, the user’s view of the data is a contiguous one at their level of abstraction. The global range provides the view of a linear memory allocation, representing the physically distributed memory.

• Data deallocation

The deallocation of global data may be achieved by using the API function free. This takes as argument the data identifier of the global data to be deleted. The method returns no value and has the following signature:

void free (data_id_t);

An important point to remark here is that the operations defined above are not collective but also not local (i.e. executed on a single client only, but involving communication between server instances) and are not doing reference counting (i.e. exactly one free per data/segment). This means that the clients need to ensure that once resources are published they need to be (implicitly) unpublished before a free/segment_delete, in order to prevent race conditions resulting from the interleaving of operations initiated by clients. The task-based runtime systems don't require creation and deletion as collective operations because they can control the lifetime of such objects through data dependencies in their task workflows. The normal client API doesn't provide collective operations. However, for directly coupling the Directory/Cache with applications, the client API can be extended with collective variants for the mentioned methods in order to overcome the shortcoming of missing functionality or infrastructure normally provided by a task-based runtime system. Although directly coupling the Directory/Cache with applications is possible, an efficient use of it by applications is achieved by programming these on top of a task-based runtime system that uses the Directory/Cache under the hood.

4.6.2 Local memory • For creating a new local cache, the API provides the methods cache_create and

shareable_cache_create with the following signatures:

cache_id_t cache_create (size_t); shareable_cache_id_t shareable_cache_create (size_t);

These methods should be called with an argument representing the size of the cache to be created. On success, a valid cache identifier is returned (non-shareable or shareable, respectively), otherwise an exception is thrown in the case of failure.

• A client may request the deletion of an existing local cache by invoking the method cache_delete provided by the API. This method requires as argument the identifier of the cache to be deleted, returns no value and has the following signature:


17

void cache_delete (cache_id_t);

4.6.3 Operations with ranges In this subsection, the main operations with ranges that can be performed by a Directory/Cache client are enumerated. • A client may request the allocation of some specified amount of memory into a local

cache. This operation requires as input parameters the memory size to allocate and the identifier of the local cache on which this should occur. The result of the execution of this operation is a mutable local range (of type mutable_local_range_t). This operation requires the local cache to be already created and can be defined as a structure, as such:

struct allocate_t { allocate_t (size_t, cache_id_t); using result_type = mutable_local_range_t; };

• A Directory/Cache client may inform the corresponding local server that some local range is no longer required and that this can therefore be released. The effect of this operation is decrementing a reference counter maintained by the local server for this range. When the reference counter becomes zero the corresponding space may be eventually freed or the local range can be kept in the cache and reused in subsequent get operations (to avoid future data copies, as long as it was not invalidated). Cache eviction policies may be specified by the user. The release operation release_t applies to local ranges and it is defined as a structure whose constructor takes either a const_local_range_t or a mutable_local_range_t object as parameter, as below:

struct release_t { release_t (const_local_range_t); release_t (mutable_local_range_t); using result_type = void_t; };

• For retrieving into the local memory, in read-only mode, a valid copy of a global range from the global memory, the API provides the get_const_t operation. This operation requires two parameters: the global range for which a copy is requested and the identifier of the cache where the copy is to be stored. The result of the execution of an operation of this type is an object of type const_local_range_t, which cannot be modified by a client. This operation can be defined as a structure, as below:

struct get_const_t { get_const_t (global_range_t, cache_id_t); using result_type = const_local_range_t; };

The semantics associated to this operation is as follows: if a valid copy of the global range already exists in the specified cache, then the returned local range object simply points to this existing memory. Otherwise a transfer of the global range from the global memory into the local cache is performed and the data is created in the local cache with the returned object pointing to this. In order to facilitate data coherence and consistency, this method requires performing additional


18

bookkeeping operations in the background such as managing data copies and associated reference counters.

• For retrieving a global range into local cache, in write mode, from the global memory, the API provides the get_mutable_t operation. The purpose of this operation is to provide an exclusive local copy of the global range. The semantics is similar to that of the const variant, with the exception that a client is allowed to modify data pointed to by the returned local range. As in the case of the const variant, the implementation will perform additional bookkeeping operations in background relating to managing data copies and their associated reference counters. This operation requires the global range for which a copy is requested and the identifier of the cache where the copy is to be stored. The result is an object of type mutable_local_range_t corresponding to either the existing or the newly created local range. This operation can be defined as a structure, as follows:

struct get_mutable_t { get_mutable_t (global_range_t, cache_id_t); using result_type = mutable_local_range_t; };

Additionally, several variants of get_const and get_mutable operations can be defined for various reasons such as optimization and further abstraction of data management. In the following, some of them are enumerated.

• To avoid copying data from the global memory into a local cache when the global range is entirely locally located in a segment part residing on the node hosting the local server, the client can be granted direct access to that segment part in the global memory. In this case, the API provides the following versions of the get_const_t and get_mutable_t operations: get_const_preferring_direct_segment_memory_access_t and get_mutable_preferring_direct_segment_memory_access_t, respectively. These operations are defined as follows:

struct get_const_preferring_direct_segment_memory_access_t { get_const_preferring_direct_segment_memory_access_t (global_range_t, cache_id_t); using result_type = const_local_range_t; };

and

struct get_mutable_preferring_direct_segment_memory_access_t { get_mutable_preferring_direct_segment_memory_access_t ( global_range_t, cache_id_t); using result_type = mutable_local_range_t; };

The semantics of these operations is similar to that of get_const_t and get_mutable_t, respectively. They behave differently when the global range entirely resides on a segment part stored on the server host node as in this case the address of a local range points directly to a location in the segment (global memory) and not in the local cache. However, when this condition is not met, these


19

methods have the same behaviour as get_const and get_mutable, respectively. In this last case, creating a copy in the local cache is still necessary, in order to store a copy of a global range that spans over multiple nodes as a contiguous space in the local cache.

• Other variants of get_const_t and get_mutable operations are related to adding one or more tags with respect to the involved global range. These operations behave similarly to those above mentioned, with the only difference that some tagging mechanism is additionally used. These operations wait until the involved global range is tagged with the given tags before returning a result. They require as parameters the global range describing the location in the global memory, the identifier of the local cache where to store the copy and a tag or a collection of tags, respectively. These operations return a result of type const_local_range_t and mutable_local_range_t, respectively, to the client. They can be defined as below:

struct get_const_with_tag_t { get_const_with_tag_t ( global_range_t , cache_id_t , size_t tag ); get_const_with_tag_t ( global_range_t , cache_id_t , std::map< global_range_t , size_t > tags ); using result_type = const_local_range_t; }; struct get_mutable_with_tag_t { get_mutable_with_tag_t ( global_range_t , cache_id_t , size_t tag ); get_mutable_with_tag_t ( global_range_t , cache_id_t , std::map< global_range_t , size_t > tags ); using result_type = mutable_local_range_t; };

• Other variants of get methods can be defined by adding tagging to those preferring a direct access to the segment memory, which are get_const_preferring_direct_segment_memory_access_t and get_mutable_preferring_direct_segment_memory_access_t, respectively. The new operations should behave similarly as those above mentioned, allowing thus direct access to memory in segments, when this is possible, and having additionally set in place a tagging mechanism for the global ranges. The new operations are defined as structures, as follows:


20

struct get_const_preferring_direct_segment_memory_access_with_tag_t { get_const_preferring_direct_segment_memory_access_with_tag_t (global_range_t, cache_id_t, size_t tag); get_const_preferring_direct_segment_memory_access_with_tag_t ( global_range_t , cache_id_t , std::map<global_range_t, size_t> tags ); using result_type = const_local_range_t; }; struct get_mutable_preferring_direct_segment_memory_access_with_tag_t { get_mutable_preferring_direct_segment_memory_access_with_tag_t (global_range_t, cache_id_t, size_t tag); get_mutable_preferring_direct_segment_memory_access_with_tag_t ( global_range_t , cache_id_t , std::map<global_range_t, size_t> tags ); using result_type = mutable_local_range_t; };

• For storing data into the global memory, the Directory/Cache API provides the put_t operation. This operation writes the data described by a local range into the global memory, at a location described by a given global range. This operation will invalidate all existing local copies before effectively executing the memory transfers. The execution of this operation doesn’t produce any result and it throws an exception in case of failure. In the last case, rolling back the modifications is not guaranteed, as is also the case with all other variants of put_t operations. This operation requires as parameters the source local range (const or mutable) and the global range describing the location in the global memory where to store the data. It can be defined as a structure, as below:

struct put_t { put_t (const_local_range_t, global_range_t); put_t (mutable_local_range_t, global_range_t); using result_type = void_t; };

• An atomic composition of both the put_t and release_t operations, by first copying a local range from the local cache into the global memory at the location described by the global memory range and releasing afterwards the mutable local range, allows for reusing the local range from the cache in a subsequent get call. The Directory/Cache API provides for this purpose the operation put_and_release_t that has as effect an atomic execution of a put_t and of a release_t operation, returns no result and is defined as below:

struct put_and_release_t { put_and_release_t ( const_local_range_t


21

, global_range_t ); put_and_release_t ( mutable_local_range_t , global_range_t ); using result_type = void_t; };

As in the case of put_t, this operation throws an exception in the case of failure.

• The put type operations, similarly to get type operations, can be tagged. In this case, the semantics of the new tagged put operation is identical to that of put_t, with the exception that a tagging mechanism is used. These tags can be used, similarly to MPI’s point to point communications, for matching gets and puts. The Directory/Cache API provides such an operation that behaves like put_t and uses tagging for global ranges, named put_and_set_tag_t and defined as below:

struct put_and_set_tag_t { put_and_set_tag_t ( mutable_local_range_t , global_range_t , size_t tag ); put_and_set_tag_t ( mutable_local_range_t , global_range_t , std::map< global_range_t , size_t > tags ); using result_type = void_t; };

Compared to the put_t operation, put_and_set_tag_t requires an additional parameter, which is either a single tag or a collection of tags. As in the case of put_t, this operation throws an exception in the case of failure.

• The Directory/Cache API provides a method that atomically combines the put_t and the release_t operations, with the same semantics as the put_and_release_t but additionally invoking a tagging mechanism. Compared to the mentioned functions, apart the mutable local range to copy into the global memory at the location indicated by the second parameter (global range), this API method requires a third parameter which is either a tag or a collection of tags. This operation is named put_and_release_and_set_tag_t and is defined as follows:

struct put_and_release_and_set_tag_t { put_and_release_and_set_tag_t ( mutable_local_range_t , global_range_t , size_t tag ); put_and_release_and_set_tag_t ( mutable_local_range_t , global_range_t , std::map<global_range_t , size_t


22

> tags ); using result_type = void_t; };

4.6.4 Executing operations with ranges The Directory/Cache API provides a method for requesting the Directory/Cache performs a series, or bunch, of operations. This method has the following signature:

void execute_bunch ( std::vector<operation_t> , std::function<void (std::vector<operation_result_t>)> , std::function<void (error::multiple<>)> )

The first parameter is a vector of operations with ranges, similar to those enumerated in the previous subsection. The next parameters are callback functions that are to be invoked in case of success or failure, respectively. In the case of failure of one of the operations, all modifications produced by the still running or already completed operations are attempted to be cancelled and rolled back. However, whilst rolling back the modifications produced by allocations is always possible, implementing this for other types of operations is not guaranteed to be supported. Due to the technical difficulty involved, there is no requirement for the implementation to roll back the partially performed remote modifications. The local server may try to cancel outstanding operations and wait until all of them complete. The API provides variations of this method, like for example the following one, which returns a vector of operation results or errors:

std::future<boost::variant<std::vector < operation_result_t> , error::multiple<> >> execute_bunch (std::vector<operation_t>);

Another variant of this method that waits for all operations to complete is the following:

std::vector<operation_result_t> execute_sync (std::vector<operation_t> operations);

The design and implementation of these API methods should follow the Command design pattern described by Gamma et al. in [8].

4.6.5 Transfer costs The clients of a Directory/Cache may request the list of transfer costs associated with a list of operations. For this purpose, the API provides the method transfer_costs, having the following signature:

std::vector<std::pair<rank_t, double>> transfer_costs (std::vector<operation_t>);

This function requires as argument a vector of operations and returns back a vector of rank – cost pairs, sorted in the ascending order of costs. The first pair in the returned vector contains the rank of the node on which the execution of the operations given as argument would incur minimal memory transfer costs. This function is intended to help the task-based runtime systems to take cost-driven scheduling decisions for minimizing the execution time of applications.


23

4.6.6 Data locality description Alternatively, a Directory/Cache client may request information about the physical location of ranges involved in a given list of operations. The Directory/Cache API provides the data_locality function for this purpose. Compared to the transfer_costs function, this call provides more detailed information about the local copies of a global range and their location. It may be invoked by trusted clients and is intended to help the task-based runtime systems to make scheduling decisions that exploit the data locality, like for example scheduling a list of operations on a rank where most copies of the involved ranges already exist. The function has the following prototype:

std::vector<data_locality_description> data_locality (std::vector<global_range_t>);

It requires as input parameter a vector of global ranges and returns back a vector of data locality description type objects. A data locality description type is defined as follows:

struct data_locality_description { global_range_t _range; rank_t _home; std::set<rank_t> _copies; };

A data locality description type structure has three data members: a global range, the rank of the node where this is stored in the global memory and a set of ranks where copies of parts of this range are stored.

4.7 Directory/Cache Server Public API 4.7.1 Segment factory The Directory/Cache server can be configured to work with a variable number of segment factories and descriptions. A segment factory type contains all the specific information necessary for configuring and initializing a communication infrastructure for the middleware used (such as GPI-2 and MPI) and for implementing the operations specified in the segment interface. The segment types can be defined and implemented independently from the Directory/Cache infrastructure. In order to let the server know which segment type to use, the segment developer should create a segment description that contains the factory type to use for creating the segment. Each segment factory SegmentFactory should define a structure params that contains specific parameters supplied by the user. A segment factory can be constructed out from a user defined set of parameters (of type SegmentFactory::params), the list of nodes on which the Directory/Cache runs and the rank of the local server. A generic factory type SegmentFactory should overload the API function:

SegmentFactory::create (size_t, SegmentDescription),

for segment descriptions for which it can create an usable pointer to the segment interface. The design and the implementation of segment factories should follow the Factory design pattern described by Gamma et al. in [8]. For instantiating a Directory/Cache server, one needs to specify the segment factories and the segment descriptions to use, as in the example below, where a Directory/Cache server using GPI-2 and MPI factories and segment descriptions are used:


24

server_templ < factories<gaspi::context, mpi::context> , descriptions< gaspi::equally_distributed_segment_description , mpi::equally_distributed_segment_description > > dc_server

4.7.2 Segment description The Directory/Cache relies on a flexible architecture that allows working with an extensible list of segment types. New segment types can be developed independently and integrated in a straightforward manner, without requiring modifications to other directory/cache components. A segment description type should be serializable and contain information about the segment factory type to use for constructing segments. For a given segment description type SegmentDescription, the type SegmentDescription::factory_type should be a segment factory type that can be used for creating segments corresponding to this segment description.

4.7.3 Segment interface All segment types must implement a minimal set of functionality, implementing a common interface. For this reason, the Directory/Cache provides a segment interface class that all segments should inherit. This is a pure abstract class exposing a number of methods for which any derived segment type should provide specific implementation. These methods are enumerated below:

• get and put: they should implement low-level memory transfers from or into a segment using knowledge provided by the factory specified in the segment description.

• direct_const_access and direct_mutable_access: in case this is possible, these methods return a pointer to the global data stored in the segment memory, otherwise they return a null pointer.

• home_nodes: this method should return for a given list of generic ranges a map type object that maps each rank to the list of subranges stored on the node having that rank.

• transfer_costs_get and transfer_costs_put: they should return a vector in which each element represents the costs for transferring the given list of ranges from/to a node having the rank equal to the index of that element.

The segment_interface base class is defined as follows:

struct segment_interface { virtual void get (range_t, void*) = 0; virtual void put (void const*, range_t) = 0; virtual void const* direct_const_access (range_t); virtual void* direct_mutable_access (range_t); virtual std::unordered_map< rank_t , std::vector<range_t> >

home_nodes (std::vector<range_t>) const = 0; virtual std::vector<double> transfer_costs_get (std::vector<range_t>) const = 0; virtual std::vector<double> transfer_costs_put (std::vector<range_t>) const = 0; };


25

4.7.4 Data description New global data can be created in the global memory by invoking the API method allocate with the parameters the size to allocate and a data description. As in the case of a segment description type, a data description type can be defined independently and added without being necessary modifying the already implemented Directory/Cache components. This type typically describes the allocation policy to use for storing data within a list of free slots in the global memory (such as prefix-filled or equally-distributed.) The data description is a parameter that should be supplied by the client when invoking the creation of some global data. A data description type must be a serializable type that is capable to tell what the actual distribution of data in the logical segment is. For this purpose the API provides the method determine_distribution, which requires as parameters the amount of memory requested and the list of slots that are free in the segment. It has the following signature:

distribution_t determine_distribution ( size_t requested , std::map<offset_t, size_t> const& free_segment_slots ) const;

The method returns an array of ranges allocated within the corresponding logical segment. More complex data description types may require knowledge about segments.


26

5 Directly Using the Directory/Cache: Jacobi Use Case Example

This section intends to further illustrate the API as described in section 4 with an example. Many numerical applications involve the solving of a PDE, such as the Poisson equation [6]. In two dimensions, with u, f, g functions defined on a real bidimensional coordinate space and the Dirichlet boundary conditionsu g= , this equation requires finding solutions for

2 2

2 2

u u fx y

δ δδ δ

+ =

When f is the null function this is also known as the Laplace equation. Solving this equation indirectly, for instance using an iterative method such as Jacobi, is commonplace. On a two dimensional mesh, solving the Laplace equation with this method involves calculating the values ,i ju at each step from the values 1,i ju + , 1,i ju − ,

, 1i ju + , , 1i ju − of the neighbouring points in the previous step:

while (residual > Ɛ) { for (i = 1; i <= N; i++) { for (j = 1; j <= N; j++) { ( 1) ( 1) ( 1) ( 1)

, 1, 1, , 1 , 11.0 / 4 ( );k k k k ki j i j i j i j i ju u u u u− − − −

+ − − += × + + + } } residual = compute_residual(); }

A common way to parallelize this algorithm in two dimensions is by decomposing the initial N N× matrix of values into P Q× submatrices (P and Q are the sizes of the rectangular mesh of processes used) where processes then perform computations on their local submatrix. This parallelization approach requires that each process stores, in addition to its local submatrix, the bordering lines or columns of its neighbours. With a stencil of depth one (as is the case with the Laplace equation) this involves providing the last column of the local matrix of the left neighbour, the first column of the right neighbour, the first line of the lower neighbour and the last line of the upper neighbour. For this purpose, each process allocates an extended matrix consisting in the original local submatrix and two additional rows and columns (called halos) for storing the borders. This parallelization method requires at each step, before carrying out local computation, that each process exchange its borders with the neighbours, commonly known as a halo swap. In the following, we assume that the submatrix assigned to each process has n_local_rows rows and n_local_cols columns. As an even distribution of the initial matrix over processes is not always possible, it may happen that some processes may have one additional row or column compared to the others. We denote by max_n_local_rows and max_n_local_cols the maximum amongst all processes of the local number of rows and the number of local columns, respectively.

Implementing this algorithm directly using the Directory/Cache is straightforward, as the application developer has only to deal with memory ranges and is not concerned with


27

explicitly programming the communication between processes. At program startup, each process must instantiate a local client that connects to a corresponding local Directory/Cache server. A segment of appropriate size must be created using the API function segment_create. Each client should then create a local cache of appropriate size, using one of the described API functions for creating caches. This is illustrated in the code snipet below, where a MPI segment equally distributed amongst the nodes is created. Using a GPI-2 segment is strightforward, by replacing the MPI segment description with a GASPI segment description (commented out in the code snippet). Caches, local to each node, having enough space allocated for storing the halos are also created.

vmem::server server (……); vmem::in_process_client client (server); vmem::size_t segment_size (2*nranks*(max_n_local_rows+max_n_local_cols)*sizeof(double)); vmem::segment_id_t const segment ( collective.segment_create ( segment_size , vmem::mpi::equally_distributed_segment_description() //vmem::gaspi::equally_distributed_segment_description() ) ); vmem::size_t cache_size (2*(max_n_local_rows+max_n_local_cols)*sizeof(double)); vmem::cache_id_t const cache (client.in_process_cache_create (cache_size));

Due to the non-contiguous nature of the data in one dimension, in this simple example we are using a separate memory area for communicating the halo data as it might need to be packed on the source neighbour and unpacked on the target neighbour. Therefore after the segment has been defined, each process will create four global data items in the global memory using the API’s, collective, method allocate, which store the halos of a neighbouring process in each of the four directions, as in the code snippet below, where nranks is the number server instances, max_n_local_rows and max_n_local_cols are the maximum number of local rows and columns assigned to each process (as explained above) and segment is the identifier of the segment where the allocation should be done:

collective.allocate // from left neighbour ( vmem::size_t (nranks*max_n_local_rows*sizeof(double)) , segment , vmem::prefix_filled_data() ); collective.allocate // from right neighbour ( vmem::size_t (nranks*max_n_local_rows*sizeof(double)) , segment , vmem::prefix_filled_data() ); collective.allocate // from up neighbour ( vmem::size_t (nranks*max_n_local_cols*sizeof(double)) , segment , vmem::prefix_filled_data() );


28

collective.allocate // from down neighbour ( vmem::size_t(nranks*max_n_local_cols*sizeof(double)) , segment , vmem::prefix_filled_data() );

In addition, each process should initially allocate memory in the local cache for storing the (mutable) local ranges corresponding to the borders with the neighbouring processes using an allocate_t range operation. This is illustrated in the code snipet below, where enough cache space to hold n_local_rows double-precision values for the left and right neighbours and n_local_cols double-precision values for the up and down neighbours is requested:

std::vector<vmem::in_process_client::operation_t> req { vmem::op::allocate_t (n_local_rows*sizeof(double),cache) , vmem::op::allocate_t (n_local_rows*sizeof(double),cache) , vmem::op::allocate_t (n_local_cols*sizeof(double),cache) , vmem::op::allocate_t (n_local_cols*sizeof(double),cache) } client.execute_sync (req)); // execute operations

The Jacobi iteration itself then proceeds in a loop fashion where borders are exchanged between neighbours (halo swap) at the start of each iteration and then local computations are performed which involves progressing the current solution and calculating the relative residual which can be used for termination once a specific accuracy has been achieved. To implement the halo swap, each process should perform at most four (assuming boundary conditions are not periodic, there are no neighbours for processes on the boundaries of the global domain) put_and_release_t operations from the local cache into the global memory. Each process will pack up the appropriate rows/columns of its submatrix into the cache and then the put_and_release_t operations will update the global ranges corresponding to the neighbours. For brevity, the code snippet below illustrates these calls for one single neighbour only (up), where local_range corresponds to the local range for that specific neighbour and for which is assumed that memory was already allocated in the cache. Memory is copied into the local cache from the src matrix and the target location in the global range determined by the offset. A request is then made to copy the local cache data (and release it in the cache) into the global range, these requests are stored in the puts vector, and once all requests are made the client will wait upon their completion via the execute_sync call.

std::vector<vmem::in_process_client::operation_t> puts; memcpy ( local_range.pointer() , &src[local_row_size+1],n_local_cols*sizeof (double) ); vmem::offset_t offset (max_n_local_cols*rank_up_neighbour*sizeof (double)); puts.emplace_back (vmem::op::put_and_release_t ( local_range , {global_data[UP],{offset,local_range.range.size}} ) );


29

……… client.execute_sync (puts)

Once all neighbours have packed their appropriate data into the target location in the global ranges, each process must perform a get_mutable_t operation in order to retrieve the data from global memory into the local cache and unpack this into halos of the local submatrix. The code snippet below again illustrates this for one neighbour only, where the gets vector will hold these data retrieval requests and for each neighbour a global range is created that corresponds to the target data in global memory. A request is then made to retrieve this into the cache, via a get_mutable_t operation and the client will wait on all requests in the gets vector to complete before copying the newly arrived data in cache to its submatrix and then progressing to calculation.

std::vector<vmem::in_process_client::operation_t> gets; vmem::offset_t offset {client.rank()*max_n_local_cols*sizeof (double)}; intertwine::vmem::global_range_t const global_range { global_data[DOWN] , {offset,vmem::size_t {n_local_cols*sizeof (double)}} }; gets.emplace_back (vmem::op::get_mutable_t(global_range, cache)); …… client.execute_sync (gets)

Once the Jacobi algorithm has completed then the cache is freed and segment deleted as per the code snippet below.

client.cache_delete (cache); collective.segment_delete (segment);

From this example it can be seen that the programmer is working with the model of global, contiguous memory, which entirely abstracts them from the lower level and tricky details of data retrieval and movement. As has been discussed, an important interoperability advantage of the Directory/Cache is that it is entirely trivial to change the segment type to work with, and thus the underlying transport layer, by providing to the segment_create function as argument the appropriate segment description. In the example above, the user may choose to use either GPI-2 or MPI (RMA) as underlying transport layers by passing either a segment description of type vmem::gaspi::equally_distributed_segment_description (using GPI-2 as transport layer for implementing the data movement) or vmem::mpi::equally_distributed_segment_description (using MPI (RMA) for implementing the data movement), preserving the application logic and without being necessary to explicitly program the communication.


30

6 Coupling the Directory/Cache with Task-‐Based Runtime Systems

As already stated in the previous sections, the primary “users” of the Directory/Cache are considered to be the task-based runtime systems. After coupling the Directory/Cache service, the users of the task-based runtime systems should be able to run their applications with little or no modifications. The main benefit is in providing the end users with access to various types of memory and segment implementations simultaneously, at the expense of performing minimal modifications to their applications. This involves hiding most of the Directory/Cache calls behind the runtimes’ API. However, replacing the currently used communication patterns relying on explicit library calls may require important re-factorizations from the part of the runtime systems. Below (Figure 7), some aspects related to coupling StarPU with the Directory/Cache service are described.

Interfacing StarPU with the INTERTWinE’s Directory/Cache involves redesigning its distributed API. The current distributed shared-memory (DSM) implementation shipped with StarPU is strongly MPI oriented, making heavy use of MPI’s send/receive model, and MPI features such as user defined datatypes. Thus, a new implementation of StarPU’s DSM layer is in preparation to better fit the model exposed by the Directory/Cache service, built on the notions of segments and caches.

In contrast to MPI, which transfers data under the single control of StarPU, the Directory/Cache service plays an active role in itself in terms of data management, data locality, consistency management. Thus, one of the goals for the redesigned implementation of the StarPU DSM layer is to ensure that the StarPU view of data management and the Directory/Cache view of the data management are coherent with each other. Moreover, some language binding is required since the Directory/Cache service is developed using a templated C++11/Boost software design approach, while StarPU only uses plain C internally and for its main API. The new StarPU DSM implementation aims therefore at providing such C bindings.

MPI

StarPU-MPIAPI

StarPUEngine

StarPUAPI

StarPU-DSMAPI

Directory/Cache

GASPI MPI

StarPU-DSMC/C++bindings

StarPUEngine

StarPUAPI

Figure 7: StarPU-DSM big picture, compared to StarPU-MPI layer.


31

7 References [1] OmpSs specification document. Accessed on February 13th, 2017. Available at:

https://pm.bsc.es/ompss-docs/specs/

[2] StarPU Handbook. Accessed on February 22, 2017. Available at http://starpu.gforge.inria.fr/doc/html/

[3] GASPI specification. Accessed on February 22, 2017. Available at: http: http://www.gaspi.de/

[4] Message Passing Interface Standard. Accessed on February 22, 2017. Available at: http://mpi-forum.org/docs/

[5] GPI-2 documentation. Accessed on February 22, 2017. Available at: http://www.gpi-site.com/gpi2/

[6] Discrete Poisson Equation. Accessed on February 28, 2017. Available at https://www.cfd-online.com/Wiki/Discrete_Poisson_equation

[7] Parallel Runtime Scheduling and Execution Controller. Accessed on March 1, 2017. Available at http://icl.utk.edu/parsec/

[8] Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides: Design Patterns. Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995, ISBN 0-201-63361-2.

Date post:	10-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Definition(of(the(Directory/Cache API$for ... · OmpSs [1], StarPU [2] and PaRSEC [7]). The main...

Documents