+ All Categories
Home > Documents > PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08]...

PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08]...

Date post: 17-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
52
Document Number: SKA-TEL-SDP-0000018 Unrestricted Revision: 1.0 Author: C. Broekema Release Date: 2015-02-09 Page 1 of 51 PDR.02.01 Compute Platform Element Subsystem Design Document number….……………………………………………………….SKA-TEL-SDP-0000018 Context…..……………...……………………………………………………………….…….....COMP Revision……………………………………………………………………………………………….1.0 Author………………………………………………………………………………......P.C. Broekema Release Date……………………………………………………………………………….2015-02-09 Document Classification………………………………………………………………….Unrestricted Status………………………………………………………………………………………...……. Final
Transcript
Page 1: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 1 of 51

PDR.02.01 Compute Platform Element Subsystem

Design

Document number….……………………………………………………….SKA-TEL-SDP-0000018

Context…..……………...……………………………………………………………….…….....COMP

Revision……………………………………………………………………………………………….1.0

Author………………………………………………………………………………......P.C. Broekema

Release Date……………………………………………………………………………….2015-02-09

Document Classification………………………………………………………………….Unrestricted

Status………………………………………………………………………………………...……. Final

Page 2: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 2 of 51

Name Designation Affiliation

Chris Broekema COMP Team Lead ASTRON

Signature & Date:

Name Designation Affiliation

Paul Alexander SDP Project Lead University of Cambridge

Signature & Date:

Version Date of Issue Prepared by Comments

1.0 2015-02-09 P. C. Broekema

ORGANISATION DETAILS

Name Science Data Processor Consortium

Signature:

Email:

Signature:

Email:

P.C. Broekema (Feb 9, 2015)P.C. Broekema

[email protected]

Paul Alexander (Feb 9, 2015)Paul Alexander

[email protected]

Page 3: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 3 of 51

Table of Contents List of Figures ............................................................................................................................ 5

List of Tables ............................................................................................................................. 5

References ................................................................................................................................ 6

Applicable documents ............................................................................................................ 6

Reference documents ............................................................................................................ 7

Introduction ...............................................................................................................................10

Purpose of this document ......................................................................................................10

Scope of this document .........................................................................................................10

Assumptions made in this document .....................................................................................11

Functional decomposition ......................................................................................................11

SDP requirements and constraints ............................................................................................12

Computational requirements dictated by science objectives ..................................................12

Constraints ............................................................................................................................12

Capital constraints .............................................................................................................12

Power constraints ..............................................................................................................12

L1 and L2 requirements .........................................................................................................13

SDP architecture .......................................................................................................................13

Architectural design principles ...............................................................................................14

Top-level architecture ............................................................................................................15

Data flow model .....................................................................................................................18

Compute Island .....................................................................................................................20

SDP scaling ...........................................................................................................................21

Science Archive .....................................................................................................................23

Computational efficiency ...........................................................................................................23

Roll-out schedule ......................................................................................................................28

Data transport ...........................................................................................................................30

Data transport bandwidth requirements .................................................................................31

Top-level network architecture ...............................................................................................31

Bulk data transport network design ........................................................................................32

Ingest Processing ..................................................................................................................33

High-performance, low-latency interconnect architecture.......................................................33

Page 4: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 4 of 51

Management, monitoring and control network .......................................................................34

Data reordering .....................................................................................................................34

In-network reordering .........................................................................................................34

Intra-island reordering ........................................................................................................35

Inter-island reordering ........................................................................................................35

Software-defined networking .................................................................................................36

Combining Bulk Data Network with Low-latency Network ......................................................36

Compute node model ................................................................................................................37

From performance model to design characteristics................................................................37

Baseline model - current-day technology ...............................................................................37

Storage model...........................................................................................................................41

Intermediate buffer ................................................................................................................41

Science Archive .....................................................................................................................42

Mirror science archive ...........................................................................................................43

Software stack ..........................................................................................................................43

Operating system ..................................................................................................................43

Middleware ............................................................................................................................43

Messaging layer .................................................................................................................44

Logging system ..................................................................................................................44

Platform management system ............................................................................................45

System optimisation ...........................................................................................................45

Archive HSM Software ..........................................................................................................46

Application development environment and software development kit ....................................46

Scheduler ..............................................................................................................................47

SDP infrastructure .....................................................................................................................47

Data delivery platform hardware ...............................................................................................48

LMC system hardware architecture ...........................................................................................48

Suitability and scalability of the architecture ..............................................................................48

Sub-element risks .....................................................................................................................48

Requirement traceability ...........................................................................................................50

Page 5: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 5 of 51

List of Figures Figure 1: SDP Hardware compute platform product sub-tree ....................................................10

Figure 2: SDP Software compute platform product sub-tree [AD10] ..........................................10

Figure 3: The Platform Management function is the only function assigned to the compute

platform. ....................................................................................................................................12

Figure 4: SDP compute platform context diagram. ....................................................................14

Figure 5: Top-level overview of the SKA Science Data Processor functions. ............................16

Figure 6: Top level Logical Data Flow Diagram for the SDP pipelines. Shown are dependencies

and interactions between the different pipelines. .......................................................................17

Figure 7: The SDP data flow. ....................................................................................................19

Figure 8: The SDP Compute Island concept, showing the various components in an island. ....21

Figure 9: SDP scaling. Each telescope SDP consists of a number of Compute Islands that are

built up from a number of Compute Nodes. ...............................................................................22

Figure 10: SDP compute distribution for the three SKA telescopes. ..........................................25

Figure 11: cuFFT performance on Nvidia Tesla K40c [RD12]. ..................................................26

Figure 12: Performance analysis of John Romein's gridding algorithm, from [RD10]. ................27

Figure 13: The public Nvidia roadmap up to 2016 [RD13]. ........................................................28

Figure 14: Preliminary timeline for SDP construction for the three telescopes [AD12]. ..............29

Figure 15: Top-level SDP network design. ................................................................................32

Figure 16: A potential SDP compute node model implementation using current-day hardware. 39

Figure 17: Double buffering in an SDP Compute Island. ...........................................................41

Figure 18: The SDP software compute platform middleware product subtree. ..........................44

Figure 19: Overview of the SDP Hierarchical Storage Manager. ...............................................46

List of Tables Table 1: Energy budgets for the three telescope SDPs. ............................................................13

Table 2: Computational FFT efficiency on both CPU and GPU, from [RD09]. ...........................26

Table 3: Hardware specifications of the platforms used in the gridding analysis presented in

[RD10]. .....................................................................................................................................27

Table 4: Input data transport bandwidth requirements for the three telescopes. ........................31

Table 5: Performance requirements for the SKA1 baseline, including baseline-dependent

averaging. .................................................................................................................................37

Table 6: Performance requirements per achieved TFLOPS. .....................................................37

Table 7: SKA1 node characteristics, assuming 700 GFLOPS achieved computational capacity.

.................................................................................................................................................38

Page 6: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 6 of 51

References

Applicable documents

The following documents are applicable to the extent stated herein. In the event of conflict

between the contents of the applicable documents and this document, the applicable

documents shall take precedence.

Reference Number Reference

PDR.01 / [AD01] SKA.TEL.SDP-0000002 - SKA preliminary SDP architecture and System Description

PDR.05 / [AD02] SKA-TEL-SDP-0000003 – SDP Performance Models – PDR.05

PDR.02 / [AD03] Sub-element design: Data Delivery

PDR.02 / [AD04] Sub-element design document: LMC

PDR.04 / [AD05] Interface Requirements (Ext ICDs) - PDR.04

[AD06] SKA-TEL.SDP.SE-TEL.CSP.SE-ICD-001 SKA1 Interface Control document SDP to CSP

[AD07] SKA-TEL.SADT.SE-TEL.SDP.SE-ICD-001 Interface Control document SADT to SDP

PDR.01.01 / [AD08] SKA-TEL-SDP-0000014 ASSUMPTIONS AND NON-CONFORMANCE

[AD09] SKA-TEL-SKO-0000035 SKA1 POWER BUDGET

PDR.03 / [AD10] Requirements Analysis & Allocations

PDR11 / [AD11] Preliminary Element Integrated Logistics Support Plan

PDR.08 / [AD12] PRELIMINARY PLAN FOR CONSTRUCTION

[AD13] SKA-TEL.SDP.SE-TEL.INFRA.SE-ICD-001 SKA1 Interface Control document SDP to INFRA-AUS and INFRA-SA

[AD14] SKA-TEL-SDP-0000027 Pipelines Element Subsystem Design

Page 7: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 7 of 51

[AD15] SKA-TEL-SDP-0000054, SKA-TEL-SDP-0000053 Prototyping and Development Plans

[AD16] SKA-TEL-SDP-0000028 Parametric Modelling of the ingest pipeline

Reference documents

The following documents are referenced in this document. In the event of conflict between the

contents of the referenced documents and this document, this document shall take

precedence.

Reference Number Reference

[RD01] SKA-TEL-SDP-0000007-glossary-PDR16

[RD02] SKA-TEL-SDP-COMP-MEMO-010 Networking in LOFAR and how a software-defined network may improve robustness and flexibility

[RD03] SKA-TEL-CSP-0000113 SKA-TEL.CSP.CBF.SUR Sub-element Prototype Test Report (ProtoTestReport-SUR) – Part 1 Data Compression

[RD04] SKA-TEL-SDP-0000019 Compute platform: Hardware alternatives and developments

[RD05] SKA-TEL-SDP-0000020 Compute platform: Software stack developments and considerations

[RD06] SKA-TEL-SDP-0000021 Improving sensor network robustness and flexibility using software-defined networks

[RD07] SKA-TEL-SDP-0000022 Compute platform: Standardisation

[RD08] Measurement and Analysis of TCP Throughput Collapse in cluster-based Storage Systems - Amar Phanishayee et.al. http://www.cs.cmu.edu/~dga/papers/incast-fast2008

[RD09] FFT Analysis - Stefano Salvini

[RD10] An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs - John W. Romein

Page 8: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 8 of 51

[RD11] SKA SDP Performance Model - https://github.com/SKA-ScienceDataProcessor/sdp-par-model

[RD12] https://developer.nvidia.com/cuFFT (23-01-2015)

[RD13] http://www.anandtech.com/show/7900/nvidia-updates-gpu-roadmap-unveils-pascal-architecture-for-2016 (23-01-2015)

[RD14] http://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures

[RD15] http://en.wikipedia.org/wiki/Skylake_%28microarchitecture%29

[RD16] http://en.wikipedia.org/wiki/DDR4_SDRAM

[RD17] http://www.extremetech.com/extreme/171678-intel-unveils-72-core-x86-knights-landing-cpu-for-exascale-supercomputing

[RD18] http://goparallel.sourceforge.net/intel-reveals-details-of-next-gen-xeon-phis/

[RD19] http://www.infinibandta.org/content/pages.php?pg=technology_overview

[RD20] http://www.ieee802.org/3/

[RD21] http://en.wikipedia.org/wiki/Phase-change_memory

[RD22] http://en.wikipedia.org/wiki/Memristor

[RD23] http://investors.micron.com/releasedetail.cfm?ReleaseID=692563

[RD24] http://ark.intel.com/products/75272/Intel-Xeon-Processor-E5-2660-v2-25M-Cache-2_20-GHz

[RD25] http://www.nvidia.com/object/tesla-servers.html

[RD26] http://www.hgst.com/solid-state-storage/enterprise-ssd/sas-ssd/ultrastar-ssd1600mr

[RD27] http://www.intel.com/content/www/us/en/network-adapters/converged-network-adapters/ethernet-x520.html

[RD28] http://www.mellanox.com/page/products_dyn?product_family=119&mtag=connectx_3_vpi

[RD29] https://perf.wiki.kernel.org/index.php/Main_Page

[RD30] https://www.docker.com/

[RD31] http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771442.pdf

Page 9: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 9 of 51

[RD32] SKA-TEL-SDP-0000046 Costs: Basis of Estimate

Page 10: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 10 of 51

Introduction

Purpose of this document

This document is part of the preliminary design review (PDR) documentation for the Square

Kilometer Array Science Data Processor (SKA SDP) element. It provides the compute platform

sub-system design which includes all hardware and all software required to efficiently use and

develop for that hardware (i.e. operating systems, middleware, deployment software and

development environment).

In terms of the SDP product tree, this document describes the hardware compute platform (C.1)

and the software compute platform (C.2), both of which are shown below.

Figure 1: SDP Hardware compute platform product sub-tree

Figure 2: SDP Software compute platform product sub-tree [AD10]

This document consists of two major components. In the first few chapters we specify the top

level compute platform architecture, in the second half of the document we verify the validity of

this architecture by describing a baseline implementation using current day hardware.

Scope of this document

This document implements the high-level architectural design presented in [AD01]. As a basis

for a detailed design, it uses the performance models and their derived design equations that

are elaborated in [AD02]. Several supporting documents accompany this document, describing

in more detail software-defined networks and the application of these in a radio telescope

Page 11: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 11 of 51

[RD06], alternative hardware implementations and expected future developments [RD04],

software considerations and developments [RD05] as well as standardisation [RD07].

Assumptions made in this document

This document follows many of the assumptions made in [AD08], the most important of which

are:

● No drastic measures are taken to ensure availability. The assumption is that the

availability budget is unrealistic and will change.

● For SDP sizing we assume 25% efficiency and continued Moore’s law scaling for

compute, storage, memory, and so on (arguably Kryder’s law, Koomey’s law, etc.).

● We assume that the initial reordering of data for ingest/flagging can largely be done in-

network.

● There are no L1 or L2 requirements for SDP to stay within a given energy, capital, or

operational budget. We assume the guidance given by the SKAO is in fact a

requirement. Note that SDP has deliberately been left out of the SKA1 power budget

[AD09].

This document introduces an architecture that is capable of fulfilling the requirements of the

SKA SDP. At the time of writing, the SDP performance models indicate that the current baseline

design exceeds the allocated power and capital budgets. It is very likely that instead the SDP

will have to be built to a capital or energy budget. We therefore introduce a scalable and flexible

architecture that can accommodate changes in budget or capability that may be required to

meet such constraints.

Functional decomposition

The vast majority of the compute platform comprises services provided to major functional

components covered by other parts of the system. The only function that the compute platform

provides is platform management, which includes platform state management and deployment

(see the figure below). Platform management contains functions to deploy the large number of

nodes in the SDP and provides platform state information to be included in the observation meta

data.

Page 12: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 12 of 51

Figure 3: The Platform Management function is the only function assigned to the compute

platform.

SDP requirements and constraints

Computational requirements dictated by science objectives

While total computational load is important, the required capacity for the SDP can be better

expressed as a ratio of various components. For each byte of input data, n double precision

floating point operations are needed, requiring m MB of memory. The scale of the Science Data

Processor is determined by the total computational capacity required, while the ratios mentioned

above impact the design and selection of the components that make up the SDP. The scaling

considerations are discussed in more detail below.

Constraints

Capital constraints

We currently have no assigned capital budget assigned for the three telescope SDPs, but when

this is assigned it is important to note that will be shared between hardware, software (both

domain specific and not), and staff. Although in our current cost model the vast majority of the

SDP budget is spent on hardware, this is not a realistic scenario as explained in [AD08]. Based

on best practices, it is expected that <50% of this budget will be spent on hardware, with most

of the remainder consumed by software development and staff.

Power constraints

The SKAO has proposed an electrical power cap for each of the three telescopes’ SDP (e-mail

communication, 14 Aug 2014), shown in the table below. Two figures are given for each: a likely

power limit and a “best case” power limit. The “likely” power limit is based on an overall power

budget that can be considered realistic given the current state of the design. The “best case”

power limit is the absolute highest power that the SDP will have available if all current unknowns

are replaced by the most favourable assumptions. These power limits are those measured at

Page 13: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 13 of 51

the building entrance, i.e. including cooling, losses, auxiliaries, etc. It is important to note that

many of the auxiliary components that consume energy, i.e. cooling, are outside the scope of

the Science Data Processor. These are instead the responsibility of the INFRA consortium.

SKA1 Mid SKA1 Low SKA1 Survey

Likely Power Limit (MW) 2.5 0.75 2

Best Case Power Limit (MW) 5 1.5 4

Table 1: Energy budgets for the three telescope SDPs.

L1 and L2 requirements

The L1 and L2 requirements assigned to the compute platform are listed at the end of this

document. Traceability is shown by referencing the relevant sections of this document or where

applicable, other PDR documents.

SDP architecture The figure below shows the compute platform context, including both the hardware and

software.

Page 14: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 14 of 51

Figure 4: SDP compute platform context diagram.

Architectural design principles

To achieve the scalability required in our system, we adopt a highly modular design approach.

The intention is that the Science Data Processor concept should hold for any scale, within

reasonable limits. Scaling is discussed in more detail below, in a dedicated section.

The primary requirement on the SDP is to perform a specific job: turn correlator products into

science-ready data. While a number of different observation modes have been defined, these

follow broadly the same processing model. Contrary to conventional HPC, that needs to support

a wide range of applications with different requirements, we can adopt a highly workload-

optimised system design approach, where tailoring the SDP design for our specific application.

This is expected to enable a more efficient use of the system and reduce both capital and

operational costs. This tailored approach is driving the design decisions made in this document.

Page 15: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 15 of 51

Since the SDP is defined by its data flow, we adopt this data flow, and in particular the efficient

and affordable way to handle that data flow, as our primary design consideration. To ensure

sufficient parallelism we allow significant reordering of data.

Top-level architecture

The SDP is required to perform a number of distinct processing steps in order to produce high

quality science data from raw CSP data. It performs these steps under the auspices of TM

which interacts with the LMC component to steer the computation and data flow. This document

provides a potential realisation of a SDP computational environment based on parametric

modelling which has been used to produce a “Costed Hardware Concept” [RD32]. Where

appropriate we suggest alternatives to the solutions described here which will be analysed

through to CDR.

The main stages of the SDP are illustrated in the figure 5 below. Five distinct processing stages

are identified, namely:

● Ingest

● Buffer

● Pipeline process

● Archive

● Delivery

Page 16: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 16 of 51

Figure 5: Top-level overview of the SKA Science Data Processor functions.

The processing to be performed by the SDP is defined by a series of pipelines [AD14]. Each Pipeline consists of multiple Components and Components may be part of multiple Pipelines. Potentially the implementation of Components may differ depending on the type of Pipeline. Different Pipelines may run in serial and / or in parallel and the output from one Pipeline may serve as input for another, see Figure 1. This will allow e.g. for Commensal Observations, or External Calibration Observations, where Calibration solutions from one Observation are applied to another one.

The interface between different components of the data processing pipelines will always go through the data layer and can take two physical forms:

Page 17: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 17 of 51

Non-streaming (using disk buffers) for e.g. the continuum pipeline and the spectral line pipeline.

Streaming for e.g. the ingest pipeline and the fast transient pipeline.

The components will be unaware of this difference, because the DATA layer will abstract it away. During the Data Flow setup stage (using directives from the pipelines), it is determined which communication will be used. Thereafter the processing software is agnostic to it.

The Figure below shows the interaction and dependencies between the various pipelines that are currently foreseen. These pipelines are defined in the sections below. In order to keep the diagrams clear, the interaction with Local Monitoring and Control (LMC) is not shown. In principle, however, each pipeline component will have a bi-directional interaction with LMC for Controlling and Monitoring the component. Also, Data Quality components are not shown. In practice Data Quality will be part of every Pipeline, where the measure of Data Quality as defined by the required metrics is fed back into LMC.

Figure 6: Top level Logical Data Flow Diagram for the SDP pipelines. Shown are dependencies

and interactions between the different pipelines.

Page 18: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 18 of 51

From Figure 6 we can see that:

The Ingest Pipeline (distinct from ingest per se) takes the uv-data from CSP and meta-data from TM and delivers visibility data in various resolutions, depending on the Pipeline that follows up on it. This means that potentially the same data is written multiple times in different resolutions.

The Spectral Line Pipeline will always run AFTER the Continuum Pipeline, because of the dependency on the Calibration solutions.

The Slow Transients Pipeline is defined such that it runs AFTER the Real-time Calibration Pipeline, and again after the Ingest Pipeline.

Science Analysis is not yet further detailed, but consists of Components like Source Finding, RM-Synthesis, Stacking, etc.

The Ingest Pipeline, the Real-time Calibration Pipeline, and the Fast Imaging (for Slow Transients) Pipeline all have to run in real-time.

The properties of the pipelines in terms of their computational needs and data requirements will drive the characteristics of the SDP. In general, we observe that the SDP workload is highly parallel in nature. Experience with precursor and pathfinder instruments has shown that using frequency as the primary parallelisation dimension results in a highly independent, embarrassingly parallel system for the vast majority of applications. This observation forms the basis for our workload-optimised system design, although we do, as mentioned above, indicate where further analysis may lead to alternative solutions or refinements. In particular further analysis of the pipelines will provide information on the trade-offs between memory size and performance, interconnect performance and dimension in terms of ingest-to-buffer vs. compute node-to-compute node together with floating-point performance and power. Such considerations will form part of the prototyping and development plan [AD15] through to CDR.

Data flow model

Moving data costs significant amounts of energy. We therefore design the SDP to minimise the

(inherently large) flow of data. Data flow is directed so that all subsequent processing requires

little or no additional (long-range) communication.

Data from the CSP is handled by the SDP switch stack at ingest. The switch stack will be an

interconnected, but probably not fully non-blocking Ethernet system, and will distribute data to

the SDP Compute Islands (see below). The switch stack increases the SDP’s capital cost but

adds flexibility and resilience, since we can route data around failed nodes, should the need

arise. The switch stack also allows in-network reordering of data, which is an essential

component of our architectural design principles. Finally, the switch infrastructure allows

Software Defined Networking, described in more detail below and in [RD02], to offer

unprecedented flexibility in our data flow.

Each SDP Compute Island is a collection of one or more highly interconnected nodes. An island

is capable of handling front-to-end processing of a chunk of data, without having to

Page 19: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 19 of 51

communicate with neighbours.

Figure 7: The SDP data flow.

Re-distribution of the data streaming from the CSP is the responsibility of the Data Flow

Manager (described in [AD04]). On a high level, Compute Islands can be seen as subscribing to

data flows from CSP correlator entities. Every “entity” produces a number of data streams, each

representing a fixed chunk of uvw-space. Each Compute Island is responsible for a (potentially

different) subset of uvw-space by subscribing to these CSP streams, as directed by the Data

Flow Manager.

The figure above provides a schematic view of the SDP data flow. While it shows switches at

the ingress and egress points of the Compute Islands, these may not be physically distinct

components. A cost-saving measure may be to have one shared network handling both ingress

and egress. Although the highly unidirectional nature of our data flow does allow this, a shared

infrastructure may cause performance issues. This in turn may cause the data flow from CSP to

drop packets.

Page 20: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 20 of 51

Compute Island

A Compute Island (which is an overloaded term in HPC, and may need renaming) is the basic

replicable unit in the SKA SDP. A Compute Island is a self-contained, independent collection of

compute nodes.

Ideally, a Compute Island only processes data that is contained in the island itself. Some

applications, such as multi-frequency synthesis, require a number of gathers to be performed

before end-products can be combined. The initial analysis in the case of the continuum pipeline

for example [AD14], indicate that inter-compute island traffic will need to be supported. For this

purpose the Compute Islands will be interconnected using a high-bandwidth interconnect,

orthogonal to the ingest-to-buffer network. The tolerance of this network to over-subscription

and hence reductions in cost and complexity are currently in review.

As mentioned before, a Compute Island consists of several interconnected compute nodes.

Each Compute Island has associated infrastructure and facilities such as shared file systems,

management network and master node(s). This makes each Compute Island largely

independent of the rest of the system. The size of the SDP can be expressed by the number of

Compute Islands it contains - a parameter that is freely scalable due to the Compute Islands’

independent nature. Most of the infrastructure will be similar between the three SDPs, but it is

conceivable that the size of an island (e.g. the number of compute nodes within an island) or the

compute node design itself differs between SDPs. This could be the case when the desired

compute to I/O ratios differ between the three telescopes.

Within a Compute Island, a fully non-blocking interconnect, with a per node bandwidth far in

excess of the per-node ingest rate, is provided. This is primarily used for reordering data

between processing steps, ideally within a single island. The same interconnect facilitates

communication between islands for inter-island reordering or global processing, but in this case

bandwidth will be much more limited and end-to-end transfers may require several hops. The

total bisectional bandwidth and over-subscription of the global interconnect network may easily

become a cost-driver. Therefore a careful analysis of the requirements and an effort towards

minimising global data transport will continue to be a design priority.

The file system and/or storage model used by the islands is yet to be determined. A small,

island-wide, parallel file system, or single file system node, are among the options. The buffer

storage spread over the island nodes may also be exposed as a single unified file system in

some way yet to be determined.

The figure below shows an overview of the Compute Island concept. Note that although a

Compute Island is represented by a single rack of hardware in this figure, this is only illustrative.

The actual size of the Compute Island may span multiple racks, or be limited to a fraction of a

rack, depending on various parameters discussed in more detail in the section on scaling.

Page 21: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 21 of 51

Figure 8: The SDP Compute Island concept, showing the various components in an island.

While operational concerns drive a desire for a high degree of standardisation of Compute

Islands and components, the self-contained nature of the Compute Islands allow for partial

upgrades. A heterogeneous SDP, with Compute Islands of various ages and potentially

specialised configurations, is also possible. However, the efficient utilisation of such a system

may require additional effort on the part of LMC and the scheduler. The optimal combination of

flexibility and standardisation will need to be determined on the road to CDR.

SDP scaling

While the total useful capacity of the Science Data Processor depends on many components,

we identify three defining characteristics that we will use to scale the system:

● Total capacity

● Capacity per Compute Island

● Characteristics per node

The total capacity is defined by the number of available Compute Islands. This top-level

number, the aggregate peak performance (Rpeak) expressed in PFLOPS, is defined by the

number of Compute Islands that make up the Science Data Processor and the capacity per

Page 22: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 22 of 51

Compute Island. While this number is a useful way to express the size of the system, its

usefulness is limited since it does not take computational efficiency into account. Ideally, the

total capacity of the system would be defined by the science or system requirements, but

considering the constraints discussed above, it is more likely that total capacity will be defined

by the available budgets (energy, capital or operational).

Figure 9: SDP scaling. Each telescope SDP consists of a number of Compute Islands that are built up from a number of Compute Nodes.

Capacity per Compute Island is defined by the number of nodes per island and the

characteristics of these nodes. This capacity is expressed in terms of peak computational

capacity, i.e. TFLOPS, but it is likely that computational capacity will not drive the sizing of the

Compute Islands. Island capacity is defined by the most demanding application, in terms of

required memory, network bandwidth, or compute capacity that requires a high-capacity

interconnect. Our current analysis of the ingest pipeline [AD02] shows that a “frequency group”

of between 64 to 256 frequency channels are needed for RFI flagging, giving an upper bound to

the total number of Compute Islands per telescope (1000 to 4000, depending on the eventual

number of frequency channels in a frequency group).

The basic building block of a Compute Island is the compute node. The characteristics of these

nodes are defined by the design equations in [AD02] but within these bounds a vast number of

valid node designs can be identified. Considering the timeframe of the SDP roll-out, which

extends well beyond the available industry roadmaps, the node definition is perhaps the least

well understood component of the SDP design. The SDP parametric model defines a number of

ratio rules that describe suitable node designs. Within the bounds of these rules, cost and

Page 23: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 23 of 51

energy efficiency and maintainability are considerations that may be used to select optimal node

implementations.

There is one key requirement that a compute node needs to satisfy: if used to ingest data, only

a very small percentage of that data may be lost. In other words, these nodes need to be scaled

such that they comfortably satisfy the ingest real-time requirements, and a sufficient number of

these nodes need to be available to receive all CSP data.

One interesting consideration is whether or not all three SDPs will be standardised on a single

node design. Answering this question requires an interesting trade-off between the

standardisation of components on the one hand, and workload optimisation of those same

components on the other hand. Operational costs, in particular energy versus deployment and

maintenance cost, will also play a key role in this decision. It is clear that this decision cannot be

made until more information is available on the likely technology options available for nodes.

Science Archive

The processed data products are forwarded from the Compute Islands to the local Science

Archive. The Science Archive is part of the SDP and is the end point for SDP data-products.

The primary design goals for the Science Archive are:

● Provide secure storage for data products for the telescope life time

● Facilitate distribution of science data products to Regional Centres

The Science Archive acts as an interface to the wider science community by distributing the

science data to potential Regional Science Centres and by providing access to the data via a

number of interfaces and APIs.

Considering the long designed lifetime of the SKA instrument, careful analysis of total cost-of-

ownership of the various archive technologies is critical. Work on this is ongoing. Possible

architectures span the entire range from a carefully balanced tiered mix of fast to slow storage

media (solid state, spinning disk and tape), to a disk-only solution. It is important to note that

there is no stringent requirement on the security or safety of the data, although the lifetime

requirement on the archive is 50 years [SKA1-SYS_REQ-2363] and the SDP is required to

maintain a mirror [SKA1-SYS_REQ-2350]. For the purposes of this document we assume that

the combination of the Science Archive and Regional Science Centres fulfils the role of the

Science Archive mirror.

At this stage we do not consider the Science Archive a high-risk item. Industry is heavily

focussed on Big Data, and archive sites of the sizes required for SKA1 are already feasible.

Later on in this document we will discuss various implementation options for the Science

Archive in more detail.

Computational efficiency The required aggregate capacity of the Science Data Processor depends on three factors:

Page 24: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 24 of 51

1. Input bandwidth (bytes/s)

2. Total computational intensity of the pipelines (FLOP/byte)

3. Computational efficiency (% of Rpeak)

The input bandwidth is given in the baseline design. Computational intensity can be estimated

using the performance models in [AD02]. Other considerations, like mixed precision

computations, in which double precision and single precision operations are intermingled, also

have an impact on computational intensity. Of these factors, computational efficiency is

arguably the most difficult to estimate.

Computational efficiency depends on many factors, such as:

● choice of algorithm

● target platform

● implementation

● data access patterns

● programmer talent

Many of these are platform- (i.e. hardware-) dependent, which makes it difficult, if not

impossible, to model or estimate computational efficiency for future hardware generations.

It is, however, possible to estimate the required number of floating point operations for the

various SDP tasks. Several modelling efforts, culminating in [AD02] and [RD11], have resulted

in a good overview of the computational requirements for the SDP, in terms of sustained

performance. In order to translate these into a hardware architecture, we estimate the

computational efficiency of the most intensive components, for the hardware we can expect to

procure based on current day numbers.

It is important to note that the discussion below is highly speculative. The expected procurement

period is well beyond the timeframe of industry roadmaps, and we can only speculate on the

available hardware. Instead we concentrate on currently available hardware and the bottlenecks

we can identify in these. Based on this, and the expected developments in terms of compute

characteristics, we estimate the computational efficiency of current-day hardware and

extrapolate it to the SKA1 roll-out timeframe.

Page 25: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 25 of 51

Figure 10: SDP compute distribution for the three SKA telescopes.

Figure 10, taken from the parametric model as implemented in iPython [RD11], shows that the

vast majority of the SDP compute requirement is taken up by gridding and FFTs. The most

power-efficient way to compute either of these today is on accelerator hardware, so we will

concentrate our analysis on these.

Fast Fourier transforms on Nvidia Tesla GPUs will most likely use cuFFT. Performance

numbers for this library are publicly available, and can be summarised as in Figure 10. The

Tesla K40c shown in this graph has a peak performance of 4290 GFLOPS single precision, and

1430 GFLOPS double precision. This shows that the maximum achieved computational

efficiency, as a percentage of peak performance, is currently 16.3% for single precision floating

point numbers, and a slightly higher 19% for double precision. However, this applies to small

transform sizes. The SDP will most likely use transform sizes of the order of 213to 216, reducing

the efficiency to 9% and 12%, respectively.

The distinct dip in performance for large transforms is due to the data no longer fitting in the fast

local memory. There is a continuing trend in hardware development to increase the amount of

fast local memory (caches) to bridge the widening memory bandwidth gap. This means that

ever larger transforms will fit in fast local memory, widening the distinct peak in efficiency shown

today and potentially improving efficiency for the transform sizes required by the SDP.

Page 26: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 26 of 51

Figure 11: cuFFT performance on Nvidia Tesla K40c [RD12].

Furthermore, it is expected that 3D stacking of memory will dramatically increase memory

bandwidth in the next two generations of hardware, resulting in a corresponding increase in

computational efficiency. This is, of course, speculation and developments need to be closely

tracked to establish the actual computational efficiency.

There are more detailed analysis results available on FFT performance, using both GPU and

CPU hardware [RD09]. The observed computational efficiencies are summarised below. The

results agree with the analysis presented above, although the high end (16-19% of Rpeak) of the

efficiency range for double precision is not achieved in this experimental setup.

Single Precision

CPU Efficiency (multithreaded) 8 – 15 %

GPU efficiency (data on GPU) <10 – 15 %

GPU efficiency (incl. data transfer) ~ 1%

Double Precision

CPU Efficiency (multithreaded) 8 – 15 %

GPU efficiency (data on GPU) 10 – 15 %

GPU efficiency (incl. data transfer) ~ 1%

Table 2: Computational FFT efficiency on both CPU and GPU, from [RD09].

Page 27: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 27 of 51

The most computationally efficient gridding implementation we know of today was developed by

John Romein [RD10]. [RD10] presents detailed performance measurements, including several

on GPUs. In this paper, achieved performance is measured in Giga grid-point additions per

second, which equals 8 GigaFLOPS, since each complex multiply-add requires four real

multiplications and four real additions.

Figure 12: Performance analysis of John Romein's gridding algorithm, from [RD10].

For relatively small convolution matrix sizes (32x32), the maximum achieved efficiency of the

algorithm is around 23% (Nvidia GTX680). This increases to approximately 25% for larger

convolution matrix sizes on AMD Radeon HD7970. For completeness, the salient hardware

specifications are shown in Table 3.

Rpeak(GFLOPS) Memory bw (GB/s) Max Power (Watt)

Nvidia GTX680 3090 192 195

AMD HD7970 3789 264 230

2x Intel E5-2680 343 102 260

Table 3: Hardware specifications of the platforms used in the gridding analysis presented in

[RD10].

Page 28: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 28 of 51

Figure 13: The public Nvidia roadmap up to 2016 [RD13].

In March, 2014 NVidia has publicly announced their roadmap leading up to 2016, including the

“Pascal” architecture as shown in Figure 13. The Pascal cards are expected to have 3D-stacked

memory, with 1 terabyte/s bandwidth and with same power consumption per bit transferred as

now. The peak FLOPS capability of these cards has not been explicitly announced but can be

inferred from the graph in Figure 13. This shows that Pascal will achieve about 2.5 times higher

FLOPS/W performance than the Kepler family of GPUs. Since for a similar packaging the total

power envelope must remain constant and since the peak Kepler performance is around 1.3

TFLOPS (e.g. Kepler K20X), this implies that the peak performance of Pascal will be around 3.3

TFLOPS.

This estimate gives a memory bandwidth to peak FLOP ratio for Pascal in 2016 of 0.30 bytes /

FLOP in contrast to about 0.19 bytes / FLOP in the current generation K20x. Our analysis

indicates that both of the algorithms mentioned above (FFT and gridding) are mainly memory-

bandwidth bound. The relative increase in memory bandwidth per peak FLOP indicates a

modest corresponding expected increase in computational efficiency for that generation.

However, this is a single revolutionary step forward in memory bandwidth per FLOP, without a

further improvement in sight. How these developments continue post-Pascal is unclear.

For costing and sizing we currently assume an overall computational efficiency of 25% of peak

performance, which based on the analysis above is optimistic. As mentioned above, this

number is highly speculative and may change based on further prototyping and research.

Roll-out schedule The SDP roll-out schedules are determined by three major considerations:

Page 29: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 29 of 51

1. Roll-out of compute equipment should be delayed as much as possible, so that we can

take advantage of Moore’s law and to maximise the operational usefulness of the

procured equipment.

2. Roll-out should follow the just-in-time principle, such that integration of equipment

coincides with Array Releases (AR) of the various instruments.

3. Compute requirements increase dramatically with increased baseline length, which

means that the full-scale Science Data Processor will not be required until very late in

the roll-out schedule.

The SDP preliminary plan for construction [AD12] describes these considerations in more detail.

Figure 14: Preliminary timeline for SDP construction for the three telescopes [AD12].

Figure 14 shows the SDP roll-out timeline in the SDP preliminary plan for construction. This

document ignores the milli- and centi-SDP as the scale of these is more or less trivial and are

not required to be maximally efficient in terms of FLOPS or energy. Instead, these precursors

only support commissioning of elements and early science as (cost-) effectively as possible.

The roll-out of the full-scale SDP will most likely happen in the final stages of the SKA system

build-up, during the first half of 2021. These systems must be fully integrated during the second

Page 30: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 30 of 51

half of 2021, ready for the fifth and sixth array releases (AR5 and AR6 in Figure 13) of SKA1-

MID and SKA1-LOW and the fourth array release of SKA1-SURVEY.

Based on this roll-out schedule, we can conclude that the SDP will use technology that will only

become available after 2020. This timeframe is well beyond established roadmaps, and puts us

into the era where only technology concepts are available.

While this makes our detailed design highly speculative, we feel this is not a risk at this stage of

the project. The high-level concepts do not change and our high-level architecture is capable of

supporting the expected technological changes.

To demonstrate the feasibility of the design, we introduce a baseline implementation below that

is based on current day technology. Evolutionary scaling to SKA1 timeframes shows one

possible implementation option for SKA1. This evolutionary development is unlikely to occur, but

it gives us a solid basis for costing and shows one valid implementation of our design.

The supporting material discusses various hardware developments that are expected to occur

[RD04]. All of these are expected to be capable of fulfilling the SDP requirements, although

some may be more efficient than others. Selection of the optimal node architecture will have to

wait until more information is available on the possible hardware solutions, but also until more

detailed performance models are available. However, it is important to stress that this does not

impact our high-level design.

In particular, the independence of the island implementation is important, since we are trying to

design an architecture with a potential life-span of fifty years. It is impossible to predict what sort

of computational resources will become available during the life-time of the instrument, we

therefore have to provide a high-level architecture that is capable of supporting a wide range of

technology options and is ideally agnostic to the eventual detailed implementation.

Data transport As a data throughput machine, data flow is intended to drive the design of the SDP architecture.

The data transport system is therefore an extremely important part of the system design. We

separate the data transport system into three different, physically separate, networks, each with

their own requirements:

● Bulk data transport

● Island data transport

● Management, monitoring and control

The bulk data transport network may be two physically separated networks, used to receive

data from the CSP and export data to the Regional Science Centres.

Page 31: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 31 of 51

Within each Compute Island, a high-performance, low-latency network is provided to facilitate

the reordering of data. We separate the highly predictable and static bulk data network, and the

more dynamically loaded island interconnect to be able to ensure the real-time performance of

the SDP ingest.

Data transport bandwidth requirements

The table below shows the input bandwidth requirements for the three receiver types, based on

the updated baseline design. The number of required input ports is estimated, using a protocol

overhead of 2% [AD06]. Based on operational experience we limit occupancy per port to 90% to

ensure no packets are dropped and the receiving node can achieve real-time performance. The

total number of top-level network ports is at least double this number.

Instrument Raw data

rate (TB/s)

Estimated

number of ports

(40 GbE)

Estimated

number of ports

(100 GbE)

SKA1-low 9.1 ~2030 ~810

SKA1-mid 4.21 ~940 ~380

SKA1-survey 5.81 ~1300 ~520

Table 4: Input data transport bandwidth requirements for the three telescopes.

Compression of input data is being investigated within the CSP consortium, which may lead to a

reduction of the required number of input ports by as much as 30% [RD03]. This work is still

ongoing, and needs to carefully consider the required compute capacity and development time

against cost savings, both in terms of capital investment and energy consumption in the data

transport.

Top-level network architecture

The top-level SDP network architecture is shown in Figure 15. A three stage oversubscribed

switch stack is shown that receives data from CS, through the Ingest Layer, and delivers it to

the appropriate Compute Island. A second switch stack, which may share hardware with the

input switches, is responsible for connecting the storage components within each compute

island into a single virtual science archive and the distribution of science data to the world

beyond the Science Data Processor.

Page 32: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 32 of 51

Figure 15: Top-level SDP network design.

Bulk data transport network design

The bulk data transport network is responsible for three distinct steps:

● Ingress (i.e. receive data from CSP, as per [AD06] )

● Egress (i.e. move science-ready data to Regional Science Centres)

● Science Archive (i.e. interconnect storage components in the Compute Islands)

All of these streams are Ethernet based, although the egress data stream is not formally part of

the project and has no formal ICD.

The ingress data stream from CSP can be described as:

● A continuous data stream

● UDP/IP on IEEE 802.3 Ethernet frames

● Maximum Transmission Unit (MTU) as large as possible while still maintaining

compatibility with COTS networking equipment (jumbo frames)

The bulk data network must be able to:

● receive a high-bandwidth, but fairly static, uni-directional data flow

● forward data to the Compute Islands

Page 33: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 33 of 51

● receive this data without losing or dropping packets (latency and out-of-order packets

are acceptable)

In addition, we the system is designed around the concept of a software-defined network

infrastructure, therefore support for this is also essential.

Data from the CSP is expected to arrive via long-haul fibre using Dense Wavelength-division

multiplexing (DWDM), therefore the bulk data transfer network also needs to support large

numbers of optics, conforming to IEEE P802.3bm [AD07].

The high data rates and the continuous nature of the data flow, coupled with the stringent

requirement on data loss, mean that relatively large switch buffers will be required. While this

has been observed in pathfinder instruments, in particular LOFAR, it has not yet been fully

investigated. It is interesting to note that similar effects occur in more conventional applications

as well [RD08].

The egress data stream, which ties the storage components in each Compute Island together

into the virtual Science Archive and transports science-ready data to the Regional Science

Centres, is less well defined. This data stream can be characterised as:

● lower bandwidth

● although fairly static traffic pattern, much less so than ingress

● direct export to the Regional Science Centres

● reliable protocols

● not quite uni-directional, but still highly imbalanced

While in Figure 14 the ingress and egress data networks are drawn as separate entities,

prototyping will have to show if these two highly unidirectional data streams can co-exist in a

single network without loss of performance or data. The unreliable nature of the ingress data

stream makes this not immediately obvious, but significant cost savings may be achieved.

Ingest Processing

The purpose of the ingest pipeline (see Figure 2) is to receive data from the CSP element,

merge it with metadata from TM and to apply conditioning functions prior to integration over time

and frequency before sending it to a number of other pipelines downstream. Currently it is

envisaged that this pipeline forms part of the local SDP function. However, as data movement is

always a high cost, consideration is being given to reducing the overall ingest rate by applying

baseline dependant averaging [AD16]. The details of implementation in the SDP is under

review, but this technique could impact favourably in terms of the size of the bulk data transport

and buffer if co-located with the CSP.

High-performance, low-latency interconnect architecture

Within each Compute Island, a high-performance, low-latency interconnect is available. This

network is used to reorder data between ingest and buffer. This interconnect will be fully non-

blocking, and have a per node bandwidth that is much higher than the per node input

bandwidth. This high degree of over dimensioning is a deliberate design decision to facilitate

extensive reordering.

Page 34: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 34 of 51

The Compute Island networks are themselves interconnected as well, albeit with a certain

degree of over subscription to be determined when the detailed design is considered.

Alternatively, islands may be interconnected using a ring or n-dimensional torus structure with

similar results. This interconnection of islands allows for limited global data reordering,

depending on requirements, although it is the intention that this is avoided as much as possible.

The total inter-island bandwidth depends greatly on the necessity and expected characteristics

of global processing and reordering.

Management, monitoring and control network

A dedicated network will be available for management, monitoring and control. We will not

design this network in detail at this stage, considering the modest requirements in terms of

bandwidth and latency. It is considered likely that basic landed-on-motherboard hardware will be

sufficient for this purpose. Similarly, a simple and cheap switch infrastructure is expected to fulfil

the requirements for this role.

This network will be the interface with Local Monitoring and Control, and through LMC to the

SKA Telescope Manager. The external (SDP-TM) network definitions and requirements are still

TBD, but will be described in [AD08]. The following components will be connected through this

network, most likely sharing physical hardware:

● Island management network

● Island Lights-out-manager network

● Network out-of-band control network.

Data reordering

To maximise available data parallelism, we intend to allow significant data reordering at the

SDP ingest. In this section we analyse the possible implications of reordering. The main goal is

to establish if reordering of data, in any dimension, is feasible in-network. If this is not the case,

a fully non-blocking interconnect, covering the entirety of the Science Data Processor, may be

needed to allow reordering in any dimension.

There are three possible reordering grades. They are discussed in the following sections, from

lowest to highest cost options (in terms of capital investment, required resources and energy

consumption).

In-network reordering

At least three possible hardware configurations may support an in-network data reordering at

SDP ingest:

● A single very large full non-blocking switch per SDP (up to several thousand ports)

◦very expensive

◦will probably have advanced features which we don't require (i.e. this will be a layer 3

router, which is not necessarily what we require)

● Interconnected set of smaller switches, possibly over-subscribed to some degree.

Several topologies are possible:

◦Fat tree structure

Page 35: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 35 of 51

◦Dragonfly

◦Torus (n-dimensional)

● Reordering in transit (in-network) between CSP and SDP

◦Using a software-defined network, we may be able to dynamically re-configure an

otherwise static Ethernet network, allowing more flexible and extensive reordering

than is possible with the previous two options.

◦Independent of architecture choice, the switch firmware needs to support some form of

software-defined networking protocol (e.g. OpenFlow).

◦This option is somewhat orthogonal to the previous two, since it is not really a hardware

configuration; it still requires either of the two options mentioned above to run on.

The current ICD with CSP [AD06] states that the data shall be packetised into UDP/IP jumbo

frames of 9000 bytes. Each visibility will be two 32 bit single precision floating point numbers.

There will be four cross-polarisations. The full dimensionality of the CSP output is:

𝑁𝑏𝑒𝑎𝑚 × 𝑁𝑐ℎ𝑎𝑛𝑛𝑒𝑙 × 𝑁𝑣𝑖𝑠 × 𝑁𝑝𝑜𝑙

A fully polarised visibility takes up 256 bits. Up to 280 of these visibilities fit into a jumbo frame.

Everything beyond the UV-plane in a packet is routable in the network (see section 2.1.1.6 of

[AD06]), although we are assuming that routing consecutive correlator dumps to different

destinations is more challenging (though not impossible, and possibly useful for round-robin

scheduling).

Intra-island reordering

Not all data that can be reordered in network. For this purpose, a low-latency, high-bandwidth

interconnect is provided in each Compute Island to allow data reordering within the islands. It is

expected that the bisectional bandwidth available in this intra-island network will greatly exceed

the input bandwidth from the correlator per island.

This network is also required for the reordering of data after ingest. Our analysis of the ingest

and subsequent pipelines shows that an intra-island reordering is required between these

components [AD02]. Ingest requires a number of frequency channels (a frequency group) for a

single baseline, while subsequent pipelines require all baselines for a single frequency channel.

While this is a significant task, it can be kept within a Compute Island, provided that:

1. a frequency group is kept within a single island -- satisfied by the Compute Island scaling

2. subsets of the visibility hierarchy can be routed to individual nodes to maintain

parallelism -- satisfied by [AD06].

Inter-island reordering

The low-latency, high-bandwidth intra-island interconnects are themselves interconnected into a

single inter-island interconnect. This network is available for a final stage reordering of data that

cannot be accommodated by the previous two stages. Point-to-point communication may

require several hops (up to nine in a Fat-Tree) to reach the destination.

Page 36: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 36 of 51

At present, we have only costed Fat-Tree topologies with varying degrees of pollarding (over-

subscription). The level of over-subscription is a design decision to be considered later. We

currently assume that most data reordering can be achieved with a combination of in-transit and

intra-island shuffling, and that consequently the over-subscription rate in the current architecture

is very high. However as mentioned above further analysis is required to determine the

appropriate level of over-subscription that can be tolerated.

Software-defined networking

Experience with Ethernet-based precursor instruments, such as LOFAR, has shown that such

infrastructures are static and fairly difficult to maintain. The classic split between network and

compute systems, in design, procurement, and maintenance, does not fit well in our data-flow

driven design philosophy. Since the data flow is the defining characteristic of the SKA Science

Data Processor, network and compute systems must both be considered integral parts of one

and the same system.

In addition to this, a classic Ethernet-based network imposes a very strong coupling between

sending and receiving peers, in this case the CSP-based correlator, and the SDP ingest. Any

change in the data flow needs to be carefully negotiated between sender and receiver, which

may be hundreds of kilometres apart.

We propose to build a software-defined network infrastructure, which will become an integral

part of the SDP workflow, and will fall under the direct control of the Data Flow Manager. This

means that the network is no longer a static piece of infrastructure, but may dynamically change

configurations to suit the work-flow requirements. Such a software-defined network also allows

an effective decoupling of sending and receiving nodes. In this model, the sending peers

effectively send to a virtual receiving node, which may or may not physically exist. Receiving

nodes subscribe to data flows from the CSP, as directed by the data flow manager. A network

controller handles the physical data flow by modifying Ethernet headers in transit to match

receiving peers: a classic publish-subscribe model, implemented in a network.

This is a novel approach to building a sensor network that needs to be prototyped. A more in-

depth discussion on the relative merits is given in [RD02].

Combining Bulk Data Network with Low-latency Network

Currently the SDP relies on two distinct or orthogonal networks supporting on the one had uni-

directional Ethernet from Ingest and a bi-directional low-latency network for extensive re-

ordering or gathering of data within specific pipelines. The combination of these network

functions under a single unified network has not, up to now, been considered given the lack of

QoS capabilities for the bulk data network and its real time requirement. Thus mixing traffic

patterns may well lead to significant unpredictability of the SDP as a whole as well as effecting

overall availability of the system and its resilience. In view of the discussion above, in terms of

ingest processing, in which data rates will be reduced and depending on implementation the

Page 37: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 37 of 51

protocol form ingest could be changed, an analysis should be performed to understand if a

certain degree of network sharing can tolerated and further reduce cost.

Compute node model To provide a solid basis for costing, we define a detailed potential compute node design, based

on current-day technology extrapolated to the SKA1 timeframe. We emphasise that this does

not describe the final Compute Island implementation. As mentioned before, the SDP

architecture should be mostly implementation agnostic. To validate this claim, we will show a

baseline compute node model in some detail, based on today’s technology extrapolated to 2017

and beyond. This model is used for costing, since it is the only solution we have accurate data

for.

In addition, we show that several other technologies, from low-power alternatives leveraging the

mobile and internet of things revolution, to reconfigurable systems with workload-optimised

accelerators, can also be used to implement Compute Islands. Since many of these

technologies are only available in concept form, these are described in less detail, and no

costing is done.

From performance model to design characteristics

Based on the performance models described in [AD02], we slightly rewrite the design equations

to give a ratio of various key components per achieved unit of double precision compute

capacity.

SKA1_low SKA1_mid SKA1_survey

Compute requirement 25 PFLOPS 52 PFLOPS 72 PFLOPS

Buffer 240 PB 30 PB 90 PB

Input bandwidth 9.1 TB/s 4.21 TB/s 5.81 TB/s

Table 5: Performance requirements for the SKA1 baseline, including baseline-dependent

averaging.

SKA1_low SKA1_mid SKA1_survey

Buffer / TFLOPS 9.6 TB 0.58 TB 1.25 TB

Input bandwidth / TFLOPS 2.91 Gb/s 0.65 Gb/s 0.65 Gb/s

Table 6: Performance requirements per achieved TFLOPS.

Baseline model - current-day technology

The processing model within a Compute Island is similar to the one currently employed in the

new GPU-based LOFAR correlator, with the addition of a large amount of buffer storage to

Page 38: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 38 of 51

facilitate iterative calibration and imaging. This system design is also used in the Wilkes cluster

at the University of Cambridge.

Taking a modernised and extended LOFAR correlator system with two K40 GPUs as a basis,

and assuming that the Nvidia K40 GPU achieves an Rmax of 350 GFLOPS double precision

(25% of 1.43 TFLOPS Rpeak), we require per node:

SKA1_low SKA1_mid SKA1_survey

DRAM ([AD02]) 370 GB 1 TB 500 GB

Working memory ([AD02]) 8 GB 8 GB 1.2 GB

Buffer 13.7 TB 0.82 TB 1.79 TB

Input bandwidth 4.16 Gb/s 0.93 Gb/s 0.92 Gb/s

Table 7: SKA1 node characteristics, assuming 700 GFLOPS achieved computational capacity.

A possible compute node design, based on current technologies would be (see Figure 15):

● Dual Intel Xeon E5-2660v2 CPU (10 cores @2.2GHz each) [RD24]

● 1024 GB DDR3 main memory @1866MHz

● 2x Nvidia Tesla K40 accelerator [RD25]

○ PCIe v3 x16; 12 GB GDDR5; 4.29 TFLOPS peak SP; 1.43 TFLOPS peak DP

● Intel X520 10 GbE Ethernet NIC (PCIe v2 x8) [RD27]

● Mellanox ConnectX-3 FDR Infiniband HCA (PCIe v3 x8) [RD28]

● HGST Ultrastar SSD1600MR 1.6TB Enterprise MLC SSD [RD26]

● SKA1_low only: 4-6x 3 TB Western Digital RED WD30EFRX [RD31]

Many of the chosen components have many alternative options, and the list above should not

be seen as anything more than an illustration that suitable SKA1 SDP nodes can be built using

components available today.

Note that the SSD chosen is rated for two Disk Writes per Day (DW/P) for five years, which is

just within the expected usage for our buffer (double-buffered, six-hour observations). A more

detailed analysis of both the endurance of SSDs and our expected usage is essential.

The amounts of both main memory and buffer are subject to further analysis, as is required

memory bandwidth (although limited memory bandwidth is implicitly included in the efficiency

percentage), but the system configuration above is readily available. Further storage, solid state

or spinning disk, may be added, since devices with higher capacity are readily available in the

market and bandwidth requirements are relatively modest. Both the Ethernet and Infiniband

networks are heavily over-dimensioned in this design, which is an intentional decision to

facilitate extensive data reordering.

Page 39: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 39 of 51

Figure 16: A potential SDP compute node model implementation using current-day hardware.

Depending on the timescales involved, we can assume that the SKA1 SDP will use technology

available on the market in 2020 or later. When extrapolating to these timeframes, we are of

course limited to information made available by industry. The information mentioned below is all

publicly available, but only extends to about 2016. Likewise, Figure 13 presents schematics of

the public Nvidia roadmap, which shows that they are at least confident of increased

performance per Watt up until 2016, although it is interesting to note that previous versions of

the roadmap showed achieved double precision GFLOPS/Watt, not normalised SGEMM/Watt

(SGEMM is a BLAS-provided matrix multiply-add operation). If we take a look at the expected

future development of the components mentioned above up until ~2016, we find the following:

● Intel Skylake or Cannonlake based CPU [RD14][RD15] (note that these are not formally

announced):

○ Mostly evolutionary development, with at least one, perhaps two changes in

micro-architecture and production process.

● DDR4 main memory [RD16]:

○ Currently available and just entering mass market, expected to increase in clock

rate.

● Nvdia Pascal or Volta based accelerators [RD13]:

○ Although mostly evolutionary developments, we do expect the introduction of

NVlink to give a significant boost to the device-host bandwidth.

○ In addition, the introduction of 3D-stacked memory on these devices will

dramatically increase memory bandwidth per FLOP.

● Next generation Intel Xeon Phi (Knights Landing, Knights Hill) [RD17] [RD18]:

Page 40: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 40 of 51

○ Based on new micro-architecture, using modified Atom cores, expected in 2015

○ Evolutionary developments after this

● Interconnect based on HDR or NDR Infiniband or similar [RD19]:

○ Evolutionary development of currently available technology

○ Many alternative technologies currently under development

● 100 GbE or 40 GbE NIC, depending on market availability and cost per port [RD20]:

○ Mostly available, expected to significantly reduce in cost per port

○ This much bandwidth may not be necessary, go with 10 GbE or 25 GbE instead

○ Consider using whatever industry lands on motherboard

● Solid state storage based on phase change memory [RD21] or memristor [RD22]

technology or similar connected through PCIe or a specialised memory bus:

○ Both memristor and PCM are in prototype stage (although memristor production

has apparently started [RD23], no products have appeared)

○ Spinning disk may be cheaper per PB, but need to consider operational costs

(energy, replacing broken disks, continuous rebuilding)

○ Flash-based storage is a viable alternative, should new solid state storage

technologies not be available or remain too expensive

○ The endurance of NAND flash is an issue, an analysis of endurance

requirements compared to the expected buffer usage will need to be carried out.

Technology development post-2017 becomes much more uncertain. For the purposes of our

initial costing, we assume that Moore’s law will continue to hold for the foreseeable future.

Indications from industry seem to show that the number of transistors per unit of die area will

indeed continue to rise at least until ~2020. There is a risk that this increase will not translate

into easily achieved additional performance. For this reason, a very high contingency has been

added to the hardware costing model. It is important to note that this risk is well understood by

industry.

This extrapolation does assume that the high-level structure of a node, in particular the device-

host model, does not change. In other words, we still have a host processor, supporting a highly

specialised accelerator. This is by no means certain. Indeed, the recent release of hybrid

CPU/GPU packages with unified memory, such as AMD’s Kaveri based APUs and

Nvidia'sTegra K1, seem to indicate that hybrid Systems-on-Chip (SoC) are a definite possibility.

Likewise, while NVlink offers a PCIe-like programming model, it seems likely that Nvidia is

aiming more for a very high bandwidth mezzanine-like connector, or even a socketable solution.

Whatever the case may be, it seems that the device-host model that we know today will be

replaced by something else before SKA1 becomes operational.

For our system design, this is most likely a positive development. The current device-host model

significantly limits the bandwidth to the accelerator. In addition, the explicit communication of

data to the accelerator is tedious, and the limited space on the add-in boards limits the amount

of memory available to the accelerator.

Page 41: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 41 of 51

Storage model Conceptually there are three storage systems in the Science Data Processor: the high-

performance intermediate buffer, the Science Archive and the mirror archive.

Intermediate buffer

To facilitate iterative calibration and imaging algorithms, SDP requires a buffer to store

observations before they can be processed. This buffer conveniently also marks the boundary

between the near real-time and more conventional batch processing. Since an entire completed

observation is required for calibration and imaging, the buffer will conceptually double-buffer the

data: while one observation is running on (and stored into) one buffer region, the previous

observation is being processed using a second buffer region (see Figure 16). It is not expected

that this buffer will store data for extended periods. The buffer will be configured using Compute

Islands and Data Objects to store data then start the batch processing. The buffer also permits

the straightforward implementation of (re-)processing data from the archive by allowing data to

be moved from the archive to the buffer and the use of the same processing architecture.

Figure 17: Double buffering in an SDP Compute Island.

To facilitate iterative imaging and calibration, each node will require a significant amount of

storage to buffer intermediate data. This buffer is likely to be local to each node, although buffer

capacity may be exposed to other nodes within an island. No technology choice is made, but for

costing we consider three options: spinning disks, solid state (non-volatile) storage, and DRAM.

The high-performance buffer storage may consist of a combination of:

oDRAM

Page 42: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 42 of 51

oSolid state storage

oSpinning disk storage

Science Archive

The SDP Science Archive can be characterised by the following features:

● receives science-ready products from the Compute Islands

● interfaces with the Regional Science Centres

● provides API-driven access to the users of the data

● needs to provide data security for an archive lifetime of fifty years

Many technologies are available to provide these functions, ranging from a “sea of disks” to

conventional physically distinct and tiered storage solutions. It is important to note that classic

SAN-based storage solutions are designed for applications with much higher data security

requirements, with associated high costs. Our lack of requirements in this area means we have

some flexibility.

The Big Data revolution, and, perhaps more importantly for our application, the advent of what

Jim Gray termed the fourth paradigm: the era of Data-Intensive Scientific Discovery, has also

given rise to a host of technologies that allow massive data stores to be built cost-effectively.

Where traditional storage technologies often require capital investments into raw media (disks),

Big Data or cloud storage technologies are often much cheaper. This class of storage,

characterised by massive quantities of low-cost and (relatively) low-performance hardware,

derives its performance from software, rather than hardware. This is exemplified by the

approach to data security: where traditional storage relies on parity calculations in dedicated

hardware and N+x redundancy of data, Big Data or cloud storage systems simply duplicate

data, with the number of duplicates depending on the requirement on data security. While this

obviously adds additional required storage capacity, the total cost of ownership of such, much

simpler, systems may be lower.

For our application, this simplicity, coupled with the massively parallel nature of such storage

systems, provides additional advantages.

Moving storage system complexity to software potentially allows highly efficient system designs

to be implemented. We could envision the Science Archive storage hardware integrated in the

Compute Islands. Exporting of science-ready data now stays within an island, significantly

reducing the data transport distance. At the egress point of the SDP, the physically separated

storage pools are unified in software. It is currently unclear if current cluster object store

systems, such as Ceph for instance, are capable of providing such functionality. This is currently

being investigated (see also http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern for

CERN’s experience with Ceph).

By integrating Science Archive hardware into the Compute Island, we also align the

replacement cycles of these hardware components, simplifying operations. Note that the

duplication of data in such systems removes the need to migrate data to new archive systems,

Page 43: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 43 of 51

provided the duplicates are stored on nodes with a different replacement date and data is re-

duplicated to new hardware.

Mirror science archive

For the physical backups of the science data, which is required by the L1 requirements SKA1-

SYS_REQ-2350, we adopt the cloud model. Science data will be duplicated among Science

Centres which are presumed to be in a secure location and offsite of the SDP. Taking this

approach, rather than designing a dedicated mirror facility, means that any mirrored data will

itself be useful, and not just be cold data.

Software stack

Operating system

The operating system forms the basis of the software stack and is the interface with the

hardware compute platform. The operating system needs to support all hardware conceivably

deployed in the Science Data Processor, be extremely scalable and, as experience with

precursor and pathfinder experiments has shown, highly tunable. We also intend to expose

information from hardware performance counters to LMC, so the OS needs to support user

space access for those as well.

Linux is the dominant operating system today, both in high-performance computing and in radio

astronomy, and this matches well with our requirements for the SKA1. Developments in

exascale operating systems will be tracked for suitability, although it should be mentioned that

most of these are based on Linux as well.

Middleware

The SDP middleware contains the software that provides services and APIs to the software

components of the other SDP work packages. In general, this middleware acts as the interface

between the hardware and the rest of the SDP, with the notable exception of LMC that has a

direct link with the Lights-out-manager to allow startup and shutdown from cold or broken state.

It is notable that the middleware layer may end up being extremely thin, if the containerization

concept is taken to the extreme and the data layer interface is a full-fledged operating system

container image running on a bare bones compute platform core OS.

Page 44: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 44 of 51

Figure 18: The SDP software compute platform middleware product subtree.

Messaging layer

The messaging layer provides communication services to the upper software layers. Several

communication protocols and methods are to be supported, having differing characteristics in

terms of (energy) cost, latency, programming model, reliability and throughput:

● Reliable messaging bus to facilitate communication between components, similar to an

Enterprise Service Bus.

● Bulk data transport services within an island

● Bulk data transport services between islands

● Bulk data transport services into SDP from CSP, UDP/IP over Ethernet

● Bulk data transport services from SDP to Regional Science Centres

A lot of experience has been gained in the pathfinder and precursor instruments with a variety of

messaging systems, ranging from raw Ethernet sockets to ZeroMQ, ICE and various flavours of

MPI.

The containerised interface between COMP and DATA allows DATA the option to provide its

own middleware layer per container, using just the kernel of the hosting platform. Alternatively,

the container can be limited to just the application and associated libraries, using the

middleware layer provided by the platform. Either model will work, but a number of middleware

services mentioned above may be integrated into the DATA containerised application.

Logging system

The logging system will collect, aggregate and analyse logs from all SDP components. This is to

be a hierarchical system, where node logs are aggregated on the island level with a subset of

these, for instance only messages from a particular severity upwards, communicated to a

central log store for analysis and dissemination. Machine learning algorithms may be employed

to model the system and predict failure states before they occur, which can be used by the

scheduler component described later on to estimate the availability of unreliable components.

Page 45: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 45 of 51

In addition, the logging system is responsible for collecting and handling any events that occur

in the system.

Platform management system

Although the eventual scale of the SDP is to be determined, it is clear that a highly automated

platform management system is essential for its efficient operation. Although the SDP itself is

rather unique in its requirements, it borrows many of its characteristics from HPC and cloud

systems. Since cloud providers routinely operate data centres of the scale we envision for the

SDP, we intend to heavily rely on existing cloud platform management solutions. The highly

modular nature of the SDP design makes this feasible.

Provisioning and deployment of software will be based on containerised images, using for

instance Docker [RD30], a light-weight and powerful open source container virtualisation

technology, simplifying the efforts required to keep software consistent over a large number of

nodes considerably. The relatively small size of (application) containers would allow the entire

container used for processing to be attached to the Science Archive as a piece of meta data.

Whether this is useful is still under consideration, but the detailed state and versioning of the

software stack needs to be exposed to the application and added to the meta data in some way.

These containers are the primary interface with the data layer.

It is interesting to note that high data rates involved in the SDP may require operating system

level optimisations that fall outside the scope of the Linux containers used to deploy our

applications. While this is a challenging issue, it is expected that the optimisations involved will

be system wide and static over all observation modes.

Apart from the provisioning and deployment of hardware and software, the platform

management system also provides system health monitoring information to the LMC. This

covers the range from processor load and memory capacity used, to temperatures and energy

consumed at component level. We intend to leverage the current trend of heavily instrumenting

processors and exposing these tools to the programmer via the kernel [RD29].

System optimisation

While not strictly part of the software stack, experience with the precursor and pathfinder

instruments has shown that optimisation of various components of the software stack, and of

the system in general, is essential to achieve optimal performance. This requires both highly

specific tooling to monitor system-wide performance and instrumentation of our specific

application and system, and highly skilled and specialised people with extensive knowledge of

the intricacies of the SDP system. This task will initially explore open-source simulation

and behavioural modelling tools and their suitability for the SDP. These environments

will be refined during operation. There will be a close relation between this work and the

modelling work in the logging system component, although with a different specific goal.

Page 46: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 46 of 51

Archive HSM Software

The archive Hierarchical Storage Manager (HSM) software automates the vertical movement of

data across various storage tiers. High-performance storage, with associated high energy

consumption is, in our design, specific to the Compute Island, while the lower performance tiers

are associated with a global namespace, although the hardware may be co-located with the

Compute Islands to minimise data transport distances.

Figure 19: Overview of the SDP Hierarchical Storage Manager.

HSM functionality may either be integrated into the Data layer, or be part of an integrated

storage platform to be selected and evaluated. Whatever the case may be, this part of the

software stack is under very active development in both industry and academia and we expect

to be able to follow rather than set the trend.

We do need to address the non-uniform nature of our storage tiers, where high-performance

storage resources have a distinct locality associated with them in terms of Compute Islands.

While there is no direct requirement to integrate these high-performance devices into the HSM,

this may well be an efficient way to reduce programmer overhead.

Application development environment and software development kit

The highly distributed nature of the project, as well as the complexity of the system we are trying

to build, necessitates a strict and formalised software development methodology. A test-driven

approach, based on the agile and scrum principles, has been successfully used in LOFAR, and

a similar approach will have to be used during the SKA1 development period. To support this,

the development environment must provide the necessary tools and hardware, including:

Page 47: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 47 of 51

● automatic test and integration toolkits

● issue tracking

● code repository

● representative test and development systems

● support for early roll-out and version tracking

● release, containerisation and packaging of software products.

This also requires a strict software development policy, but this is outside the scope of this

document.

The adoption of containers as the de-facto distribution method should allow for both an easy

and convenient way to distribute a standardised development platform, including base libraries,

and a way to roll out versions of code quickly on a standardised base operating system without

having to worry about library incompatibilities.

Scheduler

The Science Data Processor Scheduler is responsible for the interface between Local

Monitoring and Control and the compute platform. It is responsible for allocating hardware

resources to jobs that need to be carried out on the platform, for working around failed or

unstable hardware and for taking into account external factors to adjust the rate at which the

system can operate, in particular due to thermal and/or energy constraints. In addition, the

scheduler will provide LMC resource requirement estimates upon request, used for coarse-

grained scheduling of observations, based on hardware availability. These estimates may be

based on timed calibration runs of the standard pipelines.

The scheduler design assumes that:

● We can modify an existing open source high-performance computing scheduler,

● The functionality is largely shared between it and the Local Monitoring and Control

component it interfaces with.

While we are confident that existing modular schedulers, like SLURM, will develop sufficiently

for us to be able to modify them successfully, there are some requirements that are unique to

our application, i.e. the ability to estimate required resources and runtime beforehand, based on

a-priori knowledge.

SDP infrastructure The compute platform interfaces with the Local Infrastructure component, which is responsible

for energy provisioning to the hardware components, delivery of a conditioned cooling medium

(either air or fluid), local routing of cabling and rack space. Local infrastructure provides metrics

on consumed power per unit (rack or outlet), temperature and such to LMC.

SDP infrastructure interfaces with the infrastructure consortium, which provides the bulk energy

delivery, the building, and cooling solutions [AD13].

Page 48: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 48 of 51

Data delivery platform hardware While the data delivery platform hardware is nominally part of the hardware compute platform,

the requirements for this are described in the data delivery sub-element design document

[AD03].

LMC system hardware architecture Like the data delivery, the local monitoring and control hardware is nominally part of the

compute platform. The LMC sub-element design document [AD04] provides some requirements

on hardware reliability and failover, but in general the hardware design is expected to be

straightforward, and we don’t provide any more details here. At CDR this will be added in more

detail.

Suitability and scalability of the architecture

While the concept of the Compute Island is obviously extremely scalable, it has scaling limits

that we have not yet adequately explored. The size of the Compute Island is limited by the

affordability of the fully non-blocking interconnect on one extreme and the storage capacity per

node on the other end of the spectrum. In terms of SDP system scaling, there is an obvious limit

in the bulk data network scaling, since this is a superlinear scaling with the number of Compute

Islands. A more detailed analysis of the scaling limitations of this concept will be carried out on

the road to CDR.

In addition, the switch infrastructure needs to be carefully analysed for fault-tolerance. There is

a significant cost associated with redundancy in the network, but a single switch failure may

cause a sizable chunk of observational data to be lost. Careful analysis and design may allow

the impact of such a loss to be minimised.

While we currently see no observational modes that do not work with the current SDP compute

platform design, we have only carried out data flow analysis of individual components or

pipelines. On the road to CDR we intend to do a more system-wide analysis of the data flow,

which should adequately prove the suitability of the Compute Island concept for the SKA SDP.

Finally, a careful system-level data flow analysis needs to explore the trade-off between

hardware capital investment in data communication for reordering versus science results. It may

be possible to significantly reduce hardware cost with a limited science impact by using less

than optimal data distributions for some of the pipeline components.

Sub-element risks

● The COMP element design is optimised for imaging within Compute Islands. May not be

as suitable for:

a. Real-time calibration

b. Multi-scale, multi-frequency synthesis (may require SDP-wide communication)

Page 49: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 49 of 51

c. Global solver

d. Any other observations that do not fit within a single island

Mitigation: increase island size or add additional bandwidth between the islands.

● Technology developments difficult to predict.

Mitigation: prototyping of cutting edge hardware with an emphasis on exploring

component characteristics rather than pure performance analysis.

● A late roll-out of the full SDP may impact the software development timeline.

Mitigation: on the one hand the milli- and centi-SDP implementations, but we may also

use the general-purpose HPC facilities for scaling experiments. In general, our

embarrassingly parallel applications should ease scaling issues.

Page 50: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 50 of 51

Requirement traceability

ID Name Trace (section title)

SDP_REQ-301 SDP ingest data rate SDP scaling Bulk data transport network design

SDP_REQ-372 Early science processing capability

Roll-out schedule

SDP_REQ-375 SDP platform management Platform management system

SDP_REQ-376 Platform management interface to LMC

Scheduler Platform management system

SDP_REQ-377 System health monitoring detailed design @CDR

SDP_REQ-378 Deployment system Platform management system

SDP_REQ-379 Scheduler Scheduler

SDP_REQ-380 Scheduler Interface Scheduler

SDP_REQ-381 Scheduler input Scheduler

SDP_REQ-382 Component system consistency Platform management system

SDP_REQ-597 Component system state information

Platform management system

SKA1-SYS_REQ-2425 SADT to SDP interface. Bulk data transport network design

SKA1-SYS_REQ-2657 Processing capability Roll-out schedule

SKA1-SYS_REQ-2566 Materials list. SDP_REQ-363[AD10]

SKA1-SYS_REQ-2567 Hazardous Materials list. SDP_REQ-363[AD10]

SKA1-SYS_REQ-2568 Parts list. SDP_REQ-363[AD10]

SKA1-SYS_REQ-2569 Process list. SDP_REQ-363[AD10]

SKA1-SYS_REQ-2570 Parts availability. SDP_REQ-363[AD10]

SKA1-SYS_REQ-2571 Long lead time items. SDP_REQ-363[AD10]

SKA1-SYS_REQ-2572 Material environmental rule SDP_REQ-361[AD10]

Page 51: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

Document Number: SKA-TEL-SDP-0000018 Unrestricted

Revision: 1.0 Author: C. Broekema

Release Date: 2015-02-09 Page 51 of 51

compliance.

SKA1-SYS_REQ-2573 Serial number. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2574 Drawing numbers. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2575 Marking method. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2576 Electronically readable or scannable ID

SDP_REQ-361[AD10]

SKA1-SYS_REQ-2577 Package part number marking. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2578 Package serial number marking. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2579 Hazard warning marking. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2580 LRU electrostatic warnings SDP_REQ-361[AD10]

SKA1-SYS_REQ-2581 Packaging electrostatic warnings. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2583 Cable identification. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2584 Connector plates. SDP_REQ-361[AD10]

SKA1-SYS_REQ-2711 Component obsolescence plan ILS plan [AD11]

SKA1-SYS_REQ-2716 Telescope availability Non-conformant [AD08] SDP_REQ-195[AD10]

SKA1-SYS_REQ-2718 Availability budgets Non-conformant [AD08] SDP_REQ-195[AD10]

Page 52: PDR.02.01 Compute Platform Element Subsystem Designbroekema/papers/SDP-PDR... · PDR.01.01 / [AD08] SKA-TEL -SDP-0000014 ASSUMPTIONS AND NON -CONFORMANCE [AD09] SKA-TEL -SKO -0000035

PDR02-01ComputeplatformSubsystemDesign(1) (1)EchoSign Document History February 09, 2015

Created: February 09, 2015

By: Verity Allan ([email protected])

Status: SIGNED

Transaction ID: XJEHDU42H2RX37Y

“PDR02-01ComputeplatformSubsystemDesign (1) (1)” HistoryDocument created by Verity Allan ([email protected])February 09, 2015 - 3:10 PM GMT - IP address: 131.111.185.15

Document emailed to P.C. Broekema ([email protected]) for signatureFebruary 09, 2015 - 3:11 PM GMT

Document viewed by P.C. Broekema ([email protected])February 09, 2015 - 3:13 PM GMT - IP address: 192.87.1.200

P.C. Broekema ([email protected]) verified identity with Google web identity Chris Broekema (https://www.google.com/profiles/118359904325355782244)February 09, 2015 - 3:18 PM GMT

Document e-signed by P.C. Broekema ([email protected])Signature Date: February 09, 2015 - 3:18 PM GMT - Time Source: server - IP address: 192.87.1.200

Document emailed to Paul Alexander ([email protected]) for signatureFebruary 09, 2015 - 3:18 PM GMT

Document viewed by Paul Alexander ([email protected])February 09, 2015 - 6:46 PM GMT - IP address: 131.111.185.15

Document e-signed by Paul Alexander ([email protected])Signature Date: February 09, 2015 - 6:46 PM GMT - Time Source: server - IP address: 131.111.185.15

Signed document emailed to Verity Allan ([email protected]), P.C. Broekema ([email protected]) andPaul Alexander ([email protected])February 09, 2015 - 6:46 PM GMT


Recommended