Date post: | 28-Jan-2019 |
Category: |
Documents |
Upload: | vuongduong |
View: | 214 times |
Download: | 0 times |
SKA Science Data
Processing –
Storage
Requirements
Bojan Nikolic
Astrophysics Group, Cavendish Lab
University of Cambridge
Email: [email protected]
Project Engineer
SKA Science Data Processor Consortium
BigStorage Initial Workshop
3rd March 2016
SKA Science Data
Processing –
Storage
Requirements
Bojan Nikolic
Astrophysics Group, Cavendish Lab
University of Cambridge
Email: [email protected]
Project Engineer
SKA Science Data Processor Consortium
BigStorage Initial Workshop
3rd March 2016
We are hiring! PostDocs +
PhD students with
computing background
Outline
• Introduction to the Square Kilometre Array
• SKA Science Data Processor Storage
Requirements
• SDP Architecture
• (SKA Science)
• Q&A
Scientific Context – a partner to ALMA, EELT,
JWST
Credit:A.
Marinkovic/XCam/ALMA(ESO/NAOJ/NRAO)
Credit: Northrop Grumman (artists impression)
Credit:ESO/L. Calçada (artists impression)
Credit: SKA Organisation (artists impression)
Scientific Context – a partner to ALMA, EELT,
JWST
Credit:A.
Marinkovic/XCam/ALMA(ESO/NAOJ/NRAO)
Credit: Northrop Grumman (artists impression)
Credit:ESO/L. Calçada (artists impression)
Credit: SKA Organisation (artists impression)
European ELT
• ~40m optical telescope
• Completion ~2025
• Budget ~1.1 bn EUR
JWST:
• 6.5m space near-infrared telescope
• Launch 2018
• Budget ~8 bn USD
ALMA:
• 66 high precision sub-mm antennas
• Completed in 2013
• Budget ~1.5 bn USD
Square Kilometre Array
• Two next generation low-frequency arrays
• Completion ~2022 for Phase 1
• Budget 0.65 bn EUR for Phase 1
Construction
What will the Square Kilometre Array (SKA) be?
Radio Telescope
Makes Images of the Sky at radio (5m-3cm) wavelengths
~100x faster than current telescopes
Complements ALMA, JWST (successor to Hubble), and E-ELT
Currently in Design
Construction begins 2018
Full operations expected at end 2022
Good fraction of funds already committed by participating countries
Major Engineering Project
Two remote desert sites
>100k receiving elements
Major ICT Project
Subject of this talk!
The SKA Observatory – Phase 1 : “SKA1”
Receptors: Aperture Arrays and Dishes
Digital Signal Processing: FPGA + (maybe) ASICS & GPUs
Data Processing: general computing
People, buildings,
roads, ground works,
communications,
electrical power,
maintenance,
transportation,
catering at desert
sites….
SKA Context Diagram
SKA1 Low:Low Frequency Aperture Array
SKA1 Mid:Dish Antennas with Single-Pixel feeds
LFAA Correlator/
Beam Former
Science Data Processor
Implementation(Australia)
Science Data Processor
Implementation(South Africa)
SKA1 Mid Correlator/
Beam Former
Pulsar Search
Processor(South Africa)
Monitor and Control
Astro
nom
ers
Pulsar Search Processor(Australia)
These are off-
site! (In Perth &
Cape Town)
Large “D” – vs – Large “N”
GBT 100-m diameter telescope SKA LFAA prototype array
No 1 aim: collect as many photons as possible
No 2 aim: maximum separation of collectors -> achieve high angular resolution
SDP Top-level Components & Key Performance
Requirements -- SKA Phase 1
SDP Local Monitor & Control
High Performance
• ~100 PetaFLOPS
Data Intensive
• ~100 PetaBytes/observation (job)
Partially real-time
• ~10s response time
Partially iterative
• ~10 iterations/job (~3 hour)
Telescope Manager
C
S
P
Regio
nal C
entre
s &
Astro
nom
ers
High Volume & High Growth Rate
• ~100 PetaByte/year
Infrequent Access
• ~few times/year max
Data Processor Data
Preservation
Delivery
System
Data Distribution
•~100 PetaByte/year from Cape Town & Perth to rest of World
Data Discovery
•Visualisation of 100k by 100k by 100k voxel cubes
Science Data Processor
1 Tera
Byte/s
SDP Top-level Components & Key Performance
Requirements -- SKA Phase 1
SDP Local Monitor & Control
High Performance
• ~100 PetaFLOPS
Data Intensive
• ~100 PetaBytes/observation (job)
Partially real-time
• ~10s response time
Partially iterative
• ~10 iterations/job (~3 hour)
Telescope Manager
C
S
P
Regio
nal C
entre
s &
Astro
nom
ers
High Volume & High Growth Rate
•~100 PetaByte/year
Infrequent Access
•~few times/year max
Data Processor Data
Preservation
Delivery
System
Data Distribution
•~100 PetaByte/year from Cape Town & Perth to rest of World
Data Discovery
•Visualisation of 100k by 100k by 100k voxel cubes
Science Data Processor
1 Tera
Byte/s
Goal is to extract
information from data
and then discard the
data
Key Characteristics of SKA Data Processing
Very large data volumes, all data are processed in each observation
Noisy Data
Corrected for by deconvolution using iterative algorithms (~10 iterations)
Sparse and Incomplete Sampling
Corrected by jointly solving for the sky brightness distribution and for the slowly changing corruption effects using iterative algorithms
Corrupted Measurements
Loosely coupled tasks, large degree of parallelism is inherently available
Multiple dimensions of
data parallelism
Sums of rows
Sum of all
Can clearly see
the fringes!
Radio Telescopes make noisy measurements
Illustration only – not real data
Sampling
- Each pair of telescopes results in
a measured “visibility”
- Loosely equivalent to a sample in
the Fourier domain
- Inevitably areas without any
samples:
- At high radius -> limit of
attainable resolution
- At small radius -> limit of largest
detectable structure
- Regular distribution of telescopes
leads to a regular distribution of
samples in uv plane
Illustrative example for the 27 antenna JVLA!
Radio Interferometers sample sparsely & irregularly
Sampling
- Earth rotation effective moves the
sampling point of each pair of
telescopes
- This limits the possible integration
time duration
- Fills in the gaps between the
previous samples (but not the
central gap and outside the limit)
Illustrative example for the 27 antenna JVLA!
Radio Interferometers sample sparsely & irregularly
Sampling
- More rotation…
Illustrative example for the 27 antenna JVLA!
Radio Interferometers sample sparsely & irregularly
Redundant + Sparse
SamplingWith SKA1 there will be very many
individual visibilities in each observation:
• Accumulation of visibilities improves
the signal/noise
• Fills in most of the uv plane
In general however to retain best
signal/noise significant unevenness in
sampling of the uv plane must be
accepted.
-> Deconvolution
Illustrative example for the 27 antenna JVLA!
Radio Interferometers sample sparsely & irregularly & redundantly!
Measurements are imperfect – corrupted by
slowly changing mechanical, electrical &
atmospheric effects
Uncalibrated“Offset”
CalibrationRick Perley & Oleg Smirnov: “High Dynamic Range Imaging”,
www.astron.nl/gerfeest/presentations/perley.pdf
Iterative & joint solving for the image of the Sky &
Calibration
“Self-Calibration”“closure –error”
calibrationRick Perley & Oleg Smirnov: “High Dynamic Range Imaging”,
www.astron.nl/gerfeest/presentations/perley.pdf
Data-parallelism schemes
Frequency
Time &
baseline
o Data parallelism: Dominated by
frequency
o Provides dominant scaling
o Nothing more needed if each processing
node can manage a frequency channel
complete processing
Processing nodes
Visibility data
Exploit frequency independence
Grid and de-
grid
FFT
Buffered UV data
Data-parallelism schemes
Frequency
Time &
baseline
Sort and distribute visibility data and
target
Visibility
data
Gather target grids
Exploit frequency independence
Grid and de-
grid
FFT
o Further data parallelism in locality in UVW-space
o Use to balance memory bandwidth per node
o Some overlap regions on target grids needed
o UV data buffered either on a locally shared object store of locally on each node
Data-parallelism schemes
Data-parallelism schemes
Frequency
Time &
baseline
Visibility
data
Gather and accumulate target grids
FFT
Distribute visibility data, duplicate target
Exploit frequency independence
Grid and de-
grid
o To manage total I/O from buffer/bus distribute Visibility data across nodes for
same target grid which is duplicated
o Duplication of target provides fall-over protection
The challenges?
• Power efficiency
– Funding agencies more tolerant of cap-ex then power op-ex
• Cost of hardware and complexity of software
• Scalability of software
– Hardware roadmaps indicate h/w will reach requirements
– Demonstrated software scaling is only 1/1000th of requirement
• Project risks
– Inevitable significant interaction between software engineers and
idiosyncratic domain specific knowledge
– Software project
• Extensibility, system scalability, maintainability
– SKA1 is the first “milestone” – expecting significant expansion in
the 2020s
– 50yr observatory lifetime
SDP Storage Functions/Use cases
Temporary Storage of Raw Data
Large volume
Write once, ~read 10 times
Predictable read patterns and locality with respect to processing
Rewritten every ~12 hours
Temporary Storage of Intermediate Data Products
Objective is to reduce working memory size
At least one reads per write (perhaps a few on average)
Temporary persistence of key data / program state
Objective is failure recovery
Write many times read once
Archiving of science data
Write once, read few times
Most data will be permanently stored
Long term reliability important
Radio Astronomy Parallelisation Current Best
Practice
• Routinely achieve ~4000 way parallelism:
– ~100 separate observations analysed in parallel
– Each on a ~40 core shared memory system
– Often with frequent human interaction in each
(“The human pipeline”)
• Current limited by number of people with
expertise in astronomical radio interferometry
• For SKA we think we will be limited by cost of
storing the raw data for long enough
Current baseline: double buffered operation
Double buffering processing scheme for batch processing
Ob
serv
atio
n A
Ob
serv
atio
n B
Vis
ibili
ties
Fro
m
Co
rrel
ato
r
Tele
sco
pe
Stat
e fr
om
TM
Ingest Processing
Buffer 1
Near-real-time
Processing
Batch Processing of Observation A
Iterate on the buffer
data in batch
processing
Flow of time
Archive
Vis
ibili
ties
Fro
m
Co
rrel
ato
r
Tele
sco
pe
Stat
e fr
om
TM
Ingest Processing
Buffer 2
Near-real-time
Processing
Batch Processing of Observation B
Iterate on the buffer
data in batch
processing
Archive
Storage H/W level requirements
• Average aggregate write of ~1 TeraByte/s
and average aggregate read of ~10
TeraByte/s
– Large block sizes (> 1MB) are fine
• Capacity minimum 100 PetaByte
• Compatible, power, cost balance with
computing doing ~5000 FLOPS/byte of I/O
• High endurance: ~1000 write cycles/year
Storage S/W layer requirements
We need:
• High read/write throughput efficiency for large blocks
• Sophisticated management of read latency – Explicit pre-fetch probably necessary
• Both local and remote read of data– Local write probably sufficient
• Modest resilience for subset of data
We don’t think we need:
• Posix filesystem, locking, concurrent access
• Remote read between arbitrary pair of end-points – “island” level remote read probably sufficient
• Cross-node striping
• Any security, permissions, ownership mechanisms
In contrast:
• HPC Simulation check pointing almost
diametrically opposed:
– Write many read once
– Resilience/recovery after crash main use case
– Weak interaction with processing
• Traditional “big-data”:
– Small objects, high information contents
– Diverse algorithms re-run on the same dataset
– Multi-tenant datacentres, data protection, security
Design Ideas
• Mix storage and processing throughout the system, perhaps in each node
• Enable data-locality by distributing the raw data so that the processing load is roughly balanced
• Distributed object storage system
• Dataflow programming model, allowing scheduling of both computation and data movement
Questions
• Cost & power models for disk and solid state
• Are the benefits of data locality worth the software complexity?
• How to integrate with the dataflow programming model?
• Integration with interconnect fabrics
• What is the filesystem/object layer?
• Can we replace (large)message passing with put/get ?
• Will the traditional driver/kernel/libc/application stack limit performance?
Programming model
• Hybrid programming model:– Dataflow at coarse-grained level:
• About 1 million tasks/s max over the whole processor (-> ~10s – 100s milli second tasks), consuming ~100 MegaByte each
• Static scheduling at coarsest-level (down to “data-island”)– Static partitioning of the large-volume input data
• Dynamic scheduling within data island:– Failure recovery, dynamic load-balancing
• Data driven (all data will be used)
– Shared memory model at fine-grained level e.g.: threads/OpenMP/SIMT-like
• ~100s active threads per shared memory space
• Allows manageable working memory size, computational efficiency
Why?
• Shared memory model essential at fine-grain
to control working memory requirements
• Dataflow (but all of these are still to be
proven in our application):
– Maintainability
– Load-balancing
– Minimisation of data movement
– Handling failure
– Adaptability to different system architectures
Fault Tolerance – “Non-Precious” Data
• Classify arcs in the dataflow graph as precious or non-precious
• Precious data are treated in usual way –failover, restart, RAID, etc.
• Non-precious data can be dropped:– If they are input to a map-type operation then no
output
– If they are input to a reduction then result is computed without them
• Stragglers outputting non-precious data can be terminated after a relatively short time-out
The non-precious data concept - illustration
FULL DATASET
FREQ 1 FREQ 2 FREQ 3 FREQ NF
SPLIT BY FREQ
GRID
FFT
Single Frequency Image
GRID
FFT
Single Frequency Image
Reduce by pixel-vise addition
GRID
FFT
Single Frequency Image
GRID
FFT
Single Frequency Image
2 Freq Image 2 Freq Image
Reduce by pixel-vise addition
Combined Image
Normalise by 4
Combined Image
FULL DATASET
FREQ 1 FREQ 2 FREQ 3 FREQ NF
SPLIT BY FREQ
GRID
FFT
Single Frequency Image
GRID
FFT
Single Frequency Image
GRID
FFT
Single Frequency Image
GRID
FFT
Single Frequency Image
2 Freq Image 1 Freq Image
Reduce by pixel-vise addition
Combined Image
Normalise by 3
Combined Image
Data Ingress and Interconnect Concept
1st stage
BDN
1st stage
BDN
1st stage
BDN
2nd stage
BDN
2nd stage
BDN
2nd stage
BDN
2nd stage
BDN
2nd stage
BDN
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
Com
pu
te Is
lan
d
CSP
Node
CSP
Node
CSP
Node
CSP
Node
CSP
Node
CSP
Node
CSP
Node
CSP
Node
CSP
Node
Science Archive switch Science Archive switch
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
3rd stage switch
2nd stage
LLN
2nd stage
LLN
2nd stage
LLN
2nd stage
LLN
1st stage
LLN
1st stage
LLN
INGEST LAYER
DELIV
Sto
rag
e P
od
Sto
rag
e P
od
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX/HDXLNK/ACT
COMBO PORTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Long Term Storage
High Performance Buffer
Low-Latency Network
Medium Performance Buffer
Bulk Data Network
Archive Network
Pulsar Surveys: Testing gravity
• The SKA will detect around 30,000 pulsars in our
own galaxy, 2000 msec pulsars accurate clocks
• Relativistic binaries give unprecedented strong-
field test of gravity expect ~100
• Timing net of ms pulsars to detect gravitational
waves via timing residuals
• Expect timing accuracy to
improve by ~100
Pulsar timing array
Nano-Hertz range of frequencies
• MBH-MBH binaries: resolved objects and
stochastic background
• Cosmic strings and other exotic phenomenon
• Timing residual ~10s ns need ms pulsars
Finding the unexpected
Hubble Deep Field (HDF)Very Large Array observation of HDF
~ 3000 galaxies
~15 radio sources~ 3000 galaxies