+ All Categories
Home > Documents > Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled...

Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled...

Date post: 21-May-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
37
LAWRENCE BERKELEY NATIONAL LABORATORY F U T U R E T E C H N O L O G I E S G R O U P Network Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1
Transcript
Page 1: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Network Centric Runtime Implementation

Costin Iancu Khaled Ibrahim

1

Page 2: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

The Stampede to Exascale

  Hardware is changing to satisfy power constraints   O(103) cores per node/socket, O(106) nodes   Nodes are hybrid, asymmetric   Network is under-provisioned (tapered = 0.1 bytes/flop, not fully connected)

  New application domains are emerging   Some regular, embarrassingly parallel (Bioinformatics)   Some are irregular, hard to parallelize (ExaCT)

  Programming models are changing to reflect hardware and apps   Shared memory, Global Address Spaces   Fine grained asynchrony – parcels, activities, fibers….   Unstructured parallelism – finish/async, phasers …

2

Page 3: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Integrator Needed

  Intra-node programming is the CHALLENGE!   Focus on shared memory programming   Multiple paradigms, projects, languages, runtimes

•  Data parallel (CUDA, OpenCL, OpenMP…) •  Dynamic tasking (OpenMP, X10, Chapel, HPX, SWARM) •  Work stealing (Habanero, X10, Cilk, Intel TBB)

3

Page 4: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Integrator Needed

  Intra-node programming is the CHALLENGE!   Focus on shared memory programming   Multiple paradigms, projects, languages, runtimes

•  Data parallel (CUDA, OpenCL, OpenMP…) •  Dynamic tasking (OpenMP, X10, Chapel, HPX, SWARM) •  Work stealing (Habanero, X10, Cilk, Intel TBB)

  Integration with the network and whole system efficient utilization is another CHALLENGE!   Node architecture/programming tackled by industry/academia/HPC   HPC networking is a niche market   Areas that require progress

•  System software support •  Performance models •  Dynamic optimizations

4

Page 5: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Integration Goals

 Efficiency and productivity layer for heavily threaded/asynchronous applications  Productivity

•  Decouple application/runtime level concurrency from runtime concurrency

•  Manage asynchrony for clients

 Performance portability = optimal throughput for •  Any implementation (pthreads, procs …) •  Any hardware architecture (asymmetric, heterogeneous) •  Any message mix •  Any source, target

5

Page 6: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Optimization Goals

6

1K Cores

100 GB Fast Memory

1 TB Slow Memory

0.4 TB/s

4 TB/s

0.4 TB/s

NIC 1000 cores, 100GB memory

•  1000 Messages each 500B – 20KB

Small message throughput

Page 7: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Optimization Goals

7

1K Cores

100 GB Fast Memory

1 TB Slow Memory

0.4 TB/s

4 TB/s

0.4 TB/s

NIC 1000 cores, 100GB memory

•  1000 Messages each 500B – 20KB

•  Aggregation: 500KB message

•  Programmer •  Runtime

Small message throughput AND Dynamic message coalescing + Large messages

Page 8: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Runtime Components

  Runtime Scheduler   Inject/retire independent messages   Message re-ordering   Match network concurrency with core concurrency   Flow Control

8

Network

Compiler

Runtime

Operating System

Application

  Dynamic program analysis & representation   Performance models   Optimization methodology

 Dynamic communication optimizations

Page 9: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Networks and

Message Throughput

9

Page 10: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

InfiniBand Performance

Pthreads Qthreads,Chapel, ParalleX, X10

Hybrid Hierarchical locales, BUPC

Processes Optimal, MPI, BUPC

Page 11: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Performance: Implementation Matters!

Pthreads Qthreads,Chapel, ParalleX, X10

Hybrid Hierarchical locales, BUPC

Processes Optimal, MPI, BUPC

5X

3X Very poor Performance Portability

2X Non-scalable Throughput

Page 12: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Throughput and Core Concurrency

12

BUPC/GASNet on InfiniBand

Page 13: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Throughput and Core Concurrency

13

BUPC/GASNet on InfiniBand Serializing communication using 16 cores 40% faster than using 32 cores (expected 2x slower)

Page 14: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Throughput and Message Concurrency

14

Cray UPC on Cray XE6 (Gemini)

Page 15: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Throughput and Message Concurrency

15

Cray UPC on Cray XE6 (Gemini) Limiting the number of outstanding messages provides 5X speedup (expected 32X slower)

Page 16: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Throughput Oriented Runtime for Large Scale Manycore Systems

(THOR)

16

Page 17: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

New Performance Metrics

 Current runtimes optimized for single core latency and bandwidth   Design and implementation   Micro-benchmarks and evaluation

  I want to optimize for throughput   Benchmarks   Metrics – is it msgs/sec or need delay guarantees too?

 Analytical model   Have talked before about LogGP for multicore   Or just empirical?   Is there a roofline?

17

Page 18: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Software Architecture

18

Pr

Optimization Layer

Programming Models (SPMD, task and data parallel) (UPC, Chapel)

Runtimes: BUPC, Qthreads, Habanero

Admission Control Layer

GASNet/MPI

THORScheduling Layer

Driven by runtime analysis and performance models

Page 19: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Thor Layers

 Admission Control Layer   Congestion Avoidance   Flow Control   Concurrency matching   Memory Consistency/Ordering   Dispatch to Optimization Services

 Optimization Layer   Coalescing   Aggregation   Reordering

 Scheduling Layer   Integrate communication with tasking   Instantiate and Retire Communication to Network

19

Page 20: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Congestion Avoidance

 Throttle traffic to OPTIMAL concurrency   Use micro-benchmarks to explore space

 Proactive Management instructed by Declarative Behavior   Catalogue of known “patterns”   Intuitive descriptions (e.g. all2all), annotated by compilers/humans

 Integrated communication and task scheduling   Inline: mechanisms implemented in a distributed manner   Proxy: servers acting on behalf of clients

 Open loop control for scalability   With as little “global” state as possible

20

Page 21: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Initial Results

 Prototype   BUPC/GASNet/InfiniBand  Cray UPC/DMAPP/Gemini

 Admission Control + Scheduling Layer  Not well tuned yet

 Results:   4X performance improvement for all-to-all   70% improvement on GUPS/HPCC RA   17% on NAS Parallel Benchmarks

21

Page 22: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

All-to-all InfiniBand 1024 Cores

22

Speedup over GASNet tuned all-to-all - 2x Performance Portable – single implementation

Page 23: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

All-to-all Gemini 768 Cores

  S

23

Speedup over Cray UPC all-to-all - 4x Performance Portable – single implementation

Page 24: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Plans/Projects

 Extend THOR and combine with Habanero   Dynamic optimizations   Increase communication concurrency

 Performance Models and Metrics   Throughput Roofline   Metrics other than msg/s

24

Page 25: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Dynamic Communication Optimizations

25

Page 26: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Case Study: Performance Portable Message Vectorization

  Compile and runtime analysis of UPC loop nests   Compiler analyses loop nest and generates templates/stubs annotated

with information about behavior (memory region access – LMAD)   Runtime analysis decides structure of the transformed code and

communication optimizations   Communication optimizations are performed using performance models

Performance Portable Optimizations for Loops Containing Communication Operations. Iancu, Chen, Yelick. ICS 2008

26

Page 27: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Challenge: Program Representation

  Optimizer friendly program representation   Experience describing memory regions and flow control   What about unstructured parallelism? (DAGs)   What about resource requirements/usage?

27

Page 28: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Challenge: Optimization Strategy

Load

Perf

orm

ance

Models, Asymptotic

Flow Control, Fairness

> 2X

Page 29: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Challenge: Optimization Strategy

Load

Perf

orm

ance

Models, Asymptotic

Optimizations, Instantaneous

Flow Control, Fairness

> 2X

Page 30: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Challenge: Optimization Strategy

Load

Perf

orm

ance

Models, Asymptotic

Optimizations, Instantaneous

Flow Control, Fairness

> 2X Global View Optimizations enabled by SPMD

Page 31: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Challenge: Optimization Strategy

Load

Perf

orm

ance

Models, Asymptotic

Optimizations, Instantaneous

Flow Control, Fairness

> 2X Global View Optimizations enabled by SPMD

Local View optimizations to achieve global optimum

Page 32: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Using Network Performance Models

  Most approaches measure asymptotic values, optimizations need instantaneous values

  Existing “time accurate” performance models do not account well for system scale OR wide SMP nodes

  Qualitative models: which is faster, not how fast! (PPoPP’07, ICS’08, PACT’08)

  Not time accurate, understand errors and model robustness, allow for imprecision/noise, preserve order, be pessimistic

Page 33: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Building an Optimizer

  Build catalogue of representative scenarios/codes (e.g. all-to-all)   Spatial-temporal exploration of network performance

•  Short and large time scales – account for variability and system noise •  Small and large system scales – SMP node, full system

  Understand worst case behavior – BUILD REPELLER

  Develop optimized implementations of representatives using local knowledge

  Develop program analysis, representations and dynamic classification schemes to map programs to representatives (pattern matching)

  Develop statistical/empirical approaches for optimizations using local knowledge   E.g. combinations of small and large messages

Page 34: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

My DEGAS ToDo List

  Efficient execution at Exascale requires a network centric approach   Message throughput   Dynamic communication optimizations

  Providing message throughout requires   Better OS support   Dynamic end-point concurrency control

  Dynamic optimizations require (Years 2-3)   Better program representations that capture resource usage   Different performance models   Optimization algorithms using local knowledge

34

Page 35: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Thank You!

35

Page 36: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Backup…

36

Page 37: Network Centric Runtime ImplementationNetwork Centric Runtime Implementation Costin Iancu Khaled Ibrahim 1 FUTURE TECHNOLOGIES GROUP LAWRENCE BERKELEY NATIONAL LABORATORY The Stampede

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Hardware Trends

 NIC on die (memory controller)   Faster injection is bad for throughput

 Acceleration (IBM BG/Q progress thread, Mellanox FCA)   NIC still has to match core level of parallelism

 Tapered networks (asymmetric, not fully connected)   Smaller bisection means lower throughput

 Where’s the flow control?

 And NO, hybrid programming won’t solve the problem!   Hardware can be really fast, still have to implement high level semantics

37


Recommended