+ All Categories
Home > Documents > ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P....

ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P....

Date post: 19-Jan-2016
Category:
Upload: bruno-fox
View: 214 times
Download: 1 times
Share this document with a friend
31
ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many-Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science and Engg., Ohio State University Math. and Computer Sci., Argonne National Laboratory
Transcript
Page 1: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

ProOnE: A General-Purpose Protocol

Onload Engine for Multi- and Many-

Core Architectures

P. Lai, P. Balaji, R. Thakur and D. K. Panda

Computer Science and Engg., Ohio State University

Math. and Computer Sci., Argonne National Laboratory

Page 2: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Hardware Offload Engines

• Specialized Engines– Widely used in High-End Computing (HEC) systems for

accelerating task processing

– Built for specific purposes• Not easily extendable or programmable

• Serve a small niche of applications

• Trends in HEC systems: Increasing size and complexity– Difficult for hardware to deal with complexity

• Fault tolerance (understanding application and system

requirements is complex and too environment specific)

• Multi-path communication (optimal solution is NP-complete)

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 3: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

General Purpose Processors

• Multi- and Many-core Processors– Quad- and hex-core processors are

commodity components today

– Intel Larrabee will have 16-cores; Intel

Terascale will have 80-cores

– Simultaneous Multi-threading (SMT or

Hyperthreading) is becoming common

• Expected future– Each physical node will have a

massive number of processing

elements (Terascale on the chip)

Future multicore systems

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 4: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Benefits of Multi-core Architectures

• Cost– Cheap: Moore’s law will continue to drive costs down

– HEC is a small market; Multi-cores != HEC

• Flexibility– General purpose processing units

– A huge number of tools already exist to program and utilize

them (e.g., debuggers, performance measuring tools)

– Extremely flexible and extendable

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 5: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Multi-core vs. Hardware Accelarators

• Will multi-core architectures eradicate hardware

accelerators?– Unlikely: Hardware accelerators have their own benefits

– Hardware accelerators provide two advantages:• More processing power, better on-board memory bandwidth

• Dedicated processing capabilities– They run compute kernels in a dedicated manner

– Do not deal with shared mode processing like CPUs

– But more complex machines need dedicated processing for

more things• More powerful hardware offload techniques possible, but

decreasing returns!

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 6: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

ProOnE: General Purpose Onload Engine• Hybrid hardware-software engines

– Utilize hardware offload engines for low-hanging fruit• Where return for investment is maximum

– Utilize software “onload” engines for more complex tasks• Can imitate some of the benefits of hardware offload engines,

such as dedicated processing

• Basic Idea of ProOnE– Dedicate a small subset of processing elements on a multi-

core architecture for “protocol onloading”

– Different capabilities added to ProOnE as plugins

– Application interacts with ProOnE for dedicated processing

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 7: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

ProOnE: Basic Overview

• ProOnE does not try to take over tasks performed well by

hardware offload engines

• It only supplements their capabilities

Software middleware (e.g. MPI)

Network with advanced

accelerationNetwork with

basic acceleration

Basic networkhardware

ProOnEProOnE

ProOnE

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 8: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Presentation Layout

• Introduction and Motivation

• ProOnE: A General Purpose Protocol Onload Engine

• Experimental Results and Analysis

• Conclusions and Future Work

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 9: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

ProOnE Infrastructure

• Basic Idea:– Onload a part of the work to the ProOnE daemons

– Application processes and ProOnE daemons communicate

through intra-node and inter-node communication

Shared memory

Node 1

Shared memory

Node 2

Network

ProOnE Application process

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 10: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Design Components

• Intra-node Communication– Each ProOnE process allocates a shared memory segment

for communication with the application processes

– Application processes use this shared memory to

communicate requests, completions, signals, etc., to and from

ProOnE• Use queues and hash functions to manage the shared memory

• Inter-node Communication– Utilize any network features that are available

– Build a logical all-to-all connectivity on the available

communication infrastructure

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 11: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Design Components (contd.)

• Initialization Infrastructure– ProOnE processes

• Create its own shared memory region

• Attach to the shared memory created by other ProOnE

processes

• Connect to remote ProOnE processes

• Listen and accept connections from other processes (ProOnE

processes and application processes)

– Application processes• Attach to ProOnE shared memory segments

• Connect to remote ProOnE processes

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 12: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

ProOnE Capabilities• Can “onload” tasks for any component in the application

executable:– Application itself

• Useful for progress management (master-worker models) and

work-stealing based applications (e.g., mpiBLAST)

– Communication middleware• MPI stack can utilize it to offload certain tasks that can take

advantage of dedicated processing

• Assuming that the CPU is shared with other processes makes

many things inefficient

– Network Protocol stacks• E.g., Greg Reigner’s TCP Onload Engine

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 13: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Case Study: MPI Rendezvous Protocol

• Used for large messages to avoid copies and/or large

unexpected data

CTS

RTSMPI_Recv

DATA

Sender Receiver

MPI_Send

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 14: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Issues with MPI Rendezvous

• Communication Progress in

MPI Rendezvous– Delay in detecting control

messages

– Hardware support for One-

sided communication

useful, but not perfect• Many cases are

expensive to implement in

hardware

– Bad computation

communication overlap!

CTS MPI_Wait

RTSMPI_IsendMPI_Irecv

computationcomputation

DATAMPI_Wait Wait for data

Sender Receiver

Delayed!

No overlap!

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 15: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

MPI Rendezvous with ProOnE

• ProOnE is a dedicated

processing engine– No delay because

the receiver is “busy”

with something else

• Communication

progress is similar to

earlier– Completion

notifications included

as appropriate

MPI sender MPI receiver

ProOnE 0 ProOnE 1

SEND request

readrequest

Try to match a postedRECV request

CTS (matched)

RTS (no match)

DATA

CMPLT CMPLT

RECVrequestRTS

CTSRTS

Shmem 0 Shmem 1

CMPLT

readrequest

CMPLT

Receiver arrives earlierReceiver arrives later

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 16: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Design Issues and Solutions• MPICH2’s internal message matching uses the three-tuple

for matching– (src, comm, tag)

– Issue: out-of-order messages (even when the

communication sub-system is ordered)

– Solution: Add a sequence number to the matching tuple

MPI Rank 0Send1: (src=0, tag=0, comm=0, len=1M)

Send2: (src=0, tag=0, comm=0, len=1K)

MPI Rank 1Recv1: (src=0, tag=0, comm=0, len=1M)

Recv2: (src=0, tag=0, comm=0, len=1K)

MPI Rank 0Send1: (src=0, tag=0, comm=0, len=1M)

Send2: (src=0, tag=0, comm=0, len=1K)

MPI Rank 1Recv1: (src=0, tag=0, comm=0, len=1M)

Recv2: (src=0, tag=0, comm=0, len=1M)

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 17: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Design Issues and Solutions (contd.)

• ProOnE is “shared” between all processes on the node– It cannot distinguish requests to different ranks

– Solution: Add a destination rank to the matching tuple

• Shared memory lock contention– Shared memory divided into 3 segments: (1) SEND/RECV

requests, (2) RTS messages and (3) CMPLT notifications

– (1) and (2) use the same lock to avoid mismatches

• Memory Mapping– ProOnE needs access to application memory to avoid extra

copies

– Kernel direct-copy used to achieve this (knem, LiMIC)

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 18: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Presentation Layout

• Introduction and Motivation

• ProOnE: A General Purpose Protocol Onload Engine

• Experimental Results and Analysis

• Conclusions and Future Work

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 19: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Experimental Setup

• Experimental Testbed– Dual Quad-core Xeon Processors

– 4GB DDR SDRAM

– Linux kernel 2.6.9.34

– Nodes connected with Mellanox InfiniBand DDR adapters

• Experimental Design– Use one-core on each node to run ProOnE

– Compare the performance of ProOnE-enabled MPICH2 with

vanilla MPICH2, marked as “ProOnE” and “Original”

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 20: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Overview of the MPICH2 Software Stack• High Performance and Widely Portable MPI

– Support MPI-1, MPI-2 and MPI-2.1

– Supports multiple network stacks (TCP, MX, GM)

– Commercial support by many vendors• IBM (integrated stack distributed by Argonne)

• Microsoft, Intel (in process of integrating their stacks)

– Used by many derivative implementations• MVAPICH2, BG, Intel-MPI, MS-MPI, SC-MPI, Cray, Myricom

• MPICH2 and its derivatives support many Top500 systems– Estimated at more than 90%

– Available with many software distributions

– Integrated with ROMIO MPI-IO and the MPE profiling library

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 21: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Computation/Communication Overlap

• Sender Overlap

• Receiver Overlap

• Computation/Communication Ratio: W/T

MPI_Isend(array);

Compute();

MPI_Wait();

WTMPI_Recv(array);

MPI_Irecv(array);

Compute();

MPI_Wait();

WTMPI_Send(array);

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 22: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Sender-side Overlap

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

1.2Message Size: 1MB

ProOnE

Original

Computation (ms)

Ov

erl

ap

Ra

tio

100 200 300 400 500 600 700 8000

0.2

0.4

0.6

0.8

1

1.2Message Size: 256KB

Computation (us)

Ov

erl

ap

Ra

tio

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 23: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Receiver-side Overlap

1 2 3 4 5 6 7 8 9 10 11 120

0.2

0.4

0.6

0.8

1

1.2Message Size: 1MB

ProOnE

Original

Computation (ms)

Ov

erl

ap

Ra

tio

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Message Size: 256KB

Computation (ms)

Ov

erl

ap

Ra

tio

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 24: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Impact of Process Skew

5 10 15 20 25 30 35 40 45 500

20

40

60

80

100

120Receiver-side Overlap (1MB)

ProOnE

Computation (msec)

To

tal T

ime

at

Se

nd

er

(ms

)

• BenchmarkProcess 0:

for many loops

lrecv(1MB, rank 1)

Send(1MB, rank1)

Computation()

Wait()

Process 1:

Irecv()

for many loops

Computation

Wait()

lrecv(1MB, rank 0)

Send (1MB, rank 0)

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 25: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Sandia Application Availability Benchmark

1 4 16 64256 1K 4K

16K64K

256K 1M

0

500

1000

1500

2000

2500

3000

3500Overhead

ProOnE

Original

Message Size (Bytes)

Ov

erh

ea

d (

us

)

0

20

40

60

80

100

120Availability

Message Size (Bytes)

Ap

plic

ati

on

av

aila

bili

ty (

%)

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 26: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Matrix Multiplication Performance

128x128 256x256 512x5120

10

20

30

40

50

60

70

80

90With varying Problem Size

ProOnE

Original

Problem Size

Ex

ec

uti

on

Tim

e (

ms

)

16x1 8x2 4x40

20

40

60

80

100

With Varying System Config

System Size (#nodes x #cores)

Ex

ec

uti

on

Tim

e (

ms

)

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 27: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

2D Jacobi Sweep Performance

8K 16K 32K 64K0

200

400

600

800

1000

1200

With Varying Boundary Data Size

ProOnE

Original

Boundary Data Size

Tim

e fo

r W

aitA

ll (

us)

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 28: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Presentation Layout

• Introduction and Motivation

• ProOnE: A General Purpose Protocol Onload Engine

• Experimental Results and Analysis

• Conclusions and Future Work

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 29: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Concluding Remarks and Future Work

• Proposed, designed and evaluated a general purpose

Protocol Onload Engine (ProOnE)– Utilize a small subset of the cores to onload complex tasks

• Presented detailed design of the ProOnE infrastructure

• Onloaded MPI Rendezvous protocol as a case study– ProOnE-enabled MPI provides significant performance

benefits for benchmarks as well as applications

• Future Work:– Study performance and scalability on large-scale systems

– Onload other complex tasks including application kernels

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 30: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Acknowledgments

• The Argonne part of the work was funded by:– NSF Computing Processes and Artifacts (CPA)

– DOE ASCR

– DOE Parallel Programming models

• The Ohio State part of the work was funded by– NSF Computing Processes and Artifacts (CPA)

– DOE Parallel Programming models

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Page 31: ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Thank You!

Contacts:

{laipi, panda} @ cse.ohio-state.edu

{balaji, thakur} @ mcs.anl.gov

Web Links:

MPICH2: http://www.mcs.anl.gov/research/projects/mpich2

NBCLab: http://nowlab.cse.ohio-state.edu


Recommended