+ All Categories
Home > Documents > An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P....

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P....

Date post: 13-Jan-2016
Category:
Upload: augustine-rich
View: 212 times
Download: 0 times
Share this document with a friend
25
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp. Science Argonne National Laboratory
Transcript
Page 1: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

An Analysis of 10-Gigabit

Ethernet Protocol Stacks in

Multi-core Environments

G. Narayanaswamy, P. Balaji and W. Feng

Dept. of Comp. Science

Virginia Tech

Mathematics and Comp. Science

Argonne National Laboratory

Page 2: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

High-end Computing Trends

• High-end Computing (HEC) Systems

– Continue to increase in scale and capability

– Multicore architectures

• A significant driving force for this trend

• Quad-core processors from Intel/AMD

• IBM cell, SUN Niagara, Intel Terascale processor

– High-speed Network Interconnects

• 10-Gigabit Ethernet (10GE), InfiniBand, Myrinet, Quadrics

• Different stacks use different amounts of hardware support

• How do these two components interact with each other?

Page 3: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Multicore Architectures

• Multi-processor vs. Multicore systems– Not all of the processor hardware is replicated for multicore

systems

– Hardware units such as cache might be shared between the

different cores

– Multiple processing units embedded on the same processor

die inter-core communication faster than inter-processor

communication

• On most architectures (Intel, AMD, SUN), all cores are

equally powerful makes scheduling easier

Page 4: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Interactions of Protocols with Multicores• Depending on how the stack works, different protocols

have different interactions with multicore systems

• Study based on host-based TCP/IP and iWARP

• TCP/IP has significant interaction with multicore systems– Large impacts on application performance

• iWARP stack itself does not interact directly with multicore

systems– Software libraries built on top of iWARP DO interact

(buffering of data, copies)

– Interaction similar to other high performance protocols

(InfiniBand, Myrinet MX, Qlogic PSM)

Page 5: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

TCP/IP Interaction vs. iWARP Interaction

Network

TCP/IP stack

App App App

iWARP offloadedNetwork

Library

App App App

Library Library

TCP/IP is some ways more asynchronous or “centralized” with respect to host-processing as compared to iWARP (or other high performance software stacks)

Packet Arrival

Packet Processing

Packet Arrival

Packet Processing

Host-processing independent of

application process (statically tied to a

single core)

Host-processing closely tied to

application process

Page 6: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Presentation Layout

• Introduction and Motivation

• Treachery of Multicore Architectures

• Application Process to Core Mapping Techniques

• Conclusions and Future Work

Page 7: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

MPI Bandwidth over TCP/IPIntel Platform

0

500

1000

1500

2000

2500

3000

35001 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Message Size (bytes)

Ba

nd

wid

th (

Mb

ps)

Core 0

Core 1

Core 2

Core 3

AMD Platform

0

500

1000

1500

2000

2500

3000

1 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Message Size (bytes)

Ba

nd

wid

th (

Mb

ps)

Core 0

Core 1

Core 2

Core 3

Page 8: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

MPI Bandwidth over iWARPIntel Platform

0

1000

2000

3000

4000

5000

6000

70001 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Message Size (bytes)

Ba

nd

wid

th (

Mb

ps)

Core 0

Core 1

Core 2

Core 3

AMD Platform

0

1000

2000

3000

4000

5000

6000

7000

8000

1 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Message Size (bytes)

Ba

nd

wid

th (

Mb

ps)

Core 0

Core 1

Core 2

Core 3

Page 9: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

TCP/IP Interrupts and Cache MissesHardware Interrupts

0.01

0.1

1

10

100

1000

10000

1000001 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Message Size (bytes)

Inte

rru

pts

pe

r M

ess

ag

e

Core 0

Core 1

Core 2

Core 3

L2 Cache Misses

-50

0

50

100

150

200

250

1 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Message Size (bytes)

Pe

rce

nta

ge

Diff

ere

nce

Core 0

Core 1

Core 2

Core 3

Page 10: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

MPI Latency over TCP/IP (Intel Platform)Small Message Latency

0

5

10

15

20

25

30

35

40

45

50

1 4 16 64 256 1K 4K

Message Size (bytes)

Ba

nd

wid

th (

Mb

ps)

Core 0 Core 1

Core 2 Core 3

Large Message Latency

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

128K 256K 512K 1M 2M 4MMessage Size (bytes)

Ba

nd

wid

th (

Mb

ps)

Core 0

Core 1

Core 2

Core 3

Page 11: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Presentation Layout

• Introduction and Motivation

• Treachery of Multicore Architectures

• Application Process to Core Mapping Techniques

• Conclusions and Future Work

Page 12: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Application Behavior Pre-analysis

• A four-core system is effectively a 3.5 core system– A part of a core has to be dedicated to communication

– Interrupts, Cache misses

• How do we schedule 4 application processes on 3.5

cores?

• If the application is exactly synchronized, there is not

much we can do

• Otherwise, we have an opportunity!

• Study with GROMACS and LAMMPS

Page 13: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

GROMACS Overview• Developed by Groningen University

• Simulates the molecular dynamics of biochemical particles

• The root distributes a “topology” file corresponding to the

molecular structure

• Simulation time broken down into a number of steps– Processes synchronize at each step

• Performance reported as number of nanoseconds of

molecular interactions that can be simulated each day

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

Combination A 0 4 2 6 7 3 5 1

Combination B 0 2 4 6 5 1 3 7

Page 14: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

GROMACS: Random Scheduling

Gromacs LZM Application

0

5

10

15

20

25

30

TCP/IP iWARP

ns/

da

y

Combination A

Combination B

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 0 1 2 3

Computation MPI_Wait Other MPI calls

Machine 1 cores Machine 2 cores

Page 15: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

GROMACS: Selective Scheduling

Gromacs LZM Application

0

5

10

15

20

25

30

TCP/IP iWARP

ns/

da

y

Combination A

Combination B

Combination A'

Combination B'

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 0 1 2 3

Computation MPI_Wait Other MPI calls

Machine 1 cores Machine 2 cores

Page 16: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

LAMMPS Overview• Molecular dynamics simulator developed at Sandia

• Uses spatial decomposition techniques to partition the

simulation domain into smaller 3-D subdomains– Each subdomain allotted to a different process

– Interaction required only between neighboring subdomains –

improves scalability

• Used the Lennard-Jones liquid simulation within LAMMPS

Core 0 Core 1 Core 2 Core 3

Core 0 Core 1 Core 2 Core 3

Network

Page 17: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

LAMMPS: Random Scheduling

LAMMPS Application

0

2

4

6

8

10

12

TCP/IP iWARP

Co

mm

un

ica

tion

Tim

e (

seco

nd

s)

Combination A

Combination B

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 0 1 2 3

MPI_Wait MPI_Send Other MPI calls

Machine 1 cores Machine 2 cores

Page 18: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

LAMMPS: Intended Communication Pattern

Computation

MPI_Send() MPI_Send()

MPI_Irecv() MPI_Irecv()

MPI_Wait() MPI_Wait()

MPI_Send() MPI_Send()

MPI_Irecv() MPI_Irecv()

Page 19: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

LAMMPS: Actual Communication Pattern

Computation

MPI_Send() MPI_Send()

MPI_Wait()

MPI_Wait()

MPI buffer

Socket Send Buffer

Socket Recv Buffer

Application Recv Buffer

MPI_Send()

Application Recv Buffer

“Slower” Core Faster Core

MPI buffer

Socket Send Buffer

Socket Recv Buffer

MPI_Send()

Application Recv Buffer

“Slower” Core Faster Core

Computation

“Out-of-Sync” Communication between processes

Page 20: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

LAMMPS: Selective Scheduling

LAMMPS Application

0

2

4

6

8

10

12

TCP/IP iWARP

Co

mm

un

ica

tion

Tim

e (

seco

nd

s)

Combination A

Combination B

Combination A'

Combination B'

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 0 1 2 3

MPI_Wait MPI_Send Other MPI calls

Machine 1 cores Machine 2 cores

Page 21: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Presentation Layout

• Introduction and Motivation

• Treachery of Multicore Architectures

• Application Process to Core Mapping Techniques

• Conclusions and Future Work

Page 22: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Concluding Remarks and Future Work• Multicore architectures and high-speed networks are

becoming prominent in high-end computing systems– Interaction of these components is important and interesting!

– For TCP/IP scheduling order drastically impacts performance

– For iWARP scheduling order has no overhead

– Scheduling processes in a more intelligent manner allows

significantly improved application performance

– Does not impact iWARP and other high-performance stack

making the approach portable while efficient

• Dynamic process to core scheduling!

Page 23: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Thank You

Contacts:

Ganesh Narayanaswamy: [email protected]

Pavan Balaji: [email protected]

Wu-chun Feng: [email protected]

For More Information:

http://synergy.cs.vt.edu

http://www.mcs.anl.gov/~balaji

Page 24: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Backup Slides

Page 25: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

MPI Latency over TCP/IP (AMD Platform)Small Message Latency

0

5

10

15

20

25

30

35

40

45

50

1 4 16 64 256 1K 4K

Message Size (bytes)

Ba

nd

wid

th (

Mb

ps)

Core 0 Core 1

Core 2 Core 3

Large Message Latency

0

5000

10000

15000

20000

25000

128K 256K 512K 1M 2M 4MMessage Size (bytes)

Ba

nd

wid

th (

Mb

ps)

Core 0

Core 1

Core 2

Core 3


Recommended