+ All Categories
Home > Documents > IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and...

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and...

Date post: 04-Jan-2016
Category:
Upload: kathryn-collins
View: 212 times
Download: 0 times
Share this document with a friend
23
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju [email protected] HiPC Conference Bangalore, India December 19-22, 2004
Transcript
Page 1: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

HPS Switch and AdapterArchitecture, Design & Performance

Rama K [email protected]

HiPC ConferenceBangalore, India

December 19-22, 2004

Page 2: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Team• Architecture

– Peter Hochschild, Don Grice, Kevin Gildea, Rama Govindaraju

• Hardware– Carl A Bender, Jay Herring, Piyush

Chaudhary, Steven Martin, Jason Goscinski, John Houston, …

• Software– Chulho Kim, Robert Blackmore, Rajeev

Sivaram, Hanhong Xue, …

• And many others contributed to this effort

Page 3: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Outline

• What is HPS?

• Example HPS customers

• Interconnect Historical Performance

• HPS switch architecture

• HPS adapter architecture

• HPS software architecture

• Transport Modes

• HPS Performance

• Lessons Learned and Future Work

Page 4: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

What is HPS?• HPS (High Performance Switch)

– 4th generation switch and adapter to interconnect IBM’s Power processor based nodes (Power 4 and 5)

– To be used in many of the world’s fastest supercomputers

• 20 of the top 100 today use HPS

– Addressing requirements of• HPC labs, DOE, and others• Weather Forecasting, Petroleum sector, Automotive and

Aerospace sector• NSA and DOD

– Core infrastructure for the 100TF ASCI Purple system to be delivered in June 2005

Page 5: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Example HPS CustomersMore than 30 and growingSeveral over 1000 CPUsTotal over: 200TF

Page 6: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Historical Interconnect Performance

1993 1996 1998 2000 2004

Adapter

Switch

Processor

TB2

HPS

Power 2

TB3

TBS

Power 2

TBMX

TBS

Power PC/3

Colony

SP-Switch2

Power 3

HPS

HPS

Power 4

Peak link bandwidth

40MB/s 150MB/s 150MB/s 500MB/s 2GB/s

MPI bandwidth

35MB/s 110MB/s 135MB/s 375MB/s 1.8-14GB/s

MPI latency 40us 24us 21us 17us <4.2us

Links/node server

1 1 1 1,2 2,4,6,8

IBM developed Switch Interconnects and Adapters

Page 7: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

HPS Switch Fabric

switch board

12 meter copper cablesor

80 meter fiber cables

GX bus

Power4 and Power5 based servers

link driverscopper driverfiber optics driver

adapter

LDC

GX bus

RAM

GX bus RAMCanopus

Canopus

LDC

GX bus

RAM

GX bus RAMCanopus

Canopus

LDC

GX bus

RAM

GX busRAMCanopus

Canopus

HPS switch chip

LDC chipHPS Adapters Agilent optics

12 meter copper cablesor

40 meter fiber cables

LDC

GX bus

RAM

GX bus RAMCanopus

Canopus

4K end points, 59ns latency, 2GB/s bandwidth per link per direction

Page 8: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

HPS Adapter Microcode Model

General Parameters256 byte

Channel Buffer512 byte

Packet Header128 byte

Packet Data256 - 2K byte16K byte total

server

interface

fabric

interface

8M bytes

SRAM

Formattermask, rotate, merge

Formatter RAM256 entries

ALU

parallel mask, shift,

arithmetic & branch

Instruction RAM4K entries of 64 bits

Program Counter

General Registers16 entries of 64 bits

Task Registers16 entries of 64 bits

control - status

IAMover

PacketMover

DataMover

Format

4

MMIO

32+64

memoryfetch

16

memorystore

16

SRAM

16

IAread

8

IAwrite

8

PM0

16

PM 1

Page 9: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

HMCFNM DD

HYP

HPS Switch Fabric

HPS Adapter

User Space Kernel Space

LAPI

IBM’s MPI

Parallel ESSL

VSD

GPFS SOCKETS

TCP UDP

IP

APPLICATION

ES

SL

IF_LSHAL

ServiceProcessor

HPS Software ArchitectureL

LC

SM

Page 10: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

User SpaceKernel Space

MPI

LAPI

HAL

Federation Adapter

Interface Layer

User Buffer

HAL BuffersIP Interface

UDP TCP

Sockets

FIFO versus RDMA models

FIFOcopy

FIFODMA

RDMA

Page 11: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Supported Communication Modes• FIFO Mode

– Message chopped into 2K packet chunks on the host and copied by CPU

– Memory bus crossing depends on caching. At least 1 IO bus crossing

• RDMA enablement – No slave side protocol– CPU offload – Enhanced Programming

model– 1 IO bus crossing

UserBuffer

CPU

Network FIFO

Adapter

Ld/St

Ld/St

DMA

RDMA

Page 12: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

RDMA value proposition• Possible overlap of computation and communication

– Fragmentation/reassembly offloaded to the adapter– Minimize packet arrival interrupts– Requires application to be written take advantage of overlap

• One sided programming model• Zero copy transport and reduced memory subsystem

load• Striping advantage• KEY DIFFERENTIATOR: reliable RDMA protocol over

unreliable datagram transport– Allows striping across multiple paths – Out of order arrival – Reduces hot spotting and contention

• Cons– Pinned memory usage– Resource management and fairness issues

Page 13: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Federation Performance• Summary:

– Latency: Power 4, 1.9GHz, HPS• MPI latency 4.34us• Interrupt latency: adds 10us• 8 task latency: adds 1us

– Bandwidth: Power 4, 1.9GHz, HPS• FIFO mode:

– Unidirectional bandwidth: ~ 1.8GB/s– Bidirectional bandwidth: 2.1GB/s

• RDMA mode:– Unidirectional bandwidth: ~1.8GB/s– Bidirectional bandwidth: ~3.0GB/s– Linear striping performance up to 8 links

» Unidirectional: 14GB/s, Bidirectional: 24GB/s

• These are preliminary measurements

Page 14: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

HPS: MPI LatencyMachine Type Latency Measurement

1.9GHz, p690+ 4.34us

1.7GHz, p690+ 4.72us

1.7GHz, p655+ 4.70us

1.5GHz, p690+ 5.15us

1.3GHz, p690 5.5us

All measurements measured using IBM’s thread safe MPI libraries8 task latency adds approximately 1 additional microsecondInterrupt latency adds approximately 10-12 microsecondsAll measurements are preliminary

Page 15: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Unidirectional Bandwidth Peak

Machine Type Peak Uni-dir Bandwidth

1.9GHz, p690+ 1.800GB/s

1.7GHz, p690+ 1.686GB/s

1.7GHz, p655+ 1.800GB/s

1.5GHz, p690+ 1.470GB/s

1.3GHz, p690 1.170GB/s

All measurements are preliminary

Page 16: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Unidirectional Bandwidth Profile

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Message Size (bytes)

Ban

dwid

th (

MB

/s)

P655, 1.7GHz based systemM1/2= 32K, M3/4=128K

Page 17: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Bidirectional Bandwidth Profile

0

500

1000

1500

2000

2500

Message Size (bytes)

Ban

dwid

th (

MB

/s)

P655, 1.7GHz based systemM1/2=16K, M3/4=64K

Page 18: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

T1 T2 T3 T1 T2 T3 T1 T2 T3

= Communication time by thread/task

a) Asynchronous Model b) Synchronous Model c) Aggregate Comm Thread Model

Striping Options

Page 19: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Striping Models

MPI LayerLA

PI L

ayer

LAP

I Lay

er

LAP

I Lay

er

HA

L

HA

L

HA

L

ADAPTERS

MPI Layer

LAP

I Lay

er

HA

L

HA

L

HA

L

ADAPTERS

Multiple threads doing copies model Single Thread with Pipelined RDMA model

Second approach: - More elegant failover model - Less synchronization issues and CPU contention via RDMA

Page 20: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

RDMA Unidirectional BandwidthPreliminary RDMA Unidirectional BW

0

2000

4000

6000

8000

10000

12000

14000

16000

16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 1.34E+08 2.68E+08

Message Size

Ban

dw

idth

MB

/s

Single Link Two Links Four Links Eight Links

Page 21: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

RDMA Bidirectional BandwidthPreliminary RDMA Bidirectional BW

0

5000

10000

15000

20000

25000

16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 1.34E+08 2.68E+08

Message Size

Ba

nd

wid

th M

B/s

Single Link Two Links Four Links Eight Links

Page 22: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

How can users exploit RDMA?• Overlap computation and communication

– Non blocking calls– Reuse communication buffers if possible– User exposed RDMA in 11/05

• Minimize interrupts for large transfers• Reduce contention for memory• Better raw bandwidth for messages over 80KB• Possibility of overlapping collectives better (via

striping)• IP transport much more efficient (translates to

improved GPFS performance)• Select striping when sending large messages

Page 23: IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22

Copyright by IBM

Future Work• Enabling HPS for Power 5 based nodes• Exploit SMT in Power 5 processor for

FIFO mode• Further attack MPI latency• Use RDMA to improve MPI collectives

performance• Parallel file systems (GPFS) further

exploitation of IP over RDMA• Take lessons learned into the Percs

project


Recommended