Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
1
High Speed Supercomputer Communicationsin Broadband Networks
Ralph NiederbergerResearch Center Jülich GmbH
Helmut Grund, Ferdinand Hommes, Eva PlessGMD - German National Research Center for Information Technology
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
2
Introduction
• Introduction
• GTB West – Goals, Projects, Timeframes and Configuration– Super Computer Impediments and Solutions
• Status of Cray Super Computer Communications• Future Tests• Summary
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
3
• New kinds of Microprocessors and expansion of internal storage lead to new kinds of supercomputing systems solving best different kinds of problems.
• Two mostly known types of supercomputers are massively parallel systems and vector systems.
• A new kind of supercomputer is the Metacomputer.
• A Metacomputer distributes an application onto 2 or more equal or distinct machines which are coupled dynamically via an external network.
• This distribution may be done by quality (functional distribution) or by quantity.
Introduction
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
4
Distribution of an application onto more than one system only
recommended, if computation time can be decreased significantly.
This depends ondegree of parallalization and
time of communication between processes
Communication time depends on:communication medium and protocol
length of communication link
number of intermediate systems
performance of communicating systems (cpu, internal communication, ...)
Introduction
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
5
GTB - WestProject sponsored by BMBF and DFN with financial participation of the project
partners
Partners:
Research Center Jülich GmbH http://www.fz-juelich.de
GMD - German National Research Center for Information Technologyhttp://www.gmd.de
Deutsches Klimarechenzentrum http://www.dkrz.de
Alfred Wegener Inst. for Polar & Marine Res. http://www.awi.de
Pallas GmbH http://www.pallas.de
o.tel.o http://www.o-tel-o.de
Runtime: Aug, 1st 1997 - Jan, 31th 2000
More Info: http://www.fz-juelich.de/gigabit
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
6
GTB - West Projects
• Giganet - Configuration, Management and Performance Analysis of the Gigabit Testbed
• Methods and Tools, Software Support
• Solute Transport in Ground Water
• Algorithmic Analysis of Magnetoenzephalography Data
• Complex Visualization over a Gigabit WAN
• Multimedia applications in a Gigabit WAN
• Distributed calculations of climate and weather models
• Porting Parallel and Distributed Applications from CEC CISPAR Project
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
7
GTB West - Goals • Demonstrate the usefulness of high speed wide-area communication
networks for scientific computing
• Engage in selected applications which are known to need very high communication bandwidth
• Major objective:
– coupling of architecturally different supercomputers
i.e. vector computers and massively parallel computers
to build a new kind of metacomputer
• strengthen the know how in
– high speed computer communications,
– metacomputing in LAN and WAN environments
– coupling of the super computer centers in Germany
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
8
Status (Phase 1)
• Installation on top of o.tel.o high tension cables
• most problems at last mile
• at GMD underground workings necessary
• at Research Center Jülich installation of fiber cables together with
hot water supply
• o.tel.o offers SDH infrastructure and uses Lucent technologies
• no major problems using o.tel.o trunc lines have been found
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
9
Trunc lines (Phase 1)
Repeater
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
10
Status
• 622 Mbit/s link stable for one year• CRAY T3E Supercomputer connected with 155 Mbit/s ATM• FZJ GMD link upgrade (622 Mbit/s 2.4 Gbit/s): End of July 1998 • Aug. 5. 1998:
– first ATM-WAN connection with 2.4 Gbit/s user data (8 Workstations with 155 / 622 Mbit/s interfaces)
– 96.4% (TCP)– 99.97% (UDP) (high packet loss)
• Beta test FORE ASX- 4000 ended• Beta test HiPPI to ATM gateway (SUN and SGI) ongoing• Throughput and delay measurement ongoing • Monitoring and accounting of trunc line with
HP-OpenView at GMD
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
11
IBM SP2 internals
external machine
external machine
....
....
10 Mbps
SP2-Nodes:
external machine
external machine
800 Mbps
external machine
external machine
....
622 Mbps
internal Ethernet
external Ethernet
....
155 Mbps800 Mbps
10 Mbps
.... ....
HP-Switch HP-Switch
HIPPISwitch
HP-Switch
ATMSwitch
Frame 1Frame 3 Frame 2
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
12
Cray T3E internals
800 Mbps
external machine
external machine
....
.... .... ....
FDDIEthernet
communicationnodes:
4 proc
GigaRing
10 Mbps
....
HIPPISwitch
FDDIRing
HIPPI
T3E-processors: 4 proc 4 proc 4 proc 4 proc
external machine
external machine
800 Mbps
external machine
100 Mbps
800 Mbps
GigaRing
T3E-3D-Thorus
ATM
ATMSwitch
155 Mbps
external machine
GigaRing
external machine
external machine
....
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
13
Cray T3E internals (2)
GigaRingGigaRing GigaRing
D-MPN MPN HPN
Support/OS PE
User PE
I/O Controller
EthernetATM FDDIATM
MPN:
Sbus-System with 200 MB
I/O Controller
I/O Controller
Device-Treiber PE
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
14
Current problem:Communication throughput within and between supercomputers differs extremly
Example:Cray/T3E with internal communication throughput of 500 MB/s bidirectional into three dimensions (3D torus)
High speed external connections: (Fast-) Ethernet (10-100 Mb/s), FDDI (100 Mb/s) ,
HiPPI (800 Mb/s-1600 Mb/s), Super HiPPI ( 6400 Mb/s ),
ATM 155 Mb/s, 622 Mb/s - 2.4 Gb/s, Gigabit-Ethernet (1Gb/s),
Impediments
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
15
Cray SystemsNetwork Environment
CRAY/T3E 256
155 Mb/s
ATM
Essential
HiPPI
EPS1004
CRAY/T3E 512
FDDIConcentrator
CiscoRouter
CiscoRouter
CRAY/J90 File Server
CRAY/J90Compute Server
CRAY/T90
JuNet
World Wide Internet
Connecting a Cray system with n systems 2 * n PVC entries
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
16
PVC configurationnot recommended
sp2l18-a2
192.168.110.18
sp2l19-a2
192.168.110.19
sp2l20-a2
192.168.110.20
sp2l21-a2
192.168.110.21
sp2l22-a2
192.168.110.22
sp2l23-a2
192.168.110.23
sp2l24-a2
192.168.110.24
sp2l25-a2
192.168.110.25
fzj256
192.168.110.30
fzj512
192.168.110.31
fzjt90
192.168.110.32
fzjtst-1zam449192.168.110.33
fzjtst-2zam448192.168.110.34
fzjtst-3
192.168.110.35
fzjtst-4
192.168.110.36
fzjtst-5zam169192.168.110.37
fzjtst-6
192.168.110.38
fzjgatezam002192.168.110.39
gigaclipsrvzam304e1192.168.110.40
sp2l18-a2 118 110 134
sp2l19-a2 119 111 135
sp2l20-a2 120 112 136
sp2l21-a2 121 113 137
sp2l22-a2 122 114 138sp2l23-a2 123 115 139
sp2l24-a2 124 116 140
sp2l25-a2 125 117 141fzj256 118 119 120 121 122 123 124 125 101 102 103 104 105 106 107 108 109
fzj512 110 111 112 113 114 115 116 117 101 126 127 128 129 130 131 132 133
fzjt90 134 135 136 137 138 139 140 141 102 126 142 143 144 145 146 147 148fzjtst-1zam449 103 127 142
fzjtst-2zam448 104 128 143
fzjtst-3 105 129 144fzjtst-4 106 130 145
fzjtst-5zam169 107 131 146
fzjtst-6 108 132 147
fzjgatezam002 109 133 148
gigaclipsrvzam304e1
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
17
High speed communication
Alternatives communicating between
CRAY/T3E and IBM/SP2 • rawHiPPI (800 Mb/s)
– HiPPI Tunneling (622 Mb/s, currently MTU 9180)
– HiPPI Sonet Extender (currently 155 Mb/s or 932 Mb/s)
• TCP/IP via HiPPI (622 Mb/s, currently MTU 9180 because of routing)
• nativeATM (155 Mb/s, 622 Mb/s) (Hardware ?, Software ?)
• TCP/IP via ATM (155 Mb/s, 622 Mb/s) (Hardware ?)
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
18
Throughput considerations• Transmission time in fiber optics cables
tt = length of medium / (0,66 * c) with c = 300.000 km/s
additionally delays in routers, switches etc.
ttopt = 100 km / (0,66 * 300.000 km/s) = 1/2000 s = 0,5 ms
use path mtu discovery
apply socket buffers to bandwidth delay product
• BDP = (B * RTT) = 622 Mb/s * 0.5 ms 311 kb 40 kB
• use setsockopt to set:
– SO_SNDBUF und SO_RCVBUF 1 MB
– TCP_NODELAY=1 and TCP_WINSHIFT=4
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
19
Throughput considerations (2)
155 Mbit/s 622 Mbit/s 2,4 Gbit/s %Capacity 155,520 622,080 2.488,320 100SDH overhead 5,760 23,040 92,160 3,7ATM overhead 14,128 56,513 226,053 9,1ATM user data 135,632 542,527 2.170,107 87,2ATM cells 353.208 1.412.830 5.651.321CIP TCP 9140 overhead 1,118 4,474 17,896 0,7CIP TCP 9140 user data 134,513 538,053 2.152,211 86,5CIP TCP 9140 packete/s 1840 7.358 29.434LANE TCP 1460 overhead 6,711 26,844 107,375 4,3LANE TCP 1460 user data 128,921 515,683 2.062,732 82,9LANE TCP 1460 packete/s 11.038 44.151 176.604
Values: Mbit/s
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
20
Supercomputer - Impediments
CRAY T3E communication throughput measured
• Maximum of 115 Mb/s via TCP/IP over ATM MTU 9180 (Default MTU from standard)
• Maximum of 430 Mb/s via TCP/IP over HiPPI MTU 64 KB because of IP-Header fields
• Maximum of 530 Mb/s via raw HiPPI no real MTU limitation
Netperf between SUN Ultra/60 and SGI Origin 200 maximum of 535 Mb/s user data via 622 Mb/s ATM
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
21
HIPPISwitch
FZJ
155 Mbit/s
Cisco LS1010
800 Mbit/s
8 x GMD
Legende:
IBM SP2
SUNatm-fore
atm-sun
SGI
Fore ASX-4000
2,4 Gbit/s
SUNatm-fore
atm-sun
SUNatm-sun
atm-fore
CRAY T3E256 Proc
Fore ASX-4000
SUN Ultra 22 Proc
SUN Ultra 60 2 Proc SGI O200
SUN Ultra 22 Proc
Cisco A 100
SUN Enterprise 4000
CRAY T3E512 Proc
SUN
HIPPISwitch
622 Mbit/sCRAY T90
16 Proc
Fore ASX-1000
Fore ASX-1000
Fore ASX-1000
Fore ASX-1000
Net topology GMD - FZJ
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
22
Gigabit Testbed West Network Layout
FZJ GMD
SUN
HiPPI/Sbus
IBM /SP2CRAY/T3E
SGI/SUN
HiPPI/PCI
HiPPI 800 Mb/s
MTU 64 K
Gigabit Testbed West
110 km
ASX4000ASX4000
2.4 Gb/s
ATM
CiscoRouter
CiscoRouter
HiPPI 800 Mb/s
MTU 64 KATM 622 Mb/s 64K MTU
ATM 155 / 622 Mb/s 9K MTU
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
23
Gigabit Tests July, 30th 1998
2.4 Gbps
ATM/SDH
622 Mbps
GMD
2.4 Gbit Interface
ATMSwitch
filou
SUN Enterprise 5000
baloo
SUN Ultra 60
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
24
Gigabit Speed RecordAugust, 5th 1998
2.4 Gbps
ATM/SDH
3 * 622 Mbps
622 Mbps
ATMSwitch
622 Mbps
622 Mbps
622 Mbps
622 Mbps
ATMSwitch
155 Mbps
155 Mbps
622 Mbps
622 Mbps
FZJ GMD
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
25
Problem: Interrupt rate of CRAY/T3E systems
Solution:
Create two logical networks upon one physical network
• network 1 with 64k MTU between gateway systems (exact MTU 65280)
as specified for CRAY systems on HiPPI networks
• network 2 with 9.180 MTU between directly connected ATM systems
Advantage:
MTU-Path-Discovery on the end systems will find maximum value to use.
Gigabit Testbed West Connecting CRAY T3E and IBM SP2 via separate network
MTU: 9180 4356 1500 9180 65280
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
26
StatusCRAY HiPPI Testbed configuration
CRAY/T3E 512CRAY/T3E 256CRAY/J90CRAY/T90 CRAY/J90
Parallel
HiPPI card
Serial HiPPI card
2 4 6 8 90 1 3 5 7 10 11 12 13 14 15E
ther
net m
odul
e
134.94.72.1134.94.72.4
134.94.72.5 134.94.72.2 134.94.72.3 192.168.115.10192.168.115.6
192.168.115.26(gmdsp2)
HiPPI-Switch
192.168.115.25
Fore ASX4000
192.168.116.36192.168.110.49192.168.110.36
192.168.115.9SGI O200
192.168.115.5SUN Ultra 60
192.168.116.49
192.168.110.3192.168.116.3
(gmdsun)
Fore ASX4000
HPN1HPN1 HPN1 HPN1 HPN1
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
27
Communication nominal and real throughput
FZJ GMD
CRAY T90
IBM SP2
ATM/SDH
ATMSwitch
ATMSwitch
CRAY T3E/256
H/A-router
HIPPI
Switch
H/A-router
HIPPI
Switch
Real: 430 Mbps 430 Mbps 530 Mbps 530 Mbps 530 Mbps 370 Mbps 370 Mbps
Nominal: 800 Mbps 800 Mbps 622 Mbps 2.4 Gbps 622 Mbps 800 Mbps 800 Mbps
CRAY T3E/512
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
28
Gigabit Testbed West TCP-Gateway-Layout (Beta-Tests in Jülich)
250
CRAY/T3E (256)
SUNHiPPI/PCI
ATM 622 Mb/s MTU 9180 or 64 K
Serial HiPPI
800 Mb/s MTU 64 K
Parallel HiPPI800 Mb/s MTU 64 K
CRAY/T3E (512)
2 4 6 8 90 1 3 5 7 10 11 12 13 14 15
Eth
erne
t mod
ule
430 370 350 315
320 380 440
430 (direct)
350 (direct) 270 (gate)
340 (gate)
415
535
Serial HiPPI
800 Mb/s MTU 64 K
SGIHiPPI/PCI
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
29
• Solve HiPPI problem. Using large MTU sizes (65280 kB) does not work correctly
• Testing the other Cray Systems with HiPPI to ATM gateway (T90, J90)
• Testing different configurations if testbed is available
– using 2 HPN1
– using 2 Communication nodes within CRAY/T3E
– using one Gateway for more than one machine
– using same HiPPI device for local and remote communication
– using multiple HiPPI devices for advanced throughput
Future TestsCRAY HiPPI Testbed configuration
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
30
Possible future test scenario
Internal communication:
M1 Mm, N1 Nn
External communication:
Mm-k+1,Mm-k+2,..Mm (Multiplex of k HiPPI interfaces)
IP over HiPPI IP over ATM IP over HiPPI
Nn-j,Nn-j+1,...Nn (Multiplex of j HiPPI interfaces)
GatewayGateway
: :
multiple
HiPPI
ATM 2.4 Gb/s
multiple
HiPPIASX4000ASX4000
ATM 622 Mb/s
4*ATM 622 Mb/s
ATM 622 Mb/s
4*ATM 622 Mb/s
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
31
Summary
• No problems left with 2,4 Gbit/s ATM/SDH trunc line
• Workstation systems can generate and transfer datastreams saturating a 622 Mbit/s ATM link
• Coupling of supercomputer systems over WANs with high bandwidth currently only possible with an HIPPI to ATM gateway solution and special configuration
But time is ready for gigabit transmissions.
Cray User Group Meeting 24-28 May 1999, Minneapolis,USA
High Speed Supercomputer Communications in Broadband Networks
[email protected] et al.
32
Summary
• Applications are capable using gigabit networks.
• Metacomputing may become reality in LAN as well as in WAN environments
• Therefore supercomputer system designers have to prepare their systems with gigabit communication interfaces
„The net is the computer and the computer is the net“
((SuperComputer) Communications)!= (Super (ComputerCommunications))