ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL
Jim Rogers
Director, Computing and FacilitiesNational Center for Computational SciencesOak Ridge National Laboratory
22
ORNL Leadership Class Systems 2004 - 2018
2012Cray XK7
Titan
27PF
18.5TF
25 TF
54 TF
62 TF
263 TF
1 PF
2.5PF
2004Cray X1E Phoenix
2005Cray XT3
Jaguar
2006Cray XT3
Jaguar
2007Cray XT4
Jaguar
2008Cray XT4
Jaguar
2008Cray XT5
Jaguar
2009Cray XT5
Jaguar
From 2004 – 2018, HPC systems relied on chiller-based cooling (5.5°C supply)
with annualized PUEs to ~1.4
33
ORNL’s Transition to Warmer Facility Supply Temperatures
~1.5EF
200PF
27PF
2012Cray XK7
Titan
2021/2022Frontier
2018/2019IBM
Summit
Titan: Refrigerant-based per-rack cooling with direct rejection of heat to cold 5.5°C water• Below dewpoint• 100% use of chillers
Summit: A combination of direct on-package cooling and RDHX with 21°C supply is > 95% room-neutral.• Above dewpoint• Contribution by chillers
~20% of the hours of the year
Frontier: Custom mechanical packaging is >95% room-neutral with a 30°C supply.• ~100% Evaporative Cooling,
with supplemental HVAC for parasitic loads
4
Oak Ridge National Laboratory’s Cray XK7 TitanGTC'15 Session S5566 GPU Errors on HPC Systems4
Operational from November 2012 through August 1 201918,688 compute nodes – 1 AMD Opteron + 1 NVIDIA Kepler/node
27PF (peak); 17.59PF (HPL); 9.5MW (peak)Delivered > 27B compute hours over its life
Titan entered as the #1 supercomputer in the world in November 2012, and was still #12 on the Top500 list at the time of its decommissioning
5
Perspective on Titan as an Air-cooled Supercomputer
ORNL’s Cray XK7 “Titan”• 200 cabinets, each with a 3,000 CFM fan
• 600,000 CFM (air volume)• Dry air has a density of 13.076 cubic feet per pound• Titan moves 600,000/13.076/60 -> 765 lbs/sec
Airbus A340• At takeoff, each of (4) engines
generates 140kN of force and consumes 1000 lbs/sec of air
• At cruise, the A340’s engines each produce ~29kN and consumes ~200 lbs/sec of air.
Wait. What?
Titan moves as much air as a long-range Airbus A340 at cruising altitude.
6
Motivation for “Warmer” Cooling Solutions Serving HPC Centers
• Reduced Cost, Both CAPEX and OPEX– Reduce or eliminate the need for traditional
chillers. • No chillers, no ozone-depleting refrigerants (GREEN)
– Oak Ridge calculates an annualized PUE for air--cooled devices of no better than 1.4 (ASHRAE Zone 4A – Mixed Humid)
Oak Ridge, TN
– HPC power budgets continue to grow – Summit has a design point for >12MW (HPC-only). Minimizing PUE/ITUE is critical to the budget.
• Easier, more reliable design– Design is reduced to pumps, evaporative cooling, heat exchangers.– Traditional chilled water may not be necessary at all (NREL, NCAR, et al)
77
OLCF Facilities Supporting Summit• Titan – 9MW @ heavy
load• Sitting on 250
pounds/ft2 raised floor• Uses 42F water and
special CDUs (XDPs)
• Summit – 256 compute cabinets on-slab
• 100% room-neutral design uses RDHX
• 20MW warm-water cooling plant using centralized CDU/secondary loop
88
• Summit– Demand: 3-10MW;
• Secondary Loop– Supply 3300GPM (12,500
liters/min) @ 21°C; Return @ 29-33°C
– CPUs and GPUs use cold plates
– DIMMs and parasitic loads use RDHX
– Storage and Network use RDHX
21°C/20MW/7700 ton Facility System Design System
99
Cooling DesignPrimary Loop uses Evaporative Cooling Towers (~80% of the hours of the year)
When the MTW RETURN is above the 21C setpoint, use a second set of Trim HX (with 5.5C on the other side) to drive MTW to the 21C setpoint.
The need for the trim-loop is about 20% of the hours in the year, and can ramp 0-100% to meet the setpoint back to Summit
Existing chilled water cooling loop
New primarycooling loop
1010
Benefits of Warm Water + Operating Dashboards
• Warm Water allows annualized PUE of 1.1– ~$1M cost per MW-year for consumption on Summit;– ~$100k cost per MW-year for waste heat management
• Integration with PLC allows us to tune water flow– Better delta(t); less pumping energy
• Integration with IBM’s OpenBMC allows us to protect these 40k components from inadequate flow across the cold plates
• Integration with the scheduler allows us to correlate power and temperature data with individual applications.
• Additional data streams to be added- most from the Facility PLC
11
Comparing Energy Performance - Titan to Summit
• Demonstrated Performance on Titan (Oct 2012)
• 17.59 PF 8.9MW peak 8.3MW average
• Demonstrated Performance on Summit (Oct 2018)
• 143.5 PF 11,065kW peak 9,783kW average
8.16x performance increase 17.9% avg. power increase
Titan: ~2.1 GFLOPs/Watt
Summit: 14.66 GFLOPs/Watt
>7x increase in energy efficiency
1212
1313
1414
15
0
2000
4000
6000
8000
10000
12000
14000
35
40
45
50
55
60
65
70
75
80
85
90
95
100
23:5
1:00
23:5
8:30
00:0
6:00
00:1
3:30
00:2
1:00
00:2
8:30
00:3
6:00
00:4
3:30
00:5
1:00
00:5
8:30
01:0
6:00
01:1
3:30
01:2
1:00
01:2
8:30
01:3
6:00
01:4
3:30
01:5
1:00
01:5
8:30
02:0
6:00
02:1
3:30
02:2
1:00
02:2
8:30
02:3
6:00
02:4
3:30
02:5
1:00
02:5
8:30
03:0
6:00
03:1
3:30
03:2
1:00
03:2
8:30
03:3
6:00
03:4
3:30
03:5
1:00
03:5
8:30
04:0
6:00
04:1
3:30
04:2
1:00
04:2
8:30
04:3
6:00
04:4
3:30
04:5
1:00
04:5
8:30
05:0
6:00
05:1
3:30
05:2
1:00
05:2
8:30
05:3
6:00
kW
Tem
per
atu
re (°
F)
Time
Summit MTW Cooling Loads and TemperaturesHPL Run 5/24/19 Duration: 5:48
MTW kW (cooling) CHW kW (cooling) MTW Return Temp MTW Supply Temp
K100 Avg Space Temp Outdoor Air Wet Bulb Temp IT kW
PUE during HPL Run = 1.081
IT Load follows a traditional HPL profile
Total power load on the energy plant reflects storage and other items A small portion of
the total load required the use of CHW (trim RDHX)
OAWBT remained at/above the supply target, affecting ECT
Supply temperature to Summit stayed rock-solid at 70F.
16
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
0
2000
4000
6000
8000
10000
12000
123
947
771
595
311
9114
2916
6719
0521
4323
8126
1928
5730
9533
3335
7138
0940
4742
8545
2347
6149
9952
3754
7557
1359
5161
8964
2766
6569
0371
4173
7976
1778
5580
9383
3185
6988
0790
4592
8395
2197
5999
9710
235
1047
310
711
1094
911
187
1142
511
663
1190
112
139
1237
712
615
1285
313
091
1332
913
567
1380
514
043
1428
114
519
1475
714
995
1523
315
471
1570
915
947
1618
516
423
1666
116
899
1713
717
375
1761
317
851
1808
918
327
1856
518
803
1904
119
279
1951
719
755
1999
320
231
2046
920
707
2094
521
183
2142
1
Accu
mul
ated
kW
-hou
rsAx
is Ti
tle
kW
Measurement (1/second)
OLCF-4 Summit HPL Power Measurements10-25-18 - 03:31:48 to 09:32:00 Average power –
9,783 kWMax power –11,065 kW
0
10000
20000
30000
40000
50000
60000
0
2000
4000
6000
8000
10000
12000
121
542
964
385
710
7112
8514
9917
1319
2721
4123
5525
6927
8329
9732
1134
2536
3938
5340
6742
8144
9547
0949
2351
3753
5155
6557
7959
9362
0764
2166
3568
4970
6372
7774
9177
0579
1981
3383
4785
6187
7589
8992
0394
1796
3198
4510
059
1027
310
487
1070
110
915
1112
911
343
1155
711
771
1198
512
199
1241
312
627
1284
113
055
1326
913
483
1369
713
911
1412
514
339
1455
314
767
1498
115
195
1540
915
623
1583
716
051
1626
516
479
1669
316
907
1712
117
335
1754
917
763
1797
718
191
1840
518
619
1883
319
047
1926
119
475
1968
919
903
2011
720
331
2054
520
759
2097
321
187
2140
1
Accu
mul
ated
kW
-hou
rsAx
is Ti
tle
kW
Measurement (1/second)
OLCF-4 Summit HPL Power Measurements10-25-18 - 03:31:48 to 09:32:00
Average power –9,783 kW
Max power –11,065 kW
Total IT Load 58,730 kW-hoursTotal Mech Load 1,449 kW-hours
PUE 1.0246
21,612 seconds;151,284 measurements
Idle: 2.97MW
17
0
2000
4000
6000
8000
10000
12000
1214
427
640
853
1066
1279
1492
1705
1918
2131
2344
2557
2770
2983
3196
3409
3622
3835
4048
4261
4474
4687
4900
5113
5326
5539
5752
5965
6178
6391
6604
6817
7030
7243
7456
7669
7882
8095
8308
8521
8734
8947
9160
9373
9586
9799
10012
10225
10438
10651
10864
11077
11290
11503
11716
11929
12142
12355
12568
12781
12994
13207
13420
13633
13846
14059
14272
14485
14698
14911
15124
15337
15550
15763
15976
16189
0
2000
4000
6000
8000
10000
120001
215
429
643
857
1071
1285
1499
1713
1927
2141
2355
2569
2783
2997
3211
3425
3639
3853
4067
4281
4495
4709
4923
5137
5351
5565
5779
5993
6207
6421
6635
6849
7063
7277
7491
7705
7919
8133
8347
8561
8775
8989
9203
9417
9631
9845
1005
910
273
1048
710
701
1091
511
129
1134
311
557
1177
111
985
1219
912
413
1262
712
841
1305
513
269
1348
313
697
1391
114
125
1433
914
553
1476
714
981
1519
515
409
1562
315
837
1605
116
265
1647
916
693
1690
717
121
1733
517
549
1776
317
977
1819
118
405
1861
918
833
1904
719
261
1947
519
689
1990
320
117
2033
120
545
2075
920
973
2118
721
401
kW
Measurement (1/second)
OLCF-4 Summit HPL Power Measurements10-25-18 - 03:31:48 to 09:32:00
June 2018122.3PF
Oct 2018143.5PF
1818
Cooling System Performance - PUE
191920°C 21°C 22°C
23°C24°C 25°C 26°C 27°C
2020
26°C
19°C
21
Caution – Secondary (Closed) Loop Concerns Worsen as Temperatures Rise…
IBM’s Water Quality Requirements• All metals less than or equal to 0.10 ppm• Calcium less than or equal to 1.0 ppm• Magnesium less than or equal to 1.0 ppm• Manganese less than or equal to 0.10 ppm• Phosphorus less than or equal to 0.50 ppm• Silica less than or equal to 1.0 ppm• Sodium less than or equal to 0.10 ppm• Bromide less than or equal to 0.10 ppm• Nitrite less than or equal to 0.50 ppm• Chloride less than or equal to 0.50 ppm• Nitrate less than or equal to 0.50 ppm• Sulfate less than or equal to 0.50 ppm• Conductivity less than or equal to 10.0 μS/cm. • pH 6.5 – 8.0• Turbidity (NTU) less than or equal to 1
Allowable Wetted Materials• Lead-free copper alloys with less than 15% zinc.• Stainless steels• EPDM• Polypropylene
Water Chemistry - Biocides • The choice of biocide depends on whether you are chasing anaerobic
bacteria, aerobic bacteria, fungi, and/or algae
EEHPCWG is looking for common ground among the HPC suppliers for water quality
Fungi might not be detected in the water,
even though it can grow and cause
blockage of cooling channels in cold plates
Accelerated CorrosionSystem BlockagesReduced System Efficiency
2222