Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | lenard-colin-johns |
View: | 217 times |
Download: | 3 times |
A 1.7 Petaflops Warm-Water-Cooled System: Operational Experiences and Scientific Results
Łukasz Flis , Karol Krawentek, Marek Magryś
ACC Cyfronet AGH-UST
• established in 1973• part of AGH University of Science and Technology in
Krakow, Poland• provides free computing resources for scientific
institutions• centre of competence in HPC and Grid Computing• IT service management expertise (ITIL, ISO 20k)• member of PIONIER consortium• operator of Krakow MAN• home for supercomputers
International projects
PL-Grid infrastructure• Polish national IT infrastructure supporting e-Science
– based upon resources of most powerful academic resource centres– compatible and interoperable with European Grid– offering grid and cloud computing paradigms– coordinated by Cyfronet
• Benefits for users– unified infrastructure from 5 separate compute centres– unified access to software, compute and storage resources– non-trivial quality of service
• Challenges– unified monitoring, accounting, security– create environment of cooperation rather than competition
• Federation – the key to success
Competence Centre in the Field of Distributed Computing Grid Infrastructures
• Duration: 01.01.2014 – 31.11.2015• Project Coordinator: Academic Computer Centre CYFRONET AGH
The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competence centre in the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.
PLGrid Core project
ZEUS
374 TFLOPS#211, #1 in Poland
Zeus usage
44.84%
41.45%
7.87%
chemistryphysicsmedicinetechnicalastronomybiologycomputer scienceelectronics, telecomunicationmetalurgymathematicsother
Why upgrade?
• Job size growth• Users hate waiting for resources• New projects, new requirements• Follow the advances in HPC• Power costs
New building
Requirements for the new system
• Petascale system• Low TCO• Energy efficiency• Density• Expandability• Good MTBF• Hardware:
– core count– memory size– network topology– storage
Requirements: Liquid Cooling
• Water: up to 1000x more efficient heat exchange than air
• Less energy needed to move the coolant• Hardware (CPUs, DIMMs) can handle ~80C• Challenge: cool 100% of HW with liquid
– network switches– PSUs
Requirements: MTBF
• The less movement the better– less pumps– less fans– less HDDs
• Example– pump MTBF: 50 000 hrs– fan MTBF: 50 000 hrs– 1800 node system MTBF: 7 hrs
Requirements: Compute
• Max jobsize ~10k cores• Fastest CPUs, but compatible with old codes
– Two socket nodes– No accelerators at this point
• Newest memory– At least 4 GB/core
• Fast interconnect– Infiniband FDR– No need for full CBB fat tree
Requirements: Topology
services nodes
Service isle
storage nodes 576 nodes
Compute isle
Core IB switches
576 nodes
Compute isle
576 nodes
Compute isle
576 nodes
Compute isle
Why Apollo 8000?• Most energy efficient• The only solution with 100% warm water
cooling• Highest density• Lowest TCO
Even more Apollo
• Focuses also on ‘1’ in PUE!– Power distribution– Less fans– Detailed monitoring
• ‘energy to solution’
• Dry node maintenance• Less cables• Prefabricated piping• Simplified management
Prometheus
• HP Apollo 8000• 13 m2, 15 racks (3 CDU, 12 compute)• 1.65 PFLOPS• PUE <1.05, 680 kW peak power• 1728 nodes, Intel Haswell E5-2680v3• 41472 cores, 13824 per island• 216 TB DDR4 RAM• System prepared for expansion• CentOS 7
Prometheus storage
• Diskless compute nodes• Separate tender for storage
– Lustre-based– 2 file systems:
• Scratch: 120 GB/s, 5 PB usable space• Archive: 60 GB/s, 5 PB usable space
– HSM-ready• NFS for home directories and software
Deployment timeline
• Day 0 - Contract signed (20.10.2014)• Day 23 - Installation of the primary loop starts• Day 35 - First delivery (service island)• Day 56 - Apollo piping arrives• Day 98 - 1st and 2nd island delivered• Day 101 - 3rd island delivered• Day 111 - basic acceptance ends
• Official launch event on 27.04.2015
Facility preparation
• Primary loop installation took 5 weeks• Secondary (prefabricated) just 1 week• Upgrade of the raised floor done „just in case”• Additional pipes for leakage/condensation drain• Water dam with emergency drain• Lot of space needed for the hardware deliveries
(over 100 pallets)
Secondary loop
Challenges
• Power infrastructure being build in parallel• Boot over Infiniband
– UEFI, high frequency port flapping– OpenSM overloaded with port events
• BIOS settings being lost occasionally• Node location in APM is tricky• 5 dead IB cables (2‰)• 8 broken nodes (4‰)• 24h work during weekend
Solutions
• Boot to RAM over IB, image distribution with HTTP– Whole machine boots up in 10 min with just 1 boot server
• Hostname/IP generator based on MAC collector– Data automatically collected from APM and iLO
• Graphical monitoring of power, temperature and network traffic– SNMP data source,– GUI allows easy problem location– Now synced with SLURM
• Spectacular iLO LED blinking system developed for the offical launch
• 24h work during weekend
System expansion
• Prometheus expansion already ordered• 4th island
– 432 regular nodes (2 CPUs, 128 GB RAM)– 72 nodes with GPGPUs (2x Nvidia Tesla K40XL)
• Installation to begin in September• 2.4 PFLOPS total performance (Rpeak)• 2232 nodes, 53568 CPU cores, 279 TB RAM
Future plans
• Push the system to it’s limits• Further improvements of the monitoring tools• Continue to move users from the previous system• Detailed energy and temperature monitoring• Energy-aware scheduling• Survive the summer and measure performance• Collect the annual energy and PUE• HP-CAST 25 presentation?