Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | alexandrea-sincock |
View: | 218 times |
Download: | 0 times |
Trends and Perspectives for HPC infrastructures
Carlo Cavazzoni, CINECA
outline
- HPC resource in EUROPA (PRACE)- Today HPC architectures- Technology trends- Cineca roadmaps (toward 50PFlops) - EuroExa project
The PRACE RI provides access to distributed persistent pan-European world class HPC computing and data management resources and services. Expertise in efficient use of the resources is available through participating centers throughout Europe. Available resources are announced for each Call for Proposals..
Peer reviewed open accessPRACE Projects (Tier-0)PRACE Preparatory (Tier-0)DECI Projects (Tier-1)
European
Local
Tier 0
Tier 1
Tier 2
National
TIER-0 System, PRACE regular calls
CURIE (GENCI, Fr), BULL Cluster, Intel Xeon, Nvidia cards, Infiniband network FERMI (CINECA, It) &
JUQUEEN (Juelich, D), IBM BGQ, Power processors, custom 5D torus net.
HERMIT (HLRS, D), Cray XE6, AMD procs, custom 3D torus net.1PFLops
MARENOSTRUM (BSC, S), IBM DataPlex, Intel Xeon node,Infiniband net.
SuperMUC (LRZ, D), IBM DataPlex, Intel Xeon Node, Infiniband net..
DECI site Machine name System type chip peak perfor- mance (Tflops)
GPU cards
Bulgaria (NCSA) EA"ECNIS" IBM BG/P PowerPC 450 27Czech Repulic (VSB-TUO) Anselm Bull Bullx Intel Sandy Bridge-EP 6623 nVIDIA Tesla 4 Intel Xeon Phi P5110Finland (CSC) Sisu Cray XC30 Intel Sandy Bridge 244.9France (CINES) Jade SGI ICE EX8200 Intel Quad-Core E5472/X5560 267.88France (IDRIS) Babel IBM BG/P PowerPC 450 139Germany (Jülich) JuRoPA Intel cluster Intel Xeon X5570 207Germany (RZG) Genius IMB BG/P PowerPC 450 54Germany (RZG) iDataPlex Intel Sandy Bridge 200Ireland (ICHEC) Stokes Sgi ICE 8200EX Intel Xeon E5650Italy (CINECA) PLX iDataPlex Intel Westmere 293548 nVIDIA Tesla M2070/ M2070QNorway (SIGMA) Abel MegWare cluster Intel Sandy Bridge 260Poland (WCSS) Supernova Cluster Intel Westmere-EP 51.58Poland (PSNC) chimera SGI UV1000 Intel Xeon E7-8837 21.8Poland (PSNC) cane cluster AMD&GPU AMD Opteron™ 6234 224.3334 NVIDIA TeslaM2050Poland (ICM) boreasz IBM Power 775 (Power7) IBM Power7 74.5Poland (Cyfronet) Zeus-gpgpu Linux Cluster Intel Xeon X5670/E5645 136.848 M2050/160 M2090Spain (BSC) MinoTauro Bull Cuda Cluster Intel Xeon E5649 182256 nVIDIA Tesla M2090Sweden (PDC) Lindgren Cray XE6 AMD Opteron 305Switzerland (CSCS) Monte Rosa Cray XE6 AMD Opteron 402The Netherlands (SARA) Huygens IBM pSeries 575 Power 6 65Turkey (UYBHM) Karadeniz HP Cluster Intel Xeon 5550 2.5UK (EPCC) HECToR Cray XE6 AMD Opteron 829.03UK (ICE-CSE) ICE Advance IBM BG/Q PowerPC A2 1250
TIER-1 Systems, DECI calls
HPC Architectures
two model
Hybrid: Server class processors:
Server class nodes Special purpose nodes
Accelerator devices: Nvidia Intel AMD FPGA
Homogeneus: Server class node:
Standar processors Special porpouse nodes
Special purpose processors
Networks
Standard/switched: Infiniband
Special purpose/Topology: BGQ CRAY TOFU (Fujitsu) TH Express-2 (Thiane-2)
Programming Models
fundamental paradigm: Message passing Multi-threads Consolidated standard: MPI & OpenMP New task based programming model
Special purpose for accelerators: CUDA Intel offload directives OpenACC, OpenCL, Ecc… NO consolidated standard
Scripting: python
Roadmap to Exascale(architectural trends)
Dennard scaling law(downscaling)
L’ = L / 2V’ = V / 2F’ = F * 2D’ = 1 / L2 = 4DP’ = P
do not hold anymore!
The power crisis!
L’ = L / 2V’ = ~VF’ = ~F * 2D’ = 1 / L2 = 4 * DP’ = 4 * P
Increase the number of coresto maintain the architectures evolution on the Moore’s law
Programming crisis!
The core frequencyand performance do notgrow following the Moore’s law any longer
new VLSI gen.
old VLSI gen.
The cost per chip “is going down more than the capital intensity is going up,” Smith said, suggesting Intel’s profit margins should not suffer because of heavy capital spending. “This is the economic beauty of Moore’s Law.”And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers. Holt said the company has test chips running on that technology. “We are projecting similar kinds of improvements in cost out to 10 nanometers,” he said.So, despite the challenges, Holt could not be induced to say there’s any looming end to Moore’s Law, the invention race that has been a key driver of electronics innovation since first defined by Intel’s co-founder in the mid-1960s.
Moore’s Law
Economic and market law
From WSJ
Stacy Smith, Intel’s chief financial officer, later gave some more detail on the economic benefits of staying on the Moore’s Law race.
It is all about the number of chips per Si wafer!
But!
Si lattice
0.54 nm
There will be still 4~6 cycles (or technology generations) left untilwe reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008).
300 atoms!
14nm VLSI
What about Applications?
In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).
maximum speedup tends to 1 / ( 1 − P )
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
Architectural trends
Peak Performance Moore law
FPU Performance Dennard law
Number of FPUs Moore + Dennard
App. Parallelism Amdahl's law
HPC Architectures
two model Hybrid, but…
Homogeneus, but…
What 100PFlops system we will see … my guess
IBM (hybrid) Power8+Nvidia GPUCray (homo/hybrid) with Intel only!Intel (hybrid) Xeon + MICArm (homo) only arm chip, but…Nvidia/Arm (hybrid) arm+NvidiaFujitsu (homo) sparc high density low powerChina (homo/hybrid) with Intel onlyRoom for AMD console chips
Chip Architecture
Intel
ARM
NVIDIA
Power
AMD
Strongly market driven Mobile, Tv set, ScreensVideo/Image processing
New arch to compete with ARM Less Xeon, but PHI
Main focus on low power mobile chip Qualcomm, Texas inst. , Nvidia, ST, ecc new HPC market, server maket
GPU alone will not last long ARM+GPU, Power+GPU
Embedded market Power+GPU, only chance for HPC
Console market Still some chance for HPC
CINECA Roadmaps
Roadmap 50PFlops
Power consump
tion
EURORA 50KW, PLX
350 KW, BGQ 1000KW +
ENI
EURORA or PLX upgrade 400KW; BGQ
1000KW, Data
repository 200KW; - ENI
R&D Eurora EuroExa
STM / ARM board
EuroExa STM / ARM prototype
PCP Proto 1PF in a rack
EuroExa STM / ARM PF platform
ETP proto towards
exascale board
Deployment
Eurora industrial
prototype 150 TF
Eurora or PLX upgrade 1PF peak, 350TF
scalar
multi petaflop system
Tier-0 50PF Tier-1
towards exascale
Time line 2013 2014 2015 2016 2017 2018 2019 2020
Requisiti di alto livello del sistema
Potenza elettrica assorbita: 400KWDimensione fisica del sistema: 5 racksPotenza di picco del sistema (CPU+GPU): nell'ordine di 1PFlops Potenza di picco del sistema (solo CPU): nell'ordine di 300TFlops
Tier 1 CINECA
Procurement Q2014
Requisiti di alto livello del sistema
Architettura CPU: Intel Xeon Ivy BridgeNumero di core per CPU: 8 @ >3GHz, oppure 12 @ 2.4GHz
La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del sistema e dalla capacità di raffreddamento
Numero di server: 500 - 600, ( Peak perf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops )Il numero di server del sistema potrà dipendere dal costo o dalla geometria della
configurazionein termini di numero di nodi solo CPU e numero di nodi CPU+GPU
Architettura GPU: Nvidia K40Numero di GPU: >500
( Peak perf = 700 * 1.43TFlops = 1PFlops )Il numero di schede GPU del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini dinumero di nodi solo CPU e numero di nodi CPU+GPU
Tier 1 CINECA
Requisiti di alto livello del sistema
Vendor identificati: IBM, EurotechDRAM Memory: 1GByte/core
Verrà richiesta la possibilità di avere un sottoinsieme di nodi con una quantità di memoria più elevata
Memoria non volatile locale: >500GByte SSD/HD a seconda del costo e dalla configurazione del sistema
Cooling: sistema di raffreddamento a liquido con opzione di free coolingSpazio disco scratch: >300TByte (provided by CINECA)
Tier 1 CINECA
Thank you