1
Energy Efficient Computing
Lennart Johnsson, Gilbert Netzer, KTH
2
OUTLINE
• Who we are and what we do• Scientific and Engineering computation • Application characterization• Data center energy challenges • PRACE• KTH energy efficient computing projects• KTH – TI project• What is next
3
PDC’s MissionInfrastructure
Operation of a high-end infrastructure for HPC, data services, user support and training for Swedish research on behalf of the Swedish National Infrastructure for Computing (SNIC), collaborative international and national consortia, and research groups at KTH and Stockholm University
ResearchConduct world-class research and education in parallel and distributed computing methodologies and tools
4
SNIC• The Swedish meta-center for large-scale computing and
data storage. Formed 2003.• Organized within the Swedish Research Council with a
budget of about 100MSEK• Mission:
Provide research computing resources for Swedish academic research mainly through six university based computing centra
Coordinate investments and competence across the centersMerit based resource allocation based through the SNIC National
Resource Allocations Committee (SNAC) through RFPs every 6 months
Fund and coordinate minor development projectsHost the Swedish National Graduate School in Scientific
Computing (NGSSC)
SNIC 2010, - 4
5
SNIC
• HPC2N (Umeå)• UPPMAX (Uppsala)• PDC (Stockholm)• NSC (Linköping)• C3SE (Göteborg)• LUNARC (Lund)
SNIC 2010, - 5
• About 300 user groups (1-50 researchers each)
• Services:• A few large-scale computing systems• Foundation-level computer systems, storage and
user support at all centers• Coordinated access to European-level initiatives• SweGrid initiated 2003• SweStore initiated 2008• Advanced user support effort initiated 2010
6
Ferlin and SweGrid - Dell ClusterSNIC Foundation Level Service32 nodes with Infiniband
6120 cores (765 nodes, 2 quad core Intel)7 TByte memory
Ekman - Dell PowerEdge ClusterClimate and Flow research
10,144 cores (1268 nodes, 2 quad core AMD)89 TF theoretical peak performance20 TByte memory
Key - HP SMP 32 Cores, 256 GB memory
Hebb - IBM Blue Gene/LStockholm Brain Institute, Mechanics, and INCF
PovelPrace Prototype (energy efficiency)4320 cores (180 4x6core AMD nodes)36 TF theoretical peak performance5.76 TByte memory
PDC Computing Resources
1024 nodes6 TF theoretical peak performance
7
PDC’s latest HPC system• Cray XE6• 1,516, dual-socket AMD 12-core, 2.1 GHz
32 GB compute nodes (36,384 cores), 305 TF TPP, 237 TF sustained (Linpack)
• Gemini 3D torus network• SNIC PRACE system• Would be Nr. 8 in Europe and Nr. 28
worldwide on the November 2010 Top500 list(www.top500.org)
8
PDC’s Computational Resources
System Cores TPPLindgren 36,384 305 TFEkman 10,144 89 TFFerlin 5,360 58 TFSweGrid 744 8 TFHebb 2,048 6 TFPovel 4,320 36 TFTotal 59,000 502 TF
9
Storage• ~20 TB disk
Accessible via AFS• ~900 TB disk
Currently attached to individual systemsLustre parallel file system
– Site wide configuration planned
• IBM tape robot (~2900 slots, ~2.3 PB)Accessible via HSM, TSM, and dCache
(planned via NDGF (Nordic Data Grid Facility))
Large datasets, e.g. Brain image database, Human Proteon Data,…
10
PDC System UseOver 500 time allocations by 400 PIs (past 4 years)Examples of these research areas include:
• Quantum Chemistry• Climate Modeling• Neuroinformatics• Life Sciences• Physics• Computational
Fluid Dynamics
11
User Support
• Front-line help desk • Advanced user support
• Experts in parallel computing and specific application domains
• Computational Chemistry• Molecular Dynamics• Computational Fluid Dynamics• Neuroinformatics
12
Community Code Development• Gromacs
GROMACS is a versatile Molecular Dynamics package for simulation of the Newtonian equations of motion for systems with hundreds to millions of particles Head authors and project leaders: Erik Lindahl and Berk Hess, KTH, David van der Spoel, Uppsala http://www.gromacs.org/About_Gromacs
• DaltonDalton is a Molecular Electronic Structure package.
Members of the KTH Theoretical Chemistry department are active contributors, especially Olaf Vahtras and Hans Agrenhttp://www.daltonprogram.org/description.html
13
PDC Summer School
0
10
20
30
40
50
60
70
80
90
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Num
ber o
f Par
ticip
ants
Other
Göteborg University
FOA
KI
Stkhlm Observatory
Luleå
Umeå
Linköping
Lund
SU
Uppsala
CTH
KTH
Education and TrainingPDC Summer School since 1996For many years now jointly with
Total 1996 – 2010: 834
14
Schools, examples
0%10%20%30%40%50%60%70%80%
EU
Asiaother China
USA CentralAm.
SouthAm Africa
Russia
Europeother
Austr-alasia
20062007
EU 75% → 44%Total 64 students
15
Schools, examples
• 31 students (target 30)• 14 PRACE partner countries represented• 2 non-PRACE countries represented• Access to Forschungs Zentrum Juelich’s BG/P• Access to CSC’s Cray XT4
First Summer School, August 2008
16
Training Workshops, examples
• 41 participants• 3 PRACE partner countries represented• 1 non-PRACE countries represented• Access to AMD/ATI Radeon 5770 and 5870 GPUs• Access to AMD/ATI Firestream 9270 GPUs
Stream Programming Workshop December 2009
17
Scientific and Engineering Computation
18
21st Century Science and Engineering
• The three fold way• theory • experiment• computational simulation
• Supported by• multimodal collaboration systems• distributed, multi-petabyte data archives• leading edge computing systems• distributed experimental facilities• internationally distributed
multidisciplinary teams Theo
ry
Expe
rimen
t
Simulation
Courtesy Paul Messina
19
Driving ApplicationsExamples
Astronomy
CMS
Atlas
LHCb
ALICE
Physics
Life Sciences Medicine
BPBloodGlucose
Heart Rate
Temp
Enginering
Weather
20
The Large Hadron Collider Project four detectors
Storage –Raw recording rate 0.1 – 1 GBytes/sec
Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk
Processing –200,000 of today’s fastest PCs
CMSATLAS
Physics
CERN LHC Site
CMS:1800 physicists150 institutes32 countries
21
Source: Thomas Lippert, DEISA Symp, May 2005
22
Astronomy
• The planned Large Synoptic Survey Telescope (LSST) will produce over 30 TB/day when in operation before 2020! It will perform an all-sky survey every few days with a 3.2 billion pixel camera resulting in the first fine grain time series of the sky. This is expected to allow for observation of the expansion of the universe and dark matters bending of light
Impending Floods of Data
www.lsst.org
23
Telescope ArraysALMA
EVLA
LOFAR
24
Source: K Lackner, Max-Planck DEISA Symp May 2005
25
Fusion
Source: David H Bailey, Petaflops Workshop
26
Source: David H. Bailey, LBL PetaflopsWorkshop
27
Severe Weather PredictionHurricanes:
For 1925 - 1995 the US cost was $5 billion/yr for a total of 244 landfalls. But, hurricane Andrew alone caused damage in excess of $27 billion.
The US loss of life has gone down to <20/yr typically. The Galveston “Great Hurricane” year 1900 caused over 6,000 deaths.
Since 1990 the number of landfalls each year is increasing.
Warnings and emergency response costs on average $800 million/yr. Satellites, forecasting efforts and research cost $200 – 225 million/yr.
Ivan, September 14, 2004
Andrew: ~$27B (1992)
Charley: ~$ 7B (2004)
Hugo: ~$ 4B (1989) ($6.2B 2004 dollars)
Frances: ~$ 4B (2004)
28
Weather
Courtesy Kelvin Droegemeier
March 28, 2000, Fort Worth Tornado
29
Severe Weather Prediction• Must fit the prediction model to the
observations (data assimilation/retrieval)About 50-100 times as expensive as the forecast
• Must use high spatial resolution1-3 km resolution in sufficiently large domains
• Must quantify forecast uncertainty(ensembles)May need 20-30 forecasts to produce an ensemble each forecast cycle)
• Requirements: 10-100 TFLOPS sustained; 0.5 TB memory; 20 TB storage
30
Wildfire Simulation
31
Environmental Studies
Houston, TX
32
Life Sciences
33
Life Sciences: Imaging
SynchrotronsSynchrotrons
MacromolecularMacromolecularComplexes, Complexes, Organelles, CellsOrganelles, Cells
Organs, Organ Organs, Organ Systems, OrganismsSystems, Organisms
MicroscopesMicroscopes
Magnetic Resonance Magnetic Resonance ImagersImagers
Molecules
34
Life sciences: Imaging
ImageRestoration
•Deconvolution•Filtering•Registration
ImageRestoration
•Deconvolution•Filtering•Registration
ImageReconstruction
•3D Reconstruction•Refinement
ImageReconstruction
•3D Reconstruction•Refinement
MultidimensionalImage Analysis
•Image Segmentation•Feature Recognition, Extraction & Modification
MultidimensionalImage Analysis
•Image Segmentation•Feature Recognition, Extraction & Modification
DataAcquisition
DataAcquisition VisualizationVisualizationPost Processing,
Simulation, Other MethodsPost Processing,
Simulation, Other MethodsInteractions with the Experiment
Interactions with the Experiment
35
500 Å
JEOL3000-FEGLiquid He stageNSF support
No. of Particles Needed for 3-D Reconstruction
B = 100 Å2
8.5 Å 4.5 Å6,000 5,000,000
Resolution
B = 50 Å2 3,000 150,0008.5 Å Structure of the HSV-1 Capsid
Life Sciences: Imaging
36
Digital MammographyAbout 40 million mammograms/yr (USA) (estimates 32 – 48 million)About 250,000 new breast cancer cases detected each yearOver 10,000 units (analogue)Image size: 4kx6k, about 48 MBImages per patient: 4Data set size per patient: about 200 MbytesData set per year: about 10 PbytesData set per unit, if digital: 1 Tbytes/yr, on average
37
Computer Assisted Surgery
http://lyon2003.healthgrid.org/documents/slides_PDF/11_Guy_Lonsdale.pdf
38
Cancer Treatment: Hadron Centers• In the US (all proton therapy):
Harvard (MGH)Loma Linda (California)MD Anderson (Houston) Spr ’06
• Heavy Ion Therapy (All International)HIMAC @ NIRS (Chiba, Japan)GSI (Heidelberg, Germany)-Under
ConstructionEtoile (Lyon, France)-Under ConstructionUniv. of Pavia, Italy-Under Construction
39
U.S. Department of EnergyPacific Northwest National Laboratory
11/25/00 27
Virtual Lung from PNNL's Virtual Biology Center
NWGrid & NWPhysdesigned to simulate coupledfluid dynamics and continuummechanics in complexgeometries using 3-D, hybrid,adaptive, unstructured grids.
NWGrid - grid generation &setup toolbox
NWPhys - collection ofcomputational physics solvers
Harold Trease, PNNL
U.S. Department of EnergyPacific Northwest National Laboratory
11/25/00 28
Particle Distribution in the Flow Airways(Particle occurs in right branch of bifurcation)
membrane wallmembrane wall
Airway passageAirway passage
particleparticle
Harold Trease, PNNL U.S. Department of EnergyPacific Northwest National Laboratory
11/25/00 29
Pressure Contours of the Flow Fieldthroughout the Lung Airways
Particles occur in everyParticles occur in everyright branch of a bifurcationright branch of a bifurcation
Harold Trease, PNNL
Lung Simulation
40
Center for Integrated Turbulence Simulation
Engineering
41
42
Engineering
43
Scheduling
Continental Airlines
44
Application Characterization for System Design
45
Particle Physics 23.5
Computational Chemistry 22.1
Condensed Matter Physics 14.2
CFD 8.6
Earth & Climate 7.8
Astronomy & Cosmology 5.8
Life Sciences 5.3
Computational Engineering 3.7
Plasma Physics 3.3
Other 5.8
2008 usage of PRACE partner’s major systems measured as aggregated Linpack Equivalent Flops (LEFs)
69 applications surveyed on 24 systems>10TF
46
Relative use of Computational Kernels (Dwarfs) in PRACE Applications based on LEFs
Map reduce methods45.1%
Spectral methods18.4%
Dense linear algebra14.4%
Structured grids9.0%
Particle methods7.2%
Sparse linear algebra3.4%
Unstructured grids2.4%
69 applications surveyed on 24 systems>10TF
47
PRACE Large Scale Applications characteristics measured as LEFs used in TF units
0000000Other
0.630.423.551.331.3300Plasma Physics
89.2700.100.924.59012.50Particle Physics
3.460.280.940.130.944.720Life Science
00.2601.335.832.030Earth and Climate Science
5.700.281.760.731.6215.079.10Condensed Matter Physics
03.000.323.057.371.700CFD
2.80.5300.530.5300Computational Engineering
12.980.537.493.451.8026.0915.35Computational Chemistry
02.995.983.594.910.620Astronomy and Cosmology
Map
red
uce
m
eth
od
s
Un
structu
red
g
rids
Particle
m
eth
od
s
Sp
arse
linear
alg
eb
ra
Stru
cture
d
grid
s
Sp
ectra
l m
eth
od
s
Den
se lin
ear
alg
eb
ra
Area/Dwarf
48
Language use by application codes
1Mathematica
2Perl
3Python
7C99
10C++
15Fortran77
22C90
50Fortran90
No. of applicationsLanguage
About 50% use more than one base language
16 out of the 69 application codes combine Fortran with C or C++
49
Application parallelization techniques on PRACE Partner systems
• 1 is sequential (BLAST)• 1 code uses OpenMP only (Gaussian)• 67 codes use MPI
45 codes uses MPI only, one having an MPI-2 version6 codes have one MPI version, one OpenMP version3 codes have one MPI version, one SHMEM version10 codes have hybrid MPI/OpenMP versions2 codes have hybrid MPI/SHMEM versions1 code has hybrid MPI/Posix threads
50
PRACE systems usage (2008)Job Requirements
0
20
40
60
80
100
120
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67
PR
ACE
Par
tner
Mac
hine
Fra
ctio
n on
w
hich
app
licat
ion
is r
un [%
]
0.1
1
10
100
1000
Min
imal
exe
cutio
n tim
e [h
]
Size LengthMean 22%. A quarter to a third of all cores common experience for shared systems
Mean 37 hrs
Application code #
51
PRACE partner systems with peak >10 TF surveyed (2008)
100.0169,522100.0675,414100.0926,175100.024Total
25.843,77035.1236,88241.1380,68641.710TNC
10.016,92813.994,11812.0110,68225.06FNC
63.9108,24849.7335,49146.0425,59129.27MPP
0.35761.38,9231.09,2164.21VEC
%Cores%Rmax (GF)%Rpeak (GF)%Systems
VEC = Vector systemsMPP = Massively Parallel Processors (BlueGene and Cray XT)FNC = Fat Node Cluster (“big” SMP nodes)TNC = Thin Node Cluster
52
53
PRACE Vision and Mission
• Vision: Enable and support European global leadership in public and private research and development.
• Mission: Contribute to the advancement of European competitiveness in industry and research through the provisioning of world leading persistent High-End Computing infrastructure
54
PRACE AISBL
(Interest to join by Belgium and Latvia)
• PRACE AISBL (Association International Sans But Lucratif) is a Belgian legal entity seated in Brussels formed April 23 2010 for providing a persistent pan-European Research Infrastructure for High-End Computing and associated services. Member countries currently (spring 2012) are
• Italy • Netherlands • Norway• Poland• Portugal• Serbia• Slovenia• Spain• Sweden• Switzerland • Turkey• United Kingdom
• Austria• Bulgaria• Cyprus• Czech Republic• Denmark• Finland • France• Germany • Greece• Hungary• Ireland• Israel
55
Commitments to PRACE AISBL
• Hosting Partners: Germany, France, Italy, SpainBinding commitments to contribute 100 M€ over 5 years in
terms of Tier-0 cycles and servicesContribution measured by TCO
• All partners: Binding commitment to share PRACE AISBL Head-Quarters
costs equally• EU Commission (expected)
68 (originally 3x20+10 M€) in FP7 for preparatory and implementationGrants INFSO-RI-211528 and 261557
• Partner match for EU funds~ 60 Million € from PRACE partners (not including Tier-1)
Note: GDP spread among PRACE partners a factor of ~200
56
The ESFRI Vision for a European HPC service
Tier-0
Tier-1
Tier-2
PRACE
DEISA/PRACE
capa
bilit
y
# of systems
• Ensure the right level of integration in/with the tiers• Tier-0 – full integration
Creation of new high-end resourcesSingle access routeSingle operational model
• Tier-1Integration of existing national resources
enables non hosting countries to contributeDifferent funding / governance requires
adapted approachLeverage DEISA successes, like network, DECI
• Tier-2 / GridsDifferent funding and usage models,
overlapping user groupsCooperate and inter-operate for the benefit of users
(ESFRI = European Strategy Forum for Research InfrastructuresDEISA = Distributed European Infrastructure for Supercomputer Applications)
57
PRACE RI Systems
JSC
• 2010 1st PRACE SystemBG/P by Gauss Center for Supercomputing at Juelich• 294,912 CPU cores, 144 TB memory• 1 PFlop/s peak performance• 825.5 TFlop/s Linpack• 600 I/O nodes (10GigE) > 60 GB/s I/O• 2.2 MW power consumption• 35% for PRACE
58
GENCI• 2011 2nd PRACE system
Bull, 1.6PF, 92160 cores, 4GB/corePhase 1, December 2010, 105 TF
• 360 four Intel Nehalem-EX 8-core nodes, 2.26 GHz CPUs (11,520 cores), QDR Infiniband fat-tree
• 800 TB, >30GB/sec, local Lustre file system
Phase 1.5 Q2 2011• Conversion to 90 16-socket,
128 core, 512 GB nodes
Phase 2, Q4 2011, 1.5 PF• Intel Sandy-Bridge • 10PB, 230GB/sec file system
GENCI/CEA
<15MW
59
HLRS
HLRS• 2011 3rd PRACE System• Cray XE6
Phase 0 – 201010TF, 84 dual socket 8-core AMD Magny-Cours CPUs, 1344 cores in total, 2 GHz, 2GB/core, Gemini interconnect
Phase 1 Step 1 – Q3 2011AMD Interlagos, 16 cores,1 PF 2 – 4 GB/core2.7 PB file system, 150 GB/s I/O
Phase 2 – 2013Cascade, first order for Cray, 4 - 5 PF
60
LRZ• 2011/12 4th PRACE system• IBM iDataPlex
>14,000 Intel Sandy-Bridge CPUs, 3 PF (~110,000 cores), 384 TB of memory
10PB GPFS file system with 200GB/sec I/O, 2PB 10GB/sec NAS
LRZ <13MW
Innovative hot water cooling(60C inlet, 65C outlet) leading to 40 percent less energy consumption compared to air-cooled machine.
61
CINECA• 2012 5th PRACE System
FERMI a Blue Gene/Q is composed of 10.240 PowerA2 sockets running at 1.6GHz, with 16 cores each, totaling 163.840 compute cores and a system peak performance of 2.1 PFlops. Each processor comes with 16Gbyte of RAM (1Gbyte per core).
The BG/Q system will be equipped with a performantscratch storage system with a capacity of 2Pbyte and a bandwidth in excess of 100 GByte/s
62
BSC • 2013 6th PRACE System
BSC<20MW
Computing Facility10 MW 2013
63
PRACE Tier-1 (continuation/extension of DEISA)
2,747,000+1,513,728
1,437,0003,363,840
784,000+9,000 GPGPU2,900,0002,000,000
850,000+140,000 GPGPU1,900,000
880,000
1,150,0001,400,0002,872,0002,870,000
6,000,00012,749,000
7,800,0002,284,000
Core hours
CINECACINESFZJHLRSICHECLRZ
PSNC
BSCClustersSARA
RZGCINECAIBM Power6RZGNCSA (Bulgaria)
IDRISIBM BG/PKTH (36,484 cores)EPCC (44,544 cores)CSCCray XT4/5/6, XE6
PartnerSystem Type
Total: 55,500,568 core hours + 149,000 GPGPU hrs
64
Accessing the PRACE RIPeer-Review Merit Based Access Model• Three types of resource allocations
Test / evaluation accessProject access – for a specific project,
grant period ~ 1 yearProgramme access – resources managed by
a community
• Free-of-charge
Funding• Mainly national funding through partner countries• European contribution• Access model has to respect national interests (ROI)
65
PRACE Young Investigator Awards
• To stimulate interest and innovation in HPC PRACE has since its inception in 2008 awarded an annual prize to a European student or young scientist that has carried out outstanding scientific work on High-End Computing.
• The award is based papers submitted for the competition that are reviewed by three reviewers selected by the PRACE Scientific Steering Committee. Reviewers evaluate novelty, fundamental insights and potential for long-term impact of the research.
• The award and the research are presented at ISC (the International Supercomputing Conference)
66
PRACE Young Investigator Award 2008
• The award was given for “UCHPC – UnConventionalHigh Performance Computing for Finite Element Simulations”, by Stefan Turek, Dominik Goeddeke, Christian Becker, Sven H.M. Buijssen, Hilmar Wobker, Applied Mathematics, Dortmund University of Technology, Germany.
The work addresses use of heterogeneous hardware for Finite Element computations and describes Feast (Finite Element Analysis & Solution Tools), a software toolbox providing Finite Element discretizations and optimized parallel solvers for PDE problems. Feast combines modern numerical techniques with hardware efficient implementations for a wide range of HPC architectures. It contains mechanisms enabling complex simulations to directly benefit from hardware acceleration without having to change application code.
http://www.mathematik.tu-dortmund.de/lsiii/static/showpdffile_TurekGoeddekeBeckerBuijssenWobker2008.pdf
67
PRACE Young Investigator Award 2009• The award was given for “High Scalability Multipole Method.
Solving Half Billion of Unknowns” by J. C. Mouriño, A. Gómez, J. M. Taboada, L. Landesa, J. M. Bértolo, F. Obelleiro and J. L. Rodríguez, Supercomputing Center of Galicia (CESGA), Universidad de Extremadura, Universidad de Vigo, Spain.
“Large electromagnetic simulations are of great interest for the design of complex industrial products that are more and more integrating electronic equipments. They are also very useful forunderstanding and reducing the impact of electromagnetic fields on human beings. The selected paper for the PRACE award presents an outstanding work on the scalability of the FMM-FFT method (fast multiple method - fast fourier transform) and its application for solving a challenging problem with 0.5 billion of unknowns and opens the road towards even larger simulations. It will make possible to run simulations with a much higher resolution than before making possible to improve the accuracy of the simulations and to address new computational challenges in the field of very high frequencies for example fornew car anti-collision systems”, Francois Robin, Award Committee and member and CEA Senior Scientist.
http://www.springerlink.com/content/y6tr4q34r2510328
68
PRACE Young Investigator Award 2010• The Award was given for “Massively Parallel Granular Flow
Simulations with Non-Spherical Particles” by Klaus Iglberger, M. Sc. and Prof. Ulrich Rüde, University of Erlangen-Nuremberg, Germany.
“The paper proposed by Iglberger and Rüde addresses successfully the issue of simulating flows of granular materials in a realistic way, that is with particles of different shapes moving in a complex environment. It introduces a new algorithm for thispurpose that shows an excellent scalability on a very high number of cores, making possible very large simulations that will be useful in important applications like the design of silos”, says François Robin (GENCI), member of the Award Committee.
“Among the very good papers submitted for the PRACE Award 2010, this paper was the best both addressing a complex physical problem and implementation of a highly scalable method for solving it”, Robin continues.
http://www.springerlink.com/content/yw54328t38r2123p
69
PRACE Young Investigator Award 2011• The Award was given for “Astrophysical Particle Simulations with
Large Custom GPU Clusters on Three Continents” by Rainer Spurzem, Chinese Academy of Sciences & University of Heidelberg, Peter Berczik, Chinese Academy of Sciences & University of Heidelberg, Tsuyoshi Hamada, Nagasaki University, Keigo Nitadori, RIKEN, Guillermo Marcus, University of Heidelberg, Andreas Kugel, University of Heidelberg, Reinhard Maenner, University of Heidelberg, Ingo Berentzen, University of Heidelberg, Jose Fiestas, University of Heidelberg, Robi Banerjee, University of Heidelberg, Ralf Klessen, University of Heidelberg
“This paper is an excellent example of what can be achieved through international and interdisciplinary collaboration to exploit new HPC technologies”, says Prof. Richard Kenway, PRACE Scientific Steering Committee Chair.
“Astrophysicists and computer scientists in Germany and China demonstrate nearly linear strong scaling on up to 170 GPUs at a third of peak performance for large-scale simulations of dense star clusters using machines in Europe, China and the USA. The work points theway to exploit exascale technologies for problems at the forefront of science”, Kenway continues. http://www.ari.uni-heidelberg.de/mitarbeiter/fiestas/iscpaper11.pdf
70
1st PRACE User Forum, April 13, 2011, Helsinki (Held during the PRACE/DEISA Symposium)
• Open to all scientific and industrial user communities• Main communication channel between
HPC users and PRACE AISBL• Interaction with members of the PRACE AISBL• Discussion and issuing recommendations to PRACE
AISBL• Promoting HPC usage• Fostering collaborations between
user communities
71
Education and Training Highlights Petascale training and education needs surveyed Spring 2008
First PRACE Summer School on Peta-scaling, KTH, Stockholm, August 2008. Platforms: IBM Blue Gene/P (FZJ) (65,536 cores) and Cray XT4 (CSC) (10,816 cores)
First PRACE Winter School on Scalable Programming Models and Paradigms, GRNET, Athens, February 2009. Platforms IBM Power 6 (3,328 cores) and IBM Cell (1,152 SPE cores)
Seven Code Porting Workshops in 2009
In total 270 participants in education and training events 2008/2009
72
PRACE Code Porting Workshops• GPU and Hybrid system programming using CUDA and CAPS-HMPP,
CEA, Paris, April 2009• Porting and optimization techniques for PRACE applications, CSC,
Helsinki, June, 2009 • Porting and optimization techniques for the CRAY XT5, CSCS, Manno,
Switzerland, July, 2009• Porting and optimization techniques for the Clearspeed/Petapath
architecture, NCF/SARA, Amsterdam, October, 2009• Porting and optimization techniques for the NEC/SX-9 (HLRS) and IBM
BG/P (FZJ), Cyfronet, Cracow, October, 2009 • Porting and optimization techniques for the IBM Cell (BSC) and GPGPU
systems, BSC, Barcelona, October, 2009• Stream Programming with OpenCL, KTH, Stockholm, December, 2009• …..
73
PRACE Technology Evaluation, Research and Development• PRACE activities include efforts to assess the
impact on application development and code porting and optimization of new novel architectures, such as stream computing units (GPUs) and digital signal processors (DSPs), interconnection technologies, and programming paradigms through prototyping
• PRACE also seeks to assess and stimulate the development of energy efficient hardware and software design through prototyping.
74
Challenges
75
The Connection Machine
7 years
Performance doubling period on average: No 1 – 13.64 months, No 500 – 12.90 months
76
Energy efficiency evolution
Source: Assessing in the Trends in the Electrical Efficiency of Computation over Time, J.G. Koomey, S. Berard, M. Sanchez, H. Wong, Intel, August 17, 2009, http://download.intel.com/pressroom/pdf/computertrendsrelease.pdf
Energy efficiency doubling every 18.84 months on average measured as computation/kWh
77
The Gap The energy efficiency improvement as determined by Koomey does not match the performance growth of HPC systems as measured by the Top500 list
The Gap indicates a growth rate in energy consumption for HPC systems of about 20%/yr.
EPA study projections: 14% - 17%/yrUptime Institute projections: 20%/yrPDC experience: 20%/yr
Report to Congress on Server and Data Center Energy Efficiency”, Public Law 109-431, U.S Environmental Protection Agency, Energy Star Program, August 2, 2007,http://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf
“Findings on Data Center Energy Consumption Growth May Already Exceed EPA’s Prediction Through 2010!”, K. G. Brill, The Uptime Institute, 2008, http://uptimeinstitute.org/content/view/155/147
78
Evolution of Data Center Energy Costs (US)
Source: Tahir Cader, Energy Efficiency in HPC – An Industry Perspective, High Speed Computing,April 27 – 30, 2009
79
Worldwide Server Installed Base, New Server Spending, and Power and Cooling Expense
Power per rack has increased from a few kW to 30+ kW
Estimate for 2008 purchase: 4 yr cooling cost ~1.5 times cluster cost
80
DOE E3 Report: Extrapolation of existing design trends to Exascale in 2016 Estimate: 130 MW
DARPA Study: More detailed assessment of component technologies
Estimate: 20 MW just for memory alone, 60 MW aggregate extrapolated from current design trends
The current approach is not sustainable!More holistic approach is needed!
Exa-scale Data Centre Challenges
80
Nuclear power plant: 1–1,5 GW
81
DARPA Exascale study• Last 30 years:
“Gigascale” computing first in a single vector processor“Terascale” computing first via several thousand microprocessors“Petascale” computing first via several hundred thousand cores
• Commercial technology: to dateAlways shrunk prior “XXX” scale to smaller form factorShrink, with speedup, enabled next “XXX” scale
• Space/Embedded computing has lagged far behindEnvironment forced implementation constraintsPower budget limited both clock rate & parallelism
• “Exascale” now on horizonBut beginning to suffer similar constraints as spaceAnd technologies to tackle exa challenges very relevant
Especially Energy/Powerhttp://www.ll.mit.edu/HPEC/agendas/proc09/Day1/S1_0955_Kogge_presentation.ppt
82
Power fundamentals – Exascale
• Modern processors being designed today (for 2010) dissipate about 200 pJ/op total.
This is ~200W/TF 2010
• In 2018 we might be able to drop this to 10 pJ/op
~ 10W/TF 2018
• This is then 16 MW for a sustained HPL Exaflops
• This does not include memory, interconnect, I/O, power delivery, cooling or anything else
• Cannot afford separate DRAM in an Exa-ops machine!
• Propose a MIP machine with Aggressive voltage scaling on 8nm
• Might get to 40 KW/PF –
60 MW for sustained Exa-ops
Source: William J Camp, Intel, http://www.lanl.gov/orgs/hpc/salishan/pdfs/Salishan%20slides/Camp2.pdf
Processor Memory
83
Power fundamentals - Exascale
• For short distances: still Cu• Off Board: Si photonics• Need ~ 0.1 B/Flop Interconnect• Assume (a miracle)
5 mW/Gbit/sec~ 50 MW for the interconnect!
• Optics is the only choice:• 10-20 PetaBytes/sec• ~ a few MW (a swag)
Interconnect I/O
Still 30% of the total power budget in 2018!Total power requirement in 2018: 120—200 MW!
Power and Cooling
Source: William J Camp, Intel, http://www.lanl.gov/orgs/hpc/salishan/pdfs/Salishan%20slides/Camp2.pdf
84
An inefficient truth - ICT impact on CO2 emissions*• It is estimated that the ICT industry alone produces
CO2 emissions that is equivalent to the carbon output of the entire aviation industry. Direct emissions of Internet and ICT amounts to 2-3% of world emissions and is expected to grow to 6+% within a decade
• ICT emissions growth fastest of any sector in society; expected to double every 4 to 6 years with current approaches
• One small computer server generates as much carbon dioxide as a SUV with a fuel efficiency of 15 miles per gallon
*An Inefficient Tuth: http://www.globalactionplan.org.uk/event_detail.aspx?eid=2696e0e0-28fe-4121-bd36-3670c02eda49
85
1.8
1.4
1.1
0.7
0.4
0
-0.4
-0.7
-1.1
Glo
bal T
empe
ratu
re C
hang
e (d
eg F
)
Year1000 1200 1400 1600 1800 2000
380
360
340
320
300
280
CO
2C
oncentration (ppm)
1000 Years of CO2 and Global Temperature Change
Temperature CO2
Source: Jennifer Allen, ACIA 2004Source: http://alaskaconservationsolutions.com/acs/images/stories/docs/AkCS_current.ppt
86
Arctic Summer Ice Melting Accelerating
Ice Volume km3
Ice area millions km2, September minimum
Source: www.copenhagendiagnosis.org
IPCC = Intergovernmental Panel on Climate Change
87Source: Iskhaq Iskandar, http://www.jsps.go.jp/j-sdialogue/2007c/data/52_dr_iskander_02.pdf
1/3rd due to melting glaciers
2/3rd due expansion from warming oceans
Source: Trenberth, NCAR 2005
88Source: http://alaskaconservationsolutions.com/acs/images/stories/docs/AkCS_current.ppt
Global Cataclysmic Concerns
89
Ocean AcidificationOcean AcidificationAnimals with calcium carbonate shells -- corals, sea urchins, snails, mussels, clams, certain plankton, and others -- have trouble building skeletons and shells can even begin to dissolve. “Within decades these shell-dissolving conditions are projected to be reached and to persistthroughout most of the year in the polar oceans.” (Monaco Declaration 2008)
Pteropods (an important food source for salmon, cod, herring, and pollock) likely not able to survive at CO2 levels predicted for 2100 (600ppm,pH 7.9) (Nature 9/05)
Coral reefs at serious risk; doubling CO2, stop growing and begin dissolving (GRL 2009)
Larger animals like squid may have trouble extracting oxygen
Food chain disruptions
Pteropod
SquidClam
All p
hoto
s th
is p
age
cour
tesy
of N
OAA
Glo
bal W
arm
ing:
The
Gre
ates
t Thr
eat ©
2006
Deb
orah
L. W
illia
ms
Source: http://alaskaconservationsolutions.com/acs/images/stories/docs/AkCS_current.ppt
Global Cataclysmic Concerns
90
Severe WeatherHurricanes:
For 1925 - 1995 the US cost was $5 billion/yr for a total of 244 landfalls. But, hurricane Andrew alone caused damage in excess of $27 billion.
The US loss of life has gone down to <20/yr typically. The Galveston “Great Hurricane” year 1900 caused over 6,000 deaths.
Since 1990 the number of landfalls each year is increasing.
Warnings and emergency response costs on average $800 million/yr. Satellites, forecasting efforts and research cost $200 –225 million/yr.
Katrina, August 28, 2005
Andrew: ~$27B (1992)
Charley: ~$ 7B (2004)
Hugo: ~$ 4B (1989) ($6.2B 2004 dollars)
Frances: ~$ 4B (2004)
Katrina: > $100B (2005)
91
Tornadoshttp://en.wikipedia.org/wiki/TornadoAnadarko, Oklahoma
http://en.wikipedia.org/wiki/File:Dimmit_Sequence.jpg
Courtesy Kelvin Droegemeier
www.drjudywood.com/.../spics/tornado-760291.jpg
http://g.imagehost.org/0819/tornado-damage.jpg
http://www.crh.noaa.gov/mkx/document/tor/images/tor060884/damage-1.jpg
http://www.miapearlman.com/images/tornado.jpg
92
Wildfires http://topnews.in/law/files/Russian-fires-control.jpg
http://msnbcmedia1.msn.com/j/MSNBC/Components/Photo/_new/100810-russianFire-vmed-218p.grid-6x2.jpg
Russia Wildfires 2010
http://www.tolerance.ca/image/photo_1281943312664-2-0_94181_G.jpgimg.ibtimes.com/www/data/images/full/2010/08/
http://legacy.signonsandiego.com/news/fires/weekoffire/images/mainimage4.jpg
In a single week, San Diego County wildfires killed 16 people, destroyed nearly 2,500 homes and burned nearly 400,000 acres. Oct 2003
Russia may have lost 15,000 lives already, and $15 billion, or 1% of GDP, according to Bloomberg.The smog in Moscow is a driving force behind the fires' deadly impact, with 7000 being killed already in the city. Aug 10, 2010
93
FloodsUK June – July 200713 deathsmore than 1 million affectedcost about £6 billion
April 30 – May 7, 2010 TN, KY, MS31 deaths. Nashville Mayor Karl Dean estimates the damage from weekend flooding could easily top $1 billion.
China,Bloomberg Aug 17,20101450 deaths through Aug 6Aug 7 1254 killed in mudslide with 490 missing
94
Houston, Ozone and Health
95
0
20
40
60
80
100
120
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
August 2011
Tem
pera
ture
F
Max
Min
Avg Max
Avg Min
Houston August 2011 Daily Temperatures
2011 Avg Max 102.03
Historic Avg Max 94.55
2011 Avg Min 78.65
Historic Avg Min 74.81
Avg Max +7.5F (4.2C)
Avg Min +3.8F (2.1C)
Warmest July – August on record in Texas
Warmest July – August on record of any US state
96
Energy efficiency of HPC systems
97
Power Consumption – Loadfor a typical server (2008/2009)
Luiz Andre Barroso, Urs Hoelzle, The Datacenter as a Computer: An Introduction to the Design ofWarehouse-Scale Machines http://www.morganclaypool.com/doi/pdf/10.2200/s00193ed1v01y200905cac006
CPU power consumption at low load about 40% of consumption at full load. Power consumption of all other system components independent of load, approximately.Result: Power consumption at low load about 65% of consumption at full load.
98
Figure 1. Average CPU utilization of more than 5,000 servers during a six-month period. Servers are rarely completely idle and seldom operate near their maximum utilization, instead operating most of the time at between 10 and 50 percent of their maximum utilization levels.
“The Case for Energy-Proportional Computing”, Luiz André Barroso, Urs Hölzle, IEEE Computer, vol. 40 (2007).http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/33387.pdf
99
Internet vs HPC WorkloadsGoogle (Internet) KTH/PDC (HPC)
100
What type of Architecture?
Exascale Computing Technology Challenges, John ShalfNational Energy Research Supercomputing Center, Lawrence Berkeley National LaboratoryScicomP / SP-XXL 16, San Francisco, May 12, 2010
101
What type of Architecture?Instruction Set Architecture
Exascale Computing Technology Challenges, John Shalf, National Energy Research Supercomputing Center, Lawrence Berkeley National Laboratory ScicomP / SP-XXL 16, San Francisco, May 12, 2010
102
What type of Architecture?
Exascale Computing Technology Challenges, John ShalfNational Energy Research Supercomputing Center, Lawrence Berkeley National LaboratoryScicomP / SP-XXL 16, San Francisco, May 12, 2010
103
What kind of architecture (core)
http://www.csm.ornl.gov/workshops/SOS11/presentations/j_shalf.pdf
• Cubic power improvement with lower clock rate due to V2F
• Slower clock rates enable use of simpler cores
• Simpler cores use less area (lower leakage) and reduce cost
• Tailor design to application to reduce waste� �
1042345.5
BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire InfinibandDOE/NNSA/LANL444.2520
276BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz /
Opteron DC 1.8 GHz, InfinibandDOE/NNSA/LANL458.3319
154BladeCenter PS702 Express, Power7 3.3GHz, InfinibandCeSViMa - Centro de Supercomputación y Visualización
de Madrid467.7318
120.56Power 750, Power7 3.86 GHz, 10GigEIBM Thomas J. Watson Research Center483.6617
2580Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050
GPUNational Supercomputing Centre in Shenzhen (NSCS)492.6416
94.6Supermicro Xeon Cluster, E5462 2.8 Ghz, Nvidia Tesla
s2050 GPU, InfinibandCSIRO555.515
129.6Hitachi SR16000 Model XM1/108, Power7 3.3Ghz,
InfinibandYukawa Institute for Theoretical Physics (YITP)565.9714
4040NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA GPU, FT-1000
8CNational Supercomputing Center in Tianjin635.1513
115.87Asterism ID318, Intel Xeon E5530, NVIDIA C2050,
InfinibandNational Institute for Environmental Studies650.312
94.4HP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia
Fermi, Infiniband QDRGeorgia Institute of Technology677.1211
416.78Supermicro Cluster, QC Opteron 2.1 GHz, ATI Radeon
GPU, InfinibandUniversitaet Frankfurt718.1310
57.54QPACE SFB TR Cluster, PowerXCell 8i, 3.2 GHz, 3D-TorusUniversitaet Wuppertal773.389
57.54QPACE SFB TR Cluster, PowerXCell 8i, 3.2 GHz, 3D-TorusUniversitaet Regensburg773.388
57.54QPACE SFB TR Cluster, PowerXCell 8i, 3.2 GHz, 3D-TorusForschungszentrum Juelich (FZJ)773.387
9898.6K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnectRIKEN Advanced Institute for Computational Science
(AICS)824.566
160iDataPlex DX360M3, Xeon 2.4, nVidia GPU, InfinibandCINECA / SCS - SuperComputing Solution891.885
1243.8HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU,
Linux/WindowsGSIC Center, Tokyo Institute of Technology958.354
34.24DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband
QDRNagasaki University1375.93
38.8NNSA/SC Blue Gene/Q Prototype 1IBM Thomas J. Watson Research Center1684.22
40.95NNSA/SC Blue Gene/Q Prototype 2IBM Thomas J. Watson Research Center2097.21
Total Power (kW)Computer*Site*MFLOPS/W
Green500 Rank June 2011
105
Power Usage Effectiveness (PUE)
Slide courtesy Michael K Patterson, Intel, 2nd European Workshop on HPC Centre Infrastructure, Dourdan, France, 2010-10-06--08
106
Power Utilization Efficiency
• Traditionally about 3• State-of-the art today 1.05 – 1.2!• How?
“Free” CoolingJudicious attention to airflowImproved efficiency in power
distribution and supply within the data center
Increased data center temperature
109
Google ….
• EUS1 Energy consumption for type 1 unit substations feeding the cooling plant, lighting, and some network equipment
• EUS2 Energy consumption for type 2 unit substations feeding servers, network, storage, and CRACs
• ETX Medium and high voltage transformer losses
• EHV High voltage cable losses • ELV Low voltage cable losses • ECRAC CRAC energy consumption • EUPS Energy loss at UPSes which feed
servers, network, and storage equipment • ENet1 Network room energy fed from type
1 unit substitution
Google, Dalles, OR
Patent filed Dec 30, 2003Q1 2011Quarterly energy-weighted average PUE: 1.13TTM energy-weighted avg. PUE: 1.16Individual facility minimum quarterly PUE: 1.09,
Data Center E Individual facility minimum TTM PUE*: 1.11,
Data Center J Individual facility maximum quarterly PUE: 1.22,
Data Center C Individual facility maximum TTM PUE*: 1.21,
Data Center C
* Only facilities with at least twelve months of operation are eligible for Individual Facility Trailing Twelve Month (TTM) PUE reporting
http://www.google.com/corporate/datacenter/efficiency-measurements.html
EUS1+EUS2+ETX+EHV
EUS2+ENet1-ECRAC–EUPS-ELVPUE =
110
Google and Clean Energy• The Hamina data center in Finland
(previously the Summa paper mill)Cooling water from Gulf of Finland
(no chillers)Four new wind turbines built
• Belgian data center designed without chillers. If the air at the Saint-Ghislain, Belgium, data center gets too hot, Google shifts the data center's compute loads to other facilities.
Telia-Sonera cable
Summa paper mill
Gulf of Finland
Saint-Ghislain
111
Facebook –Prineville Data Center
Facebook’s Prineville, OR, 147,000-square-footcustom data center, with an estimated to cost $188.2 million was brought into operation the summer of 2011. The site was chosen because of it's very dry and relatively cool climate. For 60 – 70% of the time cooling will be achieved by using cold air from outside. Excess heat from servers will be used to warm office space in the facility.
PUE 1.07 – 1.08
112
Clean Energy - Facebook• The 120 MW Lulea Data Center will
consist of three server buildings with an area of 28 000 m2 (300 000 ft2) The first building is to be operational within a year and the entire facility is scheduled for completion by 2014
• The Lulea river has an installed hydroelectric capacity of 4.4 GW and produces on average 13.8 TWh/yr
Read more: http://www.dailymail.co.uk/sciencetech/article/2054168/Facebook-unveils-massive-data-center-Lulea-Sweden.html#ixzz1diMHlYIL
−2.5
(27.5
−13
(9)
−7(19
)
1(34
)
5(41
)
10(50
)
11(52
)
8(46
)
2(36
)
−4(25
)
−11
(12)
−16
(3)
−16
(3)
Average low °C (°F)
5.0(41.
0)
−5(23
)
−1(30
)
6(43
)
12(54
)
17(63
)
19(66
)
17(63
)
10(50
)
3(37
)
−2(28
)
−8(18
)
−8(18
)
Average high °C (°F)
Year
Dec
Nov
Oct
Sep
Aug
JulJun
May
Apr
Mar
Feb
Jan
Month
Climate data for Luleå, Sweden
2011-10-28
113
Joint Nordic Proof of Concept Research Computing Center (Denmark, Norway, Sweden)
• Located at the Thor DataCenter, Reykjavik
• Iceland Electric Energy 70% Hydro, 30% GeoCarbon Free, Sustainable
• Free Cooling – PUE in the 1.1 – 1.2 range; 1.07 for containerized equip.All time high temperature in Reykjavik: 24.8 C, Annual average ~5 C.
114
Energy Efficient Computing Projects at PDC
115
SNIC/KTH PRACE Prototype
1620W PSU +cooling fans
40Gb InfiniBand switch18 external ports1/10Gb Ethernet switch
1Gb Ethernet switch
PSU Dummywhen PSU not used
1620W PSU +cooling fans
CMM (Chassis Management Module)
1620W PSU +cooling fans
40Gb InfiniBand switch18 external ports1/10Gb Ethernet switch
1Gb Ethernet switch
PSU Dummywhen PSU not used
1620W PSU +cooling fans
CMM (Chassis Management Module)
• New 4-socket blade with 4 DIMMs per socket supporting PCI-Express Gen 2 x16
• Four 6-core 2.1 GHz 55W ADP AMD Istanbul CPUs, 32GB/node
• 10-blade in a 7U chassis with 36-port QDR IB switch, new efficient power supplies.
• 2TF/chassis, 12 TF/rack, 30 kW (6 x 4.8)• 180 nodes, 4320 cores, full bisection
QDR IB interconnect
Network: • QDR Infiniband• 2-level Fat-Tree• Leaf level 36-port switches
built into chassis• Five external 36-port switches
116
The SNIC/KTH/PRACE Prototype I 2009 (Povel)
100.05,065Total
0.420CMM
0.840GigE Switch
2.0100IB Switch
2.0100IB HCAs
2.4120HT3 Links
5.9300Motherboards
6.9350Fans
7.0355PS
15.8800Memory 1.3 GB/core
56.82,880CPUs
Percent (%)
Power (W)
Component
Not in prototype nodes
CMM (Chassis Management Module)
36-ports
117
SNIC/KTH/PRACE Prototype I
118
Nominal Energy Efficiency of Mobile CPUs, x86 CPUs and GPUs
~2.32251600~0.61306~0.911512~0.52+2~0.5~24
GF/WWCoresGF/WWCoresGF/WWCoresGF/WWCoresGF/WWCores
ATI 9370Intel 6-coreAMD 12-coreATOMARM Cortex-9
~10101923.75516~ 1548~2.2225512
GF/WWCoresGF/WWCoresGF/WWCoresGF/WWCores
ClearSpeedCX700IBM BQCTMS320C6678nVidia
Fermi
Very approximate estimates!!
KTH/SNIC/PRACE Prototype II
119
KTH/SNIC/PRACE DSP HPC node
Target:15 – 20W32 GB2.5 GF/W Linpack
4-core
50 Gbps
120
KTH/SNIC/PRACE DSP HPC nodeDSP Integration Model 1: Accelerator
121
KTH/SNIC/PRACE DSP HPC node
• MPI Ranks only on host CPU• DSP executes computational kernels• Data passes between hosts and from host
to DSP (the transfer can be optimized)• Simple model, ARM in control• Can gradually port application• Limited gains due to synchronization and
data staging issues
DSP Integration Model 1: Accelerator
122
KTH/SNIC/PRACE DSP HPC nodeDSP Integration Model 2: “Pure DSP”
123
KTH/SNIC/PRACE DSP HPC node
• MPI Processes run only on DSPs• System calls are forwarded to the Host• Host handles control and system call requests• DSPs communicate directly with each other• Application sees homogenous machine• Need to provide many system services• Alt. DSP runs also some OS• Need to port lots of software, libraries etc.
DSP Integration Model 2: “Pure DSP”
124
KTH/SNIC/PRACE DSP HPC nodeDSP Integration Model 3: Hybrid
125
KTH/SNIC/PRACE DSP HPC node
• MPI processes on both Host and DSPs• Host processes communicate to OS and libs.• DSP processes do the number crunching• Some system calls (e.g. printf) are forwarded
(convenience)• Communication can bypass Host• Can use OS and libs already ported to ARM• Programmer sees two different codes
DSP Integration Model 3: Hybrid
126
KTH/SNIC/PRACE DSP HPC node
• Hybrid approach our choice• Development can start on “normal” cluster• Most applications are already somewhat
separated into I/O and computational parts• IBM BlueGene and Cray are similar
DSP Integration Model: Summary
127
KTH/SNIC/PRACE DSP HPC node
• Control the DSPs (reset, start, download code)• Manage interconnect (routing tables, link
failures, topology discovery)• Provide simple SC forwarding for DSPs• Allow debug access to DSPs• Execute legacy/OS code (shell, ssh, grep etc.)• Provide I/O connection (TCP/IP networking,
Lustre ...)
Host role in Hybrid approach
128
KTH/SNIC/PRACE DSP prototype
• Attached via HyperLink• Looks like a memory mapped I/O device• Can do DMA into/out of DSP memory• Can generate DSP interrupts• Should be simple• Allow many outstanding transfers• Produce little jitter on the DSP side
DSP to Interconnect Interface
129
KTH/SNIC/PRACE DSP prototype
• Each top level block in each own “page”• One global register space for NIC wide
operation• Several blocks (e.g. 64) for concurrent
communication threads.• Each core can independently use some
threads
Register Interface
130
KTH/SNIC/PRACE DSP prototypeTx/Rx Registers
131
KTH/SNIC/PRACE DSP prototype
• Each thread can transmit a single packet at a time
• Can do gather DMA from DSP address space
• Short messages can be directly written to the Tx descriptor registers (to save DMA latency)
• Each message may get split into fragments if too long
Tx Operations
132
KTH/SNIC/PRACE DSP prototype
• Can match on source/tag• One shot or ring buffer operation• Status can be polled• Interrupt on packet completion• Puts fragments into correct position in
the recieve buffer• Track status for each fragment (i.e. CRC
error)• Maybe support scatter DMA
Rx Operations
133
Energy Measurement Overview
E = ∫Pdt
DC/DC
Power Readout Current measurement(to be implemented)
DC/DC
Power Readout Current measurement(to be implemented)
DC/DC
Power Readout Current measurement(to be implemented)
DSP
DSP
DDR
AUX
Power measurement
tsb
tsb
tse
tsets tetime
PUART
Pavg = E/(tse – tsb)
EBM = E/(ts – te)
Pavg
E
Sampling freq.: ?? HzAccuracy: ???%
Sampling freq.: ?? HzAccuracy: ???%
ts, te
134
Early Benchmark Results
Benchmark Performance Energy
STREAM – L1 125.18 GB/s 122 pJ/Byte
STREAM – L2 47.6 GB/s 319 pJ/Byte
STREAM – DDR3 8.9 GB/s 2173 pJ/Byte
FFT 585-696 MFLOP/s 283-333 MFLOP/J
DGEMM 585 MFLOP/s 311 MFLOP/J
Theoretical Peaks:L1 Bandwidth: 128 GB/sL2 Bandwidth: 2*64 GB/sDDR Bandwidth: 10.6 GB/s (DDR1333, 64-bit)FFT: 48 GFLOP/s (4 add, 2 mul per cycle, double precision)DGEMM: 32 GFLOP/s (2 add, 2 mul per cycle)SGEMM: 128 GB/s (TI implementation 72 GFLOP/s, 56%)
135
STREAM 6678 Bandwidth test 8 coresGB/s
Data set size in Bytes
L1: 128 GB/s
125 GB/s
L2: 2x64 GB/s48 GB/s
DDR 1333 MHz: 10.664 GB/s
8.9 GB/s
STREAM
L1: 98 % of peak
L2: 75 % of peak
DDR: 83 % of peak
(Better than TI’s results!!Telecon comment)
136
STREAM 6678 Bandwidth test 8 cores
W/GB/s
Data set size in Bytes
Energy measured for entire EVM with on-board emulator
137
FFT – TI result comp.
Platform Effective Time to complete 1024 point complex to complex FFT (single precision), μs
Power (Watts)
Energy per FFT (μJ)
DSP: TI C6678 @ 1.2 GHz
0.85 10 8.5
DSP C6678 @ 1GHz
317.47 16.78 5327
138
An aside on PDC Green Data Center Projects
139
Heat Reuse Project• Background: today around 800 kW used at PDC• Project started 2009 to re-use this energy• Goals:
-Save cooling water for PDC-Save heating costs for KTH-Save the environment
• Use district cooling pipes for heating when no cooling is required
• No heat pumps • Starting with Cray• First phase of Cray will heat the KTH Chemistry
building
140
UNDER FLOOR TEMPERATURENormal: 15-16°C | 59-60°FMax: 17°C | 62°F
2800 mm | 110.2"
1200 mm | 47.2" . TEMPERATURENormal: 35-43°C| 95-109°FMax: 52°C | 126°F
300 mm | 11.8"
HEAT RECOVERY COIL WATER INLET 18°C | 64°FWATER OUTLET 28°C | 84°F
EXHAUST 22°C | 72°F
CABINETSIDE
CABINETFRONT
CABINETFRONT
CABINETFRONT
CABINETFRONT
CABINETFRONT
PDC Energy Recovery Project
EXISTING CRAC
141
PDC Energy Recovery Project
142
Emersive Cooling
http://www.grcooling.com
PDC is evaluating this technology
143
What’s Next?
144
Plans• Hardware
• Design and build FPGA switch for interface to TI Hyperlink (50 Gbps) and ARM/Calxeda(10 GigE, XAUI, …)• TI now has signed
contract for FPGA – Hyperlink IP
• ………….• Assess Advantech
4x6678 PCIe card• Assess TI telecom
ARM+Shannoncard (not generally available)
• Assess Calxeda• Q2 2012
145
Movidius Myriad 65nm Media Processor
Source: David Moloney, http://www.hotchips.org/hc23
180 MHz
Next Generation 28 nm: Estimate 250 – 350 GF/W!
146
Objective• Evaluate the SoC for HPC Applications
Exceptional nominal energy efficiency at the SoC level (350 GF/W, single-precision, incl. memory)
What energy efficiencies can be achieved at application level?
What amount of memory can be stacked/in package at what performance level at what cost?
What SoC enhancements are desirable and feasible for HPC at what cost?
How best to integrate communication at chip and board level?
Explore the toolchain and software ecosystem
• Influence future upcoming products
146
147
Movidius 10 PFLOPS Strawman
32mm
25mm
$200/board615 DP GFLOPS @ 2.8W(8* 4 * 16 * 800MHz)8*128MB DDR3 @ 1.2 GHz76.8GB/s Mem BW (8*9.6 )
compute card Node card9840 TFLOPS16x compute cards45W
Cabinet10 petaFLOPS1024 Nodes46kW40 sq ft
10 PFLOPS in a single BlueGene/L Cabinet
148
10 PFLOPS Comparison
Name CPU GHz
Flops/Clock/Core
Pk Core GFLOPS
Cores/Socket
Watts/Sckt
Sub-domains/Sckt
MBytes/Socket
Mem BW GB/s
Pk Bytes/FLOP
Netwk BW (GB/s)
# M Sockets
Tot. Power MW
Tot. Cost $M
Tot. petaFLOPS $/socket $/GFLOP
AMD Opteron 2.8 2 5.6 2 95 22.4 112 6.4 0.57 0.57 0.89 179 1799.6 9.97 2022 180.54IBM BG/P PPC440 0.7 4 2.8 2 15 11.2 56 5.5 0.98 0.98 1.78 27 2600.6 9.97 1461 260.89Tensilica Custom 0.65 4 2.6 32 22 172.8 864 51.2 0.62 0.62 0.12 2.5 75 9.98 625 7.51Movidius Fragrak 0.8 6 4.8 128 2.8 204.8 1024 76.8 0.13 0.4 0.016 0.0455 3.25 9.98 200 0.33
http://www.hpcuserforum.com/presentations/Germany/EnergyandComputing_Stgt.pdfhttp://www.lbl.gov/cs/html/greenflash.htmlhttp://www.tensilica.com/uploads/pdf/ieee_computer_nov09.pdf
http://en.wikipedia.org/wiki/FLOPS
149
The Prototype• Streaming Hybrid Architecture Vector Engine (SHAVE)
Developed for mobile gaming and video applicationsIs a processor architecture and development environmentContains elements of RISC, DSP, VLIW & GPU architecture
• Myriad SoC Platform (65 nm) 8 cores on a chip, 17 GFLOP/s @ 0.35W -> 50 GLOP/WSContains a SPARC (LEON) control coreStacked together with a low-power DRAM die
• Fragrak Platform (28 nm, internal testing Q1 2013)Possibility to add double precision floating point support16 cores, 250 GFLOP/WS DP -> 1e18 FLOP/s in 4 MW
149
150
Fragrak 28nm Platform
Main Bus
64
450GFLOPS/W(IEEE 754 SP)
Stacked 256/512MBSDRAM die
DDR3 LP
L2 512kB
MEBI
NAL
SEBISDIOx2
SPIx3
LCD x2
MIPIDSI 2x
LCD x2
MIPI CSI 2x
USB2OTG
SDIOx3
SPIx3
SPIx3
SDIOx3
SW Controlled I/O Multiplexing
SPIx3
I2Cx2
SPIx3
I2Sx2
RISC
UARTx2
JTAG
TIMGPS
TSFLSH
Bridge
12864
Movidius IP
UARTx2
150
ICB
CMX128kB
SHAVE 0
CMX128kB
SHAVE 1
CMX128kB
SHAVE 0
CMX256kB
SHAVE 04
ICB
CMX128kB
SHAVE 0
CMX128kB
SHAVE 1
CMX128kB
SHAVE 0
CMX256kB
SHAVE 08
ICB
CMX128kB
SHAVE 0
CMX128kB
SHAVE 1
CMX128kB
SHAVE 0
CMX256kB
SHAVE 12
ICB
CMX128kB
SHAVE 0
CMX128kB
SHAVE 1
CMX128kB
SHAVE 0
CMX256kB
SHAVE 16
XCB
151
16 SHAVEs1 LEON150 GFLOPS350 mW3Q 2012
Option:DP FPU
152
Movidius 28nm 64-bit FLOPS
SAUVAU64‐bit FPAdd
64-bit FP mult
64‐bit FPAdd
32x128 VRF
64-bit FP mult
32x32-bit(16x64-bit)
SRF
64‐bit FPAdd
64‐bit FPAdd
CMU
4 FLOPS/cycle2 FLOPS/cycle
Total 6x 64‐bit FLOPS/cycle
153
TI DSPEVM
FPGAEVM
FPGAEVM
ARMEVM
PowerInstr
PRACE 1IP DSP prototype
TI DSPEVM
154
16x2x800MHz = 25.6GB/Sec
SHAVE Variable-Length Instruction
VRF32x128
SRF32x32
IRF32x32
VAUSAUIAU
LSU0
LSU1
IDC
CMU
128-bit AXI BusXtra-Cluster Bus (XCB)
L2Cache512kB2-way
Fragrak
LPDDR3Cont.
TMU
1kBcache
SHAVEProcessor
BRU DCUPEU
Decodedinstructions
256 -512MB SDRAM
Die
SHAVE 28nm Processor (Fragrak)
16x12x800MHz 76.8GB/Sec
16x2x800MHz 25.6GB/Sec
8x2x800MHz 12.8GB/Sec
16x2x800MHz 25.6GB/Sec
800MHzIntra-Cluster Bus
(ICB)
512kB
256 -512MB SDRAM
4x12x800MHz 38.4GB/Sec
16k L1
16x800MHz 12.8GB/Sec
256kBSRAM CMX
256kBPer
SHAVE
8x2x800MHz = 12.8GB/Sec
4x17x800MHz 54.4GB/Sec
155
Movidius 28nm BW Hierarchy
SDRAM 6.4GB/Sec(2 DDR3 1600)
115GB/Sec
4864GB/Sec
Bottom‐Line ‐ Very High Sustainable Performance
L1Cache
RegistersV/S/IRF
CMXSRAM ICB
L2CacheXCB
42:1
18:1
156
Movidius 28nm BW Hierarchy (Detail)
VRF SRF IRF LSU IDC L1 ICB L2 XCB SDRAMClk 800 800 800 800 800 800 800 800 800 800Bytes 16 4 4 8 16 8 16 16 16 4Ports 12 12 17 2 1 1 2 1 8 2BW 153.6 38.4 54.4 12.8 12.8 6.4 25.6 12.8 102.4 6.4#SHAVES 16 16 16 16 16 16 16Total BW 2457.6 614.4 870.4 204.8 204.8 102.4 409.6
157
PRACE Precompetitive Procurement (PCP)
• Objective: Increase European technology industry engagement in HPC
• Scope: Energy Efficient HPC systems• Funding: 5 plus 5+ M€• Timeframe: Procurement late
2012/early 2013
158
Thank You!