Energy Efficient Computing - Department of Computer …johnsson/Talks/ENEA_S.pdf · Energy...

1

Energy Efficient Computing

Lennart Johnsson, Gilbert Netzer, KTH

2

OUTLINE

• Who we are and what we do• Scientific and Engineering computation • Application characterization• Data center energy challenges • PRACE• KTH energy efficient computing projects• KTH – TI project• What is next

3

PDC’s MissionInfrastructure

Operation of a high-end infrastructure for HPC, data services, user support and training for Swedish research on behalf of the Swedish National Infrastructure for Computing (SNIC), collaborative international and national consortia, and research groups at KTH and Stockholm University

ResearchConduct world-class research and education in parallel and distributed computing methodologies and tools

4

SNIC• The Swedish meta-center for large-scale computing and

data storage. Formed 2003.• Organized within the Swedish Research Council with a

budget of about 100MSEK• Mission:

Provide research computing resources for Swedish academic research mainly through six university based computing centra

Coordinate investments and competence across the centersMerit based resource allocation based through the SNIC National

Resource Allocations Committee (SNAC) through RFPs every 6 months

Fund and coordinate minor development projectsHost the Swedish National Graduate School in Scientific

Computing (NGSSC)

SNIC 2010, - 4

5

SNIC

• HPC2N (Umeå)• UPPMAX (Uppsala)• PDC (Stockholm)• NSC (Linköping)• C3SE (Göteborg)• LUNARC (Lund)

SNIC 2010, - 5

• About 300 user groups (1-50 researchers each)

• Services:• A few large-scale computing systems• Foundation-level computer systems, storage and

user support at all centers• Coordinated access to European-level initiatives• SweGrid initiated 2003• SweStore initiated 2008• Advanced user support effort initiated 2010

6

Ferlin and SweGrid - Dell ClusterSNIC Foundation Level Service32 nodes with Infiniband

6120 cores (765 nodes, 2 quad core Intel)7 TByte memory

Ekman - Dell PowerEdge ClusterClimate and Flow research

10,144 cores (1268 nodes, 2 quad core AMD)89 TF theoretical peak performance20 TByte memory

Key - HP SMP 32 Cores, 256 GB memory

Hebb - IBM Blue Gene/LStockholm Brain Institute, Mechanics, and INCF

PovelPrace Prototype (energy efficiency)4320 cores (180 4x6core AMD nodes)36 TF theoretical peak performance5.76 TByte memory

PDC Computing Resources

1024 nodes6 TF theoretical peak performance

7

PDC’s latest HPC system• Cray XE6• 1,516, dual-socket AMD 12-core, 2.1 GHz

32 GB compute nodes (36,384 cores), 305 TF TPP, 237 TF sustained (Linpack)

• Gemini 3D torus network• SNIC PRACE system• Would be Nr. 8 in Europe and Nr. 28

worldwide on the November 2010 Top500 list(www.top500.org)

8

PDC’s Computational Resources

System Cores TPPLindgren 36,384 305 TFEkman 10,144 89 TFFerlin 5,360 58 TFSweGrid 744 8 TFHebb 2,048 6 TFPovel 4,320 36 TFTotal 59,000 502 TF

9

Storage• ~20 TB disk

Accessible via AFS• ~900 TB disk

Currently attached to individual systemsLustre parallel file system

– Site wide configuration planned

• IBM tape robot (~2900 slots, ~2.3 PB)Accessible via HSM, TSM, and dCache

(planned via NDGF (Nordic Data Grid Facility))

Large datasets, e.g. Brain image database, Human Proteon Data,…

10

PDC System UseOver 500 time allocations by 400 PIs (past 4 years)Examples of these research areas include:

• Quantum Chemistry• Climate Modeling• Neuroinformatics• Life Sciences• Physics• Computational

Fluid Dynamics

11

User Support

• Front-line help desk • Advanced user support

• Experts in parallel computing and specific application domains

• Computational Chemistry• Molecular Dynamics• Computational Fluid Dynamics• Neuroinformatics

12

Community Code Development• Gromacs

GROMACS is a versatile Molecular Dynamics package for simulation of the Newtonian equations of motion for systems with hundreds to millions of particles Head authors and project leaders: Erik Lindahl and Berk Hess, KTH, David van der Spoel, Uppsala http://www.gromacs.org/About_Gromacs

• DaltonDalton is a Molecular Electronic Structure package.

Members of the KTH Theoretical Chemistry department are active contributors, especially Olaf Vahtras and Hans Agrenhttp://www.daltonprogram.org/description.html

13

PDC Summer School

0

10

20

30

40

50

60

70

80

90

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Num

ber o

f Par

ticip

ants

Other

Göteborg University

FOA

KI

Stkhlm Observatory

Luleå

Umeå

Linköping

Lund

SU

Uppsala

CTH

KTH

Education and TrainingPDC Summer School since 1996For many years now jointly with

Total 1996 – 2010: 834

14

Schools, examples

0%10%20%30%40%50%60%70%80%

EU

Asiaother China

USA CentralAm.

SouthAm Africa

Russia

Europeother

Austr-alasia

20062007

EU 75% → 44%Total 64 students

15

Schools, examples

• 31 students (target 30)• 14 PRACE partner countries represented• 2 non-PRACE countries represented• Access to Forschungs Zentrum Juelich’s BG/P• Access to CSC’s Cray XT4

First Summer School, August 2008

16

Training Workshops, examples

• 41 participants• 3 PRACE partner countries represented• 1 non-PRACE countries represented• Access to AMD/ATI Radeon 5770 and 5870 GPUs• Access to AMD/ATI Firestream 9270 GPUs

Stream Programming Workshop December 2009

17

Scientific and Engineering Computation

18

21st Century Science and Engineering

• The three fold way• theory • experiment• computational simulation

• Supported by• multimodal collaboration systems• distributed, multi-petabyte data archives• leading edge computing systems• distributed experimental facilities• internationally distributed

multidisciplinary teams Theo

ry

Expe

rimen

t

Simulation

Courtesy Paul Messina

19

Driving ApplicationsExamples

Astronomy

CMS

Atlas

LHCb

ALICE

Physics

Life Sciences Medicine

BPBloodGlucose

Heart Rate

Temp

Enginering

Weather

20

The Large Hadron Collider Project four detectors

Storage –Raw recording rate 0.1 – 1 GBytes/sec

Accumulating at 5-8 PetaBytes/year

10 PetaBytes of disk

Processing –200,000 of today’s fastest PCs

CMSATLAS

Physics

CERN LHC Site

CMS:1800 physicists150 institutes32 countries

21

Source: Thomas Lippert, DEISA Symp, May 2005

22

Astronomy

• The planned Large Synoptic Survey Telescope (LSST) will produce over 30 TB/day when in operation before 2020! It will perform an all-sky survey every few days with a 3.2 billion pixel camera resulting in the first fine grain time series of the sky. This is expected to allow for observation of the expansion of the universe and dark matters bending of light

Impending Floods of Data

www.lsst.org

23

Telescope ArraysALMA

EVLA

LOFAR

24

Source: K Lackner, Max-Planck DEISA Symp May 2005

25

Fusion

Source: David H Bailey, Petaflops Workshop

26

Source: David H. Bailey, LBL PetaflopsWorkshop

27

Severe Weather PredictionHurricanes:

For 1925 - 1995 the US cost was $5 billion/yr for a total of 244 landfalls. But, hurricane Andrew alone caused damage in excess of $27 billion.

The US loss of life has gone down to <20/yr typically. The Galveston “Great Hurricane” year 1900 caused over 6,000 deaths.

Since 1990 the number of landfalls each year is increasing.

Warnings and emergency response costs on average $800 million/yr. Satellites, forecasting efforts and research cost $200 – 225 million/yr.

Ivan, September 14, 2004

Andrew: ~$27B (1992)

Charley: ~$ 7B (2004)

Hugo: ~$ 4B (1989) ($6.2B 2004 dollars)

Frances: ~$ 4B (2004)

28

Weather

Courtesy Kelvin Droegemeier

March 28, 2000, Fort Worth Tornado

29

Severe Weather Prediction• Must fit the prediction model to the

observations (data assimilation/retrieval)About 50-100 times as expensive as the forecast

• Must use high spatial resolution1-3 km resolution in sufficiently large domains

• Must quantify forecast uncertainty(ensembles)May need 20-30 forecasts to produce an ensemble each forecast cycle)

• Requirements: 10-100 TFLOPS sustained; 0.5 TB memory; 20 TB storage

30

Wildfire Simulation

31

Environmental Studies

Houston, TX

32

Life Sciences

33

Life Sciences: Imaging

SynchrotronsSynchrotrons

MacromolecularMacromolecularComplexes, Complexes, Organelles, CellsOrganelles, Cells

Organs, Organ Organs, Organ Systems, OrganismsSystems, Organisms

MicroscopesMicroscopes

Magnetic Resonance Magnetic Resonance ImagersImagers

Molecules

34

Life sciences: Imaging

ImageRestoration

•Deconvolution•Filtering•Registration

ImageRestoration

•Deconvolution•Filtering•Registration

ImageReconstruction

•3D Reconstruction•Refinement

ImageReconstruction

•3D Reconstruction•Refinement

MultidimensionalImage Analysis

•Image Segmentation•Feature Recognition, Extraction & Modification

MultidimensionalImage Analysis

•Image Segmentation•Feature Recognition, Extraction & Modification

DataAcquisition

DataAcquisition VisualizationVisualizationPost Processing,

Simulation, Other MethodsPost Processing,

Simulation, Other MethodsInteractions with the Experiment

Interactions with the Experiment

35

500 Å

JEOL3000-FEGLiquid He stageNSF support

No. of Particles Needed for 3-D Reconstruction

B = 100 Å2

8.5 Å 4.5 Å6,000 5,000,000

Resolution

B = 50 Å2 3,000 150,0008.5 Å Structure of the HSV-1 Capsid

Life Sciences: Imaging

36

Digital MammographyAbout 40 million mammograms/yr (USA) (estimates 32 – 48 million)About 250,000 new breast cancer cases detected each yearOver 10,000 units (analogue)Image size: 4kx6k, about 48 MBImages per patient: 4Data set size per patient: about 200 MbytesData set per year: about 10 PbytesData set per unit, if digital: 1 Tbytes/yr, on average

37

Computer Assisted Surgery

http://lyon2003.healthgrid.org/documents/slides_PDF/11_Guy_Lonsdale.pdf

38

Cancer Treatment: Hadron Centers• In the US (all proton therapy):

Harvard (MGH)Loma Linda (California)MD Anderson (Houston) Spr ’06

• Heavy Ion Therapy (All International)HIMAC @ NIRS (Chiba, Japan)GSI (Heidelberg, Germany)-Under

ConstructionEtoile (Lyon, France)-Under ConstructionUniv. of Pavia, Italy-Under Construction

39

U.S. Department of EnergyPacific Northwest National Laboratory

11/25/00 27

Virtual Lung from PNNL's Virtual Biology Center

NWGrid & NWPhysdesigned to simulate coupledfluid dynamics and continuummechanics in complexgeometries using 3-D, hybrid,adaptive, unstructured grids.

NWGrid - grid generation &setup toolbox

NWPhys - collection ofcomputational physics solvers

Harold Trease, PNNL

U.S. Department of EnergyPacific Northwest National Laboratory

11/25/00 28

Particle Distribution in the Flow Airways(Particle occurs in right branch of bifurcation)

membrane wallmembrane wall

Airway passageAirway passage

particleparticle

Harold Trease, PNNL U.S. Department of EnergyPacific Northwest National Laboratory

11/25/00 29

Pressure Contours of the Flow Fieldthroughout the Lung Airways

Particles occur in everyParticles occur in everyright branch of a bifurcationright branch of a bifurcation

Harold Trease, PNNL

Lung Simulation

40

Center for Integrated Turbulence Simulation

Engineering

41

42

Engineering

43

Scheduling

Continental Airlines

44

Application Characterization for System Design

45

Particle Physics 23.5

Computational Chemistry 22.1

Condensed Matter Physics 14.2

CFD 8.6

Earth & Climate 7.8

Astronomy & Cosmology 5.8

Life Sciences 5.3

Computational Engineering 3.7

Plasma Physics 3.3

Other 5.8

2008 usage of PRACE partner’s major systems measured as aggregated Linpack Equivalent Flops (LEFs)

69 applications surveyed on 24 systems>10TF

46

Relative use of Computational Kernels (Dwarfs) in PRACE Applications based on LEFs

Map reduce methods45.1%

Spectral methods18.4%

Dense linear algebra14.4%

Structured grids9.0%

Particle methods7.2%

Sparse linear algebra3.4%

Unstructured grids2.4%

69 applications surveyed on 24 systems>10TF

47

PRACE Large Scale Applications characteristics measured as LEFs used in TF units

0000000Other

0.630.423.551.331.3300Plasma Physics

89.2700.100.924.59012.50Particle Physics

3.460.280.940.130.944.720Life Science

00.2601.335.832.030Earth and Climate Science

5.700.281.760.731.6215.079.10Condensed Matter Physics

03.000.323.057.371.700CFD

2.80.5300.530.5300Computational Engineering

12.980.537.493.451.8026.0915.35Computational Chemistry

02.995.983.594.910.620Astronomy and Cosmology

Map

red

uce

m

eth

od

s

Un

structu

red

g

rids

Particle

m

eth

od

s

Sp

arse

linear

alg

eb

ra

Stru

cture

d

grid

s

Sp

ectra

l m

eth

od

s

Den

se lin

ear

alg

eb

ra

Area/Dwarf

48

Language use by application codes

1Mathematica

2Perl

3Python

7C99

10C++

15Fortran77

22C90

50Fortran90

No. of applicationsLanguage

About 50% use more than one base language

16 out of the 69 application codes combine Fortran with C or C++

49

Application parallelization techniques on PRACE Partner systems

• 1 is sequential (BLAST)• 1 code uses OpenMP only (Gaussian)• 67 codes use MPI

45 codes uses MPI only, one having an MPI-2 version6 codes have one MPI version, one OpenMP version3 codes have one MPI version, one SHMEM version10 codes have hybrid MPI/OpenMP versions2 codes have hybrid MPI/SHMEM versions1 code has hybrid MPI/Posix threads

50

PRACE systems usage (2008)Job Requirements

0

20

40

60

80

100

120

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67

PR

ACE

Par

tner

Mac

hine

Fra

ctio

n on

w

hich

app

licat

ion

is r

un [%

]

0.1

1

10

100

1000

Min

imal

exe

cutio

n tim

e [h

]

Size LengthMean 22%. A quarter to a third of all cores common experience for shared systems

Mean 37 hrs

Application code #

51

PRACE partner systems with peak >10 TF surveyed (2008)

100.0169,522100.0675,414100.0926,175100.024Total

25.843,77035.1236,88241.1380,68641.710TNC

10.016,92813.994,11812.0110,68225.06FNC

63.9108,24849.7335,49146.0425,59129.27MPP

0.35761.38,9231.09,2164.21VEC

%Cores%Rmax (GF)%Rpeak (GF)%Systems

VEC = Vector systemsMPP = Massively Parallel Processors (BlueGene and Cray XT)FNC = Fat Node Cluster (“big” SMP nodes)TNC = Thin Node Cluster

52

53

PRACE Vision and Mission

• Vision: Enable and support European global leadership in public and private research and development.

• Mission: Contribute to the advancement of European competitiveness in industry and research through the provisioning of world leading persistent High-End Computing infrastructure

54

PRACE AISBL

(Interest to join by Belgium and Latvia)

• PRACE AISBL (Association International Sans But Lucratif) is a Belgian legal entity seated in Brussels formed April 23 2010 for providing a persistent pan-European Research Infrastructure for High-End Computing and associated services. Member countries currently (spring 2012) are

• Italy • Netherlands • Norway• Poland• Portugal• Serbia• Slovenia• Spain• Sweden• Switzerland • Turkey• United Kingdom

• Austria• Bulgaria• Cyprus• Czech Republic• Denmark• Finland • France• Germany • Greece• Hungary• Ireland• Israel

55

Commitments to PRACE AISBL

• Hosting Partners: Germany, France, Italy, SpainBinding commitments to contribute 100 M€ over 5 years in

terms of Tier-0 cycles and servicesContribution measured by TCO

• All partners: Binding commitment to share PRACE AISBL Head-Quarters

costs equally• EU Commission (expected)

68 (originally 3x20+10 M€) in FP7 for preparatory and implementationGrants INFSO-RI-211528 and 261557

• Partner match for EU funds~ 60 Million € from PRACE partners (not including Tier-1)

Note: GDP spread among PRACE partners a factor of ~200

56

The ESFRI Vision for a European HPC service

Tier-0

Tier-1

Tier-2

PRACE

DEISA/PRACE

capa

bilit

y

# of systems

• Ensure the right level of integration in/with the tiers• Tier-0 – full integration

Creation of new high-end resourcesSingle access routeSingle operational model

• Tier-1Integration of existing national resources

enables non hosting countries to contributeDifferent funding / governance requires

adapted approachLeverage DEISA successes, like network, DECI

• Tier-2 / GridsDifferent funding and usage models,

overlapping user groupsCooperate and inter-operate for the benefit of users

(ESFRI = European Strategy Forum for Research InfrastructuresDEISA = Distributed European Infrastructure for Supercomputer Applications)

57

PRACE RI Systems

JSC

• 2010 1st PRACE SystemBG/P by Gauss Center for Supercomputing at Juelich• 294,912 CPU cores, 144 TB memory• 1 PFlop/s peak performance• 825.5 TFlop/s Linpack• 600 I/O nodes (10GigE) > 60 GB/s I/O• 2.2 MW power consumption• 35% for PRACE

58

GENCI• 2011 2nd PRACE system

Bull, 1.6PF, 92160 cores, 4GB/corePhase 1, December 2010, 105 TF

• 360 four Intel Nehalem-EX 8-core nodes, 2.26 GHz CPUs (11,520 cores), QDR Infiniband fat-tree

• 800 TB, >30GB/sec, local Lustre file system

Phase 1.5 Q2 2011• Conversion to 90 16-socket,

128 core, 512 GB nodes

Phase 2, Q4 2011, 1.5 PF• Intel Sandy-Bridge • 10PB, 230GB/sec file system

GENCI/CEA

<15MW

59

HLRS

HLRS• 2011 3rd PRACE System• Cray XE6

Phase 0 – 201010TF, 84 dual socket 8-core AMD Magny-Cours CPUs, 1344 cores in total, 2 GHz, 2GB/core, Gemini interconnect

Phase 1 Step 1 – Q3 2011AMD Interlagos, 16 cores,1 PF 2 – 4 GB/core2.7 PB file system, 150 GB/s I/O

Phase 2 – 2013Cascade, first order for Cray, 4 - 5 PF

60

LRZ• 2011/12 4th PRACE system• IBM iDataPlex

>14,000 Intel Sandy-Bridge CPUs, 3 PF (~110,000 cores), 384 TB of memory

10PB GPFS file system with 200GB/sec I/O, 2PB 10GB/sec NAS

LRZ <13MW

Innovative hot water cooling(60C inlet, 65C outlet) leading to 40 percent less energy consumption compared to air-cooled machine.

61

CINECA• 2012 5th PRACE System

FERMI a Blue Gene/Q is composed of 10.240 PowerA2 sockets running at 1.6GHz, with 16 cores each, totaling 163.840 compute cores and a system peak performance of 2.1 PFlops. Each processor comes with 16Gbyte of RAM (1Gbyte per core).

The BG/Q system will be equipped with a performantscratch storage system with a capacity of 2Pbyte and a bandwidth in excess of 100 GByte/s

62

BSC • 2013 6th PRACE System

BSC<20MW

Computing Facility10 MW 2013

63

PRACE Tier-1 (continuation/extension of DEISA)

2,747,000+1,513,728

1,437,0003,363,840

784,000+9,000 GPGPU2,900,0002,000,000

850,000+140,000 GPGPU1,900,000

880,000

1,150,0001,400,0002,872,0002,870,000

6,000,00012,749,000

7,800,0002,284,000

Core hours

CINECACINESFZJHLRSICHECLRZ

PSNC

BSCClustersSARA

RZGCINECAIBM Power6RZGNCSA (Bulgaria)

IDRISIBM BG/PKTH (36,484 cores)EPCC (44,544 cores)CSCCray XT4/5/6, XE6

PartnerSystem Type

Total: 55,500,568 core hours + 149,000 GPGPU hrs

64

Accessing the PRACE RIPeer-Review Merit Based Access Model• Three types of resource allocations

Test / evaluation accessProject access – for a specific project,

grant period ~ 1 yearProgramme access – resources managed by

a community

• Free-of-charge

Funding• Mainly national funding through partner countries• European contribution• Access model has to respect national interests (ROI)

65

PRACE Young Investigator Awards

• To stimulate interest and innovation in HPC PRACE has since its inception in 2008 awarded an annual prize to a European student or young scientist that has carried out outstanding scientific work on High-End Computing.

• The award is based papers submitted for the competition that are reviewed by three reviewers selected by the PRACE Scientific Steering Committee. Reviewers evaluate novelty, fundamental insights and potential for long-term impact of the research.

• The award and the research are presented at ISC (the International Supercomputing Conference)

66

PRACE Young Investigator Award 2008

• The award was given for “UCHPC – UnConventionalHigh Performance Computing for Finite Element Simulations”, by Stefan Turek, Dominik Goeddeke, Christian Becker, Sven H.M. Buijssen, Hilmar Wobker, Applied Mathematics, Dortmund University of Technology, Germany.

The work addresses use of heterogeneous hardware for Finite Element computations and describes Feast (Finite Element Analysis & Solution Tools), a software toolbox providing Finite Element discretizations and optimized parallel solvers for PDE problems. Feast combines modern numerical techniques with hardware efficient implementations for a wide range of HPC architectures. It contains mechanisms enabling complex simulations to directly benefit from hardware acceleration without having to change application code.

http://www.mathematik.tu-dortmund.de/lsiii/static/showpdffile_TurekGoeddekeBeckerBuijssenWobker2008.pdf

67

PRACE Young Investigator Award 2009• The award was given for “High Scalability Multipole Method.

Solving Half Billion of Unknowns” by J. C. Mouriño, A. Gómez, J. M. Taboada, L. Landesa, J. M. Bértolo, F. Obelleiro and J. L. Rodríguez, Supercomputing Center of Galicia (CESGA), Universidad de Extremadura, Universidad de Vigo, Spain.

“Large electromagnetic simulations are of great interest for the design of complex industrial products that are more and more integrating electronic equipments. They are also very useful forunderstanding and reducing the impact of electromagnetic fields on human beings. The selected paper for the PRACE award presents an outstanding work on the scalability of the FMM-FFT method (fast multiple method - fast fourier transform) and its application for solving a challenging problem with 0.5 billion of unknowns and opens the road towards even larger simulations. It will make possible to run simulations with a much higher resolution than before making possible to improve the accuracy of the simulations and to address new computational challenges in the field of very high frequencies for example fornew car anti-collision systems”, Francois Robin, Award Committee and member and CEA Senior Scientist.

http://www.springerlink.com/content/y6tr4q34r2510328

68

PRACE Young Investigator Award 2010• The Award was given for “Massively Parallel Granular Flow

Simulations with Non-Spherical Particles” by Klaus Iglberger, M. Sc. and Prof. Ulrich Rüde, University of Erlangen-Nuremberg, Germany.

“The paper proposed by Iglberger and Rüde addresses successfully the issue of simulating flows of granular materials in a realistic way, that is with particles of different shapes moving in a complex environment. It introduces a new algorithm for thispurpose that shows an excellent scalability on a very high number of cores, making possible very large simulations that will be useful in important applications like the design of silos”, says François Robin (GENCI), member of the Award Committee.

“Among the very good papers submitted for the PRACE Award 2010, this paper was the best both addressing a complex physical problem and implementation of a highly scalable method for solving it”, Robin continues.

http://www.springerlink.com/content/yw54328t38r2123p

69

PRACE Young Investigator Award 2011• The Award was given for “Astrophysical Particle Simulations with

Large Custom GPU Clusters on Three Continents” by Rainer Spurzem, Chinese Academy of Sciences & University of Heidelberg, Peter Berczik, Chinese Academy of Sciences & University of Heidelberg, Tsuyoshi Hamada, Nagasaki University, Keigo Nitadori, RIKEN, Guillermo Marcus, University of Heidelberg, Andreas Kugel, University of Heidelberg, Reinhard Maenner, University of Heidelberg, Ingo Berentzen, University of Heidelberg, Jose Fiestas, University of Heidelberg, Robi Banerjee, University of Heidelberg, Ralf Klessen, University of Heidelberg

“This paper is an excellent example of what can be achieved through international and interdisciplinary collaboration to exploit new HPC technologies”, says Prof. Richard Kenway, PRACE Scientific Steering Committee Chair.

“Astrophysicists and computer scientists in Germany and China demonstrate nearly linear strong scaling on up to 170 GPUs at a third of peak performance for large-scale simulations of dense star clusters using machines in Europe, China and the USA. The work points theway to exploit exascale technologies for problems at the forefront of science”, Kenway continues. http://www.ari.uni-heidelberg.de/mitarbeiter/fiestas/iscpaper11.pdf

70

1st PRACE User Forum, April 13, 2011, Helsinki (Held during the PRACE/DEISA Symposium)

• Open to all scientific and industrial user communities• Main communication channel between

HPC users and PRACE AISBL• Interaction with members of the PRACE AISBL• Discussion and issuing recommendations to PRACE

AISBL• Promoting HPC usage• Fostering collaborations between

user communities

71

Education and Training Highlights Petascale training and education needs surveyed Spring 2008

First PRACE Summer School on Peta-scaling, KTH, Stockholm, August 2008. Platforms: IBM Blue Gene/P (FZJ) (65,536 cores) and Cray XT4 (CSC) (10,816 cores)

First PRACE Winter School on Scalable Programming Models and Paradigms, GRNET, Athens, February 2009. Platforms IBM Power 6 (3,328 cores) and IBM Cell (1,152 SPE cores)

Seven Code Porting Workshops in 2009

In total 270 participants in education and training events 2008/2009

72

PRACE Code Porting Workshops• GPU and Hybrid system programming using CUDA and CAPS-HMPP,

CEA, Paris, April 2009• Porting and optimization techniques for PRACE applications, CSC,

Helsinki, June, 2009 • Porting and optimization techniques for the CRAY XT5, CSCS, Manno,

Switzerland, July, 2009• Porting and optimization techniques for the Clearspeed/Petapath

architecture, NCF/SARA, Amsterdam, October, 2009• Porting and optimization techniques for the NEC/SX-9 (HLRS) and IBM

BG/P (FZJ), Cyfronet, Cracow, October, 2009 • Porting and optimization techniques for the IBM Cell (BSC) and GPGPU

systems, BSC, Barcelona, October, 2009• Stream Programming with OpenCL, KTH, Stockholm, December, 2009• …..

73

PRACE Technology Evaluation, Research and Development• PRACE activities include efforts to assess the

impact on application development and code porting and optimization of new novel architectures, such as stream computing units (GPUs) and digital signal processors (DSPs), interconnection technologies, and programming paradigms through prototyping

• PRACE also seeks to assess and stimulate the development of energy efficient hardware and software design through prototyping.

74

Challenges

75

The Connection Machine

7 years

Performance doubling period on average: No 1 – 13.64 months, No 500 – 12.90 months

76

Energy efficiency evolution

Source: Assessing in the Trends in the Electrical Efficiency of Computation over Time, J.G. Koomey, S. Berard, M. Sanchez, H. Wong, Intel, August 17, 2009, http://download.intel.com/pressroom/pdf/computertrendsrelease.pdf

Energy efficiency doubling every 18.84 months on average measured as computation/kWh

77

The Gap The energy efficiency improvement as determined by Koomey does not match the performance growth of HPC systems as measured by the Top500 list

The Gap indicates a growth rate in energy consumption for HPC systems of about 20%/yr.

EPA study projections: 14% - 17%/yrUptime Institute projections: 20%/yrPDC experience: 20%/yr

Report to Congress on Server and Data Center Energy Efficiency”, Public Law 109-431, U.S Environmental Protection Agency, Energy Star Program, August 2, 2007,http://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf

“Findings on Data Center Energy Consumption Growth May Already Exceed EPA’s Prediction Through 2010!”, K. G. Brill, The Uptime Institute, 2008, http://uptimeinstitute.org/content/view/155/147

78

Evolution of Data Center Energy Costs (US)

Source: Tahir Cader, Energy Efficiency in HPC – An Industry Perspective, High Speed Computing,April 27 – 30, 2009

79

Worldwide Server Installed Base, New Server Spending, and Power and Cooling Expense

Power per rack has increased from a few kW to 30+ kW

Estimate for 2008 purchase: 4 yr cooling cost ~1.5 times cluster cost

80

DOE E3 Report: Extrapolation of existing design trends to Exascale in 2016 Estimate: 130 MW

DARPA Study: More detailed assessment of component technologies

Estimate: 20 MW just for memory alone, 60 MW aggregate extrapolated from current design trends

The current approach is not sustainable!More holistic approach is needed!

Exa-scale Data Centre Challenges

80

Nuclear power plant: 1–1,5 GW

81

DARPA Exascale study• Last 30 years:

“Gigascale” computing first in a single vector processor“Terascale” computing first via several thousand microprocessors“Petascale” computing first via several hundred thousand cores

• Commercial technology: to dateAlways shrunk prior “XXX” scale to smaller form factorShrink, with speedup, enabled next “XXX” scale

• Space/Embedded computing has lagged far behindEnvironment forced implementation constraintsPower budget limited both clock rate & parallelism

• “Exascale” now on horizonBut beginning to suffer similar constraints as spaceAnd technologies to tackle exa challenges very relevant

Especially Energy/Powerhttp://www.ll.mit.edu/HPEC/agendas/proc09/Day1/S1_0955_Kogge_presentation.ppt

82

Power fundamentals – Exascale

• Modern processors being designed today (for 2010) dissipate about 200 pJ/op total.

This is ~200W/TF 2010

• In 2018 we might be able to drop this to 10 pJ/op

~ 10W/TF 2018

• This is then 16 MW for a sustained HPL Exaflops

• This does not include memory, interconnect, I/O, power delivery, cooling or anything else

• Cannot afford separate DRAM in an Exa-ops machine!

• Propose a MIP machine with Aggressive voltage scaling on 8nm

• Might get to 40 KW/PF –

60 MW for sustained Exa-ops

Source: William J Camp, Intel, http://www.lanl.gov/orgs/hpc/salishan/pdfs/Salishan%20slides/Camp2.pdf

Processor Memory

83

Power fundamentals - Exascale

• For short distances: still Cu• Off Board: Si photonics• Need ~ 0.1 B/Flop Interconnect• Assume (a miracle)

5 mW/Gbit/sec~ 50 MW for the interconnect!

• Optics is the only choice:• 10-20 PetaBytes/sec• ~ a few MW (a swag)

Interconnect I/O

Still 30% of the total power budget in 2018!Total power requirement in 2018: 120—200 MW!

Power and Cooling

Source: William J Camp, Intel, http://www.lanl.gov/orgs/hpc/salishan/pdfs/Salishan%20slides/Camp2.pdf

84

An inefficient truth - ICT impact on CO2 emissions*• It is estimated that the ICT industry alone produces

CO2 emissions that is equivalent to the carbon output of the entire aviation industry. Direct emissions of Internet and ICT amounts to 2-3% of world emissions and is expected to grow to 6+% within a decade

• ICT emissions growth fastest of any sector in society; expected to double every 4 to 6 years with current approaches

• One small computer server generates as much carbon dioxide as a SUV with a fuel efficiency of 15 miles per gallon

*An Inefficient Tuth: http://www.globalactionplan.org.uk/event_detail.aspx?eid=2696e0e0-28fe-4121-bd36-3670c02eda49

85

1.8

1.4

1.1

0.7

0.4

0

-0.4

-0.7

-1.1

Glo

bal T

empe

ratu

re C

hang

e (d

eg F

)

Year1000 1200 1400 1600 1800 2000

380

360

340

320

300

280

CO

2C

oncentration (ppm)

1000 Years of CO2 and Global Temperature Change

Temperature CO2

Source: Jennifer Allen, ACIA 2004Source: http://alaskaconservationsolutions.com/acs/images/stories/docs/AkCS_current.ppt

86

Arctic Summer Ice Melting Accelerating

Ice Volume km3

Ice area millions km2, September minimum

Source: www.copenhagendiagnosis.org

IPCC = Intergovernmental Panel on Climate Change

87Source: Iskhaq Iskandar, http://www.jsps.go.jp/j-sdialogue/2007c/data/52_dr_iskander_02.pdf

1/3rd due to melting glaciers

2/3rd due expansion from warming oceans

Source: Trenberth, NCAR 2005

88Source: http://alaskaconservationsolutions.com/acs/images/stories/docs/AkCS_current.ppt

Global Cataclysmic Concerns

89

Ocean AcidificationOcean AcidificationAnimals with calcium carbonate shells -- corals, sea urchins, snails, mussels, clams, certain plankton, and others -- have trouble building skeletons and shells can even begin to dissolve. “Within decades these shell-dissolving conditions are projected to be reached and to persistthroughout most of the year in the polar oceans.” (Monaco Declaration 2008)

Pteropods (an important food source for salmon, cod, herring, and pollock) likely not able to survive at CO2 levels predicted for 2100 (600ppm,pH 7.9) (Nature 9/05)

Coral reefs at serious risk; doubling CO2, stop growing and begin dissolving (GRL 2009)

Larger animals like squid may have trouble extracting oxygen

Food chain disruptions

Pteropod

SquidClam

All p

hoto

s th

is p

age

cour

tesy

of N

OAA

Glo

bal W

arm

ing:

The

Gre

ates

t Thr

eat ©

2006

Deb

orah

L. W

illia

ms

Source: http://alaskaconservationsolutions.com/acs/images/stories/docs/AkCS_current.ppt

Global Cataclysmic Concerns

90

Severe WeatherHurricanes:

For 1925 - 1995 the US cost was $5 billion/yr for a total of 244 landfalls. But, hurricane Andrew alone caused damage in excess of $27 billion.

The US loss of life has gone down to <20/yr typically. The Galveston “Great Hurricane” year 1900 caused over 6,000 deaths.

Since 1990 the number of landfalls each year is increasing.

Warnings and emergency response costs on average $800 million/yr. Satellites, forecasting efforts and research cost $200 –225 million/yr.

Katrina, August 28, 2005

Andrew: ~$27B (1992)

Charley: ~$ 7B (2004)

Hugo: ~$ 4B (1989) ($6.2B 2004 dollars)

Frances: ~$ 4B (2004)

Katrina: > $100B (2005)

91

Tornadoshttp://en.wikipedia.org/wiki/TornadoAnadarko, Oklahoma

http://en.wikipedia.org/wiki/File:Dimmit_Sequence.jpg

Courtesy Kelvin Droegemeier

www.drjudywood.com/.../spics/tornado-760291.jpg

http://g.imagehost.org/0819/tornado-damage.jpg

http://www.crh.noaa.gov/mkx/document/tor/images/tor060884/damage-1.jpg

http://www.miapearlman.com/images/tornado.jpg

92

Wildfires http://topnews.in/law/files/Russian-fires-control.jpg

http://msnbcmedia1.msn.com/j/MSNBC/Components/Photo/_new/100810-russianFire-vmed-218p.grid-6x2.jpg

Russia Wildfires 2010

http://www.tolerance.ca/image/photo_1281943312664-2-0_94181_G.jpgimg.ibtimes.com/www/data/images/full/2010/08/

http://legacy.signonsandiego.com/news/fires/weekoffire/images/mainimage4.jpg

In a single week, San Diego County wildfires killed 16 people, destroyed nearly 2,500 homes and burned nearly 400,000 acres. Oct 2003

Russia may have lost 15,000 lives already, and $15 billion, or 1% of GDP, according to Bloomberg.The smog in Moscow is a driving force behind the fires' deadly impact, with 7000 being killed already in the city. Aug 10, 2010

93

FloodsUK June – July 200713 deathsmore than 1 million affectedcost about £6 billion

April 30 – May 7, 2010 TN, KY, MS31 deaths. Nashville Mayor Karl Dean estimates the damage from weekend flooding could easily top $1 billion.

China,Bloomberg Aug 17,20101450 deaths through Aug 6Aug 7 1254 killed in mudslide with 490 missing

94

Houston, Ozone and Health

95

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

August 2011

Tem

pera

ture

F

Max

Min

Avg Max

Avg Min

Houston August 2011 Daily Temperatures

2011 Avg Max 102.03

Historic Avg Max 94.55

2011 Avg Min 78.65

Historic Avg Min 74.81

Avg Max +7.5F (4.2C)

Avg Min +3.8F (2.1C)

Warmest July – August on record in Texas

Warmest July – August on record of any US state

96

Energy efficiency of HPC systems

97

Power Consumption – Loadfor a typical server (2008/2009)

Luiz Andre Barroso, Urs Hoelzle, The Datacenter as a Computer: An Introduction to the Design ofWarehouse-Scale Machines http://www.morganclaypool.com/doi/pdf/10.2200/s00193ed1v01y200905cac006

CPU power consumption at low load about 40% of consumption at full load. Power consumption of all other system components independent of load, approximately.Result: Power consumption at low load about 65% of consumption at full load.

98

Google

Figure 1. Average CPU utilization of more than 5,000 servers during a six-month period. Servers are rarely completely idle and seldom operate near their maximum utilization, instead operating most of the time at between 10 and 50 percent of their maximum utilization levels.

“The Case for Energy-Proportional Computing”, Luiz André Barroso, Urs Hölzle, IEEE Computer, vol. 40 (2007).http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/33387.pdf

99

Internet vs HPC WorkloadsGoogle (Internet) KTH/PDC (HPC)

100

What type of Architecture?

Exascale Computing Technology Challenges, John ShalfNational Energy Research Supercomputing Center, Lawrence Berkeley National LaboratoryScicomP / SP-XXL 16, San Francisco, May 12, 2010

101

What type of Architecture?Instruction Set Architecture

Exascale Computing Technology Challenges, John Shalf, National Energy Research Supercomputing Center, Lawrence Berkeley National Laboratory ScicomP / SP-XXL 16, San Francisco, May 12, 2010

102

What type of Architecture?

Exascale Computing Technology Challenges, John ShalfNational Energy Research Supercomputing Center, Lawrence Berkeley National LaboratoryScicomP / SP-XXL 16, San Francisco, May 12, 2010

103

What kind of architecture (core)

http://www.csm.ornl.gov/workshops/SOS11/presentations/j_shalf.pdf

• Cubic power improvement with lower clock rate due to V2F

• Slower clock rates enable use of simpler cores

• Simpler cores use less area (lower leakage) and reduce cost

• Tailor design to application to reduce waste� �

1042345.5

BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire InfinibandDOE/NNSA/LANL444.2520

276BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz /

Opteron DC 1.8 GHz, InfinibandDOE/NNSA/LANL458.3319

154BladeCenter PS702 Express, Power7 3.3GHz, InfinibandCeSViMa - Centro de SupercomputaciÃ³n y VisualizaciÃ³n

de Madrid467.7318

120.56Power 750, Power7 3.86 GHz, 10GigEIBM Thomas J. Watson Research Center483.6617

2580Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050

GPUNational Supercomputing Centre in Shenzhen (NSCS)492.6416

94.6Supermicro Xeon Cluster, E5462 2.8 Ghz, Nvidia Tesla

s2050 GPU, InfinibandCSIRO555.515

129.6Hitachi SR16000 Model XM1/108, Power7 3.3Ghz,

InfinibandYukawa Institute for Theoretical Physics (YITP)565.9714

4040NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA GPU, FT-1000

8CNational Supercomputing Center in Tianjin635.1513

115.87Asterism ID318, Intel Xeon E5530, NVIDIA C2050,

InfinibandNational Institute for Environmental Studies650.312

94.4HP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia

Fermi, Infiniband QDRGeorgia Institute of Technology677.1211

416.78Supermicro Cluster, QC Opteron 2.1 GHz, ATI Radeon

GPU, InfinibandUniversitaet Frankfurt718.1310

57.54QPACE SFB TR Cluster, PowerXCell 8i, 3.2 GHz, 3D-TorusUniversitaet Wuppertal773.389

57.54QPACE SFB TR Cluster, PowerXCell 8i, 3.2 GHz, 3D-TorusUniversitaet Regensburg773.388

57.54QPACE SFB TR Cluster, PowerXCell 8i, 3.2 GHz, 3D-TorusForschungszentrum Juelich (FZJ)773.387

9898.6K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnectRIKEN Advanced Institute for Computational Science

(AICS)824.566

160iDataPlex DX360M3, Xeon 2.4, nVidia GPU, InfinibandCINECA / SCS - SuperComputing Solution891.885

1243.8HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU,

Linux/WindowsGSIC Center, Tokyo Institute of Technology958.354

34.24DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband

QDRNagasaki University1375.93

38.8NNSA/SC Blue Gene/Q Prototype 1IBM Thomas J. Watson Research Center1684.22

40.95NNSA/SC Blue Gene/Q Prototype 2IBM Thomas J. Watson Research Center2097.21

Total Power (kW)Computer*Site*MFLOPS/W

Green500 Rank June 2011

105

Power Usage Effectiveness (PUE)

Slide courtesy Michael K Patterson, Intel, 2nd European Workshop on HPC Centre Infrastructure, Dourdan, France, 2010-10-06--08

106

Power Utilization Efficiency

• Traditionally about 3• State-of-the art today 1.05 – 1.2!• How?

“Free” CoolingJudicious attention to airflowImproved efficiency in power

distribution and supply within the data center

Increased data center temperature

109

Google ….

• EUS1 Energy consumption for type 1 unit substations feeding the cooling plant, lighting, and some network equipment

• EUS2 Energy consumption for type 2 unit substations feeding servers, network, storage, and CRACs

• ETX Medium and high voltage transformer losses

• EHV High voltage cable losses • ELV Low voltage cable losses • ECRAC CRAC energy consumption • EUPS Energy loss at UPSes which feed

servers, network, and storage equipment • ENet1 Network room energy fed from type

1 unit substitution

Google, Dalles, OR

Patent filed Dec 30, 2003Q1 2011Quarterly energy-weighted average PUE: 1.13TTM energy-weighted avg. PUE: 1.16Individual facility minimum quarterly PUE: 1.09,

Data Center E Individual facility minimum TTM PUE*: 1.11,

Data Center J Individual facility maximum quarterly PUE: 1.22,

Data Center C Individual facility maximum TTM PUE*: 1.21,

Data Center C

* Only facilities with at least twelve months of operation are eligible for Individual Facility Trailing Twelve Month (TTM) PUE reporting

http://www.google.com/corporate/datacenter/efficiency-measurements.html

EUS1+EUS2+ETX+EHV

EUS2+ENet1-ECRAC–EUPS-ELVPUE =

110

Google and Clean Energy• The Hamina data center in Finland

(previously the Summa paper mill)Cooling water from Gulf of Finland

(no chillers)Four new wind turbines built

• Belgian data center designed without chillers. If the air at the Saint-Ghislain, Belgium, data center gets too hot, Google shifts the data center's compute loads to other facilities.

Telia-Sonera cable

Summa paper mill

Gulf of Finland

Saint-Ghislain

111

Facebook –Prineville Data Center

Facebook’s Prineville, OR, 147,000-square-footcustom data center, with an estimated to cost $188.2 million was brought into operation the summer of 2011. The site was chosen because of it's very dry and relatively cool climate. For 60 – 70% of the time cooling will be achieved by using cold air from outside. Excess heat from servers will be used to warm office space in the facility.

PUE 1.07 – 1.08

112

Clean Energy - Facebook• The 120 MW Lulea Data Center will

consist of three server buildings with an area of 28 000 m2 (300 000 ft2) The first building is to be operational within a year and the entire facility is scheduled for completion by 2014

• The Lulea river has an installed hydroelectric capacity of 4.4 GW and produces on average 13.8 TWh/yr

Read more: http://www.dailymail.co.uk/sciencetech/article/2054168/Facebook-unveils-massive-data-center-Lulea-Sweden.html#ixzz1diMHlYIL

−2.5

(27.5

−13

(9)

−7(19

)

1(34

)

5(41

)

10(50

)

11(52

)

8(46

)

2(36

)

−4(25

)

−11

(12)

−16

(3)

−16

(3)

Average low °C (°F)

5.0(41.

0)

−5(23

)

−1(30

)

6(43

)

12(54

)

17(63

)

19(66

)

17(63

)

10(50

)

3(37

)

−2(28

)

−8(18

)

−8(18

)

Average high °C (°F)

Year

Dec

Nov

Oct

Sep

Aug

JulJun

May

Apr

Mar

Feb

Jan

Month

Climate data for Luleå, Sweden

2011-10-28

113

Joint Nordic Proof of Concept Research Computing Center (Denmark, Norway, Sweden)

• Located at the Thor DataCenter, Reykjavik

• Iceland Electric Energy 70% Hydro, 30% GeoCarbon Free, Sustainable

• Free Cooling – PUE in the 1.1 – 1.2 range; 1.07 for containerized equip.All time high temperature in Reykjavik: 24.8 C, Annual average ~5 C.

114

Energy Efficient Computing Projects at PDC

115

SNIC/KTH PRACE Prototype

1620W PSU +cooling fans

40Gb InfiniBand switch18 external ports1/10Gb Ethernet switch

1Gb Ethernet switch

PSU Dummywhen PSU not used


CMM (Chassis Management Module)


40Gb InfiniBand switch18 external ports1/10Gb Ethernet switch

1Gb Ethernet switch

PSU Dummywhen PSU not used



• New 4-socket blade with 4 DIMMs per socket supporting PCI-Express Gen 2 x16

• Four 6-core 2.1 GHz 55W ADP AMD Istanbul CPUs, 32GB/node

• 10-blade in a 7U chassis with 36-port QDR IB switch, new efficient power supplies.

• 2TF/chassis, 12 TF/rack, 30 kW (6 x 4.8)• 180 nodes, 4320 cores, full bisection

QDR IB interconnect

Network: • QDR Infiniband• 2-level Fat-Tree• Leaf level 36-port switches

built into chassis• Five external 36-port switches

116

The SNIC/KTH/PRACE Prototype I 2009 (Povel)

100.05,065Total

0.420CMM

0.840GigE Switch

2.0100IB Switch

2.0100IB HCAs

2.4120HT3 Links

5.9300Motherboards

6.9350Fans

7.0355PS

15.8800Memory 1.3 GB/core

56.82,880CPUs

Percent (%)

Power (W)

Component

Not in prototype nodes


36-ports

117

SNIC/KTH/PRACE Prototype I

118

Nominal Energy Efficiency of Mobile CPUs, x86 CPUs and GPUs

~2.32251600~0.61306~0.911512~0.52+2~0.5~24

GF/WWCoresGF/WWCoresGF/WWCoresGF/WWCoresGF/WWCores

ATI 9370Intel 6-coreAMD 12-coreATOMARM Cortex-9

~10101923.75516~ 1548~2.2225512

GF/WWCoresGF/WWCoresGF/WWCoresGF/WWCores

ClearSpeedCX700IBM BQCTMS320C6678nVidia

Fermi

Very approximate estimates!!

KTH/SNIC/PRACE Prototype II

119

KTH/SNIC/PRACE DSP HPC node

Target:15 – 20W32 GB2.5 GF/W Linpack

4-core

50 Gbps

120

KTH/SNIC/PRACE DSP HPC nodeDSP Integration Model 1: Accelerator

121


• MPI Ranks only on host CPU• DSP executes computational kernels• Data passes between hosts and from host

to DSP (the transfer can be optimized)• Simple model, ARM in control• Can gradually port application• Limited gains due to synchronization and

data staging issues

DSP Integration Model 1: Accelerator

122

KTH/SNIC/PRACE DSP HPC nodeDSP Integration Model 2: “Pure DSP”

123


• MPI Processes run only on DSPs• System calls are forwarded to the Host• Host handles control and system call requests• DSPs communicate directly with each other• Application sees homogenous machine• Need to provide many system services• Alt. DSP runs also some OS• Need to port lots of software, libraries etc.

DSP Integration Model 2: “Pure DSP”

124

KTH/SNIC/PRACE DSP HPC nodeDSP Integration Model 3: Hybrid

125


• MPI processes on both Host and DSPs• Host processes communicate to OS and libs.• DSP processes do the number crunching• Some system calls (e.g. printf) are forwarded

(convenience)• Communication can bypass Host• Can use OS and libs already ported to ARM• Programmer sees two different codes

DSP Integration Model 3: Hybrid

126


• Hybrid approach our choice• Development can start on “normal” cluster• Most applications are already somewhat

separated into I/O and computational parts• IBM BlueGene and Cray are similar

DSP Integration Model: Summary

127


• Control the DSPs (reset, start, download code)• Manage interconnect (routing tables, link

failures, topology discovery)• Provide simple SC forwarding for DSPs• Allow debug access to DSPs• Execute legacy/OS code (shell, ssh, grep etc.)• Provide I/O connection (TCP/IP networking,

Lustre ...)

Host role in Hybrid approach

128

KTH/SNIC/PRACE DSP prototype

• Attached via HyperLink• Looks like a memory mapped I/O device• Can do DMA into/out of DSP memory• Can generate DSP interrupts• Should be simple• Allow many outstanding transfers• Produce little jitter on the DSP side

DSP to Interconnect Interface

129


• Each top level block in each own “page”• One global register space for NIC wide

operation• Several blocks (e.g. 64) for concurrent

communication threads.• Each core can independently use some

threads

Register Interface

130

KTH/SNIC/PRACE DSP prototypeTx/Rx Registers

131


• Each thread can transmit a single packet at a time

• Can do gather DMA from DSP address space

• Short messages can be directly written to the Tx descriptor registers (to save DMA latency)

• Each message may get split into fragments if too long

Tx Operations

132


• Can match on source/tag• One shot or ring buffer operation• Status can be polled• Interrupt on packet completion• Puts fragments into correct position in

the recieve buffer• Track status for each fragment (i.e. CRC

error)• Maybe support scatter DMA

Rx Operations

133

Energy Measurement Overview

E = ∫Pdt

DC/DC

Power Readout Current measurement(to be implemented)

DC/DC


DC/DC


DSP

DSP

DDR

AUX

Power measurement

tsb

tsb

tse

tsets tetime

PUART

Pavg = E/(tse – tsb)

EBM = E/(ts – te)

Pavg

E

Sampling freq.: ?? HzAccuracy: ???%

Sampling freq.: ?? HzAccuracy: ???%

ts, te

134

Early Benchmark Results

Benchmark Performance Energy

STREAM – L1 125.18 GB/s 122 pJ/Byte

STREAM – L2 47.6 GB/s 319 pJ/Byte

STREAM – DDR3 8.9 GB/s 2173 pJ/Byte

FFT 585-696 MFLOP/s 283-333 MFLOP/J

DGEMM 585 MFLOP/s 311 MFLOP/J

Theoretical Peaks:L1 Bandwidth: 128 GB/sL2 Bandwidth: 2*64 GB/sDDR Bandwidth: 10.6 GB/s (DDR1333, 64-bit)FFT: 48 GFLOP/s (4 add, 2 mul per cycle, double precision)DGEMM: 32 GFLOP/s (2 add, 2 mul per cycle)SGEMM: 128 GB/s (TI implementation 72 GFLOP/s, 56%)

135

STREAM 6678 Bandwidth test 8 coresGB/s

Data set size in Bytes

L1: 128 GB/s

125 GB/s

L2: 2x64 GB/s48 GB/s

DDR 1333 MHz: 10.664 GB/s

8.9 GB/s

STREAM

L1: 98 % of peak

L2: 75 % of peak

DDR: 83 % of peak

(Better than TI’s results!!Telecon comment)

136

STREAM 6678 Bandwidth test 8 cores

W/GB/s

Data set size in Bytes

Energy measured for entire EVM with on-board emulator

137

FFT – TI result comp.

Platform Effective Time to complete 1024 point complex to complex FFT (single precision), μs

Power (Watts)

Energy per FFT (μJ)

DSP: TI C6678 @ 1.2 GHz

0.85 10 8.5

DSP C6678 @ 1GHz

317.47 16.78 5327

138

An aside on PDC Green Data Center Projects

139

Heat Reuse Project• Background: today around 800 kW used at PDC• Project started 2009 to re-use this energy• Goals:

-Save cooling water for PDC-Save heating costs for KTH-Save the environment

• Use district cooling pipes for heating when no cooling is required

• No heat pumps • Starting with Cray• First phase of Cray will heat the KTH Chemistry

building

140

UNDER FLOOR TEMPERATURENormal: 15-16°C | 59-60°FMax: 17°C | 62°F

2800 mm | 110.2"

1200 mm | 47.2" . TEMPERATURENormal: 35-43°C| 95-109°FMax: 52°C | 126°F

300 mm | 11.8"

HEAT RECOVERY COIL WATER INLET 18°C | 64°FWATER OUTLET 28°C | 84°F

EXHAUST 22°C | 72°F

CABINETSIDE

CABINETFRONT

CABINETFRONT

CABINETFRONT

CABINETFRONT

CABINETFRONT

PDC Energy Recovery Project

EXISTING CRAC

141

PDC Energy Recovery Project

142

Emersive Cooling

http://www.grcooling.com

PDC is evaluating this technology

143

What’s Next?

144

Plans• Hardware

• Design and build FPGA switch for interface to TI Hyperlink (50 Gbps) and ARM/Calxeda(10 GigE, XAUI, …)• TI now has signed

contract for FPGA – Hyperlink IP

• ………….• Assess Advantech

4x6678 PCIe card• Assess TI telecom

ARM+Shannoncard (not generally available)

• Assess Calxeda• Q2 2012

145

Movidius Myriad 65nm Media Processor

Source: David Moloney, http://www.hotchips.org/hc23

180 MHz

Next Generation 28 nm: Estimate 250 – 350 GF/W!

146

Objective• Evaluate the SoC for HPC Applications

Exceptional nominal energy efficiency at the SoC level (350 GF/W, single-precision, incl. memory)

What energy efficiencies can be achieved at application level?

What amount of memory can be stacked/in package at what performance level at what cost?

What SoC enhancements are desirable and feasible for HPC at what cost?

How best to integrate communication at chip and board level?

Explore the toolchain and software ecosystem

• Influence future upcoming products

146

147

Movidius 10 PFLOPS Strawman

32mm

25mm

$200/board615 DP GFLOPS @ 2.8W(8* 4 * 16 * 800MHz)8*128MB DDR3 @ 1.2 GHz76.8GB/s Mem BW (8*9.6 )

compute card Node card9840 TFLOPS16x compute cards45W

Cabinet10 petaFLOPS1024 Nodes46kW40 sq ft

10 PFLOPS in a single BlueGene/L Cabinet

148

10 PFLOPS Comparison

Name CPU GHz

Flops/Clock/Core

Pk Core GFLOPS

Cores/Socket

Watts/Sckt

Sub-domains/Sckt

MBytes/Socket

Mem BW GB/s

Pk Bytes/FLOP

Netwk BW (GB/s)

# M Sockets

Tot. Power MW

Tot. Cost $M

Tot. petaFLOPS $/socket $/GFLOP

AMD Opteron 2.8 2 5.6 2 95 22.4 112 6.4 0.57 0.57 0.89 179 1799.6 9.97 2022 180.54IBM BG/P PPC440 0.7 4 2.8 2 15 11.2 56 5.5 0.98 0.98 1.78 27 2600.6 9.97 1461 260.89Tensilica Custom 0.65 4 2.6 32 22 172.8 864 51.2 0.62 0.62 0.12 2.5 75 9.98 625 7.51Movidius Fragrak 0.8 6 4.8 128 2.8 204.8 1024 76.8 0.13 0.4 0.016 0.0455 3.25 9.98 200 0.33

http://www.hpcuserforum.com/presentations/Germany/EnergyandComputing_Stgt.pdfhttp://www.lbl.gov/cs/html/greenflash.htmlhttp://www.tensilica.com/uploads/pdf/ieee_computer_nov09.pdf

http://en.wikipedia.org/wiki/FLOPS

149

The Prototype• Streaming Hybrid Architecture Vector Engine (SHAVE)

Developed for mobile gaming and video applicationsIs a processor architecture and development environmentContains elements of RISC, DSP, VLIW & GPU architecture

• Myriad SoC Platform (65 nm) 8 cores on a chip, 17 GFLOP/s @ 0.35W -> 50 GLOP/WSContains a SPARC (LEON) control coreStacked together with a low-power DRAM die

• Fragrak Platform (28 nm, internal testing Q1 2013)Possibility to add double precision floating point support16 cores, 250 GFLOP/WS DP -> 1e18 FLOP/s in 4 MW

149

150

Fragrak 28nm Platform

Main Bus

64

450GFLOPS/W(IEEE 754 SP)

Stacked 256/512MBSDRAM die

DDR3 LP

L2 512kB

MEBI

NAL

SEBISDIOx2

SPIx3

LCD x2

MIPIDSI 2x

LCD x2

MIPI CSI 2x

USB2OTG

SDIOx3

SPIx3

SPIx3

SDIOx3

SW Controlled I/O Multiplexing

SPIx3

I2Cx2

SPIx3

I2Sx2

RISC

UARTx2

JTAG

TIMGPS

TSFLSH

Bridge

12864

Movidius IP

UARTx2

150

ICB

CMX128kB

SHAVE 0

CMX128kB

SHAVE 1

CMX128kB

SHAVE 0

CMX256kB

SHAVE 04

ICB

CMX128kB

SHAVE 0

CMX128kB

SHAVE 1

CMX128kB

SHAVE 0

CMX256kB

SHAVE 08

ICB

CMX128kB

SHAVE 0

CMX128kB

SHAVE 1

CMX128kB

SHAVE 0

CMX256kB

SHAVE 12

ICB

CMX128kB

SHAVE 0

CMX128kB

SHAVE 1

CMX128kB

SHAVE 0

CMX256kB

SHAVE 16

XCB

151

16 SHAVEs1 LEON150 GFLOPS350 mW3Q 2012

Option:DP FPU

152

Movidius 28nm 64-bit FLOPS

SAUVAU64‐bit FPAdd

64-bit FP mult

64‐bit FPAdd

32x128 VRF

64-bit FP mult

32x32-bit(16x64-bit)

SRF

64‐bit FPAdd

64‐bit FPAdd

CMU

4 FLOPS/cycle2 FLOPS/cycle

Total 6x 64‐bit FLOPS/cycle

153

TI DSPEVM

FPGAEVM

FPGAEVM

ARMEVM

PowerInstr

PRACE 1IP DSP prototype

TI DSPEVM

154

16x2x800MHz = 25.6GB/Sec

SHAVE Variable-Length Instruction

VRF32x128

SRF32x32

IRF32x32

VAUSAUIAU

LSU0

LSU1

IDC

CMU

128-bit AXI BusXtra-Cluster Bus (XCB)

L2Cache512kB2-way

Fragrak

LPDDR3Cont.

TMU

1kBcache

SHAVEProcessor

BRU DCUPEU

Decodedinstructions

256 -512MB SDRAM

Die

SHAVE 28nm Processor (Fragrak)

16x12x800MHz 76.8GB/Sec




800MHzIntra-Cluster Bus

(ICB)

512kB

256 -512MB SDRAM


16k L1

16x800MHz 12.8GB/Sec

256kBSRAM CMX

256kBPer

SHAVE

8x2x800MHz = 12.8GB/Sec


155

Movidius 28nm BW Hierarchy

SDRAM 6.4GB/Sec(2 DDR3 1600)

115GB/Sec

4864GB/Sec

Bottom‐Line ‐ Very High Sustainable Performance

L1Cache

RegistersV/S/IRF

CMXSRAM ICB

L2CacheXCB

42:1

18:1

156

Movidius 28nm BW Hierarchy (Detail)

VRF SRF IRF LSU IDC L1 ICB L2 XCB SDRAMClk 800 800 800 800 800 800 800 800 800 800Bytes 16 4 4 8 16 8 16 16 16 4Ports 12 12 17 2 1 1 2 1 8 2BW 153.6 38.4 54.4 12.8 12.8 6.4 25.6 12.8 102.4 6.4#SHAVES 16 16 16 16 16 16 16Total BW 2457.6 614.4 870.4 204.8 204.8 102.4 409.6

157

PRACE Precompetitive Procurement (PCP)

• Objective: Increase European technology industry engagement in HPC

• Scope: Energy Efficient HPC systems• Funding: 5 plus 5+ M€• Timeframe: Procurement late

2012/early 2013

158

Thank You!

Date post:	14-Feb-2018
Category:	Documents
Upload:	hoangminh
View:	216 times
Download:	2 times

Energy Efficient Computing - Department of Computer …johnsson/Talks/ENEA_S.pdf · Energy...

Documents