+ All Categories
Home > Documents > Future farm technologies & architectures John Baines 1.

Future farm technologies & architectures John Baines 1.

Date post: 03-Jan-2016
Category:
Upload: mary-catherine-little
View: 217 times
Download: 1 times
Share this document with a friend
55
Future farm technologies & architectures John Baines 1
Transcript
Page 1: Future farm technologies & architectures John Baines 1.

1

Future farm technologies & architectures

John Baines

Page 2: Future farm technologies & architectures John Baines 1.

2

Introduction• What will the HLT farm look like in 2020? • When & how do we narrow the options?

– Choice affects software design as well as farm infrastructure

• How do we evaluate cost/benefits? • When & how do we make the final choices for farm purchases?• How do we design software now to ensure we can fully exploit the

capability of future farm hardware?• What do we need in the way of demonstrators for specific

techologies?• We can’t evaluate all options – what should be the priorities?

Page 3: Future farm technologies & architectures John Baines 1.

3

Timescales: Framework, Steering & New Technologies

2014

Q3 Q4

LS 1

Design & Prototype

Implement core functionality

Extend to full functionality

Commissioning Run

Evaluate Implement Infrastructure

Exploit New. Tech. in Algorithms

Speed up code, thread-safety, investigate possibilities for internal parallelisation

Implement Algorithms in new framework.

HLT software Commissioning

Complete

Final Software Complete

Framework & Algos.

Fix PC architecture

FrameworkCore Functionality

Complete Incl. HLT components& new tech. support

Design of Framework

& HLT Components

Complete

Narrow h/w choices e.g. Use or not GPU

Run 3

Full menu complete

Simple menu

Framework Requirements

CaptureComplete

Framework

New Tech.

Algs & Menus

Draft Version for discussion

Prototype with 1 or 2 chains

Page 4: Future farm technologies & architectures John Baines 1.

Technologies• CPU: increased core counts –

– currently 12 core (24 thread) e.g. Xeon E5 2600 v2 series ~0.5 TFLOPS– 18 core (36 thread) coming soon (Xeon E5 2600 v3 series)– Possible trend to many cores with lower memory => cannot continue to run

one job per core

4

• GPU: Much bigger core count:e.g. Nvidia K40: 15 SMX, 2880 cores 12 GB memory. 4.3(1.4) TFLOPS SP(DP)

• Coprocessor: e.g. Intel Xeon Phi up to 61 cores, 244 threads 1.2 TFLOPS

Page 5: Future farm technologies & architectures John Baines 1.

5

GPU:Towards a cost benefit analysis

Will need to Assess:• Effort needed to port code to GPU maintain it (bug fixing, new hardware…) and to

Support MC simulation on GRID• Speed-up for individual components & full chain• What can be outsourced to GPU and what done on CPU• Integration with Athena (APE)• Balance of CPU cores to GPU i.e. sharing of GPU resource between several jobs• Farm integration issues: packaging, power consumption….• Financial cost: hardware, installation, commissioning, maintenance…

As an exercise, see what we can learn from studies to-date i.e. cost-benefit if we were to purchase today.

Page 6: Future farm technologies & architectures John Baines 1.

6

DemonstratorsDemonstrators:• ID (RAL, Edinburgh, Oxford):

– Complete L2 ID chain ported to CUDA for NVIDIA GPU– ID datapreparation (byestream conversion, clustering, space-point formation)

ported additionally to openCL• Muon (CERN, Rome)

– Muon calorimeter-isolation implemented in CUDA• Jet (Lisbon)

– Just starting

See: Twiki: TriggerSoftwareUpgrade

Porting L2 ID tracking to CUDA ~2 years @ 0.5 FTE => 1 staff year (for very experienced expert!)

Effort needed to port code?

Page 7: Future farm technologies & architectures John Baines 1.

7

GPUSTime (ms) Tau RoI 0.6x0.6

tt events 2x1034

C++ on 2.4 GHz

CPU

CUDA on Tesla

C2050

SpeedupCPU/GPU

Data. Prep. 27 3 9

Seeding 8.3 1.6 5

Seed ext. 156 7.8 20

Triplet merging

7.4 3.4 2

Clone removal 70 6.2 11

CPU GPU xfer n/a 0.1 n/a

Total 268 22 12

Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov)

Data Prep.

L2 Tracking

X2.4

X5

Max. speed-up: x26Overall speed-up t(GPU/t(CPU): 12

Page 8: Future farm technologies & architectures John Baines 1.

Sharing of GPU resource

Blue: Tracking running on CPURed: Most Tracking steps on GPU,

final ambiguity solving on CPU

X2.4

• With balanced load on CPU/GPU, several CPU cores can share a GPUe.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU

Tesla C2050

Page 9: Future farm technologies & architectures John Baines 1.

9

Packaging

1U: 2xE5-2600 or E5-2600v23xGPU

2U: 2xE5-2600 or E5-2600v24xGPU

Examples:

Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF=> 12 CPU cores/GPU

Total for 2027 with 4 K20 GPU: ~20k CHF=> 6 CPU cores/GPU

CPU: Intel E5-2697v2 CPU • 12 cores, ~0.5 TFLOPS, ~2.3kCHF

GPU: Nvidia K20 GPU • 2496 cores, 13 SMX, 192 cores per SMX • 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF

Page 10: Future farm technologies & architectures John Baines 1.

10

Power & CoolingSDX racks:• Max. Power: 12kW• Usable space: 47 U• Current power ~300W per motherboard => max. 40 motherboards per rack.• Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW

Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power)

Illustrative farm configuration:50 racks Total

FarmNodes

CPU Cores(max threads)

GPU(SMX)

Required throughput per Node(per CPU core)

40 nodes per rack~300W/node

2,000 4,000 48,000 (96,000)

0 50 Hz(2.1 Hz)

10 nodes per rack4 GPU per node~1200W/node

500 1,000 12000 (24000)

2,000(26,000)

200 Hz(8.3 Hz)

16 nodes per rack2 GPU per node~750W/node

800 1,600 19,200 (38,400)

1,600(20,800)

125 Hz(5.2 Hz)

x4

X2.5

Page 11: Future farm technologies & architectures John Baines 1.

11

Summary

• Current limiting factor is cooling: 12kW/rack=> Adding GPU means removing CPU• Factor 2.5-4 less CPU requires corresponding increase in CPU throughput • Financial Cost per motherboard (2U box with 8 CPU versus 2CPU + 4 GPU) : CPU+GPU

Factor ~2 more expensive• => win with CPU+GPU solution when throughput per CPU increased by more than factor

8 => 90% work (by CPU time) transferred to GPU

Page 12: Future farm technologies & architectures John Baines 1.

12

Discussion• Benefits:

– If we can manage to port the bulk of time-consuming code to GPU, the benefit is potentially much better scaling with p.u. i.e. • No combinatorical code left on CPU => execution times will scale slowly with p.u.• Code on GPU is parallel and will scale slowly with p.u.

• Costs:– Significant effort needed to port code– Need to support different GPU generations with rolling replacements– Potential divergence from offline – Need to support CPU version of code for simulation– Possibly more expensive than CPU-only farm.

Þ CPU+GPU solution attractive IF CPU-based farm cannot provide enough processing power.

Þ However, currently looks like CPU-only farm is the least code solution

Þ Discuss!

Page 13: Future farm technologies & architectures John Baines 1.

13

CPU

• Coming: Xeon E5-2699 V3 18 cores and 36 thread 3,960 EUR $5,392

e.g.

Page 14: Future farm technologies & architectures John Baines 1.

14

GPU

US $ K40 K20X K20 M2090 C2050

1 4435 3200 2695 1825 1100

2 8870 9600 8085 5475 2200

4 17740 12800 10780 7300 4400

Page 15: Future farm technologies & architectures John Baines 1.

15

Increase in Throughput per CPU when GPU added

Speed-upt(CPU)/t(GPU)

CPU code serial: waits

for GPU completion

Fraction defined in terms of execution time on CPU

If CPU count reduced by factor 4, need factor 4 increase in throughput to break even; i.e. 75% of work moved to GPU

Page 16: Future farm technologies & architectures John Baines 1.

16

Page 17: Future farm technologies & architectures John Baines 1.

17

Speed-up factors

HLT:60% tracking20% Calo10% Muon10% other

Page 18: Future farm technologies & architectures John Baines 1.

18

• Cost of GPU-Cost of CPU• Cost of effort for online version• Cost for simulation

Page 19: Future farm technologies & architectures John Baines 1.

19

CPU#1 12 CPU cores; 12/24 cpu threads

GPU#1: 15 SMX; 2880 cores

GPU#2 15 SMX; 2880 cores

120ms240ms360ms

10ms240ms250ms 69%

CPU: x0.69 Throughput 1.44

Page 20: Future farm technologies & architectures John Baines 1.

20

6 jobs per GPU

Page 21: Future farm technologies & architectures John Baines 1.

21

Page 22: Future farm technologies & architectures John Baines 1.

22

Data Preparation Code

Page 23: Future farm technologies & architectures John Baines 1.

23

Page 24: Future farm technologies & architectures John Baines 1.

24

Page 25: Future farm technologies & architectures John Baines 1.

25

Page 26: Future farm technologies & architectures John Baines 1.

26

Page 27: Future farm technologies & architectures John Baines 1.

27

Page 28: Future farm technologies & architectures John Baines 1.

28

Page 29: Future farm technologies & architectures John Baines 1.

29

Page 30: Future farm technologies & architectures John Baines 1.

30

Page 31: Future farm technologies & architectures John Baines 1.

31

Page 32: Future farm technologies & architectures John Baines 1.

32

Page 33: Future farm technologies & architectures John Baines 1.

33

Data Preparation Code

Page 34: Future farm technologies & architectures John Baines 1.

34

Page 35: Future farm technologies & architectures John Baines 1.

35

Page 36: Future farm technologies & architectures John Baines 1.

36

Page 37: Future farm technologies & architectures John Baines 1.

37

Page 38: Future farm technologies & architectures John Baines 1.

38

Page 39: Future farm technologies & architectures John Baines 1.

39

Page 40: Future farm technologies & architectures John Baines 1.

40

Page 41: Future farm technologies & architectures John Baines 1.

41

Page 42: Future farm technologies & architectures John Baines 1.

42

Page 43: Future farm technologies & architectures John Baines 1.

43

Page 44: Future farm technologies & architectures John Baines 1.

44

Page 45: Future farm technologies & architectures John Baines 1.

45

Page 46: Future farm technologies & architectures John Baines 1.

46

Page 47: Future farm technologies & architectures John Baines 1.

47

Page 48: Future farm technologies & architectures John Baines 1.

48

Page 49: Future farm technologies & architectures John Baines 1.

49

Page 50: Future farm technologies & architectures John Baines 1.

50

Page 51: Future farm technologies & architectures John Baines 1.

51

Page 52: Future farm technologies & architectures John Baines 1.

52

Page 53: Future farm technologies & architectures John Baines 1.

53

Page 54: Future farm technologies & architectures John Baines 1.

54

Power & Cooling• SDX racks:• Upper level: 27? XPU racks

– each 47U usable; 9.5kW – 1U 31,32,40 per rack (=>max 300W per 1U) – Current power consumption 6-9kW per rack

• Lower level: partially equipped with 10 racks (+6 preseries racks) – each 47U (could be 52U with additional reinforcing); 15 kW– 2U 4-blade servers 1100W, 8 or 10 per rack (9-11kW)

GPU: C2050: <238W; K20:<225W; K40:<235W c.f. CPU: 130W (12 core, 2.7GHz) => GPU 80% higher max. power consumption than CPU.Þ Adding 1 GPU ~doubles power consumption of node50 racks Nodes

(motherboard)

CPU Cores(max threads)

GPU(SMX)

Throughput per Node(per CPU core)

40 nodes per rack 2000 4000 48000 (96000)

0 50Hz(2.08 Hz)

10 nodes per rack4 GPU per node~1200W/node

500 1000 12000 (24000)

2000(30000)

200Hz(8.33 Hz)

15 nodes per rack2 GPU per node~800W/node

750 1500 18000 (36000)

1500(22500)

133Hz(5.55Hz)

Page 55: Future farm technologies & architectures John Baines 1.

55

Packaging

1U: 2xE5-2600 or E5-2600v23xGPU

2U: 2xE5-2600 or E5-2600v24xGPU

e.g. 2x12=24 CPU cores 3 GPU Þ 8 CPU cores/GPU

GPU: e.g. K40: 2880 cores, 15 SMX, 192 cores per SMX 4.3 (1.4) TFOPS for SP(DP): $4400

e.g. 2x12=24 CPU cores 4 GPU => 6 CPU cores/GPU

+ GPU 3X $4435


Recommended