+ All Categories
Home > Documents > Www.montblanc-project.eu This project has received funding from the European Union's Seventh...

Www.montblanc-project.eu This project has received funding from the European Union's Seventh...

Date post: 19-Jan-2016
Category:
Upload: karen-bruce
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
24
This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement n.610402 http://www.montblanc-project.eu May 7, 2015 CSW & BR Oslo Understanding and Addressing the Resiliency Issues for Future Exascale Computing with the Mont- Blanc Prototype Ferad Zyulkyarov Barcelona Supercomputing Center May 7, 2015 1
Transcript
Page 1: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement n.610402

http://www.montblanc-project.eu

May 7, 2015CSW & BR Oslo

Understanding and Addressing the Resiliency Issues for Future Exascale

Computing with the Mont-Blanc Prototype

Ferad Zyulkyarov

Barcelona Supercomputing Center

May 7, 2015

1

Page 2: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Acknowledgement

• Javier Arias• Jesus Labarta• Filippo Mantovani• Dani Ruiz• Omer Subasi• Osman Unsal• Oriol Vilarrubi• Gulay Yalcin

CSW & BR Oslo May 7, 20152

Page 3: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

About This Presentation

• Focus on memory resiliency

• First ever attempt to characterize the memory reliability of a large system which has no memory ECC

• Relate our numbers to the state-of-the-art

• Our software-based proposals to complement HW ECC

May 7, 2015CSW & BR Oslo3

Page 4: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Comparing with the Related Work

May 7, 2015CSW & BR Oslo

State of the Art Mont-Blanc[1] Sridharan et al. "Memory Errors in Modern Systems The Good, The Bad, and The Ugly", ASPLOS'2015[2] Sridharan and Liberty, "A study of DRAM failures in the field", SC'2012

Cielo

Hopper

4

Page 5: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Technical Details

May 7, 2015CSW & BR Oslo

Parameter Cielo Hopper Mont-Blanc

Nodes 8,568 6,000 1,080

Cores 137,088 144,000 2,160

Core type AMD Opteron AMD Opteron ARM Cortex A15

Memory per node 32 GB 32 GB 4 GB

Total memory 268 TB 188 TB 4.3 TB

Memory type DDR3 DDR3 LDDR3

ECC Chipkill-correct Chipkill-detect NONE

Peak performance 1,120 TFlops 1,054 TFlops 35 TFlops

Altitude 2,231 m 13 m 18 m

Location Los Alamos, NM Oakland, CA Barcelona

Mont-Blanc is much smaller than Cielo and Hopper.

5

Page 6: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

This Study

May 7, 2015CSW & BR Oslo

Related Work Mont-BlancNode hours (million) 157 0.65GB hours (million) 11,250 1.90

This is a preliminary study and the results are not statistically strong to draw conclusions.

The study on Mont-Blanc is for 917 Nodes and only 3GB per node were scanned.

6

Page 7: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Memory FIT per 1GB

May 7, 2015CSW & BR Oslo

14,744

377 129 152 89 76 860

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000FI

T

Memory FIT for 1GB

LDDR3 in Mont-Blanc has very high FIT rate. We are not sure why but we attribute this for being a low-end product.

7

Page 8: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

MTBF (all faults)

May 7, 2015CSW & BR Oslo

25.0

0.3 0.4

4.3

17.9

10.7

26.1

21.4

26.9

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Memory faults (corrected & uncorrected) MTBF

We definitely need ECC in hardware or MontBlanc system of the scale of Cielo may fail every 20 minutes.

What if MB has same amount of memory like

Cielo?

What if MB has same amount of memory like

Hopper?

8

Page 9: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

43.1

0.4 0.6

18.226.3

0.8

32.7

9.5

85.0

0.010.020.030.040.050.060.070.080.090.0

Day

s

Uncorrectable Memory Faults MTBF394 3546

ECC in Hardware

May 7, 2015CSW & BR Oslo

SECDED will not be strong enough to keep a large system operable.

9

Page 10: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Projections for Exascale

May 7, 2015CSW & BR Oslo10

0.0

2.0

4.0

6.0

8.0

10.0

12.0

32PB 64PB 96PB 128PB

Day

s

Exascale projection MTBF (High FIT is Vendor A)

8Gbit / High FIT 8Gbit / Low FIT 8Gbit / MB FIT 16Gbit / High FIT 16Gbit / Low FIT

16Gbit / MB FIT 32Gbit / High FIT 32Gbit / Low FIT 32Gbit / MB FIT

MTBF Mont-Blanc DRAM Chipkill uncorrectable errors

The MTBF for uncorrectable errors with chipkill in a system with commodity memory like in Mont-Blanc will be between 0.3 and 5.1 days.

Page 11: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

Permanent error. The most significant bit cannot be set to 1.

May 7, 201511

Page 12: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

Transient errors

May 7, 201512

Page 13: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

Intermittent errors at the same address and bit.

10 sec

3 sec

2 sec

May 7, 201513

Page 14: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

Multi-bit errors

May 7, 201514

Page 15: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

These coincide with major solar flares

May 7, 201515

Page 16: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Reliability Techniques in Software

• Task checkpoint and restart

• Task replication

• Other ongoing activities

May 7, 2015CSW & BR Oslo16

Page 17: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Advantages of Tasks for Fault Tolerance

• Task boundaries explicitly delimit the scope of the checkpoints

• The explicit task input/output declarations decrease the checkpoint state

• Compared to pthread-like parallel programs checkpointing does not require any complex coordination and synchronization between threads

• The recovery is asynchronous

May 7, 2015CSW & BR Oslo17

Page 18: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Checkpoint and Recovery for Tasks

Task start T1

Task end T1

Input

Recover task execution from detected faults.Isolate the fault propagation within the task boundaries.

Input

Checkpoint

Fault detected

Recoverexecution

Recover

Inputs are known at runtime through explicit declaration.Overheads of checkpointing are minimal.

Recovery is asynchronous.

Limitations: does not cover the execution outside tasks.

May 7, 2015CSW & BR Oslo18

Page 19: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Results: Checkpoint and Recovery for Tasks

Multi-Node ScalabilitySingle-Node Scalability

0%

2%

4%

6%

8%

10%

12%

14%

Sparse LU Cholesky FFT Perlin Stream

Checkpointing Overheads

Basline impl. Singleton

May 7, 2015CSW & BR Oslo19

Page 20: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Task Replication

T1Input

Detect and recover from silent data corruption.

Input

Checkpoint

T1’

Output Output

Fault

InputT1”

Output

Re-execute the task one more time and use the two outputs that match as the

correct result.

No fault

1. The task and its replica execute asynchronously.2. No synchronization between the task and its replica (only at the end of

the task execution).3. Faults do not limit parallelism, re-execution is also asynchronously.

May 7, 2015CSW & BR Oslo20

Page 21: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Results: Task Replication

Multi-Node ScalabilitySingle-Node Scalability

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

Overheads of Replication

May 7, 2015CSW & BR Oslo21

Page 22: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Other activities

• Software-based ECC• To complement the lack of ECC or weak ECC in hardware

• Selective replication• To reduce the cost of resource utilization by replicating the

reliability critical code

• Checkpoint and restart for tasks with MPI calls• To provide multi-node checkpoint restart within task-based

programming model

• Hierarchical checkpoint restart with task checkpointing• To decrease the checkpoint overheads and recovery time

May 7, 2015CSW & BR Oslo22

Page 23: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Summary

• Preliminary memory reliability characterization

• Low-end comodity DRAM devices might be more susceptible to transient faults

• Even strong memory ECC alone may not be sufficient to mitigate transient faults in exascale computing

• SW-based fault tolerance which is coupled to a specific programming model might be a lightweight solution to complement HW-based ECC

May 7, 2015CSW & BR Oslo23

Page 24: Www.montblanc-project.eu This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

May 7, 2015CSW & BR Oslo

Thank you!

Questions?

24


Recommended