+ All Categories
Home > Documents > Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent...

Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent...

Date post: 28-May-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
29
Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1
Transcript
Page 1: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Resilient Distributed Concurrent CollectionsCédric BassemPromotor: Prof. Dr. Wolfgang De MeuterAdvisor: Dr. Yves Vandriessche

1

Page 2: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Evolution of Performance inHigh Performance Computing

(source: http://www.top500.org/statistics/perfdevel/) 2

Petascale = 1015 Flop/s

Exascale = 1018 Flop/s

Page 3: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Evolution of Failures in HPC

Main Source: Hardware Faults (~ 50%)

Source: Franck Cappello (2009)

In ExascaleSMTTI < 30 min

3

SMTTI = System Mean time to interrupt

Page 4: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Resilience

“The collection of techniques for keeping applications running to a correct solution in a timely and efficient manner despite underlying system faults”Snir et al. (2014)

Resilience = Fault Tolerance Avizienis et al. (2004)

4

Page 5: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Coordinated Checkpoint/Restart

5

Page 6: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Asynchronous Checkpoint/Restart

6

Page 7: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Requirements for Asynchronous Checkpoint/Restart

Reasoning about state: Self-aware, execution frontier

Safe restart: Deterministic computation

Data race free: Monotonically increasing state

7

Page 8: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Resilience in CnC

Vrvilo, N. (2014). Asynchronous Checkpoint/Restart for the Concurrent Collections Model (Unpublished master’s thesis). Rice University, Houston, Texas USA.

8

CnC Properties:● Dependency graph● Provable deterministic computation● Single assignment data

Focused on shared memory CnC runtimes

Page 9: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

The Concurrent Collections Model

Tags

env

Fibs Results

9

0

1

2

Checkpoint

0

1

2

Page 10: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

The Concurrent Collections Model

Tags

Fibs Results

10

0

1

2

0 0:0

Checkpoint

0

1

2

0:0

Page 11: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

The Concurrent Collections Model

Tags

Fibs Results

11

0

1

2

1 1:10:0

Checkpoint

0

1

2

0:0

1:1

Page 12: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

The Concurrent Collections Model

Tags

Fibs Results

12

0

1

2

2

1:10:0

Checkpoint

0

1

2

0:0

1:1

Page 13: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

The Concurrent Collections Model

13

Checkpoint

0

1

2

0:0

1:1

Tags

Fibs Results

Page 14: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

The Concurrent Collections Model

14

Checkpoint

0

1

2

0:0

1:1

Tags

Fibs Results

Page 15: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

The Concurrent Collections Model

Tags

Fibs Results

15

2

2

1:10:0

Checkpoint

0

1

2

0:0

1:1

2:1

Page 16: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

The Concurrent Collections Model

16

env

2:1

Tags

Fibs Results

2

1:10:0 2:1

Checkpoint

0

1

2

0:0

1:1

Page 17: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Proof of Concept ImplementationGoal: Assessing the viability of Asynchronous C/R in distributed memory CnC runtimes

17

Resilience Flavour:● Dedicated checkpoint node● Fine grained updates● Uncoordinated restart

Runtime: Intel(R) Concurrent Collections for C++(Architect: Frank Schlimbach)

Page 18: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Dedicated Checkpoint Node &Fine grained Updates

18

Node

Node

Node

Node

Checkpoint

Updates contain:

data instances consumeddata instances producedcontrol instances producedproducersconsumers

Page 19: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Restart

19

Node

Node

Node

Node

1

2

3

4

Restart simulation ➜ No fault tolerant MPI

Uncoordinated ➜ Step duplication

Page 20: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Memory Management in CnC

Non-trivial: data accessed by dynamic stepsOne solution: get-counting method

20

int getCountFib( FibTag t ) {if ( t > 0 ) {

return 2;else {

return 1;}

}

Page 21: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Solution

Extra bookkeeping in checkpoint:➢ Consider steps only once when lowering get counts

○ Hashmap of considered steps

➢ Never re-add removed data instances ○ Marking data as removed

21

Page 22: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Modelling Overhead (Tw/Ts)Coordinated Checkpoint/Restart (Daly, 2006)

Asynchronous Checkpoint/Restart

22

Page 23: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Evaluating Asynchronous Checkpoint/Restart

23

Page 24: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Benchmarks - Goals

Assessing overhead factor (φ): Ok if highMethod:

Measure w/o resilience = Solve time (Ts)Measure with resilience = Wall clock time (Tw)Overhead factor = Tw/Ts

Assessing restart time (Tr): Should be lowMethod:

Measure time needed to calculate the restart set24

Page 25: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Number of StepsFibonacci Mandelbrot

25

Overhead factor (φ): Increases with number of steps

Page 26: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Restart Time

26

Fibonacci: Restart Time

Restart Time (Tr): Low Optimization: Shifting some of the complexity to the overhead factor

Page 27: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Future WorkDistributed Checkpoint:

➢ Overhead high but constant➢ Restart time?

27

Tag-only logging:➢ Less communication➢ Complex restart

Checkpoint

Page 28: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Conclusion

Asynchronous C/R distributed memory CnC runtime➢ Analyzing different cases➢ Proof of concept implementation

Asynchronous C/R is viable for systems with low SMTTI➢ Model➢ Proof of concept implementation

28

Page 29: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

ReferencesDaly, J. T. (2006). A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22(3), 303–312.

Avizienis, A., Laprie, J., Randell, B., & Landwehr, C. E. (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33.

Snir, M., Wisniewski, R. W., Abraham, J. A., Adve, S. V., Bagchi, S., Balaji, P., . . . Hensbergen, E. V. (2014). Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications, 28(2), 129–173.

Franck Cappello (2009). Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge. International Journal of High Performance Computing, 23(1), 212-226.

Vrvilo, N. (2014). Asynchronous Checkpoint/Restart for the Concurrent Collections Model (Unpublished master’s thesis). Rice University, Houston, Texas USA.

29


Recommended