Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay...

Office of Science

U.S. Department of Energy

Evaluating Checkpoint/Restart on the IBM SP

Jay [email protected]

Office of Science


Outline

• Motivation for Checkpoint/Restart (CPR)• CPR considerations• CPR on the IBM SP• Evaluation of CPR on the IBM SP• Results• Putting CPR into production

Office of Science


Motivation for Checkpoint/Restart

• Large HPC systems typically have large parallel or long running jobs

• To be able to save the running state for large parallel or long running jobs periodically so that in the case of an interruption we don’t lose too much work

• To decrease the impact of single-node failures on the overall usability of the machine

• To be able to perform maintenance on the system with minimal impact to running jobs

• Better utilization of resources

Office of Science


Checkpoint/Restart considerations

• User initiated (not from within the program)

• System (administrator) initiated

• Use of HPC systems is usually via a batch system (such as LoadLeveler)

• Both serial and parallel jobs are run on the machine

• Parallel jobs use message passing and we should be able to checkpoint these as well

• Use of CPR mechanism internal to code as well as externally

Office of Science


Checkpoint/Restart Users

• System administrators and operators Checkpoint used to clear a node for maintenance work.

• End users of HPC systems (scientists, students, researchers)

• Programmers writing code that uses CPR mechanism internally (or utility programs to use CPR functionality for the system)

Office of Science


Checkpoint/Restart mechanism

For Parallel programs

• Stop and discard mechanism (K. Z. Meth and W. G. Tuel) On receiving a checkpoint request, the task stops sending messages and is checkpointed. In-transit message information is saved so we know what messages have been sent but not acknowledged. These messages are resent on restart.

Office of Science


Checkpoint/Restart methods

• Utility program as part of system software

• CPR API via system calls (ll_init_ckpt, etc.)

• Batch system software can use the API to implement CPR mechanism.

Office of Science


CPR on the IBM SP

• Done via LL command (llckpt)• Once a process is checkpointed:

1. Process can continue running.2. Process is killed.

• Within LL:1. Job can be deleted from the queuing system.2. Job can be resubmitted for consideration by the scheduler.3. Job can be resubmitted and “held”.

Office of Science


Checkpoint/Restart on the IBM SP

Job command file keywords:

In order to be able to checkpoint a LL job:#@ checkpoint = [yes|no| interval]#@ ckpt_time_limit = [time to checkpoint]#@ ckpt_dir = [path to checkpoint files]#@ ckpt_file = [basename of checkpoint files]

In order to be able to restart a LL job:#@ checkpoint = [yes|no| interval]#@ ckpt_dir = [path to checkpoint files]#@ ckpt_file = [basename of checkpoint files]#@ restart_from_ckpt = [yes| no]#@ restart_on_same_nodes = [yes|no]

Office of Science


We evaluated the use of C/R with LoadLeveler on the SP usingboth a 4-node development system (dev2) and the 416-nodeproduction system (seaborg). We evaluated:

(a) System requirements(b) Configuration changes(c) Viability/Ease of Use

CPR Evaluation on the IBM SP

Office of Science


2 kinds of programs:• Serial code that allocates a certain amount of memory (integer array and initializes the array)• MPI code that starts up a certain number of processes and allocates a certain amount of memory and does simple message passing

User checkpoint:• Submit a job using llsubmit, let it run, use llckpt -u to checkpoint, and resume job using llhold –r• User can also use llckpt –k and resubmit job

CPR Evaluation on the IBM SP

Office of Science


Results – Dev2

0

50

100

150

200

250

300

350

0 2 4 6 8

Processes per node

Ch

eckp

oin

t ti

me (

secs)

1 node 2 nodes 3 nodes 4 nodes

Each task uses approximately 200 MB memory

Office of Science


0

50

100

150

200

250

300

350


Number of nodes

Tim

e to

Ch

eckp

oin

t (s

ecs)

1 task/node 2 tasks/node 3 tasks/node 4 tasks/node

5 tasks/node 6 tasks/node 7 tasks/node 8 tasks/node

Results – Dev2


Office of Science


Results – Dev2

Serial job

0

200

400

600

800

1000

1200

1400

1 10 100 1000 10000 100000

Size of job (MB)

Ch

ec

kp

oin

t ti

me

(s

ec

s)

64-bit 32-bit

Office of Science


Results – Dev2

Serial job

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 10 100 1000 10000 100000

Size of job (MB)

Dis

k s

pace u

sed

(M

B)

64-bit 32-bit

Office of Science


Results – Dev2

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9

Tasks per node

To

tal

ch

eckp

oin

t fi

le s

izes (

GB

)



Office of Science


Results – Dev2

0

5

10

15

20

25

30

35

0 1 2 3 4 5 6 7 8 9

Tasks per node

MP

I d

ata

fil

e s

izes (

MB

)

1 Node 2 nodes 3 nodes 4 nodes


Office of Science


Results – Seaborg

16 tasks per node; Each task uses approximately 260 MB memory

0

500

1000

1500

2000

2500

3000

0 24 48 72 96 120

Number of Nodes

Ch

ec

kp

oin

t T

ime

(s

ec

s)

Office of Science


Results – Seaborg

0

100

200

300

400

500

600

0 2 4 6 8 10 12 14 16

Tasks per Node

Ch

eckp

oin

t T

ime

(sec

s)

8 Nodes 12 Nodes 24 Nodes


Office of Science


• What about restart? Times to restart are on the order of time to checkpoint.

• Disk usage, user quotas (checkpoint files are owned by job owner)

• #@ restart = yes keyword is implied if checkpoint = yes.

• Priority issues: Checkpointed and held jobs retain their priority.

• Not all jobs can be checkpointed. List of exceptions is documented in the LL manual.

Using CPR

Office of Science


Acknowledgements:

• NERSC SP Systems Staff (N. Cardo, D. Paul, T. Stone)• IBM Staff (S. Burrow)• NERSC USG Staff (D. Skinner)• NERSC ASG Staff (A. Wong)

Date post:	12-Jan-2016
Category:	Documents
Upload:	mariah-patrick
View:	215 times
Download:	0 times

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay...

Documents