+ All Categories
Home > Documents > Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay...

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay...

Date post: 12-Jan-2016
Category:
Upload: mariah-patrick
View: 215 times
Download: 0 times
Share this document with a friend
21
Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan [email protected]
Transcript
Page 1: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Evaluating Checkpoint/Restart on the IBM SP

Jay [email protected]

Page 2: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Outline

• Motivation for Checkpoint/Restart (CPR)• CPR considerations• CPR on the IBM SP• Evaluation of CPR on the IBM SP• Results• Putting CPR into production

Page 3: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Motivation for Checkpoint/Restart

• Large HPC systems typically have large parallel or long running jobs

• To be able to save the running state for large parallel or long running jobs periodically so that in the case of an interruption we don’t lose too much work

• To decrease the impact of single-node failures on the overall usability of the machine

• To be able to perform maintenance on the system with minimal impact to running jobs

• Better utilization of resources

Page 4: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Checkpoint/Restart considerations

• User initiated (not from within the program)

• System (administrator) initiated

• Use of HPC systems is usually via a batch system (such as LoadLeveler)

• Both serial and parallel jobs are run on the machine

• Parallel jobs use message passing and we should be able to checkpoint these as well

• Use of CPR mechanism internal to code as well as externally

Page 5: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Checkpoint/Restart Users

• System administrators and operators Checkpoint used to clear a node for maintenance work.

• End users of HPC systems (scientists, students, researchers)

• Programmers writing code that uses CPR mechanism internally (or utility programs to use CPR functionality for the system)

Page 6: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Checkpoint/Restart mechanism

For Parallel programs

• Stop and discard mechanism (K. Z. Meth and W. G. Tuel) On receiving a checkpoint request, the task stops sending messages and is checkpointed. In-transit message information is saved so we know what messages have been sent but not acknowledged. These messages are resent on restart.

Page 7: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Checkpoint/Restart methods

• Utility program as part of system software

• CPR API via system calls (ll_init_ckpt, etc.)

• Batch system software can use the API to implement CPR mechanism.

Page 8: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

CPR on the IBM SP

• Done via LL command (llckpt)• Once a process is checkpointed:

1. Process can continue running.2. Process is killed.

• Within LL:1. Job can be deleted from the queuing system.2. Job can be resubmitted for consideration by the scheduler.3. Job can be resubmitted and “held”.

Page 9: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Checkpoint/Restart on the IBM SP

Job command file keywords:

In order to be able to checkpoint a LL job:#@ checkpoint = [yes|no| interval]#@ ckpt_time_limit = [time to checkpoint]#@ ckpt_dir = [path to checkpoint files]#@ ckpt_file = [basename of checkpoint files]

In order to be able to restart a LL job:#@ checkpoint = [yes|no| interval]#@ ckpt_dir = [path to checkpoint files]#@ ckpt_file = [basename of checkpoint files]#@ restart_from_ckpt = [yes| no]#@ restart_on_same_nodes = [yes|no]

Page 10: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

We evaluated the use of C/R with LoadLeveler on the SP usingboth a 4-node development system (dev2) and the 416-nodeproduction system (seaborg). We evaluated:

(a) System requirements(b) Configuration changes(c) Viability/Ease of Use

CPR Evaluation on the IBM SP

Page 11: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

2 kinds of programs:• Serial code that allocates a certain amount of memory (integer array and initializes the array)• MPI code that starts up a certain number of processes and allocates a certain amount of memory and does simple message passing

User checkpoint:• Submit a job using llsubmit, let it run, use llckpt -u to checkpoint, and resume job using llhold –r• User can also use llckpt –k and resubmit job

CPR Evaluation on the IBM SP

Page 12: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Results – Dev2

0

50

100

150

200

250

300

350

0 2 4 6 8

Processes per node

Ch

eckp

oin

t ti

me (

secs)

1 node 2 nodes 3 nodes 4 nodes

Each task uses approximately 200 MB memory

Page 13: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

0

50

100

150

200

250

300

350

1 node 2 nodes 3 nodes 4 nodes

Number of nodes

Tim

e to

Ch

eckp

oin

t (s

ecs)

1 task/node 2 tasks/node 3 tasks/node 4 tasks/node

5 tasks/node 6 tasks/node 7 tasks/node 8 tasks/node

Results – Dev2

Each task uses approximately 200 MB memory

Page 14: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Results – Dev2

Serial job

0

200

400

600

800

1000

1200

1400

1 10 100 1000 10000 100000

Size of job (MB)

Ch

ec

kp

oin

t ti

me

(s

ec

s)

64-bit 32-bit

Page 15: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Results – Dev2

Serial job

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 10 100 1000 10000 100000

Size of job (MB)

Dis

k s

pace u

sed

(M

B)

64-bit 32-bit

Page 16: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Results – Dev2

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9

Tasks per node

To

tal

ch

eckp

oin

t fi

le s

izes (

GB

)

1 node 2 nodes 3 nodes 4 nodes

Each task uses approximately 200 MB memory

Page 17: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Results – Dev2

0

5

10

15

20

25

30

35

0 1 2 3 4 5 6 7 8 9

Tasks per node

MP

I d

ata

fil

e s

izes (

MB

)

1 Node 2 nodes 3 nodes 4 nodes

Each task uses approximately 200 MB memory

Page 18: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Results – Seaborg

16 tasks per node; Each task uses approximately 260 MB memory

0

500

1000

1500

2000

2500

3000

0 24 48 72 96 120

Number of Nodes

Ch

ec

kp

oin

t T

ime

(s

ec

s)

Page 19: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Results – Seaborg

0

100

200

300

400

500

600

0 2 4 6 8 10 12 14 16

Tasks per Node

Ch

eckp

oin

t T

ime

(sec

s)

8 Nodes 12 Nodes 24 Nodes

Each task uses approximately 260 MB memory

Page 20: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

• What about restart? Times to restart are on the order of time to checkpoint.

• Disk usage, user quotas (checkpoint files are owned by job owner)

• #@ restart = yes keyword is implied if checkpoint = yes.

• Priority issues: Checkpointed and held jobs retain their priority.

• Not all jobs can be checkpointed. List of exceptions is documented in the LL manual.

Using CPR

Page 21: Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov.

Office of Science

U.S. Department of Energy

Acknowledgements:

• NERSC SP Systems Staff (N. Cardo, D. Paul, T. Stone)• IBM Staff (S. Burrow)• NERSC USG Staff (D. Skinner)• NERSC ASG Staff (A. Wong)


Recommended