+ All Categories
Home > Documents > National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment...

National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment...

Date post: 18-Jan-2016
Category:
Upload: ira-pierce
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
25
National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19, 2003
Transcript
Page 1: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

National Energy Research Scientific Computing Center (NERSC)

HPC In a Production Environment

Nicholas P. Cardo

NERSC Center Division, LBNLNovember 19, 2003

Page 2: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Scientific Computing

• Climate• Chemistry• Physics• Nano-Science• Genomics• Molecular Modeling• Materials• Simulation of Large

Systems• Algorithms

Development

Page 3: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

System Configuration

• 184 Compute Nodes• 16 GPFS Nodes• 4 Service Nodes• 3 Login Nodes• 1 Network/Admin Nodes• 24.7 TB Formatted SSA

• 13 Homes @ ~500 GB• Scratch @ ~13 TB• 4 Nodes @ 64 GB• 64 Nodes @ 32 GB• 140 Nodes @ 16 GB

Page 4: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

System UtilizationH

ours

Page 5: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Job Size Breakdown H

ours

Scaling Efforts

Page 6: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Large JobsP

erc e

nt

Scaling Efforts

50%

Page 7: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

System Expanded

• March 2003 The System Doubled• Difficult Decision:

– Change in operating model, single large scale production system

– Cable length limitations required existing hardware to be relocated

– Integration with minimal disruption of service

Page 8: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

System Configuration

• 380 Compute Nodes• 20 GPFS Nodes• 8 Service Nodes• 6 Login Nodes• 2 Network/Admin Nodes• 44.7 TB SSA Disk• ~33 TB Scratch

+106%

+25%+100%+100%

+100%

+80%

+153%

Page 9: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

SCSI Disks

• 2 x 36.4 GB SCSI drives• Mirrored for availability• 36.4 GB available space

rootvg (36.4 GB)

36.4 GB 36.4 GB

Page 10: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

SSA Disks

Hot Spare

hdiskx

hdisky

hdiskz

16 drives per drawer

RAID 5 for RAS

Each node twintailed to five other nodes node in the same frame

3 Groups per drawer

Page 11: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Networking

Login Node

Network Node

Jumbo Frame

Production

Jumbo Frame

Production

Page 12: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Fun Facts

• 39,936 DIMMS• 7.7 TB Memory• 832 SCSI Disks • 29.6 TB SCSI Disks• 6,656 Processors• 35 Miles of Cable

• 30 Gigabit Adapters• 210 SSA Adapters• 3,440 SSA Disks• 65.4 TB raw SSA

Page 13: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

System UtilizationH

ours

Page 14: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Job Size BreakdownH

ours

Page 15: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

New Batch Configuration

premium

regular

low

interactive

debug pre_128

pre_32

pre_1

reg_128

reg_32

reg_1

reg_1l

interactive

debug

low

Class Of Service

Job Class

high

low

Priority

Page 16: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

System UtilizationH

ours

Page 17: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Job Size BreakdownH

ours

Page 18: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Large Jobs

allocation depletion

Per

c en

t

50%

Page 19: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Job EfficiencyH

ours

Page 20: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Performance Variation

Performance variation problem detected. Original nodes appeared to performed slower than nodes added into the system.

Hardware swapped between original nodes and new nodes, no improvement.

Accounting showed occurrence of specific commands significantly higher on original nodes.

Four problem management definitions found to be deactivated but still executing constantly on original nodes.

Analysis performed by NERSC’s David Skinner

Page 21: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

FY04 System UtilizationH

ours

Page 22: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

FY04 Job Size BreakdownH

ours

Page 23: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

FY04 Large Jobs

50%

Per

cen

t

Page 24: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Job Efficiency

Page 25: National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Recommended