Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | ira-pierce |
View: | 216 times |
Download: | 0 times |
National Energy Research Scientific Computing Center (NERSC)
HPC In a Production Environment
Nicholas P. Cardo
NERSC Center Division, LBNLNovember 19, 2003
Scientific Computing
• Climate• Chemistry• Physics• Nano-Science• Genomics• Molecular Modeling• Materials• Simulation of Large
Systems• Algorithms
Development
System Configuration
• 184 Compute Nodes• 16 GPFS Nodes• 4 Service Nodes• 3 Login Nodes• 1 Network/Admin Nodes• 24.7 TB Formatted SSA
• 13 Homes @ ~500 GB• Scratch @ ~13 TB• 4 Nodes @ 64 GB• 64 Nodes @ 32 GB• 140 Nodes @ 16 GB
System UtilizationH
ours
Job Size Breakdown H
ours
Scaling Efforts
Large JobsP
erc e
nt
Scaling Efforts
50%
System Expanded
• March 2003 The System Doubled• Difficult Decision:
– Change in operating model, single large scale production system
– Cable length limitations required existing hardware to be relocated
– Integration with minimal disruption of service
System Configuration
• 380 Compute Nodes• 20 GPFS Nodes• 8 Service Nodes• 6 Login Nodes• 2 Network/Admin Nodes• 44.7 TB SSA Disk• ~33 TB Scratch
+106%
+25%+100%+100%
+100%
+80%
+153%
SCSI Disks
• 2 x 36.4 GB SCSI drives• Mirrored for availability• 36.4 GB available space
rootvg (36.4 GB)
36.4 GB 36.4 GB
SSA Disks
Hot Spare
hdiskx
hdisky
hdiskz
16 drives per drawer
RAID 5 for RAS
Each node twintailed to five other nodes node in the same frame
3 Groups per drawer
Networking
Login Node
Network Node
Jumbo Frame
Production
Jumbo Frame
Production
Fun Facts
• 39,936 DIMMS• 7.7 TB Memory• 832 SCSI Disks • 29.6 TB SCSI Disks• 6,656 Processors• 35 Miles of Cable
• 30 Gigabit Adapters• 210 SSA Adapters• 3,440 SSA Disks• 65.4 TB raw SSA
System UtilizationH
ours
Job Size BreakdownH
ours
New Batch Configuration
premium
regular
low
interactive
debug pre_128
pre_32
pre_1
reg_128
reg_32
reg_1
reg_1l
interactive
debug
low
Class Of Service
Job Class
high
low
Priority
System UtilizationH
ours
Job Size BreakdownH
ours
Large Jobs
allocation depletion
Per
c en
t
50%
Job EfficiencyH
ours
Performance Variation
Performance variation problem detected. Original nodes appeared to performed slower than nodes added into the system.
Hardware swapped between original nodes and new nodes, no improvement.
Accounting showed occurrence of specific commands significantly higher on original nodes.
Four problem management definitions found to be deactivated but still executing constantly on original nodes.
Analysis performed by NERSC’s David Skinner
FY04 System UtilizationH
ours
FY04 Job Size BreakdownH
ours
FY04 Large Jobs
50%
Per
cen
t
Job Efficiency