+ All Categories
Home > Documents > Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for...

Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for...

Date post: 28-May-2020
Category:
Upload: others
View: 15 times
Download: 0 times
Share this document with a friend
18
Health Check on DGX-1
Transcript
Page 1: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

Health Check on DGX-1

Page 2: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

2

Info: Running nvsm

• NVIDIA® System Management (NVSM) is a softwareframework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system alerts, and log generation.

• It provides notification of fluctuations in system health, faults, and potential failures.

• It is recommended to run nvsm show health after a software or hardware update or replacement

• The "dump health" command produces a health report file suitable for attaching to support tickets. sudo nvsm dump health Writing output to /tmp/nvsm-health-dgx-1-20180907085048.tar.xzDone

“sudo nvsm show health”

Page 3: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

3

Info: nvsm show health

• Base OS Version• BIOS Version• BMC Revision• GPU Status• NVLINK Status• CPU Status• Memory Status• Networking Status• Raid Status• Drive Status• Disk errors

List Of Checks By nvsm show health

Page 4: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

4

Info: Interpreting ‘unhealthy’

• If the output from running nvsm show health is ‘Unhealthy’ look over all the checks that were run and look for those that did not pass (marked Unhealthy)

• Provide this information and the log file to NVIDIA Enterprise Services in order to continue troubleshooting the system

lab@psg-xpl-evt-23:~$ sudo nvsm show health

[sudo] password for lab:

Info

----

Timestamp: Tue Aug 7 17:00:18 2018 -0700

Version: 18.06-3

Checks

------

DGX BaseOS Version [4.0.0]...........................................

BIOS Version [0.010].................................................

DGX Serial Number [To be filled by O.E.M.]...........................

Verify installed DIMM memory sticks..................................

Healthy

BMC Firmware Revision [0.70].........................................

Check BMC sensor thresholds..........................................

Healthy

Number of logical CPU cores [80].....................................

Unhealthy

Observed 80 logical CPU cores when 96 cores were expected

.

.

.

Health Summary

--------------

203 out of 205 checks are Healthy

2 out of 205 checks are Unhealthy

Overall system status is Unhealthy

Problem detected.

Please visit the ESP portal: https://nvid.nvidia.com/dashboard/

And create a support ticket with the log file attached.

Page 5: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

5

Interpreting nvsm show health Output

Page 6: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

6

Health Summary (end of nvsm show health output)

● Healthy output:

Summary-------94 out of 94 checks are HealthyOverall system status is Healthy

● Unhealthy output:

Summary-------9 out of 11 checks are Healthy2 out of 11 checks are UnhealthyOverall system status is Unhealthy

Problem detected.Please visit the ESP portal: https://nvid.nvidia.com/enterpriseloginAnd create a support ticket with the log file attached.

Page 7: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

7

Scenario: Missing GPU● Missing GPU might be caused by:

○ Failed GPU tray upgrade

○ Hardware failure

$ sudo nvsm show healthInfo----Timestamp: Thu May 16 21:02:15 2019 -0700Version: 19.01.8

Checks------Quick health check of GPU using DCGM.................................Healthy DGX BaseOS Version [4.0.6]........................................... Verify installed DIMM memory sticks.................................. Healthy...Verify Ethernet controllers.......................................... HealthyVerify installed GPU's............................................... Unhealthy

Checking output of 'lspci' for expected GPU'sMissing GPU at PCI address "07:00.0"

Verify installed InfiniBand controllers.............................. HealthyVerify PCIe switches................................................. Healthy...

Page 8: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

8

Scenario: Missing PCIe Switch● Missing PCIe switch might be caused by:

○ Hardware failure

$ sudo nvsm show health...

Checks------BIOS Revision [5.11]................................................. ...Verify Ethernet controllers.......................................... HealthyVerify installed GPU's............................................... Unhealthy

Checking output of 'lspci' for expected GPU'sMissing GPU at PCI address "06:00.0"Missing GPU at PCI address "07:00.0"Missing GPU at PCI address "0a:00.0"Missing GPU at PCI address "0b:00.0"

Verify installed InfiniBand controllers.............................. UnhealthyChecking output of 'lspci' for expected InfiniBand controllersMissing InfiniBand controller at PCI address "0c:00.0"Missing InfiniBand controller at PCI address "05:00.0"

Verify PCIe switches................................................. UnhealthyChecking output of 'lspci' for expected PCIe switchesMissing PCIe switch at PCI address "03:00.0"Missing PCIe switch at PCI address "08:00.0"

...

Page 9: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

9

Scenario: Missing DIMM● Missing DIMM might be caused by:

○ Improper DIMM installation

○ Unseated during transport

○ DIMM failure

○ DIMM slot failure

$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5

Checks------BIOS Version [S2W_3A08]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Unhealthy

Checking output of 'dmidecode' for expected DIMM's"Memory Device (DIMM_F1)" is missing"Memory Device (DIMM_G0)" is missing

BMC Firmware Revision [3.27]..................................... Healthy...

Page 10: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

10

Scenario: Unsupported DIMM● Unsupported DIMM might be caused by:

○ Unknown DIMM vendor/part number installed

○ Unexpected DIMM size installed

$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5

Checks------BIOS Version [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Unhealthy

Checking output of 'dmidecode' for expected DIMM's"Memory Device (DIMM_D1) -> size" has value "8192 MB" when "32 GB"was expected

Number of logical CPU cores [80]..................................... Healthy...Verify DIMM vendors.................................................. Unhealthy

Comparison: Unknown DIMM vendor "G-Skill"...

Page 11: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

11

Scenario: Missing SSD● Missing SSD might be caused by:

○ Unsupported SSD size installed (e.g. larger SSD installed by customer)

○ Unseated during transport

○ SSD hardware failure

○ SSD not installed

$ sudo nvsm show healthInfo----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5

Checks------BIOS Revision [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. HealthyNumber of logical CPU cores [80]..................................... Healthy...Verify installed MegaRAID disks...................................... Unhealthy

Checking output of 'smartctl' for expected disksFound 3 disk(s) with capacity "1.92 TB" when 4 disk(s) were expectedNo disks of capacity "480 GB" were found

Verify DIMM vendors.................................................. Healthy...

Page 12: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

12

Scenario: Unsupported System SKU● NVSM show health is only supported on DGX-1 and DGX-2 hardwares at this time

● Support for nvsm show health on DGX Station is coming soon

$ sudo nvsm show healthERROR: Unknown product name "DGX Station"ERROR: nvhealth could not determine system SKU

Please ensure that nvsm show health is running on a supported NVIDIA system.

If this problem persists, please visit the ESP portal:https://nvid.nvidia.com/enterpriselogin

And create a support ticket with the output attached.

● Override the system SKU with the --system-sku flag

$ sudo nvsm show health --system-sku dgx-1-p100Info----Timestamp: Sat Dec 16 16:26:32 2017 -0800Version: 17.12-5

Checks------BIOS Revision [5.11]................................................. DGX Serial Number [YSY72800016]...................................... Verify installed DIMM memory sticks.................................. Healthy...

Page 13: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

13

Challenge: DGX-1 Health Check

1. Run nvsm show health on your assigned DGX-1

2. Where can you find the output TAR log from the nsvsm dump health run?

3. Is anything wrong with your system?

_Team Challenge_

Page 14: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

14

• Output is saved in /tmp/*.tar.xz

• Sudo tar <file>

• Sudo ls ./<file>

dgxuser@psg-dgx1-02:~$ sudo nvsm show

health

[sudo] password for dgxuser:

Info

----

Timestamp: Wed Apr 4 10:03:07 2018 -0700

Version: 18.03-2

(snip)

dgxuser@psg-dgx1-02:~$ sudo nvsm dump

health

Writing output to /tmp/*.tar.xz

dgxuser@psg-dgx1-02:~$

dgxuser@psg-dgx1-02:/tmp~$ sudo tar nvsm-

health-dgx1-18-04-20190516213243.tar.xz

dgxuser@psg-dgx1-02:/tmp~$ sudo ls

./nvsm-health-dgx1-18-04-

20190516213243.tar.xz

output is reproduced

Make readable the tar log file

Solution: DGX-1 Health Check

Page 15: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

15

Info: nvsm

• Software framework for monitoring NVIDIA DGX™ nodes in a data center.

• Documentation: https://docs.nvidia.com/dgx/nvsm-user-guide/index.html

Page 16: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

16

Challenge: Using NVSM

● Use nvsm to check fan(s) status

_Team Challenge_

Page 17: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

17

Example: Using NVSM

$ sudo nvsm show fans

/chassis/localhost/thermal/fans/FAN10_F

Properties:

Status_State = Enabled

Status_Health = OK

Name = FAN10_F

MemberId = 19

ReadingUnits = RPM

LowerThresholdNonCritical = 5046.000

Reading = 9802 RPM

LowerThresholdCritical = 3596.000

...

Page 18: Health Check on DGX-1 · • NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system

Recommended