+ All Categories
Home > Documents > Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by...

Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by...

Date post: 02-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
18
DYNAMIC PAGE RETIREMENT vR440 | May 2020 Technical Note
Transcript
Page 1: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

DYNAMIC PAGE RETIREMENT

vR440 | May 2020

Technical Note

Page 2: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | ii

TABLE OF CONTENTS

Chapter 1. Overview............................................................................................ 1Chapter 2.  Implementation.................................................................................... 2

2.1. Use Case 1: DBE Detected.............................................................................. 22.2. Use Case 2: Two SBEs Detected at the Same Location............................................. 3

Chapter 3. Blacklisting and ECC Error Recovery.......................................................... 43.1. Verifying Retired Pages are Pending...................................................................43.2. Stopping GPU Clients.................................................................................... 53.3. Reattaching the GPU..................................................................................... 6

Chapter 4. Availability.......................................................................................... 7Chapter 5. Visibility..............................................................................................8

5.1.  XIDs..........................................................................................................85.2. NVML........................................................................................................ 8

Chapter 6. Caveats............................................................................................. 11Chapter 7. FAQ..................................................................................................12

7.1. RMA Eligibility............................................................................................127.2. Memory Impact.......................................................................................... 127.3. Configuration.............................................................................................137.4. App Behavior............................................................................................. 137.5. App Performance........................................................................................ 147.6. Driver Behavior.......................................................................................... 14

Chapter 8. Glossary............................................................................................ 15

Page 3: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | 1

Chapter 1.OVERVIEW

The NVIDIA® driver supports "retiring" framebuffer pages that contain bad memorycells. This is called "dynamic page retirement" and is done automatically for cells thatare degrading in quality. This feature can improve the longevity of an otherwise goodboard and and is thus an important resiliency feature on supported products, especiallyin HPC and enterprise environments.

Page 4: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | 2

Chapter 2.IMPLEMENTATION

The marking of a page for exclusion is called "retiring", while the actual act of excludingthat page from subsequent memory allocations is called "blacklisting". The NVIDIA®

driver will retire a page once it has experienced a single Double Bit ECC Error (DBE),or 2 Single Bit ECC Errors (SBE) on the same address. These addresses are stored inthe InfoROM. When each GPU is attached and initialized the driver will retrieve theseaddresses from the InfoROM, then have the framebuffer manager set these pages aside,such that they cannot be used by the driver or user applications.

Retiring of pages may only occur when ECC is enabled. However, once a page hasbeen retired it will always be blacklisted by the driver, even if ECC is later disabled.

Ideally, the NVIDIA® driver will catch weakening cells at the 2 SBE point and retire thepage, before the cell degrades to the point of a DBE and disrupts an application.

2.1. Use Case 1: DBE Detected 1. The NVIDIA driver detects a DBE and reports that a DBE occurred. 2. Applications will receive a DBE event notification for graceful exit, and no further

context will be created on the GPU until the DBE is mapped out. 3. The NVIDIA driver logs the DBE count and address in the InfoROM.

Page retirement occurs and the nvidia-smi Retired Pages 'Double Bit ECC' field isincremented.

The nvidia-smi 'Pending Page Blacklist' status becomes 'YES'. 4. The NVIDIA driver logs, in a separate list, that the page containing the DBE is to be

retired. 5. Upon the next reattachment of the GPU, the page is mapped out of usage.

Page blacklisting occurs and nvidia-smi 'Pending Page Blacklist' status becomes'NO'.

The DBE addresses and counts are preserved across driver reloads.

Page 5: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

Implementation

www.nvidia.comDynamic Page Retirement vR440 | 3

2.2. Use Case 2: Two SBEs Detected at the SameLocation 1. The NVIDIA driver detects an SBE and reports that an SBE occurred. 2. The NVIDIA driver logs the SBE count and address in the InfoROM. 3. If the SBE occurs more than once in a particular address, the driver logs, in a

separate list, that the page containing that address is to be retired.

Page retirement occurs and the nvidia-smi Retired Pages 'Single Bit ECC' field isincremented.

The nvidia-smi 'Pending Page Blacklist' status becomes 'YES'. 4. Upon the next reattachment of the GPU, the page is mapped out of usage.

Page blacklist occurs and the nvidia-smi 'Pending Page Blacklist' status becomes'NO'.

Unlike the DBE case, applications continue to run uninterrupted.

The SBE addresses and counts are preserved across driver reloads.

Page 6: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | 4

Chapter 3.BLACKLISTING AND ECC ERROR RECOVERY

Pages that have been previously retired are blacklisted for all future allocations of theframebuffer, provided that the target GPU has been properly reattached and initialized.This chapter presents a procedure for ensuring that retired pages are blacklisted and allGPUs have recovered from the ECC error.

This procedure requires the termination of all clients on the target GPU. It is notpossible to blacklist a new page while clients remain active.

Blacklisting: Procedure Overview:

‣ Verify that there are pending retired pages.‣ Determine GPUs that are related and must be reattached together.‣ Stop applications and verify there are none left running.‣ Reinitialize the GPU, or reboot the system.‣ Verify that blacklisting has occurred.‣ Restart applications.

3.1. Verifying Retired Pages are PendingWhen pages are retired but have not yet been blacklisted, the retired pages are markedas pending for that GPU. This can be seen through nvidia-smi:

$ nvidia-smi -i <target gpu> -q -d PAGE_RETIREMENT ... Retired pages Single Bit ECC : 2 Double Bit ECC : 0 Pending Page Blacklist : Yes ...

If Pending Page Blacklist shows "No", then all retired pages have already beenblacklisted.

Page 7: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

Blacklisting and ECC Error Recovery

www.nvidia.comDynamic Page Retirement vR440 | 5

If Pending Page Blacklist shows "Yes", then at least one of the retired pages that arecounted are not yet blacklisted. Note that the exact count of pending pages is not shown.

The retired pages count increments immediately when a page is retired and not onthe next driver reload when the page is blacklisted.

3.2. Stopping GPU ClientsBefore the the NVIDIA® driver can reattach the GPUs, clients that are using those GPUsmust first be stopped and the GPU must be unused.

All applications that are using the GPUs should first be stopped. Use nvidia-smi to listprocesses that are actively using the GPUs. In the example below, a tensorflow pythonprogram is using both GPUs 0 and 1. Both will need to be stopped.

$ nvidia-smi ... +---------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=====================================================================| | 0 8962 C python 15465MiB | | 1 8963 C python 15467MiB | +---------------------------------------------------------------------+

Once all applications are stopped, nvidia-smi should show no processes found:

$ nvidia-smi ... +---------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=====================================================================| | No running processes found | +---------------------------------------------------------------------+

On Linux systems, additional software infrastructure can hold the GPU open andprevent the GPU from being detached by the driver. These include the nvidia-persistenced, and version 1 of nvidia-docker. Nvidia-docker version 2 does not need tobe stopped.

A list of open proesses using the driver can be verified on Linux with the lsof command:

$ sudo lsof /dev/nvidia* COMMAND PID USER FD TYPE DEVICE NODE NAME nvidia-pe 941 nvidia-persistenced 2u CHR 195,255 453 /dev/nvidiactl nvidia-pe 941 nvidia-persistenced 3u CHR 195,0 454 /dev/nvidia0 nvidia-pe 941 nvidia-persistenced 4u CHR 195,254 607 /dev/nvidia-modeset nvidia-pe 941 nvidia-persistenced 5u CHR 195,1 584 /dev/nvidia1 nvidia-pe 941 nvidia-persistenced 6u CHR 195,254 607 /dev/nvidia-modeset

Page 8: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

Blacklisting and ECC Error Recovery

www.nvidia.comDynamic Page Retirement vR440 | 6

Once all clients of the GPU are stopped, lsof should return no entries:

$ sudo service nvidia-persistenced stop $ sudo lsof /dev/nvidia* $

3.3. Reattaching the GPUReattaching the GPU, to blacklist pending retired pages, can be done in several ways. Inorder of cost, from low to high:

‣ Re-attach the GPUs (persistence mode disabled only)‣ Reset the GPUs‣ Reload the kernel module (nvidia.ko)‣ Reboot the machine (or VM)

Reattaching the GPU is the least invasive solution. The detachment process occursautomatically a few seconds after the last client terminates on the GPU, as long aspersistence mode is not enabled. The next client that targets the GPU will trigger thedriver to reattach and blacklist all marked pages.

If persistence mode is enabled the preferred solution is to reset the GPU using nvidia-smi. To reset an individual GPU:

$ nvidia-smi -i < target GPU> -r

Or to reset all GPUs together:

$ nvidia-smi -r

These operations reattach the GPU as a step in the larger process of resetting all GPU SWand HW state.

Reloading the NVIDIA® kernel module triggers reattachment of all GPUs on themachine, and thus requires the termination of all clients on all GPUs.

Finally, rebooting the machine will effectively reattach the GPUs as the driver isreloaded and reinitialized during reboot. While rebooting isn't required and is highlyinvasive, it might simplify the recovery action in some operating environments.

Page 9: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | 7

Chapter 4.AVAILABILITY

Dynamic page retirement is supported on the following products and environments:

‣ Drivers: R319 and newer‣ OSes: All standard driver-supported Linux and Windows TCC platforms‣ GPUs:

‣ K20 and newer Tesla products, including the Tesla V100 and T4‣ Quadro GV100 and newer products‣ Quadro Virtual Data Center Workstation (Quadro vDWS) and NVIDIA

vComputeServer (starting with NVIDIA Virtual GPU Software v9.0)‣ No GeForce products are currently supported

Page 10: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | 8

Chapter 5.VISIBILITY

Three main mechanisms provide visibility into page retirement: XID errors in systemlogs, the NVML API and the nvidia-smi command line tool.

5.1. XIDsXID errors are driver errors that are logged to the system error log. Please see the XIDWhitepaper for general info on XIDs.

There are three main XIDs related to dynamic page retirement:

‣ XID 48: A DBE has occurred.‣ XID 63: A page has successfully been retired.‣ XID 64: A page has failed retirement due to an error.

In the system log these XIDs show up in the following forms:

‣ XID 48: "XID 48 An uncorrectable double bit error (DBE) has been detected on GPU(<id>)"

‣ XID 63: "XID 63 Dynamic Page Retirement: New retired page, reload the driver toactivate. (<address>)"

‣ XID 64: "XID 64 Dynamic Page Retirement: Fatal error, unable to retire page(<address>)"

5.2. NVMLThe NVIDIA® Management Library (NVML) is a public C-based library for GPUmonitoring and management. It includes APIs that report the status and count of retiredpages. Please see the NVML API docs for general info on the library.

The set of currently retired pages, and their addresses, can be retrieved using:

nvmlReturn_t nvmlDeviceGetRetiredPages (nvmlDevice_t device, nvmlPageRetirementCause_t cause, unsigned int* pageCount, unsigned long long* addresses)

Page 11: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

Visibility

www.nvidia.comDynamic Page Retirement vR440 | 9

Driver versions 410.72 or later provide a newer API that also returns page retirementtimestamp:

nvmlReturn_t nvmlDeviceGetRetiredPages_v2 (nvmlDevice_t device, nvmlPageRetirementCause_t cause, unsigned int* pageCount, unsigned long long* addresses, unsigned long long* timestamps)

For both APIs, the nvmlPageRetirementCause_t passed is one of:

‣ NVML_PAGE_RETIREMENT_CAUSE_MULTIPLE_SINGLE_BIT_ECC_ERRORS‣ NVML_PAGE_RETIREMENT_CAUSE_DOUBLE_BIT_ECC_ERROR

The current state of the driver (whether any pages are pending retirement) can beretrieved using:

nvmlReturn_t nvmlDeviceGetRetiredPagesPendingStatus (nvmlDevice_t device, nvmlEnableState_t* isPending)

5.3. nvidia-sminvidia-smi is a public command line interface for GPU monitoring and management.It implements most of the NVML APIs and supports reporting the status and count ofretired pages. Please see the nvidia-smi man page for general info on the tool.

The nvidia-smi tool provides

‣ the ability to list the number of retired pages–sorted by cause of retirement–andindicate whether any pages are pending retirement, and

‣ the ability to list all retired page addresses.

To view the number of retired pages and the page retirement state of the driver inhuman readable form:

$ nvidia-smi -i <target gpu> -q -d PAGE_RETIREMENT ... Retired pages Single Bit ECC : 2 Double Bit ECC : 0 Pending Page Blacklist : No ...

The nvidia-smi "Pending Page Blacklist" field indicates whether a page has beenrecently retired and will be blacklisted on the next system reboot or driver load.

The retired page count increments immediately when a page is retired and not onthe next driver reload when the page is blacklisted.

If pages have been retired, the affected addresses can be viewed through nvidia-smi'sscriptable outputs, in either XML format:

$ nvidia-smi -i <target gpu> -q -x... <retired_pages> <multiple_single_bit_retirement>

Page 12: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

Visibility

www.nvidia.comDynamic Page Retirement vR440 | 10

<retired_count>2</retired_count> <retired_page_addresses> <retired_page_addresse>0xABC123</retired_page_addresse> <retired_page_addresse>0xDEF456</retired_page_addresse> </retired_page_addresses> </multiple_single_bit_retirement> <double_bit_retirement> <retired_count>0</retired_count> <retired_page_addresses></retired_page_addresses> </double_bit_retirement> <pending_retirement>No</pending_retirement> </retired_pages>...

or CSV format:

$ nvidia-smi -i <target gpu> --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv...gpu_uuid, retired_pages.address, retired_pages.causeGPU-d73c8888-9482-7d65-c95c-4b58c7d9eb4c, 0xABC123, Double Bit ECCGPU-d73c8888-9482-7d65-c95c-4b58c7d9eb4c, 0xDEF456, Double Bit ECCGPU-d73c8888-9482-7d65-c95c-4b58c7d9eb4c, 0x123ABC, Single Bit ECC

Page 13: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | 11

Chapter 6.CAVEATS

There exists a race condition between logging errors to the InfoROM and ending aCUDA™ job while in persistence mode. This race condition is most often hit whenshutting down in response to a DBE. The effect of this condition is that a page may fail toretire in certain corner cases.

Temporarily exiting persistence mode before rebooting the system will forcibly flushany pending writes to the InfoROM. If XID 48 is seen and XID 63 is not seen, it isrecommended to exit persistence mode via the command:

% nvidia-smi -i <target GPU> -pm 0

At this point, the XID 63 should be seen and the NVML query can be used to verify thepage was written to the InfoROM.

There are no current plans to fix the race condition in persistence mode, as persistencemode is replaced by the persistence daemon. The persistence daemon is not susceptibleto this race condition. To let you know that your platform is susceptible to this racecondition, the kernel driver will print out a warning in the dmesg log files to indicate apersistence daemon is not being used.

Page 14: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | 12

Chapter 7.FAQ

7.1. RMA Eligibility

How many pages can be mapped out before the GPU should be returned for repair?

If a board is found to exhibit 5 or more retired pages within 30 days or 10 or more retiredpages over the period of warranty, it can be evaluated for an RMA. Please track the pageretirement information for RMA application.

Please contact your NVIDIA representative or your local distributor/reseller forWarranty and Support.

7.2. Memory Impact

Is available memory reduced by retired pages?

Each retired page decreases the total memory available to applications. However, thetotal maximum size of memory retired in this way is only on the order of 4 MiB. This isinsignificant relative to other factors, such as natural fluctuations in memory allocatedinternally by the NVIDIA® driver during normal operation.

What is the size of each retired page?

64KiB

How many pages can be retired?

A combined total of 64 pages can be mapped out, or retired. This can be anycombination of DBE and SBE pages.

How many addresses can be stored in the InfoROM?

The InfoROM can store at least 192 different addresses. Different GPU models andInfoROM formats may extend this beyond 192 addresses up to a maximum of 600.

Page 15: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

FAQ

www.nvidia.comDynamic Page Retirement vR440 | 13

7.3. Configuration

Is ECC enabled on my GPU?

nvidia-smi can be used to show if the GPU currently has ECC enabled, and what modeis pending to be activated following a reboot.

$ nvidia-smi -q ... Ecc Mode Current : Disabled Pending : Enabled

Can page retirement be disabled?

No, page retirement is an important reliability feature and cannot be disabled for eitherSBEs or DBEs. Any pages already marked as retired will continue to be excluded in allfuture allocations. Note though that if ECC is disabled no new memory errors will bedetected and thus no new pages will be blacklisted for future retirement.

ECC can be disabled through nvidia-smi:

$ nvidia-smi -e 0 Disabled ECC support for GPU 00000000:06:00.0. All done. Reboot required.

Or re-enabled, similarly:

sudo nvidia-smi -e 1 Enabled ECC support for GPU 00000000:06:00.0. All done. Reboot required.

Is the SBE recurrence threshold for triggering retirement be configurable?

No

Can I disable DMA protected range (DPR) for SBEs or DBEs?

No, NVIDIA does not support disabling DPR, either entirely or for just SBEs or DBEs.

7.4. App Behavior

Is application behavior affected?

No, applications behave the same. Since pages are retired only after the driver has beenrestarted the act of retiring a page occurs outside the lifetime of any GPU process or

Page 16: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

FAQ

www.nvidia.comDynamic Page Retirement vR440 | 14

application. An application running on a GPU with pages scheduled for retirement(blacklisted) will continue to see those pages in its memory space, though any pageretired due to a double bit error (DBE) will necessarily cause an application to terminate.This is true even without page retirement.

7.5. App Performance

Is application performance affected?

No, application performance is unaffected by either the retirement of pages or theirsubsequent blacklisting. Retirement is the only act taken during application execution,while the actual blacklisting event happens only after the application has terminated.As noted in the first FAQ question above, the memory impact of retired pages is alsonegligible.

Is memory fragmentation due to page retirement expected to impact appperformance?

Fragmentation is not expected to affect performance.

7.6. Driver Behavior

Must multiple SBEs be located at the same address to trigger retirement?

Yes, multiple SBE's must be located at the exact same location (address). Multiple SBEsat different locations within the same page will not trigger page retirement.

Are "stuck" bits rewritten by the driver, or corrected on each read?

On Kepler-class GPUs and later, the driver rewrites the data to avoid stuck bits.

Are all SBE and DBE addresses tracked indefinitely?

SBE and DBE addresses are tracked indefinitely, up to the maximum number ofaddresses that can be stored (See Memory Impact). Additional addresses beyond themaximum are dropped.

Page 17: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

www.nvidia.comDynamic Page Retirement vR440 | 15

Chapter 8.GLOSSARY

Term Definition

Aggregate DBEs The total count of DBEs encountered over the lifeof the board. This count is stored in the InfoROM.

Blacklisted pages Retired pages that have been excluded frommemory allocations

DBE Double-bit ECC error

ECC error Error that has been corrected by the error-correcting code

Retired pages Pages that have been marked for exclusion orblacklisting

Retired Pages (DBE) The number of pages retired due to double-biterrors. This does not apply to pages with single-bitECC errors (SBE) that have been retired proactivelyas a precaution against weak memory cells thatmight cause DBEs in the future.

Retired page count Retired (pending) + blacklisted pages

SBE Single-bit ECC error

Uncorrectable ECC DBE that the error-correcting code (ECC) algorithmis not able to correct.

Volatile DBE Double-bit errors (DBEs) encountered by theactively running driver instance. Volatile DBEs aregenerally the same as uncorrectable ECC errors.

Page 18: Dynamic Page Retirement - Nvidia · ‣ the ability to list the number of retired pages–sorted by cause of retirement–and indicate whether any pages are pending retirement, and

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THEMATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OFNONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULARPURPOSE.

Information furnished is believed to be accurate and reliable. However, NVIDIACorporation assumes no responsibility for the consequences of use of suchinformation or for any infringement of patents or other rights of third partiesthat may result from its use. No license is granted by implication of otherwiseunder any patent rights of NVIDIA Corporation. Specifications mentioned in thispublication are subject to change without notice. This publication supersedes andreplaces all other information previously supplied. NVIDIA Corporation productsare not authorized as critical components in life support devices or systemswithout express written approval of NVIDIA Corporation.

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIACorporation in the U.S. and other countries. Other company and product namesmay be trademarks of the respective companies with which they are associated.

© 2013-2020 NVIDIA Corporation. All rights reserved.

www.nvidia.com


Recommended