S7105 ADAS/AD CHALLENGES: GPU SCHEDULING...

Venugopala Madumbu, NVIDIA

GTC 2017 – 210D

S7105 – ADAS/AD CHALLENGES:GPU SCHEDULING & SYNCHRONIZATION

2

ADVANCED DRIVING ASSIST SYSTEMS (ADAS) & AUTONOMOUS DRIVING (AD)High Compute Workloads Mapped to GPU

3

ADAS/ADRequirements & Challenges

Real-Time Behavior• Determinism

• Freedom from Interference

• Priority of Functionalities

Performance • Maximum Throughput

• Minimal Latency

Multi-Core

CPUGPU/DSP/HWA

4

ADAS/AD WORKLOADS

Challenges Illustrated

Scenario#1 – Standalone Exec

GL

Workload

X msecCUDA Workload

Scenario#3 – Concurrent Exec

GL Workload

> (X+Y) msec

Time Shared GPU Execution

If so, How to

• Achieve determinism

• Achieve Freedom from interference

• Prioritize one Workload over other

While also having• maximum throughput• minimum latency

CUDA

Workload

GL

Workload

X msecY msec

Scenario#2 – Standalone Exec

Y msec

CUDA

Workload

5

GPU

Host Engines

DRAM

Memory Controller

CPU

Other

Clients(ISP, Display,

etc.)

GPU Memory Interface

GPU IN TEGRAHigh Level Tegra SoC Block Diagram

CPU submits job/work to GPU

GPU runs asynchronously to CPU

GPU has its own hardware

scheduler (Host)

It switches between workloads

without CPU involvement

6

GPU SCHEDULING

Channel – independent stream of work on the GPU

Command Push Buffer – Command buffer written by Software and read by Hardware

Channel Switching – Save/restore GPU state on a channel switch

Semaphores/SyncPoints – Synchronization mechanism for events within the GPU

Time Slice – How long a GPU executes commands of a channel before a channel switch

Run-list – An ordered list of channels that SW wants the GPU to execute

Concepts

7

GPU SCHEDULING

Channel switching occurs when any ONE of the following happens:

• Time slice expires

• Engine runs out of work (no more commands)

• Blocked on a semaphore

Channel Switch time = Drain Time + Save/Restore time

Preemption can reduce Channel Switch times drastically

Timesharing by Channel Switching

TimeGPU Occupancy

GPU

Timesliced Round-Robin

App1 App4App3App2

. . . . . .

8

GPU SCHEDULINGPreemption

9

Channel 1

Time slice

Channel 1

Channel SwitchTimeout

2. Channel preemption Stop all commands in pipelineWait for engines to idleHigher Context Switch time

Channel 1

Time slice

Channel 1

Channel Reset

3. Channel Reset Engine could not idle and context could not save before channel switch timeoutCallback to notify kernel of channel reset eventChannel Switch

Timeout

GPU SCHEDULINGChannel Switching with Time Slice Scenarios

Channel 1

Time slice

1. Channel finishes before time slice expiresContext switch to next channel

10

CHALLENGE REVISTEDHow can we achieve both?

Real-Time behavior:• Determinism• Freedom from Interference• Priority of Functionalities

Performance:• Maximum Throughput• Minimal Latency

11

GPU SYNCHRONIZATION & SCHEDULING

Software Control

1. User Driver Level (GPU Synchronization Approach)

• Syncpoints/Semaphores for Synchronization

• Through EglStreams, EGLSync etc

2. Kernel Driver Level (GPU Priority Scheduling Approach)

• Run-List Engineering

• How long channel runs

• Order of Channel execution

12

GPU SYNCHRONIZATION APPROACHNo Synchronization Case

CPU

GPU

CPU Task CPU Task CPU Task

Priority GPU Task

GPU

Task

Latency due to

Concurrent

ExecutionGPU Task

Kernel launch

GPU Semaphore

0 5 10 15 20 25 30 35 msec

13

GPU SYNCHRONIZATION APPROACHSynchronization on CPU: Not good for GPU

CPU

GPU

CPU Task CPU Task CPU Task

Priority GPU Task

GPU

Task

GPU Task

Kernel launch

GPU Semaphore

0 5 10 15 20 25 30 35 msec

14

GPU SYNCHRONIZATION APPROACHSynchronization on GPU: No Context Switches

CPU

GPU

CPU Task CPU Task

Priority GPU Task

GPU

Task

GPU Task

Kernel launch

GPU Semaphore

CPU Task

Delayed

Start

0 5 10 15 20 25 30 35 msec

Determinism

Freedom from Interference

Priority of Functionalities

15

GPU PRIORITY SCHEDULING APPROACHHypothetical Example

TASK PRIORITY FPSWORST CASE

EXECUTION TIME (WCET)

H1 High 60 9ms

M1 Medium 30 4ms

M2 Medium 30 4ms

L1 Low/Best Effort 30 10ms

H1

M1

M2

L1

16

GPU PRIORITY SCHEDULING APPROACHEngineered Run-list and Time Slice Ensuring FPS and Latency

H1

M1

M2

H1

Run-List

H1 (Max Exec Time = 9 ms)

Time slice = 9 ms

M1 (Max Exec Time = 4 ms)

Time slice = 3 ms

M2 (Max Exec Time = 4 ms)

Time slice = 3 ms

L1 (Max Exec Time = 10 ms)

Time slice = 1 ms

M1

L1

M2

TimeWork on GPU

. . . . . .

Ensured not >16ms for 60fpsoperation

17

GPU PRIORITY SCHEDULING APPROACH

Ensure timeslice is long enough to complete work

Ensure work is continually submitted and also well ahead in time

• To Avoid

• GPU idle time

• Unnecessary context switches

Reduce Latency for GPU Work Completion

18

GPU SCHEDULING

Submit work in advance

• So the GPU has some work to execute at any point of time

Try to reduce/eliminate work dependencies

Have contingency plan for work overload

• If feedback shows over budget, submit work few frames ahead and spread

Plan for worst case scenario

• Deal with GPU reset case esp for the Low priority cases

• GL Robustness Extensions

Best Practices to Keep GPU Busy

19

CONCLUSIONGPU Synchronization & Scheduling Approaches

Real-Time behavior:• Determinism• Freedom from Interference• Priority of Functionalities

Performance:• Maximum Throughput• Minimal Latency

20

ACKNOWLEDGEMENTS

• Scott Whitman, NVIDIA

• Vladislav Buzov, NVIDIA

• Amit Rao, NVIDIA

• Yogesh Kini, NVIDIA

GTC Instructor led Lab::

L7105 – EGLSTREAMS : INTEROPERABILITY OF

CAMERA, CUDA AND OPENGL

11TH MAY 2017 9:30-11:30AM LL21D

21

Q&A

THANK YOU

Date post:	28-Mar-2018
Category:	Documents
Upload:	hacong
View:	237 times
Download:	1 times

S7105 ADAS/AD CHALLENGES: GPU SCHEDULING...

Documents