+ All Categories
Home > Documents > Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0...

Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0...

Date post: 25-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
25
Optimizing Texture Transfers Shalini Venkataraman Senior Applied Engineer, NVIDIA [email protected]
Transcript
Page 1: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Optimizing Texture Transfers

Shalini Venkataraman

Senior Applied Engineer, NVIDIA

[email protected]

Page 2: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Outline

Definitions

— Upload : Host (CPU) -> Device (GPU)

— Readback: Device (GPU) -> Host (CPU)

Focus on OpenGL graphics

— Implementing various transfer methods

— Multi-threading and Synchronization

— Debugging transfers

— Best Practices & Results

Page 3: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Applications

Streaming videos/time varying geometry or volumes

— Broadcast, real-time fluid simulations etc

Level of detailing

— Out of core image viewers, terrain engines

— Bricks paged in as needed

Parallel rendering

— Fast communication between multiple GPUs for scaling data/render

Remoting Graphics

— Readback GPU results fast and stream over network

CPU

GPU

PCIe

8GB/s

100GB/s

5-10GB/s

RAM

Graphics Memory

Page 4: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

OpenGL Graphics – Streaming Data

Previous approaches

— Synchronous – CPU and GPU idle during transfer

— CPU Asynchronous

GPU and CPU Asynchronous with Copy Engines

— Application layout

— Use cases

— Results

Page 5: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Synchronous Transfers

Straightforward

— Upload texture every frame

— Driver does all copy

Copy, download and draw are

sequential

pData

[nBricks]

Main

Memory

[0]

[1]

[2]

Graphics

Memory

texID

Disk

glTexSubImage

time

Upload Upload Upload

CPU

GPU Draw Draw Draw

Frame Draw

Copy Copy Copy

Bus

glTexSubImage

Frame Draw

Other work

Page 6: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

CPU Asynchronous Transfers

Non CPU-blocking transfer using Pixel Buffer Objects (PBO)

— Ping-pong PBO’s for optimal throughput

— Data must be in GPU native format

OpenGL Controlled

Memory

Datacur: glTexSubImage

PBO0

PBO1

pData

[nBricks]

Main Memory

[0]

[1]

[2]

Graphics Memory

texID

Datanext memcpy

Textures

Disk

PBO0

PBO1

Page 7: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Example – 3D texture +Ping-Pong PBOs

Gluint pbo[2] ; //ping-pong pbo generate and initialize them ahead

unsigned int curPBO = 0;

//bind current pbo for app->pbo transfer

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[curPBO]); //bind pbo

GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER_ARB, 0, size,

GL_MAP_WRITE_BIT|GL_MAP_INVALIDATE_BUFFER_BIT);

memcpy(ptr,pData[curBrick],xdim*ydim*zdim);

glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB);

//Copy pixels from pbo to texture object

glBindTexture(GL_TEXTURE_3D,texId);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[1-curPBO]); //bind pbo

glTexSubImage3D(GL_TEXTURE_3D,0,0,0,0,xdim,ydim,zdim,GL_LUMINANCE,GL_UNSIGNED_BYTE,0);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);

glBindTexture(GL_TEXTURE_3D,0);

curPBO = 1-curPBO;

//Call drawing code here

Page 8: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

CPU Async - Execution Timeline

time

Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1

CPU

GPU Drawt0 Drawt2 Drawt1

Frame Draw

Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0

Bus

CPU Async

Analysis with GPUView

(http://graphics.stanford.edu/~mdfish

er/GPUView.html)

GLDriver

GPU

GLDriver

CPU

App

Page 9: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Results – Synchronous vs CPU Async

PBOs

Synchronous

0

500

1000

1500

2000

2500

3000

3500

4000

4500

16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3 (2MB) 256^3 (16MB)

PBO vs Synchronous uploads - Quadro 6000

PBO (MB/s) TexSubImage (MB/s)

- Transfers only

- Adding rendering will reduce bandwidth, GPU can’t do both

- Ideally – want to sustain bandwidth with render, need GPU overlap

Bandw

idth

(M

B/s)

Texture Size

Page 10: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Achieving Overlap - Copy Engines

Fermi+ have copy engines

— GeForce, low-end Quadro- 1 CE

— Quadro 4000+ - 2 CEs

Allows copy-to-host + compute +

copy-to-device to overlap

simultaneously

Graphics/OpenGL

— Using PBO’s in multiple threads

— Handle synchronization

Page 11: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

GPU Asynchronous Transfers

Downloads/uploads in separate thread

— Using OpenGL PBOs

ARB_SYNC used for context

synchronization

Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1

CPU

GPU Drawt0 Drawt2 Drawt1

Frame Draw

Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0

Bus

Using PBO

Using CE

Upload Draw

Init

Main App Thread

Shared textures

Readback

Page 12: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Upload–Render : Application Layout

Disk

OpenGL Controlled

Memory

PBO0

PBO1

pData

[nBricks]

Main Memory

[0]

[1]

[2]

Graphics Memory srcTex

[numTextures]

Render

Thread

glBindTexture

Upload Thread

Datacur: glTexSubImage

Datanext : memcpy

uploadGLRC

mainGLRC

Page 13: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Multi-threaded Context Creation

Sharing textures between multiple contexts

— Don’t use wglShareLists

— Use WGL/GLX_ARB_CREATE_CONTEXT instead

— Set OpenGL debug on

static const int contextAttribs[] =

{

WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_DEBUG_BIT_ARB,

0

};

mainGLRC = wglCreateContextAttribsARB(winDC, 0, contextAttribs);

wglMakeCurrent(winDC, mainGLRC);

glGenTextures(numTextures, srcTex);

//uploadGLRC now shares all its textures with mainGLRC

uploadGLRC = wglCreateContextAttribsARB(winDC, mainGLRC, contextAttribs);

//Create Upload thread

//Do above for readback if using

Page 14: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Synchronization using ARB_SYNC

OpenGL commands are asynchronous

— When glDrawXXX returns, does not mean command is completed

Sync object glSync (ARB_SYNC) is used for multi-threaded

apps that need sync

— Eg rendering a texture waits for upload completion

Fence is inserted in a unsignaled state but when completed

changed to signaled.

//Upload //Render

glTexSubImage(texID,..) glWaitSync(fence);

GLSync fence = glFenceSync(..) glBindTexture(.., texID);

unsignaled

signaled

Page 15: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Upload-Render Sychronizaton

Need additional CPU event to coordinate waiting for GPU

sync!

WaitForSingleObject(startUploadValid)

glWaitSync(startUpload[2])

glBindTexture(srcTex[2])

glTexSubImage(..)

endUpload[2] = glFenceSync(…)

SetEvent(endUploadValid)

srcTex

Upload

WaitForSingleObject(endUploadValid)

glWaitSync(endUpload[0])

glBindTexture(srcTex[0])

//Draw

startUpload[0] = glFenceSync(…)

SetEvent(startUploadValid);

Render

[0]

[2]

GLsync startUpload[MAX_BUFFERS], endUpload[MAX_BUFFERS]; //GPU fence sync objects

HANDLE startUploadValid, endUploadValid; //cpu event to coordinate wait for GPU sync

Page 16: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Analysis with GPUView

Upload and Render in

separate threads

— Map to distinct

hardware queues on

GPU

— Executed concurrently

— Will serialize on pre-

Fermi hardware

Page 17: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Adding Readback

OpenGL Controlled

Memory

Images

[nFrames]

[0]

[1]

[2]

Framecur: glGetTexImage

Frameprev : memcpy

glFramebufferTexture

(GL_DRAW_FRAMEBUFFER

_TEXTURE,…)

DRAW

[0]

[1]

[2]

[3]

PBO0

PBO1

Use glGetTexImage, not glReadPixels between threads

mainGLRC

readbackGLRC

Render Thread Readback Thread

Main Memory

Graphics Memory

resultTex

[numTextures]

Page 18: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Render-Readback Synchronizaton

WaitForSingleObject(endReadbackValid)

glWaitSync(endReadback[2])

glFramebufferTexture(resultTex[2])

//Draw

startReadback[3] = glFenceSync(…)

SetEvent(startReadbackValid)

resultTex

Render

WaitForSingleObject(startReadbackValid)

glWaitSync(startReadback[0])

glGetTexImage(resultTex[0])

//Read pixels to png-pong pbo

endReadback[0] = glFenceSync(…)

SetEvent(endReadbackValid);

Readback

[0]

[2]

GLsync startReadback[MAX_BUFFERS],endReadback[MAX_BUFFERS]; //GPU fence sync objects

HANDLE startReadbackValid, endReadbackValid; //cpu event to coordinate wait for GPU

sync

Page 19: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

GeForce vs Quadro Readbacks

Readbacks on GeForce are 3x slower than Quadro

0

1000

2000

3000

4000

5000

256K 1MB 8MB 32MB

PCI-

e b

andw

idth

(M

B/s)

Texture Size

Render-Download Bandwidth for Quadro vs GeForce

GeForce GTX 570 Quadro 6000

Page 20: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Upload-Render-Readback pipeline

// Wait for signal to start upload

CPUWait(startUploadValid);

glWaitSync(startUpload[2]);

// Bind texture object

BindTexture(capTex[2]);

// Upload

glTexSubImage(texID…);

// Signal upload complete

GLSync endUpload[2]= glFenceSync(…);

CPUSignal(endUploadValid);

// Wait for download to complete

CPUWait(endDownloadValid);

glWaitSync(endDownload[3]);

// Wait for upload to complete

CPUWait(endUploadValid);

glWaitSync(endUpload)[0]);

// Bind render target

glFramebufferTexture(playTex[3]);

// Bind video capture source texture

BindTexture(capTex[0]);

// Draw

// Signal next upload

startUpload[0] = glFenceSync(…);

CPUSignal(startUploadValid);

// Signal next download

startDownload[3] = glFenceSync(…);

CPUSignal(startDownloadValid);

// Playout thread

CPUWait(startDownloadValid);

glWaitSync(startDownload[2]);

// Readback

glGetTexImage(playTex[2]);

// Read pixels to PBO

// Signal download complete

endDownload[2] = glFenceSync(…);

CPUSignal(endDownloadValid);

Capture Thread Render Thread Playout Thread

True, S038 – Best Practices in GPU-based Video Processing, GTC 2012 Proceedings

[0]

[1]

[2]

[3]

[0]

[1]

[2]

[3]

Page 21: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

GPUView trace showing 3-way overlap

Copy Engines

are idle

Frame time

Readback

Render

Upload

Readback

Render

Upload

Balanced render, upload

and readback times

Render time larger than

upload and readback

Page 22: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Debugging Transfers

Some OGL calls may not overlap between transfer/render

thread

— Eg non-transfer related OGL calls in transfer thread

— Driver generates debug message

“Pixel transfer is synchronized with 3D rendering”

— Application uses ARB_DEBUG_OUTPUT to check the OGL debug log

— OpenGL 4.0 and above

GL_ARB_debug_output -

http://www.opengl.org/registry/specs/ARB/debug_output.txt

Page 23: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Copy Engine Results – Best Case

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

256KB 1MB 8MB 32MB

Scaln

g F

acto

r

Texture Size

Performance Scaling from CPU Asynchronous Transfers

Upload-Render Scaling Render-Download Scalng

4.2 GB/s 3.2GB/s

1.4 GB/s

900 MB/s

Perfect Scaling

No Scaling

Quadro 6000

Page 24: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

Conclusion

Presented different transfer methods

Keep the transfer method simple

— Look at your application transfer needs and render times

— Tradeoff in scaling vs application complexity

Future

— Debugging multi-threaded transfers made much easier with

Nsight Visual studio http://developer.nvidia.com/nvidia-nsight-

visual-studio-edition)

Page 25: Optimizing Texture Transfers - NVIDIA · Results – Synchronous vs CPU Async PBOs Synchronous 0 500 1000 1500 2000 2500 3000 3500 4000 4500 16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3

References

Venkataraman, Fermi Asynchronous Texture

Transfers, OpenGL Insights, 2012

— Source code (around SIGGRAPH 2012) –

https://github.com/organizations/OpenGLInsights

Related GTC Talks

— S0328, Thomas True, Best Practices in GPU-based

video processng

— S0049, Alina Alt &Tom True, Using the GPU Direct for

Video API

— S0353, S Venkataraman, Programming multi-gpus for

scalable rendering


Recommended