Siggraph Asia 2012 - Wil Braithwaite – NVIDIA Applied Engineering
Using CUDA contexts in Maya
• CUDA context
• We must create one, but where?
• Choices:
1. Share existing context.
2. Save & Restore any previous CUDA context (using driver API).
3. Create our own “persistent” thread with dedicated CUDA context.
Sharing the CUDA context – Code
// Initialize plug-in...
void Plugin::_initialize()
{
cudaSetDevice(_cudaDeviceIndex);
}
// On every update of the plug-in...
void Plugin::_update()
{
// do update here...
}
// Clean-up plug-in...
void Plugin::_destroy()
{
// Oh dear! This could be bad...
cudaDeviceReset();
}
Sharing the CUDA context
• Since CUDA 4.0, Runtime-API uses Primary CUDA Contexts.
• Primary contexts created once per device, per process.
• Other plug-ins (or “future” Maya) might change context configuration.
• cudaDeviceSynchronize()/cudaDeviceReset() affect everyone!
• cudaSetDevice(i+1) will switch the primary context for all plug-ins.
• cudaDeviceSet???(...) changes configuration for every plug-in.
• If assert() fails in a kernel then the context is tainted.
• Vigilant error-checking required!
Save & Restore CUDA contexts
• Use the CUDA Driver-API.
• Context is sandboxed, but...
• Every push/pop requires a context switch.
• Each context uses host and device memory, and spawns host threads.
• If a device's compute-mode is “exclusive” then only one context can be active on that device at a time.
• Using only Driver-API means no Runtime-API DLL when deploying.
Save & Restore CUDA contexts - Code
// Initialize plug-in...
void Plugin::_initialize()
{
CUdevice dev;
cuDeviceGet(&dev, _cudaDeviceIndex);
CUcontext cuCxt;
cuGLCtxCreate(&cuCxt, 0, dev);
cuCtxPopCurrent(0); // we expect this to fail if there is no previous context.
}
// On every update of the plug-in...
void Plugin::_update()
{
cuCtxPushCurrent(cuCxt);
// do update here...
cuCtxPopCurrent(0);
}
// Clean-up plug-in...
void Plugin::_destroy()
{
cuCtxPushCurrent(cuCxt);
cuCtxDestroy(cuCxt);
}
Save & Restore CUDA contexts
• using the CUDA Driver-API: • cuFuncSetCacheConfig(...)
• cuFuncSetSharedMemConfig(...)
• ... will only affect this context.
• Stay away from dangerous cudaDevice*() calls.
• Since CUDA 4.1, you can use Runtime-API with a context created by the Driver-API.
Persistent CUDA context thread
• Create a thread for each CUDA device.
• Jobs are dispatched from the host to a thread-safe queue.
• All my plug-ins share my “Multi-GPU Job Dispatcher”.
• Use different streams for each plug-in and use cudaStreamSynchronize().
• If we want GL interoperability, Maya’s GL context must be passed into our persistent thread via OpenGL context sharing.
• This may reduce performance on Geforce... ... but Quadro works much better on multithreaded workstation apps.
Persistent CUDA context thread – Code (win7)
// Clean-up from within Maya thread.
void Plugin::_destroy()
{
launchJob(jobExit);
killThread(_hdl);
}
// Initialize from within Maya thread.
void Plugin::_initialize()
{
HDC display = getDisplay();
HGLRC mayaGlCxt = getGlContext();
HGLRC glCxt = wglCreateContext(display);
wglShareLists(mayaGlCxt, glCxt);
_hdl = runThread(_threadFn, display,
glCxt,
_cudaDeviceIndex);
launchJob(jobInit);
wglMakeCurrent(display, mayaGlCxt);
}
// Clean-up from within Maya thread.
void Plugin::_update()
{
launchJob(jobUpdate);
}
Persistent CUDA context thread – Code (win7)
void jobInit(Data* )
{
// initialize CUDA things...
}
void jobUpdate(Data*)
{
// update CUDA things...
}
void jobExit(Data*)
{
// clean-up CUDA things...
}
// Our persistent thread function.
void _threadFn(HDC display, HGLRC glCxt, int i)
{
wglMakeCurrent(display, glCxt);
CUdevice dev;
cuDeviceGet(&dev, i);
CUcontext cuCxt;
cuGLCtxCreate(&cuCxt, 0, dev);
while (_isRunning) {
// pop & run jobs from thread-safe queue.
}
wglMakeCurrent(0, 0);
wglDeleteContext(glCxt);
cuCtxDestroy(cuCxt);
}
Our Particle pipeline
Particle Emission
• Upload emitted data using nParticles as offset.
• nParticles += nEmitted
• We use fast host-to-device transfer for emission data.
• i.e. page-locked & write-combined host memory.
• For large emissions we can double-buffer and use Async API.
Pattr*
0 1 2 3 4 5
nParticles
Eattr
0 1 2
nEmitted
• Maya plug-in callbacks:
• Plugin::compute(...) (called during data computation)
• Upload & Simulate the particles using CUDA.
• Blit simulated data to VBO.
• Plugin::draw(...) (called during Maya’s scene-graph traversal)
• Render particle VBO.
BCA.render
The App Pipeline - (Autodesk Maya)
B.compute A.compute Maya C.draw
OpenGL
frame 1 frame 2
A.compute C.compute B.draw A.draw B.compute C.compute
Particle Initialization
• Simulation & Rendering initialization
struct Data
{
int nParticles; // number of particles
GLuint vboPositions; // OpenGL VBO of particle positions
cudaGraphicsResource* vboRes; // CUDA registered VBO
cudaStream_t stream; // CUDA stream
float4* d_positions; // CUDA buffer of particle positions
};
void jobInit(Data* d)
{
cudaStreamCreate(&d->stream);
cudaMalloc(d->d_positions, d->nParticles * sizeof(float4));
glGenBuffers(1, &d->vboPositions)
glBindBuffer(GL_ARRAY_BUFFER, d->vboPositions);
glBufferData(GL_ARRAY_BUFFER, d->nParticles * sizeof(float4), 0, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
cudaGraphicsGLRegisterBuffer(&d->vboRes, d->vboPositions, cudaGraphicsRegisterFlagsNone);
cudaGraphicsResourceSetMapFlags(&d->vboRes, cudaGraphicsMapFlagsWriteDiscard);
}
Particle Simulation
• CUDA kernel operates on CUDA buffers.
void jobUpdate(Data* d)
{
// ...
somethingCrazyKernel<<<dimGrid, dimBlock, 0, d->stream>>>(d->nParticles, d->d_positions);
}
Particle Render Blit
• Blit CUDA simulation data into the GL render buffers.
void jobBlit(Data* d)
{
void* vboPtr = 0;
size_t nBytes = 0;
cudaGraphicsMapResources(1, &d->vboRes, d->stream);
cudaGraphicsResourceGetMappedPointer(&vboPtr, (size_t*)&nBytes, d->vboRes);
cudaMemcpyAsync(vboPtr, d->d_positions, nBytes, cudaMemcpyDeviceToDevice,
d->stream);
cudaGraphicsUnmapResources(1, &d->vboRes, d->stream);
cudaStreamSynchronize(d->stream); // now we can be sure the data is copied.
}
Particle Render
void Plugin::draw()
{
waitForAllJobsToFinish(); // this stalls our main thread.
Data* d = &_data;
glBindBuffer(GL_ARRAY_BUFFER, d->vboPositions);
glVertexPointer(4, GL_FLOAT, 0, 0);
glEnableClientState(GL_VERTEX_ARRAY);
glDrawArrays(GL_POINTS, 0, d->nParticles);
glBindBuffer(GL_ARRAY_BUFFER, 0);
glDisableClientState(GL_VERTEX_ARRAY);
}
Example 1: Synchronous CUDA thread
• Sync with CUDA thread before we draw the buffers.
• use a semaphore to signal when the stream is synchronized...
• ... or synchronize the CUDA thread so all the jobs are complete.
render1
sim1
compute2 Maya
OpenGL
frame 1 frame 2
compute1
blit1 sim2 Thread
draw waits for
blit to finish
draw1
• Sync with CUDA thread before we draw the buffers.
• What about a slow GL render?
render2
Example 1: Synchronous CUDA thread
render1
sim2
compute3 Maya
OpenGL
frame 2 frame 3
compute2
blit2 sim3 Thread
Blit may affect
current buffers draw2
• Sync with CUDA thread before we draw the buffers.
• CUDA GL-interop will wait until GL is complete.
render2
Example 1: Synchronous CUDA thread
render1
sim2
compute3 Maya
OpenGL
frame 2 frame 3
compute2
sim3 Thread
Blit “unmap” call will wait
for GL context to finish
draw2
blit2
Particle Render Blit – Explicit GL sync
• Wait for GL to finish rendering from the vbo.
void jobBlit(Data* d)
{
d->blitMutex.claim();
glClientWaitSync(d->glVboSync, GL_SYNC_FLUSH_COMMANDS_BIT, GL_FOREVER);
void* vboPtr = 0;
size_t nBytes = 0;
cudaGraphicsMapResources(1, &d->vboRes, d->stream);
cudaGraphicsResourceGetMappedPointer(&vboPtr, (size_t*)&nBytes, d->vboRes);
cudaMemcpyAsync(vboPtr, d->d_positions, nBytes, cudaMemcpyDeviceToDevice, d->stream);
cudaGraphicsUnmapResources(1, &d->vboRes, d->stream);
cudaStreamSynchronize(d->stream);
d->blitMutex.release();
}
Particle Render – Explicit GL sync
• Explicitly wait for GL to finish rendering from the vbo.
void Plugin::draw()
{
Data* d = &_data;
d->blitMutex.claim();
glBindBuffer(GL_ARRAY_BUFFER, d->vboPositions);
glVertexPointer(4, GL_FLOAT, 0, 0);
glEnableClientState(GL_VERTEX_ARRAY);
glDrawArrays(GL_POINTS, 0, d->nParticles);
glBindBuffer(GL_ARRAY_BUFFER, 0);
glDisableClientState(GL_VERTEX_ARRAY);
d->glVboSync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
d->blitMutex.release();
}
Example 1: Synchronous CUDA thread
• Advantages:
• Simple to implement.
• sand-boxed context protects our plug-in / host app.
• Disadvantages:
• We are relying on Maya calling our update at the start, and our render at the end!
• Allow latency and overlap simulation with host app.
• Ping-pong into double-buffered VBOs.
• (We still require sync objects if render[0] overlaps with blit[2]!)
• or use triple-buffering!
render1
Example 2: Double-buffered CUDA thread
render0
sim2
compute3 Maya draw1
OpenGL
frame 2 frame 3
compute2
sim3 Thread
draw the
previous frame
Wait for previous
frame to complete
blit2
Particle Render Blit - doublebuffered
void jobBlit(Data* d)
{
d->blitMutex.claim();
glClientWaitSync(d->glVboSync[d->current], GL_SYNC_FLUSH_COMMANDS_BIT, GL_FOREVER);
void* vboPtr = 0;
size_t nBytes = 0;
cudaGraphicsMapResources(1, &d->vboRes[d->current], d->stream);
cudaGraphicsResourceGetMappedPointer(&vboPtr, (size_t*)&nBytes, d->vboRes[d->current]);
cudaMemcpyAsync(vboPtr, d->d_positions, nBytes, cudaMemcpyDeviceToDevice, d->stream);
cudaGraphicsUnmapResources(1, &d->vboRes[d->current], d->stream);
d->current = (++d->current)%2; // ping-pong
d->blitMutex.release();
}
Particle Render - doublebuffered
void Plugin::draw()
{
Data* d = &_data;
d->blitMutex.claim();
glBindBuffer(GL_ARRAY_BUFFER, d->vboPositions[d->current]);
glVertexPointer(4, GL_FLOAT, 0, 0);
glEnableClientState(GL_VERTEX_ARRAY);
glDrawArrays(GL_POINTS, 0, d->nParticles);
glBindBuffer(GL_ARRAY_BUFFER, 0);
glDisableClientState(GL_VERTEX_ARRAY);
d->glVboSync[d->current] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
d->blitMutex.release();
}
• Create auxiliary CUDA-context in the main-thread on the GL-device.
• register VBO with the auxiliary-context, and blit from pinned host mem.
void jobBlit(Data* d)
{
cudaMemcpyAsync(d->h_positions[d->current], d->d_positions, d->nPositions*sizeof(float4),
cudaMemcpyDeviceToHost, d->stream);
...
}
void Plugin::_blit(Data* d)
{
...
void* vboPtr = 0;
size_t nBytes = 0;
cudaGraphicsMapResources(1, &vboRes);
cudaGraphicsResourceGetMappedPointer(&vboPtr, (size_t*)&nBytes, vboRes);
cudaMemcpy(vboPtr, d->h_positions[d->current+1)%2], nBytes, cudaMemcpyHostToDevice);
cudaGraphicsUnmapResources(1, &vboRes);
}
Particle Render Blit – Aux-CUDA method
Asynchronous CUDA streams
• Batching the data means…
• We hide the cost of data transfer between device and host.
• Extension to Multi-GPU is now trivial.
• NB. Not all algorithms can batch data.
• Each batch’s stream requires resource allocation.
• Be sure to do this up-front, (before OpenGL gets it all!)
• Number of batches can be chosen based on available resources.
NVIDIA Maximus
• Quadro K5000 & Tesla K20
• Uses GL-interop for efficient data movement to the Quadro GPU
• OpenGL Insights, Patrick Cozzi, Christophe Riccio, 2012. ISBN 1439893764.
• www.openglinsights.com
• DCC apps like Maya already use many GPU resources.
• Share the data between two cards allowing for larger simulations.
• Multi-GPU is scalability for the future.
• Batch simulations on your workstation.
Demo