© Copyright Khronos Group 2016 - Page 142
Swapchains Unchained!(What you need to know about Vulkan WSI)
Alon Or-bach, Chair, Vulkan Window System Integration Sub-Group – March 2016
© Copyright Khronos Group 2016 - Page 143
Intro to Vulkan Window System Integration• Explicit control for acquisition and
presentation of images - Designed to fit the Vulkan API and today’s
compositing window systems
• Not all extensions are supported by every platform- You MUST check and enable the extensions
your app/engine uses!!!
• Today’s presentation should help you get presentation working- Learn how to present through a swapchain
- Overview of Vulkan objects used by the WSI
extensions
WSI Jargon Buster• Platform
Our terminology for an OS
/ window system e.g.
Android, Windows,
Wayland, X11 via XCB
• Presentation EngineThe platform’s compositor
or display engine
• ApplicationYour app or game engine
© Copyright Khronos Group 2016 - Page 144
How many WSI extensions are there?• Two cross-platform instance extensions- VK_KHR_surface
- VK_KHR_display
• Six (platform) instance extensions- VK_KHR_android_surface
- VK_KHR_mir_surface
- VK_KHR_wayland_surface
- VK_KHR_win32_surface
- VK_KHR_xcb_surface
- VK_KHR_xlib_surface
• Two cross-platform device extensions- VK_KHR_swapchain
- VK_KHR_display_swapchain
© Copyright Khronos Group 2016 - Page 145
Vulkan Surfaces • VkSurfaceKHR- Vulkan’s way to encapsulate a native
window / surface
• Platform-independent surface queries- Find out crucial information about your
surface’s properties- e.g., if presentation is supported by a
particular queue on a particular device
- Some platforms provide additional queries
• An implementation may support multiple platforms- e.g., both xlib and xcb
Physical Device A
Platform X
Queue Family 2
Queue Family 1 Queue
Family 0
Platform Y
Physical Device B
Queue Family 1Queue
Family 0
Surface from
Platform X
Physical Device C
Queue Family 1Queue
Family 0
© Copyright Khronos Group 2016 - Page 146
Vulkan Swapchains: VK_KHR_swapchain• Array of presentable images associated with
a surface- Application requests a minimum number
of presentable images
- Implementation creates at least that
number
- Implementation may have a limit
• Upfront allocation of presentable images- No allocation hitching at crucial moment
- Pre-record fixed content command buffers
• Present mode determines behavior- FIFO support mandatory
- Platforms can offer mailbox,
immediate, FIFO relaxed
const VkSwapchainCreateInfoKHR createInfo ={VK_STRUCTURE_TYPE_SWAPCHAIN_CREATE_INFO_KHR, // sTypeNULL, // pNext0, // flagsmySurface, // surfacedesiredNumberOfPresentableImages, // minImageCountsurfaceFormat, // imageFormatsurfaceColorSpace, // imageColorSpacemyExtent, // imageExtent1, // imageArrayLayersVK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT, // imageUsageVK_SHARING_MODE_EXCLUSIVE, // imageSharingMode0, // queueFamilyIndexCountNULL, // pQueueFamilyIndicessurfaceProperties.currentTransform, // preTransformVK_COMPOSITE_ALPHA_INHERIT_BIT_KHR, // compositeAlphaswapchainPresentMode, // presentModeVK_TRUE, // clippedVK_NULL_HANDLE // oldSwapchain};
© Copyright Khronos Group 2016 - Page 147
Vulkan Swapchains: They’re good!• Application knows which image within a
swapchain it is presenting- Content of image preserved between
presents
• Application is responsible for explicitly recreating swapchains - no surprises- Platform informs app if current swapchain
- Suboptimal: e.g. after window resize,
swapchain still usable for present via image
scaling
- Surface Lost: swapchain no longer usable for
present
- Application is responsible to create a new
swapchain
© Copyright Khronos Group 2016 - Page 148
Vulkan Swapchains: They’re jolly good!• Presenting and acquiring are separate
operations- No need to submit a new image to acquire
another one, unless presentation engine
cannot release it
• Application must only modify presentable images it has acquired
• Presentation engine must only display presentable images that have been presented!
Stalls in frame loop are very bad!
© Copyright Khronos Group 2016 - Page 149
VK_KHR_<platform>_surface
VK_KHR_surface
VK_KHR_swapchain
Platform-specific APIs
Steps to setup your presentable images1 – Create a native window/surface
2 – Create a Vulkan surface
3 – Query information about your surface
4 – Create a Vulkan swapchain
5 – Get your presentable images
© Copyright Khronos Group 2016 - Page 150
VK_KHR_swapchain
Vulkan Frame Loop – as easy as 1-2-3!
2 – Submit command buffer(s) for that image
1 – Acquire the next presentable image 3 – Present the image
0 – Create your swapchain
LegendSetup
Steady-state
Response to suboptimal
/ surface_lost
© Copyright Khronos Group 2016 - Page 151
Vulkan Displays: VK_KHR_display• Vulkan’s way to discover display devices
(screens, panels) outside a window system- Reminder: Not supported on all platforms
• Defines VkDisplayKHR and VkDisplayModeKHR objects- Represent the display devices and the
modes they support connected to a
VkPhysicalDevice
- Determine if a display supports multiple
planes that are blended together
• Enables creation of a VkSurfaceKHR to represent a display plane
Physical Device
Surface
Display 0
Plane 2Plane 1
Plane 0
Display Mode 1Display
Mode 0
Display 1
Display Mode 1Display
Mode 0
© Copyright Khronos Group 2016 - Page 152
VK_KHR_display_swapchain• Extends the information provided at vkQueuePresentKHR- What region to present from the swapchain image
- What region to present to on the display
- Whether the display should persist the image
• Adds ability to create a shared swapchain- Swapchain that takes multiple VkSwapchainCreateInfoKHR structs
- Allows multiple displays to be presented to simultaneously
- No guarantee that presents are atomic ...presently!
© Copyright Khronos Group 2016 - Page 153
Any question?
[email protected]@alonorbach (disclaimers apply!)
© Copyright Khronos Group 2016 - Page 1
LunarG® SDK for Vulkan®
Karen Ghavam, CEOKarl Schultz, Principal EngineerJon Ashburn, Principal Engineer
© Copyright Khronos Group 2016 - Page 2
Enter the Raffle for your prize!Congratulations!
You are the recipient of the Vulkan Programming Guide, courtesy of LunarG!
Is your OpenGL Programming Guide getting lonely? Well, it will soon have a companion. In August 2016, when the Vulkan Programming Guide becomes available, LunarG will ship it directly to you!
In the meantime, visit LunarXchange (Vulkan.lunarg.com) for the LunarG SDK for Vulkan, and accept this book bag, anxiously awaiting its Vulkan Programming Guide.
© Copyright Khronos Group 2016 - Page 3
LunarG SDK• Loader Binary• Validation Layer Libraries• Vulkan trace and replay tools- vktrace- vkreplay
• SPIR-V Tools- GLSL Validator - SPIR-V Disassembler and Assembler - SPIR-V Remapper
• RenderDoc*• Sample Programs
*For a detailed demonstration of RenderDoc don’t miss:Practical Development for Vulkan (presented by Valve Software). Thursday. 12:45 – 1:45. Room 3009, West Hall
© Copyright Khronos Group 2016 - Page 4
Download the LunarG SDK for Vulkan at LunarXchange: vulkan.lunarg.com
Version 1.0.5.0 now available!
© Copyright Khronos Group 2016 - Page 5
The Power of a Layered Ecosystem
Development pathValidation
layer
Debug layer
Other layers
Production path
Vulkan application
Installable Client Driver
Vulkan application
Installable Client Driver
Loader
Loader
© Copyright Khronos Group 2016 - Page 6
Layers: Fully IntegratedProgrammatic Approach
Vulkan application
Debug Report
Callback
Installable Client Driver
Layer
Application supplies list
of layers
Application handles messages in
callback
Layers report “results” as
messages
Loader
© Copyright Khronos Group 2016 - Page 7
Layers: Externally Activated“Ad-hoc” Approach
Vulkan application
Debug Report
Callback
Installable Client Driver
Layer
User sets environment variables:
VK_INSTANCE_LAYER=“layer name”
Default Debug Report writes to output stream
Layers report “results” as
messages
Loader
Layer Settings File
© Copyright Khronos Group 2016 - Page 8
Demo We’ll Be Using
“Hologram”By
Chia-I Wu (olv)
• Well-written Vulkan demo• Simulation of 5000 moving objects• Demonstrates multi-threaded command
buffer recording• Can be found in:• https://github.com/LunarG/VulkanSamples
© Copyright Khronos Group 2016 - Page 9
Demo!
Watch the demo for a minute or so
© Copyright Khronos Group 2016 - Page 10
A Few Hologram Internals – Object Data
5000 ShaderParamBlocks
struct ShaderParamBlock {float light_pos[4];float light_color[4];float model[4 * 4];float view_projection[4 * 4];
};
One ShaderParamBlock per Object
For Each Frame and For Each Object:• Modify ShaderParamBlock• BindDescriptorSet
Two Frames of Object Data
© Copyright Khronos Group 2016 - Page 11
Modify DemoLet’s add code to modulate the transparency of each object, independently, as a function of time.To do this, we need to:
1. Add a parameter to the ShaderParamBlock: “per-object” alpha2. Modify the shader program to apply the per-object alpha3. Modify the Simulation to change the transparency of each object over time
Start with Step 1!struct ShaderParamBlock {
float light_pos[4];
float light_color[4];
float model[4 * 4];
float view_projection[4 * 4];
float alpha;
};
© Copyright Khronos Group 2016 - Page 12
Let’s See What Happens
Change the code and re-run demo
© Copyright Khronos Group 2016 - Page 13
More Information• Layer Documentation- LunarXchange website (https://vulkan.lunarg.com/app/docs/latest/layers)- More details on validation and other layers
• Screenshot Layer- Good for showing someone else what is wrong- Also can be used for before/after image-compare testing
• Vktrace/Vkreplay- Useful for sending someone a trace file in lieu of setting up a reproduction
scenario
A next gen Engine design on a next gen API
Dan Baker
Graphics Architect, Oxide Games
Nitrous design philosophies
• Job based threading
• Message based systems
• Redundant, shallow state design
• Always evaluate – opposite of Lazy Evaluation
• Efficient memory streaming
• Asynchronous systems
Data driven design
Unit AI System
MessageQueue
Physics Queue
FOW queue
Minimap queue
Message Dispatcher
Relating to Graphics Stack
• Collection of messages and systems extends into graphics
• Dozens of independent systems can operate in parallel
• Big systems internally parrelize (e.g. particles, unit rendering)
A modern API
• Concept of message based, asynchronous design well matched
Exposure of asynchronous nature of a GPU is the key design difference of Vulkan over OpenGL/D3D11
A contract between App and API
• Application will not make conflicting calls on the same objects (e.g. writing one object while another is reading it)
• Driver will generally not lock or serialize any API call– Context information is embedded on the
object being operated on
– With exception to occasional CPU side memory allocation (but should be rare occurrence on create calls)
Application runs parallel to GPU
Even Command Buffers
Odd Command Buffers
Delete Queue
Delete Queue
Application GPU
Flush Queue
Application runs parallel to GPU
Even Command Buffers
Odd Command Buffers
Delete Queue
Delete Queue
Application GPU
Flush Queue
Review
• When we say Vulkan is free threaded, we mean– most API function calls are operators. They operate only on data which
is passed into them as output, and read-only the data passed on that as input
– API function calls are transparent for thread safety: valid to call so long as the there is no read/write or write/write hazards. Apps responsibility to manage them
– GPU/CPU hazard is explicitly exposed. GPUs are read operators on data, therefore read/write hazards between CPU/GPU must also be managed by application
– In General, API function calls will not have locks in them• With exception to calls which must allocate some types of memory
Old way
Sim Job Sim Job Sim JobCore 1
Current Frame
Sim JobCore 2
Sim JobCore 3
Sim JobCore 4
AI Job
Sim Job
Graphics
Core 5
Game Job
Core 6
???GPU Fence, or CPU wait???
Sim Job
Sim Job
Sim Job
Graphics (Opaque, in driver)
AI Job
Game Job
Game Job
Dead time
Game Job
Game Job
AI Job AI Job
Physics Job
Physics Job
Physics Job
Old Way
Driver related cores. Missing time due to thread accounting and system level synchronization primitives
Lots of unused CPU space! Engine is just waiting for driver to be done
Powerful New model
Sim Job Sim Job Sim JobVulkan
CMD JobVulkan
CMD JobCore 1
Current Frame
Sim Job
Sim Job
VulkanCMD Job
VulkanCMD JobCore 2
Sim Job Sim JobVulkan
CMD JobVulkan
CMD JobCore 3
Sim Job Sim JobVulkan
CMD JobVulkan
CMD JobCore 4
AI Job
Sim Job Sim JobVulkan
CMD JobVulkan
CMD JobCore 5
Game Job
Sim Job Sim JobVk present
JobCore 6
GPU Fence End of Frame
Sim Job
Sim Job
Sim Job
Sim JobVulkan
CMD JobVulkan
CMD Job
AI Job
Game Job
Game Job
Next Frame
New way
Vulkan simulation using a modified Mantle build to simulate infinitely fast GPU
Difficult part of Vulkan
• Need to have a strategy for rendering up front, not lazy eval
• Before can setup shader, need to understand bindings, before bindings, need to understand descriptors– Probably need to know these even before a descriptor is
created
• The more you can know about a render job at compile time, the easier Vulkan will be
Setting up the Engine
• Pipelines created up front, combination(s) specified in shaderlanguage
• No concept of individual shader stages – Vertex/Fragment considered one block
• 64 mb temp buffer created for each frame– Shader constants– No buffers are updated directly– Any updates are dumped into staging buffer and copied – When 64 mbs is exceeded, slow allocation path is used, typically only
initialization
• Internal command format that can be built in parallel
Shader Combos
• Large, monolithic blocks with many state folded in
– Shaders
– Alpha state
– MSAA state
– Depth State
• Managing combinatorics is major challenge
Shader Combos
• Very unlikely that hardware actually needs to create unique pipeline object– The problem is that each hardware has a different state that might
require a new shader
• Vulkan has bulk shader create – Give a bunch of shader combinations at once to driver– Most likely driver only has to create a few actual shaders
• Nitrous does group creates – 20-40 combinations of a pipeline that might get used. A little bit of pruning for shader author
Pipeline serialization
• Major problem with D3D12
• Serialization context is passed into shader create
– Needed because most pipelines are not unique
• Driver will use this is a database to store compiled pipeline object
• Can serialize the whole database
Texture Sets
• Nitrous eliminates individual shader bindings
• Textures must be part of groups
• Maps to a descriptor set
Bind Vector
Batch Shader SetPrimitive (vertices)
Texture Set
Texture Set
Texture Set
Texture Set
Texture Set
Texture Set
Texture Set
Constant Set
Constant Set
Constant Set
Constant Set
Constant Set
Texture Set
Texture Set
Texture Set
Texture Set
Texture Set
Texture Set
Texture Set
Constant Set
Constant Set
Constant Set
Constant Set
Constant Set
Bind VectorTexture Set
Texture Set
Texture Set
Texture Set
Texture Set
Texture Set
Texture Set
Constant Set
Constant Set
Constant Set
Constant Set
Constant Set
• Becomes a Layout in Vulkan• Layouts are specified during the shader
creation stage• Nitrous uses only 1 master layout
• Most engines will use multiple• Switching layouts has cost
• Can easily sort off redundant changes, only call bind descriptor when something needs changing
Manging hazards
• The trickiest part of Vulkan• Must manage any time a resource will be used differently
– Cache Flush– Operator barrier– Decompression
• USE THE VALIDATOR– Could get correct results on current hardware only to see problems on future
hardware– No different then multi-threaded coding
• Consider having engine layer automatically partially calculate barriers– Good design should do a good job– Nitrous is 100% explicit right now, but will likely to switch to partial automatic system
General performance
• Shader auto recompiling won’t happen automatically– Constant folding
– But no frame stutters due to recompiles
• Memory barriers can introduce stalls
– Need to plan out
• Changing pipelines, layouts frequently
Threading/Command buffers
• Best idea is to have many command buffers, but 1 allocator per thread per frame queued
• Command buffer allocation can cause memory bloat
• Nitrous sorts command buffers from estimated size, largest first, down to smallest
Questions
twitter: dankbaker, oxidegames
Performance Lessons from Porting Source 2 to Vulkan
Dan Ginsburg
Overview
Dota 2 Vulkan Performance Results
Performance Lessons Learned
Overview
Dota 2 Vulkan Performance Results
Performance Lessons Learned
Source 2 Overview
OpenGL, Direct3D 9, Direct3D 11, Vulkan
Windows, Linux, Mac
Dota 2 Reborn
Dota 2 Performance Results - Disclaimer
Not an ideal showcase for Vulkan
Source 2 renderer is multithreaded, but…
Dota 2 is only ~1500 draw calls per frame
Allows DX/GL a frame of latency to avoid being
renderthread bound
Does not (yet!) take advantage of:
Baking descriptors
Command buffer resubmission
Dota 2 Performance Results - Disclaimer
Not an ideal showcase for Vulkan
Source 2 renderer is multithreaded, but…
Dota 2 is only ~1500 draw calls per frame
Allows DX/GL a frame of latency to avoid being
renderthread bound
Does not (yet!) take advantage of:
Baking descriptors
Command buffer resubmission
Still very pleased with results!
Dota 2 Vulkan Performance – DX9 Latency
Frame Start Frame End
Dota 2 Vulkan Performance – DX9 Latency
Frame Start Frame End Present Issued
Dota 2 Vulkan Performance – DX9 Latency
Frame Start Frame End Present Issued
DX9 Latency: 3.8ms
Dota 2 Vulkan Performance – Vulkan Latency
Frame Start Frame End
Dota 2 Vulkan Performance – Vulkan Latency
Frame Start Frame End Present Issued
Dota 2 Vulkan Performance – Vulkan Latency
Frame Start Frame End Present Issued
Vulkan Latency: 0.4ms (!)
Dota 2 Vulkan – Latency Reduction
Renderthread no longer a bottleneck
Reduces “wallclock” time of frame
Time from end of frame to present reduced by 3.4ms
Really important for:
Latency sensitive games (eSports)
VR
Dota 2 Vulkan - Framerate
Two timedemos:
Typical Dota 2 Match
High Drawcall Battle Scene
Test system:
NVIDIA TITAN X 356.45
i7-3770k @ 3.50GHz
Test settings:
Resolution: 640x480 (CPU Perf)
Highest Rendering Quality
Vulkan/GL/DX9/DX11
Dota 2 Timedemo – Typical Dota 2 Match
Dota 2 Timedemo – Typical Dota 2 Match
182.95
170.55
188.5
128.1
FPS
NVIDIA TITAN X i7 3770k 640x480 356.45 - HQ
Vulkan OpenGL DX9 DX11
Dota 2 Timedemo – Battle Scene
Dota 2 – High Drawcall Timedemo
85.3
75.15 75.65
67.5
FPS
NVIDIA TITAN X i7 3770k 640x480 356.45 - HQ
Vulkan OpenGL DX9 DX11
Dota 2 Vulkan Performance - Overall
Significant latency reduction
Improved framerate in heavy scenes
Only going to get better…
Overview
Dota 2 Vulkan Performance Results
Performance Lessons Learned
Overview
Dota 2 Vulkan Performance Results
Performance Lessons Learned
Command Buffer Recycling
Command Buffer Batching
Redundant Call Filtering
Updating Descriptors
Pipeline Cache Usage
Command Buffer Recycling Overview
At least one VkCommandPool per thread
Recycling options:
vkResetCommandPool – resets all command buffers in
pool
vkResetCommandBuffer – reset single command buffer
Reset can either recycle or release resources
Command Buffer Recycling
Souce 2 recycles individual command buffers after
completion
vkBeginCommandBuffer costly
Using VK_COMMAND_BUFFER_RESET_RELEASE_RESOURCES_BIT
Driver reallocates resources
Done to reduce memory footprint, but came at perf cost
Fast Command Buffer Recycling
vkCreateCommandPool
Use VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT
vkResetCommandBuffer( pCmdBuffer, 0 )
flags == 0, keeps resources for reuse
Downside: memory growth
Source 2 strategy for handling memory growth:
Destroy command buffers no longer needed
Heuristic to destroy command buffers
Command Buffer Batching
vkQueueSubmit implies a flush
Also has CPU costs – memory residency
Important to batch submits
Command Buffer Batching
Command Buffer Batching
Batched submit: ~0.7ms / frame
Command Buffer Batching
Batched submit: ~0.7ms / frame Unbatched submits: ~4.5ms / frame
Source 2 Command Buffer Batching
Gather command buffers on renderthread
Up to a threshold, needed during load time
Wait for present request
Issue single submit with all batched command buffers
Redundant Call Filtering
Your job now!
Vulkan drivers may not (should not!) filter calls
If we don’t do it, we will force IHVs to
Hurts the good apps at the expense of the bad
Examples from Source 2:
vkCmdBindIndexBuffer
vkCmdBindVertexBuffers
vkCmdBindPipeline
Dynamic render state
vkCmdSet*
Updating Descriptors
vkUpdateDescriptorSets #1 hotspot
vkCmdBindDescriptorSets #2 hotspot
Source 2 approach:
Single pipeline layout shared across all pipelines
Descriptor sets will have unused entries
Update/bind descriptor set per draw
Not efficient!
Updating Descriptors – The Right Way
In shaders, organize descriptor sets by update
frequency
Bake descriptor sets up front
Use compatible pipeline layouts to simplify descriptor
allocation
Updating Descriptors – The Right Way
In shaders, organize descriptor sets by update
frequency
Bake descriptor sets up front
Use compatible pipeline layouts to simplify descriptor
allocation
…we plan to do this in the future. Will help perf a lot.
Pipeline Creation
vkCreateShaderModule is relatively fast
Loads in the SPIR-V, no heavy compilation
~0.01ms in Dota 2
vkCreateGraphicsPipelines is expensive
Driver performs shader compile here
0.2 – 152ms in Dota 2 before cache is warmed
Vulkan Pipeline Cache
Serialize compiled pipelines to disk
Preload to remove first-time stutters
Header contains VendorID/DeviceID/UUID
Otherwise opaque format
Avoid unnecessary shader compiles
Driver de-duplicates
Only driver knows when recompile is needed based on
state
Pipeline cache should contain only unique pipelines
Allows compilation on multiple threads
Merge later using vkMergePipelineCaches
Summary
Dota 2 Vulkan Performance Results
Reduced latency
Improved framerate in expensive scenes
Performance Lessons Learned
Command Buffer Recycling
Command Buffer Batching
Redundant Call Filtering
Updating Descriptors
Pipeline Cache Usage
Questions?
Vulkan Does RetroA Vulkan Use-Case Study with RetroArch and libretro
Hans-Kristian Arntzen – GDC 2016
Background• Me
• Multimedia programming since 2009
• Co-founder of RetroArch project in 2010-2011
• Working at ARM hacking on the Mali GPUs since 2014
• Contributed Vulkan backend on launch day
• RetroArch / libretro
• Multi-platform system optimized for enjoying retro content
• Plugin abstraction to support many different systems
• Strong focus on portability and performance
Problem• Retro content usually needs to render on CPU
• Emulators of classic consoles in particular is a prime example
• Get software rendered images to screen fast and reliably
• Blazing fast texture uploads part of the equation
CPU
GPU magic
Streaming with Vulkan• Vulkan exposes VK_IMAGE_TILING_LINEAR
• Finally! For some reason, never added to OpenGL
• GPUs can sample from these textures• At least on the Vulkan drivers I have tested ...
• No reason to copy from linear to optimal layout (used once!)
• Vulkan supports persistently mapped memory• Finally, us GLES folks can do it right -
• Combine this to a dream scenario• Persistently map a ring buffer of linear textures
• Let libretro core render directly into HOST_VISIBLE memory or use pure memcpy()
Caveats• Vulkan doesn’t require support for sampling linear textures
• Might need fallback
• Linear textures might not be DEVICE_LOCAL• Mostly a desktop thing
• Might need same fallback as before ...
• Memory might not be cached• Fallback to copy if we want to blend on the surface
• Simple, vendor-neutral fallbacks• If we hit either case, copy linear texture to DEVICE_LOCAL
• Might as well copy to OPTIMAL tiling layout
• vkCmdCopyImage (or vkCmdCopyBufferToImage)
The various ways to copy ...• Ring buffered textures with glTexSubImage appears to be best
• We already did the hard part for the driver• Texture is not in use by GPU, should allow optimal path• Only way in pure GLES2
• Classic async PBO uploads have extra overhead on all drivers• After all, have to copy to PBO, then copy to texture• Doesn’t accomplish anything over plain SubImage in our case
• AZDO-style PBO seems interesting ... but• Observed bizzarre 10x performance dips in TexSubImage• So much for that ...
• On Raspberry Pi 1, things got weirder ...• Optimal path was uploading to OpenVG texture• Share image with GLES via EGL ...
Benchmark• NES video from Nestopia libretro core
• 256x240 resolution @ 32 bpp
• Ran through RetroArch’s Vulkan and GL backends• Measurements
• Time to copy texture from CPU to texture
• Time spent overall to submit frame
• Measured on Linux
OpenGL results• Sure, we’re measuring in microseconds• We can do so much better!
• * GL calls were blocking mid-frame• Probably rate-limiting waiting for older frames
CPU GPU Copy OpenGL (µs) Frame OpenGL (µs)
i5-5257U @ 2.70 Intel HD 6100 (Mesa) 130 N/A (*)
i7 920 @ 2.66 nVidia GTX 760 272 302
Cortex-A17 @ 1.8 Mali T-764 585 806
Vulkan delivers!• Copy time essentially a memcpy() benchmark
• Overall frame times way better than the GL texture upload!
• Great uplifts across the board
• Still room for improvement
CPU GPU Copy Vulkan (µs) Frame Vulkan (µs) Copy uplift
i5-5257U @ 2.70 Intel HD 6100 (Mesa) 27 122 352 %
i7 920 @ 2.66 nVidia GTX 760 46 69 491 %
Cortex-A17 @ 1.8 Mali T-764 80 215 631 %
Conclusion• Even humble 2D applications can gain from Vulkan
• Not reserved for the highest-end engine developers
• Vulkan provides a far more direct and simple path to perf
• Fast paths are more obvious than before
• Going from good to great is much simpler in Vulkan
THANKS!
@themaister
github.com/Themaister
github.com/libretro/RetroArch
© Copyright Khronos Group 2016 - Page 191
Porting Cinder to VulkanHai Nguyen, Google
GFXBench 5 - Aztec RuinsBenchmarking Vulkan
Gergely Juhasz, Lead Gfx Engineer @Kishonti
GFXBench 5 in a nutshell
• Concept• Working title: Aztec Ruins
• Entirely new rendering engine• In-house render API for Vulkan, Metal, DX12• Also on OpenGL 4.3+, ES 3.2, DX11 for comparison• Algorithmic and workload parity across different backends
• High-end graphics features• Real time dynamic GI• Complex shading and advanced post-effects
• State• Near to Beta• Gold version expected by Q3
Actual engine footage
Render pipeline – Direct lights
Render pipeline – Dynamic shadows
Render pipeline – Global illumination
Render pipeline – Post-process
Global illumination
• Probes capture the lighting conditions
• SH is generated for every probe
• Final scene is shaded by deferred irradiance lights
• Well fits in Vulkan’s subpass concept
Subpass 1 – Geometry
Subpass 2 – Lighting
Final step – Post effects
Multi-threaded command recording 1
Render job Render targets
Render states
Drawcalls
A B
D EC
F
Dependency graphPipeline consists of several render jobs
Multi-threaded command recording 2
Command buffer
Command buffer
Command buffer
Command buffer
Main thread Command queue
Main rendering thread submits the command buffers according to the dependency graph
Future development plans
• Planned rendering features• Indirect specular highlights and shadows by GI
• Deferred decals
• Animated vegetation
• Compute based motion blur
• Atmospheric effects, particles
• VR
© Copyright Khronos Group 2016 - Page 208
Comparing Vulkan to OpenGL (ES)
Barthold LichtenbeltMarch 16, 2016
© Copyright Khronos Group 2016 - Page 209
Beneficial Vulkan Scenarios
Is your graphicswork CPU bound?
Can your graphicscreation be parallelized?
start
yes
Vulkanfriendly
Your graphicsplatform is fixed
You’lldo what it
takes to squeeze outMax perf.
You put a premium on
avoidinghitches
You canmanage your
graphics resourceallocations
yes
yes
yes
yes
yes
© Copyright Khronos Group 2016 - Page 210
Unlikely to Benefit
Scenarios to reconsider coding to Vulkan
1. Need for compatibility to pre-Vulkan platforms2. Heavily GPU-bound application3. Heavily CPU-bound application due to non-graphics work4. Single-threaded application, unlikely to change5. App can target middle-ware engine, avoiding 3D graphics API dependencies
• Consider using an engine targeting Vulkan, instead of coding Vulkan yourself
© Copyright Khronos Group 2016 - Page 211
Comparing OpenGL, AZDO, and VulkanIssue Naïve GL AZDO VulkanDeterministic state validation/pre-compilation
no no Yes
Improved single thread performance no Yes Yes
Multi-threaded work creation no partial yes
Multi-threaded work submission (to driver)
no no yes
GPU based work creation no partial partial (through MDI)
Ability to re-use created work no partial yes
Multi-threaded resource updates no Yes Yes
Learning curve low high Significant
Effort low high Significant
© Copyright Khronos Group 2016 - Page 212
Fish demo•Vulkan and OpenGL ES 3.1•Can change- # of schools of fish
- # of fish per school
- # of fish per drawcall
•Worker threads create commandbuffers in Vulkan mode
•Reports- Drawcalls/sec
- FPS
- CPU time per thread
- GPU time
•Android and Windows• Source code will be available soon
© Copyright Khronos Group 2016 - Page 213
200K Fishies, 100 fish per draw call
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1
OpenGL ES
Vulkan
drawcalls / sec
7x
1.5x
1.2x
© Copyright Khronos Group 2016 - Page 214
200K Fishies, 1 fish per draw call
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1
OpenGL ES
Vulkan
drawcalls / sec
6x5x
19x
© Copyright Khronos Group 2016 - Page 215
FISH DEMO
Porting Cinder to VulkanLearning to Follow RulesHai NguyenCreative Technology LeadArt Copy & Code Project
Vulkan: Lots of rules and no mercy.
~Joseph Campbell (paraphrased)
Introducing Cinder
● What’s creative coding?○ Programming with aesthetic intent
● What platforms does Cinder run on?○ Android, Linux, Windows, iOS and OS X
● Open source under Simplified BSD
C++ Creative Coding Framework | https://libcinder.org
Porting Cinder to Vulkan
Cinder: Who/What/Where?
● Who is Cinder’s target audience?○ Creative coders
● What is Cinder used for?○ Apps: mobile to desktop to Times Square
● Where has Cinder been used?
Audience and Projects
Porting Cinder to Vulkan
Grove | Simon Geilfus Planetary | BLOOM.io SCAD Museum | Pentagram
IBM THINK | Mirada Samsung CenterStage | TBG Dia Lights | Kollision
Audi Urban Future | Kollision Androidify | Red Paper Heart Taxi, Taxi! | Robert Hodgin
Porting Cinder to Vulkan: Projects That Use Cinder
Porting Cinder to Vulkan
● Vulkanizing Cinder
● Crossing Vendor Implementations
● Speed Bumps
The Road To Glory
Porting Cinder to Vulkan
Vulkanizing Cinder
● Added RendererVk to Cinder○ Cinder rendering architecture is modular
● Wrapped Vulkan in C++○ Created idiomatic layer for expression
● Created high level graphics classes○ Textures, vertex buffers, render targets, etc
Getting to the First Triangle
Porting Cinder to Vulkan
Vulkanizing Cinder
● Initial port on Windows: ~3wks○ Included updating GLSL to Vulkan convention
● Android and Linux port: ~3hrs (each)○ Added platform WSI calls
○ Added platform swapchain creation
● Everything else stayed the same○ Including GLSL shader code used in demos and tests
Going Cross Platform
Porting Cinder to Vulkan
Crossing Vendor Implementations
● Vendor implementations follow the spec○ Conformance tested
● Slightly different behaviors○ Image layout transitions in render passes
● Varying GPU limits/features○ Found in VkPhysicalDeviceLimits
Implementation Details Will Vary
Porting Cinder to Vulkan
Speed Bump: Image Layout Transitions
● Initial platform allowed image layouts to be LAYOUT_GENERAL○ Made it easy to get up and going
● Seemed to work on other GPUs - until one didn’t○ Why? Vendor had stricter adherence to spec
● Checked spec and added logic for transitions○ Had to rework a good bit of code
Dad Said Yes But Mom Said No
Porting Cinder to Vulkan
Whooops...
Porting Cinder to Vulkan
YAY!
Porting Cinder to Vulkan
Speed Bump: Not Paying Attention to Limits
● Not adhering to limits often results in crashes
● Mishandled vkCmdBindDescriptorSets○ Exceeded maxBoundDescriptorSets
● Tried to multithread on device with 1 queue○ Failed to check queue family’s queue count
VkPhysicalDeviceLimits / VkQueueFamilyProperties
Porting Cinder to Vulkan
No More Black Box / Fewer Black Screens
● Vulkan Specification○ Clear about requirements and expectations (mostly)
● Check Device Limits / Features at Run Time○ Easy to query in Vulkan
● Validation Layers Are Your Friends○ Turn on at day 1 - leave on until shipped
Help Vulkan Help You
Porting Cinder to Vulkan
Antoine LabourE. Greg DanielJesse HallShannon WoodsDaniel KochJeff BolzMathias HeyerPiers DaniellTristan LorachJohn McDonaldDominik Witczak
Special Thanks
Thank You!Hai Nguyen
https://libcinder.org
GFXBench 5 - Aztec RuinsBenchmarking Vulkan
Gergely Juhasz, Lead Gfx Engineer @Kishonti
GFXBench 5 in a nutshell
• Concept• Working title: Aztec Ruins
• Entirely new rendering engine• In-house render API for Vulkan, Metal, DX12• Also on OpenGL 4.3+, ES 3.2, DX11 for comparison• Algorithmic and workload parity across different backends
• High-end graphics features• Real time dynamic GI• Complex shading and advanced post-effects
• State• Near to Beta• Gold version expected by Q3
Actual engine footage
Render pipeline – Direct lights
Render pipeline – Dynamic shadows
Render pipeline – Global illumination
Render pipeline – Post-process
Global illumination
• Probes capture the lighting conditions
• SH is generated for every probe
• Final scene is shaded by deferred irradiance lights
• Well fits in Vulkan’s subpass concept
Subpass 1 – Geometry
Subpass 2 – Lighting
Final step – Post effects
Multi-threaded command recording 1
Render job Render targets
Render states
Drawcalls
A B
D EC
F
Dependency graphPipeline consists of several render jobs
Multi-threaded command recording 2
Command buffer
Command buffer
Command buffer
Command buffer
Main thread Command queue
Main rendering thread submits the command buffers according to the dependency graph
Future development plans
• Planned rendering features• Indirect specular highlights and shadows by GI
• Deferred decals
• Animated vegetation
• Compute based motion blur
• Atmospheric effects, particles
• VR
© Copyright Khronos Group 2016 - Page 208
Comparing Vulkan to OpenGL (ES)
Barthold LichtenbeltMarch 16, 2016
© Copyright Khronos Group 2016 - Page 209
Beneficial Vulkan Scenarios
Is your graphicswork CPU bound?
Can your graphicscreation be parallelized?
start
yes
Vulkanfriendly
Your graphicsplatform is fixed
You’lldo what it
takes to squeeze outMax perf.
You put a premium on
avoidinghitches
You canmanage your
graphics resourceallocations
yes
yes
yes
yes
yes
© Copyright Khronos Group 2016 - Page 210
Unlikely to Benefit
Scenarios to reconsider coding to Vulkan
1. Need for compatibility to pre-Vulkan platforms2. Heavily GPU-bound application3. Heavily CPU-bound application due to non-graphics work4. Single-threaded application, unlikely to change5. App can target middle-ware engine, avoiding 3D graphics API dependencies
• Consider using an engine targeting Vulkan, instead of coding Vulkan yourself
© Copyright Khronos Group 2016 - Page 211
Comparing OpenGL, AZDO, and VulkanIssue Naïve GL AZDO VulkanDeterministic state validation/pre-compilation
no no Yes
Improved single thread performance no Yes Yes
Multi-threaded work creation no partial yes
Multi-threaded work submission (to driver)
no no yes
GPU based work creation no partial partial (through MDI)
Ability to re-use created work no partial yes
Multi-threaded resource updates no Yes Yes
Learning curve low high Significant
Effort low high Significant
© Copyright Khronos Group 2016 - Page 212
Fish demo•Vulkan and OpenGL ES 3.1•Can change- # of schools of fish
- # of fish per school
- # of fish per drawcall
•Worker threads create commandbuffers in Vulkan mode
•Reports- Drawcalls/sec
- FPS
- CPU time per thread
- GPU time
•Android and Windows• Source code will be available soon
© Copyright Khronos Group 2016 - Page 213
200K Fishies, 100 fish per draw call
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1
OpenGL ES
Vulkan
drawcalls / sec
7x
1.5x
1.2x
© Copyright Khronos Group 2016 - Page 214
200K Fishies, 1 fish per draw call
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1
OpenGL ES
Vulkan
drawcalls / sec
6x5x
19x
© Copyright Khronos Group 2016 - Page 215
FISH DEMO