AnySLEfficient and Portable Multi-Language Shading
Philipp SlusallekSebastian Hack, Ralf Karrenberg, Dmitri Rubinstein
German Research Center for Artificial Intelligence (DFKI)Intel Visual Computing Institute
Saarland University
Monday, August 15, 2011
Saarbrücken
Monday, August 15, 2011
Saarland Campus
Monday, August 15, 2011
Computer Scienceat the Saarland Campus
Monday, August 15, 2011
Computer Scienceat the Saarland Campus
Monday, August 15, 2011
Computer Scienceat the Saarland Campus
Monday, August 15, 2011
Computer Scienceat the Saarland Campus
Monday, August 15, 2011
Computer Scienceat the Saarland Campus
Monday, August 15, 2011
Computer Scienceat the Saarland Campus
MultimodalComputingandInteraction
Monday, August 15, 2011
Computer Scienceat the Saarland Campus
MultimodalComputingandInteraction
Monday, August 15, 2011
Shaders● Programmable Shading
– Allows for controlling core rendering features● Material properties, light emission, participating media, …
– Today: Many different shading languages● HLSL, glsl, Cg, RenderMan, MetaSL, OSL, OpenRL,
many C++ dialects, …● Mostly the same features, expressed differently
– We need a portable way to exchange materials● Common specification of shading features● Ease implementation for different renderers and HW
● Here: Efficient and Portable Implementation
Monday, August 15, 2011
Shaders• A plug-in for the innermost loops
– From one-liners to thousands of lines of code– Run for every new ray, surface hit, light sample, …
● Sometimes, once for every MADD along ray● Efficient implementation
– Low overhead interface to renderer● Ideally works directly on internal data structures
– Highly optimized code for specific HW architectures● Use of SIMD (SSE, AVX, PTX, …)
Monday, August 15, 2011
Implementation Choices
Data
Code
Renderer
GlueCode
C/C++API
/ABI
C/C++API
/ABI
C++Shader
Shader DSO/DLL
● Shaders code in C++– API specifies interface to renderer– Separate C/C++ compilation to DLL/DSO– API gets mapped directly to platform specific ABI
● Predefined data layout, function call overhead● No optimization options in interface
Monday, August 15, 2011
Implementation Choices
Data
Code
Renderer
Gen.API/ABI
Gen.API/ABI
Shader DSO/DLL
● Using a Shading Language Compiler– Compiler can transform and optimize shader code
● E.g. use of renderer internal APIs: No glue code● Transform shaders to SIMD
– Requires renderer and language specific compiler● Most renders support only one shading language
– Renderer-specific code gets embedded in result
SLShader
Monday, August 15, 2011
Implementation Choices
● AnySL: Embedded SL Compiler– Any language compiled into portable format– Types, data layout, interface not fixed yet– Renderer supplies implementations at runtime– Embedded compiler links and optimizes code
Data
Code
Renderer
GlueCode
Compiler (LLVM)
Data
Code
Renderer
OptimizedShader
Compiler (LLVM)
API
SLShader
Monday, August 15, 2011
AnySL: Portable Shading• “Any” Shading Language Supported
• Currently: RenderMan, C++ dialects, Javascript, …• Common Intermediate Format
• Independent of renderer and HW architecture• Easy Implementation by Renderer
• Need only supply the glue code• Different Backends
• Ray Tracing: PBRT, Manta, RTfact, …• Rasterization: Deferred shading (with RTT)• HW: x86, SSE, AVX, PTX, OpenCL, glsl, …
Monday, August 15, 2011
AnySL & XML3D: Interactive RenderMan in Your Web Browser
Monday, August 15, 2011
AnySL
Implementation
Monday, August 15, 2011
AnySL: Implementation Designing an Interpreter: Options
− Many OP-codes with large switch() statement− Replace OP-codes with function calls
“Subroutine Threaded Code”− Long list of function calls
Even for control flow (“if”) and types (allocate a “float”)− Nice for portability, implementations can be replaced
E.g.: use predication for “if” or substitute own “float” type− Can be directly encoded in compiled code
Use LLVM bitcode for representation → Efficiency
Monday, August 15, 2011
Subroutine Threaded Code
Conversion to Threaded Code
Its implementation(supplied by renderer)
Handling control flow: RM illuminace loop
Mapping to Threaded Code
Possible implementation(supplied by renderer)
Original shader code
Monday, August 15, 2011
But Interpreters are Slow?!? STC is used for portable representation only
− Eliminated at runtime with embedded compiler “Type Replacement”
− Substitute own types and operators− Inline all interpreter calls− Perform all usual scalar optimization
Can be used for special shader functionality− Taking derivatives of arbitrary expressions− Bounding the result of shader over intervals
− E.g. using Affine Arithmetic [Heidrich et al., 1998]
Monday, August 15, 2011
How it All Fits Together
Monday, August 15, 2011
Special Functionality Derivatives of arbitrary expressions
− Implemented through “Automatic Differentiation” Each type stores and maintains (2) derivatives Each operation updates value and derivatives Input provides initial derivatives (e.g. w.r.t screen space)
Bounding the value of a shader over interval− Implemented through Interval or Affine Arithmetic
Each type stores and maintain value plus interval− AA: plus terms for linear dependencies on (all) input values
Each operation updates value and derivatives Input provides initial interval (e.g. w.r.t parameter space)
All maps nicely to Type Replacement
Monday, August 15, 2011
ResultsAutomatic differentiation for anti-aliasing
Point sampling Analytic AA: Blend to average near Nyquist
Monday, August 15, 2011
Optimization:Packet-Based Shading Modern ray tracers shoot packets of rays
Exploit SIMD instructions of modern CPUs− Can execute instruction on k ≤ n floats at once− Current architectures:
SSE (4), AVX (8), KNF (16), GPU (32) Shader function has to shade n hit points at once
Monday, August 15, 2011
AnySL:Packetized Shaders Writing packetized shaders is REALLY HARD
− Not an option for any application You may not want to do this by hand:
Monday, August 15, 2011
AnySL:Packetized Shaders Given:
− A shader is given by a control-flow graph of scalar instructions
Needed:− A packetized shader is a new shader that executes k
instances of the original shader at once Control flow of instances can diverge!
Monday, August 15, 2011
Main Issues: Control Flow Diverging control flow of a shader
− Need to efficiently merge flows again!
Shaders are nested in a deep recursion− Must handle closures and reordering of packets
Monday, August 15, 2011
Packetized Shaders Approach
− Program transformation− Flatten control flow− Every instance executes
all instructions− Mask out wrong results− Loops are iterated until
last instance is done− Already exited instances
are invalidated− Simulate what GPUs do in HW
Monday, August 15, 2011
AnySL:Dealing With Data Divergence SSE has no gather/scatter support
− Data must be in multiple of four and properly aligned Need to resort to serial load/store
− Extract individual values from SSE vector− Load/Store − Merge/blend results back into SSE register− Very expensive (lots of dependencies)
Calling non-packetized functions− Essentially, the same as scatter/gather− E.g. hand-crafted SSE noise() function
Monday, August 15, 2011
Packetized Shader Results
Packet size of 4 (SSE)− Completely automated (LLVM)− Shaders are packetized automatically− On average 3.2x speedup
for complete rendering− Not specific to graphics− Can be used wherever
data parallelism is available
Monday, August 15, 2011
AnySL Results
Monday, August 15, 2011
Applications Beyond Graphics Whole Function Vectorization
− Transform a function over one or more scalar parameter into function over SIMD parameters
− Maintaining semantics within each SIMD lane− Application to shader code & packet ray tracing
− OpenCL-Compiler− Simply add an OpenCL-Frontend− Re-use existing AnySL backends− Currently fastest OpenCL compiler for CPUs & GPUs
Monday, August 15, 2011
AnyDSL Vision
− Language, enabling domain specific environments A new base language (others are to complex already) New environments can be written in AnyDSL
− Think libraries of types, code, syntax, etc.− Meta programming
Ensures predictable performance Programmer directly controls which parts of a program are
evaluated at compile time Convenient syntax, no special templates
− Implicit support for parallelism− Based on continuation passing style
Monday, August 15, 2011
ECOUSS Project “Efficient and Open Compiler Environment for
Semantically Annotated Parallel Simulations” German National Project
− Application Partners− Supercomputing Center HLRS, Stuttgart− Cray Computer− BMW Group− Böhringer-Ingelheim (Pharmacy)
− Research Partners− Intel Visual Computing Institute− German Research Center for Artificial Intelligence (DFKI)− Karlsruhe Institute of Technology
Monday, August 15, 2011
Conclusions AnySL
− Shaders are compiled to platform-independent code− Can be produced from any shading language
− Reduce work for the renderer implementer− Need only supply renderer-specific code and link to AnySL
− Highly-optimizing JIT compiler within the renderer− Eliminates interfaces and optimized code
− High-performance through packetization− Significant speedup on benchmarks (~3.2x )− Eliminated need for SIMD shader coding
− Many applications beyond graphics
Monday, August 15, 2011