Post on 19-Dec-2015
transcript
Real-time Mesh Simplification Using the GPU
Christopher DeCoroNatasha Tatarchuk3D Application Research Group
2
Introduction
• Implement Mesh Decimation in real-time• Utilizes new Geometry Shader stage of GPU
• Achieves a 20x speedup over CPU
3
Project Motivation
• Massive Increases in submitted geometry• Geometry rendered per shadow map (6x for cubemap!)• Not always needed at highest resolution
• Geometry not always known at build-time• Dynamically-skinned objects only finalized at run-time• May be customized to users machine based on capabilities, would
need to be adapted at program load time• Could be dynamically generated per level, need to be adapted at
level load time• Simplification therefore needs to be fast (or even real-time)
Also, just as importantly…
• We want applications that exercise & stress GS/GPU• Evaluate new capabilities of the GPU• Learn how to adapt previously CPU-bound algorithms• Develop GPU-centric methodologies
• Identify future feature set for GS/GPU as a whole• Limitations still exist – which should be addressed?
4
Contributions
• Mapping of Decimation to GPU• 20x speedup vs. CPU• Enables load-time or real-time usage
• Detail Preservation by Non-linear Warping• Also applicable to CPU out-of-core decimation
• General-purpose GPU Octree• Adaptive decimation w/ constant memory• Applications not limited to simplification: collision
detection, frustum culling, etc.
5
Outline
• Project Introduction and Motivation• Background
• Decimation with Vertex Clustering• Geometry Shaders in Direct3D 10
• Geometry Shader-based Vertex Clustering• Adaptive Simplification w/ Non-linear Warps• Probabalistic Octrees on the GPU
6
Vertex Clustering
• Reduces mesh resolution• High-res mesh as input • Low-res as output
• All implemented on the GPU• Ideal for processing streamed
out data• Useful when rendering multiple
times (i.e. shadows)• Can handle enormous models
from scanned data
• Based on “Out-of-Core Simplification of Large Polygonal Models,” P. Lindstrom, 2000
Figure from [Lindstrom 2000]
7
Previous Rendering Pipeline
• Vertex Shaders and Pixel Shaders• Limits 1 output per 1 input
• No culling of triangles for decimation• Fixed destination for each stage
• Result meshes cannot be (easily) saved and reused
8
DirectX10 Rendering Pipeline
• Geometry Shader in between VS & PS• Called for each primitive (usually triangle)
• Able to access all vertices of a primitive• Can compute per-face quantities
• Breaks 1:1 input-output limitation• Allows triangles to be culled from pipeline
• Allows stream-out of processed geometry• Decimated meshes can easily be saved and reused
9
Outline
• Project Introduction and Motivation• Background• Geometry Shader-based Vertex Clustering
• Overview• Quadric Generation• Optimal Position Computation• Final Clustering
• Adaptive Simplification w/ Non-linear Warps• Probabilistic Octrees on the GPU
10
Algorithm Overview
• Start with the input mesh• Shown divided into clusters
• Pass 1: Compute the quadric map from mesh• Use GS to compute quadric• Accumulate in cluster map, an RT used as large array
• Pass 2: For each cluster, compute optimal position• Solves a linear system given by quadrics
• Pass 3: Collapse each vertex to representative• 9x9x9 grid shown
Model Courtesy of Stanford Graphics Lab
11
Vertex Clustering Pipeline
• Pass 1: Create Quadric Map• Input: Original Mesh• Computation:
• Determine plane equation, face quadrics for triangle• Compute the cluster and address of each vertex• Pack quadric into RT at appropriate address
• Output: Render Targets representing clusters with packed quadrics and average positions
12
Quadric Map Implementation
• Start with the input mesh• Shown divided into clusters
• Compute the quadric map from mesh• Use GS to compute quadric• Accumulate in cluster map, an RT used as large array
• For each cluster, compute optimal position
• Collapse each vertex to representative• 9x9x9 grid shown
//Map a point to its location in the cluster map array
float2 writeAddr( float3 vPos )
{
uint iX = clusterId(vPos) / iClusterMapSize.x;
uint iY = clusterId(vPos) % iClusterMapSize.y;
return expand( float2(iX,iY)/float(iClusterMapSize.x) ) + 1.0/iClusterMapSize.x;
}
[maxvertexcount(3)]
void main( triangle ClipVertex input[3], inout PointStream<FragmentData> stream )
{
//For the current triangle, compute the area and normal
float3 vNormal = (cross( input[1].vWorldPos - input[0].vWorldPos, input[2].vWorldPos - input[0].vWorldPos ));
float fArea = length(vNormal)/6;
vNormal = normalize(vNormal);
//Then compute the distance of plane to the origin along the normal
float fDist = -dot(vNormal, input[0].vWorldPos);
//Compute the components of the face quadrics using the plane coefficients
float3x3 qA = fArea*outer(vNormal, vNormal);
float3 qb = fArea*vNormal*fDist;
float qc = fArea*fDist*fDist;
//Loop over each vertex in input triangle primitive
for(int i=0; i<3; i++)
{
//Assign the output position in the quadric map
FragmentData output;
output.vPos = float4(writeAddress(input[i].vPos),0,1);
//Write the quadric to be accumulated in the quadric map
packQuadric( qA, qb, qc, output );
stream.Append( output );
}
}
13
Vertex Clustering Pipeline
• Pass 2: Find Optimal Positions• Input: Cluster Map Render Targets,
Full-screen Quad• Computation:
• Determine if we can solve for optimal position• If not, fall back to vertex average
• Output: Render Targets representing clusters with optimal position of representative vtx.
14
Optimal Positions
• For each cell, need representative
• Naïve solution: Use averages• Looks very blocky• Does not consider the original faces,
only vertices
• Implemented solution: Use quadrics• Quadrics are a measure of surface• We can solve for optimal position
Original Mesh
Simplified w/ Averages
Simplified w/ Quadrics
15
Optimal Positions Implementation
• Start with the input mesh• Shown divided into clusters
• Compute the quadric map from mesh• Use GS to compute quadric• Accumulate in cluster map, an RT used as large array
• For each cluster, compute optimal position
• Collapse each vertex to representative• 9x9x9 grid shown
float3 optimalPosition(float2 vTexcoord)
{
float3 vPos = float3(0,0,0);
float4 dataWorld, dataA0, dataB, dataA1;
//Read the vertex average from the cluster map
dataWorld = tClusterMap0.SampleLevel( sClusterMap0, vTexcoord, 0 );
int iCount = dataWorld.w;
//Only compute optimal position if there are vertices in this cluster
if( iCount != 0 )
{
//Read all the data from the clustermap to reconstruct the quadric
dataA0 = tClusterMap1.SampleLevel( sClusterMap1, vTexcoord, 0 );
dataA1 = tClusterMap2.SampleLevel( sClusterMap2, vTexcoord, 0 );
dataB = tClusterMap3.SampleLevel( sClusterMap3, vTexcoord, 0 );
//Then reassemble the quadric
float3x3 qA = { dataA0.x, dataA0.y, dataA0.z,
dataA0.y, dataA0.w, dataA1.x,
dataA0.z, dataA1.x, dataA1.y };
float3 qB = dataB.xyz;
float qC = dataA1.z;
//Determine if inverting A is stable, if so, compute optimal position
//If not, default to using the average position
const float SINGULAR_THRESHOLD = 1e-11;
if(determinant(quadricA) > SINGULAR_THRESHOLD )
vPos = -mul( inverse(quadricA), quadricB );
else
vPos = dataWorld.xyz / dataWorld.w;
}
return vPos;
}
16
Vertex Clustering Pipeline
• Pass 3: Decimate Mesh• Input: Cluster Map Render Targets, Input Mesh• Computation:
• Find clusters, Remap vertices to representative• Determine if triangle becomes degenerate• If not, stream output new triangle at new positions
• Output: Low-resolution Mesh
17
Final Clustering Implementation
• Start with the input mesh• Shown divided into clusters
• Compute the quadric map from mesh• Use GS to compute quadric• Accumulate in cluster map, an RT used as large array
• For each cluster, compute optimal position
• Collapse each vertex to representative• 9x9x9 grid shown
[maxvertexcount(3)]
void main( triangle ClipVertex input[3], inout TriangleStream<StreamoutVertex> stream )
{
//Only emit a triangle if all three vertices are in diff. clusters
if( all_different(clusterId(input[0].vPos),
clusterId(input[1].vPos),
clusterId(input[2].vPos)) )
{
for(int i=0; i<3; i++)
{
//Lookup optimal position in the RT computed in Step 2
vPos = tClusterMap3.SampleLevel( sClusterMap3, readAddr(input[0].vPos), 0 );
//Output vertex to stream out
stream.Append( vPos );
}
}
return;
}
18
Vertex Clustering Pipeline
• Alternate Pass 2: Downsample RTs• Input and Output as before• Computation:
• Collapse 8 adjacent cells by adding cluster quadrics• Compute optimal position for 2x larger cell
• Create multiple lower levels of detail without repeatedly incurring Pass 1 overhead (~75%)• Pass 3 can use previous streamed-out mesh• Lower levels of detail almost free
19
Timing Results
• Recorded Time Spent in Decimation• GPU: AMD/ATI XXX• CPU: 3Ghz Intel P4
• Significant Improvement over CPU• Averages ~20x speedup on large models• Scales linearly
20
More Results
• Models shown at varying resolutions
Models Courtesy of Stanford Graphics Lab
Buddha, 45x130x45 grid
Bunny, 90x90x90 grid Dragon, 100x60x20 grid
21
More Results
• Models shown at varying resolutions
Buddha, 20x70x20 grid
Bunny, 60x60x60 grid Dragon, 50x25x10 grid
22
More Results
• Models shown at varying resolutions
Buddha, 10x40x10 grid
Bunny, 20x20x20 grid Dragon, 30x15x6 grid
23
Outline
• Project Introduction and Motivation• Background• Geometry Shader-based Vertex Clustering• Adaptive Simplification w/ Non-linear Warps
• View-dependent Simplification• Region-of-interest Simplification
• Probabalistic Octrees on the GPU
24
View-dependent Simplification
• Standard simplification does not consider view• Preserves uniform amount of detail all over
• Simplify in post-projection space to use view• Preserves more detail closer to viewer (left)
View Direction
25
Arbitrary Warping Functions
• View Transform special case of nonlinear warp• Can use arbitrary warp for adaptive simplification
• Regular grids allow data-independence, parallelism• Constant time mapping from position to grid cell• Maps well onto GPU render targets• Forces uniform resolution throughout output mesh
• Irregular geometry grids allow non-uniform output• Cells can be larger/smaller in certain regions• Corresponds to lower/greater output triangle density• We lose constant-time mapping of position to cell
• Solution: apply inverse warp to vertices• Equivalent to applying forward warp to grid cells• Clustering still performed in uniform grid• Flexibility of irregular geometry w/ speed of regular• One proposal: Gaussian weighting functions
26
Region-of-Interest Specification
• Importance specified w/ biased Gaussian
• Highest preservation at mean• Width of region given by sigma• Bias prevents falloff to zero
• Integrate to produce corresponding warp function
(Derivation given in paper)
27
Region-of-Interest Specification
• Warping allows non-uniform/adaptive level of detail
• Head has most semantic importance
• Detail lost in uniform simplification
• We can warp first to expand center
• Equivalent to grid density increasing
• Adaptive simplification preserves head detail
28
Outline
• Project Introduction and Motivation• Background• Geometry Shader-based Vertex Clustering• Adaptive Simplification w/ Non-linear Warps• Probabalistic Octrees on the GPU
• Motivation• Probablistic Storage• Adaptive Simplification• Randomized Construction • Results
29
Octrees - Motivation
• Basic grid• regular geometry, regular topology• Limitations as we discussed
• Warped grid• irregular geometry, regular topology• Much improved; however, we can do better• May be difficult to know required detail a priori
• CPU Solution: Multi-resolution grid (i.e. octree)• Irregular topology (irregular geometry w/ warping)• Store grid at many levels of detail• Measure error at each level, use coarse as possible• Efficiency requires dynamic memory, storage O(L3)• Requires O(L) writes to produce correct tree
30
GPU Solution – Probabilistic Octrees
• Proposal• Successful storage not guaranteed, w/ Prob. <= 1• However, storage failure detected on read
• Assumptions allow much flexibility• We can have unlimited depth tree (but lim P=0)• Sparse storage of data
• Require conservative algorithms for task• Vertex clustering (conveniently!) is such an example• So is collision detection and frustum culling
• Only studied in brief in this paper, we would like to analyze more for future work
31
Implementation Details
• Storage: Spatial Hashes• Map (position,level) to cell, cell hashed to index• Additive blending for quadric accumulation (app-specific)• Max blending to store (key,-key) with data (i.e. min_key,max_key)
• Retrieval:• Again map (position, level) to index• Retrieve key value from data, collision iff min_key != max_key• Use parent level, which will have higher storage probability
• Usage for Adaptive Simplification• For each vertex, find maximum error level below some threshold• Use this as the representative vertex• Can perform binary search along path• Conservative, because we can maintain validity even when using
parent of optimal node (just adds some error)
33
Probabilistic Octree Results
• Adaptive simplification shown on bunny (~4K tris)• Preserves detail around leg, eyes and ears• Simplifies significantly on large, flat regions
• Using 8% of storage of total tree, we have < 10% collisions• Only ~20% performance hit vs. standard grids
34
Conclusions
• GS is a powerful tool for interactive graphics
• Amplification and decimation are important applications of GS
35
Geometry Shaders and Other Feature Wish-List
• Bring back the Point fill mode• Important for scatter in GPGPU applications
• Data amplification improvements with indexed stream out• Avoiding triangle soups very non-trivial
• Efficient indexable temps
36
Thanks a lot!
• Various people here…
37
Questions?