Chapter 1
Introduction
Various software applications such as games, simulators, medicine, design and engineering
need to display highly detailed images. These images are frequently needed to be dynamic
and sometimes even interactive. One of the main goals of the computer graphics field is to
find ways to display these highly detailed dynamic images at an interactive frame rate.
If we look at another device that shows dynamic images – the television, we can see that it
displays 24 to 30 frames per second. With a frame rate lower than 24, the viewer might
detect popping effects and discontinuity of the displayed images.
The television displays at each frame the image it receives from an input device such as
cables, satellite, antenna or video camera. Contrary to that, a computer usually has to create
these images at run-time. For a given model, it computes the view image from the position
and direction of a viewer or a camera. This process is called rendering.
The graphics hardware renders models that usually consist of computer graphics
primitives, such as vertices, edges, and polygons. The rendering time of a given model is
mainly determined by the number of polygons sent to the graphics hardware, and the
computational power of this hardware. This implies low frame rates when rendering large
models, such as terrains, industrial designs, and weather simulations.
To overcome such limitations, two approaches could be taken. The first approach is to use
stronger and more powerful graphics hardware. This solution is not sufficient, because even
the most powerful graphics hardware available has a limit on the number of polygons it can
render at interactive frame rates. Furthermore, the software has little to do with these
1
limitations, and the best that can be done is to use pre-computed acceleration techniques
provided by the APIs (OpenGL or DirectX) to try and maximally harness the computational
power of the hardware.
The second approach is selectively reducing the number of polygons sent to the graphics
hardware simplifying the geometry of the model. Here, the software can control this number
according to the size of the model, and the computational power of the hardware. However,
there is a tradeoff here, because sending only a part of the model to the hardware implies that
the image displayed will be less detailed. If the model is not downsized appropriately it might
even result in an inaccurate display that does not represent the original model faithfully.
This is the reason why many algorithms for geometric simplifications were introduced in
the last decade. A large portion of these algorithms deals with view-dependent level-of-detail
rendering. These view-dependent algorithms downsize the number of rendered polygons by
reducing the resolution (number of polygons per area unit) of the model. The level of detail
of each region of the model is selected according to the position of the model with respect to
the viewpoint. Regions close to the viewpoint remain in a high level of detail, while regions
of the model that are far from the viewpoint have a much reduced level of detail. These
algorithms reduce the size of the model, but mostly in its less important areas, thus the result
is a higher frame rate, with minimal damage to the quality and detail of the displayed image.
Level-of-detail algorithms are mainly dependent on the CPU, which is relatively slow and
often overloaded. These limitations of the CPU create a bottleneck. In contrast, the graphics
hardware is usually faster than the CPU, and less loaded. Another bottleneck occurs in the
communication between the CPU and the graphics hardware due to the huge amount of data
sent to the hardware at each frame. Recent advances in the programmability of the graphics
hardware allow us to relieve the CPU from some of the work load, and to downsize the
communication load too. In this work we introduce two algorithms that harness the growing
power of current graphics hardware.
The first algorithm caches geometric data on the on-board memory of the graphics
hardware, thus reducing the data traffic between the CPU and the graphics hardware and help
fighting the communication bottleneck.
2
The second algorithm uses the enhanced programmability of current graphics hardware to
relief the CPU of almost all level-of-detail computations. Most of the computations are done
in the graphics hardware, and this implies better load balancing between the CPU and the
graphics hardware.
3
1.1 Graphic Hardware Background
1.1.1 What is a GPU?
GPU stands for "Graphics Processing Unit". This term was introduced by NVIDIA [21] in
the late 1990s when the old terms were no longer an accurate description of the graphic
hardware in a PC.
A GPU is a specialized single-chip processor, designed to draw 3D graphics. As such, it is
much faster than the CPU for typical tasks involving 3D graphics. It creates lighting effects
and transforms objects every time a 3D scene is redrawn. These are mathematically-intensive
tasks, which otherwise, would put quite a strain on the CPU. Lifting this burden from the
CPU frees up cycles that can be used for other jobs.
1.1.2 The Potential of GPUs
Over the past 5 years, GPU technology has advanced at an incredible pace. The rendering
rate, as measured in pixels per second, has been approximately doubling every 6 months
during those 5 years. Taking into account the heavy workload that CPUs already have to deal
with, due to their multi-purpose usage, it is a good idea to do some load balancing, by letting
the GPUs do more work.
The recent advances in GPU programmability and precision (32 bit floating point
throughout the pipeline) enable us to offload work from the CPU to the GPU, resulting in an
overall speedup in typical applications.
4
1.1.2.1 Computational Power
In 2004 as experimented by Buck and Purcell [3], a fragment program running on the Nvidia
GeForce FX 5900 achieved over 20 GFLOPS (Giga floating-point operations per second),
and compared to the Pentium4 3 Ghz theoretical 6 GFLOPS, it is clear that the GPUs are
already faster than the CPUs. Keeping in mind that CPU technology is struggling to keep up
with Moore’s law [20] of the doubling of transistors every couple of years, the doubling of
rendering rate every 6 months by the GPUs suggests that GPUs have not only passed CPUs
in performance, but will also continue to outpace CPUs in the future. This comprehension is
not surprising for two major reasons.
The first reason is the specialized nature of GPUs that make it easier to use additional
transistors for their computations. Generating images is a very parallel problem – Graphic
hardware designers can repeatedly split up the problem of creating realistic images into more
chunks of work that are smaller and easier to tackle. Then hardware engineers can arrange, in
parallel, the ever-greater number of transistors available to execute all these various chunks
of work.
The second reason is purely economic – The multi-billion dollar video game market is a
pressure cooker that drives innovation in this field.
1.1.2.2 Programmability
The dominant trend in graphics hardware design today is the effort to expose more
programmability within the GPU. A part of the graphic hardware pipeline’s units that were
once configurable at the most, are now becoming more and more programmable. The Vertex
Processor and the Fragment Processor units as will be explained later in this chapter are
already programmable, and their programmability is increasing with each generation of
GPUs.
Apart from the ability to program some of the units in the graphic hardware pipeline
which is important by itself, the programming environment is very important too – Drowning
5
into the world of graphic hardware instructions is not all that fun. In the last few years, a few
languages for programming the GPU were introduced, such as Sh and Cg [11]. These
languages offer a friendly high-level environment that translates the users programs into a
form that the GPUs hardware can execute.
1.1.3 Historical Background
Prior to the introduction of GPUs, graphics hardware was specialized and expensive. Many
of the concepts, such as vertex transformation and texture mapping were introduced then,
making those systems very important to the historical development of computer graphics, but
because they were so expensive, they did not achieve the expected mass-market success.
In the late 1990s the first generation of GPUs was introduced. When running most 3D and
2D applications, these GPUs completely relieve the CPU from updating individual pixels.
However, GPUs in that generation suffer from two clear limitations. First, they lack the
ability to transform vertices of 3D objects, and vertex transformations occur in the CPU,
instead. Second, they have a limited set of math operation for combining textures to compute
the final color of the pixels.
In 1999 the second generation of GPUs was introduced. Fast vertex transformation was
the main improvement of these GPUs. Although the set of math operations for combining
textures and coloring pixels expanded in this generation, it made this generation more
configurable, but still not truly programmable.
In 2001 the third generation of GPUs was introduced. These GPUs let the application
specify a sequence of instructions for processing vertices, and by that providing vertex
programmability rather than merely offering more configurability. Considerably more pixel-
level configurability was available, but these modes were not powerful enough to be
considered truly programmable. Because these GPUs support vertex programmability but
lack true pixel programmability, this generation was a transitional one.
In 2002 the forth and current generation of GPUs was introduced. These GPUs provide
both vertex-level and pixel-level programmability. This level of programmability opens up
6
the possibility of offloading complex vertex transformation and pixel-shading operations
from the CPU to the GPU.
All the new graphic cards support the Shader Model 3.0, which gives cutting-edge
programmability for both the vertex and pixel processors.
1.1.4 The Graphics Hardware Pipeline
Figure 1.1: The graphics hardware pipeline.
1.1.4.1 Vertex Transformation
The Vertex Transformation stage (sometimes referred to as T&L – Transform and Lighting)
performs a sequence of math operations on each vertex. Its input is the list of vertices
received from the software running on the CPU. Each vertex has a position, and usually
several other attributes such as a color, a secondary color, texture coordinates and a normal
vector.
The operations performed on each vertex in this stage are transforming the vertex
positions into image positions, generating texture coordinates for texturing, and lighting the
vertex to determine its color.
7
Figure 1.2: Transformation of vertex positions to image positions, and lighting of the vertices.
1.1.4.2 Primitive Assembly
The Primitive Assembly stage assembles vertices into geometric primitives. Its input is the
list of transformed vertices as outputted by the Vertex Transformation stage along with the
vertex connectivity information received from the software running on the CPU.
This stage assembles the transformed vertices into geometric primitives based on the
geometric primitive batching information that accompanies the sequence of the original
vertices. This results in a sequence of triangles, lines and points.
Figure 1.3: Primitive Assembly stage.
A
A
B
C
D B
C
D
A
D
C
B
A
D
C
B
8
1.1.4.3 Rasterization
The Rasterization stage determines the set of pixels or fragments covered by a geometric
primitive. Its input is a triangle, line or point as outputted by the Primitive Assembly stage.
First, the primitive may require clipping to the view frustum, i.e. removing a primitive that
is completely outside the field of view, or truncating a primitive that is partially in it.
After the clipping, a primitive may also be discarded in a process known as backface
culling, in which a polygon is discarded based on whether it faces backwards to the view
point.
A Primitive that survives the clipping and culling steps is then rasterized. Polygons, lines
and points are each rasterized according to the rules specified for each type of primitive. The
results of rasterization are a set of pixel positions as well as a set of fragments. The pixel
positions of a primitive are the pixels that will be actually lit on the screen, if this primitive is
to be displayed. A fragment is in the exact size of a pixel, but it has some more attributes
associated with it, such as a depth value, a color, a secondary color and texture coordinates.
These parameters of a fragment are derived from the transformed vertices that make up the
geometric primitive used to generate the fragment. A fragment is actually a potential pixel –
If it passes various tests in the Raster Operations stage, the fragment updates a pixel in the
frame buffer.
Figure 1.4: Rasterization stage. Note that before the Rasterization actually occurred, the back face of the pyramid (the ABD face) was discarded in the backface culling process.
D
C
B
A
9
1.1.4.4 Fragment Interpolation, Texturing, and Coloring
The Fragment Interpolation, Texturing, and Coloring stage determines the final color for
each fragment. Its input is a fragment as outputted by the Rasterization stage.
The fragment’s parameters are interpolated as necessary, then a sequence of texturing and
math operations are performed to determine the final color of the fragment. In addition, this
stage may also determine a new depth for the fragment or even discard the fragment.
Figure 1.5: Interpolation and coloring of the fragments.
1.1.4.5 Raster Operations
The Raster Operations stage performs a final sequence of operations before the frame buffer
is updated. Its inputs are the finalized fragments.
During this stage, a series of tests are performed on each fragment. If any test fails, this
stage discards the fragment without updating the pixel’s color value. These tests include
scissor, alpha, stencil and depth tests, as the later eliminates hidden surfaces according to
their depth.
After the tests, a blending operation combines the final color of the fragment with the
corresponding pixel’s color value.
Finally, a Frame Buffer write operation replaces the pixel’s color with the new blended
color.
10
Figure 1.6: The finalized fragments are turned into pixels and are written to the frame buffer.
1.1.5 GPU Programmability
During the years, the programmability of the GPU has increased more and more. The third
generation of GPUs introduced the Programmable Vertex Processor, whereas the fourth (and
current) generation of GPUs introduced the Programmable Fragment Processor, and both
processors have even more programmability in each new Shader Model going out to the
market.
The tasks of programming the GPU is also getting easier with the help of newly developed
high-level languages specially designed for GPUs, such as Cg [11].
1.1.5.1 Vertex Processor
The Vertex Processor also known as the Vertex Shader corresponds with the Vertex
Transformation stage of the graphic pipeline. The third generation of GPUs introduced the
Vertex Processor, which gives programmability to the basic transformation and lighting
operations that were only configurable prior to that.
Each vertex’s attributes, such as position, color, normal and texture coordinates, are being
loaded to the Vertex Processor. The Vertex Processor then repeatedly fetches instructions
from the vertex program. The instructions access a set of registers that contain vector values,
11
such as position, normal or color. These registers are read-only, but the results of the
computations can be written to the output registers which are write-only. When the vertex
program terminates, the output registers contain the newly transformed vertex. Some
Intermediate results can be read from or written to a set of temporary registers also available
in the Vertex Processor.
The new Shader Model 3.0 [10] allows the Vertex Processors much more instructions
(65535 instead of only 256), dynamic flow control, geometry instancing and vertex texture
fetch, which allows displacement mapping or vertex texturing.
1.1.5.2 Fragment Processor
The Fragment Processor also known as the Fragment Shader corresponds with the Fragment
Interpolation, Texturing, and Coloring stage of the graphic pipeline. The fourth generation of
GPUs introduced the Fragment Processor, which gives programmability to the basic
interpolation, texturing, and coloring operations that were only configurable prior to that.
Each fragment has parameters that are derived and interpolated from the parameters of the
vertices of that fragment's primitive. These parameters are stored in the input registers, which
are read-only for the fragment program. The Fragment Processor repeatedly fetches
instructions from the fragment program. The instructions access the input registers, and use a
set of temporary registers to calculate intermediate results. The instructions also include
texture fetches, and the textures can be both read from and writing to. The final color, and
optionally the new depth-value of the fragment are stored in the output registers which are
write-only. The fragments can also be discarded by the fragment program.
The new Shader Model 3.0 [10] allows the Fragment Processors much more instructions
(65535 instead of only 96), dynamic flow control and the usage of loops, branches and
subroutines, but almost all of the new features are costly and may decrease performance.
12
1.1.5.3 Cg
The two programmable processors in the GPU require the application programmer to supply
a program for each processor to execute. Cg [11] provides a language and a compiler that can
translate the user’s shading algorithm into a form that the GPU’s hardware can execute.
The Cg (C for graphics) language has high resemblance to C [19] and it follows C’s
philosophy, in that it is a hardware-oriented, general-purpose language, rather than an
application-specific shading language. It supports both of the major 3D graphics APIs:
OpenGL and Direct3D. The general-purposeness of the Cg language is the most interesting
aspect of this language, and it can be used to achieve unconventional goals that a traditional
shading language is not capable to.
1.1.6 Bottlenecks
“A chain is only as strong as its weakest link” – as it does for chains, the same rule applies to
computer hardware too – “The speed of a computer is given by its slowest component”.
When dealing with computer graphics we can refer to three major components – CPU, GPU,
and the communication between them. Finding the bottlenecks between these three
components might help us solve them, thus increasing the overall speed of our computer
graphics applications. Finding the bottlenecks within each of the components might help us
in achieving this goal.
1.1.6.1 CPU
As mentioned previously, the GPUs have already more computational power than CPUs and
the gap between them continues to grow. However, because of the specific nature of the
GPUs, a lot of computer graphic related tasks are still done in the CPU. This is probably the
biggest bottleneck in computer graphics. The increased programmability and general-
13
purposeness of the new GPUs suggest that more and more tasks can be moved from the CPU
to the GPU, and by that help in removing the biggest bottleneck – the CPU.
1.1.6.2 GPU
While clearly the GPU is not the main bottleneck, it can still have smaller bottlenecks in its
inner-components. Detecting and removing those bottlenecks helps increasing the GPUs
speed, and the overall speed of the system. The graphics hardware (GPU) pipeline consists of
five different stages:
Vertex Transformation
Primitive Assembly
Rasterization
Fragment Interpolation, Texturing, and Coloring
Raster Operations
It appears easy to find which of these five stages takes most of the computation time.
When the problematic stage is found its efficiency could be improved by the hardware
manufacturers. Alternatively, the software programmers might design their applications so
that they will try to do less work in this particular stage. However, this task is not as easy as it
seems. Each of these stages has its own hardware component in the GPU, and all the
components run in parallel. Because of that, we can not quite put our finger on a clear
general case bottleneck in the GPU. Nevertheless we can find several bottlenecks regarding
specific applications. For instance, when playing games like Doom3 or HalfLife2, it is very
likely that there will be a very large polygon count on every frame hence the first 2 stages
will be quite stressed. On the other hand, when playing a flight simulator, where the polygon
count is usually reduced but the screen is still filled with fragments, the stages involving
fragments will have probably more work than other stages. The last examples indicate that a
clear general bottleneck can not be found within the GPU.
14
1.1.6.3 CPU/GPU Communication
When a lot of data is being transferred from the CPU to the GPU, the communication
between them easily becomes a bottleneck. This is exactly the reason why Intel introduced
the Accelerated Graphics Port (AGP) back in 1996 [1].
AGP is a high-performance connection between a designated chipset and the graphics
controller used to enhance graphics performance for 3D applications. AGP relieves the
communication bottleneck by adding a dedicated high-speed interface directly between the
chipset and the graphics controller.
Figure 1.7: AGP connection. Courtesy of [1].
AGP uses the main PC memory to hold 3D data sets. Such a scheme allows the AGP to
use “unlimited” amount of texture memory. To speed up the data transfer, Intel designed the
port as a direct path to the PC's main memory, so AGP is in fact a point-to-point connection
between the graphics card, the system memory and the CPU. This enables storing very large
textures in the main system memory instead of on the limited on-board texture memory of
the graphics card.
Because AGP is very fast and enables large amount of data to be stored it is often used as
a caching device for graphic applications. Not only texture data can be stored via AGP, but
15
even geometric data can be cached using the AGP. However, although this caching scheme
allows much better results, the caching is still not on the on-board texture memory. This
means that no matter how fast the AGP connection is, a large amount of data passing on it
might still create a bottleneck.
16
1.2 Level-of-Detail Rendering
A 3D scene that needs to be rendered may contain millions of polygons. A polygon count of
this magnitude passes the polygonal limit that current graphics hardware can render at
reasonable frame rates. With each new generation of hardware this limit grows, but so does
the size of the scenes to be rendered. No matter how fast the graphics hardware abilities
grow, so grows the hunger for bigger models too. A solution to this problem is simplifying
the complexity of the scene to be rendered. This can be done by reducing the number of
polygons sent to the graphics hardware to match the rendering capability.
Many algorithms construct in off-line several levels of detail for each object in scene,
meaning the same object is constructed several times with a different resolution each time. At
run-time one of these levels is chosen based on a few parameters such as view position and
angle. Objects close to the viewer are rendered at high level of detail, while objects far from
the viewer are rendered at low level of detail.
Two main approaches of level of detail (LOD) rendering have been developed – discrete
and continuous.
1.2.1 Discrete Levels of Detail
Discrete levels of detail are obtained by off-line generating a fixed number of distinct levels
of detail for each object. At run-time the most appropriate level of detail is selected for each
object in the scene. The polygons representing the chosen level of detail are sent to the
graphics hardware for rendering.
17
(a) (b)
(c) (d)
Figure 1.8: Discrete levels of detail for the Armadillo model: a) 249,924 triangles, b) 62,480 triangles, c) 7,809 triangles, d) 975 triangles. Courtesy of [5].
18
One way to off-line generate the various levels of detail for an object is to use the vertex
removal technique introduced by Schroeder, Zarge, and Lorensen [25]. Their technique
removes a vertex along with it adjacent triangles, and then triangulates the resulting hole.
Starting from the original representation of the object which is the highest level of detail,
vertices are removed one by one until all requested levels of detail are created for the object.
(a) (b)
Figure 1.9: Vertex removal technique: a) before vertex removal, b) after vertex v is removed.
A detailed description of various discrete levels of detail rendering algorithms is
overviewed by Cignoni, Montani, and Scopigno [6].
These algorithms are suitable for complex scenes that consist of many objects. However,
if an object is highly detailed and viewed from close range, the algorithms will have to
choose a high level of detail representation of that object, implying little or even no
simplification of the scene. If an algorithm chooses a lower level of detail for that close and
highly detailed object, it will result in a poor representation of the object.
Another problem with this kind of algorithms is that adjacent objects can be represented at
different levels of detail. This difference can be often visible to the viewer, thus ruining the
reliability of the entire scene.
These clear drawbacks made the discrete levels of detail rendering algorithms obsolete.
19
1.2.2 Continuous Levels of Detail
Continuous levels of detail rendering algorithms are designed to deal with the problems that
the previous discrete algorithms introduced. These algorithms allow various levels of detail
to co-exist along different regions of the same object.
The changes in the rendered model between frames are very subtle due to the continuous
nature of these algorithms. Also, the simplification operator has to be dual insuring that a
model that is continuously simplified could be also continuously refined until it reaches its
original geometrical structure.
Hoppe [14] has introduced the progressive meshes scheme that uses the edge collapse
operation for continuously simplifying the model. The edge collapse operation unites a
chosen pair of adjacent vertices to one new vertex, thus removing the edge between them.
The dual operation for the edge collapse is vertex split. A vertex split separates a vertex into
two vertices and inserts back the edge that was originally between them.
Figure 1.10: Edge collapse and vertex split operations.
These two dual operators enable continuous change in the level of detail of the model.
Whenever simplification of the model is wanted a line of edge collapse operations take place,
and when the model needs to be refined back these edges are restored in reverse order by
matching vertex split operations. The collapsing edges are chosen carefully to avoid edge
collapse operations that generate geometric errors such as foldovers.
20
1.2.3 View-Dependent Rendering
In the previous section, we have shown a scheme that achieves continuous LOD rendering.
However, we did not explain how to choose the regions that are to be simplified. View-
dependent rendering algorithms choose the appropriate level of detail of each region in the
model with respect to view parameters at real time. Most of these algorithms rely on an off-
line construction of the continuous levels of detail. At run-time, an adaptive level is selected
for each region according to some or all of the following parameters:
Distance – the distance between the object and the viewer. Regions close to the
viewer are represented by a higher resolution than those farther from the viewer.
Visibility – back-facing polygons and polygons that are outside the view frustum
have a very coarse representation.
Illumination – illuminated regions are more detailed than regions in shadow.
Silhouette – vertices on the silhouette are very important for the reliability of an
image. Therefore, the silhouette is represented with a very high resolution.
Screen-space projection – objects that contribute only a few pixels to the final image
are represented in a much lower resolution than objects that cover most of the image.
Figure 1.11: A view-dependent representation of a sphere. Its adaptive levels of detail are based on the distance, visibility, and silhouette parameters. Courtesy of [26].
21
These parameters usually do not change drastically between consecutive frames therefore
view-dependent rendering implies significant coherence. A lot of view-dependent rendering
algorithms use this coherence to calculate only the changes between frames, and avoid
recalculating the levels of detail of the entire model in each frame.
1.2.4 Vertex Hierarchy
As mentioned in the previous section, view-dependent rendering algorithms usually rely on
an off-line construction of the levels of detail. Most of these algorithms use a hierarchical
data structure called vertex hierarchy. The vertex hierarchy is actually a tree of vertices. The
vertices close to the root correspond to a low detailed representation, while vertices further
away from the root represent a higher level of detail. The leaves of the vertex hierarchy are
the vertices of the original model. At run-time a cut of the vertex hierarchy defines the
vertices that will be rendered at each frame. These vertices are called the active nodes.
Figure 1.12: Active nodes in a vertex hierarchy.
View-dependence trees presented by El-Sana and Varshney [9] are a good example for
vertex hierarchy. Usually several trees should be used to represent a complex model, where
22
each tree represents a single object or a group of objects. A view-dependence tree is
constructed bottom-up by recursively applying the edge collapse operation starting from the
unsimplified original representation of the object.
Figure 1.13: Edge collapse and vertex split operations on view-dependence trees for the mesh in figure 1.10.
At run-time the active nodes of the tree are chosen. A list of active vertices that are
represented by the chosen active nodes is sent to the graphics hardware for rendering. Along
with the active vertices, a list of active triangles is also sent to the graphics hardware to create
the image. Due to the coherent nature of this algorithm, the active nodes are not recalculated
every frame. Instead, the algorithm strolls along the active nodes cut of the previous frame.
For each active node it decides whether it needs to be simplified or refined based on the
viewing parameters introduced in the previous section. If a simplification is required in the
level of detail of a node, then its parent is added to the active nodes. On the over hand, if a
refinement is needed, then the node’s children are added to the active nodes.
1.2.5 Terrain Rendering
Terrain models are usually bigger and more complex than other types of models. Their size
forces them to be aggressively simplified in order to reach interactive frame rates when
rendering them. The advantage of terrain models is that they are not full 3D models and
sometimes are referred to as 2.5D models. The reason for that is because a terrain can be
23
represented as a 2D elevation map. This unique representation enables the usage of special
simplification algorithms that can not be used on a general 3D model.
One such algorithm is the ROAM (Real-time Optimally Adapting Meshes) algorithm
presented by Duchaineau et al. [7]. ROAM treats a terrain model as a rectilinear elevation
map. The algorithm off-line builds a triangle binary tree data structure, in which each node
represents a triangle on the elevation map. The two children of a triangle in the binary tree
are formed by splitting the triangle in its base edge.
At run-time the preprocessed binary tree is used to build the adaptive triangle mesh for
each frame. The ROAM algorithm takes the distance of a region from the view point as a
parameter for defining the level of detail that this region will be presented in. It also uses the
planarity of the region as a parameter. A flat region will be represented more coarsely than a
region with a rough neighborhood.
Figure 1.14: ROAM terrain. Courtesy of [7].
24
1.2.6 Cluster Hierarchy
Level-of-detail rendering algorithms often fail to select the appropriate level of detail for
very large datasets within the span of one frame. Such limitations occur due to the often
overloaded CPU, which becomes the main bottleneck with traditional level-of-detail
rendering algorithms. To overcome this problem researchers have developed aggressive
refinement operators based on cluster hierarchy.
The Quick-VDR algorithm introduced by Yoon et al. [30] represents the dataset as a
clustered hierarchy of progressive meshes. The cluster hierarchy is used for coarse-grained
selective refinement, whereas fine-grained local refinement is obtained using progressive
meshes [14]. Using cluster hierarchy compared to vertex hierarchy reduces the refinement
cost for view-dependent rendering by more than an order of magnitude.
Figure 1.15: The right inset image shows clusters in color from a 64K cluster decomposition of Michelangelo’s St. Matthew model. Courtesy of [30].
25
Chapter 2
Related Work
Level-of-detail rendering algorithms appeared in the 1990s. Because of the very limited
abilities of graphics hardware back then, all these algorithms do their entire work on the CPU
and treat the hardware as some kind of a “black box”. These algorithms are based on the
obsolete fact that graphics hardware is unprogrammable, and therefore their most important
task is to try and reduce the geometry sent to the hardware. That is the reason why there are
very few works that combine level-of-detail rendering with hardware issues.
In the late 1990s, after the AGP [1] was introduced by Intel, the possibility of geometry
caching appeared. A few algorithms that combine level-of-detail rendering with geometry
caching using the AGP have been introduced since then. In 2004, after vertex texture fetches
were enabled, displacement mapping in the vertex shaders became practical. Because this
ability is so new, very little work has been done to combine displacement mapping with LOD
rendering. We will review the limited number of algorithms that combine level-of-detail
rendering with either geometry caching or displacement mapping.
26
2.1 Geometry Caching
In the aspect of geometry caching there is little work, due to the restrictedness of the
hardware until recently. Upon the introduction of AGP, the ability to cache geometry became
realistic. We will review a work that covers techniques for accelerating real-time graphics.
This will determine the effectiveness of caching using AGP. Next, we will see three
strategies to manage the cached geometry, and finally we will review an algorithm for level-
of-detail rendering that harnesses the ability of geometry caching.
2.1.1 Accelerating Real-Time Graphics
Perrson [24] has reviewed in his work several accelerating methods for real-time graphics.
Vertex Arrays is the first method introduced in this work. A block of vertex data (vertex
coordinates, texture coordinates, normals, RGBα colors, color indices, and edge flags) may
be stored in an array and then used to specify multiple geometric primitives through the
execution of a single OpenGL command. This simple method proves to be up to 3 times
faster than the immediate mode that uses the glBegin and glEnd functions. Several OpenGL
commands can be preprocessed and packed together into a Display List in order to achieve
improved efficiency. However, when packing Vertex Arrays into Display Lists the overall
efficiency hardly improves, and it even declines in some cases due to the small overhead
implied when using Display Lists.
Another acceleration method is the VBO (Vertex Buffer Object) extension introduced by
NVIDIA. These Vertex Buffer Objects use high-performance graphics memory, mainly
AGP, instead of standard memory that uses the regular bus. VBO actually caches the vertices
on the AGP, and by doing that it does not only lower the memory operations for every frame,
but it also uses a faster bus (AGP) to transfer the data.
The tests appearing in Perrson’s [24] work show that VBO performs up to 15 times faster
than the immediate mode, and about 5 times faster than the Vertex Array method.
27
Figure 2.1: Megavertices rendered per second with immediate mode, VA, and VBO. Courtesy of [24]
2.1.2 Cached Geometry Manager
Caching geometric data using the AGP is as we have seen doable and effective. However,
sometimes the geometric data is too large to fit into the size of the fast access memory. For
that reason Lario, Pajarola, and Tirado [16] have introduced the Cached Geometry Manager
(CGM). New vertices are needed to be displayed whenever the level of detail of some part of
the model changes. Some of these vertices might not be already cached, and therefore they
should be transferred from main memory. A removal operation is needed to free up space for
new vertices when AGP memory is full. Inactive but cached vertices are the prime candidates
to be removed, and three strategies are given to choose the best vertices for removal.
28
The first and simplest strategy is First-Available (FA). The cached memory is used as a
linear list of slots, and when a slot is needed the first available slot is chosen. An available
slot is a slot that stores a vertex that was not used in the current or last frame. The problem
with this strategy is that it does not consider the number of frames the vertex was not used.
The second strategy handles an ordinary Least-Recently-Used (LRU) list, where the
vertex chosen to be removed from cache is the one that was used least recently among all the
vertices stored in cache at the moment.
The third strategy (LRU + Error-PriorityQueue) uses the hierarchical nature of vertices in
level-of-detail algorithms. Statistics show that vertices belonging to a coarse level of detail
have more chance to be used than vertices belong to a finer level of detail. This fact is used to
maintain a Priority-Queue instead of the last 10% of the LRU list. When needed, the queue
takes the least recently used vertex from the list. From all the vertices in the priority queue,
the one with the greatest priority, representing the finest level of detail, is chosen for
removal.
The three strategies were tested on two different view-dependent level-of-detail
frameworks. The FastMesh [22] framework was used for arbitrary meshes, while QuadTIN
[23] was used to achieve better results with terrain models.
2.1.3 Cached Aggregated Binary Triangle Trees
Caching data is always possible, but if large portions of the data remain constant through
several frames, the caching becomes much more effective. When using VBO on a regular
level-of-detail terrain algorithm, we find improved results due to the caching of the vertices.
Nevertheless, we still have continuous changes to the list of displayed vertices, and therefore
we do not fully harness the clear advantages of caching.
Levenberg [17] introduced the Cached Aggregated Binary Triangle Trees (CABTT)
algorithm for that reason. The CABTT algorithm uses the same binary tree data structure as
the ROAM [7] algorithm. The difference is that instead of using a single triangle in each
node of the binary tree, CABTT uses a cluster of geometry called aggregate triangle. CABTT
29
uses a fixed triangulation for every cluster or aggregate triangle. Each cluster edge is divided
to 2N segments. Having this number of segments per cluster edge insure that no T-junctions
appear when aggregate triangles of different levels are adjacent. The best results were found
when each cluster edge was divided to 16 segments. This particular triangulation yields 206
triangles per segment.
Figure 2.2: A 2049 X 2049 height field rendered with CABTT. An example of an aggregate triangle is highlighted. Courtesy of [17].
Because each triangle is replaced by a fixed triangulation of 206 triangles, the binary
triangle tree becomes much shallower, hence less work is done on the CPU for a given level
of detail. This of course comes with a little cost in the precision of the triangles in each
cluster. When there is a change in the level of detail, instead of individual triangles changing,
there are clusters of triangles moving out of cached memory and different clusters replacing
them. This scheme enables very good caching using the AGP. The cached geometry manager
shown before can be used to handle the caching.
30
2.2 Displacement Mapping with Various Levels of Detail
Displacement mapping [28] is a technique that modifies the vertices of an object so that
during the rendering process, the object's geometry is altered to create a bumpy surface.
Unlike regular bump mapping, the edges are actually raised and can cast shadows. The
roughness of a 2 dimensional texture is used to adjust the degree to which the geometry of
the object is displaced. Displacement mapping only alters the object's geometry in the
rendered image and not the scene, so highly complex objects can be created without having
to actually model them.
(a)
(b)
)c(
Figure 2.3: Example of displacement mapping: a) original mesh, b) displacement map,c) mesh with displacement. Courtesy of [28].
31
For years displacement mapping was a peculiarity of high end rendering systems, while
real-time APIs, like OpenGL and DirectX lacked this possibility. One of the reasons for this
absence is that the original implementation of the displacement mapping required an adaptive
tessellation of the surface in order to obtain micro-polygons whose size matched the size of a
pixel on the screen.
With the newest generation of graphics hardware displacement mapping can be
interpreted as vertex texture mapping, where the values of the texture map do not alter the
pixel color, but change the position of the vertex instead. Displacement mapping can in this
way produce a genuine rough surface. It has to be used in conjunction with adaptive
tessellation techniques that increase the number of rendered polygons according to the
current viewing settings. This is done in order to produce highly detailed meshes, and to give
a more 3D feel and a greater sense of depth and detail to textures that displacement mapping
is applied to.
By using these tessellation techniques various levels of detail could be achieved for the
displacement maps. The great latency of fetching vertex textures in current hardware
suggests that the use of displacement maps for level-of-detail rendering is not a very efficient
scheme. However, the fact that when using displacement mapping almost all CPU
computations can be avoided implies that the overall effectiveness will not decrease, and the
results will even improve. Another clear advantage of using displacement mapping is that
highly complex objects can be created without actually being modeled. Therefore, we
conclude that the combination of displacement maps with level-of-detail rendering will
become more and more popular
Taking into account that the latency of fetching vertex textures should decrease
dramatically with every new graphics card that enters the market, we understand that this
scheme will be even more useful in the near future. Moreover, the height maps used for the
displacement mapping can be easily compressed and decompressed, so huge terrains can be
rendered with real-time results, without using special out-of-core techniques.
Because the displacement mapping is a very new feature for real-time APIs, only one
algorithm that combines it with level-of-detail rendering was published up to now. We will
32
review a prior version of this algorithm that is still very CPU focused, and its second version
which is much more GPU-based.
2.2.1 Geometry Clipmaps
The geometry clipmap framework introduced by Losasso and Hoppe [18] treats the terrain as
a 2D height map, pre-filtering it into a pyramid of m levels. The levels represent nested grids
centered over the viewer at successive power-of-two resolutions. Starting from a square that
represents the closest level to the view-point, each level is represented by a hollow ring
wrapping the smaller square of its preceding level. This actually creates view-dependant
levels-of-detail, where the finer levels-of-detail are in the grids closest to the camera.
Figure 2.4: Clipmap levels of a terrain rendered using a coarse geometry clipmap. Courtesy of [18].
The grids are stored as vertex buffers in AGP memory, and are incrementally refilled as
the viewpoint moves. For each level only a thin L-shaped region is changed per frame,
therefore there is great coherency and the caching is effective. To prevent cracks and popping
effects along the boundaries of different levels, zero-area triangles and transition regions are
added respectively.
33
The height maps used by the algorithm are calculated in the CPU rather than in the much
faster GPU. Nevertheless, the algorithm provides good rendering rate, and it is simple and
coherent. The height maps also have a coherent nature and therefore enable compression of
up to 100 factor of the base dataset.
2.2.2 GPU-Based Geometry Clipmaps
The geometry clipmaps algorithm uses height maps, but calculates them in the CPU and
sends them to the GPU as vertex buffers. With the new developments in graphics hardware,
the height maps can be easily moved to the GPU and perform as displacement maps in the
vertex processor.
Asirvatham and Hoppe [2] presented a GPU-based implementation of geometry clipmaps
that uses the new vertex texture fetch ability to move the calculation of the height maps to the
GPU. In their GPU-based implementation, the height maps perform as displacement maps in
the vertex processor. Consequently, constant vertex and index buffers are sent by the CPU,
therefore relieving the CPU from most of its workload.
In the original implementation each level was represented by a hollow ring. To take full
advantage of the GPU’s computational power, this version of the algorithm divides each ring
to 12 smaller square blocks. The small squares reduce memory costs, because displacement
mapping works with full squares. If the entire clipmap was used as a displacement map, then
the inner part of the ring would have been passed to the GPU along with the rest of the ring.
In addition, the small square blocks also enable view frustum culling.
34
Figure 2.5: Top view of a terrain, showing each nested grid is composed of 12 square blocks.Courtesy of [2].
Figure 2.6: View frustum culling. Courtesy of [2].
35
Chapter 3
Motivation
As the title of the thesis states, the goal of this work is to introduce several approaches to
level-of-detail rendering. These approaches should leverage the great computational power of
the GPU, and overcome the limitations of the system.
In this chapter we will first have a glance at how a GPU-based level-of-detail rendering
scheme could look like in an ideal world, meaning a world where a GPU can do everything
that a CPU does, but much faster of course. Next, we will see how the architecture of the
GPU prevents reaching this utopian goal. Finally, we will show what the GPU enables us to
achieve despite these limitations.
The overall motivation of our work is derived from these comprehensions – We see what
we want to achieve, we understand why we can not fully achieve it, and we set our goals to
try using all the relevant features that the current GPUs do allow us.
3.1 Utopia
Consider the possibility that a GPU is capable of doing anything a CPU can, and on top of
that it does everything faster. If this was the case, then for each frame the CPU would only
have to send the new position and angle of the camera to the GPU. The GPU, being so
versatile, could easily get these changed details of the camera and calculate the new levels of
36
detail by itself. The GPU could run any level-of-detail rendering algorithm designed for
CPU, but just much faster. All the geometric data will be kept in the internal memory of the
GPU, or in the worst case only a part of it will be cached on the GPU, depending on the size
of the GPU’s internal memory. Having minimal or even no geometrical data at all
transmitted to the GPU each frame, the overall efficiency of the level-of-detail rendering
algorithms will grow by orders of magnitude.
3.2 Limitations
A GPU is not a CPU - It is a processing unit designated for rendering images upon receiving
a stream of vertices. Due to its unique nature, the GPU must get a list of vertices as its input.
We can send the GPU a list of “blank” vertices on which it will run its LOD algorithm, but
some problems might show up when doing that.
If we look back at the important operations that should be done in a level-of-detail
algorithm, we can detect two problematic issues – changing number of vertices and
neighboring geometry.
A level-of-detail algorithm changes the number of vertices – This is actually the main goal
of LOD algorithms. In order to have an LOD algorithm run on the GPU, the GPU must
generate or discard vertices to reach the desired level of detail. Current vertex shaders do not
posses these abilities. These problems can actually be bypassed, but with a big cost:
In order to discard a vertex we can set its position to be the same as one of its
neighboring vertices. This way a degenerated polygon will be created, and the vertex
will not be seen.
We can bypass the generation of new vertices, by starting with the maximal number
of vertices that theoretically could be chosen by the LOD algorithm, meaning the
highest level of detail.
Both these bypassing solutions are pretty wasteful – The discarding of a vertex does not
actually discard the vertex, and the vertex actually goes through the entire GPU pipeline,
37
relieving nothing of the GPUs workload. Furthermore, the degenerated polygons created
when using this solution might obstruct the rendering process and create abnormalities in the
resulting image. The solution for creating vertices is also problematic because it forces the
entire model to start with the highest level of detail which is very time-costly and completely
denies the possibility of coherency.
In order to decide on the best level of detail, the GPU must have information about
neighboring geometry. Current vertex shaders do not have any information about their
neighboring geometry. This kind of information can be theoretically stored in texture
memory, so the vertex shader could calculate the level of detail of its vertex using this data.
In this case the vertex shader will be forced to perform numerous accesses to the texture
memory for each vertex. The problem is that fetching data from the texture memory is an
operation that comes with great latency on current hardware, not to mention numerous
operations of this sort for each vertex.
Taking into account all these limitations, we understand that full level-of-detail rendering
within the GPU can not be achieved unless significant structural and conceptual changes are
made to the graphics hardware.
3.3 Reality
We have seen that the designated nature of the GPU prevents us from achieving level-of-
detail rendering implemented solely on the GPU side. However, we can try and move more
work to the GPU within its restrictions. The new Shader Model 3.0 [10] introduces some
major improvements to both vertex and fragment shaders. One of the new features introduced
in the vertex shader is the ability of vertex texture fetches [12]. This is the most interesting
feature, because it allows the vertex processor to use memory in the shape of texture
memory.
We use this new feature to cache geometric data on the GPU on-board texture memory.
Geometric data is usually smaller than texture data, so geometric data caching is actually a
38
realistic goal considering the new texture fetch ability and the growing size of texture
memory in current graphics hardware.
We use the vertex texture fetching ability to cache geometric data, but this is not what this
feature was meant to do in the first place. The reason graphics cards manufacturers enabled
this feature was to support displacement mapping in the GPU. We will introduce an
algorithm that uses displacement mapping to achieve level of detail rendering with almost all
the work done within the GPU.
The caching scheme along with the displacement mapping algorithm are two different
approaches with one common goal - leveraging the computational power of the GPU to
improve the performance of level-of-detail rendering.
39
Chapter 4
Caching Data on the GPU On-Board Texture Memory
Current level-of-detail rendering frameworks that use geometry caching have very little
control on how the data is really cached. These algorithms use extensions given by the APIs,
such as VBO, to cache their data. However, these algorithms do not have control on how the
data is physically cached, and usually these API extensions cache the data on AGP memory
rather than on-board graphics memory. This limited caching ability although significantly
improving performance does not fully cache the data, because the geometry still has to be
passed at each frame from AGP memory to the GPU.
As seen in the related works section, some algorithms use these extensions to achieve
better results for level-of-detail rendering and some even try to manage the cached data
within the restrictions of the extensions.
The newest generation of graphics hardware allows fetching vertex textures [12] in the
vertex processor. While this ability was originally planned for enabling displacement
mapping in the vertex shader, it actually turned the on-board texture memory to a general-
purpose memory that the vertex shader can use. We will describe a scheme that uses this
memory to cache geometric data on the GPU itself, while maintaining full control of the
caching management.
40
4.1 Storing Geometric Data in Textures
The geometric data sent from the CPU to the GPU for each vertex contains its position
coordinates (x,y,z) and usually its color (RGBα). Some applications may also need the
normal information of the vertex (x,y,z) and possibly some texture coordinates (x,y) for the
texture mapping stage later in the fragment shader. The position coordinates consist of three
floats, the color is a single float (each element is a byte size), normals are three floats, and
texture coordinates are two floats.
If we sum up all of the above data, a graphics application may need to send up to 9 floats
per vertex from the CPU to the GPU. Our technique cuts the amount of data sent to the GPU
by up to 75%.
To achieve that, we store all the needed data for a vertex in a texture. This texture consists
of the geometric data of all the vertices. The level-of-detail algorithm running on the CPU
stores for each vertex some additional data. This additional data consists of two vertex
texture coordinates that map the vertex to the place in the vertex texture where its data is
stored. When the level-of-detail algorithm decides on a vertex to be sent to the GPU, it sends
the two vertex texture coordinates instead of its position x,y coordinates. The z coordinate
can be ignored by sending 2D vertices, or can alternately contain any other kind of data. The
GPU’s vertex shader fetches all the data belonging to this vertex from the texture upon the
arrival of the two vertex texture coordinates. We show several storage scenarios, so every
application can use its optimal scenario depending on the number of floats it has to send for
each vertex.
41
4.1.1 4 Floats Scenario
We first show the simplest scenario where our technique halves the data size sent each frame.
We consider a model with just the three position coordinates and a color for each vertex.
Without our caching technique, for each vertex chosen by the level-of-detail algorithm 4
floats of data are sent to the GPU. Instead, we store all these 4 floats of data for each vertex
in the texture, and just send two floats per vertex, one float for each vertex texture
coordinate.
The vertex shader can fetch from the texture up to 4 floats at a time, therefore for each
vertex, all the data can be retrieved from the texture with a single fetch operation. This is
extremely important due to the great latency that vertex texture fetches imply in current
graphics hardware.
The 4 floats scenario implies that normals can not be used. Without normals, real-time
lighting calculation is not possible. However, pre-calculated lighting can be used in the case
that the light source is static throughout the scene.
Figure 4.1: 4 floats scenario without texture caching.
42
Figure 4.2: 4 floats scenario with texture caching.
4.1.2 8 Floats Scenario
In a different scenario our technique can reduce the data size sent from the CPU to the GPU
by 75% percent. We consider a model with three position coordinates, three normal
directions and two texture coordinates for the fragment shader. Without our caching
technique, for each vertex chosen by the level-of-detail algorithm 8 floats of data are sent to
the GPU. Instead, we store these 8 floats of data for each vertex in two groups of 4 floats that
are kept in two different textures, but in the same coordinates so just two floats are sent per
vertex, one float for each vertex texture coordinate.
The lighting problem with the 4 floats scenario is not relevant in this scenario, because the
normal data for each vertex is cached in the vertex textures. The vertex shader can calculate
the lighting at real-time using this data.
Because we use two textures instead of a single texture as in the 4 floats scenario, we need
two texture fetches for each vertex with this scenario. Some applications may need several
sets of texture coordinates for the fragment shader, or any other additional data. With our
technique we can store 12, 16 or more floats for each vertex to support these applications.
43
Mind that the more floats needed for each vertex the more our technique reduces the
communication between the CPU and the GPU.
Figure 4.3: 8 floats scenario without texture caching.
Figure 4.4: 8 floats scenario with texture caching.
44
4.1.3 9 Floats Scenario
Due to the great latency that fetching vertex textures imply, we prefer storing multiples of 4
floats for each vertex.
Consider the scenario introduced in the beginning of the section - three position
coordinates, three normal directions, two texture coordinates and a color together sum to 9
floats per vertex, which is not a multiple of 4 floats. In this case we can send one of the 9
floats as the z coordinate instead of omitting the z coordinate, so in the textures we store only
8 floats for each vertex. Now the CPU sends 3D vertices, where the x,y coordinates of each
vertex are the vertex texture coordinates, and the z coordinate stores one of the floats, for
example the color information. By sending 3 floats instead of 9 floats per vertex, we reduce
the CPU/GPU communication by two thirds in this scenario.
Figure 4.5: 9 floats scenario without texture caching.
45
Figure 4.6: 9 floats scenario with texture caching.
4.2 Caching Management
We have seen how the geometric data is being stored in the on-board texture memory, but we
have not shown yet how it actually gets there. By turning the texture memory into a cache
memory, we must take the responsibility of managing it.
The maximal texture in current graphics hardware contains 4096 X 4096 pixels. Each of
these pixels can be represented in the form of RGBα, where each color component is a float.
Therefore each pixel can consist of 4 floats, which is exactly the storage size we need for
each vertex when using our texture caching technique. Having 4K X 4K pixels means we
have storage place for 224 vertices, which is over 16 million vertices. With the 8 floats
scenario we use two different textures, therefore this limit stays the same and it does not
additionally harshen.
46
If we refer to the 8 floats scenario for a static model, the two textures containing the
vertices of the entire model are created at preprocessing. Before the rendering begins, they
are uploaded once to the texture memory, and from that point no additional management is
required. However, when the model has more vertices than the capacity of a texture or when
the model is dynamic, some extra management is required in the shape of vertex switching in
the texture. Even when the vertices of the entire model can be stored in a texture, there still
might be a need for vertex switching in the textures, because the total size of texture memory
is limited too and the application might want to use the texture memory to store color
textures for texture mapping in the fragment processor. The texture we use for caching in the
4 floats scenario may contain up to 224 vertices with 4 floats per vertex. If we sum it up we
get exactly 256 Megabytes of memory, which is the total size of texture memory in some
current graphics hardware. In this case, there is no texture memory left for other textures.
This problem might be solved by the fact that the newest graphics hardware available today
has double the amount of texture memory – up to 512 Megabytes, but then again the 8 floats
scenario leads to the same problem. Moreover, the need for other textures may leave us no
possibility but to use a smaller caching texture. That again leads to the conclusion that
caching management is a problem that can not be ignored.
We can use any of the caching strategies introduced in the Cached Geometry Manager
(CGM) [16]. For instance, LRU is a strategy that achieves good results with almost every
framework. Whenever the CGM wishes to cache a new vertex, it just chooses the vertex in
the texture to be removed according to the LRU strategy, and places the data of the new
vertex instead of the data of the obsolete vertex. This can be done easily because with the
texture caching technique we have full control of the memory.
47
4.1 Level-of-Detail Rendering using Cached Geometry
A level-of-detail rendering algorithm usually holds a vertex hierarchy data structure that
contains about twice the amount of vertices as in the original model. On the first hand, this
double amount of vertices might fit entirely in the caching textures. This is the simple case of
uploading the textures once before the rendering begins. On the other hand, there might not
be enough texture memory available for the double amount of vertices. This can be the case
when the doubled number of vertices is combined with any other reason for insufficient
texture memory as described in the previous section.
Using a level-of-detail framework that supports geometry caching is the best solution for
these cases, when there is insufficient texture memory. The CABTT algorithm [17] is a good
example of such a framework. The clusters of triangles created in the CABTT algorithm can
be cached effectively with fewer changes in the caching textures than as with a framework
that is not designed to cache geometry. The CABTT can run a CGM [16] to decide which
cluster to replace whenever a new cluster should be cached. The best CGM strategy in this
case is the LRU + Error-PriorityQueue strategy.
Figure 4.7: Data structure for the LRU + Error-PriorityQueue strategy. Courtesy of [16].
48
This particular CGM strategy uses the fact that vertices belonging to a coarse level-of-
detail have more chance to be displayed than vertices belonging to a finer level-of-detail.
This strategy uses a priority queue instead of the last 10% of the LRU’s list. When a new
cluster is needed, the queue takes the least recently used cluster from the list. From the
clusters in the queue the one with the greatest priority, meaning the finest level-of-detail, is
chosen for removal.
4.2 Optimizations
4.2.1 Triangle Strips A triangle strip is a series of connected triangles therefore the application does not have to
repeatedly specify all three vertices for each triangle. Alternatively, it can use the fact that
every pair of connected triangles shares two vertex references to reduce the overall number
of references.
Figure 4.8: A triangulation to be triangle stripped.
A
B
C
D F
G
E
49
For example, the above triangulation consists of 5 triangles, so without the use of triangle
strips it is represented in the following way:
ABCBCDCDEDEFEFG
Each triangle is represented by 3 vertex references, so 15 references are needed to
represent this triangulation without triangle strips. Contrarily, only seven vertex references
are needed to define the triangle strip of these exact 5 triangles:
ABCDEFG
A model that is triangle-striped therefore uses less memory. Triangle strips are also API
and hardware supported hence processing time also improves when using them. Because of
these reasons most objects in current 3D scenes are composed of triangle strips.
The biggest problem however with triangle strips is their creation. The creation of triangle
strips from an arbitrary mesh is an NP-complete problem. Therefore, a heuristic algorithm is
needed to create the triangle strips in reasonable times. We can use Terdiman’s [27] stripifier
to create efficient triangle strips.
While this pre-processing striping works fine for static meshes, it is irrelevant for dynamic
meshes whose topology changes over time. Such meshes are used in level-of-detail rendering
schemes therefore a run-time striping solution is needed. Skip Strips [8] provide a solution to
this problem by updating triangle strips on-the-fly for dynamic meshes.
As we see, triangle strips offer great optimization for any 3D rendering scheme, and can
therefore be used with our algorithm. Furthermore, triangle strips optimize our algorithm
even more than other algorithms, because the biggest setback in our texture caching
algorithm is the great latency of fetching vertex texture in current hardware. Triangle strips
reduce the number of vertex references by up to almost three times, leading to almost three
times less vertex texture fetches.
50
4.2.2 Geometry Instancing
Geometry instancing is a scheme for efficiently rendering the same object multiple times
with only small differences such as position, color and orientation.
The Sanjusangendo temple in eastern Kyoto, Japan is a good example for the need of
geometry instancing. This particular temple has in it 1001 virtually identical budha statues of
Kannon, the goddess of mercy. In such a case, there is no need to store a model that consists
of 1001 identical statues. Alternately, a model of a single statue can be instanced a 1001
times.
Figure 4.9: The Sanjusangendo temple in eastern Kyoto, Japan.
51
Display lists can be used to instance the same object several times, each in a different
world-space position. However, all the instances have the exact same color, since display
lists use identical commands for each instance with only the world-space projection
changing.
Our texture caching algorithm can also be used as some kind of geometry instancing
scheme. Using our algorithm, once an object is cached in the vertex textures, it can be
instanced multiple times. This can be achieved if each vertex will have an additional instance
index parameter.
If a certain object is to be instanced several times, we could send along with each of its
vertices an additional parameter, the instance index. This is actually a case of the 9 floats
scenario of our algorithm. Let us say the vertex texture holds 8 floats for each vertex
(position, normals and texture coordinates for the fragment shader). Now, instead of only
sending the two vertex coordinates we also add the z coordinate that stores the instance index
parameter. This parameter lets the vertex shader know to which instance this particular vertex
belongs. According to this data, the vertex shader can change for example the position or the
color of the vertex. If the entire model to be rendered consists solely of multiple instances of
the same object, then the vertex processor can handle all the instancing by itself.
For example, several copies of the same object can be obtained by adding a multiplication
of the instance index to the position of each vertex. This way, a row of instances of the same
object would be displayed.
If the instances of the object do not represent the entire model, or if any instance needs
changes that can not be derived from the instance index parameter, then the Vertex Constants
Instancing method introduced by Carucci [4] can be used. This method uses the vertex
constants available in current vertex shaders to store the instancing data. The problem with
this method is that the number of vertex constants is limited in current graphics hardware to
256. Therefore, the number of instances of the object is limited too.
This method can be used along with our texture caching technique. The data shared by all
the instances can be cached in the vertex textures, while the unique data for each instance can
be stored in the vertex constants. Both types of data will be derived at run-time by the vertex
shader, while the CPU does very little work in this process.
52
Figure 4.10: Texture caching combined with vertex constants instancing.
In the example in figure 4.10, the texture caches the data as a regular 8 floats scenario,
while the vertex constants hold the X,Y offsets for each instance, and its color. The final
position of each vertex is achieved by adding the X,Y offsets in the vertex shader to the
values of the X,Y position coordinates that were derived from the texture. The combination
of texture caching with vertex constants instancing helps achieving load balancing between
the CPU and GPU, and fights the biggest bottleneck – the CPU.
53
4.2.3 Short Coordinates
Each vertex texture coordinate sent from the CPU to the GPU with our texture caching
algorithm can hold only up to 4096 different values, which is the limit on texture size in
current graphics hardware. The meaning of this limit is that we only need 12 bits to hold a
texture coordinate data, but we actually send a float of 32 bits for each coordinate.
Alternately we can use a short instead of a float for every coordinate, and halve the data sent
for each vertex. A short has 16 bits of data, so possible future texture size growth of up to 16
times per axis will still enable the use of a short. By using short coordinates, we only need 2
shorts instead of 8 floats in the 8 floats scenario, therefore reducing the data size sent for each
vertex by 87.5%.
54
4.3 Implementation
We have implemented our algorithm in C++ with OpenGL as our API, and Cg as the
programming language for the GPU shaders. All the geometry is sent using the VBO
extension of the API. This way, the results of our algorithm can be compared to the fastest
rendering scheme available.
The code for the vertex shader in the 4 floats scenario:
Listing 4.1: Vertex shader code for the 4 floats scenario.
55
The code for the vertex shader in the 8 floats scenario:
Listing 4.2: Vertex shader code for the 8 floats scenario.
56
Our fragment shaders in all the scenarios just set the color of the fragment according to
the data received from the vertex shader. The reason we implemented them is to prevent any
additional work that might be done when using the default fragment shaders.
All the optimizations introduced in the previous section were implemented. The
implementations of the triangle strips and the short coordinates do not change the code of the
vertex shader. However, to use the geometry instancing optimization, an addition to the
vertex shader (listing 4.3) must be added. In order to change the X position of each instance,
as for example in one line of budhas in the Sanjusangendo temple, the code for calculating
the position should be changed:
// Compute the position, adding the number of instance to the X positionv_out.oPosition = mul(modelViewProj, fetchedVec1 + float4(instance, 0, 0, 0));
Listing 4.3: Addition of geometry instancing to the vertex shader code.
All the scenarios were implemented in full-mesh rendering, with no level-of-detail
rendering. The reason for this was to check the caching scheme itself, without it being
effected by the implementation of any level-of-detail rendering scheme. Nevertheless, our
caching scheme can be used with any level-of-detail rendering due to its generalized nature.
Results concerning all the scenarios and optimizations are shown in the next section.
57
4.4 Results
We have tested our implementation on a Pentium-IV 3Ghz 1GB RAM machine with an
Nvidia Geforce 6600 GT (128 MB texture memory) graphics card. We compare the results of
our algorithm with a fully-optimized rendering scheme using the VBO extension, which
caches the data on AGP memory. We chose to compare our algorithm with VBO, because it
is currently the fastest way of rendering, and we want to compare our scheme to the fastest
one available to get a trustworthy comparison. Moreover, our algorithm runs on top of the
VBO extension anyway, so it would be fraudulent to compare it with anything but VBO.
We compare the results of our texture caching algorithm (TC) to the results of the VBO
extension for both the 4 floats and 8 floats scenarios. All these tests are made on three
different models of various sizes. The shark model is the smallest model tested, the budha
model is over 12 times bigger than the shark, and the teeth model is about 4 times bigger than
the budha.
Model Vertices Triangles Frames Per Second (FPS)
TC-4 TC-8 VBO-4 VBO-8
Shark 2,560 5,116 100 100 100 100
Budha 32,328 67,240 100 50 100 100
Teeth 131,685 263,350 33.38 16.68 100 50
Table 4.1: Results of the Texture Caching and VBO schemes in 4 floats and 8 floats scenarios for various models.
We can see that for small models consisting of a few thousand triangles such as the shark
model, both texture caching and VBO schemes obtain 100 frames per second which is the
frame rate limit of the screen.
58
(a) (b)
(c)
Figure 4.11: Budha model with texture caching: a) 4 floats scenario with no lighting, b) 4 floats scenario with pre-calculated lighting, c) 8 floats scenario with real-time lighting.
59
For bigger models such as the budha model, both schemes obtain the 100 frames per
second limit for the 4 float scenario. However, the texture caching scheme fails to reach the
100 FPS limit for the 8 floats scenario, and achieves only half the rate. The frame rate
deceleration occurs because the 8 floats scenario needs two fetches per vertex, instead of
only one fetch with the 4 floats scenario. This result implies the relatively big latency of
fetching vertex textures as mentioned by Kilgariff and Fernando’s [15] review of the shader
model 3.0 programming model.
To see the real comparison between the schemes, a larger model should be checked. The
teeth model consists of over a quarter of a million triangles, so even the VBO fails to reach
the 100 FPS limit for the 8 floats scenario when rendering it. The 4 floats scenario just
reaches the 100 FPS limit. Comparing these results to the texture caching scheme yields a
three to one ratio in rendering speed in favor of the VBO scheme, which is actually not
surprising due to the latency of fetching vertex textures.
To get a proof that this latency is the main issue slowing our algorithm, we implemented
the short coordinates optimization. Although with this optimization the CPU/GPU
communication is halved, no change at all is detected in the overall rendering speed. Such a
strange result can only occur when a process has a clear bottleneck that virtually cancels the
effect of increased effectiveness in other parts of the process. The only thing that might cause
such a bottleneck is fetching a vertex texture, because this is the only operation that appears
only in the texture caching scheme, and not in traditional VBO scheme. However, according
to the graphics hardware manufacturers this latency will vastly decrease in future graphics
hardware.
To show the results of the geometry instancing optimization we use instancing of 300
budha models that are positioned similarly to one line of budhas in the Sanjusangendo
temple. To do that, we only use one budha model that is cached in the textures. The position
of each budha instance is shifted according to its instance number.
60
Figure 4.12: 300 Buduas using instancing in the 8 floats scenario.
To the 300 budhas we also added the triangle strips optimization. We triangle strip the
budha model and achieve the same vertices and triangles count. This way a true comparison
can be made between a regular model and a triangle stripped model (TS).
Model Vertices Triangles Frames Per Second (FPS)
TC-4 TC-8 VBO-4 VBO-8
300 Budhas ~10 million ~20 million 0.479 0.234 1.39 0.61
300 TS budhas ~10 million ~20 million 1.43 0.704 4 1.69
Table 4.2: Results of the Texture Caching and VBO schemes in 4 floats and 8 floats scenarios for 300 budhas with and without triangle strips.
61
The results still point to the fact that with the current latency of fetching vertex textures,
the traditional VBO scheme achieves better results than our texture caching scheme.
However, the ratio between VBO and texture caching rendering speeds is reduced when
using instancing and triangle strips.
Figure 4.13: FPS ratio between VBO and texture caching in 4 floats and 8 floats scenarios for the teeth model, the 300 budhas instancing and the 300 triangle stripped budhas instancing.
Triangle strips reduce the number of vertices going through the graphics pipeline on each
frame by a magnitude of three. With reduced number of vertices per frame, we also reduce
the number of vertex texture fetch operations. This fetches imply the greatest latency in the 8
floats scenario, leading to the best ratio achieved with the 300 triangle stripped budhas in the
8 floats scenario.
62
Chapter 5
GPU-Based Terrain Level of Detail Rendering using Displacement Mapping
We present a GPU-based novel algorithm for run-time rendering of large terrain models.
Similarly to the GPU-based version of geometry clipmaps [2], our algorithm also sends
constant vertex and polygon lists from the CPU to the GPU. Displacement mapping is used
to derive the elevation data from 2D vertex textures representing height maps. As with the
geometry clipmaps, our algorithm also implies very little computation on the CPU side, with
most of the work done in the faster and more powerful GPU. However, contrary to the
geometry clipmaps, our algorithm does not have to use zero-area triangles and transition
regions to deal with problems such as cracks and popping effects. Our algorithm overcomes
these limitations by using progressive levels of detail opposed to the discrete levels of detail
used in the geometry clipmaps algorithm. Our algorithm is based on extracting the elevation
data of each vertex from a displacement map that resides on the texture memory of the GPU.
The CPU part of our algorithm calculates the intersections of the terrain with the view
frustum in the beginning of each frame with respect to the position and angle of the camera
(the viewpoint). We refer to the surface between these intersections as the frustum surface.
The CPU sends to the GPU four points that define the frustum surface and a constant
rectilinear grid that includes constant vertex and polygon lists. We refer to this grid as the
rectilinear grid. The elevation data of the entire terrain might be too big to fit in one texture,
therefore the CPU also has to manage an out-of-core scheme that resolves this problem.
63
The rectilinear grid received from the CPU is mapped by the GPU to the frustum surface.
The x and y coordinates of each vertex in the grid received by the GPU are mapped to their
relative position in the frustum surface using simple algebraic calculations. The new x,y
coordinates are also used to extract the elevation value of the vertex from the displacement
map.
Figure 5.1: Mapping a rectilinear grid of 9 X 9 vertices to the frustum surface. The mapped grid on the right is not triangulated for clarity.
In the frustum surface, the area closer to the camera is narrower than the areas further
from the camera. Mapping the rectilinear grid to such a surface, results in vertices near the
camera being closer to each other, thus denser. Higher resolution of vertices closer to the
camera yields a higher level of detail near the camera. The level of detail progressively
decreases for areas in the frustum surface that are further away from the camera.
Such a framework displays a constant number of vertices in continuous and progressive
levels of detail. The algorithm insures a constant frame rate regardless of the size or
complexity of the terrain.
Mapping
Rectilinear grid
Frustum surface
64
Camera
5.1 CPU
Level-of-detail rendering algorithms usually rely on per-vertex computations in the CPU.
Even when the computations are simple, they are repeated for each vertex which makes the
overall calculation burden the CPU. Our algorithm relieves the CPU from most of the
workload, by moving all per vertex calculations to the GPU. The only tasks the CPU has to
perform each frame are defining the current frustum surface and sending a rectilinear grid to
the GPU.
Defining the current frustum surface involves some calculations, in the form of
intersections of vectors with planes. It is done only once at the beginning of each frame,
therefore its influence on the CPU’s workload in negligible.
Sending the rectilinear grid to the GPU is even a simpler task in the terms of CPU
workload. The grid is constant therefore no calculations at all are made in the CPU
concerning this task. Because the grid is constant, it can also be efficiently cached on the
AGP using the API’s VBO extension. Therefore, it does not overload the CPU/GPU
communication.
The out-of-core version of the algorithm does however need to partition the grid into
several sub-grids. Nevertheless, the vertices of the grid still remain constant, so the
partitioning only slightly effects the overall rendering time.
5.1.1 Frustum Surface
Defining the frustum surface is simply performed by a few algebraic calculations, similar to
view frustum calculation. The intersection points of the view frustum’s top and bottom
planes with the terrain plane are the four points needed for defining the frustum surface.
As with the lens of a hand-held video camera, the computer graphics camera (viewpoint)
also has a virtual screen in front of it, on which the image is displayed. This virtual screen is
called the viewport, and its exact position and size are calculated using different parameters
65
such as the focal length of the camera. The viewport can also be thought of as the window on
which the image is displayed on.
To find the intersection points of the view frustum with the terrain plane, we shoot rays
(or vectors) from the viewpoint to the four corners of the viewport. The ray going to the
bottom-left corner of the viewport is called , the ray going to the bottom-right corner is
called , and respectively the rays going to the top corners are called and . The four rays
continue to intersect the terrain plane in four points that are named A, B, C, and D
respectively.
Figure 5.2: Defining the frustum surface.
The four intersection points usually define the frustum surface, but there are some cases
when part of the A, B, C, and D points should be repositioned. Such cases are when the C
and D points fall behind the viewpoint, or when the A and B points are too far from the
viewpoint. Next we will review these cases.
When the horizon is seen, it means that the top plane of the view frustum does not
intersect the terrain plane. In this case the top plain of the view frustum actually intersects the
terrain plane behind the viewpoint, opposite to the viewing direction of the camera. In such a
scenario the C and D points should be repositioned to the far side of the view frustum. The
far side of the view frustum is a parameter that can be defined according to the application or
the scene, but it must be far enough from the camera to insure that no visible geometry is
ViewpointViewport
A→
B→
C→
D→
Terrain plane
Frustum surface
66
culled. When the and rays are almost parallel to the terrain plane and intersect it further
than the far side of the view frustum, the C and D points are again repositioned to the far side
of the view frustum.
(a)
(b)
(c)
Figure 5.3: Repositioning of points C and D: a) the top plain of the view frustum intersects the terrain plane behind the viewpoint, b) the top plain of the view frustum intersects the
View frustumtop plane far plane
Terrain plane
View frustumtop plane far plane
Terrain plane
Terrain plane
B
C
D
A
Frustum surface
67
terrain plane very far from the viewpoint, c) C and D points repositioned according to the far plane of the view frustum for both (a) and (b) scenarios.
In some cases the and rays could be problematic. When calculating the frustum
surface we only refer to the base terrain plane completely ignoring the elevation data, so a
high mountain right in front of the camera could be ignored in such a situation. We should try
not to involve the elevation data in the frustum surface calculation, because that obligates the
CPU to perform per-vertex calculations and therefore contradict the GPU-based nature of our
algorithm.
To prevent such cases, we have to choose the A and B points much closer to the camera,
so a high mountain right in front of the camera would be rendered and not ignored. However
this is view and scene dependent, for instance in the case of a flight simulator that flies high
above the terrain such a scenario will hardly occur. Therefore, the A and B points could be
positioned anywhere between right beneath the camera, and the original intersection points of
the and rays with the terrain plane. This is defined by a pre-calculated near parameter
that changes according to the application or the scene.
There is a possibility of calculating the parameter at run-time using some heuristics that
depend on pre-computed data to avoid extensive computations in the CPU. These heuristics
must guarantee that the parameter changes smoothly with no jumps to insure that no popping
effects occur. If height averages over constant-sized regions are calculated in preprocessing,
then a heuristic that determines the parameter based on these pre-computed values can be
used.
68
(a)
(b)
Figure 5.4: Mount Everest, Nepal: a) the and rays go right through the Everest but the mountain is left outside the frustum surface hence not rendered, b) A and B points repositioned, so the mountain enters the frustum surface and rendered correctly.
AB
C
D
AB
C
D
69
(a)
(b)
Figure 5.5: Side view of figure 5.4: a) the mountain is outside the frustum surface, b) A and B points repositioned, so the mountain top enters the frustum surface.
5.1.2 Constant Rectilinear Grid
After defining the frustum surface the CPU has to send the geometry to the GPU. The CPU
just creates a constant rectilinear vertex grid which is sent to the GPU. The grid implies
constant vertex and polygon lists therefore it can be efficiently cached on AGP memory
using the VBO extension.
The size of the grid is determined once in the beginning of the rendering, and from that
point the exact constant grid is sent every frame to the GPU, relieving the CPU from almost
A,B C,D
Frustum surfaceBase terrain plane
A,B C,D
Frustum surfaceBase terrain plane
70
any work in each frame. The size is of the grid is mainly derived from the performance of the
GPU. A bigger grid implies better representation of the model, but a slower frame rate.
Therefore, the size of the grid is chosen as the maximal size that allows an interactive frame
rate of at least 24 frames per second.
5.1.3 Grid Partitioning
As stated in the previous section, the vertex grid remains constant throughout the running of
our algorithm. However, due to out-of-core considerations, sometimes the grid has to be
partitioned. The grid is only horizontally partitioned, so that the resulting sub-grids will not
have any connectivity problems. This means that the grid is partitioned along horizontal lines
of vertices. This way, triangle strips can be used freely without the possibility that a triangle
strip ends in the middle of a line of triangles because of some partitioning.
In some cases, mainly when using the out-of-core version of the algorithm, the usage of a
single displacement map (vertex texture) for rendering the entire terrain is not enough. In
these situations, different areas of the grid need to use different vertex textures, therefore
unbinding and binding of these textures must be done while the grid is being sent to the GPU.
In order to switch the vertex textures, the CPU must first horizontally partition the grid to
several sub-grids, in a way that the texture bind/unbind operations occur between the sending
of two sub-grids.
Horizontally partitioning the grid seems like a pretty simple and safe task however this is
not exactly the case because cracks may appear on the borderlines of two sub-grids. This
problem occurs because each vertex in the rectilinear grid is sent to the GPU with two lines
of triangles, first as a top vertex for the triangle/s beneath it, and later as a bottom vertex for
the triangle/s above it. A problem arises when the vertex texture is changed between the two
lines of triangles, meaning that two different elevation values could be fetched by the GPU
for practically the same vertex, thus resulting in a crack.
To solve this problem the CPU can send zero-area triangles along the sub-grids
borderlines as done in the GPU-based version of geometry clipmaps [2]. There, the zero-area
71
triangles are essential because they deal with T-junctions created by borders of different
clipmap sizes. In our algorithm, we can solve this problem in a much more elegant way, by
using two bound textures at a time – the main displacement map and an auxiliary one. A
simple branching mechanism in the vertex processor chooses between the auxiliary
displacement map in the case of a borderline vertex, and the main displacement map for the
rest of the vertices. In such a scheme, the vertices of the first line of each sub-grid have the
same elevation as the same vertices in the last line of the previous sub-grid, thus resolving
the cracks problem.
This solution is very time-costly due to the known latency of branching in the GPU, which
is an operation that stalls the GPU pipeline, thus completely contradicting the parallel SIMD
(Single Instruction, Multiple Data) nature of the GPU’s components. However, as stated in
the work by Harris and Buck [13] SIMD branching is very useful in cases where the branch
conditions are fairly spatially coherent. In our solution, this is exactly the case, since the path
to the auxiliary texture is taken only on borderline vertices, implying that the greater majority
of the vertices take the ordinary path in a very spatially coherent way.
The grid no longer being constant throughout the running of our algorithm is another issue
that arises. However, this is not a critical problem, because the vertices of the grid still
remain constant, therefore the partitioning only slightly effects the overall rendering time.
Moreover, the sending time of the relatively small grid is not very time consuming in the first
place.
72
5.2 GPU
The GPU receives in the beginning of each frame four points from the CPU. These points
define the four corners of the frustum surface. Then, the GPU expects a constant rectilinear
grid from the CPU. All the vertices in the grid go through the vertex shader, where they are
mapped to their relative place in the frustum surface. For each vertex, the vertex shader also
fetches the elevation data of the vertex from the displacement map.
Fetching a texture implies great latency, but in spite of that it is worthwhile even to fetch a
texture for every vertex in the grid. The superior computational power of the GPU compared
to the CPU as expressed by the mapping operation made for each vertex covers for the
fetching latency. Furthermore, this calculation is performed in the GPU during the idle time
that the fetching texture latency implies therefore it is actually not even slowing the GPU
down further from the texture fetching latency. In addition, it is also possible to use regular
RGB textures along with the vertex textures. The RGB textures are used by the fragment
processor.
5.2.1 Mapping the Rectilinear Grid to the Frustum Surface
The first operation done by the vertex shader is repositioning each received vertex, which is
done by mapping the position of the vertex in the constant rectilinear grid to its position in
the frustum surface.
Consider an m X n sized rectilinear grid, and a frustum surface defined by four points A,
B, C, and D. The 0,0 vertex in the grid is mapped to point A, the 0,n-1 vertex is mapped to
point B, the m-1,0 vertex is mapped to point C, and the m-1,n-1 vertex is mapped to point D.
Each inner vertex in the grid is mapped to its corresponding place in the frustum surface by
applying a set of calculations.
73
Figure 5.6: Mapping the corners of the rectilinear grid to the corners of the frustum surface.
Consider the x,y vertex in the grid. We calculate its corresponding position in the frustum
surface using vectors, because vector calculations are more suitable for the GPU. The A, B,
C, and D points are treated as the vectors , , , and respectively. The vector
represents the relative position of the vertex along the A-C edge. It is calculated by adding
percent of – to vector :
(Equation 5.1)
Similarly, the vector that represents the relative position of the vertex along the B-D
edge is calculated:
(Equation 5.2)
Rectilinear grid Frustum surfacem-1
0
1
2
3
4
5
6
7
8
9
10
11
m-4
m-3
m-2
m-5
0 1 2 3 4 5 6 7 8 9 10 n-4 n-3 n-2 n-111 n-5
Camera
BA
C
D
74
If we refer to the AC-BD edge, created by equations 5.1 and 5.2, then the final position
represents the relative position of the vertex along this edge:
(Equation 5.3)
The vertex shader repositions each vertex using the three equations, so eventually all the
constant rectilinear grid is mapped to the frustum surface.
5.2.2 Extracting Elevation Data from the Displacement Map
After a vertex is mapped to the frustum surface, it has the correct real-world x,y coordinates,
but it still misses the elevation data, which is extracted from the displacement map. An x,y
point in the displacement map contains the elevation data of a vertex located in the same x,y
point in the real-world, or the frustum surface. This means that the x,y coordinates as derived
from vector (equation 5.3), are exactly the same coordinates in the displacement map where
the elevation data of the vertex is stored. The result of equation 5.3 can be used to extract the
elevation data from the displacement map using the vertex texture fetch operation.
There is a problem though, when the x,y coordinates derived from vector are placed
outside the terrain model, and therefore outside the displacement map. In such a case, the x
coordinate of the displacement map is clamped to the closest value available in the map, and
so does the y coordinate. One coordinate is clamped when only that coordinate is outside the
displacement map. In such a scenario, the other coordinate remains with its value as derived
from vector . This clamping implies that elevation data of vertices outside the terrain model
is copied from the elevation in the edges of the displacement map, creating an inaccurate
representation of the terrain.
75
In order to overcome this limitation, the values on the frame of the displacement map are
set to zero. With such approach, every vertex outside the terrain receives a zero elevation
value, and the terrain is accurately represented.
5.3 Out-of-Core
Out-of-core techniques are used to efficiently support view-dependent simplification for
datasets much larger than main memory. In our framework, out-of-core refers to a much
more fundamental problem than main memory capacity, because the maximal texture size is
much smaller than main memory. In current hardware the maximal size of textures is 4096 X
4096, therefore to use our algorithm in-core, meaning using a single displacement map, our
terrain is limited to a maximum of 16777216 vertices (16M). Some datasets are much larger
than this limit.
We introduce an out-of-core technique to support large datasets in our framework. At
preprocessing we create a height map pyramid, where each level in the pyramid includes all
the dataset in its corresponding level of detail. The base of the pyramid stores the original
height map, whereas the top of the pyramid stores the coarsest height map. At run-time, the
constant rectilinear grid is partitioned into smaller sub-grids, where each sub-grid is
associated with a different level of the height map pyramid based on geometry clipmaps. The
clipmaps are incrementally updated as the viewpoint moves.
5.3.1 Height Map Pyramid
We construct a pyramid of height maps from the original height map of the entire dataset.
The base of the pyramid is the original height map. The pyramid construction process builds
the rest of the pyramid level by level. Each new level uses the height map of its previous
level and a geometric approximation metric to select a coarser data representation. The data
76
in each level is coarser than in its previous level since it consists of only half the control
points over each axis, and therefore requires four times less memory.
Figure 5.7: Height map pyramid (unscaled).
Each point in a height map corresponds to the elevation value of a vertex in a level of
detail fitting the level of the height map in the pyramid. During the construction of a
particular pyramid level, the elevation value for every vertex in that level is calculated based
on the elevation values of vertices from the previous level. A single vertex value in the new
pyramid level is based on a 5 X 5 vertex matrix representing its adjacent geometry in the
previous pyramid level. Any naive metric such as the average metric can be used here,
although we use a novel metric to achieve better results. The basic idea of our metric is to
find a 3 X 3 vertex matrix that approximates the 5 X 5 matrix with minimal geometric error.
The returned value of the metric is the elevation value of the central vertex in the 3 X 3
matrix.
5.3.2 Grid Partitioning based on Clipmaps
When rendering huge terrains, a single displacement map is not able to hold all the elevation
data of the terrain. In order to cover the entire terrain, we must therefore use a coarser
representation of the terrain. To maintain a fairly high resolution to the elevation data of the
Original height map
77
areas near the viewpoint, and at the same time cover the entire terrain, we must use height
maps of various levels of detail. For that purpose we use a simple version of geometry
clipmaps [18]. Contrary to the original geometry clipmaps, our clipmaps are in the shape of
full rectangles without the holes corresponding to the clipmaps of previous levels.
We use the intersection points of the clipmaps with the frustum surface to determine the
vertex grid partition. Our algorithm starts from the finest clipmap that is placed around the
viewpoint, and finds the maximal sub-grid that is fully covered by that clipmap. We continue
this process until the entire frustum surface is covered by clipmaps. Since our partitioning is
only horizontal, some vertices might be covered by an outer and coarser clipmap even though
they are placed within the area of a finer clipmap. This is the reason why our clipmaps are
full rectangles with no holes in their centers, contrary to the ring-shaped geometry clipmaps
[2].
Figure 5.8: Grid partitioning based on clipmaps.
The clipmaps are also used by the GPU as the displacement maps from where the
elevation data is fetched. At every point there are exactly two bound clipmaps – one clipmap
as the main displacement map, and another clipmap as the auxiliary displacement map.
78
Following the partitioning of the grid, the resulting sub-grids are sent to the GPU. After a
particular sub-grid is sent, its associated clipmap is re-bound as the auxiliary displacement
map, while the previous auxiliary displacement map is unbound. The clipmap associated with
the next sub-grid is bound as the main displacement map.
5.3.3 Updating Clipmaps
As the viewepoint moves, each clipmap translates within its pyramid level in order to remain
centered about the viewpoint. Since the motion of the viewpoint is usually coherent, only a
small L-shaped region of the window needs to be incrementally processed in each frame.
Furthermore, the relative motion decreases exponentially at coarser levels, therefore coarse
level clipmaps seldom require updating.
Figure 5.9: L-shaped region created between sequential frames (t and t+1) within a clipmap.
Rendering to the textures (clipmaps) using the fragment shader enables to modify the L-
shaped regions of the clipmaps. Rendering to textures is sometimes a time-costly operation,
but because we update only a thin L-shaped region for each clipmap, the overall updating
time is not overloading the system.
Clip region (t)
Clip region (t+1)
Viewer
motion
L-shaped region
79
5.4 Optimizations
5.4.1 Terrain Compression
Storing the geometric data as images in the form of height maps (or displacement maps)
instead of traditional API storage, enable the usage of image compression techniques.
Moreover, height maps are remarkably coherent in practice, significantly more than typical
color images, and thus offer a huge opportunity for compression. For example, Losasso and
Hoppe [18] managed to compress the 40.4GB U.S. dataset to just 355MB, while maintaining
a quite small error.
5.4.2 RGB Textures
When using only vertex textures, the coloring of the terrain is done automatically by the
fragment processor that interpolates vertex colors. Using such a method implies poor results,
since sharp changes in color disturb the human eye much more than the geometrical changes
do. For that reason, RGB textures are used by the fragment processor. RGB textures are
usually more detailed than vertex textures (height maps), so a 1K X 1K height map will be
usually accompanied by a matching 2K X 2K RGB texture. Both types of textures are placed
in the texture memory therefore they must not together surpass the texture memory limit.
Bear in mind that when using RGB textures with the out-of-core version of our algorithm,
a pyramid of RGB data will be also used along with the height map pyramid. The RGB
pyramid is built using a metric that calculates each color component separately. In addition,
whenever the L-regions of the clipmaps (displacement maps) are updated due to movement
of the viewpoint, the corresponding RGB textures are updated accordingly.
80
5.4.3 Linear Sampling
Taking the elevation data of a vertex in the frustum surface from a single value in the
displacement map may result in popping effects. This happens when the frustum surface
moves slightly forcing the elevation data to be taken from a different texel in the vertex
texture. The problem can be solved by using linear sampling when fetching the elevation data
from the displacement map.
Linear sampling interpolates the elevation data from four texels in the vertex texture that
surround the position of the vertex, instead of just taking the data from the closest single
texel. Such a method insures smooth changes instead of popping effects when the elevation
data is suddenly taken from a different texel in the vertex texture.
Figure 5.10: Linear sampling of vertex P between texels P1, P2, P3, and P4.
The interpolated elevation data of P is calculated using the next equation:
(Equation 5.4)
Each vertex has to fetch the elevation data of the four adjacent texels. This yields four
vertex texture fetch operations per vertex, which is time-costly in current hardware.
However, with a single fetch operation four floats of data can be fetched, so if we place the
P1 P2
P
P3 P4
x’
y’
81
elevation data of all the neighbors in each texel of the vertex texture, then we can obtain all
the needed elevation data with just a single fetch. Therefore, the first float in each texel
contains the elevation value of the texel itself, the second float contains the value from the
texel to the right, the third float contains the value from the texel below, and the fourth float
contains the value from the bottom-right texel. Now a texel in the vertex texture is actually a
quad of texels, where the top-left corner of the quad is the texel itself, and the rest are
neighboring texels.
To insure that the vertex fetches the correct quad of texels we must make sure that the
closest texel in the vertex texture is its top-left texel. When fetching the texel we move the
vertex half a texel left and half a texel up to insure that the correct quad is fetched.
Figure 5.11: Vertex P moves half a texel left and half a texel up to insure that P1 is the closest texel, thus the correct quad is fetched.
Keep in mind, that linear sampling implies at least double texture memory footprint, since
each texel has to store data of adjacent texels apart of its own. Linear sampling for RGB
textures is done by changing a texture parameter (done by an API call) when the texture is
uploaded.
P1 P2
P
P3 P4
82
5.4.4 Explicit Level-of-Detail Coarsening
Linear sampling helps solving popping effects when two consecutive elevation data fetches
of a vertex are from neighboring texels in the vertex texture. However, if the used
displacement map is too detailed, two consecutive fetches might result in non-neighboring
texels, and in that case linear sampling may not help. Therefore, we should use a coarser
displacement map in such cases, in order to explicitly lower the level of detail even when the
texture size allows a higher level of detail.
When using the out-of-core version of our algorithm, the clipmaps implicitly solve this
problem because the level of detail of each clipmap is subject to its distance from the
viewpoint. We need to explicitly coarsen the level of detail when the entire terrain is stored
in a single displacement map, or when we want to achieve finer changes in the level of detail
than the clipmaps dictate.
Both vertex textures and RGB textures may use explicit level-of-detail coarsening, since
both pyramids were built using “smart” metrics that use neighborhood data. Therefore,
choosing a texel of a lower level of detail insures that a larger area was considered when the
value of the texel (both elevation and RGB data) was calculated.
5.4.5 Utilizing Parallelism
The GPU is a stream processor, and not a serial processor like the CPU is [3]. A serial
processor, also known as von Neumann architecture executes instructions sequentially,
updating the memory as it goes. Contrary, a stream processor executes a function (for
example a vertex shader) on a set of input records (vertices), producing a set of output
records (transformed vertices) in parallel. This kind of processor is also known as an SIMD
(Single Instruction, Multiple Data).
83
Not only that the same vertex shader instructions can be parallely executed to several
vertices, the shader can also perform algebraic operations on vectors of four floats, and 4 X 4
matrices parallely [29].
We can take advantage of this fact by uniting the computations of equations 5.1 and 5.2
into one operation of matrix–vector multiplication. First, we can calculate the
– and – vectors once in the CPU, because their values remain the same for all the
vertices in a particular frame. We will name these vectors and respectively.
Since we do not deal with elevation data in these calculations, a vector can be represented
by its x and y components – Vx and Vy respectively. The following matrix-vector
multiplication returns a vector of four floats, where the vector’s first two floats are the
components of vector (result of equation 5.1), and the remaining floats are the
components of vector (result of equation 5.2):
(Equation 5.5)
The results are used by equation 5.3 to compute the final Fx and Fy coordinates in the
frustum surface.
84
5.5 Implementation
We have implemented our algorithm in C++ with OpenGL as our API, and Cg as the
programming language for the GPU shaders.
The basic version of our algorithm is implemented along with the RGB textures
optimization, where all the geometry is sent using the VBO extension of the API. The
algorithm was checked on a terrain with a 1k X 1k height map and a 2k X 2k RGB texture
map, and also on a terrain with a 2k X 2k height map and a 4k X 4k RGB texture map.
Our vertex shader utilizes the parallelism of the GPU by using operations on two-float
vectors to map the vertices from the rectilinear grid to the frustum surface. The code of the
vertex shader is listed at the end of this section. The fragment shader just uses the same
texture coordinates as the vertex shader to fetch RGB color data.
We have also implemented a CPU-based version of the algorithm for performance
comparisons. The CPU-based version maps the rectilinear grid to the frustum surface in the
CPU, thus no displacement mapping or any other special operation is performed in the vertex
shader. The mapped vertices are sent to the GPU in a straightforward way using Vertex
Arrays. VBO can not be used in the CPU-based version, because the vertices dynamically
change every frame, so no caching whatsoever is possible in such a case.
To compare the results of our algorithm to a naive scheme, we have also implemented a
VBO-based straightforward version that renders the full terrain each frame.
85
The code for the vertex shader:
Listing 5.1: Vertex shader code.
86
5.6 Results
We have tested our implementation on a Pentium-IV 3Ghz 1GB RAM machine with an
Nvidia Geforce 6600 GT (128 MB texture memory) graphics card. We compare the results of
our basic algorithm with the results of the CPU-based version for constant rectilinear grids of
various sizes from 128 X 128 vertices up to 600 X 600 vertices. All grid sizes were tested on
two terrains – one with a 1K X 1K height map and a 2K X 2K RGB texture map, and another
terrain with a 2K X 2K height map and a 4K X 4K RGB texture map. Even though RGB
textures were added to our algorithm, and not to the CPU-based version and the naive
versions, our algorithm still outperforms them both in all the test cases.
Algorithm Size of constant
rectilinear grid
Vertices in grid Frames Per Second (FPS)
1K X 1K terrain 2K X 2K terrain
Our algorithm
128 X 128 16,384 100 (max) 100 (max)
256 X 256 65,536 100 (max) 100 (max)
400 X 400 160,000 50 50
512 X 512 262,144 34 34
600 X 600 360,000 25 25
CPU-based
version
128 X 128 16,384 100 (max) 100 (max)
256 X 256 65,536 46 46
400 X 400 160,000 24 24
512 X 512 262,144 21 21
600 X 600 360,000 18 18
Naive version
with VBO- - 22
Up to 5.5
(1.0 using VA)
Table 5.1: Results of our algorithm, the CPU-based version, and the naive version for various sizes of the terrain and the rectilinear grid.
87
(a)
(b)
Figure 5.12: A rendered terrain using our algorithm with a 128 X 128 rectilinear grid: a) wire frame, b) with RGB texture.
88
(a)
(b)
Figure 5.13: A rendered terrain using our algorithm with a 400 X 400 rectilinear grid: a) wire frame, b) with RGB texture.
89
(a)
(b)
Figure 5.14: A rendered terrain using our algorithm with a 600 X 600 rectilinear grid: a) wire frame, b) with RGB texture.
90
Figure 5.15: Frustum surface defined by the rays shot through the corners of the viewport.
The results show that the size of the terrain does not affect our algorithm, and as suspected
the frame rates remain the same for both terrain sizes even though the second terrain is four
times bigger than the first. The size of the large terrain exceeds the AGP memory of the
machine. Contrary to our algorithm, the naive version reacts drastically to the size of the
terrain, and as a result it fails to use VBO for the large terrain and falls back to just Vertex
Arrays performance. Dividing the terrain to patches can solve this problem and enable at
least partial use of VBO. Nevertheless, the rendering speed does not pass 5.5 Frames Per
Second.
In our algorithm the latency of vertex texture fetches practically dictates the frame rates.
However, this is not so bad, because the fetch latency “hides” all the other work done by the
GPU due to its parallel nature. Consequently, adding RGB textures did not change the frame
rates. Removing the calculation of mapping each vertex from the rectilinear grid to the
frustum surface did not change the frame rates either, therefore we suspect that even more
GPU calculations can be “hidden” by the fetch latency. This is the reason why we only
partially harness the parallelism of the GPU in our implementation, using operations on two-
float vectors instead of matrix operations. We did it for code clarity, but if fetch latency
91
reduces drastically in future technologies, we can use matrix operations that fully utilize the
GPU’s parallelism.
Compared to the mapping calculations in the GPU that did not change the frame rates, in
the CPU-based version this calculation took about the same time as rendering the resulting
vertices, and it gives an idea to how fast and parallel the GPU is compared to the CPU.
For huge terrains we still render the same amount of vertices, so the rendering time is not
effected at all by the size of the terrain. For the out-of-core version we have to create the
height map pyramid at preprocessing, whereas the run-time addition is the clipmap updating.
With this addition there is a correlation between the terrain size and the frame rates, since a
larger terrain means additional clipmap/s to be updated. However, this correlation is only
logarithmic, because even when the terrain size grows by four times, the algorithm only
needs to update one additional outer clipmap. Outer clipmaps usually need even less updating
than clipmaps near the viewpoint, because their L-shaped region is usually so thin that no
updates are needed in most frames.
Our clipmaps are updated exactly as the original geometry clipmaps introduced by
Asirvatham and Hoppe [2], so we can refer to the updating times from their implementation.
They tested worst-case updating times, when the entire clipmap is updated instead of just a
thin L-shaped region as in the average case. The updating time for a 255 X 255 clipmap is
1.6 ms without terrain compression, and 9.6 ms with on-the-fly decompression of the height
map.
On top of the updating time, they also have the rendering time, which logarithmically
grows as the terrain grows, while our rendering time is completely constant. Our algorithm
also allows great adaptivity to the abilities of the machine, because the quality of our
rendered image is directly derived from the size of the rectilinear grid. The size can be
adjusted per machine to give maximal image quality at an interactive frame rate.
It is important to emphasize that our algorithm can not be outperformed by the GPU-based
geometry clipmaps [2] because the clipmap updating in both algorithms is precisely the
same, while the rendering time of both algorithms is derived from the latency of a single
vertex texture fetch made for each vertex.
92
The geometry clipmaps algorithm uses zero-area triangles to prevent T-junctions in the
borderline between two clipmap levels. Such triangles may result in visual artifacts, therefore
they should be avoided. In contrast, we use progressive levels of detail therefore no zero-area
triangles are created with our algorithm.
93
Chapter 6
Conclusions and Future Work
We have presented two novel approaches for utilizing the growing power of GPUs for level-
of-detail rendering. The first approach enables caching of geometry data on the texture
memory of the GPU, providing full control on the caching management. The second
approach allows view-dependent level-of-detail rendering of terrain models with adaptive
performances that take maximal advantage of the machine’s computational capabilities.
Our texture caching technique reaches reasonable results and enables the rendering of
small models in interactive frame rates. However, it falls behind the VBO extension of the
APIs, which performs the caching over the AGP memory. The main reason for the trailing
behind VBO is the latency involved with fetching vertex textures in current graphics
hardware. The ability to fetch vertex textures is very new, and therefore this latency is
expected to drastically decrease in the future. Nevertheless, until this happens in practice,
texture caching remains inefficient.
Our terrain rendering algorithm enables view-dependent rendering of terrains with
continuous and progressive levels of detail. It insures a constant and interactive frame rate
regardless of the size or complexity of the terrain by channeling all the computational power
of the hardware to render just vertices that are inside the view frustum. The main bottleneck
of the algorithm is again the latency of vertex texture fetches used by the GPU for each
vertex. Nevertheless, opposed to the texture caching technique, the terrain rendering
algorithm achieves results that do not fall behind similar algorithms, and even outperform
them.
94
We see the scope for future work in improving the quality of the image displayed, by
using an improved technique for mapping each vertex to the frustum surface. In our current
algorithm, we map the vertices evenly throughout the frustum surface. This way, the vertices
become implicitly denser near the camera. By using a function that relies on the ratio
between the near and the far edges of the frustum surface, we can map the vertices in a way
that they become explicitly denser near the camera.
Our current out-of-core scheme often uses a coarse texture (clipmap) where a finer texture
can be used, because of the horizontal partitioning of the grid. By selecting the clipmap level
of each vertex in the GPU shaders, this lose of quality can be avoided. Moreover, with this
method the grid does not have to be partitioned any more, resulting in efficiency
improvement too. Selecting the clipmap level within the shaders can be done by using
mipmaping.
Future improvement in the efficiency of fetching vertex textures will drastically improve
the results of both our approaches. This work is a cry out to the graphics hardware vendors to
cut down the latency of fetching vertex textures, and relief our approaches and many other
algorithms of their main bottleneck.
95
Bibliography
[1] Anonymous. AGP 8X A Closer Look. In Dev Hardware Website, October 2003.
http://www.devhardware.com/c/a/Video-Cards/AGP-8X-A-Closer-Look.
[2] A. Asirvatham and H. Hoppe. Terrain rendering using GPU-based geometry clipmaps. In
GPU Gems 2, edited by M. Pharr and R. Fernando, pages 27-45. Addison-Wesley, March
2005.
[3] I. Buck and T. Purcell. A toolkit for computation on GPUs. In GPU Gems, edited by R.
Fernando, pages 621-636. Addison-Wesley, March 2004.
[4] F. Carucci. Inside Geometry Instancing. In GPU Gems 2, edited by M. Pharr and R.
Fernando, pages 47-67. Addison-Wesley, March 2005.
[5] UNC Chapel Hill. Armadillo model, 1998. http://www.cs.unc.edu/~geom/APS.
[6] P. Cignoni, C. Montani, and R. Scopigno. A comparison of mesh simplification
algorithms. In Computers & Graphics, 22(1):37-54, February 1998.
[7] M. Duchaineau, M. Wolinsky, D. E. Sigeti, M. C. Miller, C. Aldrich, and M. B. Mineev-
Weinstein. ROAMing terrain: real-time optimally adapting meshes. In IEEE Visualization
’97 Proceedings, pages 81-88. ACM/SIGGRAPH Press, October 1997.
[8] J. El-Sana, E. Azanli, and A. Varshney. Skip strips: maintaining triangle strips for view-
dependent eendering. In Proceedings IEEE Visualization ‘99, pages 131-138. IEEE
Computer Society and ACM, October 1999.
96
[9] J. El-Sana and A. Varshney. Generalized view-dependent simplification. In Computer
Graphics Forum, volume 18, pages C83-C94. Eurographics Association and Blackwell
Publishers Ltd 1999, 1999.
[1] R. Fernando. Shader model 3.0 unleashed. In SIGGRAPH 2004 Presentation, August
2004.
http://download.nvidia.com/developer/presentations/2004/SIGGRAPH/Shader_Model_3
_Unleashed.pdf.
[2] R. Fernando and M. J. Kilgard. The Cg tutorial: the definitive guide to programmable
real-time graphics. Addison-Wesley, February 2003.
[3] P. Gerasimov, R. Fernando, and S. Green. Shader model 3.0: using vertex textures.
NVIDIA white paper DA-01373-001_v00, June 2004.
[4] M. Harris and I. Buck. GPU flow-control idioms. In GPU Gems 2, edited by M. Pharr
and R. Fernando, pages 547-555. Addison-Wesley, March 2005.
[5] H. Hoppe. Progressive meshes. In Proceedings SIGGRAPH ’96, pages 99-108. ACM
SIGGRAPH, ACM Press, August 1996.
[6] E. Kilgariff and R. Fernando. The GeForce 6 series GPU architecture. In GPU Gems 2,
edited by M. Pharr and R. Fernando, pages 471-491. Addison-Wesley, March 2005.
[7] R. Lario, R. Pajarola, and F. Tirado. Cached Geometry Manager for view-dependent
LOD rendering. In WSCG 2005 Conference Proceedings, pages 9-16. UNION Agency –
Science Press, February 2005.
[8] J. Levenberg. Fast view-dependent level-of-detail rendering using cached geometry. In
IEEE Visualization 2002, pages 259-266. IEEE Computer Society, November 2002.
[9] F. Losasso and H. Hoppe. Geometry clipmaps: terrain rendering using nested regular
grids. In ACM Transactions on Graphics: Proceedings SIGGRAPH 2004,
23(3):769-776, August 2004.
97
[10] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard. Cg: a system for
programming graphics hardware in a C-like language. In ACM Transaction on
Graphics: Proceedings ACM SIGGRAPH, 22(3):896-907, July 2003.
[11] G. Moore. Cramming more components onto integrated circuits. In Electronics, 38(8),
1965.
[10] NVIDIA website. http://www.nvidia.com.
[12] R. Pajarola. FastMesh: Efficient view-dependent meshing. In Pacific Graphics 2001,
pages 22-30. IEEE Computer Society, October 2001.
[13] R. Pajarola, M. Antonijuan, and R. Lario. QuadTIN: quadtree based triangulated
irregular networks. In IEEE Visualization 2002, pages 395-402. IEEE Computer
Society, November 2002.
[14] E. Perrson. Accelerating real-time graphics with high level shading languages. Master’s
thesis, Lulea University of Technology, Computer Science and Electrical Engineering
Department, November 2003.
[15] W. J. Schroeder, J. A. Zarge, and W. E. Lorensen. Decimation of triangle meshes. In
Computer Graphics: Proceedings SIGGRAPH ’92, 26(2):65-70, July 1992.
[16] N. Sokolovsky. Combining occlusion culling within the framework of view-dependent
rendering. Master’s thesis, Ben-Grurion University of the Negev, Department of
Computer Science, April 2002.
[17] P. Terdiman. Creating efficient triangle strips. In Coder Corner Website, 2000.
http://www.codercorner.com/Strips.htm.
[18] Wikipedia. Displacement mapping. In Wikipedia the Free Encyclopedia Website,
August 2005. http://en.wikipedia.org/wiki/Displacement_mapping.
98
[19] C. Woolley. GPU program optimization. In GPU Gems 2, edited by M. Pharr and R.
Fernando, pages 557-571. Addison-Wesley, March 2005.
[20] S. E. Yoon, B. Salomon, R. Gayle, and D. Manocha. Quick-VDR: interactive view-
dependent rendering of massive models. In Proceedings IEEE Visualization 2004, pages
131-138, October 2004.
99