Post on 23-Jun-2020
transcript
high-Performance comPuting
DELL POWER SOLUTIONS | November 20071 Reprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved.
Supercomputers and high-performance computing
(HPC) clusters enable demanding software—such
as real-time simulation, animation, virtual reality,
and scientific visualization applications—to generate high-
resolution data sets at sizes that have not typically been
feasible in the past. However, efficiently rendering these large,
dynamic data sets, especially those with high-resolution dis-
play requirements, can be a significant challenge.
Rendering is the process of converting an abstract
description of a scene (a data set) to an image. For com-
plex data sets or high-resolution images, the rendering
process can be highly compute intensive, and applica-
tions with requirements for rapid turnaround time and
human perception place additional demands on process-
ing power. State-of-the-art graphics hardware can signifi-
cantly enhance rendering performance, but a single piece
of hardware is often limited by processor performance
and amount of memory. If very high resolution is required,
the rendering task can simply be too large for one piece
of hardware to handle.
Exploiting multiple processing units—a technique known
as parallel rendering—can provide the necessary computa-
tional power to accelerate these rendering tasks. This article
discusses the major architectures and methodologies for
parallel rendering with HPC workstation clusters and
describes how open source utilities such as Chromium and
Distributed Multihead X (DMX) can help meet large-scale
rendering requirements.
Understanding parallel rendering techniquesThere are two different ways to build parallel architectures
for high-performance rendering. The first method is to use
a large symmetric multiprocessing computer with extremely
high-end graphics capabilities. The downside of this
approach is its cost—these systems can be prohibitively
expensive.
The second method is to utilize the aggregate perfor-
mance of commodity graphics accelerators in clusters of HPC
workstations. The advantages of this architecture include
the following:
• Cost-effectiveness: Commodity graphics hardware and
workstations remain far less expensive than high-end
parallel rendering computers, and some PC graphics
accelerators can provide performance levels comparable
to those of high-end graphics hardware.
• Scalability:As long as the network is not saturated, the
aggregate hardware capacity of a visualization cluster
grows linearly as the number of HPC workstations
increases.
Related Categories:
Clustering
High-performance computing (HPC)
System architecture
Visit DELL.COM/PowerSolutions
for the complete category index.
Parallel rendering Technologies for hPc clusTers
Using parallel rendering technologies with clusters of high-performance computing (HPC) workstations con-figured with high-end graphics processors helps scale out graphics capabilities by exploiting and coordinating distributed computing resources. This article discusses parallel rendering architectures and highlights open source utilities that can help meet rendering require-ments for large-scale data sets.
By Li Ou, PH.D.
yung-CHin Fang
Onur CeLeBiOgLu
ViCtOr MasHayekHi, PH.D.
2DELL.COM/PowerSolutionsReprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved.
• Flexibility:The performance of commodity
graphics hardware can increase rapidly, and
its development cycles are typically much
shorter than those of custom-designed,
high-end parallel hardware. In addition,
open interfaces for hardware, such as PCI
Express (PCIe), and open interfaces for soft-
ware, such as Open Graphics Library
(OpenGL), allow organizations to easily take
advantage of new hardware to help increase
cluster performance.
Temporal and data parallelismTwo common approaches to parallel rendering
are temporal parallelism and data parallelism.
Temporal parallelism divides up work into
single sequential frames that are assigned to
systems and rendered in order; data parallel-
ism divides a large data set into subsets of
work that are rendered by multiple systems and
then recombined.
In temporal parallelism, the basic unit of
work is the rendering of a single complete image
or frame, and each processor is assigned a
number of frames to render in sequence (see
Figure 1). Because this method is not geared
toward rendering individual images but can
increase performance when rendering an entire
sequence, film industries often use it for anima-
tion applications and similar software, in which
the time it takes to render individual frames may
not be as important as the overall time required
to render all frames.
The basic concept of data parallelism, on the
other hand, is to divide and conquer. Data paral-
lelism decomposes a large data set into many
small subsets, then uses multiple workstations
to render these subsets simultaneously (see
Figure 2). High-performance interconnects route
the data subsets between the processing work-
stations, and one or more controlling units syn-
chronize the distributed rendering tasks. When
the rendering process completes, the final image
can be compiled from the subsets on each work-
station for display. Data parallelism is widely
used by research industries and in software such
as real-time simulation, virtual reality, virtual
environment simulation, and scientific visualiza-
tion applications.
Object decomposition and image decompositionA key step in data parallelism is decomposing
large data sets, a step that can utilize one of two
major approaches: object decomposition and
image decomposition. In object decomposition,
tasks are formed by partitioning the geometric
description of the scene. Individual worksta-
tions partition and render subsets of the geo-
metric data in parallel, producing pixels that
must be integrated later into a final image.
Image decomposition, in contrast, forms tasks
by partitioning the image space: each task ren-
ders only the geometric objects that contribute
to the pixels that physically belong to the space
assigned to the task.
Figure 3 illustrates how a data set could be
partitioned by these two approaches. In object
decomposition, each workstation renders a
single object in the data set: one renders the
rectangle, and the other renders the circle. In
image decomposition, each workstation renders
half of the final image: one renders the left side,
and the other renders the right side.
There are no absolute guidelines when choos-
ing between object decomposition and image
decomposition. Generally, object decomposition
is suitable for applications with large-scale data
sets, while image decomposition is suitable for
applications requiring high resolution and a
large image, such as a tiled display integrating
multiple screens into a single display device.
Object decomposition can enhance load
balancing and scalability compared with image
decomposition by helping ease preprocessing
and the distribution of objects evenly among
processors. However, it does require a post-
composition process to integrate the image
subsets, because objects assigned to different
processors may map to the same screen space.
For example, in Figure 3, after rendering the
circle and the rectangle individually, this post-
composition step determines how they overlap
to form the final image. With large numbers of
Figure 3. Parallel rendering processes using object decomposition and image decomposition
Figure 1. Frames rendered across multiple systems using temporal parallelism
Multiple frames
Figure 2. Data subsets rendered across multiple systems using data parallelism
Multiple subsetsof a data set
Renderingworkstations
Imagedecomposition
Imagecomposition
Objectdecomposition
Objectcomposition
Data setFinal image
high-Performance comPuting
DELL POWER SOLUTIONS | November 20073 Reprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved.
partitions, this step can place heavy demands
on communication networks and require a
huge amount of computation power for the
composition units.
Image decomposition helps eliminate the
complexity of image integration by only requiring
a final composition step to physically map the
image parts together. However, this approach
may have a potential side effect: loss of spatial
coherence. This loss can occur because in image
decomposition, a single geometric object may
map to multiple regions in the image space,
which requires such objects to be shared by
multiple independent processors.
Sort-first and sort-last algorithmsIn computer graphics, object space data
includes geometric descriptions of the scene,
such as polygons in 3D models. One of the chal-
lenges in parallel rendering is mapping this
data from object space to image space. Because
the original data set is partitioned and distrib-
uted to multiple processors, the basic parallel
rendering algorithms must be established as
sort-first or sort-last, depending on where the
function mapping the object space to the image
space occurs.
Sort-first is an initial preprocessing step for
assigning objects to the appropriate processors
based on a cross-space mapping policy. Sort-first
algorithms perform the space mapping early in
the rendering process, and the sort operation
happens before the data set is partitioned and
distributed. Sort-last algorithms are less sensi-
tive to the distribution of objects within the
image than sort-first algorithms because the
payload distribution is based on object modes.
The mapping of the object space to the image
space in sort-last algorithms occurs during com-
position, when pixels from individual proces-
sors are integrated into a final image. Sort-first
algorithms are typically combined with image
decomposition, and sort-last algorithms are
typically combined with object decomposition
(see Figures 4 and 5).
Enhancing HPC cluster graphics with open source utilitiesParallel rendering typically requires a special
software layer to exploit and coordinate distrib-
uted computational resources. Chromium and
DMX, two popular open source utilities, can
provide these capabilities.1
Chromium, developed by Lawrence Livermore
National Laboratory, Stanford University, the
University of Virginia, and Tungsten Graphics, is
an open source software stack for parallel render-
ing on clusters of workstations. It runs on the
Microsoft® Windows®, Linux®, IBM® AIX, Solaris,
and IRIX operating systems, and is designed to
increase three aspects of graphics scalability:
• Data scalability: Chromium can process
increasingly larger data sets by distributing
workloads to increasingly larger clusters.
• Rendering performance: Chromium can
scale out rendering performance by aggre-
gating commodity graphics hardware.
Figure 4. Parallel rendering process for display on tiled screens using sort-first algorithms with image decomposition
Renderingserver
Renderingserver
Renderingserver
Network
Master node
Application
Imagedecomposition
Sorting
Tiled display
Figure 5. Parallel rendering process for display on a single screen using sort-last algorithms with object decomposition
Network
Rendering server
Application
Display
Objectdecomposition
Rendering server
Application Objectdecomposition
Composition server
Sorting Compositor
1 For more information, see chromium.sourceforge.net and dmx.sourceforge.net.
4DELL.COM/PowerSolutionsReprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved.
• Displayperformance:Chromium helps the
system output large, high-resolution images
such as those for a tiled display.
Chromium enhances HPC cluster graphics
capabilities by supporting sort-first and sort-last
rendering as well as hybrid parallel rendering,
which combines both sort-first and sort-last
algorithms. Hybrid parallel rendering in
Chromium uses Python scripts to help increase
system flexibility. For example, a single hard-
ware architecture can support a Python script
to configure a system with a sort-first image
decomposition policy, the configuration of the
system to support large tiled displays with high
image resolution, or the modification of a script
with a sort-last object decomposition algorithm
to render a large data set in parallel.
Flexible interfaces are a key feature of
Chromium, and allow it to support multiple
application behavior requirements. One of these
interfaces is a library that supports OpenGL
application programming interfaces (APIs), but
simply dispatches the OpenGL calls to the fol-
lowing processing chain for parallel rendering.
This mechanism is transparent to applications,
offering an easy way to deploy traditional
OpenGL applications to parallel rendering envi-
ronments, particularly for applications requiring
large tiled displays.
Rendering very large, complex models
requires applications with native parallel coding
techniques. Chromium provides the necessary
data scalability to perform these tasks with a
set of parallel APIs that synchronize rendering
processes of multiple entities. These APIs can
integrate seamlessly with standard OpenGL
interfaces to support truly parallel rendering
applications on clusters.
Using Chromium with Linux or UNIX® operat-
ing systems to utilize large display walls with
the X Window System introduces a minor prob-
lem: although the large images can span mul-
tiple screens, each screen is still managed by
an independent X server. Integrating Chromium
with DMX can help solve this problem. DMX
allows a single X server to run across a cluster
of systems such that the X display or desktop
can present images across many physical dis-
plays. For example, a cluster of 12 workstations
running one X server can provide an image to a
large tiled display in a 4 × 3 screen configura-
tion. Working with DMX, Chromium can render
data sets and output large images to a unified
X server, which controls multiple graphics cards
connected to the physical displays and allows
logical windows to cross the display’s physical
boundaries.
Open source utilities such as Chromium and
DMX help simplify the deployment of parallel ren-
dering on high-performance workstation clusters.
Figure 6 shows an example software stack for
Linux and UNIX operating systems. Each worksta-
tion in a cluster requires the three bottom layers—
the graphics card, the driver, and the X Window
System and OpenGL. Chromium and DMX create
another layer to provide parallel rendering by uti-
lizing the rendering resources of individual work-
stations. Adding a layer containing toolkits such
as the open source Visualization Toolkit (VTK) and
OpenGL Utility Toolkit (GLUT) can help applications
utilize the bottom layers. When running on
Windows-based systems, DMX and the X Window
System are not required; the system can use OS
services to provide the necessary functionality.
Creating scalable HPC architectures for parallel renderingDeploying parallel rendering technologies
can help increase the cost-effectiveness, flexi-
bility, and scalability of HPC architectures.
Organizations can take advantage of common
approaches such as temporal parallelism and
data parallelism to help streamline data set
rendering, as well as open source utilities such
as Chromium and DMX to present graphics on
high-resolution displays and provide the neces-
sary data scalability.
The combination of HPC cluster architec-
tures and parallel rendering can also accelerate
the pace of research projects—for example, by
allowing aerospace researchers to visualize the
heat generated by wind on an airplane shell to
help them increase airplane safety and effi-
ciency, enabling geologists to see through seis-
mic zones and enhance crude oil yield from
existing wells, and letting pharmaceutical
researchers see how human genes interact with
medicines to help accelerate drug development.
By using HPC clusters and parallel rendering in
these and other industries, organizations can
successfully address large-scale problems on
high-resolution data sets.
Li Ou, Ph.D., is a systems engineer in the
Scalable Systems Group at Dell. He has a B.S.
in Electrical Engineering and an M.S. in
Computer Science from the University of
Electronics Science and Technology of China,
and a Ph.D. in Computer Engineering from
Tennessee Technological University.
Yung-Chin Fang is a senior consultant in the
Scalable Systems Group at Dell. He has pub-
lished more than 30 articles on HPC and cyber-
infrastructure management, and participates in
HPC cluster–related open source groups as a
Dell representative.
Onur Celebioglu is an engineering manager in
the Scalable Systems Group at Dell. He has an
M.S. in Electrical and Computer Engineering
from Carnegie Mellon University.
Victor Mashayekhi, Ph.D., is the engineering
manager for the Scalable Systems Group at Dell,
and is responsible for product development for
HPC clusters, remote computing, unified com-
munication, virtualization, custom solutions,
and solutions advisors. Victor has a B.A., M.S.,
and Ph.D. in Computer Science from the
University of Minnesota.
Figure 6. Example software stack for parallel rendering on Linux- and UNIX-based clusters
OpenGL
Applications
VTK GLUT
ChromiumDMX
X Window System
Driver
Graphics card