Parallel rendering Technologies for hPc clusTers€¦ · parallel rendering with HPC workstation...

transcript

high-Performance comPuting

Supercomputers and high-performance computing

(HPC) clusters enable demanding software—such

as real-time simulation, animation, virtual reality,

and scientific visualization applications—to generate high-

resolution data sets at sizes that have not typically been

feasible in the past. However, efficiently rendering these large,

dynamic data sets, especially those with high-resolution dis-

play requirements, can be a significant challenge.

Rendering is the process of converting an abstract

description of a scene (a data set) to an image. For com-

plex data sets or high-resolution images, the rendering

process can be highly compute intensive, and applica-

tions with requirements for rapid turnaround time and

human perception place additional demands on process-

ing power. State-of-the-art graphics hardware can signifi-

cantly enhance rendering performance, but a single piece

of hardware is often limited by processor performance

and amount of memory. If very high resolution is required,

the rendering task can simply be too large for one piece

of hardware to handle.

Exploiting multiple processing units—a technique known

as parallel rendering—can provide the necessary computa-

tional power to accelerate these rendering tasks. This article

discusses the major architectures and methodologies for

parallel rendering with HPC workstation clusters and

describes how open source utilities such as Chromium and

Distributed Multihead X (DMX) can help meet large-scale

rendering requirements.

Understanding parallel rendering techniquesThere are two different ways to build parallel architectures

for high-performance rendering. The first method is to use

a large symmetric multiprocessing computer with extremely

high-end graphics capabilities. The downside of this

approach is its cost—these systems can be prohibitively

expensive.

The second method is to utilize the aggregate perfor-

mance of commodity graphics accelerators in clusters of HPC

workstations. The advantages of this architecture include

the following:

• Cost-effectiveness: Commodity graphics hardware and

workstations remain far less expensive than high-end

parallel rendering computers, and some PC graphics

accelerators can provide performance levels comparable

to those of high-end graphics hardware.

• Scalability:As long as the network is not saturated, the

aggregate hardware capacity of a visualization cluster

grows linearly as the number of HPC workstations

increases.

Related Categories:

Clustering

High-performance computing (HPC)

System architecture

Visit DELL.COM/PowerSolutions

for the complete category index.

Parallel rendering Technologies for hPc clusTers

Using parallel rendering technologies with clusters of high-performance computing (HPC) workstations con-figured with high-end graphics processors helps scale out graphics capabilities by exploiting and coordinating distributed computing resources. This article discusses parallel rendering architectures and highlights open source utilities that can help meet rendering require-ments for large-scale data sets.

By Li Ou, PH.D.

yung-CHin Fang

Onur CeLeBiOgLu

ViCtOr MasHayekHi, PH.D.

• Flexibility:The performance of commodity

graphics hardware can increase rapidly, and

its development cycles are typically much

shorter than those of custom-designed,

high-end parallel hardware. In addition,

open interfaces for hardware, such as PCI

Express (PCIe), and open interfaces for soft-

ware, such as Open Graphics Library

(OpenGL), allow organizations to easily take

advantage of new hardware to help increase

cluster performance.

Temporal and data parallelismTwo common approaches to parallel rendering

are temporal parallelism and data parallelism.

Temporal parallelism divides up work into

single sequential frames that are assigned to

systems and rendered in order; data parallel-

ism divides a large data set into subsets of

work that are rendered by multiple systems and

then recombined.

In temporal parallelism, the basic unit of

work is the rendering of a single complete image

or frame, and each processor is assigned a

number of frames to render in sequence (see

Figure 1). Because this method is not geared

toward rendering individual images but can

increase performance when rendering an entire

sequence, film industries often use it for anima-

tion applications and similar software, in which

the time it takes to render individual frames may

not be as important as the overall time required

to render all frames.

The basic concept of data parallelism, on the

other hand, is to divide and conquer. Data paral-

lelism decomposes a large data set into many

small subsets, then uses multiple workstations

to render these subsets simultaneously (see

Figure 2). High-performance interconnects route

the data subsets between the processing work-

stations, and one or more controlling units syn-

chronize the distributed rendering tasks. When

the rendering process completes, the final image

can be compiled from the subsets on each work-

station for display. Data parallelism is widely

used by research industries and in software such

as real-time simulation, virtual reality, virtual

environment simulation, and scientific visualiza-

tion applications.

Object decomposition and image decompositionA key step in data parallelism is decomposing

large data sets, a step that can utilize one of two

major approaches: object decomposition and

image decomposition. In object decomposition,

tasks are formed by partitioning the geometric

description of the scene. Individual worksta-

tions partition and render subsets of the geo-

metric data in parallel, producing pixels that

must be integrated later into a final image.

Image decomposition, in contrast, forms tasks

by partitioning the image space: each task ren-

ders only the geometric objects that contribute

to the pixels that physically belong to the space

assigned to the task.

Figure 3 illustrates how a data set could be

partitioned by these two approaches. In object

decomposition, each workstation renders a

single object in the data set: one renders the

rectangle, and the other renders the circle. In

image decomposition, each workstation renders

half of the final image: one renders the left side,

and the other renders the right side.

There are no absolute guidelines when choos-

ing between object decomposition and image

decomposition. Generally, object decomposition

is suitable for applications with large-scale data

sets, while image decomposition is suitable for

applications requiring high resolution and a

large image, such as a tiled display integrating

multiple screens into a single display device.

Object decomposition can enhance load

balancing and scalability compared with image

decomposition by helping ease preprocessing

and the distribution of objects evenly among

processors. However, it does require a post-

composition process to integrate the image

subsets, because objects assigned to different

processors may map to the same screen space.

For example, in Figure 3, after rendering the

circle and the rectangle individually, this post-

composition step determines how they overlap

to form the final image. With large numbers of

Figure 3. Parallel rendering processes using object decomposition and image decomposition

Figure 1. Frames rendered across multiple systems using temporal parallelism

Multiple frames

Figure 2. Data subsets rendered across multiple systems using data parallelism

Multiple subsetsof a data set

Renderingworkstations

Imagedecomposition

Imagecomposition

Objectdecomposition

Objectcomposition

Data setFinal image

high-Performance comPuting

partitions, this step can place heavy demands

on communication networks and require a

huge amount of computation power for the

composition units.

Image decomposition helps eliminate the

complexity of image integration by only requiring

a final composition step to physically map the

image parts together. However, this approach

may have a potential side effect: loss of spatial

coherence. This loss can occur because in image

decomposition, a single geometric object may

map to multiple regions in the image space,

which requires such objects to be shared by

multiple independent processors.

Sort-first and sort-last algorithmsIn computer graphics, object space data

includes geometric descriptions of the scene,

such as polygons in 3D models. One of the chal-

lenges in parallel rendering is mapping this

data from object space to image space. Because

the original data set is partitioned and distrib-

uted to multiple processors, the basic parallel

rendering algorithms must be established as

sort-first or sort-last, depending on where the

function mapping the object space to the image

space occurs.

Sort-first is an initial preprocessing step for

assigning objects to the appropriate processors

based on a cross-space mapping policy. Sort-first

algorithms perform the space mapping early in

the rendering process, and the sort operation

happens before the data set is partitioned and

distributed. Sort-last algorithms are less sensi-

tive to the distribution of objects within the

image than sort-first algorithms because the

payload distribution is based on object modes.

The mapping of the object space to the image

space in sort-last algorithms occurs during com-

position, when pixels from individual proces-

sors are integrated into a final image. Sort-first

algorithms are typically combined with image

decomposition, and sort-last algorithms are

typically combined with object decomposition

(see Figures 4 and 5).

Enhancing HPC cluster graphics with open source utilitiesParallel rendering typically requires a special

software layer to exploit and coordinate distrib-

uted computational resources. Chromium and

DMX, two popular open source utilities, can

provide these capabilities.1

Chromium, developed by Lawrence Livermore

National Laboratory, Stanford University, the

University of Virginia, and Tungsten Graphics, is

an open source software stack for parallel render-

ing on clusters of workstations. It runs on the

Microsoft® Windows®, Linux®, IBM® AIX, Solaris,

and IRIX operating systems, and is designed to

increase three aspects of graphics scalability:

• Data scalability: Chromium can process

increasingly larger data sets by distributing

workloads to increasingly larger clusters.

• Rendering performance: Chromium can

scale out rendering performance by aggre-

gating commodity graphics hardware.

Figure 4. Parallel rendering process for display on tiled screens using sort-first algorithms with image decomposition

Renderingserver

Network

Master node

Application

Imagedecomposition

Sorting

Tiled display

Figure 5. Parallel rendering process for display on a single screen using sort-last algorithms with object decomposition

Network

Rendering server

Application

Display

Objectdecomposition

Rendering server

Application Objectdecomposition

Composition server

Sorting Compositor

1 For more information, see chromium.sourceforge.net and dmx.sourceforge.net.

• Displayperformance:Chromium helps the

system output large, high-resolution images

such as those for a tiled display.

Chromium enhances HPC cluster graphics

capabilities by supporting sort-first and sort-last

rendering as well as hybrid parallel rendering,

which combines both sort-first and sort-last

algorithms. Hybrid parallel rendering in

Chromium uses Python scripts to help increase

system flexibility. For example, a single hard-

ware architecture can support a Python script

to configure a system with a sort-first image

decomposition policy, the configuration of the

system to support large tiled displays with high

image resolution, or the modification of a script

with a sort-last object decomposition algorithm

to render a large data set in parallel.

Flexible interfaces are a key feature of

Chromium, and allow it to support multiple

application behavior requirements. One of these

interfaces is a library that supports OpenGL

application programming interfaces (APIs), but

simply dispatches the OpenGL calls to the fol-

lowing processing chain for parallel rendering.

This mechanism is transparent to applications,

offering an easy way to deploy traditional

OpenGL applications to parallel rendering envi-

ronments, particularly for applications requiring

large tiled displays.

Rendering very large, complex models

requires applications with native parallel coding

techniques. Chromium provides the necessary

data scalability to perform these tasks with a

set of parallel APIs that synchronize rendering

processes of multiple entities. These APIs can

integrate seamlessly with standard OpenGL

interfaces to support truly parallel rendering

applications on clusters.

Using Chromium with Linux or UNIX® operat-

ing systems to utilize large display walls with

the X Window System introduces a minor prob-

lem: although the large images can span mul-

tiple screens, each screen is still managed by

an independent X server. Integrating Chromium

with DMX can help solve this problem. DMX

allows a single X server to run across a cluster

of systems such that the X display or desktop

can present images across many physical dis-

plays. For example, a cluster of 12 workstations

running one X server can provide an image to a

large tiled display in a 4 × 3 screen configura-

tion. Working with DMX, Chromium can render

data sets and output large images to a unified

X server, which controls multiple graphics cards

connected to the physical displays and allows

logical windows to cross the display’s physical

boundaries.

Open source utilities such as Chromium and

DMX help simplify the deployment of parallel ren-

dering on high-performance workstation clusters.

Figure 6 shows an example software stack for

Linux and UNIX operating systems. Each worksta-

tion in a cluster requires the three bottom layers—

the graphics card, the driver, and the X Window

System and OpenGL. Chromium and DMX create

another layer to provide parallel rendering by uti-

lizing the rendering resources of individual work-

stations. Adding a layer containing toolkits such

as the open source Visualization Toolkit (VTK) and

OpenGL Utility Toolkit (GLUT) can help applications

utilize the bottom layers. When running on

Windows-based systems, DMX and the X Window

System are not required; the system can use OS

services to provide the necessary functionality.

Creating scalable HPC architectures for parallel renderingDeploying parallel rendering technologies

can help increase the cost-effectiveness, flexi-

bility, and scalability of HPC architectures.

Organizations can take advantage of common

approaches such as temporal parallelism and

data parallelism to help streamline data set

rendering, as well as open source utilities such

as Chromium and DMX to present graphics on

high-resolution displays and provide the neces-

sary data scalability.

The combination of HPC cluster architec-

tures and parallel rendering can also accelerate

the pace of research projects—for example, by

allowing aerospace researchers to visualize the

heat generated by wind on an airplane shell to

help them increase airplane safety and effi-

ciency, enabling geologists to see through seis-

mic zones and enhance crude oil yield from

existing wells, and letting pharmaceutical

researchers see how human genes interact with

medicines to help accelerate drug development.

By using HPC clusters and parallel rendering in

these and other industries, organizations can

successfully address large-scale problems on

high-resolution data sets.

Li Ou, Ph.D., is a systems engineer in the

Scalable Systems Group at Dell. He has a B.S.

in Electrical Engineering and an M.S. in

Computer Science from the University of

Electronics Science and Technology of China,

and a Ph.D. in Computer Engineering from

Tennessee Technological University.

Yung-Chin Fang is a senior consultant in the

Scalable Systems Group at Dell. He has pub-

lished more than 30 articles on HPC and cyber-

infrastructure management, and participates in

HPC cluster–related open source groups as a

Dell representative.

Onur Celebioglu is an engineering manager in

the Scalable Systems Group at Dell. He has an

M.S. in Electrical and Computer Engineering

from Carnegie Mellon University.

Victor Mashayekhi, Ph.D., is the engineering

manager for the Scalable Systems Group at Dell,

and is responsible for product development for

HPC clusters, remote computing, unified com-

munication, virtualization, custom solutions,

and solutions advisors. Victor has a B.A., M.S.,

and Ph.D. in Computer Science from the

University of Minnesota.

Figure 6. Example software stack for parallel rendering on Linux- and UNIX-based clusters

OpenGL

Applications

VTK GLUT

ChromiumDMX

X Window System

Driver

Graphics card

Parallel rendering Technologies for hPc clusTers€¦ · parallel rendering with HPC workstation...

Documents