DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May 1 2003.

DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION

Dave DeMarle

May 1 2003

Thesis:

It is possible to visualize multi-Gigabyte datasets interactively using ray tracing on a cluster.

Outline

Background.

Related work.

Communication.

Ray tracing with replicated data.

Distributed shared memory.

Ray tracing large volumes.

Ray Tracing

For every pixel, compute a ray from a viewpoint into space, and test for intersection with every object.Take the nearest hit object’s color for the pixel.Shadows, reflections, refractions and photorealistic effects simply require more rays.

Interactive Ray Tracing

1998: *-Ray

Image Parallel renderer optimized for SGI-Origin shared memory supercomputer.

My work moves this program to a Cluster, in order to make it less expensive.

CPU 1 CPU 2 CPU 3 CPU 4

Ray Traced Forest Scene Showing task distribution

Cluster Computing

Connect inexpensive machines.

Advantages:Cheaper.Faster growth curve in commodity market.

Disadvantages:Slower network.Separate Memory.

Ray Nebula

~$1.5 million. ~$150 thousand.

32 0.39 GHz R12K CPUs. 2x32 1.7 GHz Xeon CPUs.

16GB RAM (shared). 32GB RAM (1GB per node).

NUMA hypercube network. Switched Gbit Ethernet.

335ns avg round trip latency. 34000ns avg round trip latency.

12.8 Gbit/sec bandwidth. .6 Gbit/sec bandwidth.

Related Work

2001: Saarland Renderer

Trace 4 rays with SIMD operations.

Obtain data from a central server.

Limited to triangular data.

My work keeps *-Ray’s flexibility, and uses distributed ownership.

Related Work

1993: Corrie and Mackeras

Volume rendering on a Fujitsu AP1000.

My work uses recent hardware, and multithreading on each node, to achieve interactivity.

Outline

Background.

Related work.

Communication.




Communication

LegionGoal 1: to reduce library overhead.

Built on top of TCP.

Goal 2: reduce wait time.Dedicated communication thread handles

incoming traffic.

Inbound: Select(), read header(), call function.Outbound: protect with mutex for thread.

Comp Thread 1

Comp Thread T

…

Communicator Thread

handler_1() select()

Communicator::send()

Node 0

handler_h() Net

Outline

Background.

Related work.

Communication.




Distributed Ray Tracer Implementation

Image Parallel Ray Tracer.

Supervisor/Workers program structure.

Each node runs a multithreaded application.

Replicate data if it fits in each node’s memory.

Use Distributed Shared Memory (DSM) for larger volumetric data.

Worker 2 Worker 3 Worker 1 RenderThread 1

RenderThread 2

RenderThread 1

RenderThread 2

RenderThread 1

RenderThread 2

Supervisor

ImageUser

Supervisor Program

Communicator

Scene State

Frame State

Task State

Display Thread

Aux. Dpy Threads

ImageNode 0

Worker Program

Communicator

Scene State

Frame State

TaskManager

Render Thread 1

SceneNode N

Render Thread N TaskQueue

ViewManager

…

Render StateData that *-Ray communicated by reference between functional units, is now transferred over the network.

SceneState – constant over a session. Acceleration structure type, number of workers…

FrameState – can change each frame. Camera Position, image resolution…

TaskState – changes during a frame. Pixel tile assignments.

TaskManager keeps a local queue of tasks.

Two semaphores guard the queue.

Tile

Supervisor Worker 1

Tile Tile TaskManager

Tile

Tile Tile

Render Thread 1

Render Thread 2

TaskQueue

Tile

Tile Tile

Image

Network Limitation

Max frame rate determined by network.

19 μs per tile (queuing), 600Mbit/sec bandwidth.

0

10

20

30

40

50

60

70

80

CPUs

Fra

me

s/s

ec

32x32

32x32 limit

16x16

16x16 limit

8x8

8x8 limit

4x4

4x4 limit1 8 12 16 31

Replicated Comparison

Machine Comparison with Replicated Data

0

2

4

6

8

10

CPUs

Fra

me

s/s

ec 16x16 SGI

8x8 SGI

16x16 Cluster

8x8 Cluster

1 8 16 24 31

Outline

Background.

Related work.

Communication.




Large Volumes

Richtmyer-MeshkovInstability Simulationfrom Lawrence Livermore National Labs.

1920x2048x2048x 8 bit

Legion’s DSMDataServer class Compute threads call acquire to obtain blocks of memory. The DataServer finds and returns the requested block. Compute threads call release to let the DataServer reuse the

space.

The DataServer uses Legion to transfer blocks over the network. Each node owns the blocks in its resident_set area, and caches

remote owned blocks in its local_cache area.

5 DataServer flavors: single threaded, multithreaded direct mapped, associative, mmap from disk, and writable.

0 3 6 1

resident_set local_cacheDataServer

Communicator Thread

get_data()release_data()

Comp. Thread 1Node 0

4 2 7

1 4 7 5


Communicator Thread



8 6 3

2 5 8 1


Communicator Thread



4 6 3

Outline

Background.

Related work.

Communication.




Large VolumesUse distributed versions of *-Ray’s templated volume classes, which treat DataServer as a 3D array.

DISOVolume DMIPVolume

DBrickArray3

DataServer

Data(x,y,z) Block Q, Offset R

Isosurface of visible female Showing data ownership

Optimized Data access for Large Volumes

Use 3 level bricking for memory coherence: 64 byte cache line. 4KB OS page. 4KB * L^3 Network transfer size.

3rd level bricks = DataServer blocks.

Use macrocell hierarchy to reduce number of accesses.

Results with Distributed Data

Hit time of 6.86 μs or higher.Associative DataServer takes longer.Miss time of 390 μs or higher.Larger bricks take longer.

Empirically, if local cache is >10% of data size, get >95% hit rates for isosurfacing, MIPing.

Investigated techniques to increase hit rate, reduce number of accesses.

Consolidated Access

Hit time is usually the limiting factor.

Reduce the number of DSM accesses.

Eliminate redundant accesses.

When ray needs data, sort accesses to get all needed data inside with one DSM access.

Consolidated Access

Brick 1 Brick 2 Brick 3


macrocell

Consolidated Access



macrocell

Consolidated Access



macrocell

2 GB

0

100000

200000

300000

400000

500000

0

1

2

3

4

5

6

7

8

Fra

mes

/sec

Acq

uire

s/no

de/f

ram

e

Access 1 Access 8 Access X

Machine Comparison

Use the Richtmyer-Meshkov data set to compare the distributed ray tracer with *-Ray.

To determine how data sharing effects the cluster program.

0

2

4

6

8

10

12

14

Frame Number

Fra

me

s/s

ec Ray 31

CPUs, 4.7 f/s

Nebula 62CPUs, 1.7 f/s

Nebula 32CPUs, 1.1f/s

1 589300

Traffic

When entire volume is in view it takes a few frames for the caches to load, which slows down the renderer.

When only a portion is in view, the working set is small and network traffic is not an issue.

0

0.5

1

1.5

2

2.5

0

5

10

15

20

isov

alue

view

poin

tMB

/nod

efr

ames

/sec

0

0.5

1

1.5

2

2.5

3

3.5

recorded rate

loaded rate

fram

es/s

ec

Frame Number

Images

Treepot scene 2 million polygons

512x5121 hard shadow~1 f/s

CPU bound, not network bound

Images

Richtmyer-MeshkovTimestep 2701920x2048x2048

512x5121..2 f/s w/ 1 hard shadow

CPU or network bound,depending on the Viewpoint.

Images

Focusing in…

Images

Focusing in…

Images

Focusing in…

Images

Focusing in…

Images

Focusing in…

Images

Focusing in…

Conclusion

Confirmed that interactive Ray Tracing on a cluster is possible.Scaling and the ultimate Frame Rate is limited by latency, and number of tasks in image determines max frame rate.With reasonably complex scenes the render is CPU bound, even with 62 processors.With tens of processors, cluster is comparable to supercomputer.

Conclusion

Data Sets that exceed the memory space of any one node can be managed with a DSM.For isosurfacing, and MIPing, hit time is limiting factor, not network time.The longer data access time makes the cluster slower than the supercomputer, but it is still interactive.

Future Work

Faster for realistic images interactively.Faster network layer.Faster DSM.Faster ray tracing.

Direct volume rendering.

Distributed polygonal data sets.

Acknowledgments

NSF Grants 9977218, 9978099.

DOE Views.

NIH Grants.

My Committee, Steve, Chuck and Pete.

Patti DeMarle.

Thanks to everyone else, for making this a great place to live and work!

Date post:	22-Dec-2015
Category:	Documents
View:	221 times
Download:	0 times

DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May 1 2003.

Documents