Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 221 times |
Download: | 0 times |
DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION
Dave DeMarle
May 1 2003
Thesis:
It is possible to visualize multi-Gigabyte datasets interactively using ray tracing on a cluster.
Outline
Background.
Related work.
Communication.
Ray tracing with replicated data.
Distributed shared memory.
Ray tracing large volumes.
Ray Tracing
For every pixel, compute a ray from a viewpoint into space, and test for intersection with every object.Take the nearest hit object’s color for the pixel.Shadows, reflections, refractions and photorealistic effects simply require more rays.
Interactive Ray Tracing
1998: *-Ray
Image Parallel renderer optimized for SGI-Origin shared memory supercomputer.
My work moves this program to a Cluster, in order to make it less expensive.
CPU 1 CPU 2 CPU 3 CPU 4
Ray Traced Forest Scene Showing task distribution
Cluster Computing
Connect inexpensive machines.
Advantages:Cheaper.Faster growth curve in commodity market.
Disadvantages:Slower network.Separate Memory.
Ray Nebula
~$1.5 million. ~$150 thousand.
32 0.39 GHz R12K CPUs. 2x32 1.7 GHz Xeon CPUs.
16GB RAM (shared). 32GB RAM (1GB per node).
NUMA hypercube network. Switched Gbit Ethernet.
335ns avg round trip latency. 34000ns avg round trip latency.
12.8 Gbit/sec bandwidth. .6 Gbit/sec bandwidth.
Related Work
2001: Saarland Renderer
Trace 4 rays with SIMD operations.
Obtain data from a central server.
Limited to triangular data.
My work keeps *-Ray’s flexibility, and uses distributed ownership.
Related Work
1993: Corrie and Mackeras
Volume rendering on a Fujitsu AP1000.
My work uses recent hardware, and multithreading on each node, to achieve interactivity.
Outline
Background.
Related work.
Communication.
Ray tracing with replicated data.
Distributed shared memory.
Ray tracing large volumes.
Communication
LegionGoal 1: to reduce library overhead.
Built on top of TCP.
Goal 2: reduce wait time.Dedicated communication thread handles
incoming traffic.
Inbound: Select(), read header(), call function.Outbound: protect with mutex for thread.
Comp Thread 1
Comp Thread T
…
Communicator Thread
handler_1() select()
Communicator::send()
Node 0
handler_h() Net
Outline
Background.
Related work.
Communication.
Ray tracing with replicated data.
Distributed shared memory.
Ray tracing large volumes.
Distributed Ray Tracer Implementation
Image Parallel Ray Tracer.
Supervisor/Workers program structure.
Each node runs a multithreaded application.
Replicate data if it fits in each node’s memory.
Use Distributed Shared Memory (DSM) for larger volumetric data.
Worker 2 Worker 3 Worker 1 RenderThread 1
RenderThread 2
RenderThread 1
RenderThread 2
RenderThread 1
RenderThread 2
Supervisor
ImageUser
Supervisor Program
Communicator
Scene State
Frame State
Task State
Display Thread
Aux. Dpy Threads
ImageNode 0
Worker Program
Communicator
Scene State
Frame State
TaskManager
Render Thread 1
SceneNode N
Render Thread N TaskQueue
ViewManager
…
Render StateData that *-Ray communicated by reference between functional units, is now transferred over the network.
SceneState – constant over a session. Acceleration structure type, number of workers…
FrameState – can change each frame. Camera Position, image resolution…
TaskState – changes during a frame. Pixel tile assignments.
TaskManager keeps a local queue of tasks.
Two semaphores guard the queue.
Tile
Supervisor Worker 1
Tile Tile TaskManager
Tile
Tile Tile
Render Thread 1
Render Thread 2
TaskQueue
Tile
Tile Tile
Image
Network Limitation
Max frame rate determined by network.
19 μs per tile (queuing), 600Mbit/sec bandwidth.
0
10
20
30
40
50
60
70
80
CPUs
Fra
me
s/s
ec
32x32
32x32 limit
16x16
16x16 limit
8x8
8x8 limit
4x4
4x4 limit1 8 12 16 31
Replicated Comparison
Machine Comparison with Replicated Data
0
2
4
6
8
10
CPUs
Fra
me
s/s
ec 16x16 SGI
8x8 SGI
16x16 Cluster
8x8 Cluster
1 8 16 24 31
Outline
Background.
Related work.
Communication.
Ray tracing with replicated data.
Distributed shared memory.
Ray tracing large volumes.
Large Volumes
Richtmyer-MeshkovInstability Simulationfrom Lawrence Livermore National Labs.
1920x2048x2048x 8 bit
Legion’s DSMDataServer class Compute threads call acquire to obtain blocks of memory. The DataServer finds and returns the requested block. Compute threads call release to let the DataServer reuse the
space.
The DataServer uses Legion to transfer blocks over the network. Each node owns the blocks in its resident_set area, and caches
remote owned blocks in its local_cache area.
5 DataServer flavors: single threaded, multithreaded direct mapped, associative, mmap from disk, and writable.
0 3 6 1
resident_set local_cacheDataServer
Communicator Thread
get_data()release_data()
Comp. Thread 1Node 0
4 2 7
1 4 7 5
resident_set local_cacheDataServer
Communicator Thread
get_data()release_data()
Comp. Thread 1Node 1
8 6 3
2 5 8 1
resident_set local_cacheDataServer
Communicator Thread
get_data()release_data()
Comp. Thread 1Node 2
4 6 3
Outline
Background.
Related work.
Communication.
Ray tracing with replicated data.
Distributed shared memory.
Ray tracing large volumes.
Large VolumesUse distributed versions of *-Ray’s templated volume classes, which treat DataServer as a 3D array.
DISOVolume DMIPVolume
DBrickArray3
DataServer
Data(x,y,z) Block Q, Offset R
Isosurface of visible female Showing data ownership
Optimized Data access for Large Volumes
Use 3 level bricking for memory coherence: 64 byte cache line. 4KB OS page. 4KB * L^3 Network transfer size.
3rd level bricks = DataServer blocks.
Use macrocell hierarchy to reduce number of accesses.
Results with Distributed Data
Hit time of 6.86 μs or higher.Associative DataServer takes longer.Miss time of 390 μs or higher.Larger bricks take longer.
Empirically, if local cache is >10% of data size, get >95% hit rates for isosurfacing, MIPing.
Investigated techniques to increase hit rate, reduce number of accesses.
Consolidated Access
Hit time is usually the limiting factor.
Reduce the number of DSM accesses.
Eliminate redundant accesses.
When ray needs data, sort accesses to get all needed data inside with one DSM access.
Consolidated Access
Brick 1 Brick 2 Brick 3
Brick 4 Brick 5 Brick 6
macrocell
Consolidated Access
Brick 1 Brick 2 Brick 3
Brick 4 Brick 5 Brick 6
macrocell
Consolidated Access
Brick 1 Brick 2 Brick 3
Brick 4 Brick 5 Brick 6
macrocell
2 GB
0
100000
200000
300000
400000
500000
0
1
2
3
4
5
6
7
8
Fra
mes
/sec
Acq
uire
s/no
de/f
ram
e
Access 1 Access 8 Access X
Machine Comparison
Use the Richtmyer-Meshkov data set to compare the distributed ray tracer with *-Ray.
To determine how data sharing effects the cluster program.
0
2
4
6
8
10
12
14
Frame Number
Fra
me
s/s
ec Ray 31
CPUs, 4.7 f/s
Nebula 62CPUs, 1.7 f/s
Nebula 32CPUs, 1.1f/s
1 589300
Traffic
When entire volume is in view it takes a few frames for the caches to load, which slows down the renderer.
When only a portion is in view, the working set is small and network traffic is not an issue.
0
0.5
1
1.5
2
2.5
0
5
10
15
20
isov
alue
view
poin
tMB
/nod
efr
ames
/sec
0
0.5
1
1.5
2
2.5
3
3.5
recorded rate
loaded rate
fram
es/s
ec
Frame Number
Images
Treepot scene 2 million polygons
512x5121 hard shadow~1 f/s
CPU bound, not network bound
Images
Richtmyer-MeshkovTimestep 2701920x2048x2048
512x5121..2 f/s w/ 1 hard shadow
CPU or network bound,depending on the Viewpoint.
Images
Focusing in…
Images
Focusing in…
Images
Focusing in…
Images
Focusing in…
Images
Focusing in…
Images
Focusing in…
Conclusion
Confirmed that interactive Ray Tracing on a cluster is possible.Scaling and the ultimate Frame Rate is limited by latency, and number of tasks in image determines max frame rate.With reasonably complex scenes the render is CPU bound, even with 62 processors.With tens of processors, cluster is comparable to supercomputer.
Conclusion
Data Sets that exceed the memory space of any one node can be managed with a DSM.For isosurfacing, and MIPing, hit time is limiting factor, not network time.The longer data access time makes the cluster slower than the supercomputer, but it is still interactive.
Future Work
Faster for realistic images interactively.Faster network layer.Faster DSM.Faster ray tracing.
Direct volume rendering.
Distributed polygonal data sets.
Acknowledgments
NSF Grants 9977218, 9978099.
DOE Views.
NIH Grants.
My Committee, Steve, Chuck and Pete.
Patti DeMarle.
Thanks to everyone else, for making this a great place to live and work!