Does Your Software Scale ?Multi-GPU Scaling for Large Data Visualization
Thomas Ruge, NVIDIA
© 2008 NVIDIA Corporation.
g
AgendaAgenda
• Multi-GPU Scaling for Large Data Visualization (Thomas Ruge, NVIDIA)– Multi GPU why ?Multi GPU why ?– Different Multi-GPU rendering methods, focus on database decomposition– System Analysis of depth compositing and its impact on multi-GPU rendering– Performance results of multi-GPU rendering with NVSG and OSG
• NVIDIA’s Multi-GPU SDK (Thomas Volk, NVIDIA)– How to use NVIDIA’s MGPU SDK to get scalability in your application
• NVSG-Scale (Subu Krishnamoorthy, NVIDIA)– Multi-GPU implemented in NVSG
• Technical Demos (with Horn Thorolf Tonjum, Stormfjord)
• Q&A
© 2008 NVIDIA Corporation.
Multi-GPU: Why ?Multi GPU: Why ?
• Demand for Large Data visualization beyond capacity of 1 GPU• Demand for Large Data visualization beyond capacity of 1 GPU– Boeing 777 (350 MTriangles) (between 4.2 – 29 GB of graphics data)– Visible Human 16 GB of 3D Textures
• Single GPU technology approaches cliff (like CPU’s do)– Power consumption is growing Power consumption is growing – Cooling problems with high heat output– High production costs per chip of dies with large footprints
G80 : 681 M transistors, 90nm, 484 mm2 GT200 1 b i 65 576 2GT200: 1 bn transistors, 65 nm, 576 mm2
• Higher Performance than fastest single GPU on the market• Higher Performance than fastest single GPU on the market– MGPU can give you a time advantage, get tomorrow’s single GPU performance
today with a multi-GPU system
© 2008 NVIDIA Corporation.
Multi-GPU: GoalMulti GPU: Goal
G l G t li l bilit ith b f GPU’ i• Goal: Get linear scalability with number of GPU’s in– Rendering performance
• Triangle throughputg g p• Fill rate
– Data sizeImage resolution – Image resolution (not topic of this presentation)
• Steps to get scalability:– Distribution of the workload/rendering tasks to all GPU’s– Collection of rendering results from all GPU’s– Assemble the rendering results in final image
© 2008 NVIDIA Corporation.
Multi-GPU: Distributing the Load g
• Pixel Decomposition (e.g. SLI Antialiasing )– Assigns different pixels or subpixels to each
GPU– helps with fill-rate bound apps, good
approach for AA
• Decomposition in time (e.g. SLI AFR)– Different GPUs render different frames– scales well for vertex-processing and fill rate
bound applications
© 2008 NVIDIA Corporation.
Multi-GPU: Screen Decompositingp g
S Fi S D i i ( • Sort-First or Screen Decomposition (e.g. SLI SFR or SLI Mosaic)
– Different GPUs are rendering different ti f th
GPU 1 GPU 2
portions of the screen– fill rate bound applications scale well– Good method for Displays with very high
resolutions (e g mersive Displays or Sony
Final Image
resolutions (e.g. mersive Displays or Sony 4k)
Compositing Schemes:
GPU 4GPU 3
Display 1 Display 2
Compositing Schemes:None, use (Tiled walls or multi-input Displays)Simple stitching (e g SLI SFR)
Display 4Display 3
Simple stitching (e.g. SLI SFR)
© 2008 NVIDIA Corporation.
Multi-GPU: Database Decompositionp
S L D b D i i GPU 1 GPU 2• Sort-Last or Database Decomposition– Different GPUs render different portions of
the datasetS l ll i V t i fill t
GPU 1 GPU 2
– Scales well in Vertex processing, fill rate bound and graphics card memory bound applications
Final ImageGPU 3 GPU 4
• Compositing Schemes– Depth or z-Buffer compositing (needs
RGB and Depth Buffer from rendering GPU’s
Final ImageGPU 3 GPU 4
GPU 1 GPU 2RGB and Depth Buffer from rendering GPU s– Alpha-compositing (needs only RGBA buffer)
GPU 1 GPU 2
=> Database decomposition is the method that gives us more graphics card memory GPU 3 GPU 4
Final Image
© 2008 NVIDIA Corporation.
memory GPU 3 GPU 4
Compositing: PerformanceCompositing: Performance
All D iti S h i C iti t t th fi l iAll Decomposition Schemes require Compositing to create the final image
This can be done:
• with special purpose hardware– very fast, typically very low additional latency and performance impact
Ti ht t ifi hi h d ( ’t i / t h diff t d )– Tight to specific graphics hardware (can’t mix/match different cards)– only subset of all possible composition schemes possible (e.g. SLI doesn’t support z- or
alpha-compositing)– no general purpose compositing hardware available (but several attempts in the past)g p p p g ( p p )
• with commodity hardware and some software– Adds additional latency, performance impact depends on final image size and compositing
modemode– not very graphic card dependent (mix/match possible)– can cover all known compositing schemes – New HW developments (e.g. PCIe Gen 2) reduces the impact of SW-compositing
© 2008 NVIDIA Corporation.
New HW developments (e.g. PCIe Gen 2) reduces the impact of SW compositing
A System Analysis of Compositing withcommodity hardwarecommodity hardware
S t A l i f d t b d iti ith System Analysis of database decomposition with depth compositing
• Goal: • Goal: – Understand the impact of depth compositing on multi-GPU
rendering– Means to estimate performance benefits– Valid on today’s HWy
• Assumptions:– single shared memory system (no clusters)– only one displaying GPU (no powerwalls)
GPU 1
GPU 2 Displaying GPU– buffer transports go through host memory (no specific HW
tricks)
• Tasks:Di ib kl d d i GPU
GPU n
p y g
Host
– Distribute workload to rendering GPUs– Download resulting 2D-buffers (from GPUs framebuffer) to
host memory– Upload 2D-buffers to displaying GPU– composite 2D-buffers to final image
© 2008 NVIDIA Corporation.
composite 2D buffers to final image
Compositing with 2 GPUs Compositing with 2 GPUs
St t A l i ith 2 GPUStart Analysis with 2 GPUsSystem description:– 2 rendering GPUs, 1 displaying GPU– Optimization: displaying GPU is also a rendering
GPU GPU => reduced number of buffer transports
Depth compositing needs RGB and Z-buffer
GPU 2GPU 1
E.g. 1920x1200 pixels :sx = 1920, sy = 1200, bpp = 8 (BGRA+DEPTH_STENCIL)
18,432,000 Bytes per frame buffer73 728 000 Bytes transported through system
Host Memory Host Memory
73,728,000 Bytes transported through system per frame @60 Hz = 4,423,680,000 > 4 GB per second !
=> High load on system BusCPU
© 2008 NVIDIA Corporation.
Depth Compositing Performance p p gon 790i Ultra
S t 790i Ult 2 FX 3700 (PCI 2 0)System: 790i Ultra, 2 FX 3700 (PCIe 2.0)
Specs relevant for Compositing:Specs relevant for Compositing:
• Buffer downloads: 2.8 GB/s 360 ms/GB
• host memory copy: 4.1 GB/s 240 ms/GB
• Buffer uploads: 3.7 GB/s 270 ms/GB
• Execution time of the actual compositing (fragment shader) is negligible
GPU 2GPU 1 GPU 2GPU 1
Color : 2.8 GB/s
Depth: 2.8 GB/s 4.1 GB/s3.7 GB/s
© 2008 NVIDIA Corporation.
Landing ZoneHost Memory
Launching PadHost Memory
CPU
Depth Compositing Performancep p g
Time to depth-composite 1 MPixel (worst – best case)
buffer size = s ·s ·bpp = 1024·1024·8 = 8 MBytebuffer size = sx·sy·bpp = 1024·1024·8 = 8 MByte
• Compositing performance for sequential execution: 870 ms/GB 143 fpsworst case (no concurrency) ttotal = tdn+ tmc+ tup
• Compositing performance for parallel execution: 360 ms/GB 350 fpsbest case (full concurrency) ttotal = max(tdn, tmc, tup )
• Compositing performance measured: 550 ms/GB 227 fps(some concurrency)
td = Time to download BGRA buffer [s/GB]
GPU 2GPU 1 1.8 GB/s
© 2008 NVIDIA Corporation.
tdc Time to download BGRA buffer [s/GB]tdc = Time to download DEPTH_STENCIL buffer [s/GB]tdn = (tdc + tdc)/2 tmc = Time to copy from host mem to host mem [s/GB]tup = Time to upload [s/GB]
Host Memory Host Memory
CPU
Single GPU Performance:T i l th h t d d t iTriangle throughput and datasize
R d i f d i Rendering performance measured in triangles/sec depending on data size on graphics card with limited memory
orm
ance
Single GPU cliff
1/tin
Ren
derin
g pe
rfo[ t
riang
les
/ s]
1/tout
l d f d
Data size [ triangles]
Example, Rendering performance on Quadro FX 5600:
bytes per triangle = 3 vertices(36 bytes) + 3 normals( 36 bytes) = 72 bytes / triangleNVSG-test program renders individual triangles in VBO’s:
– pin = 143 Mtriangles/s for in-memory data => tin = 7 ns – pout = 25 Mtriangles / s for out-of memory data => tout = 40 ns
n 20 Mtriangles
© 2008 NVIDIA Corporation.
– n0 = 20 Mtriangles
Multi GPU Rendering: 2 GPUsMulti GPU Rendering: 2 GPUs
– Compositing impact is known R d i P f– Compositing impact is known– Rendering performance on single GPU is known
Let’s estimate what performance gain we can expect over a growing dataset
Assumptions:
Rendering Performance
1.50E+08
2.00E+08
2.50E+08
3.00E+08
perf
orm
ance
an
gles
/s]
Dual GPU Cliff
Assumptions:- Execution of rendering and compositing happens sequentially (will be
changed in the future):- Rendering happens concurrently w/o overhead on both GPU’s- Fixed buffer size = 1k x 1k = 1Mpixel => tcompositing = 4.4 ms
i l GPU t t l ti f
0.00E+00
5.00E+07
1.00E+08
0 10 20 30 40 50 60 70
triangles [millions]
rend
er p
[tria
triangle throughput 1 GPUSingle GPU Cliff
single GPU total time per frame:- ttotal,1 GPU = trender(n) = n0·tin + (n-n0) ·tout
dual GPU total time per frame:- ttotal, 2 GPU = trender(n/2) + tcompositing
f i t / t
triangle throughput 1 GPUtriangle throughpout no compositingtriangle throughput 2 GPU compositing
2 GPU relative to 1 GPU Area of super linearscalability
⇒ performance gain s2GPU= ttotal,1 GPU / ttotal,2 GPU
peak performance to be expected at n = 2 n0:s(n=2·n0) = (n0·tin + n0·tout)/(n0·tin+tc) (tin + tout)/(tin+tc/n0)improve peak performance by: 3
4
5
6
7
8
tor o
ver s
ingl
e G
PUreduce tc (e.g. optimize buffer transports)increase n0 (e.g. pack more triangles in card like tristrips instead individual triangles)
- Theoretical peak max smax = (tin + tout)/tin
- applied to FX5600 s = (7ns +40ns)/7ns = 6 7
0
1
2
0 10 20 30 40 50 60 70
triangles [millions]
scal
e fa
ct
scale factor 2 GPUs no compositing scale factor 2 GPU compositing
© 2008 NVIDIA Corporation.
- applied to FX5600 smax = (7ns +40ns)/7ns = 6.7 scale factor 2 GPUs no compositing scale factor 2 GPU compositing
Multi-GPU Rendering: 4 GPUsMulti GPU Rendering: 4 GPUs
Generalize 2 GPU formula to k Generalize 2-GPU formula to k GPUs:
Assumption:buffer transports time grows linearly with number of GPUs
Rendering Performance
4.00E+08
5.00E+08
6.00E+08
form
ance
es
/s]
p g y
k GPUs total time per frame:- ttotal, k GPU = trender(n/k) + (k-1)·tcompositing
⇒ performance gain skGPU= ttotal,1 GPU / ttotal,k GPU
0.00E+00
1.00E+08
2.00E+08
3.00E+08
0 10 20 30 40 50 60 70
rend
er p
erf
[tria
ngl
multi-GPU relative to 1 GPU
kGPU total,1 GPU total,k GPU
peak performance to be expected at n = k n0:s(n=k·n0) = (n0·tin + (k-1) ·n0·tout)/(n0·tin+(k-1)·tc)
smax = (tin + (k-1) ·tout)/(tin+(k-1) ·tc/n0)
triangles [millions]
triangle throughput 1 GPU triangle throughput 2 GPU compositingtriangle throughput 4 GPUs
68
1012141618
or o
ver s
ingl
e G
PU- Theoretical peak max smax = (tin + (k-1) · tout)/tin= 1 + (k-1) · tout/tin
0246
0 20 40 60 80 100 120
triangles [millions]
scal
e fa
cto
f G f G
E.g. FX 5600 - k = 4 FX 5600 smax = (7ns +3 · 40ns)/7ns = 18.1 - k = 8 FX 5600 smax = (7ns +7 · 40ns)/7ns = 41
© 2008 NVIDIA Corporation.
scale factor 2 GPU compositing scale factor 4 GPU compositing
Frame Rates
90
Frame Rates
Application classes by frame rate
40
50
60
70
80
90
mer
ate
[1/s
]
Application classes by frame rate
>= 60 Hz, Visual Simulation (e.g. professional flight simulators)
>= 30 Hz, < 60 Hz, Entertainment (gaming)
> 60 Hz
30 – 60 Hz
0
10
20
30
0 20 40 60 80 100 120triangles [millions]
fram
>= 15 Hz, < 30 Hz VR, Design Review
>= 5 Hz, < 15 Hz Modeling, Large CAD Data Visualization, Seismic Interpretation,…
15 – 30 Hz5 – 15 Hz
triangles [millions]
framerate 1 GPU framerate 2 GPUs framerate 4 GPUs
20
Best scalability for Large Models (low impact by compositor)=> very good fit for Large Data visualization like Seismic
interpretation
6
8
10
12
14
16
18
fram
erat
e [1
/s]
5 – 15 Hz
15 – 30 Hz
0
2
4
0 20 40 60 80 100 120
triangles [millions]
framerate 1 GPU framerate 2 GPUs framerate 4 GPUs
© 2008 NVIDIA Corporation.
Results System AnalysisResults System Analysis
System Analysis shows:System Analysis shows:
• Performance scales with number of GPUs => Higher framerate=> Higher framerate
• Graphics card memory scales linearly with number of GPUs=> Larger models
• Small models don’t scale very well=> Multi-GPU rendering (database decomposition) not useful for small models
• Peak scalability far beyond linear (Larger Cache effect)
Multi GPU rendering for Large models works !Multi-GPU rendering for Large models works !
© 2008 NVIDIA Corporation.
Results Multi GPU rendering: OSGg
OSG = Open Scene GraphOpen Source Scene Graph widely used in Academia, Simulation, Oil&Gas and other industriesWe extended OSG to render in multiple threads (one thread per GPU) and added the MGPU SDK
In this test OSG used around 140 bytes per triangle
Performance multi GPU vs. StandaloneTriangle Performance QP IV
600 0%
800.0%
1000.0%
1200.0%
rel.
SA [
%]
2.00E+08
4.00E+08
6.00E+08
erfo
rman
ce
tria
ngle
s / s
]
0.0%
200.0%
400.0%
600.0%
Perf
orm
ance
0.00E+000 5000000 1E+07 1.5E+07 2E+07 2.5E+07 3E+07
Data Size [primitives]
P [t
Standalone 2 GPU no Compositing
© 2008 NVIDIA Corporation.
0 5000000 10000000 15000000 20000000 25000000
Data Size [Primitives]
2 GPU's 4 GPU's
4 GPU no Compositing 2 GPU with Compositing4 GPU with Compositing
Results Multi GPU rendering: NVSGg
NVSG = NVIDIA’s Scene GraphVery efficient Scene Graph used in automotive simulation and academia Very efficient Scene Graph used in automotive, simulation and academia
NVSG used 72 bytes per triangle
Triangle Performance QP IVPerformance multi GPU vs. Standalone
1200 0%
2.00E+08
3.00E+08
4.00E+08
5.00E+08
ance
[tria
ngle
s]
600.0%
800.0%
1000.0%
1200.0%
man
ce re
l SA
[%]
0.00E+00
1.00E+08
0.00E+00 1.00E+07 2.00E+07 3.00E+07 4.00E+07 5.00E+07Data Size [primitives]
Perf
orm
a
0.0%
200.0%
400.0%
0.00E+00
5.00E+06
1.00E+07
1.50E+07
2.00E+07
2.50E+07
3.00E+07
3.50E+07
4.00E+07
4.50E+07
Perf
orm
© 2008 NVIDIA Corporation.
Standalone 2 GPU with Compositing 4 GPU with Compositing
Data size [primitives]
2 GPU's 4 GPU's
MGPU – SDK IntroductionThomas Volk
© 2008 NVIDIA Corporation.
Agenda
• Scope of the SDK• Scope of the SDK• SDK features• Programming Guide for the SDK
© 2008 NVIDIA Corporation.
Scope of the MGPU SDK
• Multi-GPU set-up and addressing• Affinity context/screen per GPU• SDK utility class for context handling and off-screen rendering
• Application parallelism• Multi-threading of rendering loop: already solved in lots of scene Multi threading of rendering loop: already solved in lots of scene
graph implementations• Multi-process and out of core/cluster solutions: highly specialized
solutions already availabley
• Load decomposition and balancing: needs to be addressed in SG for sort last (see NVSGScale)• Sample/reference implementations• Sample/reference implementations• Utility classes
• Image compositing with huge data transfer (5GB/s) and low latency: SDK compositor
© 2008 NVIDIA Corporation.
SDK features• OpenGL based• Professional applications• QuadroPlex only
• With current systems up to 2-8 GPUs, e.g. 2xD2s, 2xD4s• Up to 16 GB of addressable video memory
• Platforms: win32/64 linux 64• Platforms: win32/64, linux 64• In comparison to SLI non-transparent to application• Utility functions to create MGPU aware GL contexts and drawables
• Platform independentPlatform independent• Stereo• AA
• Image compositor for sort first and sort last based applications b i f f i i• Easy to use abstract interface for compositing
• Compositor implementation based on latest technologies, no migration effort for applications (next gen hardware will provide faster transport)
• Multi-threaded, shared memoryC fi bl 1 1 1 hi hi l h b id• Configurable: 1-1, n-1, hierarchical, hybrid
• Screen tiling, alpha and depth based compositing
© 2008 NVIDIA Corporation.
Compositor Types and Configurations
1 2 1 2
1 2
+ +Screen tiling Alpha compositing
1 212
+
+ +
Depth compositing
+Hierarchical compositing
© 2008 NVIDIA Corporation.
System OverviewCompositor
MGPU SDK
Application thread
Scene graph
Application thread
Scene graph
Application thread
Scene graph
Dst node Src node Src node
GL context GL context GL contextGL-context SGL-context Dstdrawable
GL-context Srcdrawable
GL-contextGL-context Srcdrawable
GPU 1 GPU 2 GPU 3
© 2008 NVIDIA Corporation.
SDK ClassesICCompositor+setLayout()
+creates
ICFactory
y ()+getSrcNode()+getDstNode()
RenderArea+initialize()ICFactory
+createAlphaCompositor()+createDepthCompositor()+createDepthCompositor() n
+initialize()+terminate()+resize()+makeCurrent()+releaseCurrent()
ICNode+intialize()+terminate()
i ()
+showFBContents()+hideFBContents()
+composite()
ICSrcNode ICDstNode
© 2008 NVIDIA Corporation.
Code Examplep//init in control/main thread
ICDepthCompositor* c = CompositorFactory::getInstance()->getDepthCompositor();
c->setLayout(IC_ALLNODES, IC_RECT(width, height));
ICNode* myCompositorNode[MAX NODES];ICNode* myCompositorNode[MAX_NODES];
myCompositorNode[0] = c->getSrcNode(idx);
//init in every thread and context
//create affinity context and drawable;
RenderArea ra;
ra->initialize();
ra->makeCurrent();
//initialize GL resources of compositor
myCompositornode[i]->initialize();
//in every render-thread
void drawFrame(int myID)
{
drawPartialScene(myID);drawPartialScene(myID);
//trigger image composition
myCompositornode[myID]->composite();
© 2008 NVIDIA Corporation.
}
Sample Sequence DiagramAppThread RenderThread RenderThreadSrcNodeCompositor DstNode
i it
setLayoutresize
initinit
init init
setLayout
paintdrawFrame
drawFrame
composite compositecomposite
composite
© 2008 NVIDIA Corporation.
MGPU Threadingg
A th d
draw ic
d i
simulApp thread
Render src thread
Render src thread
simul
draw
ddraw ic
draw ic
Render src thread
Render dst thread
draw
d i
simulApp thread
Render src thread
simul
d
simul
draw ic
draw ic
draw ic
Render src thread
Render src thread
Render dst thread
draw
draw
drawdraw ic draw
© 2008 NVIDIA Corporation.
Wrap upp p
• Current status • Beta release next week • No stereo and AA, linux only for beta
F l il bl• Freely available• Release mid Oct
• Future features• Future features• Better hardware support• Access to frame-buffersAccess to frame buffers• Utility classes for load balancing• Performance feedback• Extensible shader compositor• Additional color formats
© 2008 NVIDIA Corporation.
Multi-GPU Rendering with NVIDIA Scene GraphThe World’s Fastest Scene Graph On Steroids
Multi GPU Rendering with NVIDIA Scene Graph
Subu Krishnamoorthy
© 2008 NVIDIA Corporation.
Subu Krishnamoorthy
Distributing Scene Objectsg j
• Audience – specifies which GPUs are responsible for processing a scene graph object
• DistributionTraverser – assigns Audiences to balance l d h GPUload on each GPU
• GLDistributedTraverser – respects Audiences and t l h it GPU i l d d
© 2008 NVIDIA Corporation.
prunes traversal when its GPU is excluded
Distribution Scheme
• Opaque objectsOpaque objectsLeast loaded GPU included, others excludedexcluded
• Translucent and overlay objects• Translucent and overlay objectsPrimary GPU included, others excluded
• Implies use of Depth CompositorImage composited before translucent and overlay objects are drawn
© 2008 NVIDIA Corporation.
and overlay objects are drawn
The Componentsp
• GLAffinityArea creates a thread that drives an associated GLAffinityArea creates a thread that drives an associated GLDistributedTraverser
• GLDistributedRenderArea manages the collection of N-1
© 2008 NVIDIA Corporation.
GLDistributedRenderArea manages the collection of N 1 GLAffinityAreas
Parallel Renderingg
© 2008 NVIDIA Corporation.
Ease of Use• Derive client RenderArea from
GLDistributedRenderAreaGLDistributedRenderArea(instead of nvui::RenderArea)
class wxGLDistributedRenderArea : public wxGLCanvas,ppublic nvui::GLDistributedRenderArea
{...
};
• Override GLDistributedRenderArea::init()• Create Primary GL context (affinity context) and make it current
I k b i l i• Invoke base implementation
bool wxGLDistributedRenderArea::init(nvui::RenderArea *shareArea){
...m_glContext = new wxGLAffinityContext(this, shareArea);wxGLCanvas::SetCurrent(*m_glContext);...GLDistributedRenderArea::init(shareArea);
© 2008 NVIDIA Corporation.
...}
Opaque Object Distributionp q j
© 2008 NVIDIA Corporation.
Depth Composited Imagep p g
© 2008 NVIDIA Corporation.
Spatial Distributionp
• Audience assigned based object g jbounding box
I li f Al h C it© 2008 NVIDIA Corporation.
• Implies use of Alpha Compositor
Summaryy
• Powerful new feature of NVSGPowerful new feature of NVSG
• Visualize large models at interactive gframe rates
© 2008 NVIDIA Corporation.
Visual Computing In The Oil Sector, Horn Thorolf Tonjum
The Industrial GPU Revolution Revolution
© 2008 NVIDIA Corporation.
Demos
© 2008 NVIDIA Corporation.
Dinosaurs in the MuseumDinosaurs in the MuseumModel characteristics:• Model designed in Maya, AutoDesk• 217 Million Triangles
Rendering Hardware:HP 8600 ith 32 GB S t M• HP xw8600 with 32 GB System Memory
• 2xD2’s (4xGT200 with 4 GB FB memory) with 16 GB total FB memory
Rendering Performance:Rendering Performance:1 GPU = 0.7 fps2 GPU’s = 3 ½ fps = 5 times faster than 1 GPU4 GPU’s = 6 fps = 8 ½ times faster than 1 GPU=> Total max. Triangle throughput = 1.3 GTri/s
© 2008 NVIDIA Corporation.
Multi-GPU rendering of Multi GPU rendering of Kristin,
Model characteristics:
a detailed off-shore l tf
Model characteristics:• Kristin is an off-shore platform in the North Sea• Model designed in MicroStation, Bentley• 230 Million Triangles (individual triangles, no triangle platformg ( g , g
strips)
Rendering Hardware:HP 8600 i h 32 GB S M• HP xw8600 with 32 GB System Memory
• 2xD2’s (4xGT200 with 4 GB FB memory) with 16 GB total FB memory
Rendering Performance:1 GPU = 0.4 fps2 GPU’s = 2 ½ fps = 6 times faster than 1 GPUp4 GPU’s = 4 ½ fps = 11 times faster than 1 GPU=> Total max. Triangle throughput = 1 GTri/s
© 2008 NVIDIA Corporation.
With the friendly permission of stormfjord
SummarySummary
S l i f l i GPU d i• System analysis of multi-GPU rendering
• How to do multi-GPU rendering with NVIDIA’s Multi-GPU SDKHow to do multi GPU rendering with NVIDIA s Multi GPU SDK
• Ease of use of multi-GPU rendering with NVSG
• Large CAD data visualization in the Oil&Gas industry
=> Database decomposition together with depth compositing is a strong tool to give depth compositing is a strong tool to give you tomorrows graphics performance today
© 2008 NVIDIA Corporation.
The EndThe End
Questions ?Questions ?
© 2008 NVIDIA Corporation.