A Scalable and Reconfigurable Shared-Memory Graphics Architecture
Michael Manzke∗
Trinity College DublinRoss Brennan
Trinity College DublinKeith O’Conor
Trinity College DublinJohn Dingliana
Trinity College DublinCarol O’Sullivan
Trinity College Dublin
Current scalable high-performance graphics systems are either con-structed using special purpose graphics acceleration hardware orbuilt as a cluster of commodity components with a software in-frastructure that exploits multiple graphics cards [Humphreys et al.2002]. Both these solutions are used in application domains wherecomputational demand cannot be met by a single commodity graph-ics card e.g., large-scale scientific visualisation. The former ap-proach tends to provide the highest performance but is expensivebecause it requires frequent redesign of the special purpose graph-ics acceleration hardware in order to maintain a performance ad-vantage over the commodity graphics hardware used in the clusterapproach. The latter approach, while more affordable and scalable,has intrinsic performance drawbacks due to computationally expen-sive communication between the individual graphics pipelines.
Figure 1: The first prototype of the custom-built high-performancegraphics cluster node. The figure shows how a commodity graphicscard interfaces the cluster node. It also depicts the four SCI cablesthat should interconnect the custom-built GPU interface boards andthe PC cluster via a 2D torus topology.
In this sketch we propose a scalable tightly coupled cluster ofcustom-built boards that provide an AGP interface for commoditygraphics accelerators. This hybrid solution aims to bridge the gapbetween both of the current solutions, offering a minimal custom-built hardware component together with a novel and efficient sharedmemory infrastructure that exploits cutting-edge consumer graph-ics hardware. The boards are supplied with rendering instructionsby a cluster of commodity PCs that execute OpenGL graphics ap-plications. All the commodity PCs and custom-built boards are in-terconnected with an implementation of the IEEE 1596-1992 Scal-able Coherent Interface (SCI) standard. This technology providesthe system with a high bandwidth, low latency, point to point inter-connect. Our design allows for the implementation of a 2D torustopology with good scalability properties and excellent suitabilityfor parallel rendering. Most importantly the interconnect imple-ments a Distributed Shared Memory (DSM) architecture in hard-
∗e-mail:[email protected]
ware. Figure 2 shows how local memories on the custom-builtboards and the PCs become part of the system wide DSM throughthe SCI interconnect. Figure 2 also depicts Field ProgrammableGate Arrays (FPGAs) on the custom-built boards. These recon-figurable components assist the SCI implementation and providesubstantial additional computational resources that may be used tocontrol the commodity graphics accelerators and to perform oper-ations associated with a parallel rendering infrastructure. Beyondthe previously mentioned application of the FPGAs we envisionother graphics application related computation e.g., ray tracing.These reconfigurable components are an integral part of the scalableshared-memory graphics cluster and consequently increase the pro-grammability of the parallel rendering system, just like vertex andpixel shaders increased the programmability of graphics pipelines.
Board n
FPGAs GPU Card
Local Shared Memory
Board 1
FPGAs GPU Card
Local Shared Memory
Distributed Shared MemorySingle Address Space
Implemented through a High Speed Interconnect
PC Board n
Local Shared Memory
CPUPC Board 1
Local Shared Memory
CPU Scalable
Scalable
Figure 2: Shared memory system.
In this sketch, we describe the design of a tightly coupled scalableNon-Uniform Memory Access (NUMA) architecture of distributedFPGAs, GPUs and memory that may be constructed with a limitedamount of custom-built hardware. A first prototype of the custom-built boards, seen in Figure 1, was manufactured and is currentlydebugged. A second revision will resolve outstanding problems.We expect that this hardware DSM cluster communicates data at500Mbytes/s with low latencies (< 1.5 µs). This hard real-timecapable parallel rendering cluster will be connected with the samehigh speed interconnect to a commodity PC cluster that will exe-cute the graphics application. We have introduced this novel archi-tecture and estimate, based on the arguments presented, that thissolution could out-perform pure commodity implementations with-out increased hardware cost and yet maintain its adaptability to themost recent generation of commodity graphics accelerators and tar-get applications. Later prototypes will incorporate PCI Express tobe compatible with the latest commodity graphics accelerators.
References
HUMPHREYS, G., HOUSTON, M., NG, R., FRANK, R., AH-ERN, S., KIRCHNER, P. D., AND KLOSOWSKI, J. T. 2002.Chromium: a stream-processing framework for interactive ren-dering on clusters. In SIGGRAPH, 693–702.