An FPGA-based Computing Platform for Real-time...

An FPGA-based Computing Platform for Real-time 3D Medical Imaging and its Application to Cone-beam CT Reconstruction

Jianchun Li1, Christos Papachristou1 and Raj Shekhar2

1Department of Electrical Engineering and Computing Science, Case Western Reserve University, Cleveland, OH 44106, USA

2Department of Diagnostic Radiology, University of Maryland, Baltimore, MD 21201

Corresponding Author: Raj Shekhar, Ph.D.

Department of Diagnostic Radiology University of Maryland 22 S. Greene Street Baltimore, MD 21201 USA

Phone: 410-706-8714 Fax: 410-328-0641 Email: [email protected]

2

ABSTRACT

Real-time three-dimensional (3D) medical imaging requires both high memory and high

computational bandwidth due to the massive amount of data to be processed. Application-specific

systems have been designed to speed up a few selected 3D imaging algorithms. When applied to

other algorithms, all such systems require complicated redesign procedures or get lower

performance than expected. In this article, we present an FPGA-based computing platform that can

be used to accelerate a broad range of 3D medical imaging algorithms dominated by local

operations. Its generality and reconfigurability make it easy to be customized to differing algorithm

requirements. The architecture in the platform exploits intrinsic parallelisms in 3D imaging

algorithms in order to achieve high computational bandwidth. Along with the architecture, a new

caching scheme called brick caching was designed to dynamically partition the input data and

buffer them in distributed internal caches to provide multiple memory accesses. The caching

scheme exploits locality of reference in three dimensions, thus greatly reducing the total data-flow

from external system memory to internal caches. It is also a deterministic caching method, which

allows the input data to be pre-fetched into the processor before they are processed. We also present

application of the platform in FDK (Feldkamp-Davis-Kress) cone-beam computed tomography

(CT) image reconstruction so as to demonstrate its potential in real-time 3D medical imaging.

Keywords: 3D imaging, computing platform, brick caching scheme, cone-beam reconstruction

3

1. INTRODUCTION

Real-time three-dimensional (3D) imaging represents a developing trend in medical imaging.

Currently, 3D medical images are acquired from a variety of imaging modalities like computed

tomography (CT), magnetic resonance imaging (MRI), ultrasound, positron emission tomography

(PET), and single photon emission computed tomography (SPECT). Compared to two-dimensional

(2D) images, 3D images represent an anatomic structure in a more realistic manner, are intuitive to

work with and can provide more accurate information for image-based diagnoses and treatments.

However, most 3D medical imaging algorithms are computationally intensive, which makes real-

time 3D imaging expensive and impractical for most clinical applications.1,2,3

The primary reason of these algorithms being time-consuming is the massive amount of data to

be processed. For example, a 512×512×512 8-bit image translates to 128 Mbytes of data, and such

images often need to be processed repeatedly because many algorithms are iterative in nature.

Another reason is the complicated data addressing and accessing patterns in these algorithms. 3D

image data are usually stored in the system memory sequentially; however, many algorithms

require nonsequential or even random access to the data. Therefore, traditional word-line and

interleave caching methods cannot efficiently fetch and buffer image data. Random access to the

system memory (usually DRAM), on the other hand, slows down the entire computation.

These two factors make 3D medical imaging algorithms time-consuming and expensive. Two

alternative solutions could be used to address this problem. The first is to design an application

specific integrated circuit (ASIC) for accelerating a particular algorithm. This is the most efficient

way to accelerate an algorithm; however, ASICs are not flexible to adapt to the changes in the

algorithm and are difficult to reuse in other applications. Moreover, designing and developing an

4

ASIC incurs high development or non-recurring engineering (NRE) costs. The second solution is to

use a homogeneous multiple-processor system,4,5 such as a general supercomputer or a workstation

cluster. This is the most flexible solution. However, the traditional von Neumann architecture used

in general-purpose architecture and the communication overhead between the processors limit the

potential speedup per processor and make it hard to achieve the desired performance without large-

scale multi-processor systems. Presently, this makes the solution unacceptably expensive for

clinical deployment.

Compared to the above two solutions, a field programmable gate array (FPGA)-based

computing system offers a good trade-off between performance and flexibility, and is more

attractive and promising for real-time 3D medical imaging. Modern high-end FPGAs integrate

hard-wired processor cores, SRAM blocks, multipliers, digital clock managers and massive

reconfigurable logic resources in a single chip.6 By combining these resources and customizing the

computing architecture inside the FPGA, dedicated systems can be designed to accelerate different

algorithms. The system architecture inside the FPGA can also be reconfigured to adapt to the

changing demands of different algorithms, thus avoiding the need to design a totally new hardware.

Several FPGA-based systems have been designed to accelerate specific 3D medical imaging

algorithms.7-10 However, a disadvantage of FPGA-based systems is that a large amount of time and

effort is required to redesign the computing architecture inside the FPGA in order to maintain high

performance for different algorithms. Also, it is difficult to manage the data flow in algorithms in

which random data access occurs.

To make the high-performance FPGA-based solution easy to be used in different 3D medical

imaging applications, we have designed a computing platform with a generalized architecture and a

new caching scheme called brick caching. These two concepts are aimed at accelerating a broad

5

range of 3D medical imaging algorithms dominated by local operations. A local operation can be

defined as an operation for which only a small subset of input data are required to compute a single

output value. Trilinear interpolation is an example of a local operation, which needs eight input

samples to calculate one output value. This property provides the intrinsic parallelism for

partitioning the overall computation and executing the partitions concurrently. It is also the basis for

our brick caching scheme described in Section 2.

The architecture is targeted to Xilinx Virtex II Pro FPGAs, which contain up to four PowerPC

embedded processors, a large number of SRAM blocks (several hundreds), multipliers and other

logic resources.6 Generality of this architecture allows it be easily modified to map different local

operations-based imaging algorithms. To achieve high computational performance, intrinsic

algorithm parallelisms are exploited in the architecture such as multiple data brick (block) caching,

parallel data-stream processing and deep pipeline cascades.

The brick caching scheme we present takes full advantage of locality of reference in three

dimensions to reduce the total data-flow from the main memory to the caches. It is a deterministic

caching method, meaning that there are no cache misses in computation and data are always pre-

fetched into the caches before they are needed. High-speed multiple data accesses are supported by

configuring the SRAM blocks in the FPGA into multiple independent caches and copying the same

input data block into each of them.

In section 2 of this paper, we introduce our brick caching scheme. In section 3, we present the

architecture of the platform. The application of the platform to FDK (Feldkamp-Davis-Kress) cone-

beam reconstruction1 is demonstrated in section 4. We discuss our simulation results and future

objectives in section 5.

6

2. BRICK CACHING SCHEME

3D Medical imaging involves processing massive amounts of image data; hence, DRAM is

typically used as the main system memory for data storage. Moreover, the hardware structure of

DRAM is such that peak performance is achieved only when accessing sequential data, thus causing

a dramatic drop in performance when accessing data randomly. Unfortunately, frequently occurring

operations like 3D interpolation and 3D finite impulse response (FIR) filtering in most 3D imaging

algorithms require multiple random data accesses. For example, trilinear interpolation requires us to

access eight neighbouring input data samples that do not appear sequentially in main memory, thus

requiring random memory accesses.

To address the memory access problem, several strategies have been suggested. For example, de

Boer et al.11 proposed a scheme for eight simultaneous memory accesses by implementing eight

independent external memory banks, which is especially useful for trilinear interpolation. Doggett

and Meibner12 presented a cubic addressing scheme for real-time volume rendering, in which the

input image is partitioned into fixed-size cubes that are buffered inside the processor one at a time.

Whenever a traced ray passes through the boundary of the cube, a new one is fetched into the cache.

Unlike these caching methods, our proposed caching scheme, called brick caching, partitions

the image data in the output image space, rather than the input image space. It explores the locality

of reference in three dimensions to significantly reduce the total data-flow from the system memory

to the internal caches. Also, our scheme is a deterministic caching method since the input data are

pre-fetched into the FPGA before they are actually needed, thus eliminating cache misses. The

FPGA SRAM blocks are configured into multiple independent caches for multiple simultaneous

data accesses. Our brick caching scheme is described next.

7

2.1. Brick partitioning

The first step in brick partitioning involves partitioning the output data space (i.e., output of the

algorithm) into small m × n × k cubes, called output bricks (Figure 1c). The coordinates of the eight

vertices of an output brick are denoted as ),,( iii ZOYOXO , i = 1…8.

Subsequently, the output brick is mapped back to locate the corresponding data subset in the

input space according to the data addressing procedure of the algorithm (Figure 1b). The located

subset in the input space contains all the data required to compute the output results in the output

brick. The vertices of the data subset are denoted as ),,( iii ZIYIXI , i = 1…8. The dataset in the

input space usually has an irregular direction and shape (Figure 1b).

The third step in brick portioning is to find the boundary of the input brick marking another

dataset in the input space with regular direction and shape, which completely encloses the required

input dataset. Suppose the vertex A (Figure 1a) of the input brick is (XBmin, YBmin, ZBmin) and the

vertex B of the brick is (XBmax, YBmax, ZBmax), then:

⎪⎩

⎪⎨

⎧

⎦⎣=⎦⎣=⎦⎣=

)min( )min( )min(

min

min

min

i

i

i

ZIZBYIYBXIXB

(1)

⎪⎩

⎪⎨

⎧

+⎦⎣=+⎦⎣=+⎦⎣=

1)max( 1)max( 1)max(

max

max

max

i

i

i

ZIZBYIYBXIXB

(2)

where ⎣x⎦ is the floor function whose return value is the maximal integer less than x and min(xi)

returns the smallest value in the data set xi, while max(xi) returns the largest value. The sizes of

the input brick along the three dimensions X, Y and Z are:

8

⎪⎩

⎪⎨

⎧

+−=+−=+−=

1 1 1

minmaxz

minmaxy

minmaxx

ZBZBSYBYBSXBXBS

(3)

The data in the input brick will be fetched from the system memory to the caches sequentially.

If the offset from the cache beginning is Coff, then the position, Cp, of the data with coordinates (X,

Y, Z) in the cache can be calculated as:

Cp=Coff +(Z–ZBmin)*Sx*Sy+(Y-YBmin)*Sx+(X-XBmin) (4)

The size of the output bricks (m × n × k) will determine the size of the input bricks. Suppose the

volume of input brick i is Vi and the total data-flow volume from the system memory to the caches

is Vtotal, then both Vi and Vtotal are a function of m, n and k. That is:

( , , )total iV V f m n k= =∑ (5)

Assuming the bandwidth from system memory to the cache to be W and the total algorithm

processing time to be T, then:

totalVTW

≥ (6)

When totalVTW

= , the system memory bandwidth is at least one of the bottlenecks limiting the

processing speed. Thus, Vtotal should be minimized by optimizing the partition of the output space.

This involves selecting the optimized sizes of the output brick, m, n and k. One important constraint

in this optimization is that Vi, the volume of any input brick, should not exceed the available cache

size. Numerical methods of simulating the algorithm computing procedure can be used for this

optimization.

9

2.2. Multiple memory access

To support multiple parallel memory accesses, the SRAM blocks in FPGA are configured as

multiple independent caches. These caches have two data ports, one port connecting to the

broadcast data bus from the system memory and the other connecting to the data processing

pipeline for feeding data. A mask register controls whether and when a certain cache will accept

data from the main memory (Figure 2). More details about cache configuration are discussed in

section 3.

For trilinear interpolation, de Boer et al.11 proposed to index the eight neighboring voxels

(numbered 0 to 7) and to store them in eight independent memory banks so that they can be

accessed simultaneously in one clock cycle. A similar approach can be used in the proposed brick

caching scheme. However, the disadvantage of this approach is that the addressing is very

complicated. Not only the data position in the memory bank, but also the bank index should be

calculated for each access. To address this problem, we use data duplication, which involves simply

copying the same input data brick into multiple cache banks (Figure 2) so that the neighboring data

samples can be accessed simultaneously. Although there is a memory utilization overhead

associated with data duplication, it is acceptable here because there are large amounts of

independent SRAM blocks available in modern FPGA chips. For example, Virtex Pro XC2VP125

from Xilinx has 556 18Kb dual-port SRAM blocks inside the chip.6 Data duplication simplifies the

cache structure design and saves computing resources.

2.3. Brick pre-fetching

Brick caching is a deterministic caching method that the data block can be pre-fetched into the

caches before the data are actually required by the processing units. After finishing the processing

10

of one input brick, the system is able to process the next one immediately and there is no pipeline

stall while activating data fetching. The whole system can work smoothly at its full speed. On the

other hand, if the next brick is not in the cache, the system pipeline will stall, waiting for the next

brick to be fetched. Brick caching scheme prevents pipeline stalls by pre-fetching data bricks.

Specifically, every bank of the cache system is partitioned into two sections, as shown in Figure

3. Each of them can hold one input brick. The caches act as a FIFO (first-in-first-out) except that

the moving unit is a single data brick. One data port of the dual-port cache banks is configured as

the input port connected to system memory, and the other is the output port connected to the

processing pipeline. Data fetching and data feeding are executed simultaneously.

3. ARCHITECTURE

The architecture is designed to be a high-level abstract platform so that a broad range of local

operations-based 3D imaging algorithms can be mapped onto it. To accelerate the execution of the

algorithms, the architecture takes advantage of the brick caching scheme and exploits the intrinsic

parallelisms present in the algorithms. The major parallelism in local operations-based 3D imaging

algorithms is the independence of the calculation of each output data, i.e., the result of one output is

not determined or affected by other outputs. Therefore, a multiple deep pipeline execution unit in

the architecture is designed to perform the massive computational tasks. The architecture is

designed to implement in a single FPGA chip except for the system memories.

The system (or architecture) (Figure 4) is composed of five sub-systems: (a) PowerPC-based

Controller (PPC); (b) Input Brick Fetching unit (IBF); (c) Multiple Pipeline Execution unit (MPE);

11

(d) Output Brick Storing unit (OBS) and (e) Central Pipeline Controller (CPC). Details about the

function, configuration and execution of the subsystems are described below.

3.1. The subsystems

The PPC (Figure 5A) is used to calculate the parameters of the data bricks, such as the origin

and size of the input and output data bricks. It also controls the execution of the whole system,

which involves performing various operations like initiating or stopping computing. These

parameters are sent to a register file (REG File) through device control register (DCR) bus, which

can be accessed by the CPC.

The PPC is composed of a PowerPC processor, on-chip instruction memory and data memory, a

register file and a memory controller connecting to an external memory, MEM-1. The PowerPC

processor is one of the hardwired processor cores embedded in Xilinx Virtex-II Pro FPGA chips.

The on-chip instruction memory (I-MEM) and data memory (D-MEM) are configured with SRAM

blocks inside the FPGA. Their data bus is 32-bit wide and their capacity is 32 KByte. Critical parts

of the program running on the PowerPC processor are compiled and loaded into the on-chip

instruction memory at the time of FPGA configuration. The on-chip data memory (D-MEM) holds

the computational results from the processor. The data memory is configured as a dual-port memory

– one port is connected to the processor and the other is used by other subsystems to access the data

in the memory. MEM-1 is an external DRAM that provides extra data and instruction space for the

processor; it is controlled by the MEM-1 controller.

The Input Brick Fetching unit (IBF) in figure 5B is the architectural implementation of the brick

caching scheme. It contains multiple dual-port input caches, a mask register and a memory

controller, and the MEM-2 controller. The dual-port caches have one port connected to the data

12

broadcast bus from the MEM-2 controller and the other port connecting to the pipelines in the MPE

unit. The MEM-2 controller accepts parameters of input bricks from the CPC unit, then fetches

input bricks and broadcasts them into the input brick caches. The mask register, set by the CPC,

controls which caches will accept the incoming data. Each of the input brick caches is partitioned

into two sections: S1 and S2. While one section (say, S1) is accepting data from the MEM-2

controller, the other section (S2) can provide data to the MPE unit at the same time, and vice versa.

The Multiple Pipeline Execution unit (MPE) in figure 5C is the subsystem performing the major

computational tasks. Multiple independent deep pipelines, which are identical in function and

architecture, are configured in the unit. These pipelines accept multiple data streams from the input

brick caches. The outputs of the pipelines are combined together and sent to the output brick

caches. All the pipelines are data-driven, i.e., each function unit (FU) only operates when their input

data are ready. The configuration of the MPE is algorithm-dependent because different algorithms

need different functional architecture.

The Output Brick Storing unit (OBS, Figure 5D) is used to save the final results from the MPE

and send them out to the external memory MEM-3. It contains two output brick caches and a

memory controller, the MEM-3 controller. The output brick caches are dual-port caches with one

port connecting to the MPE unit and the other connecting to the MEM-3 controller. When one

cache is accepting results from the pipelines, the other passes computed data to the MEM-3

controller for data dumping. The directions of the data flows in the OBS are controlled with a

demultiplexer and a multiplexer.

The Central Pipeline Controller (CPC, Figure 5E) coordinates the operation of the whole

system. Initially, it communicates with the PPC, requesting execution parameters such as the origin

and size of the input and output bricks. After receiving these parameters, the CPC sends the

13

information about the input brick to the IBF, after which the IBF begins to fetch the input brick.

When the IBF finishes the fetching, the CPC initiates the execution of the MPE. After the MPE

completes computation of one output brick, the CPC directs the OBS to save results to the external

memory MEM-3. The communications between the CPC and the other four subsystems occurs

simultaneously and the executions of the four subsystems are overlapped.

3.2. Parallelism

To achieve high computational bandwidth, the system is designed to operate at three parallel

levels. The first level is what we call the brick operation level. At this level, the algorithm is divided

into four stages (Figure 6): parameter generation, input brick fetching, multi-pipeline processing

and output brick storing. The parameter generation, performed by PPC, calculates the parameters

required for the execution of the other three subsystems. The IBF performs input brick fetching, the

MPE performs the multi-pipeline processing and the OBS takes care of the output brick storing. All

four stages run simultaneously as a system-level pipeline, thus forming the first level of parallelism.

The second level of parallelism is multiple data-stream processing. In the MPE unit, multiple

independent data-stream processing pipelines run simultaneously. The number of pipelines depends

on the algorithm and the available FPGA resources. Suppose the number of the pipelines is M, then

for each clock cycle, there are M results coming out of the MPE unit.

The third level of parallelism is the deep pipeline architecture in the MPE unit. The computation

is partitioned into large numbers of small stages and implemented on the deep pipeline architecture

with high running frequency.

14

The computational bandwidth is set by the slowest stage at the brick operation level. Thus, it is

important to balance the workload of the four stages in the first parallelism level to ensure the best

possible performance.

3.3. Generality to different algorithms

The whole computing platform is designed to be a high-level platform for easy adaptation to

different algorithms. The architectures of four subsystems, PPC, CPC, IBF and OBS, are designed

to be generic, so that implementation of different algorithms requires only the change of the

configuration parameters held in the configuration registers. The users are able to use these units as

primitives in their own system design. Only the architecture of MPE is algorithm-specific and

should be customized according to the data-flow of the algorithms. Many strategies have been

proposed to automatically map the data-flow of an algorithm into pipeline architectures.13,14,15 Our

future goal is to develop a high-level user interface to automatically configure the architecture of

MPE according to user input to algorithm data-flow. Since the user can focus on the architecture

design of the MPE unit, instead of the whole system, this would greatly reduce the redesigning time

and cost.

4. APPLICATION TO CONE-BEAM RECONSTRUCTION

Cone-beam reconstruction algorithms reconstruct the 3D CT image from a set of 2D

projections. These algorithms are categorized into two classes: analytic and algebraic. Algebraic

algorithms demand more computing resources than analytic algorithms. In practice, the FDK

15

algorithm, a type of analytic algorithm,1 is being widely implemented because of its computational

simplicity and compatibility with 2D fan-beam reconstruction algorithms.

The FDK algorithm requires a long execution time on modern computers. One reason is the

large size of the datasets that need to be processed, including both the reconstructed image and the

input projection data. Another reason is the large number of backprojection iterations involved. In

this section, we provide a brief description of the FDK algorithm, analyze its computational

complexity and describe how to speed the time-consuming backprojection procedure with our

computing platform.

4.1. The algorithm

Unlike the 2D fan-beam tomography approach, which reconstructs one image slice at a time, the

cone-beam method uses 2D array of detectors to collect projection data and reconstructs the entire

volume at a time. The geometry of the system configuration for cone-beam CT imaging is shown in

Figure 7.

In the system, an X-ray cone beam is emitted by the X-ray source. After its transmission through

the imaging specimen, the attenuated intensities of the X-rays are detected by the 2D detector array.

There are two coordinate systems, (x, y, z) and (u, v), shown as Figure 7. The origin of (u, v)

coordinates is at the center of the detector array and origin O of the (x, y, z) coordinate system, is at

the center of the objective. The (x, y, z) system is stationary to the object, whereas the (u, v) system

moves with the detector. The X-ray source rotates along the z axis for scanning at equal angular

intervals. The detector rotates opposite the source at the same plane. The X-ray source and the two

origins are always along a straight line. The acquired projection data is a 3D data. We denote the

16

attenuated X-ray intensity at the position (u, v) when the source is at the projection angle β using

the notation I(u, v, β).

The FDK algorithm has two major computational steps. The first step is called weighted

filtering, in which the projection data I(u, v, β) are weighted with a factor of )/( 222 vuRR ++ ,

where R is the distance from the source to the center of the object. Next, the projection data are

filtered along the u direction with a 1D FIR filter. Many choices exist for this filter,16,17,18 we used

the Ram-Lak filter, which is frequently used in filtered backprojection algorithm.17 The Ram-Lak

filter is given by:

))2/(sin21)((sin*)8/()( 222 scscsV Ω−ΩΩ=Ω π (7)

where Ω is the bandwidth of the image and s is the variable. This filtering is a one-dimensional

(1D) convolution and can be calculated with Fast Fourier Transform (FFT). Its complexity for the

entire volume is ))log(( 3 NNO , where N is the size of the reconstructed image along one

dimension.

The second step in the algorithm is backprojection. For each individual image voxel with

coordinates (x, y, z), the intensity value M(x, y, z) is calculated by the accumulation of all the

filtered projections I(u, v, β) over each projection angle β times a weighting factor W(x, y, β). The

formula is:

∑=

βββ ),,(*),,(),,( yxWvuIzyxM (8)

For a given projection angle β, the corresponding u, v and W(x, y, β) for M(x, y, z) are calculated as:

))sin()cos((*))sin()cos(/( ββββ xyyxRRu −−−= (9)

17

zyxRRv *))sin()cos(/( ββ −−= (10)

22 ))sin()cos(/(),,( βββ yxRRyxW −−= (11)

Since the calculated u and v are usually not integer, bilinear interpolation is needed to compute

the desired I(u, v, β) with its four neighbors. The complexity of the backprojection step for the

entire volume is )( 4NO , where N is the size of the reconstructed image.

Comparing the complexities of these two steps tells us that the backprojection takes

approximately ))log(/( NNO times longer than that of the weighted filtering step. In practice, the

weighted filtering step can be finished on the order of 10 seconds on a single modern computer. In

contrast, the execution time for the backprojection step is on the order of half an hour. Therefore,

our objective was to significantly reduce the backprojection time with the computing platform

described above.

4.2. Data Partitioning

To use the brick caching scheme, we partition the reconstructed 3D image (the output space) into

small cubes (the output bricks). According to our numerical simulation the optimal brick size of

16×16×16 voxels within the limits of the available FPGA resources. Next, the output bricks are

mapped back to the input space of the filtered projection data, to find the input bricks. The filtered

projection data involves a set of 2D images. Thus, when mapped back to the input space, an output

brick is projected into a series of hexagons (Figure 7) on different projection planes with different

projection angle β. The input bricks are determined as bounding rectangles for the hexagons. In the

cone-beam reconstruction algorithm, therefore, the input brick are rectangles, not cuboids.

18

Assuming the vertices of an output brick as (xi, yi, zi), i=1…8 and the projection angle β , the

corresponding projected coordinates (ui, vi) are:

( *sin( ) *cos( )) *

*cos( ) *sin( )i i

ii i

x y RuR x y

β ββ β

⎡ ⎤− += ⎢ ⎥− −⎣ ⎦

(12)

*

*cos( ) *sin( )i

ii i

z RvR x yβ β

⎡ ⎤= ⎢ ⎥− −⎣ ⎦

(13)

where R is the distance from the source to the center of the object. The upper-left corner (umin, vmin)

and the bottom-right corner (umax, vmax) of the resulting 2D input brick are:

⎩⎨⎧

⎦⎣=⎦⎣=

)min( )min(

min

min

i

i

vvuu

(14)

⎩⎨⎧

+⎦⎣=+⎦⎣=

1)max( 1)max(

max

max

i

i

vvuu

(15)

The computation for mapping the output brick to the input brick is executed by the PPC, and the

calculated parameters are sent to the IBF for input brick fetching.

4.3. Mapping Algorithm to Architecture

Mapping the algorithm to the architecture is straightforward. Specifically, the PPC executes the

calculation described in section 4.2 for locating the input brick, the IBF fetches the input bricks into

the caches, the MPE performs the backprojection computation, and the OBS saves the reconstructed

image data into MEM-3. The configurations of the subsystems are as follows.

The PPC (Figure 5A) runs at 300 MHz and its program is stored in the I-MEM. The program

running on the PPC is optimized and only has about 600 instructions. A look-up table is stored in

the D-MEM for providing the values of cos(β) and sin(β) in equations (12) and (13). The processed

19

results are passed to the register file, which is configured to have ten 32-bit registers. These results

provide information about the position and size of the input and output bricks as well as the

projection angle β and some control signals to the CPC.

The memory controller, the MEM-2 controller, in the IBF (Figure 5B) is a double data rate

(DDR) SDRAM controller running at 133 MHz and has a data bus of 64-bit width connecting to the

external memory MEM-2. It also has another 64-bit data bus for broadcasting the input data into the

caches. There are 32 small dual-port caches in the IBF. Each cache is configured with four 2-KB

dual-port SRAM blocks and partitioned into two sections with the capability of 4 KB/per section.

Every four caches form a group to provide 4 input data to one pipeline simultaneously for bilinear

interpolation.

The MPE (Figure 5C) is the algorithm-dependent unit designed according to the backprojection

data-flow of FDK cone-beam reconstruction algorithm. For parallel computation, it has eight

independent pipelines running at 150 MHz. All the pipelines have the same architecture and contain

64 pipeline stages. Every stage is registered to achieve the highest running frequency. The data path

in the pipelines is 16-bit.

The OBS (Figure 5D) has two output brick caches, each of which is configured as a dual-port 8

KB cache for holding the reconstructed image data. One port of the caches connects to a 128-bit

data bus in order to accept the combined eight 16-bit results from the MPE and the other port

connects to the MEM-3 controller via 128-bit data bus. The MEM-3 controller is a DDR SDRAM

controller connected to the external SDRAM MEM-3 through the 64-bit 133 MHz data bus.

The CPC controls the execution of the whole system. It communicates with the PPC through the

register file and it controls the operations of the other subsystems with FIFOs and mailbox registers.

The CPC runs at 150 MHz frequency.

20

5. RESULTS AND DISCUSSION

5.1. Performance

The performance of our computing platform in accelerating the backprojection procedure of

FDK algorithm was simulated with our constructed SystemC platform model. The input data was

the filtered projection data of the Shepp-Logan phantom. The projections were generated from 300

projection angles and the size of the detector array was assumed 512µ512. The projection data were

represented using 16-bit fixed-point format and had a total size of 150 MB. With the same input

data, we tested the time required for constructing three 3D images with the size of 2563, 5123 and

10243 voxels, respectively. The tested results are shown in Table 1. According to the simulation

results, the backprojection time is linearly proportional to the image size, implying that the system

is always running at its highest speed. The speed is limited by one of the four stages in the brick

operation cycle described in section 3.2.

Figure 8 shows the time required by the four stages to finish one brick operation. It is clear that

the multi-pipeline processing stage takes the longest time. Therefore, the MPE unit is the

computational bottleneck under the current configuration. To improve the system’s performance,

the number of the pipelines in MPE can be increased, which is limited only by the available FPGA

resources. When there are not sufficient FPGA resources in one chip, multiple FPGA chips can be

used to further parallelize the computation.

5.2. Accuracy

In the software implementation of the imaging algorithms, the data and the computation are

usually in the floating-point format, while in our proposed computing platform, the data and the

21

computation are all in 16-bit fixed-point format except for the accumulation operations and the

accumulated data, which use 32-bit fixed-point format. Fixed-point format is helpful to reduce the

memory storage requirement and the execution time, but it has limited data precision. We compared

the image reconstructed with floating-point software implementation (Figure 9B) and our fixed-

point hardware implementation (Figure 9C). The results indicate that the difference between the

two images is small and acceptable in the region of interest (ROI). When compared to the original

phantom (Figure 9A), the floating-point image has a noise level of 3.4% and the contrast of the

three small ellipsoids have a loss of 13.6%, while the fixed-point image has a noise level of 5.6%

and a contrast loss of 18.7% in the three ellipsoids. The noise level is measured by comparing the

voxel intensities of the reconstructed images with that of the phantom. If higher precision is

required, the bit width of the fixed-point data can be extended, though this will demand more

computing resources.

5.3. Discussion

The goal of our work was to design a general computing platform to accelerate a broad range of

local operations-based 3D medical imaging algorithms. Our strategy was to design a new caching

scheme as a solution to the memory access bottleneck in 3D imaging algorithms and to exploit the

intrinsic parallelisms in the algorithm for the hardware architecture design. Our simulation of the

implementation of the FDK cone-beam reconstruction algorithm and other 3D imaging algorithms

(in a preliminary work we have accelerated mutual information-based 3D image registration)

showed that our platform can achieve high computing performance in its application to 3D imaging

algorithms.

22

Our FPGA-based computing platform can be reconfigured to adapt to different algorithms;

however, our effort to limit the reconfigurable part of the system to only one execution unit

alleviates the burden of the design procedure. Instead of redesigning the whole system, users can

focus on the MPE pipeline architecture design according to the dataflow of the algorithms. Our

objective is to make the design procedure more automatic and further extend the application range

of our computing platform in the real-time 3D imaging field.

ACKNOLEDGEMENTS

We would like to thank Carlos R. Castro-Pareja (University of Maryland) and Vivek Walimbe

(The Ohio State University) for their valuable comments and proofreading.

23

REFERENCES

1. L.A. Feldkamp, L.C. Davis and J.W. Kress, Practical Cone-beam Algorithm, Journal of Optical Society of America, A6: 612 (1984). 2. W. M. Wells, P. Viola, H. Atsumi, S. Nakajima and R. Kikinis, Multi-modal volume registration by maximization of mutual information, Medical Image Analysis, 1: 35 (1996). 3. X. Xu and J. L. Prince, Generalized gradient vector flow external forces for active contours, Signal Processing, 71: 131 (1998). 4. T. Rohlfing and C. R. Maurer, Non-rigid image registration in shared-memory multiprocessor environments with application to brains, breasts, and bees, IEEE Transactions on Information Technology in Biomedicine, 7: 16 (2003). 5. D.A. Reimann, V. Chaudhary, M.J. Flynn and I.K. Sethi, Cone beam tomography using MPI on heterogeneous workstation clusters, Proceedings of MPI Developer's Conference, 2: 142 (1996). 6. Virtex-II Pro complete data sheet, Xilinx Inc. 2002. 7. I. Goddard and M. Trepanier, High-speed cone-beam reconstruction: an embedded systems approach, Proceedings of SPIE, 4681: 483 (2002). 8. G. Hampson and A. Paplinski, Hardware implementation of an ultrasonic beamformer, Proceedings of IEEE, 1: 227 (1997). 9. G. Knittel, A PCI-compatible FPGA-coprocessor for 2D/3D image processing, IEEE Symposium on FPGAs for Custom Computing Machines, 136 (1996). 10. C. R. Castro-Pareja, J.M. Jagadeesh, and R. Shekhar, FAIR: A Hardware Architecture for Real-Time 3-D Image Registration, IEEE Transactions on Information Technology in Biomedicine, 7: 426 (2003). 11. M de Boer, A. Gropl, J. Hesser, and R. Manner, Latency and Hazard-free Volume Memory Architecture for Direct Volume Rendering, Eurographic Workshop on Graphics Hardware, 109 (1996). 12. M. Doggett and M. Meiβner, A Memory Addressing and Access Design for Real Time Volume Rendering, IEEE International Symposium on Circuits and Systems, 344 (1999).

24

13. D. Cronquist, P. Franklin, C. Fisher, M. Figueroa, and C. Ebeling. Architecture Design of Reconfigurable Pipelined Datapaths. Twentieth Anniversary Conference on Advanced Research in VLSI, (1999). 14. B. Draper, W. Bohm, J. Hammes, W. Najjar, R. Beveridge, C. Ross, M. Chawathe, M. Desai and J. Bins. Compiling SA-C Programs to FPGAs: Performance Results, International Conference on Vision Systems, 220 (2001). 15. J. Frigo, M. Gokhale, D. Lavenier, Evaluation of the streams-C C-to-FPGA compiler: an application sperspective, Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays, 134 (2001). 16. L.T. Chang and G.T. Herman, A scientific study of filter selection for a fan-beam convolution algorithm, SIAM J. Appl. Math., 39: 83 (1980). 17. G.N. Ramachandran and A.V. Lakshminarayanan, Three-dimensional reconstruction from radiographs and electron micrographs: Application of convolutions instead of Fourier transforms, Proc. Nat. Acad. Sci. USA, 68: 2236 (1971). 18. K.T. Smith and F. Keiner, Mathematical foundations of computed tomography, Appl. Optics, 24: 3950 (1985).

25

FIGURE AND TABLE CAPTIONS:

Figure 1: Mapping of an output brick to an input brick.

Figure 2: Data duplication diagram.

Figure 3: Partition of one cache bank into two sections Figure 4. Block diagram of the system architecture: (a) PPC: PowerPC based Controller; (b)IBF: Input Brick Fetching unit; (c) MPE: Multiple Pipeline Execution unit; (d) OBS: Output Brick Storing unit; (e) CPC: Central Pipeline Controller Figure 5: Detailed system architecture: (A) PowerPC based Controller (PPC); (B) Input Crick Fetching unit (IBF); (C) Multiple Pipeline Execution unit (MPE); (D) Output Brick Storing unit (OBS); (E) Central Pipeline Controller (CPC). Figure 6. The four operation stages in the brick operation level Figure 7: Geometry of Cone-beam CT system and schematic of brick caching scheme for cone-beam reconstruction Figure 8: Time needed for the four stages at the first parallelism level to finish one brick operation Figure 9: Comparison of the Shepp-Logan phantom with the images reconstructed from the floating-point implementation and the fixed-point implementation. The projection data are from 300 projections with a detector resolution of 256×256 and the reconstruction grid is 2563. All images are windowed to 0.95--1.05. (A) The original phantom and its intensity profile across the three small ellipsoids; (B) the image reconstructed from floating-point implementation; (C) the image reconstructed from 16-bit fixed-point implementation.

Table 1: Backprojection times for different image sizes

26

Image size

(voxels) 2563 5123 10243

Backprojection time (seconds) 4.2 33.5 267.7

Table 1: Backprojection times for different image sizes

27

(b) Input Space

Output Brick

m n

k

Figure 1: Mapping of an output brick to an input brick.

Data subset

Input Brick

Sx Sy

Sz

Data subset

(c) Output Space (a) Input Space

A

B

28

Figure 2: Data duplication diagram

System Memory

cache 1 cache 2 cache n …..

Mask Register

29

Input Brick n

Input Brick n+1

Input Port

Output Port

S1

S2

Figure 3: Partition of one cache bank into two sections

30

PPC

IBF

MPE

OBS

MEM-2

MEM-3

MEM-1

FPGA

DataControl

CPC

Figure 4. Block diagram of the system architecture: (a) PPC: PowerPC based Controller; (b)IBF:

Input Brick Fetching unit; (c) MPE: Multiple Pipeline Execution unit; (d) OBS: Output Brick

Storing unit; (e) CPC: Central Pipeline Controller

31

MEM-1 Controller

DSOCM

ISOCM

PLB

DCR

I-MEM

D-MEM

MEM-1

PowerPC

MEM-2 Controller

…

CPC

…

MEM-3 Controller

Output Cache I Output Cache II

DMUX

MEM-2

MEM-3

FU1 FU1

FU2 FU2

FUn FUn

Input Caches

REG File

PLB BUS

(A) (B)

(C)

(D) FPGA

Data Path

Control Signal

S1S2

Pipeline-1 pipeline-m

MUX

Mask Register

(E)

Figure 5: Detailed system architecture: (A) PowerPC based Controller (PPC); (B) Input Crick

Fetching unit (IBF); (C) Multiple Pipeline Execution unit (MPE); (D) Output Brick Storing unit

(OBS); (E) Central Pipeline Controller (CPC).

32

Figure 6. The four operation stages in the brick operation level

Output brick storing

Multi-pipeline processing

Input Brick Fetching

Parameter Generation

33

Figure 7: Geometry of Cone-beam CT system and schematic of brick caching scheme for cone-

beam reconstruction.

β

uv

xy

z

Input Brick

Projection plane at β

Output Brick

Output image

X-ray source R

34

0 1000 2000 3000 4000

output brickstoring

multi-pipelineprocessing

input brickfetching

parametergeneration

(ns)

Figure 8: Time needed for the four stages at the first parallelism level to finish one brick operation

35

60 80 100 120 140 160 180 200

0.98

1

1.02

1.04

1.06

1.08

60 80 100 120 140 160 180 200

0.98

1

1.02

1.04

1.06

1.08

60 80 100 120 140 160 180 200

0.98

1

1.02

1.04

1.06

1.08

Figure 9: Comparison of the Shepp-Logan phantom with the images reconstructed from the

floating-point implementation and the fixed-point implementation. The projection data are from 300

projections with a detector resolution of 256×256 and the reconstruction grid is 2563. All images are

windowed to 0.95--1.05. (A) The original phantom and its intensity profile across the three small

ellipsoids; (B) the image reconstructed from floating-point implementation; (C) the image

reconstructed from 16-bit fixed-point implementation.

(A) Original Phantom

(B) Floating-point Implementation

(C) Fixed-point Implementation

Intensity Profiles

Date post:	25-Jul-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

An FPGA-based Computing Platform for Real-time...

Documents