Run-Time Reconfiguration for HyperTransport couled FPGAs using ACCFS

8/6/2019 Run-Time Reconfiguration for HyperTransport couled FPGAs using ACCFS

1/10

Preliminary Proceedings of theFirst International Workshop on HyperTransport Research and Applications (WHTRA2009)

Run- Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS

Jochen Strunk*, Andreas Heinig*, Toni Volkmer*, Wolfgang Rehm* and Heiko Schick t"Chemnit: University of Technology

Computer Architecture GroupEmail: {sjoc.heandr.tovo.rehm}@cs.tu-chemnitz.det IBM Deutschland Research & Development GmbH

Email: [email protected]

In this paper we present a solution where only oneFPGA is needed in a host coupled system, in whichthe FPGA can be reconfigured by a user applicationduring run-time without loosing the host link connec-tion. A hardware infrastructure on the FPGA and thesoftware framework ACCFS (ACCelerator File System)on the host system is provided to the user which alloweasy handling of reconfiguration and communicationbetween the host and the FPGA. Such a system canbe used for offioading compute kernels on the FPGAin high performance computing or exchanging func-tionality in highly available systems during run-timewithout loosing the host link during reconfiguration.The implementation was done for a HyperTransportcoupled FPGA. The design of a HyperTransport cavewas extended in such a way that it provides an infra-structure for run-time reconfigurable (RTR) modules.1. Introduction

With the emergence of dynamically and partiallyreconfigurable (DPR) FPGAs, the possibility to re-configure partially reconfigurable regions (PRR) withrun-time reconfigurable modules has appeared. Thisfeature enables FPGA customers to change the designof a certain region of the FPGA during run-time whilemaintaining the full functionality of the remaining part.This new degree of freedom also facilitates systemdesigners to develop single FPGA chip solutions whereadditionally required hardware, e.g. a peripheral inter-connect, is also located inside the FPGA.

For host coupled FPGA systems, solutions are con-ceivable where a static part of the FPGA covers thehost interface core and the remainder of the devicecan be reconfigured during run-time with one or moreuser specific application modules.

Such a FPGA system would offer continuous hostlink connectivity during the time of partial reconfig-uration and would not depend on exclusive hot plugsolutions, where the board, the BIOS and the operatingsystem must support hot plug functionality, which iscurrently not the case for standard motherboards withoperating systems like Linux and Windows.

Two distinct options for connecting FPGA acceler-ators to a host system do exist, either via a peripheralbus (e.g. PCI Express) or processor bus. Well suitedfor direct processor bus coupled FPGA systems arethe AMD CPUs because of the open standard and lowlatency HyperTransport (HT) protocol.

Sharing the resources of a single FPGA betweenusers is also imaginable. In a multi user or multiprocess environment several modules could be runsimultaneously on the same FPGA if resources aresufficient.

Partial reconfiguration offers the chance of reducingimplementation time of FPGA designs (rapid prototyp-ing) if supported by the FPGA synthesis tools. Alreadyfunctional parts could be left on the FPGA and onlyfunctionality under test is exchanged. It should benoted that this requires a strict modular overall design.

For highly available and real-time processing sys-tems with host connection, run-time reconfigurationenables to exchange or to add functionality duringsystem operation.

In the field of high performance computing nodeswith FPGAs used as accelerators, run-time reconfigura-tion can be utilized to change offload compute kernelsand to share FPGA device capacity.

Using FPGAs for acceleration, due to the creationof specialized processing engines utilizing the highlyparallel nature of FPGAs, can lead to a significantreduction of compute time. A speedup of more than 50compared to a CPU was achieved by Woods et al. [1]accelerating a Quasi-Monte Carlo financial simulation.
mailto:[email protected]:[email protected]


2/10

Zhang et al. [2] gained a speedup of 25 for anotherMonte-Carlo simulation.

To run such compute kernels on a single chip FPGAsolution making use of the run-time reconfigurability,three main components have to be provided to theuser. The first one is the operational infrastructure forrunning run-time reconfigurable modules (RTRM) ona host interface on the same FPGA. The second partconsists of a framework which allows the user to buildits own RTRMs. Last but not least, an generic interfacemust be provided to a user which offers functions forreconfiguration and communication between the hostand the RTRM located inside the FPGA.The rest of the paper is organized as follows:

Section 2 is devoted to related work. In section 3capabilities of run-time reconfigurable FPGAs and theprinciples of creating partial configuration bit streamfiles are shown.

Section 4 describes the run-time reconfigurationsupport for a FPGA directly connected to AMD'sprocessor bus. The enhancement for a HyperTransportcave implemented as host interconnect is shown. Theinfrastructure needed on the FPGA for the support ofrun-time reconfigurable modules and their creation ispresented.

The software framework provided to the user isbased on ACCFS (Accelerator File System) which isexplained in section 5.

As proof of concept we have implemented twodistinct compute kernel offload functions as run-timereconfigurable modules in section 6. The first RTRmodule acts as an offload function which finds patternsin a bit stream (pattern matcher) and the second modulea Mersenne Twister generates pseudo random numbersat high output frequency.

Section 7 concludes the results of this paper.

2. Related WorkUtilizing RTR capabilities of FPGAs and building

CPU coupled systems have been proposed under var-ious aspects. Some are dealing with internal commu-nication structures while others concentrate more onsystem integration.

A tool-flow for homogeneous communication infra-structure for RTR capable FPGAs was presented byHagemeyer et al. [3] built upon the Xilinx design flow.In contrast Koch et al. [4] designed a framework namedReCoBus-builder without applying Xilinx's partial re-configuration flow.Only Virtex-II and Spartan-3 FPGAare supported by the builder so far. Switch architec-tures with routers between RTR modules have beenexamined also in [5] [6].

I D I I I I I I I I I ID II I ID I I I I I

IDbb~ I II D Rlli I Iii ID I! III D II tt~I I I ID II I i ID u _ I IDIi B r - ~ f I I i i ID iI i It--~ID I-I I I IDI I I II D I I I I ID II I i I I

~ DSP

~BRAM

I B U F G

D D cLecKREGIONFigure 1. Schematic view of Xilinx XC4VFX60FPGA

On the matter of integration of FPGA modules orthreads for embedded systems different models havebeen proposed. ReconOS [7], a real time operating sys-tem implemented with static FPGA threads, is basedon memory mapping and is used in embedded systems.Another model, BORPH [8], is based on the UNIXIPC mechanism and utilizes the integrated PowerPCas host.

For the integration of host coupled accelerators weproposed and implemented the Accelerator File System(ACCFS) [9]. This framework is based on the conceptof a virtual file system. We have already shown theintegration of the Cell/B.B. processor. In this paper wewill show that ACCFS is best suited for the integrationof FPGAs, even RTR capable FPGAs, into a hostsystem.3. Run-Time Reconfiguration on FPGAs

This section addresses the conditions which must befulfilled, when using the feature of run-time reconfig-uration on Xilinx FPGAs. These are important for theimplementation of a HT cave which supports run-timereconfigurable modules.3.1. Dynamic Partial Reconfiguration forFPGAsThis subsection is devoted to the dynamic par-tial reconfiguration (DPR) of Virtex-4 and Virtex-5


3/10

,--------------p PN N-HT R R HT InternalHTXhost Cave Packet Routingconnection ...., ..__ Core Engine Unitp pN N ::x:R R

host interlace speci fic

RTRM

host interlace independent

Figure 2. Infrastructure of HT cave with RTR support

FPGAs [10] from Xi1inx which is one of the fewmanufacturers which offer DPR. The granularity of apartially reconfigurable region (PRR) is directly relatedto the configuration frames [11], which describe thefunction or contents of the slice containing LUTs orblock RAM for example. The granularity in the heightof a PRR matches the height of a clock region forVirtex-4 (16 CLBs) and Virtex-5 (20 CLBs). In the hor-izontal direction a PRR must begin with an even andend with a odd slice number. Figure 1 is a schematicview of the Virtex-4 XC4VFX60 FPGA used for theimplementation of a HyperTransport cave supportingRTRMs described in the next sections. Note that wehave a total of 16clock regions available. For run-timereconfiguration three different interfaces are available,which are able to read the configuration bit stream ofa RTRM. One of these is the JTAG port, which isa bidirectional serial host-clocked link. It is generallyused for prototyping and debugging, working up to thespeed of 24 MHz with available JTAG programmers.Another mode is SelectMAP, which works on a parallelinterface connected to the physical 10 pins of theFPGA achieving high throughputs. The third variantis the internal configuration access port (leAP). It isa internal version of the external SelectMap workingat a clock speed of up to 100 MHz at 32 bits width.For host coupled systems it is best suited, because itdoes not depend on external lOs and allows the shortestreconfiguration time.3.2. RTR Modules and Design Flow

In this subsection the design flow is introduced forthe creation of run-time reconfigurable modules. Italso covers challenges and limitations of dealing withRTRMs. The design flow is based on "Module basedPartial Reconfiguration" [12] for Xilinx FPGAs. Asa first step the HDL sources must be assigned eitherto the static part, which is constantly available during

run-time, or the dynamical part. All communicationbetween the two distinct parts has to go through hardmacros, also known as bus macros. Clock resourcesand the hard macros must be instantiated in the HDLsource and need to be assigned to a fixed locationinside the FPGA. The run-time reconfigurable mod-ule itself is only instantiated as a black box, whoseinterface (entity) can not be changed during run-time.This means that a common interface must be createdif other modules should be loaded in the partiallyreconfigurable region (PRR). The location and size of aPRR must be specified for the place and route processusing the "AREA_GROUP" constraint. It should benoted that for standard static design, neither PRR norbus macros need to be specified. The same appliesfor the definition of the location of clock resources.To conduct the partial reconfiguration flow a patchis provided by Xilinx which must be applied to thestandard synthesis tools.

4. Run-Time Reconfiguration Support fora HyperTransport Cave

For a single FPGA chip solution connected to ahost utilizing HyperTransport as interconnect, it isessential not to loose the link during the time ofthe reconfiguration of a RTRM. This implies that theHyperTransport IP-core implementing a HT cave mustbe kept inside the FPGA as static part. Hot plugging isnot supported so far by off the shelf systems. Even ifthe hardware is capable of handling such requests, mostoperating systems do not support this. Other RTRMsinside the FPGA would suffer also from the link loss.For that reason the HyperTransport cave is kept inthe static region. In this section the enhancement ofa HT cave is shown which provides an infrastructurefor dealing with RTRMs.


4/10

entity rtrm isport (crq_c2m_addr : in STD_LOGIC_VECTOR(31 downto 0);crq_c2m_data : in STD_LOGIC_VECTOR(31 downto 0);crq_c2m_rq_valid: in STD_LOGIC;crq_c2m_stop : in STD_LOGIC;crq_m2c_data : o ut STD_LOGIC_VECTOR(31 downto 0);crq_m2c_rp_valid: out STD_LOGIC;crq_m2c_stop : out STD_LOGIC;crq_c2m_wr_rd : in STD_LOGIC;mrq_m2c_addr : o ut STD_LOGIC_VECTOR(31 downto 0);mrq_c2m_data : in STD_LOGIC_VECTOR(31 downto 0);mrq_c2m_rp_valid: in STD_LOGIC;mrq_c2m_stop : in STD_LOGIC;mrq_m2c_data : o ut STD_LOGIC_VECTOR(31 downto 0);mrq_m2c_rq_valid: out STD_LOGIC;mrq_m2c_stop : out STD_LOGIC;mrq_m2c_wr_rd : i n STD_LOGIC;c2m_clkc2m_res_nm2c_intr

: in STD_LOGIC;: in STD_LOGIC;: out STD_LOGIC

end rtrm;

Figure 3. Entity of RTRM

4.1. RTR InfrastructureA run-time reconfigurable infrastructure for a

HyperTransport cave has to provide a communicationmechanism between the host and the RTRM andperhaps between RTRMs themselves. It also has tocomply to the rules of partial reconfiguration and thepartial design flow.To ease porting the infrastructure toother interconnects, e.g. PCI Express, the functionalitywhich must be implemented for a RTR infrastructureshould be divided into two parts. One covers the hostinterconnect specific functions and the other the hostinterconnect independent portions.

The infrastructure designed for a HyperTransportcave supporting RTRMs consists of two host interfacespecific, i.e. HT Cave Core and HT Packet Engine, andfour host independent parts, an Internal Routing Unit,a RTRM Controller, a Reconfig Unit and one or moreRTRMs. The design of this infrastructure of a HT cavewith RTR support is depicted in Figure 2. The HT cavedesign for the HyperTransport interconnect originatesfrom [13]. The task of the HT Package Engine isto decode the HT packets coming from the host andto convert these into appropriate actions targeting theunits inside the FPGA. This includes the creation ofresponses to requests from the host by injecting validpackets to the HT Cave Core. The Internal RoutingUnit routes requests to and from internal units, e.g.RTRM Controller and Reconfig Unit. For fast run-time reconfiguration of RTRMs it is recommended tomake use of an internal reconfiguration port. This isdone by the Reconfig Unit which controls the internal

configuration access port (ICAP) for Xilinx FPGAs.The Reconfig Unit itself is controlled by the vendorspecific driver on the host, which validates if requestsconcerning the creation of new RTRMs can be served.The allocation of RTRMs to available RTR regions isalso decided by the host system.4.2. RTRM

Each RTRM has its virtual address space which isimplemented 32 bits wide. This means that a globaladdress space is not divided between the RTRMs usingfixed addresses. It would be very difficult to resolvea request when two RTRMs demand the same fixedphysical address for their memory regions which areexported to the user application using an entry in thevirtual file system implemented on the host system.The interface (entity) of a RTRM serves as aninterconnect to the RTRM controller. Communicationin both directions, i.e. controller requests (crq) andmodule requests (mrq), are possible using a stop andvalid protocol. The entity of a RTRM in VHDL isshown in Figure 3.4.3. RTRM-Controller

The RTRM controller handles requests coming fromthe HT Core originated by the user application or fromthe RTRM itself. It converts physical addresses fordirectly accessing the RTRM, e.g. through direct loadand store operations from the host to virtual RTRMaddresses. The controller can also be used for RTRMto RTRM communication if desired.4.4. Framework for a HT Cave supportingRTR

For generating the static part, i.e. the HT cave withRTR support, and the dynamical RTR modules, scriptsare provided. The intention is to ease the creationof RTRMs for an application developer who is notso familiar with FPGA IP-core designs and run-timereconfiguration.

The top VHDL module is synthesized with theinstantiated HT core, the HTpacket engine, the internalrouting unit, the RTRM controller and the ReconfigUnit by the build_static script. The RTRM mod-ule is only instantiated as a black box module. Then thestatic part is implemented with the partial flow option.While the user constraints file (uef) normally containslocation (LOC) constraints for external 10 pins, thisfile must also contain additional LOC constraints forthe PR flow covering all clock resources, in particular


5/10

ApplicationsS y s c a l l - A P I - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ . _ .Iprocess Managementl Vi rtu al F ile System I Virtual Memory Socket

\__----")

I> x1 > ill > U) ::::}lock Devices ill "0 ill 0 ill"0 < E "0 >> "8_ c ~ ill -coj

H a r d w a r e - ' - '- ' _ ' _ ' _ '_ ' _ ' _ ' _ ' _ '_ ' _ ' _ ' _ ' _ '_ ' _ ' _ ' _ ' _ '_ ' _ ' _ _ _ _ _ _ _ _ _ _ _ t _ _ _ _ _ _ _ _ - - - f - - - - . _ . .

r 1 r1HT Cave U PCle Endpoint

ApplicationsSyscall-API _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.

Iprocess Managementl Vi rt ua l Fi le S ys tem I Virtual Memory Socket\__-----")

(a) Without ACCFS (b) ACCFS: Common Generalized Interface

Figure 4. ACCFS - Layered Structure

clock buffers and digital clock managers (DCMs). Theresulting placed and routed design represents the basisfor creating the dynamic configuration bit stream.

For the dynamic part, the user must supply aninterface-compliant RTR module with the top entityname "rtrrn" and a description of the file entrieswhich should be exported by ACCFS. This descriptionconsists of the type, the size and the virtual addresswhich are essential to export the functionality to theuser application. This additional information is addedlater to the final ACCFS configuration bit stream as apart of the header.

Using the bui Id_dynami c script the user-supplied RTRM module is implemented with the par-tial flow option. Next, the Xilinx tools PR_ veri fyand PR_assemble are used to build the partial bitstream file. Then the ACCFS RTRM bit stream file iscreated by adding header information containing theHT cave version, the FPGA board version and theuser-supplied module description. Due to this headerinformation, it is possible to transfer ACCFS RTRMbit stream files to other hosts which contain the sameFPGA accelerator board and use the identical HT caveversion.

5. ACCFS for Host System IntegrationDifferent solutions exist for operating system in-

tegration of a FPGA. For example, BORPH [8] orReconOS [7] provide a hardware process/thread ab-straction which coexist beside "normal" software pro-cesses. However, deep modifications of the Linuxkernel are necessary to implement them. Furthermore,

it is required to run Linux on the processing unit ofthe FPGA.

Due to the mentioned disadvantages we pro-posed and implemented the Accelerator File System(ACCFS) [9]. In this section we describe the majoraspects of ACCFS for the integration of FPGAs intoa host system. We start with a brief overview insubsection 5.1. Subsection 5.2 depicts the concepts ofACCFS. Thereafter, we present the integration stepsfor the HT-coupled Virtex-4 card in subsection 5.3.

5.1. OverviewACCFS is an open generic system interface for

the integration of different accelerator types into theLinux operating system. It is based on SPUFS (Syn-ergistic Processing Unit File System) [14] which isused to access the Synergistic Processing Units of theCell/B.E. processor. The goal of ACCFS is to replacethe different character device based interfaces (cf.Figure 4a) with a generic file system based interface(cf. Figure 4b).

In the case of character devices the hardware func-tionalities are usually exported through the ioctlsystem call. However, this system call has the dis-advantage of a non-standardized interface. Hence, theusage differs from one vendor to another.

In contrast, ACCFS defines a well structuredioctl-free interface based on a Virtual File System(VFS) approach. In Figure 4b the parts of ACCFSare shown as gray boxes. To be customizable whenintegrating new hardware ACCFS was split into twoparts. Part one ("accfs"), provides the user interface,


6/10

and the other parts ("device handlers") integrate thehardware.

Device vendors as well as library programmers ben-efit from ACCFS. Only the lowest abstraction levelshave to be implemented inside the devicehandlers. Thewhole user interface is already provided by accfs. Thusintegrating a newaccelerator requires less device driverprogramming costs. The library programmer benefitsfrom basic design concepts introduced in the nextsubsection.5.2. Basic Concepts

In the previous subsection we already described theconcept of functionality separation which eases theintegration of new hardware. Another concept was theusage of a VFS which maps the accelerator to normalfiles. This enables us to implement a ioctl freeand hence a nearly standard conform approach. Allsupported file I/O operations are POSIX conform withsome exceptions. For example, it is not possible towrite beyond the end of a file or to change the positionof the current file pointer on some files.

ACCFS is designed to support the virtualizationof the accelerators. We abstract the physical accel-erator with an accelerator context. The context isthe operational data set of the accelerator. It includesall information which are necessary to describe thecurrent hardware state in such a way that the operationcan be interrupted and resumed later without dataloss. During the interruption another context is able toutilize the physical hardware. Virtualization optimizesthe resource usage of the accelerators. Contexts whichdo not make use of the hardware at a given time arenot scheduled on the physical accelerator.

Each context is bounded on a directory inside theVFS under the ACCFS mount point. The files insidethis directory represent the functionalities of the ac-celerator. To support reconfigurable hardware the fileset is dynamically exported and can change duringruntime. For example, an additional memory can beexported due to reconfiguration of the FPGA with anew RTR module.

To interact with the accelerator several methods arefeasible. One is the simple memory mapped 10 withstandard load/store machine instructions. In this directmemory access (DMA) method the host is the activepart who issues a read/write for every memory access.Another method is DMA-bulk transfer. Here the ac-celerator needs a DMA unit capable of moving thedata asynchronously to the host processor execution.In cases where the accelerator is able to initiate thesetransfers by itself, the DMA unit has to handle virtual

struct accfs_vendor{

int vendor_id;int (*create)( );int (*destroy)( );int (Hun)( ... );

s si z e t (* memory _sdma ) ( ) ;ssize_t (* config_read) ( );s siz e j t (*config_write)( );};

Figure 5. struct accfs_vendor

memory managing issues, too. However, not everyaccelerator supports virtual memory.For this reasonwerestrict our solution to host initiated DMA, where thehost setups the memory management unit and initializethe data transfer. The actual data movement is doneasynchronously by the accelerator.

Finally, ACCFS supports asynchronous context ex-ecution based on an explicit synchronization primitive.This concept eases the software development becausemulti-threading is not required when using multipleaccelerator units. Every context runs asynchronously tothe host system. The finish status can be read througha "status" file.

5.3. FPGA SupportTo support HyperTransport coupled FPGA boardswithin ACCFS a new device handler has to be writ-

ten. This device handler has to provide the structureaccfs_vendor (cf. Figure 5). The first four entries hasto be set and the others are optional. For example, if thecallback function for the DMA-bulk transfer is not set(memory _sdma), accfs will use the internal routinesto copy the data from/to the FPGA.

Further details of the device handler implementationare described in the reset of this subsection withthe help of the typical FPGA usage model shown inFigure 6. An example code fragment using this modelis shown in Figure 7 of section 6, where the case studyis conducted.5.3.1. Create Context. ACCFS enforces an accelera-tor based programming model. The main program isrunning on the host system and executes the computekernel on the accelerator. To outsource such a kernelthe application has to create a context by invoking theacc_create system call.

Currently our device handler does not support vir-tualization hence we can only exclusively provide theFPGA to one application.


7/10

Application~ sys_acc_create

Device Handler

Establish Context- FPGA available?- First init ialization- Returns: context descriptoronfigure Context

Configure FPGA- Val idat e b it s tream- Space available?- P rogramm deviceata Exchange

Validate Request

Execute Design

State Transition

Wai t for Finish

Wai t for 'STOP'

Destroy Context

Wai t for 'STOP'

Figure 6. FPGA usage

5.3.2. Configuration. Loading the design is triggeredby a wri te system call on the "config" file. The datahas to be a valid ACCFS bit stream. To ensure thatthe RTRM matches the RTR infrastructure we providea tool chain which generates such a bit stream file bywriting a special header before the bit stream data. Theheader contains all necessary information describingthe bit stream such as the RTR capable core and FPGAboard version. If the validation is successful, the FPGAis programmed with the configuration bit stream fileusing the internal reconfiguration port ICAP for XilinxFPGAs or through an external JTAG programmingdevice, e.g. Xilinx USB platform cable. After a suc-cessfully configuration the exported memories of theFPGA design are visible in the context directory.

5.3.3. Data Exchange. The access of FPGA memoryis possible with the read and wri te system calls.In a later development stage these calls start a hostinitiated DMA-bulk transfer. Ifthe memory is exportedas memory mapped 10, the mmap system call will mapthe memory into the address space of the application.

The "data exchange" operation is always possibleafter the configuration no matter whether the contextis in execution or not.

5.3.4. Execute Design. To start the RTR module theapplication has to invoke the acc_run system call.The execution happens asynchronously, meaning thatacc_run returns immediately. This enables the ap-plication to execute more than one context in parallelwithout using threads.

When the application needs to check the executionstatus, e.g. if the FPGA has finished its work, the"status" file can be read. Unless this file was openedwith O_NONBLOCK the read system call will blockuntil the RTRM inside the FPGA has finished its task.5.3.5. Destroy Context. When the application closesthe file handle returned by acc_create the contextgets destroyed.

6. Case Study of RTRMs for a HT Cavesupporting RTR

6.1. OverviewAs proof of concept we designed two different com-

pute kernels as RTRMs for a HyperTransport coupledXilinx Virtex-4 FPGA plug-in card [15]. The userprogram using the virtual file system ACCFS is ableto configure and access the two RTRMs consecutivelyduring the run-time of the user program at the timewhen they are needed. The first RTRM acts as aoffload functions which finds patterns in a byte stream(pattern matcher) and the second module, a MersenneTwister, generates pseudo random numbers at highoutput frequency. For generating the appropriate partialbit stream files of the RTRMs the framework presentedin subsection 4.4 is applied.

As hardware for the host system an Iwill DK8-HTXmotherboard with two Opteron processors is utilized.The pre-installed BIOS is replaced by a customizedLinuxBios version to get the HTX-card enumeratedby the host system. The FPGA on the HTX card is aXilinx Virtex-4 XC4VFX60.

6.2. RTRMs - Pattern Matcher and MersenneTwister

Two RTRMs have been implemented, which aredescribed in this subsection, a pattern matcher anda Mersenne twister based on the MT19937 algorithm[16].

The latter uses the MT32 [17] implementation,which is able to provide a new 32 bits pseudo randomnumber each clock cycle. When the host performs a


8/10

int matcher_run (void * search_db_in, int db_sizevoid * patterns_in, int pattern_count,void * results_out, int results_size) {

int ret;char bufstatus [12];II create context of our static FPGA designint fd_ctx = (int)acc_create("example", V_ID,

D_ID, 0750, NULL);II configure the designint fd_cfg = openatCfd_ctx, "config", O_WRONLY);configure_fpga(fd_cfg, MATCHER_RTRM_BITSTREAM);II open memory and statusint fd_mem = openatCfd_ctx, "memory/FPGA MEMl",

O_RDWR);int fd_status = openatCfd_ctx, "status",

O_RDONLY);II fill memory with data (DMA bulk transfer)pwrite (fd_mem, search_db_in, db_size, DB_OFFSET);pwrite(fd_mem, patterns_in, 4 * pattern_count,

PATTERN_OFFSET) ;

II start the matcheracc_runCfd_ctx, 0);II check statusII (wait until context execution finished)read(fd_status, bufstatus, 12);II read results of operation (IMA bulk transfer)ret = pread(fd_mem, resul ts_out ,

results_size, RESULTS_OFFSET);II close filesclose (fd_mem); close (fd_status); close (fd_cfg);return ret;

Figure 7. Pattern matcher user program

read request on an arbitrary RTRM address, a new 32bits number is provided.

The RTRM pattern matcher simultaneously com-pares several 32 bits patterns against a search database.The module consists of a finite state machine (FSM),four 32 bits comparators for each pattern, one controlregister, one status register as well as dual-port blockRAMs for the search database, the search patternsand the results. Additionally, a 56 bits window issuperimposed over the search database.

The registers and memories are mapped into thelower 27 bits addresses of the RTRM's address spaceand can be accessed by the host.

After the host has set the start bit in the controlregister, the FSM reads the search patterns from thepattern memory, the window is set to the beginning ofthe search database and the comparators are enabled.

Then, the first comparator of each search patterntests the first 32 bits of the window, the second one 32bits shifted by one byte, the third one 32 bits shifted bytwo bytes and the fourth the last 32 bits of the windowagainst the search pattern. Hereby, the window can be

int run_compute_kernel (double * results_out,int results_count) {

II create context of our FPGA designint fd_ctx = (int)acc_create("example", V_ID,

D_ID, 0750, NULL);II configure the designint fd_cfg = openatCfd_ctx, "config", O_WRONLY);configure_fpga (fd_cfg , MERSENNE_RTRM_BITSTREAM);I I open memoryint fd_mem = openatCfd_ctx, "memory/FPGA MEMl",

O_RDWR);II allocating bufferint32_t * buffer = (int32_t *) mmap(NULL,

MEM_SIZE, PROT_READ I PROT_WRITE,MAP_SHARED, fd_mem, 0) ;

int32_t * mt32_numbers = buffer + NUMBERS_OFFSET;II start the Mersenne twister MT32acc_runCfd_ctx, 0);

II Example C function that uses random numbersc_kernel_function (results_out, results_count,mt32_numbers) ;

II unmap buffermunmap((void *) buffer, MEM_SIZE);II close filesclose (fd_mem); close (fd_cfg);return 0;

Figure 8. Example that uses MT32 pseudo ran-dom numbers

shifted by 32 bits each clock cycle.When the end of the search database has beenreached, the results are written to the results memory.

Afterwards, the 'finished' bit is set in the status reg-ister. Next, the host can read the matcher results fromthe results memory.

6.3. User Application accessing RTRMsThe user function matcher_run (cf. Figure 7)

demonstrates the usage of the RTRMpattern matcher.First, this function creates a new context andpartially reconfigures the FPGA by the functionconfigure_fpga. Then, the search database andsearch patterns are written to the RTRM's databaseand patterns memory using the pwrite system call.Next, the matcher is started using acc_run and theuser function waits until the execution has finished.After that, the results are read from the FPGA into thebuffer results_out by the pre ad system call.

The user function run_compute_kernel (cf.Figure 8) uses the pseudo random numbers generatedby the RTRM Mersenne twister for the computationkernel c kernel function. This RTRM is ini-tialized using the same functions like in the previous


9/10

Figure 9. Placed and routed design of the HT cavewith RTR support and the pattern matcher RTRM

example. In contrast to the previous one, the randomnumbers are not read using file handles, but can beaccessed by the computation kernel via the memory-mapped buffer mt32_numbers.6.4. Results of Case Study

The infrastructure for RTR modules based on theHT cave with RTR support was successfully im-plemented and verified. Furthermore, the virtual filesystem ACCFS was utilized for the integration andmanagement of RTR modules on a HyperTransportplug-in card with a Xilinx Virtex-4 FPGA by usingtwo example RTR modules which can be loaded ontothe FPGA during run-time. For the implementation ofthe HT cave with RTR support at least 4 clock regionshave to be reserved as static part.

The first RTR module acting as a offload functionwhich finds patterns in a byte stream (pattern matcher)consists of 290 pattern matcher units resulting in a totalof up to 116 billion 32 bits comparisons per second.

This module nearly occupies all slices available withinthe clock regions designated for the RTRM. The place-ment is shown in Figure 9.

The second module implemented is a MersenneTwister which generates pseudo random numbers athigh output frequency.

For generating the partial bit stream file the frame-work presented in subsection 4.4 was applied.

7. ConclusionBy using the ability of run-time reconfiguration of

FPGAs it is possible to build a single FPGA chip solu-tion as a host coupled accelerator without loosing thehost link connection during the reconfiguration of RTRmodules. The design of a RTR infrastructure inside theFPGA was shown which allows to manage RTR mod-ules during run-time. The implementation was donefor FPGAs coupled directly to the HyperTransportprocessor bus of the host system. The concepts pro-vided are applicable to other processor and peripheralbus coupled FPGAs. The software framework ACCFS,based on a virtual file system, provides a genericinterface to user applications which is able to satisfythe demands of run-time reconfigurable computing.

8. Future WorkTo speed up communication with high throughput

between the host and a RTRM a memory transfer con-troller supporting bulk transfer between the differentaddress spaces of the host and the RTRM should beimplemented.9. Acknowledgment

The project is performed in collaboration with theCenter of Advanced Study Boblingen, IBM Research& Development GmbH, Germany.References[1] N. A. Woods and T. VanCourt, "FPGA Accelerationof Quasi-Monte Carlo in Finance," in Proceedingsof the 2008 IEEE International Conference onField-

Programmable Logic, FPL 2008, 8-10 September, Hei-delberg. IEEE, 2008, pp. 335-340.[2] G. L. Zhang, P. H. W. Leong, C. H. Ho, K. H. Tsoi,C. C. C. Cheung, D.-U. Lee, R. C. C. Cheung, andW. Luk, "Reconfigurable Acceleration for Monte CarloBased Financial Simulation," in FPT, G. J. Brebner,S. Chakraborty, and w.r: Wong, Eds. IEEE, 2005,pp. 215-222.


10/10

[3] J. Hagemeyer, B. Kettelhoit, M. Koester, and M. Por-rmann, in Design of Homogeneous CommunicationInfrastructures for Partially Reconfigurable FPGAs(ERSA). CSREA Press, 2007.

[4] D. Koch, C. Beckhoff, and J. Teich, "ReCoBus-Buildera Novel Tool and Technique to Build Statically andDynamically Reconfigurable Systems for FPGAs," inProceedings of the 2008 IEEE International ConferenceonField-Programmable Logic, FPL 2008,8-10 Septem-ber, Heidelberg. IEEE, 2008.

[5] J. Surisi, C. Patterson, and P. Athanas, "An efficientrun-time router for connecting modules in FPGAs," inProceedings of the 2008 IEEE International ConferenceonField-Programmable Logic, FPL 2008,8-10 Septem-ber, Heidelberg. IEEE, 2008.

[6] T. Pionteck, C. Albrecht, K. Maehle, E., HUbner, M.,and Becker, J., "Commuication Architectures for Dy-namically Reconfigurable FPGA Designs," in Proceed-ings of IEEE International Parallel and DistributedProcessing Symposium, IPDPS USA, 2007.

[7] E. LUbbers and M. Planner, "ReconOS: An RTOS Sup-porting Hard-and Software Threads," in Proceedingsof the 2007 IEEE International Conference on Field-Programmable Logic and Applications. Amsterdam:IEEE, 27-29 August 2007, pp. 441-446.[8] H. K.-H. So and R. Bordersen, "File System AccessFrom Reconfigurable FPGA Hardware Processes InBORPH," in Proceedings of the 2008 IEEE Interna-tional Conference onField-Programmable Logic, FPL

2008, 8-10 September, Heidelberg. IEEE, 2008.[9] A. Heinig, R. Oertel, J. Strunk, W. Rehm, andH. Schick, "Generalizing the SPUFS concept - a case

study towards a common accelerator interface," in Pro-ceedings of the Many-core and Reconfigurable Super-computing Conference, Belfast, 1-3 April 2008.

[10] "Xilinx Virtex family," Website, 2008. [Online].Available: http://www.xilinx.com!products/[11] Xilinx, "Configuration Memory Frames," in Virtex-4FPGA Configuration User Guide (UG071), 2008.[12] Xilinx, "Two Flows for Partial Reconfiguration: Mod-ule Based or Difference Based," in Application

Note: Virtex, Virtex-E, Virtex-II, Virtex-II Pro Families(XAPP290), 2004.[13] D. Slogsnat, A. Giese, and U. Bruening, "A ver-satile, low latency HyperTransport core," in Fif-teenth ACMISIGDA International Symposium on Field-Programmable Gate Arrays, 2007.[14] A. Bergmann, "The Cell Processor ProgrammingModel," IBM Corporation, Tech. Rep., June 2005.[IS] M. Nuessle, H. Froning, A. Giese, H. Litz, D. Slogsnat,

and U. Brning, "A Hypertransport based low-latencyreconfigurable testbed for message-passing develop-ments," in KiCC'07, 2007.[16] M. Matsumoto and T. Nishimura, "Mersenne twister:

a 623-dimensionally equidistributed uniform pseudo-random number generator," ACM Trans. Model. Com-put. Simul., vol. 8, no. I, pp. 3-30, 1998.

[17] "Mersenne Twister, MT32. Pseudo RandomNumber Generator for Xilinx FPGA," Web-site, 2007. [Online]. Available: http://www.ht-lab.com/freecores/mt32/mersenne.html
http://www.xilinx.com%21products/http://www.xilinx.com%21products/

Date post:	08-Apr-2018
Category:	Documents
Upload:	heiko-joerg-schick
View:	224 times
Download:	0 times

Run-Time Reconfiguration for HyperTransport couled FPGAs using ACCFS

Documents