+ All Categories
Home > Documents > Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1,...

Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1,...

Date post: 23-May-2018
Category:
Upload: vantuong
View: 215 times
Download: 2 times
Share this document with a friend
37
Reconfigurable Computing David Boland 1 , Chung-Kuan Cheng 2 , Andrew B. Kahng 2 , Philip H.W. Leong 1 1 School of Electrical and Information Engineering, The University of Sydney, Australia 2006 2 Dept. of Computer Science and Engineering, University of California, La Jolla, California Abstract: Reconfigurable computing is the application of adaptable fabrics to solve computational problems, often taking advantage of the flexibility available to produce problem-specific architectures that achieve high performance because of customization. Reconfigurable computing has been successfully applied to fields as diverse as digital signal processing, cryptography, bioinformatics, logic emulation, CAD tool acceleration, scientific computing, and rapid prototyping. Although Estrin-first proposed the idea of a reconfigurable system in the form of a fixed plus variable structure computer in 1960 [1] it has only been in recent years that reconfigurable fabrics, in the form of field-programmable gate arrays (FPGAs), have reached sufficient density to make them a compelling implementation platform for high performance applications and embedded systems. In this article, intended for the non-specialist, we describe some of the basic concepts, tools and architectures associated with reconfigurable computing. Keywords: reconfigurable computing; adaptable fabrics; application integrated circuits; field programmable gate arrays (FPGAs); system architecture; runtime
Transcript
Page 1: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

ReconfigurableComputingDavidBoland1,Chung-KuanCheng2,AndrewB.Kahng2,PhilipH.W.Leong1

1SchoolofElectricalandInformationEngineering,TheUniversityofSydney,Australia20062Dept.ofComputerScienceandEngineering,UniversityofCalifornia,LaJolla,California

Abstract:

Reconfigurablecomputingistheapplicationofadaptablefabricstosolvecomputationalproblems,often

takingadvantageoftheflexibilityavailabletoproduceproblem-specificarchitecturesthatachievehigh

performancebecauseofcustomization.Reconfigurablecomputinghasbeensuccessfullyappliedto

fieldsasdiverseasdigitalsignalprocessing,cryptography,bioinformatics,logicemulation,CADtool

acceleration,scientificcomputing,andrapidprototyping.

AlthoughEstrin-firstproposedtheideaofareconfigurablesystemintheformofafixedplusvariable

structurecomputerin1960[1]ithasonlybeeninrecentyearsthatreconfigurablefabrics,intheformof

field-programmablegatearrays(FPGAs),havereachedsufficientdensitytomakethemacompelling

implementationplatformforhighperformanceapplicationsandembeddedsystems.Inthisarticle,

intendedforthenon-specialist,wedescribesomeofthebasicconcepts,toolsandarchitectures

associatedwithreconfigurablecomputing.

Keywords:

reconfigurablecomputing;adaptablefabrics;applicationintegratedcircuits;fieldprogrammablegate

arrays(FPGAs);systemarchitecture;runtime

Page 2: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

1Introduction

Althoughreconfigurablefabricscaninprinciplebeconstructedfromanytypeoftechnology,inpractice,

mostcontemporarydesignsaremadeusingcommercialfieldprogrammablegatearrays(FPGAs).An

FPGAisanintegratedcircuitcontaininganarrayoflogicgatesinwhichtheconnectionscanbe

configuredbydownloadingabitstreamtoitsmemory.FPGAscanalsobeembeddedinintegrated

circuitsasintellectualpropertycores.Moredetailedsurveysonreconfigurablecomputingareavailable

intheliterature[2-6].

Microprocessorsofferaneasy-to-use,powerfulandflexibleimplementationmediumfordigital

systems.Theirutilityincomputingapplicationsmakesthemanoverwhelmingfirstchoice.Moreover,it

isrelativelyeasytofindsoftwaredevelopers,andmicroprocessorsarewidelysupportedbyoperating

systems,softwareengineeringtools,andlibraries.However,inthelastdecade,powerconstraintshave

limitedtheperformanceofserialcomputationonmicroprocessors.Thishasledtothedevelopmentof

multi-coreprocessorsandanincreasingimportanceplacedonthepursuitofparallelcomputation[7].

Unfortunately,multi-coreprocessorsarerarelythemostefficientmethodtoperformparallel

computation.Thisinefficiencystemsfromthefactthateachcoremustbegeneralenoughtosupportan

entireinstructionset.Asaresult,themajorityofenergyisusedindecodingtheinstructionandfetching

datainsteadofperformingactualcomputation[8].

Hardwareacceleratorssuchasgraphicsprocessorunits(GPUs)andFPGAsareparallel

computationalarchitecturesthathavedemonstratedsubstantialperformanceandenergyefficiency

improvementsovertraditionalmulti-coreprocessordesignsbymovingthefocusbacktocomputation

[9,10].Intermsofenergyefficiency,theGPUarchitecture,whichconsistsofthousandsofparallel

floating-pointunits,isbestsuitedtoso-calledembarrassinglyparallelcomputationorcomputationally

expensiveproblems.However,manyalgorithmswillnotfallintothisproblemcategory.Incontrast,

usinganFPGAorApplication-SpecificIntegratedCircuit(ASIC),itispossibletocreateafullycustomised

datapathforagivenalgorithm,meaningitispossibletoachieveevengreaterenergyefficiencyusing

thesedevices.

Application-specificintegratedcircuits(ASICs)andFPGAsachievegreaterlevelsofparallelism

thanamicroprocessorbyarrangingcomputationsinaspatialratherthantemporalfashion.Thiscan

resultinperformanceimprovementsofseveralordersofmagnitude.Also,theabsenceofcachesand

instructiondecodingcanresultinthesameamountofworkbeingdonewithlesschipareaandlower

powerconsumption[11].Notableexamplesofapplicationdomainsincludecryptography,NP-hard

optimizationproblems,patternmatching,machinelearning,andmoleculardynamics[6].

Page 3: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Anexampleinvolvingtheimplementationofafiniteimpulseresponse(FIR)filterisshownin

Fig.1.Thereconfigurablecomputingsolutionissignificantlymoreparallelthanthemicroprocessor-

basedone.Inaddition,itshouldbeapparentthatthereconfigurablesolutionavoidstheoverheads

associatedwithinstructiondecoding,caching,registerfiles.Furthermore,speculativeexecution,

unnecessarydatatransfersandcontrolhardwarecanbeomitted.

Figure1.IllustrationofamicroprocessorbasedFIRfiltervs.areconfigurablecomputingsolution.Inthe

microprocessor,operationsareperformedintheALUsequentially.Furthermore,instructiondecoding,

caching,speculativeexecution,controlgenerationandsoonarerequired.Forthereconfigurable

computingapproachusinganFPGA,spatialcompositionisusedtoincreasethedegreeofparallelism.

TheFPGAimplementationcanbefurtherparallelizedthroughpipelining.

ComparedwithASICs,FPGAsofferverylownon-recurrentengineering(NRE)costs,whichis

oftenamoreimportantfactorthanthefactthatFPGAshavehigherunitscosts.Thisisbecausemany

applicationsdonothavetheextremelyhighvolumesrequiredtomakeASICsacheaperproposition.As

integratedcircuitfeaturesizescontinuetodecrease,theNREcostsassociatedwithASICscontinueto

escalate,increasingthevolumeatwhichitbecomescheapertouseanASIC(seeFig.2).Beingmore

specialized,ASICsofferarea,powerandspeedadvantagesoverFPGAs,thisgapbeingreducedasmore

hardblocksareemployed[12].Movingforward,reconfigurablecomputingwillbeusedinincreasingly

moreapplications,asASICsbecomeonlycosteffectiveforthehighestperformanceorhighestvolume

applications.

Page 4: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Figure2.Costoftechnologyvs.volume.ThecrossovervolumeforwhichASICtechnologyischeaper

thanFPGAsincreasesasfeaturesizeisreducedbecauseofincreasednon-recurrentengineeringcosts.

Additionalbenefitsofreconfigurablecomputingarethatitstechnologyprovidesashortertime

tomarketthanASICs(associatedFPGAfabricationtimeisessentiallyzero),makingmanyfabrication

iterationswithinasingledaypossible.Thisbenefitallowsmorecomplexalgorithmstobedeployedand

makesproblem-specificcustomizationsofdesignspossible.FPGA-baseddesignsareinherentlylessrisky

intermsoftechnicalfeasibilityandcost,asshorterdesigntimesandlowerupfrontcostsareinvolved.

Asitsnamesuggests,FPGAsalsoofferthepossibilityofmodificationstothedesigninthefield,which

canbeusedtoprovidebugfixes,modificationstoadapttochangingstandards,ortoaddfunctionality,

allofwhichcanbeachievedbydownloadinganewbitstreamtoanexistingreconfigurablecomputing

platform.Reconfigurationcaneventakeplacewhilethesystemisrunning,thisbeingknownasruntime

reconfiguration(e.g.,[13]).

Inthenextsection,weintroducethebasicarchitectureofcommonreconfigurablefabrics,

followedbyadiscussionofapplicationsofreconfigurablecomputingandsystemarchitectures.Runtime

reconfigurationanddesignmethodsarethencovered.Finally,wediscussmultichipsystemsandend

withaconclusion.

COST

VOLUME

CurrentASIC

OlderASIC

OlderFPGA CurrentFPGA

Crossovervolumeincreases withdecreasingfeaturesize

Page 5: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

2ReconfigurableFabrics

Ablockdiagramillustratingagenericfine-grainedisland-styleFPGAisgiveninFig.3[14].Productsfrom

companiessuchasXilinx[15],Altera[16],andMicrosemi[17]arecommercialexamples.TheFPGA

consistsofanumberoflogiccellsthatcanbeinterconnectedtootherlogicandinput/output(I/O)cells

viaprogrammableroutingresources.Logiccellsandroutingresourcesareconfiguredviabit-level

programmingdata,whichisstoredinmemorycellsintheFPGA.Alogiccellconsistsofuser-

programmablecombinatorialelements,withanoptionalregisterattheoutput.Theyareoften

implementedaslookuptables(LUTs)withasmallnumberofinputs,4-inputLUTsbeingshowninFig.3.

Usingsuchanarchitecture,subjecttoFPGA-imposedlimitationsonthecircuit'sspeedanddensity,an

arbitrarycircuitcanbeimplemented.Thecompletedesignisdescribedviatheconfigurationbitstream

whichspecifiesthelogicandI/Ocellfunctionality,andtheirinterconnection.

Figure3.Architectureofabasicislane-styleFPGAwithfour-inputlogiccells.Thelogiccells,shownas

grayrectanglesareconnectedtoprogrammableroutingresources(shownaswires,dots,anddiagonal

switchboxes)(source:Reference[14]and[18]).

Page 6: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Currenttrendsaretoincorporateadditionalembeddedblockssothatdesignerscanintegrate

entiresystemsonasingleFPGAdevice.Apartfromdensity,cost,andboardareabenefits,thisprocess

alsoimprovesperformancebecausemorespecializedlogicandroutingcanbeusedandallcomponents

areonthesamechip.AcontemporaryFPGAcommonlyhasfeaturessuchascarrychainstoenablefast

addition;widedecoders;tristatebuffers;blocksofon-chipmemoryandmultipliers;embedded

microprocessors;programmableI/Ostandardsintheinput/outputcells;delaylockedloops;phase

lockedloopsforclockde-skewing,phaseshiftingandmultiplication;multi-gigabittransceivers(MGTs);

andembeddedmicroprocessors.Embeddedmicroprocessorscanbeimplementedeitherassoftcores

usingtheinternalFPGAresourcesorashardwiredcores.

Inadditiontothearchitecturalfeaturesdescribed,intellectualproperty(IP)cores,implemented

usingthelogiccellresourcesoftheFPGA,areavailablefromvendorsandcanbeincorporatedintoa

design.Thesecoresincludebusinterfaces,networkingcomponents,memoryinterfaces,signal

processingfunctions,microprocessorsandsoonandcansignificantlyreducedevelopmenttimeand

effort.

Thebit-levelorganizationofthelogicandroutingresourcesinisland-styleFPGAsisextremely

flexiblebuthashighimplementationoverheadasaresult.Tradeoffsexistinthegranularityofthelogic

cellsandroutingresources.Fine-graineddeviceshavethebestflexibility;however,coarse-grained

elementscantradesomeflexibilityforhigherperformanceanddensity[19].

Withmoderntechnologies,thespeedoftheroutingresourceisalimitingfactor.Trendshave

beentoincreasethefunctionalityofthelogiccellse.g.,uselogiccellswithlargernumbersofinputs

whichcanalsobeconfiguredassmallerLUTs[20]andtoaddpipelineregisterstotheroutingfabric[21].

Fordatapathorientedapplicationssuchasindigitalsignalprocessing,coarse-grainedarchitectures[22]

suchasPipewrench[23]andRaPID[24]employbus-basedroutingandword-basedfunctionalunitsto

utilizesiliconresourcesmoreefficiently.

Page 7: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

3Applications

Reconfigurablecomputinghasfoundwidespreadapplicationintheformof“customcomputing-

machines”toacceleratecomputationoveralgorithmsimplementedonaCPU.Applicationdomains

includehigh-energyphysics[25],genomeanalysis[26],signalprocessing[27,28],computervision[29],

cryptography[30,31],financialengineering[10,32],scientificcomputing[33],machinelearning[34]

andsecurity[35].

Inmanyoftheseproblemdomains,ageneralpurposeGPUhasalsodemonstratedconsiderable

accelerationoveraCPUandwilloftenoutperformanFPGAaswellintermsofrawperformance.Thisis

becauseitisaparallelarchitecturewithmanyhardenedfloating-pointunitsandsubstantiallygreater

memorybandwidth,meaningitisanidealarchitectureprovidedalgorithmsthatcanbebrokendown

intoalargenumberofparallelthreads.However,outrightperformanceisnolongertheonly

benchmark;energyconsumptionisalsoimportant.Intermsofhighperformancecomputing,

supercomputingclustersanddatacentersnowconsumevastamountsofenergy,notonlyon

computation,butalsooncoolinginordertomaintainperformanceandreliability.Itfollowsthat

reducingenergyconsumptionprovidesbothenvironmentalandeconomicbenefits.Energyminimization

isalsoimportantforembeddedapplications;forexample,reducingpowerconsumptionon

smartphonesorotherbattery-powereddevicesisdesirablefromanenduserperspective.Asaresult,

FPGAandGPUvendorsarefocusingtheirengineeringeffortstowardsmakingfuturearchitecturesmore

energyefficient.ThisisreflectedinthemostrecentFPGAandGPUarchitectures:Nvidia’sP100claimsa

peakperformanceof10.6TFLOPs(singleprecision)withaTDPofonly300W[36],whileAlteraclaims

performanceofupto9.3TFLOPs(singleprecision)at80GFLOPs/wattisachievableontheirupcoming

Stratix10device[16].

ManyexperimentalstudieshavebeenperformedcomparingtheenergyefficiencyofFPGAs,

GPUsandCPUs;arecentsurveyisprovided[37].BothFPGAsandGPUstypicallyoutperformCPUs

accordingtothismetric.GPUshavebeenshowntobemoreenergyefficientthanFPGAsforcertain

applicationssuchasmatrixmultiplication.Tosomeextent,thisisaresultofthedecisiontooptimizethe

GPUarchitectureforthisproblem[38].However,theflexibilityofanFPGAhasseenitoutperforma

GPU,intermsofenergy-efficiencyorperformance-per-watt,acrossabroaderspectrumofapplications.

Examplesinclude:2-DFIR(finite-impulseresponse)filters,Viola-Jonesfacedetection,K-means

clustering,Monte-Carlooptionspricing,randomnumbergeneration,Smith-Waterman,3-Dultrasound

computertomography[37].EnergyefficiencygainsusingFPGAshavealsobeenclaimedoncommercial

Page 8: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

systems.Forexample,Microsoftreporteda3xenergyefficiencygain,andareducedlatency,whenusing

FPGAsinsteadofGPUsontheirCatapultmachine[39],whichisdiscussedlaterinSection4.

WhilemostoftheseperformancecomparisonshavebeenperformedusingIEEEstandardsingle

ordoubleprecisionarithmetic,thisisnotnecessarilythemostenergy-efficientdesignpossibleonan

FPGA.ThisisbecauseFPGAshavethefreedomtoimplementanyprecision,soitmaybepossibleto

createaworkingdesignusingacustom(reducedprecision)fixedorfloating-pointnumberformatthatis

sufficienttosatisfyadesignspecification.Thiswillavoidunnecessarycomputationandcanimprovethe

energy-efficiencyandperformanceofanFPGAimplementationdramatically[40,41].

Toadegree,theflexibilityofanFPGAisevenbeyondthatpossibleinanASIC.Forexample,inan

FPGA-basedimplementationofRSAcryptography[30],adifferenthardwaremodularmultiplierforeach

primemoduluswasemployed(i.e.,themoduluswashardwiredinthelogicequationsofthedesign).

SuchanapproachwouldnotbepracticalinanASICasthedesigneffortandcostistoohightodevelopa

differentchipfordifferentmoduli.Thisledtogreatlyreducedhardwareandimprovedperformance,the

implementationbeinganorderofmagnitudefasterthananyreportedimplementationinany

technologyatthetime.

Anotherimportantapplicationislogicemulation[42,43]wherereconfigurablecomputingisnot

onlyusedforsimulationacceleration,butalsoforprototypingofASICsandin-circuitemulation.In-

circuitemulationallowsthepossibilityoftestingprototypesatfullornear-fullspeed,allowingmore

thoroughtestingoftime-dependentapplicationssuchasnetworks.Italsoremovesmanyofthe

dependenciesbetweenASICandfirmwaredevelopment,allowingthemtoproceedinparallelandhence

shorteningdevelopmenttime.Asanexample,itwasusedin[44]forthedevelopmentofatwo-million-

gateASICcontaininganIEEE802.11mediumaccesscontrollerandIEEE802.1la/b/gphysicallayer

processor.UsingareconfigurableprototypeoftheASIConacommodityFPGAboard,theASICwent

throughonecompletepassofreal-timebetatestingbeforetape-out.

Digitallogic,ofcourse,mapsextremelywelltofine-grainedFPGAdevices.Themaindesign

issuesforsuchsystemslieinpartitioningofadesignamongmultipleFPGAsanddealingwiththe

interconnectbottleneckbetweenchips.TheCadenceProtiumRapidPrototypingPlatform[45]isa

commercialexampleofalogicemulationsystemandhas100million-gatelogiccapacityandfast

compilationandpartitioningalgorithms.Furtherdiscussionofinterconnecttime-multiplexingand

systemdecompositionisgivenlaterinthisarticle.Someexamplesofapplicationsacceleratedusing

earlymultipleFPGAsystemsarediscussedbelow.

Page 9: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Hoang[26]implementedalgorithmstofindminimumeditdistancesforproteinandDNA

sequencesontheSplash2architecture.Splash2canbemodeledintermsofbothbidirectionaland

unidirectionalsystolicarrays.Inthebidirectionalalgorithm,thesourcecharacterstreamisfedtothe

leftmostprocessingelement(PE),whereasthetargetstreamisfedtotherightmostPE.Comparingtwo

sequencesoflengthmandnrequiresatleast2´max(m+1,n+1)processors,andthenumberofsteps

requiredtocomputetheeditdistanceisproportionaltothesizeofthearray.Theunidirectional

algorithmissuitedforcomparingasinglesourcesequenceagainstmultipletargetsequences.The

sourcesequenceisfirstloadedasinthebidirectionalcase,andthetargetsequencesarefedinoneafter

theotherandprocessedastheypassthroughthePEs(whichresultsinvirtually100%utilizationof

processors,sothattheunidirectionalmodelisbettersuitedforlargedatabasesearches).

Acommonapplicationdomainforreconfigurablecomputingisinreal-timedataacquisitionand

signalprocessing.TheBEE2system[27],describedinthenextsection,wasappliedtotheradio

astronomysignalprocessingdomain,whichincludeddevelopmentofabillion-channelspectrometer,a

1024-channelpolyphasefilterbank,andatwo-input,1024-channelcorrelator.TheFPGA-basedsystem

useda130-nmtechnologyFPGAandperformancewascomparedwith130-and90-nmDSPchipsaswell

asa90-nmmicroprocessor.Performanceintermsofcomputationalthroughputperchipwasfoundto

beafactorof10to34overtheDSPchipin130-nmtechnologyand4to13timesbetterthanthe

microprocessor.Intermsofpowerefficiency,theFPGAwasoneorderofmagnitudebetterthantheDSP

andtwoordersofmagnitudebetterthanthemicroprocessor.Computethroughputperunitchipcost

was20–307%betterthanthe90-nmDSPand50–500%betterthanthemicroprocessor.

Onefinalemergingapplicationdomainismachinelearning.Reconfigurableimplementations

showgreatpromiseforaddressingtheirheavycomputationaldemands,andreconfigurablecomputing

isparticularlystronginembeddedandlow-precisionscenarios.Tridgellet.al.demonstratedregression,

classificationandnoveltydetectionusingonlinekernelmethods.Theirfullypipelinedimplementation

couldprocesscontinuousdataatrateshigherthan1Gbpsandperformsimultaneouslearningand

predictionwithalatencyof100ns[46].Zhanget.al.appliedarooflinemodeltobalanceresource

utilizationandmemorybandwidthintheaccelerationofadeepconvolutionalneuralnetwork(CNN).

Theyachieved62GFLOPSonasingleXilinxVirtexVC707board,thisbeinga4.8Xspeedupovera16

threadimplementationonanIntelXeonE5-2430processor[47].

Page 10: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

4SystemArchitectures

ReconfigurablecomputingmachinesareconstructedbyutilizingoneormoreFPGAs.Mostsystems

includeotherelements,suchasmicroprocessorsandstorage,andcanbetreatedasprocessing

elementsandmemorythatareinterconnected.Obviously,thearrangementoftheseelementsaffects

thesystemperformanceandroutability,andsomeexamplesaregiveninthissection.

TheAvnetZedboardisadevelopmentboardwhichintegratesasingleXilinxZynqXC7Z020FPGA

(whichcontainsFPGAlogicandadual-coreARMCortex-A9processor),DDRmemory,SDcard,Ethernet,

USBandvideointerfaces.ThissingleboardcomputercanruntheLinuxoperatingsystem,andit

providesalow-costentrypointforteachingandresearchinreconfigurablecomputing[48].

ThesimplesttopologyforconnectingmultipleFPGAsinvolvesaring,mesh,orotherfixed

pattern.FPGAsserveasbothlogicandinterconnect,providingdirectcommunicationbetweenadjacent

devices.Suchanarchitectureispredicatedonlocalityinthecircuitdesignandfurtherassumesthatthe

circuitdesignmapswelltotheplanarmesh.Thisarchitecturefitswellforapplicationswithregularlocal

communications[49].However,ingeneral,highperformanceishardtoobtainforarbitrary

communicationpatternsbecausethearchitectureonlyprovidesdirectcommunicationsbetween

neighboringFPGAsandtwodistantFPGAsmayneedmanyotherdevicesas“hops”tocommunicate,

resultinginlongandwidelyvariabledelays.Furthermore,FPGAs,whenusedasinterconnects,often

resultinpoortimingcharacteristics.

Figure4depictstheSPLASH2architecture[50]publishedin1990.Eachboardcontains16

FPGAs,X1throughX16.TheblocksM1throughM16arelocalmemoriesoftheFPGAs.Asimplified36-

bitbuscrossbar,withnopermutationofthebit-lineswithineachbus,interconnectsthe16FPGAs.

Another36-bitbusconnectstheFPGAsinalinearsystolicfashion.Thelocalmemoriesaredualported

withoneportconnectingtotheFPGAsandtheotherportconnectingtotheexternalbus.Itis

interestingtonotethatthecrossbarwasaddedtotheSPLASH2machine,theoriginalSPLASH1machine

onlyhavingthelinearconnections.SPLASH2hasbeensuccessfullyusedforcustomcomputing

applicationssuchassearchingeneticdatabasesandstringmatching[26].

Page 11: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Figure4.SPLASH2architecture.Eachboardcontains16FPGAs,XIthroughXI6.TheblocksMlthrough

Ml6arelocalmemoriesoftheFPGAs.Asimplified36-bitbuscrossbar,withnopermutationofthebit-

lineswithineachbus,interconnectsthe16FPGAs.Another36-bitbusconnectstheFPGAsindaisy-chain

fashion.ThelocalmemoriesaredualportedwithoneportconnectingtotheFPGAsandtheotherport

connectingtotheexternalbus.

Otherdesignshaveusedahierarchyofinterconnectschemes,differinginperformance.Theuse

ofmulti-gigabittransceivers(MGT)availableoncontemporaryFPGAsallowshighbandwidth

interconnectionusingcommoditycomponents.AnexampleistheBerkeleyEmulationEngine2(BEE2)

[27],designedforreconfigurablecomputingandillustratedinFig.5.Eachcomputemoduleconsistsof

fiveFPGAs(XilinxXC2VP70)connectedtofourdoubledatarate2(DDR2)dualinlinememorymodules

(DIMMs)withamaximumcapacityof4GBperFPGA.FourFPGAsareusedforcomputationandonefor

control.EachPPGAhastwoPowerPC405processorcores.Alocalmeshconnectsthecomputation

FPGAsina2-Dgridusinglow-voltageCMOS(LVCMOS)parallelsignaling.Off-modulecommunications

areofvia18(twofromthecontrolFPGAandfourfromeachofthecomputeFPGAs)Infiniband4X

channel-bonded2.5-Gbpsconnectorsthatoperatefull-duplex,whichcorrespondstoa180-Gbpsoff-

modulefull-duplexcommunicationbandwidth.Modulescanbeinterconnectedindifferenttopologies

includingtree,3-Dmesh,orcrossbar.Theuseofstandardinterfacesallowsstandardnetworkswitches

suchasInfinibandand10-GigabitEthernettobeused.Finally,a100base-TEthernetconnectiontothe

controlFPGAispresentforout-of-bandcommunications,monitoring,andcontrol.

Page 12: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Figure5.BEE2ComputeModuleblockdiagram.Computemodulescanbeinterconnectedviathe

InfinibandIB4Xconnectors,eitherdirectlyorviaa10-GigabitEthernetswitch.The100-BaseTEthernet

canbeusedforcontrol,monitoring,ordataarchiving.

Commercialmachines,suchastheMaxelerMPC-X2000system[51],haveasimilarinterconnect

structuretotheBEE2inthattheyareparallelmachinesemployinghighperformancemicroprocessors

tightlycoupledtoarelativelysmallnumberofFPGAdevicespernode.TheMPC-X2000isa1Userver

witheightlargeFPGAs,calleddataflowengines(DFEs),interconnectedinaringarrangement.Atotalof

384GBofdynamicRAMissupportedandmultiplehostprocessorscancommunicatewitheachDFEvia

ahigh-speedInfinibandswitchedinterconnectnetwork.Suchmachinescanhaveordersofmagnitude

performanceimprovementoverconventionalarchitecturesandswitchingtopologiescanbealteredvia

configurationoftheswitchingfabric.

MicrosofttookadifferentapproachintheirCatapultmachine,choosingasingledaughtercard

perserverovermulti-FPGAboardsforthereasonsofscalability,capacity,power,spaceandreliability

[52].EachFPGAcardoperatesunder20W,ishostedbyaserverviaPCIExpressandcontains8GBof

dynamicRAM.TheFPGAboardsareorganizedina24Uarrangementof48Uhalf-width1Uservers,

directlyconnectedtogetherwithSAScables.Atestsystemcontaining1,632serverswasshownto

Page 13: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

reducethetaillatencyoftheMicrosoftBingsearchengineby29%andimproverankingthroughputof

eachserverby95%.

TheIntel-AlteraHeterogeneousArchitectureResearchPlatform(HARP)utilizesIntelQuickpath

Interconnect(QPI)inadualsocketmotherboardwiththeprocessorandFPGAresidingeachoccupyinga

socket[53].Thisoffershigherbandwidthandlowerlatencyoverconventionaldaughtercards.A

coherentsharedmemorybetweentheprocessorandFPGAgivesthepromiseofagreatlysimplified

programmingmodelandtighterprocessor-FPGAcouplingwhichwillbenefitirregulardataaccess

patterns.

Page 14: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

5RuntimeReconfiguration

Areconfigurablecomputingsystemcanhaveitsfunctionalityupdatedduringexecution,

resultinginreducedresourcerequirements.Aruntimereconfigurablesystempartitionsadesign

temporallysothattheentiredesigndoesnotneedtoberesidentintheFPGAatanygivenmoment[54,

55].Instead,theFPGAfabricistime-sharedbetweenspecializedhardwareacceleratorsatruntime.

Usingthistechnique,designslargerthantheavailablehardwareresourcescanberealized,or

alternatively,anexistingdesignmaybeimplementedonasmallerorcheaperdevice.Furthermore,

energyefficiencycanbeincreasedbecausetheentirefabriccanbeusedmoreeffectively.

Singlecontext,multiplecontextarchitecturesandpartiallyreconfigurableFPGAsbeen

developed.Inasinglecontextsystem,anychangestothefunctionalityoftheFPGAinvolvereloadingthe

entirebitstream;earlyFPGAswereofthistype.Thisschemehasthedisadvantageoflong

reconfigurationtime.Multiplecontextortime-sharingarchitectures,lieattheotherextreme.These

allowanumberofcompleteconfigurationstobestoredinthefabricsimultaneouslyandthus

reconfigurationcanbeachievedinasmallnumberofcycles.Thesearchitectureswerealsoproposedfor

earlyFPGAs.Asanexample,anarchitecturenamedDharma,wasproposedthatcontainsafunctional

blockandaninterconnectnetwork[56].Bybreakingalargedesignintolevelsinafoldedpipeline,the

logicmodulesandinterconnectcanbetime-sharedbydynamicallyreconfiguringeachlevel.This

topologysimplifiesthearchitectureandprovidespredictableinterconnectdelay(Fig.6).Multiple

contextarchitectures,suchasNEC'sDynamicallyReconfigurableProcessor(DRP)[57],werelater

developed.Sucharchitectureshavetheshortestcontextswitchtime,however,alargerareaoverheadis

associatedwithimplementationofthisscheme.

Page 15: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Figure6.DynamicArchitectureforFPGA-basedsystems.Thearchitecturecontainsafunctionalblock

andaninterconnectnetwork.Theinterconnectandthelogiccanbetimeshared.Theemulateddesign

topologyislevelizedinafoldedpipelinemanner.Thelevelizedtopologysimplifiesthearchitecturewith

predictableinterconnectdelay.

PartiallyreconfigurableFPGAs,assupportedbythemajorFPGAvendorsinXilinxVirtex[15]and

AlteraStratix[16]architectures,havebeguntodominatethemarket.Thesearchitecturesallowportions

oftheFPGAtobechangedviaamemorymappedschemewhilsttheotherportionsoftheFPGA

continuefunctioning.Incomparisontoasinglecontextscheme,thereissomeareaoverheadassociated

inprovidingthisfeature;ideallythisiscompensatedforbymoreefficientuseofthefabric.

Manytoolshavebeendevelopedtohelpsupportruntimereconfiguration.Commercialtools,

providedbythemainFPGAvendorsXilinxandAlteraaimtoabstractthelowlevelimplementation

detailsfromtheengineer.However,otheropensourcetoolshavebeendevelopedtoenablemore

flexiblesystems.Forexample,ReCoBus-Builderintroducedasimpleinterfaceforcommunication

betweenthestaticpartofasystemandthedynamicmodules,aswellastheabilitytoplaceandroute

partialmodulesseparately,beforelinkingthesecompiledbitstreamsatrun-time[58].Thismakes

modulesinterchangeableandspeedsupthecompilationprocess.TheGoAheadtooltakesthisfurther,

allowingtheFPGAfabrictobeseparatedintodifferentregions,withindividualmodulescompiledtofit

intooneormoreoftheseregions[59].Italsoprovidessupportformodulestocommunicatebetweenor

acrossregions.Thisimprovesflexibilityinplacementofmodulesandpromotessharingacrossregions.

Toolstohelpdeterminetheoptimumnumberofregionshavealsobeengenerated[60].

Page 16: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Theaforementionedtoolsarealsoabletosupporthierarchicaldesigns,whereapartialregion

doesnotneedtobefullyreconfigured;insteadasmallerregionwithinthisareacanbereconfigured.

Thishasmultipleadvantages.Firstly,storingafewdifferentmodulesateachhierarchicallevelprovides

ahugeamountofflexibility,savingsignificantconfigurationmemoryincomparisontostoringall

differentmodulesatthehighestlevel.Furthermore,sinceonlyasmallregionneedstobereconfigured,

thereconfigurationtimeisreduced[61].

Toolsalsoexisttohelpoverlapre-configurationandcomputationtomaximizetheperformance

ofthedevice.ForexampleZyCAP,whichisbasedontheXilinxZynqarchitecturewithanembedded

ARMCPU,providessoftwaredriverstohelpreconfigurationbeoverlappedwithcomputationby

controllingallthereconfigurationprocesses[62].Itcanalertthesoftwarethatconfigurationis

complete,andalsomanageshowpartialbitstreamsarestoredinmemory.Thisisimportanttohelp

maximizeperformance,forexample,thistoolofferstheabilitytocachepartialbitstreamsinDRAMto

speedupthereconfigurationprocess.Finally,therearealsoeffortstoverifythepartialbitstreams

performthedesiredfunctionality[63].

Therearemanyexamplesofrun-timereconfiguration,withthelogicalunitofreconfiguration

rangingfromapplication-leveldowntoasub-instruction.Thesearediscussedbelow:

Attheapplicationlevel,examplesincludeadaptingthebitstreamaccordingtochangesin

environmentalconditions.Forexample,Clausetal.discussedhowhardwareacceleratorsmaybe

neededforreal-timevideoprocessing,butinthecontextdriverassistance,adaptingthemaccordingto

changinglightconditionscouldimproveperformance.Theydemonstratedthatthiscanprove

worthwhilesincemodulescanbequicklyreconfiguredbetweenframes[64].

Tasklevelreconfigurationiscommonforsoftwaredefinedradio,forexamplewhenswitching

betweenencodingschemes.Thetrade-offsbetweenfullorpartialreconfigurationinthisproblem

domainarediscussedbyDelahayeetal.[65].Similarly,Feillenetal.discussedhowdifferentstagesof

digitalvideodecodingdonotneedstooperateconcurrently,meaningthesamehardwarecouldbere-

usedinthisexample[66].Tasklevelreconfigurationforanoperatingsystemhasalsobeenproposed

[67].Undercontrolofsoftwarerunningonamicroprocessor,taskcircuitscanbescheduledonlineand

placedinasuitablefreespaceinahardwaretaskarea.CommunicationsbetweentasksandI/Oaredone

throughataskcommunicationbus,andterminationofataskfreesthereconfigurableresourcesused.It

wasshownthathardwareinthehardwaretaskareacanbesharedbytasksandtheoverheads

associatedwithitsimplementationonapartiallyconfigurableplatformwereacceptablylow.Thishelps

improveschedulingofreal-timetasks.

Page 17: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Instructionlevelreconfigurationhasbeendemonstratedforhardwareaccelerateddatabase

queries.DifferenthardwaremodulesforSQLqueriescouldbedynamicallyconfiguredtoimprove

performance[68]andenergyefficiency[69].ACPUsystemwithcustominstructionsisanothercommon

candidateforinstructionlevelreconfiguration.AnearlyexampleincludestheDynamicInstructionSet

Computer(DISC)[70],whichsupporteddemand-drivenmodificationoftheinstructionsetthrough

partialreconfiguration.ThecommercialStretchprocessor[71]combinesreconfigurablefabricwitha

processortosupporttheexecutionofcustominstructionsimplementedonareconfigurablefabric.

Furthermore,thefabriccanbereconfiguredatruntimeandthedesignenvironmentissoftware-centric,

withprogrammingoftheprocessorbeinginStretchC.

Finally,partialreconfigurationhasalsobeenshownforsub-instructions.Forexample,apipeline

stagecouldbeaconvenientunitofreconfiguration,asdemonstratedbyincrementalpipeline

reconfiguration[72].AssumeanFPGAthathasenoughsiliconareaforNphysicalpipelinestages,but

thedesigncontainsMpipelinestages(whereM>>N).Throughaddingonepipelinestageandremoving

thetrailingpipelinestageineachstageofthecomputation,executionandcomputationcanbe

overlapped.SuchacircuitwillimplementapipelineofdepthNandfullyutilizetheFPGAatanygiven

pointintime.Runtimereconfigurationcanbedoneatevenlowerlevels.Examplesincludethose

supportinghierarchicaldesignsforaCPUwithgreaternumbersofcustominstructions[61]anda

crossbarswitchwhichemploysruntimereconfigurationoftheFPGA'sroutingresources[73].Bypartially

reconfiguringroutingmultiplexers,thisschemewasabletoachievedensity,switchupdatelatencyand

performancehigherthanpossibleusingconventionalmeans.

Page 18: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

6Designmethods

Hardwaredescriptionlanguages(HDLs)suchastheVeryHighSpeedIntegratedCircuitHardware

DescriptionLanguage(VHDL)andVerilogarecommonlyusedtospecifythelogicofareconfigurable

system.Descriptionsintheselanguageshavetheadvantageofbeingvendorneutral,sothesame

descriptioncanbesynthesizedfordifferenttargetssuchasdifferentFPGAdevices,differentFPGA

vendors,andASICs.Forthisreason,theselanguagesareoftenthetargetlanguageforhigherleveltools

thatofferhigherlevelsofabstraction.

Modulegeneratorsandlibrariesarecommonlydeployedtopromotereuse.Forexample,

vendorssuchasAlteraandXilinxhaveparameterizedlibrariesofcomponentsthatcanbeusedina

design.Theselibrariesaregeneratedsothatacircuitoptimizedfortheparticularapplicationcanbe

produced.Asanexample,aparameterizedfloatingpointlibrarymightallowthewordlengthofthe

exponentandsignificandtobespecifiedaswellaswhetherdenormalizednumbersaresupported.The

modulegeneratorthengeneratesanetlistorVHDL-basedfloatingpointadderthatcanbeincludedina

design.Opensourcealternatives,suchastheFloPoColibraryalsoprovidevendorneutralalternativesto

generatemanykeycomponents[74].

InanefforttohelpmakeFPGAsmoremainstream,effortshavebeenplacedintohigh-level

synthesis,whichistheprocessofcompilingatraditionalhighlevellanguagedowntoanetlistorHDL.

Theuseoftraditionalprogramminglanguagesimprovesproductivityaslowleveldetailsarehandledby

thecompiler.ThisisanalogoustoCversusassemblylanguageforsoftwaredevelopment.Another

differencewithpotentiallylargeimplicationsisthat,usingthesetools,softwaredeveloperscanalso

designreconfigurablecomputingapplications

Asanearlyexample,LukandPagedescribedasimplecompilationprocess[75,76]fromahigh

levellanguagewithexplicitparallelextensionstoaregistertransferlanguage(RTL)description.Parallel

executionofstatementsisimplementedviaparallelprocesses,andthesecancommunicateviachannels

throughwhichasingle-wordmessagecanbepassed.Variablesintheuserprogramaremappedto

registers,allexpressionsareimplementedascombinationallogic,andmultiplexersareusedinthecase

aregisterhasmultiplesources.Adatapaththatmatchesthedataflowgraphoftheinputsource

descriptionisgeneratedusingthisstrategy.Theclockingschemeemployedisaglobal,synchronousone,

andaconventionthateachassignmenttakesexactlyoneclockcycleisfollowed.Astartsignalisusedto

feedtheclockandtoenableeachregisterthatcorrespondstoavariable,andafinishsignalisgenerated

fortheassignmentinthefollowingclockcycle.Toexecutestatementssequentially,thestartandfinish

signalsofadjacentstatementsaresimplyconnectedtogether,creatingaone-hotdistributedcontrol

Page 19: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

scheme.Conditionalstatementsandloopsareformedbyassertingoneofseveralpossiblestartsignals

thatcorrespondtoalternativebasicblocksinaprogram.Completionofconditionalorloopconstructs

andsynchronizationofparallelblocksareimplementedbycombiningrelevantfinishsignalsusingthe

appropriatecombinatoriallogic.Anexampleshowingthetranslationofasimplecodefragmentto

controlanddatapathisshowninFig.7.

Figure7.Hardwarecompilationexample.TheCprogramistranslatedintoadatapath(top)andcontrol

(bottom).Executionofstatementsinthewhilelooparecontrolledbys1ands2;s0ands3correspondto

thestartsignalsofthestatementsbeforeandafterthewhileloop.

High-levelsynthesistoolshavesincemovedbeyondsimplytranslatingahigh-levellanguagetoa

hardwaredesign;insteadtheyfocusoncreatinganoptimizedhardwaredesign.Straightforward

examplesmayincludeextractingparallelismthroughloopunrollingorcreatingdeeplypipelineddesigns

tomaximizeclockfrequency.However,finer-grainedoptimisationsarealsopossible.Forexample,since

Page 20: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

movingdataontotheFPGAcanbeexpensive,storingdatalocallyonthechipandre-usingthedatacan

havesubstantialperformanceimplications[77].Whiletheideaissimilartothatofcachingona

microprocessor,onanFPGAthelocalbufferscanbesizedaccordingtotheneedsofaparticular

algorithm,savingresources.Alternatively,thememorycanbearrangedinafashionthatprovidesa

largememorybandwidth,whichmaybenecessarytofeedaparalleldatapath.Moreoever,theinterface

withtheDRAMcanalsobecontrolledtoensurethatmemoryreadsoccurfaster,forexampleby

‘activating’rowsthataresoontoberead[78].Anotheroptimizationistofinetunetheprecisionused

throughoutcomputations,i.e.touseaslittleprecisionasisnecessarytomeetyourdesignspecification.

UnlikeaCPUimplementation,anFPGAdesignhasthefreedomtoimplementanyprecision.Since

arithmeticoperatorswithlessprecisionuselesssiliconarea,usingtheminimumprecisionnecessary

freesresources,allowingforgreaterparallelperformance.Toolstosupportthisdesignmethodology,

bothinfixedandfloatingpointarithmeticarebeingsupported[79,80].

Variousotherissueswiththemappingofalgorithmstohardwarearemoregenerallydiscussed

byIsshikiandDai[81],whofocusonthedifferencesbetweenimplementingbit-serialversusbit-parallel

modules(e.g.,addersandmultipliers)onFPGAarchitectures.Althoughlatencyislargerforbit-serial

modules,thereductioninareafrequentlymakesarea-timeproductssignificantlylowerforsuch

implementations.Morespecifically,suchadvantagesasthefollowingcanbeobtained:1)Forbit-parallel

modules,theI/Opinlimitationisamajorproblem,andthelargesizeofthemoduleclustercanresultin

unusedspaceandunderutilizedlogicresources;2)bit-serialmodulesareeasiertopartitionascell-to-

cellconnectionsaresparseanddonotcauseI/Oproblems;and3)highfanoutnetscanimpair

routabilityofbit-parallelmodules.LeongandLeong[82]generalizedfurtherwithadesignmethodology

thatcantranslateadataflowdescriptionwithsignalsofdifferentwordlengthstoadigitserialdesign.

CommercialtoolsthatcancompilestandardprogramminglanguagessuchasJava,C,orC++

(e.g.,[76])areavailable.ExamplesincludeXilinx’sVivadoHLS[15],Maxeler’sMaxCompiler[83]and

CatapultCfromMentorGraphics[84].Domain-specificlanguagessuchasMATLAB/Simulinkoffereven

greaterimprovementsinproductivitybecausetheyareinteractive,includealargelibraryofprimitive

routinesandtoolkits,andhavegoodgraphingcapabilities.Indeed,manydesignsforcommunications

andsignalprocessingarefirstprototypedinMATLABandthenconvertedtootherlanguagesfor

implementation.ToolssuchastheMATCHcompiler[85]andXilinxSystemGenerator[15],AlteraDSP

builder[16]andMathwork’sHDLcoder[86]cantranslateasubsetofMATLAB/Simulinkdirectlytoan

FPGAdesign.ThereisalsointerestinsupportingmoreparallelC-to-gatesflows.Supportformore

Page 21: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

recentparallelprogramminglanguagesisgainingtraction,forexampleAlteraSDKforOpenCL[16]and

effortstosupportNVidia’sCUDAusingFPGAs[87].

Duetothedifficultyincreatingafull-customdesign,thereisalsosupportforcreating

hardware/softwareco-designs.TheavailabilityofembeddedoperatingsystemssuchasLinuxfor

microprocessorsonanFPGAprovideafamiliarsoftwaredevelopmentenvironmentforprogrammers,

greatlyfacilitatingprogramdevelopmentthroughtheavailabilityofalargerangeofopen-source

librariesaswellashighqualitydevelopmenttools.Suchtoolscangreatlyspeedupthedevelopment

timeandimprovethequalityofembeddedsystems.Forexample,Altera'sNiosIIC-to-Hardware

accelerationcompilerenabletime-criticalfunctionsinaCprogramtobeconvertedtoahardware

acceleratorthatistightlycoupledtoamicroprocessorwithintheFPGA[88].Thesetoolswillsupportsoft

processors,suchastheAlteraNIOSorXilinxMicroblaze,andembeddedprocessors,suchasthoseon

theXilinxZynqorAlteraSoCs.Withthelatter,foroptimalperformance,thepartsofanalgorithmthat

areeasilyparallelizableshouldmakeuseoftheparallelFPGAfabric,whereasserialpartsofthe

algorithmshouldberunonaprocessor[89].

AfinaldesignapproachtoallowforfastFPGAprototypingistheuseofoverlayarchitectures.

Thesearecoarse-grainedarchitectureswithsoftware-likeprogrammability,withtheaimofsacrificing

someperformanceinexchangeforeaseofimplementation.Forexample,VectorBloxextendsthe

hardware-softwareparadigmbyusingtheFPGAfabrictoprovideparallelvectorinstructionsthatcanbe

easilyexecuted[90].Atypicaldesignflowusingthistechnologywouldbetocreateaninitialsoftware

design,addvectorinstructionswithinasoftwarestyledevelopmenttoobtainsomeaccelerationand

finallycreatecustomhardwareinstructionsforthemosttimeconsumingpartsofanalgorithm.Thismay

provideafastertimetomarket.Manyoverlaysarchitectureshavebeencreated,includingsomefor

specificapplications,suchasforefficientnetworkonchip(NOC)interconnectionsofprocessors[91]or

dataflowgraphs[92],andsomedesignedtomakeuseofspecifichardenedcomponentsonFPGAssuch

asDSPs[93].

Page 22: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

7MultichipSystems

Specialcaremustbetakeninthedesignoflargeandmultichipreconfigurablesystems.Inthissection,

wedescribesometheoreticresultsrelevanttothemajorarchitecturalandissuesassociatedwithsuch

designs.

7.1InterconnectOrganization

AclassicClosnetwork[94]containsthreestages:inputs,intermediateswitches,andoutputs,asshown

inFig.8.Itcanbeusedtointerconnectpinsinareconfigurablecomputingsystem,anditsinputand

outputstagesaresymmetric.Supposethefirststagehasrn×mcrossbarswitches,thesecondstagehas

mr×rswitches,andthethirdstagehasrm×nswitches,letusdenotethenetworkasc(n,m,r).Forany

two-pinnetinterconnectrequirement,thenetworkc(n,m,r)canachievecompleteroutabilityifmis

notlessthann.Theroutingmethodcanbedescribedbyrecursiveoperations[95].Inthefirstiteration,

wereducethenetworktoc(n-1,m-1,r).Intheithiteration,wereducethenetworktoc(n-i,m-i,r).

Whenn-i=1,wehaver1×(m-n+1)switchesinthefirststage,m-n+1r×rswitchesinthesecondstage,and

r(m-n+1)×1switchesinthethirdstage.Inotherwords,onlyoneinputexistsineachfirst-stageswitch

andoneoutputineachthird-stageswitch.Inthiscase,onesecond-stager×rswitchisenoughtoroute

therinputsofrfirst-stageswitchestotheroutputsofrthird-stageswitches,thuscompletingthe

interconnect.

Figure8.Closnetwork.AClosnetworkcontainsthreestages:inputs,intermediateswitches,and

outputs.Theinputandoutputstagesaresymmetric.Inthefigure,thefirst-stagehasrn×mswitches,

thesecond-stagehasmr×rswitches,andthethird-stagehasrm×nswitches.

Page 23: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Thereductionfromc(n-i,m-i,r)toc(n-i-1,m-i-1,r)canbederivedbyamaximummatching

algorithm.Thematchingalgorithmselectsdisjointsignalsfromdifferentinputswitchestodifferent

outputswitches.Onesecond-stageswitchisthenusedtoroutetheselectedsignals.FromHall's

theorem,themaximummatchingandroutingcanalwaysreducethenetworktoc(n-i-1,m-i-1,r).

Conceptually,theroutingproblemcanalsobeformulatedasedgecoloringonabipartitegraph

G(V1,V2,E)[96].ThenodesetsV1andV2representtheswitchesintheinputandoutputstages,

respectively.AnedgeinErepresentsatwo-pinnetinterconnectrequirementbetweenthe

correspondinginputandoutputswitches.InReference[96],ChanandSchlagassignedcolorstothe

edgesofthebipartitegraph.Edgesofthesamecolorarebundledintoonegroupandthecorresponding

setofnetsareroutedbyoneswitchinthesecondstage.TheworkofReference[97]wasthenusedto

findaminimumedgecoloringsolutioninO(|E|logn).

Thethree-stageClosnetworkcanbefoldedintoatwo-stagenetwork(Fig.9)sothattheinputs

andoutputsaremixedinthefirststage.Thus,thecorrespondingbipartitegraphG(V1,V2,E)constructed

aboveforedgecoloringisalsofoldedwithV1andV2mergedintooneset.

Page 24: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Figure9.FoldedClosnetwork.Thethree-stageClosnetworkisfoldedintoatwo-stagenetworksothat

theinputsandoutputsaremixedinthefirststage.

Tofindtheroutingassignment,thefoldededgecoloringgraphcanbeunfoldedbacktoa

bipartitegraphusinganEulerpathsearch.TheEulerpathtraverseseveryedgeexactlyonceanddefines

theedgedirectionaccordingtothedirectionofthetraversal.Wethenrecovertheoriginalbipartite

graphbysplittingthenodesetbackintotwosetsV1andV2andunfoldtheedgessuchthatalledgesare

directedfromV1toV2.Wecanfindtheminimumedgecoloringsolutionoftheunfoldedbipartitegraph

andapplythesolutionbacktothefoldedroutingproblem.

Page 25: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Inpractice,thefirst-levelcrossbaroftheClosnetworkisreplacedwithFPGAstosaveboard

space(Fig.10).RoutabilityisworsethananidealClosnetwork.EvenwithatrueClosnetwork,complete

routabilityofmultipinnetsisnotguaranteed,whichisanimportantpracticalconsiderationbecausein

microelectronicdesign,manymultipinnetstypicallyexist.

Figure10.VariationsoftheClosnetwork.ThefirstlevelcrossbaroftheClosnetworkisreplacedwith

FPGAstosaveboardspace.RoutabilityisworsethananidealClosnetwork.

Inanattempttosolvethemultipinnetandroutabilityproblem,wecanintroduceextra

connectionsamongFPIDsasshowninFig.11.However,extraFPIDinterconnectionsalsoincurextra

delay.WecanalsoexpandthefanoutwidthofFPGAssothateachFPGAI/Opinisconnectedtomore

thanoneFPIC[98,99].Thefanoutwidthexpansionimprovesroutabilitywithoutsignificantadditional

delay.ThemultipleappearancesofI/Opinsincreasetheprobabilitythatasignalconnectioncanbe

madeinasinglestage,whichisespeciallycriticalformultipinnets.However,theadditionalfanouts

increasetheneededpincountofFPICs.Thus,weneedtofindabalancedfanoutdistributionthat

reducestheinterconnectdelaywithaminimalpinrequirement.

Page 26: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Figure11.VariationsofClosnetwork.ThefanoutwidthofFPGAsisexpandedsothateachFPGAI/Opin

isconnectedtomorethanoneFPIC.Thefanoutwidthexpansionimprovesroutabilitywithoutsignificant

additionaldelay.

Atree-structurednetworkcansimplifythemappingprocessforcertainapplications.In

Reference[100],anexampleofatree-structurednetworkisillustratedforaVeryLargeScaleSimulator

(VLSS).TheVLSStreestructurehasalllogiccomponentslocatedattheleavesandinterconnectswitches

attheinternalnodes.Themachinecoversacapacityofeightmilliongates.Eachbranchisan8-bitbus.

Thehigherupthelevelofthetree,thelessparallelismthesignaldistributioncanachieve.Therefore,a

partitioningprocessisdesignedtominimizethehighlevelinterconnectandmaximizetheparallel

operation.

7.2InterconnectMultiplexing

Timemultiplexingisaneffectivemethodfortacklingthescalabilityproblemininterconnectinglarge

designs.Thetime-sharingmethodcanbeextendedfromtraditionalbusorganization[42,100]to

networksharing[101]andfurthertofunctionblocksharing[56].

Interconnectcanbetimesharedasabus[42,100].Ifncommunicationlinesexistbetweentwo

FPGAs,theycanbereducedtoasinglelinebystoringlogicaloutputsinshiftregistersandtime-

multiplexingthecommunicationinphases.Suchaschemewasemployedinthevirtualwireslogic

emulationsystem[42],whichisefficientbecauseinterconnectsarenormallycapableofbeingclockedat

muchhigherratesthanthecriticalpathoftherestofthesystem,andalllogicalwiresarenot

simultaneouslyactive.Thisschemecangreatlyreducethecomplexityoftheinterconnectingnetworkor

printedcircuitboardinamulti-FPGAsystem.

Page 27: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

LiandCheng[101]proposedthatadynamicnetworkbeviewedasoverlappingLconventional

FPICstogetherbutsharingthesameI/Opins.Adynamicroutingarchitecturecanincreasetheroutability

andshorteninterconnectlength.Eachswitchingnetworkisafullcrossbar,whichcanbereconfiguredto

provideanyconnectionsamongI/Opins.Theselectlinesareusedtoactivateonlyoneswitching

networkatatime;thustheI/Opinsaredynamicallyconnectedaccordingtotheconfigurationofthis

activeswitchingnetwork.BydynamicallyreconfiguringtheFPICs,Llogicsignalscantime-sharethesame

interconnectresources.

7.3MemoryAllocation

InterconnectschemesshouldalsoconsiderhowmemoryisconnectedtotheFPGAs.Although

combiningmemorywithlogicinthesameFPGAisthemostdesirablemethodforreducingrouting

congestionandsignaldelay,separatecomponentscansupplymuchlargercapacityathigherdensityand

lowerprice.Figure12demonstratesthreedifferentwaysofallocatingthememoriesinaClosnetwork

[96,102].ThememorymaybeattacheddirectlytoalocalFPGA(Fig.12a),attachedtothesecond-stage

switchesoftheClosnetworkviaahostinterface(Fig.12b),orattachedtothefirst-stageswitchesofthe

Closnetwork(Fig.12c).Thefirstmethodprovidesgoodperformanceforlocalmemoryaccess.However,

forthecaseofnonlocalmemoryaccess,theroutabilityanddelayareconcerns.Thesecondmethodis

slowerthanthefirstmethodforlocalmemoryaccessesbutprovidesbetterroutability.Thethirdisthe

mostflexibleasthememoryisattachedtothenetworkandtheroutabilityishigh.However,everylogic-

to-memorycommunicationmustgothroughthesecondinterconnectstage.

Page 28: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information
Page 29: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Figure12.Memoryorganization,(a)MemoryisattacheddirectlytoalocalFPGA.(b)Memoryis

attachedtothesecond-stageswitchesoftheClosnetworkviaahostinterface,(c)Memoryisattached

tothefirst-stageswitchesoftheClosnetwork.

7.4BusBufferInsertion

InFPGAs,signalpropagationisinherentlyslowbecauseofitsprogrammableinterconnectfeature.

However,thedelayoflongroutingwirescanbedrasticallyreducedbybufferinsertion.Theprincipleat

workisthatbyinsertingbufferswecandecouplecapacitiveeffectsofcomponentsandinterconnect

drivenbythebuffersandtherebyimproveRCdelay.

Givenaroutingtopologyforanetandtimingrequirementsforitssinks,anefficientoptimal

bufferinsertionalgorithmwasproposedin[103].Experimentalresultsshowdramaticimprovement

versustheunbufferedsolution.Thus,itisadvantageoustohaveabundantbuffersinFPGAs.However,

eachpossiblebufferanditsprogrammableswitchaddscapacitancetothewires,whichinturnwill

contributetodelay.Thus,abalancepointneedstobeidentifiedtotradeoffbetweentheadditional

delayandcapacitanceofthebuffersversustheimprovementtheycanprovide.

Foramultisourcedbus,theproblemofbufferinsertionbecomesmorecomplicated,because

theoptimizationforonesourcemaysacrificethedelayofothers.Furthermore,thedirectionofthe

bufferneedstobearbitratedbyacontroller.Insteadofusingsuchacontroller,anovelapproachisto

useapatentedopencollectorbusrepeater[104].Whenidle,thetwoendsoftherepeateraresetto

high.Whentherepeatersensesthepull-downactionononeside,itpresentsthesignalontheother

sideuntilthepull-downactionisreleasedfromtheoriginatedsignal.Thebusrepeatereliminatesthe

needforadirectioncontrolsignal,resultinginasimplerdesignandbetteruseofresources.

7.5SystemDecomposition

Todecomposeasystemintomultipledevices,Yehetal.[105]proposedanalgorithmbasedonthe

relationshipbetweenuniformmulti-commodityflowandmin-cutpartitioning.Yehetal.constructaflow

networkwhereineachnetinitiallycorrespondedanedgewithflowcostone.Tworandommodulesin

thenetworkwerechosenandtheshortestpath(i.e.,pathwithlowestcost)betweenthemwas

computed.AconstantΔ<1wasaddedtotheflowforeachnetintheshortestpath,andthecostfor

everynetinthepathwasincremented.Adjustingthecostpenalizespathsthroughcongestedareasand

forcesalternativeshortestpaths.Thisrandomshortestpathcomputationisrepeateduntileverypath

betweenthechosenpairofmodulespassesthroughatleastone“saturated”net.Thesetofsaturated

Page 30: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

netsinducesamulti-waypartitioninginwhichtwomodulesbelongtothesameclusterifandonlyif

thereisapathofunsaturatednetsbetweenthem.

Foreachoftheseclusters,theflux(definedasthecutsizebetweentheclusterandits

complement,dividedbythesizeofthecluster)iscomputedandtheclustersaresortedbasedontheir

fluxvalue.Yehetal.beganwithasingleclusterequaltotheentirenetlist,andthenpeeledoffthe

clusterswithlowestflux.Thisapproachwasattractivebecausethesaturatednetsaregoodcandidates

tobecutinapartitioningsolution.Aspeeledclusterscanbeverysmall,asecondphasemaybeusedto

makethemulti-waypartitioningmorebalanced.Thisapproach,withitssubsequentspeedupbyYeh

[106],iswell-suitedforlarge-scalemulti-waypartitioninginstances.

Thesystemprototypingphasemayalsoexplorenetlisttransformationssuchaslogicreplication

andretimingtominimizecutsize(I/Ousage)orsystemcycletime.Suchtransformationsareneededas

inter-devicedelayscanberelativelylargeandbecausedevicesareoftenI/O-limited.InReference[107],

Liuetal.proposedapartitioningalgorithmthatpermitslogicreplicationtominimizebothcutsizeand

clockcycleofsequentialcircuits.GivenanetlistG=(V,E),theirapproachchoosestwomodulesasseeds

sandt,thenconstructsa“replicationgraph”thatistwicethesizeoftheoriginalcircuit.Thisgraphhas

thespecialpropertythatatypeofdirectedminimumcutyieldsthereplicationcut(i.e.,adecomposition

ofVintoS,T,andR=V-S-TwheresÎS,tÎTandRisthereplicatedlogic)thatisoptimal.Adirected

versionoftheFiduccia-Mattheysesalgorithmisusedtofindaheuristicdirectedminimumcutinthe

replicationgraph.Congetal.[108]presentanefficientalgorithmfortheperformance-drivenmulti-way

circuitpartitioningproblemthatconsidersthedifferentlocalandglobalinterconnectdelayintroduced

bythepartitioning.

AlpertandKahng[109]surveytheFPGApartitioningliteratureinthecontextofmajorgraph

partitioningparadigms.Thecurrentpartitioningproblemsare(i)lowusagerateofFPGAgatecapacity

becauseI/Opinlimit,(ii)lowclockratebecauseofinterconnectdelaybetweenmultipleFPGAsand(iii)

longCPUtimeforthemappingprocess.

7.6SystemPlanningandDesignChanges

Foragivensystemdecompositiontobeimplementedonamulti-FPGAprototypingarchitecture,all

connectionswithineachdeviceandbetweendevicesmustberoutable.Chanetal.[110]invokemuch

literatureonroutabilitypredictioningatearrays,aswellastheoreticalconcepts,suchastheRent

parameter,toobtainafastroutabilityestimateforarbitrarynetlistsandFPGAarchitectures.Their

methodascribesoneofthreelevelsofroutable(easilyroutable,marginallyroutable,orunroutable)toa

netlistbasedonvariousparameters.Specifically,combiningawirelengthestimatorduetoFeuer,the

Page 31: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

averagenumberofpins-per-cell,andtheestimatedRentparameteryieldsarelativelyaccurate

routabilitypredictor.TheutilityoftheseparametersiscontrastedwiththatofothercriteriasuchasEl

Gamal'schannelwidthrequirement[111]ortheaveragepins-per-netratio.

Inadditiontoroutability,connectionsmustalsomeetsystemtimingconstraints.Selvidgeetal.

[112]extendtheoriginalvirtualwires[42]conceptintheirTIERS(Topology-IndEpendentRoutingand

Scheduling)approach.Theproblemformulationassumesthatanassignmentfromamultiple-FPGA

partitioning(i.e.,adesigngraph)toatargettopologygraphhasalreadybeenmade.Theobjectiveisto

assign“links”(i.e.,signalnets)tochannelsbetweendevices;aswiththeVirtualWiresconcept,specific

timeslicesforachannelcanbeassignedtomultiplelinksaslongasnotwolinksneedtotransmitsignals

atthesametime.TheTIERSalgorithmusesagreedymethodtoorderthelinksandthenrouteseachlink

inthescheduledorderwhilereservingchannelresources;factorsofupto2.5improvementinsystem

cycletimeareachieved.

Chang,etal.[113]addressthecombinedissuesofroutabilityandsystemtimingbyapplying

layout-drivenlogicresynthesistechniques.Foragivenwirethatcannotberouted,“alternativewires”

andalternativefunctionsareidentified,suchthatthegivenunroutablewirecanberemovedfromthe

circuitandreplacedwithanewwire(orwires)ornewlogicwithoutaffectingfunctionality.Chengetal.

estimatethatbetween30%and50%ofwireshaveso-called“triple-wirealternatives”(i.e.,

replacementsconsistingofthreeorfewerwires).Theirmethodfirstroutesthewiresthatdonothave

anyalternativesthenreplacesanyunroutablewirewithavailablealternatives.Systemtimingcanbe

improvedbyreplacinglongwireswithshorteralternatives.

8Conclusions

Reconfigurablecomputingoffersamiddlegroundbetweensoftware-basedsystemsandASIC

implementations,andisoftenabletocombineimportantbenefitsofboth.Implementationsareableto

avoidoverheadssuchasunnecessarydatatransfers,decodingandcontrolmandatoryin

microprocessors,anddesignscanbeoptimizedonabasisspecifictoanapplication,aprobleminstance

orevenanexecution.Usingthistechnology,itispossibletoachievesize,performance,cost,orpower

improvementsovermoreconventionalcomputingtechnologies.

9Acknowledgments

TheauthorswouldliketothankYM.LamforhishelpinpreparingthismanuscriptandProf.WayneLuk

(ImperialCollege)forhisproofreadingofthisarticle.

Page 32: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

Bibliography

[1] G.Estrin,"ReconfigurableComputerOrigins:TheUCLAFixed-plus-variable(F+V)Structurecomputer,"IEEEAnn.Hist.Comput,vol.24,pp.3--9,2002.

[2] S.Hauck,"TherolesofFPGAsinreprogrammablesystems,"Proc.IEEE,vol.86,pp.615-639,1998.

[3] K.ComptonandS.Hauck,"Reconfigurablecomputing:asurveyofsystemsandsoftware,"ACMComput.Surveys(CSUR),vol.34,pp.171-210,2002.

[4] K.BondalapatiandV.K.Prasanna,"Reconfigurablecomputingsystems,"Proc.IEEE,vol.90,pp.1201-1217,2002.

[5] T.J.Todman,G.A.Constantinides,S.J.E.Wilton,O.Mencer,W.Luk,andP.Y.K.Cheung,"Reconfigurablecomputing:architecturesanddesignmethods,"IEEProc.ComputersandDigitalTechniques,vol.152,pp.193-205,2005.

[6] R.Tessier,K.Pocek,andA.DeHon,"ReconfigurableComputingArchitectures,"Proc.IEEE,vol.103,pp.332-354,2015.

[7] H.SutterandJ.Larus,"Softwareandtheconcurrencyrevolution,"Queue,vol.3,pp.54--62,2005.

[8] J.Cong,M.A.Ghodrat,M.Gill,B.Grigorian,K.Gururaj,andG.Reinman,"Accelerator-RichArchitectures,"inProc.DesignAutomationConference,pp.1--6,2014.

[9] J.Fowers,G.Brown,P.Cooke,andG.Stitt,"AperformanceandenergycomparisonofFPGAs,GPUs,andmulticoresforsliding-windowapplications,"inProc.ACM/SIGDAInt.Symp.onFieldProgrammableGateArrayspp.47–56,2012.

[10] D.B.Thomas,L.Howes,andW.Luk,"AcomparisonofCPUs,GPUs,FPGAs,andmassivelyparallelprocessorarraysforrandomnumbergeneration,"Proc.Int.Symp.onFieldprogrammablegatearrays,pp.63-72,2009.

[11] A.DeHon,"Thedensityadvantageofconfigurablecomputing,"IEEEComputer,vol.33,pp.41-49,2000.

[12] I.KuonandJ.Rose,"MeasuringtheGapBetweenFPGAsandASICs,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.26,pp.203-215,2007.

[13] J.Liang,R.Tessier,andD.Goeckel,"ADynamically-Reconfigurable,Power-EfficientTurboDecoder,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.91--100,2004.

[14] V.Betz,J.Rose,andA.Marquardt,"ArchitectureandCADforDeep-SubmicronFPGAS,"ed.Dordrecht,theNetherlands:KluwerAcademicPublisher,1999.

[15] Xilinx,"http://www.xilinx.com,"(accessed2016).[16] Altera,"http://www.altera.com,"(accessed2016).[17] Microsemi,"http://www.microsemi.com,"(accessed2016).[18] M.P.Leong,"FPGADesignMethodologiesforHighPerformanceApplications,"TheChinese

UniversityofHongKong2001.[19] E.AhmedandJ.Rose,"TheeffectofLUTandclustersizeondeep-submicronFPGAperformance

anddensity,"inProc.ACM/SIGDAInt.Symp.onFieldprogrammablegatearrays,pp.3-12,2000.[20] D.Lewis,A.Lee,P.Leventis,S.Marquardt,C.McClintock,K.Padalia,etal.,"TheStratixIIlogic

androutingarchitecture,"inProc.Int.Symp.onField-programmablegatearrays,pp.14-20,2005.

Page 33: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

[21] D.Lewis,G.Chiu,J.Chromczak,D.Galloway,B.Gamsa,V.Manohararajah,etal.,"TheStratix™10HighlyPipelinedFPGAArchitecture,"inProc.Int.Symp.onField-ProgrammableGateArrays,pp.159-168,2016.

[22] R.Hartenstein,"Coarsegrainreconfigurablearchitecture(embeddedtutorial),"inProc.conf.onAsiaSouthPacificdesignautomation,2001.

[23] S.C.Goldstein,H.Schmit,M.Budiu,S.Cadambi,M.Moe,andR.R.Taylor,"PipeRench:areconfigurablearchitectureandcompiler,"Computer,vol.33,pp.70-77,2000.

[24] C.Ebeling,D.C.Cronquist,andP.Franklin,"RaPiD—Reconfigurablepipelineddatapath,"inProc.Int.WorkshoponField-ProgrammableLogic,SmartApplications,NewParadigmsandCompilers,pp.126-135,1996.

[25] L.Moll,J.Vuillemin,andP.Boucard,"High-energyphysicsonDECPeRLe-1programmableactivememory,"inProc.ACMInt.Symp.onField-programmablegatearrays,pp.47-52,1995.

[26] D.T.Hoang,"SearchinggeneticdatabasesonSplash2,"inProc.IEEEWorkshoponFPGAsforCustomComputingMachinespp.185-191,1993.

[27] C.Chen,J.Wawrzynek,andR.W.Brodersen,"BEE2AHigh-EndReconfigurableComputingSystem,"IEEEDes.Test.Comput.,vol.22,pp.114-125,2005.

[28] L.-K.Ting,R.Woods,andC.F.N.Cowan,"VirtexFPGAimplementationofapipelinedadaptiveLMSpredictorforelectronicsupportmeasuresreceivers,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.13,pp.86-95,2005.

[29] M.Pohl,M.Schaeferling,andG.Kiefer,"AnefficientFPGA-basedhardwareframeworkfornaturalfeatureextractionandrelatedComputerVisiontasks,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.

[30] M.ShandandJ.Vuillemin,"FastimplementationsofRSAcryptography,"inProc.IEEESymp.onComputerArithmetic,pp.252-259,1993.

[31] K.H.Tsoi,K.H.Lee,andP.H.W.Leong,"AmassivelyparallelRC4keysearchengine,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.13-21,2002.

[32] G.L.Zhang,P.H.W.Leong,C.H.Ho,K.H.Tsoi,C.C.C.Cheung,D.Lee,etal.,"ReconfigurableaccelerationforMonteCarlobasedfinancialsimulation,"inProc.Int.Conf.onField-ProgrammableTechnology,2005.,pp.215-222,2005.

[33] D.Boland,"ReducingMemoryRequirementsforHigh-PerformanceandNumericallyStableGaussianElimination,"Proc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.244-253,2016.

[34] N.J.Fraser,D.J.M.Moss,L.JunKyu,S.Tridgell,C.T.Jin,andP.H.W.Leong,"Afullypipelinedkernelnormalisedleastmeansquaresprocessorforacceleratedparameteroptimisation,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1--6,2015.

[35] J.E.Vuillemin,P.Bertin,D.Roncin,M.Shand,H.H.Touati,andP.Boucard,"Programmableactivememories:reconfigurablesystemscomeofage,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.4,pp.56-69,1996.

[36] Nvidia,"(accessed2016)."http://www.nvidia.com.[37] S.MittalandJ.S.Vetter,"ASurveyofMethodsforAnalyzingandImprovingGPUEnergy

Efficiency,"ACMComput.Surv.,vol.47,pp.1-23,2014.[38] Nvidia.((accessed2016)).NVIDIATesla®K20-K20XGPUAcceleratorsBenchmarksApplication

PerformanceTechnicalBriefhttp://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf

[39] K.Ovtcharov,O.Ruwase,J.-Y.Kim,K.Strauss,andE.Chung,AcceleratingDeepCconvolutionalNeuralNetworksUsingSpecializedHardware:MicrosoftResearch,2015.

[40] S.Gupta,A.Agrawal,K.Gopalakrishnan,andP.Narayanan,"DeepLearningwithLimitedNumericalPrecision,"inInt.Conf.onMachineLearning,pp.1337–1345,2013.

Page 34: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

[41] J.L.Jerez,G.A.Constantinides,andE.C.Kerrigan,"FixedPointLanczos:SustainingTFLOP-equivalentPerformanceinFPGAsforScientificComputing,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.53-60,2012.

[42] J.Babb,R.Tessier,M.Dahl,S.Z.Hanono,D.M.Hoki,andA.Agarwal,"Logicemulationwithvirtualwires,"IEEETrans.Computer-AidedDesignofIntegratedCircuitsandSystems,vol.16,pp.609-626,1997.

[43] J.Varghese,M.Butts,andJ.Batcheller,"Anefficientlogicemulationsystem,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.1,pp.171-174,1993.

[44] L.deSouza,P.Ryan,J.Crawford,K.Wong,G.Zyner,andT.McDermott,"PrototypingfortheConcurrentDevelopmentofanIEEE802.11WirelessLANChipset,"inProc.Int.Conf.onFieldProgrammableLogicandApplication,ed,2003,pp.51-60.

[45] Cadence,"ProtiumRapidPrototypingPlatformhttps://www.cadence.com/content/cadence-www/global/en_US/home/tools/system-design-and-verification/fpga-basedprototyping/protium-rapid-prototyping-platform.html,"(accessed2016).

[46] D.M.StephenTridgell,NicholasJ.Fraser,andPhilipH.W.Leong,"Braiding:aschemeforresolvinghazardsinNORMA,"inProc.Int.Conf.onFieldProgrammableTechnology,pp.136–143,2015.

[47] P.L.ChenZhang,GuangyuSun,YijinGuan,BingjunXiaoandJasonCong,"OptimizingFPGA-basedAcceleratorDesignforDeepConvolutionalNeuralNetworks,"inProc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.161-170,2015.

[48] R.A.E.LouiseH.Crockett,MartinA.Enderwitz,andRobertW.Stewart.,TheZynqBook:EmbeddedProcessingwiththeArmCortex-A9ontheXilinxZynq-7000allProgrammableUK,2014.

[49] P.Bertin,D.Roncin,andJ.Vuillemin,"IntroductiontoProgrammableActiveMemories,"ed:DECMemo3,1989,pp.1-9.

[50] J.M.Arnold,D.A.Buell,andE.G.Davis,"Splash2,"inProc.ACMsymp.onParallelalgorithmsandarchitectures,1992.

[51] Maxeler,"https://www.maxeler.com/products/mpc-xseries/,"(accessed2016).[52] A.Putnam,A.M.Caulfield,E.S.Chung,D.Chiou,K.Constantinides,J.Demme,etal.,"A

reconfigurablefabricforacceleratinglarge-scaledatacenterservices,"inInt.Symp.onComputerArchitecture(ISCA),2014.

[53] Y.-k.Choi,J.Cong,Z.Fang,Y.Hao,G.Reinman,andP.Wei,"AquantitativeanalysisonmicroarchitecturesofmodernCPU-FPGAplatforms,"inProc.DesignAutomationConference,2016.

[54] J.VillasenorandW.H.Mangione-Smith,"ConfigurableComputing,"Scientif.Amer.,vol.276,pp.66-71,1997.

[55] J.BeckerandM.Hübner,"Run-timereconfigurabililityandotherfuturetrends,"inProc.symp.onIntegratedcircuitsandsystemsdesign,pp.9-11,2006.

[56] N.B.C.Bhat,K.;Kuh,E.S,"Performance-orientedFullyRoutableDynamicArchitectureforaField-programmableLogicDevice,"MemorandumNo.UCB/ERLM93/42,ElectronicsResearchLab.,CollegeofEngineering,UCBerkeley,pp.1-21,1993.

[57] M.Motomura,"ADynamicallyReconfigurableProcessorArchitecture,,"MicroprocessorForum,2002.

[58] D.Koch,C.Beckhoff,andJ.Teich,"ReCoBus-Builder-AnoveltoolandtechniquetobuildstaticallyanddynamicallyreconfigurablesystemsforFPGAS,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.119-124,2008.

[59] C.Beckhoff,D.Koch,andJ.Torresen,"GoAhead:APartialReconfigurationFramework,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.37-44,2012.

Page 35: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

[60] K.VipinandS.A.Fahmy,"Efficientregionallocationforadaptivepartialreconfiguration,"inProc.Int.Conf.onField-ProgrammableTechnology,pp.1-6,2011.

[61] D.KochandC.Beckhoff,"HierarchicalreconfigurationofFPGAs,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.

[62] K.VipinandS.A.Fahmy,"ZyCAP:EfficientPartialReconfigurationManagementontheXilinxZynq,"IEEEEmbeddedSystemsLetters,vol.6,pp.41-44,2014.

[63] L.GongandO.Diessel,FunctionalVerificationofDynamicallyReconfigurableFPGA-basedSystems,1ed.:SpringerInternationalPublishing,2015.

[64] C.Claus,R.Ahmed,F.Altenried,andW.Stechele,"TowardsRapidDynamicPartialReconfigurationinVideo-BasedDriverAssistanceSystems,"inProc.Int.SymponReconfigurableComputing:Architectures,ToolsandApplications,ed,2010,pp.55-67.

[65] G.G.Jean-PhilippeDelahaye,ChristianRoland,PierreBomel,"Softwareradioanddynamicreconfigurationonadsp/fpgaplatform,"Frequenz,journaloftelecommunications,pp.152-159,2004.

[66] M.Feilen,M.Ihmig,C.Schwarzbauer,andW.Stechele,"EfficientDVB-T2decodingacceleratordesignbytime-multiplexingFPGAresources,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.75-82,2012.

[67] C.Steiger,H.Walder,andM.Platzner,"Operatingsystemsforreconfigurableembeddedplatforms:onlineschedulingofreal-timetasks,"IEEETrans.onComputers,vol.53,pp.1393-1407,2004.

[68] C.Dennl,D.Ziener,andJ.Teich,"On-the-flyCompositionofFPGA-BasedSQLQueryAcceleratorsUsingaPartiallyReconfigurableModuleLibrary,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines(FCCM),pp.45-52,2012.

[69] A.Becher,F.Bauer,D.Ziener,andJ.Teich,"Energy-awareSQLqueryaccelerationthroughFPGA-baseddynamicpartialreconfiguration,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.

[70] M.J.WirthlinandB.L.Hutchings,"Adynamicinstructionsetcomputer,"inProc.IEEESymp.onFPGAsforCustomComputingMachines,pp.99–107,1995.

[71] Stretch,"http://www.stretchinc.com/,"(accessed2016).[72] H.Schmit,"Incrementalreconfigurationforpipelinedapplications,"inProc.5thAnnualIEEE

Symp.onField-ProgrammableCustomComputingMachines,pp.47-55,1997.[73] S.Young,P.Alfke,C.Fewer,S.McMillan,B.Blodget,andD.Levi,"AhighI/Oreconfigurable

crossbarswitch,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.3-10,2003.

[74] F.d.D.a.B.Pasca,"DesigningcustomarithmeticdatapathswithFloPoCo,"IEEEDesign&TestofComputers,vol.28,pp.18--27,2011.

[75] W.LukandI.Page,"CompilingOccamintoFPGAs,"ed:EE&CSbooks,1991,pp.271-283.[76] I.Page,"Constructinghardware-softwaresystemsfromasingledescription,"VLSISignal

Processing,vol.12,pp.87-107,1996.[77] Q.Liu,G.A.Constantinides,K.Masselos,andP.Y.K.Cheung,"AutomaticOn-chipMemory

MinimizationforDataReuse,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachinespp.251-260,2007.

[78] S.BaylissandG.A.Constantinides,"OptimizingSDRAMbandwidthforcustomFPGAloopaccelerators,"Proc.ACM/SIGDAInt.Symp.onFieldProgrammableGateArrays,pp.195--204,2012.

[79] D.U.Lee,A.A.Gaffar,R.C.C.Cheung,O.Mencer,W.Luk,andG.A.Constantinides,"Accuracy-GuaranteedBit-WidthOptimization,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.25,pp.1990-2000,2006.

Page 36: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

[80] D.BolandandG.A.Constantinides,"BoundingVariableValuesandRound-OffEffectsUsingHandelmanRepresentations,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.30,pp.1691-1704,2011.

[81] T.IsshikiandW.W.Dai,"High-LevelBit-SerialDatapathSynthesisforMulti-FPGASystems,"inProc.Int.WorkshoponFPGAs,pp.161-174,1995.

[82] M.P.LeongandP.H.W.Leong,"Avariable-radixdigit-serialdesignmethodologyanditsapplicationtothediscretecosinetransform,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.11,pp.90-104,2003.

[83] M.Technologies,"MaxCompiler"(whitepaper),2011.[84] M.Graphics,"CatapultHigh-LevelSynthesishttps://www.mentor.com/hls-lp/catapult-high-

level-synthesis/c-systemc-hls,"(Accessed2016).[85] M.Haldar,A.Nayak,A.Choudhary,andP.Banerjee,"AsystemforsynthesizingoptimizedFPGA

hardwarefromMatlab,"inIEEE/ACMInt.Conf.onComputerAidedDesign,pp.314–319,2001.[86] Mathworks,"http://www.mathworks.com/products/hdl-coder/,"(accessed2016).[87] A.Papakonstantinou,K.Gururaj,J.A.Stratton,D.Chen,J.Cong,andW.M.W.Hwu,"FCUDA:

EnablingefficientcompilationofCUDAkernelsontoFPGAs,"inSymp.onApplicationSpecificProcessors,pp.35-42,2009.

[88] D.Lau,O.Pritchard,andP.Molson,"AutomatedGenerationofHardwareAcceleratorswithDirectMemoryAccessfromANSI/ISOStandardCFunctions,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachinespp.45-56,2006.

[89] A.G.Weisz,A.J.Melber,A.Y.Wang,A.K.Fleming,A.E.Nurvitadhi,andA.J.C.Hoe,"AStudyofPointer-ChasingPerformanceonShared-MemoryProcessor-FPGASystems,"Proc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.264-273,2016

[90] Vectorblox,"http://vectorblox.com/,"(accessed2016).[91] N.KapreandJ.Gray,"Hoplite:BuildingaustereoverlayNoCsforFPGAs,"inProc.Int.Conf.on

FieldProgrammableLogicandApplications,pp.1-8,2015.[92] D.CapalijaandT.S.Abdelrahman,"Ahigh-performanceoverlayarchitectureforpipelined

executionofdataflowgraphs,"inInt.Conf.onFieldprogrammableLogicandApplications,pp.1-8,2013.

[93] A.K.Jain,X.Li,P.Singhai,D.L.Maskell,andS.A.Fahmy,"DeCO:ADSPBlockBasedFPGAAcceleratorOverlayWithLowOverheadInterconnect,"Proc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.1--8,2016.

[94] C.Clos,"AStudyofNon-BlockingSwitchingNetworks,"BellSystemTechnicalJournal,vol.32,pp.406-424,1953.

[95] V.E.Beneš,"MathematicalTheoryofConnectingNetworksandTelephoneTraffic,"ed.NewYork:AcademicPress,1965.

[96] P.K.ChanandM.D.F.Schlag,"Architecturaltradeoffsinfield-programmable-device-basedcomputingsystems,"inProc.IEEEWorkshoponFPGAsforCustomComputingMachines,pp.152-161,1993.

[97] R.ColeandJ.Hopcroft,"OnEdgeColoringBipartiteGraphs,"SIAMJ.Comput.,vol.11,pp.540-546,1982.

[98] G.RichardsandF.Hwang,"ATwo-StageRearrangeableBroadcastSwitchingNetwork,"IEEETrans.Communications,vol.33,pp.1025-1035,1985.

[99] I-Cube,"UsingFPIDDeviesinFPGA-basedPrototyping,"ed:ApplicationNote,1994,pp.1–11.[100] Y.C.Wei,C.K.Cheng,andZ.Wurman,"Multiple-levelpartitioning:anapplicationtothevery

large-scalehardwaresimulator,"IEEEJ.Solid-StateCircuits,vol.26,pp.706-716,1991.[101] J.LiandC.K.Cheng,"Routabilityimprovementusingdynamicinterconnectarchitecture,"in

Proc.IEEESymp.onFPGAsforCustomComputingMachines,pp.2-7,1995.

Page 37: Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information

[102] P.K.S.Chan,M.D.F.;Martin,M.,"BORG:AReconfigurablePrototypingBoardUsingField-programmableGateArrays,"inInt.WorkshoponFPGA,pp.47–51,1992.

[103] J.Lillis,C.K.Cheng,andT.T.Y.Lin,"Optimalwiresizingandbufferinsertionforlowpowerandageneralizeddelaymodel,"inProc.Int.Conf.onComputerAidedDesign(ICCAD),pp.138-143,1995.

[104] W.J.Hsieh,Y.C.Jenq,C.S.Horng,andK.Lofstrom,"Input/outputI/OBidirectionalBufferforInterfacingI/OPartsofaFieldProgrammableInterconnectionDevicewithArrayPortsofaCross-pointSwitch.,"USPatent5,428,800,1992.

[105] C.-W.Yeh,C.-K.Cheng,andT.-T.Y.Lin,"Aprobabilisticmulticommodity-flowsolutiontocircuitclusteringproblems,"inProc.IEEE/ACMInt.Conf.onComputer-AidedDesign,pp.428–431,1992.

[106] Y.Ching-Wei,"Ontheaccelerationofflow-orientedcircuitclustering,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.14,pp.1305-1308,1995.

[107] L.-T.Liu,M.-t.Kuo,C.-K.Cheng,andT.C.Hu,"Performance-DrivenPartitioningUsingaReplicationGraphApproach,"inProc.DesignAutomationConferencepp.206-210,1995.

[108] J.Cong,S.K.Lim,andC.Wu,"Performancedrivenmulti-levelandmultiwaypartitioningwithretiming,"inProc.DesignAutomationConf.,pp.274-279,2000.

[109] C.J.AlpertandA.B.Kahng,"Recentdirectionsinnetlistpartitioning:asurvey,"Integration,theVLSIJournal,vol.19,pp.1-81,1995.

[110] P.K.Chan,M.D.F.Schlag,andJ.Y.Zien,"OnroutabilitypredictionforField-ProgrammableGateArrays,"inProc.DesignAutomationConferencepp.326-330,1993.

[111] A.E.Gamal,"Two-dimensionalstochasticmodelforinterconnectionsinmastersliceintegratedcircuits,"IEEETrans.CircuitsSyst.,vol.28,pp.127-138,1981.

[112] C.Selvidge,A.Agarwal,M.Dahl,andJ.Babb,"TIERS:TopologyIndependentPipelinedRoutingandScheduling,"inProc.ACMInt.Symp.onField-programmablegatearrayspp.25-31,1995.

[113] S.-C.Chang,K.-T.Cheng,N.-S.Woo,andM.Marek-Sadowska,"LayoutdrivenlogicsynthesisforFPGAs,"inProc.DesignAutomationConferencepp.308-313,1994.


Recommended