ReconfigurableComputingDavidBoland1,Chung-KuanCheng2,AndrewB.Kahng2,PhilipH.W.Leong1
1SchoolofElectricalandInformationEngineering,TheUniversityofSydney,Australia20062Dept.ofComputerScienceandEngineering,UniversityofCalifornia,LaJolla,California
Abstract:
Reconfigurablecomputingistheapplicationofadaptablefabricstosolvecomputationalproblems,often
takingadvantageoftheflexibilityavailabletoproduceproblem-specificarchitecturesthatachievehigh
performancebecauseofcustomization.Reconfigurablecomputinghasbeensuccessfullyappliedto
fieldsasdiverseasdigitalsignalprocessing,cryptography,bioinformatics,logicemulation,CADtool
acceleration,scientificcomputing,andrapidprototyping.
AlthoughEstrin-firstproposedtheideaofareconfigurablesystemintheformofafixedplusvariable
structurecomputerin1960[1]ithasonlybeeninrecentyearsthatreconfigurablefabrics,intheformof
field-programmablegatearrays(FPGAs),havereachedsufficientdensitytomakethemacompelling
implementationplatformforhighperformanceapplicationsandembeddedsystems.Inthisarticle,
intendedforthenon-specialist,wedescribesomeofthebasicconcepts,toolsandarchitectures
associatedwithreconfigurablecomputing.
Keywords:
reconfigurablecomputing;adaptablefabrics;applicationintegratedcircuits;fieldprogrammablegate
arrays(FPGAs);systemarchitecture;runtime
1Introduction
Althoughreconfigurablefabricscaninprinciplebeconstructedfromanytypeoftechnology,inpractice,
mostcontemporarydesignsaremadeusingcommercialfieldprogrammablegatearrays(FPGAs).An
FPGAisanintegratedcircuitcontaininganarrayoflogicgatesinwhichtheconnectionscanbe
configuredbydownloadingabitstreamtoitsmemory.FPGAscanalsobeembeddedinintegrated
circuitsasintellectualpropertycores.Moredetailedsurveysonreconfigurablecomputingareavailable
intheliterature[2-6].
Microprocessorsofferaneasy-to-use,powerfulandflexibleimplementationmediumfordigital
systems.Theirutilityincomputingapplicationsmakesthemanoverwhelmingfirstchoice.Moreover,it
isrelativelyeasytofindsoftwaredevelopers,andmicroprocessorsarewidelysupportedbyoperating
systems,softwareengineeringtools,andlibraries.However,inthelastdecade,powerconstraintshave
limitedtheperformanceofserialcomputationonmicroprocessors.Thishasledtothedevelopmentof
multi-coreprocessorsandanincreasingimportanceplacedonthepursuitofparallelcomputation[7].
Unfortunately,multi-coreprocessorsarerarelythemostefficientmethodtoperformparallel
computation.Thisinefficiencystemsfromthefactthateachcoremustbegeneralenoughtosupportan
entireinstructionset.Asaresult,themajorityofenergyisusedindecodingtheinstructionandfetching
datainsteadofperformingactualcomputation[8].
Hardwareacceleratorssuchasgraphicsprocessorunits(GPUs)andFPGAsareparallel
computationalarchitecturesthathavedemonstratedsubstantialperformanceandenergyefficiency
improvementsovertraditionalmulti-coreprocessordesignsbymovingthefocusbacktocomputation
[9,10].Intermsofenergyefficiency,theGPUarchitecture,whichconsistsofthousandsofparallel
floating-pointunits,isbestsuitedtoso-calledembarrassinglyparallelcomputationorcomputationally
expensiveproblems.However,manyalgorithmswillnotfallintothisproblemcategory.Incontrast,
usinganFPGAorApplication-SpecificIntegratedCircuit(ASIC),itispossibletocreateafullycustomised
datapathforagivenalgorithm,meaningitispossibletoachieveevengreaterenergyefficiencyusing
thesedevices.
Application-specificintegratedcircuits(ASICs)andFPGAsachievegreaterlevelsofparallelism
thanamicroprocessorbyarrangingcomputationsinaspatialratherthantemporalfashion.Thiscan
resultinperformanceimprovementsofseveralordersofmagnitude.Also,theabsenceofcachesand
instructiondecodingcanresultinthesameamountofworkbeingdonewithlesschipareaandlower
powerconsumption[11].Notableexamplesofapplicationdomainsincludecryptography,NP-hard
optimizationproblems,patternmatching,machinelearning,andmoleculardynamics[6].
Anexampleinvolvingtheimplementationofafiniteimpulseresponse(FIR)filterisshownin
Fig.1.Thereconfigurablecomputingsolutionissignificantlymoreparallelthanthemicroprocessor-
basedone.Inaddition,itshouldbeapparentthatthereconfigurablesolutionavoidstheoverheads
associatedwithinstructiondecoding,caching,registerfiles.Furthermore,speculativeexecution,
unnecessarydatatransfersandcontrolhardwarecanbeomitted.
Figure1.IllustrationofamicroprocessorbasedFIRfiltervs.areconfigurablecomputingsolution.Inthe
microprocessor,operationsareperformedintheALUsequentially.Furthermore,instructiondecoding,
caching,speculativeexecution,controlgenerationandsoonarerequired.Forthereconfigurable
computingapproachusinganFPGA,spatialcompositionisusedtoincreasethedegreeofparallelism.
TheFPGAimplementationcanbefurtherparallelizedthroughpipelining.
ComparedwithASICs,FPGAsofferverylownon-recurrentengineering(NRE)costs,whichis
oftenamoreimportantfactorthanthefactthatFPGAshavehigherunitscosts.Thisisbecausemany
applicationsdonothavetheextremelyhighvolumesrequiredtomakeASICsacheaperproposition.As
integratedcircuitfeaturesizescontinuetodecrease,theNREcostsassociatedwithASICscontinueto
escalate,increasingthevolumeatwhichitbecomescheapertouseanASIC(seeFig.2).Beingmore
specialized,ASICsofferarea,powerandspeedadvantagesoverFPGAs,thisgapbeingreducedasmore
hardblocksareemployed[12].Movingforward,reconfigurablecomputingwillbeusedinincreasingly
moreapplications,asASICsbecomeonlycosteffectiveforthehighestperformanceorhighestvolume
applications.
Figure2.Costoftechnologyvs.volume.ThecrossovervolumeforwhichASICtechnologyischeaper
thanFPGAsincreasesasfeaturesizeisreducedbecauseofincreasednon-recurrentengineeringcosts.
Additionalbenefitsofreconfigurablecomputingarethatitstechnologyprovidesashortertime
tomarketthanASICs(associatedFPGAfabricationtimeisessentiallyzero),makingmanyfabrication
iterationswithinasingledaypossible.Thisbenefitallowsmorecomplexalgorithmstobedeployedand
makesproblem-specificcustomizationsofdesignspossible.FPGA-baseddesignsareinherentlylessrisky
intermsoftechnicalfeasibilityandcost,asshorterdesigntimesandlowerupfrontcostsareinvolved.
Asitsnamesuggests,FPGAsalsoofferthepossibilityofmodificationstothedesigninthefield,which
canbeusedtoprovidebugfixes,modificationstoadapttochangingstandards,ortoaddfunctionality,
allofwhichcanbeachievedbydownloadinganewbitstreamtoanexistingreconfigurablecomputing
platform.Reconfigurationcaneventakeplacewhilethesystemisrunning,thisbeingknownasruntime
reconfiguration(e.g.,[13]).
Inthenextsection,weintroducethebasicarchitectureofcommonreconfigurablefabrics,
followedbyadiscussionofapplicationsofreconfigurablecomputingandsystemarchitectures.Runtime
reconfigurationanddesignmethodsarethencovered.Finally,wediscussmultichipsystemsandend
withaconclusion.
COST
VOLUME
CurrentASIC
OlderASIC
OlderFPGA CurrentFPGA
Crossovervolumeincreases withdecreasingfeaturesize
2ReconfigurableFabrics
Ablockdiagramillustratingagenericfine-grainedisland-styleFPGAisgiveninFig.3[14].Productsfrom
companiessuchasXilinx[15],Altera[16],andMicrosemi[17]arecommercialexamples.TheFPGA
consistsofanumberoflogiccellsthatcanbeinterconnectedtootherlogicandinput/output(I/O)cells
viaprogrammableroutingresources.Logiccellsandroutingresourcesareconfiguredviabit-level
programmingdata,whichisstoredinmemorycellsintheFPGA.Alogiccellconsistsofuser-
programmablecombinatorialelements,withanoptionalregisterattheoutput.Theyareoften
implementedaslookuptables(LUTs)withasmallnumberofinputs,4-inputLUTsbeingshowninFig.3.
Usingsuchanarchitecture,subjecttoFPGA-imposedlimitationsonthecircuit'sspeedanddensity,an
arbitrarycircuitcanbeimplemented.Thecompletedesignisdescribedviatheconfigurationbitstream
whichspecifiesthelogicandI/Ocellfunctionality,andtheirinterconnection.
Figure3.Architectureofabasicislane-styleFPGAwithfour-inputlogiccells.Thelogiccells,shownas
grayrectanglesareconnectedtoprogrammableroutingresources(shownaswires,dots,anddiagonal
switchboxes)(source:Reference[14]and[18]).
Currenttrendsaretoincorporateadditionalembeddedblockssothatdesignerscanintegrate
entiresystemsonasingleFPGAdevice.Apartfromdensity,cost,andboardareabenefits,thisprocess
alsoimprovesperformancebecausemorespecializedlogicandroutingcanbeusedandallcomponents
areonthesamechip.AcontemporaryFPGAcommonlyhasfeaturessuchascarrychainstoenablefast
addition;widedecoders;tristatebuffers;blocksofon-chipmemoryandmultipliers;embedded
microprocessors;programmableI/Ostandardsintheinput/outputcells;delaylockedloops;phase
lockedloopsforclockde-skewing,phaseshiftingandmultiplication;multi-gigabittransceivers(MGTs);
andembeddedmicroprocessors.Embeddedmicroprocessorscanbeimplementedeitherassoftcores
usingtheinternalFPGAresourcesorashardwiredcores.
Inadditiontothearchitecturalfeaturesdescribed,intellectualproperty(IP)cores,implemented
usingthelogiccellresourcesoftheFPGA,areavailablefromvendorsandcanbeincorporatedintoa
design.Thesecoresincludebusinterfaces,networkingcomponents,memoryinterfaces,signal
processingfunctions,microprocessorsandsoonandcansignificantlyreducedevelopmenttimeand
effort.
Thebit-levelorganizationofthelogicandroutingresourcesinisland-styleFPGAsisextremely
flexiblebuthashighimplementationoverheadasaresult.Tradeoffsexistinthegranularityofthelogic
cellsandroutingresources.Fine-graineddeviceshavethebestflexibility;however,coarse-grained
elementscantradesomeflexibilityforhigherperformanceanddensity[19].
Withmoderntechnologies,thespeedoftheroutingresourceisalimitingfactor.Trendshave
beentoincreasethefunctionalityofthelogiccellse.g.,uselogiccellswithlargernumbersofinputs
whichcanalsobeconfiguredassmallerLUTs[20]andtoaddpipelineregisterstotheroutingfabric[21].
Fordatapathorientedapplicationssuchasindigitalsignalprocessing,coarse-grainedarchitectures[22]
suchasPipewrench[23]andRaPID[24]employbus-basedroutingandword-basedfunctionalunitsto
utilizesiliconresourcesmoreefficiently.
3Applications
Reconfigurablecomputinghasfoundwidespreadapplicationintheformof“customcomputing-
machines”toacceleratecomputationoveralgorithmsimplementedonaCPU.Applicationdomains
includehigh-energyphysics[25],genomeanalysis[26],signalprocessing[27,28],computervision[29],
cryptography[30,31],financialengineering[10,32],scientificcomputing[33],machinelearning[34]
andsecurity[35].
Inmanyoftheseproblemdomains,ageneralpurposeGPUhasalsodemonstratedconsiderable
accelerationoveraCPUandwilloftenoutperformanFPGAaswellintermsofrawperformance.Thisis
becauseitisaparallelarchitecturewithmanyhardenedfloating-pointunitsandsubstantiallygreater
memorybandwidth,meaningitisanidealarchitectureprovidedalgorithmsthatcanbebrokendown
intoalargenumberofparallelthreads.However,outrightperformanceisnolongertheonly
benchmark;energyconsumptionisalsoimportant.Intermsofhighperformancecomputing,
supercomputingclustersanddatacentersnowconsumevastamountsofenergy,notonlyon
computation,butalsooncoolinginordertomaintainperformanceandreliability.Itfollowsthat
reducingenergyconsumptionprovidesbothenvironmentalandeconomicbenefits.Energyminimization
isalsoimportantforembeddedapplications;forexample,reducingpowerconsumptionon
smartphonesorotherbattery-powereddevicesisdesirablefromanenduserperspective.Asaresult,
FPGAandGPUvendorsarefocusingtheirengineeringeffortstowardsmakingfuturearchitecturesmore
energyefficient.ThisisreflectedinthemostrecentFPGAandGPUarchitectures:Nvidia’sP100claimsa
peakperformanceof10.6TFLOPs(singleprecision)withaTDPofonly300W[36],whileAlteraclaims
performanceofupto9.3TFLOPs(singleprecision)at80GFLOPs/wattisachievableontheirupcoming
Stratix10device[16].
ManyexperimentalstudieshavebeenperformedcomparingtheenergyefficiencyofFPGAs,
GPUsandCPUs;arecentsurveyisprovided[37].BothFPGAsandGPUstypicallyoutperformCPUs
accordingtothismetric.GPUshavebeenshowntobemoreenergyefficientthanFPGAsforcertain
applicationssuchasmatrixmultiplication.Tosomeextent,thisisaresultofthedecisiontooptimizethe
GPUarchitectureforthisproblem[38].However,theflexibilityofanFPGAhasseenitoutperforma
GPU,intermsofenergy-efficiencyorperformance-per-watt,acrossabroaderspectrumofapplications.
Examplesinclude:2-DFIR(finite-impulseresponse)filters,Viola-Jonesfacedetection,K-means
clustering,Monte-Carlooptionspricing,randomnumbergeneration,Smith-Waterman,3-Dultrasound
computertomography[37].EnergyefficiencygainsusingFPGAshavealsobeenclaimedoncommercial
systems.Forexample,Microsoftreporteda3xenergyefficiencygain,andareducedlatency,whenusing
FPGAsinsteadofGPUsontheirCatapultmachine[39],whichisdiscussedlaterinSection4.
WhilemostoftheseperformancecomparisonshavebeenperformedusingIEEEstandardsingle
ordoubleprecisionarithmetic,thisisnotnecessarilythemostenergy-efficientdesignpossibleonan
FPGA.ThisisbecauseFPGAshavethefreedomtoimplementanyprecision,soitmaybepossibleto
createaworkingdesignusingacustom(reducedprecision)fixedorfloating-pointnumberformatthatis
sufficienttosatisfyadesignspecification.Thiswillavoidunnecessarycomputationandcanimprovethe
energy-efficiencyandperformanceofanFPGAimplementationdramatically[40,41].
Toadegree,theflexibilityofanFPGAisevenbeyondthatpossibleinanASIC.Forexample,inan
FPGA-basedimplementationofRSAcryptography[30],adifferenthardwaremodularmultiplierforeach
primemoduluswasemployed(i.e.,themoduluswashardwiredinthelogicequationsofthedesign).
SuchanapproachwouldnotbepracticalinanASICasthedesigneffortandcostistoohightodevelopa
differentchipfordifferentmoduli.Thisledtogreatlyreducedhardwareandimprovedperformance,the
implementationbeinganorderofmagnitudefasterthananyreportedimplementationinany
technologyatthetime.
Anotherimportantapplicationislogicemulation[42,43]wherereconfigurablecomputingisnot
onlyusedforsimulationacceleration,butalsoforprototypingofASICsandin-circuitemulation.In-
circuitemulationallowsthepossibilityoftestingprototypesatfullornear-fullspeed,allowingmore
thoroughtestingoftime-dependentapplicationssuchasnetworks.Italsoremovesmanyofthe
dependenciesbetweenASICandfirmwaredevelopment,allowingthemtoproceedinparallelandhence
shorteningdevelopmenttime.Asanexample,itwasusedin[44]forthedevelopmentofatwo-million-
gateASICcontaininganIEEE802.11mediumaccesscontrollerandIEEE802.1la/b/gphysicallayer
processor.UsingareconfigurableprototypeoftheASIConacommodityFPGAboard,theASICwent
throughonecompletepassofreal-timebetatestingbeforetape-out.
Digitallogic,ofcourse,mapsextremelywelltofine-grainedFPGAdevices.Themaindesign
issuesforsuchsystemslieinpartitioningofadesignamongmultipleFPGAsanddealingwiththe
interconnectbottleneckbetweenchips.TheCadenceProtiumRapidPrototypingPlatform[45]isa
commercialexampleofalogicemulationsystemandhas100million-gatelogiccapacityandfast
compilationandpartitioningalgorithms.Furtherdiscussionofinterconnecttime-multiplexingand
systemdecompositionisgivenlaterinthisarticle.Someexamplesofapplicationsacceleratedusing
earlymultipleFPGAsystemsarediscussedbelow.
Hoang[26]implementedalgorithmstofindminimumeditdistancesforproteinandDNA
sequencesontheSplash2architecture.Splash2canbemodeledintermsofbothbidirectionaland
unidirectionalsystolicarrays.Inthebidirectionalalgorithm,thesourcecharacterstreamisfedtothe
leftmostprocessingelement(PE),whereasthetargetstreamisfedtotherightmostPE.Comparingtwo
sequencesoflengthmandnrequiresatleast2´max(m+1,n+1)processors,andthenumberofsteps
requiredtocomputetheeditdistanceisproportionaltothesizeofthearray.Theunidirectional
algorithmissuitedforcomparingasinglesourcesequenceagainstmultipletargetsequences.The
sourcesequenceisfirstloadedasinthebidirectionalcase,andthetargetsequencesarefedinoneafter
theotherandprocessedastheypassthroughthePEs(whichresultsinvirtually100%utilizationof
processors,sothattheunidirectionalmodelisbettersuitedforlargedatabasesearches).
Acommonapplicationdomainforreconfigurablecomputingisinreal-timedataacquisitionand
signalprocessing.TheBEE2system[27],describedinthenextsection,wasappliedtotheradio
astronomysignalprocessingdomain,whichincludeddevelopmentofabillion-channelspectrometer,a
1024-channelpolyphasefilterbank,andatwo-input,1024-channelcorrelator.TheFPGA-basedsystem
useda130-nmtechnologyFPGAandperformancewascomparedwith130-and90-nmDSPchipsaswell
asa90-nmmicroprocessor.Performanceintermsofcomputationalthroughputperchipwasfoundto
beafactorof10to34overtheDSPchipin130-nmtechnologyand4to13timesbetterthanthe
microprocessor.Intermsofpowerefficiency,theFPGAwasoneorderofmagnitudebetterthantheDSP
andtwoordersofmagnitudebetterthanthemicroprocessor.Computethroughputperunitchipcost
was20–307%betterthanthe90-nmDSPand50–500%betterthanthemicroprocessor.
Onefinalemergingapplicationdomainismachinelearning.Reconfigurableimplementations
showgreatpromiseforaddressingtheirheavycomputationaldemands,andreconfigurablecomputing
isparticularlystronginembeddedandlow-precisionscenarios.Tridgellet.al.demonstratedregression,
classificationandnoveltydetectionusingonlinekernelmethods.Theirfullypipelinedimplementation
couldprocesscontinuousdataatrateshigherthan1Gbpsandperformsimultaneouslearningand
predictionwithalatencyof100ns[46].Zhanget.al.appliedarooflinemodeltobalanceresource
utilizationandmemorybandwidthintheaccelerationofadeepconvolutionalneuralnetwork(CNN).
Theyachieved62GFLOPSonasingleXilinxVirtexVC707board,thisbeinga4.8Xspeedupovera16
threadimplementationonanIntelXeonE5-2430processor[47].
4SystemArchitectures
ReconfigurablecomputingmachinesareconstructedbyutilizingoneormoreFPGAs.Mostsystems
includeotherelements,suchasmicroprocessorsandstorage,andcanbetreatedasprocessing
elementsandmemorythatareinterconnected.Obviously,thearrangementoftheseelementsaffects
thesystemperformanceandroutability,andsomeexamplesaregiveninthissection.
TheAvnetZedboardisadevelopmentboardwhichintegratesasingleXilinxZynqXC7Z020FPGA
(whichcontainsFPGAlogicandadual-coreARMCortex-A9processor),DDRmemory,SDcard,Ethernet,
USBandvideointerfaces.ThissingleboardcomputercanruntheLinuxoperatingsystem,andit
providesalow-costentrypointforteachingandresearchinreconfigurablecomputing[48].
ThesimplesttopologyforconnectingmultipleFPGAsinvolvesaring,mesh,orotherfixed
pattern.FPGAsserveasbothlogicandinterconnect,providingdirectcommunicationbetweenadjacent
devices.Suchanarchitectureispredicatedonlocalityinthecircuitdesignandfurtherassumesthatthe
circuitdesignmapswelltotheplanarmesh.Thisarchitecturefitswellforapplicationswithregularlocal
communications[49].However,ingeneral,highperformanceishardtoobtainforarbitrary
communicationpatternsbecausethearchitectureonlyprovidesdirectcommunicationsbetween
neighboringFPGAsandtwodistantFPGAsmayneedmanyotherdevicesas“hops”tocommunicate,
resultinginlongandwidelyvariabledelays.Furthermore,FPGAs,whenusedasinterconnects,often
resultinpoortimingcharacteristics.
Figure4depictstheSPLASH2architecture[50]publishedin1990.Eachboardcontains16
FPGAs,X1throughX16.TheblocksM1throughM16arelocalmemoriesoftheFPGAs.Asimplified36-
bitbuscrossbar,withnopermutationofthebit-lineswithineachbus,interconnectsthe16FPGAs.
Another36-bitbusconnectstheFPGAsinalinearsystolicfashion.Thelocalmemoriesaredualported
withoneportconnectingtotheFPGAsandtheotherportconnectingtotheexternalbus.Itis
interestingtonotethatthecrossbarwasaddedtotheSPLASH2machine,theoriginalSPLASH1machine
onlyhavingthelinearconnections.SPLASH2hasbeensuccessfullyusedforcustomcomputing
applicationssuchassearchingeneticdatabasesandstringmatching[26].
Figure4.SPLASH2architecture.Eachboardcontains16FPGAs,XIthroughXI6.TheblocksMlthrough
Ml6arelocalmemoriesoftheFPGAs.Asimplified36-bitbuscrossbar,withnopermutationofthebit-
lineswithineachbus,interconnectsthe16FPGAs.Another36-bitbusconnectstheFPGAsindaisy-chain
fashion.ThelocalmemoriesaredualportedwithoneportconnectingtotheFPGAsandtheotherport
connectingtotheexternalbus.
Otherdesignshaveusedahierarchyofinterconnectschemes,differinginperformance.Theuse
ofmulti-gigabittransceivers(MGT)availableoncontemporaryFPGAsallowshighbandwidth
interconnectionusingcommoditycomponents.AnexampleistheBerkeleyEmulationEngine2(BEE2)
[27],designedforreconfigurablecomputingandillustratedinFig.5.Eachcomputemoduleconsistsof
fiveFPGAs(XilinxXC2VP70)connectedtofourdoubledatarate2(DDR2)dualinlinememorymodules
(DIMMs)withamaximumcapacityof4GBperFPGA.FourFPGAsareusedforcomputationandonefor
control.EachPPGAhastwoPowerPC405processorcores.Alocalmeshconnectsthecomputation
FPGAsina2-Dgridusinglow-voltageCMOS(LVCMOS)parallelsignaling.Off-modulecommunications
areofvia18(twofromthecontrolFPGAandfourfromeachofthecomputeFPGAs)Infiniband4X
channel-bonded2.5-Gbpsconnectorsthatoperatefull-duplex,whichcorrespondstoa180-Gbpsoff-
modulefull-duplexcommunicationbandwidth.Modulescanbeinterconnectedindifferenttopologies
includingtree,3-Dmesh,orcrossbar.Theuseofstandardinterfacesallowsstandardnetworkswitches
suchasInfinibandand10-GigabitEthernettobeused.Finally,a100base-TEthernetconnectiontothe
controlFPGAispresentforout-of-bandcommunications,monitoring,andcontrol.
Figure5.BEE2ComputeModuleblockdiagram.Computemodulescanbeinterconnectedviathe
InfinibandIB4Xconnectors,eitherdirectlyorviaa10-GigabitEthernetswitch.The100-BaseTEthernet
canbeusedforcontrol,monitoring,ordataarchiving.
Commercialmachines,suchastheMaxelerMPC-X2000system[51],haveasimilarinterconnect
structuretotheBEE2inthattheyareparallelmachinesemployinghighperformancemicroprocessors
tightlycoupledtoarelativelysmallnumberofFPGAdevicespernode.TheMPC-X2000isa1Userver
witheightlargeFPGAs,calleddataflowengines(DFEs),interconnectedinaringarrangement.Atotalof
384GBofdynamicRAMissupportedandmultiplehostprocessorscancommunicatewitheachDFEvia
ahigh-speedInfinibandswitchedinterconnectnetwork.Suchmachinescanhaveordersofmagnitude
performanceimprovementoverconventionalarchitecturesandswitchingtopologiescanbealteredvia
configurationoftheswitchingfabric.
MicrosofttookadifferentapproachintheirCatapultmachine,choosingasingledaughtercard
perserverovermulti-FPGAboardsforthereasonsofscalability,capacity,power,spaceandreliability
[52].EachFPGAcardoperatesunder20W,ishostedbyaserverviaPCIExpressandcontains8GBof
dynamicRAM.TheFPGAboardsareorganizedina24Uarrangementof48Uhalf-width1Uservers,
directlyconnectedtogetherwithSAScables.Atestsystemcontaining1,632serverswasshownto
reducethetaillatencyoftheMicrosoftBingsearchengineby29%andimproverankingthroughputof
eachserverby95%.
TheIntel-AlteraHeterogeneousArchitectureResearchPlatform(HARP)utilizesIntelQuickpath
Interconnect(QPI)inadualsocketmotherboardwiththeprocessorandFPGAresidingeachoccupyinga
socket[53].Thisoffershigherbandwidthandlowerlatencyoverconventionaldaughtercards.A
coherentsharedmemorybetweentheprocessorandFPGAgivesthepromiseofagreatlysimplified
programmingmodelandtighterprocessor-FPGAcouplingwhichwillbenefitirregulardataaccess
patterns.
5RuntimeReconfiguration
Areconfigurablecomputingsystemcanhaveitsfunctionalityupdatedduringexecution,
resultinginreducedresourcerequirements.Aruntimereconfigurablesystempartitionsadesign
temporallysothattheentiredesigndoesnotneedtoberesidentintheFPGAatanygivenmoment[54,
55].Instead,theFPGAfabricistime-sharedbetweenspecializedhardwareacceleratorsatruntime.
Usingthistechnique,designslargerthantheavailablehardwareresourcescanberealized,or
alternatively,anexistingdesignmaybeimplementedonasmallerorcheaperdevice.Furthermore,
energyefficiencycanbeincreasedbecausetheentirefabriccanbeusedmoreeffectively.
Singlecontext,multiplecontextarchitecturesandpartiallyreconfigurableFPGAsbeen
developed.Inasinglecontextsystem,anychangestothefunctionalityoftheFPGAinvolvereloadingthe
entirebitstream;earlyFPGAswereofthistype.Thisschemehasthedisadvantageoflong
reconfigurationtime.Multiplecontextortime-sharingarchitectures,lieattheotherextreme.These
allowanumberofcompleteconfigurationstobestoredinthefabricsimultaneouslyandthus
reconfigurationcanbeachievedinasmallnumberofcycles.Thesearchitectureswerealsoproposedfor
earlyFPGAs.Asanexample,anarchitecturenamedDharma,wasproposedthatcontainsafunctional
blockandaninterconnectnetwork[56].Bybreakingalargedesignintolevelsinafoldedpipeline,the
logicmodulesandinterconnectcanbetime-sharedbydynamicallyreconfiguringeachlevel.This
topologysimplifiesthearchitectureandprovidespredictableinterconnectdelay(Fig.6).Multiple
contextarchitectures,suchasNEC'sDynamicallyReconfigurableProcessor(DRP)[57],werelater
developed.Sucharchitectureshavetheshortestcontextswitchtime,however,alargerareaoverheadis
associatedwithimplementationofthisscheme.
Figure6.DynamicArchitectureforFPGA-basedsystems.Thearchitecturecontainsafunctionalblock
andaninterconnectnetwork.Theinterconnectandthelogiccanbetimeshared.Theemulateddesign
topologyislevelizedinafoldedpipelinemanner.Thelevelizedtopologysimplifiesthearchitecturewith
predictableinterconnectdelay.
PartiallyreconfigurableFPGAs,assupportedbythemajorFPGAvendorsinXilinxVirtex[15]and
AlteraStratix[16]architectures,havebeguntodominatethemarket.Thesearchitecturesallowportions
oftheFPGAtobechangedviaamemorymappedschemewhilsttheotherportionsoftheFPGA
continuefunctioning.Incomparisontoasinglecontextscheme,thereissomeareaoverheadassociated
inprovidingthisfeature;ideallythisiscompensatedforbymoreefficientuseofthefabric.
Manytoolshavebeendevelopedtohelpsupportruntimereconfiguration.Commercialtools,
providedbythemainFPGAvendorsXilinxandAlteraaimtoabstractthelowlevelimplementation
detailsfromtheengineer.However,otheropensourcetoolshavebeendevelopedtoenablemore
flexiblesystems.Forexample,ReCoBus-Builderintroducedasimpleinterfaceforcommunication
betweenthestaticpartofasystemandthedynamicmodules,aswellastheabilitytoplaceandroute
partialmodulesseparately,beforelinkingthesecompiledbitstreamsatrun-time[58].Thismakes
modulesinterchangeableandspeedsupthecompilationprocess.TheGoAheadtooltakesthisfurther,
allowingtheFPGAfabrictobeseparatedintodifferentregions,withindividualmodulescompiledtofit
intooneormoreoftheseregions[59].Italsoprovidessupportformodulestocommunicatebetweenor
acrossregions.Thisimprovesflexibilityinplacementofmodulesandpromotessharingacrossregions.
Toolstohelpdeterminetheoptimumnumberofregionshavealsobeengenerated[60].
Theaforementionedtoolsarealsoabletosupporthierarchicaldesigns,whereapartialregion
doesnotneedtobefullyreconfigured;insteadasmallerregionwithinthisareacanbereconfigured.
Thishasmultipleadvantages.Firstly,storingafewdifferentmodulesateachhierarchicallevelprovides
ahugeamountofflexibility,savingsignificantconfigurationmemoryincomparisontostoringall
differentmodulesatthehighestlevel.Furthermore,sinceonlyasmallregionneedstobereconfigured,
thereconfigurationtimeisreduced[61].
Toolsalsoexisttohelpoverlapre-configurationandcomputationtomaximizetheperformance
ofthedevice.ForexampleZyCAP,whichisbasedontheXilinxZynqarchitecturewithanembedded
ARMCPU,providessoftwaredriverstohelpreconfigurationbeoverlappedwithcomputationby
controllingallthereconfigurationprocesses[62].Itcanalertthesoftwarethatconfigurationis
complete,andalsomanageshowpartialbitstreamsarestoredinmemory.Thisisimportanttohelp
maximizeperformance,forexample,thistoolofferstheabilitytocachepartialbitstreamsinDRAMto
speedupthereconfigurationprocess.Finally,therearealsoeffortstoverifythepartialbitstreams
performthedesiredfunctionality[63].
Therearemanyexamplesofrun-timereconfiguration,withthelogicalunitofreconfiguration
rangingfromapplication-leveldowntoasub-instruction.Thesearediscussedbelow:
Attheapplicationlevel,examplesincludeadaptingthebitstreamaccordingtochangesin
environmentalconditions.Forexample,Clausetal.discussedhowhardwareacceleratorsmaybe
neededforreal-timevideoprocessing,butinthecontextdriverassistance,adaptingthemaccordingto
changinglightconditionscouldimproveperformance.Theydemonstratedthatthiscanprove
worthwhilesincemodulescanbequicklyreconfiguredbetweenframes[64].
Tasklevelreconfigurationiscommonforsoftwaredefinedradio,forexamplewhenswitching
betweenencodingschemes.Thetrade-offsbetweenfullorpartialreconfigurationinthisproblem
domainarediscussedbyDelahayeetal.[65].Similarly,Feillenetal.discussedhowdifferentstagesof
digitalvideodecodingdonotneedstooperateconcurrently,meaningthesamehardwarecouldbere-
usedinthisexample[66].Tasklevelreconfigurationforanoperatingsystemhasalsobeenproposed
[67].Undercontrolofsoftwarerunningonamicroprocessor,taskcircuitscanbescheduledonlineand
placedinasuitablefreespaceinahardwaretaskarea.CommunicationsbetweentasksandI/Oaredone
throughataskcommunicationbus,andterminationofataskfreesthereconfigurableresourcesused.It
wasshownthathardwareinthehardwaretaskareacanbesharedbytasksandtheoverheads
associatedwithitsimplementationonapartiallyconfigurableplatformwereacceptablylow.Thishelps
improveschedulingofreal-timetasks.
Instructionlevelreconfigurationhasbeendemonstratedforhardwareaccelerateddatabase
queries.DifferenthardwaremodulesforSQLqueriescouldbedynamicallyconfiguredtoimprove
performance[68]andenergyefficiency[69].ACPUsystemwithcustominstructionsisanothercommon
candidateforinstructionlevelreconfiguration.AnearlyexampleincludestheDynamicInstructionSet
Computer(DISC)[70],whichsupporteddemand-drivenmodificationoftheinstructionsetthrough
partialreconfiguration.ThecommercialStretchprocessor[71]combinesreconfigurablefabricwitha
processortosupporttheexecutionofcustominstructionsimplementedonareconfigurablefabric.
Furthermore,thefabriccanbereconfiguredatruntimeandthedesignenvironmentissoftware-centric,
withprogrammingoftheprocessorbeinginStretchC.
Finally,partialreconfigurationhasalsobeenshownforsub-instructions.Forexample,apipeline
stagecouldbeaconvenientunitofreconfiguration,asdemonstratedbyincrementalpipeline
reconfiguration[72].AssumeanFPGAthathasenoughsiliconareaforNphysicalpipelinestages,but
thedesigncontainsMpipelinestages(whereM>>N).Throughaddingonepipelinestageandremoving
thetrailingpipelinestageineachstageofthecomputation,executionandcomputationcanbe
overlapped.SuchacircuitwillimplementapipelineofdepthNandfullyutilizetheFPGAatanygiven
pointintime.Runtimereconfigurationcanbedoneatevenlowerlevels.Examplesincludethose
supportinghierarchicaldesignsforaCPUwithgreaternumbersofcustominstructions[61]anda
crossbarswitchwhichemploysruntimereconfigurationoftheFPGA'sroutingresources[73].Bypartially
reconfiguringroutingmultiplexers,thisschemewasabletoachievedensity,switchupdatelatencyand
performancehigherthanpossibleusingconventionalmeans.
6Designmethods
Hardwaredescriptionlanguages(HDLs)suchastheVeryHighSpeedIntegratedCircuitHardware
DescriptionLanguage(VHDL)andVerilogarecommonlyusedtospecifythelogicofareconfigurable
system.Descriptionsintheselanguageshavetheadvantageofbeingvendorneutral,sothesame
descriptioncanbesynthesizedfordifferenttargetssuchasdifferentFPGAdevices,differentFPGA
vendors,andASICs.Forthisreason,theselanguagesareoftenthetargetlanguageforhigherleveltools
thatofferhigherlevelsofabstraction.
Modulegeneratorsandlibrariesarecommonlydeployedtopromotereuse.Forexample,
vendorssuchasAlteraandXilinxhaveparameterizedlibrariesofcomponentsthatcanbeusedina
design.Theselibrariesaregeneratedsothatacircuitoptimizedfortheparticularapplicationcanbe
produced.Asanexample,aparameterizedfloatingpointlibrarymightallowthewordlengthofthe
exponentandsignificandtobespecifiedaswellaswhetherdenormalizednumbersaresupported.The
modulegeneratorthengeneratesanetlistorVHDL-basedfloatingpointadderthatcanbeincludedina
design.Opensourcealternatives,suchastheFloPoColibraryalsoprovidevendorneutralalternativesto
generatemanykeycomponents[74].
InanefforttohelpmakeFPGAsmoremainstream,effortshavebeenplacedintohigh-level
synthesis,whichistheprocessofcompilingatraditionalhighlevellanguagedowntoanetlistorHDL.
Theuseoftraditionalprogramminglanguagesimprovesproductivityaslowleveldetailsarehandledby
thecompiler.ThisisanalogoustoCversusassemblylanguageforsoftwaredevelopment.Another
differencewithpotentiallylargeimplicationsisthat,usingthesetools,softwaredeveloperscanalso
designreconfigurablecomputingapplications
Asanearlyexample,LukandPagedescribedasimplecompilationprocess[75,76]fromahigh
levellanguagewithexplicitparallelextensionstoaregistertransferlanguage(RTL)description.Parallel
executionofstatementsisimplementedviaparallelprocesses,andthesecancommunicateviachannels
throughwhichasingle-wordmessagecanbepassed.Variablesintheuserprogramaremappedto
registers,allexpressionsareimplementedascombinationallogic,andmultiplexersareusedinthecase
aregisterhasmultiplesources.Adatapaththatmatchesthedataflowgraphoftheinputsource
descriptionisgeneratedusingthisstrategy.Theclockingschemeemployedisaglobal,synchronousone,
andaconventionthateachassignmenttakesexactlyoneclockcycleisfollowed.Astartsignalisusedto
feedtheclockandtoenableeachregisterthatcorrespondstoavariable,andafinishsignalisgenerated
fortheassignmentinthefollowingclockcycle.Toexecutestatementssequentially,thestartandfinish
signalsofadjacentstatementsaresimplyconnectedtogether,creatingaone-hotdistributedcontrol
scheme.Conditionalstatementsandloopsareformedbyassertingoneofseveralpossiblestartsignals
thatcorrespondtoalternativebasicblocksinaprogram.Completionofconditionalorloopconstructs
andsynchronizationofparallelblocksareimplementedbycombiningrelevantfinishsignalsusingthe
appropriatecombinatoriallogic.Anexampleshowingthetranslationofasimplecodefragmentto
controlanddatapathisshowninFig.7.
Figure7.Hardwarecompilationexample.TheCprogramistranslatedintoadatapath(top)andcontrol
(bottom).Executionofstatementsinthewhilelooparecontrolledbys1ands2;s0ands3correspondto
thestartsignalsofthestatementsbeforeandafterthewhileloop.
High-levelsynthesistoolshavesincemovedbeyondsimplytranslatingahigh-levellanguagetoa
hardwaredesign;insteadtheyfocusoncreatinganoptimizedhardwaredesign.Straightforward
examplesmayincludeextractingparallelismthroughloopunrollingorcreatingdeeplypipelineddesigns
tomaximizeclockfrequency.However,finer-grainedoptimisationsarealsopossible.Forexample,since
movingdataontotheFPGAcanbeexpensive,storingdatalocallyonthechipandre-usingthedatacan
havesubstantialperformanceimplications[77].Whiletheideaissimilartothatofcachingona
microprocessor,onanFPGAthelocalbufferscanbesizedaccordingtotheneedsofaparticular
algorithm,savingresources.Alternatively,thememorycanbearrangedinafashionthatprovidesa
largememorybandwidth,whichmaybenecessarytofeedaparalleldatapath.Moreoever,theinterface
withtheDRAMcanalsobecontrolledtoensurethatmemoryreadsoccurfaster,forexampleby
‘activating’rowsthataresoontoberead[78].Anotheroptimizationistofinetunetheprecisionused
throughoutcomputations,i.e.touseaslittleprecisionasisnecessarytomeetyourdesignspecification.
UnlikeaCPUimplementation,anFPGAdesignhasthefreedomtoimplementanyprecision.Since
arithmeticoperatorswithlessprecisionuselesssiliconarea,usingtheminimumprecisionnecessary
freesresources,allowingforgreaterparallelperformance.Toolstosupportthisdesignmethodology,
bothinfixedandfloatingpointarithmeticarebeingsupported[79,80].
Variousotherissueswiththemappingofalgorithmstohardwarearemoregenerallydiscussed
byIsshikiandDai[81],whofocusonthedifferencesbetweenimplementingbit-serialversusbit-parallel
modules(e.g.,addersandmultipliers)onFPGAarchitectures.Althoughlatencyislargerforbit-serial
modules,thereductioninareafrequentlymakesarea-timeproductssignificantlylowerforsuch
implementations.Morespecifically,suchadvantagesasthefollowingcanbeobtained:1)Forbit-parallel
modules,theI/Opinlimitationisamajorproblem,andthelargesizeofthemoduleclustercanresultin
unusedspaceandunderutilizedlogicresources;2)bit-serialmodulesareeasiertopartitionascell-to-
cellconnectionsaresparseanddonotcauseI/Oproblems;and3)highfanoutnetscanimpair
routabilityofbit-parallelmodules.LeongandLeong[82]generalizedfurtherwithadesignmethodology
thatcantranslateadataflowdescriptionwithsignalsofdifferentwordlengthstoadigitserialdesign.
CommercialtoolsthatcancompilestandardprogramminglanguagessuchasJava,C,orC++
(e.g.,[76])areavailable.ExamplesincludeXilinx’sVivadoHLS[15],Maxeler’sMaxCompiler[83]and
CatapultCfromMentorGraphics[84].Domain-specificlanguagessuchasMATLAB/Simulinkoffereven
greaterimprovementsinproductivitybecausetheyareinteractive,includealargelibraryofprimitive
routinesandtoolkits,andhavegoodgraphingcapabilities.Indeed,manydesignsforcommunications
andsignalprocessingarefirstprototypedinMATLABandthenconvertedtootherlanguagesfor
implementation.ToolssuchastheMATCHcompiler[85]andXilinxSystemGenerator[15],AlteraDSP
builder[16]andMathwork’sHDLcoder[86]cantranslateasubsetofMATLAB/Simulinkdirectlytoan
FPGAdesign.ThereisalsointerestinsupportingmoreparallelC-to-gatesflows.Supportformore
recentparallelprogramminglanguagesisgainingtraction,forexampleAlteraSDKforOpenCL[16]and
effortstosupportNVidia’sCUDAusingFPGAs[87].
Duetothedifficultyincreatingafull-customdesign,thereisalsosupportforcreating
hardware/softwareco-designs.TheavailabilityofembeddedoperatingsystemssuchasLinuxfor
microprocessorsonanFPGAprovideafamiliarsoftwaredevelopmentenvironmentforprogrammers,
greatlyfacilitatingprogramdevelopmentthroughtheavailabilityofalargerangeofopen-source
librariesaswellashighqualitydevelopmenttools.Suchtoolscangreatlyspeedupthedevelopment
timeandimprovethequalityofembeddedsystems.Forexample,Altera'sNiosIIC-to-Hardware
accelerationcompilerenabletime-criticalfunctionsinaCprogramtobeconvertedtoahardware
acceleratorthatistightlycoupledtoamicroprocessorwithintheFPGA[88].Thesetoolswillsupportsoft
processors,suchastheAlteraNIOSorXilinxMicroblaze,andembeddedprocessors,suchasthoseon
theXilinxZynqorAlteraSoCs.Withthelatter,foroptimalperformance,thepartsofanalgorithmthat
areeasilyparallelizableshouldmakeuseoftheparallelFPGAfabric,whereasserialpartsofthe
algorithmshouldberunonaprocessor[89].
AfinaldesignapproachtoallowforfastFPGAprototypingistheuseofoverlayarchitectures.
Thesearecoarse-grainedarchitectureswithsoftware-likeprogrammability,withtheaimofsacrificing
someperformanceinexchangeforeaseofimplementation.Forexample,VectorBloxextendsthe
hardware-softwareparadigmbyusingtheFPGAfabrictoprovideparallelvectorinstructionsthatcanbe
easilyexecuted[90].Atypicaldesignflowusingthistechnologywouldbetocreateaninitialsoftware
design,addvectorinstructionswithinasoftwarestyledevelopmenttoobtainsomeaccelerationand
finallycreatecustomhardwareinstructionsforthemosttimeconsumingpartsofanalgorithm.Thismay
provideafastertimetomarket.Manyoverlaysarchitectureshavebeencreated,includingsomefor
specificapplications,suchasforefficientnetworkonchip(NOC)interconnectionsofprocessors[91]or
dataflowgraphs[92],andsomedesignedtomakeuseofspecifichardenedcomponentsonFPGAssuch
asDSPs[93].
7MultichipSystems
Specialcaremustbetakeninthedesignoflargeandmultichipreconfigurablesystems.Inthissection,
wedescribesometheoreticresultsrelevanttothemajorarchitecturalandissuesassociatedwithsuch
designs.
7.1InterconnectOrganization
AclassicClosnetwork[94]containsthreestages:inputs,intermediateswitches,andoutputs,asshown
inFig.8.Itcanbeusedtointerconnectpinsinareconfigurablecomputingsystem,anditsinputand
outputstagesaresymmetric.Supposethefirststagehasrn×mcrossbarswitches,thesecondstagehas
mr×rswitches,andthethirdstagehasrm×nswitches,letusdenotethenetworkasc(n,m,r).Forany
two-pinnetinterconnectrequirement,thenetworkc(n,m,r)canachievecompleteroutabilityifmis
notlessthann.Theroutingmethodcanbedescribedbyrecursiveoperations[95].Inthefirstiteration,
wereducethenetworktoc(n-1,m-1,r).Intheithiteration,wereducethenetworktoc(n-i,m-i,r).
Whenn-i=1,wehaver1×(m-n+1)switchesinthefirststage,m-n+1r×rswitchesinthesecondstage,and
r(m-n+1)×1switchesinthethirdstage.Inotherwords,onlyoneinputexistsineachfirst-stageswitch
andoneoutputineachthird-stageswitch.Inthiscase,onesecond-stager×rswitchisenoughtoroute
therinputsofrfirst-stageswitchestotheroutputsofrthird-stageswitches,thuscompletingthe
interconnect.
Figure8.Closnetwork.AClosnetworkcontainsthreestages:inputs,intermediateswitches,and
outputs.Theinputandoutputstagesaresymmetric.Inthefigure,thefirst-stagehasrn×mswitches,
thesecond-stagehasmr×rswitches,andthethird-stagehasrm×nswitches.
Thereductionfromc(n-i,m-i,r)toc(n-i-1,m-i-1,r)canbederivedbyamaximummatching
algorithm.Thematchingalgorithmselectsdisjointsignalsfromdifferentinputswitchestodifferent
outputswitches.Onesecond-stageswitchisthenusedtoroutetheselectedsignals.FromHall's
theorem,themaximummatchingandroutingcanalwaysreducethenetworktoc(n-i-1,m-i-1,r).
Conceptually,theroutingproblemcanalsobeformulatedasedgecoloringonabipartitegraph
G(V1,V2,E)[96].ThenodesetsV1andV2representtheswitchesintheinputandoutputstages,
respectively.AnedgeinErepresentsatwo-pinnetinterconnectrequirementbetweenthe
correspondinginputandoutputswitches.InReference[96],ChanandSchlagassignedcolorstothe
edgesofthebipartitegraph.Edgesofthesamecolorarebundledintoonegroupandthecorresponding
setofnetsareroutedbyoneswitchinthesecondstage.TheworkofReference[97]wasthenusedto
findaminimumedgecoloringsolutioninO(|E|logn).
Thethree-stageClosnetworkcanbefoldedintoatwo-stagenetwork(Fig.9)sothattheinputs
andoutputsaremixedinthefirststage.Thus,thecorrespondingbipartitegraphG(V1,V2,E)constructed
aboveforedgecoloringisalsofoldedwithV1andV2mergedintooneset.
Figure9.FoldedClosnetwork.Thethree-stageClosnetworkisfoldedintoatwo-stagenetworksothat
theinputsandoutputsaremixedinthefirststage.
Tofindtheroutingassignment,thefoldededgecoloringgraphcanbeunfoldedbacktoa
bipartitegraphusinganEulerpathsearch.TheEulerpathtraverseseveryedgeexactlyonceanddefines
theedgedirectionaccordingtothedirectionofthetraversal.Wethenrecovertheoriginalbipartite
graphbysplittingthenodesetbackintotwosetsV1andV2andunfoldtheedgessuchthatalledgesare
directedfromV1toV2.Wecanfindtheminimumedgecoloringsolutionoftheunfoldedbipartitegraph
andapplythesolutionbacktothefoldedroutingproblem.
Inpractice,thefirst-levelcrossbaroftheClosnetworkisreplacedwithFPGAstosaveboard
space(Fig.10).RoutabilityisworsethananidealClosnetwork.EvenwithatrueClosnetwork,complete
routabilityofmultipinnetsisnotguaranteed,whichisanimportantpracticalconsiderationbecausein
microelectronicdesign,manymultipinnetstypicallyexist.
Figure10.VariationsoftheClosnetwork.ThefirstlevelcrossbaroftheClosnetworkisreplacedwith
FPGAstosaveboardspace.RoutabilityisworsethananidealClosnetwork.
Inanattempttosolvethemultipinnetandroutabilityproblem,wecanintroduceextra
connectionsamongFPIDsasshowninFig.11.However,extraFPIDinterconnectionsalsoincurextra
delay.WecanalsoexpandthefanoutwidthofFPGAssothateachFPGAI/Opinisconnectedtomore
thanoneFPIC[98,99].Thefanoutwidthexpansionimprovesroutabilitywithoutsignificantadditional
delay.ThemultipleappearancesofI/Opinsincreasetheprobabilitythatasignalconnectioncanbe
madeinasinglestage,whichisespeciallycriticalformultipinnets.However,theadditionalfanouts
increasetheneededpincountofFPICs.Thus,weneedtofindabalancedfanoutdistributionthat
reducestheinterconnectdelaywithaminimalpinrequirement.
Figure11.VariationsofClosnetwork.ThefanoutwidthofFPGAsisexpandedsothateachFPGAI/Opin
isconnectedtomorethanoneFPIC.Thefanoutwidthexpansionimprovesroutabilitywithoutsignificant
additionaldelay.
Atree-structurednetworkcansimplifythemappingprocessforcertainapplications.In
Reference[100],anexampleofatree-structurednetworkisillustratedforaVeryLargeScaleSimulator
(VLSS).TheVLSStreestructurehasalllogiccomponentslocatedattheleavesandinterconnectswitches
attheinternalnodes.Themachinecoversacapacityofeightmilliongates.Eachbranchisan8-bitbus.
Thehigherupthelevelofthetree,thelessparallelismthesignaldistributioncanachieve.Therefore,a
partitioningprocessisdesignedtominimizethehighlevelinterconnectandmaximizetheparallel
operation.
7.2InterconnectMultiplexing
Timemultiplexingisaneffectivemethodfortacklingthescalabilityproblemininterconnectinglarge
designs.Thetime-sharingmethodcanbeextendedfromtraditionalbusorganization[42,100]to
networksharing[101]andfurthertofunctionblocksharing[56].
Interconnectcanbetimesharedasabus[42,100].Ifncommunicationlinesexistbetweentwo
FPGAs,theycanbereducedtoasinglelinebystoringlogicaloutputsinshiftregistersandtime-
multiplexingthecommunicationinphases.Suchaschemewasemployedinthevirtualwireslogic
emulationsystem[42],whichisefficientbecauseinterconnectsarenormallycapableofbeingclockedat
muchhigherratesthanthecriticalpathoftherestofthesystem,andalllogicalwiresarenot
simultaneouslyactive.Thisschemecangreatlyreducethecomplexityoftheinterconnectingnetworkor
printedcircuitboardinamulti-FPGAsystem.
LiandCheng[101]proposedthatadynamicnetworkbeviewedasoverlappingLconventional
FPICstogetherbutsharingthesameI/Opins.Adynamicroutingarchitecturecanincreasetheroutability
andshorteninterconnectlength.Eachswitchingnetworkisafullcrossbar,whichcanbereconfiguredto
provideanyconnectionsamongI/Opins.Theselectlinesareusedtoactivateonlyoneswitching
networkatatime;thustheI/Opinsaredynamicallyconnectedaccordingtotheconfigurationofthis
activeswitchingnetwork.BydynamicallyreconfiguringtheFPICs,Llogicsignalscantime-sharethesame
interconnectresources.
7.3MemoryAllocation
InterconnectschemesshouldalsoconsiderhowmemoryisconnectedtotheFPGAs.Although
combiningmemorywithlogicinthesameFPGAisthemostdesirablemethodforreducingrouting
congestionandsignaldelay,separatecomponentscansupplymuchlargercapacityathigherdensityand
lowerprice.Figure12demonstratesthreedifferentwaysofallocatingthememoriesinaClosnetwork
[96,102].ThememorymaybeattacheddirectlytoalocalFPGA(Fig.12a),attachedtothesecond-stage
switchesoftheClosnetworkviaahostinterface(Fig.12b),orattachedtothefirst-stageswitchesofthe
Closnetwork(Fig.12c).Thefirstmethodprovidesgoodperformanceforlocalmemoryaccess.However,
forthecaseofnonlocalmemoryaccess,theroutabilityanddelayareconcerns.Thesecondmethodis
slowerthanthefirstmethodforlocalmemoryaccessesbutprovidesbetterroutability.Thethirdisthe
mostflexibleasthememoryisattachedtothenetworkandtheroutabilityishigh.However,everylogic-
to-memorycommunicationmustgothroughthesecondinterconnectstage.
Figure12.Memoryorganization,(a)MemoryisattacheddirectlytoalocalFPGA.(b)Memoryis
attachedtothesecond-stageswitchesoftheClosnetworkviaahostinterface,(c)Memoryisattached
tothefirst-stageswitchesoftheClosnetwork.
7.4BusBufferInsertion
InFPGAs,signalpropagationisinherentlyslowbecauseofitsprogrammableinterconnectfeature.
However,thedelayoflongroutingwirescanbedrasticallyreducedbybufferinsertion.Theprincipleat
workisthatbyinsertingbufferswecandecouplecapacitiveeffectsofcomponentsandinterconnect
drivenbythebuffersandtherebyimproveRCdelay.
Givenaroutingtopologyforanetandtimingrequirementsforitssinks,anefficientoptimal
bufferinsertionalgorithmwasproposedin[103].Experimentalresultsshowdramaticimprovement
versustheunbufferedsolution.Thus,itisadvantageoustohaveabundantbuffersinFPGAs.However,
eachpossiblebufferanditsprogrammableswitchaddscapacitancetothewires,whichinturnwill
contributetodelay.Thus,abalancepointneedstobeidentifiedtotradeoffbetweentheadditional
delayandcapacitanceofthebuffersversustheimprovementtheycanprovide.
Foramultisourcedbus,theproblemofbufferinsertionbecomesmorecomplicated,because
theoptimizationforonesourcemaysacrificethedelayofothers.Furthermore,thedirectionofthe
bufferneedstobearbitratedbyacontroller.Insteadofusingsuchacontroller,anovelapproachisto
useapatentedopencollectorbusrepeater[104].Whenidle,thetwoendsoftherepeateraresetto
high.Whentherepeatersensesthepull-downactionononeside,itpresentsthesignalontheother
sideuntilthepull-downactionisreleasedfromtheoriginatedsignal.Thebusrepeatereliminatesthe
needforadirectioncontrolsignal,resultinginasimplerdesignandbetteruseofresources.
7.5SystemDecomposition
Todecomposeasystemintomultipledevices,Yehetal.[105]proposedanalgorithmbasedonthe
relationshipbetweenuniformmulti-commodityflowandmin-cutpartitioning.Yehetal.constructaflow
networkwhereineachnetinitiallycorrespondedanedgewithflowcostone.Tworandommodulesin
thenetworkwerechosenandtheshortestpath(i.e.,pathwithlowestcost)betweenthemwas
computed.AconstantΔ<1wasaddedtotheflowforeachnetintheshortestpath,andthecostfor
everynetinthepathwasincremented.Adjustingthecostpenalizespathsthroughcongestedareasand
forcesalternativeshortestpaths.Thisrandomshortestpathcomputationisrepeateduntileverypath
betweenthechosenpairofmodulespassesthroughatleastone“saturated”net.Thesetofsaturated
netsinducesamulti-waypartitioninginwhichtwomodulesbelongtothesameclusterifandonlyif
thereisapathofunsaturatednetsbetweenthem.
Foreachoftheseclusters,theflux(definedasthecutsizebetweentheclusterandits
complement,dividedbythesizeofthecluster)iscomputedandtheclustersaresortedbasedontheir
fluxvalue.Yehetal.beganwithasingleclusterequaltotheentirenetlist,andthenpeeledoffthe
clusterswithlowestflux.Thisapproachwasattractivebecausethesaturatednetsaregoodcandidates
tobecutinapartitioningsolution.Aspeeledclusterscanbeverysmall,asecondphasemaybeusedto
makethemulti-waypartitioningmorebalanced.Thisapproach,withitssubsequentspeedupbyYeh
[106],iswell-suitedforlarge-scalemulti-waypartitioninginstances.
Thesystemprototypingphasemayalsoexplorenetlisttransformationssuchaslogicreplication
andretimingtominimizecutsize(I/Ousage)orsystemcycletime.Suchtransformationsareneededas
inter-devicedelayscanberelativelylargeandbecausedevicesareoftenI/O-limited.InReference[107],
Liuetal.proposedapartitioningalgorithmthatpermitslogicreplicationtominimizebothcutsizeand
clockcycleofsequentialcircuits.GivenanetlistG=(V,E),theirapproachchoosestwomodulesasseeds
sandt,thenconstructsa“replicationgraph”thatistwicethesizeoftheoriginalcircuit.Thisgraphhas
thespecialpropertythatatypeofdirectedminimumcutyieldsthereplicationcut(i.e.,adecomposition
ofVintoS,T,andR=V-S-TwheresÎS,tÎTandRisthereplicatedlogic)thatisoptimal.Adirected
versionoftheFiduccia-Mattheysesalgorithmisusedtofindaheuristicdirectedminimumcutinthe
replicationgraph.Congetal.[108]presentanefficientalgorithmfortheperformance-drivenmulti-way
circuitpartitioningproblemthatconsidersthedifferentlocalandglobalinterconnectdelayintroduced
bythepartitioning.
AlpertandKahng[109]surveytheFPGApartitioningliteratureinthecontextofmajorgraph
partitioningparadigms.Thecurrentpartitioningproblemsare(i)lowusagerateofFPGAgatecapacity
becauseI/Opinlimit,(ii)lowclockratebecauseofinterconnectdelaybetweenmultipleFPGAsand(iii)
longCPUtimeforthemappingprocess.
7.6SystemPlanningandDesignChanges
Foragivensystemdecompositiontobeimplementedonamulti-FPGAprototypingarchitecture,all
connectionswithineachdeviceandbetweendevicesmustberoutable.Chanetal.[110]invokemuch
literatureonroutabilitypredictioningatearrays,aswellastheoreticalconcepts,suchastheRent
parameter,toobtainafastroutabilityestimateforarbitrarynetlistsandFPGAarchitectures.Their
methodascribesoneofthreelevelsofroutable(easilyroutable,marginallyroutable,orunroutable)toa
netlistbasedonvariousparameters.Specifically,combiningawirelengthestimatorduetoFeuer,the
averagenumberofpins-per-cell,andtheestimatedRentparameteryieldsarelativelyaccurate
routabilitypredictor.TheutilityoftheseparametersiscontrastedwiththatofothercriteriasuchasEl
Gamal'schannelwidthrequirement[111]ortheaveragepins-per-netratio.
Inadditiontoroutability,connectionsmustalsomeetsystemtimingconstraints.Selvidgeetal.
[112]extendtheoriginalvirtualwires[42]conceptintheirTIERS(Topology-IndEpendentRoutingand
Scheduling)approach.Theproblemformulationassumesthatanassignmentfromamultiple-FPGA
partitioning(i.e.,adesigngraph)toatargettopologygraphhasalreadybeenmade.Theobjectiveisto
assign“links”(i.e.,signalnets)tochannelsbetweendevices;aswiththeVirtualWiresconcept,specific
timeslicesforachannelcanbeassignedtomultiplelinksaslongasnotwolinksneedtotransmitsignals
atthesametime.TheTIERSalgorithmusesagreedymethodtoorderthelinksandthenrouteseachlink
inthescheduledorderwhilereservingchannelresources;factorsofupto2.5improvementinsystem
cycletimeareachieved.
Chang,etal.[113]addressthecombinedissuesofroutabilityandsystemtimingbyapplying
layout-drivenlogicresynthesistechniques.Foragivenwirethatcannotberouted,“alternativewires”
andalternativefunctionsareidentified,suchthatthegivenunroutablewirecanberemovedfromthe
circuitandreplacedwithanewwire(orwires)ornewlogicwithoutaffectingfunctionality.Chengetal.
estimatethatbetween30%and50%ofwireshaveso-called“triple-wirealternatives”(i.e.,
replacementsconsistingofthreeorfewerwires).Theirmethodfirstroutesthewiresthatdonothave
anyalternativesthenreplacesanyunroutablewirewithavailablealternatives.Systemtimingcanbe
improvedbyreplacinglongwireswithshorteralternatives.
8Conclusions
Reconfigurablecomputingoffersamiddlegroundbetweensoftware-basedsystemsandASIC
implementations,andisoftenabletocombineimportantbenefitsofboth.Implementationsareableto
avoidoverheadssuchasunnecessarydatatransfers,decodingandcontrolmandatoryin
microprocessors,anddesignscanbeoptimizedonabasisspecifictoanapplication,aprobleminstance
orevenanexecution.Usingthistechnology,itispossibletoachievesize,performance,cost,orpower
improvementsovermoreconventionalcomputingtechnologies.
9Acknowledgments
TheauthorswouldliketothankYM.LamforhishelpinpreparingthismanuscriptandProf.WayneLuk
(ImperialCollege)forhisproofreadingofthisarticle.
Bibliography
[1] G.Estrin,"ReconfigurableComputerOrigins:TheUCLAFixed-plus-variable(F+V)Structurecomputer,"IEEEAnn.Hist.Comput,vol.24,pp.3--9,2002.
[2] S.Hauck,"TherolesofFPGAsinreprogrammablesystems,"Proc.IEEE,vol.86,pp.615-639,1998.
[3] K.ComptonandS.Hauck,"Reconfigurablecomputing:asurveyofsystemsandsoftware,"ACMComput.Surveys(CSUR),vol.34,pp.171-210,2002.
[4] K.BondalapatiandV.K.Prasanna,"Reconfigurablecomputingsystems,"Proc.IEEE,vol.90,pp.1201-1217,2002.
[5] T.J.Todman,G.A.Constantinides,S.J.E.Wilton,O.Mencer,W.Luk,andP.Y.K.Cheung,"Reconfigurablecomputing:architecturesanddesignmethods,"IEEProc.ComputersandDigitalTechniques,vol.152,pp.193-205,2005.
[6] R.Tessier,K.Pocek,andA.DeHon,"ReconfigurableComputingArchitectures,"Proc.IEEE,vol.103,pp.332-354,2015.
[7] H.SutterandJ.Larus,"Softwareandtheconcurrencyrevolution,"Queue,vol.3,pp.54--62,2005.
[8] J.Cong,M.A.Ghodrat,M.Gill,B.Grigorian,K.Gururaj,andG.Reinman,"Accelerator-RichArchitectures,"inProc.DesignAutomationConference,pp.1--6,2014.
[9] J.Fowers,G.Brown,P.Cooke,andG.Stitt,"AperformanceandenergycomparisonofFPGAs,GPUs,andmulticoresforsliding-windowapplications,"inProc.ACM/SIGDAInt.Symp.onFieldProgrammableGateArrayspp.47–56,2012.
[10] D.B.Thomas,L.Howes,andW.Luk,"AcomparisonofCPUs,GPUs,FPGAs,andmassivelyparallelprocessorarraysforrandomnumbergeneration,"Proc.Int.Symp.onFieldprogrammablegatearrays,pp.63-72,2009.
[11] A.DeHon,"Thedensityadvantageofconfigurablecomputing,"IEEEComputer,vol.33,pp.41-49,2000.
[12] I.KuonandJ.Rose,"MeasuringtheGapBetweenFPGAsandASICs,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.26,pp.203-215,2007.
[13] J.Liang,R.Tessier,andD.Goeckel,"ADynamically-Reconfigurable,Power-EfficientTurboDecoder,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.91--100,2004.
[14] V.Betz,J.Rose,andA.Marquardt,"ArchitectureandCADforDeep-SubmicronFPGAS,"ed.Dordrecht,theNetherlands:KluwerAcademicPublisher,1999.
[15] Xilinx,"http://www.xilinx.com,"(accessed2016).[16] Altera,"http://www.altera.com,"(accessed2016).[17] Microsemi,"http://www.microsemi.com,"(accessed2016).[18] M.P.Leong,"FPGADesignMethodologiesforHighPerformanceApplications,"TheChinese
UniversityofHongKong2001.[19] E.AhmedandJ.Rose,"TheeffectofLUTandclustersizeondeep-submicronFPGAperformance
anddensity,"inProc.ACM/SIGDAInt.Symp.onFieldprogrammablegatearrays,pp.3-12,2000.[20] D.Lewis,A.Lee,P.Leventis,S.Marquardt,C.McClintock,K.Padalia,etal.,"TheStratixIIlogic
androutingarchitecture,"inProc.Int.Symp.onField-programmablegatearrays,pp.14-20,2005.
[21] D.Lewis,G.Chiu,J.Chromczak,D.Galloway,B.Gamsa,V.Manohararajah,etal.,"TheStratix™10HighlyPipelinedFPGAArchitecture,"inProc.Int.Symp.onField-ProgrammableGateArrays,pp.159-168,2016.
[22] R.Hartenstein,"Coarsegrainreconfigurablearchitecture(embeddedtutorial),"inProc.conf.onAsiaSouthPacificdesignautomation,2001.
[23] S.C.Goldstein,H.Schmit,M.Budiu,S.Cadambi,M.Moe,andR.R.Taylor,"PipeRench:areconfigurablearchitectureandcompiler,"Computer,vol.33,pp.70-77,2000.
[24] C.Ebeling,D.C.Cronquist,andP.Franklin,"RaPiD—Reconfigurablepipelineddatapath,"inProc.Int.WorkshoponField-ProgrammableLogic,SmartApplications,NewParadigmsandCompilers,pp.126-135,1996.
[25] L.Moll,J.Vuillemin,andP.Boucard,"High-energyphysicsonDECPeRLe-1programmableactivememory,"inProc.ACMInt.Symp.onField-programmablegatearrays,pp.47-52,1995.
[26] D.T.Hoang,"SearchinggeneticdatabasesonSplash2,"inProc.IEEEWorkshoponFPGAsforCustomComputingMachinespp.185-191,1993.
[27] C.Chen,J.Wawrzynek,andR.W.Brodersen,"BEE2AHigh-EndReconfigurableComputingSystem,"IEEEDes.Test.Comput.,vol.22,pp.114-125,2005.
[28] L.-K.Ting,R.Woods,andC.F.N.Cowan,"VirtexFPGAimplementationofapipelinedadaptiveLMSpredictorforelectronicsupportmeasuresreceivers,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.13,pp.86-95,2005.
[29] M.Pohl,M.Schaeferling,andG.Kiefer,"AnefficientFPGA-basedhardwareframeworkfornaturalfeatureextractionandrelatedComputerVisiontasks,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.
[30] M.ShandandJ.Vuillemin,"FastimplementationsofRSAcryptography,"inProc.IEEESymp.onComputerArithmetic,pp.252-259,1993.
[31] K.H.Tsoi,K.H.Lee,andP.H.W.Leong,"AmassivelyparallelRC4keysearchengine,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.13-21,2002.
[32] G.L.Zhang,P.H.W.Leong,C.H.Ho,K.H.Tsoi,C.C.C.Cheung,D.Lee,etal.,"ReconfigurableaccelerationforMonteCarlobasedfinancialsimulation,"inProc.Int.Conf.onField-ProgrammableTechnology,2005.,pp.215-222,2005.
[33] D.Boland,"ReducingMemoryRequirementsforHigh-PerformanceandNumericallyStableGaussianElimination,"Proc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.244-253,2016.
[34] N.J.Fraser,D.J.M.Moss,L.JunKyu,S.Tridgell,C.T.Jin,andP.H.W.Leong,"Afullypipelinedkernelnormalisedleastmeansquaresprocessorforacceleratedparameteroptimisation,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1--6,2015.
[35] J.E.Vuillemin,P.Bertin,D.Roncin,M.Shand,H.H.Touati,andP.Boucard,"Programmableactivememories:reconfigurablesystemscomeofage,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.4,pp.56-69,1996.
[36] Nvidia,"(accessed2016)."http://www.nvidia.com.[37] S.MittalandJ.S.Vetter,"ASurveyofMethodsforAnalyzingandImprovingGPUEnergy
Efficiency,"ACMComput.Surv.,vol.47,pp.1-23,2014.[38] Nvidia.((accessed2016)).NVIDIATesla®K20-K20XGPUAcceleratorsBenchmarksApplication
PerformanceTechnicalBriefhttp://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf
[39] K.Ovtcharov,O.Ruwase,J.-Y.Kim,K.Strauss,andE.Chung,AcceleratingDeepCconvolutionalNeuralNetworksUsingSpecializedHardware:MicrosoftResearch,2015.
[40] S.Gupta,A.Agrawal,K.Gopalakrishnan,andP.Narayanan,"DeepLearningwithLimitedNumericalPrecision,"inInt.Conf.onMachineLearning,pp.1337–1345,2013.
[41] J.L.Jerez,G.A.Constantinides,andE.C.Kerrigan,"FixedPointLanczos:SustainingTFLOP-equivalentPerformanceinFPGAsforScientificComputing,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.53-60,2012.
[42] J.Babb,R.Tessier,M.Dahl,S.Z.Hanono,D.M.Hoki,andA.Agarwal,"Logicemulationwithvirtualwires,"IEEETrans.Computer-AidedDesignofIntegratedCircuitsandSystems,vol.16,pp.609-626,1997.
[43] J.Varghese,M.Butts,andJ.Batcheller,"Anefficientlogicemulationsystem,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.1,pp.171-174,1993.
[44] L.deSouza,P.Ryan,J.Crawford,K.Wong,G.Zyner,andT.McDermott,"PrototypingfortheConcurrentDevelopmentofanIEEE802.11WirelessLANChipset,"inProc.Int.Conf.onFieldProgrammableLogicandApplication,ed,2003,pp.51-60.
[45] Cadence,"ProtiumRapidPrototypingPlatformhttps://www.cadence.com/content/cadence-www/global/en_US/home/tools/system-design-and-verification/fpga-basedprototyping/protium-rapid-prototyping-platform.html,"(accessed2016).
[46] D.M.StephenTridgell,NicholasJ.Fraser,andPhilipH.W.Leong,"Braiding:aschemeforresolvinghazardsinNORMA,"inProc.Int.Conf.onFieldProgrammableTechnology,pp.136–143,2015.
[47] P.L.ChenZhang,GuangyuSun,YijinGuan,BingjunXiaoandJasonCong,"OptimizingFPGA-basedAcceleratorDesignforDeepConvolutionalNeuralNetworks,"inProc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.161-170,2015.
[48] R.A.E.LouiseH.Crockett,MartinA.Enderwitz,andRobertW.Stewart.,TheZynqBook:EmbeddedProcessingwiththeArmCortex-A9ontheXilinxZynq-7000allProgrammableUK,2014.
[49] P.Bertin,D.Roncin,andJ.Vuillemin,"IntroductiontoProgrammableActiveMemories,"ed:DECMemo3,1989,pp.1-9.
[50] J.M.Arnold,D.A.Buell,andE.G.Davis,"Splash2,"inProc.ACMsymp.onParallelalgorithmsandarchitectures,1992.
[51] Maxeler,"https://www.maxeler.com/products/mpc-xseries/,"(accessed2016).[52] A.Putnam,A.M.Caulfield,E.S.Chung,D.Chiou,K.Constantinides,J.Demme,etal.,"A
reconfigurablefabricforacceleratinglarge-scaledatacenterservices,"inInt.Symp.onComputerArchitecture(ISCA),2014.
[53] Y.-k.Choi,J.Cong,Z.Fang,Y.Hao,G.Reinman,andP.Wei,"AquantitativeanalysisonmicroarchitecturesofmodernCPU-FPGAplatforms,"inProc.DesignAutomationConference,2016.
[54] J.VillasenorandW.H.Mangione-Smith,"ConfigurableComputing,"Scientif.Amer.,vol.276,pp.66-71,1997.
[55] J.BeckerandM.Hübner,"Run-timereconfigurabililityandotherfuturetrends,"inProc.symp.onIntegratedcircuitsandsystemsdesign,pp.9-11,2006.
[56] N.B.C.Bhat,K.;Kuh,E.S,"Performance-orientedFullyRoutableDynamicArchitectureforaField-programmableLogicDevice,"MemorandumNo.UCB/ERLM93/42,ElectronicsResearchLab.,CollegeofEngineering,UCBerkeley,pp.1-21,1993.
[57] M.Motomura,"ADynamicallyReconfigurableProcessorArchitecture,,"MicroprocessorForum,2002.
[58] D.Koch,C.Beckhoff,andJ.Teich,"ReCoBus-Builder-AnoveltoolandtechniquetobuildstaticallyanddynamicallyreconfigurablesystemsforFPGAS,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.119-124,2008.
[59] C.Beckhoff,D.Koch,andJ.Torresen,"GoAhead:APartialReconfigurationFramework,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.37-44,2012.
[60] K.VipinandS.A.Fahmy,"Efficientregionallocationforadaptivepartialreconfiguration,"inProc.Int.Conf.onField-ProgrammableTechnology,pp.1-6,2011.
[61] D.KochandC.Beckhoff,"HierarchicalreconfigurationofFPGAs,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.
[62] K.VipinandS.A.Fahmy,"ZyCAP:EfficientPartialReconfigurationManagementontheXilinxZynq,"IEEEEmbeddedSystemsLetters,vol.6,pp.41-44,2014.
[63] L.GongandO.Diessel,FunctionalVerificationofDynamicallyReconfigurableFPGA-basedSystems,1ed.:SpringerInternationalPublishing,2015.
[64] C.Claus,R.Ahmed,F.Altenried,andW.Stechele,"TowardsRapidDynamicPartialReconfigurationinVideo-BasedDriverAssistanceSystems,"inProc.Int.SymponReconfigurableComputing:Architectures,ToolsandApplications,ed,2010,pp.55-67.
[65] G.G.Jean-PhilippeDelahaye,ChristianRoland,PierreBomel,"Softwareradioanddynamicreconfigurationonadsp/fpgaplatform,"Frequenz,journaloftelecommunications,pp.152-159,2004.
[66] M.Feilen,M.Ihmig,C.Schwarzbauer,andW.Stechele,"EfficientDVB-T2decodingacceleratordesignbytime-multiplexingFPGAresources,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.75-82,2012.
[67] C.Steiger,H.Walder,andM.Platzner,"Operatingsystemsforreconfigurableembeddedplatforms:onlineschedulingofreal-timetasks,"IEEETrans.onComputers,vol.53,pp.1393-1407,2004.
[68] C.Dennl,D.Ziener,andJ.Teich,"On-the-flyCompositionofFPGA-BasedSQLQueryAcceleratorsUsingaPartiallyReconfigurableModuleLibrary,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines(FCCM),pp.45-52,2012.
[69] A.Becher,F.Bauer,D.Ziener,andJ.Teich,"Energy-awareSQLqueryaccelerationthroughFPGA-baseddynamicpartialreconfiguration,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.
[70] M.J.WirthlinandB.L.Hutchings,"Adynamicinstructionsetcomputer,"inProc.IEEESymp.onFPGAsforCustomComputingMachines,pp.99–107,1995.
[71] Stretch,"http://www.stretchinc.com/,"(accessed2016).[72] H.Schmit,"Incrementalreconfigurationforpipelinedapplications,"inProc.5thAnnualIEEE
Symp.onField-ProgrammableCustomComputingMachines,pp.47-55,1997.[73] S.Young,P.Alfke,C.Fewer,S.McMillan,B.Blodget,andD.Levi,"AhighI/Oreconfigurable
crossbarswitch,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.3-10,2003.
[74] F.d.D.a.B.Pasca,"DesigningcustomarithmeticdatapathswithFloPoCo,"IEEEDesign&TestofComputers,vol.28,pp.18--27,2011.
[75] W.LukandI.Page,"CompilingOccamintoFPGAs,"ed:EE&CSbooks,1991,pp.271-283.[76] I.Page,"Constructinghardware-softwaresystemsfromasingledescription,"VLSISignal
Processing,vol.12,pp.87-107,1996.[77] Q.Liu,G.A.Constantinides,K.Masselos,andP.Y.K.Cheung,"AutomaticOn-chipMemory
MinimizationforDataReuse,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachinespp.251-260,2007.
[78] S.BaylissandG.A.Constantinides,"OptimizingSDRAMbandwidthforcustomFPGAloopaccelerators,"Proc.ACM/SIGDAInt.Symp.onFieldProgrammableGateArrays,pp.195--204,2012.
[79] D.U.Lee,A.A.Gaffar,R.C.C.Cheung,O.Mencer,W.Luk,andG.A.Constantinides,"Accuracy-GuaranteedBit-WidthOptimization,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.25,pp.1990-2000,2006.
[80] D.BolandandG.A.Constantinides,"BoundingVariableValuesandRound-OffEffectsUsingHandelmanRepresentations,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.30,pp.1691-1704,2011.
[81] T.IsshikiandW.W.Dai,"High-LevelBit-SerialDatapathSynthesisforMulti-FPGASystems,"inProc.Int.WorkshoponFPGAs,pp.161-174,1995.
[82] M.P.LeongandP.H.W.Leong,"Avariable-radixdigit-serialdesignmethodologyanditsapplicationtothediscretecosinetransform,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.11,pp.90-104,2003.
[83] M.Technologies,"MaxCompiler"(whitepaper),2011.[84] M.Graphics,"CatapultHigh-LevelSynthesishttps://www.mentor.com/hls-lp/catapult-high-
level-synthesis/c-systemc-hls,"(Accessed2016).[85] M.Haldar,A.Nayak,A.Choudhary,andP.Banerjee,"AsystemforsynthesizingoptimizedFPGA
hardwarefromMatlab,"inIEEE/ACMInt.Conf.onComputerAidedDesign,pp.314–319,2001.[86] Mathworks,"http://www.mathworks.com/products/hdl-coder/,"(accessed2016).[87] A.Papakonstantinou,K.Gururaj,J.A.Stratton,D.Chen,J.Cong,andW.M.W.Hwu,"FCUDA:
EnablingefficientcompilationofCUDAkernelsontoFPGAs,"inSymp.onApplicationSpecificProcessors,pp.35-42,2009.
[88] D.Lau,O.Pritchard,andP.Molson,"AutomatedGenerationofHardwareAcceleratorswithDirectMemoryAccessfromANSI/ISOStandardCFunctions,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachinespp.45-56,2006.
[89] A.G.Weisz,A.J.Melber,A.Y.Wang,A.K.Fleming,A.E.Nurvitadhi,andA.J.C.Hoe,"AStudyofPointer-ChasingPerformanceonShared-MemoryProcessor-FPGASystems,"Proc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.264-273,2016
[90] Vectorblox,"http://vectorblox.com/,"(accessed2016).[91] N.KapreandJ.Gray,"Hoplite:BuildingaustereoverlayNoCsforFPGAs,"inProc.Int.Conf.on
FieldProgrammableLogicandApplications,pp.1-8,2015.[92] D.CapalijaandT.S.Abdelrahman,"Ahigh-performanceoverlayarchitectureforpipelined
executionofdataflowgraphs,"inInt.Conf.onFieldprogrammableLogicandApplications,pp.1-8,2013.
[93] A.K.Jain,X.Li,P.Singhai,D.L.Maskell,andS.A.Fahmy,"DeCO:ADSPBlockBasedFPGAAcceleratorOverlayWithLowOverheadInterconnect,"Proc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.1--8,2016.
[94] C.Clos,"AStudyofNon-BlockingSwitchingNetworks,"BellSystemTechnicalJournal,vol.32,pp.406-424,1953.
[95] V.E.Beneš,"MathematicalTheoryofConnectingNetworksandTelephoneTraffic,"ed.NewYork:AcademicPress,1965.
[96] P.K.ChanandM.D.F.Schlag,"Architecturaltradeoffsinfield-programmable-device-basedcomputingsystems,"inProc.IEEEWorkshoponFPGAsforCustomComputingMachines,pp.152-161,1993.
[97] R.ColeandJ.Hopcroft,"OnEdgeColoringBipartiteGraphs,"SIAMJ.Comput.,vol.11,pp.540-546,1982.
[98] G.RichardsandF.Hwang,"ATwo-StageRearrangeableBroadcastSwitchingNetwork,"IEEETrans.Communications,vol.33,pp.1025-1035,1985.
[99] I-Cube,"UsingFPIDDeviesinFPGA-basedPrototyping,"ed:ApplicationNote,1994,pp.1–11.[100] Y.C.Wei,C.K.Cheng,andZ.Wurman,"Multiple-levelpartitioning:anapplicationtothevery
large-scalehardwaresimulator,"IEEEJ.Solid-StateCircuits,vol.26,pp.706-716,1991.[101] J.LiandC.K.Cheng,"Routabilityimprovementusingdynamicinterconnectarchitecture,"in
Proc.IEEESymp.onFPGAsforCustomComputingMachines,pp.2-7,1995.
[102] P.K.S.Chan,M.D.F.;Martin,M.,"BORG:AReconfigurablePrototypingBoardUsingField-programmableGateArrays,"inInt.WorkshoponFPGA,pp.47–51,1992.
[103] J.Lillis,C.K.Cheng,andT.T.Y.Lin,"Optimalwiresizingandbufferinsertionforlowpowerandageneralizeddelaymodel,"inProc.Int.Conf.onComputerAidedDesign(ICCAD),pp.138-143,1995.
[104] W.J.Hsieh,Y.C.Jenq,C.S.Horng,andK.Lofstrom,"Input/outputI/OBidirectionalBufferforInterfacingI/OPartsofaFieldProgrammableInterconnectionDevicewithArrayPortsofaCross-pointSwitch.,"USPatent5,428,800,1992.
[105] C.-W.Yeh,C.-K.Cheng,andT.-T.Y.Lin,"Aprobabilisticmulticommodity-flowsolutiontocircuitclusteringproblems,"inProc.IEEE/ACMInt.Conf.onComputer-AidedDesign,pp.428–431,1992.
[106] Y.Ching-Wei,"Ontheaccelerationofflow-orientedcircuitclustering,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.14,pp.1305-1308,1995.
[107] L.-T.Liu,M.-t.Kuo,C.-K.Cheng,andT.C.Hu,"Performance-DrivenPartitioningUsingaReplicationGraphApproach,"inProc.DesignAutomationConferencepp.206-210,1995.
[108] J.Cong,S.K.Lim,andC.Wu,"Performancedrivenmulti-levelandmultiwaypartitioningwithretiming,"inProc.DesignAutomationConf.,pp.274-279,2000.
[109] C.J.AlpertandA.B.Kahng,"Recentdirectionsinnetlistpartitioning:asurvey,"Integration,theVLSIJournal,vol.19,pp.1-81,1995.
[110] P.K.Chan,M.D.F.Schlag,andJ.Y.Zien,"OnroutabilitypredictionforField-ProgrammableGateArrays,"inProc.DesignAutomationConferencepp.326-330,1993.
[111] A.E.Gamal,"Two-dimensionalstochasticmodelforinterconnectionsinmastersliceintegratedcircuits,"IEEETrans.CircuitsSyst.,vol.28,pp.127-138,1981.
[112] C.Selvidge,A.Agarwal,M.Dahl,andJ.Babb,"TIERS:TopologyIndependentPipelinedRoutingandScheduling,"inProc.ACMInt.Symp.onField-programmablegatearrayspp.25-31,1995.
[113] S.-C.Chang,K.-T.Cheng,N.-S.Woo,andM.Marek-Sadowska,"LayoutdrivenlogicsynthesisforFPGAs,"inProc.DesignAutomationConferencepp.308-313,1994.