Reconfigurable Computing - University of Sydney · Reconfigurable Computing David Boland1,...

Post on 23-May-2018

215 views 2 download

transcript

ReconfigurableComputingDavidBoland1,Chung-KuanCheng2,AndrewB.Kahng2,PhilipH.W.Leong1

1SchoolofElectricalandInformationEngineering,TheUniversityofSydney,Australia20062Dept.ofComputerScienceandEngineering,UniversityofCalifornia,LaJolla,California

Abstract:

Reconfigurablecomputingistheapplicationofadaptablefabricstosolvecomputationalproblems,often

takingadvantageoftheflexibilityavailabletoproduceproblem-specificarchitecturesthatachievehigh

performancebecauseofcustomization.Reconfigurablecomputinghasbeensuccessfullyappliedto

fieldsasdiverseasdigitalsignalprocessing,cryptography,bioinformatics,logicemulation,CADtool

acceleration,scientificcomputing,andrapidprototyping.

AlthoughEstrin-firstproposedtheideaofareconfigurablesystemintheformofafixedplusvariable

structurecomputerin1960[1]ithasonlybeeninrecentyearsthatreconfigurablefabrics,intheformof

field-programmablegatearrays(FPGAs),havereachedsufficientdensitytomakethemacompelling

implementationplatformforhighperformanceapplicationsandembeddedsystems.Inthisarticle,

intendedforthenon-specialist,wedescribesomeofthebasicconcepts,toolsandarchitectures

associatedwithreconfigurablecomputing.

Keywords:

reconfigurablecomputing;adaptablefabrics;applicationintegratedcircuits;fieldprogrammablegate

arrays(FPGAs);systemarchitecture;runtime

1Introduction

Althoughreconfigurablefabricscaninprinciplebeconstructedfromanytypeoftechnology,inpractice,

mostcontemporarydesignsaremadeusingcommercialfieldprogrammablegatearrays(FPGAs).An

FPGAisanintegratedcircuitcontaininganarrayoflogicgatesinwhichtheconnectionscanbe

configuredbydownloadingabitstreamtoitsmemory.FPGAscanalsobeembeddedinintegrated

circuitsasintellectualpropertycores.Moredetailedsurveysonreconfigurablecomputingareavailable

intheliterature[2-6].

Microprocessorsofferaneasy-to-use,powerfulandflexibleimplementationmediumfordigital

systems.Theirutilityincomputingapplicationsmakesthemanoverwhelmingfirstchoice.Moreover,it

isrelativelyeasytofindsoftwaredevelopers,andmicroprocessorsarewidelysupportedbyoperating

systems,softwareengineeringtools,andlibraries.However,inthelastdecade,powerconstraintshave

limitedtheperformanceofserialcomputationonmicroprocessors.Thishasledtothedevelopmentof

multi-coreprocessorsandanincreasingimportanceplacedonthepursuitofparallelcomputation[7].

Unfortunately,multi-coreprocessorsarerarelythemostefficientmethodtoperformparallel

computation.Thisinefficiencystemsfromthefactthateachcoremustbegeneralenoughtosupportan

entireinstructionset.Asaresult,themajorityofenergyisusedindecodingtheinstructionandfetching

datainsteadofperformingactualcomputation[8].

Hardwareacceleratorssuchasgraphicsprocessorunits(GPUs)andFPGAsareparallel

computationalarchitecturesthathavedemonstratedsubstantialperformanceandenergyefficiency

improvementsovertraditionalmulti-coreprocessordesignsbymovingthefocusbacktocomputation

[9,10].Intermsofenergyefficiency,theGPUarchitecture,whichconsistsofthousandsofparallel

floating-pointunits,isbestsuitedtoso-calledembarrassinglyparallelcomputationorcomputationally

expensiveproblems.However,manyalgorithmswillnotfallintothisproblemcategory.Incontrast,

usinganFPGAorApplication-SpecificIntegratedCircuit(ASIC),itispossibletocreateafullycustomised

datapathforagivenalgorithm,meaningitispossibletoachieveevengreaterenergyefficiencyusing

thesedevices.

Application-specificintegratedcircuits(ASICs)andFPGAsachievegreaterlevelsofparallelism

thanamicroprocessorbyarrangingcomputationsinaspatialratherthantemporalfashion.Thiscan

resultinperformanceimprovementsofseveralordersofmagnitude.Also,theabsenceofcachesand

instructiondecodingcanresultinthesameamountofworkbeingdonewithlesschipareaandlower

powerconsumption[11].Notableexamplesofapplicationdomainsincludecryptography,NP-hard

optimizationproblems,patternmatching,machinelearning,andmoleculardynamics[6].

Anexampleinvolvingtheimplementationofafiniteimpulseresponse(FIR)filterisshownin

Fig.1.Thereconfigurablecomputingsolutionissignificantlymoreparallelthanthemicroprocessor-

basedone.Inaddition,itshouldbeapparentthatthereconfigurablesolutionavoidstheoverheads

associatedwithinstructiondecoding,caching,registerfiles.Furthermore,speculativeexecution,

unnecessarydatatransfersandcontrolhardwarecanbeomitted.

Figure1.IllustrationofamicroprocessorbasedFIRfiltervs.areconfigurablecomputingsolution.Inthe

microprocessor,operationsareperformedintheALUsequentially.Furthermore,instructiondecoding,

caching,speculativeexecution,controlgenerationandsoonarerequired.Forthereconfigurable

computingapproachusinganFPGA,spatialcompositionisusedtoincreasethedegreeofparallelism.

TheFPGAimplementationcanbefurtherparallelizedthroughpipelining.

ComparedwithASICs,FPGAsofferverylownon-recurrentengineering(NRE)costs,whichis

oftenamoreimportantfactorthanthefactthatFPGAshavehigherunitscosts.Thisisbecausemany

applicationsdonothavetheextremelyhighvolumesrequiredtomakeASICsacheaperproposition.As

integratedcircuitfeaturesizescontinuetodecrease,theNREcostsassociatedwithASICscontinueto

escalate,increasingthevolumeatwhichitbecomescheapertouseanASIC(seeFig.2).Beingmore

specialized,ASICsofferarea,powerandspeedadvantagesoverFPGAs,thisgapbeingreducedasmore

hardblocksareemployed[12].Movingforward,reconfigurablecomputingwillbeusedinincreasingly

moreapplications,asASICsbecomeonlycosteffectiveforthehighestperformanceorhighestvolume

applications.

Figure2.Costoftechnologyvs.volume.ThecrossovervolumeforwhichASICtechnologyischeaper

thanFPGAsincreasesasfeaturesizeisreducedbecauseofincreasednon-recurrentengineeringcosts.

Additionalbenefitsofreconfigurablecomputingarethatitstechnologyprovidesashortertime

tomarketthanASICs(associatedFPGAfabricationtimeisessentiallyzero),makingmanyfabrication

iterationswithinasingledaypossible.Thisbenefitallowsmorecomplexalgorithmstobedeployedand

makesproblem-specificcustomizationsofdesignspossible.FPGA-baseddesignsareinherentlylessrisky

intermsoftechnicalfeasibilityandcost,asshorterdesigntimesandlowerupfrontcostsareinvolved.

Asitsnamesuggests,FPGAsalsoofferthepossibilityofmodificationstothedesigninthefield,which

canbeusedtoprovidebugfixes,modificationstoadapttochangingstandards,ortoaddfunctionality,

allofwhichcanbeachievedbydownloadinganewbitstreamtoanexistingreconfigurablecomputing

platform.Reconfigurationcaneventakeplacewhilethesystemisrunning,thisbeingknownasruntime

reconfiguration(e.g.,[13]).

Inthenextsection,weintroducethebasicarchitectureofcommonreconfigurablefabrics,

followedbyadiscussionofapplicationsofreconfigurablecomputingandsystemarchitectures.Runtime

reconfigurationanddesignmethodsarethencovered.Finally,wediscussmultichipsystemsandend

withaconclusion.

COST

VOLUME

CurrentASIC

OlderASIC

OlderFPGA CurrentFPGA

Crossovervolumeincreases withdecreasingfeaturesize

2ReconfigurableFabrics

Ablockdiagramillustratingagenericfine-grainedisland-styleFPGAisgiveninFig.3[14].Productsfrom

companiessuchasXilinx[15],Altera[16],andMicrosemi[17]arecommercialexamples.TheFPGA

consistsofanumberoflogiccellsthatcanbeinterconnectedtootherlogicandinput/output(I/O)cells

viaprogrammableroutingresources.Logiccellsandroutingresourcesareconfiguredviabit-level

programmingdata,whichisstoredinmemorycellsintheFPGA.Alogiccellconsistsofuser-

programmablecombinatorialelements,withanoptionalregisterattheoutput.Theyareoften

implementedaslookuptables(LUTs)withasmallnumberofinputs,4-inputLUTsbeingshowninFig.3.

Usingsuchanarchitecture,subjecttoFPGA-imposedlimitationsonthecircuit'sspeedanddensity,an

arbitrarycircuitcanbeimplemented.Thecompletedesignisdescribedviatheconfigurationbitstream

whichspecifiesthelogicandI/Ocellfunctionality,andtheirinterconnection.

Figure3.Architectureofabasicislane-styleFPGAwithfour-inputlogiccells.Thelogiccells,shownas

grayrectanglesareconnectedtoprogrammableroutingresources(shownaswires,dots,anddiagonal

switchboxes)(source:Reference[14]and[18]).

Currenttrendsaretoincorporateadditionalembeddedblockssothatdesignerscanintegrate

entiresystemsonasingleFPGAdevice.Apartfromdensity,cost,andboardareabenefits,thisprocess

alsoimprovesperformancebecausemorespecializedlogicandroutingcanbeusedandallcomponents

areonthesamechip.AcontemporaryFPGAcommonlyhasfeaturessuchascarrychainstoenablefast

addition;widedecoders;tristatebuffers;blocksofon-chipmemoryandmultipliers;embedded

microprocessors;programmableI/Ostandardsintheinput/outputcells;delaylockedloops;phase

lockedloopsforclockde-skewing,phaseshiftingandmultiplication;multi-gigabittransceivers(MGTs);

andembeddedmicroprocessors.Embeddedmicroprocessorscanbeimplementedeitherassoftcores

usingtheinternalFPGAresourcesorashardwiredcores.

Inadditiontothearchitecturalfeaturesdescribed,intellectualproperty(IP)cores,implemented

usingthelogiccellresourcesoftheFPGA,areavailablefromvendorsandcanbeincorporatedintoa

design.Thesecoresincludebusinterfaces,networkingcomponents,memoryinterfaces,signal

processingfunctions,microprocessorsandsoonandcansignificantlyreducedevelopmenttimeand

effort.

Thebit-levelorganizationofthelogicandroutingresourcesinisland-styleFPGAsisextremely

flexiblebuthashighimplementationoverheadasaresult.Tradeoffsexistinthegranularityofthelogic

cellsandroutingresources.Fine-graineddeviceshavethebestflexibility;however,coarse-grained

elementscantradesomeflexibilityforhigherperformanceanddensity[19].

Withmoderntechnologies,thespeedoftheroutingresourceisalimitingfactor.Trendshave

beentoincreasethefunctionalityofthelogiccellse.g.,uselogiccellswithlargernumbersofinputs

whichcanalsobeconfiguredassmallerLUTs[20]andtoaddpipelineregisterstotheroutingfabric[21].

Fordatapathorientedapplicationssuchasindigitalsignalprocessing,coarse-grainedarchitectures[22]

suchasPipewrench[23]andRaPID[24]employbus-basedroutingandword-basedfunctionalunitsto

utilizesiliconresourcesmoreefficiently.

3Applications

Reconfigurablecomputinghasfoundwidespreadapplicationintheformof“customcomputing-

machines”toacceleratecomputationoveralgorithmsimplementedonaCPU.Applicationdomains

includehigh-energyphysics[25],genomeanalysis[26],signalprocessing[27,28],computervision[29],

cryptography[30,31],financialengineering[10,32],scientificcomputing[33],machinelearning[34]

andsecurity[35].

Inmanyoftheseproblemdomains,ageneralpurposeGPUhasalsodemonstratedconsiderable

accelerationoveraCPUandwilloftenoutperformanFPGAaswellintermsofrawperformance.Thisis

becauseitisaparallelarchitecturewithmanyhardenedfloating-pointunitsandsubstantiallygreater

memorybandwidth,meaningitisanidealarchitectureprovidedalgorithmsthatcanbebrokendown

intoalargenumberofparallelthreads.However,outrightperformanceisnolongertheonly

benchmark;energyconsumptionisalsoimportant.Intermsofhighperformancecomputing,

supercomputingclustersanddatacentersnowconsumevastamountsofenergy,notonlyon

computation,butalsooncoolinginordertomaintainperformanceandreliability.Itfollowsthat

reducingenergyconsumptionprovidesbothenvironmentalandeconomicbenefits.Energyminimization

isalsoimportantforembeddedapplications;forexample,reducingpowerconsumptionon

smartphonesorotherbattery-powereddevicesisdesirablefromanenduserperspective.Asaresult,

FPGAandGPUvendorsarefocusingtheirengineeringeffortstowardsmakingfuturearchitecturesmore

energyefficient.ThisisreflectedinthemostrecentFPGAandGPUarchitectures:Nvidia’sP100claimsa

peakperformanceof10.6TFLOPs(singleprecision)withaTDPofonly300W[36],whileAlteraclaims

performanceofupto9.3TFLOPs(singleprecision)at80GFLOPs/wattisachievableontheirupcoming

Stratix10device[16].

ManyexperimentalstudieshavebeenperformedcomparingtheenergyefficiencyofFPGAs,

GPUsandCPUs;arecentsurveyisprovided[37].BothFPGAsandGPUstypicallyoutperformCPUs

accordingtothismetric.GPUshavebeenshowntobemoreenergyefficientthanFPGAsforcertain

applicationssuchasmatrixmultiplication.Tosomeextent,thisisaresultofthedecisiontooptimizethe

GPUarchitectureforthisproblem[38].However,theflexibilityofanFPGAhasseenitoutperforma

GPU,intermsofenergy-efficiencyorperformance-per-watt,acrossabroaderspectrumofapplications.

Examplesinclude:2-DFIR(finite-impulseresponse)filters,Viola-Jonesfacedetection,K-means

clustering,Monte-Carlooptionspricing,randomnumbergeneration,Smith-Waterman,3-Dultrasound

computertomography[37].EnergyefficiencygainsusingFPGAshavealsobeenclaimedoncommercial

systems.Forexample,Microsoftreporteda3xenergyefficiencygain,andareducedlatency,whenusing

FPGAsinsteadofGPUsontheirCatapultmachine[39],whichisdiscussedlaterinSection4.

WhilemostoftheseperformancecomparisonshavebeenperformedusingIEEEstandardsingle

ordoubleprecisionarithmetic,thisisnotnecessarilythemostenergy-efficientdesignpossibleonan

FPGA.ThisisbecauseFPGAshavethefreedomtoimplementanyprecision,soitmaybepossibleto

createaworkingdesignusingacustom(reducedprecision)fixedorfloating-pointnumberformatthatis

sufficienttosatisfyadesignspecification.Thiswillavoidunnecessarycomputationandcanimprovethe

energy-efficiencyandperformanceofanFPGAimplementationdramatically[40,41].

Toadegree,theflexibilityofanFPGAisevenbeyondthatpossibleinanASIC.Forexample,inan

FPGA-basedimplementationofRSAcryptography[30],adifferenthardwaremodularmultiplierforeach

primemoduluswasemployed(i.e.,themoduluswashardwiredinthelogicequationsofthedesign).

SuchanapproachwouldnotbepracticalinanASICasthedesigneffortandcostistoohightodevelopa

differentchipfordifferentmoduli.Thisledtogreatlyreducedhardwareandimprovedperformance,the

implementationbeinganorderofmagnitudefasterthananyreportedimplementationinany

technologyatthetime.

Anotherimportantapplicationislogicemulation[42,43]wherereconfigurablecomputingisnot

onlyusedforsimulationacceleration,butalsoforprototypingofASICsandin-circuitemulation.In-

circuitemulationallowsthepossibilityoftestingprototypesatfullornear-fullspeed,allowingmore

thoroughtestingoftime-dependentapplicationssuchasnetworks.Italsoremovesmanyofthe

dependenciesbetweenASICandfirmwaredevelopment,allowingthemtoproceedinparallelandhence

shorteningdevelopmenttime.Asanexample,itwasusedin[44]forthedevelopmentofatwo-million-

gateASICcontaininganIEEE802.11mediumaccesscontrollerandIEEE802.1la/b/gphysicallayer

processor.UsingareconfigurableprototypeoftheASIConacommodityFPGAboard,theASICwent

throughonecompletepassofreal-timebetatestingbeforetape-out.

Digitallogic,ofcourse,mapsextremelywelltofine-grainedFPGAdevices.Themaindesign

issuesforsuchsystemslieinpartitioningofadesignamongmultipleFPGAsanddealingwiththe

interconnectbottleneckbetweenchips.TheCadenceProtiumRapidPrototypingPlatform[45]isa

commercialexampleofalogicemulationsystemandhas100million-gatelogiccapacityandfast

compilationandpartitioningalgorithms.Furtherdiscussionofinterconnecttime-multiplexingand

systemdecompositionisgivenlaterinthisarticle.Someexamplesofapplicationsacceleratedusing

earlymultipleFPGAsystemsarediscussedbelow.

Hoang[26]implementedalgorithmstofindminimumeditdistancesforproteinandDNA

sequencesontheSplash2architecture.Splash2canbemodeledintermsofbothbidirectionaland

unidirectionalsystolicarrays.Inthebidirectionalalgorithm,thesourcecharacterstreamisfedtothe

leftmostprocessingelement(PE),whereasthetargetstreamisfedtotherightmostPE.Comparingtwo

sequencesoflengthmandnrequiresatleast2´max(m+1,n+1)processors,andthenumberofsteps

requiredtocomputetheeditdistanceisproportionaltothesizeofthearray.Theunidirectional

algorithmissuitedforcomparingasinglesourcesequenceagainstmultipletargetsequences.The

sourcesequenceisfirstloadedasinthebidirectionalcase,andthetargetsequencesarefedinoneafter

theotherandprocessedastheypassthroughthePEs(whichresultsinvirtually100%utilizationof

processors,sothattheunidirectionalmodelisbettersuitedforlargedatabasesearches).

Acommonapplicationdomainforreconfigurablecomputingisinreal-timedataacquisitionand

signalprocessing.TheBEE2system[27],describedinthenextsection,wasappliedtotheradio

astronomysignalprocessingdomain,whichincludeddevelopmentofabillion-channelspectrometer,a

1024-channelpolyphasefilterbank,andatwo-input,1024-channelcorrelator.TheFPGA-basedsystem

useda130-nmtechnologyFPGAandperformancewascomparedwith130-and90-nmDSPchipsaswell

asa90-nmmicroprocessor.Performanceintermsofcomputationalthroughputperchipwasfoundto

beafactorof10to34overtheDSPchipin130-nmtechnologyand4to13timesbetterthanthe

microprocessor.Intermsofpowerefficiency,theFPGAwasoneorderofmagnitudebetterthantheDSP

andtwoordersofmagnitudebetterthanthemicroprocessor.Computethroughputperunitchipcost

was20–307%betterthanthe90-nmDSPand50–500%betterthanthemicroprocessor.

Onefinalemergingapplicationdomainismachinelearning.Reconfigurableimplementations

showgreatpromiseforaddressingtheirheavycomputationaldemands,andreconfigurablecomputing

isparticularlystronginembeddedandlow-precisionscenarios.Tridgellet.al.demonstratedregression,

classificationandnoveltydetectionusingonlinekernelmethods.Theirfullypipelinedimplementation

couldprocesscontinuousdataatrateshigherthan1Gbpsandperformsimultaneouslearningand

predictionwithalatencyof100ns[46].Zhanget.al.appliedarooflinemodeltobalanceresource

utilizationandmemorybandwidthintheaccelerationofadeepconvolutionalneuralnetwork(CNN).

Theyachieved62GFLOPSonasingleXilinxVirtexVC707board,thisbeinga4.8Xspeedupovera16

threadimplementationonanIntelXeonE5-2430processor[47].

4SystemArchitectures

ReconfigurablecomputingmachinesareconstructedbyutilizingoneormoreFPGAs.Mostsystems

includeotherelements,suchasmicroprocessorsandstorage,andcanbetreatedasprocessing

elementsandmemorythatareinterconnected.Obviously,thearrangementoftheseelementsaffects

thesystemperformanceandroutability,andsomeexamplesaregiveninthissection.

TheAvnetZedboardisadevelopmentboardwhichintegratesasingleXilinxZynqXC7Z020FPGA

(whichcontainsFPGAlogicandadual-coreARMCortex-A9processor),DDRmemory,SDcard,Ethernet,

USBandvideointerfaces.ThissingleboardcomputercanruntheLinuxoperatingsystem,andit

providesalow-costentrypointforteachingandresearchinreconfigurablecomputing[48].

ThesimplesttopologyforconnectingmultipleFPGAsinvolvesaring,mesh,orotherfixed

pattern.FPGAsserveasbothlogicandinterconnect,providingdirectcommunicationbetweenadjacent

devices.Suchanarchitectureispredicatedonlocalityinthecircuitdesignandfurtherassumesthatthe

circuitdesignmapswelltotheplanarmesh.Thisarchitecturefitswellforapplicationswithregularlocal

communications[49].However,ingeneral,highperformanceishardtoobtainforarbitrary

communicationpatternsbecausethearchitectureonlyprovidesdirectcommunicationsbetween

neighboringFPGAsandtwodistantFPGAsmayneedmanyotherdevicesas“hops”tocommunicate,

resultinginlongandwidelyvariabledelays.Furthermore,FPGAs,whenusedasinterconnects,often

resultinpoortimingcharacteristics.

Figure4depictstheSPLASH2architecture[50]publishedin1990.Eachboardcontains16

FPGAs,X1throughX16.TheblocksM1throughM16arelocalmemoriesoftheFPGAs.Asimplified36-

bitbuscrossbar,withnopermutationofthebit-lineswithineachbus,interconnectsthe16FPGAs.

Another36-bitbusconnectstheFPGAsinalinearsystolicfashion.Thelocalmemoriesaredualported

withoneportconnectingtotheFPGAsandtheotherportconnectingtotheexternalbus.Itis

interestingtonotethatthecrossbarwasaddedtotheSPLASH2machine,theoriginalSPLASH1machine

onlyhavingthelinearconnections.SPLASH2hasbeensuccessfullyusedforcustomcomputing

applicationssuchassearchingeneticdatabasesandstringmatching[26].

Figure4.SPLASH2architecture.Eachboardcontains16FPGAs,XIthroughXI6.TheblocksMlthrough

Ml6arelocalmemoriesoftheFPGAs.Asimplified36-bitbuscrossbar,withnopermutationofthebit-

lineswithineachbus,interconnectsthe16FPGAs.Another36-bitbusconnectstheFPGAsindaisy-chain

fashion.ThelocalmemoriesaredualportedwithoneportconnectingtotheFPGAsandtheotherport

connectingtotheexternalbus.

Otherdesignshaveusedahierarchyofinterconnectschemes,differinginperformance.Theuse

ofmulti-gigabittransceivers(MGT)availableoncontemporaryFPGAsallowshighbandwidth

interconnectionusingcommoditycomponents.AnexampleistheBerkeleyEmulationEngine2(BEE2)

[27],designedforreconfigurablecomputingandillustratedinFig.5.Eachcomputemoduleconsistsof

fiveFPGAs(XilinxXC2VP70)connectedtofourdoubledatarate2(DDR2)dualinlinememorymodules

(DIMMs)withamaximumcapacityof4GBperFPGA.FourFPGAsareusedforcomputationandonefor

control.EachPPGAhastwoPowerPC405processorcores.Alocalmeshconnectsthecomputation

FPGAsina2-Dgridusinglow-voltageCMOS(LVCMOS)parallelsignaling.Off-modulecommunications

areofvia18(twofromthecontrolFPGAandfourfromeachofthecomputeFPGAs)Infiniband4X

channel-bonded2.5-Gbpsconnectorsthatoperatefull-duplex,whichcorrespondstoa180-Gbpsoff-

modulefull-duplexcommunicationbandwidth.Modulescanbeinterconnectedindifferenttopologies

includingtree,3-Dmesh,orcrossbar.Theuseofstandardinterfacesallowsstandardnetworkswitches

suchasInfinibandand10-GigabitEthernettobeused.Finally,a100base-TEthernetconnectiontothe

controlFPGAispresentforout-of-bandcommunications,monitoring,andcontrol.

Figure5.BEE2ComputeModuleblockdiagram.Computemodulescanbeinterconnectedviathe

InfinibandIB4Xconnectors,eitherdirectlyorviaa10-GigabitEthernetswitch.The100-BaseTEthernet

canbeusedforcontrol,monitoring,ordataarchiving.

Commercialmachines,suchastheMaxelerMPC-X2000system[51],haveasimilarinterconnect

structuretotheBEE2inthattheyareparallelmachinesemployinghighperformancemicroprocessors

tightlycoupledtoarelativelysmallnumberofFPGAdevicespernode.TheMPC-X2000isa1Userver

witheightlargeFPGAs,calleddataflowengines(DFEs),interconnectedinaringarrangement.Atotalof

384GBofdynamicRAMissupportedandmultiplehostprocessorscancommunicatewitheachDFEvia

ahigh-speedInfinibandswitchedinterconnectnetwork.Suchmachinescanhaveordersofmagnitude

performanceimprovementoverconventionalarchitecturesandswitchingtopologiescanbealteredvia

configurationoftheswitchingfabric.

MicrosofttookadifferentapproachintheirCatapultmachine,choosingasingledaughtercard

perserverovermulti-FPGAboardsforthereasonsofscalability,capacity,power,spaceandreliability

[52].EachFPGAcardoperatesunder20W,ishostedbyaserverviaPCIExpressandcontains8GBof

dynamicRAM.TheFPGAboardsareorganizedina24Uarrangementof48Uhalf-width1Uservers,

directlyconnectedtogetherwithSAScables.Atestsystemcontaining1,632serverswasshownto

reducethetaillatencyoftheMicrosoftBingsearchengineby29%andimproverankingthroughputof

eachserverby95%.

TheIntel-AlteraHeterogeneousArchitectureResearchPlatform(HARP)utilizesIntelQuickpath

Interconnect(QPI)inadualsocketmotherboardwiththeprocessorandFPGAresidingeachoccupyinga

socket[53].Thisoffershigherbandwidthandlowerlatencyoverconventionaldaughtercards.A

coherentsharedmemorybetweentheprocessorandFPGAgivesthepromiseofagreatlysimplified

programmingmodelandtighterprocessor-FPGAcouplingwhichwillbenefitirregulardataaccess

patterns.

5RuntimeReconfiguration

Areconfigurablecomputingsystemcanhaveitsfunctionalityupdatedduringexecution,

resultinginreducedresourcerequirements.Aruntimereconfigurablesystempartitionsadesign

temporallysothattheentiredesigndoesnotneedtoberesidentintheFPGAatanygivenmoment[54,

55].Instead,theFPGAfabricistime-sharedbetweenspecializedhardwareacceleratorsatruntime.

Usingthistechnique,designslargerthantheavailablehardwareresourcescanberealized,or

alternatively,anexistingdesignmaybeimplementedonasmallerorcheaperdevice.Furthermore,

energyefficiencycanbeincreasedbecausetheentirefabriccanbeusedmoreeffectively.

Singlecontext,multiplecontextarchitecturesandpartiallyreconfigurableFPGAsbeen

developed.Inasinglecontextsystem,anychangestothefunctionalityoftheFPGAinvolvereloadingthe

entirebitstream;earlyFPGAswereofthistype.Thisschemehasthedisadvantageoflong

reconfigurationtime.Multiplecontextortime-sharingarchitectures,lieattheotherextreme.These

allowanumberofcompleteconfigurationstobestoredinthefabricsimultaneouslyandthus

reconfigurationcanbeachievedinasmallnumberofcycles.Thesearchitectureswerealsoproposedfor

earlyFPGAs.Asanexample,anarchitecturenamedDharma,wasproposedthatcontainsafunctional

blockandaninterconnectnetwork[56].Bybreakingalargedesignintolevelsinafoldedpipeline,the

logicmodulesandinterconnectcanbetime-sharedbydynamicallyreconfiguringeachlevel.This

topologysimplifiesthearchitectureandprovidespredictableinterconnectdelay(Fig.6).Multiple

contextarchitectures,suchasNEC'sDynamicallyReconfigurableProcessor(DRP)[57],werelater

developed.Sucharchitectureshavetheshortestcontextswitchtime,however,alargerareaoverheadis

associatedwithimplementationofthisscheme.

Figure6.DynamicArchitectureforFPGA-basedsystems.Thearchitecturecontainsafunctionalblock

andaninterconnectnetwork.Theinterconnectandthelogiccanbetimeshared.Theemulateddesign

topologyislevelizedinafoldedpipelinemanner.Thelevelizedtopologysimplifiesthearchitecturewith

predictableinterconnectdelay.

PartiallyreconfigurableFPGAs,assupportedbythemajorFPGAvendorsinXilinxVirtex[15]and

AlteraStratix[16]architectures,havebeguntodominatethemarket.Thesearchitecturesallowportions

oftheFPGAtobechangedviaamemorymappedschemewhilsttheotherportionsoftheFPGA

continuefunctioning.Incomparisontoasinglecontextscheme,thereissomeareaoverheadassociated

inprovidingthisfeature;ideallythisiscompensatedforbymoreefficientuseofthefabric.

Manytoolshavebeendevelopedtohelpsupportruntimereconfiguration.Commercialtools,

providedbythemainFPGAvendorsXilinxandAlteraaimtoabstractthelowlevelimplementation

detailsfromtheengineer.However,otheropensourcetoolshavebeendevelopedtoenablemore

flexiblesystems.Forexample,ReCoBus-Builderintroducedasimpleinterfaceforcommunication

betweenthestaticpartofasystemandthedynamicmodules,aswellastheabilitytoplaceandroute

partialmodulesseparately,beforelinkingthesecompiledbitstreamsatrun-time[58].Thismakes

modulesinterchangeableandspeedsupthecompilationprocess.TheGoAheadtooltakesthisfurther,

allowingtheFPGAfabrictobeseparatedintodifferentregions,withindividualmodulescompiledtofit

intooneormoreoftheseregions[59].Italsoprovidessupportformodulestocommunicatebetweenor

acrossregions.Thisimprovesflexibilityinplacementofmodulesandpromotessharingacrossregions.

Toolstohelpdeterminetheoptimumnumberofregionshavealsobeengenerated[60].

Theaforementionedtoolsarealsoabletosupporthierarchicaldesigns,whereapartialregion

doesnotneedtobefullyreconfigured;insteadasmallerregionwithinthisareacanbereconfigured.

Thishasmultipleadvantages.Firstly,storingafewdifferentmodulesateachhierarchicallevelprovides

ahugeamountofflexibility,savingsignificantconfigurationmemoryincomparisontostoringall

differentmodulesatthehighestlevel.Furthermore,sinceonlyasmallregionneedstobereconfigured,

thereconfigurationtimeisreduced[61].

Toolsalsoexisttohelpoverlapre-configurationandcomputationtomaximizetheperformance

ofthedevice.ForexampleZyCAP,whichisbasedontheXilinxZynqarchitecturewithanembedded

ARMCPU,providessoftwaredriverstohelpreconfigurationbeoverlappedwithcomputationby

controllingallthereconfigurationprocesses[62].Itcanalertthesoftwarethatconfigurationis

complete,andalsomanageshowpartialbitstreamsarestoredinmemory.Thisisimportanttohelp

maximizeperformance,forexample,thistoolofferstheabilitytocachepartialbitstreamsinDRAMto

speedupthereconfigurationprocess.Finally,therearealsoeffortstoverifythepartialbitstreams

performthedesiredfunctionality[63].

Therearemanyexamplesofrun-timereconfiguration,withthelogicalunitofreconfiguration

rangingfromapplication-leveldowntoasub-instruction.Thesearediscussedbelow:

Attheapplicationlevel,examplesincludeadaptingthebitstreamaccordingtochangesin

environmentalconditions.Forexample,Clausetal.discussedhowhardwareacceleratorsmaybe

neededforreal-timevideoprocessing,butinthecontextdriverassistance,adaptingthemaccordingto

changinglightconditionscouldimproveperformance.Theydemonstratedthatthiscanprove

worthwhilesincemodulescanbequicklyreconfiguredbetweenframes[64].

Tasklevelreconfigurationiscommonforsoftwaredefinedradio,forexamplewhenswitching

betweenencodingschemes.Thetrade-offsbetweenfullorpartialreconfigurationinthisproblem

domainarediscussedbyDelahayeetal.[65].Similarly,Feillenetal.discussedhowdifferentstagesof

digitalvideodecodingdonotneedstooperateconcurrently,meaningthesamehardwarecouldbere-

usedinthisexample[66].Tasklevelreconfigurationforanoperatingsystemhasalsobeenproposed

[67].Undercontrolofsoftwarerunningonamicroprocessor,taskcircuitscanbescheduledonlineand

placedinasuitablefreespaceinahardwaretaskarea.CommunicationsbetweentasksandI/Oaredone

throughataskcommunicationbus,andterminationofataskfreesthereconfigurableresourcesused.It

wasshownthathardwareinthehardwaretaskareacanbesharedbytasksandtheoverheads

associatedwithitsimplementationonapartiallyconfigurableplatformwereacceptablylow.Thishelps

improveschedulingofreal-timetasks.

Instructionlevelreconfigurationhasbeendemonstratedforhardwareaccelerateddatabase

queries.DifferenthardwaremodulesforSQLqueriescouldbedynamicallyconfiguredtoimprove

performance[68]andenergyefficiency[69].ACPUsystemwithcustominstructionsisanothercommon

candidateforinstructionlevelreconfiguration.AnearlyexampleincludestheDynamicInstructionSet

Computer(DISC)[70],whichsupporteddemand-drivenmodificationoftheinstructionsetthrough

partialreconfiguration.ThecommercialStretchprocessor[71]combinesreconfigurablefabricwitha

processortosupporttheexecutionofcustominstructionsimplementedonareconfigurablefabric.

Furthermore,thefabriccanbereconfiguredatruntimeandthedesignenvironmentissoftware-centric,

withprogrammingoftheprocessorbeinginStretchC.

Finally,partialreconfigurationhasalsobeenshownforsub-instructions.Forexample,apipeline

stagecouldbeaconvenientunitofreconfiguration,asdemonstratedbyincrementalpipeline

reconfiguration[72].AssumeanFPGAthathasenoughsiliconareaforNphysicalpipelinestages,but

thedesigncontainsMpipelinestages(whereM>>N).Throughaddingonepipelinestageandremoving

thetrailingpipelinestageineachstageofthecomputation,executionandcomputationcanbe

overlapped.SuchacircuitwillimplementapipelineofdepthNandfullyutilizetheFPGAatanygiven

pointintime.Runtimereconfigurationcanbedoneatevenlowerlevels.Examplesincludethose

supportinghierarchicaldesignsforaCPUwithgreaternumbersofcustominstructions[61]anda

crossbarswitchwhichemploysruntimereconfigurationoftheFPGA'sroutingresources[73].Bypartially

reconfiguringroutingmultiplexers,thisschemewasabletoachievedensity,switchupdatelatencyand

performancehigherthanpossibleusingconventionalmeans.

6Designmethods

Hardwaredescriptionlanguages(HDLs)suchastheVeryHighSpeedIntegratedCircuitHardware

DescriptionLanguage(VHDL)andVerilogarecommonlyusedtospecifythelogicofareconfigurable

system.Descriptionsintheselanguageshavetheadvantageofbeingvendorneutral,sothesame

descriptioncanbesynthesizedfordifferenttargetssuchasdifferentFPGAdevices,differentFPGA

vendors,andASICs.Forthisreason,theselanguagesareoftenthetargetlanguageforhigherleveltools

thatofferhigherlevelsofabstraction.

Modulegeneratorsandlibrariesarecommonlydeployedtopromotereuse.Forexample,

vendorssuchasAlteraandXilinxhaveparameterizedlibrariesofcomponentsthatcanbeusedina

design.Theselibrariesaregeneratedsothatacircuitoptimizedfortheparticularapplicationcanbe

produced.Asanexample,aparameterizedfloatingpointlibrarymightallowthewordlengthofthe

exponentandsignificandtobespecifiedaswellaswhetherdenormalizednumbersaresupported.The

modulegeneratorthengeneratesanetlistorVHDL-basedfloatingpointadderthatcanbeincludedina

design.Opensourcealternatives,suchastheFloPoColibraryalsoprovidevendorneutralalternativesto

generatemanykeycomponents[74].

InanefforttohelpmakeFPGAsmoremainstream,effortshavebeenplacedintohigh-level

synthesis,whichistheprocessofcompilingatraditionalhighlevellanguagedowntoanetlistorHDL.

Theuseoftraditionalprogramminglanguagesimprovesproductivityaslowleveldetailsarehandledby

thecompiler.ThisisanalogoustoCversusassemblylanguageforsoftwaredevelopment.Another

differencewithpotentiallylargeimplicationsisthat,usingthesetools,softwaredeveloperscanalso

designreconfigurablecomputingapplications

Asanearlyexample,LukandPagedescribedasimplecompilationprocess[75,76]fromahigh

levellanguagewithexplicitparallelextensionstoaregistertransferlanguage(RTL)description.Parallel

executionofstatementsisimplementedviaparallelprocesses,andthesecancommunicateviachannels

throughwhichasingle-wordmessagecanbepassed.Variablesintheuserprogramaremappedto

registers,allexpressionsareimplementedascombinationallogic,andmultiplexersareusedinthecase

aregisterhasmultiplesources.Adatapaththatmatchesthedataflowgraphoftheinputsource

descriptionisgeneratedusingthisstrategy.Theclockingschemeemployedisaglobal,synchronousone,

andaconventionthateachassignmenttakesexactlyoneclockcycleisfollowed.Astartsignalisusedto

feedtheclockandtoenableeachregisterthatcorrespondstoavariable,andafinishsignalisgenerated

fortheassignmentinthefollowingclockcycle.Toexecutestatementssequentially,thestartandfinish

signalsofadjacentstatementsaresimplyconnectedtogether,creatingaone-hotdistributedcontrol

scheme.Conditionalstatementsandloopsareformedbyassertingoneofseveralpossiblestartsignals

thatcorrespondtoalternativebasicblocksinaprogram.Completionofconditionalorloopconstructs

andsynchronizationofparallelblocksareimplementedbycombiningrelevantfinishsignalsusingthe

appropriatecombinatoriallogic.Anexampleshowingthetranslationofasimplecodefragmentto

controlanddatapathisshowninFig.7.

Figure7.Hardwarecompilationexample.TheCprogramistranslatedintoadatapath(top)andcontrol

(bottom).Executionofstatementsinthewhilelooparecontrolledbys1ands2;s0ands3correspondto

thestartsignalsofthestatementsbeforeandafterthewhileloop.

High-levelsynthesistoolshavesincemovedbeyondsimplytranslatingahigh-levellanguagetoa

hardwaredesign;insteadtheyfocusoncreatinganoptimizedhardwaredesign.Straightforward

examplesmayincludeextractingparallelismthroughloopunrollingorcreatingdeeplypipelineddesigns

tomaximizeclockfrequency.However,finer-grainedoptimisationsarealsopossible.Forexample,since

movingdataontotheFPGAcanbeexpensive,storingdatalocallyonthechipandre-usingthedatacan

havesubstantialperformanceimplications[77].Whiletheideaissimilartothatofcachingona

microprocessor,onanFPGAthelocalbufferscanbesizedaccordingtotheneedsofaparticular

algorithm,savingresources.Alternatively,thememorycanbearrangedinafashionthatprovidesa

largememorybandwidth,whichmaybenecessarytofeedaparalleldatapath.Moreoever,theinterface

withtheDRAMcanalsobecontrolledtoensurethatmemoryreadsoccurfaster,forexampleby

‘activating’rowsthataresoontoberead[78].Anotheroptimizationistofinetunetheprecisionused

throughoutcomputations,i.e.touseaslittleprecisionasisnecessarytomeetyourdesignspecification.

UnlikeaCPUimplementation,anFPGAdesignhasthefreedomtoimplementanyprecision.Since

arithmeticoperatorswithlessprecisionuselesssiliconarea,usingtheminimumprecisionnecessary

freesresources,allowingforgreaterparallelperformance.Toolstosupportthisdesignmethodology,

bothinfixedandfloatingpointarithmeticarebeingsupported[79,80].

Variousotherissueswiththemappingofalgorithmstohardwarearemoregenerallydiscussed

byIsshikiandDai[81],whofocusonthedifferencesbetweenimplementingbit-serialversusbit-parallel

modules(e.g.,addersandmultipliers)onFPGAarchitectures.Althoughlatencyislargerforbit-serial

modules,thereductioninareafrequentlymakesarea-timeproductssignificantlylowerforsuch

implementations.Morespecifically,suchadvantagesasthefollowingcanbeobtained:1)Forbit-parallel

modules,theI/Opinlimitationisamajorproblem,andthelargesizeofthemoduleclustercanresultin

unusedspaceandunderutilizedlogicresources;2)bit-serialmodulesareeasiertopartitionascell-to-

cellconnectionsaresparseanddonotcauseI/Oproblems;and3)highfanoutnetscanimpair

routabilityofbit-parallelmodules.LeongandLeong[82]generalizedfurtherwithadesignmethodology

thatcantranslateadataflowdescriptionwithsignalsofdifferentwordlengthstoadigitserialdesign.

CommercialtoolsthatcancompilestandardprogramminglanguagessuchasJava,C,orC++

(e.g.,[76])areavailable.ExamplesincludeXilinx’sVivadoHLS[15],Maxeler’sMaxCompiler[83]and

CatapultCfromMentorGraphics[84].Domain-specificlanguagessuchasMATLAB/Simulinkoffereven

greaterimprovementsinproductivitybecausetheyareinteractive,includealargelibraryofprimitive

routinesandtoolkits,andhavegoodgraphingcapabilities.Indeed,manydesignsforcommunications

andsignalprocessingarefirstprototypedinMATLABandthenconvertedtootherlanguagesfor

implementation.ToolssuchastheMATCHcompiler[85]andXilinxSystemGenerator[15],AlteraDSP

builder[16]andMathwork’sHDLcoder[86]cantranslateasubsetofMATLAB/Simulinkdirectlytoan

FPGAdesign.ThereisalsointerestinsupportingmoreparallelC-to-gatesflows.Supportformore

recentparallelprogramminglanguagesisgainingtraction,forexampleAlteraSDKforOpenCL[16]and

effortstosupportNVidia’sCUDAusingFPGAs[87].

Duetothedifficultyincreatingafull-customdesign,thereisalsosupportforcreating

hardware/softwareco-designs.TheavailabilityofembeddedoperatingsystemssuchasLinuxfor

microprocessorsonanFPGAprovideafamiliarsoftwaredevelopmentenvironmentforprogrammers,

greatlyfacilitatingprogramdevelopmentthroughtheavailabilityofalargerangeofopen-source

librariesaswellashighqualitydevelopmenttools.Suchtoolscangreatlyspeedupthedevelopment

timeandimprovethequalityofembeddedsystems.Forexample,Altera'sNiosIIC-to-Hardware

accelerationcompilerenabletime-criticalfunctionsinaCprogramtobeconvertedtoahardware

acceleratorthatistightlycoupledtoamicroprocessorwithintheFPGA[88].Thesetoolswillsupportsoft

processors,suchastheAlteraNIOSorXilinxMicroblaze,andembeddedprocessors,suchasthoseon

theXilinxZynqorAlteraSoCs.Withthelatter,foroptimalperformance,thepartsofanalgorithmthat

areeasilyparallelizableshouldmakeuseoftheparallelFPGAfabric,whereasserialpartsofthe

algorithmshouldberunonaprocessor[89].

AfinaldesignapproachtoallowforfastFPGAprototypingistheuseofoverlayarchitectures.

Thesearecoarse-grainedarchitectureswithsoftware-likeprogrammability,withtheaimofsacrificing

someperformanceinexchangeforeaseofimplementation.Forexample,VectorBloxextendsthe

hardware-softwareparadigmbyusingtheFPGAfabrictoprovideparallelvectorinstructionsthatcanbe

easilyexecuted[90].Atypicaldesignflowusingthistechnologywouldbetocreateaninitialsoftware

design,addvectorinstructionswithinasoftwarestyledevelopmenttoobtainsomeaccelerationand

finallycreatecustomhardwareinstructionsforthemosttimeconsumingpartsofanalgorithm.Thismay

provideafastertimetomarket.Manyoverlaysarchitectureshavebeencreated,includingsomefor

specificapplications,suchasforefficientnetworkonchip(NOC)interconnectionsofprocessors[91]or

dataflowgraphs[92],andsomedesignedtomakeuseofspecifichardenedcomponentsonFPGAssuch

asDSPs[93].

7MultichipSystems

Specialcaremustbetakeninthedesignoflargeandmultichipreconfigurablesystems.Inthissection,

wedescribesometheoreticresultsrelevanttothemajorarchitecturalandissuesassociatedwithsuch

designs.

7.1InterconnectOrganization

AclassicClosnetwork[94]containsthreestages:inputs,intermediateswitches,andoutputs,asshown

inFig.8.Itcanbeusedtointerconnectpinsinareconfigurablecomputingsystem,anditsinputand

outputstagesaresymmetric.Supposethefirststagehasrn×mcrossbarswitches,thesecondstagehas

mr×rswitches,andthethirdstagehasrm×nswitches,letusdenotethenetworkasc(n,m,r).Forany

two-pinnetinterconnectrequirement,thenetworkc(n,m,r)canachievecompleteroutabilityifmis

notlessthann.Theroutingmethodcanbedescribedbyrecursiveoperations[95].Inthefirstiteration,

wereducethenetworktoc(n-1,m-1,r).Intheithiteration,wereducethenetworktoc(n-i,m-i,r).

Whenn-i=1,wehaver1×(m-n+1)switchesinthefirststage,m-n+1r×rswitchesinthesecondstage,and

r(m-n+1)×1switchesinthethirdstage.Inotherwords,onlyoneinputexistsineachfirst-stageswitch

andoneoutputineachthird-stageswitch.Inthiscase,onesecond-stager×rswitchisenoughtoroute

therinputsofrfirst-stageswitchestotheroutputsofrthird-stageswitches,thuscompletingthe

interconnect.

Figure8.Closnetwork.AClosnetworkcontainsthreestages:inputs,intermediateswitches,and

outputs.Theinputandoutputstagesaresymmetric.Inthefigure,thefirst-stagehasrn×mswitches,

thesecond-stagehasmr×rswitches,andthethird-stagehasrm×nswitches.

Thereductionfromc(n-i,m-i,r)toc(n-i-1,m-i-1,r)canbederivedbyamaximummatching

algorithm.Thematchingalgorithmselectsdisjointsignalsfromdifferentinputswitchestodifferent

outputswitches.Onesecond-stageswitchisthenusedtoroutetheselectedsignals.FromHall's

theorem,themaximummatchingandroutingcanalwaysreducethenetworktoc(n-i-1,m-i-1,r).

Conceptually,theroutingproblemcanalsobeformulatedasedgecoloringonabipartitegraph

G(V1,V2,E)[96].ThenodesetsV1andV2representtheswitchesintheinputandoutputstages,

respectively.AnedgeinErepresentsatwo-pinnetinterconnectrequirementbetweenthe

correspondinginputandoutputswitches.InReference[96],ChanandSchlagassignedcolorstothe

edgesofthebipartitegraph.Edgesofthesamecolorarebundledintoonegroupandthecorresponding

setofnetsareroutedbyoneswitchinthesecondstage.TheworkofReference[97]wasthenusedto

findaminimumedgecoloringsolutioninO(|E|logn).

Thethree-stageClosnetworkcanbefoldedintoatwo-stagenetwork(Fig.9)sothattheinputs

andoutputsaremixedinthefirststage.Thus,thecorrespondingbipartitegraphG(V1,V2,E)constructed

aboveforedgecoloringisalsofoldedwithV1andV2mergedintooneset.

Figure9.FoldedClosnetwork.Thethree-stageClosnetworkisfoldedintoatwo-stagenetworksothat

theinputsandoutputsaremixedinthefirststage.

Tofindtheroutingassignment,thefoldededgecoloringgraphcanbeunfoldedbacktoa

bipartitegraphusinganEulerpathsearch.TheEulerpathtraverseseveryedgeexactlyonceanddefines

theedgedirectionaccordingtothedirectionofthetraversal.Wethenrecovertheoriginalbipartite

graphbysplittingthenodesetbackintotwosetsV1andV2andunfoldtheedgessuchthatalledgesare

directedfromV1toV2.Wecanfindtheminimumedgecoloringsolutionoftheunfoldedbipartitegraph

andapplythesolutionbacktothefoldedroutingproblem.

Inpractice,thefirst-levelcrossbaroftheClosnetworkisreplacedwithFPGAstosaveboard

space(Fig.10).RoutabilityisworsethananidealClosnetwork.EvenwithatrueClosnetwork,complete

routabilityofmultipinnetsisnotguaranteed,whichisanimportantpracticalconsiderationbecausein

microelectronicdesign,manymultipinnetstypicallyexist.

Figure10.VariationsoftheClosnetwork.ThefirstlevelcrossbaroftheClosnetworkisreplacedwith

FPGAstosaveboardspace.RoutabilityisworsethananidealClosnetwork.

Inanattempttosolvethemultipinnetandroutabilityproblem,wecanintroduceextra

connectionsamongFPIDsasshowninFig.11.However,extraFPIDinterconnectionsalsoincurextra

delay.WecanalsoexpandthefanoutwidthofFPGAssothateachFPGAI/Opinisconnectedtomore

thanoneFPIC[98,99].Thefanoutwidthexpansionimprovesroutabilitywithoutsignificantadditional

delay.ThemultipleappearancesofI/Opinsincreasetheprobabilitythatasignalconnectioncanbe

madeinasinglestage,whichisespeciallycriticalformultipinnets.However,theadditionalfanouts

increasetheneededpincountofFPICs.Thus,weneedtofindabalancedfanoutdistributionthat

reducestheinterconnectdelaywithaminimalpinrequirement.

Figure11.VariationsofClosnetwork.ThefanoutwidthofFPGAsisexpandedsothateachFPGAI/Opin

isconnectedtomorethanoneFPIC.Thefanoutwidthexpansionimprovesroutabilitywithoutsignificant

additionaldelay.

Atree-structurednetworkcansimplifythemappingprocessforcertainapplications.In

Reference[100],anexampleofatree-structurednetworkisillustratedforaVeryLargeScaleSimulator

(VLSS).TheVLSStreestructurehasalllogiccomponentslocatedattheleavesandinterconnectswitches

attheinternalnodes.Themachinecoversacapacityofeightmilliongates.Eachbranchisan8-bitbus.

Thehigherupthelevelofthetree,thelessparallelismthesignaldistributioncanachieve.Therefore,a

partitioningprocessisdesignedtominimizethehighlevelinterconnectandmaximizetheparallel

operation.

7.2InterconnectMultiplexing

Timemultiplexingisaneffectivemethodfortacklingthescalabilityproblemininterconnectinglarge

designs.Thetime-sharingmethodcanbeextendedfromtraditionalbusorganization[42,100]to

networksharing[101]andfurthertofunctionblocksharing[56].

Interconnectcanbetimesharedasabus[42,100].Ifncommunicationlinesexistbetweentwo

FPGAs,theycanbereducedtoasinglelinebystoringlogicaloutputsinshiftregistersandtime-

multiplexingthecommunicationinphases.Suchaschemewasemployedinthevirtualwireslogic

emulationsystem[42],whichisefficientbecauseinterconnectsarenormallycapableofbeingclockedat

muchhigherratesthanthecriticalpathoftherestofthesystem,andalllogicalwiresarenot

simultaneouslyactive.Thisschemecangreatlyreducethecomplexityoftheinterconnectingnetworkor

printedcircuitboardinamulti-FPGAsystem.

LiandCheng[101]proposedthatadynamicnetworkbeviewedasoverlappingLconventional

FPICstogetherbutsharingthesameI/Opins.Adynamicroutingarchitecturecanincreasetheroutability

andshorteninterconnectlength.Eachswitchingnetworkisafullcrossbar,whichcanbereconfiguredto

provideanyconnectionsamongI/Opins.Theselectlinesareusedtoactivateonlyoneswitching

networkatatime;thustheI/Opinsaredynamicallyconnectedaccordingtotheconfigurationofthis

activeswitchingnetwork.BydynamicallyreconfiguringtheFPICs,Llogicsignalscantime-sharethesame

interconnectresources.

7.3MemoryAllocation

InterconnectschemesshouldalsoconsiderhowmemoryisconnectedtotheFPGAs.Although

combiningmemorywithlogicinthesameFPGAisthemostdesirablemethodforreducingrouting

congestionandsignaldelay,separatecomponentscansupplymuchlargercapacityathigherdensityand

lowerprice.Figure12demonstratesthreedifferentwaysofallocatingthememoriesinaClosnetwork

[96,102].ThememorymaybeattacheddirectlytoalocalFPGA(Fig.12a),attachedtothesecond-stage

switchesoftheClosnetworkviaahostinterface(Fig.12b),orattachedtothefirst-stageswitchesofthe

Closnetwork(Fig.12c).Thefirstmethodprovidesgoodperformanceforlocalmemoryaccess.However,

forthecaseofnonlocalmemoryaccess,theroutabilityanddelayareconcerns.Thesecondmethodis

slowerthanthefirstmethodforlocalmemoryaccessesbutprovidesbetterroutability.Thethirdisthe

mostflexibleasthememoryisattachedtothenetworkandtheroutabilityishigh.However,everylogic-

to-memorycommunicationmustgothroughthesecondinterconnectstage.

Figure12.Memoryorganization,(a)MemoryisattacheddirectlytoalocalFPGA.(b)Memoryis

attachedtothesecond-stageswitchesoftheClosnetworkviaahostinterface,(c)Memoryisattached

tothefirst-stageswitchesoftheClosnetwork.

7.4BusBufferInsertion

InFPGAs,signalpropagationisinherentlyslowbecauseofitsprogrammableinterconnectfeature.

However,thedelayoflongroutingwirescanbedrasticallyreducedbybufferinsertion.Theprincipleat

workisthatbyinsertingbufferswecandecouplecapacitiveeffectsofcomponentsandinterconnect

drivenbythebuffersandtherebyimproveRCdelay.

Givenaroutingtopologyforanetandtimingrequirementsforitssinks,anefficientoptimal

bufferinsertionalgorithmwasproposedin[103].Experimentalresultsshowdramaticimprovement

versustheunbufferedsolution.Thus,itisadvantageoustohaveabundantbuffersinFPGAs.However,

eachpossiblebufferanditsprogrammableswitchaddscapacitancetothewires,whichinturnwill

contributetodelay.Thus,abalancepointneedstobeidentifiedtotradeoffbetweentheadditional

delayandcapacitanceofthebuffersversustheimprovementtheycanprovide.

Foramultisourcedbus,theproblemofbufferinsertionbecomesmorecomplicated,because

theoptimizationforonesourcemaysacrificethedelayofothers.Furthermore,thedirectionofthe

bufferneedstobearbitratedbyacontroller.Insteadofusingsuchacontroller,anovelapproachisto

useapatentedopencollectorbusrepeater[104].Whenidle,thetwoendsoftherepeateraresetto

high.Whentherepeatersensesthepull-downactionononeside,itpresentsthesignalontheother

sideuntilthepull-downactionisreleasedfromtheoriginatedsignal.Thebusrepeatereliminatesthe

needforadirectioncontrolsignal,resultinginasimplerdesignandbetteruseofresources.

7.5SystemDecomposition

Todecomposeasystemintomultipledevices,Yehetal.[105]proposedanalgorithmbasedonthe

relationshipbetweenuniformmulti-commodityflowandmin-cutpartitioning.Yehetal.constructaflow

networkwhereineachnetinitiallycorrespondedanedgewithflowcostone.Tworandommodulesin

thenetworkwerechosenandtheshortestpath(i.e.,pathwithlowestcost)betweenthemwas

computed.AconstantΔ<1wasaddedtotheflowforeachnetintheshortestpath,andthecostfor

everynetinthepathwasincremented.Adjustingthecostpenalizespathsthroughcongestedareasand

forcesalternativeshortestpaths.Thisrandomshortestpathcomputationisrepeateduntileverypath

betweenthechosenpairofmodulespassesthroughatleastone“saturated”net.Thesetofsaturated

netsinducesamulti-waypartitioninginwhichtwomodulesbelongtothesameclusterifandonlyif

thereisapathofunsaturatednetsbetweenthem.

Foreachoftheseclusters,theflux(definedasthecutsizebetweentheclusterandits

complement,dividedbythesizeofthecluster)iscomputedandtheclustersaresortedbasedontheir

fluxvalue.Yehetal.beganwithasingleclusterequaltotheentirenetlist,andthenpeeledoffthe

clusterswithlowestflux.Thisapproachwasattractivebecausethesaturatednetsaregoodcandidates

tobecutinapartitioningsolution.Aspeeledclusterscanbeverysmall,asecondphasemaybeusedto

makethemulti-waypartitioningmorebalanced.Thisapproach,withitssubsequentspeedupbyYeh

[106],iswell-suitedforlarge-scalemulti-waypartitioninginstances.

Thesystemprototypingphasemayalsoexplorenetlisttransformationssuchaslogicreplication

andretimingtominimizecutsize(I/Ousage)orsystemcycletime.Suchtransformationsareneededas

inter-devicedelayscanberelativelylargeandbecausedevicesareoftenI/O-limited.InReference[107],

Liuetal.proposedapartitioningalgorithmthatpermitslogicreplicationtominimizebothcutsizeand

clockcycleofsequentialcircuits.GivenanetlistG=(V,E),theirapproachchoosestwomodulesasseeds

sandt,thenconstructsa“replicationgraph”thatistwicethesizeoftheoriginalcircuit.Thisgraphhas

thespecialpropertythatatypeofdirectedminimumcutyieldsthereplicationcut(i.e.,adecomposition

ofVintoS,T,andR=V-S-TwheresÎS,tÎTandRisthereplicatedlogic)thatisoptimal.Adirected

versionoftheFiduccia-Mattheysesalgorithmisusedtofindaheuristicdirectedminimumcutinthe

replicationgraph.Congetal.[108]presentanefficientalgorithmfortheperformance-drivenmulti-way

circuitpartitioningproblemthatconsidersthedifferentlocalandglobalinterconnectdelayintroduced

bythepartitioning.

AlpertandKahng[109]surveytheFPGApartitioningliteratureinthecontextofmajorgraph

partitioningparadigms.Thecurrentpartitioningproblemsare(i)lowusagerateofFPGAgatecapacity

becauseI/Opinlimit,(ii)lowclockratebecauseofinterconnectdelaybetweenmultipleFPGAsand(iii)

longCPUtimeforthemappingprocess.

7.6SystemPlanningandDesignChanges

Foragivensystemdecompositiontobeimplementedonamulti-FPGAprototypingarchitecture,all

connectionswithineachdeviceandbetweendevicesmustberoutable.Chanetal.[110]invokemuch

literatureonroutabilitypredictioningatearrays,aswellastheoreticalconcepts,suchastheRent

parameter,toobtainafastroutabilityestimateforarbitrarynetlistsandFPGAarchitectures.Their

methodascribesoneofthreelevelsofroutable(easilyroutable,marginallyroutable,orunroutable)toa

netlistbasedonvariousparameters.Specifically,combiningawirelengthestimatorduetoFeuer,the

averagenumberofpins-per-cell,andtheestimatedRentparameteryieldsarelativelyaccurate

routabilitypredictor.TheutilityoftheseparametersiscontrastedwiththatofothercriteriasuchasEl

Gamal'schannelwidthrequirement[111]ortheaveragepins-per-netratio.

Inadditiontoroutability,connectionsmustalsomeetsystemtimingconstraints.Selvidgeetal.

[112]extendtheoriginalvirtualwires[42]conceptintheirTIERS(Topology-IndEpendentRoutingand

Scheduling)approach.Theproblemformulationassumesthatanassignmentfromamultiple-FPGA

partitioning(i.e.,adesigngraph)toatargettopologygraphhasalreadybeenmade.Theobjectiveisto

assign“links”(i.e.,signalnets)tochannelsbetweendevices;aswiththeVirtualWiresconcept,specific

timeslicesforachannelcanbeassignedtomultiplelinksaslongasnotwolinksneedtotransmitsignals

atthesametime.TheTIERSalgorithmusesagreedymethodtoorderthelinksandthenrouteseachlink

inthescheduledorderwhilereservingchannelresources;factorsofupto2.5improvementinsystem

cycletimeareachieved.

Chang,etal.[113]addressthecombinedissuesofroutabilityandsystemtimingbyapplying

layout-drivenlogicresynthesistechniques.Foragivenwirethatcannotberouted,“alternativewires”

andalternativefunctionsareidentified,suchthatthegivenunroutablewirecanberemovedfromthe

circuitandreplacedwithanewwire(orwires)ornewlogicwithoutaffectingfunctionality.Chengetal.

estimatethatbetween30%and50%ofwireshaveso-called“triple-wirealternatives”(i.e.,

replacementsconsistingofthreeorfewerwires).Theirmethodfirstroutesthewiresthatdonothave

anyalternativesthenreplacesanyunroutablewirewithavailablealternatives.Systemtimingcanbe

improvedbyreplacinglongwireswithshorteralternatives.

8Conclusions

Reconfigurablecomputingoffersamiddlegroundbetweensoftware-basedsystemsandASIC

implementations,andisoftenabletocombineimportantbenefitsofboth.Implementationsareableto

avoidoverheadssuchasunnecessarydatatransfers,decodingandcontrolmandatoryin

microprocessors,anddesignscanbeoptimizedonabasisspecifictoanapplication,aprobleminstance

orevenanexecution.Usingthistechnology,itispossibletoachievesize,performance,cost,orpower

improvementsovermoreconventionalcomputingtechnologies.

9Acknowledgments

TheauthorswouldliketothankYM.LamforhishelpinpreparingthismanuscriptandProf.WayneLuk

(ImperialCollege)forhisproofreadingofthisarticle.

Bibliography

[1] G.Estrin,"ReconfigurableComputerOrigins:TheUCLAFixed-plus-variable(F+V)Structurecomputer,"IEEEAnn.Hist.Comput,vol.24,pp.3--9,2002.

[2] S.Hauck,"TherolesofFPGAsinreprogrammablesystems,"Proc.IEEE,vol.86,pp.615-639,1998.

[3] K.ComptonandS.Hauck,"Reconfigurablecomputing:asurveyofsystemsandsoftware,"ACMComput.Surveys(CSUR),vol.34,pp.171-210,2002.

[4] K.BondalapatiandV.K.Prasanna,"Reconfigurablecomputingsystems,"Proc.IEEE,vol.90,pp.1201-1217,2002.

[5] T.J.Todman,G.A.Constantinides,S.J.E.Wilton,O.Mencer,W.Luk,andP.Y.K.Cheung,"Reconfigurablecomputing:architecturesanddesignmethods,"IEEProc.ComputersandDigitalTechniques,vol.152,pp.193-205,2005.

[6] R.Tessier,K.Pocek,andA.DeHon,"ReconfigurableComputingArchitectures,"Proc.IEEE,vol.103,pp.332-354,2015.

[7] H.SutterandJ.Larus,"Softwareandtheconcurrencyrevolution,"Queue,vol.3,pp.54--62,2005.

[8] J.Cong,M.A.Ghodrat,M.Gill,B.Grigorian,K.Gururaj,andG.Reinman,"Accelerator-RichArchitectures,"inProc.DesignAutomationConference,pp.1--6,2014.

[9] J.Fowers,G.Brown,P.Cooke,andG.Stitt,"AperformanceandenergycomparisonofFPGAs,GPUs,andmulticoresforsliding-windowapplications,"inProc.ACM/SIGDAInt.Symp.onFieldProgrammableGateArrayspp.47–56,2012.

[10] D.B.Thomas,L.Howes,andW.Luk,"AcomparisonofCPUs,GPUs,FPGAs,andmassivelyparallelprocessorarraysforrandomnumbergeneration,"Proc.Int.Symp.onFieldprogrammablegatearrays,pp.63-72,2009.

[11] A.DeHon,"Thedensityadvantageofconfigurablecomputing,"IEEEComputer,vol.33,pp.41-49,2000.

[12] I.KuonandJ.Rose,"MeasuringtheGapBetweenFPGAsandASICs,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.26,pp.203-215,2007.

[13] J.Liang,R.Tessier,andD.Goeckel,"ADynamically-Reconfigurable,Power-EfficientTurboDecoder,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.91--100,2004.

[14] V.Betz,J.Rose,andA.Marquardt,"ArchitectureandCADforDeep-SubmicronFPGAS,"ed.Dordrecht,theNetherlands:KluwerAcademicPublisher,1999.

[15] Xilinx,"http://www.xilinx.com,"(accessed2016).[16] Altera,"http://www.altera.com,"(accessed2016).[17] Microsemi,"http://www.microsemi.com,"(accessed2016).[18] M.P.Leong,"FPGADesignMethodologiesforHighPerformanceApplications,"TheChinese

UniversityofHongKong2001.[19] E.AhmedandJ.Rose,"TheeffectofLUTandclustersizeondeep-submicronFPGAperformance

anddensity,"inProc.ACM/SIGDAInt.Symp.onFieldprogrammablegatearrays,pp.3-12,2000.[20] D.Lewis,A.Lee,P.Leventis,S.Marquardt,C.McClintock,K.Padalia,etal.,"TheStratixIIlogic

androutingarchitecture,"inProc.Int.Symp.onField-programmablegatearrays,pp.14-20,2005.

[21] D.Lewis,G.Chiu,J.Chromczak,D.Galloway,B.Gamsa,V.Manohararajah,etal.,"TheStratix™10HighlyPipelinedFPGAArchitecture,"inProc.Int.Symp.onField-ProgrammableGateArrays,pp.159-168,2016.

[22] R.Hartenstein,"Coarsegrainreconfigurablearchitecture(embeddedtutorial),"inProc.conf.onAsiaSouthPacificdesignautomation,2001.

[23] S.C.Goldstein,H.Schmit,M.Budiu,S.Cadambi,M.Moe,andR.R.Taylor,"PipeRench:areconfigurablearchitectureandcompiler,"Computer,vol.33,pp.70-77,2000.

[24] C.Ebeling,D.C.Cronquist,andP.Franklin,"RaPiD—Reconfigurablepipelineddatapath,"inProc.Int.WorkshoponField-ProgrammableLogic,SmartApplications,NewParadigmsandCompilers,pp.126-135,1996.

[25] L.Moll,J.Vuillemin,andP.Boucard,"High-energyphysicsonDECPeRLe-1programmableactivememory,"inProc.ACMInt.Symp.onField-programmablegatearrays,pp.47-52,1995.

[26] D.T.Hoang,"SearchinggeneticdatabasesonSplash2,"inProc.IEEEWorkshoponFPGAsforCustomComputingMachinespp.185-191,1993.

[27] C.Chen,J.Wawrzynek,andR.W.Brodersen,"BEE2AHigh-EndReconfigurableComputingSystem,"IEEEDes.Test.Comput.,vol.22,pp.114-125,2005.

[28] L.-K.Ting,R.Woods,andC.F.N.Cowan,"VirtexFPGAimplementationofapipelinedadaptiveLMSpredictorforelectronicsupportmeasuresreceivers,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.13,pp.86-95,2005.

[29] M.Pohl,M.Schaeferling,andG.Kiefer,"AnefficientFPGA-basedhardwareframeworkfornaturalfeatureextractionandrelatedComputerVisiontasks,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.

[30] M.ShandandJ.Vuillemin,"FastimplementationsofRSAcryptography,"inProc.IEEESymp.onComputerArithmetic,pp.252-259,1993.

[31] K.H.Tsoi,K.H.Lee,andP.H.W.Leong,"AmassivelyparallelRC4keysearchengine,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.13-21,2002.

[32] G.L.Zhang,P.H.W.Leong,C.H.Ho,K.H.Tsoi,C.C.C.Cheung,D.Lee,etal.,"ReconfigurableaccelerationforMonteCarlobasedfinancialsimulation,"inProc.Int.Conf.onField-ProgrammableTechnology,2005.,pp.215-222,2005.

[33] D.Boland,"ReducingMemoryRequirementsforHigh-PerformanceandNumericallyStableGaussianElimination,"Proc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.244-253,2016.

[34] N.J.Fraser,D.J.M.Moss,L.JunKyu,S.Tridgell,C.T.Jin,andP.H.W.Leong,"Afullypipelinedkernelnormalisedleastmeansquaresprocessorforacceleratedparameteroptimisation,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1--6,2015.

[35] J.E.Vuillemin,P.Bertin,D.Roncin,M.Shand,H.H.Touati,andP.Boucard,"Programmableactivememories:reconfigurablesystemscomeofage,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.4,pp.56-69,1996.

[36] Nvidia,"(accessed2016)."http://www.nvidia.com.[37] S.MittalandJ.S.Vetter,"ASurveyofMethodsforAnalyzingandImprovingGPUEnergy

Efficiency,"ACMComput.Surv.,vol.47,pp.1-23,2014.[38] Nvidia.((accessed2016)).NVIDIATesla®K20-K20XGPUAcceleratorsBenchmarksApplication

PerformanceTechnicalBriefhttp://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf

[39] K.Ovtcharov,O.Ruwase,J.-Y.Kim,K.Strauss,andE.Chung,AcceleratingDeepCconvolutionalNeuralNetworksUsingSpecializedHardware:MicrosoftResearch,2015.

[40] S.Gupta,A.Agrawal,K.Gopalakrishnan,andP.Narayanan,"DeepLearningwithLimitedNumericalPrecision,"inInt.Conf.onMachineLearning,pp.1337–1345,2013.

[41] J.L.Jerez,G.A.Constantinides,andE.C.Kerrigan,"FixedPointLanczos:SustainingTFLOP-equivalentPerformanceinFPGAsforScientificComputing,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.53-60,2012.

[42] J.Babb,R.Tessier,M.Dahl,S.Z.Hanono,D.M.Hoki,andA.Agarwal,"Logicemulationwithvirtualwires,"IEEETrans.Computer-AidedDesignofIntegratedCircuitsandSystems,vol.16,pp.609-626,1997.

[43] J.Varghese,M.Butts,andJ.Batcheller,"Anefficientlogicemulationsystem,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.1,pp.171-174,1993.

[44] L.deSouza,P.Ryan,J.Crawford,K.Wong,G.Zyner,andT.McDermott,"PrototypingfortheConcurrentDevelopmentofanIEEE802.11WirelessLANChipset,"inProc.Int.Conf.onFieldProgrammableLogicandApplication,ed,2003,pp.51-60.

[45] Cadence,"ProtiumRapidPrototypingPlatformhttps://www.cadence.com/content/cadence-www/global/en_US/home/tools/system-design-and-verification/fpga-basedprototyping/protium-rapid-prototyping-platform.html,"(accessed2016).

[46] D.M.StephenTridgell,NicholasJ.Fraser,andPhilipH.W.Leong,"Braiding:aschemeforresolvinghazardsinNORMA,"inProc.Int.Conf.onFieldProgrammableTechnology,pp.136–143,2015.

[47] P.L.ChenZhang,GuangyuSun,YijinGuan,BingjunXiaoandJasonCong,"OptimizingFPGA-basedAcceleratorDesignforDeepConvolutionalNeuralNetworks,"inProc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.161-170,2015.

[48] R.A.E.LouiseH.Crockett,MartinA.Enderwitz,andRobertW.Stewart.,TheZynqBook:EmbeddedProcessingwiththeArmCortex-A9ontheXilinxZynq-7000allProgrammableUK,2014.

[49] P.Bertin,D.Roncin,andJ.Vuillemin,"IntroductiontoProgrammableActiveMemories,"ed:DECMemo3,1989,pp.1-9.

[50] J.M.Arnold,D.A.Buell,andE.G.Davis,"Splash2,"inProc.ACMsymp.onParallelalgorithmsandarchitectures,1992.

[51] Maxeler,"https://www.maxeler.com/products/mpc-xseries/,"(accessed2016).[52] A.Putnam,A.M.Caulfield,E.S.Chung,D.Chiou,K.Constantinides,J.Demme,etal.,"A

reconfigurablefabricforacceleratinglarge-scaledatacenterservices,"inInt.Symp.onComputerArchitecture(ISCA),2014.

[53] Y.-k.Choi,J.Cong,Z.Fang,Y.Hao,G.Reinman,andP.Wei,"AquantitativeanalysisonmicroarchitecturesofmodernCPU-FPGAplatforms,"inProc.DesignAutomationConference,2016.

[54] J.VillasenorandW.H.Mangione-Smith,"ConfigurableComputing,"Scientif.Amer.,vol.276,pp.66-71,1997.

[55] J.BeckerandM.Hübner,"Run-timereconfigurabililityandotherfuturetrends,"inProc.symp.onIntegratedcircuitsandsystemsdesign,pp.9-11,2006.

[56] N.B.C.Bhat,K.;Kuh,E.S,"Performance-orientedFullyRoutableDynamicArchitectureforaField-programmableLogicDevice,"MemorandumNo.UCB/ERLM93/42,ElectronicsResearchLab.,CollegeofEngineering,UCBerkeley,pp.1-21,1993.

[57] M.Motomura,"ADynamicallyReconfigurableProcessorArchitecture,,"MicroprocessorForum,2002.

[58] D.Koch,C.Beckhoff,andJ.Teich,"ReCoBus-Builder-AnoveltoolandtechniquetobuildstaticallyanddynamicallyreconfigurablesystemsforFPGAS,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.119-124,2008.

[59] C.Beckhoff,D.Koch,andJ.Torresen,"GoAhead:APartialReconfigurationFramework,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.37-44,2012.

[60] K.VipinandS.A.Fahmy,"Efficientregionallocationforadaptivepartialreconfiguration,"inProc.Int.Conf.onField-ProgrammableTechnology,pp.1-6,2011.

[61] D.KochandC.Beckhoff,"HierarchicalreconfigurationofFPGAs,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.

[62] K.VipinandS.A.Fahmy,"ZyCAP:EfficientPartialReconfigurationManagementontheXilinxZynq,"IEEEEmbeddedSystemsLetters,vol.6,pp.41-44,2014.

[63] L.GongandO.Diessel,FunctionalVerificationofDynamicallyReconfigurableFPGA-basedSystems,1ed.:SpringerInternationalPublishing,2015.

[64] C.Claus,R.Ahmed,F.Altenried,andW.Stechele,"TowardsRapidDynamicPartialReconfigurationinVideo-BasedDriverAssistanceSystems,"inProc.Int.SymponReconfigurableComputing:Architectures,ToolsandApplications,ed,2010,pp.55-67.

[65] G.G.Jean-PhilippeDelahaye,ChristianRoland,PierreBomel,"Softwareradioanddynamicreconfigurationonadsp/fpgaplatform,"Frequenz,journaloftelecommunications,pp.152-159,2004.

[66] M.Feilen,M.Ihmig,C.Schwarzbauer,andW.Stechele,"EfficientDVB-T2decodingacceleratordesignbytime-multiplexingFPGAresources,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.75-82,2012.

[67] C.Steiger,H.Walder,andM.Platzner,"Operatingsystemsforreconfigurableembeddedplatforms:onlineschedulingofreal-timetasks,"IEEETrans.onComputers,vol.53,pp.1393-1407,2004.

[68] C.Dennl,D.Ziener,andJ.Teich,"On-the-flyCompositionofFPGA-BasedSQLQueryAcceleratorsUsingaPartiallyReconfigurableModuleLibrary,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines(FCCM),pp.45-52,2012.

[69] A.Becher,F.Bauer,D.Ziener,andJ.Teich,"Energy-awareSQLqueryaccelerationthroughFPGA-baseddynamicpartialreconfiguration,"inProc.Int.Conf.onFieldProgrammableLogicandApplications,pp.1-8,2014.

[70] M.J.WirthlinandB.L.Hutchings,"Adynamicinstructionsetcomputer,"inProc.IEEESymp.onFPGAsforCustomComputingMachines,pp.99–107,1995.

[71] Stretch,"http://www.stretchinc.com/,"(accessed2016).[72] H.Schmit,"Incrementalreconfigurationforpipelinedapplications,"inProc.5thAnnualIEEE

Symp.onField-ProgrammableCustomComputingMachines,pp.47-55,1997.[73] S.Young,P.Alfke,C.Fewer,S.McMillan,B.Blodget,andD.Levi,"AhighI/Oreconfigurable

crossbarswitch,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.3-10,2003.

[74] F.d.D.a.B.Pasca,"DesigningcustomarithmeticdatapathswithFloPoCo,"IEEEDesign&TestofComputers,vol.28,pp.18--27,2011.

[75] W.LukandI.Page,"CompilingOccamintoFPGAs,"ed:EE&CSbooks,1991,pp.271-283.[76] I.Page,"Constructinghardware-softwaresystemsfromasingledescription,"VLSISignal

Processing,vol.12,pp.87-107,1996.[77] Q.Liu,G.A.Constantinides,K.Masselos,andP.Y.K.Cheung,"AutomaticOn-chipMemory

MinimizationforDataReuse,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachinespp.251-260,2007.

[78] S.BaylissandG.A.Constantinides,"OptimizingSDRAMbandwidthforcustomFPGAloopaccelerators,"Proc.ACM/SIGDAInt.Symp.onFieldProgrammableGateArrays,pp.195--204,2012.

[79] D.U.Lee,A.A.Gaffar,R.C.C.Cheung,O.Mencer,W.Luk,andG.A.Constantinides,"Accuracy-GuaranteedBit-WidthOptimization,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.25,pp.1990-2000,2006.

[80] D.BolandandG.A.Constantinides,"BoundingVariableValuesandRound-OffEffectsUsingHandelmanRepresentations,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.30,pp.1691-1704,2011.

[81] T.IsshikiandW.W.Dai,"High-LevelBit-SerialDatapathSynthesisforMulti-FPGASystems,"inProc.Int.WorkshoponFPGAs,pp.161-174,1995.

[82] M.P.LeongandP.H.W.Leong,"Avariable-radixdigit-serialdesignmethodologyanditsapplicationtothediscretecosinetransform,"IEEETrans.onVeryLargeScaleIntegration(VLSI)Systems,vol.11,pp.90-104,2003.

[83] M.Technologies,"MaxCompiler"(whitepaper),2011.[84] M.Graphics,"CatapultHigh-LevelSynthesishttps://www.mentor.com/hls-lp/catapult-high-

level-synthesis/c-systemc-hls,"(Accessed2016).[85] M.Haldar,A.Nayak,A.Choudhary,andP.Banerjee,"AsystemforsynthesizingoptimizedFPGA

hardwarefromMatlab,"inIEEE/ACMInt.Conf.onComputerAidedDesign,pp.314–319,2001.[86] Mathworks,"http://www.mathworks.com/products/hdl-coder/,"(accessed2016).[87] A.Papakonstantinou,K.Gururaj,J.A.Stratton,D.Chen,J.Cong,andW.M.W.Hwu,"FCUDA:

EnablingefficientcompilationofCUDAkernelsontoFPGAs,"inSymp.onApplicationSpecificProcessors,pp.35-42,2009.

[88] D.Lau,O.Pritchard,andP.Molson,"AutomatedGenerationofHardwareAcceleratorswithDirectMemoryAccessfromANSI/ISOStandardCFunctions,"inProc.Int.Symp.onField-ProgrammableCustomComputingMachinespp.45-56,2006.

[89] A.G.Weisz,A.J.Melber,A.Y.Wang,A.K.Fleming,A.E.Nurvitadhi,andA.J.C.Hoe,"AStudyofPointer-ChasingPerformanceonShared-MemoryProcessor-FPGASystems,"Proc.ACM/SIGDAInt.Symp.onField-ProgrammableGateArrays,pp.264-273,2016

[90] Vectorblox,"http://vectorblox.com/,"(accessed2016).[91] N.KapreandJ.Gray,"Hoplite:BuildingaustereoverlayNoCsforFPGAs,"inProc.Int.Conf.on

FieldProgrammableLogicandApplications,pp.1-8,2015.[92] D.CapalijaandT.S.Abdelrahman,"Ahigh-performanceoverlayarchitectureforpipelined

executionofdataflowgraphs,"inInt.Conf.onFieldprogrammableLogicandApplications,pp.1-8,2013.

[93] A.K.Jain,X.Li,P.Singhai,D.L.Maskell,andS.A.Fahmy,"DeCO:ADSPBlockBasedFPGAAcceleratorOverlayWithLowOverheadInterconnect,"Proc.Int.Symp.onField-ProgrammableCustomComputingMachines,pp.1--8,2016.

[94] C.Clos,"AStudyofNon-BlockingSwitchingNetworks,"BellSystemTechnicalJournal,vol.32,pp.406-424,1953.

[95] V.E.Beneš,"MathematicalTheoryofConnectingNetworksandTelephoneTraffic,"ed.NewYork:AcademicPress,1965.

[96] P.K.ChanandM.D.F.Schlag,"Architecturaltradeoffsinfield-programmable-device-basedcomputingsystems,"inProc.IEEEWorkshoponFPGAsforCustomComputingMachines,pp.152-161,1993.

[97] R.ColeandJ.Hopcroft,"OnEdgeColoringBipartiteGraphs,"SIAMJ.Comput.,vol.11,pp.540-546,1982.

[98] G.RichardsandF.Hwang,"ATwo-StageRearrangeableBroadcastSwitchingNetwork,"IEEETrans.Communications,vol.33,pp.1025-1035,1985.

[99] I-Cube,"UsingFPIDDeviesinFPGA-basedPrototyping,"ed:ApplicationNote,1994,pp.1–11.[100] Y.C.Wei,C.K.Cheng,andZ.Wurman,"Multiple-levelpartitioning:anapplicationtothevery

large-scalehardwaresimulator,"IEEEJ.Solid-StateCircuits,vol.26,pp.706-716,1991.[101] J.LiandC.K.Cheng,"Routabilityimprovementusingdynamicinterconnectarchitecture,"in

Proc.IEEESymp.onFPGAsforCustomComputingMachines,pp.2-7,1995.

[102] P.K.S.Chan,M.D.F.;Martin,M.,"BORG:AReconfigurablePrototypingBoardUsingField-programmableGateArrays,"inInt.WorkshoponFPGA,pp.47–51,1992.

[103] J.Lillis,C.K.Cheng,andT.T.Y.Lin,"Optimalwiresizingandbufferinsertionforlowpowerandageneralizeddelaymodel,"inProc.Int.Conf.onComputerAidedDesign(ICCAD),pp.138-143,1995.

[104] W.J.Hsieh,Y.C.Jenq,C.S.Horng,andK.Lofstrom,"Input/outputI/OBidirectionalBufferforInterfacingI/OPartsofaFieldProgrammableInterconnectionDevicewithArrayPortsofaCross-pointSwitch.,"USPatent5,428,800,1992.

[105] C.-W.Yeh,C.-K.Cheng,andT.-T.Y.Lin,"Aprobabilisticmulticommodity-flowsolutiontocircuitclusteringproblems,"inProc.IEEE/ACMInt.Conf.onComputer-AidedDesign,pp.428–431,1992.

[106] Y.Ching-Wei,"Ontheaccelerationofflow-orientedcircuitclustering,"IEEETrans.onComputer-AidedDesignofIntegratedCircuitsandSystems,vol.14,pp.1305-1308,1995.

[107] L.-T.Liu,M.-t.Kuo,C.-K.Cheng,andT.C.Hu,"Performance-DrivenPartitioningUsingaReplicationGraphApproach,"inProc.DesignAutomationConferencepp.206-210,1995.

[108] J.Cong,S.K.Lim,andC.Wu,"Performancedrivenmulti-levelandmultiwaypartitioningwithretiming,"inProc.DesignAutomationConf.,pp.274-279,2000.

[109] C.J.AlpertandA.B.Kahng,"Recentdirectionsinnetlistpartitioning:asurvey,"Integration,theVLSIJournal,vol.19,pp.1-81,1995.

[110] P.K.Chan,M.D.F.Schlag,andJ.Y.Zien,"OnroutabilitypredictionforField-ProgrammableGateArrays,"inProc.DesignAutomationConferencepp.326-330,1993.

[111] A.E.Gamal,"Two-dimensionalstochasticmodelforinterconnectionsinmastersliceintegratedcircuits,"IEEETrans.CircuitsSyst.,vol.28,pp.127-138,1981.

[112] C.Selvidge,A.Agarwal,M.Dahl,andJ.Babb,"TIERS:TopologyIndependentPipelinedRoutingandScheduling,"inProc.ACMInt.Symp.onField-programmablegatearrayspp.25-31,1995.

[113] S.-C.Chang,K.-T.Cheng,N.-S.Woo,andM.Marek-Sadowska,"LayoutdrivenlogicsynthesisforFPGAs,"inProc.DesignAutomationConferencepp.308-313,1994.