+ All Categories
Home > Documents > A Reconfigurable Silicon-Photonic Network with Improved ...sudeep/wp-content/uploads/c74.pdf · and...

A Reconfigurable Silicon-Photonic Network with Improved ...sudeep/wp-content/uploads/c74.pdf · and...

Date post: 03-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
6
Sai Vineel Reddy Chittamuru, Srinivas Desai, Sudeep Pasricha Department of Electrical and Computer Engineering Colorado State University, Fort Collins, CO, U.S.A. {sai.chittamuru, srinivas.desai, sudeep}@colostate.edu ABSTRACT On-chip communication is widely considered to be one of the major performance bottlenecks in contemporary chip multiprocessors (CMPs). With recent advances in silicon nanophotonics, photonic- based networks-on-chip (NoCs) are being considered as a viable option for communication in emerging CMPs as they can enable higher bandwidth and lower power dissipation compared to traditional electrical NoCs. In this paper, we present UltraNoC, a novel reconfigurable silicon-photonic NoC architecture that features improved channel sharing and supports dynamic re-prioritization and exchange of bandwidth between clusters of cores running multiple applications, to increase channel utilization and performance. Experimental results show that UltraNoC improves throughput by up to 9.8× while reducing latency by up to 55% and energy-delay product by up to 90% over state-of-the-art solutions. Categories and Subject Descriptors C.2.1 [Computer-Communication Networks]: Network Architecture and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords Network-on-chip (NoC), photonic channel sharing, arbitration 1. INTRODUCTION Advances in technology scaling over the past decade have enabled the integration of billions of transistors on a single die. Such a massive number of transistors has allowed multiple processing cores and more memory to be integrated on a chip, to meet rapidly growing performance demands of modern applications. With hundreds of on-chip cores expected to become a reality in the near future, traditional electrical network-on-chip (NoC) communication fabrics [1], [2] are projected to suffer from cripplingly high power dissipation and severely reduced performance [7]. Crosstalk and electromagnetic interference in metallic interconnects will further worsen the performance and reliability of NoCs with technology scaling. Recent developments in the area of silicon photonics have enabled their integration with CMOS circuits. On-chip photonic links provide several prolific advantages over their metallic counterparts, including light speed transfers, high bandwidth [3] density by using dense wavelength division multiplexing (DWDM) [3], low power dissipation [4], and low crosstalk [7]. Thus silicon photonics is being considered as an exciting new option for future NoCs. Several photonic devices such as microring resonators, waveguides and photodetectors have already been fabricated and demonstrated at the chip level [5], [6]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. GLSVLSI’15, May 20–22, 2015, Pittsburgh, PA, USA. Copyright © 2015 ACM 978-1-4503-3474-7/15/05…$15.00. http://dx.doi.org/10.1145/2742060.2742067 These devices have been used as a foundation for several photonic network architectures [8]-[11]. A few prior works emphasize the importance of network resource contention in photonic channels and proposed arbitration techniques to resolve contention [8], [10]. However, a limitation of these approaches is that they do not fully exploit available network bandwidth. Another limitation is that these approaches typically only target single parallel application workloads. In emerging multicore systems where multiple applications execute simultaneously on unique subsets of cores, there is significantly greater variation in temporal and spatial characteristics of network injected traffic. For example, cores running memory intensive tasks can require more network bandwidth than cores running compute intensive tasks [24]. To overcome these shortcomings, we propose a novel photonic NoC architecture called UltraNoC that utilizes multiple-write- multiple-read (MWMR) photonic waveguides in a crossbar topology, and supports dynamic performance adaptation to aggressively utilize network bandwidth and meet diverse application demands. We compare UltraNoC against architectures with the best- known arbitration mechanisms, for multi-threaded PARSEC [16] workloads on CMP platform sizes ranging from 64-cores to 256- cores. The novel contributions of this paper are: A flexible on-chip photonic network architecture (UltraNoC) that facilitates selective and reconfigurable prioritization of applications based on their time-varying performance goals; A concurrent token stream arbitration that provides multiple simultaneous tokens and increases channel utilization; A dynamic bandwidth transfer technique with low overhead, to transfer unused bandwidth among clusters of cores; A mechanism to monitor traffic being injected into the network by different co-running applications, to facilitate dynamic arbitration wavelength injection rate modulation. 2. RELATED WORK A considerable amount of work has focused on the area of photonic NoC design. Several efforts have explored high-radix low- diameter photonic crossbar architectures that provide non-blocking connectivity, e.g., [8], [10], [15] and [21]; whereas silicon photonic implementations of low-radix high-diameter NoCs have also been investigated in [14], [17], [18], and [20]. Prior work has shown that photonic crossbars are extremely promising architectures to meet future on-chip bandwidths demands, but they can suffer from (i) large power dissipation and (ii) high contention for resources, especially when using inefficient token-based arbitration schemes. A few techniques to reduce power overhead of photonic NoCs have been proposed in literature. For example, an effective policy for runtime management of the laser source is proposed in [11]. We build on their work to manage power in photonic crossbar NoCs. To reduce contention issues in crossbars, a few improved arbitration techniques have been proposed in [10], [22] that use time division multiplexing (TDM), so that a single data waveguide can be simultaneously used by more than one node in different time slots. A Reconfigurable Silicon-Photonic Network with Improved Channel Sharing for Multicore Architectures 63
Transcript
Page 1: A Reconfigurable Silicon-Photonic Network with Improved ...sudeep/wp-content/uploads/c74.pdf · and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords

Sai Vineel Reddy Chittamuru, Srinivas Desai, Sudeep Pasricha Department of Electrical and Computer Engineering Colorado State University, Fort Collins, CO, U.S.A.

{sai.chittamuru, srinivas.desai, sudeep}@colostate.edu

ABSTRACT On-chip communication is widely considered to be one of the major performance bottlenecks in contemporary chip multiprocessors (CMPs). With recent advances in silicon nanophotonics, photonic-based networks-on-chip (NoCs) are being considered as a viable option for communication in emerging CMPs as they can enable higher bandwidth and lower power dissipation compared to traditional electrical NoCs. In this paper, we present UltraNoC, a novel reconfigurable silicon-photonic NoC architecture that features improved channel sharing and supports dynamic re-prioritization and exchange of bandwidth between clusters of cores running multiple applications, to increase channel utilization and performance. Experimental results show that UltraNoC improves throughput by up to 9.8× while reducing latency by up to 55% and energy-delay product by up to 90% over state-of-the-art solutions.

Categories and Subject Descriptors C.2.1 [Computer-Communication Networks]: Network Architecture and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords Network-on-chip (NoC), photonic channel sharing, arbitration

1. INTRODUCTION Advances in technology scaling over the past decade have

enabled the integration of billions of transistors on a single die. Such a massive number of transistors has allowed multiple processing cores and more memory to be integrated on a chip, to meet rapidly growing performance demands of modern applications. With hundreds of on-chip cores expected to become a reality in the near future, traditional electrical network-on-chip (NoC) communication fabrics [1], [2] are projected to suffer from cripplingly high power dissipation and severely reduced performance [7]. Crosstalk and electromagnetic interference in metallic interconnects will further worsen the performance and reliability of NoCs with technology scaling.

Recent developments in the area of silicon photonics have enabled their integration with CMOS circuits. On-chip photonic links provide several prolific advantages over their metallic counterparts, including light speed transfers, high bandwidth [3] density by using dense wavelength division multiplexing (DWDM) [3], low power dissipation [4], and low crosstalk [7]. Thus silicon photonics is being considered as an exciting new option for future NoCs. Several photonic devices such as microring resonators, waveguides and photodetectors have already been fabricated and demonstrated at the chip level [5], [6].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. GLSVLSI’15, May 20–22, 2015, Pittsburgh, PA, USA. Copyright © 2015 ACM 978-1-4503-3474-7/15/05…$15.00. http://dx.doi.org/10.1145/2742060.2742067

These devices have been used as a foundation for several photonic network architectures [8]-[11].

A few prior works emphasize the importance of network resource contention in photonic channels and proposed arbitration techniques to resolve contention [8], [10]. However, a limitation of these approaches is that they do not fully exploit available network bandwidth. Another limitation is that these approaches typically only target single parallel application workloads. In emerging multicore systems where multiple applications execute simultaneously on unique subsets of cores, there is significantly greater variation in temporal and spatial characteristics of network injected traffic. For example, cores running memory intensive tasks can require more network bandwidth than cores running compute intensive tasks [24].

To overcome these shortcomings, we propose a novel photonic NoC architecture called UltraNoC that utilizes multiple-write-multiple-read (MWMR) photonic waveguides in a crossbar topology, and supports dynamic performance adaptation to aggressively utilize network bandwidth and meet diverse application demands. We compare UltraNoC against architectures with the best-known arbitration mechanisms, for multi-threaded PARSEC [16] workloads on CMP platform sizes ranging from 64-cores to 256-cores. The novel contributions of this paper are:

A flexible on-chip photonic network architecture (UltraNoC) that facilitates selective and reconfigurable prioritization of applications based on their time-varying performance goals;

A concurrent token stream arbitration that provides multiple simultaneous tokens and increases channel utilization;

A dynamic bandwidth transfer technique with low overhead, to transfer unused bandwidth among clusters of cores;

A mechanism to monitor traffic being injected into the network by different co-running applications, to facilitate dynamic arbitration wavelength injection rate modulation.

2. RELATED WORK A considerable amount of work has focused on the area of

photonic NoC design. Several efforts have explored high-radix low-diameter photonic crossbar architectures that provide non-blocking connectivity, e.g., [8], [10], [15] and [21]; whereas silicon photonic implementations of low-radix high-diameter NoCs have also been investigated in [14], [17], [18], and [20]. Prior work has shown that photonic crossbars are extremely promising architectures to meet future on-chip bandwidths demands, but they can suffer from (i) large power dissipation and (ii) high contention for resources, especially when using inefficient token-based arbitration schemes. A few techniques to reduce power overhead of photonic NoCs have been proposed in literature. For example, an effective policy for runtime management of the laser source is proposed in [11]. We build on their work to manage power in photonic crossbar NoCs. To reduce contention issues in crossbars, a few improved arbitration techniques have been proposed in [10], [22] that use time division multiplexing (TDM), so that a single data waveguide can be simultaneously used by more than one node in different time slots.

A Reconfigurable Silicon-Photonic Network with Improved Channel Sharing for Multicore Architectures

63

Page 2: A Reconfigurable Silicon-Photonic Network with Improved ...sudeep/wp-content/uploads/c74.pdf · and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords

pdaowawcoacsaadtortrUpwa

Fth

33

CaaMemcecgthwinc

In Flexishareproposed. The scdata waveguide an arbitration wonly when it wavelength. Subarbitration wavelwhich takes N cychannel under-utof nodes and warbitration schemchannel and tokesingle-read (MWand improves uarbitration wavedestination nodeso slots. A limitaequired betweeransmission, wh

UltraNoC impropowerful concurwith reconfigurallocation to imp

Figure 1: Layouthe arrangement o

3. ULTRANO3.1 Architectu

The baseline CMP, as shownarchitecture to aanalysis. Each coMOESI protocoleach group of 8 memory controllcores has a dedicentity consistingcore CMP, respegateway interfache CMOS elect

waveguides, modnto four clusters

cores in a 64-corA detailed lay

e [10], a tokencheme requires wto be injected saveguide. A nogets access tosequently, the nlength is injecteycles for N data tilization, and pewaveguides inc

me from Corona en slot arbitratio

WSR) crossbars. upon token chaeguide into fixes circulating tokation of this appen two arbitrahich reduces the oves upon these rrent token strerable cluster prprove MWMR ph

t of MWMR croof cores and their

OC PHOTONure and TermiUltraNoC archin in figure 1.a 256-core CMPore has a privatel to maintain caccores has access

ler, whereas in acated memory c

g of one and fouectively. Every nce (GI) module trical layer and dulators, detectos (C0, C1, C2, C3

e CMP and 64 cyout of the Ult

n stream arbitrwavelengths correrially into diffede writes on tho the corresponode cannot sended into the arbitwaveguides. Th

erforms even wocrease. In [22]

[8] was improvon techniques fToken slot arbinnel arbitrationed-size, back-tokens in one-to-onproach is that a

ation slots to available time prior works by

eam arbitration rioritization anhotonic channel

ossbar used in Ulr respective gatew

NIC NOC OVinology itecture is targe. We also extP for the purpoe L1 and sharedche coherence. Is to main memo

a 256-core CMPontroller. A nod

ur cores for the node in UltraNothat facilitates a photonic lay

ors, etc.). The ent3), where each c

cores in a 256-cotraNoC architec

ration scheme responding to eaerent time slots

he data waveguionding arbitratid data again till tration waveguidhe scheme leads orse as the numb, the token ri

ved with the tokfor Multiple-writitration uses TDn by dividing to-back slots, wine correspondenfixed time gap set up data fslots to send day utilizing a mostrategy, togeth

nd bandwidth rutilization.

ltraNoC along w

way interfaces.

VERVIEW

eted to a 64-cotend the baselioses of scalabilid L2 cache, withIn a 64-core CMory via a dedicat, each group of de is defined as

64-core and 25oC is attached totransfers betwe

yer (with photontire chip is dividcluster contains ore CMP. cture is shown

is ach

of ide on its de, to

ber ng

ken te-

DM the ith

nce is

for ata. ore her re-

ith

ore ine ity h a

MP, ted 16 an

56-o a een nic ded 16

in

figure CommuCMP uand outinput/ouarbitratibetween

Inter-waveguwaveguwavelenof 512 bedges. Enode twnode dafirst pagroup u

As alwaveguarbitratinodes destructthis papto as thsecond data thrthe wavgroup. parallelwavelenand prio

Figurfour noand powThere ain each ring mo256 rinassignedetects switcheAll otheWe alsoto a cluoff-chipLSWC differenDWDMin eachalso haarbitratiabout Lanalysis

3.2 MIn Ul

clustersarbitratidividedby lightour archinto 8 tarbitratsimplictime sl

1, where 64 unication betweuses an electricatput port pairs utput port pair iion scheme is un cores and the G-node transfers uide groups, wuides by defaungth DWDM, tobits of data (cacEach MWMR w

wice in the dual ata communicat

ass using its ringusing its ring detll nodes are capauide group durion (see Sectioto ensure that tively overlap oper this first passhe modulating pass of the MW

rough their respveguide groupAdditionally th

l with the othngths. This wavority adaptation re 1 depicts an edes, which showwer waveguide are 64 wavelengt

MWMR wavegodulator and deng detectors, resed to a unique rec

its correspondes its detectors “er receivers keeo utilize four aruster. For powep laser source w

has groups ont wavelengths

M in each MWMh ring modulatoras control logicion wavelengths

LSWC are preses is presented in

MWMR ConcultraNoC all of ths (C0 – C3) anion wavelength

d into a fixed numht to traverse the

hitecture. Thus time slots. The tion slot, recei

city, figure 2(a) lots across 4 M

nodes are arreen cores withinal 5×5 NoC rouare connected s connected to aused within eacGI.

are facilitatedwhere each g

ult, with each o facilitate simu

che line) with dawaveguide group

coiled structuretion. A node hasg modulators antectors in the secable of modulatring the first pon 3.2 for more

the data of don the shared was portion of the wand arbitrationWMR waveguid

pective ring deteis referred to ahere is a poweher waveguidesveguide facilitatechniques (see

expanded view ws the modulatingroups, along wths that are utiliz

guide within eacetector group haspectively. Eachceiving node, su

ding wavelength“on” to receive dep their detectorrbitration wavelering the wavegwith a laser powof ring modula

in different cloMR waveguide, t

r group in the Lc that can altes into each wavented in the nexn Section 4.1.

urrent Token She cores on a chnd each cluste(λ0 – λ3). Each M

umber of time sloe waveguide on we divide eachtime slots are

iver prediction shows an exam

MWMR wavegui

ranged in an n a node for th

uter, where four to four cores aa GI module. A ch node for com

d by dual-coilegroup has fou

waveguide supultaneous transfeata modulation ap in UltraNoC pe to enable a twos the ability to w

nd read from thecond pass. ing (writing) in pass, there is e details) betwedifferent senderaveguide group. waveguide groun waveguide grde group all no

ectors; hence thias the receivinger waveguide ths, and carries

ates our bandwiSections 3.3, 3.4of the collection

ng and arbitrationwith their conneczed for data comh waveguide gro

as 256 ring modh of the 64 wavuch that wheneveh during a clocdata in the next rs turned off to lengths, each coguides, we use awer controller (Lators capable oock cycles. As there are 68 ringLSWC. The laseer the rate of iveguide group. Mxt subsections an

Stream Arbitrhip are partitionr is assigned

MWMR waveguots, based on thea die, which is

h MWMR wavegclassified into slot, and data

mple of the diside groups, eve

8×8 mesh. he 256-core of its input

and the fifth round-robin

mmunication

ed MWMR ur MWMR pporting 64 fer of a total at both clock passes every o pass inter-write on the e waveguide

an MWMR a need for een sending rs does not Throughout

up is referred roup. In the odes receive is portion of g waveguide hat runs in

arbitration idth transfer 4). n of GIs for n, receiving, ction to GIs. mmunication oup, so each dulators and velengths is er a receiver ck cycle, it clock cycle. save power.

orresponding a broadband LSWC). The of injecting

we use 68 g modulators er controller injection of More details nd overhead

ration ned into four a dedicated

uide group is e time taken 8 cycles in guide group three types: a slot. For stribution of en though a

64

Page 3: A Reconfigurable Silicon-Photonic Network with Improved ...sudeep/wp-content/uploads/c74.pdf · and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords

mawdnacthw

NNcrwrmawssgaeonca1

Fd(Do

3

imebtrpncWotr

minimum of 8 architecture. In twavelengths of dedicate the arbinode grabs an araccess to the corresponding whe next data sl

wavelengths in eAs an example

N32 that has a coN1 first grabs acluster C0, in theeceiver predict

wavelength for removing all t

modulators) in that N32, only the dw32 is detected, Nswitching on theshows the positigroup W1 at timarchitecture divieach slot covers of tokens on connodes to inject pachannel utilizatiarbiter is used w16 nodes in a clu

Figure 2: (a) Timidistribution of arDi) across four M

of different slots w

3.3 Inter-clustUltraNoC supp

mprove channeexample, if clustbandwidth to a ransfer its unus

presents an overvnode in each clucapable of injectWhenever a ringown arbitration wransfer, this is a

MWMR wavethe arbitration sclusters, selectivtration slot to a

rbitration wavelenext receiver

wavelength of itsot, the sending ach waveguide ge, suppose N1 inorresponding recarbitration wavee arbitration slotion slot, such receiver predictithe wavelengthhat receiver preddetector for wavN32 prepares to ree remaining detion of different

me cycle 3 for thdes each MWM8 nodes in a pa

ncurrent slots in wackets simultaneon in the MWithin each cluste

uster, and avoid s

(

(

ing diagram of arbitration (Ai), re

MWMR waveguidwithin MWMR w

ter Bandwidthports inter-clustel utilization aner C0 does not nsubsequent clussed bandwidth view of the bandster is provisionting an arbitratiog detector of thewavelength, and case where the c

eguide groups slot, the LSWC vely using a mparticular cluste

ength in the arbiprediction slot

s receiving nodenode modulate

group assigned fn cluster C0 needceiver predictionelength λ0 whicot. N1 then mod

that only w3

ion of N32) is mhs except w32 diction slot. On velength w32 is seceive data in thtectors in that nslots in the M

he example in fiMR waveguide garticular time inswaveguide groupeously, resulting

WMR waveguideer to resolve constarvation.

(a)

(b)

rbitration in Ultraeceiver predictionde groups (W1–W

waveguide group W

h Exchange ter bandwidth trnd overall perf

need to transfer dster C1. Similarlto subsequent c

dwidth transfer tned with an extraon wavelength oe last node of a

if this node doecluster has not u

are used in oinjects arbitrati

modulator grouper. After a senditration slot, it get to release t

e. Subsequently, es data on the for data transfer.ds to send data n wavelength wh is dedicated

dulates in the ne32 (the dedicatmade available

(using its rithe receiving e

switched on. Onhe next data slot node. Figure 2(

MWMR waveguifigure 2(a). As ogroup into 8 slostance. The streaps allows multip

g in extremely hies. A round-robntention among t

aNoC, which shon (Ri)and data slo

W4); (b) distributiW1 at time cycle 3

ransfers to furthformance. As data, it transfers ly any cluster cclusters. Figuretechnique. The laa modulator thatof the next clust

cluster detects es not have data used its bandwidt

our on to ng ets the in 64 to

w32. to

ext ted by ng nd

nce by (b) ide our ots, am ple gh

bin the

ws ots ion 3.

her an its

can 3

ast t is er. its to

th.

To imunused the clusthe samarbitratiinformait belowindow

FigurC0–C3

green, ythree cextra riwavelennot neewhich wavelenarbitratithis arbFigure used tointerval

Figurunused wavelen

3.4 ClUltra

bandwibandwidedicatebecauseunused of concthe abilto C0).adaptatibandwiminimizinjectedrequiremthree m

Step 1:clusterdetermibandwiweightswhenev

mprove channebandwidth to th

ster inject the arbme arbitration sion wavelengthsation about the congs to. This ws based on the dre 3 illustrates a

are assigned yellow, blue, andlusters (node16ing modulator tngth of the nexted to transfer anis the last nodngth (green) frion wavelength

bitration slot for 3 also shows co

o count the arbitl. The next subse

e 3: Bandwidth bandwidth to the

ngth and releasing

luster Priority AaNoC also suidth to each cluidth needs, by ed to each cluse while our baarbitration slots

current arbitratiolity to transfer bTo overcome t

ion mechanism idth allocations ze laser power bd arbitration slments. The clust

main steps, as dis

: DeterminationC0 – C3 has

ines the proportidth or priority)s can be set tover the last no

el utilization, thhe next cluster, bitration wavele

slot. To simultas at the last nodecurrent slot of thinformation ca

data received froan example of barbitration wavd red respective

6, node32, and to facilitate the t cluster. For thny data in the cde in C0 removfrom the arbitra

of C1 (yellow) sending data in

ounters at nodestration wavelengection presents d

transfer techniqe next cluster by g arbitration wav

Adaptation withupports runtimeuster, to closely

altering the nster (i.e., clusterandwidth transfs from one clusteon token stream fbandwidth in thethis limitation, w

to more comprover time. Thi

by intelligently rlots for scenarter priority adapcussed below:

n of wavelengtan associated

tion of arbitrati) assigned to to be equal, i.eode in a cluste

he cluster can by having the

ength of the nextaneously removee of cluster, the

he waveguide andan change acroom LSWC (Sectbandwidth transfvelengths highlily. The last nodnode48) is showinjection of the

his example, nodcurrent cycle. Thves its cluster’sation slot and so that nodes inthe next availab 16, 32, 48, and

gth conversions details about the

que: a cluster canabsorbing its own

velength of next c

h LSWC Recone alteration otrack changing

number of arbitr priority). This fer technique cer to another in tflow (e.g., C0 to e opposite directiwe design a clurehensively mans mechanism alreducing the totarios with low

ptation technique

th conversion cweight w0 –

on slots (and cthe cluster. Inite., 0.25 each. Aer performs a

transfer its last node of t cluster into e and inject arbiter uses

d the cluster oss interval tion 3.4). fer. Clusters ighted with e in the first wn with an e arbitration des in C0 do hen node16, s arbitration

injects the n C1 can use ble data slot. d 64 that are over a time se counters.

n transfer its n arbitration luster.

nfiguration f allocated application tration slots

is essential can transfer the direction C1), it lacks ion (e.g., C1

uster priority nage cluster lso helps to al number of

bandwidth e consists of

count: Each w3, which

consequently tially, these At runtime, wavelength

65

Page 4: A Reconfigurable Silicon-Photonic Network with Improved ...sudeep/wp-content/uploads/c74.pdf · and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords

conversion from its current cluster arbitration wavelength to the next cluster arbitration wavelength, a counter (shown in figure 3) is incremented. This conversion event represents the case where an unused arbitration slot (bandwidth) is transferred from one cluster to another. Over a time interval T, the recorded wavelength conversion counts WCC0 – WCC3 from each cluster are then used to determine unused bandwidth of each cluster.

Step 2: Calculation of excess arbitration slots: The wavelength conversion count values of different clusters show the aggregated number of excess arbitration slots, which includes excess arbitration slots of the present cluster along with the excess arbitration slots of predecessor clusters. Therefore the excess arbitration slots for the ith cluster (ESi) are calculated using equation (1) shown below, by subtracting the cluster wavelength conversion count of the predecessor cluster (WCCi-1) from the wavelength conversion count of the cluster under consideration (WCCi). ESi values can also be negative, when a cluster consumes a greater number of arbitration slots (made available by predecessor clusters) than its allocated arbitration slots. Such a cluster has a deficit of arbitration slots. ES = WCC ,i = 0WCC −WCC , i > 0 (1)

Step 3: Setting new weight (priority) for each cluster: Based on the estimation of excesses and deficits in arbitration slots assigned across clusters, this final step attempts to adjust weight values of each cluster to eliminate the excesses and deficits. To determine the new weight of the ith cluster wi(next) for the upcoming time interval, we must subtract the excess weight EWi of the cluster from its current weight wi(current). We can calculate EWi by dividing the excess arbitration slots of the ith cluster (ESi), calculated in equation (1), by the total number of arbitration slots released in the time interval T, which we denote as K. The equations below show these calculations: EWi=ESi/K(2)wi(next)=wi(current)–EWi(3)

Based on the values of the new weights, the LSWC changes the distribution of arbitration wavelengths injected for the next time interval T, such that a cluster with a higher weight will receive more arbitration wavelengths, and has more opportunities to use the waveguides for data transfer. The weight values are also communicated to all clusters, so that arbiters can adjust their local counters to match the new arbitration slot profile in waveguides.

4. EXPERIMENTS 4.1 Experimental Setup

To evaluate our proposed UltraNoC architecture, we compared it to an electrical mesh (EMesh) based NoC as well as to two state-of-the-art photonic crossbar NoCs that use smart arbitration techniques: Flexishare with token stream arbitration [10] and Corona with an enhanced token-slot arbitration [22]. We modeled and simulated the architectures at a cycle-accurate granularity with a SystemC-based NoC simulator, for two CMP platform complexities: 64-core and 256-core. We used the PARSEC benchmark suite [16] to create multi-application workloads, with clusters running parallelized versions of different benchmarks.

Table 1 shows the PARSEC benchmarks we considered, classified into three categories according to their memory intensities. Compute intensive benchmarks spend most of the time computing and less time communicating with memory; whereas memory intensive applications spend a larger portion of their execution time communicating with memory and less time computing within cores. Hybrid intensity benchmarks demonstrate both compute and memory intensive phases. We created 12 multi-application workloads from these benchmarks. Each workload

combines 4 benchmarks, and the memory intensity of the workloads varies across the spectrum, from compute intensive to memory intensive. As an example, the SC-BT-BS-VI workload combines parallelized implementations of Streamclusters (SC), Bodytrack (BT), Blackscholes (BS) and Vips (VI), and executes them in clusters C0, C1, C2, and C3, respectively. Each parallelized benchmark is executed on a group of 16 cores and 64 cores, in the 64-core and 256-core CMP platforms, respectively. Full-system simulation of parallelized PARSEC benchmarks using gem5 [25] was used to generate traces that were fed into our cycle-accurate network simulator. We set a “warm-up” period of 100M instructions and captured traces for 1B instructions.

Table 1: Memory intensity classification of PARSEC benchmarks

Application Representation Workload Type Blackscholes BS Compute intensive Bodytrack BT Compute intensive Vips VI Compute intensive Dedup DU Compute intensive Freqmine FQ Hybrid Ferret FR Hybrid Fluidanimate FA Hybrid X264 X264 Hybrid Streamclusters SC Memory intensive Canneal CA Memory intensive Facesim FS Memory intensive Swaptions SW Memory intensive

We targeted 32nm and 22nm process technologies for the 64-core and 256-core CMPs, respectively. Based on the geometric calculation of the waveguides for a 20mm×20mm chip dimension, we estimated the time needed for light to travel from the first to the last node in a single pass of the MWMR waveguide group in UltraNoC as 8 cycles at 5 GHz clock frequency. The same clock and 8 cycle round trip time is also applicable to the waveguides in the Flexishare and Corona photonic crossbar NoCs. Throughout our analysis we use a flit size of 64 bits for EMesh and a total packet size of 512 bits for all photonic NoC architectures. We consider data modulation at both clock edges to enable simultaneous transfer of 512 bits in a single cycle, in the UltraNoC, Flexishare, and Corona architectures.

The static and dynamic energy consumption of electrical routers is based on results obtained from the open-source DSENT tool. Energy consumption of various photonic components for all the photonic NoC architectures are adopted from photonic device characterizations in line with state-of-the-art proposals [19], [23] and shown in Table 2. Here Edynamic is the energy/bit for modulators and photodetectors and Elogic−dyn is the energy/bit for the driver circuits of modulators and photodetectors. PMWMR and PMWSR are the static power consumption of an MWMR and an MWSR waveguide group, respectively which includes the power overhead of ring resonator thermal tuning. We consider a ring heating power of 15 µW per ring and detector responsivity of 0.8 A/W [19]. To compute laser power consumption, we calculated photonic loss in components, which sets the photonic laser power budget and correspondingly the electrical laser power.

Finally, based on our gate-level analysis, area overhead due to electrical circuitry (e.g., adders, multipliers) in LSWC is estimated to be 0.011mm2 at 32nm. We set the reconfiguration delay overhead in UltraNoC to be 20 cycles to account for the time to transfer wavelength conversion counter values from each cluster to LSWC, time to determine new priority weights of each cluster, and to update these values in arbiters in each cluster. We set reconfiguration time interval window size in UltraNoC to 300 cycles, to balance reconfiguration overhead and performance.

66

Page 5: A Reconfigurable Silicon-Photonic Network with Improved ...sudeep/wp-content/uploads/c74.pdf · and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords

4

npFtoaUsrFNowcwas

dwwdrwsFb

isacacthqUM

(Uahrthrinlaccp

Table 2: E

Energy conEdynamic Elogic−dyn Static power pe

PMWMR (with 68 DPMWMR (with 64 DPMWSR (with 64 D

PhotonMicroring througWaveguide propaWaveguide coupWaveguide bend

4.2 ExperimenOur experimen

network throughproduct (EDP) oFlexishare with oken-slot arbitr

architecture: UltUltraNoC-16 whshow the resultsespect to the EM

Figure 4(a), it caNoCs provide beof higher bandwwaveguide groupcompared to Flwaveguides. Eveand time divisionsignificant differ

In Flexishare, data waveguidewaveguide and waveguide gets data waveguideeservation and d

waveguide, suchsimultaneously bFlexishare doesbandwidth excha

UltraNoC-8 als because of in

attempt to commcontroller). Suchaccess the singlecreating a signifihe other wave

queued waiting UltraNoC avoidsMWMR wavegu

The UltraNoUltraNoC-16) w

UltraNoC-8 proarchitectures. Onhigher throughpuespectively. Ulhroughputs comespectively. Thentensive workloarge throughpu

consequence of cores that need priority alteration

Energy and losses

nsumption type

er waveguide groDWDM) DWDM) DWDM) nic loss type gh agation per cm

pler/splitter ding loss

ntal Results nts target a 64-hput, average pof UltraNoC wtoken stream a

ration [22]. WetraNoC-8 whichhich uses 16 was of this study, Mesh results. Fran be observed tetter throughput width photonic lps (UltraNoC-8)lexishare, with en though Flexn multiplexing ences between tharbitration wave

es are injecteda node that gexclusive acces

e; whereas in data slots are avah that each MWby multiple nods not supportange mechanismlso provides higstances in Coro

municate with a sh instances resulte MWSR waveicant imbalance

eguides being ufor the waveg

s such an imbalauides and improvC configuratio

with approximateovides even ben average Ultrauts compared toltraNoC-16 hasmpared to Fle improvement oads than for cout improvemenavoiding unusedit the most, u

n mechanisms at

s for photonic dev

00

up

Lo

0.0

core CMP platfpacket latency, ith the electrica

arbitration [10], e consider twoh uses 8 wavegaveguide groupswith all results

rom the throughpthat, not surprisithan EMesh, du

links. UltraNoC) has nearly 3× g

the same numxishare uses MW(TDM) as in Ulhe architectures.elengths correspod serially intograbs a token ss to the corresUltraNoC, mu

ailable concurrenWMR waveguidedes. Moreover, t priority recs.

gher throughput ona, where multsingle receiver nt in the sender noeguide connecte

among MWSR underutilized w

guide connectedance with its useved arbitration. on with 16 wely twice the phoetter throughputaNoC-8 has 2.8×o Flexishare, Cos 5.3×, 3.2×, exishare, Corois somewhat hi

ompute intensivnt for UltraNd bandwidth by

using the bandwt runtime. These

vices [19], [23]

Energy .42 pJ/bit .18 pJ/bit Power 3.86 W 3.73 W 2.35 W

oss (in dB) 0.02

1 0.5

005 per 900

form and compaand energy-del

al mesh (EMeshand Corona wi

o variants of oguide groups as. Figures 4(a)-(s normalized wiput comparison ingly, all photonue to the presen

C with 8 MWMgreater throughpmber of MWMWMR waveguidltraNoC, there a. onding to MWMo the arbitratiin the arbitratisponding MWMultiple arbitrationtly in an MWMe can be accessunlike UltraNoonfiguration a

than Corona. Thtiple sender nod

node (e.g., memoodes attempting d to the receivwaveguides, wi

while packets gd to the receive of more efficie

waveguide grouotonic hardware t than the oth×, 1.6×, and 3.orona and EMesand 7.7× high

ona and EMesigher for memo

ve workloads. TNoC is a direy transferring it width transfer ae mechanisms al

are lay h), ith

our and (c) ith in

nic nce MR put

MR des are

MR on on

MR on,

MR sed oC, and

his des ory

to er, ith get er. ent

ups of

her 8× sh, her sh, ory The ect to

and lso

improveFigure photoni47% anpacket dthe diff

Figure 48 and UResults

FigurbetweenUltraNoCoronahas 79%Most ofthe formbetweenand detthat Ulphotonimore hhas a hmore phthan FUltraNo

Our fWe concore coinjectioNoC thrNoCs.

e the average p4(b), by reducinic waveguides. Ond UltraNoC-16delay over Flexi

ferent multi-appl

4: (a) throughputUltraNoC-16 wiare normalized w

re 4(c) shows thn the architectu

NoC-8 has 90%,a, EMesh, and Fl% and 45% lowf the energy in tm of static energn the network atector ring coun

UltraNoC-8 and ic hardware com

hardware than Flhigher EDP thahotonic hardwar

Flexishare, evenNoC-16 is lower t

final set of expnsidered a largeroncentration in on into the netwohroughput, avera

In addition to

packet latency ng the time spenOn average Ultr6 has 25%, 40%ishare, Corona alication workloa

(a)

(b)

(c)

t (b) latency (c) Eith other architewrt EMesh.

he energy-delay ures. It can be 70% and 10%lexishare respectwer EDP compathe photonic arcgy. A comparisoarchitectures (dents are omitted d

UltraNoC-16 mpared to Corolexishare. This an Flexishare, bre (and thus con

n though the than Flexishare. periments explor 256-core CMPeach tile to foork for this scalge packet latenco UltraNoC-8

in UltraNoC, ant waiting for araNoC-8 has 17%% and 51% lowand EMesh, respads.

EDP comparison oectures for a 64

product (EDP) observed that

% lower EDP ctively. Further U

ared to Corona achitectures was con of the photonetailed results fodue to lack of sphave 75% andona but UltraNexplains why Ubecause UltraNnsumes more staverage packet

res architectureP platform by inur to enable hilability study. Wcy, and EDP for

andUltaNoC-16

as shown in access to the %, 32% and wer average pectively for

of UltraNoC-4-core CMP.

comparison on average

compared to UltraNoC-16 and EMesh. consumed in nic hardware or modulator pace) shows d 50% less

NoC-16 uses UltraNoC-16 NoC-16 uses tatic energy) t delay for

e scalability. ncreasing the igher traffic

We evaluated all photonic 6, we also

67

Page 6: A Reconfigurable Silicon-Photonic Network with Improved ...sudeep/wp-content/uploads/c74.pdf · and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords

cwethhgTacUacrUEherUha

F8c

ohwin

considered Ultrawaveguide grouexperiments. Frohat on average

has 5.7×, 5.3× angreater throughpThe improvemenand Corona are ecore CMP case. FUltraNoC-8 has and 53% and Ultcompared to Cespectively. Fi

UltraNoC-8 has EMesh, Corona ahas 77% and 73energy delay espectively. Th

UltraNoC-16 anhardware used inalso reduces its th

Figure 5: (a) throu8, UltraNoC-16 acore CMP. Result

From the aboveof UltraNoC thhardware comparwith low as wendicator of the s

aNoC-32, a varups. Figures 5(aom the throughpUltraNoC-8 hasnd 5.7×, and Ul

put compared tonts in throughpeven better for thFrom figure 5(b)23%, 25% and 4traNoC-32 has 3Corona, Flexishinally, Figure

88%, 87% anand Flexishare r3% and UltraNproduct comp

he lower EDPnd UltraNoC-32n Flexishare, whhroughput and in

(

(

ughput (b) latencnd UltraNoC-32

ts are normalized

e results, we canhat can achievred to existing p

ell as high corescalability of our

riant of our arca)-(c) show theput results it cans 3×, 2.8× and 2ltraNoC-32 has o EMesh, Coronut for UltraNoChe 256-core CM), it can be seen 48%, UltraNoC-34%, 36% and 5hare and EM5(c) shows t

d 20% lower Erespectively. Fur

NoC-32 has 55%pared to EMeP of Flexisha2 is due to theich lowers its poncreases its laten

(a)

(b)

(c)

cy (c) EDP compawith other archiwrt EMesh.

n summarize thate better perfor

photonic NoCs, fe counts. The rer proposed Ultra

chitecture with e results of then also be inferr

2.6×, UltraNoC-9.8×, 9× and 9.7

na and FlexisharC over Flexisha

MP, than in the 6that on an avera-16 has 29%, 3055% lower laten

Mesh architecturthat on averaEDP compared rther UltraNoC-

% and 51% lowesh and Coroare compared e lower photonower footprint, bncy.

arison of UltraNoitectures for a 25

t there are varianrmance with lefor CMP platformesults are a go

aNoC architectur

32 ese red 16 7× re. are 64-age 0% ncy res age

to 16

wer ona

to nic but

oC-56-

nts ess ms od

re.

5. COIn t

architecby usinstrategytransfermultipleutilizatigoals. Uto 55% of-the-aarbitraticore cou

6. ACKThis

125250

REFE[1] R. [2] W.[3] S.

3D[4] D.

sili[5] Q.

sili[6] S.

forvol

[7] J. Dnet

[8] D. nan

[9] Y. nan

[10] Y. ene

[11] C. pho

[12] V. Let

[13] C. hig

[14] R. hyb

[15] J. Ppro

[16] C. and

[17] M.net

[18] N. usi

[19] X.pho

[20] M.net

[21] A. Co

[22] D. nan

[23] P. Int

[24] T.PpriISQ

[25] N.

NCLUSIONthis work we cture that featurng an aggressivey. UltraNoC ar bandwidth bee co-running aion and adapt UltraNoC impro

and EDP by up art photonic Nion mechanismsunts on a chip.

KNOWLEDresearch is sup

00, CCF-130269

ERENCES Ho, et al., “The fu. J. Dally, B. TowlBahirat et al., "OP

D ICs," in ASPDACA. B. Miller, “De

icon chips,” in IEEXu et al., “12.5gb

icon modulators,” J. Koester et al., “r high-performancl. 25,no. 1, pp. 46–D. Owens, et al., “tworks,” in IEEE MVantrease et al., “

nophotonic technoPan et al.,“Firefly

nophotonics,” in ISPan, J. Kim, G. M

ergy efficient nanoChen, A. Joshi, “Rotonic multibus NAlmeida et al., “Atters, 29:2867–286A. Barrios, et al.,

gh-modulation-depMorris et al., “Powbrid nanophotonicPsota et al., “ATAocessors,” Boston ABienia et al., "Thed Architectural Im. J. Cianchettiet et twork,” in ISCA, 2Kirman et al., “A

ing wavelength-baZheng et al., “Ultotonic transmitter . Petracca, et al., “tworks for chip muJoshi et al., “Silic

ommunication”, in Vantrease et al., “

nophotonic intercoGrani and S. Bartoterconnect in FuturPimpalkhute et al.,ioritization framewQED 2014. Binkert et al.,"The

NS presented the

res improved che concurrent tokalso supports thetween clusters applications to to time-varyin

ves throughput bto 90% over tra

NoC architecturs. UltraNoC also

DGMENTS pported by gran3), and AFOSR

uture of wires,” in les, “Route packetPAL: A Multi-LayC, Yokohama, Japaevice requirementsEE, Special Issue obit/s carrier-injectio

in Optics Express“Ge-on soi-detectoe optical commun–57, Jan. 2007. “Research challengMicro, September“Corona: System iology,” in ISCA, 20y: Illuminating futuISCA, 2009. Memik, “Flexishareophotonic crossbarRuntime managem

NoC architecture,” All-optical switchin69, 2004. “Low-power-cons

pth silicon electro-wer-efficient and hc interconnect for m

AC: On-chip opticaArea Archit. Worke PARSEC Bench

mplications," in PAal., “Phastlane: a

2009. power-efficient al

ased oblivious routtra-efficient 10Gb/and receiver,” in ODesign exploratioultiprocessors,” in

con-Photonic Clos NOCS 2009

“Light speed arbitronnects,” in IEEE/olini, “Design Optre Client Devices,, “An application-

work for NoC base

e gem5 Simulator,

UltraNoC phohannel sharing aken stream-basedhe ability to dof cores and further impro

g application pby up to 9.8×, la

aditional electricres with the

o scales well wit

nts from SRC, (FA9550-13-1-0

Proc. IEEE, Apr 2ts, not wires,” in D

yer Hybrid Photonian, Jan 2011.

s for optical intercoon Silicon Photonion-based silicon ms, vol. 15,no. 2, Janor/si-cmos-amplifienication application

ges for on-chip int-October 2007. mplications of em008. ure network-on-ch

e: Channel sharingr,” in HPCA, 2010

ment of laser powein IEEE JQE, 201ng on a silicon chi

sumption short-len-optic modulator” high-performance multicores,” in NOal networks for mukshop, pp. 107– 10mark Suit: Charac

ACT, Oct. 2008. rapid transit optica

ll-optical on-chip ting,” in ASPLOS /s hybrid integratedOpt. Express, Marn of optical interco

n HOTI, Aug. 2008Networks for Glo

ration and flow co/ACM MICRO, Detions for Optical R” in ACM JETC, Maware heterogeneo

ed chip multiproce

," in CA News, Ma

otonic NoC among cores d arbitration dynamically re-prioritize

ove channel performance atency by up al and state-best-known

th increasing

NSF (CCF-0110).

2001. DAC, 2001. ic NoC for

onnects to ics, 2009.

micro-ring n. 2007. er receivers ns,” in JLT,

terconnection

merging

hip with

g for an 0. r in silicon-3. ip” in Optics

ngth and in JLT, 2003. multilevel

OCS 2010. ulticore 08, Jan. 2007. cterization

al routing

interconnect 2010. d silicon r 2011. onnection 8, obal On-Chip

ntrol for ec. 2009. Ring May, 2014. ous

essors,” in

ay 2011.

68


Recommended