Sai Vineel Reddy Chittamuru, Srinivas Desai, Sudeep Pasricha Department of Electrical and Computer Engineering Colorado State University, Fort Collins, CO, U.S.A.
{sai.chittamuru, srinivas.desai, sudeep}@colostate.edu
ABSTRACT On-chip communication is widely considered to be one of the major performance bottlenecks in contemporary chip multiprocessors (CMPs). With recent advances in silicon nanophotonics, photonic-based networks-on-chip (NoCs) are being considered as a viable option for communication in emerging CMPs as they can enable higher bandwidth and lower power dissipation compared to traditional electrical NoCs. In this paper, we present UltraNoC, a novel reconfigurable silicon-photonic NoC architecture that features improved channel sharing and supports dynamic re-prioritization and exchange of bandwidth between clusters of cores running multiple applications, to increase channel utilization and performance. Experimental results show that UltraNoC improves throughput by up to 9.8× while reducing latency by up to 55% and energy-delay product by up to 90% over state-of-the-art solutions.
Categories and Subject Descriptors C.2.1 [Computer-Communication Networks]: Network Architecture and Design—Network topology; C.4 [Performance of Systems]: design studies Keywords Network-on-chip (NoC), photonic channel sharing, arbitration
1. INTRODUCTION Advances in technology scaling over the past decade have
enabled the integration of billions of transistors on a single die. Such a massive number of transistors has allowed multiple processing cores and more memory to be integrated on a chip, to meet rapidly growing performance demands of modern applications. With hundreds of on-chip cores expected to become a reality in the near future, traditional electrical network-on-chip (NoC) communication fabrics [1], [2] are projected to suffer from cripplingly high power dissipation and severely reduced performance [7]. Crosstalk and electromagnetic interference in metallic interconnects will further worsen the performance and reliability of NoCs with technology scaling.
Recent developments in the area of silicon photonics have enabled their integration with CMOS circuits. On-chip photonic links provide several prolific advantages over their metallic counterparts, including light speed transfers, high bandwidth [3] density by using dense wavelength division multiplexing (DWDM) [3], low power dissipation [4], and low crosstalk [7]. Thus silicon photonics is being considered as an exciting new option for future NoCs. Several photonic devices such as microring resonators, waveguides and photodetectors have already been fabricated and demonstrated at the chip level [5], [6].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. GLSVLSI’15, May 20–22, 2015, Pittsburgh, PA, USA. Copyright © 2015 ACM 978-1-4503-3474-7/15/05…$15.00. http://dx.doi.org/10.1145/2742060.2742067
These devices have been used as a foundation for several photonic network architectures [8]-[11].
A few prior works emphasize the importance of network resource contention in photonic channels and proposed arbitration techniques to resolve contention [8], [10]. However, a limitation of these approaches is that they do not fully exploit available network bandwidth. Another limitation is that these approaches typically only target single parallel application workloads. In emerging multicore systems where multiple applications execute simultaneously on unique subsets of cores, there is significantly greater variation in temporal and spatial characteristics of network injected traffic. For example, cores running memory intensive tasks can require more network bandwidth than cores running compute intensive tasks [24].
To overcome these shortcomings, we propose a novel photonic NoC architecture called UltraNoC that utilizes multiple-write-multiple-read (MWMR) photonic waveguides in a crossbar topology, and supports dynamic performance adaptation to aggressively utilize network bandwidth and meet diverse application demands. We compare UltraNoC against architectures with the best-known arbitration mechanisms, for multi-threaded PARSEC [16] workloads on CMP platform sizes ranging from 64-cores to 256-cores. The novel contributions of this paper are:
A flexible on-chip photonic network architecture (UltraNoC) that facilitates selective and reconfigurable prioritization of applications based on their time-varying performance goals;
A concurrent token stream arbitration that provides multiple simultaneous tokens and increases channel utilization;
A dynamic bandwidth transfer technique with low overhead, to transfer unused bandwidth among clusters of cores;
A mechanism to monitor traffic being injected into the network by different co-running applications, to facilitate dynamic arbitration wavelength injection rate modulation.
2. RELATED WORK A considerable amount of work has focused on the area of
photonic NoC design. Several efforts have explored high-radix low-diameter photonic crossbar architectures that provide non-blocking connectivity, e.g., [8], [10], [15] and [21]; whereas silicon photonic implementations of low-radix high-diameter NoCs have also been investigated in [14], [17], [18], and [20]. Prior work has shown that photonic crossbars are extremely promising architectures to meet future on-chip bandwidths demands, but they can suffer from (i) large power dissipation and (ii) high contention for resources, especially when using inefficient token-based arbitration schemes. A few techniques to reduce power overhead of photonic NoCs have been proposed in literature. For example, an effective policy for runtime management of the laser source is proposed in [11]. We build on their work to manage power in photonic crossbar NoCs. To reduce contention issues in crossbars, a few improved arbitration techniques have been proposed in [10], [22] that use time division multiplexing (TDM), so that a single data waveguide can be simultaneously used by more than one node in different time slots.
A Reconfigurable Silicon-Photonic Network with Improved Channel Sharing for Multicore Architectures
63
pdaowawcoacsaadtortrUpwa
Fth
33
CaaMemcecgthwinc
In Flexishareproposed. The scdata waveguide an arbitration wonly when it wavelength. Subarbitration wavelwhich takes N cychannel under-utof nodes and warbitration schemchannel and tokesingle-read (MWand improves uarbitration wavedestination nodeso slots. A limitaequired betweeransmission, wh
UltraNoC impropowerful concurwith reconfigurallocation to imp
Figure 1: Layouthe arrangement o
3. ULTRANO3.1 Architectu
The baseline CMP, as shownarchitecture to aanalysis. Each coMOESI protocoleach group of 8 memory controllcores has a dedicentity consistingcore CMP, respegateway interfache CMOS elect
waveguides, modnto four clusters
cores in a 64-corA detailed lay
e [10], a tokencheme requires wto be injected saveguide. A nogets access tosequently, the nlength is injecteycles for N data tilization, and pewaveguides inc
me from Corona en slot arbitratio
WSR) crossbars. upon token chaeguide into fixes circulating tokation of this appen two arbitrahich reduces the oves upon these rrent token strerable cluster prprove MWMR ph
t of MWMR croof cores and their
OC PHOTONure and TermiUltraNoC archin in figure 1.a 256-core CMPore has a privatel to maintain caccores has access
ler, whereas in acated memory c
g of one and fouectively. Every nce (GI) module trical layer and dulators, detectos (C0, C1, C2, C3
e CMP and 64 cyout of the Ult
n stream arbitrwavelengths correrially into diffede writes on tho the corresponode cannot sended into the arbitwaveguides. Th
erforms even wocrease. In [22]
[8] was improvon techniques fToken slot arbinnel arbitrationed-size, back-tokens in one-to-onproach is that a
ation slots to available time prior works by
eam arbitration rioritization anhotonic channel
ossbar used in Ulr respective gatew
NIC NOC OVinology itecture is targe. We also extP for the purpoe L1 and sharedche coherence. Is to main memo
a 256-core CMPontroller. A nod
ur cores for the node in UltraNothat facilitates a photonic lay
ors, etc.). The ent3), where each c
cores in a 256-cotraNoC architec
ration scheme responding to eaerent time slots
he data waveguionding arbitratid data again till tration waveguidhe scheme leads orse as the numb, the token ri
ved with the tokfor Multiple-writitration uses TDn by dividing to-back slots, wine correspondenfixed time gap set up data fslots to send day utilizing a mostrategy, togeth
nd bandwidth rutilization.
ltraNoC along w
way interfaces.
VERVIEW
eted to a 64-cotend the baselioses of scalabilid L2 cache, withIn a 64-core CMory via a dedicat, each group of de is defined as
64-core and 25oC is attached totransfers betwe
yer (with photontire chip is dividcluster contains ore CMP. cture is shown
is ach
of ide on its de, to
ber ng
ken te-
DM the ith
nce is
for ata. ore her re-
ith
ore ine ity h a
MP, ted 16 an
56-o a een nic ded 16
in
figure CommuCMP uand outinput/ouarbitratibetween
Inter-waveguwaveguwavelenof 512 bedges. Enode twnode dafirst pagroup u
As alwaveguarbitratinodes destructthis papto as thsecond data thrthe wavgroup. parallelwavelenand prio
Figurfour noand powThere ain each ring mo256 rinassignedetects switcheAll otheWe alsoto a cluoff-chipLSWC differenDWDMin eachalso haarbitratiabout Lanalysis
3.2 MIn Ul
clustersarbitratidividedby lightour archinto 8 tarbitratsimplictime sl
1, where 64 unication betweuses an electricatput port pairs utput port pair iion scheme is un cores and the G-node transfers uide groups, wuides by defaungth DWDM, tobits of data (cacEach MWMR w
wice in the dual ata communicat
ass using its ringusing its ring detll nodes are capauide group durion (see Sectioto ensure that tively overlap oper this first passhe modulating pass of the MW
rough their respveguide groupAdditionally th
l with the othngths. This wavority adaptation re 1 depicts an edes, which showwer waveguide are 64 wavelengt
MWMR wavegodulator and deng detectors, resed to a unique rec
its correspondes its detectors “er receivers keeo utilize four aruster. For powep laser source w
has groups ont wavelengths
M in each MWMh ring modulatoras control logicion wavelengths
LSWC are preses is presented in
MWMR ConcultraNoC all of ths (C0 – C3) anion wavelength
d into a fixed numht to traverse the
hitecture. Thus time slots. The tion slot, recei
city, figure 2(a) lots across 4 M
nodes are arreen cores withinal 5×5 NoC rouare connected s connected to aused within eacGI.
are facilitatedwhere each g
ult, with each o facilitate simu
che line) with dawaveguide group
coiled structuretion. A node hasg modulators antectors in the secable of modulatring the first pon 3.2 for more
the data of don the shared was portion of the wand arbitrationWMR waveguid
pective ring deteis referred to ahere is a poweher waveguidesveguide facilitatechniques (see
expanded view ws the modulatingroups, along wths that are utiliz
guide within eacetector group haspectively. Eachceiving node, su
ding wavelength“on” to receive dep their detectorrbitration wavelering the wavegwith a laser powof ring modula
in different cloMR waveguide, t
r group in the Lc that can altes into each wavented in the nexn Section 4.1.
urrent Token She cores on a chnd each cluste(λ0 – λ3). Each M
umber of time sloe waveguide on we divide eachtime slots are
iver prediction shows an exam
MWMR wavegui
ranged in an n a node for th
uter, where four to four cores aa GI module. A ch node for com
d by dual-coilegroup has fou
waveguide supultaneous transfeata modulation ap in UltraNoC pe to enable a twos the ability to w
nd read from thecond pass. ing (writing) in pass, there is e details) betwedifferent senderaveguide group. waveguide groun waveguide grde group all no
ectors; hence thias the receivinger waveguide ths, and carries
ates our bandwiSections 3.3, 3.4of the collection
ng and arbitrationwith their conneczed for data comh waveguide gro
as 256 ring modh of the 64 wavuch that wheneveh during a clocdata in the next rs turned off to lengths, each coguides, we use awer controller (Lators capable oock cycles. As there are 68 ringLSWC. The laseer the rate of iveguide group. Mxt subsections an
Stream Arbitrhip are partitionr is assigned
MWMR waveguots, based on thea die, which is
h MWMR wavegclassified into slot, and data
mple of the diside groups, eve
8×8 mesh. he 256-core of its input
and the fifth round-robin
mmunication
ed MWMR ur MWMR pporting 64 fer of a total at both clock passes every o pass inter-write on the e waveguide
an MWMR a need for een sending rs does not Throughout
up is referred roup. In the odes receive is portion of g waveguide hat runs in
arbitration idth transfer 4). n of GIs for n, receiving, ction to GIs. mmunication oup, so each dulators and velengths is er a receiver ck cycle, it clock cycle. save power.
orresponding a broadband LSWC). The of injecting
we use 68 g modulators er controller injection of More details nd overhead
ration ned into four a dedicated
uide group is e time taken 8 cycles in guide group three types: a slot. For stribution of en though a
64
mawdnacthw
NNcrwrmawssgaeonca1
Fd(Do
3
imebtrpncWotr
minimum of 8 architecture. In twavelengths of dedicate the arbinode grabs an araccess to the corresponding whe next data sl
wavelengths in eAs an example
N32 that has a coN1 first grabs acluster C0, in theeceiver predict
wavelength for removing all t
modulators) in that N32, only the dw32 is detected, Nswitching on theshows the positigroup W1 at timarchitecture divieach slot covers of tokens on connodes to inject pachannel utilizatiarbiter is used w16 nodes in a clu
Figure 2: (a) Timidistribution of arDi) across four M
of different slots w
3.3 Inter-clustUltraNoC supp
mprove channeexample, if clustbandwidth to a ransfer its unus
presents an overvnode in each clucapable of injectWhenever a ringown arbitration wransfer, this is a
MWMR wavethe arbitration sclusters, selectivtration slot to a
rbitration wavelenext receiver
wavelength of itsot, the sending ach waveguide ge, suppose N1 inorresponding recarbitration wavee arbitration slotion slot, such receiver predictithe wavelengthhat receiver preddetector for wavN32 prepares to ree remaining detion of different
me cycle 3 for thdes each MWM8 nodes in a pa
ncurrent slots in wackets simultaneon in the MWithin each cluste
uster, and avoid s
(
(
ing diagram of arbitration (Ai), re
MWMR waveguidwithin MWMR w
ter Bandwidthports inter-clustel utilization aner C0 does not nsubsequent clussed bandwidth view of the bandster is provisionting an arbitratiog detector of thewavelength, and case where the c
eguide groups slot, the LSWC vely using a mparticular cluste
ength in the arbiprediction slot
s receiving nodenode modulate
group assigned fn cluster C0 needceiver predictionelength λ0 whicot. N1 then mod
that only w3
ion of N32) is mhs except w32 diction slot. On velength w32 is seceive data in thtectors in that nslots in the M
he example in fiMR waveguide garticular time inswaveguide groupeously, resulting
WMR waveguideer to resolve constarvation.
(a)
(b)
rbitration in Ultraeceiver predictionde groups (W1–W
waveguide group W
h Exchange ter bandwidth trnd overall perf
need to transfer dster C1. Similarlto subsequent c
dwidth transfer tned with an extraon wavelength oe last node of a
if this node doecluster has not u
are used in oinjects arbitrati
modulator grouper. After a senditration slot, it get to release t
e. Subsequently, es data on the for data transfer.ds to send data n wavelength wh is dedicated
dulates in the ne32 (the dedicatmade available
(using its rithe receiving e
switched on. Onhe next data slot node. Figure 2(
MWMR waveguifigure 2(a). As ogroup into 8 slostance. The streaps allows multip
g in extremely hies. A round-robntention among t
aNoC, which shon (Ri)and data slo
W4); (b) distributiW1 at time cycle 3
ransfers to furthformance. As data, it transfers ly any cluster cclusters. Figuretechnique. The laa modulator thatof the next clust
cluster detects es not have data used its bandwidt
our on to ng ets the in 64 to
w32. to
ext ted by ng nd
nce by (b) ide our ots, am ple gh
bin the
ws ots ion 3.
her an its
can 3
ast t is er. its to
th.
To imunused the clusthe samarbitratiinformait belowindow
FigurC0–C3
green, ythree cextra riwavelennot neewhich wavelenarbitratithis arbFigure used tointerval
Figurunused wavelen
3.4 ClUltra
bandwibandwidedicatebecauseunused of concthe abilto C0).adaptatibandwiminimizinjectedrequiremthree m
Step 1:clusterdetermibandwiweightswhenev
mprove channebandwidth to th
ster inject the arbme arbitration sion wavelengthsation about the congs to. This ws based on the dre 3 illustrates a
are assigned yellow, blue, andlusters (node16ing modulator tngth of the nexted to transfer anis the last nodngth (green) frion wavelength
bitration slot for 3 also shows co
o count the arbitl. The next subse
e 3: Bandwidth bandwidth to the
ngth and releasing
luster Priority AaNoC also suidth to each cluidth needs, by ed to each cluse while our baarbitration slots
current arbitratiolity to transfer bTo overcome t
ion mechanism idth allocations ze laser power bd arbitration slments. The clust
main steps, as dis
: DeterminationC0 – C3 has
ines the proportidth or priority)s can be set tover the last no
el utilization, thhe next cluster, bitration wavele
slot. To simultas at the last nodecurrent slot of thinformation ca
data received froan example of barbitration wavd red respective
6, node32, and to facilitate the t cluster. For thny data in the cde in C0 removfrom the arbitra
of C1 (yellow) sending data in
ounters at nodestration wavelengection presents d
transfer techniqe next cluster by g arbitration wav
Adaptation withupports runtimeuster, to closely
altering the nster (i.e., clusterandwidth transfs from one clusteon token stream fbandwidth in thethis limitation, w
to more comprover time. Thi
by intelligently rlots for scenarter priority adapcussed below:
n of wavelengtan associated
tion of arbitrati) assigned to to be equal, i.eode in a cluste
he cluster can by having the
ength of the nextaneously removee of cluster, the
he waveguide andan change acroom LSWC (Sectbandwidth transfvelengths highlily. The last nodnode48) is showinjection of the
his example, nodcurrent cycle. Thves its cluster’sation slot and so that nodes inthe next availab 16, 32, 48, and
gth conversions details about the
que: a cluster canabsorbing its own
velength of next c
h LSWC Recone alteration otrack changing
number of arbitr priority). This fer technique cer to another in tflow (e.g., C0 to e opposite directiwe design a clurehensively mans mechanism alreducing the totarios with low
ptation technique
th conversion cweight w0 –
on slots (and cthe cluster. Inite., 0.25 each. Aer performs a
transfer its last node of t cluster into e and inject arbiter uses
d the cluster oss interval tion 3.4). fer. Clusters ighted with e in the first wn with an e arbitration des in C0 do hen node16, s arbitration
injects the n C1 can use ble data slot. d 64 that are over a time se counters.
n transfer its n arbitration luster.
nfiguration f allocated application tration slots
is essential can transfer the direction C1), it lacks ion (e.g., C1
uster priority nage cluster lso helps to al number of
bandwidth e consists of
count: Each w3, which
consequently tially, these At runtime, wavelength
65
conversion from its current cluster arbitration wavelength to the next cluster arbitration wavelength, a counter (shown in figure 3) is incremented. This conversion event represents the case where an unused arbitration slot (bandwidth) is transferred from one cluster to another. Over a time interval T, the recorded wavelength conversion counts WCC0 – WCC3 from each cluster are then used to determine unused bandwidth of each cluster.
Step 2: Calculation of excess arbitration slots: The wavelength conversion count values of different clusters show the aggregated number of excess arbitration slots, which includes excess arbitration slots of the present cluster along with the excess arbitration slots of predecessor clusters. Therefore the excess arbitration slots for the ith cluster (ESi) are calculated using equation (1) shown below, by subtracting the cluster wavelength conversion count of the predecessor cluster (WCCi-1) from the wavelength conversion count of the cluster under consideration (WCCi). ESi values can also be negative, when a cluster consumes a greater number of arbitration slots (made available by predecessor clusters) than its allocated arbitration slots. Such a cluster has a deficit of arbitration slots. ES = WCC ,i = 0WCC −WCC , i > 0 (1)
Step 3: Setting new weight (priority) for each cluster: Based on the estimation of excesses and deficits in arbitration slots assigned across clusters, this final step attempts to adjust weight values of each cluster to eliminate the excesses and deficits. To determine the new weight of the ith cluster wi(next) for the upcoming time interval, we must subtract the excess weight EWi of the cluster from its current weight wi(current). We can calculate EWi by dividing the excess arbitration slots of the ith cluster (ESi), calculated in equation (1), by the total number of arbitration slots released in the time interval T, which we denote as K. The equations below show these calculations: EWi=ESi/K(2)wi(next)=wi(current)–EWi(3)
Based on the values of the new weights, the LSWC changes the distribution of arbitration wavelengths injected for the next time interval T, such that a cluster with a higher weight will receive more arbitration wavelengths, and has more opportunities to use the waveguides for data transfer. The weight values are also communicated to all clusters, so that arbiters can adjust their local counters to match the new arbitration slot profile in waveguides.
4. EXPERIMENTS 4.1 Experimental Setup
To evaluate our proposed UltraNoC architecture, we compared it to an electrical mesh (EMesh) based NoC as well as to two state-of-the-art photonic crossbar NoCs that use smart arbitration techniques: Flexishare with token stream arbitration [10] and Corona with an enhanced token-slot arbitration [22]. We modeled and simulated the architectures at a cycle-accurate granularity with a SystemC-based NoC simulator, for two CMP platform complexities: 64-core and 256-core. We used the PARSEC benchmark suite [16] to create multi-application workloads, with clusters running parallelized versions of different benchmarks.
Table 1 shows the PARSEC benchmarks we considered, classified into three categories according to their memory intensities. Compute intensive benchmarks spend most of the time computing and less time communicating with memory; whereas memory intensive applications spend a larger portion of their execution time communicating with memory and less time computing within cores. Hybrid intensity benchmarks demonstrate both compute and memory intensive phases. We created 12 multi-application workloads from these benchmarks. Each workload
combines 4 benchmarks, and the memory intensity of the workloads varies across the spectrum, from compute intensive to memory intensive. As an example, the SC-BT-BS-VI workload combines parallelized implementations of Streamclusters (SC), Bodytrack (BT), Blackscholes (BS) and Vips (VI), and executes them in clusters C0, C1, C2, and C3, respectively. Each parallelized benchmark is executed on a group of 16 cores and 64 cores, in the 64-core and 256-core CMP platforms, respectively. Full-system simulation of parallelized PARSEC benchmarks using gem5 [25] was used to generate traces that were fed into our cycle-accurate network simulator. We set a “warm-up” period of 100M instructions and captured traces for 1B instructions.
Table 1: Memory intensity classification of PARSEC benchmarks
Application Representation Workload Type Blackscholes BS Compute intensive Bodytrack BT Compute intensive Vips VI Compute intensive Dedup DU Compute intensive Freqmine FQ Hybrid Ferret FR Hybrid Fluidanimate FA Hybrid X264 X264 Hybrid Streamclusters SC Memory intensive Canneal CA Memory intensive Facesim FS Memory intensive Swaptions SW Memory intensive
We targeted 32nm and 22nm process technologies for the 64-core and 256-core CMPs, respectively. Based on the geometric calculation of the waveguides for a 20mm×20mm chip dimension, we estimated the time needed for light to travel from the first to the last node in a single pass of the MWMR waveguide group in UltraNoC as 8 cycles at 5 GHz clock frequency. The same clock and 8 cycle round trip time is also applicable to the waveguides in the Flexishare and Corona photonic crossbar NoCs. Throughout our analysis we use a flit size of 64 bits for EMesh and a total packet size of 512 bits for all photonic NoC architectures. We consider data modulation at both clock edges to enable simultaneous transfer of 512 bits in a single cycle, in the UltraNoC, Flexishare, and Corona architectures.
The static and dynamic energy consumption of electrical routers is based on results obtained from the open-source DSENT tool. Energy consumption of various photonic components for all the photonic NoC architectures are adopted from photonic device characterizations in line with state-of-the-art proposals [19], [23] and shown in Table 2. Here Edynamic is the energy/bit for modulators and photodetectors and Elogic−dyn is the energy/bit for the driver circuits of modulators and photodetectors. PMWMR and PMWSR are the static power consumption of an MWMR and an MWSR waveguide group, respectively which includes the power overhead of ring resonator thermal tuning. We consider a ring heating power of 15 µW per ring and detector responsivity of 0.8 A/W [19]. To compute laser power consumption, we calculated photonic loss in components, which sets the photonic laser power budget and correspondingly the electrical laser power.
Finally, based on our gate-level analysis, area overhead due to electrical circuitry (e.g., adders, multipliers) in LSWC is estimated to be 0.011mm2 at 32nm. We set the reconfiguration delay overhead in UltraNoC to be 20 cycles to account for the time to transfer wavelength conversion counter values from each cluster to LSWC, time to determine new priority weights of each cluster, and to update these values in arbiters in each cluster. We set reconfiguration time interval window size in UltraNoC to 300 cycles, to balance reconfiguration overhead and performance.
66
4
npFtoaUsrFNowcwas
dwwdrwsFb
isacacthqUM
(Uahrthrinlaccp
Table 2: E
Energy conEdynamic Elogic−dyn Static power pe
PMWMR (with 68 DPMWMR (with 64 DPMWSR (with 64 D
PhotonMicroring througWaveguide propaWaveguide coupWaveguide bend
4.2 ExperimenOur experimen
network throughproduct (EDP) oFlexishare with oken-slot arbitr
architecture: UltUltraNoC-16 whshow the resultsespect to the EM
Figure 4(a), it caNoCs provide beof higher bandwwaveguide groupcompared to Flwaveguides. Eveand time divisionsignificant differ
In Flexishare, data waveguidewaveguide and waveguide gets data waveguideeservation and d
waveguide, suchsimultaneously bFlexishare doesbandwidth excha
UltraNoC-8 als because of in
attempt to commcontroller). Suchaccess the singlecreating a signifihe other wave
queued waiting UltraNoC avoidsMWMR wavegu
The UltraNoUltraNoC-16) w
UltraNoC-8 proarchitectures. Onhigher throughpuespectively. Ulhroughputs comespectively. Thentensive workloarge throughpu
consequence of cores that need priority alteration
Energy and losses
nsumption type
er waveguide groDWDM) DWDM) DWDM) nic loss type gh agation per cm
pler/splitter ding loss
ntal Results nts target a 64-hput, average pof UltraNoC wtoken stream a
ration [22]. WetraNoC-8 whichhich uses 16 was of this study, Mesh results. Fran be observed tetter throughput width photonic lps (UltraNoC-8)lexishare, with en though Flexn multiplexing ences between tharbitration wave
es are injecteda node that gexclusive acces
e; whereas in data slots are avah that each MWby multiple nods not supportange mechanismlso provides higstances in Coro
municate with a sh instances resulte MWSR waveicant imbalance
eguides being ufor the waveg
s such an imbalauides and improvC configuratio
with approximateovides even ben average Ultrauts compared toltraNoC-16 hasmpared to Fle improvement oads than for cout improvemenavoiding unusedit the most, u
n mechanisms at
s for photonic dev
00
up
Lo
0.0
core CMP platfpacket latency, ith the electrica
arbitration [10], e consider twoh uses 8 wavegaveguide groupswith all results
rom the throughpthat, not surprisithan EMesh, du
links. UltraNoC) has nearly 3× g
the same numxishare uses MW(TDM) as in Ulhe architectures.elengths correspod serially intograbs a token ss to the corresUltraNoC, mu
ailable concurrenWMR waveguidedes. Moreover, t priority recs.
gher throughput ona, where multsingle receiver nt in the sender noeguide connecte
among MWSR underutilized w
guide connectedance with its useved arbitration. on with 16 wely twice the phoetter throughputaNoC-8 has 2.8×o Flexishare, Cos 5.3×, 3.2×, exishare, Corois somewhat hi
ompute intensivnt for UltraNd bandwidth by
using the bandwt runtime. These
vices [19], [23]
Energy .42 pJ/bit .18 pJ/bit Power 3.86 W 3.73 W 2.35 W
oss (in dB) 0.02
1 0.5
005 per 900
form and compaand energy-del
al mesh (EMeshand Corona wi
o variants of oguide groups as. Figures 4(a)-(s normalized wiput comparison ingly, all photonue to the presen
C with 8 MWMgreater throughpmber of MWMWMR waveguidltraNoC, there a. onding to MWMo the arbitratiin the arbitratisponding MWMultiple arbitrationtly in an MWMe can be accessunlike UltraNoonfiguration a
than Corona. Thtiple sender nod
node (e.g., memoodes attempting d to the receivwaveguides, wi
while packets gd to the receive of more efficie
waveguide grouotonic hardware t than the oth×, 1.6×, and 3.orona and EMesand 7.7× high
ona and EMesigher for memo
ve workloads. TNoC is a direy transferring it width transfer ae mechanisms al
are lay h), ith
our and (c) ith in
nic nce MR put
MR des are
MR on on
MR on,
MR sed oC, and
his des ory
to er, ith get er. ent
ups of
her 8× sh, her sh, ory The ect to
and lso
improveFigure photoni47% anpacket dthe diff
Figure 48 and UResults
FigurbetweenUltraNoCoronahas 79%Most ofthe formbetweenand detthat Ulphotonimore hhas a hmore phthan FUltraNo
Our fWe concore coinjectioNoC thrNoCs.
e the average p4(b), by reducinic waveguides. Ond UltraNoC-16delay over Flexi
ferent multi-appl
4: (a) throughputUltraNoC-16 wiare normalized w
re 4(c) shows thn the architectu
NoC-8 has 90%,a, EMesh, and Fl% and 45% lowf the energy in tm of static energn the network atector ring coun
UltraNoC-8 and ic hardware com
hardware than Flhigher EDP thahotonic hardwar
Flexishare, evenNoC-16 is lower t
final set of expnsidered a largeroncentration in on into the netwohroughput, avera
In addition to
packet latency ng the time spenOn average Ultr6 has 25%, 40%ishare, Corona alication workloa
(a)
(b)
(c)
t (b) latency (c) Eith other architewrt EMesh.
he energy-delay ures. It can be 70% and 10%lexishare respectwer EDP compathe photonic arcgy. A comparisoarchitectures (dents are omitted d
UltraNoC-16 mpared to Corolexishare. This an Flexishare, bre (and thus con
n though the than Flexishare. periments explor 256-core CMPeach tile to foork for this scalge packet latenco UltraNoC-8
in UltraNoC, ant waiting for araNoC-8 has 17%% and 51% lowand EMesh, respads.
EDP comparison oectures for a 64
product (EDP) observed that
% lower EDP ctively. Further U
ared to Corona achitectures was con of the photonetailed results fodue to lack of sphave 75% andona but UltraNexplains why Ubecause UltraNnsumes more staverage packet
res architectureP platform by inur to enable hilability study. Wcy, and EDP for
andUltaNoC-16
as shown in access to the %, 32% and wer average pectively for
of UltraNoC-4-core CMP.
comparison on average
compared to UltraNoC-16 and EMesh. consumed in nic hardware or modulator pace) shows d 50% less
NoC-16 uses UltraNoC-16 NoC-16 uses tatic energy) t delay for
e scalability. ncreasing the igher traffic
We evaluated all photonic 6, we also
67
cwethhgTacUacrUEherUha
F8c
ohwin
considered Ultrawaveguide grouexperiments. Frohat on average
has 5.7×, 5.3× angreater throughpThe improvemenand Corona are ecore CMP case. FUltraNoC-8 has and 53% and Ultcompared to Cespectively. Fi
UltraNoC-8 has EMesh, Corona ahas 77% and 73energy delay espectively. Th
UltraNoC-16 anhardware used inalso reduces its th
Figure 5: (a) throu8, UltraNoC-16 acore CMP. Result
From the aboveof UltraNoC thhardware comparwith low as wendicator of the s
aNoC-32, a varups. Figures 5(aom the throughpUltraNoC-8 hasnd 5.7×, and Ul
put compared tonts in throughpeven better for thFrom figure 5(b)23%, 25% and 4traNoC-32 has 3Corona, Flexishinally, Figure
88%, 87% anand Flexishare r3% and UltraNproduct comp
he lower EDPnd UltraNoC-32n Flexishare, whhroughput and in
(
(
ughput (b) latencnd UltraNoC-32
ts are normalized
e results, we canhat can achievred to existing p
ell as high corescalability of our
riant of our arca)-(c) show theput results it cans 3×, 2.8× and 2ltraNoC-32 has o EMesh, Coronut for UltraNoChe 256-core CM), it can be seen 48%, UltraNoC-34%, 36% and 5hare and EM5(c) shows t
d 20% lower Erespectively. Fur
NoC-32 has 55%pared to EMeP of Flexisha2 is due to theich lowers its poncreases its laten
(a)
(b)
(c)
cy (c) EDP compawith other archiwrt EMesh.
n summarize thate better perfor
photonic NoCs, fe counts. The rer proposed Ultra
chitecture with e results of then also be inferr
2.6×, UltraNoC-9.8×, 9× and 9.7
na and FlexisharC over Flexisha
MP, than in the 6that on an avera-16 has 29%, 3055% lower laten
Mesh architecturthat on averaEDP compared rther UltraNoC-
% and 51% lowesh and Coroare compared e lower photonower footprint, bncy.
arison of UltraNoitectures for a 25
t there are varianrmance with lefor CMP platformesults are a go
aNoC architectur
32 ese red 16 7× re. are 64-age 0% ncy res age
to 16
wer ona
to nic but
oC-56-
nts ess ms od
re.
5. COIn t
architecby usinstrategytransfermultipleutilizatigoals. Uto 55% of-the-aarbitraticore cou
6. ACKThis
125250
REFE[1] R. [2] W.[3] S.
3D[4] D.
sili[5] Q.
sili[6] S.
forvol
[7] J. Dnet
[8] D. nan
[9] Y. nan
[10] Y. ene
[11] C. pho
[12] V. Let
[13] C. hig
[14] R. hyb
[15] J. Ppro
[16] C. and
[17] M.net
[18] N. usi
[19] X.pho
[20] M.net
[21] A. Co
[22] D. nan
[23] P. Int
[24] T.PpriISQ
[25] N.
NCLUSIONthis work we cture that featurng an aggressivey. UltraNoC ar bandwidth bee co-running aion and adapt UltraNoC impro
and EDP by up art photonic Nion mechanismsunts on a chip.
KNOWLEDresearch is sup
00, CCF-130269
ERENCES Ho, et al., “The fu. J. Dally, B. TowlBahirat et al., "OP
D ICs," in ASPDACA. B. Miller, “De
icon chips,” in IEEXu et al., “12.5gb
icon modulators,” J. Koester et al., “r high-performancl. 25,no. 1, pp. 46–D. Owens, et al., “tworks,” in IEEE MVantrease et al., “
nophotonic technoPan et al.,“Firefly
nophotonics,” in ISPan, J. Kim, G. M
ergy efficient nanoChen, A. Joshi, “Rotonic multibus NAlmeida et al., “Atters, 29:2867–286A. Barrios, et al.,
gh-modulation-depMorris et al., “Powbrid nanophotonicPsota et al., “ATAocessors,” Boston ABienia et al., "Thed Architectural Im. J. Cianchettiet et twork,” in ISCA, 2Kirman et al., “A
ing wavelength-baZheng et al., “Ultotonic transmitter . Petracca, et al., “tworks for chip muJoshi et al., “Silic
ommunication”, in Vantrease et al., “
nophotonic intercoGrani and S. Bartoterconnect in FuturPimpalkhute et al.,ioritization framewQED 2014. Binkert et al.,"The
NS presented the
res improved che concurrent tokalso supports thetween clusters applications to to time-varyin
ves throughput bto 90% over tra
NoC architecturs. UltraNoC also
DGMENTS pported by gran3), and AFOSR
uture of wires,” in les, “Route packetPAL: A Multi-LayC, Yokohama, Japaevice requirementsEE, Special Issue obit/s carrier-injectio
in Optics Express“Ge-on soi-detectoe optical commun–57, Jan. 2007. “Research challengMicro, September“Corona: System iology,” in ISCA, 20y: Illuminating futuISCA, 2009. Memik, “Flexishareophotonic crossbarRuntime managem
NoC architecture,” All-optical switchin69, 2004. “Low-power-cons
pth silicon electro-wer-efficient and hc interconnect for m
AC: On-chip opticaArea Archit. Worke PARSEC Bench
mplications," in PAal., “Phastlane: a
2009. power-efficient al
ased oblivious routtra-efficient 10Gb/and receiver,” in ODesign exploratioultiprocessors,” in
con-Photonic Clos NOCS 2009
“Light speed arbitronnects,” in IEEE/olini, “Design Optre Client Devices,, “An application-
work for NoC base
e gem5 Simulator,
UltraNoC phohannel sharing aken stream-basedhe ability to dof cores and further impro
g application pby up to 9.8×, la
aditional electricres with the
o scales well wit
nts from SRC, (FA9550-13-1-0
Proc. IEEE, Apr 2ts, not wires,” in D
yer Hybrid Photonian, Jan 2011.
s for optical intercoon Silicon Photonion-based silicon ms, vol. 15,no. 2, Janor/si-cmos-amplifienication application
ges for on-chip int-October 2007. mplications of em008. ure network-on-ch
e: Channel sharingr,” in HPCA, 2010
ment of laser powein IEEE JQE, 201ng on a silicon chi
sumption short-len-optic modulator” high-performance multicores,” in NOal networks for mukshop, pp. 107– 10mark Suit: Charac
ACT, Oct. 2008. rapid transit optica
ll-optical on-chip ting,” in ASPLOS /s hybrid integratedOpt. Express, Marn of optical interco
n HOTI, Aug. 2008Networks for Glo
ration and flow co/ACM MICRO, Detions for Optical R” in ACM JETC, Maware heterogeneo
ed chip multiproce
," in CA News, Ma
otonic NoC among cores d arbitration dynamically re-prioritize
ove channel performance atency by up al and state-best-known
th increasing
NSF (CCF-0110).
2001. DAC, 2001. ic NoC for
onnects to ics, 2009.
micro-ring n. 2007. er receivers ns,” in JLT,
terconnection
merging
hip with
g for an 0. r in silicon-3. ip” in Optics
ngth and in JLT, 2003. multilevel
OCS 2010. ulticore 08, Jan. 2007. cterization
al routing
interconnect 2010. d silicon r 2011. onnection 8, obal On-Chip
ntrol for ec. 2009. Ring May, 2014. ous
essors,” in
ay 2011.
68