Aergia Ieee Micro Top Picks11

8/11/2019 Aergia Ieee Micro Top Picks11

1/13

...............................................................................................................................................................................................................

AERGIA: A NETWORK-ON-CHIP

EXPLOITING PACKET LATENCY SLACK...............................................................................................................................................................................................................

A TRADITIONAL NETWORK-ON-CHIP (NOC) EMPLOYS SIMPLE ARBITRATION STRATEGIES,

SUCH AS ROUND ROBIN OR OLDEST FIRST, WHICH TREAT PACKETS EQUALLY REGARDLESS

OF THE SOURCE APPLICATIONS CHARACTERISTICS. THIS IS SUBOPTIMAL BECAUSE

PACKETS CAN HAVE DIFFERENT EFFECTS ON SYSTEM PERFORMANCE. WE DEFINE SLACK

AS A KEY MEASURE FOR CHARACTERIZING A PACKETS RELATIVE IMPORTANCE. AERGIA

INTRODUCES NEW ROUTER PRIORITIZATION POLICIES THAT EXPLOIT INTERFERING PACKETS

AVAILABLE SLACK TO IMPROVE OVERALL SYSTEM PERFORMANCE AND FAIRNESS.

......Network-on-Chips (NoCs) arewidely viewed as the de facto solution for

integrating many components that will com-prise future microprocessors. Its thereforeforeseeable that on-chip networks willbecome a critical shared resource in many-core systems, and that such systems perfor-mance will depend heavily on the on-chipnetworks resource sharing policies. Devisingefficient and fair scheduling strategies is par-ticularly important (but also challenging)when diverse applications with potentiallydifferent requirements share the network.

An algorithmic question governing appli-

cation interactions in the network is theNoC routers arbitration policythat is,which packet to prioritize if two or morepackets arrive at the same time and want totake the same output port. Traditionally,on-chip network router arbitration policieshave been simple heuristics such as round-robin and age-based (oldest first) arbitration.These arbitration policies treat all packetsequally, irrespective of a packets underlyingapplications characteristics. In other words,these policies arbitrate between packets as if

each packet had exactly the same impact onapplication-level performance. In reality,

however, applications can have unique anddynamic characteristics and demands. Differ-ent packets can have a significantly differentimportance to their applications perfor-mance. In the presence of memory-levelparallelism (MLP),1,2 although the systemmight have multiple outstanding load misses,not every load miss causes a bottleneck (or iscritical).3Assume, for example, that an appli-cation issues two concurrent networkrequests, one after another, first to a remotenode in the network and then to a close-by

node. Clearly, the packet going to theclose-by node is less critical; even if itsdelayed for several cycles in the network, itslatency will be hidden from the applicationby the packet going to the distant node,which would take more time. Thus, eachpackets delay tolerance can have a differentimpact on its applications performance.

In our paper for the 37th Annual Interna-tional Symposium on Computer Architec-ture (ISCA),4 we exploit packet diversity incriticality to design higher performance and

Chita R. Das

Pennsylvania State

University

Onur Mutlu

Carnegie Mellon

University

Thomas Moscibroda

Microsoft Research

Reetuparna DasPennsylvania State

University

0272-1732/11/$26.00c 2011 IEEE Published by the IEEE Computer Society

.........................................................

29


2/13

more application-fair NoCs. We do so bydifferentiating packets according to theirslack, a measure that captures the packetsimportance to the applications performance.In particular, we define a packets slack as the

number of cycles the packet can be delayedin the network without affecting the applica-tions execution time. So, a packet is rela-tively noncritical until its time in thenetwork exceeds the packets available slack.In comparison, increasing the latency ofpackets with no available slack (by depriori-tizing them during arbitration in NoCrouters) will stall the application.

Building off of this concept, we developAergia, an NoC architecture that containsnew router prioritization mechanisms to ac-

celerate the critical packets with low slackvalues by prioritizing them over packetswith larger slack values. (We name our archi-tecture after Aergia, the female spirit of lazi-ness in Greek mythology, after observingthat some packets in NoC can afford toslack off.) We devise techniques to efficientlyestimate a packets slack dynamically. Beforea packet enters the network, Aergia tags thepacket according to its estimated slack.

We propose extensions to routers to priori-tize lower-slack packets at the contention

points within the router (that is, buffer allo-cation and switch arbitration). Finally, to en-sure forward progress and starvation freedomfrom prioritization, Aergia groups packetsinto batches and ensures that all earlier-batch packets are serviced before later-batchpackets. Experimental evaluations on acycle-accurate simulator show that our pro-posal effectively increases overall systemthroughput and application-level fairness inthe NoC.

MotivationThe premise of our research is the exis-

tence of packet latency slack. We thus moti-vate Aergia by explaining the concept ofslack, characterizing diversity of slack in appli-cations, and illustrating the advantages ofexploiting slack with a conceptual example.

The concept of slackModern microprocessors employ several

memory latency tolerance techniques (suchas out-of-order execution5 and runahead

execution2,6) to hide the penalty of loadmisses. These techniques exploit MLP1 byissuing several memory requests in parallelwith the hope of overlapping future loadmisses with current load misses. In the pres-

ence of MLP in an on-chip network, the ex-istence of multiple outstanding packets leadsto packet latency overlap, which introducesslack cycles (the number of cycles a packetcan be delayed without significantly affectingperformance). We illustrate the concept ofslack cycles with an example. Consider theprocessor execution timeline in Figure 1a.In the instruction window, the first loadmiss causes a packet (Packet0) to be sentinto the network to service the load miss,and the second load miss generates the next

packet (Packet1). In Figure 1as executiontimeline, Packet1 has lower network latencythan Packet0 and returns to the source ear-lier. Nonetheless, the processor cant commitLoad0 and stalls until Packet0 returns. Thus,Packet0 is the bottleneck packet, whichallows the earlier-returning Packet1 someslack cycles. The system could delay Packet1for the slack cycles without causing signifi-cant application-level performance loss.

Diversity in slack and its analysis

Sufficient diversity in interfering packetsslack cycles is necessary to exploit slack. Inother words, its only possible to benefitfrom prioritizing low-slack packets if the net-work has a good mix of both high- and low-slack packets at any time. Fortunately, wefound that realistic systems and applicationshave sufficient slack diversity.

Figure 1b shows a cumulative distributionfunction of the diversity in slack cycles for16 applications. The x-axis shows the num-ber of slack cycles per packet. The y-axis

shows the fraction of total packets thathave at least as many slack cycles per packetas indicated by thex-axis. Two trends are vis-ible. First, most applications have a goodspread of packets with respect to the numberof slack cycles, indicating sufficient slack di-versity within an application. For example,17.6 percent of Gemss packets have, atmost, 100 slack cycles, but 50.4 percent ofthem have more than 350 slack cycles. Sec-ond, applications have different slack charac-teristics. For example, the packets ofart

....................................................

30 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS


3/13

0

10

20

30

40

50

60

70

80

90

100

0 50 100 150 200 250 300 350 400 450 500

Allpac

ke

ts(%)

Slack in cycles

Gems

omnet

tpcw

mcf

bzip2

sjbb

sap

sphinx

deal

barnes

astar

calculix

art

libquantum

sjeng

h264ref

Instruction window

causes Packet0

causes

(a)

(b)

Packet1Load miss 1

Load miss 0

Latency of Packet0

Latency of Packet1

Packet1 (sent)

Execu

tion

time

line

Packet0 (sent)

Compute

Packet1(returns)

Slack

Slack of Packet1

Stall of Packet0

Packet0(returns)

Figure 1. Processor execution timeline demonstrating the concept of slack (a); the resulting

packet distribution of 16 applications based on slack cycles (b).

..........................................................

JANUARY/FEBRUARY 2011 31


4/13

and libquantum have much lower slackin general than tpcw and omnetpp. Weconclude that each packets number ofslack cycles varies within application phasesas well as between applications.

Advantages of exploiting slackWe show that a NoC architecture thats

aware of slack can lead to better systemperformance and fairness. As we mentionedearlier, a packets number of slack cycles indi-cates that packets importance or criticality tothe processor core. If the NoC knew about apackets remaining slack, it could make arbi-tration decisions that would accelerate pack-ets with a small slack, which would improvethe systems overall performance.

Figure 2 shows a motivating example.Figure 2a depicts the instruction window oftwo processor coresCore A at node (1, 2)and Core B at node (3, 7). The network con-sists of an 8 8 mesh (see Figure 2b). Core

A generates two packets (A-0 and A-1). Thefirst packet (A-0) isnt preceded by any otherpacket and is sent to node (8, 8)therefore,its latency is 13 hops and its slack is 0 hops.In the next cycle, the second packet (A-1) isinjected toward node (3, 1). This packet hasa latency of 3 hops, and because its preceded

(and thus overlapped) by the 13-hop packet(A-0), it has a slack of 10 hops (13 hopsminus 3 hops). Core B also generates twopackets (B-0 and B-1). We can calculateCore Bs packets latency and slack similarly.B-0 has a latency of 10 hops and slack of0 hops, while B-1 has a latency of 4 hopsand slack of 6 hops.

We can exploit slack when interferenceexists between packets with different slackvalues. In Figure 2b, for example, packets

A-0 and B-1 interfere. The critical question

is which packet the router should prioritizeand which it should queue (delay).

Figure 2c shows the execution timeline ofthe cores with the baseline slack-unaware pri-oritization policy that possibly prioritizespacket B-1, thereby delaying packet A-0. Incontrast, if the routers knew the packetsslack, they would prioritize packet A-0 overB-1, because the former has a smaller slack(see Figure 2d). Doing so would reduceCore As stall time without significantlyincreasing Core Bs stall time, thereby

improving overall system throughput (seeFigure 2). The critical observation is thatdelaying a higher-slack packet can improveoverall system performance by allowing alower-slack packet to stall its core less.

Online estimation of slackOur goal is to identify critical packets

with low slack and accelerate them bydeprioritizing packets with high-slack cycles.

Defining slackWe can define slack locally or globally.7

Local slackis the number of cycles a packetcan be delayed without delaying any subse-quent instruction. Global slackis the num-ber of cycles a packet can be delayed

without delaying the last instruction inthe program. Computing a packets globalslack requires critical path analysis of theentire program, rendering it impractical inour context. So, we focus on local slack.

A packets ideal local slack could depend oninstruction-level dependencies, which arehard to capture in the NoC. So, to keepthe implementation in the NoC simple,we conservatively consider slack only withrespect to outstanding network transac-tions. We define the term predecessor pack-ets, or simply predecessors for a givenpacket P, as all packets that are still out-standing and that have been injected bythe same core into the network earlierthanP. We formally define a packets avail-able local network slack (in this article,simply called slack) as the differencebetween the maximum latency of its prede-cessor packets (that is, any outstandingpacket that was injected into the networkearlier than this packet) and its ownlatency:

Slack(Packeti) maxkLatency(Packetk8k0 toNumberofPredecessors)Latency(Packeti) (1)

The key challenge in estimating slack is toaccurately predict latencies of predecessorpackets and the current packet beinginjected. Unfortunately, its difficult to ex-actly predict network latency in a realisticsystem. So, instead of predicting the exactslack in terms of cycles, we aim to categorize

....................................................

32 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS


5/13

Core A

Core B

B-1(Latency = 4 hops, Slack = 6 hops)

B-0(Latency = 10 hops, Slack = 0 hops)

Interference

(3hops)

A-0

(Latency = 13 hops, Slack = 0 hops)

A-1

(Latency = 3 hops, Slack = 10 hops)

Instruction window(Core A)

causesPacket A-0causesPacket A-1

Instruction window(Core B)

causesPacket B-0

causes

Packet B-1

Load miss 0

Load miss 1

Load miss 1

Load miss 0

(a) (b)

Latency of B-1 Latency of B-1

Slack

Latency of A-0

Latency of A-1

Core A

Core B

Slack aware (Argia)

Slack

Interference

Latency of A-0

Latency of A-1

Core A

Core B

Slack unaware

Pac

ke

tinjec

tion

/ejec

tion

Pac

ke

tinjec

tion

/eje

ction

Saved cycles

Latency of B-0

StallCompute

ComputeStall

Core A

Core B

Saved cycles

Latency of B-0

StallCompute

Stall Compute

Core A

Core B

(c) (d)

Figure 2. Conceptual example showing the advantage of incorporating slack into

prioritization decisions in the Network-on-Chip (NoC). The instruction window for two

processor cores (a); interference between two packets with different slack values (b);

the processor cores slack-unaware execution timeline (c); prioritization of the packets

according to slack (d).

..........................................................



6/13

and quantize slack into different priority lev-els according to indirect metrics that corre-late with the latency of predecessors and

the packet being injected. To accomplishthis, we characterize slack as it relates to var-ious indirect metrics. We found that themost important factors impacting packetcriticality (hence, slack) are the number ofpredecessors that are level-two (L2) cachemisses, whether the injected packet is an L2cache hit or miss, and the number of a pack-ets extraneous hops in the network (com-pared to its predecessors).

Number of miss predecessors. We define a

packets miss predecessors as its predecessorsthat are L2 cache misses. We found that apacket is likely to have high slack if it hashigh-latency predecessors or a large numberof predecessors; in both cases, theres a highlikelihood that its latency is overlapped (thatis, that its slack is high). This is intuitive be-cause the number of miss predecessors triesto capture the first term of the slack equa-tion (that is, Equation 1).

L2 cache hit/miss status. We observe that

whether a packet is an L2 cache hit ormiss correlates with the packets criticalityand, thus, the packets slack. If the packetis an L2 miss, it likely has high latencyowing to dynamic RAM (DRAM) accessand extra NoC transactions to and frommemory controllers. Therefore, the packetlikely has a smaller slack because other pack-ets are less likely to overlap its long latency.

Slack in terms of number of hops. Our thirdmetric captures both terms of the slack

equation, and is based on the distance trav-ersed in the network. Specifically:

Slackhops(Packeti) maxkHops(Packetk8k0 to NumberofPredecessors)

Hops(Packeti) (2)

Slack priority levelsWe combine these three metrics to form

the slack priority level of a packet in Aergia.When a packet is injected, Aergia computesthese three metrics and quantizes them toform a three-tier priority (see Figure 3).

Aergia tags the packets head flit withthese priority bits. We use 2 bits for thefirst tier, which is assigned according to

the number of miss predecessors. We use1 bit for the second tier to indicate if thepacket being injected is (predicted to be) anL2 hit or miss. And we use 2 bits for thethird tier to indicate a packets hop-basedslack. We then use the combined slack-based priority level to prioritize betweenpackets in routers.

Slack estimationIn Aergia, assigning a packets slack prior-

ity level requires estimation of the three

metrics when the packet is injected.

Estimating the number of miss predecessors.Every core maintains a list of the outstand-ing L1 load misses (predecessor list). Themiss status handling registers (MSHRs)limit the predecessor lists size.8 Each L1miss is associated with a corresponding L2miss status bit. At the time a packet isinjected into the NoC, its actual L2 hit/missstatus is unknown because the shared L2cache is distributed across the nodes. There-

fore, an L2 hit/miss predictor is consultedand sets the L2 miss status bit accordingly.

We compute the miss predecessor slack pri-ority level as the number of outstanding L1misses in the predecessor list whose L2 missstatus bits are set (to indicate a predicted oractual L2 miss). Our implementationrecords L1 misses issued in the last 32 cyclesand sets the maximum number of misspredecessors to 8. If a prediction erroroccurs, the L2 miss status bit is updatedto the correct value when the data response

No. of miss predecessors (2 bits)

No. of slack in hops (2 bits)

Is L2 cache hit or miss? (1 bit)

Figure 3. Priority structure for estimating

a packets slack. A lower priority level

corresponds to a lower slack value.

....................................................

34 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS


7/13

packet returns to the core. In addition, if apacket that is predicted to be an L2 hit actu-ally results in an L2 miss, the correspondingL2 cache bank notifies the requesting corethat the packet actually resulted in an L2

miss so that the requesting core updatesthe corresponding L2-miss-status bit ac-cordingly. To reduce control packet over-head, the L2 bank piggy-backs thisinformation to another packet travelingthrough the requesting core as an intermedi-ate node.

Estimating whether a packet will be an L2cache hit or miss. At the time a packet isinjected, its unknown whether it will hitin the remote L2 cache bank it accesses.

We use an L2 hit/miss predictor in eachcore to guess each injected packets cachehit/miss status, and we set the second-tierpriority bit accordingly in the packetheader. If the packet is predicted to be anL2 miss, its L2 miss priority bit is set to0; otherwise, it is set to 1. This prioritybit within the packet header is correctedwhen the actual L2 miss status is knownafter the packet accesses the L2.

We developed two types of L2 miss pre-dictors. The first is based on the global

branch predictor.9

A shift register recordsthe hit/miss values for the last ML1 loadmisses. We then use this register to index apattern history table (PHT) containing2-bit saturating counters. The accessedcounter indicates whether the prediction forthat particular pattern history was a hit or amiss. The counter is updated when a packetshit/miss status is known. The second predic-tor, called thethreshold predictor, uses the in-sight that misses occur in bursts. Thispredictor updates a counter if the access is

a known L2 miss. The counter resets afterevery M L1 misses. If the number ofknown L2 misses in the last ML1 missesexceeds a threshold T, the next L1 miss ispredicted to be an L2 miss.

The global predictor requires more stor-age bits (for the PHT) and has marginallyhigher design complexity than the thresholdpredictor. For correct updates, both predic-tors require an extra bit in the responsepacket indicating whether the transactionwas an L2 miss.

Estimating hop-based slack. We calculate anypackets hops per packet by adding theXand

Ydistance between source and destination.We then calculate a packets hop-basedslack using Equation 2.

Aergia NoC architectureOur NoC architecture, Aergia, uses the

slack priority levels that we described forarbitration. We now describe the architec-ture. Some design challenges accompanythe prioritization mechanisms, includingstarvation avoidance (using batching) andmitigating priority inversion (using multiple

network interface queues).

BaselineFigure 4 shows a generic NoC router

architecture. The router hasPinput channelsconnected toPports and Poutput channelsconnected to the crossbar; typicallyP 5 fora 2D mesh(one from each direction and onefrom the network interface). The routingcomputation unit determines the next routerand the virtual channel (V) within the nextrouter for each packet. Dimension-ordered

Switch arbiter

VC arbiter

Routingcomputation

Outputchannel

1

Outputchannel

P

Crossbar(P x P)

Scheduler

Input port P

Input port 1

VC v

VC 1VC identifier

Inpu

t

channe

l

1

Inpu

t

channe

l

P

Cre

dit

ou

t

Figure 4. Baseline router microarchitecture. The figure shows the different

parts of the router: input channels connected to the router ports; buffers

within ports organized as virtual channels; the control logic, which performs

routing and arbitration; and the crossbar connected to the output channels.

..........................................................



8/13

routing (DOR) is the most common routingpolicy because of its low complexity anddeadlock freedom. Our baseline assumes

XYDOR, which first routes packets in Xdi-rection, followed byYdirection. The virtual

channel arbitration unit arbitrates among allpackets requesting access to the same outputvirtual channel in the downstream router anddecides on winners. The switch arbitrationunit arbitrates among all input virtual chan-nels requesting access to the crossbar andgrants permission to the winning packetsand flits. The winners can then traverse thecrossbar and be placed on the output links.Current routers use simple, local arbitrationpolicies (such as round-robin or oldest-first arbitration) to decide which packet to

schedule next (that is, which packet winsarbitration). Our baseline NoC uses theround-robin policy.

ArbitrationIn Aergia, the virtual channel arbitration

and switch arbitration units prioritize pack-ets with lower slack priority levels. Thus,low-slack packets get first preference for buf-fers as well as the crossbar and, thus, areaccelerated in the network. In wormholeswitching, only the head flit arbitrates for

and reserves the virtual channel; thus, onlythe head flit carries the slack priority value.

Aergia uses this header information duringvirtual channel arbitration stage for priorityarbitration. In addition to the state thatthe baseline architecture maintains, each vir-tual channel has an additional priority field,which is updated with the head flits slackpriority level when the head flit reservesthe virtual channel. The body flits use thisfield during switch arbitration stage for pri-ority arbitration.

Without adequate countermeasures, pri-oritization mechanisms can easily lead tostarvation in the network. To prevent starva-tion, we combine our slack-based prioritiza-tion with a batching mechanism similar tothe one from our previous work.10We dividetime into intervals ofTcycles, which we callbatching intervals. Packets inserted into thenetwork during the same interval belong tothe same batch (that is, they have the samebatch priority value). We prioritize packetsbelonging to older batches over those from

younger batches. Only when two packets be-long to the same batch do we prioritize themaccording to their available slack (that is, theslack priority levels). Each head flit carries a5-bit slack priority value and a 3-bit batch

number. We use adder delays to estimatethe delay of an 8-bit priority arbiter (P 5,V 6) to be 138.9 picoseconds and a16-bit priority arbiter to be 186.9 picosec-onds at 45- nanometer technology.

Network interfaceFor effectiveness, prioritization is neces-

sary not only within the routers but also atthe network interfaces. We split the mono-lithic injection queue at a network interfaceinto a small number of equal-length queues.

Each queue buffers packets with slack prior-ity levels within a different programmablerange. Packets are guided into a particularqueue on the basis of their slack priority lev-els. Queues belonging to higher-prioritypackets take precedence over those belong-ing to lower-priority packets. Such prioriti-zation reduces priority inversion and isespecially needed at memory controllers,where packets from all applications are buf-fered together at the network interface andwhere monolithic queues can cause severe

priority inversion.

Comparison to application-awarescheduling and fairness policies

In our previous work,10 we proposed StallTime Criticality (STC), an application-awarecoordinated arbitration policy to acceleratenetwork-sensitive applications. The idea isto rank applications at regular intervalsbased on their network intensity (L1-MPI,or L1 miss rate per instruction), and to

prioritize all packets of a nonintensive appli-cation over all packets of an intensiveapplication. Within applications belongingto the same rank, STC prioritizes packetsusing the baseline round-robin order in aslack-unaware manner. Packet batching isused for starvation avoidance. STC is acoarse-grained approach that identifies criti-cal applications (from the NoC perspective)and prioritizes them over noncritical ones.By focusing on application-level intensity,STC cant capture fine-grained packet-level

....................................................

36 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS


9/13


10/13

system performance and network fairnessover various workloads, whether usedindependently or with application-awareprioritization techniques.

T o our knowledge, our ISCA 2010paper4 is the first work to make use ofthe criticality of memory accesses for packetscheduling within NoCs. We believe ourproposed slack-aware NoC design caninfluence many-core processor implementa-

tions and future research in several ways.Other work has proposed and explored

the instruction slack concept7 (for moreinformation, see the Research related to

Aergia sidebar). Although prior researchhas exploited instruction slack to improveperformance-power trade-offs for single-threaded uniprocessors, instruction slacksimpact on shared resources in many-core processors provides new research oppor-tunities. To our knowledge, our ISCA 2010paper4 is the first work that attempts to

exploit and understand how instruction-level slack impacts the interference of appli-cations within a many-core processorsshared resources, such as caches, memorycontrollers, and on-chip networks. We de-fine, measure, and devise easy-to-implementschemes to estimate slack of memory accessesin the context of many-core processors.Researchers can use our slack metric for bet-ter management of other shared resources,such as request scheduling at memory con-

trollers, managing shared caches, saving en-ergy in the shared memory hierarchy, andproviding QoS. Shared resource manage-ment will become even more important infuture systems (including data centers,cloud computing, and mobile many-core sys-tems) as they employ a larger number of in-creasingly diverse applications runningtogether on a many-core processor. Thus,using slack for better resource managementcould become more beneficial as the pro-cessor industry evolves toward many-core

...............................................................................................................................................................................................

Research related to Aergia

To our knowledge, no previous work has characterized slack in packet

latency and proposed mechanisms to exploit it for on-chip network

arbitration. Here, we discuss the most closely related previous work.

Instruction criticality and memory-level parallelism (MLP)Much research has been done to predict instruction criticality or

slack1-3 and to prioritize critical instructions in the processor core and

caches. MLP-aware cache replacement exploits cache-miss criticality dif-

ferences caused by differences in MLP.4 Parallelism-aware memory

scheduling5 and other rank-based memory scheduling algorithms6 reduce

application stall time by improving each threads bank-level parallelism.

Such work relates to ours only in the sense that we also exploit critical-

ity to improve system performance. However, our methods (for comput-

ing slack) and mechanisms (for exploiting it) are very different owing to

on-chip networks distributed nature.

Prioritization in on-chip and multiprocessor networksWeve extensively compared our approach to state-of-the-art local ar-

bitration, quality-of-service-oriented prioritization (such as Globally-

Synchronized Frames, or GSF7), and application-aware prioritization

(such as Stall-Time Criticality, or STC8) policies in Networks-on-Chip

(NoCs). Bolotin et al. prioritize control packets over data packets in

the NoC but dont distinguish packets on the basis of available slack. 9

Other researchers have also proposed frameworks for quality of service

(QoS) in on-chip networks,10-12 which can be combined with our

approach. Some other work proposed arbitration policies for multichip

multiprocessor networks and long-haul networks.13-16 They aim to pro-

vide guaranteed service or fairness, while we aim to exploit slack in

packet latency to improve system performance. In addition, most ofthe previous mechanisms statically assign priorities and bandwidth to

different flows in off-chip networks to satisfy real-time performance

and QoS guarantees. In contrast, our proposal dynamically computes

and assigns priorities to packets based on slack.

BatchingWe use packet batching in NoC for starvation avoidance, similar to

our previous work.8 Other researchers have explored using batching

and frames in disk scheduling,17 memory scheduling,5 and QoS-aware

packet scheduling7,12 to prevent starvation and provide fairness.

References

1. S.T. Srinivasan and A.R. Lebeck, Load Latency Tolerance in Dy-

namically Scheduled Processors, Proc. 31st Ann. ACM/IEEE

Intl Symp. Microarchitecture, IEEE CS Press, 1998, pp.148-159.

2. B. Fields, S. Rubin, and R. Bodk, Focusing Processor Policies

via Critical-Path Prediction,Proc. 28th Ann. Intl Symp. Com-

puter Architecture(ISCA 01), ACM Press, 2001, pp. 74-85.

3. B. Fields, R. Bodk, and M. Hill, Slack: Maximizing Performance

under Technological Constraints,Proc. 29th Ann. Intl Symp.

....................................................

38 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS


11/13

processors. We hope that our work inspiressuch research and development.

Managing shared resources in a highlyparallel system is a fundamental challenge.

Although application interference is relativelywell understood for many important sharedresources (such as shared last-level caches12

or memory bandwidth13-16), less is knownabout application interference behavior inNoCs or how this interference impacts appli-cations execution times. One reason why

analyzing and managing multiple applica-tions in a shared NoC is challenging is thatapplication interactions in a distributed sys-tem can be complex and chaotic, with nu-merous first- and second-order effects (suchas queueing delays, different MLP, bursti-ness, and the effect of spatial location ofcache banks) and hard-to-predict interfer-ence patterns that can significantly impactapplication-level performance.

We attempt to understand applicationscomplex interactions in NoCs using

application-level metrics and systematicexperimental evaluations. Building on thisunderstanding, weve devised low-cost,simple-to-implement, and scalable policiesto control and reduce interference in NoCs.

Weve shown that employing coarse-grainedand fine-grained application-aware prioritiza-tion can significantly improve performance.

We hope our techniques inspire othernovel approaches to reduce application-levelinterference in NoCs.

Finally, our work exposes the instructionbehavior inside a processor core to theNoC. We hope this encourages future re-search on the integrated design of cores andNoCs (as well as other shared resources).This integrated design approach will likelyfoster future research on overall system per-formance instead of individual componentsperformance: by codesigning NoC andthe core to be aware of each other, wecan achieve significantly better performanceand fairness than designing each alone.

Computer Architecture, IEEE Press, 2002, pp. 47-58,

doi:10.1109/ISCA.2002.1003561.

4. M. Qureshi et al., A Case for MLP-Aware Cache Replace-

ment, Proc. 33rd Ann. Intl Symp. Computer Architecture

(ISCA 06), IEEE CS Press, 2006, pp. 167-178.

5. O. Mutlu and T. Moscibroda, Parallelism-Aware BatchScheduling: Enhancing Both Performance and Fairness of

Shared DRAM Systems, Proc. 35th Ann. Intl Symp. Com-

puter Architecture(ISCA 08), IEEE CS Press, 2008, pp. 63-74.

6. Y. Kim et al., ATLAS: A Scalable and High-Performance Sched-

uling Algorithm for Multiple Memory Controllers,Proc. IEEE

16th Intl Symp. High Performance Computer Architecture

(HPCA 10), IEEE Press, 2010, doi:10.1109/HPCA.2010.5416658.

7. J.W. Lee, M.C. Ng, and K. Asanovic, Globally-Synchronized

Frames for Guaranteed Quality-of-Service in On-Chip Net-

works, Proc. 35th Ann. Intl Symp. Computer Architecture

(ISCA 08), IEEE CS Press, 2008, pp. 89-100.

8. R. Das et al., Application-Aware Prioritization Mechanismsfor On-Chip Networks, Proc. 42nd Ann. IEEE/ACM Intl

Symp. Microarchitecture, ACM Press, 2009, pp. 280-291.

9. E. Bolotin et al., The Power of Priority: NoC Based Distrib-

uted Cache Coherency, Proc. 1st Intl Symp. Networks-

on-Chip (NOCS 07), IEEE Press, 2007, pp. 117-126,

doi:10.1109/NOCS.2007.42.

10. E. Bolotin et al., QNoC: QoS Architecture and Design Pro-

cess for Network on Chip, J. Systems Architecture,

vol. 50, nos. 2-3, 2004, pp. 105-128.

11. E. Rijpkema et al., Trade-offs in the Design of a Router with

Both Guaranteed and Best-Effort Services for Networks on

Chip, Proc. Conf. Design, Automation and Test in Europe

(DATE 03), vol. 1, IEEE CS Press, 2003, pp. 10350-10355.

12. B. Grot, S.W. Keckler, and O. Mutlu, Preemptive Virtual

Clock: A Flexible, Efficient, and Cost-effective QOS Schemefor Networks-on-Chip, Proc. 42nd Ann. IEEE/ACM Intl

Symp. Microarchitecture, ACM Press, 2009, pp. 268-279.

13. K.H. Yum, E.J. Kim, and C. Das, QoS Provisioning in Clus-

ters: An Investigation of Router and NIC Design, Proc.

28th Ann. Intl Symp. Computer Architecture (ISCA 01),

ACM Press, 2001, pp. 120-129.

14. A.A. Chienand J.H. Kim, Rotating Combined Queueing(RCQ):

Bandwidth and Latency Guarantees in Low-Cost, High-

Performance Networks, Proc. 23rd Ann. Intl Symp. Computer

Architecture(ISCA 96), ACM Press, 1996, pp. 226-236.

15. A. Demers, S. Keshav, and S. Shenker, Analysis and Simu-

lation of a Fair Queueing Algorithm, Symp. Proc. Comm.Architectures & Protocols(Sigcomm 89), ACM Press, 1989,

pp. 1-12, doi:10.1145/75246.75248.

16. L. Zhang, Virtual Clock: A New Traffic Control Algorithm for

Packet Switching Networks, Proc. ACM Symp. Comm.

Architectures & Protocols(Sigcomm 90), ACM Press, 1990,

pp. 19-29, doi:10.1145/99508.99525.

17. H. Frank, Analysis and Optimization of Disk Storage Devices

for Time-Sharing Systems, J. ACM, vol. 16, no. 4, 1969,

pp. 602-620.

..........................................................



12/13

We therefore hope Aergia opens newresearch opportunities beyond packet-scheduling in on-chip networks. M ICRO

Acknowledgments

This research was partially supported byNational Science Foundation (NSF) grantCCF-0702519, NSF Career Award CCF-0953246, the Carnegie Mellon UniversityCyLab, and GSRC.

....................................................................References

1. A. Glew, MLP Yes! ILP No! Memory Level

Parallelism, or Why I No Longer Care about

Instruction Level Parallelism, ASPLOS

Wild and Crazy Ideas Session, 1998; http://

www.cs.berkeley.edu/~kubitron/asplos98/abstracts/andrew_glew.pdf.

2. O. Mutlu, H. Kim, and Y.N. Patt, Efficient

Runahead Execution: Power-Efficient Mem-

ory Latency Tolerance,IEEE Micro,vol. 26,

no. 1, 2006, pp. 10-20.

3. B. Fields, S. Rubin, and R. Bodk, Focusing

Processor Policies Via Critical-Path Predic-

tion,Proc. 28th Ann. Intl Symp. Computer

Architecture (ISCA 01), ACM Press, 2001,

pp. 74-85.

4. R. Das et al., Aergia: Exploiting Packet

Latency Slack in On-Chip Networks,Proc. 37th Ann. Intl Symp. Computer Archi-

tecture (ISCA 10), ACM Press, 2010,

pp. 106-116.

5. R.M. Tomasulo, An Efficient Algorithm for

Exploiting Multiple Arithmetic Units,I BM

J. Research and Development, vol. 11,

no. 1, 1967, pp. 25-33.

6. O. Mutlu et al., Runahead Execution: An

Alternative to Very Large Instruction Win-

dows for Out-of-Order Processors, Proc.

9th Intl Symp. High-Performance Computer

Architecture(HPCA 03), IEEE Press, 2003,

pp. 129-140.

7. B. Fields, R. Bodk, and M. Hill, Slack:

Maximizing Performance under Technologi-

cal Constraints, Proc. 29th Ann. Intl

Symp. Computer Architecture, IEEE Press,

2002, pp. 47-58, doi:10.1109/ISCA.2002.

1003561.

8. D. Kroft, Lockup-Free Instruction Fetch/

Prefetch Cache Organization, Proc. 8th

Ann. Symp. Computer Architecture (ISCA

81), IEEE CS Press, 1981, pp. 81-87.

9. T.Y. Yeh and Y.N. Patt, Two-Level Adap-

tive Training Branch Prediction, Proc. 24th

Ann. Intl Symp. Microarchitecture, ACM

Press, 1991, pp. 51-61.

10. R. Das et al., Application-Aware Prioritiza-

tion Mechanisms for On-Chip Networks,Proc. 42nd Ann. IEEE/ACM Intl Symp.

Microarchitecture, ACM Press, 2009,

pp. 280-291.

11. J.W. Lee, M.C. Ng, and K. Asanovic, Glob-

ally-Synchronized Frames for Guaranteed

Quality-of-Service in On-Chip Networks,

Proc. 35th Ann. Intl Symp. Computer Archi-

tecture (ISCA 08), IEEE CS Press, 2008,

pp. 89-100.

12. L.R. Hsu et al., Communist, Utilitarian, and

Capitalist Cache Policies on CMPS: Caches

as a Shared Resource, Proc. 15th IntlConf. Parallel Architectures and Compilation

Techniques (PACT 06), ACM Press, 2006,

pp. 13-22.

13. O. Mutlu and T. Moscibroda, Stall-Time

Fair Memory Access Scheduling for Chip

Multiprocessors, Proc. 40th Ann. IEEE/

ACM Intl Symp. Microarchitecture, IEEE

CS Press, 2007, pp. 146-160.

14. O. Mutlu and T. Moscibroda, Parallelism-

Aware Batch Scheduling: Enhancing Both

Performance and Fairness of Shared

DRAM Systems, Proc. 35th Ann. IntlSymp. Computer Architecture (ISCA 08),

IEEE CS Press, 2008, pp. 63-74.

15. Y. Kim et al., ATLAS: A Scalable and High-

Performance Scheduling Algorithm for Mul-

tiple Memory Controllers, Proc. IEEE 16th

Intl Symp. High Performance Computer

Architecture (HPCA 10), IEEE Press, 2010,

doi:10.1109/HPCA.2010.5416658.

16. Y. Kim et al., Thread Cluster Memory

Scheduling: Exploiting Differences in Mem-

ory Access Behavior, Proc. 43rd Ann.

IEEE/ACM Intl Symp. Microarchitecture,

ACM Press, 2010.

Reetuparna Das is a research scientist atIntel Labs. Her research interests includecomputer architecture, especially intercon-nection networks. She has a PhD incomputer science and engineering fromPennsylvania State University.

Onur Mutlu is an assistant professor inthe Electrical and Computer Engineering

....................................................

40 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS


13/13

Department at Carnegie Mellon University.His research interests include computerarchitecture and systems. He has a PhD inelectrical and computer engineering from theUniversity of Texas at Austin. Hes a

member of IEEE and the ACM.

Thomas Moscibroda is a researcher in thedistributed systems research groupat MicrosoftResearch. His research interests include dis-tributed computing, networking, and algorith-mic aspects of computer architecture. He has aPhD in computer science from ETH Zurich.Hes a member of IEEE and the ACM.

Chita R. Das is a distinguished professor inthe Department of Computer Science and

Engineering at Pennsylvania State Univer-sity. His research interests include paralleland distributed computing, performanceevaluation, and fault-tolerant computing.He has a PhD in computer science from

the University of Louisiana, Lafayette. Hes afellow of IEEE.

Direct questions and comments to Reetu-parna Das, Intel Labs, Intel Corp., SC-12,3600 Juliette Ln., Santa Clara, CA; [email protected].

..........................................................


Date post:	02-Jun-2018
Category:	Documents
Upload:	umar-farooq-zia
View:	216 times
Download:	0 times

Aergia Ieee Micro Top Picks11

Documents