+ All Categories
Home > Documents > Aergia Ieee Micro Top Picks11

Aergia Ieee Micro Top Picks11

Date post: 02-Jun-2018
Category:
Upload: umar-farooq-zia
View: 216 times
Download: 0 times
Share this document with a friend

of 13

Transcript
  • 8/11/2019 Aergia Ieee Micro Top Picks11

    1/13

    ...............................................................................................................................................................................................................

    AERGIA: A NETWORK-ON-CHIP

    EXPLOITING PACKET LATENCY SLACK...............................................................................................................................................................................................................

    A TRADITIONAL NETWORK-ON-CHIP (NOC) EMPLOYS SIMPLE ARBITRATION STRATEGIES,

    SUCH AS ROUND ROBIN OR OLDEST FIRST, WHICH TREAT PACKETS EQUALLY REGARDLESS

    OF THE SOURCE APPLICATIONS CHARACTERISTICS. THIS IS SUBOPTIMAL BECAUSE

    PACKETS CAN HAVE DIFFERENT EFFECTS ON SYSTEM PERFORMANCE. WE DEFINE SLACK

    AS A KEY MEASURE FOR CHARACTERIZING A PACKETS RELATIVE IMPORTANCE. AERGIA

    INTRODUCES NEW ROUTER PRIORITIZATION POLICIES THAT EXPLOIT INTERFERING PACKETS

    AVAILABLE SLACK TO IMPROVE OVERALL SYSTEM PERFORMANCE AND FAIRNESS.

    ......Network-on-Chips (NoCs) arewidely viewed as the de facto solution for

    integrating many components that will com-prise future microprocessors. Its thereforeforeseeable that on-chip networks willbecome a critical shared resource in many-core systems, and that such systems perfor-mance will depend heavily on the on-chipnetworks resource sharing policies. Devisingefficient and fair scheduling strategies is par-ticularly important (but also challenging)when diverse applications with potentiallydifferent requirements share the network.

    An algorithmic question governing appli-

    cation interactions in the network is theNoC routers arbitration policythat is,which packet to prioritize if two or morepackets arrive at the same time and want totake the same output port. Traditionally,on-chip network router arbitration policieshave been simple heuristics such as round-robin and age-based (oldest first) arbitration.These arbitration policies treat all packetsequally, irrespective of a packets underlyingapplications characteristics. In other words,these policies arbitrate between packets as if

    each packet had exactly the same impact onapplication-level performance. In reality,

    however, applications can have unique anddynamic characteristics and demands. Differ-ent packets can have a significantly differentimportance to their applications perfor-mance. In the presence of memory-levelparallelism (MLP),1,2 although the systemmight have multiple outstanding load misses,not every load miss causes a bottleneck (or iscritical).3Assume, for example, that an appli-cation issues two concurrent networkrequests, one after another, first to a remotenode in the network and then to a close-by

    node. Clearly, the packet going to theclose-by node is less critical; even if itsdelayed for several cycles in the network, itslatency will be hidden from the applicationby the packet going to the distant node,which would take more time. Thus, eachpackets delay tolerance can have a differentimpact on its applications performance.

    In our paper for the 37th Annual Interna-tional Symposium on Computer Architec-ture (ISCA),4 we exploit packet diversity incriticality to design higher performance and

    Chita R. Das

    Pennsylvania State

    University

    Onur Mutlu

    Carnegie Mellon

    University

    Thomas Moscibroda

    Microsoft Research

    Reetuparna DasPennsylvania State

    University

    0272-1732/11/$26.00c 2011 IEEE Published by the IEEE Computer Society

    .........................................................

    29

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    2/13

    more application-fair NoCs. We do so bydifferentiating packets according to theirslack, a measure that captures the packetsimportance to the applications performance.In particular, we define a packets slack as the

    number of cycles the packet can be delayedin the network without affecting the applica-tions execution time. So, a packet is rela-tively noncritical until its time in thenetwork exceeds the packets available slack.In comparison, increasing the latency ofpackets with no available slack (by depriori-tizing them during arbitration in NoCrouters) will stall the application.

    Building off of this concept, we developAergia, an NoC architecture that containsnew router prioritization mechanisms to ac-

    celerate the critical packets with low slackvalues by prioritizing them over packetswith larger slack values. (We name our archi-tecture after Aergia, the female spirit of lazi-ness in Greek mythology, after observingthat some packets in NoC can afford toslack off.) We devise techniques to efficientlyestimate a packets slack dynamically. Beforea packet enters the network, Aergia tags thepacket according to its estimated slack.

    We propose extensions to routers to priori-tize lower-slack packets at the contention

    points within the router (that is, buffer allo-cation and switch arbitration). Finally, to en-sure forward progress and starvation freedomfrom prioritization, Aergia groups packetsinto batches and ensures that all earlier-batch packets are serviced before later-batchpackets. Experimental evaluations on acycle-accurate simulator show that our pro-posal effectively increases overall systemthroughput and application-level fairness inthe NoC.

    MotivationThe premise of our research is the exis-

    tence of packet latency slack. We thus moti-vate Aergia by explaining the concept ofslack, characterizing diversity of slack in appli-cations, and illustrating the advantages ofexploiting slack with a conceptual example.

    The concept of slackModern microprocessors employ several

    memory latency tolerance techniques (suchas out-of-order execution5 and runahead

    execution2,6) to hide the penalty of loadmisses. These techniques exploit MLP1 byissuing several memory requests in parallelwith the hope of overlapping future loadmisses with current load misses. In the pres-

    ence of MLP in an on-chip network, the ex-istence of multiple outstanding packets leadsto packet latency overlap, which introducesslack cycles (the number of cycles a packetcan be delayed without significantly affectingperformance). We illustrate the concept ofslack cycles with an example. Consider theprocessor execution timeline in Figure 1a.In the instruction window, the first loadmiss causes a packet (Packet0) to be sentinto the network to service the load miss,and the second load miss generates the next

    packet (Packet1). In Figure 1as executiontimeline, Packet1 has lower network latencythan Packet0 and returns to the source ear-lier. Nonetheless, the processor cant commitLoad0 and stalls until Packet0 returns. Thus,Packet0 is the bottleneck packet, whichallows the earlier-returning Packet1 someslack cycles. The system could delay Packet1for the slack cycles without causing signifi-cant application-level performance loss.

    Diversity in slack and its analysis

    Sufficient diversity in interfering packetsslack cycles is necessary to exploit slack. Inother words, its only possible to benefitfrom prioritizing low-slack packets if the net-work has a good mix of both high- and low-slack packets at any time. Fortunately, wefound that realistic systems and applicationshave sufficient slack diversity.

    Figure 1b shows a cumulative distributionfunction of the diversity in slack cycles for16 applications. The x-axis shows the num-ber of slack cycles per packet. The y-axis

    shows the fraction of total packets thathave at least as many slack cycles per packetas indicated by thex-axis. Two trends are vis-ible. First, most applications have a goodspread of packets with respect to the numberof slack cycles, indicating sufficient slack di-versity within an application. For example,17.6 percent of Gemss packets have, atmost, 100 slack cycles, but 50.4 percent ofthem have more than 350 slack cycles. Sec-ond, applications have different slack charac-teristics. For example, the packets ofart

    ....................................................

    30 IEEE MICRO

    ...............................................................................................................................................................................................

    TOP PICKS

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    3/13

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 50 100 150 200 250 300 350 400 450 500

    Allpac

    ke

    ts(%)

    Slack in cycles

    Gems

    omnet

    tpcw

    mcf

    bzip2

    sjbb

    sap

    sphinx

    deal

    barnes

    astar

    calculix

    art

    libquantum

    sjeng

    h264ref

    Instruction window

    causes Packet0

    causes

    (a)

    (b)

    Packet1Load miss 1

    Load miss 0

    Latency of Packet0

    Latency of Packet1

    Packet1 (sent)

    Execu

    tion

    time

    line

    Packet0 (sent)

    Compute

    Packet1(returns)

    Slack

    Slack of Packet1

    Stall of Packet0

    Packet0(returns)

    Figure 1. Processor execution timeline demonstrating the concept of slack (a); the resulting

    packet distribution of 16 applications based on slack cycles (b).

    ..........................................................

    JANUARY/FEBRUARY 2011 31

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    4/13

    and libquantum have much lower slackin general than tpcw and omnetpp. Weconclude that each packets number ofslack cycles varies within application phasesas well as between applications.

    Advantages of exploiting slackWe show that a NoC architecture thats

    aware of slack can lead to better systemperformance and fairness. As we mentionedearlier, a packets number of slack cycles indi-cates that packets importance or criticality tothe processor core. If the NoC knew about apackets remaining slack, it could make arbi-tration decisions that would accelerate pack-ets with a small slack, which would improvethe systems overall performance.

    Figure 2 shows a motivating example.Figure 2a depicts the instruction window oftwo processor coresCore A at node (1, 2)and Core B at node (3, 7). The network con-sists of an 8 8 mesh (see Figure 2b). Core

    A generates two packets (A-0 and A-1). Thefirst packet (A-0) isnt preceded by any otherpacket and is sent to node (8, 8)therefore,its latency is 13 hops and its slack is 0 hops.In the next cycle, the second packet (A-1) isinjected toward node (3, 1). This packet hasa latency of 3 hops, and because its preceded

    (and thus overlapped) by the 13-hop packet(A-0), it has a slack of 10 hops (13 hopsminus 3 hops). Core B also generates twopackets (B-0 and B-1). We can calculateCore Bs packets latency and slack similarly.B-0 has a latency of 10 hops and slack of0 hops, while B-1 has a latency of 4 hopsand slack of 6 hops.

    We can exploit slack when interferenceexists between packets with different slackvalues. In Figure 2b, for example, packets

    A-0 and B-1 interfere. The critical question

    is which packet the router should prioritizeand which it should queue (delay).

    Figure 2c shows the execution timeline ofthe cores with the baseline slack-unaware pri-oritization policy that possibly prioritizespacket B-1, thereby delaying packet A-0. Incontrast, if the routers knew the packetsslack, they would prioritize packet A-0 overB-1, because the former has a smaller slack(see Figure 2d). Doing so would reduceCore As stall time without significantlyincreasing Core Bs stall time, thereby

    improving overall system throughput (seeFigure 2). The critical observation is thatdelaying a higher-slack packet can improveoverall system performance by allowing alower-slack packet to stall its core less.

    Online estimation of slackOur goal is to identify critical packets

    with low slack and accelerate them bydeprioritizing packets with high-slack cycles.

    Defining slackWe can define slack locally or globally.7

    Local slackis the number of cycles a packetcan be delayed without delaying any subse-quent instruction. Global slackis the num-ber of cycles a packet can be delayed

    without delaying the last instruction inthe program. Computing a packets globalslack requires critical path analysis of theentire program, rendering it impractical inour context. So, we focus on local slack.

    A packets ideal local slack could depend oninstruction-level dependencies, which arehard to capture in the NoC. So, to keepthe implementation in the NoC simple,we conservatively consider slack only withrespect to outstanding network transac-tions. We define the term predecessor pack-ets, or simply predecessors for a givenpacket P, as all packets that are still out-standing and that have been injected bythe same core into the network earlierthanP. We formally define a packets avail-able local network slack (in this article,simply called slack) as the differencebetween the maximum latency of its prede-cessor packets (that is, any outstandingpacket that was injected into the networkearlier than this packet) and its ownlatency:

    Slack(Packeti) maxkLatency(Packetk8k0 toNumberofPredecessors)Latency(Packeti) (1)

    The key challenge in estimating slack is toaccurately predict latencies of predecessorpackets and the current packet beinginjected. Unfortunately, its difficult to ex-actly predict network latency in a realisticsystem. So, instead of predicting the exactslack in terms of cycles, we aim to categorize

    ....................................................

    32 IEEE MICRO

    ...............................................................................................................................................................................................

    TOP PICKS

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    5/13

    Core A

    Core B

    B-1(Latency = 4 hops, Slack = 6 hops)

    B-0(Latency = 10 hops, Slack = 0 hops)

    Interference

    (3hops)

    A-0

    (Latency = 13 hops, Slack = 0 hops)

    A-1

    (Latency = 3 hops, Slack = 10 hops)

    Instruction window(Core A)

    causesPacket A-0causesPacket A-1

    Instruction window(Core B)

    causesPacket B-0

    causes

    Packet B-1

    Load miss 0

    Load miss 1

    Load miss 1

    Load miss 0

    (a) (b)

    Latency of B-1 Latency of B-1

    Slack

    Latency of A-0

    Latency of A-1

    Core A

    Core B

    Slack aware (Argia)

    Slack

    Interference

    Latency of A-0

    Latency of A-1

    Core A

    Core B

    Slack unaware

    Pac

    ke

    tinjec

    tion

    /ejec

    tion

    Pac

    ke

    tinjec

    tion

    /eje

    ction

    Saved cycles

    Latency of B-0

    StallCompute

    ComputeStall

    Core A

    Core B

    Saved cycles

    Latency of B-0

    StallCompute

    Stall Compute

    Core A

    Core B

    (c) (d)

    Figure 2. Conceptual example showing the advantage of incorporating slack into

    prioritization decisions in the Network-on-Chip (NoC). The instruction window for two

    processor cores (a); interference between two packets with different slack values (b);

    the processor cores slack-unaware execution timeline (c); prioritization of the packets

    according to slack (d).

    ..........................................................

    JANUARY/FEBRUARY 2011 33

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    6/13

    and quantize slack into different priority lev-els according to indirect metrics that corre-late with the latency of predecessors and

    the packet being injected. To accomplishthis, we characterize slack as it relates to var-ious indirect metrics. We found that themost important factors impacting packetcriticality (hence, slack) are the number ofpredecessors that are level-two (L2) cachemisses, whether the injected packet is an L2cache hit or miss, and the number of a pack-ets extraneous hops in the network (com-pared to its predecessors).

    Number of miss predecessors. We define a

    packets miss predecessors as its predecessorsthat are L2 cache misses. We found that apacket is likely to have high slack if it hashigh-latency predecessors or a large numberof predecessors; in both cases, theres a highlikelihood that its latency is overlapped (thatis, that its slack is high). This is intuitive be-cause the number of miss predecessors triesto capture the first term of the slack equa-tion (that is, Equation 1).

    L2 cache hit/miss status. We observe that

    whether a packet is an L2 cache hit ormiss correlates with the packets criticalityand, thus, the packets slack. If the packetis an L2 miss, it likely has high latencyowing to dynamic RAM (DRAM) accessand extra NoC transactions to and frommemory controllers. Therefore, the packetlikely has a smaller slack because other pack-ets are less likely to overlap its long latency.

    Slack in terms of number of hops. Our thirdmetric captures both terms of the slack

    equation, and is based on the distance trav-ersed in the network. Specifically:

    Slackhops(Packeti) maxkHops(Packetk8k0 to NumberofPredecessors)

    Hops(Packeti) (2)

    Slack priority levelsWe combine these three metrics to form

    the slack priority level of a packet in Aergia.When a packet is injected, Aergia computesthese three metrics and quantizes them toform a three-tier priority (see Figure 3).

    Aergia tags the packets head flit withthese priority bits. We use 2 bits for thefirst tier, which is assigned according to

    the number of miss predecessors. We use1 bit for the second tier to indicate if thepacket being injected is (predicted to be) anL2 hit or miss. And we use 2 bits for thethird tier to indicate a packets hop-basedslack. We then use the combined slack-based priority level to prioritize betweenpackets in routers.

    Slack estimationIn Aergia, assigning a packets slack prior-

    ity level requires estimation of the three

    metrics when the packet is injected.

    Estimating the number of miss predecessors.Every core maintains a list of the outstand-ing L1 load misses (predecessor list). Themiss status handling registers (MSHRs)limit the predecessor lists size.8 Each L1miss is associated with a corresponding L2miss status bit. At the time a packet isinjected into the NoC, its actual L2 hit/missstatus is unknown because the shared L2cache is distributed across the nodes. There-

    fore, an L2 hit/miss predictor is consultedand sets the L2 miss status bit accordingly.

    We compute the miss predecessor slack pri-ority level as the number of outstanding L1misses in the predecessor list whose L2 missstatus bits are set (to indicate a predicted oractual L2 miss). Our implementationrecords L1 misses issued in the last 32 cyclesand sets the maximum number of misspredecessors to 8. If a prediction erroroccurs, the L2 miss status bit is updatedto the correct value when the data response

    No. of miss predecessors (2 bits)

    No. of slack in hops (2 bits)

    Is L2 cache hit or miss? (1 bit)

    Figure 3. Priority structure for estimating

    a packets slack. A lower priority level

    corresponds to a lower slack value.

    ....................................................

    34 IEEE MICRO

    ...............................................................................................................................................................................................

    TOP PICKS

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    7/13

    packet returns to the core. In addition, if apacket that is predicted to be an L2 hit actu-ally results in an L2 miss, the correspondingL2 cache bank notifies the requesting corethat the packet actually resulted in an L2

    miss so that the requesting core updatesthe corresponding L2-miss-status bit ac-cordingly. To reduce control packet over-head, the L2 bank piggy-backs thisinformation to another packet travelingthrough the requesting core as an intermedi-ate node.

    Estimating whether a packet will be an L2cache hit or miss. At the time a packet isinjected, its unknown whether it will hitin the remote L2 cache bank it accesses.

    We use an L2 hit/miss predictor in eachcore to guess each injected packets cachehit/miss status, and we set the second-tierpriority bit accordingly in the packetheader. If the packet is predicted to be anL2 miss, its L2 miss priority bit is set to0; otherwise, it is set to 1. This prioritybit within the packet header is correctedwhen the actual L2 miss status is knownafter the packet accesses the L2.

    We developed two types of L2 miss pre-dictors. The first is based on the global

    branch predictor.9

    A shift register recordsthe hit/miss values for the last ML1 loadmisses. We then use this register to index apattern history table (PHT) containing2-bit saturating counters. The accessedcounter indicates whether the prediction forthat particular pattern history was a hit or amiss. The counter is updated when a packetshit/miss status is known. The second predic-tor, called thethreshold predictor, uses the in-sight that misses occur in bursts. Thispredictor updates a counter if the access is

    a known L2 miss. The counter resets afterevery M L1 misses. If the number ofknown L2 misses in the last ML1 missesexceeds a threshold T, the next L1 miss ispredicted to be an L2 miss.

    The global predictor requires more stor-age bits (for the PHT) and has marginallyhigher design complexity than the thresholdpredictor. For correct updates, both predic-tors require an extra bit in the responsepacket indicating whether the transactionwas an L2 miss.

    Estimating hop-based slack. We calculate anypackets hops per packet by adding theXand

    Ydistance between source and destination.We then calculate a packets hop-basedslack using Equation 2.

    Aergia NoC architectureOur NoC architecture, Aergia, uses the

    slack priority levels that we described forarbitration. We now describe the architec-ture. Some design challenges accompanythe prioritization mechanisms, includingstarvation avoidance (using batching) andmitigating priority inversion (using multiple

    network interface queues).

    BaselineFigure 4 shows a generic NoC router

    architecture. The router hasPinput channelsconnected toPports and Poutput channelsconnected to the crossbar; typicallyP 5 fora 2D mesh(one from each direction and onefrom the network interface). The routingcomputation unit determines the next routerand the virtual channel (V) within the nextrouter for each packet. Dimension-ordered

    Switch arbiter

    VC arbiter

    Routingcomputation

    Outputchannel

    1

    Outputchannel

    P

    Crossbar(P x P)

    Scheduler

    Input port P

    Input port 1

    VC v

    VC 1VC identifier

    Inpu

    t

    channe

    l

    1

    Inpu

    t

    channe

    l

    P

    Cre

    dit

    ou

    t

    Figure 4. Baseline router microarchitecture. The figure shows the different

    parts of the router: input channels connected to the router ports; buffers

    within ports organized as virtual channels; the control logic, which performs

    routing and arbitration; and the crossbar connected to the output channels.

    ..........................................................

    JANUARY/FEBRUARY 2011 35

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    8/13

    routing (DOR) is the most common routingpolicy because of its low complexity anddeadlock freedom. Our baseline assumes

    XYDOR, which first routes packets in Xdi-rection, followed byYdirection. The virtual

    channel arbitration unit arbitrates among allpackets requesting access to the same outputvirtual channel in the downstream router anddecides on winners. The switch arbitrationunit arbitrates among all input virtual chan-nels requesting access to the crossbar andgrants permission to the winning packetsand flits. The winners can then traverse thecrossbar and be placed on the output links.Current routers use simple, local arbitrationpolicies (such as round-robin or oldest-first arbitration) to decide which packet to

    schedule next (that is, which packet winsarbitration). Our baseline NoC uses theround-robin policy.

    ArbitrationIn Aergia, the virtual channel arbitration

    and switch arbitration units prioritize pack-ets with lower slack priority levels. Thus,low-slack packets get first preference for buf-fers as well as the crossbar and, thus, areaccelerated in the network. In wormholeswitching, only the head flit arbitrates for

    and reserves the virtual channel; thus, onlythe head flit carries the slack priority value.

    Aergia uses this header information duringvirtual channel arbitration stage for priorityarbitration. In addition to the state thatthe baseline architecture maintains, each vir-tual channel has an additional priority field,which is updated with the head flits slackpriority level when the head flit reservesthe virtual channel. The body flits use thisfield during switch arbitration stage for pri-ority arbitration.

    Without adequate countermeasures, pri-oritization mechanisms can easily lead tostarvation in the network. To prevent starva-tion, we combine our slack-based prioritiza-tion with a batching mechanism similar tothe one from our previous work.10We dividetime into intervals ofTcycles, which we callbatching intervals. Packets inserted into thenetwork during the same interval belong tothe same batch (that is, they have the samebatch priority value). We prioritize packetsbelonging to older batches over those from

    younger batches. Only when two packets be-long to the same batch do we prioritize themaccording to their available slack (that is, theslack priority levels). Each head flit carries a5-bit slack priority value and a 3-bit batch

    number. We use adder delays to estimatethe delay of an 8-bit priority arbiter (P 5,V 6) to be 138.9 picoseconds and a16-bit priority arbiter to be 186.9 picosec-onds at 45- nanometer technology.

    Network interfaceFor effectiveness, prioritization is neces-

    sary not only within the routers but also atthe network interfaces. We split the mono-lithic injection queue at a network interfaceinto a small number of equal-length queues.

    Each queue buffers packets with slack prior-ity levels within a different programmablerange. Packets are guided into a particularqueue on the basis of their slack priority lev-els. Queues belonging to higher-prioritypackets take precedence over those belong-ing to lower-priority packets. Such prioriti-zation reduces priority inversion and isespecially needed at memory controllers,where packets from all applications are buf-fered together at the network interface andwhere monolithic queues can cause severe

    priority inversion.

    Comparison to application-awarescheduling and fairness policies

    In our previous work,10 we proposed StallTime Criticality (STC), an application-awarecoordinated arbitration policy to acceleratenetwork-sensitive applications. The idea isto rank applications at regular intervalsbased on their network intensity (L1-MPI,or L1 miss rate per instruction), and to

    prioritize all packets of a nonintensive appli-cation over all packets of an intensiveapplication. Within applications belongingto the same rank, STC prioritizes packetsusing the baseline round-robin order in aslack-unaware manner. Packet batching isused for starvation avoidance. STC is acoarse-grained approach that identifies criti-cal applications (from the NoC perspective)and prioritizes them over noncritical ones.By focusing on application-level intensity,STC cant capture fine-grained packet-level

    ....................................................

    36 IEEE MICRO

    ...............................................................................................................................................................................................

    TOP PICKS

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    9/13

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    10/13

    system performance and network fairnessover various workloads, whether usedindependently or with application-awareprioritization techniques.

    T o our knowledge, our ISCA 2010paper4 is the first work to make use ofthe criticality of memory accesses for packetscheduling within NoCs. We believe ourproposed slack-aware NoC design caninfluence many-core processor implementa-

    tions and future research in several ways.Other work has proposed and explored

    the instruction slack concept7 (for moreinformation, see the Research related to

    Aergia sidebar). Although prior researchhas exploited instruction slack to improveperformance-power trade-offs for single-threaded uniprocessors, instruction slacksimpact on shared resources in many-core processors provides new research oppor-tunities. To our knowledge, our ISCA 2010paper4 is the first work that attempts to

    exploit and understand how instruction-level slack impacts the interference of appli-cations within a many-core processorsshared resources, such as caches, memorycontrollers, and on-chip networks. We de-fine, measure, and devise easy-to-implementschemes to estimate slack of memory accessesin the context of many-core processors.Researchers can use our slack metric for bet-ter management of other shared resources,such as request scheduling at memory con-

    trollers, managing shared caches, saving en-ergy in the shared memory hierarchy, andproviding QoS. Shared resource manage-ment will become even more important infuture systems (including data centers,cloud computing, and mobile many-core sys-tems) as they employ a larger number of in-creasingly diverse applications runningtogether on a many-core processor. Thus,using slack for better resource managementcould become more beneficial as the pro-cessor industry evolves toward many-core

    ...............................................................................................................................................................................................

    Research related to Aergia

    To our knowledge, no previous work has characterized slack in packet

    latency and proposed mechanisms to exploit it for on-chip network

    arbitration. Here, we discuss the most closely related previous work.

    Instruction criticality and memory-level parallelism (MLP)Much research has been done to predict instruction criticality or

    slack1-3 and to prioritize critical instructions in the processor core and

    caches. MLP-aware cache replacement exploits cache-miss criticality dif-

    ferences caused by differences in MLP.4 Parallelism-aware memory

    scheduling5 and other rank-based memory scheduling algorithms6 reduce

    application stall time by improving each threads bank-level parallelism.

    Such work relates to ours only in the sense that we also exploit critical-

    ity to improve system performance. However, our methods (for comput-

    ing slack) and mechanisms (for exploiting it) are very different owing to

    on-chip networks distributed nature.

    Prioritization in on-chip and multiprocessor networksWeve extensively compared our approach to state-of-the-art local ar-

    bitration, quality-of-service-oriented prioritization (such as Globally-

    Synchronized Frames, or GSF7), and application-aware prioritization

    (such as Stall-Time Criticality, or STC8) policies in Networks-on-Chip

    (NoCs). Bolotin et al. prioritize control packets over data packets in

    the NoC but dont distinguish packets on the basis of available slack. 9

    Other researchers have also proposed frameworks for quality of service

    (QoS) in on-chip networks,10-12 which can be combined with our

    approach. Some other work proposed arbitration policies for multichip

    multiprocessor networks and long-haul networks.13-16 They aim to pro-

    vide guaranteed service or fairness, while we aim to exploit slack in

    packet latency to improve system performance. In addition, most ofthe previous mechanisms statically assign priorities and bandwidth to

    different flows in off-chip networks to satisfy real-time performance

    and QoS guarantees. In contrast, our proposal dynamically computes

    and assigns priorities to packets based on slack.

    BatchingWe use packet batching in NoC for starvation avoidance, similar to

    our previous work.8 Other researchers have explored using batching

    and frames in disk scheduling,17 memory scheduling,5 and QoS-aware

    packet scheduling7,12 to prevent starvation and provide fairness.

    References

    1. S.T. Srinivasan and A.R. Lebeck, Load Latency Tolerance in Dy-

    namically Scheduled Processors, Proc. 31st Ann. ACM/IEEE

    Intl Symp. Microarchitecture, IEEE CS Press, 1998, pp.148-159.

    2. B. Fields, S. Rubin, and R. Bodk, Focusing Processor Policies

    via Critical-Path Prediction,Proc. 28th Ann. Intl Symp. Com-

    puter Architecture(ISCA 01), ACM Press, 2001, pp. 74-85.

    3. B. Fields, R. Bodk, and M. Hill, Slack: Maximizing Performance

    under Technological Constraints,Proc. 29th Ann. Intl Symp.

    ....................................................

    38 IEEE MICRO

    ...............................................................................................................................................................................................

    TOP PICKS

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    11/13

    processors. We hope that our work inspiressuch research and development.

    Managing shared resources in a highlyparallel system is a fundamental challenge.

    Although application interference is relativelywell understood for many important sharedresources (such as shared last-level caches12

    or memory bandwidth13-16), less is knownabout application interference behavior inNoCs or how this interference impacts appli-cations execution times. One reason why

    analyzing and managing multiple applica-tions in a shared NoC is challenging is thatapplication interactions in a distributed sys-tem can be complex and chaotic, with nu-merous first- and second-order effects (suchas queueing delays, different MLP, bursti-ness, and the effect of spatial location ofcache banks) and hard-to-predict interfer-ence patterns that can significantly impactapplication-level performance.

    We attempt to understand applicationscomplex interactions in NoCs using

    application-level metrics and systematicexperimental evaluations. Building on thisunderstanding, weve devised low-cost,simple-to-implement, and scalable policiesto control and reduce interference in NoCs.

    Weve shown that employing coarse-grainedand fine-grained application-aware prioritiza-tion can significantly improve performance.

    We hope our techniques inspire othernovel approaches to reduce application-levelinterference in NoCs.

    Finally, our work exposes the instructionbehavior inside a processor core to theNoC. We hope this encourages future re-search on the integrated design of cores andNoCs (as well as other shared resources).This integrated design approach will likelyfoster future research on overall system per-formance instead of individual componentsperformance: by codesigning NoC andthe core to be aware of each other, wecan achieve significantly better performanceand fairness than designing each alone.

    Computer Architecture, IEEE Press, 2002, pp. 47-58,

    doi:10.1109/ISCA.2002.1003561.

    4. M. Qureshi et al., A Case for MLP-Aware Cache Replace-

    ment, Proc. 33rd Ann. Intl Symp. Computer Architecture

    (ISCA 06), IEEE CS Press, 2006, pp. 167-178.

    5. O. Mutlu and T. Moscibroda, Parallelism-Aware BatchScheduling: Enhancing Both Performance and Fairness of

    Shared DRAM Systems, Proc. 35th Ann. Intl Symp. Com-

    puter Architecture(ISCA 08), IEEE CS Press, 2008, pp. 63-74.

    6. Y. Kim et al., ATLAS: A Scalable and High-Performance Sched-

    uling Algorithm for Multiple Memory Controllers,Proc. IEEE

    16th Intl Symp. High Performance Computer Architecture

    (HPCA 10), IEEE Press, 2010, doi:10.1109/HPCA.2010.5416658.

    7. J.W. Lee, M.C. Ng, and K. Asanovic, Globally-Synchronized

    Frames for Guaranteed Quality-of-Service in On-Chip Net-

    works, Proc. 35th Ann. Intl Symp. Computer Architecture

    (ISCA 08), IEEE CS Press, 2008, pp. 89-100.

    8. R. Das et al., Application-Aware Prioritization Mechanismsfor On-Chip Networks, Proc. 42nd Ann. IEEE/ACM Intl

    Symp. Microarchitecture, ACM Press, 2009, pp. 280-291.

    9. E. Bolotin et al., The Power of Priority: NoC Based Distrib-

    uted Cache Coherency, Proc. 1st Intl Symp. Networks-

    on-Chip (NOCS 07), IEEE Press, 2007, pp. 117-126,

    doi:10.1109/NOCS.2007.42.

    10. E. Bolotin et al., QNoC: QoS Architecture and Design Pro-

    cess for Network on Chip, J. Systems Architecture,

    vol. 50, nos. 2-3, 2004, pp. 105-128.

    11. E. Rijpkema et al., Trade-offs in the Design of a Router with

    Both Guaranteed and Best-Effort Services for Networks on

    Chip, Proc. Conf. Design, Automation and Test in Europe

    (DATE 03), vol. 1, IEEE CS Press, 2003, pp. 10350-10355.

    12. B. Grot, S.W. Keckler, and O. Mutlu, Preemptive Virtual

    Clock: A Flexible, Efficient, and Cost-effective QOS Schemefor Networks-on-Chip, Proc. 42nd Ann. IEEE/ACM Intl

    Symp. Microarchitecture, ACM Press, 2009, pp. 268-279.

    13. K.H. Yum, E.J. Kim, and C. Das, QoS Provisioning in Clus-

    ters: An Investigation of Router and NIC Design, Proc.

    28th Ann. Intl Symp. Computer Architecture (ISCA 01),

    ACM Press, 2001, pp. 120-129.

    14. A.A. Chienand J.H. Kim, Rotating Combined Queueing(RCQ):

    Bandwidth and Latency Guarantees in Low-Cost, High-

    Performance Networks, Proc. 23rd Ann. Intl Symp. Computer

    Architecture(ISCA 96), ACM Press, 1996, pp. 226-236.

    15. A. Demers, S. Keshav, and S. Shenker, Analysis and Simu-

    lation of a Fair Queueing Algorithm, Symp. Proc. Comm.Architectures & Protocols(Sigcomm 89), ACM Press, 1989,

    pp. 1-12, doi:10.1145/75246.75248.

    16. L. Zhang, Virtual Clock: A New Traffic Control Algorithm for

    Packet Switching Networks, Proc. ACM Symp. Comm.

    Architectures & Protocols(Sigcomm 90), ACM Press, 1990,

    pp. 19-29, doi:10.1145/99508.99525.

    17. H. Frank, Analysis and Optimization of Disk Storage Devices

    for Time-Sharing Systems, J. ACM, vol. 16, no. 4, 1969,

    pp. 602-620.

    ..........................................................

    JANUARY/FEBRUARY 2011 39

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    12/13

    We therefore hope Aergia opens newresearch opportunities beyond packet-scheduling in on-chip networks. M ICRO

    Acknowledgments

    This research was partially supported byNational Science Foundation (NSF) grantCCF-0702519, NSF Career Award CCF-0953246, the Carnegie Mellon UniversityCyLab, and GSRC.

    ....................................................................References

    1. A. Glew, MLP Yes! ILP No! Memory Level

    Parallelism, or Why I No Longer Care about

    Instruction Level Parallelism, ASPLOS

    Wild and Crazy Ideas Session, 1998; http://

    www.cs.berkeley.edu/~kubitron/asplos98/abstracts/andrew_glew.pdf.

    2. O. Mutlu, H. Kim, and Y.N. Patt, Efficient

    Runahead Execution: Power-Efficient Mem-

    ory Latency Tolerance,IEEE Micro,vol. 26,

    no. 1, 2006, pp. 10-20.

    3. B. Fields, S. Rubin, and R. Bodk, Focusing

    Processor Policies Via Critical-Path Predic-

    tion,Proc. 28th Ann. Intl Symp. Computer

    Architecture (ISCA 01), ACM Press, 2001,

    pp. 74-85.

    4. R. Das et al., Aergia: Exploiting Packet

    Latency Slack in On-Chip Networks,Proc. 37th Ann. Intl Symp. Computer Archi-

    tecture (ISCA 10), ACM Press, 2010,

    pp. 106-116.

    5. R.M. Tomasulo, An Efficient Algorithm for

    Exploiting Multiple Arithmetic Units,I BM

    J. Research and Development, vol. 11,

    no. 1, 1967, pp. 25-33.

    6. O. Mutlu et al., Runahead Execution: An

    Alternative to Very Large Instruction Win-

    dows for Out-of-Order Processors, Proc.

    9th Intl Symp. High-Performance Computer

    Architecture(HPCA 03), IEEE Press, 2003,

    pp. 129-140.

    7. B. Fields, R. Bodk, and M. Hill, Slack:

    Maximizing Performance under Technologi-

    cal Constraints, Proc. 29th Ann. Intl

    Symp. Computer Architecture, IEEE Press,

    2002, pp. 47-58, doi:10.1109/ISCA.2002.

    1003561.

    8. D. Kroft, Lockup-Free Instruction Fetch/

    Prefetch Cache Organization, Proc. 8th

    Ann. Symp. Computer Architecture (ISCA

    81), IEEE CS Press, 1981, pp. 81-87.

    9. T.Y. Yeh and Y.N. Patt, Two-Level Adap-

    tive Training Branch Prediction, Proc. 24th

    Ann. Intl Symp. Microarchitecture, ACM

    Press, 1991, pp. 51-61.

    10. R. Das et al., Application-Aware Prioritiza-

    tion Mechanisms for On-Chip Networks,Proc. 42nd Ann. IEEE/ACM Intl Symp.

    Microarchitecture, ACM Press, 2009,

    pp. 280-291.

    11. J.W. Lee, M.C. Ng, and K. Asanovic, Glob-

    ally-Synchronized Frames for Guaranteed

    Quality-of-Service in On-Chip Networks,

    Proc. 35th Ann. Intl Symp. Computer Archi-

    tecture (ISCA 08), IEEE CS Press, 2008,

    pp. 89-100.

    12. L.R. Hsu et al., Communist, Utilitarian, and

    Capitalist Cache Policies on CMPS: Caches

    as a Shared Resource, Proc. 15th IntlConf. Parallel Architectures and Compilation

    Techniques (PACT 06), ACM Press, 2006,

    pp. 13-22.

    13. O. Mutlu and T. Moscibroda, Stall-Time

    Fair Memory Access Scheduling for Chip

    Multiprocessors, Proc. 40th Ann. IEEE/

    ACM Intl Symp. Microarchitecture, IEEE

    CS Press, 2007, pp. 146-160.

    14. O. Mutlu and T. Moscibroda, Parallelism-

    Aware Batch Scheduling: Enhancing Both

    Performance and Fairness of Shared

    DRAM Systems, Proc. 35th Ann. IntlSymp. Computer Architecture (ISCA 08),

    IEEE CS Press, 2008, pp. 63-74.

    15. Y. Kim et al., ATLAS: A Scalable and High-

    Performance Scheduling Algorithm for Mul-

    tiple Memory Controllers, Proc. IEEE 16th

    Intl Symp. High Performance Computer

    Architecture (HPCA 10), IEEE Press, 2010,

    doi:10.1109/HPCA.2010.5416658.

    16. Y. Kim et al., Thread Cluster Memory

    Scheduling: Exploiting Differences in Mem-

    ory Access Behavior, Proc. 43rd Ann.

    IEEE/ACM Intl Symp. Microarchitecture,

    ACM Press, 2010.

    Reetuparna Das is a research scientist atIntel Labs. Her research interests includecomputer architecture, especially intercon-nection networks. She has a PhD incomputer science and engineering fromPennsylvania State University.

    Onur Mutlu is an assistant professor inthe Electrical and Computer Engineering

    ....................................................

    40 IEEE MICRO

    ...............................................................................................................................................................................................

    TOP PICKS

  • 8/11/2019 Aergia Ieee Micro Top Picks11

    13/13

    Department at Carnegie Mellon University.His research interests include computerarchitecture and systems. He has a PhD inelectrical and computer engineering from theUniversity of Texas at Austin. Hes a

    member of IEEE and the ACM.

    Thomas Moscibroda is a researcher in thedistributed systems research groupat MicrosoftResearch. His research interests include dis-tributed computing, networking, and algorith-mic aspects of computer architecture. He has aPhD in computer science from ETH Zurich.Hes a member of IEEE and the ACM.

    Chita R. Das is a distinguished professor inthe Department of Computer Science and

    Engineering at Pennsylvania State Univer-sity. His research interests include paralleland distributed computing, performanceevaluation, and fault-tolerant computing.He has a PhD in computer science from

    the University of Louisiana, Lafayette. Hes afellow of IEEE.

    Direct questions and comments to Reetu-parna Das, Intel Labs, Intel Corp., SC-12,3600 Juliette Ln., Santa Clara, CA; [email protected].

    ..........................................................

    JANUARY/FEBRUARY 2011 41


Recommended