+ All Categories
Home > Documents > Coherence Domain Restriction on Large Scale Systems

Coherence Domain Restriction on Large Scale Systems

Date post: 25-Mar-2022
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
13
Coherence Domain Restriction on Large Scale Systems Yaosheng Fu Princeton University Princeton, NJ [email protected] Tri M. Nguyen Princeton University Princeton, NJ [email protected] David Wentzlaff Princeton University Princeton, NJ [email protected] ABSTRACT Designing massive scale cache coherence systems has been an elusive goal. Whether it be on large-scale GPUs, future thousand-core chips, or across million- core warehouse scale computers, having shared mem- ory, even to a limited extent, improves programmability. This work sidesteps the traditional challenges of cre- ating massively scalable cache coherence by restricting coherence to flexible subsets (domains) of a system’s total cores and home nodes. This paper proposes Co- herence Domain Restriction (CDR), a novel coherence framework that enables the creation of thousand to million core systems that use shared memory while maintaining low storage and energy overhead. Inspired by the observation that the majority of cache lines are only shared by a subset of cores either due to limited application parallelism or limited page sharing, CDR restricts the coherence domain from global cache coher- ence to VM-level, application-level, or page-level. We explore two types of restriction, one which limits the total number of sharers that can access a coherence do- main and one which limits the number and location of home nodes that partake in a coherence domain. Each independent coherence domain only tracks the cores in its domain instead of the whole system, thereby re- moving the need for a coherence scheme built on top of CDR to scale. Sharer Restriction achieves constant storage overhead as core count increases while Home Restriction provides localized communication enabling higher performance. Unlike previous systems, CDR is flexible and does not restrict the location of the home nodes or sharers within a domain. We evaluate CDR in the context of a 1024-core chip and in the novel appli- cation of shared memory to a 1,000,000-core warehouse scale computer. Sharer Restriction results in significant area savings, while Home Restriction in the 1024-core chip and 1,000,000-core system increases performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MICRO-48, December 05 - 09, 2015, Waikiki, HI, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-4034-2/15/12 ...$15.00 DOI: http://dx.doi.org/10.1145/2830772.2830832 by 29% and 23.04x respectively when comparing with global home placement. We implemented the entire CDR framework in a 25-core processor taped out in IBM’s 32nm SOI process and present a detailed area characterization. Categories and Subject Descriptors B.3.2 [Memory Structures]: Shared Memory Keywords Shared Memory, Cache Coherence, Home Placement 1. INTRODUCTION The number of cores integrated within a single many- core chip, GPU, and even data center continues to in- crease. Programming systems with shared memory has been popular due to the ease of programming, similar- ity to single-threaded programming, easy mapping of certain classes of algorithm, and the widespread avail- ability of applications based on POSIX Threads [1] and OpenMP [2]. Unfortunately, creating massively scalable cache coherence protocols has proven problematic. In this work, we address the challenge of massively (Thou- sands to Millions of caches) scalable cache coherence not by creating a new cache coherence scheme, but rather by creating a hardware framework that can be used by other coherence schemes to sidestep the challenges of scaling directory storage and communication cost. We study this framework in two primary use cases: future thousand-core chips and current day warehouse scale computers with shared memory. In addition, we see our solution being applicable to future GPU systems which are adopting shared memory. As we advance to the future manycore era with hun- dreds or thousands of cores on a single chip [3, 4, 5, 6, 7, 8], the directory which tracks sharers in directory-based cache coherence systems presents a storage scalability challenge. The quadratic overhead of full-map directo- ries [9] makes it unsuitable for manycore designs. For example, assuming a 64-byte cache line, the overhead for a full-map directory in a 64-core system is 12.5%, but this grows to 200% for a 1024-core system with the bookkeeping requiring twice the storage of the data. Much effort has gone into alleviating the directory stor- age overhead [10, 11, 12, 13, 14, 15]. Most of these efforts focused on trading sharing pattern storage for additional time needed to transition the coherence state
Transcript

Coherence Domain Restriction on Large Scale Systems

Yaosheng FuPrinceton University

Princeton, [email protected]

Tri M. NguyenPrinceton University

Princeton, [email protected]

David WentzlaffPrinceton University

Princeton, [email protected]

ABSTRACTDesigning massive scale cache coherence systems hasbeen an elusive goal. Whether it be on large-scaleGPUs, future thousand-core chips, or across million-core warehouse scale computers, having shared mem-ory, even to a limited extent, improves programmability.This work sidesteps the traditional challenges of cre-ating massively scalable cache coherence by restrictingcoherence to flexible subsets (domains) of a system’stotal cores and home nodes. This paper proposes Co-herence Domain Restriction (CDR), a novel coherenceframework that enables the creation of thousand tomillion core systems that use shared memory whilemaintaining low storage and energy overhead. Inspiredby the observation that the majority of cache lines areonly shared by a subset of cores either due to limitedapplication parallelism or limited page sharing, CDRrestricts the coherence domain from global cache coher-ence to VM-level, application-level, or page-level. Weexplore two types of restriction, one which limits thetotal number of sharers that can access a coherence do-main and one which limits the number and location ofhome nodes that partake in a coherence domain. Eachindependent coherence domain only tracks the coresin its domain instead of the whole system, thereby re-moving the need for a coherence scheme built on topof CDR to scale. Sharer Restriction achieves constantstorage overhead as core count increases while HomeRestriction provides localized communication enablinghigher performance. Unlike previous systems, CDR isflexible and does not restrict the location of the homenodes or sharers within a domain. We evaluate CDR inthe context of a 1024-core chip and in the novel appli-cation of shared memory to a 1,000,000-core warehousescale computer. Sharer Restriction results in significantarea savings, while Home Restriction in the 1024-corechip and 1,000,000-core system increases performance

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected], December 05 - 09, 2015, Waikiki, HI, USACopyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-4034-2/15/12 ...$15.00DOI: http://dx.doi.org/10.1145/2830772.2830832

by 29% and 23.04x respectively when comparing withglobal home placement. We implemented the entireCDR framework in a 25-core processor taped out inIBM’s 32nm SOI process and present a detailed areacharacterization.

Categories and Subject DescriptorsB.3.2 [Memory Structures]: Shared Memory

KeywordsShared Memory, Cache Coherence, Home Placement

1. INTRODUCTIONThe number of cores integrated within a single many-

core chip, GPU, and even data center continues to in-crease. Programming systems with shared memory hasbeen popular due to the ease of programming, similar-ity to single-threaded programming, easy mapping ofcertain classes of algorithm, and the widespread avail-ability of applications based on POSIX Threads [1] andOpenMP [2]. Unfortunately, creating massively scalablecache coherence protocols has proven problematic. Inthis work, we address the challenge of massively (Thou-sands to Millions of caches) scalable cache coherence notby creating a new cache coherence scheme, but ratherby creating a hardware framework that can be used byother coherence schemes to sidestep the challenges ofscaling directory storage and communication cost. Westudy this framework in two primary use cases: futurethousand-core chips and current day warehouse scalecomputers with shared memory. In addition, we seeour solution being applicable to future GPU systemswhich are adopting shared memory.

As we advance to the future manycore era with hun-dreds or thousands of cores on a single chip [3, 4, 5, 6, 7,8], the directory which tracks sharers in directory-basedcache coherence systems presents a storage scalabilitychallenge. The quadratic overhead of full-map directo-ries [9] makes it unsuitable for manycore designs. Forexample, assuming a 64-byte cache line, the overheadfor a full-map directory in a 64-core system is 12.5%,but this grows to 200% for a 1024-core system withthe bookkeeping requiring twice the storage of the data.Much effort has gone into alleviating the directory stor-age overhead [10, 11, 12, 13, 14, 15]. Most of theseefforts focused on trading sharing pattern storage foradditional time needed to transition the coherence state

of a cache line. Unfortunately, a cache coherence pro-tocol which can scale to a very large number of cacheswhile having efficient cache state transition time andsharing meta-data storage has yet to be discovered.

The emergence of large data centers along with ware-house scale computers [16] has driven the developmentof applications such as whole web search, large-scaledata mining, and hosted web applications. Contem-porary data centers scale to hundreds of thousands ofservers which collectively have millions of cores. Thetypical data center only has rigid pockets of cache co-herence, usually within a single server of a few cores,requiring programmers to use a dual-programming en-vironment where shared memory is used within a serverwhile messaging is used between servers. In contrast,our work is the foundation for economically expandingshared memory across current day data centers. Weintroduce warehouse scale shared memory where lim-ited shared memory can flexibly span the data cen-ter containing millions of cores. Such a design can beused to break down the boundaries that exist in cur-rent day data centers where virtual machines do notspan servers, applications do not share memory acrossthe data center, and programmers need to use a hy-brid, shared memory along with messaging program-ming environment. In this paper we focus on address-ing the scalability issue of warehouse scale shared mem-ory. There are other important challenges for ware-house scale shared memory besides the scalability suchas fault-tolerance, resource management and softwareinfrastructure. Since implementing shared memory inthe warehouse scale is a radical change and requiressignificant modification and re-design in both hardwareand software, we do not expect to address all those is-sues in this paper and leave them to future work.

In this paper, we propose Coherence Domain Restric-tion (CDR), a novel coherence framework that is capa-ble of enabling systems to scale to thousands or millionsof cores, while keeping constant storage overhead andhigh performance. CDR is inspired by the observationthat today’s shared memory applications are limited inparallelism by Amdahl’s law and serializing critical sec-tions [17, 18, 19, 20], therefore causing them to useonly a limited number of cores (but still larger than thecore count of most current processors). While the totalscalability of a shared memory application may be lim-ited, this does not mean that the total number of coresin a system, total capacity of sharable memory, place-ment of sharers or home nodes, or the ability to quicklyand flexibly rearrange the cores which are sharing mem-ory should be restricted. We investigate two types ofCoherence Domain Restriction: Sharer Restriction andHome Restriction. Sharer Restriction limits the num-ber of sharers that can be tracked for a given coherencedomain while Home Restriction limits the number andlocation of home nodes in a particular coherence do-main. These two ideas are decoupled and can be usedindependently or together as we will show.

Sharer Restriction significantly reduces the storageoverhead by selectively tracking cache line sharers on a

… … … … … … … …

…………………

App 1

App 3

App 2

…………

App 1 App 2

App 3

(a) (b)

Figure 1: V-SR: Three applications run on (a) a many-core chip or (b) a warehouse scale shared memory com-puter of 4-core machines, each requiring 6, 4, and 16cores respectively.

per-VM, per-application, or per-page basis rather thanacross all cores. Sharer Restriction is not a coherencescheme itself, but rather provides a mapping layer todecouple physical cores from coherence schemes. It iscompatible with and beneficial to most existing coher-ence schemes. In Sharer Restriction, coherence proto-cols only operate on a small set of logical sharers whichthen get mapped into a much larger space of physicalcores. Therefore, the selected coherence protocol itselfdoes not need to scale to the entire system.

In contrast, Home Restriction places directory in-formation (home nodes) near the cores that use itthereby reducing the network communication distanceand energy. In addition, Home Restriction can allevi-ate hotspots at directory controllers. Home Restrictioncan act on either VM/application-level or page-level.Page-level Home Restriction leads to better communi-cation locality than VM/application-level because ofthe restriction at a finer granularity.

CDR builds upon ideas from previous work whichattempted to restrict coherence domains either stati-cally such as in Scale-Out processors [21] and the IntelSCC [7] or dynamically like in Virtual Hierarchies [22].CDR extends the static coherence domain work bybreaking down the restrictions that rigid coherenceboundaries have. While restricting homes has beeninvestigated in Virtual Hierarchies, we extend this ideafurther by enabling dynamic resizing of home domainsat runtime. We also evaluate how to utilize dynamiccoherence domains in 1000+ core chips and warehousescale shared memory systems which were not consid-ered in previous work. Some major differences in ourwork are that CDR addresses the scalability of directorystorage, the creation of page-level domains, and run-time modification of domain size, all of which VirtualHierarchies did not address.

The following are our key contributions:

• The Sharer Restriction framework and how itachieves constant storage overhead with core scal-ing.

• The Home Restriction framework and how it in-creases performance and decreases communicationenergy by localizing communication between cores.

• Investigation of coarse-grain VM/application-leveland fine-grain page-level as granularities for bothSharer Restriction and Home Restriction.

• Discussion of the software-hardware interface andsystem software (OS/Hypervisor) techniques to ef-ficiently support CDR.

• We evaluate CDR in the context of a future, tiled,1024-core chip and a 100,000-machine (10 cores permachine) warehouse scale system. We find thatSharer Restriction achieves significant area savingswith 0.3% performance loss on average and thatHome Restriction speeds up the performance by29% in the former system and 23.04x in the lattersystem versus global home placement.

• We implement the entire CDR framework in a 25-core Processor taped out in IBM’s 32nm SOI pro-cess and provide a detailed area analysis.

2. COHERENCE DOMAIN RESTRICTIONCoherence Domain Restriction (CDR) is a novel

cache coherence framework that enables shared mem-ory in systems that scale to thousands or even millionsof cores. The key insight of CDR is that instead ofkeeping a single global cache coherence domain, weprovide flexible fine-grain coherence domains for thesubset of cores that are used by a VM/application orphysical page.

CDR leverages two independent techniques: SharerRestriction (SR) and Home Restriction (HR). The ba-sic idea of Sharer Restriction is to restrict the sharerdomain of coherence protocols to a subset of cores ac-cessed by a VM, application, or page. By doing this, wecan greatly reduce the hardware overhead introduced bytracking all cores across the system. Home Restrictionapplies a similar idea by restricting the home domainof a VM, application, or page to be within a subsetof home nodes to provide localized communication be-tween request caches and home nodes.

2.1 Sharer RestrictionFigure 1 introduces the basic idea of VM/application-

level Sharer Restriction (V-SR) applied to a manycorechip (1a) and a warehouse scale computer with sharedmemory (1b). Each application runs only on a smallsubset of cores; thus it is not necessary to track shar-ers beyond those cores. V-SR provides isolation be-tween physical cores and the coherence scheme by map-ping logical sharers used in the coherence scheme intophysical cores in the system. Therefore, the coherencescheme only needs to keep track of a small number oflogical sharers that get mapped into the subset of phys-ical cores used by the application. The mapping mech-anism can be re-configured at runtime to add new coresinto or remove old cores from the sharer domain. Thecores in a sharer domain do not need to be con-tiguous, but rather can be any set of cores in the sys-tem. Moreover, sharer domains can overlap on the samephysical core without affecting each other.

We further explore Sharer Restriction at a finer gran-ularity at the page-level which we call page-level SharerRestriction (P-SR). In P-SR, each physical page sharedby a subset of cores can form its own sharer domain. In

Schemes Full-map Coarse Vector (2:1)Limited pointers (4 ptrs)

No SR O(n) O(n) O(log(n))SR O(s) = O(1) O(s) = O(1) O(log(s)) = O(1)

No SR 1024 bits 512 bits 40 bitsSR 64 bits 32 bits 24 bits

No SR 100,000 bits 50,000 bits 68 bitsSR 8 bits 4 bits 12 bits

Table 1: Asymptotic storage cost (top) for coherenceschemes with and without Sharer Restriction. Storagecomparison of sharer list size for a 1024-core chip (mid-dle) with max sharer domain of 64 cores and a 100,000-machine warehouse scale shared memory cluster (bot-tom) with max sharer domain of 8 machines.

effect, this brings the benefit of even smaller sharer do-mains especially for private pages which only have onesharer. Page-level granularity provides a convenient lo-cation to manage the mapping of information due towidespread use of page tables and TLBs.

In order to take advantage of fine-grain sharer do-mains, we purposefully limit the maximum sharer do-main size (s) which dictates the storage overhead ofcache coherence in the system. This number is inde-pendent of the total number of cores (n). Rather itis set to track the total expected number of cores thatwill need to share data, i.e. the maximum applicationor page scalability. Depending on the coherence schemeused, the number of bits needed to store the sharer listchanges. For instance, in a full-map directory [9], s bitsare needed per cache line in the directory in contrast ton bits being needed if the full map directory can addressthe entire system. Because the maximum sharer domain(s) is set independently of the total number of cores (n)and can stay constant as system size grows, we treat itas a constant in asymptotic analysis. Other coherenceschemes also see storage savings when combined withSharer Restriction. For instance, the storage analysisfor coarse vector and limited pointers schemes with andwithout Sharer Restriction is shown in Table 1.

The maximum sharer domain size should be judi-ciously chosen such that the number of sharers staysbelow it. As evidenced from numerous PARSEC stud-ies [23, 18], typical applications today have limited scal-ability, on the order of tens of cores. Many sharedmemory programs only explore simple communicationpatterns such as producer-consumer or nearest-neighborcommunication [24], which results in a very small num-ber of sharers for each page. In the results section, wechoose 64 cores as a conservative maximum sharer do-main size for a manycore chip and 8 machines for a ware-house scale computer (each machine contains 10 coreswith on-chip coherence, but SR is applied only on theinter-machine coherence scheme). In the rare case thatthe domain does expand beyond the maximum limit,the directory scheme is switched to inaccurate track-ing or broadcast on a per-line basis. The details areexplained later in Section 3.1.3.

2.2 Home RestrictionOn shared memory systems with directory protocols,

each directory access results in a round-trip message

… … … … … … … …

…………………

…… … … … … … … …

…………………

Miss Miss

DataHome

Round Trip Distance = 50 hops Round Trip Distance = 6 hops

No Home Restriction Home Restriction

Data

Home

Figure 2: NoC communication shows a cache miss whichrequires a remote invalidation. On the left, the homenode is distant even though the core that needs to beinvalidated is local. On the right, Home Restriction isused to localize home nodes and invalidations.

communication between the request node and the homenode (or directory) if they are not the same. In addition,if cache line invalidations are needed, additional roundtrip communication is required. In order to map anaddress to a home node (or directory) in a directoryprotocol, it is typical to interleave the home location ofcache lines across the entire system, evenly distributingthe load. The downside to such a solution is that whencache misses occur, they may need to communicate withthe home node (or directory) on the other side of thesystem, incurring large communication cost.

In order to reduce the average communication dis-tance, Home Restriction (HR) restricts the home do-main of a VM/application or physical page to be withina subset of home nodes (or directories) for shorter com-munication distance. HR uses a mapping mechanismthat maps logical home locations to physical home loca-tions. The simplest mapping scheme maps the home lo-cations across the same nodes that the VM/applicationor page is running on, thereby localizing request cacheto home node communication versus being spread acrossthe system. This implies that the home domain andsharer domain will be the same if both SR and HRare applied. Similarly, the home domain is adjustedat runtime to match the subset of cores used by aVM/application or page. Nevertheless, home nodesand request nodes do not have to be the sameand home nodes can be mapped to arbitrary locations.More advanced mapping strategies can be adopted tolocate home nodes at better locations to minimize com-munication cost. For example, if the request nodes aredistributed, the best home node location may be in thecenter-of-mass of the communicating cores which maynot be a request node. Similar to SR, HR limits themaximum home domain size (h) to be constant.

In general, the asymptotic worst-case communicationcost without HR on a manycore chip with 2D mesh on-chip network is O(

√(n)), while with HR, assuming the

cores in an application or utilizing a page are co-located,it is O(

√(h)) = O(1) with a maximum home domain

size (h). An example of this can be seen in Figure 2, inwhich a network request results in a round trip distance

of 50 hops without HR, while only 6 hops are requiredwith HR since both request nodes and home nodes arelocated in the same subset of cores. In warehouse scalecomputers, HR can achieve even better performancebenefits for two reasons: 1) inter-machine communica-tion latency is much longer than on-chip communica-tion and 2) warehouse scale computers usually containa much larger number of homes.

Home Restriction can be applied either at the page-level (P-HR) or VM/application-level (V-HR). Page-level Home Restriction (P-HR) achieves better commu-nication locality than VM/application-level Home Re-striction (V-HR) at the cost of a larger number of con-current home domains. P-HR is ideal for private pagesas no network traffic is generated because the requestnode and home node (or directory) are co-located.

3. IMPLEMENTATIONThe key design of the CDR framework is the mapping

between logical cores and physical cores. By adding aflexible mapping layer that abstracts logical cores fromphysical cores, we expose a much smaller logical domainspace to coherence protocols and homing policies. Oneprimary challenge with CDR is how to translate logicalcores efficiently to physical cores and vice versa.

3.1 Implementation of Sharer RestrictionIn our implementation of Sharer Restriction, each

sharer domain is assigned a unique sharer domain ID(SDID) and each physical core in the domain is assigneda logical sharer ID (LSID). Within each sharer domain,logical sharer IDs are exposed to the coherence protocolinstead of actual physical core IDs, and they are thentranslated back to physical core IDs for network rout-ing when coherence messages are sent to those sharers.Since the number of logical sharers in a domain is lim-ited by the maximum domain size, each logical sharercan be represented with fewer bits in the directory com-pared with directly storing the physical core ID. Thecoherence protocol is effectively dealing with a smallersystem with a limited number of sharers. A physicalcore can participate in multiple sharer domains withmultiple pairs of sharer domain ID and logical sharerID, so different sharer domains can overlap on the samephysical cores. A physical core is uniquely determinedgiven a pair of sharer domain ID and logical sharer ID.We utilize an OS-maintained sharer map table (SMT)to translate logical sharer IDs to physical core IDs. Thetable is cached for fast look-up using sharer map caches(SMC), similar to a TLB caching a page table.

3.1.1 Hardware Modification for Sharer RestrictionFigure 3a shows the implementation details of V-SR

at the request node. We add an extra register to storethe current pair of sharer domain ID and logical sharerID which is updated upon context switches. When thecore needs to send any coherence message to the home,the ID pair is appended. Figure 3b shows the imple-mentation for P-SR. Instead of having a special registerthat is swapped out on every context switch, the ID

CPU

MMU

VirtualAddress

Sharer Domain ID

Physical Address

Logical Sharer ID

CPU

TLB MMU

VirtualAddress

Physical Address

Logical Sharer ID

Sharer Domain ID

(a) (b)

TLB

Figure 3: Architectural implementation of the requestnode for (a) VM/Application-level Sharer Restriction(V-SR) and (b) Page-level Sharer Restriction (P-SR).

DirectorySharerMap

Cache

DirectoryController

PhysicalAddress

LogicalSharer ID

PhysicalCore ID

Sharer Domain IDPhysicalCore ID

CoherenceMessage

Logical Sharer ID

SharerMapTable

In Off-Chip DRAM

Figure 4: Architectural implementation of the homenode with Sharer Restriction.

pairs are stored in TLB entries associated with physi-cal pages. Each time a TLB access occurs, the ID pairis read out along with the physical page number. Thehardware needed for P-SR is a superset of V-SR: wecan emulate V-SR by setting all physical pages of aVM/application to have the same sharer domain ID.

Figure 4 shows the modification at the home node (ordirectory) in order to support SR. As mentioned above,coherence messages destined for the home node (or di-rectory) contain the sharer domain ID and local sharerID which are exposed to the coherence protocol insteadof the physical core ID. For example, in a 1024-core chipwith full-map directory scheme and maximum share do-main of 64 cores, each directory entry only needs tomaintain a sharer vector of 64 bits representing localsharer IDs from 0 to 63. If invalidation or downgradeof sharers is needed, logical sharer IDs for those shar-ers along with the sharer domain ID are fed into thesharer map cache to look-up the corresponding physi-cal core IDs. Otherwise, no access to the sharer mapcache is needed. Accesses to the sharer map cache canbe pipelined to reduce the performance impact whensending out multiple invalidation messages for a widelyshared cache line.

A special case for Sharer Restriction is self-eviction ofdirty lines from a write-back cache, which seems to beproblematic because the cache line being evicted doesnot contain the information about its sharer domainID and local sharer ID. However, this is easily solvedbecause at the home node (or directory), processing anevicted dirty line does not need the sharer domain IDor local sharer ID because the line is exclusively ownedand the owner is known.

3.1.2 VM/Application-level and Page-level SharerRestriction Interaction

VA Space

VASpace

Shared

VA SpaceShared

VA Space

V-SR domain 1

Core 1 Core 2 Core 4Core 3

V-SR domain 2

Unmapped

Unmapped

P-SR domain 3

Figure 5: An inter-domain communication example.

Considering that V-SR already results in constantdirectory storage, it is usually not worthwhile to fur-ther restrict sharer domains to the page level at thecost of more complex domain management and largersharer map cache size. However, P-SR does providemore flexibility than V-SR. For instance, V-SR doesnot allow shared data between two VMs/applicationsbecause there is no global cache coherence. But there isno such constraint for P-SR. P-SR operates on physicalpages rather than virtual pages; therefore a page sharedby two VMs/applications naturally forms a unique do-main. As already mentioned, a P-SR implementationcan fully emulate V-SR, so V-SR and P-SR can co-existin one system with the hardware support for P-SR.

An example implementation of combining V-SR andP-SR is shown in Figure 5. In the beginning, only V-SRis used (emulated by P-SR hardware) and all pages inthe same VM/application are assigned with the samesharer domain ID. When inter-domain communicationis needed, a shared physical page is allocated mappingto two virtual pages in two VMs/applications similarto inter-process communication. The shared physicalpage is switched from V-SR to P-SR by assigning anew domain ID to it, but all other non-shared physicalpages maintain the previous domain ID.

3.1.3 Runtime Modification of Sharer DomainsOver the lifetime of a program, the number of

sharers in a domain might change. For example, aVM/application might have only one core during astartup or I/O phase, but when the program transi-tions to a computationally intensive phase, the numberof cores may grow to the maximum available. In or-der to track all sharers, the sharer domain needs tobe adjusted at runtime. When a VM/application fin-ishes, all relevant private caches lines as well as sharermap cache lines are flushed, and the sharer domain IDsare recycled. By flushing all relevant cache lines, weavoid losing sharer information when sharer domainsare recycled.

Adding a new sharer into a sharer domain is very sim-ple: just assign a new pair of sharer domain ID and log-ical sharer ID to the new sharer and update the sharermap table. Removing an existing sharer requires ad-ditional invalidation of all relevant cache lines on thissharer core, but it is unnecessary to remove sharersunless the maximum domain size is reached. We ex-pect most sharer domains to stay within the limit ofmaximum domain size, but if a domain does need toscale beyond the limit, a simple solution is to stop accu-rately tracking additional sharers and switch to broad-

CPU

TLB MMU

Home Allocation Unit

VirtualAddress

Logical Home ID

HomeMap

Cache

Physical Core ID

Home Domain IDHomeMapTable

PhysicalAddress

Domain Core Count

In Off-Chip DRAM

(a) V-HR

CPU

TLB MMU

VirtualAddress

Logical Home ID

HomeMap

Cache

Physical Core ID

HomeMapTable

PhysicalAddress

Home Domain ID

Domain Core Count

Home Allocation Unit

In Off-Chip DRAM

(b) P-HR

Figure 6: Architectural implementation of the requestnode for (a) VM/Application-level Home Restriction(V-HR) and (b) Page-level Home Restriction (P-HR).

cast or alternative inaccurate directory schemes such ascoarse vectors on a per-line basis during an invalidation.Broadcast or inaccurate tracking are only required forcache lines that contain new sharers beyond the track-ing scope because there is no extra bit in the directoryentry to keep track of them, other lines still have accu-rate sharer information thereby do not need to switchto broadcast or inaccurate directory schemes. We adoptthe broadcast scheme in the 25-core processor to sim-plify the hardware design.

3.2 Implementation of Home RestrictionHome Restriction (HR) decouples logical home nodes

(or directories) from physical homes (or directories)by providing a mapping layer between them. Similarto Sharer Restriction, we maintain a home domain ID(HDID) for each home domain and a logical home ID(LHID) for each home node inside a home domain.Homing policies only generate logical home IDs and theactual mapping from logical home IDs to physical homeIDs is managed by HR. This enables the OS to coop-erate with home policies and optimize home locationsfor better performance. Home map caches (HMC) thatare backed by a home map table (HMT) are adoptedto translate logical home IDs to physical ones.

3.2.1 Hardware Modification for Home RestrictionHome Restriction only requires hardware modifica-

tion to the request node/CPU/core. Figure 6a and 6bshow the implementation of the V-HR and P-HR frame-works, respectively. Similar to Sharer Restriction, thehome domain ID for V-HR is stored in a special regis-ter while in P-HR it is stored in the TLB. The domaincore count (current size of the domain) is also kept todetermine home mapping. The home allocation unit isa combinational logic block that does home allocationfor home domains. It takes the lower-order bits of thephysical address (above the cache line offset) along withthe domain core count to compute the logical home ID.This logical home ID along with the home domain ID

Logical Home 0

Logical Home 0

Logical Home 1

Logical Home 0

Logical Home 1

Logical Home 2

Logical Home 0

Logical Home 2

Logical Home 1

Logical Home 3

Logical Home 0

Logical Home 2

Logical Home 1

Logical Home 3

Logical Home 0Logical Home 4

1 Core 2 Cores 3 Cores 4 Cores 5 Cores

The

add

ress

sp

ace

(a) A home domain expandedfrom 1 core to 5 cores.

Input: x = low-order bits of theaddress,y = home domain core count

Output: z = logical home ID(0 ≤ z ≤ y − 1)maskhigh ← (1� dlog2 ye)− 1

masklow ← (1 � (dlog2 ye −1))− 1xmasked ← x & maskhigh

if xmasked < y thenreturn xmasked

elsereturn(xmasked & masklow)

end if

(b) Algorithm

Figure 7: Home Allocation

indexes the home map cache to determine the physicalcore ID of the home. The home allocation unit andhome map cache only need to be accessed upon cachemisses on the request node. Besides, the extra accesslatency can be overlapped with the data cache pipeline.

The purpose of the home allocation unit is to dis-tribute physical addresses as evenly as possible acrossall home nodes for arbitrary domain core count. Fora given domain core count, the mapping rule from ad-dresses to homes is fixed. Figure 7a shows an exampleof home allocation for a home domain that grows from1 to 5 cores (or directories). We can see that each timea new home node is added, half of the address spacemanaged by a single previous home is reallocated to it,while all other homes remain unchanged. The detailedre-homing process is explained later in Section 3.2.2. Byreducing the number of existing home nodes that needto be re-homed, the growth of a home domain is faster.Removing an existing home node which has the largestlogical home ID in the domain is exactly the reverseprocess of adding a new one. Otherwise this home nodeswaps its logical home ID with the one with the largestlogical home ID first before the reverse process. Besides,the address spaces assigned to both home nodes needto be re-distributed to match the home allocation rule.Therefore, two home nodes need to be modified insteadof one in this case. Removing a home has higher costthan adding a home in general, but it is not necessaryin most cases. The address space is evenly distributedwhen domain core count is a power of two. The detailedhome allocation algorithm is described in Figure 7b andthe logic is easy to implement in hardware.

Eviction of dirty data is interesting for Home Restric-tion. The cache needs to determine the appropriatehome for the dirty, evicted cache line. Unfortunately,in a standard cache, neither the home location nor homedomain ID is known. We propose several solutions.First, we can store the home domain ID in the tag ordata array of the request node’s private cache. This is atthe expense of every cache line having additional infor-mation stored in it. A second solution which works forV-HR is to write-back dirty data when there is a contextswitch to an application with a different home domainID. Conveniently this does not affect thread switches.A third solution is to simply use write-through caches.

We have implemented the first solution in hardware andevaluated it in the results section in order to supportboth V-HR and P-HR with write-back caches.

3.2.2 Runtime Modification of Home DomainsHome domains can be modified at runtime to min-

imize communication distance or balance directoryrequest load across different home nodes. As men-tioned before in Section 2.2, we choose a simple homemapping scheme that maps home nodes to the samenodes that the VM/application or the page is run-ning on. This scheme sets the home domain to bethe same as the sharer domain if SR is also adopted.Therefore, home domains also need to be adjusted atruntime as VM/applications/pages expand or shrink.For V-HR, domains do not frequently change size asthe size is linked to the number of cores used in aVM/program. In contrast, P-HR requires potentiallyfrequent domain size change as the number of coresaccessing a shared page changes. Modifying home do-mains can bring a performance benefit but also causesthe sharing information in the directories to be out ofdate and incorrectly distributed due to re-allocation ofhome nodes. In order to add or remove a home node toa pre-existing home domain, the system software hasto guarantee that the directories contain the updatedsharing information. Our home allocation algorithm inFigure 7b guarantees that at most two of the existinghomes are affected for each modification; others remainunchanged.

The runtime modification for V-HR requires freezingall relevant cores and updating information, but this isnot necessary for P-HR modification as shown below:

1. Invalidate the TLB entry for the page in all rel-evant cores and set the page as poisoned, whichmeans TLB misses for this page will be delayeduntil the poisoned state is cleared.

2. Flush relevant directory entries and correspondingcache lines in the to-be-modified homes. Giventhat the subset of addresses that need to be re-homed are known, we can selectively flush themfrom the directory without affecting other entries.For example, if a home domain grows from 16 coresto 17 cores, we only need to flush 1/32 of a pagefrom logical home 0.

3. Update the home map table and invalidate the cor-responding entries in relevant home map caches.

4. Clear the poisoned state of the page.

3.2.3 Home Domain MergingModern applications may access a large number of

pages, which creates a large amount of home domainsif P-HR is used. This introduces pressure on both soft-ware and hardware management, and could potentiallylead to significant overhead. In order to reduce the to-tal number of home domains, we adopt domain mergingat runtime to merge domains with the same mapping.When a new page is allocated or a P-HR page expandsto an additional core, we merge it into an existing do-main if they are accessed by the same set of cores and

have the same mapping from logical homes to physicalhomes. A previous study [24] has shown that for mod-ern workloads most of the shared data is either accessedby all cores or a small subset of cores. Based on thisobservation, we further merge all pages whose domainsize grows beyond a threshold into a single home domaincontaining all cores used by the application. We choose3 as the merging threshold during our evaluation. Sim-ulation results in Section 4.4.2 show that after homedomain merging, the total number of home domains isgreatly reduced while still achieving good performance.

3.3 OS Management of the CDR FrameworkBased on the above implementation details, we see

that SR and HR have a similar software and hardwareflow. Therefore, sharer domains and home domains canbe managed similarly. For V-SR or V-HR, the OS main-tains a table for mapping VMs/applications to domains.This table is read out upon each context switch betweenVMs/applications. At the page-level (P-SR, P-HR), themapping information is stored along with page table en-tries and is filled into the TLB on a TLB miss.

The OS also needs to maintain the sharer/home maptables which provide the mapping from logical IDs tophysical IDs. These tables can be statically allocatedto a piece of memory with fixed size. For a 1024-corechip with maximum domain size of 64, each domain re-quires 64 entries, each of which is 2 bytes (10 bits forthe physical core ID and 6 bits for meta-data). There-fore, each domain only needs 128 bytes of memory. Ifthe OS limits the maximum number of sharer/home do-mains to be 16384, the total memory requirement forthe sharer map table and home map table is 4MB. Awarehouse scale computer may require a larger num-ber of domains, but the sharer/home map table canbe distributed across machines to amortize the storageoverhead. Since the sharer/home map tables are pre-allocated with fixed size, it is very convenient to sup-port hardware refill of sharer/home map caches with-out triggering interrupt handlers from the core. Upon asharer/home map cache miss, the refill address can becalculated based on the missed domain ID and logicalID, so a single memory look-up is required for the refill.

The OS itself can also be fit into the CDR frameworkby assigning one or more pre-allocated domains to it.Managing the memory for the OS is an interesting usecase for CDR. Since we do not maintain global cache co-herence for Sharer Restriction, it is likely not efficient torun an OS that utilizes a large amount of global sharedmemory as SR switches to inaccurate tracking or broad-cast when there are more sharers than that can be ac-curately tracked. However, there are many examples ofdistributed OSes that do not require global shared mem-ory such as Barrelfish [25], fos [26, 27], Exokernel [28],Corey [29], and Disco [30, 31]. Home Restriction canbe easily turned off for a traditional OS that wants touse all directories without affecting the correctness.

4. EVALUATIONIn this section we first present the area overhead of

Processor 1024 cores, x86-64 ISA, in-order, IPC=1 except formemory accesses, one thread per core, 2GHz

L1 Caches 8KB, private, 4-way set associative, 1-cycle latencyL2 Caches 64 KB, private, 4-way set associative, 5-cycle latencyDirectoryCaches

Distributed shared by all cores, no backup storage,4-way set associative, 5-cycle latency

SMC/HMC 256 entries, direct mapped, 1-cycle hit, 100-cycle missNoC Tiled 32×32 2D mesh network, X-Y routing, 64-bit

flits and links, 1 cycle per hopCoherenceProtocol

Directory-based MESI protocol, homing by cache lineinterleaving

Memory 100-cycle latency

(a) The manycore chip

System 100,000 machines with a 10-core processor eachProcessor 10 cores, x86-64 ISA, in-order, IPC=1 except for

memory accesses, one thread per core, 2.5GHzL1 Caches 32KB, private, 8-way set associative, 1-cycle latencyL2 Caches 256 KB, private, 8-way set associative, 5-cycle latencyL3 Caches 25 MB, shared by all cores, 24-way set associative,

10-cycle latencyDirectoryCaches

Distributed shared by all machines, no backup stor-age, 24-way set associative, 10-cycle latency

SMC/HMC 256 entries, direct mapped, 1-cycle hit, 120-cycle missInter-nodeNetwork

3D mesh network, X-Y-Z routing, 16GB/s each direc-tion, 200 cycles per hop

Inter-nodeProtocol

Directory-based MESI protocol, homing by page in-terleaving for the baseline and cache line interleavingfor other schemes

Memory 120-cycle latency

(b) The warehouse scale computer with shared memory

Table 2: Main parameters of modeled systems

implementing Coherence Domain Restriction (CDR),and then evaluate Sharer Restriction and Home Restric-tion separately. For SR, we present an analysis of stor-age savings and the performance impact of SR. For HR,we simulate multi-threaded workloads on both a many-core chip and a warehouse scale computer with sharedmemory. We then measure the performance benefits ofHR over other frameworks.

4.1 MethodologyWe use the PriME simulator [32] to simulate mod-

eled systems, as detailed in Table 2. For the manycoresystem described in Table 2a, each core adopts a simplein-order architecture with x86-64 ISA, running at 2GHz.L1 and L2 are private caches, with 8KB and 64KB stor-age space respectively. A directory-based MESI proto-col is maintained through distributed shared directorycaches. Each directory cache contains twice as manydirectory entries as L2 cache lines. We assume 1-cycleextra latency for accessing the sharer/home map cachealthough the latency can be fully overlapped in manypipelined designs. Every cache miss in the sharer/homemap cache takes one memory look-up which is 100 cy-cles in the modeled system. Of course, a technologywith a smaller feature size is required in order to actu-ally fit 1024 cores on a chip. The warehouse scale com-puter contains 1,000,000 cores in total across 100,000machines. It is described in Table 2b in which CDR isapplied in the inter-machine coherence protocol operat-ing on a per-machine basis. The processors are modeledafter the Intel Xeon E5-2670v2 [33].

The three frameworks that we evaluate CDR againstare global home placement (Global-home), Scale-Outprocessors [21] (Scale-out) and Virtual Hierarchies [22]

HR3.0%SR1.9%

L17.1%

L1.5

8.8%

L2 & Directory Cache

25.7%

Core(Non-memory System)

53.5%

(a) The area breakdownof a core.

HR

SR

(b) The area overhead of CDRin a post-layout core.

Figure 8: Area Analysis of CDR

(VH). Global-home is the baseline framework wherehome nodes are interleaved across the entire system on aper cache line basis and memory is kept coherent amongall cores in the system. Scale-Out processors staticallypartition all of the cores in the system into small groupsto ensure low storage overhead and communication cost.These groups, called pods, are similar to sharer/homedomains, but are rigid in size and shape and do notenable inter-pod communication. Virtual Hierarchiesprovides VM-level dynamic home domains but does notexplore runtime modification of home domains, so thedomain size for one VM is fixed once assigned.

In our studies on the manycore system, Scale-outuses a fixed pod size of 64 cores; Virtual Hierarchiesuses the same VM size as the maximum number ofthreads of each benchmark, and CDR uses a maximumhome/sharer domain size of 64 in which the domain sizecan change at runtime. For the warehouse scale com-puter, the maximum pod/domain size is set to 8 ma-chines which correspond to 80 cores. We do not varythe coherence protocol, and all frameworks use the samedirectory scheme (Full-map) with directory caches [12].

We utilize the 13 multi-threaded benchmarks fromthe PARSEC benchmark suite [23]. Applications aresimulated with small input sets, and we simulate theentire benchmark. For simulations on the manycore sys-tem, PARSEC benchmarks are executed with differentthread count parameters (1, 2, 4, 8, and 16), but this pa-rameter only specifies the minimum number of threadsrather than the actual number. Especially for ferret,the actual number of threads is more than 64 when athread count parameter of 16 is used. Since our simula-tor assumes one thread per core and both Scale-out andCDR have a maximum domain size of 64 in our config-uration, we either omit this benchmark configuration orsimulate with a parameter of 8 instead for compatibil-ity. For the boundary evaluation of SR, we configureeach benchmark to run with closest to 72 threads (andmore than 64 threads). Unfortunately, facesim, swap-tions and x264 cannot meet this constraint so we omitthem. Simulations on the warehouse scale computerset the actual number of threads to be 10, 20 and 40in order to execute on 1, 2 and 4 machines. We omitswaptions and x264 for the same reason.

4.2 Area Overhead of CDR

blackscholes

bodytrackcanneal

dedupfacesim

ferretfluidanimate

freqmineraytrace

streamcluster

swaptionsvips x264

average0%

1%

2%

3%

4%

Perfo

rman

ce d

egra

datio

n (%

)

Figure 9: Performance degradation of Full-map with V-SR running PARSEC on the 1024-core system. Resultsare normalized to Full-map without V-SR. Benchmarksare simulated with a thread count parameter of 16 ex-cept ferret uses 8.

blackscholes

bodytrackcanneal

dedupfacesim

ferretfluidanimate

freqmineraytrace

streamcluster

swaptionsvips x264

average0%

2%

4%

6%

8%

10%

SMC

mis

s ra

te (%

)

0500010000150002000025000300003500040000

Tota

l # o

f SM

C m

isse

sFigure 10: Cache miss rates (left) and absolute numberof cache misses (right) of sharer map caches for Full-map with V-SR running PARSEC on 1024-core system.

We have implemented the entire CDR frame-work in Verilog in a 25-core processor that wastaped out in IBM’s 32nm SOI process. We havereceived chips back from the foundry (Oct. 2015) andthey are under test. All the architectural modificationsfor page-level Sharer Restriction and Home Restrictiondiscussed in Section 3 are seamlessly integrated with thememory system of the processor (The VM/application-level design can be fully emulated by the page-level de-sign so there is no need to implement them separately).We have leveraged the OpenSPARC T1 core in the chipwe built, but in this paper, we evaluate a more stan-dard x86-64 design. The logic to implement CDR isthe same between different ISAs as it only affects thecache system, therefore we present logic area based onour OpenSPARC T1 design. Each core of the 25-coreprocessor contains a 8KB private write-through L1 datacache, a 8KB private write-back L1.5 data cache and a64KB distributed shared L2 cache with integrated direc-tories. We present the area breakdown of a core in the25-core processor from the post-layout design in IBM’s32nm SOI process in Figure 8. Sharer Restriction costs1.9% of the post-layout area, which is mainly due to thesharer map cache. Home Restriction costs 3.0% of thearea because we store the Home Domain ID per cacheline in the private L1.5 as discussed in Section 3.2.1.This is not necessary if write-through caches are usedin the L1.5. The total area overhead of CDR is 4.9%within a core. Since we leverage smaller OpenSPARCT1 cores and caches in our processor compared to astandard x86-64 design, the area overhead of CDR isexpected to be lower in the x86-64 case.

4.3 Sharer Restriction Evaluation4.3.1 Storage Analysis

Both Scale-out and Full-map with Sharer Restric-

blackscholes

bodytrackcanneal

dedupferret

fluidanimate

freqmineraytrace

streamcluster

vips average0%

2%

4%

6%

8%

10%

Perf

orm

ance

deg

rada

tion

(%)

V-SR P-SR

Figure 11: Performance degradation of Full-map withV-SR running PARSEC on the 1024-core system. Re-sults are normalized to Full-map without V-SR. Bench-marks are simulated with closest to 72 threads.

tion achieve constant overhead scaling as the numberof cores increases due to constant pod size and maxi-mum sharer domain size. In contrast, full-map is notscalable with more cores/nodes, incurring around 14xand 1440x storage overhead respectively when scalingto 1024 and 100,000 cores/nodes.

4.3.2 PerformanceFigure 9 presents the performance degradation of

Full-map with V-SR on a 1024 core system. Since thereis no violation of the maximum domain size, there is noneed to use P-SR for finer-grain domains in this case.As shown in Figure 9, all benchmarks are within 3%performance degradation, and the average performanceloss is 0.3%. Figure 10 shows both the overall sharermap cache (SMC) miss rates and total number of cachemisses for all benchmarks calculated by dividing thetotal number of misses by the total number of accessesacross all sharer map caches in the system. Since eachapplication only maintains one sharer domain with atmost 64 cores, they can all fit in one sharer map cacheso all misses are cold misses. Considering that thenumber of accesses on sharer map caches is already avery small portion of memory accesses, an average SMCmiss rate of ∼1% explains the reason for negligible per-formance impact of V-SR. The blackscholes applicationhas higher SMC miss rate than other applications be-cause it has far fewer SMC accesses resulting in a highercold miss rate. Since the total number of SMC accessesvaries for different benchmarks, the absolute number ofcache misses can have a more straightforward reflectionof the performance degradation than the SMC missrate. Applying V-SR on the warehouse scale computerleads to similar results because the sharer domain size isindependent of scale, so we do not present this scenario.

In order to evaluate the performance of SR with viola-tions of maximum domain size, we simulate the perfor-mance degradation of Full-map with V-SR and P-SRon a 1024 core system in Figure 11. Benchmarks areconfigured with closest to 72 threads. Although sharerdomains larger than 64 cores can cause broadcasts, bothV-SR and P-SR still result in low performance penalty(3.3% and 1.2% on average) because broadcasts are trig-gered on a per-line basis as mentioned in Section 3.1.3.Overall, P-SR has better performance than V-SR dueto fewer broadcasts with finer-grain sharer domains,but V-SR achieves better performance than P-SR fora few benchmarks because P-SR introduces more do-mains and leads to more SMC misses.

blackscholes

bodytrackcanneal

dedupfacesim

ferretfluidanimate

freqmineraytrace

streamcluster

swaptionsvips x264

average

0%

20%

40%

60%

80%

100%Pe

rform

ance

gai

n (%

)

1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16

Scale-outVirtual Hierarchies

V-HRP-HR

Figure 12: Performance gain of Scale-out, Virtual Hierarchies, V-HR and P-HR normalized to Global-home runningPARSEC on 1024-core system. Each benchmark is simulated with multiple thread count parameters: 1, 2, 4, 8, 16.Ferret uses more than 64-threads when running with a parameter of 16, therefore we omit this result.

(x)$

112$ 111$111$125$ 150$

Figure 13: Performance gain of Scale-out, Virtual Hierarchies, V-HR and P-HR normalized to Global-home runningPARSEC benchmarks on the 100,000-machine system. Each benchmark is simulated on 1, 2 and 4 machines.

4.4 Home Restriction Evaluation

4.4.1 PerformanceIn the first experiment, we run individual PARSEC

benchmarks with varying thread counts on a simulatedsystem with 1024 cores (The ferret benchmark withthread parameter of 16 is omitted), and the perfor-mance results of different schemes are shown in Fig-ure 12. Because cache missing request nodes and homenodes are in the same home domain, V-HR shortens theaverage communication distance and achieves a perfor-mance gain. P-HR localizes communication even fur-ther at page level, resulting in the best performanceamong all schemes. Compared to Global-home, a 24%average performance gain is achieved by V-HR and 29%by P-HR. Scale-out achieves a similar goal by staticallydividing the system into separate pods, but because podsize is fixed, a mismatch between pod size and applica-tion size results in less gain, especially for applicationswith fewer threads. On average, Scale-out has 18% per-formance gain over Global-home. Virtual Hierarchiesperforms better than Scale-out because the VM size ofeach application is adjusted to match the maximum ap-plication size (the number of cores used by the appli-cation). However, since it cannot adjust VM size atruntime to match application needs like V-HR nor canit have finer-grain domains at page-level like P-HR, Vir-tual Hierarchies performs worse than HR.

Similarly, we run individual PARSEC benchmarkswith varying thread counts executing on a varyingnumber of machines on a warehouse scale shared mem-ory computer with 100,000 machines. Because of thelonger inter-machine communication latency and largernumber of homes, localized communication results in amuch larger performance impact as shown in Figure 13.Global-home results in much worse performance thanall other schemes because the home nodes are dis-tributed across all 100,000 machines. P-HR and V-HR

blackscholes

bodytrackcanneal

dedupfacesim

ferretfluidanimate

freqmineraytrace

streamcluster

swaptionsvips x264

average0%

10%

20%

30%

40%

50%

Perfo

rman

ce g

ain

(%)

Figure 14: Performance gain of P-HR due to home do-main merging.

blackscholes

bodytrackcanneal

dedupfacesim

ferretfluidanimate

freqmineraytrace

streamcluster

swaptionsvips x264

average

0%

20%

40%

60%

80%

100%

HMC

mis

s ra

te (%

)

P-HR no merge P-HR merge

Figure 15: Overall cache miss rates of home map cachesfor P-HR without and with home domain merging.

lead to more performance gains than other schemeswith increasing number of execution machines. On av-erage, P-HR, V-HR, Virtual Hierarchies and Scale-outlead to 23.04x, 16.35x, 13.43x and 6.07x performancegain respectively compared to Global-home.

4.4.2 Home Domain Merging of P-HRAs mentioned in Section 3.2.3, we adopt, by default,

home domain merging to reduce the total number ofhome domains for P-HR. In order to investigate itsperformance impact, we simulate P-HR both with andwithout home domain merging in the 1024-core system.Figure 14 presents the performance gain of P-HR dueto home domain merging, which can be up to ∼30% forcanneal and is 7% on average across all benchmarks.The performance benefit is mainly caused by two rea-sons: 1) Fewer home domains result in fewer runtimemodification operations 2) Fewer home domains lead tolower home map cache miss rate as shown in Figure 15.

After home domain merging, the average HMC missrate drops significantly from 25% to less than 1%.

5. RELATED WORKDirectory-based protocols are motivated by bus-

based protocols’ inability to scale beyond a few tensof cores. Early proposals include organizations likeduplicate-tag[34, 35] or sparse directory [12], withsharer storage schemes like full-map share vector [9],coarse-grain vector [12, 36], or limited pointers [10,11]. While these techniques help solve the bandwidthscalability challenge, they are inadequate for manycoresystems and warehouse scale shared memory comput-ers because of the exponentially increasing storage.To address the storage challenge, researchers haverecently proposed the use of bloom filters [13, 37],highly-associative caches [15, 38], and deduplication ofdirectory entries and sharer tags [14, 37, 15]. Thesegeneralized solutions are orthogonal to our approachand can be applied together to increase the scalabilityof the system. Another solution is to use hierarchicalcoherence systems on a manycore chip [39]. In con-trast, we have demonstrated in our paper that CDRworks well in a 1024-core chip with constant storageoverhead and negligible performance loss. Comparedto a flat system, the hierarchical design introduces ex-tra traffic and directory lookup when two sharers fromdifferent subtrees communicate with each other. In ad-dition, multi-level coherence protocols are also harderto design and verify.

Another direction researchers have explored is tonot keep directories at all. Non-uniform cache ar-chitecture [40] (NUCA) and its variants (D-NUCA,R-NUCA [24]) maintain only one copy per cache lineon-chip, therefore obviating the need to track sharers.Software-managed coherence [41, 42, 43, 44, 45, 46,47, 48] off-loads the coherence tracking from hardwareto software. Message-passing systems, like Intel’s SCCchip [7, 49], also do not need directories. The aboveschemes all have inherent draw-backs compared to tra-ditional directories: the NUCA approach, while simpler,trades complexity for spatial-temporal locality perfor-mance; software-managed coherence can have poor per-formance; message-passing and software-managed co-herence can necessitate changes in programming model.

The Scale-Out architecture [21] restricts the coherentspace by physically partitioning the cores into fixed-size pods. Communication is only possible among coreswithin the same pod. Our motivation is similar, to re-duce the scope of coherence, however our work com-pletely decouples sharer restriction from home restric-tion, hence achieving more flexibility in creating andmaintaining the coherence domains at runtime, result-ing in better performance. CDR can flexibly locatecores and home nodes and hence does not suffer fromunused cores that can hinder Scale-Out designs. CDR’sinter-domain communication is also realizable throughcombining V-SR with P-SR, a feature not available inthe Scale-Out architecture.

In Virtual Hierarchies (VH) [22], Marty et al. pro-

posed a malleable two-level coherence hierarchy to fa-cilitate spatial locality for communicating cores whichshares some similarity with V-HR in our work. How-ever, V-HR supports runtime modification of home do-mains for performance optimization which is not im-plemented in VH. VH also does not consider the stor-age overhead problem which we address with SR. InVH, each directory entry contains a global full-map bit-vector to track sharers, whereas in CDR we restrict thesharer set to reduce storage overhead.

The problem of cache line placement in NUCA is alsosimilar to home-node allocation, as only one copy of acache line is allowed to exist. In static NUCA, the lo-cation of the cache line is statically mapped. In dy-namic NUCA, a line can exist in multiple possible loca-tions and moves around as dictated by the performancepolicy. As the location of the cache line is not exact,the requester needs to do multiple searches. Finally,in reactive NUCA, entries are classified as either pri-vate or shared. Classified as private, the line is stati-cally mapped as a member of a fixed, small-size, cluster.Classified as shared, however, the mapping reverts backto being statically mapped. Recent proposals have alsoexplored replication [50, 51], coherence protocol basedoptimization [52, 53, 54] and software configurable poli-cies [55, 56], trading implementation complexity for per-formance. With regard to cache line placement, this pa-per explores runtime modification of the home node andhow to support it at the software-hardware interface.

Warehouse scale shared memory has much common-ality to distributed shared memory systems in high per-formance computing [57, 58, 59, 60]. We see CDR as apotential mechanism to expand the scale of cache coher-ent systems for most use cases beyond current designsto a warehouse scale, while keeping hardware costs incheck. Accessing large amounts of shared DRAM acrossthe data center has also recently been popularized bythe RAMCloud project [61] and key-value stores [62,63]. CDR addresses some of these use cases, such as en-abling access to memory on other nodes, but does notenable fully scalable, globally coherent data access bymassive numbers of cores. While focusing on solvingthe scalability issue, CDR did not address other impor-tant challenges in warehouse scale shared memory suchas fault-tolerance, resource management and software.

6. CONCLUSIONCDR explores the insight that many shared memory

applications only use a subset of cores in a large sharedmemory system. By dynamically creating multiple, ar-bitrary coherence domains to restrict sharers or homeson a per-VM/application or per-page basis storage andcommunication distance can be reduced. As a result,CDR achieves constant storage overhead with the in-crease of core counts with SR, and a 29% and 23.04xperformance gain respectively with HR on a 1024-coresystem and a 1,000,000-core warehouse scale computercompared to global home placement using a full-mapdirectory. The entire CDR framework has been imple-mented in Verilog in a 25-core processor taped out in

IBM’s 32nm process with reasonable area overhead.

7. ACKNOWLEDGEMENTSThis work was supported by the NSF under Grant

CCF-1217553, AFOSR under Grant FA9550-14-1-0148, and DARPA under Grants N66001-14-1-4040and HR0011-13-2-0005. Any opinions, findings, andconclusions or recommendations expressed in this ma-terial are those of the authors and do not necessarilyreflect the views of our sponsors.

8. REFERENCES[1] F. Mueller, “A library implementation of POSIX threads

under UNIX,” in In Proceedings of the USENIXConference, pp. 29–41, 1993.

[2] L. Dagum and R. Menon, “OpenMP: an industry standardAPI for shared-memory programming,” ComputationalScience Engineering, IEEE, vol. 5, no. 1, pp. 46–55, 1998.

[3] D. Johnson, M. Johnson, J. Kelm, W. Tuohy, S. S.Lumetta, and S. Patel, “Rigel: A 1,024-core single-chipaccelerator architecture,” Micro, IEEE, vol. 31, pp. 30–41,July 2011.

[4] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards,C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown, andA. Agarwal, “On-chip interconnection architecture of thetile processor,” Micro, IEEE, vol. 27, no. 5, pp. 15–31, 2007.

[5] D. Wentzlaff, C. J. Jackson, P. Griffin, and A. Agarwal,“Configurable fine-grain protection for multicore processorvirtualization,” in Proceedings of the 39th AnnualInternational Symposium on Computer Architecture, ISCA’12, pp. 464–475, 2012.

[6] C. Ramey, “TILE-Gx ManyCore Processor: AccelerationInterfaces and Architecture,” in Hot Chips 23, 2011.

[7] J. Howard et al., “A 48-Core IA-32 message-passingprocessor with DVFS in 45nm CMOS,” in Solid-StateCircuits Conference Digest of Technical Papers (ISSCC),IEEE International, pp. 108–109, 2010.

[8] G. Chrysos, “Knights Corner, Intel’s first Many IntegratedCore (MIC) Architecture Product,” in Hot Chips 24, 2012.

[9] L. M. Censier and P. Feautrier, “A new solution tocoherence problems in multicache systems,” IEEETransactions on Computers, vol. 100, no. 12,pp. 1112–1118, 1978.

[10] A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, “Anevaluation of directory schemes for cache coherence,” inACM SIGARCH Computer Architecture News, vol. 16,pp. 280–298, IEEE Computer Society Press, 1988.

[11] D. Chaiken, J. Kubiatowicz, and A. Agarwal, “LimitLESSDirectories: A Scalable Cache Coherence Scheme,” inProceedings of the Fourth International Conference onArchitectural Support for Programming Languages andOperating Systems, ASPLOS IV, pp. 224–234, ACM, 1991.

[12] A. Gupta, W. dietrich Weber, and T. Mowry, “ReducingMemory and Traffic Requirements for ScalableDirectory-Based Cache Coherence Schemes,” in InInternational Conference on Parallel Processing,pp. 312–321, 1990.

[13] J. Zebchuk, V. Srinivasan, M. K. Qureshi, andA. Moshovos, “A tagless coherence directory,” inMicroarchitecture (MICRO), 42nd Annual IEEE/ACMInternational Symposium on, pp. 423–434, IEEE, 2009.

[14] H. Zhao, A. Shriraman, and S. Dwarkadas, “SPACE:sharing pattern-based directory coherence for multicorescalability,” in Proceedings of the 19th internationalconference on Parallel architectures and compilationtechniques, PACT ’10, pp. 135–146, ACM, 2010.

[15] D. Sanchez and C. Kozyrakis, “SCD: A Scalable CoherenceDirectory with Flexible Sharer Set Encoding,” inProceedings of the 18th international symposium on HighPerformance Computer Architecture (HPCA-18), 2012.

[16] X. Fan, W.-D. Weber, and L. A. Barroso, “PowerProvisioning for a Warehouse-Sized Computer,” inProceedings of the 34th Annual International Symposiumon computer Architecture, ISCA ’07, pp. 13–23, 2007.

[17] M. D. Hill and M. R. Marty, “Amdahl’s law in themulticore era,” Computer, vol. 41, no. 7, pp. 33–38, 2008.

[18] M. Bhadauria, V. M. Weaver, and S. A. McKee,“Understanding parsec performance on contemporarycmps,” in Workload Characterization (IISWC), IEEEInternational Symposium on, pp. 98–107, IEEE, 2009.

[19] P. D. Bryan, J. G. Beu, T. M. Conte, P. Faraboschi, andD. Ortega, “Our many-core benchmarks do not use thatmany cores,” in Workshop on Duplicating, Deconstructing,and Debunking (WDDD), vol. 6, p. 8, 2009.

[20] N. Ioannou and M. Cintra, “Complementing user-levelcoarse-grain parallelism with implicit speculativeparallelism,” in Proceedings of the 44th Annual IEEE/ACMInternational Symposium on Microarchitecture,pp. 284–295, ACM, 2011.

[21] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos,O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji,E. Ozer, and B. Falsafi, “Scale-out processors,” in ComputerArchitecture (ISCA), 39th Annual InternationalSymposium on, pp. 500–511, June 2012.

[22] M. R. Marty and M. D. Hill, “Virtual hierarchies tosupport server consolidation,” in Proceedings of the 34thannual international symposium on Computer architecture,ISCA ’07, pp. 46–56, 2007.

[23] C. Bienia, Benchmarking Modern Multiprocessors. PhDthesis, Princeton University, January 2011.

[24] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki,“Reactive NUCA: Near-optimal Block Placement andReplication in Distributed Caches,” in Proceedings of the36th Annual International Symposium on ComputerArchitecture, ISCA ’09, pp. 184–195, 2009.

[25] A. Baumann, P. Barham, P.-E. Dagand, T. Harris,R. Isaacs, S. Peter, T. Roscoe, A. Schupbach, andA. Singhania, “The Multikernel: A New OS Architecturefor Scalable Multicore Systems,” in Proceedings of the ACMSIGOPS 22nd Symposium on Operating SystemsPrinciples, SOSP ’09, pp. 29–44, ACM, 2009.

[26] D. Wentzlaff and A. Agarwal, “Factored operating systems(fos): the case for a scalable operating system formulticores,” SIGOPS Oper. Syst. Rev., vol. 43, no. 2,pp. 76–85, 2009.

[27] D. Wentzlaff, C. Gruenwald, III, N. Beckmann,K. Modzelewski, A. Belay, L. Youseff, J. Miller, andA. Agarwal, “An operating system for multicore and clouds:Mechanisms and implementation,” in Proceedings of the 1stACM Symposium on Cloud Computing, SoCC ’10,pp. 3–14, 2010.

[28] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr.,“Exokernel: An Operating System Architecture forApplication-level Resource Management,” in Proceedings ofthe Fifteenth ACM Symposium on Operating SystemsPrinciples, SOSP ’95, pp. 251–266, ACM, 1995.

[29] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek,R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang,and Z. Zhang, “Corey: An operating system for manycores,” in Proceedings of the 8th USENIX Conference onOperating Systems Design and Implementation, OSDI’08,pp. 43–57, USENIX Association, 2008.

[30] E. Bugnion, S. Devine, and M. Rosenblum, “Disco:Running commodity operating systems on scalablemultiprocessors,” in Proceedings of the ACM Symposium onOperating System Principles, pp. 143–156, 1997.

[31] K. Govil, D. Teodosiu, Y. Huang, and M. Rosenblum,“Cellular Disco: Resource management using virtualclusters on shared-memory multiprocessors,” in Proceedingsof the ACM Symposium on Operating System Principles,pp. 154–169, 1999.

[32] Y. Fu and D. Wentzlaff, “PriME: A parallel and distributedsimulator for thousand-core chips,” in Performance

Analysis of Systems and Software (ISPASS), 2014 IEEEInternational Symposium on, pp. 116–125, March 2014.

[33] “Intel R©Xeon R©Processor E5-2670 v2 (25M Cache, 2.50GHz).” http://ark.intel.com/products/75275/Intel-Xeon-Processor-E5-2670-v2-25M-Cache-2_50-GHz.

[34] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A32-way multithreaded sparc processor,” Micro, IEEE,vol. 25, no. 2, pp. 21–29, 2005.

[35] L. A. Barroso, K. Gharachorloo, R. McNamara,A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, andB. Verghese, “Piranha: a scalable architecture based onsingle-chip multiprocessing,” ACM SIGARCH ComputerArchitecture News, vol. 28, no. 2, pp. 282–293, 2000.

[36] R. T. Simoni, Cache coherence directories for scalablemultiprocessors. PhD thesis, to the Department ofElectrical Engineering.Stanford University, 1992.

[37] H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan,“SPATL: Honey, I Shrunk the Coherence Directory,” inParallel Architectures and Compilation Techniques(PACT), International Conference on, pp. 33–44, 2011.

[38] M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi,“Cuckoo directory: A scalable directory for many-coresystems,” in High Performance Computer Architecture(HPCA), 2011 IEEE 17th International Symposium on,pp. 169–180, IEEE, 2011.

[39] M. M. K. Martin, M. D. Hill, and D. J. Sorin, “Whyon-chip cache coherence is here to stay,” Communicationsof the ACM, vol. 55, pp. 78–89, July 2012.

[40] C. Kim, D. Burger, and S. W. Keckler, “An adaptive,non-uniform cache structure for wire-delay dominatedon-chip caches,” in Proceedings of the 10th InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS X,pp. 211–222, ACM, 2002.

[41] P. Keleher, A. L. Cox, and W. Zwaenepoel, “Lazy releaseconsistency for software distributed shared memory,” inProceedings of the 19th Annual International Symposiumon Computer Architecture, ISCA ’92, pp. 13–21, 1992.

[42] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel,“TreadMarks: distributed shared memory on standardworkstations and operating systems,” in Proceedings of theUSENIX Winter 1994 Technical Conference on USENIXWinter 1994 Technical Conference, WTEC’94, 1994.

[43] X. Zhou, H. Chen, S. Luo, Y. Gao, S. Yan, W. Liu,B. Lewis, and B. Saha, “A case for software managedcoherence in manycore processors,” in Poster sessionpresented in th 2nd USENIX Workshop on Hot Topics inParallelism, 2010.

[44] C. Fensch and M. Cintra, “An OS-based alternative to fullhardware coherence on tiled CMPs,” in High PerformanceComputer Architecture (HPCA), IEEE 14th InternationalSymposium on, pp. 355–366, IEEE, 2008.

[45] J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago,W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, andS. J. Patel, “Rigel: An Architecture and ScalableProgramming Interface for a 1000-core Accelerator,” inProceedings of the 36th Annual International Symposiumon Computer Architecture, ISCA ’09, pp. 140–151, 2009.

[46] B. Choi, R. Komuravelli, H. Sung, R. Smolinski,N. Honarmand, S. Adve, V. Adve, N. Carter, and C.-T.Chou, “DeNovo: Rethinking the Memory Hierarchy forDisciplined Parallelism,” in Parallel Architectures andCompilation Techniques (PACT), 2011 InternationalConference on, pp. 155–166, Oct 2011.

[47] H. Sung, R. Komuravelli, and S. V. Adve, “DeNovoND:Efficient Hardware Support for DisciplinedNon-determinism,” in Proceedings of the EighteenthInternational Conference on Architectural Support forProgramming Languages and Operating Systems, ASPLOS’13, pp. 13–26, ACM, 2013.

[48] H. Sung and S. V. Adve, “DeNovoSync: Efficient Supportfor Arbitrary Synchronization Without Writer-InitiatedInvalidations,” in Proceedings of the Twentieth

International Conference on Architectural Support forProgramming Languages and Operating Systems, ASPLOS’15, pp. 545–559, ACM, 2015.

[49] T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas,P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl,et al., “The 48-core SCC Processor: the programmer’sview,” in Proceedings of the ACM/IEEE InternationalConference for High Performance Computing, Networking,Storage and Analysis, pp. 1–11, 2010.

[50] G. Kurian, O. Khan, and S. Devadas, “The locality-awareadaptive cache coherence protocol,” in Proceedings of the40th Annual International Symposium on ComputerArchitecture, ISCA ’13, pp. 523–534, ACM, 2013.

[51] P. Foglia, C. A. Prete, M. Solinas, and G. Monni,“Re-NUCA: Boosting CMP performance through blockreplication,” in Digital System Design: Architectures,Methods and Tools (DSD), 13th Euromicro Conference on,pp. 199–206, IEEE, 2010.

[52] Ros, Alberto and Acacio, Manuel E. and Garcıa, Jose M.,“Direct coherence: Bringing together performance andscalability in shared-memory multiprocessors,” inProceedings of International Conference on HighPerformance Computing, pp. 147–160, 2007.

[53] Ros, Alberto and Acacio, Manuel E. and Garcıa, Jose M.,“DiCo-CMP: Efficient cache coherency in tiled CMParchitectures,” in Parallel and Distributed Processing, IEEEInternational Symposium on, pp. 147–160, April 2008.

[54] H. Hossain, S. Dwarkadas, and M. C. Huang, “POPS:Coherence Protocol Optimization for Both Private andShared Data,” in Parallel Architectures and CompilationTechniques (PACT), International Conference on,pp. 45–55, Oct 2011.

[55] J. Zhou and B. Demsky, “Memory management formany-core processors with software configurable localitypolicies,” in Proceedings of the 2012 InternationalSymposium on Memory Management, ISMM ’12, pp. 3–14,ACM, 2012.

[56] M. Schuchhardt, A. Das, N. Hardavellas, G. Memik, andA. Choudhary, “The impact of dynamic directories onmulticore interconnects,” Computer, vol. 46, pp. 32–39,October 2013.

[57] J. Laudon and D. Lenoski, “The SGI Origin: A ccNUMAHighly Scalable Server,” in Computer Architecture, 1997.Conference Proceedings. The 24th Annual InternationalSymposium on, pp. 241–251, June 1997.

[58] S. Corporation, “White Paper: Technical Advances in theSGI R©UV Architecture,” June 2012.

[59] C. Dubnicki, A. Bilas, K. Li, and J. Philbin, “Design andImplementation of Virtual Memory-mappedCommunication on Myrinet,” in Proceedings of theInternational Parallel Processing Symposium, pp. 388–396,1997.

[60] R. Kessler and J. Schwarzmeier, “Cray T3D: a newdimension for Cray Research,” in Compcon Spring ’93,Digest of Papers., pp. 176–182, Feb 1993.

[61] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis,J. Leverich, D. Mazieres, S. Mitra, A. Narayanan,D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble,E. Stratmann, and R. Stutsman, “The Case forRAMCloud,” Communications of the ACM, vol. 54,pp. 121–130, July 2011.

[62] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall,and W. Vogels, “Dynamo: Amazon’s highly availablekey-value store,” in Proceedings of Twenty-first ACMSIGOPS Symposium on Operating Systems Principles,SOSP ’07, pp. 205–220, 2007.

[63] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee,H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab,D. Stafford, T. Tung, and V. Venkataramani, “Scalingmemcache at facebook,” in The 10th USENIX Symposiumon Networked Systems Design and Implementation (NSDI13), pp. 385–398, USENIX, 2013.


Recommended