+ All Categories
Home > Documents > Pay Migration Tax to Homeland: Anchor-based Scalable ...

Pay Migration Tax to Homeland: Anchor-based Scalable ...

Date post: 25-Oct-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
14
Open access to the Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST ’19) is sponsored by Pay Migration Tax to Homeland: Anchor-based Scalable Reference Counting for Multicores Seokyong Jung, Jongbin Kim, Minsoo Ryu, Sooyong Kang, and Hyungsoo Jung, Hanyang University https://www.usenix.org/conference/fast19/presentation/jung This paper is included in the Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST ’19). February 25–28, 2019 • Boston, MA, USA 978-1-939133-09-0
Transcript
Page 1: Pay Migration Tax to Homeland: Anchor-based Scalable ...

Open access to the Proceedings of the 17th USENIX Conference on File and

Storage Technologies (FAST ’19) is sponsored by

Pay Migration Tax to Homeland: Anchor-based Scalable Reference Counting

for MulticoresSeokyong Jung, Jongbin Kim, Minsoo Ryu, Sooyong Kang, and Hyungsoo Jung,

Hanyang University

https://www.usenix.org/conference/fast19/presentation/jung

This paper is included in the Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST ’19).

February 25–28, 2019 • Boston, MA, USA

978-1-939133-09-0

Page 2: Pay Migration Tax to Homeland: Anchor-based Scalable ...

Pay Migration Tax to Homeland:Anchor-based Scalable Reference Counting for Multicores

Seokyong Jung, Jongbin Kim, Minsoo Ryu, Sooyong Kang, Hyungsoo Jung∗

Hanyang University{syjung, jongbinkim, msryu, sykang, hyungsoo.jung}@hanyang.ac.kr

AbstractThe operating system community has been combating scal-ability bottlenecks for the past decade with victories for allthe then-new multicore hardware. File systems, however, arein the midst of turmoil yet. One of the culprits behind per-formance degradation is reference counting widely used formanaging data and metadata, and scalability is badly im-pacted under load with little or no logical contention, wherethe capability is desperately needed. To address this, we pro-pose PAYGO, a reference counting technique that combinesper-core hash of local reference counters with an anchorcounter. PAYGO imposes the restriction that decrement mustbe performed on the original local counter where the act ofincrement has occurred so that reclaiming zero-valued localcounters can be done immediately. To this end, we enforcemigrated processes running on different cores to update theanchor counter associated with the original local counter. Weimplemented PAYGO in the Linux page cache, and so ourimplementation is transparent to the file system. Experimen-tal evaluation with underlying file systems (i.e., ext4, F2FS,btrfs, and XFS) demonstrated that PAYGO scales file sys-tems better than other state-of-the-art techniques.

1 Introduction

Reference counting is a general technique, originally intro-duced by Collins [8] almost six decades ago, to determinethe liveness of an object for automatic storage reclamation.Since the early version of UNIX kernel used reference count-ing to manage data (e.g., page cache) and metadata (e.g.,inode), reference counting has gained widespread accep-tance in the systems community thereafter, e.g., file systems,HBase [21], RocksDB [22] and MariaDB [23].

However, a recent study by Min et al. [17] found that ref-erence counters, among many other factors, in modern filesystems are not scalable, thus leading file systems to sufferperformance degradation on multicore hardware, even with∗Contact author and principal investigator

applications with little or no logical contention. For exam-ple, the traditional way of referencing (let us call it ‘tradi-tional reference counter’), which is currently being used inthe Linux operating system for page cache, uses a singleshared atomic counter. By using atomic operations, an objectcan be safely referenced even when multiple threads updateat the same time. The traditional reference counters, how-ever, degrade the performance of applications on multicoresdue to excessive atomic operations on a shared counter.

In order to be a good reference counter for concurrent ap-plications, there are important properties to consider; 1) up-dates on reference counters must be scalable, 2) reading anaccurate (zero or positive) counter value should be cheap, 3)reference counters should be space-efficient and 4) all theseshould be guaranteed without incurring extra delay to man-age reference counters. We denote overheads required forachieving the four properties as counting overhead, queryoverhead, space overhead, and time overhead, respectively.

Counting overhead. The counting overhead, which is themost important property for scalable counting, indicates thecost of updating (REF/UNREF) a reference counter itselfwhen there is a heavy load on referencing an object. Sincethe counting overhead is a crucial hurdle for achieving scal-ability, all reference counting techniques strive hard to elim-inate it first. The traditional reference counter which uses asingle shared counter has the highest counting overhead dueto the hardware-based synchronization bottleneck [13].

Query overhead. The query overhead measures the costof query operation which checks if the reference counter ofan object is zero and so we can safely reclaim the objectfrom memory. The traditional technique can detect zero byreading a single atomic counter.

Space overhead. The space overhead indicates how muchspace they use for reference counting. In terms of space over-head, the traditional reference counter is a (de facto) optimaltechnique since it does not require any other data structurethan one atomic counter per object.

Time overhead. The time overhead represents any otherdelay than the counting overhead introduced by a reference

USENIX Association 17th USENIX Conference on File and Storage Technologies 79

Page 3: Pay Migration Tax to Homeland: Anchor-based Scalable ...

counting technique to manage all data structures it maintains.The traditional reference counter has minimal time overheadsince it maintains only per-object atomic counters. However,some reference counting techniques that exploit distributedlocal reference caches have the synchronization overhead be-tween local counters and a global counter. This synchroniza-tion plays two roles: 1) the global counter becomes ready(i.e., up-to-date) for zero detection and 2) the local counter,if it resides in hash, can be reclaimed. We generally denotethis type of overhead as the time overhead.

Our analysis of prior proposals (§2.1) suggests that it ischallenging to achieve all four properties, possibly due totradeoffs between different properties. In this work, we pro-pose pay-as-you-go (PAYGO1) reference counting that en-sures scalable counting and space efficiency with negligi-ble time overhead. Although based on a well-known per-corehash technique, PAYGO introduces a novel concept of an an-chor counter that enables the immediate reclamation of lo-cal zero-valued counter entries, which is pivotal to reducingthe forceful eviction of the conflicting hash entries when thenumber of objects accessed in a core becomes large. The in-stant reclamation is indeed a critical feature for escaping per-formance degradation that may otherwise occur due in largepart to the heavy cost of operations for resolving collisions,including forceful evictions.

We implemented PAYGO and applied it to page cache inLinux, leaving existing file systems almost intact. To see theapplicability of PAYGO to user applications, we also imple-mented new PAYGO system calls that can be used for ref-erence counting user-level objects. Evaluation results withvarious underlying file systems (i.e., ext4, F2FS, btrfs andXFS) demonstrated that PAYGO shows substantial improve-ments against state-of-the-art reference counting techniques.

2 Related Work and Motivation

2.1 Related WorkThere have been many proposals attempting to address someof the properties introduced in §1, and the techniques avail-able so far utilize at least one of the following features:

Contention distribution. One of the major factors imped-ing the scalability of a reference counter is cache line con-tention: updating the same reference counter atomically bymany threads results in high contention. SNZI [14] mitigatesthe contention by dispersing the reference counters at com-pile time. It manages distributed counters using a binary treewith a fixed-sized depth. While it shows better scalabilitythan the traditional reference counter, it is still slower thanother techniques due to the possible contention on a partic-ular counter that changes frequently. However, it can per-form zero detection in constant time by checking the indi-cator of the root node in the binary tree, although determin-ing the global count value is impossible. Other techniques

[1, 18] alleviate the contention problem by distributing ref-erence counters according to the degree of contention at run-time, but they empirically judge the degree of contentionand distribute reference counters so they cannot relieve thecontention for a reference counter under general workloadswhere we can hardly predict the degree of contention. Car-refour [12] also distributes contention dynamically, but hard-ware profiling is required to verify memory traffic. Proposalsin this category still rely on atomic instructions for updatingreference counters, so they seldom achieve linear scalability.

Cache affinity. Another factor that hinders scalability ofreference counters is cache misses. To reduce the cachemisses, a local reference counter is used in a way that an ob-ject has per-core local counters and updates are made to thelocal counters nonatomically. The downside of this approachis the overhead needed for summing all local counter valuesto obtain the global count. To alleviate this side effect, thereis a way to obtain the global count in advance and store it inthe central counter [9, 5, 10], which incurs extra time over-head. Sloppy counter [9, 5] updates only the local counterif the updated value is less than a certain threshold. If thevalue exceeds the threshold, the local counter value is trans-ferred to the central counter. The central counter is thereforean approximation of the global count. Before transferring thelocal counter value, it acquires the global lock for the centralcounter, which incurs extra time overhead. percpu ref [10],a variant of the sloppy counter, implemented in Linux formanaging memory objects in several device drivers, also pri-marily changes the local counter. The techniques exploitingcache affinity have the counting-query tradeoff : nonatomicupdates on local reference counters earn good scalability inexchange for longer query time to read a global count by col-lecting the sum of local counters. They also have bad spaceefficiency due to the per-object, per-core local counters.

Per-core hash. To improve the space overhead of cacheaffinity-based techniques, recent years have seen attemptsto use per-core hash of reference caches that would fulfillthe main duty of reference counting with much less spaceoverhead [7, 4, 3]. They can substantially reduce the spaceoverhead by using per-core hash which keeps the local refer-ence counters for only those objects in use. Techniques basedon per-core hash inevitably face the problem of reclaiming ahash table entry whose local counter is zero (i.e., the corre-sponding object is not in use). Existing techniques addressthis using quiescent period-based synchronization which iswidely used in Linux to reclaim objects, such as read-copy-update (RCU) [16]. The reference counting algorithm ex-ploiting per-core hash with quiescent period-based synchro-nization cannot avoid the space-time tradeoff : they achievebetter space efficiency in exchange for time overheads notonly in synchronization between local and global countersbut also in hash entry reclamation. RefCache [7], which isone of quiescent period-based techniques, manages its localcounters in per-core hash, and the counter values are flushed

80 17th USENIX Conference on File and Storage Technologies USENIX Association

Page 4: Pay Migration Tax to Homeland: Anchor-based Scalable ...

Countingoverhead

Queryoverhead

Spaceoverhead

Timeoverhead

IdealPAYGO

RefCache [7]Sloppy counter [5]SNZI [14]Traditional

Counting-QueryTradeoff

Space-TimeTradeoff

Trad

ition

al∗

SNZI

Slop

pyco

unter

RefCac

he

PAYGO

Counting atomic atomic global − −overhead ops. ops.‡ lock

Space overhead† O(N) O(M ·N) O(M ·N) O(M ·C+N) O(M ·C+N)

Query overhead§ O(1) O(1) O(M) O(1)+2 · epoch O(M)§§

Time − − every every epoch −overhead threshold and collision

∗ A single atomic reference counter† N: # of objects, M: # of local counters per object, C: # of hash entries‡ SNZI recursively updates the counter of the parent node whenever the counter of the child node

changes from 0 to 1 and vice versa.§ Time to determine if the reference counter of a single object is zero or not§§ PAYGO has practically less query overhead than Sloppy counter (§3.4).

Figure 1: A comparison of reference counting techniques under workloads accessing a shared counter.

into a central counter every epoch. OpLog [4] generalizedRefCache’s idea by using operation logs for the shared datastructure with a local timestamp. In descending order of thetimestamp, per-core logs are applied to the data structure.

2.2 Motivation

The design of PAYGO is motivated by two observations:Observation 1. Our analysis of existing algorithms in §2.1

is summarized in Figure 1. Noticeable is the observation thatattaining the counting scalability (i.e., low counting over-head) demands a sacrifice of two other properties due to thecounting-query and space-time tradeoffs. By escaping thosetradeoffs we can attain more good properties; for example,escaping the space-time tradeoff enables us to make bothspace and time overheads low while achieving scalability.

Observation 2. Another and more important observationis that the excessive time overhead may eventually incur se-vere performance degradation. As described in §2.1, tech-niques based on per-core hash sacrifice time overhead to re-duce space overhead. The time overhead under considerationin such techniques is the overhead of reclaiming hash entries,when the number of objects accessed in a core becomes largeso that frequent hash collisions occur and therefore forcefulevictions for the conflicting hash entries need to be exercisedto make room for newly accessed objects. The eviction ofa hash entry needs to flush the local counter value to theglobal counter and therefore causes additional synchroniza-tion overhead between local and central counters.

For example, RefCache [7], designed for a new virtualmemory system, is perfectly scalable when n threads are re-peatedly performing mmap/munmap on a single shared phys-ical page (see Figure 8 in [7]), since the forceful eviction ofhash entries due to collisions seldom occurs and so may notbe a serious design consideration in virtual memory systems.However, if we use RefCache in page caches under file sys-

50

100

150

200

250

300

350

1 MiB 2 MiB 4 MiB 8 MiB 16 MiB 32 MiB 64 MiB

Thro

ughput (M

ops/s

ec)

Shared File Size

Read unit = 64 BRead unit = 4 KiB

Figure 2: Performance of RefCache: hash table size = 4,096entries (default size), ext4 file system.

tem benchmarks that may access far more objects, frequentevictions that internally acquire/release object locks to pro-tect the flush of local counter values to the central counters,may result in serious time overhead, leading to performancedegradation. To confirm it, we conducted a preliminary ex-periment that measures the throughput of RefCache for pagecaches. The experimental environment is shown in §6.1, andwe ran the FXMARK microbenchmark so that 96 threadsread 64 bytes (or 4 KiB) on a shared file, with a sequen-tial access pattern and varying the file size. Figure 2 showsthe result that confirms our conjecture. The throughput de-creases as the file size (i.e., the number of objects) gets big-ger, due to increased hash collisions triggering more force-ful evictions. We found that the hash collisions start slightlyoccurring from the point when the file size is 1 MiB (i.e.,256 objects) and become excessive as the file size increases.Hence, reclaiming garbage hash entries in a timely manner iscritical for avoiding the performance degradation of per-corehash-based reference counting techniques.

The above observations guide us to conclude that escap-ing the space-time tradeoff is crucial for scalable referencecounting techniques. By escaping the space-time tradeoff,we can achieve true scalable counting keeping both spaceand time overheads low, which is our design goal of PAYGOwhose comparative properties are depicted in Figure 1.

USENIX Association 17th USENIX Conference on File and Storage Technologies 81

Page 5: Pay Migration Tax to Homeland: Anchor-based Scalable ...

3 PAYGO

Of counting-query and space-time tradeoffs, we aim at es-caping the space-time tradeoff while embracing the other.PAYGO achieves this by using a per-core hash-based refer-ence cache with a new technique called anchoring. PAYGOis designed on the following assumptions; (i) objects are ref-erenced and unreferenced by the same process and (ii) thelifetime of references is reasonably short not to put the staticsize of per core hash in jeopardy (see §3.5).

3.1 Design Overview

Design rationale. To escape the space-time tradeoff, per-core hash-based techniques must ensure the safety condi-tion such that a local reference cache entry can be reclaimedimmediately upon releasing all references to an object, alldone without sacrificing other properties. In this respect, Re-fCache earned the counting scalability in exchange for theincreased time overhead required for reclaiming obsoletecache entries. For addressing this issue, our main design ra-tionale behind PAYGO lies in a simple goal; we make a localreference cache zero (i.e., ready to be reclaimed) right afterall references are released. To this end, we enforce the re-striction that a process, once referencing an object, must beanchored to the original reference cache to inform of any un-referencing to the object irrespective of which core the pro-cess runs on. For this purpose, PAYGO’s per-core hash entryconsists of a local counter and an anchor counter fields, andthe sum of two represents a local count for an object initiallyaccessed in that core.

Access rules. First, we establish ground rules in accessinga pair of local and anchor counters to preserve the correct-ness, that is ‘never miscount’. Access rules for (UN)REF aredescribed as follows. For the REF operation, a process al-ways accesses a local reference cache and updates the localcounter field nonatomically. At this time, the process is log-ically anchored to this core (homeland) and anchor core IDis recorded in a task struct. For the UNREF operation, act-ing on local or anchor counters depends on whether migratedor not in between REF and UNREF; if the process remains atthe same core (homeland), UNREF is done on the same lo-cal counter nonatomically. Otherwise, the migrated processatomically updates the anchor counter of the original refer-ence cache in the homeland core. The use of an atomic op-eration on an anchor counter is indeed for correct countingeven with multiple processes in concurrent environments.

We summarize the access rules in Table 1, and we ensurethat a local reference cache becomes zero upon the comple-tion of REF/UNREF operations. This allows PAYGO to reclaimzero-valued local reference caches immediately from hash,thus retaining the hash space efficiency without time over-head. The common rule governing both REF and UNREF is,we disable preemption while performing two operations in

Table 1: Access rules for REF and UNREF from homeland andforeign land. (X: nonatomic, X©: atomic, ×: no-op)

TypePAYGO Entry

local counter anchor counterREF UNREF REF UNREF

Homeland X X × ×Foreign land × × × X©

REF at core 0

1 0 0LC AC ID

core corePAYGO

object pointer

local counter

PAYGO entry

cache line size (byte)

task structanchor info

StateStructure

(per-process)

(per-core)

63 56 55 52 51 48 47 0

(ⅰ) UNREF at the same core

0 0 0LC AC ID

(ⅱ) UNREF at different core

1 −1 0LC AC ID

core

( )core

core

LC : local counterAC : anchor counterID : anchor core ID

anchor counter

anchor core IDs

Figure 3: An overall structure of PAYGO and state of thestructure when referencing and unreferencing an object.

order to prevent malicious data races on a local counter. Ofcourse, there are other ways of doing this, such as kernel spinlocks (i.e., spinlock t), but disabling/enabling preemptionis by far the fastest method we found it suitable for our pur-pose and has been used in prior work [7, 10]. An in-depthperformance analysis will be presented in §6.2.5.

Overall structure of PAYGO. Next, we describe the over-all structure of PAYGO. Figure 3 shows the structure ofPAYGO and the state of the data when an object is refer-enced and unreferenced. For each core, there is per core hashof reference caches, each entry of which consists of an ob-ject pointer and two counters, a local and an anchor coun-ters. The space overhead for this hash table is much smallerthan the Linux sloppy counter and larger than the traditionalone, but it is similar to RefCache. Given hash of referencecaches, the UNREF operation atomically decreases the anchorcounter of the anchored core only when process migrationoccurs, by the access rules in Table 1. To do this, each pro-cess stores anchor information that bookkeeps the core IDs inwhich an object is referenced. The anchor information inter-nally maintains multiple anchor core IDs to deal with a casewhere a process references an object multiple times on differ-ent cores without unreferencing it. Matching anchor core IDis removed after UNREF is done on the corresponding core.Unlike hash, PAYGO requires extra memory space for stor-ing anchor information in a task structure, and this is surelyregarded as additional memory overhead.

On the right side of Figure 3 shows the state of the datawhen a process references and unreferences an object. Whena process references an object at core 0, it raises the localcounter of core 0 and keeps core 0 in the process’s anchor

82 17th USENIX Conference on File and Storage Technologies USENIX Association

Page 6: Pay Migration Tax to Homeland: Anchor-based Scalable ...

information. When the process unreferences the object, itfirst searches the current core ID in its anchor information. Iffound, the process decreases the local counter of the currentcore; otherwise, it decreases the anchor counter of any corein the anchor information atomically.

3.2 PAYGO Operations

PAYGO has three operations: REF/UNREF operations to in-crease/decrease a reference counter and READ-ALL operationto read the global value of the reference counter which isequivalent to the query operation.REF operation. When a REF operation of an object is in-

voked, it finds the PAYGO entry for the object in the hashof the current core. If the PAYGO entry is found, its localcounter is increased. If the PAYGO entry is not found, a newPAYGO entry is created in the hash and the local counteris increased, and then the current core ID is stored in theprocess’ anchor information. The REF operation is executedwhile preemption is disabled to prevent multiple processesfrom updating the same hash entry concurrently.UNREF operation. When an UNREF operation of an object

is invoked, it first checks the anchor information of the pro-cess. If the core ID stored in the anchor information is thesame as the current core, the process finds the PAYGO entryfor the object in the hash of the current core and decreasesthe local counter. If the process has migrated to another core,it finds the PAYGO entry for the object in the hash of theanchored core and atomically decreases the anchor counter.The UNREF operation is also performed with preemption be-ing disabled for the same reason as the REF operation.READ-ALL operation. When a READ-ALL operation is in-

voked, it finds all the PAYGO entries for the object in all per-core hash tables and computes the sum of the local and theanchor counters of all valid PAYGO entries. The READ-ALL

operation is performed while the preemption is disabled inorder to prevent any scheduling delays that may slow downthe process. Nevertheless, this does not guarantee to read thecorrect sum since the REF and the UNREF operations maymodify the counters during the READ-ALL operation. Objectreclamation therefore needs a delicate design (§3.3).

3.3 Object Reclamation

Objects are a target of reference counting, and operating sys-tems often reclaim objects that are not referenced by any pro-cess in order to keep memory pressure under control. Oncean object is chosen to be reclaimed, the reclaiming processshould prevent any additional reference to the object andcheck again the zero value of the reference counter. In tra-ditional reference counting, this can be done atomically bycomparing the shared atomic counter with zero and swap-ping it to a negative value. The synchronization used in the

rcu_read_lock();while (rcu_pagep = radix_tree_lookup_slot()) {

if (!(page = radix_tree_deref_slot(rcu_pagep))break;

preempt_disable();this_cpu->hash[H(page)]->local_counter++;add_anchor_info(current_task, page, this_cpu);preempt_enable();if (get_flag(page) || is_object_removed(page)) {

UNREF(page); continue;

}}rcu_read_unlock();

localcounter

0

1

(a) REF operation

(migrated to other core)preempt_disable();anchor = find_last_anchor(current_task, page);if (anchor.cpu == this_cpu)this_cpu->hash[H(page)]->local_counter--;

elseatomic_dec(&anchor.cpu->

hash[H(page)]->anchor_counter);delete_anchor_info(current_task, anchor);preempt_enable();

anchorcounter

0

-1

(b) UNREF operation

flag......set_flagag(pagagege);if (t_fla((READ

agADAD-

aagge)pag(pDD--ALL(page)) {if (REAADD LLALA

clear_flagpage)) {LL(p

agag(page);return fail;

}}delete_objectct(page);delete_objecclear_flag

bjecagag(

tcctjecgg((pag

t(pagage

ge)agpa(pggeee);

...

anchorcounter

localcounter

(c) READ-ALL operation (reclaimer)

Figure 4: Code snippets of how PAYGO’s REF, UNREF andREAD-ALL are implemented and used in the Linux pagecache, where H() is a hash function.

traditional method is based on atomic read-modify-write op-eration (e.g., CMPXCHG).

In PAYGO, it needs more steps to correctly handle the case.Since the READ-ALL operation cannot acquire the sum in onesnapshot, it uses a flag to indicate its commencement, whichhelps prevent the additional reference to the object. The syn-chronization method we use here is based on the read-after-write (RAW) pattern [2]. Important to notice is the invariantthat at least one of a reclaiming process and referencing pro-cesses, if they run concurrently, must detect both events andthen retreat itself for safety, thus never allowing maliciousdata race. We enforce these checking conditions to be veri-fied at the end of REF and READ-ALL routines.

Figure 4 shows the code snippets of how REF, UNREF andREAD-ALL operations are implemented in the Linux pagecache with a special flag indicating that the current pageis accessed for reclamation. Accessing the special flag may

USENIX Association 17th USENIX Conference on File and Storage Technologies 83

Page 7: Pay Migration Tax to Homeland: Anchor-based Scalable ...

cause contention only if the same page is repeatedly re-claimed (or flushed in the Linux page cache) while many pro-cesses read it, which we seldom, if ever, witnessed in Linux.In Figure 4a, the code executed while preemption is disableddenotes the REF operation. In Figure 4b, the whole code isthe UNREF operation. READ-ALL operation which iterates allcore’s hash, finds the PAYGO entry and collects the sum ofall entries again with the preemption being disabled, is onlyshown as a function call in Figure 4c. Notice that there is ad-ditional code around the REF and READ-ALL operations forthe correct implementation of reclaiming page caches.

As shown in Figure 4a, the entire routine for ref-erencing a page is protected by rcu read lock() andrcu read unlock(). The REF operation starts by obtain-ing an rcu reference to the page. Once it obtains the rcu

reference, it retrieves the page object and then performs theREF operation. After that, a flag is checked to see if a re-claiming process is being tried. If the flag is clear, then thepage is checked whether or not it is removed. This makessure that the page is not already reclaimed before the flagis checked. Only if both conditions are passed, the page issafely referenced. Otherwise the flag is set, then the processretries until the reclaiming process clears the flag. If the pagewas already removed, the reference process fails. For the re-claiming process, the READ-ALL operation is performed aftersetting a flag. If the page is not referenced by any thread, thepage object is safely removed and the flag is cleared. If thepage is already referenced by other thread, it is not removedand the flag is cleared, thus failing to reclaim the page object.

3.4 Anchoring in Action

Reference counting techniques exploiting per-core hash,such as RefCache [7], allow processes to update nonatomiclocal counters of the running core. This means that when aprocess at core 0 increases the local counter of core 0, mi-grates to core 1, and decreases the local counter of core 1,then we have two local counters with values of 1 and -1, re-spectively. The spread-out local counters are problematic ifper-core hash is used to reduce space overhead. As an ex-ample, RefCache uses background threads to flush the localcounters every epoch, which inevitably delays the reclama-tion of zero-valued reference cache entries.

The anchoring technique in PAYGO enforces the REF andUNREF operations to act on the same PAYGO entry, thus guar-anteeing the sum of its local and anchor counters to eventu-ally become zero. Any zero-valued PAYGO entry can be re-cycled immediately when another REF operation accesses thesame hash bucket. Figure 5 shows an example of an objectaccessed by multiple threads in a system with four cores. Atcore 0, a red thread references and unreferences the object byincreasing and decreasing the local counter of core 0. At core1, a blue thread followed by a green thread reference the ob-ject. Then, a yellow thread also references the object at core

core 0

core 1

core 2read-all

1

0 0

migrate

migrate

: lock;subl (atomic op.): addl/subl

: anchor counter: local counter /

0 0

core 3

Figure 5: Usages of an anchor counter.

1. Since the yellow thread is using core 1, the blue thread andthe green thread have to migrate to other cores (namely, core2 and core 3, respectively), and unreference the object usingthe same anchor counter of core 1. As shown in this example,an anchor counter has the risk of being modified by multiplethreads in parallel, so we use an atomic operation.

The anchoring technique gives us another opportunity ofreducing the query overhead. Since the sum of local and an-chor counters in a core can never be negative, during the zerodetection (i.e., query), upon seeing a positive value of thesum in a core, we can immediately stop zero detection safelyconcluding that the object is currently being used by at leastone process.

Discussion. Since decreasing an anchor counter uses anatomic operation, there is a performance concern when sys-tems have processes that are all accessing the same an-chor counter, thus hitting the hardware-based synchroniza-tion bottleneck. This is the worst case that can occur whenprocesses are migrated frequently between REF and UNREF.But, the general design rationale for the OS scheduler usu-ally inhibits such frequent process migration unless there arecompelling reasons, such as severe load imbalance.

Nonetheless, the chance of migrating a process can in-crease if the interval between REF and UNREF becomes dis-tant. Even if it occurs, atomic operations on anchor counterswould not have bad impact on performance, since the pricefor process migration is much larger than the pure cost of ref-erence counting itself. Hence, the performance degradationcaused by atomic operations can be neglected (see anchor-ing overhead in §6.2.3). To alleviate any possible bottleneckon the same anchor counter, the OS scheduler can give a tem-porary CPU affinity to processes that are in between REF andUNREF, to prevent process migration.

One may raise concern about the overhead of searchingthe matching core ID in anchor information when a pro-cess references an object multiple times or numerous objectswithout unreferencing. Since PAYGO stores the same anchorID in anchor information even if the same object is refer-enced again, this issue will surely impact performance dueto the search cost, but we have not discovered such cases yetinside file systems or data management systems. If the case isfound, then augmenting an additional search structure mustbe necessary.

84 17th USENIX Conference on File and Storage Technologies USENIX Association

Page 8: Pay Migration Tax to Homeland: Anchor-based Scalable ...

3.5 Table Overflow

The table overflow problem of hash tables is a fundamen-tal issue that per-core hash-based counting techniques shouldaddress. In the context of reference counting, the table over-flow occurs when there are a large number of live objects.For instance, if a process opens many files, then the cor-responding dentry objects will be alive in per-core hashuntil closed. Conventional methods, such as table doubling,are hard to use or to be efficiently designed due mainly tohigh concurrency. We deal with the overflow similar to theway Linux swap space is managed. First, an object that usesPAYGO has a list, called an overflow counter list, protectedby an object lock. When a live object needs to be evictedfrom per-core hash, we acquire the object lock, evict the en-try from hash, add the evicted counter information to theoverflow counter list and then release the lock. Later, theowner process of the evicted entry can reload the evictedcounter information from the overflow list while holding theobject lock. Further improvements can be made to the sharedoverflow list, but we hold off until it really matters since‘premature optimization is the root of all evil’ [15]. What re-ally matters here is the lifetime of the referenced object, andthe concerned place (i.e., page cache) suffering bottleneckshas short-lived objects that begin and end its lifetime insideread/write system calls. PAYGO scales file operations wellunder such conditions.

4 PAYGO Implementation

We implemented PAYGO in Linux kernel version 4.12.5 andapplied it to the page cache that can affect many concretefile systems suffering scalability issues. For experiments, wetake the code base implementing RefCache and SNZI fromsv6 [6] and adapt it to Linux page cache. Noticeable is theobservation that other latent contention often arises afterPAYGO eliminated contention on reference counters.

The Linux page cache is implemented using a radix tree,and its operations are made lockless for the performance ben-efits [20]. However, read system calls using a page cachestill have scalability issues, such as the usage of atomic ref-erence counter to synchronize between reading a page froma page cache (REF/UNREF) and flushing the page from mem-ory to storage (READ-ALL). Therefore, threads trying to readthe same page contend on the same reference counter andhave poor performance under such loads [17].

In the original implementation of a reference counter ina page ( refcount) has two purposes. First, it is used as astatus variable. If the value is zero, it means that the page isunused. If the value is two, the page is active and is storedin a page cache, but it is not referenced by any threads.The refcount of a value above two is used as a referencecounter. For example, the refcount value of three indicatesthat there is one process referencing the page. Here, we left

userthreads

sys_ref(void* obj) or sys_unref(void* obj)

usermode

kernelmode

object

preempt_disable() preempt_enable()

referencing / unreferencing

REF(obj, pid)UNREF(obj, pid)

Figure 6: User-level PAYGO in Linux.

the refcount to be used as a status variable and use PAYGOto replace the referencing part of refcount.

5 User-Level PAYGO

PAYGO, although being motivated by pressing concerns infile systems and intended to address it, can be easily extendedto a user-level reference counting method for applicationsabove the kernel. The development of scalable user-level ref-erence counting is more demanding indeed, since there aremany latent use cases where contention may arise once thepresent performance matters are all cleared away. For exam-ple, managed language runtimes (such as the JAVA runtime)use referencing counting for collecting garbage objects, andthe same is true in the database land; popular database sys-tems, such as HBase [21], RocksDB [22] and MariaDB [23],also use reference counting for managing memory objects.To the best of our knowledge, they use either hardwarebased atomic operations or lock based synchronization prim-itives to safely orchestrate concurrent accesses to the sharedcounter variables, but both methods are all vulnerable to per-formance bottlenecks in highly concurrent environments.

To make applications benefit from the better scalabilityof PAYGO, we implement three system calls, sys ref(),sys unref() and sys readall(), which enable applica-tions to exploit core kernel-level PAYGO operations withoutdifficulty for user-level reference counting (Figure 6). De-spite there being the inherent overhead required in switchingbetween user mode and kernel mode, reference counting onuser-level objects through PAYGO system calls enables ap-plications to achieve far better scalability than their legacyreference counting techniques. Furthermore, PAYGO will ex-hibit much less overhead for managing garbage entries inper-core hash, and this is a required feature especially whenapplications hold a large number of live references.

Enabling applications to directly exploit the referencecounting technique in the kernel via system calls poses twonontrivial issues. First, the system call overhead should besufficiently minimized to benefit from the original perfor-mance of the kernel-level reference counting technique. Tothis end, we make PAYGO system calls lightweight suchthat they just wrap core kernel-level PAYGO routines withpreempt disable()/preempt enable() executed before-hand/afterward. The wrapped routines here basically referto the code segments bounded by preempt disable() and

USENIX Association 17th USENIX Conference on File and Storage Technologies 85

Page 9: Pay Migration Tax to Homeland: Anchor-based Scalable ...

0

50

100

150

200

250

300

350

400

148 16 32 48 64 80 96

Thro

ughput (M

ops/s

ec)

No. of threads

ext4

VanillaSNZI

RefCachePAYGO

0

50

100

150

200

250

300

350

400

148 16 32 48 64 80 96No. of threads

btrfs

VanillaSNZI

RefCachePAYGO

0

50

100

150

200

250

300

350

400

148 16 32 48 64 80 96No. of threads

F2FS

VanillaSNZI

RefCachePAYGO

0

5

10

15

20

25

148 16 32 48 64 80 96No. of threads

XFS

VanillaSNZI

RefCachePAYGO

Figure 7: Scalability comparison under strongly contending workloads: the Linux page cache.

preempt enable() in Figure 4. One subtle matter is, wehave to transform the virtual address of a user object into aunique one inside per-core hash, by combining it with thepid of a user process. Second, since applications are not asreliable as kernel, the abnormal termination of applicationsmay leave the kernel data structures for reference count-ing incorrect. When an application terminates after referenc-ing an object but before unreferencing it, the correspond-ing counter in the kernel can never become zero. To resolvethe problem, when terminating a process, the task struct

needs to be checked to detect any left-over counter valuesin the corresponding PAYGO entries in per-core hash tables.Any such left-over counters, if found, must be decreased.

6 Evaluation

In this section, we measure the overall performance andscalability of PAYGO, especially in page cache, and com-pare with other reference counting techniques including Re-fCache, SNZI and traditional reference counter under vari-ous file systems. For analyzing the performance of user-levelPAYGO, we compare PAYGO with existing user-level refer-ence counting techniques.

6.1 Experimental SetupWe perform all of the experiments in Linux kernel version4.12.5 on our 96-core system equipped with four 24-core In-tel Xeon E7-8890 v4 CPUs and 1 TiB DDR4 DRAM. Werun FXMARK microbenchmark [17] with a RAM disk andfilebench [11, 25] with a Samsung SM1725 NVMe SSD. Toshow the general applicability of PAYGO, we conduct exper-iments under four different file systems (i.e., ext4, btrfs,F2FS and XFS). In ext4, we used the default journalingmode and did not see any lock contention in the journalingsubsystem observed in the prior study [17]. Page structurescached in memory are freed before every experiment, andthe Linux security module is turned off to avoid the unre-lated performance degradation.

6.2 ScalabilityThis section explores the multicore scalability of concernedfile systems under file system benchmarks, with the degree of

contention being varied from strongly contending to weaklycontending. Our evaluation methodology follows similar ap-proaches used in [17], and the important metric is the numberof REF/UNREF (i.e., file reads) with the degree of contentionon reference counters being controlled by the size of files ac-cessed by benchmark threads. Experiments under this con-trolled environment may reveal the weakness and strengthof tested schemes that may overlook at the time it was pro-posed.

6.2.1 Strongly Contending Workloads

To evaluate the performance of file operations under stronglycontending scenarios, we ran the shared block read work-load (i.e., DRBH) in FXMARK, a microbenchmark that isintended to stress file systems. For the evaluation, a vary-ing number of threads repeatedly read the same 4 KiB datablock, thus stressing the reference counting part enormously.This workload is known to reveal the contention resilienceof any reference counting approach, since the stock Linuxsuffers the most. Figure 7 shows the results. With this work-load, all file systems under consideration in stock Linux(i.e., vanilla) undergo severe scalability bottlenecks arisingfrom contention on the same reference counter. SNZI showsslight improvements over the vanilla scheme that uses thetraditional reference counter. PAYGO and RefCache perfectlyscale the throughput of ext4, btrfs and F2FS. The mainreason for slightly lower performance of PAYGO than Re-fCache is because the number of instructions executed byPAYGO is slightly greater than RefCache. By profiling onclock cycles, we obtain the cycle difference that matches theperformance difference we observe here.

Interesting is the performance degradation that has beenconsistently observed in XFS primarily due to contention onthe semaphore inside an inode structure, which completelyrenders all reference counting methods useless. Although afurther investigation is needed, it is worthwhile putting effortto redesign this coarse-grained locking so that XFS can reapperformance benefits from better counting techniques.

6.2.2 Weakly Contending Workloads

To evaluate the performance of file systems under weaklycontending scenarios, we used filebench, a benchmark that

86 17th USENIX Conference on File and Storage Technologies USENIX Association

Page 10: Pay Migration Tax to Homeland: Anchor-based Scalable ...

0

20

40

60

80

100

1 24 48 72 96

Thro

ughput (M

ops/s

ec)

No. of threads

ext4

VanillaSNZI

RefCachePAYGO

0

20

40

60

80

100

1 24 48 72 96No. of threads

btrfs

VanillaSNZI

RefCachePAYGO

0

20

40

60

80

100

1 24 48 72 96No. of threads

F2FS

VanillaSNZI

RefCachePAYGO

0

10

20

30

40

50

1 24 48 72 96No. of threads

XFS

VanillaSNZI

RefCachePAYGO

Figure 8: Scalability comparison under weakly contending workloads: the Linux page cache.

can flexibly add and test workloads to file systems and stor-age. Before we run the filebench, we modified the filebenchcode to experiment with more flexibility on multicores. Orig-inally, filebench is implemented with a lock for each file andonly one thread can access the file at a time. We eliminatedthe file lock so that multiple threads can access the file con-currently. For the evaluation, we ran the randomread work-load with participating threads performing 64 bytes randomreads from one of ten 128 MiB files. Since weakly con-tending workloads disperse contention on reference coun-ters, it may expose any latent overhead (or downside) ofgiven counting techniques that has been overlooked in ex-change for resolving high contention arising under stronglycontending scenarios.

The throughput results are shown in Figure 8. Strikinglythe vanilla scheme deployed in the stock Linux page cacheperforms well after it reduces hotspot contention; it out-classes SNZI all the time and sometimes outperforms Ref-Cache with a slight margin. As the thread count increases,the throughput gap between PAYGO and RefCache widensdue to a large number of garbage entries that increase hashcollisions in RefCache’s per-core hash, which were not ob-served under strongly contending workloads.

Again, none of the tested counting techniques scale theperformance with XFS due to bottlenecks inside XFS, and thiswill be discussed in detail in the following section.

6.2.3 In-Depth Analysis

In order to reveal detailed information about various systemactivities, we perform an in-depth analysis with moderatelycontending workloads being profiled over different metrics.

Stressing page cache. We first ran the randomread work-load of the filebench microbenchmark, by varying the num-ber of files whose size is set to 32 MiB. We chose the moder-ately contending workload as a good proxy for stressing ref-erence counting schemes with a reasonable balance of con-tention and the count of objects referenced. Figure 9 showsthe throughput and the CPU breakdown of the benchmark.

First, the in-depth profiling gives clear explanations fortwo strange observations in XFS and SNZI. The first obser-vation is the poor scalability of XFS. The main culprit for thisproblem is due to severe lock contention inside the file sys-tem; xfs ilock() and xfs iunlock() on the inode of a

file. Lock contention mainly depends on the number of files,not the file size. High contention on the inode lock indeedleads to severe performance degradation regardless of refer-ence counting schemes. This perhaps needs attention fromour community. The second observation is the poor scalabil-ity of SNZI, and SNZI also has a similar culprit for the issue;it scales poorly regardless of file systems at this time. Sincethe only publicly available code base for SNZI can be takenfrom sv6 [6], we show the results as is.

The vanilla scheme can scale the performance of ext4,btrfs and F2FS quite well as the thread count increases.Although the overhead of atomic instructions grows in pro-portion to the thread count, the dispersed contention cancelsout the negative impact of atomic operations we have seenin Figure 7. With 72 threads, the vanilla scheme performsalmost on a par with RefCache. An in-depth analysis of per-formance over different contention levels will be discussedin the next experiment.

RefCache shows worse core scalability than PAYGO, andthis is mainly because of the increased overhead of handlinghash collisions (i.e., atomic lock operations) in RefCache.Also, noticeable is the slight performance degradation of thevanilla, RefCache and PAYGO as the file count increases.This is due to the increased memory access overhead forreading files larger than cache memory, which is not ob-served in experiments with the same number of smaller files,although results are omitted here due to the space limitation.

Performance spectrum over degree of contention. Wefurther investigate the performance spectrum over differ-ent contention levels to fully grasp the nature of the space-time tradeoff. For this evaluation, we modified the FXMARKbenchmark in a way that 96 threads perform 64 bytes se-quential reads per 4 KiB page on a single file whose sizeis varied from 1 MiB to 64 MiB, with ext4 mounted. Fig-ure 10 shows the performance spectrum of three concernedschemes. The most noticeable result is the sharp throughputdecrease in RefCache as the file size grows, which clearlyshows the negative effect of a large time overhead to scala-bility and so the necessity of the instant reclamation of hashentries. PAYGO effectively addresses the problem and under-goes no performance overhead for that issue. The gradualdegradation of the throughput in PAYGO is due to the file dataoverflow in cache memory, resulting in the increased mem-ory access overhead, which also occurs in other schemes.

USENIX Association 17th USENIX Conference on File and Storage Technologies 87

Page 11: Pay Migration Tax to Homeland: Anchor-based Scalable ...

0

20

40

60

80

100

Utiliz

ation

(%)

AtomicLock

RefcntKernel

Misc.Library

User

0

40

80

120

nfiles 1 5 10 1 5 10 1 5 10 1 5 10 1 5 10nthreads 1 24 48 72 96

Thro

ughput

(Mops/s

ec)

0

20

40

60

80

100

Utiliz

ation

(%)

0

40

80

120

nfiles 1 5 10 1 5 10 1 5 10 1 5 10 1 5 10nthreads 1 24 48 72 96

Thro

ughput

(Mops/s

ec)

0

20

40

60

80

100

Utiliz

ation

(%)

0

40

80

120

nfiles 1 5 10 1 5 10 1 5 10 1 5 10 1 5 10nthreads 1 24 48 72 96

Thro

ughput

(Mops/s

ec)

0

20

40

60

80

100

Utiliz

ation

(%)

(a) ext4

0

40

80

120

nfiles 1 5 10 1 5 10 1 5 10 1 5 10 1 5 10nthreads 1 24 48 72 96

Thro

ughput

(Mops/s

ec)

AtomicLock

RefcntKernel

Misc.Library

User

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

(b) btrfs

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

AtomicLock

RefcntKernel

Misc.Library

User

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

(c) f2fs

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

Va

nill

a

AtomicLock

RefcntKernel

Misc.Library

User

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

S

NZ

I

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

R

efC

ach

e

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

P

AY

GO

(d) xfs

1 5 10 1 5 10 1 5 10 1 5 10 1 5 101 24 48 72 96

Figure 9: The performance and the CPU breakdown of file systems (i.e., column labels (a)-(d)) with different reference countingschemes (i.e., row labels on the right side) under moderately contending workloads: the Linux page cache.

As we analyzed earlier, the vanilla scheme is ill-suitedfor the strongly contending condition (i.e., 1 MiB). Butits performance rebounds quickly as soon as the degreeof contention is alleviated, and it outperforms RefCacheonce it passes a break even point (i.e., 16 MiB file in ourcase). In-depth looking through profiling reveals that the ac-quire/release of an object lock in RefCache to handle hashcollisions incur more overhead than the atomic operations inthe vanilla scheme when hash collisions occur frequently dueto a large number of objects accessed. After the break evenpoint, the throughput of the vanilla scheme starts to decreasebecause the increased memory access overhead due to thefile data overflow in cache memory becomes larger than themerit of dispersed contention.

Anchoring overhead. Since the anchor counter can becontended by only migrated threads, the frequency of threadmigration determines the anchoring overhead. As describedin §3.4, the design rationale for the OS scheduler usually in-

hibits frequent process migration. To confirm it, we ran theopenfiles workload of the filebench on all cores that couldcause thread migration between REF and UNREF operations,and counted the number of migration. For this experiment,we created 2,000 threads running the openfiles workloadon 36 physical cores (disabling 60 cores), which hopefullycauses frequent thread migration due to the load imbalance.However, during the 60 seconds experiment, less than 10,000times of migration occurred.

Moreover, regardless of how the Linux scheduler is im-plemented, the more frequent the thread migration occurs,the less effective the CPU time is due to the long latency ofthe context switch. The latency of the context switch can beas short as 1 microsecond [24, 19] which is still relativelylarger than the overhead of the atomic operation [13]. Re-cently, there has been an effort to reduce the latency of thecontext switch to several tens of nanoseconds by emulatinga thread at the user level [24], but no such study has been

88 17th USENIX Conference on File and Storage Technologies USENIX Association

Page 12: Pay Migration Tax to Homeland: Anchor-based Scalable ...

0

20

40

60

80

100

Utiliz

atio

n(%

)

User Library Misc. Kernel Refcnt Lock Atomic

0

100

200

300

1 2 4 8 16 32 64

Th

rou

gh

pu

t(M

op

s/s

ec)

File Size (MiB)

Vanilla RefCache PAYGO

Figure 10: Performance spectrum of the vanilla, RefCache and PAYGO over varying contention levels on ext4.

1

10

100

1000

1 2 4 8 16 32 64 96

Th

rou

gh

pu

t (M

op

s/s

)

No. of Threads

FAA CAS PAYGO

Figure 11: Throughput under the strongly contending work-load (user-level reference counting).

found in the kernel level. Therefore, there is practically noreduction in system throughput due to frequent changes ofthe anchor counter in PAYGO.

6.2.4 Scalability of User-level PAYGO

Next, we evaluate the performance of user-level PAYGO sys-tem calls to see its applicability. For the evaluation, we usea microbenchmark that has a varying number of threads(un)referencing user-level objects repeatedly. For compari-son, we implement two methods based on our observation.The first one is to use atomic fetch add and fetch sub forreference counting. We call it FAA, and this is a typical im-plementation widely adopted in many systems. Note that thistechnique does not show performance collapses, but it cannotscale performance mainly due to hardware-based synchro-nization bottlenecks. The second one is to implement whatis being used in the Linux page cache, which is based on theatomic compare-and-swap instruction. We call this CAS.

Figure 11 shows the throughput (i.e., the number offetch add/fetch sub and REF/UNREF operations per second)of three schemes as we increase the number of threads, all ofwhich access a single shared user-level object. As we man-ifested, the mode switch overhead of user-level PAYGO isquite noticeable and expected, considering the performanceof FAA and CAS until 2 threads in our system. The perfor-mance number FAA and CAS achieve with 1 thread, how-ever, is the peak number obtainable for reference counting asingle shared object. After 2 threads, both FAA and CAS are

1

10

100

1000

1 4 16 32 64 128 256 512

Th

rou

gh

pu

t (M

op

s/s

)No. of Refcount Objects

FAA CAS PAYGO

Figure 12: Performance spectrum of FAA, CAS and user-levelPAYGO with a varying number of objects.

either saturated or slowly degrading. Meanwhile, our user-level PAYGO scales the performance with no contention over-head. Figure 12 shows the performance spectrum of threemethods as we increase the number of referenced objectswith 96 threads. As shown in the figure, FAA and CAS suf-fer from synchronization bottlenecks initially when all of 96threads access a small number of objects, but they slowlygain throughput up to a certain point as contention is dis-persed. We believe that the saturation point observed here(i.e., 137 Mops/s) reaches the maximum capacity that our96-core server can support. On the other hand, our user-levelPAYGO could sustain the maximum throughput regardless ofthe count of objects.

Impact on application performance. As demonstratedabove, user-level PAYGO may have a profound impact onapplication performance especially on multicore hardware.We have been conducting an in-depth code-level analysisof latent bottlenecks caused by reference counting in Mon-goDB, MariaDB, Boost.Asio, etc. What we have learnt fromour preliminary study on such systems is that many applica-tions using user-level reference counting mostly suffer per-formance bottlenecks that start occurring much earlier be-fore the reference counting is responsible for severe perfor-mance degradation. For example, database systems we ana-lyzed have recently undergone major changes to enhance itsmulticore scalability. As the systems community is battlingpressing concerns, the contention around reference countingwill soon appear as a primary bottleneck in achieving scal-able performance.

USENIX Association 17th USENIX Conference on File and Storage Technologies 89

Page 13: Pay Migration Tax to Homeland: Anchor-based Scalable ...

0

5

10

15

20

Vanilla SNZIRefCachePaygo

Tim

e (

se

co

nd

s)

With 95 Background Readers

0

0.5

1

1.5

2

Vanilla SNZI RefCachePaygo

Without Background Readers

Figure 13: Query overhead comparison.

6.2.5 Comparing preempt disable() and spin lock

As we briefly mentioned in §3.1, the use ofpreempt disable(), instead of kernel spin locksor something similar, to prevent data races inREF/UNREF needs concrete justification. Hence, wecompared the overheads of both methods by mea-suring the average clock cycles per each functionpair (i.e., preempt disable()/preempt enable(),spin lock()/spin unlock()) by iterating them up to abillion times. The results show that the clock cycles forpreempt disable() and spin lock converge to 14 cyclesand 50 cycles, respectively. Throughout the experiments,the overhead of preempt disable() remains a constantfraction (∼30%) of that of spin lock regardless of thenumber of iterations performed. The main reason for thehigher overhead of spin lock is because it internallyinvokes preempt disable() and executes additional codesegments including an atomic instruction for cross-corecommunication supporting mutual exclusion on a global ob-ject. This is undoubtedly overkill for our case where we alsouse an atomic operation to safely decrease an anchor countfrom remote cores. In conclusion, preempt disable() is afast and safe method, as it has shown its usefulness in priorwork, for preventing data races on local counters in ourREF/UNREF implementations.

6.3 Query Overhead

In this section, we conduct the performance evaluation ofthe READ-ALL operation. To evaluate the query overhead ofPAYGO, we measure the time to flush a 4 GiB file in theLinux page cache with and without background readers onext4. The experiment first reads the entire file so that fileblocks are all loaded in page caches. Then, it measures thetime taken to drop the file from page caches. Since pagecaches are all clean (i.e., unmodified), dropping page cachesis comprised of pure CPU activities. Figure 13 shows thecompletion time of different reference counting techniques.With background readers, RefCache surprisingly outpacedall other competitors, since RefCache may read a batch ofglobal counters for multiple pages safely if their hash entrieswere flushed two epochs ago and no referencing occurred inbetween. Although PAYGO has less query overhead than Re-

fCache for a single reference counter, the benefit of syncingthe entire hash of dirty reference caches to global counterspredominates the time overhead of two epochs with a largenumber of objects. Meanwhile PAYGO exhibits the overheadof reading a large number of local counters for each page andthe vanilla scheme suffers contention due to atomic opera-tions. Without background readers, the vanilla scheme is bet-ter than RefCache, but PAYGO still shows the same overheadof reading local counters. Nevertheless, the query overheadof PAYGO does not commensurate with the number of coresowing to its early detection of positive reference counter val-ues (§3.4).

7 Limitations and Future Work

The limitations of PAYGO can be summarized as follows.First and foremost, PAYGO is not completely free from thecounting-query tradeoff. We do not have a clue on whether itis possible or not. A proposal achieving low overhead in alldirections must be a major breakthrough in systems research.Second, the way we handle the table overflow is rather naive,and one may find practical use cases that can stress PAYGO inthat the overflow counter list is spotted as a bottleneck point.Our ongoing work is to apply user-level PAYGO to languageruntime systems that surely benefit from user-level PAYGO.

8 Conclusion

Reference counting in modern file systems is not scalableon multicores, even under workloads with little or no log-ical contention. Through in-depth survey of present refer-ence counting techniques designed for scaling file I/O opera-tions, we found that there are space-time tradeoff and query-counting tradeoff in designing scalable reference countingtechniques. In this paper, we have presented a novel refer-ence counting scheme, PAYGO, that escapes the space-timetradeoff by using an anchor counter. PAYGO provides scal-able counting and space efficiency with negligible time de-lay for the reclamation of hash entries. We have implementedPAYGO in the page cache in Linux. Our evaluation with dif-ferent file system benchmarks demonstrated that PAYGO ispractically useful in addressing severe contention arising inother reference counting techniques.

Acknowledgements. We would like to thank our shep-herd, Vijay Chidambaram, and the anonymous reviewersfor helping us improve this paper. This work was sup-ported by the National Research Foundation of Korea grant(2017R1A2B4006134) and the Ministry of Science and ICT(MSIT), Korea (R0114-16-0046, Software Black Box forHighly Dependable Computing), and (2016-0-00023, Na-tional Program for Excellence in SW) supervised by the In-stitute for Information and communications Technology Pro-motion (IITP), Korea.

90 17th USENIX Conference on File and Storage Technologies USENIX Association

Page 14: Pay Migration Tax to Homeland: Anchor-based Scalable ...

References[1] ACAR, U. A., BEN-DAVID, N., AND RAINEY, M. Contention in

structured concurrency: Provably efficient dynamic non-zero indica-tors for nested parallelism. In Proceedings of the 22nd ACM SIG-PLAN Symposium on Principles and Practice of Parallel Program-ming (2017), ACM, pp. 75–88.

[2] ATTIYA, H., GUERRAOUI, R., HENDLER, D., KUZNETSOV, P.,MICHAEL, M. M., AND VECHEV, M. Laws of order: Expensive syn-chronization in concurrent algorithms cannot be eliminated. In Pro-ceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium onPrinciples of Programming Languages (New York, NY, USA, 2011),POPL ’11, ACM, pp. 487–498.

[3] BHAT, S. S., EQBAL, R., CLEMENTS, A. T., KAASHOEK, M. F.,AND ZELDOVICH, N. Scaling a file system to many cores using anoperation log. In Proceedings of the Twenty-Sixth ACM Symposiumon Operating Systems Principles (2017), ACM.

[4] BOYD-WICKIZER, S. Optimizing Communication Bottlenecks inMultiprocessor Operating System Kernels. PhD thesis, MassachusettsInstitute of Technology, 2014.

[5] BOYD-WICKIZER, S., CLEMENTS, A. T., MAO, Y., PESTEREV, A.,KAASHOEK, M. F., MORRIS, R., ZELDOVICH, N., ET AL. An analy-sis of linux scalability to many cores. In OSDI (2010), vol. 10, pp. 86–93.

[6] CLEMENTS, A., ZELDOVICH, N., ET AL. sv6: Posix-like scalablemulticore research os kernel. https://github.com/aclements/

sv6, 2014.

[7] CLEMENTS, A. T., KAASHOEK, M. F., AND ZELDOVICH, N.Radixvm: Scalable address spaces for multithreaded applications. InProceedings of the 8th ACM European Conference on Computer Sys-tems (2013), ACM, pp. 211–224.

[8] COLLINS, G. E. A method for overlapping and erasure of lists. Com-mun. ACM 3, 12 (Dec. 1960), 655–657.

[9] CORBET, J. The search for fast, scalable counters. https://lwn.

net/Articles/170003/, 2006.

[10] CORBET, J. Per-cpu reference counts. https://lwn.net/

Articles/557478/, 2013.

[11] CORBET, J. Filebench. https://github.com/filebench/

filebench/wiki, 2017.

[12] DASHTI, M., FEDOROVA, A., FUNSTON, J., GAUD, F., LACHAIZE,R., LEPERS, B., QUEMA, V., AND ROTH, M. Traffic management:a holistic approach to memory placement on numa systems. In ACMSIGPLAN Notices (2013), vol. 48, ACM, pp. 381–394.

[13] DAVID, T., GUERRAOUI, R., AND TRIGONAKIS, V. Everything youalways wanted to know about synchronization but were afraid to ask.In Proceedings of the Twenty-Fourth ACM Symposium on OperatingSystems Principles (2013), ACM, pp. 33–48.

[14] ELLEN, F., LEV, Y., LUCHANGCO, V., AND MOIR, M. Snzi:Scalable nonzero indicators. In Proceedings of the twenty-sixth an-nual ACM symposium on Principles of distributed computing (2007),ACM, pp. 13–22.

[15] KNUTH, D. E. Structured programming with go to statements. ACMComput. Surv. 6, 4 (Dec. 1974), 261–301.

[16] MCKENNEY, P. E., BOYD-WICKIZER, S., AND WALPOLE, J. Rcuusage in the linux kernel: One decade later. Technical report (2013).

[17] MIN, C., KASHYAP, S., MAASS, S., AND KIM, T. Understandingmanycore scalability of file systems. In USENIX Annual TechnicalConference (2016), pp. 71–85.

[18] NARULA, N., CUTLER, C., KOHLER, E., AND MORRIS, R. Phasereconciliation for contended in-memory transactions. In OSDI (2014),vol. 14, pp. 511–524.

[19] PETER, S., LI, J., ZHANG, I., PORTS, D. R., WOOS, D., KRISH-NAMURTHY, A., ANDERSON, T., AND ROSCOE, T. Arrakis: The op-erating system is the control plane. ACM Transactions on ComputerSystems (TOCS) 33, 4 (2016), 11.

[20] PIGGIN, N. A lockless page cache in linux. In Proceedings of theLinux Symposium (2006), vol. 2.

[21] Apache HBase. https://github.com/apache/hbase/blob/

rel/2.1.0/hbase-server/src/main/java/org/apache/

hadoop/hbase/io/hfile/bucket/BucketCache.java#L484.

[22] Facebook RocksDB. https://github.com/facebook/rocksdb/blob/v5.14.3/utilities/persistent_cache/hash_table_

evictable.h#L62.

[23] MariaDB Server. https://github.com/MariaDB/server/blob/10.3/storage/innobase/buf/buf0buf.cc#L4350.

[24] SEO, S., AMER, A., BALAJI, P., BORDAGE, C., BOSILCA, G.,BROOKS, A., CASTELLO, A., GENET, D., HERAULT, T., JINDAL,P., ET AL. Argobots: A lightweight threading/tasking framework.Tech. rep., Argonne National Laboratory (ANL), 2016.

[25] TARASOV, V., ZADOK, E., AND SHEPLER, S. Filebench: A flexibleframework for file system benchmarking. USENIX; login 41 (2016).

Notes1PAYGO: pay migration tax (i.e., the anchoring overhead) as you go to

other core.

USENIX Association 17th USENIX Conference on File and Storage Technologies 91


Recommended