+ All Categories
Home > Documents > BTS: An Accelerator for Bootstrappable Fully Homomorphic ...

BTS: An Accelerator for Bootstrappable Fully Homomorphic ...

Date post: 26-Jan-2023
Category:
Upload: khangminh22
View: 0 times
Download: 0 times
Share this document with a friend
14
BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption Sangpyo Kim Jongmin Kim Michael Jaemin Kim Wonkyung Jung Minsoo Rhu John Kim Jung Ho Ahn Seoul National University , KAIST {vnb987, jongmin.kim, michael604, jungwk, gajh}@snu.ac.kr, {mrhu, jjk12}@kaist.edu Abstract—Homomorphic encryption (HE) enables secure of- floading of computation to the cloud by providing computation on encrypted data (ciphertexts). HE is based on noisy encryption schemes such that noise accumulates as we apply more compu- tation to the data. The limited number of operations applicable to the data prevents practical applications from exploiting HE. Bootstrapping enables an unlimited number of operations or fully HE (FHE) by refreshing the ciphertext. Unfortunately, bootstrap- ping requires a significant amount of additional computation and memory bandwidth. Prior works have proposed hardware accelerators for computation primitives of FHE. However, to the best of our knowledge, this is the first to propose a hardware FHE accelerator tailored to support bootstrapping efficiently. In particular, we propose BTS —Bootstrappable, Technology- driven, Secure accelerator architecture for FHE. We identify the challenges of supporting bootstrapping in the accelerator and analyze the off-chip memory bandwidth and computation required. In particular, given the limitations of modern memory technology, we identify the HE parameter sets that are efficient for FHE acceleration. Based on the insights from our analysis, we propose BTS that effectively exploits parallelism innate in HE operations by arranging a massive number of processing elements in a grid. We present the design and microarchitecture of BTS, including the network-on-chip that exploits the deterministic communication pattern. BTS shows 5,556× and 1,306× improved execution time on ResNet-20 and logistic regression over CPU, using 373.6mm 2 chip area and up to 133.8W of power. I. I NTRODUCTION Homomorphic encryption (HE) allows computation on en- crypted data or ciphertexts (cts). In the era of machine- learning-as-a-service (MLaaS), HE is being spotlighted as an enabler for privacy-preserving cloud computing, as it allows safe offloading of private data. Because HE schemes are based on the learning-with-errors (LWE) [63] problem, they are noisy in nature. Noise accumulates as we apply a sequence of computations on cts. This limits the number of computations that can be performed and hinders the HE applicability for practical applications. To overcome such limitation, fully HE (FHE) [34] was proposed, featuring an operation (op) called bootstrapping, that “refreshes” the ct and hence permits an unlimited number of computations on the ct. Among multiple HE schemes that support FHE, CKKS [20] is one of the prime candidates as it supports fixed-point real number arithmetic. One of the main barriers to adopting HE has been its high computational and memory overhead. New schemes [13], [14], [20], [23], [33] and algorithmic optimizations [2], [12], [37], such as using the residue number system (RNS) [6], [18], have been suggested to reduce this overhead and resulted in over 1,000,000× speedup [12] compared to its first HE implementation [35]. However, even with such efforts, HE ops experience tens of thousands of slowdowns compared to the unencrypted ops [43]. Tackling this, prior works have sought hardware solutions to accelerate HE ops, including CPU extensions [11], [43], GPU [1]–[3], [42], FPGA [47], [48], [64], [65], and ASIC [66]. However, there still remains ample room for improvement left on the table when it comes to accelerating FHE. While more than 1,000 bootstrapping ops are necessary per single ResNet-20 inference [51], each bootstrapping op takes dozens of seconds on the state-of-the-art CPU implementation [32]. GPU performs better, but it still takes hundreds of millisec- onds to perform a single bootstrapping [42]. Furthermore, prior works on custom hardware acceleration [47], [48], [64], [65] do not support HE parameters that allow bootstrapping. F1 [66] demonstrated bootstrapping time of CKKS but had limited throughput because its design was not focused on the FHE situation with frequent bootstrapping. These limitations prevent them from being adopted for complex applications. We propose BTS, a bootstrapping-oriented FHE accelerator that is Bootstrappable, Technology-driven, and Secure. First, we identify the limitations in designing an HE accelerator im- posed by contemporary fabrication technology, analyzing the implications of various conflicting requirements for the per- formance and security of FHE under such constrained design space. This allows us to pinpoint the appropriate optimization targets and requirements of designing an FHE accelerator. Sec- ond, we establish a design principle of populating a massive number of small processing elements (PEs) in a grid instead of a few large PEs (Section IV). We base our principle on the observation that the amount of the two types of parallelism (residue-polynomial-wise and coefficient-wise) changes when it comes to the FHE optimal parameters. Finally, we optimize BTS microarchitecture by i) increasing the utilization rates of compute units via coarse-grained pipelining among different HE functions and ii) exploiting the computation patterns of HE functions for intelligent data mapping among the PEs with minimized NoC traffic (Section V). Through these detailed studies, BTS achieves 5,714× speedup in multiplicative throughput against F1, the state-of- the-art ASIC implementation, when bootstrapping is properly considered. Also, BTS significantly reduces the training time 1 arXiv:2112.15479v1 [cs.CR] 31 Dec 2021
Transcript

BTS: An Accelerator for Bootstrappable Fully HomomorphicEncryption

Sangpyo Kim† Jongmin Kim† Michael Jaemin Kim† Wonkyung Jung†

Minsoo Rhu‡ John Kim‡ Jung Ho Ahn†

Seoul National University†, KAIST‡

{vnb987, jongmin.kim, michael604, jungwk, gajh}@snu.ac.kr, {mrhu, jjk12}@kaist.edu

Abstract—Homomorphic encryption (HE) enables secure of-floading of computation to the cloud by providing computationon encrypted data (ciphertexts). HE is based on noisy encryptionschemes such that noise accumulates as we apply more compu-tation to the data. The limited number of operations applicableto the data prevents practical applications from exploiting HE.Bootstrapping enables an unlimited number of operations or fullyHE (FHE) by refreshing the ciphertext. Unfortunately, bootstrap-ping requires a significant amount of additional computationand memory bandwidth. Prior works have proposed hardwareaccelerators for computation primitives of FHE. However, to thebest of our knowledge, this is the first to propose a hardwareFHE accelerator tailored to support bootstrapping efficiently.

In particular, we propose BTS — Bootstrappable, Technology-driven, Secure accelerator architecture for FHE. We identifythe challenges of supporting bootstrapping in the acceleratorand analyze the off-chip memory bandwidth and computationrequired. In particular, given the limitations of modern memorytechnology, we identify the HE parameter sets that are efficientfor FHE acceleration. Based on the insights from our analysis,we propose BTS that effectively exploits parallelism innate in HEoperations by arranging a massive number of processing elementsin a grid. We present the design and microarchitecture of BTS,including the network-on-chip that exploits the deterministiccommunication pattern. BTS shows 5,556× and 1,306× improvedexecution time on ResNet-20 and logistic regression over CPU,using 373.6mm2 chip area and up to 133.8W of power.

I. INTRODUCTION

Homomorphic encryption (HE) allows computation on en-crypted data or ciphertexts (cts). In the era of machine-learning-as-a-service (MLaaS), HE is being spotlighted as anenabler for privacy-preserving cloud computing, as it allowssafe offloading of private data. Because HE schemes are basedon the learning-with-errors (LWE) [63] problem, they arenoisy in nature. Noise accumulates as we apply a sequence ofcomputations on cts. This limits the number of computationsthat can be performed and hinders the HE applicability forpractical applications. To overcome such limitation, fully HE(FHE) [34] was proposed, featuring an operation (op) calledbootstrapping, that “refreshes” the ct and hence permits anunlimited number of computations on the ct. Among multipleHE schemes that support FHE, CKKS [20] is one of the primecandidates as it supports fixed-point real number arithmetic.

One of the main barriers to adopting HE has been its highcomputational and memory overhead. New schemes [13], [14],[20], [23], [33] and algorithmic optimizations [2], [12], [37],such as using the residue number system (RNS) [6], [18],

have been suggested to reduce this overhead and resultedin over 1,000,000× speedup [12] compared to its first HEimplementation [35]. However, even with such efforts, HEops experience tens of thousands of slowdowns compared tothe unencrypted ops [43]. Tackling this, prior works havesought hardware solutions to accelerate HE ops, includingCPU extensions [11], [43], GPU [1]–[3], [42], FPGA [47],[48], [64], [65], and ASIC [66].

However, there still remains ample room for improvementleft on the table when it comes to accelerating FHE. Whilemore than 1,000 bootstrapping ops are necessary per singleResNet-20 inference [51], each bootstrapping op takes dozensof seconds on the state-of-the-art CPU implementation [32].GPU performs better, but it still takes hundreds of millisec-onds to perform a single bootstrapping [42]. Furthermore,prior works on custom hardware acceleration [47], [48], [64],[65] do not support HE parameters that allow bootstrapping.F1 [66] demonstrated bootstrapping time of CKKS but hadlimited throughput because its design was not focused on theFHE situation with frequent bootstrapping. These limitationsprevent them from being adopted for complex applications.

We propose BTS, a bootstrapping-oriented FHE acceleratorthat is Bootstrappable, Technology-driven, and Secure. First,we identify the limitations in designing an HE accelerator im-posed by contemporary fabrication technology, analyzing theimplications of various conflicting requirements for the per-formance and security of FHE under such constrained designspace. This allows us to pinpoint the appropriate optimizationtargets and requirements of designing an FHE accelerator. Sec-ond, we establish a design principle of populating a massivenumber of small processing elements (PEs) in a grid insteadof a few large PEs (Section IV). We base our principle on theobservation that the amount of the two types of parallelism(residue-polynomial-wise and coefficient-wise) changes whenit comes to the FHE optimal parameters. Finally, we optimizeBTS microarchitecture by i) increasing the utilization rates ofcompute units via coarse-grained pipelining among differentHE functions and ii) exploiting the computation patterns ofHE functions for intelligent data mapping among the PEs withminimized NoC traffic (Section V).

Through these detailed studies, BTS achieves 5,714×speedup in multiplicative throughput against F1, the state-of-the-art ASIC implementation, when bootstrapping is properlyconsidered. Also, BTS significantly reduces the training time

1

arX

iv:2

112.

1547

9v1

[cs

.CR

] 3

1 D

ec 2

021

TABLE ILIST OF SYMBOLS USED TO DESCRIBE CKKS [20].

Symbol Definition

Q (Prime) moduli product =∏L

i=0 qiq0, ..., qL (Prime) moduliQ0, ..., Qdnum−1 Modulus factorsP Special (prime) moduli product =

∏k−1i=0 pi

p0, ..., pk−1 Special (prime) modulievkmult Evaluation key (evk) for HMultevk

(r)rot evk for HRot with a rotation amount of r

N The degree of a polynomialL Maximum (multiplicative) level` Current (multiplicative) level of a ciphertextLboot Levels consumed at bootstrappingk The number of special prime modulidnum Decomposition numberλ Security parameter of a given CKKS instance

of logistic regression [36] compared to CPU (by 1,306×) andGPU (by 27×) implementations, and can execute a ResNet-20inference 5,556× faster than prior CPU implementations [51].

In this paper, we make the following key contributions:• We provide a detailed analysis of the interplay of HE

parameters impacting the performance of FHE accelerators.• We propose BTS, a novel accelerator architecture equipped

with massively parallel compute units and NoCs tailored tothe mathematical traits of FHE ops.

• BTS is the first accelerator targeting practical bootstrapping,enabling unbounded multiplicative depth, which is essentialfor complex workloads.

II. BACKGROUND

We provide a brief overview of HE and CKKS [20] in par-ticular. Table I summarizes the key parameters and notationswe use in this paper.

A. Homomorphic Encryption (HE)

HE enables direct computation on encrypted data, referredto as ciphertexts (cts), without decryption. There are two typesof HE. Somewhat HE (SHE) supports a limited number ofoperations (ops) on a ct due to the noise accumulated afterthe ops. In contrast, Fully HE (FHE) allows unlimited numberof ops on cts through bootstrapping [34] that “refreshes” act and lowers the impact of noise. There are popular FHEschemes [13], [14], [20], [23], [33] that support different typesof data that can be encrypted and different kinds of ops.While other schemes support integer [13], [14], [33] or gate-level ops [23], CKKS [20] supports fixed-point complex (real)numbers. Since many real-world applications require arith-metic with real numbers, CKKS has become one of the mostimportant algorithms among cutting-edge HE schemes [13],[14], [20], [33]. In this paper, we focus on accelerating CKKSops; however, our proposed architecture is applicable to otherpopular HE schemes (e.g., BGV [13] and BFV [6], [14], [33])as they share similar core ops.

B. CKKS: an emerging HE scheme

CKKS first encodes a message that is a vector of com-plex numbers, into a plaintext m(X) =

∑N−1i=0 ciX

i, whichis a polynomial in a cyclotomic polynomial ring RQ =ZQ[X]/(XN+1). The coefficients {ci} are integers modulo Qand the number of coefficients (or degree) is up to N where Nis a power-of-two integer, typically ranging from 210 to 218.For a given N , a message with up to N/2 complex numberscan be packed into a single plaintext in CKKS. Each elementwithin a packed message is referred to as a slot. After encoding(or packing), element-wise multiplication (mult) and additionbetween two messages can be done through polynomial op-erations between plaintexts. CKKS then encrypts a plaintextm(X) ∈ RQ into a ct ∈ R2

Q based on the following equation,

ct=(b(X), a(X))=(a(X) · s(X) +m(X) + e(X), a(X))

where s(X) ∈ RQ is a secret key, a(X) ∈ RQ is a randompolynomial, and e(X) is a small error polynomial whosecoefficients follow a discrete Gaussian distribution. CKKSdecrypts ct by computing m′(X)=ct · (1,−s(X))=m(X)+e(X), which is approximately same as m(X) with errors.

The main bottleneck in HE is the high computational com-plexity of polynomial ops. As each coefficient of a polynomialis a large integer (having up to 1,000s of bits) and the degreeis high (even surpassing 100,000), an op between two polyno-mials has high compute and data-transfer cost. To reduce thecomputational complexity, variants of HE schemes [6], [18]have been proposed to use the residue number system (RNS).For example, Full-RNS CKKS [18] sets Q as the product ofword-sized (prime) moduli {qi}0≤i≤L, where Q =

∏Li=0 qi

for a given integer L, called maximum (multiplicative) level.Using Chinese Remainder Theorem (Eq. 1), we represent apolynomial in RQ with residue polynomials in {Rqi}0≤i≤L,whose coefficients are residues obtained by performing mod-ulo qi (represented as [·]qi ) on the big coefficients:

[a(X)]Q 7→([a(X)]q0 , . . . , [a(X)]qL) where Q=∏i

qi (1)

Then, we can convert an op involving two polynomials to theops between residue polynomials with small coefficients (< 64bits) corresponding to the same prime modulus qi, avoidingcostly big-integer arithmetic with carry propagation. AlthoughFull-RNS CKKS was shown to provide about 8× improve-ment in performance (execution time) over CKKS [18], theperformance overhead still remains very high [1], [12], [42],[64]. In this paper, we leverage Full-RNS CKKS as our CKKSimplementation which represents a polynomial in RQ as anN × (L+1) matrix of residues, and a ct as a pair of suchmatrices.

C. Primitive operations (ops) of CKKS

Primitive HE ops of CKKS are introduced here, whichcan be combined to create more complex HE ops such aslinear transformation and convolution. Given two ciphertextsct0, ct1 where cti = (bi(X), ai(X)) and bi(X) = ai(X) ·s(X) +mi(X), the HE ops can be summarized as follows.

HAdd performs an element-wise addition of ct0 and ct1:

2

ctadd = (b0(X) + b1(X), a0(X) + a1(X)) (2)

HMult consists of two steps, tensor product and key-switching. Tensor product first creates a tuple of polynomials(d0(X), d1(X), d2(X)):

d0(X) = b0(X) · b1(X)

d1(X) = a0(X) · b1(X) + a1(X) · b0(X)

d2(X) = a0(X) · a1(X)

(3)

By computing (d0(X), d1(X), d2(X))·(1,−s(X), s(X)2), werecover m0(X)·m1(X), albeit with error terms. Key-switchingrecombines the tensor-product result to be decryptable with(1,−s(X)) using a public key, called evaluation key (evk).An evk is a ct inR2

PQ with a larger modulus PQ, where P =

(∏k−1

i=0 pi) ≥ Q for given special (prime) moduli p0, . . . , pk−1.We express an evk as a pair of N×(k+L+1) matrices. HMultis then computed using Eq. 4, which involves key-switchingwith an evk for mult, evkmult:

ctmult = (d0(X), d1(X)) + P−1(d2(X) · evkmult)︸ ︷︷ ︸key-switching

(4)

HRot circularly shifts a message vector by slots. Whena ct encrypts a message vector z = (z0, ..., zN/2−1), afterapplying HRot by a rotation amount r, the rotated ciphertextctrot encrypts z(r) = (zr, ..., zN/2−1, z0, ..., zr−1). HRot con-sists of automorphism and key-switching. ct=(b(X), a(X))is mapped to ct′ = (b(X5r ), a(X5r )) by automorphism.This moves coefficients of a polynomial through a mappingi 7→ σr(i), where i is the index of a coefficient ci and σr is:

σr : i 7→ i · 5r mod N (i = 0, 1, . . . , N − 1) (5)

Similar to HMult, key-switching brings back ct′, whichwas only decryptable with (1,−s(X5r )) by automorphism,to be decryptable with (1,−s(X)). An HRot with a differentrotation amount each requires a separate evk, evk(r)

rot. HRotis computed as follows:

ctrot = (b(X5r ), 0) + P−1(a(X5r ) · evk(r)rot) (6)

HE applications require other HE ops, such as addition ormult of a ct with a scalar (CAdd, CMult) or a polynomial(PAdd, PMult) of unencrypted, constant values. Additions areperformed by adding the scalar or polynomial to b(X), andmults are performed by multiplying the scalar or polynomialto both b(X) and a(X).

D. Multiplicative level and HE bootstrapping

The error included in a ct is amplified during HE ops; inparticular, HMult multiplies the error e(X) with other terms(e.g., m0(X) and m1(X)) and can result in an explosion ofthe error if not treated properly. CKKS performs HRescaleto mitigate this explosion and keeps the error tolerable bydividing the ct with the last prime modulus qL [18]. AfterHRescale, the qL residue polynomial is no longer used, andthe ct is reduced in size. The ct continues losing the residuesof qL−1, . . . , q1 with each HRescale during executing an HEapplication until only one residue polynomial is left when noadditional HMult can be performed on the ct. L, or the max-imum multiplicative level, determines the maximum number

of HMult ops that can be performed without bootstrapping,and current (multiplicative) level ` is the number of remainingHMult operations that can be performed on the ct. Thus, a ctwith a level ` is represented as a pair of N×(`+1) matrices.

FHE features a bootstrapping op that restores the multiplica-tive level (`) of a ct to enable more ops. For practical usageof HE with a complex sequence of HE ops, bootstrappingmust be commonly performed. A bootstrapping itself consistsof hundreds of primitive HE ops, consuming Lboot levels.Therefore, L should be larger than Lboot, and having a largerL is beneficial since it requires less frequent bootstrappingops to execute a HE application with a fixed multiplicativedepth. Lboot depends on the bootstrapping algorithm andtypically ranges from 10 to 20 — larger Lboot permits usingmore precise and faster bootstrapping algorithms but at thecost of more frequent bootstrapping [12], [16], [37], [52].The bootstrapping algorithm we use in this paper is basedon [37] with updates to meet the latest security and precisionrequirements [12], [19], [54], and has Lboot of 19. AnotherCKKS-specific constraint is that the moduli qi’s and thespecial moduli pi’s must be large enough to tolerate the erroraccumulated during bootstrapping, whose typical values rangefrom 240 to 260 [22], [32].

E. Modern algorithmic optimizations in CKKS & Tmult,a/slot

The level of security for the HE scheme is determined by theλ parameter as it determines the minimum logarithmic-timecomplexity for an attack [19] to deduce the message from act without the secret key. In this work, we target λ of 128 bits,similar to recent HE studies [12], [52], [54] and libraries [32],[60]. A prior study, F1 [66], provided a substandard [4] levelof security under 80 bits for CKKS bootstrapping and usedsmaller cts which simplifies the microarchitecture. λ is astrictly increasing function of N/logPQ [29].

Key-switching is an expensive function, taking most of thetime in HRot and HMult [42]. We adopt the state-of-the-artgeneralized key-switching technique [37], which balances L,computational cost, and λ. [37] factorizes the moduli productQ into Q=Q0 ·...·Qdnum−1 (see Eq. 7) for a given integer dnum(decomposition number). It decomposes a ct into dnum slices,each consisting of the residue polynomials corresponding tothe prime moduli (qi’s) that together compose the modulusfactor Qj . We perform key-switching on each slice inRQj andlater accumulate them. The special moduli product P shouldonly satisfy P ≥ Qj for each Qj , thus we can choose smallerP , leading to higher λ. Because P has decreased, we caninstead choose the same λ with higher Q, and accordingly,higher L values to apply more HE ops between bootstrapping.Q=q0 ·...·qL+1

dnum−1︸ ︷︷ ︸

Q0

·qL+1dnum·...·q2L+1

dnum−1︸ ︷︷ ︸

Q1

·...·q(dnum−1)L+1dnum·...·qL+1︸ ︷︷ ︸

Qdnum−1

(7)

A major challenge of the generalized key-switching is thatdifferent evks (evk0, ..., evkdnum−1) must be prepared foreach factor Qj where each evk is a pair of N×(k+L+1)matrices and k is set to (L+1)/dnum. Thus, the aggregate evksize becomes 2·N ·(L+1)·(dnum+1), linearly increasing with

3

dnum. The overall computational complexity of a single HE opalso increases with dnum. Therefore, choosing an appropriatednum is crucial for performance.

Changing the HE parameter set has mixed effects on theperformance of HE ops. Decreasing N reduces computationalcomplexity and memory usage. However, we should lowerL and Q to sustain security, which requires more frequentbootstrapping. Also, since a ct with degree N can encode onlyup to N/2 message slots by packing, throughput degrades.

Jung et al. [42] introduced a metric called amortized multtime per slot (Tmult,a/slot), which is calculated as follows:

Tmult,a/slot =Tboot +

∑L−Lboot

`=1 Tmult(`)

L− Lboot· 2N

(8)

where Tboot is the bootstrapping time, Tmult(`) is the timeto perform HMult at a level `. This metric first calculates theaverage cost of mult including the overhead of bootstrapping,then divides it with the number of slots in a ct (N/2). Thus,Tmult,a/slot effectively captures the reciprocal throughput of aCKKS instance (CKKS scheme with a certain parameter set).

III. TECHNOLOGY-DRIVEN PARAMETER SELECTION OFBOOTSTRAPPABLE ACCELERATORS

A. Technology trends regarding memory hierarchy

Domain-specific architectures (e.g., deep-learning accelera-tors [41], [49], [55]) are often based on custom logic and op-timized dataflow to provide high computation. In addition, thememory capacity/bandwidth requirements of the applicationsare exploited in the design of the memory hierarchy. Recently,on-chip SRAM capacity has scaled significantly [5] such that100’s of MBs of on-chip SRAM is feasible, providing 10’s ofTB/s of SRAM bandwidth [41], [49], [62]. While the band-width of the main-memory has also increased, its aggregatethroughput is still more than an order of magnitude lowerthan on-chip SRAM bandwidth [59], achieving a few TB/sof throughput even with high-bandwidth memory (HBM).

Similar to other domain-specific architectures [17], [41],HE applications also follow deterministic computationalgraphs [66] and the locality of input and output cts of HEops can be maximized through software scheduling [30]. Thus,cts can be reused by exploiting a large amount of on-chipSRAM enabled by technology scaling. However, even with theincreasing on-chip SRAM capacity, we observe that the size ofon-chip SRAM is still insufficient to store evks, rendering theoff-chip memory bandwidth to become a crucial bottleneck formodern CKKS that supports bootstrapping. In the followingsections, we identify the importance of bootstrapping on theoverall performance and provide an analysis of how differentCKKS parameters impact the amount of data movement duringbootstrapping and its final throughput.

B. Interplay between primary CKKS parameters

Selecting one parameter of a CKKS instance has a multi-faceted effect on the others. First, λ is lowered when Q ishigher, and is raised when N is higher. Considering that abootstrappable CKKS instance requires a high L (> Lboot),

0

30

60

90

120

150

0 0.2 0.4 0.6 0.8 1

Max

imu

m le

vel L

Normalized dnum

(a) Maximum level L

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Eva

lua

tio

n k

ey s

ize

(G

B)

Normalized dnum

NMax

dnum218 121

217 60

216 29

215 14

Eva

luat

ion

key

siz

e (G

B)

(b) evk sizeFig. 1. (a) L and (b) a single evk size vs. dnum for four different N(polynomial degree) values and a fixed 128b security target. Normalized-dnumof 0 means dnum = 1 and normalized-dnum of 1 means dnum = max (i.e., k =1). Interpolated results are used for points with non-integer dnum values. Thedotted line in (a) represents a minimum required level of 11 for bootstrapping.

10

100

1000

70 100 130 160 190 220 250Security (λ)

N

218

217

216

215

Target λ

L = 23 L = 23

L = 27

L = 79L = 59

L = 27L = 39L = 44

Am

ort

ized

mu

ltti

me p

er

slo

t (n

s)

Fig. 2. λ and the minimum bound Tmult,a/slot of an HE acceleratorsimulated for different CKKS instances. Results are measured for all possibleinteger dnum values including 1 and the max for each (N, L) pair. The red-colored points represent the CKKS instances with (N , L, dnum) = (217, 27, 1),(217, 39, 2), (217, 44, 3).

and with the size of prime moduli qi and pi chosen around 250

and 260 under the 64-bit machine word size, logPQ exceeds500. To support 128b security when logPQ exceeds 500, Nhas to be larger than 214 [54].

Second, when logPQ is set from a fixed λ and N , a largerdnum leads to a higher L at the cost of a larger evk size.Considering that k equals (L+1)/dnum, Q : P ratio is closeto dnum : 1. Therefore, when logPQ is fixed, larger dnum

means larger Q and finally larger L. However, the evk sizealso increases linearly with dnum (see Fig. 1). Because thehigh level of L achieved by increasing dnum saturates quickly,choosing a proper dnum is important.

C. Realistic minimum bound of HE accelerator execution time

When calculating Tmult,a/slot, bootstrapping is significant asit can take 60× longer than a single HMult execution on aCPU [32] and 66× on a GPU [42]. However, bootstrappingitself is a combination of more than hundreds of HMultand HRot ops, which accounts for more than 77% of thebootstrapping time [32]. Thus, it is crucial to accelerate HMultand HRot for faster Tmult,a/slot. Prior GPU-based implementa-tion [42] identified that both HMult and HRot are memory-bound ops highly dependent on the on-chip storage capacity.Given today’s technology with cheap logic costs and high-density on-chip SRAMs, the performance of both ops can beimproved significantly with an HE accelerator.

4

However, despite such an increase in on-chip storage, evks,where each can take up several hundreds of MBs (see Fig. 1),cannot easily be stored on-chip. In particular, the evk

(r)rots for

HRot typically have low locality during bootstrapping due tothe long sequence of multiple HRots applied with differentr’s. Because on-chip storage cannot hold all evks, which canamount to GBs of storage, evks must be stored off-chip and beloaded in a streaming fashion at every HMult/HRot. Therefore,even if every temporal data and cts with high locality areassumed to be stored on chip with massive on-chip storage,the load time of evk becomes the minimum execution timefor HMult/HRot considering the limited off-chip bandwidth.

D. Desirable target parameter sets for HE accelerators

To understand the impact of CKKS parameters, we simulatethe Tmult,a/slot of multiple points sweeping N , L, and dnum

values. Upon 1TB/s of memory bandwidth, a bootstrappingalgorithm that consumes 19 levels, and the simulation method-ology explained in Section VI-B, we add two simplifyingassumptions based on Section III-C; 1) the computation timeof HE ops can be fully hidden by the memory latency of evks,and 2) all cts of HE ops are stored in on-chip SRAM and re-used. Fig. 2 reports the results. The x-axis shows λ based onN/logPQ [29], calculated using an estimation tool [68]. They-axis shows Tmult,a/slot for different N ’s, L’s, and dnums.

We make two key observations. First, when others arefixed, Tmult,a/slot decreases as N increases even with the highermemory pressure from larger cts and evks because availablelevel (L−Lboot) increases. However, such an effect saturatesafter N = 217. Around our target security level of 128b inFig. 2, the gain from 216 to 217 is 3.8× (111.4ns to 29.1ns)whereas that from 217 to 218 is 1.3×. Second, while a higherdnum can help smaller N ’s to reach our target 128b security,it comes at the cost of a superlinear increase in Tmult,a/slot dueto the increasing evk size and additional L being saturated.

These key observations suggest that a bootstrappable HEaccelerator should target the CKKS instances with high poly-nomial degrees (N ≥ 217) and low dnum. In particular, ourBTS targets the HE parameter sets with N = 217 highlightedin Fig. 2. With these parameter sets, the simulated HE accel-erator achieves Tmult,a/slot of 27.7ns, 19.9ns, and 22.1ns with(L, dnum) of (27, 1), (39, 2), and (44, 3), respectively. TheCKKS instances with comparable λ and higher dnum showhigher (worse) Tmult,a/slot, thus are not considered as the targets.Using even larger N ’s may have a marginal performance gain,but it requires significantly more on-chip resources as the sizeof a ct increases. Therefore, BTS is not optimized for suchparameters.

In this paper, we use the CKKS instance with N = 217,L=27, and dnum=1 as a running example. When using the64-bit machine word size, a ct at the maximum level has asize of 56MB, and an evk has a size of 112MB.

IV. ARCHITECTING BTS

We explore the organization of BTS, our HE acceleratorarchitecture, which targets the CKKS instances mentioned

above. Section III-D derived the optimality of such CKKSinstances assuming that an HE accelerator can hide all thecomputation time within the loading time of an evk. BTSexploits massive parallelism innate in HE ops to satisfy thatoptimality requirement indeed, with enough, but not an excessof, functional units (FUs). We first dissect HE ops, key-switching in particular, to identify what kind of FUs andhow much throughput from these FUs are required. Then wediscuss how to organize them in a chip with hundreds of MBsof on-chip SRAM to store temporary values and cts.

A. Computational breakdown of HE operations

We first break down key-switching, which appears in bothHMult and HRot, the two dominant HE ops for bootstrappingand general HE workloads. We have discussed that HMult andHRot are dominant HE ops for bootstrapping and general HEworkloads. Fig. 3(a) shows the computational flow of a key-switching, and Fig. 3(b) shows its computational complexitybreakdown. We focus on three functions, NTT, iNTT, andBConv, which take up most of the computation.Number Theoretic Transform (NTT): Polynomial mult be-tween polynomials in RQ translates to negacyclic convolutionof their coefficients. NTT is a variant of Discrete FourierTransform (DFT) in RQ. Similar to DFT, NTT transforms theconvolution between two sets of coefficients into an element-wise mult, while inverse NTT (iNTT) is applied to obtain thefinal result as shown below (⊗ meaning element-wise mult):

a1(X) · a2(X) = iNTT(NTT(a1(X))⊗ NTT(a2(X)))

By applying the well-known Fast Fourier Transform (FFT)algorithms [27], the computational complexity of (i)NTT re-duces from O(N2) to O(N logN). It divides the computationinto logN stages, where N data elements are paired up intoN/2 pairs in a strided manner and butterfly operations areapplied to each pair per stage. The stride value changes everystage. Butterfly operations in (i)NTT are as follows:ButterflyNTT(X,Y,W )→X ′=X+W ·Y, Y ′=X−W ·YButterflyiNTT(X,Y,W

−1)→X ′=X+Y, Y ′=(X−Y )·W−1

where W (a twiddle factor) is an odd power (up to 2N−1) ofthe primitive 2N -th root of unity ξ. Total N twiddle factorsare needed per prime modulus. CKKS concurrently appliesNTT to each residue polynomial (in Rqi ) in a ct.Base Conversion (BConv): BConv [6] converts a set ofresidue polynomials to another set whose prime moduli aredifferent from the former. A ct at level ` has two polynomials,where each comprises (`+1) residue polynomials correspond-ing to prime moduli {q0, ..., q`}. We denote this modulus setas C`, called the polynomial’s base or base in short.

Key-switching involves mult between the input polynomiald2(X) with the base C` and evk with the base B∪C` whereB = {p0, ..., pk−1}. Because we can compute two polynomi-als only when they have the same base, key-switching usesBConv to generate the residue polynomials corresponding to{pi} ∈ B from ones with the base C` (BConv.d2 in Fig. 3(a)).

After multiplying evk with d2(X), we bring the result withthe base B∪C` back to the original base C` by performing

5

iNTT

.d2

Bconv

.d2

NTT

.d2

evk.ax.PiNTT

.ax

Bconv

.ax

NTT

.ax

iNTT

.bx

Bconv

.bx

NTT

.bx

evk.bx.P

evk.ax.Qℓ

evk.bx.Qℓ

⨉1/P

⨉1/P

+

+

cout.ax

cout.bx

d2

d1

d0d1 = (cin0.ax ⊗ cin1.bx) ⊕ (cin0.bx ⊗ cin1.ax)

d2 = cin0.ax ⊗ cin1.ax d0 = cin0.bx ⊗ cin1.bx A matrix sized (N, ℓ + 1)

A matrix sized (N, k)

⨉ Element-wise mult

+ Element-wise add

− Element-wise sub

Element-wise scaling

Residue-polynomial-wise op

Coefficient-wise op

(a) Computational flow

0

500

1000

1500

2000

27 14 1

Mil

lio

n in

teg

er

op

s

Level

Others

iNTT

NTT

Bconv

(b) ComplexityFig. 3. (a) Computational flow of the key-switching inside HMult and (b) computational complexity breakdown of HMult for cts with different levels onthe CKKS instance of N=217, L=27, and dnum=1. The number of required integer ops is measured using Lattigo [32] (See Section VI-B for detail).

Polynomial 0

N coefficients

ℓ+

1 p

rim

es q0

q1

qℓ

Polynomial 1

residue-polynomial-wise

coefficient-wise

element-wise

residues

Fig. 4. Data access patterns in HE functions.

BConv from B to C` with the B part of the resultantpolynomial (BConv.ax, BConv.bx). We subtract the convertedresult to the C` part and finally scale by a constant P−1. Thisprocedure is called ModDown. HRescale is a special case ofModDown, where the base is changed from C` to C`−1.

The computation of BConv is shown in Eq. 9, where qjrefers to

∏i 6=j qi ∈ C`.

BConvC`→B

([a(X)]C`)=

{[∑j=0

[[a(X)]qj ·qj−1]qj︸ ︷︷ ︸

(1)

·qj

]pi

}0≤i<k

(9)

Because BConv cannot be performed on polynomials afterNTT (i.e., they are in the NTT domain), iNTT is performedto bring the polynomials back to the RNS domain. BTS keepspolynomials in the NTT domain by default and brings themback to the RNS domain only for BConv. Thus, a sequenceof iNTT→BConv→NTT is a common pattern in CKKS.

B. Exploiting two parallelism types in primary HE functions

We can categorize the primary HE functions into threegroups according to the data access pattern (see Fig. 4).Residue-polynomial-wise functions, including (i)NTT and au-tomorphism, involve all N residues in a residue polynomialto produce an output residue polynomial. Coefficient-wisefunctions (e.g., BConv) involve all (`+1) residues of a singlecoefficient to produce an output residue. Element-wise func-tions, such as CAdd, CMult, and PMult, only involve residueson the same position over multiple residue polynomials.

We can exploit two types of data parallelism, residue-polynomial-level parallelism (rPLP) and coefficient-level par-allelism (CLP), in parallelizing an HE op with multiple FUs.We can exploit rPLP by distributing (`+1) residue polynomi-als and CLP by distributing N coefficients to multiple FUs.

HE functions with different access patterns can employdifferent parallelism. Applying rPLP to residue-polynomial-wise, CLP to coefficient-wise, and rPLP or CLP to element-wise functions is trivial. Parallelism can be applied even ifthe access pattern differs; e.g., CLP can be applied to NTTby parallelizing the butterfly ops. However, applying rPLP tocoefficient-wise functions or CLP to residue-polynomial-wisefunctions induces an all-to-all data exchange among FUs.

As L < 100 practically, rPLP has a limited degree ofparallelism (DoP), which further drops as ` of a ct decreases.F1 [66] exploits rPLP by dedicating computation of a residuepolynomial to each of 16 clusters it features and exploits CLPinside each cluster to parallelize (i)NTT. As ` drops, F1 mustexecute multiple HE functions simultaneously to utilize allthe clusters; however, it is not always possible due to thedependence between HE functions, incurring underutilization.

Thus, we only exploit CLP in BTS for (i)NTT. We havea huge DoP for our target parameter sets with large N . Forelement-wise and coefficient-wise functions (e.g., BConv), allthe residues of a coefficient are allocated to the same FU. Wedesign a distributed memory system that places the relevantdata near the FUs to minimize data movement.

C. Arrangement of (i)NTT functional units

Prior HE acceleration studies [8], [64]–[66] identified(i)NTT as the paramount acceleration target and placed mul-tiple NTT units (NTTUs) that can perform both ButterflyNTTand ButterflyiNTT to exploit CLP and rPLP. F1, which targetsa HE parameter set with N = 214 in particular, places√N/2 · logN = 26 · 14 = 896 NTTUs per cluster and use

16 clusters, populating a total of 14,336 NTTUs. It allocatesa residue polynomial to a cluster to exploit CLP, and rPLPis exploited inside the clusters to perform (i)NTT in parallelwith the massive number of NTTUs.

We observe that such massive use of NTTUs [66] iswasteful in bootstrappable CKKS instances, where off-chipbandwidth becomes the dominating factor in the overallperformance. We analyze how many fully-pipelined NTTUsan HE accelerator requires to finish HMult or HRot withinthe evk loading time with our target CKKS instances. Wedefine the minimum required number of NTTUs (minNTTU) asnumber of butterflies per HE op

operating frequency / size of an evkmain-memory bandwidth . When we assume

a nominal operating frequency of 1.2GHz for NTTUs and

6

xb

ar v

xb

ar v

xb

ar v

xbarh

xbarh

xbarh

(0,

0)

(1,

0)

HBM Memory Controller

xbarh

xbarh

xbarh

(0,

16)

(0,

31)

HBM Memory Controller

xb

ar v

xb

ar v

xb

ar v

BrU

(63,

31)P

CIe

In

terf

ace

+

−⨉ %

+

RFlow

RFhigh

⨉%

Inter-PE Network

Scratchpad

⨉ %

RF

BT

1

⨉ ⨉ ⨉ ⨉

+ ++

%+

M

U

X

M

U

X

M

U

X

M

U

X

M

U

X

M

U

XM

U

X

M

U

X

M

U

X

M

U

XM

U

X

M

U

XM

U

X

M

U

X

MUX

MUX

⨉%

+

PE MMAU

NTTU

RFNTT

RFNTT

RFNTT

RFNTT

RFMMAU

RFMMAU

RFMMAU

RFMMAU

RF

BT

2

RF

BT

2

RF

BT

2

RF

BT

2

From BrU

Transpose

FIF

O

FIF

O

FIF

O

FIF

O

ModMult

+⨉

%

: Adder

: Multiplier

: Modular reduction unit

: NTTU

: ModMult (BConvU)

: MMAU (BConvU)

xbarv

xbarh

BrU

(col_idx,

row_idx): PE

: Broadcast unit

: Horizontal crossbar

: Vertical crossbar

: PE-Mem NoC

: BrU NoC

: Horizontal PE-PE NoC

: Vertical PE-PE NoC

− : Subtractor

(63,

0)

M

U

X

ModMult

ModAdd

SS

A

Fig. 5. The overview of BTS: Each PE in a grid is denoted as (column index, row index). PEs interconnect through the PE-PE NoC composed of xbarvand xbarh. BrU is the broadcast unit. BrU and main memory communicate with PE through separate NoCs. A PE consists of a scratchpad, an NTTU forperforming NTT/iNTT, a BConvU for BConv, a modular multiplier (ModMult), and a modular adder (ModAdd). BConvU consists of a ModMult and MMAU.

HBM with a 1TB/s of aggregate bandwidth considering priorworks in 7nm process nodes [24], [41], [49], minNTTU is:

minNTTU=(dnum+2)·(k+`+1)· 12N logN/(1.2GHz)

2·dnum·(k+`+1)·N ·8B/(1TB/s)The value of minNTTU is maximized for a given N when dnum

is 1. For N = 217, the value becomes 1,328. We utilize 2,048NTTUs in BTS to provide some margin for other operations.

BTS primarily targets to support N = 217; when evenlydistributing 217 residues of a residue polynomial across 2,048processing elements (PEs, one fully-pipelined NTTU per PE),each NTTU processes 26 residues. Then we can perform 6 outof 17 NTT stages in a PE without communication betweenPEs. We minimize the data exchange between 2,048 PEs byadopting 3D-NTT. We regard N = 217 residues of a residuepolynomial as a 3D data structure of size 26× 25× 26. Then,each PE performs a sequence of 26-, 25-, and 26-point NTTs,interleaved with just two rounds of inter-PE data exchange.Splitting the NTT in a more fine-grained manner requires morerounds of data exchange, so it is less energy-efficient.

V. BTS MICROARCHITECTURE

We devise a massively parallel architecture that distributesPEs in a grid. A PE consists of FUs and an SRAM scratchpad.Within a PE, the NTTU exploits CLP by handling a portionof the residues in a residue polynomial during (i)NTT. Weallocate the residues of the same coefficient index in the samePE, for every residue polynomials. Thus, the coefficient-wiseor element-wise functions can be computed in a PE withoutany inter-PE data exchange.

Fig. 5 depicts the high-level overview of BTS. We ar-range 2,048 (nPE) PEs in a grid with a vertical height of32 (nPEver) and a horizontal width of 64 (nPEhor). PEs areinterconnected via dimension-wise crossbars of 32×32 verticalcrossbars (xbarv) and 64×64 horizontal crossbars (xbarh),topologically similar to a 2D flattened-butterfly network [45].Each PE has an NTTU, a BConv unit (BConvU), a modularadder (ModAdd) and a multiplier (ModMult) for element-wisefunctions, and a scratchpad. We populate a central, constantmemory, storing precomputed values including twiddle factorsfor (i)NTT and qj , q−1j for BConv. A broadcast unit (BrU) de-livers the precomputed values to the PEs at required moments.Memory controllers are located at the top and bottom sides,each connecting an HBM stack. BTS receives instructions andnecessary data from the host via PCIe interface.

A. Datapath for (i)NTT

BTS maps the coefficients of a polynomial to the PEs suitedto 3D-NTT. We view the N residues in a residue polynomial asa (Nx, Ny, Nz)=(nPEhor, nPEver,N/nPE) cube. Then in the RNSdomain, a residue at the coefficient index i (the coefficient ofXi) is at position (x, y, z) in this cube, where i = x+Nx ·y+Nx ·Ny ·z. We allocate residues at position (x′, y′, z′)z′∈[0,Nz)

of such a cube to the PE of (x′, y′) coordinate in the PE grid.3D-NTT is broken down into 5 steps in BTS. First, we conducti) NTTz inside a single PE, which corresponds to the NTTalong the z-axis of the cube. Next, ii) data exchange betweenvertically aligned PEs is executed, which corresponds to nPEhorof yz-plane parallel transposition of residues in the cube. iii)

7

NTTy along the z-axis follows. iv) Data exchange betweenhorizontally aligned PEs is executed, which corresponds tonPEver of xz-plane parallel transposition of residues in the cube.Finally, v) NTTx along the z-axis is carried out.

An NTTU supports both NTT and iNTT by using logiccircuits similar to [72]–[75]. We allocate separate register files(RFNTT) to reuse data between the different (i)NTT stages.An NTTU executes the necessary NTTx, NTTy, and NTTzby decomposing them into radix-2 NTTs. It is fully pipelinedand performs one butterfly op per clock. An input pair is fedto, and an output pair is stored from the NTTU each cycle,provided by 2 pairs of RFNTTs. An RFNTT has a 64-bit readport and a write port.

We hide the time for vertical and horizontal data exchangesof 3D-NTT (steps ii) and iv)) through coarse-grained, epoch-based pipelining. As steps i), iii), and v) are executed with thesame NTTU, we decide the length of an epoch by the timeto perform these three steps (N logN

2·nPEcycles). Within the r-th

epoch, we time-multiplex i) of (r+2)-th, iii) of r-th, and v) of(r−2)-th residue polynomials, while exchanging ii) of (r+1)-th and iv) of (r−1)-th residue polynomials concurrently. Thus,(i)NTT of a single residue polynomial finishes every epoch.

A single (i)NTT on a residue polynomial requires N dif-ferent twiddle factors. Because each prime modulus needsdifferent twiddle factors, the size of twiddle factors for (i)NTTon a ciphertext reaches dozens of MBs for our target CKKSinstances. We reduce the storage for the twiddle factorsby decomposing them, using the on-the-fly twiddling (OT)method [46]. OT replaces the N -sized precomputed twiddle-factor table with two tables; a higher-digit table of ξmj

2N , where1≤j< (N−1)/m, and a lower-digit table of ξi2N , where 1≤i<m.We can compose any twiddle factor, ξk2N , by multiplying twotwiddle factors ξi2N and ξmj

2N that satisfy k = mj + i. OTreduces the memory usage by 2/m. BTS stores the lower-digit tables of prime moduli in PEs (each PE having differententries), while storing the higher-digit tables in the BrU (allPEs sharing the entries). The BrU broadcasts a higher-digittable for a prime modulus to PEs, for every (i)NTT epoch.

B. Base Conversion Unit (BConvU)

BConv consists of two parts. The first part multiplies residuepolynomials with

[qj−1]

qjand the second part with [qj ]pi

andaccumulates them. The second part exhibits a coefficient-wiseaccess pattern because it accumulates residues of the samecoefficient index in all residue polynomials.

A BConv unit (BConvU) with a modular multiplier (Mod-Mult) for the first part and a modular multiply-accumulateunit (MMAU) for the second part is placed in each PE.BConv strongly depends on the preceding iNTT (see Fig. 3).Since iNTT is a residue-polynomial-wise function, whereasthe second part of BConv is a coefficient-wise function, theMMAU has to wait until iNTT is finished on all residuepolynomials. We mitigate this by partially overlapping iNTTand BConv. We modify the right-hand side of Eq. 9 to:

{ (`+1)/lsub−1∑j1=0

[(j1+1)×lsub−1∑j2=j1×lsub

[[a(X)]j2 ·q−1j2]qj2 ·qj2

]pi

}0≤i<k

(10)

Now, the second part starts when the preceding iNTT andthe first part of BConv are finished on lsub(=4 in BTS) residuepolynomials and stored in RFMMAU. The MMAU computes thepartial sum (the inner sum of Eq. 10) of them, and accumulatesto previous results (the outer sum), being loaded from andstored to the scratchpad inducing a read and write every cycle.Temporal registers and FIFO minimize the bandwidth pressureon RFMMAU and transpose the data for correct orientation tofeed lsub lanes in the MMAU. The precomputed values of[qj−1]qj and [qj ]pi

(BConv tables) are respectively loaded tothe dedicated RFBT1 and RFBT2 from the BrU when needed.

We also leverage the MMAU for other operations. Sub-traction, 1/P scaling, and d0/d1 addition at the end of key-switching (Fig. 3) can be expressed as [d2′.ax]Q`

×(1/P ) +[d2′.ax]P→Q`

× (−1/P ) + d1×1 + 0×0, so we fuse these3 operations to compute on the MMAU. We call this fusionsubtraction-scaling-addition (SSA).

C. Scratchpad

The per-PE scratchpad has three purposes. First, it stores thetemporary data generated in the course of HE ops. The sizeof the temporal data during key-switching can be large (e.g.,a single (i)NTT or BConv can produce 28MB at `+ 1= 28,N=217). If such data does not reside on chip, the additionaloff-chip access would cause severe performance degradation.

Second, the scratchpad also stores the prefetched evk. Tohide the latency of evk load time, it has to be prefetchedbeforehand. As evk does not get consumed right after beingloaded on-chip, it takes up a portion of the scratchpad.

Third, the scratchpad functions as a cache for cts, controlledexplicitly by software (SW caching). cts often show hightemporal locality during a sequence of HE ops. For instance,during bootstrapping, a ct is commonly subjected to multipleHRots. Moreover, as HE ops form a deterministic computa-tional flow and the granularity of the cache management is aslarge as a ct, SW control is manageable.

The scratchpad bandwidth demand of the BConvU is high(later detailed in Fig. 8) due to the accesses involved inupdating the partial sums. Considering that the partial sumsize is only proportional to k in Eq. 10, and is loaded(`+1)/lsub times, the bandwidth pressure can be relieved byincreasing lsub. However, such would also require the increasein the number of lanes in the MMAU (and hence the size ofRFMMAU), resulting in a trade-off.

D. Network-on-Chip (NoC) design

BTS has three types of on-chip communication: 1) off-chipmemory traffic to the PEs (PE-Mem NoC), 2) distributingprecomputed constants to PEs (BrU-NoC), and 3) inter-PEdata exchange for (i)NTT and automorphism (PE-PE NoC).BTS has a large number of nodes (over 2k endpoints) andrequires high bandwidth. Given the unique communicationcharacteristics for each on-chip communication, BTS provides

8

3 separate NoCs instead of sharing a single NoC to enable de-terministic communication while minimizing NoC overhead.PE-Mem NoC: Since data is distributed evenly across thePEs, the off-chip memory (i.e., HBM2e) is placed on top andbottom and each HBM only needs to communicate with halfof the PEs placed nearby. The PE grid placement is exploitedby separating them into 32 regions and connecting each HBMpseudo-channel only to a single PE region (i.e., an HBM2estack supports 16 pseudo-channels [56] and thus, the upperhalf of PEs have 16 regions while the lower half also has 16regions with each region consisting of 64 PEs).BrU NoC: BrU data is globally shared by all PEs and needsto be broadcast to all PEs. Given the large number of PEs,the BrU is organized hierarchically such that 128 local BrUsare placed where each local BrU provides the higher-digittables of twiddle factors and the BConv tables to 16 PEs. Theglobal BrU is loaded with all precomputed values before anHE application starts and sends data to the local BrUs thatserve as temporary storage/repeater.PE-PE NoC: The PE-PE NoC requires support for the highestbandwidth because of the data exchanges necessary betweenthe PEs. The communication pattern is symmetric (i.e., eachPE sends and receives the same amount of data), and a singlePE is not oversubscribed. In addition, since the traffic patternis known (e.g., all-to-all or a fixed, permutation traffic), theNoC can be greatly simplified. BTS implements a logical2D flattened butterfly [45] since communication is limited toother PEs within each row and within each column. However,instead of having a router at each PE, a single “router” xbarh(respectively, xbarv) is shared by all PEs within each row(column), which is placed in the center of each row (column)as shown in Fig. 5 and used for horizontal (vertical) dataexchange steps of (i)NTT (steps ii), iv)). Each xbarh (xbarv)does not require any allocation since the traffic pattern isknown ahead of time and can be scheduled through pre-determined arbitration.

E. Automorphism

We identify that BTS can handle the automorphism forHRots efficiently. All residues mapped to a single PE al-ways move to another single destination PE under the BTS’PE-coefficient mapping; i.e., the inter-PE communication ofautomorphism exhibits a permutation pattern. A PE of the(x′, y′) PE-grid coordinate holds the residues at positions(x′, y′, z′)z′∈[0,Nz), corresponding to coefficient indices i =x′+Nx·y′+Nx·Ny·z′ (Section V-A). i’s in binary format onlydiffer in the higher bit-field (Nx·Ny·z′), so the automorphismdestination indices (i · 5r’s in Eq. 5) also only differ in thehigher bit-field; the residues are mapped to the same destina-tion PE corresponding to the lower bit-field (x′′+Nx · y′′).

We can decompose such a permutation pattern into 3 stepsto fit the PE-PE NoC structure of BTS: intra-PE permutation(z-axis), vertical permutation (y-axis), and horizontal permu-tation (x-axis). Each step gradually updates i’s to i · 5r’sfrom higher to lower bit-fields. The intra-PE permutation doesnot use the NoC. The vertical/horizontal permutations can

TABLE IITHE AREA AND THE PEAK POWER OF COMPONENTS IN BTS.

Component Area (µm2) Power (mW) Freq (GHz)

Scratchpad SRAM 114,724 9.86 1.2RFs 12,479 2.29 VariousNTTU 9,501 12.17 1.2ModMult (BConvU) 4,070 0.56 0.3MMAU (BConvU) 9,511 8.42 1.2Exchange unit 421 1.03 1.2ModMult 3,833 1.35 0.6ModAdd 325 0.08 0.6

1 PE 154,863 35.75 -

Component Area (mm2) Power (W) Freq (GHz)

2048 PEs 317.2 73.21 -Inter-PE NoC 3.06 45.93 1.2Global BrU + NoC 0.42 0.10 0.6128 local BrUs 3.69 0.04 0.6HBM NoC 0.10 6.81 1.22 HBM controllers 29.6 [41] 2.36 [59] -PCIe5x16 interface 19.6 [41] 5.37 [10] -

Total 373.6 133.8 -

be handled by xbarv/xbarh. The PE-PE NoC can support anarbitrary HRot with any rotation amount (r) without datacontention, whose property is similar to that of 3D-NTT.

VI. EVALUATION

A. Hardware modeling of BTS

We used the ASAP7 [25], [26] design library to synthesizethe logic units and datapath components in a 7nm technologynode. Modular reduction units use Barrett reduction [9]. Wesimulated RFs and scratchpads using FinCACTI [67] due tothe absence of a public 7nm memory compiler. We updated an-alytic models and technology constants of FinCACTI to matchthe ASAP7 and the IRDS roadmap [39]. We validated theRTL synthesis and SRAM simulation results against publishedinformation [5], [15], [40], [41], [58], [69], [71].

BTS uses single-ported 128-bit wide 1.2GHz SRAMs forthe scratchpads, providing a total capacity of 512MB and abandwidth of 38.4TB/s chip-wide. RFs are implemented insingle-ported SRAMs with variable sizes, port widths, andoperating frequencies following FUs’ needs. 22MBs of RFsare used chip-wide, providing 292TB/s. Crossbars in the PE-PE NoC have 12-bit wide ports and runs at 1.2GHz, providinga bisection bandwidth of 3.6TB/s. Two 360nm-pitch metallayers [58] are dedicated for the NoC and are shared withthe HBM and BrU NoCs. We analyzed the cost of wires andcrossbars using FinCACTI and prior works [7], [39], [57],[61]. The peak power and area estimation results are shown inTable II. BTS is sized 373.6mm2 and consumes up to 133.8W.The peak power is estimated on the worst-case scenario inpower where HMult and HRot are performed continuously.The NoC wires are routed over other components [61].

B. Experimental setup

We developed an in-house simulator modeling the computecapability, latency, and bandwidth of the FUs and memory

9

TABLE IIITHE CKKS INSTANCES USED FOR EVALUATION.

Parameter N L dnum logPQ λ Temp data

BTS-1 217 27 1 3090 133.4 183MBBTS-2 217 39 2 3210 128.7 304MBBTS-3 217 44 3 3160 130.8 365MB

composing BTS. Their utilization was collected and combinedwith the power model to calculate the energy. We prioritizedthe scratchpad usage for the temporary data during the HE opsand prefetched evks. The remaining was for caching cts in agreedy manner, with an LRU policy at a capacity miss.

We compared the performance of BTS with prior worksusing applications requiring bootstrapping, logistic regressionand CNN inference, the most complex workloads currentlyavailable for CKKS. We evaluated the logistic regression usingHELR [36], which trains a binary classification model for theMNIST data [31] labeled 3 and 8. A mini-batch containing1,024 14×14-pixel images was selected per iteration, and theaverage time over 30 iterations was measured.

Lee et al. [51] developed a CNN model based on ResNet-20 [38] with a 98b-secure CKKS instance, showing 92.43%accuracy on CIFAR-10 classification [50]. They approximatedReLU with a high-degree polynomial function because com-paring encrypted data is expensive in HE schemes [21], [53].This network consumes hundreds of levels during inference,requiring frequent bootstrapping. We ran the inference on thesame network with BTS. We additionally used the channelpacking method proposed in [44] to pack all the feature mapchannels in a single ct to further improve performance.

We compared BTS with the state-of-the-art implementa-tions on CPU (Lattigo [32]), GPU (100x [42]), and ASIC(F1 [66]) for the bootstrapping and logistic-regression work-loads. We ran Lattigo on a system with an Intel Skylake CPU(Xeon Platinum 8160) and 256GB DDR4-2666 memory. Weused a 128b-secure CKKS instance preset of Lattigo. For 100xand F1, we reported the execution times reported in each paper.We used the CKKS instances shown in Table III for evaluatingBTS. They all have the same degree and satisfy 128b security,but have different L and dnum. As dnum and L increase, thetemporary data increases, requiring more scratchpad space.

C. Performance and efficiency of BTS

Amortized mult time per slot: BTS outperforms the state-of-the-art CPU/GPU/ASIC implementations by tens to thousandsof times in the throughput of HMult. Fig. 6 shows Tmult,a/slotof Lattigo, 100x, F1, and BTS-x. BTS-2 performs the best,achieving 45.5ns in Tmult,a/slot, 2,237× better than Lattigo.F1 is even 2.5× slower than Lattigo; it is because F1 onlysupports a single-slot bootstrapping.1 Tmult,a/slot of 100x is

1We call a ct sparsely-packed if its corresponding message occupies muchfewer slots compared to the maximum available ones (N/2). Bootstrappinga sparsely-packed ct reduces computational complexity and consumes fewerlevels [16]. In the extreme case using a single-slot, such effect is maximized.F1 only supports the single-slot bootstrapping due to the lack of multiplicativelevel as it targets supporting small parameter sets.

Target λ

0.01

0.1

1

10

100

1000

70 100 130 160 190

Am

ort

ize

d-m

ult

tim

e p

er

slo

t (μ

s)

Security (λ)

Lattigo100xF1BTS

BTS-1BTS-2

BTS-3

2237x

16.3x

5714x

176x

Am

ort

ize

d m

ult

tim

e p

er

slo

t (μ

s)

Fig. 6. Comparison of Tmult,a/slot between BTS and other prior works:Lattigo [32], 100x [42], and F1 [66]. BTS-x refers to the performanceof BTS measured using the CKKS instances specified in Table III.

0

20

40

60

80

BT

S-1

BT

S-2

BT

S-3

BT

S-1

BT

S-2

BT

S-3

BT

S-1

BT

S-2

BT

S-3

Minimumbound

w/ 512 MBscratchpad

w/ 2 GBscratchpad

Am

ort

ized

mu

ltti

me

per

slo

t (n

s)

0%

25%

50%

75%

100%

Amortized-mult t ime HELR Resnet-20

Rate

Others

Bootstrapping

Tm

ult,a

/slo

t

HE

LR

Resnet-

20

(a) (b)

Fig. 7. (a) Comparison of the minimum bound of Tmult,a/slot (Section III) andthe actual Tmult,a/slot using 512MB and 2GB of scratchpad for BTS-x, and(b) the portion of bootstrapping time for each application on BTS-1.

743ns, reporting the best performance among prior works.However, this is for a 97b-secure parameter set; when using a173b-secure CKKS instance, 100x reported a 8µs Tmult,a/slot.

The performance of BTS-x is different from the minimumbound performance shown in Fig. 2 because cts are not alwayson the scratchpad with limited capacity. Fig. 7(a) shows theminimum and actual performance using 512MB and 2GB ofscratchpad for BTS-x. BTS-2 performs best always. However,BTS-1 performs better than BTS-3 with a 512MB scratchpadbecause the former requires smaller temporary data, leading toa higher hit rate for cts. With an enough (albeit not practical)scratchpad capacity of 2GB, cts are located at the scratchpadmostly, reaching a performance close to the minimum.Logistic regression: BTS still holds its performance superi-ority over prior works in logistic regression classifier training.Table IV reports the average training time per iteration. Asopposed to the bootstrapping, of which we used the alreadyimplemented one in the Lattigo repository, we implementedHELR for Lattigo. We refer to the reported execution timeof 100x [42]. Due to the limited parameter set F1 supports, weassumed that 1024 images were trained over 4 iterations andapplied 14×14 = 196 single-slot bootstrapping without con-sidering the cost of packing/unpacking cts for bootstrapping(giving favor to F1). The execution time of BTS-2 is 28.4ms,1,306× and 36.1× better than Lattigo and F1, respectively.ResNet-20: BTS performs up to 5,556× faster over the priorwork [51]. Table V shows the execution time of [51] andBTS-x on ResNet-20. BTS-1 without channel packing is 311×faster than [51]. By adopting the channel-packing method [44]exploiting the abundant slots of our target CKKS instances, wereduced the working set and improved the throughput, result-

10

d2BConv.ax

NTT.d2

d2⊗ evk.ax.Q

BConv.d2 BConv.bx SSANTT.bxiNTT.ax

SSANTT.ax

Computation timeline of HMult in BTS-1

d0d1

lower digit table

d2⊗ evk.bx.Q

iNTT.bxload evk.bx.P load evk.bx.Q

iNTT.d2

d2⊗ evk.ax.Pd2⊗ evk.bx.P

load evk.ax.P load evk.ax.Q

0

20

40

60

80

100

0

40

80

120

160

200

0 20000 40000 60000 80000 100000 120000

Sc

ratc

hp

ad

BW

uti

liza

tio

n (

%)

On

-ch

ip s

cra

tch

pa

du

sa

ge

(M

B)

Execution time (ns)

On-chip scratchpad usage (MB) Scratchpad BW. utilization(%)Scratchpad bandwidth utilization (%)

Fig. 8. Timeline, on-chip scratchpad usage change, and scratchpad bandwidth utilization change when BTS-1 performs HMult.

TABLE IVCOMPARISON OF PERFORMANCE BETWEEN BTS AND OTHER PRIORWORKS [32], [42], [66] FOR LOGISTIC REGRESSION TRAINING [36].

Lattigo 100x F1 BTS-1 BTS-2 BTS-3

Execution time (ms) 37,050 775 1,024 39.9 28.4 43.5Speedup vs. Lattigo 1× 48× 36× 929× 1,306× 852×

TABLE VCOMPARING [51] WITH BTS FOR RESNET-20 INFERENCE.

w/o channel packing w/ channel packing[51] BTS-1 BTS-1 BTS-2 BTS-3

Execution time (s) 10,602 34.0 1.91 2.02 3.09Speedup (vs. [51]) 1× 311× 5,556× 5,240× 3,427×

0

20

40

60

80

100

0

20

40

60

80

100

19

2

25

6

32

0

38

4

44

8

51

2

57

6

64

0

70

4

76

8

83

2

89

6

96

0

10

24

ED

AP

(J

·s·m

m2

)

Bo

ots

tra

pp

ing

tim

e (

ms)

Scratchpad SRAM size (MB)

CMult/CAddHRescaleHAddPMultHRotHMultEDAP

Bo

ots

tra

pp

ing

tim

e (

ms

)

Fig. 9. The bootstrapping time and Energy-Delay Area Product (EDAP) ofBTS-1 at various scratchpad SRAM sizes.

ing in an additional 17.8× performance gain and achieving1.91s of ResNet-20 inference latency on an encrypted image.

As opposed to bootstrapping and logistic regression, BTS-1performs best on ResNet-20. It is because HE ops other thanbootstrapping are more expensive in BTS-2 and BTS-3, andResNet-20 spends more portion of time processing these non-bootstrapping ops (see Fig. 7(b)). Higher dnum values allowfor additional levels and better Tmult,a/slot, but also increase thect size in every HE ops and thus the off-chip access.PE resource utilization over time: Resources populated inPEs are highly utilized while processing HE ops. Fig. 8 depictsthe detailed timeline of HMult on BTS-1 when cts are onthe scratchpad. HBM achieves 98% of its peak bandwidth(1TB/s) reading the evk. NTTUs are busy processing (i)NTTof three intermediate polynomials (d2, ax, and bx) 76% of thetime. BConv is partially pipelined with iNTT and has a strongdependency on the following NTT; thus, it occupies BConvUfor 33% of the time. The scratchpad bandwidth requirementof BConv is high because it has to load the partial sum for

all pi’s in Eq. 10 within lsub epochs. BConvU runs SSA whilenot occupied by BConv.

Bandwidth and capacity utilization of the scratchpad fluc-tuate over time while being properly provisioned to meetthe requirement. The average bandwidth usage was 58.6%over time while it peaked at 90% processing a BConv. Thedemanded capacity was also highest at BConv.ax with 183MB.

Impact of scratchpad size on performance and EDAP: Theperformance and energy efficiency of BTS improves as wedeploy a larger scratchpad but saturates as the scratchpad holdsmost of the HE ops’ working sets. Fig. 9 shows the executiontime breakdown and energy-delay-area product (EDAP [70])for the bootstrapping of BTS-1 with various scratchpad sizes.We increased the scratchpad size from 192MB (close to thetemporary data for HMult) by 64MB, up to 1GB.

With a 192MB scratchpad, BTS has to frequently load ctsfrom the off-chip memory due to the capacity misses. At thispoint, HMult/HRot, which used to be dominant (77% of thebootstrapping time for Lattigo) due to its high computationalcomplexity, now only take 24% of the execution time. The restconsists of PMult, HAdd, HRescale, and CMult/CAdd. WhileBTS has greatly reduced the computation time of HMult/HRotwith its abundant PEs, the ct load time, which any HE opsrequire when SW cache misses, is now dominant.

As the scratchpad size increases, the portion of HMult/HRoton bootstrapping increases. This is because the SW cachehit rate of cts for every HE ops gradually increases; 65.6%,98.8%, 93.7%, 98.6%, 97.5%, and 47.8%, for HMult, HRot,PMult, HAdd, HRescale, and CMult/CAdd, respectively, at512MB scratchpad. The execution time of HMult/HRot hasa lower-bound of evk load time, even when SW cache hits.However, the other HE ops not requiring evk can take sig-nificantly less by the ratio of on-chip over off-chip bandwidth(>10), when the necessary cts are located on the scratchpad.

As the scratchpad gets larger, the bootstrapping time ofBTS decreases with higher SW cache hit rates, leading tolower (better) EDAP values. Then, the cache hit rates and thebootstrapping time saturate. The EDAP is minimized whenthe scratchpad size is 704MB. Considering the superlinearrelationship between the fabrication cost and the die area,populating a 512MB scratchpad for BTS is appropriate.

11

VII. RELATED WORK

There have been a number of prior work that tackled HEacceleration using CPU, GPU, FPGA, and ASIC.CPU acceleration: [28] spawns multiple threads to parallelizeHE ops over multi-core CPUs. [11], [43] leverage the short-SIMD support. [32] exploits the algorithmic analysis of [12]for efficient bootstrapping implementation. However, otherplatforms outperform CPU while processing HE ops.GPU acceleration: GPU is a prime candidate for acceleratingHE ops as it is equipped with a massive number of integerunits and abundant memory bandwidth. However, a majorityof prior works did not handle bootstrapping [1]–[3], [43]. [42]is the first work that supports CKKS bootstrapping on GPU.It optimized HE computation by fusing GPU kernels, savingoff-chip access and hence leading to 242× faster bootstrappingover CPU. However, the deficiency of on-chip storage forcessome kernels to stay unfused [46]. BTS allows all temporaldata to reside on-chip, minimizing the off-chip access.FPGA/ASIC acceleration: A different set of works accel-erates HE using FPGA or ASIC, but most of them do notconsider bootstrapping [47], [48], [64], [65]. HEAX [64]dedicated hardware for CKKS mult on FPGA, reaching 200×performance gain over CPU. However, its design is rather fixedto a limited set of parameters, not considering bootstrapping.F1 [66] is the first ASIC design to support bootstrapping. It is aprogrammable accelerator supporting multiple FHE schemes,including CKKS and BGV. F1 achieves impressive perfor-mance on various FHE applications as it provides tailoredhigh-throughput computation units and stores all evks on-chip, minimizing the off-chip access. However, F1 targets theparameter sets with low degree N ; thus, it supports only non-packed (single-slot) bootstrapping, the throughput of which isgreatly exacerbated compared to BTS.

VIII. CONCLUSION

We have proposed an accelerator architecture for fullyhomomorphic encryption (FHE), which is primarily opti-mized for the throughput of bootstrapping encrypted data.By comprehensively analyzing the impact of selecting keyparameter values on the bootstrapping performance of CKKS,the emerging HE scheme, we devised the design principlesof bootstrappable HE accelerators. Based on the principles,we suggested BTS, which distributes massively-parallel pro-cessing elements (PEs) connected through the network-on-chip tailored to the unique, deterministic traffic patterns ofnumber theoretic transform (NTT) and automorphism, thecritical functions of HE operations. We designed BTS tobalance the off-chip memory access, on-chip data reusability,and computation required for bootstrapping. With BTS, weobtained a speedup of 5,714× in HE multiplication throughputand 5,556× in CNN inference compared to the state-of-the-artASIC design and CPU implementation, respectively.

REFERENCES

[1] A. Al Badawi, L. Hoang, C. F. Mun, K. Laine, and K. M. M.Aung, “Privft: Private and Fast Text Classification with HomomorphicEncryption,” IEEE Access, vol. 8, 2020.

[2] A. Al Badawi, Y. Polyakov, K. M. M. Aung, B. Veeravalli, andK. Rohloff, “Implementation and Performance Evaluation of RNS Vari-ants of the BFV Homomorphic Encryption Scheme,” IEEE Transactionson Emerging Topics in Computing, 2019.

[3] A. Al Badawi, B. Veeravalli, C. F. Mun, and K. M. M. Aung, “High-Performance FV Somewhat Homomorphic Encryption on GPUs: AnImplementation Using CUDA,” IACR Transactions on CryptographicHardware and Embedded Systems, vol. 2018, no. 2, 2018.

[4] M. R. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gor-bunov, S. Halevi, J. Hoffstein, K. Laine, K. E. Lauter, S. Lokam,D. Micciancio, D. Moody, T. Morrison, A. Sahai, and V. Vaikuntanathan,“Homomorphic Encryption Standard,” IACR Cryptology ePrint Archive,no. 939, 2019.

[5] C. Auth, A. Aliyarukunju, M. Asoro, D. Bergstrom, V. Bhagwat,J. Birdsall, N. Bisnik, M. Buehler, V. Chikarmane, G. Ding, Q. Fu,H. Gomez, W. Han, D. Hanken, M. Haran, M. Hattendorf, R. Heussner,H. Hiramatsu, B. Ho, S. Jaloviar, I. Jin, S. Joshi, S. Kirby, S. Kosaraju,H. Kothari, G. Leatherman, K. Lee, J. Leib, A. Madahavan, K. Marla,H. Meyer, T. Mule, C. Parker, S. Parthasarathy, C. Pelto, L. Pipes,I. Post, M. Prince, A. Rahman, S. Rajamani, A. Saha, J. DacunaSantos, M. Sharma, V. Sharma, J. Shin, P. Sinha, P. Smith, M. Sprinkle,A. St. Amour, C. Staus, R. Suri, D. Towner, A. Tripathi, A. Tura,C. Ward, and A. Yeoh, “A 10nm High Performance and Low-PowerCMOS Technology Featuring 3rd Generation FinFET Transistors, Self-Aligned Quad Patterning, Contact over Active Gate and Cobalt LocalInterconnects,” in IEEE International Electron Devices Meeting, 2017.

[6] J. Bajard, J. Eynard, M. A. Hasan, and V. Zucca, “A Full RNS Variantof FV Like Somewhat Homomorphic Encryption Schemes,” in SelectedAreas in Cryptography, vol. 10532, 2016.

[7] K. Banerjee and A. Mehrotra, “A Power-Optimal Repeater InsertionMethodology for Global Interconnects in Nanometer Designs,” IEEETransactions on Electron Devices, vol. 49, no. 11, 2002.

[8] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: A Con-figurable Crypto-Processor for Post-Quantum Lattice-based Protocols,”IACR Transactions on Cryptographic Hardware and Embedded Systems,vol. 2019, no. 4, 2019.

[9] P. Barrett, “Implementing the Rivest Shamir and Adleman public keyencryption algorithm on a standard digital signal processor,” in AnnualInternational Conference on the Theory and Application of Crypto-graphic Techniques, 1986.

[10] M. Bichan, C. Ting, B. Zand, J. Wang, R. Shulyzki, J. Guthrie,K. Tyshchenko, J. Zhao, A. Parsafar, E. Liu, A. Vatankhahghadim,S. Sharifian, A. Tyshchenko, M. De Vita, S. Rubab, S. Iyer, F. Spagna,and N. Dolev, “A 32Gb/s NRZ 37dB SerDes in 10nm CMOS to SupportPCI Express Gen 5 Protocol,” in IEEE Custom Integrated CircuitsConference, 2020.

[11] F. Boemer, S. Kim, G. Seifu, F. D. M. de Souza, and V. Gopal, “IntelHEXL: Accelerating Homomorphic Encryption with Intel AVX512-IFMA52,” in Workshop on Encrypted Computing & Applied Homomor-phic Cryptography, 2021.

[12] J. Bossuat, C. Mouchet, J. R. Troncoso-Pastoriza, and J. Hubaux,“Efficient Bootstrapping for Approximate Homomorphic Encryptionwith Non-sparse Keys,” in Annual International Conference on theTheory and Applications of Cryptographic Techniques, vol. 12696, 2021.

[13] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled) Fully Ho-momorphic Encryption without Bootstrapping,” ACM Transactions onComputing Theory, vol. 6, no. 3, 2014.

[14] Z. Brakerski and V. Vaikuntanathan, “Efficient Fully Homomorphic En-cryption from (Standard) LWE,” SIAM Journal on Computing, vol. 43,no. 2, 2014.

[15] J. Chang, Y. Chen, W. Chan, S. P. Singh, H. Cheng, H. Fujiwara, J. Lin,K. Lin, J. Hung, R. Lee, H. Liao, J. Liaw, Q. Li, C. Lin, M. Chiang,and S. Wu, “12.1 A 7nm 256Mb SRAM in High-K Metal-Gate FinFETTechnology with Write-Assist Circuitry for Low-VMIN Applications,”in IEEE International Solid-State Circuits Conference, 2017.

[16] H. Chen, I. Chillotti, and Y. Song, “Improved Bootstrapping for Approx-imate Homomorphic Encryption,” in Annual International Conferenceon the Theory and Applications of Cryptographic Techniques, 2019.

[17] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture forEnergy-Efficient Dataflow for Convolutional Neural Networks,” in ISCA,2016.

[18] J. H. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, “A Full RNSVariant of Approximate Homomorphic Encryption,” in Selected Areasin Cryptography, vol. 11349, 2018.

12

[19] J. H. Cheon, M. Hhan, S. Hong, and Y. Son, “A Hybrid of Dual andMeet-in-the-Middle Attack on Sparse and Rernary Secret LWE,” IEEEAccess, vol. 7, 2019.

[20] J. H. Cheon, A. Kim, M. Kim, and Y. S. Song, “HomomorphicEncryption for Arithmetic of Approximate Numbers,” in InternationalConference on the Theory and Applications of Cryptology and Informa-tion Security, vol. 10624, 2017.

[21] J. H. Cheon, D. Kim, and D. Kim, “Efficient Homomorphic ComparisonMethods with Optimal Complexity,” in International Conference on theTheory and Application of Cryptology and Information Security, 2020.

[22] J. H. Cheon, Y. Son, and D. Yhee, “Practical FHE Parameters AgainstLattice Attacks,” IACR Cryptology ePrint Archive, no. 39, 2021.

[23] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachene, “TFHE: FastFully Homomorphic Encryption Over the Torus,” Journal of Cryptology,vol. 33, no. 1, 2020.

[24] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky,“NVIDIA A100 Tensor Core GPU: Performance and Innovation,” IEEEMicro, vol. 41, no. 2, 2021.

[25] L. T. Clark, V. Vashishtha, D. M. Harris, S. Dietrich, and Z. Wang,“Design Flows and Collateral for the ASAP7 7nm FinFET PredictiveProcess Design Kit,” in IEEE International Conference on Microelec-tronic Systems Education, 2017.

[26] L. T. Clark, V. Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline,C. Ramamurthy, and G. Yeric, “ASAP7: A 7-nm FinFET PredictiveProcess Design Kit,” Microelectronics Journal, vol. 53, 2016.

[27] J. W. Cooley and J. W. Tukey, “An Algorithm for the MachineCalculation of Complex Fourier Series,” Mathematics of Computation,vol. 19, no. 90, 1965.

[28] CryptoLab Inc., “HEAAN v2.1,” Sep 2018. [Online]. Available:https://github.com/snucrypto/HEAAN

[29] B. R. Curtis and R. Player, “On the Feasibility and Impact of Standard-ising Sparse-secret LWE Parameter Sets for Homomorphic Encryption,”in ACM Workshop on Encrypted Computing & Applied HomomorphicCryptography, 2019.

[30] R. Dathathri, B. Kostova, O. Saarikivi, W. Dai, K. Laine, and M. Musu-vathi, “EVA: An Encrypted Vector Arithmetic Language and Compilerfor Efficient Homomorphic Computation,” in ACM SIGPLAN Interna-tional Conference on Programming Language Design and Implementa-tion, 2020.

[31] L. Deng, “The MNIST Database of Handwritten Digit Images for Ma-chine Learning Research,” IEEE Signal Processing Magazine, vol. 29,no. 6, 2012.

[32] EPFL-LDS, “Lattigo v2.3.0,” Oct 2021. [Online]. Available: https://github.com/ldsec/lattigo

[33] J. Fan and F. Vercauteren, “Somewhat Practical Fully HomomorphicEncryption,” IACR Cryptology ePrint Archive, no. 144, 2012.

[34] C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices,” inACM Symposium on Theory of Computing, 2009.

[35] C. Gentry and S. Halevi, “Implementing Gentry’s Fully-HomomorphicEncryption Scheme,” in Annual International Conference on the Theoryand Applications of Cryptographic Techniques, 2011.

[36] K. Han, S. Hong, J. H. Cheon, and D. Park, “Logistic Regressionon Homomorphic Encrypted Data at Scale,” in AAAI Conference onArtificial Intelligence, vol. 33, no. 01, 2019.

[37] K. Han and D. Ki, “Better Bootstrapping for Approximate HomomorphicEncryption,” in Cryptographers’ Track at the RSA Conference, 2020.

[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for ImageRecognition,” in IEEE Conference on Computer Vision and PatternRecognition, 2016.

[39] IEEE, “International Roadmap for Devices and Systems: 2018,” IEEEIRDS, Tech. Rep., 2018. [Online]. Available: https://irds.ieee.org/editions/2018/

[40] W. Jeong, S. Maeda, H. Lee, K. Lee, T. Lee, D. Park, B. Kim, J. Do,T. Fukai, D. Kwon, K. Nam, W. Rim, M. Jang, H. Kim, Y. Lee, J. Park,E. Lee, D. Ha, C. Park, H. Cho, S. Jung, and H. Kang, “True 7nmPlatform Technology featuring Smallest FinFET and Smallest SRAMcell by EUV, Special Constructs and 3rd Generation Single DiffusionBreak,” in IEEE Symposium on VLSI Technology, 2018.

[41] N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin,G. Kurian, J. Laudon, S. Li, P. C. Ma, X. Ma, T. Norrie, N. Patil,S. Prasad, C. Young, Z. Zhou, and D. A. Patterson, “Ten Lessons FromThree Generations Shaped Google’s TPUv4i: Industrial Product,” inISCA, 2021.

[42] W. Jung, S. Kim, J. Ahn, J. H. Cheon, and Y. Lee, “Over 100x FasterBootstrapping in Fully Homomorphic Encryption through Memory-

centric Optimization with GPUs,” IACR Transactions on CryptographicHardware and Embedded Systems, vol. 2021, no. 4, 2021.

[43] W. Jung, E. Lee, S. Kim, J. Kim, N. Kim, K. Lee, C. Min, J. H. Cheon,and J. Ahn, “Accelerating Fully Homomorphic Encryption ThroughArchitecture-Centric Analysis and Optimization,” IEEE Access, vol. 9,2021.

[44] C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan, “{GAZELLE}:A Low Latency Framework for Secure Neural Network Inference,” inUSENIX Security Symposium, 2018.

[45] J. Kim, J. Balfour, and W. Dally, “Flattened Butterfly Topology for On-Chip Networks,” in MICRO, 2007, pp. 172–182.

[46] S. Kim, W. Jung, J. Park, and J. Ahn, “Accelerating Number Theo-retic Transformations for Bootstrappable Homomorphic Encryption onGPUs,” in IEEE International Symposium on Workload Characteriza-tion, 2020.

[47] S. Kim, K. Lee, W. Cho, J. H. Cheon, and R. A. Rutenbar, “FPGA-basedAccelerators of Fully Pipelined Modular Multipliers for HomomorphicEncryption,” in International Conference on ReConFigurable Computingand FPGAs, 2019.

[48] S. Kim, K. Lee, W. Cho, Y. Nam, J. H. Cheon, and R. A. Ruten-bar, “Hardware Architecture of a Number Theoretic Transform for aBootstrappable RNS-based Homomorphic Encryption Scheme,” in IEEEInternational Symposium on Field-Programmable Custom ComputingMachines, 2020.

[49] S. Knowles, “Graphcore,” in IEEE Hot Chips 33 Symposium, 2021.[50] A. Krizhevsky and G. Hinton, “Learning Multiple Layers of Features

from Tiny Images,” University of Toronto, Tech. Rep., 2009.[51] J. Lee, H. Kang, Y. Lee, W. Choi, J. Eom, M. Deryabin, E. Lee, J. Lee,

D. Yoo, Y. Kim, and J. No, “Privacy-Preserving Machine Learningwith Fully Homomorphic Encryption for Deep Neural Network,” arXivpreprint arXiv:2106.07229, 2021.

[52] J. Lee, E. Lee, Y. Lee, Y. Kim, and J. No, “High-Precision Bootstrappingof RNS-CKKS Homomorphic Encryption Using Optimal MinimaxPolynomial Approximation and Inverse Sine Function,” in Annual Inter-national Conference on the Theory and Applications of CryptographicTechniques, vol. 12696, 2021.

[53] J. Lee, E. Lee, J. Lee, Y. Kim, Y. Kim, and J. No, “Precise Approxima-tion of Convolutional Neural Networks for Homomorphically EncryptedData,” arXiv preprint arXiv:2105.10879, 2021.

[54] Y. Lee, J. Lee, Y. Kim, H. Kang, and J. No, “High-Precision and Low-Complexity Approximate Homomorphic Encryption by Error VarianceMinimization,” IACR Cryptology ePrint Archive, no. 1549, 2020.

[55] E. Medina and E. Dagan, “Habana Labs Purpose-Built AI Inference andTraining Processor Architectures: Scaling AI Training Systems UsingStandard Ethernet With Gaudi Processor,” IEEE Micro, vol. 40, no. 2,2020.

[56] Micron Technology, Inc., “8GB/16GB HBM2E with ECC,” MicronTechnology, Inc., Tech. Rep., 2020, cCM005-1412786195-10301 - Rev.D 08/2020 EN. [Online]. Available: https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/hbm2e/8gband 16gb hbm2e dram.pdf?rev=dbfcf653271041a497e5f1bef1a169ca

[57] P. Moon, V. Chikarmane, K. Fischer, R. Grover, T. A. Ibrahim, D. In-gerly, K. J. Lee, C. Litteken, T. Mule, and S. Williams, “Process andElectrical Results for the On-die Interconnect Stack for Intel’s 45nmProcess Generation,” Intel Technology Journal, vol. 12, no. 2, 2008.

[58] S. Narasimha, B. Jagannathan, A. Ogino, D. Jaeger, B. Greene,C. Sheraw, K. Zhao, B. Haran, U. Kwon, A. K. M. Mahalingam, B. Kan-nan, B. Morganfeld, J. Dechene, C. Radens, A. Tessier, A. Hassan,H. Narisetty, I. Ahsan, M. Aminpur, C. An, M. Aquilino, A. Arya,R. Augur, N. Baliga, R. Bhelkar, G. Biery, A. Blauberg, N. Borjem-scaia, A. Bryant, L. Cao, V. Chauhan, M. Chen, L. Cheng, J. Choo,C. Christiansen, T. Chu, B. Cohen, R. Coleman, D. Conklin, S. Crown,A. da Silva, D. Dechene, G. Derderian, S. Deshpande, G. Dilliway,K. Donegan, M. Eller, Y. Fan, Q. Fang, A. Gassaria, R. Gauthier,S. Ghosh, G. Gifford, T. Gordon, M. Gribelyuk, G. Han, J. Han, K. Han,M. Hasan, J. Higman, J. Holt, L. Hu, L. Huang, C. Huang, T. Hung,Y. Jin, J. Johnson, S. Johnson, V. Joshi, M. Joshi, P. Justison, S. Kalaga,T. Kim, W. Kim, R. Krishnan, B. Krishnan, K. Anil, M. Kumar, J. Lee,R. Lee, J. Lemon, S. Liew, P. Lindo, M. Lingalugari, M. Lipinski, P. Liu,J. Liu, S. Lucarini, W. Ma, E. Maciejewski, S. Madisetti, A. Malinowski,J. Mehta, C. Meng, S. Mitra, C. Montgomery, H. Nayfeh, T. Nigam,G. Northrop, K. Onishi, C. Ordonio, M. Ozbek, R. Pal, S. Parihar,O. Patterson, E. Ramanathan, I. Ramirez, R. Ranjan, J. Sarad, V. Sarde-sai, S. Saudari, C. Schiller, B. Senapati, C. Serrau, N. Shah, T. Shen,

13

H. Sheng, J. Shepard, Y. Shi, M. Silvestre, D. Singh, Z. Song, J. Sporre,P. Srinivasan, Z. Sun, A. Sutton, R. Sweeney, K. Tabakman, M. Tan,X. Wang, E. Woodard, G. Xu, D. Xu, T. Xuan, Y. Yan, J. Yang, K. Yeap,M. Yu, A. Zainuddin, J. Zeng, K. Zhang, M. Zhao, Y. Zhong, R. Carter,C. Lin, S. Grunow, C. Child, M. Lagus, R. Fox, E. Kaste, G. Gomba,S. Samavedam, P. Agnello, and D. K. Sohn, “A 7nm CMOS TechnologyPlatform for Mobile and High Performance Compute Application,” inIEEE International Electron Devices Meeting, 2017.

[59] M. O’Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. Keck-ler, and W. J. Dally, “Fine-Grained DRAM: Energy-Efficient DRAM forExtreme Bandwidth Systems,” in MICRO, 2017.

[60] PALISADE Project, “PALISADE Lattice Cryptography Library (release1.11.5),” Sep 2021. [Online]. Available: https://palisade-crypto.org/

[61] G. Passas, M. Katevenis, and D. Pnevmatikatos, “Crossbar NoCs areScalable Beyond 100 Nodes,” IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems, vol. 31, no. 4, 2012.

[62] R. Prabhakar and S. Jairath, “SambaNova SN10 RDU: AcceleratingSoftware 2.0 with Dataflow,” in IEEE Hot Chips 33 Symposium, 2021.

[63] O. Regev, “On Lattices, Learning with Errors, Random Linear Codes,and Cryptography,” Journal of the ACM, vol. 56, no. 6, 2009.

[64] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “HEAX: An Architecturefor Computing on Encrypted Data,” in ASPLOS, 2020.

[65] S. S. Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede,“FPGA-Based High-Performance Parallel Architecture for Homomor-phic Computing on Encrypted Data,” in HPCA, 2019.

[66] N. Samardzic, A. Feldmann, A. Krastev, S. Devadas, R. Dreslinski,C. Peikert, and D. Sanchez, “F1: A Fast and Programmable Acceleratorfor Fully Homomorphic Encryption,” in MICRO, 2021.

[67] A. Shafaei, Y. Wang, X. Lin, and M. Pedram, “FinCACTI: ArchitecturalAnalysis and Modeling of Caches with Deeply-Scaled FinFET Devices,”in IEEE Computer Society Annual Symposium on VLSI, 2014.

[68] Y. Son, “SparseLWE-estimator,” 2021. [Online]. Available: https://github.com/Yongyongha/SparseLWE-estimator

[69] T. Song, J. Jung, W. Rim, H. Kim, Y. Kim, C. Park, J. Do, S. Park,S. Cho, H. Jung, B. Kwon, H. Choi, J. Choi, and J. S. Yoon, “A 7nmFinFET SRAM Using EUV Lithography with Dual Write-Driver-AssistCircuitry for Low-Voltage Applications,” in IEEE International Solid-State Circuits Conference, 2018.

[70] S. Thoziyoor, J. Ahn, M. Monchiero, J. B. Brockman, and N. P. Jouppi,“A Comprehensive Memory Modeling Tool and Its Application to theDesign and Analysis of Future Memory Hierarchies,” in ISCA, 2008.

[71] S. Wu, C. Lin, M. Chiang, J. Liaw, J. Cheng, S. Yang, C. Tsai, P. Chen,T. Miyashita, C. Chang, V. Chang, K. Pan, J. Chen, Y. Mor, K. Lai,C. Liang, H. Chen, S. Chang, C. Lin, C. Hsieh, R. Tsui, C. Yao,C. Chen, R. Chen, C. Lee, H. Lin, C. Chang, K. Chen, M. Tsai, K. Chen,Y. Ku, and S. M. Jang, “A 7nm CMOS Platform Technology Featuring4th Generation FinFET Transistors with a 0.027um2 High Density 6-T SRAM cell for Mobile SoC Applications,” in IEEE InternationalElectron Devices Meeting, 2016.

[72] G. Xin, J. Han, T. Yin, Y. Zhou, J. Yang, X. Cheng, and X. Zeng,“VPQC: A Domain-Specific Vector Processor for Post-Quantum Cryp-tography Based on RISC-V Architecture,” IEEE Transactions on Cir-cuits and Systems I: Regular Papers, vol. 67, no. 8, 2020.

[73] G. Xin, Y. Zhao, and J. Han, “A Multi-Layer Parallel HardwareArchitecture for Homomorphic Computation in Machine Learning,” inIEEE International Symposium on Circuits and Systems, 2021.

[74] Y. Xing and S. Li, “A Compact Hardware Implementation of CCA-secure Key Exchange Mechanism CRYSTALS-KYBER on FPGA,”IACR Transactions on Cryptographic Hardware and Embedded Systems,2021.

[75] Y. Zhang, S. Wang, X. Zhang, J. Dong, X. Mao, F. Long, C. Wang,D. Zhou, M. Gao, and G. Sun, “PipeZK: Accelerating Zero-KnowledgeProof with a Pipelined Architecture,” in ISCA, 2021.

14


Recommended