Peter Kairouz Department of Electrical Engineering arXiv ...

Breaking the Communication-Privacy-AccuracyTrilemma

Wei-Ning ChenDepartment of Electrical Engineering

Stanford [email protected]

Peter KairouzGoogle

[email protected]

Ayfer ÖzgürDepartment of Electrical Engineering

Stanford [email protected]

Abstract

Two major challenges in distributed learning and estimation are 1) preservingthe privacy of the local samples; and 2) communicating them efficiently to acentral server, while achieving high accuracy for the end-to-end task. While therehas been significant interest in addressing each of these challenges separatelyin the recent literature, treatments that simultaneously address both challengesare still largely missing. In this paper, we develop novel encoding and decodingmechanisms that simultaneously achieve optimal privacy and communicationefficiency in various canonical settings. In particular, we consider the problems ofmean estimation and frequency estimation under ε-local differential privacy andb-bit communication constraints. For mean estimation, we propose a scheme basedon Kashin’s representation and random sampling, with order-optimal estimationerror under both constraints. For frequency estimation, we present a mechanismthat leverages the recursive structure of Walsh-Hadamard matrices and achievesorder-optimal estimation error for all privacy levels and communication budgets.As a by-product, we also construct a distribution estimation mechanism that israte-optimal for all privacy regimes and communication constraints, extendingrecent work that is limited to b = 1 and ε = O(1). Our results demonstrate thatintelligent encoding under joint privacy and communication constraints can yield aperformance that matches the optimal accuracy achievable under either constraintalone.

1 Introduction

The rapid growth of large-scale datasets has been stimulating interest in and demands for distributedlearning and estimation, where datasets are often too large and too sensitive to be stored on acentralized machine. When data is distributed across multiple devices, communication cost oftenbecomes a bottleneck of modern machine learning tasks [40]. This is even more so in federatedlearning type settings, where communication occurs over bandwidth-limited wireless links [32].Moreover, as more personal data is entrusted to data aggregators, in many applications it carriessensitive individual information, and hence finding ways to protect individual privacy is of crucialimportance. In particular, local differential privacy (LDP) [18, 21, 36, 51] is a widely adopted privacyparadigm, which guarantees that the outcome from a privatization mechanism will not release toomuch individual information statistically. In this paper, we study the relationship between utility(often in forms of accuracy for certain statistical tasks), privacy, and communication jointly.

Preprint. Under review.

arX

iv:2

007.

1170

7v3

[cs

.LG

] 2

0 A

pr 2

021

At first glance, privacy and communication may seem to be in conflict with each other: achievingprivacy requires the addition of noise, therefore increasing the entropy of the data and making it lesscompressible. For instance, consider the mean estimation problem, which appears as a fundamentalsubroutine in many distributed optimization tasks, e.g. distributed stochastic gradient descent (SGD).Here, the goal is to estimate the empirical mean of a collection of d-dimensional vectors. If wefirst privatize each vector via PrivUnit in [13] (which is optimal under LDP constraints) andthen quantize via the RandomSampling quantizer in [25] (which is optimal under communicationconstrains), a tedious but straightforward calculation shows that the resulting `2 estimation errorgrows with d2. However, this is far from matching the error rate under each constraint separately,which has a linear dependence on d. A similar phenomenon happens in the distribution estimationproblem, where each client’s data is drawn independently from a discrete distribution p with supportsize d. One can satisfy both constraints by first perturbing the data via the Subset Selection (SS)mechanism [53] (which is optimal under LDP constraints) and then quantizing the noised data tob bits. Again, it can be shown that under such strategy, the `2 estimation error of p has a quadraticdependence on d. This leaves a huge gap to the lower bounds under each constraint separately, whichhave a linear dependence on d. See Section A in the appendix for a detailed discussion.

While there has been significant recent progress on understanding how to achieve optimal accuracyunder separate privacy [12, 53] and communication [44, 54] constraints, as illustrated above a simpleconcatenated application of these optimal schemes can yield a highly suboptimal performance. Recentworks that attempt to break this communication-privacy-accuracy trilemma have been either limitedto specific regimes or, as we show, are far from optimal. For example, [3] provides a 1-bit ε-LDPscheme for distribution estimation which is order-optimal only in the low communication regime(b = O(1)) and high privacy regime (ε = O(1)), while [25] tries to address both constraints in themean estimation setting, but the error rate achieved under their mechanism is quadratic in d andtherefore does not improve on the above baseline. We note that the general privacy regime (i.e.ε = Ω(1)) is also of both theoretical and practical interest. For instance, when n = Ω (d), one cancombine LDP with amplification techniques [7, 19, 20] to ensure stronger central differential privacy.

This paper closes the above gaps for any given privacy level ε and communication budget b. Indeed,our results show that the fundamental trade-offs are determined by the more stringent of the twoconstraints, and with careful encoding we can satisfy the less stringent constraint for free, thusbreaking the privacy-communication-accuracy trilemma. For the same privacy level ε, this allows us toachieve the accuracy of existing mechanisms in the literature with drastically smaller communicationbudget, or equivalently, for the same communication budget achieve higher privacy. It also explains,for example, why 1-bit communication budget is sufficient under the high privacy regime [3, 11].We will demonstrate this phenomenon in various canonical tasks and answer the following question:

“given arbitrary privacy budget ε and communication budget b, what are the fundamental limitsfor estimation accuracy?” We next formally define the settings and the problem formulations weconsider in this paper.

1.1 Problem Formulation

The general distributed statistical tasks we consider in this paper can be formulated as follows: eachone of the n clients has local data Xi ∈ X and sends a message Yi ∈ Y to the server, who uponreceiving Y n aims to estimate some pre-specified quantity of Xn. Note that Xn are not necessarilydrawn from some distribution. At client i, the message Yi is generated via some mechanism (arandomized mapping that possibly uses shared randomness across participating clients and the server)denoted by a conditional probability Qi(y|Xi) satisfying the following constraints.

Local differential privacy Let (Y,B) be a measurable space, and Q(·|x) be probability measuresfor all x ∈ X , with Q(·|x)|x ∈ X dominated by some σ-finite measure µ so that the densityQ(y|x) exists. A mechanism Q is ε-LDP if

∀x, x′ ∈ X , y ∈ Y, Q(y|x)

Q(y|x′)≤ eε.

b-bit communication constraint Y satisfies b-bit communication constraint if each of its elementscan be described by b bits, i.e. |Y| ≤ 2b.

2

The goal is to jointly design a mechanism (at clients’ sides) and an estimator (at the server side) sothat the accuracy of estimating some target function

∑ni=1 f(Xi) is maximized. In this paper, we

are mainly interested in the distribution-free framework, that is, we do not assume any underlyingdistribution on Xi, but we also demonstrate that our results can be extended to probabilistic settings.To this end, we will focus on the following four canonical tasks.

Mean estimation For real-valued data, we consider the d-dimensional unit euclidean ball X =Bd(0, 1) and are interested in estimating the empirical mean X , 1

n

∑iXi. The goal is to minimize

the worst-case `2 estimation error defined as

rME (`2, ε, b) , min(X,Qn)

maxXn∈Xn

E[∥∥∥X − X∥∥∥2

2

], (1)

where Qn satisfies ε-LDP and b-bit communication constraints. When the context is clear, we mayomit ε and b in rME (`, ε, b).

Statistical mean estimation In the probabilistic version of the mean estimation problem, weassume that Xi’s are drawn from some common but unknown distribution P supported on Bd(0, 1),the goal is to estimate the statistical mean θ (P ) = EP [X1] and to minimize the `2 estimation error:

rSME (`2, ε, b) , min(θ,Qn)

maxXn∈Xn

E[∥∥∥θ (Xn)− θ (P )

∥∥∥22

].

Frequency estimation When X consists of categorical data, i.e. X = [d] = 1, ..., d, we areinterested in estimating DXn(x) , 1

n

∑i 1Xi=x for x ∈ [d]. With a slight abuse of notation, DXn

is viewed as a vector (DXn(1), ..., DXn(d)) lying in the d-dimensional probability simplex. Theworst-case estimation error is defined by

rFE (`, ε, b) , min(D,Qn)

maxXn∈Xn

E[`(D,DXn

)],

where ` = ‖·‖∞, ‖·‖1, or ‖·‖22 and again Qn satisfies ε-LDP and b-bit communication constraints.

Distribution estimation A closely related setting is that of discrete distribution estimation, wherewe assume that the Xi’s are drawn independently from a discrete distribution p on the alphabetX = [d], and the goal is to estimate p. In this case, the worst-case error is given by

rDE (`, ε, b) , inf(Qn,p)

supp∈Pd

E [`(p,p)] ,

where Pd is the d-dimensional probability simplex.

We note that these canonical tasks serve as fundamental subroutines in many distributed optimizationand learning problems. For instance, the convergence rate of distributed SGD is determined by the`2 error of estimating the mean of the local gradient vectors (see [5] for more on this connection).Lloyd’s algorithm [37] for k-means clustering or the power-iteration method for PCA can also bereduced to the mean estimation task.

Remark 1.1 In this work, we generally assume the availability of shared randomness across theparticipating clients and the server. In this case the encoding functions at each node can be explicitlydenoted as Qi(y|Xi, U) where U is a shared random variable that is independent of data, referredto as a public coin. U is also available at the server and the estimator implicitly depends on U . Inour notation, we suppress this dependence on U for simplicity. The entropy of U is referred as theamount of shared randomness needed by a scheme. In Section 4, we discuss the amount of sharedrandomness required by our schemes in order to achieve the optimal estimation error Section 4. Wepoint out that in the statistical settings (i.e. statistical mean estimation and distribution estimation),the optimal estimation error can be achieved without shared randomness.

3

Privacy Comm. `2 error

SQKR (this work, Thm. 2.1) ∀ ε ∀ b dnmin(ε2,ε,b)

Cross-polytope [25] ε 1 b log d d2

n

Simplex [25] ε log d b log d dn

Table 1: Comparison between our mean estimation scheme and vqSGD [25]. Our scheme applies togeneral communication and privacy regimes, and achieves optimal estimation error for all scenarios.

1.2 Relation to Prior Work

Previous works in the mean estimation problem [6, 10, 25, 44, 46, 52] mainly focus on reducingcommunication cost, for instance, by random rotation [44] and sparsification [6, 14, 50, 52]. Amongthem, [25] considers LDP simultaneously. It proposes vector quantization and takes privacy intoaccount, developing a scheme for ε = Θ(1) and b = Θ(log d) with estimation error O(d2/n). Incontrast, the scheme we develop in Theorem 2.1 achieves an estimation error O(d/n) when ε = Θ(1)and b = Θ(log d). Moreover, our scheme is applicable for any ε and b and achieves the optimalestimation error, which we show by proving a matching information theoretic lower bound. SeeTable 1 for a comparison of our results with [25]. A key step in our scheme is to pre-process the localdata via Kashin’s representation [38]. While various compression schemes, based on quantization,sparsification and dithering have been proposed in the recent literature and Kashin’s representationfor communication efficiency [16, 24, 42, 43] has been also explored in a few works, it is particularlypowerful in the case of joint communication and privacy constraints as it helps spread the informationin a vector evenly in every dimension. In [22], a similar idea based on Kashin’s representation is usedto preserve LDP under the context of statistical query models, and although not discussed explicitlyin [22], it can be further extended to reduce the communication. This helps mitigate the error due tosubsequent noise introduced by privatization and compression.

The recent works of [39, 49] also consider estimating empirical mean under ε-LDP. They showthat if the data is from a d-dimensional unit `∞ ball, i.e. Xi ∈ [−1, 1]d, then directly quantizing,sampling and perturbing each entry can achieve optimal `∞ estimation error that matches the LDPlower bound in [17], where their privatization steps are based on techniques developed in [13, 17].Nevertheless, their approach does not yield good `2 error in general. Indeed, as in the case ofseparation schemes discussed in Section A, the `2 error of their scheme can grow with d2. Weemphasize that in many applications the `2 estimation error (i.e. MSE) is a more appropriate measurethan `∞. For instance, [5] shows a direct connection between the MSE in mean estimation and theconvergence rate of distributed SGD.

Frequency estimation under local differential privacy has been studied in [48], where they proposeschemes for estimating the frequency of an individual symbol and minimizing the variance of theestimator. Some of their schemes, while matching the information-theoretic lower bound on `2estimation error under privacy constraints, require large communication. For instance, the schemeOptimal Unary Encoding (OUE), which can be viewed as an asymmetric version of RAPPOR [55],

Loss Estimation error Communication

Asymmertic RAPPOR [48, 55] `2 Θ

(d

nmin((eε−1)2,eε)

)d bits

RHR (this work, Thm 3.1) `2 Θ

(d

nmin((eε−1)2,eε)

)min (dεe, log d) bits

Heavy hitter (Thm. 3.1 and [12]) `∞ Θ(√

log dnmin(ε,ε2)

)dεe bits

Table 2: Comparison of different frequency estimation schemes.

4

Privacy ε ∈ (0, 1) ε ∈ (1, log d)

SS [53] d bits max(deε , log d

)HR [4] log d bits log d bits

1bit-HR [3] 1 bit -

RHR (this work, Thm. 3.2) 1 bit min (dεe, log d)

Table 3: Comparison between LDP distribution estimation schemes, where blue(or red) color indicatesthat accuracy of the corresponding scheme is optimal (or not). Under same privacy guarantee, ourscheme is more communication efficient while achieves same accuracy.

achieves optimal `2 estimation error, but the communication required is O(d) bits, which, as we showin this work, can be reduced to O(min(dεe, log d)) bits. We do this by developing a new scheme forfrequency estimation under joint privacy and communication constraints. We establish the optimalityof our proposed schemes by deriving matching information theoretic lower bounds on rFE (`2, ε, b).

Frequency estimation is also closely related to heavy hitter estimation [3, 11, 12, 15, 30, 41, 55],where the goal is to discover symbols that appear frequently in a given data set and estimate theirfrequencies. This can be done if the error of estimating the frequency of each individual symbolcan be controlled uniformly (i.e. by a common bound), and thus is equivalent to minimizing the`∞ error of estimated frequencies, i.e. rFE (`∞, ε, b). It is shown in [12] that in the high privacyregime ε = O(1), rFE (`∞, ε, b) = Θ(

√log d/nε2), and this rate can be achieved via a 1-bit

public-coin scheme that has a runtime almost linear in n [11]. An extension, which we describein Section E.4 of the appendix, generalizes the achievability in [12] to arbitrary ε and b, achievingrFE (`∞, ε, b) = O(

√log d/nmin (ε2, ε, b)). We compare our scheme and existing results in Table 2.

If we further assume Xn are drawn from some discrete distribution p, then the problem fallsinto distribution estimation under local differential privacy [1–4, 17, 31, 47, 53, 55] and limitedcommunication [1, 2, 9, 14, 26, 28, 29, 54]. Tight lower bounds are given separately: for instance[4, 53] shows rDE (`1, ε, log d) = Ω(

√d2/nmin((eε − 1)2, eε)) and [28] shows rDE (`1,∞, b) =

Ω(√d2/n2b).

We show that these lower bounds can be achieved simultaneously (Theorem 3.2). Our result recoversthe result of [3] when b = 1 and ε = O(1) as a special case. See Table 3 for a comparison.

Finally, [12] proposes a generic approach to compress the communication of any ε-LDP schemeinto 1 bit by utilizing public randomness. However, this result holds only in the high privacy regimeε = O(1), and as we show in Section 4 it uses much more shared randomness as compared to ourschemes. For instance, for mean estimation with ε = O(1), [12] usesO(d) bits of shared randomness,while our scheme SQKR (Theorem 2.1) requires only O(log d) bits to achieve the same performance.Moreover, our schemes extend naturally to statistical settings (i.e. statistical mean estimation anddistribution estimation) in which case they do not require shared randomness.

1.3 Our Contributions and Techniques

To summarize, our main technical contributions include:

• For mean estimation, we characterize the optimal `2 error rME (`2) = Θ(d/nmin

(ε2, ε, b

)), by

designing a public-coin scheme, Subsampled and Quantized Kashin’s Response (SQKR), andproving its optimality by deriving matching information theoretic bounds (in Theorem 2.1). Ourencoding scheme is based on Kashin’s representation [38] and random sampling, which allow theserver to construct unbiased estimator of each Xi privately and with little communication. Thissignificantly improves on [25], which focuses on the special case ε = Ω(1), b = log d and achievesquadratic dependence on d in that case.

• For frequency estimation, we characterize the optimal `1 and `2 errors under both constraints(in Theorem 3.1) and propose an order-optimal public-coin scheme called Recursive Hadamard

5

Response (RHR). Our result shows that the accuracy is dominated only by the worst-case constraint,and this implies that one can achieve the less stringent constraint for free. The proposed schemeRHR is based on Hadamard transform, but unlike previous works using Hadamard transform,e.g. [11], we crucially leverage the recursive structure of the Hadamard matrix, which allows us tomake the estimation error decay exponentially as ε and b grow. RHR is computationally efficient,and the decoding complexity is O(n+ d log d). We establish its optimality by showing matchinglower bounds on the performance.

• We show that RHR easily leads to an optimal scheme for distribution estimation [3, 4, 53], inwhich case it does not require shared randomness and achieves order-optimal `1 and `2 error for allprivacy regimes and communication budgets. We also provide empirical evidence that our schemerequires significantly less communication while achieving the same accuracy and privacy levels asthe state-of-the-art approaches. See Section 5 for more results.

2 Mean Estimation

In the mean estimation problem, each client has a d-dimensional vector Xi from the Euclideanunit ball, and the goal is to estimate the empirical mean X = 1

n

∑iXi under ε-LDP and b bits

communication constraints. This problem has applications in private and communication efficientdistributed SGD. The following theorem characterizes the optimal `2 estimation error for this setting.

Theorem 2.1 For mean estimation under ε-LDP and b-bit communication constraints, we canachieve

rME (`2, ε, b) d/nmin(ε2, ε, b

). (2)

Moreover, if min(ε2, ε, b) = o(d) and n ·min(ε2, ε, b) > d, the above error is optimal.

Note that by taking ε→∞ for a fixed b, or by taking b→∞ for a fixed ε in part (i), Theorem 2.1provides the optimal error when we have the corresponding constraint alone. Furthermore, for finiteε and b we see that the optimal error is dictated by the error due to one of these constraints, the onethat leads to larger error, and hence the less stringent constraint is satisfied for free. This also impliesthat to achieve the optimal accuracy under ε-LDP constraints, we do not need more than dεe bits. Wenote that the two conditions for optimality in the theorem are standard and are needed to restrict theproblem to the interesting parameter regime.

The lower bounds are obtained by connecting the problem to a specific parametric estimationproblem with a distribution supported on the unit ball. The lower bounds Ω( d

nε2 ) and Ω( dnb ) appearin [17, Prop. 4] and [44, Thm. 5] respectively, and the lower bound Ω( d

nε ) in Theorem 2.1 is new.To match this lower bound, we propose a public-coin scheme, Subsampled and Quantized Kashin’sResponse (SQKR), based on Kashin’s representation [38] and random sampling.

2.1 Subsampled and Quantized Kashin’s Response

For each observation Xi, we aim to construct an unbiased estimator Xi which is ε-LDP, can bedescribed in b bits, and has small variance. Towards this goal, our general strategy is to quantize,subsample, and privatize the data Xi. However before this, it is crucial to pre-process each Xi by acarefully designed mechanism to increase the robustness of the signal to noise introduced by samplingand privatization.

Pre-processing via Kashin’s representation We first introduce the idea of a tight frame inKashin’s representation. A tight frame is a set of vectors ujNj=1 ∈ Rd that satisfy Parseval’s

identity, i.e. ‖x‖22 =∑Nj=1〈uj , x〉2 for all x ∈ Rd. A frame can be viewed as a generalization of the

notion of an orthogonal basis in Rd for N > d. To increase robustness, we wish the informationto be spread evenly across different coefficients. Thus, we say that the expansion x =

∑Nj=1 ajuj

is a Kashin’s representation of x at level K if maxj |aj | ≤ K√N‖x‖2 [35]. [38] shows that if

N > (1 + µ) d for some µ > 0, then there exists a tight frame ujNj=1 such that for any x ∈ Rd,one can find a Kashin’s representation at level K = Θ(1). This implies that we can represent eachXi with coefficients ajNj=1 ∈ [−c/

√d, c/√d]c′d for some constants c and c′.

6

Quantization Each client i computes the Kashin’s representation ajNj=1 ∈ [−c/√d, c/√d]c′d

of Xi, and then quantizes each aj into a 1-bit message qj ∈−c/√d, c/√d

with E[qj ] = aj . This

yields an unbiased estimator of ajNj=1, which can be described in Θ(d) bits in total. Moreover, dueto the small range of each aj , the variance of qj is bounded by O(1/d).

Sampling and privatization To further reduce qj to k = min(dεe, b) bits, client i draws kindependent samples from qjNj=1 with the help of shared randomness, and privatizes its k bitsmessage via 2k-RR mechanism [34, 51], yielding the final privatized report of k bits, which it sendsto the server.

Upon receiving the report from client i, the server can construct unbiased estimators aj for eachajNj=1, and hence reconstruct Xi =

∑Nj=1 ajuj , which yields an unbiased estimator of Xi. We

show that the variance of Xi can be controlled byO(d/min

(ε2, ε, b

)). Therefore 1

n

∑i Xi achieves

the order-optimal `2 estimation error, establishing the upper bound in Theorem 2.1. We provide adetailed description of the scheme and its performance analysis in Section C.

Remark 2.1 In order to achieve optimal communication efficiency, SQKR uses public randomness atthe sampling step. That being said, we can still turn SQKR into a private scheme by using additionalcommunication. See Section 4 for more details.

At a high-level, SQKR resembles vqSGD [25] as both schemes seek a suitably designed representationfor Xi before quantizing it. vqSGD represents Xi by a basis B = b1, ..., bK ⊂ Rd where Bis chosen in such a way that its convex hull contains the unit `2 ball. Therefore we can writeXi =

∑Nj=1 ajbj with

∑j aj = 1. Equivalently, the pre-processing step of vqSGD corresponds to a

linear transformation that embeds the d-dim `2 unit ball into a N -dim `1 ball. In contrast, Kashin’srepresentation above embeds the d-dim `2 unit ball into an N -dim `∞ ball. Therefore, while bothschemes have a pre-processing step of a similar flavor, what is achieved by these steps is quitedifferent. The representation of vqSGD is most efficient when it concentrates the information ina few coefficients, while Kashin’s representation spreads the information evenly across differentcoefficients. The first representation serves us well when we only seek to quantize the signal. However,the quantized signal becomes very sensitive to privatization noise. Therefore vqSGD ends up withO(d2) error in the case of both privacy and communication constraints, while we can achieve O(d)error.

2.2 Application to statistical mean estimation

For mean estimation, SQKR requires shared randomness so that the server can construct an unbiasedestimator. However, for distribution estimation where X1, ..., Xn

i.i.d.∼ P , we can replace the randomsampling with a deterministic partitioning of coordinates among the different clients and circumventthe need for shared randomness. This gives us the following theorem:

Theorem 2.2 For statistical mean estimation under ε-LDP and b bits communication constraint, wecan achieve

rSME (`2, ε, b) d

nmin (ε2, ε, b, d), (3)

without shared randomness. Moreover, if min(ε2, ε, b) = o(d), the above error is optimal (even inthe presence of shared randomness).

The lower bounds follow from the results of [13] (under LDP constraint) and [54] (under communica-tion constraint), and we leave the formal proof of the achievability to Section D.

3 Frequency Estimation

Recall that in the frequency estimation problem, given X1, ...Xn ∈ [d], we want to estimate theempirical frequency DXn(x) under ε-LDP and b bits communication budgets on each Xi. Thefollowing theorem characterizes the optimal estimation error achievable in this setting.

7

Theorem 3.1 For frequency estimation under ε-LDP and b bits communication constraint, we canachieve

(i) rFE (`2) d

nmineε,(eε−1)2,2b,d , and rFE (`1) d√nmineε,(eε−1)2,2b,d

;

(ii) rFE (`∞) √

log dnmin ε2,ε,b .

Moreover, if min(eε, (eε − 1)

2, 2b)

= o(d) and nmin(eε, (eε − 1)

2, 2b)≥ d2, the errors in (i)

are order-optimal.

Note that, similar to Theorem 2.1, Theorem 3.1 shows that for finite ε and b, the error is determinedby the error due to one of these constraints, and hence the other less stringent constraint is satisfiedfor free. It also implies that to achieve the optimal accuracy under ε-LDP constraints, we do not needmore than min (dlog2 e · εe, log d) bits.In the rest of the section, we overview the scheme we developto achieve the optimal error in (2).

We next overview the scheme that achieves the error in (i) of Theorem 3.1. We call this schemeRecursive Hadamard Response (RHR) as it builds on the recursive structure of the Hadamard matrix.The formal description of the scheme and complete proof of Theorem 3.1 can be found in Section E.

3.1 Recursive Hadamard Response

For notational convenience, we will view DXn as a d-dimensional vector (DXn(1), ..., DXn(d)) andassume Xi is one-hot encoded, i.e. Xi = ej for some j ∈ [d], so DXn = 1

n

∑iXi. We further

assume, without of loss of generality, that d = 2m for some m ∈ N. Recall that a Hadamard matrixHd ∈ −1,+1d×d can be constructed in a recursive fashion as

Hm =

[Hm/2 Hm/2

Hm/2 −Hm/2

],

where H1 = [1]. It can be easily shown that H−1d = Hd/d.

Instead of directly estimating DXn , our strategy is to first estimate Hd ·DXn and then perform theinverse transform H−1d to get an estimate for DXn . So each client will transmit information aboutYi , Hd ·Xi ∈ −1, 1d rather than its original data Xi.

The 1-bit case In this case, each client transmits a uniformly at random chosen entry of Yi via any1-bit LDP channel (for instance, using the 2-randomized response (RR) scheme [31, 34, 51]). Oncereceiving all the bits of the clients, the server can construct an unbiased estimator of Yi (since therandomness is public the server knows which entry is chosen for communication by each client). Itturns out that this simple 1-bit scheme achieves optimal `1 (and `2) error Θ(

√d2/nε2) in the high

privacy regime ε < 1. This idea is not new and has been used in heavy hitter estimation [11] anddistribution estimation [3]. However, a key question remains: how do we minimize the error given anarbitrary communication budget b and privacy level ε?

Moving beyond the 1-bit case A natural way to extend the 1-bit scheme above to the case wheneach client can transmit b-bits is to have each client communicate b randomly chosen entries ofits transformed data Yi instead of a single entry. This will boost the sample size by a factor of b,equivalently decrease the `2 error by a factor of b (

√b for `1). Instead, we argue next that we can

exploit the recursive structure of the Hadamard matrix to boost the sample size by a factor of 2b,equivalently decrease the error by an exponential factor.

Consider b ≤ blog dc and let B = d/2b−1. Note that Hd = H2b−1 ⊗ HB , where ⊗ denotes theKronecker product. To visualize, for b = 3, Hd has the following structure:

Yi = HdXi =

HB HB HB HB

HB −HB HB −HB

HB HB −HB −HB

HB −HB −HB HB

X

(1)i

X(2)i

X(3)i

X(4)i

,

8

where for l = 1, . . . , 2b−1, X(l)i denotes the l’th block of Xi of length B = d/2b−1. Therefore, in

order to communicate Yi, we can equivalently communicate HBX(l)i for l = 1, . . . , 2b−1. Since

H2b−1 is known, this is sufficient to reconstruct Yi. We next observe that while communicating Yirequires d = B × 2b−1 bits, communicating HBX

(l)i , l = 1, . . . , 2b−1 requires B + (b− 1) bits.

This is becauseXi is one-hot encoded and all but one of the 2b−1 vectors HBX(l)i , l = 1, . . . , 2b−1

are equal to zero. It suffices to communicate the index l of the non-zero vector, by using (b− 1) bits,and its B entries by using additional B bits. This is the key observation that RHR builds on.

When each client has only b bits, they cannot communicate sufficient information for fully re-constructing Yi, i.e. all HBX

(l)i , l = 1, . . . , 2b−1. Instead, each client chooses a random

index ri ∈ [B] and communicates the ri’th row of HB X(l)i , l = 1, . . . , 2b−1, equivalently

(HB)riX(l)i , l = 1, . . . , 2b−1 where (HB)ri denotes the ri’th row of HB . Note that as before,

only one of the 2b−1 numbers (HB)ri X(l)i , l = 1, . . . , 2b−1 is non-zero and therefore these num-

bers can be communicated by using b bits, b− 1 bits to represent the index of the non-zero numberand a single bit to communicate its value. When there is a privacy constraint, client i perturbs their bbits by a 2b-RR mechanism with privacy level ε, and this yields the privatized report of b bits.

Upon receiving the reports from clients, the server constructs an unbiased estimator for Yi. Todo this, it first constructs an unbiased estimator for HB X

(l)i , l = 1, . . . , 2b−1 and then employs

the structure Hd = H2b−1 ⊗ HB . Note that since the randomness is shared the server knowsthe index r chosen by each client, and since the clients choose their indices independently anduniformly at random, roughly speaking, they communicate information about different rows ofHB X

(l)i , l = 1, . . . , 2b−1. Finally, an unbiased estimator Yi for Yi yields an unbiased estimator

for Xi through the transformation Xi = 1dHd · Yi, and due to the orthogonality of Hd, it can be

shown that the variance of Xi is the same as the variance of Yi divided by d.

A subtle issue is that if eε 2b, the noise due to 2b-RR mechanism may be too large, so instead ofusing all b bits, we perform the above encoding and decoding procedure with b′ , min (dlog2 e · εe).We defer the details and the formal proof to Section E.1.

Note that this careful construction based on the recursive structure of the Hadamard matrix is onlyrequired in the case when there are joint privacy and communication constraints. When only oneconstraint is present, the optimal error can be achieved in a much simpler fashion. When there is onlya b bit constraint, [28] shows that the optimal error can be achieved by simply having each clientcommunicate a subset of the entries of its data vector Xi (without requiring Hadamard transform).When there is only a privacy constraint ε, the optimal error can be achieved by a number of schemes,such as subset selection (2b-SS) [53] and Hadamard response (HR) [4].

The encoding mechanism above involves two operations: 1) sampling a random index ri from [B]at each client with the help of a public coin, and 2) computing (Hd)ri · Xi. Since Xi is one-hot,the encoding complexity is O(log d). On the other hand, in order to efficiently decode, the serverfirst computes the joint histogram of client i’s report and ri in O(n) time, which in turn allowsus to calculate 1

n

∑i Yi, and then apply the Fast Walsh-Hadamard transform (FWHT) to obtain

the estimator of empirical frequency in O(d log d) time. Hence the overall decoding complexity isO (n+ d log d). See Algorithm 3 and Algorithm 4 in Section E for details.

Remark 3.1 As in mean estimation, RHR requires public randomness to achieve optimal communi-cation efficiency. Indeed, we can show that RHR uses the minimum amount of shared randomness.See Section 4 for more details.

3.2 Application to distribution estimation

As in statistical mean estimation (Section 2.2), for distribution estimation where X1, ..., Xni.i.d.∼ p,

we can replace the random sampling with deterministic one and avoid the use of shared randomness.This yields the following theorem:

9

Theorem 3.2 For distribution estimation under ε-LDP and b bits communication constraint, we canachieve

rDE (`2) d

nmin(eε, (eε − 1)

2, 2b, d

) , and rDE (`1) d√nmin

(eε, (eε − 1)

2, 2b, d

) ,without shared randomness. Moreover, if n ·min

(eε, (eε − 1)

2, 2b, d

)≥ d2, the above errors are

optimal even in the presence of shared randomness.

The lower bounds follow directly from the results of [53] (under LDP constraint) and [9, 28] (undercommunication constraint). We leave the formal proof of the achievability to Section F.

4 Role of Shared Randomness and How It Benefits Communication

The Amount of Shared Randomness In the achievability part of Theorem 2.1, our proposedscheme SQKR randomly and independently samples b∗ME , min (dεe, b) bits from the quantizedd-dimensional binary vector at each client. These bits are then privatized and communicated to theserver. In addition to the values of these bits, the server needs to know the indices of the sampledbits, which corresponds to an additional b∗ME log d bits of information that needs to be shared betweeneach client and the server. This information can be shared in two different ways: 1) sampling can bedone by using a public coin shared a priori between the client and the server, or 2) sampling can bedone by using a private coin at the client side, which is then communicated to the server. We can alsocombine both 1) and 2) when b > b∗ME: given b bits communication budget, SQKR compresses thedata to b∗ME bits, so the client can use the remaining b− b∗ME bits to communicate the locally generatedrandomness required at the sampling step. Thus the amount of shared randomness is reduced tob∗ME log d− (b− b∗Me) bits. Moreover, by extending [3, Theorem 4], we also obtain a lower bound onthe amount of shared randomness required, which we summarize in the following corollary:

Corollary 4.1 Under ε-LDP and b-bit communication constraints, SQKR uses min (b∗ME log d, d)−(b− b∗ME) bits of shared randomness to achieve rME (`2, b, ε), where b∗ME , min (dεe, b). Moreover, ifb < log d− 2, any b-bit consistent mean estimation scheme1 requires at least log d− b− 2 bits.

We contrast this with the amount of shared randomness needed in the generic scheme of [12] whichprovides ε-LDP by using 1 bit per client in the high privacy regime ε = O(1). The shared randomnessrequired by this scheme is d bits per client. In contrast, when ε = O(1) and b = 1, SQKR requireslog d bits of shared randomness.

Similarly, for frequency estimation, it can be seen that RHR requires log d − b∗FE bits of sharedrandomness in the random sampling step, where b∗FE , min (dε log2 ee, b). Again, this is achievedby communicating b− b∗FE bits of privately generated randomness from the client to the the server,which reduces the required public randomness to log d− b bits. Furthermore, as in mean estimation,we can show that at least log d− b− 2 bits are needed to get a consistent scheme, so RHR is alsooptimal in the amount of public randomness it uses. We summarize it in the following corollary:

Corollary 4.2 Under ε-LDP and b-bit communication constraints, RHR uses log d− b bits of sharedrandomness to achieve rFE (`2, b, ε), where b∗FE , min (dε log2 ee, b). Moreover, if b < log d− 2, anyb-bit consistent frequency estimation scheme requires at least log d− b− 2 bits of shared randomness.Thus RHR is optimal in the amount of shared randomness it uses for frequency estimation, up to anadditive constant.

The achievability parts of Corollary 4.1 and Corollary 4.2 follow directly from the analysis of SQKRand RHR, and we defer the proof of the converse part to Section G.2. Given a ε-LDP constraint, wesummarize the minimum amounts of communication and shared randomness required to achieve theoptimal error rME (`2, ε,∞) and rFE (`2, ε,∞) in Table 4.

In Figure 1, we plot the achievable region for the minimax frequency estimation error under ε-LDPconstraint (i.e. rFE (`2, ε,∞)). Note that the red line in Figure 1 can be achieved by RHR.

1A scheme is consistent if it has vanishing estimation error as n→∞.

10

Communication Shared randomness

SQKR (Thm. 2.1) dεe bits min (dεe log d, d) bits

RHR (Thm. 3.1) dlog2 e · εe bits log d− blog2 e · εc bits

Table 4: The amounts of required shared randomness.

Figure 1: Achievable region for frequency estimation with public randomness.

Remark 4.1 Note that shared randomness is only needed for distribution-free settings; for distribu-tion estimation and statistical mean estimation, one can achieve the same estimation error with onlyprivate randomness as noted in Theorems 2.2 and 3.2 .

Converting public-coin schemes to private-coin schemes As discussed above, we can alwaysreplace shared randomness with additional communication by first generating the random bits at theclient side and then sending them to the server. Therefore, by Corollary 4.1 and Corollary 4.2, weautomatically obtain private-coin SQKR and private-coin RHR by using additional communication.We next state these observations for completeness.

Corollary 4.3 (Private-coin SQKR) Under ε-LDP and b-bit communication constraints with b >log d and 0 < ε ≤ d, the `2 minimax error for private-coin mean estimation, denoted as rME(`2, ε, b)2

(to distinguish it from the minimax error rME(`2, ε, b) achieved by public-coin schemes), is character-ized as follows:

(i) if log d < b < d, then

rME(`2, ε, b) d

nmin (ε2, ε, b/ log d, d);

(ii) if b ≥ d, then

rME(`2, ε, b) d

nmin (ε2, ε, d),

and the above errors can be achieved by private-coin SQKR. Therefore private-coin SQKR requiresO (min (dεe log d, d)) bits of communication to achieve rME (`2, ε,∞).

Similarly, the estimation error of private-coin RHR is characterized below:

Corollary 4.4 (Private-coin RHR) Under ε-LDP and b-bit communication constraints with b >log d and 0 < ε ≤ log d, the `2 minimax error for private-coin frequency estimation, denoted as

2The definition of rME(·) is the same as that of rME(·) in (1), except that now the minimum is taken over allprivate-coin schemes.

11

rFE(`2, ε, b), is

rFE(`2, ε, b) d

nmin(

(eε − 1)2, eε, d

) ,which can be achieved by private-coin RHR. In words, for any ε, private-coin RHR always uses log dbits of communication to achieve rFE(`2, ε,∞).

Moreover, the following lemma, an extension of [3, Theorem 4], establishes a lower bound on thecommunication required for consistent private-coin schemes:

Lemma 4.1 Any consistent private-coin scheme for both mean estimation and frequency estimationuses at least b > log d− 2 bits of communication.

This shows that the log d lower bounds on b in both corollaries are fundamental (within 2 bits). Theproof of the lemma is given in Section G.

5 Experiments

In this section, we implement our mean estimation and frequency estimation schemes and present ourexperimental results3. More detailed results can be found in Section B.

5.1 Mean estimation

We implement our mean estimation scheme Subsampled and Quantized Kashin’s Response (SQKR) asin Section 2 under private-coin setting and compare it with a baseline, a concatenation of DJW [13,17](which is order-optimal under ε-LDP for ε = O(1)) and the quantizer based on Kashin’s representa-tion [38] (which is optimal up to a logarithmic factor, under b-bit communication constraint).

DJW (Lemma 1 in [17]) samples a vector from the unit sphere with proper probability density (whichdepends on Xi), and scales it by a factor of O(

√d) in order to make it unbiased. Although under

public-coin setting, one can sample the vector with the help of public randomness and reduce thecommunication to dεe bits [11], for private-coin model each client has to send a d-dimensionalvector to the server and hence requires to communicate Θ(d) bits4. To compare with SQKR underprivate-coin setting, we use an (order-optimal) quantizer based on Kashin’s representation to furthercompress the communication to bdlog de bits. It can be shown that such direct concatenation willresult in O(d2) error rate (see Section B in appendix for more details).

Generating the data In order to capture the distribution-free setting, we generate data indepen-dently but non-identically; in particular, we set Z1, ..., Zn/2

i.i.d.∼ N(1, 1)⊗d and Zn/2+1, ..., Zni.i.d.∼

N(10, 1)⊗d (this also makes the data non-central, i.e. E [∑Zi] 6= 0). Since each sample has bounded

`2 norm, we normalize each Zi by setting Xi = Zi/ ‖Zi‖2.

Generating the tight frame We construct the tight frame by using the random partial Fouriermatrices in [38]. Specifically, we set N = 2dlog2 de+1 = Θ(d), and choose the basis U =

1/√N,−1/

√NN×d

by selecting the first d rows of HN ·D, where HN is a N ×N Hadamardmatrix andD is a random diagonal matrix with each diagonal entry generated from uniform +1,−1.It can be shown that the tight frame based on U has Kashin’s level K = O(1).

In Figure 2, we fix the sample size to n = 105 and ε, b, and increase the dimension d. From the result,we see that SQKR has linear dependence on d, whereas the baseline (labeled as "Separation" since itis based on the idea of separately coding for privacy and communication efficiency) has super-lineardependence. Therefore the performance differs drastically when d increases.

3The code can be found in https://github.com/WeiNingChen/Kashin-mean-estimation (for theSQKR scheme) and https://github.com/WeiNingChen/RHR (for the RHR scheme).

4We remark that after our paper being published, a recent work [23] shows that DJW and its improved versionprivUnit [13] can be compressed in a more efficient way. We refer the reader to [23] for more details.

12

https://github.com/WeiNingChen/Kashin-mean-estimation

https://github.com/WeiNingChen/RHR

Figure 2: `2 error with n = 105 and different dimensions d. Note that under private-coin setting, thecommunication cost is bdlog de bits for both schemes. In order to better emphasize the dependenceto d, on the right-hand side we only plot the `2 error of SQKR.

5.2 Frequency estimation

For frequency estimation problem, we experimentally compare our scheme, Recursive HadamardResponse (RHR), with SS [53], HR [4] and 1-bit HR [3]5. We set d = 1000, 10000, ε = 2, 5,and evaluate the `1 estimation errors on the truncated and normalized geometric distribution withλ = 0.8. For each point (i.e. for each parameter n, ε, d), we repeat the simulation 30 times andaverage the `2 errors. Figure 3 shows that our schemes can achieve the same performance as HR butis significantly more communication efficient. For instance, in Figure 3 with d = 10000, ε = 5, RHRuses only half of the communication budget for HR and achieves better performance. In all settings,SS has the best statistical performance, but this comes with drastically higher communication andcomputation cost.

Figure 3: `1 error with d = 5000 and d = 10000, under (truncated) Geo(0.8) and different ε.

6 Conclusion

We have investigated mean estimation and frequency estimation under ε-LDP and b-bit commu-nication constraints. A significant advantage of the approaches we presented is that they achievethe privacy and communication constraints simultaneously at the cost of the harsher one. Manyinteresting questions remain to be addressed, including investigating if we can reduce the amount ofshared randomness, deriving decoding schemes with optimal runtimes, and applying our results todistributed SGD.

5For HR, we use the codes from [4] (https://github.com/zitengsun/hadamard_response)

13

https://github.com/zitengsun/hadamard_response

7 Acknowledgments

The authors would like to thank Jakub Konecný for bringing Kashin’s representation to their attention.This was helpful in achieving order-optimality for mean estimation. The authors would also liketo thank Vitaly Feldman and Kunal Talwar for pointing out a mistake in the experiments of meanestimation as well as the connection between SQKR and [22]. This work was supported in part by aStanford Graduate Fellowship, the National Science Foundation, and a Google Research Award.

References[1] J. Acharya, C. L. Canonne, and H. Tyagi. Inference under information constraints ii: Communi-

cation constraints and shared randomness. arXiv preprint arXiv:1905.08302, 2019.[2] J. Acharya, C. L. Canonne, and H. Tyagi. Inference under information constraints: Lower

bounds from chi-square contraction. In Conference on Learning Theory, pages 3–17. PMLR,2019.

[3] J. Acharya and Z. Sun. Communication complexity in locally private distribution estimationand heavy hitters. In International Conference on Machine Learning, pages 51–60, 2019.

[4] J. Acharya, Z. Sun, and H. Zhang. Hadamard response: Estimating distributions privately,efficiently, and with little communication. In The 22nd International Conference on ArtificialIntelligence and Statistics, pages 1120–1129, 2019.

[5] N. Agarwal, A. T. Suresh, F. X. X. Yu, S. Kumar, and B. McMahan. cpsgd: Communication-efficient and differentially-private distributed sgd. In Advances in Neural Information ProcessingSystems, pages 7564–7575, 2018.

[6] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficientsgd via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information ProcessingSystems 30, pages 1709–1720. Curran Associates, Inc., 2017.

[7] B. Balle, J. Bell, A. Gascón, and K. Nissim. The privacy blanket of the shuffle model. In AnnualInternational Cryptology Conference, pages 638–667. Springer, 2019.

[8] L. P. Barnes, W.-N. Chen, and A. Ozgur. Fisher information under local differential privacy.arXiv preprint arXiv:2005.10783, 2020.

[9] L. P. Barnes, Y. Han, and A. Ozgur. Lower bounds for learning distributions under communica-tion constraints via fisher information, 2019.

[10] L. P. Barnes, H. A. Inan, B. Isik, and A. Ozgur. rtop-k: A statistical estimation approach todistributed sgd, 2020.

[11] R. Bassily, K. Nissim, U. Stemmer, and A. Thakurta. Practical locally private heavy hitters. InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17, page 2285–2293, Red Hook, NY, USA, 2017. Curran Associates Inc.

[12] R. Bassily and A. Smith. Local, private, efficient protocols for succinct histograms. InProceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15,page 127–135, New York, NY, USA, 2015. Association for Computing Machinery.

[13] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers. Protection against recon-struction and its applications in private federated learning. arXiv preprint arXiv:1812.00984,2018.

[14] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff. Communication lowerbounds for statistical estimation problems via a distributed data processing inequality. InProceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011–1020, 2016.

[15] M. Bun, J. Nelson, and U. Stemmer. Heavy hitters and the structure of local privacy. InProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of DatabaseSystems, SIGMOD/PODS ’18, page 435–447, New York, NY, USA, 2018. Association forComputing Machinery.

[16] S. Caldas, J. Konecny, H. B. McMahan, and A. Talwalkar. Expanding the reach of federatedlearning by reducing client resource requirements. arXiv preprint arXiv:1812.07210, 2018.

14

[17] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates.In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429–438.IEEE, 2013.

[18] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in privatedata analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.

[19] Ú. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, S. Song, K. Talwar, and A. Thakurta.Encode, shuffle, analyze privacy revisited: formalizations and empirical evaluation. arXivpreprint arXiv:2001.03618, 2020.

[20] Ú. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar, and A. Thakurta. Amplifi-cation by shuffling: From local to central differential privacy via anonymity. In Proceedings ofthe Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2468–2479. SIAM,2019.

[21] A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving datamining. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium onPrinciples of database systems, pages 211–222, 2003.

[22] V. Feldman, C. Guzman, and S. Vempala. Statistical query algorithms for mean vector estimationand stochastic convex optimization. In Proceedings of the Twenty-Eighth Annual ACM-SIAMSymposium on Discrete Algorithms, pages 1265–1277. SIAM, 2017.

[23] V. Feldman and K. Talwar. Lossless compression of efficient private local randomizers. arXivpreprint arXiv:2102.12099, 2021.

[24] J.-J. Fuchs. Spread representations. In 2011 Conference Record of the Forty Fifth AsilomarConference on Signals, Systems and Computers (ASILOMAR), pages 814–817. IEEE, 2011.

[25] V. Gandikota, D. Kane, R. K. Maity, and A. Mazumdar. vqsgd: Vector quantized stochasticgradient descent, 2019.

[26] A. Garg, T. Ma, and H. Nguyen. On communication cost of distributed statistical estimationand dimensionality. In Advances in Neural Information Processing Systems, pages 2726–2734,2014.

[27] Y. Han, J. Jiao, and T. Weissman. Minimax estimation of discrete distributions. In 2015 IEEEInternational Symposium on Information Theory (ISIT), pages 2291–2295. IEEE, 2015.

[28] Y. Han, P. Mukherjee, A. Ozgur, and T. Weissman. Distributed statistical estimation of high-dimensional and nonparametric distributions. In 2018 IEEE International Symposium onInformation Theory (ISIT), pages 506–510. IEEE, 2018.

[29] Y. Han, A. Özgür, and T. Weissman. Geometric lower bounds for distributed parameterestimation under communication constraints. arXiv preprint arXiv:1802.08417, 2018.

[30] J. Hsu, S. Khanna, and A. Roth. Distributed private heavy hitters. In Proceedings of the 39thInternational Colloquium Conference on Automata, Languages, and Programming - VolumePart I, ICALP’12, page 461–472, Berlin, Heidelberg, 2012. Springer-Verlag.

[31] P. Kairouz, K. Bonawitz, and D. Ramage. Discrete distribution estimation under local privacy.In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages2436–2444, New York, New York, USA, 20–22 Jun 2016.

[32] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz,Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning.arXiv preprint arXiv:1912.04977, 2019.

[33] P. Kairouz, S. Oh, and P. Viswanath. The composition theorem for differential privacy. InInternational conference on machine learning, pages 1376–1385. PMLR, 2015.

[34] P. Kairouz, S. Oh, and P. Viswanath. Extremal mechanisms for local differential privacy. TheJournal of Machine Learning Research, 17(1):492–542, 2016.

[35] B. Kashin. Section of some finite-dimensional sets and classes of smooth functions (in russian)izv. Acad. Nauk. SSSR, 41:334–351, 1977.

[36] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can welearn privately? SIAM Journal on Computing, 40(3):793–826, 2011.

15

[37] S. Lloyd. Least squares quantization in pcm. IEEE transactions on information theory,28(2):129–137, 1982.

[38] Y. Lyubarskii and R. Vershynin. Uncertainty principles and vector quantization. IEEE Transac-tions on Information Theory, 56(7):3491–3501, 2010.

[39] T. T. Nguyên, X. Xiao, Y. Yang, S. C. Hui, H. Shin, and J. Shin. Collecting and analyzing datafrom smart device users with local differential privacy, 2016.

[40] F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild! a lock-free approach to parallelizingstochastic gradient descent. In Proceedings of the 24th International Conference on NeuralInformation Processing Systems, NIPS’11, page 693–701, Red Hook, NY, USA, 2011. CurranAssociates Inc.

[41] Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren. Heavy hitter estimation over set-valueddata with local differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference onComputer and Communications Security, CCS ’16, page 192–203, New York, NY, USA, 2016.Association for Computing Machinery.

[42] M. Safaryan, E. Shulgin, and P. Richtárik. Uncertainty principle for communication compressionin distributed and federated learning and the search for an optimal compressor. arXiv preprintarXiv:2002.08958, 2020.

[43] C. Studer, W. Yin, and R. G. Baraniuk. Signal representations with minimum `∞-norm. In2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton),pages 1270–1277. IEEE, 2012.

[44] A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limitedcommunication. In Proceedings of the 34th International Conference on Machine Learning -Volume 70, ICML’17, page 3329–3337. JMLR.org, 2017.

[45] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48.Cambridge University Press, 2019.

[46] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright. Atomo:Communication-efficient learning via atomic sparsification. In Advances in Neural InformationProcessing Systems, pages 9850–9861, 2018.

[47] S. Wang, L. Huang, P. Wang, Y. Nie, H. Xu, W. Yang, X.-Y. Li, and C. Qiao. Mutual informationoptimally local private discrete distribution estimation, 2016.

[48] T. Wang, J. Blocki, N. Li, and S. Jha. Locally differentially private protocols for frequencyestimation. In 26th USENIX Security Symposium (USENIX Security 17), pages 729–745,2017.

[49] T. Wang, J. Zhao, X. Yang, and X. Ren. Locally differentially private data collection andanalysis. arXiv preprint arXiv:1906.01777, 2019.

[50] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficientdistributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.

[51] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias.Journal of the American Statistical Association, 60(309):63–69, 1965.

[52] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradientsto reduce communication in distributed deep learning. In Advances in neural informationprocessing systems, pages 1509–1519, 2017.

[53] M. Ye and A. Barg. Optimal schemes for discrete distribution estimation under local differentialprivacy. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 759–763,June 2017.

[54] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower boundsfor distributed statistical estimation with communication constraints. In Advances in NeuralInformation Processing Systems, pages 2328–2336, 2013.

[55] Úlfar Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 21st ACM Conference on Computer andCommunications Security, Scottsdale, Arizona, 2014.

16

A Separate Quantization and Privatization Is Strictly Sub-optimal

Distribution estimation First let us recap the subset selection (SS) scheme proposed by [53].Assume X1, ..., Xn

i.i.d.∼ p = (p1, ..., pd). Client i maps the local data Xi into y ∈ Yd,w ,y ∈ 0, 1d :

∑j yj = w

with the transitional probability

QSS(y|X = j) =eεyj + (1− yj)eε(d−1w−1

)+(d−1w

) .The estimator for pj is defined by

pj ,

((d− 1)eε + (d−1)(d−w)

w

(d− w)(eε − 1)

)Tjn− (w − 1)eε + d− w

(d− w)eε − 1, (4)

where Tj ,∑ni=1 Yi(j). Note that by picking w = d d

eε+1e, SS is order-optimal for all privacyregimes.

To demonstrate that separating privatization and quantization is strictly sub-optimal, we analyze theestimation error of directly concatenating the 2b-SS mechanism with the grouping-based quantizationin [28]. Note that both schemes are known to be optimal under the corresponding constraints, privacyand communication respectively. However, their direct combination yields an `2 error of orderO(d2), which is far from the optimal accuracy established in Theorem 3.1.

We first group [d] into s = d/2b equal-sized groups G1, ...,Gs, and each client is only responsible tosend information about one particular group. That is, let Yi be the outcome of the 2b-SS mechanism,i.e. Yi ∼ QSS (·|Xi), and client i only transmits Yi(j)|j ∈ Gs′, for some s′ ∈ [s]. Since the serverestimates each component of p separately as in (4), this grouping strategy reduces the effectivesample size from n to n′ = n2b/d. Plugging n′ into the `2 error (see Proposition III.1 in [53]), weconclude that the error grows as

O

(d2

n2b min (eε, (eε − 1)2)

).

Note that since each Yi contains exactly w ones, the required communication budget to describeYi(j), j ∈ Gl may be larger than b bits. But this is fine since it implies that even given more than bbits, the estimation error still grows with d2. In Theorem 3.2, on the other hand, we show that theoptimal `2 error is linear in d, so this demonstrates that separate quantization and privatization issub-optimal.

Mean estimation For the mean estimation problem, a straightforward combination is using thePrivUnit mechanism (see Algorithm 1 in [13]) to perturb the local data Xi ∈ Bd(0, 1), and thenusing RandomSampling quantization in (Theorem 6 in [25]) to compress the perturbed data. Bothschemes are known to be optimal under the corresponding constraints, privacy and communica-tion respectively. (Note that in Section 5 we replaced the RandomSampling quantization with aKashin’s quantizer, since implementing the theoretically optimal RandomSampling quantizaton iscomputationally infeasible.)

By Proposition 4 in [13], the output of PrivUnit, denoted as Zi = PrivUnit (Xi, ε), has `2 norm

of order Θ(√

dmin(ε,ε2)

). However, if we further apply RandomSampling to b bits, by Theorem 6

in [25], the `2 estimation error grows as

Θ

(‖Zi‖

d

n · b

)= Θ

(d2

nbmin (ε, ε2)

),

showing a quadratic dependence in d. By Theorem 2.1, nevertheless, we can construct a betterscheme with O(d/nmin

(ε, ε2, b

)) dependence under both constraints.

17

B More Experimental Results

B.1 Mean estimation

We generate the data as well as the tight frame as described in Section 5.

Compare to optimal LDP estimation schemes We first compare our scheme SQKR, under private-coin setting, with 1) privUnit [13], which is order-optimal for all ε and 2) DJW [17], which isorder-optimal for ε = O(1). Note that although DJW is originally designed for high-privacy regimeε = O(1), one can independently and repeatedly apply it with ε′ = 1 for bεc times and return themean of the bεc vectors. By the composition theorem [33], the output satisfies bεc-LDP, and the MSEis reduced by a factor of bεc. The repeated version of DJW (denoted as reDJW) is hence asymptoticallyoptimal, and we also compare it with our scheme.

Note that the outcomes of privUnit, DJW and reDJW are d-dimensional vectors lying in a radiusO(√d) sphere, so in general we need 32d bits to represent it (where we assume each float requires

32 bits). Figure 4 shows that SQKR achieves similar performance with significantly communicationbudgets. For instance, under private-coin model, when ε = 5 and d = 200, the communication costof privUnit is roughly 32 × 200 ≈ 6K bits, while according to Corollary 4.3, SQKR uses only5× dlog2 200e = 40 bits.

Figure 4: `2 error of privUnit, DJW, reDJW and SQKR with different dimensions d = 200.

Next, under private-coin setting, we compare SQKR with a combination of DJW and an optimalquantizer.

Baseline: a direct concatenation of DJW, Kashin’s quantizer and sampling For each Xi in unit`2 ball, DJW maps it to a vector Xi with length

∥∥∥Xi

∥∥∥2

= Θ(√

d/min (1, ε2))

. Note that DJW is

order-optimal for ε = O(1) [17]. If we quantize Xi according to its Kashin’s representation and thensubsample b bits from it as in Section 2, then the `2 error (i.e. variance) will be

O

(d

b

∥∥∥Xi

∥∥∥2) = O

(d2

bmin (1, ε2)

).

Therefore, averaging over n clients, the `2 error of estimating the empirical mean is

O

(d2

n · bmin (1, ε2)

).

However, in Theorem 2.1, we see that with a more sophisticated design, we can achieve smaller `2error

O

(d

n ·min (ε, ε2, b)

).

18

Setup In the experiment, we mainly focus on the high-privacy low-communication setting whereε = b = 1. Note that since we are under private-coin setting, the actual communication cost for eachsetting is b · dlog2 de.We first consider different dimensions d and plot the (log-scale) `2 estimation error (i.e. meansquare error) with sample size n. For each point, i.e. each set of parameters (ε, b, d, n), we repeatthe simulation for 8 iterations and report the average. In Figure 5, we see that SQKR drasticallyoutperforms the baseline (labeled as "Separation" since it is based on the idea of separately codingfor privacy and communication efficiency). The gain increases in higher dimensions or with morestringent privacy/communication constraints.

Figure 5: Log-scale `2 error with different dimensions d = 50, 80 and different privacy and commu-nication budgets.

Next, to better study the dependence on d, we fix the sample size to n = 105 and ε = b = 1,and increase the dimension d. In Figure 6, We see that SQKR has linear dependence on d, andSeparation has super-linear dependence. Therefore the performance differs drastically when dincreases.

Figure 6: `2 error with n = 105 and different dimensions d. In order to better emphasize thedependence to d, on the right-hand side we only plot the `2 error of SQKR.

19

B.2 Frequency estimation

For frequency estimation, we compare our scheme, Recursive Hadamard Response (RHR), withSS [53], HR [4] and 1-bit HR [3]. We set d = 1000, 5000, 10000, ε ∈ 0.5, 2, 5 and n =50000, 100000, ..., 500000, and evaluate the `1 estimation errors on uniform distribution andtruncated and normalized geometric distribution with λ = 0.8. For each point (i.e. for each parametern, ε, d), we repeat the simulation 30 times and average the `2 errors. Figure 7 and Figure 8 show thatRHR can achieve the same performance as HR but is significantly more communication efficient. Forinstance, in Figure 8 with d = 10000, ε = 5, RHR uses only half of the communication budget forHR and achieves better performance. In all settings, k-SS has the best statistical performance, butthis comes with drastically higher communication and computation cost.

Figure 7: `1 error with d = 1000. Left are Geo(0.8) and right are Uniform.

20

Figure 8: `1 error with d = 5000 and d = 10000, under (truncated) Geo(0.8) and different ε.

In Figure 9, we record the decoding time for each scheme. The decoding complexity of RHR issimilar to HR and 1-bit HR, which are all much more computationally efficient than SS.

Figure 9: Left: time complexity with d = 1000, ε = 7 right: time complexity with d = 5000, ε = 2.

21

C Proof of Theorem 2.1

C.1 Achievability

In this section, we prove that Subsampled and Quantized Kashin’s Response (SQKR) achievesoptimal `2 estimation error. For each observation Xi, we will construct an unbiased estimator Xi

(i.e. E[Xi|Xi

]= Xi), where Xi is ε-LDP, can be described by k bits, and has small variance. The

encoding scheme consists of three main steps: (1) obtaining a Kashin’s representation for a tightframe [38], (2) subsampling and (3) privatization.

Kashin’s representation We begin with introducing tight frames and Kashin’s representation [38].

Definition C.1 (Tight frame) A tight frame is a set of vectors ujNj=1 ∈ Rd that obeys Parseval’sidentity

‖x‖22 =

N∑j=1

〈uj , x〉2, for all x ∈ Rd.

A frame can be viewed as a generalization of an orthogonal basis in Rd, which can improve theencoding stability by adding redundancy to the representation system when N > d. To increaserobustness, we wish the information to spread evenly in each coefficient, which motivates thefollowing definition of a Kashin’s representation:

Definition C.2 ( Kashin’s representation) For a set of vectors ujNj=1, we say the expansion

x =

N∑j=1

ajuj , with maxj|aj | ≤

K√N‖x‖2

is a Kashin’s representation of vector x at level K .

Therefore, if we can obtain unbiased estimators ajNj=1 of the Kashin’s representation of X with

respect to a tight frame ujNj=1, then the MSE can be controlled by

E[(X −X

)2]= E

∥∥∥∥∥∥N∑j=1

(aj − aj)uj

∥∥∥∥∥∥2

2

(a)≤ E

N∑j=1

(aj − aj)2 =

N∑j=1

Var (aj) , (5)

where (a) is due to the Cauchy–Schwarz inequality and the definition of a tight frame. Recall that Xis deterministic, so here the expectation is taken with respect to the randomness on aj . Notice thatthe cardinality N of the frame determines the compression (i.e. quantization) rate, and Kashin’s levelK affects the variance. Hence we are interested in constructing tight frames with small N and K.

By Theorem 3.5 and Theorem 4.1 in [38], we have the following lemma:

Lemma C.1 (Uncertainty principle and Kashin’s Representation) For any µ > 0 and N > (1 +

µ)d, there exists a tight frame ujNj=1 with Kashin’s level K = O(

1µ3 log 1

µ

). Moreover, for each

X , finding Kashin’s coefficient requires O (dN logN) computation.

For our purpose, we choose µ to be a constant, i.e. µ = Θ(1), so N = Θ(d),K = Θ(1), and we canobtaina representation of X =

∑Nj=1 ajuj , with |aj | ≤ K√

N= c√

dfor some constant c. Therefore,

we quantize each aj as follows:

qj ,

−c√d, with probability c/

√d−aj

2c/√d

c√d, with probability aj+c/

√d

2c/√d.

(6)

q , (q1, ..., qN ) yields an unbiased estimator of a , (a1, ..., aN ) and can be described byN = Θ(d)bits.

22

Sampling To further reduce the communication cost, we sample k bits uniformly at random fromq using public randomness. Let s1, ..., sk

i.i.d.∼ uniform[N ] be the indices of the sampled elements,and define the sampled message as

Q (q, (s1, ..., sk)) = (qs1 , ..., qsk) ∈−c/√d, c/√dk

. (7)

Then Q can be described in k bits, and each of qsm yields an independent and unbiased estimator ofa:

E[N · qsm · 1j=sm

]= E

[E[N · qsm · 1j=sm

∣∣q1, ..., qN]] = E [qj ] = aj , ∀j ∈ [N ]. (8)

Privatization Each client then perturbs Q via 2k-RR mechanism (as a k-bit string):

Q =

Q, with probability eε

eε+2k−1

Q′ ∈−c/√d, c/√dk

/ Q , with probability 1eε+2k−1 .

(9)

Since ∑Q′∈−c/√d,c/√dk/Q

Q′ = −Q,

it is not hard to see(eε+2k−1eε−1

)Q yields an unbiased estimator of Q. Indeed, if we write Q =

(q1, ..., qk), then

E[(

eε + 2k − 1

eε − 1

)· qm

∣∣∣∣q1, ..., qN , s1, ..., sk] = qsm , (10)

or equivalently

E[(

eε + 2k − 1

eε − 1

)Q

∣∣∣∣Q] = Q.

Estimation and the `2 error Given Q = (q1, ..., qk), define

aj =N

k·(eε + 2k − 1

eε − 1

) k∑m=1

qm · 1j=sm.

According to (8) and (10), E [aj ] = aj , and hence X(Q, (s1, ..., sk)

),∑Nj=1 ajuj gives us an

unbiased estimator of X .

Claim C.1 The MSE of X can be bounded by

E[∥∥∥X −X∥∥∥2

2

]≤ C

(eε + 2k − 1

eε − 1

)2d

k.

Finally, each client encodes its data Xi independently, and the server computes 1n

∑i Xi. Since Xi

is unbiased and by Claim C.1, we get

E

∥∥∥∥∥∥ 1

n

n∑j=1

Xi − X

∥∥∥∥∥∥2

2

=1

n2

n∑j=1

E[∥∥∥Xi −Xi

∥∥∥22

]≤ C

(eε + 2k − 1

eε − 1

)2d

nk.

Finally, picking k = min (dlog2 eeε, b) gives us the desired upper bound.

C.2 Lower Bound of Theorem 2.1

As in the converse part of Theorem 3.1, the lower bound can be obtained by constructing a priordistribution on Xi and analyzing the statistical mean estimation problem. Therefore, we will impose aprior distribution P on X1, ..., Xn and lower bound the `2 error of estimating the mean θ(P ), whereP is a distribution supported on the d-dimension unit ball.

23

For any X , observe that

EX,Xn i.i.d.∼P

[∥∥∥X − X∥∥∥22

](a)≥ E

[(∥∥∥X − θ (P )∥∥∥2−∥∥X − θ (P )

∥∥2

)2]≥ E

[∥∥∥X − θ (P )∥∥∥22

]− 2E

[∥∥∥X − θ (P )∥∥∥2

∥∥X − θ (P )∥∥2

](b)≥ E

[∥∥∥X − θ (P )∥∥∥22

]− 2

√E[∥∥∥X − θ (P )

∥∥∥22

]E[∥∥X − θ (P )

∥∥22

],

(11)

where (a) and (b) follow from the triangular inequality and the Cauchy-Schwartz inequality respec-tively. Since Xi and θ(P ) are supported on the unit ball, E

[∥∥X − θ (P )∥∥22

] 1/n, so it remains to

find a distribution P ∗ such that

minX

E[∥∥∥X − θ (P ∗)

∥∥∥22

] d

nmin (ε2, ε, b).

Consider the product Bernoulli model Y ∼∏dj=1 Ber(θj). If we set Θ = [1/2 − ε, 1/2 + ε]d for

some 12 > ε > 0, then it can be shown that both variance and sub-Gaussian norm of the score function

of this model is Θ(1) [9, Corollary 4]. Therefore, applying [9, Corollary 8] and [8, Proposition 2,Proposition 4] yields

minθ

E[∥∥∥θ − θ∥∥∥2

2

] d2

nmin (ε2, ε, b).

Finally, if we set Xi = Yi/√d, then each Xi is supported on the unit ball and E [Xi] = θ/

√d.

Therefore

minX

E

[∥∥∥∥X − θ√d

∥∥∥∥22

] d

nmin (ε2, ε, b).

Plugging into (11), as long as min(ε2, ε, k) = o(d), the first term dominates and we get the desiredlower bound.

D Proof of Theorem 2.2

The lower bounds follow directly from [13] (under ε-LDP constraint) and [44] (under b-bit commu-nication constraint). For the achievability part, we apply SQKR except that replacing the randomsampling step by deterministic grouping.

Let Xii.i.d.∼ P with P supported on B(0, 1). First, as in the proof of Theorem 3.1, by Lemma C.1 we

can write Xi =∑Nj=1Aijuj with N = c0d and |Aij | ≤ K/

√d,K = Θ (1). Since Xi

i.i.d.∼ P , if we

denote Ai = [Ai1, ..., AiN ], then Aii.i.d.∼ Q for some Q supported on

[− K√

d, K√

d

]N.

Now we group n clients into m , N/b∗ groups G1, ...,Gm, each with nb∗/N clients, where b∗ ,min (dε log2 ee, b). Also, we divide all of N coordinates (of Ai) into m groups I1, ..., Im, andeach group of clients are responsible for estimating the corresponding group of coordinates of

θ (Q) ∈[− K√

d, K√

d

]N, where θ (Q) = EQ[A] is the mean of Q and θ (Q).

Quantization If client i belongs to Gl, then it quantizes Aij to Qij according to

Qij ,

− K√

d, with probability K/

√d−Aij

2K/√d, if j ∈ Il,

K√d, with probability Aij+K/

√d

2K/√d, if j ∈ Il,

0, else.

(12)

Conditioned on Ai, Qij | j ∈ Il yields an unbiased estimator of Aij | j ∈ Il and can be de-scribed by |Il| = b∗ bits.

24

Privatization Client i then perturbs the b∗-bit message Qij | j ∈ Il intoQij | j ∈ Il

via

2b∗-RR, as described in (9). Similarly,(

eε + 2b∗ − 1

eε − 1

)Qij | j ∈ Il

yields an unbiased estimator on Aij | j ∈ Il.

Estimation and the `2 error For all j ∈ Il, Aij ,(eε+2b

∗−1

eε−1

)Qij yields an unbiased estimator

on EQ [Aij ], and note that Qij ∈[− K√

d, K√

d

], so the variance of Aij is controlled by

EQ[(Aij − θ (Q) (j)

)]≤(eε + 2b

∗ − 1

eε − 1

)2(2K√d

)2

= O

(1

dmin (1, ε2)

).

Since for each coordinate j ∈ Il, there are |Gl| clients (samples) that output independent and unbiasedestimators Aij , the estimator

Aj ,1

|Gl|∑i∈Gl

Aij

has variance

O

(1

d |Gl|

)= O

(1

nmin (b∗, ε2)

).

Therefore, we arrive at

E

N∑j=1

(Aj − EQ [Aj ]

)2 = O

(d

nmin (b∗, ε2)

).

Write θ =∑Nj=1 Ajuj and note that θ (P ) =

∑Nj=1 EQ

[Aj

]uj , so by (5) we conclude that

EP[‖θ − θ(P )‖22

]= O

(d

nmin (b∗, ε2)

)= O

(d

nmin (ε, ε2, b)

).

E Proof of Theorem 3.1

E.1 Achieving optimal `1 and `2 error (part (i) of Theorem 3.1)

In this section, we show that Recursive Hadamard Response (RHR) achieves optimal `1 and `2estimation error.

Decomposition of Hadamard matrix Let us set B = d/2k−1. Since Hd = H2k−1 ⊗ HB , forany j ∈ [B] and m ∈ [2k−1], if j′ = (m − 1)B + j (and thus j ≡ j′ (mod B)), we must have(Hd)j′ = (H2k−1)m ⊗ (Hb)j , where ⊗ is the Kronecker product. This allows us to decompose thej′-th component of Hd ·Xi into

(Hd)j′ ·Xi = ((H2k−1)m ⊗ (HB)j) ·Xi =

2k−1∑l=1

(H2k−1)m,l (HB)j ·X(l)i , (13)

where X li is the l-th block of Xi, i.e. X(l)

i , Xi[(l − 1)B + 1 : lB]. Therefore, as long as we know(HB)j ·X(l)

i for l = 1, ..., 2k−1, we can reconstruct (Hd)j′ ·Xi, for all j′ ≡ j (mod B).

25

Encoding mechanism Let ri ∼ Uniform(B) be generated from the shared randomness, andconsider the following quantizer

Q(Xi, ri) =(

(HB)ri ·X(l)i

)l=1,...,2k−1

∈ −1, 0, 12k−1

.

Since Xi is one-hot encoded, there is exactly one non-zero X(l)i , so Q(Xi, ri) can be described by a

k-bit string (with k − 1 bits indicating the location of the non-zero entry and 1 bit indicating its sign).

Given Q(Xi, ri), by (13) we can recover 2k−1 coordinates of Yi = Hd ·Xi:

Yi(r′) = (Hd)r′ ·Xi =

2k−1∑l=1

(H2k−1)m,l (HB)ri ·X(l)i = (H2k−1)m ·Q(Xi, ri), (14)

for any r′ = (m− 1)B + ri. Therefore, if we define

Yi(Q(Xi, ri), ri) ,

1

2k−1Yi(r′), if r′ ≡ ri

0, else,(15)

then E[Yi

]= 1

dHd ·Xi, where the expectation is taken with respect to ri.

To protect privacy, client i then perturbs Q(Xi, ri) via 2k-RR scheme, since Q takes values on analphabet of size 2k, denoted by Q = ±e1, . . . ,±e2k−1,

Qi =

Q(Xi, ri), w.p. eε

eε+2k−1Q′ ∈ Q \ Q(Xi, ri) , w.p. 1

eε+2k−1 ,

where el denotes the l-th coordinate vector in R2k−1

.

Client i then sends the k-bit report Qi to the server, and with Qi, the server can compute an estimateof Qi since E

[Qi

∣∣∣Q(Xi, ri)]

= eε−1eε+2k−1Q(Xi, ri).

Constructing estimator for D For a given Qi, we estimate Yi by Yi(eε+2k−1eε−1 Qi, ri

), where Yi

is given by (14) and (15), with Q(Xi, ri) in (14) replaced by Qi.

Claim E.1 Yi is an unbiased estimator of Yi.

The final estimator of DXn = 1n

∑Xi is given by

D

((Qi, ri

)i=1,...,n

),

1

n

n∑i=1

Hd · Yi(eε + 2k − 1

eε − 1Qi, ri

). (16)

Note that by Claim E.1, D is an unbiased estimator for DXn . Finally picking k =min (b, dε log2 ee, blog dc) yields the following bounds.

Claim E.2 The estimator D in (16) achieves the optimal `1 and `2 errors:

E[∥∥∥D −DXn

∥∥∥22

] d

n(

mineε, (eε − 1)

2, 2b, d

) and

E[∥∥∥D −DXn

∥∥∥1

] d√

n(

mineε, (eε − 1)

2, 2b, d

) .This establishes the achievability part of Theorem 3.1.

26

E.2 Algorithms

We summarize our proposed scheme RHR scheme below:

Algorithm 1: Encoding mechanism Qi (at each client)Input: client index i, observation Xi, privacy level ε, alphabet size dResult: Encoded message

(˜sign, ˜loc

)Set D = 2dlog de, k = min (b, dε log2 ee), B = D/2k−1;Draw ri from uniform(B) using public-coin ;begin

loc← dXi

B e;sign← (Hd)ri,Xi

;(˜sign, ˜loc

)← 2k − RRε ((sign, loc)) /* (sign, loc) as a k-bit string */;

end

Notice that computing any entry of Hd takes O (log d) Boolean operations, and uniformly samplinga k-bit string takes O(k) time. Therefore the computation cost at each client is O (log d) time. Alsonote that the encoded message is a k-bit binary string, and therefore the communication cost at eachclient is k = min (b, dε log2(e)e) ≤ b.Once receiving the k-bit messages from all clients, the server does the following operation:

Algorithm 2: Estimator of DXn (at the server)

Input: ( ˜sign[1 : n], ˜loc[1 : n]), privacy level ε, alphabet size dResult: DSet D = 2dlog de, k = min (b, dε log2 ee), B = D/2k−1;Partition messages into groups G1, ...,GB , with message i in Gri ;forall j = 1, ..., B doG+j ←

˜loc(i) | i ∈ Gj , ˜sign(i) = +1

;

G−j ←

˜loc(i) | i ∈ Gj , ˜sign(i) = −1

;

Empj ←(empirical distribution(G+j )− empirical distribution(G−j )

)· e

ε+2k−1eε−1 ;

forall l = 0, ..., 2k−1 − 1 doE[l ·B + j]← FWHT(Empj)[l] /* fast Walsh-Hadamard transform */

endendD ← 1

d · FWHT(E)

;

Partitioning n samples into B groups and computing the empirical distribution of each group takesO(n) time, and the fast Walsh-Hadamard transform can be implemented in O (d log d) time. Hencethe decoding complexity is O (n+ d log d).

E.3 Lower Bound on `1 and `2 errors in Theorem 3.1

We can bound the error by considering the worst case Bayesian setting, i.e. by imposing a priordistribution p on X1, ..., Xn and applying the converse part of Theorem 3.2 in Section 3.2.

27

Let X1, ..., Xni.i.d.∼ p. Then for any D(Xn), we must have

maxXn∼p

E[∥∥∥D −DXn

∥∥∥22

](a)≥ max

pE[(∥∥∥D − p

∥∥∥2− ‖DXn − p‖2

)2]≥ max

p

(E[∥∥∥D − p

∥∥∥22

]− 2E

[∥∥∥D − p∥∥∥2‖DXn − p‖2

])(b)≥ max

p

(E[∥∥∥D − p

∥∥∥22

]− 2

√E[∥∥∥D − p

∥∥∥22

]E[‖DXn − p‖22

])(17)

where (a) and (b) follow from the triangular inequality and the Cauchy-Schwarz inequality respec-tively. By Theorem 3.2, there exists a worst case p∗ such that

cd

n

1

mineε, (eε − 1)

2, 2b ≤ E

[∥∥∥D − p∗∥∥∥22

]≤ C d

n

1

mineε, (eε − 1)

2, 2b , (18)

for some constants c and C. On the other hand, the `2 convergence of D(Xn) to p is O (1/n) forany p, which gives us

E[‖DXn − p∗‖22

]≤ c′ 1

n. (19)

Plugging (18) and (19) back into (17) yields

maxXn∼p

E[∥∥∥D −DXn

∥∥∥22

]

≥ C1d

n

1

mineε, (eε − 1)

2, 2b− C2

1

n

√√√√ d

mineε, (eε − 1)

2, 2b .

Thus as long as min(eε, (eε − 1)

2, 2b)

= o(d), the first term dominates and the desired `2 lowerbound follows.

For the case of `1, we similarly have

maxXn∼p

E[∥∥∥D −DXn

∥∥∥1

]≥ max

p

(E[∥∥∥D − p

∥∥∥1

]− E [‖DXn − p‖1]

)(20)

It is well-known that E [‖D(Xn)− p‖1] ≤√d/n (for instance, see [27]), and by the converse part

of Theorem 3.2

maxp

E[∥∥∥D − p

∥∥∥1

]≥

√√√√ d2

nmineε, (eε − 1)

2, 2b .

Plugging this into (20) yields the `1 lower bound.

E.4 Achieving optimal `∞ error (part (ii) of Theorem 3.1 )

To obtain an upper bound on `∞ error, we extend the TreeHist protocol in [11], a 1-bit LDP heavyhitter estimation mechanism, to communicate b bits and satisfy a desired privacy level ε. A simplerversion of TreeHist protocol, which is not optimized for computational complexity, is as follows: wefirst perform Hadamard transform on Xi, and sample one random coordinate with public randomnessri. The 1-bit message is then passed through a binary ε-LDP mechanism. We can show that from theperturbed outcomes, the server can construct an unbiased estimator of Xi with bounded sub-Gaussiannorm, and the `∞ error will be O(

√log d/nε2).

To extend this scheme to an arbitrary privacy regime and an arbitrary communication budget of b bits,we independently and uniformly sample the Hadamard transform of Xi for k = min (b, dεe) times.Each 1-bit sample is then perturbed via a ε′-LDP mechanism with ε′ , ε/k.

28

Note that under the distribution-free setting, the randomness comes only from the sampling and theprivatization steps, so we could view each re-sampled and perturbed message as generated from afresh new copy of Xi since Xi is not random. Equivalently, this boils down to a frequency estimationproblem with n′ = nk clients and under ε′ = ε/k and gives us the `∞ error

O

(√log d

n′ (ε′)2

)= O

(√log d

nmin (ε2, ε, b)

).

Below we describe the details.

Encoding mechanism Set k = min (b, dεe). For each Xi, we randomly sample (Hd)Xi(i.e.

the Xi-th column of Hd) k times, identically and independently by using the shared randomness.Let r(1)i , ..., r

(k)i be the sampled coordinates, which are known to both the server and node i, and

(Hd)Xi,r(`)i

be the sampling outcomes. Then due to the orthogonality of Hd, for all j ∈ [d], ` ∈ [k],

E[(Hd)j,r(`)i

· (Hd)Xi,r(`)i

]=

1, if j = Xi

0, if j 6= Xi,(21)

where the expectation is taken over r(`)i .

We then pass

(Hd)Xi,r(`)i

∣∣∣` = 1, ..., k

through k binary ε′-LDP channels sequentially, with ε′ ,ε/k. By the composition theorem of differential privacy, the privatized outcomes, denoted as

˜(Hd)Xi,r(`)i

, satisfy ε-LDP.

Estimation of DXn Observe that

E

[(eε′+ 1

eε′ − 1

)˜(Hd)Xi,r

(`)i

∣∣∣∣∣(Hd)Xi,r(`)i

]= (Hd)Xi,r

(`)i,

where the expectation is with respect to the privatization. Therefore

X(`)i (j) ,

(eε′+ 1

eε′ − 1

)(Hd)j,Xi

˜(Hd)Xi,r(`)i

defines an unbiased estimator of Xi(j). Moreover,∣∣∣X(`)i (j)−Xi(j)

∣∣∣ ≤ (eε′ + 1

eε′ − 1+ 1

)a.s.,

so X(`)i (j) has sub-Gaussian norm bounded by

σ ≤ 2eε′+ 1

eε′ − 1. (22)

Finally, we estimate DXn(j) by

D(j) =1

nk

n∑i=1

k∑`=1

X(`)i (j).

Observe that

D(j)−DXn(j) =1

nk

n∑i=1

k∑`=1

(X

(`)i (j)−Xi(j)

)(23)

has sub-Gaussian norm bounded by σ/√nk, where σ is given by (22).

To bound the `∞ norm, we apply the maximum bound (see, for instance, [45, Chapter 2]) forsub-Gaussian random variables (note that for j, j′, D(j) and D(j′) are not independent):

E[maxj∈[d]

∣∣∣D(j)−DXn(j)∣∣∣] ≤ 2

√σ2 log d = 4

√(eε′ + 1

eε′ − 1

)2log d

nk

(a)

√log d

nmin (ε, ε2, k), (24)

29

where (a) holds since if ε = o(1), then k = 1 and hence(eε′+ 1

eε′ − 1

)2

1

ε2;

otherwise ε = Ω(1) and ε′ = Ω(1), so (eε′+ 1

eε′ − 1

)2

1.

Both cases are upper bounded by (24), so the result follows.

Remark E.1 Notice that in the high privacy regime ε = o(1), the upper bound matches the lowerbound in [12]. For general privacy regimes with limited communication, however, we do not knowwhether the upper bound is tight or not. This remains as an open question.

F Proof of Theorem 3.2

The construction of the distribution estimation scheme mainly follows Section E.1, except we replacethe random sampling step by a deterministic grouping idea. We will use the same notation as inSection E.1.

Encoding mechanism We group n samples into B equal-sized groups, each with n′ = n/Bsamples. For sample Xi ∈ Gj , we quantize it to a 2k−1-dimensional 1, 0,−1 vector:

Qj(Xi) =

(HB)j ·X(1)

i

(HB)j ·X(2)i

...(HB)j ·X(2k−1)

i

∈ −1, 0, 12k−1

.

Since Xi is one-hot encoded, there is only one l ∈ 1, ..., 2k−1 such that (HB)j · X(l)i 6= 0, so

Qj(Xi) can be described by k bits (1 bit for the sign and (k − 1) bits for the location of the non-zeroelement). Also notice that

E [Qj(Xi)] =

(HB)j · p(1)

(HB)j · p(2)

...(HB)j · p(2k−1)

,where p(l) , p[(l − 1)B + 1 : lB]. By (13), the estimator qj′ = 〈(H2k−1)m , Qj(Xi)〉 is unbiasedfor qj′ (where j′ = (m− 1)B + j).

We further perturb Qj via 2k-RR scheme, since Q takes values on an alphabet of size 2k, denoted byQ = ±e1, . . . ,±e2k−1,

Qj =

Qj , w.p. eε

eε+2k−1Q′ ∈ Q \ Qj , w.p. 1

eε+2k−1 ,

where el denotes the l-th coordinate vector in R2k−1

. This gives us

E[Qj

]=

eε − 1

eε + 2k − 1E [Qj ] .

Therefore eε+2k−1eε−1 Qj yields an unbiased estimator of

(HB)j · p(1)

(HB)j · p(2)

...(HB)j · p(2k−1)

.

30

Constructing the estimator for p For each j′ ≡ j (mod B), we estimate (H2k−1)m ·Qj(Xi), i ∈Gj (recall that j′ = j + (m− 1)B). Define the estimator

qj′ (Xi, i ∈ Gj) =1

|Gj |∑i∈Gj

(H2k−1)m ·(eε + 2k − 1

eε − 1

)Qj(Xi)

=B

n

(eε + 2k − 1

eε − 1

)∑i∈Gj

(H2k−1)m Qj(Xi).

The MSE of qi′ can be obtained by

E[(qj′ − qj′)2

](a)= Var (qi′)

(b)=

d

n2k−1

(eε + 2k − 1

eε − 1

)2

Var(

(H2k−1)m · Qj(Xi))

(c)≤ d

n2k−1

(eε + 2k − 1

eε − 1

)2

, (25)

where (a) is due to the unbiasedness of qj′ , (b) is due to the independence across Xi, and (c) isbecause 〈(H2k−1)m , Qj〉 only takes value in −1, 1.Finally, let p be the inverse Hadamard transform of q, the MSE is

E ‖p− p‖22 = E [〈p− p, p− p〉]

= E[(q − q)

ᵀ (H−1d

)ᵀH−1d (q − q)

]=

1

dE ‖q − q‖22

≤ d

n2k

(eε + 2k − 1

eε − 1

)2

= O

(d

n2k

(eε + 2k

eε − 1

)2),

where the last inequality holds due to (25).

Picking k = min (b, dε log2 ee, blog dc) yields

E ‖p− p‖22 = O

(d

nmin (2b, eε, d)

(eε

eε − 1

)2).

Observe that if eε = O(2b), then eε 2b, so E ‖p− p‖22 = O(

deε

n(eε−1)2

). On the other hand, if

eε = Ω(2b), then eε

eε−1 = θ(1), and E ‖p− p‖22 = O(

dnmin(2b,d)

).

Therefore we conclude that

E ‖p− p‖22 max

(d

nmin (2b, d),

deε

n (eε − 1)2

) d

n

1

mineε, (eε − 1)

2, 2b, d

.

Finally, by Jensen’s inequality and Cauchy-Schwarz inequality, we also have

E [‖p− p‖1] ≤(E[‖p− p‖21

]) 12 ≤

(d · E ‖p− p‖22

) 12 d√

n(

mineε, (eε − 1)

2, 2b, d

) ,establishing the achievability part of Theorem 3.2.

31

F.1 Algorithms and analysis

Each client runs the following algorithm:

Algorithm 3: Encoding mechanism (at each client)Input: client index i, observation Xi, privacy level ε, alphabet size dResult: Encoded message

(˜sign, ˜loc

)Set D = 2dlog de. Set k = min (b, dε log2 ee), B = D/2k−1;begin

j ← i mod B /* assign user i to group j */;loc← dXi

B e;sign← (Hd)j,Xi

;(˜sign, ˜loc

)← kRRε ((sign, loc)) ;

end

As in Algorithm 1, the computation cost at each client is O (log d). Also note that the en-coded message is a k-bit binary string, and therefore the communication cost at each client isk = min (b, ε log2(e)) ≤ b.Upon receiving the privatized k-bit messages from the clients, the server runs the following algorithm:

Algorithm 4: Estimation of p (at the server)

Input: ( ˜sign[1 : n], ˜loc[1 : n]), privacy level ε, alphabet size dResult: pSet D = 2dlog de, k = min (b, dε log2 ee), B = D/2k−1;Partition messages into groups G1, ...,GB , with message i in Gj if i ≡ j (mod B);forall j = 1, ..., B doG+j ←

˜loc(i) | i ∈ Gj , ˜sign(i) = +1

;

G−j ←

˜loc(i) | i ∈ Gj , ˜sign(i) = −1

;

Dj ←(empirical distribution(G+j )− empirical distribution(G−j )

)· e

ε+2k−1eε−1 ;

forall l = 0, ..., 2k−1 − 1 doq[l ·B + j]← FWHT(Dj)[l] ;

endendp← 1

d · FWHT (q);

Partitioning n samples into B groups and computing the empirical distribution of each group takesO(n) time, and the fast Walsh-Hadamard transform can be performed in O (d log d) time. Hence thedecoding complexity is O (n+ d log d).

32

G Proofs for Section 4

We start with proving Lemma 4.1. Without access to the public randomness, [3] shows that at leastΘ(d) bits of communication is required for heavy hitter estimation in order to obtain a consistentestimator6. We state their result here:

Lemma G.1 ( [3] Theorem 4) Let b ≤ log d− 2. For all private-coin schemes(Qn, D

)with only

private randomness and b bits communication budgets, there exists a data sets X1, ..., Xn withn > 12(2b + 1)2, such that

E[∥∥∥D (Qn)−DXn

∥∥∥∞

]≥ 1

2b+2 + 4.

Based on this, we claim that without public coin, each client needs to transmit at least Θ(log d) bitsin order to construct consistent schemes for frequency estimation or mean estimation.

G.1 Proof of Lemma 4.1

Frequency estimation We lower bound `1 and `2 error by `∞ and apply Lemma G.1.


∥∥∥1

]≥ E

[∥∥∥D (Qn)−DXn

∥∥∥∞

]≥ 1

2b+2 + 4,

and


∥∥∥22

]≥ E

[∥∥∥D (Qn)−DXn

∥∥∥2∞

]≥(E[∥∥∥D (Qn)−DXn

∥∥∥∞

])2≥(

1

2b+2 + 4

)2

. (26)

This implies that it is impossible to construct consistent schemes with less than log d−2 bits per clientin the absence of a public randomness. On the other hand, given log d bits, one can readily achievethe optimal estimation accuracy without any public randomness, for instance, by using Hadamardresponse [4] (see also the discussion in [3]). Therefore, the problem of frequency estimation issomewhat trivialized in the absence of public randomness.

Mean estimation Let Xi ∈ [d] be one-hot encoded, so Xi ∈ Bd (0, 1). Then (26) impliesthe `2 error of mean estimation is at least 1/

(2b+2 + 4

)2. Thus with less than log d − 2 bits of

communication budget, it is also impossible to construct a consistent scheme for mean estimation.

G.2 Proof of Corollary 4.1 and Corollary 4.1

Notice that since one can always “simulate” the public coin by uplink communication (i.e. eachclient generates its private random bits and send them to the server), any b bits public-coin schemecan be cast into a private coin scheme with additional b bits communication. This implies theabove impossibility results (Lemma 4.1) also serves a valid lower bound for the amount of publicrandomness: for any public-coin scheme with b < log d− 2 bits communication budgets, we need atleast log d− b− 2 bits of shared randomness in order to obtain a consistent estimate of the empiricalmean or empirical frequency.

6Recall that an estimator is consistent if it has vanishing estimation error as n tends to infinity.

33

H Proof of Claims

H.1 Proof of Claim C.1

Proof. According to (5), it suffices to control Var (aj). To bound the variance, consider

Var (aj) =N2

k2·(eε + 2k − 1

eε − 1

)2

Var

(k∑

m=1

qm · 1j=sm

)

≤ N2

k2·(eε + 2k − 1

eε − 1

)2

E

( k∑m=1

qm · 1j=sm

)2

(a)≤ N2

k2·(eε + 2k − 1

eε − 1

)2(c√d

)2

E

( k∑m=1

1j=sm

)2

(b)≤ C

N

k2·(eε + 2k − 1

eε − 1

)2(k2

N2+k

N

)= C

(eε + 2k − 1

eε − 1

)2(1

N+

1

k

),

where (a) is due to |qm| = c√d

, and (b) is due to the second moment bound on Binomial(k, 1/N) andthe fact N = Θ(d). Therefore by (5),

E[∥∥∥X −X∥∥∥2

2

]≤ C0

N∑i=1

Var (ai) ≤ C1

(eε + 2k − 1

eε − 1

)2d

k,

establishing the claim.

H.2 Proof of Claim E.1

Proof. Yi yields an unbiased estimator since

E[Yi

(eε + 2k − 1

eε − 1Qi, ri

)]= E

[E[Yi

(eε + 2k − 1

eε − 1Qi, ri

) ∣∣∣ri]](a)= E

[Yi

(E[eε + 2k − 1

eε − 1Qi

∣∣∣ri] , ri)]= E

[Yi (Q(Xi, ri), ri)

]=

1

dHdXi, (27)

where (a) holds since conditioning on ri, Yi(Q, ri) is a linear function of Q.

H.3 Proof of Claim E.2

Proof. The `2 error is

E[∥∥∥D −DXn

∥∥∥22

]=

1

n2

n∑i=1

E[∥∥∥HdYi −HdE

[Yi

]∥∥∥22

]

=d

n2

n∑i=1

E[∥∥∥Yi − E

[Yi

]∥∥∥22

]. (28)

It remains to bound E[∥∥∥Yi − E [Yi]

∥∥∥22

]. Observe that∣∣∣E[Yi]

∣∣∣ =

∣∣∣∣Hd ·Xi

d

∣∣∣∣ = [1/d, ..., 1/d]ᵀ,

34

and from expression (15), given ri, there are only 2k−1 non-zero coordinates, each with valuebounded by

(eε+2k−1eε−1

)/2k−1. Therefore we have

E[∥∥∥Yi − E

[Yi

]∥∥∥22

]= E

[E[∥∥∥Yi − E

[Yi

]∥∥∥22

∣∣∣ri]]≤ 2

(d

(1

d

)2

+ 2k−1(eε + 2k − 1

2k−1 (eε − 1)

)2).

Plugging this in to (28), we arrive at

E[∥∥∥D −DXn

∥∥∥22

] d

n2k−1

(eε + 2k − 1

(eε − 1)

)2

.

Picking k = min (b, dε log2 ee, blog dc) yields

E[∥∥∥D −DXn

∥∥∥22

]= O

(d

nmin (2b, eε, d)

(eε

eε − 1

)2).

Observe that

(i) if eε = O(2b), then eε 2b, so E[∥∥∥D −DXn

∥∥∥22

]= O

(deε

n(eε−1)2

).

(ii) If eε = Ω(2b), then eε

eε−1 = θ(1), and E[∥∥∥D −DXn

∥∥∥22

]= O

(d

nmin(2b,d)

).

Therefore we conclude that

E[∥∥∥D −DXn

∥∥∥22

] max

(d

nmin (2b, d),

deε

n (eε − 1)2

) d

n

1

mineε, (eε − 1)

2, 2b, d

.

By Jensen’s inequality and Cauchy-Schwarz inequality, we also have

E[∥∥∥D −DXn

∥∥∥1

]≤(E[∥∥∥D −DXn

∥∥∥21

]) 12

≤(d · E

∥∥∥D −DXn

∥∥∥22

) 12

d√n(

mineε, (eε − 1)

2, 2b, d

) .

35

Date post:	31-Jan-2022
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Peter Kairouz Department of Electrical Engineering arXiv ...

Documents