Decoupled Classi ers for Group-Fair and E cient Machine...

Proceedings of Machine Learning Research 81:1–15, 2018 Conference on Fairness, Accountability, and Transparency

Decoupled Classifiers for Group-Fair and Efficient MachineLearning

Cynthia Dwork [email protected] University

Nicole Immorlica [email protected] Research New England

Adam Tauman Kalai [email protected] Research New England

Max Leiserson [email protected]

University of Maryland

Editors: Sorelle A. Friedler and Christo Wilson

Abstract

When it is ethical and legal to use a sen-sitive attribute (such as gender or race) inmachine learning systems, the question re-mains how to do so. We show that thenaıve application of machine learning algo-rithms using sensitive attributes leads toan inherent tradeoff in accuracy betweengroups. We provide a simple and efficientdecoupling technique, which can be addedon top of any black-box machine learningalgorithm, to learn different classifiers fordifferent groups. Transfer learning is usedto mitigate the problem of having too littledata on any one group.

1. Introduction

As algorithms are increasingly used to make de-cisions of social consequence, the social valuesencoded in these decision-making procedures arethe subject of increasing study, with fairnessbeing a chief concern (Pedreschi et al., 2008;Zliobaite et al., 2011; Kamishima et al., 2011;Dwork et al., 2011; Friedler et al., 2016; Angwinet al., 2016; Chouldechova, 2017; Kleinberg et al.,2016; Hardt et al., 2016; Joseph et al., 2016; Kus-ner et al., 2017; Berk, 2009). Classification andregression algorithms are one particular locus offairness concerns. Classifiers map individuals tooutcomes: applicants to accept/reject/waitlist;adults to credit scores; web users to advertise-ments; felons to estimated recidivism risk. In-

𝑥2

𝑥1

2

1

0+–

Figure 1: No linear classifiers can achieve greaterthan 50% accuracy on both groups.

formally, the concern is whether individuals aretreated “fairly,” however this is defined. Stillspeaking informally, there are many sources ofunfairness, prominent among these being train-ing the classifier on historically biased data anda paucity of data for under-represented groupsleading to poor performance on these groups,which in turn can lead to higher risk for those,such as lenders, making decisions based on clas-sification outcomes.

Should ML systems use sensitive attributes,such as gender or race if available? The legaland ethical factors behind such a decision varyby time, country, jurisdiction, culture, and down-stream application. Still speaking informally, itis known that “ignoring” these attributes doesnot ensure fairness, both because they may beclosely correlated with other features in the dataand because they provide context for understand-

c© 2018 C. Dwork, N. Immorlica, A.T. Kalai & M. Leiserson.

Decoupled Classifiers for Group-Fair and Efficient Machine Learning

ing the rest of the data, permitting a classifierto incorporate information about cultural differ-ences between groups (Dwork et al., 2011). Usingsensitive attributes may increase accuracy for allgroups and may avoid biases where a classifierfavors members of a minority group that meetcriteria optimized for a majority group, as illus-trated visually in Figure 4 of 8.

In this paper, we consider how to use a sen-sitive attribute such as gender or race to maxi-mize fairness and accuracy, assuming that it islegal and ethical. A data scientist wishing to fit,say, a simple linear classifier, may use the rawdata, upweight/oversample data from minoritygroups, or employ advanced approaches to fit-ting linear classifiers that aim to be accurate andfair. No matter what he does and what fairnesscriteria he uses, assuming no linear classifier isperfect, he may be faced with an inherent trade-off between accuracy on one group and accuracyon another. As an extreme illustrative example,consider the two group setting illustrated in Fig-ure 1, where feature x1 perfectly predicts the bi-nary outcome y ∈ {−1, 1}. For people in group 1(where x2 = 1), the majority group, y = sgn(x1),i.e., y = 1 when x1 > 0 and −1 otherwise. How-ever, for the minority group where x2 = 2, ex-actly the opposite holds: y = −sgn(x1). Now, ifone performed classification without the sensitiveattribute x2, the most accurate classifier predictsy = sgn(x1), so the majority group would be per-fectly classified and the minority group would beclassified as inaccurately as possible. However,even using the group membership attribute x2,it is impossible to simultaneously achieve betterthan 50% (random) accuracy on both groups.This is due to limitations of a linear classifiersgn(w1x1 + w2x2 + b), since the same w1 is usedacross groups.

In this paper we define and explore decoupledclassification systems, in which a separate1 classi-fier is trained on each group. Training a classifierinvolves minimizing a loss function that penal-izes errors; examples include mean squared lossand absolute loss. In decoupled classification sys-

1. In the case of linear classifiers, training separate clas-sifiers is equivalent to adding interaction terms be-tween the sensitive attributes and all other attributes.More generally, the separate classifiers can equiva-lently be thought of as a single classifier that brancheson the group attribute. The decoupling technique is asimple way to add branching to any type of classifier.

tems one first obtains, for each group separately,a collection of classifiers differing in the numbersof positive classifications returned for the mem-bers of the given group. Let this set of outputsfor group k be denoted Ck, k = 1, . . . ,K. Theoutput of the decoupled training step is an ele-ment of C1 × . . .×CK , that is, a single classifierfor each group. The output is chosen to minimizea joint loss function that can penalize differencesin classification statistics between groups. Thusthe loss function can capture group fairness prop-erties relating the treatment of different groups,e.g., the false positive (respectively, false nega-tive) rates are the same across groups; the de-mographics of the group of individuals receivingpositive (negative) classification are the same asthe demographics of the underlying population;the positive predictive value is the same acrossgroups.2 By pinning down a specific objective,the modeler is forced to make explicit the tradeoffbetween accuracy and fairness, since often bothcannot simultaneously be achieved. Finally, ageneralization argument relates fairness proper-ties, captured by the joint loss on the training set,to similar fairness properties on the distributionfrom which the data were drawn. We broaden ourresults so as to enable the use of transfer learningto mitigate the problems of low data volume forminority groups.

The following observation provides a propertyessential for efficient decoupling. A profile is avector specifying, for each group, a number ofpositively classified examples from the trainingset. For a given profile (p1, . . . , pK), the mostaccurate classifier also simultaneously minimizesthe false positives and false negatives. It is thechoice of profile that is determined by the jointloss criterion. We show that, as long as the jointloss function satisfies a weak form of monotonic-ity, one can use off-the-shelf classifiers to find adecoupled solution that minimizes joint loss.

The monotonicity requirement is that the jointloss is non-decreasing in error rates, for any fixedprofile. This sheds some light on the thought-provoking impossibility results of Chouldechova(2017) and Kleinberg et al. (2016) on the impos-sibility of simultaneously achieving three specific

2. In contrast individual fairness Dwork et al. (2011) re-quires that similar people are treated similarly, whichrequires a task-specific, culturally-aware, similaritymetric.

2


notions of group fairness (see Observation 1 inSection 4.1).

Finally, we present experiments on 47 datasetsdownloaded from http://openml.org. The ex-periments are “semi-synthetic” in the sense thatthe first binary feature was used as a substitutesensitive feature since we did not have access tosensitive features. We find that on many datasets our algorithm improves performance whilemuch less often decreasing performance.

Remark. The question of whether or not to usedecoupled classifiers is orthogonal to our work,which explores the mathematics of the approach,and a comprehensive treatment of the pros andcons is beyond our expertise. Most importantly,we emphasize that decoupling, together with a“poor” choice of joint loss, could be used un-fairly for discriminative purposes. Furthermore,in some jurisdictions using a different classifica-tion method, or even using different weights onattributes for members of demographic groupsdiffering in a protected attribute, is illegal forcertain classification tasks, e.g. hiring. Even bar-ring legal restrictions, the assumption that groupmembership is an input bit is an oversimplifica-tion, and in reality the information may be ob-scured, and the definition of the groups may beambiguous at best. Logically pursuing the ideabehind the approach it is not clear which inter-sectionalities to consider, or how far to subdi-vide. Nonetheless, we believe decoupling is valu-able and applicable in certain settings and thusmerits investigation.

The contributions of this work are: (a) show-ing how, when using sensitive attributes, thestraightforward application of many machinelearning algorithms will face inherent tradeoffsbetween accuracy across different groups, (b) in-troducing an efficient decoupling procedure thatoutputs separate classifiers for each class usingtransfer learning, (c) modeling fair and accuratelearning as a problem of minimizing a joint lossfunction, and (d) presenting experimental resultsshowing the applicability and potential benefit ofour approach.

1.1. Related Work

Group fairness has a variety of definitions, in-cluding conditions of statistical parity, class bal-ance and calibration. In contrast to individual

fairness, these conditions constrain, in variousways, the dependence of the classifier on the sen-sitive attributes. The statistical parity conditionrequires that the assigned label of an individ-ual is independent of sensitive attributes. Thecondition formalizes the legal doctrine of dis-parate impact imposed by the Supreme Courtin Griggs v Duke Power Company. Statisticalparity can be approximated by either modifyingthe data set or by designing classifiers subjectto fairness regularizers that penalize violationsof statistical parity (see Feldman et al. (2015)and references therein). Dwork et al. (2011) pro-pose a “fair affirmative action” methodology thatcarefully relaxes between-group individual fair-ness constraints in order to achieve group fair-ness. Zemel et al. (2013) introduce a representa-tional approach that attempts to “forget” groupmembership while maintaining enough informa-tion to classify similar individuals similarly; thisapproach also permits generalization to unseendata points. To our knowledge, the earliest workon trying to learn fair classifiers from histori-cally biased data is by Pedreschi et al. (2008);see also (Zliobaite et al., 2011) and (Kamishimaet al., 2011).

The class-balanced condition (called error-ratebalance by Chouldechova (2017) or equalized oddsby Hardt et al. (2016)), similar to statistical par-ity, requires that the assigned label is indepen-dent of sensitive attributes, but only conditionalon the true classification of the individual. Forbinary classification tasks, a class-balanced clas-sifier results in equal false positive and false neg-ative rates across groups. One can also modifya given classifier to be class-balanced while min-imizing loss by adding label noise (Hardt et al.,2016).

The well-calibrated condition requires that,conditional on their label, an equal fraction ofindividuals from each group have the same trueclassification. A well-calibrated classifier labelsindividuals from different groups with equal ac-curacy. Hebert-Johnson et al. (2017) extend cal-ibration to multi-calibration which requires theclassifier to be well calibrated on a collection ofsets of individuals, eg, all those described by cir-cuits of a given size. The class-balanced solu-tion (Hardt et al., 2016) also fails to be well-calibrated. Chouldechova (2017) and Kleinberget al. (2016) independently showed that, except

3


in cases of perfect predictions or equal base ratesof true classifications across groups, there is noclass-balanced and well-calibrated classifier.

A number of recent works explore causalapproaches to defining and detecting(un)fairness (Nabi and Shpitser, 2017; Kus-ner et al., 2017; Bareinboim and Pearl, 2016;Kilbertus et al., 2017). See the beautiful primerof Pearl et al. (2016) for an introduction to thecentral concepts and machinery.

Finally, we mention that sensitive attributesare used in various real-world systems. As oneexample, Hassidim et al. (2017) describe usingsuch features in an admissions matching systemfor masters students in Israel.

2. Preliminaries

Let X = X1 ∪ X2 ∪ . . . ∪ XK be the set of pos-sible examples partitioned by group. The set ofpossible labels is Y and the set of possible classifi-cations is Z. A classifier is a function c : X → Z.We assume that there is a fixed family C of clas-sifiers. For simplicity, we restrict our analysis thecase of binary classification Y = Z = {0, 1}, butmany of the results extended directly to regres-sion or randomized classification Y,Z ⊆ R.

We suppose that there is a joint distribu-tion D over labeled examples x, y ∈ X × Yand we have access to n training examples(x1, y1), . . . , (xn, yn) ∈ X × Y drawn indepen-dently from D. We denote by g(x) the groupnumber to which x belongs and gi = g(xi), soxi ∈ Xgi .

Finally, as is common, we consider the loss`D(c) = Ex,y∼D[`(y, c(x))] for an application-specific loss function ` : Y ×Z → R where `(y, z)accounts for the cost of classifying as z an exam-ple whose true label is y. The group-k loss forD, c is defined to be `Dk(c) = ED[`(y, c(x))|x ∈Xk] or 0 if D assigns 0 probability to Xk. Thestandard approach in ML is to minimize `D(c)over c ∈ C. Common loss functions include theL1 loss `(y, z) = |y − z| and L2 loss `(y, z) =(y− z)2. In Section 4, we provide a methodologyfor incorporating a range of fairness notions intoloss.

3. Decoupling and the cost ofcoupling

For a vector of K classifiers, c = (c1, c2, . . . , cK),the decoupled classifier γc : X → Z is defined tobe γc(x) = cg(x)(x). The set of decoupled clas-sifiers is denoted γ(C) = {γc | c ∈ CK}. Someclassifiers, such as decision trees of unboundedsize over X = {0, 1}d, are already decoupled, i.e.,γ(C) = C. As we shall see, however, in high di-mensions common families of classifiers in use arecoupled to avoid the curse of dimensionality.

The cost of coupling of a family C of classifiers(with respect to `) is defined to be the worst-casemaximum of the difference between the loss of themost accurate coupled and decoupled classifiersover distributions D.

cost-of-coupling(C, `) =

maxD∈∆(X×Y)

[minc∈C

`D(c)− minγc∈γ(C)

`D(γc)

].

Here ∆(S) denotes the set of probability dis-tributions over set S. To circumvent measure-theoretic nuisances, we require C,X ,Y to be fi-nite sets. Note that numbers on digital com-puters are all represented using a fixed-precision(bounded number of bits) representation, andhence all these sets may be assumed to be of finite(but possibly exponentially large) size.

We now show that the cost of coupling is re-lated to fairness across groups.

Lemma 1 Suppose cost-of-coupling(C, `) = /c.Then there is a distribution D such that no mat-ter which classifier c ∈ C is used, there will al-ways be a group k and a classifier c′ ∈ C whosegroup-k loss is at least /c smaller than that of c,i.e., `Dk(c′) ≤ `Dk(c)− /c.

Proof Let γc′ be a decoupled classifier with min-imal loss where c′ = (c′1, . . . , c

′K). This loss is a

weighted average (weighted by demography) ofthe average loss on each group. Hence, for any c,there must be some group k on which the loss ofc′k is /c less than that of c.

Hence, if the cost of coupling is positive, then thelearning algorithm that selects a classifier facesan inherent tradeoff in accuracy across groups.The following theorem shows that the cost of cou-pling is large (a constant) for linear classifiers and

4


decision trees; similar arguments exist for othercommon classifiers. All remaining proofs are de-ferred to the full version.

Theorem 2 Fix X = {0, 1}d, Y = {0, 1}, andK = 2 groups (encoded by the last bit of x). Thenthe cost of coupling is at least 1/4 for:

1. Linear regression: Z = R, C = {w · x +b | w ∈ Rd, b ∈ R}, and `(y, z) = (y − z)2

2. Linear separators: Z = {0, 1}, C = {I[w ·x + b ≥ 0] | w ∈ Rd, b ∈ R}, and `(y, z) =|y − z|

3. Bounded-size decision trees: For Z ={0, 1}, C being the set of binary decision treesof size ≤ 2s leaves, and `(y, z) = |y − z|

We note that it is straightforward to extend theabove theorem to generalized linear models, i.e.,functions c(x) = u(w ·x) for monotonic functionsu : R → R, which includes logistic regressionas one common special case. It is also possible,though more complex, to provide a lower boundon the cost of coupling of neural networks, regres-sion forests, or other complex families of func-tions of bounded representation size s. In orderto do so, one needs to simply show that the size-s functions are sufficiently rich in that there aretwo different size-s classifiers c = (c1, c2) suchthat γc has 0 loss (say over the uniform distribu-tion on X ) but that every single size-s classifierhas significant loss.

4. Joint loss and monotonicity

As discussed, the classifications output by anML classifier are often evaluated by their empir-ical loss 1

n

∑i `(yi, zi). To account for fairness,

we generalize loss to joint classifications acrossgroups. In particular, we consider an application-specific joint loss L : ([K] × Y × Z)∗ → R thatassigns a cost to a set of classifications, where[K] = {1, 2, . . . ,K} indicates the group numberfor each example. A joint loss might be, for pa-rameter λ ∈ [0, 1]:

L(〈gi, yi, zi〉ni=1

)=

λ

n

n∑i=1

|yi − zi|+1− λn

K∑k=1

∣∣∣∣∣∣∑i:gi=k

zi −1

K

∑i

zi

∣∣∣∣∣∣ .

The above L trades off accuracy for differences innumber of positive classifications across groups.For λ = 1, this is simply L1 loss, while for λ = 0,the best classifications would have an equal num-ber of positives in each group. Joint loss differsfrom a standard ML loss function in two ways.First, joint loss is aware of the sensitive groupmembership. Second, it depends on the completelabelings and is not simply a sum over labels.

Even with only K = 1 group, this capturessituations beyond what is representable by thesum

∑`(yi, zi). A simple example is when one

seeks exactly P positive examples:

L(〈gi, yi, zi〉ni=1

)=

{1n

∑|yi − zi| if

∑zi = P

1 otherwise.

Since 1n

∑|yi−zi| ≤ 1, the 1 ensures that the loss

minimizer will have exactly P positives, if such aclassifier exists in C for the data.

In this section, we denote joint loss L with thehat notation indicating that it is an empirical ap-proximation. In the next section we will definejoint loss L for distributions. We denote classifi-cations by zi rather than the standard notationyi which suggests predictions, because, as in theabove loss, one may choose classifications z 6= yeven with perfect knowledge of the true labels.

For the remainder of our analysis, we hence-forth consider binary labels and classifications,Y = Z = {0, 1}. Our approach is general, how-ever, and our experiments include regression. Fora given 〈xi, yi, zi〉ni=1, and for any group k ≤ Kand all (y, z) ∈ {0, 1}2, recall that the groups aregi = g(xi) and define:

counts: nk =∣∣{i | gi = k}

∣∣ ∈ {1, 2, . . . , n}profile: pk =

1

n

∑i:gi=k

zi ∈ [0, nk/n]

group losses: ˆk =

1

nk

∑i:gi=k

|zi − yi| ∈ [0, 1]

Note that the normalization is such that the stan-dard 0-1 loss is

∑knk

nˆk and the fraction of pos-

itives within any class is nnkpk.

We note many studied fairness notions, includ-ing numerical parity, demographic parity, andfalse-negative-rate parity can be represented ina joint loss function. For example, demographic

5


parity is:

λL1 + (1− λ)∑k

∣∣∣∣∣pk nnk − 1

K

∑k′

pk′n

nk′

∣∣∣∣∣ .In many applications there is a different cost

for false positives where (y, z) = (0, 1) and falsenegatives where (y, z) = (1, 0). The fractions offalse positives and negatives are defined, below,for each group k. They can be computed basedon the fraction of positive labels in each groupπk:

πk =1

nk

∑i:gi=k

yi

FPk =1

nk

∑i:gi=k

zi(1− yi) =ˆk + pk

nnk− πk

2

(1)

FNk =1

nk

∑i:gi=k

(1− zi)yi =ˆk + πk − pk n

nk

2,

(2)

While minimizing group loss ˆk = FPk + FNk in

general does not minimize false positives or falsenegatives on their own, the above implies that fora fixed profile pk, the most accurate classifier ongroup k simultaneously minimizes false positivesand false negatives. The above can be derived byadding or subtracting the equations ˆ

k = FPk +FNk (since every error is a false positive or afalse negative) and n

nkpk = FPk + (πk − FNk)

(since every positive classification is either a falsepositive or true positive, and the fraction of truepositives from group k are πk − FNk). We alsodefine the false negative rate FNRk = FNk/πk.False positive rates can be defined similarly.

Equations (1) and (2) imply that, if one de-sires fewer false positives and false negatives (allother things being fixed), then greater accuracyis better. That is, for a fixed profile, the mostaccurate classifier simultaneously minimizes falsepositives and false negatives. This motivates thefollowing monotonicity notion.

Definition 3 (Monotinicity) Joint loss L ismonotonic if, for any fixed 〈gi, yi〉ni=1 ∈ ([K] ×Y)∗, L can be written as c(〈ˆk, pk〉Kk=1) where c :[0, 1]2K → R is a function that is nondecreasing

in each ˆk fixing all other inputs to c.

That is, for a fixed profile, increasing ˆk can only

increase joint loss. To give further intuition be-hind monotonicity, we give two other equivalentdefinitions.

Definition 4 (Monotonicity) Joint loss L ismonotonic if, for any 〈gi, yi, zi〉ni=1 ∈ ([K]×Y ×Z)∗, and any i, j where gi = gj, yi ≤ yj andzi ≤ zj: swapping zi and zj can only increaseloss, i.e.,

L(〈gi, yi, zi〉ni=1) ≤ L(〈gi, yi, z′i〉ni=1),

where z′ is the same as z except z′i = zj andz′j = zi.

We can see that if yi = yj then swapping zi and zjdoes not change the loss (because the conditioncan be used in either order). This means thatthe loss is “semi-anonymous” in the sense thatit only depends on the numbers of true and falsepositives and negatives for each group. The moreinteresting case is when (yi, yj) = (0, 1) whereit states that the loss when (zi, zj) = (0, 1) isno greater than the loss when (zi, zj) = (1, 0).Finally, monotonicity can also be defined in termsof false positives and false negatives.

Definition 5 (Monotonicity) Joint loss L ismonotonic if, for any 〈gi, yi, zi〉ni=1 ∈ ([K]×Y ×Z)∗, and any alternative classifications z′1, . . . , z

′n

such that, in each group k, the same profile as zbut all smaller or equal false positive rates andall smaller or equal false negative rates, the lossof classifications z′i is no greater than that of zi.

Lemma 6 Definitions 3, 4, and 5 of Mono-tonicity are equivalent.

One may be tempted to consider a simpler no-tion of monotonicity, such as requiring the losswith zi = yi to be no greater than that ofzi = 1 − yi, fixing everything else. However,this would rule out many natural monotonic jointlosses L, such as demographic parity.

4.1. Discussion: fairness metrics versusobjectives

The monotonicity requirement admits a range ofdifferent fairness criteria, but not all. We do notmean to imply that monotonicity is necessary forfairness, but rather to discuss the implications of

6


minimizing a non-monotonic loss objective. Thefollowing example helps illustrate the boundarybetween monotonic and non-monotonic.

Observation 1 Fix K = 2. The following jointloss is monotonic if and only if λ ≤ 1/2:

(1− λ)(ˆ1 + ˆ

2) + λ|ˆ1 − ˆ2|.

The loss in the above lemma trades off accuracyfor differences in loss rates between groups. Whatwe see is that monotonic losses can account, toa limited extent, for differences across groups infractions of errors, and related statements can bemade for combinations of rates of false positiveand false negative, inspired by “equal odds” def-initions of fairness. However, when the weight λon the fairness term exceeds 1/2, then the lossis non-monotonic and one encounters situationswhere one group is punished with lower accuracyin the name of fairness. This may still be desir-able in a context where equal odds is a primaryrequirement, and one would rather have randomclassifications (e.g., a lottery) than introduce anyinequity.

What is the goal of an objective function? Weargue that a good objective function is one whoseoptimization leads to favorable outcomes, andshould not be confused with a fairness metricwhose goal is quantify unfairness. Often, a differ-ent function is appropriate for quantifying unfair-ness than for optimizing it. For example, the dif-ference in classroom performance across groupsmay serve as a good metric of unfairness, but itmay not be a good objective on its own. The rootcause of the unfairness may have begun long be-fore the class. Now, suppose that the objectivefrom the above observation was used by a teacherto design a semester-long curriculum with thebest intention of increasing the minority group’sperformance to the level of the majority. If thereis no curriculum that in one semester increasesone group’s performance to the level of anothergroup’s performance, then optimizing the aboveloss for λ > 1/2 leads to an undesirable outcome:the curriculum would be chosen so as to inten-tionally misteach students the higher-performinggroup of students so that their loss increases tomatch that of the other group. This can be seeby rewriting the loss as follows:

(1− λ)(ˆ1 + ˆ

2) + λ|ˆ1 − ˆ2|

= 2λmax{ˆ1, ˆ2}+ (1− 2λ)(ˆ

1 + ˆ2).

This rewriting illuminates why λ ≤ 1/2 is neces-sary for monotonicity, otherwise there is a nega-tive weight on the total loss. λ = 1/2 correspondsto maximizing the minimum performance acrossgroups while λ = 0 means teaching to the aver-age, and λ in between allows interpolation. How-ever, putting too much weight on fairness leadsto undesirable punishing behavior.

5. Minimizing joint loss on trainingdata

Here, we show how to use learning algorithm tofind a decoupled classifier in γ(C) that is opti-mal on the training data. In the next section, weshow how to generalize this to imperfect random-ized classifiers that generalize to examples drawnfrom the same distribution, potentially using anarbitrary transfer learning algorithm.

Our approach to decoupling uses a learningalgorithm for C as a black box. A C-learningalgorithm A : (X × Y)∗ → 2C returns one ormore classifiers from C with differing numbersof positive classifications on the training data,i.e., for any two distinct c, c′ ∈ A

(〈xi, yi〉ni=1),∑

i c(xi) 6=∑i c′(xi). In ML, it is common

to simultaneously output classifiers with varyingnumber of positive classifications, e.g., in com-puting ROC or precision-recall curves (Davis andGoadrich, 2006). Also note that a classifier thatpurely minimizes errors can be massaged into onethat outputs different fractions of positive andnegative examples by reweighting (or subsam-pling) the positive- and negative-labeled exam-ples with different weights.

Our analysis will be based on the assumptionthat the classifier is in some sense optimal, butimportantly note that it makes sense to apply thereduction to any off-the-shelf learner. Formally,we say A is optimal if for every achievable num-ber of positives P ∈

{∑i c(xi)

∣∣ c ∈ C}, it out-puts exactly one classifier that classifies exactlyP positives, and this classifier has minimal er-ror among all classifiers which classify exactly Ppositives. Theorem 7 shows that an optimal clas-sifier can be used to minimize any (monotonic)joint loss

Theorem 7 For any monotonic joint loss func-tion L, any C, and any optimal learner A for C,

7


Algorithm 1: Decouple (A, L, {〈xi, yi〉}, {Xi})Minimize training loss L using learner A

1. For k = 1 to K, Ck ← A(〈xi, yi〉i:xi∈Xk

). Learner outputs a set of classifiers.

2. return γc that minimizesminc∈C1×...×CK

L(〈gi, yi, γc(xi)〉ni=1

). γc(xi) = cgi(x))

The simple decoupling algorithm partitions databy group and runs the learner on each group.Within each group, the learner outputs one ormore classifiers of differing numbers of positives.

the Decouple procedure from Algorithm 1 re-turns a classifier in γ(C) of minimal joint lossL. For constant K, Decouple runs in time lin-ear in the time to run A and polynomial in thenumber of examples n and time to evaluate L andclassifiers c ∈ C.

Implementation notes. Note that if the pro-file is fixed, as in Lp∗ , then one can simply runthe learning algorithm once for each group, tar-geted at p∗k positives in each group. Otherwise,also note that to perform the slowest step whichinvolves searching over O(nK) losses of combina-tions of classifiers, one can pre-compute the errorrates and profiles of each classifier. In the “bigdata” regime of very large n, the O(nK) evalua-tions of a simple numeric function of profile andlosses will not be the rate limiting step.

6. Generalization and transferlearning

We now turn to the more general randomizedclassifier model in which Z = [0, 1] but still withY = {0, 1}, and we also consider generalizationloss as opposed to simply training loss. We willdefine loss in terms of the underlying joint dis-tribution D over X × Y from which training ex-amples are drawn independently. We define thetrue error, true profile, and true probability:

νk = Pr[x ∈ Xk] = E[nk/n]

pk = E[zI[x ∈ Xk]

]= E[pk]

`k = E[|y − z| | x ∈ Xk

]= E[ˆk|nk > 0]

Joint loss L is defined on the joint distributionµ on g, y, z ∈ [K] × Y × Z induced by D anda classifier c : X → Z. A distributional jointloss L is related to empirical joint loss L in thatL = limn→∞ E[L], i.e., the limit of the empiricaljoint loss as the number of training data growswithout bound (if it exists).

Fixing the marginal distribution over [K]×Y,joint loss L : [0, 1]2K → R can be viewed asa function of `1, p1, . . . , `K , pK (in addition togroup probabilities Pr[g(x) = k] which are in-dependent of the classification). In addition torequiring monotonicity, namely L being nonde-creasing in `k fixing all other parameters, we willassume that L is continuous with a bound on therate of change of the form:

|L(`1, p1, . . . , `K , pK)− L(`′1, p′1, . . . , `

′K , p

′K)| ≤

R∑k

(νk|`k − `′k|+ |pk − p′k|

), (3)

for parameter R ≥ 0 and all `k, `′k, pk, p

′k ∈ [0, 1].

Note that the νk in the above bound is neces-sary for our analysis because a loss that dependson `k without νk may require exponentially largequantities of data to estimate and optimize overif νk is exponentially small. Of course, alterna-tively νk could be removed from this assumptionby imposing a lower bound on all νk.

Many losses, such as L1 and LNPλ above, canbe shown to satisfy this continuity requirementfor R = 1 and R = 2, respectively. We also notethat the reduction we present can be modifiedto address certain discontinuous loss functions.For instance, for a given target allocation (i.e., afixed fraction of positive classifications for eachgroup), one simply finds the classifier of minimalempirical error for each group which achieves thedesired fraction of positives as closely as possible.

A transfer learning algorithm for C is A :(X ×{0, 1})∗×(X ×{0, 1})∗ → 2C , where A takesin-group examples 〈xi, yi〉ni=1 and out-group ex-

amples 〈x′i, y′i〉n′

i=1, both from X × {0, 1}. This isalso called supervised domain adaptation. Thedistribution of out-group examples is differentfrom (but related to) the distribution of in-groupsamples. The motivation for using the out-groupexamples is that if one is trying to learn a classi-fier on a small dataset, accuracy may be increasedusing related data.

8


Algorithm 2: G.D. (T, L, {〈xi, yi〉}, {Xi})

1. For k = 1 to K,

• nk ← |{i ≤ n | xi ∈ Xk}|• Ck ← T

(〈xi, yi〉i:xi∈Xk

, 〈xi, yi〉i:xi 6∈Xk

). Run transfer learner, output is a set

2. For all c ∈ Ck,

• pk[c]← 1

n

∑i:xi∈Xk

c(xi)

. Estimate profile

• ˆk[c]← 1

nk

∑i:xi∈Xk

|yi − c(xi)|

. Estimate error rates

3. return γc for c ∈arg minC1×...×CK

L(〈ˆi[ci], pi[ci]〉Ki=1

)The general decoupling algorithm uses a transferlearning algorithm T .

In the next section, we describe and analyzea simple transfer learning algorithm that down-weights samples from the out-group. For thatalgorithm, we show:

Theorem 8 Suppose that, for any two groupsj, k ≤ K and any classifiers c, c′ ∈ C,

|(`j(c)− `j(c′))− (`k(c)− `k(c′))| ≤ ∆ (4)

For algorithm 2 with the transfer learning algo-rithm described in Section 6.1, with probability≥ 1 − δ over the n iid training data, the algo-rithm outputs c with L(c) at most

minc∈C

L(c) + 5RKτ +R∑k

min

(τ

√1

νk − τ,∆

),

where τ =√

2n log(8|C|(n+K)/δ). For constant

K, the run-time of the algorithm is polynomialin n and the runtime of the optimizer over C.

The assumption in (4) states that the perfor-mance difference between classifiers is similaracross different groups and is weaker than an as-sumption of similar classifier performance acrossgroups. Note that it would follow from a simpler

but stronger requirement that |`j(c) − `k(c)| ≤∆/2 by the triangle inequality.

Parameter settings (see Lemma 10) and tighterbounds can be found in the next section. How-ever, we can still see qualitatively that, as ngrows, the bound decreases roughly like O(n−1/2)as expected. We also note that for groups withlarge νk, as we will see in the next section, thetransfer learning algorithm places weight 0 on(and hences ignores) the out-group data. Forsmall3 νk, the algorithm will place significantweight on the out-group data.

6.1. A transfer learning algorithm T

In this section, we describe analyze a simpletransfer learning algorithm that down-weights4

out-group examples by parameter θ ∈ [0, 1]. Tochoose θ, we can either use cross-validation onan independent held-out set, or θ can be chosento minimize a bound as we now describe. Thecross-validation, which we do in our experiments,is appropriate when one does not have bounds athand on the size of set of classifiers or the differ-ence between groups, as we shall assume, or whenone simply has a black-box learner that does notperfectly optimize over C. We now proceed toderive a bound on the error that will yield a pa-rameter choice θ.

Consider k to be fixed. For convenience, wewrite n−k = n−nk as the number of samples fromother groups. Define ˆ−k and `−k analogously toˆk and `k for out-of-group data xi 6∈ Xk.

Instead of outputting a set of classifiers, one foreach different number of positives within groupk, it will be simpler to think of the group-k pro-file pk = P as being specified in advance, and wehence focus our attention on the subset of classi-fiers,

CkP =

{c ∈ C

∣∣∣∣ 1

n

∑i:xi∈Xk

c(xi) = P

},

which depends on the training data. The boundsin this section will be uninteresting, of course,when CkP is empty (e.g., in the unlikely event

3. For very small νk < τ , the term νk − τ is negative(making the left side of the above min imaginary), inwhich case we define the min to be the real term onthe right.

4. If the learning algorithm doesn’t support weighting,subsampling can be used instead.

9


that x1 = x2 = . . . = xn, the only realizable pkof interest are 0 and 1). The general algorithmwill simply run the subroutine described in thissection nk+1 ≤ n+1 times, once for each possiblevalue of pk.5 Of course, |CkP | ≤ |C|.

As before, we will assume that the underlyinglearner is optimal, meaning that given a weightedset of examples (w1, x1, y1), . . . , (wn, xn, yn) withtotal weight W =

∑wi, it returns a classifier c ∈

CkP that has minimal weighted error∑ wi

W |yi −c(xi)| among all classifiers in CkP .

In Appendix A, we derive a closed-form solu-tion for θ, the (approximately) optimal down-weighting of out-group data for our transferlearning algorithm. This solution depends ona bound on the difference in classifier rankingacross different groups. For small ∆, the dif-ference in error rates of each pair of classifiersis approximately the same for in-group and out-group data. In this case, we expect generalizationto work well and hence θ ≈ 1. For large ∆, out-group data doesn’t provide much guidance for theoptimal in-group classifier, and we expect θ ≈ 0.

For a fixed k and θ ∈ [0, 1], let c be a classifierthat minimizes the empirical loss when out-of-group samples are down-weighted by θ, i.e.,

c ∈ arg minc∈CkP

nk ˆk(c) + θn−k ˆ−k(c),

and c∗ be an optimal classifier that minimizes thetrue loss, i.e.,

c∗ ∈ arg minc∈CkP

`k(c).

We would like to choose θ such that `k(c) isclose to `k(c∗). In order to derive a closed-formsolution for θ in terms of ∆, we use concentrationbounds to bound the expected error rates of cand c∗ in terms of ∆ and θ, and then choose θ tominimize this expression.

We find that, as long as nk <2

∆2 log 2|C|δ the

optimal choice of θ will be strictly in between 0and 1.

7. Experiment

For this experiment, we used data that is “semi-synthetic” in that the 47 datasets are “real”

5. In practice, classification learning algorithms gener-ally learn a single real-valued score and consider dif-ferent score thresholds.

0.2 0.4 0.6 0.8 1.0

decoupled loss/blind loss

0.5

0.6

0.7

0.8

0.9

1.0

couple

d loss

/blin

d loss

Comparison of joint loss across datasets

Figure 2: Comparing the joint loss of our de-coupled algorithm with the coupledand blind baselines. Each point is adataset. A ratio less than 1 meansthat the loss was smaller for the de-coupled or coupled algorithm than theblind baseline, i.e, that using the sensi-tive feature resulted in decreased error.Points above the diagonal representdatasets in which the decoupled algo-rithm outperformed the coupled one.

(downloaded from openml.org) but an arbitrarybinary attribute was used to represent a sensi-tive attribute, so K = 2. The base classifier waschosen to be least-squares linear regression forits simplicity (no parameters), speed, and repro-ducibility.

In particular, each dataset was a univari-ate regression problem with balanced loss forsquared error, i.e., LB = 1

2 (ˆ1 + ˆ

2) where ˆk =∑

i:gi=k(yi− zi)2/nk. To gather the datasets, we

first selected the problems with twenty or fewerdimensions. Classification problems were con-verted to regression problems by assigning y = 1to the most common class and y = 0 to all otherclasses. Regression problems were normalized sothat y ∈ [0, 1]. Categorical attributes were simi-larly converted to binary features by assigning 1to the most frequent category and 0 to others.

The sensitive attribute was chosen to be thefirst binary feature such that there were at least100 examples in both groups (both 0 and 1 val-ues). Further, large datasets were truncated sothat there were at most 10,000 examples in eachgroup. If there was no appropriate sensitive at-

10

http://openml.org


0.2 0.4 0.6 0.8 1.0

decoupled loss/blind loss

0.2

0.4

0.6

0.8

1.0

1.2deco

uple

d loss

/blin

d loss

(w

ithout

transf

er)

Comparison of joint loss across datasets

Figure 3: Comparing the joint loss of our de-coupled algorithm with the decoupledalgorithm with and without transferlearning. Each point is a dataset. Aratio less than 1 means that the losswas smaller for the decoupled algo-rithm than the blind baseline. Pointsabove the diagonal represent datasetsin which transfer learning improvedperformance compared to decouplingwithout transfer learning.

tribute, then the dataset was discarded. We alsodiscarded a small number of “trivial” datasetsin which the data could be perfectly classified(less than 0.001 error) with linear regression. Theopenml id’s and detailed error rates of the 45 re-maining datasets are listed in the appendix.

All experiments were done with five-fold cross-validation to provide an unbiased estimate of gen-eralization error on each dataset. Algorithm 2was implemented, where we further used five-foldcross validation (within each of the outer folds)to choose the best down-weighting parameterθ ∈ {0, 2−10, 2−9, . . . , 1} for each group. Hence,least-squares regression was run 5 ∗ 5 ∗ 11 = 275times on each dataset to implement our algo-rithm.

The baselines were considered: the blind base-line is least-squares linear regression that has noaccess to the sensitive attribute, the coupled base-line is least-squares linear regression that cantake into account the sensitive attribute.

Figure 2 compares the loss of the coupled base-line (x-axis) and our decoupled algorithm (y-axis) to that of the blind baseline. In particular,

the log ratio of the squared errors is plotted, asthis quantity is immune to scaling of the y val-ues. Each point is a dataset. Points to the left of1 (x < 1) represent datasets where the coupledclassifier outperformed the blind one. Points be-low the horizontal line y < 1 represent pointsin which the decoupled algorithm outperformedthe indiscriminate baseline. Finally, points abovethe diagonal line x = y represent datasets wherethe decoupled classifier outperformed the coupledclassifier.

Figure 3 compares transfer learning to decou-pling without any transfer learning (i.e., justlearning on the in-group data or setting θ = 0).As one can see, on a number of datasets, transferlearning significantly improves performance. Infact, without transfer learning the coupled clas-sifiers significantly outperform decoupled classi-fiers on a number datasets.

8. Image retrieval experiment

In this section, we describe an anecdotal examplethat illustrates the type of effect the theory pre-dicts, where a classifier biases towards minoritydata that which is typical of the majority group.We hypothesized that standard image classifiersfor two groups of images would display bias to-wards the majority group, and that a decoupledclassifier could reduce this bias. More specifically,consider the case where we have a setX = X1∪X2

of images, and want to learn a binary classifierc : X → {0, 1}. We hypothesized that a coupledclassifier would display a specific form of bias wecall majority feature bias, such that images in theminority group would rank higher if they had fea-tures of images in the majority group.

We tested this hypothesis by training classifiersto label images as “suit” or “no suit”. We con-structed an image dataset by downloading the“suit, suit of clothes” synset as a set of posi-tives, and “male person” and “female person”synsets as the negatives, from ImageNet Denget al. (2009). We manually removed images in thenegatives that included suits or were otherwiseoutliers, and manually classified suits as “male”or “female”, removing suit images that were nei-ther. We used the pre-trained BVLC CaffeNetmodel – which is similar to the AlexNet modefrom Krizhevsky et al. (2012) – to generate fea-tures for the images and clean the dataset. We

11

http://www.image-net.org/synset?wnid=n04350905



https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet

https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet

Decoupled Classifiers for Group-Fair and Efficient Machine LearningSame“real”resultsasinslide2(basedonmanuallyclassified“Suitofclothes”)

Linearclassifier

Decoupled linear

classifiers

Figure 4: Differences between image classifica-tions of “suit” using standard lin-ear classifiers and decoupled classifiers(trained using standard neural networkimage features). The females selectedby the linear classifier are wearing atuxedo and blazer more typical of themajority male group.

used the last fully connected of layer (“fc7”) ofthe CaffeNet model as features, and removed im-ages where the most likely label according to theCaffeNet model was “envelope” (indicating thatthe image was missing), or “suit, suit of clothes”or “bow tie, bow-tie, bowtie” from the negatives.The dataset included 506 suit images (462 male,44 female) and 1295 no suit images (633 male,662 female).

We then trained a coupled and decoupled stan-dard linear support vector classifier (SVC) on thisdataset, and provide anecdotal evidence that thedecoupled classifier displays less majority featurebias than the coupled classifier. We trained thecoupled SVC on all images, and then ranked im-ages according to the predicted class. We traineddecoupled SVCs, with one SVC trained on themale positives and all negatives, and the otheron female positives and all negatives. Both clas-sifiers agreed on eight of the top ten “females”predicted as “suit”, and Fig. 4 shows the four im-ages (two per classifier) that differed. One of theimages found by the coupled classifier is a womanin a tuxedo (typically worn by men), which maybe an indication of majority feature bias; addinga binary gender attribute to the coupled classifier

did not change the top ten predictions for “femalesuit.” We further note that we also tested boththe coupled and decoupled classifier on out-of-sample predictions using 5-fold cross-validation,and that both were highly accurate (both had94.5% accuracy, with the coupled classifier pre-dicting one additional true positive).

We emphasize that we present this experimentto provide an anecdotal example of the potentialadvantages of a decoupled classifier, and we donot make any claims on generalizability or effectsize on this or other real world datasets becauseof the small sample size and the several manualdecisions we made.

9. Conclusions

In this paper, we give a simple technical approachfor a practitioner using ML to incorporate sensi-tive attributes. Our approach avoids unneces-sary accuracy tradeoffs between groups and canaccommodate an application-specific objective,generalizing the standard ML notion of loss. Fora certain family of “weakly monotonic” fairnessobjectives, we give a black-box reduction that canuse any off-the-shelf classifier to efficiently opti-mize the objective. In contrast to much priorwork on ML which first requires complete fair-ness, this work requires the application designerto pin down a specific loss function that tradesoff accuracy for fairness.

Experiments demonstrate that decoupling canreduce the loss on some datasets for some poten-tially sensitive features.

References

Julia Angwin, Jeff Larson, Surya Mattu, andLauren Kirchner. Machine bias: Theres soft-ware used across the country to predict futurecriminals. and its biased against blacks. ProP-ublica, May, 23, 2016.

Elias Bareinboim and Judea Pearl. Causal infer-ence and the data-fusion problem. Proceedingsof the National Academy of Sciences, 113(27):7345–7352, 2016.

Richard Berk. The role of race in forecasts ofviolent crime. Race and social problems, 1(4):231, 2009.

12


Alexandra Chouldechova. Fair prediction withdisparate impact: A study of bias in recidivismprediction instruments. arXiv, 2017.

Jesse Davis and Mark Goadrich. The relation-ship between precision-recall and roc curves.In Proceedings of the 23rd international con-ference on Machine learning, pages 233–240.ACM, 2006.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. ImageNet: A Large-Scale Hierarchi-cal Image Database. In CVPR09, 2009.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi,Omer Reingold, and Richard Zemel. Fairnessthrough awareness. ITCS, 2011.

Michael Feldman, Sorelle A. Friedler, JohnMoeller, Carlos Scheidegger, and SureshVenkatasubramanian. Certifying and removingdisparate impact. Proceedings of the 21th ACMSIGKDD International Conference on Knowl-edge Discovery and Data Mining, 2015.

Sorelle A Friedler, Carlos Scheidegger, andSuresh Venkatasubramanian. On the(im)possibility of fairness. arXiv preprintarXiv:1609.07236, 2016.

Moritz Hardt, Eric Price, and Nathan Srebro.Equality of opportunity in supervised learning.NIPS, 2016.

Avinatan Hassidim, Assaf Romm, and Ran I.Shorrer. Redesigning the israeli psychologymaster’s match. American Economic Review,107(5):205–09, May 2017. doi: 10.1257/aer.p20171048. URL http://www.aeaweb.org/

articles?id=10.1257/aer.p20171048.

U. Hebert-Johnson, M. Kim, O. Reingold,and G. Rothblum. Calibration for the(computationally-identifiable) masses. 2017.arXiv:1711.08513v1.

Matthew Joseph, Michael Kearns, Jamie H Mor-genstern, and Aaron Roth. Fairness in learn-ing: Classic and contextual bandits. In Ad-vances in Neural Information Processing Sys-tems, pages 325–333, 2016.

Toshihiro Kamishima, Shotaro Akaho, and JunSakuma. Fairness-aware learning through reg-ularization approach. In Proceedings of the

2011 IEEE 11th International Conference onData Mining Workshops, ICDMW ’11, pages643–650, Washington, DC, USA, 2011. IEEEComputer Society. ISBN 978-0-7695-4409-0.doi: 10.1109/ICDMW.2011.83. URL http:

//dx.doi.org/10.1109/ICDMW.2011.83.

Niki Kilbertus, Mateo Rojas-Carulla, Giambat-tista Parascandolo, Moritz Hardt, DominikJanzing, and Bernhard Scholkopf. Avoidingdiscrimination through causal reasoning. arXivpreprint arXiv:1706.02744, 2017.

Jon Kleinberg, Sendil Mullainathan, and Man-ish Raghavan. Inherent trade-offs in the fairdetermination of risk scores. arXiv, 2016.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey EHinton. Imagenet classification with deep con-volutional neural networks. In F. Pereira,C. J. C. Burges, L. Bottou, and K. Q. Wein-berger, editors, Advances in Neural Informa-tion Processing Systems 25, pages 1097–1105.Curran Associates, Inc., 2012.

M. J. Kusner, J. R. Loftus, C. Russell, andR. Silva. Counterfactual Fairness. ArXiv e-prints, March 2017.

Matt J Kusner, Joshua R Loftus, Chris Russell,and Ricardo Silva. Counterfactual fairness.arXiv preprint arXiv:1703.06856, 2017.

Razieh Nabi and Ilya Shpitser. Fair inference onoutcomes. arXiv preprint arXiv:1705.10378,2017.

Judea Pearl, Madelyn Glymour, and Nicholas PJewell. Causal inference in statistics: a primer.John Wiley & Sons, 2016.

Dino Pedreschi, Salvatore Ruggieri, and FrancoTurini. Discrimination-aware data mining. InProceedings of the 14th ACM SIGKDD In-ternational Conference on Knowledge Discov-ery and Data Mining, KDD ’08, pages 560–568, New York, NY, USA, 2008. ACM. ISBN978-1-60558-193-4. doi: 10.1145/1401890.1401959. URL http://doi.acm.org/10.

1145/1401890.1401959.

R. Zemel, Y. Wu, K. Swersky, T. Pitassi, andC. Dwork. Learning fair representations. Proc.of Intl. Conf. on Machine Learning, 2013.

13

http://www.aeaweb.org/articles?id=10.1257/aer.p20171048

http://www.aeaweb.org/articles?id=10.1257/aer.p20171048

http://dx.doi.org/10.1109/ICDMW.2011.83

http://dx.doi.org/10.1109/ICDMW.2011.83

http://doi.acm.org/10.1145/1401890.1401959

http://doi.acm.org/10.1145/1401890.1401959


Indre Zliobaite, Faisal Kamiran, and ToonCalders. Handling conditional discrimination.In Proceedings of the 2011 IEEE 11th Inter-national Conference on Data Mining, ICDM’11, pages 992–1001, Washington, DC, USA,2011. IEEE Computer Society. ISBN 978-0-7695-4408-3. doi: 10.1109/ICDM.2011.72. URL http://dx.doi.org/10.1109/ICDM.

2011.72.

Appendix A. Transfer LearningBounds

We derive a closed-form solution for θ, the(approximately) optimal down-weighting of out-group data for our transfer learning algorithm.This solution depends on a bound ∆ (defined inTheorem 8) on the difference in classifier rank-ing across different groups. For small ∆, the dif-ference in error rates of each pair of classifiersis approximately the same for in-group and out-group data. In this case, we expect generalizationto work well and hence θ ≈ 1. For large ∆, out-group data doesn’t provide much guidance for theoptimal in-group classifier, and we expect θ ≈ 0.

Finally, for a fixed k and θ ∈ [0, 1], let c be aclassifier that minimizes the empirical loss whenout-of-group samples are down-weighted by θ,i.e.,

c ∈ arg minc∈CkP

nk ˆk(c) + θn−k ˆ−k(c),

and c∗ be an optimal classifier that minimizes thetrue loss, i.e.,

c∗ ∈ arg minc∈CkP

`k(c).

We would like to choose θ such that `k(c) isclose to `k(c∗). In order to derive a closed-formsolution for θ in terms of ∆, we use concentrationbounds to bound the expected error rates of cand c∗ in terms of ∆ and θ, and then choose θ tominimize this expression.

Lemma 9 Fix any k ≤ K,P, nk, n−k ≥ 0and ∆, θ ≥ 0. Let 〈xi, yi〉ni=1 be n = nk +n−k training examples drawn from D conditionedon exactly nk belonging to group k. Let c ∈arg minc∈CkP

nk ˆk(c) + θn−k ˆ−k(c) be any min-

imizer of empirical error when the non-group-k

examples have been down-weighted by θ. Then,

Pr

[`k(c) ≤ min

c∈CkP

`k(c) + f(θ, nk, n−k,∆, δ)

]≥ 1−δ,

where the probability is taken over the n = nk +n−k training iid samples, and f(θ, nk, n−k,∆, δ)is defined as:

1

nk + θn−k

(√2(nk + θ2n−k) log

2|C|δ

+ θn−k∆

).

(5)

Unfortunately, the minimum value of f is acomplicated algebraic quantity that is easy tocompute but not easy to directly interpret. In-stead, we can see that:

Lemma 10 For f from Equation (5),g(nk, n−k,∆, δ) = minθ∈[0,1] f(θ, nk, n−k,∆, δ)is at most

min

√ 2

nklog

2|C|δ,

√2

nlog

2|C|δ

+n−kn

∆

,

(6)

with equality if and only if nk ≥ 2∆2 log 2|C|

δ inwhich case the minimum occurs at θ = 0 where

g(nk, n−k,∆) =√

2nk

log 2|C|δ . Otherwise the

minimum occurs at,

θ∗ =

√β2

4+n−knk

(1− β)− β

2∈ (0, 1),

for β = ∆2 2nk

log(2|C|/δ).

Appendix B. Dataset ids

For reproducibility, the id’s and feature namesfor the 47 open ml datasets were as follows:(21, ’buying’), (23, ’Wifes education’), (26,’parents’), (31, ’checking status’), (50, ’top-left-square’), (151, ’day’), (155, ’s1’), (183,’Sex’), (184, ’white king row’), (292, ’Y’),(333, ’class’), (334, ’class’), (335, ’class’),(351, ’Y’), (354, ’Y’), (375, ’speaker’), (469,’DMFT.Begin’), (475, ’Time of survey’), (679,’sleep state’), (720, ’Sex’), (741, ’sleep state’),(825, ’RAD’), (826, ’Occasion’), (872, ’RAD’),(881, ’x3’), (915, ’SMOKSTAT’), (923, ’isns’),(934, ’family structure’), (959, ’parents’), (983,’Wifes education’), (991, ’buying’), (1014,

14

http://dx.doi.org/10.1109/ICDM.2011.72

http://dx.doi.org/10.1109/ICDM.2011.72


’DMFT.Begin’), (1169, ’Airline’), (1216, ’click’),(1217, ’click’), (1218, ’click’), (1235, ’elevel’),(1236, ’size’), (1237, ’size’), (1470, ’V2’), (1481,’V3’), (1483, ’V1’), (1498, ’V5’), (1557, ’V1’),(1568, ’V1’), (4135, ’RESOURCE’), (4552, ’V1’)

15

Date post:	30-May-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Decoupled Classi ers for Group-Fair and E cient Machine...

Documents