arXiv:2105.06965v3 [cs.CL] 15 Sep 2021

Counterfactual Interventions Reveal the Causal Effect of Relative ClauseRepresentations on Agreement Prediction

Shauli Ravfogel∗1,2 Grusha Prasad∗3 Tal Linzen4 Yoav Goldberg1,2

1Computer Science Department, Bar Ilan University2Allen Institute for Artificial Intelligence

3Cognitive Science Department, Johns Hopkins University4Department of Linguistics and Center for Data Science, New York University

[email protected], [email protected]{shauli.ravfogel, yoav.goldberg}@gmail.com

Abstract

When language models process syntacticallycomplex sentences, do they use their represen-tations of syntax in a manner that is consistentwith the grammar of the language? We pro-pose AlterRep, an intervention-based methodto address this question. For any linguisticfeature of a given sentence, AlterRep gener-ates counterfactual representations by alteringhow the feature is encoded, while leaving in-tact all other aspects of the original represen-tation. By measuring the change in a model’sword prediction behavior when these counter-factual representations are substituted for theoriginal ones, we can draw conclusions aboutthe causal effect of the linguistic feature inquestion on the model’s behavior. We applythis method to study how BERT models ofdifferent sizes process relative clauses (RCs).We find that BERT variants use RC boundaryinformation during word prediction in a man-ner that is consistent with the rules of Englishgrammar; this RC boundary information gen-eralizes to a considerable extent across differ-ent RC types, suggesting that BERT representsRCs as an abstract linguistic category.

1 Introduction

The success of neural language models, both inNLP tasks and as cognitive models, has fueled tar-geted evaluation of these models’ word predictionaccuracy on a range of syntactically complex con-structions (Linzen et al., 2016; Gauthier et al., 2020;Warstadt et al., 2020; Mueller et al., 2020; Marvinand Linzen, 2018). What are the internal repre-sentations that support such sophisticated syntacticbehavior? In this paper, we tackle this questionusing an intervention-based approach (Woodward,2005). Our method, AlterRep, is designed to studywhether a model uses a particular linguistic featurein a manner which is consistent with the grammarof the language. The method involves two steps:

∗Equal contribution.

first, it generates counterfactual1 contextual wordrepresentations by altering the neural network’s rep-resentation of the linguistic feature under consid-eration; and second, it characterizes the change inthe model’s word prediction behaviour that resultsfrom replacing the original word representationswith their counterfactual variants. If the resultingchange in word prediction aligns with predictionsfrom linguistic theory, we can infer that the modeluses the feature under consideration in a mannerconsistent with the grammar of the language.

We demonstrate the utility of AlterRep using rel-ative clauses (RCs). According to the grammar ofEnglish, to correctly determine whether the maskedverb in (1) should be singular or plural, a modelmust recognize that the masked verb is outside theRC the officers love, and should therefore agreewith the subject of the main clause (the skater,which is singular), rather than with the subject ofthe RC (the officers, which is plural).

(1) The skater the officers love [MASK] happy.

To investigate whether a neural model uses RCboundary representations as predicted by the gram-mar of English, we use AlterRep to generate twocounterfactual representations of the masked verb:one which encodes (incorrectly) the verb is insidethe RC, and another which encodes (correctly) thatthe verb is outside the RC. Crucially, the differencebetween the counterfactual and original representa-tions is minimal: the aspects of the representationwhich do not encode information about RC bound-aries remain unchanged. Therefore, if the modeluses RC boundary information as dictated by thegrammar of English—and if our method success-fully identifies the way in which RC boundary in-formation is represented by the model—we expect

1We use the word counterfactual as it is used when re-ferring to counterfactual examples (Verma et al., 2020): analtered version of an element that is similar to the originalelement in all aspects except one.

arX

iv:2

105.

0696

5v3

[cs

.CL

] 1

5 Se

p 20

21

Accepted in CoNLL 2021 2 BACKGROUND

Figure 1: Causal analysis with counterfactual interven-tion. Given a representation h of a masked word, wederive two new representations h−, h+ that differ in theinformation they contain with respect to a specific lin-guistic property. The predictions of the model over thecounterfactual representations are compared with theoriginal prediction Y .

the incorrect counterfactual to cause the maskedverb to incorrectly agree with the noun inside theRC, and the correct counterfactual to cause agree-ment with the noun outside the RC, correctly.

We report experiments applying this logic toBERT variants of different sizes (Devlin et al.,2019; Turc et al., 2019). We found that whileall layers of the BERT variants encoded informa-tion about RC boundaries, only the informationencoded in the middle layers was used in a man-ner consistent with the grammar of English. Thiscontrast highlights the pitfalls of drawing behav-ioral conclusions from probing results alone, andmotivates causal approaches such as AlterRep.

For BERT-base, we also found that counterfac-tual representations learned solely from one typeof RC influenced the model’s predictions in sen-tences containing other RC types, suggesting thatthis model encodes information about RC bound-aries in an abstract manner that generalizes acrossdifferent RC types. Going beyond our case study ofRC representations in BERT variants, we hope thatfuture work can apply this method to test linguisti-cally motivated hypotheses about a wide range ofstructures, tasks and models.

2 Background

2.1 Relative clauses (RCs)An RC is a subordinate clause that modifies a noun.The head of the RC needs to be interpreted twice—once in the main clause, and once inside the RC—but it is omitted from inside the RC, replaced by anunpronounced “gap”. For example, in (2), the RC(in bold) describes the subject of the main clause

the book. Since the book is the object of the em-bedded clause, we say that the gap is in the objectposition of the RC (indicated by underscores).

(2) The books that my cousin likes were in-teresting. (Object RC)

RCs can structurally differ from the Object RCin (2) in several ways: the overt complementizerthat can be excluded, as in (3); the gap can be in thesubject instead of object position of the embeddedclause, as in (4); and so on. The five types of RCswe consider in this paper are outlined in Table 1.

(3) The books my cousin likes were interest-ing. (Reduced Object RC)

(4) My cousin that likes the books was inter-esting. (Subject RC)

These differences do not affect the strategy that asystem that follows the grammar of English shoulduse to determine the number of the verb: regardlessof the internal structure of the RC, a verb outsidethe RC should agree with the subject of the mainclause, whereas a verb inside the RC should agreewith the subject of the RC. Thus, a model that doesnot properly identify the boundaries of the RC willoften predict a singular verb where a plural one isrequired, or vice versa.

2.2 Iterative Null Space Projection (INLP)INLP (Ravfogel et al., 2020) is a method for selec-tively identifying and removing user-defined con-cept features from a contextual representation. LetT be a set of words-in-context, and let H be the setof representations of T , such that ~ht ∈ Rd is thecontextualized representation of the word t ∈ T .Let F be a linguistic feature that we hypothesizeis encoded in H . Given H and the values ft of thefeature F for each word, INLP returns a set of mlinear classifiers, each of which predicts F withabove-chance accuracy. Each of these classifiersis a vector in Rd, and corresponds to a directionin the representation space. The m vectors can bearranged in a matrix Wm×d. Since the m classi-fiers are mutually orthogonal, so are the rows of W.Each linear classifier can be interpreted as defininga separating plane, which is intended to partitionthe space, as much as possible, according to thevalues of the feature F . In our case, F can takeone of two values—whether or not a given word tis in an RC—and each direction in W is intendedto separate words that are inside RCs from words

2

Accepted in CoNLL 2021 3 ALTERREP: GENERATING COUNTERFACTUAL REPRESENTATIONS

Abstract structure Example

Unreduced Object RC (ORC) The conspiracy that the employee welcomed divided the beautiful country.Reduced Object RC (ORRC) The conspiracy the employee welcomed divided the beautiful country.Unreduced Passive RC (PRC) The conspiracy that was welcomed by the employee divided the beautiful country.Reduced Passive RC (PRRC) The conspiracy welcomed by the employee divided the beautiful country.Active Subject RC (SRC) The employee that welcomed the conspiracy quickly searched the buildings.

P/OR(R)C-matched Coordination The conspiracy welcomed the employee and divided the beautiful country.SRC-matched Coordination The employee welcomed the conspiracy and quickly searched the buildings.

Table 1: Examples of sentences generated from the 5 RC structures, the 2 coordination structures. Elements whichonly occur in a subset of the examples are indicated in grey. This table is copied from Prasad et al. (2019).

Figure 2: Generating counterfactual representations. Arepresentation ~ht of a word outside of an RC is trans-formed to create counterfactual mirror images ~h−t , ~h+twith respect to an empirical RC subspace. The RC sub-space here is a 1-dimensional line for illustrative pur-poses; in practice we use an 8-dimensional subspace.

that are outside them.2

The feature subspace—the space spanned byall the learned directions (R = span(W))—is asubspace of the original representation space thatcontains information useful to linearly decode Fwith high accuracy. The orthogonal complementof R (the null space; N ) is a subspace in which itis not possible to predict F with high accuracy.

3 AlterRep: Generating CounterfactualRepresentations

The goal of AlterRep is to generate, based ona model’s contextual representations of a set ofwords, a set of counterfactual representations thatmodify the encoding of a feature F while leav-

2In this paper, we make the simplifying assumption thatsentences do not contain RCs that are nested within one an-other (cf. Lakretz et al. 2021). To accommodate such sen-tences in future work, an integer feature could be used whosevalue would be 0 if the word is outside any RC; 1 if it is insidean RC of depth 1; 2 if it is inside an RC of depth 2; and so on.As long as we specify a bound on the embedding depth, thisfeature would still be categorical, and a variant of our methodcould still be used.

ing all other aspects of the representations intact.3

If swapping these counterfactual representationsfor the model’s original representations changesthe model’s probability distribution over predictedwords in a way that aligns with the feature’s lin-guistic functions, we say that the model uses F forword prediction in a manner that is consistent withthe grammar of English.

For our case study, we use a feature with twopossible values: ‘+’ if the word is inside an RC and‘−’ if it is not.4 We generate two counterfactualrepresentations: h+t , which encodes that the word tis inside an RC—regardless of its actual syntacticposition in the sentence—and h−t , which is similarto h+t in all respects except it encodes that t is notin an RC. Our method allows us to generate h+t andh−t irrespective of the feature value encoded in theoriginal representation ht. If the model uses thisfeature appropriately, we expect h+t and h−t to leadto different predictions in contexts where correctword predictions depend on determining whetheror not the word is inside an RC.

Row-space and Null-space Recall that INLP de-fines a feature space R where the property of in-terest is encoded, and a complement subspace Nwhere it is not.

We can project any word representation ~ht tothe feature subspace (here, the RC subspace) or tothe null space, resulting in the vectors ~hRt and ~hNt ,respectively: ~hRt maintains the information neededto predict F from ~ht, while ~hNt maintains all in-formation which is not relevant for predicting F .INLP can be used to generate “amnesic counterfac-

3We aim to propose a concrete instantiation that approxi-mates the counterfactual.

4In the experiments below, we will only apply this pro-cedure to sets of representations of words that are all in aparticular type of RC (for example, Object RCs). We do,however, test whether the representations of RC boundarygeneralize across RC types; see Prediction 3 in §5.

3

Accepted in CoNLL 2021 4 EXPERIMENTAL SETUP

tuals” (Elazar et al., 2021), which do not encode agiven property, even if the original representationdid encode that property. In the next paragraph wepropose a way to use this algorithm to manipulatethe value of the feature, rather than remove it.

Generating Counterfactual RepresentationsWe obtain the counterfactual representations ~h+tand ~h−t as follows. As we discussed in Section 2.2,INLP identifies planes—one for each direction(row) in W—each of which linearly divides theword-representation space into two parts: wordsthat are in an RC and words that are not. Fromthe representation ~ht of a word t that is not in anRC, we can generate ~h+t by pushing ~ht across theseparating plane towards the representations ofwords that are inside an RC . Similarly, we cangenerate ~h−t by moving ~ht further away from thatplane (see Figure 2).5

How do we move the representation of a wordaway from or towards the separating planes? Re-call that the feature subspace R and the nullspaceN are orthogonal complements, and consequentlyany vector ~v ∈ Rn can be represented as a sumof its projections on R and N . Further, by defini-tion, the vector’s projection on R is the sum of itsprojections on the RC directions ~w ∈ W. Thus,we can decompose ~ht as follows, where ~hwt is theorthogonal projection of ~ht on direction ~w:

~ht =~hNt + ~hRt = ~hNt +

∑~w∈W

~hwt (1)

For any word t, we expect a positive counterfac-tual ~h+t to be classified as being inside an RC, withhigh confidence, according to all original RC di-rections w ∈W — that is, ∀ w ∈W, wT ~h+t > 0.Conversely, we expect a negative counterfactualto be classified as not being in an RC, i.e., ∀ w ∈W, wT ~h−t < 0.

To enforce these desiderata, we create positiveand negative counterfactuals as follows, whereSIGN(x) = 1 if x ≥ 0 and 0 otherwise, and αis a positive scalar hyperparameter that enhancesor dampens the effect of the intervention.

5For a word t that is inside an RC, the reverse computationswould be required: to generate h+

t we would move ~ht furtheraway from the separating plane, whereas to generate h−

t wewould move ~ht across the separating plane.

~h−t = ~hNt + α∑~w∈W

(−1)SIGN(wT ~ht) ~hwt (2)

~h+t = ~hNt + α∑~w∈W

(−1)1−SIGN(wT ~ht) ~hwt (3)

In both cases, we subtract a direction ~hwt , flip-ping its sign, if the sign constraints are violated,that is, ifwT ~ht > 0 for ~h−t and ifwT ~ht < 0 for ~h+t .Geometrically, flipping the sign of a direction ~hwtin Equations 2 and 3 is equivalent to taking a mir-ror image with respect to a direction w (Figure 2).This enforces the sign constraints: all classifiers wpredict the negative or positive class, respectively(see Appendix §A.1 for a formal proof).

4 Experimental Setup

Our overall goal is to assess the causal effect of RCboundary representations on our models’ agree-ment behavior when subject-verb dependenciesspan an RC (that is, where an RC intervenes be-tween the head of the subject and the correspondingverb). We test whether we can modify the represen-tation of the masked verb outside the RC such that,compared to the original representations, the modelassigns higher probability to either the correct form(after negative intervention) or to the incorrect one(after positive intervention). We first describe themodels we use (§4.1), then the dataset we use toobtain RC subspaces and generate counterfactualrepresentations (§4.2), and finally the dataset weuse to measure the models’ agreement predictionaccuracy in sentences containing RCs, before andafter the counterfactual intervention (§4.3).

4.1 ModelsWe use BERT-base (12 layers,768 hidden units) andBERT-large (24 layers, 1024 hidden units) (Devlinet al., 2019), as well as the smaller BERT mod-els released by Turc et al. (2019): BERT-medium(8 layers × 512 hidden units), BERT-small (4 lay-ers, 512 hidden units), BERT-mini (4 × 256), andBERT-tiny (2 × 128). In all experiments, we in-tervene on a single layer at a time, and continuethe forward pass of the original model through thefollowing layers.

4.2 Generating CounterfactualRepresentations

Datasets To create the training data for the INLPclassifiers, we used the templates of Prasad et al.

4

Accepted in CoNLL 2021 5 PREDICTIONS

(2019) to generate five lexically matched sets ofsemantically plausible sentences, one for each typeof RC outlined in Table 1, as well as two additionalsets of sentences without RCs; these included sen-tences with nearly the same word order and lexicalcontent as the sentences in the other sets. Each setcontained 4800 sentences. All verbs in the trainingsentences were in the past tense; this ensured thatthe subspaces we identified did not contain infor-mation about overt number agreement, making itunlikely that AlterRep will alter agreement-relatedinformation that does not concern RCs.

Identifying and Altering RC Subspaces Toidentify RC subspaces, we used INLP with SVMclassifiers as implemented in scikit-learn. We iden-tified different subspaces for each of the five typesof RCs listed in Table 1. For example, in (5), thebolded words were considered to be in the RC.

(5) My cousin that liked the book hated movies.

For the negative examples, we took represen-tations of words outside of the RC, either fromoutside the bolded region of the same sentence,or from inside or outside the bolded region of thecoordination control sentence.

(6) My cousin liked the book and hated movies.

We selected the negative examples in this mannerfor two reasons: first, to ensure that the same wordserved as a positive example in some context and asa negative example in others (e.g., book in (5) and(6)); and second, to ensure that the same RC sen-tence included both positive and negative examples(e.g., book and cousin in (5)).

Hyperparameters INLP has a hyperparameterm which sets the dimensionality of the RC sub-space; this parameter trades off exhaustivity againstselectivity.6 We set m = 8; In Appendix §A.3 wedemonstrate that the trends we observe are not sub-stantially affected by this parameter.

AlterRep has an hyperparameter, α, that deter-mines the magnitude of the counterfactual inter-vention (§4.2). We use α = 4; In Appendix §A.4we show that the trends we observe are similar forother values of α.

6In particular, running INLP for 768 iterations—the dimen-sionality of BERT representations—yields the original space,which is exhaustive but not useful in distilling RC information.

4.3 Measuring the Effect of the Interventionon Agreement Accuracy

Dataset We measure the models’ agreement pre-diction accuracy using a subset of the Marvin andLinzen (2018) dataset in which the subject is mod-ified by an RC. The noun inside the RC eithermatched (7) or mismatched (8) the subject of thematrix clause in number:

(7) The skater that the officer loves is/are happy.

(8) The skater that the officers love is/are happy.

The Marvin and Linzen dataset contains sentenceswhere the intervening RC is either a subject RC or a(reduced or unreduced) object RC. We augmentedthis dataset with lexically matched sentences with(reduced or unreduced) passive RC interveners,using attribute varying grammars (Mueller et al.,2020). Finally, we only considered sentences withcopular main verbs (is and are) to ensure that boththe singular and plural forms of the verb are highlyfrequent. We used 1750 sentences per construction.

Computing Agreement Accuracy We per-formed masked language modeling (MLM) on thedataset described earlier in this section. In each sen-tence, we masked the copula, started the forwardpass, performed the intervention on the represen-tation of the masked copula in the layer of inter-est, and continued with the forward pass to obtainBERT’s distribution over the vocabulary for themasked token. We repeated this process for eachlayer separately. We then computed the probabilityof error, normalized within the two copulas is andare (Arehalli and Linzen 2020):

P (Err) =P (VerbIncorrect)

P (VerbIncorrect) + P (VerbCorrect)(4)

In Appendix §A.5, we present results where themetric of success is accuracy, that is, the percentageof cases where the model assigned a higher proba-bility to the verb with the correct number (Marvinand Linzen, 2018). These results are qualitativelysimilar.

5 Predictions

As discussed earlier, a system that computed agree-ment in accordance with the grammar of Englishwould determine the number of the masked verbin a sentence like (9) based on the number of offi-cers, because both officers and the [MASK] token

5

Accepted in CoNLL 2021 5 PREDICTIONS

● ● ● ● ● ● ●●

● ● ● ● ● ● ● ● ● ●● ●

●

● ● ● ● ●

DifferentRC type

SameRC type

0 4 8 12 0 4 8 120.00

0.05

0.10

0.15

0.20

0.25

Layer

P(e

rr)

afte

r in

terv

entio

n●

Negative Intervention(Outside RC)

Positive Intervention(Inside RC)

(a) RC sentences with attractors. In the right panel,the test sentence included an RC of the type used togenerate the counterfactual representations; in the leftpanel, counterfactual representations were generatedbased on sentences with different RC types from thosein the agreement test sentences.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

● ● ● ● ● ●

Simple agreement

Sentential complement

Across RC no attractors

0 4 8 12 0 4 8 12 0 4 8 120.00

0.05

0.10

0.15

0.20

0.25

Layer

P(e

rr)

afte

r in

terv

entio

n

●Negative Intervention(Outside RC)


(b) Sentences without RCs and sentences with an RCbut without attractors.

● ●●

●●

●

●

● ●

● ● ● ● ● ●

●

●●

●

●

●

●

● ●●

●

DifferentRC type

SameRC type

0 4 8 12 0 4 8 120.5

0.6

0.7

0.8

Layer

P(e

rr)

afte

r in

terv

entio

n

●Negative Intervention(Outside RC)


(c) Sentences where before the intervention the modelassigned a higher probability to the ungrammatical thanthe grammatical verb. Note the y-axis differs from otherplots (reflecting the higher original error probability).

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

Simple agreement

Sentential complement

Across RC no attractors

Across RC with attractors

0 4 8 12 0 4 8 12 0 4 8 12 0 4 8 120.00

0.05

0.10

0.15

0.20

0.25

Layer

P(e

rr)

afte

r in

terv

entio

n

● Negative Intervention Positive Intervention

(d) Intervention from counterfactual representations gen-erated from 10 random subspaces.

Figure 3: Change in probability of error with negative and positive counterfactual BERT-base representations (redcircle and cyan triangle respectively). Horizontal lines indicate probability of error with the original representationswithout any intervention: the middle line is the mean accuracy across all items prior to intervention and the upperand lower lines indicate accuracy two standard errors away from the mean accuracy. Error bars reflect two standarderrors from the mean probability of error after intervention.

are outside the RC; the number of the RC-internalnoun skater should be ignored.

(9) The officers that love the skater [MASK]nice.

We can derive the following predictions for apply-ing AlterRep to a system that follows this strategy:

Prediction 1: Impact on Error Probabilityin RC Sentences with Attractors. In RC sen-tences where the main clause subject differs innumber from the RC subject, error probability willbe higher with the counterfactual h+MASK, which en-codes (incorrectly) that [MASK] is inside the RC,than with the original representation hMASK. Con-versely, error probability will be lower with h−MASK,

which encodes (correctly) that [MASK] is outsidethe RC, than with the original hMASK.

Prediction 2: No Impact on Other Sentences.We do not expect a difference in error probabilitybetween the original and counterfactual representa-tions in all other sentences. This should be the casefor sentences with RCs where the nouns inside andoutside the RC match in number, as in (10):

(10) The officers that love the skaters [MASK]nice.

Since both officers and skaters are plural, mostplausible agreement prediction strategies wouldmake the same predictions regardless of whether[MASK] is analyzed as being inside the RC or out-

6

Accepted in CoNLL 2021 6 RESULTS

side it. Consequently, intervening on the encodingof RC boundaries is not expected to systematicallychange the model’s predictions.

Likewise, since the interventions are designedto modulate the encoding of RC-related properties,we do not expect the interventions to impact num-ber prediction in sentences without RCs such as(11) and (12):7

(11) The officer [MASK] nice.

(12) The bankers knew the officer [MASK] nice.

Prediction 3: Generalization Across RC Types.If RC boundaries are represented in an abstractway that is shared across different RC types, thenthe counterfactual representations will affect errorprobability in the same way regardless of whetherthe counterfactual representations were generatedfrom subspaces estimated from sentence with thesame RC type as the target sentences, or from sen-tences with different RC types.

6 Results

Counterfactual Intervention in the Middle Lay-ers of BERT-base Modulates Agreement ErrorRate in RC Sentences with Attractors. We be-gin by discussing experiments where subspaceswere estimated from sentences with the same typeof RC as the test sentences with agreement; wereport results averaged across the five RC types.Interventions using counterfactual representationsgenerated from the middle layers of the BERT-base(5–8 out of 12) resulted in changes in the proba-bility of error which partially aligned with Predic-tion 1 (Figure 3a). In sentences with attractors,using the positive counterfactual h+MASK resulted inan increase in the probability of error (a maximumincrease of 14 percentage points in layer 7). Con-versely, using the negative counterfactual h−MASKgenerated from layers 5 and 6 resulted in a de-crease in the probability of error. This decreasewas much smaller (a maximum decrease of 2 per-centage point in layer 6) and there was overlap inthe error bars for the probability of error before andafter intervention.

It is likely that the smaller effects of the nega-tive counterfactual intervention are due to the factthat accuracy before the intervention was very high(95%) and the probability of error very low (8%),

7If models encoded boundaries of all embedded clausessimilarly we would expect a change in prediction for (12).

leaving very little room for change: in most cases,the original representation already correctly en-coded the verb is outside the RC. In a follow-upanalysis, we only considered sentences in whichthe model originally assigned a higher probabilityto the ungrammatical than the grammatical form.In these examples the decrease in probability oferror was larger (a maximum decrease of 16 per-centage points in layer 6; see Figure 3c).

While only RC interventions in the middle layerselicit the expected behavioral outcomes, probingaccuracy for RC information was high for all lay-ers (Appendix §A.2), giving further evidence tothe dissociation between correlational and causalmethods: probing can identify aspects of the repre-sentations that do not affect the model’s behavior.

Interventions on RC Boundary Representa-tions Generalize Across RC Types, but not Fur-ther. In line with Prediction 3, we observed aqualitatively similar pattern of change in error prob-abilities even when the counterfactuals were gen-erated from subspaces estimated from a differentRC type than the RC in the agreement test sen-tences. The effects were smaller, however. Thissuggests that while BERT’s representation of RCboundaries is partly shared across different RCtypes, there are also structure-specific RC bound-ary representations. The effect of the interven-tion also aligned with Prediction 2: in construc-tions where we do not expect RC boundaries to af-fect predictions—sentences without attractors andthose without RCs—we did not observe significantchanges in error probability (Figure 3b).

Intervention Based on Random Subspaces DoesNot Produce Interpretable Results. To teaseapart the effect of the RC-targeted interventionfrom intervening on any subspace with the samedimensionality, we generated counterfactual rep-resentations from 10 random subspaces and re-peated our analysis.8 While we observed verysmall changes in probability of error in some cases,the pattern of changes resulting from this interven-tion did not align with any of our predictions (seeFigure 3d). This suggests that the change in proba-bility of error that resulted from intervening withRC subspaces was not merely a by-product of in-tervening on a large enough subspace of BERT’soriginal representation space.

8We generated a random subspace by sampling standardGaussian vectors instead of the INLP matrix W, and thenemploying the same procedure described in §3.

7

Accepted in CoNLL 2021 7 DISCUSSION

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

●● ● ● ● ●

● ●● ● ●

●●

●

●

● ●●

●●

● ● ●

●

●

●

●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●●

● ●

●

●

● ● ● ● ●

●● ●

●●

●

●

●

RC subspaceIntervention

Random subspaceIntervention

BE

RT

largeB

ER

Tbase

BE

RT

med

BE

RT

small

BE

RT

mini

BE

RT

tiny

0 4 8 12 16 20 24 0 4 8 12 16 20 24

0.120.140.160.18

0.080.100.120.14

0.1500.1750.2000.225

0.200.250.30

0.460.480.500.52

0.480.520.56

Layer

P(e

rr)

afte

r in

terv

entio

n●

Negative Intervention(Outside RC)


Figure 4: Change caused by counterfactual represen-tations in agreement error probability across RCs withattractors for different BERT variants. Note that thebaseline performance prior to intervention (marked byblack horizontal lines) is different between models.

Intervening on the Middle Layers of OtherBERT Variants Yielded Qualitatively SimilarResults. We repeated the experiments on BERT-large and four smaller versions of BERT, trained onthe same amount of data as the BERT-base model(Turc et al., 2019). As with BERT-base, interven-ing on the middle layers of BERT-large (12–17 outof 24) with the RC subspaces—but not the ran-dom subspaces—resulted in predicted changes inthe probability of error. Compared to BERT-base,the smaller models showed a greater change in theprobability of error as a result of intervention withcounterfactuals generated from random subspaces.However, when the counterfactual representationswere generated from particular layers—4 and 5(out of 8) in BERT-medium, 3 (out of 4) in BERT-mini and 2 (out of 4) in BERT-small—the changein error probability aligned with Prediction 1 overand above the changes from intervening with ran-dom subspaces. In all of these layers, interveningwith the positive but not the negative counterfactualresulted in an increase in the probability of error.No such layer was observed for BERT-tiny, whichhas only 2 layers (see Figure 4).

7 Discussion

We proposed an intervention-based method, Al-terRep, to test whether language models use thelinguistic information encoded in their representa-tions in a manner that is consistent with the gram-

mar of the language they are trained on. For agiven linguistic feature of interest, we generatedcounterfactual contextual word representations bymanipulating the value of the feature in the originalrepresentations. Then, by replacing the originalrepresentations with these counterfactual variants,we characterized the change in word predictionbehaviour. By comparing the resulting change inword prediction with hypotheses from linguistictheory about how specific values of the feature areexpected to influence the probabilities over pre-dicted words, we investigated whether the modeluses the feature as expected.

As a case study, we applied this method to studywhether altering the information encoded aboutRC boundaries in the contextual representationsof masked verbs in different BERT variants influ-ences the verb’s number inflection in a manner thatis consistent with the grammar of English. Wefound that while all layers of the BERT variantsencoded information about RC boundaries, onlythe information in the middle layers influenced themasked verb’s number inflection as predicted byEnglish grammar. We also found that in BERT-base, counterfactual representations based on sub-spaces that were learned from sentences with onetype of RC influenced the number inflection of themasked verb in sentences with other types of RCs;this suggests that the model encodes informationabout RC boundaries in an abstract manner thatgeneralizes across the different RC types.

Caveat: Linear Analysis of a Non-linear Net-work AlterRep interventions are based on con-cept subspaces identified using linear classifiers,but most neural networks components, includingBERT layers, are non-linear. It is possible, then,that subsequent non-linear layers transform thecounterfactual representation in a way that is notamenable to analysis using our methods. As such,while we can conclude from a positive result thatthe feature in question causally affects the model’sbehavior, negative results should be interpretedmore cautiously.

Future Work Future work can apply this methodto test linguistically motivated hypotheses about awide range of structures and tasks. For example,linguistic theory predicts that information aboutsemantic roles (like agent and patient) is crucial fortasks such as natural language inference (NLI) thatrequire reasoning about sentence meaning. To test

8

Accepted in CoNLL 2021 9 CONCLUSIONS

if NLI models use semantic roles as predicted bylinguistic theory, we can use AlterRep to replacethe original representations with counterfactual rep-resentations where the patient is encoded as theagent (and vice versa), and measure the change inperformance on NLI, especially on challenge setssuch as HANS (McCoy et al., 2019) that evaluatesensitivity to these properties.

8 Related Work

Probing and Causal Analysis Behavioral testsof neural models, such as the ability of the model tomaster agreement prediction (Linzen et al., 2016;Gulordava et al., 2018; Goldberg, 2019), have ex-posed both impressive capabilities and limitations.These paradigms focus on the model’s output, anddo not link the behavioral output with the infor-mation encoded in its representations. Conversely,probing (Adi et al., 2017; Conneau et al., 2018;Hupkes et al., 2018) does not reveal whether theproperty recovered by the probe affects the orig-inal model’s prediction in any way (Hewitt andLiang, 2019; Tamkin et al., 2020; Ravichanderet al., 2021). This has sparked interest in iden-tifying the causal factors that underlie the model’sbehavior (Vig et al., 2020; Feder et al., 2020; Voitaet al., 2020; Kaushik et al., 2020; Slobodkin et al.,2021; Pryzant et al., 2021; Finlayson et al., 2021).

Counterfactuals The relation between counter-factual reasoning and causality is extensively dis-cussed in social science and philosophy literature(Woodward, 2005; Miller, 2018, 2019). Attemptshave been made to generate counterfactual exam-ples (Maudslay et al., 2019; Zmigrod et al., 2019;Ross et al., 2020; Kaushik et al., 2020; Hvilshøjet al., 2021) and recently to derive counterfactualrepresentations (Feder et al., 2020; Elazar et al.,2021; Jacovi et al., 2021; Shin et al., 2020; Tuckeret al., 2021). Contrary to our approach, previ-ous attempts to generate counterfactual represen-tations were either limited to amnesic operations(i.e., focused on the removal of information andnot on modifying the encoded information) or usedgradient-based interventions, which are expressiveand powerful, but less controllable. Our linear ap-proach is guided by well-defined desiderata: wewant all linear classifiers trained on the originalrepresentation to predict a specific class for thecounterfactual representations, and we prove thatis the case in Appendix §A.1.

Representations and Behavior Previous workbridging the gap between representations and be-havior includes Giulianelli et al. (2018), whodemonstrated that back-propagating an agreementprobe into a language model induces behavioralchanges and improve predictions. Lakretz et al.(2019) identified individual neurons that causallysupport agreement prediction. Prasad et al. (2019)used similarity measures between different RCtypes extracted using behavioural methods to inves-tigate the inner organization of information withinthe model. Closest to our work is Elazar et al.(2021), where the authors applied INLP to “erase”certain distinctions from the representation, andthen measured the effect of the intervention onlanguage modeling. We extend INLP to generateflexible counterfactual representations (§3) and usethese to instantiate hypotheses about the linguisticfactors that guide the model’s behavior.

9 Conclusions

We proposed an intervention-based approach tostudy whether a model uses a particular linguisticfeature as predicted by the grammar of the lan-guage it was trained on. To do so, we generatedcounterfactual representations in which the linguis-tic property under consideration was altered but allother aspects of the representation remained intact.Then, we replaced the original word representa-tion with the counterfactual one and characterizedthe change in behaviour. Applying this methodto BERT, we found that the model uses informa-tion about RC boundaries that is encoded in itsword representations when inflecting the numberof masked verb in a manner consistent with thegrammar of English. We conclude that AlterRep isan effective tool for testing hypotheses about thefunction of the linguistic information encoded inthe internal representations of neural LMs.

Acknowledgements

This work was supported by United States–IsraelBinational Science Foundation award 2018284,and has received funding from the European Re-search Council (ERC) under the European Union’sHorizon 2020 research and innovation programme,grant agreement No. 802774 (iEXTRACT). Wethank Robert Frank for a fruitful discussion of anearly version of this work, and Marius Mosbach,Hila Gonen and Yanai Elazar for their helpful com-ments.

9

Accepted in CoNLL 2021 References

ReferencesYossi Adi, Einat Kermany, Yonatan Belinkov, Ofer

Lavi, and Yoav Goldberg. 2017. Fine-grained anal-ysis of sentence embeddings using auxiliary pre-diction tasks. In 5th International Conferenceon Learning Representations, ICLR 2017, Toulon,France, April 24-26, 2017, Conference Track Pro-ceedings. OpenReview.net.

Suhas Arehalli and Tal Linzen. 2020. Neural languagemodels capture some, but not all, agreement attrac-tion effects. In Proceedings of the 42nd AnnualConference of the Cognitive Science Society, pages370—376.

Alexis Conneau, German Kruszewski, Guillaume Lam-ple, Loıc Barrault, and Marco Baroni. 2018. Whatyou can cram into a single $&!#* vector: Probingsentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 2126–2136.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers), pages 4171–4186. Association for Computa-tional Linguistics.

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and YoavGoldberg. 2021. Amnesic probing: Behavioral ex-planation with amnesic counterfactuals. Transac-tions of the Association for Computational Linguis-tics, 9:160–175.

Amir Feder, Nadav Oved, Uri Shalit, and Roi Re-ichart. 2020. CausaLM: Causal model explanationthrough counterfactual language models. CoRR,abs/2005.13407.

Matthew Finlayson, Aaron Mueller, SebastianGehrmann, Stuart Shieber, Tal Linzen, and YonatanBelinkov. 2021. Causal analysis of syntacticagreement mechanisms in neural language models.In Proceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages1828–1843, Online. Association for ComputationalLinguistics.

Jon Gauthier, Jennifer Hu, Ethan Wilcox, Peng Qian,and Roger Levy. 2020. Syntaxgym: An onlineplatform for targeted evaluation of language models.In Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics: SystemDemonstrations, pages 70–76.

Mario Giulianelli, Jack Harding, Florian Mohnert,Dieuwke Hupkes, and Willem H. Zuidema. 2018.

Under the hood: Using diagnostic classifiers to in-vestigate and improve how language models trackagreement information. In Proceedings of the Work-shop: Analyzing and Interpreting Neural Networksfor NLP, BlackboxNLP@EMNLP 2018, Brussels,Belgium, November 1, 2018, pages 240–248. Asso-ciation for Computational Linguistics.

Yoav Goldberg. 2019. Assessing BERT’s syntacticabilities. CoRR, abs/1901.05287.

Kristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Color-less green recurrent networks dream hierarchically.In Proceedings of the Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,NAACL-HLT, pages 1195–1205.

John Hewitt and Percy Liang. 2019. Designing andinterpreting probes with control tasks. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 2733–2743, HongKong, China. Association for Computational Lin-guistics.

Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema.2018. Visualisation and ‘diagnostic classifiers’ re-veal how recurrent and recursive neural networksprocess hierarchical structure. Journal of ArtificialIntelligence Research, 61:907–926.

Frederik Hvilshøj, Alexandros Iosifidis, and Ira Assent.2021. ECINN: efficient counterfactuals from invert-ible neural networks. CoRR, abs/2103.13701.

Alon Jacovi, Swabha Swayamdipta, Shauli Ravfogel,Yanai Elazar, Yejin Choi, and Yoav Goldberg. 2021.Contrastive explanations for model interpretability.CoRR, abs/2103.01378.

Divyansh Kaushik, Eduard H. Hovy, andZachary Chase Lipton. 2020. Learning the differ-ence that makes A difference with counterfactually-augmented data. In 8th International Conference onLearning Representations, ICLR 2020, Addis Ababa,Ethiopia, April 26-30, 2020. OpenReview.net.

Yair Lakretz, Dieuwke Hupkes, Alessandra Vergallito,Marco Marelli, Marco Baroni, and Stanislas De-haene. 2021. Mechanisms for handling nested de-pendencies in neural-network language models andhumans. Cognition, page 104699.

Yair Lakretz, German Kruszewski, Theo Desbordes,Dieuwke Hupkes, Stanislas Dehaene, and Marco Ba-roni. 2019. The emergence of number and syn-tax units in LSTM language models. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 11–20, Minneapolis,Minnesota. Association for Computational Linguis-tics.

10

https://openreview.net/forum?id=BJh6Ztuxl



https://doi.org/10.18653/v1/n19-1423

https://doi.org/10.18653/v1/n19-1423

https://doi.org/10.18653/v1/n19-1423

https://transacl.org/ojs/index.php/tacl/article/view/2423

https://transacl.org/ojs/index.php/tacl/article/view/2423

http://arxiv.org/abs/2005.13407


https://doi.org/10.18653/v1/2021.acl-long.144

https://doi.org/10.18653/v1/2021.acl-long.144

https://doi.org/10.18653/v1/w18-5426

https://doi.org/10.18653/v1/w18-5426

https://doi.org/10.18653/v1/w18-5426



https://aclanthology.info/papers/N18-1108/n18-1108

https://aclanthology.info/papers/N18-1108/n18-1108

https://doi.org/10.18653/v1/D19-1275

https://doi.org/10.18653/v1/D19-1275




https://openreview.net/forum?id=Sklgs0NFvr



https://doi.org/10.18653/v1/N19-1002

https://doi.org/10.18653/v1/N19-1002


Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of theAssociation for Computational Linguistics, 4:521–535.

Rebecca Marvin and Tal Linzen. 2018. Targeted syn-tactic evaluation of language models. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 1192–1202,Brussels, Belgium. Association for ComputationalLinguistics.

Rowan Hall Maudslay, Hila Gonen, Ryan Cotterell,and Simone Teufel. 2019. It’s all in the name: Mit-igating gender bias with name-based counterfactualdata substitution. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7,2019, pages 5266–5274. Association for Computa-tional Linguistics.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3428–3448,Florence, Italy. Association for Computational Lin-guistics.

Tim Miller. 2018. Contrastive explanation: Astructural-model approach. CoRR, abs/1811.03163.

Tim Miller. 2019. Explanation in artificial intelli-gence: Insights from the social sciences. Artif. In-tell., 267:1–38.

Aaron Mueller, Garrett Nicolai, Panayiota Petrou-Zeniou, Natalia Talmina, and Tal Linzen. 2020.Cross-linguistic syntactic evaluation of word predic-tion models. In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics, Seattle, Washington. Association for Com-putational Linguistics.

Grusha Prasad, Marten van Schijndel, and Tal Linzen.2019. Using priming to uncover the organization ofsyntactic representations in neural language models.In Proceedings of the 23rd Conference on Computa-tional Natural Language Learning (CoNLL), pages66–76, Hong Kong, China. Association for Compu-tational Linguistics.

Reid Pryzant, Dallas Card, Dan Jurafsky, Victor Veitch,and Dhanya Sridhar. 2021. Causal effects of lin-guistic properties. In Proceedings of the 2021 Con-ference of the North American Chapter of the As-sociation for Computational Linguistics: HumanLanguage Technologies, NAACL-HLT 2021, Online,June 6-11, 2021, pages 4095–4109. Association forComputational Linguistics.

Shauli Ravfogel, Yanai Elazar, Hila Gonen, MichaelTwiton, and Yoav Goldberg. 2020. Null it out:Guarding protected attributes by iterative nullspaceprojection. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,ACL 2020, Online, July 5-10, 2020, pages 7237–7256. Association for Computational Linguistics.

Abhilasha Ravichander, Yonatan Belinkov, and Ed-uard H. Hovy. 2021. Probing the probing paradigm:Does probing accuracy entail task relevance? InProceedings of the 16th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Main Volume, EACL 2021, Online, April19 - 23, 2021, pages 3363–3377. Association forComputational Linguistics.

Alexis Ross, Ana Marasovic, and Matthew E. Peters.2020. Explaining NLP models via minimal con-trastive editing (mice). CoRR, abs/2012.13985.

Seungjae Shin, Kyungwoo Song, JoonHo Jang, HyemiKim, Weonyoung Joo, and Il-Chul Moon. 2020.Neutralizing gender bias in word embedding with la-tent disentanglement and counterfactual generation.In Proceedings of the 2020 Conference on Empiri-cal Methods in Natural Language Processing: Find-ings, EMNLP 2020, Online Event, 16-20 November2020, pages 3126–3140. Association for Computa-tional Linguistics.

Aviv Slobodkin, Leshem Choshen, and Omri Abend.2021. Mediators in determining what processingBERT performs first. In Proceedings of the 2021Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, NAACL-HLT 2021, Online,June 6-11, 2021, pages 86–93. Association for Com-putational Linguistics.

Alex Tamkin, Trisha Singh, Davide Giovanardi, andNoah D. Goodman. 2020. Investigating transferabil-ity in pretrained language models. In Proceedings ofthe 2020 Conference on Empirical Methods in Nat-ural Language Processing: Findings, EMNLP 2020,Online Event, 16-20 November 2020, pages 1393–1401. Association for Computational Linguistics.

Mycal Tucker, Peng Qian, and Roger Levy. 2021.What if this modified that? syntactic inter-ventions via counterfactual embeddings. CoRR,abs/2105.14002.

Iulia Turc, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Well-read students learn better:The impact of student initialization on knowledgedistillation. CoRR, abs/1908.08962.

Sahil Verma, John P. Dickerson, and Keegan Hines.2020. Counterfactual explanations for machinelearning: A review. CoRR, abs/2010.10596.

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov,Sharon Qian, Daniel Nevo, Yaron Singer, and Stu-art M. Shieber. 2020. Causal mediation analysis

11

https://doi.org/10.18653/v1/D18-1151

https://doi.org/10.18653/v1/D18-1151

https://doi.org/10.18653/v1/D19-1530

https://doi.org/10.18653/v1/D19-1530

https://doi.org/10.18653/v1/D19-1530

https://doi.org/10.18653/v1/P19-1334

https://doi.org/10.18653/v1/P19-1334



https://doi.org/10.1016/j.artint.2018.07.007

https://doi.org/10.1016/j.artint.2018.07.007

https://doi.org/10.18653/v1/K19-1007

https://doi.org/10.18653/v1/K19-1007

https://www.aclweb.org/anthology/2021.naacl-main.323/


https://www.aclweb.org/anthology/2020.acl-main.647/



https://www.aclweb.org/anthology/2021.eacl-main.295/

https://www.aclweb.org/anthology/2021.eacl-main.295/



https://doi.org/10.18653/v1/2020.findings-emnlp.280
















for interpreting neural NLP: the case of gender bias.CoRR, abs/2004.12265.

Elena Voita, Rico Sennrich, and Ivan Titov. 2020.Analyzing the source and target contributions topredictions in neural machine translation. CoRR,abs/2010.10907.

Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo-hananey, Wei Peng, Sheng-Fu Wang, and Samuel RBowman. 2020. Blimp: The benchmark of linguis-tic minimal pairs for english. Transactions of the As-sociation for Computational Linguistics, 8:377–392.

James Woodward. 2005. Making things happen: A the-ory of causal explanation. Oxford university press.

Ran Zmigrod, S. J. Mielke, Hanna M. Wallach, andRyan Cotterell. 2019. Counterfactual data augmen-tation for mitigating gender stereotypes in languageswith rich morphology. In Proceedings of the 57thConference of the Association for ComputationalLinguistics, ACL 2019, Florence, Italy, July 28- Au-gust 2, 2019, Volume 1: Long Papers, pages 1651–1661. Association for Computational Linguistics.

12




https://doi.org/10.18653/v1/p19-1161

https://doi.org/10.18653/v1/p19-1161

https://doi.org/10.18653/v1/p19-1161

Accepted in CoNLL 2021 A APPENDIX

A Appendix

A.1 Correctness of the CounterfactualGeneration

In this appendix, we prove that the method pre-sented in §3 is guaranteed to achieve its goal: thenegative counterfactual ~h−t would be classified asbelonging to the negative class, and the positivecounterfactual ~h+t would be classified as belong-ing to the positive class, according to all the linearclassifiers w trained on the original representation.

We base our derivation on the decompositionpresented in §3:

~ht =~hNt + ~hRt = ~hNt +

∑~w∈W

~hwt (5)

Where N is the nullspace of the INLP matrixW, R is its rowspace, and ~hNt and ~hRt are the or-thogonal projection of a representation ~ht to thosesubspaces, respectively.

We focus on the negative counterfactual; Theproof for the positive counterfactual is similar. Inthe proceeding discussion, wj ∈ Rd is an arbitrarylinear classifier trained on the jth iteration of INLP(one of the rows in the matrix W). wj predictsa negative or positive class y ∈ {0, 1} accordingto the decision rule y = SIGN(wT

j ht)9. We

denote by ~htw

the orthogonal projection of ~ht on adirection w, given by (~ht

Tw)~w.

Claim A.1. For the negative counterfactual de-fined by ~h−t = ~hNt +α

∑mi=0 (−1)SIGN(wT

i~ht) ~ht

wi ,

it holds that ~h−t would always be classified to thenegative class: wT

j h−t < 0 for every wj in the

original INLP matrix W.

Proof.

wTj~h−t = wT

j (~hNt + α

m∑i=0

(−1)SIGN(wTi~ht) ~ht

wi )

(6)

= wTj (α

m∑i=0

(−1)SIGN(wTi~ht) ~ht

wi) (7)

= αwjT ((−1)SIGN(wj

T ~ht) ~htwj ) (8)

Where the transition from 6 to 7 stems from~hNt being in the nullsapce of W, so ∀w ∈ W :

wT ~hNt = 0; and the transition from 7 to 8 stems9For simplicity, we define SIGN(x) = 1 if x ≥ 0 else 0.

0 corresponds to the negative class.

from the mutual orthogonality of the INLP clas-sifiers (proved in Ravfogel et al. (2020)): since∀ j 6= i, wT

i wj = 0, it holds that wTj~hwit =

wTj ((

~htTwi)wi) = (~ht

Twi)w

Tj wi = 0.

Now, we consider two cases.

• Case 1: wTj~ht > 0, that is, the classifier pre-

dicted the positive class on the original repre-sentation. Then, by 8,

wTj~h−t = αwT

j ((−1)SIGN(wT

j~ht) ~h

wj

t ) (9)

= αwTj (−1)h

wj

t (10)

= −αwTj h

wj

t (11)

Since α is a positive scalar and by assumptionwTj~ht > 0, it holds that wT

j~h−y < 0.

• Case 2: SwTj~ht < 0, that is, the classifier

predicted the negative class on the originalrepresentation. Then, by 8,

wTj~h−t = αwT

j ((−1)SIGN(wT

j~ht) ~h

wj

t ) (12)

= αwTj h

wj

t (13)

= αwTj h

wj

t (14)

Since α is a positive scalar and by assumptionwTj~ht < 0, it holds that wT

j~h−y < 0.

We have proved that regardless of the originallypredicted label, all INLP classifiers would predictthe negative class on the negative counterfactual,which concludes the proof.

A.2 Probing AccuracyIn this appendix, we provide probing results forthe task on which we run INLP: detecting whetherrepresentation was taken over a word inside or out-side of an RC. As INLP iteratively trains linearprobes, this accuracy is equivalent to the accuracyof the first INLP classifier. In all contextualizedlayers, we observe probing accuracy of over 90%for all RC types (Figure 5). This contrasts with theintervention results in §6. While it is possible tolinearly decode the RC boundary in all layers, onlyin the middle layers do we find that this conceptcausally influences the model’s behavior. In otherwords, good probing performance does not indicatemain-task relevancy.

13


Figure 5: Probing accuracy for the presence of wordswithin or outside of RCs, vs. BERT-base layers, for allthe different RC types in our experiments.

A.3 Influence of the Dimensionality of theRC Subspace

In this appendix, we analyze the influence of thedimensionality m of the RC subspace. Recall thatINLP is an iterative algorithm (§2.2). On the ithiteration, the method identifies a single direction~wi—the parameter vector of a linear classifier—which is predictive of the concept of interest (inour case, RC). The different directions are mutu-ally orthogonal, and after m iterations, the “con-cept subspace” is the subspace spanned the rowsof the matrix W = [ ~w1

T , ~w2T , . . . , ~wm

T ]. In theith iteration of INLP, the subspace identified so faris removed from the representation (by the opera-tion of nullspace projection), and the next classifier~wi+1 is trained to predict the concept over the resid-

ual representation. As such, accuracy is expectedto decrease with the number of iterations: as thenumber of iterations increases, the algorithm iden-tifies directions which have a weaker associationwith the concept. This creates a trade-off betweenexhaustively – identifying all the directions whichare at least somewhat predictive of the concept, andselectivity – identifying only directions which havea meaningful association with the concept.

Figure 6a presents positive intervention resultsfor different RC-subspace dimensionality on sen-tences with agreement across RC with attractors;Figure 6b present negative intervention results onsentences on which the model was originally mis-taken. Generally, we observe the same trends un-der all settings, suggesting our method is relativelyrobust to the dimensionality of the manipulatedsubspace. In figure 6c we present the results ofintervening on subspaces of different dimensional-

(a) Positive intervention results on sentences withagreement across RC with attractors.

(b) Negative intervention results on sentences withagreement across RC on which the model originallypredicted incorrectly.

(c) Positive intervention results on sentences withsimple agreement and sentences with sentential com-plements.

ity, for sentences where we do not expect an effect:sentences without attractors, and sentences withoutRCs. For all contextualized layers we do not seean effect, as expected. For m = 32 and m = 64,we see an effect on the uncontextualized embed-ding layer. This effect may hint towards a spuriousinformation encoded in this uncontexualized layerwhich is used by the model when predicting agree-ment, but studying this possibility is beyond thescope of this work.

A.4 Influence of αIn this appendix, we analyze the influence of theparameter α in the AlterRep algorithm (Section §3)on the BERT-base model. Recall that α dictates thestep size one takes when calculating the counter-factual mirror image: α = 1 corresponds to exact

14


(a) Positive intervention results on sentences withagreement across RC with attractors.

(b) Positive intervention results on sentences withsimple agreement and sentences with sentential com-plements.

(c) Negative intervention results on sentences withagreement across RC on which the model originallypredicted incorrectly.

Figure 7: Influence of α on the probability of error postintervention.

mirror image, while α > 1 over-emphasizes theRC components over which we take the counterfac-tual mirror image.

In Figures 7a and 7b we focus on the positive in-tervention, which is expected to increase the prob-ability of error, making the model act as if themasked verb is within the RC; and in Figure 7c wefocus on the negative intervention on sentences onwhich the model was originally mistaken, which isexpected to decrease the probability of error.

In Figure 7b we present the results on the con-trol sentences: sentences without agreement acrossRC. Overall, the trends we observe are similar fordifferent values of α, indicating that AlterRep is

relatively robust to the value of this parameter. Oneexception is the large values of α = 8 and to alesser degree α = 6, which result in some increasein the probability of error also in the control sen-tences, where we do not expect such effect (Fig-ure 7b), albeit this increase is much smaller thanthe increase on sentences with agreement acrossRC. With a large-enough α, the new counterfactualrepresentation might diverge too-much from thedistribution of the original representations. Noticethat when compared with gradient-based methodsfor generating counterfactuals (Tucker et al., 2021),our linear approach has the advantage of being ableto control the magnitude of the intervention witha single controlled parameter, which has a cleargeometric interpretation: the extent to which onepushes the representations to one direction or an-other when taking the mirror image.

A.5 Influence on Accuracy

Figure 8: Influence of the negative intervention on ac-curacy (the percentage of cases where the model favorsthe correct form), on sentences on which the model wasoriginally mistaken.

In this appendix, we evaluate the impact of theintervention by its influence on the model’s accu-racy, calculated as the percentage of cases wherethe model assigned higher probability to the correctform than to the incorrect form. We focus on thecases on which the model originally predicted in-correctly, Thus, the original accuracy on this groupof sentences is 0%. We use a negative intervention,pushing the model to act as if the verb is (correctly)outside of the RC, which is expected to increase itsaccuracy.

In Section §4.3 we use an alternative measure:probability-of-error. The probability of error is amore sensitive measure, as it might change evenwhen the model’s absolute preference for one form

15


over the other has not. However, it is the absoluteranking which eventually dictates the model’s topprediction.

Figure 8 presents the results for different dimen-sionalities of the RC subspace. The trends are sim-ilar to the trends shown by the probability-of-errorevaluation measure. Notably, in up to 30% of thecases, it is possible to flip the model’s preferencefrom the incorrect to the correct form solely bymanipulating a low-dimensional subspace withinthe 768-dimensional representation space.

16

Date post:	26-Dec-2021
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

arXiv:2105.06965v3 [cs.CL] 15 Sep 2021

Documents