arXiv:2010.02323v1 [cs.CV] 5 Oct 2020David McNeely-White Benjamin Sattelberg Nathaniel Blanchard...

Exploring the Interchangeability of CNN Embedding Spaces

David McNeely-White Ben Sattelberg Nathaniel BlanchardRoss Beveridge

Colorado State UniversityFort Collins, Colorado, USA

{david.white, ben.sattelberg, nathaniel.blanchard, ross.beveridge}@colostate.edu

Abstract

CNN feature spaces can be linearly mapped and conse-quently are often interchangeable. This equivalence holdsacross variations in architectures, training datasets, andnetwork tasks. Specifically, we mapped between 10 image-classification CNNs and between 4 facial-recognitionCNNs. When image embeddings generated by one CNNare transformed into embeddings corresponding to the fea-ture space of a second CNN trained on the same task, theirrespective image classification or face verification perfor-mance is largely preserved. For CNNs trained to the sameclasses and sharing a common backend-logit (soft-max) ar-chitecture, a linear-mapping may always be calculated di-rectly from the backend layer weights. However, the caseof a closed-set analysis with perfect knowledge of classi-fiers is limiting. Therefore, empirical methods of estimat-ing mappings are presented for both the closed-set imageclassification task and the open-set task of face recognition.The results presented expose the essentially interchangeablenature of CNNs embeddings for two important and com-mon recognition tasks. The implications are far-reaching,suggesting an underlying commonality between representa-tions learned by networks designed and trained for a com-mon task. One practical implication is that face embeddingsfrom some commonly used CNNs can be compared usingthese mappings.

1. IntroductionThe last decade of research into neural networks might

be coarsely categorized into efforts toward: 1) advancingthe state-of-the-art as expressed through accuracy, and 2)better understanding and analyzing what networks learn.Efforts toward understanding have been far ranging: fromcomparisons to human vision [3, 40], to investigations intothe quality and content of learned features [17], to detailedbreakdowns of what causes networks to fail [9]. Even re-search areas as distinct as style transfer [11, 14] and trans-

Figure 1: Distinct CNNs trained to specific tasks, for ex-ample object recognition (ImageNet) or face recognition(LFW), create embeddings which are, in many cases, in-terchangeable within tasks. This paper will present bothanalytical and empirical findings relating to the existenceof linear mappings between embedding spaces. By empir-ically estimating these mappings we will reveal commoncases where embeddings from apparently very different net-works are nearly interchangeable.

fer learning [31] are all, from a certain vantage, centeredaround understanding the nature of what CNNs learn and, inturn, how what they learn separates and semantically orga-nizes information. Indeed, such fundamental understandingis crucial for continued improvement.

Our work here focuses specifically on the embeddingsgenerated by different CNNs. To be clear, what we will callan embedding here is the feature vector generated by thenetwork layer just ahead of the final classifier. Often theclassifier takes the form of a series of units, one per objectclass, and the network’s final decision is expressed througha winner-take-all competition between these units. How-ever, as we will see in the open-set face recognition task be-low, an alternative to a final classifier layer for a fixed set oflabels is to make direct comparisons between embeddingsderived from images.

The question we take up in this work is whether theirexists a linear mapping between embedding spaces gener-

1

arX

iv:2

010.

0232

3v4

[cs

.CV

] 1

2 Fe

b 20

21

ated by different CNNs trained to a common task. Whatwe find is that for the two cases explored in detail, objectrecognition and face recognition, the answer is yes. Thiscommonality of embedding spaces is illustrated in Figure 1.Because we demonstrate the existence of these linear map-pings, and also how estimate them, we reveal that, in a verybasic sense, apparently different and incomparable embed-dings from different networks are, to a great extent, inter-changeable when the networks are trained to solve a com-mon task.

In revealing a deep commonality between embeddingspaces created by very different CNNs, our work is in linewith other recent and significant research trends. For exam-ple, some within the neural architecture search communityhave shifted their focus from improving the architecture it-self to finding new data augmentation strategies [6, 7, 49].There is, we believe, a growing sense that different archi-tectures applied to the same data are, in a broad sense, con-verging upon very similar solutions.

The remainder of this paper is organized as follows. Re-lated work is presented in Section 2. Then, in Section 3 weempirically derive mappings between 10 ImageNet-trainedCNNs of varying architecture, demonstrating broad lin-ear correspondence in a closed-set context. Also, in Sec-tion 3.4, we identify potential concerns when using accu-racy as a metric for closed-set feature space similarity. Sec-tion 4 builds upon the close-set result by establishing linearcorrespondence in the open-set facial recognition domain.Of particular importance is that the open-set nature of facerecognition means some networks we are comparing weretrained on disjoint datasets and indeed disjoint sets of peo-ple. Finally, in Section 5, we discuss both broad implica-tions and also a very practical consequence for face embed-dings. Namely, the fact that mappings between differentembedding spaces can be estimated means identities of in-dividuals present in multiple databases may be transferredfrom one to the other.

2. Related WorkMany have studied neural networks for the purpose of

understanding their hidden representations. Techniquesrange from visualization [10, 30, 47], to semantic interpre-tation [22, 48], to similarity metrics. The efforts with simi-larity metrics are most aligned to our goals with this work,and constitute the bulk of our related work.

Li et al. [26] used matching algorithms to align individ-ual units of ImageNet-trained deep CNNs to determine ifnetworks of different initializations converge to similar rep-resentation. However, others have argued that drawing se-mantic meaning from individual bases is problematic, sug-gesting that semantic meaning is contained within the entirefeature space, rather than individual units [44].

Studies on representational similarity studied the full

feature space using canonical correlation analysis (CCA),or variants of CCA such as SVCCA, projection weightedCCA, or centered kernel alignment. Relevant to our work,these studies have revealed a number of interesting deepneural network properties, like the tendency of layers to belearned in a bottom-up fashion [33], a correspondence be-tween networks’ size and width and relative similarity [29],and similarity across and within networks trained on differ-ent datasets [23]. The latter work is particularly relevant,as it focuses on the degree of representational similarity be-tween networks which differ in parameterization, much asour own does. CCA methods typically look at a variety oflayers and often investigate toy problems. In contrast, ourwork focuses solely on comparing different architectures’hidden representations using linear regression and relativelylarge-scale recognition tasks.

To our knowledge, only two other works have used lin-ear regression to compare hidden deep neural network rep-resentations. Lenc and Vedaldi [25] previously investigatedlinear transforms between layers of AlexNet [24], VGG-16 [39], and ResNet-v1-50 [15] on ILSVRC2012. However,these mappings were between spatially-sensitive convolu-tional layers, and thus required interpolation, which signifi-cantly factors into mapping performance. Further, they usedsupervision in the form of image labels, which, in somesense, amounts to a newly-trained network. In our work, wefit mappings using corresponding pairs of features withoutinterpolation, and independent of semantic image labels.

McNeely-White et al. [28] come closest to our method-ology, providing the basis for our first set of experi-ments. They found linear similarity between feature rep-resentations of ILSVRC2012-trained Inception-v4 [41] andResNet-v2 152 [16] models. We map between those and8 additional ImageNet CNNs. Further, we consider map-pings between feature spaces of CNNs trained using differ-ent datasets and labels, and evaluate in an open-set context.

Recently, Roeder et al. established a theoretical basisfor the empirical linear correspondence we observe hasemerged [34]. In the limit of infinite data, they show that abroad class of models, including deep neural networks, willconverge to representations which are similar, up to a lineartransformation. They empirically demonstrate their findingusing SVCCA on a variety of tasks, including a toy su-pervised classification problem. We complement this workby empirically demonstrating this effect for large CNNs ofvarying architecture and domains.

3. Classifier Based Linear Maps

Our goal is to compare CNNs which vary in architec-ture, training dataset, or both. We begin with ILSVRC2012trained networks, where all of our models are trained on thesame dataset, but differ in architecture. Each network and

2

its associated weights was obtained from TensorFlow Hub1

with the exception of Inception-v4, which was obtainedfrom the TensorFlow-Slim GitHub repository2. See Table 1for details of the networks and their respective top-1 single-crop classification accuracies on the 50,000 ILSVRC2012validation samples as reported by Google and reproducedby us3.

These networks’ architectures vary widely, from the sim-ple residual connections of ResNets [15, 16], the hand-built modules of Inception nets [41–43], the resource-constrained design of MobileNet-v2 [36], to the heavily op-timized NASNet and PNASNet [27, 50]. Due to the signifi-cant methodological differences used for constructing thesenetworks, it seems reasonable to assume they will partitionthe space used to assign class labels differently.

3.1. Classifier Based Metrics

To better lay out our approach, let us develop a formatthat clearly establishes our intended meaning by outliningfeatures/embeddings and their connection to object label-ing.

Each network presented in Table 1 may be considered afunction, v : Rw×h×3 → R1000. The domain consists ofw × h pixel RGB images and the range is the R1000 labelspace; each dimension represents a network activation cor-responding to one of the 1, 000 ILSVRC2012 object labels.The function v can be further divided into two parts:

v(x) = Cf(x). (1)

Here, f : Rw×h×3 → Rd is a highly nonlinear functionmapping from the input image space to a d-dimensionalfeature/embedding space. The matrix C ∈ R1000×d is alinear classifier that transforms a feature vector into class la-bel activations. The argmax of these activations is typicallyused to determine the predicted class. Note that many tra-ditional neural network systems include some form of biasthat would appear to complicate this definition, but by map-ping to projective space (always appending a 1 to vector f ),the last column of C can be used to represent the bias.

Given two networks, vA and vB defined by

vA(x) = CAfA(x)

vB(x) = CBfB(x),(2)

with features of system vA, fA ∈ RdA , and features of sys-tem vB , fB ∈ RdB , we are interested in the extent to which

1https://tfhub.dev/2https : / / github . com / tensorflow / models / tree /

master/research/slim3Note: we use a crop size of 331x331, except in the case of MobileNet-

v2 which is only available pretrained with a maximum crop size of224x224, to minimize the difference in features encoded into each net-work’s feature vectors. This means most networks actually performslightly better than reported by Google, although MobileNet-v2 performsmarginally worse.

the functions fA and fB behave similarly. These functionsare unlikely to have identical output or be directly com-parable, but we can investigate the extent to which theyhave a linear relationship, i.e. does there exist a matrixMA→B ∈ RdB×dA such that

fB(·) ≈MA→BfA(·). (3)

Naively, calculating distance between the features fB(x)and fA(x) for each sample in the ILSVRC2012 dataset us-ing something like Euclidean distance appears tempting, butfor a variety of reasons it is not helpful. Most obvious isthe problem of possible representational permutations. It iswell understood that two networks of identical architecturemay permute feature dimensions, yet be in all other ways beequivalent. Additionally, even when setting aside the per-mutation problem it is also known that distances in high di-mensional spaces can sometimes obfuscate results [2]. Fur-ther, comparing feature spaces is complicated by possibledifferences in the number of dimensions from one networkto the next.

We do want to mention that a promising avenue to workaround such problems is to employ methods such as canon-ical correlation analysis [18, 29]. However, because suchmethods are invariant to linear transforms, they give insightinto similarity while not serving our goal of constructingand understanding the MA→B mappings.

Due to these potential issues with common distance met-rics, a useful proxy for us to measure similarity is the top-1 accuracy of network combinations: what accuracy is at-tained by the system using the linear mapping:

vA→B(x) = CBMA→BfA(x). (4)

We will show in Section 3.4 some conditions such thatwhen CA and CB are known, a matrix MA→B may be cal-culated directly from them. However, we will choose firstto explore the broader case where such knowledge is notavailable and hence MA→B must be estimated empirically.

3.2. Empirically Calculated Mapping

Although Euclidean distance is potentially a poor mea-sure of similarity between spaces as a whole, it may beused as optimization metric to efficiently construct empir-ical MA→B mappings. Our calculation of the mapping is aridge regression problem over all the 1.3 million images inthe training set, X , using the Euclidean norm || · ||2 and theFrobenius norm || · ||F :

minimizeM̃A→B

∑xi∈X

||M̃A→BfA(xi)−fB(xi)||2 + ||M̃A→B ||F

(5)The resulting matrix M̃A→B minimizes total point-wiseEuclidean distance between the feature spaces.

3

https://tfhub.dev/

https://github.com/tensorflow/models/tree/master/research/slim

https://github.com/tensorflow/models/tree/master/research/slim

Table 1: Classification accuracies of the 10 CNNs studied on the ILSVRC2012 validation set, both reported by Google andverified independently (with respective crop sizes). Also included for reference are the number of dimensions used in eachCNN’s feature space, and the total number of parameters present in the model.

CNN Reported Reported Crop Ours Our Crop Feature Dimension # paramsInception-v1 [42] 69.8% 224x224 71.1% 331x331 1024 5MInception-v2 [43] 73.9% 224x224 73.9% 331x331 1024 11MMobileNet-v2-1.4-224 [36] 74.9% 224x224 74.6% 224x224 1792 7MResNet-v1-152 [15] 76.8% 224x224 78.8% 331x331 2048 60MResNet-v2-152 [16] 77.8% 224x224 78.7% 331x331 2048 60MInception-v3 [43] 78.0% 299x299 78.9% 331x331 2048 24MInception-v4 [41] 80.1% 299x299 80.4% 331x331 1536 43MInception-ResNet-v2 [41] 80.4% 299x299 81.2% 331x331 1536 56MNASNet-Large [50] 82.7% 331x331 82.7% 331x331 4032 89MPNASNet-Large [27] 82.9% 331x331 82.9% 331x331 4320 86M

We then use M̃A→B for pairs of networks to constructthe mapped networks vA→B as defined by Equation 4. Therecognition accuracy for the mapped networks is then cal-culated over the ILSVRC2012 test set. This process is com-pleted for each pairwise combination of the 10 pre-trainedCNNs and the results are summarized in Table 1. One thingto emphasize here is that neither the ground truth image la-bels nor the CA and CB matrices are used in when estimat-ing the linear mappings M̃A→B .

One additional control system is also considered. Thecontrol is simply the PNASNet-Large network using onlythe initial randomly generated weights. We add this con-trol to provide some backdrop against which to empiricallyassess the relative difference between the mapped networksdesigned and trained to achieve high recognition accuracy.

3.3. Classifier Based Mapping Results

Table 2 presents the performance of all combinations ofCNN features and back-end classifiers. All accuracies arereported using the ILSVRC2012 validation set, which wasunseen by all CNNs and mappings during training. In addi-tion a randomly-initialized PNASNet-Large, denoted by *,is also included.

When comparing column-wise (feeding different fea-tures into the same classifier), accuracies are sometimes in-creased when compared to the classifier CNN’s own fea-tures. That is, even though the mappings were trainedto reproduce the classifier CNN’s less expressive features,some additional helpful information is passed through themapping. As an example, see ResNet-v2-152’s columnand unmapped accuracy of 78.70%. This unmapped accu-racy is taking ResNet-v2’s features and passing them intoResNet-v2’s classifier. When instead fed a linear transfor-mation of PNASNet’s features, ResNet-v2’s classifier pro-duces 82.46% accuracy. This is an increase in the ResNet-v2 classifier’s accuracy of 3.76%! Of course, when ana-

lyzing the effects of mappings on the source CNN’s perfor-mance (i.e. the fashion that cells were shaded in Table 2),no feature vectors actually become more discriminativewhen mapped. Additionally, the sharpest dropoffs in accu-racy occur for the lowest-accuracy networks. This suggeststhat as architectures become complex enough to solve theILSVCR2012 dataset well, they are converging to similarsolutions as predicted by Roeder et al. [34].

The central question is whether the features learned byeach trained CNN studied are equivalent. In every case, thepercent change in classification accuracy introduced by us-ing another CNN’s features is no worse than -11.6% (andin most cases, much better). Indeed, the median percentchange over all mappings is only -1.90%. This strongly sug-gests that the features learned by one CNN are also learnedby every other, though some have learned somewhat supe-rior variations.

One additional item of note is that the PNASNet-Largewith randomly initialized weights improves from the base-line random accuracy when mapped to the features of othernetworks. To be clear, the improvement is dramatic fromone perspective, jumping by as much as two orders of mag-nitude. However, it is lackluster at best from another per-spective, reaching only about 5 percent accuracy in the bestcase. This finding lends credence to the view that evenjust the CNN architecture itself, independent of learnedweights, encodes useful information. Indeed, this propertyof networks has been explored in more depth in a numberof different contexts including fast architecture search —see [12, 21, 32, 37].

3.4. Analytic Mapping

An important caveat to this accuracy-based comparisonis that there is a vacuous sense in which the accuracy of themapped systems can be high while similarity of the featurespaces may be low.

4

Table 2: Classification accuracies of 121 inter-CNN linear maps. Each cell represents a single instance of Equation 4 withsystem vA corresponding to the row CNN and system vB corresponding to the column CNN. The number in large font ineach cell indicates the accuracy of this hybrid CNN. Diagonal elements correspond to identity mappings, so they are theoriginal network accuracies from Table 1. The number in small font indicates the percent change from the original unmappedrow/source CNN (i.e. the value in that row which belongs to the diagonal). The darker the shade of red, the greater theperformance penalty introduced by the mapping, relative to the feature extractor’s own classifier. The last row and column,gray background, presents comparisons with a random untrained control CNN that is the PNASNet Large architecture usingrandomly initialized weights.

Lemma 3.1 Consider a mapping CA ∈ Rd×dA and a map-ping CB ∈ Rd×dB that is tall (i.e. d < db) and of full rowrank (i.e. rank d). Then, there exists a matrix MA→B suchthat

CA = CBMA→B (6)

where MA→B can be defined as

MA→B = C+BCA. (7)

Here C+B is the Moore-Penrose inverse of CB .

This isn’t necessarily an astonishing fact — it is simply astatement that any matrix with full row rank has a right in-verse and as such we can write

CBMA→B = CBC+BCA = ICA = CA (8)

However, it has the following unfortunate consequence:

Theorem 3.2 Given two systems,

vA(x) = CAfA(x)

vB(x) = CBfB(x)(9)

as defined by Equation 1, if CB is of full rank then thenthere exists a matrix M∗A→B such that for all x ∈ Rw×h×3,

CAfA(x) = CBM∗A→BfA(x). (10)

where M∗A→B can be defined as

M∗A→B = C+BCA. (11)

Here C+B is the Moore-Penrose inverse of CB .

All ILSVRC2012 networks considered here have classifiersthat are of full rank.

A few consequences of this theorem:

5

• For any two linear-classifier based neural networks,there exists a linear mapping such that vA→B has ac-curacy exactly equal to that of vA

• This linear mapping and accuracy is independent of fBand the accuracy is independent of CB .

This means that although high linear similarity in featurespace implies high accuracy after mapping, we cannot sim-ply say that high accuracy after a linear mapping implieshigh linear similarity.

As an example, consider a system which when given animage generates a random feature vector, and which usesa fixed full rank classifier. This system has random per-formance on ILSVRC2012, and features from a real neu-ral network cannot be mapped to its changing “features”in any meaningful sense. However, when “mapping” to itsfeatures using the above analytic mapping, we achieve theperformance of the source network.

This does not necessarily mean this analytic mappingdoes not transform between feature spaces well when givenfeatures from two networks — it just means that it may not(and clearly does not in certain situations like the randomclassifier) and that the metric we are using does not alwaystell us when the metric may succeed or fail.

To avoid this issue in our empirical mappings, we per-form our computation of the matrix M̃A→B without knowl-edge of the classifiers and entirely between the two fea-ture spaces. Additionally, we are optimizing for Euclideandistance between feature spaces, rather than the resultingmapped network accuracy. This means that such a mappingis taking advantage of similarity in feature space when be-ing calculated, rather than accuracy.

Additionally, we have included the random control sys-tem described above to highlight the distinction beingdrawn here. In particular, the poor performance as seen inTable 2 when mapping to this control system demonstratesthat the optimization problem we are solving to map be-tween feature spaces likely does not fall prey to the poten-tial issues in the analytic solution shown to exist by Theo-rem 3.2.

Perhaps the simplest way to summarize what we’ve seenso far, is to recognize the following asymmetry of implica-tions. To start, when Theorem 3.2 is applicable there alwaysexists an exact linear mapping between two CNNs. How-ever, the existence of a linear mapping is not always a suffi-cient condition for concluding features are similar. The in-troduction of the random control addresses this latter point.This distinction also helps spur our interest in studying lin-ear mappings between CNNs in contexts where there is nocommon classifier, and as such Theorem 3.2 does not apply.

4. Open-set Linear MapsHere we explore how to estimate linear mappings be-

tween feature spaces for CNNs constructed to solve open-

Dataset # individuals # imagesVGGFace2 9131 3.31 MCASIA-WebFace 10,575 494,414MS-Celeb-1M 100,000 10 MLFW 5,749 13,233

Table 3: The datasets used in our experiments. The firstthree datasets were used to train networks used in our ex-periments. The fourth, LFW, was used to test mappingsbetween feature spaces.

set problems. Face recognition has been chosen for theseexperiments because it is both a compelling application andalso because it is a task which, by its very nature, demandsgeneralization to new labels when moving from training totesting. So while a linear classifier may be used as the finallayer during training, the classifier is discarded after train-ing. During operation and validation face verification isrooted in measuring similarity between pairs of face embed-dings derived from novel unseen images of people not seenduring training [8, 38]. This is in our opinion an intrigu-ing domain in which to seek out and find linear mappingsbetween feature spaces associated with CNNs of differingarchitecture trained on different data.

Four common face-recognition networks are selected forstudy here. Two share the same architecture and are trainedon different datasets using center loss by David Sandbergand are available on GitHub4 [35]. The other two havedifferent architectures but were trained on the same datasetby Kuan-Yu Huang and are available on GitHub5 [20]. Allfour use multi-task CNNs (MTCNN) to align and crop faceimages before feeding the result into the network and allnetworks normalize their feature outputs to the unit hy-persphere. Network details are available in Table 5 andthe training datasets are described in Table 3. Additionaltraining details such as learning rate schedules, optimizers,and specific software details are documented at the sourcerepositories. Importantly, the four networks differ from oneanother by either architecture, training dataset, or both.

These pre-trained models are used to generate pairs offeature embeddings over all 6,000 image pairs specified inthe LFW dataset [19]. Accuracy is measured by perform-ing 10-fold cross-validation on aligned image embeddingsas specified by the Image Restricted, labeled outside dataprotocol documented in LFW [19]. All four models weconsider were internally validated to within 0.01% of statedaccuracies on LFW.

6

Image/Videopair

fA

fB

Networkoutputs

MTCNN

MTCNN

0.0510.009

...−0.040−0.027

0.048−0.001

...−0.0420.062

Featureembeddings

=

=

MA→BT

0.0510.009

...−0.040−0.027

0.062−0.005

...−0.019−0.011

=

Mappedfeature

d

0.062−0.005

...−0.019−0.011

,

0.048−0.001

...−0.0420.062

≤ τ

Distance forthresholding

Figure 2: The pipeline used for evaluating the linear mappings between face recognition CNNs. Both networks use separatemulti-task CNNs (MTCNN) to detect and align the faces. The networks are then used to generate feature embeddings inR512. The linear mapping is used to map the feature embedding of the source network (f1) into the feature space of the targetnetwork (f2). Finally, the distance d between the embeddings is computed and compared against the matching threshold τ ;often distance is simply one minus the cosine of the angle between embedding vectors.

4.1. Open-Set Based Metrics

In contrast to the classifier based systems defined inEquation 1, these open-set networks can be written as v :Rw×h×3 → Rd, that for a given w × h pixel RGB imagex ∈ Rw×h×3, can be defined as:

v(x) = f(x) (12)

4https://github.com/davidsandberg/facenet.5https://github.com/peteryuX/arcface-tf2.

for the associated feature extractor f : Rw×h×3 → Rd withthe condition that ||f(·)||2|| = 1. Typically the unit lengthof the feature is not enforced explicitly by the network, sothe final step of feature extraction is normalizing it to theunit hypersphere. For the four networks studied here, thedimensionality of the feature space, d, is always 512. Twofaces x1 and x2 are considered to be of the same person iftheir cosine distance,

d(x1, x2) = 1− f(x1) · f(x2), (13)

Name CNN Loss function Training Dataset Accuracy on LFW

Model-IC [35] InceptionResNetV1 [41] Softmax + Center [45] CASIA-WebFace [46] 99.03% ± 0.42%

Model-IV [35] InceptionResNetV1 [41] Softmax + Center [45] VGGFace2 [5] 99.47% ± 0.37%

Model-MM [20] MobileNetV2 [36] ArcFace [8] MS-Celeb-1M [13] 98.70% ± 0.50%

Model-RM [20] ResNet-v2-50 [16] ArcFace [8] MS-Celeb-1M [13] 99.40% ± 0.46%

Table 4: Configuration and accuracy of each model. A shortened name is provided for use in other tables. Note these accuracyvalues are calculated by internal verification and differ from the stated values for each model’s source publication and alsothat these models are not the official versions associated with the original publications.

Name CNN Loss function Training Dataset Accuracy on LFW

Model-IC InceptionResNetV1 Softmax + Center CASIA-WebFace 99.03% ± 0.42%

Model-IV InceptionResNetV1 Softmax + Center VGGFace2 99.47% ± 0.37%

Model-MM MobileNetV2 ArcFace MS-Celeb-1M 98.70% ± 0.50%

Model-RM ResNet-v2-50 ArcFace MS-Celeb-1M 99.40% ± 0.46%

Table 5: Configuration and accuracy of each model. A shortened name is provided for use in other tables. Note these accuracyvalues are calculated by internal verification and differ from the stated values for each model’s source publication and alsothat these models are not the official versions associated with the original publications.

7

https://github.com/davidsandberg/facenet

https://github.com/peteryuX/arcface-tf2

is smaller than some established threshold τ .For testing in the 10-fold validation setting of LFW, for

each fold a cosine distance threshold τ is found which max-imizes accuracy (i.e. labels the most image pairs correctlyas matched or mismatched) over the other nine folds. Us-ing the remaining validation partition, accuracy is measuredusing the same threshold. This process is repeated for eachof 10 folds given by the LFW dataset in pairs.txt andaveraged to produce a final accuracy measurement.

As before, we are interested in calculating the extent towhich a linear map converts between the features of twonetworks. For two systems

va(x) = fa(x)

vb(x) = fb(x)(14)

as defined in Equation 12, does there exist a matrixMA→B ∈ R512×512 such that

fB(x2) ≈MA→BfA(x1) (15)

is satisfied for input images x1, x2 ∈ Rw×h×3 correspond-ing to the same individual. This approach maps clusterscorresponding to individuals in one network to the sameclusters in another network, rather than mapping points infeature space specifically. Note that we do not explicitlynormalize the result of MA→BfA(x).

Given such a matrix, we can calculate the cosine dis-tances between corresponding image pairs, as

dA→B(x1, x2) = 1− (M̃A→BfA(x1)) · fB(x2)||M̃A→BfA(x1)||2

. (16)

The denominator here is due to the fact that we do not nor-malize MA→BfA(x). The distance dA→B is used to calcu-late an optimal threshold, τ , which is then used with dA→B

to determine if unseen pairs of images are of the same in-dividual as in standard LFW validation. This process isdemonstrated in Figure 2. Once a mapping is calculatedand threshold τ determined, we use classification accuracyas a measure of linear similarity between feature extractorsfA and fB .

4.2. Empirically Calculated Mapping

We fit mappings in a fashion similar to Subsection 3.2.Instead of fitting linear maps between pairs of embeddingscorresponding to the same image, however, we fit mappingsduring the LFW evaluation process using pairs of embed-dings corresponding to the same individual. To be moreprecise, during the evaluation of one fold of LFW, the set ofm matched image pairs that have both images representingthe same individual, (xi ∈ X , yi ∈ Y), i = 1, ...,m, fromthe nine training folds are used as training data for fittingthe linear mapping. X and Y do not overlap in specific im-ages. For reference, each LFW fold contains 300 matchedand 300 mismatched pairs, so m = 9 ∗ 300 = 2,700.

Table 6: The open-set accuracy of each face recognitionCNN when mapped, using LFW. The format of this tableis identical to Table 2, including the same color scale.

We calculate mappings by solving the ridge regressionproblem over these image pairs similarly to Equation 5:

minimizeM̃A→B

m∑i=1

||M̃A→BfA(xi)− fB(yi)||2 + ||M̃A→B ||F

(17)This produces a matrix M̃A→B which minimizes the dis-tance between clusters corresponding to individuals in fea-ture space.

We then use these estimated mappings M̃A→B to clas-sify image pairs in the remaining validation fold as illus-trated in Figure 2. Validation accuracy scores are averagedover 10 folds to produce a single accuracy value. This pro-cess is completed for each permutation of the 4 models inTable 5 to produce 12 mappings (omitting 4 identity map-pings). Face verification accuracies for these mapped sys-tems are shown in Table 6.

4.3. Open Set Mapping Results

The loss in accuracy introduced by linear mapping is atmost 1.0% on LFW. In our experiments, Model-MM is out-performed by other models by a large margin, which mayaccount for the poor performance of mappings to and fromits feature space. Regardless, the ability for one network’sfeatures to robustly replicate another’s demonstrates a near-linear equivalence of the feature spaces. Although some in-formation does not transfer, the vast majority of informationthat is encoded by each network is equivalent. This find-ing is in contrast to previous results that methods such asfeature fusion increase the accuracy of the networks [1, 4].However, we note that the small amount of information notencoded by this linear transform may result in the increasesin accuracy revealed in those experiments.

One additional check is direct interchangeability be-tween these feature spaces without mapping. As discussedin [26,44], direct interchangeability between feature spacesis unlikely to exist. We confirm that by using an identitymapping as MA→B . When we do so, we achieve near ran-dom accuracy. This confirms that our results demonstrate

8

that empirically estimated linear mappings are able to mapreliably between embedding spaces, and consequently sug-gest one way in which the spaces may be considered largelyinterchangeable.

5. ConclusionThe work of Roeder et al. [34] mentioned above is highly

significant, suggesting a theoretical basis for concludingthat neural networks applied to a common task will, in thelimit, converge upon a common representation up to a lin-ear transformation. We believe our work nicely comple-ments that of Roeder et al. by demonstrating the existenceof linear mappings in two significant object-recognition do-mains and well-recognized CNNs. We believe future workexploring the presence, or perhaps absence in some cases,of linear mappings using new ML models and more variedtask domains will be both interesting and rewarding.

Our work also has some significant practical implica-tions. Let us highlight one relating to the use of faceembeddings to encode identity; as part of a large multi-institution federally funded research program, (Name willbe added here after double blind reviewing is over), theface-embedding mapping technique introduced in Section 4was used to reveal the concealed identity of a person of in-terest in one database using a named instance in an entirelyseparate database. At least in this experiment, our abilityto establish a mapping demonstrates the risk in believingthat withheld identities cannot be discovered using crossdatabase linkages. This is just one example of what canbe done once it is understood that: 1) these mappings seemto be common, and 2) they can be uncovered empirically.

Acknowledgements This work was supported by the USDefense Advanced Research Projects Agency (DARPA)and the Army Research Office (ARO) under contracts#W911NF-15-1-0459 and #FA8750-18-2-0016. Thank youto Michelle Wern for designing Figure 1.

References[1] Ankan Bansal, Rajeev Ranjan, Carlos D Castillo, and Rama

Chellappa. Deep features for recognizing disguised faces inthe wild. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition Workshops, pages 10–16, 2018. 8

[2] Richard Bellman. A markovian decision process. Journal ofmathematics and mechanics, pages 679–684, 1957. 3

[3] Nathaniel Blanchard, Jeffery Kinnison, Brandon Richard-Webster, Pouya Bashivan, and Walter J Scheirer. A neurobi-ological evaluation metric for neural network model search.In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 5404–5413, 2019. 1

[4] Navaneeth Bodla, Jingxiao Zheng, Hongyu Xu, Jun-ChengChen, Carlos Castillo, and Rama Chellappa. Deep hetero-

geneous feature fusion for template-based face recognition.In 2017 IEEE winter conference on applications of computervision (WACV), pages 586–595. IEEE, 2017. 8

[5] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, andAndrew Zisserman. Vggface2: A dataset for recognisingfaces across pose and age. In 2018 13th IEEE InternationalConference on Automatic Face & Gesture Recognition (FG2018), pages 67–74. IEEE, 2018. 7

[6] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude-van, and Quoc V Le. Autoaugment: Learning augmentationstrategies from data. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 113–123,2019. 2

[7] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc VLe. Randaugment: Practical automated data augmenta-tion with a reduced search space. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition Workshops, pages 702–703, 2020. 2

[8] Jiankang Deng, Jia Guo, Niannan Xue, and StefanosZafeiriou. Arcface: Additive angular margin loss for deepface recognition. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 4690–4699, 2019. 6, 7

[9] Samuel Dodge and Lina Karam. Understanding how im-age quality affects deep neural networks. In 2016 eighth in-ternational conference on quality of multimedia experience(QoMEX), pages 1–6. IEEE, 2016. 1

[10] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and PascalVincent. Visualizing higher-layer features of a deep network.University of Montreal, 1341(3):1, 2009. 2

[11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im-age style transfer using convolutional neural networks. InProceedings of the IEEE conference on computer vision andpattern recognition, pages 2414–2423, 2016. 1

[12] Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deepneural networks with random gaussian weights: A universalclassification strategy? IEEE Transactions on Signal Pro-cessing, 64(13):3444–3457, 2016. 4

[13] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, andJianfeng Gao. Ms-celeb-1m: A dataset and benchmark forlarge-scale face recognition. In European conference oncomputer vision, pages 87–102. Springer, 2016. 7

[14] David Hart, Bryan Morse, and Jessica Greenland. Styletransfer for light field photography. In The IEEE Winter Con-ference on Applications of Computer Vision, pages 99–108,2020. 1

[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 770–778, 2016. 2, 3, 4

[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks. In Europeanconference on computer vision, pages 630–645. Springer,2016. 2, 3, 4, 7

[17] Katherine Hermann, Ting Chen, and Simon Kornblith. Theorigins and prevalence of texture bias in convolutional neu-ral networks. Advances in Neural Information ProcessingSystems, 33, 2020. 1

9

[18] Harold Hotelling. Relations between two sets of variates. InBreakthroughs in statistics, pages 162–190. Springer, 1992.3

[19] Gary B. Huang, Manu Ramesh, Tamara Berg, and ErikLearned-Miller. Labeled faces in the wild: A databasefor studying face recognition in unconstrained environ-ments. Technical Report 07-49, University of Massachusetts,Amherst, October 2007. 6

[20] Kuan-Yu Huang. Arcface unofficial implemented in tensor-flow 2.0+ (resnet50, mobilenetv2)., June 2020. 6, 7

[21] Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato,and Yann LeCun. What is the best multi-stage architecturefor object recognition? In 2009 IEEE 12th international con-ference on computer vision, pages 2146–2153. IEEE, 2009.4

[22] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai,James Wexler, Fernanda Viegas, and Rory Sayres. In-terpretability beyond feature attribution: Quantitative test-ing with concept activation vectors (tcav). arXiv preprintarXiv:1711.11279, 2017. 2

[23] Simon Kornblith, Mohammad Norouzi, Honglak Lee, andGeoffrey Hinton. Similarity of neural network representa-tions revisited. arXiv preprint arXiv:1905.00414, 2019. 2

[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In Advances in neural information processing sys-tems, pages 1097–1105, 2012. 2

[25] Karel Lenc and Andrea Vedaldi. Understanding image repre-sentations by measuring their equivariance and equivalence.International Journal of Computer Vision, 127(5):456–476,May 2019. 2

[26] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, andJohn E Hopcroft. Convergent learning: Do different neu-ral networks learn the same representations? In FE@ NIPS,pages 196–212, 2015. 2, 8

[27] Chenxi Liu, Barret Zoph, Maxim Neumann, JonathonShlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, JonathanHuang, and Kevin Murphy. Progressive neural architecturesearch. In Proceedings of the European Conference on Com-puter Vision (ECCV), pages 19–34, 2018. 3, 4

[28] David McNeely-White, J. Beveridge, and Bruce Draper. In-ception and resnet features are (almost) equivalent. CognitiveSystems Research, 59, 10 2019. 2

[29] Ari Morcos, Maithra Raghu, and Samy Bengio. Insights onrepresentational similarity in neural networks with canonicalcorrelation. In Advances in Neural Information ProcessingSystems, pages 5727–5736, 2018. 2, 3

[30] Chris Olah, Alexander Mordvintsev, and LudwigSchubert. Feature visualization. Distill, 2017.https://distill.pub/2017/feature-visualization. 2

[31] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic.Learning and transferring mid-level image representationsusing convolutional neural networks. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 1717–1724, 2014. 1

[32] Nicolas Pinto, David Doukhan, James J DiCarlo, andDavid D Cox. A high-throughput screening approach to dis-

covering good forms of biologically inspired visual represen-tation. PLoS Comput Biol, 5(11):e1000579, 2009. 4

[33] Maithra Raghu, Justin Gilmer, Jason Yosinski, and JaschaSohl-Dickstein. Svcca: Singular vector canonical correlationanalysis for deep learning dynamics and interpretability. InAdvances in Neural Information Processing Systems, pages6076–6085, 2017. 2

[34] Geoffrey Roeder, Luke Metz, and Diederik P Kingma.On linear identifiability of learned representations. arXivpreprint arXiv:2007.00810, 2020. 2, 4, 9

[35] David Sandberg. Face recognition using tensorflow, Apr.2018. 6, 7

[36] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang-Chieh Chen. Mobilenetv2: Invertedresiduals and linear bottlenecks. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 4510–4520, 2018. 3, 4, 7

[37] Andrew M Saxe, Pang Wei Koh, Zhenghao Chen, ManeeshBhand, Bipin Suresh, and Andrew Y Ng. On random weightsand unsupervised feature learning. In Proceedings of the28th International Conference on International Conferenceon Machine Learning, pages 1089–1096, 2011. 4

[38] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A unified embedding for face recognition and clus-tering. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 815–823, 2015. 6

[39] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.Deep inside convolutional networks: Visualising imageclassification models and saliency maps. arXiv preprintarXiv:1312.6034, 2013. 2

[40] Katherine R Storrs, Tim C Kietzmann, Alexander Walther,Johannes Mehrer, and Nikolaus Kriegeskorte. Diverse deepneural networks all predict human it well, after training andfitting. bioRxiv, 2020. 1

[41] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, andAlexander A Alemi. Inception-v4, inception-resnet and theimpact of residual connections on learning. In Thirty-FirstAAAI Conference on Artificial Intelligence, 2017. 2, 3, 4, 7

[42] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper withconvolutions. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 1–9, 2016.3, 4

[43] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, JonShlens, and Zbigniew Wojna. Rethinking the inception archi-tecture for computer vision. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages2818–2826, 2016. 3, 4

[44] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, JoanBruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.Intriguing properties of neural networks. arXiv preprintarXiv:1312.6199, 2013. 2, 8

[45] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. Adiscriminative feature learning approach for deep face recog-nition. In European conference on computer vision, pages499–515. Springer, 2016. 7

10

[46] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learn-ing face representation from scratch. arXiv preprintarXiv:1411.7923, 2014. 7

[47] Matthew D Zeiler and Rob Fergus. Visualizing and under-standing convolutional networks. In European conference oncomputer vision, pages 818–833. Springer, 2014. 2

[48] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,and Antonio Torralba. Learning deep features for discrimina-tive localization. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 2921–2929,2016. 2

[49] Barret Zoph, Ekin D Cubuk, Golnaz Ghiasi, Tsung-Yi Lin,Jonathon Shlens, and Quoc V Le. Learning data aug-mentation strategies for object detection. arXiv preprintarXiv:1906.11172, 2019. 2

[50] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc VLe. Learning transferable architectures for scalable imagerecognition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 8697–8710,2018. 3, 4

11

Date post:	11-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

arXiv:2010.02323v1 [cs.CV] 5 Oct 2020David McNeely-White Benjamin Sattelberg Nathaniel Blanchard...

Documents