+ All Categories
Home > Documents > Few-shot Classi cation by Learning Disentangled ... · representation, allows generalization and...

Few-shot Classi cation by Learning Disentangled ... · representation, allows generalization and...

Date post: 30-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
46
MSc Artificial Intelligence Master Thesis Few-shot Classification by Learning Disentangled Representations by Emiel Hoogeboom 10831428 June, 2017 36 ECTS January – June, 2017 Supervisor: Dr. E. Gavves Daily Supervisor: Dr. E. Gavves Assessor: Prof. Dr. M. Welling Faculteit der Natuurkunde, Wiskunde en Informatica
Transcript
  • MSc Artificial Intelligence

    Master Thesis

    Few-shot Classification by LearningDisentangled Representations

    by

    Emiel Hoogeboom

    10831428

    June, 2017

    36 ECTSJanuary – June, 2017

    Supervisor:Dr. E. Gavves

    Daily Supervisor:Dr. E. Gavves

    Assessor:Prof. Dr. M. Welling

    Faculteit der Natuurkunde, Wiskunde en Informatica

  • Acknowledgements

    I would like to thank Efstratios Gavves for his guidance and help the past half year. He couldtruly inspire me to approach a problem differently. He managed to spend a lot of time with me,despite his busy schedule.

    I would also like to thank Jorn Peters, with whom I have had numerous discussions that led tosignificant insights. Jorn may be one of the smartest guys I know, and I predict that he will oneday run his own research lab.

    My gratitude goes out to my committee, consisting of Max Welling and Efstratios, who agreedto read my report on short notice.

    Finally I would like to thank my parents, and everyone else, who helped me with their supportand encouragement.

    i

  • Abstract

    Machine learning has improved state-of-the art performance in numerous domains, by usinglarge amounts of data. In reality, labelled data is often not available for the task of interest.A fundamental problem of artificial intelligence is finding a representation that can generalizeto never seen before classes. In this research, the power of generative models is combined withdisentangled representations. The combination is leveraged to learn a representation for content,which generalizes to unseen classes. Potentially, disentangled representations can drasticallyreduce the number of required training examples, and improve understanding of different factorsof variation.

    This is achieved, by starting with a known procedure to disentangle representations. By exploringthe structure of the content representation, a loss function is composed such that the model learnsa few-shot class probability. A mathematical framework is defined, that includes a few-shot classprobability. This probability ensures that a disentangled representation is learned. A lowerboundof the log-likelihood, is derived to obtain an objective function that optimizes the log-likelihoodconditioned on the support set. The presented method has achieved state-of-the-art performanceon the Omniglot dataset at the time of writing.

    ii

  • Contents

    Acknowledgements i

    Abstract ii

    Contents iii

    Introduction 1

    1 Related Work 41.1 Few-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Disentangling Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 Preliminaries 62.1 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Kullback Leibler Divergence for Multivariate Normals . . . . . . . . . . . . . . . 72.4 Squared Euclidean distance between Multivariate Normals . . . . . . . . . . . . . 82.5 Disentangling Factors of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3 Structure of Disentanglement 113.1 Understanding Disentanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Distance Penalty for Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Disentangling with Distance Loss Exclusively . . . . . . . . . . . . . . . . . . . . 13

    4 Model: Generative Few-shot Learning 154.1 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Class probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4.2.1 Embedding Distance in Literature . . . . . . . . . . . . . . . . . . . . . . 164.2.2 Model Class Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.3 Support Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3.1 The Support-Conditional Log-likelihood . . . . . . . . . . . . . . . . . . . 184.3.2 Resolving the Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4.4 Collecting All Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5 Datasets 225.1 Episodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3 Omniglot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.4 miniImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.5 Quick, Draw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    6 Experiments 256.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    iii

  • 6.1.1 Moving Average Batch Normalization . . . . . . . . . . . . . . . . . . . . 256.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.1.3 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.2.1 Omniglot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.2.2 miniImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2.3 Quick, Draw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    6.3 Expectation of Support Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    6.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.3 Model Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    Conclusion 35

    A Derivation Lowerbound for Batches 37A.1 Model definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37A.2 Log-likelihood Conditioned on Support Content (SS) . . . . . . . . . . . . . . . . 37A.3 Lowerbound Conditioned on Support Examples (XS) . . . . . . . . . . . . . . . . 38A.4 Intermezzo: Factorizing the Support Set KL Term . . . . . . . . . . . . . . . . . 39A.5 Collecting Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.6 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    iv

  • Introduction

    “Much learning does not teach understanding.”

    Heraclitus of Ephesus

    A deep learning model is a complex function approximator, based on a simple principle thatis applied repeatedly. Its complexity makes it incredibly malleable, which allows it to performtasks such as object classification and detection at high performance levels. This performance ispossible, given vast amounts of data from the test domain. However, when inputs appear outsidethis domain, the words of Heraclitus make deep learning models look foolish.

    The human brain is remarkable at object recognition, especially because of object constancy.With different types of illumination, pose or other changes in viewpoint, an object is often easilyrecognized. Different from machine learning, is that humans can easily generalize from very fewexamples. This phenomenon is called object constancy. A picture from an apple covered bysnow, is still an apple. Most people have no problem with this decision, even if this is the firsttime that an apple is observed in this exact condition.

    “What I cannot create, I do not understand.”

    Richard Feynman

    A promising direction for these problems, is generative modelling. Generative modelling is ele-gantly motivated by the words of Feynman. Deep learning models may simply be cheating byrecognizing the sky, when they need to recognize birds. By learning to generate examples, amodel is forced to represent the whole image, including the bird. From a practical viewpoint,modelling a generative process has the advantage of being an unsupervised learning problem, andmany unlabelled examples are available. However, for a representation to be object constant, itneeds to be disentangled for variations in the object and other factors. Disentangled representa-tions are appealing, because a representation suitable for distinguishing cars from trucks, shouldbe disentangled from color. This concept is illustrated in Figure 1.

    Figure 1: The concept of disentanglement: multiple attributes characterize an object.

    1

  • Status

    In recent years, machine learning has improved state-of-the art performance in numerous do-mains. Notably, deep learning has shown superhuman performance on multiple classificationtasks, with extensive amounts of data [1, 2]. In reality, labelled examples may be scarce, whichmakes it difficult to learn a deep network directly. Moreover, sometimes large quantities of la-belled data are available, but not for all classes of interest. The field that tries to classify imageswith either one or a handful exemplars, is called one or few-shot learning.

    In few-shot learning scenarios, a system is presented with only a few examples per class withknown labels. The collection of these examples is called the support set. Another example isthen presented to the system, which has to be classified by comparing it with the support set.Early attempts in the field of few-shot learning, only inferred directly from the support set. Inmore recent studies, the field has shifted towards similarity metric learning. First a metric islearned from a subset of classes, and then few-shot classification is tested on another subset withdifferent classes. The underlying assumption is that large quantities of labelled data are available,but these are unavailable for some classes. For the classes of interest, only a few examples arelabelled. The goal of few-shot learning is then to learn an embedding, that generalizes to unseenclasses.

    Deep generative models have been shown to improve classification performance, in semi-supervisedlearning settings. The aim is to have a model for the data generation process, because capturingthis process means that the data was understood by the model to some degree. For example,a discriminative model might classify a ship based on the surrounding water, but a generativemodel will learn an actual representation for a ship. The intuition is that learning the actualrepresentation, allows generalization and yields better performance. However, the application ofgenerative models to few-shot learning has thus far been limited.

    Disentanglement

    We define disentanglement, as a separation in the representation of the attributes of an object.Let us illustrate this concept with an example. Imagine taking a photo of a car. In the camera thephoto is represented with a large number of pixels. These pixels are highly entangled, as changingthe color of the car would change a large number of the pixels. A representation would be moredisentangled, when it generates the same image, but changing a subset of variables changes asubset of attributes, for example the color of the car. Suppose that an object is completelydefined by a set of attributes. A valid representation of an object should represent the completeset. Furthermore, a disentangled representation is defined as a separable representation, suchthat each part represents an exclusive subset of attributes, and the union of all subsets is thecomplete set. The choice of attributes is arbitrary, and can be chosen to match meaningfulhuman intuitions.

    Direction

    In this thesis, a mathematical framework to combine generative models, learning disentangle-ment, and few-shot learning is presented. The framework is designed to learn a disentanglementbetween two subsets of object attributes, content and style. The framework learns to represent

    2

  • images in content and style variables, where the content variable is used for few-shot classifi-cation and reconstruction, while the style variable is only used for reconstruction. To enforcethat content and style represent attributes of an example exclusively, priors are placed on thesevariables. These priors allow the framework to eliminate all redundant information in represen-tations during optimization. In this formulation, the content is defined as all helpful informationin classifying the image. The style is defined as all other possible sources of variation, that areneeded for reconstruction. The following hypotheses shall be addressed:

    • Learning disentangled representation can be combined with few-shot classification.

    • Few-shot classification accuracy is improved, by using disentangled representations.

    3

  • 1 Related Work

    The related work is organized into three different sections on sub-domains of deep learning. Thedomains few-shot learning, generative models and disentangling representations are discussed.The last section outlines what differentiates this thesis from existing literature.

    1.1 Few-shot learning

    Few-shot learning is a field where the number of examples is very limited. A key insight byFei-Fei et al., is that knowledge of previously learned classes can be used, and hence learningdoes not start from scratch [3]. The union of few-shot learning with deep learning, has shiftedthe field towards a metric learning approach, and this metric (or embedding) has been learnedin various manners.

    Siamese networks [4] use the contrastive loss function to learn an embedding on a data set.Another key insight was provided by Vinyals et al., who showed that performance can significantlyincrease when the train procedure is adapted to match the test procedure closely. In their work,memory networks termed Matching Networks [5] are used to augment the embedding. Ravi et al.improved the meta-learning approach, by proposing a recurrent meta-learner model the updatesfor the few-shot model [6]. Prototypical Networks [7] use basic components of matching networks,showing that even higher performance can be attained with a relatively simple architecture andprocedure, without the need for recurrent networks.

    Instead of presuming a fixed distance metric to measure distance between examples, the metriccan be directly learned. Competitive results on some datasets have been achieved by Residualnetworks with skip connections [8]. Learning the distance metric, allows the model to choosea suitable distance measure itself. This demonstrates that a distance function with parameterscan be a more suitable choice on some problem instances.

    1.2 Generative Models

    An emerging field within machine learning is deep generative modelling. A common assumptionin generative modelling, is that some lower-dimensional representation exists. One can proposesome low-dimensional representation, and learn a transformation whose output resembles samplesfrom the data distribution.

    Variational Auto-Encoders (VAEs) [9] are derived from a generative process, by introducing avariational distribution that can be recognized as an encoder in traditional auto-encoders. Theylearn to reconstruct by encoding images into a lower-dimensional latent space, and decoding areconstruction from that latent space.

    Generative Adversarial Networks (GANs) [10] learn to model the data by defining two competingnetworks. The discriminator needs to classify which images are real and which are fake, thegenerator tries to deceive the discriminator. The reconstruction loss of the VAE is definedexplicitly, and is often modelled by a pixel wise error. That means that a perfect reconstructioncan still have a high error, with small perturbations (e.g. translation of one pixel). The loss ofthe generator in a GAN is defined implicitly, as the ability to mislead the discriminator. GANstend to be able to reconstruct crisper and more realistic images.

    4

  • 1.3 Disentangling Representation

    In [11] a combination of VAEs and GANs learn to disentangle variation, separating class infor-mation from style into two latent spaces. To disentangle content from style, the labels are chosento represent content. The training procedure is then formulated such that all other informationwill be encoded by the style variable. In other work, an unsupervised disentangled representationis be learned by maximizing the mutual information between a subset of the latent variables andthe observation [12].

    1.4 Contribution

    This paper combines few-shot learning, generative models and disentangled representations. Tothe best knowledge of the author, disentangled representations have never before been used forfew-shot classification.

    5

  • 2 Preliminaries

    In this section, preliminary techniques are explained that will be used in subsequent sections.The techniques discussed, relate to general deep learning models, mathematical derivations ofdistribution distances, and an application where a disentanglement is learned.

    2.1 Variational Autoencoders

    Auto-encoders are artificial neural networks used for unsupervised representation learning. Thedimension of the input is equal to the dimension of the output, and the purpose of the networkis to reconstruct the input. The representation that is learned at the bottleneck of the network,is called the code or latent space. An auto-encoder can be divided in two distinct modules,the encoder and the decoder. The encoder is a function that maps an input x into some latentrepresentation z, Enc : X → Z. The decoder maps the latent representation to the input space,Dec : Z → X . The objective of the auto-encoder is to minimize some distance loss as defined inEquation 1.

    Dec*,Enc* = argminDec,Enc

    ||x−Dec(Enc(x))||2 (1)

    Variational Auto-Encoders (VAEs) [9] assume some generative process from the latent space z tox (Depicted in Figure 2). Note that the latent variable z is treated as a random variable.

    x

    z

    θ

    φ

    N

    Figure 2: Generative process in a graphical model. This model is the basis for the VariationalAutoencoder.

    By introducing a variational distribution qθ(z|x), a lower bound for p(x) can be derived withJensen’s inequality (Equation 2). In this equation, DKL represents the Kullback-Leibler di-vergence, probability distributions are parametrized by θ, and the variational distribution isparametrized by φ. The decoder is now defined as the conditional distribution Dec := pθ(x|z).The encoder is defined as the variational distribution Enc := qφ(z|x). A common assumption isto let qφ(z|x) be a multivariate normal distribution with diagonal variances. Thus, the encoder isdefined as Enc := N (z|µφ(x), Iσφ(x)). Then the choice of prior is often p(z) = N (z|0, I).

    6

  • log pθ(x) = log

    ∫pθ(x, z)dz

    = log

    ∫qφ(z|x)

    pθ(x, z)

    qφ(z|x)dz

    ≥∫qφ(z|x) log

    pθ(x, z)

    qφ(z|x)dz

    = Ez∼qφ(z|x) [log pθ(x|z)] +DKL(qφ(z|x)||pθ(z))

    (2)

    2.2 Generative Adversarial Networks

    Generative Adversarial Networks (GANs) [10] are a different type of generative model. A GANconsists of two distinct modules. A generator that maps some latent representation to an exampleGen : Z → X , and a discriminator that maps an example to a confidence that signifies how realan example looks, Disc : X → [0, 1]. The two networks a trained as adversaries, in a zero-sumgame setting. The value function is depicted in Equation 3. The generator tries to minimize thevalue function, while the discriminator tries to maximize it, as depicted in Equation 4.

    V (Gen,Disc) = Ex∼Pdata[

    log Disc(x)]

    + Ez∼p(z)[

    log 1−Disc(Gen(z))]

    (3)

    Gen*,Disc* = argminGen

    argmaxDisc

    V (Gen,Disc) (4)

    Where an auto-encoder uses some defined distance metric to compare the reconstruction to theoriginal, a GAN uses the certainty prediction of a discriminator. Loss functions for reconstruc-tion of high dimensional data such as images, are difficult to define such that sharp images aregenerated. Instead, a GAN architecture only implicitly defines a loss function, via the discrimi-nator.

    Optimization of the value function leads to a problem for the generator, since the strength of thegradient decreases when the discriminator is certain. Practically, instead of optimizing Equation3, the generator optimizes a value function that has stronger gradients for a certain discriminator,defined in Equation 5. When a discriminator is more certain, i.e. Disc(Gen(z))→ 0, the gradientwill be stronger, since ddx log f(x) =

    1f(x)

    df(x)dx .

    VG(Gen,Disc) = −Ez∼p(z)[

    log Disc(Gen(z))]

    (5)

    2.3 Kullback Leibler Divergence for Multivariate Normals

    The Kullback-Leibler divergence is a distance measure between probabilities. In a previoussection, we already saw that the variational autoencoder optimizes a KL divergence betweenthe variational distribution and the prior. By our choice of parametrization, we will only befaced with multivariate normal distributions. In Equation 6 the analytical solution between twoarbitrary normal distributions, q = N (x|µ,Σ) and p = N (x|m,L) is derived.

    7

  • DKL(q||p) = EN (x|µ,Σ) [logN (x|µ,Σ)− logN (x|m,L)]

    =1

    2log|L||Σ|

    + EN (x|µ,Σ)[−1

    2(x− µ)TΣ−1(x− µ) + 1

    2(x−m)TL−1(x−m)

    ]=

    1

    2log|L||Σ|

    +1

    2EN (x|µ,Σ)

    [− Tr((xxT − µµT − 2xµT )Σ−1) + Tr((xxT + mmT − 2xmT )L−1)

    ]=

    1

    2log|L||Σ|

    +1

    2

    [−Tr I + (µµT + Σ + mmT − 2µmT )L−1

    ]=

    1

    2

    [log|L||Σ|−D + Tr(ΣL−1) + (m− µ)TL−1(m− µ)

    ](6)

    If we parametrize multivariate normal distributions such that the covariance matrix only hasdiagonal entries, the solution can be further simplified. The normal distributions are redefinedto q = N (x|µ, I · σ) and p = N (x|m, I · l). The corresponding KL divergence between q and pis shown in Equation 7.

    DKL(q||p) =1

    2

    [D∑i

    (2 log

    liσi

    +σi

    2

    li2 +

    (mi − µi)2

    li2

    )−D

    ](7)

    The equation can be further simplified if p is a prior with zero mean and variance one. Thedistribution p is redefined such that p = N (x|0, I). The KL divergence between q and p ispresented in Equation 8.

    DKL(q||p) =1

    2

    [D∑i

    (−2 log σi + σi2 + µi2

    )−D

    ](8)

    2.4 Squared Euclidean distance between Multivariate Normals

    A straightforward measure of distance is the squared euclidean distance. We define two arbitrarymultivariate normal distributions p = N (x|µ,Σ) and q = N (y|m,L) where x and y have anequal number of dimensions. An analytical solution for the expectation of the squared euclideandistance between two multivariate normal distributions is presented in Equation 9.

    Ex∼p,y∼q(||x− y||2) = Ex∼p,y∼q(xTx + yTy − 2xTy)= Ex∼p,y∼qTr(xxT + yyT − 2xyT )= Tr(µµT + Σ + mmT + L− 2µmT )= Tr(Σ + L) + (µ−m)T (µ−m)

    (9)

    8

  • 2.5 Disentangling Factors of Variation

    In 2016, Mathieu et al. learned a disentangled representation by combining a variational auto-encoder with a generative adversarial network [11]. They specify two latent variables, the con-tent s and the style z. The function of the content is to contain all class information, andthe style should contain any other information, such as how slanted a letter is written. To-gether, s and z provide sufficient information to reconstruct the original example x. An encoder(s, (µz, logσz)) = Enc(x) and a decoder x = Dec(s, z) are defined. Furthermore a discriminator[0, 1] = Disc(x, id) is trained to distinguish real and fake examples. The variable id denotesthe label of the example x. In Equations 10 and 11 the loss for the VAE and the GAN arespecified. The complete loss can be formulated as in Equation 12, where λ is a scaling factor.Note that the authors chose to include a KL regularization on z, but s is not treated as a randomvariable.

    L(V AE) = −Ez∼q(z|x,s) log p(x|z, s) +DKL(q(z|x, s)||p(z)) (10)

    L(GAN) = log Disc(x, id) + log (1−Disc(Gen(z, s), id)) (11)

    L = L(V AE) + λL(GAN) (12)

    The authors propose a training procedure with multiple steps that swaps the latent variables.This training procedure, in combination with the model, ensures that a disentangled represen-tation is learned. A summary of the procedure is described below, please refer to [11] for exactdetails.

    • Two samples from the same class, x1 and x1′ are drawn. The VAE is trained to maximizep(x1|Dec(s1, z1)) and p(x1|Dec(s1′ , z1). Note that both produce the same reconstruction.This ensures that only content information may flow through s.

    • To avoid that the network ignores s, a sample from a different class, x2 is drawn. The VAEis trained to minimize the generator GAN loss log Disc(Gen(z2, s1), id(x1)). This ensuresthat the content information must flow through s, and may not flow through z.

    • Again sampling x1, x1′ and x2 in a similar fashion. The discriminator is trained to minimizelog Disc(Gen(z1, s1), id(x1))+log (1−Disc(Gen(z2, s1), id(x1))). Thus the discriminator istrained to detect whether a reconstruction used a style z from an example with anotherclass.

    Since it is difficult to express disentanglement in numbers, we follow the procedure of the originalauthors to display interpolations in latent representations. Some of the reconstruction resultsare depicted in Figure 3. In the left image, a slanted seven is interpolated to an upright nine.Moving downwards from the top left, the seven gradually appears more upright. Going upwardsfrom the bottom right, the nine becomes increasingly slanted. In the right image, a three isinterpolated with a seven.

    9

  • Figure 3: Interpolation between content and style. Left and right follow the same procedurewith different examples. The top left image is a reconstruction of an image in the dataset. Thebottom right image is also a reconstruction of an image in the dataset. Horizontally the contents is linearly interpolated. Vertically the style z is interpolated.

    10

  • 3 Structure of Disentanglement

    Disentangled representations are representations where specific variables of the representationcan be modified to change specific components. The method of Mathieu et al. [11] learns adisentanglement of content (class) and style (all other variations), but does not put any con-straint on the structure of the disentangled representation. In this section, the structure of therepresentations is investigated, and modified with additional constraints. Ultimately, the goal isto perform inference for few-shot learning on the disentangled content representation.

    3.1 Understanding Disentanglement

    The work of [11] is taken as a starting point, a combination of a VAE and a GAN with the specifiedtraining procedure. With this model, a disentangled representation is learned on MNIST, and allvisualizations are obtained with datapoints in the test set. The structure of the high dimensionalcontent and style representations are visualized with stochastic neighbourhood embedding. Notethat z is a distribution, and therefore only µz is visualized. In Figure 4 these embeddings aredepicted. Notice that the content s is clustered stronger, and style z clustering is less apparent.This is expected since content variables from the same class should contain the same information,making clusters very distinct. In contrast, style is often more continuous (how slanted or bold adigit is), and the same style can be shared between different classes.

    7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

    7.5

    5.0

    2.5

    0.0

    2.5

    5.0

    7.5

    10.0

    7.5 5.0 2.5 0.0 2.5 5.0 7.58

    6

    4

    2

    0

    2

    4

    6

    8

    Figure 4: Visualization of the high dimensional latent variables of the model in [11]. All pointsrepresent test data. Left: t-SNE plot of content s. Right: t-SNE plot of style µz (z is adistribution). Different colors represent different classes.

    The structure of the content s is not suited for few-shot classification, because multiple clustersexist for the same class. An example that maps to the necessary cluster, might be absent. In[11] this was not necessarily a problem, since two different points can be mapped to the sameclass by the decoder. However, for few-shot classification, ideally the content embedding wouldhave one cluster for each class.

    11

  • 3.2 Distance Penalty for Content

    In the previous section visualizations showed that the clusters for content s were scattered. Itcan advantageous to group examples more tightly when classification is based on the proximityof s.

    Therefore, in addition to the VAE and GAN loss, a simple loss that is based on distance betweencontent variables of the same class (Equation 13) is used. In this equation, the subscript notationcorrespond to the previously described training procedure, s1 and s1′ are the same class. Inessence, optimizing the distance penalty will attract style s of examples with the same class.The objective that is optimized is presented in Equation 14.

    L(penalty) = ||s1 − s1′ ||2 (13)

    L = L(V AE) + λL(GAN) + L(penalty) (14)

    The content s and style µz are visualized in Figure 5. Clearly, classes are clustered more com-pactly in the embedding. Furthermore, no class has multiple clusters. As a proof of conceptfor few-shot learning, a single content s of each class is chosen as the support set. The testset is classified using a nearest neighbour approach on the support set. Classification based ona single example in the content domain, has about 99% accuracy. For comparison, the modelwithout a penalty evaluated with the same procedure has only about 90% accuracy. Althoughthe model is not classifying examples of an unseen class, this illustrates two important points:Firstly, a disentangled representation of content can be used for few-shot classification. And sec-ondly, an additional restriction (such as a distance penalty) is effective to learn a useful few-shotembedding.

    7.5 5.0 2.5 0.0 2.5 5.0 7.5

    7.5

    5.0

    2.5

    0.0

    2.5

    5.0

    7.5

    8 6 4 2 0 2 4 6 8

    8

    6

    4

    2

    0

    2

    4

    6

    Figure 5: Visualization of the high dimensional latent variables of the model, that also optimisesa penalty on distance between same-class content variables. All points represent test data. Left:t-SNE plot of content s. Right: t-SNE plot of style µz (z is a distribution). Different colorsrepresent different classes.

    12

  • 3.3 Disentangling with Distance Loss Exclusively

    Inspired by the results in the previous section, a new distance loss on s is proposed. We formulatea classification probability for the correct class, based on euclidean distance. The probability isnormalized similar to a softmax function (Equation 15). With examples from the same class, thecontent s should lie close together. For other classes, they should lie far apart. The first termof the loss contracts content variables of the same class, and the second term of the expands thedistance between content variables of different classes.

    Different from previous work, we also choose s to be a random variable, and let the encoderoutput ((µs, logσs), (µz, logσz)) = Enc(x). The objective function in previous models did nottreat s as a random variable, and therefore it was not regularized. Because the new objective doesconstrain s, the variable is now modeled as a distribution. Experiments showed that without thismodification, s encodes all information and z is ignored. The VAE loss is depicted in Equation 16,which now includes s as a random variable. Note that both latent variables are now regularizedwith their priors. In Equation 17 the objective to optimize is shown.

    L(distance) = − log

    [exp(−||s1 − s1′ ||2)∑Ci=1 exp(−||s1 − si′ ||2)

    ]

    = ||s1 − s1′ ||2︸ ︷︷ ︸Contraction term

    + log

    C∑i=1

    exp(−||s1 − si′ ||2)︸ ︷︷ ︸Expansion term

    (15)

    L(V AE) = −Ez∼q(z|x),s∼q(s|x) log p(x|z, s) +DKL(q(z|x)||p(z)) +DKL(q(s|x)||p(s)) (16)

    L = L(V AE) + λL(distance) (17)

    Without the adversarial procedure, learning a disentanglement is less explicitly enforced. How-ever, the intuition is that the distance loss will ensure that the classes will cluster in the embed-ding s. To create a reconstruction, the decoder can obtain information through s and z. Theencoder needs to send information through the latent space, by changing the distribution fromthe prior. Changing the distribution of the latent space, incurs a penalty via the KL divergence.If class information is already available in s, the model will avoid putting the same informationin z, because doing so would incur another penalty.

    The model is trained with the following procedure. Draw two samples from the same class, x1and x1′ . Also draw samples from other classes: x2′ , . . . , xC′ . All gradients for the decoder areoriginating from L(V AE) for x1. The gradient signal for the encoder comes from both L

    (V AE)

    and L(distance) with s1 and s1′ in the contraction term, and all s1′ , . . . , sC′ in the expansionterm.

    Interpolations of style and content are depicted in Figure 6, by changing s and z linearly betweentwo examples. Notice that content information is conveyed via s and style via z. Note how afour written slanted become upright and shaky, in the style of the eight. Thus, the modelis able to learn a disentanglement with a euclidean distance loss, instead of the adversarialprocedure.

    13

  • Figure 6: Interpolation between content and style, reconstructions created with a VAE trainedwith distance loss. Left and right follow the same procedure with different examples. The topleft image is a reconstruction of an image in the dataset. The bottom right image is also areconstruction of an image in the dataset. Horizontally the content s is linearly interpolated.Vertically the style z is interpolated.

    In Figure 7, t-SNE visualizations of the content and style variables of test examples are depicted.The content variables are strongly clustered, and the style variables show less structure basedon class. Notice that content grouping has become tight, and that style grouping has becomeless noticeable. Thus, only by restraining the distance of s for images of the same class, adisentanglement can be learned. Furthermore, a continuous representation for content is learnedthat is tightly clustered.

    8 6 4 2 0 2 4 6 88

    6

    4

    2

    0

    2

    4

    6

    4 2 0 2 4

    4

    2

    0

    2

    4

    Figure 7: Visualization of the high dimensional latent variables of the VAE with MNIST, trainedwith distance loss. All points represent test data. Left: t-SNE plot of content µs. Right: t-SNEplot of style µz. Different colors represent different classes.

    14

  • 4 Model: Generative Few-shot Learning

    The previous section described how a disentanglement can be learned, and hinted at how a few-shot learning loss may actually aid in learning a disentanglement. In this chapter, a generativemodel for few-shot learning is formally defined, inspired by disentangling representations.

    Generative models in semi-supervised learning can have x conditioned on some latent variable zand the class variable y. However, in few-shot learning scenarios the number of classes is large,and classes during test time have never been seen before. Therefore, conditioning directly on yis impractical. Instead, the example x is conditioned on content s and style z.

    In this section a lowerbound of the conditional log-likelihood will be derived, for a simplified usecase. The actual derivation involves a few more terms, which make it notation heavy. Therefore,the complete derivation is presented in appendix A.

    4.1 Generative Model

    A generative model for an example x and its class y in few-shot learning is defined. Thereare latent variables for content s ∼ p(s) and style z ∼ p(z). The observed class y = p(y|s)is conditionally independent of x and the observed example is conditioned on both content andstyle x = p(x|s, z). The corresponding graphical model is depicted in Figure 8. Class informationinformation is often encoded in discrete variables, but this formulation allows the content variables to be continuous, which makes generalization for few-shot learning possible.

    y

    s

    x

    z

    N

    Figure 8: Graphical model that shows how content and style influence the example and its label.The example x is conditioned on s and z, and the class y is only conditioned on s.

    Analogous to the derivation of [9], a lower bound for log p(y,x) can be obtained. As depicted inthe graphical model, the priors for the content and style are independent, thus p(s, z) = p(s)p(z).In contrast, the posterior p(s, z|x) cannot be factorized. However, we impose that the variationaldistribution can be factorized to simplify the model, such that q(s, z|x) = q(s|x)q(z|x). (Equation18)

    15

  • log p(y,x) =

    ∫∫q(s, z|x) log p(y,x)dsdz

    = Es,z∼q(s,z|x) [log p(y,x)]

    = Es,z∼q(s,z|x)[log p(y,x|s, z)− log q(s, z|x)

    p(s)p(z)+ log

    q(s, z|x)p(s, z|y,x)

    ]= Es,z∼q(s,z|x) [log p(y,x|s, z)]−DKL(q(s, z|x)||p(s)p(z)) +DKL(q(s, z|x)||p(s, z|y,x))≥ Es,z∼q(s,z|x) [log p(y,x|s, z)]−DKL(q(s, z|x)||p(s)p(z))= Es∼q(s|x),z∼q(z|x) [log p(y|s)p(x|s, z)]−DKL(q(s|x)||p(s))−DKL(q(z|x)||p(z))

    (18)

    The terms in this equation can be interpreted as an autoencoder with a classification model. Theterm p(y|s) is a class probability, p(x|s, z) is a reconstruction probability and q(s|x) and q(z|x)can be interpreted as encoders. For now, the term DKL(q(s, z|x)||p(s, z|y,x)) is neglected, as itis non-negative.

    4.2 Class probability

    Thus far, we have obtained a lower bound to optimize log p(y,x). Inside the lower bound, theterm p(y|s) refers to the class probability. Defining a class probability with some discriminatorcan be problematic, since classes will be different when tested. Instead, the class probabilityp(y|s) is defined relative to other examples. This definition is inspired by few-shot learningliterature.

    4.2.1 Embedding Distance in Literature

    In the method proposed by Vinyals et al. [5], a modified softmax equation is used to compute theclassification prediction (Equation 19). This equation can be modified to output a probability

    distribution p(y|x) =∏Cc=1 ŷ

    ycc , where C is the total number of classes. Variable S denotes

    the support set, which contains a few examples with labels. The function d can be an arbitrarydistance metric, either a basic function such as Euclidean distance, or a complex function modeledby a deep network. f(x) is an embedding of an input vector x. The embedding function can belearned by a deep network. The equation is suitable for few-shot learning, because it makes aprediction for y, and is defined relative to the support set.

    ŷ =∑

    (x′,y′)∈S

    exp−d(f(x), f(x′))∑(x′′,y′′)∈S exp−d(f(x), f(x′))

    y′ (19)

    There are two distinct challenges when this method is applied to the generative model. Firstly,instead of a point estimate, the encoder predicts distributions. Not every distance metric maylead to meaningful distribution distances. For instance, in variational autoencoders, the varia-tional distribution is often modeled by a multivariate normal distribution. In few-shot literature,d is often modeled by cosine distance. However, two equally likely samples (1, 1) and (-1, -1) fromthe normal distribution N (0, I), have a cosine similarity of minus one, meaning very dissimilar.Because the cosine distance changes in a curved space, it does not correspond to the form of the

    16

  • normal distribution. The effect of two commonly used distance metrics is illustrated in Figure9. Secondly, the classification probability is actually conditioned on the support set S, which hasthus far been ignored in the graphical model. The formula p(y|s) needs to be defined relativeto the support set, and will therefore be conditioned on the support set. As a result, the classprobability will be redefined from p(y|s) to include the support set, p(y|s,SS ,YS).

    Figure 9: Left: Visualization of the probability landscape of a multivariate normal distributionwith mean at (1, 1) and diagonal variances of one. Center: Visualization of the cosine distancebetween a sample and the point (1, 1). Right: Visualization of the negative squared euclideandistance between a sample and the point (1, 1).

    4.2.2 Model Class Probability

    The content s is chosen as the embedding for x, as the content variable should contain allnecessary information to classify an example. Thus, the conditional probability distribution overin the model is now defined as in Equation 20. The probability of a class increases when thedistance between the content of an example and a support example of that class decreases. Toavoid clutter, the normalization constant is written as Z. The variable C denotes the totalnumber of classes in the support set. The variable c is used to select the c’th component of avector.

    p(y|s,SS ,YS) =C∏c=1

    ∑(ss,ys)∈(SS ,YS)

    exp−d(s, ss)Z

    ysc

    yc (20)For the distance function d, a simple squared euclidean distance is chosen, d(a, b) = ||a − b||2.There are two arguments to do so: Firstly, an expected euclidean distance for multivariatenormals, corresponds to a distance that is intuitively coherent, as depicted in Figure 9. But moreimportantly, we previously showed that by using euclidean distances, an actual disentanglementcan be learned.

    In the special case when the single support set has only one example per class (i.e. 1-shot), the ex-pectation of the numerator can be analyzed analytically. Assume that the objective function willinclude the form Es,SS [

    ∑i yi log p(yi|s,SS ,YS)]. Note then, that for the matching class, the nu-

    merator Es∼N (µ,Σ),s′∼N (m,L)[− log exp−d(s, s′)] can be simplified into Es∼N (µ,Σ),s′∼N (m,L)[d(s, s′)],where two arbitrary normal distributions are assumed for s and s′.

    Since d is squared euclidean distance, optimizing numerator in this special case will minimizethe expected squared euclidean distance between two multivariate normal distributions, Tr(Σ +

    17

  • L)+(µ−m)T (µ−m) (section 2). This term is the combination of the squared distance betweenmeans, and the sum of all diagonal variances.

    4.3 Support Set

    By specifying the class probability, a new concept was introduced, the support set. To beaccurate, the graphical model is adapted to include the support set.

    The model with support set is depicted in Figure 10. Every training example is now connectedto its own support set Sn. A support set example has the same generative process as a normalexample. The only difference is that the support set is used to classify the example. Note thatto optimize the complete likelihood log p(x,y,XS ,YS), all probabilities represented by arrowsin the graphical model would need to be defined. This includes the class probability equationp(ys|ss), which was the reason we introduced the support set in the first place. Alternatively, theconditional distribution log p(x,y|XS ,YS) can be optimized and does not require the definitionfor p(ys|ss).

    xs ys

    sszs

    y

    s

    x

    z

    Sn

    N

    Figure 10: Graphical model that includes a support set S. Every example in the dataset isconnected to its support set, that defines relatively what class the example belongs to.

    4.3.1 The Support-Conditional Log-likelihood

    Instead of optimizing log p(x,y,XS ,YS), which would maximize the likelihood of all observeddata, it is possible to optimize log p(x,y|XS ,YS), where the example is conditioned on thesupport set. Intuitively, this corresponds with the few-shot learning context, where an exampleis classified given a support set. To integrate the support set with the previously derived model,first the term p(y|s) from Equation 18 is redefined as p(y|s,SS ,YS), which changes the left-handside of the log-likelihood to also condition on SS and YS , as shown in Equation 21.

    log p(y,x|SS ,YS) ≥ Es∼q(s|x),z∼q(z|x)[

    log p(y|s,SS ,YS)p(x|s, z)]

    −DKL(q(s|x)||p(s))−DKL(q(z|x)||p(z))(21)

    This equation is conditioned on support content SS . To obtain the log-likelihood conditionedon support examples XS , p(y,x|SS ,YS) can marginalized over support content SS with theposterior p(SS |XS). Since the posterior is intractable, a variational distribution is introducedand the log-likelihood is formulated as in Equation 22. Note that the second term still involves

    18

  • the intractable posterior distribution, which will be solved in the next section. Also note thatthe variational distribution q(SS |XS) can be factorized as

    ∏xsq(ss|xs) and shares the same

    parameters as for an ordinary example x.

    log p(y,x|XS ,YS) = log∫p(SS |XS ,YS)p(y,x|SS ,YS)dSS

    = log

    ∫q(SS |XS)

    p(SS |XS ,YS)q(SS |XS)

    p(y,x|SS ,YS)dSS

    ≥ ESS∼q(SS |XS)[log p(y,x|SS ,YS)− log

    q(SS |XS)p(SS |XS ,YS)

    ]= ESS∼q(SS |XS)

    [log p(y,x|SS ,YS)

    ]−DKL(q(SS |XS)||p(SS |XS ,YS))︸ ︷︷ ︸

    intractable

    (22)

    4.3.2 Resolving the Posterior

    At first glance, the KL divergence with the posterior seems problematic. Recall that, duringthe derivation of the probability of an example, the term DKL(p(s, z|y,x)||q(s, z|x)) was ne-glected.

    In a moment the neglected term will be reintroduced, to cancel the intractable term. To matchthe neglected term, first the posterior for the support set is redefined so that it includes the styleZS , p(SS ,ZS |XS ,YS). Note that the first term of the lower bound does not need to include anexpectation over ZS because the term is independent of ZS , leaving the first term unchanged(Equation 23).

    log p(y,x|XS ,YS) = log∫∫

    p(SS ,ZS |XS ,YS)p(y,x|SS ,YS)dSSdZS

    ≥ ESS∼q(SS |XS)[

    log p(y,x|SS ,YS)]−DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))︸ ︷︷ ︸

    intractable

    (23)

    The intractable term was encountered before in Equation 18, albeit for an ordinary example.Realize that x and XS are identically distributed, as they come from the dataset. The posteriorsfor an example and a support set should not differ, as these are also identically distributed. Andthus, the expected value of the difference between the terms will become zero, as portrayed inEquation 24. For now, the support set has been assumed to consist of only one example.

    E(x,y),(xs,ys)∼Pdata

    DKL(q(s, z|x)||p(s, z|x,y))︸ ︷︷ ︸neglected term

    −DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))︸ ︷︷ ︸intractable term

    = 0 (24)The expected value is zero, when the support set has only one example. In reality however, thesupport set always has multiple examples. The training procedure of few-shot learning, uses

    19

  • batches of queries. Therefore, the expected value of the difference, will be greater than or equalto zero, under the condition that |B| ≥ |S| (the size of the batch is greater than or equal tothe size of the support set). Because the procedure to derive the lowerbound is repetitive andnotation heavy, the complete derivation is shown in appendix A.

    4.4 Collecting All Components

    Collecting components from Equations 18, 23 and 24, the formulation for the model is displayedin Equation 25. The approximation is valid in the expectation over data when the number ofsamples for support set and queries are balanced, or the number of queries is greater. Althoughthe final term is simplified for a single support set example, the same principle applies for largersupport sets, as long as the batch is greater.

    log p(x,y|XS ,YS) ≥ ESS∼q(SS |XS)[Es∼q(s|x),z∼q(z|x) log p(y|s,SS ,YS)p(x|s, z)

    ]−DKL(q(s|x)||p(s))−DKL(q(z|x)||p(z))+DKL(q(s, z|x)||p(s, z|x,y))−DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))

    EPdata[

    log p(x,y|xs,ys)]≥ EPdata

    [Ess∼q(ss|xs),s∼q(s|x),z∼q(z|x)

    [log p(y|s, ss,ys)p(x|s, z)

    ]−DKL(q(s|x)||p(s))−DKL(q(z|x)||p(z))

    ](25)

    In summary, the log-likelihood started from a generative process for x. To define class probability,a support set was included. The log-likelihood was updated to condition on the support set, bymarginalizing over the posterior. By rewriting the posterior, the expected difference betweentwo intractable terms will be positive. In the last step components were collected, and all termsof the equation can be computed.

    4.5 Inference

    During classification the term p(y|x,XS ,YS) is maximized. Maximizing this term is equivalentto maximizing the joint probability as depicted in Equation 26. Since all other terms in the jointprobability are independent of y, practically only the class probability term needs to be max-imized. Technically, the expected value should be computed, however, approximating sampleswith the mean of the distribution did not affect performance significantly.

    argmaxy

    [log p(y|x,XS ,YS)] = argmaxy

    [log p(y,x|XS ,YS)− log p(x|XS ,YS)]

    = argmaxy

    [log p(y,x|XS ,YS)]

    = argmaxy

    ESS∼q(SS |XS),s∼q(s|x) [log p(y|s,SS ,YS)]

    (26)

    20

  • 4.6 Implementation

    A variational autoencoder is defined with two latent variables, s and z with nlatent values each.The content code s is an input for the reconstruction and the class probability. The stylecode z is only used to reconstruct an example. The term p(x|s, z) is modeled by the decoder,and represents reconstruction error. This term is modeled with a the Bernoulli loss, such thatlog p(x|s, z) =

    ∑i xi log x̂i, where x̂ = Dec(s, z), and the summation is over pixel values. The

    terms q(s|x) and q(z|x) are modeled by encoder with multivariate normal distributions that havediagonal variances, such that s ∼ N (µs(x), Iσs(x)) and z ∼ N (µz(x), Iσz(x)). The DKL termslimit the divergence between the variational and the prior distributions and can be obtainedanalytically (section 2). Unless mentioned otherwise, expectations for ordinary examples areapproximated with a single sample. Expectations for the support set are approximated with themean of the distribution.

    In principal, the model is not constrained to learn two completely separate embeddings. Themodel could use s only for classification and z only for reconstruction. However, making useof the latent variables incurs a small penalty via the Kullback-Leibler divergence between theprior and the variational distribution. The term can be seen as a regularizer on the latent codes.Since the classification pushes content codes s apart, useful information for construction is alsoavailable in s. Although the model can theoretically choose to put duplicate information in z, thiswould incur an additional penalty on the KL divergence of z. And thus, in optimal conditions,the model saves content information in s and style information in z.

    21

  • 5 Datasets

    Few shot learning scenarios differ from conventional classification tasks in machine learning. Theconcept of few-shot learning is that a few examples with label information are presented, knownas the support set. Also, a different example is presented without the label. The task is toclassify the example by using the support set. In general, the support set contains the samenumber of examples per class, and the class of the example is always in the support set. Wedefine two variables that describe the few shot learning setting: nway denotes the number ofclasses in the support set, and nshot denotes how many examples per class are in the supportset. For instance, when nshot is 1 and nway is 5, this is a 5-way 1-shot classification problem.Importantly, during evaluation the classes in the support set have never been seen before.

    To evaluate models related to few-shot classification, we use four different datasets, differing insize and complexity. In this section we will first define the few-shot learning episode. Subsequentsections present the details of four different datasets.

    5.1 Episodes

    Suppose we have some pool of training examples Dtrain and test examples Dtest. Each trainingexample belongs to a class. We choose Dtrain and Dtest such that a class is exclusively presentin only one of the sets.

    Few-shot learning is comprised of episodes: a few examples are given with label information, anew example needs to be classified. We describe an episode following the procedure of [5]. Foreach episode, we take nway different classes L ∼ D. For each class, we sample nshot examplesas the support set S ∼ L. Also, we sample a batch B ∼ L with nqueries per class, for the nshotdifferent classes. We make sure that S and B are disjoint, i.e. they contain different samples.The task now, is to classify B with the information in S. An example of a session is shown inFigure 11.

    The classes inDtrain are different from the classes inDtest. Therefore, a successful model will haveto effectively use the limited information in the support set to make the correct prediction.

    Support set Example

    . . .

    Figure 11: Configuration of a few-shot learning session. On the left side the support set is shown.For every example the correct label is known. The right side shows the image that needs to beclassified.

    5.2 MNIST

    The MNIST dataset is a well-known standard benchmark for machine learning. The datasetis relatively to learn, models can easily achieve 99% classification accuracy. This makes it a

    22

  • practical dataset to test new algorithms and architectures. The training set contains 60000examples and the test set 10000 examples, with 10 different classes. Images are 28 by 28 pixels.In Figure 12 a random sample of the dataset is shown.

    Figure 12: Random samples from the MNIST dataset.

    MNIST is not particularly well suited to evaluate few-shot learning performance. There are onlya limited amount of classes with many examples per class. The MNIST dataset is mainly used toshow disentanglements and examine latent variables. Actual few-shot learning evaluation resultsare presented on more complicated and better suited datasets.

    5.3 Omniglot

    The Omniglot dataset was created by Lake et al. to test algorithms while having only a handful oflabelled examples [13]. The dataset contains 1623 different characters from 50 different alphabets.For every character, there are only 20 different examples available. In Figure 13, 20 samples fromthe dataset are presented. To preprocess the data, the procedure from [5] is followed. Imagesare resized to 28 by 28 pixels. The first 1200 characters are training data that is augmentedwith rotations of 0, 90, 180 and 270. The remaining 423 characters are used for evaluation. Incontrast with [7], we do not augment test data unless specified otherwise.

    Figure 13: Random samples from the Omniglot dataset. Images are resized to 28 by 28 pixelsand colors are inverted.

    5.4 miniImageNet

    The miniImageNet dataset was created by Ravi et al. to have a more difficult baseline for few-shot classification. Derived from the original ImageNet, the dataset contains only 100 differentclasses, with 600 examples per class. The dataset is split up in 64 train classes, 16 validationclasses and 20 test classes. To preprocess the data, pixel values are rescaled to the range [0, 1],by dividing by 255. Images are resized to 84 by 84 pixels. The train images are rotated by 0,90, 180 and 270 degrees to create more image classes. In Figure 14 a few random examples fromthe miniImageNet test set are presented.

    23

  • Figure 14: Random samples from the miniImageNet test dataset.

    5.5 Quick, Draw

    The Quick, Draw dataset has been collected by Google Creative Lab.1 Users were asked to drawa concept, based on a textual description, such as “airplane” or “Eiffel Tower”. This can lead tovery different drawings of the same concept. For instance, the description “clock” let some usersdraw an analogue clock, and others a digital one.

    While users were drawing a concept, a recurrent neural network was guessing what the usertried to draw. Also, users were limited to 20 seconds within they had to draw the concept. Asession finished either in 20 seconds, or when the network guessed the concept correctly. As aconsequence, some images might be incomplete. Both Omniglot and Quick, Draw use a processwhere users are asked to draw a concept. The difference is that Omniglot users were presentedwith a visual example. Quick, Draw users were presented with a concept, which can be expressedin many different ways. Therefore, the intra-class variation in Quick Draw is not only caused bydrawing style, but also the interpretation of the concepts to draw.

    In total there are 345 classes that we split in 275 train, 35 validation and 35 test classes. Eachclasses contain numerous images, but we use only the first 100 images per class. Each image is a28 by 28 pixels in gray scale. A few samples from the dataset are depicted in Figure 15.

    Figure 15: Random samples from the Quick, Draw test dataset.

    1Quick, Draw Dataset https://quickdraw.withgoogle.com/data [Accessed in June 2017]

    24

    https://quickdraw.withgoogle.com/data

  • 6 Experiments

    In this section the experiments with the model are discussed. The first part gives an overviewof the techniques that were used. The second part presents and analyzes the results.

    6.1 Setup

    In this section the experimental setup is detailed. First a non-standard batch normalization layeris introduced, because samples from the data are not identically distributed. Then the networkarchitectures and hyperparameter configurations are discussed.

    6.1.1 Moving Average Batch Normalization

    Batch normalization [14] has significantly improved deep learning optimization in some instances.However, using batch normalization may be problematic when samples in a batch are not inde-pendent and identically distributed. Since a batch is skewed with only nway different classes, wemay experience high variance for the first and second moments over different batches. Instead,we propose a simple method resembling [15], where moving averages are used at train and testtime. In the pseudocode below the exact mechanism is specified. Note that x is an input and y isthe corresponding output. Furthermore, β and γ are parameters trained with backpropagation.The moment variables are not trained, but updated as specified.

    mu, sigma , beta , gamma = i n i t ( )de f moving average norm (x , i s t r a i n i n g , decay ) :

    y = (y + (x − mu) / sigma ) ∗ gammai f i s t r a i n i n g :

    mu b , sigma b = compute moments ( x )mu = mu + decay ∗ (mu b − mu)sigma = sigma + decay ∗ ( sigma b − sigma )

    re turn y

    6.1.2 Architecture

    The model can be separated in two distinct parts. An encoder for content and style, and thedecoder for reconstructions. For computational efficiency, we use only one encoder that outputsboth content and style. This means that weights are shared for content and style. The encoderarchitecture is largely inspired by [7], because their network achieved state of the art performance,at the time of writing.

    Different from [7], we choose to have only 3 max pooling layers. Furthermore, we pad featuremaps during pooling so that no information is discarded. The last layer is a fully connected layerto output µs, logσs, µz and logσz, corresponding to the distributions s ∼ N (s|µs, Iσs2) andz ∼ N (z|µz, Iσz2). Details are presented in Table 1.

    25

  • Table 1: Encoder architecture, input is an image with 1 channel and 28 by 28 pixels, outputsare µs, logσs,µz, logσz.

    Name Feature maps Output sizeinput 1 28, 28conv1 1 · nfilters 28, 28max pool 1 · nfilters 14, 14conv2 2 · nfilters 14, 14max pool 1 · nfilters 7, 7conv3 4 · nfilters 7, 7max pool 1 · nfilters 4, 4conv4 8 · nfilters 4, 4fc 4 · nlatent 1, 1

    The decoder takes s and z as inputs and transforms them with a fully connected layer to afeature map with shape (4, 4, nfilters · 8). In subsequent layers, the resolution is increased bya re-sizing and then a convolution operation. We choose to increase the resolution with factorsof 2, and therefore the final 32 by 32 output needs to be cropped such that we have a 28 by 28output. In Table 2 the specifics of the decoder are presented.

    Table 2: Decoder architecture, inputs are s, z and output is an image with 28 by 28 pixels

    Name Feature maps Output sizeInput 2 · nlatent 1, 1fc 8 · nfilters 4, 4upsample 8 · nfilters 8, 8conv1 8 · nfilters 8, 8upsample 4 · nfilters 16, 16conv2 4 · nfilters 16, 16upsample 2 · nfilters 32, 32conv3 2 · nfilters 32, 32conv4 1 32, 32slice 1 28, 28

    6.1.3 Configuration

    During training, Adam [16] is used to optimize the network. The model is trained for 50000iterations with a learning rate of 1e-4. Since the reconstruction loss heavily outweighs the few-shot loss, the few shot term log p(y|s,SS ,YS) is magnified. To match training procedures fromliterature closely, all other terms in the loss are divided by a factor λ, instead of multiplying thefew shot term with λ. The factor λ was set to 1000 experimentally.

    Every iteration, a support set of nway · nshot samples is drawn. Also nway · nqueries samples aredrawn to be classified. Authors generally observed increased performance when trained with ahigher nway in [7]. Therefore, nway is set to 30. An overview of all additional parameters can befound in Table 3.

    26

  • Table 3: Configuration for experiments during training

    Name Valuenfilters 128nlatent 32learning rate 1e-4λ 1000nway 30nshot 1nqueries 15

    6.2 Evaluation

    The model will be evaluated on the Omniglot, miniImageNet and the Quick, Draw dataset.This section will test two hypotheses that were introduced in the introduction: (1) The modelactually learns a disentanglement. This can be tested by visualizing the reconstructions withperturbations of the content and style. If the model successfully learns a disentanglement, (2)few-shot performance will improve performance when a disentangled representation is learned.The effect of disentanglement is tested, by comparing the model to a deep network with the samearchitecture, but without the generative loss.

    6.2.1 Omniglot

    The model is trained on the Omniglot dataset with the hyperparameter settings as described insection 6.1. The disentanglement of the model is visualized as in previous sections: Indirectly,by reconstructing images of perturbed content and style variables. And directly, by visualizingthe structure of the high-dimensional content and style variables with stochastic neighbourhoodembedding.

    Interpolations of the content and style variables are depicted in Figure 16. These pictures showto what extend content and style have been disentangled. The interpolation results show thatthe network has successfully learned to encode the stylistic translation and scale attributes in z,since these tend to change vertically. Also, the content of an image tends to change horizontally,which confirms that content is encoded in s.

    27

  • Figure 16: Interpolations between the latent variables of two images. The upper left and thebottom right are reconstructions from the test set. All other images are linear interpolations overcontent s and style z. The content variable s changes horizontally, the style z changes vertically.

    Reconstructions of examples with interchanged variables are shown in Figure 17. Note thatall characters in a column still have the general shape of the original example, which againdemonstrates that content is modeled by s. In addition, note that the actual location, size androundness are modeled by z, as one would expect style to be modeled.

    The pictures also show the limitations of the model: the reconstructions can be blurry, andsometimes lack certain strokes. Recall that Omniglot only has 20 examples per class, and themodel has never actually seen the classes in the test set before. Another reason for blurrinessis the reconstruction loss, that is formulated pixel-wise. This penalizes the model heavily forreconstructions that have been translated slightly.

    Figure 17: Interchanging the content and style of examples. Reconstructions are generated bytaking s from the column and z from the row. Images from the dataset that are used as inputare depicted at the sides.

    28

  • The structure of the representation is visualized in Figure 18. The latent variables for contentand style, are embedded into a 2D manifold. Each test example is represented by a dot, wherethe color denotes the class. Since the examples are ordered by alphabet, similar colors oftencorrespond to letters in the same alphabet. For every class, precisely one example is shown asan image. Some cluster annotations are provided to make the diagrams more understandable,note that these are subjective and are not necessarily complete.

    The top embedding shows the structure of s, representing content. Images with classes that aresimilar, lie close together. For instance, a group of ‘o’-shaped characters is grouping together.Furthermore, a few box-shaped characters are visibly clustering. In general, characters from thesame alphabet are grouped together more than others. There seems to be no real pattern forother factors. For instance, the ‘o’-group has very different sizes, and still their s variables lieclose together.

    The bottom embedding depicts the style z. In contrast with the previous embedding, nowgrouping based on location, scale and other factors is expected. Some clusters are annotatedintuitively to demonstrate the different styles. For instance, the top-group contains images thatare drawn relatively high in the image. Thus, the diagrams illustrate that content is indeedmodeled by s, and style is modeled by z.

    29

  • 10 5 0 5 10

    10

    5

    0

    5

    10

    0

    50

    100

    150

    200

    250

    300

    350

    400

    10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5

    10.0

    7.5

    5.0

    2.5

    0.0

    2.5

    5.0

    7.5

    0

    50

    100

    150

    200

    250

    300

    350

    400

    Figure 18: Visualization of the structure of the high-dimensional content and style variables. Forevery class of the test set, exactly one image is depicted. Top: Embedding of content s. Bottom:Embedding of style z.

    30

  • The classification performance is presented in Table 4. Since the authors in [7] evaluated onaugmented test data, performance is reported on normal and augmented test data. It is worthmentioning that augmenting the test data increases accuracy, but may not be a realistic problemsetting. Furthermore, the disentangled VAE outperforms the models consistently in every setting.The last row denotes the performance of the same architecture without the generative loss,and therefore without disentangled representations. Clearly, the performance drops consistentlyfor all tasks. And thus, learning disentangled representations significantly improves few-shotclassification.

    Table 4: 20-way classification performance on the Omniglot dataset. The ‘+’ sign denotes thatperformance is measured on an augmented test dataset (90 degree rotations).

    1-shot 1-shot+ 5-shot 5-shot+Matching Networks [5] 93.8 - 98.7% -Prototypical Networks [7] - 96.0% - 98.9%Disentangled VAE 95.9% 97.0% 98.8% 99.1%Only few-shot loss 94.8% 96.0% 98.4% 98.9%

    6.2.2 miniImageNet

    The miniImageNet images were resized to 84 by 84 pixels matching [5, 7]. To control theincreased resolution, an additional max pooling layer is added after conv4 in the encoder. Thereconstruction task is simplified, by re-sizing the target to 32 by 32 pixels. And thus the finalslicing layer is removed in the decoder.

    Experiments showed that the model performed worse than baseline models. In limited datasettings, the network relies on heavy regularization. With Omniglot, this problem did not reallyoccur. But a more complex dataset such as miniImageNet, the network quickly overfits. However,tests on Omniglot revealed that removing the fully connected layer, significantly degrades thequality of the learned disentanglement. Visualizations of the reconstructions also reveal that thenetwork does not disentangle anything, as the decoder only relies on z (Figure 19). Furthermore,the network learns to create vague reconstructions, showing that the generative model itself islimited.

    31

  • Figure 19: Interchanging the content and style of examples. Reconstructions are generated bytaking s from the column and z from the row. Images from the dataset that are used as inputare depicted at the sides.

    The miniImageNet dataset is complex compared to Omniglot, and the classes exhibit largevariations within a class. The capacity for the VAE is limited when it comes to ImageNet, andthe architecture changes for disentangling remove important regularization. To achieve betterperformance, the model needs to be revised such that these problems are addressed.

    6.2.3 Quick, Draw

    The Quick, Draw dataset is in the same format as Omniglot, and thus the same settings asdescribed in section 6.1 are used. The few-shot classification performance on Quick, Draw isdifficult to report, since the validation accuracy of either baselines and the disentangled VAEdid not converge. Even when converged during training, the validation error fluctuated rapidly.Overall, the performance of the baseline is better, because the same problem as with miniIma-geNet occurs: the network overfits because of the fully connected layer.

    Interestingly, although somewhat vague, a disentanglement is actually learned. Analogous toprevious visualization, the interpolation and interchanging of s and z are depicted (Figure 20).The pictures show that the stylistic attributes such as rotation are modeled by z, while thegeneral shape of an object is modeled by s.

    Quick, Draw resembles Omniglot in certain ways, but there are also important differences. InOmniglot, the variation is caused by the drawing style of the user, while the images of Quick,Draw users also vary because of the variation in interpretation. For instance, disentangling thevariation in clocks, would require the model to encode “analogue” or “digital” in the style z.However, for another class such as airplane, this attribute might not have any meaning. Thus, tosome degree, disentangled representations are learned on Quick, Draw, but they may be impededbecause the stylistic variations are often limited to a single class.

    32

  • Figure 20: Left: Interpolations between the latent variables of two images. The upper left and thebottom right are reconstructions from the test set. All other images are linear interpolations overcontent s and style z. The content variable s changes horizontally, the style z changes vertically.Right: Interchanging the content and style of examples. Reconstructions are generated by takings from the column and z from the row.

    6.3 Expectation of Support Set

    The loss function that is optimized, defined in section 4, consists of multiple expectations. Forthe example that needs to be classified, and for the support set. Since optimization is performedin batches, the queries are approximated with a single sample. For the support set we test twooptions:

    1. ESS∼q(SS |XS)[·] is approximated with a single sample for each support set item, for eachquery.

    2. ESS∼q(SS |XS)[·] is approximated by the overconfident estimate µs.

    In the end, no significant difference in classification performance or learning was encountered.Also, the learned disentanglement visually did not look different. All presented results arereported on models trained with the second method, as it is the most straightforward to imple-ment.

    6.4 Discussion

    In this section, the results and conclusions of the experiments are summarized. Furthermoresome intuitive insights are provided from empirical observations. The first part will discussresults based on architecture changes. The second part will discuss performance on differentdatasets.

    33

  • 6.4.1 Architecture

    The architecture for the encoder is designed so that it matches Prototypical Networks [7] closely,since they achieved state of the art performance at that time on both Omniglot and miniIm-ageNet. However, to make the model suitable for disentangled representation learning, someaspects have to be modified.

    Experiments with different network architectures on Omniglot showed that disentangled rep-resentation are learned when a fully connected layer is used in the encoder, between the lastconvolutional layer and the latent representation. A downside to fully connected layers, is thatin some instances they easily overfit, which caused the model to perform worse on more complexdatasets.

    Prototypical Networks used max-pooling layers to reduce the resolution. Max-pooling layers areinsensitive to small translation perturbations, which improves the regularization, but reducesprecision. With generative modelling, ideally the model would be more sensitive to these per-turbations. However, strided convolutions impeded performance drastically. Instead, the fourthmax pooling layer is removed, to retain some sensitivity. In contrast with prototypical networks,the max pooling operations are padded, because it allows them to retain more information.

    6.4.2 Performance

    The Disentangled VAE showed a large performance increase over existing methods on the Om-niglot dataset. However, it had more difficulty with miniImageNet and Quick, Draw.

    The hypothesis that learning disentangled representations can be combined with few-shot classi-fication, is confirmed by the direct and indirect visualizations of s and z. Visualizations confirmthat s models content and z models style on Omniglot. Moreover, few-shot classification perfor-mance is improved on Omniglot by utilizing disentangled representations, confirming the secondhypothesis.

    The miniImageNet and Quick, Draw datasets are inherently more difficult. The reconstructionsfrom a VAE tend to be vague and without much detail. Generative modelling combined with deeplearning is a relatively new area of research, and future developments could play a crucial role tolet this method work on more complex datasets. In addition, disentangling representations mightbe less helpful when style attributes are not shared between classes. Being able to disentanglethe property analogue or digital, does not improve airplane classification.

    6.4.3 Model Framework

    The mathematically derived model is optimizing a lower bound of the conditional log-likelihood.Furthermore, the variational distribution is imposed to be a multivariate normal distribution withdiagonal variance. Nonetheless, the model has learned to disentangle representations, reconstructimages, and perform few-shot classification. Potentially, less restricting approximations canimprove the quality of reconstructions and classification performance.

    34

  • Conclusion

    The human brain can remarkably disentangle different types of illumination, pose or otherchanges in viewpoint from the actual object of interest. We propose that disentangling rep-resentations, is key in learning an interpretation that is suitable for generalization. Few-shotclassification is a field, where such a suitable representation has to be learned. The hypothesis is,that learning disentangled representation can be combined with few-shot classification. Further-more, few-shot classification accuracy is improved, by using disentangled representations.

    In an exploratory study, we demonstrated that the structure of learned disentangled representa-tions can be shaped into a suitable structure for few-shot classification. Furthermore, experimentsshow that an adversarial network is not necessary, to learn a suitable disentangled representa-tion.

    Inspired by intuitions gained from the exploratory study, theory for generative few-shot learningis developed. A graphical model is defined for a single example. The process assumes twolatent variables s and z that represent the content and style of an example x. The graphicalmodel is extended to include the support set, and a lower bound of a conditional log-likelihoodis mathematically derived.

    The framework is trained on three different datasets. Two datasets reveal opportunities forthe model two improve. Experiments with the Omniglot dataset confirm, that learning disen-tangled representation can be combined with few-shot classification. Moreover, state-of-the-artperformance is achieved at the time of writing, showing that classification accuracy improves byusing disentangled representations. Analysis of the datasets indicates that the framework worksparticularly well when the stylistic attributes are shared, which effectively gives the generativemodel more samples to learn stylistic attributes.

    In summary, we combine learning disentangled representations, few-shot learning, and generativemodelling. By combining a few-shot loss with a variational autoencoder, a disentanglement islearned naturally. Experiments demonstrate that disentangled representations can improve few-shot classification performance.

    A few suggestions for interesting further research include:

    • Defining the distance function as a deep learning model dθ(s, s′), preferable such thatdθ(s, s

    ′) = dθ(s′, s). This may allow the model to learn a more suitable metric.

    • Exploiting siamese adversarial networks to disentangle representations. This may improvethe quality of generated data, and allow disentanglement of more complicated data.

    • Incorporating unsupervised disentangling methods, for semi-supervised few-shot classifica-tion.

    35

  • References

    [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deepresidual networks. In European Conference on Computer Vision, pages 630–645. Springer,2016.

    [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in neural information processing systems, pages1097–1105, 2012.

    [3] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEEtransactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.

    [4] Gregory Koch. Siamese neural networks for one-shot image recognition. PhD thesis, Uni-versity of Toronto, 2015.

    [5] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks forone shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638,2016.

    [6] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In FifthInternational Conference on Learning Representations, ICLR, 2017.

    [7] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learn-ing. arXiv preprint arXiv:1703.05175, 2017.

    [8] Akshay Mehrotra and Ambedkar Dukkipati. Generative adversarial residual pairwise net-works for one shot learning. arXiv preprint arXiv:1703.08033, 2017.

    [9] Diederik P Kingma and Max Welling. Stochastic gradient vb and the variational auto-encoder. In Second International Conference on Learning Representations, ICLR, 2014.

    [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances inneural information processing systems, pages 2672–2680, 2014.

    [11] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann,and Yann LeCun. Disentangling factors of variation in deep representation using adversarialtraining. In Advances in Neural Information Processing Systems, pages 5041–5049, 2016.

    [12] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.Infogan: interpretable representation learning by information maximizing generative adver-sarial nets. In Advances in Neural Information Processing Systems, 2016.

    [13] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level conceptlearning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.

    [14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. In Proceedings of the 32nd International Conference onMachine Learning (ICML-15), pages 448–456, 2015.

    [15] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. arXiv preprint arXiv:1702.03275, 2017.

    [16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ThirdInternational Conference on Learning Representations, ICLR, 2014.

    36

  • A Derivation Lowerbound for Batches

    In this section, a lowerbound for the conditional log-likelihood will be derived. This modelis adapted such that an intractable posterior term can be cancelled with another posteriorterm.

    A.1 Model definition

    During optimization, the queries will be sampled in batches. This ensures that a lowerboundis optimized. The graphical model that corresponds to this perspective, is illustrated in Figure21.

    xs ys

    sszs

    yb

    sb

    xb

    zb

    Sn Bn

    N

    Figure 21: Graphical model that includes a support set S and the batch B. All examples in thebatch are classified with the same support set.

    A.2 Log-likelihood Conditioned on Support Content (SS)

    In this section, the log-likelihood for a batch is derived, conditioned on support content SS . Thederivation is based on the graphical model in Figure 21. (Equation 27)

    37

  • log p(YB ,XB |SS ,YS) = log∏

    (xb,yb)

    p(yb,xb|SS ,YS)

    =∑

    (xb,yb)

    log p(yb,xb|SS ,YS)

    =∑

    (xb,yb)

    ∫∫q(sb, zb|xb) log p(yb,xb|SS ,YS)dsdz

    =∑

    (xb,yb)

    Esb,zb∼q(sb,zb|xb) [log p(yb,xb|SS ,YS)]

    =∑

    (xb,yb)

    Esb,zb∼q(sb,zb|xb)[log p(yb,xb|s, z,SS ,YS)− log

    q(sb, zb|xb)p(sb)p(zb)

    + logq(sb, zb|xb)

    p(sb, zb|xb,yb)

    ]=

    ∑(xb,yb)

    Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]

    −DKL(q(sb, zb|xb)||p(sb)p(zb))+DKL(q(sb, zb|xb)||p(sb, zb|xb,yb))

    =∑

    (xb,yb)

    Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]

    −DKL(q(sb|xb)||p(sb))−DKL(q(zb|xb)||p(zb))+DKL(q(sb, zb|xb)||p(sb, zb|xb,yb))

    (27)

    A.3 Lowerbound Conditioned on Support Examples (XS)

    In this section, the lowerbound of the log-likelihood for a batch is derived, conditioned onsupport content XS . The previously derived term log p(XB ,YB |SS ,YS), is marginalized overp(SS ,ZS |XS ,YS). (Equation 28)

    log p(YB ,XB |XS ,YS)

    = log

    ∫p(SS ,ZS |XS ,YS)

    ∑(xb,yb)

    log p(yb,xb|SS ,YS)

    dSSdZS= log

    ∫q(SS ,ZS |XS)

    p(SS ,ZS |XS ,YS)q(SS ,ZS |XS)

    ∑(xb,yb)

    log p(yb,xb|SS ,YS)

    dSSdZS≥ ESS ,ZS∼q(SS ,ZS |XS)

    ∑(xb,yb)

    log p(xb,yb|SS ,YS)

    − log q(SS ,ZS |XS)p(SS ,ZS |XS ,YS)

    = ESS∼q(SS ,ZS |XS)

    ∑(xb,yb)

    log p(xb,yb|SS ,YS)

    −DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))︸ ︷︷ ︸intractable

    (28)

    38

  • A.4 Intermezzo: Factorizing the Support Set KL Term

    Note that the distributions of the support set factorize as follows. The variational distributionfactorizes such that q(SS |XS) =

    ∏(ss,xs)

    q(ss|xs), and the posterior distribution factorizes likep(SS ,ZS |XS ,YS) =

    ∏(xs,ys)

    p(ss, zs|xs,ys). (Equation 29)

    DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))

    =

    ∫∫q(SS ,ZS |XS) log

    q(SS ,ZS |XS)p(SS ,ZS |XS ,YS)

    dSSdZS

    =

    ∫ ∏xs

    q(ss, zs|xs) log∏

    xsq(ss, zs|xs)∏

    (xs,ys)p(ss, zs|xs,ys)

    ∏ss

    dss∏zs

    dzs

    =

    ∫ ∏xs

    q(ss, zs|xs) log∏

    xsq(ss, zs|xs)∏

    (xs,ys)p(ss, zs|xs,ys)

    ∏ss

    dss∏zs

    dzs

    =

    ∫ ∏xs

    q(ss, zs|xs)

    ∑(xs,ys)

    logq(ss, zs|xs)

    p(ss, zs|xs,ys)

    ∏ss

    dss∏zs

    dzs

    =∏xs

    ∫q(ss, zs|xs)

    ∑(xs,ys)

    logq(ss, zs|xs)

    p(ss, zs|xs,ys)

    ∏ss

    dss∏zs

    dzs

    =∑

    (xs,ys)

    ∫q(ss, zs|xs) log

    q(ss, zs|xs)p(ss, zs|xs,ys)

    dssdzs

    =∑

    (xs,ys)

    DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))

    (29)

    A.5 Collecting Terms

    In this section we combine terms from Equations 27-29.

    39

  • log p(YB ,XB |XS ,YS)

    ≥ ESS∼q(SS ,ZS |XS)

    ∑(xb,yb)

    log p(xb,yb|SS ,YS)

    −DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))

    = ESS∼q(SS ,ZS |XS)

    ∑(xb,yb)

    Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]

    −DKL(q(sb|xb)||p(sb))−DKL(q(zb|xb)||p(zb))

    +DKL(q(sb, zb|xb)||p(sb, zb|xb,yb))

    −∑

    (xs,ys)

    DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))

    = ESS∼q(SS ,ZS |XS)

    ∑(xb,yb)

    Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]

    +∑

    (xb,yb)

    (−DKL(q(sb|xb)||p(sb))−DKL(q(zb|xb)||p(zb))

    )+∑

    (xb,yb)

    DKL(q(sb, zb|xb)||p(sb, zb|xb,yb))

    −∑

    (xs,ys)

    DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))

    (30)

    A.6 Objective Function

    The last two terms in Equation 30 can be ignored, because the KL terms concerning posteriorshave the same distribution. The expectation of the two terms is greater than or equal to zero,under the condition that |B| ≥ |S| (Equation 31).

    E(XB ,YB),(XS ,YS)∼Pdata ∑(xb,yb)

    DKL(q(sb, zb|xb)||p(sb, zb|xb,yb))−∑

    (xs,ys)

    DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))

    ≥ 0(31)

    In conclusion, the objective function can be written as a simplification of Equation 30, where thelast two terms are ignored. The function is presented in Equation 32. The inequality originatesfrom both the lowerbound when marginalizing over the posterior of the support set content, andthe inequility as formulated in Equation 31.

    40

  • E(XB ,YB),(XS ,YS)∼Pdata [log p(YB ,XB |XS ,YS)]

    ≥ EPdata

    ESS∼q(SS ,ZS |XS) ∑(xb,yb)

    Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]

    +∑

    (xb,yb)

    (−DKL(q(sb|xb)||p(sb))−DKL(q(zb|xb)||p(zb))

    )(32)

    41

    AcknowledgementsAbstractContentsIntroductionRelated WorkFew-shot learningGenerative ModelsDisentangling RepresentationContribution

    PreliminariesVariational AutoencodersGenerative Adversarial NetworksKullback Leibler Divergence for Multivariate NormalsSquared Euclidean distance between Multivariate NormalsDisentangling Factors of Variation

    Structure of DisentanglementUnderstanding DisentanglementDistance Penalty for ContentDisentangling with Distance Loss Exclusively

    Model: Generative Few-shot LearningGenerative ModelClass probabilityEmbedding Distance in LiteratureModel Class Probability

    Support SetThe Support-Conditional Log-likelihoodResolving the Posterior

    Collecting All ComponentsInferenceImplementation

    Da


Recommended