+ All Categories
Home > Documents > Can We Train Dogs and Humans at the Same Time? GANs and...

Can We Train Dogs and Humans at the Same Time? GANs and...

Date post: 25-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
9
Can We Train Dogs and Humans at the Same Time? GANs and Hidden Distributions Bryan Cheong, Hubert Teo Stanford University [email protected], [email protected] Abstract This paper is a series of experiments on InfoGAN, ex- ploring the extent to which an InfoGAN is able to gener- ate the hidden distribution behind a sparse initial dataset. We observe from our experiments that the infoGAN would prefer to memorise a sparse dataset if it is not sufficiently complex, and even when sufficiently complex will not gen- eralise over unpopulated regions of the data distribution. As a result, the InfoGAN does not extrapolate over mixed distributions. 1. Introduction In this paper, we aim to hone in on the limitations of generative models for image generation. In particular, we evaluate the limitations of generative adversarial networks (GANs) on an image extrapolation task. Our research question concerns the nature of the types of distributions that can be modelled by GANs. Recently, GANs have seen remarkable empirical success in their ap- plication to unsupervised learning, generating sharp and re- alistic images after training only on unlabelled data [18]. They accomplish this by modelling the underlying data dis- tribution in a way that allows them to be sampled from. However, there is still much to be understood about the kinds of distributions that are modelled well by GANs, as well as their generalizing power. In the context of images, what classes of transformations can be captured by GANs? Can GAN models extrapolate between mixtures of distri- butions to produce convincing images that interpolate be- tween two datasets? In particular, given a combined dataset of images of dog and human faces, can GANs produce an acceptable morph between dog and human faces? Our ex- periments are an approach to test the extent to which an un- supervised approach can produce explanatory factors, and extrapolate between these factors, as a complement to su- pervised representation-learning algorithms [2]. These questions have broad implications on the extent to which GAN-based approaches to image generation are pos- sible. Furthermore, we hope they will yield a better under- standing of the nature of the distributions generated thereby. 2. Related Work Within the family of GANs, there have been many differ- ent architectures that reportedly have different properties. We explored the different types of GANs and summarise their developments here. The Conditional GAN or CGAN is a variation on the GAN by Mirza et al. which feeds la- tent variables that condition the data distribution into the GAN as an additional layer in the GAN structure, so that both the G and D distributions are conditioned on the la- tent variables. Their experiment of the CGAN consisted of generating MNIST numerals conditioned on their class la- bels and concluded that the CGAN was a viable model that could capture multimodal labelling. [16] Since our project is explores the limitations of GANs around limited and sparse datasets, we also consulted the literature on DeLiGANs, or Generative Adversarial Net- works for Diverse and Limited Data [9]. Gurumurthy et al. implemented an architecture that tried learning a map- ping from a simple latent distribution to a more complicated data distribution, in order to train a GAN for when the orig- inal dataset is limited but has a diverse and sparse modality by drawing on a reparamaterized mixture of Gaussians in- stead of over the distribution of the latent variable directly (which is a single Gaussian). This is a so-called reparam- eterization trick. [5] They experimented on datasets such as MNIST and freeform drawn samples, and demonstrate that they are able to actively avoid the low-probability re- gions. The DeLiGAN handles a low probability void be- tween two modes of high information in a dataset distribu- tion by absorbing the void into its own latent distribution and produces no samples from this low-probability region. Indeed, much of the literature is concerned about the stabil- ity of GANs when presented with non-ideal training data. For example, many slightly different architectures, objec- tive functions or formulations to alleviate GAN training in- 1
Transcript
Page 1: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

Can We Train Dogs and Humans at the Same Time?GANs and Hidden Distributions

Bryan Cheong, Hubert TeoStanford University

[email protected], [email protected]

Abstract

This paper is a series of experiments on InfoGAN, ex-ploring the extent to which an InfoGAN is able to gener-ate the hidden distribution behind a sparse initial dataset.We observe from our experiments that the infoGAN wouldprefer to memorise a sparse dataset if it is not sufficientlycomplex, and even when sufficiently complex will not gen-eralise over unpopulated regions of the data distribution.As a result, the InfoGAN does not extrapolate over mixeddistributions.

1. Introduction

In this paper, we aim to hone in on the limitations ofgenerative models for image generation. In particular, weevaluate the limitations of generative adversarial networks(GANs) on an image extrapolation task.

Our research question concerns the nature of the typesof distributions that can be modelled by GANs. Recently,GANs have seen remarkable empirical success in their ap-plication to unsupervised learning, generating sharp and re-alistic images after training only on unlabelled data [18].They accomplish this by modelling the underlying data dis-tribution in a way that allows them to be sampled from.However, there is still much to be understood about thekinds of distributions that are modelled well by GANs, aswell as their generalizing power. In the context of images,what classes of transformations can be captured by GANs?Can GAN models extrapolate between mixtures of distri-butions to produce convincing images that interpolate be-tween two datasets? In particular, given a combined datasetof images of dog and human faces, can GANs produce anacceptable morph between dog and human faces? Our ex-periments are an approach to test the extent to which an un-supervised approach can produce explanatory factors, andextrapolate between these factors, as a complement to su-pervised representation-learning algorithms [2].

These questions have broad implications on the extent to

which GAN-based approaches to image generation are pos-sible. Furthermore, we hope they will yield a better under-standing of the nature of the distributions generated thereby.

2. Related Work

Within the family of GANs, there have been many differ-ent architectures that reportedly have different properties.We explored the different types of GANs and summarisetheir developments here. The Conditional GAN or CGANis a variation on the GAN by Mirza et al. which feeds la-tent variables that condition the data distribution into theGAN as an additional layer in the GAN structure, so thatboth the G and D distributions are conditioned on the la-tent variables. Their experiment of the CGAN consisted ofgenerating MNIST numerals conditioned on their class la-bels and concluded that the CGAN was a viable model thatcould capture multimodal labelling. [16]

Since our project is explores the limitations of GANsaround limited and sparse datasets, we also consulted theliterature on DeLiGANs, or Generative Adversarial Net-works for Diverse and Limited Data [9]. Gurumurthy etal. implemented an architecture that tried learning a map-ping from a simple latent distribution to a more complicateddata distribution, in order to train a GAN for when the orig-inal dataset is limited but has a diverse and sparse modalityby drawing on a reparamaterized mixture of Gaussians in-stead of over the distribution of the latent variable directly(which is a single Gaussian). This is a so-called reparam-eterization trick. [5] They experimented on datasets suchas MNIST and freeform drawn samples, and demonstratethat they are able to actively avoid the low-probability re-gions. The DeLiGAN handles a low probability void be-tween two modes of high information in a dataset distribu-tion by absorbing the void into its own latent distributionand produces no samples from this low-probability region.Indeed, much of the literature is concerned about the stabil-ity of GANs when presented with non-ideal training data.For example, many slightly different architectures, objec-tive functions or formulations to alleviate GAN training in-

1

Page 2: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

stability have been researched, including the WassersteinGAN (WGAN) [1], unrolled GAN [15], and even ensemblemethods [10]. There is also significant literature exploringbest GAN training practices that seek to optimize stabilityand prevent mode collapse [19] [8]. Our experiments inthis paper, however, seek not to avoid covering these low-probability regions, but instead explore how a GAN mighttreat these low-probability regions when forced to do so.

Apart from GANs, other generative approaches to un-supervised or semi-supervised image generation have alsobeen explored. Siddharth et al. introduced the generalisedvariational autoencoder (VAE) [20] model that is reportedlyable to disentangle representations that encode distinct as-pects of the data into separate variables. To this end, theyused partially-specified graphical model structures to con-struct a recognisable disentangled space, and demonstratedtheir model’s ability to do so for faces and multi-MNIST.Similarly to GANs, this general framework also admitsmany architectural variants such as conditional VAEs [21][13], hard-regularization approaches like lossy VAEs [4], toname a few. Crucial to both VAEs and GANs is the ideaof the latent variable that implicitly parametrizes the datadistribution and understanding the behaviour of and mod-ifying the underlying distribution of these latent variablesis an area of heavy research [22] [24]. We wish to explorewith our experiments if a more unsupervised InfoGAN ap-proach can likewise encode a disentangled space within itslatent variables, and not only disentangle but also populatethe full space between the modes of the dataset distribution.

Since our experiments is essentially an exploration onthe ’creativity’ of GANs in their ability to construct distri-butions, we also take caution that our GAN implementationwill not memorise the original dataset distribution, whichwould defeat the purpose of our experiments. We consultedthe literature on latent geometry and memorisation in GANsby Matt D. Feiszli. [6] We have kept in mind his paper’s un-derstanding of how a GAN can learn an output distributionthat is concentrated on a finite number of examples, andhave chosen our methods and visualisations to be conscien-tious of catching this memorisation if it indeed does occur.

3. MethodsIn this paper, we perform three experiments that share a

similar structure. Each of the three experiments begin with adataset composed of a mix of two sets of images. Then, wetrain GANs on the dataset and use the trained model to per-form various tasks. We keep the GAN model architectureconstant (save for hyperparameters and image channels) inorder to evaluate its performance across the three datasets.

3.1. Generative Adversarial Networks (GANs)

A generative adversarial network is an unsupervisedlearning technique that estimates a joint distribution be-

tween latent variables z and data x via an adversarial pro-cess. Generative models learn distributions not by attempt-ing to estimate the probability that a particular data point isdrawn from a distribution, but instead by modelling the dis-tributions themselves. GANs achieve this goal by traininga generator and a discriminator adversarially. The genera-tor models p(x|z), approximating the distribution of datagiven incompressible random variables z as input. Thediscriminator’s goal is instead to distinguish between datapoints sampled from the generator’s distribution and realdata points from the data set. Goodfellow [7] defined thefollowing minimax game that optimizes the generator anddiscriminator in alternating phases:

minG

maxD

V (G,D) = Ex∼pdata(x) [logD(x)]

+ Ez∼pz(z) [log(1−D(G(z)))]

3.2. InfoGAN

An InfoGAN is a GAN variant that makes latent vari-ables c explicitly different from noise variables z [3]. Thegenerator G is allowed to parametrize the data distributionp(x|z, c) with both incompressible noise z and meaningfullatent variables c. Another network Q is introduced to ap-proximate the posterior distribution p(c|x). The objectivefunction is also modified to include a mutual informationterm LI(G,Q) which gives a lower bound on the true mu-tual information between the data distribution and the latentvariables:

LI(G,Q) = Ec∼pc(c),x∼G(c,z) [logQ(c|x)]

This mutual information lower bound is then included inthe minimax objective so that it is maximized by the G:

minG,Q

maxD

V (G,D)− λLI(G,Q)

In practice,D andQ are made to share the same networkup to the last embedding layer before branching off. Hence,we treat λ as a regularizing term that encourages the gener-ator to use the latent variables c meaningfully, such that theauxiliary network Q is able to infer c from its output.

Most relevantly to our research question, Xi Chen et. al.concluded that GANs can approximately model the latentvariable distribution of a given x drawn from the dataset.In other words, they produce an embedding of the datasetinto a meaningful latent variable space, from which imagescan be generated. Our research is a way to test the gener-alizing power of this latent variable space by attempting togenerate images from outside the support of latent variabledistribution associated with the dataset.

3.3. Architecture

The GAN architecture we use is taken directly from theseminal InfoGAN paper [3], and is essentially a deep con-

Page 3: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

G D Qdeconv-512-4-1-0 conv-64-4-1-2deconv-256-4-2-1 conv-128-4-1-2deconv-128-4-2-1 conv-256-4-1-2deconv-64-4-2-1 conv-512-4-1-2conv-64-3-1-1 conv-σ-1-4-1-0 conv-C-4-1-0

conv-tanh-3-4-2-1

Table 1. GAN architecture

volutional GAN (DCGAN) [18] with an additional mutualinformation objective.

Our implementation is based off the PyTorch DCGANexample [17], but with a slightly modified architecture toproduce smaller images for speed. (Table 1.) deconv-L-K-S-P denotes a deconvolutional layer producing L fea-tures with kernel size K, stride S and padding P , followedby batch normalization and LeakyReLU(0.2). conv-L-K-S-P describes an analogous convolutional layer. The firstlayer of G is fed the both the noise and latent variables.Note that G’s output volume and D’s input volume are both3× 64× 64. The final layers of both G and D skip normal-ization and use their respective non-linearities. The finallayer of Q instead predicts the C latent variables from thelast embedding layer it shares with D. To support greyscaleimages in experiment 3, the above network architecture isalso modified minimally to produce 1 × 64 × 64 imagescorresponding to only a single luminance channel.

3.4. Datasets

3.4.1 F dataset

The objective of this dataset is to be as clean and simple aspossible while still being a pure additive mixture of two (po-tentially orthogonal, potentially overlapping) distributions.We plan to use this clean and small dataset to estimate a setof suitable hyperparameters and a viable training schedulefor our GAN on the larger dataset.

To this end, this dataset is synthetic and consists of black,san-serif capital Fs on a white background, translated (butnot rotated) across the 64 × 64 image. There are 128 datapoints in total, corresponding to 64 Fs translated at differentpixel locations along the horizontal axis but vertically cen-tered, and another 64 Fs translated along a centered verticalaxis. Notably, there are no Fs translated along a diagonaloffset from the center.

Additionally, since the images are entirely synthetic withlatent variables (vertical and horizontal offset) that are com-pletely transparent, we also generated four Fs diagonallyoffset from the center, corresponding to the four quadrants.These diagonal Fs can be used to probe the trained genera-tive models for their ability to generate them.

3.5. Dogs & Humans

This dataset is again a pure additive mixture of two dis-tributions. The motivation for including a more complexdistribution is that the two distributions presumably have alarger variance and spread in latent space. To account forthis more diffuse distribution, we hope that the discrimina-tor should penalize outliers less, giving the generator moreleeway to produce images that are far away from both dis-tributions.

Hence, we produced a dataset combining dog faces andhuman faces. These two distributions have similar structure:both have eyes, noses and mouths, and consist of a roughlycircular shape on some background. However, there is nooverlap (there are no dog faces that are also human faces).The hope is that the GAN will be able to interpolate betweenthe two distributions while preserving the spatial structureof faces, creating a plausible morph between a dog and ahuman face.

To create this dataset, we extracted 1517 images fromthe Stanford Dogs Dataset [11] and hand-aligned them sothat the dog faces were roughly registered in the center ofthe images and with a similar scale relative to the imagesize. These images were mixed among pre-aligned humanfaces from the CelebA dataset [14]. In a bid to leveragethe well-procured human faces from CelebA while not bi-asing our GAN towards either distribution, we combinedthese datasets dynamically at training time. Human facesare randomly sampled at each epoch to match the numberof dog faces in our dataset, and the two sets of images arecombined so that each batch has an equal number of dogand human faces. In this way, this dataset effectively hasa size of 3034 and has an equal number of dog and humanfaces, but the human faces are instantiated from CelebA ondemand.

3.6. Experiment 1: Image Extrapolation

In this experiment, we train GANs on the F dataset,varying several critical hyper-parameters. Training is per-formed with stochastic batch gradient descent (SGD) usingthe Adam [12] update rule (β1 = 0.5, β2 = 0.999). In eachbatch iteration, Z random normal variables zi ∼ N(0, 1)are sampled and C latent variables ci ∼ U(−1, 1) are sam-pled as input to the generator, and the discriminator updatestep is performed before the generator update step. Trainingproceeds for a total of 1024 epochs.

A total of 27 models were trained, corresponding toa cross product between the following hyperparameterchoices: the SGD learning rate (lr = 5×10−4, 1×10−3, 2×10−3), mutual information constant or regularization (λ =1, 10, 100), and three configurations for the number of noiseand latent variables ((Z,C) = (2, 2), (16, 2), (16, 16)).These choices were chosen because of the relative simplic-ity of the F dataset, where the optimal number of latent vari-

Page 4: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

ables is theoretically only 2. The MSE loss of the resultinghyperparameters can be found in Fig. 1

Following training, we perform the following reverse op-timization problem to estimate the optimal noise and latentvariables to produce a given target image:

minz,c‖x−G(z, c)‖22

We thus find the generator input that minimizes the L2norm between the generator output and a target image. Thetarget images are the four diagonal Fs described above. Foreach model, we then calculate the average loss for the di-agonal Fs as a way to measure the ability of the model togeneralize and produce examples outside of the given datadistribution.

3.7. Experiment 2: Image Interpolation

For the next experiment, we trained a GAN using similarhyperparameters on the dogs and human faces dataset, ex-cept with 128 noise variables and 128 latent variables due tothe more complicated dataset. These values are selected tomatch the InfoGAN paper, which also uses more than 200input variables in its GAN architecture for larger datasets.

Next, we sample a set of random line segments that varyentirely in the noise space and interpolate between them.Similarly, we do the same for line segments varying only inlatent variable space. By sampling points along these linesegments and using the generator network to produce gen-erated images, we obtain interpolated images that charac-terize the differences between variation in noise space andvariation in latent space.

Furthermore, we also obtain the generator inputs thatproduce each data point in the dataset via the reverse op-timization problem shown above, and linearly interpolatebetween pairs of them. At each linearly-interpolated pointin the latent distribution space, we again use the genera-tor network to produce generated images. Since there is noground truth for a dog-human face morph, we then evaluatethe generated images qualitatively.

3.8. Experiment 3: Greyscale

Finally, we repeat the training and interpolation processfor greyscale versions of both of the above datasets. Thisis to determine if colour plays a significant role in the abil-ity of GAN to generalize and extrapolate between mixturedistributions.

4. Results4.1. Experiment 1: Image Extrapolation

The images drawn for these results are from the outputthat is quantitatively assessed to have the lowest MSE lossfor its output. The distribution over the hyperparameters

Figure 1. The MSE loss for generating diagonal effs over the hy-perparameters of the translated effs experiment.

tested can be seen in Fig. 1. The generation of the diago-nally translated F image exists so far outside the input spacebounds of the latent variable that we can observe that the In-foGAN effectively does not extrapolate the two translationsinto the hidden distribution of the diagonal translation. (Fig.2). From our samples of the generated output at epoch 1024after training the GAN (Fig. 3), we observe that the outputgenerally looks like the original dataset of vertically andhorizontally translated F images.

Keeping in mind that the space of translated Fs is rela-tively small, we were wary of the InfoGAN memorising theoutput translations, and thus not producing well-distributedgenerated images. To observe this, we interpolated betweenthe latent variables of the InfoGAN (Fig. 4) and also overthe noise variables (Fig. 5). There is very little movementof the Fs, and there is no difference observable between theinterpolation over the latent variables or the noise. This sug-gests that this dataset has effectively been memorised by theInfoGAN, and so produces output images that are clusteredaround only a few examples, rather than learning the verti-cal or horizontal translations. It is not a surprise then thatthe diagonal F translations were also not produced, since

Page 5: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

Figure 2. Interpolation between of diagonal Fs. The leftmost andrightmost images are the closest generated images to top-left andbottom-right diagonal Fs. Since they are not in any way off-horizontal or off-vertical, the images are essentially not generated.

Figure 3. Samples of the generated translations of the letter F atepoch 1024 after training the InfoGAN

Figure 4. The interpolation over the latent variables of the Info-GAN for translated Fs.

the vertical and horizontal translations were effectively notlearned but instead memorized.

4.2. Experiment 2: Image Interpolation

In the case of the facial images of dogs and humans, theInfoGAN in this case was observed not to have memorisedthe output, producing a good amount of variation in the gen-erated images. We can observe from the output that the

Figure 5. The interpolation over the noise variables of the Info-GAN for translated Fs.

faces of both humans and dog are generalized, but unfortu-nately when we interpolate the images that are generatedcloser to a dog image and the images that are generatedcloser to a human face, there is no preservation of the fa-cial features in the mapping. This can be observed with thefigure of our batch approximation of the generated outputimages (Fig. 7). In fact, a lot of the mapping between theimages seems to be between the colours of the various im-ages, suggesting that the colour information is dominatingthe learning of the InfoGAN. This is another motivation forour third experiment, in which we repeat this experimentwith only the luminance channel of the same dataset.

Even though the target images produced via back-solving for dataset samples are largely imperfectly pro-duced, the generation of dog faces and human faces seemsquite robust. They are able to produce dog and human facesseparately, albeit without a plausible structure-preservingcorrespondence between these two outputs. This is a some-what surprising result as the dataset we used to generatethese images is very noisy, since it is a mixture of two differ-ent distributions. In this case, we can also infer that mode-collapse did not occur even with a clearly bimodal mixturedistribution, which is encouraging. This can also be ob-served in the t-distributed stochastic neighbor embedding(t-SNE) [23] image of our latent variables in Fig. 6.

Another observation that we can make when interpolat-ing between latent variables (Fig. 8) and the noise variables(Fig. 9) is that interpolation between latent variables variesdetails, while varying between noise variables varies large-scale features such as the overall colour distribution in theimage. This suggests that in this case the latent variables areindeed capturing the meaningful variation within the dataset(facial features, geometry, texture) as opposed to the irrel-

Page 6: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

Figure 6. t-SNE visualisation of dogs and humans over latent vari-ables. Dogs are labelled 0 and humans with 1.

Figure 7. The batch approximation when interpolating betweendogs and humans generated by the InfoGAN.

evant details (background colour, shading, and face posi-tion). This is also strong evidence that the GAN has indeedlearned a meaningful representation of the latent distribu-tion space, and is no longer memorising the distribution asit was in the case of the F dataset.

4.3. Experiment 3: Greyscale

From our batch approximation of the greyscale faces ofdogs and humans, this interpolation seems to have a moreconsistent mapping between the generated output dog andhuman faces, producing a reasonable transition that largelypreserves the features, although there appears to still besome exceptions. (Fig. 11) We can observe then that byrestricting the number of channels we did manage to reducethe dominance of colour over other features of the output.This might be the case because the colour features of dogsvary very substantially over the colour variations of hu-

Figure 8. The interpolation over the latent variables of the Info-GAN for dog and human faces.

Figure 9. The interpolation over the noise variables of the Info-GAN for dog and human faces.

mans, and so the dominating feature being translated fromdogs to humans is their colour, instead of the visual featuresthat we desired. Even in this case, a t-SNE visualisation ofthe latent variables shows that the space over which the hu-man faces are distributed is smaller than that of dog faces,suggesting that there is indeed more variation in dog faceseven when colour information has been removed and thechannels have been restricted. (Fig. 10) However, like thecoloured dog and human faces in experiment 2, most of theinterpolated images do not seem to display any translation-like movements, and there seems to be a preference for justlocally morphing between the initial and final image. Thisis similar to our experiment with the F dataset, where the

Page 7: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

Figure 10. The t-SNE visualisation of greyscale dogs and humanface latent variables. The red labels, 1 are human while the blue 0labels are dogs.

GAN was reluctant to learn to translate an identical imagehorizontally and vertically, instead choosing to memorizeall instances of Fs in each offset position.

On the other hand, in the case of the translated F images,restricting the number of channels does not appear to havea significant impact on preventing the memorisation of im-ages. We quantified the MSE of the generated output overthe number of latent variables that we gave the InfoGAN(Fig. 14), and while we observe that the MSE decreaseswith increasing number of latent parameters, this is likelynot because the InfoGAN is learning anything significantfrom the distribution. The distribution is so simple that sucha large number of latent variables is likely not necessary. In-deed when we compare the interpolated output images overthe latent variables (Fig. 12) and over the noise variable(Fig. 13) we essentially observe no qualitative differencebetween these interpolations and the output we observedfrom our first experiment, which had more channels. Wecan conclude then that the dataset in this case is so sparsethat changing the number of channels does not affect theway that the InfoGAN learns the distribution.

5. Conclusion & Future WorkOur conclusion is that while InfoGAN is able to dis-

entangle representations, there are limits to this disentan-glement. We see that there is a qualitative difference be-tween the distribution captured by noise variables and la-tent variables, even when the GAN is trained on a verynoisy data distribution. Sparse datasets with biased gather-ing that do not represent some parts of the population can-not be re-created through InfoGAN. If the dataset is suffi-ciently sparse, the InfoGAN will instead memorise the out-put distribution, and not accord any significant learning tothe latent variables. In this sense, we see that the noisy dis-tribution is actually beneficial to the InfoGAN’s generaliz-ing power, since the widened support and variance reduces

Figure 11. The batch approximation interpolating the output gen-erated by the InfoGAN for greyscale dog and human faces.

Figure 12. The interpolation over the latent variables of the Info-GAN for greyscale dog and human faces.

the discriminator’s ability to penalize the generator for notexactly matching images in the data distribution. Theremight, however, also be inherent limitations in our currentDCGAN-based framework for applying GANs to the imagegeneration task. As we have seen in all three experiments,DCGANs are reluctant to capture macro-scale image trans-formations like translation.

Further work needs to be done to pursue exactly how cre-ative GANs are, in the human sense. In particular, it wouldbe beneficial to characterize exactly the class of image-space transformations that are easily modelled by GANsand the class of transformations that are not. Addition-ally, other GAN variants and even other generative models

Page 8: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

Figure 13. The interpolation over the noise variables of the Info-GAN for greyscale dog and human faces.

Figure 14. The MSE loss of the greyscale translated F images overthe hyperparameter of number of latent variables used.

(based on VAEs or otherwise) should be evaluated accord-ing to their ability to interpolate between mixture distribu-tions or extrapolate into regions of low density. This hasbroader implications for how a GAN may be used to popu-late datasets that are restricted over certain easier-to-gathermodes, or even infer useful information from sparsely sam-pled datasets. In essence, these are the boundaries of thecreative abilities of generative image models, and the qual-ity of the distributions that they learn.

References[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan,

2017. 2[2] Y. Bengio, A. Courville, and P. Vincent. Representation

learning: A review and new perspectives. IEEE Transactionson Pattern Analysis and Machine Intelligence, 35(8):1798–1828, Aug 2013. 1

[3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,and P. Abbeel. Infogan: Interpretable representation learn-ing by information maximizing generative adversarial nets.CoRR, abs/1606.03657, 2016. 2

[4] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal,J. Schulman, I. Sutskever, and P. Abbeel. Variational lossyautoencoder. CoRR, abs/1611.02731, 2016. 2

[5] L. Devroye. Sample-based non-uniform random variate gen-eration. In Proceedings of the 18th Conference on WinterSimulation, WSC ’86, pages 260–265, New York, NY, USA,1986. ACM. 1

[6] M. Feiszli. Latent geometry and memorization in generativemodels, 2017. 2

[7] I. J. Goodfellow. NIPS 2016 tutorial: Generative adversarialnetworks. CoRR, abs/1701.00160, 2017. 2

[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville. Improved training of wasserstein gans. CoRR,abs/1704.00028, 2017. 2

[9] S. Gurumurthy, R. K. Sarvadevabhatla, and R. V. Babu. Deli-gan : Generative adversarial networks for diverse and limiteddata. In Proceedings of the 2017 Conference on ComputerVision and Pattern Recognition. 1

[10] F. Juefei-Xu, V. N. Boddeti, and M. Savvides. Gang ofgans: Generative adversarial networks with maximum mar-gin ranking. CoRR, abs/1704.04865, 2017. 2

[11] A. Khosla, N. Jayadevaprakash, B. Yao, and F. fei Li. L.:Novel dataset for fine-grained image categorization. In FirstWorkshop on Fine-Grained Visual Categorization, CVPR(2011. 3

[12] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. CoRR, abs/1412.6980, 2014. 3

[13] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling.Semi-supervised learning with deep generative models.CoRR, abs/1406.5298, 2014. 2

[14] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning faceattributes in the wild. In Proceedings of International Con-ference on Computer Vision (ICCV), 2015. 3

[15] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolledgenerative adversarial networks. CoRR, abs/1611.02163,2016. 2

[16] M. Mirza and S. Osindero. Conditional generative adversar-ial nets. CoRR, abs/1411.1784, 2014. 1

[17] PyTorch. Pytorch. https://github.com/pytorch/pytorch, 2017. 3

[18] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks. CoRR, abs/1511.06434, 2015. 1, 3

[19] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung,A. Radford, and X. Chen. Improved techniques for traininggans. CoRR, abs/1606.03498, 2016. 2

[20] N. Siddharth, B. Paige, A. Desmaison, J.-W. V. de Meent,F. Wood, N. D. Goodman, P. Kohli, and P. H. S. Torr. In-ducing interpretable representations with variational autoen-coders, 2016. 2

[21] K. Sohn, H. Lee, and X. Yan. Learning structured outputrepresentation using deep conditional generative models. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and

Page 9: Can We Train Dogs and Humans at the Same Time? GANs and ...cs231n.stanford.edu/reports/2017/pdfs/318.pdf · z˘p z(z) [log(1 D(G(z)))] 3.2. InfoGAN An InfoGAN is a GAN variant that

R. Garnett, editors, Advances in Neural Information Process-ing Systems 28, pages 3483–3491. Curran Associates, Inc.,2015. 2

[22] K. Sun and X. Zhang. Coarse grained exponential variationalautoencoders. CoRR, abs/1702.07904, 2017. 2

[23] L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-sne. 2008. 5

[24] S. Zhao, J. Song, and S. Ermon. Towards deeper un-derstanding of variational autoencoding models. CoRR,abs/1702.08658, 2017. 2


Recommended