UNSUPERVISED CIPHER CRACKING USING DISCRETE GANS

Published as a conference paper at ICLR 2018

UNSUPERVISED CIPHER CRACKING USING DISCRETEGANS

Aidan N. Gomez 1,2

[email protected] Huang 1,2

[email protected]

Ivan Zhang 2

[email protected] M. Li 1,2

[email protected]

Muhammad Osama 1,2

[email protected]Łukasz Kaiser 3

[email protected]

1 Department of Computer Science, University of Toronto2 FOR.ai3 Google Brain

ABSTRACT

This work details CipherGAN, an architecture inspired by CycleGAN used forinferring the underlying cipher mapping given banks of unpaired ciphertext andplaintext. We demonstrate that CipherGAN is capable of cracking language dataenciphered using shift and Vigenere ciphers to a high degree of fidelity and forvocabularies much larger than previously achieved. We present how CycleGANcan be made compatible with discrete data and train in a stable way. We thenprove that the technique used in CipherGAN avoids the common problem of un-informative discrimination associated with GANs applied to discrete data.

1 INTRODUCTION

Humans have been encoding messages for secrecy since before the ancient Greeks, and for thesame amount of time, have been fascinated with trying to crack these codes using brute-force, fre-quency analysis, crib-dragging and even espionage. Simple ciphers have, in the past century, beenrendered irrelevant in favor of the more secure encryption schemes enabled by modern computa-tional resources. However, the question of cipher-cracking remains an interesting problem since itrequires an intimate understanding of the structure in a language. Nearly all automated cipher crack-ing techniques have had to rely on a human-in-the-loop; grounding the automated techniques in ahuman’s preexisting knowledge of language to clean up the errors made by simple algorithms suchas frequency analysis. Across a number of domains, the use of hand-crafted features has often beenreplaced by automatic feature extraction directly from data using end-to-end learning frameworks(Goodfellow et al., 2016). The question to be addressed is as follows:

Can a neural network be trained to deduce withheld ciphers from unaligned text, without thesupplementation of preexisting human knowledge?

The implications for such a general framework would be far-reaching in the field of unsupervisedtranslation, where each language can be treated as an enciphering of the other. The decoding ofthe Copiale cipher (Knight et al., 2011) stands as an excellent example of the potential for machinelearning techniques to decode enciphered texts by treating the problem as language translation. TheCycleGAN (Zhu et al., 2017) architecture is extremely general and we demonstrate our adaptation,CipherGAN, is capable of cracking ciphers to an extremely high degree of accuracy. CipherGAN

Code available at: github.com/for-ai/ciphergan

1

https://github.com/for-ai/CipherGAN


requires little or no modification to be applied to plaintext and ciphertext banks generated by theuser’s cipher of choice.

In addition to presenting a GAN that can crack ciphers, we contribute the following techniques:

• We show how to stabilize CycleGAN training: our CipherGAN achieves good performancein all training runs, compared to approximately 50% of runs for the original CycleGAN.

• We provide a theoretical description and analysis of the uninformative discrimination prob-lem that impacts GANs applied to discrete data.

• We introduce a solution to the above problem by operating in the embedding space andshow that it works in practice.

1.1 SHIFT AND VIGENERE CIPHERS

The shift and Vigenere ciphers are well known historical substitution ciphers. The earliest knownrecord of a substitution cipher is believed to have dated back to 58 BCE, when Julius Caesar replacedeach letter in a message with the letter that was three places further down the alphabet (Singh, 2000).

Plain alphabet a b c d e f g h i j k l m n o p q r s t u v w x y z

Shifted alphabet D E F G H I J K L M N O P Q R S T U V W X Y Z A B C

Figure 1: An example of a right-shift-3 cipher.

Using Figure 1, the message “attackatdawn” can be encrypted to “DWWDFNDWGDZQ”. This mes-sage can be easily deciphered by the intended recipient (who is aware of the particular shift numberused) but looks meaningless to a third party. The shift cipher ensured secure communication be-tween sender and receiver for centuries, until the ninth century polymath Al-Kindi introduced theconcept of frequency analysis (Singh, 2000). He suggested that it would be possible to crack a ci-pher simply by analyzing the individual characters’ frequencies. For instance, in English the mostfrequently occurring letters are ‘e’ (12.7%), ‘t’ (9.1%) and ‘a’ (8.2%); whereas ‘q’, ‘x’ and ‘z’ eachhave frequency of less than 1%. Moreover, the code-breaker can also focus on bigrams of repeatedletters; ‘ss’, ‘ee’, and ‘oo’ are the most common in English. This structure in language provides anexploit of efficiency to the code-breaker.

Polyalphabetic substitution ciphers, including the Vigenere cipher, were introduced to inhibit theuse of n-gram frequency analysis in determining the cipher mapping. Instead, the encrypter furtherscrambles the message by using a separate shift cipher for each element of a key that is tiled tomatch the length of the plaintext. Increasing the key length greatly increases the number of possiblecombinations and thus prevents against basic frequency analysis. In the mid nineteenth century,Charles Babbage recognized that the length of the used key could be determined by counting therepetitions and spacing of sequences of letters in the cipher (Singh, 2000). Using the determinedlength, we can then apply frequency analysis on the index of the cipher base. This method makes itpossible to break the Vigenere cipher, but is very time consuming and requires strong knowledge ofthe language itself.

There is a rich literature of automated shift-cipher cracking techniques (Ramesh et al., 1993; Forsyth& Safavi-Naini, 1993; Hasinoff, 2003; Knight et al., 2006; Verma et al., 2007; Raju et al., 2010;Knight et al., 2011) many of which achieve excellent results which is what one would expect fromhand-crafted algorithms targeting specific ciphers and vocabularies. Work on automated cracking ofpolyalphabetic ciphers (Carroll & Martin, 1986; Toemeh & Arumugam, 2008; Omran et al., 2011)has seen similar success on small vocabularies. It is a difficult matter to compare the results ofprevious work with our own as their focus ranges from inferring cipher keys (Carroll & Martin,1986; Ramesh et al., 1993; Omran et al., 2011), to inferring the mappings given limited quantitiesof ciphertext (determining unicity distance) (Carroll & Martin, 1986; Ramesh et al., 1993; Hasinoff,2003; Verma et al., 2007), to analyzing the unicity distance required to solve small percentages ofthe cipher mappings (i.e. 20% in Carroll & Martin (1986)).

In comparison to these past works, we afford ourselves the advantage of an unconstrained corpusof ciphertext, however, we prescribe ourselves the following constraints: our model is not providedany prior knowledge of vocabulary element frequencies; and, no information about the cipher key is

2


provided. Another complexity our work must overcome is our significantly larger vocabulary sizes;all previous work has addressed vocabularies of approximately 26 characters, while our model iscapable of solving word-level ciphers with over 200 distinct vocabulary elements. As such, ourmethodology is notably ‘hands-off’ in comparison to previous work and can be easily applied todifferent forms of cipher, different underlying data and unsupervised text alignment tasks.

1.2 GANS AND WASSERSTEIN GANS

Generative Adversarial Networks (GANs) are a class of neural network architectures introduced byGoodfellow et al. (2014) as an alternative to optimizing likelihood under a true data distribution.Instead, GANs balance the optimization of a generator network which attempts to produce convinc-ing samples from the data distribution, and a discriminator which is trained to distinguish betweensamples from the true data distribution and the generator’s synthetic samples. GANs have beenshown to produce compelling results in the domain of image generation, but comparatively weakperformance in domains using discrete data (discussed in Section 2).

The original GAN discriminator objective as introduced in Goodfellow et al. (2014) is:

D∗ = arg maxD

Ex∼X [logD(x)]− Ez∼Z [log(1−D(F (z)))] (1)

Where F is the generator network and D is the discriminator network. This loss is vulnerable tothe problem of ‘mode collapse’ where the generative distribution collapses to produce a generatingdistribution with low diversity. In order to more broadly distribute the mass, the Wasserstein GAN(WGAN) objective (Arjovsky et al., 2017) considers the set of K-Lipschitz discriminator functionsD : X → R and minimizes the earth movers (1st Wasserstein) distance. The Lipschitz condition isenforced by clipping discriminators weights to fall within a predefined range.

D∗ = arg max‖D‖L≤K

Ex∼X [D(x)]− Ez∼Z [D(F (z))] (2)

An improved WGAN objective, introduced by Gulrajani et al. (2017), enforced the Lipschitz condi-tion using a Jacobian regularization term instead of the originally proposed weight-clipping solution.This resulted in more stable training, avoiding capacity under-use and exploding gradients, and im-proved network performance over weight-clipping.

D∗ = arg maxD

Ex∼X [D(x)]− Ez∼Z [D(F (z))]+

α · Ex∼X [(‖∇xD(x)‖2 − 1)2](3)

Here X are samples taken along a line between the true data distribution X and the generator’s datadistribution Xg = {F (z)|z ∼ Z}.

1.3 CYCLEGAN

CycleGAN (Zhu et al., 2017) is a generative adversarial network designed to learn a mapping be-tween two data distributions without supervision. Three separate works (Zhu et al., 2017; Yi et al.,2017; Liu et al., 2017) share many of the core features we describe below, however, for simplicitywe will refer to CycleGAN as the basis for our work as it is the most similar to our model. It actson distributions X and Y by using two mapping generators: F : X → Y and G : Y → X ; and twodiscriminators: DX : X → [0, 1] and DY : Y → [0, 1].

CycleGAN optimizes the standard GAN loss LGAN:

LGAN(F,DY ,X ,Y) = Ey∼Y [logDY(y)] + Ex∼X [log(1−DY(F (x)))] (4)

While also considering a reconstruction loss, or ‘cycle’ loss Lcyc:

Lcyc(F,G,X ,Y) = Ex∼X [‖G(F (x))− x‖1] + Ey∼Y [‖F (G(y))− y‖1] (5)

Taken together the losses are balanced using a hyperparameter λ:

L(F,G,DX , DY ,X ,Y) = LGAN(F,DY ,X ,Y) + LGAN(G,DX ,Y,X ) + λ · Lcyc(F,G,X ,Y)

3


Figure 2: Discriminators trained on the toy example of recognizing the bottom-right corner of asimplex as true data. From left to right the discriminators were regularized using: nothing; WGANJacobian norm regularization; and, the relaxed sampling technique.

This leads to the training objectives:

F ∗ = arg minFLcyc(F,G,X ,Y) + LGAN(F,DY ,X ,Y)

G∗ = arg minGLcyc(F,G,X ,Y) + LGAN(G,DX ,Y,X )

D∗X = arg maxDXLGAN(G,DX ,Y,X )

D∗Y = arg maxDYLGAN(F,DY ,X ,Y)

CycleGAN uses Lcyc to avoid mode collapse by preserving reconstruction of mapping inputs fromoutputs. It has demonstrated excellent results in unpaired image translation between two visuallysimilar categories. Our architecture is the first example of this unsupervised learning frameworkbeing successfully applied to discrete data such as language.

2 DISCRETE GANS

Applying GANs to discrete data generation is still an open research problem that has seen greatinterest and development. The primary difficulty with training discrete data generators in a GANsetting is the lack of a gradient through a discrete node in the computation graph. The alternativesto producing discrete outputs - for instance, generators producing a categorical distribution overdiscrete elements - are prone to uninformative discrimination (described below), in that, the dis-criminator may use an optimal discrimination criterion that is unrelated to the correctness of there-discretized generated data. In our example of a continuous distribution over discrete elements,the produced samples all lie within the standard simplex ∆k with dimension k equal to the num-ber of elements in the distribution. In this case, samples from the true data distribution always lieon a vertex vi of the simplex, while any sub-optimal generator will produce samples within thesimplex’s interior ∆k\{v1, ...,vk}. In this example, a discriminator which performs uninformativediscrimination might evaluate a sample’s membership in the vertices of the simplex as an optimaldiscrimination criterion, which is entirely uninformative of the correctness of re-discretized samplesfrom the generator.

A number of solutions to training generators with discrete outputs have been proposed: SeqGAN (Yuet al., 2017) uses the REINFORCE gradient estimate to train the generator; Boundary-seeking GANs(Hjelm et al., 2017) and maximum-likelihood augmented GANs (Che et al., 2017) proposed a gra-dient approximation with low bias and variance that resembles the REINFORCE (Williams, 1992)estimator. Gumbel-softmax GANs (Kusner & Hernandez-Lobato, 2016) replace discrete variables inthe simplex with continuous relaxations called Concrete (Maddison et al., 2016) or Gumbel-softmax(Jang et al., 2016) variables; WGANs (Arjovsky et al., 2017) were suggested as a remedy to the un-informative discrimination problem by ensuring that the discriminator’s rate of change with respectto its input is bound by some constant.

Our work utilizes both the Wasserstein GAN (Gulrajani et al., 2017; Arjovsky et al., 2017) and relax-ation of discrete random variables (such as Concrete/Gumbel-softmax). It has been noted multipletimes in implementations of CycleGAN as well as in the original paper itself that the architecturewas sensitive to initialization and requires repeat attempts in order to converge to a satisfactory map-ping (Bansal & Rathore, 2017; Sari, 2017). Our architecture suffered the same instability before theWGAN Jacobian norm regularization term was added to the discriminator’s loss. In addition, wefound that having the discriminator operate over embedding space instead of directly over softmaxvectors produced by our generator has improved performance.

4


Our hypothesis, which is justified by Proposition 1 below, is that the embedding vectors may act ascontinuous relaxations of discrete random variables as small, noisy updates are applied throughouttraining; Proposition 1 asserts that by replacing discrete random variables with continuous ones, ourdiscriminator is prevented from arbitrarily approximating a Dirac delta distribution. Figure 2 showssimple discriminators trained on the toy task of identify a single vertex of a simplex as true data; it isclear that a lack of regularization leads to the discriminator collapsing to the vertex of the simplex,leaving approximately zero gradient everywhere; while the Jacobian regularization of WassersteinGANs leads to the space covered by lines leading from the true data vertex to the generated datahaving a lower rate of change (note that the gradient is still close to zero in the remaining area of thesimplex); and finally, replacing the discrete random variables of the true data with continuous sam-ples about the vertex results in a much more gradual transition, which is desirable since it providesa stronger gradient signal from which to learn.

Unique to CycleGAN is the auxiliary cycle loss described in Section 1.3. The effect of this additionalobjective is the generated samples regularly being forced away from the discriminator’s minimumin favor of a mapping that better-satisfies reconstruction. For instance, it may be the case that thediscriminator favors a particular cipher mapping that is not bijective; in this case, the model willreceive a strong signal from the cycle loss away from the discriminator’s minimum. In these caseswhere the model moves against the gradient it receives from the discriminator it may be the casethat this region has near zero curvature (as is visually discernible from Figure 2); this is because theWGAN curvature regularization (see Equation 3) has not been applied in this region.

‘Curvature’ here refers to the curvature of the discriminator’s output with respect to its input; thiscurvature determines the strength of the training signal received by the generator. Low curvaturemeans little information for the generator to improve itself with. This motivates the benefits ofhaving strong curvature globally, as opposed to linearly between the generators samples and thetrue data. Kodali et al. (2017) proposes regularizing in all directions about the generated samples,which would likely remedy the vanishing gradient in our case as well; for our experiments, therelaxed sampling technique proved effective. It should also be noted that Luc et al. (2016) proposesomething similar to relaxed sampling whereby they replace the ground-truth discrete tokens witha distribution over the vocabulary that distributes some of the mass across the remaining incorrecttokens.

Let us now introduce the definitions needed for the formal presentation of Proposition 1.

Definitions.

• (Continuous relaxation of a discrete set). A continuous relaxation of a discrete set X is aproper, path-connected metric space X satisfying X ⊂ X .

• (Rediscretization function). A rediscretization function is an injective functionR : X → Xfrom a continuous relaxation X of discrete space X satisfying ∀x ∈ X ,∃ε > 0 s.t. R ≡x on Bε[x]. Note that R defines an equivalence relation in X .

• (Uninformative Discrimination). A discriminator DX is said to perform uninformativediscrimination under rediscretization function R if: ∃x ∈ X , x ∈ X s.t. (R(x) = x) ∧(DX (x) 6≈ DX (x)).

• (Continuous relaxation of a function). A continuous relaxation of a function over discretesets F : X → Y is another function F : X → Y (where X ,Y are continuous relaxationsof X ,Y) such that F is continuous and F (x) = F (x),∀x ∈ X .

The following proposition (proved in the Appendix) forms the theoretical basis of the technique.

Proposition 1 (Reliable Fooling Via Relaxation).Given:

• discrete spaces X ,Y and continuous relaxations X ,Y

• generators F : X → Y, G : Y → X bijections satisfying F = G−1

• discrete discriminators DX , DY both optimal for fixed F,G

• rediscretization functions RX , RY

5


Suppose: F is approximately volume preserving in a small region about each x ∈ X . Consequentlythe same is true for G about each y ∈ Y .If: during training, we replace discrete random variables from X which lie in the continuous metricspace X with samples from regions about them.Then: the optimal relaxed discriminators DX and DY have a non-empty region about each x ∈ Xand y ∈ Y where they are expected to assign values close to DX (x) and DY(y).

Figure 3 compares a model trained with embedding vectors versus one with only the softmax out-puts. It becomes clear on a harder task, such as Vigenere, that the embeddings vastly outperformssoftmax in terms of speed of convergence and final accuracy; however we found that the simpler taskof a shift cipher showed little difference between embeddings and softmax, suggesting an increasein task complexity increases the benefits provided by the stronger gradient signal of embeddings.

3 METHOD

3.1 CIPHERGAN

GANs applied to text data have yet to produce truly convincing results (Kawthekar et al.). Previousattempts at discrete sequence generation with GANs have generally utilized a generator outputtinga probability distribution over the token space (Gulrajani et al., 2017; Yu et al., 2017; Hjelm et al.,2017). This leads to the discriminator receiving a sequence of discrete random variables from thedata distribution, and a sequence of continuous random variables from the generator distribution;making the task of discrimination trivial and uninformative of the underlying data distribution. Inorder to avoid such a scenario, we perform all discrimination within the embedding space by allow-ing the generator’s output distribution to define a convex combination of corresponding embeddings.This leads to the following losses:

LGAN(F,DY ,X ,Y) = Ey∼Y [logDY(y ·W>Emb)]+ Ex∼X [log(1−DY(F (x ·W>Emb) ·W>Emb))]

Lcyc(F,G,X ,Y) = Ex∼X [‖G(F (x ·W>Emb) ·W>Emb)− x‖1]

+ Ey∼Y [‖F (G(y ·W>Emb) ·W>Emb)− y‖1]

We perform an inner product between the embeddings WEmb and the one-hot vectors in x as wellas between the embeddings and the softmax vectors produced by generators F and G. The formeris equivalent to a lookup operation over the table of embedding vectors, while the latter is a convexcombination between all vectors in the vocabulary. The embeddings WEmb are trained at each stepto minimize Lcyc and maximize LGAN, meaning the embeddings are easily mapped from and areeasy to discriminate. As was discussed in Section 2, training with the above loss functions wasunstable, with approximately three of every four experiments failing to produce compelling results.This is a problem we observed with the original CycleGAN horse-zebra experiment, and one that hasbeen noted by multiple re-implementations online (Bansal & Rathore, 2017; Sari, 2017). We wereable to significantly increase the stability by training the discriminator loss along with the Lipschitzconditioning term from the improved Wasserstein GAN (Gulrajani et al., 2017) (see Equation 3and Fedus et al. (2017)), resulting in the following loss (DualGAN (Yi et al., 2017) opted to useweight-clipping to enforce the Lipschitz condition):

LGAN(F,DY ,X ,Y) = Ey∼Y [DY(y ·W>Emb)]− Ex∼X [DY(F (x ·W>Emb) ·W>Emb)]+ α · Ey∼Y [(‖∇yDY(y)‖2 − 1)2]

As a consequence of Proposition 1, discriminators trained on non-stationary embeddings will beunable to approximate Dirac delta distributions to arbitrary accuracy; implying there are dedicated‘safe-zones’ about members of X where the generator can reliably fool the discriminator and unin-formative discrimination is prevented.

In our experiments, we jointly train the embedding vectors as parameters of the model. The gradi-ent updates applied to these vectors introduces noise between training iterations; we observed thatembedding vectors tend to remain in a bounded region after the initial steps of training. We found

6


Work Ciphertext Length AccuracyHasinoff (2003) 500 ∼ 97%Forsyth & Safavi-Naini (1993) 5000 ∼ 100%Ramesh et al. (1993) 160 ∼ 78.5%Verma et al. (2007) 1000 ∼ 87%

Table 1: Previous results on automated shift cipher cracking with limited ciphertext length.

that simply replacing the data with embedding vectors had a similar effect to performing the randomsampling described in Proposition 1 (see Figure 3).

4 EXPERIMENTS

4.1 DATA

Our experiments use plaintext natural language samples from the Brown English text dataset (Fran-cis & Kucera, 1979). We generate 2 ∗ batch size plaintext samples, the first half are fed as theCycleGAN’s X distribution and the second half is passed through the cipher of choice and fed asthe Y distribution.

For our natural language plaintext data we used the Brown English-language corpus which consistsof over one million words in 57340 sentences. We experiment with both word-level ”Brown-W”and character-level ”Brown-C” vocabularies. For word-level vocabularies, we control the size ofthe vocabulary by taking the top k most frequent words and introducing an ‘unknown’ token whichwe use to replace all words that are not within the taken vocabulary. We demonstrate our method’sability to scale to large vocabularies using the word-level vocabularies; more modern encipheringtechniques rely on large substitution-boxes (S-boxes) with many (often hundreds of) elements.

4.2 TRAINING

As in Zhu et al. (2017) we replace the log-likelihood loss with a squared difference loss whichwas originally introduced by Mao et al. (2016). The original motivation for this replacement wasimproved stability in training and avoidance of the vanishing gradients problem. In this work wefound the effect on training stability substantial.

LGAN(F,DY ,X ,Y) = Ey∼Y [(DY(y ·W>Emb))2]

+ Ex∼X [(1−DY(F (x ·W>Emb) ·W>Emb))2]

+ α · Ey∼Y [(1− ‖∇yDY(y)‖2)2]

Hence, our total loss is:

LTotal(F,G,DX , DY ,X ,Y) =Lcyc(F,G,X ,Y)

+ LGAN(F,DY ,X ,Y)

+ LGAN(G,DX ,X ,Y)

We adapted the convolutional architecture for the generator and discriminator directly from Zhuet al. (2017). We simply replace all two dimensional convolutions with the one dimension variantand reduce the filter sizes in our generators to 1 (pointwise convolutions). Convolutional neural net-works have recently been shown to be highly effective on language tasks and can speed up trainingsignificantly (Zhang & LeCun, 2015; Kalchbrenner et al., 2016; Yu et al., 2017). Both our genera-tors and discriminators receive a sequence of vectors in embedding space; our generators produce asoftmax distribution over the vocabulary, while our discriminator produces a scalar output. For allour experiments we use a cycle loss with regularization coefficient λ = 1. In order to be compatiblewith the WGAN we replace batch normalization (Ioffe & Szegedy, 2015) with layer normalization(Ba et al., 2016). We train using the Adam optmizer (Kingma & Ba, 2014) with batch size 64and learning rate 2e − 4, β1 = 0 and β2 = 0.9. Our learning rate is exponentially warmed up to2e − 4 over 2500 steps, and held constant thereafter. We use learned embedding vectors with 256

7


Figure 3: Left: Comparison of different timing techniques for Brown-C Vigenere. Right: Compari-son of embedding vs. raw softmax on Brown-W with vocab size of 200.

Data Brown-W Brown-W Brown-C Freq. Analysis (With Key)Vocab size 10 200 58 58 200

Cipher Shift/PermutationAcc. 100% 98.7% 99.8% 80.9% 44.5%

Cipher Vigenere (Key: “345”)Acc. 99.7% 75.7% 99.0% 9.6% (78.1%) <0.1% (44.3%)

Table 2: Average proportion of characters correctly mapped in a given sequence. The “Freq. Anal-ysis” column is simple frequency analysis applied to the same corpus our model observes. ForVigenere we also show the score if the key were known (note: the key is left unknown to ourmodel).

dimensions. The WGAN Lipschitz conditioning parameter was set to α = 10 as was prescribed inGulrajani et al. (2017).

For the Vigenere cipher, positional information is critical to the network being able to perform themapping. In order to facilitate this we experimented with adding the timing signal described inVaswani et al. (2017) (”Transformer Timing” in Figure 3) and found that performance increasedrelative to no explicit timing signal; we found that the best option was concatenating a learned posi-tional embedding vector specific to each position onto the sequence (”Concat Timing” in Figure 3),this dramatically improved performance, however this means that the architecture can not generalizeto sequences longer than those in the training set. A potential solution to the issue of generalizing tolonger sequences would be making a ’soft’ choice at each position for which positional embeddingvector to concatenate using a softmax distribution over a set of embedding vectors larger than theexpected key length, however, we leave this to future work.

4.3 DISCUSSION

Table 2 shows that CipherGAN was able to solve shift ciphers to near flawless accuracy, with allthree vocabulary sizes being easily decoded by the model. CipherGAN performs extremely well onVigenere, achieving excellent results on the character-level cipher and strong results on the chal-lenging word-level cipher with a vocabulary size of 200. The vocabulary size of 58 for our characterlevel, containing punctuation and special characters, is more than double what has been previouslyexplored. In comparison to the original CycleGAN architecture, we found CipherGAN to be ex-tremely consistent in training and notably insensitive to the random initialization of weights; weattribute this stability to the Jacobian norm regularization term.

For both ciphers, the first mappings to be correctly determined were those of the most frequentlyoccurring vocabulary elements, suggesting that the network does indeed perform some form offrequency analysis to distinguish outlier frequencies in the two banks of text. Another interestingobservation is that of the mistakes made by the network: the network would frequently confuse

8


punctuation marks with one another, perhaps suggesting that these vocabulary elements’ skip-gramsignatures were similar enough to lead to the repeated confusion observed across many training runs.

5 CONCLUSION

CipherGAN is a compelling demonstration of the potential generative adversarial networks holdto act on discrete data to solve difficult tasks that rely on an extremely sensitive and nuanced dis-crimination criterion. Our work serves to redouble the promise of the CycleGAN architecture forunsupervised alignment tasks for multiple classes of data. CipherGAN presents an algorithm thatis both stable and consistent in training, improving upon past implementations of the CycleGANarchitecture. Our work theoretically motivates – and empirically confirms – the use of continuousrelaxations of discrete variables, not only to facilitate the flow of gradients through discrete nodes,but also to prevent the oft-observed phenomena of uninformative discrimination. CipherGAN ishighly general in its structure and can be directly applied to a variety of unsupervised text alignmenttasks, without excess burden of adaptation. On the one hand, CipherGAN is an early step towardsthe goal of unsupervised translation between languages and has shown excellent performance on thesimplified task of cipher map inference. On the other hand, the methods we introduce can be usedmore broadly in the field of text generation with adversarial networks.

ACKNOWLEDGMENTS

Our thanks goes to Roger Grosse and Kelvin Shuangjian Zhang for their advice and support through-out. We also thank Otavio Good and Ian Goodfellow for meaningful early discussions and direction;as well as Michal Wiszniewski for his assistance in developing the code upon which the experimentswere run. This work was made possible thanks to the AI Grant, which provided generous supportthroughout.

REFERENCES

Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein gan. arXiv preprintarXiv:1701.07875, 2017.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprintarXiv:1607.06450, 2016.

Hardik Bansal and Archit Rathore. Understanding and implementing cyclegan in tensorflow.https://hardikbansal.github.io/CycleGANBlog/, 2017.

John M Carroll and Steve Martin. The automated cryptanalysis of substitution ciphers. Cryptologia,10(4):193–209, 1986.

Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and YoshuaBengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprintarXiv:1702.07983, 2017.

William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, andIan Goodfellow. Many paths to equilibrium: Gans do not need to decrease a divergence at everystep, 2017.

William S Forsyth and Reihaneh Safavi-Naini. Automated cryptanalysis of substitution ciphers.Cryptologia, 17(4):407–418, 1993.

W Nelson Francis and Henry Kucera. Brown corpus manual. Brown University, 2, 1979.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor-mation processing systems, pp. 2672–2680, 2014.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

9

https://hardikbansal.github.io/CycleGANBlog/

http://www.deeplearningbook.org

http://www.deeplearningbook.org


Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Im-proved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.

Sam Hasinoff. Solving substitution ciphers. Department of Computer Science, University of Toronto,Tech. Rep, 2003.

R Devon Hjelm, Athul Paul Jacob, Tong Che, Kyunghyun Cho, and Yoshua Bengio. Boundary-seeking generative adversarial networks. arXiv preprint arXiv:1702.08431, 2017.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456,2015.

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144, 2016.

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and KorayKavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.

Prasad Kawthekar, Raunaq Rewari, and Suvrat Bhooshan. Evaluating generative models for textgeneration.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.

Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. Unsupervised analysis for decipher-ment problems. In Proceedings of the COLING/ACL on Main conference poster sessions, pp.499–506. Association for Computational Linguistics, 2006.

Kevin Knight, Beata Megyesi, and Christiane Schaefer. The copiale cipher. In Proceedings of the4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web,pp. 2–9. Association for Computational Linguistics, 2011.

Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. How to train your dragan. arXivpreprint arXiv:1705.07215, 2017.

Matt J Kusner and Jose Miguel Hernandez-Lobato. Gans for sequences of discrete elements withthe gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.

Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks.arXiv preprint arXiv:1703.00848, 2017.

Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. Semantic segmentation usingadversarial networks. arXiv preprint arXiv:1611.08408, 2016.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuousrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.

Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.Least squares generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016.

SS Omran, AS Al-Khalid, and DM Al-Saady. A cryptanalytic attack on vigenere cipher usinggenetic algorithm. In Open Systems (ICOS), 2011 IEEE Conference on, pp. 59–64. IEEE, 2011.

Bhadri Msvs Raju et al. Decipherment of substitution cipher using enhanced probability distribution.International Journal of Computer Applications, 5(8):34–40, 2010.

RS Ramesh, G Athithan, and K Thiruvengadam. An automated approach to solve simple substitutionciphers. Cryptologia, 17(2):202–218, 1993.

Eyyb Sari. tensorflow-cyclegan. https://github.com/Eyyub/tensorflow-cyclegan,2017.

Simon Singh. The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptogra-phy. Anchor, 2000.

10

http://arxiv.org/abs/1412.6980

https://github.com/Eyyub/tensorflow-cyclegan


Ragheb Toemeh and Subbanagounder Arumugam. Applying genetic algorithms for searching key-space of polyalphabetic substitution ciphers. International Arab Journal of Information Technol-ogy (IAJIT), 5(1), 2008.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762,2017.

AK Verma, Mayank Dave, and RC Joshi. Genetic algorithm and tabu search attack on the mono-alphabetic substitution cipher i adhoc networks. In Journal of Computer science. Citeseer, 2007.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine learning, 8(3-4):229–256, 1992.

Zili Yi, Hao Zhang, Ping Tan Gong, et al. Dualgan: Unsupervised dual learning for image-to-imagetranslation. arXiv preprint arXiv:1704.02510, 2017.

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial netswith policy gradient. In AAAI, pp. 2852–2858, 2017.

Xiang Zhang and Yann LeCun. Text understanding from scratch. arXiv preprint arXiv:1502.01710,2015.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translationusing cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.

11


APPENDIX

A ARCHITECTURAL DETAILS

For both the generator and discriminator architectures below, we assume the network is providedwith a sequence of N tokens denoted data laying in a K-simplex. Therefore data is an N by Kmatrix.

For simplicity we define the following notation:ConvLayerfc,fs=1,s=1(x) = ReLU(LayerNorm(Conv1Dfc,fs(x)))

ResBlockfc,fs=1(x) = ReLU(LayerNorm(x+ Conv1Dfc,fs(Conv1Dfc,fs(x))))

ConvStackn,fc,fs=1(x) =

ConvLayer1,fs◦ConvLayer21·fc,fs,2 ◦ · · · ◦ ConvLayer2n·fc,fs,2 ◦ ConvLayerfc,fs,2(x)

A.1 GENERATOR

We define the following constants:

• fc = 32: The base ’filter count‘, or number of filters in a convolution layer• fs = 1: The ’filter size‘, or width of weight kernel used in convolution layers• vs = 1: The ’vocab size‘, or number of elements in the vocabulary• s = 1: The stride of the convolution layers• E = 100: The dimensionality of the embedding vectors• T = 100: The dimensionality of the concat timing weights

First, we first look up embedding vectors corresponding to the observed tokens. This is performed asa simple inner-product between data and a learning weight matrix of embeddings WEmb ∈ RK×E :

x = data ·WEmb

Next, a timing signal is added using either:

• The Transformer method:signaln,2k = sin(n/1e52k/K)

signaln,2k+1 = cos(n/1e52k/K)

timing(x) = x+ signal

• The concat method:Wtime ∈ RN×T

t = timing(x) = [x‖2Wtime]

Where Wtime are trained parameters and [·‖k·] denotes concatenation along the kth axis.

Then the generator is defined as follows:a(x) = ConvLayer4·fc(ConvLayer2·fc,fs(ConvLayerfc(x)))

b(x) = ResBlock(5)4·fc(x)

out(t) = Softmax(ConvLayervs(b(a(t))))

A.2 DISCRIMINATOR

We define the following constants:

• fc = 32: The base ’filter count‘, or number of filters in a convolution layer• fs = 15: The ’filter size‘, or width of weight kernel used in convolution layers• n = 5: The depth of the convolutional stack

Similar to the generator, a timing signal is added first. Leading to the following discriminator:out(t) = ConvStackn,fc,fs(dropout0.5(t))

12


B PROOF OF PROPOSITION 1

Proposition 1 (Reliable Fooling Via Relaxation).Given:

• discrete spaces X ,Y and continuous relaxations X ,Y

• generators F : X → Y, G : Y → X bijections satisfying F = G−1

• discrete discriminators DX , DY both optimal for fixed F,G

• rediscretization functions RX , RY

Suppose: F is approximately volume preserving in a small region about each x ∈ X . Consequentlythe same is true for G about each y ∈ Y .If: during training, we replace discrete random variables from X which lie in the continuous metricspace X with samples from regions about them.Then: the optimal relaxed discriminators DX and DY have a non-empty region about each x ∈ Xand y ∈ Y where they are expected to assign values close to DX (x) and DY(y).

Proof. We’ll prove one side of the CipherGAN as the proof for both sides are similar.

Given bijective function between continuous relaxations of X and Y: F : (X , dX ) → (Y, dY),where X and Y contain finite sequences (length n) of vectors laying on the vertices of the simplex∆k, and are supports of data distributions pX , pY respectively.

Let:

• X = Y = ∆k × · · · ×∆k︸︷︷︸n

with k equal to the number of elements in our vocabulary.

• the rediscretization function RX : X → X : RX (x) = arg minx∈X dX (x, x); similarly forRY .

Now, for each x ∈ X consider the infinite set Sx with cardinality of the continuum constructedaccording to:

x ∈ Sx ⇐⇒ x ∈ X , s.t. RX (x) = x and RY(F (x)) = F (x)

Equivalently,Sx = R−1

X (x) ∩ F−1(R−1Y (F (x)))

Note. Sx is never of cardinality less than the continuum since the following is implied by the defini-tions of RX , Sx and the fact that F is continuous: x ∈ X =⇒ x ∈ Sx ∧ ∃ closed ball Bε[x] withradius

0 < ε < minz∈X

RY(F (z)) 6=RY(F (x))

dX (x, z)

So, for each element x ∈ X there exists a closed set of points in X which are rediscretized, underRX , to x. Since Sx is a Borel Set we can sample uniformly from it. Therefore, during trainingsuppose we replace each element of x ∈ X with a sample x ∼ Sx:

We begin with the discrete objective:∑x∈X

pX (x) log(DX (x)) + pF (x) log(1−DX (x))

As was noted in Goodfellow et al. (2014), this objective is optimized in DX : X → [0, 1] when:

DX =pX

pX + pG

which is undesirable as pX is a sum of Dirac delta distributions pX (x) =∑xi∈X δxi

(x) and lacksa non-zero gradient to train the generator function with. Instead, let us consider a continuous relax-ation DX : X → [0, 1] of the discriminator DX and observe where it optimizes.

13


Suppose ∀x ∈ X ,∃εx ≥ 0,∀x ∈ Sx(1− εx) ≤ |∇xF (x)| ≤ (1 + εx) (6)

That is, suppose F is approximately volume preserving within a small region about each x ∈ X .

Lemma 1. ∀y ∈ Y, G(Sy) = SG(y)

Proof of Lemma 1. For all y ∈ Y with G(y) = x⇔ y = F (x):

G(Sy) = G[R−1Y (y) ∩G−1(R−1

X (G(y)))]

= G(R−1Y (y)) ∩G(G−1(R−1

X (G(y))))

= G(R−1Y (y)) ∩R−1

X (G(y))

= F−1(R−1Y (F (x))) ∩R−1

X (x)

= Sx = SG(y)

Corollary 1. ∫x∼SG(y)

1

|SG(y)|dx =

∫x∼G(Sy)

|∇xF (x)||Sy|

dx

Proof of Corollary 1. ∫x∼SG(y)

1

|SG(y)|dx = 1 =

∫y∼Sy

1

|Sy|dy

=

∫x∼G(Sy)

|∇xG−1(x)||Sy|

dx

=

∫x∼G(Sy)

|∇xF (x)||Sy|

dx

Corollary 2.1

(1− εx)|SG(y)|≥ 1

|Sy|≥ 1

(1 + εx)|SG(y)|Proof of Corollary 2. By Equation 6 and Corollary 1:∫

x∼SG(y)

1− εx|Sy|

dx ≤∫x∼SG(y)

1

|SG(y)|dx ≤

∫x∼SG(y)

1 + εx|Sy|

dx

=⇒ 1− εx|Sy|

|SG(y)| ≤ 1 ≤ 1 + εx|Sy|

|SG(y)|

=⇒ (1− εx)|SG(y)| ≤ |Sy| ≤ (1 + εx)|SG(y)|

=⇒ 1

(1− εx)|SG(y)|≥ 1

|Sy|≥ 1

(1 + εx)|SG(y)|

Corollary 1 leads to the following:

Ex∈X [Ex∈Sx [log(DX (x))]] + Ey∈Y [Ey∈Sy [log(1−DX (G(y)))]]

=∑x∈X

pX (x)

∫x∼Sx

1

|Sx|log(DX (x))dx+

∑y∈Y

pY(y)

∫y∼Sy

1

|Sy|log(1−DX (G(y)))dy

=∑x∈X

pX (x)

∫x∼Sx

1

|Sx|log(DX (x))dx+

∑y∈Y

pY(y)

∫x∼G(Sy)

|∇xF (x)||Sy|

log(1−DX (x))dx

(7)

14


Using Lemma 1 and Corollary 2, we obtain the following lower-bound of Equation 7:

≥∑x∈X

pX (x)

∫x∼Sx

1

|Sx|log(DX (x))dx+

∑y∈Y

pY(y)

∫x∼G(Sy)

1− εx|Sy|

log(1−DX (x))dx

≥∑x∈X

pX (x)

∫x∼Sx

1

|Sx|log(DX (x))dx

+ pY(F (x))

∫x∼SG(F (x))

1− εx(1 + εx)|SG(F (x))|

log(1−DX (x))dx

=∑x∈X

∫x∼Sx

1

|Sx|

[pX (x) log(DX (x)) +

1− εx1 + εx

pG(x) log(1−DX (x))

]dx

Which is maximal at:

DX (x ∼ Sx) =pX

pX + 1−εx1+εx

pG

εx≈ DX

=⇒ Ex∼Sx[DX (x)]

εx≈ DXAnd similarly, we find Equation 7 is upper-bounded by:

≤∑x∈X

pX (x)

∫x∼Sx

1

|Sx|log(DX (x))dx+

∑y∈Y

pY(y)

∫x∼G(Sy)

1 + εx|Sy|

log(1−DX (x))dx

≤∑x∈X

pX (x)

∫x∼Sx

1

|Sx|log(DX (x))dx

+ pY(F (x))

∫x∼SG(F (x))

1 + εx(1− εx)|SG(F (x))|

log(1−DX (x))dx

=∑x∈X

∫x∼Sx

1

|Sx|

[pX (x) log(DX (x)) +

1 + εx1− εx

pG(x) log(1−DX (x))

]dx

Which is maximal in DX at:

DX (x ∼ Sx) =pX

pX + 1+εx1−εx pG

εx≈ DX

=⇒ Ex∼Sx[DX (x)]

εx≈ DXHence, asymptotically as F becomes approximately volume-preserving about each discrete x ∈ Xthe bounds maximize to the same loss value in the same class of functions Ex∼Sx

[DX (x)] = DX ,application of the squeeze theorem concludes the proof.

15

Date post:	03-Nov-2021
Category:	Documents
Upload:	others
View:	20 times
Download:	0 times

UNSUPERVISED CIPHER CRACKING USING DISCRETE GANS

Documents