Unsupervised Audio Spectrogram Compression using Vector ...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Unsupervised Audio Spectrogram Compression using Vector Quantized Autoencoders

AMUND HANSEN VEDAL

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Unsupervised Audio

Spectrogram Compression

using Vector Quantized

Autoencoders

AMUND HANSEN VEDAL

Master in Machine LearningDate: September 11, 2019Supervisor: Alexandre Proutiere, Anders ArptegExaminer: Danica Kragic JensfeltSchool of Electrical Engineering and Computer ScienceHost company: Peltarion ABSwedish title: Oövervakad ljudspektrogramkompression medvektor-kvantiserande självkodande neurala nätverk

4

Abstract

Despite the recent successes of neural networks in a variety of domains, mu-sical audio modeling is still considered a hard task, with features typicallyspanning tens of thousands of dimensions in input space. By formulating au-dio data compression as an unsupervised learning task, this project investi-gates the applicability of vector quantized neural network autoencoders forcompressing spectrograms – image-like representations of audio. Using a re-cently proposed gradient-based method for approximating waveforms from re-constructed (real-valued) spectrograms, the discrete pipeline produces listen-able reconstructions of surprising fidelity compared to uncompressed versions,even for out-of-domain examples. The results suggest that the learned discretequantization method achieves about 9x harder spectrogram compression com-pared to its continuous counterpart, while achieving similar reconstructions,both qualitatively and in terms of quantitative error metrics.

5

Sammanfattning

Trots de senaste framgångarna för neurala nätverk på en rad olika områdenär musikalisk ljudmodellering fortfarande en svår uppgift, med karakteris-tiska egenskaper som spänner över tiotusentals dimensioner i inputrymnden.Genom att formulera ljuddatakomprimering som en oövervakad inlärning-suppgift undersöker detta projekt användbarheten av vektorkvantiserade neu-rala nätverk-baserade självkodare på spektrogram – en bildliknande represen-tation av ljud. Med en nyligen beskriven gradientbaserad metod för approx-imering av vågformer från rekonstruerade (realvärda) spektrogram, produc-erar den diskreta pipelinen lyssningsbara rekonstruktioner med överraskandeljudåtergivning jämfört med okomprimerade versioner, även för exempel utan-för domänen. Resultaten tyder på att den lärda diskreta kvantiseringsmeto-den uppnår ungefär nio gånger hårdare spektrogramkompression jämfört medsin kontinuerliga motsvarighet, samtidigt som den skapar liknande rekonstruk-tioner, både kvalitativt och enligt kvantitativa felmått.

6

For Elena and my dear family.

7

Acknowledgements

I would first like to thank Agrin, Anders, Carl, Lars and the rest of the Menagerieat Peltarion for great discussions, guidance and support, and for including mein their research team. You are all a major reason why this project became asenjoyable as it did. I hope we will work together again sometime in the future.

I would also like to thank my KTH supervisor Prof. Alexandre Proutiere forhis patient guidance and for sharing his admirable theoretical knowledge. Ithas been inspiring to work under such standards of quality and thorough un-derstanding.

I would like to take this opportunity to thank Simone, Elisa, Klas, Erifili,Nicola, Deniz and Carolina for being there for me when I needed you the most.

Finally, I thank my friends at KTH for the great support, inspiration and com-panionship you showed, and times we shared these years.

Thank you.

Contents

1 Introduction 101.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Background 152.1 Variational Autoencoder (VAE) . . . . . . . . . . . . . . . . . 15

2.1.1 Problem statement . . . . . . . . . . . . . . . . . . . 192.1.2 Deriving the lower bound (ELBO) objective . . . . . 202.1.3 Discrepancy between log p(x) and L . . . . . . . . . 212.1.4 Di�erentiable L through reparametrization trick . . . . 222.1.5 Multivariate Gaussian VAE Objective . . . . . . . . . 23

2.2 Vector Quantized Variational Autoencoder (VQVAE) . . . . . 242.2.1 The VQVAE Objective . . . . . . . . . . . . . . . . . 27

2.3 Spectrograms . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.1 The Fourier Transform . . . . . . . . . . . . . . . . . 302.3.2 Short-Time Fourier Transform (STFT) . . . . . . . . . 312.3.3 Limitations of STFT resolution . . . . . . . . . . . . 33

3 Method 343.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Spectrogram Pipeline . . . . . . . . . . . . . . . . . . . . . . 363.4 Autoencoder Architecture . . . . . . . . . . . . . . . . . . . . 373.5 Phase Approximation from Real-valued Spectrograms . . . . . 38

4 Related Work 41

5 Experiments and Results 445.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.1 Natural Image Compression . . . . . . . . . . . . . . 455.1.2 Audio Spectrogram Compression . . . . . . . . . . . 45

8

CONTENTS 9

5.1.3 Spectrogram Compression by Naive Image Resizing . 465.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Discussion 566.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2.1 Key Take-aways . . . . . . . . . . . . . . . . . . . . . 59

7 Conclusion and Future Work 617.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . 647.4 Societal Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 64

Appendix A Spectrogram Pipeline Chart 74

Appendix B Autoencoder Pseudocode 75

Chapter 1

Introduction

Today’s most commonly used audio compression algorithms enable millionsof users to enjoy digital musical audio streamed directly over the internet. Thewell-known MP3 algorithm, and more modern variants such as iTunes’ AAC,are designed to retain "hearable" audio content, and remove information wecannot perceive [47]. Which audio features are deemed ’unperceivable’ byhumans is a result of decades of empirical research in psychoacoustic model-ing [32].

In parallel, neural network -based methods are dominating an ever-increasingamount of fields of research. This is partially due to the wide-spread successof convolution-based neural network models for computer vision and a strongincrease of computation speeds of hardware components optimized for ma-trix computations. In the case of digital audio, the main challenge in applyingneural network models is temporal resolution of audio signals [41], typically at44100 Hz sampling rate for high quality audio. Musical features like melodies,harmonies and chord progressions, with lengths typically in the seconds, thusrequire tens of thousands of samples. This makes feature extraction and mod-eling of audio di�cult, which is especially evident in generated audio, wherethe sensitivity of the human ear can sense distortions even at the sample-level.

Visual audio features In the last few years, however, some of the world’slargest companies (read: Google) managed to develop o�-line, sample-levelconvolutional neural networks (e.g. WaveNet [41]) able to extract and rec-ognize both speech and shorter musical features at the expense of significantcomputational power for training, hyperparameter tuning and generation. Incases of less available computational power, an alternative to sample-levelanalysis (used in original Mp3 algorithm and its descendants) is analyzing

10

CHAPTER 1. INTRODUCTION 11

the signal in the frequency domain. A visual representation of the frequencycomponents in a signal known as a ’spectrogram’, can be achieved througha Fourier transformation of the audio, which decomposes the signal into itssinusoidal component frequencies. Here, harmonic series, rhythms and evenchord progressions – all common features of popular music – become observ-able and directly analyzable.

Unsupervised compression of spectrograms Following the trend of usingconvolutional networks to learn to extract observable features in images, a nat-ural next step would be attempting to apply them to spectrograms in order toextract audio features. In text-to-speech applications, such as Google’s end-to-end speech synthesis system Tacotron[63], convolutions are used to extractspeech audio features from spectrograms. Other systems such as WaveNet[41]also used spectrograms as supplementary, lower-resolution ’overviews’ for long-term dependency conditioning when generating sample-level speech and mu-sic, to avoid divergence.

Considering the architecture and learning objective, several representationlearning for images proved capable of extracting high-level features and gen-erating variations based on clever constraints to the learned feature space. Thevariational autoencoder (VAE), a recent neural network architecture for per-forming feature extraction and image generation, could also be considered alearned compression algorithm, able to extract patterns of its training datawithout labeled examples. An even more recent extension, the Vector Quan-tized VAE (VQVAE, [44]), allows for further compression by quantizing thelatent space.

Although di�erent types of methods have been used to model similar prob-lems, recent developments of convolutional nets for probability density mod-eling of images suggest they might be good candidates to model the highlynon-linear dependencies we would expect between latent audio features andtheir spectrogram versions.

Research question Based on the assumptions that 1) music often consists ofa discrete, comprehensible set of features (e.g melodies, instruments, chords,timbres, words), and that 2) these features are observable in the spectrogramrepresentation the same way natural features are observable in natural images,my hypothesis is that: digital musical audio consists of discrete features rec-

ognizable by a convolutional neural network, such that the audio signal can be

compressed and quantized into a discrete latent representation, and my centralresearch question: Is vector quantization a suitable extension to the con-

12 CHAPTER 1. INTRODUCTION

volutional autoencoder -framework in the context of audio spectrogramcompression?

Contributions Through this report, my contributions can be summarized asfollows (ordered by section):

• quantitative evaluation of AE and VQVAE spectrogram reconstructions(Section 5.1.3 and 5.1.2), suggesting that a vector quantization bottle-neck can decrease the size of musical audio compressed representationsby a factor of 10 compared to a regular AE bottleneck, while retaininga similar quantitative reconstruction error.

• description and implementation of a neural audio compression pipeline,including an exotic, high-quality, gradient-based spectrogram-to-waveformmethod (Section 3.3)

• reproduction of qualitative results for small image compression/ recon-struction from [44] (Section 5.1.1), including clarifications of architec-tural choices not mentioned in the original paper (Section 3.4)

Shorthand used:

• (VQ)(V)AE - (Vector Quantized)(Variational) Autoencoder

• SotA: State-of-the-Art

• waveform: digital audio signal, consisting of samples.

• subpixel: single-channel (Red, Green or Blue) values of an image pixel

• posterior collapse: a scenario in training autoencoders where a too pow-erful decoder ends up ‘ignoring‘ the latent variables during decoding,described in [7].

• latent vector = quantization vector = embedding vector; the terms areused interchangably in this work, and describe the learned vectors ofneural network weights used for quantization.

• MSE: mean squared error

• convnet: convolutional neural network

• GPU: graphical processing unit. A hardware component optimized formatrix multiplications.

CHAPTER 1. INTRODUCTION 13

• unsupervised learning algorithm: a learning algorithm which does notrequire labeled training data.

• receptive field - in neuroscience, the size of the sensorial space able toevoke a neural response. In the case of convolutional neural networks,the receptive field is the input e�ectively covered by a convolutionalkernel at a given forward pass, and as such the amount of dimensions asingle neuron can see "at once"

• L-BFGS - Limited-memory Broyden–Fletcher–Goldfarb–Shanno algo-rithm. Used as black box method for fast approximation of waveformphase o�set for reconstructed audio in this project.

1.1 Motivation

Recommender System Use Case

Large-scale music recommendation systems are often based on collaborativefiltering (CF)[51]. CF, in short, recommends content by suggesting songswhich users with similar listening history already listened to. Thus, the rec-ommendations are based on listening statistics rather than direct digital audiocontent analysis [11] [62].

Compressed spectrograms for content-based music recommender systemsCF systems work well in cases where very large amounts of user data are avail-able, such as Spotify’s "Discover Weekly" [3]. One problem with CF, however,is that it su�ers from the ’cold start’ problem of insu�cient usage data, mean-ing that new items (such as new albums) must be consumed before they canbe recommended. To tackle this problem, [11] defined a supervised, content-based approach [24, s.35] where a convolutional neural network (’convnet’)was trained to predict the (40-dimensional [24]) CF song embeddings directlyfrom the spectrogram representation of each song.

Although exact public details are scarce due to the company’s secrecy pol-icy, Spotify suggest using the approximate nearest-neighbor algorithm "An-noy" for CF [4]. The performance of this algorithm depends on CF embeddingdimensionality. With a catalogue consisting of millions of songs, it is there-fore advantageous to device a representation as compact as possible, retainingthe inherent features of the data and applying regularization which matchesthe audio characteristics.

14 CHAPTER 1. INTRODUCTION

For use cases where companies do not have access to su�cient usage data,a content-based recommender system has shown to be a viable extension. Tobase recommendations on a large set of songs, the content for each song mustbe described as concisely as possible. In such a use case, a vector quantizedautoencoder approach has the potential to make the content representation anorder of magnitude smaller compared to a regular autoencoder.

Chapter 2

Background

2.1 Variational Autoencoder (VAE)

What is an Autoencoder? In the context of unsupervised learning, a com-mon type of neural networks is the autoencoder – a model architecture withequal input and output sizes. Autoencoders have a learning objective to mini-mize a distance metric between its input and output. This architecture can befurther specialized to perform dimensionality reduction by including a lower-dimensional hidden layer (bottleneck). In order to minimize the error betweenthe input data and its reconstructed counterpart, the training objective of thisautoencoder e�ectively causes the model to learn a transformation from theinput space to a lower-dimensional manifold and back.

Figure 2.1: Illustration describing the autoencoder neural network architecture

A shortcoming of the general autoencoder definition, particularly in thecontexts of representation learning and generative modeling, is the lack ofconstraints on the latent space. A model trained to convergence could thuspotentially have learned an arbitrary latent code where semantical similarity

15

16 CHAPTER 2. BACKGROUND

Figure 2.2: Comparison between the regular and the variational autoencoderarchitectures (left and right respectively). In this particular case, there isa Gaussian prior distribution over the latent variables z. Illustration fromhttps://becominghuman.ai/6d0cfc4eeabd

in inputs is not reflected as small distances between their respective latent rep-resentations.

The Variational autoencoder (VAE) The VAE is a fairly recent extensionto the autoencoder framework proposed by [30][50]. Here, the authors in-stead formulate a learned approximative Bayesian inference problem wherethe true data distribution p(x) is dependent on a lower-dimensional vectorof hidden variables z ⇠ p(z). Their suggested objective is a di�erentiablelower bound (ELBO) on the true marginal data log-likelihood, which simul-taneously minimizes the error between inputs and reconstructions and the KLdivergence between an approximate posterior over latents p(z|x) and the cho-sen prior over latents p(z) through a Kullback-Liebler divergence (KL) reg-ularization term. The VAE can be interpreted as an autoencoder where thevariational distribution q�(z|x) approximating the true posterior distributionp(z|x) is parametrized by a neural network encoder, and the data likelihooddistribution p✓(x|z) modeling the true data likelihood p(x|z) is parametrizedby the decoder. Note the naming convention of trainable variational parame-ters � and generative parameters ✓, emphasizing that, while trained jointly, thegroups of parameters belong to two distinct neural networks (encoder and de-coder respectively). The VAE framework can be described by a probabilisticgraphical model, see Figure 2.3.

The problem definition of the VAE comes with several benefits, such as

CHAPTER 2. BACKGROUND 17

Figure 2.3: Variational autoencoder illustrated as a probabilistic graphicalmodel. The observations x are dependent on latent variables z according tothe likelihood distribution p✓(x|z), parametrized by generative parameters ✓.The latent variables z are distributed according to the true posterior over la-tents p(z|x). This distribution is approximated by the approximate posteriordistribution q�(z|x), parametrized by variational parameters �, where � is in-fered using approximate Bayesian inference from the observations x and anassumed prior over latent variables p✓(z) (dashed lines). The square (plate)is standard notation in graphical models, and represents N duplications of thesubgraph, where N is the number of observations x. Illustration from [30]


the possibility to sample from the learned approximate posterior distribution,"generating" new data from a distribution similar to the true data distributionp(x). In the context of representation learning, a prior could be chosen insuch a way that the resulting learned latent representation strongly correlatesto abstract semantic data content, such as objects in images.

The VAE, being an approximative inference method, is also less limitedthan f.ex mean-field variantional Bayesian inference, which generally con-strains the approximative distribution q(z|x) to be factorizable. It also scalesbetter than f.ex Monte Carlo Expectation-Maximization, since expensive sam-pling loops are not necessary for training [30].

The following section introduces the Variational Autoencoder (VAE) de-scribed in [30], and is divided as follows:

• 2.1.1 contains a statement of the formal problem that the variationalautoencoder is designed to solve

• 2.1.2 derives the evidence lower bound (ELBO) as a di�erentiable ob-jective function

• 2.1.3 shows how minimizing the dissimilarity between the approximatedistribution q�(z|x) and the real posterior distribution p(z|x) causes thelower bound to approach the true data log-likelihood.

• 2.1.4 describes the reparametrization trick, enabling joint training of en-coder and decoder

• 2.1.5 expresses a multivariate Gaussian learning objective (ELBO) es-timating the conditional data log-likelihood using Monte Carlo.


Notation

symbol descriptionp✓(x) = p(x | ✓) marginal data likelihood given model

parameters (weights)

q�(z | x) = q(z | x,�) approximate posterior distribution overlatent variables, given data and modelparameters

Ep(z)

⇥p(x | z)

⇤=

Rp(x | z)p(z)dz expected data likelihood with respect

to latent variables, continuous case

2.1.1 Problem statement

The dataset X = {x(i)}Ni=1 consists of N i.i.d samples x(i) 2 RD where thetrue marginal data distribution p(x) is unknown and assumed to be in the setof parametrized distributions p✓(·), ✓ 2 RW . We wish to find the best possibleapproximation p✓⇤ of p(x) in order to generate (sample) new data.We assume that our samples x(i) come from a random process dependent onlatent random variable z 2 Rd, d << D according to the unknown condi-tional distribution p(x|z), also assumed in the set of parametrized distributionsabove. For simplicity we assume that z ⇠ p(z) = N (0, I) and that p(z|x) canbe well approximated by q�(z|x) = N (µ(x), �(x)), where � 2 RV .

Our objective L is a lower bound on the marginal data log-likelihood log p✓(x)

log p✓(x(i)) � L(✓,�; x(i)) (2.1.1)

where

L(✓,�; x(i)) = �DKL

⇥q�(z|x(i)) k p(z)

⇤+ Eq�(z|x(i))

⇥log p✓(x(i)|z)

⇤(2.1.2)

is di�erentiable according to variational parameters � and generative parame-ters ✓, enabling joint optimization with for example gradient descent. The KLdivergence term constrains the approximate posterior to the prior, which in ourcase is a centered, isotropic Gaussian. The form of this lower bound suggestsan encoder-decoder structure, where the encoder q�(z|x) and decoder p✓(x|z)can both be parametrized by neural networks.


2.1.2 Deriving the lower bound (ELBO) objective

The dataset X = {x(i)}Ni=1 consists of N i.i.d samples. Since they are i.i.d, themarginal probability of the dataset can be expressed as the product of proba-bilities of each sample

p✓(x(1), · · · , x(N)) =

NY

i=1

p✓(x(i))

It is, however, more numerically convenient to consider the log-probability ofthe samples. As log(x) increases monotonically with x, we have that argmax✓ f(x) =argmax✓ log f(x) and as such we can rewrite

log p✓(x(1), · · · , x(N)) =

NX

i=1

log p✓(x(i))

We would like to model the dependencies between our data distribution anda hidden random variable z (NB: from here I drop the (i)-index of x(i) forreadability)

log p✓(x) = log

Zp✓(x, z)dz

To avoid simplifying assumptions that constrain our modeling capacity or al-gorithm e�ciency (see Section 2.1.1), we approach the problem through thevariational inference framework by introducing the approximate posterior dis-tribution q�(z|x).

log p✓(x) = log

Zq�(z|x)q�(z|x)

p✓(x, z) dz

= log Eq�(z|x)

hp✓(x, z)q�(z|x)

i

According to Jensens inequality, for a concave function f the following holds

f(E[X]) � E[f(X)]

and since the log function is concave, moving it into the expectation gives

logEq�(z|x)

hp✓(x, z)q�(z|x)

i� Eq�(z|x)

hlog

p✓(x, z)q�(z|x)

i

= Eq�(z|x)

hlog p✓(z, x)� log q�(z|x)

i, L(✓,�; x)


which we call the (variational) lower bound (ELBO) of the data log-likelihood(evidence) log p✓(x). The same lower bound, expressed in the following way,exposes an autoencoder architecture (NB: x(i) index (i) shortened for readabil-ity).

L(✓,�; x) = Eq�(z|x)

hlog p✓(z, x)� log q�(z|x)

i

= Eq�(z|x)

hlog p✓(x|z) + log p✓(z)� log q�(z|x)

i

= Eq�(z|x)

hlog p✓(z)� log q�(z|x)

i+ Eq�(z|x)

⇥log p✓(x|z)

⇤

= �DKL

⇥q�(z|x)| {z }

encoder

k p✓(z)| {z }prior

⇤+ Eq�(z|x)

⇥log p✓(x|z)| {z }

decoder

⇤

which is the Vartiational Autoencoder objective introduced in (Eq.2.1.1). Fromthe information theoretic perspective, we can recognize q�(z|x) as an encoder

that produces a distribution over latent variable z given input data x, and p✓(x|z)as a decoder that produces a distribution over possible reconstructions givena latent representation. This objective enables joint optimization of the lowerbound L(✓,�; x) according to encoder parameters � and decoder parameters✓.

2.1.3 Discrepancy between log p(x) and LWe can also show how minimizing the KL divergence between the approxi-mate and true posterior e�ectively maximizes the data log likelihoodf

DKL

⇥q�(z|x) k p✓(z|x)

⇤= Eq�(z|x)

hlog

q�(z|x)p✓(z|x)

i

= Eq�(z|x)

hlog q�(z|x)� log p✓(z|x)

i

by Bayes’ rule and the product rule of probability

= Eq�(z|x)

hlog q�(z|x)� log

p✓(z, x)p✓(x)

i

log p✓(x) can be moved out since it does not depend on z

= Eq�(z|x)

hlog q�(z|x)� log p✓(z, x)

i+ log p✓(x)

= �L(✓,�; x) + log p✓(x)


Using the Gibbs’ inequality, it can be shown that the KL divergence is non-negative (see [33], section 2.6), and it follows that the lower bound equals thetrue marginal likelihood p✓(x) i� the KL divergence is zero. This result alsoshows that maximizing the ELBO is the same as minimizing the KL diver-gence, as mentioned in [6].

2.1.4 Di�erentiable L through reparametrization trick

Figure 2.4: Illustration (from [12]) of reparametrization trick in VAE. Left:Without re-parametrization, the non-di�erentiable sampling of z (in red) in-hibits training decoder and encoder jointly . Right: VAE with reparametrizedez = µ(x) + ✏ � �(x), enabling joint optimization of encoder and decoder.Intuitively, the encoder learns to produce parameters of the prior (such as µ

and � when the prior is Gaussian) that are most likely to result in a correctreconstruction of the input.

Assuming that both encoder and decoder are neural networks, we want to opti-mize the lower boundL by training them using gradient-based learning (SGD).This means L must be made di�erentiable according to the parameters ✓ and�.

The major challenge of constructing such a di�erentiable L, however, isthat the sampling operation (see Figure 2.4, Left) is non-di�erentiable. Onemethod of circumventing this problem, suggested by [30], is to reparametrize


z using a deterministic transformation g�(x, ✏)

ez = g�(x, ✏) where ✏ ⇠ p(✏)

according to the independent noise variable ✏. A consequence of this ’reparametriza-tion trick’ is illustrated in Figure 2.4 (Right) for the special the case whereq�(z|x) is assumed to be Gaussian distributed

ez = g�(x, ✏) = µ(x) + ✏� �(x)

where � indicates the element-wise product.

2.1.5 Multivariate Gaussian VAE Objective

Assuming that both p(z) and q�(z|x) are multivariate Gaussians the enables usto express the KL divergence term of Eq. 2.1.1 in closed form [30, AppendixB] as

DKL

⇥q�(z|x(i)) k p✓(z)

⇤=

1

2

JX

j=1

�1 + log((�(i)

j )2)� (µ(i)j )2 � (�(i)

j )2�

The reconstruction error (2nd term of Eq.2.1.1) can be estimated through sam-pling using Monte Carlo estimation, which approximates the expected valueof a function f(z) of a random variable z through averaging over samples fromthe distribution p(z).

Ep(z)[f(z)] '1

L

LX

l=1

f(z(l))

where L are number of samples. For our reconstruction error term

Eq�(z|x(i))⇥log p✓(x(i)|z(i,l))

⇤' 1

L

LX

l=1

log p✓(x(i)|z(i,l))

where z(i,l) = g�(✏(i,l)

, x(i)) and ✏(l) ⇠ p(✏) (2.1.3)

A multivariate Gaussian ELBO objective can thus be expressed as

L̂ = �DKL

⇥q�(z|x(i)) k p✓(z)

⇤+ Eq�(z|x(i))

⇥log p✓(x(i)|z)

⇤

=1

2

JX

j=1

�1 + log((�(i)

j2 � (µ(i)

j )2 � (�(i)j )2

�+

1

L

LX

l=1

log p✓(x(i)|z(i,l))

(2.1.4)

where [30] suggest L = 1 for training given large batch size.


2.2 Vector Quantized Variational Autoencoder

(VQVAE)

The most commonly used variational autoencoder architectures today assumeGaussian distributed latent variables, and thus a continuous latent space, mak-ing the model a powerful function approximator. However, when the input datacontains discrete features such as objects in images or phonemes in speechrecordings, continuous latent variable models could end up modeling other’features’ that are conceptually di�cult for a human to interpret.

To extract such high-level, ’conceptual’ features from the input data, someunsupervised representation learning methods instead impose a discrete priordistribution over the latents [20]. When such latents are combined with de-coders with a limited receptive field, the ’local’ decoder becomes dependenton the latent variable capacity for a broader ’overview’ during decoding. Givena powerful enough decoder, this e�ectively forces the latent variables to modelglobal features, often similar to those perceived by humans [7].

A recent work [44] explores how vector quantization can be used as amethod of discretizing the variational autoencoder latent space (see Figure2.5). The authors show that their proposed model, the Vector Quantized Vari-ational AutoEncoder (VQVAE), learned to encode human speech into dis-crete, phoneme-like features, separating them from both speech prosody andlow-level speech signal information. The VQVAE imposed a categorical priorover latents (vector quantization) combined with a local decoder that, althoughpowerful, was forced to depend on the latent variables for global features. Theauthors concluded that the latent speech representations were strongly corre-lated with the human-annotated phoneme annotations of the speech recording.

The goal behind the VQVAE architecture was to achieve both e�ectiverepresentiation learning of conceptual features and high quality reconstruc-tions, with performance (lower bounds) on par with continuous models. Theauthors tested the model to reconstruct both natural images, speech and video,analyzing quality of reconstructions and attempting generation of new data bytraining generative models on the latent representation of the inputs.

This chapter is a summary of the VQVAE as presented in [44].


Figure 2.5: Illustration of the VQVAE, as presented in [44]. Left of dashedline: The Vector Quantized Variational Autoencoder. Here, the encoder out-put ze(x) 2 R4⇥4⇥D, consisting of D feature maps, can be interpreted as 16separate vectors in RD (green), matching the length of embedding vectors ei(purple, above). Nearest-neighbor vector quantization then entails replacingeach of the 16 vectors by the embedding vector ej closest to it (according to theeuclidean distance). The resulting decoder input zq(x) (purple) consists solelyof the embedding vectors that were found to be nearest neighbors. Note thatthe nearest-neighbor indexation represents the bottleneck of the model, requir-ing only 4⇥ 4⇥ log2(K) bits, where K is the number of available embeddingvectors. Right: Illustration of an example latent spaceRD=2, mapped to by theVQVAE encoder. The encoder output vector ze(x) (green) is replaced by itsnearest neighbor embedding vector e2 during quantization, and will be movedcloser to e2, in the direction of the negative gradient �rzL (illustrated by thered arrow).


Notation

symbol description dimx input

ze feature maps after last convolutionallayer of encoder neural net

height ⇥ width ⇥ D

z encoder output; height ⇥ width ⇥ 1categorically distributed acc to q�(z|x) (depth impl as 1-of-K)depth dim indexes nearest embedding;

zq input to decoder; height ⇥ width ⇥ Dequals nearest neighbor ek to ze

ex decoder output (reconstruction);distributed acc to p✓(ex|zq(x))

K number of learned quantization vectors(embeddings)

D length of each embedding vector;depth-dim of the latent representation

� regularization term for Lcommit

|| · ||2 euclidean (`2) distance

MSE mean square error

Note: In accordance with [44], I use a single random variable z to represent

the latent variables to simplify notation.

Discretizing the latent space

To enforce a discretization constraint on the latent space, [44] replaced thereparametrization/sampling operations of the original VAE with a vector quan-tization step.Each feature of the feature space ze (output of layer of the encoder neural net-work) is "quantized" by replacing it with the closest embedding in a set of Klatent embedding vectors is a set e 2 R

K⇥D, each of length D. The closestembedding ek is chosen using the nearest neighbor algorithm such that


Figure 2.6: Alternative illustration of the vector quantized variational au-toencoder architecture from [54]. The ’Nearest Neighbors’ step replaces the’reparametrization’ step of the VAE.

q(z = k|x) =

8<

:1 for k = argminj||ze � ej||20 otherwise

(2.2.1)

where each dimension zi 2 z is a categorically distributed (discrete) randomvariable indexing the dictionary e, thus z contains one index per feature of ze.To avoid confusion, note how dim(ze) 6= dim(z), as ze signifies the interme-diate feature map output of the last convolutional layer of the encoder before

the quantization step. After an embedding is selected for each latent dimen-sion in ze, the indices (z) are used to extract the corresponding embeddingsfrom the set e,

zq = ek, where k = argminj||ze � ej||2 (2.2.2)

resulting in a collection of quantized features zq to be sent to the decoder. Notethat naming it "quantization" implies that dim(ze) = dim(zq). We can seehow this fits into the VAE notation of Section 2.1, by noting how the decoderof Algorithm 1 would be expressed as p(ex|zq(x)) in the notation of (2.1.4)

2.2.1 The VQVAE Objective

After replacing the reparametrization/sampling step of the common VAE byvector quantization, backpropagating through the model is no longer straight-forward. Since the nearest_neighbor operation does not have a definedgradient, the gradient signal is obstructed at the decoder input zq, thus in-hibiting both training of the encoder and the embeddings. To train the model


end-to-end using backpropagation, the gradient at ze (at the last layer of the en-coder) must be approximated. The authors of [44] suggest that, since ze and zq

share the same D-dimensional feature space, the gradient at zq should containuseful information on how ze should be updated to minimize Lrecon. There-fore, they approximate the gradient at ze through a so-called straight-throughgradient estimation, by copying the gradients rzqL directly to encoder out-put ze, bypassing the non-di�erentiable quantization step. This operation isexpressed as

z = ze � stop_gradient(ze � zq) (2.2.3)

in Algorithm 1, where stop_gradient is an identity operation during theforward pass and zero during the backward pass. Here, z = zq is definedby an unusual re-write of zq in terms of ze and the quantization error ze �zq, where the latter term does not receive gradient updates from Lrecon. Theembeddings are thus learned separately through minimizing the quantizationerror sum Lembed + Lcommit. This sum was split into two terms to enableregularization of the encoder features term through the parameter �. Thisbalances the training of embeddings e and encoder features ze, preventing theencoder from training too fast, which would cause the embeddings to growarbitrarily.The resulting VQVAE training objective is the sum of all error terms

L = log p(x|zq(x)) + ||sg[ze(x)]� e||22 + �||ze(x)� sg[e]||22 (2.2.4)

Refer to Algorithm 1 for further details on implementation.

Note on the VAE KL divergence term

Another variation on the VAE introduced through VQVAE is the possibilityto learn a prior over the latent space. The objective for the common VAEis ELBO, consisting of a reconstruction term and a KL divergence betweenthe approximate posterior and a chosen prior distribution (Eq. 2.1.4). For theVQVAE, [44] assume a uniform categorically distributed prior p(z), such thatp(zi) = 1

K 8 i, where K is the number of embeddings. Since the prior isstatic during training, and the posterior q(z|x) is deterministic 1-of-K, the KLdivergence

DKL

⇥q(z|x) k p(z)

⇤=

X

i

q(zi|x) logq(zi|x)p(zi)

= 1 · log 11K

= logK (2.2.5)

is constant and can be disregarded during training.


Algorithm 1 VQVAE training algorithmRequire: hyperparameters num_embeddings, num_latent_dimInitialize decoder weights ✓Initialize embeddings eInitialize encoder weights �

for x : 1, . . . , N doze = encoder(x)z = ze � stop_gradient(ze � zq) . See Section 2.2.1zq = nearest_neighbor(ze, e) . See Section 2.2ex = decoder(z)

Lrecon = MSE(ex� x)Lembed =

��zq � stop_gradient(ze)��2

2

Lcommit = ��ze � stop_gradient(zq)

��2

2

L = Lrecon + Lembed + Lcommit

update ✓update eupdate �

end for


Speeding up VQVAE convergence

In the [44, Appendix], the author suggests to replace the latent embeddingvector update term of the VQVAE objective (Equation 2.2.4)

Lembed = ||sg[ze(x)]� e||22 (2.2.6)

with an exponential moving average term to speed up model convergence. Thenew term

e(t)i =

m(t)i

N(t)i

(2.2.7)

is an exponential moving average term for minibatch updates, where

N(t)i = �N

(t�1)i + (1� �)n(t)

i (2.2.8)

m(t)i = �m

(t�1)i + (1� �)

n(t)iX

j

z(t)i,j (2.2.9)

where n(t)i is number of vectors in the encoder output z(t)e (the product of the

two first dimensions latent representation, such as 8 ⇥ 8 in the case of naturalimages described in Section 5.1.1), and � 2 [0, 1] is the moving average decayparameter.

2.3 Spectrograms

Recent works [15][14][13] [55] among others have shown that spectrogramrepresentations are still highly relevant as part of competitive systems for anal-ysis and generation of music and speech.

The method I have chosen for transforming audio waveform to log-powerspectrogram representations (see Section 2.3.2) has lead to promising resultsin other works [56][8]. I suspect that my results could perhaps be improvedby applying music-specific methods (shown to improve results [55]), such asConstant-Q Transform (CQT) spectrograms and the ’Rainbowgram’ coloringscheme [14], or traditional mel-spectrograms [9] (although not necessarily bet-ter [8]), which I leave for future work.

2.3.1 The Fourier Transform

The Discrete Fourier Transform (DFT) [39, p. 49] is a discrete transformationused in music signal processing to obtain a frequency domain representation


of an audio waveform. The frequency spectrum describes the waveform by thedegree of correlation with sinusoid components, where frequencies correlat-ing more strongly with the input waveform receive a higher amplitude in thespectrum. This is sometimes used to uncover key frequency content, such asbase pitch and overtones, f.ex in pitch detection systems.

In practice, DFT [39, equation 2.24]

X(k) =N�1X

n=0

x(n) e�i2⇡kn/N (2.3.1)

where N is the number of samples in the input signal, computes the cross-correlation between the input signal x(n) and the complex sinusoid componente�i2⇡kn/N for a given frequency bin k. The sinusoidal components become

clearer by the following re-write

e�i2⇡kn/N = cos(2⇡kn/N)� i sin(2⇡kn/N) (2.3.2)

using Euler’s formula ei✓ = cos(✓)� i sin(✓).The result X(k) is a complex-valued amplitude spectrum, where each dimen-sion represents the correlation between the input signal and a sinusoidal com-ponent with index k. It is common to restrict k to the range [0 : K = N/2+1]since X(k) is periodic around k = N and mirrors around k = N/2 (theNyquist frequency, see Section 2.3.3). It is also common to use Fast FourierTransform (FFT) algorithms to minimize computation time.

2.3.2 Short-Time Fourier Transform (STFT)

A weakness of using DFT directly is that it computes the sinusoidal compo-nent correlations X(k) across the whole input waveform, assuming that thewaveform is time-invariant. So, to account for changes occurring over time,a common method known as discrete Short-Time Fourier Transform (STFT)[39, p.53] can be used. Discrete STFT extracts successive frames from thewaveform using a window function w, and performs the Fourier transform oneach frame separately. The discrete STFT can be described as [39, Eq. 2.26]

X(m, k) =N�1X

n=0

x(n+mH)w(n)e�i2⇡kn/N (2.3.3)

where H is hop size, m is time frame index and X(m, k) is the kth complex

Fourier coe�cient for the mth time frame. As such, there are three relevant


parameters to fully describe the STFT:

• N - frame length (window length), balances the trade-o� between temporal-and frequency resolution, further discussed in Section 2.3.3

• H - hop size, determines the overlap between frames. A smaller step sizeincreases computation time and spectrogram size, but decreases distor-tion.

• w(n) - window function, choice is usually based on minimizing artifactsfor the specific application.

As such, the complex-valued amplitude spectrogram X(m, k) is e�ec-tively the concatenation of m di�erent DFT amplitude spectra X(k) alongthe time axis. From this, a real-valued power spectrogram P (m, k) can becomputed as the squared magnitude

P (m, k) = |X(m, k)|2 (2.3.4)

The power spectrogram representation is common because it is interpretable,since one can directly calculate which frequencies (in Hz) are present in thesound (Figure 2.7) and real-valued.

Figure 2.7: Example of waveform and spectrogram representations of a samplefrom the Nsynth dataset

Converting to power spectrograms is a lossy conversion however, sincethe phase information of the spectrogram is lost in the conversion betweencomplex and real values. This is necessary to avoid handling complex output


from my preprocessing pipeline, but also requires approximating an inverse tothe STFT when approximating a waveform from a reconstructed spectrogram,see Section 3.5

2.3.3 Limitations of STFT resolution

STFT spectrograms have a fixed resolution which depends on window size.Shorter windows cause better temporal resolution but worse frequency res-olution. This trade-o� is a consequence of the previously mentioned time-invariance assumption of DFT. We can see this by analyzing Eq. 2.3.1, which,although computible for all k, is N-periodic and thus starts repeating afterk > N . Also, X(k), k 2 [N/2+1 : N ] is a mirror of X(k), k 2 [0 : N/2+1],and thus shouldn’t be re-computed. By converting between frequency bin in-dex k and physical frequency f according to [39, equation 2.28]

f =k · fsN

(2.3.5)

we recognize how k = N/2 corresponds to f = fs/2, the Nyquist frequency.As such, any components above are either mirrored or periodic replica, and assuch only a total ofN/2 coe�cients (bins) are unique and should be computed.

Chapter 3

Method

In the following chapter, I describe the experimental method behind trainingautoencoder neural network models to compress and reconstruct both natu-ral images and digital audio spectrogram representations. The image and au-dio datasets are described in Section 3.2. The preprocessing of natural im-ages consisted solely of scaling subpixel values to the range [0, 1]. For audio,the processing pipelines were more involved, consisting of a waveform-to-spectrogram preprocessing pipeline described in section 3.3, and a spectrogram-to-waveform phase approximation postprocessing step described in section3.5.

Implementation The autoencoder architectures (Section 3.4), as well as spec-trogram pre- and post-processing pipelines, were implemented entirely in Ten-sorflow, a free, open-source software library developed by Google for traininglarge-scale neural networks, available through its Python API [1]. The coreidea behind the library is to enable the design and compilation of neural net-works at a high level of abstraction, while training details such as gradientcomputation (automatic di�erentiation) and hardware interaction (with GPU)is handled in the background. The key hardware components used for thisproject were Tesla K80 and P100 GPUs available at Google Cloud, which wasprovided by Peltarion.

34

CHAPTER 3. METHOD 35

3.1 Metrics

Mean squared error In this project, the main metric for training the modelis the mean squared error:

MSE =1

N

NX

i=1

(xi � x̂i)2 (3.1.1)

where N is the size of the dataset. For training, I average over all dimensions(subpixels and batches), while during evaluation I report a metric summedover pixels and averaged over batches (per-image MSE).

Perplexity For evaluating the degree of utilization of the latent vectors of theVQVAE, the perplexity metric, common in language modeling can be used [5].Re-written using the VQVAE notation,

PP = expn� 1

M

MX

i=1

log q(zi|x)o

(3.1.2)

where M is the dimensionality of the latent space and q(zi|x) is the posteriorcategorical distribution over latent variables zi. A larger perplexity score in-dicates that a larger part of the model capacity (latent vectors) were used toencode the data – a sign that the model avoids posterior collapse.

3.2 Datasets

CIFAR10 [31] consists of 60,000 tiny color images of size (32, 32, 3) pix-els, divided into 50,000 training and 10,000 test images. I further separate thetraining set into 40,000 for train and 10,000 for validation.

The Nsynth [14] dataset consists of more than 300,000 separate sound clips ofvarious instruments playing a single pitch, where pitch, timbre and envelope isvaried. Each clip is four seconds long, where the note attack/sustain covers thethree first seconds and is released the last. The clips are monophonic (single-channel), with a sample rate of 16,000 Hz, and a bit depth of 16 bits/sample.The clips are annotated with <instrument_family> <instrument_number><pitch> <velocity> (see [14, Appendix C] for more info).

The following describes my train/validation split: I listed all sound clipsof the instrument family keyboard_acoustic from the original Nsynth

36 CHAPTER 3. METHOD

training set, and extracted 1500 clips randomly. I then sampled 12 pitchesuniformly from the total pitch range of the dataset (standard MIDI piano range[21 : 108], corresponding to musical pitches A0-C8) to form a validation set

MIDI pitch 27 52 56 63 69 71 73 74 91 96 98 104Musical note D#1 E3 G#3 D#4 A4 B4 C#5 D5 G6 C7 D7 G#7

while the rest of the pitch range is used for training. The resulting train/validationsplit is about 80/20, corresponding to 1193/307 sound clips.

3.3 Spectrogram Pipeline

A flow chart describing the spectrogram input pipeline can be found in Ap-

pendix A

The original Nsynth dataset consists of digital audio in the WAVE digitalaudio format (.wav file extension), sampled at fs = 16000Hz with a bit depthof 16 bits/sample. In order to be processed by my models, the sound clips mustbe transformed into spectrograms. The audio clips are initially passed througha .wav decoder to be converted to a list of samples, which are divided into4.152 second chunks (66,432 samples), adding zero-padding when necessary.After removing silent chunks (containing no sound above amplitude thresh-old), the chunks are transformed into spectrograms using STFT. The STFTspecifications are

frame_length = 1024frame_step = 128fft_length (NFFT) = 1024window_fn = hann_window

resulting in 1024/16000 s = 64 ms long STFT frames of shape (frames,num_unique_fft_bins) pixels where

frames = (samples - frame_length) / frame_step + 1 = 512num_unique_fft_bins = fft_length // 2 + 1 = 513

Here, division by 2 comes from omitting non-unique (redundant) frequencybins k > N/2 (explained in Section 2.3.3). I then compute the power spec-trogram 2.3.4, convert to decibels (log) and perform minmax normalization,and finally remove the 513th frequency bin, achieving a normalized (decibel)log-power spectrogram of shape (512, 512), see example in Figure 2.7.

From Eq. 2.3.5, we can read the frequency resolution of our spectrogramas fs/N = 16000 Hz/1024 ⇡ 15.6 Hz. Using the formula for computing the


time position of a window [39, equation 2.27]

t(m) =m ·Hfs

(3.3.1)

we can compute the time resolution as H/fs = 1024/16000 ⇡ 3ms.

3.4 Autoencoder Architecture

Important note: my baseline AE is non-variational. Although I describethe VQVAE in the VAE context by comparing it to a VAE in earlier sections, Iended up selecting a traditional, non-variational AE as baseline for this work,similar to the "Spectral Autoencoder" baseline in [14, Section 2.2]. The rea-son for skipping variational models entirely here was that I encountered un-expected challenges when training models with large latent sizes using thevariational ELBO objective. I describe these challenges further in Section6.2).

Encoder and decoder The convolutional autoencoder architecture I usedas a base for these experiments was adopted from [44], and a pseudocode-description of the architecture can be found in the Appendix B. In short, theencoder consists of 2-strided convolutional layers with 4x4 kernels, followedby two residual blocks, each consisting of two convolutional layers with 3x3kernels. The decoder consists of a 3x3 convolutional layer, two residual blocksand two 2-strided transposed convolutional layers with kernel size (4x4). Alllayers except the encoder and decoder output layers have ReLU activations and256 filters. The output layers have linear activations.

To reduce the number of feature maps to D after the encoder, I use a point-wise (1x1) convolutional layer with D outputs, a common dimensionality re-duction method popularized by [58, Sec.2] and others. As for the AE, theoutput zq of the VQVAE quantization step has the same shape as ze.

Before the decoder input, another convolutional layer is needed to increasethe number of feature maps before the first layer decoder residual layer, suchthat input size matches the 256 feature map output size. I use a 3x3 convolu-tional layer with 256 filters here. These extensions were not described in [44]but suggested in [61]. Possible short-comings are discussed in Section 6.1.

Quantization bottleneck For the VQVAE, the following describes the quan-tization step: the encoder produces an output ze with shape (h, w,D). In the


quantization step, each depth-dimension vector z(i)e with shape (1, 1, D) is re-placed the by a quantization vector ek, also with shape (1, 1, D), which is theclosest ofK learned quantization vectors. Here, the nearest-neighbor (integer)indexing from z

(i)e to ek constitutes the compression bottleneck of the VQVAE:

each index can be represented by a log2(K)-bit integer, because the learnedquantization vectors are part of the model itself.

Compression ratio calculation example Assume a VQVAE with K=256available quantization vectors, input shape (32, 32, 3) and a latent space shapeof (8, 8, D). The resulting latent size is 8 · 8 · log2(K = 256) bits. If the inputdata consists of RGB color images requiring 32 · 32 · 3 · log2(256) bits, thecompression ratio of this model is 32·32·3·log2(256)

8·8·log2(K=256) = 48.

Note on downsampling using strided convolutions Unlike common neuralnetwork autoencoderes, the architecture I use is fully convolutional, and thusthe choice of latent representation size is more constrained than in a commonautoencoder. Consequently, the (height,width) dimensions of the latent repre-sentation is directly dependent on the strides of the convolutional layers, andthus constrained to factors of 2 smaller than the input size in my case. Com-mon encoder architectures with a fully-connected layer at the end can insteadbe configured to any naural number output size, but also contain more hyperpa-rameters and train more slowly. Given the above VQVAE model architecture,the size of the latent representation can be varied through parameters D andK.

3.5 Phase Approximation from Real-valued

Spectrograms

In order to avoid handling complex values in the neural network (a topic un-der very active research), it is common to transform the waveform (time do-main) into log-power spectrograms (frequency domain). However, this is alossy transformation, where the complex-valued phase information of the fre-quency domain is lost. For this reason, the transformation doesn’t have aninverse, and in reconstructing the waveform, the phase information needs tobe approximated.

Methods for approximating phase The literature describe several meth-ods for estimating missing phase information when reconstrucintg a wave-


form from a real-valued spectrogram. I describe two common methods andthe method I used below.

A simple method is to save the the input (ground-truth) phase and appendit to the spectrogram before reconstruction. This has been shown to work wellfor reconstruction of music [53] [37], but does not work well for generationsince no known phase exists.

The Gri�n-Lim Algorithm (GL) [21] is a common method for approxi-mating phase for a real-valued spectrogram used in recent SotA works such asTacotron2 ([56]) and DeepVoice3 ([45]). GL starts by appending a random-initialized phase to the spectrogram to compute the inverse-STFT, producinga waveform. In then computes the STFT of this new waveform, resulting in anew complex-valued spectrogram. From this new spectrogram, it extracts thephase and append it to the real-valued reconstruction spectrogram again. This(inverse STFT ! STFT)-loop continues until convergence on a stable phaseestimate.The lesser-known method selected for this work was introduced in Decorsièreet al. [10], see Figure 3.5. This is a gradient-based estimation method, thusrequiring the waveform ! spectrogram transformation (pipeline) to be di�er-entiable. The method works by first random-initializing variables representinga waveform, with the same length as the waveform you want to reconstruct.Then, it computes the spectrogram of the "noise" waveform, and measures theerror (f.ex MSE) between the noise spectrogram and the input spectrogram.Using an optimiziation algorithm (Decorsière et al. [10] uses L-BFGS), noisewaveform-variables are updated using backpropagation to minimize the errorbetween the two spectrograms. After convergence, the result is a waveformwhose spectrogram representation is similar to the desired spectrogram re-construction. The authors of [10] report a lower spectral convergence (distancebetween target and reconstructed signals in time-frequency (STFT-magnitude)domain) for this method, but also note that it is slower than GL and there-fore better suited for o�ine use. The gradient-based method was ultimatelyselected to evaluate vector quantized audio reconstructions qualitatively, thusprioritizing lower distortion over phase approximation speed (see further com-ments on GL quality in [13]).


Figure 3.1: Diagram describing the gradient-based method used to approxi-mate a waveform from a real-valued spectrogram. At ’Initialization’, I initial-ize a "signal estimate" s0, a random weight vector with the same length (num-ber of samples) as the original audio clip. At "Envelope extraction", I computethe normalized log-power spectrogram Es0 of s0 as in Section 3.3. The "Com-parison" step computes the MSE between Es0 and T, where T refers to theautoencoder reconstructed spectrogram from which we want to approximate awaveform. Then, at the "Update" step, the signal estimate s0 is updated using agradient-based method (’L-BFGS-B’ in [10]). The algorithm terminates aftern iterations, or if the error falls below a tolerance threshold. Image borrowedfrom [10]

Chapter 4

Related Work

VAE, extensions and challenges

The VAE framework has seen a surge of interest from academia and researchin recent years, as part of a broader field of research on generative models andrepresentation learning [60] for images. Relevant to this work are some worksfocused on improving the neural network architectures of the VAE by usingneural network modules with strong modeling capabilities as both encoder anddecoder. One such module type is the deep autoregressive model [19] whichhas demonstrated good modeling capacity when applied to pixel-wise recon-struction or generation of images using masked convolutions (PixelCNN [42]),or sample-wise generation of raw audio waveforms using dilated convolutions(WaveNet [41]).

Applying these modules in the VAE framework comes with its own setof challenges, particularly in learning meaningful latent representations. Asreported in [7] [42] and others, VAEs with powerful autoregressive decoderstend to ignore the latent variables, causing an "uninformative" prior. In thesecases, special care is required to ensure a meaningful latent representation islearned. One method, which forces the global-level features such as outlinesin images into the latents, is to design a decoder with a narrow receptive field,forcing it to resort to latent variables for high-level structure [7].

If recurrent (RNN) modules are used instead, the global-level featuresare naturally separated from the fine-grained detail information through thesmaller memory capacity of the recurrent cell [20].

A third option is the recently-introduced vector quantized variational au-toencoder (VQVAE, [44]) which learns to quantize and reconstruct its inputusing a relatively small amount of learned quantization vectors. The VQVAEmethod is also a good example of how autoregressive models can be combined

41

42 CHAPTER 4. RELATED WORK

in the VAE framework while avoiding posterior collapse problems, and wasrecently extended by works such as [25] [54].

Image generation

A strong competitor to the primed VAE framework models are generative ad-versarial networks (GANs) [18], which are trained to minimize an entirelydi�erent objective, making them great for generation [27] but less suited forrepresentation learning and generating predictable output, because the input ofits generative model is traditionally random-sampled [17]. Flow-based mod-els are another alternative that recently lead to impressive results [29]. Therecent release of VQVAE-2 [48] shows that vector quantized methods shouldbe considered as a competitior.

Music

The choice of model depends heavily on the formats used to encode the music,such as raw digital audio waveform, spectrogram or MIDI.

MIDI is the most compact format, consisting of descriptions of how to playeach digital instrument, similar to classical musical notation, rather than record-ings of the actual sounds. Some model architectures such as Google Magenta’sPerformanceRNN [57], successful at handling short-term chord progressions,have been improved to handle long-term dependencies (for example melodies)through conditioning on a latent representation learned through the variationalframework of MusicVAE [52].

Spectrograms are also used as intermediate representations in some works,for example alongside VAEs to compress [53] or perform style transfer [2]on audio. Recent works based on GANs [13] [15] also succesfully gener-ated reasonably high quality audio through generation of both spectrogramsand raw waveforms (SpecGAN, WaveGAN; [13]). Worth noting in this caseis that training and generation of spectrograms is generally much faster thansample-by-sample generation with autoregressive (WaveNet) models, sincewhole spectrograms are generated in parallel [15, Section 6].

Raw audio waveform can be used for neural network training directly ifproper care was taken to consider long-term dependencies (at a sample rateof 16000 Hz, interesting features such as melodies and rhythms typically span

CHAPTER 4. RELATED WORK 43

thousands of samples in raw audio input space). A few successful recurrentmusical audio synthesis models such as sampleRNN [34] and WaveRNN [26]have been proposed, in parallel with competing WaveNet-based (autoregres-sive) models for generation [43] [14], representation learning [44] and styletransfer [38]. These models are often conditional, so as to enable generationof di�erent musical styles or instruments by changing the conditional input. Tothe best of my knowledge, however, most autoencoder-models used for musicgeneration [14] [38] rely on WaveNet encoders, and are not "variational" au-toencoders.

Speech synthesis

Spectrogram representations of audio are also used in modern speech synthe-sis and recognition applications. Although peripheral to my work, it is worthnoting that these systems, such as Google’s Tacotron2 [56], often depend onspectrogram representations to obtain an overview of the material when per-forming sample-level generation. This requires a high-level spectrogram fea-ture extractor to complement the sample-level generative method.

Chapter 5

Experiments and Results

5.1 Experiments

The following chapter describe the experiments and results of this project, andthe sections are divided as follows:

• Section 5.1.1 is a reproduction of the qualitative CIFAR10 compres-sion/reconstruction results of [44], as a sanity-check of my implementa-tion of the VQVAE model. It also includes further quantitative perfor-mance comparisons of AE and VQVAE on small images.

• Section 5.1.2 describes quantitative performance comparions for AEand VQVAE models on a spectrogram compression task.

• Section 5.1.3 contains further analysis of performance di�erences onspectrograms, and includes comparisons with naive image resizing base-lines.

Unless otherwise noted, learning rate (step size) was 2e-4, VQVAE com-mitment loss scaling �=0.25 (see Eq. 2.2.4), and the weights of the networkswere sampled according to a truncated normal distribution with zero meanand 4e-4 variance. I optimize the network weights using Adam [28]. The neu-ral network models were trained using NVIDIA Tesla K80 and P100 GPUs.The hyperparameters K and D are refered to extensively in this section, asthey define the latent representation size: K is the number of VQVAE latentvectors, and thus the VQVAE latent size varies with K. D is the size of thedepth-dimension of the latent representation (and thus also the length of eachlearned quantization vector). The AE latent size increases with D, while theVQVAE latent size is constant in D. I did not use batch normalization [23] for

44

CHAPTER 5. EXPERIMENTS AND RESULTS 45

these experiments. Further descriptions of the autoencoder model architectureand spectogram preprocessing pipelines can be found in sections 3.4 and 3.3respectively.

5.1.1 Natural Image Compression

Initially, I prepared a sanity-check experiment to ensure that both models couldbe trained to compress and reconstruct small images in a similar way to [44],while avoiding posterior collapse. To compare compression capabilities, Itrained both VQVAEs and standard AE models varying the size of the imagecompression bottleneck.

The models were trained on the CIFAR10 training set (see 3.2) and com-pared on the validation set after each epoch (roughly 78 iterations) of training.The error metric was the mean square error (MSE) between input and out-put images, which I calculated by summing the squared error over subpixelsand averaging over all images in the validation set. I used the model architec-ture described in 3.4, such that the original images (size (32, 32, 3) subpixels)were 4x downsampled to a (8, 8, D) latent representation. For the VQVAE, Ideviced models with di�erent latent representation sizes by varying the num-ber of trained latent vectors K 2 [32, 64, 128, ..., 4096], while keeping thelatent depth dimension constant at D = 10 (as suggested in [44]). Note thatD is the length of each learned quantization vector. For the AE, D 2 [1 : 12].The learning rate was 2e-4 and batch size 128, and all models were trained forroughly 60, 000 iterations (about 5 hours) each. Figure 5.4 displays the modelperformance on the CIFAR10 validation set, as well as the number of latentvectors used per epoch (perplexity) of the VQVAE. The quantitative resultsfor increasing latent space size for both VQVAE and AE can be seen in Figure5.1, where I report the MSE for the models at their best performing epoch (dif-ferent for each model), e�ectively using the ’early stopping’-technique. For anexample of how to calculate compressions, see Section 3.4. Examples of val-idation set image reconstructions for increasing latent sizes can be found inFigures 5.2 and 5.3. I observed very small improvements on CIFAR10 withVQVAE when D > 10 (not shown here).

5.1.2 Audio Spectrogram Compression

To see whether audio spectrograms contained regularities which could makethem benefit from quantization in the same way as natural images, I trainedboth AE and VQVAE models to compress and quantize audio spectrograms

46 CHAPTER 5. EXPERIMENTS AND RESULTS

by treating them as single-channel images. The log-power spectrograms wereof size (512, 512), representing 4 second clips of single piano key strokesfrom the Nsynth dataset (See 3.2). Explicit details about the preprocessing ofthe spectrograms can be found in Section 3.3. This experiment had a similarsetup to that described in Section 5.1.1, but the latent representation size ofthe spectrograms had size (128, 128, D). VQVAE models with latent depthsD = 10 andD = 256were compared, to make each latent vector more expres-sive (performance plateaued at D=256 for Nynth spectrograms and D=10 forCIFAR10). I did not test models with higher K-values for this experiment dueto memory constraints, and because smaller batch sizes caused more unstabletraining.

I also trained AE models on the same task for comparsion. All modelstrained for approximately 18,300 iterations (246 epochs), requiring 20-24hours per model. The results are presented in Figure 5.5.

5.1.3 Spectrogram Compression by Naive Image Re-

sizing

To compare the quality of reconstructions for each frequency bin separately,I compared VQVAE and AE to two naive image resizing baselines, namelybilinear and nearest-neighbor interpolation [16, p.65]. I performed 4x down-sampling (resizing) to shape (128, 128) using the naive methods, and then up-sampled them back to (512, 512) pixels, measuring the noise introduced by thecompression using MSE, averaged over time. The VQVAE(K=2048,D=256)and AE(D=3) were chosen based on similar performance in the previous ex-periment (Figure 5.5). The results and MSE frequency response envelopes arecompared on a sound clip from the validation set in Figures 5.8 and 5.9.

5.2 Results

Natural Image Compression

The results of the sanity-check experiment on small natural images (section5.1.1) are shown in Figure 5.1, comparing the performance of models’ in-creasing latent size at the highest-performing epoch. Here we can note how theVQVAE achieves impressive results, with VQVAE(K=1024) reaching a quan-titative performance similar to the AE(D=3), while requiring only 80

768 ⇡ 110 of

the bits. The performance seems to peak around K=2048 for the VQVAE for


this training task, while more powerful AE-models quantitatively outperformthe VQVAE.

Figure 5.1: MSE error on the CIFAR10 validation set as function of latentrepresentation size (bytes). The error is summed for each image and averagedover all images. Left: VQVAE for latent shape (8, 8, 1) · log2(K) bits, withK = 2d, d 2 [5 : 12] latent vectors . Right: AE with latent size (8, 8, D) · 32bits, consisting of D 2 [1 : 12] feature maps

Qualitative proof of model performance on small images can be found inFigures 5.2 (VQVAE) and 5.3 (AE). Here, I compressed 12 randomly chosenimages from the CIFAR10 validation set with models of varying compressionrate, increasing the latent size from the top to the bottom row. It seems that thequality of the reconstructions for the best VQVAE models (such as K=1024),while slightly blurrier than the original, result in a reconstruction on-par withor better than AE models such as AE(D=3). The AE also seems to su�er froma larger degree of ’checkerboard’ artifacts, a common side-e�ect of (strided)transposed convolutions [40].


Figure 5.2: AE Reconstructions of CIFAR10 validation set images with in-creasing latent depth D


Figure 5.3: VQVAE Reconstructions of CIFAR10 validation set images withthe number of latent vectors K increasing row-wise. Here, the vector sizeD = 10 is constant

The training development for the small natural images shown in Figure 5.4further illustrates the model performance for the two autoencoder types andindicates that the models’ ability to reproduce their input data increases withlatent representation size, with lower error rates for models with larger rep-resentations (softer compression). The perplexity-plot of the VQVAE showsthat the model learned utilizes a major part of its latent modeling capacity,and avoided problems with posterior collapse and ignored latents [7]. Notethe tendency of overfitting in models with smaller latent size (particulary theoutlier AE(D=1).


(a)

(b)

(c)

Figure 5.4: CIFAR10 validation set loss for VQVAE (a) and the traditional AE(b) models over the course of the training. The plots are in order of increasinglatent representation size, such that the VQVAE(K=32) with the fewest latentvectors also performs the worst (highest loss) in (a), and the AE(D=1) has thehighest loss in (b) Final results are found in the simplified Figure 5.1. Thebottom plot (c) shows the number of latent vectors used per epoch for theVQVAE, where a higher number indicates that more of the model capacity isused during reconstruction (perplexity). Similarly to plots (a) and (b), a largerlatent capacity leads to improved performance (higher perplexity is better).


Spectrogram Compression

Figure 5.5: MSE loss results on the spectrogram validation set as function oflatent representation size (kbits) for the VQVAE and AE models. The erroris summed over image dimensions and averaged over images. Left: VQVAEfor latent shape (128, 128, 1) · log2(K) bits, where number of latent vectorsK = 2d, d 2 [6 : 11]. Right: AE for latent size (128, 128, D) · 32 bits, withD 2 [1 : 10] feature maps. The original input size was (512, 512, 1) · 32 bits.

The results for audio spectrogram compression presented in Figure 5.5 sug-gests VQVAE-models such as (K=1024,D=256) perform similarly to AE(D=3),with a 9.5x higher compression rate achieved by introducing the vector quan-tization bottleneck. For cases with high loss tolerance and high compression,it might also be relevant to compare VQVAE(K=128,D=256), which outper-forms AE(D=2) with a similar ratio between latent representation sizes.Increasing the dimensionality of the quantization vectors of VQVAE from 10to 256 seems to help the performance for some models. However, note the rela-tively small improvement from VQVAE(K=2048,D=10) to VQVAE(K=2048,D=256),which might suggest increasing the number of latent embedding vectors couldbe an alternative to increasing depth, depending on the purpose. Number ofutilized latent vectors is presented in Table 5.7. I also observed sudden dropsin performance when training VQVAEs with D=256 and small K, particularly


K=64 and K=128, from which the model’s training performances did not re-cover (not presented here).

Figure 5.6: The perplexity metric measures how many discrete vectors wereused to encode the training. This example plot shows the development of theVQVAE(D=256,K=512) over training iterations for the spectrogram trainingtask, where the model utilizes an increasing fraction of its encoding capacitythroughout the training. As a relatively large portion of latent vectors are e�ec-tively used, the model seems to avoid posterior collapse, a common problemfor autoencoders [7].


K used (D=256) used (D=10)

64 53 56128 109 104256 213 199512 327 3551024 711 7072048 860 1279

Figure 5.7: Number of quantization vectors used by the VQVAE to encodespectrograms during training epoch 246. Models with K>512 had not con-verged.

Spectrogram Compression by Naive Image Resizing

Figures 5.8 and 5.9 compare reconstructions of autoencoders to naive ’bilin-ear’ and ’nearest neighbor’ resizing baselines. The comparison is made bothvisually and according to the squared di�erence between the original and re-constructed spectrograms averaged over time as frequency respose spectra.From visual inspection of these frequency responses, the VQVAE(K=512,D=256)and AE(D=3) seem to have similar error magnitudes, while the VQVAE(K=1024,D=256)performs significantly better. The baselines methods cause larger errors overthe whole spectrum. Another detail not apparent in the figures is that the VQ-VAE generally outperforms the baselines in silent areas of the spectrum, suchthat the total di�erence grows with the sampling rate.

Example reconstructions of musical audio samples

Please refer to http:// amundv.github.io/ thesis-audio-samples for examples ofreconstructions of ’popular’ musical audio clips. Note that these models wereonly trained on single acoustic keyboard soundclips, as described in section3.2.

http://amundv.github.io/thesis-audio-samples


Figure 5.8: Left column: Visual comparison between original spectrogramand reconstructions. The upper half of the spectrograms were cropped out forillustrative purposes, since most of the energy is in frequency bins 0-250, butthe VQVAE performed on-par or better on higher (often silent) frequencies.Right column: The frequency responses of the squared error between theoriginal and reconstruction, averaged over time (MSE). The input sound is thenote G4 played on a piano (from the validation set) Please see the Figure 5.9for comparisons with AE and the naive baselines.


Figure 5.9: See description of Figure 5.8

Chapter 6

Discussion

6.1 Discussion

Bottleneck dimensionality reduction To reduce the number of feature mapsoutput from the encoder before the compression bottleneck, I chose a commondimensionality reduction method where a 1x1 convolution is used to down-sample from 256 to D feature maps [58, Sec.2]. For the VQVAE models, Iset D to a high value such as D=256 for the best spectrogram model, sinceVQVAE latent representation size is constant in D, but modeling capacity in-creases. For AE, on the other hand, the latent size grows linearly with D, andcomparatively small values for the models were required to have comparablelatent sizes for the given model architecture. This method of dimensionalityreduction might have caused an unfair disadvantage for the AE models.

Learning rate and convergence The selected small learning rate 2e-4 (sim-ilar to [44], [54]) contributed to relatively long convergence times, particularlyfor AE and larger VQVAE models. The reason for this could be to prioritizea better model performance over low computation time. Although not reportedin the results section, I trained the highest-performing VQVAE(D=256,K=2048)alongside its closest competitor AE(D=3) for 47k iterations to ensure that bothwould stay competitive closer to convergence. This extensive training lead toimproved performance for both models: MSE 70.2 and 73.5 for VQVAE andAE respectively. In future work, the remaining models could be trained longerto verify that the results generalise to the remaining latent sizes. Learning rateannealing methods could be used to speed up convergence of the models.

56

CHAPTER 6. DISCUSSION 57

Input redundancy Although they are the result of a lossy transformation,the input spectrograms require significantly more bits than their original WAVE-format audio clip counterparts. While the original clips (4 seconds) with addedpadding (+0.152 seconds) at sample rate 16000 Hz and bit rate 16 bits/samplerequire a total of 4.152 ⇤ 16000 ⇤ 16 [bits] ⇡ 1.06 [megabits], the input sizeof the spectrogram representation is 512 ⇤ 512 ⇤ 32 [bits] ⇡ 8.39 [megabits]– almost a factor of 8 larger. Aside from considering a switch to a 16-bitfloat spectrogram representation (further discussed in Section 7.2), one wayto lessen this size increase could be increasing STFT hop size. Compared toother works such as [13], my choice of hop size (only 12% of window sizerather than 50%) causes extensive redundancy, which might have reduced thenegative up-sampling e�ects of dilated convolutions such as checkerboard ar-tifacts (mentioned in Section 5.2), but also caused a very large input size (16xlarger than [13]). Attempting a smaller hop size, and e�ectively harder com-pression, could be investigated in future work.

Combining audio chunks after reconstruction For applications involvingcompression of longer timescales than used in this project, the challenge of re-constructing a continuous audio signal from out-of-phase audio chunks is anobstacle. The problem arises from the discarding of phase in the preprocessingpipeline. Since the phase-reconstruction method synthesizes phase separatelyfor each 4 second audio chunk, combining (’glueing’) the reconstructed au-dio chunks becomes non-trivial. In audio editing, glueing such out-of-phaseaudio chunks is known to cause ’phase distortion’ artifacts, requiring man-ual transitions between chunks using f.ex cross-fading (transitioning betweenaudio clips by successively lowering the amplitude of one while increasingthat of the other). To apply vector-quantization for longer audio chunks, anappropriate phase approximation method must be selected.

Perceptual audio compression The MSE objective function, although general-purpose in neural network training due to its stable gradients, might not cor-relate well with perceived audio quality in reconstructions. The metric is con-sidered unfit for measuring perceptual quality in other domains such as im-age compression, because reconstructions with large variations in perceivedquality have been shown to give the same MSE [64]. In music compression,common-place algorithms such as mp3 are designed based on psychoacousticanalysis, where quantitative research of perceived audio quality for humansis the basis of the encoding method [47]. In this process, spectral analysisis an important part, particularly in so-called "variable bitrate" (VBR) mp3-

58 CHAPTER 6. DISCUSSION

encoding, where the encoding bitrate is varied throughout a piece of musicdepending on the spectral content of the audio (more bits are required to en-code audio with richer spectral content, such as orchestral music). In futurework, a perceptual loss objective could be used to optimize for perceived audioquality.

6.2 Challenges

In my attempt to reproduce the log-likelihood lower bound results for VAEfrom [44], I performed extensive experimentation for hyperparameter tuning,ad hoc variations to the ELBO loss function such as KL weighting/annealing,as well as various other measures. These results are not reported here. Ingeneral, most of this work could have been avoided if the exact VAE objec-tive function, KL weighting and other methodological details had been moreclearly reported in the original paper. The following is a detailed descriptionof a few unexpected challenges encoutered while training my VAEs with theELBO objective.

KL divergence magnitude in ELBO

Initially, when comparing the MSE results of variational AEs with VQVAEs,I expected that choosing the minimum amount of feature maps (VAE(D=1);latent size of (8,8,D=1) float32 = 2048 bits for CIFAR10) was sure to out-perform a (8,8,D=1)· log2(K = 256)int � 512 bits VQVAE, due to the 4:1size ratio between their latent sizes. Interestingly, the VAE converged to ahigher error value than the VQVAE using the same value for parameter D forboth models, even with a very low number of latent vectors (f.ex K=64). Thisseemed strange since the VAE was modeled with float32 while the VQVAEwas constrained to choose one of K vector indices for each position in thecompressed representation of the image.

Initially, I assumed that this problem was caused by the 1x1 convolutionbefore the compression bottleneck, where the 256 feature maps output fromthe encoder are reduced to D feature maps before reparametrization (VAE)or discretization (VQVAE). Although a common method for dimensionalityreduction and training speed-up in neural networks (see Section 3.4), my intu-ition was that it gave an unfair disadvantage to the VAE, as the encoder featuremaps were constrained to D = 1 in VAE, while they could be arbitrarily ex-pressive in VQVAE as its latent size does not depend on D.

CHAPTER 6. DISCUSSION 59

However, increasing the number of feature maps D for VAE caused smallimprovements, and the VQVAE dominated completely in terms of MSE. Themain cause of this was the KL divergence regularization term of Equation2.1.1, which [30] described as a sum over latent dimensions (Gaussian case),quickly outgrew and soon dominated MSE values for the normalized CIFAR10images as D increased. This caused the training to get stuck in local minimaand converge to sub-optimal scores.

The problem of the KL-term magnitude is well-known in the literature,and di�erent techniques have been suggested. [22] suggests a tuneable hyper-parameter � to scale the KL-term, where �=1 is the ELBO. The authors quali-tatively demonstrate how �>1 helps disentangle latent features, such that eachlatent dimension more closely corresponds to separate, conceptually meaning-ful factors of variation in facial images (such as hairstyle) rather than a mix ofseveral factors (hairstyle and emotion), seen in regular VAE latent space in-terpolation. An alternative to a constant � is annealing the KL term using awarm-up scheduling [20][59][35], such that the modified objective approachesthe true ELBO over time.

These warm-up techniques generally proved to have good e�ect on balanc-ing the two loss terms (not reported here). The challenge with these techniquesis that a good scaling schedule must be designed separately for each datasetand model architecture. Since generative models are out of the scope of thiswork, I refer to the current research for recent advances [35][49].

Since the VAEs converged to sub-optimal results due to the KL divergenceterm, I ended up choosing a simpler, non-variational AE to ensure that thebaseline results to which I compared the VQVAE remained unconstrained.This bottleneck, consequently, is similar to that of the baseline described by[14].

6.2.1 Key Take-aways

Keep loss function and metric separate (Train on mean, measure sum)In general, one of the biggest lessons learned is that the metric measured neednot correspond exactly to the objective function minimized during training.From studying the public code of other reference authors to understand theVAE object function, I noticed that their way of averaging when estimatingthe MSE di�ered between training and metrics, such that their reported lowerbound was calculated di�erently (averaging over batch dimension) from intheir training objective (averaging over batch and subpixel dimensions). Apotential solution could be keeping the objective function and metrics sepa-

60 CHAPTER 6. DISCUSSION

rate, particularly in the case for VAEs with large latent sizes (f.ex 8x8x10),where the objective consists of two terms that must be balanced.

Ensure that sanity-check model is representative In my initial experi-ments comparing the variational AE and VQVAE, I used a minimal VAE(D=1)for fast iterations, believing models would improve steadily with D. My errorhere was that the minimal AE(D=1) represented a strong performance-wiseoutlier for this particular autoencoder architecture, and made it a bad choice forbaseline VAE/AE testing for this project. A lesson learned for creating base-lines is to ensure that the model architecture used for initial sanity-checkingactually represents a simple case of the relevant models.

Chapter 7

Conclusion and Future Work

7.1 Conclusion

The topic of this project was to compare the performance of discrete andcontinuous convolutional neural network autoencoders for unsupervised au-dio spectrogram compression, to understand whether vector quantization isa suitable extension for this task. The project builds on previous works onprobabilistic, convolution-based representation learning of images, speech andvideo, and explores whether audio spectrograms contain recurring sets of neigh-boring pixels (features) that could be learned and compressed. The proposedmethod includes an unsupervised compression method, audio spectrogrampre- and postprocessing pipelines, and a final gradient-based phase o�set ap-proximation step, attempting to reconstruct the original waveform. The unsu-pervised method was trained to minimize the MSE error between original andreconstructed spectrograms using gradient descent, while the phase approxi-mation method was used to minimize the MSE between the spectrogram of anapproximated waveform and the reconstructed spectrogram.

The experimental results suggest that vector quantized autoencoders with alarge number of latent vectors (f.ex 2048) perform quantitatively similar to reg-ular autoencoders on spectrogram reconstruction, while requiring about 11%of their latent size. Vector quantization thus seems to enable harder compres-sion than the continuous autoencoder bottleneck, retaining its performance.

Inspection of the error distribution in the frequency domain also suggestsfavorable results for the VQVAE, as the AE produces error peaks around theharmonic partials which could be averaging artifacts commonly seen in au-toencoders.

Finally, manual inspection of the reconstructed waveforms, although highly

61

62 CHAPTER 7. CONCLUSION AND FUTURE WORK

biased and subjective, further supports the quantitative findings described above,as the reconstructed audio clips from the validation set closely resemble theoriginals.

Attempts to compress other sorts of musical audio such as extracts frompopular music also gives promising results. Most samples are reconstructedwithout significant loss of audio quality, and only extreme examples chosento ’provoke’ errors end up severely distorted. This was a surprising result,as the models were trained on a highly homogenous dataset consisting onlyof keyboard-sounds. This could indicate that model has learned to extractabstract musical audio features common to various sorts of music, makingthe quantization method interesting for representation learning in the broadermusical audio research field. These are however highly subjective reports,using a waveform-encoding method of high redundancy, and thus extensivequantitative studies are required before drawing robust conclusions.

To conclude, the results suggest that there exist local, quantizable regu-larities in musical audio which can be learned by vector quantizing the latentfeature space, motivating the use of vector quantization for this use case.

7.2 Future Work

Application-dependent Model Improvements

Vector Quantization Improvements Since the release of [44], several workssuch as [54] [25] have improved the technique through variations on the quan-tization method for natural images. Attempting these methods could also bebeneficial in the audio spectrogram case, and is an interesting direction forfuture work.

Audio Feature Engineering In this report, I avoid most audio feature en-gineering in the spectrogram preprocessing pipeline, and instead focus onwhether the compression method itself is suitable for audio spectrogram com-pression. In real-world applications, however, it might be relevant to considerfeature engineering approaches to optimize for the domain of musical audio

• Di�erent spectrogram resolution, dedicating more bits to the hearablefrequencies. Mel spectrograms could allow harder compression withlimited loss of perceived quality compared to linear-scale spectrograms[62] [56].

CHAPTER 7. CONCLUSION AND FUTURE WORK 63

• ’Rainbowgrams’ (Constant-Q spectrograms colored by log magnitude)[14] might have improved my results due to compactness and since thecolor coding is a form of musical feature engineering.

• Variations on shape of convolutional kernels could encode prior as-sumptions for certain features in the model architecture, such as a pref-erence towards learning musical timbre (tall kernels) [46] or rhythmicalpatterns (broad kernels). In this case, a kernel should be chosen basedon specific usecase.

Half-precision for a more Competitive AE Using half-precision floats (16bits) rather than single-precision (32 bits) throughout the AE would have halvedits latent representation size, which could have made it more competitive. An-other interesting possibility could be using half-precision in the bottleneckonly, involving the active research area of "mixed precision training" [36].

Unsupervised Music Representation Learning

An interesting topic for future exploration would be to analyze the semanticmeaning of the trained quantization vectors of VQVAE close to convergence,and exploring whether the model has learned abstract ’conceptual’ musicalfeatures (stored in latent vectors) which are generalizable to other types ofmusic.

One way to analyze the audio latent vectors could be to train a generative’prior’ model (such as PixelCNN) on the vector quantized latent representa-tions of the training data, and then use the decoder of the VQVAE to decodesamples of the generative prior, as was done for natural images in [44] and re-cently further in [48]. Other explorations of feature space, such as latent spaceinterpolation, are common in the VAE literature, and could provide model in-sights, and better understanding of discretization as a method of regularizationin a musical context.

Reasoning about the problem as a discrete, unsupervised representationlearning problem poses new design questions, such as what number of latentvectors K should be chosen for a particular learning problem, or what shapeof convolutional kernels is appropriate for learning to recognize the desiredfeatures.

Model Performance on Other Audio Data Although not presented in thisreport, the trained models seemed capable of reconstructing other types of

64 CHAPTER 7. CONCLUSION AND FUTURE WORK

audio clips than those in the dataset with surprisingly high quality. The nextstep could be a thorough analysis of model performance on other types of audioclips (such as cut-outs from songs) to further determine the usefulness of themodel for the usecase described in Section 1.1.

7.3 Ethics and Sustainability

When used for musical audio compression and analysis, the method presentedin this thesis has no direct implications on ecological sustainability.

From a social perspective, a content-based music analysis tool could aidindividuals in finding music that fits their taste, e�ectively bypassing the in-fluence of sponsored recommendations. This could make the user less depen-dent on trends made by large record companies, radio stations and influencers,causing a larger democratization of the music recommendation field, and haveeconomic implications.

One societal risk, however, is a hypothetical scenario where the user endsup being recommended only very similar content without realizing that whatshe sees is being filtered – an e�ect known as a "filter bubble". This is alreadya large problem in today’s news media, and a content-based system has thepotential to amplify this e�ect when used incorrectly.

Another challenge could be the possibility that such a method could strengthenmusical trends where musical qualities are sacrificed to optimize musical pop-ularity and economical profit. A previous example of such negative trends isthe long-lasting trend of increasing perceived audio loudness levels to increasemusic sales "loudness war". It is debatable whether these previous negativetrends were consequences of the availability of the methods, or the intentionsof the responsible individuals, or both.

7.4 Societal Aspects

The methods and results described in this project could benefit di�erent groupsin society, from individual listeners and creative professional musicians tobusinesses.

For the individual listener, recommender systems that are based on themusical content rather than listening statistics could be useful in several ways.For example, it could simplify the search for artists with similar "sound", forexample when putting together playlists or sets of songs. It could also helpfinding music which is less popular but still sounds similar to what the user

CHAPTER 7. CONCLUSION AND FUTURE WORK 65

has shown a preference for, which would also benefit lesser-known musicalartists.

For a creative professional, the described method could aid in the produc-tion of music. For example, an autoencoder trained on professionally mixed/mastered music could potentially be used to remove noise from corrupted au-dio tracks ("denosie"). It could also be possible to train generative models togenerate new music in the latent space (which is low-dimensional and as suchallows for more powerful models at a lower computational cost), and then de-code it to create completely new music. This was shown to be possible forimages and speech in [44].

From a business perspective, the method could enable better and more e�-cient recommendations of the musical content, as previously described in sec-tion 1.1. This could aid in analysis and understanding of the available content,such as by understanding trends, why certain tracks are popular, and whichhave the potential to be.

Bibliography

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Good-fellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Mur-ray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Tal-war, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Ten-sorFlow: Large-Scale Machine Learning on Heterogeneous DistributedSystems. arXiv:1603.04467 [cs], Mar. 2016. URL http://arxiv.org/abs/1603.04467. arXiv: 1603.04467.

[2] S. Barry and Y. Kim. “Style” Transfer for Musical Audio Using MultipleTime-Frequency Representations. Feb. 2018. URL https://openreview.net/forum?id=BybQ7zWCb.

[3] E. Bernhardsson. Collaborative Filtering at Spo-tify, 2013. URL https://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818/62.

[4] E. Bernhardsson. Annoy 1.10 released, with Hamming distance andWindows support, Nov. 2017. URL https://erikbern.com/2017/11/26/annoy-1.10-released-with-hamming-distance-and-windows-support.html.

[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Jour-

nal of machine Learning research, 3(Jan):993–1022, 2003.

[6] D. M. Blei, A. Kucukelbir, and J. D. McAuli�e. Variational Inference: AReview for Statisticians. Journal of the American Statistical Association,112(518):859–877, Apr. 2017. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2017.1285773. URL http://arxiv.org/abs/1601.00670.arXiv: 1601.00670.

66

http://arxiv.org/abs/1603.04467


https://openreview.net/forum?id=BybQ7zWCb

https://openreview.net/forum?id=BybQ7zWCb

https://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818/62

https://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818/62

https://erikbern.com/2017/11/26/annoy-1.10-released-with-hamming-distance-and-windows-support.html




BIBLIOGRAPHY 67

[7] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schul-man, I. Sutskever, and P. Abbeel. Variational Lossy Autoencoder.arXiv:1611.02731 [cs, stat], Nov. 2016. URL http://arxiv.org/abs/1611.02731. arXiv: 1611.02731.

[8] K. Choi, G. Fazekas, K. Cho, and M. Sandler. A Comparison of Au-dio Signal Preprocessing Methods for Deep Neural Networks on MusicTagging. arXiv:1709.01922 [cs], Sept. 2017. URL http://arxiv.org/abs/1709.01922. arXiv: 1709.01922.

[9] S. B. Davis and P. Mermelstein. Comparison of parametric represen-tations for monosyllabic word recognition in continuously spoken sen-tences. Acoustics, Speech and Signal Processing, Ieee Transactions On,pages 357–366, 1980.

[10] R. Decorsière, P. L. Søndergaard, E. N. MacDonald, and T. Dau. In-version of Auditory Spectrograms, Traditional Spectrograms, and OtherEnvelope Representations. IEEE/ACM Transactions on Audio, Speech,

and Language Processing, 23(1):46–56, Jan. 2015. ISSN 2329-9290.doi: 10.1109/TASLP.2014.2367821.

[11] S. Dieleman. Recommending music on Spotify with deep learning– Sander Dieleman, 2014. URL http://benanne.github.io/2014/08/05/spotify-cnns.html.

[12] C. Doersch. Tutorial on Variational Autoencoders. arXiv:1606.05908

[cs, stat], June 2016. URL http://arxiv.org/abs/1606.05908. arXiv:1606.05908.

[13] C. Donahue, J. McAuley, and M. Puckette. Adversarial Audio Synthe-sis. arXiv:1802.04208 [cs], Feb. 2018. URL http://arxiv.org/abs/1802.04208. arXiv: 1802.04208.

[14] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, andM. Norouzi. Neural Audio Synthesis of Musical Notes with WaveNetAutoencoders. arXiv:1704.01279 [cs], Apr. 2017. URL http://arxiv.org/abs/1704.01279. arXiv: 1704.01279.

[15] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, andA. Roberts. GANSynth: Adversarial Neural Audio Synthesis. Sept.2018. URL https://openreview.net/forum?id=H1xQVn09FX.





http://benanne.github.io/2014/08/05/spotify-cnns.html

http://benanne.github.io/2014/08/05/spotify-cnns.html






https://openreview.net/forum?id=H1xQVn09FX

68 BIBLIOGRAPHY

[16] R. C. Gonzalez, R. E. Woods, and B. R. Masters. Digital Image Process-ing, Third Edition. Journal of Biomedical Optics, 14(2):029901, 2009.ISSN 10833668. doi: 10.1117/1.3115362. URL http://biomedicaloptics.spiedigitallibrary.org/article.aspx?doi=10.1117/1.3115362.

[17] I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks.arXiv:1701.00160 [cs], Dec. 2016. URL http://arxiv.org/abs/1701.00160. arXiv: 1701.00160.

[18] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks.arXiv:1406.2661 [cs, stat], June 2014. URL http://arxiv.org/abs/1406.2661. arXiv: 1406.2661.

[19] K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. DeepAutoRegressive Networks. arXiv:1310.8499 [cs, stat], Oct. 2013. URLhttp://arxiv.org/abs/1310.8499. arXiv: 1310.8499.

[20] K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra. To-wards Conceptual Compression. arXiv:1604.08772 [cs, stat], Apr. 2016.URL http://arxiv.org/abs/1604.08772. arXiv: 1604.08772.

[21] D. Gri�n and Jae Lim. Signal estimation from modified short-timeFourier transform. IEEE Transactions on Acoustics, Speech, and Sig-

nal Processing, 32(2):236–243, Apr. 1984. ISSN 0096-3518. doi: 10.1109/TASSP.1984.1164317. URL http://ieeexplore.ieee.org/document/1164317/.

[22] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,S. Mohamed, and A. Lerchner. beta-VAE: Learning Basic Visual Con-cepts with a Constrained Variational Framework. Nov. 2016. URLhttps://openreview.net/forum?id=Sy2fzU9gl.

[23] S. Io�e and C. Szegedy. Batch Normalization: Accelerating Deep Net-work Training by Reducing Internal Covariate Shift. arXiv:1502.03167

[cs], Feb. 2015. URL http://arxiv.org/abs/1502.03167. arXiv:1502.03167.

[24] C. Johnson. From Idea to Execution: Spotify’s Dis-cover Weekly, 2015. URL https://www.slideshare.net/MrChrisJohnson/from-idea-to-execution-spotifys-discover-weekly/35-3_httpbenannegithubio20140805spotifycnnshtmlDeep_Learning_on_Audio.

http://biomedicaloptics.spiedigitallibrary.org/article.aspx?doi=10.1117/1.3115362

http://biomedicaloptics.spiedigitallibrary.org/article.aspx?doi=10.1117/1.3115362







http://ieeexplore.ieee.org/document/1164317/

http://ieeexplore.ieee.org/document/1164317/

https://openreview.net/forum?id=Sy2fzU9gl


https://www.slideshare.net/MrChrisJohnson/from-idea-to-execution-spotifys-discover-weekly/35-3_httpbenannegithubio20140805spotifycnnshtmlDeep_Learning_on_Audio




BIBLIOGRAPHY 69

[25] Kaiser, A. Roy, A. Vaswani, N. Parmar, S. Bengio, J. Uszkoreit, andN. Shazeer. Fast Decoding in Sequence Models using Discrete LatentVariables. arXiv:1803.03382 [cs], Mar. 2018. URL http://arxiv.org/abs/1803.03382. arXiv: 1803.03382.

[26] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, andK. Kavukcuoglu. E�cient Neural Audio Synthesis. arXiv:1802.08435

[cs, eess], Feb. 2018. URL http://arxiv.org/abs/1802.08435. arXiv:1802.08435.

[27] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive Growing ofGANs for Improved Quality, Stability, and Variation. arXiv:1710.10196

[cs, stat], Oct. 2017. URL http://arxiv.org/abs/1710.10196. arXiv:1710.10196.

[28] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization.arXiv:1412.6980 [cs], Dec. 2014. URL http://arxiv.org/abs/1412.6980.arXiv: 1412.6980.

[29] D. P. Kingma and P. Dhariwal. Glow: Generative Flow with Invertible1x1 Convolutions. arXiv:1807.03039 [cs, stat], July 2018. URL http://arxiv.org/abs/1807.03039. arXiv: 1807.03039.

[30] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes.arXiv:1312.6114 [cs, stat], Dec. 2013. URL http://arxiv.org/abs/1312.6114. arXiv: 1312.6114.

[31] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images.2009.

[32] G. Lovink. Reflections on the MP3 Format: Interview with JonathanSterne. Computational Culture, 4, 2014.

[33] D. J. C. MacKay. Information Theory, Inference & Learning Algorithms.Cambridge University Press, New York, NY, USA, 2002. ISBN 0-521-64298-1.

[34] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. arXiv:1612.07837 [cs], Dec.2016. URL http://arxiv.org/abs/1612.07837. arXiv: 1612.07837.











70 BIBLIOGRAPHY

[35] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial VariationalBayes: Unifying Variational Autoencoders and Generative AdversarialNetworks. arXiv:1701.04722 [cs], Jan. 2017. URL http://arxiv.org/abs/1701.04722. arXiv: 1701.04722.

[36] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. MixedPrecision Training. arXiv:1710.03740 [cs, stat], Oct. 2017. URL http://arxiv.org/abs/1710.03740. arXiv: 1710.03740.

[37] M. Miron, J. Janer, and E. Gómez. Monaural Score-Informed SourceSeparation for Classical Music Using Convolutional Neural Networks.In ISMIR, 2017.

[38] N. Mor, L. Wolf, A. Polyak, and Y. Taigman. A Universal MusicTranslation Network. arXiv:1805.07848 [cs, stat], May 2018. URLhttp://arxiv.org/abs/1805.07848. arXiv: 1805.07848.

[39] M. Müller. Fundamentals of Music Processing: Audio, Analysis,

Algorithms, Applications. Springer International Publishing, 2016.ISBN 978-3-319-35765-2. URL https://books.google.se/books?id=�BBswEACAAJ.

[40] A. Odena, V. Dumoulin, and C. Olah. Deconvolution and CheckerboardArtifacts. Distill, 2016. doi: 10.23915/distill.00003. URL http://distill.pub/2016/deconv-checkerboard.

[41] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generativemodel for raw audio. arXiv preprint arXiv:1609.03499, 2016. URLhttps://arxiv.org/abs/1609.03499.

[42] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrentneural networks. arXiv preprint arXiv:1601.06759, 2016.

[43] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stim-berg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen,N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov,and D. Hassabis. Parallel WaveNet: Fast High-Fidelity Speech Synthe-sis. arXiv:1711.10433 [cs], Nov. 2017. URL http://arxiv.org/abs/1711.10433. arXiv: 1711.10433.






https://books.google.se/books?id=ffBBswEACAAJ

https://books.google.se/books?id=ffBBswEACAAJ

http://distill.pub/2016/deconv-checkerboard

http://distill.pub/2016/deconv-checkerboard

https://arxiv.org/abs/1609.03499



BIBLIOGRAPHY 71

[44] A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu. Neural Discrete Rep-resentation Learning. Nov. 2017. URL https://arxiv.org/abs/1711.00937v2.

[45] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,J. Raiman, and J. Miller. Deep Voice 3: Scaling Text-to-Speech withConvolutional Sequence Learning. arXiv:1710.07654 [cs, eess], Oct.2017. URL http://arxiv.org/abs/1710.07654. arXiv: 1710.07654.

[46] J. Pons, O. Slizovskaia, R. Gong, E. Gómez, and X. Serra. Tim-bre Analysis of Music Audio Signals with Convolutional Neural Net-works. arXiv:1703.06697 [cs], Mar. 2017. URL http://arxiv.org/abs/1703.06697. arXiv: 1703.06697.

[47] R. Raissi. The Theory Behind Mp3. 2002.

[48] A. Razavi, A. v. d. Oord, and O. Vinyals. Generating Diverse High-Fidelity Images with VQ-VAE-2. arXiv:1906.00446 [cs, stat], June2019. URL http://arxiv.org/abs/1906.00446. arXiv: 1906.00446.

[49] D. J. Rezende and F. Viola. Taming VAEs. arXiv:1810.00597 [cs, stat],Oct. 2018. URL http://arxiv.org/abs/1810.00597. arXiv: 1810.00597.

[50] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Back-propagation and Approximate Inference in Deep Generative Models.arXiv:1401.4082 [cs, stat], Jan. 2014. URL http://arxiv.org/abs/1401.4082. arXiv: 1401.4082.

[51] F. Ricci. Recommender systems handbook. Springer Science+BusinessMedia, New York, NY, 2015. ISBN 978-1-4899-7636-9.

[52] A. Roberts, J. Engel, C. Ra�el, C. Hawthorne, and D. Eck. A Hierar-chical Latent Vector Model for Learning Long-Term Structure in Music.Mar. 2018. URL https://arxiv.org/abs/1803.05428v4.

[53] F. Roche, T. Hueber, S. Limier, and L. Girin. Autoencoders for musicsound synthesis: a comparison of linear, shallow, deep and variationalmodels. arXiv:1806.04096 [cs, eess], June 2018. URL http://arxiv.org/abs/1806.04096. arXiv: 1806.04096.

[54] A. Roy, A. Vaswani, A. Neelakantan, and N. Parmar. Theory and Exper-iments on Vector Quantized Autoencoders. arXiv:1805.11063 [cs, stat],May 2018. URL http://arxiv.org/abs/1805.11063. arXiv: 1805.11063.

https://arxiv.org/abs/1711.00937v2













72 BIBLIOGRAPHY

[55] Y. E. K. Shaun Barry. “STYLE” TRANSFER FOR MUSICAL AU-DIO USING MULTIPLE TIME-FREQUENCY REPRESENTATIONS.page 12, 2018.

[56] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian-nakis, and Y. Wu. Natural TTS Synthesis by Conditioning WaveNet onMel Spectrogram Predictions. arXiv:1712.05884 [cs], Dec. 2017. URLhttp://arxiv.org/abs/1712.05884. arXiv: 1712.05884.

[57] I. Simon and S. Oore. Performance RNN: Generating Music with Ex-

pressive Timing and Dynamics. 2017. URL https://magenta.tensorflow.org/performance-rnn.

[58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions.arXiv:1409.4842 [cs], Sept. 2014. URL http://arxiv.org/abs/1409.4842.arXiv: 1409.4842.

[59] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther.Ladder Variational Autoencoders. arXiv:1602.02282 [cs, stat], Feb.2016. URL http://arxiv.org/abs/1602.02282. arXiv: 1602.02282.

[60] M. Tschannen, O. Bachem, and M. Lucic. Recent Advances inAutoencoder-Based Representation Learning. arXiv:1812.05069 [cs,

stat], Dec. 2018. URL http://arxiv.org/abs/1812.05069. arXiv:1812.05069.

[61] A. Van Den Oord. VQVAE Deepmind reference implementationGithub. URL https://github.com/deepmind/sonnet/blob/master/sonnet/examples/vqvae_example.ipynb.

[62] A. van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Ad-

vances in Neural Information Processing Systems 26, pages 2643–2651. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5004-deep-content-based-music-recommendation.pdf.

[63] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis,R. Clark, and R. A. Saurous. Tacotron: Towards End-to-End Speech


https://magenta.tensorflow.org/performance-rnn

https://magenta.tensorflow.org/performance-rnn




https://github.com/deepmind/sonnet/blob/master/sonnet/examples/vqvae_example.ipynb

https://github.com/deepmind/sonnet/blob/master/sonnet/examples/vqvae_example.ipynb

http://papers.nips.cc/paper/5004-deep-content-based-music-recommendation.pdf

http://papers.nips.cc/paper/5004-deep-content-based-music-recommendation.pdf

BIBLIOGRAPHY 73

Synthesis. arXiv:1703.10135 [cs], Mar. 2017. URL http://arxiv.org/abs/1703.10135. arXiv: 1703.10135.

[64] Z. Wang and A. C. Bovik. Mean squared error: Love it or leave it? A newlook at Signal Fidelity Measures. IEEE Signal Processing Magazine,26(1):98–117, Jan. 2009. ISSN 1053-5888. doi: 10.1109/MSP.2008.930649.



Appendix A

Spectrogram Pipeline Chart

Figure A.1: Flow chart representing the spectrogram preprocessing pipeline.Note the lossy ’abs’-step, where the imaginary part of the complex-valuedspectrogram is thrown away.

74

Appendix B

Autoencoder Pseudocode

def residual_block(x_in):x = ReLU(x_in)x = conv2d(x, filters=256, kernel_size=(3, 3), strides=(1, 1))x = ReLU(x)x = conv2d(x, filters=256, kernel_size=(1, 1), strides=(1, 1))return x + x_in

def network(x, K, D, discrete_bottleneck=True):# Encoderx = conv2d(x, filters=256, kernel_size=(4, 4), strides=(2, 2), activation=ReLU)x = conv2d(x, filters=256, kernel_size=(4, 4), strides=(2, 2), activation=ReLU)x = residual_block(x)x = residual_block(x)

z_e = conv2d(x, filters=D, kernel_size=(1, 1), strides=(1, 1), activation=Linear)

# Bottleneckif discrete_bottleneck:

z = Quantization(z_e, num_embeddings=K)else:

z = z_e

x = conv2d(z, filters=256, kernel_size=(3, 3), strides=(1, 1), activation=Linear)

# Decoderx = residual_block(x)x = residual_block(x)x = conv2d_transpose(x, filters=256, kernel_size=(4, 4), strides=(2, 2), activation=ReLU)x = conv2d_transpose(x, filters=1, kernel_size=(4, 4), strides=(2, 2), activation=Linear)

Figure B.1: Autoencoder architecture. Note that the output size (1) is for thespectrogram-case, and should be changed to 3 for CIFAR10 to match numberof channels in RGB.)

75

TRITA -EECS-EX-2019:649

www.kth.se

Date post:	03-Oct-2021
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

Unsupervised Audio Spectrogram Compression using Vector ...

Documents