+ All Categories
Home > Documents > Music Generation: Part 1 - KAISTjuhan/gct634/Slides/11 music...Score (REMI) Pop Music Transformer:...

Music Generation: Part 1 - KAISTjuhan/gct634/Slides/11 music...Score (REMI) Pop Music Transformer:...

Date post: 28-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
44
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Music Generation: Part 1 Juhan Nam
Transcript
  • GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

    Music Generation: Part 1

    Juhan Nam

  • Introduction

    ● We have focused on analyzing the input audio and extracting certain information or sources○ Audio-to-label: music genre/mood classification and tagging○ Audio-to-score/MIDI: note transcription, chord recognition, beat tracking○ Audio-to-audio: source separation and audio style transfer

  • Introduction

    ● Now we have a very different problem: generating new musical content from scratch or a given condition ○ Label-to-score: music composition and arrangement○ Score-to-MIDI: expressive performance○ MIDI-to-audio: sound synthesis

    Composer Performer Musical Instrument

  • Introduction

    ● But, the generative process can be directly conducted on performance MIDI or audio ○ Label-to-MIDI ○ Label-to-Audio

    PerformanceRNN, Music Transformer (trained with the MAESTRO dataset)

    WaveNet, WaveGAN (trained with raw audio waveforms)

  • Music Generation

    ● Symbolic music generation○ Generate music in the forms of music score (but mostly MIDI)○ Take 1D sequence (note events) as input○ Focus on sequential note generation based on musical language model○ Leverage advances in natural language processing: RNN, transformer

    ● Audio generation○ Generate waveforms or spectrogram○ Take spectrogram as 2D image or waveforms as 1D sequence○ Focus on natural sound synthesis○ Leverage high-quality image generation models such as GAN

  • Symbolic Music Generation

    ● Language model in natural language processing○ Predict what comes next in a sentence

    ● Language model in music ○ Predict what comes next in note sequence

    𝑝(𝑥!|𝑥", … , 𝑥!#")

    𝑥!: input representation vector

    𝑥" 𝑥# 𝑥$ 𝑥%…

    The sky is so𝑥" 𝑥# 𝑥$ 𝑥%…

    blue

    darkbeautiful Language Modeling

  • Symbolic Generation

    ● Once the language model is trained, the joint probability of a sequence can be calculated ○ For a sequence 𝑋 = (𝑥", 𝑥#, … , 𝑥&'", 𝑥&)

    ● Therefore, we can figure out which sequence is more likely than others ○ This is used for speech recognition/automatic music transcription to find

    more sensible sentences/note sequence among the candidates from acoustic models

    𝑃 𝑋 = 𝑃 𝑥!, 𝑥", … , 𝑥# = 𝑃 𝑥! 𝑃 𝑥" 𝑥! 𝑃 𝑥$ 𝑥", 𝑥! …𝑃 𝑥# 𝑥#%!, … , 𝑥! ='&'!

    #

    𝑃 𝑥& 𝑥(&

  • What’s difference in music?

    ● Music is polyphonic○ Melody and accompaniment ○ Music score is an 1 D sequence with the 2D nature: how we handle the

    simultaneous notes in the input representation?

  • What’s difference in music?

    ● Music is structured in scale, rhythm and harmony○ Given a key, notes on the scale are played more likely than other notes○ Simultaneous notes are arranged in harmony with a chord ○ Successive notes are placed with a rhythm pattern

    Scale

    RhythmHarmony

    Measure

    Beat

    Tick

  • What’s difference in music?

    ● The majority of music pieces has a form○ Repetitions and variations○ AABA or Intro-verse-chorus-outro○ 16-bar blues ○ Sonata, Rondo

    ● Learning the long-term structure(long-term dependency) is a challenge in music generation!○ Likewise in NLP

    The Clustering of Expressive Timing Within a Phrase in Classical Piano Performances by Gaussian Mixture Models”, Li et. al, “, 2015

  • Symbolic Input Representations

    ● Piano roll: 2D image or 1D sequence of multi-hot vectors ○ Easy to understand: visualize music intuitively○ Easy to handle polyphony○ A note is a line of pixels: generative models handle the pixels but not the note

    ■ The generated output will be musically noisy ○ Too much redundancy in time

    ■ Time quantization (e.g., 16th note length) can reduce the redundancy (MusicVAE)■ But, the quantization is applicable to score MIDI only

    Piano roll Quantization

  • Symbolic Input Representations

    ● MIDI event (the Magenta format)○ Event types

    ■ Note-on: 128 MIDI pitches■ Note-off: 128 MIDI pitches■ Set-velocity: 32 quantized velocities■ Time-shift: 100 shifts (10ms to 1 sec)

    ○ The time shift event compresses sustained note states into a single event■ Greatly reduce the time redundancy

    ○ Easy to handle polyphony○ Fit to performance MIDI but hard to incorporate score information○ All events are encoded to a 388-dimensional one hot vector

    ■ A typical 30-sec clip might contain about 1200 such one-hot vectors ■ No semantic meaning (as opposed to word embedding)

    This Time with Feeling: Learning Expressive Musical Performance, Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan, 2018

  • Symbolic Input Representations

    ● Music notation parsing○ A structured text sequence

    ■ Bar, chord, tempo, note, and so on○ This rich information can be useful in generating more musical output.

    ■ But, it may need manual annotations ○ No standard method

    Score (REMI)

    Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions, Yu-Siang Huang, Yi-Hsuan Yang, 2020

  • Musical Language Model Using RNN

    ● PerformanceRNN○ Use performance MIDI files from the e-Piano competition dataset (the early

    version of the MAESTRO dataset)○ Data augmentation: tempo change and key transpose○ The event-based MIDI representation (one-hot vector)

    ■ 𝑥)={note on, note off, set velocity, time-shift}○ Trained with three layers of LSTMs and softmax output

    ■ The loss function is the cross-entropy between the softmax and the one-hot vector

    ■ Teacher-forcing: the ground output is used for next inputinstead of the predicted output in the training phase

    This Time with Feeling: Learning Expressive Musical Performance, Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan, 2018

    𝑝(𝑥!|𝑥", … , 𝑥#)

    𝑥" 𝑥$ 𝑥% 𝑥#

    𝑥$ 𝑥% 𝑥# 𝑥! . . .

    . . .

  • Music Generation Using Musical Language Model

    ● Generating the output from the trained MLM○ Sample from the softmax distribution ○ The sampled output is used as input at the next step

    ● Softmax temperature○ 𝜏 > 1 : 𝑃! becomes more uniform

    ■ Thus more diverse output are generated○ 𝜏 < 1 : 𝑃! becomes more spiky

    ■ Thus less diverse output are generated 𝑥" 2𝑥$ 2𝑥% 2𝑥#

    2𝑥$ 2𝑥% 2𝑥# 2𝑥!

    . . .

    Sample

    Softmax output

    𝑃! 𝑤 =exp(𝑆"/𝜏)

    ∑"! exp(𝑆"!/𝜏)

    𝜏 > 1𝜏 = 1

    𝜏 < 1

    A funny animation about auto-regressive models: https://twitter.com/i/status/1327775912352493568

    https://twitter.com/i/status/1327775912352493568

  • Evaluating Musical Language Model

    ● Objective evaluation○ Perplexity (PPL): inverse probability of the corpus

    ○ Equal to the exponential of cross-entropy loss○ Lower PPL is better in NLP but in music too?

    ● Listening Test○ Demo: https://magenta.tensorflow.org/performance-rnn○ The result sounds natural in short terms but note patterns are not coherent

    and keeps diverging: the long-term dependency issue!○ Need better models capable of learning wider music context

    𝑃 𝑋 ='&'!

    #

    (1

    𝑃34 𝑥& 𝑥(&)!/#

    (“more predictable” might be “less creative”?)

    https://magenta.tensorflow.org/performance-rnn

  • Generative Model

    ● Given a dataset of examples 𝑋 = {𝑥$} , estimate 𝑝(𝑋) and generate new samples from 𝑝(𝑋)○ Density estimation: a type of unsupervised learning○ Remember that GMM is a generative model

    Training data ~ 𝑝#$!$(𝑋)

    𝑝%'( 𝑥 = 3)*+

    ,

    𝜋) 𝑁(𝑥|𝜇) , Σ))

    Generated samples ~ 𝑝%'((𝑋)

  • Generative Model

    ● Given a dataset of examples 𝑋 = {𝑥$} , estimate 𝑝(𝑋) and generate new samples from 𝑝(𝑋)○ If the data is high-dimensional (image, audio or sequence), we will need

    more representation power and so we will use deep neural network

    Training data ~ 𝑝#$!$(𝑋) Generated samples ~ 𝑝%'((𝑋)

  • Auto-Encoder

    ● Auto-encoder is an unsupervised learning model that can learn structure within high-dimensional input○ Using encoder-decoder CNN or encoder-decoder RNN○ The latent vector can be reconstructed into the high-dimensional data

    ● But, can we use AE as a generative model ?○ Randomly sample a vector in the latent space and generate the data?

    Encoder Decoder

    Encoder Decoder

    encoder-decoder CNN encoder-decoder RNN

    Sample? Sample?

  • Auto-Encoder

    ● It can generate the input but the latent space may not be continuous○ The distribution is not dense: there are gaps between the clusters○ The generated output from the gaps (by sampling or interpolation) will be

    unrealistic

    Source: https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

    Encoder Decoder

    https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

  • Variational Auto Encoder (VAE)

    ● Model the latent space using randomly sampled latent vectors with a probabilistic model ○ Make the encoder yield two vectors for mean and standard deviation○ Randomly sample a latent vector using the mean and standard deviation○ Reconstruct the input with the random vector

    Encoder

    𝜇

    𝜎

    Mean

    StandardDeviation

    Generator

    random sample = 𝜇 + 𝜎𝑧(𝑧~𝒩 (0, I))

  • Variational Auto Encoder (VAE)

    ● Optimize the network using the maximum likelihood estimation○ The estimation is intractable and so an approximated method is used:

    ■ Maximize the lower bound of the log likelihood ○ This ends up with minimizing two terms: the reconstruction error and KL

    divergence between the Gaussian distributions

    𝑙 𝑊; 𝑥 = 𝑥 − 1𝑥 # + 𝐾𝐿(𝒩(𝜇 𝑥 , 𝜎 𝑥 ) ∥ 𝒩 0, I )

    Reconstruction error KL divergence: make the distribution of latent vectors have zero mean and unit variance

    Auto-Encoding Variational Bayes, Diederik Kingma, Max Welling, 2014

  • Variational Auto Encoder (VAE)

    ● Re-parameterization ○ Enables gradient flow by detouring the sampling process

    Encoder

    𝜇

    𝜎

    Mean

    StandardDeviation

    Generator

    random sample𝑧~𝒩 (0, I)

    +

    ×

  • Variational Auto Encoder (VAE)

    ● Distribution in the latent space○ By using both KL divergence and reconstruction error, the space can be

    discriminative as well as continuous

    Reconstruction only KL divergence only Bothhttps://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

    https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

  • Variational Auto Encoder (VAE)

    ● Generate data by taking a random vector from the unit Gaussian○ Data manifold are generated from varying 𝑧

    Generator

    random sample𝑧~𝒩 (0, I)

    Auto-Encoding Variational Bayes, Diederik Kingma, Max Welling, 2014

    𝑧1(circle shape)

    𝑧2 (tilt + more )

    𝑧1(smile)

    𝑧2 (pose)

    Generation from 2-D latent space 𝑧

  • Variational Auto Encoder (VAE)

    ● Recurrent VAE is also possible○ The language model in the decoder (generator) is conditioned with the latent

    vector that captures dependency within the entire sentence

    “I” “love” “you”

    EncoderRNN

    DecoderRNN

    𝜇

    𝜎

    Mean

    StandardDeviation

    Generating Sentences from a Continuous Space, Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, Samy Bengio, 2016

    “I” “love” “you”

    “I” “Love” “you”

  • MusicVAE

    ● Use the encoder-decoder RNN architecture

    ● Encoder: bidirectional RNN○ The two hidden units at both ends

    are concatenated

    ● Decoder: hierarchical RNN○ Conductor RNN: learns high-level

    dependency in measure level○ Language model RNN: condition

    from the conductor RNN is concatenated with the previous output as input at the next step

    A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music, Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, Douglas Eck, 2018

  • MusicVAE

    ● Training dataset○ The Lakh MIDI dataset: Multi-track score MIDI○ Use piano roll but quantized notes to 16th note events○ One event is 130 dimensional vector: 128 pitches, note off, rest ○ The input length of RNN (𝑇) is 256 which corresponds to 16 measures (bars)

    Quantized Score MIDI

  • MusicVAE

    ● Learning the latent space of long-term music sequence ○ A latent vector corresponds to a “well-structured” music segment ○ Beat-blender

    ■ continuous move on the latent space generates gradually changing music sequence

    ○ Melody mix■ Interpolate between two different melodies

    ● Demo ○ https://magenta.tensorflow.org/music-vae

    Melody Mixer (interpolation)

    Beat blender (latent space exploration)

    https://magenta.tensorflow.org/music-vae

  • Issues with RNN

    ● Sequential computation inhibits parallelization (not like CNN)

    ● No explicit modeling of long and short range dependencies

    ● Information bottleneck in the encoder

    “I” “love” “you”

    “I” “love” “you”

    “I” “Love” “you”

    Information bottleneck

    Information bottleneck

    Long and short range dependencies?

  • Attention Mechanism

    ● Direct connections between words in the encoder and decoder○ A weighted sum of the input is concatenated to each of the output words○ The weights are computed from the one-to-one correspondence○ The alignment between words in the encoder and decoder can be obtained

    for free

    “난” “네가” “진짜” “I” “really” “like”

    dot product

    softmax

    weighted sum

    concatenate

    “really”

    “좋아” “you”

    I

    really

    like

    you

    난 네가

    진짜

    좋아

  • Self-Attention

    ● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear

    transforms

    key value query

    x1 x2 x3 x4x

    dot product

    softmax

    Weighted sum

    “Self-attention layer”

    +Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&

    𝑑'𝑉

    𝑄𝐾 𝑉

    Scaling factor

    Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017

  • Self-Attention

    ● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear

    transforms

    key value query

    x1 x2 x3 x4x

    dot product

    softmax

    Weighted sum

    “Self-attention layer”

    +

    𝑄𝐾 𝑉

    Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&

    𝑑'𝑉

    Scaling factor

  • Self-Attention

    ● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear

    transforms

    key value query

    x1 x2 x3 x4x

    dot product

    softmax

    Weighted sum

    “Self-attention layer”

    +

    𝑄𝐾 𝑉

    Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&

    𝑑'𝑉

    Scaling factor

  • Self-Attention

    ● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear

    transforms

    key value query

    x1 x2 x3 x4x

    dot product

    softmax

    Weighted sum

    “Self-attention layer”

    +

    𝑄𝐾 𝑉

    Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&

    𝑑'𝑉

    Scaling factor

  • Self-Attention

    ● Multi-head attention ○ Multiple independent key, query and value that capture different types of

    dependency in the sequence

    x1 x2 x3 x4

    ++

    concatenated

  • Self-Attention

    ● “Re-representation” of input ○ Based on interactions between input elements

    ● Constant “path length” between any two position (unlike RNN)○ Permutation-invariant○ Need to add positional information for sequence modeling

    ● Trivial to parallelize○ Effective use of GPU

    Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017

  • Transformer

    ● Position encoding is added to the input○ Self-attention is permutation-invariant

    ● A single module is composed of○ Multi-head attention layer○ Position-wise feed forward layer ○ Add skip connections and normalization layers

    ■ The skip connection carries the position information

    ● Masking is added to the attention in the decoder○ For causal self-attention

    Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017

  • Transformer

    ● Used in the state-of-the-arts models in natural language processing○ Machine Translation○ Language modeling○ …

    ● Used in computed vision as well○ Image classification○ Image generation○ …

    Vision transformerImage transformer

  • Music Transformer

    ● How about applying transformer to music generation?

    Source: https://magenta.tensorflow.org/music-transformer

    Primer (“Initial input”)

    PerformanceRNN

    Vanilla Transformer

    https://magenta.tensorflow.org/music-transformer

  • Music Transformer

    ● What’s wrong?

    Primer (“Initial input”)

    PerformanceRNN

    Vanilla Transformer

    This model was trained on this length In this “unseen” position, the output completely goes wrong!

    Source: https://magenta.tensorflow.org/music-transformer

    https://magenta.tensorflow.org/music-transformer

  • Music Transformer

    ● Use the relative position instead of the absolute position○ Musical patterns can be translation-variance (like convolution)

    ● The relative position encoding○ Needs a pair-wise distance (2D compared to 1D in the absolute position). ○ A 3D tensor is necessary for positional encoding: too much memory!

    Music Transformer, Cheng-Zhi Anna Huang, et al, 2019

    Relative position encoding

  • Music Transformer

    ● Skewing to reduce relative memory

    Music Transformer, Cheng-Zhi Anna Huang, et al, 2019

  • Music Transformer

    ● Consistent generation!

    Source: https://magenta.tensorflow.org/music-transformer

    Primer (“Initial input”)

    Vanilla Transformer

    Music Transformer

    ß More music examples and a great visualization of real-time self-attention!

    https://magenta.tensorflow.org/music-transformer

Recommended