ISMIR 2016_Melody Extraction

CV & Research Plan

MELODY EXTRACTION ON VOCAL SEGMENTS USING MULTI-COLUMN DEEP NEURAL NETWORKSSangeun Kum, Changhyun Oh, Juhan [email protected]

Music and Audio Computing Lab.Korea Advanced Institute of Science and Technology

11 Aug. 2016

Thanks for introduction. Hello Im sangeun kum. Finally, this is final presentation ISMIR 2016. And Ill talk about my research.1

Melody extraction: from polyphonic musicDefinition:Automatically obtain the f0 curve of the predominant melodic line drawn from multiple sources [1] [1] Bittner, R. M., Salamon, J., Essid, S., & Bello, J. P. Melody extraction by contour classification. InProc. ISMIR(pp. 500-506).

Melody extraction is automatically obtain fundamental frequency curve of the predominant melodic line/ from polyphonic music. This is an example of melody extraction using our proposed methods. Lets follow the pitch line listening Halo. Pretty well? As you know, all demo are perfect.2

Melody extraction algorithms

[2] Salamon, Justin, et al. "Melody extraction from polyphonic music signals: Approaches, applications, and challenges."Signal Processing Magazine, IEEE31.2 (2014): 118-134.

Salience based approachesSource separationbased approachesData drivenbased approaches

I got this table from salamons review paper. Various algorithms have been proposed so far. They can be broadly classified into three categories:

A salience-based approaches use a salience function to estimate the salience of each possible pitch value.A source-separation based approaches isolate the melody source from the mixture.These two approaches are majority of the melody extraction algorithms

On the other hand, Data-driven based approach is rarely attempted.

3

Support vector machine note classifier

Pitch labels : 60 MIDI notes (G2~F#7)Resolution = 1 semitoneLosing detailed information about singing styles ex) vibrato, transition patterns

[3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription."Machine Learning65.2 (2006)Melody extraction algorithms: Data-driven based approachesDataPosteriorgramSupport Vector MachineG2~ F#760 MIDIscaleHMM

In 2006, Ellis and Poliner proposed a fully data-driven method using a Support Vector Machine to classify 60 MIDI note from a spectrogram.

The resolution of output is one semitone. So, it could lose detailed information about singing styles such as vibrato. And.. last year, Bitter et.al proposed an algorithm using classifier to predict melody contour. However, data-driven approach is still rarely attempted.

4

1. Deep neural network

2. Classification-based approach : High resolution

3. Data augmentation

4. Singing voice detector

Addressed issues

Therefore, we addressed some issues. No one attempt to use deep neural network to extract melody. You know.. deep learning is really hot keyword in research area in these days, (although deep learning session started after banquet. )Anyway, Deep learning has proved having great performance with sufficient labeled data and computing power. So we tried to use deep learning.

5





Addressed issues

Deep learning method is based on classification approach. so, we tried to maintain high accuracy and high resolution.

6





Addressed issues

The third point is that melody-labeled public datasets are not much available and manual labeling is laborious. therefore, it is desirable to augment existing datasets.

7





Addressed issues

The last one is singing voice detector to obtain high overall accuracy, 8

Deep neural networks: Configuration

input layer512512256output layerD2F#5hidden layerMulti-framespectrogram

Input : Multi-frame spectrogramTrain : singing voice frameTest : all frameHidden layers : 512-512-256 Output : Range : D2 F#5Layer : 41, 81, 161Nonlinear function : ReLUOptimizer : RMSpropOutput layer : sigmoidDropout : 20% Using Keras

Addressed issues

We configure the DNN like that.we train DNNs using only voiced frames of training data set, and we take multi-frame spectrogram to capture contextual information.The pitch labels cover from D2 to F#5 with different resolution.

For the output layer, we use the sigmoid function instead of the softmax function because the sigmoid slightly worked better in our experiments. We optimize the objective function using RMSprop and 20% dropout for all hidden layers to avoid overfitting to the training set. For fast computing, we run the code using Keras, a deep learning library in Python, on a computer with two GPUs.

9

Motivation: Classification accuracy & pitch resolution res_1 res_2 res_4

high pitch resolution high classification accuracy

DNNDNN

DNNres_1res_2res_4Addressed issues[4] Ciregan, Dan, Ueli Meier, and Jrgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012.[5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising." Advances in Neural Information Processing Systems. 2013.

And then we check the validation accuracy. Figure(b) shows a classification accuracy of each DNN with different resolutions.you can see that.. as the resolution increases, the accuracy drops quite significantly. There is a trade-off between the pitch resolution and the classification accuracy.

But, we need to both them. So in order to take advantage of both them, we combine each output of DNN.10

Motivation: Multi-column DNN[4] Ciregan, Dan, Ueli Meier, and Jrgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012.[5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising." Advances in Neural Information Processing Systems. 2013.Addressed issues

The MCDNN was originally devised as an ensemble method to improve the performance of DNN for image classificationSeveral deep neural columns/ become experts on inputs in different ways, therefore by averaging each predictions, we can decrease the errors.

It was applied to image denoising as wellIn this approach, each column was trained on a different type of noise and the outputs were weighted to handle noise types.

Our proposed model may pose half-way between these two approaches

11

res_1 res_2 res_4Proposed method : Architecture of multi-column DNN (MCDNN)

res_1 1 semitone ex) 40, 41, 42,

res_2 0.5 semitone ex) 40, 40.5, 41,

res_N 1/N semitone

res_1res_2res_4

Addressed issues

This is our architecture of a proposed methods.By using Multi Column DNN, our model produces a finer pitch resolution more accurately.

Each of the DNN columns takes multi-fame spectrogram frames as input / to capture contextual information from neighboring frames.And each DNN columns predict pitch labels with different resolutions. The lowest resolution is 1 semitone. The next one has higher resolutions by two times Given the outputs of the columns, we compute the combined posterior likelihood probability 12

1 semitone#410.5 semitone#81

#161

#161

0.25 semitone#161

Proposed method : Architecture of multi-column DNN (MCDNN)Addressed issues

pitch predictions with lower resolutions are actually expanded by replicating each elementso that the output sizes are the same for the all columns.Mathematically, we multiplied all probabilities together, which corresponds to summing the log-likelihood of the predictions.

13

res_1 res_2 res_4Proposed method : Architecture of multi-column DNN (MCDNN)

res_1 1 semitone ex) 40, 41, 42,

res_2 0.5 semitone ex) 40, 40.5, 41,

res_N 1/N semitone

res_1res_2res_4Addressed issues

For temporal smoothing, we use Viterbi decoding and choose singing voice frames using SVD.Finally we can get melody pitch contour. 14

Training Datasets

RWC (100 songs)

RWC +1 semitone

RWC -1 semitone

RWC +2 semitone

RWC -2 semitone

Data augmentation[6] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases."ISMIR. Vol. 2. 2002.[7] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014.MelodyMCDNNTestDataTrainingData

Addressed issues

We use the RWC pop music database as our main training set.

Also, In order to enlarge training set, we augment the training set by applying pitch-shifting by 1, 2 semitones.15

Training Datasets

RWC (100 songs)MedleyDB(60 songs)

RWC +1 semitone

RWC -1 semitone

RWC +2 semitone

RWC -2 semitone

Data augmentation[4] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases."ISMIR. Vol. 2. 2002.[5] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014.MelodyMCDNNTestDataTrainingData

Addressed issues

The RWC database includes only pop music. Melody contours tend to have different style according to genre.So, for genre diversity and avoiding overfitting, we use 60 vocal tracks of the MedleyDB dataset as an additional training set.16

res_1 res_2 res_4Temporal smoothing by HMM: Viterbi decoding

Addressed issues

We conduct the Viterbi decoding based on a Hidden Markov model for temporal smoothing. 17

Temporal smoothing by HMM: Viterbi decoding

Bayes' theorem

Viterbi decoding

transitionpriorposterior

Addressed issues[3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription."Machine Learning65.2 (2006)

We follow the Ellis and Poliners steps. We estimate the prior and transition matrix from ground-truth of the training set. And then, we use the DNN prediction of whole tracks as posterior probabilities18

Singing voice detection: Energy-based approach

HMM

SVD

Spectral energy (200~1800Hz)High singing voice levelThe sum is normalized by the median energy in the band.

Addressed issues

The DNN is trained with only voiced frames for pitch classification. Therefore, a singing voice detection step is necessary for the test phase.

However, since singing voice detection itself is a challenging task and not our main concern in this paper.So, we use a simple energy-based singing voice detector.

19

< Classification accuracy on the validation set>The classification accuracy : Multi-frame spectrogram

Results

This is result part. I told you our model takes multi-frames of spectrogram to capture contextual information.

To figure out an optimal size, we experimented with multi-frame as inputs of DNN where the input data were taken from neighbor spectrogram frames.

The accuracy increases up to 11 frames and then converge to a certain level. This is expected because pitch contours usually have continuous curve patterns, and this temporal features can be captured better by taking multiple frames. For the following experiments, we fix the input size to 11 frames.

20

- RWC (100 songs)- RWC (100 songs) + pitch-shifted RWC (100 x 4 songs)- RWC (100 songs) + pitch-shifted RWC (400 songs) + MedleyDB (60 songs)

Classification accuracy: Data Augmentation

Results

This Figure shows the classification accuracy for a varying size of pitch resolution when the pitch-shifted RWC data and MedleyDB data are added to the training data pool in turn.

Overall, the accuracy increases by 2 to 3 % with the additional sets. 21

Temporal smoothing by HMM: Performance of smoothing

Results

This Table shows the results as performance increments after applying the Viterbi decoding for the 1-2-4 Multi Column DNN on the test sets.It helps to learn long-term temporal dependencies.22

A case example of melody extraction on an opera song (ADC2004)

SCDNN(res=4)Results

Here we verify it by illustrating three examples from different models.We selected an opera song from the ADC2004 dataset, because this song has dynamic pitch motions such as high pitch and strong vibrato.

This one is from the Single Column DNN with a pitch resolution of 4 and trained only with the RWC dataset. 23

A case example of melody extraction on an opera song (ADC2004)SCDNN(res=4) + Data augmentationResults

This one is from the same SCDNN but trained with additional data. Comparing the first models, the additional songs help tracking the vibrato. But the second model still misses the whole excursion.

24

A case example of melody extraction on an opera song (ADC2004)1-2-4 MCDNN + Data augmentationResults

The right one is from the 1-2-4 Multi-column DNN.With the additional resolutions, the Multi-column DNN makes further improvement, tracking the pitch contours quite precisely.25

ADC2004 [8]12 songsRock, R&B, Pop, Jazz, Opera MIREX05 [9]Total 25 songs13 songs ( 9 vocal songs + 4 instrument songs) MIR1k [10]1000 vocal songs[8,9] http://labrosa.ee.columbia.edu/projects/melody[10] https://sites.google.com/site/unvoicedsoundseparation/mir-1kMelodyMCDNNTestDataTrainingData

Evaluation: Test Datasets

Evaluation

We examine our proposed model with three public datasets using mir-eval.

26

Test data setCase 1 : All songs(including song without vocal)Case 2 : Singing voice songsSingle-column Vs. Multi-column: Raw / Chroma accuracy

Case 1

Case 2Assumed the voiced framed are perfectly detected.Evaluation

Due to our model that can handle singing voice only, Therefore, we evaluate our model on all songs, its case 1 and those with singing voices separately its case 2.

And we assumed the voiced frames are perfectly detected to verify performance of classifier. As you can see, the results of multi-column DNN is better than those of SCDNN comparing blue bar and yellow bar. Also, we can find that MCDNN increases the accuracies on the sets with singing voices comparing case 1 and 2.This indicates that our model is a specialized in singing voice songs.

27

Comparison to State-of-the-art Methods: ADC2004

Evaluation

We compare our proposed method using energy based SVD with state-of-the-art algorithms.These are all based on pitch saliency methods.

This is the result of ADC datasetWell.. Unfortunately, the performance is bad when the testset include all songs, However after exclude non-voice songs, the result is not ..bad.. 28

Comparison to State-of-the-art Methods: MIREX05

Evaluation

This is the result of MIREX05 dataset.

Yes!! The accuracies are comparable to some of the algorithms when the test sets include singing vocals.

29

Summary

multi-frame spectrogramdata augmentationmulti-column DNNHMM-based smoothing

Limitation & Future work

working only for singing voice melodysinging voice detectionReplace HMM with RNN

Conclusion

To summarize, In this paper, we proposed a novel data-driven melody extraction algorithm using the multi-column deep neural networks.

We showed how the data-driven approach can be improved by different settings of the model, such as data augmentation, multi-column DNN, etc.

30

Limitation & Future works


working only for singing voice melodysinging voice detectionMulti-column DNN(1-2-4)Songs with vocalSongs with vocal & non-vocal

The limitation of this model is that it works well only for singing voice because we trained it only with vocal songs.However, this also indicates that our model can be improved to a general melody extractor, if a sufficient amount of instrumental pieces are included in the training sets.

Since we used a simple energy-based singing voice detector, the performance of our model has limitations. However, the results show that, with a better voice detector, our model can be improved up to perfect voice detecting case. 31

Summary

multi-frame spectrogramdata augmentationmulti-column DNNHMM-based smoothing


working only for singing voice melodysinging voice detectionreplace HMM with RNN

Conclusion

And well replace HMM with RNN end to end.

32

Thank [email protected]

Thank for your attention.

33

AppendixResample : 8 kHzMerge stereo channel into mono. STFT : FFT size : 1024 (1 bin = 7.81Hz)Window size = 1024 (Hann) Hop size : 80 (1 frame = 10ms)Compressing the magnitude by a log scaleUsing 256 bins (0 ~ 2000Hz : vocal range)Multi-frame11 frame spectrogram / example

[] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription."Machine Learning65.2 (2006): 439-456.Pre-processing

voice frame

In pre-processing step, we resample the audio files to 8 kHz and merge stereo channels into mono. We then compute spectrogram with Hann window and hop size of 80 samples, and finally compress the magnitude by a log scale.

we use only 256 bins from 0 to 2 kHz where the human singing voices have a relatively greater level than background music. And we use only voiced frames for training.

34

Comparison to State-of-the-art Methods: MIR-1k

Appendix

And This is the result of MIR1k dataset with ADC04, MIREX05.

35

MIREX2016 resultsAppendix

3Cc2Bb1Aa

cC3bB2aA1

Multi-frame spectrogramAppendix

Date post:	12-Apr-2017
Category:	Engineering
Upload:	sangeun-kum
View:	187 times
Download:	0 times

ISMIR 2016_Melody Extraction

Engineering