Timbre and Modulation Features forTimbre and Modulation Features forMusic Genre/Mood ClassificationMusic Genre/Mood Classification
J.-S. Roger Jang & Jia-Min RenJ.-S. Roger Jang & Jia-Min RenMultimedia Information Retrieval LabMultimedia Information Retrieval LabDept. of CSIE, National Taiwan UniversityDept. of CSIE, National Taiwan University
2/40
Outline Audio features and modulation spectral analysis MIREX 2011 method and its improvement Experimental setup and results Conclusions and future work
3/40
Introduction – music genres/moods
*pictures from www.playonradio.com, brainpickings.org & mpac.ee.ntu.edu.tw
Descriptions of music contents
4/40
Motivation Rapid growth of digital music
Apple iTunes: 28 million songs; 7digital: 20 million tracks Organization of large collections of audio music
Important but challenging Manual labeling by tags: labor intensive/time consuming
Thus, machine learning for classification is called for!
Feature Extraction
Music clipsfor training
Classifier Training
Feature Extraction
Evaluation
Short-term: MFCC, OSCLong-term: beat, tempo, pitch
KNN, GMM, SVM
Classifiers
ResultMusic clipfor test
Feature Extraction
Evaluation Result
5/40
System overview
Frame-based timbre feature extraction and
summarization
Long-term modulation-based feature extraction
Music clips for training
...
SVMs training
SVMs
Concatenation
Feature extraction
Classification
Result
Feature extraction
Training stage
Test stageMusic clip for testing
6/40
Performance evaluation Dataset-dependent criteria for evaluation
GTZAN 10-fold cross-validation
ISMIR2004Genre Holdout test, same as the one used in ISMIR 2004 Genre
Classification Contest, with 729 clips for training and 729 clips for test
7/40
Audio features – short-term timbre features Statistical spectrum descriptors (SSD)
Spectral centroid (SC) Spectral flux (SF) Spectral rolloff (SR), Spectral skewness (SS) Spectral kurtosis (SK).
MFCC To model the subjective frequency contents of audio signals 21-dim (including energy)
8/40
Audio features – short-term timbre features Spectral contrast & valley (SCV)
Measure spectral contrast/valley in octave-based subbands
Valley: non-harmonic/noise
audio frame
FFT
For each subband, compute peak/valleyby averaging values in the larger/smaller percentage of spectra ( )
contrast=peak-valley:relative distribution20%
Peak: harmonic 8 frequencysubbands:1: [0,100)2: [100,200)3: [200,400)4: [400,800)5: [800,1600)6: [1600,3200)7: [3200,6400)8: [6400,11025]
10/40
Audio features – short-term timbre features Spectral flatness measure (SFM)
Measures the noisiness of spectra within a subband
≈1: similar amount of power is distributed in all spectral bands ≈0: spectral power is concatenated in a relative small # bands
Spectral crest measure (SCM)
,1
,1
( )1
aa
a
NNa ii
Na ii
a
BSFM a
BN
,1,...,
,1
max( )
1a
a
a ii N
Na ii
a
BSCM a
BN
, :a iB the i-th magnitude spectrum in the a-th subband
:aN # of spectra in the a-th subband
11/40
Audio features – short-term timbre features For each feature dimension, we compute its mean and
standard deviation. Total dimensions for short-term timbre features
2*(5+21+16+16)=116
SSD MFCC SCV SFM/SCM
Frame-based features
Mean & std
Octave-based subbands
12/40
Modulation spectral analysis MFCC, SC, SFM/SCM
Capture only short-time spectral properties of audio signals Modulation spectral analysis
Captures long-term spectral dynamics within audio signals Computes spectrogram, then creates modulation spectrogram
(by applying FFT again along time axis of spectrogram) Low/high modulation frequency slow/fast spectral change
FFT
13/40
Modulation spectral analysis of timbre features Flowchart
MSP/MSV: the strength of rhythm in music
7 modulation freq. subbands:[0,0.33), [0.33,0.66),[0.66,1.32),[1.32,2.64),[2.64,5.28),[5.28,10.56),[10.56, 21.03)
The same process is applied to MFCC, SFM/SCM.
(MSC: modulationSpectralcontrast)
1. OSC extraction (hop size 23 ms)
music clip
2. Segmentation(256 frames ≈ 6 sec)
3. FFT (along feature dim)
4. Average
5. Modulation frequency
decomposition
129 bins
16 dim
modulation frequency(129 bins)
windows
...
7. Mean/std computation(along feature & subband dims)
92-dim feature vector(=16*2+7*2+16*2+7*2)
feature
dim (16)
256 frames
...
texture windows window
- =
16 dim
...
...
...
16 dim
7 subbands
6. Modulation spectral peak/valley (MSP/MSV)
computation
...
MSP
16 dim
7 dim
MSV
16 dim
7 dim
MSC
16 dim
7 dim
MSV MSC
14/40
Modulation spectral analysis of timbre features Reference
C.-H. Lee, J.-L. Shih, K.-M. Yu, and H.-S. Lin, “Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features,” IEEE Trans. Multimedia, vol. 11, no. 4, pp.670-682, June 2009.
15/40
Proposed joint acoustic frequency and modulation frequency features Motivation
Averaging and mean/std computation smooth out MD info. Computation of joint frequency features (proposed)
Compute modulation spectrogram from an entire music clip Compute SCV (spectral contrast/valley), SFM/SCM (spectral
flatness/crest measure) within each joint acoustic-modulation (AM) frequency subband AMSCV, AMSFM/AMSCM
FFT
ComputeAMSCVAMSFMAMSCM
16/40
Audio features used in our study All possible audio features
Extract SSD, MFCC, SCV, and SFM/SCM from audio frames mean/std computation MuStd MuStd dim=2*(5+21+16+16)=116
Perform modulation spectral analysis on MFCC, OSC, SFM/SCM MMFCC dim=2*(21*2+7*2)=112 MSCV dim=2*(16*2+7*2)=92 MSFM/MSCM dim=2*(16*2+7*2)=92
Compute SCV, SFM/SCM within acoustic-modulation (AM) frequency subbands AMSCV, AMSFM/AMSCM AMSCV 8*7*2=112 AMSFM/AMSCM dim = 8*7*2=112
17/40
Audio feature sets and classifier Audio feature sets
MIREX 2011 method MuStd+MMFCC+MSCV+MSFM/MSCM
dim=116+112+92+92=412
Improved method MuStd+MMFCC+AMSCV+AMSFM/AMSCM
dim=116+112+112+112=452
Classifier construction with RBF kernel SVMs Three-fold inside cross-validation to tune hyper-parameters
18/40
30
40
50
60
70
Acc
ura
cy (
%)
Genre classification
WR1
TCCP4SSKS2 JR1
SSPK1WR2
TCCP3JR2
ES2ES1
DM1GDC2
EP2GKC4
30
40
50
60
70
Acc
ura
cy (
%)
Mood classification
JR1
TCCP4WR1
TCCP3SSKS2
ES2SSPK1
WR2ES1 JR2
DM4DM1
GDC1EP2
GDC2GKC4
Experimental setup and results of MIREX 2011 genre/mood classification tasks Datasets
Genre classification: 10 genres, 700 30-sec clips in each one Mood classification: 5 categories, 120 30-sec clips in each one
Evaluation metric Three-fold cross-validation; classification accuracy
Results (JR1 is ours)
19/40
Experimental results of MIREX 2008-2012 genre/mood classification tasks
ParticipationsClassification Task(Year)
Accuracy(%)
Rank(# of Submissions)
Wu and Jang Genre (2013) 76.23 1 (13)
Wu and Jang Genre (2012) 76.13 1 (16)
Wu and Ren Genre (2011) 75.57 1 (15)
Our submission Genre (2011) 74.23 4 (15)
Seyerlehner et al. Genre (2010) 73.64 1 (24)
Cao and Li Genre (2009) 73.33 1 (31)
Tzametalis Genre (2008) 67.83 1 (13)
Wu and Jang Mood (2013) 68.33 1 (23)
Panda and Paiva Mood (2012) 67.83 1 (20)
Our submission Mood (2011) 69.50 1 (17)
Wang et al. Mood (2010) 64.17 1 (36)
Cao and Li Mood (2009) 65.67 1 (33)
Peeters Mood (2008) 63.67 1 (13)
20/40
Extended experiments Four datasets
Performance evaluation Randomly stratified 10-fold cross-validation (repeating 10
times) Repeat the above process 10 times to obtain the average
result
Dataset Category Class # Min/Max # of clips in classes
Total # of clips
Duration of each clip
GTZAN Genre 10 100/100 1,000 30sUnique Genre 14 26/766 3,115 ~30sSoundtracks Mood 6 30/30 180 18s to 30sMIR-Mood Mood 4 464/619 2,223 ~30s or ~60s
21/40
Extended experiments Averaged classification accuracy (%) of combining
different feature sets on four datasets
22/40
Extended experiments Comparison of our methods with other recent work
23/40
Conclusions Timbre & modulation features
Won 1st place (MIREX 2011 mood classification) Timbre & improved modulation
Improves 2.47%/2.08% on GTZAN/Unique Achieves 2.50%/0.14% higher than MIREX 2011 method on
Soundtracks/MIR-Mood
24/40
Thank you for listening.Questions & comment welcome!
25/40