Post on 03-Jul-2015
description
transcript
1
Personalized Music Emotion Recognition via Model Adaptation
Ju-Chiang Wang, Yi-Hsuan Yang, Hsin-Min Wang, and Skyh-Kang Jeng
Academia Sinica, National Taiwan University,Taipei, Taiwan
2
Outline
• Introduction
• The Acoustic Emotion Gaussians (AEG) Model
• Personalization via MAP Adaptation
• Music Emotion Recognition using AEG
• Evaluation and Result
• Conclusion
3
Introduction
• Developing a computational model that comprehends the affective content from musical audio signal, for automatic music emotion recognition and content-based music retrieval
• Emotion perception in music is in nature subjective (fairly user-dependent)– A general music emotion recognition (MER) system
could be insufficient– One’s personal device is desirable to understand
his/her perception of music emotion– Adaptive MER method, efficient and effective
4
Basic Idea
• The UBM-GMM system for speaker adaptation– State-of-the-art systems for speaker recognition– A large background GMM (UBM), representing the
speaker-independent distribution of acoustic features– Obtain the speaker-dependent GMM via model
adaptation with the speech data of a specific speaker• Adaptive MER method for personalization
– A probabilistic background emotion model, learn the broad emotion perception of music from general users
– Personalize the background emotion model via model adaptation in an online and dynamic fashion
5
Multi-Dimensional Emotion• Emotions are considered as numerical values
(instead of discrete labels) over two emotion dimensions, i.e., Valence and Arousal (Activation)
• Good visualization, a unified model
Mr. Emo developed by Yang and Chen
6
The Valence-Arousal Annotations
• Different emotions may be elicited from a song• Assumption: the VA annotation of a song can be
drawn from a Gaussian distribution, as observed• Learn from the multiple annotations and the acoustic
features of the corresponding song• Predict the emotion as a single Gaussian
7
The Acoustic Emotion Gaussians Model
• Represent the acoustic features of a song by a probabilistic histogram vector
• Develop a model to comprehend the relationship between acoustic features and VA annotations– Wang et al. (2012), “The acoustic emotion Gaussians model for
emotion-based music annotation and retrieval,” Proc. ACM Multimedia (full paper)
Acoustic GMM Posterior Distributions
8
Construct Feature Reference Model
A1 A2AK-1
AK A3A4
Each component representsa specific pattern
EM Training
A Universal Music Database
Acoustic GMM
Music Tracks& Audio Signal
Frame-based Features
… …
… …
Global Set of frame vectors randomlyselected from each track
…
Music Tracks& Audio Signal
A Universal Music Database
Music Tracks& Audio Signal
9
Represent a Song into Probabilistic Space
1
2
K-1
K…
Posterior Probabilities over the Acoustic GMM
…
A1
A2
AK-1
Acoustic GMM
AK
…
Feature Vectors Histogram:Acoustic GMM Posterior
prob
1 2 K-1 K…
10
Generative Process of VA GMM
• Key idea: Each component in acoustic GMM can generate a component VA Gaussian
Audio Signal of Each Clip
A Mixture of Gaussians in the VA Space
…
A1
A2
AK-1
Acoustic GMM
AK
1
2
K-1
K…
Viewed as a set of acoustic codewords
11
The Likelihood Function of VA GMM
• Each training clip is annotated by multiple users {uj}, indexed by j
• An annotated corpus: assume each annotation eij of clip si can be generated by a weighted VA GMM with {qik}!
• Generating the Corpus-level likelihood and maximize it using the EM algorithm
1 1 1 1
( | ) ( | ) ( | , )jU KN N
i ik ij k ki i j k
p p s q= = = =
= = å E E e m S
1
( | ) ( | , )K
ij i ik ij k kk
p s q=
=åe e Sm
Acoustic GMM posterior
Clip-level likelihood:Each user contributes equally
parameters of each latent VA Gaussian to learn
Annotation-levelLikelihood
12
Personalizing VA GMM via MAP
• Apply the Maximum A Posteriori (MAP) adaptation• Suppose we have a set of personally annotated songs
{ei, qi}, i =1,…,M• The posterior probability over each component zk for ei is
• The expected sufficient statistics with posterior and ei1
( | , )( | , )
( | , )ik i k k
k i i K
iq i q qq
p zq
q=
=å
ee
e
m
m
Sq
S
1
1
( | , )( )
( | , )
M
k i i iik M
k i ii
p zE
p z=
=
¬åå
e e
em
q
q
1
1
( | , )( )
( | , )
W Tk ij i i ii
k W
k i ii
p zE
p z=
=
=åå
e e e
e
qS
q
13
MAP for GMM: Parameter Interpolation
• The updated parameters for the personalized VA GMM can be derived by interpolation
• The effective number of component zk for the target user
• The interpolation factors (data-dependent) can be set by
( ) (1 ) ,k k k k kEa a¢ ¬ + -m m m
( ) (1 )( ) .T Tk k k k k k k k kEa a¢ ¢ ¢¬ + - + -S S S m m m m
1
( | , )M
k k i ii
M p z=
= å e q
k
kk M
M
Parameter interpolation between the expectation and background
14
Graphical Interpretation – MAP Adaptation
1
2
3 6
5
4
Interpolation Factor
Acoustic GMM Posterior
The personal annotation can be applied to clips exclusive to the background training set
15
Music Emotion Recognition
• Given the acoustic GMM posterior of a test song, predict the emotion as a single VA Gaussian
1
2
K-1
K
…
Acoustic GMM Posterior Learned VA GMM Predicted Single Gaussian
1
ˆˆ( | ) ( | , )K
k ij k kk
p s q=
=åe e m S
^
^
^
^
…
{ , }*m *S
16
Find the Representative Gaussian
• Minimize the cumulative weighted relative entropy– The representative Gaussian has the minimal cumulative
distance from all the component VA Gaussians
• The optimal parameters of the Gaussian are
( )KL{ , }
1
ˆ( | , ) argmin ( | , ) || ( | , )K
k k kk
p D p pq* *
=
= åe e eS
S S Sm
m m m
*
1
ˆ K
k kk
q=
= åm m
( )* * *
1
ˆ ( )( )K
Tk k k k
k
q=
= + - -åS S m m m m
17
Evaluation – Dataset and Acoustic Features
• MER60– 60 music clips, each is 30-second– 99 users in total, each clip annotated by 40 subjects– 6 users have annotated all the clips– Evaluate the personalization based on these 6 users
• Bag-of-frames representation, perform the analysis of emotion at the clip-level, instead of frame-level– 70Dim: dynamic, spectral, timbre (13 MFCCs, 13 delta MFCCs,
and 13 delta-delta MFCCs), and tonal
18
Evaluation – Incremental Setting
• Incremental adaptation experiment per target user– Randomly split all the clips (w/ annotations) into 6 folds – Perform 6-fold CV
• Hold out one fold for testing• The rest 5 folds: All the annotations except the
target user’s to train a background VA GMM• Add one fold of annotation of target user into the
adaptation pool (P=5 iterations loop) – Adaptation pool to adapt the background VA
GMM– Evaluate the prediction performance of the test
fold
19
Evaluation – Result
• Metric (ALLi): compute the log-likelihood of the predicted Gaussian with the ground truth annotation of the target user
20
Conclusion and Future Work
• The AEG model provides a principled probabilistic framework that is technically sound, and flexible for adaptation
• We have presented a novel MAP-based adaptation technique which is very efficient for personalizing the AEG model
• Demonstrated the effectiveness of the proposed method for personalizing MER in an incremental learning manner
• We will investigate the maximum likelihood linear regression (MLLR) that learns a linear transformation over the parameters of the AEG model