Bachelor’s Thesis
Sensitivity in evaluating classifications
Name H. Ling (Hansen)
Student ID s678825
Title Sensitivity in evaluating classifications
University Tilburg University
Faculty Faculty of Humanities
Major Business Communication and Digital Media
Bachelor or Master Bachelor
Place Tilburg
Date 28 January 2011
Supervisor dr. M.M. van Zaanen
Second reader dr. T. Gaustad
I
Preface
This Bachelor’s thesis on the evaluation of the automatic mood classification system
has been an exciting and interesting journey. It concludes my studies in Bachelor Business
Communication and Digital Media of the faculty of Humanities at Tilburg University.
Many thanks are in order to those who have aided me in writing this thesis. Firstly, I
would especially like to thank my supervisor Menno van Zaanen for his patience, guidance, and
support throughout the entire process. My gratitude also goes to Tanja Gaustad for her time to
co-assess this thesis.
I would like to thank my supervisor again for the data from their automatic mood
classifier; a system that was built by Pieter Kanters and Menno van Zaanen. Furthermore,
Crayon Room for providing the dataset with the Moody Counts that has been used for the
experiment as well.
Of course, I would also like to thank my family for their enduring support.
Hansen Ling
Tilburg, January 2011
II
Abstract
This thesis is on incorporating weighted distance in the evaluation process of classifiers.
We have studied the effects on the evaluation process of the automatic mood classification
system by Van Zaanen and Kanters (2010). This classifier assigns a mood to music based on its
lyrics. We have focused on two research questions in this study.
The first question concerns incorporating distance between mood classes in the
evaluation of the system. Currently this system is too strict in its evaluation because it uses a
binary metric in which is stated that only perfect matches with the song’s Moody Tag are
counted as a success. However, the wrong predictions are not necessarily wrong because of
nuances between moods. To incorporate sensitivity, two standard distance metrics were
considered: Euclidean and Taxicab metric. The results have shown that with the weighted
accuracy we can evaluate the results more in-depth and discover a difference in the used
retrieval metrics. Based on the new metrics, it was concluded that the tf*idf metric was more
accurate than the tf+tf*idf retrieval metric which according to the original evaluation
performed the same.
Secondly, the Moody Tags are the current gold standard. We studied the influence of
replacing the gold standard with data that is directly obtained from social tagging on the
evaluation of the system which is known as the Moody Counts. The Moody Tags are derived
from the Moody Counts and is therefore an indirect source. Results have shown that the new
standard evaluates in a more fine-grained manner, but did not show a significant difference in
the tf*idf and tf+tf*idf feature.
III
Table of contents
- Preface
- Abstract
- Table of contents
I
II
III
1 Introduction
1.1 Audio Compression Formats
1.2 The story of MP3
1.3 Sharing music
1.4 Metadata of music files
1.5 Music and mood
1
1
2
2
3
4
2 Background information
2.1 Web 2.0
2.1.1 Social Tagging
2.1.2 Metadata of MP3
2.2 Music collection
2.2.1 Creating playlists
2.2.2 Automatic playlist generation
2.2.3 Mood based playlist
2.3 Music, moods, and emotions
2.3.1 Thayer’s model of mood
2.3.2 Language and mood
2.3.3 Music, lyrics and mood
4
4
6
7
8
9
10
11
11
13
14
14
3 Background information for research
3.1 Moody’s mood framework
3.1.2 System’s mood classes
3.2 tf*idf weighting
3.3 Automatic mood classification system using tf*idf
based on lyrics
3.4 Confusion Matrix
3.5 Distance Metrics
15
15
17
18
19
20
21
4 Research questions
4.1 Research purpose
24
24
5 Methodology
5.1 Data
5.2 Method
5.2.1 Method for RQ1
5.2.2 Method for RQ2
26
26
27
28
30
IV
6 Results
6.1 Results RQ1: Distance metrics
6.2 Results RQ2: Moody Counts
32
32
33
7 Conclusions and discussion
7.1 Answer to RQ1
7.2 Answer to RQ2
7.3 Sensitivity and new standard
34
34
34
35
8 Future research 36
9 References 37
1
1. Introduction
This Bachelor thesis is on incorporating a weighted distance in the evaluation of
classifiers. It is also a follow-up study on Van Zaanen and Kanters’ (2010) “Automatic mood
classification system using tf*idf based on lyrics”, or henceforth also known as the system or
other variations of the word, with the focus on the evaluation of this classifier. Their system
automatically categorizes music tracks in specific mood classes based on the lingual aspect of
music. It was achieved using word oriented metrics such as tf*idf and tf+tf*idf. These are
standard metrics taken from the field of information retrieval that are based on term
frequency and inverse document frequency. Their study has shown that words in lyrics contain
valuable information on the mood that the song writer or artist wants to transfer to his/her
audience. With their study, we can examine the effects of weighted distance.
This section gives an introduction from digital music files to the mood in music. It
describes how large digital music collections came about and how we can organize them by
creating playlists that are based on specific criteria such as the mood that music gives people.
1.1 Audio Compression Formats
Nowadays almost everyone has a (portable) device with which they can play digital
audio files to listen to music. There are numerous devices enabling you to enjoy your favorite
songs. It can be for instance a mobile phone, a portable game console, a portable media player,
or a computer.
The introduction of audio compression formats such as Windows Media Audio [WMA],
Advanced Audio Coding [AAC], Vorbis, and the most popular MPEG-1 Audio Layer 3 [MP3] has
allowed people to carry along more music occupying less space than music compact discs [CD].
In addition, it has allowed us to share music more rapidly and conveniently. This is due to the
relatively small sized files and the convenience of the internet and data sticks. To give an
example, a four minute song would be 40 Megabytes [MB] in size without any compression
methods. When you convert the file to a standard quality MP3 format, a four minute song
would then approximately take up 4 MB of space.
2
1.2 The story of MP3
The lossy audio data coding technique MP3 is developed by the German institute
Fraunhofer IIS (http://www.iis.fraunhofer.de/). The researchers at Fraunhofer IIS found a way
to make audio files about ten times smaller by leaving out auditory elements that cannot be or
can hardly be heard with the human hearing capabilities; auditory masking.
According to Fraunhofer IIS, the MP3 standard first appeared as part of MPEG-1 in 1992
and they decided on the name MP3 for layer 3 in 1995. In 1998, the first portable MP3-players
became available: “Rio 100” by United States’ Diamond Multimedia and “MPMAN” by Korea’s
Saehan Information Systems. These players used flash memory to store and play MP3-files.
MP3-files were either downloaded from the internet or encoded from music CD’s. This has
allowed people to start possessing a music collection on their computer, create their own
playlists, and carry music around on MP3-players.
MP3 and MP3-players rapidly gained massive popularity and started to gain preference
over compact discs. CD-players take up more physical space than MP3-players or multimedia
players such as an iPod. Furthermore, CD-players require a music CD in order to play the music
you wish to hear. Unless you want to listen to one CD, limited to approximately 18 songs, you
will need to carry along extra music CD’s that take up physical space as well. In contrast, MP3
files can be directly stored on the storage space that is usually implemented in your media
player. This has brought much convenience as you do not have to take out the CD and put in
another to listen to other artists. In addition, a music CD can store about 700 MB of data which
is 74 minutes worth of music, while MP3-players depend on its storage capacity which
nowadays can be several Gigabytes [GB] (1 GB ≈ 1000 MB). To complete the comparison,
assume that you have a portable device that can play MP3-files with a storage capacity of 1 GB,
this equals to 250 4-minute-songs that totals 16.7 hours of music. Therefore, it is not difficult
to conclude that MP3-players are currently preferred over CD-players.
1.3 Sharing music
During the introduction of the MP3 most people had analogue internet access via the
telephone line. This dial-up internet access allowed access by establishing a dialed connection
to the internet service provider [ISP]. It did not matter if you were surfing the World Wide Web
3
or downloading files, every kilobyte mattered due to limited transfer speeds of up to 5 to 7
kilobytes per second. Therefore, file sizes were of great importance during the dial-up
generation for a faster internet experience. In addition, since you are basically calling your ISP
to gain internet access you needed to pay telephone costs as well which emphasizes the
importance of file sizes even more. However, transfer speeds have gone up since then with the
introduction of broadband internet access, a high data rate connection. Though, dial-up users
are still present in many parts of the world. Hence, MP3 was and still is a welcomed
development.
Since 1999, file sharing systems such as Napster came into existence (Liebowitz, 2004).
These peer-to-peer file sharing programs linked users’ computer together so that they could
share their files with each other and disregard any copyright infringement. People now had
access to music collections and build up a large collection for free. With the ability to obtain
numerous songs via the internet and being able to share them with others, for instance
through data sticks, the need to buy CD’s has declined. Unlike with CD’s, you can easily add and
delete songs from your computer or portable media-player. This has removed the necessity to
spend money on compact discs that contain songs of your preference as well as songs that you
do not enjoy. Overall, research has shown that the purchase of music CD’s has decreased since
people started to share copyrighted music using the MP3-format illegally (Liebowitz, 2004).
1.4 Metadata of music files
With the songs digitally stored on our computer, we can organize them according to
specific criterion such artist or genre. In the case of MP3-files, it is possible to store this kind of
information in ID3-tags. This tag is integrated in the MP3-file and contains several text fields in
which you can enter information such as artist, title, album title, and genre; metadata.
Assuming that the ID3-tags of all songs are complete, you can automatically create playlists
according to artist, album, or title.
4
1.5 Music and mood
It is commonly known that music not only contains but also gives a certain mood and
emotion to the listeners (e.g. Juslin & Sloboda, 2001; Meyer, 1956). Radocy and Boyle (1988)
stated that in a common culture when people listen to the same song, they tend to agree with
one and other about the elicited mood of the particular song (as cited in Liu, Lu, & Zhang, 2003,
p.13).
Nowadays, many people have a large music collection stored digitally on their
computer or portable device. According to Voong and Beale (2007), music listeners are
interested in creating playlists that suit the listener’s mood. However, it is cumbersome to
remember what mood each song in your music collection elicits. You could manually tag each
of them, but this is a very time consuming and tedious activity that requires you to have
listened to all of your songs.
2. Background information
This section introduces the Web 2.0 phenomenon that has caused an enormous
increase in interactivity on the World Wide Web. This also resulted in a collective provision and
exchange of data such as music track’s metadata and music files.
Furthermore, we discuss how we can organize our large music collection now that a
number of sources are available to obtain music from. As Van Zaanen and Kanters’ automatic
classifier categorizes music by mood, we discuss the mood aspect of music.
2.1 Web 2.0
The term Web 2.0 suggests that the World Wide Web received an update in the past. In
contrast, it actually refers to the changed ways of using the World Wide Web by software
developers and end-users. It is a follow-up on Web 1.0, which mainly consisted of static web
pages that provided users with a one-way stream of information. To differentiate the two web
5
generations, O’Reilly (2005) listed seven main features to describe Web 2.0 which have been
listed below:
- Providing services with cost-effective scalability, and not limited to packaged software
(O’Reilly, 2005); e.g. Google started out with just the search service that expanded with
services such as Google Analytics, Google’s advertising service, Google Mail, etc.
- Having control over unique data sources that get richer when more users make use of
them (O’Reilly, 2005); e.g. BitTorrent is a system that allows files to be shared among
users which are provided by users. The more users who share a (fragment of a) file, the
more sources you have to obtain the complete file from, and the faster your download
will be. The users provide matters such as the bandwidth, the availability of files, and
the files they are willing to share.
- Allowing and treating users as co-developers (O’Reilly, 2005); e.g. open source software
that is software of which the source code is free to be used and altered by others.
Software developers are then able to learn from, improve, and/or make use of each
other’s work. This results in software with direct input from users and therefore more
suitable to the users’ needs.
- Making use of users’ combined intelligence (O’Reilly, 2005). For instance, Flickr
(http://www.flickr.com) is a company that allows people to share pictures online. Users
can share, view, and search for photos. In addition, users can attach related keywords
to a photo; also known as tagging. This allows other users to search and find photos
based on keywords. Another example is Wikipedia, an online encyclopedia on which
entries can be added and edited by different users.
- Benefitting from the power of smaller websites or users that make up for most of the
Web (O’Reilly, 2005), The Long Tail (Anderson, 2004), which are combined through a
service such as Google’s advertising service. This allows advertisers to reach visitors
that are part of the The Long Tail.
- Providing software that can be used on different devices (O’Reilly, 2005).
- Using “lightweight user interfaces, development models, and business models” (O’Reilly,
2005, p. 37); simple for users to use, up to the users to decide what to do with the
obtained data, and easy for others to re-use and remix data.
Web 2.0 is therefore seen as the next generation web experience. It shifted from the
static Web 1.0 to the dynamic and interactive Web 2.0 that provides a rich user experience.
6
Some of these features can be related to the automatic mood classification system. The
system can be introduced as a service for music listeners to gain access to specific mood genre
of music and in the process it is possible for the listener to discover the mass of unknown songs.
Furthermore, the classification system is employable in various settings which give users a new
way of interacting with music; e.g. making recommendations, automatic playlist generation,
and as an organizing tool for large music collections. In addition, the evaluation of the system’s
results is currently against data that is acquired through social tagging which is a valuable way
of gaining data directly from the users. It should have direct access to this data for its
evaluation because these data give a realistic view of the mood that people experience.
2.1.1 Social Tagging
Web 2.0 has meant a massive increase in interactivity on the World Wide Web (section
2.1). This new environment has brought new possibilities and techniques for people to “offer,
find, and interact with online music content” (Kanters, 2009, p. 7). One of the techniques is
tagging which is known under a variety of names such as collaborative tagging, social
classification, social indexing, folksonomy, and social tagging (Tonkin et al., 2008; Voss, 2007).
Tagging is when end-users assign keywords to items after which the tags are immediately
available for other users to see and to use. Social tagging can be seen as a way of indexing
done by end-users instead of experts (Voss, 2007). The keywords are freely chosen instead of
using a controlled vocabulary (Tonkin et al., 2008). The objects being tagged usually concern
digital items such as photos, music, videos, blog posts, or documents. This tagging technique
has become a popular feature for a rapidly increasing number of websites such as Flickr, reddit
(http://www.reddit.com/), and del.icio.us (http://www.delicious.com/).
According to Vander Wal (2005), there are two types of folksonomy (a combination of
folk and taxonomy): broad folksonomy and narrow folksonomy. In broad folksonomy, such as
on del.icio.us, someone created an object and every other user is allowed to freely tag the
object in their own words. In contrast, in narrow folksonomy, such as used on Flickr, people
can only tag their own objects (once). In both cases of folksonomy, tags are used to describe
and organize objects which can be retrieved and accessed by using keywords in your search.
In the field of music, social tags have become an important information source for
online music recommendation systems (Eck, Lamere, Bertin-Mahieux, & Green, 2007). These
7
systems allow users to automatically generate playlists according to their input terms. These
terms will be matched against the tags of songs which were given by users. Last.fm is an
example of such a recommendation site that uses social tags as their information source to
recommend music to listeners.
2.1.2 Metadata of MP3
Digital audio files such as those in the popular MP3-format can contain audio-related
information such as related text, song title, graphical information, and genre. This information
is stored in a data container that is implemented in an MP3-file called ID3-tag. When you
playback an audio file, the audio software may read the metadata and display information such
as artist name, title of track, album name, year and genre. Most people are familiar with this
because many audio devices display this kind of information.
When the Fraunhofer IIS developers originally decided on the .mp3 file-name extension
for the MPEG Layer 3, it did not yet contain the ID3-tag. However, people preferred to know
this kind of information. The same preference occurs when you are listening to a music CD and
come across an unknown song or when you would like to know the content of an album. You
do this by scanning the cover on the back of its jewel CD case for the song title behind the
corresponding track number. It was Eric Kemp who had the idea in 1996 to implement
information as a tag at the end of the audio file (http://www.id3.org/). This information would
include title, artist, album, year, genre, and a comment field. Even though this was a nice
addition to audio files, it has a fixed size of 128 bytes which means that it has a maximum of
128 characters to fill in all information. In addition, the amount of characters per field has a
fixed size. For instance, “song title” can contain a maximum of 30 characters and so does the
field for “artist”, but “year” can only have 4. This tag is known as an ID3-tag, the first version;
ID3v1.
Shortly after ID3v1 came a small improvement that was made by Michael Mutschler in
1997 (http://www.id3.org/). He took 2 bytes from the comment field, and added a new field
where the track number is to be stored. This new version of the tag is known as ID3v1.1.
However, despite the appreciation of the ID3-tag, both ID3v1 and ID3v1.1 still had their
limitations such as the low character limit per field and the placement at the end of the file
8
which makes it the last piece of information to receive when you are streaming the song.
Therefore Martin Nilsson and Michael Mutschler came with the idea for ID3v2 in 1998. Their
approach, however, is radically different from ID3v1. ID3v2 is not limited by size or field limits
and is now stored in the beginning of an audio file. Furthermore, it supports Unicode which
means that foreign characters are now possible too. The new ID3-tag has been improved in
such a way that it can even store pictures, lyrics, equalizer presets, and more because it stores
information in a variable number of dimensions (http://www.id3.org/). Therefore, as the
automatic mood classifier (Van Zaanen & Kanters, 2010) bases its classification on lyrics, it
could automatically implement the mood class in the ID3-tags without searching the World
Wide Web for those lyrics.
ID3-tags can be edited by the users themselves, and shared with others. In other words,
metadata can be found on the internet and automatically assigned to the specific audio files by
audio software or other software provided that sufficient input is given to retrieve the correct
data. It is possible to manually tag songs or edit incorrect tags but there are online music
libraries from which you can get (most of) your information such as All Music
(http://www.allmusic.com) and Last.fm (http://www.last.fm/). In addition, you can add your
contribution to the database and therefore help other users which is one of the Web 2.0
characteristics.
With the information in these tags, we can ease the process of retrieving, and
organizing large quantities of music tracks based on criteria such as artist, genre, and mood.
The addition of mood class would increase the granularity and therefore the interaction with
music.
2.2 Music collection
With the introduction of the MP3, the availability of music via the World Wide Web and
file-sharing has increased immensely whether it is through legal or illegal means. This has been
felt by record labels via their sales (Liebowitz, 2004; Oberholzer-Gee & Strumpf, 2007). An
illegal mean would be for instance obtaining music via The Pirate Bay (http://thepiratebay.org/)
which is a website where millions of users freely exchange music, software, movies, and other
files on the internet and thereby infringing copyrights. An example of a legal music resource is
iTunes (http://www.apple.com/itunes/), an online music service where users can preview and
9
buy digital music files for roughly one US Dollar per song of which a percentage goes to the
artists. Aside of online sources, music lovers exchange these digital files amongst each other as
well.
These resources enable us to acquire a massive amount, yet personally selected music
collection on our computer and portable music players (Voong & Beale, 2007). People’s music
collections have grown due to the increased approachability of music on and off the internet. It
is therefore of importance to organize this mass of music in order to retrieve relevant
information in a time-efficient manner.
2.2.1 Creating Playlists
Kanters (2009) states that: “Playlists and music collections have to be organized in the
best possible way in order to find the relevant information effectively.” In resemblance to
arranging your tangible music records, there are various approaches in organizing your files,
but they all require some kind of criterion. Unlike records, digital music files require far less
physical space and they are much easier to rearrange to fit other categories such as artist or
album. Moreover, when your digital collection is loaded into a library system it will be even
easier to sort and retrieve your music files according to different criteria, assuming that the
necessary information is present in filenames or metadata (ID3-tags). The flexibility of today’s
music allows us to arrange them in any order or combination by making a playlist (Andric &
Haus, 2006; Kanters, 2009).
People create and prepare playlists for various occasions such as jogging or driving a car.
In general, music listeners create a playlist in a few ways and usually play digital music on the
computer. The first method is to load your entire or partial music collection into a playlist and
activate the shuffle mode. This is an easy and quick solution, but it is not based on a specific
criterion. Moreover, you have no or limited control over the contents of the playlist. This
results in a randomly generated list that might not suite your current mood depending on how
diverse your collection is. Another method is to manually arrange songs in folders or playlists
according to criteria such as artist, song title, album, genre, or mood. Vignoli (2004) found that
music listeners tend to use a hierarchical structure to manually organize their music with
folders and subfolders. He also found that users who own a CD collection tend to organize their
digital music collection according to the way their CD’s are organized. This allows some
10
flexibility in creating playlists by loading a main group or subgroup(s) into the list. However,
there are many disadvantages to this approach, especially when you are in the possession of a
large music collection it becomes difficult to navigate your way through the masses. It is
necessary to at least listen to some passages of the music track in order to place the song in a
category and your personal judgment of the song is also required. Creating playlists as
previously described is time consuming, tedious, and static. Pauws and Eggen (2002) state: “It
is hard to arrive at an optimal playlist as music has personal appeal to the listener and is judged
on many subjective criteria.” If your criterion changes, you would need to re-arrange your
collection. In addition, the supply of music continues to grow which makes it important to have
a reliable and effective information access and retrieval (Kanters, 2009). This clearly shows a
need for specialized tools that give users the ability to conveniently fulfill their music demands.
These tools must be able to classify and retrieve files in accordance to the users’ input such as
genre, or mood. The result could mean a custom generated playlist or an organized music
collection that suits the user’s criterion (Kanters, 2009; Meyers, 2007).
2.2.2 Automatic playlist generation
As music collections have grown immensely, people tend to forget parts of their
collection or get lost in the masses. Therefore, new tools are required that allow them to, for
instance, reach forgotten music or easily generate specified playlists; a new way to interact
with music (Vignoli & Pauws, 2005). Vignoli and Pauws evaluated a music retrieval system that
sorts songs based on a seed song and found that it took users less effort and time to make a
more satisfying playlist according to criterion such as mood, rhythm, or artist. Vignoli and
Pauws concluded that “providing users with complete control on their personal definition of
music similarity is found to be more useful and preferred than no control.”
There are automatic playlist generators available since recent years. For instance, the
Smart Playlist feature in the popular music software iTunes which automatically creates
playlists based on the users’ criteria. However, the number of criteria are limited and require
metadata to contain the necessary information; e.g. artist, year, album, rating, newest song,
and play count. Moreover, some criteria require users’ input such as “rating” which is a score
that is given by users to a song, and “play count” which is the number of times each song has
been listened to on iTunes. This approach tends to skip the majority of the songs that have no
11
rating or have never been played before; The Long Tail (Anderson, 2004). You could set the
search for “newest songs” and “no ratings”, but this would include useless results.
2.2.3 Mood based playlist
When you create playlists according to criteria such as artist, album, or title, you
require the necessary information on the music which is usually found in its metadata. With
the complete information it becomes easier to sort and create specific playlists. According to
Liu et al. (2003), with the growing music collection on computers and the internet, it has
become apparent how important (semantic) metadata are for easy access to music.
We can also add additional information that specifies the mood of the music track. By
adding mood tags to the metadata we can view our music collection from a new perspective.
Currently, these tags are manually applied by listeners or obtained through social tagging by
using tagging tools such as Moody (http://www.moodyapp.com/). We can identify a mood in
music tracks the same way we do with moods in our daily life. “Mood tagging and tagging in
general is a relatively new way of expressing one’s feelings or thoughts” (Kanters, 2009, p. 15).
These tags can be used to create mood based playlists. If users want to create a list with
joyful songs they can sort the music files according to mood tags, select all joyful-tagged music
tracks and load them in their playlist. Moreover, it is common knowledge that music affects us
in many ways and there has been much research on musical influences (e.g. Koopman & Davies,
2001; Milliman, 1982; Thompson, Schellenberg, & Husain, 2001). As a mood based playlist
contains songs of the same mood and because people are affected by music, it can be used to
change the mood of the listener over a period of time (Kanters, 2009).
2.3 Music, moods, and emotions
As human beings we are constantly facing and being influenced by emotions and
moods in our daily life. Even though the concepts emotion and mood are related as they are
feelings, there are differences.
As is described by Watson (2000), emotions such as surprise and fear “represent an
organized, highly structured reaction to an event that is relevant to the needs, goals, or
12
survival of the organism.” Watson furthermore writes that they contain interrelated
components: a prototypical form of expression, a consistent autonomic change, a distinct
feeling state, and an adaptive behavior. These can be best explained with an example. You can
recognize the emotion of anger by observing the person’s facial expressions that is universally
typical for anger (Ekman, 1971): the eyebrows are pulled inward and downwards with wrinkles
above the center of the eyes, the upper and lower eyelids pull together a little which make it
appear as though the person is squinting, the lips are pressed tightly together or in other cases
they open which shapes the mouth squared with raised lips that may point forward, and the
teeth might show (Ekman, 1971). An angry person usually experience autonomic changes such
as increasing of heart rate, getting a hot or flushed face, and tensing of muscles (Shields, 1984).
This emotion is also known to give the person an annoyed and irritated feeling (Watson, 2000).
Lastly, anger is often an emotional reaction with the purpose to defend themselves, to
maintain their personal integrity, to correct social injustice, and it is at times a reaction to
protect themselves in dangerous situations in order to survive (Izard, 1991).
Moods and emotions are very much alike because they are periods of feeling that come
and go. However, the episodes of moods have a relatively longer duration compared to
emotions which are usually intense but brief states. Depending on the level of intensity, an
emotion can last from seconds at lower intensity to a few hours when dealing with high
intensity (Izard, 1991). A mood on the other hand, can last for hours or a few days (Thayer,
1989; Watson, 2000). Furthermore, both emotions and moods are influenced by external
events and experiences, but with the difference that internal processes play a role as well
when dealing with moods. In addition, this concept includes all transient feeling states. To
elaborate on the latter matter, it includes states that are milder versions of emotions such as
annoyance for anger and nervousness for fear. It also includes states that require low levels of
activation and arousal like fatigue and serenity which occur frequently in daily life. Watson
(2000) finishes the definition with the notion that in order to define the nuances of affective
experience we deal with in our everyday life we need several states that do not clearly point to
one classically defined emotion.
Kanters (2009) assumes in his study that an emotion is a part of the mood. The mood
being a basic state that someone finds themselves in with occurrences of energetic emotional
outbursts. As previously discussed in this section, emotions are intense but brief states while
moods go on for longer and reside internally. These fluctuations in intensity can also be
13
experienced in music tracks. We sense an overall feeling (mood) in which emotional outbursts
can occur during the length of the song.
2.3.1 Thayer’s model of mood
Thayer (1989) views mood as an experience of biological as well as psychological nature.
According to Thayer, our moods are made up of energy and tension. These are the dimensions
of his mood model, with valence on the x-axis of the plane, which is derived from tension, and
arousal (energy) on the y-axis. Valence is furthermore divided into negative valence (negative
moods) on the left side and positive valence (positive moods) on the right side. On the vertical
axis (arousal), there is at the top high energy arousal (energetic moods) and at the bottom
there is low energy arousal (calm moods). This divides the Valence-Arousal space into four
quadrants of moods as is shown in figure 1: positive valence and high arousal, positive valence
and low arousal, negative valence and high arousal, and negative valence and low arousal. For
instance, a high arousal and negative valence can be an angry mood while a high arousal and
positive valence are moods such as happy and excited.
Figure 1: Thayer Mood Model with examples of mood classes
- Angry
- Annoyed
- Nervous
- Excited
- Happy
- Content
Arousal
Valence
High energy
Low energy
Negative Positive
- Relaxed
- Satisfied
- Calm
- Gloomy
- Sad
- Bored
14
2.3.2 Language and mood
People experience mood on a regular basis. It is an emotional state people find
themselves in that has a relatively long duration. Certain moods can be expressed through
verbal and non-verbal language such as speech, written texts, body language, and facial
expressions.
Among other studies on the relation between mood, and words and language use
(Bower, 1981; Stenius, 1967; Teasdale & Russell, 1983), a study by Beukeboom and Semin
(2006) showed how our current mood state is reflected in our language to describe a social
event. People in a negative mood have a different word choice and language use than those
who are in a positive mood. When someone is in a negative mood, they tend to describe an
event to the point, providing concrete information. The same event described by someone in a
positive mood will be with active interpretation and enriched information. Furthermore, it can
also work the opposite way where the mood of the speaker can be derived from their language
use. Other cues such as emotional tone of voice and facial expressions also help expressing
their mood.
Therefore, people are aware of their mood which can be expressed through language
and people can sense the mood from the speaker through communication. Although aspects
such as body language and tone of voice help determine the mood, language can contain
information on mood as well.
2.3.3 Music, lyrics and mood
We listen to music as it gives us a certain feeling that musical aspects such as melody,
key, and rhythm can emphasize. However, Kanters (2009) states that you can also transfer a
certain mood onto the listeners by using mood related words in the lyrics. Music lyrics are texts
that contain words and phrases that are written by song writers and spoken (sung) by artists.
These can express a certain mood and emotion such as in normal communication. If you want
to express happiness in your song, you typically use words that have that emotional load with
words such as “happy” and “joyful”, and not “darkness” and “hell”. Furthermore, the intensity
of a song can be achieved through choice of words as well.
15
The focus of Kanters’ study lies in the linguistic aspects of music. He assumes that lyrics
contain lexical items which express a certain mood that is transferred by the sender (artist or
songwriter) to the receiver (listener or reader). The mood of a song can be detected via its
lyrics without the presence of aspects such as the musical tracks, key, and tempo. However,
these musical aspects do emphasize the emotional load in the lyrics. Even though the vocal
aspect is of importance in detecting the mood, there are words of which the emotional load is
clear in written form such as “joy” or “misery” (Kanters, 2009).
Omar Ali and Peynircioğlu (2006) have shown “that lyrics can influence the overall
emotional valence of music”. Negative emotions (sad and angry music) were more easily
detected by listeners with the presence of lyrics while the absence of lyrics allows music to
more easily express positive emotions (happy and calm music). Furthermore, it was shown that
melodies are more dominant than lyrics in determining the emotional valence of music.
Therefore, lyrics do aid listeners in discovering the mood of music.
Kanters tested in his study whether lyrics provided the necessary information to assign
main moods to music. He stated that even though it is possible for a song to contain several
emotional events, they are all linked together by one certain mood which is usually the chorus.
3. Background information for research
This section firstly provides information for a better understanding of the automatic
mood classification system by Van Zaanen and Kanters (2010). Secondly, it provides
information on the distance tools that are going to be used for this study.
3.1 Moody’s mood framework
Moody is a third-party software application for iTunes that is created by the company
Crayon Room (http://www.crayonroom.com/). Moody uses a color scheme that is associated
with mood; known as Moody Colors. Users can assign a mood to a song by selecting a Moody
Color from the framework. This information on mood, know as Moody Tag, is then stored in
16
the comment or composer field of ID3-tags. The Moody Tags are also stored in the iTunes
database which allows other users to make use of this information.
It was shown in a study by Voong and Beale (2007) that users found it a useful way to
associate mood with color. Moody has sixteen colors to select from and therefore sixteen
different moods can be used to classify a music track. The standard settings of color codes for
moods are displayed in figure 2. From left to right it goes from sad to happy (1 – 4 on the
horizontal axis), and from the bottom to the upper row it goes from calm to intense music (D –
A on the vertical axis). Therefore, each color is represented by a letter and a number or also
known as the “Moody tag”. The tag or coordinate D1 stands for calm and sad mood, while A4
represents intense and happy.
Intense
Calm
Sad Happy
Figure 2: Moody’s color coding and respective moods. Retrieved 8 December, 2010, from
http://www.moodyapp.com/help/
By using Moody to tag songs in iTunes, users can generate playlists based on your mood
preference. For instance, if you are in a good mood you might feel like listening to a collection
of happy songs. When you have confirmed your mood choice, the application will generate a
playlist that consists of the songs that are tagged as happy.
17
3.1.2 System’s mood classes
Kanters (2009) adopted the Thayer mood model (Thayer, 1989) for the classification
system. Moody’s framework uses the arousal and valence dimensions as well and therefore
integrates well into Thayer’s model as is shown in figure 3. The difference is that Moody uses
hue colors instead of keywords to distinguish moods. Another difference with the Thayer
Model is that instead of four quadrants, there are sixteen different colors to choose from. This
means that the data that Van Zaanen and Kanters received from Crayon Room has that range
of moods as well.
Figure 3: Moody’s framework integrated into Thayer’s mood model (Kanters, 2009)
In order to work with a fine grained set of classes, Van Zaanen and Kanters divided the
Valence-Arousal plane into sixteen parts (Van Zaanen & Kanters, 2010) by dividing each
dimension into four areas. Similar to the Moody framework, the arousal segments are named
A to D and the valence segments 1 to 4. The fine-grained division now resembles Moody’s
framework as is shown in figure 4. However, this division is one of four class divisions that was
studied. The different class divisions were:
- Fine-grained: A1 – D4 (16 classes)
- Arousal: A – D (4 classes)
- Valence: 1 – 4 (4 classes)
Arousal
Valence
High energy
Low energy
Negative Positive
18
- Thayer: the original four quadrants in Thayer’s mood model (4 classes)
Arousal (A-D)
A1
A2
+
A3
A4
Va
len
ce (1
-4)
B1
B2
B3
B4
-
C1
C2
C3
+
C4
D1
D2
D3
-
D4
Figure 4: Class divisions as used by Van Zaanen and Kanters (2010)
3.2 tf*idf weighting
The tf*idf metric is a standard information retrieval metric that consists of the
components term frequency [tf] and the inverse document frequency [idf]. The tf measures
the number of occurrences of term t in document d, which is denoted as tft,d.
tfi,j =
The tf-formula shows the importance of term ti in document dj by dividing the number
of occurrences of the specific term in document dj with the total occurrences of all terms (nk,j)
in the document (dj) (Kanters, 2009).
With the term frequency, all terms are viewed as equally important. Therefore there is
the inverse document frequency that measures the importance of term t in a collection of
documents D, denoted as:
idfi = log
ni,j
∑k nk,j
|D|
|{dj : ti є dj}|
19
The idf-formula gives the logarithm of all documents D divided by the amount of
documents in which a specific term is present (Kanters, 2009). It tells us the uniqueness of a
term in the collection of documents.
In the tf*idf metric the components are multiplied with each other. This metric is a
method for weighting the importance of each term in each document (Manning, Raghavan, &
Schütze, 2009). Van Zaanen and Kanters (2010) also used tf+tf*idf to measure the importance
of words in lyrics in each mood document.
According to Manning, Raghaven, and Schütze, the tf*idf-based metrics assign a high
weight to term t in document d when the term is frequently found in a few documents, such as
certain mood specific words like “happy” and “heartache”. The weight is lower when term t is
found fewer times in a document or found in many documents such as “not”. The weight for
term t is the lowest when the word is present in nearly all documents such as function words.
3.3 Automatic mood classification system using tf*idf based on lyrics
As is mentioned in the introduction of this thesis, the automatic mood classification
system (Van Zaanen & Kanters, 2010) automatically labels music with a mood based on the
lingual aspect of music tracks. The different mood classes that were used to classify the songs
are discussed in section 3.1.2. The data that Crayon Room provided consisted of the Moody
Tags and to which artist and song title it was assigned. This dataset was used as the gold
standard for Van Zaanen and Kanters’ system.
Furthermore, a set of information retrieval features were selected in order to describe
properties of lyrics. Of these sets of features, word-based features showed the best
performance and to be more specific, the tf*idf and tf+tf*idf metric. This information retrieval
metric is generally used to measure the importance of terms in documents in a large document
collection, as is explained in more detail in section 3.2. Every mood class consisted of a
document that represented a specific mood class. That document was a combination of all
lyrics that were used in their study that had the same mood tag. The combined lyrics would
then appear as though it were one document.
Hence, the determining of mood depended on looking at lyrics in a word based manner.
With the tf*idf metric only words remain that do not occur in all mood classes and these were
20
more valued as they could determine in which mood class they are most relevant. Hence,
words with high tf*idf values show a high importance of those words to a mood.
In their study the retrieval metrics were used to show how relevant words in lyrics are
with regard to the mood classes. Their results have shown that the lingual part of music does
provide information on the overall mood of a song (Van Zaanen & Kanters, 2010).
The evaluation part of the automatic mood classification system uses a binary distance
metric to determine the accuracy of the system. The result is that you get values of 0 and 1.
Value 0 represents whenever the system did not predict the mood of a song in correspondence
to the gold standard and value 1 represented the perfect matches between the prediction and
the actual mood class that a song was supposed to be in. The accuracy was then calculated by
dividing the sum of all values by the total amount of elements in the dataset.
3.4 Confusion Matrix
To get an overview of how many mood classes were tagged correctly, how many were
mistaken, and in which classes it went wrong, a confusion matrix (Kohavi & Provost, 1998) can
provide the necessary information. It is a tool that is used to share information about actual
and predicted classifications done by a classification system such as in machine learning.
Table 1 shows an example of a confusion matrix. It shows that of the nine that should
actually be classed “happy”, eight predictions were correct, and one was classed “angry” and
therefore mistaken. Another way to view this example is that of the predictions that were
classed as “angry”, only one of the predictions was wrong. In addition, we can also conclude
that “angry” contains the most mistakes and “sad” the least. The accuracy of the classifier can
therefore also be determined.
Table 1: Example of a confusion matrix
Gold standard
Happy Angry Sad
Pre
dic
ted
Happy 8 1 0 9
Angry 1 6 0 7
Sad 0 2 9 11
9 9 9 27
21
3.5 Distance Metrics
Moods are not necessarily complete opposites of each other because there are levels of
intensity to consider, compare excitement with pleased for instance. If the system labels a
song with a mood that is close to the actual mood, it is not as wrong as if it were classified with
a completely opposite mood. The mood classification system uses two dimensions to classify
mood, namely arousal and valence (Figure 4). With the help of a distance metric, it is possible
to measure how far off the mark a class is from the actual mood class or other classes.
The form of a simple distance metric for the distance between point A and point B is
d(A, B) (Kochanski, 2009). However, this metric has to comply with the following conditions:
- d(A, B) ≥ 0
- d(A, B) = 0
- d(A, B) = d(B, A)
- d(A, C) ≤ d(A, B) + d(B, C)
(non-negativity: the distance from one mood to another
cannot be negative)
(identity of indiscernibles: only if mood A is equal to mood
B)
(symmetry: the distance from mood A to B is the same as
from mood B to A)
(triangle inequality; the distance from A to C is always
smaller or equal to the distance from A to B and B to C
combined)
We discuss in this section the most common two distance metrics: Taxicab and
Euclidean metric.
The Taxicab metric is known under variations of names such as rectilinear distance, city
block distance, and Manhattan distance. As the various names might suggest and as it is shown
in figure 5, it measures the distance between two points by counting the steps on a city road
grid (Krause, 1986).
22
B
(3,3)
B
(3,3)
B
(3,3)
A
(0,0)
A
(0,0)
A
(0,0)
5a 5b 5c
Figure 5: Different ways from point A to B using the Taxicab metric. Each way has a distance of 6 steps.
The amount of steps that is needed to go from coordinate A(0,0) to coordinate B(3,3) is
the same for figure 5a, 5b, and 5c; in this case it is 6 steps. By using a function the process of
counting the steps can be automated. This distance metric says that the distance between A
and B is the sum of the absolute difference of coordinate A to coordinate B (Krause, 1986).
Hence, |0 - 3| + |0 – 3| = 3 + 3 = 6.
The Euclidean metric is represented in figure 6. It describes the shortest route available
by drawing a straight line from point A to point B.
B
(3,3)
A
(0,0)
Figure 6: Shortest distance from point A to B
The distance can be calculated with the Pythagorean Theorem. This theorem can be
explained with the help of figure 7. It states that in any right-angled triangle, the surface of
area C is equal to the sum of area A and area B. The areas are squares, this means that the
length of each side is equal and therefore surface B is equal to b x b or b2. According to
Pythagorean Theorem, c can be calculated as follows:
23
C = a2 + b
2
c = √C
c = √(a2 + b
2)
The Euclidean metric can be denoted as (Krause, 1986): d(A, B) = √((Ax - Bx)2 + (Ay - By)
2)
Figure 7: Pythagoras Theorem
By applying this method to figure 6, we can calculate the distance from A to B by
looking at their coordinates; A(0,0) and B(3,3). We can create an imaginary right-angled
triangle (such as figure 6) with line AB as one of the sides (such as c in figure 7). The length of
the other two sides is equal to |(0,0)-(3,3)|. Side a would be the difference from A to B on the
horizontal axis, therefore 3. Side b would also be 3, because on the vertical axis the difference
is |0-3|. Now we calculate c by filling in the formula:
c = √(a2 + b
2)
AB = √(32 + 3
2)
AB = √(9 + 9)
AB = √18 ≈ 4.24
The difference between the Taxicab and the Euclidean metric is that the Taxicab metric
follows the grid horizontally and vertically to go from one point to the other. Furthermore, it is
with the Taxicab metric possible to have various ways to reach a destination; as is shown in
figure 5. The Euclidean distance gives the shortest route by drawing a straight line from point A
to point B and thus unable to have multiple ways to reach point B. However, both metrics give
c b
a
B
C
A
24
a clear indication of distance by using a simple method that can be applied on a (multi-
dimensional) grid.
4. Research questions
The first question that we ask ourselves is on the precision of the automatic mood
classification system. Currently the system evaluates a proposed tag as correct when it
perfectly matches the Moody Tag that was assigned to a song. Their results (Van Zaanen &
Kanters, 2010) have shown that the fine-grained division does not show a difference in
accuracy for the tf+tf*idf and tf*idf metric. Therefore, it is not known which metric is more
useful. We need to get an overview of what mood classes go wrong and based on that we find
a way to indicate how far off the wrong classifications are.
RQ1: How does incorporating the distance between mood classes affect the evaluation
of the automatic classifier?
Secondly, the Moody Tags are the result of a social tagging process which is recorded in
a dataset to which we refer to as Moody Counts. It was originally not exactly known by Van
Zaanen and Kanters (2010) how the songs received a Moody Tag. Therefore it is possible that
the mood range of a music track is broader than just one mood class. For instance, the Moody
Tag A4 can have received the arousal A as a result of 20 votes for arousal A and 19 votes for B.
Hence, the difference is too small to say with confidence that the song elicits only arousal A.
Therefore we ask the question whether the proposed tags are in the range of classes that the
users generally found as acceptable.
RQ2: How does replacing the Moody Tags with the Moody Counts affect the evaluation
of the automatic classifier?
4.1 Research purpose
The current system is too strict in its evaluation against Moody’s data because a perfect
match has to occur in order to be counted as a success; it does not incorporate nuances
25
between mood classes. The main mood of a song does not necessarily have to be described
with one exact class because listeners can have different perception or standards. There is also
the issue of judging the level of intensity of an experienced mood. For instance, if the system
detects the mood A1 for a song but the gold standard shows the class A2, it should be a better
assessment than when the outcome D4 would have been proposed. A more fine-grained
approach is necessary to give a more realistic outcome by counting A2 as being partially
correct.
Furthermore, currently the automatic mood classification system evaluates its data
against that of Crayon Room’s Moody Tags. The tag of each song in Moody’s dataset is the
result of the highest frequency count in arousal and valence; Moody Counts. The counts are
the result of social tagging and therefore a spread of counts over the arousal and valence
values is very likely. It is therefore unknown how well the remaining arousal and valence values
scored with respect to the Moody Tag. Not every listener of a song might assign the exact
same mood tag to the song as another listener. One person might experience songs differently
or perhaps the mood they had influenced their judgment. Furthermore, as is discussed in
section 2.3, music consists of a main mood with emotional outbursts. These outbursts can
influence the judging of a song’s mood as well. The system does not consider the possibility
that there is a range of moods into which the users feel the music can be categorized.
With Van Zaanen and Kanters’ system we can classify music tracks based on mood and
therefore have the possibility to create mood based playlists. However, a more sensitive
approach is preferred by considering nuances in mood classes that should not be considered as
being completely off target. With the news that Crayon Room has agreed to provide the raw
data (Moody Counts), we can explore how the system fares when it is evaluated against this
raw data. By taking into account that some mood classes are more closely related than others,
we can partially include songs in the results of which the system indicated to be closely related
to the specified mood class of a song’s mood. This could prove that the system is not
necessarily wrong in certain cases but that perhaps the processed data should not be seen as
the standard against which the system should be evaluated.
This study may provide tools to give the automatic mood classifier a more fine-grained
evaluation approach. This allows music listeners to effectively and efficiently create mood
based playlists and it gives them the ability to reach throughout their music collection;
providing them with a new way to interact with music.
26
5. Methodology
5.1 Data
As this is a follow-up study on Van Zaanen and Kanters’ (2010) classification system, the
data for this thesis will be the results from their system and the Moody Tags. However, we will
be focusing on the tf+tf*idf and tf*idf results of the fine-grained division. This division gives the
most precise results because it concerns all sixteen mood classes whereas the other class
divisions leave out parts of information. For instance, the arousal division generalizes all
arousal values and does not incorporate the valence dimension.
In addition, we asked Crayon Room for their unprocessed data to see how users
actually tag. The reason is because the firstly received dataset from Crayon Room [Moody Data
A] is based on the secondly received set [Moody Counts].
Moody Data A consists of the Moody Tags of songs. The tags were integrated as the
gold standard in the system’s dataset to evaluate the tags that were given by the system. Both
Moody Tag and system tags are needed to create a confusion matrix on which we base our
distance calculations from system tag to Moody Tag.
The second part of this study is to look at the distribution of the Moody Counts. Crayon
Room acquires its data through social tagging, which is a typical Web 2.0 characteristic. Their
system keeps track of how a song is rated by each user. However, it does not register a
selected Moody tag as is (e.g. D4, C1, A3), but separates the value into two dimensions; arousal
and valence. Therefore, the data consists of ten fields. After artist and title, there are eight
fields: four fields for arousal (A, B, C, D) and four fields for valence (1, 2, 3, 4). When for
instance a song is rated C2, the corresponding item receives one added value for C and one
added value for 2. Based on this data, the Moody Tag for each song is determined by
combining the highest rated arousal value with the highest rated valence value. To illustrate
the latter, if the highest arousal value is B and the highest valence is 1, the song will be tagged
27
as B1 which then results in Moody Data A. Our request to Crayon Room for the Moody Counts
resulted in a dataset with data on 4,503 out of the requested 5,631 songs.
With the Moody Counts, we attempt to see how broad the mood range for each song is
according to the users. Hence, we can compare the system’s results with this data and
conclude whether the system was too strict. In other words, did the system give a tag that was
half right or even accepted by music listeners as an alternative?
5.2 Method
An exploratory research will be carried out with the automatic classification system by
Van Zaanen and Kanters (2010) as the starting point. The Moody dataset consists of 4,503
entries. During the processing of the results, it turned out that some of the previously used
Moody Tags for Kanter’s study had changed during the period of time; specific songs were
given a new Moody Tag based on new Moody Counts. This meant that the Moody Tags in the
system’s dataset had to be updated to correspond with the latest Moody dataset. We had to
make the sets correspond in order to have the same entries and with the data that belong to
each entry.
Furthermore, the Moody data showed entries that received a Moody Tag that was
based on null values which was unexpected because these apparently did not receive any tags
and therefore should not have been registered in the Moody Counts. In total, it concerned 38
entries which were automatically classed as A1 and have been removed from the dataset. The
final dataset consisted of 4,465 entries with the fields: Moody Tag, system tag, and Moody
Counts for arousal A to D and valence 1 to 4. The distribution is shown in table 2.
Table 2: Distribution of 4,465 entries
Arousal
A B C D
Va
len
ce
1 104 199 196 152 719
2 316 474 474 208 1456
3 303 530 405 167 1410
4 198 326 269 144 880
921 1529 1344 671 4465
28
Then the accuracy of the current system needs to be calculated based on the newly
acquired dataset (4,465 entries). The new accuracy of the tf+tf*idf and tf*idf approach remain
equal to each other; 70.97%. The original study showed an accuracy of 70.89% for both
approaches in the fine-grained division (Van Zaanen and Kanters, 2010). The accuracy is
needed for comparison reasons when we take the distance metrics into consideration for RQ1
and when we substitute the gold standard in RQ2.
5.2.1 Method for RQ1
The first research question requires the system tags and Moody Tags over which the
distance metrics will be applied based on a confusion matrix. For the confusion matrix the
classes given by the system are the predicted values which will be set against the gold standard.
This will tell us how frequent a correct tag was given and we see what tag and how many times
that wrong tag was given.
Then we use the Euclidean and Taxicab metrics to see how far off each given system tag
was from the actual class. The mood classes will need to be converted to coordinates first after
which we apply them to the distance metrics. To avoid confusion we set A1, which is the top
left corner, equal to coordinate (1,1) and D4 equal to (4,4), the bottom right corner. Both
Moody Tags and system tags will need to be separated into the two dimensions before we
translate them to coordinates. The grid would be as is shown in figure 8.
Valence
1
2
3
4
Aro
usa
l
1 A1 A2 A3 A4
2 B1 B2 B3 B4
3 C1 C2 C3 C4
4 D1 D2 D3 D4
Figure 8: Mood Tags translated to coordinates
29
Afterwards the outcomes will be normalized against the maximum distance a class can
have. For example, the maximum distance for A1 is to D4 and for D2 it is A4. The normalized
results range from zero to one that will indicate the degree of error. The outcome zero
indicates that the system provided the same tag as Moody and when the outcome is one, the
system was completely wrong (maximum distance between classes). The normalization is done
by dividing the distance from system tag to Moody Tag by the maximum distance that the
Moody Tag can have:
d(System tag, Moody Tag)
maximum d(Moody Tag, furthest Moody Tag)
To calculate a weighted accuracy that takes the distance into consideration requires a
weighted distance score which is done with: 1 - normalized distance. By doing this, we allow
wrong instances to be partially correct. In this case, the perfect matches get the score 1. For
instance when the actual mood class is B3 and the system classified the song as B1, it is still
partially right because the system at least predicted the arousal correctly. This score then gets
multiplied by the total amount of identical instances in the confusion matrix. Lastly, the
product will be divided by the total amount of elements in the confusion matrix (4,465 entries).
In other words, the weighted accuracy is the sum of:
(elements in class)*(1-normalized distance)
total amount of elements
For instance, if the system’s prediction of a music track is B3 but the gold standard
shows B2, the distance between these two mood classes will be 1 with the Taxicab metric or 1
when we apply the Euclidean metric. The furthest Moody Tag that is possible for B2 is D4
which gives a maximum distance of 4 (Taxicab) or 2.83 (Euclidean). When we normalize the
distances the outcome will be 0.25 for the Taxicab method or 0.35 for the Euclidean method.
= normalized distance
= weighted accuracy
30
With this information we can calculate the weighted accuracy for this case. Assume that this
particular error (B3 – B2) has occurred 60 times according to the confusion matrix then its
normalized distance will be multiplied by this frequency. Lastly, the outcome of the
multiplication will be divided by the total amount of entries; 4,465. Hence, with the distance
acquired from using the Taxicab metric, the weighted accuracy in this case is:
60 * (1 - 0.25)
4465
With the Euclidean distance, the weighted accuracy is:
60 * (1 - 0.35)
4465
Finally, when all the weighted accuracies are known, the total weighted accuracy will be
calculated by the sum of these.
5.2.2 Method for RQ2
For the second research question the system tags, Moody Tags, and Moody Counts are
required. The Moody Tags are based on the Moody Counts which were acquired through social
tagging. The users categorize a song by selecting a Moody Color which is linked to a Moody Tag.
This tag is registered in the Moody Counts as two separate values: arousal and valence. We
want to see whether the system tag is within the range of arousal and valence values that were
inputted by the users.
As is mentioned in section 5.1, the Moody Counts consist of counts for each separate
dimension and its values. Due to the nature of this data, the dimensions of the system tags and
Moody Tags need to be separated as well. This is required in order to make a comparison
between the dimensions of the system tag (predicted tag) with counts of the dimensions in the
Moody Counts of the corresponding Moody Tag (actual tag). To indicate the degree of
relevance, the count of the arousal value that matches the system tag’s arousal value gets
divided by the count of the arousal value that matches Moody Tag’s arousal. The same goes for
the valence dimension and is done for each of the 4,465 items. Afterwards, the result of both
= 0.00
= 0.01
31
dimensions will be added to each other and divided by two. In other words, the mean
relevance is:
This weights the relevance of the system tag’s dimensions to the dimensions of the
Moody tag. Then the weighted accuracy will be calculated by dividing the sum of all mean
relevance by the total amount of elements. For instance, a song with the Moody Tag B2 could
be the result of the distribution of Moody Counts as shown in table 3. However, based on the
analysis of the system, it classified the song as B1.
Table 3: Example of Moody Counts’ distribution for Moody Tag B2
System tag Moody Tag Moody Counts
Arousal Valence
A B C D 1 2 3 4
B1 B2 11 20 4 0 10 17 8 0
To see how the proposed tag compares to the Moody Counts with regard to the Moody
Tag, we calculate the mean relevance. As the system tag’s arousal is B and the Moody Tag’s
arousal is B, we divide the respective counts by each other; 20/20. The valences differ in this
case; the system gives a valence of 1 while Moody gives valence 2 which gives us the division of
10/17. Therefore, when we fill in the entire formula, it gives us the mean relevance for this
example which is:
= 0.79
2
+ 20
20
10
17
= mean relevance
2
+ Counts system arousal
Counts Moody arousal
Counts system valence
Counts Moody valence
32
6. Results
The accuracy of the current system was measured based on a binary approach which
meant that results were either precisely right or wrong. When the system tag was precisely
right, it received the value one and otherwise it was zero. According to the binary metric, the
tf+tf*idf-approach resulted in an accuracy of 71.0% and for the tf*idf-approach the accuracy is
also 71.0%. As there is no noticeable difference, both approaches can be seen as equals in
terms of performance.
6.1 Results RQ1: Distance metrics
Table 4 shows the accuracies that have been achieved for the tf+tf*idf and the tf*idf
approach when the distance from the proposed class to actual classes were taken into
consideration. When the Euclidean metric was applied to tf+tf*idf it showed a weighted
accuracy of 91.9% (n=4,465) and for tf*idf this was 92.1%. The Taxicab method resulted in a
92.3% accuracy for tf+tf*idf and 92.4% for tf*idf.
The highest percentage is found when the Taxicab metric was applied to the results of
tf*idf and the lowest percentage when the Euclidean metric was applied to the results of
tf+tf*idf division.
When we look at the columns of table 4 we see that tf+tf*idf has percentages of 91.9%
and 92.3%, respectively with the Euclidean method and the Taxicab method. The tf*idf section
scored slightly higher with 92.1% and 92.4%.
Table 4: Accuracy for tf+tf*idf and tf*idf results after incorporating distance (n=4,465)
tf+tf*idf tf*idf
Euclidean metric 91.9% 92.1%
Taxicab metric 92.3% 92.4%
33
6.2 Results RQ2: Moody Counts
When we combine the Moody Counts for each class together, as is done in table 5, we
can see that for each Moody Tag the highest counts are found in the corresponding arousal
and valence column; in general the Moody Tags are correctly chosen. We also see in the results
that there is a distribution of votes across the remaining arousal and valence values.
Table 5: Sum of Moody Counts against Moody Tags
Arousal
Valence
Moody Tag A B C D 1 2 3 4
A1 2737 103 53 57 2729 84 62 75
A2 5960 339 127 90 168 5963 284 101
A3 6531 317 153 57 133 176 6500 247
A4 3399 163 58 30 29 41 142 3438
B1 94 3579 127 75 3510 170 127 68
B2 249 8716 249 150 154 8715 317 178
B3 219 9231 235 90 153 198 9177 247
B4 159 5454 152 70 65 132 200 5438
C1 58 94 3635 185 3642 169 122 39
C2 107 258 8637 195 220 8578 265 134
C3 84 244 6253 135 118 163 6218 217
C4 72 131 4332 96 73 56 207 4294
D1 19 79 183 2870 2809 231 64 47
D2 35 41 182 3147 115 3178 85 27
D3 17 54 135 3170 60 88 3163 66
D4 12 23 64 1683 28 39 52 1664
Table 6 shows the score and accuracy that are based on an evaluation of the system
tags against the Moody Counts. The score is the result of the sum of all mean relevance. It
shows that tf+tf*idf had a score of 3276.96 and tf*idf had 3278.38. Furthermore, this resulted
in an accuracy of 73.4% for tf+tf*idf and 73.4% for tf*idf. However, as can be seen in the scores,
there is a very small difference in favor of tf*idf, but it goes unnoticed due to the rounding of
the outcomes of the accuracy.
34
Table 6: Results based on examination of system tags in comparison to Moody Counts (n=4.465)
tf+tf*idf tf*idf
Score 3276.96 3278.38
Accuracy 73.4% 73.4%
7. Conclusions and discussion
7.1 Answer to RQ1
The original results showed an accuracy of 71.0% for either of the tf+tf*idf and tf*idf
approach. By taking the weighted distance into account classes that were close to the actual
mood class were now considered as partially correct in the evaluation process. Both distance
metrics gave a fine-grained evaluation of the results which resulted in a noticeable difference
between tf+tf*idf and tf*idf. It can be concluded that the tf*idf feature showed a better
performance than tf+tf*idf. Hence, tf*idf had more predictions that were close to the gold
standard in terms of distance. This is in contrast to the original results where both features
showed no difference in accuracy. The distance tools provided the necessary information on
the relation between the system tag and the gold standard. However, it is not yet known which
of these evaluation metrics are more suitable or accurate. Furthermore, we did not consider
the strength of relations between moods. Should the relation between moods A2 and A3 be
considered as strong as the relation between A2 and B2?
7.2 Answer to RQ2
Upon examination of the Moody Counts, results have shown that the given Moody Tags
were not always accurate in determining the mood of a song. This is due to how the Moody
Tags are generated. The counts have shown that people do not always experience the exact
same mood. There was a noticeable spread of counts across the dimensions as can be seen in
table 5. There are (small) variations in what mood class gets chosen for each music track. By
considering this range of mood classes that each song acquires from the users, we can
35
conclude that the Van Zaanen and Kanters’ (2010) system actually performed better than was
presented.
The current gold standard consists of Moody Tags. Our results have shown that the
Moody Tags are not perfectly reliable, because those tags rely on the highest arousal and
valence value in the Moody Counts. When a song receives 23 votes for arousal A and 23 votes
for arousal B, it seems that Moody will select the first highest value that occurs in that
dimension; in this case arousal A will be selected. Consequently, when null values occur in the
counts, it will automatically assign A1 to the song. Therefore, this study shows that the
unprocessed data is more reliable as a standard.
Even though the new standard allows us to evaluate in a more fine-grained manner, it
did not show a clear difference in accuracy between the tf+tf*idf and the tf*idf approach.
Though, the scores did show a very slight difference in favor of tf*idf. Therefore, the usefulness
of the new standard can also be questioned, but with the mentioned issues on Moody’s
algorithm on deciding Moody Tags, the Moody Counts remain a more reliable source.
7.3 Sensitivity and new standard
The weighted distance has proven to give the system a more fine-grained evaluation as
it takes more information into account during the process. The distance tools have given a
better view on the difference in performance of the information retrieval features. In contrast,
the study on the Moody Counts did not clearly show this difference between tf+tf*idf and
tf*idf. However, the counts did provide the range of moods for each song that were
considered acceptable by the users. If we implement a distance tool and substitute the gold
standard in the current system, we will most likely improve the evaluation of the system as
was desired. However, future research is required in determining the weights of the distance
metrics and which metric is more suitable.
Even though this study was carried out using the mood classifier as our starting point,
the distance tools can also be applied in other classification fields with multiple dimensions
where the boundaries are not clear-cut. In this thesis we looked at different mood classes, but
this could just as well have been about e.g. classifying personalities or other multidimensional
classes.
36
8. Future research
In this study, the two most common distance metrics were used to see the effect of
incorporating distance in the evaluation of the automatic classification system by Van Zaanen
and Kanters (2010). However, other distance metrics could have been used as well. It is not
clear which metric is more suitable or gives better outcomes. In future research, we could
possibly perform a user evaluation to compare the two distance metrics.
Furthermore, this study does not take into account which adjacent mood classes weigh
more than others in relation to the actual mood. For instance, is the relation between moods
A2 and A3 stronger than A3 and B3? We would need to study the relation between mood
classes and adjust the weighting of the distance metrics or possibly introduce a new metric.
37
9. References
Anderson, C. (2004). The Long Tail. Wired, 12(10), 1-30.
Andric, A., & Haus, G. (2006). Automatic playlist generation based on tracking user’s listening habits. Multimedia
Tools and Applications, 29(2), 127-151.
Beukeboom, C. J., & Semin, G. R. (2006). How mood turns on language. Journal of Experimental Social Psychology,
42(5), 553-566.
Bower, G. H. (1981). Mood and memory. American Psychologist, 36(2), 129-148.
Confusion matrix. (n.d.). Retrieved 8 November, 2010, from
http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html/
Daelemans, W., Zavrel, J., Van der Sloot, K., & Van den Bosch, A. (2010). TiMBL: Tilburg Memory-Based Learner,
version 6.3, Reference Guide. ILK Technical Report 10-01, available from
http://ilk.uvt.nl/downloads/pub/papers/ilk.1001.pdf/
Eck, D., Lamere, P., Bertin-Mahieux, T., & Green, S. (2007). Automatic Generation of Social Tags for Music
Recommendation. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis, (Eds.), Advances in Neural Information
Processing Systems 20, 20, 1-8. MIT Press.
Ekman, P. (1971). Universals and Cultural Differences in Facial Expressions of Emotion. In J. K. Cole (Ed.), Nebraska
Symposium On Motivation: Vol. 19. (pp. 207-283). Lincoln: University of Nebraska Press.
Izard, C. E. (1977). Human emotions. New York: Plenum Press.
Juslin, P. N., & Sloboda, J. A. (2001). Music and emotion: Theory and research. Oxford: Oxford University Press.
Kanters, P.W.M. (2009). Automated Mood Classification for Music. (Master Thesis, Tilburg University, 2009).
Retrieved from http://arno.uvt.nl/show.cgi?fid=95615/
Kochanski, G. (2009). Distance Metrics. Retrieved 8 November, 2010, from
http://kochanski.org/gpk/research/misc/2004/distance-metric/dist.pdf/
Kohavi, R., & Provost, F. (1998). Glossary of terms. Editorial for the Special Issue on Applications of Machine
Learning and the Knowledge Discovery Process, 30(2-3).
Koopman, C., & Davies, S. (2001). Musical Meaning in a Broader Perspective. The Journal of Aesthetics and Art
Criticism, 59(3), 261-273.
Krause, E. F. (1986). Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Mineola: Dover
Publications, Inc.
Liebowitz, S. (2004). Will MP3 downloads annihilate the record industry? The evidence so far. Advances in the
Study of Entrepreneurship Innovation and Economic Growth, 15, 229-260.
38
Liu, D., Lu, L., & Zhang, H. (2003). Automatic Mood Detection from Acoustic Music Data. Proceedings of 4th
International Symposium on Music Information Retrieval, 4, 81-87.
Loomis, E. S. (1968). The Pythagorean Proposition: Its Demonstration Analyzed and Classified and Bibliography of
Sources for Data of the Four Kinds of “Proofs”. Washington D.C.: National Council of Teachers of
Mathematics.
Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. Cambridge University
Press.
Meyer, L. B. (1956). Emotion and Meaning in Music. Chicago: University of Chicago Press.
Meyers, O. C. (2007). A mood-based music classification and exploration system. SciencesNew York.
Citeseer. Retrieved 20 November, 2010, from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.3440&rep=rep1&type=pdf/
Milliman, R. E. (1982). Using Background Music to Affect the Behavior of Supermarket Shoppers. The Journal of
Marketing, 46(3), 86–91.
O’Reilly, T. (2007). What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software.
Communications & Strategies, 65(1), 17-37. Retrieved from http://ssrn.com/abstract=1008839/
Oberholzer-Gee, F., & Strumpf, K. (2007). The Effect of File Sharing on Record Sales: An Empirical Analysis. Journal
of Political Economy, 115(1), 1-42.
Omar Ali, S., & Peynircioğlu, Z. F. (2006). Songs and emotions: are lyrics and melodies equal partners? Psychology
of Music, 34(4), 511-534.
Pauws, S., & Eggen, B. (2002). PATS: Realization and user evaluation of an automatic playlist generator.
Proceeding of the 3rd International Conference on Music Information Retrieval, 222-230.
Shields, S. A. (1984). Reports of bodily change in anxiety, sadness, and anger. Motivation and Emotion, 8, 1-21.
Stenius, E. (1967). Mood and Language-Game. Synthese, 17, 254-274.
Teasdale, J. D., & Russell, M. L. (1983). Differential effects of induced mood on the recall of positive, negative and
neutral words. The British journal of clinical psychology the British Psychological Society, 22(3), 163-171.
Thayer, R.E. (1989). The biopsychology of mood and arousal. New York: Oxford University Press.
The Story of MP3. (n.d.). Retrieved 20 November, 2010, from
http://www.iis.fraunhofer.de/en/bf/amm/mp3geschichte/mp3blicklabor/
Thompson, W. F., Schellenberg, E. G., & Husain, G. (2001). Arousal, mood, and the Mozart effect. Psychological
Science, 12(3), 248-251.
39
Tonkin, E., Corrado, E. M., Moulaison, H. L., Kipp, M. E., Resmini, A., Pfeiffer, H., & Zhang, Q. (2008). Collaborative
and Social Tagging Networks. Ariadne, 54(54), 1-20.
Van Zaanen, M., & Kanters, P.W.M. (2010). Automatic mood classification using tf*idf based on lyrics. 11th
International Society for Music Information Retrieval Conference, 11, 75-80.
Vander Wal, T. (2005). Explaining and Showing Broad and Narrow Folksonomies. Personal InfoCloud. Retrieved
20 December, 2010, from http://www.personalinfocloud.com/2005/02/explaining_and_.html/
Vignoli, F. (2004). Digital Music Interaction concepts: a user study. Proceedings of International Conference on
Music Information Retrieval, 415-420.
Vignoli, F., & Pauws, S. (2005). A music retrieval system based on user-driven similarity and its evaluation.
Proceedings of International Conference on Music Information Retrieval, 272-279.
Voong, M., & Beale, R. (2007). Music organisation using colour synaesthesia. CHI ‘07 extended abstracts on
Human factors in computing systems, 1869-1874.
Voss, J. (2007). Tagging, Folksonomy & Co - Renaissance of Manual Indexing? International Symposium for
Information Science, 10, 1-12.
Watson, D. (2000). Mood and Temperament. New York: The Guilford Press.