QUANTIFYING PIANIST STYLE - AN INVESTIGATION OF
PERFORMER SPACE AND EXPRESSIVE GESTURES FROM AUDIO
RECORDINGS
Submitted In partial fulfillment of the requirements for the
Master of Music in Music Technology
in the Department of Music and Performing Arts Professions
The Steinhardt School
New York University
Advisor: Prof. Juan Pablo Bello
Cheng-i Wang
February 2013
c� Copyright by Cheng-i Wang 2013
All Rights Reserved
ii
Acknowledgement
I would like to express my great appreciation to Dr. Juan Bello, my thesis advisor, for
his valuble and constructive suggestions during my course of study in this program and
throughout the development of this thesis. His willingness to share his wisdom, knowledge
and time so generously has been very much appreciated. I would also like to thank Dr.
Agnieszka Roginska for her support, suggestions and guidance in keeping my progress in
the right track and on schedule. My grateful thanks are extended to Dr. Kenneth Peacock,
Dr. John Gilbert, Prof. Dafna Naphtali and Prof. Tom Beyer for their generous academic
advisements. I would like to thank all the members from the MARL research group, who
constantly inspired and motivated me with constructive discussions and ideas. I would also
like to thank Mr. Justin Mathew, Mr. Andrew Madden and Mr. Donald Bosley for being
such positive influences and companions for the past two years. I would like to thank my
family for their support and understanding. Finally, I would like to thank Ms Fanning Chi
for her unconditional support and encouragement throughout my study.
iii
Contents
Acknowledgement iii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Approach 12
2.1 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 The CHARM Mazurka project . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Structures of Performers . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv
2.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Feature Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Discussion 40
3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2.1 Beat-level Features . . . . . . . . . . . . . . . . . . . . 41
3.1.2.2 Similarity Measurement . . . . . . . . . . . . . . . . . . 51
3.1.2.3 Feature Refinement . . . . . . . . . . . . . . . . . . . . 53
3.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Bibliography 65
v
List of Tables
1.1 Topics under the study of music performance . . . . . . . . . . . . . . . . 3
1.2 Subareas of measurements of performance . . . . . . . . . . . . . . . . . . 3
1.3 Mechanisms for explaining music performance . . . . . . . . . . . . . . . 5
1.4 Computational models of expressive music performance . . . . . . . . . . 5
2.1 Stats for the recordings with metadata . . . . . . . . . . . . . . . . . . . . 15
2.2 Beat-by-beat features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Query list for feature selection . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Top five feature combinations by correct return rate for L2 �norm . . . . . 18
2.5 Top five feature combinations by correct return rate for cosine distance . . 18
2.6 Union set of performers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Section bar numbers - op.63 no.3 . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Comparison between Ashkenazy and Shebanova from a historical perspective 52
3.3 Comparison between Rubinstein and Milkina from a historical perspectives 52
vi
List of Figures
1.1 Illustration of the performer structure problem . . . . . . . . . . . . . . . . 8
1.2 Diagram of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Visualization of metadata - Mazurka op.17 no.4 by Ashkenazy . . . . . . . 14
2.2 Normalized dynamics of selected pairs of performers . . . . . . . . . . . . 19
2.3 Normalized 2nd order derivatives of duration of selected pairs of performers 20
2.4 Normalized Hall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Histogram of returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Hmp, summarization matrix from mutual proximity . . . . . . . . . . . . . 27
2.7 Normalized query/search returns using mutual proximity . . . . . . . . . . 28
2.8 Histogram of returns (mutual proximity) . . . . . . . . . . . . . . . . . . . 29
2.9 Envelopes of normalized second order derivatives of duration . . . . . . . 32
2.10 Reconstruction of normalized dynamics with fitted polynomials . . . . . . 34
2.11 Reconstruction of normalized 2nd order derivatives of duration with fitted
polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
2.12 Comparison of reconstructed curves . . . . . . . . . . . . . . . . . . . . . 36
2.13 Normalized Hmp�re f ined using mutual proximity . . . . . . . . . . . . . . . 38
2.14 Histogram of returns (mutual proximity & feature refinement) . . . . . . . 39
3.1 Comparison of {d̂, t̂ 00} between score, section A1 of Chopin Op. 63 no.3 . . 43
3.2 Comparison of {d̂, t̂ 00} between score, section A2 of Chopin Op. 63 no.3 . . 43
3.3 Comparison of {d̂, t̂ 00} between score, section B1 of Chopin Op. 63 no.3 . . 44
3.4 Comparison of {d̂, t̂ 00} between score, section B2 of Chopin Op. 63 no.3 . . 45
3.5 Comparison of {d̂, t̂ 00} between score, section C of Chopin Op. 63 no.3 . . . 46
3.6 Comparison of {d̂, t̂ 00} between score, section D of Chopin Op. 63 no.3 . . 46
3.7 Difference between Aas and Arm in section C, op.63 no.3 . . . . . . . . . . 47
3.8 Comparison of {d̂, t̂ 00} between score, section A3 of Chopin Op. 63 no.3 . . 48
3.9 Comparison of {d̂, t̂ 00} between score, section A4 of Chopin Op. 63 no.3 . . 49
3.10 Comparison of {d̂, t̂ 00} between score, section A5 of Chopin Op. 63 no.3 . . 50
3.11 Hierarchical clustering from Hmp . . . . . . . . . . . . . . . . . . . . . . . 54
3.12 Normalized Hmp using {t̂, t̂ 0, ˆt 00} . . . . . . . . . . . . . . . . . . . . . . . 55
3.13 Normalized Hmp using {t̂} . . . . . . . . . . . . . . . . . . . . . . . . . . 55
viii
Chapter 1
Introduction
1.1 Background
Music, in the context of classical Western music, consists of three main human compo-
nents; composers, performers and audiences. Music could also be viewed as an activity, and
the corresponding components become composition, performance and listening (Sloboda,
1985). Composers carry out ideas through their composition, and document these compo-
sitions on musical scores. The role of the performer is not just a transmitter transmitting
musical information from the composer to the audience, but an interpreter who is responsi-
ble for re-creating or creating the evolving structure of the music being played (Rink et al.,
2011). Performances are then conveyed to audiences in live concerts or through recordings.
1
CHAPTER 1. INTRODUCTION 2
Whether a piece of music is appreciated in a live concert environment or through record-
ings, performers have deep influences on what the music sounds like by means of expres-
sive performance. Expressive performance refers to the phenomenon when performers try
to express the intrinsic affect of the composition by varying aspects of control themselves.
Performances of one piece of music differ from performer to performer, and across dif-
ferent renditions of the same piece by one performer. This difference reflects the fact that
each performer has their own way of realizing the composition, and each realization by
the same performer is different. Well-known performers are appraised by their abilities to
execute their aesthetic interpretations with precision and elegance, and differentiate them-
selves from other performers (Sloboda, 2000). It is also an established fact that, whether
caused consciously or not by the performers, one of the effects of expressive performance is
conveying the grouping structure of the composition. Also, experienced performers make
greater use of expressive variations to enhance the communication of grouping structures
than less experienced performers (Sloboda, 1983).
A conclusion could be drawn from the previous discussion, that performers play a very
important role in the communication of music from composers to audiences. Thus the study
of music performance and performers is one of the keys to understanding the mechanism of
music. Studies related to music performances can be traced back to 18th century (Gabriels-
son, 2003). Empirical studies of music performances started around 1900 and focused on
timings in musical performance. Seashore‘s textbook on music psychology, Psychology of
CHAPTER 1. INTRODUCTION 3
Table 1.1: Topics under the study of music performanceIntroduction
Performance planning
Sight reading
Improvisation
Feedback in performance
Motor processes in performance
Measurements of performance
Models of music performance
Psychological and social factors
Performance evaluation
Table 1.2: Subareas of measurements of performanceTiming & Dynamics
Structure
Tempo
Ritardando
Asynchronizaiton
Perceptual Effects
Intonation & Vibration
Conductor
Intention
Music, (Seashore, 1967) could be seemed as a summarization of research in the first half of
the 20th century. In the second half of 20th century, multiple topics under the study of per-
formance were starting to be investigated. A list of topics are listed in Table 1.1(Gabriels-
son, 2003). Measurements of performance data were the dominant topic in performance
research (Gabrielsson, 2003). A list of all of the subareas of performance measurements
are provided in Table 1.2. Under the topic of measurements of performance, timing and
dynamics are the most emphasized subareas besides intonation and vibrato which focus on
the singing voice and string family. Piano or keyboard instruments were the main focus in
CHAPTER 1. INTRODUCTION 4
the study of timings and measurements. Several studies conducted by Repp confirmed and
revealed some of the tendencies and phenomena of music performance (Repp, 1990, 1995,
1996, 1997, 1998b,c,a). Some of the findings are listed below in bullets;
• Experts and graduate students have similar group average timing pattern and individ-
ual consistency, but experts showed much more individual expressivity.
• It is strongly suggested that “articulation” or “touch”, a performance attribute very
hard to measure and define, also accounts for a large portion of expressive perfor-
mance besides timings and dynamics.
• An “average” performance gathered across students‘ performances, received the
highest score in aesthetic quality but weak in individuality.
• Some timing and dynamics patterns are extracted by analyzing a collection of
Chopin‘s Etude in E Major, but just a few performers conformed to the patterns.
This research accumulated a large amount of measurement data, which naturally led to
the development of models of performance. The study of models of music performance
aims at finding rules or patterns that explain and summarize the phenomenon extracted
from the measurement data collected. Most of the research in late 20th century considered
combining different mechanisms to explain music performance. A list of mechanisms is
provided in Table 1.3(Gabrielsson, 2003) . Details regarding the related research until the
21st century could be referenced in (Gabrielsson, 2003). Besides distinguishing research by
CHAPTER 1. INTRODUCTION 5
Table 1.3: Mechanisms for explaining music performanceListener’s learned expectation
Psycho-acoustical/perceptual factors
Motor constraints
Musical structure
Table 1.4: Computational models of expressive music performanceModel Description
The KTH model Rule-based system. Analysis-by-synthesisThe Todd model Direct link between structure and performance
Analysis-by-measurementThe Mazzola model Mathematical music theory. No empirical evaluation
The machine learning model Data-Driven. Data mining techniques
“mechanisms considered”, “approaches used” provide another way to categorize research.
From the beginning of the 21st century in particular, computational or quantitative models
have been gaining more and more attention.
Computational models of expressive music performance embody mathematical models
which define relationships between variables provided by measured data. Four prominent
computational models are listed in Table 1.4 (Widmer and Goebl, 2004). The machine
learning model takes the advantage of the improvements in hardware power and algorithm
developed during the past decade. IMP/ML@OFAI1, conducted a series of investigations
regarding the understanding and visualization of expressive piano performances with ma-
chine learning approaches (Dixon et al., 2002; Flossmann et al., 2009, 2010; Goebl et al.,
2004; Grachten and Widmer, 2007; Grachten et al., 2008; Grachten and Widmer, 2011;1The Intelligent Music Processing and Machine Learning Group of the Austrian Research Institute for
Artificial Intelligence
CHAPTER 1. INTRODUCTION 6
Madsen and Widmer, 2006; Pampalk et al., 2003; Stamatatos and Widmer, 2002; Widmer,
2001; Widmer et al., 2003; Widmer and Goebl, 2004; Widmer and Zanon, 2004). Efforts
were made to accomplish tasks such as automatic pianist recognition, expressive gesture
visualization and artificial piano performances. Successful results were presented in the last
two areas, but no significant progress about automatic pianist recognition was reported.
Research from the perspective of the relationship between audio recordings and musi-
cology also investigated the subject of pianist style in a quantitative fashion. The CHARM
Mazurka project (Sapp, 2007, 2008)is a result of such investigation. A large amount of
recordings and metadata about performances of Chopin‘s mazurkas were collected, and
research motivated by finding relationships between performers was conducted.
1.2 Motivation
Since computational models based on data-driven approaches make very few assumptions
about the data, and extract patterns or information out of the data objectively. The construc-
tion of an objective framework for expressive music performances based on data-driven
approaches provide a tool that is absent in musical research. With such a framework, mu-
sicologists could verify historical and theoretical issues about music performances, and
researchers in music cognition or perception could gain further understanding about music
performances such as expert performance, the mechanism interpretation, expressive varia-
tion ... etc.
CHAPTER 1. INTRODUCTION 7
Data-driven approaches rely on the existence of a certain amount of data. In the con-
text of studying expressive music performance, take piano as an example, the data could
either be MIDI recordings (Goebl et al., 2005) from expert performers or audio recordings
of performances. MIDI recordings makes it possible to analyze synchronization between
two hands and articulation, which are two very difficult tasks if given audio recordings. Al-
though MIDI recordings present itself as a better choice than audio recordings in the aspect
of information carried, their availability is a serious problem in data-driven approaches.
The amount of MIDI recordings is far less than that of audio recordings, thus making
MIDI recordings not suitable for data-driven performance analysis.
A large part of previous studies in computational modeling for expressive music perfor-
mances have been focused on forging low-level representations out of music surface-level
events, such as timings and dynamics (Sapp, 2007; Widmer et al., 2003). The relationship
between the surface and its underlying rendering mechanism is a dynamic process which
evolves with time (Rink et al., 2011). It is reported that expert performers usually per-
form with individuality and do not conform to the average performance (Sloboda, 2000).
This finding agrees with the common perception that performers have their own “style” of
comprehending, interpreting and performing music. The goal of this thesis is to devise a
framework for verifying how this phenomenon reflects itself in audio recordings from the
perspective of signal processing and information retrieval. The goal described is two-folds:
how recordings are structured with performers as labels; what features are piece-invariant.
CHAPTER 1. INTRODUCTION 8
Figure 1.1: Illustration of the performer structure problem
Structure of Performers
Since there is neither any evidence showing that each performer is individually distinguish-
able, nor is there ground truth about how performers should be classified, the structure of
how performers are related to each other in the space defined by the features extracted
from performance measurements should be investigated in an unsupervised manner. This
structural problem could be explained using the plot in Figure 1.1. In the three subplots in
Figure 1.1, every circle represents a different performer and the assigned color represents
the ’true’ groupings of the performers. Each subplot illustrates one possible realization of
the groupings of performers out of many different possibilities, and the true distribution is
CHAPTER 1. INTRODUCTION 9
unknown to observers. In other words, the assumption that each performer has their own
individual style, hence making it possible to separate them individually in some feature
space should be challenged (Stamatatos and Widmer, 2002; Widmer and Zanon, 2004).
This thesis aims at proposing a framework for investigating such an issue.
Piece-invariant Features
Performances by the same performer with different pieces can not be grouped together
directly since the feature sequences representing each performance have different length
given different pieces. In order to tackle this issue and facilitate the understanding of music
performance, piece-invariant features have to be devised. Piece-invariant features should
reflect the characteristics of the performers instead of the pieces themselves. To the best of
the author‘s knowledge, there has not been any success in devising such features.
1.3 Outline
The workflow is as follows; first, each performance is represented by a vector constructed
via performance measurements from the dataset, then a nearest neighbor search is applied
to each piece given a set of queries in the vectors representing performances. A similarity
measurement matrix is then constructed by aggregating the normalized query/return counts
for each piece. Pairs of query/return in the similarity measurement matrix with high values
will be selected to be used for feature refinement. The goal of feature refinement is to find
CHAPTER 1. INTRODUCTION 10
Figure 1.2: Diagram of approaches
characteristics separating groups of performers, then design hand-crafted features based on
those found characteristics. This feature refinement step could be think of as a loop process
for improving the features representing performances. In this thesis only one iteration of
the refinement process is studied. The results from the similarity measurement and the
feature refinement are evaluated with musical interpretations.
A diagram of the framework is provided in Figure 1.2 . The outline of this thesis is
CHAPTER 1. INTRODUCTION 11
as follows; Chapter 2 provides the description of the dataset and approaches. Discussions,
evaluations and future works are covered in Chapter 3.
Chapter 2
Approach
2.1 The Dataset
For the study of expressive performance analysis, piano is considered a suitable instrument
for the task since the performance attributes that a pianist could control are relatively simple
and more easily quantified than for brass, wind and the string family (Widmer et al., 2003).
Also piano performance has a large amount of recordings, thus making this instrument
suitable for data-driven approaches. The advantage of studying expressive performance
with Mazurkas by Chopin is because of their structure, which were borrowed from folk
models, forming the Mazurkas into well structured, repetitive pieces (Rink et al., 2011).As
a result, this makes the comparison between different parts of the piece an easier task.
12
CHAPTER 2. APPROACH 13
2.1.1 The CHARM Mazurka project
The dataset is an archive of recordings and metadata of Chopin‘s mazurkas from the
Mazurka Project1 conducted at CHARM (Center of the History and and Analysis of
Recorded Music, UK) (Sapp, 2007). The collection has 2919 recorded performances for
49 mazurkas. Each mazurka has on average, over 50 renditions from different performers.
The time period of the recordings collected in the dataset spans from 1902 to 2006. A
complete discography of the recordings can be accessed from the project website.
Besides the collection of recordings, metadata comes along with the dataset. Two kinds
of metadata are available for part of the recordings, the duration in seconds and dynamics
of each beat location. The beat locations are annotated in a semi-automated fashion and
then dynamics are calculated given these locations. Details of the approaches can be found
in (Sapp, 2007). From the metadata, durations and tempos of each beat and the average
values across different performances of the same mazurka are then derived. A simple plot
in Figure 2.1 shows the beat-level metadata provided in the dataset.
2.1.2 Pre-processing
In order to investigate the quantitative relationships between performances, only recordings
with metadata attached are included in the analysis. Since not every recording with beat
locations has dynamics information, additional dynamic information is calculated in the1http://mazurka.org.uk
CHAPTER 2. APPROACH 14
Figure 2.1: Visualization of metadata - Mazurka op.17 no.4 by Ashkenazy
CHAPTER 2. APPROACH 15
Table 2.1: Stats for the recordings with metadata# of performances # of beats within the piece
Op.17 No.4 62 (32) 396Op.24 No.2 64 360Op.30 No.2 33 192Op.63 No.3 82 228Op.68 No.3 49(27) 180
Total # 290 81444(Numbers in parenthesis represent additional entries by this thesis.)
Table 2.2: Beat-by-beat featuresWithout normalization With normalization
Duration t, t 0, t 00 t̂, t̂ 0, t̂ 00
Dynamics d̂, d̂0, d̂00
same manner mentioned in (Sapp, 2007). In Table 2.1, the numbers in the parenthesis
represent the quanty of dynamic entries added during this research.
Given the beat-by-beat durations and dynamics of the Mazurka performances, basic
pre-processing could be applied to expand the raw beat-level information. Derivatives and
normalization are applied to the raw data to expand the number of beat-level features. Du-
ration value are denoted by t, dynamics are denoted by d. First and second order derivatives
are denoted by 0 and 00. Normalization to zero mean and unit standard deviation denoted
with ˆ is applied across each performance. Table 2.2 shows a list for the features used.
It should be noted that since recordings span across more than a century, dynamic data is
alway normalized due to varying dynamic range of recording technology across this period.
CHAPTER 2. APPROACH 16
Since the metadata are stored in excel files, a python module is implemented to ac-
cess, parse and process the metadata from the them. The python module is available at
https://bitbucket.org/ciwang/exper.
2.2 Features
In order to computationally analyze each performance, devising a numerical representation
appropriate for representing the performance is necessary before analysis is carried out.
2.2.1 Method
With the beat-level measurements extracted from the dataset, it is still unknown what com-
bination of measurements form the best feature vector to represent a performance. In order
to filter out a combination to start working with, an exhaustive search in the measurement
combination space is implemented and a criterion for ranking the combination is devised.
Similar to the evaluation conducted in (Sapp, 2008), since the goal is to devise a feature
set that could quantify a pianist’s style, one way to devise the criteria is to treat the feature
sequence of each performance as a point in a high-dimensional space and check if perfor-
mances by the same performer are closer than those by others. A table, showed in Table 2.3
, of query recordings is constructed by choosing performances recorded more than once by
the same performer in each of the five mazurkas. Each performance listed in Table 2.3 will
be used as a query and a nearest neighbor search will be performed to return the nearest
CHAPTER 2. APPROACH 17
Table 2.3: Query list for feature selectionOp. 17, No. 4 Op. 24, No. 2 Op. 30, No. 2 Op. 63, No. 3 Op. 68, No. 3
Rubinstein1939 Rubinstein1939 Rubinstein1939 Rubinstein1939 Rubinstein1938Rubinstein1952 Rubinstein1952 Rubinstein1952 Rubinstein1952 Rubinstein1952Rubinstein1966 Rubinstein1966 Rubinstein1966 Rubinstein1966 Rubinstein1966Horowitz1971 Richter1960 Tsong1993 Friedman1923Horowitz1985 Richter1961 Tsong2005 Friedman1930Czerny1949 Garcia2007
Czerny1949b Garcia2007bRosenthal1930Rosenthal1931
Rosenthal1931bRosenthal1931cRosenthal1931d
Uninsky1932Uninsky1971
Zak1937Zak1951
three performances. Two metrics, L2 � norm and cosine distance were used. The query
itself is excluded during the search. If any one of the returns has the same performer as
the query, it is considered correct, an incorrect return show the opposite. If the piece has N
beats and the feature vector consists of M measurements, then each performance is a vector
of length N ⇥M. An exhaustive search of all the combinations of features in Table 2.2 is
conducted. For each measurement combination, the average correct rate of returns across
all the queries is calculated, then the measurement combination with the highest correct
rate is chosen.
CHAPTER 2. APPROACH 18
Table 2.4: Top five feature combinations by correct return rate for L2 �normFeature combinations Correct return rate
d̂, t̂ 00 97%d̂,t̂ 0, t 00 94%
d̂, t̂ 91%d̂, t̂ 0 91%
d̂, d̂0, t̂ 88%
Table 2.5: Top five feature combinations by correct return rate for cosine distanceFeature combinations Correct return rate
d̂, t̂ 00 97%d̂0, t̂, t 00 94%d̂0, d̂0, t̂ 0 94%d̂, t 0, t̂ 0 92%t̂, d̂0, ˆt 00 88%
2.2.2 Results
The top five results from both metrics are listed in Table 2.4 and Table 2.5 . Multiple combi-
nations return the same percentage but only the combinations with least features are kept in
the results. From the results, the combination of normalized dynamics and normalized sec-
ond derivatives of durations, give the best correct return rate. The implication and musical
explanation of this feature combination will be discussed in Chapter 3. The result is then
fed to the next stage to form distance matrices. In Figure 2.2 and Figure 2.3, performances
are plotted with the result feature set {d̂, t̂ 00}.
CHAPTER 2. APPROACH 19
Figure 2.2: Normalized dynamics of selected pairs of performers
CHAPTER 2. APPROACH 20
Figure 2.3: Normalized 2nd order derivatives of duration of selected pairs of performers
CHAPTER 2. APPROACH 21
2.3 Structures of Performers
Since each piece has different beat length, a straightforward comparison between perfor-
mance feature sequences is unachievable. Also the relationships between performers in
each piece is unknown. Thus an approach to tackle the two problems is proposed in the
following.
2.3.1 Method
With the feature set {d̂, t̂ 00} from Section 2.2, another query/search task could be used to
derive a similarity measurement between pairs of performers. Only the union of performers
of the five Mazurkas are considered in this stage to obtain a more general structure across
different pieces. The union set S with 19 performers are showed in Table 2.6 . Then the
steps are as follow;
1. For each piece, a 19⇥ 19 matrix Hm is created with m = {17,24,30,63,68} repre-
senting each piece by its opus number.
2. Then for each piece m, each performer si in S is used as a query to return the three
nearest neighbor s j, i 6= j from S using L2�norm2 .Then increment each of the three
cell (i, j) in Hm by 1.2Using cosine distance yields almost the same return results.
CHAPTER 2. APPROACH 22
Table 2.6: Union set of performersOp.17 No.4 Op.24 No.2 Op.30 No.2 Op.63 No.3 Op.68 No.3 Total Counts
Ashkenazy 1 1 1 1 1 5Biret 1 1 1 1 1 5
Brailowsky 1 1 1 1 1 5Chiu 1 1 1 1 1 5
Cortot 1 1 1 1 1 5Fliere 1 1 1 1 1 5
Francois 1 1 1 1 1 5Hatto 1 2 1 2 2 8Indjic 1 1 1 1 1 5
Luisada 1 2 1 1 1 6Lushtak 1 1 1 1 1 5Magaloff 1 1 1 1 1 5Milkina 1 1 1 1 1 5
Mohovich 1 1 1 1 1 5Rangell 1 1 1 1 1 5
Rubinstein 3 3 3 3 3 15Shebanova 1 1 1 1 1 5
Smith 1 1 1 1 1 5Uninsky 1 1 1 2 1 6
Total Counts 21 23 21 23 22 110The number in each cell represents the number of performances of each piece by eachperformer
CHAPTER 2. APPROACH 23
3. A summarizing matrix Hall is obtained by summing all the Hm matrices together,
then each row in Hall (representing a query and all its returns) is normalized to unit
sum to balance between different query counts between performers. The values in
cell (i, j) of Hall could be viewed as a similarity measure between performer i and
performer j(not symmetric).
Hall and the histogram of returns by different performers are showed in Figure 2.4 and
Figure 2.5 . Observe that in Figure 2.4 and Figure 2.5, it is obvious that some performers
dominate the return results in certain Mazurkas, such as Milkina in op. 17 no. 4, Uninsky
in op. 24 no. 2, Biret in op. 30 no. 2 ... etc. Milkina has the highest return counts in
op. 17 no. 4, op. 63 no. 3, op. 68 no. 3 and in total. The non-uniform distribution in
the return histogram suggests that the ’hubness’ of the feature space is relatively strong.
Hubness refers to the phenomenon that in a high dimensional space, some points may be
closer to every other points in the space due to high dimensionality and the feature used to
construct the space, not because of the characteristic of the point itself (Flexer et al., 2012).
This is a well-known issue in music similarity related applications, such as recommenda-
tion and search. In the context of this thesis, it is possible that some specific performers
possess the style that is similar or, in a sense, close to the average of most of the perform-
ers. But since there is no evidence of supporting this point of view, a remedy to ease the
’hubness’ of the feature space should be applied to see if it will improve the outcome of
CHAPTER 2. APPROACH 24
Figure 2.4: Normalized Hall
CHAPTER 2. APPROACH 25
Figure 2.5: Histogram of returns
CHAPTER 2. APPROACH 26
the query/search task. The word ’improve’ here means that, the ideal distribution of the re-
turn counts histogram should be uniform, but peaks still appear in Hall . Mutual proximity
(Schnitzer et al., 2011) transforms the distances between a point x and all other points yi
in the space to the probability P(yi isthe nearest neighbor o f x), denoted by P(yi ! x), by
assuming the distribution of the distances from yi to x is a Gaussian distribution. Then the
distance between point x and point y is recalculated as P(y ! x)⇥P(x ! y). It has been
shown that by replacing original distance matrix with mutual proximity, the hubness of the
feature space could be eased (Schnitzer et al., 2011; Flexer et al., 2012). Since the mutual
proximity are themselves values with probablistic properties, the query/search step could
be skipped and all five Hm could be multipied element-wise to produce a 19⇥19 summa-
rizaion matrix with pair-wise similarity measurements. In order to apply mutual proximity,
steps 2 and 3 are replaced by following steps. The resulting normalized histogram will be
denoted by Hmp;
• Calculate the 19⇥ 19 L2 � norm pair-wise distance matrix Dm, then transform Dm
from L2 �norm to mutual proximity Dmp.
• Calculate Hmp by Hmp(i, j) = Ân in m logDnp(i, j).
2.3.2 Results
Hmp is plotted in Figure 2.6 . Query/search results using mutual proximity are showed in
Figure 2.7 and Figure 2.8 . Comparing them with Figure 2.4 and Figure 2.5, it is clear that
CHAPTER 2. APPROACH 27
Figure 2.6: Hmp, summarization matrix from mutual proximity
CHAPTER 2. APPROACH 28
Figure 2.7: Normalized query/search returns using mutual proximity
CHAPTER 2. APPROACH 29
Figure 2.8: Histogram of returns (mutual proximity)
CHAPTER 2. APPROACH 30
the distribution of returns becomes flatter after applying mutual proximity, and both Hmp
and the normalized query/return results also maintain peaks with more clarity.
2.4 Feature Refinement
To further investigate how to derive piece-invariant features, the characteristics that differ-
entiate performer groups from each other are examined. The results of the examination are
facilitated to refine handcrafted features.
2.4.1 Method
The goal of feature refinement is to find features capable of separating performers but not
pieces. By observing the similarity matrix in Figure 2.7, pairs of query/return showing
high similarity with each other are selected as groups of performers to be examined. The
criteria of examination is to find characteristics that separate group from group. Feature
refinements are then conducted to achieve a more compact representation of performances.
2.4.2 Experiments
Two pairs of performers are selected as groups to be examined and compared. The first
pair is Rubinstein and Milkina and the second is Ashkenazy and Shebanova, which will
be referred as Arm and Aas respectively. Aas showed symmetric peaks in the Hmp which
means that both performers appeared as returns multiple times given the other performer
CHAPTER 2. APPROACH 31
as queries. Although the similarity between Rubinstein and Milkina is not symmetric, the
pair is chosen because of its strong similarity values given Rubinstein as query and Milkina
as return.
Refinement based on d̂
By observing the dynamic curves inFigure 2.2, it is obvious that the shapes of d̂ are similar
within pairs of performers but dissimilar between pairs. To model these shapes of dynamic
curves, they are first segmented into sections according to the form analysis of the respec-
tive score. Then each section is modeled with polynomial fitting. The coefficients of the
fitted polynomials of each section are then considered as the new representation of d̂.
Refinement based on ˆt 00
For ˆt 00, the initial observation of Figure 2.3 suggested that the curves exhibit semi-
oscillating behaviors. But it is determined that although the curves do exhibit semi-
oscillating behaviors, they are not the factors that separate groups of performers from each
other. Further investigation of Figure 2.3 suggested that it is the envelope ( Figure 2.9) of
the curves separating groups of performers from each other. The envelopes are obtained by
a full-wave rectification followed by a moving average of window length 3. To model the
shape of the envelope, the same approach in 2.4.2 is adopted. The resulting coefficients of
the fitted polynomials of each section are then the new representation of ˆt 00.
CHAPTER 2. APPROACH 32
Figure 2.9: Envelopes of normalized second order derivatives of duration
CHAPTER 2. APPROACH 33
Coefficients of fitted polynomial as features
For polynomial fitting, an appropriate order of the polynomial has to be chosen. To choose
an appropriate order, the 1st through 8th orders are tested using the same criteria for choos-
ing feature combination in Section 2.2. The best result is 89% with 1st order for both d̂
and ˆt 00, which means that the fitted polynomials for each section of d̂ and ˆt 00 are in the form
y = ax+b. The reconstructed curves (actually a straight line with slope a and offset b) are
plotted in Figure 2.10 and Figure 2.11 . A plot comparing the reconstructed curves of Arm
and Aas is in Figure 2.12 .
One further question which could be asked about the refined feature is, to what ex-
tent the sequential relationship between sections will influence the ability of separat-
ing groups of performers? If the sequential relationship could be ignored, then it is
possible to design piece-invariant features by summarizing over the sequential features.
To summarize the refined features without taking sequential relationships into account,
the average and standard deviation for coefficients of different orders across each sec-
tion are calculated. The resulting feature for each performance is a vector of length
(# o f orders + 1)⇥ 2(mean & standard deviation)⇥ 2(d̂ & ˆt 00). All 64 combinations of
polynomial fitting orders for d̂ and ˆt 00 from 1 to 8 are tested, using the same criteria as in
2.2.1.3The best combination of orders is still 1st for both d̂ and ˆt 00 with correct rate 44%.3The metric used is L2 � norm. Since summarizing using means and standard deviations of coefficients
implies modeling the coefficients with mixture of gaussians, an altered KL-divergence to measure the simi-larity between two mixtures was also used but yielded only 11% of correct rate.
CHAPTER 2. APPROACH 34
Figure 2.10: Reconstruction of normalized dynamics with fitted polynomials
CHAPTER 2. APPROACH 35
Figure 2.11: Reconstruction of normalized 2nd order derivatives of duration with fittedpolynomials
CHAPTER 2. APPROACH 36
Figure 2.12: Comparison of reconstructed curves
CHAPTER 2. APPROACH 37
2.4.3 Results
The similarity measurements using the refined features, coefficients of fitted polynomials,
are plotted in Figure 2.13 and Figure 2.14. In short, the refined features broadened the
histogram of returns(Figure 2.14) more than applying mutual proximity only, and more
closely symmetrical pairs of performers appear in Figure 2.13.
CHAPTER 2. APPROACH 38
Figure 2.13: Normalized Hmp�re f ined using mutual proximity
CHAPTER 2. APPROACH 39
Figure 2.14: Histogram of returns (mutual proximity & feature refinement)
Chapter 3
Discussion
3.1 Discussion
3.1.1 Dataset
The most critical issue effecting the whole thesis concerns the size of the dataset. Although
the Mazurka Project has 2,919 recordings of Chopin mazurka in the collection, and several
research projects in the area of music information retrieval have been taking advantage
of it (Bello, 2009, 2011; Bello et al., 2012; Nieto et al., 2012). From the perspective of
expressive performance there are only 290 recordings attached with metadata. On average
each of the 5 mazurkas has 30 ~ 60 different renderings by different performers, meaning
that most of the performers have 1 ~ 4 performances across the 5 pieces. If we consider the
task of quantifying pianist style in the context of classification problems (Stamatatos, 2001;
40
CHAPTER 3. DISCUSSION 41
Stamatatos and Widmer, 2002; Widmer and Zanon, 2004), the dataset size for each label(in
this case, the performers) is very small(1 ~ 15) which makes it very difficult to adopt pattern
recognition techniques to extract meaningful timing/dynamic patterns or discrimination
functions for each of the labels. To make things worse, since performances by the same
performer are spread over the 5 mazurkas most of the time, it is very difficult to marginalize
the effect of the composition itself. Not only is pattern recognition very difficult given this
dataset, but also the validation of any findings from this dataset itself is a non-trivial topic.
3.1.2 Approaches
3.1.2.1 Beat-level Features
Examples of beat-level features, {d̂, t̂ 00}, are plotted in Figure 2.2 and Figure 2.3. The
examples are performances from Arm and Aas. The shapes of dynamic curves between
members in each of its pairs are closer than the shapes between pairs. The differences
in dynamic shapes between Arm and Aas clearly displayed a different phrasing strategy
between these two pairs of performers. To further investigate how {d̂, t̂ 00} reflect themselves
in characterizing the performances, zoomed-in inspections on each section of Op. 63 no. 3
by Arm and Aas are conducted. The inspections are done by comparing {d̂, t̂ 00} to the score
section by section, and then observing their commonalities and differences in detail. The
bar numbers corresponding to the sections obtained by form analysis are provided in Table
3.1.
CHAPTER 3. DISCUSSION 42
Table 3.1: Section bar numbers - op.63 no.3Section Bar numbers (bar/beat)
A1 0/3 ⇠ 8/2A2 8/3 ⇠ 16/1B1 16/2 ⇠ 24/2B2 24/3 ⇠ 32/3C 33/1 ⇠ 40/3D 41/1 ⇠ 48/3
A3 49/1 ⇠ 56/3A4 57/1 ⇠ 64/3A5 65/1 ⇠ 76/1
Section A1 & A2
In Figure 3.1 and Figure 3.2, section A1 and A2 are plotted against {d̂, t̂ 00} respectively.
The first obvious difference between Aas and Arm is at the beginning of the piece and is
marked by a yellow box on d̂. Arm begins the piece with a more powerful dynamic then
gradually becomes softer towards the second bar, while Aas begins the piece softly then
gradually gets louder towards the end of first bar. The second difference is from bar 3 to
bar 4 and is marked by a purple box on t̂ 00. From bar 3 to bar 4, Arm exhibits an oscillatory
behavior with t̂ 00 while Aas is more smooth during the two bars. The t̂ 00 defines a measure
for the shape of each three points along the curves. The oscillation of Arm shows that the
shape of each three beats represented by t̂ 00 is changing back and forth between concave and
convex at each beat. The change between concave and convex means that during the two
bars, the duration of each beat changes radically and the direction of change also changes
frequently.
CHAPTER 3. DISCUSSION 43
Figure 3.1: Comparison of {d̂, t̂ 00} between score, section A1 of Chopin Op. 63 no.3
Figure 3.2: Comparison of {d̂, t̂ 00} between score, section A2 of Chopin Op. 63 no.3
CHAPTER 3. DISCUSSION 44
Figure 3.3: Comparison of {d̂, t̂ 00} between score, section B1 of Chopin Op. 63 no.3
In section A2, there is also a difference in the beginning of the section in terms of
dynamics which is annotated by a yellow box. From beat 3 to 6, Aas is louder relatively to
the pick-up notes while Arm stayed relatively constant until beat 6.
Section B1 & B2
In Figure 3.3 and Figure 3.4, section B1 and B2 are plotted against {d̂, t̂ 00} respectively. For
the B sections, although not significantly distinguishable, d̂ of both groups are similar to the
other performance in the group but different from the other group. Despite the difference
between Arm and Aas in d̂, both d̂ of the two groups have a smooth arc shape. The arc shape
represents a general phrasing strategy in which the phrase will start softly, then gradually
increase the volume toward the end of the phrase, then becomes softer again at the end
CHAPTER 3. DISCUSSION 45
Figure 3.4: Comparison of {d̂, t̂ 00} between score, section B2 of Chopin Op. 63 no.3
of the phrase. One specific observation about d̂ is the diminuendo mark at beat 5 to 7 in
section B1(marked by yellow box). Both d̂ in Arm and Aas actually becomes louder during
those three beats instead of following the dynamic notation.
Section C & D
In Figure 3.5 and Figure 3.6, section C and D are plotted against {d̂, t̂ 00} respectively.
For the C section, from beat 8 to 16 (marked by a yellow box), the t̂ 00 of Arm has a more
obvious oscillatory behavior than Aas. The reason for this difference could be understood
as a difference between Arm and Ash in treating local phrase endings. The ^�_�^ shape
of Arm means that toward the end of the first sub phrase in section C (bar 4), the speed
slowed down to mark the end of a phrase and then sped up at the begging of the second
CHAPTER 3. DISCUSSION 46
Figure 3.5: Comparison of {d̂, t̂ 00} between score, section C of Chopin Op. 63 no.3
Figure 3.6: Comparison of {d̂, t̂ 00} between score, section D of Chopin Op. 63 no.3
CHAPTER 3. DISCUSSION 47
Figure 3.7: Difference between Aas and Arm in section C, op.63 no.3
sub phrase (bar 5), on the contrary, Ash does not display drastic change thus implying the
interpretation of treating the whole section as one phrase. A supplementary plot of t̂ for
beat 8 to 16 is provided in Figure 3.7.
In section D, the d̂ of both Aas and Arm act correspondingly to the crescendo mark-
ers which are placed from beat 15 to 20 (marked with yellow box). For t̂ 00, the phrasing
CHAPTER 3. DISCUSSION 48
Figure 3.8: Comparison of {d̂, t̂ 00} between score, section A3 of Chopin Op. 63 no.3
behavior of Arm towards the middle of the section (the ending of first sub phrase and the
beginning of the second phrase, from beat 9 to 14) occurs similarly as to section C.
Section A3 & A4
In Figure 3.8 and Figure 3.9, section A3 and A4 are plotted against {d̂, t̂ 00} respectively.
The effect of the diminuendo mark in the beginning of the section is not prominent across
both groups as one can observe in Figure 3.8. On beats 9 and 10 (section A4), the effect of
the crescendo mark does not appear in Figure 3.9. The drastic fluctuation in t̂ 00 towards the
end of section A4(beat 14 to 21, marked by yellow box) shared by both Aas and Arm shows
the agreement between performers on how to approach the sub phrase attached to the main
CHAPTER 3. DISCUSSION 49
Figure 3.9: Comparison of {d̂, t̂ 00} between score, section A4 of Chopin Op. 63 no.3
phrase in this section. The d̂ of both groups are similar within the group but apart from the
other group.
Section A5
The last section of op.63 no.3 is plotted in Figure 3.10 against {d̂, t̂ 00}. In A5, both groups
agreed on building up the dynamics toward the end of the piece. The fluctuation in t̂ 00 at the
end of the piece has the same shape as the fluctuation at the end of A4. In fact, the rhythmic
patterns are the same between the two endings of A4 and A5.
CHAPTER 3. DISCUSSION 50
Figure 3.10: Comparison of {d̂, t̂ 00} between score, section A5 of Chopin Op. 63 no.3
Summary
In general, the d̂ of Aas and Arm are similar within the group but deviate from the other
group. The inspection of these two groups of performers shows that the dynamics markings
of this piece are often violated by these four performers. From the view of the two groups,
it could be said that Arm uses more timing variations than Aas , as it can be seen in the
plots that the magnitudes of t̂ 00 of Arm are stronger than Aas a majority of the time. These
variations in timings, are sometimes reflected in the emphasis of short phrases which, on
the contrary, do not appeared in Aas.
CHAPTER 3. DISCUSSION 51
By comparing t̂ and t̂ 00, the observation suggests that t̂ 00 is highly connected to the
behavior of t̂ but is able to exaggerate the subtleties of performances, thus making it a
better discriminator between performances.
3.1.2.2 Similarity Measurement
Since there is no ground truth to compare with the similarity measures obtained in Sec-
tion 2.3, a qualitative approach is adopted to examine the results. Cases with significant
connections are examined in the following discussions.
The Hatto Hoax
The strongest similarity between pairs of performers is the pair consisting of Hatto and
Indjic. In fact, their beat-level features are almost identical to each other, which is not
a surprising discovery since it was already discussed in (Cook and Sapp, 2007). It was
reported and confirmed that the recordings in the mazurka collection performed by Hatto
were actually performed by Indjic. This finding in the similarity measures does not pro-
vide any new insight about the style of performers, but rather a sanity check to see if the
approaches should reflect any basic numerical relations between performances.
Vladimir Ashkenazy and Tatiana Shebanova
The other pair in interest is Ashkenazy and Shebanova, which is denoted by Aas. In fact,
they both returned as the nearest performance to each other in op. 24, no. 2 and op.
CHAPTER 3. DISCUSSION 52
Table 3.2: Comparison between Ashkenazy and Shebanova from a historical perspectiveLive Date Nationality Institutional Education
Vladimir Ashkenazy 1937 ~ Russian Central Music School/ Moscow ConservatoryTatiana Shebanova 1953 ~ 2011 Russian Central Music School/ Moscow Conservatory
Table 3.3: Comparison between Rubinstein and Milkina from a historical perspectivesLive Date Nationality Institutional Education
Arthur Rubinstein 1887 ~ 1982 Polish NoneNina Milkina 1919 ~ 2006 Russian None
63, no. 3, two queries out of five in the query/search task(Section 2.3). Aas is picked
also for historical reasons. In Table 3.2 , their relationships in historical perspective are
provided. Although they had their education in different eras, institutional influences may
have still had an impact in shaping their performance style as suggested by the similarity
measurements.
Arthur Rubinstein and Nina Milkina
The pair Arm has stronger similarity measures than Aas. This pair does not have obvious
historical relations between them. Their historical backgrounds are provided in Table 3.3
. No obvious connection could be ascertained given their historical backgrounds. Their
performance of op.63 no.3 is compared to Aas using d̂ and t̂ 00 and is discussed in Chapter 3.
Hierarchical clustering using Hmp
Since Hmp contains pair-wise similarity measurements for the performers included, hier-
archical clustering could be applied to examine possible groupings of the performers. In
CHAPTER 3. DISCUSSION 53
Figure 3.11 , a hierarchical clustering using complete-algorithm and a cut-off of 7.5 is dis-
played. Hatto was excluded since her presence blocked Indjic from other performers. It
could be seen that both Aas and Arm are grouped together under this settings. Other groups
were also formed from the clustering such as Lushtak and Fliere, Uninsky and Magaloff,
Smith and Biret. Among these formed groups, Uninsky and Magaloff are of particular in-
terest because Uninsky is reported as “ ... greatly reminiscent of Nikita Magaloff”. But due
to the scope of this thesis, the groups beside Aas and Arm were not studied.
Some may still argue that these findings were rather arbitrary and the relationship be-
tween performers were superimposed to force the rationale of the findings. Also this sim-
ilarity measurement is sensitive to the feature set chosen for the nearest neighbor search,
two examples using normalized timing features {t̂, t̂ 0, ˆt 00} and {t̂} are shown in Figure 3.12
and Figure 3.13 . Dominant pairs of peaks such as Arm still hold in these two examples,
but the distribution of similarity values between pairs of performers changed to a certain
degree. Although not explored in this thesis, some tests on the query/search task showed
that the similarity measurement is also sensitive to the metric used.
3.1.2.3 Feature Refinement
The performance of query/return task (2.2.1) using the refined features dropped from 97%
to 89%, and further to 44% if using the average and standard deviations of the coefficients
(2.4.2).
CHAPTER 3. DISCUSSION 54
Figure 3.11: Hierarchical clustering from Hmp
CHAPTER 3. DISCUSSION 55
Figure 3.12: Normalized Hmp using {t̂, t̂ 0, ˆt 00}
Figure 3.13: Normalized Hmp using {t̂}
CHAPTER 3. DISCUSSION 56
The degradation of performance on the query/return task based on results from fea-
ture refinement reflects the fact that the current approach is still far from expectation. The
attempt at extracting generalizable discriminators from only two groups is obviously too
optimistic. Despite these evident shortcomings, the feature refinement process proposed
in this thesis still provides a framework for future development. In the absence of a suf-
ficient amount of training samples, this framework provides a way to formalize the study
of pianist style, that allows experimentation with different algorithms during various stages
of the analysis. By observing Figure 2.10, Figure 2.11 and Figure 2.12, and comparing
the original curves to the reconstructed curves with the fitted coefficients, one explanation
for the degradation of query/return task could be deduced. For d̂ in op.63 no.3, all recon-
structed curves of B1 and B2 from both groups are very close to each other compared to
other sections (can be observed in Figure 2.10 and Figure 2.12), but that does not reflect the
difference between Aas and Arm when the original curves are compared. This agreement
between certain segments with fitted coefficients might be the reason for the degradation of
performance on the query/return task. Further discussion is provided in the next section.
3.1.3 Results
The goal of this thesis is to devise a framework for quantifying performer styles given
audio recordings and construct piece-invariant features. The results of the thesis could be
discussed in two parts; framework and results.
CHAPTER 3. DISCUSSION 57
The framework proposed in this thesis proved to be robust enough for investigating
the implicit structures in the performance space defined by features extracted from audio
recordings. The results of the approaches could be assessed either qualitatively or quan-
titatively. The steps taken in the proposed framework avoid the problem of transforming
performances of different pieces of various lengths into feature vectors having the same
dimensions. By the search/query task implemented in Section 2.3, the performance fea-
ture space of each piece is explored independently and the distances between performances
are summarized to form an abstract describing how performers are similar to each other
given the recordings. Although the framework is intended for eventual use of unsuper-
vised machine learning techniques, human judgements were necessary for compensating
the minimal size of the dataset available.
In comparison to the scape plot representation proposed in (Sapp, 2007, 2008), the
similarity measures devised in Section 2.3 do not have the capability of showing different
time-scale relationships between different performances of the same piece. The similarity
measures enable the visualization of relations between performers across all pieces under
consideration.
In regard to the study of pianist style, one assertion that could be made from this thesis
is that there are still no generalizable rules or features that could separate performers from
one another. From 3.1.2.1, it could be concluded that even similar performances disagree
with each other in many aspects, and differences between groups of performers are not
CHAPTER 3. DISCUSSION 58
consistent across a single piece. It could be proposed that the differences between groups
do not transfer across different pieces. This reasoning is in line with the argument made
in (Rink et al., 2011), which suggested that the connection between performance surface
events and the underlying structure is neither straight forward nor simple. Arguments and
viewpoints from (Cook and Everist, 1999; Shaffer, 1981; Sloboda, 1985) provide some
insights into the discussion about using performance surface events as features, which is
what has been done in this thesis.
From the view point of music theory (Cook and Everist, 1999) and psychology (Slo-
boda, 1985), both suggested that the structure embedded in the composition plays a very
important role in the relationship between the composition and the performance. Internally
(within a performance) speaking, local decisions have to be made to solve problems de-
rived from physical constraints of instruments or note arrangements as the piece unfolds
itself, while in the same time, decisions on higher-level structures are also made to reflect
the architecture of the composition perceived by the performer. From an external view
(comparing performance to another performance), the composition itself has structural am-
biguities since groups, patterns or hierarchies could be formed by various musical units
(melodies, harmonies, rhythm, note sets ... etc ) and their combinations, and these ambi-
guities provide different choices for performance interpretations. Thus at the instance of a
performance, both hierarchical and temporal decisions are made to shape the outcome of
the performance.
CHAPTER 3. DISCUSSION 59
To link the above discussions to the study of pianist style, it could be said that the “style”
of a pianist could be realized in any of the aspects mentioned above, from local decisions
to structural interpretations, from dynamical decisions to hierarchical considerations. Thus
the “style” in this context should be the attributes which are relatively consistent between
performances.
Put the approaches taken in this thesis into the context discussed in above paragraphs,
beat-level features (Section 2.2) reflected only the local decisions made dynamically, then
through the similarity measurements and feature refinement (Section 2.3 and Section 2.4),
higher level features summarizing the local decisions based on segmentation from form
analysis are extracted from beat-level features. The poor performance of the refined fea-
tures could then be explained as the following; the summarization of local decisions did
not extract structural or hierarchical information about the performance but instead lost in-
formation along the process of refinement. The choice of using segmentation from form
analysis also ignored the fact that different performers may have different perceptions of
the structure of the piece, thus making the summarization compromised with irrelevant in-
formation. From an information retrieval view, it is inevitable that everything has to be
built from raw observations, and since in this thesis beat-level features are fairly effective,
it is reasonable to continue the current path and adds more emphasis on how to combine
multi-level features. The other issue in the approaches taken is the lack of consideration
about temporal evolution, or how the performance developed dynamically as it unfolds.
CHAPTER 3. DISCUSSION 60
Measurements of temporal evolution should include both indications of how the same mu-
sical units are treated given different contexts within the same piece, and how the use of
performance gestures changes along the performance (Flossmann et al., 2010). Neither of
these indicators were considered in the features used in this study, so there is also no in-
formation about the difference between how each performance evolved. Difficulties facing
these considerations are the quantization of performance gestures, and the relationship of
such gestures to their corresponding musical events. Further discussion of such issues is
not in the scope of this thesis and could be found in (Goebl et al., 2004, 2005; Madsen and
Widmer, 2006; Pampalk et al., 2003; Widmer, 2001).
3.2 Future Works
Expansion of Dataset
Given the discussion in 3.1.1, it is crucial to expand the dataset to improve the validity of
the approaches proposed in this thesis. Some efforts have been made to improve automatic
beat detection given varying-tempo recordings (Grosche et al., 2010; Wu et al., 2011), but
in order to have robust estimations of expressive subtleties, semi-automatic approaches
with manual corrections are still needed at this stage.
CHAPTER 3. DISCUSSION 61
Features
One crucial piano performance attribute that was missing throughout the thesis is articula-
tion. Not only is articulation a very important attribute that performers often manipulate
in order to achieve musical expression, but it might also be more relevant to personal style
than the other two attributes, “timing and dynamics”, considered in this thesis. Timing
and dynamics often reflect more about the structure of the piece, or phrase boundaries. In
the context of only the given audio recordings, extracting piano articulation is a non-trivial
topic and would require a background investigation into the acoustics of piano and accurate
automatic transcription for piano.
In terms of the timing and dynamics features, beat- or bar-level features are still too
short for effectively extracting performance information related to longer time-span. Thus
a more creative and novel way of deriving performance features from timing and dynamic
data should still be investigated in order to account for the nature of expressive perfor-
mance. Removing the average performance given the dataset from raw performance mea-
surements could be the starting point of emphasizing the influence of the performer and in
the same time minimizing the influence of the piece.
CHAPTER 3. DISCUSSION 62
Structure of Performers
Given the results from this thesis, further studies could be conducted in two directions. The
first one will be taking the advantage of the results displayed in Figure 2.6, the summariza-
tion matrix of Hm. The second employs a deeper investigation into each pair of performers
to find possible groupings or structures within piece.
Feature Refinement
Although the goal of the feature refinement process is to make the framework flexible
enough to support improvements based on previous similarity measurement results, in the
context of this thesis only one iteration of feature refinement is achieved. The reason for
limiting the iteration is mainly because there are still other unexplored parameters for the
approaches taken to improve the original features. For example in this thesis, in the stage
of using polynomial fitting to describe the shape of the curves, one parameter that could
be explored is the segmentation used. Instead of using sections gathered from analysis of
the musical form, shorter segmentation such as 2 or 4 bars could be used. The advantage
of using shorter and unified segmentation is that it is possible to capture more details and
the weight of each section becomes uniform. To further this idea, analyzing the behavior
of shorter segments becomes the study of individual expressive gestures. A number of
previous studies have examined expressive gestures as in (Goebl et al., 2004; Grachten
et al., 2008; Grachten and Widmer, 2011; Madsen and Widmer, 2006; Rink et al., 2011;
CHAPTER 3. DISCUSSION 63
Stamatatos, 2001; Stamatatos and Widmer, 2002; Widmer and Zanon, 2004). Based on
the work done in this thesis, instead of forming different expressive gesture clusters for
each of the performance as in (Rink et al., 2011) or the same gesture clusters for the whole
performance set (Goebl et al., 2004), expressive gestures from pairs of performers could be
generated based on results from Section 2.3 and Section 2.4. Another direction of dealing
with segmentation differently is to allow variable length segmentation. Criterion of phrase
boundaries based on performance variances (Todd, 1985) could be implemented to segment
each performance individually, then the length of segments and their distribution could be
used as another performance attribute along with the segments themselves.
3.3 Conclusion
Future directions stemming from this thesis could be summarized into two topics: the study
of expressive gestures, and the temporal structure of music performance. The study of
expressive gestures could be seen as the extension of feature refinement used in this thesis.
By finding meaningful performance features which could either discriminate performers
from each other or explain phenomena involved, the relationship between performance and
composition could be further exploited.
In parallel to the study of expressive gestures, the temporal structure of music per-
formance should also be examined. The connection between the score and the rendered
performance does not stay static, the performance evolves as the piece reveals itself to the
CHAPTER 3. DISCUSSION 64
performers and the audience(Rink et al., 2011). In order to understand the mechanics of
music performance, it is crucial to take this time-variant nature of performance into ac-
count as discussed in Section 3.1. Information theoretic approaches primarily concerned
with how dynamic process could be described with quantifiable models (Abdallah and
Plumbley, 2009) are suitable for the study of temporal structures of music performance.
In conclusion, the main task facing the study of pianist style is how to separate perfor-
mance attributes from piece-wise attributes given audio recordings. Two issues have to be
dealt before this main task could be assessed; the extraction of performance measurements
(Grosche et al., 2010) and the groupings of pianist. The later issue was investigated in this
thesis. A framework is suggested to enable the comparison of pair-wise performer similar-
ities between different piece. Two things were being examined in this framework, the first
is the feature used to group performances by the same performer together, the second is the
relationships between performers revealed by the similarity measurements. Different sets
of features were derived from the low-level feature and evaluation results were reported.
The difference between groups of performers were examined qualitatively with musical
assessment.
Bibliography
Abdallah, S. and Plumbley, M. (2009). Information dynamics: patterns of expectation and
surprise in the perception of music. Connection Science, 21(2-3):89–117.
Bello, J. (2011). Measuring structural similarity in music. Audio, Speech, and Language
Processing, IEEE Transactions on, 19(7):2013–2025.
Bello, J., Grosche, P., Müller, M., and Weiss, R. (2012). Analyzing and visualizing repeti-
tive structures in music recordings.
Bello, J. P. (2009). Grouping recorded music by structural similarity. In In Proc. ISMIR,
pages 531–536.
Cook, N. and Everist, M. (1999). Rethinking music. Oxford University Press, USA.
Cook, N. and Sapp, C. (2007). Purely coincidental? joyce hatto and chopin’s mazurkas.
Royal Holloway, Univ. of London, London, UK.
65
BIBLIOGRAPHY 66
Dixon, S., Goebl, W., and Widmer, G. (2002). The performance worm: Real time visual-
isation of expression based on langner’s tempo-loudness animation. In Proceedings of
the international computer music conference (icmc 2002).
Flexer, A., Schnitzer, D., and Schlüter, J. (2012). A mirex meta-analysis of hubness in
audio music similarity. In 13th International Society for Music Information Retrieval
Conference(ISMIR).
Flossmann, S., Grachten, M., Niedermayer, B., and Widmer, G. (2010). The magaloff
project: An interim report. Journal of New Music Research, 39, no.4:369–377.
Flossmann, S., Grachten, M., and Widmer, G. (2009). Expressive performance rendering:
Introducing performance context. Proceedings of the SMC, pages 155–160.
Gabrielsson, A. (2003). Music performance research at the millennium. Psychology of
music, 31(3):221–272.
Goebl, W., Dixon, S., De Poli, G., Friberg, A., Bresin, R., and Widmer, G. (2005). ’sense’
in expressive music performance: Data acquisition, computational studies, and models.
Cirotteau, Damien (Ed.): Sound to Sense, Sense to Sound: A State-of-the-Art. Version
0.1. Logos Berlin.
Goebl, W., Pampalk, E., and Widmer, G. (2004). Exploring expressive performance trajec-
tories: Six famous pianists play six chopin pieces.
BIBLIOGRAPHY 67
Grachten, M., Goebl, W., Flossmann, S., and Widmer, G. (2008). Phase-plane visualiza-
tions of gestural structure in expressive timing. In In Proceedings of the fourth Confer-
ence on Interdisciplinary Musicology.
Grachten, M. and Widmer, G. (2007). Towards phrase structure reconstruction from ex-
pressive performance data. In Proceedings of the international conference on music
communication science, pages 56–59.
Grachten, M. and Widmer, G. (2011). Explaining musical expression as a mixture of basis
functions. In Proceedings of the 8th Sound and Music Computing Conference (SMC
2011).
Grosche, P., Müller, M., and Sapp, C. (2010). What makes beat tracking difficult? a
case study on chopin mazurkas. In Proceedings of the 11th International Conference on
Music Information Retrieval (ISMIR), Utrecht, Netherlands, pages 649–654.
Madsen, S. and Widmer, G. (2006). Exploring pianist performance styles with evolutionary
string matching. International Journal on Artificial Intelligence Tools, 15(04):495–513.
Nieto, O., Humphrey, E., and Bello, J. (2012). Compressing music recordings into au-
dio summaries. In 13th International Society for Music Information Retrieval Confer-
ence(ISMIR).
BIBLIOGRAPHY 68
Pampalk, E., Goebl, W., and Widmer, G. (2003). Visualizing changes in the structure of
data for exploratory feature selection. In Proceedings of the ninth ACM SIGKDD inter-
national conference on Knowledge discovery and data mining, pages 157–166. ACM.
Repp, B. (1990). Patterns of expressive timing in performances of a beethoven minuet by
nineteen famous pianists. The Journal of the Acoustical Society of America, 88:622.
Repp, B. (1995). Quantitative effects of global tempo on expressive timing in music per-
formance: Some perceptual evidence. Music Perception, pages 39–57.
Repp, B. (1996). The dynamics of expressive piano performance: Schumann’s ”träumerei”
revisited. The Journal of the Acoustical Society of America, 100:641.
Repp, B. (1997). The aesthetic quality of a quantitatively average music performance: Two
preliminary experiments. Music Perception, pages 419–444.
Repp, B. (1998a). The detectability of local deviations from a typical expressive timing
pattern. Music Perception, pages 265–289.
Repp, B. (1998b). A microcosm of musical expression. i. quantitative analysis of pianists’
timing in the initial measures of chopin’s etude in e major. The Journal of the Acoustical
Society of America, 104:1085.
BIBLIOGRAPHY 69
Repp, B. (1998c). Variations on a theme by chopin: Relations between perception and
production of timing in music. Journal of Experimental Psychology: Human Perception
and Performance, 24(3):791.
Rink, J., Spiro, N., and Gold, N. (2011). Motive, gesture, and the analysis of performance.
New Perspectives on Music and Gesture, pages 267–92.
Sapp, C. (2007). Comparative analysis of multiple musical performances. In Proceedings
of the International Conference on Music Information Retrieval (ISMIR), pages 497–
500.
Sapp, C. (2008). Hybrid numeric/rank similarity metrics for musical performance analysis.
In Bello, J. P., Chew, E., and Turnbull, D., editors, ISMIR, pages 501–506.
Schnitzer, D., Flexer, A., Schedl, M., and Widmer, G. (2011). Using mutual proximity
to improve content-based audio similarity. In Proc. of the 12th Int. Conf. for Music
Information Retrieval (ISMIR-2011).
Seashore, C. (1967). Psychology of music. Dover Publications.
Shaffer, L. H. (1981). Performances of chopin, bach, and bartok: Studies in motor pro-
gramming. Cognitive Psychology, 13(3):326–376.
Sloboda, J. (1983). The communication of musical metre in piano performance. The
quarterly journal of experimental psychology, 35(2):377–396.
BIBLIOGRAPHY 70
Sloboda, J. (1985). The musical mind : the cognitive psychology of music. Clarendon
Press, Oxford.
Sloboda, J. A. (2000). Individual differences in music performance. Trends in Cognitive
Sciences, 4(10):397 – 403.
Stamatatos, E. (2001). A computational model for discriminating music performers. In
In Proceedings of the MOSART Workshop on Current Research Directions in Computer
Music, pages 65–69.
Stamatatos, E. and Widmer, G. (2002). Music performer recognition using an ensemble of
simple classifiers. In PROCEEDINGS OF THE 15TH EUROPEAN CONFERENCE ON
ARTIFICIAL INTELLIGENCE (ECAI’2002, pages 335–339. IOS Press.
Todd, N. (1985). A model of expressive timing in tonal music. Music Perception, pages
33–57.
Widmer, G. (2001). Machine discoveries: A few simple, robust local expression principles.
Journal of New Music Research, 31:37–50.
Widmer, G., Dixon, S., Goebl, W., Pampalk, E., and Tobudic, A. (2003). In search of the
horowitz factor. AI Magazine, 24(3):111.
Widmer, G. and Goebl, W. (2004). Computational models of expressive music perfor-
mance: The state of the art. Journal of New Music Research, 33:203–216.
BIBLIOGRAPHY 71
Widmer, G. and Zanon, P. (2004). Automatic recognition of famous artists by machine.
In IN PROCEEDINGS OF THE 16TH EUROPEAN CONFERENCE ON ARTIFICIAL
INTELLIGENCE(ECAIÕ2004.
Wu, F., Lee, T., Jang, J., Chang, K., Lu, C., and Wang, W. (2011). A two-fold dynamic
programming approach to beat tracking for audio music with time-varying tempo. In
Proc. ISMIR.