Louis Matignon - IRCAM · This work has been realised thanks to professor Mark Sandler, head of the...

Université Pierre et Marie CurieM�moire de Stage de Master 2eme année

Systèmes et applications réparties mention ATIAM2006-2007

Louis Matignon

C-implementation of a musical intrument

recognition system

Mars - Juin 2007

Laboratoire d'accueil : Center for Digital Music - Queen Mary, Universityof London

Responsable : Dr Josh Reiss

Acknowledgment

This work has been realised thanks to professor Mark Sandler, head of theCenter for digital music who accepted me in his lab. Many thanks to Dr.Mark Plumbley as well for all his advices.

I specially would like to thank Dr. Josh Reiss, who was the initiator ofthe internship and who supervised my work during 4 months.

I'd like to adress a special thank to the whole C4DM Phd students `team'.Specially to Dr. Matthew D. for his help introducing me to C4DM uses andto the british way of life; Yves who always adviced me on my computerproblems. Thanks to him I now know what music ontology is and what thewords semantic web mean (it is very interesting...). Kurt and Enrique ofcourse, who have always been there to discuss about ... everything, Amélie,Chris S., Beckie. I owe a special thank to both Chris (Cannam and Landone),who helped me every time I had an ùnsolvable but �nally not so complicateto solve' programming-bug.

Finally I really want to thank my londoners cousins who helped me tostay alive in London, and of course my parents thanks to whom I had theopportunity to live in London during �ve months.

i

Contents

Abstract v

1 Introduction 1

2 Context of the internship 32.1 The EASAIER project . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 origin . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 purpose of the project . . . . . . . . . . . . . . . . . . 32.1.3 The partners . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 The lab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Musical Instrument Identi�cation 53.1 Timbre Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 Mono-feature systems . . . . . . . . . . . . . . . . . . 63.1.2 Multi-features systems . . . . . . . . . . . . . . . . . . 83.1.3 Data-Mining approaches . . . . . . . . . . . . . . . . . 9

3.2 Instrument Modeling . . . . . . . . . . . . . . . . . . . . . . . 93.3 Mixed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 System Architecture 114.1 Timbre Descriptors . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Spectral enveloppe descriptors . . . . . . . . . . . . . . 134.1.2 The temporal descriptor: the Zero Crossing Rate . . . 184.1.3 Summarize . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 The classi�ers . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.1 The K nearest neighbours . . . . . . . . . . . . . . . . 184.2.2 The Gaussian Mixture Model (GMM) . . . . . . . . . 19

5 System Evaluation 215.1 Acoustic Material . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . 235.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 24

ii

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.5 Improvements - Future work . . . . . . . . . . . . . . . . . . . 29

6 Conclusion 32

iii

List of Figures

3.1 Recognition performance for 6 systems using mono-timbralexcerpts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Classical taxonomic classi�cation of pitched musical instru-ments [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1 General overview of the system . . . . . . . . . . . . . . . . . 114.2 Example of a desired timeline . . . . . . . . . . . . . . . . . . 124.3 LPC �ltering model . . . . . . . . . . . . . . . . . . . . . . . 144.4 Triangular �lterbank used to calculate the MFCC coe�cients

[22]. The non uniform weighting comes from the fact thateach �lter is given unit weight. Each �lter has approximatelyequal bandwith in the mel-frequency scale . . . . . . . . . . . 16

4.5 Representation of a mixture of K gaussians . . . . . . . . . . 19

5.1 Databases content . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Pre-processing chain . . . . . . . . . . . . . . . . . . . . . . . 245.3 Recognition performance per Instrument depending on the

Machine learning Database (labelled on the x axis). Thetestbed is gathering the samples from the 2 other Databases . 25

5.4 Recognition performance for each machine learning Databasedepending on the classi�ers . . . . . . . . . . . . . . . . . . . 26

5.5 Overall recognition performance depending on the type ofClassi�er . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.6 Recognition performance for each frame length depending onthe classi�er . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.7 cross-classi�cation using the k-nn algorithm and an half win-dow overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.8 Overall recognition rate . . . . . . . . . . . . . . . . . . . . . 30

iv

Abstract

This report concerns an automatic musical instrument recognition system. Inthe scope of the EASAIER project, the idea is to implement in C-language(for a future VAMP1 plug-in implementation) a computer system able tolisten to musical excerpts/samples and recognize which instrument is playing.

A �rst part of the report is dedicated to outline the di�erent approachesin automatic instrument recognition from the literature. Our own approachis then introduced: The system will model the instruments thanks to a bunchof 18 timbre descriptors. The classi�cation of an unknown excerpt will thenbe performed either thank to a k-nn approach, or to a GMM approach.Results using both of the classi�ers will be presented.

After a brief overview of the system, each feature and his choice in thetimbre description is introduce. We then present each classi�erand the eval-uation of the system is presented. Our experimental material is composed ofthree di�erent databases containing either isolated notes or musical excerptsfrom commercial recordings.

The system evaluation has been realised by crossing the databases forthe machine learning and the classi�cation stages. Before each classi�cation,each sample is segmented in blocks of ten representative frames. After elim-inating a database from the training stage, the system reported an overallrecognition rate between 49% and 63.7% depending on the training databaseand the evaluation testbed. An optimisation of the settings has been per-formed and the results underlined a k-nn approach.

1http://www.vamp-plugins.org/

v

Avant-propos

Ce stage a été réalisé en conclusion de mon MASTER 2 SAR, �lière ATIAM(Acoustique, Traitement du signal, Informatique Appliqués à la Musique),suivi à l'IRCAM, et sous la direction administrative de l'université Pierre etMarie Curie � Paris 6. Ce Master est une collaboration entre l'IRCAM (In-stitut de Recherche et Coordination Acoustique/Musique), Paris 6, l'ENST(Ecole Nationale Supèrieure des Télécommunications), et l'université d'Aix-Marseille 2.

J'ai donc travaillé pendant 4 mois, sous la direction du docteur JoshuaReiss au sein du Center for Digital Music, entité du département d'ElectronicEngineering du Queen Mary college, à Londres. Initialement, le sujet portaitsur la segmentation du son et la reconnaissance de sources sonores. Ce sujetétant relativement vague, nous avons dans un premier temps cadr¢e dernier etl'avons introduit dans un programme européen: le projet EASAIER. Nousavons donc commencé à travailler sur: Séparation de sources audio et re-connaissance d'instruments de musique avec pour objectif, l'implémentationd'un gre�on pour la plateforme informatique du projet EASAIER: le SonicVisualiser [1]. N'ayant pas eu le temps de travailler sur le sujet entier, nousnous sommes concentrés sur la partie de reconnaissance automatique d'ins-truments de musique appliquée à des �chiers musicaux mono-instrumentaux.

Ayant travaillé exclusivement en langue anglaise lors de ce stage, il parais-sait plus pertinent de rédiger ce rapport en anglais.

vi

Chapter 1

Introduction

The work reported here took place in an European two years and a halfproject: EASAIER (Enable access to Sound Archive through IntegrationEnrichment and Retrieval). The main motivation for this project is to enablethe access to digital music libraries content through integration enrichmentand retrieval. I worked speci�cally on the musical part of the project. Itis further relevant to say that the work was realised in a Music InformationRetrieval (MIR) context, within the MIR community.

The aim of MIR tasks is to symbolize an audio recording. This character-ization is performed thanks to speci�c framework sorted in di�erent classes.The classes assemble musical play/samples with equivalent properties. Inthe musical �eld, the properties usually come from the music theory, e.g.tempo, rhythm, tonality, instrumental content etc.

The purpose of our work is to provide an organizing and classifying toolfor the digital sound archives, from the automatic analysis of their musicalcontent. Indeed, the musical samples characterisation following to a de�nedontology enables the research and the exploration of the digital library bymusical similarity features which is one of the most important EASAIERtask. Within the overall system, extraction of the instrumental content playsan important role in order to compare the similarity between two �les.

Many systems that try to extract perceptually relevant information frommusical instrument sounds and recognize their sources have been built. How-ever, the implemented systems are still far from being applicable to real-worldmusical signals in general. Most of the systems operate either on isolatednotes or monophonic phrases. Chètry' s basic system [2], reported compa-rable results between what humans can achieve and his system. Essid [3]de�ned an automatic taxonomy for mixes of 4 instruments. He reported anaverage recognition rate of 64,6% but, his system proved to be e�cient for11 instruments on 18. However no generic system have been build.

Our task is now, to build kind a generic system with a strong instrumentrecognition rate. That is what I tried to do during this 4 months intern-

1

ship. The outline of the report is divided in four parts. In the �rst one,the context of the work will be shortly presented, including an introductionto the project, and to the C4DM lab. Then a description of the di�erentapproaches in instrument recognition systems will be given. We then explainthe overview of the system, and the technical tools we used in order to buildit. The last part of the report will be dedicated to the experiments we pro-ceed to test the system, the results we got with it, and the future work thatneed to be done in order to improve the e�ciency of the system.

2

Chapter 2

Context of the internship

2.1 The EASAIER project

2.1.1 origin

EASIEAR project is a two and a half year European project that has beganin may 2006. The project is funded by the European Union, information,science technology (EU-FP6-IST-033902). To achieve the project, an Euro-pean consortium has been created and is composed by 3 companies and 4academic institution.

2.1.2 purpose of the project

This project addresses access to and preservation of cultural and scienti�cresources. The main purpose is the creation of a tool for the digital soundarchives which will enable the access to the digital sound databases throughIntegration, Enrichment and Retrieval (EASAIER). The implementation ofsuch a tool involves many new approaches:

• The development of powerful multimedia mining techniques, in combi-nation with content extractors, meaningful descriptors, and visualiza-tion tools.

• An e�cient and e�ective retrieval system needs to be grounded in se-mantic description, similarity and structure in order to provide relevantfunctionalities related to the exploration of sound archives.

• Provide appropriate interaction with and presentation of material forthe end-users.

• Create automatically appropriate Metadata in order to organize, shareand deliver the archives.

3

2.1.3 The partners

The consortium created for the project is composed of 7 partners. 4 academicinstitutions:

• Queen Mary, University of London - England. Coordinator of theproject.

• Dublin Institute of Technology - Ireland. DIT is in charge of the musi-cal computations (source separation, instruments recognition, featuresextraction, etc.).

• Royal Scottish Academy of Music and Drama - Scotland. They're incharge of the link between the artists, and the developers of the project.

• University of Innsbrück, DERI - Austria. They work on the overallontology.

and 3 companies

• Applied logic Laboratory - Hungary. They' re working on the speechontology.

• NICE systems - Israel. Speech processing and retrieval

• Silogic - France. They are building the interface of the system, basedon the Sonic visualiser tool [1]

2.2 The lab

My internship took place in the Center for digital Music (C4DM), which ispart of the Electronic engineering department - directed by professor LaurieCuthbert - within the Queen Mary college. C4DM research covers the �eldof Music and Audio technology.

This work has been realised under the supervisory control of Dr JoshReiss, who is the coordinator of the EASAIER project.

4

Chapter 3

Musical Instrument

Identi�cation

Musical Instrument Identi�cation systems have been developed for approxi-mately ten years now. Many di�erent systems have been developed and pub-lished in the literature. Most of the algorithms studied here belong to theclass of supervised learning. The aim of those algorithms to derive rules froman existing labelled dataset in order to classify one unknown sample. Theidentities of the instruments as well as some characteristic sounds are knowna-priori. Thus, the purpose is to infer from the available pairs labels/soundsa mathematical relationship whereby one unknown sample, after the identi-�cation process, will be assigned to a label taken from the machine learningdatabase.

In table 3.1, typical performance for six systems are reported. As canbe drawn from their analyses, problems arise when one tries to comparethe percentages. Considered instruments, their number and the durationof the audio samples used for the training and testing are usually di�erent.Implementation of the feature extraction function and/or the classi�er mayalso di�er from one system to another. This obviously a�ects the objectivityof the results comparisons.

However, general trends in the system' s behaviours can be exploitedand used for the design of new classi�cation algorithms. Cross-comparisonwith experimental studies on timbre perception, both in terms of correctinstrument and family identi�cation rates also contributes to evaluate theperformance of a system.

We consider two concurrent approaches in musical instrument identi�ca-tion:

• Timbre model: This approach �nds its direct origins in the study ofthe mechanism of perception of the timbre by humans and can be seenas a direct algorithmic, transposition. The multi-dimensional natureof timbre is transposed into a multi-feature, multi descriptor based

5

Figure 3.1: Recognition performance for 6 systems using mono-timbral ex-cerpts

system which tries to represent the sensory information received bythe ear and processed at the early stages of the perception chain. Thismultitude of information is then recombined using a machine learningalgorithm whose task is to mimic to a certain extent the remainingprocess yielding the perception of timbre.

• Instrument model: The physical mechanisms yielding the productionof sound by musical instruments are di�erent by nature. Bowing astring, blowing in a clarinet, or pressing a piano key involves di�erentphysical principles. A taxonomy of such physical mechanisms based ontheir likely similarities that can lead to instrument classi�cation hasbeen built (see �gure 3.2).

In this section, we focus on the state of the art systems in Monotimbral in-strument recognition. Those systems, use Mono-instrumental isolated notes,or musical excerpts extracted from melodic phrases.

3.1 Timbre Modelling

Timbre modelling systems involve a features extraction stage. This partconsists of the choice and computation of relevant acoustic features in orderto model the timbre of an instrument sound. These features are then usedwith a machine learning algorithm to obtain a synthesized and representativemodel of the instrument (the class). The types of extracted features willimpact the results obtained in studies on timbre perception. One can dividethe published timbre modelling systems in three kinds of algorithms: themono-feature systems, the multi-features ones, and data-mining approaches.

3.1.1 Mono-feature systems

In mono-feature systems, one build the instruments models on a single fea-ture. Although it can be argued that timbre cannot be e�ciently modelled

6

Figure

3.2:

Classical

taxonom

icclassi�cation

ofpitched

musicalinstruments

[2]

7

by one type of feature, these systems allow a better understanding of theinteraction feature/classi�er.

Brown [4] used speaker recognition techniques for classifying betweenoboe and saxophone. Using cepstral coe�cients based on a constant-Q trans-form, 94% correct identi�cation was reported. In a later study [5], her systemclassi�ed between oboe, saxophone, �ute and clarinet. The most successfulfeature set was the frequency derivative of 22 constant-Q coe�cients mea-suring the spectral smoothness. A performance of 84% correct identi�cationwas reported using a standard GMM classi�er. In [6], Marques described asystem able to recognise between 8 di�erent instruments (bagpipes, clarinet,�ute, harpsichord, organ, piano, trombone and violin). Using 16 Mel Fre-quency Cepstral Coe�cients (MFCC) and a Support Vector Machine (SVM)as classi�er, 70% correct identi�cation was reported for 0.2 second test sam-ples and 83% for 2 seconds audio samples. Krishna [7] studied the use of aparticular set of linear predictive coe�cients: the Line Spectrum Frequen-cies (LSF). Using a mixture of 54 Gaussians, performance of 87.3% can beachieved for the classi�cation of one excerpt among 14 instruments. It hasbeen further shown that the LSF performed better than the MFCC for asimilar task. Eggink [8] �rst evaluated the performance of a technique de-signed to identify instruments in arti�cial poly-timbral mixtures. Prior tofeature extraction, the fundamental frequency f0 is calculated. A binarymask is then determined to select spectral descriptors based on the over-tones frequencies. With this system, average instrument identi�cation fortones extracted from the McGill database [9] was 66% for 5 instruments:�ute, clarinet, oboe, violin and cello.

3.1.2 Multi-features systems

Multi-feature systems are a direct extension of the multi-dimensional aspectsof timbre. In this approach, timbre is modelled by a mixture of spectral, har-monic and temporal descriptors. In [10], Martin used a large set of 31 featuresincluding the pitch, spectral centroid, attack asynchrony, ratio of odd-to-evenharmonic energy (based on the �rst six partials) and the strength of vibrato-tremolo calculated from the output of a log-lag correlogram. A k-NearestNeighbours (k-NN) classi�er was used within a taxonomic hierarchy afterhaving applied a Fischer discriminant analysis [11] on the feature data set inorder to reduce the required number of training samples. For 1023 isolatedtones over the full pitch range of 14 instruments, 71.6% correct accuracyfor the identi�cation of individual instruments has been reported. Agostinidescribed in [12] a system using the mean and the standard deviation of9 features derived from a STFT, including the spectral centroid, spectralbandwidth, harmonic energy percentage, inharmonicity and harmonic en-ergy skewness. The three last parameters were calculated for the �rst fourpartials. The best results have been achieved using a Quadratic Discrimi-

8

nant Analysis (QDA) classi�er (92.8% for 27 instruments and the maximum95.3% for 20 instruments), followed by a SVM (69.7% for 27 instruments),a Canonical Discriminant Analysis (CDA) (66.7% for 27 instruments) and�nally a k-NN classi�er (65.7% for 27 instruments).

3.1.3 Data-Mining approaches

These techniques consist of optimising a whole system both at the featureand classi�er levels. In essence, a consequent number of features is extractedfrom the waveforms. Next, the principle is to maximise the system's per-formance in terms of correct identi�cation rates by selecting, for each class,the feature set allowing the best discrimination between the other classes inthe database. These approaches usually involve iterative and trial-and-errorprocedures. Fujinaga [13] used a Genetic Algorithm (GA) to select the bestfeature set among 352 descriptors. These descriptors consisted of spectralstatistical moments extracted from steady-state segments of musical instru-ment tones. He then used a GA to �nd an optimum set of feature weights tobuild the models. His system allowed 50.3% of correct classi�cation of oneunknown tone among 39 instruments. In the same vein, Peeters ( [14], [15])used a feature selection technique based on the maximisation of the Fisherdiscriminant ratio in a GMM framework. It has been shown that perfor-mance can be increased by 15% for the identi�cation of one isolated noteamong 28 instruments. Next, Essid [16] explored the use of class-pairwise fea-ture selection techniques [17], the principle being to automatically select thefeature set among a large amount of descriptors that optimally discriminatesbetween each possible pair of instruments. In [16], a GMM was used to buildthe instrument models from the selected features. Between 77.8% and 79.1%correct identi�cation were reported using a one vs. one classi�cation schemeas opposed to 73.9% when the classical maximum a-posteriori (MAP) rulewas used. More recently, a system using Support Vector Machines (SVM)has been described in [18] that helped to improve the performance up to 92%for test samples of 5 seconds.

3.2 Instrument Modeling

These techniques focus on the characteristic properties of sound productionby musical instruments. Starting from a mathematical assumption about thesignal content, the process consists of adapting this model to a training set,evaluating the relevant parameters during a training stage and using it toidentify new excerpts. As an example, a log-power spectrum plus noise modelin an independent subspace analysis framework has been used in [19]. In [19],it is assumed that instruments can play a �nite number of notes lying on asemitone scale. The short-term log-power spectra are represented as a non-linear sum of weighted typical notes spectra plus background noise. Training

9

the models using isolated notes, 90% correct identi�cation has been achievedfor a database of 5 instruments and for testing samples of 5 seconds extractedfrom commercial recordings. Other approaches are concerned with acousticfeatures extracted from the amplitude envelope (e.g. attack time and energy)or from the output of a sinusoidal analysis stage (e.g. partial frequencies andamplitudes, harmonicity or in-harmonicity factors [20] [21]). However thedi�culty to accurately attain these features from realistic recordings such asmelodic phrases limits the extension of these models for the classi�cation oflarge musical databases.

3.3 Mixed approach

Mixed models combine the two approaches. On the one hand, a prior is set onthe mechanisms of sound production and on the signal structure, whereason the other hand, features are extracted and used to build the models.De�ning this approach, Chétry [2] implemented a system which �rst modelsthe formant structure and then extracts timbral descriptors using commonclassi�ers: K-nn, GMM, and SVM. By using isolated notes for training themodels, 64.4% and 73.3% correct identi�cation rates have been achievedfor the K-means and SVM. By training the models using melodic phrases,performance increased by 13% to 77.5% and 86.4% for the K-means andSVM respectively.

3.4 Our approach

In the scope of the EASAEIR, the most relevant approach would have beento study multi-instrument recordings. However, we decided to work �rst onthe implementation in C of a robust Mono-Instrument system. Further, asthe implementation and the tests have been done in C language, we decided�rst to have kind of a multi-features approach. We chose �rst a bunch offeatures to characterise the timbre, and then try to classify the samples using2 classi�ers: k-nn and GMM. No features selection system has been appliedto our features. The main problem we had, was about �nding the best �rstset of programming parameters in order to build the most generic program.

10

Chapter 4

System Architecture

This chapter is an overview of the general system we implemented for theautomatic source recognition. The system is depicted �gure 4.1.

Figure 4.1: General overview of the system

11

The system is based on the PhD work and sources of Nicolas Chétry [2],and the PhD thesis of Slim Essid at ENST [3] through the experimentalcourse I had at ENST and directed by Slim Essid. All the sources havebeen written in C. The programming base is from Nicolas Chétry's sources.However the sources had been widely modi�ed, the machine learning andclassi�cation have been implemented, some new features as well, most ofthem have been improved and the general system have been adapted to afuture VAMP plug-in implementation.

The music samples we use are excerpts of three databases. Two of themare available databases for research use, and the third one is composed ofcommercial musical excerpts. The di�erent kind of �les and our choices inthe testbeds are explained 5.1.

The �les are read in a pseudo real time technique, thanks to an imple-mented bu�ering method, and a C open source library. The pre-processing(s)of the data are computed during the bu�ering (see �gure 5.2. From thebu�ering, a short frame of the signal is extracted. The feature extractionwill then be applied to each frame.

After checking if the frame wasn't a silence, the algorithm extracts abunch of 18 features detailed in section . Those features are really commonones. They are the most use in the literary.

Once the features extracted, the algorithm consists of two distinct phases:The training phase which consist to build for each individual label a classi-�er model. The testing phase, which consist in trying to classify unknownexcerpts by extracting the features by blocks of segmentation (a numberof frames) under the same conditions than to build the machine learningdatabase, and then classify them using a generative method: The gaussianmixture model-GMM, and/or the k nearest neighbours - k-nn.

The output of the program will then be kind of a timeline (�gure 4.2).The aim of the timeline is to describe thanks to the labelled blocks, theinstrumental content of the musical track. In our �rst implementation, thelength of the blocks is �xed and is function of the window size (1 block = 10frames). At the end, the idea is to set this length as a parameter.

Figure 4.2: Example of a desired timeline

12

4.1 Timbre Descriptors

In this section, the acoustic descriptors we use in our system are detailed.We �rst decided to characterise the timbre of the instruments, thanks to abunch of 18 of the most common spectral and temporal features:

• The two �rst coe�cients deducted from a LPC analysis.

• The ten �rst MFCC coe�cients without the DC component c(0).

• The four �rst statistical moments, also known as spectral centroïd,spectral width, spectral asymetry and spectral �atness.

• The spectral slope.

• The zero crossing rate.

The spectral roll-o� has been implemented as well but was abandonedafter the �rst experimentations.

We can distinguish two di�erent kinds of features. The spectral descrip-tors, which are suppose to describe the timbre by analysing the spectrum.The temporal feature that are useful to get information about the waveform.

4.1.1 Spectral enveloppe descriptors

This part of the report is based on [2] and [3]The spectral characteristics of audio signals remain the foundation of

the understanding of musical sounds by humans. The harmonicity or in-harmonicity degrees and the spectral envelopes are examples of features at-tainable from the calculation of the short-term spectral energy distribution.

The linear Predictive Model

The linear predictive model is originally used to describe in speech recogni-tion. It is specially widely used in the speech compression algorithm. Wequickly describe in this section the theoretical principles, the coe�cientsdetermination method, and why we choosed those features.

theoritical principles In the linear predictive model, we try to estimatea current sample from the previous weighted samples by:

s̃(n) =p∑

i=1

ais(n − i) (4.1)

where s̃(n) is the estimated and p is the order of the prediction.

13

We de�ne the prediction error (or residual) as the di�erence between thesignal sample and his estimated:

e(n) = s(n) − s̃(n)

= s(n) −p∑

i=1

ais(n − i)(4.2)

The purpose of the LPC is to determine the order p and the set of coef-�cients (ai)i=1...p which minimises the energy of the prediction error e(n).

By calculating the Z transform of the residual, one can obtain the transferfunction of a system taking e(n) as input and s(n) as output :

E(Z) = S(Z) − S(Z)p∑

i=1

aiz−p

= S(Z)(1 − P (Z))

(4.3)

De�ning A(Z) the transfer function, as:

A(Z) =E(Z)S(Z)

it comes:

A(Z) = 1 − P (Z)

= 1 −p∑

i=1

aiz−p (4.4)

P (Z) is called the predictive �lter, A(Z) the inverse linear predictive�lter, and e(n) the predictive error or the residual signal. The �lteringoperation can be depicted as on the �gure 4.3:

Figure 4.3: LPC �ltering model

Estimation of the parameters : We calculate the �lter polynomialcoe�cients using the àutocorrelation LPC' method. The analysis is per-formed on each frame. As said in the theoretical principles, the set ofcoe�cients(ai)i=1...p, is the one that minimises the energy of the residualfor each frame and a given order p. They can be deducted by setting thepartial derivative of the energy E on the whole frame of length N :

14

E(Z) =N−1+p∑

n=0

e(n)2

=N−1+p∑

n=0

(s(n) −

p∑i=1

ais(n − i)

)2

with s(n − i) = 0 if n − i < 0

(4.5)

with respect to ai to 0:

∂E

∂ai= 0, i = 1 . . . p (4.6)

i.e

∂E

∂ai=∑

n

[(s(n) +

∑i

ais(n − i)

)s(n − j)

]

= 2

[∑n

s(n)s(n − j) −∑

i

ai

∑n

s(n − i)s(n − j)

]= 0, i, j = 1 . . . p

(4.7)

i.e ∑n

s(n)s(n − j) −∑

i

ai

∑n

s(n − i)s(n − j) = 0 (4.8)

De�ning the cross-correlation coe�cients as:

{Rj =

∑n s(n)s(n − j) j = 1 . . . p

Ri−j =∑

i ai∑

n s(n − i)s(n − j) i, j = 1 . . . p(4.9)

We can rewrite the expression as:

Rj +∑

aiRi−j = 0 (4.10)

Which de�nes the set of equations most known as the Yule-Walker equa-tions, and which can be written in a matrix form:

R0 R1 . . . Rp−1

a1. . . . . .

......

. . .. . . R1

ap−1 . . . R1 R0

.

a1

a2...

ap

=

R1

R2...

Rp

(4.11)

In order to calculate the vector a = (ai)− i = 1 . . . p, the Toeplitz matrixR has to be inverted. We implemented the Levinson algorithm to solve thisequation and get the coe�cients.

15

LPC in timbre description We consider that an order 2 LPC is a coarseestimation of the spectral envelop of the source.

The MFCC coe�cients

MFCC are widely used in speech recognition and speaker veri�cation sys-tems. They constitute the classical feature in audio spectral pattern recogni-tion problems. For this reason, they have been the �rst feature to be studiedin a musical instrument identi�cation context [4]. We brie�y recall here howthe MFCC can be calculated from a frame of audio signal. This procedureis based on the implementation proposed in [22].

For a given frame, the short-term magnitude spectrum is calculated usinga FFT. Next a perceptual triangular �lterbank having approximately equalbandwith in the Mel frequency scale is applied in the frequency domain. The�lters are spaced linearly for the low-frequencies (13 �lters) up to roughly1000Hz and logarithmically afterwards (27 �lters). The upper and lowerfrequencies of each �lters are the center frequencies of the adjacent �ltersrespectively. The �lterbank used in [22] is depicted in the �gure 4.4.

Figure 4.4: Triangular �lterbank used to calculate the MFCC coe�cients[22]. The non uniform weighting comes from the fact that each �lter isgiven unit weight. Each �lter has approximately equal bandwith in the mel-frequency scale

The lowest chosen frequency is 66,6Hz.The total energies in the 40 bandsare then calculated, yielding 40 coe�cients also called the �lterbank coef-�cients. Next, the log-energy outputs are cosine transformed, yielding theMel-cepstral coe�cients. In practice, a discrete cosine transform is used.Assuming that (fi)i=1...n are the �lterbank coe�cients with Nf being equalto the total number of �lters, The MFCC's Ci are calculated using:

ci =Nf∑j=1

log fj cosπi

Nf(j − 1

2i, j = 1 . . . (Nf − 1) (4.12)

16

In practice, C0 which represents the average power of the spectrum isdiscarded. On the other hand, only the �rst coe�cients (1 . . . 10 for us) areusually considered for building the feature vectors.

The spectral moments

The spectral moments are useful to describe the spectral shape. Theyhave been successfully used for example in automatic drum loops transcrip-tion [23], and in the automatic instrument recognition. Peeters [24] intro-duced those features in the Cuidado project.

From the de�nition of the statistical moments:

µi =∑K−1

k=0 (fk)iak∑K−1k=0 ak

, (4.13)

we can de�ne our spectral moments as:

the spectral centroïd : it is the center of gravity of the spectrum. Thusit is de�ne as:

Sc = µ1

The spectral width : it is the spread of the frequency values aroundthe mean value, i.e it's the variance of the distribution which values are thefrequencies and probabilities are the amplitudes. It is de�ne as:

Sw =√

µ2 − µ21

The spectral asymetry :This feature indicate the asymetry of the dis-tribution around the mean value. It's computed around the 3rd moment:

Sa =2µ3

1 − 3µ1µ2 + µ3

Sw(4.14)

Sa = 0 indicates a symmetric distribution of the frequencies, while Sa < 0indicates more energy in the low frequencies and Sa > 0 indicates moreenergy in the high frequencies.

The spectral �atness : This feature gives a measure of the �atness of thedistribution around the mean value. It is computed from the 4th moment:

Sf =−3µ4

1 + 6µ1µ2 − 4µ1µ3 + µ4

S4w

− 3 (4.15)

Sf = 0 indicates a normal distribution, while Sf < 0 indicates a �atterdistribution, and Sf > 0 indicates a peaker distribution.

17

The spectral slope

The spectral slope represents the amount of decreasing of the spectral am-plitude. It is computed by linear regression of the spectral amplitude:

Ss =K∑K

k=1 fkak −∑K

k=1 fk∑K

k=1 ak

K∑K

k=1 f2k −

(∑Kk=1 ak

)2 (4.16)

4.1.2 The temporal descriptor: the Zero Crossing Rate

The only temporal descriptor we use is the Zero Crossing Rate. The ZCR isthe frequency the temporal wave cross the zero axis. It's main interest is todistinguish the noisy signals which have a high ZCR to the periodic soundswhich have a low ZCR.

4.1.3 Summarize

We extract from each frame a bunch of features in order to model the timbre.two of those features come from an LPC and are suppose to approximatethe spectral enveloppe. ten of the features are MFCC's. Four features aredirectly extract from the statistical moments, and are suppose to model thedistribution of the frequencies repartition. One - the spectral slope- gives usan idea about the evolution of the spectral amplitudes. And the last one isa temporal one which distinguish the noisy sounds and the periodic ones.

All those features will be extract on each frame and will be average eitherto build a machine learning database or to classify the audio sample thanksto the K-nn or the GMM classi�ers.

4.2 The classi�ers

The classi�ers we implemented and tested are:

• distance-based classi�er: a K-nearest neighbour K-nn approach

• probabilistic classi�er: The Gaussian Mixture Model (GMM)

4.2.1 The K nearest neighbours

After extracting the features of a whole frames of a training sample, the av-erage value of each feature is calculated. Each training vector is representedby a 18 average features vector. So each instrument model is an averagevalue of the training vectors obtained for each training sample. The clas-si�cation is then performed by calculating the distance between the testingvector (extracted from a block) and each model. The class of the closesttraining vector is considering as the result. We use the Euclidean distancemetric in a normalized space.

18

4.2.2 The Gaussian Mixture Model (GMM)

The GMM models the probability-density function of an observed n-dimen-sional feature vector x by a multivariate Gaussian mixture density:

p(x/Λ) =K∑

k=1

wkΦk(x)

Where K is the number of Gaussian components (18 in our case) and wk

the mixture weights with the constraint∑K

k=1 wk = 1. Each componentΦk, k = 1 . . .K is a gaussian density function of the form:

Φk(x) =1

(2π)n2 |Σk|

12

exp(−1

2(x − µk)T Σ−1

k (x − µk))

(4.17)

The GMM can be depicted as on �gure 4.5:

Figure 4.5: Representation of a mixture of K gaussians

In a recognition system, each instrument in the database is representedby a GMM Λ entirely de�ned by the mean vectors µk, covariance matricesΣk and weights wk noted:

Λk = (µk,Σk, wk), k = 1 . . .K

Pratically, the mean vector contain the means of the di�erent features(from di�erent �les), the covariance matrice is build from the statisticalproperties of the features, and we consider that all the instrument have thesame probability to be classi�ed, i.e wk = 1

KFurthermore, we assume that each Gaussian in the model has a diagonal

covariance matrix. The use of diagonal covariance matrices provides a good

19

compromise between modelling power and algorithm complexity compare toa full covariance GMM.

The I instruments in the database are represented by their GMM Λ1,Λ2,-. . . , ΛI . The identity of an unknown excerpt is the identity corresponding tothe model that maximises the a-posteriori probability for the given observa-tion sequence Y = (y1, y2, . . . , yM ). It can be mathematically written:

I∗ = arg max1≤i≤I

p(Y/Λi)

In our case, the observation sequence is the vectors of the average featuresrepresenting each training sample.

20

Chapter 5

System Evaluation

This chapter describes the experiments we made to evaluate the system withvarious amount of data, various setting parameters, and di�erent classi�ca-tion schemes. After introducing the acoustic material and the way to useit, the computing results are presented with the di�erent sets of parame-ters (hop size, window size, databases), in order to �nd the most suitableprogram settings which will provide the best results.

5.1 Acoustic Material

Livshin [Livshin1] emphasized the need of crossing the databases in orderto have a relevant evaluation of an instrument recognition system. Fur-thermore, EASAIER software is suppose to be used on very diverse andùnknown' databases. By unknown databases, we consider sounds that havenever been classi�ed by the system. Consequently, We chose to test oursystem by crossing three di�erent databases with di�erent recording prop-erties, and content. Those three testbeds have specially been built in thescope of research, and specially for instrument recognition for the last one.We chose the instruments with the most uniform repartition of �les betweenthe databases. Thus, four instruments have been selected: Violin, Piano,Clarinet, Trumpet

We detail the databases here:

The RWC database [25] [26] : The interest of the RWC database isthe wide variety of recordings, and playing styles. For each playing styleof an instrument, the musician generally played individual sounds at half-tone intervals over the entire range of tones that could be produced by thatinstrument. here are the �les we used:

• Piano: 3*12 �les played in 4 di�erent styles

• Clarinet: idem

21

• Violin: 3*19 �les played in 11 di�erent styles

• Trumpet: 27 �les played in 9 di�erent styles, and 36 played in 12di�erent styles.

The UIOWA database [27] : Each �le is the recording of an isolatednote. Each note is approximately 2sec long and followed by an ambientsilent. The �les we used are:

• Piano: 29 �les between A0 and Gb4

• Clarinet: 14 �les from a bb Clarinet between D3 and C7, and 14 �lesfrom an eb Clarinet between G3 and C7

• Violin: 36 �les between G3 and Eb6

• Trumpet: 24 �les from a bb Trumpet between E3 and Eb6

The samples were recorded in the Anechoic Chamber in the Wendell John-son Speech and Hearing Center at The University of Iowa on the followingequipment: Neumann KM-84 Microphones, Mackie 1402-VLZ Mixer andPanasonic SV-3800 DAT Recorder. The samples were transferred throughdigital lines to an editing workstation. Three non-normalized dynamic levelsare included: piano - pianissimo, mezzo forte, and forte fortissimo

ENST's Database : part of the database built by Essid S. during hisPhd [3]. The �les are monophonic (only one instrument) excerpts fromcommercial audio CD's. The �les were recorded in MONO (by averagingthe right and the left channel) in PCM. We use 15 �les per instrument.

Databases RWC UIOWA ENST

# �les # styles # �les from to # �les

Clarinet 36 4 14 (bb Clarinet) D3 G7 1514 (eb Clarinet) G3 C7

Piano 36 4 29 A0 Gb4 15

Violin 57 11 36 G3 Eb6 15

Trumpet 63 12 24 (bb Trumpet) E3 Eb6 15

Figure 5.1: Databases content

22

5.2 Implementations

5.2.1 Pre-processing

Files Standardisation

Before extracting the features either for classi�cation, or machine learning,the �les are �rst read as MONO �les using the libsnd�le C library [28].Files in the database are 16 bits and 32 khz or 44.1 kHz. All the samples aredownsample to 32kHz before processing. We chose to downsample instead ofupsample in order not to have any empty spectrum between the 16kHz and22kHz (in the case of the up-sampling from 32kHz to 44kHz).Furthermore,the 32kHz sampling-frequency can still be considered as a high quality audiosamplerate. Even if several instruments can product some partials up to16kHz, a couple of studies showed that this quality was enough to proceedto automatic instrument recognition [29]. The computing of downsamplingis performed using the libsamplerate library. This library is a C/C + +open-source library written by Lopo [30].

Pre-processing of the data

In the case of musical signals, the musical recordings can be a problem todeal with because of the high-quality audio signals. Indeed, the variouselectronic processing chains used in recording studios introduces signi�cantalterations that are carried by the acoustic features. Post-processing betweenthe di�erent channels, from di�erent microphones, can make the sounds ofan instrument have rather di�erent feature distributions. Livshin and Rodet[31] proved that thoses di�erences could a�ect the generalisation of the mod-els resulting in an unusable system. This problem can be tackled mostlywithin the learning stage of the classi�ers with the use most various learningdatabase. However, various channel normalisation techniques are commonlyused [32] prior to the feature extraction stage. They are use to normalizethe di�erent audio recordings as much as we can by attenuating any biasdue to recordings conditions. The �rst pre-processing we realised after thebu�ering stage is concerned with the introduction of an eventual DC-bias.To remove it, we substract the long term mean from the signal:

s̃(n) = s(n) − ¯s(n)

where s̃(n) is the estimated amplitude, and s̄(n) is the estimated mean onthe whole frame. Then, a normalisation of the amplitudes in order to havea signal energy between a limited range is performed:

s(n) =s(n)

maxframe |s(n)|

23

. At the end of pre-processing stage a pre-emphasis block is add. The aim ofthe pre-emphasis is to increase the relative contribution of the high frequencycontent (particularly for the LPC).

The pre-processing stage in a whole is depicted �gure 5.2

Figure 5.2: Pre-processing chain

5.2.2 Experiments

In order to choose the best set of parameters, we decided to proceed experi-mentations with two varying parameters:

• The window length: 32, 64, 128, 256 ms.

• Overlap: 1/2 and 0

For each couple of settings (window size ; overlap), a machine learningdatabase has been built and of course, the instrument models for each clas-si�ers as well.

To build the machine learning database, we �rst calculate the featureson each frame along the �le, and then we store the average value of eachfeature. So, we store one average vector for each sample from the machinelearning database.

A coarse segmentation of the �le : What we want now is to build kindof a timeline of the instrumental content (see chapter 4) as fast as possible.Instead of classifying every frame, we decided to process blocks of 10 framesassuming in a �rst way that the instrumental content won't change duringthose ten frames. This assumption seems to be weak specially in the caseof window length of 256ms. But, as we are just dealing for the momentwith monophonic samples, it is not a problem. Moreover, the most relevantparameters should be 32 and 64 ms with overlapping windows, which impliesa block length of processing between 176 ms (window length = 32ms, withan overlap of 1

2) and 352 ms. In a future work, the idea will be to detectthe change of instrumental content along the �le, and then adapt the blocklength to the segment with homogenic instrumental content.

With the same settings than to build the machine learning database, weextract the features from the frames of each block. We then compute anaverage vector of the 10 calculated feature vectors. The obtained vector isthen suppose to describe the instrumental content on the whole length of theblock. We assume here that the instrumental content is homgeneous alongthe block. The classi�cation will then be performed on the block vector.

24

5.3 Results

After the experimentations, we can extract a lot of di�erent results dependingon what we are looking for. The main problem, is the massive amount ofdata and specially the need of considering the settings one by one or twoby two, in order to have some understandable results. Consequently, in thissection and in section 5.4, we will focus on one or two settings at a time.

The �gure 5.8 merges all the results of well recognized blocks dependingon the machine learning database, the testbed we used, and the window sizefor each instrument. From this table, we extract di�erent informations.

Instruments : The �rst results, we can gather is the recognition rate perinstrument depending on the learning database. Figure 5.3 gives the results.We notice a relevant recognition rate for the piano with 74,0% of recognisedblocks when using the ENST Database, a good rate for the Trumpet with73,2 in the case of learning on the RWC Database, and a 67,75% rate in therecognition of the Clarinet when modeling the instruments from the UIowasamples. But the noticeable information comes from the very low rate forthe three other instruments in the UIowa case. Indeed this rate are underthe random rate of 25%, so we detect here a problem. The problem can beeither on the modelling, or in the use of the UIowa database for the machinelearning.

Figure 5.3: Recognition performance per Instrument depending on the Ma-chine learning Database (labelled on the x axis). The testbed is gatheringthe samples from the 2 other Databases

25

Machine learning : Following to this �rst result, it is interesting to seethe global recognition rate depending on the machine learning database. Theresults presented �gure 5.4 show a recognition rate quite similar when thesystem learns on the RWC and the ENST databases for each classi�er. Usingthe same settings, we notice a lowest recognition rate between 10% and 15%when using the UIowa Database. So, we may assume that for our system,the UIowa database wouldn't be relevant for the machine learning.

Figure 5.4: Recognition performance for each machine learning Databasedepending on the classi�ers

Classi�er : The e�ciency of the classi�ers has to be study as well. Onecan notice on the �gure 5.4 that the k-nn classi�er seems to be more e�cientthan the Gaussian Mixture Models. This tendency can be con�rmed withthe overall recognition rate depending on the classi�ers given �gure 5.5.This graphic gives a recognition rate of 50,0% with overlap, and a 51,2%without overlap when classifying with the K-nn method. On the other side,The GMM correctly classi�es 40,2% of the blocks without overlapping thewindows, and 40,1% when overlapping. The k-nn classifying seems to bemuch more e�cient in our case.

Window length : The last setting we want to study is the window length.The �gure 5.6 shows a better recognition rate when the analysis is performon the long frames. The interesting information stands here in the recogni-tion rates when the extraction is performed on small frames (32ms). This

26

Figure 5.5: Overall recognition performance depending on the type of Clas-si�er

rate is between 23% (29,4% versus 52,6% when classifying with the GMMwithout overlap) and 18%(37,1% versus 55,6% when classifying with theK-nn algorihm with an overlap) lowest than the 256ms rate. Usually, auto-matic musical instrument recognition system use the small frames [2] [3]corresponding to a compromise between a short term window (in order toconsider a stationary signal) and an acceptable frequential resolution.

5.4 Discussion

The strength of our approach is that everything has been implemented andtested in C programming langage. The main advantage is the time of pro-cessing the frames, that allows an important amount of tests. Now, the algo-rithm we present here isn't enough e�cient to be include in a whole project.However, from the results, we can get some informations, and discuss aboutthe future directions.

Even if the comparison between the systems is in most of the cases hardto realise (see chapter 3), we can see that the recognition rates are farto be enough to include the recognition system in the EASAIER software.However the results of cross classi�cation presented �gure 5.7 show some �rstsome results in the same range as Livshin [31]. The di�erence compare tothe other study come from the fact that in our case, the learning databaseand tesbed are rather di�erent whereas in most of the other studies [2] [32],

27

Figure 5.6: Recognition performance for each frame length depending on theclassi�er

Training DataVs ENST RWC UIowa

ENST 49.0% 43.3%Testbeds RWC 49.8% 43.%8

UIowa 58.0% 63.7%

Figure 5.7: cross-classi�cation using the k-nn algorithm and an half windowoverlap

the training datas are made from half of each databases, and the testbedcontains the other halfs.

The �gure 5.1 describes the content of each database. Considering bothisolated notes databases, we have more �les in the Real World musical Col-lection than in the UIowa database. Both of the databases have been buildon similar way. The consequence is that the RWC covers a more variousrange of instrumental sounds than the UIowa. So, it would explain whyRWC obtain good results when it's used to learn the system, and on theopposite the UIowa should present the best recognition rate (63,7%) whenit's used as a testbed.

On the same way, the ENST database has been build with musical ex-cerpts from various commercial recordings. So it covers a very various rangeof sounds from the same instrument. This explain why it is a better databaseto learn than to classify.

28

However the system is a �rst implementation, so it still need some im-provements. An interesting approach would be now to mix the two moree�cient testbeds in order to have a various database.

Classi�er : The overall recognition rate is much more better when usinga k-nn classi�cation, than a GMM. We can now consider the K-nn as thedefault classi�er of the system.

5.5 Improvements - Future work

From the literature, we can consider many implementation improvementsthat would be relevant for the system.

Instruments models : Livshin [31] reported a signi�cative improvementwhen mixing the databases for the machine learning in order to have a ma-chine learning database as various as possible. This would be relevant for usto mix the RWC and the ENST databases for the machine learning.

features : One of our limit come from the fact that we implement a heuris-tic choice of features. Most of the system now use much more features [24][3] [32] to describe the timbre. The next stage of the implementation willbe to implement other features. The implementation of a large choice offeatures implies the implementation of a feature selection algorithm. Thiswould be interesting as well and could solve problems such the one we havewith the Clarinet in the UIowa database.

In parallel of the features implementation, it could be interesting to im-plement some other classi�er. Most of the last studies used a Support VectorMachine (SVM) [3], [2], or neural networks [31].

File segmentation : A new approach for the segmentation can be con-sider. Assuming the fact that we are able to detect instrumental contentchanges along the �le; this would allow the study of the �le segments whereinstrumental content is homogeneous. We then could adapt the size of theblocks, and take a decision, considering the classi�cation results of the bunchof blocks constituting each segment. A trusting rate could then be de�ned.

In order to detect the content change, the �rst idea would be to studythe evolution of the MFCC's, and detect the modi�cation of their coe�-cients. Indeed, the weakness of the MFCC analysis is the non-robustness tothe di�erence of recording conditions between two �les. Assuming that therecording conditions are the same during a whole �le, we could compare themodelisation of the sound source given by the MFCC's along the sample andthen detect the sudden modi�cations.

29

Instrument

Clarinet

Piano

Violin

Trumpet

WindowLength

(ms)

3264

128

256

3264

128

256

3264

128

256

32x64

128

256

ENSTvsRWC

GMM

nooverlap

15,3

46,2

26,0

21,4

30,0

81,5

91,2

93,6

56,5

42,6

20,4

28,2

25,5

34,6

60,8

54,3

GMM

withoverlap

15,1

44,4

23,7

19,6

29,9

80,3

90,6

92,2

58,2

45,2

18,0

24,2

21,7

29,9

59,2

52,6

K-nnnooverlap

43,0

30,9

32,9

36,4

84,5

87,7

88,9

90,2

45,0

36,7

22,1

28,0

28,3

36,6

51,9

52,2

K-nnwithoverlap

41,6

29,1

29,8

33,0

84,1

87,2

88,0

88,7

45,4

36,7

18,3

23,8

27,5

36,7

51,9

50,6

ENSTvsUIO

WA

GMM

nooverlap

19,0

48,3

44,9

32,0

2,0

42,6

95,2

96,1

21,4

36,7

20,1

51,9

51,4

20,1

95,4

82,3

GMM

withoverlap

19,7

48,1

40,0

30,7

2,1

42,4

94,9

95,7

21,7

37,4

16,1

50,6

50,4

19,4

94,6

77,6

K-nnnooverlap

46,5

48,4

52,8

55,4

22,3

93,5

94,3

93,8

23,6

54,8

50,4

50,8

43,8

66,1

65,4

64,4

K-nnwithoverlap

46,1

44,2

48,8

54,0

22,3

93,3

93,9

93,8

23,7

53,0

50,6

48,9

42,5

64,5

65,7

61,5

RWCvsENST

GMM

nooverlap

4,8

4,7

8,3

11,9

62,4

29,4

61,1

44,4

76,8

81,6

35,5

26,6

36,8

42,7

83,3

93,3

GMM

withoverlap

5,0

3,7

8,9

14,2

64,1

31,4

60,5

43,3

77,2

81,8

37,2

31,1

39,0

43,2

81,6

87,7

K-nnnooverlap

34,8

38,1

32,1

40,4

62,6

52,2

41,1

31,1

40,8

34,4

38,8

44,4

65,3

72,7

77,7

75,5

K-nnwithoverlap

36,4

34,2

36,3

32,1

60,9

53,0

45,5

36,6

36,7

33,8

35,5

37,7

63,6

70,6

75,0

82,2

RWCvsUIO

WA

GMM

nooverlap

33,2

26,8

19,0

21,4

9,4

10,0

92,9

93,4

23,1

21,1

35,9

63,3

48,8

51,3

98,8

99,1

GMM

withoverlap

33,0

27,0

20,9

19,8

9,3

10,3

93,0

92,8

23,5

22,1

33,3

61,3

48,9

50,5

98,3

98,9

K-nnnooverlap

47,0

48,8

53,5

60,0

23,4

92,4

92,5

92,8

14,1

55,8

55,8

51,7

59,1

85,1

92,1

94,8

K-nnwithoverlap

45,3

47,3

51,2

57,2

23,7

92,3

92,8

92,3

13,8

55,0

54,8

51,8

60,0

84,3

89,6

91,1

UIO

WA

vsENST

GMM

nooverlap

94,8

94,0

95,2

45,2

0,0

0,0

0,0

2,2

0,0

0,0

0,0

88,8

4,5

5,0

0,0

20,0

GMM

withoverlap

94,5

94,8

94,0

51,1

0,0

0,0

0,5

3,3

0,0

0,0

0,0

87,7

4,5

4,0

1,6

28,8

K-nnnooverlap

93,7

75,0

55,9

54,7

0,0

5,5

33,3

31,1

3,2

31,6

34,4

35,5

21,6

66,1

74,4

75,5

K-nnwithoverlap

93,1

74,0

53,5

48,8

0,0

6,9

31,6

30,0

3,0

29,6

30,0

35,5

18,3

66,1

71,6

81,1

UIO

WA

vsRWC

GMM

nooverlap

70,8

70,8

73,8

38,6

1,0

10,0

23,2

59,5

15,0

13,4

13,2

62,8

3,1

2,0

0,7

30,2

GMM

withoverlap

71,0

71,2

73,4

36,4

1,1

10,8

26,1

59,4

15,2

14,2

11,7

56,9

2,7

1,8

0,8

31,0

K-nnnooverlap

76,4

57,1

48,4

49,5

1,5

53,1

81,7

85,4

9,5

12,4

14,2

18,5

13,1

54,5

58,2

61,2

K-nnwithoverlap

76,1

53,3

42,9

44,4

1,5

53,8

81,2

84,1

10,1

11,8

11,9

15,5

12,8

55,2

57,2

58,7

Figure

5.8:

Overallrecognitionrate

30

Onset Detection : The last improvement we could consider would be tomatch the frame to an onset detection algorithm [33] [34]. The advantagewould be to �t the blocks exactly with the temporal shape of the notes.This would allow to extract many more temporal features. Chétry alreadyreported a slightly improvement with the recognition rate when extractingfeatures from the onsets compared to extraction from steady-state segmentsof sound.

31

Chapter 6

Conclusion

In the scope of the european project EASAIER, we implemented in C lan-guage a �rst version of an instrument recognition system. We evaluatedour system by crossing di�erent databases: we built instruments models ona database for two di�erent classi�ers: the K-nearest neighbours and theGaussian Mixture Models. From thoses models, we classi�ed the samplesof a testbed made from two other databases. Our acoustic material wascomposed of three di�erent databases, all built in the scope of research.

For the classifying stage, we �rst segmented the samples by blocks of tensigni�cative frames. Then the classi�cation has been operated on each block.The results extract two relevant training databases: RWC and ENST. Theresults shown an average recognition rate 10% higher when classifying theblocks with the K-nn method than when classifying with the GMM. Theresults reported a recognition rate between 49,0% and 63,7% (depending ofthe databases) when training the system on the two e�cient databases.

Thoses results are not enough in order to introduce the system on theeuropean project EASAIER, which is suppose to extend . Consequently,improvements in the feature extraction, selection, �le segmentation, andinstrument modelling have been proposed.

This system is dedicated to be at the end implemented as a Vamp plug-infor the Sonic Visualiser (EASAIER development platform).

32

Bibliography

[1] C. Cannam and M. Sandler. Sonic visualiserhttp://www.sonicvisualiser.org.

[2] N. Chétry, �Computer models for musical instrument identi�cation,�Ph.D. dissertation, Center for Digital Music - Queen Mary Universityof London, London, 2006.

[3] S. Essid, �Classi�cation automatique des signaux audio-fréquences:reconnaissance des instruments de musique,� Ph.D. dissertation,Université Pierre et Marie Curie - Paris 6, Paris, December 2005.[Online]. Available: http://www.tsi.enst.fr/ essid/publications.htm

[4] J. C. Brown, �Computer identi�cation of musical instruments using pat-tern recognition with cepstral coe�cients as features,� The Journal ofthe Acoustical Society of America, vol. 105, pp. 1933�1941, march 1999.

[5] J. C. Brown, O. Houix, and S. McAdams, �Feature dependence in theautomatic identi�cation of musical woodwind instruments,� The Jour-nal of the Acoustical Society of America, vol. 109, pp. 1064�1072, march2001.

[6] J.Marques and P. Moreno, �A study of musical instrument classi�cationusing gaussian mixtures models and support vector machines,� CompaqCambridge Research Laboratory, Cambridge, Massachusetts, Tech. Rep.CRL 99/4, June 1999.

[7] A. G. Krishna and T. Sreenivas, �Music instrument recognition: fromisolated notes to solo phrases,� in Proc. ICASSP, 2004.

[8] J. Eggink and G. Brown, �A missing feature approach to instrumentidenti�cation in polyphonic music,� in Proc. ICASSP, 2003.

[9] F. Opolko and J. Wapnick, �Mcgill university master samples (cd),�McGill University, 1987.

[10] K. Martin, �Musical instrument identi�cation: A pattern-recognition approach,� 1998. [Online]. Available: cite-seer.ist.psu.edu/martin98musical.html

33

[11] G. McLachlan, Discriminant analysis and statistical pattern recognition.John Wiley and Sons Inc., 1992.

[12] G. Agostini, M. Longari, and E. Pollastri, �Musical instrument timbresclassi�cation with spectral features,� in Workshop on Multimedia SignalProcessing, Cannes, France, 2001, pp. 97�102.

[13] I. Fujinaga, �Machine recognition of timbre using steady-statetone of acoustic musical instruments,� 1998. [Online]. Available:citeseer.ist.psu.edu/fujinaga98machine.html

[14] G. Peeters, �Automatic classi�cation of large musical instrument data-bases,� in Proc. AES 115th Convention, 2003.

[15] G. Peeters and X. Rodet, �Hierarchical gaussian tree with inertiaratio maximization for the classi�cation of large musical instrumentdatabases,� in Proc. DAFx, London, UK, 2003.

[16] S. Essid, G. Richard, and B. David, �Musical instrument recognitionbased on class pairwise feature selection,� in Proc. ISMIR, 2004.

[17] T. Hastie and R. Tibshirani, �Classi�cation by pairwise coupling,� inAdvances in Neural Information Processing Systems, M. I. Jordan,M. J. Kearns, and S. A. Solla, Eds., vol. 10. The MIT Press, 1998.[Online]. Available: citeseer.ist.psu.edu/hastie98classi�cation.html

[18] S. Essid, G. Richard, and B. David, �Musical instrument recognition bypairwise classi�cation strategies,� IEEE Transactions on Audio, Speechand Language Processing, vol. 14, pp. 1401�1412, July 2006.

[19] E. Vincent and X. Rodet, �Instrument identi�cation in solo and ensem-ble music using independent subspace analysis,� in Proc. ISMIR, 2004.

[20] K. Jensen, �Perceptual and physical aspects of musical sounds,� Journalof Sangeet Research Academy, 2002.

[21] J. Risset and D. Wessel, �Exploration of timbre by analysis and synthe-sis,� The psychology of music, 1982.

[22] M.Slaney, �Auditory toolbox: A matlab toolbox for auditory modelingwork,� Interval Research Corporation, Apple Technical Report, 1998.

[23] O. Gillet and G. Richard, �Automatic transcription of drum loops,� inProc. ICASSP, 2004.

[24] G. Peeters, �A large set of audio features for sound description in thecuidado project,� Project CUIDADO - Audio Feature Extraction, 2002.

[25] M. Goto, �Development of the rwc database,� in Proc. ICA, April 2004.

34

[26] G. Masataka. Rwc website http://sta�.aist.go.jp/m.goto/RWC-MDB/.

[27] U. Iowa. U-iowa website http://theremin.music.uiowa.edu/Events.html.

[28] M. D. C. Lopo, �Libsnd�le library,� Tech. Rep. [Online]. Available:http://www.mega-nerd.com/libsnd�le/

[29] K. Martin, �Sound source recognition: A theory and computationalmodel,� Ph.D. dissertation, Massachusetts Institute of Technology, Mas-sachusetts, 1999.

[30] M. D. C. Lopo, �Libsamplerate library,� Tech. Rep. [Online]. Available:http://www.mega-nerd.com/libsamplerate/

[31] A. Livshin and X. Rodet, �The importance of cross database evaluationin sound classi�cation,� in ISMIR, 2003.

[32] A. Erronen, �Automatic musical instrument recognition,� Master's the-sis, Tampere University of Technology, Tampere, 2001.

[33] J. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. San-dler, �A tutorial on onset detection in music signals,� IEEE transactionson speech and audio processing, vol. 13, pp. 1035�1047, sept. 2005.

[34] N. Collins, �A comparison of sound onset detection algorithms withemphasis on psychoacoustically motivated detection functions,� in Proc.of the 118th AES Convention, Barcelona, Spain, May 2005.

35

Date post:	28-Jun-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Louis Matignon - IRCAM · This work has been realised thanks to professor Mark Sandler, head of the...

Documents