Song Form Intelligence for Repairing Streaming Music ...paulmckevitt.com/phd/dohertythesis.pdfD...

Song Form Intelligencefor Repairing Streaming MusicAcross Wireless Bursty Networks

Jonathan P. Doherty, B.Sc. (Hons.)

School of Computing & Intelligent Systems (SCIS)Faculty of Computing & Engineering

University of Ulster

A thesis submitted in partial fulfilment of the requirements forthe degree of Doctor of Philosophy

March, 2010

ii

TABLE OF CONTENTS

List of Figures vi

List of Tables ix

Acknowledgements xi

Abstract xii

List of Abbreviations xiv

Notes On Access To Contents xv

1 Introduction 11.1 Objectives of this thesis . . . . . . . . . . . . . . . . . . . . . . . 31.2 Outline of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Song Form Intelligence 62.1 Aspects of audio analysis . . . . . . . . . . . . . . . . . . . . . . 62.2 Digital Signal Processing (DSP) . . . . . . . . . . . . . . . . . . . 72.3 Music Information Retrieval (MIR) . . . . . . . . . . . . . . . . . 8

2.3.1 MELDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Marsyas . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.3 Chord recognition . . . . . . . . . . . . . . . . . . . . . . 162.3.4 Summary of MIR systems . . . . . . . . . . . . . . . . . 17

2.4 Music structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.1 Syntax of Music . . . . . . . . . . . . . . . . . . . . . . . 192.4.2 Semantics of music . . . . . . . . . . . . . . . . . . . . . 212.4.3 Cognitive representation . . . . . . . . . . . . . . . . . . 24

2.5 Frequency and pitch estimation . . . . . . . . . . . . . . . . . . 252.6 Beat detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.7 Mainstream approaches to packet loss: protocols and standards 27

2.7.1 Improvements of real-time traffic using Internet protocols 302.7.2 VoIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.8 Audio formats and file compression . . . . . . . . . . . . . . . . 312.9 Compression and mp3s . . . . . . . . . . . . . . . . . . . . . . . 32

iii

2.10 Jitter control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.11 Streaming media . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.11.1 Windows Media Encoder . . . . . . . . . . . . . . . . . 342.11.2 Icecast and Ices . . . . . . . . . . . . . . . . . . . . . . . 352.11.3 GStreamer . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.12 Streaming audio approaches to packet loss . . . . . . . . . . . . 392.12.1 Voice communication . . . . . . . . . . . . . . . . . . . . 402.12.2 Audio/video streaming . . . . . . . . . . . . . . . . . . 41

2.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Feature Extraction and Audio Analysis 443.1 MPEG–7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.1.1 MPEG–7 descriptors . . . . . . . . . . . . . . . . . . . . 453.1.2 Audio Spectrum Envelope (ASE) . . . . . . . . . . . . . 483.1.3 Audio Spectrum Flatness (ASF) . . . . . . . . . . . . . . 483.1.4 Audio Spectrum Basis/Projection . . . . . . . . . . . . . 493.1.5 Audio Spectrum Centroid (ASC) . . . . . . . . . . . . . 51

3.2 Pattern classification and matching . . . . . . . . . . . . . . . . 533.2.1 Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.2 Segmentation and grouping . . . . . . . . . . . . . . . . 533.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . 543.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . 543.2.5 Pattern matching . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Mel-Frequency Cepstral Coefficients (MFCCs) . . . . . . . . . . 563.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4.1 Unsupervised classifiers . . . . . . . . . . . . . . . . . . 593.4.2 Cluster numbers . . . . . . . . . . . . . . . . . . . . . . . 63

3.5 String matching algorithms . . . . . . . . . . . . . . . . . . . . . 643.5.1 Brute force . . . . . . . . . . . . . . . . . . . . . . . . . . 653.5.2 Knuth-Morris-Pratt (KNP) . . . . . . . . . . . . . . . . . 653.5.3 Boyer-Moore . . . . . . . . . . . . . . . . . . . . . . . . . 663.5.4 Regular expressions . . . . . . . . . . . . . . . . . . . . . 673.5.5 Approximate string matching . . . . . . . . . . . . . . . 67

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 Similarity and Classification of Music Features 704.1 Visualising structure and repetition in music . . . . . . . . . . . 704.2 k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2.1 Distance measures . . . . . . . . . . . . . . . . . . . . . 784.3 String matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5 Implementation of Song Form Intelligence (SoFI) 825.1 Architecture of SoFI . . . . . . . . . . . . . . . . . . . . . . . . . 835.2 Server-side feature extraction . . . . . . . . . . . . . . . . . . . . 84

5.2.1 Audio Spectrum Envelope feature extraction . . . . . . 855.2.2 Clustering the Audio Spectrum Envelope (ASE) . . . . 86

iv

5.2.3 Similarity measurement . . . . . . . . . . . . . . . . . . 885.3 Streaming server . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.3.1 Ices2 and Icecast2 . . . . . . . . . . . . . . . . . . . . . . 925.4 Client side audio repair . . . . . . . . . . . . . . . . . . . . . . . 95

5.4.1 gStreamer pipelines and buffers . . . . . . . . . . . . . . 955.4.2 Network monitoring . . . . . . . . . . . . . . . . . . . . 985.4.3 Masking network dropouts . . . . . . . . . . . . . . . . 985.4.4 SoFI’s internal synchronization clock . . . . . . . . . . . 995.4.5 SoFI output . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.5 Technologies used by SoFI . . . . . . . . . . . . . . . . . . . . . 1065.5.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . 1065.5.2 k-means clustering . . . . . . . . . . . . . . . . . . . . . 1075.5.3 Streaming servers and audio players . . . . . . . . . . . 1075.5.4 Computational requirements of audio analysis . . . . . 108

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6 Evaluation of Song Form Intelligence (SoFI) 1106.1 Clustering groups . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.2 String matching large clusters . . . . . . . . . . . . . . . . . . . 113

6.2.1 Clann Brennan . . . . . . . . . . . . . . . . . . . . . . . . 1216.2.2 Baroque classical music . . . . . . . . . . . . . . . . . . 121

6.3 Audio repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.3.1 Quantitive audio comparisons . . . . . . . . . . . . . . 1246.3.2 Correlation of similar and dissimilar matches . . . . . . 125

6.4 Subjective evaluation of SoFI . . . . . . . . . . . . . . . . . . . . 1266.4.1 Subject listeners . . . . . . . . . . . . . . . . . . . . . . . 1276.4.2 Evaluation questionnaire . . . . . . . . . . . . . . . . . . 1296.4.3 Audio repair . . . . . . . . . . . . . . . . . . . . . . . . . 1296.4.4 Subjective evaluation results . . . . . . . . . . . . . . . . 1306.4.5 Subjective ranking of song by test subjects . . . . . . . . 1326.4.6 Subjective evaluation of baroque classical music repair 1336.4.7 Feedback from subjects . . . . . . . . . . . . . . . . . . . 134

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7 Conclusion and future work 1367.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.2 Relation to other work . . . . . . . . . . . . . . . . . . . . . . . . 138

7.2.1 Music similarity and pattern matching . . . . . . . . . . 1387.2.2 Packet loss streaming media . . . . . . . . . . . . . . . . 139

7.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1397.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A MusicXML Output 142

B 12 Bar Blues Music Score 145

C MPEG–7 XML Output 146

v

D Audio sample music test data 148

E A similarity comparison of MPEG–7 ASE 150

F A representation of 6 s. of audio 156

G SoFI subjective evaluation questionnaire 157

H Subjective evaluation results 162

I Similarity Output 166

J Enya visit to the University of Ulster 168J.1 Enya visit to the Intelligent Systems Research Centre (ISRC) . . 168J.2 Enya Honorary Doctorate . . . . . . . . . . . . . . . . . . . . . . 169

References 171

vi

LIST OF FIGURES

2.1 A waveform with rectangle windowing applied . . . . . . . . 9

2.2 Embedding an audio stream into a two dimensional similaritymatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Parson’s code representation of a music score . . . . . . . . . 12

2.4 A sample music score . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 A sample piano roll representation of Figure 2.4 . . . . . . . . 14

2.6 Music syntax with a hierarchical structure . . . . . . . . . . . 20

2.7 Semantic network representation of ’I have a car’ . . . . . . . 22

2.8 Hierarchical breakdown of a song in western tonal format . . 24

2.9 A standing wave pattern . . . . . . . . . . . . . . . . . . . . . 25

2.10 Pitch plot of 0.5 seconds of audio . . . . . . . . . . . . . . . . 26

2.11 The TCP/IP model . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.12 The OSI reference model . . . . . . . . . . . . . . . . . . . . . 29

2.13 A Gstreamer media player . . . . . . . . . . . . . . . . . . . . 38

2.14 WSOLA loss concealment . . . . . . . . . . . . . . . . . . . . . 40

3.1 Example MPEG–7 application scenario . . . . . . . . . . . . . 46

3.2 Class hierarchy of MPEG–7 audio low level descriptors . . . 47

3.3 Example MPEG–7 audio power representation . . . . . . . . 47

3.4 Example MPEG–7 audio waveform representation . . . . . . 48

3.5 Architecture of an audio indexing and retrieval system . . . . 50

3.6 Architecture of spectrum basis projection . . . . . . . . . . . . 51

3.7 Architecture of fingerprinting application . . . . . . . . . . . 52

3.8 Example composite scene . . . . . . . . . . . . . . . . . . . . . 54

3.9 Example k-means clustering distance . . . . . . . . . . . . . . 58

vii

3.10 k-means clustering algorithm flowchart . . . . . . . . . . . . . 60

3.11 Example string matching comparison . . . . . . . . . . . . . . 66

3.12 Example of Hamming distance . . . . . . . . . . . . . . . . . . 69

4.1 Example MPEG–7 Audio Spectrum Envelope Representation 72

4.2 Example Audio spectrum envelope differences . . . . . . . . 74

4.3 Example MPEG–7 audio spectrum envelope 5 ms., 10 ms. and30 ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4 Example MPEG–7 ASE closeup at 5ms., 10 ms. and 30 ms. . . 76

4.5 MPEG–7 ASE closeup of a song at 5ms, 10ms and 30ms . . . 76

4.6 Example MPEG–7 fundamental frequency of two audio signals 77

4.7 Example k-means cluster comparison . . . . . . . . . . . . . . 78

4.8 k-means distance measures (Manhattan and Minkowski) . . . 79

4.9 Example k-means distance measures . . . . . . . . . . . . . . 80

5.1 Architecture of SoFI . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 SoFI’s low level extraction modules . . . . . . . . . . . . . . . 84

5.3 Example MPEG–7 XML output . . . . . . . . . . . . . . . . . 85

5.4 Overlapping sampling frames of a waveform . . . . . . . . . 86

5.5 Example k-means output . . . . . . . . . . . . . . . . . . . . . 87

5.6 Example k-means cluster representation of a song . . . . . . . 89

5.7 Comparative waveform output . . . . . . . . . . . . . . . . . 90

5.8 A backwards string matching search . . . . . . . . . . . . . . 90

5.9 Icecast2 administrator web page . . . . . . . . . . . . . . . . . 93

5.10 Icecast2 mountpoint detail . . . . . . . . . . . . . . . . . . . . 94

5.11 Icecast2 listener detail . . . . . . . . . . . . . . . . . . . . . . . 94

5.12 Graphical representation of SoFI client media handler withmultiple pipelines . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.13 Flow of control between pipelines . . . . . . . . . . . . . . . . 99

5.14 Example time delay effect during source swapping . . . . . . 101

5.15 Time delay effect when swapping audio sources . . . . . . . . 102

5.16 SoFI swapping audio sources . . . . . . . . . . . . . . . . . . . 104

5.17 SoFI returning to a live Internet radio stream . . . . . . . . . . 105

5.18 End of stream signal event . . . . . . . . . . . . . . . . . . . . 106

6.1 A comparison of cluster selection . . . . . . . . . . . . . . . . 111

viii

6.2 A string matching comparison of 1 to 10 s. . . . . . . . . . . . 114

6.3 A 5 s. query on only preceding sections . . . . . . . . . . . . . 116

6.4 A 5 s. query from only 30 s. of audio . . . . . . . . . . . . . . 116

6.5 A comparison of 1 and 5 s. query strings . . . . . . . . . . . . 117

6.6 Two ’similar’ 5 second segments . . . . . . . . . . . . . . . . . 118

6.7 A two channel wave audio file . . . . . . . . . . . . . . . . . . 119

6.8 Figure 6.7 in cluster representation . . . . . . . . . . . . . . . 120

6.9 Match ratio for one 5 second segment in Figure 6.8 . . . . . . 120

6.10 Best match ratio for baroque classical period music . . . . . . 123

6.11 Peak frequency spectrum representation of three similar audiosections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.12 Demographic data for 16 subjects that participated in the eva-luation of SoFI . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.13 Listening habits for 16 subjects that participated in the evalua-tion of SoFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.14 A comparison of average subject scores for song repair types 132

6.15 User evaluation rank for each song . . . . . . . . . . . . . . . 133

B.1 A sheet music representation of the 12 Bar Blues music file . 145

E.1 Comparison of two different 5 second segments identified assimilar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

F.1 Basic spectral representation of three similar audio sections . 156

J.1 Jonathan Doherty demonstrates SoFI to Enya party (1) . . . . 168

J.2 Jonathan Doherty demonstrates SoFI to Enya party (2) . . . . 169

J.3 Enya receives Honorary Doctorate (D.Litt.) from Universityof Ulster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

J.4 Enya graduation with her parents Leo and Baba Brennan . . 170

ix

LIST OF TABLES

2.1 A First Order Markov Chain . . . . . . . . . . . . . . . . . . . 17

2.2 Summary of Music Information Retrieval (MIR) systems . . . 18

2.3 Digital Audio Formats . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Example Euclidian Distance . . . . . . . . . . . . . . . . . . . 62

4.1 ASE Sample Differences . . . . . . . . . . . . . . . . . . . . . . 73

5.1 Example string matching output . . . . . . . . . . . . . . . . . 91

6.1 k Cluster computations relative to size of k . . . . . . . . . . . 112

6.2 k Cluster computations relative to the size of k . . . . . . . . . 112

6.3 String comparison results for 1 to 10 s. . . . . . . . . . . . . . 115

6.4 Average match ratio across all song segments . . . . . . . . . 117

6.5 A comparison of match ratio across all song segments withOrinoco Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.6 Clann Brennan song comparison . . . . . . . . . . . . . . . . . 121

6.7 Baroque classical comparison . . . . . . . . . . . . . . . . . . . 122

6.8 A comparison of correlation and mean difference between 3different audio segments . . . . . . . . . . . . . . . . . . . . . 126

6.9 Test songs using different approaches to dropout repair . . . 130

6.10 Summary of subject listening evaluation . . . . . . . . . . . . 131

6.11 Summary of baroque classical music evaluation . . . . . . . . 134

D.1 Songs used in experiments with Western Tonal Format level . 149

H.1 Demographics of subjects . . . . . . . . . . . . . . . . . . . . . 163

H.2 Subject listening evaluation . . . . . . . . . . . . . . . . . . . . 164

x

H.3 Subject listening rank score . . . . . . . . . . . . . . . . . . . . 165

xi

ACKNOWLEDGEMENTS

First, many thanks are due to Dr. Kevin Curran and Prof. Paul Mc Kevitt, sincethey presented me with the opportunity to work in a very interesting fieldof research. Furthermore, I want to thank both Kevin and Paul particularlyfor supervising and directing my work, for the numerous discussions onpossible approaches and for their great flexibility. Thanks also go to Dr. TomLunney for his valuable comments on my 100-day-review and confirmationreports. I would also like to thank the Heads of the Faculty of Computing andEngineering Graduate School, Prof. Sally McClean and Dr. Philip Morrow.

Special thanks are due to all my colleagues, especially Patrick Demps-ter without whom many of the approaches used within this work wouldnot have been possible. His advisory capacity knows no bounds. Additio-nal thanks go to Alan Browne, Neil Glackin, Michael McBride and JohnWade, who all deserve a mention, not for contributing to my work butmore so for moral support - I feel I know more about neural networksand spiking neural networks than if I had completed my own work in thisfield. Also, Audrey Hunter, Sheila McCarthy, Julie Wall and Philip Vance,who always ensured that I went to lunch in time, deserve to be mentionedhere. I would also like to take this opportunity to extend my appreciationto various members from the Intelligent Systems Research Centre (ISRC)at Magee who have provided feedback, invaluable criticisms and adviceon my research over the years, in particular Pawel Herman and SimonJohnston who provided countless invaluable consultations on areas outsidemy subject domain knowledge. Thanks also belong to Pat Kinsella, Ted Leath,Paddy McDonough and Bernard McGarry for their consistent technical support.

Finally, I would like to express my most sincere thanks to my wife, Ka-ren and my daughter, Naoishe (Barbie). Without their continued drive,enthusiasm and motivational support I firmly believe this work would not havebeen completed, especially Naoishe, her accomplishments through adversityshow everyone what anyone can achieve - a shining example to us all.

xii

ABSTRACT

Streaming media across the Internet is still an unreliable and poor quality me-dium. Services such as audio-on-demand drastically increase the loads onnetworks. Therefore new, robust and highly efficient coding algorithms arenecessary. One method overlooked to date, which can work alongside existingaudio compression schemes, is that which takes account of the syntax andnatural repetition of music in the category of Western Tonal Format (WTF).Similarity detection within polyphonic audio has presented problematic chal-lenges within the field of Music Information Retrieval (MIR). One approachto deal with bursty errors is to use self-similarity to replace missing segments.Many existing systems exist based on packet loss and replacement on a networklevel but none attempt repairs of large dropouts of 5 seconds or more.

We have developed a server-client based framework for automatic detectionand replacement of large packet losses on wireless networks when receivingtime-dependent streamed audio. SoFI, a self-similarity identification and audioreplacement system, which swaps audio presented to the listener between a livestream and previous sections of the same audio stored locally, when dropoutsoccur, has been implemented. Using the MPEG–7 Audio Spectrum Envelope(ASE) gives features for extraction and combined with k-means clusteringenables self-similarity to be performed within polyphonic audio. SoFI usesstring matching to identify similarity between large sections of clustered audio.

Objective and subjective evaluations of SoFI give positive results. SoFI is shownto detect high levels of similarity on varying lengths of time within an audiofile. In a scale between 0 and 1 with 0 the best, a clear correlation betweensimilarly identified sections of 0.2491 shows successful identification. This issupported by subjective evaluations, where subjects were presented with othersimulated approaches to audio repair together with simulations of replacementsidentified by SoFI including varying lengths of time in the repair. Results showa 200% increase in the level of acceptance of simulated SoFI repairs whencompared to other approaches. Future work will include integration of optimalidentification for the value of k during the clustering stage, the inclusion ofverse/chorus/verse identification and the refinement of the time point at whicha swap occurs, i.e., during more salient sections.

Keywords: Audio Spectrum Envelope (ASE), k–means clustering, MPEG–7, Music Information Retrieval (MIR), packet loss, pattern matching, self-similarity, semantics, song form intelligence, SoFI, wireless bursty networks.

xiii

LIST OF ABBREVIATIONS

AAC Advanced Audio enCodingAP AudioPower

ARQ Automatic Repeat RequestARQ Automatic Repeat-reQuestASB Audio Spectrum BasisASC Audio Spectrum CentroidASE Audio Spectrum EnvelopeASF Audio Spectrum FlatnessASP Audio Spectrum ProjectionAW Audio Waveform

BDM Backward DAWG MatchingBIC Bayesian Information Criterion

BNDM Backward NonDeterministic DAWGMatching

C-BRAHMS Content-Based Retrieval and Analysis ofHarmony and other Music Structures

CD Compact DiscDAT Digital Audio Tape

DAWG Directed Acyclic Word GraphDDL Description Definition LanguageDFT Discrete Fourier TransformDSP Digital Signal ProcessingEoS End Of StreamFEC Forward Error CorrectionICA Independent Component AnalysisJVM Java Virtual Machine

KMP Knuth-Morris-PrattKNN The K-Nearest NeighbourLLD Low Level DescriptorsLPC Linear Predictive CodingLSP Linear Spectral Pairs

MDL Minimum Description LengthMFC Mel-Frequency Cepstrum

MFCC Mel-Frequency Cepstral Coefficients

xiv

MIDI Musical Instrument Digital InterfaceMIR Music Information RetrievalMP3 MPEG-1 Layer 3

MPEG Moving Pictures Experts GroupNACK Negative ACKnowledgementNASE Normalised Audio Spectrum EnvelopeNIFF Notation Interchange File Format

OSI Open Systems InterconnectPCA Principal Components AnalysisPCM Pulse Code ModulationQoE Quality of ExperienceQoS Quality of ServiceReD Replicated and DelayedRMS Root-Mean-SquareRTP Real-time Transport ProtocolSHS Sub-Harmonic Summation

SMDL Standard Music Description LanguageSoFI Song Form Intelligence

STFT Short Time Fourier TransformSVD Singular Value DecompositionTCP Transmission Control ProtocolVBR Variable BitRateUEP Unequal Error ProtectionVoIP Voice over Internet ProtocolWAV WAVeform audio format

WMA Windows Media AudioWSOLA Waveform Similarity OverLap Add

WTF Western Tonal FormatXML eXtensible Markup Language

xv

NOTES ON ACCESS TO CONTENTS

I hereby declare that with effect from the date on which the thesis is depositedin the Library of the University of Ulster, I permit the Librarian of the Universityto allow the thesis to be copied in whole or in part without reference to me onthe understanding that such authority applies to the provision of single copiesmade for study purposes or for inclusion within the stock of another library.This restriction does not apply to the British Library Thesis Service (which is permittedto copy the thesis on demand for loan or sale under the terms of a separate agreement)nor to the copying or publication of the title and abstract of the thesis. IT IS A CONDI-TION OF USE OF THIS THESIS THAT ANYONE WHO CONSULTS IT MUSTRECOGNISE THAT THE COPYRIGHT RESTS WITH THE AUTHOR ANDTHAT NO QUOTATION FROM THE THESIS AND NO INFORMATION DE-RIVED FROM IT MAY BE PUBLISHED UNLESS THE SOURCE IS PROPERLYACKNOWLEDGED.

1

CHAPTER

ONE

Introduction

Streaming media across the Internet is still an unreliable and poor quality com-munications medium. Current technologies for streaming media have reachedoptimum potential in respect of compression (both lossy and lossless) andbuffering songs streamed from a web-based server to clients. It is anticipatedthat the next revolution will be witnessed through telecommunications techno-logy. The communications sector has been one of the few constantly growingsectors, where over the last two decades a number of new services have beencreated, with digital communications being the primary investment. Servicessuch as audio-on-demand drastically increase the load on networks. The spreadof newly created compression standards such as MPEG–4 reflect the currentdemand for data compression. As these new services become available thedemand of audio services through mobile devices (e.g. phones, personal dataassistants (PDA)) increases. The technology for these services is available butsuitable standards are yet to be defined. This is due to the nature of mobileradio channels, which are more limited in terms of bandwidth and bit errorrates such as the public telephone network. Therefore, new, robust and highlyefficient coding algorithms will be necessary. Audio, due to its timely naturerequires guarantees that are very different in nature with regard to deliveryof data from Transmission Control Protocol (TCP) traffic for standard HTTPrequests. In addition, audio applications increase the set of requirements interms of throughput, end-to-end delay, delay jitter and synchronization.

Applications such as Microsoft’s Windows Media Player (2009) and Real Audio(2009) have yet to overcome the problems attributed to using a network thatis built upon technology that does not rely on the order the data is sent, butmore the speed at which it travels. Despite seemingly unlimited bandwidth, aQuality of Service protocol in place and high rates of compression, temporalaliasing still occurs giving the client a poor/unreliable connection where audio

2

playback is patchy when unsynchronized packets arrive.

The focus of this research is to examine new methods for streaming music overbandwidth constrained networks. One method overlooked to date, which canwork alongside existing audio compression schemes, is to take account of thesyntax of the music. Songs, in general exhibit standard structures that canbe used as a forward error correction mechanism. This work implements asystem called SoFI (Song Form Intelligence) that determines packet loss anduses previously received portions of the song to predict what possible matchalready received exists. In turn, this is used in place of the missing packet(s)before the buffer is empty by applying and improving state-of-the-art theoriesand techniques in pattern matching with a syntactic, semantic and cognitiveapproach.

Streaming media across networks has been a focus for much research in thearea of lossy/lossless file compression and network communications. However,the rapid uptake of wireless communications has led to more recent problemsbeing identified. Traffic on a wireless network can be categorised in the sameway as on cabled networks. File transfers cannot tolerate packet loss but cantake an indefinite length of time. ’Real-time’ traffic can accept packet loss,within limitations, but must arrive at its destination in a given time frame.Forward Error Correction (FEC) which usually involves redundancy built intothe packets, and Automatic Repeat Request (ARQ) (Perkins et al., 1998) are twokey techniques currently implemented to overcome the problems encounte-red. However, bandwidth restrictions limit FEC solutions and the ’real-time’constraints limit the effectiveness of ARQ.

The increase in bandwidth across networks should help to alleviate the conges-tion problem. However, the development of audio compression including themore popular formats such as Microsoft’s Windows Media Audio (WMA) andthe MPEG group’s mp3 compression schemes have peaked and yet end userswant higher quality through the use of lossless compression formats on moreunstable network topologies. When receiving streaming media over a lowbandwidth wireless connection, users can experience not only packet loss butalso extended service interruptions. These dropouts can last for as long as 15to 20 seconds. During this time no packets are received and, if not addressed,these dropped packets cause unacceptable interruptions in the audio stream. Along dropout of this kind may be overcome by ensuring that the buffer at theclient is large enough. However, when using fixed bit-rate technologies suchas Windows Media Player or Real Audio a simple packet resend request is the

3

sole method of audio stream repair implemented.

1.1 Objectives of this thesis

The central objective of this thesis is to investigate the use of pattern mat-ching techniques with the aim of replacing missing sections of audio streamedacross a bursty wireless networks resulting in a system called SoFI (Song FormIntelligence) that utilises these techniques. The core objectives of this thesis are:

• To develop a Forward Error Correction (FEC) approach to audio repairon wireless bursty networks, i.e., repair of a large network dropout on theclient side without the need for a resend request.

• To utilise MPEG–7 Audio Spectrum Envelope (ASE) feature extraction asa method of data reduction prior to similarity analysis of audio in WesternTonal Format (WTF).

• To apply k-means clustering as an unsupervised learning method of simi-larity identification within an audio file.

• Apply string matching techniques to identify similarity between largesections of the clustered audio analysis.

• To identify incomplete sections and determine replacements based onpreviously received portions of a song based on self-similarity analysis.

• To design, implement and test SoFI, a streaming media repair system thatincorporates self-similarity as the FEC approach.

Satisfying these objectives requires investigation into current approaches topacket loss, audio feature extraction, similarity measurement and Music Infor-mation Retrieval (MIR).

1.2 Outline of this thesis

This thesis consists of seven chapters. Chapter 2 examines research relating tosong form intelligence. First, characteristics of audio and music, with a reviewof digital signal processing and the purpose of ’windows’, are introduced.Music Information Retrieval (MIR) is discussed, detailing systems developed

4

to browse and search music collections, compare audio files, together withdifferent approaches for audio content representation. This is followed bya discussion of the syntax and semantics of music. Next, signal analysis inrelation to pitch and frequency is reviewed, with the aim of determining thefundamental frequency of a music signal and a discussion on beat detection.Audio file formats and streaming media applications are then discussed. Thischapter concludes with a review of streaming audio approaches to packet loss.

Chapter 3 details the use of MPEG–7 as a feature extraction technique for com-plex audio. How MPEG–7 is composed of hierarchal descriptors at high and lowlevels and an example scenario are discussed. This is followed by a discussionof the low level vector and scalar representations available including the audiospectrum flatness, Audio Spectrum Basis (ASB) and Audio Spectrum Envelope(ASE). Pattern classification, matching and clustering is then discussed iden-tifying differences between the definition of pattern classification and patternmatching. An overview of varying clustering methods and k-Nearest Neighborand k-means are discussed. A section on approximate string matching anddiffering distance measurement techniques concludes this chapter.

Chapter 4 gives a discussion on music features, the syntax and cognitive repre-sentation of music and the similarity representation using clustering. Inherentwithin music is a natural repetition of rhythm and structure which can beformally represented with a cognitive representation. The MPEG–7 Audio Spec-trum Envelope (ASE) is shown, and how as a feature extraction tool, similarityclassification can be performed in relation to other MPEG–7 representations.This is followed by an example of clustering as an unsupervised machine lear-ning approach to classification with investigations into cluster numbers anddistance measures within pattern matching.

Next, Chapter 5 discusses the implementation of SoFI (Song Form Intelligence),a server/client system, and details the architecture of SoFI, a server/clientimplementation that uses self-similarity within audio to perform a best effortrepair of large network dropouts on time dependent streamed audio. Thischapter details first the server side architecture that utilises MPEG–7 ASEfeature extraction. Next, the implementation of clustering of these featuresis presented along with the pattern matching for each previous time-pointwithin the audio. The implementation of a streaming audio framework isalso discussed which enables SoFI to control data sent to listening clients.The client side application implemented using the gStreamer framework isthen demonstrated. This enables audio streams to be controlled based on

5

the similarity output from prior analysis on the server to dynamically swapbetween a current live stream and stored portions of the audio previouslyreceived and thereby masking network failures and latency from the listener.

Chapter 6 details objective and subjective evaluations of SoFI with differing testscenarios that present differing time-points and durations where a failure mayoccur. Finally, this thesis is concluded in chapter 7 with a summary, relation toother work and discussions on areas for future work.

6

CHAPTER

TWO

Song Form Intelligence

In this chapter a variety of areas relating to song form intelligence will be revie-wed. First, characteristics of audio and music with a review of Digital SignalProcessing (DSP) and the purpose of ’windows’ are introduced. A discussion ofthe area of Music Information Retrieval (MIR) is presented, detailing systemsdeveloped to browse and search music collections, compare audio files, as wellas different approaches for audio content representation. This is followed by adiscussion of the syntax and semantics of music. This is followed by a review ofsignal analysis in relation to pitch and frequency with the aim of determiningthe fundamental frequency of a music signal following with a discussion onbeat detection. Mainstream approaches to packet loss: protocols and standardsdiscusses commercial approaches to the problem of network communicationof time dependent packets. Audio file formats are then discussed leading to adiscussion on streaming media applications and implementation. This chapterconcludes with a review of Streaming Audio Approaches to Packet Loss.

2.1 Aspects of audio analysis

Wold et al. (1996) describe music and sound as being measured as one of fourpsychologically perceived categories: ′′Sounds are traditionally described bytheir pitch, loudness, duration, and timbre′′ (Wold et al., 1996, p. 208). Thefirst three are well understood and can be accurately modeled by measurableacoustic features. A number of different qualities of audio must be analysedbefore a complete ’picture’ of the music can be gained:

• Loudness: Notes can be played with varying degrees of strength in thatthe same note when analysed will have very different signal strengths. Itis computed by the signal’s root-mean-square (RMS) level in decibels.

7

• Pitch is the perception of the frequency of a note and is often cited as oneof the fundamental aspects of music. Pitch is estimated by taking a seriesof short-time Fourier spectra. For each of these frames, the frequenciesand amplitudes of the peaks are measured and an approximate greatestcommon divisor algorithm is used to calculate an estimate of the pitch.

• Frequency is the physical measurement of vibration and is often confusedwith pitch. For example, to the human ear the note A above middle Cis perceived to be of the same pitch as middle C, but is not at the samefrequency.

• Brightness is a measure of the higher frequency content of a signal. Bright-ness is calculated as the centroid of the short-time Fourier magnitudespectra and can change over the same range as pitch. However, it cannotbe lower than the pitch at the same interval.

• Timbre: ”The quality of a sound by which a listener can tell that two soundsof the same loudness and pitch are dissimilar” (Sonn, 1973). The definitionof timbre is greatly debated but generally accepted as the combination ofall the remaining attributes of music, i.e.: melody, harmony, rhythm, anddynamics.

2.2 Digital Signal Processing (DSP)

Most forms of audio analysis using computers to identify the characteristics ofa sound (e.g. amplitude, velocity, wavelength and frequency) involve digitalsignal processing (DSP). A varied number of different techniques have beenused to analyse audio in respect of different qualities. The results requireddetermine the type of analysis that is used. However, almost all forms of DSPare based on one core principle, Fourier analysis (also referred to as harmonicanalysis, spectral analysis, or frequency analysis).

Fourier analysis is a mathematical technique for describing a series of waves interms of repeated cycles of components. One of the core principals of Fourieranalysis is that it is based on an infinitely repeating signal. The Discrete FourierTransform (DFT) transforms a series of discrete observations measured overa finite range of time into a discrete frequency-domain spectrum (Williams,1997). The resultant output of DFT analysis is a continuous frequency spectrumthat includes all the frequencies. It should be noted that results from Fourier

8

analysis depend on the sampling interval used and a large sample interval canlead to information being missed. The fast Fourier transform (FFT) is a discreteFourier transform algorithm that minimises the number of calculations neededfrom O(N2) to O(N log N) operations.

The main drawback of the Fourier Transform is that no information is givenof the time at which a frequency component occurs. An approach that canbe used to overcome this is the short time Fourier transform (STFT), whichgives information on the time resolution of the spectrum. STFT uses a movingwindow over the signal and the Fourier Transform is applied to this section ofthe signal as the window is moved.

One of the main problems associated with DFT analysis is leakage. A DFTis calculated over a finite sample using a rectangular window where abruptchanges at the beginning and end of the window can cause leakages (non-zero values). Other windows that reduce leakage more than using a rectanglewindow include the Hann window and the Hamming window which is similarto the Hann window except raised on a pedestal (Lyons, 2004). A window isapplied to both the beginning and the end of the sample interval to smooth outto a single common amplitude value. Figure 2.1 shows a waveform truncatedwith a rectangular window. By applying a Hann/Hamming window thistruncated signal can be smoothed at both ends bringing the waveform to zero.This is achieved by multiplying the signal samples by the Hamming function,the samples at the centre are ′windowed′ by the largest factor and this reducesin a steady sinusoidal fashion as it gets further from the centre and finishes bymultiplying the samples at each end by zero. The choice of ′window′ is signaldependent and varies depending on analysis requirements. The Hann windowresults in less leakage than the Hamming window, but with a tradeoff of moresignal-to-noise ratio.

2.3 Music Information Retrieval (MIR)

Research in the field of music information retrieval has gathered momentumover the past decade. With the increase of audio file sharing across heteroge-neous networks and streamed audio across the Internet, a need has arisen formore accurate search/retrieval of files. Research in the analysis of audio has ledto the development of systems that can browse audio files in much the sameway as search engines can browse web pages retrieving relevant data based

9

Figure 2.1: A waveform with rectangle windowing applied

on specific qualities (Chai & Vercoe, 2003; Gomez et al., 2003; Leman et al.,2002). MIR covers a large area of research topics that range from computationalmethods for classification, clustering, and modeling to research into musicperception, cognition, and emotions.

Two inherent problems associated with MIR are the complexity of audio andthe complexity of the query (Downie, 2004). Music is a combination of pitch,tempo, timbre, and rhythm, making analysis of music more difficult than text.Structuring a query for music is made difficult owing to the varying representa-tions and interpretations including natural transitions in music. Monophonicstyle queries usually perform better where simple note matching can be usedwhereas polyphonic audio files and queries simply compound the problem.Adding to the complexity of music structure and query structure is the methodof audio analysis.

The format of an audio file limits its type of use since different file formats existto allow for better reproduction, compression and analysis. Hence it is alsotrue that different digital audio formats lead to different methods of analysis.Musical Instrument Digital Interface (MIDI) files were created to distributemusic playable on synthesisers of both the hardware and software varietyamong artists and equipment, and because of its notational style the MIDIformat allows analysis of pitch, duration and intensity (Doraisamy & Ruger,2004). An excellent tool for analysis of MIDI files is the MIDI Toolbox (Eerola& Toiviainen, 2004) which primarily uses symbolic musical data but signalprocessing methods can also be applied to cover areas of musical analysis suchas geometric representations and short-term memory. As well as rudimentarymanipulation and filtering functions, the toolbox can also perform analytic

10

functions that are suitable for context dependent musical analysis and is usedas a prerequisite for many music information retrieval applications.

However, reproduction of a MIDI file can vary greatly on different machinessimply from differences between the composers and listeners equipment andrenders it unsuitable for general audio playback. An alternative to MIDI as aformat for analysis is Pulse code modulation (PCM), a common method forstoring and transmitting uncompressed digital audio. Since it is a genericformat, it can be read by most audio applications similar to the way a plaintext file can be read by word-processing applications. Initially developed forstoring audio on Audio CDs and digital audio tapes (DATs), it has since beenadopted for use as a analysis standard. However, owing to the file size analysiscan be time consuming when compared to MIDI files. PCM is generally storedon computers in WAV format, which was built into operating systems sincethe release of Windows 95 thereby making it the de facto standard for soundon PCs. This format for storing sound in files in PCs was developed jointly byMicrosoft and IBM.

Recent work within polyphonic music has shown that similarity within differentsections of a piece of music can aid in both pattern matching for searching largedatasets and pattern matching within a single audio file (Foote & Cooper, 2003;Meredith et al., 2001a; Dannenberg & Hu, 2003). Results from analysis of anaudio stream are stored in a similarity matrix created by (Foote & Cooper, 2003)which can be seen in Figure 2.2. The similarity matrix is generated by measuringthe difference between row and column for the same data. Data along thei0,j0 to i1,j1 diagonal will have an exact similarity, but any comparisons ’off’the diagonal give a measure of how similar the two values are. Analysisis performed using short time Fourier transform to determine the spectralproperties of the segmented audio, this is a variation of the discrete Fouriertransform which allows for the influence of time as a factor. Bartsch & Wakefield(2001) used chroma based spectrum analysis technique to identify the chorus orrefrain of a song by identifying repeated sections of the audio waveform withthe results also being stored in a similarity matrix.

The following range of applications vary from database search/retrieval appli-cations and indexing systems that allow quicker user browsing, to automaticmusic replication systems based on a specific composer’s style. It should benoted that music recognition is still only in its infancy and has limited accurateresults. The recognition of scanned text can have an accuracy of up to 95%,programs using speech recognition have 70-80% accuracy, whereas systems for

11

Figure 2.2: Embedding an audio stream into a 2D similarity matrix (Foote &Cooper, 2003)

music recognition only claim a 60-70% accuracy rating depending on the audioformat, although new research is producing results of up to 90% when usingthe MIDI format.

2.3.1 MELDEX

MELody inDEX (MELDEX) (McNab et al., 1997) is a ’Query-by-Humming’application similar to systems developed by Ghias et al. (1995) as well as Cater& O’Kennedy (2000). MELDEX allows a user to use a microphone to enternotes by humming a tune and then searches in a database for a similar match.To match user input with content held in the database, MELDEX primarilyuses pitch and the fundamental frequency to process the signal for similaritymatching. MELDEX filters the input to remove as many harmonics as possible,while preserving the fundamental frequency. The beginnings and ends of notesare defined using a technique primarily found in voice recognition, notes aredependent on the user using ′ta′ or ′da′ to hum the input, this causes a dropin amplitude of the waveform of 60ms at each utterance allowing each note tobe more easily identified. MELDEX then uses ’string-matching’ to identify theinput from the user with audio held in a database using approximation to scorethe results which are returned in order of accuracy.

Similarly, the MelodyHound melody recognition system (Prechelt & Typke, 2001)was developed by Rainer Typke in 1997. Hosted by the University of KarlsruheMelodyHound was initially known as ‘Tuneserver’. Initially designed as aQuery by Whistling system, i.e. it returns the closest match from a list of songs

12

stored in a database based on a whistled query. A more unusual method ofinput is where the user can enter information about a song in the form ofParson’s code. This involves coding consecutive notes as “U” (“up”) if thesecond note is higher than the first note, “R” (“repeat”) if the pitches are equal,and “D” (“down”) otherwise. The first note of any tune is the reference pointso it does not need to be shown in Parsons code and is therefore entered as anasterisk (*). Parsons (1975) showed this method of encoding of music providesenough information for distinguishing between a large number of tunes despiteignoring most of the information in the musical signal. Figure 2.3 shows thefirst line of a “12 Bar Blues” music score with the resultant Parson’s Code below.

Figure 2.3: Parson’s code representation of a music score

Humdrum Humdrum is a multifunctional system created to aid music resear-chers in various types of analysis, and consists of two distinct components: theHumdrum Syntax and the Humdrum Toolkit (Humdrum, 2008). The syntaxprovides a common framework for representing information in ASCII format.Within the syntax an infinite number of representation formats can be ’userdefined’. The Humdrum Toolkit provides a set of more than 70 interconnectingtools. The most commonly used Humdrum tools are described in the followingtypes of operations:

• Visual display: display a score beginning at measure 128.

• Aural display: play the bass trombone part slowly beginning at measure70.

• Searching: search for instances of a motive.

• Counting: how often do augmented intervals occur in Greek folk songs?

• Editing: change all up-stems to down-stems in measure 88 of the secondhorn part.

• Transforming or translating between representations: transpose from onekey to another.

13

Themefinder (Kornstadt, 1998; Sapp & Aarden, 2008) identifies common mo-tifs in Western classical music, Folksongs, and Latin Motifs of the sixteenthcentury. Using a web-based interface of the Humdrum toolkit Themefinderindexes musical data for searching (Humdrum, 2008). This allows databasescontaining musical themes to be searched. One of the most popular Humdrumrepresentations of music is **kern. Kern provides a combination of core pitchand duration in its representation. An example representation showing thebeginning of “Happy Birthday” along with the associated lyrics can be seenbelow:

**kern **lyrics8.g Hap-16g py4a birth-4g day4cc to2b you.8.g Hap-16g py4a birth-4g day4dd to2cc you.*- *-

Time is represented as columns and moves down the column. The first lineof the data is a keyword indicating the type of representation and the audiodata is then arranged in columns. Each column is then terminated by a “spine-path terminator” (*-). Spines are separated by tabs and each line representsconcurrent activities across all spines. In the case of **kern format, durationsare represented by the multiplicative inverse of numbers, i.e. 8.=dotted-eighth,2=half, 4=quarter. The pitch is represented by letter (a-g and A-G), whereupper-case letters denote pitches below middle C and lower-case letters denotea pitch above (and including) middle C. Repeating letters are used for eachsuccessive octave distance from middle C (e.g. “cc” or “CC”). The syntax ofHumdrum is flexible in that users are free to choose any string of ASCII codedcharacters provided the representation conforms to the Humdrum Syntax.

C-Brahms Lemstrom et al. (2003) developed a project called C-BRAHMS(Content-Based Retrieval and Analysis of Harmony and other Music Struc-tures) that aims to retrieve polyphonic music from large scale databases where

14

music is stored in a symbolic coding format. C-Brahms uses a number of dif-ferent algorithms that allow music to be in various formats, including MIDI,monophonic and polyphonic. It also allows partial and exact matching ap-proaches. As an example one form of input that C-Brahms uses can be seen in(Figure 2.5). The order of notes is displayed from left to right with the length ofthe bar as an indicator of time and the pitch is represented in numeric formatvertically. C-Brahms uses a geometric representation of both the query patternand the source pattern allowing a Euclidean measurement of difference.

Figure 2.4: A sample music score

Figure 2.5: A sample piano roll representation of Figure 2.4 (Lemstrom et al.,2003)

CubyHum is a ’Query by Humming’ application (Pauws, 2002) that attemptsto detect pitches in a sung melody and measures the similarity of these withthe symbolic coding of melodies stored within a database. CubyHum esti-mates the pitch from the music by a technique called sub-harmonic summation(SHS) that was initially proposed by Hermes (1988). SHS calculates the sum ofharmonically compressed spectra in short time frames and selects the highestresultant calculation as the pitch estimate for that time frame. CubyHum thenuses standard signal processing methods including short-time energy, pitchlevel shifts and amplitude envelopes are as a method of detecting note onsets.The resultant data is then combined to describe the pitch and duration of the

15

query allowing normal transcription to the MIDI format for comparison withsongs stored in the database.

notify! Whistle is a query by whistling/humming system for melody retrievalsimilar to CubyHum along with a similar conversion of user queries to a MIDIformat. By using a piano roll representation of the query the user is allowed tochange the original input to account for errors (Kurth et al., 2002). However,unlike CubyHum which uses a string-based approach for comparisons, Kurthet al. (2002) uses an index-based approach for pattern matching. By describingsongs as notes within documents represented by the form Di ⊂ N and queriesas Q ⊂ N allows queries to be performed using set theory and is an alternativeapproach to the problem of incorrect notes and mismatches that are commonwith user generated input queries.

Muscle Fish’s content-based retrieval CBR technology searches for audio files onthe basis of how they sound (Wold et al., 1996). It can also be used to classifysound files or live sound inputs. An additional feature of Muscle Fish is itsability to cluster sound files according to category and search for sounds thatare similar in their features. Muscle Fish analyses sound files for a specificset of psychoacoustic features that include loudness, pitch, bandwidth andharmonicity with the resultant output presented as a vector containing therequested attributes. Using the Euclidean (Mahalonobis) distance metric ameasure of similarity between a given sound example and all other soundsamples is then found. This allows samples to be ranked by distance based oncloser distance measures being more similar.

2.3.2 Marsyas

Marsyas (MusicAl Research SYstem for Analysis and Synthesis) is a collection oftools aimed at audio analysis. The principal behind Marsyas is to allow resear-chers to utilise a standardised set of analysis tools, allowing them to collaborateand compare results using a level platform. Marsyas uses a semi-automaticapproach that combines both manual and fully automatic annotation, giving anecessary degree of flexibility dependant on the research approach. Marsyascombines both spectral analysis with pitch and harmonicity, and techniquesusing singular value decomposition (SVD) and principal components analysis(PCA)1 and multidimensional datasets (Jolliffe, 1986).

1SVD and PCA are techniques for dimensionality reduction.

16

Commercial research groups Yahoo Research and the Yahoo Media Group havebeen using Marsyas to analyze audio signals in a database of over 2 millionsongs (Slaney & White, 2006). By looking at four main characteristics: Spectralcentroid, Spectral rolloff, Spectral flux and the Zero crossing rate of songs YahooResearch are able to gain a measure of the diversity of a consumer’s musicalpreferences. This can then be used by recommender systems to more accuratelyidentify new material suitable to the consumer.

2.3.3 Chord recognition

Based on the principle that most songs can be described in the form AABAwhere each letter is an instance of a phrase, SAM1 is a set of programs designedto read/record digital audio, extract a pitch contour, compute a similaritymatrix, find groups of a similar composition and demonstrate the music interms of structural relationships (Tanguiane, 1993). Using pitch extractionSAM1 identifies potential areas where the signal amplitude is low (i.e. signalnoise) and areas where there are clear peaks (notes). Similar groups of notesare then identified and a similarity matrix is built. This matrix is then used toidentify similar groups of notes within the audio file. One area identified bySAM1 is that the similarity between groups of notes is not transitive - in thatif group A was found to be similar to group B, and group B was found to besimilar to group C, it did not mean that group A was similar to group C. Thiswas because exact pattern matching was not used and limits were set as to howexact the match had to be.

Using probability models to predict the next sequence of notes is one of themost common methods applied when using computers to compose musicautomatically. In its simplest form, based on the laws of probability, if a seriesof notes containing the values C3 D3 E3 F3 G3 A3 B3 C4 (the C major scale) areused as a knowledge base, and the note E3 is presented, then a probability of1.0 is returned that F3 is the next note to be used (Boulanger, 2000).

More complex rules are required for probability systems where the probabilityof future events depends on one or more past events. The number of pastevents that are taken into consideration is known as the order of the chain. AMarkov Chain (Rabiner, 1989) where only one previous note is used is of thefirst order; two previous notes are used is of the second order, and so forth. Atransition matrix is used to indicate the probability of the next note given theprevious one. Table 2.1 shows the probability of correct notes occurring after

17

each respective predecessor.

A B C DA 0.2 0.1 0.3 0.4B 0.5 0.1 0.2 0.2C 0.5 0.2 0.1 0.2D 0.2 0.2 0.3 0.2

Table 2.1: A First Order Markov Chain (Miranda, 2001)

CAMUS 3D is part of an ongoing research programme at the Centre for MusicTechnology at Glasgow University. It is an algorithmic composition system,which uses cellular automata: a computer modelling technique widely used tomodel systems in which space and time are discrete with the aim to drive themusic generating process (McAlpine et al., 1999).

Within the area of music information retrieval, cellular automata, genetic al-gorithms and neural networks are primarily used as machine learning andcomposition tools. They have been used for the analysis of a particular com-poser’s style and then to create/simulate a similar piece of music based onthis analysis (Papadopoulos & Wiggins, 1999; Pearce & Wiggins, 2002). Initialresults have shown that some (experienced) musicians who are not familiarwith a particular composer’s work find it difficult to tell the difference betweenthe original and synthesized music. For self-similarity and pattern matching,DSP techniques combined with pattern matching using scalars, vectors andmatrices are more common.

Search Inside the Music (Lamere, 2006) is a system that recommends songs tousers based on an analysis of the music that they already enjoy. The systemcategorises a user’s collection of music based on a number of attributes: pitch,harmony, tempo rhythm, and energy. This information is stored as metadata,and then used as search criteria when searching for files with similar attributes.As well as the direct analysis of the audio, metadata concerning the tracks’genre is also included which allows similarity identification by artist/genre.The application is tailored to help users find music they prefer on future digitalmusic players.

2.3.4 Summary of MIR systems

We have reviewed previous work in Music Information Retrieval systems.Table 2.2 summarises this work and identifies specific features of these systems.

18

Inpu

tM

atch

ing

Feat

ures

Nam

e

Audio

Symbolic

Audio

Symbolic

Exact

Approximate

Polyphonic

Audiofingerprints

Pitch

Notedetection

Timbre

Rhythm

Contour

Intervals

Onsetdetection

Indexing

Self-similarity

Other

Col

lect

ion

size

Dor

aisa

my

&R

uger

(200

4)•

•••

••

•—

The

MID

IToo

lbox

••••

••

•—

Foot

e&

Coo

per

(200

3)•

•••

••

—SI

ATE

C•

••

••

—SI

A(M

)ESE

••

••

•••

—To

Cat

cha

Cho

rus•

••

—M

ELD

EX•

••

•••

•9,

354

Mel

odyh

ound

••

•••••

••

••

10,2

10H

umdr

um•

•••

••

•—

Them

efind

er•

••

••

•21

,500

C-B

rahm

s•

•••

1,46

4C

uby-

Hum

••

••

••

••

Unk

now

nno

tify!

Whi

stle

••

••

••

••

Unk

now

nM

uscl

eFis

h•

••

••

•40

9M

arsy

as•

••

••

—SA

M1•

•••

••

•—

Sear

chIn

side

the

Mus

ic•

••

•—

SAR

••

••

—ES

AC

F•

••

—

Tabl

e2.

2:Su

mm

ary

ofM

usic

Info

rmat

ion

Ret

riev

al(M

IR)s

yste

ms

19

One of the key points evident in the table is the collection size. Each work iscomprised of their own collection of data making comparison of finding dif-ficult if not impossible. The larger collections are mainly for fingerprintingwith pitch as the most common feature for similarity comparisons being used.Of all the systems, only four use audio as matching criteria and only two ofthese use polyphonic audio giving a more complex audio structure to analyseas opposed to monophonic audio which is a more elementary level of audioinput. Finally, the only approach that investigates self-similarity to any extentis SIA(M)ESE (Meredith et al., 2001b; Wiggins et al., 2002) where the use of thepreviously developed SIA algorithm (Meredith et al., 2002) allows a multidi-mensional pattern matching problem to be investigated when the input data isa multidimensional dataset.

2.4 Music structure

To define the structure of music is to say that music can be represented in avariety of formats depending on the needs of the user. The structure of musicis a series of work that has yet to agree on a definition of the term structure(Wiggins, 1998; Salzer, 1962; Lerdahl & Jackendoff, 1983; West et al., 1991).Different representations for music enable different but salient information tobe displayed and stored, for example, the current structure of the well knownmusic score has only been used since the mid 17th century. The ongoingdevelopment of music notation continues through the 20th century. Composerscontinually find new methods of expression more suited to their specific needsand have developed new means of writing them down. For example, in early20th century work methods of indicating microtones were found, and symbolsprimarily used in the mathematics domain have been used to denote complexrelationships within the rhythm of a piece of music. Attempts have beenmade to invent completely new notational styles/systems such as Klavarscribo(Walker, 1997), particularly in the first half of the 20th century, but these havenot been adopted as a notational standard by many.

2.4.1 Syntax of Music

Syntax can be defined as a set of principles governing the combination ofdiscrete structural elements into sequences (Jackendoff, 2002). As shown inFigure 2.6 Jackendoff (2002) presents a hierarchical structure of a classical piece

20

of work by J.S. Bach showing how local tensing and relaxing motions areembedded into larger scales. Right branching indicates an increase in tensionand left branching a decrease, i.e., relaxation.

Figure 2.6: Music syntax with a hierarchical structure (Jackendoff, 2002)

Ockelford (1991) investigates finding a common ground within the varyingmusical structure representations based on repetition where the conclusionleads to creating yet another structure model with music being represented bya system of variables defined as perspects. This concludes with the commonground assumption that one perspect is deemed to exist in imitation of anotherby determining a zygonic2 theory of music-structural cognition. Since aroundthe beginning of the 17th century, music notation has been defined in a stan-dardised format in the style of music typography called plate engraving. Thisstyle of representation is ideal for musicians and composers, but when using acomputer to read/interpret the notation difficulties arise. This is compoundedby the myriad of computer programs for music notation, making sharing musicbetween them difficult. Different programs use different representation styles:graphical, symbolic, numerical, etc. The reason for this is based on the particu-lar task that the software has to perform, but because of this no one programcan do everything equally well.

2Ockelford (1991) uses the adjectival form of the word Zygon derived from the Greek wordfor yoke, meaning the union of two similar things.

21

Music information retrieval is complex: the queries are often “fuzzy” (queryby humming or singing), and the data relationships are complicated. Queriesthat work well with one format are unsuitable for another. Until recently theonly music interchange format commonly supported was MIDI. The MIDInotation is ideally suited to performance applications like sequencers, but itis not as suitable for other applications such as music notation. MIDI cannotdifferentiate between an F-sharp and a G-flat as well as many other aspectsof music notation. Notation Interchange File Format (NIFF) and StandardMusic Description Language (SMDL) have attempted to solve the interchangeproblem but they still have their limitations depending on the specific needsof users. NIFF is used to interchange music between scanning and notationapplications, whereas SMDL was an attempt to create a formal specification formusic, but as yet it has limited implementation mainly due to its complexity(Good et al., 2001).

Recordare is a manufacturer and retailer of digital sheet music software thathas developed the MusicXML format to create an Internet based method ofsharing musical scores, with the aim to provide the same role for interactivesheet music that mp3 files serve for recorded music. By using an XML stylelayout, Recordare has developed a standardised notation for the representationof music in a format that can be used by almost any application. Scanningand reading applications can import the content and present it in a graphicalformat with precise representation and other systems that previously reliedon MIDI files can now import a more accurate conversion of the music. Itshould be pointed out that MusicXML has been developed with the need fora standardised representation and due to this it is a verbose representationwhere size is of less importance in relation to being application independent.A sample MusicXML representation of the 12 Bar Blues audio file is includedin Appendix A. This represents just one quarter of the 130 music notes in themusic score. When written in music notation the 12 Bar Blues piece uses justover half a page as shown in Appendix B, Figure B.1.

2.4.2 Semantics of music

In music, the word ’semantics’ does not have a well-defined meaning, and ithas been claimed to have no meaning at all (Wiggins, 1998). For example, whenwe say, “this music means ... to me”, we are really saying, “I associate thismusic with ...”, which is not always the same thing. Jurafsky & Martin (2000)

22

point out that meanings of a word come from agreed associations in usage, butwe are dependent first of all on common associations amongst speakers andthen crucially these associations are grounded in demonstrable reference to theperceived world. Without meaning there can be difficulties in representation.Steedman (1996) views tonal function as the semantics of music and a means todescribe music in grammatical terms by capturing the mental processes of alistener which lead to expectation.

The use of semantics within the context of music has caused some debate(Wiggins, 1998). The term semantics is often used to describe methods oftagging elements of a structure in order to apply some form of meaning, andyet using the arguments presented by Wiggins, no real ’meaning’ can be inferred.Wiggins does however accept that the use of semantics can be applied on amore abstract level, i.e., for the study of the relationships between various signsand symbols and what they represent without the associated meaning usedwithin the context of linguistics. Whilst developing an understanding of thesimilarities and dissimilarities between the structure and constituents of musicand language Griffith (2002) points out, that although a piece of music may bedescribed with grammar it is not defined by its use.

One popular form of reduction is the use of semantics to apply meaning toabstract structures of representation. The term semantics in its true senseis primarily used in linguistics to use structure to represent the meaning ofsentences. Where syntax defines the rules whereby words or other elementsof a sentence are combined to form grammatical sentences, without semanticsthey have no meaning. Jurafsky & Martin (2000) use by example, a semanticnetwork representation of the sentence ’I have a car’ (Figure 2.7) where itsmeaning is clearly represented in that it is the speaker that has the object car.

Figure 2.7: Semantic network representation of ’I have a car’ (Jurafsky & Martin,2000)

23

Extraction and classification of audio using semantics is a popular approach tomusic representation/interpretation (Herrera et al., 2004; Slaney et al., 2002).SAR (Semantic-Audio Retrieval) creates a connection between semantic spaceand acoustic space using cluster abstraction for higher level representationsimilar to the MPEG–7 format which uses hierarchical semantics for its taggingelements of data with Descriptors. SIMAC (Semantic Interaction with MusicAudio Contents) (Semantic, 2008) is a research group dedicated to providingmeaningful descriptors of musical content to aid in describing music collectionsthat are close to emulating the mind’s way of organising and understanding itscontents.

The vast majority of songs in western nations contain a structure known asa song form. A song form is basically a framework, which makes the songlistenable (Jackson, 2008). Most songs in western music are structured into averse (V) and a chorus (C). The purpose of a verse is to tell the story or describethe feeling. The chorus is generally the focal point of the song, the central theme.A bridge (B) is a kind of fresh perspective, a small part that may consist of onlymusic, or both lyrics and music, usually placed after the second chorus andoften varying in major/minor chords. These are the main parts that are usedwhen describing song form.

Often we can describe songs in terms such as having a VCVC form, or aVCBVC form. Not all songs however follow a verse, chorus and bridge pattern.In the history of music the oldest song form is often referred to as folk, wherecommon structure of verse followed by chorus is not found. In describinga folk song form, or any song that has only verses, the song form is VVVV.A large percentage of songs these days however follow a type of song formwhich includes a chorus, so a verse, chorus, verse, chorus type of song is VCVC.There is also a common song form, which includes a bridge, so a typical formwith a bridge might be VVCVCBC. Other less common forms may include apre-chorus that is a lead-up to the chorus (labelled L); intros (labelled I) at thevery beginning of a song and extras (labelled E) are the lead-outs or endings toa song (Jackson, 2008). There are many variations of these forms however, withsome songs starting with the chorus while others have more than one bridge.Any individual artist may have all kinds of variations. Most songwriters don’tstart writing by coming up with a song form first. It usually reveals itself as thesong is being written. It is, however, a quick and easy language to use whendiscussing the process with other writers.

The method of applying the hierarchical approach to audio in the context of its

24

Figure 2.8: Hierarchical breakdown of a song in western tonal format

structure to describe the different portions of the audio is shown in Figure 2.8,where an audio file has a root node of ’song’ which is then broken down intosections where the introduction, verse, and chorus are identified. Each sectioncan then be broken into smaller sections again using the natural refrain of songsin Western Tonal Format.

2.4.3 Cognitive representation

To provide a cognitive representation a piece of music can be defined using aform of semantic representation. Experience is not only related to the richnessof perception, it also has a role in the construction of knowledge with mea-ning being characterised in terms of the experience: of the person becomingconscious of the music. As such, it is a basic claim of cognitive semantics(Jackendoff, 1987; Lakoff, 1988) that meaning is an account of reality by people.

The perception of music has been a popular area of cognitive research, withthe human mind’s methods for interpretation of musical structure and patternsbeing a key area. Lerdahl & Jackendoff (1983) conducted some of their earlierwork to provide investigations into the mind’s cognitive approach to groupingand reduction within music. They showed that complex musical structureswere reduced to more abstract representations in the human mind: ”ReductionHypothesis: The listener attempts to organise all the pitch-events of a piece intoa single coherent structure, such that they are heard in a hierarchy of relative

25

importance” (Lerdahl & Jackendoff, 1983, p. 106).

2.5 Frequency and pitch estimation

One of the more challenging areas of automatic transcription of polyphonicmusic is the estimation of a song’s fundamental frequency. Primarily this is dueto the nature of music with a mixture of various musical instruments that havediverse spectral characteristics. If an instrument is is played at one of its naturalfrequencies, the vibrations it produces create a standing wave pattern withinthe object. Figure 2.9 shows a frequency representation of a string instrumentat a harmonic frequency in relation to the length of the string. Each naturalfrequency which an instrument produces has its own standing wave pattern(Fletcher & Rossing, 1998). The harmonic frequency of an instrument is definedas the specific frequencies of vibration where standing wave patterns are created.If an instrument is played at any frequency other than a harmonic frequency, theresonance vibrations of the instrument are irregular and non-repeating. Whenmusical instruments vibrate in a regular and periodic fashion, the harmonicfrequencies are related to each other by simple whole number ratios. It is atthese frequencies that instruments sound pleasant. The fundamental frequency(F0) of an instrument is defined as the lowest harmonic frequency produced bythat particular instrument.

Figure 2.9: A standing wave pattern

Performing autocorrelation yields information about repeating events, such asidentifying the fundamental frequency of a signal that does not contain thatfrequency but implies it by providing the different harmonic frequencies avai-lable. This is particularly common when multiple instruments are combined.

26

Multiplication between the signal and a shifted version of itself results in agraph illustrating peaking patterns. An implementation of the autocorrelationfunction and its use for pitch detection, as presented in Tolonen & Karjalainen(2000) and Wallach (2004) is shown in Figure 2.10. The plots show the pitchfrequencies for a 0.5 second audio sample.

Figure 2.10: Pitch plot of 0.5 seconds of audio

2.6 Beat detection

In signal analysis beat detection is performed using computer software orhardware to detect the beat of an audio signal. The rhythm of music whenprocessed by the human brain can be determined the by detecting a pseudo-periodical succession of beats. When the audio signal detected by the earcontains a certain energy it is then converted into an electrical signal whichthe brain interprets. The more energy the sound carries, the louder the soundwill appear. However, a beat will only be heard in a sound if the currentsound has more energy than the sound’s energy history. For example, if amonotonous sound containing large energy bursts is received beats can bedetected, but when a continuously loud sound is received beats will not beperceived. Therefore, beats are large variations of sound energy.

There are many methods of beat detection available but there is always atrade-off between accuracy and speed. Beat detectors are common in musicvisualization software such as some media player plugins. For example Micro-soft’s Windows Media Player provides visualizations that enable the user tosee visual imagery that is synchronized to the sound of the media content as

27

it plays. Measurements based on the energy of the sound provide ’real-time’processing of the audio signal with limited impact on the overall processingneeds of the application. Default visualization effects that are included withWindows Media Player as standard include bars, spikes and waves.

Foote & Uchihashi (2001) shows how automatically characterising music tempoand rhythm can be achieved utilising beat tracking. Applications that have theability to reliably segment and beat-track audio include the following functionalabilities:

• Identifying rhythmic similarity: Rhythmically similar music will havesimilar beat spectra. By comparing the beat spectra of two audio sources ameasure of similarity can be calculated. Retrieval is then performed basedon rhythmic similarity. By normalising the beat spectra by the tempo arhythmic similarity comparison can be made that is independent of thetempo, as in Wold et al. (1999).

• Segmenting the music by rhythm: Clustering the beat spectrogram allowssongs to be segmented.

• Tempo extraction: Identifying the beat times, tempo and song structurepermits the synchronisation of external events to the music.

The algorithms used to track beats within an audio file may utilise straightfor-ward statistical models based on the energy of the sound or the frequency ofthe sound. Sound energy detection determines if there has been an onset (beat)in a frame by tracking the level of the signal and registering peaks. Frequencyenergy detection uses the same algorithm but tracks frequency bands, givingmore detailed information about where in the spectrum the onset occurred.This is useful for tracking a particular instrument.

2.7 Mainstream approaches to packet loss: pro-

tocols and standards

Protocols within network standards dictate how traffic is handled and attemptsome form of improvement when time-dependant data is sent across networks.Attempts to improve the latency of streaming audio/video include identi-fying and prioritising these packets and assigning them priority over ’general’network packets. Packet delay from network congestion has been partially

28

alleviated using routing protocols and application protocols such as Real-timeTransport Protocol (RTP). These have been developed to assign a higher priorityto time dependent data. However, it is also the case that some servers automati-cally dump packets that are time sensitive, so streaming applications have hadto resort to ‘masking’ the packets by using ’HTTP port 80’ so packets appearas normal web traffic. The latest addition to network protocols specificallyaddressing ‘real-time’ communication include ’Voice over Internet Protocol’, atechnology that allows telephone calls using a broadband Internet connectionacross a packet switched network instead of a regular (or analog) phone line.

It is necessary for standards to be defined for communication across networks.Computers need a specific set of rules and guidelines to communicate in thesame way humans need language to be able to communicate with each other.Without a predefined set of guidelines one computer wouldn’t understand whatthe other was saying – just as a French person (who doesn’t speak Chinese)wouldn’t understand a Chinese person when they both talk in their native lan-guages. TCP/IP was the first recognized standard for communication betweencomputers across a network. Protocols are an open set of rules of behaviour thatare independent of an operating system and architectural differences. They areavailable to everyone to allow for development and are changed on consensus.These protocols are published as ‘Requests For Comments’ (Socolofsky & Kale,2008) and contain the latest versions of the specification of all standard TCP/IPprotocols. Each layer can contain any given number of protocols that performspecific functions relating to that layer. It should be noted that each layer doesnot know or care how layers above and below work, simply that data is passedbetween them. The main benefit of TCP/IP is that it provides interoperablecommunications between all types of hardware and operating systems (Stevens,1993).

Figure 2.11 shows this model with four layers, and although there can bebetween three and five layers used to describe the TCP/IP model, the 4 layermodel is the most commonly used. Each layer does not define a single protocol,but represents a communication function that can be performed by any amountof different protocols. However TCP/IP was not the only standard developed.

Figure 2.12 shows the OSI (Open Systems Interconnect) Reference Model, whichwas developed by the International Standards Organisation, and is widelyaccepted as a general guide. However it is not as popular in its implementationas some of the layers have few (if any) protocols defined. It is for this reasonthe TCP/IP model is more widely used.

29

Figure 2.11: The TCP/IP model (Tanenbaum, 1996)

Figure 2.12: The OSI reference model (Tanenbaum, 1996)

30

2.7.1 Improvements of real-time traffic using Internet proto-cols

As a step to improve the Quality of Service (QoS) of time dependant data sentacross networks the Internet Engineering Task Force proposed an upgrade fromthe IPv4 Internet Protocol and released IPv6. The main difference between IPv6and IPv4 is that IPv6 contains a much larger address space within each packetthat allows greater flexibility in assigning addresses. However, the IPv6 headerhas also been re-designed to keep header overhead to a minimum, by movingboth optional fields and non-essential fields to extension headers which arethen placed after the main IPv6 header (Hagen, 2006). The more improved IPv6header is more efficiently processed at intermediate routers resulting in moreefficient processing. Newly added fields to the IPv6 header dictate how packetsare identified and handled. Traffic identification in the IPv6 header allowsrouters along the route of the packets to identify and provide special handlingfor individual packets identified as belonging to a series of packets betweenspecific a source and destination. Better QoS can be achieved during transit forstreaming applications since time dependent packets can be identified in theIPv6 header and given a higher priority.

2.7.2 VoIP

One of the most recent additions to network communication is the inclusion ofVoIP (Voice over Internet Protocol). This new technology allows users to maketelephone calls using a computer to either another computer with an Internetconnection, or a telephone for the cost of a local call. VoIP converts the voicesignal from a telephone/microphone into a digital signal that travels over theInternet and is converted back at the receiving computer. One of the drivingforces behind VoIP is its cost; state and country borders have no meaning to theInternet, the traditional charges of local, long-distance and international callslargely disappears, allowing users of the service to save on long-distance andinternational charges.

VoIP is not without its problems that are associated with real-time traffic acrossnetworks: packet delays/losses (Jiang & Schulzrinne, 2002). One of the mainissues faced by Internet telephone applications such as Skype and Vonage isthe quality and reliability of communication via the Internet. The traditionalpublic switched telephone network sets a high standard for IP telephony to

31

match before mainstream acceptance.

Both IPv6 and VoIP are important changes that have been made to currentnetwork communication protocols. Although the network reliability and overallQoS is significantly improved, there is still room for improvement. The coreidea of the IPv6 header format shows the need for optional information to beincluded in packet headers. The VoIP protocol is an area of time dependentcommunication where error recovery is still in its infancy.

2.8 Audio formats and file compression

The number of audio formats has increased dramatically since the introductionof digital media. Previously there were simply vinyl (LPs) and cassettes withaudio being stored in an analogue format. With the introduction of digitalstorage a number of different formats have emerged, all serving a particularpurpose. Table 2.3 lists the most common of these:

File Extension Origin/Name RemarksAU, SND,ULW Sun

MicrosystemsSun Microsystems file format

AIFF, AIF, AIFC AudioInterchange File

Format

File format for storing digitalaudio on Macintosh computers.

GSM GSM Audio File Audio file used for GSMSupported devices

MIDI, MID, SMF MusicalInstrument

Digital Interface

Protocol designed for recordingand playback on digital

synthesizers.MP3 Moving

Pictures ExpertsGroup

MPEG 1 audio layer 3compresses files to 1/12 of their

size.WAV, WMA Microsoft Audio file that has become a

standard PC audio format withCD level quality.

OGG Ogg Vorbis Ogg Vorbis is an open sourcegeneral-purpose compressed

audio format.

Table 2.3: Digital Audio Formats (Menin, 2002)

The three main formats most commonly used by PCs are:

32

• WAV: The Waveform Audio File format is the standard audio format forMicrosoft Windows, but is now also supported by Macintosh computers.

• MP3: Developed by the Fraunhofer Institute in Germany in 1991, thisaudio compression format is the most popular audio format currently inuse. Mp3 files are able to produce good/reasonable sound quality and yetmaintain small file sizes. By removing audio that the human ear cannothear file sizes can be reduce by up to one twelfth of their original size.

• MIDI: This format was developed for communication between electronicmusical instruments. The file contains no audio information but insteadcommands for a series of notes with information on their length andvolume. MIDI files can be played on a MIDI player application with theavailable ”instrument” sounds using a sound card on a PC or any othercompatible device.

2.9 Compression and mp3s

Reducing the size of an audio file makes it more manageable in both overallcollection size and the analysis performed on it. File size reduction can beperformed in a number of ways. One method is to reduce the sampling fre-quency (Menin, 2002) of the recording system. However, this can have seriousside-effects with regards to sound quality since most of the high-frequencycontent of the audio is removed, leading to recordings lacking in brightnessand clarity. Mp3 compression uses a number of coding techniques based onthe human auditory system to reduce file size and yet maintain quality audio.Using lossy compression the sample rate of mp3 files determines the level ofquality. The compact disc (CD) audio format uses a 16 bit sample rate (samplesmeasured every 44.1 kHz or 44,100 slices every second) which equates to 5.2Mb per minute of recording. The sample rate is the number of times a sectionof audio is measured per second. Mp3 files can be sampled at the followinglevels:

• 1411 bitrate - CD Quality

• 192 bitrate - good CD quality mp3 files

• 128 bitrate - near CD quality

• 64 bitrate - FM Broadcast quality

33

• 32 bitrate - AM radio quality

The result in real terms is mp3 coding shrinks the original audio signal from aCD (PCM format) by a factor of 12 without sacrificing sound quality, i.e. from abit rate of 1411.2 Kbps of stereo music to 112-128 Kbps.

Most mp3 files are compressed to either the 128 bitrate or the 64 bitrate. Anacceptable loss of quality is permitted for the advantages of the reduced file size.It is for this reason mp3 files are so popular within the Internet and networks.

2.10 Jitter control

Streaming audio over a network has one serious problem associated with it:Jitter. Jitter is when media being played back starts and stops as the packetsof the stream are sent inconsistently. Because of the nature of networks, it ispossible for packets sent to arrive in a different order from which they wereoriginally sent. The receiving application then has to restructure these into theircorrect order. In the context of streaming this can be problematic as portionsof audio may arrive too late to be played, leading to sections of the audiobeing dropped altogether and making the audio sound jittery. This effect iscompounded by the quality of the transmission and high quality audio signalsrequire a large number of packets that in turn require a larger bandwidth (Bush,2000).

Jitter control can be managed at hops across the network. At each hop a packetis examined to determine its position relative to the rest of the stream. When apacket is found to be ’lagging behind’ it can be forwarded with priority overother packets in the same stream. Similarly, if a packet has managed to jumpthe queue it can be ’slowed down’ to allow the other packets to catch up. Jitteroccurs more frequently when streaming audio across wireless networks. Thenature of wireless communication and its inconsistencies amplify the effects ofpacket loss when bursty packet losses occur.

Streaming media players are almost indifferent to the format of an audio filebefore streaming but results from analysis vary greatly depending on the formatused. The quality of the audio signal received depends greatly on both jitterand file formats.

34

2.11 Streaming media

When surfing the web, it is common to find embedded audio and video thatneed additional applications to handle the content. The definition of streamingmedia is the act of sending audio and/or video data, that has been encoded(digitized) into a series of small data packets, out across the Internet that maythen be viewed by an end user in real time using a media player (WindowsMedia, WinAmp, iTunes, Real Player or Quicktime). Essentially the MediaPlayer captures, decodes and reorders the data packets for real time viewing.Either a hardware or software encoder can be used to encode the audio and/orvideo input source file. A hardware encoder is more commonly used when theaudio/video input is from an external source, the audio/video is encoded as itis received and then streamed directly. However, a software encoder can encodefiles being played by an internal (software) player as well. In most cases thenature of the broadcast determines whether a hardware or software encoder isused, for example a “live” broadcast across the Internet of a concert will utilisehardware encoders directly encoding the input before broadcast and a radiostation using the Internet as a broadcast medium will use software encoders toconvert the ’pre-recorded’ digital media into audio streams for broadcast.

2.11.1 Windows Media Encoder

Windows Media Encoder is a software based tool primarily aimed towardscontent producers that need to capture video and audio content using the manyunique abilities within Windows Media, including support for mixed-modevoice and music content, high-definition video quality and high-quality mul-tichannel sound. Professional-level codec’s and encoding modes enable highdefinition video quality and multichannel sound. The latest release, WindowsMedia Encoder 9, enables two-pass encoding to optimise quality for streamingaudio and video through live or on-demand webcasting services. Both VariableBitRate (VBR) and true VBR is supported. VBR provides optimum encoding fordownload-and-play scenarios and true VBR that is applied over the entire du-ration of a high-motion sequence ensures the highest quality playback withoutthe sacrifice of file size/download speed.

35

2.11.2 Icecast and Ices

Most streaming media servers consist of two major parts: (1) the componentproviding the content (i.e., source clients) and (2) the component which handlesthe broadcasting of the content across the network to listeners. Icecast (Icecast,2008) is a streaming media server which provides support for most audio fileformats including Ogg Vorbis and mp3 audio streams. the versatility of Icecastallows users to create anything from an Internet radio station to a privatelyrunning jukebox and many other applications depending on their needs. Theversatility of Icecast allows new formats to be added relatively easily andby supporting the open standards for communication and interaction therebyallowing specific tailoring to a users needs through the use of add-ons. Througha web-based interface the user can manipulate many server features. Icecastallows the administrator to move listeners from one source stream to another(mountpoints), disconnect connected sources, disconnect connected listeners,gather statistics and many other activities. Each of these functions requiresauthentication via the <admin-username> and <admin-password> specifiedin the Icecast configuration file.

The Icecast configuration file is an XML file and usually resides in /etc/ice-cast.xml (under linux). The configuration file allows the Icecast settings to becustomised. The most important of these include:

• Limits. The limits section of the file allows the administrator to set themaximum number of source streams (encoders) and the maximum num-ber of clients that can connect to the streaming computer at once.

< l i m i t s><c l i e n t s>100</ c l i e n t s><sources>1</sources>

</ l i m i t s>

• Upload. The upload data rate can be limited based on the following:

– The number of simultaneous users Icecast must support when undermaximum load

– Data transmission speed at which listeners connect to the streamingserver considering that listeners will be using various connectionspeeds (56 Kbps dial-up connection, cable/DSL, or a LAN?)

36

– Limitations of the server hardware and Internet connection providedby the hosting service

– The encoding rate of the audio and video content to be streamed

As an estimate, if the media server has 100 listeners then the necessaryupload speed can be determined by the following calculation:

– Server bandwidth total (SBT):

SBT = L x BR

L = number of simultaneous listeners

BR = average bitrate of encoded audio to be transmitted

Therefore the minimum of bandwidth needed would be: 12.8 Mbps = 100x 128 Kbps.

• Authentication. The authentication section is where usernames and pass-words are specified. The admin-user and admin-password allow accessto the administration Web page. Since Icecast is web based, default set-tings must be changed upon setup otherwise the server is at risk fromunauthorised access.

<authentication>

<admin-user>admin</admin-user>

<admin-password>hackme</admin-password>

</authentication>

• Hostnames. The hostname is used so that Icecast knows what address toappend to the beginning of the links on the Web page:

<hostname>SampleRadio.com</hostname>

• The listen-socket allows the administrator to set the port on which Icecastlistens; 8000 is the default:

<listen-socket>

<port>8000</port>

</listen-socket>

Ices is one of many available source clients for a streaming server (e.g. Icecast).The purpose of Ices is to provide an audio stream to a streaming server withoutany regard for the number of listeners connected or the limitations in place, e.g.the amount of bandwidth or the number of ports available. It is not necessaryfor Ices to be on the same physical machine as the streaming server (Icecast)

37

since using separate machines helps to alleviate the processing needs. However,it is easier to manage both the server and source client when they are located onthe same machine. The Ices configuration file is also an XML file. It is usuallystored in ’/etc/ices.conf’. Within the stream section, the metadata section iswhere information about the stream is specified.

<metadata>

<name>REM</name>

<genre>Rock</genre>

<description>Album: REM, The Best of</description>

</metadata>

There are a number of possible settings for playlists and scripts. The sampleXML below defines the type of playlist used, its location, the ordering ofthe playlist and whether to repeat the playlist once it has completed one runthrough.

<input>

<module>playlist</module>

<param name=’’Type’’>basic</param>

<param name=’’file’’>filepath to playlist....</param>



<param name=’’random’’>0</param>



<param name=’’restart-after-reread’’>0</param>



<param name=’’once’’>0</param>

</input>

Of the different configuration settings available, the encode is the most impor-tant in regards to the expected bandwidth. Ices allows the audio to be encodedto differing levels of “quality” depending on the intended audience and thebandwidth limitations of the server. The XML below shows an Ices configura-tion where audio is encoded at 128 Kbps over 2 channels for stereo output atthe listeners machine.

<encode>



<nominal-bitrate>128000</nominal-bitrate>

<samplerate>44100</samplerate>

<channels>2</channels>

</encode>

38

2.11.3 GStreamer

GStreamer (Gstreamer, 2008) is a development framework for creating strea-ming media applications. Using GStreamer as a development tool makes itpossible to create a vast array of streaming multimedia applications. One of theprincipal goals of the GStreamer framework is to make it easier for developersto build applications that handle audio, video or both. One of the the most ob-vious uses of GStreamer is using it to build a media player. GStreamer includescomponents for building a media player that can support a very wide varietyof formats, including mp3, Ogg/Vorbis, MPEG–1 and 2, AVI, Quicktime, MODand more. Its main advantages are that the pluggable components can be mixedand matched into arbitrary pipelines so that it’s possible to write a full-fledgedvideo or audio editing application.

GStreamer’s core function is to provide a framework for plugins, data flowand media type handling/negotiation. An element is the most important classof object in GStreamer. An element has one specific function, which can bethe reading of data from a file, decoding of this data or outputting this datato a sound card or any other form of output. By chaining together severalsuch elements, a pipeline is created that can perform a specific task, e.g. mediaplayback or capture. GStreamer ships with a large collection of elements bydefault, making the development of a large variety of media applicationspossible simply by chaining different elements depending on the needs of thedeveloper. A basic media player is shown in Figure 2.13, shows a pipelinecontaining elements and their source pads required for basic playback of anaudio file encoded in the ogg format. It should be noted that the output element’alsasink’ is required on Unix/Linux operating system for soundcard output.

Figure 2.13: A basic Gstreamer media player

39

2.12 Streaming audio approaches to packet loss

Solutions to packet loss, jitter and associated problems within streaming audiohave included research into a number of varying techniques. The probability ofpacket loss across bursty networks has been modelled where time delay is usedto control the flow of packets and measure the difference between the currenttime and the time the packet arrives (Lee & Chanson, 2004). This techniquecan be used to predict network behaviour and adjust audio compression basedon current network behaviour. Higher compression results in poorer qualityaudio but reduces network congestion through smaller packets. A variationof this theme has been used to create new protocols that allow scalable mediastreaming (Mahanti et al., 2003).

Randomising packet order to alleviate the large gaps associated with burstylosses has been implemented, where the problem was reduced by re-orderingthe packets before they are sent and reassembled into the correct order at thereceiver (Varadarajan et al., 2002). This reduced the bursty loss effect sincepackets lost were from different time segments. Although nothing is done toreplace the missing packets, overall audio quality improved through smallergaps in the audio - albeit more frequent.

A number of techniques have been developed that use some form of redun-dancy where repetition replaces lost audio segments have been developed.Sending packets containing the same audio segments (but with a lower bit-rate)alongside the high bit-rate encoding increases the likelihood of packet arrivalbut at the loss of audio quality, as well as increasing the overall network band-width usage (Perkins et al., 1998). Another approach to using redundancy inthe form of unequal error protection (UEP) has been developed, where impro-vement is achieved with an acceptable amount of redundancy using advancedaudio encoding (AAC) (Wang et al., 2003). Segmentation of the audio intodifferent classes such as drumbeats and onset segments allows priority to beapplied to more important audio segments with an Automatic Repeat-reQuest(ARQ) applied to high priority segments and a reconstruction technique for thereplacement of low priority segments based on the AAC received in previoussegments.

One of the more recent methods of interpolation of low bit-rate coded voicewhere observation of high correlation of linear predictors within adjacent framesallows descriptions to be inserted using linear spectral pairs (LSP), and thenreconstruct lost packets using linear interpolation has been used by Wah &

40

Lin (2005). This allowed packet-loss replacement without increasing the trans-mission bandwidth. Wah & Lin (2005) does point out that this approach is atrade-off between the quality of the received packets and the ability to recons-truct lost packets.

2.12.1 Voice communication

Traditional methods for interpolation between lost packets are still popularwith Internet telephone applications where timing is critical and limited si-gnal degradation is acceptable. Research into Forward Error Correction (FEC)range from Waveform Similarity OverLap Add (WSOLA) (Liang et al., 2003)where lost packets of a section of voice are merged based on pitch similarity(Figure 2.14) rather than straightforward interpolation. WSOLA decomposesthe input into equal lengths of overlapping segments. When received by theclient, these segments are then rearranged and overlaid in their original order toform an audio output of equal and fixed length. Using a windowing technique(see Section 2.2) minimises the changes in signal strength of the two segments.This leads to increased processing overhead, but concealment is possible ifpacket loss is limited to one or two sections.

Figure 2.14: WSOLA loss concealment (Liang et al., 2003)

FS-CELP (Lin & Wah, 2005) is an implementation of the Federal Standard1016 Code Excited Linear Prediction, based on the principle of linear predictivecoding. Linear Predictive Coding (LPC) is a powerful speech analysis technique,and one of the most common methods for encoding good quality speech at alow bit rate. LPC provides relatively accurate estimates of speech parameters.Using multi-description coding, finite state code excited linear pairs (FS-CELP)allows multiple descriptions of the signal to be encoded into two streams - oddand even samples. Reconstruction of the original signal is possible if only oneof the two streams is lost. Only when there is loss of both odd and even streamsis error correction not possible.

41

Building redundancy into packets is a popular method for FEC. FreePhone(Bolot et al., 1999) uses an adaptive approach where redundancy is codeddepending on the loss characteristics of the network at that time using the realtime control protocol RTCP feedback. Bolot et al. (1999) justifies this by pointingout that there is little point in using high levels of redundant informationencoded into the packets if there is little chance of it being used. The actualmethod for FEC used is a simple ’next packet scenario’ where packet n isencoded with not only its own data, but a redundant version of packet n-1. Inthe event that packet loss occurs for packet n-1, the packet can be reconstructedusing the information encoded into packet n. Using an adaptive approachminimises the extra network bandwidth required to carry the redundant data,thereby reducing the overhead required in using FEC.

2.12.2 Audio/video streaming

Many techniques used for error correction when streaming audio can also beapplied when video signals are also sent. Traditional ARQ (Lin et al., 1984)methods have been improved with the use of Gap Detection in packets wherea large number of packets in sequence are lost. Detection of large gaps allowfor a retransmission request before buffer levels run low, thereby allowingsufficient time for the missing segments to be resent. Another approach is touse timeout detection, where packet loss is detected by estimating the arrivaltime of packets. If a packet hasn’t arrived by a certain deadline, it is assumedto be lost and an ARQ is sent. Both these techniques have their merits undercertain conditions, but a combination of these has been used to improve overallperformance (Sze et al., 2001). The main drawback of this approach is that thereis no reduction in the buffer size.

Techniques for correction of video frames are very similar to audio approachesin that temporal concealment methods estimate the missing section based oninterpolation between previous and next blocks of the frame (Pyun et al., 2003).The Bidirectional Motion Vector Tracking (BMVT) system proposed by Pyunet al. (2003) is unique as it uses previous and future frames to predict the contentof the missing segment rather than preceding and subsequent blocks within thecurrent frame. Based on the principal that an average value can be taken of asimilar segment of blocks between the previous and subsequent frames, a bestpossible match can be used for the missing blocks in the current frame. Wherethis approach is beneficial to video error correction (based on the similarity

42

of continuous frames of video), audio does not follow the same linear path.Consider for example a sequence of video frames of a snooker player potting aball. The vast majority of the frames are of a constant value, i.e. the green clothof the table and the static coloured balls. The only changes between frames arethe moving objects in the scene i.e. the snooker player and the balls that are ’inplay’. However, music and audio in general is constantly changing from one’frame’ to the next, making this approach almost impossible to apply.

Other approaches to FEC when streaming video include additional redundancyencoding that is temporally delayed in a multicast streamed video. Chan et al.(2006) proposed the use of replicated and delayed (ReD) streams that can becombined with FEC. By replicating the video stream the server can multicast itin a delayed manner in parallel with the FEC packets, and as the ReD packetsare the source packets, they are used first to incrementally recover some of thelost packets. (Makharia et al., 2008) designed a channel estimation algorithm fora client receiver that is able to dynamically identify the delayed FEC multicastgroups to join and to send an ARQ negative acknowledgement (NACK) toretransmission request when necessary. A novel approach by Shan (2005)uses a cross-layer technique to improve efficiency when determining losses. Bycombining a novel packetisation scheme within the same layer of the OSI modelas the FEC encoding any errors within received packets can be identified beforethe packet has reached the application layer, thereby combining the flexibility ofthe application layers’ adaptability, with the low delay and bandwidth efficiencyof the link layer. However, all of these approaches introduce extra bandwidthusage to facilitate the extra encoding, some form of retransmission and willprovide little benefit when large dropouts occur.

43

2.13 Summary

In this chapter a variety of areas relating to song form intelligence have beenreviewed. First, characteristics of audio and music with the introduction ofdigital signal processing and the purpose of ’windows’ were introduced. Adiscussion of the area of Music Information Retrieval was presented, detailingsystems developed to browse and search music collections, compare audio filesand different approaches for audio content representation. This was followedby a discussion on the syntax and semantics of music. Signal analysis of pitchand frequency have been discussed with the aim to determine the fundamentalfrequency of a music signal followed by beat detection. Following this mains-tream approaches to packet loss were discussed. Audio file formats were thendiscussed leading to streaming media applications and implementation. Thischapter concluded with a review of streaming audio approaches to packet loss.

44

CHAPTER

THREE

Feature Extraction and AudioAnalysis

The choice of features that can be extracted from audio depend greatly on thecriteria of the analysis performed. As shown in Section 2.3.4, these featurescan vary from pitch estimation to fingerprinting depending on the nature ofthe queries involved. The following sections present the use of MPEG–7 as afeature extraction tool for a variety of extraction needs. This is followed by anintroduction to clustering and classification of extracted features and concludedwith a discussion on string matching and distance measurement.

3.1 MPEG–7

One of the most common formats for audio compression is mp3, defined by theMoving Picture Experts Group (MPEG). MPEG–7 is an international standar-dised description of various types of multimedia information (Martınez et al.,2002). Whereas MPEG4 defines the layout and structure of a file and codecs,MPEG–7 is a more abstract model that incorporates a markup language to de-fine description schemes and descriptors - the Description Definition Language(DDL). Using a hierarchy of classification allows different granularity in thedescriptions. Whilst a Google/AltaVista search engine does not exist for audio,many researchers are discovering ways to automatically locate, index, andbrowse audio using recent advances in technologies such as speech recognitionand machine learning. MPEG–7 as a descriptive tool can greatly enhance thisarea by adding additional metadata based on the content of the audio as wellas standard keyword descriptors.

It should be noted MPEG–7 standard only specifies the format for descriptions

45

of content, and not the algorithms to utilise these descriptions. Developershave begun implementation of MPEG–7 only recently. MPEG–7 classes andapplication developments have primarily been in the MATLAB. MATLAB isan IDE/programming tool, primarily a numerical computing environmentand programming language. However, implementations in Java and C++language formats are available by a number of researchers/institutes includingthe MPEG–7 Library by the Joanneum Research group (MPEG–7, 2008) and anMPEG–7 Audio Encoder by Holger Crysandt (Wellhausen & Crysandt, 2003).The MPEG–7 Library is a comprehensive list of over 800 description schemesand descriptors using classes in C++, which enables developers to use thefunctionality of MPEG–7 in their own applications. The Java MPEG–7 AudioEncoder is a complete software package for MPEG–7 analysis with the audioanalysis results stored in XML format. The Java MPEG–7 Audio Encoder canbe launched from the web using a Java Virtual Machine (JVM) or on a localmachine as a command line application. It has been widely used as an analysistool for similarity analysis and pattern recognition (Matushima et al., 2004;Super, 2004; Cho & Choi, 2005).

3.1.1 MPEG–7 descriptors

Through the combination of descriptors, description schemes, and a DescriptionDefinition Language (DDL), MPEG–7 can facilitate efficient searching and filte-ring of files. The role of descriptors, description schemes and the DescriptionDefinition Language (DDL) are:

• Descriptors – these define the syntax and semantics audio. Descriptorsassign specified features to the relevant set of extracted values.

• Description schemes – these define the structures of relationships betweendescriptors, and in a hierarchal manner they can define other descriptionschemes.

• Description Definition Language (DDL) – this defines the syntax to re-present the audiovisual data results in XML format.

Extracted audio features can aid application queries based on more than simplemetadata stored within audio files. MPEG–7 Descriptors are primarily describelow-level features such as colour, texture, motion, audio energy and attributesof the audio/video such as location, time and quality. Example MPEG–7

46

descriptors and their output are shown in Appendix C. The resulting output is acompact representation of the analysed audio. Features resulting from MPEG–7analysis can be stored locally and form the core search criteria. Figure 3.1 showsan example application scenario (Kim et al., 2005). Through specific filterson MPEG–7 descriptors a user can specify the type of audio content he/shewants to listen to without any requirement to analyse the audio itself. Forexample, in one method of similarity analysis Crysandt (2004) used the meanof the AudioPower and the mean of the AudioSpectrumFlatness as criteria fora similarity comparison between songs.

Figure 3.1: Example MPEG–7 application scenario (Kim et al., 2005, p. 4)

There are 17 Low Level Descriptors (LLD) that can appear in a variety ofcombinations that look at either the temporal or spectral domain of an audiosignal as shown in Figure 3.2.

The two basic descriptors are the AudioWaveform (AW) and AudioPower (AP)presented as temporarly sampled scalar values. The AudioWaveform Descrip-tor gives a minimum/maximum value of the signal range within a specifiedtemporal resolution3. The AudioPower Descriptor provides a measure of thesquare of the waveform values and thereby providing a simplified representa-tion of the signal showing peaks where the signal has a higher amplitude. Acomparative study of some of the LLDs by Lukasiak et al. (2003) shows thatwhen when facilitating comparison between audio segments, the AW performsbetter based on the principle that it has two measurements for each frame as

3By default, MPEG–7 recommends 10 ms. hops.

47

Figure 3.2: Class hierarchy of MPEG–7 audio low level descriptors (Chiariglione,2010)

opposed to a single value from the AP analysis. An example giving the first 18s. of a guitar playing a rendition of the 12 Bar Blues4 is shown in Figures 3.3and 3.4. When compared, the level of detail in the AW representation is higherregarding the content of the audio file.

Figure 3.3: Example MPEG–7 audio power representation

4The 12 Bar Blues is described as a series of repeating chords in progressive form.

48

Figure 3.4: Example MPEG–7 audio waveform representation

3.1.2 Audio Spectrum Envelope (ASE)

The Audio Spectrum Envelope (ASE) is a log-frequency power spectrum thatcan facilitate generation of a reduced spectrum of the original audio. This isperformed by summing the energy of the power spectrum within a series offrequency bands. Bands are equally distributed between two frequency edges:loEdge and hiEdge. Default values of 62.5 Hz. and 16 KHz. correspond to thelower/upper limit of hearing as shown below:

<AudioDescriptor hiEdge="16000.0" loEdge="62.5" octaveResolution="1/4"xsi:type="AudioSpectrumEnvelopeType">

The spectral resolution r of the frequency bands within these limits can bespecified based on eight possible values, ranging from 1/16 of an octave to 8octaves as shown in equation 3.1.

r = 2 joctaves(−4 ≤ j ≤ +3) (3.1)

(Kim et al., 2005)

Each ASE vector is extracted every 10 ms. from a 30 ms. frame (window) whichgives a compact representation of the spectrogram of the audio.

3.1.3 Audio Spectrum Flatness (ASF)

Audio Spectrum Flatness (ASF) describes the flatness properties of the spectrumof an audio signal within a given number of frequency bands. The flatness of a

49

band is defined as the ratio of the geometric mean, i.e., the central tendency ofa vector, and the arithmetic mean of the spectral power coefficients within theband. The most proficient use of ASF is for musical instrument identification(Essid et al., 2004) who used the ASF as one of the key characteristics for musicalinstrument recognition. The ASF features, when combined with other featurevectors including the Mel-Frequency Cepstral Coefficients (MFCC), result inhigh accuracy recognition for instruments belonging to the different familieseven over short-term decision lengths.

3.1.4 Audio Spectrum Basis/Projection

The AudioSpectrumBasis Descriptor is a container for basis functions forprojecting a spectrum onto a lower-dimensional sub-space suitable for proba-bility model classifiers (e.g., neural networks and Hidden Markov Models.)The reduced basis consists of decorrelated features of the spectrum with salientinformation described more efficiently than with the direct spectrum represen-tation. This reduced representation is suited for probability model classifiersthat typically perform best when the input features consist of fewer than 10dimensions (Casey, 2002) who points out that AudioSpectrumBasis featuresperform better for sound recognition tasks than similarity or classificationtasks. Kim et al. (2004) evaluate the efficiency of audio indexing and retrie-val systems based on a combination of the Audio Spectrum Basis (ASB) andAudio Spectrum Projection (ASP) of MPEG-7 audio descriptors. Figure 3.5illustrates the typical architecture of an audio indexing and retrieval systemwhich incorporates MPEG-7 basis projection.

50

Figure 3.5: Architecture of an audio indexing and retrieval system (Kim et al.,2004)

The feature extraction system incorporating basis projection, shown in Fi-gure 3.6 comprises five steps/functions:

1. Convert to spectral domain with a short-time Fourier transform

2. Calculate the ASE

3. Normalise the ASE (NASE), which is a log-power of the ASE obtainedfrom the root mean square of the energy envelope

4. Separation into basis constituents through principal component analysis(PCA) or independent component analysis (ICA)

5. Basis projection: obtained by multiplying the NASE with the set of extrac-ted basis functions

51

Figure 3.6: Architecture of spectrum basis projection (Kim et al., 2004)

3.1.5 Audio Spectrum Centroid (ASC)

The Audio Spectrum Centroid (ASC) is the centre of gravity of a log-frequencypower spectrum. Unlike the previous MPEG–7 low-level descriptors, the ASCis of a scalar type and provides a high level of dimensional reduction at thecost of high information loss. The ASC provides information on the shape ofthe power spectrum and indicates whether a power spectrum is dominated bylow or high frequencies. The ASC can be regarded as an approximation of theperceptual sharpness of the signal by indicating where the centre of mass ofthe spectrum is. Perceptually, it has a robust connection with the impressionof brightness5 of a sound (Schubert et al., 2004). Seo et al. (2005) applied theASC to audio fingerprinting. An audio fingerprint is applied to recognisingaudio in the same way a human fingerprint is applied to identify an individual.Fingerprints are perceptual features (short summaries) of a multimedia object,

5The brightness of a sound is indicated by the amount of high-frequency content.

52

and can be useful in applying search/retrieval queries and copyright detectionas shown in Figure 3.7.

Figure 3.7: Architecture of fingerprinting application

By converting the audio signal to mono, down-sampling it and then transfor-ming it to the frequency domain with FFT, Seo et al. (2005) were able to create areliable fingerprint matching system. The audio spectrum obtained was dividedinto 16 critical bands, and the normalised frequency centroid for each band wascalculated. These centroids acted as fingerprints of the audio frame. Seo et al.(2005) showed this approach to be robust through ’quality preserving’ signalprocessing steps, and that it out-performed other commonly used features, suchas tonality and MFCC, in the context of audio fingerprinting.

53

3.2 Pattern classification and matching

There are distinct differences between the definitions of pattern classificationand pattern matching. Pattern classification aims to classify data based on eitherstatistical information extracted from the patterns or on deductive reasoningbased on the pattern contents. The patterns to be classified are usually groupsof measurements or observations defining points in an appropriate multidimen-sional space. This is in contrast to pattern recognition, where the pattern to beidentified is rigidly specified. Pattern recognition can become almost byzantinewhen fixed templates generate multiple variations. For example, in English,sentences often follow the N-VP (Noun - Verb Phrase) pattern, but some know-ledge of the English language is required to detect the pattern (Jurafsky &Martin, 2000).

Regardless of whether pattern classification or matching is used, there are threeelementary steps to be performed prior to matching/classification: sensing,segmentation/grouping and feature extraction (Duda et al., 2000).

3.2.1 Sensing

The input to a pattern recognition application can be in many forms. Images,speech/sound and text. The difficulty lies in the limitations of the input imageresolution, the volume of data in a sound file or signal distortion.

3.2.2 Segmentation and grouping

Segmentation is one of the most difficult problems in pattern recognition. Forexample, in speech recognition an application may need to recognize individualphonemes and combine these to form the word. Consider the words “sheep”and “shop”, the speaker will use different positions of his/her lips to pronouncethe “sh” of both words. When saying the word “Sheep” the speaker will havetheir lips tight to the face but when saying the word “shop” the lips will be in arounder form in preparation for the “op” portion. This is commonly referredto as anticipatory coarticulation, commonly known as rounding (Bilmes &Bartels, 2005), and lowers the spectrum of the “sh” when compared to theword “sheep”. Another example of the segmentation problem is in the areaof image recognition. Lamdan et al. (1988) present the problem of objects ina scene that may be overlapping and partially occluded as shown in Figure

54

3.8. Segmentation of the square (a) and circle (b) proves to be difficult whenoverlapped (as in c).

(a) (b)

(c)

Figure 3.8: Example composite scene

3.2.3 Feature extraction

The border between feature extraction and classification is difficult to specify.A feature extractor that gives an ideal representation of a subject would makeclassification almost unnecessary. Conversely, a classifier that was all power-ful would make a feature extractor redundant. The definition of the role of afeature extractor is to characterise an object to be recognised by measurementswhose values are very similar for objects in the same category, and very dif-ferent for objects in a different category, i.e., ideally to extract distinguishingfeatures. The choice of distinguishing features is a critical step and dependson the characteristics of the problem domain. Having prior knowledge can beinvaluable when choosing a feature. However, example data for training setscan be equally if not more valuable depending on the classification methodused (Duda et al., 2000). Until recently, the most common features extractedfor audio processing were the Mel-Frequency Cepstral Coefficients (MFCCs),more specifically speaker recognition, sound classification, and segmentationof audio using sound/speaker identification (Kim et al., 2004).

3.2.4 Classification

A typical pattern recognition system consists of a number of modules butmust contain a sensor that can gather the observations to be classified for allthe feature extraction mechanisms where numeric or symbolic information is

55

extracted from the observations; and a description or classification functionthat performs classification or observation description, based on the extractedfeatures.

The classification or description scheme is usually based on the availability of aset of patterns that have already been classified or described. This set of patternsis termed the training set and the resulting learning strategy is characterised assupervised learning. Learning can also be unsupervised, in the sense that thesystem is not given an a priori labelling of patterns. Instead it establishes theclasses itself based on the statistical regularities of the patterns.

The classification or description scheme typically uses either a statistical or syn-tactic approach. Statistical pattern recognition is based on statistical characteri-sations of patterns, assuming that the patterns are generated by a probabilisticsystem. Syntactic pattern recognition is based on the structural interrelation-ships of features. A wide range of algorithms can be applied to pattern recog-nition, from very simple Bayesian classifiers to much more powerful neuralnetworks. Typical applications can include automatic speech recognition, clas-sification of text (e.g. spam email), the automatic recognition of handwriting, orface recognition applications.

3.2.5 Pattern matching

Pattern matching is the act of looking for the existence of the components of agiven pattern. In contrast to pattern recognition/classification, the pattern isrigidly specified. A specified pattern can be seen as having either sequences ortree structures contained within the pattern. Pattern matching tests whetherthe data relevant structure exists, retrieves the aligning parts, and substitutesthe matching part with something else. A common application of patternmatching is with text/string patterns. Queries are often posed with regularexpressions and matched with respective algorithms. Sequences can also beseen as trees branching for each element into the respective element and the restof the sequence, or as trees that immediately branch into all elements. Patternmatching is of most benefit when the underlying data structures are as simpleand flexible as possible.

56

3.3 Mel-Frequency Cepstral Coefficients(MFCCs)

Mel-Frequency Cepstral Coefficients (MFCCs) (Stevens et al., 1937) are derivedfrom a cepstral representation of an audio clip. The difference between thecepstrum and the Mel-Frequency Cepstrum (MFC) is that the MFC frequencybands are equally spaced on the mel scale, this approach estimates the humanauditory system’s response more accurately than the frequency bands in thenormal cepstrum which are represented using linear spacing. The cepstrumcan be seen as information about rate of change in different spectrum bands.MFCCs often feature in speech recognition systems, a common example of thisis the automatic identification of digits when spoken into a telephone. Theyare also a popular choice in the area of speaker recognition, where the aimof the system is to recognise people solely based on their voices. MFCCs areincreasingly finding applied to Music Information Retrieval (MIR) applicationssuch as genre classification (Tzanetakis & Cook, 2002), audio similarity mea-sures (Logan & Salomon, 2001). However, as the MFCC is cepstrum based,it is most successful in voice recognition (Logan, 2000). Voice recognition isdivided into two classification: voice recognition and voice identification, andis the method of automatically identifying who is speaking on the basis ofindividual information integrated in speech waves. Voice recognition is widelyapplicable in the use of a speaker’s voice to verify their identity in a similay wayto a fingerprint can identify a person, thereby controlling access to restrictedservices such as voice mail, security control for protected information areas,database access services, banking by telephone, remote access to computersand other information services.

3.4 Clustering

Pattern classifiers typically fall into one of two categories: supervised or unsu-pervised. A supervised classifier predicts the value of the function for any validinput object after having seen a number of training examples. To achieve this,the classifier has to generalise from the presented data to unseen situations in areasonable way. Supervised classifiers typically feature in the following areas:

Artificial neural networks: an abstract simulation of a real nervous system thatcontains a collection of neuron units communicating with each other via axonconnections. Such a model bears a strong resemblance to axons and dendrites

57

in a nervous system. Rabiner & Juang (1993), Van Rijsbergen (1979), andFrakes & Baeza-Yates (1992) have derived various techniques within the fieldof information retrieval for pattern recognition.

Bayesian statistics: The main distinguishing feature of a Bayesian approachis that it makes use of more information than the non-Bayesian approaches.Whereas the latter are based on analysis of hard data that is well-structuredand well-defined, Bayesian statistics accommodates prior information whichis usually less well specified and can even be subjective. Abdallah et al. (2005)report work on music structure extraction with Bayesian probability methods.

Nearest Neighbour Algorithm: allows a small number of neighbours to influencethe classification of an individual value, and is a non-parametric classifier. It hasalso been shown that the error rate of the nearest neighbour algorithm is at mosttwice as large as the best possible Bayesian error rate (Tzanetakis et al., 2003).Chuan & Chew (2004) have successfully implemented a nearest-neighbouralgorithm to determine the key of a polyphonic music piece.

Gaussian mixture models: model the distribution of feature vectors. Gaussianmixture models widely feature in the Music Information Retrieval (MIR) com-munity, notably to build timbre models as reported in Tzanetakis & Cook(2002). In Burred & Lerch (2003) a tree-like structure of Gaussian mixture mo-dels models the underlying genre taxonomy: a divide-and-conquer strategyfirst classifies items on a coarse level and then on successively finer levels.

The K-nearest neighbour (KNN) algorithm is a supervised learning algorithmwhere the result of a new instance query is classified based on a majority ofK-nearest neighbours. The purpose of this algorithm is to classify a new objectbased on attributes and training samples. The classifiers do not have any modelto fit and are only based on memory. Given a query point the k number ofobjects (or training points) closest to the query point is found by finding groupsof objects such that the objects in a group will be similar to one another anddifferent from the objects in other groups as shown in Figure 3.9.

Rough sets Other supervised classifiers include decision trees and rough setsand are a popular alternative to the knowledge based classifiers mentionedpreviously. Where the purpose of a decision trees is for example to clearly mapan item to a category through the use of branches and leaf nodes originatingfrom a ’root’. Where decision trees have a limitation in that an item can only becontained within one category (leaf), rough sets loosen the coupling of objectsto categories allowing an item to belong to more that one set at any one time.

58

Figure 3.9: Example k-means clustering distance

However as with the previous classifiers ANNs, KNNs and Bayesian statistics,prior knowledge of the patterns to be classified need to be defined, usuallywith the aid of a training set. Investigating the possible identification andclassification of similar patterns within the MPEG–7 descriptions of audio wereperformed using two of the more popular mainstream research tools: Rough SetExploration System (RSES) (RES2.2, 2008) and ROSETTA (ROSETTA, 2009), bothof which allow for the automatic and manual creation of rules and reductionsets (also known as ’reducts’). Using a minimal MPEG–7 audio dataset as inputresulted in no automatic generation of rules to aid in the creation of reductsand manually creating rules based on the varying octaves between the MPEG–7’lo-edge’ and ’high-edge’ produced reduced sets of little or no discernibilitybetween them. Since no known pattern can be defined within a single song andevery song is unique (even within a cover version of an original song since vocalfrequencies are different and even the tempo can change) no prior knowledgeof specific pattern can be manually or automatically identified or generated.Rough sets may prove to be more useful on a higher level of abstraction, e.g.when searching for verse/chorus sections or defining the genre based on ’beatsper minute’, in these situations a ’set’ of rules can be generated based on knownpatterns such as the tempo of the bass frequencies or the frequency of changesin the audio signal.

59

3.4.1 Unsupervised classifiers

The debate as to whether or not it is possible in principle to learn anything fromunlabelled data depends heavily on the assumptions of the user. Unsupervisedclassifiers assume that there are no a priori rules/knowledge of the dataset tobe analysed. However, Duda et al. (2000) give five reasons why unsupervisedlearning is of benefit:

1. Collecting and labelling a large data set is costly.

2. A person may wish to reverse direction, i.e., train a classifier with a large(inexpensive) volume of unlabelled data and then use supervision toclassify the groupings found.

3. Characteristics of data may change. If the changes can be tracked by aclassifier in unsupervised mode then performance is improved.

4. Unsupervised methods can help find features that can then facilitatecategorisation.

5. In the early stages of an investigation it can be valuable to perform explo-ratory analysis in order to gain an insight into the structure/nature of thedataset.

The unsupervised k-means clustering method has been shown to be effectivein producing reliable clustering results for many practical applications and iscommonly applied to music (Courses & Surveys, 2007; Kamata & Furukawa,2007; Peeters et al., 2002; Berenzweig et al., 2004; Logan & Salomon, 2001).

The basic steps of k-means clustering are straightforward. First, determinenumber of clusters k and assume the centroid or centre of these clusters. Anyrandom objects can be taken as the initial centroids or the first k objects insequence can defined as the initial centroids.

Then the k-means algorithm will iterate the following three steps until stable,i.e., no data moves groups:

1. Determine the centroid coordinate.

2. Determine the distance of each object to the centroids.

3. Group the object based on minimum distance.

60

Figure 3.10: k-means clustering algorithm flowchart

The steps above can be seen in Figure 3.10

A more detailed description of the steps shown in the Figure 3.10 are as follows:

1. Begin with a decision on the value of k = number of clusters.

2. Put any initial partition that classifies the data into k clusters. These canbe assigned randomly, or systematically as the first k data of clusters andassign each of the remaining (N-k) data to the cluster with the nearestcentroid.

3. Take each sample in sequence and calculate the distance of the sampleto the centroid of each of the clusters. If a sample is not currently in thecluster with the closest centroid, move this sample to that cluster andrecalculate the centroid of the cluster the new data has moved to and thecentroid of the cluster losing the sample.

4. Repeat step 3 until no new assignments are made.

61

Since the location of the centroid is initialised to a default value at the start itneeds to be adjusted based on the current updated data. All the data is thenassigned to the new centroid. This process is repeated until no data is moved toanother cluster. This loop can be proved to be convergent, the convergence willalways occur if the following conditions are satisfied:

1. For each switch in step 2 the sum of distances from each object to thatobject’s group centroid is decreased.

2. There are only finitely many partitions of the objects into k clusters.

One of the advantages of using k-means as a similarity metric is its compu-tational speed in comparison to other approaches. Tao et al. (2004) present aquery-by-singing based musical retrieval system that utilises k-means cluste-ring. To improve system efficiency, Tao et al. (2004) reorganised their databasewith a two-stage clustering scheme in both time space and feature space with ak-means algorithm and reported results of over 30 percent increase in accuracywith a speed up of more than 16 in average query time.

Similar to most other algorithms, k-means clustering has a number of weak-nesses:

• If the volume of data is small, initial grouping will determine the clustersignificantly.

• The number of clusters, k, must be determined beforehand.

• There is never a real cluster, i.e., using the same data, if the data is inputin a different order it may produce a different cluster, depending on thevolume of data.

• k-means is sensitive to initial conditions. A different initial condition mayproduce different results.

• It is impossible to know which attribute contributes more to the groupingprocess since it is assumed that each attribute has the same weight.

• Data with extreme distance measures from the centroid may pull thecentroid away from the actual one.

One way to overcome these weaknesses is to use the median instead of meanalso known as k-median clustering. However, this leads to new complications,

62

and can often lead to degradation in performance as shown by Steinbach et al.(2000). The choice is entirely dependent on the nature and volume of data beingused.

k-means suffers the same problem as almost all learning algorithms - despitebeing an unsupervised technique it is very difficult to define the term cluster.What properties/attributes define the data to be similar/dissimilar? Howcan these differences be measured? Within the area of Music InformationRetrieval (MIR) this problem is partially overcome by the fact that the datasets are usually numerical making the measurement of similarity easier. Themost common approach of similarity (dissimilarity) is the distance method. Ifdistance is a good measure of dissimilarity then it is reasonable to assumethat the distance between samples in the same cluster will be considerablysmaller than the distance between samples in different clusters. The type ofmeasurement used is again dependent on the nature of the data. Euclidiandistance is the most common use of distance. Euclidian distance examines theroot of square differences between coordinates of a pair of objects as shown inthe formula 3.2:

di j =

√x

∑k=1

(x jk − x jk)2 (3.2)

In its simplest form this can be seen in the following example:

Attribute1 2 3 4

A 2 5 6 7B 9 8 5 3

Table 3.1: Example Euclidian Distance

From the sample data in Table 3.1 point, A has values of (2, 5, 6, 7) and point Bhas values (9, 8, 5, 3).

The Euclidian Distance between point A and B can be calculated as:

dAB =√(2− 9)2 + (5− 8)2 + (6− 5)2 + (7− 3)2

=√

49 + 9 + 1 + 16

= 8.66

63

Other measures of distance for quantitive data are as follows:

• Hamming Distance: the number of bits which differ between two binarystrings. The Hamming distance can be interpreted as the number of bitswhich need to be changed to turn one string into the other.

• Minkowski distance: the distance between two points in a Euclidianspace with a fixed two-dimensional coordinate system is the sum of thelengths of the projections of the line segment between the points onto thecoordinate axes.

• City block (Manhattan) distance: this is a special case of the Minkowskidistance where the distance between two points is measured along axesat right angles.

• Cosine distance: is defined as one – cosine of the included angle betweenvectors. Holzapfel & Stylianou (2008) use this method successfully ofdistance measurement to determine rhythmic similarity of music pieceswith a k-nearest neighbour supervised clustering approach.

• Correlation: one – sample correlation between points treated as sequencesof values. Foote (1997) points out that the correlation distance is similarto the cosine distance if the correlation is normalised.

Choice of distance metric is entirely dependant on the nature of the data andwhat measure of similarity is required. Results vary (Ellis et al., 2002; Berenz-weig et al., 2004) and it is subjective at best as to which measurement to use.However, the Euclidian distance measure is one of the most popular metricsused within the field of Music Information Retrieval (MIR).

3.4.2 Cluster numbers

The problem of defining k clusters as a starting point has existed since theconception of the clustering algorithms for the study of statistical data (Mac-Queen, 1966) with investigations into the optimal starting k cluster rangingfrom Forgy (1965), Jain et al. (1999) to Cheung (2003), Salvador & Chan (2004)Tseng & Wong (2005) Chiang & Mirkin (2007). The k cluster is not known apriori and it can be concluded that the there is no definitive value of k. Com-monly a heuristic approach is used. For example, typical incremental clusteringgradually increases the number of k clusters under the control of a threshold

64

value. Forgy (1965) concludes that the initial number of clusters should beused in close interaction with theory and intuition. Consequently, the computerprogram actually prepared for this purpose can involve several modificationsof the k-means cluster numbers prior to being initialized.

The number of k clusters that the data is to be grouped into can have a large in-fluence on the classification result. The right number of k clusters is not obvious,and choosing k automatically is a hard algorithmic problem (Khan & Ahmad,2004). Hamerly & Elkan (2003) recently suggested using a Gaussian6 approachto determine the k, provided the data that follows a Gaussian distribution.

Several algorithms have been proposed to determine k automatically. Most me-thods are wrappers around k-means or some other type of clustering algorithmfor fixed k. Wrapper methods use splitting and/or merging rules for centres toincrease or decrease k as the algorithm proceeds.

Pelleg & Moore (2000) have proposed a ’smoothing’ framework for learningk, which they call x-means. The algorithm searches over many values of kand scores each clustering model using a Bayesian Information Criterion (BIC)(Kass & Wasserman, 1995). x-means chooses the model with the best BIC scoreon the data. Aside from the BIC, other scoring functions are also available.One common approach suggested by Bischof et al. (1999) uses a minimumdescription length (MDL) framework, how well the data fits the model isdetermined by a measure of the description length in relation to the model. Thex-means algorithm starts with a large value for k and reduces k when a reductionin the description length can be made. Before reducing k the k-means algorithmis applied to the clusters in order to re-optimise the model to fit the data.

3.5 String matching algorithms

String matching algorithms are a basic component that is used in the practicalimplementation of software ranging from operating systems to search tools onweb sites, e.g., Google, Yahoo. The algorithms underlying string matching arenot new but have been refined over time to provide more efficient algorithms,and algorithms that are more suited to a particular purpose. For example,interest in string searching has increased dramatically within the fields ofinformation retrieval and computational biology owing to the dramatic increasein text/database sizes needing management.

6Gaussian functions are widely used in statistics where they describe normal distributions.

65

A string can be defined as a sequence of characters over a finite alphabet ∑. Theprincipal objective of string matching is to find all instances of a string p in alarge string T of the same alphabet. There are three common approaches to theproblem (Navarro & Raffinot, 2002):

1. Read all the characters in the text one after the other and at each stepupdate some variable so as to identify a possible occurrence. This iscommonly referred to as the Brute Force algorithm.

2. Use a sliding window along the text T and within the window searchbackwards for an occurence that matches p. The Boyer-Moore algorithm(Boyer & Moore, 1977) takes this approach.

3. The Backward DAWG Matching (BDM) and the Backward NonDetermi-nistic DAWG Matching (BNDM) algorithms by Navarro et al. (1998) aresimilar to the second approach but more efficient by also searching for thelongest suffix of the window that is also a factor of p.

The following sections discuss the most common search algorithms for exactstring matching.

3.5.1 Brute force

The brute force algorithm consists of checking, at all positions in the text bet-ween 0 and n-m, whether an occurrence of the pattern starts there or not. Then,after each attempt, it shifts the pattern by exactly one position to the right.The brute force algorithm requires no pre-processing phase, and a constantextra space of memory in addition to the pattern and the text. During thesearching phase the text character comparisons are done in any order. The timecomplexity of this searching phase is O(mn) with an expected number of 2ncharacter comparisons.

3.5.2 Knuth-Morris-Pratt (KNP)

The Knuth-Morris-Pratt (KMP) (Charras & Lecroq, 2004) algorithm turns asearch string into a finite state machine and then runs the machine with thestring to be searched as the input string. Execution time is O(m+n), where m isthe length of the search string and n is the length of the string to be searched.The KMP algorithm uses information about the characters in the string being

66

searched to determine how much to move along that string after a mismatchoccurs. For example given the strings: s1= ’aaaabaaaabaaaaab’ and s2 = ’aaaaa’,on the fifth comparison the ’a’ of s2 does not match the ’b’ of s1, whereas thebrute force algorithm would simply move onto the next character with themismatch of ’b’ in s1 found on the fourth comparison. This is avoided in theKMP algorithm by moving s1i + 1 and begining the comparison again as seenin Figure 3.11(c).

(a) (b)

(c)

Figure 3.11: Example string matching comparison

3.5.3 Boyer-Moore

The fastest known exact string matching algorithms are based on the Boyer-Moore algorithm (Boyer & Moore, 1977). Such algorithms are on average,sublinear, in the sense that it is not necessary to check every symbol in the textas was the case with KPM in the previous section. The larger the alphabet andthe longer the pattern, the faster the algorithm works. It compares charactersfrom right-to-left, starting with the last character in the search pattern. Thespeed of the Boyer-Moore algorithm is attributed to how well it deals withcharacters that do not match. When a mismatch is detected, the algorithmchecks if the non-matching character is in the search pattern. If it is not in thepattern, then the pattern is shifted over the entire length of the search pattern.

67

3.5.4 Regular expressions

Regular expressions are incorporated into numerous applications, includingprogramming languages, text editors and other utilities with the aim to aidsearching and manipulate text based on patterns. For example, program-ming languages Tcl, Perl and Ruby have an efficient regular expression en-gine included directly in their syntax, and they feature in Unix type systemsas the command, grep7. As an example of the syntax of a grep query, the re-gular expression \bet can be used to search for all instances of the stringet that occur after word boundaries (signified by the \b). Therefore, inthe string “Better than Eternity” \bet matches the Et in Eternity

but not in Better, because the et occurs inside a word and not immedia-tely after a word boundary. One of the main differences between regularexpressions and other exact string matching algorithms is the use of wild-cards. These give a fixed/unknown number of unknown characters to be igno-red. For example, in the string, To say that no-one is better than

Eternity in this genre is foolish, the regular expression \*terwill return better and Eternity as characters preceding the ter are ignored.A vast library of functions allow regular expressions extensive flexibility and itis for this reason they are also popular within text editor applications.The list of string matching algorithms is endless. The Karp-Rabin algorithm,Shift-Or algorithm, Simon algorithm, Colussi algorithm, Forward-DAWG Mat-ching algorithm and the Horspool algorithm are but a few (Charras & Lecroq,2004). Each algorithm is best suited to a particular purpose. An algorithm mayhave a faster performance depending on the nature of the match, the data beingused, the size of the query, or volume of data to be queried. Charras & Lecroq(2004) provide a detailed description and implementation of these algorithmstogether with their specific purposes, advantages and disadvantages.

3.5.5 Approximate string matching

The scope of string matching (Boyer & Moore, 1977) is rich in problems withsubstantial mathematical and algorithmic structure. Quite often, these problemsare well-motivated from an application standpoint. In standard string matchingthe problems typically involve finding all occurrences of a pattern string of sizem in a larger text string of size n. In these problems, a text location matches alocation in the pattern provided the associated symbols are identical.

7Although grep is not an acronym, its name is taken from global / regular expression / print.

68

Approximate string matching differs in that it is not always possible to find anexact match to the query string. One of the best studied cases of this problem isthe so-called edit distance, which allows the deletion, insertion and substitutionof simple characters in both strings. The edit distance has received muchattention because its generalised version is powerful enough for a wide rangeof applications.

The edit distance is calculated as D(A,B) between strings A = a1......am and B= b1......bn, A,B ∈ ∑

∗ (∑∗ denotes the set of all sequences over ∑), where thedistance is the minimum number of editing operations, including insertionsand deletions of characters within the search string, required to transform stringA into string B (Crochemore et al., 1994).

A popular approach within the area Music Information Retrieval (MIR) is to useedit distance as a measurement of similarity. Work by Hu & Dannenberg (2002)in an investigation of several variations of search algorithms was aimed to-wards improving search precision. Their aim was not to find the best matchingalgorithm for searching, but rather to find the best form of music representa-tion by applying edit distance to similarity retrieval based on symbolic notes,pitch, loudness, and tempo, and combinations of these together with differentwindowing approaches.

A popular choice of music representation to aid string searches is a monophonicformat. Monophonic music can be represented by a one-dimensional string ofcharacters, where each character describes one note or one pair of consecutivenotes. Strings can represent interval sequences or sequences of pitches.Lemstrom & Ukkonen (2000) applied an edit distance measure to musiccomparison and retrieval with a simplified representation of monophonicmusic that gave only the pitch levels of the notes and ignored note duration.

Hamming distanceHamming distance is a special case of the edit distance similarity metric. Whe-reas edit distance enables insertion and deletion of characters, Hamming dis-tance can only compare strings of equal length and return only the number ofdifferences between characters, or the number of errors that transformed onestring into the other, and not the total difference value. For example, an editdistance measurement will return the value of 4 when measuring the distanceof string A and B where A = 5 and B = 1. However, Hamming distance measu-rement will return a value of 1, as the size of the difference between the valuesis not taken into account, only the number of characters that vary. A common

69

example of Hamming distance is given through the use of a cube. Figure 3.12shows the Hamming distance where k=3. Following the red path we see that,010→111, has a distance of 2, and following the blue path shows, 100→011, hasa distance of 3. Note: 000 (0 in binary) and 100 (4 in binary) are 4 units apart,but in Hamming space, they are only one unit apart and very close.

Figure 3.12: Example of Hamming distance (Cederberg, 2001)

3.6 Summary

This chapter discussed MPEG–7 as a feature extraction for a variety of repre-sentational purposes. MPEG–7 formats including the Audio Spectrum Flatness(ASF), Audio Spectrum Centroid (ASC) and Audio Spectrum Envelope (ASE)were discussed. This was followed by clustering and classification of datasetsand their differences. Pattern matching was presented along with commonimplementations. Varying clustering techniques in order to identify similaritywas discussed including supervised and unsupervised approaches. A detaileddiscussion on k-means clustering along with varying distance measurementtechniques was given. This chapter concludes with a discussion on stringmatching and distance measurement.

70

CHAPTER

FOUR

Similarity and Classification ofMusic Features

Internet communication is not a perfect medium for time-dependent data. ’Live’audio and video streams with errors are one of the mediums likely to annoya listener. Packet drop-outs leave a large noticeable gap in the stream thatdraws listeners’ attention to the fault. Forward Error Correction (FEC) is anarea that addresses this issue with the onus of repair placed as much as possibleon the listener’s device. The following sections in this chapter outline the useof MPEG–7 and k-clustering as a means of identifying self-similarity within asong and and the use of a best-possible match approach to determine previoussections of the song as a replacement for the live stream, thereby minimizingthe impact on the listener. By applying a k-means clustering algorithm on theextracted features, a classified set of features is presented. Finally, the use ofstring matching is presented as a measure of distance between large sections ofclustered audio concludes this chapter.

4.1 Visualising structure and repetition in music

Pre-processing a song and storing the results for future on-the-spot repair is arun once and use many times approach to improving a listeners experience whenless than optimum conditions occur. As previously discussed in Sections 2.12and 2.12.1 Linear Predictive Coding and Linear Interpolation have been apopular choice for attempting Forward Error Correction (FEC), but limitationson signal analysis can give a poor overall listening experience. The use ofMPEG–7 as an initial representation of an audio signal greatly reduces thepreprocessing required whilst retaining as much of the relevant information aspossible.

71

Many applications use MPEG–7 as a metadata description tool to form a da-tabase of standardised audio signal content with emphasis on the low-levelphysical content primarily to supplement the metadata already stored. Bycombining the MPEG–7 metadata with the mp3 metadata can widen the searchcriteria for current and future applications. This content can facilitate generalsearch queries but can also be useful as a data reduction method for similarityanalysis.

The core MPEG–7 representation, the Audio Spectrum Envelope (ASE), is usedhere as the foundation for similarity analysis. Figure 4.1 shows a 7 s. musicsample plotted from the ASE. The original sound file is a digital reproductionof an acoustic guitar with a repeating pattern of notes in the form of the wellknown 12 Bar Blues. By using a digitally created piece of music instead ofrecording a real musician any ambiguities such as timing are not introduced.To the human eye the repetition within the audio can be clearly seen, but whatis of interest is how the full spectrum is represented. Where an audio waveformonly gives the peak height of the power of the frequency, the ASE gives anequal representation across the full spectrum and repeating patterns becomemore visible.

Using correct features and levels of detail are crucial for the classification ofsections of audio since small changes to settings can have a very different results.A classification algorithm will always provide some form of classification ofthe data, but a poor choice of features used will result in a classification thatdoes not reflect the true nature of the underlying data. For example, using theMPEG–7 Spectral Centroid instead of the Temporal Centroid when determiningthe temporal properties of audio. The choice of features needs to fulfill twobasic criteria:

1. Data that can be perceived as being similar must map to nearby pointsin feature space, and decision boundaries that define different regionsshould be spaced as far apart as possible. Deshpande et al. (2001) refersto this as either small intra-category scatter or high inter-category scatter.

2. Ideally, the features extracted from the data should retain as much impor-tant8 information as possible. Removal of important/pertinent data willhave a strong influence on what transforms can then performed duringthe feature extraction stage. Ideally, the transform should be reversible to

8Defining what data is important is dependent upon the nature of the classification.

72

Figu

re4.

1:Ex

ampl

eM

PEG

–7A

udio

Spec

trum

Enve

lope

Rep

rese

ntat

ion

73

allow the original data to be reconstructed from the transformed data, butthis is rarely the case.

The purpose of MPEG-7 feature extraction is to obtain a low-complex descrip-tion of the content of the audio source. A balanced trade-off between reducingthe dimensionality of data and retaining maximum information content mustbe performed. Too many dimensions cause problems with classification, whilstdimensionality reduction invariably introduces information loss. Our goal isthus to maximize the amount of information per dimension whilst maintaininga minimal sized dataset.

As with all forms of analysis/data reduction when performed on time-dependent data, the problem of generalisation occurs. Figure 4.2 shows this inmore detail where the same note is repeated but the extracted signal value forthe repeated note varies slightly from the original. This is due to windowingattributed to DSP where signals are sampled at discrete time points and it isexpected that some deviation will occur.

The differences shown are for one frequency band within the ASE but thiseffect occurs within all frequency bands of the sample, as is shown in Table4.1 regardless of the hopsize. Although the differences between notes A and Bappear to be small it must be noted that they are represented in a log frequencyscale where small differences can represent very different audio signals.

Frequency Band Note A Note B Difference(Lo-edge) 1 0.00113820000 0.00160140000 -0.0004632000

2 0.00667040000 0.00878590000 -0.00211550003 0.16378000000 0.15145000000 0.01233000004 0.03144300000 0.02257200000 0.00887100005 0.02642900000 0.03337200000 -0.00694300006 0.00675590000 0.00690820000 -0.00015230007 0.00550100000 0.00542540000 0.00007560008 0.00080775000 0.00073329000 0.00007446009 0.00005295100 0.00004947100 0.0000034800

(High-Edge) 10 0.00000039929 0.00000013699 0.0000002623

Table 4.1: ASE Sample Differences

Figure 4.3 shows the difference in sample rates, i.e., the hopsize. Shown inFigure 4.3(a) is a 5 ms. hop that has a total sample number of 75,850 (in a7585*10 array). Figure 4.3(b) shows the same sample with a 10 ms. hop sizewith a total number samples of 37,930 (in a 3793*10 array) and in Figure 4.3(c)

74

Figu

re4.

2:Ex

ampl

eA

udio

spec

trum

enve

lope

diff

eren

ces

75

the same audio sampled at 30 ms. contains 12,650 samples (in a 7585*10 array).

Figure 4.3: Example MPEG–7 audio spectrum envelope 5 ms., 10 ms. and 30ms.

It should be noted that although the repetition is still obvious, the level ofdetail has been greatly reduced which can be seen in Figure 4.4. Looking atFigure 4.4(a) clear reverberations of a note and between the two separate notescan be seen. Even with a 10 ms. hopsize the reverberations can be clearly seenin 4.4(b). However, using a sample rate of 30 ms. hops the loss of informationis sufficient to give a rough outline of a single note and nothing more. Thisshows how the number of samples can have a direct effect on the amount ofinformation lost. As noted before, it is important not to reduce the dataset to apoint that insufficient information remains to enable a meaningful classification.

To further illustrate the hopsize dimensionality reduction effects, Figure 4.5shows the ASE representation of a song segment with a duration of 1/10th of asecond long. Since this audio contains vocals and more than one instrument itis clearly not going to have the almost sinusoidal repeating pattern shown inFigure 4.4. But what can be seen is the difference in representation between 5ms,10ms and 30ms. Figures 4.5(a) and (b) have some minor differences in valuesbut when compared to Figure 4.5(c) it becomes clear that so much ’information’is lost that it makes it almost unrecognisable when compared to either Figure 4.5(a) or (b).

Taking the level of abstraction to a more singular level with the use of the

76

Figure 4.4: Example MPEG–7 ASE closeup at 5ms., 10 ms. and 30 ms.

Figure 4.5: MPEG–7 ASE closeup of a song at 5ms, 10ms and 30ms

MPEG–7 fundamental frequency reaffirms the generalisation problem associa-ted with feature extraction. Figure 4.6 shows two samples of audio with thefundamental frequency extracted for the duration of the signals. Although thisform of representation is similar to the audio spectrum centroid discussed inSection 3.1.5 in that it is a data reduction technique that reduces a signal to asingular value for each frame, it is not a reflection of the power of the signal or atwhat frequency it it is strongest. The fundamental frequency is most commonlyused as a pitch-tracking approach to identify notes/instruments/vocal sectionsof an audio track but consideration must be given to the fact that there may besections of a signal such as background noise/silence where no fundamentalfrequency can be extracted/determined. In addition, the more convoluted theaudio is, the more difficult it becomes to obtain usable results. Figure 4.6 showsthe fundamental frequencies of two different pieces of music as defined bythe MPEG–7 standard. Figure 4.6(a) is the analysis of the 12 Bar Blues used

77

throughout this work, and Figure 4.6(b) is a similar length of music but froma song in Western Tonal Format (WTF), with at least two repeating sectionscontaining multiple instruments and vocals. Whereas the repeating pattern ofthe 12 Bar Blues can be seen clearly in Figure 4.6(a) the repetition is lost whenmultiple instruments and vocals are added as shown in 4.6(b). The contrastbetween drums, guitars and vocals can change the fundamental frequencyestimation dramatically.

Figure 4.6: Example MPEG–7 fundamental frequency of two audio signals

4.2 k-means clustering

Clustering procedures are essential tools for unsupervised machine learning.Of the array of clustering methods the k-means clustering is one of the morecommon choices for solving clustering problems. As previously stated in 3.4.1the choice of starting point of the clusters has a direct result on the outcome.The following example shows a matrix of 10 vectors with three k clusters.

The data in Figure 4.7 consists of a series of vectors randomly positioned alongthe x/y axis. The starting point in Figure (a), is different from Figure (c) andresults in Figure (a) having different clusters choice than Figure (d). In Figure (a)the starting point for the clusters was positioned randomly but more biasedto the left. This is in contrast to the starting point of the clusters in Figure (c)where the starting point has been altered to be biased to the right of the of the

78

(a) Left-sided starting point (b) Final cluster centroids and cluster groups

(c) Right-sided cluster starting point (d) Different groupings from 4.7(a)

Figure 4.7: Example k-means cluster comparison

clusters. The change in cluster grouping can be seen in Figure (d) where the datapoints are now associated with different clusters. There is no optimum initialcluster positioning but some work has given consideration to this problemwith varying outcomes (Chinrungrueng & Sequin, 1995; Bradley & Fayyad,1998; Zha et al., 2002) and a common rule of thumb, where the initial clustercentroids are initialized evenly across the data is the most often proposedsolution. Our stance here is not to find some unique, definitive grouping of theASE output, but rather to obtain a qualitative and quantitative understandingof large amounts of N-dimensional MPEG–7 data by finding similarity withinit by clustering the audio data.

4.2.1 Distance measures

Since similarity is fundamental to the definition of a cluster, a measure of thesimilarity between two patterns drawn from the same feature space is essentialto most clustering procedures. Because of the variety of feature types andscales, the distance measure(s) must be chosen carefully. Often the dissimilaritybetween two patterns using a distance measure defined on the feature space is

79

calculated.

The most popular metric for continuous features is the Euclidean distance.The Euclidean distance has intuitive appeal as it is often used to evaluate theproximity of objects in two or three-dimensional space. It works well when adata set has compact or isolated clusters (Mao et al., 1996). The drawback withdirect use of Minkowski metrics is the tendency of the largest-scaled feature todominate the others. This problem can be seen clearly in Figure 4.8(b) whereclustering of the data is grouped at the middle cluster. Solutions to this probleminclude normalization of the continuous features to a common range or varianceor other weighting schemes that take into account data with large variances9.

(a) Manhattan (b) Minkowski

Figure 4.8: k-means distance measures (Manhattan and Minkowski)

Applying a cluster of 50 to the sample 12 Bar Blues audio the repeating patternof notes can still be seen clearly as is shown in Figure 4.9(a). With closerinspection the grouping of each 10 ms. hop can be seen with different repeatingsections highlighted in red in Figure 4.9(b).

4.3 String matching

With the polyphonic audio now in a clustered format, identification of largesections of audio can be performed with various string matching techniques.Section 3.5.5 discussed the various methods of measuring the differences/-distance between two fixed length strings which are again dependent on thenature of the data. Although the clusters presented in Section 4.2.1 are identifiedby digits there is no actual numerical association other than as an identifierhence the clusters are presented in a nominal scale. For example, consider the

9Changing the scale can adversely affect the cluster outcome.

80

(a) k-means cluster of 50

(b) Similar cluster identification

Figure 4.9: Example k-means cluster of 50 groups (a) and repeating clusters in(b)

sequence of numbers 1, 2, 3. It can be said that 3 is higher than 2 and 1, while2 is higher than 1. However, during the clustering process when two similarsamples are found they could be as easily identified by using characters orsymbols provided a nominal scale is used. By comparing a string of clustersusing the hamming scale any metric value is ignored and only the numberof differences between the two strings are calculated. However, if a rankingsystem is applied then ordinal variables can be transformed into quantitativevariables through normalization. To determine distance between two objectsrepresented by ordinal variables, it needs to transform the ordinal scale intoratio scale. This allows the distance to be calculated by treating the ordinalvalue as quantitative variables and using Euclidean distance, city block dis-

81

tance, Chebyshev distance, Minkowski distance or the coefficient correlationas distance metrics. Without rank the most effective measure is the hammingdistance.

4.4 Summary

This chapter outlined the use of the MPEG–7 tools as a feature extractiontechnique, and clustering as a means of identifying self-similarity within a song.A k-means clustering algorithm on the extracted features resulting in a classifiedset of features was presented which included the the effect of changing theinitial parameters of cluster numbers and distance measures. Finally, the useof string matching is presented as a measure of similarity between sections ofclustered data concludes this chapter.

82

CHAPTER

FIVE

Implementation of Song FormIntelligence (SoFI)

This chapter discusses Song Form Intelligence (SoFI), an intelligent music repairsystem that repairs dropouts in broadcast audio streams on bursty networks.Unlike other Forward Error Correction (FEC) approaches that attempt to repairerrors at the packet level SoFI applies self-similarity in masking large burstyerrors in an audio stream received by the listener. SoFI utilises the MPEG–7 content markup as a base representation of audio and clusters these intosimilar groups comparing large groupings for similarity. It is this similarityidentification process that is used on the client side to replace dropouts in theaudio stream being received.

An overview of SoFI’s architecture is discussed first in Section 5.1 and then Sec-tion 5.2 details details implementation of the MPEG–7 audio spectrum envelopeextraction, the k-means cluster grouping process and similarity measurementof large groupings of audio data. Section 5.3 discusses Icecast2 and Ices2 andtheir configuration for the implementation of streaming audio across networks.Section 5.4 discusses the client side aspect of FEC on an application level usinggStreamer as a media framework for development. Network monitoring isdiscussed in Section 5.4.2 followed by Section 5.4.3 which shows how networkdropouts are handled. A discussion on the internal clock within SoFI and how itsynchronises playback from varying sources and the output of SoFI to listenersin Sections 5.4.4 and 5.4.5 respectively is presented. This chapter is concludedwith a brief mention of the patent application based on the technologies SoFIutilises and their requirements.

83

5.1 Architecture of SoFI

The architecture of SoFI is shown in Figure 5.1 illustrating a client/serverapproach to audio repair. Figure 5.1 illustrates the pattern identification compo-nents on the server and the music stream repair components on the client. Onthe server side of Figure 5.1 is a representation of the feature extraction processprior to the audio being streamed. The feature extractor analyses the audio fromthe audio database prior to streaming and creates a results file which is thenstored locally on the server ready for the song to be streamed. The streamingmedia server then streams the relevant similarity file alongside the audio to theclient across the network. On the client side the client receives the broadcastand monitors the network bandwidth for delays of the time-dependent packets.When the level of the internal buffer of the audio stream becomes critically lowthe similarity file is accessed to determine the best previously received portionof the song to use as a replacement until the network can recover. The bestmatching portion of the song is retrieved from a temporary buffer stored on theclient machine specifically for this purpose.

Figure 5.1: Architecture of SoFI

84

5.2 Server-side feature extraction

In a typical MIR system similarity assessment is performed in three stages:

1. Data reduction

2. Feature extraction

3. Similarity comparisons

One of the key aspects of feature extraction is to maintain as high a level ofreduction as possible without the loss of pertinent data. SoFI makes use ofMPEG–7 features in the audio spectrum envelope representation which wereintroduced in Section 3.1.2. The feature extraction components of SoFI areshown in Figure 5.2. Songs stored in the database are analysed and the contentdescription generated from the audio is stored in XML format as shown inFigure 5.3. A more complete XML representation of the ASE output is shown inAppendix C. The actual file is over 487 KB. (499,354 bytes) in size and containsover 3700 x 10 samples for a 37 s. long piece of music stored as a wave file.However, the resultant data is now only 6% of its original size. This representsa vast reduction in the volume of information to be classified but still retainssufficient information for similarity analysis purposes.

Figure 5.2: SoFI’s low level extraction modules

85

Figure 5.3: Example MPEG–7 XML output

5.2.1 Audio Spectrum Envelope feature extraction

The settings used for extraction can be seen in the XML field<AudioDescriptor> in Figure 5.3. This stipulates a low and high edgethreshold set to 16 KHz. and 62.5 Hz. respectively. These settings werediscussed in Section 3.1.2 and shown to be the upper and lower bounds of thehuman auditory system (Pan et al., 1995). Sounds above and below these levelsare of little value and present no additional information that can be utilisedwhen extracting the frequencies. Experiments with values above and belowthese levels produced results with no gain and more detrimental output as theresultant data was clouded with noise that did not belong to the audio beinganalysed. It should be noted that the Joanneum Research facility (MPEG–7,2008) recommend these settings to be used as default values.

Within the low and high frequencies a resolution of 1 is set for the parameteroctaveResolution. This gives a full octave10 resolution of overlapping frequencybands which are logarithmically spaced from the low to high frequency thre-shold settings. The output of the logarithmic frequency range is the weightedsum of the power spectrum in each logarithmic sub-band. The spectrum accor-ding to a logarithmic frequency scale, consists of one coefficient representingpower between 0 Hz. and low edge, a series of coefficients representing power in

10In music, an octave is the interval between one musical pitch and another with half ordouble its frequency. Cooper (1973) refers to the natural relationship between octaves as amiracle of music that is common in most music systems.

86

logarithmically spaced bands between low edge and high edge, and a coefficientrepresenting power above high edge resulting in 10 samples for each hop of theaudio.

The ASE features have been derived using a hopsize of 10 ms. and a framesize of 30 ms. This enables overlapping of the audio signal samples to givea more even representation of the audio as it changes from frame to frameas is shown in Figure 5.4. As a rule, using more overlap will provide moreanalysis points and therefore smoother results over the length of the audiobut at the expense of computational costs since more overlap requires morecalculations. SoFI generates the ASE descriptions in offline mode in a run onceoperation for each audio file stored. Audio files are in .wav format, as wasdiscussed in Section 2.8 to ensure that audio is of the best possible quality.Lossey compression codecs such as . mp3 or . ogg can introduce unnecessaryvariations in the compressed audio that did not exist in the original, even ata high bitrate level. Investigations into the file format to use also showed noimprovement in the time to perform the ASE extraction on .mp3/.ogg filescompared to .wav thereby providing no reasonable gain/justification for theiruse.

Figure 5.4: Overlapping sampling frames of a waveform

5.2.2 Clustering the Audio Spectrum Envelope (ASE)

SoFi uses k-means clustering discussed in Section 4.2 as a method of identifyingsimilarities within different sections of audio. Using a set number of clustersderived from iterative experimentation of the ASE data provides sufficientgrouping. The ASE data files contain a varying number of vectors depending

87

on the length of the audio, but as each vector contains a finite value in that eachsample contains a variable quantity that can be resolved into components, anoptimal value of k = 50 clusters is used, a sample output of which is shown inFigure 5.5. This enables a reasonable computational process with the minimumprocessing power possible whilst maintaining maximum variety. Experimentsabove this value produced little or no gain, and with processing time increa-sing exponentially with each increase in cluster number, was considered toocomputationally expensive.

Figure 5.5: Example k-means output

The k-means output results in an array of numbers of 1 → x where x is thenumber of samples in the ASE representation ranging from 1 to 50. A file lasting30 s. will result in 3,000 clustered samples, and a file of duration 2 minutes, 45seconds will produce 16,500 clustered samples. At this stage of the similaritycomputation process the cognitive representation of music can be construedfrom the output. Where the human mind automatically detects rhythm andrepeating patterns, the clustered output notation can be considered similar inthat each sample has been compared to all other ASE samples and groupedaccordingly. Whereas Jackendoff (1987) presents a hierarchical tree as a repre-sentation/notation, a k-means representation conveys a similar representativemeaning but on a more detailed linear scale. This grouping can be seen in

88

Figure 5.6.

When looking at the k-means audio clusters on a low/mid level it is difficult tosee any pattern/characteristic of the audio with the naked eye, but on a highlevel some aspects/similarities within the audio can be identified.The samplesin Figure 5.6(a) represent 0.1 s. of audio with plot values representing the 10ms. hop of the ASE extraction. From Figure 5.6(a) the level of detail showsthe variations in detail between 1 and 50. The k-means plot in Figure 5.6(b)shows an expanded time frame window of 20 s. where it becomes more difficultto identify individual clusters, but what is more transparent is how differingsections of the audio are being represented. The final plot of the k-means outputshown in Figure 5.6(c) contains the entire k-means cluster groupings for a fulllength audio song. For the human eye it is difficult to see similarities betweensections at this level of detail but what can be clearly seen is the bridge sectionin the middle. The bridge in this particular audio sample is dissimilar to anyother section of the audio and explains the closeness of the grouping of theclusters within this section. A waveform of the same audio sample is shown inFigure 5.7 where the same bridge section can be clearly seen in relation to therest of the audio sample. Although the waveform is merely a power frequencyrepresentation of the audio the bridge section is easily identified as the audiopower differs greatly in relation to the rest of the audio.

5.2.3 Similarity measurement

Having an audio file classified and clustered into groups is the preliminarysteps in determining similarity between large sections of the file. Where theASE is a minimalist data representation/description and the k-means groupingis a cluster representation of similar samples at a granular level, SoFI makes useof a traditional string matching approach to identify large sections of audio.k-means clustering identifies and groups 10 ms. vectors of audio but this needsto be expanded to a larger window in order to facilitate network dropouts. Forexample, bursty errors on networks can last for as long as 15 to 20 s. (Yin et al.,2006; Nafaa et al., 2008), which would mean that if SoFI attempted to apply oneidentified cluster at a time to repair the gap then it would need to perform thefollowing steps up to 2,000 times:

• Determine the time-point of failure

• Analyse the current cluster

89

(a) 1 s. closeup of sample audio

(b) 20 s. closeup of a sample audio

(c) k-means cluster representation of the full song

Figure 5.6: Example k-means cluster representations of a song

90

Figure 5.7: Comparative waveform output

• Replace the current section with a suitable previous section

This is not a feasible option in respect of computational costs. Also, jitter wouldbecome a major contributing factor in the resultant audio output to the listener.Applying string matching large sections of the k-means cluster output can becompared for overall similarity and the best-effort match can be stored forreference. This file is then used on the client side for reference at a later time onthe client machine when dropouts occur.

To reduce unnecessary computation, SoFI only compares the clusters in pre-vious sections for similarities as shown in Figure 5.8. This is based on theprinciple that when attempting a repair SoFI can only use portions of the audioalready received. Any sections beyond this have not yet been received by theclient and hence cannot be used. This reduces analysis comparisons considera-bly in early sections of the audio but as the time-point progresses the numberof comparisons increase exponentially.

Figure 5.8: A backwards string matching search

Sample output in the example given below in Table 5.1 shows three differentvalues. The left column is the starting point of the frame to search for, the

91

middle column is the best match time-point of all the previous sections and thelast column is the matching result i.e., how close the best match is representedin a scale between zero and one, the closer to zero the better the match. Layoutof the data was initially intended to be in an XML format similar to the MPEG–7data but this was considered unnecessary as there is no change of the datalayout throughout the entire content of the file. Incorporating XML tags wouldbe to include metadata for song and artist identification which is already storedin the filename. Incorporating XML tags would also include complexity whenparsing the file, increasing processing requirements of the media application.

Current time point Matching time point Match result7.4130000000000000e+03 5.4400000000000000e+02 7.1199999999999997e-017.4140000000000000e+03 5.4500000000000000e+02 7.1199999999999997e-017.4150000000000000e+03 5.4600000000000000e+02 7.1199999999999997e-017.4160000000000000e+03 5.4700000000000000e+02 7.1199999999999997e-017.4170000000000000e+03 5.4800000000000000e+02 7.1199999999999997e-017.4180000000000000e+03 5.4900000000000000e+02 7.0999999999999996e-017.4190000000000000e+03 5.5000000000000000e+02 7.0799999999999996e-017.4200000000000000e+03 5.5100000000000000e+02 7.0799999999999996e-017.4210000000000000e+03 5.5200000000000000e+02 7.0799999999999996e-017.4220000000000000e+03 5.5300000000000000e+02 7.0799999999999996e-017.4230000000000000e+03 5.5400000000000000e+02 7.0799999999999996e-017.4240000000000000e+03 5.5500000000000000e+02 7.0999999999999996e-017.4250000000000000e+03 5.5600000000000000e+02 7.1199999999999997e-01

Table 5.1: Example string matching output

5.3 Streaming server

SoFI uses the Ogg Vorbis (Vorbis, 2008) audio file format as an audio compres-sion tool for preparing files for broadcast. As with other compression toolsthere is no error correction within the stream and packet loss will result ina loss of signal. With a proprietary media player when fragmented packetsare dropped a resend request is called using the Real-Time Control Protocol.However, SoFI differentiates between fragmented packets and network trafficcongestion. As with any media player, SoFI makes use of the resend request forcorrupt individual packets, where one or two packets have time to be resentwhich will not affect the overall audio output. However, when large dropoutsof 5, 10 or 15 s. occur this will be unrecoverable and the audio output is affected.

92

It is at this point that SoFI uses the previously received portions of a song in anattempt at masking this error from the listener.

Unlike the MPEG audio format, Ogg Vorbis libraries are freely released underthe BSD license and its tools are released under the GNU General Public License.This enables the following Ices2 (Ices2, 2008) and Icecast2 (Icecast, 2008) to beused to broadcast an audio stream across a network. Both Icecast2 and Ices2 areopen source open, patent-free, streaming applications.

5.3.1 Ices2 and Icecast2

Ices2 (Ices2, 2008) is a program that sends audio data to an Icecast2 server forbroadcast to clients. Where Icecast2 handles network connections from clients,Ices2 handles the data to be streamed. Ices2 can either read audio data fromdisk (Ogg Vorbis files), or sample live audio from a sound card and encode it onthe fly. Ices2 has a command-line interface and once Ices2 is invoked it requiresno more interaction from the administrator. Ices2 sends the encoded audioto Icecast2 for streaming. Ices2 and Icecast2 are invoked with the followingcommands:

ices2 /home/user/ices2/ices-rep.xml

sudo icecast2 -b -c /usr/share/icecast2/icecast.xml

Essentially, both of these commands provide paths to the configuration fileswhen invoked. The icecast2 command requires administrator privileges fornetwork access, hence the sudo command, and is forced to run in the backgroundusing the -b option.

The configuration files introduced in Section 2.11.2 enable SoFI to be configuredfor optimum sound quality in music broadcasts as bandwidth consumption bylisteners is not salient. This is due to the fact that SoFI is a prototype applicationthat is not expected to cater for a large number of listeners. Alongside initialconfiguration settings, Icecast2 has a web interface that enables administra-tors to control mountpoints11, metadata attached to the audio stream and thenumber of listeners on each mountpoint. Figure 5.9 shows the general adminis-tration page of Icecast2 giving statistics including the total number of listenerscurrently connected, the total number of connections and when the streaming

11Separate audio streams are named mountpoints.

93

server was invoked.

Figure 5.9: Icecast2 administrator web page

The main administration page also gives a breakdown of mountpoints detailingquality of the stream including the audio bitrate, the number of channels,the audio sample rate and the number of listeners currently connected to thatparticular mountpoint. This is shown more clearly in Figure 5.10. From the mainadministration page navigation to the mountpoints enables the administratorto control listener access to individual mountpoints, move listeners to anothermountpoint and update metadata and finally stop (kill) a mountpoint.

Individual listeners are identified and controlled by navigating to the Listenerstab. Details for each mountpoint list each client connected, their IP address, theapplication clients are using to listen to the stream and an option to ’kick’ alistener from the mountpoint, as is in Figure 5.11.

Ices2 and Icecast2 are an essential to SoFI in that audio streams can be controlleddirectly for development and testing purposes, including the ability to start,kill and restart, when required, during development and testing. However, anystreaming server would suffice provided that the similarity file either resideson the client machine or is contained in the metadata of the stream prior tocommencement of the audio broadcast. Other proprietary music streamingservers detailed in Section 2.11 are all suitable also.

94

Figure 5.10: Icecast2 mountpoint detail

Figure 5.11: Icecast2 listener detail

95

5.4 Client side audio repair

When repairing dropouts in a live audio stream, priority lies in the system’sability to maintain continuity of output audio alongside a seamless switchbetween real-time streams being received and buffered portions of the audio,whilst monitoring network bandwidth levels and responding accordingly. Thefollowing sections discuss how the gStreamer media framework is used todevelop SoFI’s client-side handling of audio streams and and repair of dropoutsin the network using best-effort similarity matching performed on the serverprior to broadcast.

5.4.1 gStreamer pipelines and buffers

On the client side, a media application has been developed that performs threekey requirements necessary for client side audio repair when dropouts occur:

1. Monitor network: the media application needs to be aware of traffic flowto the network buffer, because, in the event that a dropout occurs a timelyswap can be achieved before the internal network buffer fails.

2. Store locally all previously received portions of audio: a local buffer isrequired to fill the missing section of audio until the network recovers.

3. Play locally stored audio: as well as being able to play network audio themedia player needs to be capable of playing audio stored locally on theclient machine.

Using gStreamer as a media framework three pipelines have been created toperform these three key requirements simultaneously. Pipelines are a top-levelbin which is essentially a container object that can be set to a paused or playingstate. The state of any elements contained within the pipeline will assume thestopped/ready/paused/playing pipeline state when the state is set. Once apipeline is set to playing, it will run in a separate thread until it is manuallystopped or the end of the data stream is reached.

Figure 5.12 shows the bin containing the pipelines necessary for the mediaapplication to fulfill the three requirements specified above. SoFI’s mediaapplication is programmed in C, has no GUI and is entirely self contained.Once it is invoked it requires no interaction from a listener. This is based onthe fact that the listener has no control of an audio stream being broadcast, the

96

order in which songs are played or pausing/rewinding a live stream is notpossible.

Figure 5.12: Graphical representation of SoFI client media handler with multiplepipelines

The media pipeline is the main container/bin and contains three separatepipelines. Each of the inner pipelines perform one of the necessary functions tomaintain continuity of audio being relayed to the listener even when dropoutsoccur.

• The ir pipeline contains the necessary functions to receive an Internetradio broadcast in an ogg vorbis format. Using the GNOME12 virtual filesource pad as a receiver the stream is thus decoded and passed alongthe components of the pipeline until it is handled by the alsasink audiooutput.

• The file pipeline is created to handle swaps to the file stored locally onthe client machine in the event the network fails. It is the media player’sability to perform this function that masks a network failure from the lis-tener. When a dropout occurs the ir pipeline is paused and playbackis initiated from the locally stored file.

12The GNOME desktop environment is an intuitive interface for users of GNU/Linux orUNIX operating systems and the GNOME development platform is an extensive frameworkfor building applications that integrate into the GNOME desktop.

97

• Whilst the Internet radio broadcast is being played the record pipelinereceives the same broadcast and stores it locally on the machine as a localbuffer for future playback. Only one song is stored at any given time, eachtime a new song is played, an end-of-stream message is sent to the clientapplication and the last song received is over-written by the new song.

Usually an Internet audio stream is shown as simply the length of time thatit is connected to the station, not the length of individual songs. SoFI differsin that it resets the GstClock() for each new song. This provides a simplecurrent time-point that enables the media player to determine exactly where inthe current song it is and thereby provides a timestamp as a point of referencewhen network failure occurs.

As previously mentioned, when the state of a pipeline is changed any source/-sink pads contained within the pipeline is changed. Upon invocation of the me-dia application the media pipeline is set to playing by default. This sets thecontaining pipelines to playing where possible. However, the file pipeline

remains in a ready state as no file has been specified for playing. This allowsthe other two pipelines to run concurrently. The following program code showsthe creation of the ir pipeline and setting its state to ready:

100 /*IR_Play elements*/GstElement *ip_pipeline, *ir_source, *ir_queue, *ir_parser,

102 *ir_decoder, *ir_conv, *ir_sink;

104 gboolean setup_ir_play (){

106

unref_ir_pipeline();108 ir_queue = gst_element_factory_make ("queue", NULL);

ir_source = gst_element_factory_make ("gnomevfssrc", NULL);110 ir_icydemuxer = gst_element_factory_make ("icydemux", NULL);

ir_parser = gst_element_factory_make ("oggdemux", NULL);112 ir_decoder = gst_element_factory_make ("vorbisdec", NULL);

ir_conv = gst_element_factory_make ("audioconvert", NULL);114 ...........................................

...........................................116

/*put all elements in main bin*/118 gst_bin_add_many (GST_BIN (ir_pipeline), ir_source,

ir_queue, ir_parser, ir_decoder,120 ir_conv, ir_sink, NULL);

122 }

Line 101 shows a pipeline created, lines 108 to 113 show each element withinthe pipeline being created and their state set to NULL and line 118 shows the

98

newly created pipeline being added to the main media pipeline bin.

5.4.2 Network monitoring

Built into gStreamer is a message bus that constantly handles internal messagesbetween pipelines and handlers. This message system enables alerts to beraised when unexpected events occur such as end-of-stream and low internalbuffer levels. A watch method is created to monitor the internal buffer from theaudio stream and when a pre-set critical level is reached an underrun messageis sent to alert the application of imminent network failure. It should be notedthat a network failure is not that a network is completely disconnected from theclient machine, but is a network connection that is of such poor signal qualitywith a low throughput that traffic flow is reduced to an unacceptable level. Forthe purposes of testing this was simulated by throttling the bandwidth on thelocal client machine using the following command under Linux:

sudo tc qdisc add dev eth0 root handle 1:0 netem delay 5000msec 25%

The qdisc function is the major building block for all Linux traffic control.Using qdisc allows the scheduling of packets between input and output of aparticular queue. In the above example the eth0 network input is delayed by5,000 ms. with 25% of the incoming traffic suffering from an additional jittereffect. This simulates a typical almost out of range or signal interference sce-nario attributed to wireless networks, and thereby prevents complete networkconnection failure whilst at the same time throttling the throughput to almostfailure point.

5.4.3 Masking network dropouts

When the ir pipeline is playing the file pipeline is in a ready state.When a critical buffer level warning is received the media application mustswap the audio input from the network to the locally stored file that containsaudio from its start point to the point that the network dropout occurred.Figure 5.13 shows the process of controlling which pipeline is active at any onetime. A network failure message calls a procedure that notes the current time-point of the stream and uses this to parse the similarity file already received onthe client machine when the current song started. This file contains the outputresults of the similarity identification previously performed on the server. From

99

this file the previously identified best match section of the audio determinesthe starting point of the local file on the client machine.

Figure 5.13: Flow of control between pipelines

The file pipeline is now given focus over the ir pipeline with theirstates being changed to playing and paused respectively. After a predeterminedlength of time the buffer level of the ir pipeline is checked to determine ifnetwork traffic has returned to normal, and if so, then audio output is swappedback to the ir pipeline. Otherwise file playback continues for the samefixed length of time and is repeated as necessary. In the event that playbackof the locally stored file reaches the end of the time-point from when thenetwork failed it is assumed that network traffic levels will not recover andthe application ends audio output and closes the pipelines waiting for re-initialization from the user.

5.4.4 SoFI’s internal synchronization clock

Within gStreamer is the GstClock() which is used to maintain synchroniza-tion within pipelines during playback. gStreamer uses a global clock to monitor

100

and synchronize the pads in each pipeline. The clock time is always measuredin nanoseconds and always increases. GstClock() exists because it ensuresmedia playback at a specific rate, and this rate is not necessarily the same asthe system clock rate. For example, a sound card may playback at 44.1 kHz.,but that does not mean that after exactly 1 s. according to the system clock, thesound card has played back 44,100 samples. This is only true by approximation.Therefore, pipelines with an audio output use the audiosink as a clock provider.This ensures that 1 s. of audio will be played back at the same rate as the soundcard plays back 1 s. of audio.

Whenever a component of the pipeline requires the current clock time, it willbe requested from the clock through a call to gst clock get time(). Thepipeline that contains all others contains the global clock that all elements in thepipeline use as a base time, which is the clock time at the point where mediaplayback is starting from zero. Using GstClock() methods pipelines withinthe SoFI can calculate the stream time and synchronise the internal pipelinesaccordingly. This provides an accurate measure of the current playback time inthe currently active pipeline. SoFI’s media application, through using its owninternal clock, can synchronise swapping between the audio stream and thefile stored locally. When a network error occurs the current time-point of theinternal clock is used as a reference point when accessing the best match datafile as shown in the following code segment:

100 guint64 len=0, pos=0, newpos=0;GstFormat fmt = GST_FORMAT_TIME;

102 gst_element_query_position (ir_sink, &fmt, &pos);

104 /*convert nanoseconds to centiseconds*/timepos = (timepos/GST_MSECOND) / 10;

106

f=fopen(datafile,"r");108 if (!f)

{110 //Unable to open file!

return 1;112 }

int linepos = 1;114 //Lines are in 10 milisecond hops

while (!found)116 {

118 fgets(s,98,f);if (linepos == timepos)

120 {printf("%s\n",s);

122 strTime = s;

101

found=TRUE;124 }

linepos++;126 }

128 newTime = atof(strTime);

This C code above is not optimised in that the similarity file must be scannedfrom the beginning of the file line-by-line until the current line number countmatches the corresponding current time-point. Initial tests demonstrated a jittereffect when swapping from one source to the other. Whilst reading the file tofind the best possible time to seek to the radio stream continues playing. Thismeans that when swapping to the previous section on the local audio file thefirst 0.5 s. of audio is not synchronised with the current time-point of the audiostream as shown in Figure 5.14.

Figure 5.14: Example time delay effect during source swapping

A partial fix for this involved reading the entire contents of the similarity fileinto a dynamically created array at the beginning of the audio song beingstreamed using the following code:

100

while(fgets(s,98,f) != NULL)102 {

strTime = s;104 newTime = atof(strTime);

fData[linepos] = newTime;106 linepos++

}

This enables the time-point to act as a reference pointer in the array, fData.Access to memory gives quicker responses than file I/O and a much quickerreturn time of the value held in the similarity file and thereby reducing jitter toa minimum.

102

At the point of failure the time-point is used as a reference to be read fromthe similarity file. Since each comparison in this file is in 10 ms. hops thecurrent time-point needs to be converted from nanoseconds to centiseconds: forexample, 105441634000 nanoseconds converts to 10544 centiseconds or 105 s.

Since the principle of swapping audio sources in real-time without user inter-vention, whilst maintaining the flow of audio, is a novel approach within thearea of media players, SoFI creates unique adaptations to the flow of controlbetween pipelines. This is achieved through the use of a queue element whichis the thread boundary element through which the application can force theuse of threads. It does so with a classic provider/receiver model as shown inFigure 5.15. The queue element acts both as a means to make data through-put between threads threadsafe, and it also acts as a buffer between elements.Queues have several GObject properties to be configured for specific uses. Forexample, the lower and upper threshold level of data to be held by the elementcan be set. If there is less data than set in the lower threshold it will blockoutput to the following element and if there is more data than set in the upperthreshold it will block input or drop data from the preceding element. It isthrough this element that the message bus receives the buffer underrun whenincoming network traffic reaches a critically low state.

Figure 5.15: Time delay effect when swapping audio sources

It is important to note that data flow from the source pad of the element beforethe queue and the sink pad of the element after the queue is synchronous. As

103

data is passed returning messages can be sent: for example, when a buffer fullnotification is sent back through the queue to notify the file source sink

to pause the data flow. Within pipelines, scheduling is either push-based orpull-based depending on which mode is supported by the particular element.If elements support random access to data, such as the gnomevfssink Internetradio source element, then elements downstream in the pipeline become theentry point of this group, i.e., the element controlling the scheduling of otherelements. In the case of the queue element the entry point pulls data from theupstream gnomevfssink and pushes data downstream to the codecs. A bufferfull message is then passes back upstream from the codecs to the gnomevfssink,thereby calling data handling functions on either source or sink pads withinthe element.

5.4.5 SoFI output

Sofi has a command-line interface and has no graphical user interface. However,feedback is provided through the use of an application window whilst the SoFIis running. This facilitates the notification of events occurring, including endof stream, network underrun and seek events occurring. Figure 5.16 shows ascreenshot of the SoFI media application running with three important sectionshighlighted in red boxes. The highlighted box A in Figure 5.16 shows the textoutput to the listener in the application window. This window is encapsulatedin the KDevelop IDE13 which was used to create and test the SoFI mediaapplication. However, when running as a stand alone application the terminalwindow under Linux or command window under Windows contain the sameoutput: as text based feedback of the current stages of playback.

Synchronised to the gstClock() is a callback function within SoFI and oneach cycle of 1 s. a number of statistics are displayed. These include:

• The current level of bytes held in the queue which provides feedback onthe current buffer level of network data

• The actual level of the buffer in size which can vary from 0% to 100%

• Nanoseconds providing the precise moment currently being played in thereceiving audio stream

13KDevelop is a open source software integrated development environment in the KDEdesktop environment for Unix/Linux operating systems.

104

Figure 5.16: SoFI swapping audio sources

• Seconds: a conversion from nanoseconds to a more readable format forthe listener, but which also assists in the next statistic

• Callback: the current number of times feedback has been provided; whenchecked with the output above. Callback can provide valuable informa-tion in regards to synchronisation problems during playback of audio

Boxes B and C highlighted in Figure 5.16 show the relevant method calls fordisplaying output to the listener when a source swap occurs during a criticalnetwork dropout. Box C presents the current time played, the seek time of thebest match held in the similarity file and the start time of the playback from thelocally stored audio of the radio stream. At the end of this a message indication

105

of either seek successful or seek failed is displayed. Box B shows the actual statesof the ir pipeline and file pipeline being changed from PAUSED andPLAYING respectively.

Figure 5.17 shows the return of playback from the local file to the live Internetaudio stream. The highlighted box A shows the application messages relayedto the user through the application window, and the highlighted box B showsthe actual method calls changing the playback state of the file pipeline

and the ir pipeline to PAUSED and PLAYING respectively. It should benoted that a Network Error is also shown in box A but this is a side affect ofthe return to the live Internet radio stream. Whilst playback of the audio isfrom a previous section of the audio stored locally the ir pipeline maintainsa buffer of received audio from the Internet radio stream and this has to becleared to ensure as smooth a transition as possible from local playback back tothe live stream. As a side effect a network error is flagged and reported.

Figure 5.17: SoFI returning to a live Internet radio stream

The End Of Stream (EOS) message handle deals with the incoming Internetaudio stream. When Icecast2 reaches the end of the current song in its playlistit sends an EOS signal to indicate the end of the current song. This enablesIcecast2 to change not only the audio being played but also the audio format toa different format, i.e., from .ogg to .mp3 or vice versa. This EOS signals SoFI toclose current threads, queues and pipelines in preparation for the next song to

106

be broadcast, as shown in Figure 5.18. Figure 5.18 shows that an EOS messageis presented to the listener, and a new ir pipeline is created together withthe necessary elements being linked dynamically. At this point, the previouslystored audio is removed and a new locally stored recording of the new incomingaudio stream begins. This allows SoFI to begin anew for the new song beingreceived.

Figure 5.18: End of stream signal event

5.5 Technologies used by SoFI

The following sections describe the underlying technologies used by SoFI. Al-though they are not unique themselves they are proven and reliable approachesto the different aspects of audio analysis. It is the combination of feature extrac-tion through MPEG–7 and clustering that facilitate SoFI’s approach to audiorepair.

5.5.1 Feature extraction

Combining new and well-established technologies allow a unique approach toan existing problem and utilising the MPEG–7 descriptors as a feature extractionbecame a logical choice as a feature extraction tool. Classification of audiodata can be performed on raw audio data but the computational requirementsbecome exponential as mentioned in section 2.3.4 and 4.1, the more original datato be analysed the more computationally expensive it becomes. Investigationsinto the most suitable features to be extracted led to the Audio SpectrumEnvelope (ASE) being identified as the most suitable based on a balance on

107

information retention and data reduction. Other MPEG–7 descriptors such asthe Audio Spectrum Spread (ASS) and Audio Spectrum Centroid (ASC) hadinitial positive results but with more investigative experiments showed the ASEto retain the most pertinent data.

5.5.2 k-means clustering

When pre-existing knowledge about the data to be classified is known thensimilarity detection can be more suited to technologies such as neural networks,fuzzy logic and Bayesian networks. Where voice recognition and characterrecognition works well with these technologies, as in they are more applicableif a known pattern to search for can be determined before processing the ’real’data. Character, digit and voice recognition systems all have expectant inputof some form and training sets can be generated to improve the recognitionsuccess/efficiency. When searching for similarities within audio where no priorknowledge is known about the contents of the file is known, no two musicianswill perform the same piece of work with exactly the same tempo, pitch orspeed, therefore no ’database of patterns’ can exist beforehand. In order toretain as much information about the audio as possible the format of the featuresextracted from the audio are stored as a multidimensional array, this layoutlends itself naturally to the k-means clustering approach and comparisons withthe k nearest neighbour showed k-means as a more successful approach, hencethe k-means clustering approach was chosen as the most suitable similarityidentifier.

5.5.3 Streaming servers and audio players

A number of streaming servers were introduced in section 2.11 all of whichhave their own merits. However, Ices and Icecast combined proved to be themost flexible in regards to operating system platform and configuration, aswell as being open source and freely available. Proprietary systems such asWindows Media Player, Real Player and Apple’s Quicktime all have limitedfunctionality regarding streaming audio. The primary decision for using Icesand Iceast was based on operating system platform. Investigations into creatingan audio application capable of receiving streamed audio, ’buffering’ the streamlocally and being able to swap between media stored locally and the networkstream, led to gStreamer as a development application tool. Primarily basedin a Unix/Linux environment it was beneficial to have a server side streaming

108

application capable of running in the same environment.

5.5.4 Computational requirements of audio analysis

Analysis of the audio files were performed in three stages:

1. Generate the low-level ASE descriptions from a song using the MPEG–7feature extraction tools.

2. Perform k-means clustering on the ASE description.

3. Identify the ’best previous match’ section from the clustered data using afive second string search.

The first two steps can be performed on a standard desktop computer with thefollowing minimal requirements:

• Operating system: Any Unix/Windows XP base.

• Processor 1.6GHz or better.

• 512MB of RAM.

• 1GB of additional hard disk space.

The MPEG–7 ASE extraction from an audio file with a duration of 3 minuteswill take around 5 minutes to be created. A further 10 minutes is needed forclustering the ASE representation using a cluster of 50.

The most computationally expensive aspect of the audio analysis is the stringsearch process. The process of individually comparing the query string withevery possible previous section of the audio requires more and more resourcesas the further along the audio search query progresses, memory capacity quicklybecomes an issue. For this reason this stage of analysis was performed on a’mainframe’ server with the following specifications:

• Eight 3.0GHz Dual Core Xeon processors

• 32GB RAM with a virtual swap memory of 8GB

• Four 500GB RAID hard drives

109

The additional bandwidth needed to send the metadata during streaming isan additional 10% of the original audio file size. Requirements for runningthe streaming server and client receiving application are minimal, the onlyexception is that they must be running under a Unix/Linux based system asthese are more suited to the Ices/Icecast streaming server and the gStreamermedia player application.

5.6 Summary

This chapter discussed the implementation of SoFI, a Forward Error Correctionapproach to repairing packet loss using self-similarity. SoFI’s architecture waspresented detailing the overall implementation of SoFI. This was followed bydiscussion of the implementation of the MPEG–7 content descriptors, the k-means cluster grouping process and the similarity measurement of a numberof grouped clusters. The configuration of Icecast2 and Ices2 were detailed.Following this the client side aspect of FEC on an application level usinggStreamer as a media framework for development was discussed. How networkmonitoring is performed was discussed, followed by a discussion on hownetwork dropouts are handled by SoFI. The internal clock within SoFI and howit is used to synchronise playback from varying sources was presented alongwith a discussion of SoFI’s output window for listeners. A brief mention of thetechnologies SoFI utilises and their requirements concludes this chapter.

110

CHAPTER

SIX

Evaluation of Song FormIntelligence (SoFI)

This chapter details the evaluation of Song Form Intelligence (SoFI) using twomethodologies:

• Objective evaluation with calculated metrics which can provide unbiasedquantifiable feedback

• Subjective evaluation with human subjects indicating a level of user satis-faction, typically by means of a user survey

Through string matching differing time lengths are evaluated as possible repla-cement sections with justifications for the use of lengths of 5 s. as discussedin Section 6.2 Figures 6.2 and 6.3. Problems with songs that are not strictly inwestern tonal format are discussed together with the analysis results presen-ting best-effort matching sections. A visual representation of identified similarsections is given in Section 6.3 with a spectrogram representation giving a viewof each frequency across time and through correlation to compare similar anddissimilar segments showing SoFI’s best-effort approach.

Subjective evaluation of SoFI is discussed in Section 6.4. Sixteen test subjectslistened to and quantified simulated dropout scenarios enabling a compari-son of SoFI’s simulated performance in relation to other approaches togetherwith a comparison of SoFI’s simulated performance under different ’dropout’scenarios. Subjective evaluation results for SoFI are given in Section 6.4.4.This chapter concludes with a discussion of general feedback provided by testsubjects.

111

6.1 Clustering groups

Initial investigations into identifying an optimal value for k were explored inSection 4.2. As noted previously the number of clusters is set to 50. Valuesabove this offer no gain for the level of detail attained. Shown in Figure 6.1shows a 5 second sample of audio with clusters of 30, 40 and 50 plotted. Thedifferent groupings for each 10ms sample can be seen in both box A and box B.Depending on the number of k clusters specified each sample will be classifieddifferently. The majority of samples using 30 clusters are shown in green (*),the red samples (+) are for a value of 40 clusters and the 50 cluster groupingis shown in blue (x). In box A a distinct difference between the values can beseen, where the 30 cluster group has been identified as predominantly between0 and 5. The 30 cluster grouping in Box A also has a high number associatedwith groups at the high end of clusters between 25 and 30 which shows a highlevel of inconsistency between samples.

Figure 6.1: A comparison of cluster selection

Although the k cluster number is initially arbitrarily defined, consistency bet-ween clusters improves as the number of groupings increase. In Figure 6.1box B shows the k cluster of 50 predominantly classified as the same cluster,whereas the 30 and 40 clusters produced more varied classifications. Testsinvolving k clusters over 50 produced similar results but created large increasesin processing time.

112

Table 6.1 shows the number of calculations required based on the chosen va-lue of k clusters. The number of computations does not increase in a linearor exponential manner, but are based is the complexity of the music and itscomposition together with the duration of the audio. Song A requires morecalculation due to its composition. Song A is the 12Bar Blues sample audio usedas a test bed throughout this work. Since it contains a high level of variationbetween time frames, centroids and distances need to be re-evaluated morefrequently. Songs B, C and D are a random collection from the music collectionused within this work14. The basic descriptions of which are presented inTable 6.2. When a choice of k clusters is set to above 40 a steady increase ofcomputations can be seen in song A. This is contradicted when using clustersbelow the level of 40 where songs A,B,C and D all produce varying results. Thiscan be partially attributed to the limited number of k clusters that differingvalues can be assigned to (Zha et al., 2002; Kriegel et al., 2005). More variationswith fewer clusters mean more comparisons since one sample with a high valuecan offset the centroid of the associated cluster and this needs to be adjustedmore frequently.

Iterationsk Clusters Song A Song B Song C Song D

30 8040 3270 4410 624040 6520 13280 6650 1016050 8750 12000 11000 1355060 10260 17040 15060 1680080 13520 19680 31360 18080

100 19700 30700 39300 45700

Table 6.1: k Cluster computations relative to size of k

Audio PropertiesSong Duration (mins.) Degree of WTF

A 3.86 MediumB 3.86 MediumC 3.20 HighD 4.43 Low

Table 6.2: k Cluster computations relative to the size of k

14A more complete list of evaluation data, including artists and titles of songs, is given inAppendix D.

113

6.2 String matching large clusters

Having one 10 ms. sample classified presents the audio in a readable format thatenables larger sections to be compared for similarities. Section 5.2.3 introducedthe approach of using string matching for comparing n length of clusters attime-point y with n length at a preceding time-point x within the same audiofile.

Investigations into string length are shown in Figure 6.2. Within Figure 6.2 are10 graphs showing the results for a complete search of a file given a specific’query’. The query in question is a fixed length string taken from the k-meansclustered output which results in a series of numerical values. Each numericalvalue is the cluster value for each given time-point and can be seen in thefollowing sample output:

...6,2,23,17,42,36,35,16,23,11,35,16,2,6,35,16,6,...

...,2,35,41,40,46,42,17,...

A query string of 1 s. in length contains 100 values and the entire clusteredoutput of an audio file contains over 23,000 identified clusters. An exampleresult is where the query string is taken from a random point in the middle ofthe file without any pre-conceptions, i.e., it is not known whether the querystring time-point is part of the chorus or a verse or even a bridge. This querystring is then compared with the entirety of the clustered file and is noted as tohow close a match it is to each segment across all time-points from beginningto end.

The closer to a ranked score of zero the better the match. Within each graph inFigure 6.2 one clear match value can be seen. This is the time-point at whichthe original query string was sampled and will always give an exact match.Other matching points have clearly been shown as the best match in the firstquarter of the song as indicated in each graph. However, the main focus is onthe number of matches across the full duration of the audio file that indicate aclear result in regard to other sections of the audio. Although the ’best match’shown in Table 6.3 shows an average match ratio of around only a 0.7 matchratio in relation to the nearest other matches, this is considerably close.

Also given in Table 6.3 is the number of matches that have been found to bebelow 0.85. Although the best score in the table is .6931 it can be clearly seenthat other sections have been identified as similar which gives an indication

114

(a) 1 s. search (b) 2 s. search

(c) 3 s. search (d) 4 s. search

(e) 5 s. search (f) 6 s. search

(g) 6 s. search (h) 10 s. search

Figure 6.2: A string matching comparison: With a string length equivalent to 1s. in (a), stepped by 1 s. up to (f) and stepped by 2 s. in (g) and (h)

115

Searchstring inseconds Match result

No. matchesbelow 0.85

1 0.6931 62 0.7463 53 0.7110 54 0.6983 45 0.7005 46 0.7006 48 0.7416 3

10 0.7662 5

Table 6.3: String comparison results for 1 to 10 s.

of the repetitiveness of the audio. However, as the initial query string lengthincreases, either the ’score’ decreases or the number of matches found decreasesgiving a reduction in accuracy whilst determining the best match. Add to thisthe need to replace sections of audio when dropouts greater than 1 s. occureliminates the choice of using 1 or 2 s. length queries as sample criteria whilstsearching for matches. In graphs (a) and (b) of Figure 6.2 a very close matchcan be seen marginally to the left of the original query time-point. In theorythis could cause problems when trying to repair bursty errors as it is too closeto the live stream time-point and the media application will have only a verylimited time frame in which to recover the network. Although more possiblechoices appear, the accuracy is reduced, balance between the extreme lengthsof queries as shown in Table 6.3 can be seen by using a 5 s. length15 as shownin Figure 6.2(e).

The purpose of graphs (a) to (h) in Figure 6.2 is to show results of matchingsections of audio found throughout the audio file. For the purpose of repairingbursty dropouts the media section of SoFI can use only previously receivedportions of the live stream up to the point at which network bandwidth/-throughput becomes unstable. Figure 6.3 shows another random time-pointchosen near the end of a different song than the song used in Figure 6.2. Using5 s. as a query length a match success can clearly be seen as identified by thebest match indicator in Figure 6.3. Only one other possible match can be seenand this match has a relatively low match ratio of 0.87. All other comparisonsresulted in near and above 0.95 - a match at this level would be considered

15A comparison of the original MPEG–7 query string and best match section is shown inAppendix E Figure E.1 graphs (a) to (k).

116

almost unusable.

Figure 6.3: A 5 s. query on only preceding sections

The final figure in this section represents a worst case scenario where a networkdropout occurs near the beginning of a song. Figure 6.4 shows the query resultof a dropout occurring after 30 s. of audio have been received. The best matchratio is now only just below 0.89 and only marginally better than any of theother samples. It is this level of a best match that the phrase best-effort can beused in pattern matching to its fullest. Using this portion of audio as a startingpoint to replace the break in the live stream will simply mask the error fromthe listener, although it will be apparent. At this level the attempted repair ismerely to replace a complete loss of signal to minimise the level of distractioncaused to the listener. A subjective test by listeners is more a measurement ofthe success of the replacement than the actual values displayed.

Figure 6.4: A 5 s. query from only 30 s. of audio

Table 6.4 shows the average match for cluster string lengths of between 1 and20 s. As the time span increases the accuracy of the match decreases. The

117

interesting point to note in Table 6.4 is the jump between 0.6534 for a 1 s. querystring and 0.6994 for a 2 s. query string. This is due to too many false positivesof a match result for such a short query string.

Time (s.) Average match1 0.65342 0.69943 0.72244 0.73505 0.74566 0.75217 0.75818 0.77269 0.7864

10 0.791812 0.800715 0.812120 0.8365

Table 6.4: Average match ratio across all song segments

The problem of too many successful matches can be seen more clearly in Fi-gure 6.5. Both the 1 and 5 s. queries returned the same time-point as the bestpossible match for the starting time-point of the query. However, additionalmatches below the best match result using 5 s. were found with only 1 s. ofaudio. This can lead to sections of audio that may be used which are not anaccurate replacement for dropouts of over 1 s. in length. Using a 5 s. lengthreduces this possibility whilst increasing the likelihood of the audio followingon from the query string time-point still being correct.

Figure 6.5: A comparison of 1 and 5 s. query strings

118

Of the songs listed in Appendix D, Table D.1 one of the most problematic wasby the artist Enya. Although the structure of the songs by Enya are repetitivein principle, they do not strictly adhere to the Western Tonal Format (WTF)definition. For example, the song ‘Orinoco Flow’ follows the verse/chorus/-bridge/verse/chorus structure and yet repeats of the chorus are not composedexactly the same each time they are repeated. This presents problems whenmatching ’chorus’ sections as well as verses and bridges. The match ratio expec-ted for a verse is expected to be lower than for a chorus, where the verse usuallycontains the same underlying music (guitar, drums, piano) in the same repeatedmanner for different sections. The lyrics, however, can, and do, change for eachverse throughout the song, thereby leading to a lower match percentage. In thecase of work by Enya however, not only do the verses change but the chorus isdifferent also. To add to the complexity of an uncertain structure, Enya changesthe underlying music but not the lyrics for each repetition of the chorus. Forexample the drum rhythm and guitar rhythm appear ’out of sync.’ comparedto other repetitions of the chorus.

Figure 6.6: Two ’similar’ 5 s. segments

In Figure 6.6 a 5 s. segment of the ASE representation of the first chorus of‘Orinoco Flow’ can be seen. The time-point it starts at is relative to the start ofthe first lyric in the chorus. When compared to the next time the same lyrics arerepeated, as shown in the lower plot in Figure 6.6, an overall difference of theaudio composition for the equivalent section can be seen.

119

Figures 6.7 and 6.8 show the full audio file as a wave representation and clus-tered format respectively. Clear similarities of the overall structure of Enya’smusic can be clearly seen. The bridge section is clearly visible in both figures, aswell as similarities between the start and end of the song in the way the overallstrength of the wave representation is somewhat similarly represented in theclustered representation. It can be implied from this that best-effort resultswould be similar to previous examples. However, shown in Figure 6.9 is thematch ratio result for the time segments used in Figure 6.6 and the correspon-ding best match is not at the most optimal position. The correct time point isactually 10 s. following on from this point. It can also be seen as a high matchratio at the beginning of the audio where no lyrics are performed. This couldpossibly lead to the conclusion that the underlying music has more influencethan the vocals in the song, although this would need to be further investigated.To explain the reason for the ’miss-classification’ occurring in the case of thissong, the music is timed differently for each different repetition of the lyrics infurther sections.

Figure 6.7: A two channel wave audio file

Throughout almost the entirety of ‘Orinoco Flow’ a repetitive music pattern isplayed and as lyrics change the music remains the same during both verse andchorus. The only deviation from this pattern occurs during the bridge section.The result of this continuous repetition produces a best match because the musicfrequencies are more dominant than the lyrics. This leads to a false positivematch where the underlying music is the same but the section matched is not

120

Figure 6.8: Figure 6.7 as cluster representation

Figure 6.9: Match ratio for one 5 s. segment from Figure 6.8

correct. Table 6.5 shows a comparison of the match ratio for Enya’s ‘OrinocoFlow’ alongside the difference between the average match ratio for durations of1 to 10 s.. The results ’indicate’ a better match for time lengths of over 2 s. butmany of these matches may be false positives. Table 6.5 together with Figure 6.9show how music that is not strictly in WTF can produce what appears to begood match sections, but in reality are poor substitutes when better sectionsshould have been identified. A justification for this is that Enya’s music mayfollow Eastern styles and hence could explain her high sales in Asia.

121

Searchstring (s.) Match result

Difference fromaverage

1 0.6737 -0.01942 0.7260 -0.02033 0.7523 +0.04134 0.7680 +0.06975 0.7795 +0.07906 0.7958 +0.09528 0.8145 +0.0729

10 0.8339 +0.0677

Table 6.5: A comparison of match ratio across all song segments with OrinocoFlow

6.2.1 Clann Brennan

As a comparative highlighting the underlying style presented by Enya’s music16

a summary of song comparisons by other members of Enya’s family can be seenin Table 6.6. Table 6.6 shows that Orinoco Flow has both the lowest averagematch ratio (closer to 0 indicates a better match) and the poorest best-possiblematch within the audio. Thereby giving an indication that Enya’s music doesnot follow the same style as other musicians in her family and does not preciselyfollow the WTF.

Artist Song Duration(mins.)

Averagematch ratio

Best match

Moya Black Night 3.26 0.811322 0.434Clannad Dulaman 4.68 0.802523 0.414

Leo The Wee Crolly Doll 3.03 0.755704 0.508Enya Orinoco Flow 4.43 0.859356 0.628

Table 6.6: Clann Brennan song comparison

6.2.2 Baroque classical music

As a comparative to modern music styles the baroque17 period in the history ofclassical music is considered as an era of classical music with regulated chordprogression, formal structure and tonal harmony (particularly in the later years

16Enya, her manager and producer, Nicky Ryan and her parents, Leo and Baba Brennan,viewed a demonstration of SoFI during a visit to the Intelligent Systems Research Centre (ISRC)on July 10th, 2007. Photographs of the visit are included in Appendix J.

17The baroque period is a style of European classical music from 1600 to 1750.

122

of the period), in contrast to the earlier Renaissance period leading up to theBaroque era (Bukofzer, 2008, p. 17). Three of most prominent composers of thelater period of the baroque style were chosen, not just purely for their notorietybut most notably for their contribution to the era. Handel, J.S. Bach and Vivaldiare well known for their contributions to the baroque period.

Using the same clustering and string matching analysis as performed on musicin the WTF style a number of interesting results can be seen. Table 6.7 shows anaverage match between 0.6 and 0.8, which shows a similar average match ratioto previously tested music that was in the WTF style. What is of interest hereis the ’best match’ showing an almost exact copy being found in a previoussection of the music in two of the three composers, thereby indicating a highlyrepetitive structure.

Artist Piece Duration(mins.)

Averagematch ratio

Best match

J.S. Bach Air on a G String 5.33 0.7641 0.0800Handel Water Music Suite

No. 1 in F, Allegro2.5 0.8462 0.4680

Vivaldi The Four Seasons,Spring

3.01 0.6232 0.0300

Table 6.7: Baroque classical comparison

The graphs (a) to (c) in Figure 6.10 show the complete similarity analysis resultfor each composer. Of the three pieces of music analysed, Vivaldi’s ”‘The FourSeasons”’ shown in Figure 6.10 (c) shows the most repetition closely followedby Bach as seen in Figure 6.10 (a). Using the results from Table 6.7 and thegraphs in Figure 6.10 clear indications can be seen that the repetitive structure ofthe baroque period can be easily identified using a self-similarity approach suchas k-means and string matching. Listener evaluations of how closely sampledropout scenarios match a listeners expectations when listening to a piece ofmusic from the baroque period are discussed in Section 6.4.6.

123

(a) J.S. Bach: Air on a G String

(b) Handel: Water Music Suite No. 1 in F, Allegro

(c) Vivaldi: The Four Seasons, Spring

Figure 6.10: Best match ratio for baroque classical period music

124

6.3 Audio repair

The following section continues analysis of the best-effort approach that SoFIuses to identify similar segments. Using the ASE representation as a meaningfultrue representation and an audio analysis tool called Sonic Visualiser to presentthe same or similar sections of the actual audio in the form of a spectrogramidentified best-effort matches can be more easily displayed.

6.3.1 Quantitive audio comparisons

Section 6.1 discusses the optimum replacement for specific time points withinan audio broadcast. Using Sonic Visualiser these identified sections can bevisually represented not as a match ratio but as a spectrogram representation18.The graphs (a) to (c) in Figure 6.11 shown below present an audio signal ina peak frequency representation. The standard spectrogram representationas shown in Appendix F, Figure F.1 is a general overview of the file wherebackground noise and equalisation trends may be visually evident but musicalfeatures contained within the signal are less apparent.

The peak frequency spectrogram presents the precise/dominant frequency ofa bin19. Within each frequency bin, a single value representing each bin isdisplayed as a short horizontal line as opposed to the whole block/bin. Thisallows the strongest details to be seen without interference from neighbouringfrequencies. Colours are representative of the power of the particular frequencyranging from black through to red. Figure 6.11 shows the three similar sectionsidentified in Figure 6.2(e) as a peak frequency representation. From this layoutthe similarities between the different sections can be seen more clearly.

The most evident repetition among the range of frequencies can be seen at thelower end of the scale where strong base tones are more applicable (Olson,1967). Base drums and male20 vocals can be more dominant within this range.More evident at this level is the repetition of the frequencies over the fixedtemporal length of the sample. Close similarities between duration and powercan be seen across graphs (a) to (c) in Figure 6.11. Clear signs of similarities canbe seen visually when comparing the lower frequencies of Figure 6.11(a) with

18A spectrogram shows how the spectral density of a signal changes across time.19A frequency bin is the division of the frequency span of the audio signal by the number of

display points such that each bin contains a segment of the frequency span.

20The vocal range of a typical adult male will have a fundamental frequency of from 85 to155 Hz. and an adult female from 165 to 255 Hz. (Titze & Martin, 1998).

125

(a) Original query section

(b) Best match at 20 s.

(c) Best match at 2 minutes 45 seconds

Figure 6.11: Peak frequency spectrum representation of three similar audiosections

Figures 6.11(b) and 6.11(c). The lower frequencies of the audio (shown primarilyin red) consist of drum and bass guitar and indicate a clear similarity of strengthand duration between the original query section and the closest matches found,thereby indicating repetitive sections of the music. These graphs also relate toFigure E in Appendix E which show the same sample comparison using theMPEG–7 audio spectrum envelope representation instead.

6.3.2 Correlation of similar and dissimilar matches

Computing the 2D correlation coefficient between two similarly defined ma-trices produces a value of a high correlation and a low mean difference as

126

shown in Table 6.8 with matrices a and b. These results can be compared toa segment of audio from a dissimilar match that produces a low correlationand a proportionately higher mean difference as shown in the comparison ofmatrices a and c. It is noted here that correlation measures the strength of alinear relationship between the two variables and this linear relationship canbe false. A correlation coefficient of zero does not mean the two variables areindependent but that there is no linear relationship between them. However,when combined with the mean differences between the vectors and their visualrepresentations the accuracy between matches can be clearly defined.

Matricescompared Correlation Mean difference

a⋂

b 0.7204 0.000066657a⋂

c 0.2491 0.000785420

Table 6.8: A comparison of correlation and mean difference between 3 differentaudio segments

The overall best match found across all audio files was a match ratio of 0.448yet the above samples in Figures 6.3, 6.2(e) and 6.11 were based on a 0.7 match.This measure of similarity is merely to obtain the best option for repair. Asa representative example the samples chosen for comparison were arbitraryand no known verse or chorus structure was known. The purpose of this workis to repair dropouts with a best possible match from all previously receivedsections and not to repair a verse dropout with a previous verse section. For thisreason, best possible matches of values as high as 0.9 may be used during thefirst rendition of a verse or chorus and will produce audio quality that canonly be described as subjective at best. The following section evaluates SoFI’sperformance based on subjective feedback on the quality of audio perceived bya listener.

6.4 Subjective evaluation of SoFI

Quantitative evaluations discussed in the previous section demonstrate howaccurate replacements can be measured when comparing errors. However, notevery dropout occurrence can be accurately repaired, and comparisons to otherapproaches need to be performed to determine whether an acceptable levelof repair is performed based on multiple scenarios. The quality of experience(QoE) is a measure of a customer’s experience with a service/product. Unlike

127

the quantitive evaluation where a product is measured by how well it performsbased on the design specifications, the QoE measures how well the end userperceives the product performance based on their needs and expectations. Inan effort to ascertain the listener’s perceived QoE a questionnaire designed tocompare and contrast previous approaches to audio repair as well as comparea number of various ’dropout’ scenarios is used. The following sections discussthe subjective evaluations performed on SoFI using feedback from listeners whowere presented a series of test audio files and asked to complete a questionnaireevaluating each type of repair.

6.4.1 Subject listeners

Sixteen subjects were invited to participate in this experiment. The subjectiveevaluation questionnaire can be found in Appendix G The statistical data of thegeneral age, gender and the time spent listening to music per week is given inAppendix H, Table H.1. Figure 6.12 shows this data in the form of pie charts.Over 50% of the test subjects were in the age group of 18 to 25 with over 60%being male. The subjects were asked to rate their musical knowledge on a scaleof one to ten in order to gain an understanding of their auditory perceptionskills. A keen musician would be able to discern any irregular changes in a pieceof music better than a novice listener. The pie chart in Figure 6.12(c) shows thatalmost 70% consider themselves as average by scoring their abilities between 4,5 and 6. Only two subjects (13%) considered themselves as above average witha listening skill of 7 and 9. None of the subjects considered themselves at a skilllevel for writing music.

Shown in Figure 6.13 are subject’s general background listening habits and theirexperiences of listening to music online. Over 80% use both wired and wirelessnetwork connections for Internet access and, of those over 50% use wirelessmore than wired. This shows a correlation between Figure 6.13(a) and 6.13(c)where it can be seen that over 80% have experienced dropouts whilst listeningto a music broadcast. One statistic not displayed is the number of listeners thathave hearing problems. This question was included as a precautionary measureto ensure that the subjects would not be hindered in their evaluation by lowsound quality. Only one subject answered ’yes’ to this but she presented nosymptoms that would hinder the evaluation.

128

(a) Gender (b) Age group

(c) Music knowledge

Figure 6.12: Demographic data for 16 subjects that participated in the evaluationof SoFI

(a) Network medium (b) Hours listening online

(c) Experienced dropouts

Figure 6.13: Listening habits for 16 subjects that participated in the evaluationof SoFI

129

6.4.2 Evaluation questionnaire

Using a ten point Likert scale21, test subject were given eight audio files tolisten to in succession and asked to rate the repair in a questionnaire form.An example of the questionnaire used can be found in Appendix G. Sincethe tests were to be the same for each listener the audio repair in each songwas simulated and stored locally as a wave file. Using sample time-pointsthat have been identified as similar within the collection of songs presented inAppendix D specific experiments were executed at predetermined time-points.The point of failure for each song was decided by the particular test, i.e. a versefailure, at the beginning of the song and even the duration of the dropout. Theaudio files were then reconstructed based on the identified segments usinga wave editor to ensure precise matching of time-points. This removed anypossible ambiguity introduced by re-running the audio broadcasts multipletimes with varying network characteristics. Each song was evaluated when ithad completed and before the next began. Once all songs had been played atleast once a general rank for all the songs was requested from each test subject.This rank was based on an order of preference of one to eight with one beingthe most favored method of repair and eight the least favoured.

6.4.3 Audio repair

Table 6.9 shows the eight different songs used in the evaluation tests. Thefirst three songs include repair approaches which are comparative to otherapproaches as presented in Chapter 2, Section 2.12 approaches on differentsongs. All eight songs follow the Western Tonal Format (WTF) of intro/verse/-chorus/verse/chorus with an additional bridge included. The first song includeslinear interpolation to bridge the missing five second segment. This was per-formed by determining the overall frequency of the audio signal at time-pointx and and time point y and creating a signal to bridge the gap based on thesefrequencies. Song 2 includes the low-bandwidth redundancy approach wherea second signal is also broadcast but at a much lower sampling rate. When adropout occurs this lower quality audio has a higher chance of still arrivingat its destination. A down-sampled segment of the audio was used to replacethe dropout at a sample rate that equates to just below telephone audio qualityof 8 bit mono at 4000 Hz. Song 3 includes the re-ordered packet effect that

21A Likert scale is a psychometric scale used frequently in questionnaires. When respondingto a Likert questionnaire statement, subjects specify a level of agreement to the statement. Thescale is named after Rensis Likert who published a report describing its use (Likert, 1932).

130

reduces the dropout affect to appear as jitter. However, as this evaluation isperformed ’offline’ a simulation was created by re-ordering and repeating smallsegments of the audio for the duration specified. Songs 4, 5, 7 and 8 were allbased on SoFI’s best-effort analysis and uses varying levels of similar matchesas identified during analysis. Although the audio was not broadcast as a liveaudio stream the effects of the audio replacement are the same. Performing asimulation provides the opportunity to enable subjects to evaluate the best andworst-case scenarios that SoFI would be expected to repair. Song 6 acted as acontrol test to evaluate listeners attention to the audio to ensure random markswere not supplied as a quick response to the survey.

Repair type Details of repair Duration of repair (s.)Song 1 Linear

interpolationN/A 5 s.

2 Down sampling 8 bit mono @ 4000 Hz. 5 s.3 Jitter 0.5 s. continuous

sections repeated ormoved

5 s.

4 5 s. chorusrepair

Taken from an earlychorus section with

an average matchratio of 0.720

5 s.

5 5 s. best match Best match ratio of allsections of the song

of 0.409

5 s.

6 Control audiowith no repair

N/A N/A

7 10 s. repair Repair a 10 s.dropout using a 5 s.

match

10 s.

8 Low matchratio 5 s. repair

Use a low match ratioof 0.890 as

replacement

5 s.

Table 6.9: Test songs using different approaches to dropout repair

6.4.4 Subjective evaluation results

Data collated from the responses to the subjective evaluation questionnairesis given in Appendix H, Tables H.2 and H.3. A summary of these results ispresented in Table 6.10 showing the average score given by each subject, theiraverage score for the three alternative methods that were not performed by

131

SoFI and the average score for each repair based on the simulated SoFI test-casescenarios given in Table 6.9.

Other Repairs SoFI RepairsSong 1 2 3 4 5 7 8

Average Score 1.56 3.38 3.06 5.94 6.38 8.44 7.63Overall Average 4.55

Average Other Repairs 2.67Average SoFI Repairs 7.09

Table 6.10: Summary of subject listening evaluation

The average score for each subject is shown in Figure 6.14. The blue columnrepresents the average mark across all repair methods. The orange columnrepresents the average given for songs 1, 2 and 3 which use varying types ofrepair to mask a bursty network dropout. The final column for each listenerrepresents the average score for the four different simulated attempts made bySoFI to replace varying time-points and durations of dropouts in the signal. Wehave the following observations:

1. Songs 1, 2 and 3 obtain a lower than average score

2. The average score across all listeners is directly related to the scale ofmarking on a per-listener level

3. The average score for each listener for songs 4, 5, 7 and 8 includingsimulated SoFI repairs are scored significantly higher than songs 1, 2 and3.

These results show a consensus of opinion across all subjects that, althoughsome level of change in the audio stream is perceived, listeners indicate anincrease in level of acceptance in simulated attempts at audio repair by SoFIwhen presented with current alternatives.

The mean evaluation score for other methods of repair was 2.67 out of a pos-sible 10. This is almost half the average for all songs where a mean of 4.55 isshown. The mean score obtained for the attempted best-effort repair underdiffering circumstances obtained was 7.09. This indicates a much higher levelof acceptance overall and an almost 200% increase in level of acceptance by thesubjects.

132

Figure 6.14: A comparison of average subject scores for song repair types: Otherapproaches, SoFI repairs and overall average

Looking at the demographic data collected from questions the evaluation ques-tionnaire (given in Appendix G), no correlation could be found between age/-gender with mean scores or the corresponding variance between differentscores.

6.4.5 Subjective ranking of song by test subjects

The final part of the evaluation questionnaire (given in Appendix G) requestedthe subjects to list the order of preferred repair based on how successful itwas at masking a dropout from the listener. Prior to completing this sectionsubjects were explained each repair approach and allowed to listen to thespecific section where the error occurred in each song over again. With moreattentive perception listeners were more aware of what type of repair they werelistening to and ranked each song accordingly based on a scale of 1 to 8. Allsubjects correctly identified the fact that song 6 had no audio changes and soautomatically ranked it as top with a score of 8. The following Figure 6.15shows the final mean rank for each method of repair from all subjects22.

This correlates closely to the mean scores subjects provided as discussed inSection 6.4.4. Figure 6.15 shows ranks of between 4 to 7 assigned to songs 4, 5, 7and 8 whilst songs 1, 2 and 3 were ranked the lowest.

22Table H.3 shows the perceived ranks on a per-listener level as well as the mean score.

133

Figure 6.15: User evaluation rank for each song

6.4.6 Subjective evaluation of baroque classical music re-pair

An additional subjective evaluation was performed using five listeners asa means to evaluate the effectiveness of the audio repair when a simulateddropout occurs when listening to music from the baroque classical period.Using a questionnaire similar to the one presented in Appendix G listenerswere asked to evaluate the three pieces of music from the baroque classicalperiod introduced in Section 6.2.2. Each piece of music had three sections ofaudio replaced using the ’best match’ evaluation performed in Section 6.2.2with a duration of 5 seconds. The first simulated dropout contained a replacedsection of audio with the worst match, the second simulated dropout containeda replaced section of audio with an average match, and finally a third simulateddropout contained a replaced section of audio with the best possible match for atime-point near the end of the music. From Table 6.11 a clear level of good useracceptance can be seen when a match ratio of either an average match or bestmatch repair is used. As expected a poor level of acceptance is shown when theworst match ratio is used as an audio repair. What can also bee seen is a directcorrelation between the level of user acceptance with the level of self-similaritywithin the music. Where Vivaldi has the best (lowest) average match ratio asshown in Table 6.7 they have the highest overall level of acceptance by listenersas shown in Table 6.11. Similarly, Handel has the overall worst average matchratio and also scores lowest in the user acceptance evaluations.

134

Listener rankingComposer/Music Worst

matchAverage

matchBest

matchAverageranking

J.S. Bach: Air on a GString

3.2 6.8 7.6 5.86

Handel: Water MusicSuite No. 1 in F,

Allegro

1.8 6 6.8 4.86

Vivaldi: The FourSeasons, Spring

3.6 7 7.8 6.13

Table 6.11: Summary of baroque classical music evaluation

6.4.7 Feedback from subjects

Upon completion of the evaluation questionnaires subjects were presented withthe opportunity to discuss in general what they had previously listened to andhow much they had understood about each repair approach. During thesediscussion it became apparent that listeners were unaware of the full extent ofthe duration of repair23. Subjects were informed that any repair section heard insongs 4, 5, 7 and 8 were from similar sections in a preceding section of the audioand that entire sections of lyrics were different from what was there originally.Every subject found this to be a surprise and they did not notice any change inlyrics, rhythm of music other than the jitter effect when audio was swapped atthe start and end time-points of the swap between sections.

The most interesting statistic shown in all the data from the subjective evalua-tion is reflected in the rank of repair for each method. Songs 7 and 8 use a matchratio of average and below average respectively and yet have received the highestranks. One possible justification for this is that the tempo of song 7 is veryhigh and it is difficult to determine any difference in the audio when swappingbetween sections. This masks the expected jitter effect better than with songsof a slower tempo. One other possible justification is that the point at which aswap was performed was during a low frequency point where lyrics and musicare minimal leaving the swap less apparent. Since song 7 demonstrated a 10 s.section repair subjects did comment that they had almost forgotten that a swaphad occurred.

23All subjects were informed of the duration of repair attempts prior to beginning thequestionnaire.

135

6.5 Summary

This chapter discussed the evaluation of SoFI through both objective and sub-jective evaluation techniques. Objective evaluation included the comparisonand identification of optimal k-clusters with differing cluster sizes. Throughstring matching differing time lengths were evaluated as possible replacementsections with justifications for the use of lengths of 5 s. Problems with songsthat are not strictly in Western Tonal Format (WTF) were discussed togetherwith the results of analysis presenting best effort matching sections that couldbe improved upon. A visual representation of similarly identified sectionswith a spectrogram representation giving a view of each frequency across time.Using correlation to compare similar and dissimilar segments the success ofSoFI’s best effort matching can be seen.

Subjective evaluation with sixteen test subjects listening to and quantifyingsimulated dropout scenarios allowed a comparison of SoFI’s simulated perfor-mance compared to other simulated approaches together with a comparisonof how successful SoFI’s simulated performance was using different dropoutscenarios. The subjective evaluation results showed how a greater level ofacceptance of packet dropouts is perceived by subjects with an increased accep-tance score of over 200% compared to the other methods. A ranking measureprovided by subjects shows a higher level of acceptance for simulated audiorepairs performed by SoFI and, within these, which scenario was most effective.This chapter concluded with a discussion on general feedback provided bysubjects.

136

CHAPTER

SEVEN

Conclusion and future work

Forward error correction using self-similarity on a wireless bursty networkconcerns various areas in the field of Music Information Retrieval. In thischapter we conclude by first summarizing the thesis. Next, the research iscompared with other related work. Finally, there is a discussion on furtherwork and application of the research.

7.1 Summary

In this thesis we have discussed the problems and solutions for repairingdropouts whilst streaming audio across wireless bursty networks. Previousresearch in the field of Music Information Retrieval is presented, this includesquery-by-humming, melody recognition, recommender systems, music repre-sentation in textural format, music indexing systems for similar music queries,content based retrieval approaches, along with associated tools. We also investi-gated the syntax and semantics of music as meaningful textural representationsof music content. Frequency and pitch detection were reviewed along withbeat detection to aid in characterizing the rhythm and tempo of music andaudio. The problems associated with jitter and streaming media were reviewedalongside systems that attempt repair at a network level.

Having examined the associated problems and previous work on identifyingmusic similarity and audio repair of time dependent audio broadcasts, we desi-gned an approach that utilises self-similarity identification for automatic repairof bursty dropouts in streamed audio. Feature extraction of audio utilisingMPEG–7 low level descriptors was presented as a method of data reductionthat allows for pertinent data to be retained whilst minimising the overallvolume of data to be analysed. SoFI presents an approach to audio repairthat utilises feature extraction and similarity detection together with distance

137

measures of successful self-similarity detection on receiver based audio repairwhen dropouts of time dependent audio occur. Based on this approach SoFI(Song Form Intelligence), an intelligent media streaming/receiving system thatutilises self-similarity to repair bursty errors when receiving time dependentaudio broadcasts was implemented. Utilising the MPEG–7 audio spectrumenvelope features, k-means clustering and categorical measurement of distancesegmentation and classification of audio in Western Tonal Format was perfor-med. Using the resultant output of the similarity identification processes onthe client side enabled SoFI to identify network dropouts and use previouslyreceived sections of audio stored locally for best-effort replacement.

Finally, evaluation of SoFI using both subjective and objective techniques wasperformed. Results show a close correlation between similarly identified seg-ments when compared to non-similar sections. Subjective evaluations with testsubjects from simulated example scenarios show a greater level of acceptance ofthe audio repair when compared to alternative approaches. Using a Likert scalethe average score across all listening tests gave an acceptance score of 7.09/10with the maximum score of 8.44/10 achieved when the longest dropout andsubsequent repair was presented. A rank of preferred ’repairs’ showed thatacceptable repairs can be of a low match depending on the content of the audio.

Specifically, the main findings within this thesis are:

1. Combining the audio spectrum envelope features with k-means clusteringenables similar audio frames to be classified. The MPEG–7 audio spec-trum envelope representation enables pertinent data to be retained whilstreducing the overall data being analysed to a minimum. Clustering theaudio into similarly identified groups enables large sections of audio tobe analysed and compared.

2. Integrating categorical measurement of distance to determine a best pos-sible match in previous sections within an audio file as similar enablesautomatic swapping between live audio streams and previously receivedportions of the audio stored locally in a time dependent manner.

3. Forward Error Correction (FEC) can be performed using previously re-ceived audio broadcast across bursty networks without adversely increa-sing bandwidth, unlike other approaches that attempt to repair networkdropouts using ’synthesised’ or greatly reduced signal quality as replace-ments.

138

4. Providing a novel approach to FEC that combines the latest metadatarepresentation with a classical clustering and string matching technique inan attempt to minimise large dropouts in an audio broadcast on wirelessbursty networks.

7.2 Relation to other work

SoFI relates to other work within the same theoretical areas of content basedretrieval systems such as query-by-humming and pattern matching systemsas discussed in Chapter 2, Sections 2.3, 2.4.2 and 2.3.4. The problem of audiodropouts is not limited to the field of Music Information Retrieval and attemptsat ’minimising’ the effect at packet level have also been investigated by otherwork. This section gives a comparison of SoFI with both related research areas.

7.2.1 Music similarity and pattern matching

The understanding and analysis of the structure of music requires multiplerepresentations and there is no definitive form that is right for all purposes.Specific applications require specific forms of analysis and representation. Inthe case of similarity identification within music the method of representa-tion varies based on what properties are being analysed. Within the area ofrecommender systems such as Themefinder (Section 2.3.1) audio fingerprintsare generated from monophonic queries and use regular expressions to findsimilar audio within collections. Beat tracking systems such as those discussedin Section 2.6 present similarity retrieval by rhythmic similarity. Although SoFIuses a different representation the overall objective of SoFI is similar in thatsimilarity of audio is the main goal, albeit SoFI’s main focus is within the sameaudio file and a closer ’matching’ relationship between sections as opposed tohow ’similar’ two different audio files are. Most of the related work within thefield of MIR uses one of the following forms of signal representation: singularvalue decomposition, principal component analysis, pitch/frequency detection.However, SoFI utilises as much of the original signal in a condensed formthrough the MPEG–7 audio spectrum envelope. By retaining the importantcontent across the spectrum a more informed comparison is made.

String matching and clustering discussed in Section 3.2.5 within the field ofpattern identification and matching is shown to be a reliable and acceptedapproach. Within the field of MIR systems, similarity classifiers present three

139

stages as given in Section 3.2.4:

• A sensor that gathers the observations

• A feature extraction mechanism that computes numeric or symbolic infor-mation from the observations

• A classification scheme that performs the actual classifying

SoFI conforms to these steps through the use of the ASE representation asthe sensor, the clustering into groups as a symbolic representation and byperforming string matching comparisons classifies the measures of similarityfor each frame.

7.2.2 Packet loss streaming media

A union between the similarity MIR approach to identifying similar sectionsand packet loss with network critical level identification when dropouts oc-cur provides a unique approach to a problem that has until now only beenapproached on a packet level. Utilising network monitoring and swappingaudio sources gives a similar approach to Lee & Chanson (2004) discussed inSection 2.12. Audio is dynamically changed based on the characteristics ofnetwork flow but listeners are provided with a ’best possible’ alternative withthe aim to minimising disruption. SoFI however, does not use compression,but through implementation of swapping between sources implements a mini-mal jitter effect at the start and end time-points of the swap. This jitter effectis presented in Section 2.10 as an associated problem within audio playback.However, it is shown how it can be used to reduce potentially large dropoutsdown to smaller and less noticeable effects in Section 2.12. Depending on howaccurate a replacement is, and whether either the start or end segments containlyrics jitter can be more apparent. However, through the subjective testing ofSoFI discussed in Section 6.4 a comparison with approaches that use this effectit can be seen that SoFI produces a much more acceptable level even under poormatch conditions.

7.3 Future work

During the development of this work a number of potential areas have beenidentified that could be explored to enhance the listening experience. Although

140

all the quantitive evaluation results was generated from actual SoFI output,the simulated data used for subjective evaluation tests could be performedwith SoFI doing replacement rather than by simulation. Section 4.2 appliesan empirical value through iterative testing and repetition. The identificationof an optimal value for k clusters could potentially be determined by the useof statistical criteria or by using clustering validity indices in that the appro-priate number of clusters are automatically selected. Milligan & Cooper (1985)presented a comparison study on over thirty validity indices for hierarchicalclustering algorithms. This was further investigated in a comparative study byDimitriadou et al. (2002) of over fifteen validity indexes for the case of binarydata. Another variation for cluster choice is suggested by Fraley & Raftery(1998), where a Bayesian Information Criterion (BIC) can be applied as selectioncriterion for choosing the optimal number of clusters and their centroids.

Further analysis of the matching sections may provide an alternative method ofdetermining the structure of the audio by identifying different sections, i.e. Verseor Chorus. Section 6.3.2 gives comparisons based on an unknown structureof audio, if a verse/chorus/verse structure with associated start and end timepoints is known, a more efficient and possibly more accurate ’similarity match’could be generated24. Early indications of this can be seen in Appendix I wherethree successive sections of the audio are shown. Using such changes betweendiffering time-points as identified matches could be applied on a broader levelfor verse/chorus identification.

On a more detailed level, the subjective feedback by test subjects highlightedjitter problems when swapping between sections of the audio. A potentialimprovement here would be to perform swaps during salient periods. Inparticular, lyrics have been identified as the most noticeable if a swap occurs.A possible approach would to perform the swap between the fall and rise ofparticular frequencies to account for both male and female vocals. Section 6.3.1points out that the frequency of male vs. female voice may contribute to theaccuracy of the best-effort approach. A more comparative investigation couldbe performed in relation to female versus male artists to determine whethergender and vocal frequency range affect the match ratio could be performed.

Early sections of audio provide limited match results but, as shown in Table 6.3,Section 6.2 shorter lengths of time produce reasonably accurate results. Using

24Abdallah et al. (2005) discuss a Bayesian framework using Expectation Maximization andMaximum Likelihood approaches to segmentation to identify the music structure of verse andchorus.

141

smaller query length at the beginning of an audio file and progressing to alarger query ’string’ further on in the similarity matching process may producean improved performance overall.

One of the more interesting aspects of the evaluation of this work was thesubjective listening tests. Listeners initially found it difficult to comprehend the’duration’ of the repair and this may have caused biased results as discussed inSection 6.4.7. An additional subjective evaluation with more aware listenerscould possibly give a more refined evaluation of different levels of performancebased on the accuracy of the match ratio at time-points as well as differingsections of the song. This could also be conducted by selecting the same subjectssince they are now more aware of what they are expecting to hear.

7.4 Conclusion

The central aim of this work is to investigate self similarity within audio in anattempt to repair network dropouts that primarily occur on wireless burstynetworks. Listeners of a live audio stream that relies heavily on time-dependentdata are presented with an approach that successfully minimizes these dropoutswhen they occur. To demonstrate this approach, SoFI, a best-effort patternmatching system was implemented. SoFI applies MIR approaches to similarityidentification within audio and classifies similar sections. These classifiedsections are then used as reference time-points on the client side by the receivingmedia framework as possible replacements when dropouts occur. This workincludes an implementation and combination of string matching, clusteringand the use of the MPEG–7 low-level descriptor Audio Spectrum Envelope toprovide an approach to client side repair of bursty audio streams.

Objective and subjective evaluations show how this approach contributes tothe area of forward error correction whilst streaming time-dependent audio.Suggestions for future work include segmentation of audio into an intro, verse,chorus, verse, chorus with the aim of improving efficiency and reducing jitter,combining varying lengths of query strings depending on the section of audiobeing classified, the introduction of probability to determine an optimum valueof k for clusters and their starting points.

142

APPENDIX

A

MusicXML Output

The following XML listing is a sample MusicXML representation of the 12 BarBlues audio file as produced by the Recodare application. The representationallows music to be described in text format by using tags to identify pertinentinformation such as midi-channel, midi-instrument, major/minor mode andbeats.<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE score-partwise PUBLIC "-//Recordare...//DTD-MusicXML-1.0-Partwise//EN" "/musicxml/partwise.dtd"><score-partwise>

<work><work-title>12 Bar Blues</work-title></work><identification><encoding><software>Guitar Pro 5</software>

<encoding-description>MusicXML 1.0</encoding-description></encoding><miscellaneous></miscellaneous>

</identification><part-list><score-part id="P1">

<part-name>12 Bar Blues</part-name><score-instrument id="P1-I1">

<instrument-name></instrument-name></score-instrument><midi-instrument id="P1-I1">

<midi-channel>1</midi-channel><midi-program>25</midi-program>

</midi-instrument></score-part>

</part-list><part id="P1">

<measure number="1"><attributes><divisions>1</divisions>

<key><fifths>0</fifths><mode>major</mode>

</key><time><beats>4</beats>

<beat-type>4</beat-type></time><clef><sign>TAB</sign>

<line>5</line></clef>

143

<staff-details><staff-lines>6</staff-lines><staff-tuning line="6">

<tuning-step>E</tuning-step><tuning-octave>5</tuning-octave>

</staff-tuning><staff-tuning line="5">

<tuning-step>B</tuning-step><tuning-octave>4</tuning-octave>


<tuning-step>G</tuning-step><tuning-octave>4</tuning-octave>


<tuning-step>D</tuning-step><tuning-octave>4</tuning-octave>


<tuning-step>A</tuning-step><tuning-octave>3</tuning-octave>


<tuning-step>E</tuning-step><tuning-octave>3</tuning-octave>

</staff-tuning></staff-details>

</attributes><sound pan="8" tempo="120"></sound><barline location="left">

<bar-style>heavy-light</bar-style></barline><note><pitch><step>A</step>

<octave>3</octave></pitch><duration>1</duration><voice>1</voice><type>quarter</type><notations><dynamics><f></f>

</dynamics><technical><string>5</string>

<fret>0</fret></technical>

</notations></note><note><chord></chord>

<pitch><step>E</step><octave>4</octave>

</pitch><duration>1</duration><voice>1</voice><type>quarter</type><notations><dynamics><f></f>



</notations>

144

</note><note><pitch><step>A</step>




</notations></note><note><chord></chord>

<pitch><step>E</step><octave>4</octave>

</pitch><duration>1</duration><voice>1</voice><type>quarter</type><notations><dynamics><f></f>



</notations></note><note><pitch><step>G</step>




</notations></note>

</measure>.....................

</part></score-partwise>

145

APPENDIX

B

12 Bar Blues Music Score

A music notation representation of the 12 Bar Blues piece described in Ap-pendix A showing the difference in size when different representations are used.

Figure B.1: A sheet music representation of the 12 Bar Blues music file

146

APPENDIX

C

MPEG–7 XML Output

The following XML listing is a sample representation of the MPEG–7 represen-tation of the Audio Spectrum Envelope analysis of an audio file in XML format.

<?xml version="1.0" encoding="UTF-8" standalone="no"?><Mpeg7 xmlns="urn:mpeg:mpeg7:schema:2001" xmlns:mpeg7="

urn:mpeg:mpeg7:schema:2001" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<Description xsi:type="ContentEntityType"><MultimediaContent xsi:type="AudioType"><Audio xsi:type="AudioSegmentType"><MediaInformation xsi:type="MediaInformationType"><MediaProfile xsi:type="MediaProfileType"><MediaFormat xsi:type="MediaFormatType"><Content href="MPEG7ContentCS" xsi:type="ControlledTermUseType"><Name>audio</Name></Content><FileSize>7286444</FileSize><BitRate>393216</BitRate><AudioCoding><AudioChannels>2</AudioChannels><Sample bitsPer="16" rate="48000.0"/></AudioCoding></MediaFormat><MediaInstance xsi:type="MediaInstanceType"><InstanceIdentifier xsi:type="UniqueIDType">UniqueID0000000000</

InstanceIdentifier><MediaLocator xsi:type="MediaLocatorType"><MediaUri>file:/C:/MatWorkFolder/WaveOut.wav</MediaUri></MediaLocator></MediaInstance></MediaProfile></MediaInformation><AudioDescriptor hiEdge="16000.0" loEdge="62.5" octaveResolution="1"

xsi:type="AudioSpectrumEnvelopeType"><SeriesOfVector hopSize="PT10N1000F" totalNumOfSamples="37930"

vectorSize="10"><Raw mpeg7:dim="3793 10">8.1016106E-4 4.8894755E-8 8.804471E-8

8.127679E-8 4.528862E-8 2.2494097E-8 1.3204225E-8 1.3050797E-88.736412E-9 5.1660733E-9

147

8.095515E-4 4.9329174E-8 8.962045E-8 8.201905E-8 4.4893625E-82.2498998E-8 1.3583877E-8 1.3001844E-8 8.298367E-9 5.6540315E-9

....................

....................</Raw></SeriesOfVector></AudioDescriptor></Audio></MultimediaContent></Description></Mpeg7>

148

APPENDIX

D

Audio sample music test data

Table D.1 lists the collection of song test data used for testing and evaluationpurposes. Songs are listed alphabetically and categorised by genre. The defini-tion of Degree of WTF is merely an indicator of the repetitive nature of the songin relation to the Western Tonal Format (WTF) of popular chart music, i.e. howoften sections are repeated within the song with no metric of measurement.Most of the songs were released as singles within the United Kingdom, themajority of which reached UK top 40 album chart. Some songs are from albumsthat reached the UK top 40 album chart and contained at least one ’hit single’release.

149

Song Artist Genre Duration(m.)

Degreeof WTF

12 Bar Blues N/A Blues/Jazz 0:37 MediumAll the Right

FriendsR.E.M. Pop/Rock 2:48 High

Anywhere Is Enya New Age 3:46 LowAt My Most

BeautifulR.E.M. Pop/Rock 3:46 Medium

Baby One MoreTime

BritneySpears

Pop/Electro 3:31 High

Book Of Days Enya New Age 2:56 LowCaribbean Blue Enya New Age 3:58 Low

Crazy in Love Beyonce Pop/R&B 3:56 HighDaysleeper R.E.M. Pop/Rock 3:40 High

Don’t Stop TheMusic

Rihanna Hip Hop 5:39 Medium

Hole in theHead

Sugababes Pop 3:38 High

Imitation of Life R.E.M. Rock 3:58 MediumNine Million

BicyclesKatie

MeluaPop/Jazz/Blues 3:15 High

Orange Crush R.E.M. Pop/Rock 3:52 MediumOrinoco Flow Enya New Age 4:26 Low

Stand R.E.M. Pop/Rock 3:12 High

Table D.1: Songs used in experiments with Western Tonal Format (WTF) level

150

APPENDIX

E

A similarity comparison of MPEG–7ASE

A comparison of the original MPEG–7 representation of the query string and apreceding section identified as most similar for each of the log frequency rangesis shown below.

(a) Low Edge

151

(b) Log Frequency Band 1

(c) Log Frequency Band 2

152

(d) Log Frequency Band 3

(e) Log Frequency Band 4

153

(f) Log Frequency Band 5

(g) Log Frequency Band 6

154

(h) Log Frequency Band 7

(i) Log Frequency Band 8

155

(j) High Edge

(k) All Frequency Bands

Figure E.1: Comparison of two different 5 second segments identified as similar

156

APPENDIX

F

A representation of 6 s. of audio

The standard spectrogram representation as shown in Figure F.1 is a generaloverview of the file where background noise and equalisation trends maybe visually evident, but musical features contained within the signal are lessapparent. Figure F.1(l) shows the spectral representation of the original querysection. Figures F.1(m) and F.1(n) show a spectral representation of similarlyidentified sections.

(l) Original query section

(m) Best match at 20 s.

(n) Best match at 2 minutes 45 seconds

Figure F.1: Basic spectral representation of three similar audio sections

157

APPENDIX

G

SoFI subjective evaluationquestionnaire

The purpose of this questionnaire is to evaluate the level of quality that can be

perceived as acceptable by a subject when network traffic becomes an issue in

relation to music streamed across a network. The following questionnaire is in

two parts:

• The first section is designed to determine a background knowledge of

your music awareness and habits.

• The second section is to determine what level of quality you perceive the

music presented to be and to compare differing options for audio repair.

Section One: Subject details

Gender: Male o : Female o

Age: 18 –25 o : 26 –30 o : 31 –40 o : 41 –50 o : 51+ o

Do you have any hearing problems?

Yes o or No o

1. If 1 is considered as someone who never listens to music for pleasure and

10 is someone who composes their own music, on a scale of 1 to 10 how

158

would you rate your musical knowledge?

1——–2——–3——–4——–5——–6——–7——–8——–9——–10

2. How often do you listen to music in general?

<1 hour p/w o : 1 –2 hours p/w o : 3 –4 hours p/w o : 5 –6 hours p/w o

: 7+ hours p/w o

3. How often do you listen to music online?


: 7+ hours p/w o

4. How often do you listen to live music streams such as radio or concert

broadcasts?


: 7+ hours p/w o

5. Have you ever experienced dropouts in the audio stream when listening

to live music?

Yes o or No o

6. Do you use a wired network cable or wireless for network/Internet ac-

cess?

Wired o : Wireless o : Both o

159

Section Two: Subjective listening evaluation

The following 8 songs use a variety of attempts to repair missing segments of a

song. For the following questions please use a scale of 1 to 10 to rate the ’repair’

of a missing segment of audio where 1 = Extremely Unacceptable and 10 = Highly

Acceptable. Please complete each question after each song is finished.

1. Song 1 : Are you familiar with this song? Yes o or No o

Did you perceive any changes in the audio? If so using the scale of 1 to 10

presented above how do you rate the ’repair’?

1——–2——–3——–4——–5——–6——–7——–8——–9——–10




1——–2——–3——–4——–5——–6——–7——–8——–9——–10




1——–2——–3——–4——–5——–6——–7——–8——–9——–10




1——–2——–3——–4——–5——–6——–7——–8——–9——–10

160




1——–2——–3——–4——–5——–6——–7——–8——–9——–10




1——–2——–3——–4——–5——–6——–7——–8——–9——–10




1——–2——–3——–4——–5——–6——–7——–8——–9——–10




1——–2——–3——–4——–5——–6——–7——–8——–9——–10

161

Overall quality

Of the 8 songs presented please provide a rank in order of preferred choice in

relation to the quality of repair by placing a tick in the appropriate box for that

song. A rank of 1 is the Least Preferred and a rank of 8 is the Most Preferred. Please

only give one rank to one song, if you want to change the rank of a song completely fill

the square and tick another on the same row

Rank1 2 3 4 5 6 7 8

Song 1 o o o o o o o o








Table G.1: Subjective evaluation table of ranked preference of audio repairsuccess

162

APPENDIX

H

Subjective evaluation results

Tables H.1, H.2 and H.3 show the data gathered from the returned question-naires collate the results into relevant formats. Table H.1 shows the backgroundof the subjects in regard to demographics such as age, gender and their listeninghabits. Table H.2 presents the subjective evaluation of each attempted repair ofa song using different techniques and Table H.3 shows an overall view of howwell each method of repair succeeded when compared to all other methodswith respective rankings.

163

Subj

ects

(1–1

6)D

emog

raph

ics

12

34

56

78

910

1112

1314

1516

Sum

mar

yG

ende

r0

00

10

01

10

10

11

00

010

M/6

FA

ge1

11

22

11

34

11

23

21

19/

4/3/

1/0

Hea

ring

diffi

culty

00

00

00

00

10

00

00

00

15/1

Mus

ical

know

ledg

e4

35

64

45

67

45

53

73

54.

75

List

enin

gho

urs

(p/w

)3

55

44

44

55

55

44

54

50/

0/1/

7/8

List

enin

gon

line

(p/w

)1

55

11

35

55

54

11

54

25/

1/1/

2/7

Mus

icst

ream

s1

44

11

24

54

42

11

44

16/

2/0/

7/1

Expe

rien

ced

drop

outs

11

11

01

11

11

10

01

11

4/12

Wir

ed/W

irel

ess

/Bo

th2

22

20

02

22

22

02

22

23/

0/13

Tabl

eH

.1:D

emog

raph

ics

ofsu

bjec

ts

164

Subj

ects

(1-1

6)Q

ualit

yof

Song

12

34

56

78

910

1112

1314

1516

Ave

rage

Scor

e1

41

11

11

11

42

11

12

21

1.56

27

61

23

56

34

31

22

33

33.

383

73

25

31

73

52

12

23

21

3.06

48

95

55

48

75

54

74

67

65.

945

810

54

58

76

56

58

56

68

6.38

6a–

––

––

––

––

––

––

––

–—

710

88

79

108

1010

96

95

108

88.

448

1010

76

93

810

109

59

56

87

7.63

Ove

rall

Ave

rage

6.75

5.88

3.63

3.75

4.38

4.00

5.63

5.00

5.38

4.50

2.88

4.75

3.00

4.50

4.50

4.25

4.55

Ave

rage

for

song

s1,

2an

d 3

6.00

3.33

1.33

2.67

2.33

2.33

4.67

2.33

4.33

2.33

1.00

1.67

1.67

2.67

2.33

1.67

2.67

Ave

rage

for

SoFI

9.00

9.25

6.25

5.50

7.00

6.25

7.75

8.25

7.50

7.25

5.00

8.25

4.75

7.00

7.25

7.25

7.09

Tabl

eH

.2:S

ubje

ctlis

teni

ngev

alua

tion

a Song

6ac

ted

asa

cont

rola

ndno

test

subj

ects

perc

eive

dan

ych

ange

san

dhe

nce

scor

ing

isun

nece

ssar

y.

165

Subj

ects

(1-1

6)A

vera

geSo

ngR

ank

12

34

56

78

910

1112

1314

1516

Ran

k(R

ound

ed)

11

11

11

11

11

11

11

11

11.

00=

12

33

34

23

22

23

33

32

32

2.69

=3

32

22

33

23

33

22

22

32

32.

44=

24

45

45

46

45

55

55

54

45

4.69

=4

55

75

65

56

44

46

44

75

45.

06=

56

88

88

88

88

88

88

88

88

8.00

=8

77

47

77

77

77

77

76

67

76.

69=

78

66

64

64

56

66

46

75

66

5.56

=6

Tabl

eH

.3:S

ubje

ctlis

teni

ngra

nksc

ore

166

APPENDIX

I

Similarity Output

The following values are the resultant output of the string matching algorithm.The first column presents the original time-point used as the query string.The second column presents the most similar section identified and the finalcolumn shows how close the two segments match. The first section is in boldtypeset, followed by a five second duration of the second section using normalfont and a change in sections is shown in bold for the third section. Using thesechanges between differing time-points as identified matches could be used on abroader level for verse/chorus identification.

1.59270e+04 1.51330e+04 8.53998e-011.59280e+04 1.51340e+04 8.53998e-011.59290e+04 1.51350e+04 8.53998e-011.59300e+04 1.51360e+04 8.55998e-011.59310e+04 5.69000e+03 8.57998e-011.59320e+04 5.69100e+03 8.55998e-011.59330e+04 5.69200e+03 8.55998e-011.59340e+04 5.69300e+03 8.55998e-011.59350e+04 5.69400e+03 8.55998e-011.59360e+04 5.69500e+03 8.55998e-011.59370e+04 5.69600e+03 8.55998e-011.59380e+04 5.69700e+03 8.55998e-011.59390e+04 5.69800e+03 8.55998e-011.59400e+04 5.69900e+03 8.55998e-011.59410e+04 5.70000e+03 8.55998e-011.59420e+04 5.70100e+03 8.55998e-011.59430e+04 5.70200e+03 8.55998e-011.59440e+04 5.70300e+03 8.55998e-011.59450e+04 5.70400e+03 8.55998e-01.................................................................................................................................

167

1.59760e+04 5.73500e+03 8.63999e-011.59770e+04 5.73600e+03 8.63999e-011.59780e+04 5.73700e+03 8.63999e-011.59790e+04 5.73800e+03 8.63999e-011.59800e+04 5.73900e+03 8.63999e-011.59810e+04 5.74000e+03 8.63999e-011.59820e+04 5.74100e+03 8.63999e-011.59830e+04 5.74200e+03 8.63999e-011.59840e+04 7.82900e+03 8.61999e-011.59850e+04 7.83000e+03 8.61999e-011.59860e+04 7.83300e+03 8.59999e-011.59870e+04 7.83400e+03 8.59999e-011.59880e+04 7.83500e+03 8.59999e-011.59890e+04 7.83600e+03 8.59999e-011.59900e+04 7.83700e+03 8.57998e-011.59910e+04 7.83800e+03 8.57998e-011.59920e+04 2.46700e+03 8.57998e-011.59930e+04 2.46800e+03 8.55998e-011.59940e+04 2.46900e+03 8.53998e-01

168

APPENDIX

J

Enya visit to the University of Ulster

The following sections relate to the visit of Enya and her party to the Intelligent SystemsResearch Centre (ISRC) and the award of her Honorary Doctorate on July, 10th, 2007 atthe University of Ulster.

J.1 Enya visit to the Intelligent Systems ResearchCentre (ISRC)

The following photographs in Figures J.1 and J.2 show Enya, her manager and producer,Nicky Ryan and her parents, Leo and Baba Brennan, viewing a demonstration of SoFIduring a visit to the Intelligent Systems Research Centre (ISRC). Having extensivemusical knowledge and skill Enya posed some interesting questions in relation toself-similarity within audio and expressed genuine curiosity in the subject matter andresults.

Figure J.1: Jonathan Doherty demonstrates SoFI to Enya party (1)

169

Figure J.2: Jonathan Doherty demonstrates SoFI to Enya party (2)

J.2 Enya Honorary Doctorate

The portrait photographs in Figures J.3 and J.4 show Enya when she received thehonorary degree of D.Litt. (Doctor of Letters) in recognition of her services to musicand the creative industries at the University of Ulster, Magee, July 10th, 2007.

Figure J.3: Enya receives Honorary Doctorate (D.Litt.) from University of Ulster

170

Figure J.4: Enya graduation with her parents Leo and Baba Brennan

171

REFERENCES

S. Abdallah, et al. (2005). ‘Theory and evaluation of a Bayesian music structure extrac-tor’. Proc. ISMIR pp. 420–425.

M. Bartsch & G. Wakefield (2001). ‘To Catch a Chorus: Using Chroma-based Repre-sentations for Audio Thumbnailing’. Applications of Signal Processing to Audio andAcoustics, 2001 IEEE Workshop on the pp. 15–18.

A. Berenzweig, et al. (2004). ‘A Large-Scale Evaluation of Acoustic and SubjectiveMusic-Similarity Measures’. Computer Music Journal 28(2):63–76.

J. Bilmes & C. Bartels (2005). ‘Graphical model architectures for speech recognition’.Signal Processing Magazine, IEEE 22(5):89–100.

H. Bischof, et al. (1999). ‘MDL principle for robust vector quantization’. Pattern Analysis& Applications 1:59–72.

J. Bolot, et al. (1999). ‘Adaptive FEC-based Error Control for Internet Telephony’.INFOCOM′99. Eighteenth Annual Joint Conference of the IEEE Computer and Communi-cations Societies. Proceedings. IEEE 3:1453–1460.

R. Boulanger (2000). The Csound Book: Perspectives in Software Synthesis, Sound Design,Signal Processing, and Programming. MIT Press.

R. Boyer & J. Moore (1977). ‘A fast string searching algorithm’. Communications of theACM 20(10):762–772.

P. Bradley & U. Fayyad (1998). ‘Refining Initial Points for K-Means Clustering’. In Proc.15th International Conf. on Machine Learning, vol. 727, pp. 91–99. Morgan Kaufmann,San Francisco, CA.

M. Bukofzer (2008). Music in the Baroque Era-From Monteverdi to Bach. Von ElterleinPress.

J. Burred & A. Lerch (2003). ‘A hierarchical approach to automatic musical genreclassification’. Proc. DAFx03 pp. 308–311.

S. Bush (2000). ‘Active Jitter Control’. Intelligence in Services and Networks (ISN)’00,February .

172

M. Casey (2002). ‘General sound classification and similarity in MPEG-7’. OrganisedSound 6(02):153–164.

A. Cater & N. O’Kennedy (2000). ‘You Hum it, and I’ll Play it’. 11th Conference onArtificial Intelligence and Cognitive Science .

J. Cederberg (2001). A Course in Modern Geometries. Springer.

W. Chai & B. Vercoe (2003). ‘Structural Analysis of Musical Signals for Indexing andThumbnailing’. Digital Libraries, 2003. Proceedings. 2003 Joint Conference on pp. 27–34.

S. Chan, et al. (2006). ‘Video loss recovery with FEC and stream replication’. IEEETransactions on Multimedia 8(2):370.

C. Charras & T. Lecroq (2004). Handbook of Exact String Matching Algorithms. King’sCollege Publications.

Y. Cheung (2003). ‘k*-Means: A new generalized k-means clustering algorithm’. PatternRecognition Letters 24(15):2883–2893.

M. Chiang & B. Mirkin (2007). ‘Experiments for the Number of Clusters in K-Means’.LECTURE NOTES IN COMPUTER SCIENCE 4874:395.

L. Chiariglione (2010). ‘Description of MPEG-7 Audio Low Level Descriptors’. Siteaccessed on: 18/03/2010.

C. Chinrungrueng & C. Sequin (1995). ‘Optimal adaptive k-means algorithm withdynamic adjustment of learning rate’. Neural Networks, IEEE Transactions on 6(1):157–169.

Y. Cho & S. Choi (2005). ‘Nonnegative features of spectro-temporal sounds for classifi-cation’. Pattern Recognition Letters 26(9):1327–1336.

C. Chuan & E. Chew (2004). ‘Polyphonic Audio Key Finding Using the Spiral ArrayCEG Algorithm’. Proceedings of the International Conference on Multimedia and Expo,Amsterdam, Netherlands .

P. Cooper (1973). Perspectives in music theory; an historical-analytical approach. Dodd,Mead, New York. ID: 690043.

E. Courses & T. Surveys (2007). ‘Unsupervised speech/music classification usingone-class support vector machines’. Information, Communications & Signal Processing,2007 6th International Conference on pp. 1–5.

M. Crochemore, et al. (1994). Text algorithms. Oxford University Press New York.

H. Crysandt (2004). ‘Music Identification with MPEG-7’. Proceedings of SPIE 5307:117–124.

R. Dannenberg & N. Hu (2003). ‘Pattern Discovery Techniques for Music Audio’.Journal of New Music Research 32(2):153–163.

H. Deshpande, et al. (2001). ‘Mugec: Automatic music genre classification’. Tech. rep.,Technical report, Stanford University, June.

173

E. Dimitriadou, et al. (2002). ‘An examination of indexes for determining the numberof clusters in binary data sets’. Psychometrika 67(1):137–159.

S. Doraisamy & S. Ruger (2004). ‘A Polyphonic Music Retrieval System Using N-Grams’. Proceedings of the International Conference on Music Information Retrieval pp.204–209.

J. Downie (2004). ‘The Scientific Evaluation of Music Information Retrieval Systems:Foundations and Future’. Computer Music Journal 28(2):12–23.

R. Duda, et al. (2000). Pattern Classification. Wiley-Interscience.

T. Eerola & P. Toiviainen (2004). ‘MIR In Matlab: The MIDI Toolbox’. Proceedings of theInternational Conference on Music Information Retrieval pp. 22–27.

D. Ellis, et al. (2002). ‘The quest for ground truth in musical artist similarity’. Proc.International Symposium on Music Information Retrieval ISMIR-2002 .

S. Essid, et al. (2004). ‘Efficient musical instrument recognition on solo performancemusic using basic features’. AES 25th International Conference, London, UK, June .

N. Fletcher & T. Rossing (1998). The Physics of Musical Instruments. Springer Verlag,New York.

J. Foote (1997). ‘A similarity measure for automatic audio classification’. Proc. of theAAAI 1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video,and Audio Corpora. Stanford, Palo Alto, California. .

J. Foote & M. Cooper (2003). ‘Media Segmentation Using Self-Similarity Decomposition’.Proceedings of SPIE 5021:167–175.

J. Foote & S. Uchihashi (2001). ‘The beat spectrum: a new approach to rhythm analysis’.Multimedia and Expo, 2001. ICME 2001. IEEE International Conference on pp. 881–884.

E. Forgy (1965). ‘Cluster analysis of multivariate data: Efficiency vs. interpretability ofclassifications’. Biometrics 21(3):768.

W. Frakes & R. Baeza-Yates (1992). Information retrieval: data structures and algorithms.Prentice-Hall, Inc. Upper Saddle River, NJ, USA.

C. Fraley & A. Raftery (1998). ‘How many clusters? Which clustering method? Answersvia model-based cluster analysis’. The Computer Journal 41(8):578–588.

A. Ghias, et al. (1995). ‘Query by Humming: Musical Information Retrieval in anAudio Database’. Proceedings of the third ACM international conference on Multimediapp. 231–236.

E. Gomez, et al. (2003). ‘Melody Description and Extraction in the Context of MusicContent Processing’. Journal of New Music Research 32(1):23–40.

M. Good et al. (2001). ‘MusicXML: An Internet-Friendly Format for Sheet Music’. XMLConference and Expo pp. 03–04.

N. Griffith (2002). ‘Music and language Metaphor and causation’. Language, Vision, andMusic: Selected Papers from the 8th International Workshop on the Cognitive Science ofNatural Language Processing, Galway, Ireland, 1999 .

174

Gstreamer (2008). ‘Gstreamer: An open source media framework. Available at:http://gstreamer.freedesktop.org. Site last visited: 24/06/2008’.

S. Hagen (2006). IPv6 Essentials. O’Reilly Media, Inc., Sebastopol, CA, USA.

G. Hamerly & C. Elkan (2003). ‘Learning the k in k-means’. Advances in NeuralInformation Processing Systems 17.

D. Hermes (1988). ‘Measurement of pitch by subharmonic summation’. The Journal ofthe Acoustical Society of America 83:257.

P. Herrera, et al. (2004). ‘Percussion-Related Semantic Descriptors of Music AudioFiles’. Proc. AES 25th International Conference, London .

A. Holzapfel & Y. Stylianou (2008). ‘Rhythmic similarity of music based on dynamicperiodicity warping’. Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEEInternational Conference on pp. 2217–2220.

N. Hu & R. Dannenberg (2002). ‘A comparison of melodic database retrieval techniquesusing sung queries’. Proceedings of the 2nd ACM/IEEE-CS joint conference on Digitallibraries pp. 301–307.

Humdrum (2008). ‘The Humdrum Toolkit: Software for Music Research’. Available at:http://www.musiccog.ohio-state.edu/Humdrum/ Site visited 20/06/2008.

Icecast (2008). ‘Icecast’. Available at: http://www.icecast.org Site accessed on:24/06/2008.

Ices2 (2008). ‘Ices2’. Available at: http://www.icecast.org/ices.php Site accessed on:10/11/2008.

R. Jackendoff (1987). Consciousness and the computational mind. MIT Press Cambridge,Mass.

R. Jackendoff (2002). Foundations of Language: Brain, Meaning, Grammar. OxfordUniversity Press, USA.

I. Jackson (2008). ‘Song Forms and Terms - A Quick Study. Available athttp://www.irenejackson.com/form.html Site last visited:19/05/08’.

A. Jain, et al. (1999). ‘Data Clustering: A Review’. ACM Computing Surveys 31(3).

W. Jiang & H. Schulzrinne (2002). ‘Comparison and optimization of packet loss repairmethods on VoIP perceived quality under bursty loss’. Proceedings of the 12th interna-tional workshop on Network and operating systems support for digital audio and video pp.73–81.

I. Jolliffe (1986). Principal component analysis. Springer-Verlag New York.

D. Jurafsky & J. Martin (2000). Speech and Language Processing: An Introduction to NaturalLanguage Processing, Computational Linguistics, and Speech Recognition. Prentice Hall,New Jersey, USA.

M. Kamata & K. Furukawa (2007). ‘Three types of viewers’ favorite music videos’. Pro-ceedings of the international conference on Advances in computer entertainment technologypp. 196–199.

175

R. Kass & L. Wasserman (1995). ‘A Reference Bayesian Test for Nested Hypotheses andIts Relationship to the Schwarz Criterion’. American Statistical Association 90:928–928.

S. Khan & A. Ahmad (2004). ‘Cluster center initialization algorithm for K-meansclustering’. Pattern Recognition Letters 25(11):1293–1302.

H. Kim, et al. (2004). ‘Audio classification based on MPEG-7 spectral basis representa-tions’. Circuits and Systems for Video Technology, IEEE Transactions on 14(5):716–725.

H. Kim, et al. (2005). MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval.Wiley, West Sussex, England.

A. Kornstadt (1998). ‘Themefinder: A Web-Based Melodic Search Tool’. Computing inMusicology 11:231–236.

H. Kriegel, et al. (2005). ‘Distributed High-Dimensional Data’. In Advances in KnowledgeDiscovery and Data Mining: 9th Pacific-Asia Conference, PAKDD 2005, Hanoi, Vietnam,May 18-20, 2005. Springer Verlag.

F. Kurth, et al. (2002). ‘Efficient Fault Tolerant Search Techniques for Full-Text AudioRetrieval’. Preprints-Audio Engineering Society .

G. Lakoff (1988). ‘Cognitive semantics’. Meaning and mental representations pp. 119–154.

Y. Lamdan, et al. (1988). ‘Object recognition by affine invariant matching’. ComputerVision and Pattern Recognition, 1988. Proceedings CVPR’88., Computer Society Conferenceon pp. 335–344.

P. Lamere (2006). ‘Search Inside the Music ’. Available at:http://research.sun.com/projects/dashboard.php?id=153 Site last visited08/06/2008.

K. Lee & S. Chanson (2004). ‘Packet Loss Probability for Bursty Wireless Real-TimeTraffic Through Delay Model’. Vehicular Technology, IEEE Transactions on 53(3):929–938.

M. Leman, et al. (2002). ‘Tendencies, Perspectives, and Opportunities of MusicalAudio-Mining’. Forum Acusticum Sevilla pp. 16–20.

K. Lemstrom, et al. (2003). ‘The C-BRAHMS Project’. In Proceedings of the 4th Internatio-noal Conference on Music Information Retrieval (ISMIR 2003), pp. 237–238.

K. Lemstrom & E. Ukkonen (2000). ‘Including interval encoding into edit distancebased music comparison and retrieval’. In In Proceedings of the AISB’2000 Symposiumon Creative & Cultural Aspects and Applications of AI & Cognitive Science, pp. 53–60.

F. Lerdahl & R. Jackendoff (1983). A Generative Theory of Tonal Music. MIT Press.

Y. Liang, et al. (2003). ‘Adaptive Playout Scheduling and Loss Concealment for VoiceCommunication Over IP Networks’. Multimedia, IEEE Transactions on 5(4):532–543.

R. Likert (1932). ‘A technique for the measurement of attitudes.’. Archives of Psychology22(140):1–55.

176

D. Lin & B. Wah (2005). ‘LSP-Based Multiple-Description Coding for Real-Time LowBit-Rate Voice Over IP’. In IEEE Transactions On Multimedia 7.

S. Lin, et al. (1984). ‘Automatic-Repeat-Request Error-Control Schemes’. Communica-tions Magazine, IEEE 22(12):5–17.

B. Logan (2000). ‘Mel Frequency Cepstral Coefficients for Music Modeling’. InternationalSymposium on Music Information Retrieval 28.

B. Logan & A. Salomon (2001). ‘A music similarity function based on signal analysis’pp. 745–748.

J. Lukasiak, et al. (2003). ‘Performance of MPEG-7 low level audio descriptors withcompressed data’. Multimedia and Expo, 2003. ICME ’03. Proceedings. 2003 InternationalConference on 3:III–273–6 vol.3.

R. Lyons (2004). Understanding Digital Signal Processing. Prentice Hall PTR UpperSaddle River, NJ, USA.

J. MacQueen (1966). ‘Some Methods For Classification And Analysis of MultivariateObservations’. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statisticsand Probability, Volume 1: Statistics, vol. 1, pp. 281–297. Western Management SscienceInst., Univ. of California, Los Angeles.

A. Mahanti, et al. (2003). ‘Scalable On-Demand Media Streaming with Packet LossRecovery’. Networking, IEEE/ACM Transactions on 11(2):195–209.

S. Makharia, et al. (2008). ‘Experimental study on wireless multicast scalability usingMerged Hybrid ARQ with staggered adaptive FEC’. In 2008 International Symposiumon a World of Wireless, Mobile and Multimedia Networks, 2008. WoWMoM 2008., pp.1–12.

J. Mao, et al. (1996). ‘A self-organizing network for hyperellipsoidal clustering (HEC)’.Neural Networks, IEEE Transactions on 7(1):16–29.

J. Martınez, et al. (2002). ‘MPEG-7: The Generic Multimedia Content DescriptionStandard, Part 1’. IEEE MultiMedia 9, 2:78 – 87.

R. Matushima, et al. (2004). ‘Integrating MPEG-7 Descriptors and Pattern Recognition:An Environment for Multimedia Indexing and Searching’. WebMedia and LA-Web,2004. Proceedings pp. 125–132.

K. McAlpine, et al. (1999). ‘Making Music with Algorithms: A Case-Study System’.Computer Music Journal 23(2):19–30.

R. McNab, et al. (1997). ‘The New Zealand Digital Library MELody inDEX’. D-LibMagazine 3(5):4–15.

E. Menin (2002). The Streaming Media Handbook. Pearson Education.

D. Meredith, et al. (2002). ‘Algorithms for discovering repeated patterns in multi-dimensional representations of polyphonic music’. Journal of New Music Research31(4):321–345.

177

D. Meredith, et al. (2001a). ‘Pattern Induction and Matching in Polyphonic Music andOther Multidimensional Datasets’. Proceedings of the 5th World Multiconference onSystemics, Cybernetics and Informatics (SCI2001), July pp. 22–25.

D. Meredith, et al. (2001b). ‘Pattern induction and matching in polyphonic music andother multidimensional datasets’. In Proceedings of the 5th World Multiconference onSystemics, Cybernetics and Informatics (SCI2001), July, pp. 22–25.

G. Milligan & M. Cooper (1985). ‘An examination of procedures for determining thenumber of clusters in a data set’. Psychometrika 50(2):159–179.

E. Miranda (2001). Composing Music with Computers. Focal Press.

MPEG–7 (2008). ‘MPEG 7 Library: A Complete API to Manipulate MPEG 7 Documents.Joanneum Research. Available at: http://iis.joanneum.at/mpegvisited 08/06/2008.’.

A. Nafaa, et al. (2008). ‘Forward error correction strategies for media streaming overwireless networks’. Communications Magazine, IEEE 46(1):72–79.

G. Navarro & M. Raffinot (2002). Flexible Pattern Matching in Strings: Practical On-LineSearch Algorithms for Texts and Biological Sequences. Cambridge University Press.

G. Navarro, et al. (1998). ‘A Bit-parallel Approach to Suffix Automata: Fast ExtendedString Matching’. Proceedings of the 9th Annual Symposium on Combinatorial PatternMatching pp. 14–33.

A. Ockelford (1991). ‘The Role of Repetition in Perceived Musical Structures’. Represen-ting Musical Structure pp. 129–60.

H. Olson (1967). Music, physics and engineering. Dover Publications.

D. Pan, et al. (1995). ‘A tutorial on MPEG/audio compression’. Multimedia, IEEE2(2):60–74.

G. Papadopoulos & G. Wiggins (1999). ‘AI Methods for Algorithmic Composition: ASurvey, a Critical View and Future Prospects’. AISB Symposium on Musical Creativity .

D. Parsons (1975). Directory of Tunes and Musical Themes. S. Brown.

S. Pauws (2002). ‘Cubyhum: A fully operational query by humming system’. In ISMIR2002 Conference Proceedings, pp. 187–196.

M. Pearce & G. Wiggins (2002). ‘Aspects of a Cognitive Theory of Creativity in MusicalComposition’. Proceedings of the ECAI ‘02 Workshop on Creative Systems pp. 17–24.

G. Peeters, et al. (2002). ‘Toward automatic music audio summary generation fromsignal analysis’. Proceedings of International Conference on Music Information Retrieval .

D. Pelleg & A. Moore (2000). ‘X-means: Extending K-means with Efficient Estimationof the Number of Clusters’. Proceedings of the Seventeenth International Conference onMachine Learning table of contents pp. 727–734.

C. Perkins, et al. (1998). ‘A Survey of Packet Loss Recovery Techniques for StreamingAudio’. Network, IEEE 12(5):40–48.

178

L. Prechelt & R. Typke (2001). ‘An Interface for Melody Input’. ACM Transactions onComputer-Human Interaction (TOCHI) 8(2):133–149.

J. Pyun, et al. (2003). ‘Robust Error Concealment for Visual Communications in Burst-Packet-Loss Networks’. Consumer Electronics, IEEE Transactions on 49(4):1013–1019.

L. Rabiner (1989). ‘A Tutorial on Hidden Markov Models and Selected Applications inSpeech Recognition’. Proceedings of the IEEE 77(2):257–286.

L. Rabiner & B. Juang (1993). Fundamentals of speech recognition. Prentice-Hall, Inc.Upper Saddle River, NJ, USA.

Real Audio (2009). ‘Real Audio’. Available at: http://uk.real.com/realplayer/ Siteaccessed on: 21/02/09.

RES2.2 (2008). ‘Rough Set Exploration System’. Available at:http://logic.mimuw.edu.pl/ rses Site accessed on: 19/11/2009.

ROSETTA (2009). ‘ROSETTA’. Available at: http://www.lcb.uu.se/tools/rosetta/ Sitelast visited: 20/11/2009.

S. Salvador & P. Chan (2004). ‘Determining the number of clusters/segments inhierarchical clustering/segmentation algorithms’. Tools with Artificial Intelligence,2004. ICTAI 2004. 16th IEEE International Conference on pp. 576–584.

F. Salzer (1962). Structural Hearing: Tonal Coherence in Music. Dover Publications.

H. D. C. Sapp & B. Aarden (2008). ‘Themefinder. Available at:http://www.themefinder.org Site last visited 08/06/2008’.

E. Schubert, et al. (2004). ‘Spectral centroid and timbre in complex, multiple instru-mental textures’. Proceedings of the 8th International Conference on Music Perception &Cognition. Evanston, Illinois: Society for Music Perception & Cognition .

Semantic (2008). ‘Semantic Interaction with Audio Contents. ’. Available at:http://www.semanticaudio.org Site last visited: 08/06/2008.

J. Seo, et al. (2005). ‘Audio fingerprinting based on normalized spectral subbandcentroids’. IEEE International Conference on Acoustics, Speech, and Signal Processing3:213–216.

Y. Shan (2005). ‘Cross-layer techniques for adaptive video streaming over wirelessnetworks’. EURASIP journal on applied signal processing 2005(2):220–228.

M. Slaney, et al. (2002). ‘Semantic-Audio Retrieval’. IEEE International Conference onAcoustics, Speech, and Signal Processing, 2002 4:4108–4111.

M. Slaney & W. White (2006). ‘Measuring playlist diversity for recommendationsystems’. Proceedings of the 1st ACM workshop on Audio and music computing multimediapp. 77–82.

T. Socolofsky & C. Kale (2008). ‘RFC (Request For Comments) – A Tutorial on TCP/IP’.Available at http://www.ietf.org/rfc/rfc1180.txt. Site accessed on 11/06/2008.

179

M. Sonn (1973). American National Standard: Psychoacoustical Terminology. AmericanNational Standards Institute.

M. Steedman (1996). ‘The Blues and the Abstract Truth: Music and Mental Models’.Mental Models in Cognitive Science pp. 305–318.

M. Steinbach, et al. (2000). ‘A comparison of document clustering techniques’. KDDWorkshop on Text Mining 34:35.

S. Stevens, et al. (1937). ‘A scale for the measurement of the psychological magnitudepitch’. The Journal of the Acoustical Society of America 8:185.

W. Stevens (1993). TCP/IP Illustrated Volume 1: The Protocols. Addison-Wesley LongmanPublishing Co., Inc. Boston, MA, USA.

B. Super (2004). ‘Learning chance probability functions for shape retrieval or classifi-cation’. Proceedings of the IEEE Workshop on Learning in Computer Vision and PatternRecognition, June .

H. Sze, et al. (2001). ‘A Packet-Loss-Recovery Scheme for Continuous-Media Streamingover the Internet’. Communications Letters, IEEE 5(3):116–118.

A. Tanenbaum (1996). Computer networks. Prentice Hall PTR, Upper Saddle River, NJ,USA.

A. Tanguiane (1993). Artificial Perception and Music Recognition. Springer, Berlin.

D. Tao, et al. (2004). ‘K-BOX: a query-by-singing based music retrieval system’. InMULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference onMultimedia, pp. 464–467, New York, NY, USA. ACM.

I. Titze & D. Martin (1998). ‘Principles of voice production’. The Journal of the AcousticalSociety of America 104:1148.

T. Tolonen & M. Karjalainen (2000). ‘A computationally efficient multipitch analysismodel’. Speech and Audio Processing, IEEE Transactions on 8(6):708–716.

G. Tseng & W. Wong (2005). ‘Tight Clustering: A Resampling-Based Approach forIdentifying Stable and Tight Patterns in Data’. Biometrics 61(1):10–16.

G. Tzanetakis & P. Cook (2002). ‘Musical genre classification of audio signals’. Speechand Audio Processing, IEEE Transactions on 10(5):293–302.

G. Tzanetakis, et al. (2003). ‘Pitch Histograms in Audio and Symbolic Music InformationRetrieval’. Journal of New Music Research 32(2):143–152.

C. Van Rijsbergen (1979). Information Retrieval. Butterworth-Heinemann Newton, MA,USA.

S. Varadarajan, et al. (2002). ‘Error Spreading: A perception-driven Approach to Hand-ling Error in Continuous Media Streaming’. IEEE/ACM Transactions on Networking(TON) 10(1):139–152.

Vorbis (2008). ‘Ogg Vorbis’. Available at: http://www.vorbis.com Site accessed on:20/11/2008.

180

B. Wah & D. Lin (2005). ‘LSP-based Multiple-description Coding for Real-Time LowBit-rate Voice Over IP’. Multimedia, IEEE Transactions on 7(1):167–178.

R. Walker (1997). ‘Visual metaphors as music notations for sung vowel spectra indifferent cultures’. Journal of New Music Research 26(4):315–345.

H. Wallach (2004). ‘Evaluation Metrics for Hard Classifiers’. Unpublished note(http://www. inference. phy. cam. ac. uk/hmw26/papers/evaluation. ps) .

Y. Wang, et al. (2003). ‘Content-based UEP: A New Scheme for Packet Loss Recoveryin Music Streaming’. Proceedings of the eleventh ACM international conference onMultimedia pp. 412–421.

J. Wellhausen & H. Crysandt (2003). ‘Temporal Audio Segmentation Using MPEG-7Descriptors’. Proceedings of SPIE 5021:380.

R. West, et al. (1991). ‘Musical structure and knowledge representation’. RepresentingMusical Structure pp. 1–30.

G. Wiggins (1998). ‘Music, syntax, and the Meaning of ’’Meaning‘‘’. Proceedings of theFirst Symposium on Music and Computers .

G. Wiggins, et al. (2002). ‘SIA (M) ESE: An algorithm for transposition invariant,polyphonic content-based music retrieval’. In Proceedings of the 3rd InternationalConference on Music Information Retrieval (ISMIR 2002), pp. 283–284.

G. P. Williams (1997). Chaos Theory Tamed. Taylor & Francis, London.

Windows Media Player (2009). ‘Windows Media Player’. Site accessed on: 21/02/09.

E. Wold, et al. (1996). ‘Content-based classification, search, and retrieval of audio’.Multimedia, IEEE 3(3):27–36.

E. Wold, et al. (1999). ‘Classification, search, and retrieval of audio’. Handbook ofMultimedia Computing pp. 207–226.

J. Yin, et al. (2006). ‘Impact of bursty error rates on the performance of wireless localarea network (WLAN)’. Ad Hoc Networks 4(5):651–668.

H. Zha, et al. (2002). ‘Spectral Relaxation for K-means Clustering’. Advances in NeuralInformation Processing Systems 2:1057–1064.

Date post:	02-Jul-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times