Analysis of Important Factors for Measuring Similarity of Symbolic ...

Analysis of Important Factors for Measuring

Similarity of Symbolic MusicUsing n-gram-Based, Bag-of-Words Approach

Jacek Wo�lkowicz and Vlado Keselj

Faculty of Computer Science, Dalhousie University, Halifax NS, Canada

Abstract. In this paper, we evaluate several factors that influence theperformance of n-gram-based music similarity algorithms. Thosealgorithms are derived from textual information retrieval and adaptedto operate on music data. The influence of n-gram length, applied fea-ture extraction method, term weighting approach and similarity measureto the final performance of the similarity measure has been analyzed.MIREX 2005 data and MIREX 2011 evaluation framework for symbolicmusic similarity task have been used to measure the impact of each ofthe factors. The paper concludes that the choice of a proper feature ex-traction method and n-gram length are more important than the appliedsimilarity measure or term weighting technique.

1 Introduction

It has been shown, that bag-of-words methods, which are well establishedapproaches in textual Information Retrieval, work well for retrieving relevantmelodies from music corpora. It is noticeable that the main focus has usually beon how to transfer the input sequences of notes into a set of features and howto compare (measure the similarity) between different feature sets, yet therehas been no sufficient quantitative analysis on how to make proper decisions re-garding certain parameters. Usually the parameters are chosen based on musicknowledge and intuition rather than experimentation.

In textual information retrieval it has been found that simply using words asfeatures (as feature extraction) and basic similarity measures between terms (likestring equality or stemmed string equality) is sufficient. However, researchersfocus on term weighting, i.e. given documents in a dataset — to determine whichterms are more important for each document or the dataset as a whole. Somefrequent terms (dubbed stopwords) do not usually even take part in the retrievalprocess at all. This reduces retrieval time, and also allows for better ordering ofthe retrieval results. Similar approach is used in general for other text miningtasks, e.g. classification.

Music Information Retrieval Evaluation eXchange (MIREX) has been createdto compare and evaluate algorithms that operate on music data. It provides aTREC-like collaboration framework for researchers dealing with both audio andsymbolic music problems. Since the beginnings, the Symbolic Melodic Similar-ity (SMS) task has its well established position in a core of symbolic music

L. Kosseim and D. Inkpen (Eds.): Canadian AI 2012, LNAI 7310, pp. 230–241, 2012.c© Springer-Verlag Berlin Heidelberg 2012

Analysis of Important Factors for Measuring Similarity of Symbolic Music 231

analysis with repetitive releases among the years. We have used the released2005 SMS MIREX dataset to evaluate how various design decisions impact per-formance of similarity measures, and we have contributed to 2011 SMS task toevaluate further some aspects of using different term weighting approaches. 2005SMS MIREX task used a subset of a larger Repertoire International des SourcesMusicales (RISM) music excerpts collection, while 2011 task was based on Es-sen Folksongs Collection (EFC) that contains more than 5000 monophonic folkmelodies.

2 Background

2.1 Previous Work

Existing approaches to symbolic music similarity task are common with the othertasks dealing with symbolic music. Among them one can differentiate severalmain streams, which will be described in the following paragraphs. Successfulapproaches can be found in all these groups, but string methods that deal withmusic in the similar way as text IR with written text, seem to play a major role.

Geometric methods usually treat music excerpts as points in two dimensionaltime and pitch space [7, 9, 16]. Since there is usually no notion of successionof notes (it is represented naturally on the time dimension) those methods areusually suitable for both monophonic (with only one note allowed at each time)and polyphonic (where notes can co-occur temporarily) music. However, it isgenerally more challenging to implement methods that are robust with regardsto small changes that are perceptually not very important, but cause the entiregeometric representation to bend or stretch significantly.

String methods deal with music data as it is a linear sequence of symbolsand use method originally developed for texts. They do not handle polyphonicmusic very well, but SMS MIREX task deals with monophonic queries to themonophonic database, so this simplifies the problem for those methods. Amongthem, the major group employs algorithms based on the local or global alignmenttechniques, created initially to compute the edit distance between two strings[1–3, 8, 17, 18]. This seams to be a natural choice for the kind of data presentin EFC and RISM datasets, since they contain only short melodies. The majordrawback of those methods is their computational complexity, being at leastO(n2). The other approach utilizes a “bag-of-words” type of analysis, that treatsfeatures as a set, without specific order between them [10, 13]. This approachworks best for larger documents giving the capability of linear processing (O(n)),which means that RISM and EFC datasets with their short excerpts might notbe the perfect environment to show their benefits.

There are also other ways to compute music similarity based on graph methods[11] or tree representation [12], but they are not as popular as the previousapproaches.

232 J. Wo�lkowicz and V. Keselj

2.2 Datasets

We have decided to base our work on two datasets that were previously used inthe SMS MIREX tasks. The dataset from 2005 is based on a subset of RepertoireInternational des Sources Musicales (RISM) collection and contains two subsetsused for training and evaluation each having 11 unique queries and more than550 melodies to match. The ordering of the most relevant results annotated byexperts for each query has been analyzed by Typke et al. [15] and further refinedand validated by Urbano et al.[19]. This gives a solid background, allowing forcomplete evaluation of similarity algorithms, and the fact that the dataset havebeen released to the broad public make it suitable for performing own analysis.The incipits in RISM collection are typically 10 – 40 notes in length.

The second dataset, Essen Folksongs Collection (EFC) has been used inMIREX SMS tasks in 2007, 2010 and 2011 edition. It contains 5274 MIDI en-coded monophonic folksongs from different countries. Typically the documentsin the collection are longer than in RISM dataset, ranging from 15 to 80 noteswith the average of 48 notes. For EFC based tasks, MIREX does not post anydetailed evaluation data, including queries that are used to evaluate submittedalgorithms (6 base queries, in 30 variants in total), and only quantitative perfor-mance data (rankings, performance measures) are publicly available. This allowsthem for reuse of the same testing dataset among different releases of the SMSchallenge, since the effort of creation of such dataset is a very human resourceconsuming task. Therefore we will use results from this dataset only to confirmor deny hypotheses drawn from RISM dataset.

2.3 MIREX SMS Evaluation Framework

The evaluation of 2011 SMS MIREX task takes place on each ISMIR Confer-ence. Teams submit their algorithms and each team is allowed to submit manyvariants of their submission. Each system is supposed to return 10 most simi-lar documents for each query, ordered by their similarity to the query. Resultsare then combined and presented to human judges. Each time a judge sees aquery melody and a single result, they evaluate similarity between them. Thesystem ensures that each pair of query and candidate is evaluated by a randomgrader, and receives from them two scores, ordinal (Very Similar, SomewhatSimilar, Not Similar) and fine (a number from 0 to 100) indicating how closeare those two melodies. The results lists are then evaluated using several perfor-mance measures, including Average Dynamic Recall (ADR), Normalized Recallat Group Boundaries (NRGB), Average Precision (AP), Precision at N Docu-ments (PND) plus a variety of summative measures that aggregate judge ratingsfor each returned result.

In our testing we will focus on two measures. First of them is Average DynamicRecall (ADR). For each query, a union of all results returned by all candidate al-gorithms is obtained. The designers of SMS task assumed that if a few documentshave very similar relevance ratings to a given query, they order in the perfectresult list should not matter, e.g., a result list (equivalence groups indicated by


brackets) [ABC]D[EF ] should yield the same result as [BCA]D[FE]. To cal-culate ADR, recall at each item on the result list is calculated (ri, 1 ≤ i ≤ 10)assuming the results in the same group could occur in a different order, andADR is an average of all ri. In brief, ADR gives the average recall among allthe documents that the user should seen at any number of retrieved items [14].It captures both quality (relevance of items) and order (how high are the mostrelevant items) aspects of the resulting ranked lists and it is a primary measureused in the evaluation of submitted algorithms to MIREX SMS task. For detailson ADR measure please refer to [14].

The second measure we will evaluate the algorithms against is Fine Precisionat 10 (FP10). It is the sum of all the fine ratings of all the items in the result set. Itdoes not tell anything about the order of the results, but it tells about the generalquality of all the results returned by the algorithm. Unlike for other measures,MIREX publishes not only total cumulative measures for all the queries, but alsoper-query FP10, allowing for analysis of differences in performance for differentqueries.

3 Methodology

The approach to symbolic music retrieval proposed in this paper focuses on eval-uation of the impact on performance of certain aspects like feature extraction,n-gram length and similarity function. The basic framework of the retrieval pro-cess we kept standard. Our goal is to check what kind of results one can obtainwith some pure, well established methods known from text information retrievalwhen one ports them directly to music data.

The main analysis has been conducted using MIREX 2005 dataset, which con-sists of a subset of RISM collection. Based on that and using average dynamicrecall (ADR) as a performance measure, we were able to evaluate the best pa-rameters for each of the proposed approaches and draw general conclusions aboutthe behaviour of the systems.

We have found that the performance depends on the following factors:

– feature extraction method used to retrieve basic components of the melodiesin question (the analysis checks for importance of rhythmic and melodiccomponents separately),

– n-gram size, i.e., the length of the sliding window used to calculate the countsof the features,

– similarity measure applied to calculate similarity between feature vectors,and

– term weighting algorithm.

Our hypothesis was that all of those components play a role in retrieval of similarmelodies so our goal was to investigate the importance of each of them. Fromthe plain melodic and rhythmic features, there are multiple approaches one cangeneralize the data or smooth out irrelevant differences, which we have indicatedabove. The only problem is that without experimentations, it is not possible


to confirm what is and what is not relevant. Therefore our set of experimentscovered various approaches to quantization of rhythm and melody.

The process of document retrieval starts with extracting features from thecorpus documents. Input dataset consists a set of standard MIDI files, each con-taining a single track of notes representing a monophonic melody. Since none ofthe notes are concurrent or overlapping in the datasets we have used, string basedmethods can be directly applied to the input documents. Like text documentsthat can be just seen as series of characters, monophonic music opi are just seriesof notes. The difference with music files is that text documents are easily sepa-rable into basic features — words. Since there is no such thing as a clear phraseboundary in music, the usual bag-of-words, or bag-of-terms approach consistsof building n-grams, i.e. substrings of n consecutive tokens (notes) that startwith every note. This process is widely used in bio-informatics (DNA sequenceanalysis) and in some text processing tasks as well (for authorship attribution[5], or for tasks with languages that have no word boundaries, like Thai [4]).

3.1 Granularity of Melodic and Rhythmic Features

Basic features that conform an n-gram are derived from note events from anotes stream. We call them uni-grams. They can either contain absolute valuesrepresenting music features, such as note’s pitch, duration or inter-onset interval(IOI), but in most cases relative (interval) features are used. They allow toachieve basic transposition and tempo invariance, which is required in this task,i.e. the melody played in a different key and in a different tempo, then someother melody should be considered equivalent. Therefore we have decided to usemelodic intervals and inter-onset interval ratios (IORs). The other question ishow one should translate the numbers that represent melodic intervals and IORsinto discrete terms, i.e. at which level of granularity they should be dealt with.

Melodic intervals derived from MIDI files give precise, discrete interval classes.We have chosen the following levels of granularity of melodic features to test:

– accurate/fine: each feature represents the actual interval between two con-secutive notes, in semitones, taken directly from MIDI note events.

– coarse: that group intervals belonging to the same class together. We use fiveclasses: same (no pitch change), small jump up (1-3 semitones), small jumpdown, large jump up (4 and up), large jump down.

– contour: only direction of the melody matters (stays the same, goes up orgoes down)

– identity: always 0, which essentially turns off melodic features.

Inter-onset interval ratios (IORs), unlike melodic intervals, give rational num-bers. The method used in this submission uses our previous research resultswhere rounding (with 0.2 threshold) was applied to binary logarithm of theIOR. This gives progressively wider steps as IOR increases, yet being preciseenough to maintain the perception of rhythm changes [20]. With this as a base,we have come up with four IOR granulation schemes:


– accurate/fine: using rounded values of binary logarithm of IORs with preci-sion of 0.2

– coarse: five classes of the same duration (log2IOR = 0), next note beingat least twice as fast (−1 ≤ log2IOR < 0), at least twice as slow (0 <log2IOR ≤ 1), more than twice as fast (log2IOR < −1) and more thantwice as slow (log2IOR > 1) as the previous note.

– contour: similarly to melodic contour, with three classes — same duration,slower or faster.

– identity: always 0, which turns off rhythmic features.

3.2 n-gram Length

With our testing we were also able to determine the optimal n for each of theproposed algorithms. We have analyzed n-gram lengths from 1 to 10 with peaksof performance observed usually between n = 2 and n = 7. The general rule ofthumb is though, the more general the features, the bigger the n should be.

3.3 Similarity Measure

We have also tested for the impact of the following similarity measures:

– Common Features: represents a dot product between two vectors represent-ing two documents that are being measured, which can be denoted as follows:

sim(x,y) =∑

i

xiyi (1)

where xi is a weight of a term i in a document x.– Cosine Similarity: the cosine of the angle between both document in the hi-

dimensional feature space. In essence it is a dot product between two vectorsnormalized by their lengths. Unlike the previous approach, this also takesthe length of the vectors into consideration:

sim(x,y) =

∑i xiyi√∑

i x2i

√∑i y

2i

(2)

– CNG measure: This method calculates arithmetic mean over n-gram profilesthat consist of frequencies of each feature. The similarity measure is provento distinguish well between authors of texts [5] as well as composers in musicdomain [20] making it an interesting candidate for application to this task.The equivalent formula for the similarity measure is given as follows:

sim(x,y) =∑

i

(1 −

(xi − yixi + yi

)2)

(3)


3.4 Term Weighting Method

One of the goals of this research was to measure impact of how various text-basedterm weighting measures affect measuring similarity between music documents.We have decided to evaluate four techniques:

– binary: It is either 0 (if a term, or an n-gram does not appear in the doc-ument) or 1 (if a term appears in the document). With Common Featuressimilarity measure, it gives a basic number indicating the number of termstwo documents have in common.

– frequency: simple term counts (the number of times the term appears inthe compared documents) which gives a classical cosine similarity or CNGmeasure definition.

– tf.idf: a standard term weighting technique where term frequencies are nor-malized by document frequencies, i.e., the number of documents a termoccurs in. This penalizes high frequency, common terms that occur in mostdocuments. The formula for an term i in the document x is given as follows:

tf.idfxi =countxi

‖x‖ log‖D‖δi

(4)

where δi = ‖{d ∈ D|i ∈ d}‖ is the number of documents containing termi in the entire collection D. This measure is commonly used in TextualInformation Retrieval for term weighting so it would be interesting, how itperforms for music data.

– Okapi BM25: unlike tf.idf, it is an industry-developed weighting scheme, thatoutperforms classic term weighting measures, like tf.idf. It tries to captureroughly the same concept as original tf.idf measure but tries to balancedocuments with different lengths and different term distribution:

bm25xi =countxi(k + 1)

countxi + k(1 − b + b ‖D‖avgdl )

log‖D‖ − δi + 0.5

‖D‖ + δi(5)

where avgdl is an average document length. It is parametrized, with param-eters b and k, and we have used a recommended setting of b = 0.75 andk = 2.

4 Results

In order to evaluate which factors have the biggest impact on the performanceof similarity algorithms, we have reproduced MIREX 2005 task using existingdocument relevance information. At each run of an algorithm, a set of 10 orderedresults is returned. For each of the results a relevance score is assumed, such thatif a result was previously marked as relevant with an appropriate relevance grouplabel used to calculate ADR score, the same group is kept. If a result relevancyto a given query was not marked in the judgment list, the item is assumed tobe not relevant, although it could get some recognition if it was submitted to


the actual MIREX task. As a result of that, the final measured ADR score notbetter than the actual ADR score that the given result set would get.

For each similarity measure test, ADR score is evaluated for each availablequery and the average ADR among all queries is returned as a result. The perfor-mance for each different settings are tested for a range of n-gram length values,although one could use other dimensions (e.g. feature granulation) as well; it isan arbitrary choice.

We have found that regardless of the term weighting algorithm applied, melodicfeatures give better results than rhythmic and that there is not a significant dif-ference in using only melodic features versus combined melodic and rhythmic (seeFigure 1). One can observe that the peak performance is achieved with n-gramlengths from n = 2 to n = 3 for more precise, combined features, and betweenn = 3 to n = 5 for a little bit more general melodic features. Using only IOR’s(rhythmic) features leads to significantly worse results with the peak performancearound n = 5 or n = 6.

The significance of the choice of feature granulation scheme is shown on Fig-ure 2. The conclusion in that the finer (or the more precise) the representation,the better results can be achieved with smaller n. Figure 2 shows the perfor-mance for melodic features with increasing generalization of intervals. The peakperformance is achieved with fine interval values for n = 2 and n = 3. Coarserepresentation, with only five levels yields the best results around n = 5 andmelodic contour (three levels) performs much worse, with a peak performancearound n = 7.

Figure 3 shows ADR scores achieved using fine melodic interval features usingdifferent similarity measures (CNG, common features and cosine) and Figure 4

Fig. 1. The comparison of performance of similarity measures using the same, cosinesimilarity method using various feature extraction method. Using only rhythmic fea-tures is easily outperformed by melodic features and combined melody with rhythm.


Fig. 2. The influence of various quantization approaches for the same feature extractionmethod (here, melodic intervals). More precise features offer better performance thanmore general ones.

shows scores for cosine similarity measure and with different weighting methodsapplied (binary, frequency, tf.idf and bm25). It turned out, rather surprisingly,that those, usually important aspects, don’t have much influence on the finalresult, e.g. the number of features in common gives as good results as applyingcosine similarity measure with bm25 term weighting method. The peak perfor-mance for all those methods is achieved between n = 2 and n = 4. This couldbe the case because the queries and documents are rather short and those moresophisticated similarity measures were designed to evaluate similarity betweenlarger documents or profiles with hundreds and thousands of features.

4.1 MIREX 2011 Evaluation

Having in mind our previous findings, we have chosen to evaluate further howdifferent text-based term weighting methods perform in a different task. To dothat we have came up with the following 6 setups, with parameters tuned ac-cording to our previous analysis. All of our submissions feature cosine similaritymeasure with melodic or combined, fine features. The main aspect that variesbetween them is the term weighting approach:

– WK1: uses binary term weighting approach and fine melodic features withn = 5

– WK2 and WK3: they both use frequency-based weighting with melodic in-terval features and n = 4, and combined features and n = 2.

– WK4: uses tf.idf weighting with fine melodic features and n = 4– WK5 and WK6: feature bm25 weighting with melodic interval features andn = 4, and combined features and n = 2.


Fig. 3. Here the same fine melodic interval features have been used with three differentsimilarity measures. As one can see, there is no significant difference in performancewith regards to the similarity function applied to measure similarity between musicexcerpts.

Fig. 4. Fine melodic interval features have been used with cosine using different weight-ing scheme. Again, no significant difference in performance with regards to the weight-ing scheme applied to measure similarity between music excerpts.


Our algorithms were evaluated along with 5 other submitted algorithms reach-ing similar total score. Only the UL series algorithms [18], featuring string align-ment techniques, outperformed most of our submissions, yet still the differencein most cases was not measured as significant [6]. What drew our attention werethe fine results calculated for each query separately (see Table 1). The tableconsists of FP10 values achieved by each of the algorithms for every base query.The best and the worst performers for each query have been highlighted andall the values — colour coded for clarity. It turned out that a lot depends onthe actual query, since our most sophisticated setup, WK6 although it performedrather poorly overall, achieved the best scores in two out of six queries. Since oneknows nothing about the actual queries, because this also is kept confidential atMIREX, it does not allow to draw any meaningful conclusion why it happened,but one can clearly see that the type of the actual query should also play animportant role in determining the best algorithm for the task.

Table 1. Results of SMS 2011 task calculated for each base query separately. Thenumbers represent FP10 values, in percents.

5 Conclusions

The results of our experiments show that for this simple tasks even basic retrievalmethods yield very satisfactory results. At the end, the simplest of our algorithmssubmitted to 2011 MIREX SMS challenge came second overall. On the otherhand, we have pointed out the importance of more foundational design decisionslike the use of proper features extraction method, and the impact of choosinga proper n-gram length. The FP10 per query results from MIREX 2011 showthat there are other factors, that we are just cannot capture because of lack ofproper evaluation data, which creates more possibilities for the future research.

References

1. Ferraro, P., Hanna, P., Allali, J., Robine, M.: Mirex symbolic music similarity. In:MIREX (2007)

2. Gomez, C., Abad-Mota, S., Ruckhaus, E.: An analysis of the mongeau-sankoffalgorithm for music information retrieval. In: MIREX (2007)


3. Grachten, M., Arcos, J.L., de Mantaras, R.L.: Melody retrieval using the implica-tion/realization model. In: MIREX (2005)

4. Haruechaiyasak, C., Kongyoung, S., Dailey, M.: A comparative study on thai wordsegmentation approaches. In: 5th International Conference on ECTI-CON 2008,vol. 1, pp. 125–128 (May 2008)

5. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles forauthorship attribution. In: Proc. of the PACLING 2003 Conf., pp. 255–264 (2003)

6. International Music Information Retrieval Systems Evaluation Laboratory. Mirex2011 challenge on symbolic melodic similarity (August 2011)

7. Laitinen, M., Lemstrom, K.: Geometric algorithms for melodic similarity. In:MIREX (2010)

8. Lemstrom, K., Mikkila, N., Makinen, V., Ukkonen, E.: String matching and geo-metric algorithm for melodic similarity. In: MIREX (2005)

9. Lemstrom, K., Mikkila, N., Makinen, V., Ukkonen, E.: Sweepline and recursivegeometric algorithms for melodic similarity. In: MIREX (2006)

10. Orio, N.: Combining multilevel and multi-feature representation to computemelodic similarity. In: MIREX (2005)

11. Pinto, A.: Mirex 2007 - graph spectral method. In: MIREX (2007)12. Rizo, D., Inesta, J.M.: Trees and combined methods for monophonic music simi-

larity evaluation. In: MIREX (2010)13. Suyoto, I.S.H., Uitdenbogerd, A.L.: Simple efficient n-gram indexing for effective

melody retrieval. In: MIREX (2005)14. Typke, R., Veltkamp, R.C., Wiering, F.: A measure for evaluating retrieval tech-

niques based on partially ordered ground truth lists. In: 2006 IEEE InternationalConference on Multimedia and Expo., pp. 1793–1796 (July 2006)

15. Typke, R., den Hoed, M., de Nooijer, J., Wiering, F., Veltkamp, R.C.: A groundtruth for half a million musical incipits. Journal of Digital Information Manage-ment 3, 34–39 (2005)

16. Typke, R., Wiering, F., Veltkamp, R.C.: Mirex symbolic melody similarity andquery by singing/humming. In: MIREX (2006)

17. Uitdenbogerd, A.L.: N-gram pattern matching and dynamic programming for sym-bolic melody search. In: MIREX (2007)

18. Urbano, J., Llorens, J., Sanchez-Cuadrado, S.: Sequence alignment with geometricrepresentations. In: MIREX (2011)

19. Urbano, J., Marrero, M., Martın, D., Llorens, J.: Improving the Generation ofGround Truths based on Partially Ordered Lists. In: International Society for Mu-sic Information Retrieval Conference, pp. 285–290 (2010)

20. Wo�lkowicz, J.M.: N-gram-based approach to composer recognition. Master’s the-sis, Warsaw University of Technology, Warsaw, Poland (2007) Supervisor-KulkaZbigniew

Date post:	11-Feb-2017
Category:	Documents
Upload:	ngomien
View:	219 times
Download:	0 times

Analysis of Important Factors for Measuring Similarity of Symbolic ...

Documents