Representing Musical Patterns via the Rhythmic Style...

Representing Musical Patterns via the Rhythmic StyleHistogram Feature

Matthew ProckupMET-Lab

Drexel University, ECE Dept3141 Chestnut St.

Philadelphia, PA [email protected]

Jeffrey ScottMET-Lab



Youngmoo E. KimMET-Lab



ABSTRACTWhen listening to music, humans often focus on melodicand rhythmic elements to identify specific songs or genres.While these representations may be quite simple, they stillcapture and differentiate higher level aspects of music suchas expressive intent and musical style. In this work we seekto extract and represent rhythmic patterns from a poly-phonic corpus of audio encompassing a number of styles.A compact feature is designed that probabilistically modelsrhythmic activations within musical beat divisions throughhistograms of Inter-Onset-Intervals (IOI). Onset detectionfunctions are calculated from multiple frequency bands of aperceptually motivated filter bank. This allows for patternsof lower pitched and higher pitched onsets to be describedseparately. Through a set of supervised and unsupervisedexperiments, we show that this feature is well suited for avariety of tasks in which quantifying rhythmic style is nec-essary.

Categories and Subject DescriptorsH.5.5 [Information Interfaces and Presentation]: Soundand Music Computing; I.5 [Computing Methodologies]:Pattern Recognition

General TermsDesign, Experimentation, Evaluation

KeywordsMusic-Information Retrieval, Rhythmic Style, Expression

1. INTRODUCTIONHumans identify with the two basic components of melody

and rhythm in order to describe and differentiate songs.With these simple components, one can usually recognizehigher level concepts such as the style and other expressive

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] Multimedia 2014 Orlando, Florida USACopyright 2014 ACM 978-1-4503-3063-3/14/11 ...$15.00.http://dx.doi.org/10.1145/2647868.2655058.

components of a piece of music. Previous work has stud-ied both style and expression of a song as a whole, but fewefforts focus on deconstructing and quantifying these indi-vidual components to discover the specific roles that each ofthem play. In the context of this paper, we explore meth-ods to represent rhythmic elements and provide analyses fortheir ability to distinguish rhythmic style.

There is a large body of previous work on beat-tracking[3, 11], but in extracting only beats, much of the rhythmicsubtlety is discarded. There are, however, several systemsthat use rhythmic information in order to better inform beattracking algorithms. One such system in [1] uses domainknowledge to improve onset detection, one of the key com-ponents in a beat tracking system. The system then trimsspurious onsets by finding only those that make sense rhyth-mically based on a set of synthesized templates.

In [6] the authors introduce the beat spectrum. This isa compact feature that represents periodicities in the self-alignment of an audio signal. Capturing a similar, yet morecompact feature, is a primary motivation of this work.

Recent efforts have expanded beyond beat tracking to cap-ture elements of rhythm, style and expression [4, 10, 13]. In[13], source separation of harmonic and percussive compo-nents is applied then used to derive multiple sets of rhyth-mic pattern templates and bassline templates. A compactset of rhythm and bass line features are calculated based ona song’s alignment to the learned templates.

We propose a representation that probabilistically mod-els Inter-Onset-Intervals (IOI) at the tatum positions be-tween beat locations. The Rhythmic Style Histogram Fea-ture (RSHF) is designed to be tempo invariant and capturerhythmic subtleties across different frequency ranges. Thefollowing section describes the feature computation in detail.

2. RHYTHMIC FEATURE DESIGNIn order to represent rhythmic style, the RSHF is designed

to probabilistically model inter-onset-intervals across multi-ple frequency bands. The signal flow of the RSHF calcula-tion is shown in Figure 1.

2.1 Deriving the RSHFThe percussive component of Harmonic Percussive Source

Separation (Figure 1a) is computed [5] and fed into a beattracker to extract beat locations. The beat tracker employedis based on dynamic programming. This algorithm is furtherexplained in [3] and publicly available via LibROSA1. The1https://github.com/bmcfee/librosa

......

Audio

Median Filtering HPSS

LibROSA Beat Tracking and

Tatum Intuition

Coarse Mel-Spaced

Filtering

Onset Detection and IOI Calculation

A

B

C

D

Figure 1: RSHF design and calculation.

percussive power spectrum is quantized into twelve equallyspaced bins between successive beat locations (Figure 1b).Twelve frames per beat is chosen to allow for rhythms withboth duple and triple feels to accurately quantize to a tatumposition.

The tatum aligned spectrum is processed using a coarseMel-spaced filter bank. The output of each Mel-filter be-comes an independent onset signal Xf [t], where f is thefilter channel and t is the tatum position (Figure 1c). Theonset detection signals are then modified by subtracting theoutput of a moving average filter with a window length wand a tail multiplier m, yielding

Yf [t] = Xf [t] −∑t+w

k=t−mw Xf [k]

mw + w. (1)

This process emphasizes large changes in the signal and theasymmetric window prevents false detection of offsets as on-sets. Onsets are defined as the local maxima of the filteredsignals Yf [t] within a window of length w. The calculationof the onset positions is similar to the method described in[1].

Using the tatum-aligned onset positions, the IOIs for eachMel-frequency band are calculated (Figure 1d). The rawRSHF feature becomes a stacked set of histograms denotingthe empirical probability of the IOI values for each Mel-Frequency band. An example of this feature is shown inFigure 2. In the lower frequency range, there is more prob-ability mass in tatums that are at 1/4 and 3/4 of the beat.

This shows that both 16th note and dotted 8th notes arecommon. In this example, it is representative of repetitiveswing-like pattern in the bass drum composed of a dotted8th note followed by a 16th note. In the higher frequencycomponents there is high probability for IOI equal to 1/2 ofa beat. This means that there is likely a steady 8th notepattern, played possibly by a hi-hat or ride cymbal.

It is important to note that the RSHF is sensitive to thequality of the onset detection function and the beat trackeremployed as part of the front-end system. Errors in thebeat tracking may lead to improper binning in the metricaldivisions defining the tatum-aligned spectrogram in Figure1C.

Figure 2: Rhythmic Style Histogram Feature.

2.2 Meaningful Dimensionality ReductionWhile RSHF is an intuitive representation of repetitive

rhythmic style, its dimensionality can become quite largedepending on the number of tatum divisions and Mel-filtersdesired. This can become prohibitive when applying certainmachine learning techniques.

In order to accommodate these more complex methods,a reduced dimensionality is required. However, in reduc-ing the RSHF, we seek to still maintain its intuitive mean-ing. This is done through Non-Negative Matrix Factoriza-tion (NMF). The basic goal of NMF is to factorize a matrixV into two matrices (V = W×H). The rows of H are a set oflearned basis components over all examples and the columnsof W are the activations of those components [9]. Perform-ing NMF on the RSHF yields a rhythmic tatum basis acrossall examples. This effectively reduces the dimensionality ina manner that is still representative of the original rhythmicstructure.

3. FEATURE SALIENCE EXPERIMENTSIn order to test the salience of the RSHF, a set of both

supervised and unsupervised machine learning experimentsare performed. These experiments show that the feature hasthe potential to predict previously known intuitions aboutrhythm and style as well as reveal meaningful correlationslearned directly from data that are sometimes difficult toquantify. It is important to note that while the following re-sults are not state of the art in this stand-alone context, theRSHF is definitely salient, and has the potential to improveand inform other systems and tasks [13].

A set of classification tasks using the popular BallroomDataset [7] are performed. The dataset’s audio examples are30 seconds in length, and are labeled with a specific ballroomdance style. This dataset is chosen because its labels apply

directly to terms that reference quantifiable attributes of themusic, and not to cultural popularity (e.g. genre).

3.1 Supervised ExperimentsThe goal of the first supervised experiment is to classify

the Ballroom Dataset with respect to the given ballroomstyle label. This is performed using an SVM classifier witha Radial Basis Function (RBF) kernel. The dataset is splitinto 30% test and 70% train and the parameters of the modelare fit using 10-fold cross validation. The experiment is re-peated with the raw RSHF, NMF and PCA features. Resultsfor style classification are presented in Table 1. This taskhas been performed by many others previously in [2, 12, 8]with classification results surpassing 90%. The goal of thiswork is not to solve this task specifically, but to show itssalience in this domain. It was also shown that tempo aloneis a good descriptor of styles on the Ballroom Dataset [12].The beat-tracking algorithm inherently includes an estimateof tempo. By including that estimate along with the RSHF,classification is improved.

Feature Dim. RSHF Alone Dim. RSHF & TempoRAW 96 0.562 ± 0.035 96 0.755 ± 0.017PCA 10 0.574 ± 0.044 26 0.777 ± 0.018NMF 10 0.568 ± 0.019 17 0.761 ± 0.017

Table 1: Accuracies in the style task for the rawRSHF and the best performing reductions.

In the second experiment we classify whether a given piecehas duple or triple feel using the same experimental setup. Inthe case of NMF and PCA, the result for the number of com-ponents with the highest classification accuracy is shown. Inboth cases the raw style feature performs well. By addingtempo, classification accuracy is improved once again. Theseresults are shown in Table 2. Additionally, both classifica-tion experiments show that by reducing the dimensionalityin a meaningful way, such as using activations of learnedNMF bases, classification performance is maintained.

Feature Dim. RSHF Alone Dim. RSHF & TempoRAW 96 0.831 ± 0.017 96 0.937 ± 0.033PCA 2 0.854 ± 0.023 3 0.954 ± 0.018NMF 13 0.851 ± 0.044 3 0.937 ± 0.014

Table 2: Accuracies in the duple vs. triple task forthe raw RSHF and the best performing reductions

3.2 Unsupervised ExperimentsIn certain analysis tasks of expression and style, hard la-

bels are not sufficient to describe a certain musical phenom-ena. It may be necessary for expression and style compo-nents to sit in a continuous space and employ unsupervisedmethods to find meaningful correlations and relationships inthe data. In order to test the RSHF’s effectiveness in thisunsupervised domain, we performed a set of simple k-meansclustering experiments (k = 2 and k = 4).

In order to explore the cluster space, the RSHF feature isshown along with the annotated ground truth labels for styleand feel in Figure 3. The 96 dimensional feature is shownusing t-distributed Stochastic Neighbor Embedding. t-SNEis a tool for visualizing high dimensional data by preserving

the distance between points in a lower dimensional space.More on t-SNE can be found in [14]. Notice that whilethe different classes of style-labeled and feel-labeled dataoverlap, each occupies a unique area throughout the space.

(a)

(b)

Figure 3: The projections of the raw RSHF featureinto 2 dimensions for (a) duple/triple designationand (b) individual style classifications.

In Figure 3, Quickstep, Samba, and Rumba group to-gether. All of these styles are syncopated in nature, meaningthe rhythmic emphasis does not always lie on the down beat.Additionally, these styles tend to be rhythmically dense,which is another reason for their similarity in the RSHFspace. In contrast, Waltz, Viennese Waltz, and ChaChaChaare very straight-forward styles with a heavy downbeat em-phasis. Waltz rhythms are also composed with 3 beats permeasure, unlike the rest of the styles which have 4. In theirgrouping together, this notion of triple feel is captured aswell. The Viennese Waltz is quicker and more rhythmicallydense the the traditional Waltz. Jive employs a compoundduple meter, in that there are four beats per measure, butthe eighth notes are usually swung, giving it a fast triple feelas well. Because of this, Jive and Viennese Waltz form theirown grouping. Within this groping, it is also shown that theduple and triple meter are still somewhat separated.

In order to capture these empirical observations organi-cally, k-means clustering is performed on the RSHF. The

results of the unsupervised clustering is shown in Figure4. The percentages of the original style labels containedwithin a specific cluster are shown for all clusters. In thek = 2 clustering, the somewhat obvious separation of theconvex and concave arc structures in the data occurs. Thecomplex and more dense styles cluster apart from the sim-pler straight-forward styles. As the number of clusters in-creases to k = 4, each of the styles start to separate. Jiveand Viennese Waltz separate from the other dense rhythms,which can be attributed to the triple and compound metersof Waltz and Jive versus the simple meters of Quickstep,Rumba, and Samba. The more straight-forward rhythmicstyles also start to split. Tango and ChaChaCha still grouptightly due to the fact that their rhythms are composed ofvery similar structures with a heavy emphasis on the beat.Both Waltz styles also start to separate into multiple groups.Some of the Viennese Waltz and Waltz rhythms group to-gether in their own cluster, while others tend to group withother more similarly dense and less dense styles.

(a) (b)

Figure 4: The percentage of each style label in eachk-means cluster for (a) k = 2 and (b) k = 4.

4. CONCLUSIONS AND FUTURE WORKThe Rhythmic Style Histogram Feature captures musi-

cally informative patterns in a compact form, leveragingtatum-aligned IOI interval histograms over multiple Mel-spaced frequency bands. Through a set of supervised andunsupervised machine learning experiments, we showed thatthe RSHF is informative for the task of rhythmic style classi-fication and analysis. Additionally, forms of dimensionalityreduction such as NMF further reduce the representation ofthe RSHF and maintain the underlying musical meaning.

In future work, the RSHF can be improved by adding ad-ditional components that relate to rhythmic style. In itscurrent state, the feature’s IOI histograms are based solelyon raw counts. It could be more informative to scale theIOIs summed in the histogram by the magnitude of the re-spective peaks in the onset detection function. This wouldmake higher onset peaks more meaningful. Additionally wecan incorporate some notion of phase. In the current repre-sentation, tatums on the downbeat and tatums on the up-beat are treated the same if their IOI’s are the same. Theincorporation would allow for systems to distinguish thisphenomenon.

The RSHF has many future implications in the domainof rhythmic style analysis as a whole as well. One possible

use is the learning of musically informed transition probabil-ities for an HMM to detect onsets (similar to [1]). Becausethe RSHF feature is a probabilistic model, and intuitivelymaintains rhythmic components, it can be informative inthe generation of rhythmic styles as well. This generation isalso tunable among different frequency bands, allowing it tocapture and synthesize rhythms that have contrasting lowpitched and high pitched components, such as a kick drumvs a snare drum.

All in all, one of the best attributes of the RSHF is itsability to adapt to many different applications. It throughthis adaptability that the RSHF can aid in a wide variety oftasks within domains that require an analysis of rhythmicstyle.

5. REFERENCES[1] N. Degara, M. E. P. Davies, A. Pena, and M. D.

Plumbley. Onset event decoding exploiting therhythmic structure of polyphonic music. J. Sel. TopicsSignal Processing, 5(6):1228–1239, 2011.

[2] S. Dixon, F. Gouyon, and G. Widmer. Towardscharacterisation of music via rhythmic patterns.ISMIR, 2004.

[3] D. Ellis. Beat tracking by dynamic programming.Journal of New Music Research, 36(1):51–60, 2007.

[4] T. V. et al. Automatic genre classification of latinamerican music using characteristic rhythmicpatterns. In Audio Mostly Conference, page 16, 2010.

[5] D. FitzGerald. Harmonic/percussive separation usingmedian filtering. DAFx, 2010.

[6] J. Foote and S. Uchihashi. The beat spectrum: a newapproach to rhythm analysis. ICME, 2001.

[7] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso,G. Tzanetakis, C. Uhle, and P. Cano. An experimentalcomparison of audio tempo induction algorithms.IEEE Tran. Audio, Speech, and Language Processing,14(5):1832–1844, Sept 2006.

[8] A. Holzapfel and Y. Stylianou. Scale transform inrhythmic similarity of music. IEEE Tran. Audio,Speech, and Language Processing, 19(1):176–185,January 2011.

[9] D. D. Lee and H. S. Seung. Learning the parts ofobjects by non-negative matrix factorization. Nature,401(6755):788–791, 1999.

[10] M. Leimeister, D. Gaertner, and C. Dittmar.Rhythmic classification of electronic dance music. AESSemantic Audio, Jan 2014.

[11] J. L. Oliveira et al. IBT: A real-time tempo and beattracking system. ISMIR, 2010.

[12] B. Schuller, F. Eyben, and G. Rigoll. Fast and robustmeter and tempo recognition for the automaticdiscrimination of ballroom dance styles. ICASSP,April 2007.

[13] E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama.Beyond timbral statistics: Improving musicclassification using percussive patterns and bass lines.IEEE Tran. Audio, Speech, and Language Processing,19(4):1003–1014, May 2011.

[14] L. Van der Maaten and G. Hinton. Visualizing datausing t-sne. Journal of Machine Learning Research,9(11), 2008.

Date post:	05-Feb-2018
Category:	Documents
Upload:	ngotuyen
View:	221 times
Download:	0 times

Representing Musical Patterns via the Rhythmic Style...

Documents