1
A Nonlinear Method for Stochastic Spectrum
Estimation in the Modeling of Musical
SoundsNicola Laurenti, Giovanni De Poli, Daniele Montagner
Abstract
We propose an original technique for separating the spectrum of the noisy component from that of the
sinusoidal, quasi-deterministic one, for the sinusoids + transients + noise (S+T+N) modeling of musical sounds.
It also enables estimation of the time-domain noise envelope and detection of transients with standard techniques.
The algorithm for spectrum separation relies on nonlinear transformations of the amplitude spectrum of the
sampled signal obtained via fast Fourier transform (FFT), which allow to eliminate the dominant partials without
the need for precisely tuned notch filters. The envelope estimation is performed by calculating the energy of the
signal in the frequency domain, over a sliding time window.
Several transformations (such as pitch shifting, time stretching, etc.) can be performed on the so obtained
stochastic spectrum prior to resynthesis. The synthesized sound is built via inverse fast Fourier transform (IFFT)
with overlap-add method. The performance of the proposed algorithm is assessed on synthetic, instrumental and
natural sounds in terms of different quality measures.
Index Terms
Spectral modeling, nonlinear analysis, sound analysis, sinusoidal modeling, parametric modeling, residual
modeling
I. INTRODUCTION
Spectral analysis of sound produced by musical instruments, shows that the spectral energy of the sound
signals can be interpreted as the sum of two main components: a sinusoidal component that is concentrated
around a discrete set of frequencies and a stochastic component that has a broadband characteristic. The
sinusoidal component normally corresponds to the main modes of vibration of the system. The stochastic
residual accounts for the energy produced by the excitation mechanism which is not turned into stationary
vibrations by the system and for any other energy component that is not sinusoidal. Hence, separation of the
two components has a robust physical foundation and applications (from sound transformation to parametric
coding, and to sound description) that maintain it are quite effective. We devote our attention to the modeling of
the stochastic component and we focus on estimating its amplitude spectrum and time-domain envelope (from
now on, stochastic spectrum and stochastic envelope, respectively).
The authors are with Dipartimento di Ingegneria dell’Informazione, Universita di Padova, via Gradenigo 6/B, 35131 Padova, Italy.
E-mail:{nil,depoli}@dei.unipd.it
2
-sn � ��×6wn
- FFT - | · | -Sk stochasticspectrum
estimation-Bk stochastic
envelopeestimation
-rn
Stochastic Analysis
Stochastic Synthesis
-Bk
-rn transform
6user
-Ak
en� ��×6ejφk
- IFFT -� ��×6wsynn
- JJ JJ
overlap & add
-� ��×6en
-ηn
Fig. 1. Block diagram of the stochastic analysis and synthesis, using the proposed spectrum separation procedure.
We propose an original technique based on nonlinear transformations of the amplitude spectrum Sk of the
signal sn multiplied by the analysis window wn, to estimate the spectrum Bk and the time envelope rn of the
noisy residual, as illustrated in the top of Fig. 1, and described in sections II and III.
The resynthesis, as in the bottom of Fig. 1, can be preceded by user controlled parametric transformations
of the estimated spectrum and time envelope (most commonly, these would be pitch shifting with preservation
of time structure and time stretching under tonal coherence condition [26]). Then, by generating random,
independent and uniformly distributed phases ϕk for each frequency bin of the signal discrete Fourier transform
(DFT) and pairing them with the corresponding amplitudes Ak we obtain, via IFFT with the synthesis window
wsynn , a quasi-stationary colored noise [25], that is eventually multiplied by the time envelope en, obtaining ηn.
The synthesized sound results plausible and does not have any sinusoidal residual present.
The paper has the following structure. We conclude the introduction by reviewing other existing methods.
Then, we separately describe in detail our algorithms for the estimation of the spectrum and time envelope in
sections II and III, respectively. In section IV we carry out a set of measurements for testing the separation
process by using composite synthetic signals, then, in section V, we compare it in performance to other
methods. In section VI we present a further refinement and different potential applications of the method, and
draw conclusions.
A. Review of existing methods
In the digital processing of musical sounds based on time-frequency representations, the sinusoidal quasi-
deterministic component, the noisy stochastic part, and temporal transients are often treated separately, due
to their quite different features. In particular, in the spectrum modeling technique called sinusoids + residual
(S+R) modeling [1], [2], the deterministic component is usually modeled as a sum of stable sinusoids (with slow
amplitude and frequency variations), whereas the noisy component is modeled through time-varying filtering
of stationary white noise. Different approaches for modeling the noisy component were proposed in [3]–[8].
However, the S+R approach, unless coherent structures in the residual are modeled accurately as in [9] has the
drawback of an improper modeling of temporal transients and instrument attacks, which play an important role
in psychoacoustic perception [10].
3
In the S+T+N model [5], [11], [12], the signal analysis and additive synthesis are extended to three components
in order to get a more general representation of the input sound. Therefore, in the analysis of digitally sampled
sound waveforms, the three components must be carefully separated in order to extract the three sets of
parameters that are needed for their processing and synthesis. On the other hand, since in most cases the
energy of the sinusoidal component is significantly larger than that of the others, the separation process is not
an easy task. Indeed, in some approaches transients are identified and temporally separated before sinusoidal
parameter estimation to avoid interference [13].
Methods for noise separation and estimation can follow two distinct approaches, based on subtraction or on
filtering, and both methods can be implemented either in the time or frequency domain. Time-domain subtraction
methods rely on a precise estimate of the sinusoidal component, then subtract its waveform from the original
sampled sound [1], [2], yielding the residual component. The latter is then processed and resynthesized using
the short time Fourier transform (STFT). The main drawbacks of such a direct method are the sensitivity of
the relative phase of the synthesized sinusoidal part to analysis parameters like window and hop sizes [1]. The
residual from subtraction combines the effect of errors in the sinusoidal analysis with that of noise sources.
After subtraction this can give rise to an undesired and unstable “sinusoidal part” in the residual, the energy
of which can be larger than that of the non sinusoidal one. Such effects can result in perceptually annoying
artifacts which render the model unusable for further processing. In “analysis-by-synthesis” systems [14], [15]
the sinusoidal parameters are estimated iteratively, and at each iteration, errors introduced in the residual can
be counteracted by new deterministic components. This limits the introduction of sinusoidal components in
the residual, although it might lead to the introduction of spurious partials in the estimate of the sinusoidal
component.
Subtraction of the complex sinusoidal spectrum on a frame-by-frame basis is used in [16] to derive the
residual; this method is in principle equivalent to time-domain subtraction, and computationally more efficient
as only the few bins around each partial are involved in the calculation. However, in the case of monophonic
sounds, since the analysis is not done in a pitch-synchronous way, this method suffers from a much higher
sensitivity to errors in the estimation of partials parameters and to the window type.
On the other hand time-domain filtering methods process the original sound through filters that exhibit deep
notches corresponding to the partial frequencies. They provide a more realistic noise residual, especially since
it preserves the amplitude envelope of the original noise, but if the notches are not very selective, the resulting
spectrum turns out to be “anti-harmonic” rather than stochastic. This type of problem is generally solved by
performing filtering through cancellation of the sinusoidal component in the frequency domain with some sort
of curve fitting [17], i.e. finding a function that matches the general contour of the given filtered amplitude
spectrum. For example, the ”straight-line approximation” method is used in [18] after eliminating the points in
the amplitude spectrum that are supposed to represent the partials.
In speech analysis, an iterative algorithm operating alternatively in time and frequency domain with erasure
and substitution of the partials was proposed in [19], but it was shown to exhibit convergence problems in
[20], where a different technique was developed. This is based on deriving the partial parameters in a pitch-
synchronous analysis, then subtracting the reconstructed partials from the unwindowed complex spectrum.
Substitution of the partials can then be performed on the power spectrum, but this method degrades in the
4
-Sk
-
short filter
-S′k
(·)−1
reciprocal
-Rk
-
averagingfilter
-R′k
(·)−1
reciprocal
-Bk
Fig. 2. Block diagram of the stochastic spectrum estimation procedure.
presence of jitter and shimmer (i.e. random fluctuations in frequency and amplitude).
Further methods for stochastic estimation were developed in the framework of parametric audio coding [7],
[21] harmonic/noise and signal/noise ratio estimation [22]–[24] in which time resolution is not an issue, so
that much longer analysis windows can be used. In [22], Qi estimates the stochastic energy by comparing the
original signal in the time domain with an estimate of its sinusoidal component obtained through averaging of
several subsequent pitch periods of the signal. In [23] estimation is performed in the cepstrum domain with the
use of a comb filter and by shifting the spectral baseline. Improved versions of the two methods are compared
in [24], with similar performance.
As for the two basic transformations commonly applied in the resynthesis (pitch shifting and time stretching),
in general pitch shifting is performed only on the sinusoidal part, while the noisy and transient parts are
reproduced unshifted [1], [18]. Nevertheless some instruments (e.g. clarinet) present a tonal stochastic part
which should be taken into account in such a transformation to improve the quality of the output sound [27].
Time stretching is performed both on the sinusoidal and noisy part, whereas transients are only time shifted
in order to preserve psychoacoustic coherence [5], [12], as they would otherwise lose sharpness in their attack
and tend to sound dull after this transformation.
In our opinion, the complexity of the problem allows for further research and investigation of new approaches.
II. STOCHASTIC SPECTRUM ESTIMATION
In this section we describe the stochastic spectrum estimation technique in detail. Given the original sound
signal s(t), modeled as the sum of partials, a wideband stochastic signal g(t) and transient events (both often
with a much lower energy than the sinusoidal part), we are faced with the task of finding a smooth function
B(f) that approximates the time-varying spectrum of the stochastic component. Such a function, represents
an average spectrum of the stochastic component realizations and should be updated as we move our analysis
window along the sound samples.
It is evident that by performing a mean filtering in frequency of the signal spectrum, we would only obtain
a spreading of the narrowband partials, since their amplitudes are much larger than the underlying stochastic
spectrum. Also, if we removed the partials with a comb filter before the mean filtering, the effect of the latter
would be to spread the rather wide comb notches, unless the comb filter is very precisely tuned and capable
of tracking the partial frequencies. As we said above, this would give rise to an “anti-harmonic” spectrum.
On the other hand, if we consider the reciprocal of amplitude spectrum R(f) = 1/|S(f)|, then in place of
the highly energetic partials in S(f) we will find deep and selective notches in R(f), which can be eliminated
through a mean filter, whereas the reciprocal of stochastic spectrum will play a prominent role in the averaging
5
performed by the filter. Once obtained the filtered reciprocal spectrum, we must in turn take its reciprocal, to
have the required function B(f) approximating the stochastic residual spectrum. We note that when taking the
reciprocal R(f) we might end up with some very high accidental peaks, that correspond to zeros of S(f),
due to use of the instantaneous spectrum. Such peaks would corrupt the result of filtering R(f), in much the
same way that the partials would corrupt the result of filtering |S(f)|. However, since the zeros are much more
isolated and randomly distributed than the partials, they can be cancelled, without substantially altering the
spectrum shape, by passing |S(f)| through a very short mean filter of length ∆f (we choose ∆f = 3 bins),
before the nonlinear transformation, or alternatively by median filtering.
The above technique requires the following steps, as shown in Fig. 2:
1) consider the sequence of sound samples sn = s(t0 + nTs) , n = 0, . . . , Nt − 1, taken at rate Fs = 1/Ts,
and belonging to an analysis frame Nt samples long, starting at t0
2) calculate the amplitude spectrum of {sn} by taking the absolute value of its Nt-point DFT with a suitable
window function wn (e.g. Hann, Hamming or Blackman)
Sk =
∣∣∣∣∣Nt−1∑n=0
sn wn e−j2πnk/Nt
∣∣∣∣∣ , k = 0, . . . , Nt − 1 (1)
3) remove incidental zeros in {Sk}, by replacing it with
S′k = (Sk−1 + Sk + Sk+1)/3 (2)
4) calculate the reciprocal spectrum
Rk =1S′k
(3)
5) smooth Rk by cyclic convolution1 with the Nf -points mean filter impulse response
R′k =1Nf
dNf/2e−1∑h=−bNf/2c
R(k−h) mod Nt (4)
6) calculate the reciprocal of R′k that gives the required approximation for the residual spectrum
Bk =1R′k
(5)
Observe that, as described, the algorithm has two adjustable parameters of analysis: the length Nt of the
analysis window in the time domain, and the length Nf of the mean filter in the frequency domain, and both
must be set according to the time and frequency variability of both the sinusoidal component and the stochastic
residual spectra. Typically they should be set to rather low values in order to be able to track fast variations in
time and frequency shaping. The spectral shapes obtained from each step of the procedure are plotted in Fig. 3
for a flute sound with pitch at 1780 Hz (A6), by using Nt = 1024 and Nf = 25 with a sampling frequency
Fs = 44.1 kHz. As for the hop size H at the analysis stage, it should meet the condition for reconstruction
of the signal in the analysis/synthesis cascade with the analysis window (e.g. H ≤ Nt/4 for the Hamming
window [28]). Nt should be fixed to the lowest power of two (in samples) that is at least four periods of the
fundamental pitch in a harmonic sound, to avoid overlapping of the harmonics. For the same reason, Nt must
1Cyclic convolution is the appropriate form of convolution when dealing with periodic signals. As the original signal is sampled, its
two-sided spectrum {Sk}, and likewise˘S′k
¯and {Rk} are to be considered periodic (in the frequency domain) every Nt points.
6
-80-60-40-20
020
(dB)
0 5 10 15 20f (kHz)
a) original Sk
-80-60-40-20
020
(dB)
0 5 10 15 20f (kHz)
b) filtered S′k
-200
20406080
(dB)
0 5 10 15 20f (kHz)
c) reciprocal Rk
-200
20406080
(dB)
0 5 10 15 20f (kHz)
d) filtered R′k
-80-60-40-20
020
(dB)
0 5 10 15 20f (kHz)
e) noise Bk
-80-60-40-20
020
(dB)
0 5 10 15 20f (kHz)
f) Bk versus Sk
Fig. 3. Spectra resulting from the consecutive steps in the estimation procedure.
be increased accordingly in the case of polyphonic sounds. We also note that the algorithm is homogeneous,
i.e. if the input signal {sn} is multiplied by a positive constant (and hence so is its amplitude spectrum {Sk}),
then the estimated noise spectrum {Bk} will turn out multiplied by the same factor. No normalization in the
signal level is therefore required, unless one is worried about the effects of finite precision arithmetics.
The synthesis of the stochastic component is the generation of a noise signal that has the frequency and
amplitude characteristics described by the spectral envelopes of the stochastic representation. In order to reduce
the amount of data for storage and computing transformations, significant parameters can be extracted by using
the Bark Band Noise Modeling proposed by Goodwin [6]. Starting from the spectral envelope or alternatively
from its Bark Band representation we generate random independently and uniformly distributed phases in
(−π, π) for each bin, and pair them with the corresponding amplitudes. The synthesized stochastic signal
is obtained via inverse STFT. After windowing, the resulting waveforms are overlapped (with a 50% overlap
factor), added and multiplied by a normalization constant. The IFFT size, hop size for synthesis and consequently
the normalization factor, may be changed for applying time-scaling effects to the input sound. [18], [25].
7
III. TIME ENVELOPE ESTIMATION AND TRANSIENT DETECTION
To estimate the time envelope of the stochastic component we can make use of the frequency-domain
information obtained in the previous steps. Consider the spectrum separation procedure performed on the
portion of the signal samples within the analysis window [n0, (n0 +Nt − 1)]. From the noise spectral samples
{Bk} obtained in step 5 we can derive a measure of the energy of the noise process gn within the window. In
fact, since windowing and the DFT are linear operations, we can consider Bk to be a good approximation to
the amplitude spectrum of the windowed noise gnwn. Therefore we must have
EB =Nt−1∑k=0
B2k = Nt
Nt∑n=0
(gn+n0wn)2 (6)
Assuming gn is a stationary random process within the analysis window, let Mg be its statistical power and
Eg = NtMg its average energy. Then, EB is a random variable with mean
mEB= NtMgEw (7)
where Ew =∑Nt−1n=0 w2
n consequently
E[n0,n0+Nt]g =
Nt∑k=1
B2k/Ew (8)
and an estimate of the stochastic envelope at the window midpoint is obtained as
rn0+Nt/2 =
√E
[n0,n0+Nt]g
Nt(9)
By progressively shifting the analysis window along the signal sequence by small hops, we can obtain a rather
dense grid of envelope values, which have to be interpolated to yield the required envelope rn.
However, the envelope obtained by this estimation procedure tends to smooth out the initial transients (note
attacks) and time-localized events, due the long time windows used in the analysis stage, so these components
have to be treated separately. This phenomenon is clearly illustrated by comparison of the original and estimated
envelope for a purely stochastic time-varying sound, such as water flow recording [29], in Fig. 4.
For transient detection we can make use of the above procedure, in conjunction with methods of residual
estimation based on time-domain subtraction of the sinusoidal component, such as the one described in [1]. This
residual contains both noise and transients, so that its envelope will be much larger than the estimated stochastic
envelope in the neighborhood of transient events. Following [1], frames in which the peak detection algorithm
doesn’t yield correct points for the peak continuation process are marked as residual (noise + transient). For
each marked frame we compare our estimated stochastic envelope rn with the residual envelope rn,R obtained
after subtraction of deterministic part from input sound, similarly to the method used in [1] for correcting
residual envelopes. We can gather points where rn,R−rn > rth with rth a suitable threshold value. This allows
a fine measurement of impulsive events that have been cut out from nonlinear stochastic estimation. Following
[5], [12], the regions are considered as pure transient (there is not separation between noise and transients) and
are copied from the input sound into the resynthesized sound using a cross fade.
8
time (s)0 1-1
1
s n(n
orm
aliz
ed)
time (s)0 10
1r n
(nor
mal
ized
)
Fig. 4. Original water flow sound (top) and comparison (bottom) between original (gray line) and estimated (black line) envelopes. The
original envelope was obtained by averaging over 64-sample non-overlapping rectangular windows. The estimated envelope was obtained
with a Hamming window of Nt = 1024 samples, hop size H = Nt/4, and averaging over Nf = 11 bins in the frequency domain.
IV. ASSESSMENT OF THE METHOD WITH SYNTHETIC SOUNDS
To evaluate the performance of our algorithm and determine its application range we tested it on different
synthetic sounds obtained by adding stationary noise with a known power spectral density to a purely harmonic
signal. We now compare the estimated stochastic spectrum with the result of directly smoothing the amplitude
spectrum of the noise signal.
In the following we describe the signal generation procedure, the parameters that we used for evaluation,
and show some plots of the results.
A. Test signals
In each test the noise signal u(t) is Gaussian and stationary, generated by filtering a zero mean, unit variance
Gaussian white noise uw(t) with a filter gi(t). In particular, we simulated pink noise (using the filter coefficients
given by the Kasdin algorithm [30]), white noise, and 12 auto-regressive moving average models with coefficients
whose spectral shapes are shown in Fig. 5.
Since the amplitudes of the noise DFT are Rayleigh distributed, our estimate will turn out to be biased even
in the ideal condition of absence of the sinusoidal component. The bias factor can be easily calculated for a
rectangular analysis window, considering that three independent samples of the DFT amplitude are averaged in
(2) before reciprocation, as
k1 =1∫ ∫ ∫ +∞
0
3p(a)p(b)p(c)a+ b+ c
da db dc
' 0.90 ' −0.9 dB (10)
9
−20−15−10−5
0
−20−15−10−5
0
−20−15−10−5
0
−20−15−10−5
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20f (kHz)
Rn(f) = |Gi(f)|2 (dB)
Fig. 5. Noise power spectral densities used in the test signals to assess the algorithm performance.
with
p(a) =π
2ae−πa
2/4 , a > 0 (11)
the unit mean Rayleigh probability density function. For practical windows the above value of k1 is seen to be
a good approximation.
As for the deterministic component, we choose to generate harmonic signals, with all harmonics and ban-
dlimited to (0, 10 kHz). We set the amplitude of the k-th harmonic to be Ak = A1/k, with A1 determined by
the harmonic/noise ratio (HNR) and generate random phases φk as independent and uniformly distributed in
(0, 2π). The pitches were chosen among the B and E notes in the [60 Hz, 4000 Hz] range, and the HNR values
considered are {0,−3 dB, 0 dB, 3 dB, 10 dB, 20 dB}. All signals are generated at rate Fs = 44.1 kHz for a
duration of 0.5 s.
B. Performance measures
We prove the effectiveness of our algorithm against the estimate obtained by smoothing the amplitude
spectrum of the pure noise through averaging over a sliding window of Nf bins in the frequency domain.
This estimate can be expressed as
Dk =1Nf
dNf/2e−1∑h=−bNf/2c
U(k−h) mod Nt (12)
where
Uk =
∣∣∣∣∣Nt−1∑n=0
un wn e−j2πnk
∣∣∣∣∣ , k = 0, . . . , Nt − 1 (13)
10
and both the values of Nf and the analysis windows {wn} are the same as in the proposed method (1)–(4).
When performed on the pure noise signal this estimate is unbiased and exhibits a lower variance than the
proposed method, but it is not applicable in the presence of the sinusoidal component.
The deviation of our estimate is then measured in two different ways. We consider the average over all
frames of
1) the ratio of energies in a band (f0, f1),
m(f0,f1) = 10 log10
∑kFs∈(f0,f1)
B2k∑
kFs∈(f0,f1)
D2k
(14)
2) the mean absolute log (MAL) spectral difference (in dB) [31]
d1 =1Nt
Nt/2−1∑k=−Nt/2
∣∣∣∣20 log10
BkDk
∣∣∣∣ (15)
We set as limits for an acceptable performance of the proposed method the bounds
−3 dB < m(0,Fs/4) < 3 dB , d1 < 1.8 dB (16)
as a compromise between the considerations expressed in [31]–[34] about the correlation between error measures
and mean opinion score (MOS), and the concept of “just noticeable changes in amplitude” in [10].
C. Results
We now illustrate and discuss the performance of the proposed method and its dependence on several
parameters of the original sound and of the algorithm.
1) Noise spectral shape: The method performance does not exhibit substantial differences going from one
spectral shape to another, thus in the following we give results in dB (a more perception related measure
than linear scale) averaged over all spectral shapes. Some specific considerations do apply to the pink noise
case. As the amplitude spectrum of pink noise exhibits a 1/√f cusp at the origin, any smoothing operation
is bound to remove a large part of the signal energy. The loss in energy at low frequencies can be reduced
by shortening the averaging filter to Nf = 13 bins. The same problem was evident in the frequency-domain
method of [23]. A possible solution could be to exclude a narrow band at the lower end of the spectrum from
the smoothing procedure, and obtain those values from an average over multiple time frames, providing that the
fundamental frequency lies outside the chosen interval. This is justified by the slower variability of the narrow
band information. Another solution could take advantage of an estimate making use of the sound derivative, or
its approximation obtained with a high-pass filter as in [35], since the corresponding amplitude spectrum has a√f shape around the origin. The effect of combining the two results should be investigated in future studies.
2) Window shape: First we tried different shapes of the analysis window, at different HNR values. For HNR
< 20 dB, Hamming, Hann and Bartlett windows exhibit superior performance with respect to Blackman and
Blackman-Harris windows, thanks to their narrower main lobe. In particular the Hamming window has a further
slight advantage given by its deeper first zero [36], and we have employed it in all the remaining tests. On the
other hand the Blackman and Blackman-Harris windows perform better for HNR > 20 dB, due to their higher
sidelobe attenuation. As the HNR grows, the use of Hamming, Hann or Bartlett windows becomes unviable.
11
125 Hz 250 Hz 500 Hz 1 kHz 2 kHz 4 kHz
−1 dB
0 dB
1 dB
2 dB
3 dB
4 dB
125 Hz 250 Hz 500 Hz 1 kHz 2 kHz 4 kHz
1 dB
2 dB
3 dB
0−3 dB 0 dB 3 dB10 dB20 dB
0−3 dB 0 dB 3 dB10 dB20 dB
fpitch
d1 Mean Absolute Log difference
fpitch
m(0,Fs/4) Energy ratio
Fig. 6. Performance measures of the algorithm versus pitch frequency for different HNR values. The analysis is done with a Hamming
window of Nt = 1024 samples, hop size H = Nt/4, and averaging over Nf = 21 bins in the frequency domain.
3) Pitch frequency: In Fig. 6 we show the dependence of the algorithm performance on the fundamental
pitch frequency, at different HNR values, with the use of a 1024-sample analysis window. It can be observed
that the algorithm gives a better performance for higher pitch frequencies (the harmonics are more widely
spaced apart), and (of course) lower HNR. For example, at HNR = 10 dB, the pitch frequencies that allow a
satisfactory result in terms of the parameters (14) and (15) as stated in (16) are fpitch > 8/Tw, with Tw the
length of the analysis window. The results are confirmed by tests on 512- and 2048-sample windows. Observe
the nonideal behaviour of the estimate at HNR = 0, that is due to the bias (10).
4) Smoothing of the reciprocal spectrum.: Fig. 7 shows how the estimate accuracy improves with increasing
the number of adjacent bins Nf considered in the smoothing of the reciprocal spectrum. The improvement is
noteworthy going from 9 (which is the minimum value for acceptable performance at HNR = 10 dB) to 25
bins, and is far less evident for Nf > 25. The choice of different lowpass shapes for smoothing, other than the
rectangular one, does not give any improvement, yielding on the contrary an even worse performance.
5) Spectrum reciprocation: In another set of tests we replaced the reciprocations in (3) and (5) with the
more general
Rk =1
(S′k)α, Bk =
1(R′k)1/α
(17)
respectively, for different values of α. It should be observed that the combination of α and 1/α powers in
(17) preserves the homogeneity of the algorithm, ie its invariance to multiplication of the input by a positive
12
10 20 30 40
−1 dB
0 dB
1 dB
2 dB
3 dB
4 dB
10 20 30 40
1 dB
2 dB
3 dB
HNR = 0HNR = 0 dBHNR = 10 dBHNR = 20 dB
HNR = 0HNR = 0 dBHNR = 10 dBHNR = 20 dB
Nf (bins)
d1 Mean Absolute Log difference
Nf (bins)
m(0,Fs/4) Energy ratio
Fig. 7. Performance measures of the algorithm versus length of the averaging filter in bins for different HNR values. The analysis is
done with a Hamming window of Nt = 1024 samples, hop size H = Nt/4, and the fundamental frequency is fpitch = 1046.5 Hz
(corresponding to C6).
constant. In the same ideal hypotheses that led to equation (10) the bias factor would now change to
kα =1[∫ ∫ ∫ +∞
0
3αp(a)p(b)p(c)(a+ b+ c)α
da db dc
]1/α (18)
yielding, for examplek2 ' 0.85 ' −1.46 dB , k3 ' 0.78 ' −2.14 dB
k4 ' 0.71 ' −3.01 dB , k5 ' 0.61 ' −4.34 dB(19)
From Fig. 8, we see that with α > 1 the estimator bias increases for low HNR, confirming the analysis in the
ideal case, while it decreases for high HNR (curves referring to pitch C4) and thus exhibits a stronger rejection
of the harmonic component. This advantage vanishes for high pitch frequencies (curves referring to pitch C7),
when the spacing between consecutive harmonics allows a good rejection also for α = 1. Higher values of
the reciprocation index also lead to higher variance of the estimates, due to a sharper non linearity: indeed, by
considering again the ideal hypotheses of (18), it even becomes infinite for α ≥ 3. However, a refinement of
the estimate that makes use of α > 1 will be seen in Section VI.
6) Tremolo and vibrato effects: We have investigated the effect of tremolo (amplitude modulation) and
vibrato (frequency modulation) in the harmonic component on the algorithm performance. In order to gain a
better insight we analyzed the two effects separately.
In the tests for tremolo we have used a 100% sinusoidal modulation of the amplitude with frequencies
ranging from 1 to 20 Hz. As can be expected, the average performace of the algorithm is little influenced by
13
−20 dB −10 dB 0 dB 10 dB 20 dB 30 dB−3 dB
0 dB
3 dB
6 dB = 1 = 2 = 3
ααα261 Hz (C4)2.093 kHz (C7)
−20 dB −10 dB 0 dB 10 dB 20 dB 30 dB1 dB
2 dB
3 dB
4 dB
ααα
= 1 = 2 = 3
261 Hz (C4)2.093 kHz (C7)
HNR
d1 Mean Absolute Log difference
HNR
m(0,Fs/4) Energy ratio
Fig. 8. Performance measures of the algorithm versus harmonic-to-noise ratio for different values of the reciprocation exponent α, and
two different pitches, C4 (261.6 Hz) and C7 (2093 Hz). The analysis is done with a Hamming window of Nt = 1024 samples, hop size
H = Nt/4, and averaging over Nf = 21 bins in the frequency domain.
the tremolo effect. However, for low tremolo frequencies, when the length Tw of the analysis window is much
shorter than the tremolo period, the HNR is quite different going from one analysis window to another and so
is the algorithm performance.
In the set of tests for vibrato we have used a sinusoidal modulation with frequencies fvib = 5, 10 Hz,
and modulation depths ∆ = 0.036, 0.123, corresponding to a ±60 and ±200 cents deviation, respectively
(a cent is 1/100 of a semitone interval on the log scale). In this case partial frequencies can wander during
one analysis frame and their energy be spread across many adjacent bins. The algorithm performance is then
heavily influenced by the length Tw of the analysis window. Within one window, the maximum variation in the
frequency of the k-th partial is given by V kfpitch where the parameter V is defined as
V =
2∆ sinπfvibTw , for fvibTw < 1/2
2∆ , for fvibTw ≥ 1/2
We can see from Fig. 9 that by choosing a 1024-sample window, and considering typical values for western
instruments [37] in the parameters (e.g. ∆ = 0.036, fvib = 5 Hz, which yield V = 0.025) the algorithm
performance is nearly unaffected by the vibrato effect. On the other hand it can also be seen that with higher
values of V the performance degradation can be substantial, limiting the pitch range in which the algorithm is
effective.
14
125 Hz 250 Hz 500 Hz 1 kHz 2 kHz 4 kHz
0 dB
2 dB
4 dB
6 dB
125 Hz 250 Hz 500 Hz 1 kHz 2 kHz 4 kHz
1 dB
2 dB
3 dB
V = 0V = 0.025V = 0.047
V = 0.087V = 0.163
V = 0V = 0.025V = 0.047
V = 0.087V = 0.163
fpitch
d1 Mean Absolute Log difference
fpitch
m(0,Fs/4) Energy ratio
Fig. 9. Performance measures of the algorithm versus pitch frequency, with frequency modulation (vibrato) of the harmonic component,
with different depths ∆ and modulation frequencies fvib. The HNR is 10 dB, and each curve is identified by a different value of the
parameter V , with V = 0 corresponding to the unmodulated sound. The analysis is done with a Hamming window of Nt = 1024 samples,
hop size H = Nt/4, and averaging over Nf = 21 bins in the frequency domain.
V. COMPARISON WITH OTHER METHODS
In this section we compare the performance of the proposed method with the complex spectrum subtraction
presented in [16] and the pitch-scaled comb filtering in [20], which both also operate in the frequency domain.
These methods, unlike the proposed one, require prior estimation of the sinusoidal component, and the perfor-
mance of stochastic estimation is crucially affected by the accuracy in the estimates of amplitude and frequency
of each partial. Moreover the comb filtering method requires a pitch-synchronous analysis window.
We make use of synthetic signals2 with the harmonic component as in Subsection IV-A and white noise,
at different HNR values, and we assume that estimation of the harmonic parameters is done with the method
presented in [35], which makes use of signal derivatives. Thus, the parameter estimates of amplitude and
fundamental frequency are derived by corrupting their true values with random Gaussian errors with bias and
standard deviation derived from [35, tables 5-6]. We test the three methods with the use of 1024-sample analysis
windows, Blackman for the complex spectrum subtraction method, rectangular for the pitch-synchronous comb
filtering, and Hamming for the proposed one, and compare them in terms of the measure (14).
The results for HNR = 0, 10, 20 dB, averaged over the different pitches are given in Table I. We observe
2The tests are done on synthetic sounds for the ease of parameter modifications and performance comparisons, although a more detailed
comparison should be carried out on real-world applications. Unfortunately, this would require the careful definition of an application
context as well as a more exhaustive testing and optimization of the other methods, which go beyond the scope of this work.
15
TABLE I
PERFORMANCE COMPARISON BETWEEN THE COMPLEX SUBTRACTION AND COMB FILTERING METHODS AND THE PROPOSED ONE.
MEAN AND STANDARD DEVIATION OF THE ERRORS ∆Ak AND ∆fpitch IN THE ESTIMATE OF HARMONICS AMPLITUDE Ak AND PITCH
FREQUENCY fpitch ARE DERIVED FROM [35].
HNR ∆Ak/Ak ∆fpitch/fpitch complex subtraction comb filtering proposed method
m σ m σ m(0,Fs/4) m(0,Fs/4) m(0,Fs/4)
0 dB 2.14 · 10−2 2.13 · 10−4 1.4 · 10−3 1.56 · 10−2 1.86 dB 2.38 dB 1.05 dB
10 dB 7.5 · 10−3 5.3 · 10−5 6 · 10−4 5 · 10−3 2.98 dB 3.77 dB 2.10 dB
20 dB 2.7 · 10−3 2.7 · 10−5 2 · 10−4 2.1 · 10−3 3.94 dB 5.01 dB 3.36 dB
that the three methods perform similarly, with errors increasing with HNR, although the proposed method has
a much lower computational complexity and does not require any estimation of the partial parameters. The
error increase at higher HNR holds for the subtraction and filtering methods even though the relative errors
in the estimate of partial parameters decrease in an inversely proportional manner to HNR. The comb filtering
method shows its limits in the presence of amplitude and frequency fluctuations, or erratic estimates. As for the
proposed method, we next examine a way to limit its dependence on the energy of the sinusoidal component.
VI. A FURTHER REFINEMENT OF THE ESTIMATE
The performance of the proposed method degrades for lower pitches and higher HNR values, since in these
cases its damping of the partials is not sufficient. Therefore, to extend its range of application we can think
of pre-attenuating the highest partials, and to this purpose we make use of the same estimate method with a
higher reciprocation exponent. The result will have the partials strongly attenuated, and a biased estimate (as
shown in Section IV-C) in the stochastic part of the spectrum, so we only use it to substitute the spikes in the
amplitude spectrum, and then apply the method of section II.
The overall estimate in this case will proceed as follows. Starting from the amplitude spectrum Sk:
1) Perform estimation on {Sk} with (3) and (5) replaced by (17) and α > 1. Call Gk the result.
2) Find the spikes as the points in which Sk > KGk, (a good threshold is given with K > 2) and replace
them with the corresponding values of Gk, that is consider the corrected amplitude spectrum
Sck =
Sk for Sk ≤ KGkGk for Sk > KGk
3) Perform estimation (3)-(5) on {Sck} (that is with ordinary exponent α = 1)
In this refined version the algorithm has additional parameters: the filter length Nf in steps 1 and 3 can be
different, and the constant K determining the threshold for spike detection can be chosen.
The results are shown in Fig. 10 where we can observe that satisfactory results are obtained with Nf = 21
for both steps 1 and 3, α = 4 in step 1, and K = 3 in step 2. We observe that the HNR range in which the
algorithm yields satisfactory results (d1 < 1.8 dB) is extended by more than 10 dB for the C4 and C5 pitches,
whereas the improvement is more limited for higher pitch frequencies, such as C6.
Like its original version, the above described procedure does not require any previous analysis of the sinusoidal
component, as this is one of the advantages in our approach. However, if our method were to be used in
16
−20 dB 0 dB 20 dB 40 dB −2 dB
0 dB
2 dB
4 dB
−20 dB 0 dB 20 dB 40 dB
1 dB
2 dB
3 dB
261 Hz (C4) 523 Hz (C5)1047 Hz (C6)refinedoriginal
261 Hz (C4) 523 Hz (C5)1047 Hz (C6)refinedoriginal
HNR
d1 Mean Absolute Log difference
HNR
m(0,Fs/4) Energy ratio
Fig. 10. Performance measures of the refined (solid line) and the original (dashed line) version of the algorithm versus harmonic-to-noise
ratio for three different pitches, C4 (261.6 Hz), C5 (523.3 Hz) and C6 (1046.5 Hz). The analysis is done with a Hamming window of
Nt = 1024 samples, hop size H = Nt/4, and averaging over Nf = 21 bins in the frequency domain for both steps. The reciprocation
exponent in step 1 is α = 4 and the threshold constant in step 2 is K = 3.
TABLE II
PERFORMANCE OF THE PROPOSED METHOD IN TERMS OF ENERGY RATIO m(0,Fs/4) WITH LONGER WINDOWS (Nt = 2048) AND
HIGHER PITCH FREQUENCIES (fpitch > 1 KHZ).
HNR original version refined version
Hamming Blackman Hamming Blackman
0 dB 0.52 dB 0.58 dB 0.42 dB 0.47 dB
10 dB 0.80 dB 0.91 dB 0.58 dB 0.50 dB
20 dB 1.36 dB 1.15 dB 1.02 dB 0.51 dB
combination with othe methods for estimation of the sinusoidal component, it would be advisable to include
that part of information in the algorithm. A possible way is to use the knowledge of location and amplitude of
the partials to increase their damping, without going through steps 1 and 2. Alternatively, by making use of the
frequency information only, on can directly find the regions of the spectrum that need to be corrected in step
2 above, thus eliminating the need for conparison with the threshold KGk and to properly select the value of
K.
The proposed method and its refined version may be used in calculating sound descriptors like the signal/noise
ratio (SNR) as in [22]–[24], or the spectral centroid (see [38]) of the noise component. In these cases time
resolution is not an issue, so it can be traded for frequency resolution by using longer analysis windows (tens
of pitch periods). In Table II we show the error measure (14) for both versions of the proposed method with
17
0 1 2 3 4 5 6 7 8 9 102
3
4
5
6
7
8
9
10
11
12
2
34
5
23
43
4
5
3
4
3
4
5
23
4
56
34
5
3
4
4
5
2
3
2
Spectral centroid of stochastic component (Bark bands)
Spec
tral
cen
troi
d of
sou
nd (
Bar
k ba
nds)
trombonebassoonenglish horntubaharp
saxophoneoboeaccordion
flutebass clarinet
Fig. 11. Spectral centroid of instrumental sounds versus estimated spectral centroid of their stochastic component. All pitches correspond
to the C note of different octaves, indicated by each plot point, with lines connecting points that represent the same instrument.
the use of 2048-sample Blackman and Hamming windows, averaged over signals generated as in Section V
with pitch frequencies in the range (1 kHz, 4 kHz). Although the performance of both versions are similarly
improved, we must note that the use of a Blackman window and the refined version yields an estimate error
that is nearly constant over a wide range of HNR.
In Fig. 11 we see the results of applying our method to determine the spectral centroid of the stochastic
component. We plot for each analyzed sound the spectral centroid calculated on a 0–18 Bark band scale [6]
of the original sound versus that of its estimated stochastic component, and the clustering of data for each
instrument (shown by connecting points representing the same instrument) is evident.
VII. CONCLUSIONS
We have presented a new method for estimating the spectrum of the noisy part in musical sounds and
evaluating its time envelope. The spectrum estimation is based on applying a cyclic convolution (smoothing)
on a nonlinear transformation (reciprocal) of the amplitude spectrum obtained from a STFT analysis. The time
envelope of the noisy component is calculated from its energy spectrum.
We have assessed the performance of our technique on synthetic test sounds with different features in the
sinusoidal and stochastic components, as well as studied their dependence on the parameters both of the
algorithm and of the test sound. The results are quite satisfactory, over a wide range of pitches and HNR
values, with a particular effectiveness for higher pitch frequencies and lower HNRs.
The comparison with other frequency-domain methods (complex subtraction and comb filtering), shows that
our algorithm works better and is computationally more efficient, due to the fact that it does not depend on any
18
analysis of the sinusoidal component. It remains to investigate whether by performing the stochastic spectrum
estimation before the sinusoidal and transient analysis, the latter can be improved by the results of the former.
We have also shown an example of the method’s potential use in parameter extraction for sound classification.
REFERENCES
[1] X. Serra, “Musical Sound Modelling with Sinusoids plus Noise,” in C. Roads, S. Pope, A. Piccialli, G. De Poli editors, Musical
Signal Processing, Swets & Zeitlinger Publishers, 1997, pp. 91–102.
[2] Y. Ding, X. Qian, “Processing of Musical Tones Using a Combined Quadratic Polynomial-Phase Sinusoid and Residual (QUASAR)
Signal Model,” Journal of the Audio Engineering Society, vol. 45, n. 7, pp. 571–584, July 1997.
[3] K. Fitz, L. Haken, P. Christensen, “A new Algorithm for BandWidth association in Bandwidth-Enhanced Additive Sound Modeling,”
Proceedings of the 2000 International Computer Music Conference, ICMC 2000, Berlin, Germany, 27 August – 1 September 2000,
pp. 384–387.
[4] P. Polotti, G. Evangelista, “Multiresolution Sinusoidal Stochastic Model for Voiced-Sounds,” Proceedings of the COST G-6 Conference
on Digital Audio Effects, DAFX ‘01, Limerick, Ireland, 6–8 December 2001, pp. 120–124.
[5] S. N. Levine, J. O. Smith, “A Sines+Transients+Noise Audio Representation for Data Compression and Time/Pitch Scale
Modifications,” Proceedings of the 105th Convention of the Audio Engineering Society, San Francisco, CA, 26–29 September 1998,
preprint 4781.
[6] M. Goodwin, “Residual Modeling in Music Analysis-Synthesis,” Proceedings of 1996 IEEE International Conference on Acoustics,
Speech and Signal Processing, ICASSP ‘96, Atlanta, GA, 7–10 May 1996, vol. 2, pp. 1005–1008.
[7] H. Purnhagen, N. Meine, “HILN - The MPEG-4 Parametric Audio Coding Tools,” Proceedings of the 2000 IEEE International
Symposium on Circuits and Systems, ISCAS 2000, Geneva, Switzerland, 28–31 May 2000, vol. 3, pp. 201–204.
[8] R. Hendriks, J. Jensen, R. Heusdens, “Perceptual Linear Predictive Noise Modelling for Sinusoid-Plus-Noise Audio Coding,”
Proceedings of 2004 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ‘04, Montreal, Canada,
17–21 May 2004, vol. 4, pp. 189–192.
[9] K. Hamdy, M. Ali, A. Tewfik, “Low Bit Rate High Quality Audio Coding with Combined Harmonic and Wavelet Representations,”
Proceedings of 1996 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ‘96, Atlanta, GA, 7–10
May 1996, vol. 2, pp. 1045–1048.
[10] E. Zwicker, H. Fastl, Psychoacoustics. Facts and Models, Information Sciences Series, Springer-Verlag, New York, 1999.
[11] T. S. Verma, T. H. Y. Meng, “Time Scale Modification Using a Sines+Transient+Noise Signal Model,” Proceedings of the Digital
Audio Effect Workshop DAFX ‘98, Barcelona, Spain, 1998, pp. 49–52.
[12] T. S. Verma, T. H. Y. Meng, “Extending Spectral Modeling Syntesis with Transient Modeling Synthesis,” Computer Music Journal,
vol. 24, n. 2, pp. 47–49, Summer 2000.
[13] S. N. Levine, J. O. Smith, “A Switched Parametric and Transform Coder,” Proceedings of 1999 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP ‘99, Phoenix, AZ, 15–19 March 1999, vol. 2, pp. 985–988.
[14] E. B. George, M. Smith, “Analysis-by-synthesis / overlap-add sinusoidal modeling applied to the analysis and synthesis of musical
tones,” Journal of the Audio Engineering Society, vol. 40, n. 6, pp. 497–516, June 1992.
[15] M. Goodwin, “Multiscale Overlap-Add Sinusoidal Modeling Using Matching Pursuit and Refinements,” Proceedings of the 2001
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA01, Mohonk Mountain Resort, NY, 21–24
October 2001.
[16] X. Serra, J. Bonada, P. Herrera, R. Loureiro, “Integrating Complementary Spectral Models in the Design of a Musical Synthesizer,”
Proceedings of the 1997 International Computer Music Conference, ICMC ‘97, Thessaloniki, Greece, 25–30 September 1997, pp. 152–
159.
[17] J. Strawn, “Approximation and Syntactic Analysis of Amplitude and Frequency Function for Digital Sound Synthesis,” Computer
Music Journal, vol. 4, n. 3, pp. 3–23, Fall 1980.
[18] X. Serra, J. O. Smith, “Spectral Modelling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plus Stochastic
decomposition,” Computer Music Journal, vol. 14, n. 4, pp. 12–24, Winter 1990.
[19] C. d’Alessandro, V. Darsinos, B. Yegnanarayana, “Effectiveness of a Periodic and Aperiodic Decomposition Method for Analysis of
Voice Sources,” IEEE Transactions on Speech and Audio Processing, vol. 6, n. 1, pp. 12–23, January 1998.
[20] P. J. B. Jackson, C. H. Shadle, “Pitch-Scaled Estimation of Simultaneous Voiced and Turbolence-Noise Components in Speech,” IEEE
Transactions on Speech and Audio Processing, vol. 9, n. 7, pp. 713–726, October 2001.
19
[21] K. Vos, R. Vafin, R. Heusdens, W. B. Kleijn, “High-Quality Consistent Analysis-Synthesis in Sinusoidal Coding,” Proceeding of the
AES 17th International Conference: High Quality Audio Coding, Firenze, Italy, 2–5 September 1999.
[22] Y. Qi, “Time Normalization in Voice Analysis,” Journal of the Acoustical Society of America, vol. 92, n. 5, pp. 2569–2577, November
1992.
[23] G. de Krom, “A Cepstrum-Based Technique for Determining a Harmonics-to-Noise Ratio in Speech Signals,” Journal of Speech and
Hearing Research, vol. 36, n. 2, pp. 254–266, April 1993.
[24] Y. Qi, “Temporal and Spectral Estimation of Harmonic-to-Noise Ratio in Human Voice Signals,” Journal of the Acoustical Society
of America, vol. 102, n. 1, pp. 537–543, July 1997.
[25] X. Rodet, P. Depalle, “Spectral Envelopes and Inverse FFT Synthesis,” Proceedings of the 93rd Convention of the Audio Engineering
Society, San Francisco, CA, 1–4 October 1992, preprint 3393.
[26] U. Zolzer, ed., DAFX, Digital Audio Effects, John Wiley & Sons, Chichester, 2002.
[27] C. Chafe, “Pulsed Noise in Self-sustained Oscillation of Musical Instruments,” Proceedings of 1990 IEEE International Conference
on Acoustics, Speech and Signal Processing, ICASSP ‘90, Alburquerque, NM, 3–6 April 1990, vol. 2, pp. 1157–1160.
[28] J. B. Allen, “Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform,” IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. 25, n. 3, pp. 235–238, June 1977.
[29] SMS software, manual, examples and sound files at www.iua.upf.es/sms/
[30] N. J. Kasdin, “Discrete Simulation of Colored Noise and Stochastic Processes and 1/fa Power Law Noise Generation,” Proceedings
of the IEEE, vol. 83, n. 5, pp. 803–827, May 1995.
[31] A. H. Gray, J. D. Markel, “Distance Measures for Speech Processing,” IEEE Transactions on Acoustics, Speech and Signal Processing,
vol. 24, n. 5, pp. 381–391, October 1976.
[32] S. Wang, A. Sekey, A. Gersho, “An Objective Measure for Predicting Subjective Quality of Speech Coders,” IEEE Journal on Selected
Areas in Communications, vol. 10, n. 3, pp. 819–829, June 1992.
[33] J. G. Beerends, J. A. Stemerdink, “A Perceptual Audio Quality Measure Based on a Psychoacoustic Sound Representation,” Journal
of the Audio Engineering Society, vol. 40, n. 12, pp. 963–978, December 1992.
[34] T. Thiede, W. C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J. G. Beerends, C. Colomes, M. Keyhl, G. Stoll, K. Brandenburg,
B. Feiten, “PEAQ — The ITU Standard for Objective Measurement of Perceived Audio Quality,” Journal of the Audio Engineering
Society, vol. 48, n. 1, pp. 3–27, January 2000.
[35] M. Desainte-Catherine, S. Marchand, “High-Precision Fourier Analysis of Sounds Using Signal Derivatives,” Journal of the Audio
Engineering Society, vol. 48, n. 7, pp. 654–667, July 2000.
[36] F. J. Harris, “On the Use of Windows for Harmonic Analysis with Discrete Fourier Transform,” Proceedings of the IEEE, vol. 66,
n. 1, pp. 51–83, January 1978.
[37] J. Liljencrants, Tremolo and Vibrato Sounds, on line at mmd.foxtail.com/Tech/AM&FM.html, as published 21 October 1999,
accessed 2 May 2004..
[38] P. Herrera, G. Peeters, S. Dubnov, “Automatic Classification of Musical Instrument Sounds,” Journal of New Music Research, vol. 32,
n. 1, pp. 3–21, March 2003.