A Nonlinear Method for Stochastic Spectrum Estimation in the Modeling of Musical Sounds

1

A Nonlinear Method for Stochastic Spectrum

Estimation in the Modeling of Musical

SoundsNicola Laurenti, Giovanni De Poli, Daniele Montagner

Abstract

We propose an original technique for separating the spectrum of the noisy component from that of the

sinusoidal, quasi-deterministic one, for the sinusoids + transients + noise (S+T+N) modeling of musical sounds.

It also enables estimation of the time-domain noise envelope and detection of transients with standard techniques.

The algorithm for spectrum separation relies on nonlinear transformations of the amplitude spectrum of the

sampled signal obtained via fast Fourier transform (FFT), which allow to eliminate the dominant partials without

the need for precisely tuned notch filters. The envelope estimation is performed by calculating the energy of the

signal in the frequency domain, over a sliding time window.

Several transformations (such as pitch shifting, time stretching, etc.) can be performed on the so obtained

stochastic spectrum prior to resynthesis. The synthesized sound is built via inverse fast Fourier transform (IFFT)

with overlap-add method. The performance of the proposed algorithm is assessed on synthetic, instrumental and

natural sounds in terms of different quality measures.

Index Terms

Spectral modeling, nonlinear analysis, sound analysis, sinusoidal modeling, parametric modeling, residual

modeling

I. INTRODUCTION

Spectral analysis of sound produced by musical instruments, shows that the spectral energy of the sound

signals can be interpreted as the sum of two main components: a sinusoidal component that is concentrated

around a discrete set of frequencies and a stochastic component that has a broadband characteristic. The

sinusoidal component normally corresponds to the main modes of vibration of the system. The stochastic

residual accounts for the energy produced by the excitation mechanism which is not turned into stationary

vibrations by the system and for any other energy component that is not sinusoidal. Hence, separation of the

two components has a robust physical foundation and applications (from sound transformation to parametric

coding, and to sound description) that maintain it are quite effective. We devote our attention to the modeling of

the stochastic component and we focus on estimating its amplitude spectrum and time-domain envelope (from

now on, stochastic spectrum and stochastic envelope, respectively).

The authors are with Dipartimento di Ingegneria dell’Informazione, Universita di Padova, via Gradenigo 6/B, 35131 Padova, Italy.

E-mail:{nil,depoli}@dei.unipd.it

2

-sn � ��×6wn

- FFT - | · | -Sk stochasticspectrum

estimation-Bk stochastic

envelopeestimation

-rn

Stochastic Analysis

Stochastic Synthesis

-Bk

-rn transform

6user

-Ak

en� ��×6ejφk

- IFFT -� ��×6wsynn

- JJ JJ

overlap & add

-� ��×6en

-ηn

Fig. 1. Block diagram of the stochastic analysis and synthesis, using the proposed spectrum separation procedure.

We propose an original technique based on nonlinear transformations of the amplitude spectrum Sk of the

signal sn multiplied by the analysis window wn, to estimate the spectrum Bk and the time envelope rn of the

noisy residual, as illustrated in the top of Fig. 1, and described in sections II and III.

The resynthesis, as in the bottom of Fig. 1, can be preceded by user controlled parametric transformations

of the estimated spectrum and time envelope (most commonly, these would be pitch shifting with preservation

of time structure and time stretching under tonal coherence condition [26]). Then, by generating random,

independent and uniformly distributed phases ϕk for each frequency bin of the signal discrete Fourier transform

(DFT) and pairing them with the corresponding amplitudes Ak we obtain, via IFFT with the synthesis window

wsynn , a quasi-stationary colored noise [25], that is eventually multiplied by the time envelope en, obtaining ηn.

The synthesized sound results plausible and does not have any sinusoidal residual present.

The paper has the following structure. We conclude the introduction by reviewing other existing methods.

Then, we separately describe in detail our algorithms for the estimation of the spectrum and time envelope in

sections II and III, respectively. In section IV we carry out a set of measurements for testing the separation

process by using composite synthetic signals, then, in section V, we compare it in performance to other

methods. In section VI we present a further refinement and different potential applications of the method, and

draw conclusions.

A. Review of existing methods

In the digital processing of musical sounds based on time-frequency representations, the sinusoidal quasi-

deterministic component, the noisy stochastic part, and temporal transients are often treated separately, due

to their quite different features. In particular, in the spectrum modeling technique called sinusoids + residual

(S+R) modeling [1], [2], the deterministic component is usually modeled as a sum of stable sinusoids (with slow

amplitude and frequency variations), whereas the noisy component is modeled through time-varying filtering

of stationary white noise. Different approaches for modeling the noisy component were proposed in [3]–[8].

However, the S+R approach, unless coherent structures in the residual are modeled accurately as in [9] has the

drawback of an improper modeling of temporal transients and instrument attacks, which play an important role

in psychoacoustic perception [10].

https://www.researchgate.net/publication/291757730_Processing_of_musical_tones_using_a_combined_quadratic_polynomial-phase_sinusoid_and_residual_QUASAR_signal_model?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/2703173_Low_Bit_Rate_High_Quality_Audio_Coding_With_Combined_Harmonic_And_Wavelet_Representations?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/2379850_A_New_Algorithm_For_Bandwidth_Association_In_Bandwidth-Enhanced_Additive_Sound_Modeling?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/4087827_Perceptual_linear_predictive_noise_modelling_for_sinusoid-plus-noise_audio_coding?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/2624700_Spectral_Envelopes_and_Inverse_FFT_Synthesis?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/200806411_Psychoacoustics_Facts_and_Models?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

3

In the S+T+N model [5], [11], [12], the signal analysis and additive synthesis are extended to three components

in order to get a more general representation of the input sound. Therefore, in the analysis of digitally sampled

sound waveforms, the three components must be carefully separated in order to extract the three sets of

parameters that are needed for their processing and synthesis. On the other hand, since in most cases the

energy of the sinusoidal component is significantly larger than that of the others, the separation process is not

an easy task. Indeed, in some approaches transients are identified and temporally separated before sinusoidal

parameter estimation to avoid interference [13].

Methods for noise separation and estimation can follow two distinct approaches, based on subtraction or on

filtering, and both methods can be implemented either in the time or frequency domain. Time-domain subtraction

methods rely on a precise estimate of the sinusoidal component, then subtract its waveform from the original

sampled sound [1], [2], yielding the residual component. The latter is then processed and resynthesized using

the short time Fourier transform (STFT). The main drawbacks of such a direct method are the sensitivity of

the relative phase of the synthesized sinusoidal part to analysis parameters like window and hop sizes [1]. The

residual from subtraction combines the effect of errors in the sinusoidal analysis with that of noise sources.

After subtraction this can give rise to an undesired and unstable “sinusoidal part” in the residual, the energy

of which can be larger than that of the non sinusoidal one. Such effects can result in perceptually annoying

artifacts which render the model unusable for further processing. In “analysis-by-synthesis” systems [14], [15]

the sinusoidal parameters are estimated iteratively, and at each iteration, errors introduced in the residual can

be counteracted by new deterministic components. This limits the introduction of sinusoidal components in

the residual, although it might lead to the introduction of spurious partials in the estimate of the sinusoidal

component.

Subtraction of the complex sinusoidal spectrum on a frame-by-frame basis is used in [16] to derive the

residual; this method is in principle equivalent to time-domain subtraction, and computationally more efficient

as only the few bins around each partial are involved in the calculation. However, in the case of monophonic

sounds, since the analysis is not done in a pitch-synchronous way, this method suffers from a much higher

sensitivity to errors in the estimation of partials parameters and to the window type.

On the other hand time-domain filtering methods process the original sound through filters that exhibit deep

notches corresponding to the partial frequencies. They provide a more realistic noise residual, especially since

it preserves the amplitude envelope of the original noise, but if the notches are not very selective, the resulting

spectrum turns out to be “anti-harmonic” rather than stochastic. This type of problem is generally solved by

performing filtering through cancellation of the sinusoidal component in the frequency domain with some sort

of curve fitting [17], i.e. finding a function that matches the general contour of the given filtered amplitude

spectrum. For example, the ”straight-line approximation” method is used in [18] after eliminating the points in

the amplitude spectrum that are supposed to represent the partials.

In speech analysis, an iterative algorithm operating alternatively in time and frequency domain with erasure

and substitution of the partials was proposed in [19], but it was shown to exhibit convergence problems in

[20], where a different technique was developed. This is based on deriving the partial parameters in a pitch-

synchronous analysis, then subtracting the reconstructed partials from the unwindowed complex spectrum.

Substitution of the partials can then be performed on the power spectrum, but this method degrades in the


https://www.researchgate.net/publication/2630250_A_SinesTransientsNoise_Audio_Representation_for_Data_Compression_and_TimePitch_Scale_Modications?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/3927312_Multiscale_overlap-add_sinusoidal_modeling_using_matching_pursuit_and_refinements?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/239057929_An_Analysis-by-Synthesis_Approach_to_Sinusoidal_Modeling_Applied_to_the_Analysis_and_Synthesis_of_Musical_Tones?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/234791257_Extending_Spectral_Modeling_Synthesis_with_Transient_Modeling_Synthesis?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/2612880_Time_Scale_Modification_Using_a_SinesTransientsNoise_Signal_Model?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/2446591_A_Switched_Parametric_Transform_Audio_Coder?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/209436381_Spectral_Modeling_Synthesis_A_Sound_AnalysisSynthesis_System_Based_on_a_Deterministic_Plus_Stochastic_Decomposition?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/2523695_Integrating_Complementary_Spectral_Models_In_The_Design_Of_A_Musical_Synthesizer?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/3333547_Effectiveness_of_a_periodic_and_aperiodic_decomposition_method_for_analysis_of_voice_sources?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

4

-Sk

-

short filter

-S′k

(·)−1

reciprocal

-Rk

-

averagingfilter

-R′k

(·)−1

reciprocal

-Bk

Fig. 2. Block diagram of the stochastic spectrum estimation procedure.

presence of jitter and shimmer (i.e. random fluctuations in frequency and amplitude).

Further methods for stochastic estimation were developed in the framework of parametric audio coding [7],

[21] harmonic/noise and signal/noise ratio estimation [22]–[24] in which time resolution is not an issue, so

that much longer analysis windows can be used. In [22], Qi estimates the stochastic energy by comparing the

original signal in the time domain with an estimate of its sinusoidal component obtained through averaging of

several subsequent pitch periods of the signal. In [23] estimation is performed in the cepstrum domain with the

use of a comb filter and by shifting the spectral baseline. Improved versions of the two methods are compared

in [24], with similar performance.

As for the two basic transformations commonly applied in the resynthesis (pitch shifting and time stretching),

in general pitch shifting is performed only on the sinusoidal part, while the noisy and transient parts are

reproduced unshifted [1], [18]. Nevertheless some instruments (e.g. clarinet) present a tonal stochastic part

which should be taken into account in such a transformation to improve the quality of the output sound [27].

Time stretching is performed both on the sinusoidal and noisy part, whereas transients are only time shifted

in order to preserve psychoacoustic coherence [5], [12], as they would otherwise lose sharpness in their attack

and tend to sound dull after this transformation.

In our opinion, the complexity of the problem allows for further research and investigation of new approaches.

II. STOCHASTIC SPECTRUM ESTIMATION

In this section we describe the stochastic spectrum estimation technique in detail. Given the original sound

signal s(t), modeled as the sum of partials, a wideband stochastic signal g(t) and transient events (both often

with a much lower energy than the sinusoidal part), we are faced with the task of finding a smooth function

B(f) that approximates the time-varying spectrum of the stochastic component. Such a function, represents

an average spectrum of the stochastic component realizations and should be updated as we move our analysis

window along the sound samples.

It is evident that by performing a mean filtering in frequency of the signal spectrum, we would only obtain

a spreading of the narrowband partials, since their amplitudes are much larger than the underlying stochastic

spectrum. Also, if we removed the partials with a comb filter before the mean filtering, the effect of the latter

would be to spread the rather wide comb notches, unless the comb filter is very precisely tuned and capable

of tracking the partial frequencies. As we said above, this would give rise to an “anti-harmonic” spectrum.

On the other hand, if we consider the reciprocal of amplitude spectrum R(f) = 1/|S(f)|, then in place of

the highly energetic partials in S(f) we will find deep and selective notches in R(f), which can be eliminated

through a mean filter, whereas the reciprocal of stochastic spectrum will play a prominent role in the averaging


https://www.researchgate.net/publication/224736761_Pulsed_noise_in_self-sustained_oscillations_of_musical_instruments?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/21676840_Time_normalization_in_voice_analysis?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==


https://www.researchgate.net/publication/13990484_Temporal_and_spectral_estimations_of_harmonics-to-noise_ratio_in_human_voice_signals?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==




https://www.researchgate.net/publication/2421012_HILN-the_MPEG-4_parametric_audio_coding_tools?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/14710834_A_Cepstrum-Based_Technique_for_Determining_a_Harmonics-to-Noise_Ratio_in_Speech_Signals?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/243612055_High-Quality_Consistent_Analysis-Synthesis_In_Sinusoidal_Coding?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

5

performed by the filter. Once obtained the filtered reciprocal spectrum, we must in turn take its reciprocal, to

have the required function B(f) approximating the stochastic residual spectrum. We note that when taking the

reciprocal R(f) we might end up with some very high accidental peaks, that correspond to zeros of S(f),

due to use of the instantaneous spectrum. Such peaks would corrupt the result of filtering R(f), in much the

same way that the partials would corrupt the result of filtering |S(f)|. However, since the zeros are much more

isolated and randomly distributed than the partials, they can be cancelled, without substantially altering the

spectrum shape, by passing |S(f)| through a very short mean filter of length ∆f (we choose ∆f = 3 bins),

before the nonlinear transformation, or alternatively by median filtering.

The above technique requires the following steps, as shown in Fig. 2:

1) consider the sequence of sound samples sn = s(t0 + nTs) , n = 0, . . . , Nt − 1, taken at rate Fs = 1/Ts,

and belonging to an analysis frame Nt samples long, starting at t0

2) calculate the amplitude spectrum of {sn} by taking the absolute value of its Nt-point DFT with a suitable

window function wn (e.g. Hann, Hamming or Blackman)

Sk =

∣∣∣∣∣Nt−1∑n=0

sn wn e−j2πnk/Nt

∣∣∣∣∣ , k = 0, . . . , Nt − 1 (1)

3) remove incidental zeros in {Sk}, by replacing it with

S′k = (Sk−1 + Sk + Sk+1)/3 (2)

4) calculate the reciprocal spectrum

Rk =1S′k

(3)

5) smooth Rk by cyclic convolution1 with the Nf -points mean filter impulse response

R′k =1Nf

dNf/2e−1∑h=−bNf/2c

R(k−h) mod Nt (4)

6) calculate the reciprocal of R′k that gives the required approximation for the residual spectrum

Bk =1R′k

(5)

Observe that, as described, the algorithm has two adjustable parameters of analysis: the length Nt of the

analysis window in the time domain, and the length Nf of the mean filter in the frequency domain, and both

must be set according to the time and frequency variability of both the sinusoidal component and the stochastic

residual spectra. Typically they should be set to rather low values in order to be able to track fast variations in

time and frequency shaping. The spectral shapes obtained from each step of the procedure are plotted in Fig. 3

for a flute sound with pitch at 1780 Hz (A6), by using Nt = 1024 and Nf = 25 with a sampling frequency

Fs = 44.1 kHz. As for the hop size H at the analysis stage, it should meet the condition for reconstruction

of the signal in the analysis/synthesis cascade with the analysis window (e.g. H ≤ Nt/4 for the Hamming

window [28]). Nt should be fixed to the lowest power of two (in samples) that is at least four periods of the

fundamental pitch in a harmonic sound, to avoid overlapping of the harmonics. For the same reason, Nt must

1Cyclic convolution is the appropriate form of convolution when dealing with periodic signals. As the original signal is sampled, its

two-sided spectrum {Sk}, and likewise˘S′k

¯and {Rk} are to be considered periodic (in the frequency domain) every Nt points.

https://www.researchgate.net/publication/3176349_Short_term_spectral_analysis_synthesis_and_modification_by_Discrete_Fourier_Transform_IEEE_Trans_Audio_Speech_Sig_Processg_ASSP?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/3176383_Short-Term_Spectral_Analysis_Synthesis_and_Modification_by_Discrete_Fourier_Transform?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

6

-80-60-40-20

020

(dB)

0 5 10 15 20f (kHz)

a) original Sk

-80-60-40-20

020

(dB)

0 5 10 15 20f (kHz)

b) filtered S′k

-200

20406080

(dB)

0 5 10 15 20f (kHz)

c) reciprocal Rk

-200

20406080

(dB)

0 5 10 15 20f (kHz)

d) filtered R′k

-80-60-40-20

020

(dB)

0 5 10 15 20f (kHz)

e) noise Bk

-80-60-40-20

020

(dB)

0 5 10 15 20f (kHz)

f) Bk versus Sk

Fig. 3. Spectra resulting from the consecutive steps in the estimation procedure.

be increased accordingly in the case of polyphonic sounds. We also note that the algorithm is homogeneous,

i.e. if the input signal {sn} is multiplied by a positive constant (and hence so is its amplitude spectrum {Sk}),

then the estimated noise spectrum {Bk} will turn out multiplied by the same factor. No normalization in the

signal level is therefore required, unless one is worried about the effects of finite precision arithmetics.

The synthesis of the stochastic component is the generation of a noise signal that has the frequency and

amplitude characteristics described by the spectral envelopes of the stochastic representation. In order to reduce

the amount of data for storage and computing transformations, significant parameters can be extracted by using

the Bark Band Noise Modeling proposed by Goodwin [6]. Starting from the spectral envelope or alternatively

from its Bark Band representation we generate random independently and uniformly distributed phases in

(−π, π) for each bin, and pair them with the corresponding amplitudes. The synthesized stochastic signal

is obtained via inverse STFT. After windowing, the resulting waveforms are overlapped (with a 50% overlap

factor), added and multiplied by a normalization constant. The IFFT size, hop size for synthesis and consequently

the normalization factor, may be changed for applying time-scaling effects to the input sound. [18], [25].



https://www.researchgate.net/publication/3644434_Residual_Modeling_In_Music_Analysis-Synthesis?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

7

III. TIME ENVELOPE ESTIMATION AND TRANSIENT DETECTION

To estimate the time envelope of the stochastic component we can make use of the frequency-domain

information obtained in the previous steps. Consider the spectrum separation procedure performed on the

portion of the signal samples within the analysis window [n0, (n0 +Nt − 1)]. From the noise spectral samples

{Bk} obtained in step 5 we can derive a measure of the energy of the noise process gn within the window. In

fact, since windowing and the DFT are linear operations, we can consider Bk to be a good approximation to

the amplitude spectrum of the windowed noise gnwn. Therefore we must have

EB =Nt−1∑k=0

B2k = Nt

Nt∑n=0

(gn+n0wn)2 (6)

Assuming gn is a stationary random process within the analysis window, let Mg be its statistical power and

Eg = NtMg its average energy. Then, EB is a random variable with mean

mEB= NtMgEw (7)

where Ew =∑Nt−1n=0 w2

n consequently

E[n0,n0+Nt]g =

Nt∑k=1

B2k/Ew (8)

and an estimate of the stochastic envelope at the window midpoint is obtained as

rn0+Nt/2 =

√E

[n0,n0+Nt]g

Nt(9)

By progressively shifting the analysis window along the signal sequence by small hops, we can obtain a rather

dense grid of envelope values, which have to be interpolated to yield the required envelope rn.

However, the envelope obtained by this estimation procedure tends to smooth out the initial transients (note

attacks) and time-localized events, due the long time windows used in the analysis stage, so these components

have to be treated separately. This phenomenon is clearly illustrated by comparison of the original and estimated

envelope for a purely stochastic time-varying sound, such as water flow recording [29], in Fig. 4.

For transient detection we can make use of the above procedure, in conjunction with methods of residual

estimation based on time-domain subtraction of the sinusoidal component, such as the one described in [1]. This

residual contains both noise and transients, so that its envelope will be much larger than the estimated stochastic

envelope in the neighborhood of transient events. Following [1], frames in which the peak detection algorithm

doesn’t yield correct points for the peak continuation process are marked as residual (noise + transient). For

each marked frame we compare our estimated stochastic envelope rn with the residual envelope rn,R obtained

after subtraction of deterministic part from input sound, similarly to the method used in [1] for correcting

residual envelopes. We can gather points where rn,R−rn > rth with rth a suitable threshold value. This allows

a fine measurement of impulsive events that have been cut out from nonlinear stochastic estimation. Following

[5], [12], the regions are considered as pure transient (there is not separation between noise and transients) and

are copied from the input sound into the resynthesized sound using a cross fade.



8

time (s)0 1-1

1

s n(n

orm

aliz

ed)

time (s)0 10

1r n

(nor

mal

ized

)

Fig. 4. Original water flow sound (top) and comparison (bottom) between original (gray line) and estimated (black line) envelopes. The

original envelope was obtained by averaging over 64-sample non-overlapping rectangular windows. The estimated envelope was obtained

with a Hamming window of Nt = 1024 samples, hop size H = Nt/4, and averaging over Nf = 11 bins in the frequency domain.

IV. ASSESSMENT OF THE METHOD WITH SYNTHETIC SOUNDS

To evaluate the performance of our algorithm and determine its application range we tested it on different

synthetic sounds obtained by adding stationary noise with a known power spectral density to a purely harmonic

signal. We now compare the estimated stochastic spectrum with the result of directly smoothing the amplitude

spectrum of the noise signal.

In the following we describe the signal generation procedure, the parameters that we used for evaluation,

and show some plots of the results.

A. Test signals

In each test the noise signal u(t) is Gaussian and stationary, generated by filtering a zero mean, unit variance

Gaussian white noise uw(t) with a filter gi(t). In particular, we simulated pink noise (using the filter coefficients

given by the Kasdin algorithm [30]), white noise, and 12 auto-regressive moving average models with coefficients

whose spectral shapes are shown in Fig. 5.

Since the amplitudes of the noise DFT are Rayleigh distributed, our estimate will turn out to be biased even

in the ideal condition of absence of the sinusoidal component. The bias factor can be easily calculated for a

rectangular analysis window, considering that three independent samples of the DFT amplitude are averaged in

(2) before reciprocation, as

k1 =1∫ ∫ ∫ +∞

0

3p(a)p(b)p(c)a+ b+ c

da db dc

' 0.90 ' −0.9 dB (10)

https://www.researchgate.net/publication/2984910_Discrete_Simulation_of_Colored_Noise_and_Stochastic_Processes_and_1fg_Power_Law_Noise_Generation?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

9

−20−15−10−5

0

−20−15−10−5

0

−20−15−10−5

0

−20−15−10−5

0

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20f (kHz)

Rn(f) = |Gi(f)|2 (dB)

Fig. 5. Noise power spectral densities used in the test signals to assess the algorithm performance.

with

p(a) =π

2ae−πa

2/4 , a > 0 (11)

the unit mean Rayleigh probability density function. For practical windows the above value of k1 is seen to be

a good approximation.

As for the deterministic component, we choose to generate harmonic signals, with all harmonics and ban-

dlimited to (0, 10 kHz). We set the amplitude of the k-th harmonic to be Ak = A1/k, with A1 determined by

the harmonic/noise ratio (HNR) and generate random phases φk as independent and uniformly distributed in

(0, 2π). The pitches were chosen among the B and E notes in the [60 Hz, 4000 Hz] range, and the HNR values

considered are {0,−3 dB, 0 dB, 3 dB, 10 dB, 20 dB}. All signals are generated at rate Fs = 44.1 kHz for a

duration of 0.5 s.

B. Performance measures

We prove the effectiveness of our algorithm against the estimate obtained by smoothing the amplitude

spectrum of the pure noise through averaging over a sliding window of Nf bins in the frequency domain.

This estimate can be expressed as

Dk =1Nf

dNf/2e−1∑h=−bNf/2c

U(k−h) mod Nt (12)

where

Uk =

∣∣∣∣∣Nt−1∑n=0

un wn e−j2πnk

∣∣∣∣∣ , k = 0, . . . , Nt − 1 (13)

10

and both the values of Nf and the analysis windows {wn} are the same as in the proposed method (1)–(4).

When performed on the pure noise signal this estimate is unbiased and exhibits a lower variance than the

proposed method, but it is not applicable in the presence of the sinusoidal component.

The deviation of our estimate is then measured in two different ways. We consider the average over all

frames of

1) the ratio of energies in a band (f0, f1),

m(f0,f1) = 10 log10

∑kFs∈(f0,f1)

B2k∑

kFs∈(f0,f1)

D2k

(14)

2) the mean absolute log (MAL) spectral difference (in dB) [31]

d1 =1Nt

Nt/2−1∑k=−Nt/2

∣∣∣∣20 log10

BkDk

∣∣∣∣ (15)

We set as limits for an acceptable performance of the proposed method the bounds

−3 dB < m(0,Fs/4) < 3 dB , d1 < 1.8 dB (16)

as a compromise between the considerations expressed in [31]–[34] about the correlation between error measures

and mean opinion score (MOS), and the concept of “just noticeable changes in amplitude” in [10].

C. Results

We now illustrate and discuss the performance of the proposed method and its dependence on several

parameters of the original sound and of the algorithm.

1) Noise spectral shape: The method performance does not exhibit substantial differences going from one

spectral shape to another, thus in the following we give results in dB (a more perception related measure

than linear scale) averaged over all spectral shapes. Some specific considerations do apply to the pink noise

case. As the amplitude spectrum of pink noise exhibits a 1/√f cusp at the origin, any smoothing operation

is bound to remove a large part of the signal energy. The loss in energy at low frequencies can be reduced

by shortening the averaging filter to Nf = 13 bins. The same problem was evident in the frequency-domain

method of [23]. A possible solution could be to exclude a narrow band at the lower end of the spectrum from

the smoothing procedure, and obtain those values from an average over multiple time frames, providing that the

fundamental frequency lies outside the chosen interval. This is justified by the slower variability of the narrow

band information. Another solution could take advantage of an estimate making use of the sound derivative, or

its approximation obtained with a high-pass filter as in [35], since the corresponding amplitude spectrum has a√f shape around the origin. The effect of combining the two results should be investigated in future studies.

2) Window shape: First we tried different shapes of the analysis window, at different HNR values. For HNR

< 20 dB, Hamming, Hann and Bartlett windows exhibit superior performance with respect to Blackman and

Blackman-Harris windows, thanks to their narrower main lobe. In particular the Hamming window has a further

slight advantage given by its deeper first zero [36], and we have employed it in all the remaining tests. On the

other hand the Blackman and Blackman-Harris windows perform better for HNR > 20 dB, due to their higher

sidelobe attenuation. As the HNR grows, the use of Hamming, Hann or Bartlett windows becomes unviable.

https://www.researchgate.net/publication/3176266_Distance_Measures_for_Speech_Processing?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/236340546_PEAQ---The_ITU_Standard_for_Objective_Measurement_of_Perceived_Audio_Quality?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/2567068_High_Precision_Fourier_Analysis_of_Sounds_Using_Signal_Derivatives?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/2995027_On_the_Use_of_Windows_for_Harmonic_Analysis_With_the_Discrete_Fourier_Transform?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==



11

125 Hz 250 Hz 500 Hz 1 kHz 2 kHz 4 kHz

−1 dB

0 dB

1 dB

2 dB

3 dB

4 dB


1 dB

2 dB

3 dB

0−3 dB 0 dB 3 dB10 dB20 dB

0−3 dB 0 dB 3 dB10 dB20 dB

fpitch

d1 Mean Absolute Log difference

fpitch

m(0,Fs/4) Energy ratio

Fig. 6. Performance measures of the algorithm versus pitch frequency for different HNR values. The analysis is done with a Hamming

window of Nt = 1024 samples, hop size H = Nt/4, and averaging over Nf = 21 bins in the frequency domain.

3) Pitch frequency: In Fig. 6 we show the dependence of the algorithm performance on the fundamental

pitch frequency, at different HNR values, with the use of a 1024-sample analysis window. It can be observed

that the algorithm gives a better performance for higher pitch frequencies (the harmonics are more widely

spaced apart), and (of course) lower HNR. For example, at HNR = 10 dB, the pitch frequencies that allow a

satisfactory result in terms of the parameters (14) and (15) as stated in (16) are fpitch > 8/Tw, with Tw the

length of the analysis window. The results are confirmed by tests on 512- and 2048-sample windows. Observe

the nonideal behaviour of the estimate at HNR = 0, that is due to the bias (10).

4) Smoothing of the reciprocal spectrum.: Fig. 7 shows how the estimate accuracy improves with increasing

the number of adjacent bins Nf considered in the smoothing of the reciprocal spectrum. The improvement is

noteworthy going from 9 (which is the minimum value for acceptable performance at HNR = 10 dB) to 25

bins, and is far less evident for Nf > 25. The choice of different lowpass shapes for smoothing, other than the

rectangular one, does not give any improvement, yielding on the contrary an even worse performance.

5) Spectrum reciprocation: In another set of tests we replaced the reciprocations in (3) and (5) with the

more general

Rk =1

(S′k)α, Bk =

1(R′k)1/α

(17)

respectively, for different values of α. It should be observed that the combination of α and 1/α powers in

(17) preserves the homogeneity of the algorithm, ie its invariance to multiplication of the input by a positive

12

10 20 30 40

−1 dB

0 dB

1 dB

2 dB

3 dB

4 dB

10 20 30 40

1 dB

2 dB

3 dB

HNR = 0HNR = 0 dBHNR = 10 dBHNR = 20 dB

HNR = 0HNR = 0 dBHNR = 10 dBHNR = 20 dB

Nf (bins)


Nf (bins)


Fig. 7. Performance measures of the algorithm versus length of the averaging filter in bins for different HNR values. The analysis is

done with a Hamming window of Nt = 1024 samples, hop size H = Nt/4, and the fundamental frequency is fpitch = 1046.5 Hz

(corresponding to C6).

constant. In the same ideal hypotheses that led to equation (10) the bias factor would now change to

kα =1[∫ ∫ ∫ +∞

0

3αp(a)p(b)p(c)(a+ b+ c)α

da db dc

]1/α (18)

yielding, for examplek2 ' 0.85 ' −1.46 dB , k3 ' 0.78 ' −2.14 dB

k4 ' 0.71 ' −3.01 dB , k5 ' 0.61 ' −4.34 dB(19)

From Fig. 8, we see that with α > 1 the estimator bias increases for low HNR, confirming the analysis in the

ideal case, while it decreases for high HNR (curves referring to pitch C4) and thus exhibits a stronger rejection

of the harmonic component. This advantage vanishes for high pitch frequencies (curves referring to pitch C7),

when the spacing between consecutive harmonics allows a good rejection also for α = 1. Higher values of

the reciprocation index also lead to higher variance of the estimates, due to a sharper non linearity: indeed, by

considering again the ideal hypotheses of (18), it even becomes infinite for α ≥ 3. However, a refinement of

the estimate that makes use of α > 1 will be seen in Section VI.

6) Tremolo and vibrato effects: We have investigated the effect of tremolo (amplitude modulation) and

vibrato (frequency modulation) in the harmonic component on the algorithm performance. In order to gain a

better insight we analyzed the two effects separately.

In the tests for tremolo we have used a 100% sinusoidal modulation of the amplitude with frequencies

ranging from 1 to 20 Hz. As can be expected, the average performace of the algorithm is little influenced by

13

−20 dB −10 dB 0 dB 10 dB 20 dB 30 dB−3 dB

0 dB

3 dB

6 dB = 1 = 2 = 3

ααα261 Hz (C4)2.093 kHz (C7)

−20 dB −10 dB 0 dB 10 dB 20 dB 30 dB1 dB

2 dB

3 dB

4 dB

ααα

= 1 = 2 = 3

261 Hz (C4)2.093 kHz (C7)

HNR


HNR


Fig. 8. Performance measures of the algorithm versus harmonic-to-noise ratio for different values of the reciprocation exponent α, and

two different pitches, C4 (261.6 Hz) and C7 (2093 Hz). The analysis is done with a Hamming window of Nt = 1024 samples, hop size

H = Nt/4, and averaging over Nf = 21 bins in the frequency domain.

the tremolo effect. However, for low tremolo frequencies, when the length Tw of the analysis window is much

shorter than the tremolo period, the HNR is quite different going from one analysis window to another and so

is the algorithm performance.

In the set of tests for vibrato we have used a sinusoidal modulation with frequencies fvib = 5, 10 Hz,

and modulation depths ∆ = 0.036, 0.123, corresponding to a ±60 and ±200 cents deviation, respectively

(a cent is 1/100 of a semitone interval on the log scale). In this case partial frequencies can wander during

one analysis frame and their energy be spread across many adjacent bins. The algorithm performance is then

heavily influenced by the length Tw of the analysis window. Within one window, the maximum variation in the

frequency of the k-th partial is given by V kfpitch where the parameter V is defined as

V =

2∆ sinπfvibTw , for fvibTw < 1/2

2∆ , for fvibTw ≥ 1/2

We can see from Fig. 9 that by choosing a 1024-sample window, and considering typical values for western

instruments [37] in the parameters (e.g. ∆ = 0.036, fvib = 5 Hz, which yield V = 0.025) the algorithm

performance is nearly unaffected by the vibrato effect. On the other hand it can also be seen that with higher

values of V the performance degradation can be substantial, limiting the pitch range in which the algorithm is

effective.

14


0 dB

2 dB

4 dB

6 dB


1 dB

2 dB

3 dB

V = 0V = 0.025V = 0.047

V = 0.087V = 0.163

V = 0V = 0.025V = 0.047

V = 0.087V = 0.163

fpitch


fpitch


Fig. 9. Performance measures of the algorithm versus pitch frequency, with frequency modulation (vibrato) of the harmonic component,

with different depths ∆ and modulation frequencies fvib. The HNR is 10 dB, and each curve is identified by a different value of the

parameter V , with V = 0 corresponding to the unmodulated sound. The analysis is done with a Hamming window of Nt = 1024 samples,

hop size H = Nt/4, and averaging over Nf = 21 bins in the frequency domain.

V. COMPARISON WITH OTHER METHODS

In this section we compare the performance of the proposed method with the complex spectrum subtraction

presented in [16] and the pitch-scaled comb filtering in [20], which both also operate in the frequency domain.

These methods, unlike the proposed one, require prior estimation of the sinusoidal component, and the perfor-

mance of stochastic estimation is crucially affected by the accuracy in the estimates of amplitude and frequency

of each partial. Moreover the comb filtering method requires a pitch-synchronous analysis window.

We make use of synthetic signals2 with the harmonic component as in Subsection IV-A and white noise,

at different HNR values, and we assume that estimation of the harmonic parameters is done with the method

presented in [35], which makes use of signal derivatives. Thus, the parameter estimates of amplitude and

fundamental frequency are derived by corrupting their true values with random Gaussian errors with bias and

standard deviation derived from [35, tables 5-6]. We test the three methods with the use of 1024-sample analysis

windows, Blackman for the complex spectrum subtraction method, rectangular for the pitch-synchronous comb

filtering, and Hamming for the proposed one, and compare them in terms of the measure (14).

The results for HNR = 0, 10, 20 dB, averaged over the different pitches are given in Table I. We observe

2The tests are done on synthetic sounds for the ease of parameter modifications and performance comparisons, although a more detailed

comparison should be carried out on real-world applications. Unfortunately, this would require the careful definition of an application

context as well as a more exhaustive testing and optimization of the other methods, which go beyond the scope of this work.



15

TABLE I

PERFORMANCE COMPARISON BETWEEN THE COMPLEX SUBTRACTION AND COMB FILTERING METHODS AND THE PROPOSED ONE.

MEAN AND STANDARD DEVIATION OF THE ERRORS ∆Ak AND ∆fpitch IN THE ESTIMATE OF HARMONICS AMPLITUDE Ak AND PITCH

FREQUENCY fpitch ARE DERIVED FROM [35].

HNR ∆Ak/Ak ∆fpitch/fpitch complex subtraction comb filtering proposed method

m σ m σ m(0,Fs/4) m(0,Fs/4) m(0,Fs/4)

0 dB 2.14 · 10−2 2.13 · 10−4 1.4 · 10−3 1.56 · 10−2 1.86 dB 2.38 dB 1.05 dB

10 dB 7.5 · 10−3 5.3 · 10−5 6 · 10−4 5 · 10−3 2.98 dB 3.77 dB 2.10 dB

20 dB 2.7 · 10−3 2.7 · 10−5 2 · 10−4 2.1 · 10−3 3.94 dB 5.01 dB 3.36 dB

that the three methods perform similarly, with errors increasing with HNR, although the proposed method has

a much lower computational complexity and does not require any estimation of the partial parameters. The

error increase at higher HNR holds for the subtraction and filtering methods even though the relative errors

in the estimate of partial parameters decrease in an inversely proportional manner to HNR. The comb filtering

method shows its limits in the presence of amplitude and frequency fluctuations, or erratic estimates. As for the

proposed method, we next examine a way to limit its dependence on the energy of the sinusoidal component.

VI. A FURTHER REFINEMENT OF THE ESTIMATE

The performance of the proposed method degrades for lower pitches and higher HNR values, since in these

cases its damping of the partials is not sufficient. Therefore, to extend its range of application we can think

of pre-attenuating the highest partials, and to this purpose we make use of the same estimate method with a

higher reciprocation exponent. The result will have the partials strongly attenuated, and a biased estimate (as

shown in Section IV-C) in the stochastic part of the spectrum, so we only use it to substitute the spikes in the

amplitude spectrum, and then apply the method of section II.

The overall estimate in this case will proceed as follows. Starting from the amplitude spectrum Sk:

1) Perform estimation on {Sk} with (3) and (5) replaced by (17) and α > 1. Call Gk the result.

2) Find the spikes as the points in which Sk > KGk, (a good threshold is given with K > 2) and replace

them with the corresponding values of Gk, that is consider the corrected amplitude spectrum

Sck =

Sk for Sk ≤ KGkGk for Sk > KGk

3) Perform estimation (3)-(5) on {Sck} (that is with ordinary exponent α = 1)

In this refined version the algorithm has additional parameters: the filter length Nf in steps 1 and 3 can be

different, and the constant K determining the threshold for spike detection can be chosen.

The results are shown in Fig. 10 where we can observe that satisfactory results are obtained with Nf = 21

for both steps 1 and 3, α = 4 in step 1, and K = 3 in step 2. We observe that the HNR range in which the

algorithm yields satisfactory results (d1 < 1.8 dB) is extended by more than 10 dB for the C4 and C5 pitches,

whereas the improvement is more limited for higher pitch frequencies, such as C6.

Like its original version, the above described procedure does not require any previous analysis of the sinusoidal

component, as this is one of the advantages in our approach. However, if our method were to be used in

16

−20 dB 0 dB 20 dB 40 dB −2 dB

0 dB

2 dB

4 dB

−20 dB 0 dB 20 dB 40 dB

1 dB

2 dB

3 dB

261 Hz (C4) 523 Hz (C5)1047 Hz (C6)refinedoriginal

261 Hz (C4) 523 Hz (C5)1047 Hz (C6)refinedoriginal

HNR


HNR


Fig. 10. Performance measures of the refined (solid line) and the original (dashed line) version of the algorithm versus harmonic-to-noise

ratio for three different pitches, C4 (261.6 Hz), C5 (523.3 Hz) and C6 (1046.5 Hz). The analysis is done with a Hamming window of

Nt = 1024 samples, hop size H = Nt/4, and averaging over Nf = 21 bins in the frequency domain for both steps. The reciprocation

exponent in step 1 is α = 4 and the threshold constant in step 2 is K = 3.

TABLE II

PERFORMANCE OF THE PROPOSED METHOD IN TERMS OF ENERGY RATIO m(0,Fs/4) WITH LONGER WINDOWS (Nt = 2048) AND

HIGHER PITCH FREQUENCIES (fpitch > 1 KHZ).

HNR original version refined version

Hamming Blackman Hamming Blackman

0 dB 0.52 dB 0.58 dB 0.42 dB 0.47 dB

10 dB 0.80 dB 0.91 dB 0.58 dB 0.50 dB

20 dB 1.36 dB 1.15 dB 1.02 dB 0.51 dB

combination with othe methods for estimation of the sinusoidal component, it would be advisable to include

that part of information in the algorithm. A possible way is to use the knowledge of location and amplitude of

the partials to increase their damping, without going through steps 1 and 2. Alternatively, by making use of the

frequency information only, on can directly find the regions of the spectrum that need to be corrected in step

2 above, thus eliminating the need for conparison with the threshold KGk and to properly select the value of

K.

The proposed method and its refined version may be used in calculating sound descriptors like the signal/noise

ratio (SNR) as in [22]–[24], or the spectral centroid (see [38]) of the noise component. In these cases time

resolution is not an issue, so it can be traded for frequency resolution by using longer analysis windows (tens

of pitch periods). In Table II we show the error measure (14) for both versions of the proposed method with



17

0 1 2 3 4 5 6 7 8 9 102

3

4

5

6

7

8

9

10

11

12

2

34

5

23

43

4

5

3

4

3

4

5

23

4

56

34

5

3

4

4

5

2

3

2

Spectral centroid of stochastic component (Bark bands)

Spec

tral

cen

troi

d of

sou

nd (

Bar

k ba

nds)

trombonebassoonenglish horntubaharp

saxophoneoboeaccordion

flutebass clarinet

Fig. 11. Spectral centroid of instrumental sounds versus estimated spectral centroid of their stochastic component. All pitches correspond

to the C note of different octaves, indicated by each plot point, with lines connecting points that represent the same instrument.

the use of 2048-sample Blackman and Hamming windows, averaged over signals generated as in Section V

with pitch frequencies in the range (1 kHz, 4 kHz). Although the performance of both versions are similarly

improved, we must note that the use of a Blackman window and the refined version yields an estimate error

that is nearly constant over a wide range of HNR.

In Fig. 11 we see the results of applying our method to determine the spectral centroid of the stochastic

component. We plot for each analyzed sound the spectral centroid calculated on a 0–18 Bark band scale [6]

of the original sound versus that of its estimated stochastic component, and the clustering of data for each

instrument (shown by connecting points representing the same instrument) is evident.

VII. CONCLUSIONS

We have presented a new method for estimating the spectrum of the noisy part in musical sounds and

evaluating its time envelope. The spectrum estimation is based on applying a cyclic convolution (smoothing)

on a nonlinear transformation (reciprocal) of the amplitude spectrum obtained from a STFT analysis. The time

envelope of the noisy component is calculated from its energy spectrum.

We have assessed the performance of our technique on synthetic test sounds with different features in the

sinusoidal and stochastic components, as well as studied their dependence on the parameters both of the

algorithm and of the test sound. The results are quite satisfactory, over a wide range of pitches and HNR

values, with a particular effectiveness for higher pitch frequencies and lower HNRs.

The comparison with other frequency-domain methods (complex subtraction and comb filtering), shows that

our algorithm works better and is computationally more efficient, due to the fact that it does not depend on any


18

analysis of the sinusoidal component. It remains to investigate whether by performing the stochastic spectrum

estimation before the sinusoidal and transient analysis, the latter can be improved by the results of the former.

We have also shown an example of the method’s potential use in parameter extraction for sound classification.

REFERENCES

[1] X. Serra, “Musical Sound Modelling with Sinusoids plus Noise,” in C. Roads, S. Pope, A. Piccialli, G. De Poli editors, Musical

Signal Processing, Swets & Zeitlinger Publishers, 1997, pp. 91–102.

[2] Y. Ding, X. Qian, “Processing of Musical Tones Using a Combined Quadratic Polynomial-Phase Sinusoid and Residual (QUASAR)

Signal Model,” Journal of the Audio Engineering Society, vol. 45, n. 7, pp. 571–584, July 1997.

[3] K. Fitz, L. Haken, P. Christensen, “A new Algorithm for BandWidth association in Bandwidth-Enhanced Additive Sound Modeling,”

Proceedings of the 2000 International Computer Music Conference, ICMC 2000, Berlin, Germany, 27 August – 1 September 2000,

pp. 384–387.

[4] P. Polotti, G. Evangelista, “Multiresolution Sinusoidal Stochastic Model for Voiced-Sounds,” Proceedings of the COST G-6 Conference

on Digital Audio Effects, DAFX ‘01, Limerick, Ireland, 6–8 December 2001, pp. 120–124.

[5] S. N. Levine, J. O. Smith, “A Sines+Transients+Noise Audio Representation for Data Compression and Time/Pitch Scale

Modifications,” Proceedings of the 105th Convention of the Audio Engineering Society, San Francisco, CA, 26–29 September 1998,

preprint 4781.

[6] M. Goodwin, “Residual Modeling in Music Analysis-Synthesis,” Proceedings of 1996 IEEE International Conference on Acoustics,

Speech and Signal Processing, ICASSP ‘96, Atlanta, GA, 7–10 May 1996, vol. 2, pp. 1005–1008.

[7] H. Purnhagen, N. Meine, “HILN - The MPEG-4 Parametric Audio Coding Tools,” Proceedings of the 2000 IEEE International

Symposium on Circuits and Systems, ISCAS 2000, Geneva, Switzerland, 28–31 May 2000, vol. 3, pp. 201–204.

[8] R. Hendriks, J. Jensen, R. Heusdens, “Perceptual Linear Predictive Noise Modelling for Sinusoid-Plus-Noise Audio Coding,”

Proceedings of 2004 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ‘04, Montreal, Canada,

17–21 May 2004, vol. 4, pp. 189–192.

[9] K. Hamdy, M. Ali, A. Tewfik, “Low Bit Rate High Quality Audio Coding with Combined Harmonic and Wavelet Representations,”

Proceedings of 1996 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ‘96, Atlanta, GA, 7–10

May 1996, vol. 2, pp. 1045–1048.

[10] E. Zwicker, H. Fastl, Psychoacoustics. Facts and Models, Information Sciences Series, Springer-Verlag, New York, 1999.

[11] T. S. Verma, T. H. Y. Meng, “Time Scale Modification Using a Sines+Transient+Noise Signal Model,” Proceedings of the Digital

Audio Effect Workshop DAFX ‘98, Barcelona, Spain, 1998, pp. 49–52.

[12] T. S. Verma, T. H. Y. Meng, “Extending Spectral Modeling Syntesis with Transient Modeling Synthesis,” Computer Music Journal,

vol. 24, n. 2, pp. 47–49, Summer 2000.

[13] S. N. Levine, J. O. Smith, “A Switched Parametric and Transform Coder,” Proceedings of 1999 IEEE International Conference on

Acoustics, Speech and Signal Processing, ICASSP ‘99, Phoenix, AZ, 15–19 March 1999, vol. 2, pp. 985–988.

[14] E. B. George, M. Smith, “Analysis-by-synthesis / overlap-add sinusoidal modeling applied to the analysis and synthesis of musical

tones,” Journal of the Audio Engineering Society, vol. 40, n. 6, pp. 497–516, June 1992.

[15] M. Goodwin, “Multiscale Overlap-Add Sinusoidal Modeling Using Matching Pursuit and Refinements,” Proceedings of the 2001

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA01, Mohonk Mountain Resort, NY, 21–24

October 2001.

[16] X. Serra, J. Bonada, P. Herrera, R. Loureiro, “Integrating Complementary Spectral Models in the Design of a Musical Synthesizer,”

Proceedings of the 1997 International Computer Music Conference, ICMC ‘97, Thessaloniki, Greece, 25–30 September 1997, pp. 152–

159.

[17] J. Strawn, “Approximation and Syntactic Analysis of Amplitude and Frequency Function for Digital Sound Synthesis,” Computer

Music Journal, vol. 4, n. 3, pp. 3–23, Fall 1980.

[18] X. Serra, J. O. Smith, “Spectral Modelling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plus Stochastic

decomposition,” Computer Music Journal, vol. 14, n. 4, pp. 12–24, Winter 1990.

[19] C. d’Alessandro, V. Darsinos, B. Yegnanarayana, “Effectiveness of a Periodic and Aperiodic Decomposition Method for Analysis of

Voice Sources,” IEEE Transactions on Speech and Audio Processing, vol. 6, n. 1, pp. 12–23, January 1998.

[20] P. J. B. Jackson, C. H. Shadle, “Pitch-Scaled Estimation of Simultaneous Voiced and Turbolence-Noise Components in Speech,” IEEE

Transactions on Speech and Audio Processing, vol. 9, n. 7, pp. 713–726, October 2001.



































https://www.researchgate.net/publication/228709407_Multiresolution_sinusoidalstochastic_model_for_voiced-sounds?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/228709407_Multiresolution_sinusoidalstochastic_model_for_voiced-sounds?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==


19

[21] K. Vos, R. Vafin, R. Heusdens, W. B. Kleijn, “High-Quality Consistent Analysis-Synthesis in Sinusoidal Coding,” Proceeding of the

AES 17th International Conference: High Quality Audio Coding, Firenze, Italy, 2–5 September 1999.

[22] Y. Qi, “Time Normalization in Voice Analysis,” Journal of the Acoustical Society of America, vol. 92, n. 5, pp. 2569–2577, November

1992.

[23] G. de Krom, “A Cepstrum-Based Technique for Determining a Harmonics-to-Noise Ratio in Speech Signals,” Journal of Speech and

Hearing Research, vol. 36, n. 2, pp. 254–266, April 1993.

[24] Y. Qi, “Temporal and Spectral Estimation of Harmonic-to-Noise Ratio in Human Voice Signals,” Journal of the Acoustical Society

of America, vol. 102, n. 1, pp. 537–543, July 1997.

[25] X. Rodet, P. Depalle, “Spectral Envelopes and Inverse FFT Synthesis,” Proceedings of the 93rd Convention of the Audio Engineering

Society, San Francisco, CA, 1–4 October 1992, preprint 3393.

[26] U. Zolzer, ed., DAFX, Digital Audio Effects, John Wiley & Sons, Chichester, 2002.

[27] C. Chafe, “Pulsed Noise in Self-sustained Oscillation of Musical Instruments,” Proceedings of 1990 IEEE International Conference

on Acoustics, Speech and Signal Processing, ICASSP ‘90, Alburquerque, NM, 3–6 April 1990, vol. 2, pp. 1157–1160.

[28] J. B. Allen, “Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform,” IEEE Transactions on

Acoustics, Speech and Signal Processing, vol. 25, n. 3, pp. 235–238, June 1977.

[29] SMS software, manual, examples and sound files at www.iua.upf.es/sms/

[30] N. J. Kasdin, “Discrete Simulation of Colored Noise and Stochastic Processes and 1/fa Power Law Noise Generation,” Proceedings

of the IEEE, vol. 83, n. 5, pp. 803–827, May 1995.

[31] A. H. Gray, J. D. Markel, “Distance Measures for Speech Processing,” IEEE Transactions on Acoustics, Speech and Signal Processing,

vol. 24, n. 5, pp. 381–391, October 1976.

[32] S. Wang, A. Sekey, A. Gersho, “An Objective Measure for Predicting Subjective Quality of Speech Coders,” IEEE Journal on Selected

Areas in Communications, vol. 10, n. 3, pp. 819–829, June 1992.

[33] J. G. Beerends, J. A. Stemerdink, “A Perceptual Audio Quality Measure Based on a Psychoacoustic Sound Representation,” Journal

of the Audio Engineering Society, vol. 40, n. 12, pp. 963–978, December 1992.

[34] T. Thiede, W. C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J. G. Beerends, C. Colomes, M. Keyhl, G. Stoll, K. Brandenburg,

B. Feiten, “PEAQ — The ITU Standard for Objective Measurement of Perceived Audio Quality,” Journal of the Audio Engineering

Society, vol. 48, n. 1, pp. 3–27, January 2000.

[35] M. Desainte-Catherine, S. Marchand, “High-Precision Fourier Analysis of Sounds Using Signal Derivatives,” Journal of the Audio

Engineering Society, vol. 48, n. 7, pp. 654–667, July 2000.

[36] F. J. Harris, “On the Use of Windows for Harmonic Analysis with Discrete Fourier Transform,” Proceedings of the IEEE, vol. 66,

n. 1, pp. 51–83, January 1978.

[37] J. Liljencrants, Tremolo and Vibrato Sounds, on line at mmd.foxtail.com/Tech/AM&FM.html, as published 21 October 1999,

accessed 2 May 2004..

[38] P. Herrera, G. Peeters, S. Dubnov, “Automatic Classification of Musical Instrument Sounds,” Journal of New Music Research, vol. 32,

n. 1, pp. 3–21, March 2003.

https://www.researchgate.net/publication/236340538_A_Perceptual_Audio_Quality_Measure_Based_on_a_Psychoacoustic_Sound_Representation?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/236340538_A_Perceptual_Audio_Quality_Measure_Based_on_a_Psychoacoustic_Sound_Representation?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==









https://www.researchgate.net/publication/220641780_An_Objective_Measure_for_Predicting_Subjective_Quality_of_Speech_Coders?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==

https://www.researchgate.net/publication/220641780_An_Objective_Measure_for_Predicting_Subjective_Quality_of_Speech_Coders?el=1_x_8&enrichId=rgreq-7f8d30f024efdfc8da63147ac2234549-XXX&enrichSource=Y292ZXJQYWdlOzM0NTc3NDE7QVM6OTcxMjIzNDI5MzI0ODRAMTQwMDE2NzE3OTI4Mw==
















Date post:	05-Dec-2023
Category:	Documents
Upload:	unipd
View:	0 times
Download:	0 times

A Nonlinear Method for Stochastic Spectrum Estimation in the Modeling of Musical Sounds

Documents