IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,...

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004 121

A Perceptual Subspace Approach for Modeling ofSpeech and Audio Signals With Damped Sinusoids

Jesper Jensen, Member, IEEE, Richard Heusdens, and Søren Holdt Jensen, Senior Member, IEEE

Abstract—The problem of modeling a signal segment as asum of exponentially damped sinusoidal components arises inmany different application areas, including speech and audioprocessing. Often, model parameters are estimated using subspacebased techniques which arrange the input signal in a structuredmatrix and exploit the so-called shift-invariance property relatedto certain vector spaces of the input matrix. A problem with thisclass of estimation algorithms, when used for speech and audioprocessing, is that the perceptual importance of the sinusoidalcomponents is not taken into account. In this work we propose asolution to this problem. In particular, we show how to combinewell-known subspace based estimation techniques with a recentlydeveloped perceptual distortion measure, in order to obtain analgorithm for extracting perceptually relevant model components.In analysis-synthesis experiments with wideband audio signals,objective and subjective evaluations show that the proposedalgorithm improves perceived signal quality considerable overtraditional subspace based analysis methods.

Index Terms—Complex exponentials, perceptually relevantsinusoids, psycho-acoustical distortion measure, sinusoidal mod-eling, speech and audio processing, subspace-based signal analysis.

I. INTRODUCTION

S INUSOIDAL models have proven to provide accurate andflexible representations of a large class of acoustic signals

including audio and speech signals. For speech and audio pro-cessing, sinusoidal models have been applied in areas such asspeech coding (e.g., [1]–[3]) and enhancement (e.g., [4], [5]),speech signal transformations (e.g., [6]–[8]), music synthesis(e.g., [9], [10]), and more recently low bit-rate audio coding(e.g., [11]–[13]).

The applications above can be described in an analysis-modi-fication-synthesis framework, where in the analysis stage modelparameters are estimated for consecutive signal frames; in thisstage it is typically assumed that each signal frame can be wellrepresented as a linear combination of constant-amplitude, con-stant-frequency sinusoidal functions. In the modification phase,the estimated parameters may be quantized or otherwise modi-fied. Finally, in the synthesis stage, the resulting parameters are

Manuscript received November 20, 2002; revised July 14, 2003. This re-search was conducted within the ARDOR project and was supported by the E.U.under Grant IST-2001-34095. The associate editor coordinating the review ofthis manuscript and approving it for publication was Dr. Ravi P. Ramchandran.

J. Jensen and R. Heusdens are with the Department of Mediamatics,Delft University of Technology, 2628 CD Delft, The Netherlands (e-mail:[email protected]; [email protected]).

S. R. Jensen is with the Department of Communication Technology, AalborgUniversity, 9220 Aalborg, Denmark (e-mail: [email protected]).

Digital Object Identifier 10.1109/TSA.2003.819948

used for reconstructing the possibly modified signal using inter-polative (e.g., [14], [15]) or overlap/add synthesis (e.g., [3]).

Recently, a number of extended sinusoidal model variantshave been proposed, where the constant-amplitude, constant-frequency assumption has been relaxed (e.g., [11], [16], [17]).An extended model of particular interest is the so-called ex-ponential sinusoidal model (ESM), which aims at representingsignal segments as sums of exponentially damped sinusoidalfunctions. Observing that damped oscillations occur commonlyin many natural signals including speech and audio, the ESMis often a physically reasonable model. Furthermore, since ex-ponentially damped sinusoids play a fundamental role in linearsystem theory, a vast amount of research supports the treatmentof this model. The ESM has been applied to analysis and/or syn-thesis of audio (e.g., [18]–[21]) as well as speech signals (e.g.,[22], [23]).

In many speech and audio applications it is of interest to rep-resent only the perceptually relevant time/frequency regions ofthe signal in question by exploiting the masking properties ofthe human auditory system. For example, in most audio codingschemes (e.g., [24]) and in perceptual speech coding algorithms(e.g., [25], [26]), the parameter estimation and/or quantizationstages have been tailored to represent the perceptually most sig-nificant signal regions.

For the ESM, the parameter estimation schemes can roughlybe divided into two main groups: analysis-by-synthesis schemessuch as the matching pursuit (MP) based algorithms described in[16], [27] and subspace-based schemes (e.g., [28]–[30]). Whilesome work has been done for extracting perceptually relevantsinusoids using MP based schemes (e.g., [31], [32]), less ef-forts have been directed toward estimating perceptually relevantESM parameters using subspace techniques [21].

In [21] an attempt is made to combine psycho-acoustic in-formation with a subspace based ESM parameter estimationscheme. In this scheme, the signal to be modeled is divided intosubbands and an independent (low-order) ESM is used for eachsubband. The ESM components are estimated in an iterativemanner, one at a time, by assigning at each iteration an addi-tional damped sinusoid to the subband with the largest residualerror-noise to masking level, in much the same way as the bitsare assigned to different subbands in MPEG [24]. The approachin [21] operates at a lower computational complexity than acorresponding full-band scheme. However, it is not optimal be-cause subbands are treated independently. Furthermore, no per-ceptual knowledge is used for estimating the sinusoids withineach subband.

1063-6676/04$20.00 © 2004 IEEE

122 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004

This paper describes an alternative approach for determiningperceptually relevant ESM parameters. The aim is to minimizea perceptually motivated distortion measure. Furthermore, thepresented framework allows for joint estimation of the parame-ters of interest. The presented algorithms combine well-knownsubspace based estimation algorithms and a distortion measurederived from a recently developed psycho-acoustical maskingmodel.

The paper is structured as follows. In Section II, we introducethe perceptual distortion measure on which the proposed algo-rithm relies. Section III describes briefly a traditional schemefor ESM parameter estimation and then moves on to treat theproposed algorithm. Section IV reports results of simulationexperiments performed with the proposed algorithm. First, anumber of simple simulation experiments are conducted toillustrate differences between the proposed algorithm and tra-ditional (nonperceptual) subspace estimation techniques. Then,it is shown through simulations with audio signals how thedimensions of the input matrix influence objective and sub-jective modeling performance. Furthermore, an analysis of thedistribution of estimated sinusoids for speech and audio signalmodeling reveals a simple way of reducing the computationalload of the algorithms in the study. Finally, the subjective per-formance of the proposed algorithm is evaluated in a listeningtest. Section V summarizes and concludes the paper.

II. PERCEPTUAL DISTORTION MEASURE

In order to account for human auditory perception in the es-timation of the ESM parameters, we use the recently developedperceptually relevant distortion measure described in [33]. Thisdistortion measure has a number of desirable properties, whichmakes it particularly useful for the application at hand, namelyaudio and speech representation. First of all, the distortion mea-sure is general in the sense that it does not make any assumptionson the origin of the target signals. Consequently, it is applicableto both speech and audio signals. This is in contrast to many ofthe well-known objective speech quality measures [34], whichrely on an autoregressive signal production model. Secondly, inany practical situation, the distortion measure defines a norm ona Hilbert space. This property is important because it makes thedistortion measure mathematically tractable and allows it to beincorporated in optimization algorithms aiming at minimizingnorm based criteria.

Ignoring time-domain masking phenomena, signal distortionbecomes audible when the log power spectrum of the modelingerror exceeds a frequency dependent threshold, called themasking threshold. Many models exist for computing themasking threshold. In this paper we use the model proposed in[35], which differs from traditional spreading-function basedmodels in the sense that it takes into account all auditoryfilters for computing the distortion, rather than considering theauditory filter receiving most of the distortion. Furthermore,this psycho-acoustical model avoids the explicit classificationof tonal and noise-like masker signal components, which isoften used in traditional models (e.g., [24]). Although we in this

work use the masking threshold derived from the model in [35],it is just as possible to use another psycho-acoustical model,e.g., the one used in MPEG-Audio [24]. It should, however,be noted that the model in [35] provides masking thresholdpredictions which are better in accordance with experimentalpsycho-acoustical data than those obtained with conventionalmasking models [35], [36].

The distortion measure can be written as [33]

(1)

where indicates the Fourier transform operation, is aweighting function representing the frequency-dependentsensitivity of the human auditory system, is the analysiswindow, and

is the modeling error, i.e., the difference between the originalsignal and the modeled signal . The weighting function isusually chosen to be the reciprocal of the masking threshold.

In order for (1) to define a norm, the weighting functionmust be positive and real for all , and the windowsequence must satisfy for . In this case, thedistortion in (1) can be rewritten as a convolution of two (in-finite) discrete-time sequences

(2)

where is the inverse Fourier transform of , and denotesthe vector -norm.

In the context of the ESM, the modeled signal frameis given by

(3)

where , , , and are amplitude, damping, angularfrequency, and phase parameters, respectively. The problemat hand is, for a given original signal frame , to find the setof ESM parameters which minimize the perceptual distortionmeasure in (2). Since a convolution operation can beformulated in terms of a matrix-vector multiplication, theminimization problem of interest can be stated as

(4)

where is a diagonal matrix with the elements ofthe analysis window on the main diagonal, and is an (in-finite) Toeplitz filtering matrix containing the elements of the,in this case symmetric, filter impulse response . We treat thespecific structure of in further detail in Section III-B. Theeffect of premultiplication with may be interpreted as atransformation from the linear domain where the -norm doesnot necessarily correlate well with subjective quality to a per-ceptual domain where the -norm is better in accordance withperceived quality.

JENSEN et al.: PERCEPTUAL SUBSPACE APPROACH FOR MODELING OF SPEECH AND AUDIO SIGNALS 123

III. ESTIMATION OF PERCEPTUALLY

RELEVANT ESM PARAMETERS

The algorithms to be presented rely on the observation that themodeled segment in (3) can be expressed as a sum of complexexponentials

(5)

where are complex amplitude parametersand are so-called signal poles. From(5) we see that the signal poles contribute nonlinearly to theobject function in (4). The nonlinearity related to signal pole es-timation can be circumvented by using the so-called HTLS al-gorithm by Van Huffel et al. [29], which is a total least squares(TLS) based variant of Kung et al.’s original state space al-gorithm [28]. These algorithms belong to the class of singleshift-invariant methods within the set of subspace-based signalanalysis algorithms [30]. The HTLS algorithm is not immedi-ately suited for solving the weighted problem in (4). To takethe filtering of the matrix in (4) into account, we considerinstead the so-called prefiltered HTLS algorithm described in[37]. Having estimated the signal poles using this algorithm,the complex amplitudes can be found as the solution to aweighted linear least-squares problem. In the following we givea brief review of the HTLS algorithm and the prefiltered HTLSalgorithm in order to support our discussions in the remainder ofthe paper; for an in-depth treatment of the algorithms, the readeris referred to [29] and [37], respectively.

A. Signal Poles With HTLS

Let us assume initially that the observed signal frame can berepresented by the ESM in (5) without error, and that the correctmodel order is known. The HTLS algorithm first arranges theobserved signal frame in a Hankel data matrix as follows:

......

...

(6)

where . The singular value decom-position (SVD) of is given by

(7)

where is a diagonal matrix containing thenonzero singular values, and the matrices and

contain as columns the corresponding left andright singular vectors, respectively. The shift invariance prop-erty of (and ) ensures that the following matrix equationsare satisfied:

(8)

where the superscripts ( ) denote deletion of the top (bottom)row of the matrix in question. The signal poles are found as the

eigenvalues of the matrix (or ). If the observedsignal satisfies (5) and is known, the underlying signal polescan be recovered without error. However, in practice, the ob-served signal frame will not satisfy the ESM exactly, the Hankelmatrix will typically have a rank larger than , and the shiftinvariance property in (8) will only be approximately valid. Inthis case, the matrix contains the left singular vectors corre-sponding to the largest singular values, and the matrix(or ) is estimated as the total least squares solution [38] ofthe (incompatible) matrix equations in (8).

B. Signal Poles With Prefiltered HTLS

The HTLS algorithm described above is not immediatelysuited for solving the problem in (4) because HTLS does nottake the filtering operation of (and ) into account. A firstobvious choice for adapting HTLS to this problem is to filterthe observed signal frame in the order FIR filter , i.e.,calculate the convolution sequence , arrange this sequencein a data matrix , and then use as input to the HTLSalgorithm. However, this approach is not acceptable because

will no longer have rank ; even when the observed signalframe satisfies the ESM and is known, it is not possibleto retrieve the underlying signal poles without error usingthis approach. Alternatively, the convolution sequencecould be truncated before arranging it in . Specifically, bydiscarding the first and last elements of and arrangingthe remaining middle part in , the matrix retains the rankof and the shift invariant property; that is, when satisfiesthe ESM and is known, the signal poles can be estimatedwithout error. The problem with this approach, however, is thatpotentially useful data in the edges of is wasted during thetruncation. Furthermore, in order to have samples left after thetruncation, the length of the filter impulse response is limitedto .

The above mentioned drawbacks can be overcome by usingthe so-called prefiltered HTLS algorithm [37], which retains therank of the original signal without discarding potentially usefuldata. In the prefiltered HTLS algorithm, the Hankel data matrix

is postmultiplied by a full rank filter matrix andthe HTLS algorithm is applied to the matrix product ;a similar description can be derived when is premultipliedwith a filter matrix [37]. It is straight-forward toshow that , which generally is not Hankel structured, retainsthe rank- and shift-invariant property. That is, when satisfies(5) and is known, we have

where ( ) contains the left (right)singular vectors corresponding to the nonzero singular valuesof the filtered matrix , and the signal poles can be recoveredwithout error as the eigenvalues of ( ). In this idealscenario, the only requirement is that the filter matrix have fullrank.

The purpose of the filter matrix is to implement theconvolution in (2). The range of the summation in (2) is infinite,and its frequency domain counterpart in (1) involves an integral.In practice, however, we work with finite-length time-domain


sequences, and the integral in (1) should be replaced by asummation, i.e., a point-wise multiplication in the frequencydomain. This, in turn, means that the convolution in (2) becomesa circular convolution, rather than a linear. Hence, we considera circular Toeplitz filter matrix (see (9) at the bottom of thepage) with . Initial experiments indeed showthat this filter matrix structure leads to better performancethan, e.g., the Toeplitz filter matrix proposed in [37]. Formingthe product corresponds to convolving circularlyeach row in with the FIR filter impulse response .

C. Estimation of Complex Amplitudes

Having estimated the signal poles using the prefiltered HTLSalgorithm, the complex amplitudes (and thusreal amplitudes and phases ) are found as the solution tothe weighted linear least squares problem

(10)

where is the complex amplitude vector,is a circular Toeplitz filter matrix whose

first column and row is and, respectively,

is a diagonal matrix with the elements of the analysis windowon the main diagonal, and is a Vandermonde

matrix constructed from the signal pole estimates

......

......

(11)

D. Algorithm Outline

The proposed scheme, which we call P-ESM, for estimatingperceptually relevant ESM parameters can be outlined asfollows.

Input: , .Output: ,

1. Compute perceptual weighting filterfrom a psycho-acoustical masking model

(e.g., [35]), and construct filter ma-trices and .2. Construct Hankel structured data ma-trix .3. Compute prefiltered data matrix

.4. Find perceptual relevant signal poleestimates using the HTLS algorithm [37]:

.5. Contruct the Vandermonde matrix[(11)] from estimated signal poles, andestimate complex amplitude vector fromweighted linear least-squares problem[(10)].

IV. SIMULATION RESULTS

A number of simulation experiments were conducted tostudy and evaluate the performance of the proposed algorithm,P-ESM, and to demonstrate the differences between this schemeand the traditional scheme without perceptual prefiltering (i.e.,

, ). We denote this latter scheme by-ESM, where “ ” reflects that processing is done to minimize

an unweighted -norm (as opposed to a perceptually weighted-norm). Objective as well as subjective tests were performed.

Seven audio signals, sampled at a frequency of 44.1 kHz, wereused in the experiments (see Appendix I). A fixed frame lengthof samples (23.2 ms) was used, and frames wereextracted with an overlap of 50%. The filter impulse response

had a length of samples (5.8 ms); initial experimentsshowed slightly better performance with this choice comparedto and . For the P-ESM algorithm, theHankel data matrix (and thus the filtered matrix ) haddimensions and , while for the standard

-ESM algorithm, the data matrix was chosen “as squareas possible,” i.e., and . Simulation studiesreported in Section IV-C show that these matrix dimensionslead to the best overall performance of the algorithms. Thewindow used in the experiments was a Hanning window.

.... . .

. . .. . .

. . .. . .

......

. . .. . .

. . .. . .

. . .. . .

. . .. . .

. . .. . .

......

. . .. . .

. . .. . .

. . .. . .

. . .. . .

. . ....

.... . .

. . .. . .

. . .. . .

...

(9)


Fig. 1. Modeling of sum of sinusoids using P-ESM (left column) and l -ESM (right column). (a)–(b) Power spectrum (solid) of original signal frame x

and corresponding masking curve (dashed). (c)–(d) Modeling with P-ESM and l -ESM for K = 2. (e)–(f) Modeling with P-ESM and l -ESM for K = 4.(g)–(h) Modeling with P-ESM and l -ESM for K = 6.

In order to have an objective quality measure we define thefollowing “perceptual” signal-to-noise ratio ( ) for theoriginal signal frame and its modeled counterpart

(12)

where the Toeplitz matrix and the diagonal matrix areidentical to the ones used in (10) for estimating the complexamplitudes in the prefiltered HTLS algorithm. The SNR mea-sure aims at reflecting the quality of the modeled frame in aperceptual domain, and is valid to the extent that the perceptualmodel used for constructing the filter matrix adequately rep-resents the masking properties of the human auditory system. Insome cases, it is useful to assign an objective quality measure tothe modeling of several consecutive signal frames, e.g., an en-tire signal. To do so we use the segmental ( ) de-fined as the average value taken across the signal framesin question.

A. Two Case Studies

We demonstrate the characteristics of the proposed method intwo case studies.

Example 1: Sum of Sinusoids: In this example a syntheticsignal frame was generated as a sum of three stationary si-nusoids with frequencies , ,

, amplitudes , , , andphase values .

The signal frame was modeled using P-ESM and -ESM formodel orders , 4, 6. The result of the modeling proce-dure is illustrated in Fig. 1. Fig. 1(a) and (b) shows the spectrumof the original signal (solid) with the corresponding maskingcurve (dashed). Fig. 1(c) and (d) shows the modeled spectra for

for P-ESM and -ESM, respectively. In Fig. 1(c) wesee that P-ESM tries to model the two closely spaced sinusoids,because these are the most important from a perceptual point ofview, while as shown in Fig. 1(d), -ESM does not take percep-tual information into account and selects the sinusoid at 20 kHzbecause it contains the largest energy. In Fig. 1(e), the model


Fig. 2. Modeling of noise-like signal frame using P-ESM (left column) and l -ESM (right column). (a) and (b) Spectrum of original signal frame x withcorresponding masking curve. (c) and (d) Modeling with P-ESM and l -ESM for K = 2. (e) and (f) Modeling with P-ESM and l -ESM for K = 4.(g) and (h) Modeling with P-ESM and l -ESM for K = 16.

order is and P-ESM represents almost perfectly the twoclosely spaced sinusoids, while -ESM in Fig. 1(f) maintainsthe sinusoid at 20 kHz. For both P-ESM [Fig. 1(g)] and

-ESM [Fig. 1(h)] retrieve the three sinusoids without error.Example 2: Noise-Like Signal Frame: Here, we consider

the case where the observed signal frame contains a noise-likesignal. In this example, we have selected a frame from anunvoiced speech sound (/f/ in “Viel,” German male speaker).The modeling performance for P-ESM and -ESM is illus-trated, respectively, in the left and right column of Fig. 2 for thefrequency range 0–6 kHz, from which it is clear that P-ESMand -ESM attack different regions of the spectrum. TheP-ESM scheme aims at representing lower frequency regionswhile -ESM models the high-energy regions around 3 kHzfirst. Even for a model order of , -ESM [Fig. 2(h)]does not represent the low-frequency regions around 500 Hz,which were found the perceptually most relevant by P-ESM[see Fig. 2(c) and (e)].

B. Performance versus Model Order

In this section, we compare the performance of P-ESM and-ESM for varying model order and for two types of signal

frames: a quasiperiodic signal frame taken from a steady-voicedregion in a female speech signal, and a noise-like signal frametaken from an unvoiced region in a male speech signal. The

two signal frames are modeled with the ESM using P-ESM and-ESM for parameter extraction for model orders in the range

. The modeling performance for the quasiperi-odic and the noise-like signal frame are shown in Figs. 3 and 4,respectively.

From Fig. 3 we see that the performance for P-ESM and-ESM is almost identical for lower model orders ( ).

The reason is that with this signal frame, the harmonics withhighest-energy also have the highest perceptual relevance, i.e.,P-ESM and -ESM extract the same sinusoids for low modelorders. For larger model orders, however, the performance gapbetween the two schemes grows. For , the differ-ence is about 2 dB in favor of P-ESM. For noise-like frames asthe one in Fig. 4, the performance gap is approximately 1.5 dBat low model orders and almost 3 dB at higher model orders.The reason for the difference at low model orders is, as illus-trated in Fig. 2, that P-ESM tends to represent the perceptuallyimportant low-frequency regions first, while -ESM models thehigh- energy regions.

C. Performance versus Matrix Dimensions

In [39] it was argued that data matrices should be constructed“as square as possible” for optimal performance with the stan-dard HTLS algorithm. However, since it is not clear whethersquare data matrices are optimal with the P-ESM scheme, we


Fig. 3. Modeling performance versus model order K for quasiperiodic signalframe. (a) Magnitude spectrum of signal frame x. (b) SNR as a function of Kfor P-ESM and l -ESM.

Fig. 4. Modeling performance versus model order K for noise-like signalframe. (a) Magnitude spectrum of signal frame x. (b) SNR as a function ofK for P-ESM and l -ESM.

study here the impact of modeling performance for different ma-trix dimensions.

The set of audio signals listed in Appendix I were modeledwith P-ESM and -ESM using values of in the range

. A constant signal frame length of sam-ples was used, resulting in taller and narrower data matrices forincreasing values of . Fig. 5 shows the modeling performancein terms of as a function of . We see that the per-formance of -ESM remains essentially constant for values of

in the range – 2/3, which is consistent with theresults reported in [39] (although for signals very different thanthe ones considered here). For P-ESM, we observe that squaredata matrices ( ) lead to good but not optimal per-

formance in terms of . Instead, the optimum is shiftedto where P-ESM gives an gain of morethan 2 dB over -ESM. Detailed analysis of the results showsthat these conclusions hold for each of the signals used in thetest, and for both model orders and .

The computational complexity of P-ESM and -ESM isdominated by the estimation of the column space , whichin this work is obtained from an SVD of the (prefiltered) datamatrix [see (7)]; consequently, the computational complexityof the algorithms is approximately . Fromthis we conclude that the increased P-ESM performance isobtained at a reduced computational complexity compared to

-ESM, because P-ESM uses “fat” matrices ( ),whereas -ESM uses square matrices ( ). In ourimplementation of the algorithms, P-ESM requires 80% of thecomputations used for -ESM (as measured with the “flops”counter in Matlab).

D. Distribution of Model Components Across Frequency

An analysis of the distribution of model components acrossfrequency is of interest for several reasons. First, it provides aclearer insight into the characteristic of the parameter extractionschemes. Secondly, it may lead to algorithmic advantages, forexample a reduction in the computational expense of the param-eter estimation schemes.

The audio signals described in Appendix I were representedusing the P-ESM and the -ESM schemes with matrix dimen-sions and , respectively. A fixed modelorder of was used for all signal frames. The estimatedsinusoids (a total of more than 150.000 sinusoids for each es-timation scheme) were collected and sorted in bins of width500 Hz according to their estimated frequency parameter. Thehistograms resulting from this procedure are shown in Fig. 6.Clearly, most model components are found at lower frequencies.A careful examination of the histogram for P-ESM [Fig. 6(a)]shows that 79.6% of the components are found below 5 kHz,92.7% below 8 kHz and 99.6% of the components are below11 kHz. A similar analysis of Fig. 6(b) reveals the correspondingvalues of 89.7% below 5 kHz, 95.2% below 8 kHz, and 99.1%of all components are found below 11 kHz. The component dis-tribution for a fixed model order of is almost identicalto the case.

In order to demonstrate the distribution of estimated compo-nents further, we have included Fig. 7 which shows the locationof ESM components in the time-frequency plane for one of theaudio excerpts listed in Appendix I. From this figure it is clearthat P-ESM distributes the sinusoids more uniformly across thefrequency axis [Fig. 7(b)], while -ESM tends to “cluster” si-nusoids in high-energy regions [Fig. 7(c)].

The fact that sinusoidal components almost never occur inregions above 11 kHz is significant because it may allow fora decimation of the signal frames under analysis without anynoticeable loss in modeling performance. To pursue this ideafurther, we implemented the following decimated versions ofthe algorithms. First, the signal to be modeled was decimatedby a factor of two using a 10th order antialiasing FIR filter afterwhich the decimated signal was modeled using P-ESM and

-ESM for parameter estimation; we denote the algorithms in


Fig. 5. Average modeling performance across the test signals in Appendix I as a function of dimension of data matrix �X 2 for a fixed frame length ofN = 1024 samples (N = L +M � 1).

Fig. 6. Distribution of model components across frequency (K = 50). (a) Components extracted with P-ESM. (b) Components extracted with l -ESM.


Fig. 7. Distribution of model components in the time-frequency plane for a region of Signal no. 6. (a) Time domain signal. (b) Components extracted with P-ESM.(c) Components extracted with l -ESM.

Fig. 8. SNR values for full-band estimation schemes l -ESM and P-ESM and decimated schemes l -ESM-Dec and P-ESM-Dec for K = 50.

the decimated domain by P-ESM-Dec and -ESM-Dec, respec-tively. In the decimated domain, a frame length of wasused, and the impulse response length of the perceptual filterwas samples (5.8 ms). The relative matrix dimensionsremained and for P-ESM-Dec and

-ESM-Dec, respectively (a study of vs. forP-ESM-Dec and -ESM-Dec showed performance curves sim-ilar to that of Fig. 5 for the full-band algorithms). Finally, forcomparison with the original signal, the modeled signal was re-constructed at the original sample rate of 44.1 kHz.

Fig. 8 compares the modeling performance of the decimatedalgorithms with that of the full-band schemes for .From this figure we conclude that performance with

P-ESM-Dec and -ESM-Dec typically is at least as good aswith their full-band counterparts. In fact, rejecting the upperfrequency band before modeling tends to increaseslightly; a reason for this is that the signal content in the highfrequency band is mainly nonsinusoidal and therefore, in ef-fect, contributes as noise in the estimation process. The mainadvantage, though, of performing the parameter estimation inthe decimated domain is in terms of computational load, be-cause the frame length and thus the matrix dimensions and

have been reduced by a factor of 2. Since computations aredominated by the SVD of the data matrix, having a complexityof , we would expect a decrease in computa-tional load by a factor of 8 for the SVD (assuming constant


ratio). In our algorithm implementations, the P-ESM-Dec algo-rithm requires 20% of the computations used for P-ESM and16% of the computations used for -ESM (as measured withthe “flops” counter in Matlab). The conclusions drawn from the

case in Fig. 8 remain valid when a fixed model orderof is used.

Informal listening tests support the objective performancemeasures in Fig. 8. Signals generated with the decimatedschemes P-ESM-Dec and -ESM-Dec are almost perceptuallyidentical to signals generated with the full-band schemes.

E. Subjective Comparison Test With Audio Signals

In order to determine the possible subjective advantages ofthe proposed scheme, the seven test signals listed in Appendix Iwere modeled with P-ESM and -ESM for and com-pared in a listening test. We restricted ourselves to consider thefull band algorithms only. The test signals were presented tothe listeners as triplets OAB or OBA, where O was the orig-inal signal, A was the signal modeled with -ESM and B wasthe signal modeled with P-ESM. The task of the listener wasto indicate, which signal (A or B) was perceptually closest tothe original O. Each test signal triplet was presented a total of4 times during the test, and the order (OAB or OBA) in whichsignals were presented was selected randomly for each presen-tation. Nine listeners participated in the test; the authors did notparticipate. The preference for P-ESM averaged across the lis-teners is shown in Table I. As can be seen, P-ESM performsbetter than -ESM for all test signals.

A set of additional subjective evaluations were done by theauthors and resulted in a number of conclusions on the percep-tual quality of the modeled signals. Generally, P-ESM leads tomodeled signals of considerably higher subjective quality than

-ESM, the difference being larger for compared to. For some types of signals, e.g., Signals no. 1, 4, 5,

the subjective difference is very significant, while for other sig-nals, e.g., Signals no. 2 and 3, the difference is less distinctalthough still clearly noticeable. We note that these subjectiveobervations are well in line with Figs. 3 and 4 which showedlarger improvements for more noise-like signals (such as Sig-nals no. 1, 4, and 5) compared to signals with a larger periodiccontent (such as Signals no. 2 and 3). With -ESM, regions ofthe modeled signals sometimes sound “narrowband;” this arti-fact is eliminated with P-ESM (see also Fig. 7). Further, signalsmodeled with -ESM occasionally have a “background of mu-sical noise”; this artifact too is not present in the P-ESM signals.

In a few cases P-ESM introduces artifacts which are not ob-served in the -ESM signals. For example, the unvoiced speechsounds /s/ in Signal no. 6 have a more reverberant/tonal qualitywith P-ESM. One explanation is that -ESM tends to modelnoise-like sounds by means of clusters of closely spaced sinu-soidal components thereby creating a signal resembling band-pass filtered noise. P-ESM, on the other hand, tends to spreadsinusoidal components more uniformly across frequency, cre-ating a signal that is perceived as tonal. The tonal artifact can beeliminated by increasing the model order for the signal framesin question, or, more efficiently, introduce an additional signalmodel, e.g., a filtered noise representation to model the “noise-like” signal regions.

TABLE IPREFERENCE FOR P-ESM OVER l -ESM IN SUBJECTIVE COMPARISON TEST

V. CONCLUSION

This paper considered the problem of approximating a signalframe with the exponential sinusoidal model (ESM), which rep-resents the signal frame as a linear combination of exponentiallydamped constant-frequency sinusoidal functions. Traditionally,model components have been estimated using the HTLS algo-rithm [29], which does not take perceptual relevance of the es-timated components into account. In this work we propose amethod to incorporate auditory information in the estimationprocess in order to extract only perceptually relevant modelcomponents. Perception based estimation algorithms are of im-portance in model based audio and speech processing in caseswhere the signal receiver is human. The proposed algorithmcombine the prefiltered HTLS algorithm [37] and a recentlydeveloped perceptual distortion measure [35], in an attempt tominimize a signal-dependent, weighted -norm. The proposedalgorithm is named P-ESM.

The proposed method was studied in a number of simulationexperiments. First, the differences between the proposedmethods and the traditional HTLS based method, denoted

-ESM here, were demonstrated in a number of simple mod-eling examples, using synthetically generated signal framessuch as stationary sinusoidal signals, as well as natural signalssuch as voiced and unvoiced speech sounds. The examplesverified that while -ESM focuses on high-energy spectralregions, the perceptually based algorithm, P-ESM, tends, asexpected, to model spectral regions which are perceptuallyimportant.

Secondly, the influence of the data matrix dimensions onmodeling performance was studied. In particular, modelingperformance was evaluated as a function of ( is thenumber of rows in the data matrix, and is the number ofsamples in a signal frame, which is assumed constant) where

was varied in the range 0.08–0.93, resulting in tallerand narrower matrices for increasing . The traditional

-ESM method had a broad performance maximum roughlycentered around , supporting the assumptionthat square data matrices lead to the best performance. TheP-ESM method, however, had a narrower, but significantlylarger maximum, centered around . Moreover,since the computational load in both P-ESM and -ESMis dominated by an SVD of the data matrix and therefore is

, the P-ESM performance gain can be obtainedat computationally lower expenses; in practice, P-ESM spent80% of the computations used for -ESM (as measured by the“flops” counter in Matlab).

Thirdly, an analysis was performed, which aimed at deter-mining the distribution of sinusoidal components across fre-quency for P-ESM as well as the traditional -ESM algorithm.The analysis showed that for a range of different wideband audiosignals sampled at 44.1 kHz, most sinusoidal components oc-


curred at low frequencies. To be more exact, for a model orderof (approximately 25 real damped sinusoids per signalframe), the P-ESM and -ESM distributed 99.6% and 99.1%,respectively, of all components below 11 kHz. This observationis significant, because it suggests that typical wideband audiosignals can be decimated at least by a factor of 2, thereby re-ducing the computational demands of the estimation algorithmsfurther, without sacrificing any modeling performance. Simu-lation experiments with these algorithms verified that objectiveas well as subjective modeling performance remains virtuallyidentical to the performance of the full-band algorithms, but thecomputational expenses for the P-ESM-Dec and -ESM-Decis roughly 16–20% of the computational load for P-ESM and

-ESM.Finally, to determine the potential advantages of incor-

porating a perceptual distortion measure in the parameterestimation procedure, signals modeled with P-ESM were com-pared to signals modeled with the traditional -ESM methodin a subjective listening test involving nine listeners. Thiscomparison test showed that the signals generated with P-ESMwere of considerably higher perceptual quality than those of

-ESM (preference in the range 69%–94% for P-ESM). Inparticular, the “narrowbanded” quality of the -ESM signalswas eliminated with P-ESM. Furthermore, the comparisonsshowed that the “musical” background noise often present insignals modeled with -ESM, was not noticeable in the signalsmodeled with P-ESM.

APPENDIX

OVERVIEW OF TEST SIGNALS

The following test signals were used in the evaluation of thepresented algorithms. All signals were sampled at a frequencyof 44.1 kHz.

ACKNOWLEDGMENT

The authors wish to thank O. Niamut for helping preparingthe listening test and they thank the nine participants in the lis-tening test. They also would like to thank the anonymous re-viewers for their helpful and constructive remarks.

REFERENCES

[1] P. Hedelin, “A tone-oriented voice-excited vocoder,” Proc. IEEE Int.Conf. Acoust., Speech, Signal Processing, pp. 205–208, 1981.

[2] L. B. Almeida and J. M. Tribolet, “Harmonic coding: A low bit-rate,good-quality speech coding technique,” in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing, 1982, pp. 1664–1667.

[3] R. J. McAulay and T. F. Quatieri, “Sinusoidal coding,” in Speech Codingand Synthesis, W. B. Kleijn and K. K. Paliwal, Eds. New York: Else-vier, 1995, ch. 4.

[4] T. F. Quatieri and R. J. McAulay, “Noise reduction using a soft-decisionsine-wave vector quantizer,” in Proc. IEEE Int. Conf. Acoust., Speech,Signal Processing, 1990, pp. 821–824.

[5] J. Jensen and J. H. L. Hansen, “Speech enhancement using a constrainediterative sinusoidal model,” IEEE Trans. Speech Audio Processing, vol.9, pp. 731–740, Oct. 2001.

[6] T. F. Quatieri and R. J. McAulay, “Speech transformations based ona sinusoidal representation,” IEEE Trans. Acoust., Speech, Signal Pro-cessing, vol. ASSP-34, pp. 1449–1464, 1986.

[7] E. B. George and M. J. T. Smith, “Speech analysis/synthesis and mod-ification using an analysis-by-synthesis/overlap-add sinusoidal model,”IEEE Trans. Speech Audio Processing, vol. 5, no. 5, pp. 389–406, 1997.

[8] Y. Stylianou, “Applying the harmonic plus noise model in concatenativespeech synthesis,” IEEE Trans. Speech Audio Processing, vol. 9, pp.232–239, Mar. 2001.

[9] X. Serra and J. Smith III, “Spectral modeling synthesis: Asound anal-ysis/synthesis system based on a deterministic plus stochastic decompo-sition,” Comput. Music J., vol. 14, no. 4, pp. 12–24, 1990.

[10] E. B. George and M. J. T. Smith, “Analysis-by-synthesis overlap-addsinusoidal modeling applied to the analysis and synthesis of musicaltones,” J. Audio Eng. Soc., vol. 40, no. 6, pp. 497–516, June 1992.

[11] B. Edler, H. Purnhagen, and C. Ferekidis, “ASAC – Analysis/synthesiscodec for very low bit rates,” in Preprint 4179 (F-6) 100th AES Conven-tion, 1996.

[12] K. N. Hamdy, M. Ali, and A. H. Tewfik, “Low bit rate high qualityaudio coding with combined harmonic and wavelet representation,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1996, pp.1045–1048.

[13] T. S. Verma and T. H. Y. Meng, “A 6 kbps to 85 kbps scalable audiocoder,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp.877–880, 2000.

[14] R. J. McAulay and T. F. Quatieri, “Magnitude-only reconstruction usinga sinusoidal model,” in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcessing, 1984, pp. 27.6.1–27.6.4.

[15] L. B. Almeida and F. M. Silva, “Variable-frequency synthesis: An im-proved harmonic coding scheme,” in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing, 1984, pp. 27.5.1–27.5.4.

[16] M. Goodwin, “Matching pursuit with damped sinusoids,” in Proc. IEEEInt. Conf. Acoust., Speech, Signal Processing, 1997, pp. 2037–2040.

[17] G. Li, L. Qiu, and L. K. Ng, “Signal representation based on instan-taneous amplitude models with application to speech synthesis,” IEEETrans. Speech, Audio Processing, vol. 8, pp. 353–357, May 2000.

[18] J. Laroche, “The use of the matrix pencil method for the spectrumanalysis of musical signals,” J. Acoust. Soc. Amer., vol. 94, no. 4, pp.1958–1965, Oct. 1993.

[19] J. Nieuwenhuijse, R. Heusdens, and E. F. Deprettere, “Robust exponen-tial modeling of audio signals,” in Proc. IEEE Int. Conf. Acoust., Speech,Signal Processing, 1998, pp. 3581–3584.

[20] J. Jensen and R. Heusdens, “A comparison of sinusoidal model variantsfor speech and audio representation,” in Proc. Eur. Signal ProcessingConf., 2002, pp. 479–482.

[21] K. Hermus, W. Verhelst, and P. Wambacq, “Psycho-acoustic modelingof audio with exponentially damped sinusoids,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Processing, 2002, pp. 1821–1824.

[22] P. Lemmerling, I. Dologlou, and S. Van Huffel, “Speech compressionbased on exact modeling and structured total least norm,” in Proc. IEEEInt. Conf. Acoust., Speech, Signal Processing, 1998, pp. 353–356.

[23] J. Jensen, S. H. Jensen, and E. Hansen, “Exponential sinusoidal mod-eling of transitional speech segments,” in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing, 1999, pp. 473–476.

[24] “Coding of Moving Pictures and Associated Audio for Digital StorageMedia at up to About 1.5 MBit/s – Part 3: Audio,” ISO/MPEG Com-mittee, ISO/IEC 11 172–3, 1993.

[25] R. D. di Iacovo and R. Montagna, “Some experiments in perceptualmasking of quantizing noise in analysis-by-synthesis speech coders,” inProc. Eurospeech, 1991, pp. 825–828.

[26] D. Sen and W. Holmes, “Perceptual enhancement of CELP speechcoders,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp.II-105–II-108, 1994.

[27] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dic-tionaries,” IEEE Trans. Signal Processing, vol. 41, pp. 3397–3415, Dec.1993.

[28] S. Y. Kung, K. S. Arun, and D. V. B. Rao, “State-space and singular-value decomposition-based approximation methods for the harmonic re-trieval problem,” J. Amer. Opt. Soc., vol. 73, no. 12, pp. 1799–1811,1983.

[29] S. Van Huffel, H. Chen, C. Decanniere, and P. Van Hecke, “Algorithmfor time-domain NMR data fitting based on total least squares,” J. Magn.Reson. A, vol. 110, pp. 228–237, 1994.


[30] A.-J. Van der Veen, E. F. Deprettere, and A. L. Swindlehurst, “Subspace-based signal analysis using singular value decomposition,” Proc. IEEE,vol. 81, no. 9, pp. 1277–1308, 1993.

[31] T. S. Verma and T. H. Y. Meng, “Sinusoidal modeling using frame-basedperceptually weighted matching pursuits,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Processing, 1999, pp. 981–984.

[32] R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal modeling usingpsychoacoustic-adaptive matching pursuits,” IEEE Signal ProcessingLett., vol. 9, pp. 262–265, Aug. 2002.

[33] R. Heusdens and S. van der Par, “Rate-distortion optimal sinusoidalmodeling of audio and speech using psychoacoustical matchingpursuits,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,2002, pp. 1809–1812.

[34] S. R. Quackenbush, T. P. Barnwell III, and M. A. Clements, Objec-tive Measures of Speech Quality. Englewood Cliffs, NJ: Prentice-Hall,1988.

[35] S. van der Par et al., “A new psychoacoustical masking model for audiocoding applications,” in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcessing, 2002, pp. 1805–1808.

[36] G. Charestan, R. Heusdens, and S. van der Par, “A gammatone-basedpsychoacoustical modeling approach for speech and audio coding,” inProc. ProRISC/IEEE: Workshop on Circuits, Systems and Signal Pro-cessing, 2001, pp. 321–326.

[37] H. Chen, S. Van Huffel, and J. Vandewalle, “Bandpass prefiltering forexponential data fitting with known frequency regions of interest,”Signal Process., vol. 48, pp. 135–154, 1996.

[38] S. Van Huffel and J. Vandewalle, The Total Least Squares Problem –Computational Aspects and Analysis. Philadelphia, PA: SIAM, 1991.

[39] S. V. Huffel, “Enhanced resolution based on minimum variance estima-tion and exponential data modeling,” Signal Process., vol. 33, no. 3, pp.333–355, 1993.

Jesper Jensen (S’96–M’00) received the M.Sc. andPh.D. degrees from Aalborg University, Aalborg,Denmark, in 1996 and in 2000, respectively, both inelectrical engineering.

From 1996 to 2001, he was with Center for Per-sonKommunikation (CPK), Aalborg University asa Researcher, Ph.D. student, and Assistant ResearchProfessor. In 1999, he was a Visiting Researcher atCenter for Spoken Language Research, University ofColorado at Boulder. Currently, he is a PostdoctoralResearcher at Delft University of Technology, Delft,

The Netherlands. His main research interests are digital speech and audiosignal processing, including coding, synthesis, and enhancement.

Richard Heusdens received the M.Sc. and Ph.D. de-grees from the Delft University of Technology, Delft,The Netherlands, in 1992 and 1997, respectively.

In the spring of 1992, he joined the Digital SignalProcessing Group at the Philips Research Laborato-ries, Eindhoven, The Netherlands, where he workedon various topics in the field of signal processing,such as image/video compression, and VLSI archi-tectures for image-processing algorithms. In 1997, hejoined the Circuits and Systems Group of the DelftUniversity of Technology, where he was a Postdoc-

toral Researcher. In 2000, he moved to the Information and CommunicationTheory (ICT) Group where he became an Assistant Professor, responsible forthe audio and speech processing activities within the ICT group. Since 2002,he has been an Associate Professor. His research interests include signal pro-cessing, in particular audio and speech processing and information theory.

Søren Holdt Jensen (S’87–M’88–SM’00) was bornin Denmark in 1964. He received the M.Sc. degree inelectrical engineering from Aalborg University, Aal-borg, Denmark, and the Ph.D. degree from the Tech-nical University of Denmark, Lyngby, Denmark.

Currently, he is an Associate Professor with theDepartment of Communication Technology, AalborgUniversity. Before joining Aalborg University, hewas with the Telecommunications Laboratory ofTelecom Denmark, the Electronics Institute of theTechnical University of Denmark, the Scientific

Computing Group of the Danish Computing Center for Research and Education(UNI � C), and the Electrical Engineering Department of Katholieke Univer-siteit Leuven, Belgium. His research activities are in digital signal processing,digital communications, and speech and audio signal processing. He is amember of the editorial board of the Journal on Applied Signal Processing.

Dr. Jensen is a former Chairman of the IEEE Denmark Section and the IEEEDenmark Signal Processing Chapter.

Date post:	16-Apr-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,...

Documents