Regular Pulse Excitation

8/3/2019 Regular Pulse Excitation

1/10

1054 IEEERANSACTIONS ON ACOUSTICS,PEECH,NDIGNALROCESSIN G, VOL. ASSP-34, NO. 5, OCTOBER 198

Regular-Pulse Excitation-A Novel Approach toEffective and Efficient Multipulse Coding of SpeechAbstruct-This paper descr ibes an effec tive and efficient time do-

main speech encoding technique that has an appealingow complexity,and produces toll quality speech at rates below 16 kbits/s. The pro-posed coder uses linear predictive techniques to remove the short-timecorrelation in the speech signal. The remaining (residual) informationis then modeled by a low bit rate reduced excitation sequence that,when applied to the time-varying model filter, produces a signal thatis close to the reference speech signal. The procedure for findingthe optimal constrained excitation signal incorporates the solutionf afew strongly coupled sets of linear equations and is of moderate com-plexity compared to competing coding systems such as adaptive trans-form coding and mnltipnlse excitation coding. The paper describes thenovelcoding dea and heprocedure for finding heexcitation se-quence. We then show t hat the coding procedure can e considered asanoptimizedbasebandcoderwithspectral oldingashigh-fre-quency regeneration technique. The effect of various analysis param-eters on the qu ality of the reconstructed speech is investigated usingboth objective and subjective tests. Further, modificationsf the basicalgorithm, and their impact on both the quality of the reconstructedspeech signal and the complexity of the enc oding algorithm, are dis-cussed. Using he generalized baseband coder formulation, we dem-onstrate that under reasonable assumptions concerning the weightingfilter, an attractiveow-complexity/high-quality coder can be obtained.

I. INTRODUCTION

A interesting application area for digital speech cod-ing can be found in mobile telephony systems andcomputer networks. For heseapplications, toll qualityspeech at bit rates below 16 kbits/s is a prerequisite. Manyof the conventional speech coding techniques [ l ] fail toobey this condition. However, a class of coders, the so -called delayed decision coders (DDC) [ l , ch. 91, seemsto be promising for these applications.Coders that belongto this class utilize an encoding delay to find the bestquantized version of the input speech signal or a trans-formed version of it. Quite effective algorithms can bedesigned by combining predictive and DDC techniques toyield low bit rate waveform matching encoding schemes.A powerful and common approach is to use a slowly time-

Manuscript received August 23, 1985; revised March 5, 1986. This workwas supported in part by Philips Research Laboratories, Eindhoven, TheNetherlands, and by the Dutch National Applied Science Foundation underGrant STW DEL 44.0643.P. Kroon was with the Department of Electrical Engineering, Delft Uni-versity of Technology, Delft, The Netherlands. He is now with the Acous-07974.tics Research Department,AT&T Bell Laboratories, Murray Hill , NJE . F . Deprettere is with the Department of Electrical Engineering, DelftUniversity of Technology, Mekelweg 4, 2628 CDDelft, The Netherlands.R. J. Sluyter is with the PhilipsResearch Laboratories, 5600 MD Eind-hoven, The Netherlands.IEEE Log Number 8609633.

varying linear predictive (LP) filter to model the shorttime spectral envelope of the quasi-stationary speech signal. The problem that remains is how to describe the resulting prediction residual that contains the necessary information to describe the fine structure of the underlyinspectrum. In other words, what is the best low-capacity model for the speech prediction residual subjected tone or more judgment criteria. These may include objective and subjective quality measures (such as rate distortions and listening scores, respectively), but coder complexity can also be taken into account. Although certaimodels have been shown to behave very satisfactorily 121[4], the question of optimality remains difficult to answe

In this paper we address the problem of finding an excitation signal for an LP speech coder that not only ensures a comparable quality with existing approaches, buis also structurally powerful. By the latter we mean that fast realization algorithm and corresponding higthroughput (VLSI) implementation can be obtained. Wpropose a method in which the prediction residual is modeled by a signal that resembles an upsampled sequencand has, therefore, a regular (in time) structure. Becausof this regularity, we refer to this coder as the regularpulse excitation (RPE) coder [5]. The values of the nonzero samples in this signal are optimally determined by least-squares analysis-by-synthesis fitting procedure thacan be expressed in terms of matrix arithmetic.

In Section I1 we describe in more detail the regularpulse excitation coding procedure and the algorithm fofinding the excitation sequence. In Section I11 we showthat the proposed encoding procedure can be interpretein terms of optimized baseband coding. In Section IV , thinfluence of the various analysis parameters on the qualityof the reconstructed speech is investigated. Further, to exploit the long-term correlation in the speech signal, thuse of a pitch predictor is discussed. Modifications to thbasic procedure, to attain a further reduction in complexity without noticeable quality loss, are described in Section V . Finally, in Section V I, we describe the effect oquantization on the quality of the reconstructed speecsignal.

11. BASICCODERSTRUCTUREThe basic coder structure can be viewed as a residuamodeling process, as depicted in Fig. 1. In this figure, th

residual r ( n ) is obtained by filtering the speech signal s(n0096-3518/86/1000-1054$01.000 986 IEEE


2/10

KROON et al. : REGULAR-PULSE EXCITATION 1055

(b)Fig. 1. Block diagram of the regular-pulse excitation coder: (a) encoder,(b) decoder.

k . 1 : I . _ I . I . . . I . . . I . . I . . . . . I . . . I . . . Ik - 2 . . I . . . I I . . . I . . . I . . I . . I . . . ~ . . . I . .k . 3 . . . . . . ~ . . . I .I . . . I , . I . . I . . . I . I . , Ik - 4 . . . I . , I . . . I . . . ~ . . . I . I . . . . . I . . . I . . . I

Fig. 2. Possible excitation patterns with k = 40 and N = 4.through a pth-order time-varying filter A(z),

which can be determined with the use of linear prediction(LP) techniques as described in, e.g., [6 ] .The differencebetween the LP-residual r (n) and a certain model residualu (n) to be defined below) is fed through the haping filter1 / A ( z h ) ,

1 1

This filter, which serves as an error weighting function,plays the same role as the feedbackfilter in adaptive pre-dictive coding with noise shaping (APC-NS) [ 7 ] and theweighting filter in multipulse excitation (MPE) coders [ 2 ] .The resulting weighted difference e(n) is squared and ac-cumulated, and is used as a measure for determining theeffectiveness of the presumed model u(n) of the residualrtn).The excitation sequence u (n) is determined for adjacentframes consisting of L samples each, and is constrainedas follows. Within a frame, it is equired to correspond toan upsampled version of a certain "optimal" vector b =(b ( ) , - , b@)) of length Q (Q < L). Thus, each seg-ment of the excitation signal contains Q equidistant sam-ples of nonzero amplitude, while the remaining samplesare equal to zero. The spacing between nonzero samplesis N = L/Q. For a particular coder, the parameters L andN are optimally chosen but are otherwise ixed quantities.The duration of A frame of size L is typically 5 ms, Eachexcitation frame can support N sets of Q equidistant non-zero amples, resulting in N candidateexcitation e-quences. Fig. 2 shows the possible excitationpatterns fora frame containing 40 samples and a spacing of N = 4.

In this figure, the locations of the pulses are marked by avertical dash and the zero samples by dots. If k ( k = 1,2 , * * , N ) denotes the phase of the upsampled versionof the vector b(k), .e., the position of the first nonzerosample in a particular segment, then we have to computefor every value of k the amplitudes b'k)() that minimizethe accumulated squared error. The vector that yields theminimum error is selected and transmitted. The decodingprocedure is then straightforward,as s shown in Fig.1 b).A. Encoding Algorithm

Denoting by M k the Q by L position matrix with entriesm , = 1 f j = i * N + k - 1

O S i S Q - 1m, = 0 otherwise (3)O s j s L - 1 ,the segmental excitation row vector d k ) , corresponding tothe kth excitation pattern, can be written as

= bk" k . (4)Let H be an uppertriangular L by L matrix whose jt h row( j = 0, * - - , L - 1) contains he (truncated) responseh(n) of the error weighting filter l /A( z /y ) caused by a unitimpulse 6(n - ) . That is,

H =

If eo denotes the output of the weighting filter due to thememory hangover (i.e., the output as aesult of the initialfilter state) of previous intervals, then the signal e ( n ) pro-duced by the input vector b(k)an be described as

e(k)= e(') - b(k)Hk, k = 1, . , N , (6 )where

= eo + rH , (7)Hk = MkH, (8 )

and thevector r represents the residual r ( n ) or thecurrentframe. The objective is to minimize the squared error

= e(k)e(k)t , (9)where t denotes transpose. For a given phase the optimalamplitudes b'k'( ) can be computed from (6) and (9), byrequiring e@ ) H ; o be equal to zero. Hence,

By substituting (10) in (6 ) and hereafter he resultingexpression in (9), we obtain the following expression forthe error:

E(k) = ,@)[I - H b[HkH:] -'Hk] e(')'. ( 11)


3/10

1056 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH,NDIGNA L PROCESSING, VOL. ASSP-34,O. 5 , OCTOBER 198

The vectorb @ )hat yields the minimum value of E@) verall k is then selected.The resulting optimal excitationvector d k ) is entirely characterized by its phase k and thecorresponding amplitude vectorb(k). he whole procedurecomprises the solution of N sets of linear equations asgiven by (10). A fast algorithm to compute the N vectorsb(k) imultaneously has been presented in [8] and [9]. Weshall show in Section V that a further reduction in com-plexity can be obtained by exploiting the nature of thematrix product HkH : in (10).

111. GENERALIZEDASEBANDCODINGIt may be observed that the regular-pulse excitation se-

quence bears some resemblance to the excitation signal ofexcited baseband coder (BBC) using spectral folding ashigh-frequency regeneration technique [4] [101. In thissection we show that the RPE coder can be interpreted asa generalized version of this baseband coder. For this pur-pose we use the block diagram of Fig.3.The blocksdrawn with solid lines represent the conceptual structureof a residual excited BBC coder with spectral folding. Forthis coder, the index k has no significance and is set tozero. In this scheme, heLP-residual signal r ( n ) , ob-tained by filtering the speech signal through the filter A ( z ) ,is band-limited by an (almost) ideal low-pass filter Fo(z ) ,downsampled to b(n) and transmitted. At the receiver,this signal is upsampled to d0(n) o recover the originalbandwidth, and is fed through the synthesis filter to re-trieve the speech signal s^(n).When the dashed blocks areincluded in Fig. 3, one provides a possibility to optimizethe filter Fk(z ) , i.e., to replace the ideal low-pass filterFo(z) by another filter, which is more tailored to opti-mal waveform matching, where the optimality criterionis to minimize the (weighted) mean-squared error betweenthe original and the reconstructed signal.

We shall now show that for this optimized BBC ver-sion, the output of the filter F k ( z ) ,after down- and upsam-pling, is exactly the excitation signal d k ) ( n ) s computedby the RPE algorithm. Thus, let there exist for each k , ( k

II--- ERRO R 7 Ir------------ e ( n ) ---------I L - JI7 MIN IMIZAT ION y-----A ( z / y ) IL.-._^_.-_-__I L- ___ -_____ JFig. 3 . Block diagram of a BBC coder (solid ines), and an RP E cod(solid and dashed l ines).

= 1 , * - . N ) , an FIR filter Fk(z )such that the weighteleast-squares error C, e2(n)over the nterval L is minimalDefine Fk(z)as

L - 1Fk(z) = ,x j k ) z - i , (121 = 0

andf k) = [ f @ (O ) f@(l> * f k (L- l)]. (13

Let r+(n)and r-(n) (n = 0, . , L - 1) denote heresidual samples of the current frame and those of the prvious frame, respectively. Then we can write for the ouput d k ) ( n ) f the filter Fk(z)

r+(O) r+(U . * r + ( L - 1)r - ( L - 1) r+(O) r + ( L - 2)r-(L - 2) r - (L - 1) r+(L - 3)1

= f ( k ) R , (14The vector b@),which is he downsampled version ou@) with downsampling factor N ) , can be written as

b@)= f (k)R&f:,= f (k)Rk (15

with

I

r-((Q - 1)N + k ) 1where M k is the position matrix as defined in (3) , anwhere the definition r-(L + k ) = r + ( k ) .The excitatiovector d k ) can be expressed as the product

0 . . . , . . . r+((Q- l ) N - 1 + k) 0 * . . 0. . . . . .

k - 1 N - k N - k


4/10

KROON et al . : REGULAR-PULSEEXCITATION 1057

-321 , , k.3 I -321 , , k - 4 1FREQUENCY (kHz) FREQUENCY (kHz)

- 4 0 ~ . ~ 2.0 30 4.0 -400.~ 1.0 2.0 3.0.0

Fig.'4. Power spectra 1 F,(e j s ) ( * for different values ofk , obtained from a5 ms speech segment.Hence, with the matrix H and the initial error d o ) as de-fined in the previous section,

= - (k )- - (k )

f R k M k H- f R k H k . (18)

Minimizinge obtain as solutionf (k ) = e'o'(RkHk)'[R,H,(RkHk)t]-'. (19)

Substituting this result in (15), we obtain the vector b@',which is equal to the pulse amplitude vector b(&)btainedvia the procedure described in Section I1 (see the proof inthe Appendix).

Fig. 4 gives an example of the spectra 1 Fk(eje)I2ob-tained from real speech data. From this figure we see thatthe filters F k ( z ) are rather different from the one (F,(z))used in the classical baseband coder, and have more all-pass character.Although the RPE algorithm and the optimal BBC al-gorithm are conceptually equivalent, the optimized BBCvariant will in general not offer any computational advan-tage over the RPE approach. However, n Section V, it isdemonstrated that under certain reasonable assumptionsconcerning heweighting filter, the BBC approachcanprovide an attractive alternative in practice.

IV. EVALUATIONF THE RPE ALGORITHMFig. 5 shows a typical example of the waveforms as

produced by the RPE coder,using the analysis parameterslisted in Table I. The corresponding short-time powerspectra of the speech signal s(n) (solid line) and the re-constructed signal 9(n) (dashed line) are shown in Fig. 6 .To give an impression of the signal-to-noise ratio over acomplete utterance, we show n Fig. 7 the segmental SNR(SNRSEG) computed every 10 ms for the utterance "alathe is a big tool" spoken by both a female and a malespeaker.A . RPE Analysis Parametersspeech quality are listed below:

The RPE analysis parameters that could affect the final1) predictor parameters,

16 32 48 64TIME (ms)

Fig. 5. (a)Speechsignal s ( n ) , (b) econstructedspeechsignal S(n), (c)excitation signal u ( n ) , and (d) difference signal s (n) - S(n) in the RPEcoding procedure.TABLE IDEFAULT PARAMETERS RPE ANALYSIS

Parameteraluesampling frequencyLP analysis procedureorder ( p )update rate coefficientsanalysis frame sizepulse spacing Nfrane size Lweight factory

8 kHzautocorrelation1210ms25 ms Hamming window40.805 ms

b L'0.0 1.0 2.0 3.0 4.0FREQUENCY ( kHz )

Fig. 6. Power spectra of the original speech segment (soIid line) and thereconstructed speech segment (dashed line). The spectra were obtainedwith a Hamming window using the last 32 ms segment of the data dis-played in Fig. 5.

2) pulse spacing N ,3) frame size L , and4) error weighting filter.To evaluate the coder behavior, we useda set of default

parameter values (see Table ), while the parameter underinvestigation was vaned.The effects of the predictorparameters in APC-likeschemes have been extensively studied in the literature(e.g., [l]),and will not be discussed in detail in this pa-per. We found that good results were obtained with theautocorrelation method using a Hamming window on 25ms frames. The predictor coefficients were updated every20 ms and thepredictor orderp was chosen tobe equal to12.


5/10

1058 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH,AND SIGNAL PROCESSING, VOL. ASSP-34, NO . 5, OCTOBER 1986

A L A T H ES A B I G T O O L

0.00 0.25.50.75 1.00 1.25.50.75 2.00TIME ( S I

(b)Fig. 7 . Segmental SNR for successive time frames for a female speaker(a) and a male speaker (b). The upper curveepresents the speech power

+1 5 dB..1 I I

7 Ih I 6 20 2 1 40 42 So 84 :FRAMESIZE L

Fig. 8. Segmental SNR values fo r different frame sizes L and pulse spac-ings N . The results fo r L = X were obtained with a fixed value fork andL = 40.

We mentioned earlier that for the case n which there isno phase adaptation (on a frame basis), that is, k is fixedand equal to 1, the structure of the excitation signal re-sembles the upsampled residual signals used in BBC cod-ers with spectral folding. This observation can give us arough estimate of the maximum spacing ( N )between thepulses, to ensure agood synthetic speech quality. Assum-ing a maximum fundamental frequency of 500 Hz, nehave to use a sampling ate of minimally 1000Hz. Hence,for an 8 kHz sampling rate, the pulse spacing should beless than or equal to 8 .To investigate the effect of different frame sizes L andpulse spacings N , we computed the segmental SNR valuesof the reconstructed speech signals for various values ofthese parameters. Fig. 8 shows he averaged segmentalSNR values for two female and two male speakers fordifferent values of N and L . As far as possible, we havechosen the same frame ize for different values of N . Fromthis figure, we see that the SNR increases with the numberof pulses and decreases with increasing frame size. How-

The utterances are: A lathe is a big tool and A n icy wind rakedthe beach.

1 I . I 1 . 1 . I . I . I . . I .2 . I . . I 1 . 1 . 1 I . I I .3 1 . J 1 . I I . . l IL.1..4 ) . . I 1 L . I . . I . I . . I I .5 . I . . I . . I I . . l . . I , I . I .6 . I I . . / I . I . . I . . I . I .7 I . . I . l . I . . , . I . . I . . I8 J . , l . I . . I . . . I . . / . . I . . /9 . . I , I . I . . I . . I . I . . I . . I

Fig. 9. RPE excitation patterns with D = 12, L = 24, and N = 3.TABLE I1SNR VALUESOR DIFFERENT ALUESF L AND D

L : D SNRSEG SNR20 :20 14.58 dB 11 SO dB40 :20 14.15 dB 11.90 dB40 :404.28 dB 11.17 dB80 :404.58 dB 11.29 dB80 :80 13.80 dB 10.44 dB

ever, there is no real tradeoff between the values of L andN . Informal listening tests confirmed the ranking as intro-duced by the SNR measurements. For values of N greaterthan 5 , some of the utterances (especially those by femalespeakers) sounded distorted. From our experiments, wefound that N = 4 and L = 5 ms will give the best resultsconsidering the bit rate constraints.The pulse amplitudes bk)() and phase k are computedevery L samples, which means that the pha se adaptationrate is equal to 1/L . To investigate the effect of this dis-turbance, without changing the size of L , we consideredphase adaptation every D samples, where the value of Dis less than or equal to L , and LID must be an integerratio. Within a frame of size L, the possible number ofexcitation sequences is then given by

B = N d , d = LlD. (20)Hence, a value of D smaller than L results in a more com-plex procedure for the computation of the optimum exci-tation. Fig. 9 shows the possible excitation patterns for L= 24, D = 12, and N = 3 . Table I1 lists the resultingaveraged SN R values for different frame sizes L and ratiosL /D = 1 and 2. From this table we see a small improve-ment in SNR for values of D less than L , at the expenseof a much higher complexity.B . Application of a Pitch Predictor

An examination of the regular-pulse excitation (see, forexample, Fig. 5 ) reveals the periodic structure of the ex-citation for voiced sounds. Obviously, theRPE algorithmaligns the excitation grid to the major pitch pulses,thereby introducing the possibility that the remainingpulses within the grid are not optimally located. If wemodel the major pitch pulses with a pitch predictor/syn-thesizer, the remaining excitation sequence can be mod-


6/10

KROON et al. : REGULAR-PULSE EXCITATION

eled by the regular-pulse excitation sequence. A simplebut effective pitch predictor is the so-called one-tap pre-dictor,

1 - P(z) = p z -M , (21)where M represents the distance between adjacent pitchpulses and p is a gain factor. The pitch predictor param-eters can be determined either in an open-loop configu-ration [111, or in a closed-loop configuration [121. In thelatter case, the parameters can be optimally computed byincluding a pitch generator l/P(z) in the closed-loop dia-gram of Fig. l . The parameters p and M are determinedsuch that the output of the pitch generator due to ts initialstate is optimally close (in the weighted ense) to the ini-tial error signal e(')(n). Once /3 and M have been deter-mined, he emaining regular-pulse excitation signal iscomputed as described in Section 11, except that this sig-nal is now to be fed.hrough both the pitch generator andtheweighting filter. The advantage of determining hepitch parameters within the analysis loop is that the pitchgenerator is then optimally contributing to the minimiza-tion of the weighted error. To be more specific, let y M ( n )be the response of the pitch generator to an input u ( n ) ,which is zero for n 2 0,

Y M ( 4 = u(4+ PY.& - M). (22)Let z M ( n )epresent the response of the weighting filter tothe input signal y M ( n ) ,defined in (22), and let e@)(n) ep-resent the initial error as defined in (7). The error to beminimized will then be

E ( M , 0)= (e")(n) - P z M ( ~ ) ) ~ . (23)The approach is o compute P for allpossible values of Mwithin a specified range, and then select the pair (M,)for which E ( M , P ) is minimal.The range of M should be chosen to accommodate tothe variation in pitch frequency in the speech signal.However, in simulations with a one-tap predictor usingdifferent ranges of M, we found hat a range of M between16 and 80 (i.e., a fundamental frequency between100and470 Hz) is satisfactory. The effect of pitch prediction isdemonstrated in Fig. 10, by using the same speech seg-ment as used in Fig. 5. The short-time power spectra ofthe speech signal s(n) (solid line) and of the error signals(n) - 9(n) (dashed line) for y = 0.80, without and withpitch filter, are shown in Figs. 11 and 12, respectively.The effect of pitch prediction on the averaged segmentalSNR values is shown in Fig. 13. These figures show thatthe effect of pitch prediction is to decrease the absolutelevel of noise powerand o flatten its spectrum, andthereby improving the performance in terms ofSNR. Thiseffect was most noticeable for high-pitched (average pitch1 2 5 0 Hz) speakers.C. Error Weighting Filter

Although the effect of noise shaping can be heard, thereal mechanism behind this effect is not clear. We willnot pursue the question whether the proposed noise-shap-

1059

0 16 32 484TIME (mS)

Fig. 10. (a) Speech signal s (n ) , (b) reconstructed speech signal S(n ) , (c)excitation signal (i.e., output of the pitch generator), (d) difference sig-nal s (n) - S(n) in the RP E coding procedure with pitch prediction.

4LZ '0.0 1.0 2.0 3.0 4.0

FREQUENCY (kHz)Fig. 11. Power spectra of the speech signal (solid line) and the differencesignal s (n) - % ( n ) dashed line) fory = 0.80. The spectra were obtainedfrom the last 32 ms segmentof Fig. 5.

[L O J0.0 1.0 2.0 3.0 4.0FREQUENCY ( kHz )

Fig. 12 . Power spectra of the speech signal (solid line) and the differencesignal s(n) - S(n) (dashed line) for y = 0.80, and pitch prediction. Thespectra were obtained from the last 32 ms segment of Fig.10.

- 17.- o f + P P5 16.- a a Y f - P PA15.-

14-t aY P +0 B5 13.- Y O

12I I

, Y YN=4 N = 5 N=4 N=5

UPDATE RATE IOrns 20 rn sFig. 13. Segmental SNR values obtained from RPE encoded speech with( + p p ) and without ( - p p ) pitch prediction for different update rates ofthe predictors and different pulse spacingsN (f= female, m =.male).


7/10

1060 I E E E TRANSACTIONS ON ACOUSTICS, SPEECH,ND SIGNALROCESSING. VOL. ASSP-34, NO . 5,OCTOBER 1986

ing filter of ( 2 ) s an effective choice or not, and concen-trate instead on the effect of the suggested filter and itscontrol parameter y. This arameter determines theamount of noise power in the formant regions of thespeech spectrum. Noise shaping reduces the SNR, but im-proves the perceived speech quality. An optimal value fory was found to be between 0.80 and 0.90 at an 8 kHzsampling rate, and resulted in an average 2 dB decreasein SNR.

Aside from the value of y, he order of the noise-shap-ing filter could also be of importance. By default, thecoef-ficients (ak} and the order p of l/A(z/y) areequal to thoseof the predictor A ( z ) , but instead, we can compute a qth-order predictor(q < p ) and use the resulting q coefficientsto define the weighting filter. While reducing the order,we nevertheless must takecare that the noise remainsproperly weighted. We examined the effect of decreasingthe order of the weighting filter l /A(zly ) ,and observedthat for low orders ( 2 - 4 ) , the results were close to thoseobtained with a 16th-order filter. However, the compu-tational savings obtained by reducing the order of theweighting filter are marginal.

The time-varying nature of the weighting filter providesa significant contribution to the complexity of the analysisprocedure,since the system of linear equations to besolved is entirely built on the impulse response of thisfilter. It is obvious that the computational complexitywould be considerably lower in casea weighting filtercould be chosen such that the matrix to be inverted nolonger depends on short-time data. It turns out that this ispossible by choosing the weighting filter equal to l/C(zly),

1 1- - -9 (24)c(z'y) 1 + C c k y z k 'k = I

where (ck)are the coefficients of the$xed low-order pre-dictors as used in DPCM systems, which are based on theaveraged spectral characteristics of speech. We carried outcomparative listening tests on the results obtained withfixed weighting filters of different orders (q = 1 to 3) .The value of y was set to 0.80 and we used for (ck} hecoefficients tabulated in [131. It was surprising to find thatthe effects of the weighting filters l/A(z/y) and l/C(z/y)were judged to be almost equivalent. This remarkable re-sult can be exploited to dramatically reduce the complex-ity of the proposed coder, as we will show in the nextsection.

V . COMPLEXITYEDUCTIONF THE RPECODERThe analysis procedure of the RPE coder necessitatesthe solution of N sets of linear equations, where N rep-

resents the spacing between successive pulses within aframe in the excitation model. However, the matricesH k H i ,which have to be inverted, can be solved very ef-ficiently as was described in [8] and [9]. We shall notpursue the details of this procedure here, but we shall in-stead look for modifications of the algorithm to reduce thecomplexity without affecting the coder performance.

A . Modijication of Hk H i to a Toeplitz MatrixTo begin with, we can reconfigure the algorithm to force

the matrix product H k H : in (10) to become a single Toe-plitz matrix which is independent of the phase k . Thus,leth(n) = y"g(n), = 0, 1, 2, - - , (25)

be the impulse response of the weighting filter l /A(z /y) ,where g ( n ) is the impulse response of the all-pole filterl /A(z) . For values of /y less than one, h(n) convergesfaster to zero than g(n) and, as a result,he L by 2 L matrixbuilt on h(n) can be very well approximated by the up-pertriangular Toeplitz matrix H in ( 2 6 ) .

h(0) h(1) * - - h(L - 1) 0 . . ..=I"- * - h(L - 3 ) h(L - 2) *: jh(0) * . h(L - 2 ) h(L - 1). .0 0 - * * k(0) . . . h(L - 1) 0

(26)Notice that the matrix HH' is also a Toeplitz matrix.Moreover, when substituting H from ( 2 6 ) into (8) , weshall have that the matrices HkHL are independent of thephase index k and are equal to a single Toeplitz matrix.It should also be remarked that the matrix of ( 2 6 ) is an Lby 2 L matrix instead of an L by L. Thus, when substitut-ing H of ( 2 6 ) in ( 7 ) and (8), the vectors eo, e('), and e(kin ( 6 )and ( 7 )will now be of length 2L, while the vectorsu ( ~ )nd r in (4) and ( 7 ) remain of dimension L. The RPEencoding procedure that is based on the mapping H in( 2 6 ) , and for which g(n) in (25) is the impulse responseof the transfer function l /A(z) , will be referred to asRPM1. Fig. 14(a) shows the segmental SNR values per10 ms for this method (dashed line) and the originalmethod (solid line) for the utterance "a lathe is a big tool"spoken by both a male and a female speaker.B . Modification of H k & to a Band Matrix

In the previous subsection, a computationally attractivescheme was obtained by forcing the matrix operator H tobe of th e form of ( 2 6 ) .Recall, however, that this structureis almost naturally emerging when the mapping originallydefined via ( 5 ) s taken to be of dimension L by 2 L insteadof L by L . This is the more so when h(n) in ( 2 6 ) is theimpulse response of the fixed filter l/C(z/y) of (24).Butan even more interesting observation is that the resultingsingle Toeplitz matrix, whether data dependent or not, isstrongly diagonal dominant.Hence, when minimizingE(k' in (1 l ) , where now H is built on l /C(z ly) ,or equiv-alently, when maximizing

T'k) e( ' )H:[HkHi]lH k ( 2 7 )we can conveniently replace the (Toeplitz) matrix H k H iwith a diagonal matrix roZ,where ro = h2(i), yield-


8/10

KROON et a l. : REGULAR-PULSEXCITATION 1061A L A T H E ISAB I G T 0 0 L

TIME (SIA L A T H E ISAB I G T 0 0 L

-7

I h.


tt42.00

TIME ( S )A L A T H ES A I G T O O L

TIME ( S )(b)

Fig. 14. Segmental SNR ratios for RPE solid line) and modified methods,(a) RPMl and (b) RPM2 (dashed line) for a female and male speaker.TABLE 111SNR VALUES OBTAINEDITH THE ORIGINALN D T H E MODIFIED RPEALGORITHMS RPMlN D RPM2 DESCRIBEDN SECTION-A A N D V-B. THEPROCEDURES,RPFIN D RPF2 AR E DESCRIBEDN SECTION v - c .

Method SNRSEG SNRRPE 14.28 dB 11.17 dBRPMl 12.98 dB 10.93 dB

RPM2 13.00 dB 11.03 dBRPFl 10.04 dB 9.38 dBRPF2 10.40 dB 9.21 dB

ingT ( k ) = 1 ( ~ ) H ; H k(o)t, (28)r0

which means that no matrix inversion is needed to findthe optimum phase k.Table I11 lists theSNR and SNRSEGvalues for the different methods, obtained by averagingthe results of the same four utterances used in previous

examples. Method RPM2 refers to theprocedurede-scribed in this subsection, where the optimal phase indexk is determined from (28), after which the excitation stringb(k)s computed according to (10). From this table, wesee that the modifications introduced resulted in a slightdecrease in SNR. But from informal listening tests, themodified methods were judged to bealmost equivalent tothe original RPE method. Fig. 14(b) shows the segmentalSNR values per 10 ms for RPM2 (dashed lines) and theoriginal method (solid line) for the utterance "a lathe is abig tool" spoken by both a male and a female speaker.C. Avoiding Matrix Inversion

The discussions in the previous two subsections haveled to the conclusion that the complexity of the RPE coder,although moderate by itself, can be substantially reducedwithout any significant degradation of the speech quality.We shall show in this subsection that it is even possibleto obtain an extremely simple encoding algorithm thatturns out to yield an applicable practical version of the(conceptual) optimal baseband coder which was describedin Section I11 and was shown there to be equivalent to theRPE coder. Thus, leth ( n ) n (26) be the impulse responseof the time-invariant filter l/C(zly) as defined in (24).Next, use in (8 ) the matrix H as defined in (26) and dis-card the zeroth-order approximation eo in (7). Then (6)and (10) become

e ( k ) = rH - b'')Hk, (29)and

b(k'[HkHi]= r H H ' M i , (30)respectively. Now denoting

S = HH' , (3)and recalling that

HkHi P o l , (32)with

L - Ir0 = C /z2(i),i = O

as a coder constant, it is easy to show that(33)

Interpreting M i as a downsampling operator, (33) saysthat b'k' resembles2 a downsampled output of a smootherS whose input is a scaled version of the residual I [seeFig. 15(a)]. The excitation selection in the diagram of Fig.15(a) is based on the minimization of the approximationerror given by (1 1). Under he above-mentioned con-straints, this equation becomes

= r ~ p r f b(k)b(k)'0 (34)'This statement must be carefully interpreted. In fact, (33) is a blocksmoother, and hence, the boundary conditions of the smoother's intetnalstate must be properly taken into account.


9/10

1062 IEEE TRANSACTIONSONACOUSTICS, SPEECH,AN DSIGNAL PROCESSING,VOL.ASSP-34, NO. 5 , OCTOBER 1986

SMOOTHER

L(a)

r ( n ~ SMOOTHER EXCITATIONSELECTION--, b, ( n l

b2 ( n )*T-c- -+----* -----, SELECT1 b, I ~ I MAXIMUM+g

(b)Fig. 15. Simplified R Y @ procedure (a) and excitation selection (b). Thesmoother i s represented by a triangle shape.

A L A THE ISAB I G T O 0 L

TIME (SI(a)


TIME [S)(b)

Fig. 16. Segmental SNR for RPFl procedure (solid line) and RPF2 pro-cedure (dashed line) fo r a female (a) and a male (b ) speaker.

Hence,min { E @ ) )= max {b(k)b(k)'). (35)

The whole procedure is now extremely simple. Theresid-ual signal r is "smoothed" with the smoother S = H H ' .The resulting output vector is downsampled by applyingM : , and the b(k)or which b'k)b(k)'s maximum is selected[see Fig. 15(b)]. Notice that since H i s built on l/C(z/y),the smoother S will be of low order (typically 3rd order),since h(n) s a rapidly decaying sequence. For compari-son, the averaged SNRSEG values obtained with this pro-cedure have been included in Table 111. In this table, theRPE coder using a fixed weighting filter is referred to asRPFl, while the procedure outlined above is referred toas RPF2. In Fig. 16, the same comparison is made of thesegmental SN R as a function of time for the utterance "alathe is a big tool" spoken by both a male and a femalespeaker. From this figure, it is clear hat ora fixedweighting filter procedure RPF2 provides a quality com-parable to that of procedure RPF 1 . The advantage of theformer is its ease of implementation.

VI.QUANTIZATIONTo quantize the pulses (i .e. , entries of b@'),we used an

8-level adaptive quantizerwhose input range was adjustedto the largest pulse amplitude within the current frame ofsize L. The quantization bins can be determined by aLloyd-Max procedure (nonuniform), but we found that auniform quantizer also performs quite well. The quantizenormalization factor is logarithmically encoded with 6 bitsand is transmitted every L samples (typically 5 ms). Thenormalized pulses are encoded using 3 bits per pulse. Tominimize quantization errors, the quantizer has to be in -corporated in the minimization procedure. This can bedone in two ways. In the first case (RPQl), only the op-timal excitation vector is quantized; and in the second case(RPQ2), every candidate b@) s quantized and the quan-tized vector that produces a minimum error is selected.From segmental SNR measurements,we found that RPQ2yields a higher SNR, and in listening tests the quality ofthe reconstructed speech of RPQ2 was judged to be some-what better than that of RPQ 1.The 12 reflection coefficients were transformed to in-verse sine coefficients and encoded with 44 bits/set. Thebit-allocation and quantizer characteristics were deter-mined by the minimum deviation method [14]. Using 3bitdpulse and a pulse spacing of N = 4, he excitationsignal can be encoded with 7 kbits/s. The predictor coef-ficients can be encoded with 2.2 kbits/s resulting in a totalbit rate of 9.2 kbits/s. The quality of the reconstructedspeech was judged to be good but definitely not transpar-ent. In informal listening tests, it was determined that theRPE approach has fewer anifacts han the baseband coderas proposed in [4], nd that the performance is compara-ble to that of the MPE schemes. A pitch predictor willenhance the coder performance butgoes at the cost of anadditional 1000 bits/s (4 its for ,B and 6 bits for M ) .

VII. CONCLUSIONIn this paper, a novel coding concept has been proposed

that uses linear prediction to remove the short-time cor-relation in the speech signal. The remaining residual sig-nal is then modeled L;J a regular (in time) excitation se-quence, that resembles an upsampled sequence. Thismodel excitation Yignal is det:::mined in such a way thatthe perceptual e? ur between ' e original and the recon-structed signal is minimized. he computational effort isonly moderate and can be fun ::&.xeduced by using a fixederror weighting filter and an ai,propriate vector size (min-imization segment length). The coder can produce high-quality speech at bit rates around 9600 bits/s by using apulse spacing equal to 4 nd quantizing each pulse with 3bits.The use of pitch prediction improves the speechquality but, in general, heRPEcoder per foms ade-quately without a pitch predictor. Other applications forthe proposed coder can be found in the area of wide-bandspeech coding (7 kHz bandwidth) as encountered in tele-and video-conferencing applications [151.


10/10

KROON et al.: REGULAR-PULSEEXCITATION 1063

APPENDIXThe excitation vector obtained withheoptimizedBBC (Section 111),coincides with the vectcsb(k) roduced

by the RPE algorithm (Section 11).Proof: Equation (19) can be written asfkRk[HkHi]Rfk = e()H;R:.

Multiplying both sides to the right by Rk givesfkRk[HkH:]R:Rk = eHZR:Rk.Now assuming that RfGRk is nonsingular (which will al-most always be the case for speech signals), we can aswell write

f (k Rk[HkH:] : e()H:.Substituting b(k)orf k R k ,see (15), in this equation, weobtain (10).

REFERENCES[l ] N. S. Jayant and P. Nol l, Digital Coding of Waveforms. Englewood

Cliffs, NJ: Prentice-Hall, 1984.[2] B. S. Atal and J . R. Remde, A new model of LPC excitation forproducing natural-sounding speech at low bit rates , in Proc. IEEEInt. Conf. Acoust., Speech, Signal Processing, Apr. 1982, pp. 614-617.[3]M. R. Schroederand B. S. Atal,Code-excited inearprediction(CELP): High-quality speech at very low bit rates, in Proc. IEEEInt. Conf. Acoust., Speech, Signal Processing, 1985, pp. 937-940.141 R . J. Sluyter, G. J. Bosscha, and H. M. P. T. Schmitz, A 9.6 kbitls speech coder foc mobile radio applications, in Proc. IEEE Int.Conf. Commun., May 1984, pp. 1159-1162.[5] E. F. Deprettere and P. Kroon, Regular excitation reduction for ef-fective and efficient LP-coding of speech, in Proc. IEEE Int. Conf.Acoust., Speech, Signal Processing, Mar. 1985, pp. 25.8.1-25.8.4.161 J. D. Markel and A. H. Gray, L. .ea r Prediction of Speech. Berlin,Germany: Springer-Verlag, 1976.[7] B . S . Atal, Predictive coding of speech at ow bit rates, IEEETrans. Commun., vol. COM-30, pp. 600-614, Apr. 1982.[8] E. F. Deprettere and K. Jainandunsing, Design and VLSI mple-mentation of a concurrent solver for N-coupled least-squares fittingproblems, n Proc. IEEE Int. Conf. Acoust., Speech, Signal Pro-cessing, Mar. 1985, pp. 6.3.1-6.3.4.[9]K.Jainandunsingand E. F . Deprettere,DesignandVLSI mple-mentation of a concurrent solver for N-coupled least-squares fittingproblems, IEEE J . Select. Areas Commun., pp. 39-48, Jan. 1986.[lo] V. R. Viswanathan, A. L. Higgins, and W . H. Russel, Design of arobust baseband LPC coder for speech transmission over noisy chan-nels, IEEETrans.Commun., vol.COM-30,pp.663-673,Apr.1982.[ l l ] P. Kroon and E. F. Deprettere, Experimental evaluation of differentapproaches o hemulti-pulsecoder, in Proc. EEE In?. Con$Acoust., Speech, Signal Processing, Mar. 1984, pp. 10.4.1-10.4.4.[12] S. Singhaland B. S . Atal,ImprovingperformanceofmultipulseLPCcoders at lowbit rates, in Proc. EEE nt. Con$ Acoust.,Speech, Signal Processing, Mar. 1984, pp. 1.3.1-1.3.4.

[13] J . L. Flanagan, M. R. Schroeder, B. S. Atal, R. E. Crochiere, N. S .Jayant,and J. M. Tribolet,Speechcoding, IEEETrans.Com-mun., vol. COM-27, pp. 710-736, Apr. 1979.[14] A. H. Gray and J . D. Markel, Implementation and comparison oftwo transformed reflection coefficient scalar quantization methods,IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp.[15] C.Home,P.Kroon, andE. F. Deprettere,VLSI mplementablealgorithm for transparent coding of wide band speech below 32 kbls, in Proc. ASTEDSymp.Appl.SignalProcessingDig.Filter.,

June 1985, pp. A3.1-A3.4.

575-583, Oct. 1980.

Peter Kroon (S82-M86) was born in Vlaardin-gen, The Netherlands, on September 7, 1957. Hereceived he B.S., M.S.,andPh.D. degrees inelectricalengineering romDelftUniversity ofTechnology,Delft, The Netherlands, in 197:.His Ph.D. work focused on time-domain tech-niques for toll quality speech coding at rates be-low 16 kbits/s. From 1982 to 1983 he was a Re-search Assistant at he Network Theory Group,Delft University of Technology. During the years1984 and 1985 he was sponsored by Philips Research Labs o work oncoderssuitable for mobile adioapplications.He is currentlywith heAcoustics Research Department, AT&T Bell Laboratories, Murray Hill,NJ. His research nterests nclude speech coding, signal processing, andthe development of software for signal processing.

I 1981,nd 985,espectively.

Lecturer at DUT, where he is now Associate Pro-fessor in the Department of Electrical Engineer-ing. His current research interests aren VLSI andsignal processing, particularly speech and imageprocessing, filter design and modeling, systolic signal processors, and ma-trix equation solvers.

. , : , Rob J. Sluyter was born in Nijmegen, The Neth-; erlands, on July 12, 1946. n1968hegraduatedin electronic engineering from the Eindhoven In-stituut voor Hoger Beroepsonderwijs.He joinedhilipsesearchaboratories,Eindhoven, The Netherlands, in 1962. Until 1978he wasaResearchAssistant nvolved in datatransmission and ow bit rate speech coding. In1978 he becamea Staff Researcherengaged inspeechnalysis,ynthesis,igitaloding ofspeech, and digital signal processing. Since 1982he has been engaged in research on medium bit rate coding of speech formobile radio applications as a member of the Digital Signal ProcessingGroup. His current interests are in digital signal processing for television

signals.

Date post:	06-Apr-2018
Category:	Documents
Upload:	bharavi-k-s
View:	216 times
Download:	0 times

Regular Pulse Excitation

Documents