Interpolation of pitch contour using temporal decomposition

~ INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 2, 215-225 (1998) (~) 1998 Kluwer Academic Publishers. Manufactured in The Netherlands.

Interpolation of Pitch Contour Using Temporal Decomposition

SHAHROKH GHAEMMAGHAMI, MOHAMED DERICHE AND BOUALEM BOASHASH Signal Processing Research Centre, School of Electrical and Electronic Systems Engineering,

Queensland University of Technology, Brisbane, Australia [email protected]

[email protected] [email protected]

Received April 25, 1997; Revised February 25, 1998; Accepted May 18, 1998

Abstract. A new method for predicting pitch contour of a speech signal using a small number of pitch values is addressed, for the application of very low rate speech coding, relying on the correlation between phonetic evolution and pitch variations during voiced speech segments. To track the phonetic evolution and specify perceptually significant time points, Temporal Decomposition (TD) is used. TD provides information required for both determination of critical pitch values and estimation of pitch contour by detecting event functions, as interpolation paths, and their centroids, as the most steady points, in the spectral parameters space. It is shown that the proposed method reduces the amount of pitch information to about one-tenth of that in conventional frame-by-frame based techniques with less than 5% error in pitch approximation.

Keywords: very low-rate speech coding, pitch detection, pitch interpolation, temporal decomposition

1. Introduction

In most speech analysis and processing systems, voiced speech is modelled using a time-varying linear filter excited by a quasi-periodic source (Childers and Wu, 1990). Generally, the filter parameters, which follow vocal tract movements, convey mainly the phonetic information embedded in speech while the excitation parameters represent prosodic information (Gong and Haton, 1987). In most systems, these two types of parameters are considered separately and processed in- dependently in spite of the certain correlation between the phonetic evolution of speech and the evolutionary structure of the excitation signal (Wilgus and Barnwell, 1983). This correlation is different from the source- tract interaction phenomenon which denotes the mu- tual effects of glottal region and vocal tract on both glottal flow spectra and formant characteristics.

Despite the considerable dependence of speech naturalness on source characteristics (Childers and

Wu, 1990; Kleijn and Haagen, 1995), the quasi- periodic excitation is usually approximated by a train of pulses in very low-rate speech coders working at rates below 2 kb/s (Gong and Haton, 1987). In such systems, pitch, voicing, and gain represent the minimum information required to preserve the major features of speech excitation (Childers and Wu, 1990). Neverthe- less, a large number of bits (about 30-50% of the total bit-rate) are needed to encode these source parameters (Knagenhjelm and Kleijn, 1995; Mouy et al., 1995).

It has been shown that further compression of source information is possible using appropriate approximation techniques based on the slowly varying characteristics of the excitation signals, at the expense of some degradation in speech quality (Kleijn and Haagen, 1995). Differential quantization of pitch and gain is a standard method for such an approximation at very low rates. This technique addresses encoding differences between consecutive pitch or gain samples, each corresponding to a short fi'ame of speech,

216 Ghaemmaghami, Deriche and Boashash

rather than coding the samples themselves. 1 Usually, such a technique is applied to pitch and gain contours similarly although there are some certain differences between these two, in both their behavior and their effects on synthesized speech quality. Previous work has shown that pitch affects the quality of speech more than gain because of the importance of stress in speech naturalness as compared to amplitude (Wilgus and Barnwell, 1983).

Typically, 4-7 bits are used to quantize pitch information in each frame in very low-rate coders (O'Shaughnessy, 1987). Therefore, depending on the frame rate taken, 150--400 bits/s are required to encode pitch, fundamental fi'equency (Wilgus and Barnwell, 1983; Mouy et al., 1995). On the other hand, in segment coders at a rate of 10-15 segments/s (150-500 bits/s), often, only one pitch value is calculated within each segment (mostly at the center of the segment) and other pitch values are approximated by linear interpolation (Wilgus and Barnwell, 1983; Roucos et al., 1983; Shiraki and Honda, 1988). Hence, a much smaller number of bits is assigned to pitch information.

In (Roucos et al., 1983), 1-bit/segment has been used to follow the changes in the pitch in a piece-wise linear approximation of the pitch contour. In (Shiraki and Honda, 1988), a 4-bit/segment differential pitch pre- dictor has been proposed. These coders face with two basic problems: inaccuracy of linear pitch interpolation and the ambiguity in identifying the frame within the segment at which the pitch value is to be calculated.

More effective and complicated pitch interpolation techniques have been reported, in association with more complex excitation coding at a rate over 2 kb/s, in which excitation waveform is interpolated through pitch-synchronous analysis (Kleijn and Haagen, 1995; Taori et al., 1995) using a conventional pitch detection technique.

In this paper, we propose a new method for estimating pitch over voiced segments of a given speech utterance through Temporal Decomposition (TD). The rationale for the use of TD for such an application re- lies on the substantial correlation between event functions, extracted by TD from spectral features of speech, and evolutionary characteristics of phonetic contents. It has been shown in earlier work that almost all (about 99%) phonetic information conveyed by original spectral parameters is preserved by TD in a much more condensed space as compared to the original feature space (Bimbot et al., 1987; Van Dijk-Kappers, 1989).

In addition, our own findings on event approximation and TD-based coding (Ghaemmaghami and Deriche, 1996; Ghaemmaghami et al., 1997b) denote that event centroids are key instants over temporal characteristics of vocal tract configuration which can be employed ef- ficiently to model articulatory dynamism. This paper is therefore another effort on using such a powerful tool for compressing pitch information through a near- optimal interpolation algorithm.

We show that this method can improve considerably the reliability of the pitch detection task, as well as the efficiency and accuracy of the interpolation of pitch contour at very low rates. Performance of the method is assessed using a perceptually based spectral distance measure on the reconstructed speech in a binary excitation LPC synthesizer.

The organisation of this paper is as follows. In Section 2, TD algorithm and event approximation technique are described. In Section 3, the proposed pitch determination/interpolation method is developed. Section 4 is devoted to the results obtained using the proposed algorithm in speech coding. A discussion on performance of the method is given in Section 5 and conclusion is included in Section 6.

2. Temporal Decomposition

2.1. Description

Temporal Decomposition (TD) is a method used for modelling the phonemic evolution of speech based on a sequence of spectral parameters (Atal, 1983). Such an evolutionary characteristic is expressed by a number of temporally overlapping compact interpolation functions, called target or event functions, which are interpreted as physical representations of speech acoustic events. TD is applied to a matrix of spectral parameters, Y, to extract the corresponding matrix of event functions, cI,.

v = A ~ (1)

where Y is a p x N matrix of parameters, �9 is an m x N matrix of event functions, and A is a p x m matrix of associated weightings.

Equation (1) can be written in scalar form as a set of linear equations, each corresponding to the time trajec- tory of ith parameter of the columns (speech frames)

Interpolation of Pitch Contour 217

of Y, as

yi(n) = ~-~ aik~)k(n), 1 < n < N, 1 < i < p (2) k=l

where p is the number of spectral parameters extracted from each frame, n is the frame index, yi(n) is the ith parameter approximated by the model, Ck (n) is the kth event function evaluated at frame n, aik is the weighting factor, and m represents the number of event functions in the interval n = 1 to n = N.

To find ff~ and A matrices, we need to decompose Y matrix through orthogonalization. Such a method is basically performed in two steps. First, the locations of the event functions are detected using Singular Vahle Decomposition (SVD), second, the event functions are refined using an iterative method which minimizes the distance (or error) between the estimated and the original parameter sets (Atal, 1983).

Based on the method proposed in (Atal, 1983), event locations are obtained by finding the negative-going Zero-Crossings of timing function V(nc), along the speech utterance, given as

V('lc) : Z n ( n -- nc)(b2(n) )--~,, r ' (3)

where nc is index of the central flame of speech segment, and the sum is computed over the whole segment of speech.

The refinement stage is carried by minimizing the Mean Square Error, E, defined by

E : ~ yi(n) -- aikCk(n) k=l

(4)

From Eq. (4), event functions are refined after elim- inating their minor lobes without significant degradation in the overall performance of the model given in Eq. (2).

2.2. Event Approximation

We have shown earlier that event functions can be approximated by fixed-shape simple functions, by accepting a minor degradation in quality of speech synthesized using the approximated spectral parameters (Ghaemmaghami and Deriche, 1996; Ghaemmaghami et al., 1997b). We showed that a suitable shape for

events is Gaussian function with a fixed cr between 30 and 50 ms. Such an approximation denotes replac- ing event functions with Gaussian function, located at event centroids. Therefore, the event refinement step, which is a time-consuming task, can be eliminated and Eq. (1) changes to

'~" = A ~ (5)

where '~' is the matrix of estimated parameters and is the matrix of approximating functions whose (k, n) element is given by

~k(n) = exp( - (n - nc)2/2cr2), (6)

which is nonzero only in the interval assigned to the corresponding event.

3. Determination of Pitch Contour

In this section, we describe the proposed method for estimating pitch contour using event functions, extracted from spectral parameters by TD, as interpolation paths. These events could be orighmlrefined event functions or Gaussian events described in Section 2. In Section 3.1, the algorithm for pitch estimation and interpolation is derived. Section 3.2 gives comments on both pitch detection, at selected frames, and voicing decision, which is to be made over all frames within the speech segment.

3.1. Pitch Estimation~Interpolation Algorithm

The idea of TD based interpolation of pitch is relied on the correlation between evolutionary configuration of vocal tract, revealed by TD of spectral features, and the source characteristics (Schwartz and Roucos, 1983).

Given an N-element vector, _v, of pitch values computed from consecutive frames within a segment of voiced speech, we assume that there exists a linear relation between _v and the matrix of events, ff~, extracted from the same segment with similar frames using TD. This relation described as

_v - Va --~ A v r (7)

where v, is a scalar representing average of_v, and A o is a 1 x m row vector of weightings.

It is clear that for m < N, Eq. (7) does not hold unless there is a specific dependence between _v and ff~.


This is indeed the main theme discussed in this paper which is to show that such a dependence exists and can be used for interpolating pitch contour from a minimal subset of pitch values.

We begin by rewriting the p x N matrix of spectral parameters, Y, whose columns, y__l, 2 2 . . . . . YN' are p- element vectors of the spectral parameters of speech frames:

v = [z , y_, . - . yN] (8)

We then extract the m x N matrix of events, ~ , from Y matrix using TD, as described in Section 2. In general, Y may correspond to a speech segment comprising both voiced and unvoiced segments. We compute the pitch values only at frames coinciding with the centroid of voiced events (detected from the first stage of TD) which are far enough from the voiced/unvoiced transitions. This condition is held considering the nature of TD which may produce some spurious events due to nonlinearities in speech at such transitions. Extensive experiments showed that a safe distance from a tran- sition would be a little less than half of the minimum duration of an event (about 15 ms).

The selected pitch values form a pitch vector, _v 1, which is a small subset of the whole pitch vector _v:

1) = [1)1 /22 �9 �9 �9 UN] , Vl = [~1~2 ' ' ' ~1] E V (9)

where vi (i = 1, 2 . . . . . N), represents pitch value at frame i, if the frame is voiced, and zero if unvoiced. Obviously, l << N because event rate is typically much smaller than frame rate (m << N) and _v I corresponds to a smaller subset of events (voiced events).

Now, we construct the matrix ~1 of size m x 1 from matrix ,I~ by discarding all columns of matrix

which do not correspond to vj. In other words, we retain only the l columns of ,l, matrix corresponding to ~1, ~2 . . . . . ~1. Based on the formulation of ~ t , we generate the m-element vector Av~, of weighting fac- tors using

O 1 - - Ula = A v l t I ) I ~ A__vl "-' ( v 1 - V la ) t I :~ (10)

where 1)la is the average value of _v 1 and ~1 # is the pseudo-inverse of ~1 (Golub and Van Loan, 1983).

Given the formulations in (7) and (10), we would expect that for voiced segments, Aoj --- ,4 o. Therefore, the difference vector of pitch over a voiced segment,

Vd, is obtained as:

v ! = ~ - v ~ - ~ A ~ ~ b = v ! + v ~ = v ! + v ~

(11)

where ~ is the estimated vector of pitch _v, and v,, and Via are the same as in (7) and (10), respectively, and are assumed to be equal.

In the general case where we deal with speech segments composed of voiced and unvoiced sounds, the algorithm to obtain the interpolated pitch vector is similar to the above with the exception that we need to mod- ify tile extracted pitch vector using voiced/unvoiced information over the whole segment. Assuming tZ, the vector representing this information, as

U = [Ul U~_ .. . UN],

1, Vi > 0 , i = 1 , 2 , N, (12) t l i : O, 1) i : 0 " ' ' '

where vi is the ith element of pitch vector, _v. The interpolated pitch vector is then computed using

tTi = (Avl~_i + v , , )u i i = 1, 2 . . . . . N (13)

where tTi is the ith element of interpolated pitch vector, _~, Vl,, and tl i are defined in Eqs. (10) and (12) respectively, and q5 i is the ith column of ~ .

3.2. Pitch Detect ion and Voicing Decision

As mentioned in Section 3.1, we only need to detect pitch at voiced frames identified as event centroids. Such frames, marked by TD, coincide with voiced sounds located sufficiently far from voiced/unvoiced transitions, as argued earlier. As obvious, this ne- cessitates having information about voicing status beforehand. For such classification, any reliable voicing identification method could be employed. We used the LPC-10E voicing classification technique discussed in (Campbell and Tremain, 1986), because of its accuracy in voicing detection even without any prior pitch detection, as pitch values were unknown at most instants in our selective pitch determination system. Therefore, this classification method, so-called VC- 10E (LPC-10E Voicing Classifier) here, also needs to be modified slightly for such a particular purpose, as described in the following.

For each sub-frame of speech (10 ms), VC-10E uses an M-level adaptive linear discriminant classifier for


two-class voicing identification, given as:

Do=aj.O__ r + c j , j = O , 1,2 . . . . . M - 1 (14)

where 0 represents a set of L parameters computed from speech (see below), a_j is a weighting vector, cj is a scalar parameter depending on the SNR, and T holds for transposition.

Both aj and cj depend on the level of classification specified by SNR in the system. At any particular classification level, a sub-frame of speech is classified voiced ifD~, > 0, and unvoiced if otherwise. ~ is composed of L elements representing parameters on which voicing decision is made, defined as

0 _ = [ 0 , 02 . . . OL] (15)

where

1. z91 = E (normalized low-band energy) 2. t~2 = ZC (zero-crossing rate) 3. b9 3 ~- RCI (the first reflection coefficient as the nor-

malized short-term auto-covariance coefficient at unit sample delay)

4. 04 = Qs (relative pre-emphasized high band difference signal energy)

5. 05 = IV RC2 (the second reflection coefficient as the relative degree of spectral peak or Q)

6. 06 = aRb (causal pitch prediction gains) 7. 07 = aR f (noncausal pitch prediction gains)

Weighting coefficients in aj are determined tentatively using a large number of speech frames (Campbell and Tremain, 1986). For relatively clean speech (SNR > 20 dB), the weighting vector is suggested as

aj =[1158 -108 832 -4096 -1018 1195 1011],

c0 = 3462 (16)

Once Do is computed from (14), it is to be smoot- hed to reduce its irregularities. This is carried by modified median smoother which uses the information about two frames in the future for each sub-frame on the basis of the quantity of Do. We use this smoother only at the last stage of speech synthesis (to excite LPC filter), while the raw voicing information is used for identifying voiced events. The rationale behind this technique is to avoid error encountered by taking unvoiced events

as voiced, which decreases both efficiency and reliability of the proposed method.

The two last elements of 0__, otRb and o tR f , a r e used for indicating pitch trailing offand pitch onset, respectively, based on pitch data obtained from AMDF (Av- erage Magnitude Difference Function) (Rabiner et al., 1976) method. These parameters tend to 1 for voiced and approach zero for unvoiced. However, they are in- applicable in our method, hence we dismiss them from the set of parameters, 0, and make them equal to 0.7 when computing Do through Eqs. (14) and (16).

This reduces accuracy of the voicing decision at tran- sitive instants but it is negligible as compared to total error introduced by interpolation. However, we could add this information to our system by detecting pitch at such instants and using them for modifying interpolated pitch contour to achieve better naturalness in synthesized speech. Nevertheless, we ignore this modification because it directly affects the number of bits required for quantizing pitch information, which is of crucial importance in very low-rate coding applications.

For pitch detection at voiced event centroids, any pitch detection algorithm can be employed. Note that such a detection is not critical, as we are facing mostly with quasi-steady signals, from which pitch information is relatively easy to obtain. This is indeed the ad- vantage of our method over frame-based pitch detection methods which encounter usually with many ambigu- ous situations. However, we used cepstral method to detect pitch in our experiments due to its high accuracy in pitch determination for clean speech and also its ad- vantages in characterizing voicing information via corresponding amplitude in the cepstral domain (Chung and Schafer, 1990; Hess, 1983).

4. Experiments

4.1. Analysis and Synthesis Models

Spectral parameters, required for both TD and speech synthesis, are extracted from speech on a frame-by- frame analysis basis. Any set of spectral parameters may potentially be used for TD. We used six different sets due to previous findings (Ghaemmaghami et al., 1997a; Van Dijk-Kappers, 1989), as Cepstrum, Log Area Ratio (LAR), Linear Predictive Coding filter coefficients (LPC), Log Area (LA), Area, and Band Filter (BF) parameters.

Cepstrum represents cepstral coefficients extracted from speech in Homomorphic model (O'Shaughnessy,


1987). BF parameters exhibit speech spectrum at the output of a Bark-scaled filter bank (Sekey and Hanson, 1984). The other sets are different representations of spectral features in the LPC model. (see O'Shaughnessy, 1987; Van Dijk-Kappers, 1989, for description).

Except for the BF set where 15 parameters per frame are used for the speech of bandwidth 0--4 kHz, we take the first 10 cepstral coefficients per frame as the Cepstrum set. To compute LPC parameters, a 10th order LPC analysis based on autocorrelation method is applied to short frames of speech pre-emphasized by a filter characterized as Hp(z) = 1 - 0.95z. To syn- thesize speech for evaluating pitch data estimated by the proposed method, again LPC synthesis model of order 10 is used in which only pitch data are altered for comparison.

Frame length, frame shift, and segment length in TD to search for event locations were equal to 40, 5, and 275 ms, respectively (Atal, 1983; Ghaemmaghami and Deriche, 1996).

4.2. Pitch Approximation Error

To assess performance of the proposed pitch determination method, the relative error (percentage) between pitch values obtained using our algorithm and those obtained from a frame-based analysis is derived. This error is calculated as

N I vi 0il

eL, = 100 Z - - ' - Vvi ~ 0 (17) i=1 1)i

where vi and /3 i a r e the traditional and predicted pitch values at frames indexed by i (assuming vi = vi = 0 for unvoiced frames), respectively, and N is the total number of frames.

We evaluated the performance of the method in two cases. In the first case, the event functions extracted from conventional TD (so-called orighTal events) were used to interpolate the pitch contour while for the second case, we used the approximated Gaussian event functions (Ghaemmaghami and Deriche, 1996).

Figure 1 displays the results obtained for case 1 with different spectral parameters used in the detection of event functions, where Cepstrum, LPC, LAR, LA, and BF (band filter, Bark-scaled) parameter sets are used. Figure 2 illustrates similar results using Gaussian event function (case 2).

The performance of the proposed method in a CVC (consonant-vowel-consonant) section is illustrated in

Table 1. Percentage of error in pitch estimation using different parameter sets.

Parameter set Original events Gaussian events

Cepstrum 8.43 4,43

LAR 7.02 5.39

LPC 4.67 3.98

LA 6.32 6.35

Area 24.49 12.2

BF 4.56 5.61

Mean 9.25 6.33

Fig. 3, and its behavior at a sudden change in pitch is shown in Fig. 4.

The overall results are summarized in Table 1, these are obtained from the analysis of a number of different speech utterances from the TIMIT database.

4.3. Objective Evaluation of Speech

In addition to pitch error measurement, we computed the spectral distortion in speech reconstructed using the predicted pitch contour in the LPC synthesis model of order 10. No quantisation was performed on LPC parameters to focus on errors introduced by the excitation. For this evaluation, we used a perceptually based spectral distance measure, defined as

1 N 15 ds - N (18)

i=1 k=l

where i is the frame index, N is the total number of frames, S~ '~ is the average power of the original speech over the whole utterance, S~ (k) and S~ (k) are the power of original and encoded speech signals for the ith frame at the kth filter of a 15-band Bark-scaled filter-bank whose center frequencies are obtained using (Sekey and Hanson, 1984):

v + 1/2 f o = 6 0 0 s i n h - - , v = 1 , 2 . . . . . 15. (19)

6

Table 2 shows the results of spectral assessment for different parameter sets used in event detection. As reference, we also obtained the distance between the original speech and the speech reconstructed using pitch values computed through frame-based cepstral


ilL. 100 e o

e o

4 o

2 o

o 60 I O 0 160 200 2 S O

b.

8O I O 0 150 200 2150

C. 1~176 f ~ 3.7es%

8 0 I O 0 1 5 0 2OO 2SO

d. l o o

f 4.~m8%

eo

3 2O

0 6 0 100 160 200 250

e 6

2 ~ . .

5O I O 0 1 5 0 2OO 2 5 O

Frame Number

f.

"~'-- '~ "'""'1II~" " 'l[[~lllll' " l I P ' " .............. ~ "

Figure 1. Relative error in pitch estimation using different spectral parameter sets. Solid: Conventional technique, Dashed: Proposed technique. (a) Cepstrum coefficients, (b) LAR parameters, (c) LPC parameters, (d) LA parameters, (e) BF parameters, (f) Speech waveform.

pitch detection method. This distance is displayed as dref in Table 2.

represented by cepstral peak, decreases gradually when distance from the event centroid is increased.

4.4. hnproving Pitch Detection Performance

Figure 3 shows the cepstrum coefficients calculated at consecutive frames for a voiced segment, and the corresponding event whose location is shown by a vertical line. As indicated, the maximum peak of the cepstrum coincides with the event location and degree of voicing,

4.5. Bit-Rate Consideration in Pitch Coding

We examined three methods to encode pitch information: frame-by-frame coding of absolute pitch values, differential coding (coding of the changes in pitch values), and the proposed interpolative coding. The first method was taken as the reference to assess the two other techniques.


Table 2. Spectral distance between original speech and speech reconstructed using predicted pitch.

Parameter set Original events Gaussian events

Cepstrum 0.0684 0.0656

LAR 0.0667 0.0673

LPC 0.0662 0.0654

LA 0.0684 0.0662

BF 0.0723 0.0657

Mean 0.0682 0.0660

drcf = 0.0678

The experimental results showed that coding pitch values using the proposed method at rate 40 bits/s, gave the same accuracy as obtained with the reference method at 400 bits/s or the differential method at 100 bits/s. In other words, no degradation was noticed in reconstructed speech quality us ing the proposed method as compared to the others two at the ment ioned rates.

5. Discussion

Table 1 indicates that us ing the proposed method, the error in pitch est imation will not be perceivable, as the

~L.

6 O I 0 0 1 ~ 0 2 O O 2 6 0

b. 1 0 0

S 4 O

2 O

0 " 5 O I O 0 1 6 0 2 0 o 2 6 0

C. 1 0 0

e o

2 o

o , s o l O O I SO 2 0 0 2 ~ 0

d. t o o :E 2

c 8 0 l o o 1 5 o 2 0 0 2 5 0

e .

2 : . ,

5 0 l O O 1 5 o 2 0 o 2 s o

~" '" ...... 1~"' j ~ll~t' ~ V t''' ..................

Figure 2. As Fig. 1, but with fixed-a Gaussian events in TD. (a) Cepstrum coefficients, (b) LAR parameters, (c) LPC parameters, (d) LA parameters, (e) BF parameters, (f) Speech waveform.


Ev .n~ . 2 . . . . .

_ _ _ - - -

0 ~" 10 12 . . . . . . . . . 1'4 l ' e l ' a zo I = r e m e l Sp~h

1 . . . . . .

0 . 8

20O0 2 *OO ~2c0 2300 2400 2SOO 2aoo 270o t ime Index

C e p s t r u m

Consecutive Frames

Figure 3. Event function (dashed curve), its location (centroid), and cepstrum for successive frames. The maximum peak coincides with the

event centroid.

oo

6 0

E~n l f unc t l on l

. . . . a z . . . . . 2 - . .o

o~

- - 0 . 6

; t 26 2 2e 2 3 2 32 ; t 3 2 l ime i ndex x ~0 "

mes laua l s l one l

- -O.6

-~ e 2.1,, =J, 2.;2 2.;, ~.;6

Figure 4, Conventional (solid) and interpolated (dashed) pitch contours (top), event functions and their locations shown by vertical lines on speech waveform (middle), and the corresponding residual signal from an LPC model (bottom).

actual difference limen (DL) for fundamental frequency (maximum perceivable deviation from the true value of) is in the order of about 10-15% (Harris and Umeda, 1987). This result is confirmed with the spectral distances shown in Table 2 where distances obtained for

some parameter sets are less than dref (the distance associated with frame-based pitch detection method).

As shown in the tables, satisfying results are also obtained using Gaussian event approximation. This stems from the fact that event locations, detected by


TD, mostly coincide with phonetic-im,ariant instants (Blumstein and Stevens, 1979), such that event functions can be interpreted as paths between these instants which often overlap. Although any change in these paths (as done by Gaussian approximation) would change the corresponding spectral parameters, it has a minor effect on the phonetic contents of speech (Ghaemmaghami and Deriche, 1996), So, an appropriate Gaussian approximation of the paths not only gives an acceptable result, but it can even produce better paths than those obtained with the original events in pitch interpolation (see Tables 1 and 2) because of con- formity of such an approximation with the smooth evolution of pitch compared to vocal tract changes (Kleijn and Haagen, 1995).

The predictability of pitch period across a voiced segment can be noticed in Fig. 3. This figure shows a voiced segment in a CVC section of speech along with the corresponding events and cepstrum of the segment at consecutive frames. The peaks of the cepstrum correspond to the pitch for voiced frames, which all are to be estimated using conventional pitch estimation algorithms. As seen, the amount of pitch information (degree of voicing) decreases gradually with increase in the temporal distance of the observed frame from the event location (marked by the vertical line on the waveform). This means that pitch detection reliability decreases at voiced frames far from the event centroid, which can result in larger error in frame-by-frame pitch detection than the error in the proposed technique where pitch is only detected at event centroid.

The results displayed in Tables 1 and 2 also show the effect of different parameter sets in event detection on the performance of the proposed algorithm. The first point concerns the differing results obtained in the two evaluation methods, pitch error (Table 1) and the spectral distance measurements (Table 2). This difference arises from the error associated with our reference pitch contour, which is detected on a frame-by-frame basis, as well as the error produced by the pitch interpolation. However, LPC parameters gave the best re- suits in all experiments. This comes from the fact that more events are usually extracted from LPC parameters compared to those obtained from most other sets used, due to the larger variance of these parameters. This leads to more points for the estimation of pitch contour, hence, resulting in a closer curve to the true pitch contour.

Cepstrum coefficients also have a relatively large variance but they are more affected by excitation signal

than others. So, they do not give a good result with original events. This is why a better result is obtained with cepstrum using smooth (Gaussian) events.

6. Conclusion

This paper presents a new method for predicting the pitch contour through TD-based interpolation. In this method, only pitch values at voiced event locations (less than 10 points/s) need to be detected and used, leading to a rate of less than 40 bits/s to encode pitch information. The key feature of the proposed technique is that it searches for pitch values when speech reaches its highest amount of periodicity at the voiced event centroids, resulting in more reliability in pitch determination as compared to conventional frame-by-frame pitch detection. This feature makes the method very robust in noisy environments.

The method can be used in most very low rate coding systems, but, it would be most useful in TD- based coders in which TD is used for compressing the spectral information. However, in general, it has an intrinsic drawback due to the need of perform- ing TD, which is a time-consuming task. Although Gaussian event approximation simplifies considerably TD (Ghaemmaghami and Deriche, 1996), the complex process of event detection still needs to be performed. Such a complexity, however, can be reduced by taking short segments on the basis of voiced/unvoiced information but could still be quite complex as compared to frame-based analysis.

Note

Generally, for fixed frame-length systems, the analysis window should be wider than the maximum pitch period expected, and narrower than a critical length in which the signal is assumed sta- tionary. Accordingly, in speaker independent coders, the window length is taken in the range of 20-45 ms. For more information, see (Hess, 1983).

References

Ahlbom, G., Bimbot, E, and Chollet, G. (1987). Modeling spectral speech transitions using temporal decomposition techniques. Proc. ICASSP'87, pp. 13-16.

Atal, B.S. (1983). Efficient coding of LPC parameters by temporal decomposition. Proc. ICASSP'83, pp. 81-84.

Bimbot, E and Atal, B.S. (1991). An evaluation of temporal decomposition. Proc. EUROSPEECH'91, pp. 1089-1092.

In te rpo la t ion of P i tch Con tou r 225

Blumstein, S.E. and Stevens, K.N. (1979). Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. J. Acoast. Soc. Am., 66(4): 1001-1017.

Campbell, J.P., Jr. and Tremain, T.E. (1986). Voiced/unvoiced classification of speech with application to the U.S. government LPC- 10E algorithm. Proc. ICASSP'86, pp. 473-476.

Childers, D.G. and Wu, K. (1990). Quality of speech produced by analysis-synthesis. Speech Comm., 9:97-117.

Chung, J.H. and Schafer R.W. (1990). Excitation modeling in a homomorphic vocoder. Proc. ICASSP'90, vol. 2, pp. 25- 28.

Ghaemmaghami, S. and Deriche, M. (1996). A new approach to very low-rate speech coding using temporal decomposition. Proc. ICASSP'96, vol. 1, pp. 224-227.

Ghaemmaghami, S., Deriche, M., and Boashash, B. (1997a). Com- parative study of different parameters for temporal decomposition based speech coding. Proc. ICASSP'97, vol. 3, pp. 1703- 1706.

Ghaemmaghami, S., Deriche, M., and Boashash, B. (1997b). On modeling event functions in temporal decomposition based speech coding. EUROSPEECH'97, vol. 3, pp. 1299-1302.

Golub, G.H. and Van Loan, C.E (1983). Matrix Computation. North Oxford Academic.

Gong, Y. and Haton, J. (1987). Time domain harmonic matching pitch estimation using time dependent speech modeling. 1EEE Trans. ASSP, ASSP-3 5( l O ): 1386-1400.

Harris, M.S. and Umeda, N. (1987). Difference limens for fundamental frequency contours in sentences. J. Acoust. Soc. Am., 81(4): 1139-1145.

Hess, W.J. (1983). Pitch Determination of Speech Signals: Algo- rithms and Devices. Springer-Verlag.

Kleijn, W.B. and Haagen, J. (1995). A speech coder based on decomposition of characteristic waveforms. Proc. ICASSP'95, vol. I, pp. 508-511.

Knagenhjelm, H.P.W. and Kleijn, B. (1995). Spectral dynamics is more important than spectral distortion. Proc. ICASSP'95, vol. 1, pp. 732-735.

Mouy, B., De La Noue, P., and Goudezeune, G. (1995). NATO STANAG 4479: A standard for an 800 BPS vocoder and channel coding in HF-ECCM system. Proc. ICASSP'95, vol. 1, pp. 480- 483.

O'Shaughnessy, D. (1987). Speech Communication: Human and Machine. Addison-Wesley Pub. Co.

Rabiner, L.R., Cheng, M.J., Rosenberg, A.E., and McGonegal, C.A. (1976). A comparative performance study of several pitch detection algorithms. IEEE Trans. ASSP, ASSP-24(5):399-418.

Roucos, S., Schwartz, R., and Makhoul, J. (1983). A segment vocoder at 150 bits/s. Proc. ICASSP'83, pp. 61-64.

Schwartz, R.M. and Roucos, S.E. (1983). A comparison of methods for 300-400 bits/s vocoders. Proc. ICASSP'83, pp. 69-72.

Sekey, A. and Hanson, B.A. (1984). Improved 1-bark bandwidth auditory filter. J. Acoust. Soc. Am., 75(6): 1902-1904.

Shiraki, Y. and Honda, M. (1988). LPC speech coding based on variable-length segment quantization. 1EEE Trans. ASSP, ASSP- 36:1437-1444.

Taori, R., Sluijter, and Kathmann, E. (1995). Speech compression using pitch synchronous interpolation. Proc. ICASSP'95, vol. 1, pp. 512-515.

Van Dijk-Kappers, A.M.L. (1989). Comparison of parameter sets for temporal decomposition. Speech Comm., 8(3):204-220.

Wilgus, A.M. and Barnwell, T.E (1983). Data rate reduction of gain and pitch parameters in an LPC vocoder. Proc. ICASSP'83, pp. 77- 80.

Date post:	04-Dec-2023
Category:	Documents
Upload:	kfupm
View:	0 times
Download:	0 times

Interpolation of pitch contour using temporal decomposition

Documents