+ All Categories
Home > Documents > Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained...

Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained...

Date post: 30-Dec-2016
Category:
Upload: inma
View: 218 times
Download: 2 times
Share this document with a friend
13
Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001 ARTICLE IN PRESS +Model YCSLA-648; No. of Pages 13 Available online at www.sciencedirect.com ScienceDirect Computer Speech and Language xxx (2014) xxx–xxx Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations Daniel Erro a,b,, Agustin Alonso a , Luis Serrano a , Eva Navas a , Inma Hernaez a a Aholab, University of the Basque Country, Bilbao, Spain b Ikerbasque, Basque Foundation for Science, Bilbao, Spain Received 14 October 2013; received in revised form 7 February 2014; accepted 7 March 2014 Abstract Voice conversion functions based on Gaussian mixture models and parametric speech signal representations are opaque in the sense that it is not straightforward to interpret the physical meaning of the conversion parameters. Following the line of recent works based on the frequency warping plus amplitude scaling paradigm, in this article we show that voice conversion functions can be designed according to physically meaningful constraints in such manner that they become highly informative. The resulting voice conversion method can be used to visualize the differences between source and target voices or styles in terms of formant location in frequency, spectral tilt and amplitude in a number of spectral bands. © 2014 Elsevier Ltd. All rights reserved. Keywords: Voice conversion; Gaussian mixture models; Frequency warping; Amplitude scaling; Spectral tilt 1. Introduction Voice conversion (VC) (Moulines and Sagisaka, 1995) is the technology that allows transforming the voice charac- teristics of a speaker (the source speaker) into those of another speaker (the target speaker) without altering the linguistic message. The applications of VC include the personalization of artificial speaking devices, the transformation of voices in the movie, music and computer game industries, and the real-time repair of pathological voices. Among all possible voice characteristics, the timbre, which is closely related to the short-time spectral envelope, has attracted most of the attention of researchers. During the training phase, given a number of speech recordings from the two involved speakers, VC systems extract their corresponding acoustic information and then learn a mapping function to transform the source speaker’s acoustic space into that of the target speaker. During the conversion phase, this function is applied to transform new input utterances from the source speaker. Various types of VC techniques have been studied in the literature: vector quantization and mapping codebooks (Abe et al., 1988), more sophisticated solutions based on fuzzy vector quantization (Arslan, 1999), frequency warping transformations (Rentzos et al., 2004; Shuang et al., 2006; Suendermann and Ney, 2003; Valbret et al., 1992), artificial neural networks (Desai et al., 2010; Narendranath et al., 1995), hidden Markov models (Duxans et al., 2004; Lee et al., 2010; Zen et al., 2011), and Gaussian Corresponding author at: Aholab, University of the Basque Country, Bilbao, Spain. Tel.: +34 946017245. E-mail address: [email protected] (D. Erro). http://dx.doi.org/10.1016/j.csl.2014.03.001 0885-2308/© 2014 Elsevier Ltd. All rights reserved.
Transcript
Page 1: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

+ModelY

A

sbdci©

K

1

tmi

htfthsSN

0

ARTICLE IN PRESSCSLA-648; No. of Pages 13

Available online at www.sciencedirect.com

ScienceDirect

Computer Speech and Language xxx (2014) xxx–xxx

Interpretable parametric voice conversion functions based onGaussian mixture models and constrained transformations

Daniel Erro a,b,∗, Agustin Alonso a, Luis Serrano a, Eva Navas a, Inma Hernaez a

a Aholab, University of the Basque Country, Bilbao, Spainb Ikerbasque, Basque Foundation for Science, Bilbao, Spain

Received 14 October 2013; received in revised form 7 February 2014; accepted 7 March 2014

bstract

Voice conversion functions based on Gaussian mixture models and parametric speech signal representations are opaque in theense that it is not straightforward to interpret the physical meaning of the conversion parameters. Following the line of recent worksased on the frequency warping plus amplitude scaling paradigm, in this article we show that voice conversion functions can beesigned according to physically meaningful constraints in such manner that they become highly informative. The resulting voiceonversion method can be used to visualize the differences between source and target voices or styles in terms of formant locationn frequency, spectral tilt and amplitude in a number of spectral bands.

2014 Elsevier Ltd. All rights reserved.

eywords: Voice conversion; Gaussian mixture models; Frequency warping; Amplitude scaling; Spectral tilt

. Introduction

Voice conversion (VC) (Moulines and Sagisaka, 1995) is the technology that allows transforming the voice charac-eristics of a speaker (the source speaker) into those of another speaker (the target speaker) without altering the linguistic

essage. The applications of VC include the personalization of artificial speaking devices, the transformation of voicesn the movie, music and computer game industries, and the real-time repair of pathological voices.

Among all possible voice characteristics, the timbre, which is closely related to the short-time spectral envelope,as attracted most of the attention of researchers. During the training phase, given a number of speech recordings fromhe two involved speakers, VC systems extract their corresponding acoustic information and then learn a mappingunction to transform the source speaker’s acoustic space into that of the target speaker. During the conversion phase,his function is applied to transform new input utterances from the source speaker. Various types of VC techniquesave been studied in the literature: vector quantization and mapping codebooks (Abe et al., 1988), more sophisticated

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

olutions based on fuzzy vector quantization (Arslan, 1999), frequency warping transformations (Rentzos et al., 2004;huang et al., 2006; Suendermann and Ney, 2003; Valbret et al., 1992), artificial neural networks (Desai et al., 2010;arendranath et al., 1995), hidden Markov models (Duxans et al., 2004; Lee et al., 2010; Zen et al., 2011), and Gaussian

∗ Corresponding author at: Aholab, University of the Basque Country, Bilbao, Spain. Tel.: +34 946017245.E-mail address: [email protected] (D. Erro).

http://dx.doi.org/10.1016/j.csl.2014.03.001885-2308/© 2014 Elsevier Ltd. All rights reserved.

Page 2: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

ARTICLE IN PRESS+ModelYCSLA-648; No. of Pages 13

2 D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx

source

target

converted (F W)

converted (FW+AS)

source

target

converted (F W)

converted (FW+AS)

Fig. 1. Graphical explanation of FW + AS transformations applied to spectral envelopes.

mixture model (GMM) based VC (Benisty and Malah, 2011; Helander et al., 2010; Kain, 2001; Stylianou et al., 1998;Toda et al., 2007), which currently is the dominant technique.

Recently, the set of linear transforms characterizing the traditional GMM-based VC systems were replaced by a setof frequency warping (FW) plus amplitude scaling (AS) transforms (Erro et al., 2010; Godoy et al., 2012; Tamura et al.,2011; Toda et al., 2001) to improve the quality and naturalness of the converted speech. Unlike the former, FW + AStransformations have a clear physical interpretation (see Fig. 1). FW is a nonlinear operation that maps the frequencyaxis of the source speaker’s spectrum into that of the target speaker. Since it does not remove any detail of the sourcespectrum but just moves it to a different location in frequency, FW preserves the quality of the converted speech well.However, the conversion accuracy achieved via FW is moderate because it does not modify the relative amplitude ofmeaningful parts of the spectrum. For this reason, FW is complemented with AS to compensate for the differences inthe amplitude axis, typically by means of smooth corrective filters.

In the works referenced above, particular signal representations were required for the specific FW + AS methodsto be applicable, whereas current trends in speech synthesis technologies are pushing research toward methods thatcan be applied to well known parametric representations of speech. That is why it was shown in Zorila et al. (2012)that GMM-based FW + AS methods can be applied to a simple cepstral representation of speech, overcoming the needof specifically designed vocoders. In Erro et al. (2012), the FW functions were constrained to be bilinear (BLFW),which led to a more elegant formulation of FW + AS in the cepstral domain with very few conversion parameters. Theperformance of BLFW + AS was found to be as good as that of the best existing GMM-based parametric VC methods(Erro et al., 2013b).

Following the line of BLFW + AS and in continuation of our preliminary work (Erro et al., 2013a), this paper goesone step beyond in making VC functions more understandable and controllable by users while reducing even morethe number of involved conversion parameters. We suggest imposing constraints to the AS part of the VC functionas it was done previously with the FW part. More specifically, we propose a new way of expressing the AS functionas a combination of a spectral tilt related term and a set of smooth bandpass filters. We will show that the resultingVC functions are very informative in the sense that all their parameters can be interpreted from a physical point ofview. Therefore, the method can be applied not only to synthesize high-quality converted voices but also to analyzethe differences between the involved voices.

The remainder of the paper is structured as follows. Section 2 contains a brief description of the BLFW + AS VCmethod. In Section 3 we present the modified method and describe the corresponding automatic training procedures. Theperformance of this method is experimentally evaluated and discussed in Section 4. Section 5 shows how the proposedmethod can be used as an analysis and visualization tool. Finally, the conclusions of this work are summarized inSection 6.

2. Overview of BLFW + AS

In the cepstral domain, FW transformations are equivalent to multiplicative matrices (Pitz and Ney, 2005) and AScan be implemented by means of additive cepstral terms (Godoy et al., 2012). Given an input p-dimensional cepstralvector x and a GMM θ, the BLFW + AS operation proposed by Erro et al. (2013b) can be formulated mathematically

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

as follows:

F (x) = Wα(x,θ)x + s(x, θ) (1)

Page 3: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

ARTICLE IN PRESS+ModelYCSLA-648; No. of Pages 13

D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx 3

0

1

2

3

0 1 2 3freq. (rad)

war

ped

freq

. (ra

d)

0.1

-0.1

0.3

-0.3

0.5

-0.5

0

1

2

3

0 1 2 3freq. (rad)

war

ped

freq

. (ra

d)

0.1

-0.1

0.3

-0.3

0.5

-0.5

Fv

wW

o

wmppter(

SS

ig. 2. Shape of the bilinear frequency warping curve for different values of α. Positive values move formants toward higher frequencies and viceersa. For α = 0 no warping is performed.

here Wα is the matrix that implements the BLFW transform, which was proposed by McDonough and Byrne (1999).α can be expressed in terms of a single parameter α which will be referred to as the warping factor:

Wα =

⎡⎢⎢⎢⎣

1 − α2 2α − 2α3 · · ·−α + α3 1 − 4α2 + 3α4 · · ·...

.... . .

⎤⎥⎥⎥⎦ (2)

This parameter controls the shape of the desired FW transformation according to the following curve (see Fig. 2):

ωα = tan−1 (1 − α2) sin ω

(1 + α2) cos ω − 2α(3)

In the original BLFW + AS implementation, both the current warping factor α(x, θ) and the AS vector s(x, θ) arebtained by means of a weighted combination of the individual contributions of each Gaussian acoustic class:

α(x, θ) =m∑

k=1

p(θ)k (x) · αk (4)

s(x, θ) =m∑

k=1

p(θ)k (x) · sk (5)

here pk(θ)(x) denotes the probability that x belongs to the kth Gaussian mixture of θ and m is the total number of

ixtures. The elementary factors and vectors of the transformation, {αk} and {sk}, result from a data-driven trainingrocedure based on error minimization. Given a training set of N source-target vector pairs, {xn} and {yn}, the trainingrocess is carried out sequentially in two steps. In the first step, the warping factors {αk} are calculated by minimizinghe error between the warped source vectors and the target vectors. An iterative algorithm was proposed by Errot al. (2013b, 2012) to train all the warping factors {αk} simultaneously while dealing with the strongly nonlinearelationship between α and Wα. Omitting the theoretical fundamentals – interested readers should refer to Erro et al.2013b) –, the algorithm can be summarized as follows:

tep 1 Null initialization: αk = 0 for k = 1, 2, . . ., m

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

tep 2 Create partially warped vectors {zn} using the warping matrix that results from expressions (2) and (4) for thecurrent warping factors:

zn = Wα(xn,θ)xn (6)

Page 4: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

+Model

ARTICLE IN PRESSYCSLA-648; No. of Pages 13

4 D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx

Step 3 Calculate corrective terms {�αk} for the warping factors {αk} by solving a least squares problem:

�m×1

= [ �α1 . . . �αm ]T = (DTD)−1

DTe (7)

where

DNp×m

=

⎡⎢⎢⎢⎣

p(θ)1 (x1)d(z1) · · · p(θ)

m (x1)d(z1)

.... . .

...

p(θ)1 (xN )d(zN ) · · · p(θ)

m (xN )d(zN )

⎤⎥⎥⎥⎦ , e

Np×1=

⎡⎢⎢⎣

y1 − z1

...

yN − zN

⎤⎥⎥⎦ (8)

and d(·) is an operator that returns a vector whose ith element is

d(z)[i] = (i + 1) · z[i + 1] − (i − 1) · z[i − 1], i = 1...p (9)

Step 4 Update the warping factors {αk} by means of the corrective terms {�αk}:

α(updated)k = αk + �αk

1 + αk · �αk

(10)

Step 5 If |�αk| < 0.001 for all k, exit. Otherwise, go back to step 2.

Once the warping factors have been determined, residual vectors containing the differences between warped andtarget vectors are calculated for the N training pairs:

rn = yn − Wα(xn,θ)xn, n = 1. . .N (11)

Then, class-specific scaling vectors {sk} are trained in such manner that these differences are maximally compen-sated. The least squares training algorithm proposed by Erro et al. (2013b, 2012) is the following:

Sm×p

= [ s1 s2 . . . sm ]T = arg minS

||R − PS||2 = (PTP)−1

PTR (12)

where p is the vector dimension (equal to the cepstral order) and P and R are given by

PN×m

=

⎡⎢⎢⎢⎢⎢⎢⎣

p(θ)1 (x1) p

(θ)2 (x1) · · · p(θ)

m (x1)

p(θ)1 (x2) p

(θ)2 (x2) · · · p(θ)

m (x2)

......

. . ....

p(θ)1 (xN ) p

(θ)2 (xN ) · · · p(θ)

m (xN )

⎤⎥⎥⎥⎥⎥⎥⎦

, RN×p

=

⎡⎢⎢⎢⎢⎢⎣

rT1

rT2

...

rTN

⎤⎥⎥⎥⎥⎥⎦

(13)

Although the resulting vectors {sk} are optimal from a mathematical point of view, they are not informative inthe sense that it is not straightforward to understand the information they convey. On the contrary, the meaning ofthe warping factors {αk} is clear: positive values of α mean higher formant frequencies (α ≈ 0.1 for male to femaleconversion) and negative values mean lower formant frequencies. This informative transparency is partially due to theBL constraints imposed to the FW operation, which simplifies the warping curves and makes them dependent on asingle meaningful parameter. At the same time, this helps increasing the robustness of the system. Inspired by thisidea, in this paper we propose to constrain the AS vectors as well in order to get a new type of transformation whichcan be understood, intuitively manipulated and even used to analyze the differences between the source voice and thetarget voice.

3. BLFW + constrained AS

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

Assuming that BLFW is sufficiently accurate, the aim of AS is not making new formants appear in the convertedspectrum, but just correcting the relative intensity of the already existing ones after FW. Consequently, the spectralresponse of the AS filter represented by the elementary vectors {sk} should be smooth by definition. In the originalBLFW + AS system (Erro et al., 2013b), this aspect was not taken into account explicitly. Smoothness was guaranteed

Page 5: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

ARTICLE IN PRESS+ModelYCSLA-648; No. of Pages 13

D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx 5

0 1000 2000 3000 4000 5000 6000 7000 8000-1.5

-1

-0.5

0

0.5

1

1.5

Freq. [Hz]

Ampl

. [dB

]

F2

sbtpwv

wtsr

cecpoc

w

ut

fm

ig. 3. Spectral shapes involved in constrained AS: tilt-related term (1 dB/dec slope) and 9 Hanning-like bandpass filters, all reconstructed from a4th-order Mel-cepstral representation.

imply by using a GMM with few acoustic classes, because intra-class averaging prevented sharp peaks in {rn} fromeing transferred to {sk}. Additionally, BLFW + AS did not take into account that a significant portion of the additiveerms {sk} might be directly related to the spectral tilt differences between the two involved speakers. In our newroposal we force the AS frequency response to be formed by a weighted combination of a spectral tilt related termith 1 dB/decade slope and B smooth Hanning-like bands equally spaced in the Mel frequency scale. The AS elementaryectors {sk} are now forced to be the result of the following combination:

sk = τkt +B∑

j=1

βk,jbj, k = 1. . .m (14)

here t and {bj} are p-dimensional column vectors containing the cepstral representations1 of the tilt-related term andhe B bands, respectively. These representations are constant and can be calculated numerically. This means that thehape of sk depends exclusively on τk and {βk,j}. Fig. 3 shows the spectral shapes that correspond to t and {bj}, alleconstructed from their 24th-order Mel-cepstral parameterization (p = 24) for fs = 16 kHz and B = 9.

Thus, the value of τk coincides with the slope (dB/decade) by which the two involved voices differ at the kth acousticlass of model θ, whereas {βk,j} represent the exact amplitude (dB) of the complementary smooth amplitude correctionnvelope at equally spaced points in the Mel-frequency scale. Remarkably, the dimensionality of the resulting voiceonversion function is reduced significantly because each amplitude scaling vector sk is now given by 1 + B weights. Torevent the tilt-related term from being diluted within the corrective bands, we propose the following two-step weightptimization during training. First we simultaneously optimize the weights of all the tilt-related terms by solving theorresponding least squares system:

�m×1

= [ τ1 τ2 . . . τm ]T = arg min�

||Γ − Q�||2 = (QTQ)−1

QT� (15)

here

QNp×m

= P ⊗ t, �Np×1

= [ rT1 rT

2 . . . rTN ]

T(16)

P and {rn} are given by expressions (13) and (11) respectively, and ⊗ denotes the Kronecker product. Next, wepdate the residuals by subtracting the tilt-related terms and we determine the weights of the corrective bands for allhe classes by solving another least squares system. This can be expressed mathematically as follows:

� = [ β1,1. . .β1,B β2,1. . .β2,B . . . βm,1. . .βm,B ]T = arg min||Γ − Q�||2 = (QT

Q)−1

QT� (17)

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

mB×1 �

1 The design of the method is strongly linked to the properties of cepstral parameterization. Other speech representations such as line spectralrequencies are not suitable because they cannot be combined as in expression (14) and they require a more careful definition of the error to beinimized during training.

Page 6: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

+Model

ARTICLE IN PRESSYCSLA-648; No. of Pages 13

6 D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx

where

QNp×mB

= P ⊗ [ b1 b2 . . . bB ], �Np×1

= � − Q� (18)

�, Q and � are given by expressions (15)–(16). Although these operations involve the inversion of large matrices,the efficiency of the computation can be increased by exploiting the properties of the Kronecker product.

4. Experiments and discussion

4.1. The database

The speech data used in the evaluation experiments were taken from the CMU ARCTIC database (Kominek andBlack, 2003). Four US English speakers with the same dialectal characteristics were selected from this database: twofemale speakers, slt and clb, and two male speakers, bdl and rms. 50 parallel training sentences per speaker wererandomly selected for training and a different set of 50 sentences was separated for testing purposes. The remainingsentences of the database were simply discarded. The sampling frequency was 16 kHz. We used the vocoder describedin Erro et al. (2011) to parameterize the speech signals into Mel-cepstral coefficients and to reconstruct the waveformsfrom the converted vectors. The order of the Mel-cepstral analysis was 24 (plus the 0th coefficient containing theenergy, which does not take part in the conversion). The frame shift was set to 8 ms. During conversion, the mean andvariance of the source speaker’s log f0 distribution were replaced by those of the target speaker by means of a lineartransformation. In order to find the correspondence between the source and target cepstral vectors extracted from theparallel training utterances, we calculated a piecewise linear time warping function from the phoneme boundaries givenby the available segmentation. Similarly as in Erro et al. (2013b, 2012), the GMMs used in our experiments had 32mixtures with full-covariance matrices.

4.2. Objective evaluation

The first aspect to be evaluated is how much the conversion performance of the VC system is degraded with respectto the baseline case – BLFW + AS – when the aforementioned simplifications are made. Although objective measuressuch as Mel-cepstral distortion (MCD) between converted and target vectors are not always correlated with subjectivemeasures – for instance, it is known that alleviating the oversmoothing effect of standard GMM-based VC results intoworse MCD scores and better subjective scores at the same time (Benisty and Malah, 2011; Godoy et al., 2012) –they can still help to determine the best configuration or dimensions of a given method. In our case, given the similarnature of BLFW + AS and its AS-constrained version (from now on, BLFW + CAS), MCD is a reliable measure of howaccurate the conversion will be for different number of AS bands, B. In this experiment, VC functions were trained forall voice pairs using the training dataset; then, MCD scores were calculated using the test material. For a test datasetcontaining N paired vectors {xn} and {yn}, the MCD score achieved by a given VC function F(·) is computed as2:

MCD{F (·)} = 10

log 10· 1

N

N∑n=1

||yn − F (xn)|| (19)

Fig. 4 shows the average MCD scores achieved by BLFW + CAS for B = {9, 14, 19}. Two more methods are includedfor comparison: BLFW + AS and the standard GMM-based method based on joint statistical modeling of source-targetacoustic pairs (Kain, 2001) (from now on, JGMM). Given the inherent particularities of BLFW-based methods, wecompute average scores not only for all possible combinations of voices but also separately for intra-gender and cross-gender combinations. JGMM performs better than the other methods in terms of MCD, which is consistent with thefindings of previous works (Erro et al., 2013b). Regarding BLFW + CAS, we can observe that imposing constraints to

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

the AS terms means reducing the accuracy of the method. Specifically, the use of 9, 14 and 19 AS bands produces anaverage MCD increment of 4.9%, 2.2% and 0.6% respectively. The MCD scores achieved by BLFW + CAS get closerto those of BLFW + AS as the number of unknowns of their respective VC functions becomes similar. Cross-gender

2 An additional multiplicative factor 2 may be needed in expression (19) depending on how cepstral coefficients are defined.

Page 7: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

ARTICLE IN PRESS+ModelYCSLA-648; No. of Pages 13

D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx 7

32 x 2532 x 2132 x 16

32 x 11

32 x 600

4,0

4,5

5,0

5,5

6,0

JGMM BLF W+CAS(B=9)

BLFW+CAS(B=14)

BLFW+CAS(B=19 )

BLFW+AS

Mel

-cep

stra

l dist

or�o

nco

nver

ted-

targ

et

all int ra-gend er cross- gender

Fm

Vcv

oo

2rtVlpq

4

fmtwadwa

TA

J

0

ig. 4. Global and gender-dependent average MCD scores with 95% confidence intervals for different configurations of BLFW + CAS. Baselineethods: BLFW + AS and JGMM. The numbers above the bars indicate the dimension of the VC function.

C is more sensitive than intra-gender VC to the reduction of B. The main reason for this is the fact that CAS is lessapable than AS of absorbing the inaccuracies of BLFW, which are logically larger when the involved voices exhibitery different vocal tract lengths.

For a more precise objective characterization of the methods, we also computed the quotient between the variancef the converted parameters and that of the natural target parameters. Defining vari{·} as the variance of the ith elementf its input vector set, the variance quotient (VarQ) measure we have used can be expressed as:

VarQ{F (·)} = 1

p

p∑i=1

vari{{F (xn)}n=1...N

}vari

{{yn}n=1...N

} (20)

Variance measures have been shown to be good indicators of the degree of oversmoothing (Benisty and Malah,011; Godoy et al., 2012; Toda et al., 2007) and BLFW + AS method was already shown to produce good variance-elated scores even without an explicit modeling of the variance of the converted parameters (Erro et al., 2013b). Inhis experiment, VarQ scores were calculated for all voice pairs and a single average score was calculated for eachC method and configuration. As shown in Table 1, the variance of the converted parameters is closer to natural for

ow values of B. Indeed, low B means smoother AS terms which preserve better the variance of the natural sourcearameters. As reported in previous works (Erro et al., 2013b; Godoy et al., 2012; Toda et al., 2007), JGMM achievesuite poor scores for this specific type of measure due to the oversmoothing effect.

.3. Robustness against data scarcity

One of the main advantages of reducing the dimension of the VC function is that it can be estimated more reliablyrom a given amount of training data. BLFW + AS was already shown to be more robust than other GMM-basedethods (Erro et al., 2013b). The next experiment aims at confirming that BLFW + CAS (B = 9) is even more robust

han BLFW + AS when the amount of training data is progressively reduced. Assuming a fixed GMM learned from thehole training dataset of the source speaker (50 utterances, as specified in Section 4.1) – this is a reasonable assumption

s discussed in Erro et al. (2013b) – we trained VC functions after reducing the available amount of parallel training

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

ata (initially 50 parallel utterances) by factors M = {2, 4, 8, 16}. For each value of M, the parallel training datasetas split into M subsets, each one yielding a different VC function; then, their M corresponding MCD scores were

veraged to get a single score per each data reduction factor. Finally, the factor-dependent scores were also averaged

able 1verage VarQ scores for different VC methods.

GMM BLFW + CAS (B = 9) BLFW + CAS (B = 14) BLFW + CAS (B = 19) BLFW + AS

.36 0.93 0.91 0.89 0.88

Page 8: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

ARTICLE IN PRESS+ModelYCSLA-648; No. of Pages 13

8 D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx

5,4

5,5

5,6

5,7

5,8

5,9

6,0

÷ 1 ÷ 2 ÷ 4 ÷ 8 ÷ 16

Data redu c�on fac tor

MCD

conv

erte

d-ta

rget

BLF W+AS BLFW+CAS (B=9)

Fig. 5. MCD scores for different training data reduction factors. Discontinuous lines mean that correct training is not guaranteed due to insufficientdata and numerical issues.

over all possible voice pairs. Remarkably, for M = 8 and M = 16 some of these partial MCD scores had to be discardedbecause the VC had been badly estimated due to insufficient data. Both BLFW + AS and BLFW + CAS were equallyprone to this phenomenon.

The results shown in Fig. 5 reveal that the MCD scores produced by BLFW + CAS grow less rapidly than those pro-duced by BLFW + AS when the amount of parallel training data is reduced. This means that in principle BLFW + CASis more robust against parallel training data scarcity. Unfortunately, for the amounts of data that allow a correct (numer-ical error free) training of the VC function, the MCD scores achieved by BLFW + CAS remain always higher (worse)than the baseline scores. Taking into account that numerical errors start to appear at the same training conditions forboth methods, we can conclude that robustness is not a practical advantage of the constrained one unless the observedMCD gap is meaningless from a perceptual point of view.

4.4. Subjective evaluation

The next logical step is to quantify the loss of conversion accuracy of BLFW + CAS by means of a subjective test.To do this we conducted a perceptual test in which 20 volunteer listeners with good English speaking skills (including 6speech processing experts) were asked to listen to several converted-target pairs and rate two aspects: (i) the similaritybetween voices and (ii) the quality of the converted ones, both in a 1-to-5 scale (1 = “very bad”, . . ., 5 = “excellent”).The test was carried out via a web interface and listeners were asked to use headphones. The natural signals used asreference were analyzed and reconstructed by means of the same vocoder we used for VC in order to focus the attentionof the evaluators on the methods themselves. Given the subtle differences between BLFW + CAS configurations andthe annoyance of listeners when asked to rate very similar things, we evaluated only the most basic configuration ofBLFW + CAS, namely that with B = 9 (although fewer bands can be used, this is a representative operation point witha good trade-off between dimension, objective accuracy and informative transparency). Again, we compared it withBLFW + AS and JGMM. According to Fig. 4, it is reasonable to assume that BLFW + CAS is equivalent to BLFW + ASwhen B is high enough.

The resulting mean opinion scores (MOSs) are shown in Fig. 6. Once again, since the relative scores of JGMMand BLFW + AS are consistent with those reported previously (Erro et al., 2013b, 2012), we focus our attention onBLFW + CAS. For this configuration, subjective scores confirm that the converted voices yielded by BLFW + CAS arenot as similar to the target as those converted by BLFW + AS or JGMM. This means that the effect of reducing B is notnegligible from a perceptual point of view. The quality scores, however, are similar for both BLFW-based methods,

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

far beyond the performance of JGMM. In short, we can affirm that the AS constraints have no negative impact onthe subjective quality measures, and the similarity between converted and target voices can be effectively controlledthrough the number of AS bands, B, which determines also the number of parameters of the VC function.

Page 9: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

ARTICLE IN PRESS+ModelYCSLA-648; No. of Pages 13

D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx 9

1

2

3

4

5

Similarity QualityM

OS

JGMM BLFW+AS BLFW+CAS (B=9)

Fc

hmd

5

(s

F(a

ig. 6. Results of the perceptual test: MOS and 95% confidence intervals for both similarity between converted and target voices and quality of theonverted utterances.

Given all the experiments conducted in this section, we can conclude that in terms of performance BLFW + CASas no particularly relevant advantage with respect to its predecessor, BLFW + AS. Thus, we can consider that theain advantage of BLFW + CAS is the fact that it can be used as an analysis tool which helps rapidly visualizing the

ifferences between source and target voices. Next section will illustrate this interesting property of the method.

. BLFW + CAS as a speech analysis tool

This section studies the information provided by BLFW + CAS-based VC functions for three different types of data:

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

i) several voices chosen from the CMU ARCTIC database; (ii) neutral and emotional speech from a single femalepeaker; (iii) normal and Lombard speech from a single male speaker. Given the performance limitations of the method

ig. 7. Evolution of BLFW + CAS parameters over time for two different CMU ARCTIC voice pairs: slt-bdl (1st column) and slt-clb (2nd column).a) Source spectral envelope given by the MCEP coeffs.; (b) instantaneous BLFW factor α; (c) instantaneous spectral slope τ; (d) instantaneousmplitude of 14 Hanning-like bands {βj}.

Page 10: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

ARTICLE IN PRESS+ModelYCSLA-648; No. of Pages 13

10 D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx

Fig. 8. Evolution of BLFW + CAS parameters over time for neutral-angry (1st column) and neutral-happy (2nd column) spectral conversion. (a)

Source spectral envelope given by the MCEP coeffs.; (b) instantaneous BLFW factor α; (c) instantaneous spectral slope τ; (d) instantaneous amplitudeof 14 Hanning-like bands {βj}.

when B = 9, in this section we set B = 14, which gives a good balance between conversion accuracy and suitability forgraphical display of information.

5.1. ARCTIC voices

Fig. 7 shows the evolution of the parameters of a BLFW + CAS based VC function trained for two differentvoice pairs, slt-bdl (female-male) and slt-clb (female-female), using the same experimental setup as in Section 4.1.Instantaneous values of the BLFW factor α are obtained through expression (4). For the remaining parameters, we getinstantaneous values by combining expressions (5) and (14):

τ(x, θ) =m∑

k=1

p(θ)k (x) · τk, βj(x, θ) =

m∑k=1

p(θ)k (x) · βk,j (21)

Regarding the evolution of α, the differences between the two VC pairs are obvious: while conversion toward bdlrequires a strongly negative factor, which means that the formants are located at lower frequencies, the factor requiredby clb is much closer to zero (slightly negative), which reveals that clb is quite similar to slt in terms of formantlocation. Unvoiced frames are not of interest here because they have no formants, strictly speaking. Hence, abruptα jumps are observed at voiced-unvoiced transitions for bdl. The spectral tilt related parameter of BLFW + CAS,referred to as τ in Section 3, is equally informative. While bdl requires the addition of a positive spectral slope, whichdenotes a more pressed phonation, clb requires a negative one, thus exhibiting a smoother phonation than slt. Theseobservations correlate well with our initial expectations according to previous informal analyses by listening. Finally,

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

the fine details of the conversion are captured by the relative amplitudes of the 14 Hanning-like spectral bands. In thiscase, the variability of the values required by bdl reveals that the differences between slt and bdl are hardly captured bya simple warping + tilt model; specific corrections are needed at specific frequencies. By contrast, in the clb case the

Page 11: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

ARTICLE IN PRESS+ModelYCSLA-648; No. of Pages 13

D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx 11

Fig. 9. Evolution of BLFW + CAS parameters over time for normal-Lombard spectral conversion. (a) Source spectral envelope given by the MCEPc

amu

5

bwHpOssscbpssha

oeffs.; (b) instantaneous BLFW factor α; (c) instantaneous spectral slope τ; (d) instantaneous amplitude of 14 Hanning-like bands {βj}.

mplitudes of the corrective bands are more uniform both in time and in frequency, which means that a warping + tiltodel can capture most of the differences between slt and clb. This is probably the reason why slt-clb conversion was

sually more accurate than others in previously published works (Erro et al., 2013b; Godoy et al., 2012).

.2. Neutral vs. emotional speech

For this experiment we selected 30 parallel sentences from one female emotional voice in the database reportedy (Sainz et al., 2012). Using dynamic time warping for source-target alignment, we trained BLFW + CAS functionsith 32 Gaussian mixtures to convert neutral speech into all the available emotions: anger, happiness and sadness.owever, we found the sad style to be spectrally similar to neutral style. Therefore, only the two remaining cases arelotted in Fig. 8. For an easier visualization, we plot the transformation of a single word taken from a longer utterance.nce again, the irregularities observed at unvoiced segments can be excluded from the discussion since they are not

ignificant from a physical point of view but just the result of a numerical optimization procedure. Given that thepeaker is the same for all emotions, the BLFW factor α is not far from zero (null warping), the happy style exhibitinglightly positive values and the angry style negative ones at some segments. Thus, the BLFW seems to be somewhatorrelated with the valence of the emotion, whereas the spectral tilt increment needed to transform neutral speech intooth angry and happy speech is linked to arousal. Of course, prosody is out of the scope of this tool even though itlays a crucial role in rendering emotions. Finally, according to the amplitudes of the AS bands, there seem to beome differences between anger and happiness at high Mel-frequencies (mid-high linear frequencies). Both emotionshow a similar contrastive pattern there, its magnitude being larger for anger. Admittedly, this can be a consequence of

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

aving used a too simple model for spectral tilt. A more sophisticated model inspired by speech production theoriesnd spectral models of the glottal source may result in a better visual analysis.

Page 12: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

+Model

ARTICLE IN PRESSYCSLA-648; No. of Pages 13

12 D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx

5.3. Normal vs. Lombard speech

In this last example, BLFW + CAS VC is used to illustrate the spectral changes related to the Lombard effect.Lombard speech is the type of speech humans produce in noisy environments in order to preserve the intelligibility.We took 50 normal and Lombard utterances spoken by the same UK English speaker from the database described byCooke et al. (2013). The parameters of the resulting VC function are shown in Fig. 9. We can observe that there isno particular shift of the formant structure, while the spectral tilt is substantially increased. Apart from this, the ASbands reveal that there is also a visible amplification of the 12th band, as happened in neutral-angry and neutral-happyconversion too. This coincidence found for different voices seems to confirm that a more sophisticated model of thespectral tilt should be considered in future versions of the method in order to mimic the enhancement of mid-highfrequencies more properly.

6. Conclusions

We have shown that GMM-based voice conversion functions operating in the cepstral domain can be designed insuch manner that relevant information about the physical differences between voices or styles – relative location of theformants in frequency, relative spectral tilt and relative amplitude in specific frequency bands – becomes accessible.Therefore, the resulting system can be used as an automatic analysis and visualization tool. The new conversion methodachieves good quality scores in comparison with the baseline systems, and the trade-off between conversion accuracyand dimensionality of the conversion function can be controlled by means of one of its parameters, namely the numberof amplitude correction bands. In spite of the good overall performance of the method, our experiments suggest thatit can still be improved by means of a more sophisticated modeling of the spectral tilt. In any case, even without anyfurther improvement, we have shown that the proposed method is useful to analyze the differences between voices orstyles whenever some parallel utterances are available for training. As a last remark, the tool has been tested usingspeech synthesis databases only. The use of noisy or reverberant material could produce unexpected results, thoughthis needs to be confirmed by future works.

Acknowledgements

This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (SpeechTech4Allproject, TEC2012-38939-C03-03), the Basque Government (Ber2tek project, IE12-333), and Euroregion Aquitaine-Euskadi (Iparrahotsa project, 2012-004).

References

Abe, M., Nakamura, S., Shikano, K., Kuwabara, H., 1988. Voice conversion through vector quantization. In: Proc. ICASSP, pp. 655–658.Arslan, L.M., 1999. Speaker transformation algorithm using segmental codebooks (STASC). Speech Communication 28, 211–226.Benisty, H., Malah, D., 2011. Voice conversion using GMM with enhanced global variance. In: Proc. Interspeech, pp. 669–672.Cooke, M., Mayo, C., Valentini-Botinhao, C., Stylianou, Y., Sauert, B., Tang, Y., 2013. Evaluating the intelligibility benefit of speech modifications

in known noise conditions. Speech Communication 55, 572–585.Desai, S., Black, A.W., Yegnanarayana, B., Prahallad, K., 2010. Spectral mapping using artificial neural networks for voice conversion. IEEE

Transactions on Audio, Speech and Language Processing 18, 954–964.Duxans, H., Bonafonte, A., Kain, A., Van Santen, J., 2004. Including dynamic and phonetic information in voice conversion systems. In: Proc.

ICSLP, pp. 1193–1196.Erro, D., Moreno, A., Bonafonte, A., 2010. Voice conversion based on weighted frequency warping. IEEE Transactions on Audio, Speech and

Language Processing 18, 922–931.Erro, D., Sainz, I., Navas, E., Hernaez, I., 2011. HNM-based MFCC + F0 extractor applied to statistical speech synthesis. In: Proc. ICASSP, pp.

4728–4731.Erro, D., Navas, E., Hernaez, I., 2012. Iterative MMSE estimation of vocal tract length normalization factors for voice transformation. In: Proc.

Interspeech, pp. 86–89.

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

Erro, D., Alonso, A., Serrano, L., Navas, E., Hernaez, I., 2013a. Towards physically interpretable parametric voice conversion functions. LectureNotes in Artificial Intelligence 7911, 75–82.

Erro, D., Navas, E., Hernaez, I., 2013b. Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Transactionson Audio, Speech and Language Processing 21, 556–566.

Page 13: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

+ModelY

G

H

KKL

MMN

P

R

S

S

S

ST

T

T

VZ

Z

ARTICLE IN PRESSCSLA-648; No. of Pages 13

D. Erro et al. / Computer Speech and Language xxx (2014) xxx–xxx 13

odoy, E., Rosec, O., Chonavel, T., 2012. Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallelcorpora. IEEE Transactions on Audio, Speech and Language Processing 20, 1313–1323.

elander, E., Virtanen, T., Nurminen, J., Gabbouj, M., 2010. Voice conversion using partial least squares regression. IEEE Transactions on Audio,Speech and Language Processing 18, 912–921.

ain, A., 2001. High Resolution Voice Transformation. Oregon Health and Science University, Portland, Oregon, USA.ominek, J., Black, A.W., 2003. CMU Arctic Databases for Speech Synthesis.ee, C.H., Wu, C.H., Guo, J.C., 2010. Pronunciation variation generation for spontaneous speech synthesis using state-based voice transformation.

In: Proc. ICASSP, pp. 4826–4829.cDonough, J., Byrne, W., 1999. Speaker adaptation with all-pass transforms. In: Proc. ICASSP, pp. 757–760.oulines, E., Sagisaka, Y., 1995. Voice conversion: state of the art and perspectives. Speech Communication 16, 125–126.arendranath, M., Murthy, H.A., Rajendran, S., Yegnanarayana, B., 1995. Transformation of formants for voice conversion using artificial neural

networks. Speech Communication 16, 207–216.itz, M., Ney, H., 2005. Vocal tract normalization equals linear transformation in cepstral space. IEEE Transactions on Speech and Audio Processing

13, 930–944.entzos, D., Vaseghi, S., Yan, Q., Ho, C.H., 2004. Voice conversion through transformation of spectral and intonation features. In: Proc. ICASSP,

pp. 21–24.ainz, I., Erro, D., Navas, E., Hernaez, I., Sanchez, J., Saratxaga, I., Odriozola, I., 2012. Versatile speech databases for high quality synthesis for

basque. In: Proc. LREC, pp. 3308–3312.huang, Z., Bakis, R., Shechtman, S., Chazan, D., Qin, Y., 2006. Frequency warping based on mapping formant parameters. In: Proc. Interspeech,

pp. 2290–2293.tylianou, Y., Cappe, O., Moulines, E., 1998. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio

Processing 6, 131–142.uendermann, D., Ney, H., 2003. VTLN-based voice conversion. In: Proc. ISSPIT, pp. 556–559.amura, M., Morita, M., Kagoshima, T., Akamine, M., 2011. One sentence voice adaptation using GMM-based frequency-warping and shift with

a sub-band basis spectrum model. In: Proc. ICASSP, pp. 5124–5127.oda, T., Saruwatari, H., Shikano, K., 2001. High quality voice conversion based on Gaussian mixture model with dynamic frequency warping. In:

Proc. Interspeech, pp. 349–352.oda, T., Black, A.W., Tokuda, K., 2007. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE

Transactions on Audio, Speech and Language Processing 15, 2222–2235.albret, H., Moulines, E., Tubach, J.P., 1992. Voice transformation using PSOLA technique. Speech Communication 11, 175–187.

Please cite this article in press as: Erro, D., et al., Interpretable parametric voice conversion functions based on Gaussian mixturemodels and constrained transformations. Comput. Speech Lang. (2014), http://dx.doi.org/10.1016/j.csl.2014.03.001

en, H., Nankaku, Y., Tokuda, K., 2011. Continuous stochastic feature mapping based on trajectory HMMs. IEEE Transactions on Audio, Speechand Language Processing 19, 417–430.

orila, T.C., Erro, D., Hernaez, I., 2012. Improving the quality of standard GMM-based voice conversion systems by considering physically motivatedlinear transformations. Communications in Computer and Information Science 328, 30–39.


Recommended