Development of Croatian unit selection and statistical parametric speech synthesis

Development of Croatian unit selection and

statistical parametric speech synthesis

M. Pobar, I. Ipšić

University of Rijeka/Department of Informatics, Rijeka, Croatia

{mpobar, ipsic}@inf.uniri.hr

Abstract - This paper presents the development of Croatian

speech synthesis systems. Three voices were built using the

same recorded speech corpus. Two of these voices were built

with the Festival speech synthesis system, using the

clustering unit selection method and the statistical

parametric method. The third developed voice uses a

general unit selection algorithm implemented in a custom

speech synthesis system. Obtained voices are compared

mutually and with voices generated with a previously

developed diphone based TTS system. The comparison is

based on subjective tests using MOS evaluation.

I. INTRODUCTION

In recent years, two different methods relying on large quantities of natural speech data dominate in speech synthesis research. The unit selection method is based on selecting and concatenating samples or units of natural speech from the available corpus. If there is more than one instance of each unit spoken in different styles, the system can choose the sequence of units that best conforms to desired prosody and has the least audible joins. Quality of speech produced by unit selection can vary widely, depending if units that join well for the desired utterance can be found in the inventory or not. Variants of unit selection synthesis have been implemented in systems such as -talk [1], CHATR [2], and Festival [3].

Statistical modeling of speech has recently been successfully applied in speech synthesis, leading to the statistical parametric speech synthesis. The method is based on parametrization of speech that can be both reversed and modeled. A set of models is trained on examples of natural speech. At synthesis time, these models can produce parameter vectors from which speech waveform is generated. Typically, hidden Markov model (HMM) formalism is used along with mel-frequency cepstral coefficients (MFCC) [4,5], but other parameters such as formant trajectories have been used as well [6]. The reported quality of such systems is generally very good [7], but still has some drawbacks compared to unit selection systems, especially the buzziness of generated speech resulting from the filtering process used to generate the waveforms.

For Croatian language, various speech synthesis systems have been reported, using diphone concatenation, [8,9], unit selection [10], and statistical parametric method [11].

This paper presents the development of two unit

selection voices for Croatian language, one for a custom system and one for the Festival system, and one statistical parametrical voice also for Festival. All voices were built using the same speech corpus so the results could be compared. A preliminary subjective evaluation of systems was conducted using the mean opinion score (MOS) scale.

The paper is organized as follows:

In the next section a short overview of the unit selection and statistical parametric speech synthesis methods is given. The development of the Croatian voices is described in section three. Section four presents the evaluation results. Finally some conclusions and suggestions for future work are given.

II. SPEECH SYNTHESIS METHODS

A. Unit selection

Unit selection speech synthesis method is based on concatenation of recorded segments, or units, of natural speech stored in a corpus. The corpus contains recordings of speech that are phonetically transcribed and segmented, so that beginnings and endings of phones are known.

At synthesis time, the system searches the corpus to find a sequence of units that matches the desired phonetic string and concatenates the corresponding waveforms. In general there will be many sequences that match, so the system must choose the sequence that would give the best speech quality.

In earlier diphone systems the corpus was reduced to one instance of each unit (i.e. diphone) per unit class using some pre-selection. Selection from a large corpus leads to improved quality of speech when larger segments can be found in the corpus, leading to fewer concatenations. At each join point the unit that will lead to least perceptible join can be chosen from potentially many candidates. In diphone synthesis, there is only one choice, and since the units can be extracted from a different context than the one in the synthesized speech, bad, unnatural sounding joins are difficult to avoid.

The size and character of the corpus is very important for the resulting quality of speech, as larger corpus increases the chances of finding a unit that will fit well in the context. Another important factor is the unit selection algorithm itself, as it needs to reflect the human preferences. There are more proposed solutions of how the subjective perceived quality of join can be predicted from the data available at synthesis time, e.g. [12].

A general unit selection scheme was proposed in [2]. Support for this work was provided by the Ministry of Science,

Education and Sports of the Republic of Croatia (project number 318-

0361935-0852).

https://www.researchgate.net/publication/251390350_Speaker-Independent_HMM-based_Speech_Synthesis_System_-_HTS2007_System_for_the_Blizzard_Challenge_2007?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/220120388_Analysis_of_Statistical_Parametric_and_Unit_Selection_Speech_Synthesis_Systems_Applied_to_Emotional_Speech?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/220066283_Croatian_HMM-based_Speech_Synthesis?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/243787387_Festival_2_-_Build_your_own_general_purpose_unit_selection_speech_synthesiser?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/2493352_Tongues_Rapid_Development_Of_A_Speech-To-Speech_Translation_System?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/221479011_ATR_m-talk_speech_synthesis_system?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/221479396_CLUSTERGEN_a_statistical_parametric_synthesizer_using_trajectory_modeling?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/2356922_Formant_Analysis_And_Synthesis_Using_Hidden_Markov_Models?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/3644281_Unit_selection_in_a_concatenative_speech_synthesis_system_using_alarge_speech_database?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/3644281_Unit_selection_in_a_concatenative_speech_synthesis_system_using_alarge_speech_database?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/221488105_Perpetually_optimizing_the_cost_function_for_unit_selection_in_a_TTS_system_with_one_single_run_of_MOS_evaluation?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

The input to the synthesizer is a target specification { } which is a sequence of N units described with feature vectors. The goal is to find the optimal sequence of units { } in the corpus corresponding to the specification .

The choice of best unit sequence is presented as the problem of finding the path with minimal total cost through a network where each node is a unit in the database, and the edges are possible joins.

Two cost functions are defined, the target cost ( ) is a difference measure between a unit in the corpus and a target unit , i.e. the desired rendition of this unit. Target cost is effectively the cost of a given node.

Join cost ( ) is a measure of how perceptible is the join between two consecutive units and , and corresponds to the cost of the edge from unit to .

Both target cost and join costs are defined as weighted

sums of p and q sub-costs, ( ) ∑ ( )

( ) ∑ ( )

where and

are target and join weights. The target and join sub-cost functions measure contributions of individual features to the cost, usually in a form of difference of feature values. These features may be a combination of linguistic and acoustic features for the join cost, and linguistic features for the target cost as acoustic features usually aren’t available for the target specification. The weights determine the relative importance of sub-costs and are an important part of cost functions design, as they should reflect the subjective perception of human listeners.

The total cost is a sum of target and join costs for N units in the utterance:

( ) ∑ ( )

∑ ( )

(1)

and the optimal sequence is

( ) (2)

The Viterbi dynamic programming algorithm can be used to find the optimal path through the network.

B. Statistical parametric synthesis

The statistical parametric synthesis method uses hidden Markov models of speech like those often used in speech recognition.

The basic waveform generation method is similar to the source/filter model in the older rule-based formant synthesizers. A vector of parameters or acoustic features, e.g. formant frequencies, voicing information, duration etc. are generated for the desired utterance. Source signal, generally noise for unvoiced phonemes or an impulse train for voiced phonemes, is generated and processed using a filter with coefficients set according to the parameters, generating speech. The key difference is in the way the parameters are generated. Instead of a set of manually inferred rules driving the synthesizer, statistical learning

techniques are used to train hidden Markov models of sub-word units from a corpus of natural speech.

The modeling unit is typically context-dependent phoneme described with probability distributions of mel-frequency cepstral coefficients (MFCC) and F0 [4,5] or formant trajectories [6]. Multiple context-dependent models for each phoneme capture the differences in expression of phonemes caused by coarticulation. Contexts are distinguished by linguistic features that can be extracted from text (the phonetic context, syllable structure, lexical stress etc).

The spectral part of the parameters is modeled using mixtures of continuous Gaussian distributions, while the duration is modeled either by Gaussian distribution [4] or a decision tree [5].

The model parameters are estimated on parameters extracted from a corpus of natural speech using the Baum-Welch algorithm.

At synthesis time, the context-dependent phoneme models are concatenated according to desired phoneme string and the state durations are calculated according to the duration model. The MFCC and F0 parameter vectors which maximize their state output probabilities are then produced at fixed time intervals. Finally the waveform is generated using the MLSA filter [10].

This allows much faster development of voices than for the formant synthesizer, and the voices share the characteristics of the speaker whose speech was used to train the models. The intermediate parametrical representation of speech is easier to to manipulate than the raw waveforms in unit selection synthesis, so the prosody of speech can be controlled more easily. Also, the voice characteristics can be changed producing synthetic speech unlike the original speaker’s [14].

III. DESCRIPTION OF SYSTEMS

A. Corpus

The same speech data from the VEPRAD [15] radio news and weather report corpus was used to build all three voices. The VEPRAD corpus is a multi-speaker database, containig speech from 11 male and 14 female professional speakers. From this corpus a subset from a single male speaker (identified as sm04) was selected for building the voices. This subset consists of about 2 and a half hours of transcribed speech with word level textual transcriptions. The size of this set is 267 MB.

Phone level segment labels were obtained automatically using HMM speech recognition in forced alignment mode. The recognition is done using the HTK toolkit, with monophone HMMs trained on the same speech corpus that is used in synthesis [15].

B. Phoneset

The phoneset that was used to build the voices is adapted from the one used in a previous diphone-based synthesizer [8]. It consists of 30 standard phonemes in the Croatian language, accented forms of vowels (5), the syllable-forming /r/, and the silence phoneme.



https://www.researchgate.net/publication/222428447_Statistical_Parametric_Speech_Synthesis?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

https://www.researchgate.net/publication/2493352_Tongues_Rapid_Development_Of_A_Speech-To-Speech_Translation_System?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==



https://www.researchgate.net/publication/2356922_Formant_Analysis_And_Synthesis_Using_Hidden_Markov_Models?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

C. Lexicon and grapheme-to-phoneme rules

The grapheme-to-phoneme conversion in the system is done using a pronunciation lexicon and a set of grapheme-to-phoneme rules. The lexicon consists of about 10000 entries with phonetic transcriptions of words in the news and weather domain, with stress position information.

In Croatian language, grapheme-to-phoneme conversion is mostly straightforward, with one to one mapping in majority of cases. A basic set of manually produced mapping rules for 30 graphemes is enough to cover unknown words. Better quality of speech can be obtained with a more extensive set of rules that take into account various sound changes occurring in speech [16], so they were adapted for these voices as well. Examples of such rules are the transformation of sequence ts into c, as in Hrvatska [hrvacka], and transformation of doubled consonants between words into a long consonant, e.g. “radit ću” [radić:u]. A complete list is presented in [16].

D. Unit selection voice for Festival

One unit selection voice was built with the Festival and Festvox [17] tools. The Festival speech system supports a variant of the unit selection framework, called clunits [18], or clustering unit selection, which restricts the search space by off-line clustering of similar units and searching only promising clusters at runtime.

In this implementation of unit selection, the basic unit of speech is phoneme.

Linguistic (phonetic and prosodic) features that can be derived from text are used for clustering acoustically similar units of the same type. For each phoneme type, a decision tree is built using the classification and regression tree (CART) method [18]. The nodes are split according to questions pertaining to the linguistic features, minimizing the mean acoustic distance between units in the resulting clusters. For example, a question may be is the next phoneme voiced, or is the current phoneme the first one in a syllable. The acoustic distance is calculated as the average weighted distance between vectors of all frames in the units and a part of previous unit. To compare two units of different durations, the shorter unit is extended using linear interpolation to have equal number of frames as the longer one.

At synthesis time, linguistic features are extracted from input text and for each phoneme the corresponding decision tree is queried to select the cluster of candidate units. The selected cluster center is then set as the target specification, so the target costs are actually distances of units from cluster means.

The Viterbi algorithm is then used similarly to the general unit selection scheme to find the optimal path through the network of candidate units.

To build the Croatian voice, the previously defined phoneset was adapted to the Festival system. For each phoneme linguistic features that are used in CART questions were defined, such as whether the phone is a vowel or a consonant, its place of articulation, length, voicing etc.

Grapheme to phoneme rules described above are converted from Matlab and Perl to the Scheme language

for Festival. The lexicon that was in plain text form was also converted into Festival scheme format.

The next step is data preparation. The method requires a set of so called utterance structure files, one for each utterance/wave file containing segment labels and linguistic features. These are automatically generated by the festvox tools from textual transcriptions, which need to be provided, using the defined phoneset, grapheme-to-phoneme rules and lexicon.

Next the phone level segment labels for all utterances were prepared. The corpus was already segmented using the HTK toolkit, but due to slightly different phoneset used in segmentation, where non-verbal sounds like coughing and laughter were marked with additional phoneme marks, a few utterance files generated by Festival didn’t match HTK generated labels and had to be manually corrected or excluded from the training data. Next, following the procedure described in [16], the parameters for calculating the acoustic distance – pitch marks, F0 and MFC coefficients are extracted from data. Finally, the clustering unit selection voice was built using the scripts provided by Festvox.

E. Statistical parametric voice for Festival

The statistical parametric voice was also built using the Festival and Festvox tools, with the clustergen statistical parametric synthesis module.

The higher-level modules like the phoneset, lexicon and grapheme to phoneme rules were kept the same as in the previously described unit selection voice.

To train the context-dependent HMMs, acoustic features (MFCCs, voicing information and log F0) need to be extracted from the data first. This is done using tools provided with Festvox, as well as preparation of utterance files with linguistic features.

Each context dependent phoneme is modeled using a three state HMM with the set of parameters consisting of 24 MFCCs and log F0, extracted at 5 ms intervals. Duration of each HMM state is predicted using a separate CART tree.

The next phase involves training the actual HMM models. The HMM training stage requires HMM state level segmentation of speech, whereas the unit selection voice required phone level segmentation, so the data needed to be segmented again. The segment labels were generated using the Festvox scripts and the EHMM recognizer tool was used in forced alignment mode to align the predicted segment labels with the speech data.

Finally, the HMM parameters and duration CART trees were estimated from the aligned utterances and extracted features. After the models are trained, the training data is no longer needed and the resulting voice takes up 10.2 MB.

F. Unit selection voice

The unit selection voice built for a custom system follows the general unit selection framework described in the section II A.

https://www.researchgate.net/publication/2581816_Automatically_Clustering_Similar_Units_For_Unit_Selection_In_Speech_Synthesis?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==


In this system, the diphone was chosen as the fundamental acoustic unit. Concatenation of diphones is less sensitive to smaller labeling errors caused by automatic segmentaion of the used speech corpus than units that end on phoneme boundaries, as joins occur in the middle of the same phone. If the labeling is consistent, the joins will result with a full phone with the correct identity, whereas if phones were joined parts of mislabeled phones would be added. Since the corpus was labeled automatically, consistent labeling was expected.

First the unit database was populated with a number of units for each diphone class. The following features were stored for each unit:

left and right phone identity,

beginning and ending time of both phonemes,

a pointer to the actual wave file,

F0 contour,

12 MFCCs,

log energy,

context identifier. Acoustic features used in join cost calculation are first

extracted from all waveforms in the corpus. F0 contours were extracted in 10 ms frames using the RAPT [17] algorithm. The obtained contours were smoothed using a three-point running median filter to remove spurious jumps in detected contours. For unvoiced regions, missing F0 values were inserted using linear interpolation from neighboring values.

Twelve MFCC coefficients and log energy were extracted in 16 ms frames with 8 ms overlap.

A unique context identifier is also stored with each unit so units that were originally joined in natural speech may be identified when calculating the join cost.

For the target function, the normalized Euclidean distance between the duration of phonemes and their corresponding means was used, as shown in (3):

( ) √( )

( )

(3)

where is the current unit (diphone), and the durations of left and right phoneme in the unit u

respecively, and mean durations of left and right phoneme classes and and are the standard deviations of left and right phoneme class durations.

Join cost is a weighted sum of absolute differences of F0 value at the point of concatenation, log energy at the point of concatenation and the distance of MFCC vectors. MFCC distance is computed as the Euclidean distance between MFCC vectors of the first frame of next diphone and one frame after the last in the current diphone. If two diphones that are adjacent in the original recording are considered, the MFCC distance becomes zero. When diphones come from different utterances, the one frame overlap ensures that lower cost is assigned to the unit with spectral characteristics similar to the continuation of the current diphone, and not the current diphone itself. This is particularly to account for phones with abrupt spectral changes, e.g. plosives.

Join cost ( ) between units and is defined as:

( ) (| |)

(| |)

√∑( )

(4)

where and are the F0, energy and MFCC

cost weights, F0 is the fundamental frequency, E the log energy and C is the MFCC vector.

The total cost ( ) of choosing the unit is S( ) ( ) ( ),

where and are target cost weight and join cost

weight. In this version of the synthesizer the weights were set empirically.

Finally the sequence of units that minimizes the total cost over the whole utterance has to be found.

The input to the synthesizer is plain text, which is first converted to a phoneme string using the grapheme to phoneme rules and lexicon described above.

The Viterbi algorithm is used to search for the optimal unit sequence.

The final speech waveform is generated by concatenating the chosen units’ waveforms using the overlap-and-add technique. No extra signal processing was used. The system was implemented using Matlab. Voicebox [18] toolbox was used for MFCC extraction and the RAPT algorihm.

IV. EVALUATION AND COMPARISON

Preliminary evaluation using MOS (mean opinion score) was conducted for all developed voices. A previously developed diphone voice [8] was also included in the evaluation.

For each voice two samples of synthesized speech were prepared. The first sample is a synthesized text from the weather domain (labeled text A), using the same vocabulary as in the training corpus. The wording of the text was different than any utterance in the training corpus.

The other sample is from the news domain with a larger proportion of words not present in the lexicon.This text is labeled as text B.

Both texts have three sentences, with total length of 34 and 37 words, respectively.

The following quality factors were evaluated:

Overall quality: The listeners rated the speech on

a scale from 1-5, with labels: 1 – poor, 2 –

satisfactory, 3 – good, 4 – very good and 5 –

excellent.

Intelligibility: The listeners were asked to

specify whether they understood what was said

in a scale from 1-5, labeled thus: I understood: 1


– I didn't, 2 – a smaller part, 3 – partially, 4 –

most of it, 5 – completely.

Naturalness: The listeners rated the naturalness

in the 1-5 scale, labeled 1 – completely

unnatural, 2 – not natural, 3 – roughly natural, 4

– mostly natural, 5 – natural.

Irregularities: The listeners reported if they

noticed irregularities in speech, on a 1-5 scale

labeled: 1 – very often, 2 – often, 3 –

occasionally, 4 – yes, in isolated cases, 5 - no

Acceptance: The listeners responed whether they

thought the synthesized speech was acceptable

for use in an automated information service over

the telephone, with answers yes, yes with

improvements to the system and no,

corresponding to scores 5, 3 and 1. The samples were labeled clip1-clip4 for renditions of

text A and clip5-clip8 for renditions of text B. The order of the systems corresponding to the clips was random and different for texts A and B.

In the following discussion the voices described in section III.D, III.E and III.F are labeled S1, S2, S3, respectivlely, and the system described in [8] is labeled S4.

The participants listened to the samples in arbitrary order and could repeat the samples any number of times. The samples were presented to the users using a web page and the results were collected using an electronic form where users ticked the boxes corresponding to given scales.

Twelve listeners of both genders (mostly university students and staff) participated in the evaluation. Of those, 6 had previous contact with speech synthesis systems.

The mean opinion scores for systems in the evaluation are presented in table 1:

TABLE 1. RESULTS OF EVALUATION FOR SYSTEMS S1-S4

EXPRESSED IN MEAN OPINION SCORE (MOS) SCALE

S1 S2 S3 S4

Quality text A 4,33 2,08 3,42 2,00

text B 3,33 1,83 2,83 2,50

Intelligibility text A 4,92 3,67 4,75 3,67

text B 4,08 2,92 3,92 3,58

Naturalness text A 4,58 1,25 4,00 2,50

text B 3,75 1,25 3,42 2,42

Irregularities text A 4,25 2,67 3,67 2,25

text B 3,42 2,58 3,17 2,83

Acceptance text A 4,67 1,83 4,00 2,17

text B 3,33 1,50 3,33 2,33

Figure 1 shows the MOS results for text A, and figure 2 shows the results for text B.

For both texts the best scores for all questions have two unit selection systems, with the system S1

consistently performing slightly better than system S3. This difference may be explained due to the very simplistic target function used in the system S3, which considers only the average duration of the diphone, while the system S1 uses context features to select candidate units. The join cost itself may choose the units that join well, but it only considers neighboring units so the overall prosody of the whole utterance may still be poor, which was reported by some listeners as a "singing" quality.

All systems achieved good scores for inteligibillity and the unit selection voices also scored well on naturalness. Some irregularities in speech were present in all systems, and the score dropped for text B where more unknown words were present.

Significanly better scores for systems S1 and S3 can be seen for text A than for text B, while system S4 (the diphone system) performs about the same for both texts.

This was expected as the diphone system couldn't take advantage of the fact that text A contained words present in the training corpus and the number of concatenations is always the same. The voice S2 also performed better on text A, but the difference was smaller. This suggests that the voice may generalize better outside the domain of the training corpus than voices S1 and S3, however the overall score was still lower than these voices.

The low scores for system S2 with even diphone system outperforming in most cases were somewhat surprising considering other comparisons [7]. Particularly low score was given for its naturalness, which was expected due to the vocoding buzziness. Some participants noted that although the inteligibillity of the voice was excellent, the metallic character of voice made it unsuitable for stated use.

83% of users described system S1 as acceptable for use in automated information services for text A.

Figure 1. MOS results for text A

Quality Intelligibility Naturalness Irregularities Acceptance0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

MO

S

S1

S2

S3

S4

https://www.researchgate.net/publication/220120388_Analysis_of_Statistical_Parametric_and_Unit_Selection_Speech_Synthesis_Systems_Applied_to_Emotional_Speech?el=1_x_8&enrichId=rgreq-3fe64446-1758-4db6-a624-260019a03f4e&enrichSource=Y292ZXJQYWdlOzIyNDI1MDU2NjtBUzoxMDE2NTIzNzE2MDc1NTlAMTQwMTI0NzIyMjQ4Nw==

Figure 2. MOS results for text B

V. CONCLUSION

Two unit selection and one statistical parametric voices were built for Croatian using the same speech corpus. The systems were evaluated and rated on a MOS scale for quality, intelligibility, naturalness, frequency of speech irregularities and acceptance.

The two unit selection systems performed best with the mean quality scores of 4.33 and 3.42 and acceptance scores of 4.67 and 4 for text in domain of the training corpus. For text with unknown vocabulary, the unit selection systems still performed best, but with lower scores and with less difference between the general and clustering unit selection.

To improve the quality of speech from the unit selection systems several modifications should be made. The empirically set weights of the cost functions may not be optimal and should be estimated using some objective or perceptual measure.

The target function should be changed to accomodate the sentence level prosodic information. To this effect, the linguistic context information may be combined with statistical models of prosody trained from data.

According to this evaluation results future work will be done in defining appropriate cost functions and weights optimized for Croatian speech synthesis.

REFERENCES

[1] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura, "ATR ν-Talk Speech Synthesis System," in Second International Conference on

Spoken Language Processing, 1992.

[2] A.J. Hunt and A.W. Black, "Unit selection in a concatenative speech synthesis system using alarge speech database," in 1996

IEEE International Conference on Acoustics, Speech, and Signal

Processing, 1996. ICASSP-96. Conference Proceedings., vol. 1, 1996.

[3] R.A.J. Clark, K. Richmond, and S. King, "Festival 2-build your

own general purpose unit selection speech synthesiser," in Fifth

ISCA Workshop on Speech Synthesis, 2004.

[4] J. Yamagishi, H. Zen, T. Toda, and K. Tokuda, "Speaker

independent HMM-based speech synthesis system—HTS-2007 system for the Blizzard Challenge 2007," Proc. BLZ3-2007 (in

Proc. SSW6), 2007.

[5] A.W. Black, "CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling," in Ninth International Conference on

Spoken Language Processing, 2006.

[6] A. Acero, "Formant analysis and synthesis using hidden Markov models," in Sixth European Conference on Speech Communication

and Technology, 1999.

[7] R. Barra-Chicote, J. Yamagishi, S. King, J.M. Montero, and J. Macias-Guarasa, "Analysis of statistical parametric and unit

selection speech synthesis systems applied to emotional speech,"

Speech Communication, vol. 52, no. 5, pp. 394-404, 2010.

[8] M. Pobar, Martincic-Ipsic S., and I. Ipsic, "Text-to-speech

synthesis: a prototype system for Croatian language," Engineering

Review, vol. 28, no. 2, pp. 31-44, 2008.

[9] J. Bakran and N. Lazic, "Fonetski problemi difonske sinteze

hrvatskoga govora= Phonetic problems with the diphonic synthesis

of Croatian speech," Govor, vol. 15, no. 2, pp. 103-116, 1998.

[10] A.W. Black et al., "TONGUES: Rapid development of a speech-

to-speech translation system," in Proceedings of the second

international conference on Human Language Technology Research, 2002, pp. 183-186.

[11] S. Martincic-Ipsic and I. Ipsic, "Croatian HMM based speech synthesis," in 28th International Conference on Information

Technology Interfaces, 2006, pp. 251-256.

[12] H. Peng, Y. Zhao, and M. Chu, "Perpetually optimizing the cost function for unit selection in a TTS system with one single run of

MOS evaluation," in Seventh International Conference on Spoken

Language Processing, 2002.

[13] S. Imai, K. Sumita, and C. Furuichi, "Mel log spectrum

approximation (MLSA) filter for speech synthesis," Electronics

and Communications in Japan (Part I: Communications), vol. 66, no. 2, pp. 10-18, 1983.

[14] H. Zen, K. Tokuda, and A.W. Black, "Statistical parametric speech

synthesis," Speech Communication, vol. 51, no. 11, pp. 1039-1064, 2009.

[15] S. Martincic-Ipsic, M. Matesic, and I. Ipsic, "Korpus hrvatskoga

govora (Croatian speech corpora)," Govor : časopis za fonetiku, vol. I, no. 2, pp. 135-150, 2004.

[16] L. Nacinovic, M. Pobar, I. Ipsic, and S. Martincic-Ipsic,

"Grapheme-to-Phoneme Conversion for Croatian Speech Synthesis," in 32nd International Convention on Information and

Communication Technology, Electronics and Microelectronics

(MIPRO 2009), Vol. 3, 2009, pp. 318-323.

[17] A.W. Black and K.A. Lenzo, Building synthetic voices.: Language

Technologies Institute, Carnegie Mellon University and Cepstral

LLC, 2003.

[18] A.W. Black and P. Taylor, "Automatically clustering similar units

for unit selection in speech synthesis," in Fifth European

Conference on Speech Communication and Technology, 1997.

[19] D. Talkin, "A robust algorithm for pitch tracking (RAPT)," Speech

coding and synthesis, vol. 495, p. 518, 1995.

[20] M. Brookes and others. (2000) Voicebox: Speech processing toolbox for matlab. [Online].

http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

Quality Intelligibility Naturalness Irregularities Acceptance0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5M

OS

S1

S2

S3

S4

http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

Date post:	23-Nov-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times

Development of Croatian unit selection and statistical parametric speech synthesis

Documents