+ All Categories
Home > Documents > Diphone-Concatenation Speech Synthesis for Myanmar...

Diphone-Concatenation Speech Synthesis for Myanmar...

Date post: 23-Feb-2020
Category:
Upload: others
View: 0 times
Download: 3 times
Share this document with a friend
10
ISSN: 2278 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 1078 All Rights Reserved © 2013 IJSETR AbstractSpeech Synthesis is a popular field in Natural Language Processing of computer science. It is composed of Natural Language Processing (NLP) and Digital Signal Processing (DSP). This paper gives about the Digital Signal Processing part for speech synthesis for Myanmar language using diphone-concatenation method. Diphone-Concatenation method based on the diphone level of speech to concatenate by applying Pitch Synchronous Overlap and Add (PSOLA) algorithm to smooth the joints of the speech signals. PSOLA has two parts: Time-Domain and Frequency-Domain. This paper describes the Time-Domain Pitch Synchronous Overlap and Add method in diphone-concatenation speech synthesis. One of the contributions of this paper is building the Myanmar diphone database for diphone-concatenation speech. Concatenative synthesis will provide to reduce the problems of speech synthesis by using the formant synthesis. Diphone Database for Myanmar pronunciations is constructed in this research to reduce the ambiguity in pronunciations. In the process of diphone-concatenation synthesis, space complexity and searching time are less than other techniques. This paper illustrates the techniques to improve the performance of text-to-speech in the Myanmar speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. It is based on the signal into overlapping synchronized frames of the pitch period. The diphone-concatenation of the speech synthesis is to maintain the consistency and accuracy of the pitch marks of the speech signal and diphone database with integrated vowels and consonants of Myanmar language. This paper shows the testing results with the varieties of overlapping pitch marks for speech waveforms of Myanmar sentence. The result shows the quality of speech synthesis according the number of overlap times are bigger the quality of speech are better. This paper is to be able to take a word sequence and produce “human-like” speech. Index TermsPSOLA, diphone-concatenation, speech synthesis, Myanmar speech, text-to-speech. I. INTRODUCTION The text-to-speech synthesis system imitates the human-like speech from the input Myanmar text to spoken language. Since this generally requires great language knowledge, the context where the text comes from, a deep understanding of the semantics of the text content and the relations. However, many research and commercial speech synthesis systems developed have contributed to our understanding of all these phenomena, and have been successful in various respective ways for many applications such as in speech-to-speech machine translation, interactive voice response systems, reading software for the blind, linguistic research and language teaching center. Text-To-Speech technology gives computers the ability of converting text into audible speech, with the goal of being able to deliver information via voice message. It has been utilized to provide easier means of communication and to improve accessibility for people with visual impairment to textual information. Two quality criteria are proposed for deciding the quality of a TTS synthesizer. Intelligibility it refers to how easily the output can be understood. Naturalness it refers to how much the output sounds like the speech of a real person. Most of the existing systems have reached a fairly satisfactory level for intelligibility, while significantly less success has been attained in producing highly natural speech [1]. II. TEXT-TO-SPEECH SYNTHESIS A TTS voice is a computer program that has two major parts: a natural language processing (NLP) which reads the input text and translates it into a phonetic language and a digital signal processing (DSP) that converts the phonetic language into spoken speech. The input text might be for example data from a word processor, standard ASCII from e-mail, a mobile text-message, or scanned text from a newspaper. The character string is then preprocessed and analyzed into phonetic representation which is usually a string of phonemes with some additional information for linguistic representation. A TTS system generally consists of four modules, namely text analysis, phonetic analysis, prosody analysis and speech synthesis. In the step of text analysis, there have two parts: syllable segmentation and number converter. The input text is analyzed to segmented Myanmar text like a syllable. Syllable segmentation is the process of identifying syllable boundaries in a text. This process provides to generate the phonetic sequences using the phonetic dictionary. The purpose of number converter is to convert the number to textual versions. The non standard words are tokens like numbers, which need to be expanded into sequences of Myanmar words before they are pronounced. Number expands to string of words representing cardinal. Phonetic analysis is also called Grapheme-to-Phoneme. Grapheme-to-Phoneme conversion translates the syllable of Myanmar text to phonetic sequence. It determines the pronunciation of a syllable based on its spelling. It also analyzes the best sequence of phonemes for words, numbers and symbols and converts into phonetic sequences. We will construct the Myanmar phonetic dictionary to generate the phoneme sequence and to pronounce these phonemes. The phonetic sequences are analyzed to produce the prosodic features by applying the phonological rules in prosodic analysis step. It is the module to analyze duration and intonation such as pitch variation, syllable length to create naturalness of synthetic speech. The combination of consonant phoneme and a vowel phoneme produces a syllable. The phonetic alphabet is usually divided in two main categories, vowels and consonants. Vowels are always voiced sounds and they are produced with the vocal cords in vibration, while Diphone-Concatenation Speech Synthesis for Myanmar Language Ei Phyu Phyu Soe, Aye Thida University of Computer Studies, Mandalay, Myanmar
Transcript
Page 1: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1078

All Rights Reserved © 2013 IJSETR

Abstract— Speech Synthesis is a popular field in Natural Language

Processing of computer science. It is composed of Natural

Language Processing (NLP) and Digital Signal Processing (DSP).

This paper gives about the Digital Signal Processing part for

speech synthesis for Myanmar language using

diphone-concatenation method. Diphone-Concatenation method

based on the diphone level of speech to concatenate by applying

Pitch Synchronous Overlap and Add (PSOLA) algorithm to

smooth the joints of the speech signals. PSOLA has two parts:

Time-Domain and Frequency-Domain. This paper describes the

Time-Domain Pitch Synchronous Overlap and Add method in

diphone-concatenation speech synthesis. One of the contributions

of this paper is building the Myanmar diphone database for

diphone-concatenation speech. Concatenative synthesis will

provide to reduce the problems of speech synthesis by using the

formant synthesis. Diphone Database for Myanmar

pronunciations is constructed in this research to reduce the

ambiguity in pronunciations. In the process of

diphone-concatenation synthesis, space complexity and searching

time are less than other techniques. This paper illustrates the

techniques to improve the performance of text-to-speech in the

Myanmar speech synthesis using the TD-PSOLA (Time Domain

Pitch Synchronous Overlap-Add) method. It is based on the signal

into overlapping synchronized frames of the pitch period. The

diphone-concatenation of the speech synthesis is to maintain the

consistency and accuracy of the pitch marks of the speech signal

and diphone database with integrated vowels and consonants of

Myanmar language. This paper shows the testing results with the

varieties of overlapping pitch marks for speech waveforms of

Myanmar sentence. The result shows the quality of speech

synthesis according the number of overlap times are bigger the

quality of speech are better. This paper is to be able to take a word

sequence and produce “human-like” speech.

Index Terms—PSOLA, diphone-concatenation, speech

synthesis, Myanmar speech, text-to-speech.

I. INTRODUCTION

The text-to-speech synthesis system imitates the human-like

speech from the input Myanmar text to spoken language. Since

this generally requires great language knowledge, the context

where the text comes from, a deep understanding of the

semantics of the text content and the relations. However, many

research and commercial speech synthesis systems developed

have contributed to our understanding of all these phenomena,

and have been successful in various respective ways for many

applications such as in speech-to-speech machine translation,

interactive voice response systems, reading software for the

blind, linguistic research and language teaching center.

Text-To-Speech technology gives computers the ability of

converting text into audible speech, with the goal of being able

to deliver information via voice message. It has been utilized to

provide easier means of communication and to improve

accessibility for people with visual impairment to textual information. Two quality criteria are proposed for deciding the

quality of a TTS synthesizer. Intelligibility – it refers to how

easily the output can be understood. Naturalness – it refers to

how much the output sounds like the speech of a real person.

Most of the existing systems have reached a fairly satisfactory

level for intelligibility, while significantly less success has been

attained in producing highly natural speech [1].

II. TEXT-TO-SPEECH SYNTHESIS

A TTS voice is a computer program that has two major

parts: a natural language processing (NLP) which reads the

input text and translates it into a phonetic language and a

digital signal processing (DSP) that converts the phonetic

language into spoken speech. The input text might be for

example data from a word processor, standard ASCII from

e-mail, a mobile text-message, or scanned text from a

newspaper. The character string is then preprocessed and

analyzed into phonetic representation which is usually a

string of phonemes with some additional information for

linguistic representation. A TTS system generally consists of four modules, namely text analysis, phonetic analysis,

prosody analysis and speech synthesis.

In the step of text analysis, there have two parts: syllable

segmentation and number converter. The input text is

analyzed to segmented Myanmar text like a syllable. Syllable

segmentation is the process of identifying syllable boundaries

in a text. This process provides to generate the phonetic

sequences using the phonetic dictionary. The purpose of

number converter is to convert the number to textual

versions. The non standard words are tokens like numbers,

which need to be expanded into sequences of Myanmar

words before they are pronounced. Number expands to string of words representing cardinal.

Phonetic analysis is also called Grapheme-to-Phoneme.

Grapheme-to-Phoneme conversion translates the syllable of

Myanmar text to phonetic sequence. It determines the

pronunciation of a syllable based on its spelling. It also

analyzes the best sequence of phonemes for words, numbers

and symbols and converts into phonetic sequences. We will

construct the Myanmar phonetic dictionary to generate the

phoneme sequence and to pronounce these phonemes.

The phonetic sequences are analyzed to produce the

prosodic features by applying the phonological rules in prosodic analysis step. It is the module to analyze duration

and intonation such as pitch variation, syllable length to

create naturalness of synthetic speech. The combination of

consonant phoneme and a vowel phoneme produces a syllable.

The phonetic alphabet is usually divided in two main categories,

vowels and consonants. Vowels are always voiced sounds and

they are produced with the vocal cords in vibration, while

Diphone-Concatenation Speech Synthesis for

Myanmar Language

Ei Phyu Phyu Soe, Aye Thida

University of Computer Studies, Mandalay, Myanmar

Page 2: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1079 All Rights Reserved © 2013 IJSETR

consonants may be either voiced or unvoiced. Vowels have

considerably higher amplitude than consonants and they are also

more stable and easier to analyze and describe acoustically. Because consonants involve very rapid changes they are more

difficult to synthesize properly.

Speech synthesis, automatic generation of speech waveforms,

has been under development for several decades [12],

[13].Synthesized speech can be produced by several different

methods. The methods are usually classified into three groups:

Articulatory synthesis, which attempts to model the human

speech production system directly.

Formant synthesis, which models the pole frequencies of

speech signal or transfer function of vocal tract based on

source-filter-model. Concatenative synthesis, which uses different length

prerecorded samples derived from natural speech.

The concatenative synthesis method is the most commonly

used in present synthesis systems. The Concatenative method is

becoming more and more popular. The articulatory method is

still too complicated for high quality implementations [1]. The

aim of paper is to improve the quality of Myanmar

Text-To-Speech by applying the Concatenative speech

synthesis using TD-PSOLA (Time Domain-Pitch Synchronous

Overlap and Add) algorithms.

Linguistic analysis stage maps the input text into a standard

form and determines the structure of the input, and finally decides how to pronounce it. Synthesis stage converts the

symbolic representation of what to say into an actual speech

waveform [9].

III. MYANMAR LANGUAGE

Myanmar writing does not use white spaces between words

or between syllables. Thus, the computer has to determine

syllable and word boundaries by means of an algorithm such as finite-state and rule-based. Moreover, a Myanmar syllable can

be composed of multiple characters. Syllable segmentation is

the process of determining word boundaries in a piece of text.

Myanmar language can consist of one or more morphemes

that are linked more or less tightly together. Typically, a word

will consist of a root or stem and zero or more affixes. Words

can be combined to form phrases, clauses and sentences. A word

consisting of two or more stems joined together is known as a

compound word. To process text computationally, words have

to be determined first [2].

The purpose of this paper is to develop Myanmar

Text-To-Speech system and to improve the performance of high quality synthesis by applying the diphone-concatenation speech

synthesis. The Myanmar language is the official language of

Myanmar and is more than one thousand years old. Texts in the

Myanmar language use the Myanmar script, which is descended

from the Brahmi script of ancient South India. Other Southeast

Asian descendants, known as Brahmic or Indic scripts, include

Thai, Khmer and Lao.

A Myanmar text is a string of characters without explicit

word boundary markup, written in sequence from left to right

without regular inter-word spacing, although inter-phrase

spacing may sometimes be used. Myanmar characters can be classified into three groups: consonants, medials and vowels.

The basic consonants in Myanmar can be multiplied by medials.

Syllables or words are formed by consonants combining with

vowels. However, some syllables can be formed by just

consonants, without any vowel. Other characters in the

Myanmar script include special characters, numerals,

punctuation marks and signs.

Table1. Myanmar Character

There are 34 basic consonants in the Myanmar script, as

displayed in Table1. They are known as “Byee” in the

Myanmar language [2]. Consonants serve as the base

characters of Myanmar words, and are similar in pronunciation to other Southeast Asian scripts such as Thai,

Lao and Khmer.

Medials are known as “Byee Twe” in Myanmar. There are

4 basic medials and 6 combined medials in the Myanmar

script. The 10 medials can modify the 34 basic consonants to

form 340 additional multi-clustered consonants. Therefore, a

total of 374 consonants exist in the Myanmar script, although

some consonants have the same pronunciation.

Vowels are known as “Thara”. Vowels are the basic

building blocks of syllable formation in the Myanmar

language, although a syllable or a word can be formed from

just consonants, without a vowel. Like other languages, multiple vowel characters can exist in a single syllable.

Special characters for Myanmar language are used as

Page 3: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1080

All Rights Reserved © 2013 IJSETR

prescription noun and conjunctions words between two or

more sentences.

Numerals for Myanmar language are known as “Counting

Numbers”. Numerals are the 10 basic digits for counting.

IV. MYANMAR LANGUAGE PHONOLOGY

A phoneme is the smallest unit that distinguishes words and morphemes. Therefore, changing a phoneme of a word to

another phoneme produces a different word or a nonsense

utterance, whereas changing a phone to another phone, when

both belong to the same phoneme, produces the same word

with an odd or an incomprehensible pronunciation.

Phonemes are not physical segments themselves, but mental

abstractions of them [5]. Different acoustic realizations of a

phoneme are called allophones. The acoustic characteristics

of phonemes come from the vocal tract movement during

their articulation. There are three types of phonetic

parameters in phonology of Myanmar language: first is place of articulation, second is articulator and third is manner of

articulation [3]. The pronunciation of Myanmar words

depend on these parameters. A phoneme is a contrastive unit

in the sound system of a particular language. It is a minimal

unit that serves to distinguish between meanings of words.

Phoneme can pronounce in one or more ways, depending on

the number of allophones. It can represent between slashes by

convention. Table2 describes the inventory of Myanmar

consonant phonemes defined by the International Phonetic

Association (IPA) [3], [4], [11].

Table2. The inventory of Myanmar consonant phonemes

A. Myanmar Phonological Tones

Myanmar language has four tones and a simple syllable

structure that consists of an initial consonant followed by a

vowel with an associate tone. This means all syllables in

Myanmar have prosodic features. Different tone makes

different meanings for syllables with the same structure of

phonemes. In the Myanmar writing system, a tone is

presented by a diacritic mark [3], [4].

The fundamental frequency as shown in figure1 rises gradually from Tone 1 to Tone 4. Tone 1 starts at a relatively

level range and tends to go down slightly; Tone 2 starts at a

relatively level range, goes up, and then falls down relatively

low; Tone 3 starts at a relatively high range, usually higher

than or as high as the peak of Tone 2, and falls down

relatively low; Tone 4 stats at a high range, frequently higher

or as high as the peak of Tone 2 and falls low, but not as low

as Tone 3 because it stops very suddenly before it can drop

lower [4]. The general contrastive features of the four

phonological tones offered by the analysis of their

fundamental frequency can be described as figure1:

Figure1. Four Tones of Myanmar Language

There are four tones in Myanmar language. The lengths of

tones are:

Tone 1 has 18.50 Cs

Tone 2 has 21.03 Cs

Tone 3 has 15.44 Cs and

Tone 4 has 10.35 Cs.

So Myanmar toneme is described with the variety of rate or

duration. Length of the tone is defined as rate or duration.

Tone 2 is defined as a longest rate and tone 4 is defined as a

shortest rate in these four tones. Now we describe with the

redundant features of these four tones in Table3.

Table3. Features of Myanmar Tones

Description Tone1 Tone2 Tone3 Tone4

Rate 2 3 1 0

Duration 18.5 21.03 15.44 10.35

Low + + - -

High - + + +

Low-Falling + + + -

B. Phonological Structure of Myanmar Language

The Myanmar language uses a rather large set of 50 vowel

phonemes, including diphthongs, although its 22 to 26 consonants are close to average. Some languages, such as

French, have no phonemic tone or stress, while several of the

Kam-Sui languages have nine tones, and one of the Kru

languages, Wobe, has been claimed to have 14, though this is

disputed. The most common vowel system consists of the

five vowels /i/, /e/, /a/, /o/, /u/. The most common consonants

are /p/, /t/, /k/, /m/, /n/. Relatively few languages lack any of

these, although it does happen: for example, Arbic lacks /p/,

standard Hawaiian lacks /t/, Mohawk and Tlingit lack /p/ and

/m/, Hupa lacks both /p/ and a simple /k/, colloquial Samoan

lacks /t/ and /n/, while Rotokas and Quileute lack /m/ and /n/ [5]. Table4 shows the phonetic signs of 50 Myanmar vowels

to pronounce the Myanmar words. These 50 phonemes show

the basic symbol with four tone levels [3].

Figure2. Combination of Phoneme Syllable

Phonology is how speech sounds are organized and affect

one another in pronunciation. The combination of consonant

phoneme and a vowel phoneme produces a syllable in

figure2. The phonetic alphabet is usually divided in two main

Page 4: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1081 All Rights Reserved © 2013 IJSETR

categories, vowels and consonants. Vowels are always

voiced sounds and they are produced with the vocal cords in

vibration, while consonants may be either voiced or unvoiced. Vowels have considerably higher amplitude than

consonants and they are also more stable and easier to

analyze and describe acoustically. Because consonants

involve very rapid changes they are more difficult to

synthesize properly [6].

Table4. Phonetic Signs of Myanmar Vowels

C. Five Phonological Rules for Myanmar Language

Why construct the phonological rules? In Myanmar

Language, speech of Myanmar language has two types:

Sentence-based speech and Word-based speech. These two

types are described in this paper. Word-based sentences are

convenient by applying Myanmar phonetic dictionary.

Sentence-based speech for Myanmar language is proposed in

this research. So, the sentence-based speech problem can

solve by applying phonological rules. Phonological rules are

often written using distinctive features, which are natural

characteristics that describe the acoustic and Articulatory

makeup of a sound; by selecting a particular bundle, or

"matrix," of features, it is possible to represent a group of

sounds that form a natural class and pattern together in

phonological rules [10].

There are many phonological rules in Myanmar language

not only phonological rules without part of speech levels but

also phonological rules with grammars structures. The

problems for sentence-based speech pronunciations for

Myanmar language solve by applying the five phonological

rules [4], [7].

D. Algorithms for Five Phonological Rules

Rule 1 uses the substitute for communication theory,

computational linguistics (for instance, statistical natural

language processing). It uses the reduction phonological

rules to reduce the vowels with glottalized (ɂ) and nasal (˜)

tones. The algorithm of vowel reduction algorithm for rule 1

as shown in figure3 [7]:

Rule 2 describes the pronunciation changes from /tiɂ/ to

[də] by applying the metathesis rule type when the next

phonemes are /ka/, /sa/, /za/, /ta/ and /pa/. The system finds

the number / tiɂ/ in input phoneme sequence and it checks

the next phonemes are /ka/, /sa/, /za/, /ta/ and /pa/. If this

decision is true, the system will change the / tiɂ/ to [də]

phoneme. If it is not, this decision will continue to next rule.

The changing to “DA” pronunciation algorithm for the rule type of metathesis is illustrated in figure4 [7].

Figure3. Vowel Reduction Algorithm

Figure4. Changing to „DA‟ Pronunciation Algorithm

Rule 3 is the inserting the nasal phoneme according to

obstruent types as shown in Table5. If the obstruent type is

bilabial next to the voiced vowel, „[m]‟ will fill in this

phoneme. If the obstruent type is dental next to the voiced

vowel, „[ṉ ]‟ will fill in this phoneme. If the obstruent type is

alveolar next to the voiced vowel, „[n]‟ will fill in this

phoneme. If the obstruent type is palate-alveolar next to the

voiced vowel, „[ɲ ]‟ will fill in this phoneme. If the obstruent

type is velar next to the voiced vowel, „[ŋ]‟ will fill in this

phoneme.

Table5. Obstruent Types and Nasal Phonemes

The process of rule 3 is the inserting nasal phoneme

between the voiced asats and obstruent consonants. The

system find the voiced vowel asat in the input phoneme

sequence and then check the types of obstruent consonants. If

the decision is consistent, the nasal phoneme inserts between

them and if it is not, the system goes to the next rules. Figure6

(a) describes the process of five types of obstruent types.

Figure6 (b) shows the detail processes for five processes of

five obstruent types.

Page 5: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1082

All Rights Reserved © 2013 IJSETR

Figure5. (a) Filling Nasal Phoneme Algorithm

Figure5. (b) Processes of Procedures for Filling Nasal

Algorithm

Rule 4 explains about the unchanged pronunciation

phonemes next to the /aɂ/ အ. The unvoiced phonemes are

not changed to voiced phoneme when the preceding phoneme

is /aɂ/ အ. Rule 5 is the pronunciation changes from unvoiced

phonemes to voiced phonemes depending on the voiced

consonants, voiced vowels and voiced asats. If the unvoiced

phoneme locates in the first position, this phoneme is not

change to voiced phoneme. If the unvoiced consonants with

voiced vowels or voiced asat, it will change to the voiced

phoneme. The process of rule 4 and rule 5 is made use of

pronunciation algorithm [7].

The combination of rule 4 and 5 pronunciation algorithm is presented in following figure6. This algorithm can solve the

confusion of unchanged and changed pronunciation for

unvoiced to voiced phoneme. The algorithm finds the

unvoiced consonants in input phoneme sequence. If the

system found the unvoiced consonants, first step is to check

this consonant‟s location and if it locates in first position, this

unvoiced consonant is not changed to voiced consonants. If

this unvoiced consonant is not in first location and current

and previous vowels asats are unvoiced in phoneme

sequence, the consonants are not changed to voiced

consonants. If it is not and the previous consonant is /aɂ/, the pronunciation is not changed to voiced consonant [7].

Figure6. Changing Pronunciation Algorithm

V. DESIGN OF CONCATENATIVE SPEECH SYNTHESIS

The process of concatenative speech synthesis is cutting and pasting the short segments of speech is selected from a

pre-recorded database and joined one after another to

produce the desired utterances. In theory, the use of real

speech as the basis of synthetic speech brings about the

potential for very high quality, but in practice there are

serious limitations, mainly due to the memory capacity

required by such a system. The longer the selected units are

the fewer problematic concatenation points will occur in the

synthetic speech, but at the same time the memory requirements increase. Another limitation in Concatenative

synthesis is the strong dependency of the output speech on

the chosen database. For example, the personality or the

affective tone of the speech is hardly controllable. Despite the

somewhat featureless nature, Concatenative synthesis is well

suited for certain limited applications [1]. Concatenative

synthesis is based on the concatenation or stringing together

of segments of recorded speech. Generally, Concatenative

synthesis produces the most natural-sounding synthesized

speech. It is easier to obtain more natural sound with longer

units and it can achieve a high segmental quality. Among

these techniques, this paper highlights a diphone concatenation-based synthesis technique in Myanmar

text-to-speech research.

Page 6: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1083 All Rights Reserved © 2013 IJSETR

A. Myanmar Diphone Database Construction

The basic idea behind building Myanmar diphone

databases is to explicitly list all possible phone-phone

transitions in a language. One technique is to use target words

embedded carrier sentences to ensure that the diphones are

pronounced with acceptable duration and prosody. Speech

synthesis unit finds the corresponding pre-recorded sounds

from its database and tries to concatenate them smoothly. It uses an algorithm like TD-PSOLA (Time-Domain Pitch

Synchronous Overlap and Add) to make a smooth pass in

diphone. PSOLA method takes two speech signals. One of

these signal ends with a voiced part and the other starts with a

voiced part. PSOLA changes the pitch values of these two

signals so that pitch values at both sides become equal. The

advantage of this technique is to obtain a better output

speech when compared to other techniques [1].

The structure of diphone database constructs with Arpabet

signs to understand the retrieving phonemes. After retrieving

the phonemes, we can then retrieve each individual phoneme

from a diphone database and concatenate them together with

only 50 phonemes; this would be the most economical choice

to save space on embedded devices. Diphones are just pairs

of partial phonemes. This might be recovered from the

pronouncing dictionary by taking into account the 1 or 0 designation applied to vowels concerning stress instead of

representing a single phoneme; a diphone represents the end

of one phoneme and the beginning of another. This is

significant because there is less difference in the middle of a

phoneme than there is at the beginning and ending edges [15].

The problem is that it greatly increases the size of the diphone

database from around 10496 diphones (114 (22Consonants +

42ExceptionWords + 50Vowels) x 114 (22Consonants +

42ExceptionWords + 50Vowels) – 2500 (50Vowels

x50Vowles)) in Myanmar Language. The pair of vowel and

vowel is not in phoneme sequence for Myanmar diphone database. So the number of double vowels subtracts from the

total diphone database. The Arpabet signs for 22 consonants

are described in the following Table6 and the Arpabet sing

for 50 vowels [14] are shown in Table7.

The diphone list will be categorized in different categories

[15]: Consonants-Consonants, Consonants-Exception

Words, Consonants-Vowels, Exception Words-Consonants,

Exception Words-Exception Words, Exception

Words-Vowels, Vowels-Consonants, Vowels-Exception

Words, Consonants-Silence, Exception Words-Silence, and

Vowels-Silence, Silence-Consonants, Silence-Exception

Words and Silence-Vowels pairs.

Table6. Arpabet Signs for 22 Consonants

Phoneme for Consonants Arpabet for Diphone

k K

kh KH

g G

ŋ NG

s S

sh SA

z Z

ɲ NYA

t T

th HT

d D

n N

t T

th TH

n N

p P

ph PH

b B

m M

j Y

l L

w W

ɵ TH

h HH

l L

a ́ AA3

Table7. Arpabet Signs for 50 Vowels

B. Diphone Recording

The recordings were read by a native Myanmar speaker.

The recordings were done in a professional recording at LA

studio in Mandalay, Myanmar. The diphone database was

completed in four hours. Two hundred sentences of different

length were recorded. The reason for recording the sentences

was to start building a Unit selection database to be able to

re-synthesize the sentences. The sentences were taken from

two sources. Firstly, news takes in Myanmar daily newspaper. This source was chosen due to its use of modern

formal language. Approximately thirty sentences were

chosen from this source. These sentences have an average

word length of 20 words. The recordings were tough to

achieve because of the sentences length. The second source

for the remainder of the sentences is from “Myanmar

Grammar Book” published from Myanmar language group.

These sentences are short and easy to use since the vowel ling

is already done. The language and grammar within the book

is modern, therefore a good starting point for testing the

system.

Page 7: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1084

All Rights Reserved © 2013 IJSETR

VI. LABELING DIPHONE INDEX

A diphone database consists of a dictionary file, a set of

waveform files and a set of pitch mark files. The dictionary

file, also called the diphone index, identifies which diphone

comes with which files, and from where. The index consists

of a simple header, followed by a single line for each

diphone: the diphone name, the file name without any

extension, a point start position in seconds, a mid position

and an end position also in seconds [8]. Table8 describes the

labeling diphone index for the sentence of “#-KYA-AY4-

HT-IY2-Y-AA2-TH-IY1-D-AH-D-AW1-T-EH4-D-AH-D-

AW1-Z-IH2-PHY-IH3-AA3-SA-IH2-AA3-T-EH4-MY-AA

2-TH-IY1-#”.

Table8. Labeling the Diphone Index

The structure of diphones runs from one mid pitch mark

of first phone to another mid pitch mark of the following

phone. The pitch mark files consists a simple list of positions

in seconds in order, one per line of each pitch mark in the file.

VII. DIPHONE-CONCATENATIVE SYNTHESIS

The diphone-concatenative speech synthesis joins one

phone with another phone to reduce the discontinuity of the

joints of the phones. This paper highlights a diphone concatenation-based synthesis technique. This synthesis part

is a popular challenge of high quality speech production in

Myanmar Text-To-Speech System. Concatenative synthesis

is a popular method that the most common choices are

phonemes and diphones because they are short enough to

attain sufficient flexibility and to keep the memory

requirements reasonable. The use of diphones in the

concatenation promotes to get good performance quality

because a diphone contains the transition from one phoneme

to another and the latter half of the first phoneme and the

former half of the latter phoneme. Consequently, the concatenation points will be located at the center of each

phoneme, and since this is usually the steadiest part of the

phoneme, the amount of distortion at the boundaries can be

expected to be minimized. While the sufficient number of

different phonemes in a database is typically around 200, the

corresponding number of diphones is from 4500 to 5000 but a

synthesizer with a database of this size is generally

implementable. To avoid audible distortions caused by the

differences between successive segments, at least the

fundamental frequency and the intensity of the segments

must be controllable. The creation of natural prosody in

synthetic speech is impossible with the present-day methods

but some promising methods for getting rid of the

discontinuities have naturally been developed. Finally,

Concatenative speech synthesis is afflicted by the

troublesome process of creating the database from which the

units will be selected. Each phoneme, together with all of the

needed allophones, must be included in the recording, and then all of the needed units must be segmented and labeled to

enable the search from the database.

VIII. TD-PSOLA METHOD

France Telecom (CNET) develops Pitch Synchronous

Overlap and Add method. It allows prerecorded speech

samples smoothly concatenated and provides good

controlling for pitch and duration. Time-domain version,

TD-PSOLA, is the most commonly used due to its

computational efficiency. The basic algorithm consists of

three steps:

1. original speech signal is divided into separate short

analysis signal 2. the modification of each analysis signal to synthesis

signal and

3. the synthesis step where these segments are recombined

by means of overlap-adding [1].

The purpose of TD-PSOLA (Time-Domain) is to modify

the pitch or timing of a signal as shown in figure7. The

process of the TD-PSOLA algorithm is to find the pitch

points of the signal and then apply the hamming window

centered of the pitch points and extending to the next and

previous pitch point. If the speeches want to slow down, the

system defines the frame to double. If the speeches want to speed up, the system removes the frames in the signal.

Figure7. TD-PSOLA Algorithm

TD-PSOLA requires an exact marking of pitch points in a

time domain signal. Pitch marking any part within a pitch

period is okay as long as the algorithm marks the same point

for every frame. The most common marking point is the

instant of glottal closure, which identifies a quick time domain descent. The algorithm creates an array of sample

numbers comprise an analysis epoch sequence P = {p1, p2…

pn} and it estimates pitch period distance = (pk - pk+1)/2 to get

the mid-point of pitch marking.

Table9 gives the data for the overlapping time according to

the pitch marks with 0.03s of the waveforms of the voices.

This table shows the 22 diphone pairs for 20 words Myanmar

Page 8: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1085 All Rights Reserved © 2013 IJSETR

sentence. The diphone-concatenation pairs for this sentence

have the 20 pairs for speech synthesis.

The comparisons of waveforms are shown in below. Figure 8 shows the original waveforms without TD-PSOLA method

and the length is 1.206s for first 4 words,

“#-KYA-AY4-HT-IY2-Y-AA2-TH-IY1”.

Figure8. Original Waveform of First 4 Words

Table9 Defining the Pitch Marks with 0.03s

The following figure9 describes the overlapping

waveforms with 0.03s pitch marks. The waveforms smooth

between one joint of waveform and another by using

TD-PSOLA method for Myanmar language. The quality of

speech is more speed and smooth by are overlapping each to

each with 0.03s than original speech waveforms. The total

length of overlapping speech waveforms is shorter than

original waveforms without any method. The length of

overlapping waveforms reduces 0.05s from 1.206s. The next

table 9 shows the overlapping of pitch marks with 0.05s

between one joint and other joints.

Figure9. Overlapping with 0.03s Pitch Marks

Table10 shows the overlapping pitch marks with 0.05s of

waveforms of the voice for 20 words of Myanmar sentence.

The hanning window calculates with the overlap of 0.05s

pitch marks in each diphone label. The values of the start of

hanning window and the end of hanning windows are

changed according to the overlapping pitch marks values.

The sound quality is better than the overlapping pitch marks

0.03ms.

Table10 Defining the Pitch Marks with 0.05s

The following figure10 describes the overlapping

waveforms with 0.05s pitch marks. The waveforms smooth

between one joint of waveform and another by using

TD-PSOLA method for Myanmar language. The quality of

speech is more speed and smooth by are overlapping each to

each with 0.05s than original speech waveforms and 0.03s

pitch marks overlapping. The total length of overlapping

speech waveforms is shorter than original waveforms without

any method. The length of overlapping waveforms reduces

0.12s from 1.206s.

Figure10. Overlapping with 0.05s Pitch Marks

IX. EXPERIMENTAL RESULTS

This paper gives the results for the diphone-concatenation with TD-PSOLA method. This system is tested with the 200

Myanmar sentences and this sentence structure is very

complex. The Myanmar diphone database stores over 5000

diphones for these sentences. Firstly, this system accepts the

segmented Myanmar sentence and then it can produce the

phonetic sequence with the pairs of consonants and vowels

by using Myanmar phonetic dictionary in

grapheme-to-phoneme stage [4]. And then this system checks

the phonetic sequence to get the prosodic features with

phonological rules. Finally, it produces the high quality

speech by applying the Myanmar diphone database with

concatenation method that uses TD-PSOLA algorithm. The experimental results of diphone-concatenation speech

synthesis can be calculated with precision, recall and

f-measure. The results for 14 types of diphone pairs

according to the total number of 5275 diphone pairs for 200

sentences is shown in Table11.

Table11 Experimental Results for Diphone-Concatenation

Page 9: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1086

All Rights Reserved © 2013 IJSETR

X. TESTING MYANMAR SPEECH QUALITY

Testing the naturalness and intelligibility of the Myanmar

speech contains 7 female people between the ages 16 to 40.

The test can be divided into two parts with 20 pairs of words

of confusability. The first part contains naturalness of the

diphone-concatenative speech synthesis. The last part tests

how much the participants understood the voice or how much

of what the voice said the participants understood. The

participants heard one word at a time and marked on the

answering sheet which one of the two words they think is

correct.

A. Naturalness

The results of the grapheme-to-phoneme part of listening

compared to the diphone-concatenation of listening are

shown in figure12 below. The system tested with 20 words

complex sentence structure. The listeners or users are regarding the question whether the voice is nice to listen to or

not, 70% considered the voice natural, 60% thought that the

naturalness of the voice was acceptable and 25 % considered

the voice unnatural for grapheme-to-phoneme conversion.

The users regard 95% considered the voice natural, 72%

thought that the naturalness of the voice was acceptable and 5

% considered the voice unnatural for diphone-concatenation

synthesis. The results changed slightly after the second time

of listening.

Figure12 Comparison of Naturalness

B. Intelligibility

The results of intelligibility of the voices for

grapheme-to-phoneme conversion and the speech of

diphone-concatenation synthesis are shown in figure13. The system tested with 20 words complex sentence structure. The

questions of intelligibility for the listeners or users are asked

understood or not the voice or how much of what the voice

said the participants understood, 20% of the participants

understood the voice very well, 56% of the participants

understood well. 11% neither much nor little and another

15% understood a little, i.e. not very well for

grapheme-to-phoneme conversion. The results of

diphone-concatenation are 82% of the participants

understood the voice very well, 88% of the participants

understood well. 7% neither much nor little and another 5%

understood a little. The comparison of this intelligibility is shown in figure13.

Figure13 Intelligibility of the Voice

XI. CONCLUSIONS

This paper describes the diphone-concatenation speech

synthesis TD-PSOLA method that is tested with the 200

Myanmar sentences and these sentence structures are

complex. The Myanmar diphone database stores over 5000

diphones for these sentences. Firstly, this system accepts the

segmented Myanmar sentence and then it can produce the phonetic sequence with the pairs of consonants and vowels

by using Myanmar phonetic dictionary in

grapheme-to-phoneme stage. And then this system checks the

phonetic sequence to get the prosodic features with

phonological rules. Finally, it produces the high quality

speech by applying the Myanmar diphone database with

concatenation method that uses TD-PSOLA algorithm.

This paper shows the comparison of overlapping method

with variety of time domain such as 0.03s and 0.05s pitch

marks of hanning windows. The overlapping pitch marks

0.05s is better than original waveforms and 0.03s overlapping

pitch marks waveforms. The comparison of high speech quality for naturalness and intelligibility of the

grapheme-to-phoneme and diphone-concatenation synthesis

are illustrated in this paper.

This system can be promoted for the simple Myanmar

sentences. It cannot be provided the pali and Sanskrit of

Myanmar language. The phonological rules can be extended

to other 15 phonological rules. But this system can be

processed by five phonological rules for changing to

unvoiced to voiced pronunciations. This paper describes

about the TD-PSOLA method for Myanmar language TTS

system. TTS system can be extended with other methods such as source-filter model with Festival tools or

diphone-concatenative synthesis with FD-PSOLA method.

ACKNOWLEDGMENT

I wish to express my deep gratitude and sincere

appreciation to all persons who contributed towards the

success of my research. It was a great chance to opportunity

Page 10: Diphone-Concatenation Speech Synthesis for Myanmar …ijsetr.org/wp-content/uploads/2013/07/IJSETR-VOLUME-2-ISSUE-5-1078-1087.pdfThe purpose of this paper is to develop Myanmar Text-To-Speech

ISSN: 2278 – 7798

International Journal of Science, Engineering and Technology Research (IJSETR)

Volume 2, Issue 5, May 2013

1087 All Rights Reserved © 2013 IJSETR

to study Myanmar text-to-speech research in one of the most

famous research in the world. I would like to respectfully

thank and appreciate Dr. Mie Mie Thet Thwin, Rector of the University of Computer Studies, Mandalay (UCSM), for her

precious advice, patience and encouragement during the

preparation of my research. I am grateful to my supervisor,

Dr. Aye Thida, an Associate Professor of Research and

Development Department (1) at University of Computer

Studies, Mandalay (UCSM), and Myanmar, one of the

leaders of Natural Language Processing Project for having

me helped during the preparation of my research. She was, I

am also deeply thankful to Dr. Aye Thida, for her advising

from the point of view of natural language processing. I also

take this opportunity to thank all our teachers of the University of Computer Studies, Mandalay (UCSM), for

their teaching and guidance during my research life.

I especially thank my parents, my sisters and all my

friends for their encouragement, help, kindness, providing

many useful suggestions and giving me their precious time

give to me during the preparation of my research.

REFERENCES

[1] S. Lemmetty, “Review of Speech Synthesis Technology”, Master‟s

Thesis, Helsinki University of Technology, 1999.

[2] Tun Thura Thet; Jin-Cheon Na; Wunna Ko Ko, “Word segmentation

for the Myanmar language”.

[3] Dr. Thein Tun, “Acoustic Phonetics and the Phonology of the

Myanmar Language”, School of Human Communication Sciences, La

Trobe University, Melbourne, Australia, 2007.

[4] Ei Phyu Phyu Soe, “Grapheme-to-Phoneme Conversion for Myanmar

Language”, the 11th International Conference on Computer

Applications (ICCA 2013).

[5] “Phoneme”, http://en.wikipedia.org/w/index.php, April 2012.

[6] D.J. RAVI Research Scholar, “Kannada Text to Speech Synthesis

Systems: Emotion Analysis”, JSS Research Foundation, S.J College of

Engg, Mysore-06, 2010.

[7] Ei Phyu Phyu Soe, “Prosodic Analysis with Phonological Rules for

Myanmar Text-to-Speech System”, AICT 2013.

[8] Alan W Black and Kevin A Lenzo. Building Synthetic Voices, For

FestVox 2.0 Edition. Language Technologies Institute, Carnegie

Mellon University and Cepstral, LLC, 2003b.

[9] Tractament Digital de la Parla, “Introduction to Speech Processing”.

[10] Hayes, Bruce (2009). “Introductory Phonology.” Blackwell Textbooks

in Linguistics. Wiley-Blackwell. ISBN 978-1-4051-8411-3.

[11] International Phonetic Association, “Phonetic description and the IPA

chart", Handbook of the International Phonetic Association: a guide to

the use of the international phonetic alphabet, Cambridge University

Press, 1999.

[12] Kleijn K., Paliwal K. (Editors), “Speech Coding and Synthesis”.

Elsevier Science B.V., the Netherlands, 1998.

[13] Santen J., Sproat R., Olive J., Hirschberg J. (editors), “Progress in

Speech Synthesis”, Springer-Verlag New York Inc, 1997

[14] “Arpabet”, 26 October 2012, [online] Available:

http://en.wikipedia.org/wiki/Arpabet. 2012.

[15] Maria Moutran Assaf , “A Prototype of an Arabic Diphone Speech

Synthesizer in Festival”, 2005.

Biography

Miss. Ei Phyu Phyu Soe is a candidate of Ph.D of computer science in

University of Computer Studies, Mandalay, and Myanmar. Her research

interests include Data Mining, Database Management System, Natural

Languge Processing and Liguistic Research. She is currently working in the

research of Speech Synthesis for Myanmar Language. Ei Phyu Phyu Soe

received B.C.Sc, M.C.Sc degrees from the Computer University, Mandalay,

and Myanmar.

Myanmar to English Translation System Project (NLP)

Dr. Aye Thida

University of Computer Studies, Mandalay (UCSM), Myanmar

Research and Development Department (1)

Dr. Aye Thida is an Associate Professor of Research and Development

Department (1) at University of Computer Studies, Mandalay (UCSM), and

Myanmar. She was one of the leaders of Natural Language Processing

Project. Her team has developed Myanmar to English Translation System in

2011. Her research interests include Distributed Processing, Queuing and

Natural Language Processing. She is currently working Myanmar to English

Translation System Project. Dr. Aye Thida received B.Sc(Hons)Maths

degree from the Mandalay University, Myanmar and her M.I.Sc and Ph.D

degrees in Computer Science from the University of Computer Studies,

Yangon(UCSY), Myanmar.


Recommended