ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1078
All Rights Reserved © 2013 IJSETR
Abstract— Speech Synthesis is a popular field in Natural Language
Processing of computer science. It is composed of Natural
Language Processing (NLP) and Digital Signal Processing (DSP).
This paper gives about the Digital Signal Processing part for
speech synthesis for Myanmar language using
diphone-concatenation method. Diphone-Concatenation method
based on the diphone level of speech to concatenate by applying
Pitch Synchronous Overlap and Add (PSOLA) algorithm to
smooth the joints of the speech signals. PSOLA has two parts:
Time-Domain and Frequency-Domain. This paper describes the
Time-Domain Pitch Synchronous Overlap and Add method in
diphone-concatenation speech synthesis. One of the contributions
of this paper is building the Myanmar diphone database for
diphone-concatenation speech. Concatenative synthesis will
provide to reduce the problems of speech synthesis by using the
formant synthesis. Diphone Database for Myanmar
pronunciations is constructed in this research to reduce the
ambiguity in pronunciations. In the process of
diphone-concatenation synthesis, space complexity and searching
time are less than other techniques. This paper illustrates the
techniques to improve the performance of text-to-speech in the
Myanmar speech synthesis using the TD-PSOLA (Time Domain
Pitch Synchronous Overlap-Add) method. It is based on the signal
into overlapping synchronized frames of the pitch period. The
diphone-concatenation of the speech synthesis is to maintain the
consistency and accuracy of the pitch marks of the speech signal
and diphone database with integrated vowels and consonants of
Myanmar language. This paper shows the testing results with the
varieties of overlapping pitch marks for speech waveforms of
Myanmar sentence. The result shows the quality of speech
synthesis according the number of overlap times are bigger the
quality of speech are better. This paper is to be able to take a word
sequence and produce “human-like” speech.
Index Terms—PSOLA, diphone-concatenation, speech
synthesis, Myanmar speech, text-to-speech.
I. INTRODUCTION
The text-to-speech synthesis system imitates the human-like
speech from the input Myanmar text to spoken language. Since
this generally requires great language knowledge, the context
where the text comes from, a deep understanding of the
semantics of the text content and the relations. However, many
research and commercial speech synthesis systems developed
have contributed to our understanding of all these phenomena,
and have been successful in various respective ways for many
applications such as in speech-to-speech machine translation,
interactive voice response systems, reading software for the
blind, linguistic research and language teaching center.
Text-To-Speech technology gives computers the ability of
converting text into audible speech, with the goal of being able
to deliver information via voice message. It has been utilized to
provide easier means of communication and to improve
accessibility for people with visual impairment to textual information. Two quality criteria are proposed for deciding the
quality of a TTS synthesizer. Intelligibility – it refers to how
easily the output can be understood. Naturalness – it refers to
how much the output sounds like the speech of a real person.
Most of the existing systems have reached a fairly satisfactory
level for intelligibility, while significantly less success has been
attained in producing highly natural speech [1].
II. TEXT-TO-SPEECH SYNTHESIS
A TTS voice is a computer program that has two major
parts: a natural language processing (NLP) which reads the
input text and translates it into a phonetic language and a
digital signal processing (DSP) that converts the phonetic
language into spoken speech. The input text might be for
example data from a word processor, standard ASCII from
e-mail, a mobile text-message, or scanned text from a
newspaper. The character string is then preprocessed and
analyzed into phonetic representation which is usually a
string of phonemes with some additional information for
linguistic representation. A TTS system generally consists of four modules, namely text analysis, phonetic analysis,
prosody analysis and speech synthesis.
In the step of text analysis, there have two parts: syllable
segmentation and number converter. The input text is
analyzed to segmented Myanmar text like a syllable. Syllable
segmentation is the process of identifying syllable boundaries
in a text. This process provides to generate the phonetic
sequences using the phonetic dictionary. The purpose of
number converter is to convert the number to textual
versions. The non standard words are tokens like numbers,
which need to be expanded into sequences of Myanmar
words before they are pronounced. Number expands to string of words representing cardinal.
Phonetic analysis is also called Grapheme-to-Phoneme.
Grapheme-to-Phoneme conversion translates the syllable of
Myanmar text to phonetic sequence. It determines the
pronunciation of a syllable based on its spelling. It also
analyzes the best sequence of phonemes for words, numbers
and symbols and converts into phonetic sequences. We will
construct the Myanmar phonetic dictionary to generate the
phoneme sequence and to pronounce these phonemes.
The phonetic sequences are analyzed to produce the
prosodic features by applying the phonological rules in prosodic analysis step. It is the module to analyze duration
and intonation such as pitch variation, syllable length to
create naturalness of synthetic speech. The combination of
consonant phoneme and a vowel phoneme produces a syllable.
The phonetic alphabet is usually divided in two main categories,
vowels and consonants. Vowels are always voiced sounds and
they are produced with the vocal cords in vibration, while
Diphone-Concatenation Speech Synthesis for
Myanmar Language
Ei Phyu Phyu Soe, Aye Thida
University of Computer Studies, Mandalay, Myanmar
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1079 All Rights Reserved © 2013 IJSETR
consonants may be either voiced or unvoiced. Vowels have
considerably higher amplitude than consonants and they are also
more stable and easier to analyze and describe acoustically. Because consonants involve very rapid changes they are more
difficult to synthesize properly.
Speech synthesis, automatic generation of speech waveforms,
has been under development for several decades [12],
[13].Synthesized speech can be produced by several different
methods. The methods are usually classified into three groups:
Articulatory synthesis, which attempts to model the human
speech production system directly.
Formant synthesis, which models the pole frequencies of
speech signal or transfer function of vocal tract based on
source-filter-model. Concatenative synthesis, which uses different length
prerecorded samples derived from natural speech.
The concatenative synthesis method is the most commonly
used in present synthesis systems. The Concatenative method is
becoming more and more popular. The articulatory method is
still too complicated for high quality implementations [1]. The
aim of paper is to improve the quality of Myanmar
Text-To-Speech by applying the Concatenative speech
synthesis using TD-PSOLA (Time Domain-Pitch Synchronous
Overlap and Add) algorithms.
Linguistic analysis stage maps the input text into a standard
form and determines the structure of the input, and finally decides how to pronounce it. Synthesis stage converts the
symbolic representation of what to say into an actual speech
waveform [9].
III. MYANMAR LANGUAGE
Myanmar writing does not use white spaces between words
or between syllables. Thus, the computer has to determine
syllable and word boundaries by means of an algorithm such as finite-state and rule-based. Moreover, a Myanmar syllable can
be composed of multiple characters. Syllable segmentation is
the process of determining word boundaries in a piece of text.
Myanmar language can consist of one or more morphemes
that are linked more or less tightly together. Typically, a word
will consist of a root or stem and zero or more affixes. Words
can be combined to form phrases, clauses and sentences. A word
consisting of two or more stems joined together is known as a
compound word. To process text computationally, words have
to be determined first [2].
The purpose of this paper is to develop Myanmar
Text-To-Speech system and to improve the performance of high quality synthesis by applying the diphone-concatenation speech
synthesis. The Myanmar language is the official language of
Myanmar and is more than one thousand years old. Texts in the
Myanmar language use the Myanmar script, which is descended
from the Brahmi script of ancient South India. Other Southeast
Asian descendants, known as Brahmic or Indic scripts, include
Thai, Khmer and Lao.
A Myanmar text is a string of characters without explicit
word boundary markup, written in sequence from left to right
without regular inter-word spacing, although inter-phrase
spacing may sometimes be used. Myanmar characters can be classified into three groups: consonants, medials and vowels.
The basic consonants in Myanmar can be multiplied by medials.
Syllables or words are formed by consonants combining with
vowels. However, some syllables can be formed by just
consonants, without any vowel. Other characters in the
Myanmar script include special characters, numerals,
punctuation marks and signs.
Table1. Myanmar Character
There are 34 basic consonants in the Myanmar script, as
displayed in Table1. They are known as “Byee” in the
Myanmar language [2]. Consonants serve as the base
characters of Myanmar words, and are similar in pronunciation to other Southeast Asian scripts such as Thai,
Lao and Khmer.
Medials are known as “Byee Twe” in Myanmar. There are
4 basic medials and 6 combined medials in the Myanmar
script. The 10 medials can modify the 34 basic consonants to
form 340 additional multi-clustered consonants. Therefore, a
total of 374 consonants exist in the Myanmar script, although
some consonants have the same pronunciation.
Vowels are known as “Thara”. Vowels are the basic
building blocks of syllable formation in the Myanmar
language, although a syllable or a word can be formed from
just consonants, without a vowel. Like other languages, multiple vowel characters can exist in a single syllable.
Special characters for Myanmar language are used as
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1080
All Rights Reserved © 2013 IJSETR
prescription noun and conjunctions words between two or
more sentences.
Numerals for Myanmar language are known as “Counting
Numbers”. Numerals are the 10 basic digits for counting.
IV. MYANMAR LANGUAGE PHONOLOGY
A phoneme is the smallest unit that distinguishes words and morphemes. Therefore, changing a phoneme of a word to
another phoneme produces a different word or a nonsense
utterance, whereas changing a phone to another phone, when
both belong to the same phoneme, produces the same word
with an odd or an incomprehensible pronunciation.
Phonemes are not physical segments themselves, but mental
abstractions of them [5]. Different acoustic realizations of a
phoneme are called allophones. The acoustic characteristics
of phonemes come from the vocal tract movement during
their articulation. There are three types of phonetic
parameters in phonology of Myanmar language: first is place of articulation, second is articulator and third is manner of
articulation [3]. The pronunciation of Myanmar words
depend on these parameters. A phoneme is a contrastive unit
in the sound system of a particular language. It is a minimal
unit that serves to distinguish between meanings of words.
Phoneme can pronounce in one or more ways, depending on
the number of allophones. It can represent between slashes by
convention. Table2 describes the inventory of Myanmar
consonant phonemes defined by the International Phonetic
Association (IPA) [3], [4], [11].
Table2. The inventory of Myanmar consonant phonemes
A. Myanmar Phonological Tones
Myanmar language has four tones and a simple syllable
structure that consists of an initial consonant followed by a
vowel with an associate tone. This means all syllables in
Myanmar have prosodic features. Different tone makes
different meanings for syllables with the same structure of
phonemes. In the Myanmar writing system, a tone is
presented by a diacritic mark [3], [4].
The fundamental frequency as shown in figure1 rises gradually from Tone 1 to Tone 4. Tone 1 starts at a relatively
level range and tends to go down slightly; Tone 2 starts at a
relatively level range, goes up, and then falls down relatively
low; Tone 3 starts at a relatively high range, usually higher
than or as high as the peak of Tone 2, and falls down
relatively low; Tone 4 stats at a high range, frequently higher
or as high as the peak of Tone 2 and falls low, but not as low
as Tone 3 because it stops very suddenly before it can drop
lower [4]. The general contrastive features of the four
phonological tones offered by the analysis of their
fundamental frequency can be described as figure1:
Figure1. Four Tones of Myanmar Language
There are four tones in Myanmar language. The lengths of
tones are:
Tone 1 has 18.50 Cs
Tone 2 has 21.03 Cs
Tone 3 has 15.44 Cs and
Tone 4 has 10.35 Cs.
So Myanmar toneme is described with the variety of rate or
duration. Length of the tone is defined as rate or duration.
Tone 2 is defined as a longest rate and tone 4 is defined as a
shortest rate in these four tones. Now we describe with the
redundant features of these four tones in Table3.
Table3. Features of Myanmar Tones
Description Tone1 Tone2 Tone3 Tone4
Rate 2 3 1 0
Duration 18.5 21.03 15.44 10.35
Low + + - -
High - + + +
Low-Falling + + + -
B. Phonological Structure of Myanmar Language
The Myanmar language uses a rather large set of 50 vowel
phonemes, including diphthongs, although its 22 to 26 consonants are close to average. Some languages, such as
French, have no phonemic tone or stress, while several of the
Kam-Sui languages have nine tones, and one of the Kru
languages, Wobe, has been claimed to have 14, though this is
disputed. The most common vowel system consists of the
five vowels /i/, /e/, /a/, /o/, /u/. The most common consonants
are /p/, /t/, /k/, /m/, /n/. Relatively few languages lack any of
these, although it does happen: for example, Arbic lacks /p/,
standard Hawaiian lacks /t/, Mohawk and Tlingit lack /p/ and
/m/, Hupa lacks both /p/ and a simple /k/, colloquial Samoan
lacks /t/ and /n/, while Rotokas and Quileute lack /m/ and /n/ [5]. Table4 shows the phonetic signs of 50 Myanmar vowels
to pronounce the Myanmar words. These 50 phonemes show
the basic symbol with four tone levels [3].
Figure2. Combination of Phoneme Syllable
Phonology is how speech sounds are organized and affect
one another in pronunciation. The combination of consonant
phoneme and a vowel phoneme produces a syllable in
figure2. The phonetic alphabet is usually divided in two main
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1081 All Rights Reserved © 2013 IJSETR
categories, vowels and consonants. Vowels are always
voiced sounds and they are produced with the vocal cords in
vibration, while consonants may be either voiced or unvoiced. Vowels have considerably higher amplitude than
consonants and they are also more stable and easier to
analyze and describe acoustically. Because consonants
involve very rapid changes they are more difficult to
synthesize properly [6].
Table4. Phonetic Signs of Myanmar Vowels
C. Five Phonological Rules for Myanmar Language
Why construct the phonological rules? In Myanmar
Language, speech of Myanmar language has two types:
Sentence-based speech and Word-based speech. These two
types are described in this paper. Word-based sentences are
convenient by applying Myanmar phonetic dictionary.
Sentence-based speech for Myanmar language is proposed in
this research. So, the sentence-based speech problem can
solve by applying phonological rules. Phonological rules are
often written using distinctive features, which are natural
characteristics that describe the acoustic and Articulatory
makeup of a sound; by selecting a particular bundle, or
"matrix," of features, it is possible to represent a group of
sounds that form a natural class and pattern together in
phonological rules [10].
There are many phonological rules in Myanmar language
not only phonological rules without part of speech levels but
also phonological rules with grammars structures. The
problems for sentence-based speech pronunciations for
Myanmar language solve by applying the five phonological
rules [4], [7].
D. Algorithms for Five Phonological Rules
Rule 1 uses the substitute for communication theory,
computational linguistics (for instance, statistical natural
language processing). It uses the reduction phonological
rules to reduce the vowels with glottalized (ɂ) and nasal (˜)
tones. The algorithm of vowel reduction algorithm for rule 1
as shown in figure3 [7]:
Rule 2 describes the pronunciation changes from /tiɂ/ to
[də] by applying the metathesis rule type when the next
phonemes are /ka/, /sa/, /za/, /ta/ and /pa/. The system finds
the number / tiɂ/ in input phoneme sequence and it checks
the next phonemes are /ka/, /sa/, /za/, /ta/ and /pa/. If this
decision is true, the system will change the / tiɂ/ to [də]
phoneme. If it is not, this decision will continue to next rule.
The changing to “DA” pronunciation algorithm for the rule type of metathesis is illustrated in figure4 [7].
Figure3. Vowel Reduction Algorithm
Figure4. Changing to „DA‟ Pronunciation Algorithm
Rule 3 is the inserting the nasal phoneme according to
obstruent types as shown in Table5. If the obstruent type is
bilabial next to the voiced vowel, „[m]‟ will fill in this
phoneme. If the obstruent type is dental next to the voiced
vowel, „[ṉ ]‟ will fill in this phoneme. If the obstruent type is
alveolar next to the voiced vowel, „[n]‟ will fill in this
phoneme. If the obstruent type is palate-alveolar next to the
voiced vowel, „[ɲ ]‟ will fill in this phoneme. If the obstruent
type is velar next to the voiced vowel, „[ŋ]‟ will fill in this
phoneme.
Table5. Obstruent Types and Nasal Phonemes
The process of rule 3 is the inserting nasal phoneme
between the voiced asats and obstruent consonants. The
system find the voiced vowel asat in the input phoneme
sequence and then check the types of obstruent consonants. If
the decision is consistent, the nasal phoneme inserts between
them and if it is not, the system goes to the next rules. Figure6
(a) describes the process of five types of obstruent types.
Figure6 (b) shows the detail processes for five processes of
five obstruent types.
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1082
All Rights Reserved © 2013 IJSETR
Figure5. (a) Filling Nasal Phoneme Algorithm
Figure5. (b) Processes of Procedures for Filling Nasal
Algorithm
Rule 4 explains about the unchanged pronunciation
phonemes next to the /aɂ/ အ. The unvoiced phonemes are
not changed to voiced phoneme when the preceding phoneme
is /aɂ/ အ. Rule 5 is the pronunciation changes from unvoiced
phonemes to voiced phonemes depending on the voiced
consonants, voiced vowels and voiced asats. If the unvoiced
phoneme locates in the first position, this phoneme is not
change to voiced phoneme. If the unvoiced consonants with
voiced vowels or voiced asat, it will change to the voiced
phoneme. The process of rule 4 and rule 5 is made use of
pronunciation algorithm [7].
The combination of rule 4 and 5 pronunciation algorithm is presented in following figure6. This algorithm can solve the
confusion of unchanged and changed pronunciation for
unvoiced to voiced phoneme. The algorithm finds the
unvoiced consonants in input phoneme sequence. If the
system found the unvoiced consonants, first step is to check
this consonant‟s location and if it locates in first position, this
unvoiced consonant is not changed to voiced consonants. If
this unvoiced consonant is not in first location and current
and previous vowels asats are unvoiced in phoneme
sequence, the consonants are not changed to voiced
consonants. If it is not and the previous consonant is /aɂ/, the pronunciation is not changed to voiced consonant [7].
Figure6. Changing Pronunciation Algorithm
V. DESIGN OF CONCATENATIVE SPEECH SYNTHESIS
The process of concatenative speech synthesis is cutting and pasting the short segments of speech is selected from a
pre-recorded database and joined one after another to
produce the desired utterances. In theory, the use of real
speech as the basis of synthetic speech brings about the
potential for very high quality, but in practice there are
serious limitations, mainly due to the memory capacity
required by such a system. The longer the selected units are
the fewer problematic concatenation points will occur in the
synthetic speech, but at the same time the memory requirements increase. Another limitation in Concatenative
synthesis is the strong dependency of the output speech on
the chosen database. For example, the personality or the
affective tone of the speech is hardly controllable. Despite the
somewhat featureless nature, Concatenative synthesis is well
suited for certain limited applications [1]. Concatenative
synthesis is based on the concatenation or stringing together
of segments of recorded speech. Generally, Concatenative
synthesis produces the most natural-sounding synthesized
speech. It is easier to obtain more natural sound with longer
units and it can achieve a high segmental quality. Among
these techniques, this paper highlights a diphone concatenation-based synthesis technique in Myanmar
text-to-speech research.
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1083 All Rights Reserved © 2013 IJSETR
A. Myanmar Diphone Database Construction
The basic idea behind building Myanmar diphone
databases is to explicitly list all possible phone-phone
transitions in a language. One technique is to use target words
embedded carrier sentences to ensure that the diphones are
pronounced with acceptable duration and prosody. Speech
synthesis unit finds the corresponding pre-recorded sounds
from its database and tries to concatenate them smoothly. It uses an algorithm like TD-PSOLA (Time-Domain Pitch
Synchronous Overlap and Add) to make a smooth pass in
diphone. PSOLA method takes two speech signals. One of
these signal ends with a voiced part and the other starts with a
voiced part. PSOLA changes the pitch values of these two
signals so that pitch values at both sides become equal. The
advantage of this technique is to obtain a better output
speech when compared to other techniques [1].
The structure of diphone database constructs with Arpabet
signs to understand the retrieving phonemes. After retrieving
the phonemes, we can then retrieve each individual phoneme
from a diphone database and concatenate them together with
only 50 phonemes; this would be the most economical choice
to save space on embedded devices. Diphones are just pairs
of partial phonemes. This might be recovered from the
pronouncing dictionary by taking into account the 1 or 0 designation applied to vowels concerning stress instead of
representing a single phoneme; a diphone represents the end
of one phoneme and the beginning of another. This is
significant because there is less difference in the middle of a
phoneme than there is at the beginning and ending edges [15].
The problem is that it greatly increases the size of the diphone
database from around 10496 diphones (114 (22Consonants +
42ExceptionWords + 50Vowels) x 114 (22Consonants +
42ExceptionWords + 50Vowels) – 2500 (50Vowels
x50Vowles)) in Myanmar Language. The pair of vowel and
vowel is not in phoneme sequence for Myanmar diphone database. So the number of double vowels subtracts from the
total diphone database. The Arpabet signs for 22 consonants
are described in the following Table6 and the Arpabet sing
for 50 vowels [14] are shown in Table7.
The diphone list will be categorized in different categories
[15]: Consonants-Consonants, Consonants-Exception
Words, Consonants-Vowels, Exception Words-Consonants,
Exception Words-Exception Words, Exception
Words-Vowels, Vowels-Consonants, Vowels-Exception
Words, Consonants-Silence, Exception Words-Silence, and
Vowels-Silence, Silence-Consonants, Silence-Exception
Words and Silence-Vowels pairs.
Table6. Arpabet Signs for 22 Consonants
Phoneme for Consonants Arpabet for Diphone
k K
kh KH
g G
ŋ NG
s S
sh SA
z Z
ɲ NYA
t T
th HT
d D
n N
t T
th TH
n N
p P
ph PH
b B
m M
j Y
l L
w W
ɵ TH
h HH
l L
a ́ AA3
Table7. Arpabet Signs for 50 Vowels
B. Diphone Recording
The recordings were read by a native Myanmar speaker.
The recordings were done in a professional recording at LA
studio in Mandalay, Myanmar. The diphone database was
completed in four hours. Two hundred sentences of different
length were recorded. The reason for recording the sentences
was to start building a Unit selection database to be able to
re-synthesize the sentences. The sentences were taken from
two sources. Firstly, news takes in Myanmar daily newspaper. This source was chosen due to its use of modern
formal language. Approximately thirty sentences were
chosen from this source. These sentences have an average
word length of 20 words. The recordings were tough to
achieve because of the sentences length. The second source
for the remainder of the sentences is from “Myanmar
Grammar Book” published from Myanmar language group.
These sentences are short and easy to use since the vowel ling
is already done. The language and grammar within the book
is modern, therefore a good starting point for testing the
system.
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1084
All Rights Reserved © 2013 IJSETR
VI. LABELING DIPHONE INDEX
A diphone database consists of a dictionary file, a set of
waveform files and a set of pitch mark files. The dictionary
file, also called the diphone index, identifies which diphone
comes with which files, and from where. The index consists
of a simple header, followed by a single line for each
diphone: the diphone name, the file name without any
extension, a point start position in seconds, a mid position
and an end position also in seconds [8]. Table8 describes the
labeling diphone index for the sentence of “#-KYA-AY4-
HT-IY2-Y-AA2-TH-IY1-D-AH-D-AW1-T-EH4-D-AH-D-
AW1-Z-IH2-PHY-IH3-AA3-SA-IH2-AA3-T-EH4-MY-AA
2-TH-IY1-#”.
Table8. Labeling the Diphone Index
The structure of diphones runs from one mid pitch mark
of first phone to another mid pitch mark of the following
phone. The pitch mark files consists a simple list of positions
in seconds in order, one per line of each pitch mark in the file.
VII. DIPHONE-CONCATENATIVE SYNTHESIS
The diphone-concatenative speech synthesis joins one
phone with another phone to reduce the discontinuity of the
joints of the phones. This paper highlights a diphone concatenation-based synthesis technique. This synthesis part
is a popular challenge of high quality speech production in
Myanmar Text-To-Speech System. Concatenative synthesis
is a popular method that the most common choices are
phonemes and diphones because they are short enough to
attain sufficient flexibility and to keep the memory
requirements reasonable. The use of diphones in the
concatenation promotes to get good performance quality
because a diphone contains the transition from one phoneme
to another and the latter half of the first phoneme and the
former half of the latter phoneme. Consequently, the concatenation points will be located at the center of each
phoneme, and since this is usually the steadiest part of the
phoneme, the amount of distortion at the boundaries can be
expected to be minimized. While the sufficient number of
different phonemes in a database is typically around 200, the
corresponding number of diphones is from 4500 to 5000 but a
synthesizer with a database of this size is generally
implementable. To avoid audible distortions caused by the
differences between successive segments, at least the
fundamental frequency and the intensity of the segments
must be controllable. The creation of natural prosody in
synthetic speech is impossible with the present-day methods
but some promising methods for getting rid of the
discontinuities have naturally been developed. Finally,
Concatenative speech synthesis is afflicted by the
troublesome process of creating the database from which the
units will be selected. Each phoneme, together with all of the
needed allophones, must be included in the recording, and then all of the needed units must be segmented and labeled to
enable the search from the database.
VIII. TD-PSOLA METHOD
France Telecom (CNET) develops Pitch Synchronous
Overlap and Add method. It allows prerecorded speech
samples smoothly concatenated and provides good
controlling for pitch and duration. Time-domain version,
TD-PSOLA, is the most commonly used due to its
computational efficiency. The basic algorithm consists of
three steps:
1. original speech signal is divided into separate short
analysis signal 2. the modification of each analysis signal to synthesis
signal and
3. the synthesis step where these segments are recombined
by means of overlap-adding [1].
The purpose of TD-PSOLA (Time-Domain) is to modify
the pitch or timing of a signal as shown in figure7. The
process of the TD-PSOLA algorithm is to find the pitch
points of the signal and then apply the hamming window
centered of the pitch points and extending to the next and
previous pitch point. If the speeches want to slow down, the
system defines the frame to double. If the speeches want to speed up, the system removes the frames in the signal.
Figure7. TD-PSOLA Algorithm
TD-PSOLA requires an exact marking of pitch points in a
time domain signal. Pitch marking any part within a pitch
period is okay as long as the algorithm marks the same point
for every frame. The most common marking point is the
instant of glottal closure, which identifies a quick time domain descent. The algorithm creates an array of sample
numbers comprise an analysis epoch sequence P = {p1, p2…
pn} and it estimates pitch period distance = (pk - pk+1)/2 to get
the mid-point of pitch marking.
Table9 gives the data for the overlapping time according to
the pitch marks with 0.03s of the waveforms of the voices.
This table shows the 22 diphone pairs for 20 words Myanmar
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1085 All Rights Reserved © 2013 IJSETR
sentence. The diphone-concatenation pairs for this sentence
have the 20 pairs for speech synthesis.
The comparisons of waveforms are shown in below. Figure 8 shows the original waveforms without TD-PSOLA method
and the length is 1.206s for first 4 words,
“#-KYA-AY4-HT-IY2-Y-AA2-TH-IY1”.
Figure8. Original Waveform of First 4 Words
Table9 Defining the Pitch Marks with 0.03s
The following figure9 describes the overlapping
waveforms with 0.03s pitch marks. The waveforms smooth
between one joint of waveform and another by using
TD-PSOLA method for Myanmar language. The quality of
speech is more speed and smooth by are overlapping each to
each with 0.03s than original speech waveforms. The total
length of overlapping speech waveforms is shorter than
original waveforms without any method. The length of
overlapping waveforms reduces 0.05s from 1.206s. The next
table 9 shows the overlapping of pitch marks with 0.05s
between one joint and other joints.
Figure9. Overlapping with 0.03s Pitch Marks
Table10 shows the overlapping pitch marks with 0.05s of
waveforms of the voice for 20 words of Myanmar sentence.
The hanning window calculates with the overlap of 0.05s
pitch marks in each diphone label. The values of the start of
hanning window and the end of hanning windows are
changed according to the overlapping pitch marks values.
The sound quality is better than the overlapping pitch marks
0.03ms.
Table10 Defining the Pitch Marks with 0.05s
The following figure10 describes the overlapping
waveforms with 0.05s pitch marks. The waveforms smooth
between one joint of waveform and another by using
TD-PSOLA method for Myanmar language. The quality of
speech is more speed and smooth by are overlapping each to
each with 0.05s than original speech waveforms and 0.03s
pitch marks overlapping. The total length of overlapping
speech waveforms is shorter than original waveforms without
any method. The length of overlapping waveforms reduces
0.12s from 1.206s.
Figure10. Overlapping with 0.05s Pitch Marks
IX. EXPERIMENTAL RESULTS
This paper gives the results for the diphone-concatenation with TD-PSOLA method. This system is tested with the 200
Myanmar sentences and this sentence structure is very
complex. The Myanmar diphone database stores over 5000
diphones for these sentences. Firstly, this system accepts the
segmented Myanmar sentence and then it can produce the
phonetic sequence with the pairs of consonants and vowels
by using Myanmar phonetic dictionary in
grapheme-to-phoneme stage [4]. And then this system checks
the phonetic sequence to get the prosodic features with
phonological rules. Finally, it produces the high quality
speech by applying the Myanmar diphone database with
concatenation method that uses TD-PSOLA algorithm. The experimental results of diphone-concatenation speech
synthesis can be calculated with precision, recall and
f-measure. The results for 14 types of diphone pairs
according to the total number of 5275 diphone pairs for 200
sentences is shown in Table11.
Table11 Experimental Results for Diphone-Concatenation
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1086
All Rights Reserved © 2013 IJSETR
X. TESTING MYANMAR SPEECH QUALITY
Testing the naturalness and intelligibility of the Myanmar
speech contains 7 female people between the ages 16 to 40.
The test can be divided into two parts with 20 pairs of words
of confusability. The first part contains naturalness of the
diphone-concatenative speech synthesis. The last part tests
how much the participants understood the voice or how much
of what the voice said the participants understood. The
participants heard one word at a time and marked on the
answering sheet which one of the two words they think is
correct.
A. Naturalness
The results of the grapheme-to-phoneme part of listening
compared to the diphone-concatenation of listening are
shown in figure12 below. The system tested with 20 words
complex sentence structure. The listeners or users are regarding the question whether the voice is nice to listen to or
not, 70% considered the voice natural, 60% thought that the
naturalness of the voice was acceptable and 25 % considered
the voice unnatural for grapheme-to-phoneme conversion.
The users regard 95% considered the voice natural, 72%
thought that the naturalness of the voice was acceptable and 5
% considered the voice unnatural for diphone-concatenation
synthesis. The results changed slightly after the second time
of listening.
Figure12 Comparison of Naturalness
B. Intelligibility
The results of intelligibility of the voices for
grapheme-to-phoneme conversion and the speech of
diphone-concatenation synthesis are shown in figure13. The system tested with 20 words complex sentence structure. The
questions of intelligibility for the listeners or users are asked
understood or not the voice or how much of what the voice
said the participants understood, 20% of the participants
understood the voice very well, 56% of the participants
understood well. 11% neither much nor little and another
15% understood a little, i.e. not very well for
grapheme-to-phoneme conversion. The results of
diphone-concatenation are 82% of the participants
understood the voice very well, 88% of the participants
understood well. 7% neither much nor little and another 5%
understood a little. The comparison of this intelligibility is shown in figure13.
Figure13 Intelligibility of the Voice
XI. CONCLUSIONS
This paper describes the diphone-concatenation speech
synthesis TD-PSOLA method that is tested with the 200
Myanmar sentences and these sentence structures are
complex. The Myanmar diphone database stores over 5000
diphones for these sentences. Firstly, this system accepts the
segmented Myanmar sentence and then it can produce the phonetic sequence with the pairs of consonants and vowels
by using Myanmar phonetic dictionary in
grapheme-to-phoneme stage. And then this system checks the
phonetic sequence to get the prosodic features with
phonological rules. Finally, it produces the high quality
speech by applying the Myanmar diphone database with
concatenation method that uses TD-PSOLA algorithm.
This paper shows the comparison of overlapping method
with variety of time domain such as 0.03s and 0.05s pitch
marks of hanning windows. The overlapping pitch marks
0.05s is better than original waveforms and 0.03s overlapping
pitch marks waveforms. The comparison of high speech quality for naturalness and intelligibility of the
grapheme-to-phoneme and diphone-concatenation synthesis
are illustrated in this paper.
This system can be promoted for the simple Myanmar
sentences. It cannot be provided the pali and Sanskrit of
Myanmar language. The phonological rules can be extended
to other 15 phonological rules. But this system can be
processed by five phonological rules for changing to
unvoiced to voiced pronunciations. This paper describes
about the TD-PSOLA method for Myanmar language TTS
system. TTS system can be extended with other methods such as source-filter model with Festival tools or
diphone-concatenative synthesis with FD-PSOLA method.
ACKNOWLEDGMENT
I wish to express my deep gratitude and sincere
appreciation to all persons who contributed towards the
success of my research. It was a great chance to opportunity
ISSN: 2278 – 7798
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 2, Issue 5, May 2013
1087 All Rights Reserved © 2013 IJSETR
to study Myanmar text-to-speech research in one of the most
famous research in the world. I would like to respectfully
thank and appreciate Dr. Mie Mie Thet Thwin, Rector of the University of Computer Studies, Mandalay (UCSM), for her
precious advice, patience and encouragement during the
preparation of my research. I am grateful to my supervisor,
Dr. Aye Thida, an Associate Professor of Research and
Development Department (1) at University of Computer
Studies, Mandalay (UCSM), and Myanmar, one of the
leaders of Natural Language Processing Project for having
me helped during the preparation of my research. She was, I
am also deeply thankful to Dr. Aye Thida, for her advising
from the point of view of natural language processing. I also
take this opportunity to thank all our teachers of the University of Computer Studies, Mandalay (UCSM), for
their teaching and guidance during my research life.
I especially thank my parents, my sisters and all my
friends for their encouragement, help, kindness, providing
many useful suggestions and giving me their precious time
give to me during the preparation of my research.
REFERENCES
[1] S. Lemmetty, “Review of Speech Synthesis Technology”, Master‟s
Thesis, Helsinki University of Technology, 1999.
[2] Tun Thura Thet; Jin-Cheon Na; Wunna Ko Ko, “Word segmentation
for the Myanmar language”.
[3] Dr. Thein Tun, “Acoustic Phonetics and the Phonology of the
Myanmar Language”, School of Human Communication Sciences, La
Trobe University, Melbourne, Australia, 2007.
[4] Ei Phyu Phyu Soe, “Grapheme-to-Phoneme Conversion for Myanmar
Language”, the 11th International Conference on Computer
Applications (ICCA 2013).
[5] “Phoneme”, http://en.wikipedia.org/w/index.php, April 2012.
[6] D.J. RAVI Research Scholar, “Kannada Text to Speech Synthesis
Systems: Emotion Analysis”, JSS Research Foundation, S.J College of
Engg, Mysore-06, 2010.
[7] Ei Phyu Phyu Soe, “Prosodic Analysis with Phonological Rules for
Myanmar Text-to-Speech System”, AICT 2013.
[8] Alan W Black and Kevin A Lenzo. Building Synthetic Voices, For
FestVox 2.0 Edition. Language Technologies Institute, Carnegie
Mellon University and Cepstral, LLC, 2003b.
[9] Tractament Digital de la Parla, “Introduction to Speech Processing”.
[10] Hayes, Bruce (2009). “Introductory Phonology.” Blackwell Textbooks
in Linguistics. Wiley-Blackwell. ISBN 978-1-4051-8411-3.
[11] International Phonetic Association, “Phonetic description and the IPA
chart", Handbook of the International Phonetic Association: a guide to
the use of the international phonetic alphabet, Cambridge University
Press, 1999.
[12] Kleijn K., Paliwal K. (Editors), “Speech Coding and Synthesis”.
Elsevier Science B.V., the Netherlands, 1998.
[13] Santen J., Sproat R., Olive J., Hirschberg J. (editors), “Progress in
Speech Synthesis”, Springer-Verlag New York Inc, 1997
[14] “Arpabet”, 26 October 2012, [online] Available:
http://en.wikipedia.org/wiki/Arpabet. 2012.
[15] Maria Moutran Assaf , “A Prototype of an Arabic Diphone Speech
Synthesizer in Festival”, 2005.
Biography
Miss. Ei Phyu Phyu Soe is a candidate of Ph.D of computer science in
University of Computer Studies, Mandalay, and Myanmar. Her research
interests include Data Mining, Database Management System, Natural
Languge Processing and Liguistic Research. She is currently working in the
research of Speech Synthesis for Myanmar Language. Ei Phyu Phyu Soe
received B.C.Sc, M.C.Sc degrees from the Computer University, Mandalay,
and Myanmar.
Myanmar to English Translation System Project (NLP)
Dr. Aye Thida
University of Computer Studies, Mandalay (UCSM), Myanmar
Research and Development Department (1)
Dr. Aye Thida is an Associate Professor of Research and Development
Department (1) at University of Computer Studies, Mandalay (UCSM), and
Myanmar. She was one of the leaders of Natural Language Processing
Project. Her team has developed Myanmar to English Translation System in
2011. Her research interests include Distributed Processing, Queuing and
Natural Language Processing. She is currently working Myanmar to English
Translation System Project. Dr. Aye Thida received B.Sc(Hons)Maths
degree from the Mandalay University, Myanmar and her M.I.Sc and Ph.D
degrees in Computer Science from the University of Computer Studies,
Yangon(UCSY), Myanmar.