VOICE CONVERSION USING ARTICULATORYFEATURES
A THESIS
submitted by
BAJIBABU BOLLEPALLI
200731002
Master of Science (by Research)
in
Electronics and Communication Engineering
Language Technologies Research Centre
International Institute of Information Technology
Hyderabad- 500 032, India
JUNE 2012
To my parents, friends and guide
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Voice conversion using artic-
ulatory features” by Bajibabu Bollepalli (200731002), has been carried out under my
supervision and is not submitted elsewhere for a degree.
Date Adviser: Dr. Kishore Prahallad
Acknowledgements
I would like to express my deepest respect and most sincere gratitude to Dr. Kishore Pra-
hallad, for his constant guidance, encouragement at all stages of my work. I am fortunate
to have numerous technical discussions with him from which I have benefited enormously.
I thank him for allowing me to explore the challenging world of speech technology and
for always finding the time to discuss the difficulties along the way.
I thank my thesis committee members, Dr. Garimella Rama Murthy and Dr. Anil Ku-
mar Vuppala for sparing their valuable time to evaluate the progress of my research work.
I am thankful to Prof. B. Yegnanarayana, Dr. Suryakanth, and Dr. Rajendran for their
immense support and help through my research work. I am thankful to them for all the
invaluable advise on both technical and nontechnical matters. I thank my senior laboratory
members for all the cooperation, understanding and help I received from them.
I am very grateful for having had the opportunity to study among my colleagues:
Ronanki srikanth, Sathya adithya thati, Ch. Nivedita, E. Naresh kumar, P. Gangamohan,
Santosh Kesiraju, Anand Swaroop, Bharghav, Gautam Mantena, Buchi babu, Vasanth Sai,
Sivanand, Ravi shankar Prasad, Karthik, Sridhar, Aneeja, Vishala, Rambabu, Sudharasan,
Dhanujaya, Guru Prasad, Anand Joseph, Chetana, D. Rajesh, Vinay kumar Mittal - for all
the support, fruitful discussions and fun times together.
Needless to mention the love and moral support of my family. This work would not
have been possible without their support.
Bajibabu Bollepalli
i
Abstract
The aim of voice conversion is to transform an utterance spoken by an arbitrary (source)
speaker to that of a specific (target) speaker. Text-to-speech (TTS), speech-to-speech
translation, mimicry generation and human-machine interaction systems are among the
numerous applications which can be greatly benefited by having a voice conversion mod-
ule. Generally voice conversion systems require parallel data between source and target
speakers. Parallel data is a set of utterances recorded by both source and target speakers.
By having such data, one can build a mapping function at frame level to transform charac-
teristics of source speaker to a specified target speaker using machine learning techniques
(GMMs, ANNs, etc,.). These techniques are assumed to perform well as humans typi-
cally perceive transformed speech to sound more like the target speaker than the source
speaker. But having parallel data is not always feasible, especially in cross-lingual voice
conversion where the language of source and target speakers is different. In literature,
voice conversion techniques have been proposed which do not require parallel data. But
they require speech data apriori from source speaker. These techniques cannot be applied
when an arbitrary source speaker wants to transform his/her voice to a target speaker
without any apriori recording.
In this dissertation, we propose a method to perform voice conversion without the need
of training data from the source speaker. It alleviates the need for any speech data from
source speaker apriori, and can be used for cross-lingual voice conversion system. In this
method, we capture speaker-specific characteristics of target speaker. The problem of cap-
turing speaker-specific characteristics can be viewed as modelling a noisy-channel model.
The idea behind modelling a noisy-channel is as follows. Suppose, C is a canonical form
of a speech signal (a generic and speaker-independent representation of the message in
iii
speech signal) passes through the speech production system of a target speaker to produce
a surface form S . This surface form S carries the message as well as the identity of the
speaker. One can interpret S as the output of a noisy-channel, for the input C. Here, the
noisy-channel is the speech production system of the target speaker. We used an artificial
neural network (ANN) to model the speech production system of a target speaker, which
captures the essential speaker-characteristics of the target speaker. The choice of repre-
sentation of C and S of a speech signal plays an important role in this method. We used
articulatory features (AFs), which represents the characteristics of speech production pro-
cess, as canonical form or speaker-independent representation of speech signal as they
assumed to be speaker independent. But our analysis showed that AFs contain significant
amount of speaker information in their trajectories. Thus, we propose suitable techniques
to normalize the speaker-specific information in AF trajectories and the resultant AFs are
used for voice conversion. We show that the proposed method could be used to alleviate
the need for source speaker data, and in cross-lingual voice conversion. Subjective and
objective evaluations reveal that the quality of the transformed speech using the proposed
approach is intelligible and possess the characteristics of the target speaker. A set of trans-
formed utterances corresponding to results discussed in this work are available for listen-
ing at http://researchweb.iiit.ac.in/˜bajibabu.b/vc_evaluation.html
Keywords: Voice Conversion, Artificial Neural Networks, Articulatory features, Spectral
Mapping, Noisy-channel modelling, Cross-Lingual Voice Conversion.
iv
Contents
Abstract iii
List of Tables viii
List of Figures xi
Abbreviations xii
1 Introduction to voice conversion 0
1.1 Voice conversion and its applications . . . . . . . . . . . . . . . . . . . 0
1.2 Acoustic cues for voice conversion system . . . . . . . . . . . . . . . . 1
1.3 Voice conversion using parallel data . . . . . . . . . . . . . . . . . . . 2
1.3.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Alignment of parallel data . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Training/testing in voice conversion . . . . . . . . . . . . . . . 4
1.3.4 Mapping function . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.5 Evaluation metrics for voice conversion . . . . . . . . . . . . . 6
1.4 Voice conversion using non-parallel data . . . . . . . . . . . . . . . . . 8
1.5 Limitations of the current systems . . . . . . . . . . . . . . . . . . . . 11
v
1.5.1 Limitations using parallel data . . . . . . . . . . . . . . . . . . 11
1.5.2 Limitations using non-parallel data . . . . . . . . . . . . . . . 12
1.6 Objective and scope of the work . . . . . . . . . . . . . . . . . . . . . 13
1.7 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Articulatory features 17
2.1 Human speech production . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Types of articulatory features . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Extraction of articulatory features . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Analysis phase (Encoder): MCEP to AF . . . . . . . . . . . . . 21
2.3.4 Evaluation of mapping accuracy . . . . . . . . . . . . . . . . . 25
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Analysis of speaker information in articulatory features 31
3.1 Speaker identification using articulatory features . . . . . . . . . . . . . 32
3.1.1 Preparation of the data . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Speaker modelling . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Normalizing speaker specific information . . . . . . . . . . . . . . . . 36
3.2.1 Use of smoothed AFs for speech recognition . . . . . . . . . . 38
vi
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Use of articulatory features for voice conversion 40
4.1 Noisy-channel model . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Intra lingual voice conversion . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Training target speaker’s model . . . . . . . . . . . . . . . . . 43
4.2.3 Conversion process . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.5 Experiments on multiple speaker database . . . . . . . . . . . . 45
4.2.6 Mapping of excitation features . . . . . . . . . . . . . . . . . . 46
4.3 Cross-lingual voice conversion . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Summary and Conclusion 52
5.1 Summary of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Major contributions of the work . . . . . . . . . . . . . . . . . . . . . 53
5.3 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . 54
References 55
List of Publications 63
vii
List of Tables
2.1 Eight articulatory properties, each property has different classes and the
number of bits required to represent each property. . . . . . . . . . . . 22
2.2 Average MCD and MOS scores of analysis-by-synthesis approach. . . 26
2.3 Frame-wise recognition using TIMIT database. . . . . . . . . . . . . . 27
2.4 Frame-wise recognition using ARCTIC database. . . . . . . . . . . . . 28
3.1 Accuracies (%) of speaker identification system using MCEPs and AFs 35
3.2 Accuracies (%) of four groups of articulatory features . . . . . . . . . 36
3.3 Phone recognition accuracies using MCEPs and smoothed AFs. AFs-‘k’
correspond to applying 11-point mean-smoothing window ‘k’ times. . . 39
4.1 MCD scores obtained between multiple speaker pairs with SLT and BDL
as target speakers. Scores in parenthesis are obtained using parallel data. 45
4.2 Subjective evaluation of voice conversion models built by using parallel
and Noisy-channel models. . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Subjective evaluation of cross-lingual voice conversion models. Scores in
parenthesis are obtained using multi-speaker and multi-lingual encoder. 50
viii
List of Figures
1.1 Plot of an utterance recorded by two speakers showing that their durations
differ even if the spoken sentence is the same. The spoken sentence is
“Will we ever forget it” which has 18 phones “pau w ih l w iy eh v er f
er g eh t ih t pau pau” according to the US English phoneset. Adopted
from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Plot of an utterance recorded by two speakers showing that their durations
match after applying DTW. The spoken sentence is “Will we ever forget
it” which has 18 phones “pau w ih l w iy eh v er f er g eh t ih t pau pau”
according to the US English phoneset. Adopted from [1]. . . . . . . . . 4
1.3 Block diagram of training and testing modules in the voice conversion
framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Flowchart of the training and conversion modules of a voice conversion
system capturing speaker-specific characteristics. Notice that during train-
ing, only the target speakers data is used. Adopted from [2] . . . . . . . 14
2.1 (1) nasal cavity, (2) hard palate, (3) alveolar ridge, (4) soft palate (velum),
(5) tip of the tongue (apex), (6) dorsum, (7) uvula, (8) radix, (9) pharynx,
(10) epiglottis, (11) false vocal cords, (12) vocal cords, (13) larynx, (14)
esophagus, and (15) trachea. Adopted from [3] . . . . . . . . . . . . . 18
2.2 Architecture of a five-layered MLFFNN with number of nodes in each
layer and type of activation function. . . . . . . . . . . . . . . . . . . . 23
ix
2.3 (a) Waveform of the sentence “The angry boy answered but they didn’t
look up.”, (b) Expected output in binary (phonologically derived AFs).
(c) Actual output in continuous (acoustically derived AFs). . . . . . . . 25
2.4 Block diagram representation of both analysis and synthesis of AFs. . . 25
2.5 Articulatory contours of vowels w.r.t tongue movements. Dash line cor-
responds to actual binary AFs, dark line corresponds to predicted AFs. . 29
3.1 (a),(b) and (c) show unsmoothed AF contours of stops, fricatives, approx-
imants for speaker-1 and speaker-2, respectively. (d),(e) and (f) show
smoothed AF contours of Stops, Fricatives, Approximants for speaker-1
and speaker-2, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Speaker identification accuracies for different levels of smoothing by 5-
point and 11-point mean-smoothing window. Level ‘k’ corresponds to
applying mean-smoothing window ‘k’ times. . . . . . . . . . . . . . . 37
3.3 Speaker identification accuracies and MCD scores for different levels of
smoothing. Level ‘k’ corresponds to applying 11-point mean-smoothing
window ‘k’ times. All scores are normalized with respect to scores with-
out smoothing (Level 0). . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Mapping of arbitrary source speaker into target speaker . . . . . . . . . 41
4.2 Capturing speaker-specific characteristics as a speaker-coloring function. 42
4.3 Plot of MCD scores obtained between different speaker pairs. . . . . . . 44
4.4 Flow-chart of the training and conversion modules of a voice conversion
system capturing speaker-specific characteristics. . . . . . . . . . . . . 45
4.5 (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”,
(b) Phonologically derived AFs. (c) Acoustically derived AFs using multi-
speaker and mono-lingual data (English). . . . . . . . . . . . . . . . . 49
x
4.6 (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”,
(b) Phonologically derived AFs. (c) Acoustically derived AFs using multi-
speaker and multi-lingual data (English + Telugu). . . . . . . . . . . . 50
xi
Abbreviations
AFs - Articulatory features
ANN - Artificial Neural Networks
DFW - Dynamic Frequency Warping
DTW - Dynamic Time Warping
EM - Expectation Maximization
GMM - Gaussian Mixture Models
HMM - Hidden Markov Models
LPC - Linear Prediction Coefficients
LSF - Line Spectral Frequency
MAP - Maximum A Posteriori
MCD - Mel Cepstral Distortion
MCEP - Mel-cepstral Coefficients
MFCC - Mel Frequency Cepstral Coefficients
MLSA - Mel Log Spectrum Approximation
MOS - Mean Opinion Scores
RMS - Root Mean Square
TTS - Text-to-Speech
VC - Voice Conversion
VQ - Vector Quantization
VTLN - Vocal Tract Length Normalization
xii
Chapter 1
Introduction to voice conversion
1.1 Voice conversion and its applications
The aim of a voice conversion system is to transform the utterance of an arbitrary speaker
(referred to as source speaker) to sound, as if spoken by a specific speaker (referred to
as target speaker). Listeners perceive the source speaker’s speech as if uttered by the
target speaker. Voice conversion can also be referred to as voice transformation or voice
morphing. It involves two main processes. 1) Identification of source speaker character-
istics, and 2) Replacement of the source speaker characteristics with the target speaker
characteristics, without any loss of information in the given speech signal.
Due to its wide range of applications, there has been a considerable amount of research
effort directed towards this problem in the past few years.
• One of the most important applications of voice conversion is its usage as a mod-
ule in a text-to-speech (TTS) synthesis system. A unit-selection based TTS system
uses natural sound units. It is a technique which generates synthetic speech by se-
lecting the most appropriate sequence of units from a large database. Typically,
these systems require recording of large speech corpora by professional speakers
for good quality of speech [4] [5]. In addition to the difficulties related to the time
and effort involved in recording large transcriptions, most people are unable to read
0
such a large number of sentences/transcriptions in a consistent manner. Such in-
consistencies can decrease the quality of the final synthesizers. However, with the
use of voice conversion techniques, it is possible to build a TTS in a new voice
with typically 30-50 utterances (10-15 minutes) from a new speaker. Hence, it is
advantageous to employ voice conversion to create new TTS voices with the help
of the existing voices [6] [7]. Voice conversion techniques can also be used to build
multilingual TTS systems [7]. In this framework, units from multiple languages are
recorded by one speaker per language which are combined to improve the coverage
of units. However, a TTS built on such a database has multiple speaker identities in
the synthesized speech. Hence, a voice conversion technique is applied to transform
the synthesized utterance to a particular target speaker.
• A speech recognition system which converts a spoken utterance into textual sen-
tence has to be sufficiently robust to a large variety of speakers. Hence, voice con-
version could also be used as a method for speaker normalization by converting
every speakers’ data into a single speaker’s voice.
• The problem of a speech-to-speech (S2S) translation system is defined as the trans-
formation of speech spoken in one language to some other language [8]. Some-
times, the speaker may be native or non-native to the target language. In such
cases, S2S translation systems usually build the synthesized voice in some others
speaker’s voice rather than the source speaker’s. So, voice conversion techniques
could be used in these systems which transform the synthesized speech in a target
language to a source speaker’s voice. It can also be used for the film dubbing in-
dustry, to transform an ordinary voice singing karaoke into a famous singer’s voice.
1.2 Acoustic cues for voice conversion system
The objective of a voice conversion system is to transform the identity of a source speaker’s
voice, so that it is perceived as if it has been spoken by a specified target speaker. Hence a
complete voice conversion system should be capable of transforming all types of speaker-
dependent characteristics of speech. The speaker-dependent characteristics lie at both
1
acoustic and linguistic levels of speech signal [9]. The acoustic level parameters are di-
vided into segmental (spectral + fundamental frequency F0) and supra-segmental (prosody)
levels. Segmental level parameters analyzed are the locations and bandwidths of for-
mants, spectral tilt, and the characteristics of the voiced excitation by the vocal folds [10].
Supra-segmental level parameters relate to the style of speaking and include phoneme
duration, evolution of fundamental frequency (intonation) and energy (stress) over the
utterance [11]. Linguistic cues include language of the speaker. Also, choice of lexical
patterns, details of dialect, choice of syntactic constructs and the semantic context are
included.
Due to difficulties in extraction and modeling supra-segmental and linguistic cues
from speech signals, most of the current voice conversion systems focus on the segmental
level features of voice i.e spectral characteristics and fundamental frequency F0. A vast
majority of them focus only on spectral feature transformation. Transformation of the
spectral parameters from a source speaker to that of a target speaker is done by employing
machine learning techniques such as vector quantization (VQ) [12] [13] [14] [15], gaus-
sian mixture models (GMM) [16] [17] [18] [19], artificial neural networks (ANN) [20] [21],
and hidden markov models (HMM) [22] [23]. In the scope of this thesis we also focus on
the transformation of spectral features.
1.3 Voice conversion using parallel data
A vast majority of voice conversion systems found in the literature use parallel data to
find the correspondence between the spectral features of a source and a target speaker
at the frame level. Parallel data is obtained when exactly same sentences are uttered by
both, the source speaker and the target speaker. Availability of such parallel data enables
us to arrive at a relationship between the utterances of the source and target speakers at a
phone/frame level and build a voice conversion system. This is considered to be a baseline
voice conversion system.
2
1.3.1 Feature extraction
To extract features from a given speech signal, we assume a model which is a mathemat-
ical representation of the speech production mechanism that makes the analysis, manip-
ulation and transformation of the speech signal possible. The source-filter model [24] is
widely used in various areas of speech research such as speech synthesis, speech recog-
nition, speech coding, speech enhancement, etc. In the context of source-filter modeling,
speech is defined as the output of a time-varying vocal tract system excited by a time-
varying excitation signal [25]. The time-varying filter represents the vocal tract shape,
which selectively boosts/attenuates certain frequencies of the excitation spectrum depend-
ing on the location and position of the tongue, jaw, lips and velum. The input to this filter
is an excitation signal which is a mixture of a quasi periodic signal and a noise source.
Both, the excitation and the filter characteristics, are represented by features which are
usually extracted from the speech signal by performing a frame-by-frame analysis, where
the size of a frame could vary from 5 ms to 30 ms. Spectral features are generally used to
characterize the vocal tract shape or the filter (Formant frequencies, linear prediction co-
efficients (LPCs) [26], cepstrum, line spectral frequencies (LSFs) [27], bandwidths [28],
mel-frequency cepstral coefficients (MFCC) [29], etc.). Features such as pitch period,
residual, glottal closure instants, etc., are derived from the excitation signal.
1.3.2 Alignment of parallel data
Voice conversion systems are capable of learning transformation functions from the train-
ing data of the source and target speakers. In order to map the source speakers acoustic
space to the target speakers acoustic space, it is necessary to know about the source-target
correspondence between different training units. The process in which this correspon-
dence is established is called alignment. In this case, the most preferred frame-alignment
technique is dynamic time-warping (DTW), almost a standard in voice conversion sys-
tems [12] [18] [30].
As the durations of the parallel utterances typically differ (as shown in Fig. 1.1), dy-
namic time warping is used to align the vectors of the source and target speakers. Fig. 1.1
3
is a plot of an utterance recorded by two speakers. The utterance consists of 18 phones,
the boundaries of which are indicated by the vertical lines. It is very clear from this figure
that the durations of the phones in both the recorded utterances are different even though
the spoken sentence is the same. Fig. 1.2 shows that the durations of the two utterances
match after applying DTW.
Fig. 1.1: Plot of an utterance recorded by two speakers showing that their durations differeven if the spoken sentence is the same. The spoken sentence is “Will we ever forget it”which has 18 phones “pau w ih l w iy eh v er f er g eh t ih t pau pau” according to the USEnglish phoneset. Adopted from [1].
Fig. 1.2: Plot of an utterance recorded by two speakers showing that their durations matchafter applying DTW. The spoken sentence is “Will we ever forget it” which has 18 phones“pau w ih l w iy eh v er f er g eh t ih t pau pau” according to the US English phoneset. Adoptedfrom [1].
1.3.3 Training/testing in voice conversion
The schematic diagrams of training and testing modules in parallel voice conversion are
shown in Fig. 1.3.(a) and Fig. 1.3.(b), respectively. The training module of a voice conver-
sion system to transform both, the excitation and the spectral features (filter parameters)
from a source speakers acoustic space to a target speakers acoustic space, is shown in
4
Fig. 1.3.(a). Fig. 1.3.(b) shows the block diagram of various modules involved in a voice
conversion testing process. In testing or conversion, the transformed spectral features,
along with excitation features, can be used as input to a speech production model (source-
filter) to synthesize the transformed utterance.
Source speaker utterances
Target speaker utterances
Feature extraction
Excitation features
Spectral features alignment (DTW)
Calculation of statistics for linear
transformation
Mapping functions
Source Speech Feature extraction
Mapping function from
training
Linear transformation using statistics from training
Speech model
Spectral features
Excitation features Transformed speech
(a) TRANINING
(b) TESTING
Fig. 1.3: Block diagram of training and testing modules in the voice conversion framework.
1.3.4 Mapping function
Mapping of spectral features
After the alignment is done, to obtain a transformation function between the spectral
features of the source speaker’s acoustic space and the target speaker’s acoustic space,
machine learning techniques such as vector quantization (VQ) [12], hidden markov mod-
els (HMM) [22] [31] [32], gaussian mixture models (GMM) [33] [16] [34] [35], artificial
neural networks (ANN) [20] [21] [36], dynamic frequency warping (DFW) [13] and unit
selection [37] are applied.
5
Mapping of excitation features
Though the residual signal is impulse-like for voiced frames and noise-like for unvoiced
frames, it contains the glottal characteristics of speech that are not modeled by spectral
features. The excitation signal also contains information that could help to achieve the
required conversion performance and quality.
A logarithmic Gaussian normalized transformation [38] is used to transform the fun-
damental frequency F0 of a source speaker to the F0 of a target speaker as indicated in the
equation 1.1 below. The assumption in this case is that the major cues of speaker identity
lie in the spectral features and hence just a linear transformation is sufficient to transform
the excitation characteristics.
log(F0c) = µt +σt
σs(log(F0s) − µs) (1.1)
where µs and σs are the mean and variance of the fundamental frequencies in loga-
rithm for the source speaker, F0s is the pitch of source speaker and F0c is the converted
pitch frequency.
1.3.5 Evaluation metrics for voice conversion
A successful voice conversion system must be good in terms of naturalness, intelligibil-
ity, and identity qualities. Naturalness is how human-like the produced speech sounds.
Intelligibility is how much it is possible to correctly understand the words that were said,
and identity is the recognizability of the individuality of the speech. Different methods
have been proposed to measure these qualities. Some are objective measures, which can
automatically be computed from the audio data. They are typically faster and cheaper
to compute as they do not involve human experiments. Others are subjective measures,
which are based on the opinions expressed by humans in listening evaluations, or on other
human behaviour.
6
Objective measures
Distance measures are used most commonly for providing objective scores. One among
them is spectral distortion (SD) which has been widely used to quantify spectral envelope
conversions. For example, Abe et.al, 1988 [12] measured the ratio of spectral distortion
between the transformed and target speech and the source and target speech as follows:
R =S D(trans, tgt)S D(src, tgt)
(1.2)
where R is the normalized distance, S D(trans, tgt) is the spectral distortion between
the transformed and the target speaker utterances and S D(src, tgt) is the spectral distortion
between the source and the target speaker utterances.
A comparison of the performance of different types of conversion functions using a
warped root mean square (RMS) log-spectral distortion measure was reported in [16].
Similar spectral distortion measures have been reported by other researches [33] [39].
In addition, excitation spectrum, RMS-energy, F0 and duration distances have also been
used to measure excitation, energy, fundamental frequency and duration conversions [23].
Mel Cepstral Distortion (MCD) is another objective error measure used, which seems
to have a correlation with the subjective test results [35]. Thus MCD is used to measure
the quality of voice transformation [34]. MCD is related to vocal characteristics and
hence, is an important measure to check the performance of the mapping obtained by
ANN/GMM network. MCD is essentially a weighted Euclidean distance defined as:
MCD = (10/ln10) ∗
√√2 ∗
25∑k=1
(cek − ct
k)2 (1.3)
where cti and ce
i denote the target and the estimated Mel-cepstral coefficients, respectively.
7
Subjective measures
Subjective measures are based on collecting human opinions and analyzing them. Their
advantage is that they are directly related to human perception, which is typically the
standard for judging the quality of transformed speech. Their disadvantages are that they
are time-consuming, expensive, and difficult to interpret.
Two popular identity tests are:
1. Mean opinion score (MOS): This test is used to evaluate the naturalness and in-
telligibility of converted speech. In this test, the participants are asked to rank the
transformed speech in terms of its quality and/or intelligibility. This is similar to
the similarity test, but the major difference lies in the fact that we concentrate on
the speaker characteristics in the similarity test and intelligibility in the MOS score.
2. Similarity test: The MOS score does not determine how similar the transformed
speech and the target speech are. Hence, similarity measure is used, where the
participants are asked to grade on a scale of 1 to 5, as to how close the transformed
speech is, to the target speakers speech. A score of 5 means that the transformed and
the target speech sound as if spoken by the same speaker and a scale of 1 indicates
that both the utterances sound to be from totally different speakers.
1.4 Voice conversion using non-parallel data
Most of the voice conversion techniques use parallel corpora for training i.e, the source
speaker and the target speaker record the same set of utterances. In a realistic voice
conversion application, only non-parallel corpora may be available during the training
phase. Since it is not always feasible to find parallel utterances for training, methods
were proposed with the goal of reducing the recordings from the source speaker. All
such methods use non-parallel training data, the goal of which is to find a one-to-one
correspondence between the frames of the source and target speaker. The different kinds
of methods that work with the non-parallel data are explained ahead.
8
1. Class mapping: In this method, the source and target vectors are separately classi-
fied into clusters using vector quantization. It involves two levels of alignment:
(a) First level: Each source speaker acoustic class is aligned to one of the target
speaker acoustic classes by searching the closest frequency-warped centroid.
(b) Second level: The vectors inside each class are mean-normalized and frame-
level alignment is performed by finding the nearest neighbour of each source
vector in the corresponding target class.
This technique was evaluated using objective measures, and it was found that the
performance of this method was not as good, when compared to using parallel
data [40]. However this method was proposed as a starting point for further im-
provements that led to the development of the dynamic programming method.
2. Speech recognition: Typically, speech recognition systems use a set of speaker-
independent HMMs to model the parameters of speech signal. In this technique [41],
speaker-independent HMMs are used to label the source and target speaker utter-
ances at frame level, with state indices. Given the state sequence of one speaker, the
alignment procedure consists of finding the longest matching sub-sequences from
the other speaker, until all the frames are paired. The HMMs used for this task give
good results for intra-lingual alignment. However, the suitability of such models
for cross-lingual alignment tasks has not been tested yet.
3. Pseudo parallel corpora created for TTS: In some applications, like customiza-
tion of a text-to-speech synthesizer, a huge database of speech from the source
speaker is available. So, the TTS system can be used to generate the same sentences
that have been recorded from the target speaker. Given that a parallel training cor-
pus is now available, the parameter vectors can be aligned by DTW or HMM. The
main disadvantage of this method is that it can be applied only when there is enough
data from the source speaker to build a TTS system. This strategy is incompatible
with cross-lingual applications [32].
4. Dynamic programming: This method is based on the unit selection paradigm.
Given a set of N source vectors S , dynamic programming is used to find the se-
9
quence of N target vectors T that minimize the acoustic distance between two
speakers. The distance measure is computed by a cost function such as the one
used in TTS systems to concatenate two units. In a unit selection based TTS sys-
tem, there are two costs involved: target cost and concatenation cost. However, in
TTS systems the target cost considers the distance between the acoustic, prosodic
and phonetic characteristics of the target units and those predicted by the TTS it-
self, according to previously trained models. Whereas in this alignment system, the
target cost considers only the acoustic distance between the vectors of the source
speaker and those of the target speaker [37] [42] [43].
One important advantage of the alignment technique based on dynamic program-
ming is, that it establishes the correspondence between frames using only acous-
tic information. Its performance is satisfactory even for cross-lingual applications.
However, it has two drawbacks: (a) it is very time-consuming, and (b) increasing
the size of the training database implies worsening the conversion scores, since the
optimal sequence of the target speaker is closer to the source speaker when there
are more frames available for selection.
Therefore, a new method for estimating pseudo-parallel data was proposed in [9]. A
nearest neighbor of each source vector in the target acoustic space, and the nearest
neighbor of each target vector in the source acoustic space, allowing one-to-many
and many-to-one alignments were mapped. When a voice conversion system using
GMM was trained on such aligned data it was observed that an intermediate con-
verted voice was obtained. That is, it was neither recognized as a source speakers’
voice nor as the target speakers’ voice. When this proposed approach was applied
on the transformed data and the target speaker data, it resulted in an output closer to
the target speaker than the previously transformed sentences. If this procedure was
followed iteratively, the final voice was found to converge to the target speaker’s
voice.
5. Adaptation technique: This technique is based on building a transformation mod-
ule on the existing parallel data of an arbitrary source-target speaker pair and then
adapting this model to a particular pair of speakers for which no parallel data is
10
available [44]. Suppose A and B are the two speakers between whom we need to
build a transformation function, but the recorded utterances by these speakers are
not parallel. Suppose we also have parallel recorded utterances from speakers C
and D. We could then estimate a transformation function between speakers C and
D and use adaptation techniques to adapt the conversion model to speakers A and
B.
Cross-lingual voice conversion is the most extreme situation in terms of alignment.
Voice conversion systems dealing with different languages have some special require-
ments because the utterances available for training are characterized by different phoneme
sets. Obviously, the main difference between intra-lingual and cross-lingual alignment is
that it is not possible to obtain parallel corpora from utterances in different languages, so
the most popular alignment strategies are not valid anymore. On the other hand, it can be
remarked that training cross-lingual voice conversion functions would not be problematic
at all if the alignment problem was solved.
1.5 Limitations of the current systems
In the previous section, the state-of-the-art speech modelling and feature transformation
techniques employed in voice conversion framework have been discussed. The existing
methods have been shown to work reasonably well and are capable of achieving convinc-
ing identity transformations when a pair of speakers with similar characteristics are in-
volved. However, if the conventional conversion techniques are extended to more extreme
applications, such as cross-lingual voice conversion, emotion conversion and speech re-
pair, results are far from convincing.
1.5.1 Limitations using parallel data
One of the main limitations of current voice conversion systems is to have both, the source
and the target speakers record a matching set of utterances, referred to as parallel data. A
11
mapping function obtained on such parallel data can be used to transform spectral char-
acteristics from the source speaker to the target speaker [12] [33] [16] [20] [36] [45] [46].
However, the use of parallel data has many limitations:
1. If either of the speakers changes, then a new transformation function has to be
estimated which requires collection of parallel data from a new speaker.
2. If there are differences between the utterances of source and target speakers in terms
of recording conditions, duration, prosody, etc., then it introduces alignment errors,
which in turn leads to a poorer estimation of the transformation function.
3. Collection of parallel data is not always feasible. Collecting a parallel set of record-
ings from both the speakers in a naturally time aligned fashion [30] is a costly and
time consuming task.
4. When applying voice conversion in a speech-to-speech translation, we desire the
target voice that is synthesized by a text-to-speech system to be identical to the
source speakers’ voice. Since source and target languages are different, it is very
unlikely to have parallel utterances of both speakers. We can classify this problem
as the need to acquire training data for cross-lingual voice conversion.
1.5.2 Limitations using non-parallel data
Section 1.3.4 explains the methods which align non-parallel data for training a voice con-
version system. While these techniques avoid the need for parallel data, they still require
speech data (non-parallel data) from the source speakers apriori to build the conversion
models. This is a limitation to an application where an arbitrary user intends to transform
his/her speech to a pre-defined target speaker without recording anything apriori. Thus, it
is worthwhile to investigate conversion models which capture the speaker-specific char-
acteristics of a target speaker and avoid the need for speech data from source speaker for
training. Such conversion models not only allow an arbitrary speaker to transform his/her
voice to a pre-defined target speaker but also find applications in cross-lingual voice con-
version systems.
12
1.6 Objective and scope of the work
The main objective of this work is to alleviate the requirement of source speaker data in
intra-lingual voice conversion and reduce the complexity in obtaining training data for a
cross-lingual voice conversion system. We propose a method to capture speaker specific
characteristics of a target speaker. Such a method needs to be trained only on target
speaker data and hence any arbitrary source speakers speech could be transformed to the
specified target speaker.
Desai et.al, 2010 [1] and Prahallad, 2010 [2] proposed a method to capture the speaker-
specific characteristics of a target speaker. To our knowledge, this is the only work done
previously which does not require source data in apriori. They used an ANN model to
capture the speaker-specific characteristics. The core idea of this work is as follows.
Let L and S be two different representations of the target speaker’s speech signal. A
mapping function Ω(L) could be built to transform L to S . Such a function would be
specific to the target speaker and could be considered as capturing the essential speaker-
specific characteristics. The choice of representations L and S play an important role
in building such mapping networks and their interpretation. In their work, they assume
that L represents speaker-independent (linguistic) information, and S represents linguistic
and speaker information. Then a mapping function from L to S should capture speaker-
specific information in the process. They used first six formants, their bandwidths, and
delta features as the representation of L. The formants undergo a normalization technique
such as vocal tract length normalization (VTLN) to compensate for the speaker effect. S
is represented by traditional mel-cepstral features (MCEPs). They introduce a concept
of an error correction network which is essentially an additional ANN network, used to
map the predicted MCEPs to the target MCEPs so that the final output features represent
the target speaker in a better way. A schematic diagram of the training and conversion
modules is shown in Fig. 1.4. Notice that during training, only the target speakers data is
used. The limitations of this work are:
• The formants are used to represent the language information (L) in speech signal.
So, it is necessary to extract correct formants from speech signal. But it is very
13
difficult to find a method to extract exact formants from a given signal [47].
• Theoretically speaking, the number of formants vary from phone to phone. How-
ever, in this work, 6 formants are used for every phone. So, it is not the optimal
representation for a phone.
• VTLN is used to normalize speaker effect in formants. This method does not work
without VTLN.
• This work uses an error correction network to improve the performance of the sys-
tem. It is a separate ANN mapper which adds more computations and parameters
to the system.
In this work, we investigate alternatives such as articulatory features for speaker-independent
representation of speech signal.
Target speaker data
VTLN
MCEP
Formants & B.Ws
ANN (error correction)
ANN VTLN Formants
& B.Ws Source speaker
data
ANN
TRANINIG
CONVERSION
Fig. 1.4: Flowchart of the training and conversion modules of a voice conversion systemcapturing speaker-specific characteristics. Notice that during training, only the target speakersdata is used. Adopted from [2]
1.7 Contributions of this thesis
In this thesis, we proposed articulatory features (AFs) as the canonical form or speaker-
independent representation of speech signal as they are assumed to be speaker indepen-
dent. AFs used in this work represent the characteristics of the speech production process
14
like manner of articulation, place of articulation, lip rounding, etc,. These features are
motivated by human speech production mechanism. Chapter 2 briefly explains about AFs
and how can be extracted AFs from a given speech signal. These features have been used
for automatic speech recognition (ASR) with the aim of better pronunciation modeling,
better co-articulation modeling, robustness to cross speaker variation and noises, multi-
lingual and cross-lingual portability of systems, language identification and expressive
speech synthesis. In these studies, often the articulatory features derived from the acous-
tics are treated as generic or speaker-independent representation of the speech signal. But
we show that AFs contain significant amount of speaker information in their trajectories.
Thus, we propose suitable techniques to normalize the speaker-specific information in AF
trajectories and the resultant AFs are used for voice conversion.
1.8 Organization of the thesis
The contents of this thesis are organized as follows: In chapter 2, we briefly explain
articulatory features and the features we use in this work. The methods to extract these
features from a given speech signal are also discussed. We summarize previous research
on the use of articulatory features for various speech systems.
In chapter 3, we analyze the speaker-specific information in articulatory features by
conducting speaker identification experiments with gaussian mixture models. We show
that AFs contain significant amounts of speaker information in their contours. We pro-
pose a technique to normalize the speaker-specific information in the AFs. Finally, we
conclude that AFs have to be normalized speaker-specific information before using them
in voice conversion as canonical form of a speech signal.
chapter 4, proposes a new method that captures speaker specific characteristics and
hence resolves the issue of requiring source speaker data for voice conversion training.
Finally, we conclude this chapter with experiments and results of this method when tested
in a cross-lingual voice conversion scenario.
In chapter 5, we summarize the contributions of the present work, and highlight some
15
issues arising out of the study.
16
Chapter 2
Articulatory features
This chapter introduces the concept of articulatory features that are used in this work and
methods to extract these features from a given speech signal. Section 2.1 gives a brief in-
troduction about the human speech production process, and the role articulatory features
play in describing it. In Section 2.2, we describe different types of articulatory features
(AFs) and the type of articulatory features that are modeled in this work. Section 2.3 ex-
plains the extraction of AFs from speech signal using ANNs and discusses some objective
measures used to evaluate the accuracy of extracted AFs. The summary of this chapter is
presented in Section 2.4.
2.1 Human speech production
The production of human speech is mainly based on the modification of an egressive
air stream by the articulators in the human vocal tract [3]. The activity of the vocal
organs in making a speech sound is called articulation. It involves three major processes:
1)The air stream process, 2)The phonation process, and 3)The configuration of the vocal-
tract (oro-nasal process). The Air stream process describes how sounds are produced and
manipulated by the source of air. The pulmonic egressive mechanism is based on the air
being exhaled from the lungs while the pulmonic ingressive mechanism produces sounds
while inhaling air. Ingressive sounds, however, are rather rare. The Phonation process
17
Fig. 2.1: (1) nasal cavity, (2) hard palate, (3) alveolar ridge, (4) soft palate (velum), (5) tip ofthe tongue (apex), (6) dorsum, (7) uvula, (8) radix, (9) pharynx, (10) epiglottis, (11) false vocalcords, (12) vocal cords, (13) larynx, (14) esophagus, and (15) trachea. Adopted from [3]
occurs in the vocal chords. Voiced sounds are produced by narrowing the vocal chords
when air passes through them. An open glottis leads to unvoiced sounds. In that case,
air passes through the glottis without obstruction so that the air stream is continuous. The
Oro-nasal process technical by, the vocal tract can be described as a system of cavities.
The major components of the vocal tract are illustrated in Figure 2.1. The vocal tract
consists of three cavities: the oral cavity, the nasal cavity, and the pharyngeal cavity.
These components provide a mechanism for the production of different speech sounds,
by obstructing the air stream or by changing the frequency spectrum. Several articulators
can be moved in order to change the vocal tract characteristic.
18
2.2 Types of articulatory features
AFs can be broadly classified into three models based on the model which is used to
capture them.
1. Theoretical models
2. Medical scanning models
3. Linguistically derived models
Theoretical models were proposed in late 70s such as Maeda’s model [48] or the loss-
less tube model [49]. These models were used for early work on acoustic-articulation
inversion. These are classic, but have little practical interest.
Medical scanning models use some scanning devices like X-ray cineradiography [50],
electromagnetic misdagittal articulography (EMMA) or electromagnetic articulography
(EMA) [51] and electropalatography (EPG) [52] to acquire the articulatory state directly
from human subject. These devices measure the trajectories of the movement of the
articulators, they vary slowly with time. These devices show that speech organs are in
continuous motion during the act of speaking. The same can be observed by looking at a
spectrogram representation of speech. A very few and limited datasets of this type, like
MOCHA database [53], with EMA, EPG and laryngography data, the EUR-ACCOR [54]
database, with EPG, laryngography and pneumotachography (measurements of nasal and
oral airflow velocity) are freely available. Obviously, the derivation of such data is a
difficult and quite expensive task so it is not possible to apply these features as acoustic
features in general.
Linguistically derived models describe the AFs in a different manner. These models
use the knowledge of linguistics, and particularly phonetics. Each phoneme of the spoken
language is related to a vector of features that describe in a somewhat abstract sense, the
articulatory state. These features can be either multivalued or binary. Multivalued features
often describe the articulatory state in terms of the place and the manner of articulation.
An example of such a set of features can be found in [55]. On the other hand, binary
19
features describe the articulatory state as the presence or absence of a specific phonetic
quality. The justification of those features is based on the coronal work found in [56]. A
third kind of features, the Government Phonology [57] primes have also been used in the
same sense. These features can also be called “Phonological Features”, as they may have
a functional, as opposed to a strictly articulatory, meaning. We prefer to have a unified
view, believing that there is a great degree of overlap between the two features. Due to
ease of extraction of these features, we have used them in our work. In the following
section we briefly explain about extraction of these features from a given speech signal.
2.3 Extraction of articulatory features
The problem of extraction/prediction of articulatory features (AFs) from a speech signal
is called ’acoustic-to-articulatory inversion of speech‘, or simply ’inversion‘, which has
attracted many researchers and scientists during the last 35 years [58]. It is very difficult
to extract them from a given speech signal. It is considered to be an ill-posed problem.
The reasons for difficulty is its one-to-many nature: a given articulatory state has always
only one acoustic realization but an acoustic signal can be the outcome of more than one
articulatory states and high non-linearity: two somewhat similar articulatory states may
give rise to totally different acoustic signals.
A number of approaches have been proposed in the quest for a solution to the acoustic-
to-articulatory inversion problem such as codebook approaches [59], neural network ap-
proaches [60], constrained optimization approaches [61], analytical approaches [62] as
well as stochastic modelling [63] and statistical inference methods such as mixture den-
sity networks [64] or Kalman filtering [65]. Most of these methods build a separate artic-
ulatory classifier for each AF type. Models are trained to learn to predict the presence or
absence of an AF type, and finally the outputs of these classifiers are concatenated to form
an AF vector. In this work, we use the artificial neural networks (ANN) to extract AFs
from an acoustic signal by building a mapper which maps acoustic space to articulatory
space, as they give promising results as compared to other methods. Such mapper uses
lesser number of parameters and also preserves the dependencies or correlations among
20
AFs.
2.3.1 Database
The experiments presented here are carried out on the TIMIT database [66]. TIMIT com-
prises hand labeled and segmented data of quasi-phonetically balanced sentences read by
native speakers of American English. It consists of 630 speakers, (70% male and 30% fe-
male) from 10 different dialectic regions in America. Each speaker has approximately 30
seconds of speech, spread over ten utterances. The data is designed to have rich phonetic
content, which consists of 2 dialect sentences (SA), 450 phonetically compact sentences
(SX) and 1890 phonetically diverse sentences (SI). The training set (3698 utterances)
consists of all (SI) and (SX) sentences from 432 speakers. The test set (1344 utterances)
consists of all (SI) and (SX) sentences from 168-speakers’ test set. The speaker sets in
training and testing are mutually exclusive.
2.3.2 Feature extraction
To extract spectral features from the speech signal, a source filter model of speech is
applied. Mel-cepstral coefficients (MCEPs) are extracted for a frame size of 25 ms with a
fixed frame advance of 5 ms. The number of MCEPs extracted for every 5 ms is 25 [67].
2.3.3 Analysis phase (Encoder): MCEP to AF
The AFs used in this work are shown in Table 2.1. We have used eight different articu-
latory properties, as tabulated in the first column of Table 2.1. Each articulatory property
has different number of classes, where each class is denoted by a separate dimension in
AF space. For example, vowel length has four classes – short, long, schwa and diphthong.
To represent these four classes, we have used four bits. The dimension of an AF vector is
26, which is equal to the total number of bits present in the third column of Table 2.1.
21
Table 2.1: Eight articulatory properties, each property has different classes and the numberof bits required to represent each property.
Articulatory properties classes # bitsVoicing +voiced, -voice 1
Vowel length short, long, diphthong,schwa 4
Vowel height high, mid, low 3Vowel frontness front, mid, back 3
Lip rounding +round, -round 1Consonant type stop, fricative, affricative,
(Manner) nasal, liquid, approximant 6Place of articulation labial, velar, alveolar, palatal,
labio-dental, dental, glottal 7Silence +silence, -silence 1
Mapping using ANN
Artificial Neural Network (ANN) models consist of interconnected processing nodes,
where each node represents the model of an artificial neuron, and the interconnection be-
tween two nodes has a weight associated with it. ANN models with different topologies
perform different pattern recognition tasks. For example, a feed-forward neural network
can be designed to perform the task of pattern mapping, whereas a feedback network
could be designed for the task of pattern association. A multi-layer feed forward neural
network is used in this work to obtain the mapping function between the acoustic and the
articulatory vectors.
Figure. 2.2 shows the architecture of a five layer ANN used to capture the transforma-
tion function for mapping the acoustic features onto the articulatory space. The ANN is
trained to map the MCEPs vector to an AF vector, i.e., if G(xt) denotes the ANN mapping
of xt , then the error of mapping is given by ε =∑
t ||yt −G(xt)||2. G(xt) is defined as
G(xt) = g(w(4)g(w(3)g(w(2)g(w(1)xt)))), (2.1)
where
g(κ) = κ, g(κ) = a tanh(bκ) (2.2)
22
Input Layer Output Layer Layer
Compression
Layer 1
2
3
4
5
activation L N N N L
Type of
function
of nodes
Number P P P P P1 5432
Fig. 2.2: Architecture of a five-layered MLFFNN with number of nodes in each layer and typeof activation function.
Here w(1), w(2), w(3), w(4) represents the weight matrices of the first, second, third and
fourth hidden layers of ANN respectively. The values of the constants a and b used
in the tanh function are 1.7159 and 2/3 respectively. A generalized back propagation
learning [20] is used to adjust the weights of the neural network so as to minimize ε, i.e.,
the mean squared error between the desired and the actual output values. Selection of
initial weights, architecture of ANN, learning rate, momentum and number of iterations
are some of the optimization parameters in training an ANN [25]. Once the training
is complete, we get a weight matrix that represents the mapping function between the
acoustic features and articulatory features. Such a weight matrix can be used to transform
a feature vector from acoustic space to a feature vector of the articulatory space.
Input and Output Representation
In this work, MCEPs are used as inputs to train the ANN mapper. The use of excitation
features or other representations of speech signal have not been explored in the scope of
this work.
To train a MCEP-to-AF mapper, a representation for AF is required for each MCEP
vector. Such knowledge could be obtained by phonetic segmentation of speech obtained
manually or automatically. The utterances in the TIMIT database have time stamps at
23
phone level. This is used to know the beginning and ending of each phone in the utter-
ance. Also, given a phone symbol, we relied on its phonological properties to derive an
AF representation. This representation is binary in nature and the number of bits used
to denote this representation is explained in Table 2.1. For example, the 1st bit in the
AF representation could take a value 1 or 0 based on whether the phone is voiced or un-
voiced. Thus an ANN model is trained to map an MCEP vector to the corresponding
phonologically derived AF. Although the training of the ANN model is done using a bi-
nary representation at the output layer, the final output of the ANN model is continuous.
That is, the output of the ANN model at each node is a continuous value in the range of
0 and 1, as shown in Fig. 2.3. Figure 2.3.(b) is shows the phonologically derived AFs
which are binary values, where black color corresponds to bit value 1 and white color
corresponds to bit value 0. Figure 2.3.(c) is shows the acoustically derived AFs which
are continuous values varying from 0 to 1. The implicit assumption in the representation
of binary AFs for a phone is that, speech production of one phone is independent of the
other. So, phonologically derived AFs are discrete. But actual speech production is con-
tinuous in nature. The production of one phone is dependent upon the next phone. So,
acoustically derived AFs are continuous.
This difference in the expected and the actual values at the nodes in the output layer
could be attributed to contextual effects of the phones which are not captured in the phono-
logically derived AF representation. Thus, the output of the ANN model is treated as an
acoustically derived AF representation which encapsulates co-articulation, emotion and
speaker characteristics [68].
The structure of the ANN model used is 25L 50N 12L 50N 26L, where the integer
value indicates the number of nodes in each layer and L / N indicates the linear or non-
linear activation function. It is a five layer feed forward neural network that consists of
three hidden layers. Generally, we use the dimension of expansion layer to be equal to
two times the dimension of the input and size of the compression layer to be almost half
of the dimension of input layer.
24
Fig. 2.3: (a) Waveform of the sentence “The angry boy answered but they didn’t look up.”,(b) Expected output in binary (phonologically derived AFs). (c) Actual output in continuous(acoustically derived AFs).
MCEPs –> AFs AFs –> MCEPs
Analysis map Synthesis map
Architecture of MLFFNN 25L 50N 20L 50N 26L
MCEPs estimated MCEPsAFs
26L 50N 20L 50N 25L
Fig. 2.4: Block diagram representation of both analysis and synthesis of AFs.
2.3.4 Evaluation of mapping accuracy
Measuring cepstral distortion
To evaluate how good the AFs are predicted from MCEPs, we used another ANN to map
the predicted AFs to original MCEPs. This mapping can be called the synthesis phase,
and the whole frame is referred to as analysis-by-synthesis as shown in Fig. 2.4. The
structure of the ANN model used is 26L 50N 12L 50N 25L. The performance of analysis-
by-synthesis approach could be measured by Mel-cepstral distortion (MCD) computed
between the output of synthesis phase and the original MCEPs. MCD is related to fil-
ter characteristics and hence, is an important measure to check the performance of the
25
Table 2.2: Average MCD and MOS scores of analysis-by-synthesis approach.
Approach MCD MOS Similarity testanalysis-by-synthesis 4.604 3.97 4.44
mapping obtained by an ANN model. The MCD is computed as follows:
MCD = (10/ln10) ∗
√√2 ∗
24∑i=1
(coi − ce
i )2 (2.3)
where coi and ce
i denote the original and the estimated mel-cepstral, respectively [18].
Given that the MCEP-to-AF and AF-to-MCEP mapping networks are trained on TIMIT
training set, we computed the MCD for all utterances in the testing set (1344 utterances).
We synthesized 10 utterances by predicted MCEPs and original F0, using the mel-log
spectral approximation (MLSA) filter [67]. For all the experiments done in this work, we
have used pulse excitation for voiced sounds and random noise excitation for unvoiced
sounds.
We conducted mean opinion score (MOS) and similarity tests to evaluate the perfor-
mance of analysis-by-synthesis method. These are subjective evaluations where listeners
evaluate the speech quality of the synthesized speech using a 5-point scale (5: excellent,
4: good, 3: fair, 2: poor, 1: bad) and closeness of synthesized speech with original speech
signal. Table 2.2, shows the average MCD, MOS and similarity test scores for all test
sets. The MOS and similarity scores were obtained from 10 subjects, each performing
the listening tests on 10 utterances. An analysis drawn from these results shows that AFs
do capture sufficient information of speech signal. It is typically observed that an MCD
score less than 6.0 produces good quality speech in speech synthesis/voice conversion.
Measuring frame-wise recognition
The evaluation method used is a comparison of overall accuracy in terms of frame error
rate (FER) together with insertion and deletion. FER is widely used for articulatory fea-
ture extraction evaluation [69]. This is because, in current speech technology, articulatory
features are commonly used as an alternative or additional speech representation. Speech
26
Table 2.3: Frame-wise recognition using TIMIT database.
Articulatory features Correct (%) Deletion (%) Insertion (%)Voiced 91.07 3.68 5.25
Short vowels 91.00 3.20 5.80Long vowels 91.74 4.08 4.18Diphthongs 89.64 4.92 5.44
Schwa 93.53 1.39 5.08High vowels 93.59 3.41 3.00Mid vowels 85.96 3.93 10.11Low vowels 92.61 5.01 2.38Front vowels 91.30 6.74 1.96
Central vowels 88.84 2.77 8.39Back vowels 92.51 3.72 3.77
Round vowels 94.56 2.15 3.29Stops 91.60 4.14 4.26
Fricatives 93.81 3.98 2.21Affricatives 98.49 0.17 1.34
Nasals 96.85 1.69 1.46Liquids 96.12 0.74 3.14
Approximants 95.09 2.10 2.81Labials 93.10 1.95 4.95Velars 98.04 0.65 1.31
Alveolars 98.37 0.06 1.57Palatals 82.91 8.24 8.85
Labio-dentals 97.30 0.62 2.08Dentals 95.86 0.99 3.15Glottals 99.28 0.05 0.67Silence 95.79 4.05 0.16
Average over all features 93.86 2.86 3.28All correct together 37 NA NA
representation is a sequence of numeric vectors where each numeric vector represents
speech in each time frame. Therefore, AFs extraction systems are usually evaluated on
the frame level. To understand the achieved accuracies better, we present insertion and
deletion rates in Table 2.3. Given that the MCEP-to-AF mapping network is trained on the
TIMIT training set, we computed the FER for all utterances in both training (432 speak-
ers) and testing (168 speakers) sets, where speakers in both sets are mutually exclusive.
The deletion is defined as the ratio
No. of correct classes not classified as correctTotal no. of classes
(2.4)
27
Table 2.4: Frame-wise recognition using ARCTIC database.
Articulatory features Correct (%) Deletion (%) Insertion (%)Voiced 78.89 17.98 3.13
Short vowels 90.32 2.60 7.08Long vowels 86.08 8.97 4.95Diphthongs 85.51 9.56 4.93
Schwa 93.4 1.44 5.16High vowels 86.42 11.25 2.33Mid vowels 85.28 3.48 11.24Low vowels 89.53 5.28 5.19Front vowels 81.44 16.62 1.94
Central vowels 87.23 3.15 9.62Back vowels 91.16 2.04 6.80
Round vowels 93.12 1.71 5.17Stops 82.81 6.42 10.77
Fricatives 85.37 8.45 6.18Affricatives 97.81 0.95 1.24
Nasals 88.88 6.71 4.41Liquids 94.29 3.60 2.11
Approximants 91.56 2.59 5.85Labials 88.45 5.10 6.45Velars 95.28 1.36 3.36
Alveolars 97.42 0.00 2.58Palatals 74.21 11.53 14.26
Labio-dentals 93.96 4.38 1.66Dentals 95.39 0.09 4.52Glottals 95.76 3.03 1.21Silence 88.22 1.17 10.61
Average over all features 89.13 6.50 4.37All correct together 17 NA NA
and the insertion is defined as the ratio
No. of incorrect classes classified as correctTotal no. of classes
(2.5)
It is clear from the Table 2.3 that the general recognition accuracy is high, and in
all cases recognition is substantially above chance levels. The performance on training
and testing portions of the database did not differ greatly – this indicates that the net-
work learned to generalise well. The accuracies for palatals, mid vowels and central
vowels is lower than 90%. The insertion rate for mid vowels and central vowels is higher
28
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
Height vowel
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
Middle vowel
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
low vowel
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
front vowel
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1central vowel
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
back vowel
Time (Sec)
Predicted continous AFs
Actual binary AFs
Fig. 2.5: Articulatory contours of vowels w.r.t tongue movements. Dash line corresponds toactual binary AFs, dark line corresponds to predicted AFs.
because there are more confusions between central vowels and front vowels, central vow-
els and back vowels. Similarly there are more confusions between mid vowels and high
vowels, mid vowels and low vowels. This is because, whenever the tongue moves from
high to low or front to back it has to pass through the middle and central positions, re-
spectively. Fig. 2.5, shows the articulatory contours of vowels. From this figure we can
observe that middle and central vowels have more confusions. The “all correct together”
value gives the percentage that all features are correct for a given frame. This means that
the network has found the right combination 37% of the time from a possible choice of
226 = 67108864 feature combinations. This accuracy is only meant as a guide to the
overall network accuracy, as they, of course, take no account of the asynchronous nature
of the features: simple frame-wise phone classification is not our aim. Table 2.4 shows,
frame wise accuracy for ARCTIC database [70]. We mapped the ARCTIC speech files
on TIMIT trained MCEP-to-AF mapper. ARCTIC and TIMIT database differ in speakers
29
and environmental conditions.
2.4 Summary
In this chapter, we briefly explained about human speech production and the role of ar-
ticulators in the production of speech. We showed that AFs can be classified into three
models based on the model which is used to capture them. We also explained that ex-
traction of AFs using linguistically derived models is easy so we have used this model to
capture the AFs. We showed the representation of AFs using phonological information.
The use of an ANN mapper to extract the AFs from the speech signal is explained. To
evaluate the mapping accuracy of the ANN mapper we described two objective evalua-
tions. These are 1) Mel-cepstral distortion (MCD) and 2) Frame-wise recognition. The
MCD score showed that AFs capture sufficient amounts of speech information, whereas
frame-wise recognition showed that AFs are predicted very well from speech signal.
30
Chapter 3
Analysis of speaker information in
articulatory features
Extraction of articulatory features from an acoustic speech signal has various applica-
tions in speech research. These features have been used for automatic speech recognition
(ASR) with the aim of better pronunciation modeling, better co-articulation modeling, ro-
bustness to cross speaker variation and noises, multi-lingual and cross-lingual portability
of systems [71] [55] [72] [73] [74] [3] in language identification [75] and in emotional
speech synthesis [68]. In these studies, often the articulatory features derived from the
acoustics are treated as generic or speaker-independent representation of the speech sig-
nal. So, we intend to use these features too, as the canonical form ( generic to speaker
variation) in voice conversion. There has been an earlier work on using AFs for speaker
verification [76]. In this work they used articulatory feature-based conditional pronuncia-
tion modeling (AFCPM) to capture the pronunciation characteristics of speakers because
different people have their own way of pronunciation. The AFCPM is modeled by link-
age between the states of articulation during speech production and the actual phonemes
produced by a speaker. They showed that AFs contain speaker-specific information, com-
plementary to spectral features. In order to understand the nature of speaker-dependence
in AFs, we have performed a detailed analysis on AFs by conducting speaker identifica-
tion experiments in this Chapter.
31
Section 3.1 describes the speaker identification using AFs along with the results. In
this each speaker is modeled by a gaussian mixture model (GMM). The results of the
speaker identification system showed that AFs contain significant amount of speaker-
specific information. But our goal is to obtain speaker independent representation from
the speech signal. So, in Section 3.2 we discuss a method to normalize the speaker specific
information in AFs and its impact on the performance of speaker identification and speech
recognition.
3.1 Speaker identification using articulatory features
The purpose of a speaker identification (SID) system is to identify a speaker from his/her
voice samples. The goal of this experiment is to find out the amount of speaker informa-
tion present in AFs and compare it with that of MCEPs.
3.1.1 Preparation of the data
The experiments presented here are carried out on the TIMIT database [66]. It consists of
630 speakers, 70% male and 30% female, from 10 different dialectic regions in America.
Each speaker has approximately 30 seconds of speech, spread over ten utterances. The
speech is designed to have a rich phonetic content, which consists of 2 dialect sentences
(SA), 450 phonetically compact sentences (SX) and 1890 phonetically diverse sentences
(SI). All the SX and SI wave-files in each speaker’s directory are concatenated to form a
single utterance (of approximately 25 seconds duration) for training. Two utterances in
SA directory of each speaker are concatenated to form a test utterance.
3.1.2 Feature extraction
In these experiments we used both MCEPs and AFs as feature vectors for all speakers.
MCEPs are extracted as explained in Section 2.3.2. AFs are extracted for all 630 speakers
in the TIMIT data using same encoder (MCEP-to-AF) which was built in Section 2.3.3.
32
3.1.3 Speaker modelling
A Gaussian mixture model (GMM) is used to model the distribution of features of a given
speaker. Given a feature vector (~xi), the mixture density for speaker s is defined by
p(~xi|Φs) =
M∑j=1
wsj N s
j (~xi; ~µ j,Σ j) (3.1)
and can be thought of as the weighted linear combination of M Gaussian densities N sj (~xi; ~µ j,Σ j).
Each trained speaker is represented by a model, Φs = [wsj, ~µ j,Σ j] where j = 1, 2, · · · ,M,
~µ j, Σ j, and wsj represent the mean, variance and weighting of the mixture respectively.
Models are trained using the expectation maximization (EM) algorithm.
EM algorithm is an iterative method of finding maximum likelihood or maximum a
posteriori (MAP) estimates of parameters in statistical models. The EM iteration alter-
nates between performing an expectation (E) step, which computes the expectation of the
log-likelihood evaluated using parameters of the initial model Φ, and a maximization (M)
step, which computes parameters maximizing the expected log-likelihood found on the E
step, such that p(~xi|Φ) ≥ p(~xi|Φ) where Φ is an estimated model. These estimated model
parameters are then used in the next E step as initial model.
After each iteration, the following re-estimation formulae are used, which guarantee
a monotonic increase in the model’s likelihood value:
1. Mixture Weights:
w j =1T
T∑i=1
p( j|~xi,Φ) (3.2)
2. Means:
~µ j =
∑Ti=1 ~xi p( j|~xi,Φ)∑T
i=1 p( j|~xi,Φ)(3.3)
3. Variances:
Σ j =
∑Ti=1 x2
i p( j|~xi,Φ)∑Ti=1 p( j|~xi,Φ)
− µ j2 (3.4)
where T is the total number of feature vectors for a speaker.
33
Two critical factors in training GMMs are, selecting the order M of the mixture and ini-
tializing the model parameters prior to the EM algorithm. Since there are generally 40
significant acoustic classes in speech, a model order of M = 32 was chosen and parame-
ters are initialized using K-means algorithm.
During testing, given the feature representation of a test utterance X = ~x1, ~x2, ~x3, · · · , ~xT
and a group of N speakers represented by GMMs Φ1,Φ2, · · · ,ΦN , the objective is to find
the speaker model which has the maximum a posteriori probability for a given utterance.
Formally,
S = argmax︸ ︷︷ ︸1≤k≤N
p(Φk|X) = argmax︸ ︷︷ ︸1≤k≤N
p(X|Φk) p(Φk)p(X)
(3.5)
Assuming equally likely speakers (i.e p(Φk) = 1N ) and noting that p(X) is the same for all
speaker models, the classification rule simplifies to
S = argmax︸ ︷︷ ︸1≤k≤N
p(X|Φk) (3.6)
Using logarithms and the independence between observations, the speaker identification
system computes the log likelihood of the utterance for all N speakers. Identification of
speakers is implemented using a maximum likelihood classification rule. The speakers
identity is defined by the model that produced the maximum log likelihood probability.
i.e.,
S = argmax︸ ︷︷ ︸1≤k≤N
T∑i=1
log(p(~xi|Φk)) (3.7)
Where N is the total number of trained speakers. The accuracy (ACC) of the identification
system is defined as–
ACC(%) =Nc
N∗ 100 (3.8)
Where Nc is the number of speakers identified correctly.
34
Table 3.1: Accuracies (%) of speaker identification system using MCEPs and AFs
Features ACC (%)MCEPs 100AFs 85.24
3.1.4 Results
Table 3.1 shows the performance of the SID system using MCEPs and AFs. To compute
ACC, each test utterance is matched against all of the 630 speakers. The MCEPs iden-
tify all the speakers correctly because the TIMIT database doesn’t have any noise other
than speech, and speaker information. Surprisingly, AFs gave 85.24% accuracy, which
is far above the chance level. This indicates that AFs do capture sufficient amount of
speaker information in their trajectories, but may not be as good as that of MCEPs. In or-
der to further understand what sections of AFs contribute more to speaker identification,
we conducted experiments using AFs corresponding to 1) vowels, 2) manner of articula-
tion, 3) place of articulation and 4) consonants (manner + place), which are not mutually
exhaustive.
Table 3.2 shows the performance of the system using different sections of AFs. The
performance of the consonants group (manner + place) is higher than that of vowels. The
performance of AFs corresponding to both vowels and consonants does not vary much
with that of AFs using consonants only. The reason for this: the degree of freedom for
articulation of vowels is higher than consonants because most vowels are produced with
open configuration of vocal tract whereas consonants are produced with constriction in
the vocal tract. But the number of bits we have used to represent vowel features are
lesser than that of consonants. So the number of bits used for representing vowels are not
sufficient to capture all variations of vowels which are significant for each speaker.
Thus one could conclude that speaker information is largely embedded in the man-
ner and place of articulation and in their temporal variations. This raises the question –
how one could normalize the speaker information in AFs, so that AFs act as a speaker-
independent representation of the speech signal. Such representations have applications
in both speech recognition and voice conversion.
35
Table 3.2: Accuracies (%) of four groups of articulatory features
Group name Articulatory properties ACC (%)Vowels length, height, 41.90
frontnesss, Lip roundingManner stop, fricative, affricative, 79.37
nasal, liquid, approximantPlace labial, alveolar, palatal, 71.75
labio-dental, dental, velarConsonants Place + Manner 84.44
3.2 Normalizing speaker specific information
In order to normalize the speaker specific information in AF streams, we have experi-
mented with mean smoothing the AF trajectories with a 5-point and an 11-point window.
The idea is to smooth the correlations among the samples in the AF trajectories so that
the smoothed trajectories normalize the effect of speaker-specific characteristics on the
AF streams.
1 pau sh iyhh ae d y
ao
rd
aar
ks uw t ih n
gr iy
s iy w aa sh w aot er ssil ao
ly
ihr pau
Unsmoothed articulatory feature contours
1
0 0.5 1 1.5 2 2.5 3 3.5
1
1
smoothed articulatory feature contours
1
0 0.5 1 1.5 2 2.5 3 3.5
1
Time (Sec)
Speaker1 Speaker 2
(a)
(b)
(c)
(f)
(e)
(d)
Fig. 3.1: (a),(b) and (c) show unsmoothed AF contours of stops, fricatives, approximants forspeaker-1 and speaker-2, respectively. (d),(e) and (f) show smoothed AF contours of Stops,Fricatives, Approximants for speaker-1 and speaker-2, respectively.
Figure 3.1 shows the unsmoothed and smoothed contours of stops, fricatives and ap-
36
proximants for two different speakers. Unsmoothed contours have very small variations
which are different among the speakers. After applying a mean smooth window on these
contours these variations are smoothed out. Smoothed contours represent finer represen-
tation of AFs similar among the speakers.
Figure 3.2 shows the speaker identification performance after applying the mean-
smoothing repeatedly for five times. It could be observed that the performance of SID
system decreases with every iteration of mean-smoothing and more so for the 11-point
window spanning 225 milliseconds (frame shift is 5ms).
0 1 2 3 4 520
30
40
50
60
70
80
90
100
Level of smoothing
Accuracy (%)
5−point mean−smoothing window
11−point mean−smoothing window
Fig. 3.2: Speaker identification accuracies for different levels of smoothing by 5-point and 11-point mean-smoothing window. Level ‘k’ corresponds to applying mean-smoothing window ‘k’times.
A relevant question here is – Do these smoothing operations reduce only speaker in-
formation, or speech information as well? To study the effect of mean-smoothing on the
speech quality, we built an AF-to-MCEP mapper after every iteration of smoothing. This
mapper was tested on the held-out test set and an MCD score was computed as described
in Section 2.3.4. In Fig. 3.3, we show the normalized accuracies of speaker identification
and MCD scores with initial accuracy (85.24%) and initial MCD score(4.604), respec-
tively. It can be seen that, after five iterations of mean-smoothing of AFs, the accuracy
of speaker identification decreases by 0.6 times. However, the MCD score is increased
by only 0.2 times, indicating that the loss of spectral information is lesser than that of
speaker information.
37
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 1 2 3 4 5
Accuracy
MCD score
No
rmal
ized
per
form
ance
No. of times 11-point window is applied
Fig. 3.3: Speaker identification accuracies and MCD scores for different levels of smoothing.Level ‘k’ corresponds to applying 11-point mean-smoothing window ‘k’ times. All scores arenormalized with respect to scores without smoothing (Level 0).
3.2.1 Use of smoothed AFs for speech recognition
The goal of a speech recognition system is to print the textual message in the speech sig-
nal. Such systems have to work for all speakers in all environments. So, it is necessary to
normalize the speaker-specific information in speech signals before using them in speech
recognition. Otherwise it acts like noise to the system, and performance reduces signif-
icantly. In this section, we talk about the speech recognition experiments we have con-
ducted, using MCEPs, unsmoothed AFs and smoothed AFs. It is mono-phoneme based
speech recognition. We trained context independent HMM models for each phoneme by
using 16 gaussian mixtures with the help of the HMM-toolkit (HTK). We used the train-
ing directory (468 speakers) of TIMIT database for training and the testing directory (162
speakers) was used to test the system. Table 3.3 shows the performance of the system
using these three features. From this table one can observe that, by smoothing AFs itera-
tively the performance of the system increased gradually. After after the 4th iteration the
accuracy of the system decreased a little, which mean that the speech information was
also begining to be lost and it suggested to us to stop the smoothing after 4 iterations.
38
These experiments conclude that AFs contain speaker characteristics and this could be
reduced by smoothing.
Table 3.3: Phone recognition accuracies using MCEPs and smoothed AFs. AFs-‘k’ corre-spond to applying 11-point mean-smoothing window ‘k’ times.
Features AccuracyMCEPs 52.16%AFs-0 26.32%AFs-1 40.68%AFs-2 42.94%AFs-3 44.24%AFs-4 44.65%AFs-5 44.05%
3.3 Summary
This chapter, described a speaker identification approach using only AFs. We modeled
each speaker using GMMs. To estimate the parameters of GMMs, EM algorithm was
used. The results based on the total TIMIT corpus have shown that AFs contain significant
speaker specific information, and more so in the AF streams of consonants. To remove the
speaker specific information in AFs we smoothed the trajectories of AFs and used them
in speaker identification. Result show a, significant decrease in the performance of the
speaker identification system using smoothed AFs. It is also shown that smoothed AFs
perform better than the unsmoothed AFs, for speech recognition.
39
Chapter 4
Use of articulatory features for voice
conversion
Chapters 2 and 3 showed the extraction of articulatory features (AFs), analysis of speaker
information in AFs and normalization of speaker information in AFs. In this chapter,
we propose a voice conversion method using articulatory features (AFs) which are used
to capture speaker-specific characteristics of a target speaker. Such a method avoids the
need for speech data from a source speaker and hence could be used to transform an
arbitrary speaker including a cross-lingual speaker. The basic idea used in this work is
shown in block diagram 4.1. It involves two steps: 1) Projecting the source speaker space
into a speaker-independent space where it has only the message part of the signal. 2)
Mapping the speaker-independent space to target speaker space by using an ANN mapper
which captures the target speaker-specific characteristics. Here, AFs are used to represent
speaker independent space in the process of capturing the speaker-specific characteristics
of a target speaker.
Section 4.1 explains a model which is used to capture speaker-specific characteristics
of a target speaker and the mathematical representation of that model. Section 4.2 de-
scribes the use of such a model in intra lingual voice conversion using the AFs and also
discusses both subjective and objective evaluations used to evaluate the performance of
the system. In section 4.3 we discuss how we can extend that model for cross lingual
40
Source speaker space
Projecting
Speaker independent
space Mapping
Target speaker space
Fig. 4.1: Mapping of arbitrary source speaker into target speaker
voice conversion using AFs.
4.1 Noisy-channel model
As discussed in chapter 1, the assumption of existence of parallel or pseudo-parallel data
is not valid for many practical applications. Hence, we posed an alternative, but relevant
research question, which is – “How to capture speaker specific characteristics of a target
speaker from the speech signal (independent of any assumptions about a source speaker)
and impose these characteristics on the speech signal of an arbitrary source speaker to
perform voice conversion?”. The problem of capturing speaker specific characteristics
can be attempted by the following method.
The problem of capturing speaker-specific characteristics can be viewed as modeling
a noisy-channel [2]. Suppose, C is a canonical form of speech signal i.e., a generic and
speaker-independent representation of the message in speech signal which passes through
the speech production system of a target speaker to produce a surface form S . This surface
form S carries the message as well as the identity of the speaker.
One can interpret S as the output of a noisy-channel, for the input C. Here, the noisy-
channel is the speech production system of the target speaker. The mathematical formu-
lation of this noisy-channel model is –
argmax︸ ︷︷ ︸S
p(S/C) = argmax︸ ︷︷ ︸S
p(C/S )p(S )p(C) (4.1)
= argmax︸ ︷︷ ︸S
p(C/S )p(S ) (4.2)
as p(C) is constant for all S . Here p(C/S ) could be interpreted as a production model.
41
Speaker coloring Ω(C)
C S’
Canonical form (articulatory features)
Surface form (Mel-cepstrum)
Fig. 4.2: Capturing speaker-specific characteristics as a speaker-coloring function.
p(S ) is the prior probability of S and it could be interpreted as the continuity constraints
imposed on the production of S . It could be seen as analogous to a language model of S .
In this work, p(S/C) is directly modeled as a mapping function between C and S
using artificial neural networks (ANN). The process of capturing speaker-specific charac-
teristics and its application to voice conversion is explained below:
Suppose, we derive two different representations C and S from the speech signal with
the following properties: Let, C be a canonical form of speech signal, i.e., a generic
and speaker-independent form - approximately represented by articulatory features (AFs)
extracted from speech signal. Let S be a surface form represented by Mel-cepstral co-
efficients (MCEPs). If there exists a function Ω(.) such that S ′ = Ω(C), where S ′ is an
approximation of S - then Ω(C) can be considered as specific to a speaker. The function
Ω(.) could be interpreted as speaker-coloring function. We treat the mapping function
Ω(.) as capturing speaker-specific characteristics. It is this property of Ω(.), we exploit
for the task of voice conversion. Fig. 4.2 depicts the concept of capturing speaker-specific
characteristics as a speaker-coloring function.
4.2 Intra lingual voice conversion
4.2.1 Database
The experiments here were carried out on the CMU ARCTIC database consisting of ut-
terances recorded by seven speakers. Each speaker has recorded a set of 1132 phonet-
ically balanced utterances [70]. The ARCTIC database includes utterances of SLT (US
Female), CLB (US Female), BDL (US Male), RMS (US Male), JMK (Canadian Male),
42
AWB (Scottish Male), and KSP (Indian Male).
4.2.2 Training target speaker’s model
Given the utterance of a target speaker T , the corresponding canonical form CT of the
speaker is represented by AFs. To alleviate the effect of speaker characteristics, the AFs
undergo a normalization technique such as smoothing, as explained in Section 3.2. The
surface form S T is represented by traditional MCEP features, as it would allow us to syn-
thesize using the MLSA synthesis technique. The MLSA synthesis technique generates
a speech waveform from the transformed MCEPs and F0 values using pulse excitation or
random noise excitation. An ANN model is trained to map CT to S T using backpropaga-
tion learning algorithm by minimizing the Euclidean error ||S T −S ′T ||, where S ′T = Ω(CT ).
4.2.3 Conversion process
Once the target speaker’s model is trained, it can be used to convert CR to S ′T where CR is
the canonical form from an arbitrary source speaker R. To get the canonical form for any
arbitrary source speaker, one of the three methods below, could be followed. The process
to build any of the encoders is below, explained in Section 2.3.3.
1. Use source speaker encoder. This requires building an encoder specific to a source
speaker, and hence a large amount of speech data (along with transcription) from
the source speaker is required.
2. Use target speaker encoder. This maps MCEPs of an arbitrary source speaker onto
AFs using target speaker’s encoder.
3. Use average speaker encoder. This maps MCEPs of an arbitrary source speaker
onto AFs using an average speaker encoder which is trained using all speakers’
data except that of source and target speakers. Since an average model is used
to generate AFs, a form of speaker normalization takes place on AFs even before
smoothing is applied.
43
0
1
2
3
4
5
6
7
8
9
10
RMS to SLT BDL to SLT SLT to BDL RMS to BDL
Target speaker encoder
Source speaker encoder
Average speaker encoder
Me
l-ce
pst
ral D
isto
rtio
n in
dB
Fig. 4.3: Plot of MCD scores obtained between different speaker pairs.
4.2.4 Validation
By using the three methods discussed in Section 4.2.3 we predicted the AFs for three
source speakers (SLT, BDL and RMS). The AFs were smoothed to normalize speaker-
specific information. Smoothed AFs were mapped onto the BDL and SLT speaker-
specific model. To test the effectiveness of the voice conversion model, we calculated
the Mel-cepstral distortion (MCD) between predicted MCEPs and actual MCEPs. MCD
is a standard measure used in speech synthesis and voice conversion evaluations [18].
Fig. 4.3 shows the MCD scores obtained using the three methods. We observe that the
average speaker encoder gives lesser MCD score compared to the other two methods.
This suggests that the use of an average speaker encoder generates normalized AFs, and
smoothing the AF trajectories further helps in realizing the speaker-independent form.
Rest of the experiments were carried out using average speaker encoder to get the canon-
ical form for any arbitrary source speaker. It can also be used to get the canonical form
for target speaker.
44
Target Speaker Data
Encoder (MCEPs AFs)
Smoothing
Target speaker ANN
Source Speaker Data
Smoothing
TRAINING
CONVERSION
Target speaker ANN
MCEPs
AFs MCEPs
AFs
(a)
(b)
Predicted target MCEPs’
(input)
(output)
Encoder (MCEPs AFs)
Fig. 4.4: Flow-chart of the training and conversion modules of a voice conversion systemcapturing speaker-specific characteristics.
4.2.5 Experiments on multiple speaker database
To test the validity of the proposed method, we conducted experiments on other speakers’
database from the CMU ARCTIC set, such as RMS, CLB, AWB, and KSP. Fig. 4.4.(a)
shows the block diagram for training process and Fig. 4.4.(b) shows the block diagram
for the conversion processing. Table 4.1 provides the results of mapping CR (where R =
BDL, RMS, CLB, AWB, KSP, SLT voices) onto the acoustic space of SLT and BDL.
Table 4.1: MCD scores obtained between multiple speaker pairs with SLT and BDL as targetspeakers. Scores in parenthesis are obtained using parallel data.
Target speakersSource speakers SLT BDL
SLT - 7.563 (6.709)RMS 6.604 (5.717) 7.364 (6.394)AWB 6.797 (6.261) 7.731 (6.950)KSP 7.808 (6.755) 8.695 (7.374)BDL 6.637 (5.423) -CLB 6.339 (5.380) 7.249 (6.172)
In Table 4.1 the performance of voice conversion models built following a noisy-
model approach is compared with that of a traditional model using parallel data. MCD
scores indicate that the use of parallel data has a better performance than the noisy-channel
model approach. Here, we implemented the voice conversion system using parallel data
45
as explained in Chapter 1. The use of parallel data allows us to capture an explicit map-
ping function between source and target speaker. The approach of noisy-channel model
captures target speaker-specific characteristics which can be later imposed on any source
speaker. This approach provides an MCD in the range of 6.3 to 8.6.
4.2.6 Mapping of excitation features
Our focus in this research work is to get a better transformation of spectral features.
Hence, we use the traditional approach of F0 transformation as used in a GMM based
transformation. A logarithmic Gaussian normalized transformation [38] is used to trans-
form the F0 of a source speaker to the F0 of a target speaker as indicated in the equation
below:
log(F0c) = µt +σt
σs(log(F0s) − µs) (4.3)
Where µs and σs are the mean and variance of the fundamental frequency in logarithm
domain for the source speaker, µt and σt are the mean and variance of the fundamental
frequency in logarithm domain for the target speaker, F0s is the pitch of source speaker
and F0c is the converted pitch frequency.
Subjective evaluation
We also performed perceptual tests whose results are provided in Table 4.2, for mean
opinion scores (MOS) in the scale of 1 to 5 (5:Excellent, 4:Good, 3:Fair, 2:Poor, 1:Bad).
For the listening tests, we chose 10 utterances randomly from the two transformed pairs
(SLT to BDL and BDL to SLT). Fifteen listeners participated in the evaluation tests. The
MOS scores in Table 4.2 are averaged over fifteen listeners. By observing the MOS
scores one could say that the noisy-model approach does capture the speaker-specific
characteristics of the target speaker. The transformed waveforms are available at http:
//researchweb.iiit.ac.in/˜bajibabu.b/vc_evaluation.html.
46
Table 4.2: Subjective evaluation of voice conversion models built by using parallel and Noisy-channel models.
Transformation using SLT to BDL BDL to SLTParallel data 3.34 3.58Noisy-channel model 3.14 3.40
By using smoothed AFs we can transform any arbitrary speaker into a predefined
target speaker without the need of any utterance from a source speaker, in a training voice
conversion model. This indicates that the methodology of training an ANN model to
capture speaker-specific characteristics for voice conversion could be generalized over
different datasets.
4.3 Cross-lingual voice conversion
Cross-lingual voice conversion is a task where the language of the source and the target
speakers is different. We employ the same model explained in Section 4.1 to capture
speaker-specific characteristics for the task of cross-lingual voice conversion. We per-
formed an experiment to transform two speakers’ (speaking Kannada and Telugu) utter-
ances into a male voice speaking English (US male - BDL). Our goal here is to transform
two speaker voices to BDL voice. Hence the output will be as if BDL were speaking in
Kannada and Telugu, respectively.
Here, we can pose some research questions, which are:–
• How to extract the canonical form (AFs) for a cross lingual source speaker?
• Can we use the average encoder which is used for intra lingual voice conversion
explained in Section 4.2.3? (It is built by using the data of many speakers of same
language.)
• Do we need to include the data of some other languages in the average model, to
normalize the language information in speech signal?
47
To answer the above questions, we used two encoders to extract the canonical form
for cross lingual source speaker.
1. Use multi-speakers and mono-lingual encoder. It is trained by using many speakers
of the same language, without source and target speakers’ data. It does a form of
speaker normalization in AFs. It is similar to an average speaker encoder model in
intra-lingual voice conversion.
2. Use multi-speakers and multi-lingual encoder. It is trained by using many speakers
of multi language without source and target speakers’ data. This kind of encoder
offers a form of language and speaker normalization in AFs.
The process to build the above encoders is the same as explained in Section 2.3.3. Artic-
ulation of some phones in one language is different from that in other languages. In this
work, we used the phone information to derive the AFs from speech signal. So, it was
necessary to consider the significant articulations in other languages.
To check the performances of both the encoders, we extracted the AFs using two
encoders, separately. AFs extracted using multi-speaker and mono-lingual encoder is
shown in Fig. 4.5. This encoder is the same as that used in intra lingual voice conversion,
which has been built using all of the speakers in the ARCTIC database. AFs extracted
using multi-speaker and multi-lingual encoder is shown in Fig. 4.6. This encoder was built
by using ARCTIC (English) and Telugu data. Since aspiration is a significant articulation
in Indian languages, we used an extra bit to represent this information and the dimension
of AFs increased to 27.
From Fig. 4.5 and Fig. 4.6 it can be observe that prediction of AFs in Fig. 4.5 is not
accurate, whereas most of the AFs are correctly predicted in Fig. 4.6. We can infer that on
use of multiple languages in building a MCEP-to-AFs encoder, some form of language
normalization happens. The prediction of AFs for cross lingual source speaker using such
encoder would be more accurate. In the following experiments we used both the encoders
for a cross lingual voice conversion.
48
Fig. 4.5: (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”, (b) Phono-logically derived AFs. (c) Acoustically derived AFs using multi-speaker and mono-lingual data(English).
4.3.1 Experiments
By using the two encoders above we predicted the AFs for Telugu and Kannada na-
tive source speakers. The AFs were smoothed to normalize speaker-specific information.
Smoothed AFs were mapped onto the BDL speaker-specific model which was built per
the explanation in Section 4.2. Five utterances from two speakers were transformed into
BDL voice and we performed the MOS test and similarity test to evaluate the performance
of this transformation. Table 4.3 provides the MOS and similarity test results averaged
over all listeners. There were ten native listeners of Telugu, and Kannada who participated
in the tests. The similarity tests indicate the closeness of the transformed speech to that of
the target speaker characteristics. Table 4.3 shows the performance using the two methods
mentioned in the previous Section. We observe that the performance using multi-speaker
and multi-language encoder is better than that using the other method. This justifies the
49
Fig. 4.6: (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”, (b) Phono-logically derived AFs. (c) Acoustically derived AFs using multi-speaker and multi-lingual data(English + Telugu).
Table 4.3: Subjective evaluation of cross-lingual voice conversion models. Scores in paren-thesis are obtained using multi-speaker and multi-lingual encoder.
Source Speaker Target Speaker MOS Similarity(Lang.) (Lang.) test
Speaker1 (Telugu) BDL (English) 1.85 (3.1) 2.40 (2.95)Speaker2 (Kannada) BDL (English) 2.00 (3.0) 2.50 (2.8)
use of a multi-speaker and multi-language encoder for generating language normalized
AFs. Smoothing the AFs trajectories further helps in realizing the speaker-independent
form. These tests indicate that cross-lingual transformation can be achieved using AFs as
canonical form in the noisy-channel model, and the output possesses the characteristics
of BDL voice.
50
4.4 Summary
In this chapter, we have shown that it is possible to build a voice conversion system by
capturing speaker-specific characteristics of a speaker (noisy-channel model). We used an
ANN model to capture the speaker-specific characteristics. Such a model does not require
any speech data from source speakers and hence can be considered as independent of
source speaker. We used smoothed AFs to represent the canonical form of a speech signal.
Our results indicate that AFs can be used as the canonical form of the speech signal in
noisy-channel model to capture the speaker-specific characteristics for voice conversion.
An effective process of normalization or transformation of AFs for cross-lingual voice
conversion has to be investigated further.
51
Chapter 5
Summary and Conclusion
5.1 Summary of the work
The main limitation of most current voice conversion systems is the requirement of par-
allel data between source and target speakers. Parallel data is a set of recorded utterances
of same sentences by both source and target speakers. But having parallel data is not al-
ways feasible, especially in cross-lingual voice conversion where the language of source
and target speakers is different. In the literature, such voice conversion techniques have
been proposed that do not require parallel data. But they require apriori data from source
speaker. These techniques cannot be applied when an arbitrary source speaker wants to
transform his/her voice to a target speaker without any apriori recording.
In this dissertation, we proposed a method to perform voice conversion without the
need of training data from the source speaker. In this method, we used the articulatory
features (AFs) to capture speaker-specific characteristics of a target speaker. It alleviates
the need of parallel data, and can be used for cross-lingual voice conversion system. To
capture speaker-specific characteristics of a target speaker, we modeled a noisy-channel
model. The idea behind modelling a noisy-channel is as follows. Suppose, C is a canon-
ical form of a speech signal (a generic and speaker-independent representation of the
message in speech signal), which passes through the speech production system of a target
speaker to produce a surface form S . This surface form S carries the message as well as
52
the identity of the speaker.
One can interpret S as the output of a noisy-channel, for the input C. Here, the noisy-
channel is the speech production system of the target speaker. We used an artificial neural
network (ANN) to model the speech production system of a target speaker, which captures
the essential speaker-characteristics of the target speaker. The choice of representation of
C and S for a speech signal plays an important role in this method. We used articula-
tory features (AFs), which represent the characteristics of the speech production process,
as canonical form or speaker-independent representation of speech signal as they are as-
sumed to be speaker independent. MCEPs that are used as surface form, capture both,
speech and speaker information. But, results based on speaker identification using AFs
show that AFs contain significant amounts of speaker information in their trajectories. We
proposed a method called mean smoothing to normalize the speaker-specific information
in AFs. Results have shown that by smoothing, speaker-specific information is reduced
significantly without losing much speech information. Later, smoothed AFs were used as
canonical form in the noisy-channel model. Finally, subjective and objective evaluations
reveal that the quality of the transformed speech using the proposed method is intelligible
and possess the characteristics of the target speaker.
5.2 Major contributions of the work
The important contribution of the research work reported in this thesis is the ‘Voice con-
version using articulatory features’. The articulatory features used in this work are mo-
tivated by phonological properties of the sounds, not the actual position data which is
collected by using medical devices. The major contributions of this work are:
1. Studied the significance of voice conversion in human speech communication.
2. Extracted the articulatory features from a given speech signal.
3. Analyzed the speaker information in articulatory features.
4. Proposed a method to normalize the speaker information in articulatory features.
53
5. Proposed a method to capture speaker-specific characteristics using articulatory fea-
tures which can be used in voice conversion.
6. Analyzed the use of articulatory features in cross-lingual voice conversion.
5.3 Directions for future work
• The research work in this thesis studied transformation of spectral features and
average pitch frequency only. These are not enough to obtain a good voice trans-
formation. Duration and pitch contours are also important features that affect the
transformation performance and can be worked with.
• In this work, we used the machine learning techniques to extract articulatory fea-
tures from a given speech signal. These techniques require large amount of tran-
scribed speech data for training. So, there is a need to find alternative signal pro-
cessing techniques to extract these features without much need of training data.
• In our approach for cross-lingual voice conversion, as we did not use a bilingual
speaker, we did not find any means to perform an objective evaluation. Hence,
there is a need to come up with an algorithm that can be used to assess the quality
of transformation objectively.
54
References
[1] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping
using artificial neural networks for voice conversion,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 18, no. 5, pp. 954–964, Jul 2010.
[2] K. Prahallad, “Automatic building of synthetic voices from audio books,” PhD The-
sis, Carnegie Mellon University, Pittsburgh, USA, 2010.
[3] F. Metze, “Articulatory features for conversational speech recognition,” PhD Thesis,
Universitaet Fridericiana zu Karlsruhe, Germany, 2005.
[4] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis sys-
tem using a large speech database,” Proc. Int. Conf. Acoust., Speech, Signal Process.,
1996.
[5] T. Dutoit, An introduction to text-to-speech synthesis. Kluwer Academic Publish-
ers, 1997.
[6] A. Kain and M. W. Macon, “Personalizing a speech synthesizer by voice adaptation,”
3rd ECSA/COCOSDA, pp. 225–230, 1998.
[7] W. Zhang, L. Q. Shen, and D. Tang, “Voice conversion based on acoustic feature
transformation,” 6th national conference on Man-machine speech communications,
2001.
[8] H. Hoge, “Project proposal tc-star: Make speech-to-speech translation real,” LREC,
2002.
55
[9] D. Erro, “Intra-lingual and cross-lingual voice conversion using harmonic plus
stochastic models,” PhD Thesis, Universitat Politcnica de Catalunya, 2008.
[10] D. Childers and C. Lee, “Vocal quality factors: Analysis, synthesis, and perception,”
J. Acoust. Soc. Am., vol. 90, no. 5, 1991.
[11] K. S. Rao, “Voice conversion by mapping the speaker-specific features using pitch
synchronous approach,” Elsevier Science, 2009.
[12] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through
vector quantization,” vol. 1, pp. 655–658, 1988.
[13] H. Valaree, E. Moulines, and J. P. Tubach, “Voice transformation using psola tech-
nique,” IEEE Trans. Acoust. Speech Signal Process., 1992.
[14] H. Kuwabara and Y. Sagisaka, “Acoustic characteristics of speaker individuality:
Control and conversion,” Speech Communication, vol. 16, 1995.
[15] K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voice conversion
by codebook mapping,” in IEEE International Symposium on Circuits and Systems,
vol. 1, June 1991.
[16] Y. Stylianou, O. Cappe, and E. Moulines, “Statistical methods for voice quality
transformation,” Proc. European Conf. Speech Proces. and Techn., pp. 447–450,
1995.
[17] A. Pozo, “Voice source and duration modelling for voice conversion and speech,”
PhD Thesis, University of Cambridge, 2008.
[18] T. Toda, A. Black, and K. Tokuda, “Spectral conversion based on maximum like-
lihood estimation considering global variance of converted parameter,” Proc. Int.
Conf. Acoust., Speech, Signal Process., vol. 1, pp. 9–12, 2005.
[19] T. Toda, “High-quality and flexible speech synthesis with segment selection and
voice conversion,” PhD Thesis, Nara Institute of Science and Technology, Japan,
2003.
56
[20] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transfor-
mation of formants for voice conversion using artificial neural networks,” Speech
Communication, vol. 16, pp. 207–216, 1995.
[21] T. Watanabe, T. Murakami, M. Namba, T. Hoya, and Y. Ishida, “Transformation
of spectral envelope for voice conversion based on radial basis function networks,”
Proceedings of the International Conference on Spoken Language Processing, 2002.
[22] E. K. Kim, S. Lee, and Y. H. Oh, “Hidden markov model based voice conversion
using dynamic characteristics of speaker,” Proc. European Conf. Speech Proces. and
Techn., pp. 1311–1314, 1997.
[23] L. M. Arslan, “Speaker transformation algorithm using segmental codebooks
(stasc),” Speech Communication, 1999.
[24] G. Fant, Acoustic theory of speech production. Mouton De Gruyter, 1970.
[25] B. Yegnanarayana, Artificial Neural Networks. Prentice Hall of India, 2004.
[26] J. Makhoul, “Linear prediction: A tutorial review,” IEEE Trans. Acoust. Speech
Signal Process., vol. 63, pp. 561–580, 1975.
[27] F. Itakura, “Minimum prediction residual principle applied to speech recognition,”
IEEE Transactions on Acoustics Speech and Signal Processing, vol. 23, pp. 67–72,
1975.
[28] W. Holmes and M. J. J. Holmes, and, “Extension of the bandwidth of the jsru
parallel-formant synthesizer for high quality synthesis of male and female speech,”
Proc. Int. Conf. Acoust., Speech, Signal Process., vol. 1, pp. 313–316, 1990.
[29] S. Davis and P. Mermelstein, “Comparison of parametric representations for mono-
syllabic word recognition in continuously spoken sentences,” Proc. Int. Conf.
Acoust., Speech, Signal Process., vol. 28, pp. 357–366, 1990.
[30] A. Kain, “High resolution voice transformation,” PhD Thesis, Oregon Health and
Science University, Portland, USA, 2001.
57
[31] H. Mori and H. Kasuya, “Speaker conversion in arx-based source-formant type
speech synthesis,” Proc. European Conf. Speech Proces. and Techn., pp. 2421–2424,
2003.
[32] H. Duxans, “Voice conversion applied to text-to-speech systems,” PhD Thesis, Uni-
versitat Politecnica de Catalunya, Barcelona, 2006.
[33] A. Kain and M. W. Macon, “Design and evaluation of a voice conversion algorithm
based on spectral envelop mapping and residual prediction,” Proc. Int. Conf. Acoust.,
Speech, Signal Process., vol. 2, pp. 813–816, 2001.
[34] A. R. Toth and A. W. Black, “Using articulatory position data in voice transforma-
tion,” in Workshop on Speech Synthesis, pp. 182–187, 2007.
[35] T. Toda, A. W. Black, and K. Tokuda, “Acoustic-to-articulatory inversion mapping
with gaussian mixture model,” Proceedings of the International Conference on Spo-
ken Language Processing, 2004.
[36] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad,
“Voice conversion using artificial neural network,” Proc. Int. Conf. Acoust., Speech,
Signal Process., 2009.
[37] D. Suendermann, H. Hoege, A. Bonafonte, H. Ney, A. Black, and S. Narayanan,
“Text-independent voice conversion based on unit selection,” Proc. Int. Conf.
Acoust., Speech, Signal Process., Toulouse, France, May 2006.
[38] K. Liu, J. Zhang, and Y. Yan, “High quality voice conversion through phoneme-
based linear mapping functions with straight for mandarin,” Fourth International
Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 4, pp. 410–
414, Aug 2007.
[39] H. Ye and S. Young, “Perceptually weighted linear transformations for voice conver-
sion,” Proc. European Conf. Speech Proces. and Techn., pp. 2409–2412, Sep 2003.
[40] D. Suendermann, A. Bonafonte, H. Ney, and H. Hoege, “Voice conversion using
exclusively unaligned training data,” In Proc. of the ACL/SEPLN, Barcelona, Spain,
July 2004.
58
[41] H. Ye and S. Young, “Voice conversion for unknown speakers,” Proceedings of
the International Conference on Spoken Language Processing, pp. 1161–1164, Oct.
2004.
[42] D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, and J. Hirschberg, “Text indepen-
dent cross-language voice conversion,” Proceedings of the International Conference
on Spoken Language Processing, Pittsburgh, USA, September 2006.
[43] A. Sundermann, “Text-independent voice conversion,” PhD Thesis, Universitat der
Bundeswehr Munchen, 2007.
[44] A. Mouchtaris, J. V. Spiegel, and P. Mueller, “Nonparallel training for voice conver-
sion based on a parameter adaptation approach,” IEEE Trans. Acoust. Speech Signal
Process., vol. 14, 2006.
[45] T. Toda, Y. Ohtani, and K. Shikano, “Eigenvoice conversion based on gaussian mix-
ture model,” Proc. of INTERSPEECH, 2006.
[46] T. Toda, Y. Ohtani, and K. Shikao, “One-to-many and many-to-one voice conversion
based on eigenvoices,” Proc. Int. Conf. Acoust., Speech, Signal Process., 2007.
[47] M. A. J. Xavier, S. Guruprasad, and B. Yegnanarayana, “Extracting formants from
short segments of speech using group delay functions,” Proc. of INTERSPEECH,
pp. 1009–1012, Pittsburgh, Pennsylvania, USA, September 2006.
[48] S. Maeda, “Un modele articulatoire de la langue avec des componsantes lineaires,”
actes 10emes Journees d Etude sur la Parole, pp. 152–162, Grenoble 1979.
[49] J. D. Markel and J. A. H. Gray, “Linear prediction of speech,” Springer Verlag,
Berlin 1976.
[50] K. G. Munhall, E. Vatikiotis-Bateson, and Y. Tokhura, “X-ray film database for
speech research,” J. Acoust. Soc. Am., vol. 98, pp. 1222–1224, 1995.
[51] J. Ryalls, “Introduction to speech science: From basic theories to clinical applica-
tions,” Allyn and Bacon, 2000.
59
[52] W. J. Hardcastle, “The use of electropalatography in phonetic research,” Phonetica,
vol. 25, no. 197-215, 1972.
[53] A. A. Wrench and W. J. Hardcastle, “A multichannel articulatory database and its
application for automatic speech recognition,” in Proc. of 5th Seminar of Speech
Production, pp. 305–308, Kloster Seeon, Bavaria 2000.
[54] A. Marchal, W. Hardcastle, P. Hoole, E. Farnetani, A. N. Chasaide, O. Schmid-
bauer, I. Galiana-Ronda, O. Engstrand, and D. Recasens, “Eur-accor: The design
of a multichannel database,” Actes du XIIeme Congres International des Science
Phonetiques, Aix-en-Provence, vol. 5, pp. 422–425, 1991.
[55] K. Kirchhoff, “Robust speech recognition using articulatory information,” PhD The-
sis, University of Bielefeld, Bielefeld, Germany, 1999.
[56] N. Chomsky and M. Halle, The sound pattern of english. MIT Press, 1968.
[57] J. Harris, English sound structure. Blackwell, 1994.
[58] A. Toutios and K. Margaritis, “Acoustic-to- articulatory inversion of speech: A re-
view,” in Proc. of International 12th Turkish Symposium on Artificial Intelligence
and Neural Networks (TAINN-2003), Canakkale, Turkey, July 2003.
[59] S. Ouni and Y. Laprie, “Improving acoustic-to-articulatory inversion by using hyper-
cube codebooks,” Proceedings of the International Conference on Spoken Language
Processing, Bejing, China 2000.
[60] S. King and P. Taylor, “Detection of phonological features in continuous speech
using neural networks,” Computer Speech and Language, vol. 14, no. 4, pp. 333–
353, 2000.
[61] P. P. L. Prado, E. H. Shiva, and D. G. Childers, “Optimization of acoustic-to-
articulatory mapping,” IEEE Trans. Acoust. Speech Signal Process., vol. 2, pp. 33–
36, 1992.
60
[62] Y. Laprie and B. Mathieu, “A variational approach for estimating vocal tract shapes
from the speech signal,” IEEE Trans. Acoust. Speech Signal Process., pp. 929–932,
1998.
[63] G. Ramsay, “A non-linear filtering approach to stochastic training of the articulatory-
acoustic mapping using the em algorithm,” Proceedings of the International Confer-
ence on Spoken Language Processing, Philadelphia, USA, 1996.
[64] K. Richmond, “Mixture density networks, human articulatory data and acoustic-to-
articulatory inversion of continuous speech,” in Proc. Workshop on Innovation in
Speech Processing, pp. 259–276, Institute of Acoustics, April 2001.
[65] S. King and A. Wrench, “Dynamical system modelling of articulator movement,” in
Proc. of ICPhS 99, pp. 2259–2262, San Francisco, Aug. 1999.
[66] L. Lamel, R. Kassel, and S. Seneff, “Speech database development: Design and
analysis of the acoustic-phonetic corpus,” In DARPA Speech Recognition Workshop,
pp. 100–109, 1986.
[67] S. Imai, “Cepstral analysis and synthesis on the mel-frequency scale,” Proc. Int.
Conf. Acoust., Speech, Signal Process., 1983.
[68] A. W. Black et al, “Articulatory features for emotional synthesis,” Proc. Int. Conf.
Acoust., Speech, Signal Process., pp. 725–728, May 2012.
[69] S. Chang, S. Greenberg, and M. Wester, “An elitist approach to articulatory acoustic
feature classification,” Proc. European Conf. Speech Proces. and Techn., pp. 1725–
1728, Aalborg, Denmark, 2001.
[70] CMU-ARCTIC speech synthesis databases. [Online]. Available: http://festvox.org/
cmu arctic/index.html
[71] K. Livescu et al, “Articulatory feature-based methods for acoustic and audio-
visual speech recognition,” 2006 JHU Summer Workshop Final Report,
http://www.clsp.jhu.edu/ws2006/groups/afsr/documents/WS06AFSR final report.pdf
2008.
61
[72] J. Frankel and S. King, “Speech recogntion in the articulatory domain: Investigat-
ing an alternative to acoustic hmms,” in Proc. Workshop on Innovation in Speech
Processing, Institute of Acoustics, April 2001.
[73] M. Richardson, J. Bilmes, and C. Diorio, “Hidden-articulator markov models for
speech recognition,” ISCA ITRW Conference on Automatic Speech Recognition,
Paris, France 2000.
[74] T. A. Stephenson, H. Bourlard, S. Bengio, and A. C. Morris, “Automatic speech
recognition using dynamic bayesian networks with both acoustic and articulatory
variables,” Proceedings of the International Conference on Spoken Language Pro-
cessing, vol. Bejing, China, 2000.
[75] A. Sangwan, M. Mehrabani, and J. H. L. Hansen, “Language identification using a
combined articulatory prosody framework,” Proc. Int. Conf. Acoust., Speech, Signal
Process., pp. 4400–4403, Jul. 2011.
[76] K. Leung, M. Mak, M. Siu, and S. Kung, “Speaker verification using adapted artic-
ulatory feature-based conditional pronunciation modeling,” Proc. Int. Conf. Acoust.,
Speech, Signal Process., vol. 1, pp. 181–184, 2005.
62
List of Publications
The work done during my masters has been disseminated to the following journals and
conferences.
Journal
Bajibabu Bollepalli, Alan W Black, Kishore Prahallad, “Use of Articulatory Features for
Non-parallel Voice Conversion”, in preparation for IEEE Transactions on Acoustics and
Speech Signal Processing.
Conferences
1. Bajibabu Bollepalli, Alan W Black, Kishore Prahallad, “Modelling a Noisy-channel
for Voice Conversion Using Articulatory Features”, accepted in INTERSPEECH-
2012, Portland, USA.
2. Sathya adithya Thati, Bajibabu B, B. Yegnanarayana, “Analysis of Breathy Voice
Based on Excitation Characteristics of Speech Production”, accepted in SPCOM
2012, Banglore, India.
3. Srikanth Ronanki, Bajibabu B, Kishore Prahallad, “Duration Modelling In Voice
Conversion Using Artificial Neural Networks”, International Conference on Sys-
tems, Signals and Image Processing (IWSSIP), Vienna, Austria, April, 2012.
63
4. Bajibabu, Ronanki Srikanth, Sathya Adithya Thati, Bhiksha Raj, B Yegnanarayana,
Kishore Prahallad. “A comparison of prosody modification using instants of sig-
nificant excitation and mel-cepstral vocoder”, Centenary Conference of the Indian
Institute of Science, 14-17 Dec 2011, Banglore.
5. Gautam Verma Mantena, Bajibabu B, Kishore Prahallad, “SWS task: Articulatory
Phonetic Units and Sliding DTW”, MediaEval 2011, Satellite Events in INTER-
SPEECH 2011, Italy, September 1-2, 2011.
64