VOICE CONVERSION USING ARTICULATORY...

VOICE CONVERSION USING ARTICULATORYFEATURES

A THESIS

submitted by

BAJIBABU BOLLEPALLI

200731002

Master of Science (by Research)

in

Electronics and Communication Engineering

Language Technologies Research Centre

International Institute of Information Technology

Hyderabad- 500 032, India

JUNE 2012

To my parents, friends and guide

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Voice conversion using artic-

ulatory features” by Bajibabu Bollepalli (200731002), has been carried out under my

supervision and is not submitted elsewhere for a degree.

Date Adviser: Dr. Kishore Prahallad

Acknowledgements

I would like to express my deepest respect and most sincere gratitude to Dr. Kishore Pra-

hallad, for his constant guidance, encouragement at all stages of my work. I am fortunate

to have numerous technical discussions with him from which I have benefited enormously.

I thank him for allowing me to explore the challenging world of speech technology and

for always finding the time to discuss the difficulties along the way.

I thank my thesis committee members, Dr. Garimella Rama Murthy and Dr. Anil Ku-

mar Vuppala for sparing their valuable time to evaluate the progress of my research work.

I am thankful to Prof. B. Yegnanarayana, Dr. Suryakanth, and Dr. Rajendran for their

immense support and help through my research work. I am thankful to them for all the

invaluable advise on both technical and nontechnical matters. I thank my senior laboratory

members for all the cooperation, understanding and help I received from them.

I am very grateful for having had the opportunity to study among my colleagues:

Ronanki srikanth, Sathya adithya thati, Ch. Nivedita, E. Naresh kumar, P. Gangamohan,

Santosh Kesiraju, Anand Swaroop, Bharghav, Gautam Mantena, Buchi babu, Vasanth Sai,

Sivanand, Ravi shankar Prasad, Karthik, Sridhar, Aneeja, Vishala, Rambabu, Sudharasan,

Dhanujaya, Guru Prasad, Anand Joseph, Chetana, D. Rajesh, Vinay kumar Mittal - for all

the support, fruitful discussions and fun times together.

Needless to mention the love and moral support of my family. This work would not

have been possible without their support.

Bajibabu Bollepalli

i

Abstract

The aim of voice conversion is to transform an utterance spoken by an arbitrary (source)

speaker to that of a specific (target) speaker. Text-to-speech (TTS), speech-to-speech

translation, mimicry generation and human-machine interaction systems are among the

numerous applications which can be greatly benefited by having a voice conversion mod-

ule. Generally voice conversion systems require parallel data between source and target

speakers. Parallel data is a set of utterances recorded by both source and target speakers.

By having such data, one can build a mapping function at frame level to transform charac-

teristics of source speaker to a specified target speaker using machine learning techniques

(GMMs, ANNs, etc,.). These techniques are assumed to perform well as humans typi-

cally perceive transformed speech to sound more like the target speaker than the source

speaker. But having parallel data is not always feasible, especially in cross-lingual voice

conversion where the language of source and target speakers is different. In literature,

voice conversion techniques have been proposed which do not require parallel data. But

they require speech data apriori from source speaker. These techniques cannot be applied

when an arbitrary source speaker wants to transform his/her voice to a target speaker

without any apriori recording.

In this dissertation, we propose a method to perform voice conversion without the need

of training data from the source speaker. It alleviates the need for any speech data from

source speaker apriori, and can be used for cross-lingual voice conversion system. In this

method, we capture speaker-specific characteristics of target speaker. The problem of cap-

turing speaker-specific characteristics can be viewed as modelling a noisy-channel model.

The idea behind modelling a noisy-channel is as follows. Suppose, C is a canonical form

of a speech signal (a generic and speaker-independent representation of the message in

iii

speech signal) passes through the speech production system of a target speaker to produce

a surface form S . This surface form S carries the message as well as the identity of the

speaker. One can interpret S as the output of a noisy-channel, for the input C. Here, the

noisy-channel is the speech production system of the target speaker. We used an artificial

neural network (ANN) to model the speech production system of a target speaker, which

captures the essential speaker-characteristics of the target speaker. The choice of repre-

sentation of C and S of a speech signal plays an important role in this method. We used

articulatory features (AFs), which represents the characteristics of speech production pro-

cess, as canonical form or speaker-independent representation of speech signal as they

assumed to be speaker independent. But our analysis showed that AFs contain significant

amount of speaker information in their trajectories. Thus, we propose suitable techniques

to normalize the speaker-specific information in AF trajectories and the resultant AFs are

used for voice conversion. We show that the proposed method could be used to alleviate

the need for source speaker data, and in cross-lingual voice conversion. Subjective and

objective evaluations reveal that the quality of the transformed speech using the proposed

approach is intelligible and possess the characteristics of the target speaker. A set of trans-

formed utterances corresponding to results discussed in this work are available for listen-

ing at http://researchweb.iiit.ac.in/˜bajibabu.b/vc_evaluation.html

Keywords: Voice Conversion, Artificial Neural Networks, Articulatory features, Spectral

Mapping, Noisy-channel modelling, Cross-Lingual Voice Conversion.

iv

http://researchweb.iiit.ac.in/~bajibabu.b/vc_evaluation.html

Contents

Abstract iii

List of Tables viii

List of Figures xi

Abbreviations xii

1 Introduction to voice conversion 0

1.1 Voice conversion and its applications . . . . . . . . . . . . . . . . . . . 0

1.2 Acoustic cues for voice conversion system . . . . . . . . . . . . . . . . 1

1.3 Voice conversion using parallel data . . . . . . . . . . . . . . . . . . . 2

1.3.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.2 Alignment of parallel data . . . . . . . . . . . . . . . . . . . . 3

1.3.3 Training/testing in voice conversion . . . . . . . . . . . . . . . 4

1.3.4 Mapping function . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.5 Evaluation metrics for voice conversion . . . . . . . . . . . . . 6

1.4 Voice conversion using non-parallel data . . . . . . . . . . . . . . . . . 8

1.5 Limitations of the current systems . . . . . . . . . . . . . . . . . . . . 11

v

1.5.1 Limitations using parallel data . . . . . . . . . . . . . . . . . . 11

1.5.2 Limitations using non-parallel data . . . . . . . . . . . . . . . 12

1.6 Objective and scope of the work . . . . . . . . . . . . . . . . . . . . . 13

1.7 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . 14

1.8 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Articulatory features 17

2.1 Human speech production . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Types of articulatory features . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Extraction of articulatory features . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


2.3.3 Analysis phase (Encoder): MCEP to AF . . . . . . . . . . . . . 21

2.3.4 Evaluation of mapping accuracy . . . . . . . . . . . . . . . . . 25

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Analysis of speaker information in articulatory features 31

3.1 Speaker identification using articulatory features . . . . . . . . . . . . . 32

3.1.1 Preparation of the data . . . . . . . . . . . . . . . . . . . . . . 32


3.1.3 Speaker modelling . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Normalizing speaker specific information . . . . . . . . . . . . . . . . 36

3.2.1 Use of smoothed AFs for speech recognition . . . . . . . . . . 38

vi

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Use of articulatory features for voice conversion 40

4.1 Noisy-channel model . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Intra lingual voice conversion . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2 Training target speaker’s model . . . . . . . . . . . . . . . . . 43

4.2.3 Conversion process . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.5 Experiments on multiple speaker database . . . . . . . . . . . . 45

4.2.6 Mapping of excitation features . . . . . . . . . . . . . . . . . . 46

4.3 Cross-lingual voice conversion . . . . . . . . . . . . . . . . . . . . . . 47

4.3.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Summary and Conclusion 52

5.1 Summary of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Major contributions of the work . . . . . . . . . . . . . . . . . . . . . 53

5.3 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . 54

References 55

List of Publications 63

vii

List of Tables

2.1 Eight articulatory properties, each property has different classes and the

number of bits required to represent each property. . . . . . . . . . . . 22

2.2 Average MCD and MOS scores of analysis-by-synthesis approach. . . 26

2.3 Frame-wise recognition using TIMIT database. . . . . . . . . . . . . . 27

2.4 Frame-wise recognition using ARCTIC database. . . . . . . . . . . . . 28

3.1 Accuracies (%) of speaker identification system using MCEPs and AFs 35

3.2 Accuracies (%) of four groups of articulatory features . . . . . . . . . 36

3.3 Phone recognition accuracies using MCEPs and smoothed AFs. AFs-‘k’

correspond to applying 11-point mean-smoothing window ‘k’ times. . . 39

4.1 MCD scores obtained between multiple speaker pairs with SLT and BDL

as target speakers. Scores in parenthesis are obtained using parallel data. 45

4.2 Subjective evaluation of voice conversion models built by using parallel

and Noisy-channel models. . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Subjective evaluation of cross-lingual voice conversion models. Scores in

parenthesis are obtained using multi-speaker and multi-lingual encoder. 50

viii

List of Figures

1.1 Plot of an utterance recorded by two speakers showing that their durations

differ even if the spoken sentence is the same. The spoken sentence is

“Will we ever forget it” which has 18 phones “pau w ih l w iy eh v er f

er g eh t ih t pau pau” according to the US English phoneset. Adopted

from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Plot of an utterance recorded by two speakers showing that their durations

match after applying DTW. The spoken sentence is “Will we ever forget

it” which has 18 phones “pau w ih l w iy eh v er f er g eh t ih t pau pau”

according to the US English phoneset. Adopted from [1]. . . . . . . . . 4

1.3 Block diagram of training and testing modules in the voice conversion

framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Flowchart of the training and conversion modules of a voice conversion

system capturing speaker-specific characteristics. Notice that during train-

ing, only the target speakers data is used. Adopted from [2] . . . . . . . 14

2.1 (1) nasal cavity, (2) hard palate, (3) alveolar ridge, (4) soft palate (velum),

(5) tip of the tongue (apex), (6) dorsum, (7) uvula, (8) radix, (9) pharynx,

(10) epiglottis, (11) false vocal cords, (12) vocal cords, (13) larynx, (14)

esophagus, and (15) trachea. Adopted from [3] . . . . . . . . . . . . . 18

2.2 Architecture of a five-layered MLFFNN with number of nodes in each

layer and type of activation function. . . . . . . . . . . . . . . . . . . . 23

ix

2.3 (a) Waveform of the sentence “The angry boy answered but they didn’t

look up.”, (b) Expected output in binary (phonologically derived AFs).

(c) Actual output in continuous (acoustically derived AFs). . . . . . . . 25

2.4 Block diagram representation of both analysis and synthesis of AFs. . . 25

2.5 Articulatory contours of vowels w.r.t tongue movements. Dash line cor-

responds to actual binary AFs, dark line corresponds to predicted AFs. . 29

3.1 (a),(b) and (c) show unsmoothed AF contours of stops, fricatives, approx-

imants for speaker-1 and speaker-2, respectively. (d),(e) and (f) show

smoothed AF contours of Stops, Fricatives, Approximants for speaker-1

and speaker-2, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Speaker identification accuracies for different levels of smoothing by 5-

point and 11-point mean-smoothing window. Level ‘k’ corresponds to

applying mean-smoothing window ‘k’ times. . . . . . . . . . . . . . . 37

3.3 Speaker identification accuracies and MCD scores for different levels of

smoothing. Level ‘k’ corresponds to applying 11-point mean-smoothing

window ‘k’ times. All scores are normalized with respect to scores with-

out smoothing (Level 0). . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Mapping of arbitrary source speaker into target speaker . . . . . . . . . 41

4.2 Capturing speaker-specific characteristics as a speaker-coloring function. 42

4.3 Plot of MCD scores obtained between different speaker pairs. . . . . . . 44

4.4 Flow-chart of the training and conversion modules of a voice conversion

system capturing speaker-specific characteristics. . . . . . . . . . . . . 45

4.5 (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”,

(b) Phonologically derived AFs. (c) Acoustically derived AFs using multi-

speaker and mono-lingual data (English). . . . . . . . . . . . . . . . . 49

x

4.6 (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”,

(b) Phonologically derived AFs. (c) Acoustically derived AFs using multi-

speaker and multi-lingual data (English + Telugu). . . . . . . . . . . . 50

xi

Abbreviations

AFs - Articulatory features

ANN - Artificial Neural Networks

DFW - Dynamic Frequency Warping

DTW - Dynamic Time Warping

EM - Expectation Maximization

GMM - Gaussian Mixture Models

HMM - Hidden Markov Models

LPC - Linear Prediction Coefficients

LSF - Line Spectral Frequency

MAP - Maximum A Posteriori

MCD - Mel Cepstral Distortion

MCEP - Mel-cepstral Coefficients

MFCC - Mel Frequency Cepstral Coefficients

MLSA - Mel Log Spectrum Approximation

MOS - Mean Opinion Scores

RMS - Root Mean Square

TTS - Text-to-Speech

VC - Voice Conversion

VQ - Vector Quantization

VTLN - Vocal Tract Length Normalization

xii

Chapter 1

Introduction to voice conversion

1.1 Voice conversion and its applications

The aim of a voice conversion system is to transform the utterance of an arbitrary speaker

(referred to as source speaker) to sound, as if spoken by a specific speaker (referred to

as target speaker). Listeners perceive the source speaker’s speech as if uttered by the

target speaker. Voice conversion can also be referred to as voice transformation or voice

morphing. It involves two main processes. 1) Identification of source speaker character-

istics, and 2) Replacement of the source speaker characteristics with the target speaker

characteristics, without any loss of information in the given speech signal.

Due to its wide range of applications, there has been a considerable amount of research

effort directed towards this problem in the past few years.

• One of the most important applications of voice conversion is its usage as a mod-

ule in a text-to-speech (TTS) synthesis system. A unit-selection based TTS system

uses natural sound units. It is a technique which generates synthetic speech by se-

lecting the most appropriate sequence of units from a large database. Typically,

these systems require recording of large speech corpora by professional speakers

for good quality of speech [4] [5]. In addition to the difficulties related to the time

and effort involved in recording large transcriptions, most people are unable to read

0

such a large number of sentences/transcriptions in a consistent manner. Such in-

consistencies can decrease the quality of the final synthesizers. However, with the

use of voice conversion techniques, it is possible to build a TTS in a new voice

with typically 30-50 utterances (10-15 minutes) from a new speaker. Hence, it is

advantageous to employ voice conversion to create new TTS voices with the help

of the existing voices [6] [7]. Voice conversion techniques can also be used to build

multilingual TTS systems [7]. In this framework, units from multiple languages are

recorded by one speaker per language which are combined to improve the coverage

of units. However, a TTS built on such a database has multiple speaker identities in

the synthesized speech. Hence, a voice conversion technique is applied to transform

the synthesized utterance to a particular target speaker.

• A speech recognition system which converts a spoken utterance into textual sen-

tence has to be sufficiently robust to a large variety of speakers. Hence, voice con-

version could also be used as a method for speaker normalization by converting

every speakers’ data into a single speaker’s voice.

• The problem of a speech-to-speech (S2S) translation system is defined as the trans-

formation of speech spoken in one language to some other language [8]. Some-

times, the speaker may be native or non-native to the target language. In such

cases, S2S translation systems usually build the synthesized voice in some others

speaker’s voice rather than the source speaker’s. So, voice conversion techniques

could be used in these systems which transform the synthesized speech in a target

language to a source speaker’s voice. It can also be used for the film dubbing in-

dustry, to transform an ordinary voice singing karaoke into a famous singer’s voice.

1.2 Acoustic cues for voice conversion system

The objective of a voice conversion system is to transform the identity of a source speaker’s

voice, so that it is perceived as if it has been spoken by a specified target speaker. Hence a

complete voice conversion system should be capable of transforming all types of speaker-

dependent characteristics of speech. The speaker-dependent characteristics lie at both

1

acoustic and linguistic levels of speech signal [9]. The acoustic level parameters are di-

vided into segmental (spectral + fundamental frequency F0) and supra-segmental (prosody)

levels. Segmental level parameters analyzed are the locations and bandwidths of for-

mants, spectral tilt, and the characteristics of the voiced excitation by the vocal folds [10].

Supra-segmental level parameters relate to the style of speaking and include phoneme

duration, evolution of fundamental frequency (intonation) and energy (stress) over the

utterance [11]. Linguistic cues include language of the speaker. Also, choice of lexical

patterns, details of dialect, choice of syntactic constructs and the semantic context are

included.

Due to difficulties in extraction and modeling supra-segmental and linguistic cues

from speech signals, most of the current voice conversion systems focus on the segmental

level features of voice i.e spectral characteristics and fundamental frequency F0. A vast

majority of them focus only on spectral feature transformation. Transformation of the

spectral parameters from a source speaker to that of a target speaker is done by employing

machine learning techniques such as vector quantization (VQ) [12] [13] [14] [15], gaus-

sian mixture models (GMM) [16] [17] [18] [19], artificial neural networks (ANN) [20] [21],

and hidden markov models (HMM) [22] [23]. In the scope of this thesis we also focus on

the transformation of spectral features.

1.3 Voice conversion using parallel data

A vast majority of voice conversion systems found in the literature use parallel data to

find the correspondence between the spectral features of a source and a target speaker

at the frame level. Parallel data is obtained when exactly same sentences are uttered by

both, the source speaker and the target speaker. Availability of such parallel data enables

us to arrive at a relationship between the utterances of the source and target speakers at a

phone/frame level and build a voice conversion system. This is considered to be a baseline

voice conversion system.

2

1.3.1 Feature extraction

To extract features from a given speech signal, we assume a model which is a mathemat-

ical representation of the speech production mechanism that makes the analysis, manip-

ulation and transformation of the speech signal possible. The source-filter model [24] is

widely used in various areas of speech research such as speech synthesis, speech recog-

nition, speech coding, speech enhancement, etc. In the context of source-filter modeling,

speech is defined as the output of a time-varying vocal tract system excited by a time-

varying excitation signal [25]. The time-varying filter represents the vocal tract shape,

which selectively boosts/attenuates certain frequencies of the excitation spectrum depend-

ing on the location and position of the tongue, jaw, lips and velum. The input to this filter

is an excitation signal which is a mixture of a quasi periodic signal and a noise source.

Both, the excitation and the filter characteristics, are represented by features which are

usually extracted from the speech signal by performing a frame-by-frame analysis, where

the size of a frame could vary from 5 ms to 30 ms. Spectral features are generally used to

characterize the vocal tract shape or the filter (Formant frequencies, linear prediction co-

efficients (LPCs) [26], cepstrum, line spectral frequencies (LSFs) [27], bandwidths [28],

mel-frequency cepstral coefficients (MFCC) [29], etc.). Features such as pitch period,

residual, glottal closure instants, etc., are derived from the excitation signal.

1.3.2 Alignment of parallel data

Voice conversion systems are capable of learning transformation functions from the train-

ing data of the source and target speakers. In order to map the source speakers acoustic

space to the target speakers acoustic space, it is necessary to know about the source-target

correspondence between different training units. The process in which this correspon-

dence is established is called alignment. In this case, the most preferred frame-alignment

technique is dynamic time-warping (DTW), almost a standard in voice conversion sys-

tems [12] [18] [30].

As the durations of the parallel utterances typically differ (as shown in Fig. 1.1), dy-

namic time warping is used to align the vectors of the source and target speakers. Fig. 1.1

3

is a plot of an utterance recorded by two speakers. The utterance consists of 18 phones,

the boundaries of which are indicated by the vertical lines. It is very clear from this figure

that the durations of the phones in both the recorded utterances are different even though

the spoken sentence is the same. Fig. 1.2 shows that the durations of the two utterances

match after applying DTW.

Fig. 1.1: Plot of an utterance recorded by two speakers showing that their durations differeven if the spoken sentence is the same. The spoken sentence is “Will we ever forget it”which has 18 phones “pau w ih l w iy eh v er f er g eh t ih t pau pau” according to the USEnglish phoneset. Adopted from [1].

Fig. 1.2: Plot of an utterance recorded by two speakers showing that their durations matchafter applying DTW. The spoken sentence is “Will we ever forget it” which has 18 phones“pau w ih l w iy eh v er f er g eh t ih t pau pau” according to the US English phoneset. Adoptedfrom [1].

1.3.3 Training/testing in voice conversion

The schematic diagrams of training and testing modules in parallel voice conversion are

shown in Fig. 1.3.(a) and Fig. 1.3.(b), respectively. The training module of a voice conver-

sion system to transform both, the excitation and the spectral features (filter parameters)

from a source speakers acoustic space to a target speakers acoustic space, is shown in

4

Fig. 1.3.(a). Fig. 1.3.(b) shows the block diagram of various modules involved in a voice

conversion testing process. In testing or conversion, the transformed spectral features,

along with excitation features, can be used as input to a speech production model (source-

filter) to synthesize the transformed utterance.

Source speaker utterances

Target speaker utterances

Feature extraction

Excitation features

Spectral features alignment (DTW)

Calculation of statistics for linear

transformation

Mapping functions

Source Speech Feature extraction

Mapping function from

training

Linear transformation using statistics from training

Speech model

Spectral features

Excitation features Transformed speech

(a) TRANINING

(b) TESTING

Fig. 1.3: Block diagram of training and testing modules in the voice conversion framework.

1.3.4 Mapping function

Mapping of spectral features

After the alignment is done, to obtain a transformation function between the spectral

features of the source speaker’s acoustic space and the target speaker’s acoustic space,

machine learning techniques such as vector quantization (VQ) [12], hidden markov mod-

els (HMM) [22] [31] [32], gaussian mixture models (GMM) [33] [16] [34] [35], artificial

neural networks (ANN) [20] [21] [36], dynamic frequency warping (DFW) [13] and unit

selection [37] are applied.

5

Mapping of excitation features

Though the residual signal is impulse-like for voiced frames and noise-like for unvoiced

frames, it contains the glottal characteristics of speech that are not modeled by spectral

features. The excitation signal also contains information that could help to achieve the

required conversion performance and quality.

A logarithmic Gaussian normalized transformation [38] is used to transform the fun-

damental frequency F0 of a source speaker to the F0 of a target speaker as indicated in the

equation 1.1 below. The assumption in this case is that the major cues of speaker identity

lie in the spectral features and hence just a linear transformation is sufficient to transform

the excitation characteristics.

log(F0c) = µt +σt

σs(log(F0s) − µs) (1.1)

where µs and σs are the mean and variance of the fundamental frequencies in loga-

rithm for the source speaker, F0s is the pitch of source speaker and F0c is the converted

pitch frequency.

1.3.5 Evaluation metrics for voice conversion

A successful voice conversion system must be good in terms of naturalness, intelligibil-

ity, and identity qualities. Naturalness is how human-like the produced speech sounds.

Intelligibility is how much it is possible to correctly understand the words that were said,

and identity is the recognizability of the individuality of the speech. Different methods

have been proposed to measure these qualities. Some are objective measures, which can

automatically be computed from the audio data. They are typically faster and cheaper

to compute as they do not involve human experiments. Others are subjective measures,

which are based on the opinions expressed by humans in listening evaluations, or on other

human behaviour.

6

Objective measures

Distance measures are used most commonly for providing objective scores. One among

them is spectral distortion (SD) which has been widely used to quantify spectral envelope

conversions. For example, Abe et.al, 1988 [12] measured the ratio of spectral distortion

between the transformed and target speech and the source and target speech as follows:

R =S D(trans, tgt)S D(src, tgt)

(1.2)

where R is the normalized distance, S D(trans, tgt) is the spectral distortion between

the transformed and the target speaker utterances and S D(src, tgt) is the spectral distortion

between the source and the target speaker utterances.

A comparison of the performance of different types of conversion functions using a

warped root mean square (RMS) log-spectral distortion measure was reported in [16].

Similar spectral distortion measures have been reported by other researches [33] [39].

In addition, excitation spectrum, RMS-energy, F0 and duration distances have also been

used to measure excitation, energy, fundamental frequency and duration conversions [23].

Mel Cepstral Distortion (MCD) is another objective error measure used, which seems

to have a correlation with the subjective test results [35]. Thus MCD is used to measure

the quality of voice transformation [34]. MCD is related to vocal characteristics and

hence, is an important measure to check the performance of the mapping obtained by

ANN/GMM network. MCD is essentially a weighted Euclidean distance defined as:

MCD = (10/ln10) ∗

√√2 ∗

25∑k=1

(cek − ct

k)2 (1.3)

where cti and ce

i denote the target and the estimated Mel-cepstral coefficients, respectively.

7

Subjective measures

Subjective measures are based on collecting human opinions and analyzing them. Their

advantage is that they are directly related to human perception, which is typically the

standard for judging the quality of transformed speech. Their disadvantages are that they

are time-consuming, expensive, and difficult to interpret.

Two popular identity tests are:

1. Mean opinion score (MOS): This test is used to evaluate the naturalness and in-

telligibility of converted speech. In this test, the participants are asked to rank the

transformed speech in terms of its quality and/or intelligibility. This is similar to

the similarity test, but the major difference lies in the fact that we concentrate on

the speaker characteristics in the similarity test and intelligibility in the MOS score.

2. Similarity test: The MOS score does not determine how similar the transformed

speech and the target speech are. Hence, similarity measure is used, where the

participants are asked to grade on a scale of 1 to 5, as to how close the transformed

speech is, to the target speakers speech. A score of 5 means that the transformed and

the target speech sound as if spoken by the same speaker and a scale of 1 indicates

that both the utterances sound to be from totally different speakers.

1.4 Voice conversion using non-parallel data

Most of the voice conversion techniques use parallel corpora for training i.e, the source

speaker and the target speaker record the same set of utterances. In a realistic voice

conversion application, only non-parallel corpora may be available during the training

phase. Since it is not always feasible to find parallel utterances for training, methods

were proposed with the goal of reducing the recordings from the source speaker. All

such methods use non-parallel training data, the goal of which is to find a one-to-one

correspondence between the frames of the source and target speaker. The different kinds

of methods that work with the non-parallel data are explained ahead.

8

1. Class mapping: In this method, the source and target vectors are separately classi-

fied into clusters using vector quantization. It involves two levels of alignment:

(a) First level: Each source speaker acoustic class is aligned to one of the target

speaker acoustic classes by searching the closest frequency-warped centroid.

(b) Second level: The vectors inside each class are mean-normalized and frame-

level alignment is performed by finding the nearest neighbour of each source

vector in the corresponding target class.

This technique was evaluated using objective measures, and it was found that the

performance of this method was not as good, when compared to using parallel

data [40]. However this method was proposed as a starting point for further im-

provements that led to the development of the dynamic programming method.

2. Speech recognition: Typically, speech recognition systems use a set of speaker-

independent HMMs to model the parameters of speech signal. In this technique [41],

speaker-independent HMMs are used to label the source and target speaker utter-

ances at frame level, with state indices. Given the state sequence of one speaker, the

alignment procedure consists of finding the longest matching sub-sequences from

the other speaker, until all the frames are paired. The HMMs used for this task give

good results for intra-lingual alignment. However, the suitability of such models

for cross-lingual alignment tasks has not been tested yet.

3. Pseudo parallel corpora created for TTS: In some applications, like customiza-

tion of a text-to-speech synthesizer, a huge database of speech from the source

speaker is available. So, the TTS system can be used to generate the same sentences

that have been recorded from the target speaker. Given that a parallel training cor-

pus is now available, the parameter vectors can be aligned by DTW or HMM. The

main disadvantage of this method is that it can be applied only when there is enough

data from the source speaker to build a TTS system. This strategy is incompatible

with cross-lingual applications [32].

4. Dynamic programming: This method is based on the unit selection paradigm.

Given a set of N source vectors S , dynamic programming is used to find the se-

9

quence of N target vectors T that minimize the acoustic distance between two

speakers. The distance measure is computed by a cost function such as the one

used in TTS systems to concatenate two units. In a unit selection based TTS sys-

tem, there are two costs involved: target cost and concatenation cost. However, in

TTS systems the target cost considers the distance between the acoustic, prosodic

and phonetic characteristics of the target units and those predicted by the TTS it-

self, according to previously trained models. Whereas in this alignment system, the

target cost considers only the acoustic distance between the vectors of the source

speaker and those of the target speaker [37] [42] [43].

One important advantage of the alignment technique based on dynamic program-

ming is, that it establishes the correspondence between frames using only acous-

tic information. Its performance is satisfactory even for cross-lingual applications.

However, it has two drawbacks: (a) it is very time-consuming, and (b) increasing

the size of the training database implies worsening the conversion scores, since the

optimal sequence of the target speaker is closer to the source speaker when there

are more frames available for selection.

Therefore, a new method for estimating pseudo-parallel data was proposed in [9]. A

nearest neighbor of each source vector in the target acoustic space, and the nearest

neighbor of each target vector in the source acoustic space, allowing one-to-many

and many-to-one alignments were mapped. When a voice conversion system using

GMM was trained on such aligned data it was observed that an intermediate con-

verted voice was obtained. That is, it was neither recognized as a source speakers’

voice nor as the target speakers’ voice. When this proposed approach was applied

on the transformed data and the target speaker data, it resulted in an output closer to

the target speaker than the previously transformed sentences. If this procedure was

followed iteratively, the final voice was found to converge to the target speaker’s

voice.

5. Adaptation technique: This technique is based on building a transformation mod-

ule on the existing parallel data of an arbitrary source-target speaker pair and then

adapting this model to a particular pair of speakers for which no parallel data is

10

available [44]. Suppose A and B are the two speakers between whom we need to

build a transformation function, but the recorded utterances by these speakers are

not parallel. Suppose we also have parallel recorded utterances from speakers C

and D. We could then estimate a transformation function between speakers C and

D and use adaptation techniques to adapt the conversion model to speakers A and

B.

Cross-lingual voice conversion is the most extreme situation in terms of alignment.

Voice conversion systems dealing with different languages have some special require-

ments because the utterances available for training are characterized by different phoneme

sets. Obviously, the main difference between intra-lingual and cross-lingual alignment is

that it is not possible to obtain parallel corpora from utterances in different languages, so

the most popular alignment strategies are not valid anymore. On the other hand, it can be

remarked that training cross-lingual voice conversion functions would not be problematic

at all if the alignment problem was solved.

1.5 Limitations of the current systems

In the previous section, the state-of-the-art speech modelling and feature transformation

techniques employed in voice conversion framework have been discussed. The existing

methods have been shown to work reasonably well and are capable of achieving convinc-

ing identity transformations when a pair of speakers with similar characteristics are in-

volved. However, if the conventional conversion techniques are extended to more extreme

applications, such as cross-lingual voice conversion, emotion conversion and speech re-

pair, results are far from convincing.

1.5.1 Limitations using parallel data

One of the main limitations of current voice conversion systems is to have both, the source

and the target speakers record a matching set of utterances, referred to as parallel data. A

11

mapping function obtained on such parallel data can be used to transform spectral char-

acteristics from the source speaker to the target speaker [12] [33] [16] [20] [36] [45] [46].

However, the use of parallel data has many limitations:

1. If either of the speakers changes, then a new transformation function has to be

estimated which requires collection of parallel data from a new speaker.

2. If there are differences between the utterances of source and target speakers in terms

of recording conditions, duration, prosody, etc., then it introduces alignment errors,

which in turn leads to a poorer estimation of the transformation function.

3. Collection of parallel data is not always feasible. Collecting a parallel set of record-

ings from both the speakers in a naturally time aligned fashion [30] is a costly and

time consuming task.

4. When applying voice conversion in a speech-to-speech translation, we desire the

target voice that is synthesized by a text-to-speech system to be identical to the

source speakers’ voice. Since source and target languages are different, it is very

unlikely to have parallel utterances of both speakers. We can classify this problem

as the need to acquire training data for cross-lingual voice conversion.

1.5.2 Limitations using non-parallel data

Section 1.3.4 explains the methods which align non-parallel data for training a voice con-

version system. While these techniques avoid the need for parallel data, they still require

speech data (non-parallel data) from the source speakers apriori to build the conversion

models. This is a limitation to an application where an arbitrary user intends to transform

his/her speech to a pre-defined target speaker without recording anything apriori. Thus, it

is worthwhile to investigate conversion models which capture the speaker-specific char-

acteristics of a target speaker and avoid the need for speech data from source speaker for

training. Such conversion models not only allow an arbitrary speaker to transform his/her

voice to a pre-defined target speaker but also find applications in cross-lingual voice con-

version systems.

12

1.6 Objective and scope of the work

The main objective of this work is to alleviate the requirement of source speaker data in

intra-lingual voice conversion and reduce the complexity in obtaining training data for a

cross-lingual voice conversion system. We propose a method to capture speaker specific

characteristics of a target speaker. Such a method needs to be trained only on target

speaker data and hence any arbitrary source speakers speech could be transformed to the

specified target speaker.

Desai et.al, 2010 [1] and Prahallad, 2010 [2] proposed a method to capture the speaker-

specific characteristics of a target speaker. To our knowledge, this is the only work done

previously which does not require source data in apriori. They used an ANN model to

capture the speaker-specific characteristics. The core idea of this work is as follows.

Let L and S be two different representations of the target speaker’s speech signal. A

mapping function Ω(L) could be built to transform L to S . Such a function would be

specific to the target speaker and could be considered as capturing the essential speaker-

specific characteristics. The choice of representations L and S play an important role

in building such mapping networks and their interpretation. In their work, they assume

that L represents speaker-independent (linguistic) information, and S represents linguistic

and speaker information. Then a mapping function from L to S should capture speaker-

specific information in the process. They used first six formants, their bandwidths, and

delta features as the representation of L. The formants undergo a normalization technique

such as vocal tract length normalization (VTLN) to compensate for the speaker effect. S

is represented by traditional mel-cepstral features (MCEPs). They introduce a concept

of an error correction network which is essentially an additional ANN network, used to

map the predicted MCEPs to the target MCEPs so that the final output features represent

the target speaker in a better way. A schematic diagram of the training and conversion

modules is shown in Fig. 1.4. Notice that during training, only the target speakers data is

used. The limitations of this work are:

• The formants are used to represent the language information (L) in speech signal.

So, it is necessary to extract correct formants from speech signal. But it is very

13

difficult to find a method to extract exact formants from a given signal [47].

• Theoretically speaking, the number of formants vary from phone to phone. How-

ever, in this work, 6 formants are used for every phone. So, it is not the optimal

representation for a phone.

• VTLN is used to normalize speaker effect in formants. This method does not work

without VTLN.

• This work uses an error correction network to improve the performance of the sys-

tem. It is a separate ANN mapper which adds more computations and parameters

to the system.

In this work, we investigate alternatives such as articulatory features for speaker-independent

representation of speech signal.

Target speaker data

VTLN

MCEP

Formants & B.Ws

ANN (error correction)

ANN VTLN Formants

& B.Ws Source speaker

data

ANN

TRANINIG

CONVERSION

Fig. 1.4: Flowchart of the training and conversion modules of a voice conversion systemcapturing speaker-specific characteristics. Notice that during training, only the target speakersdata is used. Adopted from [2]

1.7 Contributions of this thesis

In this thesis, we proposed articulatory features (AFs) as the canonical form or speaker-

independent representation of speech signal as they are assumed to be speaker indepen-

dent. AFs used in this work represent the characteristics of the speech production process

14

like manner of articulation, place of articulation, lip rounding, etc,. These features are

motivated by human speech production mechanism. Chapter 2 briefly explains about AFs

and how can be extracted AFs from a given speech signal. These features have been used

for automatic speech recognition (ASR) with the aim of better pronunciation modeling,

better co-articulation modeling, robustness to cross speaker variation and noises, multi-

lingual and cross-lingual portability of systems, language identification and expressive

speech synthesis. In these studies, often the articulatory features derived from the acous-

tics are treated as generic or speaker-independent representation of the speech signal. But

we show that AFs contain significant amount of speaker information in their trajectories.

Thus, we propose suitable techniques to normalize the speaker-specific information in AF

trajectories and the resultant AFs are used for voice conversion.

1.8 Organization of the thesis

The contents of this thesis are organized as follows: In chapter 2, we briefly explain

articulatory features and the features we use in this work. The methods to extract these

features from a given speech signal are also discussed. We summarize previous research

on the use of articulatory features for various speech systems.

In chapter 3, we analyze the speaker-specific information in articulatory features by

conducting speaker identification experiments with gaussian mixture models. We show

that AFs contain significant amounts of speaker information in their contours. We pro-

pose a technique to normalize the speaker-specific information in the AFs. Finally, we

conclude that AFs have to be normalized speaker-specific information before using them

in voice conversion as canonical form of a speech signal.

chapter 4, proposes a new method that captures speaker specific characteristics and

hence resolves the issue of requiring source speaker data for voice conversion training.

Finally, we conclude this chapter with experiments and results of this method when tested

in a cross-lingual voice conversion scenario.

In chapter 5, we summarize the contributions of the present work, and highlight some

15

issues arising out of the study.

16

Chapter 2

Articulatory features

This chapter introduces the concept of articulatory features that are used in this work and

methods to extract these features from a given speech signal. Section 2.1 gives a brief in-

troduction about the human speech production process, and the role articulatory features

play in describing it. In Section 2.2, we describe different types of articulatory features

(AFs) and the type of articulatory features that are modeled in this work. Section 2.3 ex-

plains the extraction of AFs from speech signal using ANNs and discusses some objective

measures used to evaluate the accuracy of extracted AFs. The summary of this chapter is

presented in Section 2.4.

2.1 Human speech production

The production of human speech is mainly based on the modification of an egressive

air stream by the articulators in the human vocal tract [3]. The activity of the vocal

organs in making a speech sound is called articulation. It involves three major processes:

1)The air stream process, 2)The phonation process, and 3)The configuration of the vocal-

tract (oro-nasal process). The Air stream process describes how sounds are produced and

manipulated by the source of air. The pulmonic egressive mechanism is based on the air

being exhaled from the lungs while the pulmonic ingressive mechanism produces sounds

while inhaling air. Ingressive sounds, however, are rather rare. The Phonation process

17

Fig. 2.1: (1) nasal cavity, (2) hard palate, (3) alveolar ridge, (4) soft palate (velum), (5) tip ofthe tongue (apex), (6) dorsum, (7) uvula, (8) radix, (9) pharynx, (10) epiglottis, (11) false vocalcords, (12) vocal cords, (13) larynx, (14) esophagus, and (15) trachea. Adopted from [3]

occurs in the vocal chords. Voiced sounds are produced by narrowing the vocal chords

when air passes through them. An open glottis leads to unvoiced sounds. In that case,

air passes through the glottis without obstruction so that the air stream is continuous. The

Oro-nasal process technical by, the vocal tract can be described as a system of cavities.

The major components of the vocal tract are illustrated in Figure 2.1. The vocal tract

consists of three cavities: the oral cavity, the nasal cavity, and the pharyngeal cavity.

These components provide a mechanism for the production of different speech sounds,

by obstructing the air stream or by changing the frequency spectrum. Several articulators

can be moved in order to change the vocal tract characteristic.

18

2.2 Types of articulatory features

AFs can be broadly classified into three models based on the model which is used to

capture them.

1. Theoretical models

2. Medical scanning models

3. Linguistically derived models

Theoretical models were proposed in late 70s such as Maeda’s model [48] or the loss-

less tube model [49]. These models were used for early work on acoustic-articulation

inversion. These are classic, but have little practical interest.

Medical scanning models use some scanning devices like X-ray cineradiography [50],

electromagnetic misdagittal articulography (EMMA) or electromagnetic articulography

(EMA) [51] and electropalatography (EPG) [52] to acquire the articulatory state directly

from human subject. These devices measure the trajectories of the movement of the

articulators, they vary slowly with time. These devices show that speech organs are in

continuous motion during the act of speaking. The same can be observed by looking at a

spectrogram representation of speech. A very few and limited datasets of this type, like

MOCHA database [53], with EMA, EPG and laryngography data, the EUR-ACCOR [54]

database, with EPG, laryngography and pneumotachography (measurements of nasal and

oral airflow velocity) are freely available. Obviously, the derivation of such data is a

difficult and quite expensive task so it is not possible to apply these features as acoustic

features in general.

Linguistically derived models describe the AFs in a different manner. These models

use the knowledge of linguistics, and particularly phonetics. Each phoneme of the spoken

language is related to a vector of features that describe in a somewhat abstract sense, the

articulatory state. These features can be either multivalued or binary. Multivalued features

often describe the articulatory state in terms of the place and the manner of articulation.

An example of such a set of features can be found in [55]. On the other hand, binary

19

features describe the articulatory state as the presence or absence of a specific phonetic

quality. The justification of those features is based on the coronal work found in [56]. A

third kind of features, the Government Phonology [57] primes have also been used in the

same sense. These features can also be called “Phonological Features”, as they may have

a functional, as opposed to a strictly articulatory, meaning. We prefer to have a unified

view, believing that there is a great degree of overlap between the two features. Due to

ease of extraction of these features, we have used them in our work. In the following

section we briefly explain about extraction of these features from a given speech signal.

2.3 Extraction of articulatory features

The problem of extraction/prediction of articulatory features (AFs) from a speech signal

is called ’acoustic-to-articulatory inversion of speech‘, or simply ’inversion‘, which has

attracted many researchers and scientists during the last 35 years [58]. It is very difficult

to extract them from a given speech signal. It is considered to be an ill-posed problem.

The reasons for difficulty is its one-to-many nature: a given articulatory state has always

only one acoustic realization but an acoustic signal can be the outcome of more than one

articulatory states and high non-linearity: two somewhat similar articulatory states may

give rise to totally different acoustic signals.

A number of approaches have been proposed in the quest for a solution to the acoustic-

to-articulatory inversion problem such as codebook approaches [59], neural network ap-

proaches [60], constrained optimization approaches [61], analytical approaches [62] as

well as stochastic modelling [63] and statistical inference methods such as mixture den-

sity networks [64] or Kalman filtering [65]. Most of these methods build a separate artic-

ulatory classifier for each AF type. Models are trained to learn to predict the presence or

absence of an AF type, and finally the outputs of these classifiers are concatenated to form

an AF vector. In this work, we use the artificial neural networks (ANN) to extract AFs

from an acoustic signal by building a mapper which maps acoustic space to articulatory

space, as they give promising results as compared to other methods. Such mapper uses

lesser number of parameters and also preserves the dependencies or correlations among

20

AFs.

2.3.1 Database

The experiments presented here are carried out on the TIMIT database [66]. TIMIT com-

prises hand labeled and segmented data of quasi-phonetically balanced sentences read by

native speakers of American English. It consists of 630 speakers, (70% male and 30% fe-

male) from 10 different dialectic regions in America. Each speaker has approximately 30

seconds of speech, spread over ten utterances. The data is designed to have rich phonetic

content, which consists of 2 dialect sentences (SA), 450 phonetically compact sentences

(SX) and 1890 phonetically diverse sentences (SI). The training set (3698 utterances)

consists of all (SI) and (SX) sentences from 432 speakers. The test set (1344 utterances)

consists of all (SI) and (SX) sentences from 168-speakers’ test set. The speaker sets in

training and testing are mutually exclusive.


To extract spectral features from the speech signal, a source filter model of speech is

applied. Mel-cepstral coefficients (MCEPs) are extracted for a frame size of 25 ms with a

fixed frame advance of 5 ms. The number of MCEPs extracted for every 5 ms is 25 [67].

2.3.3 Analysis phase (Encoder): MCEP to AF

The AFs used in this work are shown in Table 2.1. We have used eight different articu-

latory properties, as tabulated in the first column of Table 2.1. Each articulatory property

has different number of classes, where each class is denoted by a separate dimension in

AF space. For example, vowel length has four classes – short, long, schwa and diphthong.

To represent these four classes, we have used four bits. The dimension of an AF vector is

26, which is equal to the total number of bits present in the third column of Table 2.1.

21

Table 2.1: Eight articulatory properties, each property has different classes and the numberof bits required to represent each property.

Articulatory properties classes # bitsVoicing +voiced, -voice 1

Vowel length short, long, diphthong,schwa 4

Vowel height high, mid, low 3Vowel frontness front, mid, back 3

Lip rounding +round, -round 1Consonant type stop, fricative, affricative,

(Manner) nasal, liquid, approximant 6Place of articulation labial, velar, alveolar, palatal,

labio-dental, dental, glottal 7Silence +silence, -silence 1

Mapping using ANN

Artificial Neural Network (ANN) models consist of interconnected processing nodes,

where each node represents the model of an artificial neuron, and the interconnection be-

tween two nodes has a weight associated with it. ANN models with different topologies

perform different pattern recognition tasks. For example, a feed-forward neural network

can be designed to perform the task of pattern mapping, whereas a feedback network

could be designed for the task of pattern association. A multi-layer feed forward neural

network is used in this work to obtain the mapping function between the acoustic and the

articulatory vectors.

Figure. 2.2 shows the architecture of a five layer ANN used to capture the transforma-

tion function for mapping the acoustic features onto the articulatory space. The ANN is

trained to map the MCEPs vector to an AF vector, i.e., if G(xt) denotes the ANN mapping

of xt , then the error of mapping is given by ε =∑

t ||yt −G(xt)||2. G(xt) is defined as

G(xt) = g(w(4)g(w(3)g(w(2)g(w(1)xt)))), (2.1)

where

g(κ) = κ, g(κ) = a tanh(bκ) (2.2)

22

Input Layer Output Layer Layer

Compression

Layer 1

2

3

4

5

activation L N N N L

Type of

function

of nodes

Number P P P P P1 5432

Fig. 2.2: Architecture of a five-layered MLFFNN with number of nodes in each layer and typeof activation function.

Here w(1), w(2), w(3), w(4) represents the weight matrices of the first, second, third and

fourth hidden layers of ANN respectively. The values of the constants a and b used

in the tanh function are 1.7159 and 2/3 respectively. A generalized back propagation

learning [20] is used to adjust the weights of the neural network so as to minimize ε, i.e.,

the mean squared error between the desired and the actual output values. Selection of

initial weights, architecture of ANN, learning rate, momentum and number of iterations

are some of the optimization parameters in training an ANN [25]. Once the training

is complete, we get a weight matrix that represents the mapping function between the

acoustic features and articulatory features. Such a weight matrix can be used to transform

a feature vector from acoustic space to a feature vector of the articulatory space.

Input and Output Representation

In this work, MCEPs are used as inputs to train the ANN mapper. The use of excitation

features or other representations of speech signal have not been explored in the scope of

this work.

To train a MCEP-to-AF mapper, a representation for AF is required for each MCEP

vector. Such knowledge could be obtained by phonetic segmentation of speech obtained

manually or automatically. The utterances in the TIMIT database have time stamps at

23

phone level. This is used to know the beginning and ending of each phone in the utter-

ance. Also, given a phone symbol, we relied on its phonological properties to derive an

AF representation. This representation is binary in nature and the number of bits used

to denote this representation is explained in Table 2.1. For example, the 1st bit in the

AF representation could take a value 1 or 0 based on whether the phone is voiced or un-

voiced. Thus an ANN model is trained to map an MCEP vector to the corresponding

phonologically derived AF. Although the training of the ANN model is done using a bi-

nary representation at the output layer, the final output of the ANN model is continuous.

That is, the output of the ANN model at each node is a continuous value in the range of

0 and 1, as shown in Fig. 2.3. Figure 2.3.(b) is shows the phonologically derived AFs

which are binary values, where black color corresponds to bit value 1 and white color

corresponds to bit value 0. Figure 2.3.(c) is shows the acoustically derived AFs which

are continuous values varying from 0 to 1. The implicit assumption in the representation

of binary AFs for a phone is that, speech production of one phone is independent of the

other. So, phonologically derived AFs are discrete. But actual speech production is con-

tinuous in nature. The production of one phone is dependent upon the next phone. So,

acoustically derived AFs are continuous.

This difference in the expected and the actual values at the nodes in the output layer

could be attributed to contextual effects of the phones which are not captured in the phono-

logically derived AF representation. Thus, the output of the ANN model is treated as an

acoustically derived AF representation which encapsulates co-articulation, emotion and

speaker characteristics [68].

The structure of the ANN model used is 25L 50N 12L 50N 26L, where the integer

value indicates the number of nodes in each layer and L / N indicates the linear or non-

linear activation function. It is a five layer feed forward neural network that consists of

three hidden layers. Generally, we use the dimension of expansion layer to be equal to

two times the dimension of the input and size of the compression layer to be almost half

of the dimension of input layer.

24

Fig. 2.3: (a) Waveform of the sentence “The angry boy answered but they didn’t look up.”,(b) Expected output in binary (phonologically derived AFs). (c) Actual output in continuous(acoustically derived AFs).

MCEPs –> AFs AFs –> MCEPs

Analysis map Synthesis map

Architecture of MLFFNN 25L 50N 20L 50N 26L

MCEPs estimated MCEPsAFs

26L 50N 20L 50N 25L

Fig. 2.4: Block diagram representation of both analysis and synthesis of AFs.

2.3.4 Evaluation of mapping accuracy

Measuring cepstral distortion

To evaluate how good the AFs are predicted from MCEPs, we used another ANN to map

the predicted AFs to original MCEPs. This mapping can be called the synthesis phase,

and the whole frame is referred to as analysis-by-synthesis as shown in Fig. 2.4. The

structure of the ANN model used is 26L 50N 12L 50N 25L. The performance of analysis-

by-synthesis approach could be measured by Mel-cepstral distortion (MCD) computed

between the output of synthesis phase and the original MCEPs. MCD is related to fil-

ter characteristics and hence, is an important measure to check the performance of the

25

Table 2.2: Average MCD and MOS scores of analysis-by-synthesis approach.

Approach MCD MOS Similarity testanalysis-by-synthesis 4.604 3.97 4.44

mapping obtained by an ANN model. The MCD is computed as follows:

MCD = (10/ln10) ∗

√√2 ∗

24∑i=1

(coi − ce

i )2 (2.3)

where coi and ce

i denote the original and the estimated mel-cepstral, respectively [18].

Given that the MCEP-to-AF and AF-to-MCEP mapping networks are trained on TIMIT

training set, we computed the MCD for all utterances in the testing set (1344 utterances).

We synthesized 10 utterances by predicted MCEPs and original F0, using the mel-log

spectral approximation (MLSA) filter [67]. For all the experiments done in this work, we

have used pulse excitation for voiced sounds and random noise excitation for unvoiced

sounds.

We conducted mean opinion score (MOS) and similarity tests to evaluate the perfor-

mance of analysis-by-synthesis method. These are subjective evaluations where listeners

evaluate the speech quality of the synthesized speech using a 5-point scale (5: excellent,

4: good, 3: fair, 2: poor, 1: bad) and closeness of synthesized speech with original speech

signal. Table 2.2, shows the average MCD, MOS and similarity test scores for all test

sets. The MOS and similarity scores were obtained from 10 subjects, each performing

the listening tests on 10 utterances. An analysis drawn from these results shows that AFs

do capture sufficient information of speech signal. It is typically observed that an MCD

score less than 6.0 produces good quality speech in speech synthesis/voice conversion.

Measuring frame-wise recognition

The evaluation method used is a comparison of overall accuracy in terms of frame error

rate (FER) together with insertion and deletion. FER is widely used for articulatory fea-

ture extraction evaluation [69]. This is because, in current speech technology, articulatory

features are commonly used as an alternative or additional speech representation. Speech

26

Table 2.3: Frame-wise recognition using TIMIT database.

Articulatory features Correct (%) Deletion (%) Insertion (%)Voiced 91.07 3.68 5.25

Short vowels 91.00 3.20 5.80Long vowels 91.74 4.08 4.18Diphthongs 89.64 4.92 5.44

Schwa 93.53 1.39 5.08High vowels 93.59 3.41 3.00Mid vowels 85.96 3.93 10.11Low vowels 92.61 5.01 2.38Front vowels 91.30 6.74 1.96

Central vowels 88.84 2.77 8.39Back vowels 92.51 3.72 3.77

Round vowels 94.56 2.15 3.29Stops 91.60 4.14 4.26

Fricatives 93.81 3.98 2.21Affricatives 98.49 0.17 1.34

Nasals 96.85 1.69 1.46Liquids 96.12 0.74 3.14

Approximants 95.09 2.10 2.81Labials 93.10 1.95 4.95Velars 98.04 0.65 1.31

Alveolars 98.37 0.06 1.57Palatals 82.91 8.24 8.85

Labio-dentals 97.30 0.62 2.08Dentals 95.86 0.99 3.15Glottals 99.28 0.05 0.67Silence 95.79 4.05 0.16

Average over all features 93.86 2.86 3.28All correct together 37 NA NA

representation is a sequence of numeric vectors where each numeric vector represents

speech in each time frame. Therefore, AFs extraction systems are usually evaluated on

the frame level. To understand the achieved accuracies better, we present insertion and

deletion rates in Table 2.3. Given that the MCEP-to-AF mapping network is trained on the

TIMIT training set, we computed the FER for all utterances in both training (432 speak-

ers) and testing (168 speakers) sets, where speakers in both sets are mutually exclusive.

The deletion is defined as the ratio

No. of correct classes not classified as correctTotal no. of classes

(2.4)

27

Table 2.4: Frame-wise recognition using ARCTIC database.

Articulatory features Correct (%) Deletion (%) Insertion (%)Voiced 78.89 17.98 3.13

Short vowels 90.32 2.60 7.08Long vowels 86.08 8.97 4.95Diphthongs 85.51 9.56 4.93

Schwa 93.4 1.44 5.16High vowels 86.42 11.25 2.33Mid vowels 85.28 3.48 11.24Low vowels 89.53 5.28 5.19Front vowels 81.44 16.62 1.94

Central vowels 87.23 3.15 9.62Back vowels 91.16 2.04 6.80

Round vowels 93.12 1.71 5.17Stops 82.81 6.42 10.77

Fricatives 85.37 8.45 6.18Affricatives 97.81 0.95 1.24

Nasals 88.88 6.71 4.41Liquids 94.29 3.60 2.11

Approximants 91.56 2.59 5.85Labials 88.45 5.10 6.45Velars 95.28 1.36 3.36

Alveolars 97.42 0.00 2.58Palatals 74.21 11.53 14.26

Labio-dentals 93.96 4.38 1.66Dentals 95.39 0.09 4.52Glottals 95.76 3.03 1.21Silence 88.22 1.17 10.61

Average over all features 89.13 6.50 4.37All correct together 17 NA NA

and the insertion is defined as the ratio

No. of incorrect classes classified as correctTotal no. of classes

(2.5)

It is clear from the Table 2.3 that the general recognition accuracy is high, and in

all cases recognition is substantially above chance levels. The performance on training

and testing portions of the database did not differ greatly – this indicates that the net-

work learned to generalise well. The accuracies for palatals, mid vowels and central

vowels is lower than 90%. The insertion rate for mid vowels and central vowels is higher

28

0 0.5 1 1.5 2 2.5 3 3.5

0

0.5

1

Height vowel

0 0.5 1 1.5 2 2.5 3 3.5

0

0.5

1

Middle vowel

0 0.5 1 1.5 2 2.5 3 3.5

0

0.5

1

low vowel

0 0.5 1 1.5 2 2.5 3 3.5

0

0.5

1

front vowel

0 0.5 1 1.5 2 2.5 3 3.5

0

0.5

1central vowel

0 0.5 1 1.5 2 2.5 3 3.5

0

0.5

1

back vowel

Time (Sec)

Predicted continous AFs

Actual binary AFs

Fig. 2.5: Articulatory contours of vowels w.r.t tongue movements. Dash line corresponds toactual binary AFs, dark line corresponds to predicted AFs.

because there are more confusions between central vowels and front vowels, central vow-

els and back vowels. Similarly there are more confusions between mid vowels and high

vowels, mid vowels and low vowels. This is because, whenever the tongue moves from

high to low or front to back it has to pass through the middle and central positions, re-

spectively. Fig. 2.5, shows the articulatory contours of vowels. From this figure we can

observe that middle and central vowels have more confusions. The “all correct together”

value gives the percentage that all features are correct for a given frame. This means that

the network has found the right combination 37% of the time from a possible choice of

226 = 67108864 feature combinations. This accuracy is only meant as a guide to the

overall network accuracy, as they, of course, take no account of the asynchronous nature

of the features: simple frame-wise phone classification is not our aim. Table 2.4 shows,

frame wise accuracy for ARCTIC database [70]. We mapped the ARCTIC speech files

on TIMIT trained MCEP-to-AF mapper. ARCTIC and TIMIT database differ in speakers

29

and environmental conditions.

2.4 Summary

In this chapter, we briefly explained about human speech production and the role of ar-

ticulators in the production of speech. We showed that AFs can be classified into three

models based on the model which is used to capture them. We also explained that ex-

traction of AFs using linguistically derived models is easy so we have used this model to

capture the AFs. We showed the representation of AFs using phonological information.

The use of an ANN mapper to extract the AFs from the speech signal is explained. To

evaluate the mapping accuracy of the ANN mapper we described two objective evalua-

tions. These are 1) Mel-cepstral distortion (MCD) and 2) Frame-wise recognition. The

MCD score showed that AFs capture sufficient amounts of speech information, whereas

frame-wise recognition showed that AFs are predicted very well from speech signal.

30

Chapter 3

Analysis of speaker information in

articulatory features

Extraction of articulatory features from an acoustic speech signal has various applica-

tions in speech research. These features have been used for automatic speech recognition

(ASR) with the aim of better pronunciation modeling, better co-articulation modeling, ro-

bustness to cross speaker variation and noises, multi-lingual and cross-lingual portability

of systems [71] [55] [72] [73] [74] [3] in language identification [75] and in emotional

speech synthesis [68]. In these studies, often the articulatory features derived from the

acoustics are treated as generic or speaker-independent representation of the speech sig-

nal. So, we intend to use these features too, as the canonical form ( generic to speaker

variation) in voice conversion. There has been an earlier work on using AFs for speaker

verification [76]. In this work they used articulatory feature-based conditional pronuncia-

tion modeling (AFCPM) to capture the pronunciation characteristics of speakers because

different people have their own way of pronunciation. The AFCPM is modeled by link-

age between the states of articulation during speech production and the actual phonemes

produced by a speaker. They showed that AFs contain speaker-specific information, com-

plementary to spectral features. In order to understand the nature of speaker-dependence

in AFs, we have performed a detailed analysis on AFs by conducting speaker identifica-

tion experiments in this Chapter.

31

Section 3.1 describes the speaker identification using AFs along with the results. In

this each speaker is modeled by a gaussian mixture model (GMM). The results of the

speaker identification system showed that AFs contain significant amount of speaker-

specific information. But our goal is to obtain speaker independent representation from

the speech signal. So, in Section 3.2 we discuss a method to normalize the speaker specific

information in AFs and its impact on the performance of speaker identification and speech

recognition.

3.1 Speaker identification using articulatory features

The purpose of a speaker identification (SID) system is to identify a speaker from his/her

voice samples. The goal of this experiment is to find out the amount of speaker informa-

tion present in AFs and compare it with that of MCEPs.

3.1.1 Preparation of the data

The experiments presented here are carried out on the TIMIT database [66]. It consists of

630 speakers, 70% male and 30% female, from 10 different dialectic regions in America.

Each speaker has approximately 30 seconds of speech, spread over ten utterances. The

speech is designed to have a rich phonetic content, which consists of 2 dialect sentences

(SA), 450 phonetically compact sentences (SX) and 1890 phonetically diverse sentences

(SI). All the SX and SI wave-files in each speaker’s directory are concatenated to form a

single utterance (of approximately 25 seconds duration) for training. Two utterances in

SA directory of each speaker are concatenated to form a test utterance.


In these experiments we used both MCEPs and AFs as feature vectors for all speakers.

MCEPs are extracted as explained in Section 2.3.2. AFs are extracted for all 630 speakers

in the TIMIT data using same encoder (MCEP-to-AF) which was built in Section 2.3.3.

32

3.1.3 Speaker modelling

A Gaussian mixture model (GMM) is used to model the distribution of features of a given

speaker. Given a feature vector (~xi), the mixture density for speaker s is defined by

p(~xi|Φs) =

M∑j=1

wsj N s

j (~xi; ~µ j,Σ j) (3.1)

and can be thought of as the weighted linear combination of M Gaussian densities N sj (~xi; ~µ j,Σ j).

Each trained speaker is represented by a model, Φs = [wsj, ~µ j,Σ j] where j = 1, 2, · · · ,M,

~µ j, Σ j, and wsj represent the mean, variance and weighting of the mixture respectively.

Models are trained using the expectation maximization (EM) algorithm.

EM algorithm is an iterative method of finding maximum likelihood or maximum a

posteriori (MAP) estimates of parameters in statistical models. The EM iteration alter-

nates between performing an expectation (E) step, which computes the expectation of the

log-likelihood evaluated using parameters of the initial model Φ, and a maximization (M)

step, which computes parameters maximizing the expected log-likelihood found on the E

step, such that p(~xi|Φ) ≥ p(~xi|Φ) where Φ is an estimated model. These estimated model

parameters are then used in the next E step as initial model.

After each iteration, the following re-estimation formulae are used, which guarantee

a monotonic increase in the model’s likelihood value:

1. Mixture Weights:

w j =1T

T∑i=1

p( j|~xi,Φ) (3.2)

2. Means:

~µ j =

∑Ti=1 ~xi p( j|~xi,Φ)∑T

i=1 p( j|~xi,Φ)(3.3)

3. Variances:

Σ j =

∑Ti=1 x2

i p( j|~xi,Φ)∑Ti=1 p( j|~xi,Φ)

− µ j2 (3.4)

where T is the total number of feature vectors for a speaker.

33

Two critical factors in training GMMs are, selecting the order M of the mixture and ini-

tializing the model parameters prior to the EM algorithm. Since there are generally 40

significant acoustic classes in speech, a model order of M = 32 was chosen and parame-

ters are initialized using K-means algorithm.

During testing, given the feature representation of a test utterance X = ~x1, ~x2, ~x3, · · · , ~xT

and a group of N speakers represented by GMMs Φ1,Φ2, · · · ,ΦN , the objective is to find

the speaker model which has the maximum a posteriori probability for a given utterance.

Formally,

S = argmax︸︷︷︸1≤k≤N

p(Φk|X) = argmax︸︷︷︸1≤k≤N

p(X|Φk) p(Φk)p(X)

(3.5)

Assuming equally likely speakers (i.e p(Φk) = 1N ) and noting that p(X) is the same for all

speaker models, the classification rule simplifies to


p(X|Φk) (3.6)

Using logarithms and the independence between observations, the speaker identification

system computes the log likelihood of the utterance for all N speakers. Identification of

speakers is implemented using a maximum likelihood classification rule. The speakers

identity is defined by the model that produced the maximum log likelihood probability.

i.e.,


T∑i=1

log(p(~xi|Φk)) (3.7)

Where N is the total number of trained speakers. The accuracy (ACC) of the identification

system is defined as–

ACC(%) =Nc

N∗ 100 (3.8)

Where Nc is the number of speakers identified correctly.

34

Table 3.1: Accuracies (%) of speaker identification system using MCEPs and AFs

Features ACC (%)MCEPs 100AFs 85.24

3.1.4 Results

Table 3.1 shows the performance of the SID system using MCEPs and AFs. To compute

ACC, each test utterance is matched against all of the 630 speakers. The MCEPs iden-

tify all the speakers correctly because the TIMIT database doesn’t have any noise other

than speech, and speaker information. Surprisingly, AFs gave 85.24% accuracy, which

is far above the chance level. This indicates that AFs do capture sufficient amount of

speaker information in their trajectories, but may not be as good as that of MCEPs. In or-

der to further understand what sections of AFs contribute more to speaker identification,

we conducted experiments using AFs corresponding to 1) vowels, 2) manner of articula-

tion, 3) place of articulation and 4) consonants (manner + place), which are not mutually

exhaustive.

Table 3.2 shows the performance of the system using different sections of AFs. The

performance of the consonants group (manner + place) is higher than that of vowels. The

performance of AFs corresponding to both vowels and consonants does not vary much

with that of AFs using consonants only. The reason for this: the degree of freedom for

articulation of vowels is higher than consonants because most vowels are produced with

open configuration of vocal tract whereas consonants are produced with constriction in

the vocal tract. But the number of bits we have used to represent vowel features are

lesser than that of consonants. So the number of bits used for representing vowels are not

sufficient to capture all variations of vowels which are significant for each speaker.

Thus one could conclude that speaker information is largely embedded in the man-

ner and place of articulation and in their temporal variations. This raises the question –

how one could normalize the speaker information in AFs, so that AFs act as a speaker-

independent representation of the speech signal. Such representations have applications

in both speech recognition and voice conversion.

35

Table 3.2: Accuracies (%) of four groups of articulatory features

Group name Articulatory properties ACC (%)Vowels length, height, 41.90

frontnesss, Lip roundingManner stop, fricative, affricative, 79.37

nasal, liquid, approximantPlace labial, alveolar, palatal, 71.75

labio-dental, dental, velarConsonants Place + Manner 84.44

3.2 Normalizing speaker specific information

In order to normalize the speaker specific information in AF streams, we have experi-

mented with mean smoothing the AF trajectories with a 5-point and an 11-point window.

The idea is to smooth the correlations among the samples in the AF trajectories so that

the smoothed trajectories normalize the effect of speaker-specific characteristics on the

AF streams.

1 pau sh iyhh ae d y

ao

rd

aar

ks uw t ih n

gr iy

s iy w aa sh w aot er ssil ao

ly

ihr pau

Unsmoothed articulatory feature contours

1

0 0.5 1 1.5 2 2.5 3 3.5

1

1

smoothed articulatory feature contours

1

0 0.5 1 1.5 2 2.5 3 3.5

1

Time (Sec)

Speaker1 Speaker 2

(a)

(b)

(c)

(f)

(e)

(d)

Fig. 3.1: (a),(b) and (c) show unsmoothed AF contours of stops, fricatives, approximants forspeaker-1 and speaker-2, respectively. (d),(e) and (f) show smoothed AF contours of Stops,Fricatives, Approximants for speaker-1 and speaker-2, respectively.

Figure 3.1 shows the unsmoothed and smoothed contours of stops, fricatives and ap-

36

proximants for two different speakers. Unsmoothed contours have very small variations

which are different among the speakers. After applying a mean smooth window on these

contours these variations are smoothed out. Smoothed contours represent finer represen-

tation of AFs similar among the speakers.

Figure 3.2 shows the speaker identification performance after applying the mean-

smoothing repeatedly for five times. It could be observed that the performance of SID

system decreases with every iteration of mean-smoothing and more so for the 11-point

window spanning 225 milliseconds (frame shift is 5ms).

0 1 2 3 4 520

30

40

50

60

70

80

90

100

Level of smoothing

Accuracy (%)

5−point mean−smoothing window

11−point mean−smoothing window

Fig. 3.2: Speaker identification accuracies for different levels of smoothing by 5-point and 11-point mean-smoothing window. Level ‘k’ corresponds to applying mean-smoothing window ‘k’times.

A relevant question here is – Do these smoothing operations reduce only speaker in-

formation, or speech information as well? To study the effect of mean-smoothing on the

speech quality, we built an AF-to-MCEP mapper after every iteration of smoothing. This

mapper was tested on the held-out test set and an MCD score was computed as described

in Section 2.3.4. In Fig. 3.3, we show the normalized accuracies of speaker identification

and MCD scores with initial accuracy (85.24%) and initial MCD score(4.604), respec-

tively. It can be seen that, after five iterations of mean-smoothing of AFs, the accuracy

of speaker identification decreases by 0.6 times. However, the MCD score is increased

by only 0.2 times, indicating that the loss of spectral information is lesser than that of

speaker information.

37

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1 2 3 4 5

Accuracy

MCD score

No

rmal

ized

per

form

ance

No. of times 11-point window is applied

Fig. 3.3: Speaker identification accuracies and MCD scores for different levels of smoothing.Level ‘k’ corresponds to applying 11-point mean-smoothing window ‘k’ times. All scores arenormalized with respect to scores without smoothing (Level 0).

3.2.1 Use of smoothed AFs for speech recognition

The goal of a speech recognition system is to print the textual message in the speech sig-

nal. Such systems have to work for all speakers in all environments. So, it is necessary to

normalize the speaker-specific information in speech signals before using them in speech

recognition. Otherwise it acts like noise to the system, and performance reduces signif-

icantly. In this section, we talk about the speech recognition experiments we have con-

ducted, using MCEPs, unsmoothed AFs and smoothed AFs. It is mono-phoneme based

speech recognition. We trained context independent HMM models for each phoneme by

using 16 gaussian mixtures with the help of the HMM-toolkit (HTK). We used the train-

ing directory (468 speakers) of TIMIT database for training and the testing directory (162

speakers) was used to test the system. Table 3.3 shows the performance of the system

using these three features. From this table one can observe that, by smoothing AFs itera-

tively the performance of the system increased gradually. After after the 4th iteration the

accuracy of the system decreased a little, which mean that the speech information was

also begining to be lost and it suggested to us to stop the smoothing after 4 iterations.

38

These experiments conclude that AFs contain speaker characteristics and this could be

reduced by smoothing.

Table 3.3: Phone recognition accuracies using MCEPs and smoothed AFs. AFs-‘k’ corre-spond to applying 11-point mean-smoothing window ‘k’ times.

Features AccuracyMCEPs 52.16%AFs-0 26.32%AFs-1 40.68%AFs-2 42.94%AFs-3 44.24%AFs-4 44.65%AFs-5 44.05%

3.3 Summary

This chapter, described a speaker identification approach using only AFs. We modeled

each speaker using GMMs. To estimate the parameters of GMMs, EM algorithm was

used. The results based on the total TIMIT corpus have shown that AFs contain significant

speaker specific information, and more so in the AF streams of consonants. To remove the

speaker specific information in AFs we smoothed the trajectories of AFs and used them

in speaker identification. Result show a, significant decrease in the performance of the

speaker identification system using smoothed AFs. It is also shown that smoothed AFs

perform better than the unsmoothed AFs, for speech recognition.

39

Chapter 4

Use of articulatory features for voice

conversion

Chapters 2 and 3 showed the extraction of articulatory features (AFs), analysis of speaker

information in AFs and normalization of speaker information in AFs. In this chapter,

we propose a voice conversion method using articulatory features (AFs) which are used

to capture speaker-specific characteristics of a target speaker. Such a method avoids the

need for speech data from a source speaker and hence could be used to transform an

arbitrary speaker including a cross-lingual speaker. The basic idea used in this work is

shown in block diagram 4.1. It involves two steps: 1) Projecting the source speaker space

into a speaker-independent space where it has only the message part of the signal. 2)

Mapping the speaker-independent space to target speaker space by using an ANN mapper

which captures the target speaker-specific characteristics. Here, AFs are used to represent

speaker independent space in the process of capturing the speaker-specific characteristics

of a target speaker.

Section 4.1 explains a model which is used to capture speaker-specific characteristics

of a target speaker and the mathematical representation of that model. Section 4.2 de-

scribes the use of such a model in intra lingual voice conversion using the AFs and also

discusses both subjective and objective evaluations used to evaluate the performance of

the system. In section 4.3 we discuss how we can extend that model for cross lingual

40

Source speaker space

Projecting

Speaker independent

space Mapping

Target speaker space

Fig. 4.1: Mapping of arbitrary source speaker into target speaker

voice conversion using AFs.

4.1 Noisy-channel model

As discussed in chapter 1, the assumption of existence of parallel or pseudo-parallel data

is not valid for many practical applications. Hence, we posed an alternative, but relevant

research question, which is – “How to capture speaker specific characteristics of a target

speaker from the speech signal (independent of any assumptions about a source speaker)

and impose these characteristics on the speech signal of an arbitrary source speaker to

perform voice conversion?”. The problem of capturing speaker specific characteristics

can be attempted by the following method.

The problem of capturing speaker-specific characteristics can be viewed as modeling

a noisy-channel [2]. Suppose, C is a canonical form of speech signal i.e., a generic and

speaker-independent representation of the message in speech signal which passes through

the speech production system of a target speaker to produce a surface form S . This surface

form S carries the message as well as the identity of the speaker.

One can interpret S as the output of a noisy-channel, for the input C. Here, the noisy-

channel is the speech production system of the target speaker. The mathematical formu-

lation of this noisy-channel model is –

argmax︸︷︷︸S

p(S/C) = argmax︸︷︷︸S

p(C/S )p(S )p(C) (4.1)

= argmax︸︷︷︸S

p(C/S )p(S ) (4.2)

as p(C) is constant for all S . Here p(C/S ) could be interpreted as a production model.

41

Speaker coloring Ω(C)

C S’

Canonical form (articulatory features)

Surface form (Mel-cepstrum)

Fig. 4.2: Capturing speaker-specific characteristics as a speaker-coloring function.

p(S ) is the prior probability of S and it could be interpreted as the continuity constraints

imposed on the production of S . It could be seen as analogous to a language model of S .

In this work, p(S/C) is directly modeled as a mapping function between C and S

using artificial neural networks (ANN). The process of capturing speaker-specific charac-

teristics and its application to voice conversion is explained below:

Suppose, we derive two different representations C and S from the speech signal with

the following properties: Let, C be a canonical form of speech signal, i.e., a generic

and speaker-independent form - approximately represented by articulatory features (AFs)

extracted from speech signal. Let S be a surface form represented by Mel-cepstral co-

efficients (MCEPs). If there exists a function Ω(.) such that S ′ = Ω(C), where S ′ is an

approximation of S - then Ω(C) can be considered as specific to a speaker. The function

Ω(.) could be interpreted as speaker-coloring function. We treat the mapping function

Ω(.) as capturing speaker-specific characteristics. It is this property of Ω(.), we exploit

for the task of voice conversion. Fig. 4.2 depicts the concept of capturing speaker-specific

characteristics as a speaker-coloring function.

4.2 Intra lingual voice conversion

4.2.1 Database

The experiments here were carried out on the CMU ARCTIC database consisting of ut-

terances recorded by seven speakers. Each speaker has recorded a set of 1132 phonet-

ically balanced utterances [70]. The ARCTIC database includes utterances of SLT (US

Female), CLB (US Female), BDL (US Male), RMS (US Male), JMK (Canadian Male),

42

AWB (Scottish Male), and KSP (Indian Male).

4.2.2 Training target speaker’s model

Given the utterance of a target speaker T , the corresponding canonical form CT of the

speaker is represented by AFs. To alleviate the effect of speaker characteristics, the AFs

undergo a normalization technique such as smoothing, as explained in Section 3.2. The

surface form S T is represented by traditional MCEP features, as it would allow us to syn-

thesize using the MLSA synthesis technique. The MLSA synthesis technique generates

a speech waveform from the transformed MCEPs and F0 values using pulse excitation or

random noise excitation. An ANN model is trained to map CT to S T using backpropaga-

tion learning algorithm by minimizing the Euclidean error ||S T −S ′T ||, where S ′T = Ω(CT ).

4.2.3 Conversion process

Once the target speaker’s model is trained, it can be used to convert CR to S ′T where CR is

the canonical form from an arbitrary source speaker R. To get the canonical form for any

arbitrary source speaker, one of the three methods below, could be followed. The process

to build any of the encoders is below, explained in Section 2.3.3.

1. Use source speaker encoder. This requires building an encoder specific to a source

speaker, and hence a large amount of speech data (along with transcription) from

the source speaker is required.

2. Use target speaker encoder. This maps MCEPs of an arbitrary source speaker onto

AFs using target speaker’s encoder.

3. Use average speaker encoder. This maps MCEPs of an arbitrary source speaker

onto AFs using an average speaker encoder which is trained using all speakers’

data except that of source and target speakers. Since an average model is used

to generate AFs, a form of speaker normalization takes place on AFs even before

smoothing is applied.

43

0

1

2

3

4

5

6

7

8

9

10

RMS to SLT BDL to SLT SLT to BDL RMS to BDL

Target speaker encoder

Source speaker encoder

Average speaker encoder

Me

l-ce

pst

ral D

isto

rtio

n in

dB

Fig. 4.3: Plot of MCD scores obtained between different speaker pairs.

4.2.4 Validation

By using the three methods discussed in Section 4.2.3 we predicted the AFs for three

source speakers (SLT, BDL and RMS). The AFs were smoothed to normalize speaker-

specific information. Smoothed AFs were mapped onto the BDL and SLT speaker-

specific model. To test the effectiveness of the voice conversion model, we calculated

the Mel-cepstral distortion (MCD) between predicted MCEPs and actual MCEPs. MCD

is a standard measure used in speech synthesis and voice conversion evaluations [18].

Fig. 4.3 shows the MCD scores obtained using the three methods. We observe that the

average speaker encoder gives lesser MCD score compared to the other two methods.

This suggests that the use of an average speaker encoder generates normalized AFs, and

smoothing the AF trajectories further helps in realizing the speaker-independent form.

Rest of the experiments were carried out using average speaker encoder to get the canon-

ical form for any arbitrary source speaker. It can also be used to get the canonical form

for target speaker.

44

Target Speaker Data

Encoder (MCEPs AFs)

Smoothing

Target speaker ANN

Source Speaker Data

Smoothing

TRAINING

CONVERSION

Target speaker ANN

MCEPs

AFs MCEPs

AFs

(a)

(b)

Predicted target MCEPs’

(input)

(output)

Encoder (MCEPs AFs)

Fig. 4.4: Flow-chart of the training and conversion modules of a voice conversion systemcapturing speaker-specific characteristics.

4.2.5 Experiments on multiple speaker database

To test the validity of the proposed method, we conducted experiments on other speakers’

database from the CMU ARCTIC set, such as RMS, CLB, AWB, and KSP. Fig. 4.4.(a)

shows the block diagram for training process and Fig. 4.4.(b) shows the block diagram

for the conversion processing. Table 4.1 provides the results of mapping CR (where R =

BDL, RMS, CLB, AWB, KSP, SLT voices) onto the acoustic space of SLT and BDL.

Table 4.1: MCD scores obtained between multiple speaker pairs with SLT and BDL as targetspeakers. Scores in parenthesis are obtained using parallel data.

Target speakersSource speakers SLT BDL

SLT - 7.563 (6.709)RMS 6.604 (5.717) 7.364 (6.394)AWB 6.797 (6.261) 7.731 (6.950)KSP 7.808 (6.755) 8.695 (7.374)BDL 6.637 (5.423) -CLB 6.339 (5.380) 7.249 (6.172)

In Table 4.1 the performance of voice conversion models built following a noisy-

model approach is compared with that of a traditional model using parallel data. MCD

scores indicate that the use of parallel data has a better performance than the noisy-channel

model approach. Here, we implemented the voice conversion system using parallel data

45

as explained in Chapter 1. The use of parallel data allows us to capture an explicit map-

ping function between source and target speaker. The approach of noisy-channel model

captures target speaker-specific characteristics which can be later imposed on any source

speaker. This approach provides an MCD in the range of 6.3 to 8.6.

4.2.6 Mapping of excitation features

Our focus in this research work is to get a better transformation of spectral features.

Hence, we use the traditional approach of F0 transformation as used in a GMM based

transformation. A logarithmic Gaussian normalized transformation [38] is used to trans-

form the F0 of a source speaker to the F0 of a target speaker as indicated in the equation

below:

log(F0c) = µt +σt

σs(log(F0s) − µs) (4.3)

Where µs and σs are the mean and variance of the fundamental frequency in logarithm

domain for the source speaker, µt and σt are the mean and variance of the fundamental

frequency in logarithm domain for the target speaker, F0s is the pitch of source speaker

and F0c is the converted pitch frequency.

Subjective evaluation

We also performed perceptual tests whose results are provided in Table 4.2, for mean

opinion scores (MOS) in the scale of 1 to 5 (5:Excellent, 4:Good, 3:Fair, 2:Poor, 1:Bad).

For the listening tests, we chose 10 utterances randomly from the two transformed pairs

(SLT to BDL and BDL to SLT). Fifteen listeners participated in the evaluation tests. The

MOS scores in Table 4.2 are averaged over fifteen listeners. By observing the MOS

scores one could say that the noisy-model approach does capture the speaker-specific

characteristics of the target speaker. The transformed waveforms are available at http:

//researchweb.iiit.ac.in/˜bajibabu.b/vc_evaluation.html.

46



Table 4.2: Subjective evaluation of voice conversion models built by using parallel and Noisy-channel models.

Transformation using SLT to BDL BDL to SLTParallel data 3.34 3.58Noisy-channel model 3.14 3.40

By using smoothed AFs we can transform any arbitrary speaker into a predefined

target speaker without the need of any utterance from a source speaker, in a training voice

conversion model. This indicates that the methodology of training an ANN model to

capture speaker-specific characteristics for voice conversion could be generalized over

different datasets.

4.3 Cross-lingual voice conversion

Cross-lingual voice conversion is a task where the language of the source and the target

speakers is different. We employ the same model explained in Section 4.1 to capture

speaker-specific characteristics for the task of cross-lingual voice conversion. We per-

formed an experiment to transform two speakers’ (speaking Kannada and Telugu) utter-

ances into a male voice speaking English (US male - BDL). Our goal here is to transform

two speaker voices to BDL voice. Hence the output will be as if BDL were speaking in

Kannada and Telugu, respectively.

Here, we can pose some research questions, which are:–

• How to extract the canonical form (AFs) for a cross lingual source speaker?

• Can we use the average encoder which is used for intra lingual voice conversion

explained in Section 4.2.3? (It is built by using the data of many speakers of same

language.)

• Do we need to include the data of some other languages in the average model, to

normalize the language information in speech signal?

47

To answer the above questions, we used two encoders to extract the canonical form

for cross lingual source speaker.

1. Use multi-speakers and mono-lingual encoder. It is trained by using many speakers

of the same language, without source and target speakers’ data. It does a form of

speaker normalization in AFs. It is similar to an average speaker encoder model in

intra-lingual voice conversion.

2. Use multi-speakers and multi-lingual encoder. It is trained by using many speakers

of multi language without source and target speakers’ data. This kind of encoder

offers a form of language and speaker normalization in AFs.

The process to build the above encoders is the same as explained in Section 2.3.3. Artic-

ulation of some phones in one language is different from that in other languages. In this

work, we used the phone information to derive the AFs from speech signal. So, it was

necessary to consider the significant articulations in other languages.

To check the performances of both the encoders, we extracted the AFs using two

encoders, separately. AFs extracted using multi-speaker and mono-lingual encoder is

shown in Fig. 4.5. This encoder is the same as that used in intra lingual voice conversion,

which has been built using all of the speakers in the ARCTIC database. AFs extracted

using multi-speaker and multi-lingual encoder is shown in Fig. 4.6. This encoder was built

by using ARCTIC (English) and Telugu data. Since aspiration is a significant articulation

in Indian languages, we used an extra bit to represent this information and the dimension

of AFs increased to 27.

From Fig. 4.5 and Fig. 4.6 it can be observe that prediction of AFs in Fig. 4.5 is not

accurate, whereas most of the AFs are correctly predicted in Fig. 4.6. We can infer that on

use of multiple languages in building a MCEP-to-AFs encoder, some form of language

normalization happens. The prediction of AFs for cross lingual source speaker using such

encoder would be more accurate. In the following experiments we used both the encoders

for a cross lingual voice conversion.

48

Fig. 4.5: (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”, (b) Phono-logically derived AFs. (c) Acoustically derived AFs using multi-speaker and mono-lingual data(English).

4.3.1 Experiments

By using the two encoders above we predicted the AFs for Telugu and Kannada na-

tive source speakers. The AFs were smoothed to normalize speaker-specific information.

Smoothed AFs were mapped onto the BDL speaker-specific model which was built per

the explanation in Section 4.2. Five utterances from two speakers were transformed into

BDL voice and we performed the MOS test and similarity test to evaluate the performance

of this transformation. Table 4.3 provides the MOS and similarity test results averaged

over all listeners. There were ten native listeners of Telugu, and Kannada who participated

in the tests. The similarity tests indicate the closeness of the transformed speech to that of

the target speaker characteristics. Table 4.3 shows the performance using the two methods

mentioned in the previous Section. We observe that the performance using multi-speaker

and multi-language encoder is better than that using the other method. This justifies the

49

Fig. 4.6: (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”, (b) Phono-logically derived AFs. (c) Acoustically derived AFs using multi-speaker and multi-lingual data(English + Telugu).

Table 4.3: Subjective evaluation of cross-lingual voice conversion models. Scores in paren-thesis are obtained using multi-speaker and multi-lingual encoder.

Source Speaker Target Speaker MOS Similarity(Lang.) (Lang.) test

Speaker1 (Telugu) BDL (English) 1.85 (3.1) 2.40 (2.95)Speaker2 (Kannada) BDL (English) 2.00 (3.0) 2.50 (2.8)

use of a multi-speaker and multi-language encoder for generating language normalized

AFs. Smoothing the AFs trajectories further helps in realizing the speaker-independent

form. These tests indicate that cross-lingual transformation can be achieved using AFs as

canonical form in the noisy-channel model, and the output possesses the characteristics

of BDL voice.

50

4.4 Summary

In this chapter, we have shown that it is possible to build a voice conversion system by

capturing speaker-specific characteristics of a speaker (noisy-channel model). We used an

ANN model to capture the speaker-specific characteristics. Such a model does not require

any speech data from source speakers and hence can be considered as independent of

source speaker. We used smoothed AFs to represent the canonical form of a speech signal.

Our results indicate that AFs can be used as the canonical form of the speech signal in

noisy-channel model to capture the speaker-specific characteristics for voice conversion.

An effective process of normalization or transformation of AFs for cross-lingual voice

conversion has to be investigated further.

51

Chapter 5

Summary and Conclusion

5.1 Summary of the work

The main limitation of most current voice conversion systems is the requirement of par-

allel data between source and target speakers. Parallel data is a set of recorded utterances

of same sentences by both source and target speakers. But having parallel data is not al-

ways feasible, especially in cross-lingual voice conversion where the language of source

and target speakers is different. In the literature, such voice conversion techniques have

been proposed that do not require parallel data. But they require apriori data from source

speaker. These techniques cannot be applied when an arbitrary source speaker wants to

transform his/her voice to a target speaker without any apriori recording.

In this dissertation, we proposed a method to perform voice conversion without the

need of training data from the source speaker. In this method, we used the articulatory

features (AFs) to capture speaker-specific characteristics of a target speaker. It alleviates

the need of parallel data, and can be used for cross-lingual voice conversion system. To

capture speaker-specific characteristics of a target speaker, we modeled a noisy-channel

model. The idea behind modelling a noisy-channel is as follows. Suppose, C is a canon-

ical form of a speech signal (a generic and speaker-independent representation of the

message in speech signal), which passes through the speech production system of a target

speaker to produce a surface form S . This surface form S carries the message as well as

52

the identity of the speaker.

One can interpret S as the output of a noisy-channel, for the input C. Here, the noisy-

channel is the speech production system of the target speaker. We used an artificial neural

network (ANN) to model the speech production system of a target speaker, which captures

the essential speaker-characteristics of the target speaker. The choice of representation of

C and S for a speech signal plays an important role in this method. We used articula-

tory features (AFs), which represent the characteristics of the speech production process,

as canonical form or speaker-independent representation of speech signal as they are as-

sumed to be speaker independent. MCEPs that are used as surface form, capture both,

speech and speaker information. But, results based on speaker identification using AFs

show that AFs contain significant amounts of speaker information in their trajectories. We

proposed a method called mean smoothing to normalize the speaker-specific information

in AFs. Results have shown that by smoothing, speaker-specific information is reduced

significantly without losing much speech information. Later, smoothed AFs were used as

canonical form in the noisy-channel model. Finally, subjective and objective evaluations

reveal that the quality of the transformed speech using the proposed method is intelligible

and possess the characteristics of the target speaker.

5.2 Major contributions of the work

The important contribution of the research work reported in this thesis is the ‘Voice con-

version using articulatory features’. The articulatory features used in this work are mo-

tivated by phonological properties of the sounds, not the actual position data which is

collected by using medical devices. The major contributions of this work are:

1. Studied the significance of voice conversion in human speech communication.

2. Extracted the articulatory features from a given speech signal.

3. Analyzed the speaker information in articulatory features.

4. Proposed a method to normalize the speaker information in articulatory features.

53

5. Proposed a method to capture speaker-specific characteristics using articulatory fea-

tures which can be used in voice conversion.

6. Analyzed the use of articulatory features in cross-lingual voice conversion.

5.3 Directions for future work

• The research work in this thesis studied transformation of spectral features and

average pitch frequency only. These are not enough to obtain a good voice trans-

formation. Duration and pitch contours are also important features that affect the

transformation performance and can be worked with.

• In this work, we used the machine learning techniques to extract articulatory fea-

tures from a given speech signal. These techniques require large amount of tran-

scribed speech data for training. So, there is a need to find alternative signal pro-

cessing techniques to extract these features without much need of training data.

• In our approach for cross-lingual voice conversion, as we did not use a bilingual

speaker, we did not find any means to perform an objective evaluation. Hence,

there is a need to come up with an algorithm that can be used to assess the quality

of transformation objectively.

54

References

[1] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping

using artificial neural networks for voice conversion,” IEEE Trans. Audio, Speech,

Lang. Process., vol. 18, no. 5, pp. 954–964, Jul 2010.

[2] K. Prahallad, “Automatic building of synthetic voices from audio books,” PhD The-

sis, Carnegie Mellon University, Pittsburgh, USA, 2010.

[3] F. Metze, “Articulatory features for conversational speech recognition,” PhD Thesis,

Universitaet Fridericiana zu Karlsruhe, Germany, 2005.

[4] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis sys-

tem using a large speech database,” Proc. Int. Conf. Acoust., Speech, Signal Process.,

1996.

[5] T. Dutoit, An introduction to text-to-speech synthesis. Kluwer Academic Publish-

ers, 1997.

[6] A. Kain and M. W. Macon, “Personalizing a speech synthesizer by voice adaptation,”

3rd ECSA/COCOSDA, pp. 225–230, 1998.

[7] W. Zhang, L. Q. Shen, and D. Tang, “Voice conversion based on acoustic feature

transformation,” 6th national conference on Man-machine speech communications,

2001.

[8] H. Hoge, “Project proposal tc-star: Make speech-to-speech translation real,” LREC,

2002.

55

[9] D. Erro, “Intra-lingual and cross-lingual voice conversion using harmonic plus

stochastic models,” PhD Thesis, Universitat Politcnica de Catalunya, 2008.

[10] D. Childers and C. Lee, “Vocal quality factors: Analysis, synthesis, and perception,”

J. Acoust. Soc. Am., vol. 90, no. 5, 1991.

[11] K. S. Rao, “Voice conversion by mapping the speaker-specific features using pitch

synchronous approach,” Elsevier Science, 2009.

[12] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through

vector quantization,” vol. 1, pp. 655–658, 1988.

[13] H. Valaree, E. Moulines, and J. P. Tubach, “Voice transformation using psola tech-

nique,” IEEE Trans. Acoust. Speech Signal Process., 1992.

[14] H. Kuwabara and Y. Sagisaka, “Acoustic characteristics of speaker individuality:

Control and conversion,” Speech Communication, vol. 16, 1995.

[15] K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voice conversion

by codebook mapping,” in IEEE International Symposium on Circuits and Systems,

vol. 1, June 1991.

[16] Y. Stylianou, O. Cappe, and E. Moulines, “Statistical methods for voice quality

transformation,” Proc. European Conf. Speech Proces. and Techn., pp. 447–450,

1995.

[17] A. Pozo, “Voice source and duration modelling for voice conversion and speech,”

PhD Thesis, University of Cambridge, 2008.

[18] T. Toda, A. Black, and K. Tokuda, “Spectral conversion based on maximum like-

lihood estimation considering global variance of converted parameter,” Proc. Int.

Conf. Acoust., Speech, Signal Process., vol. 1, pp. 9–12, 2005.

[19] T. Toda, “High-quality and flexible speech synthesis with segment selection and

voice conversion,” PhD Thesis, Nara Institute of Science and Technology, Japan,

2003.

56

[20] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transfor-

mation of formants for voice conversion using artificial neural networks,” Speech

Communication, vol. 16, pp. 207–216, 1995.

[21] T. Watanabe, T. Murakami, M. Namba, T. Hoya, and Y. Ishida, “Transformation

of spectral envelope for voice conversion based on radial basis function networks,”

Proceedings of the International Conference on Spoken Language Processing, 2002.

[22] E. K. Kim, S. Lee, and Y. H. Oh, “Hidden markov model based voice conversion

using dynamic characteristics of speaker,” Proc. European Conf. Speech Proces. and

Techn., pp. 1311–1314, 1997.

[23] L. M. Arslan, “Speaker transformation algorithm using segmental codebooks

(stasc),” Speech Communication, 1999.

[24] G. Fant, Acoustic theory of speech production. Mouton De Gruyter, 1970.

[25] B. Yegnanarayana, Artificial Neural Networks. Prentice Hall of India, 2004.

[26] J. Makhoul, “Linear prediction: A tutorial review,” IEEE Trans. Acoust. Speech

Signal Process., vol. 63, pp. 561–580, 1975.

[27] F. Itakura, “Minimum prediction residual principle applied to speech recognition,”

IEEE Transactions on Acoustics Speech and Signal Processing, vol. 23, pp. 67–72,

1975.

[28] W. Holmes and M. J. J. Holmes, and, “Extension of the bandwidth of the jsru

parallel-formant synthesizer for high quality synthesis of male and female speech,”

Proc. Int. Conf. Acoust., Speech, Signal Process., vol. 1, pp. 313–316, 1990.

[29] S. Davis and P. Mermelstein, “Comparison of parametric representations for mono-

syllabic word recognition in continuously spoken sentences,” Proc. Int. Conf.

Acoust., Speech, Signal Process., vol. 28, pp. 357–366, 1990.

[30] A. Kain, “High resolution voice transformation,” PhD Thesis, Oregon Health and

Science University, Portland, USA, 2001.

57

[31] H. Mori and H. Kasuya, “Speaker conversion in arx-based source-formant type

speech synthesis,” Proc. European Conf. Speech Proces. and Techn., pp. 2421–2424,

2003.

[32] H. Duxans, “Voice conversion applied to text-to-speech systems,” PhD Thesis, Uni-

versitat Politecnica de Catalunya, Barcelona, 2006.

[33] A. Kain and M. W. Macon, “Design and evaluation of a voice conversion algorithm

based on spectral envelop mapping and residual prediction,” Proc. Int. Conf. Acoust.,

Speech, Signal Process., vol. 2, pp. 813–816, 2001.

[34] A. R. Toth and A. W. Black, “Using articulatory position data in voice transforma-

tion,” in Workshop on Speech Synthesis, pp. 182–187, 2007.

[35] T. Toda, A. W. Black, and K. Tokuda, “Acoustic-to-articulatory inversion mapping

with gaussian mixture model,” Proceedings of the International Conference on Spo-

ken Language Processing, 2004.

[36] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad,

“Voice conversion using artificial neural network,” Proc. Int. Conf. Acoust., Speech,

Signal Process., 2009.

[37] D. Suendermann, H. Hoege, A. Bonafonte, H. Ney, A. Black, and S. Narayanan,

“Text-independent voice conversion based on unit selection,” Proc. Int. Conf.

Acoust., Speech, Signal Process., Toulouse, France, May 2006.

[38] K. Liu, J. Zhang, and Y. Yan, “High quality voice conversion through phoneme-

based linear mapping functions with straight for mandarin,” Fourth International

Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 4, pp. 410–

414, Aug 2007.

[39] H. Ye and S. Young, “Perceptually weighted linear transformations for voice conver-

sion,” Proc. European Conf. Speech Proces. and Techn., pp. 2409–2412, Sep 2003.

[40] D. Suendermann, A. Bonafonte, H. Ney, and H. Hoege, “Voice conversion using

exclusively unaligned training data,” In Proc. of the ACL/SEPLN, Barcelona, Spain,

July 2004.

58

[41] H. Ye and S. Young, “Voice conversion for unknown speakers,” Proceedings of

the International Conference on Spoken Language Processing, pp. 1161–1164, Oct.

2004.

[42] D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, and J. Hirschberg, “Text indepen-

dent cross-language voice conversion,” Proceedings of the International Conference

on Spoken Language Processing, Pittsburgh, USA, September 2006.

[43] A. Sundermann, “Text-independent voice conversion,” PhD Thesis, Universitat der

Bundeswehr Munchen, 2007.

[44] A. Mouchtaris, J. V. Spiegel, and P. Mueller, “Nonparallel training for voice conver-

sion based on a parameter adaptation approach,” IEEE Trans. Acoust. Speech Signal

Process., vol. 14, 2006.

[45] T. Toda, Y. Ohtani, and K. Shikano, “Eigenvoice conversion based on gaussian mix-

ture model,” Proc. of INTERSPEECH, 2006.

[46] T. Toda, Y. Ohtani, and K. Shikao, “One-to-many and many-to-one voice conversion

based on eigenvoices,” Proc. Int. Conf. Acoust., Speech, Signal Process., 2007.

[47] M. A. J. Xavier, S. Guruprasad, and B. Yegnanarayana, “Extracting formants from

short segments of speech using group delay functions,” Proc. of INTERSPEECH,

pp. 1009–1012, Pittsburgh, Pennsylvania, USA, September 2006.

[48] S. Maeda, “Un modele articulatoire de la langue avec des componsantes lineaires,”

actes 10emes Journees d Etude sur la Parole, pp. 152–162, Grenoble 1979.

[49] J. D. Markel and J. A. H. Gray, “Linear prediction of speech,” Springer Verlag,

Berlin 1976.

[50] K. G. Munhall, E. Vatikiotis-Bateson, and Y. Tokhura, “X-ray film database for

speech research,” J. Acoust. Soc. Am., vol. 98, pp. 1222–1224, 1995.

[51] J. Ryalls, “Introduction to speech science: From basic theories to clinical applica-

tions,” Allyn and Bacon, 2000.

59

[52] W. J. Hardcastle, “The use of electropalatography in phonetic research,” Phonetica,

vol. 25, no. 197-215, 1972.

[53] A. A. Wrench and W. J. Hardcastle, “A multichannel articulatory database and its

application for automatic speech recognition,” in Proc. of 5th Seminar of Speech

Production, pp. 305–308, Kloster Seeon, Bavaria 2000.

[54] A. Marchal, W. Hardcastle, P. Hoole, E. Farnetani, A. N. Chasaide, O. Schmid-

bauer, I. Galiana-Ronda, O. Engstrand, and D. Recasens, “Eur-accor: The design

of a multichannel database,” Actes du XIIeme Congres International des Science

Phonetiques, Aix-en-Provence, vol. 5, pp. 422–425, 1991.

[55] K. Kirchhoff, “Robust speech recognition using articulatory information,” PhD The-

sis, University of Bielefeld, Bielefeld, Germany, 1999.

[56] N. Chomsky and M. Halle, The sound pattern of english. MIT Press, 1968.

[57] J. Harris, English sound structure. Blackwell, 1994.

[58] A. Toutios and K. Margaritis, “Acoustic-to- articulatory inversion of speech: A re-

view,” in Proc. of International 12th Turkish Symposium on Artificial Intelligence

and Neural Networks (TAINN-2003), Canakkale, Turkey, July 2003.

[59] S. Ouni and Y. Laprie, “Improving acoustic-to-articulatory inversion by using hyper-

cube codebooks,” Proceedings of the International Conference on Spoken Language

Processing, Bejing, China 2000.

[60] S. King and P. Taylor, “Detection of phonological features in continuous speech

using neural networks,” Computer Speech and Language, vol. 14, no. 4, pp. 333–

353, 2000.

[61] P. P. L. Prado, E. H. Shiva, and D. G. Childers, “Optimization of acoustic-to-

articulatory mapping,” IEEE Trans. Acoust. Speech Signal Process., vol. 2, pp. 33–

36, 1992.

60

[62] Y. Laprie and B. Mathieu, “A variational approach for estimating vocal tract shapes

from the speech signal,” IEEE Trans. Acoust. Speech Signal Process., pp. 929–932,

1998.

[63] G. Ramsay, “A non-linear filtering approach to stochastic training of the articulatory-

acoustic mapping using the em algorithm,” Proceedings of the International Confer-

ence on Spoken Language Processing, Philadelphia, USA, 1996.

[64] K. Richmond, “Mixture density networks, human articulatory data and acoustic-to-

articulatory inversion of continuous speech,” in Proc. Workshop on Innovation in

Speech Processing, pp. 259–276, Institute of Acoustics, April 2001.

[65] S. King and A. Wrench, “Dynamical system modelling of articulator movement,” in

Proc. of ICPhS 99, pp. 2259–2262, San Francisco, Aug. 1999.

[66] L. Lamel, R. Kassel, and S. Seneff, “Speech database development: Design and

analysis of the acoustic-phonetic corpus,” In DARPA Speech Recognition Workshop,

pp. 100–109, 1986.

[67] S. Imai, “Cepstral analysis and synthesis on the mel-frequency scale,” Proc. Int.

Conf. Acoust., Speech, Signal Process., 1983.

[68] A. W. Black et al, “Articulatory features for emotional synthesis,” Proc. Int. Conf.

Acoust., Speech, Signal Process., pp. 725–728, May 2012.

[69] S. Chang, S. Greenberg, and M. Wester, “An elitist approach to articulatory acoustic

feature classification,” Proc. European Conf. Speech Proces. and Techn., pp. 1725–

1728, Aalborg, Denmark, 2001.

[70] CMU-ARCTIC speech synthesis databases. [Online]. Available: http://festvox.org/

cmu arctic/index.html

[71] K. Livescu et al, “Articulatory feature-based methods for acoustic and audio-

visual speech recognition,” 2006 JHU Summer Workshop Final Report,

http://www.clsp.jhu.edu/ws2006/groups/afsr/documents/WS06AFSR final report.pdf

2008.

61

http://festvox.org/cmu_arctic/index.html

http://festvox.org/cmu_arctic/index.html

[72] J. Frankel and S. King, “Speech recogntion in the articulatory domain: Investigat-

ing an alternative to acoustic hmms,” in Proc. Workshop on Innovation in Speech

Processing, Institute of Acoustics, April 2001.

[73] M. Richardson, J. Bilmes, and C. Diorio, “Hidden-articulator markov models for

speech recognition,” ISCA ITRW Conference on Automatic Speech Recognition,

Paris, France 2000.

[74] T. A. Stephenson, H. Bourlard, S. Bengio, and A. C. Morris, “Automatic speech

recognition using dynamic bayesian networks with both acoustic and articulatory

variables,” Proceedings of the International Conference on Spoken Language Pro-

cessing, vol. Bejing, China, 2000.

[75] A. Sangwan, M. Mehrabani, and J. H. L. Hansen, “Language identification using a

combined articulatory prosody framework,” Proc. Int. Conf. Acoust., Speech, Signal

Process., pp. 4400–4403, Jul. 2011.

[76] K. Leung, M. Mak, M. Siu, and S. Kung, “Speaker verification using adapted artic-

ulatory feature-based conditional pronunciation modeling,” Proc. Int. Conf. Acoust.,

Speech, Signal Process., vol. 1, pp. 181–184, 2005.

62

List of Publications

The work done during my masters has been disseminated to the following journals and

conferences.

Journal

Bajibabu Bollepalli, Alan W Black, Kishore Prahallad, “Use of Articulatory Features for

Non-parallel Voice Conversion”, in preparation for IEEE Transactions on Acoustics and

Speech Signal Processing.

Conferences

1. Bajibabu Bollepalli, Alan W Black, Kishore Prahallad, “Modelling a Noisy-channel

for Voice Conversion Using Articulatory Features”, accepted in INTERSPEECH-

2012, Portland, USA.

2. Sathya adithya Thati, Bajibabu B, B. Yegnanarayana, “Analysis of Breathy Voice

Based on Excitation Characteristics of Speech Production”, accepted in SPCOM

2012, Banglore, India.

3. Srikanth Ronanki, Bajibabu B, Kishore Prahallad, “Duration Modelling In Voice

Conversion Using Artificial Neural Networks”, International Conference on Sys-

tems, Signals and Image Processing (IWSSIP), Vienna, Austria, April, 2012.

63

4. Bajibabu, Ronanki Srikanth, Sathya Adithya Thati, Bhiksha Raj, B Yegnanarayana,

Kishore Prahallad. “A comparison of prosody modification using instants of sig-

nificant excitation and mel-cepstral vocoder”, Centenary Conference of the Indian

Institute of Science, 14-17 Dec 2011, Banglore.

5. Gautam Verma Mantena, Bajibabu B, Kishore Prahallad, “SWS task: Articulatory

Phonetic Units and Sliding DTW”, MediaEval 2011, Satellite Events in INTER-

SPEECH 2011, Italy, September 1-2, 2011.

64

Date post:	24-Apr-2018
Category:	Documents
Upload:	phungmien
View:	223 times
Download:	4 times

VOICE CONVERSION USING ARTICULATORY...

Documents