IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND ...IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE...

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. XX, XX 2020 1

NAUTILUS: a Versatile Voice Cloning SystemHieu-Thi Luong, Student Member, IEEE, Junichi Yamagishi, Senior Member, IEEE

Abstract—We introduce a novel speech synthesis system, calledNAUTILUS, that can generate speech with a target voice eitherfrom a text input or a reference utterance of an arbitrary sourcespeaker. By using a multi-speaker speech corpus to train allrequisite encoders and decoders in the initial training stage, oursystem can clone unseen voices using untranscribed speech oftarget speakers on the basis of the backpropagation algorithm.Moreover, depending on the data circumstance of the targetspeaker, the cloning strategy can be adjusted to take advantageof additional data and modify the behaviors of text-to-speech(TTS) and/or voice conversion (VC) systems to accommodate thesituation. We test the performance of the proposed framework byusing deep convolution layers to model the encoders, decodersand WaveNet vocoder. Evaluations show that it achieves com-parable quality with state-of-the-art TTS and VC systems whencloning with just five minutes of untranscribed speech. Moreover,it is demonstrated that the proposed framework has the abilityto switch between TTS and VC with high speaker consistency,which will be useful for many applications.

Index Terms—voice cloning, text-to-speech, voice conversion,speaker adaptation, neural network.

I. INTRODUCTION

SPEECH synthesis is the technology of generating speechfrom an input interface. In its narrow sense, speech

synthesis is used to refer to text-to-speech (TTS) systems [1],which play an essential role in a spoken dialog system as a wayfor machine-human communication. In its broader definition,speech synthesis can refer to all kinds of speech generationinterfaces like voice conversion (VC) [2], video-to-speech [3],[4], and others [5]. Recent state-of-the-art speech synthesissystems can generate speech with natural sounding quality,some of which is indistinguishable from recorded speech [6].Deep neural networks are used in various components of thesespeech synthesis systems. Many use sequence-to-sequence(seq2seq) models to unfold a compact phoneme sequence intoacoustic features in the case of TTS [6], [7] or to handle themisalignment of acoustic sequences in the case of VC [8], [9],[10]. A neural vocoder, which generates waveforms sample-by-sample [11], [12], [13], is also a staple of many high-qualityspeech-generation recipes [6], [14]. Generally speaking, theperformance of deep learning approaches is high when trainingon a large amount of data. For speech generation models,this means that we need many hours of speech from a targetspeaker to train a model. This limits the ability to scale thetechnology to many different voices.

H.-T. Luong is with the National Institute of Informatics, and with Depart-ment of Informatics, SOKENDAI (The Graduate University for AdvancedStudies), Tokyo 101-8340, Japan (e-mail: [email protected])

J. Yamagishi is with the National Institute of Informatics, with Departmentof Informatics, SOKENDAI (The Graduate University for Advanced Studies),Tokyo 101-8340, Japan, and also with the Centre of Speech TechnologyResearch, University of Edinburgh, Edinburgh EH8 9AB, U.K. (e-mail:[email protected]).

Besides improving the naturalness, cloning new voices witha small amount of data is also an active research topic. Whilethere are many different approaches proposed to tackle thisproblem, they all share the same fundamental principle whichis using an abundant corpus to compensate for the lack of dataof a target speaker [15]. For neural TTS, we can fine-tuneall or part of a well-trained acoustic model using transcribedspeech from a target speaker [16]. For neural VC, we canpool the speech data of multiple source and target speakersand share knowledge learned from each [17]. In most ofthese cases, the data used for training or adaptation is eitherpaired or labeled. However, as all acoustic characteristics of aspeaker are fully contained within speech signals, we shouldhypothetically be able to clone voices by using untranscribedspeech only, and this would greatly reduce the cost of buildingspeech generation systems. Disentangling speaker character-istics from linguistic information and representing it as aspeaker vector is hence a popular way for cloning voices [18].Another approach is to use labels auto-generated by speaker-independent automatic speech recognition (ASR) trained onlarge-scale multi-speaker corpora [16]. Either way, the cloningmethod is usually formulated for a specific data scenario of aspecific speech generation system (either TTS or VC), whilea true data-efficient method should work on extremely limiteddata and also abundant data with or without labels.

From the perspective of voice cloning, TTS and VC can beseen as similar systems that use different inputs for generatingspeech with a target voice. They share almost the sameobjective as well as many functional components, but theyare normally treated as different systems and are modeledusing vastly different frameworks. Several works have usedthis similarity to combine these two systems into one [19],[20]. However in the end, these works only focus on usingone to improve the other [21], [10], [22].

In this work we present our new speech synthesis system,NAUTILUS, which can act as TTS or VC with state-of-the-art (SOTA) quality and highly consistent similarity1. Moreimportantly, this combination is not for convenience but toopen up the ability to clone unseen voices with a versatilecloning strategy that could be adjusted to the data situation oftarget speakers. Given this versatility, we show that our systemcan handle unique speaker characteristics such as L2 accents.

This paper is structured as follows: Section II reviewsworks on text-to-speech and voice conversion in the context ofcloning voices. Section III explains the principles of our frame-work. Section IV gives details on the proposed NAUTILUSsystem used in this paper. Section V presents experiment

1The basic of the voice cloning method for TTS was proposed in [23] andas a proof-of-concept it was also shown that the same principle is applicablefor VC in [20]. This new work builds upon the methodology and presents aSOTA unified voice cloning system for TTS and VC.

arX

iv:2

005.

1100

4v1

[ee

ss.A

S] 2

2 M

ay 2

020


scenarios and their evaluations. We conclude our findings inSection VI.

II. RELATED WORK ON VOICE CLONING

A. Definition of voice cloning

The term voice cloning is used to refer to a specific speakeradaptation scenario for TTS with untranscribed speech inseveral works [18], [24]. However in pop culture, it is looselyused to describe technology that resembles VC. In this paper,we use voice cloning as an umbrella term that indicates anytype of system that generates speech imitating the voice of aparticular speaker. The main difference between voice cloningand speech synthesis is that the former puts an emphasison the identity of the target speaker [25], while the lattersometimes disregards this aspect for naturalness [26]. Giventhis definition, a voice cloning can be a TTS, a VC, or any typeof speech synthesis system [4], [5]. The NAUTILUS system isdesigned to be expandable to other input interfaces. However,we focus on TTS and VC, which are two common speechsynthesis tasks, in this work as they play an irreplaceable rolein our voice cloning method.

The performance of a voice cloning system is judged onmany aspects. As a speech generation system, the naturalnessand similarity to target speakers are important [6]. As acomputer system, a small memory footprint [15] and fastcomputing time [18], [27] are desirable for practical reasons.However, the defining property of a voice cloning systemcompared with generic speech synthesis is its data efficiencyas this determines its scalability [28]. While data efficiencycan be casually interpreted as using as little data as possible[15], a better voice cloning system should not only work ina situation with extremely limited amount of data but alsobe able to take advantage of abundant speech data [28] whensuch data become available regardless of the availability oftranscriptions [23].

B. Training voice conversion system for target speaker

The conventional VC approach is text-dependent, i.e., itexpects training data to be parallel utterances of source andtarget speakers [29], [30]. As obtaining these utterances isexpensive and labor-intensive work, a parallel VC system hasto commonly be built with as little as five minutes of data froma speaker [31]. This is inconvenient and it limits the quality ofVC systems in general. Many have worked on methodologiesfor building VC systems with non-parallel utterances [32].With HMM models, we can formulate a transformation func-tion to adapt pretrained models using non-parallel speech [33],[34]. With recent deep representation learning approaches, thepopular method is training a speaker-disentangled linguisticrepresentation either implicitly or explicitly. For implicit cases,Hsu et al. [35] used variational auto-encoder (VAE), whileKameoka et al. [32] used generative adversarial network(GAN) to train a many-to-many non-parallel VC system.These methods use multi-speaker data, conditional labels,and various kinds of regularization to encourage a model todisentangle linguistic content from speaker characteristics viaa self-supervised training process. For explicit cases, Sun et

al. [36] used phonetic posteriorgrams (PPG) obtained from anASR model to train an any-to-one non-parallel VC system.As the ASR model is speaker-independent, a PPG-based VCsystem can theoretically convert the speech of arbitrary sourcespeakers into a target speaker.

Even though a typical VC system is only trained on speechdata, recent works have suggested that using transcriptions oftraining data or jointly training TTS along with VC can furtherimprove the naturalness of generated samples [19], [10].

C. Adapting text-to-speech system to unseen target

A TTS system is typically trained on dozens of hours oftranscribed speech [6], [37]. Due to the high requirement forquantity and quality, a professional voice actor is commonlycommissioned to record such data in a controlled environment.This makes the conventional approach ill-fitted for the voicecloning task in which we do not have controls over the targetspeaker, recording environment, or the amount of data. Tobuild a TTS system for speakers with a limited amount oflabeled data, we can adapt a pretrained model. The initialmodel can be trained on the data of a single speaker [38]or data pooled from multiple speakers [39], [40]. This simplefine-tuning produces a high-quality model when the data oftarget speakers is sufficient (e.g., one hour) [28]. When thedata is extremely limited (e.g., one minute), we can restrictthe tuning to certain speaker components instead of the entirenetwork to prevent overfitting [40], [41], [28]. In summary,speaker adaptation transfers knowledge learned from the abun-dant data of one or multiple speakers to reduce the demandon a target.

The costly part of the voice cloning system is the datacollecting process, especially the transcription of speech.Theoretically speaking, as speaker characteristics are self-contained within an utterance we should be able to clonevoices without using text. One practical approach is obtainingautomatically annotated transcriptions using a SOTA ASRsystem [16]. However ASR-predicted transcriptions containwrong annotations, which affects the performance of theadaptation. Moreover this approach assumes that a well-trainedASR is obtainable for a target language, which makes itimpractical for low-resource languages [26] or performingcross-language adaptation [42], [24]. Given the disentangle-ment ability of deep learning models, another approach isto train a speaker-adaptive model conditioned on a speakerrepresentation extracted from speech [18], [43]. The speakerrepresentation can be an i-vector [44], d-vector [45], [15], orx-vector [46], which are all byproducts of speaker recognitionsystems. This approach has a computational advantage in thatit does not involve an optimization loop [18]. However, But thedrawback is its limited scalability; in other words the speakersimilarity seems to stop improving when using more than afew seconds of speech [28].

D. TTS as speech-chain component

Even though TTS and ASR, two essential modules of spo-ken dialog systems, are placed at the two ends of the human-machine communication interface and compliment each other,


historically, they are built independently under different frame-works [1], [47]. Recent end-to-end speech models have re-duced the technical difference between TTS and ASR systemsand opened up the possibility of integrating them into asingle ecosystem. Tjandra et al. [48] developed the SpeechChain model which consists of a TTS and ASR that consumeeach other’s output as their own inputs. Karita et al. [49]factorized TTS and ASR into encoders and decoders and thenjointly trained them all together by putting a constraint on thecommon latent space. The purpose of these unified systems iscombining resources and enabling semi-supervised training.

Similar to the situation with ASR, several works have triedto combine VC with TTS [19], [10] or bootstrapping VC fromTTS [20], [22]. Hypothetically speaking, given a perfect ASRsystem, there is no difference between TTS and VC systems.Specifically, the PPG-based VC system [36] is essentially aTTS model stacked on top of an ASR model. Polyak et al.[21] trained a TTS with target voice by combining any-to-oneVC and robot-voice TTS systems[21].

III. VERSATILE VOICE CLONING FRAMEWORK

Our voice cloning system, “NAUTILUS”, is a multimodalneural network, that can be used as a TTS [23] or a VC [20]systems. It is not just a combination of conventional TTSand VC systems [19] but a carefully designed system thathas the ability to clone unseen voices using either transcribedor untranscribed speech [23]. The core concept is to train alatent linguistic embedding (LLE) for use as a stand-in fortext when transcription is difficult to obtain. The architectureof our multimodal system resembles the model proposed byKarita et al. [49]; however, they focus on the performanceof ASR system instead of speaker adaptation. While theemphasis on linguistic latent features is similar to the PPG-based VC system proposed by Sun et al. [36], their phoneticrepresentation extractor is trained independently with the VCmodel while our linguistic latent features are jointly trainedwith the speech generation model. Given the similarity intechniques, we will compare our system with the PPG-basedVC system in the experiments.

A. Training latent linguistic embedding with multimodal neu-ral network

The principle components of the voice cloning frameworkare presented in Fig. 1. The multimodal neural network isessential for our voice cloning methodology. While the neuralvocoder is optional, we included it since it is necessary forgenerating high-quality speech in most recent setups [6],[14]. The proposed system contains four modules, whichare encoders and decoders of either text, x, or speech, y.In combinations of encoders and decoders, the modules canperform four transformations: text-to-speech (TTS), speech-to-speech (STS), speech-to-text (STT), and text-to-text (TTT).Combining these modules into a single system is not justfor convenience but serves an important purpose. The speechencoder helps the TTS system adapt with untranscribed speech[23], while the text encoder helps the VC system disentanglespeaker from the linguistic representation. The text decoder

z

y

SEncμ σ

ε

x

TEncμ σ

ε

SDecspkcode

y~

TDec

x~

spk

Vocspkcode

y

o~

MAE y

lossgoal

CE

x

lossgoal

CEo

lossvoc

losstie

KLD

Fig. 1. Principle components and initial training stage of NAUTILUS, theproposed voice cloning system.

is the new addition in this paper. While Karita et al. [49]use a similar combination for speech recognition, we focuson speech generation tasks and the text decoder is used as anauxiliary regularizer only.

Our methodology is designed around the training of aspeaker-disentangled LLE, z. The LLE in our setup plays thesame role as the PPG proposed for VC [35]. However, theLLE is jointly trained with the speech generation modulesand contains linguistic information as a whole (instead ofphoneme). There are several ways to train the multimodalneural network. It can be trained stochastically [50], step-by-step [22], or jointly [51], [49]. We proposed two methodsfor the joint training in our previous work [51]: 1) joint-goalwhere several losses calculated between an output inferredby each decoder and its ground-truth are combined, and 2)tied-layer, where the distance or distortion between two latentspaces obtained from encoders are constrained to be identical.Using one or the other is enough [51], [23], but as they arecomplementary, we could use them together:

losstrain = lossgoal + β losstie= losstts + αsts losssts + αstt lossstt

+ β losstie ,(1)

where the losstts in Equation 1 is a TTS loss defined by thetext encoder and speech decoder and is used as the anchor toadjust other hyperparameters. losssts is an STS loss defined bythe speech encoder and speech decoder and we de-emphasizelosssts with a weighting parameter, αsts. lossstt is an STTloss defined by the speech encoder and text decoder. Eventhough the speech-to-text task is not a target one, its loss is


z

y

SEncμ σ

ε

SEncμ σ

SDec

y~ y Voc

o~

KLD

losscycle

MAE y

lossgoal

CEo

lossvoc

trainable

trainable

(a) Step 1 - Adaptation

z

y

SEncμ σ

SDec

y~ Voc

o~

MAE y

lossgoal

CEo

lossvoc

trainable

(b) Step 2 - Welding

z

y

SEncμ σ

x

TEncμ σ

ε

SDec

y~ Voc

o~

waveform

(c) Step 3 - Inference

Fig. 2. Cloning procedure with untranscribed speech of target speaker.

also included to encourage the latent space to focus more onphonemes (but not entirely). Some other works have shownthat an auxiliary phoneme classifier helps in boosting thequality of speech generation systems in general [10]. A TTTloss defined by the text encoder and text decoder, lossttt, isnot included as we do not think that it helps. The last termlosstie is for the tied-layer constraint.

In each training step, we calculate each term of the losstrainusing a transcribed speech sample and then optimize allparameters in a supervised manner. Karita et al. [49] useda similar loss to jointly train their system but with oneimportant difference, two separated speech samples, one withits transcription and another without, are used to calculate asingle training loss. Specifically, losstts, lossstt, and losstieare calculated using the transcribed sample, while losssts andlossttt are calculated on the untranscribed sample. This semi-supervised training strategy was proposed to take advantageof an abundant unlabeled corpus [49]. Our system can alsobenefit from this semi-supervised strategy, but we only focuson supervised training in this work.

For the tied-layer loss, we calculated the symmetrizedKullback-Leibler divergence between the outputs of the textand speech encoders instead of the asymmetric one [23]:

losstie =1

2LKLD(TEnc(x), SEnc(y))

+1

2LKLD(SEnc(y), TEnc(x)) (2)

The constraints help obtaining a consistent latent space be-tween the text and speech encoders. Through experiments wefound that KL divergence is an effective tied-layer loss [23]2.

Another important aspect is random sampling at the outputof the encoders. Thanks to the noise added by the samplingprocess of the LLE in the training stage, the text and speechdecoders are trained in a denoising fashion. This, in turn,

2Karita et al. [49] reported that KL divergence is unstable for training. Thereason for this contrast is that in their work the autoencoder-based latent spaceis assumed to be Gaussian distribution while in our case it is forced to be anisotropic Gaussian distribution through VAE-like structure [52]

makes the speech decoder robust to unseen samples, whichis helpful for speaker adaptation.

B. Speaker adaptation framework

The multimodal network trained in the previous stage isessentially a multi-speaker TTS/VC system; however our goalis to perform voice cloning for unseen speakers. Next, wedescribe the cloning protocol for a standard scenario that usesuntranscribed speech and the supervised scenario which usestranscribed speech in the following subsections.

1) Cloning voice using untranscribed speech: The coremechanism for unsupervised speaker adaptation is the sameas from our prior work [23], [20]; however, the detail ofthe executions have been updated. The voice cloning stagenow contains three steps, which takes the neural vocoder intoaccount.

Step 1 - Adaptation: This is essentially our legacy unsu-pervised adaptation stage [23] in which the speech decoderand neural vocoder are adapted separately. We first removeall speaker components and then fine-tune the remainingparameters of the speech decoder using the following loss:

lossadapt = losssts + β losscycle (3)

The speech distortion losssts by itself is enough for theadaptation [23], but we further add a linguistic cycle consistentterm losscycle to try to improve the performance. losscycleis the KL divergence between LLE distributions of naturalspeech and reconstructed speech as follows:

losscycle =1

2LKLD(SEnc(y), SEnc(y))

+1

2LKLD(SEnc(y), SEnc(y)) (4)

Even though both losssts and losscycle try to force the recon-structed features to be close to natural speech, they focus ondifferent aspects; losssts is either l1 or l2 frame-based harddistortion of the acoustic features, while losscycle focuses on


linguistic content with soft divergence. We adapt the neuralvocoder in a similar manner using its goal loss:

loss′adapt = lossvoc (5)

As a neural vocoder depends on speech only, it can be usedin an unsupervised adaptation strategy. This is a simple yeteffective approach [14].

Step 2 - Welding: Even though fine-tuning the acousticmodel and the neural vocoder separately can produce sufficientquality [14], there are still mismatches between the generatedfeatures and the natural features used to train the vocoder. Fortext-to-speech systems, Zhao et al. [53] fine-tuned an acousticmodel with the losses propagating from a neural vocoder,while Ping et al. [37] jointly trained them together. For voiceconversion, due to the duration mismatch between source andtarget utterances, Huang et al. proposed that the WaveNetvocoder be fine-tuned by using reconstructed acoustic featuresof a target speaker [54]. Motivated by them, we deploy a“welding” strategy, illustrated in Fig. 2b, that conducts fine-tuning by using the reconstructed features of the target speakerin a similar way to Huang’s approach [54], but, for both thespeech decoder and neural vocoder like Ping’s method [37]based on the loss function below:

lossweld = losssts + γ lossvoc , (6)

where losssts is included to preserve the acoustic space evenafter the welding process as the speech decoder is assumed tobe autoregressive in the domain.

Two practical tactics are further introduced for this step.1) mean-value LLE: to let the acoustic model learn fine-grained details, we remove the sampling process from thespeech encoder and use the mean value instead. 2) mix-in:as losses propagating from the neural vocoder can overpowerthe speech decoder [53], we propose a mix-in tactic, inspiredby drop-out, to ease this problem. Specifically the output ofthe speech decoder is randomly mixed with natural frames bya percentage to reduce the amount of losses propagated back.

Step 3 - Inference: Even though we use the speech encoderto tune the speech decoder and neural vocoder in the adaptionand welding steps, the text encoder can utilize these tunedmodules without any further adjustment in inference (See Fig.2c) thanks to the consistency between the latent spaces of thetext and speech encoders. As our cloning method tunes entiremodules, the more data available, the better the performance.

2) Alternative strategy to cloning voices with transcribedspeech: The strategy for supervised speaker adaptation usingtranscribed speech was also refined compared with our pre-vious work [23]. Instead of using exactly the same strategyas those for the above unsupervised strategy, we first tune thespeech decoder and text encoder together using the transcribedspeech since transcriptions could benefit the TTS system.

Step 1 - Adaptation (supervised alternative): The super-vised strategy for the adaptation step is illustrated in Fig. 3a.We adapt both the speech decoder and text encoder using thefollowing function.

loss′′adapt = losstts + α losssts + β losstie (7)

z

y

SEncμ σ

ε

x

TEncμ σ

ε

SDec

y~ y Voc

o~

CEo

lossvoclosstie

KLD MAE y

lossgoal

trainable

trainable

(a) Step 1 - Adaptation (supervised alternative)

Fig. 3. Cloning procedure with transcribed speech of target speaker.

The optimizing loss is similar to that used in the trainingstage (Equation 1). We use losssts and losstie to maintain thelinguistic latent space for VC. The welding and inference stepsare the same as the unsupervised strategy.

IV. DETAILS OF NAUTILUS SYSTEM

The methodology explained in Sec. III can be applied toany neural architecture from the conventional acoustic model[23] to end-to-end (E2E) model [6]. Next we give the detailson our system used in the experiments. It is not a fully E2Esystem but inspired by the E2E model in various ways.

A. Text-speech multimodal system

Our system is shown in Fig. 4. The text representation x isa phoneme sequence and the speech representation y is mel-spectrogram.

1) Text encoder: the text encoder transforms a compactphoneme sequence x into the LLE sequence z, which has thesame length as the acoustic sequence. Our specifications forthe text encoder are illustrated in Fig. 4a. The input phonemesequence is represented as one-hot vectors. As engineeredlinguistic features are no longer provided, tenc-linguistic-context is used to learn the linguistic context. This is adirect imitation of Tacotron 2 [6] but with quasi-RNN usedin place of the standard RNN to speed up the training. Theattention mechanism is essential in a E2E setup to unrollthe phoneme sequence; our setup, however, uses an explicitduration/alignment module called “tenc-alignment” in trainingand inference to have direct control over the generated sampleprosody.3 The coarse linguistic features, then, go throughseveral dilated convolution layers called “tenc-latent-context”to capture the local context and smooth out the coarseness.tenc-latent-context has essentially the same design as theacoustic models used in our prior work [23], which used

3The tenc-alignment could be replaced with attention mechanism forconvenience, and this could also potentially improve the quality further [55].


Conv-3-256-FGS, dil=1




tenc

-lat

ent-

cont

ext

Conv-1-256-tanh

Linear-64 Linear-64-exp

Embedding-512

Conv-5-512-relu

Conv-5-512-relu

Conv-5-512-relu

BiQRNN-512

ε

z

x

Conv-1-256-tanh

Conv-1-256-tanh

tenc

-lin

guis

tic-

cont

ext

tenc-alignment

(a) Text encoder

CConv-3-512-FG, dil=1




Conv-3-512-FG, dil=1



Conv-3-512-FG, dil=27 sdec-context-blksdec-causal-blk

Conv-1-512-tanh

Conv-1-512-tanh

z

CConv-1-1024-relu

CConv-1-1024-relu

Linear-80

y~

sdec

-pre

net

spkcode

spkcode

spkcode

spkcode

spkcode

spkcode

spkcode

spkcode

CConv-1-256-relu

CConv-1-256-relu

y'

CConv-1-256-none

CConv-3-256-HW, dil=1










CConv-1-64-relu

(b) Speech decoder

Conv-3-512-relu

z

spkcode

Conv-1-512-relu

Linear-64-softmax

x~

(c) Text decoder







Conv-1-256-tanh

Conv-1-256-tanh

Conv-1-256-tanh

ε

z

y

Linear-64 Linear-64-exp

(d) Speech encoder

Fig. 4. Blueprint of text-speech multimodal system. Naming convention is as follows type-[filter]-unit-function. Most layers are either causal (CConv) ornon-causal (Conv) convolution layers with filter width of 3. Besides regular non-linear activation functions like tanh or relu, we also use non-linear filter-gate(FG), filter-gate with skip connection (FGS) and highway layer (HW). Dilation rate is indicated when applicable.

residual, skip connection and filter-gate function (Fig. 4a in[23]) to help the gradient flow:

hl = tanh(W fl hl−1 + cfl )� σ(W

gl hl−1 + cgl ) , (8)

where hl is the output of the l-th layer, and W fl , W g

l , cfl ,and cfl are the weights and biases for filters and gates. Theoutput of the text encoder consists of the mean and standarddeviation of a text-encoded LLE sequence.

2) Speech decoder: the speech decoder takes in an LLEsequence z to generate a respective acoustic sequence y witha particular voice. It is essentially a multi-speaker speechsynthesis model and there are three components that signifi-cantly affect the performance: temporal context capturing [56],autoregressive mechanism [57], [55], and speaker modeling[41]. sdec-context-blk captures LLE temporal context by usingtime-domain convolution (1dconv) layers, which also containspeaker biases in their filters and gates (Fig. 4b in [23]):

hl = tanh(W fl hl−1 + cfl + b

f,(k)l )

� σ(W gl hl−1 + cgl + b

g,(k)l ) , (9)

where bf,(k)l and b

f,(k)l are the speaker biases of k-th speaker

in the training speaker pool. The effective type of speakercomponent depends on the network structure as well as the

acoustic features [41]. We previously found that speaker biaseswork the best for our setup [23].

An autoregressive mechanism is introduced to improve theoverall naturalness. sdec-prenet is responsible for the autore-gressive dependency that captures the past outputs using causallayers. This is a direct imitation of the AudioEnc proposedby Tachinaba et al. [27]. The layers in sdec-prenet use thehighway function in the same way as [27] as follows:

hfl = W f

l hl−1, (10)hgl = σ(W g

l hl−1), (11)

hl = hfl � hg

l + hl−1 � (1− hgl ) (12)

The linguistic context and the past-state token are fed intomore causal layers before being transformed into the acousticfeatures. The architecture of the speech decoder is shown inFig. 4b. We use the mean absolute error (MAE) as the lossfunction for the speech generation goals. In the adaptationstage, speaker biases are removed from the speech decoder.

3) Speech encoder: the speech encoder extracts the LLE zfrom a given acoustic sequence y while stripping unnecessaryinformation (i.e. speaker characteristics). It is similar to anASR model as the output needs to be independent fromtraining speakers, and the model needs to be generalizedto unseen targets. We have no strong preference for speech


encoder specification and simply use several dilated layers tocapture the local context as illustrated in Fig. 4d.

4) Text decoder: the text decoder takes an LLE sequencez and predicts the phoneme posterior x at each frame. Thisis a new component introduced in this work compared withprevious ones [23]. Unlike other modules that would be reusedin various stages, the shallow text decoder is included in thetraining only and acts as an auxiliary regularizer. Its purposeis forcing the latent linguistic embedding to focus more onphoneme information, which we found important for generat-ing utterances with clear pronunciation. The balance betweenphoneme and other linguistic information is adjustable usingthe joint-goal weight αstt and the representative power of thetext decoder itself. This is why we use a couple of layers onlyto model the text decoder (Fig. 4c). The cross-entropy criterionis used as the loss function of the phoneme classifier.

B. WaveNet vocoder

An auto-regressive WaveNet model conditioned on a mel-spectrogram [58], [6], [14] is used as the neural vocoder ofour setup. WaveNet is trained on either 22.05kHz or 24kHzspeech depending on the scenarios. Waveform amplitudes arequantized by using 10-bit µ-law. The network consists of40 dilated causal layers containing speaker biases. Both theresidual and skip channels are set at 128. This is a typicalsetup for WaveNet [11]. In the adaptation stage, speaker biasesare removed before fine-tuning.

C. Training, adapting, and inferring configurations

The General American English lexicon [59] was used fortext representation, and 56 distinct phonemes were found inour training data. An 80-dimensional mel-spectrogram wasused as acoustic representation. The mel-spectrogram wascalculated by using a 50-ms window size and 12.5-ms shiftsize. This was inspired by the setup of E2E TTS model [6],[27]. The weighting parameters of the optimizing losses wereα = 0.1, β = 0.25 and γ = 0.01. The learning rate was set at0.1 for all optimizing stages. The dropout rate was set at 0.2for most components apart from tenc-linguistic-context andsdec-prenet, for which the rate was set at 0.5. The trainingwas stopped when loss on validation stopped improving forten consecutive epochs.

One hundred speakers of the VCTK corpus [60] wereused to train the multi-speaker text-speech system and theWaveNet vocoder. The sampling rate was converted to thetarget scenarios. Among the remaining speakers, one male andone female with an American accent were used as targets foran experiment described in Sec. V-B. All common sentenceswere removed from the training so they could be used forevaluation. As VCTK lacks diversity in linguistic content,we first used 24-kHz LibriTTS corpus [61] to warm-up thetext-speech network. Only train-clean-100 and train-clean-360sets, which are 245 hours in total, were used to reduce thewarming time. The phoneme alignments of each corpus wereextracted using an ASR model trained on the same corpususing the KALDI toolkit [62]. For evaluated utterances, the

TABLE ITARGET SPEAKERS OF SCENARIO A

Speaker Database Gender Accent Quantity DurationVCC2TF1 VCC2018 female American 81 utt. 5.2 minVCC2TF2 VCC2018 female American 81 utt. 5.0 minVCC2TM1 VCC2018 male American 81 utt. 5.2 minVCC2TM2 VCC2018 male American 81 utt. 5.3 min

model trained on the LibriTTS corpus is used to extract theirphoneme alignments.

There were two voice cloning experiments, scenarios A andB. For the voice cloning stage, the number of epochs wasfixed to create a uniform process. Specifically, for scenario Adescribed in Sec. IV-D, we first adapted the text-speech modelfor 256 epochs, the vocoder for 128 epochs, and then weldedthem together for 64 more. For scenario B described in Sec.V-B, the number of epochs was 256, 64, and 32, respectively.The mix-in rate in the welding step was set at 0.9.

For the inference stage, the speech encoder used its meanoutput for VC while text encoder sampled a LLE sequencefrom Gaussian distributions for TTS as shown in Fig. 2c.To maintain stochasticity but reduce the chance of samplingundesirable outliers, we multiplied the standard deviationoutput of the text encoder by 0.1 before random sampling.

D. Evaluation measurements

In this work, we treat our system as a whole, instead offocusing on individual techniques, and we compare it withother third-party systems. For objective evaluation, we usedan ASR model4 to calculate the word error rate (WER) ofgenerated speech. Note that the WER was only used as areference point since it is highly sensitive to the trainingdata of the ASR model. For subjective evaluation, we usedMOS in a 5-point scale for quality and DMOS on a 4-point scale for speaker similarity [31]. In most questions onspeaker similarity, participants were asked to compare speakersimilarity of a generated utterance with a natural utterance.However, scenario A included additional questions for com-paring speaker similarity between generated utterances. Inscenario B, the participants were also asked to do several ABtests on quality and speaker similarity. In the AB test, twospeech samples were shown at each test page and participantswere asked to choose the better of the two. These questionswere used to highlight the fine-grained differences betweengeneration systems. Each participant in our subjective listeningtests was asked to do ten sessions.

V. EXPERIMENT SCENARIOS AND EVALUATIONS

As our system can clone voices by using either transcribedor untranscribed speech and can be used as a TTS or VCsystem, it would be difficult to evaluate all of these tasksin a single experiment. Therefore, we tested its performanceand versatility under two separate scenarios. The first scenariofocuses more on VC and cloning voices with untranscribedspeech, while the second scenario focuses more on TTS

4A chain system based on TDNN-F pretrained on the Librispeech corpus[63] was used for calculation (http://kaldi-asr.org/models/m13).

http://kaldi-asr.org/models/m13


TABLE IIWORD ERROR RATE FOR OBJECTIVE EVALUATION OF SCENARIO A

System Target speakers (%WER)VCC2TF1 VCC2TF2 VCC2TM1 VCC2TM2

XV 3.25 2.98 3.66 10.57N10= 9.21 7.99 11.79 9.89N10× 9.62 11.52 8.67 9.21N13= 23.31 21.68 31.57 27.37N13× 32.25 24.80 21.41 26.96N17= 25.47 24.39 33.47 23.71N17× 38.08 31.44 35.23 25.88VCA=

u 25.34 26.02 27.37 25.75VCA×

u 30.62 27.51 23.71 22.63TTSu 7.72 8.40 6.23 7.18

Source speakers (%WER)VCC2SF3 VCC2SF4 VCC2SM3 VCC2SM4

S00 5.69 4.88 5.69 7.32

and performance of the supervised and unsupervised speakeradaptation strategies5.

A. Cloning voices using untranscribed speech

In the first scenario, scenario A, we tested the ability toclone voices by using a small amount of untranscribed speech(about five minutes). A system showing good performanceunder this scenario is expected to have the capability to clonethousands of voices efficiently and cheaply.

1) Description of scenario A: we re-enacted the SPOKEtask of Voice Conversion Challenge 2018 (VCC2018) [31]for this scenario. The original goal of the task was to buildVC systems for 4 target English speakers (2 males and 2females) using 81 utterances (Table I). These systems wereused to convert the speech of 4 source speakers (2 malesand 2 females) into each of the target voices. We followedthe VCC2018 guideline [31] faithfully with one extension– we evaluated TTS systems as well as VC systems at thesame time. These TTS systems were required to train on theuntranscribed speech of the target speakers. In the inferencestage, transcriptions of source utterances were used to generatespeech with TTS systems. As there were only 35 uniquesentences, we generated each sentence twice. In summary,each TTS system produced 70 utterances for each targetspeaker while each VC system produced 140 utterance. Wesplit each VC system into two entities, one for same-genderconversion denoted by the superscript “=” and the other forcross-gender denoted by “×”.

2) Systems: We evaluated the following TTS and VCsystems in scenario A:• XV: a speaker-adaptive E2E TTS system using the x-

vector [18], [15], [46]. XV was used as a third-party unsu-pervised TTS baseline. We used the libritts.tacotron2.v1model and the speaker-independent WaveNet vocoder lib-ritts.wavenet.mol.v1 which were trained on the LibriTTScorpus to realize this approach. Both are available at theESPnet [64] repository6. As the x-vector is utterance-based, we randomly picked five utterances (about tenseconds) from the training pool of target speakers toextract the x-vector each time we generate an utterance.

5The generated speech samples of both experiment scenarios are availableat https://nii-yamagishilab.github.io/sample-versatile-voice-cloning/

6https://github.com/espnet/espnet

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quality

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Sim

ilari

ty

N10=

N10×NR=

NR×

VCA=u

VCA×u

TTSu

XV

T00

S00

Extra similarity evaluationsTTSu / VCAu

=: mean=3.68, lower=3.63, upper=3.73TTSu / N10=: mean=3.41, lower=3.35, upper=3.48

Fig. 5. Subjective results of scenario A. Lines indicate 95% confidenceinterval. Cross-gender and Same-gender conversion of VC systems weretreated as separate entities.

• N10: the winner of the VCC2018 SPOKE task. N10 con-tains a PPG-based acoustic model [36] and a fine-tunedWaveNet vocoder [14]. It uses a speaker-independentASR model trained on hundreds of hours of labeled datato extract PPG from speech. N10 clones voice withoutusing the speech data of source speakers.

• N13\N17 (NR): the runners up of the VCC2018 SPOKEtask in terms of quality and similarity, respectively. Toreduce the amount of systems, we treat them as one(denotes as NR) and use N13 in the quality evaluationwhile using N17 [65] in the similarity evaluation.

• VCAu: our unsupervised VC system follows the adapta-tion process described in Sec. III-B1. The letter “A” asin “any-to-one” indicates that the model is not trainedon source speakers. The word unsupervised means thatthe cloning is performed with untranscribed speech in thecontext of our current work. It is operated at 22.05 kHzto be compatible with the target speakers.

• TTSu: our unsupervised TTS system. As we did nottrain an automatic duration model, we used the durationextracted from the same-gender source speakers to gen-erate speech from text. This means that TTSu shares thesame duration model as VCA=

u (and other same-genderVC systems). This reduces the difference in experimentalconditions between them and allows us to make moreinsightful observations.

• T00 and S00: natural utterances of the target and sourcespeakers used as references, respectively.

3) Evaluation: twenty-eight native English speakers par-ticipated in our subjective test for scenario A. Listeners wereasked to answer 18 quality and 22 similarity questions in eachsession. In summary, each system was judged 560 times foreach measurement, while natural speech systems (T00 andS00) were judged 280 times. The objective and subjectiveevaluation results are shown in Table II and Fig. 5 with manyinteresting observations. a) XV had better quality but worse

https://nii-yamagishilab.github.io/sample-versatile-voice-cloning/


TABLE IIITARGET SPEAKERS OF SCENARIO B

Speaker Database Gender Accent/L1 Quantity Durationp294 VCTK female American 325 utt. 11.2 minp345 VCTK male American 325 utt. 11.0 minMF6 EMIME female Mandarin 145 utt. 10.2 minMM6 EMIME male Mandarin 145 utt. 11.3 min

similarity than the runners up of VCC2018. It also had thelowest WER; one reason is it trained on LibriTTS a subset ofLibrispeech. b) Our systems had high scores in both subjectivemeasurements. Interestingly our TTS system has lower WERthan our VC systems. c) Even though we had a lower scorefor quality than did N10, the similarity seem to be higher. d)Our TTS and VC systems had highly consistent results, whilethere was a gap between the same-gender and cross-genderN10 subsystems. This was perpetuated by extra similarityevaluations between the generated systems presented in Fig.5. The similarity between our TTSu and VCA=

u systems washigher than the similarity between TTSu and N10=.

4) Scenario conclusion: Even though the naturalness of ourvoice cloning system was slightly worse than N10 (again thebest system at VCC2018), generally speaking it has achievedperformance that is comparable to SOTA systems consideringthe difference in experimental conditions (e.g., the amountof data used in the training stage). More importantly, oursystem can seamlessly switch between TTS and VC modeswith high consistency in terms of speaker characteristics. Thisis a desirable trait that would be useful for many applications.

B. Capturing unique speaker characteristics

As mentioned earlier, the way voice cloning is differentiatedfrom speech synthesis is that it should prioritize capturing theunique characteristics of target speakers. While it is easy forlisteners to grasp general global characteristics (e.g., averagepitch), it is more difficult to notice local subtle traits (e.g.,pronunciation of particular words) with just a single referenceutterance. We could use famous individuals as targets [25],but this assumes that listeners would be familiar with them.In scenario B, we therefore used non-native speakers as targetsto highlight their unique characteristics. This is convenientfor subjective evaluation as native speakers can generallyspot their distinctiveness without any explanation about thelinguistic aspect of it [66]. In simple words, the goal ofscenario B was to reproduce the accent of non-native speakers.This scenario is closely related to reducing accents [67], [68]or controlling accents [24] tasks.

1) Scenario description: the target speakers for this sce-nario included two American and two non-native Englishspeakers who use Mandarin as their native language. Eachspeaker had about 10 minutes of speech as listed in Table III.As the base model was trained with native speakers of English,the speakers from the VCTK corpus represented the standardeasy task while the speakers from the EMIME corpus [69]represented difficult and unique target speakers. The evaluatedsystems were required to be built with either the transcribed oruntranscribed speech of the targets. Twenty common sentencesfrom the VCTK corpus were used for the evaluations. Each

TABLE IVWORD ERROR RATE FOR OBJECTIVE EVALUATION OF SCENARIO B

System Target speakers (%WER)VCTK-p294 VCTK-p345 EMIME-MF6 EMIME-MM6

NAT* 6.09 8.69 56.24 43.39XV 3.50 24.05 5.33 3.81FT 13.39 20.09 57.53 42.01VCMu 22.22 24.05 27.70 27.09VCMs 23.29 24.81 29.07 29.53TTSu 8.37 9.74 13.39 14.92TTSs 9.28 10.05 36.38 38.20

Source speakers (%WER)VCTK-p299 VCTK-p311 - -

SRC** 5.64 6.51 - -*calculated on all training utterances of target speakers.**calculated on natural utterances of source speakers.

sentence was generated twice by each TTS system, whichtotaled 40 utterances. In the case of VC, one female (p299)and one male (p311) with a general American accent includedin the training pool are used as source speakers.

2) Systems: The following TTS and VC systems were usedfor the evaluation in scenario B:

• XV: the same x-vector system in scenario A is reused asthe unsupervised baseline of TTS.

• FT: a fine-tuned E2E TTS system was used as thesupervised baseline. We used ljspeech.tacotron2.v3, im-plemented with ESPnet [70], as the initial model. It wastrained with 24 hours of the transcribed speech of afemale speaker from the LJSpeech corpus [71]. An initialWaveNet vocoder was also trained with the same corpus.When cloning voices, we fine-tuned both acoustic andvocoder models with the transcribed speech of the targets.This system represented a simple supervised approach byfine-tuning a well-trained single speaker model [16].

• VCMu: our unsupervised VC system followed the adapta-tion process described in Sec. III-B1 using untranscribedspeech. The letter “M” as in “many-to-one” indicates thatthe source speakers were included in the training pool ofthe base model. The system was operated in 24kHz.

• VCMs: our supervised VC system followed the cloningprocess described in Sec. III-B2 using transcribed speech.The supervised strategy is more relevant to TTS, but westill included its VC counterpart.

• TTSu: our unsupervised TTS system. The duration isextracted from the source speakers of VC. This meansour TTS and VC systems share the same duration model.

• TTSs: our supervised TTS system using the alternativesupervised adaptation strategy.

• NAT: the natural utterances of the target speakers.

3) Evaluation: Thirty-two native speakers took part in oursubjective evaluation for scenario B. As the participants werenative English speakers living in Japan and many work asEnglish teachers, we expected that they could quickly pick upon the non-native accents. Each session had 18 quality and18 similarity questions that contain utterances of both nativeand non-native speakers. Besides the standard MOS tests, wealso included several AB tests in this scenario. In summary,each system was evaluated 640 times for each assessment.The objective evaluation result are listed in Table III, and the


1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quality

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Sim

ilari

ty

XV

FT

VCMu

VCMs

TTSu

TTSs NAT

0 25 50 75 100

Quality

TTSu

FT

0 25 50 75 100

Similarity

TTSu

FT

TTSs

TTSs

(a) Native speakers

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quality

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Sim

ilari

ty

XV

FTVCMu

VCMs

TTSu

TTSs

NAT

0 25 50 75 100

Quality

TTSu

FT

0 25 50 75 100

Similarity

TTSu

FT

TTSs

TTSs

(b) Non-native speakers

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quality

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Sim

ilari

ty

glo

ba

la

vera

ge

global average

(c) Details per listener evaluation ofnon-native natural utterances (NAT)

Fig. 6. Subjective evaluations of scenario B. Lines indicate 95% confidence interval.

subjective evaluation results are shown in Fig. 6. Here theresults of native and non-native speakers are separately shown.

For the standard case with native target speakers, the subjec-tive results show high MOS scores for most systems as shownin Fig. 6a. The new results here are comparisons betweensupervised and unsupervised approaches. Comparing the XVand FT systems, which represent unsupervised and supervisedTTS baselines, we see that the fine-tuned one was significantlybetter than the speaker embedding one as it benefited fromall ten minutes of data. Similar to scenario A, XV system hasbetter WER than FT for many targets. Among our systems, thedifference between the supervised and unsupervised strategieswas marginal, but they were all better than the supervisedbaseline FT. One hypothesis is that our approaches are lesssensitive to overfitting thanks to the multi-speaker corpus,speaker factorization and denoising training while FT has ahigher possibility of overfitting when using ten minutes ofspeech [16], [22]. These observations are also supported byAB-preference tests (See the bottom part of Fig. 6a).

For the challenging case with non-native target speakers,the subjective results revealed more interesting tendenciesas shown in Fig. 6b. This scenario not only revealed therobustness of the voice cloning methods but also the listeners’behaviors. First, we can see that our systems had highersimilarity scores than the TTS baselines, FT and XV. The dif-ferences between our supervised and unsupervised strategieswas more profound in the non-native cases. TTSs seemed tohave higher similarity than TTSu. Next, interestingly we seethat the natural speech of the non-native speakers (NAT) hadlower quality scores than its native counterpart. This wouldbe because our native listeners perceived the “quality” ofspeech with strong non-native accents as low. As a result, thequality and similarity measurement in this case was no morea positive correlation. The average per-listener results for non-native NAT are plotted in Fig. 6c. Even a negative correlationwas found for the subjective results of the TTS baselines, FTand XV, indicating that higher-quality speech correspondedto less accented speech and hence lower speaker similarity tonon-native target speakers. This highlights the pros and cons of

these adaptation methods. The WER of the non-native naturalspeech (NAT) was significantly worse than the native speakersas expected, while TTSs is worse than TTSu.

In summary, the proposed system had higher speaker simi-larity than the baseline systems. Our TTS system, in particular,benefited from the supervised strategy although the improve-ment was relatively small. Regarding the TTSu and the othertwo VC systems that had slightly better quality than the naturalspeech, we suspect that this is due to the reduced/lack ofaccents of their generated speech. This hints at potential usesfor other accent-related tasks [67].

4) Scenario conclusion: The subjective results have shownthat the fine-tuning approach is better at capturing uniquespeaker characteristics than the speaker embedding approachwhen data are sufficient. Our systems, in particular, achievedhigh performance for native speakers as well as non-nativespeakers. Moreover our cloning strategy can be adjusted totake advantage of the transcriptions if they are available. In themeantime, the experiment also points out the limitations of thesubjective evaluation. While the current quality and similarityquestions work well for native speakers, listeners’ judgementswere biased when they needed to evaluate the voices of non-native speakers.

VI. CONCLUSION

In this paper, we showed that our voice cloning system,“NAUTILUS”, can achieve state-of-the-art performance. Moreimportantly, it can act as a text-to-speech or voice conversionsystem with high consistency in terms of speaker character-istics when switching between the two. With the versatilecloning strategy, which can be adjusted to specific data situa-tion of a target speaker, it is potentially useful for many otherinteresting tasks like accent reduction [67] or cross-lingualvoice cloning [72], [73]. For future work, we will focus onevaluating our systems by using different architectures for text-speech systems [7], [22] or neural vocoders [74], [13] to solvespecific voice cloning scenarios [24], [20]. Finally given themultimodal structure, extending our system to other speechgeneration tasks (e.g., video-to-speech [3]) would be a naturaldirection toward a unified voice cloning framework.


ACKNOWLEDGMENTS

This work was partially supported by a JST CRESTGrant (JPMJCR18A6, VoicePersonae project), Japan, andMEXT KAKENHI Grants (16H06302, 17H04687, 18H04120,18H04112, 18KT0051), Japan.

REFERENCES

[1] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura,“Speech parameter generation algorithms for hmm-based speech syn-thesis,” in Proc. ICASSP, 2000, pp. 1315–1318.

[2] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speechsynthesis,” in Proc. ICASSP, 1998, pp. 285–288.

[3] T. L. Cornu and B. Milner, “Reconstructing intelligible audio speechfrom visual speech features,” in Proc. INTERSPEECH, 2015, pp. 3355–3359.

[4] D. Michelsanti, O. Slizovskaia, G. Haro, E. Gomez, Z.-H. Tan, andJ. Jensen, “Vocoder-based speech synthesis from silent videos,” arXivpreprint arXiv:2004.02541, 2020.

[5] G. Krishna, C. Tran, Y. Han, and M. Carnahan, “Speech synthesis usingeeg,” in Proc. ICASSP, 2020, pp. 1235–1238.

[6] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian-nakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet onMel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.

[7] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesiswith transformer network,” in Proc. AAAI Conf. AI, vol. 33, 2019, pp.6706–6713.

[8] H. Miyoshi, Y. Saito, S. Takamichi, and H. Saruwatari, “Voice conver-sion using sequence-to-sequence learning of context posterior probabil-ities,” Proc. INTERSPEECH, pp. 1268–1272, 2017.

[9] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc: Sequence-to-sequence voice conversion with attention and context preservationmechanisms,” in Proc. ICASSP, 2019, pp. 6805–6809.

[10] J. Zhang, Z. Ling, and L.-R. Dai, “Non-parallel sequence-to-sequencevoice conversion with disentangled linguistic and speaker representa-tions,” IEEE/ACM Trans. Audio, Speech, Language Process, vol. 28,pp. 540–552, 2019.

[11] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:A generative model for raw audio,” arXiv preprint arXiv:1609.03499,2016.

[12] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-basedgenerative network for speech synthesis,” in Proc. ICASSP, 2019, pp.3617–3621.

[13] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveformmodels for statistical parametric speech synthesis,” IEEE/ACM Trans.Audio, Speech, Language Process, vol. 28, pp. 402–415, 2019.

[14] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “Wavenetvocoder with limited training data for voice conversion,” in Proc.INTERSPEECH, 2018, pp. 1983–1987.

[15] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen,P. Nguyen, R. Pang, I. L. Moreno, and Y. Wu, “Transfer learning fromspeaker verification to multispeaker text-to-speech synthesis,” arXivpreprint arXiv:1806.04558, 2018.

[16] K. Inoue, S. Hara, M. Abe, T. Hayashi, R. Yamamoto, and S. Watanabe,“Semi-supervised speaker adaptation for end-to-end speech synthesiswith pretrained models,” in Proc. ICASSP, 2020, pp. 7634–7638.

[17] X. Tian, J. Wang, H. Xu, E. S. Chng, and H. Li, “Average modelingapproach to voice conversion with non-parallel data.” in Proc. Odyssey,2018, pp. 227–232.

[18] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloningwith a few samples,” in Proc. NIPS, 2018, pp. 10 040–10 050.

[19] M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint trainingframework for text-to-speech and voice conversion using multi-sourcetacotron and wavenet,” Proc. INTERSPEECH, pp. 1298–1302, 2019.

[20] H.-T. Luong and J. Yamagishi, “Bootstrapping non-parallel voice conver-sion from speaker-adaptive text-to-speech,” Proc. ASRU, pp. 200–207,2019.

[21] A. Polyak and L. Wolf, “Attention-based wavenet autoencoder foruniversal voice conversion,” in Proc. ICASSP, 2019, pp. 6800–6804.

[22] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda,“Voice transformer network: Sequence-to-sequence voice conversionusing transformer with text-to-speech pretraining,” arXiv preprintarXiv:1912.06813, 2019.

[23] H.-T. Luong and J. Yamagishi, “A unified speaker adaptation methodfor speech synthesis using transcribed and untranscribed speech withbackpropagation,” arXiv preprint arXiv:1906.07414, 2019.

[24] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia,A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently ina foreign language: Multilingual speech synthesis and cross-languagevoice cloning,” Proc. INTERSPEECH, pp. 2080–2084, 2019.

[25] J. Lorenzo-Trueba, F. Fang, X. Wang, I. Echizen, J. Yamagishi, andT. Kinnunen, “Can we steal your vocal identity from the internet?:Initial investigation of cloning obamas voice using gan, wavenet andlow-quality found data,” in Proc. Odyssey, 2018, pp. 240–247.

[26] A. Gutkin, L. Ha, M. Jansche, K. Pipatsrisawat, and R. Sproat, “Tts forlow resource languages: A bangla synthesizer,” in Proc. LREC, 2016,pp. 2005–2010.

[27] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guidedattention,” in Proc. ICASSP, 2018, pp. 4784–4788.

[28] Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen,Q. Wang, L. C. Cobo, A. Trask, B. Laurie, C. Gulcehre, A. van denOord, O. Vinyals, and N. de Freitas, “Sample efficient adaptive text-to-speech,” arXiv preprint arXiv:1809.10460, 2018.

[29] Y. Stylianou, O. Cappe, and E. Moulines, “Statistical methods for voicequality transformation,” in Proc. EUROSPEECH, 1995, pp. 447–450.

[30] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based onmaximum-likelihood estimation of spectral parameter trajectory,” IEEETrans. Audio, Speech, Language Process., vol. 15, no. 8, pp. 2222–2235,2007.

[31] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio,T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018:Promoting development of parallel and nonparallel methods,” in Proc.Odyssey, 2018, pp. 195–202.

[32] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarialnetworks,” in Proc. SLT, 2018, pp. 266–273.

[33] Y. Chen, M. Chu, E. Chang, J. Liu, and R. Liu, “Voice conversion withsmoothed gmm and map adaptation,” in Proc. EUROSPEECH, 2003,pp. 2413–2416.

[34] T. Toda, Y. Ohtani, and K. Shikano, “Eigenvoice conversion based ongaussian mixture model,” in Proc. INTERSPEECH, 2006, pp. 2446–2449.

[35] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voiceconversion from non-parallel corpora using variational auto-encoder,” inProc. APSIPA, 2016, pp. 1–6.

[36] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic posterior-grams for many-to-one voice conversion without parallel data training,”in Proc. ICME, 2016, pp. 1–6.

[37] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation inend-to-end text-to-speech,” in Proc. ICLR, 2019, pp. 1–15.

[38] Z. Huang, H. Lu, M. Lei, and Z. Yan, “Linear networks based speakeradaptation for speech synthesis,” in Proc. ICASSP, 2018, pp. 5319–5323.

[39] J. Yamagishi and T. Kobayashi, “Average-voice-based speech synthesisusing hsmm-based speaker adaptation and adaptive training,” IEICE T.Inf. Syst., vol. 90, no. 2, pp. 533–543, 2007.

[40] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Multi-speaker modeling andspeaker adaptation for DNN-based TTS synthesis,” in Proc. ICASSP,2015, pp. 4475–4479.

[41] H.-T. Luong and J. Yamagishi, “Scaling and bias codes for modelingspeaker–adaptive DNN–based speech synthesis systems,” in Proc. SLT,2018, pp. 610–617.

[42] Y.-N. Chen, Y. Jiao, Y. Qian, and F. K. Soong, “State mapping forcross-language speaker adaptation in tts,” in Proc. ICASSP, 2009, pp.4273–4276.

[43] S. Takaki, Y. Nishimura, and J. Yamagishi, “Unsupervised speakeradaptation for DNN-based speech synthesis using input codes,” in Proc.APSIPA, 2018, pp. 649–658.

[44] Z. Wu, P. Swietojanski, C. Veaux, S. Renals, and S. King, “A study ofspeaker adaptation for DNN-based speech synthesis,” in Proc. INTER-SPEECH, 2015, pp. 879–883.

[45] R. Doddipatla, N. Braunschweiler, and R. Maia, “Speaker adaptation inDNN-based speech synthesis using d-vectors,” in Proc. INTERSPEECH,2017, pp. 3404–3408.

[46] E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, andJ. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” arXiv preprint arXiv:1910.10838, 2019.

[47] M. Gales, S. Young et al., “The application of hidden markov modelsin speech recognition,” Foundations and Trends R© in Signal Processing,vol. 1, no. 3, pp. 195–304, 2008.


[48] A. Tjandra, S. Sakti, and S. Nakamura, “Listening while speaking:Speech chain by deep learning,” in Proc. ASRU, 2017, pp. 301–308.

[49] S. Karita, S. Watanabe, T. Iwata, M. Delcroix, A. Ogawa, andT. Nakatani, “Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders,” in Proc. ICASSP, 2019, pp. 6166–6170.

[50] B. Li and H. Zen, “Multi-language multi-speaker acoustic modelingfor lstm-rnn based statistical parametric speech synthesis.” in Proc.INTERSPEECH, 2016, pp. 2468–2472.

[51] H.-T. Luong and J. Yamagishi, “Multimodal speech synthesis archi-tecture for unsupervised speaker adaptation,” in Proc. INTERSPEECH,2018, pp. 2494–2498.

[52] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,“Autoencoding beyond pixels using a learned similarity metric,” arXivpreprint arXiv:1512.09300, 2015.

[53] Y. Zhao, S. Takaki, H.-T. Luong, J. Yamagishi, D. Saito, and N. Mine-matsu, “Wasserstein gan and waveform loss-based acoustic model train-ing for multi-speaker text-to-speech synthesis systems using a wavenetvocoder,” IEEE Access, vol. 6, pp. 60 478–60 488, 2018.

[54] W.-C. Huang, Y.-C. Wu, H.-T. Hwang, P. L. Tobing, T. Hayashi,K. Kobayashi, T. Toda, Y. Tsao, and H.-M. Wang, “Refined wavenetvocoder for variational autoencoder based voice conversion,” in Proc.EUSIPCO, 2019, pp. 1–5.

[55] O. Watts, G. E. Henter, J. Fong, and C. Valentini-Botinhao, “Where dothe improvements come from in sequence-to-sequence neural tts?” inProc. SSW 10th, 2019.

[56] H. Zen and H. Sak, “Unidirectional long short-term memory recurrentneural network with recurrent output layer for low-latency speechsynthesis,” in Proc. ICASSP, 2015, pp. 4470–4474.

[57] X. Wang, S. Takaki, and J. Yamagishi, “An autoregressive recurrentmixture density network for parametric speech synthesis,” in Proc.ICASSP, 2017, pp. 4895–4899.

[58] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda, “Aninvestigation of multi-speaker training for wavenet vocoder,” in Proc.ASRU, 2017, pp. 712–718.

[59] K. Richmond, R. A. Clark, and S. Fitt, “Robust lts rules with thecombilex speech technology lexicon,” in Proc. INTERSPEECH, 2009.

[60] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus:English multi-speaker corpus for CSTR voice cloning toolkit,” 2017,http://dx.doi.org/10.7488/ds/1994.

[61] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, andY. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,”in Proc. INTERSPEECH, 2019, pp. 1526–1530.

[62] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem-mer, and K. Vesely, “The kaldi speech recognition toolkit,” in Proc.ASRU, 2011.

[63] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: anasr corpus based on public domain audio books,” in Proc. ICASSP, 2015,pp. 5206–5210.

[64] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N.-E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” Proc. INTERSPEECH, pp. 2207–2211,2018.

[65] Y.-C. Wu, P. L. Tobing, T. Hayashi, K. Kobayashi, and T. Toda, “The nunon-parallel voice conversion system for the voice conversion challenge2018.” in Proc. Odyssey, 2018, pp. 211–218.

[66] A. C. Janska and R. A. Clark, “Native and non-native speaker judge-ments on the quality of synthesized speech,” in Proc. INTERSPEECH,2010.

[67] S. Aryal and R. Gutierrez-Osuna, “Can voice conversion be used toreduce non-native accents?” in Proc. ICASSP, 2014, pp. 7879–7883.

[68] Y. Oshima, S. Takamichi, T. Toda, G. Neubig, S. Sakti, and S. Nakamura,“Non-native text-to-speech preserving speaker individuality based onpartial correction of prosodic and phonetic characteristics,” IEICE Trans.Inf. & Syst., vol. 99, no. 12, pp. 3132–3139, 2016.

[69] M. Wester and H. Liang, “The EMIME mandarin bilingual database,”2011, http://hdl.handle.net/1842/4862.

[70] T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe,T. Toda, K. Takeda, Y. Zhang, and X. Tan, “Espnet-tts: Unified,reproducible, and integratable open source end-to-end text-to-speechtoolkit,” in Proc. ICASSP, 2020, pp. 7654–7658.

[71] K. Ito, “The LJ speech dataset,” 2017, https://keithito.com/LJ-Speech-Dataset/.

[72] M. Abe, K. Shikano, and H. Kuwabara, “Statistical analysis of bilingualspeakers speech for cross-language voice conversion,” J. Acoust. Soc.Am., vol. 90, no. 1, pp. 76–82, 1991.

[73] Y. Zhou, X. Tian, H. Xu, R. K. Das, and H. Li, “Cross-lingualvoice conversion with bilingual phonetic posteriorgram and averagemodeling,” in Proc. ICASSP, 2019, pp. 6790–6794.

[74] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-endneural audio generation model,” arXiv preprint arXiv:1612.07837, 2016.

PLACEPHOTOHERE

Hieu-Thi Luong received B.E and M.E degrees incomputer science from Vietnam National University,Ho Chi Minh city, University of Science, Vietnamin 2014 and 2016 respectively. In 2017, he wasawarded a Japanese Government (Monbukagakusho:MEXT) Scholarship to pursue a PhD degree instatistical speech synthesis and machine learning atthe National Institute of Informatics, Tokyo, Japan.

PLACEPHOTOHERE

Junichi Yamagishi received a Ph.D. degree fromThe Tokyo Institute of Technology in 2006 for athesis that pioneered speaker-adaptive speech syn-thesis. He is currently a Professor with the NationalInstitute of Informatics, Tokyo, Japan and also aSenior Research Fellow with the Centre for SpeechTechnology Research, University of Edinburgh, Ed-inburgh, U.K. Since 2006, he has authored and co-authored more than 250 refereed papers in interna-tional journals and conferences.

He was the recipient of the Tejima Prize as thebest Ph.D. thesis of Tokyo Institute of Technology in 2007. He was awardedthe Itakura Prize from the Acoustic Society of Japan in 2010, the Kiyasu Spe-cial Industrial Achievement Award from the Information Processing Societyof Japan in 2013, the Young Scientists’ Prize from the Minister of Education,Science and Technology in 2014, the JSPS Prize from the Japan Society forthe Promotion of Science in 2016, and the Docomo Mobile Science Awardfrom the Mobile Communication Fund in 2018.

He was one of the organizers for the special sessions on “Spoofing andCountermeasures for Automatic Speaker Verification” at Interspeech 2013,“ASVspoof evaluation” at Interspeech 2015, “Voice Conversion Challenge2016” at Interspeech 2016, “2nd ASVspoof evaluation” at Interspeech 2017,and “Voice Conversion Challenge 2018 at Speaker Odyssey 2018. He iscurrently an organizing committee member for ASVspoof 2019, an organizingcommittee member for the 10th ISCA Speech Synthesis Workshop 2019, atechnical program committee member for IEEE ASRU 2019, and an awardcommittee member for ISCA Speaker Odyssey 2020.

He was a member of the Speech and Language Technical Committee anda Lead Guest Editor for a special issue of the IEEE Journal of SelectedTopics in Signal Processing on spoofing and countermeasures for automaticspeaker verification. He is currently a Senior Area Editor of the IEEE/ACMTransactions on Audio, Speech, and Language Processing and a chairpersonof ISCA SynSIG.

Date post:	05-Nov-2020
Category:	Documents
Upload:	others
View:	20 times
Download:	0 times

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND ...IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE...

Documents