Date post: | 24-Jul-2018 |
Category: |
Documents |
Upload: | hoangnguyet |
View: | 216 times |
Download: | 0 times |
ComparingRecentWaveformGenerationandAcousticModelingMethodsfor
Neural-network-basedSpeechSynthesis
XinWANG,JaimeLorenzo-Trueba,ShinjiTAKAKI,LauriJuvela,JunichiYAMAGISHI
NationalInstituteofInformatics,Japan&AaltoUniversity,Finland2018-04-17
1contact:[email protected],suggestions,anddiscussion
ICASSP2018Calgary,Canada
qMotivation• Bettermodulesforthestatisticalparametricspeech
synthesis(SPSS)framework?
qMethod• Plugandtestnewacousticmodelsandwaveform
generators
qResults• Bestcombination
• Quality:asgoodasvocodedspeech(at16kHz)
2
OVERVIEW
Autoregressive(AR)acousticmodelsWaveNet-basedvocoder
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
3
F0
Backgroundq ConventionalTTSpipeline[1]
q SPSSback-end[2,3]
• MGC:Mel-generalizedcepstralcoefficients[4]• BAP:band-aperiodicity
4
INTRODUCTION
SpectralfeaturesAcousticmodels
Waveformgenerators
Text Front-end Back-endLinguisticfeatures Speech
SpeechLinguisticfeatures
[1]T.Dutoit.AnIntroductiontoText-to-speechSynthesis.KluwerAcademicPublishers,Norwell,MA,USA,1997.[2]Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[3]Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.[4]Tokuda,K.,Kobayashi,T.,Masuko,T.,andImai,S.(1994).Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046.
MGC&BAP
Topicofthisworkq BettermodulesforSPSSback-end?
• MGC:Mel-generalizedcepstralcoefficients• BAP:band-aperiodicity
5
INTRODUCTION
F0Acousticmodels
Waveformgenerators
SpeechLinguisticfeatures
Recurrentneuralnetworks(RNNs)
Autoregressive(AR)models
Generaladversarialnetwork(GAN)
…
WORLDvocoder
+Phraserecovery
WaveNet-basedvocoder
…
MGC&BAP
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
6
7
MODELS &METHODS
Waveformgenerators
Acousticmodels
Linguisticfeatures
MGC&BAP&F0
Speechwaveforms
RNN
WORLD
Acousticmodelsq BaselineRNN
• Sequenceoflinguisticfeatures• Sequenceofgeneratedacousticfeatures
8
MODELS &METHODS
x1 x2 x3 x4 x5
bo1 bo2 bo3 bo4 bo5
{x1, · · · ,xt, · · · }
{bo1, · · · , bot, · · · }
Acousticmodelsq BaselineRNN
9
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)
H(RNN)⇥ (·)
bot = µt
Neuralnetwork
Probabilisticmodels
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|x1:T ;⇥) =TY
t=1
N (ot;µt, I)
[5]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.
10
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|x1:T ;⇥) =TY
t=1
N (ot;µt, I)
ARmodels
GAN
Acousticmodelsq BaselineRNN
• Limitations1. Conditionalindependence2. Maximum-likelihoodtraining
Acousticmodelsq ShallowAR(SAR)
• Alternativeinterpretation:trainablefilter+RNN
11
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5 K=2
p(o1:T |x1:T ;⇥, ) =TY
t=1
p(ot|ot�K:t�1,x1:T ;⇥) =TY
t=1
N (ot;µt + f (ot�K:t�1), I)
[6]X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforparametricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.
Acousticmodelsq DeepAR(DAR)
• Onlyfor(quantized)F0modeling
12
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|o1:t�1,x1:T ;⇥)
[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)
Acousticmodelsq GAN
• GAN-basedpost-filter[8]
13
MODELS &METHODS
Acousticmodel
Linguisticfeatures
ResidualgeneratorNoise Discriminator
Acousticfeatures(generated)
+
Acousticfeatures(natural)
0/1
[8]T.Kaneko,H.Kameoka,N.Hojo,Y.Ijima,K.Hiramatsu,andK.Kashino.Generativeadversarialnetwork-basedpostfilterforstatisticalparametricspeechsynthesis.InProc.ICASSP,pages4910–4914,2017.
14
MODELS &METHODSAcousticmodels
• BAPisnotshown
Linguisticfeatures
SAR
WORLD
RNNDAR
F0
Waveformgenerators
Acousticmodels
MGC
GAN
Speechwaveforms
15
MODELS &METHODSAcousticmodels
• BAPisnotshown
Linguisticfeatures
SAR RNNDAR
Waveformgenerators
Acousticmodels
GAN
SAR-Wo SGA-Wo RGA-Wo RNN-Wo
WORLD
F0 MGC
Waveformgeneratorsq Deterministicapproaches
• WORLD[9]
o Binaryvoicingdecisiono Minimumphase
• Alogdomainpulsemodel(PML)[10]
o Source-filtermodel,additiveinlog-domaino Binarynoisymask
• WORLD+phraserecovery
16
MODELS &METHODS
[9]M.Morise,etal.WORLD:Avocoder-basedhigh-qualityspeechsynthesissystemforreal-timeapplications.IEICETrans.onInformationandSystems, 99(7):1877–1884,2016.[10]G.Degottex,etal.Alogdomainpulsemodelforparametricspeechsynthesis.IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,2017.[11]D.GriffinandJ.Lim.Signalestimationfrommodifiedshort-timeFouriertransform.IEEETrans.ASSP,32(2):236–243,1984.
Phraserecovery[11]
Generatedwaveform
STFTamplitude
InverseSTFT
“Phrase-recovered”waveform
WaveformgeneratorsqWaveNet-basedvocoder[12,13]
• Howtogenerate(search)awaveform:1. Exploration: sampling2. Exploitation: pickingone-best
17
MODELS &METHODS
[12]A.vandenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner,A.Senior,andK.Kavukcuoglu.WaveNet:Agenerativemodelforrawaudio.arXiv preprintarXiv:1609.03499,2016.
[13]A.Tamamori,T.Hayashi,K.Kobayashi,K.Takeda,andT.Toda.Speaker-dependentWaveNetvocoder.InProc.Interspeech,pages1118–1122,2017.
Sampling point
9300.09305.0
9310.09315.0
9320.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
Sampling point
8900.08905.0
8910.08915.0
8920.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
• Samplinginunvoicedregion• Pickingone-bestinvoicedregion
MODELS &METHODS
59
WaveformgeneratorsqWaveNet-basedvocoder
probabilityUnvoicedsegment
Voicedregion
• Lessdistortionofharmonics☛ appendix&paper
19
MODELS &METHODSAcousticmodels
Linguisticfeatures
SAR RNNDAR
Waveformgenerators
Acousticmodels
GAN
SAR-Wo SGA-Wo RGA-Wo RNN-Wo
F0 MGC
SAR-PmSAR-PrSAR-Wa
Phraserecovery
PMLWaveNet WORLD
minimum phase
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
20
Configurationq Data
• Recordingperiod:over1year
q Front-end:OpenJTalk [15]
21[14]Kawai,H.,Toda,T.,Ni,J.,Tsuzaki,M.,andTokuda,K.(2004).Ximera:AnewTTSfromATRbasedoncorpus-based
technologies.InProc.SSW5,pages179–184.[15]HTSWorkingGroup.TheJapaneseTTSSystem‘OpenJTalk’,2015.
Corpus Size Note
ATR XimeraF009voice[14]
~30,000 utterances48hours
Samplingrate:48kHzJapanese,
neutral style,reading
Feature Dimension
Linguistic Phoneidentity,prosodictags... ~390
Acoustic
MGC 60
BAP 25
F0 1
EXPERIMENTS
Configurationq Listeningtest
• Quality:MOS(1-5)• Similarity: rate1-5,naturalreference48kHz• Participants:235nativeJapaneselisteners,1500 setsofresults
q Systems• Commonnetworkconfiguration(cf.thepaper)• Without,norformantenhancement• Samplingrate:48kHz&16kHz,exceptSAR-Wa at16kHz(10bits,𝜇-law)
22
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARnaturalnatural
�,�2
SAR
23
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural
16kHz
48kHz
SAR
Qualityscores
24
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural SAR
Similarityscores
25
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural SAR
q Introduction
qModelsandmethods
q Experiments
q Summary
CONTENTS
26
Plugandtestq SAR:
• Avoidconditionalindependenceassumption• Alleviatetheover-smoothingeffect☛ appendix&paper
qWaveNet• Generationmethod:one-bestgeneration+randomsampling• Lessdistortionofharmonics☛ appendix&paper
qWaveNetvocoder+SAR&DAR• Betterthanothercombinations• Worsethannaturalspeech• Closetovocodedspeech
27
SUMMARY
Recentwork☛ appendixq SAR
• Specialcaseofvolume-preservingnormalizingflow[16]
• ExtendedSAR= time-variantfilter+RNN
qWaveNet-vocoders• Trainingbasedongeneratedconditionalfeatures
q Annotatedlinguisticfeatures• Reducethegapbetweennatural&syntheticspeech
Futurework?q Reducethevariabilityofrecordingsq Usecomplex-valuedneuralmodels
28
FURTHER IMPROVEMENT?
[16]D.Rezende andS.Mohamed.Variationalinferencewithnormalizingflows.InProc.ICML,pages1530–1538,2015.
Code,recipes,slidesq Acousticmodels&WaveNet(CUDA/C++)
q SimpleexplanationonWaveNetandacousticmodels
29
MESSAGE
https://github.com/TonyWangX/CURRENNT_MODIFIEDhttps://github.com/TonyWangX/CURRENNT_Recipes
http://tonywangx.github.io/slides.html
Conditionalnetwork
WaveNet-Backend
Learningcurve
Generationmethod
Generationwithotheracousticmodels
Trainingbasedongeneratedfeatures31
APPENDIX - WAVENET
Tutorialslides:http://tonywangx.github.io/pdfs/wavenet.pdf
• Nocherrypicking• Allsamplesbasedongeneratedacousticfeatures
orautomaticallyinferredlinguisticfeatures
Structure
32
+
Wavenetblock 1
Clock rate: 200Hz (frame shift = 5ms)
Linearot�1Wavenetblock 2
Wavenetblock M
…
Linear+tanh Linear+tanh Softmax
s(1)t s(2)t s(M)tet�1
r(1)t
lt
… …
et
Bi-LSTM Linearc1:N concatenationCNNF0
Up-sampling
Clock rate: 16kHz
Conditional feature network
Post-processing network
P (ot|ot�R:t�1, c1:N )
APPENDIX - WAVENET
Conditionalnetworkq Architectureoftheconditionalnetwork
• Trial1
• Trial2
• Trial3
33
Bi-LSTM Linearc1:N Concat.CNN
F0
l1:N
c1:N l1:N
c1:N l1:N
F0MGC
Bi-LSTM LinearCNN
Linear
APPENDIX - WAVENET
Conditionalnetworkq ExperimentsonWaveNet Vocoder
• Givengenerated MGC/F0
34
sample1 sample2 sample3
Natural
Trial1None
Trial 2LSTM+CNN
Trial3LSTM+CNN+skip-F0
APPENDIX - WAVENET
WaveNetbackendq Architecture
qWaveNet-backendonlyusesrandomsampling
35
APPENDIX - WAVENET
WaveNet backendText Textanalyzer
F0model(DAR)
Textualfeatures
Waveform
sample1 sample2 sample3 sample4 sample5
Natural
WaveNet-vocoder
WaveNet-backend
Learningcurve
• WetrainedWaveNetbackendformorethan100epochs36
247002570026700277002870029700307003170032700
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
-loglikelihoo
d
epoch
Trainset
Val.Set
WaveNetvocoder
247002570026700277002870029700307003170032700
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
-loglikelihoo
d
epoch
Trainset
Val.Set
WaveNetbackend
APPENDIX - WAVENET
Generationmethod
37
8900.0 9100.0 9300.0 9500.0 9700.0 9900.00
200
400
600
800
1000
wav
efor
m(m
u-la
w)
8900.0 9100.0 9300.0 9500.0 9700.0 9900.0sampling point
0
200
400
600
800
1000
prob
ablit
yWaveformlevels(0-1024)
Waveformlevels(0-1024)
APPENDIX - WAVENET
Generationmethodq ExperimentsonWaveNet vocoder
Sampling point
9300.09305.0
9310.09315.0
9320.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
Sampling point
8900.08905.0
8910.08915.0
8920.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
59
APPENDIX - WAVENET
Generationmethodq One-best+randomsampling
• GivengeneratedMGC/F0
• Mix:
• Random:randomsamplingallthetime
39
voicedregion:one-best
unvoicedregion:randomsampling
sample1 sample2 sample3
Natural
Mix
Random
APPENDIX - WAVENET
40
Natural
Voiced:one-bestUnvoiced:sampling
Voiced:samplingUnvoiced:sampling
Generationmethodq One-best+randomsampling
• Mix:
• Mix2:
• …
• Mix4:
41
voicedregion:one-best
unvoicedregion:randomsampling
75%voicedframes:one-best
else:randomsampling
APPENDIX - WAVENET
25%voicedframes:one-best
else:randomsampling
Generationmethodq One-best+randomsampling
INVESTIGATION
42
sample1 sample2 sample3
Natural
Mix
Mix2
Mix3
Mix4
Random
75%
50%
20%
100%
00%
Generationmethodq One-best+randomsampling
43
sample1 sample2 sample3
Natural
WaveNetbackend
Mix
Random
WaveNetvocoder Mix
APPENDIX - WAVENET
44
Natural
Mix
Random
WaveNet-backendMix
WaveNet-backendRandom
45
WaveNet-vocoder+otheracousticmodelsAPPENDIX - WAVENET
sample1 sample2 sample3
Natural
SAR+DAR
SGA+DAR
RGA+DAR
RNN+DAR
extendedSAR+DAR
WaveNet-vocoder:trainingusinggeneratedMGC
46
APPENDIX - WAVENET
sample1 sample2 sample3
Natural
WaveNet-Backend
Trainedonnatural MGC
Trainedongenerated MGCEpoch35
Trainedongenerated MGCEpoch45
Trainedongenerated MGCEpoch55
WaveNet-Vocoder
Generalcomparison
SAR&DAR
SARextension
47
APPENDIX – ACOUSTIC MODELS
Moredetails:http://tonywangx.github.io/pdfs/talk.pdf
Generalcomparisonq Generatedtrajectories
48
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�2
�1
0
1
2
3
4
MG
C1
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.8
�0.6
�0.4
�0.2
0.0
0.2
0.4
0.6
0.8
1.0
MG
C5
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
MGC1st dim
MGC5th dim
APPENDIX – ACOUSTIC MODELS
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.4
�0.3
�0.2
�0.1
0.0
0.1
0.2
0.3
0.4
MG
C15
dim
Natural RNN SAR
49
MGC15th dim
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.20
�0.15
�0.10
�0.05
0.00
0.05
0.10
0.15
0.20
MG
C31
dim
Natural RNN SAR
MGC31th dim
Generalcomparisonq Generatedtrajectories
APPENDIX – ACOUSTIC MODELS
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.4
�0.3
�0.2
�0.1
0.0
0.1
0.2
0.3
0.4
MG
C15
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
50
MGC15th dim
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.20
�0.15
�0.10
�0.05
0.00
0.05
0.10
0.15
0.20
MG
C31
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
MGC31th dim
Generalcomparisonq Generatedtrajectories
APPENDIX – ACOUSTIC MODELS
51
Generalcomparisonq GV
APPENDIX – ACOUSTIC MODELS
0 1 2 3 4 5 6 7 8�2
�1
0
1
Natural
RNN
RNN+GAN (RGA)
SAR
SAR+GAN (SGA)
10 12 14 16 18 20 22 24�3.5
�3.0
�2.5
�1.5
26 28 30 32 34 36 38 40 42�4.0
�3.6
�3.0
�2.4
44 46 48 50 52 54 56 58�4.2
�3.8
�3.2
�2.6
Dimension index of MGC
Utte
ranc
e-le
velG
Vof
MG
C
Modulationspectrum(MGC31th)
Globalvariance
52
SAR&DARq ToySARexample
• SARversusRMDNwitharecurrentoutputlayer[11]
• Assumeand,linearactivationfunction
APPENDIX – ACOUSTIC MODELS
[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.
ot 2 R ⌃t = 1
h2h1
RMDNSAR
h2h1
x1 x2x1 x2
o1 o2o1 o2
wµ
a
µ1 µ2µ1 µ2 outputlayer
hiddenlayer
53[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
h2h1RMDN
x1 x2
o1 o2
wµµ1 µ2
SARh2h1
x1 x2
o1 o2a
µ1 µ2
µ1 = w>h1 + b
µ2 = w>h2 + b+ wµµ1 = µ̃2 + wµµ1
µ1 = w>h1 + b
µ2 = w>h2 + b
p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)
p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)
54
h2h1RMDN
x1 x2
o1 o2
wµµ1 µ2
SARh2h1
x1 x2
o1 o2a
µ1 µ2
p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)
=1
2⇡exp(�1
2(o� µ)>⌃�1(o� µ))
o = [o1, o2]> µ = [µ1, µ2 + aµ1]
>⌃ =
1 aa 1 + a2
�
o = [o1, o2]>
⌃ =
1 00 1
�µ = [µ1, µ̃2 + wµµ1]
>
p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)
=1
2⇡exp(�1
2(o� µ)>⌃�1(o� µ))
Dependencybetweenor?µt ot
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
55
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
µc = [µ1, µ2]> ⌃c =
1 00 1
�RMDN
h2h1
x1 x2
µ1 µ2
c2c1
SARh2h1
x1 x2
o1 o2a
µ1 µ2
o = [o1, o2]>
c =
c1c2
�=
o1
o2 � ao1
�=
1 0�a 1
� o1o2
�= Ao
p(o) = p(c) = N (c;µc,⌃c)
p(o) = N (o;µo,⌃o)
µo = [µ1, µ2 + aµ1]>
⌃o =
1 aa 1 + a2
�
56
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• For
• SARisequivalentto:
APPENDIX – ACOUSTIC MODELS
Training
Generationx1:T
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)bo1:T
o1:T
bc1:T
c1:T
2
4c1:T,1
· · ·c1:T,D
3
5o1:T =
2
4o1:T,1
· · ·o1:T,D
3
5A(1)
A(D)
…o1:T 2 RD⇥T
…
…
A(1)
A(D)
A(1)�1
A(D)�1
57
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• For
• SARisequivalentto:
APPENDIX – ACOUSTIC MODELS
Training
Generationx1:T
bo1:T
o1:T
bc1:T
c1:T
2
4c1:T,1
· · ·c1:T,D
3
5o1:T =
2
4o1:T,1
· · ·o1:T,D
3
5o1:T 2 RD⇥T
filtersA1(z)
AD(z)…
filters
…1/A1(z)
1/AD(z)
filter1
filterD…
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)
58
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• Onlydueto?
• Dueto,lessmismatchbetweenandRMDN
APPENDIX – ACOUSTIC MODELS
bo1:T
o1:T
bc1:T
c1:T
filtersA1(z)
AD(z)…
filters
…1/A1(z)
1/AD(z)
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)
H1(z)
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)
A1(z)A1(z)
1/A1(z)
1/Ad(z)
{Ad(z), 1/Ad(z)} c1:T
SARextension:normalizingflowq Basicidea
• Jacobianmatrixmustbesimple• f(.)mustbeinvertible
59
x1:T
bo1:T
o1:T
bc1:T
c1:Tc1:T = f�(o1:T )
bo1:T = f�1� (bc1:T )
po(o1:T |x1:T ) = pc(c1:T |x1:T )
����� det@c1:T@o1:T
�����
[13]D.Rezende andS.Mohamed.Variational inferencewithnormalizingflows.InInternationalConferenceonMachineLearning,pages1530–1538,2015.[14]D.P.Kingma,T.Salimans,R.Jozefowicz,X.Chen,I.Sutskever,andM.Welling.Improvedvariational inferencewithinverseautoregressiveflow.InProc.NIPS,pages
4743–4751,2016.
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)
APPENDIX – ACOUSTIC MODELS
SARextension:normalizingflowq Basicidea
• SimpleforSAR:
60
SAR ARFlow
Transform
De-transform
ct = ot �KX
k=1
ak � ot�k
bot = ct +KX
k=1
ak � bot�k
po(o1:T |x1:T ) = pc(c1:T |x1:T )
����� det@c1:T@o1:T
�����
����� det@c1:T@o1:T
����� = 1
µt = RNN(o1:t�1)
ct = ot �KX
k=1
f (k)(o1:t�k)� ot�k
bot = ct +KX
k=1
f (k)(bo1:t�k)� bot�k
APPENDIX – ACOUSTIC MODELS
61
SARextensionq SARcanbeextended
APPENDIX – ACOUSTIC MODELS
Moredetails:http://tonywangx.github.io/pdfs/talk.pdf
62
DARq Sameautoregressiveprincipleq Butnoninvertiblenonlinear
APPENDIX – ACOUSTIC MODELSp-value
NAT DAR SAR RMDN RNNNAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30DAR <1e-30 1.6e-28 6.3e-19 2.4e-30SAR <1e-30 1.6e-28 0.015 0.949RMDN <1e-30 6.3e-19 0.015 0.014RNN <1e-30 2.4e-30 0.949 0.014
3.00
3.25
3.50
3.75
4.00
4.25
MO
S
NAT RNN RMDN SAR DAR20
40
60
80
100
120
GV
ofF0
atut
tera
nce-
leve
l(H
z)
NAT
DAR
SAR RMDN RNN
MOSscore F0GV
NAT DARSARRMDNRNN
[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)