+ All Categories
Home > Documents > TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH … · 2020-02-13 · IndexTerms— speech bandwidth...

TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH … · 2020-02-13 · IndexTerms— speech bandwidth...

Date post: 13-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
5
TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH BANDWIDTH EXTENSION Xiang Hao 1,2 , Chenglin Xu 2 , Nana Hou 2 , Lei Xie 1* , Eng Siong Chng 2 , Haizhou Li 3,4 1 School of Computer Science, Northwestern Polytechnical University, Xi’an, China 2 School of Computer Science and Engineering, Nanyang Technological University, Singapore 3 Department of Electrical and Computer Engineering, National University of Singapore, Singapore 4 Machine Listening Lab, University of Bremen, Germany ABSTRACT In this paper, we study the time-domain neural network approach for speech bandwidth extension. We propose a network architecture, named multi-scale fusion neural network (MfNet), that gradually re- stores the low-frequency signal and predicts the high-frequency sig- nal through the exchange of information across different scale rep- resentations. We propose a training scheme to optimize the network with a combination of perceptual loss and time-domain adversari- al loss. Experiments show the proposed multi-scale fusion network consistently outperforms the competing methods in terms of percep- tual evaluation of speech quality (PESQ), signal to distortion rate (SDR), signal to noise ratio (SNR), log-spectral distance (LSD) and word error rate (WER). More promisingly, the multi-scale fusion network requires only 10% of the parameters of the time-domain reference baseline. Index Termsspeech bandwidth extension, multi-scale fusion, neural networks, deep learning 1. INTRODUCTION Speech bandwidth extension, which expands narrowband signal to wideband signal, has been widely studied for many years [1]. This technique plays an important role in many practical scenarios, e.g., narrowband signal enhancement [2] and automatic speech recogni- tion (ASR) [3]. Other applications include speech audio compres- sion and text-to-speech synthesis [4]. There have been a large variety of methods for speech band- width extension. In general, these approaches can be classified into two categories, namely rule-based and statistical approaches. The rule-based methods generate the high frequency spectrum based on the acoustic knowledge of the speech signal [5], while the statis- tical approaches assume that there exists a non-linear relationship between the spectral features of low frequency and high frequency components. The statistical approaches try to model such relation- ship by learning a mapping function between the narrowband signal and the wideband signal. The statistical methods can be implemented either in frequency domain or in time domain. One of the frequency domain methods is to predict the spectral envelope of the high frequency part. Linear predictive coding (LPC) [6], Gaussian mixture model [7, 8], hidden Markov model [9, 10], and neural networks [11, 12, 13, 14, 15, 16] have been used to estimate the spectral envelope. These methods, however, face a common problem that the excitation, which defines the spectral fine structure of the signal, has to be estimated. With the * Lei Xie is the corresponding author, [email protected]. advent of deep learning (DL), many approaches have studied how to estimate the high-frequency spectrum directly [17, 18, 19, 20, 21, 22]. Such techniques require the phases of the high frequency com- ponent, which is usually unknown, to reconstruct the signal. One of the most recent advances in speech signal processing is the ability to directly model raw signal in the time domain using neural network- s [23, 24, 25], that avoids the phase estimation problem. The idea of time-domain neural network approach has opened up a new direction for speech bandwidth extension [26, 27, 4, 28, 29]. Time-frequency network(TFNet) [30] represents one of the successful implementa- tions. In this paper, we investigate a multi-scale fusion network (MfNet) to improve the performance of speech bandwidth exten- sion in time domain. MfNet is inspired by the idea of time-domain speech bandwidth extension [26] as well as the work of image super-resolution [31]. As we know, multi-scale learning achieves better performance by capturing multi-resolution information. For example, in source separation and audio classification, with different convolutional filter size [32] or a cascade of wavelet filter banks [33] that learn multi-scale representations, one observes performance improvement. In this work, we study how MfNet performs speech bandwidth extension by aggregating the speech information across different scale representations, in analogy to multi-resolution image information in supervision. Note that the training objective plays a key role in neural net- work performance. In the field of speech bandwidth extension, mean squared error (MSE) is used usually. In addition, perceptual loss is adopted in [28] and adversarial loss is used in [15, 16, 21]. The in- put data and adversarial loss calculation in the prior work of speech bandwidth extension is in frequency domain. In this paper, we would like to study the networks that take time-domain speech as input, and measure the training objective also in time-domain. We will study the effect of different loss functions that include perceptual loss and adversarial loss. 2. MULTI-SCALE FUSION NEURAL NETWORK 2.1. Architecture of MfNet Our aim is to reconstruct a 16kHz wideband signal from a 8kHz narrowband signal. Suppose we have a 16kHz wideband sig- nal x = [x1,...,xt ,...,xT ]. Same as in [26], by applying Chebyshev low-pass filter, we can get a 8kHz narrowband singal ˆ x = [ˆ x1,..., ˆ xt ,..., ˆ x T/2 ]. Then, we use bicubic interpolation to generate a “fake” 16kHz wideband signal ¯ x = [¯ x1,..., ¯ xt ,..., ¯ xT ]. MfNet takes the “fake” 16kHz signal as input, and is trained to pro- duce 16kHz signal using the original 16kHz wideband signal as the
Transcript
Page 1: TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH … · 2020-02-13 · IndexTerms— speech bandwidth extension, multi-scale fusion, neural networks, deep learning 1. INTRODUCTION Speech

TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH BANDWIDTHEXTENSION

Xiang Hao1,2, Chenglin Xu2, Nana Hou2, Lei Xie1∗, Eng Siong Chng2, Haizhou Li3,4

1 School of Computer Science, Northwestern Polytechnical University, Xi’an, China2 School of Computer Science and Engineering, Nanyang Technological University, Singapore

3 Department of Electrical and Computer Engineering, National University of Singapore, Singapore4 Machine Listening Lab, University of Bremen, Germany

ABSTRACT

In this paper, we study the time-domain neural network approachfor speech bandwidth extension. We propose a network architecture,named multi-scale fusion neural network (MfNet), that gradually re-stores the low-frequency signal and predicts the high-frequency sig-nal through the exchange of information across different scale rep-resentations. We propose a training scheme to optimize the networkwith a combination of perceptual loss and time-domain adversari-al loss. Experiments show the proposed multi-scale fusion networkconsistently outperforms the competing methods in terms of percep-tual evaluation of speech quality (PESQ), signal to distortion rate(SDR), signal to noise ratio (SNR), log-spectral distance (LSD) andword error rate (WER). More promisingly, the multi-scale fusionnetwork requires only 10% of the parameters of the time-domainreference baseline.

Index Terms— speech bandwidth extension, multi-scale fusion,neural networks, deep learning

1. INTRODUCTION

Speech bandwidth extension, which expands narrowband signal towideband signal, has been widely studied for many years [1]. Thistechnique plays an important role in many practical scenarios, e.g.,narrowband signal enhancement [2] and automatic speech recogni-tion (ASR) [3]. Other applications include speech audio compres-sion and text-to-speech synthesis [4].

There have been a large variety of methods for speech band-width extension. In general, these approaches can be classified intotwo categories, namely rule-based and statistical approaches. Therule-based methods generate the high frequency spectrum based onthe acoustic knowledge of the speech signal [5], while the statis-tical approaches assume that there exists a non-linear relationshipbetween the spectral features of low frequency and high frequencycomponents. The statistical approaches try to model such relation-ship by learning a mapping function between the narrowband signaland the wideband signal.

The statistical methods can be implemented either in frequencydomain or in time domain. One of the frequency domain methodsis to predict the spectral envelope of the high frequency part. Linearpredictive coding (LPC) [6], Gaussian mixture model [7, 8], hiddenMarkov model [9, 10], and neural networks [11, 12, 13, 14, 15, 16]have been used to estimate the spectral envelope. These methods,however, face a common problem that the excitation, which definesthe spectral fine structure of the signal, has to be estimated. With the

∗Lei Xie is the corresponding author, [email protected].

advent of deep learning (DL), many approaches have studied howto estimate the high-frequency spectrum directly [17, 18, 19, 20, 21,22]. Such techniques require the phases of the high frequency com-ponent, which is usually unknown, to reconstruct the signal. One ofthe most recent advances in speech signal processing is the ability todirectly model raw signal in the time domain using neural network-s [23, 24, 25], that avoids the phase estimation problem. The idea oftime-domain neural network approach has opened up a new directionfor speech bandwidth extension [26, 27, 4, 28, 29]. Time-frequencynetwork(TFNet) [30] represents one of the successful implementa-tions.

In this paper, we investigate a multi-scale fusion network(MfNet) to improve the performance of speech bandwidth exten-sion in time domain. MfNet is inspired by the idea of time-domainspeech bandwidth extension [26] as well as the work of imagesuper-resolution [31]. As we know, multi-scale learning achievesbetter performance by capturing multi-resolution information. Forexample, in source separation and audio classification, with differentconvolutional filter size [32] or a cascade of wavelet filter banks [33]that learn multi-scale representations, one observes performanceimprovement. In this work, we study how MfNet performs speechbandwidth extension by aggregating the speech information acrossdifferent scale representations, in analogy to multi-resolution imageinformation in supervision.

Note that the training objective plays a key role in neural net-work performance. In the field of speech bandwidth extension, meansquared error (MSE) is used usually. In addition, perceptual loss isadopted in [28] and adversarial loss is used in [15, 16, 21]. The in-put data and adversarial loss calculation in the prior work of speechbandwidth extension is in frequency domain. In this paper, we wouldlike to study the networks that take time-domain speech as input, andmeasure the training objective also in time-domain. We will studythe effect of different loss functions that include perceptual loss andadversarial loss.

2. MULTI-SCALE FUSION NEURAL NETWORK

2.1. Architecture of MfNetOur aim is to reconstruct a 16kHz wideband signal from a 8kHznarrowband signal. Suppose we have a 16kHz wideband sig-nal x = [x1, . . . , xt, . . . , xT ]. Same as in [26], by applyingChebyshev low-pass filter, we can get a 8kHz narrowband singalx = [x1, . . . , xt, . . . , xT/2]. Then, we use bicubic interpolation togenerate a “fake” 16kHz wideband signal x = [x1, . . . , xt, . . . , xT ].MfNet takes the “fake” 16kHz signal as input, and is trained to pro-duce 16kHz signal using the original 16kHz wideband signal as the

Page 2: TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH … · 2020-02-13 · IndexTerms— speech bandwidth extension, multi-scale fusion, neural networks, deep learning 1. INTRODUCTION Speech

1D Convolution  

 1D Convolution  Downscaling  

Multi-Scale Fusion Block

1D Convolution  1DConvolution  1D Convolution 

Multi-Scale Fusion Block

NarrowbandSignal

WidebandSignal

1

12 2

2

31

32

3

41

42

43

5 52

53

6

1

1

2

3

Concatenation Block

Fig. 1. Schematic diagram of the proposed multi-scale fusion modelfor speech bandwidth extension. ‘

⊕’ represent an elementwise add

operation.

target in a supervised training.The architecture of MfNet is illustrated in Fig 1. We use Cr

l torepresent the feature maps, where l indicates this feature map is theoutput of the l-th layer, r indicates the time resolution or scale. In ourapproach, the convolution doesn’t change the feature size, while thedownscaling can be achieved by a 1-dimension convolution of stride2 that halves the time resolution. In this way, the neural networkconsists of feature maps of different scales. The multi-scale fusionblock, which is used to aggregate information among the differentscale representations, is composed of convolution, downscaling andupscaling operation. For upscaling operation, we use convolutionto smooth input feature, and then performs bilinear interpolation inthe time direction by a factor of two. Suppose the feature maps ofthe l-th layer are {C1

l , . . . , Cil , . . . ,Cs

l }. Through multi-scale fu-sion block, the feature maps of the (l + 1)-th layer are {C1

l+1, . . . ,

Cil+1, . . . ,C s

l+1}, where Crl+1 =

1

s

∑si=1 S(Ci

l ). The S(·) is oneof convolution, downscaling, upscaling, and is used to resize featuremaps.

To better explain the multi-scale fusion block, we show an ex-ample in Fig 2(a), which uses the feature maps {C1

4 , C24 , C3

4}tocalculate the feature map C2

5 . The concatenation block with detailedstructure is shown in Fig 2(b), which is another way to aggregate theinformation among the different scale representations. We first useupscaling operation to resize different scale representations, then wefurther concatenate these high-level feature maps along the channeldimension. At last, we aggregate the information through a convo-lutional layer.

2.2. Training ObjectiveAs Section 2.1 definition, the input signal and original widebandsignal is x and x. Furthermore, we denote the predicted widebandsignal as y. Thus, we have y = G(x), where G(·) is the map-ping function represented by neural network. We explore the use ofseveral loss functions, namely are time-domain loss, perceptually-motivated loss, adversarial loss (GAN loss) and a composite losswhich combines perceptually-motivated loss and GAN loss.

Downscaling C Upscaling 1DConvolution 

(a)

14

24

34

25

1DConvolution

1

C

5

Upscaling  

Upscaling  2

1DConvolutionC

(b)25

35

16

Fig. 2. (a)an example of the process in which multi-scale fusionblock aggregates multi-scale information;(b)The detailed structureof the concatenation block.

2.2.1. time-domain loss

The time-domain loss between predicted wideband signal and origi-nal wideband signal is formulated as:

Lt(x,y) =1

M

M∑m=1

‖x− y‖2, (1)

where M is the batch size.

2.2.2. perceptually-motivated loss

The intuition of using perceptually-motivated loss is that the melscale spectrograms approximate the perceived auditory informationby humans in psychoacoustic experiments. The perceptual loss isdefined as the L1 loss of the mel-spectrogram between the predictedwideband signal and original wideband signal. This loss function isdescribed as:

Lf (x,y) = ‖Mel(x)−Mel(y)‖1, (2)

where Mel(·) is the mel-spectrum transformation. The spectrogram,derived from STFT, is transformed into mel-spectrogram based ontriangular filters ranging from 3.8kHz to 8kHz. We calculate the lossfunction over this frequency range to emphasize our objective, that isto recover the high frequency part of the signal. In order to make surethat the lower frequencies are intact, the final perceptually-motivatedloss is defined as:

Lp(x,y) = Lt(x,y) + λfLf (x,y), (3)

where λf is the weighting parameter for the perceptual loss and it’svalue is 0.001.

2.2.3. adversarial loss

In a Generative Adversarial Network (GAN) [34], the adversarialmodel plays the two-player minimax game between a generator anda discriminator. The generator, G, captures the data distribution andmaximize the similarity of real data and generated data, while thediscriminator,D, estimates the probability that a sample is from nat-ural speech as opposed to synthetic speech generated byG. Supposethat we adopt our proposed MfNet as the generator G, and the archi-tecture of D is same as in [35]. The value function can be defined asfollows:

V (D,G) = Ex∼Px [logD(x)] +Ex∼Px [log(1−D(G(x)))], (4)

where G is trained to minimize this value function, D is trained tomaximize it. Equation (4) is to minimizing the Jensen-Shannon di-vergence between the real data distribution and the distribution of

Page 3: TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH … · 2020-02-13 · IndexTerms— speech bandwidth extension, multi-scale fusion, neural networks, deep learning 1. INTRODUCTION Speech

generated data. However, this neural network model is notorious-ly difficult to train. [36] proposes to minimizing the Wasserstein-1distance. Thus, the value function is discribed as:

VWGAN (D,G) = Ex∼Px [D(x)]− Ex∼Px [D(G(x))], (5)

In Equation (5), D must be 1-Lipschitz. Weight clipping [36] andgradient penalty [37] are two methods to enforce this constraint. Inour experiments, gradient penalty is used. Finally, the GAN loss isgiven as follows:

LG(x,y) = Lt(x,y)− λaD(y), (6)

LD(x,y) = D(y)−D(x) + λ(‖∇xD(x)‖2 − 1)2,

s.t. x = x + α(y − x),(7)

where α is a random number which sampled from a uniform distri-bution U [0, 1], λ and λa is set to 10 and 0.001 respectively.

2.2.4. composite loss

The composite loss is a combination of perceptual loss and adversar-ial loss. The discriminator lossLD is same as used in our adversarialloss, and the generator loss is formulized as:

LG(x,y) = Lt(x,y) + λfLf (x,y)− λaD(y), (8)

where λf and λa is set to 0.001.

3. EXPERIMENTS

3.1. Datasets

To evaluate the performance of the proposed method, the corpus ofValentini-Botinhao [38] of 86 different speakers, is adopted in ourexperiments. A total of 84 different speakers are included in thetraining set and 2 other speakers in the test set. In our experiments,we split the original wave files into several segments with 50% over-lap and every segment consists of 16,384 samples. We randomlyselect 133,096 and 14,789 segments from this training set for train-ing and examining the convergence situation, respectively. The testset, consisting of 824 sentences, is used to evaluate the performancesof different approaches.

3.2. Comparative Study

We implemented several systems both in the frequency domain andtime domain as the benchmarking references for the proposed MfNetmethods. They are described next in detail.

• Spline: simple bicubic interpolation.• LSM: frequency domain method [17]. The input of this

method is the log-spectrum of the narrowband signal and theoutput is the high-frequency log-spectrum of the widebandsignal. This model have 3 hidden layers and every hiddenlayer has 2,048 hidden nodes.

• DRCNN: time domain method [26]. This approach is anencoder-decoder architecture, which consists of a series ofupsampling and downsampling blocks. Notably, each convo-lutional layer has filters with the same size. We can considerthe re-sampling as extracting features with filters at differen-t time resolution, which is different from re-scaling, whichkeeps the time resolution the same, but shifts the filters atdifferent stride. The method of upsampling is Subpixel shuf-fling. In the downsampling stage, the number of channels is

[128, 256, 512, 512] and corresponding filter size is [65, 33,17, 9]. In the upsampling stage, the number of channels is[1024, 1024, 512, 256] and corresponding filter size is [9, 17,33, 65].

• MfNet: proposed MfNet architecture trained with the basictime-domain loss.

• MfNet+P: proposed MfNet architecture trained with theperceptually-motivate loss.

• MfNet+A: proposed MfNet architecture trained with the ad-versarial loss.

• MfNet+C: proposed MfNet architecture trained with thecomposite loss which combines perceptual loss and GANloss.

3.3. Evaluation Metrics

We implement two evaluation metrics. One is the group of signal-based evaluation indicators, that includes perceptual evaluationof speech quality (PESQ), signal to distortion rate (SDR), sig-nal to noise ratio (SNR) and log-spectral distance (LSD), anoth-er is the word error rate (WER) of automatic speech recognition(ASR). This ASR system is an end-to-end model and trained bythe data of LibriSpeech [39]. For a reference speech utterancex = [x1, . . . , xt, . . . , xT ] and the corresponding predicted speechy = [y1, . . . , yt, . . . , yT ], the SNR is calculated according to thefollowing formula:

SNR = 10 log

∑Tt=1 x

2t∑T

t=1 (xt − yt)2. (9)

The LSD is defined as follows:

LSD =1

L

L∑l=1

√√√√ 1

K

K∑k=1

(X(l, k)−Y(l, k))2. (10)

where X(l, k) and Y(l, k) are the log-spectral power magnitudes ofx and y, respectively. The k and l indexes frequency and frame. TheK represents the numbers of frequency in a frame and the L is thetotal number of frames of a speech utterance.

3.4. Experiment Results

3.4.1. signal-based evaluation results

The evaluation results of different models are summarized in Ta-ble 1. In general, the proposed MfNet methods consistently out-perform the baseline systems. More specifically, we first focus onMfNet whose training objective is the time-domain loss as definedin Equation (1). We observe that, MfNet consistently outperformsSpline, LSM, and DRCNN baselines on all signal-based evaluationmetrics. We further observe that these MfNet methods outperfor-m the time-domain DRCNN baseline method in both low-frequencyand high-frequency part. The LSD on both high and low frequencypart are also summarized in Table 1. The results suggest that MfNetcan restore low-frequency part well and learn better representation-s to predict the high-frequency part. The results also validate ourclaim that multi-scale information, through multi-scale fusion unitto aggregate different bandwidth representations, can help to achievebetter bandwidth extension performance.

https://drive.google.com/open?id=1BtQvAnsFvVi-dp qsaFP7n4A 5cwnlR6

Page 4: TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH … · 2020-02-13 · IndexTerms— speech bandwidth extension, multi-scale fusion, neural networks, deep learning 1. INTRODUCTION Speech

Table 1. The SNR, SDR, PESQ and LSD of different methods e-valuated on test set. LSD Full, LSD LF and LSD HF shows the LS-D value calculated for the whole spectrogram, low-frequency rangeand high-frequency range respectively.

Method #Params SNR (dB) SDR PESQ LSD Full LSD LF LSD HF

Spline – 21.88 26.09 3.75 2.14 1.34 2.67

LSM [17] 13.38M 21.26 25.57 2.76 1.80 1.10 2.26

DRCNN [26] 56.41M 23.95 27.76 3.55 1.84 0.47 2.60

MfNet 5.96M 24.50 28.08 3.76 1.61 0.19 2.26

MfNet+P 5.96M 24.55 28.20 3.88 1.40 0.21 1.97

MfNet+A 5.96M 24.77 28.48 3.80 1.72 0.18 2.42

MfNet+C 5.96M 24.70 28.47 3.82 1.46 0.21 2.05

We further investigate the improvement of the proposed MfNetwith different training objective. We firstly add the perceptualbased L1 loss of high-freqeuency bands of the mel-spectrogrambetween the estimated signal and the original wideband signal intothe time-domain loss with a weight, as defined in Equation (3).We name MfNet with the perceptual-motivated loss as “MfNet+P”.MfNet+P method achieves a 12.8% and 3.2% relative improvementover MfNet in terms of the LSD error on high-frequency bands andPESQ. This verifies our motivation that adding perceptual informa-tion into the training objective loss improves the perceptual qualityof the estimated signal.

We further study whether using a GAN scheme would improvethe perceptual quality and intelligibility of the estimated signal,where the weighted GAN loss in time-domain is defined as Equa-tion (6) and (7). We name this system as “MfNet+A”. We observethat the MfNet+A outperforms the MfNet in terms of SNR, SDR,PESQ and LSD LF. However, the performance of the MfNet+A isworse than the MfNet+P in terms of PESQ and LSD. The resultssuggest that the perceptual loss is very good at improving the ofperceptual quality of the estimated signal.

Finally, we interpolate the perceptual loss and the adversarialloss, as defined in Equation (8), and we name it as “MfNet+C”. Weobserve that the MfNet+C balances the importance of the percep-tual loss and adversarial loss, and achieves balanced performancebetween MfNet+P and MfNet+A in terms of SNR, SDR, PESQ andLSD.

To facilitate the comparison, we also visualize the spectrogramof different approaches as shown in Figure 3. In the area A, it’s obvi-ous that our proposed methods do better in restoring high-frequencyinformation. Also, we find that none of the methods performs wellin area B, probably due to the fact that this part represents a conso-nant with low energy, in particular, whose low frequency energy isnot informative. We will explore some phonotactics information tosolve this problem in the future.

3.4.2. speech recognition results

We now conduct experiments using the ASR system on bandwidthextended speech signals. Also, we use this ASR system to decodethe original 16kHz wideband signal x. The word error rates are re-ported in Table 2. We observe that MfNet has a clear advantage overSpline, LSM and DRCNN methods. We also observe that trainingMfNet with different loss functions reduces the WER, in particularMfNet+P achieves the best result. We believe that re-training theASR with bandwidth-extended speech will lead to further perfor-mance improvement.

LSM

origin

SplineOrigin

MfNet

DRCNN

MfNet+P

MfNet+A MfNet+C

AB

Fig. 3. Visualization of the spectrograms of different systems.

Table 2. WER(%) of different systems on the test set. “Real” refersto the ASR result on original 16kHz speech signals.

System RealSplineLSM [17]DRCNN [26]MfNetMfNet+PMfNet+AMfNet+C

WER(%) 7.2 8.7 9.3 8.5 8.7 8.2 8.5 8.3

3.4.3. Comparison of network complexity

In general, a network with a larger number of parameter offers betterperformance. We would like to compare the number of parametersof different networks. The statistics are summarized in Table 1. No-tably, the architecture of our proposed MfNet only has requires about10% of the parameters of the time domain DRCNN baseline. Thecompact structure of the proposed MfNet is an obvious advantage,especially when it also provides better performance. These resultsfurther prove that the proposed MfNet, which learns representationsfrom different scales, has an advantage in estimating wideband sig-nal.

4. CONCLUSION

In this work, we first show the promising ability of multi-scale fusionneural network for speech bandwidth extension. Based on this neu-ral network structure, we explore the effect of different loss functionsand propose a composite loss. Compared with a simple interpolationmethod, a frequency domain method and a time domain method, theproposed approaches can consistently achieve better performance interms of SNR, SDR, PESQ, LSD and WER. In addition, an obvi-ous advantage is the MfNet needs fewer parameters, compared withthe time domain baseline, to achieve better performance. In the fu-ture, we will explore other techniques which have been proved usefulin audio generation task, such as u-law compress, dilated convolu-tion and temporal convolutional module and further combines multi-scale neural network with these techniques.

5. REFERENCES

[1] Yan Ming Cheng, Douglas O’Shaughnessy, and Paul Mermel-stein, “Statistical recovery of wideband speech from narrow-band speech,” IEEE Transactions on Speech and Audio Pro-cessing, 1994.

Page 5: TIME-DOMAIN NEURAL NETWORK APPROACH FOR SPEECH … · 2020-02-13 · IndexTerms— speech bandwidth extension, multi-scale fusion, neural networks, deep learning 1. INTRODUCTION Speech

[2] Bernd Iser and Gerhard Schmidt, “Bandwidth extension oftelephony speech,” in Speech and Audio Processing in AdverseEnvironments. 2008.

[3] Kehuang Li, Zhen Huang, Yong Xu, and Chin-Hui Lee,“Dnn-based speech bandwidth expansion and its application toadding high-frequency missing features for automatic speechrecognition of narrowband speech,” in INTERSPEECH, 2015.

[4] Mu Wang, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, DanSu, Dong Yu, and Helen Meng, “Speech super-resolution usingparallel wavenet,” in ISCSLP, 2018.

[5] Martin Dietz, Lars Liljeryd, Kristofer Kjorling, and OliverKunz, “Spectral band replication, a novel approach in audiocoding,” in Audio Engineering Society Convention, 2002.

[6] Frank K Soong and B-H Juang, “Optimal quantization of lspparameters,” IEEE Transactions on Speech and Audio Process-ing, 1993.

[7] Kun-Youl Park and Hyung Soon Kim, “Narrowband to wide-band conversion of speech using gmm based transformation,”in ICASSP, 2000.

[8] Hyunson Seo, Hong-Goo Kang, and Frank Soong, “A max-imum a posterior-based reconstruction approach to speechbandwidth expansion in noise,” in ICASSP, 2014.

[9] Peter Jax and Peter Vary, “Artificial bandwidth extensionof speech signals using mmse estimation based on a hiddenmarkov model,” in ICASSP, 2003.

[10] Geun-Bae Song and Pavel Martynovich, “A study of hmm-based bandwidth extension of speech signals,” Signal Process-ing, 2009.

[11] Juho Kontio, Laura Laaksonen, and Paavo Alku, “Neu-ral network-based artificial bandwidth expansion of speech,”IEEE Transactions on Audio, Speech and Language Process-ing, 2007.

[12] Yingxue Wang, Shenghui Zhao, Wenbo Liu, Ming Li, andJingming Kuang, “Speech bandwidth expansion based on deepneural networks,” in INTERSPEECH, 2015.

[13] Johannes Abel and Tim Fingscheidt, “Artificial speech band-width extension using deep neural networks for widebandspectral envelope estimation,” IEEE/ACM Transactions on Au-dio, Speech and Language Processing, 2017.

[14] Konstantin Schmidt and Bernd Edler, “Blind bandwidth ex-tension based on convolutional and recurrent deep neural net-works,” in ICASSP, 2018.

[15] Sen Li, Stephane Villette, Pravin Ramadas, and Daniel J Sin-der, “Speech bandwidth extension using generative adversarialnetworks,” in ICASSP, 2018.

[16] Jonas Sautter, Friedrich Faubel, Markus Buck, and GerhardSchmidt, “Artificial bandwidth extension using a condition-al generative adversarial network with discriminative training,”in ICASSP, 2019.

[17] Kehuang Li and Chin-Hui Lee, “A deep neural network ap-proach to speech bandwidth expansion,” in ICASSP, 2015.

[18] Liu Bin, Tao Jianhua, Wen Zhengqi, Li Ya, Danish Bukhari,et al., “A novel method of artificial bandwidth extension usingdeep architecture,” in INTERSPEECH, 2015.

[19] Yu Gu, Zhen-Hua Ling, and Li-Rong Dai, “Speech bandwidthextension using bottleneck features and deep recurrent neuralnetworks.,” in INTERSPEECH, 2016.

[20] Johannes Abel, Maximilian Strake, and Tim Fingscheidt, “Asimple cepstral domain dnn approach to artificial speech band-width extension,” in ICASSP, 2018.

[21] Sefik Emre Eskimez and Kazuhito Koishida, “Speech superresolution generative adversarial network,” in ICASSP, 2019.

[22] Pramod Bachhav, Massimiliano Todisco, and Nicholas Evans,“Latent representation learning for artificial bandwidth exten-sion using a conditional variational auto-encoder,” in ICASSP,2019.

[23] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Si-monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, An-drew W Senior, and Koray Kavukcuoglu, “Wavenet: A gener-ative model for raw audio.,” SSW, 2016.

[24] Santiago Pascual, Antonio Bonafonte, and Joan Serra, “Segan:Speech enhancement generative adversarial network,” INTER-SPEECH, 2017.

[25] Yi Luo and Nima Mesgarani, “Tasnet: Surpassing ideal time-frequency masking for speech separation,” arXiv preprint arX-iv:1809.07454, 2018.

[26] Volodymyr Kuleshov, S Zayd Enam, and Stefano Ermon, “Au-dio super resolution using neural networks,” ICLR, 2017.

[27] Yu Gu and Zhen-Hua Ling, “Waveform modeling using s-tacked dilated convolutional neural networks for speech band-width extension.,” in INTERSPEECH, 2017.

[28] Berthy Feng, Zeyu Jin, Jiaqi Su, and Adam Finkelstein,“Learning bandwidth expansion using perceptually-motivatedloss,” in ICASSP, 2019.

[29] Zhen-Hua Ling, Yang Ai, Yu Gu, and Li-Rong Dai, “Wavefor-m modeling and generation using hierarchical recurrent neuralnetworks for speech bandwidth extension,” IEEE/ACM Trans-actions on Audio, Speech and Language Processing, 2018.

[30] Teck Yian Lim, Raymond A Yeh, Yijia Xu, Minh N Do, andMark Hasegawa-Johnson, “Time-frequency networks for au-dio super-resolution,” in ICASSP, 2018.

[31] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang, “Deep high-resolution representation learning for human pose estimation,”CVPR, 2019.

[32] Emad M Grais, Dominic Ward, and Mark D Plumbley, “Rawmulti-channel audio source separation using multi-resolutionconvolutional auto-encoders,” in EUSIPCO, 2018.

[33] Joakim Anden and Stephane Mallat, “Multiscale scattering foraudio classification.,” in ISMIR, 2011.

[34] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing X-u, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, “Generative adversarial nets,” in NIPS, 2014.

[35] Chris Donahue, Julian McAuley, and Miller Puckette, “Adver-sarial audio synthesis,” in LCLR, 2019.

[36] Martin Arjovsky, Soumith Chintala, and Leon Bottou,“Wasserstein gan,” in ICML, 2018.

[37] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Du-moulin, and Aaron C Courville, “Improved training of wasser-stein gans,” in NIPS, 2017.

[38] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Ju-nichi Yamagishi, “Investigating rnn-based speech enhance-ment methods for noise-robust text-to-speech.,” in SSW, 2016.

[39] Shinji Watanabe, Takaaki Hori, Shigeki Karita, and TomokiHayashi, “Espnet: End-to-end speech processing toolkit,” inInterspeech, 2018.


Recommended