DIFFERENTIABLE CONSISTENCY CONSTRAINTS FOR …Scott Wisdom, John R. Hershey, Kevin Wilson, Jeremy...

DIFFERENTIABLE CONSISTENCY CONSTRAINTSFOR IMPROVED DEEP SPEECH ENHANCEMENT

Scott Wisdom, John R. Hershey, Kevin Wilson, Jeremy Thorpe,Michael Chinen, Brian Patton, Rif A. Saurous

Google Research

ABSTRACT

In recent years, deep networks have led to dramatic improvements inspeech enhancement by framing it as a data-driven pattern recogni-tion problem. In many modern enhancement systems, large amountsof data are used to train a deep network to estimate masks forcomplex-valued short-time Fourier transforms (STFTs) to suppressnoise and preserve speech. However, current masking approachesoften neglect two important constraints: STFT consistency and mix-ture consistency. Without STFT consistency, the system’s outputis not necessarily the STFT of a time-domain signal, and withoutmixture consistency, the sum of the estimated sources does not nec-essarily equal the input mixture. Furthermore, the only previousapproaches that apply mixture consistency use real-valued masks;mixture consistency has been ignored for complex-valued masks.

In this paper, we show that STFT consistency and mixture con-sistency can be jointly imposed by adding simple differentiable pro-jection layers to the enhancement network. These layers are com-patible with real or complex-valued masks. Using both of these con-straints with complex-valued masks provides a 0.7 dB increase inscale-invariant signal-to-distortion ratio (SI-SDR) on a large datasetof speech corrupted by a wide variety of nonstationary noise acrossa range of input SNRs.

Index Terms— Speech enhancement, STFT consistency, mix-ture consistency, deep learning

1. INTRODUCTION

In recent years, deep neural networks (DNNs) have led to dramaticimprovements in speech enhancement. Typically deep networks aretrained to estimate masks for complex-valued short-time Fouriertransforms (STFTs) to suppress noise and preserve speech. That is,given N time-domain samples of an input mixture of J sources:

y =

J∑j=1

xj , (1)

a DNN is trained to produce a mask Mj for each source given themixture. For each of the J sources, the mask is applied to the mixtureSTFT Y = S{y} of shape F × T , and the separated audio signal isrecovered by applying the inverse STFT operator:

xj = S−1{Mj �Y}, (2)

where S and S−1 are forward and inverse STFT operators.Such masking-based DNN approaches have been very success-

ful [1, 2, 3, 4]. However, existing approaches have two deficiencies.First, the loss function used to train the enhancement network is typ-ically measured on the masked noisy STFT. The problem with this is

that applying an arbitrary mask does not produce a consistent STFT,in the sense that the masked STFT could not be computed from anyreal-valued time-domain signal. Therefore, if the masked STFT isinverted then recomputed, the resulting magnitudes and phases willbe different. Second, some approaches that use a real-valued maskand all approaches that use a complex-valued mask do not lever-age the basic assumption given by the model (1): that the separatedsources should add up to the original mixture signal.

In this paper, we show that STFT and mixture consistency canbe enforced by adding simple end-to-end layers in the enhancementnetwork. These two techniques can be combined with any masking-based speech enhancement model to improve performance. Also,these constraints are compatible with complex-valued masks, whichfor the first time allows systems that use complex-valued masks totake advantage of the basic assumption of mixture consistency.

2. RELATION TO PRIOR WORK

STFT consistency has been exploited by a number of works [5, 6,7, 8, 9], but usually in a model-based context that requires iterativealgorithms. In contrast to our work, none of these approaches havecombined STFT consistency with DNNs. Since we use a single dif-ferentiable consistency constraint layer within a DNN, we do notrequire iterations to enforce consistency.

Gunawan and Sen [10] proposed adding mixture consistencyconstraints to improve Griffin-Lim, resulting in the multiple inputspectrogram inversion (MISI) algorithm. Wang et al. [11] trained areal-valued masking-based system for speech separation through un-folded MISI iterations. Strumel and Daudet [12] proposed iterativepartitioned phase reconstruction (PPR), which also relied on mixtureconsistency. In contrast to these works, we do not require multipleiterations for phase estimation, and furthermore we experiment withestimating phase using a complex-valued mask.

Lee et al. [13] recently used a soft squared error penalty betweenthe warped magnitudes of the true and estimated mixture STFTs topromote mixture consistency. In contrast, we use an exact projectionstep implemented as a neural network layer.

3. CONSISTENCY

3.1. STFT consistency

When a STFT uses overlapping frames, applying a mask to the STFTof a mixture signal, whether the mask is real or complex-valued, doesnot necessarily produce a consistent STFT [5]. A consistent STFTX is one such that there exists a real-valued time-domain signal xsatisfying X = S{x}. A STFT X is consistent if it satisfies

X = S{S−1{X}} = PS{X}, (3)

arX

iv:1

811.

0852

1v1

[cs

.SD

] 2

0 N

ov 2

018

Fig. 1. Illustration of the effect of STFT consistency. A noisy STFT (spectrogram in lower left) is masked using an ideal phase-sensitive mask(4) to recover a target speech signal (spectrogram in upper left). Then, this masked STFT is made consistent using the projection (3). Noticethat the masked spectrogram (upper middle panel) is substantially different from the STFT-consistent spectrogram (upper right panel), withthe magnitude-squared error visualized in the lower right panel.

where PS refers to the projection performed by the sequence of in-verse and forward STFT operations.

An important reason to enforce consistency on separated esti-mates Mj � Y is that these estimates are used in loss functionsand metrics used during deep network training. If the masked STFTis simply used in a loss function, e.g. mean-squared error betweenmagnitudes of the masked and ground-truth spectrograms, the mag-nitudes |Xj | = |Mj�Y| do not necessarily correspond to the STFTmagnitudes of the reconstructed time-domain signal, |S{xj}|. Thus,the loss function will not be accurately measuring the spectral mag-nitude of the estimate.

3.1.1. Illustration of STFT consistency

Figure 1 illustrates the effect of STFT consistency. A clean speechexample s (spectrogram in upper left panel) is embedded in non-stationary noise v at 8 dB SNR (mixture spectrogram in lower leftpanel). Then, we compute an oracle phase-sensitive mask [2] as

|Sf,t||Sf,t + Vf,t|

· cos(∠Sf,t − ∠Yf,t). (4)

This mask is applied to the noisy STFT to create a masked STFT(spectrogram shown in upper middle panel). Then, we use the pro-jection (3) to create a consistent STFT of the enhanced speech (spec-trogram in upper right panel). The lower right panel visualizes themagnitude-squared error between the masked and consistent STFTs.

Note that the masked STFT (upper middle panel) differs sub-stantially from the consistent STFT (upper right panel). This indi-cates the importance of STFT consistency: if a neural network train-ing loss is measured on the masked magnitude instead of the consis-tent magnitude, then the loss is not looking at the actual spectrogramof the enhanced time-domain signal. In particular, the magnitude-squared error between consistent and true STFTs (9.57e-3) is lessthan the magnitude-squared error between masked and true STFTs(1.58e-2). This discrepancy emphasizes the importance of havingthe network compute training losses on consistent STFTs, since oth-erwise the network is focusing on irrelevant parts of the error.

3.1.2. Backpropagating through STFTs

A natural way to enforce STFT consistency is to simply project esti-mated signals using the constraint (3). Since the forward and inverse

STFT operations are linear transforms implemented in TensorFlow1,this projection can simply be treated as an extra layer in the network.

3.2. Mixture consistency

An obvious constraint on separated signals comes from the origi-nal mixing model (1), as it is natural to assume that estimates ofthe separated signals should add up to the original mixture. This isequivalent to the complex STFTs of the estimated sources adding upto the complex mixture STFT:∑

j

Xj,f,t = Yf,t ∀t, f, (5)

where Xj,f,t is the mixture-consistent TF bin for source j.In the past, this mixture consistency constraint has been en-

forced using real-valued masks that are made to sum to one acrossthe sources, which is equivalent to using the projection (7). How-ever, when the masks are complex-valued, constraining these masksto sum to one across sources is too restrictive, since complex masksmight need to modify the phase differently for different sources inthe same time-frequency (TF) bin.

In this section, we describe a simple differentiable mixture pro-jection layer that can enforce mixture consistency for any type ofmasking-based enhancement or separation method, including real-valued and complex-valued masks and explicit phase prediction.

3.2.1. Backpropagating through mixture-consistent projection

To ensure mixture consistency without putting explicit constraints onthe masks, we project the masked estimates to the nearest points onthe subspace of mixture-consistent estimates. To do this, we solvethe following optimization problem for each TF bin:

minimizeXf,t∈CJ

1

2

∑j

|Xj,f,t − Xj,f,t|2

subject to∑j

Xj,f,t = Yf,t.(6)

1https://www.tensorflow.org/api_docs/python/tf/contrib/signal/inverse_stft

https://www.tensorflow.org/api_docs/python/tf/contrib/signal/inverse_stft

https://www.tensorflow.org/api_docs/python/tf/contrib/signal/inverse_stft

Yf,t = X1,f,t + X2,f,t<latexit sha1_base64="6hskNATgnsm3exJb5mBo0L6yJWg=">AAACCHicbZDLSsNAFIZP6q3WW9SlCweLIFhKUgTdCEU3LivYi7QhTKaTdujkwsxEKKFLN76KGxeKuPUR3Pk2TtMstPWHgY//nMOZ83sxZ1JZ1rdRWFpeWV0rrpc2Nre2d8zdvZaMEkFok0Q8Eh0PS8pZSJuKKU47saA48Dhte6Prab39QIVkUXinxjF1AjwImc8IVtpyzcN7N/UraoIuUcdN7UrGp1OuZeyaZatqZUKLYOdQhlwN1/zq9SOSBDRUhGMpu7YVKyfFQjHC6aTUSySNMRnhAe1qDHFApZNmh0zQsXb6yI+EfqFCmft7IsWBlOPA050BVkM5X5ua/9W6ifIvnJSFcaJoSGaL/IQjFaFpKqjPBCWKjzVgIpj+KyJDLDBROruSDsGeP3kRWrWqbVXt27Ny/SqPowgHcAQnYMM51OEGGtAEAo/wDK/wZjwZL8a78TFrLRj5zD78kfH5A2FEl6Q=</latexit><latexit sha1_base64="6hskNATgnsm3exJb5mBo0L6yJWg=">AAACCHicbZDLSsNAFIZP6q3WW9SlCweLIFhKUgTdCEU3LivYi7QhTKaTdujkwsxEKKFLN76KGxeKuPUR3Pk2TtMstPWHgY//nMOZ83sxZ1JZ1rdRWFpeWV0rrpc2Nre2d8zdvZaMEkFok0Q8Eh0PS8pZSJuKKU47saA48Dhte6Prab39QIVkUXinxjF1AjwImc8IVtpyzcN7N/UraoIuUcdN7UrGp1OuZeyaZatqZUKLYOdQhlwN1/zq9SOSBDRUhGMpu7YVKyfFQjHC6aTUSySNMRnhAe1qDHFApZNmh0zQsXb6yI+EfqFCmft7IsWBlOPA050BVkM5X5ua/9W6ifIvnJSFcaJoSGaL/IQjFaFpKqjPBCWKjzVgIpj+KyJDLDBROruSDsGeP3kRWrWqbVXt27Ny/SqPowgHcAQnYMM51OEGGtAEAo/wDK/wZjwZL8a78TFrLRj5zD78kfH5A2FEl6Q=</latexit><latexit sha1_base64="6hskNATgnsm3exJb5mBo0L6yJWg=">AAACCHicbZDLSsNAFIZP6q3WW9SlCweLIFhKUgTdCEU3LivYi7QhTKaTdujkwsxEKKFLN76KGxeKuPUR3Pk2TtMstPWHgY//nMOZ83sxZ1JZ1rdRWFpeWV0rrpc2Nre2d8zdvZaMEkFok0Q8Eh0PS8pZSJuKKU47saA48Dhte6Prab39QIVkUXinxjF1AjwImc8IVtpyzcN7N/UraoIuUcdN7UrGp1OuZeyaZatqZUKLYOdQhlwN1/zq9SOSBDRUhGMpu7YVKyfFQjHC6aTUSySNMRnhAe1qDHFApZNmh0zQsXb6yI+EfqFCmft7IsWBlOPA050BVkM5X5ua/9W6ifIvnJSFcaJoSGaL/IQjFaFpKqjPBCWKjzVgIpj+KyJDLDBROruSDsGeP3kRWrWqbVXt27Ny/SqPowgHcAQnYMM51OEGGtAEAo/wDK/wZjwZL8a78TFrLRj5zD78kfH5A2FEl6Q=</latexit><latexit sha1_base64="6hskNATgnsm3exJb5mBo0L6yJWg=">AAACCHicbZDLSsNAFIZP6q3WW9SlCweLIFhKUgTdCEU3LivYi7QhTKaTdujkwsxEKKFLN76KGxeKuPUR3Pk2TtMstPWHgY//nMOZ83sxZ1JZ1rdRWFpeWV0rrpc2Nre2d8zdvZaMEkFok0Q8Eh0PS8pZSJuKKU47saA48Dhte6Prab39QIVkUXinxjF1AjwImc8IVtpyzcN7N/UraoIuUcdN7UrGp1OuZeyaZatqZUKLYOdQhlwN1/zq9SOSBDRUhGMpu7YVKyfFQjHC6aTUSySNMRnhAe1qDHFApZNmh0zQsXb6yI+EfqFCmft7IsWBlOPA050BVkM5X5ua/9W6ifIvnJSFcaJoSGaL/IQjFaFpKqjPBCWKjzVgIpj+KyJDLDBROruSDsGeP3kRWrWqbVXt27Ny/SqPowgHcAQnYMM51OEGGtAEAo/wDK/wZjwZL8a78TFrLRj5zD78kfH5A2FEl6Q=</latexit>

Xf,t (weighted)<latexit sha1_base64="5EVk7U6GpvNwGeNCkHGThn3rIeY=">AAACO3icdVDBbhMxEPWWFkooNIUjF4sIqUjVypumNL1V5cKxRaSNlF1FXu9sYtXrXdmzhcja/+LCT/TGhQsHEOLaO06aSAXBSJae3pt5M35ppaRFxr4Ea/fWN+4/2HzYerT1+Ml2e+fpuS1rI2AgSlWaYcotKKlhgBIVDCsDvEgVXKSXb+b6xRUYK0v9HmcVJAWfaJlLwdFT4/Y7Fy9MRmaSJo6F/aP9qMf2WPj66PCAdT3o9nsH+90mrnUGZr7GxWlOh83Y5XvYNDHCR3R09wPIyRQhe9WM252VD1350JUPjUK2qA5Z1um4fR1npagL0CgUt3YUsQoTxw1KoaBpxbWFiotLPoGRh5oXYBO3OLuhLz2T0bw0/mmkC/buhOOFtbMi9Z0Fx6n9W5uT/9JGNeb9xEld1Qha3C7Ka0WxpPMgaSYNCFQzD7gw0t9KxZQbLtDH3fIhrH5K/w/Ou2HEwuis1zk+WcaxSZ6TF2SXROSQHJO35JQMiCCfyFfynfwIPgffgp/Br9vWtWA584z8UcHNb5TWq5k=</latexit><latexit sha1_base64="5EVk7U6GpvNwGeNCkHGThn3rIeY=">AAACO3icdVDBbhMxEPWWFkooNIUjF4sIqUjVypumNL1V5cKxRaSNlF1FXu9sYtXrXdmzhcja/+LCT/TGhQsHEOLaO06aSAXBSJae3pt5M35ppaRFxr4Ea/fWN+4/2HzYerT1+Ml2e+fpuS1rI2AgSlWaYcotKKlhgBIVDCsDvEgVXKSXb+b6xRUYK0v9HmcVJAWfaJlLwdFT4/Y7Fy9MRmaSJo6F/aP9qMf2WPj66PCAdT3o9nsH+90mrnUGZr7GxWlOh83Y5XvYNDHCR3R09wPIyRQhe9WM252VD1350JUPjUK2qA5Z1um4fR1npagL0CgUt3YUsQoTxw1KoaBpxbWFiotLPoGRh5oXYBO3OLuhLz2T0bw0/mmkC/buhOOFtbMi9Z0Fx6n9W5uT/9JGNeb9xEld1Qha3C7Ka0WxpPMgaSYNCFQzD7gw0t9KxZQbLtDH3fIhrH5K/w/Ou2HEwuis1zk+WcaxSZ6TF2SXROSQHJO35JQMiCCfyFfynfwIPgffgp/Br9vWtWA584z8UcHNb5TWq5k=</latexit><latexit sha1_base64="5EVk7U6GpvNwGeNCkHGThn3rIeY=">AAACO3icdVDBbhMxEPWWFkooNIUjF4sIqUjVypumNL1V5cKxRaSNlF1FXu9sYtXrXdmzhcja/+LCT/TGhQsHEOLaO06aSAXBSJae3pt5M35ppaRFxr4Ea/fWN+4/2HzYerT1+Ml2e+fpuS1rI2AgSlWaYcotKKlhgBIVDCsDvEgVXKSXb+b6xRUYK0v9HmcVJAWfaJlLwdFT4/Y7Fy9MRmaSJo6F/aP9qMf2WPj66PCAdT3o9nsH+90mrnUGZr7GxWlOh83Y5XvYNDHCR3R09wPIyRQhe9WM252VD1350JUPjUK2qA5Z1um4fR1npagL0CgUt3YUsQoTxw1KoaBpxbWFiotLPoGRh5oXYBO3OLuhLz2T0bw0/mmkC/buhOOFtbMi9Z0Fx6n9W5uT/9JGNeb9xEld1Qha3C7Ka0WxpPMgaSYNCFQzD7gw0t9KxZQbLtDH3fIhrH5K/w/Ou2HEwuis1zk+WcaxSZ6TF2SXROSQHJO35JQMiCCfyFfynfwIPgffgp/Br9vWtWA584z8UcHNb5TWq5k=</latexit><latexit sha1_base64="5EVk7U6GpvNwGeNCkHGThn3rIeY=">AAACO3icdVDBbhMxEPWWFkooNIUjF4sIqUjVypumNL1V5cKxRaSNlF1FXu9sYtXrXdmzhcja/+LCT/TGhQsHEOLaO06aSAXBSJae3pt5M35ppaRFxr4Ea/fWN+4/2HzYerT1+Ml2e+fpuS1rI2AgSlWaYcotKKlhgBIVDCsDvEgVXKSXb+b6xRUYK0v9HmcVJAWfaJlLwdFT4/Y7Fy9MRmaSJo6F/aP9qMf2WPj66PCAdT3o9nsH+90mrnUGZr7GxWlOh83Y5XvYNDHCR3R09wPIyRQhe9WM252VD1350JUPjUK2qA5Z1um4fR1npagL0CgUt3YUsQoTxw1KoaBpxbWFiotLPoGRh5oXYBO3OLuhLz2T0bw0/mmkC/buhOOFtbMi9Z0Fx6n9W5uT/9JGNeb9xEld1Qha3C7Ka0WxpPMgaSYNCFQzD7gw0t9KxZQbLtDH3fIhrH5K/w/Ou2HEwuis1zk+WcaxSZ6TF2SXROSQHJO35JQMiCCfyFfynfwIPgffgp/Br9vWtWA584z8UcHNb5TWq5k=</latexit>

Xf,t (unweighted)<latexit sha1_base64="ut1prbASDG8eHKcRuDTmO/snX6A=">AAACPXicdVBNa9tAEF0lbZO6H3HSYy9LTCGFIKTEyscttJccE4gTgyXMajWyl6xWYnfUxiz6Y730P+SWWy89NJRee+3KsaEt7cDC472ZN7MvraQwGAR33srqo8dP1tafdp49f/Fyo7u5dWnKWnMY8FKWepgyA1IoGKBACcNKAytSCVfp9ftWv/oA2ohSXeCsgqRgEyVywRk6aty9sPHcZKQnaWIDf7/fj6LD3cCPDqLooAX7Ydg/Dpu4Vhnodo2N05wOm7HNd7FpYoQbtHSnVh9BTKYI2dtm3O0tnejSiS6daOgH8+qRRZ2Nu7dxVvK6AIVcMmNGYVBhYplGwSU0nbg2UDF+zSYwclCxAkxi54c39I1jMpqX2j2FdM7+PmFZYcysSF1nwXBq/tZa8l/aqMb8KLFCVTWC4g+L8lpSLGkbJc2EBo5y5gDjWrhbKZ8yzTi6wDsuhOVP6f/B5Z4fBn543u+dvFvEsU5ek22yQ0JySE7IKTkjA8LJJ/KFfCP33mfvq/fd+/HQuuItZl6RP8r7+Qt5+6yW</latexit><latexit sha1_base64="ut1prbASDG8eHKcRuDTmO/snX6A=">AAACPXicdVBNa9tAEF0lbZO6H3HSYy9LTCGFIKTEyscttJccE4gTgyXMajWyl6xWYnfUxiz6Y730P+SWWy89NJRee+3KsaEt7cDC472ZN7MvraQwGAR33srqo8dP1tafdp49f/Fyo7u5dWnKWnMY8FKWepgyA1IoGKBACcNKAytSCVfp9ftWv/oA2ohSXeCsgqRgEyVywRk6aty9sPHcZKQnaWIDf7/fj6LD3cCPDqLooAX7Ydg/Dpu4Vhnodo2N05wOm7HNd7FpYoQbtHSnVh9BTKYI2dtm3O0tnejSiS6daOgH8+qRRZ2Nu7dxVvK6AIVcMmNGYVBhYplGwSU0nbg2UDF+zSYwclCxAkxi54c39I1jMpqX2j2FdM7+PmFZYcysSF1nwXBq/tZa8l/aqMb8KLFCVTWC4g+L8lpSLGkbJc2EBo5y5gDjWrhbKZ8yzTi6wDsuhOVP6f/B5Z4fBn543u+dvFvEsU5ek22yQ0JySE7IKTkjA8LJJ/KFfCP33mfvq/fd+/HQuuItZl6RP8r7+Qt5+6yW</latexit><latexit sha1_base64="ut1prbASDG8eHKcRuDTmO/snX6A=">AAACPXicdVBNa9tAEF0lbZO6H3HSYy9LTCGFIKTEyscttJccE4gTgyXMajWyl6xWYnfUxiz6Y730P+SWWy89NJRee+3KsaEt7cDC472ZN7MvraQwGAR33srqo8dP1tafdp49f/Fyo7u5dWnKWnMY8FKWepgyA1IoGKBACcNKAytSCVfp9ftWv/oA2ohSXeCsgqRgEyVywRk6aty9sPHcZKQnaWIDf7/fj6LD3cCPDqLooAX7Ydg/Dpu4Vhnodo2N05wOm7HNd7FpYoQbtHSnVh9BTKYI2dtm3O0tnejSiS6daOgH8+qRRZ2Nu7dxVvK6AIVcMmNGYVBhYplGwSU0nbg2UDF+zSYwclCxAkxi54c39I1jMpqX2j2FdM7+PmFZYcysSF1nwXBq/tZa8l/aqMb8KLFCVTWC4g+L8lpSLGkbJc2EBo5y5gDjWrhbKZ8yzTi6wDsuhOVP6f/B5Z4fBn543u+dvFvEsU5ek22yQ0JySE7IKTkjA8LJJ/KFfCP33mfvq/fd+/HQuuItZl6RP8r7+Qt5+6yW</latexit><latexit sha1_base64="ut1prbASDG8eHKcRuDTmO/snX6A=">AAACPXicdVBNa9tAEF0lbZO6H3HSYy9LTCGFIKTEyscttJccE4gTgyXMajWyl6xWYnfUxiz6Y730P+SWWy89NJRee+3KsaEt7cDC472ZN7MvraQwGAR33srqo8dP1tafdp49f/Fyo7u5dWnKWnMY8FKWepgyA1IoGKBACcNKAytSCVfp9ftWv/oA2ohSXeCsgqRgEyVywRk6aty9sPHcZKQnaWIDf7/fj6LD3cCPDqLooAX7Ydg/Dpu4Vhnodo2N05wOm7HNd7FpYoQbtHSnVh9BTKYI2dtm3O0tnejSiS6daOgH8+qRRZ2Nu7dxVvK6AIVcMmNGYVBhYplGwSU0nbg2UDF+zSYwclCxAkxi54c39I1jMpqX2j2FdM7+PmFZYcysSF1nwXBq/tZa8l/aqMb8KLFCVTWC4g+L8lpSLGkbJc2EBo5y5gDjWrhbKZ8yzTi6wDsuhOVP6f/B5Z4fBn543u+dvFvEsU5ek22yQ0JySE7IKTkjA8LJJ/KFfCP33mfvq/fd+/HQuuItZl6RP8r7+Qt5+6yW</latexit>


X1,f,t<latexit sha1_base64="eZ9gJMQnQfEd0h2d6DaAXtzlWCk=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBQymJCHosevFYwX5IG8pmu2mX7iZhdyKU0F/hxYMiXv053vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLkXEmyhQ8k6iOVWB5O1gfDvz209cGxFHDzhJuK/oMBKhYBSt9NjpZ141rOK0X664NXcOskq8nFQgR6Nf/uoNYpYqHiGT1Jiu5yboZ1SjYJJPS73U8ISyMR3yrqURVdz42fzgKTmzyoCEsbYVIZmrvycyqoyZqMB2Koojs+zNxP+8borhtZ+JKEmRR2yxKEwlwZjMvicDoTlDObGEMi3srYSNqKYMbUYlG4K3/PIqaV3UPLfm3V9W6jd5HEU4gVM4Bw+uoA530IAmMFDwDK/w5mjnxXl3PhatBSefOYY/cD5/AAsxj+Y=</latexit><latexit sha1_base64="eZ9gJMQnQfEd0h2d6DaAXtzlWCk=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBQymJCHosevFYwX5IG8pmu2mX7iZhdyKU0F/hxYMiXv053vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLkXEmyhQ8k6iOVWB5O1gfDvz209cGxFHDzhJuK/oMBKhYBSt9NjpZ141rOK0X664NXcOskq8nFQgR6Nf/uoNYpYqHiGT1Jiu5yboZ1SjYJJPS73U8ISyMR3yrqURVdz42fzgKTmzyoCEsbYVIZmrvycyqoyZqMB2Koojs+zNxP+8borhtZ+JKEmRR2yxKEwlwZjMvicDoTlDObGEMi3srYSNqKYMbUYlG4K3/PIqaV3UPLfm3V9W6jd5HEU4gVM4Bw+uoA530IAmMFDwDK/w5mjnxXl3PhatBSefOYY/cD5/AAsxj+Y=</latexit><latexit sha1_base64="eZ9gJMQnQfEd0h2d6DaAXtzlWCk=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBQymJCHosevFYwX5IG8pmu2mX7iZhdyKU0F/hxYMiXv053vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLkXEmyhQ8k6iOVWB5O1gfDvz209cGxFHDzhJuK/oMBKhYBSt9NjpZ141rOK0X664NXcOskq8nFQgR6Nf/uoNYpYqHiGT1Jiu5yboZ1SjYJJPS73U8ISyMR3yrqURVdz42fzgKTmzyoCEsbYVIZmrvycyqoyZqMB2Koojs+zNxP+8borhtZ+JKEmRR2yxKEwlwZjMvicDoTlDObGEMi3srYSNqKYMbUYlG4K3/PIqaV3UPLfm3V9W6jd5HEU4gVM4Bw+uoA530IAmMFDwDK/w5mjnxXl3PhatBSefOYY/cD5/AAsxj+Y=</latexit><latexit sha1_base64="eZ9gJMQnQfEd0h2d6DaAXtzlWCk=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBQymJCHosevFYwX5IG8pmu2mX7iZhdyKU0F/hxYMiXv053vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLkXEmyhQ8k6iOVWB5O1gfDvz209cGxFHDzhJuK/oMBKhYBSt9NjpZ141rOK0X664NXcOskq8nFQgR6Nf/uoNYpYqHiGT1Jiu5yboZ1SjYJJPS73U8ISyMR3yrqURVdz42fzgKTmzyoCEsbYVIZmrvycyqoyZqMB2Koojs+zNxP+8borhtZ+JKEmRR2yxKEwlwZjMvicDoTlDObGEMi3srYSNqKYMbUYlG4K3/PIqaV3UPLfm3V9W6jd5HEU4gVM4Bw+uoA530IAmMFDwDK/w5mjnxXl3PhatBSefOYY/cD5/AAsxj+Y=</latexit>

X2,f,t<latexit sha1_base64="AL4QBUUANikO+U0JXV/rmiYzjS0=">AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KCUpgh6LXjxWsB/ShrLZbtqlm03YnQgl9Fd48aCIV3+ON/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrPXb6Wa0SVnDaL5XdqjsHWSVeTsqQo9EvffUGMUsjrpBJakzXcxP0M6pRMMmnxV5qeELZmA5511JFI278bH7wlJxbZUDCWNtSSObq74mMRsZMosB2RhRHZtmbif953RTDaz8TKkmRK7ZYFKaSYExm35OB0JyhnFhCmRb2VsJGVFOGNqOiDcFbfnmVtGpVz61695fl+k0eRwFO4QwuwIMrqMMdNKAJDCJ4hld4c7Tz4rw7H4vWNSefOYE/cD5/AAy6j+c=</latexit><latexit sha1_base64="AL4QBUUANikO+U0JXV/rmiYzjS0=">AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KCUpgh6LXjxWsB/ShrLZbtqlm03YnQgl9Fd48aCIV3+ON/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrPXb6Wa0SVnDaL5XdqjsHWSVeTsqQo9EvffUGMUsjrpBJakzXcxP0M6pRMMmnxV5qeELZmA5511JFI278bH7wlJxbZUDCWNtSSObq74mMRsZMosB2RhRHZtmbif953RTDaz8TKkmRK7ZYFKaSYExm35OB0JyhnFhCmRb2VsJGVFOGNqOiDcFbfnmVtGpVz61695fl+k0eRwFO4QwuwIMrqMMdNKAJDCJ4hld4c7Tz4rw7H4vWNSefOYE/cD5/AAy6j+c=</latexit><latexit sha1_base64="AL4QBUUANikO+U0JXV/rmiYzjS0=">AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KCUpgh6LXjxWsB/ShrLZbtqlm03YnQgl9Fd48aCIV3+ON/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrPXb6Wa0SVnDaL5XdqjsHWSVeTsqQo9EvffUGMUsjrpBJakzXcxP0M6pRMMmnxV5qeELZmA5511JFI278bH7wlJxbZUDCWNtSSObq74mMRsZMosB2RhRHZtmbif953RTDaz8TKkmRK7ZYFKaSYExm35OB0JyhnFhCmRb2VsJGVFOGNqOiDcFbfnmVtGpVz61695fl+k0eRwFO4QwuwIMrqMMdNKAJDCJ4hld4c7Tz4rw7H4vWNSefOYE/cD5/AAy6j+c=</latexit><latexit sha1_base64="AL4QBUUANikO+U0JXV/rmiYzjS0=">AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KCUpgh6LXjxWsB/ShrLZbtqlm03YnQgl9Fd48aCIV3+ON/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrPXb6Wa0SVnDaL5XdqjsHWSVeTsqQo9EvffUGMUsjrpBJakzXcxP0M6pRMMmnxV5qeELZmA5511JFI278bH7wlJxbZUDCWNtSSObq74mMRsZMosB2RhRHZtmbif953RTDaz8TKkmRK7ZYFKaSYExm35OB0JyhnFhCmRb2VsJGVFOGNqOiDcFbfnmVtGpVz61695fl+k0eRwFO4QwuwIMrqMMdNKAJDCJ4hld4c7Tz4rw7H4vWNSefOYE/cD5/AAy6j+c=</latexit>

Xf,t<latexit sha1_base64="j7y6piuTPgkJWJ6AIlKZZtkvrn0=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwICURQY9FLx4r2A9oQthsN+3SzSbsTpQS81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8MBVcg+N8Wyura+sbm5Wt6vbO7t6+XTvo6CRTlLVpIhLVC4lmgkvWBg6C9VLFSBwK1g3HN1O/+8CU5om8h0nK/JgMJY84JWCkwK55IwK5F0a4VwR5dAZFYNedhjMDXiZuSeqoRCuwv7xBQrOYSaCCaN13nRT8nCjgVLCi6mWapYSOyZD1DZUkZtrPZ6cX+MQoAxwlypQEPFN/T+Qk1noSh6YzJjDSi95U/M/rZxBd+TmXaQZM0vmiKBMYEjzNAQ+4YhTExBBCFTe3YjoiilAwaVVNCO7iy8ukc95wnYZ7d1FvXpdxVNAROkanyEWXqIluUQu1EUWP6Bm9ojfryXqx3q2PeeuKVc4coj+wPn8AFa6T3w==</latexit><latexit sha1_base64="j7y6piuTPgkJWJ6AIlKZZtkvrn0=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwICURQY9FLx4r2A9oQthsN+3SzSbsTpQS81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8MBVcg+N8Wyura+sbm5Wt6vbO7t6+XTvo6CRTlLVpIhLVC4lmgkvWBg6C9VLFSBwK1g3HN1O/+8CU5om8h0nK/JgMJY84JWCkwK55IwK5F0a4VwR5dAZFYNedhjMDXiZuSeqoRCuwv7xBQrOYSaCCaN13nRT8nCjgVLCi6mWapYSOyZD1DZUkZtrPZ6cX+MQoAxwlypQEPFN/T+Qk1noSh6YzJjDSi95U/M/rZxBd+TmXaQZM0vmiKBMYEjzNAQ+4YhTExBBCFTe3YjoiilAwaVVNCO7iy8ukc95wnYZ7d1FvXpdxVNAROkanyEWXqIluUQu1EUWP6Bm9ojfryXqx3q2PeeuKVc4coj+wPn8AFa6T3w==</latexit><latexit sha1_base64="j7y6piuTPgkJWJ6AIlKZZtkvrn0=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwICURQY9FLx4r2A9oQthsN+3SzSbsTpQS81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8MBVcg+N8Wyura+sbm5Wt6vbO7t6+XTvo6CRTlLVpIhLVC4lmgkvWBg6C9VLFSBwK1g3HN1O/+8CU5om8h0nK/JgMJY84JWCkwK55IwK5F0a4VwR5dAZFYNedhjMDXiZuSeqoRCuwv7xBQrOYSaCCaN13nRT8nCjgVLCi6mWapYSOyZD1DZUkZtrPZ6cX+MQoAxwlypQEPFN/T+Qk1noSh6YzJjDSi95U/M/rZxBd+TmXaQZM0vmiKBMYEjzNAQ+4YhTExBBCFTe3YjoiilAwaVVNCO7iy8ukc95wnYZ7d1FvXpdxVNAROkanyEWXqIluUQu1EUWP6Bm9ojfryXqx3q2PeeuKVc4coj+wPn8AFa6T3w==</latexit><latexit sha1_base64="j7y6piuTPgkJWJ6AIlKZZtkvrn0=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwICURQY9FLx4r2A9oQthsN+3SzSbsTpQS81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8MBVcg+N8Wyura+sbm5Wt6vbO7t6+XTvo6CRTlLVpIhLVC4lmgkvWBg6C9VLFSBwK1g3HN1O/+8CU5om8h0nK/JgMJY84JWCkwK55IwK5F0a4VwR5dAZFYNedhjMDXiZuSeqoRCuwv7xBQrOYSaCCaN13nRT8nCjgVLCi6mWapYSOyZD1DZUkZtrPZ6cX+MQoAxwlypQEPFN/T+Qk1noSh6YzJjDSi95U/M/rZxBd+TmXaQZM0vmiKBMYEjzNAQ+4YhTExBBCFTe3YjoiilAwaVVNCO7iy8ukc95wnYZ7d1FvXpdxVNAROkanyEWXqIluUQu1EUWP6Bm9ojfryXqx3q2PeeuKVc4coj+wPn8AFa6T3w==</latexit>

pv1,f,t

<latexit sha1_base64="a3uStMBRWGE21wMgqYVyrT1B9Uc=">AAAB+XicdVDJSgNBEO2JW4zbqEcvjUHwEEJPTNRj0IvHCGaBZAg9nZ6kSc9id00gDPMnXjwo4tU/8ebf2FkEFX1Q8Hiviqp6XiyFBkI+rNzK6tr6Rn6zsLW9s7tn7x+0dJQoxpsskpHqeFRzKULeBAGSd2LFaeBJ3vbG1zO/PeFKiyi8g2nM3YAOQ+ELRsFIfdvu6XsF6aSfOiW/BFnWt4ukXCOkck7wglRrS1I7w06ZzFFESzT69ntvELEk4CEwSbXuOiQGN6UKBJM8K/QSzWPKxnTIu4aGNODaTeeXZ/jEKAPsR8pUCHiufp9IaaD1NPBMZ0BhpH97M/Evr5uAf+mmIowT4CFbLPITiSHCsxjwQCjOQE4NoUwJcytmI6ooAxNWwYTw9Sn+n7QqZYeUndtqsX61jCOPjtAxOkUOukB1dIMaqIkYmqAH9ISerdR6tF6s10VrzlrOHKIfsN4+AdFmk8U=</latexit><latexit sha1_base64="a3uStMBRWGE21wMgqYVyrT1B9Uc=">AAAB+XicdVDJSgNBEO2JW4zbqEcvjUHwEEJPTNRj0IvHCGaBZAg9nZ6kSc9id00gDPMnXjwo4tU/8ebf2FkEFX1Q8Hiviqp6XiyFBkI+rNzK6tr6Rn6zsLW9s7tn7x+0dJQoxpsskpHqeFRzKULeBAGSd2LFaeBJ3vbG1zO/PeFKiyi8g2nM3YAOQ+ELRsFIfdvu6XsF6aSfOiW/BFnWt4ukXCOkck7wglRrS1I7w06ZzFFESzT69ntvELEk4CEwSbXuOiQGN6UKBJM8K/QSzWPKxnTIu4aGNODaTeeXZ/jEKAPsR8pUCHiufp9IaaD1NPBMZ0BhpH97M/Evr5uAf+mmIowT4CFbLPITiSHCsxjwQCjOQE4NoUwJcytmI6ooAxNWwYTw9Sn+n7QqZYeUndtqsX61jCOPjtAxOkUOukB1dIMaqIkYmqAH9ISerdR6tF6s10VrzlrOHKIfsN4+AdFmk8U=</latexit><latexit sha1_base64="a3uStMBRWGE21wMgqYVyrT1B9Uc=">AAAB+XicdVDJSgNBEO2JW4zbqEcvjUHwEEJPTNRj0IvHCGaBZAg9nZ6kSc9id00gDPMnXjwo4tU/8ebf2FkEFX1Q8Hiviqp6XiyFBkI+rNzK6tr6Rn6zsLW9s7tn7x+0dJQoxpsskpHqeFRzKULeBAGSd2LFaeBJ3vbG1zO/PeFKiyi8g2nM3YAOQ+ELRsFIfdvu6XsF6aSfOiW/BFnWt4ukXCOkck7wglRrS1I7w06ZzFFESzT69ntvELEk4CEwSbXuOiQGN6UKBJM8K/QSzWPKxnTIu4aGNODaTeeXZ/jEKAPsR8pUCHiufp9IaaD1NPBMZ0BhpH97M/Evr5uAf+mmIowT4CFbLPITiSHCsxjwQCjOQE4NoUwJcytmI6ooAxNWwYTw9Sn+n7QqZYeUndtqsX61jCOPjtAxOkUOukB1dIMaqIkYmqAH9ISerdR6tF6s10VrzlrOHKIfsN4+AdFmk8U=</latexit><latexit sha1_base64="a3uStMBRWGE21wMgqYVyrT1B9Uc=">AAAB+XicdVDJSgNBEO2JW4zbqEcvjUHwEEJPTNRj0IvHCGaBZAg9nZ6kSc9id00gDPMnXjwo4tU/8ebf2FkEFX1Q8Hiviqp6XiyFBkI+rNzK6tr6Rn6zsLW9s7tn7x+0dJQoxpsskpHqeFRzKULeBAGSd2LFaeBJ3vbG1zO/PeFKiyi8g2nM3YAOQ+ELRsFIfdvu6XsF6aSfOiW/BFnWt4ukXCOkck7wglRrS1I7w06ZzFFESzT69ntvELEk4CEwSbXuOiQGN6UKBJM8K/QSzWPKxnTIu4aGNODaTeeXZ/jEKAPsR8pUCHiufp9IaaD1NPBMZ0BhpH97M/Evr5uAf+mmIowT4CFbLPITiSHCsxjwQCjOQE4NoUwJcytmI6ooAxNWwYTw9Sn+n7QqZYeUndtqsX61jCOPjtAxOkUOukB1dIMaqIkYmqAH9ISerdR6tF6s10VrzlrOHKIfsN4+AdFmk8U=</latexit>

pv2,f,t

<latexit sha1_base64="4bD8AKiE9mQjO3ktK5EqjwrCHzI=">AAAB+XicdVDJSgNBEO2JW4zbqEcvjUHwEEJPTNRj0IvHCGaBZAg9nZ6kSc9id00gDPMnXjwo4tU/8ebf2FkEFX1Q8Hiviqp6XiyFBkI+rNzK6tr6Rn6zsLW9s7tn7x+0dJQoxpsskpHqeFRzKULeBAGSd2LFaeBJ3vbG1zO/PeFKiyi8g2nM3YAOQ+ELRsFIfdvu6XsF6aSfVkp+CbKsbxdJuUZI5ZzgBanWlqR2hp0ymaOIlmj07ffeIGJJwENgkmrddUgMbkoVCCZ5VuglmseUjemQdw0NacC1m84vz/CJUQbYj5SpEPBc/T6R0kDraeCZzoDCSP/2ZuJfXjcB/9JNRRgnwEO2WOQnEkOEZzHggVCcgZwaQpkS5lbMRlRRBiasggnh61P8P2lVyg4pO7fVYv1qGUceHaFjdIocdIHq6AY1UBMxNEEP6Ak9W6n1aL1Yr4vWnLWcOUQ/YL19AtLwk8Y=</latexit><latexit sha1_base64="4bD8AKiE9mQjO3ktK5EqjwrCHzI=">AAAB+XicdVDJSgNBEO2JW4zbqEcvjUHwEEJPTNRj0IvHCGaBZAg9nZ6kSc9id00gDPMnXjwo4tU/8ebf2FkEFX1Q8Hiviqp6XiyFBkI+rNzK6tr6Rn6zsLW9s7tn7x+0dJQoxpsskpHqeFRzKULeBAGSd2LFaeBJ3vbG1zO/PeFKiyi8g2nM3YAOQ+ELRsFIfdvu6XsF6aSfVkp+CbKsbxdJuUZI5ZzgBanWlqR2hp0ymaOIlmj07ffeIGJJwENgkmrddUgMbkoVCCZ5VuglmseUjemQdw0NacC1m84vz/CJUQbYj5SpEPBc/T6R0kDraeCZzoDCSP/2ZuJfXjcB/9JNRRgnwEO2WOQnEkOEZzHggVCcgZwaQpkS5lbMRlRRBiasggnh61P8P2lVyg4pO7fVYv1qGUceHaFjdIocdIHq6AY1UBMxNEEP6Ak9W6n1aL1Yr4vWnLWcOUQ/YL19AtLwk8Y=</latexit><latexit sha1_base64="4bD8AKiE9mQjO3ktK5EqjwrCHzI=">AAAB+XicdVDJSgNBEO2JW4zbqEcvjUHwEEJPTNRj0IvHCGaBZAg9nZ6kSc9id00gDPMnXjwo4tU/8ebf2FkEFX1Q8Hiviqp6XiyFBkI+rNzK6tr6Rn6zsLW9s7tn7x+0dJQoxpsskpHqeFRzKULeBAGSd2LFaeBJ3vbG1zO/PeFKiyi8g2nM3YAOQ+ELRsFIfdvu6XsF6aSfVkp+CbKsbxdJuUZI5ZzgBanWlqR2hp0ymaOIlmj07ffeIGJJwENgkmrddUgMbkoVCCZ5VuglmseUjemQdw0NacC1m84vz/CJUQbYj5SpEPBc/T6R0kDraeCZzoDCSP/2ZuJfXjcB/9JNRRgnwEO2WOQnEkOEZzHggVCcgZwaQpkS5lbMRlRRBiasggnh61P8P2lVyg4pO7fVYv1qGUceHaFjdIocdIHq6AY1UBMxNEEP6Ak9W6n1aL1Yr4vWnLWcOUQ/YL19AtLwk8Y=</latexit><latexit sha1_base64="4bD8AKiE9mQjO3ktK5EqjwrCHzI=">AAAB+XicdVDJSgNBEO2JW4zbqEcvjUHwEEJPTNRj0IvHCGaBZAg9nZ6kSc9id00gDPMnXjwo4tU/8ebf2FkEFX1Q8Hiviqp6XiyFBkI+rNzK6tr6Rn6zsLW9s7tn7x+0dJQoxpsskpHqeFRzKULeBAGSd2LFaeBJ3vbG1zO/PeFKiyi8g2nM3YAOQ+ELRsFIfdvu6XsF6aSfVkp+CbKsbxdJuUZI5ZzgBanWlqR2hp0ymaOIlmj07ffeIGJJwENgkmrddUgMbkoVCCZ5VuglmseUjemQdw0NacC1m84vz/CJUQbYj5SpEPBc/T6R0kDraeCZzoDCSP/2ZuJfXjcB/9JNRRgnwEO2WOQnEkOEZzHggVCcgZwaQpkS5lbMRlRRBiasggnh61P8P2lVyg4pO7fVYv1qGUceHaFjdIocdIHq6AY1UBMxNEEP6Ak9W6n1aL1Yr4vWnLWcOUQ/YL19AtLwk8Y=</latexit>


Fig. 2. Geometric illustration of mixture consistency for two real-valued sources in a single time-frequency bin.

Using the method of Lagrange multipliers and defining the estimatedmixture Yf,t =

∑j Xj,f,t yields the following update, which is a

simple projection that can be added as a layer in the network andbackpropagated through:

Xj,f,t = Xj,f,t +1

J(Yf,t − Yf,t). (7)

3.2.2. Weighted mixture-consistent projection with uncertainty

If we have a priori knowledge of the uncertainty of the source esti-mates, a weighted version of the problem (6) can be used to projectthe sources. Assume that in each frequency bin, each estimatedsource can be modeled as a zero-mean complex-valued circularGaussian with variance vj,f,t.

Given variances vj,f,t, the weighted problem is

minimizeXf,t∈CJ

1

2

∑j

1

vj,f,t

∣∣Xj,f,t − Xj,f,t

∣∣2subject to

∑j

Xj,f,t = Yf,t.(8)

Using the method of Lagrange multipliers yields

Xj,f,t = Xj,f,t +vj,f,t∑j′ vj′,f,t

(Yf,t − Yf,t). (9)

Notice that if all sources have equal uncertainty, i.e. that vj,f,t =vj′,f,t for all j 6= j′, then the weighted projection (9) is equivalentto the unweighted projection (7).

Figure 2 illustrates unweighted and weighted mixture consis-tency for two sources. For visualization, we depict the sources’TF bins X1,f,t and X2,f,t as real-valued instead of complex-valued.Note that the projected mixture-consistent estimate Xf,t is given bythe intersection of the mixture constraint line and the ellipse rep-resenting contours of a diagonal-covariance Gaussian pdf. Whenthe variances vj,f,t are the same for all sources, the ellipse becomesa circle, and the mixture-consistent projection is orthogonal to theconstraint surface.

3.3. Order of consistency operations

The STFT and mixture consistency operations (3) and (7) are linearprojections. Though linear projections do not in general commute,the order in which these projections are applied does not matter. Asa proof, we start from the case where STFT consistency (3) is ap-plied after mixture consistency (7). Notice that since PS is linear,

and since a linear combination of consistent STFTs is still a consis-tent STFT, we can write this as being equivalent to applying mixtureconsistency after STFT consistency:

Xj =PS{Xj +1

J(Y −

∑j′

Xj′)}

⇔ Xj =PS{Xj}+1

JY − 1

J

∑j′

PS{Xj′}.(10)

However, weighted mixture projections that apply a differentweight per TF bin are not orthogonal to STFT consistency projec-tion, since the relation (10) relies on 1/J not depending on timeor frequency. Despite this lack of commutativity, we do still observebenefits of combining STFT consistency with weighted mixture con-sistency. Jointly imposing STFT and weighted mixture consistencyis more complicated, which we defer to future work.

4. EXPERIMENTS

4.1. Dataset

We use a dataset constructed from publicly-available data. Speechis taken from a subset of LibriSpeech [14], where train, validation,and test sets use nonoverlapping sets of speakers, and audio has beenfiltered using WADA-SNR [15] to ensure low levels of backgroundnoise. Noise data is sourced from freesound.org, which con-tains a large variety of music, musical instruments, field recordings,and sound effects. Files that are very long, have a large number ofzero samples, or exhibit clipping are filtered out. The sampling rateis 16kHz.

Speech and noise are mixed with normally-distributed SNRs,with a mean of 5 dB and standard deviation of 10 dB. We also applya random gain to the mixture signals, with a mean of -10 dB anda standard deviation of 5 dB. These random gain parameters werechosen such that the clipping is minimal in the audio. After mixing,the duration of the training set is 134.2 hours, and the validation andtest sets are 6.2 hours each.

4.2. Model architecture and training

Our model, shown in Figure 3, uses a modified version of a sim-ilar architecture that recently achieved state-of-the-art performanceon the CHiME2 dataset [4]. The model consists of a convolutionalfront-end operating on spectral input features, a single unidirectionalLSTM with residual connections and width of 400, and two fully-connected layers with 600 hidden units each. The input featuresare power-compressed STFTs, computed as X0.3 := |X|0.3ej∠X .STFTs are computed using 50 ms Hann windows with a 10 ms hopand FFT length of 1024.

The training loss, L, for all networks is as follows:

L =

2∑j=1

zj∑f,t

[(|Xj,f,t|0.3 − |Xj,f,t|0.3

)2+0.2 ·

∣∣X0.3j,f,t − X0.3

j,f,t

∣∣2] , (11)

where z1 = 0.8 is the speech loss weight and z2 = 0.2 is the noiseloss weight. We train and validate on fixed-length, three-secondclips. The Adam optimizer [16] is used with a batch size of 8, learn-ing rate of 3× 10−5, and default parameters otherwise.

freesound.org

STFT

Noisy input (1xN) Power-law

compressed (complex)

Convolutional layers

Unidirectional LSTM

Fully connected

with sigmoid or tanh

activation

STFT

Clean sources

(JxN)

STFT consistency projection

eq. (3)

Mixture consistency projection eq. (7) or

eq. (9)⇥

<latexit sha1_base64="EKJy6rjXhiAKrBUXGSg3J2VqHFQ=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uygSbnvlil/15yCrJMhJBXLUe+Wvbl+zLOEKmaTWdgI/xXBCDQom+bTUzSxPKRvRAe84qqhbEk7m107JmVP6JNbGlUIyV39PTGhi7TiJXGdCcWiXvZn4n9fJML4OJ0KlGXLFFoviTBLUZPY66QvDGcqxI5QZ4W4lbEgNZegCKrkQguWXV0nzohr41eD+slK7yeMowgmcwjkEcAU1uIM6NIDBIzzDK7x52nvx3r2PRWvBy2eO4Q+8zx+2K480</latexit><latexit sha1_base64="EKJy6rjXhiAKrBUXGSg3J2VqHFQ=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uygSbnvlil/15yCrJMhJBXLUe+Wvbl+zLOEKmaTWdgI/xXBCDQom+bTUzSxPKRvRAe84qqhbEk7m107JmVP6JNbGlUIyV39PTGhi7TiJXGdCcWiXvZn4n9fJML4OJ0KlGXLFFoviTBLUZPY66QvDGcqxI5QZ4W4lbEgNZegCKrkQguWXV0nzohr41eD+slK7yeMowgmcwjkEcAU1uIM6NIDBIzzDK7x52nvx3r2PRWvBy2eO4Q+8zx+2K480</latexit><latexit sha1_base64="EKJy6rjXhiAKrBUXGSg3J2VqHFQ=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uygSbnvlil/15yCrJMhJBXLUe+Wvbl+zLOEKmaTWdgI/xXBCDQom+bTUzSxPKRvRAe84qqhbEk7m107JmVP6JNbGlUIyV39PTGhi7TiJXGdCcWiXvZn4n9fJML4OJ0KlGXLFFoviTBLUZPY66QvDGcqxI5QZ4W4lbEgNZegCKrkQguWXV0nzohr41eD+slK7yeMowgmcwjkEcAU1uIM6NIDBIzzDK7x52nvx3r2PRWvBy2eO4Q+8zx+2K480</latexit><latexit sha1_base64="EKJy6rjXhiAKrBUXGSg3J2VqHFQ=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uygSbnvlil/15yCrJMhJBXLUe+Wvbl+zLOEKmaTWdgI/xXBCDQom+bTUzSxPKRvRAe84qqhbEk7m107JmVP6JNbGlUIyV39PTGhi7TiJXGdCcWiXvZn4n9fJML4OJ0KlGXLFFoviTBLUZPY66QvDGcqxI5QZ4W4lbEgNZegCKrkQguWXV0nzohr41eD+slK7yeMowgmcwjkEcAU1uIM6NIDBIzzDK7x52nvx3r2PRWvBy2eO4Q+8zx+2K480</latexit>

iSTFT

Enh. output

(J,N)

Squared error loss compressed magnitude + compressed complex

eq. (11)

Masks (JxFxT)

Noisy STFT (1xFxT)

Clean source STFTs

(JxFxT)

Enh. source STFTs (JxFxT)

Consistent source STFTs

(JxFxT)

Fig. 3. System architecture when both STFT consistency and mixture consistency are used.

Real mask

Complex mask

Real mask

Complex mask

Real mask

Complex mask

Real mask

Complex mask

Real mask

Complex mask

Real mask

Complex mask

Phase method Noisy Real/Imag. Mag./Angle Residual

Mixture consistency None Unweighted Weighted mag. sq. Weighted learned

STFT consistency Empty: no Filled: yes

Mask type Real Complex





Fig. 4. Mean improvement in SI-SDR for different input SNRs for the the test set. Notice that using both types of consistency almost alwaysimproves performance, for both real-valued and complex-valued masks. The dashed lines indicate the performance of the baseline modelwhich uses a real-valued mask with noisy phase and neither consistency constraint (blue empty circle).

Our proposed consistency constraints are compatible with bothreal and complex-valued masks. For real-valued masking, the net-work predicts a single scalar value through a sigmoid nonlinearityfor each TF bin, and the noisy phase is used for reconstruction. Toperform complex-valued masking, we use an approach similar to thatof Williamson et al. [3]. For each TF bin, the network predicts thereal and imaginary parts of a complex-valued mask through a hyper-bolic tangent (tanh) nonlinearity. This mask is multiplied with thecomplex noisy STFT, then reconstructed.

To implement weighted mixture consistency (9), we considertwo types of weighting schemes:

1. Weights vj,f,t are the squared estimated source magnitudes,|Xj,f,t|2. This has the advantage of not adding any correc-tion signal to TF bins that have a low magnitude, which helpswhen there is only one signal active within a TF bin.

2. Weights are learned by the network. The network outputs ascalar for each time, frequency, and source. A sigmoid non-linearity is applied, producing weights w1,f,t for the speechsource, and w2,f,t = 1 − w1,f,t for the noise source. Theterms vj,f,t/(

∑j′ vj′,f,t) in (9) are replaced by wj,f,t.

4.3. Results

Results are shown in Figure 4 in terms of scale-invariant SDR (SI-SDR) improvement. SI-SDR measures signal fidelity with respect toa reference signal while allowing for a gain mismatch [17, 18]:

SI-SDR = 10 log10‖αx‖2

‖αx− x‖2 , (12)

where α = argmina‖ax− x‖2 = xT x/‖x‖2.These results are grouped into five bins based on input SNR,

from -15 dB to 15 dB with bin width of 6 dB. Models that use bothSTFT and mixture consistency constraints almost always outperformmodels that do not use these constraints. The best models tend to useweighted mixture consistency with learned weights. Phase predic-tion provides a slight improvement in performance, especially whenusing a complex mask for phase prediction at lower input SNRs.The system with the best overall mean SI-SDR improvement of 10.6dB uses complex masking, STFT consistency, and weighted mixtureconsistency with learned weights. This improves 0.7 dB over thebaseline system, which has mean SI-SDR improvement of 9.9 dB.

5. CONCLUSION

In this paper, we have shown that the simple addition of differ-entiable neural network layers can be used to enforce STFT andmixture consistency on source estimates of an audio source separa-tion network. Adding these consistency constraint layers improvesspeech enhancement performance on a dataset using a wide varietyof nonstationary noise and SNR levels. These constraints are alsocompatible any type of STFT-based enhancement systems, includ-ing those that use complex-valued masks. In future work, we willcombine these constraints with other types of phase estimation andgenerative models, as well as use them for general audio source sep-aration, rather than just speech enhancement. Another interestingdirection of future work is defining a weighted version of STFT con-sistency and joint weighted STFT and mixture consistency, both ofwhich have the potential to further improve performance.

6. REFERENCES

[1] A. Narayanan and D. Wang, “Ideal ratio mask estimation usingdeep neural networks for robust speech recognition,” in Proc.ICASSP, May 2013.

[2] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux,“Phase-sensitive and recognition-boosted speech separationusing deep recurrent neural networks,” in Proc. ICASSP, Apr.2015.

[3] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratiomasking for joint enhancement of magnitude and phase,” inProc. ICASSP, Mar. 2016.

[4] K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey,R. A. Saurous, J. Skoglund, and R. F. Lyon, “Exploring trade-offs in models for low-latency speech enhancement,” in Proc.IWAENC, Sep. 2018.

[5] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistencyconstraints for STFT spectrograms and their application tophase reconstruction,” in Proc. SAPA, Sep. 2008.

[6] J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “Fast sig-nal reconstruction from magnitude STFT spectrogram basedon spectrogram consistency,” in Proc. 13th International Con-ference on Digital Audio Effects (DAFx-10), Sep. 2010.

[7] J. Le Roux, E. Vincent, Y. Mizuno, H. Kameoka, N. Ono,and S. Sagayama, “Consistent Wiener filtering: Generalizedtime-frequency masking respecting spectrogram consistency,”in Proc. LVA/ICA, Sep. 2010.

[8] P. Magron, J. Le Roux, and T. Virtanen, “Consistentanisotropic wiener filtering for audio source separation,” inProc. WASPAA, Oct. 2017.

[9] J. Le Roux and E. Vincent, “Consistent wiener filtering foraudio source separation,” IEEE Signal Processing Letters,vol. 20, no. 3, 2013.

[10] D. Gunawan and D. Sen, “Iterative phase estimation for thesynthesis of separated sources from single-channel mixtures,”IEEE Signal Processing Letters, vol. 17, no. 5, 2010.

[11] Z.-Q. Wang, J. L. Roux, D. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase recon-struction,” Proc. Interspeech, Sep. 2018.

[12] N. Sturmel and L. Daudet, “Iterative phase reconstruction ofwiener filtered signals,” in Proc. ICASSP, Mar. 2012.

[13] J. Lee, J. Skoglund, T. Shabestary, and H.-G. Kang, “Phase-sensitive joint learning algorithms for deep learning-basedspeech enhancement,” IEEE Signal Processing Letters, vol. 25,no. 8, 2018.

[14] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in Proc. ICASSP. IEEE, Apr. 2015.

[15] C. Kim and R. M. Stern, “Robust signal-to-noise ratio esti-mation based on waveform amplitude distribution analysis,” inProc. Interspeech, Sep. 2008.

[16] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980, 2014.

[17] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey,“Single-channel multi-speaker separation using deep cluster-ing,” Proc. Interspeech, Sep. 2016.

[18] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” Submitted to ICASSP, 2019.

Date post:	10-Mar-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

DIFFERENTIABLE CONSISTENCY CONSTRAINTS FOR …Scott Wisdom, John R. Hershey, Kevin Wilson, Jeremy...

Documents