[Lecture Notes in Computer Science] Advances in Nonlinear Speech Processing Volume 7015 ||...

Morphological Processing of Spectrograms

for Speech Enhancement

Joyner Cadore, Ascension Gallardo-Antolın, and Carmen Pelaez-Moreno

Universidad Carlos III de Madrid, Escuela Politecnica Superior,Avda. de la Universidad 30, 28911 Madrid, Spain

{jcadore,gallardo,carmen}@tsc.uc3m.es

http://gpm.tsc.uc3m.es/

Abstract. In this paper a method to remove noise in speech signalsimproving the quality from the perceptual point of view is presented. Itcombines spectral subtraction and two dimensional non-linear filteringtechniques most usually employed for image processing. In particular,morphological operations like erosion and dilation are applied to a noisyspeech spectrogram that has been previously enhanced by a conventionalspectral subtraction procedure. Anisotropic structural elements on gray-scale spectrograms have been found to provide a better perceptual qual-ity than isotropic ones and reveal themselves as more appropriate forretaining the speech structure while removing background noise. Ourprocedure has been evaluated by using a number of perceptual qualityestimation measures for several Signal-to-Noise Ratios on the Auroradatabase.

Keywords: noise compensation, spectral subtraction, spectrogram,morphological processing, image filtering, speech enhancement.

1 Introduction

Noisy speech signals are a common problem in many applications, e.g. automaticspeech recognition (ASR), landline and mobile phone communications, etc. InASR, the problem is harder because machine understanding is still far fromhumans [2] and speech enhancement is sometimes performed as a preprocessingstage for those systems. However, in this paper, we have concentrated our effortson enhancing speech for human consumption.

It is well-known that normal hearing people [8],[9],[12] do not need all theinformation to understand a speech signal. Therefore, conventional speech en-hancement techniques, like Spectral Subtraction (SS), produce a more intelligiblesignal by removing background noise but generate the so called musical noise asa side effect.

In this paper we employ SS as a preprocessing stage to subsequently applytwo dimensional image processing techniques on the filtered spectrogram pro-ducing a perceptually enhanced signal where musical noise has been reduced.As the work presented in [3], our goal is to emphasize the areas of interest by

C.M.Travieso-Gonzalez, J.B. Alonso-Hernandez (Eds.):NOLISP2011, LNAI7015, pp. 224–231, 2011.c© Springer-Verlag Berlin Heidelberg 2011

http://gpm.tsc.uc3m.es/

Morphological Processing of Spectrograms for Speech Enhancement 225

morphological filtering as well as to eliminate as much noise as possible by mim-icking some properties of the human auditory system (HAS) as masking effects.However, we have substituted the use of a binary mask (thus avoiding the needof thresholding) by the full gray-scale spectrogram information and we proposeanisotropic structural elements based on the spectro-temporal masking of theHAS.

In order to thoroughly evaluate our proposal we have employed a large amountof speech utterances (section 4.2) which, on the other hand, precludes the useof subjective quality measures. For this reason, estimations of these subjectiveopinions are computed from a set of objective quality measures [7].

This paper is organized as follows. In section 2, we present the preprocessingstage: Spectrogram calculation and spectral subtraction. Section 3 is devoted tothe explanation of our proposed method and section 4 describes the experimentsand results to end with some conclusions and ideas for future work in section 5.

2 Spectrogram and Spectral Subtraction

A spectrogram [10] expresses the speech signal spectral energy density as a func-tion of time. It shows the temporal evolution of formant positions, harmonicsand other components of speech. The spectrograms are usually displayed asgray-scale or heatmap images. Typically, the larger energy magnitudes in thespectrum are displayed in white (or warm in case of heatmaps) colors and thevalleys (e.g. silences) in dark (or cold) colors. This is illustrated in Figure 1 bythe spectrograms for clean and noisy signals.

A conventional SS procedure [1] is applied on the noisy spectrogram as apreprocessing stage that will be also regarded as a baseline system. However, thismethod is known to exhibit what is called musical noise, i.e., rough transitionsbetween the speech signal and the areas with removed noise become noticeable

Fig. 1. Left panel shows the spectrogram from a clean utterance. Right panel showsthe spectrogram of the same utterance with added metro noise at 10dB SNR (seeSection 4.

226 J. Cadore, A. Gallardo-Antolın, and C. Pelaez-Moreno

and unpleasant to a human listener. Our proposal attenuates this behavior asthe two dimensional processing inherently produces some temporal smoothingwhile preserving the main speech features.

3 Morphological Filtering

Morphological filtering [5] is a tool for extracting image components that areuseful for some purposes like thinning, pruning, structure enhancement, objectmarking, segmentation and noise filtering. It may be used on both black andwhite and gray-scale images. In this paper, we put forward that for our purposesits application to gray-scale spectrograms is more advantageous.

As in [3], we used the morphological operation known as opening. Openingconsists in performing erosion followed by dilation in order to remove smallobjects in images. Erosion removes noise and dilation produces an amplificationof shapes as well as fills holes. The remaining objects in the image are smoothedversions of the original objects.

The goal is to remove the most of the remaining noise and enhance the time-frequency components of the speech signal. From this operation we obtain anormalized “mask” that is subsequently applied on the “noisy spectrogram” toproduce the filtered speech signal.

3.1 Structuring Elements and Mask

Each mask was treated using a different Structuring Element (SE). From theobservation of the irregular shapes of the objects of the spectrogram (i.e. formantsand harmonics evolutions) we decided to test different SEs.

The first attempt was the combination of 3 different anisotropic1 SE (Fig-ure 2): 3 rectangles of different sizes and angles (0◦, 45◦and 90◦). The mask isobtained as a combination of those generated by the 3 different SE used inde-pendently. After the combination, we normalize the mask and then we multiplyit (or sum it, if logarithmic scale) with the “noisy spectrogram”, pixel by pixel.Values equal to 0 in the mask are replaced by a small value (close to 0) to avoidintroducing a musical noise on the filtered signal.

Fig. 2. Anisotropic SE: rectangles of different sizes and angles

1 Isotropic: Uniformity in all orientations. Anisotropic: Non-uniform, asymmetric.


The second attempt was only an anisotropic SE (Figure 3), avoiding the needof masks combination. Its design is inspired by the masking effect in the humanauditory system (HAS) both in time and frequency [13],[4]. In the time scale themasking effect is asymmetric and there are two different effects: The maskingbefore the masker (pre-masking or backward masking) and the masking after themasker (post-masking or forward masking). Both effects depends on duration ofmasker and it happens that the post-masking is longer than the pre-masking.In the frequency domain the masking effect is asymmetric in linear-scale and itis almost symmetric using critical bands which we use. The procedure to obtainand combine the mask with the spectrogram is the same already explained.

Fig. 3. Left panel shows the masking effect in the human auditory system. Right panelshows the structuring element designed to emulate those system.

4 Experiments

In this section we present the evaluation of our proposed method (morphologicalfiltering with two different anisotropic SE, Figs. 2 and 3) on a speech enhance-ment. A block diagram of our proposed procedure can be observed in Figure 4.

Fig. 4. Filtering proposed, step by step

4.1 General Description

The first step is to obtain the spectrogram of the noisy speech signal, sampled at8 kHz. A resolution of 128 pixels (256-point FFT) is used on every spectrogramas it was empirically determined to be appropriate for this task. Next, a conven-tional SS is applied on the noisy spectrogram and the contrast of the resultinggray-scale image increased. The idea behind these operations is to emphasizethe speech signal over the remaining noise to make it easier for the subsequentmorphological filtering process. This last is performed by applying an openingoperation. Finally, the filtered signal in the time domain is recovered using aconventional overlap-add method.


4.2 AURORA Database

The evaluation of the filtered signals with the proposal method was conductedon the AURORA Project Database [6] which makes use of a speech databasebased on TI digits with artificially added noise over a range of SNR’s.

We have considered the four available noise: metro, car, airport and restaurantnoise. We employed around a thousand speech files, individually contaminatedwith the additive noises, respectively, at 5 different values of SNR (-5dB, 0dB,5dB, 10dB and 15dB). A clean speech signal with additive noise (regardless ofthe SNR) is called in this paper noisy signal.

4.3 Estimation of Perceptual Quality with Objective QualityMeasures

We used three objective quality measures (OQM) to evaluate the filtered signals:Sig (adequate for the prediction of the distorsion on speech), Bak (for predictingbackground intrusiveness or interference) and Ovl (for predicting the overallquality). These measures ([7]) consist of combinations of the following: Log-Likelihood Ratio (LLR), Weighted-Slope Spectral Distance (WSS), SegmentalSignal-to-Noise Ratio (segSNR) and Perceptual Evaluation of Speech Quality(PESQ). All the OQM are evaluated using a five point scale where 1 is the worstscenario and 5 the best.

4.4 Results

In order to evaluate the performance of the proposed method we have used themeasures mentioned in section 4.3 with the code available in [11] and consid-ering the clean speech signal as the reference. We have compared five differentmethods. The first one corresponds to spectral subtraction and the other fourcorrespond to different morphological filtering: black and white mask with aisotropic SE, black and white mask with anisotropic-SE, gray-scale mask withanisotropic-SE and gray-scale mask with anisotropic-SE-2. The last two are theproposed methods: anisotropic-SE are the rectangles and anisotropic-SE-2 is theHAS inspired by.

Overall, similar trends have been observed for all of the noises being the resultsof car and metro on the one side and restaurant and airport on the other, verysimilar. Therefore, we have chosen metro and airport as a representative sampleof them.

Metro Noise. Results for Metro noise and several SNRs in terms of the relativemeasures with respect to the noisy signal are shown in Figure 5a.

As it can be observed, the method with gray-scale mask and anisotropic SE(Gray & aSE) provides the best performance for high SNRs in terms of Sig.The method Gray & aSE-2 is just better in low SNRs. The largest margin withrespect the other 3 methods is obtained for SNR = -5dB.

With respect to the Bak measure, the Gray & aSE method achieves the bestperformance for SNRs of -5dB, 0dB and 5dB. However, the filtering with black


and white mask and isotropic SE (BW & iSE) reaches the highest values ofBak for higher SNRs (10dB and 15dB). It is worth mentioning that this methodemploys segSNR which is known to be very sensitive to misalignments.

Best results for the Ovl measure are obtained for SNRs of 0dB, 5dB and 10dBwhen using the Gray & aSE filtering. For SNR = -5dB the Gray & aSE-2 is thebest. For SNR = 15dB the BW & iSE method is slightly better.

In summary, for the Metro noise, the proposed methods (and in general, theuse of anisotropic structural elements) provides the best performance for lowand medium SNRs (-5dB, 0dB and 5dB). For higher SNR where the speechsignal may not need to be denoised, the filtering with black and white mask andisotropic SE presents a similar performance in comparison to other methods orslightly better, in terms of Bak.

Airport Noise. Figure 5b shows results for airport noise and several SNRs interms of relative Sig, Bak and Ovl measures.

First of all, it is worth mentioning that for low SNRs, all the evaluated meth-ods produce degradations in the quality of the processed signals. One possibleexplanation to this fact is the acoustic nature of the Airport environment inwhich babble noise is present. Spectrograms of the babble noise show the typicalenergy distribution of speech, making more difficult the denoising of the speechsignals so contaminated.

As can be observed, in terms of Sig and Ovl measures, the Gray & aSE andthe Gray & aSE-2 methods achieves the best performance, but the last one ismore suitable for low range of SNRs. For the Bak measure, the Gray & aSE-2and Gray & aSE methods provides the highest performance in low and mediumSNRs (-5dB, 0dB, 5dB and 10dB) respectively (except of SNR = 15dB, in whichboth SS and BW & iSE filtering performs better).

5 Conclusions and Future Work

In this paper we have explored an alternative to the morphological filteringfor speech enhancement and noise compensation proposed in [3]. In particular,we have proposed the use of morphological filtering with anisotropic structuralelements motivated on the HAS applied over gray-scale spectrograms.

Looking at the results for both noises in the speech enhancement task, wecould infer that the proposed methods (using aSE and aSE-2) provides a betterperformance than the other alternatives for the SNR’s of -5dB, 0dB and 5dB,a very important range of SNR’s for speech enhancement. Besides the proposedmethods seem to be more suitable for non-stationary noise. However, subjectivemeasures of the different alternatives could also shed more light into the evalua-tion procedure given that the objective estimates that we have employed in thispaper have several limitations.

For future work, we plan to explore other shapes for the anisotropic structuralelements with the rationale of trying to emulate the filtering effects (in time andfrequency) in the human ear. The experimentation on real noisy signals instead


(a) Metro Noise (b) Airport Noise

Fig. 5. From the top panel to the bottom panel: Relative improvements for the Ob-jective Quality Measures. Five different methods (SS: Spectral subtraction, BW: Blackand white mask, Gray: Gray-scale mask, iSE: Isotropic SE, aSE: Anisotropic SE).

of the artificially distorted ones employed in this paper is also desirable. Finally,we intend to extend the experiments to ASR, as in [3] to provide a better inputto the feature extraction stage.

Acknowledgments. This work has been partially supported by the SpanishMinistry of Science and Innovation CICYT Project No. TEC2008-06382/TEC.


References

1. Berouti, M., Schwartz, R., Makhoul, J.: Enhancement of speech corrupted by acous-tic noise. In: IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing, ICASSP 1979, vol. 4, pp. 208–211. IEEE (1979)

2. ten Bosch, L., Kirchhoff, K.: Editorial note: Bridging the gap between human andautomatic speech recognition. Speech Communication 49(5), 331–335 (2007)

3. Evans, N., Mason, J., Roach, M., et al.: Noise compensation using spectrogrammorphological filtering. In: Proc. 4th IASTED International Con. on Signal andImage Processing, pp. 157–161 (2002)

4. Flynn, R., Jones, E.: Combined speech enhancement and auditory modelling for ro-bust distributed speech recognition. Speech Communication 50(10), 797–809 (2008)

5. Gonzalez, R., Woods, R.: Digital image processing (1993)6. Hirsch, H., Pearce, D.: The aurora experimental framework for the performance

evaluation of speech recognition systems under noisy conditions. In: ASR 2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorialand Research Workshop, ITRW (2000)

7. Hu, Y., Loizou, P.: Evaluation of objective quality measures for speech enhance-ment. IEEE Transactions on Audio, Speech, and Language Processing 16(1), 229–238 (2008)

8. Li, N., Loizou, P.: Factors influencing glimpsing of speech in noise. The Journal ofthe Acoustical Society of America 122(2)

9. Li, N., Loizou, P.: Effect of spectral resolution on the intelligibility of ideal binarymasked speech. The Journal of the Acoustical Society of America 123(4) (2008)

10. Loizou, P.: Speech enhancement: Theory and practice (2007)11. Loizou, P.: Matlab software (January 2011),

http://www.utdallas.edu/~loizou/speech/software.htm

12. Wang, D., Kjems, U., Pedersen, M., Boldt, J., Lunner, T.: Speech perception ofnoise with binary gains. The Journal of the Acoustical Society of America 124(4),2303–2307 (2008)

13. Zwicker, E., Zwicker, U.: Audio engineering and psychoacoustics: Matching signalsto the final receiver, the human auditory system. J. Audio Eng. Soc. 39(3), 115–126(1991)

http://www.utdallas.edu/~loizou/speech/software.htm

Date post:	06-Oct-2016
Category:	Documents
Upload:	jesus-b
View:	213 times
Download:	0 times

[Lecture Notes in Computer Science] Advances in Nonlinear Speech Processing Volume 7015 ||...

Documents