Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech...

Slide 1

Microphone Array Post-filter based on Spatially-Correlated Noise Measurements for Distant Speech RecognitionKenichi Kumatani, Disney Research, PittsburghBhiksha Raj, Carnegie Mellon UniversityRita Singh, Carnegie Mellon UniversityJohn McDonough Carnegie Mellon University

1Organization of PresentationOur Goal: Distant Speech Recognition (DSR)BackgroundsConventional Post-filtering MethodsMotivationsOur Post-filtering MethodDSR Experiments on Real Array DataConclusions

2Our Goal ~ Distant Speech Recognition (DSR) SystemDistant speechRecognition resultSpeakers positionEnhanced speechMerits of this Approach:By using the geometry of the microphone array and speakers position, our system has the following merits: stable performance in real environments and straightforward extension to the use of other information sources.Speech RecognitionSpeaker TrackingBeamformingMicrophone array

Goal:Replace the close-talking microphone with the far-field sensors to make human-machine interfaces more interactive.Overview of our DSR System:Post-filteringAvoid being blind!Backgrounds of this WorkBeamforming would not provide the optimal solution in a sense of the minimum mean square error (MMSE). Post-filtering can further improve speech recognition performance.

Backgrounds :Beamforming Time Delay CompensationMulti-channel Input Vector XPost-filter EstimationPost-filtering H

Basic Block Chart:Estimate the power spectral densities (PSD) of target and noise signals to build the Wiener filter.

Key issue:

4Conventional Post-filter Design Method 1Zelinski Post-filter :Zelinski assumed that The target and noise signals are uncorrelated,The noise signals are uncorrelated between different channels, andThe noise PSD is the same among all the channels.

Then, the cross- and auto- spectral densities between two channels can be simplified as00

0

By substituting them into the Wiener filter formulation, we have the Zelinski post-filter:

5

Lefkimmiatis et al. more accurately model the diffuse noise field by applying the coherence to the denominator of the McCowan post-filter.McCowan Post-filter :Conventional Post-filter Design Method 2Issues of the Zelinski Post-filter :In many situations, the noise signals are spatially correlated.McCowan and Bourlard introduced the coherence of the diffuse noise field:

and compute the cross- and auto- spectral densities asThen, the McCowan post-filter can be written as

where is an PSD estimate of the target signal for each sensor pair.

This is different from the Zelinski method.an indicator of the similarity of signals at different positionsLefkimmiatis Post-filter:6Motivation of our MethodCommon Problem of Conventional Methods:The static noise field model will not match to every situation.

Figures show the magnitude-squared coherenceExample of Noise Coherence in a Car:Engine idling StateDriving at a speed of 65 mphIt is clear that the actual noise field is neither uncorrelated nor diffuse field.

Our Motivation:measure the most dominant noise signal instead of those static noise field assumptions.observed in a car.

7Our Strategy- How can we measure a noise signal?Estimate a speakers position,Build a beamformer and steer a beam toward the target source,Find where the most dominant interfering source is, andBuild another beamformer to measure a noise signal.microphonesSpeakerNoiseBeamformer 1for the target speechPost-filterEnhanced speechFurther Noise RemovalBeamformer 2(Noise Extractor)Separated noiseSteering direction for the noise source

8Our Post-filter System-waXwSDBHHHwnullHp

HPost-filter estimationWe build a maximum negentropy beamformer for a target source and null-steering beamformer for extracting the noise signal.

Maximum Negentropy BeamformerNull-steering Beamformer

For the target source For the noise source 9Our Post-filter System- Maximum Negentropy (MN) Beamformer (Speech emphasizer)-waXwSDBHHHwnullHp

HPost-filter estimationMN Beamformer for the target source For the noise source

Build a super-directive beamformer for the quiescent vector wSD.Compute the blocking matrix B to maintain the distortionless constraint for the look direction BH wSD= 0.Find the active weight vector which provides the maximum negentropy of the outputs: wa = argmax YSDMN=( wSD - B wa )H X. We can enhance a structured-information signal coming from the direction of interest without signal cancelation and distortion. Advantage:The distribution of clean speech is non-Gaussian and that of noisy and reverberant speech becomes Gaussian.Negentropy is an indicator of how far the distribution of signals is from Gaussian.Maximum Negentropy Beamformer:Maximum Negentropy Criterion:10Our Post-filter System- Null-Steering Beamformer (Noise extractor)-waXwSDBHHHwnullHp

HPost-filter estimationFor the noise source

Null-steering Beamformer (Noise Extractor):Place a null on the direction of interest (DOI) while maintaining the unity gain for the direction of the noise source.Assuming the array manifold vectors for the target source v and for the noise source vN, we obtain such a beamformers weight by solving the linear equation: [ v vN ]H wnull = [ 0 1 ]T.We can extract a noise signal only by eliminating the target signal arriving directly from the source point. Advantage:11Our Post-filter System-waXwSDBHHHwnullHp

HPost-filter estimationFor the target source For the noise source

We can design the post-filter as

Now that we have estimates of the target signal YSDMN=( wSD - B wa )H X and an noise observation Ynull = wnull X, HOur post-filter design:12Distant Speech Recognition ExperimentsMicrophone ArrayNo. Sensors2Distance btw. Sensors3.8 cmSpeech RecognizerTraining DataWSJ1 corpus corrupted with noise recorded in different cars and the noise statesTest ConditionLanguage ModelTri-gram model with 27,000 wordsTest MaterialReal data recorded in a car (NOT artificially convolved with measured impulse responses)Car statesEngine running in a stationary state (Idle),Moving at speeds of 35 mph (35mph) or65 mph (65mph), with a fan on (Fan), turning signal on (Turn) and keeping passengers window open (Wind).

13Speech Recognition ResultsWord Error Rates in Different Conditions

Word Error Rate 14 ConclusionsWe used actual noise measurements for the microphone array post-filter. It turned out that the noise fields in car conditions are neither uncorrelated nor spherically isotropic (diffuse). It has been demonstrated that our post-filter method can provide the best recognition performance among the popular post-filter methods.This is because our method can update a noise PSD adaptively without any static noise coherence assumption.

15Thank you

Speech Samples (65-Wind)

Single Distant ChannelPost-filtered SpeechExtracted Noise Signal

Actual Speech Distribution ~ Super-Gaussian

Distributions of clean speech with super-Gaussian distributions The distribution of speech is not Gaussian but non-Gaussian. It has spikey and heavy-tailed characteristics.*The histograms are computed from the real part of actual subband samples.

How about maximizing a degree of super-Gaussianity?Why do we need non-Gaussianity measures?The reasoning is briefly grounded on 2 points:The distribution of independent random variables (r.v.s.) will approach Gaussian in the limit as more components are added.Information-bearing signals have a structure which makes them predictable.If we want original independent components which bear information,we have to look for a signal that is not Gaussian.

Distributions of clean and noise-corrupted speechDistributions of clean and reverberated speech The distributions of noise-corrupted and reverberated speech are closer to the Gaussian than clean speech.

Negentropy Criterion for super-Gaussianity

Definition of negentropy:Negentropy is defined as the difference between entropy of Gaussian and Super-Gaussian r.v.s:Entropy of Gaussian r.vEntropy of super-Gaussian r.vHigher negentropy indicates how far the distribution of the r.v.s. is from Gaussian. Definition of entropy:Entropy of r.v. Y is defined as:Entropy indicates a degree of uncertainty of information.Negentropy is generally more robust than the other criterion.

Analysis of the MN Beamforming Algorithm

Simulated environment by the image methodThe signal cancellation will occur because of the strong reflection.

3070.9Target source4mImageReflectionObserve that MN beamforming can enhance the target signal by strengthening the reflection, which suggests it does not suffer from the signal cancellation.650Hz1600HzMeasures for non-GaussianityKurtosis of r.v. is defined as:

Definition of kurtosis:where K is the number of frames. Super-Gaussian: positive kurtosis, Sub-Gaussian: those with negative kurtosis, The Gaussian pdf : zero kurtosis. Kurtosis can measure the degree of non-Gaussianity.Empirical approximation of kurtosis:

is positive value

Negentropy Empirical kurtosis

22

Date post:	15-Jan-2016
Category:	Documents
Upload:	arlene-watts
View:	226 times
Download:	1 times

Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech...

Documents