Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Noise Robust Speech Recognition ofMissing or Uncertain Data

Jose Andres Gonzalez LopezAdvisors: Dr. Antonio M. Peinado Herreros

Dr. Angel M. Gomez Garcıa

Dpt. Signal Theory, Telecommunications and NetworkingUniversity of Granada

Ph.D. DefenceFebruary 25th, 2013

1 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data



Conclusions

Outline

1 Introduction

2 Feature Compensation based on Stereo Data

3 Feature Compensation based on a Masking Model

4 Temporal Modelling and Uncertainty Decoding

5 Conclusions




Conclusions

Outline

1 Introduction




5 Conclusions




Conclusions

Robust ASR

The performance of ASR (Automatic Speech recognition)systems degrades when training and testing conditions differ.

This mismatch can be due to different factors

Language complexity: grammar, vocabulary, spontaneousspeech, ...Speaker variability: accent, age, gender, ...Environmental factors: background noise, channel distortion,room acoustics, ...

In this work, we will focus on the environmental factors,especially on the background noise and the channel distortion.

Effect of noise on speech: noise modifies the speechdistributions and causes loss of information.




Conclusions

Approaches for Noise Robustness

Different approaches to achieve noise robustness: robustfeature extraction, model adaptation and feature modification.

Feature compensation enhances the noisy features used forspeech recognition.

yt and xt are, respectively, the feature vectors for noisyspeech and estimated clean speech at time t.

uncertainty: information about the reliability of xt .




Conclusions

Objectives

Development of a set of compensation techniques for speechfeature enhancement.

To do this, a Bayesian estimation framework is adopted here.

Two different approaches for estimating clean speech will beexplored

Feature compensation based on stereo-data: clean andnoisy recordings are used to derive a set of transformationsapplied to noisy speech.Feature compensation based on a masking model:parametric models of speech degradation are used to estimateclean speech.

Finally, an uncertainty decoding approach and temporalmodelling of speech will be also investigated.




Conclusions

IntroductionMMSE EstimationExperimental Results

Outline

1 Introduction




5 Conclusions




Conclusions


Introduction

Stereo data: simultaneous recordings of clean and noisyspeech signals,

(X,Y) = (〈x1, y1〉, 〈x2, y2〉, . . . , 〈xT , yT 〉)

The stereo data is used to learn the statistical relationshipbetween the clean and noisy feature spaces.

As a result, a set of transformations is derived to enhancespeech in a certain acoustic environment.

Acoustic environment: combination of additive andconvolutional noises at a given SNR.




Conclusions


MMSE Estimation (I)

MMSE estimation is chosen to obtain suitable estimates forthe clean feature vectors,

x = E[x|y] =

∫xp(x|y)dx

Problem: p(x|y) must be expressed in a convenient form.Solution: clean and noisy feature spaces are represented byVQ codebooks Mx and My , respectively.




Conclusions


MMSE Estimation (II)Using these codebooks, the MMSE estimation can be expressed as,

x =Mx∑kx=1

P(kx |k∗y ) x(kx )

P(kx |k∗y ): mapping between the clean and noisy cells for acertain environment. Estimated using stereo data.x(kx ) = E[x|y, kx , k∗y ]: 3 alternatives (Q-VQMMSE,S-VQMMSE and W-VQMMSE).




Conclusions


Computation of x(kx)

Q-VQMMSEAssumes that both spaces arequantized.Also, this approach assumes thatthe spaces are independent.

Then, x(kx ) = µ(kx )x .

S-VQMMSEA correction is applied to y,

x(kx ) = y −(µ

(k∗y )

y − µ(kx )x

)= µ(kx )

x +(

y − µ(k∗y )

y

)∆: quantization error




Conclusions


Improving the Mapping Accuracy

Subregion modelling

C(kx ,ky )y is the subset of the noisy cell ky whose corresponding

clean vectors belong to kx .

Similarly, C(kx ,ky )x is the subset of kx whose corresponding

noisy vectors are C(kx ,ky )y .




Conclusions


Whitening-transformation based VQMMSE

W-VQMMSE assumes that the subregions of both featurespaces are Gaussian distributed, e.g.

C(kx ,ky )x ∼ N

(µ

(kx ,ky )x ,Σ

(kx ,ky )x

)Computation of E[x|y, kx , ky ]: the following whiteningtransformation is applied

E[x|y, kx , ky ] = µ(kx ,ky )x +

(Σ

(kx ,ky )x

)1/2 (Σ

(kx ,ky )y

)−1/2 (y − µ(kx ,ky )

y

)After some manipulations the MMSE estimation becomes,

x = A(k∗y )y + b(k∗y )

where the parameters of the affine transformation can beprecomputed offline for each noisy cell ky = 1, . . . ,My .




Conclusions


Experimental Setup

Recognition task: based on the Aurora2 noisy digitsdatabase.

Acoustic environments: 9 noises at 7 SNRs (clean, 20, 15,10, 5, 0, and -5 dB).

Speech features: ETSI FE Standard (13 MFCCs + ∆ +∆2).

Front-end speech models: codebooks with 256 components.

SPLICE and MEMLIN are also evaluated (i.e. GMM-basedMMSE estimation).

A priori knowledge on the acoustic environment is assumed.




Conclusions


FE Results

System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.

Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38

SPLICE 99.02 98.09 95.87 88.88 70.62 39.04 15.99 78.50MEMLIN 99.02 98.36 97.01 92.43 78.26 47.03 18.76 82.62Q-VQMMSE 96.19 93.72 90.21 81.24 61.82 31.33 14.39 71.66S-VQMMSE 99.02 97.93 96.28 90.57 74.70 43.02 18.57 80.50

iW-VQMMSE 99.02 98.23 96.79 91.60 76.82 46.60 20.02 82.01dW-VQMMSE 99.02 98.33 97.06 92.43 78.70 48.88 20.26 83.08fW-VQMMSE 99.02 98.37 97.15 92.88 79.61 50.04 20.89 83.61

Matched: HMMs trained under the same conditions that in testing.

iW-, dW-, fW-: identity, diagonal and full covariance matrices.

MEMLIN and iW-VQMMSE behave almost identically, but our proposal

is more efficient.

When the dynamic features are also processed, MEMLIN and

fW-VQMMSE achieves similar results: 87.67 % vs. 87.31 %.




Conclusions


AFE Results

System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.

Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83AFE 99.22 98.24 96.95 93.68 84.37 62.46 29.53 87.14Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38

Q-VQMMSE 95.60 93.56 91.28 85.25 70.23 39.20 12.84 75.90S-VQMMSE 99.22 98.32 97.39 94.71 86.30 63.07 27.46 87.96

iW-VQMMSE 99.22 98.61 97.93 95.89 89.19 69.46 32.62 90.22dW-VQMMSE 99.22 98.70 98.05 96.19 89.93 71.47 34.94 90.87fW-VQMMSE 99.22 98.65 97.99 96.10 89.92 72.29 36.57 90.99

AFE: ETSI Advanced Front-End.

The proposed techniques are applied to the features extractedby AFE.

The combined systems AFE+VQMMSE increase therobustness of AFE against noise.




Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Outline

1 Introduction




5 Conclusions




Conclusions


Introduction

Speech degradation model: an analytical model that relatesy with x and n (the additive noise vector).

Model-based compensation: the degradation model is usedto derive the MMSE estimator.

X No stereo data is required.X Thus, unknown distortions can be mitigated.× MMSE estimation only tackles the distortions considered in

the degradation model. E.g. additive and convolutional noises.× Noise need to be estimated.

We will only consider the robustness to additive noise here.




Conclusions


Speech Masking Model

In the log-Mel domain, the degradation model is approximated by

y = log(ex + en)

This model can be rewrittenas,

y = max(x,n) + ε(x,n)

Disregarding ε(x,n), thespeech masking model is

y ≈ max(x,n) 0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

-0.4 -0.2 0 0.2 0.4 0.6

Probab

ility

ε(x, n)

Distribution of ε(x, n) in Aurora2




Conclusions


Spectral Reconstruction: Problems

According to the speech masking model, the observation canbe rearranged into y = (yr , yu).

Reliable features (xr ≈ yr ), i.e. speech is dominant.Unreliable features (−∞ ≤ xu ≤ yu): speech is masked bynoise.

Thus, feature compensation can be reformulated as differenttwo problems

1 Segregation of the noisy spectra into speech and noise.This yields a mask where the reliable and unreliable featuresare identified.

2 Spectral reconstruction, i.e. estimate the speech energy inthe unreliable features.

Two alternative techniques are proposed here:TGI only deals with problem 2.MMSR addresses both 1 & 2.




Conclusions


Truncated-Gaussian based Imputation

TGI estimates the speech energy in the unreliable regions ofthe observed spectrogram.

To do this, the correlation between features is exploited.

Prerequisites: the segregation binary mask is known inadvance.

After spectral reconstruction, MFCC features can becomputed and used for recognition.




Conclusions


MMSE Estimation of the Unreliable Features

MMSE estimation is used again to reconstruct the unreliablefeatures,

xu = E[xu|xr = yr ,−∞ ≤ xu ≤ yu]

Speech model: p(x) is modelled as a Gaussian MixtureModel (GMM),

p(x) =M∑k=1

P(k)N(

x;µ(k),Σ(k))

Applying this model, the MMSE estimation is given by,

xu =M∑k=1

P(k |yr , yu) x(k)u

Problem: computation of P(k |yr , yu) and x(k)u .




Conclusions


Posterior Computation

After applying Bayes’ rule, the posterior can be expressed as,

P(k |yr , yu) =p(yr , yu|k)P(k)∑M

k ′=1 p(yr , yu|k ′)P(k ′)

p(yr , yu|k) is factorized as the following product,

p(yr , yu|k) = p(yr |k)

∫ yu

−∞p(xu|yr , k)dxu

p(yr |k) = N (yr ;µ(k)r ,Σ

(k)r ): marginal PDF of the reliable

features.

p(xu|yr , k) = N (xu;µ(k)u|r ,Σ

(k)u|r ): conditional PDF of the

unreliable features given the reliable ones.




Conclusions


Partial Estimates

According to the speech masking model xu ∈ (−∞, yu]. Thus,

x(k)u =

∫ yu

−∞xup(xu|yr , k)dxu

Independence is assumed tosolve the integral.

The partial estimate

x(k)u = µ(k)(y) corresponds

to the mean of aright-truncated GaussianPDF.




Conclusions


Example

Clean

Noisy (0 dB)

Oracle mask

TGI reconstruction

23

15

7

12

0

23

15

7

12

5

23

15

7

1

0

23

15

7

12

4

Time (s)

eigth six zero one one six two

Mel

ch

ann

el

0.5 1.0 1.5 2.0 2.5 3.0




Conclusions


Experimental Setup

Databases: Aurora2 & Aurora4.

The 3 test sets (A, B and C) of Aurora2 are considered.

Aurora4: 5000-word recognition task based on the WallStreet Journal corpus. Two testing conditions:

Test 01-07 includes utterances with artificially added acousticnoise (random SNR between 10 dB and 20 dB).Test 08-14: acoustic noise + different microphones.

TGI is evaluated using both oracle (OR) or estimated (EST)binary masks.

Noise estimation: linear interpolation of the first and lastframes of the utterance.

Front-end speech model: GMM with 256 components.




Conclusions


Experimental ResultsW

Acc (

%)

Aurora2 Aurora440

50

60

70

80

90

100

Baseline CBR−OR TGI−OR CBR−EST TGI−EST

CBR: Cluster-Based Reconstruction (Raj et al., 2004).

TGI outperforms CBR when oracle masks are used.

The difference is small when the masks are estimated.

Large margin for improvement between OR and EST ⇒ amore robust approach for speech/noise segregation is required.




Conclusions


Masking-Model based Spectral Reconstruction

As we have seen, TGI achieves excellent results when oraclemasks are used.

However, its performance diminishes when the masks areestimated ⇒ the noise estimation errors can be magnified bythe hard decision implemented by the binary masks.

MMSR uses the noise estimates directly in the MMSEestimation.

Advantages with respect to TGI

No a priori segregation mask is required now.Therefore, the feature reliability and the speech energy in theunreliable regions are jointly estimated.




Conclusions


MMSR: Diagram

Mx : GMM with Mx gaussians.Mn: GMM with Mn gaussians (alternatively a noise estimate

nt ∼ Nn(nt ,Σn,t) for each frame).MMSE estimation

x =Mx∑kx=1

Mn∑kn=1

P(kx , kn|y) x(kx ,kn)




Conclusions


Posterior Computation

Applying Bayes’ rule, P(kx , kn|y) ∝ p(y|kx , kn)P(kx)P(kn).

Independence assumpion: p(y|kx , kn) is expressed as theproduct of p(y |kx , kn) for every observed feature y .

According to the masking model, p(y |kx , kn) is computed as,

p(y |kx , kn) = p(x = y , n ≤ y |kx , kn)︸︷︷︸+ p(n = y , x < y |kx , kn)︸︷︷︸px(y |kx)Pn(x ≤ y |kn) pn(y |kn)Px(x < y |kx)

Probability that speech is dominant

Probability that noise is dominant




Conclusions


Partial Estimates

Contrary to TGI, the reliability of the observed feature y isunknown in MMSR.

Hence, both the reliable and unreliable cases are taken intoaccount,

x (kx ,kn) = w (kx ,kn) y +(

1− w (kx ,kn))µ

(kx )x

Estimate for high SNRsEstimate for masked speech (i.e. truncated PDF mean)w (kx ,kn) = P(x = y , n < y |kx , kn) is the normalized speechpresence probability.




Conclusions


MMSR: Mask Estimation

MMSR can be also considered as a robust method for speechsegregation.

To see this, we reproduce here the final expression for theMMSE estimator,

x =

Mx∑kx=1

Mn∑kn=1

P(kx , kn|y)w (kx ,kn)

︸︷︷︸

m

y +Mx∑kx=1

Mn∑kn=1

P(kx , kn|y)(

1− w (kx ,kn))µ

(kx )x

m ∈ [0, 1] acts as a soft-mask: m ≈ 1 for the reliable featuresand m ≈ 0 for the unreliable ones.Advantages regarding other methods:

Parameter free.Mask estimation is fully integrated within the reconstruction.




Conclusions


Experimental Results

WA

cc (

%)

Aurora2 Aurora440

50

60

70

80

90

100

Baseline TGI−OR TGI−EST MMSR VTS

VTS: well-known model-based compensation technique(Moreno, 1996).

MMSR outperforms TGI-EST and is upper-bounded byTGI-OR.

VTS is slightly better than MMSR ⇒ more accurate noisemodels can reduce the gap.




Conclusions


MMSR: Diagram




Conclusions


MMSR: Diagram




Conclusions


EM-based Noise Model Estimation

Objective: estimate the noise model used in MMSR.

Noise model: GMM with Mn gaussians,

Mn ={⟨π

(1)n ,µ

(1)n ,Σ

(1)n

⟩, . . . ,

⟨π

(Mn)n ,µ

(Mn)n ,Σ

(Mn)n

⟩}where π

(kn)n (kn = 1, . . . ,Mn) are the component priors.

Maximum Likelihood estimation

Mn = argmaxMn

p(y1, . . . , yT |Mn,Mx)

Direct optimization of this expression is unfeasible ⇒ aniterative EM approach is used.




Conclusions


Overview

Problems

The oracle mask is unknown ⇒ the soft-mask estimated byMMSR is used.

Treatment of the speech-dominated regions: the noise inthese regions can be estimated using the model obtained inthe previous iteration.




Conclusions


Experimental Results

2 4 6 885

85.5

86

86.5Aurora2

No. of components

WA

cc (

%)

2 4 6 8 1068

68.5

69

69.5Aurora4

No. of componentsW

Acc (

%)

Estimated noise

GMM noise model

Small but consistent performance improvement is achievedwhen using GMM noise models in MMSR.GMMs worse than estimated noise in 2 cases

1-gauss GMMs: unable to properly model non-stationary noises.

Complex GMMs: not enough data to robustly estimate the GMM

parameters.38 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data



Conclusions

Temporal ModellingUncertainty Decoding

Outline

1 Introduction




5 Conclusions




Conclusions


Temporal Modelling

More accurate MMSE estimates are obtained with betterspeech models.

Here, the temporal correlation of speech is considered.

Two alternative approaches

Patch-based modelling: short segments of speech aremodelled instead of single frames.HMM modelling: the previous speech models (GMMs or VQcodebooks) are augmented with transition probabilities. Then,

xt =M∑k=1

P(k |y1, . . . , yt , . . . , yT )E[x|yt , k]




Conclusions



Acc (

%)

Aurora2 Aurora450

60

70

80

90

100

TGI−OR

PATCH−OR

HMM−OR

TGI−EST

PATCH−EST

HMM−EST

The PATCH and HMM approaches are applied in combinationwith TGI.

Spectral reconstruction benefits from temporal redundancy,especially at low SNRs.

The HMM-based modelling achieves the best results.




Conclusions


Uncertainty Decoding (I)

The accuracy of MMSE estimation depends on many factors,such as the SNR of the signal, stationarity of the noise, etc.

Inaccurate xt can degrade the performance of ASR.

Two objectives1 Estimate the uncertainty/reliability of xt .2 Account for this information in the recognizer.




Conclusions


Uncertainty Decoding (II)

Uncertainty of x

Depends on p(x|y) that appears in the MMSE estimator

If p(x|y) = δy(x), then we will consider that x is fully reliable.

If p(x|y) is uniformly distributed, then x is badly estimated.

How to measure the uncertainty of x?

Entropy of p(x|y).

Variance of the MMSE estimate: Σx.

Exploitation in the recognizer

Soft-data decoding: Σx increases the variance of theGaussians in the acoustic model.

Weighted Viterbi Algorithm: the exponential factorρ ∈ [0, 1] used to weight the observation probabilities of x isobtained after applying a sigmoid function to MSE = tr(Σx).




Conclusions



Acc (

%)

Aurora2 Aurora440

50

60

70

80

90

100

Baseline TGI−OR UD−OR TGI−EST UD−EST

UD: TGI + Weighted Viterbi Algorithm.

OR vs. EST: oracle masks and oracle uncertainties vs.estimated masks and uncertainties.

The recognition performance is improved after accounting forthe uncertainty, especially in Aurora4.




Conclusions

Outline

1 Introduction




5 Conclusions




Conclusions

Conclusions (I)

The performance of ASR is severely affected by noise.

To improve the robustness of ASR to noise, a featurecompensation approach has been adopted in this thesis.

Stereo-data based compensation: stereo recordings are usedto estimate a set of transformations that are later applied tonoisy speech.

Excellent results for the environments seen during training.Efficient implementation without a significant performancedegradation when VQ codebooks are used.The proposed techniques can be used to reduce the residualnoise of other robust techniques.




Conclusions

Conclusions (II)

Model-based compensation: a model that considers thedistortion of speech as a masking problem is used to derivetwo reconstruction techniques.

TGI estimates the masked regions in the noisy spectra. Goodresults if the masking pattern is perfectly known, otherwise itsperformance is significantly affected.MMSR uses clean speech and noise models to enhance noisyspeech. Unlike TGI, mask estimation is an integrated part ofthe reconstruction algorithm.An EM-based iterative algorithm has been proposed toestimate the noise models used by MMSR.

Finally, several approaches to account for temporalcorrelations and to decode uncertain speech evidence werealso investigated.




Conclusions

Future Work

Speech masking model vs. perceptual masking.

EM algorithm: joint estimation of additive and convolutionalnoises.

Using more information in MMSR. E.g. pitch, onset/offsetposition, etc.

Joint speaker and noise compensation.




Conclusions

Thank you!


Date post:	28-Oct-2014
Category:	Technology
Upload:	joseangl
View:	500 times
Download:	0 times

Thesis

Technology