+ All Categories

Thesis

Date post: 28-Oct-2014
Category:
Upload: joseangl
View: 500 times
Download: 0 times
Share this document with a friend
Description:
 
Popular Tags:
49
Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Noise Robust Speech Recognition of Missing or Uncertain Data Jos´ e ´ Andres Gonz´ alez L´ opez Advisors: Dr. Antonio M. Peinado Herreros Dr. ´ Angel M. G´ omez Garc´ ıa Dpt. Signal Theory, Telecommunications and Networking University of Granada Ph.D. Defence February 25th, 2013 1 / 49 Jos´ e A. Gonz´ alez Noise Robust Speech Recognition of Missing or Uncertain Data
Transcript
Page 1: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Noise Robust Speech Recognition ofMissing or Uncertain Data

Jose Andres Gonzalez LopezAdvisors: Dr. Antonio M. Peinado Herreros

Dr. Angel M. Gomez Garcıa

Dpt. Signal Theory, Telecommunications and NetworkingUniversity of Granada

Ph.D. DefenceFebruary 25th, 2013

1 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 2: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Outline

1 Introduction

2 Feature Compensation based on Stereo Data

3 Feature Compensation based on a Masking Model

4 Temporal Modelling and Uncertainty Decoding

5 Conclusions

2 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 3: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Outline

1 Introduction

2 Feature Compensation based on Stereo Data

3 Feature Compensation based on a Masking Model

4 Temporal Modelling and Uncertainty Decoding

5 Conclusions

3 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 4: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Robust ASR

The performance of ASR (Automatic Speech recognition)systems degrades when training and testing conditions differ.

This mismatch can be due to different factors

Language complexity: grammar, vocabulary, spontaneousspeech, ...Speaker variability: accent, age, gender, ...Environmental factors: background noise, channel distortion,room acoustics, ...

In this work, we will focus on the environmental factors,especially on the background noise and the channel distortion.

Effect of noise on speech: noise modifies the speechdistributions and causes loss of information.

4 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 5: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Approaches for Noise Robustness

Different approaches to achieve noise robustness: robustfeature extraction, model adaptation and feature modification.

Feature compensation enhances the noisy features used forspeech recognition.

yt and xt are, respectively, the feature vectors for noisyspeech and estimated clean speech at time t.

uncertainty: information about the reliability of xt .

5 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 6: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Objectives

Development of a set of compensation techniques for speechfeature enhancement.

To do this, a Bayesian estimation framework is adopted here.

Two different approaches for estimating clean speech will beexplored

Feature compensation based on stereo-data: clean andnoisy recordings are used to derive a set of transformationsapplied to noisy speech.Feature compensation based on a masking model:parametric models of speech degradation are used to estimateclean speech.

Finally, an uncertainty decoding approach and temporalmodelling of speech will be also investigated.

6 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 7: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

Outline

1 Introduction

2 Feature Compensation based on Stereo Data

3 Feature Compensation based on a Masking Model

4 Temporal Modelling and Uncertainty Decoding

5 Conclusions

7 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 8: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

Introduction

Stereo data: simultaneous recordings of clean and noisyspeech signals,

(X,Y) = (〈x1, y1〉, 〈x2, y2〉, . . . , 〈xT , yT 〉)

The stereo data is used to learn the statistical relationshipbetween the clean and noisy feature spaces.

As a result, a set of transformations is derived to enhancespeech in a certain acoustic environment.

Acoustic environment: combination of additive andconvolutional noises at a given SNR.

8 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 9: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

MMSE Estimation (I)

MMSE estimation is chosen to obtain suitable estimates forthe clean feature vectors,

x = E[x|y] =

∫xp(x|y)dx

Problem: p(x|y) must be expressed in a convenient form.Solution: clean and noisy feature spaces are represented byVQ codebooks Mx and My , respectively.

9 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 10: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

MMSE Estimation (II)Using these codebooks, the MMSE estimation can be expressed as,

x =Mx∑kx=1

P(kx |k∗y ) x(kx )

P(kx |k∗y ): mapping between the clean and noisy cells for acertain environment. Estimated using stereo data.x(kx ) = E[x|y, kx , k∗y ]: 3 alternatives (Q-VQMMSE,S-VQMMSE and W-VQMMSE).

10 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 11: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

Computation of x(kx)

Q-VQMMSEAssumes that both spaces arequantized.Also, this approach assumes thatthe spaces are independent.

Then, x(kx ) = µ(kx )x .

S-VQMMSEA correction is applied to y,

x(kx ) = y −(µ

(k∗y )

y − µ(kx )x

)= µ(kx )

x +(

y − µ(k∗y )

y

)∆: quantization error

11 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 12: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

Improving the Mapping Accuracy

Subregion modelling

C(kx ,ky )y is the subset of the noisy cell ky whose corresponding

clean vectors belong to kx .

Similarly, C(kx ,ky )x is the subset of kx whose corresponding

noisy vectors are C(kx ,ky )y .

12 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 13: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

Whitening-transformation based VQMMSE

W-VQMMSE assumes that the subregions of both featurespaces are Gaussian distributed, e.g.

C(kx ,ky )x ∼ N

(kx ,ky )x ,Σ

(kx ,ky )x

)Computation of E[x|y, kx , ky ]: the following whiteningtransformation is applied

E[x|y, kx , ky ] = µ(kx ,ky )x +

(kx ,ky )x

)1/2 (Σ

(kx ,ky )y

)−1/2 (y − µ(kx ,ky )

y

)After some manipulations the MMSE estimation becomes,

x = A(k∗y )y + b(k∗y )

where the parameters of the affine transformation can beprecomputed offline for each noisy cell ky = 1, . . . ,My .

13 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 14: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

Experimental Setup

Recognition task: based on the Aurora2 noisy digitsdatabase.

Acoustic environments: 9 noises at 7 SNRs (clean, 20, 15,10, 5, 0, and -5 dB).

Speech features: ETSI FE Standard (13 MFCCs + ∆ +∆2).

Front-end speech models: codebooks with 256 components.

SPLICE and MEMLIN are also evaluated (i.e. GMM-basedMMSE estimation).

A priori knowledge on the acoustic environment is assumed.

14 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 15: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

FE Results

System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.

Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38

SPLICE 99.02 98.09 95.87 88.88 70.62 39.04 15.99 78.50MEMLIN 99.02 98.36 97.01 92.43 78.26 47.03 18.76 82.62Q-VQMMSE 96.19 93.72 90.21 81.24 61.82 31.33 14.39 71.66S-VQMMSE 99.02 97.93 96.28 90.57 74.70 43.02 18.57 80.50

iW-VQMMSE 99.02 98.23 96.79 91.60 76.82 46.60 20.02 82.01dW-VQMMSE 99.02 98.33 97.06 92.43 78.70 48.88 20.26 83.08fW-VQMMSE 99.02 98.37 97.15 92.88 79.61 50.04 20.89 83.61

Matched: HMMs trained under the same conditions that in testing.

iW-, dW-, fW-: identity, diagonal and full covariance matrices.

MEMLIN and iW-VQMMSE behave almost identically, but our proposal

is more efficient.

When the dynamic features are also processed, MEMLIN and

fW-VQMMSE achieves similar results: 87.67 % vs. 87.31 %.

15 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 16: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

IntroductionMMSE EstimationExperimental Results

AFE Results

System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.

Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83AFE 99.22 98.24 96.95 93.68 84.37 62.46 29.53 87.14Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38

Q-VQMMSE 95.60 93.56 91.28 85.25 70.23 39.20 12.84 75.90S-VQMMSE 99.22 98.32 97.39 94.71 86.30 63.07 27.46 87.96

iW-VQMMSE 99.22 98.61 97.93 95.89 89.19 69.46 32.62 90.22dW-VQMMSE 99.22 98.70 98.05 96.19 89.93 71.47 34.94 90.87fW-VQMMSE 99.22 98.65 97.99 96.10 89.92 72.29 36.57 90.99

AFE: ETSI Advanced Front-End.

The proposed techniques are applied to the features extractedby AFE.

The combined systems AFE+VQMMSE increase therobustness of AFE against noise.

16 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 17: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Outline

1 Introduction

2 Feature Compensation based on Stereo Data

3 Feature Compensation based on a Masking Model

4 Temporal Modelling and Uncertainty Decoding

5 Conclusions

17 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 18: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Introduction

Speech degradation model: an analytical model that relatesy with x and n (the additive noise vector).

Model-based compensation: the degradation model is usedto derive the MMSE estimator.

X No stereo data is required.X Thus, unknown distortions can be mitigated.× MMSE estimation only tackles the distortions considered in

the degradation model. E.g. additive and convolutional noises.× Noise need to be estimated.

We will only consider the robustness to additive noise here.

18 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 19: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Speech Masking Model

In the log-Mel domain, the degradation model is approximated by

y = log(ex + en)

This model can be rewrittenas,

y = max(x,n) + ε(x,n)

Disregarding ε(x,n), thespeech masking model is

y ≈ max(x,n) 0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

-0.4 -0.2 0 0.2 0.4 0.6

Probab

ility

ε(x, n)

Distribution of ε(x, n) in Aurora2

19 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 20: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Spectral Reconstruction: Problems

According to the speech masking model, the observation canbe rearranged into y = (yr , yu).

Reliable features (xr ≈ yr ), i.e. speech is dominant.Unreliable features (−∞ ≤ xu ≤ yu): speech is masked bynoise.

Thus, feature compensation can be reformulated as differenttwo problems

1 Segregation of the noisy spectra into speech and noise.This yields a mask where the reliable and unreliable featuresare identified.

2 Spectral reconstruction, i.e. estimate the speech energy inthe unreliable features.

Two alternative techniques are proposed here:TGI only deals with problem 2.MMSR addresses both 1 & 2.

20 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 21: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Truncated-Gaussian based Imputation

TGI estimates the speech energy in the unreliable regions ofthe observed spectrogram.

To do this, the correlation between features is exploited.

Prerequisites: the segregation binary mask is known inadvance.

After spectral reconstruction, MFCC features can becomputed and used for recognition.

21 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 22: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

MMSE Estimation of the Unreliable Features

MMSE estimation is used again to reconstruct the unreliablefeatures,

xu = E[xu|xr = yr ,−∞ ≤ xu ≤ yu]

Speech model: p(x) is modelled as a Gaussian MixtureModel (GMM),

p(x) =M∑k=1

P(k)N(

x;µ(k),Σ(k))

Applying this model, the MMSE estimation is given by,

xu =M∑k=1

P(k |yr , yu) x(k)u

Problem: computation of P(k |yr , yu) and x(k)u .

22 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 23: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Posterior Computation

After applying Bayes’ rule, the posterior can be expressed as,

P(k |yr , yu) =p(yr , yu|k)P(k)∑M

k ′=1 p(yr , yu|k ′)P(k ′)

p(yr , yu|k) is factorized as the following product,

p(yr , yu|k) = p(yr |k)

∫ yu

−∞p(xu|yr , k)dxu

p(yr |k) = N (yr ;µ(k)r ,Σ

(k)r ): marginal PDF of the reliable

features.

p(xu|yr , k) = N (xu;µ(k)u|r ,Σ

(k)u|r ): conditional PDF of the

unreliable features given the reliable ones.

23 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 24: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Partial Estimates

According to the speech masking model xu ∈ (−∞, yu]. Thus,

x(k)u =

∫ yu

−∞xup(xu|yr , k)dxu

Independence is assumed tosolve the integral.

The partial estimate

x(k)u = µ(k)(y) corresponds

to the mean of aright-truncated GaussianPDF.

24 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 25: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Example

Clean

Noisy (0 dB)

Oracle mask

TGI reconstruction

23

15

7

12

0

23

15

7

12

5

23

15

7

1

0

23

15

7

12

4

Time (s)

eigth six zero one one six two

Mel

ch

ann

el

0.5 1.0 1.5 2.0 2.5 3.0

25 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 26: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Experimental Setup

Databases: Aurora2 & Aurora4.

The 3 test sets (A, B and C) of Aurora2 are considered.

Aurora4: 5000-word recognition task based on the WallStreet Journal corpus. Two testing conditions:

Test 01-07 includes utterances with artificially added acousticnoise (random SNR between 10 dB and 20 dB).Test 08-14: acoustic noise + different microphones.

TGI is evaluated using both oracle (OR) or estimated (EST)binary masks.

Noise estimation: linear interpolation of the first and lastframes of the utterance.

Front-end speech model: GMM with 256 components.

26 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 27: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Experimental ResultsW

Acc (

%)

Aurora2 Aurora440

50

60

70

80

90

100

Baseline CBR−OR TGI−OR CBR−EST TGI−EST

CBR: Cluster-Based Reconstruction (Raj et al., 2004).

TGI outperforms CBR when oracle masks are used.

The difference is small when the masks are estimated.

Large margin for improvement between OR and EST ⇒ amore robust approach for speech/noise segregation is required.

27 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 28: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Masking-Model based Spectral Reconstruction

As we have seen, TGI achieves excellent results when oraclemasks are used.

However, its performance diminishes when the masks areestimated ⇒ the noise estimation errors can be magnified bythe hard decision implemented by the binary masks.

MMSR uses the noise estimates directly in the MMSEestimation.

Advantages with respect to TGI

No a priori segregation mask is required now.Therefore, the feature reliability and the speech energy in theunreliable regions are jointly estimated.

28 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 29: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

MMSR: Diagram

Mx : GMM with Mx gaussians.Mn: GMM with Mn gaussians (alternatively a noise estimate

nt ∼ Nn(nt ,Σn,t) for each frame).MMSE estimation

x =Mx∑kx=1

Mn∑kn=1

P(kx , kn|y) x(kx ,kn)

29 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 30: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Posterior Computation

Applying Bayes’ rule, P(kx , kn|y) ∝ p(y|kx , kn)P(kx)P(kn).

Independence assumpion: p(y|kx , kn) is expressed as theproduct of p(y |kx , kn) for every observed feature y .

According to the masking model, p(y |kx , kn) is computed as,

p(y |kx , kn) = p(x = y , n ≤ y |kx , kn)︸ ︷︷ ︸+ p(n = y , x < y |kx , kn)︸ ︷︷ ︸px(y |kx)Pn(x ≤ y |kn) pn(y |kn)Px(x < y |kx)

Probability that speech is dominant

Probability that noise is dominant

30 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 31: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Partial Estimates

Contrary to TGI, the reliability of the observed feature y isunknown in MMSR.

Hence, both the reliable and unreliable cases are taken intoaccount,

x (kx ,kn) = w (kx ,kn) y +(

1− w (kx ,kn))µ

(kx )x

Estimate for high SNRsEstimate for masked speech (i.e. truncated PDF mean)w (kx ,kn) = P(x = y , n < y |kx , kn) is the normalized speechpresence probability.

31 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 32: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

MMSR: Mask Estimation

MMSR can be also considered as a robust method for speechsegregation.

To see this, we reproduce here the final expression for theMMSE estimator,

x =

Mx∑kx=1

Mn∑kn=1

P(kx , kn|y)w (kx ,kn)

︸ ︷︷ ︸

m

y +Mx∑kx=1

Mn∑kn=1

P(kx , kn|y)(

1− w (kx ,kn))µ

(kx )x

m ∈ [0, 1] acts as a soft-mask: m ≈ 1 for the reliable featuresand m ≈ 0 for the unreliable ones.Advantages regarding other methods:

Parameter free.Mask estimation is fully integrated within the reconstruction.

32 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 33: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Experimental Results

WA

cc (

%)

Aurora2 Aurora440

50

60

70

80

90

100

Baseline TGI−OR TGI−EST MMSR VTS

VTS: well-known model-based compensation technique(Moreno, 1996).

MMSR outperforms TGI-EST and is upper-bounded byTGI-OR.

VTS is slightly better than MMSR ⇒ more accurate noisemodels can reduce the gap.

33 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 34: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

MMSR: Diagram

34 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 35: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

MMSR: Diagram

35 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 36: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

EM-based Noise Model Estimation

Objective: estimate the noise model used in MMSR.

Noise model: GMM with Mn gaussians,

Mn ={⟨π

(1)n ,µ

(1)n ,Σ

(1)n

⟩, . . . ,

⟨π

(Mn)n ,µ

(Mn)n ,Σ

(Mn)n

⟩}where π

(kn)n (kn = 1, . . . ,Mn) are the component priors.

Maximum Likelihood estimation

Mn = argmaxMn

p(y1, . . . , yT |Mn,Mx)

Direct optimization of this expression is unfeasible ⇒ aniterative EM approach is used.

36 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 37: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Overview

Problems

The oracle mask is unknown ⇒ the soft-mask estimated byMMSR is used.

Treatment of the speech-dominated regions: the noise inthese regions can be estimated using the model obtained inthe previous iteration.

37 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 38: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Speech Masking ModelTGIMMSRNoise Model Estimation

Experimental Results

2 4 6 885

85.5

86

86.5Aurora2

No. of components

WA

cc (

%)

2 4 6 8 1068

68.5

69

69.5Aurora4

No. of componentsW

Acc (

%)

Estimated noise

GMM noise model

Small but consistent performance improvement is achievedwhen using GMM noise models in MMSR.GMMs worse than estimated noise in 2 cases

1-gauss GMMs: unable to properly model non-stationary noises.

Complex GMMs: not enough data to robustly estimate the GMM

parameters.38 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 39: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Temporal ModellingUncertainty Decoding

Outline

1 Introduction

2 Feature Compensation based on Stereo Data

3 Feature Compensation based on a Masking Model

4 Temporal Modelling and Uncertainty Decoding

5 Conclusions

39 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 40: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Temporal ModellingUncertainty Decoding

Temporal Modelling

More accurate MMSE estimates are obtained with betterspeech models.

Here, the temporal correlation of speech is considered.

Two alternative approaches

Patch-based modelling: short segments of speech aremodelled instead of single frames.HMM modelling: the previous speech models (GMMs or VQcodebooks) are augmented with transition probabilities. Then,

xt =M∑k=1

P(k |y1, . . . , yt , . . . , yT )E[x|yt , k]

40 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 41: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Temporal ModellingUncertainty Decoding

Experimental ResultsW

Acc (

%)

Aurora2 Aurora450

60

70

80

90

100

TGI−OR

PATCH−OR

HMM−OR

TGI−EST

PATCH−EST

HMM−EST

The PATCH and HMM approaches are applied in combinationwith TGI.

Spectral reconstruction benefits from temporal redundancy,especially at low SNRs.

The HMM-based modelling achieves the best results.

41 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 42: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Temporal ModellingUncertainty Decoding

Uncertainty Decoding (I)

The accuracy of MMSE estimation depends on many factors,such as the SNR of the signal, stationarity of the noise, etc.

Inaccurate xt can degrade the performance of ASR.

Two objectives1 Estimate the uncertainty/reliability of xt .2 Account for this information in the recognizer.

42 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 43: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Temporal ModellingUncertainty Decoding

Uncertainty Decoding (II)

Uncertainty of x

Depends on p(x|y) that appears in the MMSE estimator

If p(x|y) = δy(x), then we will consider that x is fully reliable.

If p(x|y) is uniformly distributed, then x is badly estimated.

How to measure the uncertainty of x?

Entropy of p(x|y).

Variance of the MMSE estimate: Σx.

Exploitation in the recognizer

Soft-data decoding: Σx increases the variance of theGaussians in the acoustic model.

Weighted Viterbi Algorithm: the exponential factorρ ∈ [0, 1] used to weight the observation probabilities of x isobtained after applying a sigmoid function to MSE = tr(Σx).

43 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 44: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Temporal ModellingUncertainty Decoding

Experimental ResultsW

Acc (

%)

Aurora2 Aurora440

50

60

70

80

90

100

Baseline TGI−OR UD−OR TGI−EST UD−EST

UD: TGI + Weighted Viterbi Algorithm.

OR vs. EST: oracle masks and oracle uncertainties vs.estimated masks and uncertainties.

The recognition performance is improved after accounting forthe uncertainty, especially in Aurora4.

44 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 45: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Outline

1 Introduction

2 Feature Compensation based on Stereo Data

3 Feature Compensation based on a Masking Model

4 Temporal Modelling and Uncertainty Decoding

5 Conclusions

45 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 46: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Conclusions (I)

The performance of ASR is severely affected by noise.

To improve the robustness of ASR to noise, a featurecompensation approach has been adopted in this thesis.

Stereo-data based compensation: stereo recordings are usedto estimate a set of transformations that are later applied tonoisy speech.

Excellent results for the environments seen during training.Efficient implementation without a significant performancedegradation when VQ codebooks are used.The proposed techniques can be used to reduce the residualnoise of other robust techniques.

46 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 47: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Conclusions (II)

Model-based compensation: a model that considers thedistortion of speech as a masking problem is used to derivetwo reconstruction techniques.

TGI estimates the masked regions in the noisy spectra. Goodresults if the masking pattern is perfectly known, otherwise itsperformance is significantly affected.MMSR uses clean speech and noise models to enhance noisyspeech. Unlike TGI, mask estimation is an integrated part ofthe reconstruction algorithm.An EM-based iterative algorithm has been proposed toestimate the noise models used by MMSR.

Finally, several approaches to account for temporalcorrelations and to decode uncertain speech evidence werealso investigated.

47 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 48: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Future Work

Speech masking model vs. perceptual masking.

EM algorithm: joint estimation of additive and convolutionalnoises.

Using more information in MMSR. E.g. pitch, onset/offsetposition, etc.

Joint speaker and noise compensation.

48 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data

Page 49: Thesis

IntroductionFeature Compensation based on Stereo Data

Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding

Conclusions

Thank you!

49 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data


Recommended