Download - Histogram-based Quantization for Distributed / Robust Speech Recognition Chia-yu Wan, Lin-shan Lee College of EECS, National Taiwan University, R. O. C.

Histogram-based Quantization

for Distributed / Robust Speech

Recognition

Chia-yu Wan, Lin-shan Lee

College of EECS, National Taiwan University, R. O. C.

2007/08/16

Outline

Introduction

Histogram-based Quantization (HQ)

Joint Uncertainty Decoding (JUD)

Three-stage Error Concealment (EC)

Conclusion

Problems of Distance-based VQ Conventional Distance-based VQ (e.g. SVQ) was popularly used in

DSR Dynamic Environmental noise and codebook mismatch jointly degrade

the performance of SVQ

Histogram-based Quantization (HQ) is proposed to solve the problems

Noise moves clean speech to another partition cell (X to

Y)

Mismatch between fixed VQ codebook and test data

increases distortion

Quantization increases difference between clean

and noisy features

Decision boundaries yi{i=1,…,N} are dynamically defined by C(y). Representative values zi {i=1,…,N} are fixed, transformed by a standard Gaussian.

Histogram-based Quantization (HQ)T

i i i{D ,z ,b (vertical scale) i=1,...,N}

determined by Lloyd-Max and a

standard Gaussian Distribution

Histogram-based Quantization (HQ)T

1

1

, ( ) ,

1,2, ...

t ti ii

t ii

x z if b C x bor y x y

where i N

The actual decision boundaries (horizontal scale) for xt are dynamically defined by the inverse transformation of C(y).


With histogram C’(y’), decision boundaries automatically changed to .

Decision boundaries are adjusted according to local statistics, no codebook mismatch problem.

T

1( , )iiy y

1

1

, '( )

' ' ,

1,2, ...

t ti ii

t ii

x z if b C x b

or y x y

where i N


Based on CDF on the vertical scale and histogram, less sensitive to noise on the horizontal scale

Disturbances are automatically absorbed into HQ block

Dynamic nature of HQ hidden codebook on vertical scaletransformed by dynamic C(y){yi} Dynamic on horizontal scale

T

Histogram-based Vector Quantization (HVQ)

Discussions about robustness of Histogram-based Quantization (HQ)

Distributed speech recognition: SVQ v.s. HQ

Robust speech recognition: HEQ v.s. HQ

Comparison of Distance-based VQ and Histogram-based Quantization (HQ)

Distance-based VQ (SVQ) Histogram-based Quantization (HQ)

HQ solves the major problems of conventional Distance-based VQ

Fixed codebook cannot well represent the noisy speech

Dynamically adjusted to local statistics, no codebook mismatch

Quantization increases difference between clean and noisy speech.

Inherent robust nature, noise disturbances automatically absorbed by C(y)

HEQ performed point-to-point transformation

point-based order-statistics are more disturbed

HQ performed block-based transformation

automatically absorbed disturbance within a block

with proper choice of block size, block uncertainty can be

compensated by GMM and uncertainty decoding

Averaged normalized distance between clean and corrupted

speech features based on AURORA 2 database

HEQ (Histogram Equalization) v.s. HQ (Histogram-based Quantization)

HEQ performed point-to-point transformation

point-based order-statistics are more disturbed

HQ performed block-based transformation

automatically absorbed disturbance within a block

with proper choice of block size, block uncertainty can be

compensated by GMM and uncertainty decoding

HEQ (Histogram Equalization) v.s. HQ (Histogram-based Quantization)

HQ gives smaller d for all SNR condition less influenced by the noise disturbance

HQ as a feature transformation method

HQ as a feature quantization method






Further analysisBit rates v.s. SNR

Clean-condition training multi-condition training

HQ-JUDFor both robust and/or distributed speech recognition

For robust speech recognition

• HQ is used as the front-end feature transformation

•JUD as the enhancement approach at the backend recognizer

For Distributed Speech Recognition (DSR)

• HQ is applied at the client for data compression

•JUD at the server

Front-end Back-end Client Server

Robustness DSR

Joint Uncertainty Decoding (1/4)- Uncertainty Observation Decoding

HMM would be less discriminate on features with higher uncertainty Increasing larger variance for more uncertain features

w: observation, o: uncorrupted features

Assume

Joint Uncertainty Decoding (2/4) - Uncertainty for quantization errors

Codeword is the observation w

Samples in the partition cell are

the uncorrupted features o

p(o) is the pdf of the samples

within the partition cell

Variance of samples within partition cell

More uncertain regions

Loosely quantized cells

Joint Uncertainty Decoding (2/4) - Uncertainty for quantization errors

Codeword is the observation w

Samples in the partition cell are

the possible distribution o

p(o) is the pdf of the samples

within the partition cell

Increases the variances for the loosely quantized cells

Variance of samples within partition cell

Joint Uncertainty Decoding (3/4) -Uncertainty for environmental noise

Increase the variances for HQ features with a larger histogram shift

Histogram shift

Jointly consider the uncertainty caused by both the

environmental noise and the quantization errors.

One of the above two would dominate

Quantization errors (High SNR)

Disturbance absorbed into HQ block

Environment noise (Low SNR)

Noisy features moved to another partition cells

Joint Uncertainty Decoding (4/4)

HQ-JUDfor robust speech recognition

Different types of noise, averaged over all SNR values Client

HEQ-SVQ

ClientHEQ-SVQ

ServerUD

ClientHQ

ClientHQ

ServerJUD

HQ-JUDfor distributed speech recognition


HEQ-SVQ

HEQSVQ-UD was slightly worse than HEQ for set C

ClientHEQ-SVQ

ServerUD


Different types of noise, averaged over all SNR values

HEQSVQ-UD was slightly worse than HEQ for set C HQ-JUD consistently improved the performance of HQ

ClientHQ

ClientHQ

ServerJUD



HEQ-SVQClientHQ

HQ performed better than HEQ-SVQ for all types of noise


Different types of noise, averaged over all SNR values

HQ performed better than HEQ-SVQ for all types of noise HQ-JUD consistently performed better than HEQSVQ-UD

ClientHQ

ServerJUD

ClientHEQ-SVQ

ServerUD


Different SNR conditions, averaged over all noise types

HQ-JUD significantly improved the performance of SVQ-UD HQ-JUD consistently performed better than HEQSVQ-UD

ClientHEQ-SVQ

ServerUD

ClientHQ

ServerJUD

ClientSVQ

ServerUD

ClientHQ

ServerJUD


Three-stage error concealment (EC)

Stage 1 : error detection

Frame-level error detection

The received frame-pairs are first checked with CRC

Subvector-level error detection

The erroneous frame-pairs are then checked by the HQ

consistency check

The quantized codewords for HQ represent the order-statistics

information of the original parameters

Quantizaiton process does not change the order-statistics

Re-perform HQ on received subvector codeword should fall in

the same partition cell

Stage 1 : error detection

Noise seriously affects the SVQ with data consistency check

-precision degradation (from 66% at clean down to 12% at 0 dB)

HQ-based consistency approach is much more stable at all SNR values, - both recall and precision rates are higher.

Stage 2 : reconstruction

Based on the Maximum a posterior (MAP) criterion

-Considering the probability for all possible codewords St(i) at time

t, given the current and previous received subvector codewords, Rt and Rt-1,

-prior speech source statistics : HQ codeword bigram model

-channel transition probability : the estimated BER from stage1

-reliability of the received subvectors : consider the relative reliability between prior speech source and wireless channel

1 1( ) ( )

ˆ arg max{ ( ( ) | , )} arg max{ ( ( ) | ) ( | ( ))}t t

t t t t t t t tS i S i

S P S i R R P S i R P R S i prior channel

Channel transition probability P(Rt | St(i))

-significantly differentiated (for different codeword i, with different d) when Rt is

more reliable (BER is smaller)

-put more emphasis on prior speech source when Rt is less reliable

( ( ), )( ( ), )( | ( )) *(1- ) t tt t

t t M d S i Rd S i RP R S i BER BER

-the estimated BER is the number of inconsistent subvectors in

the present frame divided by the total number of bits in the frame


Prior source information P(St (i)| Rt-1)

-based on the codeword bi-gram trained from cleaning training data in AURORA 2

-HQ can estimate the lost subvectors more preciously than SVQ

-The conditional entropy measure


1 1 1( | ) [ log[ ]]( ( ) | ) ( ( ) | )t t t t t ti

H S S E P s i s P s i s

Stage 3 : Compensation in Viterbi decoding

The distribution of P(St (i)|Rt ,Rt-1) characterizes the

uncertainty of the estimated features

Assume the distribution P(St (i)|Rt ,Rt-1) is Gaussian, the

variance of the distribution P(St (i)|Rt ,Rt-1) is used in

Uncertainty Decoding

Make the HMMs less discriminative for the estimated

subvectors with higher uncertainty

HQ-based DSR system with transmission errors

Features corrupted by noise are more susceptible to transmission errors For SVQ, 98% to 87% (clean), 60% to 36% (10 dB SNR)

HQ-based DSR system with transmission errors

The improvements that HQ offered over HEQ-SVQ when transmission errors were present are consistent and significant at all SNR values

HQ is robust against both environmental noise and transmission errors

Analyze the degradation of recognition accuracy caused by transmission errors

Comparison of SVQ, HEQ-SVQ and HQ for the percentage of words which were correctly recognized if without transmission errors, but incorrectly recognized after transmission.

HQ-Based DSR with Wireless Channels and Error Concealment

ETSI repetition technique actually degraded the performance of HEQ-SVQg the whole feature vectors including the correct subvectors are

replaced by inaccurate estimations

g: GPRS r: ETSI repetition c: three-stage EC


Three-stage EC improved the performance significantly for all cases. Robust against not only transmission errors, but against

environmental noise as well.

g: GPRS r: ETSI repetition c: three-stage EC


Different client traveling speed (1/3)



Conclusions Histogram-based Quantization (HQ) is proposed

a novel approach for robust and/or distributed speech recognition

(DSR)

robust against environmental noise (for all types of noise and all SNR

conditions) and transmission errors

For future personalized and context aware DSR environments

HQ can be adapted to network and terminal capabilities

with recognition performance optimized based on environmental

conditions

Thank you for your attention