Histogram-based Quantization
for Distributed / Robust Speech
Recognition
Chia-yu Wan, Lin-shan Lee
College of EECS, National Taiwan University, R. O. C.
2007/08/16
Outline
Introduction
Histogram-based Quantization (HQ)
Joint Uncertainty Decoding (JUD)
Three-stage Error Concealment (EC)
Conclusion
Problems of Distance-based VQ Conventional Distance-based VQ (e.g. SVQ) was popularly used in
DSR Dynamic Environmental noise and codebook mismatch jointly degrade
the performance of SVQ
Histogram-based Quantization (HQ) is proposed to solve the problems
Noise moves clean speech to another partition cell (X to
Y)
Mismatch between fixed VQ codebook and test data
increases distortion
Quantization increases difference between clean
and noisy features
Decision boundaries yi{i=1,…,N} are dynamically defined by C(y). Representative values zi {i=1,…,N} are fixed, transformed by a standard Gaussian.
Histogram-based Quantization (HQ)T
i i i{D ,z ,b (vertical scale) i=1,...,N}
determined by Lloyd-Max and a
standard Gaussian Distribution
Histogram-based Quantization (HQ)T
1
1
, ( ) ,
1,2, ...
t ti ii
t ii
x z if b C x bor y x y
where i N
The actual decision boundaries (horizontal scale) for xt are dynamically defined by the inverse transformation of C(y).
Histogram-based Quantization (HQ)
With histogram C’(y’), decision boundaries automatically changed to .
Decision boundaries are adjusted according to local statistics, no codebook mismatch problem.
T
1( , )iiy y
1
1
, '( )
' ' ,
1,2, ...
t ti ii
t ii
x z if b C x b
or y x y
where i N
Histogram-based Quantization (HQ)
Based on CDF on the vertical scale and histogram, less sensitive to noise on the horizontal scale
Disturbances are automatically absorbed into HQ block
Dynamic nature of HQ hidden codebook on vertical scaletransformed by dynamic C(y){yi} Dynamic on horizontal scale
T
Histogram-based Vector Quantization (HVQ)
Discussions about robustness of Histogram-based Quantization (HQ)
Distributed speech recognition: SVQ v.s. HQ
Robust speech recognition: HEQ v.s. HQ
Comparison of Distance-based VQ and Histogram-based Quantization (HQ)
Distance-based VQ (SVQ) Histogram-based Quantization (HQ)
HQ solves the major problems of conventional Distance-based VQ
Fixed codebook cannot well represent the noisy speech
Dynamically adjusted to local statistics, no codebook mismatch
Quantization increases difference between clean and noisy speech.
Inherent robust nature, noise disturbances automatically absorbed by C(y)
HEQ performed point-to-point transformation
point-based order-statistics are more disturbed
HQ performed block-based transformation
automatically absorbed disturbance within a block
with proper choice of block size, block uncertainty can be
compensated by GMM and uncertainty decoding
Averaged normalized distance between clean and corrupted
speech features based on AURORA 2 database
HEQ (Histogram Equalization) v.s. HQ (Histogram-based Quantization)
HEQ performed point-to-point transformation
point-based order-statistics are more disturbed
HQ performed block-based transformation
automatically absorbed disturbance within a block
with proper choice of block size, block uncertainty can be
compensated by GMM and uncertainty decoding
HEQ (Histogram Equalization) v.s. HQ (Histogram-based Quantization)
HQ gives smaller d for all SNR condition less influenced by the noise disturbance
HQ as a feature transformation method
HQ as a feature quantization method
HQ as a feature quantization method
HQ as a feature quantization method
HQ as a feature quantization method
HQ as a feature quantization method
HQ as a feature quantization method
Further analysisBit rates v.s. SNR
Clean-condition training multi-condition training
HQ-JUDFor both robust and/or distributed speech recognition
For robust speech recognition
• HQ is used as the front-end feature transformation
•JUD as the enhancement approach at the backend recognizer
For Distributed Speech Recognition (DSR)
• HQ is applied at the client for data compression
•JUD at the server
Front-end Back-end Client Server
Robustness DSR
Joint Uncertainty Decoding (1/4)- Uncertainty Observation Decoding
HMM would be less discriminate on features with higher uncertainty Increasing larger variance for more uncertain features
w: observation, o: uncorrupted features
Assume
Joint Uncertainty Decoding (2/4) - Uncertainty for quantization errors
Codeword is the observation w
Samples in the partition cell are
the uncorrupted features o
p(o) is the pdf of the samples
within the partition cell
Variance of samples within partition cell
More uncertain regions
Loosely quantized cells
Joint Uncertainty Decoding (2/4) - Uncertainty for quantization errors
Codeword is the observation w
Samples in the partition cell are
the possible distribution o
p(o) is the pdf of the samples
within the partition cell
Increases the variances for the loosely quantized cells
Variance of samples within partition cell
Joint Uncertainty Decoding (3/4) -Uncertainty for environmental noise
Increase the variances for HQ features with a larger histogram shift
Histogram shift
Jointly consider the uncertainty caused by both the
environmental noise and the quantization errors.
One of the above two would dominate
Quantization errors (High SNR)
Disturbance absorbed into HQ block
Environment noise (Low SNR)
Noisy features moved to another partition cells
Joint Uncertainty Decoding (4/4)
HQ-JUDfor robust speech recognition
Different types of noise, averaged over all SNR values Client
HEQ-SVQ
ClientHEQ-SVQ
ServerUD
ClientHQ
ClientHQ
ServerJUD
HQ-JUDfor distributed speech recognition
Different types of noise, averaged over all SNR values Client
HEQ-SVQ
HEQSVQ-UD was slightly worse than HEQ for set C
ClientHEQ-SVQ
ServerUD
HQ-JUDfor distributed speech recognition
Different types of noise, averaged over all SNR values
HEQSVQ-UD was slightly worse than HEQ for set C HQ-JUD consistently improved the performance of HQ
ClientHQ
ClientHQ
ServerJUD
HQ-JUDfor distributed speech recognition
Different types of noise, averaged over all SNR values Client
HEQ-SVQClientHQ
HQ performed better than HEQ-SVQ for all types of noise
HQ-JUDfor distributed speech recognition
Different types of noise, averaged over all SNR values
HQ performed better than HEQ-SVQ for all types of noise HQ-JUD consistently performed better than HEQSVQ-UD
ClientHQ
ServerJUD
ClientHEQ-SVQ
ServerUD
HQ-JUDfor distributed speech recognition
Different SNR conditions, averaged over all noise types
HQ-JUD significantly improved the performance of SVQ-UD HQ-JUD consistently performed better than HEQSVQ-UD
ClientHEQ-SVQ
ServerUD
ClientHQ
ServerJUD
ClientSVQ
ServerUD
ClientHQ
ServerJUD
HQ-JUDfor distributed speech recognition
Three-stage error concealment (EC)
Stage 1 : error detection
Frame-level error detection
The received frame-pairs are first checked with CRC
Subvector-level error detection
The erroneous frame-pairs are then checked by the HQ
consistency check
The quantized codewords for HQ represent the order-statistics
information of the original parameters
Quantizaiton process does not change the order-statistics
Re-perform HQ on received subvector codeword should fall in
the same partition cell
Stage 1 : error detection
Noise seriously affects the SVQ with data consistency check
-precision degradation (from 66% at clean down to 12% at 0 dB)
HQ-based consistency approach is much more stable at all SNR values, - both recall and precision rates are higher.
Stage 2 : reconstruction
Based on the Maximum a posterior (MAP) criterion
-Considering the probability for all possible codewords St(i) at time
t, given the current and previous received subvector codewords, Rt and Rt-1,
-prior speech source statistics : HQ codeword bigram model
-channel transition probability : the estimated BER from stage1
-reliability of the received subvectors : consider the relative reliability between prior speech source and wireless channel
1 1( ) ( )
ˆ arg max{ ( ( ) | , )} arg max{ ( ( ) | ) ( | ( ))}t t
t t t t t t t tS i S i
S P S i R R P S i R P R S i prior channel
Channel transition probability P(Rt | St(i))
-significantly differentiated (for different codeword i, with different d) when Rt is
more reliable (BER is smaller)
-put more emphasis on prior speech source when Rt is less reliable
( ( ), )( ( ), )( | ( )) *(1- ) t tt t
t t M d S i Rd S i RP R S i BER BER
-the estimated BER is the number of inconsistent subvectors in
the present frame divided by the total number of bits in the frame
Stage 2 : reconstruction
Prior source information P(St (i)| Rt-1)
-based on the codeword bi-gram trained from cleaning training data in AURORA 2
-HQ can estimate the lost subvectors more preciously than SVQ
-The conditional entropy measure
Stage 2 : reconstruction
1 1 1( | ) [ log[ ]]( ( ) | ) ( ( ) | )t t t t t ti
H S S E P s i s P s i s
Stage 3 : Compensation in Viterbi decoding
The distribution of P(St (i)|Rt ,Rt-1) characterizes the
uncertainty of the estimated features
Assume the distribution P(St (i)|Rt ,Rt-1) is Gaussian, the
variance of the distribution P(St (i)|Rt ,Rt-1) is used in
Uncertainty Decoding
Make the HMMs less discriminative for the estimated
subvectors with higher uncertainty
HQ-based DSR system with transmission errors
Features corrupted by noise are more susceptible to transmission errors For SVQ, 98% to 87% (clean), 60% to 36% (10 dB SNR)
HQ-based DSR system with transmission errors
The improvements that HQ offered over HEQ-SVQ when transmission errors were present are consistent and significant at all SNR values
HQ is robust against both environmental noise and transmission errors
Analyze the degradation of recognition accuracy caused by transmission errors
Comparison of SVQ, HEQ-SVQ and HQ for the percentage of words which were correctly recognized if without transmission errors, but incorrectly recognized after transmission.
HQ-Based DSR with Wireless Channels and Error Concealment
ETSI repetition technique actually degraded the performance of HEQ-SVQg the whole feature vectors including the correct subvectors are
replaced by inaccurate estimations
g: GPRS r: ETSI repetition c: three-stage EC
HQ-Based DSR with Wireless Channels and Error Concealment
Three-stage EC improved the performance significantly for all cases. Robust against not only transmission errors, but against
environmental noise as well.
g: GPRS r: ETSI repetition c: three-stage EC
HQ-Based DSR with Wireless Channels and Error Concealment
Different client traveling speed (1/3)
Different client traveling speed (2/3)
Different client traveling speed (3/3)
Conclusions Histogram-based Quantization (HQ) is proposed
a novel approach for robust and/or distributed speech recognition
(DSR)
robust against environmental noise (for all types of noise and all SNR
conditions) and transmission errors
For future personalized and context aware DSR environments
HQ can be adapted to network and terminal capabilities
with recognition performance optimized based on environmental
conditions
Thank you for your attention