COMPENSATION FOR NONLINEAR DISTORTION IN NOISE FOR ROBUST SPEECH RECOGNITION
Mark J. Harvilla Ph.D. Thesis Defense October 27, 2014
Introduction
2
Topic Symbol Fraction of thesis work
Dynamic range compression (DRC) and automatic speech recognition (ASR)
11%
Blind amplitude normalization (BAN) 14%
Blind amplitude reconstruction (BAR) 28%
Robust estimation of distortion (RED) 28%
Artificially-matched training (AMT) 9%
The Big Picture 10%
DRC & ASR
BAN
BAR
RED
AMT
Big Picture
DRC & ASR BAN BAR RED AMT Big Picture Conclusion Introduction
Dynamic Range Compression (DRC) • A form of nonlinear distortion
Ø Nonlinear systems are common (e.g., AM/FM radio, rectifiers)
• DRC is used extensively in audio engineering typically for one of three reasons: 1. Adhere to dynamic range limitations of a signal transmission
system, while increasing average signal power 2. Increase perceived signal loudness 3. Eliminate drastic changes in volume (e.g., automatic gain control)
• Because of the ubiquity of DRC, speech systems—like ASR—
are likely to encounter compressed speech
3
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8−0.6−0.4−0.2
00.20.40.60.8
1
input amplitude
outp
ut a
mpl
itude
τ = 0.6
τ = 0.1
R = 1.5R = 2.5R = ∞
Dynamic Range Compression (DRC) • DRC is characterized by two parameters, ratio (R) and
threshold (τ).
4
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
0 0.005 0.01 0.015 0.02 0.025 0.03−1
−0.8−0.6−0.4−0.2
00.20.40.60.8
1
time (seconds)
−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8−0.6−0.4−0.2
00.20.40.60.8
1
input amplitude
outp
ut a
mpl
itude
R = 1R = 1.5R = 2.5R = ∞
Dynamic Range Compression (DRC)
5
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 10
0.005
0.01
0.015
0.02
0.025
0.03
amplitude
time
(sec
onds
)
0 10 20 30 40 50 60 70 80 90 1000
4
8
12
16
20
τ, threshold (percentile)
SNR
(dB)
R=1.5R=2R=3R=6R=∞
Dynamic Range Compression (DRC)
6
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
Some examples
7
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
Threshold (τ) Ratio (R) Audio Crest Factor Word Error
Rate (WER) WER after processing
P100 1 17.1 dB 6.4% 6.4%
P75 4 7.7 dB 20.3% 6.4%
P75 ∞ 4.1 dB 30.8% 13.5%
P50 4 6.7 dB 30.2% 6.4%
P50 ∞ 2.2 dB 49.5% 23.0%
Measuring the effect of DRC on ASR
8
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
Clean acoustic model
clean speech
signal
Controlled parameter
values: (R,τ)
Measure word error rate (WER) DRC ASR
Experiment 1 (no additive noise):
Measuring the effect of DRC on ASR
9
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
Clean acoustic model
clean speech
signal
Controlled parameter
values: (R,τ)
Measure word error rate (WER) DRC ASR
Experiment 2 (additive, channel noise):
Additive noise at
controlled SNR
+
Measuring the effect of DRC on ASR
10
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
Experiment 1 (no additive noise):
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
Clean acoustic model
clean speech
signal
Controlled parameter
values: (R,τ)
Measure word error rate (WER) DRC ASR
No additive noise
Measuring the effect of DRC on ASR
11
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
Experiment 2 (additive, channel noise): Clean acoustic
model
clean speech
signal
Controlled parameter
values: (R,τ)
Measure word error rate (WER) DRC ASR
Additive noise at
controlled SNR
+
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
Additive noise at 20-dB SNR w.r.t. compressed signal
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
Measuring the effect of DRC on ASR
12
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
Experiment 2 (additive, channel noise): Clean acoustic
model
clean speech
signal
Controlled parameter
values: (R,τ)
Measure word error rate (WER) DRC ASR
Additive noise at
controlled SNR
+
Additive noise at 15-dB SNR w.r.t. compressed signal
Counteracting the effects of DRC
13
BAN BAR RED AMT Big Picture Conclusion Introduction DRC & ASR
DRC
Saturating “clipping”
Non-saturating “compression”
Blind amplitude reconstruction
(BAR)
Blind amplitude normalization
(BAN)
Artificially-matched
training (AMT)
Robust estimation of nonlinear distortion function (RED)
Blind Amplitude Normalization (BAN) (Balchandran & Mammone; ICASSP 1998)
• Step 1: Obtain estimate of the cumulative distribution function (CDF) of the observed speech, and of clean, unadulterated reference speech.
14
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
−0.08 −0.048 −0.016 0.016 0.048 0.080
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
Observed speech (R = 10, τ = P50) Clean speech
• Step 2: For a given reference signal amplitude, find the amplitude in the observed CDF with the same cumulative probability.
15
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
Ø Input amplitude of 0.061 maps to 0.2
Blind Amplitude Normalization (BAN) (Balchandran & Mammone; ICASSP 1998)
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
−0.08 −0.048 −0.016 0.016 0.048 0.080
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
• Step 3: Repeat for each input signal amplitude to obtain a full non-parametric estimate of the nonlinear mapping.
16
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
Blind Amplitude Normalization (BAN) (Balchandran & Mammone; ICASSP 1998)
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
−0.08 −0.048 −0.016 0.016 0.048 0.080
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
How well does BAN work? • Experiment 1 (no additive noise):
17
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
Before BAN
How well does BAN work? • Experiment 1 (no additive noise):
18
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
After BAN
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
How well does BAN work? • Experiment 2 (additive, channel noise at 20-dB SNR):
19
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
Before BAN
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
How well does BAN work? • Experiment 2 (additive, channel noise at 20-dB SNR):
20
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
After BAN
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
How well does BAN work? • Experiment 2 (additive, channel noise at 15-dB SNR):
21
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
Before BAN
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
How well does BAN work? • Experiment 2 (additive, channel noise at 15-dB SNR):
22
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
After BAN
Robust BAN (Harvilla & Stern; unpub.)
• Idea: Shift each input sample by the amount the centroid of it and its neighbors is changed when inverting the nonlinearity.
23
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
Observed speech after low-pass filter (R = 10, τ = P50, SNR = 15 dB)
Clean speech after low-pass filter
−0.08 −0.048 −0.016 0.016 0.048 0.080
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
Robust BAN (Harvilla & Stern; unpub.)
• Step 1: As before, for a given reference signal amplitude, find the amplitude in the observed CDF with the same cumulative probability.
24
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
−0.08 −0.048 −0.016 0.016 0.048 0.080
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
Robust BAN (Harvilla & Stern; unpub.)
• Step 2: The difference between the output and the input is the offset to be added to the original, noisy and compressed waveform.
25
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
−0.08 −0.048 −0.016 0.016 0.048 0.080
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
Offset = output – input = 0.2 – 0.061 = 0.139
Robust BAN (Harvilla & Stern; unpub.)
26
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
−0.08 −0.048 −0.016 0.016 0.048 0.080
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
0.6
0.8
1
amplitude
cum
ulat
ive
prob
abili
ty
• Step 3: Repeat for each input signal amplitude, always using the inverse mapping defined by the smoothed signals.
Robust BAN (Harvilla & Stern; unpub.)
27
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
• Step 1: For each sample, find the centroid of the value and its surrounding 4 samples. • Step 2: Pass the centroid value through the inverse
nonlinearity estimate. • Step 3: Find the difference (“offset”) between the output of
the inverse nonlinearity and the centroid. • Step 4: Add the offset to the original noisy and compressed
sample value from Step 1. • Step 5: Repeat for each sample in the input signal.
Robust BAN (Harvilla & Stern; unpub.)
28
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
0 0.0037 0.0075 0.0112 0.0149 0.0187−0.3−0.2−0.1
00.10.20.30.4
time (seconds)
ampl
itude
originalDRC + noise (SNR = 15dB)
0 0.0037 0.0075 0.0112 0.0149 0.0187−0.3−0.2−0.1
00.10.20.30.4
time (seconds)
ampl
itudeRepaired
using BAN:
Robust BAN (Harvilla & Stern; unpub.)
29
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
0 0.0037 0.0075 0.0112 0.0149 0.0187−0.3−0.2−0.1
00.10.20.30.4
time (seconds)
ampl
itude
originalDRC + noise (SNR = 15dB)
0 0.0037 0.0075 0.0112 0.0149 0.0187−0.3−0.2−0.1
00.10.20.30.4
time (seconds)
ampl
itude
Repaired using
Robust BAN:
R=2 R=4 R=6 R=10 R=20−30
−20
−10
0
10
20
30
(RB
AN−B
AN
) rel
. im
prov
. (%
)
15−dB SNR20−dB SNR
• RBAN is more useful as R becomes large and SNR decreases:
Results summary
30
DRC & ASR BAR RED AMT Big Picture Conclusion Introduction BAN
Blind Amplitude Reconstruction (BAR) • When R = ∞, BAN techniques are ineffective. • All samples greater than |τ| are completely lost (“clipping”).
31
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
0 0.5 1 1.5 2 2.5 3x 10−3
−0.3
−0.1
0.1
0.3
0.5
time (seconds)
ampl
itude
Consistent Iterative Hard Thresholding (Kitic et al.; ICASSP 2013)
• Kitic-IHT works by learning a sparse representation of the incoming clipped speech in term of Gabor basis vectors. • Learning is done using a modified version of the Iterative
Hard Thresholding (IHT) algorithm. • The learned sparse representation is then used to
reconstruct the signal on a frame-by-frame basis.
32
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
Kitic-IHT will be used as a baseline to compare novel declipping algorithm performance.
Gabor basis vectors
Sparse representation, learned from clipped observation
Repaired signal frame
Constrained BAR (Harvilla & Stern; Interspeech 2014) • Declip the signal by interpolating missing samples such that
the energy in the second derivative is minimized (i.e., for smoothness). • Ensure the interpolation matches the sign of the clipped
signal and is greater than |τ| in the absolute sense.
33
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
0 0.5 1 1.5 2 2.5 3x 10−3
−0.3
−0.1
0.1
0.3
0.5
time (seconds)
ampl
itude
Constrained BAR (Harvilla & Stern; Interspeech 2014) • Explaining masking matrices
34
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
Isolates reliable samples
Isolates clipped samples
Constrained BAR (Harvilla & Stern; Interspeech 2014)
35
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
minimize
subject to
xc
CBAR objective function:
Constrained BAR (Harvilla & Stern; Interspeech 2014) • Because Constrained BAR (CBAR) imposes a hard constraint
when minimizing the objective function, it is very slow.
• A line search algorithm is used to solve the constrained optimization separately for every frame.
• In the worst case, it is 400 times slower than real time. • This motivates the development of a declipping algorithm
that does not require a hard constraint.
36
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
Regularized BAR (Harvilla & Stern; ICASSP 2015)
37
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Replace CBAR’s hard constraint with regularization terms:
minimize
subject to
xc
CBAR objective function:
Regularized BAR (Harvilla & Stern; ICASSP 2015)
38
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Replace CBAR’s hard constraint with regularization terms:
minimize xc
Regularized BAR (Harvilla & Stern; ICASSP 2015)
39
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Replace CBAR’s hard constraint with regularization terms:
minimize xc
Regularized BAR (Harvilla & Stern; ICASSP 2015)
40
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Replace CBAR’s hard constraint with regularization terms:
minimize xc
RBAR objective function: xc can be solved for in closed form!
Regularized BAR (Harvilla & Stern; ICASSP 2015)
41
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Replace CBAR’s hard constraint with regularization terms:
Frame-specific solution: xc can be solved for in closed form!
Regularized BAR (Harvilla & Stern; ICASSP 2015)
42
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• The t0 and t1 terms are target vectors. • They “float” above the clipped segments at the target
amplitude. • They are defined as a function of the fraction of clipped
samples in a frame.
Regularized BAR (Harvilla & Stern; ICASSP 2015)
43
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
0 0.5 1 1.5 2 2.5 3x 10−3
−0.3
−0.1
0.1
0.3
0.5
time (seconds)
ampl
itude
0 0.5 1 1.5 2 2.5 3x 10−3
−0.3
−0.1
0.1
0.3
0.5
time (seconds)
ampl
itude
Regularized BAR (Harvilla & Stern; ICASSP 2015)
44
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
t0
t1
0 0.5 1 1.5 2 2.5 3x 10−3
−0.3
−0.1
0.1
0.3
0.5
time (seconds)
ampl
itude
Regularized BAR (Harvilla & Stern; ICASSP 2015)
45
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
0 0.5 1 1.5 2 2.5 3x 10−3
−0.3
−0.1
0.1
0.3
0.5
time (seconds)
ampl
itude
Regularized BAR (Harvilla & Stern; ICASSP 2015)
46
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
The target amplitudes underestimate the true peak (future research).
Regularized BAR (Harvilla & Stern; ICASSP 2015)
47
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Amplitude prediction
0 0.2 0.4 0.6 0.8 10
80
160
240
320
400
fraction of clipped samples
P 95 / τ
exponentialpower−law
ρ: fraction of clipped samples in frame
Processing speed
48
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
20 40 60 80−2 [0.13]
−1 [0.37]
0 [1.00]
1 [2.71]
2 [7.39]
3 [20.1]
4 [54.6]
5 [148]
6 [403]
τ, threshold (percentile)
log(
TRT)
[run
time/
inpu
t dur
atio
n]
CBARKitic−IHTRBAR
Declipping performance
49
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Experiment 1 (no additive noise):
15 35 55 75 95 1000
20
40
60
80
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
no declippingRBARKitic−IHTCBAR
Declipping performance
50
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Experiment 1 (no additive noise), relative improvements:
15 35 55 75 95−30−20−10
010203040506070
τ, threshold (percentile)
Rel
ativ
e de
crea
se in
WER
(%)
relative to no declippingrelative to RBARrelative to Kitic−IHT
15 35 55 75 95−30−20−10
010203040506070
τ, threshold (percentile)
Rel
ativ
e de
crea
se in
WER
(%)
relative to no declippingrelative to Kitic−IHT
CBAR RBAR
Declipping performance
51
DRC & ASR BAN RED AMT Big Picture Conclusion Introduction BAR
• Experiment 2 (additive noise):
5 10 15 200
20
40
60
80
100
SNR (dB)
Wor
d er
ror r
ate
(%)
5 10 15 200
20
40
60
80
100
SNR (dB)
Wor
d er
ror r
ate
(%)
no declippingRBARCBARKitic−IHT
τ = P75 τ = P95
The location of all clipped samples is assumed known.
Kitic-IHT is more robust to additive noise (future research).
Is audio exposed to
DRC?
Is audio clipped?
Apply BAR
Extract features
Apply BAN
yes
no
yes
no
Receive audio
Robust Estimation of Distortion (RED) • Given a received speech signal, how does one determine if
declipping (BAR) or decompression (BAN) need to be performed?
52
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
Robust Estimation of Distortion (RED) • Given a received speech signal, how does one determine if
declipping (BAR) or decompression (BAN) need to be performed?
53
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
Is audio exposed to
DRC?
Is audio clipped?
Apply BAR
Search for peaks in the probability distribution of the waveform amplitudes
Accurately estimate the value of R (recall: if R is “very” large, speech is effectively clipped)
Requires estimation of which samples are clipped and must assume the possibility of noise (e.g., as in Experiment 2)
✔
✗
✔
Clipped speech detection & τ estimation (Harvilla & Stern; ICASSP 2015)
• Exposure to DRC significantly modifies the waveform amplitude distribution of the speech
54
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
−0.6 −0.4 −0.2 0 0.2 0.4 0.60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
amplitude
probability
−0.6 −0.4 −0.2 0 0.2 0.4 0.60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
amplitude
probability
Uncompressed speech with noise at 15-dB SNR
DRC’ed speech (R=6, τ=0.06) + noise at 15 dB
−0.6 −0.4 −0.2 0 0.2 0.4 0.60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
amplitude
probability
Clipped speech detection & τ estimation (Harvilla & Stern; ICASSP 2015)
• Exposure to DRC significantly modifies the waveform amplitude distribution of the speech
55
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
DRC’ed speech (R=6, τ=0.06) + noise at 15 dB
Clipping detection and τ estimation algorithm: 1. Detect peaks in the
distribution 2. Compute:
3. Output indicates clipping occurrence and amplitude value of τ (0.5*( |-τ| + 0 + |τ| ))
(if output is ∞, no clipping)
Clipped speech detection & τ estimation (Harvilla & Stern; ICASSP 2015)
56
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
Clipped signal detection accuracies
5 10 15 200
20
40
60
80
100
SNR (dB)
Clip
ped
signa
l det
. acc
. (%
)
5 10 15 200
20
40
60
80
100
SNR (dB)Cl
ippe
d sig
nal d
et. a
cc. (
%)
τ = P95 τ = P75
Because the amplitude distribution merges into one lobe (thus, one peak) with decreasing SNR and τ, detection accuracy correspondingly decreases.
Clipped speech detection & τ estimation (Harvilla & Stern; ICASSP 2015)
57
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
SNR = 20 dB SNR = 15 dB
SNR = 10 dB SNR = 5 dB
τ-estimation accuracies for R = ∞
0.03 0.06 0.09 0.12 0.15−0.01
0.02
0.05
0.08
0.11
0.14
0.17
0.2
0.23
τ, actual
τ, e
stim
ate
0.03 0.06 0.09 0.12 0.15−0.01
0.02
0.05
0.08
0.11
0.14
0.17
0.2
0.23
τ, actual
τ, e
stim
ate
0.03 0.06 0.09 0.12 0.15−0.01
0.02
0.05
0.08
0.11
0.14
0.17
0.2
0.23
τ, actual
τ, e
stim
ate
0.03 0.06 0.09 0.12 0.15−0.01
0.02
0.05
0.08
0.11
0.14
0.17
0.2
0.23
τ, actual
τ, e
stim
ate
Clipped sample estimation (Harvilla & Stern; ICASSP 2015)
• Given the amplitude value of τ, how do we determine the location of clipped samples?
58
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
0 0.5 1 1.5 2 2.5 3x 10−3
−0.3
−0.1
0.1
0.3
0.5
time (seconds)
ampl
itude
signal samplesclipping threshold
0 0.5 1 1.5 2 2.5 3x 10−3
−0.3
−0.1
0.1
0.3
0.5
time (seconds)
ampl
itude
Clipped speech, no noise Clipped speech + noise at 10-dB SNR
Clipped sample estimation (Harvilla & Stern; ICASSP 2015)
• Given the amplitude value of τ, how do we determine the location of clipped samples? • Solution:
Given, amplitude value of τ percentile value of τ variance of the additive noise (σw
2) variance of the observed signal (σy
2)
• Model the clean speech and noise with separate Gaussians • For each sample, classify as clipped if
59
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
Pr( clipped|observed sample, τ, σw2, σy
2) > Pr( not clipped|observed sample, τ, σw2, σy
2)
Clipped sample estimation (Harvilla & Stern; ICASSP 2015)
60
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
−0.2 −0.12 −0.04 0.04 0.12 0.20
5.2
10.4
15.6
20.8
26
amplitude
prob
abili
ty d
ensi
ty
clippednot clipped
Speech clipped at τ = 0.07 and added to noise at 15-dB SNR
Clipped sample estimation (Harvilla & Stern; ICASSP 2015)
61
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
0 4 8 12 16 20 2460
70
80
90
100
SNR (dB)
mea
n cl
assif
icat
ion
accu
racy
τ = P95τ = P75τ = P55τ = P35
Is audio exposed to
DRC?
Is audio clipped?
Apply BAR
Extract features
Apply BAN
yes
no
yes
no
Receive audio
Robust Estimation of Distortion (RED) • Given a received speech signal, how does one determine if
declipping (BAR) or decompression (BAN) need to be performed?
62
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
Clipped sample estimation (Harvilla & Stern; ICASSP 2015)
63
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
Apply BAR
Voice activity
detection
Estimation of noise variance
Estimation of τ
percentile
Clipped sample
estimation
Declipping
Clipped sample estimation (Harvilla & Stern; ICASSP 2015)
64
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
• Experiment 2 (additive noise):
τ = P75 τ = P95
The location of all clipped samples is assumed known.
5 10 15 200
20
40
60
80
100
SNR (dB)
Wor
d er
ror r
ate
(%)
5 10 15 200
20
40
60
80
100
SNR (dB)
Wor
d er
ror r
ate
(%)
no declippingRBARCBARKitic−IHT
5 10 15 200
20
40
60
80
100
SNR (dB)
Wor
d er
ror r
ate
(%)
no declippingRBARCBARKitic−IHT
5 10 15 200
20
40
60
80
100
SNR (dB)
Wor
d er
ror r
ate
(%)
Clipped sample estimation (Harvilla & Stern; ICASSP 2015)
65
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
• Experiment 2 (additive noise):
τ = P75 τ = P95
Clipping occurrence and location is detected using RED techniques
5 10 15 200
20
40
60
80
100
SNR (dB)
Wor
d er
ror r
ate
(%)
← clipped signal detection accuracy
no declippingRBARCBARKitic−IHT
5 10 15 200
20
40
60
80
100
SNR (dB)
Wor
d er
ror r
ate
(%)
← clipped signal detection accuracy
Clipped sample estimation (Harvilla & Stern; ICASSP 2015)
66
DRC & ASR BAN BAR AMT Big Picture Conclusion Introduction RED
• Experiment 2 (additive noise):
τ = P75 τ = P95
Clipping occurrence and location is detected using RED techniques
Artificially-Matched Training (AMT) • So far, the developed techniques have sought to repair
clipped, compressed and noisy speech to “look like” clean speech:
67
DRC & ASR BAN BAR RED Big Picture Conclusion Introduction AMT
noisy observations
compensation
clean models
Artificially-Matched Training (AMT) • Ultimately, it’s only important for the Acoustic Model and
testing data conditions to match. They both need not be “clean.”
68
DRC & ASR BAN BAR RED Big Picture Conclusion Introduction AMT
noisy observations
noisy models
Artificially-Matched Training (AMT)
69
DRC & ASR BAN BAR RED Big Picture Conclusion Introduction AMT
• Experiment 1 (no additive noise):
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
Clean training
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
Artificially-Matched Training (AMT)
70
DRC & ASR BAN BAR RED Big Picture Conclusion Introduction AMT
• Experiment 1 (no additive noise):
DRC-matched training
Artificially-Matched Training (AMT)
71
DRC & ASR BAN BAR RED Big Picture Conclusion Introduction AMT
• One approach to achieving this in practice:
xn
MFCC ASR WER
Regression on DRC parameters
{R,τ}
{Rk-1, τk-1} {R1, τ1} {R0, τ0} … Bank of
acoustic models
Artificially-Matched Training with Acoustic Model Selection (AMT-AMS)
Current implementation uses the following parameter sets: R = {∞} τ = {P15, P35, P55, P75, P95}
Artificially-Matched Training (AMT)
72
DRC & ASR BAN BAR RED Big Picture Conclusion Introduction AMT
• Experiment 1 (no additive noise):
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
Clean training
15 35 55 75 95 100
102030405060708090
100
τ, threshold (percentile)
Wor
d er
ror r
ate
(%)
R = ∞R = 20R = 10R = 6R = 4R = 2R = 1
Artificially-Matched Training (AMT)
73
DRC & ASR BAN BAR RED Big Picture Conclusion Introduction AMT
• Experiment 1 (no additive noise):
AMT-AMS
x[n] y[n]
+
w[n]
SNR in dB drawn from N(µ,σ2)
τ drawn uniformly in [τ0,τ1]
Compress with probability pc
Add noise with probability pn
R drawn from Gamma dist., [kR,θR]
The Big Picture • With no knowledge of the noise conditions and
characteristics of the incoming speech, how well does the combination of algorithms from the thesis work in practice?
74
DRC & ASR BAN BAR RED AMT Conclusion Introduction Big Picture
pc = 0.9 t0 = 60 t1 = 98 pn = 0.75 µ = 20 σ2 = 25 k = 3 θ = 2
Compression
The Big Picture
75
DRC & ASR BAN BAR RED AMT Conclusion Introduction Big Picture
Compression
12
19
26
33
40
Wor
d er
ror r
ate
(%)
none
RBAN BAN
x[n] y[n]
+
w[n]
SNR in dB drawn from N(µ,σ2)
τ drawn uniformly in [τ0,τ1]
Clip with probability pc
Add noise with probability pn
The Big Picture • With no knowledge of the noise conditions and
characteristics of the incoming speech, how well does the combination of algorithms from the thesis work in practice?
76
DRC & ASR BAN BAR RED AMT Conclusion Introduction Big Picture
pc = 0.9 t0 = 60 t1 = 98 pn = 0.75 µ = 20 σ2 = 25
Clipping
12
19
26
33
40
Wor
d er
ror r
ate
(%)
none
RBAR
CBARKitic−IHT
AMT−AMS AMT−AMS
(RBAR)
The Big Picture
77
DRC & ASR BAN BAR RED AMT Conclusion Introduction Big Picture
Clipping
Summary & Conclusions • A previously-unexplored problem in speech recognition, DRC,
was introduced. • Novel solutions to the two primary aspects of the problem,
clipping and compression, were developed. • Techniques for detecting the occurrence of DRC were
considered. • A comprehensive solution to DRC for speech recognition was
proposed. • DRC, especially in noise, is a very hard problem, but this
thesis lays the groundwork for very promising future research.
78
DRC & ASR BAN BAR RED AMT Big Picture Introduction Conclusion
Summary & Conclusions • Areas of future research include: • Improving target amplitude estimates for RBAR [BAR] • Improving the robustness of BAR methods to additive noise [BAR] • Improving the robustness of clipped/compressed signal detection to
low-valued SNR and τ [RED, Big Picture] • Development of an R-estimation algorithm [RED, Big Picture] • Further investigation of the performance of AMT-AMS with an
increasing granularity of acoustic model references [AMT]
79
DRC & ASR BAN BAR RED AMT Big Picture Introduction Conclusion
Thank you! • Questions?
80
DRC & ASR BAN BAR RED AMT Big Picture Introduction Conclusion