2010. 4. 26.
Hyung-Min Park
Audio Segregation
2
Contents• Independent component analysis (ICA)
Conventional methods for acoustic mixtures Filter bank approach to ICA
• Degenerate unmixing and estimation technique (DUET) Target speech enhancement
• Zero-crossing-based binaural processing Inter-aural time difference (ITD)
Zero crossings vs. cross-correlation
Continuously-variable mask vs. binary mask
3
Cocktail Party Problem
4
Independent Component Analysis
5
Blind Source Separation: A Demo
sources andthe mixing environment
6
Independent Component Analysis
• Blind source separation Sensor signals
Recover the original source signals without knowing how they are mixed
• ICA Assume sources are independent Estimate the unmixing system W from
mixtures x(t)
s
u
x
A
W
7
Acoustic Mixtures
• Instantaneous mixtures
• Acoustic mixing environments Time delay Reverberation Convolutive mixing
Wall
sensors
sources
x1 x2
s1 s2
8
Time Domain Approach to ICA
• Feedback architecture
• Adaptation rules (Torkkola, 1996)
• Intensive computations and slow convergence
W11
W21
W12
W22
x1(n)
x2(n) u2(n)
u1(n)
9
Frequency Domain Approach to ICA (1)
• In the frequency domain
• Complex score function
• Adaptation rule (Smaragdis, 1998)
x1(n)
W1
xN(n)
Short-Time
FourierTransform
InverseShort-Time
FourierTransform
u1(n)
uN(n)
W2
WK
10
Frequency Domain Approach to ICA (2)
• Performance limitation Contradiction between long reverberation covering
and insufficient learning data Long reverberation long frame size Small number of frames insufficient input data
Mixtures combined from different time ranges of sources Delayed mixtures
kth block kth blocks1(n) s2(n-d1)x1(n)
x2(n)
d2 d1s1(n-d2) s2(n)
= +
+=
11
Design of a Filter Bank
• Filter bank design Frequency response of analysis filters
Uniform sixteen-channel filter bank
Decimation factor: 10Filter length: 220 taps
12
Filter Bank Approach to ICA (1)• 2x2 network for the filter bank approach to ICA
x1(n)
x2(n)
M
M
M
M
M
M
H1(z)
H2(z)
HK(z)
H1(z)
H2(z)
HK(z)
ICA networkW1(z)
ICA networkW2(z)
ICA networkWK(z)
F1(z)
F2(z)
FK(z)
F1(z)
F2(z)
FK(z)
M
M
M
M
M
M
u1(n)
u2(n)
13
Filter Bank Approach to ICA (2)• Adaptation rules
• Total number of multiplications
Time domain approach Filter bank approach
The number of required filter coefficients Uniform K-channel oversampled filter bank
is the number of adaptive filter coefficients
14
Experimental Setup (1)
• Measure for blind source separation SIR for a 2x2 mixing/unmixing system
• Sources Two streams of speech 5 second length 16 kHz sampling rate
15
Experimental Setup (2)• Mixing system
Virtual room to simulate impulse responses
16
Experimental Results
• Learning curves of the three different approaches
17
Experiment on Real-Recorded Data (1)
• Mixing environment
• Filter bank approach Using the sixteen-channel filter bank Each adaptive filter: 103 taps
Speakers
Microphones
40cm
60cm
18
Experiment on Real-Recorded Data (2)
• Blind source separation of real recorded mixtures
Mixture 1
Mixture 2
Result 1
Result 2 stop
19
Motivation of a Nonuniform Filter Bank Approach
• Time-averaged magnitude responses of signals The energy exponentially decreases as the frequency
increases.Speech Car noise Music
Subband divisionResult of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length
20
Relationship between Performances and Filter Length
• Convergence of gradient-based algorithms Controlled by condition number
• Bordering theorem
• Condition number
Monotonically nondecreasing function of filter length
• The longer filter length The slower convergence speed
)0(r
rRR
rH 11 aa LL 1and
1
1
1min
max
aa LL
21
Bark-Scale Filter Banks• Subband division
Result of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length
• Bark frequency warping function
• Bark-scale filter banks Resemble that of the mammalian cochlea Somewhat narrow subbands in low frequency region Wide subbands in high frequency region
5.02
112001200
log6)(
22
Nonuniform Filter Bank Approach to BSS
• 2x2 network for the nonuniform oversampled filter bank approach to BSS
x1(n)
x2(n)
M1
M2
MK
M1
M2
MK
H1(z)
H2(z)
HK(z)
H1(z)
H2(z)
HK(z)
ICA networkW1(z)
ICA networkW2(z)
ICA networkWK(z)
F1(z)
F2(z)
FK(z)
F1(z)
F2(z)
FK(z)
M1
M2
MK
M1
M2
MK
u1(n)
u2(n)
23
Design of a Bark-Scale Filter Bank
• Filter design of a Bark-scale oversampled filter bank 16-channel, , OSR=167%220 ,3] 7 11 14 18 20 22 22[ qLMBark-scale filter bank Uniform filter bank
24
Experimental Results
• Results on blind source separation in the oversampled filter bank
SIR PESQ score
25
FPGA Implementation (1)
noise references
mic. signals
outputs
noises femalespeech
microphones
male speech
26
FPGA Implementation (2)
• 4 adaptive noise canceling (4 music signals) + 2 blind source separation (2 speech signals)
OUT1
OUT2
MIC1
MIC2
stop
27
Application to Hearing Aids
• BTE-type hearing aids
noise speech
1m1m
front mic.
rear mic.
front mic. SNR=3.20dB
rear mic. SNR=2.45dB
output SNR=21.38dB
stop
28
Discussion on ICA
• Assume sources are independent• Time domain approach
Intensive computations and slow convergence
• Frequency domain approach Less computations but inferior performance
• Filter bank approach Moderate computations and good performance Suitable for parallel processing Bark-scale filter bank approach
29
Degenerate Unmixing and Estimation Technique
30
Introduction
• Independent component analysis for blind source separation Good performance In general, the number of microphones should not be
smaller than the number of sources. Too many parameters
Heavy computational load and slow convergence Problem with a source which is active in a short period
31
Binaural Processing
• Auditory scene analysis (ASA)
Cues: harmonics, pitch, on-set, etc
• Spatial cues Inter-aural time difference (ITD) Inter-aural intensity difference (IID)
target noise
32
DUET Algorithm (1)
• Mixing model
• In the time-frequency domain
N
jj tntstx
111 )()()(
N
jjjj tntsatx
122 )()()(
),(
),(11
),(
),( 1
12
1
1
N
iN
i
S
S
eaeaX
XN
33
DUET Algorithm (2)
• W-disjoint orthogonality assumption
• Parameter estimation
,,,0),(),( jiSS ji
jSeaX
Xji
jj
somefor ),,(1
),(
),(
2
1
),(
),(1,
),(
),(),(ˆ,),(ˆ
1
2
1
2
X
X
X
Xa
34
DUET Algorithm (3)
• 2D Histogram of amplitude-delay estimates from two mixtures of five sources
♦ Amplitude parameters
( .98, 1.02, .93, 1.06, .93)
♦ Delay parameters
( .3, -.2, .8, -.7, -.2)
35
DUET Algorithm (4)
• If the j-th source is active,
• Cost function
• Parameter estimation• Stochastic gradient descent algorithm
2
21
ˆ
2 ),(),(ˆˆ1
1 ),,ˆ,ˆ(),( XXea
aa ji
j
j
jjj
0),(),( 21 XXea jij
)),(,),,(min( 1
N
36
DUET Algorithm (5)
• Mask
• Demixing
otherwise,0
),,,ˆ,ˆ(),,ˆ,ˆ(,1),(
jmaa mmjjj
),(),()( 1 XS jj
1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1
s1
s2
37
Target Speech Enhancement
• In many practical applications, Need to estimate a signal from a target source The target source
Frequently, we can expect its approximated direction. Strong utterance in a noisy environment
38
Proposed Method (1)
• Continuously variable mask
• Real mask
),(
),(),(
1
1
X
X j
rj
),(),(ˆ
),(),(ˆ1),(
21
21
ˆ
XXa
XXea
j
ij
cj
j
Continuously variable mask
Real mask
39
Proposed Method (2)
• Determine a threshold. Using a top ranking
• Binary mask using a threshold
),(%35)( pcjp ofTopTh
otherwise
ThTh pp
cj
ppbj
,0
)(),(,1))(,,(
Real mask
Binary mask
40
• Overall procedure
• Overall procedure of the DUET algorithm
Attenuation-delay
histogram
Continuousmask
Initializingattenuation
-delayparameters
Learningattenuation
-delayparameters
ST
FT
ST
FT
Thresholding
Binarymask
IST
FT
)(1 tx
)(2 tx)(tsj
Initialcontinuous
maskThresholding
Attenuation-delay
histogram
Initializingattenuation
-delayparameters
Learningattenuation
-delayparameters
ST
FT
ST
FT
Comparinglikelihoods
Binarymask
IST
FT
)(1 tx
)(2 tx
)(tsj
Proposed Method (3)
41
Number of sources : 2 (1 target source and 1 noise source) Input SIR : 5 dB Simulated mixing in an anechoic environment
80˚
50cm
50 ˚
20 ˚-10 ˚
-40 ˚
-70 ˚
-100 ˚
• Source signals10-second-long speech signals uttered by
4 males and 4 females in the TIMIT database
• Microphones Space : 2 cm
• Angle differences between two sources 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚
Mic1 Mic2
Experimental Setup (1)
42
Experimental Results (1)
9.1710.9
9
13.61
13.77
13.79
13.82
96.26 95.38
97.24
98.35
98.34
98.26
96.26
20.63
21.09
22.03
21.96
22.03
22.11
89.51
89.53
89.55
89.55
89.87
89.79
Proposed methodDUET
43
90˚
50cm60 ˚
30 ˚0 ˚
-30 ˚
-60 ˚
-90 ˚ Mic
1Mic2
Experimental Setup (2) Number of sources : 2 (1 target source and 1 noise source) Input SIR : 5 dB Real recorded mixtures in a normal office room
• Source signals10-second-long speech signals uttered by
3 male and 3 female speakers in the TIMIT database
• Microphones Space : 2 cm
• Angle differences between two sources 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚
44
Experimental Results (2)
76.67
81.77
85.81
96.93
97.14
97.03
6.29
10.02
11.49
12.70
12.46
12.57
72.27
80.02
83.90
83.37
84.67
85.14
5.51
10.99
14.88
15.93
15.92
15.99
Proposed methodDUET
45
Discussion on DUET
• DUET(Degenerate Unmixing and Estimation Technique) Simple We should know the number of sources in advance.
Estimate the attenuation and delay parameters for all sources.
• Described target speech enhancement technique Estimate the parameters for only one target source
Much faster convergence of all the required parameters
• Not robust to reverberation
46
Zero-Crossing-Based Binaural Processing
47
Binaural Processing
• Auditory scene analysis (ASA) Spatial cues: ITD, IID Others: harmonics, pitch, on-set, etc
• Conventional methods Inter-aural cross-correlation Binary mask (all-or-none)
• Developed method Inter-aural zero-crossing difference Continuously variable mask
target noise
48
Jeffress’ Model
running interaural cross-correlation ),( nm
),( nml ),( nmrrightear
leftear
multiplication
runningintegration
49
Source Localization Based on Cross-Correlation
• Signal model for the sensor outputs
• ITD estimation based on generalized cross-correlation
• Phase transform (PHAT)
)()()( tntshtx ij
jiji )()()( ij
jiji NSHX
deXXWR jkiik )()()()(
)(maxargˆ
ikD
ik Rik
1)()()( kiPHAT XXW
50
Finding Zero-Crossingstwo microphones
ITD
51
Noise Robustness of the Zero-Crossing-Based Method
Y.-I. Kim and R. M. Kil, “Estimation of Interaural Time Differences Based on Zero-Crossings in Noisy Multisource Environments,” IEEE Trans. ASLP, vol.15, no. 2, 2007.
5-dB SNR
otherwise.,0
,1)( if,)(
1log10
)(SNR22
2210 jSjSj ii
iii
52
Application to Source Localization
• Four sources located at azimuth angles of -10o, 0o, 10o, and 40o
53
Speech Segregation
i : band(frequency) indexj : frame(time) index : time lagM: frame lengthT: frame shift
L R
i
j
i
j
• Cross-correlation- based ITD estimation
54
Overall Procedure
scaling factorestimation
Gammatonefilterbank
inputsignal1
inputsignal2
enhanced signal
time reverseamplitude scaling
subband signal
BPFN
BPF2
BPF1
scaling factorestimation
BPF2
BPFN
BPFN
BPF2
BPF1
BPF1
subband signal
ITD estimationusing ZCs
ITD estimationusing ZCs
subband signal
ITD estimationusing ZCs
scaling factorestimation
55
Amplitude Scaling
actualscalefactor
0.9 0.7 0.2 0.8
s(ITD)
56
Relationship between Zero-Crossing-Based ITDs and the SNRs
• Band-pass signals from two microphones
• The mean of the estimated ITDs, , can be approximated by
where
57
Zero Crossing vs. Cross-Correlation (1)
• Relative strength
• Criterion Confidence of the conversion between the relative
strength and an ITD Measure the sample standard deviation of ITDs for
each relative strength
58
Zero Crossing vs. Cross-Correlation (2)
• Estimate the sample standard deviation by simulation 100,000 randomly generated samples Parameters
Phase : uniform distribution on Frequency : uniform distribution
on
59
• ITDs normalized byZC-based ITDs CC-based ITDs
the lowestfreq. band
middlefreq. band
the highestfreq. band
Zero Crossing vs. Cross-Correlation (3)
60
Recognition Experimental Setup• Recognizer
The CMU SPHINX-III speech recognition system Fully-continuous hidden Markov models
• Database The DARPA Resource Management database
Training data: 2,880 sentences Test data: 600 sentences
– The target and interfering speech were combined with different delays from sensor to sensor. (SIR=0dB)
• Feature 13th-order mel-frequency cepstral coefficients
61
Recognition Results (1)• Word error rates (WERs) (%)
added white Gaussian noise(SNR:20dB)
0102030405060708090
100
noproc.
CC-binary
ZC-binary
CC-cont.
ZC-cont.
targetalone
noidenticalindependent
stop
62
Recognition Results (2)• Word error rates (WERs) (%)
63
What if There is Reverberation?
target noise
Room
Direct path
Echoic path
64
Jeffress’ Model
running interaural cross-correlation ),( nm
),( nml ),( nmrrightear
leftear
multiplication
runningintegration
65
Lindemann’s Model (1)
inhibited interaural cross-correlation ),( nm
),( nml ),( nmrrightear
leftear
),( nmk
),( nmil),( nmir
66
Lindemann’s Model (2)
),( nml
),( nmr
),( nmil
),( nmir
l1
r1
1
),( nmk
),( nm
),(),(),( nmrnmlnmk
)}),({1))(,(1)(,(
)1,1(
nmkcnmlcnmr
nmr
ds
)}),({1))(,(1)(,(
)1,1(
nmkcnmrcnml
nml
ds
Inhibition stationaryinhibition
dynamicinhibition
)]1(1[)1()1(
)(/
nxennx
ninhd TT
)()](1)[,(),(
)()](1)[,(),(
mmnmrnmr
mmnmlnml
ll
rr
Monaural sensitivities
fMMmfrl emmm /)()()()(
67
Simulation (1)
• Input signal Gammatone filterbank impulse response with 1kHz
center freq. Half-wave rectified Low-pass filtered (cf = 1.6kHz)
it dt
left
right
directsound
reflectedsound
sec6.0 mtd
68
• Input signal
sec10mti
Simulation (2)
Jeffress’ model
Lindemann’s model
69
Onset Detection
• Onset intervals Dominantly contain direct-path components
Room Impulse Response
Direct path
Late Reflection
Early Reflection
target noise
70
Palomäki’s Model of the Precedence Effect
target noise
i-th BPF i-th BPF
Cross-corr.
ITD output
Envelope e(t)
Inhibition h(t)
Envelope e(t)
Inhibition h(t)
Inhibitedenvelope
Inhibitedenvelope
71
Filter to Generate the Inhibition from an Envelope
• Low pass filter
n
Annhlp exp)(
is chosen to give a unity gain at DC. is a time constant.
A
msHzFs 15*)16000(
72
Envelope and Inhibition
• Envelope: blue line, Inhibition: red line (cf = 1,037Hz)
73
Source Localizationin Reverberant Environments (1)
• Energy-based onset detection Simple (small computation) Not robust to parameters and environments
Smoothed envelope
Onset detection
74
Source Localizationin Reverberant Environments (2)
• Echo-free onset detection
Detection by comparing the sound to echo ratio with threshold
More robust to parameters
Possible echo at time n caused by the preceding sound at time np
Maximum possible echo
time
time
The total estimated echoesam
plitu
de le
vel
Observed sounds
efoTh
75
Source Localizationin Reverberant Environments (3)
ITD estimationbased on zero crossing
SNR estimation
ReliableITD sampleselection
angleconversion
BPF1
BPF2
BPFN
BPF1
BPF2
BPFN
weightedhistogram
ITD estimationbased on zero crossing
SNR estimation
ReliableITD sampleselection
angleconversion
ITD estimationbased on zero crossing
SNR estimation
ReliableITD sampleselection
angleconversion
Sourcelocalization
envelope
envelope
envelope
waveform
waveform
waveform
waveform
waveform
waveform
Gammatonefilterbank
inputsignal1
inputsignal2
Onset detection
Onset detection
Onset detection
76
Experimental Setup (1)
• Recording rooms
Moderately reverberant room(a normal office room)
Higher reverberant room(a bathroom)
• Height of both rooms : 3 m• Height of speakers and mics : 1.5 m
77
Experimental Results (1)
Mo
de
rate
ly reve
rbe
ran
t roo
mH
igh
er re
verb
era
nt ro
om
Described method Energy-based onset detection Echo-free onset detection
78
Experimental Setup (2)
• Simulated mixing environment
• Height of both rooms : 3 m• Height of speakers and mics : 1.1 m• 30-dB SNR observations by adding white Gaussian noise• 320 utterances by 16 speakers from the TI-DIGIT database
43mm4.0m
5.0m
2.0m
1.0m
3.0m
0°
30°mic.1
mic.2
speakers
60°
79
Experimental Results (2)
• Rates of localizations where errors of estimated angles were less than 3o
80
Discussion on Binaural Processing• Describe a method that enhances speech by
estimating continuously variable masking weights • Estimation of ITDs from zero crossings
More reliable than that from cross-correlation
• Continuously variable mask Estimate relative target intensity in the t-f domain Better accuracy than binary mask
• Reverberation Precedence effect Onset detection and SNR estimation
81
Thank you very much.
82
Multi-rate Systems
• Decimation and expansion
)()( LjD
j eYeV
1
0
/)2( )(1
)(M
m
MmjjD eX
MeY
83
Filter Banks (1)
• Multirate System
84
Filter Banks (2)
• In the z-domain, ( )
• Perfect reconstruction system
MjeW 2
1
0
/1/11
0
/1 )()(1
)(1
)(M
m
mMmMk
M
m
mMkk WzXWzH
MWzX
MzV
1
0
)()(1
)()(M
m
mmk
Mkk zWXzWH
MzVzU
1
0
1
0
1
0
)()()(1
)()()(ˆK
kk
mk
M
m
mK
kkk zFzWHzWX
MzUzFzX
0),()(ˆ 0 czXczzX n
85
Polyphase Representation (1)
• Analysis filter (Type 1 polyphase)
• Using matrix notations,
1
0
212
)(
)12()2()()(
M
m
Mkm
m
n
nk
n
nk
n
nkk
zEz
znhzznhznhzH
n
nkkm zmMnhzE )()(where
)()()(
)()()(
)()()(
)(
1,11,10,1
1,11110
1,00100
zEzEzE
zEzEzE
zEzEzE
z
MKKK
M
M
E
)(
)(
)(
)(
1
1
0
zH
zH
zH
z
K
h
TMM zzzz )1(11)()( Eh
86
Polyphase Representation (2)
• Synthesis filter (Type 2 polyphase)
• Using matrix notations,
1
0
)1(
221
)(
)2()12()()(
M
m
Mmk
mM
n
nk
n
nk
n
nkk
zRz
znfznfzznfzF
n
nkmk zmMMnfzR )1()(where
)()()(
)()()(
)()()(
)(
1,11,10,1
1,11110
1,00100
zRzRzR
zRzRzR
zRzRzR
z
KMMM
K
K
R
)(
)(
)(
)(
1
1
0
zF
zF
zF
z
K
f
)(1)( )2()1( MMMT zzzz Rf
87
Polyphase Representation (3)
• Polyphase representation
• Rearrangement using noble identities
88
Paraunitary Propertyfor Perfect Reconstruction
• Paraunitary property
Transposed with its entries complex-conjugated and time-
reversed
• Perfect reconstruction condition
0,)()(~ ddzz IEE
)(~
zE
0),(~
)( czczz l ER10),()( * KknLchnf kk
10),(~
)( KkzHczzF kL
k
1 MMlLwhere
89
Critically Sampled Filter Banks (1)
• Overall system
• Critically sampled filter banks MK
90
Critically Sampled Filter Banks (2)
• NotationsTMzWXzWXzXz )]()()([)( 1 x
T
MK
MM
K
K
zWHzWHzWH
zWHzWHzWH
zHzHzH
z
)()()(
)()()(
)()()(
)(
11
11
10
110
110
H
)]}()()({[diag)( 1 MzWSzWSzSz S
)()()(1
)]()()([)( /1/1/1110
MMMTM zzz
MzYzYzYz xSHy
)()()(1
)](ˆ)(ˆ)(ˆ[)(ˆ /1/1110
MMTM zzz
MzYzYzYz xHWy
91
Critically Sampled Filter Banks (3)
• The subband error signals are zero if
• Two subbands scheme (Gilloire and Vetterli, ’92) Assume the classical QMF filters
Therefore,
)()()()( zzzzM SHHW
)()(
)()(
)(det
1)(1
zHzH
zHzH
zz
HH
)()(
)()(
1
0
zHzH
zHzH
)()(
)()()(
zHzH
zHzHzH
92
Critically Sampled Filter Banks (4)
• Adaptive filters
• is diagonal only if or
)()()()()]()()[()(
)]()()[()()()()()(
)(det
1
)()()()(
22
22
12
zSzHzSzHzSzSzHzH
zSzSzHzHzSzHzSzH
z
zzzz
H
HSHW
)( 2zW
0)()( zHzH 0)()( zSzS
A general physical system (X)
0)()( zHzHPR (X)
93
Critically Sampled Filter Banks (5)
• Multiband scheme Assumptions
Adaptive filters (require the use of cross filters)
Slow convergence and performance degradation
1||,0)()( jizHzH ji )()( ii zWHzH
)(00)(
0)()(0
0)()()(
)(0)()(
)(
1,10,1
2221
121110
1,00100
zWzW
zWzW
zWzWzW
zWzWzW
z
MMM
M
M
W
where
94
Oversampled Filter Banks (1)
• such that
• Non-critical decimation can avoid the aliasing problem. The redundancy
Provide enough information for successful adaptation in every bands.
Diagonal adaptive filter matrix
MK jizHzH ji ,0)()( )()( i
i zWHzH where
95
Oversampled Filter Banks (2)
• Recall
• For perfect reconstruction,
• Remove the aliasing terms
10),(~
)( KkzHczzF kL
k
1
0
1
0
)()()(1
)(ˆK
kk
mk
M
m
m zFzWHzWXM
zX
1
0
)()()(1
)(ˆK
kkk zFzHzX
MzX
96
Oversampled Filter Banks (3)
• Analysis filters from a real-valued linear phase prototype filter by generalized DFT
To cover the frequency range by exactly subbands
For the linear phase property • Synthesis filters
• All filters can be derived from one prototype filter.
1 , ,1 ,0 ,1 , ,1 ,0 ),()())((
200
q
nnkkK
j
k LnKknqenh
];0[ 2/K
2
10 k
2)1(0 qLn
)()1()(~
)( nhnLhnhnf kqkkk
97
Design of Oversampled Filter Banks (1)
• Cost function Combination of filter bank reconstruction error and
stopband energy of the analysis filters
• Impulse response of the overall filter bank system
• Using matrix notation,
1
0
)()(1
)(K
kkk nfnh
Mnt
kk
qk
k
k
qk
kk
k
qk
k
k
k
Lf
f
f
Lh
hh
h
Lt
t
t
fHt
)1(
)1(
)0(
)1(00
0)0()1(
00)0(
)22(
)1(
)0(
98
Design of Oversampled Filter Banks (2)
• Impulse response of the overall filter bank system
• Measure of the reconstruction error
• Measure for the energy contained in the stopband
fHfffHHHt TT
KTT
KM 110110
1
2
1 ))1(( qLnδt
2
2
11
00
2
)1(
)1(
)0(
))1(cos()1cos(1
))1(cos()1cos(1
))1(cos()1cos(1
kk
qk
k
k
qNN
q
q
k
Lh
h
h
L
L
L
hP
99
Design of Oversampled Filter Banks (3)
• To enforce linear phase filters,
Therefore, , where .
• Cost function
where
qLJI
q
T
qT
LL
Tq
Lqqq
Lqqq
qq)12/()1()0(
)1()1()0(
2/2/
TTK
TT110 hhhhf
Tqkkkk Lhhh )1()1()0( h
2
1
02
21
))1((1
0
δf
P
H qK
kk
LnM
},,,{diag 110 KPPPP
100
Design of Oversampled Filter Banks (4)
• GDFT
• Cost function
• Iterative least-squares design algorithm• Initialize• Minimize with respect to• Apply relaxation
,qMh kk
21
0
))1((1
0
δq
LP
LMH
q
q
K
kkk
LnM
where are diagonal matrices with transform factorskM
)(iq
)(iq
)1()1()()( iii qqq
101
ITD Using Zero Crossings (1)• Two microphone signals
Ignore attenuation between microphones because of closeness
• Assume and
))(cos())(cos()(
)cos()cos()(
22112
211
dtwadtwAtx
twatwAtx
0)( 11 tx
))(sin()sin(
))(cos()cos(
))(sin()sin(
))(cos()cos()(
2212
2212
1111
111112
dwtwa
dwtwa
dwtwA
dwtwAtx
0)( 12 tx
102
ITD Using Zero Crossings (2)
• Since
• Therefore,
• Since
)()sin(
)()sin()(
2212
111112
dwtwa
dwtwAtx
}2,1,{ ,)( jidw ji
1))(cos( ji dw )())(sin( jiji dwdw
0)( 12 tx
21221111
122111
)sin()sin(
))sin()sin((
dtwawdtwAw
twawtwAw
103
ITD Using Zero Crossings (3)
• Recall
)sin()(cos
)sin()(cos
)sin()))cos((sin(cos
)sin()))cos((sin(cos
12212222
1
2122112222
1
122121
1
21221121
1
twawtwaAw
dtwawdtwaAw
twawtwAa
Aw
dtwawdtwAa
Aw
)(cos)sin(
)(cos)sin(
11222
2111
211222
21111
twAawtwAw
dtwAawdtwAw
for aA
for aA
0)cos()cos()( 121111 twatwAtx
104
ITD Using Zero Crossings (4)
• Assume uniformly-distributed frequencies over a narrow band and phases over the interval ),(
21 )),(1(),( daAgdaAg
otherwise )(
)(tan2
, if 2
1
, if )(
)(tan)2
(
),(
22
22
22
122
22
22
22
122
2
Aa
AaAAa
AaA
aA
aAaA
aAaaA
aA
aA
aAg
105
ITD Using Cross-Correlation (1)• Two microphone signals
• Cross-correlation))(cos())(cos()(
)cos()cos()(
22112
211
dtwadtwAtx
twatwAtx
T
Tdttxtx
Tc )()(
2
1)( 21
)))](cos()2)(2(cos(
)))(cos())(cos(
))(cos())((cos(
)))(cos()2)(2(cos([2
1
)()(
222222
211211
221221
111112
21
dwdwtwa
twdtwtwdtw
dtwtwdtwtwAa
dwdwtwA
txtx
106
ITD Using Cross-Correlation (2)• ITD at the maximum of cross-correlation,
)]2sin()2sin()sin())sin((2
)sin())sin((2)2sin()2sin([
)]2)2cos()2(sin()cos())sin((2
)cos())sin((2
[
)]cos())sin((2
)cos())sin((2
)2)2cos()2(sin([
)]2)2cos()2(sin()cos())()sin((2
)cos())()sin((2
)2)2cos()2(sin([
22
21
2112
2222
22221
21
2221
212
2121
21
2121
21
2111
21
2222
222
2121
21
22
2121
21
2111
2
TwaTwwAa
TwwAaTwA
TwwTwawTwwww
Aa
wTwwww
Aad
wTwwww
Aa
wTwwww
AaTwwTwAd
TwwTwawwTwwww
Aa
wwTwwww
AaTwwTwA
0)(
d
dc
107
Spatial Aliasing
• To avoid the spatial aliasing The delay between sensors should be smaller than a half period
of the signal. If we have t sample delay for noisy signal,
Alias-free condition is
1 sample delay at 16kHz sampling rate
Center frequencies of all Gammatone filters are lower than 8kHz.
FF
t
s 2
1
kHzF
F s 82
108
Close Microphones
• ITD estimation without phase ambiguity The closest zero crossings provide the desired ITD
value. Do not need to estimate IIDs
• Easy to derive a relationship between the ITDs and the scaling factors
• Reduce search region of p2
• Reduce signal distortion• Compact implementation
109
Results on Source Segregation
• Target signal• Without processing• Scaling with the following factors
factor = [summation of CC values in the neighborhood of ITD of the desired source (male speaker)] / [summation of CC values in the whole range]
factor = 1 if the peak value of CC is in the neighborhood of ITD of the desired source.factor = 0 otherwise.
Jeffress’ model Lindemann’s model
110
Combining the Lindemann’s Model and the Precedence Effect
target noise
i-th BPF i-th BPF
Jeffress’or
Lindemann’smodel
The model will be operated only when e(t) > h(t).Otherwise, the model will provide the previous factor.
Envelope e(t) Inhibition h(t)
On-sete(t)>h(t) ?
Yes
Enhanced speech
111
Results on Source Segregationfor Reverberated Signal
• Target signal with reverberation without reverberation (ideal solution)
• Without processing• Scaling with the following factors
factor = 1 if the peak value of CC is in the neighborhood of ITD of the desired source.factor = 0 otherwise. Jeffress’ model Lindemann’s model
w/o on-set enh. res: 1/8000 sec res: 1/48000 sec
112
Dereverberation
• Early reflections are especially problematic. Affect on the same frame as the direct sound wave
• To remove early reflection components Dereverberate the linear prediction (LP) residual of
incoming speech Filter estimation for nearly exponentially-decaying
reverberation like a typical room impulse response Correspond to the inverse of the truncated auto-correlation
]))00 )()0( 00([DFT/.1(IDFT)( Rccnhderev
113
Dereverberation and Echo Suppression
Dereverberation
Dereverberation
Gammatonefilterbank
Gammatonefilterbank
Cross-correlation
Inhibition
Frameenergies
Frame energies with inhibition
Maskestimation
Featureestimation
Input 1
Input 2
114
Simulated Room Impulse Response
Virtual room to simulate impulse responses
2m1.5m
1.1m17cm
1m
30o
target
interferencemics
• Mixing environments (reverberation time: 0.5s)
115
Recognition Results
• Word error rates (WERs) (%) A : no processing B : seg. + inhib. C : derev. + seg.
+ inhib. D : ideal masks
0
50
100
A B C D
infinity, 0sec
10dB, 0.3sec
10dB, 0.5sec
0dB, 0.3sec
0dB, 0.5sec