Audio Segregation

2010. 4. 26.

Hyung-Min Park

Audio Segregation

2

Contents• Independent component analysis (ICA)

Conventional methods for acoustic mixtures Filter bank approach to ICA

• Degenerate unmixing and estimation technique (DUET) Target speech enhancement

• Zero-crossing-based binaural processing Inter-aural time difference (ITD)

Zero crossings vs. cross-correlation

Continuously-variable mask vs. binary mask

3

Cocktail Party Problem

4

Independent Component Analysis

5

Blind Source Separation: A Demo

sources andthe mixing environment

6

Independent Component Analysis

• Blind source separation Sensor signals

Recover the original source signals without knowing how they are mixed

• ICA Assume sources are independent Estimate the unmixing system W from

mixtures x(t)

s

u

x

A

W

7

Acoustic Mixtures

• Instantaneous mixtures

• Acoustic mixing environments Time delay Reverberation Convolutive mixing

Wall

sensors

sources

x1 x2

s1 s2

8

Time Domain Approach to ICA

• Feedback architecture

• Adaptation rules (Torkkola, 1996)

• Intensive computations and slow convergence

W11

W21

W12

W22

x1(n)

x2(n) u2(n)

u1(n)

9

Frequency Domain Approach to ICA (1)

• In the frequency domain

• Complex score function

• Adaptation rule (Smaragdis, 1998)

x1(n)

W1

xN(n)

Short-Time

FourierTransform

InverseShort-Time

FourierTransform

u1(n)

uN(n)

W2

WK

10

Frequency Domain Approach to ICA (2)

• Performance limitation Contradiction between long reverberation covering

and insufficient learning data Long reverberation long frame size Small number of frames insufficient input data

Mixtures combined from different time ranges of sources Delayed mixtures

kth block kth blocks1(n) s2(n-d1)x1(n)

x2(n)

d2 d1s1(n-d2) s2(n)

= +

+=

11

Design of a Filter Bank

• Filter bank design Frequency response of analysis filters

Uniform sixteen-channel filter bank

Decimation factor: 10Filter length: 220 taps

12

Filter Bank Approach to ICA (1)• 2x2 network for the filter bank approach to ICA

x1(n)

x2(n)

M

M

M

M

M

M

H1(z)

H2(z)

HK(z)

H1(z)

H2(z)

HK(z)

ICA networkW1(z)

ICA networkW2(z)

ICA networkWK(z)

F1(z)

F2(z)

FK(z)

F1(z)

F2(z)

FK(z)

M

M

M

M

M

M

u1(n)

u2(n)

13

Filter Bank Approach to ICA (2)• Adaptation rules

• Total number of multiplications

Time domain approach Filter bank approach

The number of required filter coefficients Uniform K-channel oversampled filter bank

is the number of adaptive filter coefficients

14

Experimental Setup (1)

• Measure for blind source separation SIR for a 2x2 mixing/unmixing system

• Sources Two streams of speech 5 second length 16 kHz sampling rate

15

Experimental Setup (2)• Mixing system

Virtual room to simulate impulse responses

16

Experimental Results

• Learning curves of the three different approaches

17

Experiment on Real-Recorded Data (1)

• Mixing environment

• Filter bank approach Using the sixteen-channel filter bank Each adaptive filter: 103 taps

Speakers

Microphones

40cm

60cm

18

Experiment on Real-Recorded Data (2)

• Blind source separation of real recorded mixtures

Mixture 1

Mixture 2

Result 1

Result 2 stop

19

Motivation of a Nonuniform Filter Bank Approach

• Time-averaged magnitude responses of signals The energy exponentially decreases as the frequency

increases.Speech Car noise Music

Subband divisionResult of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length

20

Relationship between Performances and Filter Length

• Convergence of gradient-based algorithms Controlled by condition number

• Bordering theorem

• Condition number

Monotonically nondecreasing function of filter length

• The longer filter length The slower convergence speed

)0(r

rRR

rH 11 aa LL 1and

1

1

1min

max

aa LL

21

Bark-Scale Filter Banks• Subband division

Result of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length

• Bark frequency warping function

• Bark-scale filter banks Resemble that of the mammalian cochlea Somewhat narrow subbands in low frequency region Wide subbands in high frequency region

5.02

112001200

log6)(

22

Nonuniform Filter Bank Approach to BSS

• 2x2 network for the nonuniform oversampled filter bank approach to BSS

x1(n)

x2(n)

M1

M2

MK

M1

M2

MK

H1(z)

H2(z)

HK(z)

H1(z)

H2(z)

HK(z)

ICA networkW1(z)

ICA networkW2(z)

ICA networkWK(z)

F1(z)

F2(z)

FK(z)

F1(z)

F2(z)

FK(z)

M1

M2

MK

M1

M2

MK

u1(n)

u2(n)

23

Design of a Bark-Scale Filter Bank

• Filter design of a Bark-scale oversampled filter bank 16-channel, , OSR=167%220 ,3] 7 11 14 18 20 22 22[ qLMBark-scale filter bank Uniform filter bank

24

Experimental Results

• Results on blind source separation in the oversampled filter bank

SIR PESQ score

25

FPGA Implementation (1)

noise references

mic. signals

outputs

noises femalespeech

microphones

male speech

26

FPGA Implementation (2)

• 4 adaptive noise canceling (4 music signals) + 2 blind source separation (2 speech signals)

OUT1

OUT2

MIC1

MIC2

stop

27

Application to Hearing Aids

• BTE-type hearing aids

noise speech

1m1m

front mic.

rear mic.

front mic. SNR=3.20dB

rear mic. SNR=2.45dB

output SNR=21.38dB

stop

28

Discussion on ICA

• Assume sources are independent• Time domain approach

Intensive computations and slow convergence

• Frequency domain approach Less computations but inferior performance

• Filter bank approach Moderate computations and good performance Suitable for parallel processing Bark-scale filter bank approach

29

Degenerate Unmixing and Estimation Technique

30

Introduction

• Independent component analysis for blind source separation Good performance In general, the number of microphones should not be

smaller than the number of sources. Too many parameters

Heavy computational load and slow convergence Problem with a source which is active in a short period

31

Binaural Processing

• Auditory scene analysis (ASA)

Cues: harmonics, pitch, on-set, etc

• Spatial cues Inter-aural time difference (ITD) Inter-aural intensity difference (IID)

target noise

32

DUET Algorithm (1)

• Mixing model

• In the time-frequency domain

N

jj tntstx

111 )()()(

N

jjjj tntsatx

122 )()()(

),(

),(11

),(

),( 1

12

1

1

N

iN

i

S

S

eaeaX

XN

33

DUET Algorithm (2)

• W-disjoint orthogonality assumption

• Parameter estimation

,,,0),(),( jiSS ji

jSeaX

Xji

jj

somefor ),,(1

),(

),(

2

1

),(

),(1,

),(

),(),(ˆ,),(ˆ

1

2

1

2

X

X

X

Xa

34

DUET Algorithm (3)

• 2D Histogram of amplitude-delay estimates from two mixtures of five sources

♦ Amplitude parameters

( .98, 1.02, .93, 1.06, .93)

♦ Delay parameters

( .3, -.2, .8, -.7, -.2)

35

DUET Algorithm (4)

• If the j-th source is active,

• Cost function

• Parameter estimation• Stochastic gradient descent algorithm

2

21

ˆ

2 ),(),(ˆˆ1

1 ),,ˆ,ˆ(),( XXea

aa ji

j

j

jjj

0),(),( 21 XXea jij

)),(,),,(min( 1

N

36

DUET Algorithm (5)

• Mask

• Demixing

otherwise,0

),,,ˆ,ˆ(),,ˆ,ˆ(,1),(

jmaa mmjjj

),(),()( 1 XS jj

1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1

s1

s2

37

Target Speech Enhancement

• In many practical applications, Need to estimate a signal from a target source The target source

Frequently, we can expect its approximated direction. Strong utterance in a noisy environment

38

Proposed Method (1)

• Continuously variable mask

• Real mask

),(

),(),(

1

1

X

X j

rj

),(),(ˆ

),(),(ˆ1),(

21

21

ˆ

XXa

XXea

j

ij

cj

j

Continuously variable mask

Real mask

39

Proposed Method (2)

• Determine a threshold. Using a top ranking

• Binary mask using a threshold

),(%35)( pcjp ofTopTh

otherwise

ThTh pp

cj

ppbj

,0

)(),(,1))(,,(

Real mask

Binary mask

40

• Overall procedure

• Overall procedure of the DUET algorithm

Attenuation-delay

histogram

Continuousmask

Initializingattenuation

-delayparameters

Learningattenuation

-delayparameters

ST

FT

ST

FT

Thresholding

Binarymask

IST

FT

)(1 tx

)(2 tx)(tsj

Initialcontinuous

maskThresholding

Attenuation-delay

histogram

Initializingattenuation

-delayparameters

Learningattenuation

-delayparameters

ST

FT

ST

FT

Comparinglikelihoods

Binarymask

IST

FT

)(1 tx

)(2 tx

)(tsj

Proposed Method (3)

41

Number of sources : 2 (1 target source and 1 noise source) Input SIR : 5 dB Simulated mixing in an anechoic environment

80˚

50cm

50 ˚

20 ˚-10 ˚

-40 ˚

-70 ˚

-100 ˚

• Source signals10-second-long speech signals uttered by

4 males and 4 females in the TIMIT database

• Microphones Space : 2 cm

• Angle differences between two sources 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚

Mic1 Mic2


42

Experimental Results (1)

9.1710.9

9

13.61

13.77

13.79

13.82

96.26 95.38

97.24

98.35

98.34

98.26

96.26

20.63

21.09

22.03

21.96

22.03

22.11

89.51

89.53

89.55

89.55

89.87

89.79

Proposed methodDUET

43

90˚

50cm60 ˚

30 ˚0 ˚

-30 ˚

-60 ˚

-90 ˚ Mic

1Mic2

Experimental Setup (2) Number of sources : 2 (1 target source and 1 noise source) Input SIR : 5 dB Real recorded mixtures in a normal office room

• Source signals10-second-long speech signals uttered by

3 male and 3 female speakers in the TIMIT database

• Microphones Space : 2 cm

• Angle differences between two sources 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚

44


76.67

81.77

85.81

96.93

97.14

97.03

6.29

10.02

11.49

12.70

12.46

12.57

72.27

80.02

83.90

83.37

84.67

85.14

5.51

10.99

14.88

15.93

15.92

15.99

Proposed methodDUET

45

Discussion on DUET

• DUET(Degenerate Unmixing and Estimation Technique) Simple We should know the number of sources in advance.

Estimate the attenuation and delay parameters for all sources.

• Described target speech enhancement technique Estimate the parameters for only one target source

Much faster convergence of all the required parameters

• Not robust to reverberation

46

Zero-Crossing-Based Binaural Processing

47

Binaural Processing

• Auditory scene analysis (ASA) Spatial cues: ITD, IID Others: harmonics, pitch, on-set, etc

• Conventional methods Inter-aural cross-correlation Binary mask (all-or-none)

• Developed method Inter-aural zero-crossing difference Continuously variable mask

target noise

48

Jeffress’ Model

running interaural cross-correlation ),( nm

),( nml ),( nmrrightear

leftear

multiplication

runningintegration

49

Source Localization Based on Cross-Correlation

• Signal model for the sensor outputs

• ITD estimation based on generalized cross-correlation

• Phase transform (PHAT)

)()()( tntshtx ij

jiji )()()( ij

jiji NSHX

deXXWR jkiik )()()()(

)(maxargˆ

ikD

ik Rik

1)()()( kiPHAT XXW

50

Finding Zero-Crossingstwo microphones

ITD

51

Noise Robustness of the Zero-Crossing-Based Method

Y.-I. Kim and R. M. Kil, “Estimation of Interaural Time Differences Based on Zero-Crossings in Noisy Multisource Environments,” IEEE Trans. ASLP, vol.15, no. 2, 2007.

5-dB SNR

otherwise.,0

,1)( if,)(

1log10

)(SNR22

2210 jSjSj ii

iii

52

Application to Source Localization

• Four sources located at azimuth angles of -10o, 0o, 10o, and 40o

53

Speech Segregation

i : band(frequency) indexj : frame(time) index : time lagM: frame lengthT: frame shift

L R

i

j

i

j

• Cross-correlation- based ITD estimation

54

Overall Procedure

scaling factorestimation

Gammatonefilterbank

inputsignal1

inputsignal2

enhanced signal

time reverseamplitude scaling

subband signal

BPFN

BPF2

BPF1


BPF2

BPFN

BPFN

BPF2

BPF1

BPF1

subband signal

ITD estimationusing ZCs


subband signal



55

Amplitude Scaling

actualscalefactor

0.9 0.7 0.2 0.8

s(ITD)

56

Relationship between Zero-Crossing-Based ITDs and the SNRs

• Band-pass signals from two microphones

• The mean of the estimated ITDs, , can be approximated by

where

57

Zero Crossing vs. Cross-Correlation (1)

• Relative strength

• Criterion Confidence of the conversion between the relative

strength and an ITD Measure the sample standard deviation of ITDs for

each relative strength

58


• Estimate the sample standard deviation by simulation 100,000 randomly generated samples Parameters

Phase : uniform distribution on Frequency : uniform distribution

on

59

• ITDs normalized byZC-based ITDs CC-based ITDs

the lowestfreq. band

middlefreq. band

the highestfreq. band


60

Recognition Experimental Setup• Recognizer

The CMU SPHINX-III speech recognition system Fully-continuous hidden Markov models

• Database The DARPA Resource Management database

Training data: 2,880 sentences Test data: 600 sentences

– The target and interfering speech were combined with different delays from sensor to sensor. (SIR=0dB)

• Feature 13th-order mel-frequency cepstral coefficients

61

Recognition Results (1)• Word error rates (WERs) (%)

added white Gaussian noise(SNR:20dB)

0102030405060708090

100

noproc.

CC-binary

ZC-binary

CC-cont.

ZC-cont.

targetalone

noidenticalindependent

stop

62

Recognition Results (2)• Word error rates (WERs) (%)

63

What if There is Reverberation?

target noise

Room

Direct path

Echoic path

64

Jeffress’ Model

running interaural cross-correlation ),( nm


leftear

multiplication

runningintegration

65

Lindemann’s Model (1)

inhibited interaural cross-correlation ),( nm


leftear

),( nmk

),( nmil),( nmir

66

Lindemann’s Model (2)

),( nml

),( nmr

),( nmil

),( nmir

l1

r1

1

),( nmk

),( nm

),(),(),( nmrnmlnmk

)}),({1))(,(1)(,(

)1,1(

nmkcnmlcnmr

nmr

ds

)}),({1))(,(1)(,(

)1,1(

nmkcnmrcnml

nml

ds

Inhibition stationaryinhibition

dynamicinhibition

)]1(1[)1()1(

)(/

nxennx

ninhd TT

)()](1)[,(),(

)()](1)[,(),(

mmnmrnmr

mmnmlnml

ll

rr

Monaural sensitivities

fMMmfrl emmm /)()()()(

67

Simulation (1)

• Input signal Gammatone filterbank impulse response with 1kHz

center freq. Half-wave rectified Low-pass filtered (cf = 1.6kHz)

it dt

left

right

directsound

reflectedsound

sec6.0 mtd

68

• Input signal

sec10mti

Simulation (2)

Jeffress’ model

Lindemann’s model

69

Onset Detection

• Onset intervals Dominantly contain direct-path components

Room Impulse Response

Direct path

Late Reflection

Early Reflection

target noise

70

Palomäki’s Model of the Precedence Effect

target noise

i-th BPF i-th BPF

Cross-corr.

ITD output

Envelope e(t)

Inhibition h(t)

Envelope e(t)

Inhibition h(t)

Inhibitedenvelope

Inhibitedenvelope

71

Filter to Generate the Inhibition from an Envelope

• Low pass filter

n

Annhlp exp)(

is chosen to give a unity gain at DC. is a time constant.

A

msHzFs 15*)16000(

72

Envelope and Inhibition

• Envelope: blue line, Inhibition: red line (cf = 1,037Hz)

73

Source Localizationin Reverberant Environments (1)

• Energy-based onset detection Simple (small computation) Not robust to parameters and environments

Smoothed envelope

Onset detection

74


• Echo-free onset detection

Detection by comparing the sound to echo ratio with threshold

More robust to parameters

Possible echo at time n caused by the preceding sound at time np

Maximum possible echo

time

time

The total estimated echoesam

plitu

de le

vel

Observed sounds

efoTh

75


ITD estimationbased on zero crossing

SNR estimation

ReliableITD sampleselection

angleconversion

BPF1

BPF2

BPFN

BPF1

BPF2

BPFN

weightedhistogram


SNR estimation


angleconversion


SNR estimation


angleconversion

Sourcelocalization

envelope

envelope

envelope

waveform

waveform

waveform

waveform

waveform

waveform

Gammatonefilterbank

inputsignal1

inputsignal2

Onset detection

Onset detection

Onset detection

76


• Recording rooms

Moderately reverberant room(a normal office room)

Higher reverberant room(a bathroom)

• Height of both rooms : 3 m• Height of speakers and mics : 1.5 m

77


Mo

de

rate

ly reve

rbe

ran

t roo

mH

igh

er re

verb

era

nt ro

om

Described method Energy-based onset detection Echo-free onset detection

78


• Simulated mixing environment

• Height of both rooms : 3 m• Height of speakers and mics : 1.1 m• 30-dB SNR observations by adding white Gaussian noise• 320 utterances by 16 speakers from the TI-DIGIT database

43mm4.0m

5.0m

2.0m

1.0m

3.0m

0°

30°mic.1

mic.2

speakers

60°

79


• Rates of localizations where errors of estimated angles were less than 3o

80

Discussion on Binaural Processing• Describe a method that enhances speech by

estimating continuously variable masking weights • Estimation of ITDs from zero crossings

More reliable than that from cross-correlation

• Continuously variable mask Estimate relative target intensity in the t-f domain Better accuracy than binary mask

• Reverberation Precedence effect Onset detection and SNR estimation

81

Thank you very much.

82

Multi-rate Systems

• Decimation and expansion

)()( LjD

j eYeV

1

0

/)2( )(1

)(M

m

MmjjD eX

MeY

83

Filter Banks (1)

• Multirate System

84

Filter Banks (2)

• In the z-domain, ( )

• Perfect reconstruction system

MjeW 2

1

0

/1/11

0

/1 )()(1

)(1

)(M

m

mMmMk

M

m

mMkk WzXWzH

MWzX

MzV

1

0

)()(1

)()(M

m

mmk

Mkk zWXzWH

MzVzU

1

0

1

0

1

0

)()()(1

)()()(ˆK

kk

mk

M

m

mK

kkk zFzWHzWX

MzUzFzX

0),()(ˆ 0 czXczzX n

85

Polyphase Representation (1)

• Analysis filter (Type 1 polyphase)

• Using matrix notations,

1

0

212

)(

)12()2()()(

M

m

Mkm

m

n

nk

n

nk

n

nkk

zEz

znhzznhznhzH

n

nkkm zmMnhzE )()(where

)()()(

)()()(

)()()(

)(

1,11,10,1

1,11110

1,00100

zEzEzE

zEzEzE

zEzEzE

z

MKKK

M

M

E

)(

)(

)(

)(

1

1

0

zH

zH

zH

z

K

h

TMM zzzz )1(11)()( Eh

86


• Synthesis filter (Type 2 polyphase)

• Using matrix notations,

1

0

)1(

221

)(

)2()12()()(

M

m

Mmk

mM

n

nk

n

nk

n

nkk

zRz

znfznfzznfzF

n

nkmk zmMMnfzR )1()(where

)()()(

)()()(

)()()(

)(

1,11,10,1

1,11110

1,00100

zRzRzR

zRzRzR

zRzRzR

z

KMMM

K

K

R

)(

)(

)(

)(

1

1

0

zF

zF

zF

z

K

f

)(1)( )2()1( MMMT zzzz Rf

87


• Polyphase representation

• Rearrangement using noble identities

88

Paraunitary Propertyfor Perfect Reconstruction

• Paraunitary property

Transposed with its entries complex-conjugated and time-

reversed

• Perfect reconstruction condition

0,)()(~ ddzz IEE

)(~

zE

0),(~

)( czczz l ER10),()( * KknLchnf kk

10),(~

)( KkzHczzF kL

k

1 MMlLwhere

89

Critically Sampled Filter Banks (1)

• Overall system

• Critically sampled filter banks MK

90


• NotationsTMzWXzWXzXz )]()()([)( 1 x

T

MK

MM

K

K

zWHzWHzWH

zWHzWHzWH

zHzHzH

z

)()()(

)()()(

)()()(

)(

11

11

10

110

110

H

)]}()()({[diag)( 1 MzWSzWSzSz S

)()()(1

)]()()([)( /1/1/1110

MMMTM zzz

MzYzYzYz xSHy

)()()(1

)](ˆ)(ˆ)(ˆ[)(ˆ /1/1110

MMTM zzz

MzYzYzYz xHWy

91


• The subband error signals are zero if

• Two subbands scheme (Gilloire and Vetterli, ’92) Assume the classical QMF filters

Therefore,

)()()()( zzzzM SHHW

)()(

)()(

)(det

1)(1

zHzH

zHzH

zz

HH

)()(

)()(

1

0

zHzH

zHzH

)()(

)()()(

zHzH

zHzHzH

92


• Adaptive filters

• is diagonal only if or

)()()()()]()()[()(

)]()()[()()()()()(

)(det

1

)()()()(

22

22

12

zSzHzSzHzSzSzHzH

zSzSzHzHzSzHzSzH

z

zzzz

H

HSHW

)( 2zW

0)()( zHzH 0)()( zSzS

A general physical system (X)

0)()( zHzHPR (X)

93


• Multiband scheme Assumptions

Adaptive filters (require the use of cross filters)

Slow convergence and performance degradation

1||,0)()( jizHzH ji )()( ii zWHzH

)(00)(

0)()(0

0)()()(

)(0)()(

)(

1,10,1

2221

121110

1,00100

zWzW

zWzW

zWzWzW

zWzWzW

z

MMM

M

M

W

where

94

Oversampled Filter Banks (1)

• such that

• Non-critical decimation can avoid the aliasing problem. The redundancy

Provide enough information for successful adaptation in every bands.

Diagonal adaptive filter matrix

MK jizHzH ji ,0)()( )()( i

i zWHzH where

95


• Recall

• For perfect reconstruction,

• Remove the aliasing terms

10),(~

)( KkzHczzF kL

k

1

0

1

0

)()()(1

)(ˆK

kk

mk

M

m

m zFzWHzWXM

zX

1

0

)()()(1

)(ˆK

kkk zFzHzX

MzX

96


• Analysis filters from a real-valued linear phase prototype filter by generalized DFT

To cover the frequency range by exactly subbands

For the linear phase property • Synthesis filters

• All filters can be derived from one prototype filter.

1 , ,1 ,0 ,1 , ,1 ,0 ),()())((

200

q

nnkkK

j

k LnKknqenh

];0[ 2/K

2

10 k

2)1(0 qLn

)()1()(~

)( nhnLhnhnf kqkkk

97

Design of Oversampled Filter Banks (1)

• Cost function Combination of filter bank reconstruction error and

stopband energy of the analysis filters

• Impulse response of the overall filter bank system

• Using matrix notation,

1

0

)()(1

)(K

kkk nfnh

Mnt

kk

qk

k

k

qk

kk

k

qk

k

k

k

Lf

f

f

Lh

hh

h

Lt

t

t

fHt

)1(

)1(

)0(

)1(00

0)0()1(

00)0(

)22(

)1(

)0(

98


• Impulse response of the overall filter bank system

• Measure of the reconstruction error

• Measure for the energy contained in the stopband

fHfffHHHt TT

KTT

KM 110110

1

2

1 ))1(( qLnδt

2

2

11

00

2

)1(

)1(

)0(

))1(cos()1cos(1

))1(cos()1cos(1

))1(cos()1cos(1

kk

qk

k

k

qNN

q

q

k

Lh

h

h

L

L

L

hP

99


• To enforce linear phase filters,

Therefore, , where .

• Cost function

where

qLJI

q

T

qT

LL

Tq

Lqqq

Lqqq

qq)12/()1()0(

)1()1()0(

2/2/

TTK

TT110 hhhhf

Tqkkkk Lhhh )1()1()0( h

2

1

02

21

))1((1

0

δf

P

H qK

kk

LnM

},,,{diag 110 KPPPP

100


• GDFT

• Cost function

• Iterative least-squares design algorithm• Initialize• Minimize with respect to• Apply relaxation

,qMh kk

21

0

))1((1

0

δq

LP

LMH

q

q

K

kkk

LnM

where are diagonal matrices with transform factorskM

)(iq

)(iq

)1()1()()( iii qqq

101

ITD Using Zero Crossings (1)• Two microphone signals

Ignore attenuation between microphones because of closeness

• Assume and

))(cos())(cos()(

)cos()cos()(

22112

211

dtwadtwAtx

twatwAtx

0)( 11 tx

))(sin()sin(

))(cos()cos(

))(sin()sin(

))(cos()cos()(

2212

2212

1111

111112

dwtwa

dwtwa

dwtwA

dwtwAtx

0)( 12 tx

102

ITD Using Zero Crossings (2)

• Since

• Therefore,

• Since

)()sin(

)()sin()(

2212

111112

dwtwa

dwtwAtx

}2,1,{ ,)( jidw ji

1))(cos( ji dw )())(sin( jiji dwdw

0)( 12 tx

21221111

122111

)sin()sin(

))sin()sin((

dtwawdtwAw

twawtwAw

103


• Recall

)sin()(cos

)sin()(cos

)sin()))cos((sin(cos

)sin()))cos((sin(cos

12212222

1

2122112222

1

122121

1

21221121

1

twawtwaAw

dtwawdtwaAw

twawtwAa

Aw

dtwawdtwAa

Aw

)(cos)sin(

)(cos)sin(

11222

2111

211222

21111

twAawtwAw

dtwAawdtwAw

for aA

for aA

0)cos()cos()( 121111 twatwAtx

104


• Assume uniformly-distributed frequencies over a narrow band and phases over the interval ),(

21 )),(1(),( daAgdaAg

otherwise )(

)(tan2

, if 2

1

, if )(

)(tan)2

(

),(

22

22

22

122

22

22

22

122

2

Aa

AaAAa

AaA

aA

aAaA

aAaaA

aA

aA

aAg

105

ITD Using Cross-Correlation (1)• Two microphone signals

• Cross-correlation))(cos())(cos()(

)cos()cos()(

22112

211

dtwadtwAtx

twatwAtx

T

Tdttxtx

Tc )()(

2

1)( 21

)))](cos()2)(2(cos(

)))(cos())(cos(

))(cos())((cos(

)))(cos()2)(2(cos([2

1

)()(

222222

211211

221221

111112

21

dwdwtwa

twdtwtwdtw

dtwtwdtwtwAa

dwdwtwA

txtx

106

ITD Using Cross-Correlation (2)• ITD at the maximum of cross-correlation,

)]2sin()2sin()sin())sin((2

)sin())sin((2)2sin()2sin([

)]2)2cos()2(sin()cos())sin((2

)cos())sin((2

[

)]cos())sin((2

)cos())sin((2

)2)2cos()2(sin([

)]2)2cos()2(sin()cos())()sin((2

)cos())()sin((2

)2)2cos()2(sin([

22

21

2112

2222

22221

21

2221

212

2121

21

2121

21

2111

21

2222

222

2121

21

22

2121

21

2111

2

TwaTwwAa

TwwAaTwA

TwwTwawTwwww

Aa

wTwwww

Aad

wTwwww

Aa

wTwwww

AaTwwTwAd

TwwTwawwTwwww

Aa

wwTwwww

AaTwwTwA

0)(

d

dc

107

Spatial Aliasing

• To avoid the spatial aliasing The delay between sensors should be smaller than a half period

of the signal. If we have t sample delay for noisy signal,

Alias-free condition is

1 sample delay at 16kHz sampling rate

Center frequencies of all Gammatone filters are lower than 8kHz.

FF

t

s 2

1

kHzF

F s 82

108

Close Microphones

• ITD estimation without phase ambiguity The closest zero crossings provide the desired ITD

value. Do not need to estimate IIDs

• Easy to derive a relationship between the ITDs and the scaling factors

• Reduce search region of p2

• Reduce signal distortion• Compact implementation

109

Results on Source Segregation

• Target signal• Without processing• Scaling with the following factors

factor = [summation of CC values in the neighborhood of ITD of the desired source (male speaker)] / [summation of CC values in the whole range]

factor = 1 if the peak value of CC is in the neighborhood of ITD of the desired source.factor = 0 otherwise.

Jeffress’ model Lindemann’s model

110

Combining the Lindemann’s Model and the Precedence Effect

target noise

i-th BPF i-th BPF

Jeffress’or

Lindemann’smodel

The model will be operated only when e(t) > h(t).Otherwise, the model will provide the previous factor.

Envelope e(t) Inhibition h(t)

On-sete(t)>h(t) ?

Yes

Enhanced speech

111

Results on Source Segregationfor Reverberated Signal

• Target signal with reverberation without reverberation (ideal solution)

• Without processing• Scaling with the following factors

factor = 1 if the peak value of CC is in the neighborhood of ITD of the desired source.factor = 0 otherwise. Jeffress’ model Lindemann’s model

w/o on-set enh. res: 1/8000 sec res: 1/48000 sec

112

Dereverberation

• Early reflections are especially problematic. Affect on the same frame as the direct sound wave

• To remove early reflection components Dereverberate the linear prediction (LP) residual of

incoming speech Filter estimation for nearly exponentially-decaying

reverberation like a typical room impulse response Correspond to the inverse of the truncated auto-correlation

]))00 )()0( 00([DFT/.1(IDFT)( Rccnhderev

113

Dereverberation and Echo Suppression

Dereverberation

Dereverberation

Gammatonefilterbank

Gammatonefilterbank

Cross-correlation

Inhibition

Frameenergies

Frame energies with inhibition

Maskestimation

Featureestimation

Input 1

Input 2

114

Simulated Room Impulse Response

Virtual room to simulate impulse responses

2m1.5m

1.1m17cm

1m

30o

target

interferencemics

• Mixing environments (reverberation time: 0.5s)

115

Recognition Results

• Word error rates (WERs) (%) A : no processing B : seg. + inhib. C : derev. + seg.

+ inhib. D : ideal masks

0

50

100

A B C D

infinity, 0sec

10dB, 0.3sec

10dB, 0.5sec

0dB, 0.3sec

0dB, 0.5sec

Date post:	08-Jan-2016
Category:	Documents
Upload:	erma
View:	37 times
Download:	0 times

Audio Segregation

Documents