+ All Categories
Home > Documents > Audio Segregation

Audio Segregation

Date post: 08-Jan-2016
Category:
Upload: erma
View: 37 times
Download: 0 times
Share this document with a friend
Description:
Audio Segregation. 2010. 4. 26. Hyung-Min Park. Contents. Independent component analysis (ICA) Conventional methods for acoustic mixtures Filter bank approach to ICA Degenerate unmixing and estimation technique (DUET) Target speech enhancement Zero-crossing-based binaural processing - PowerPoint PPT Presentation
Popular Tags:
115
2010. 4. 26. Hyung-Min Park Audio Segregation
Transcript
Page 1: Audio Segregation

2010. 4. 26.

Hyung-Min Park

Audio Segregation

Page 2: Audio Segregation

2

Contents• Independent component analysis (ICA)

Conventional methods for acoustic mixtures Filter bank approach to ICA

• Degenerate unmixing and estimation technique (DUET) Target speech enhancement

• Zero-crossing-based binaural processing Inter-aural time difference (ITD)

Zero crossings vs. cross-correlation

Continuously-variable mask vs. binary mask

Page 3: Audio Segregation

3

Cocktail Party Problem

Page 4: Audio Segregation

4

Independent Component Analysis

Page 5: Audio Segregation

5

Blind Source Separation: A Demo

sources andthe mixing environment

Page 6: Audio Segregation

6

Independent Component Analysis

• Blind source separation Sensor signals

Recover the original source signals without knowing how they are mixed

• ICA Assume sources are independent Estimate the unmixing system W from

mixtures x(t)

s

u

x

A

W

Page 7: Audio Segregation

7

Acoustic Mixtures

• Instantaneous mixtures

• Acoustic mixing environments Time delay Reverberation Convolutive mixing

Wall

sensors

sources

x1 x2

s1 s2

Page 8: Audio Segregation

8

Time Domain Approach to ICA

• Feedback architecture

• Adaptation rules (Torkkola, 1996)

• Intensive computations and slow convergence

W11

W21

W12

W22

x1(n)

x2(n) u2(n)

u1(n)

Page 9: Audio Segregation

9

Frequency Domain Approach to ICA (1)

• In the frequency domain

• Complex score function

• Adaptation rule (Smaragdis, 1998)

x1(n)

W1

xN(n)

Short-Time

FourierTransform

InverseShort-Time

FourierTransform

u1(n)

uN(n)

W2

WK

Page 10: Audio Segregation

10

Frequency Domain Approach to ICA (2)

• Performance limitation Contradiction between long reverberation covering

and insufficient learning data Long reverberation long frame size Small number of frames insufficient input data

Mixtures combined from different time ranges of sources Delayed mixtures

kth block kth blocks1(n) s2(n-d1)x1(n)

x2(n)

d2 d1s1(n-d2) s2(n)

= +

+=

Page 11: Audio Segregation

11

Design of a Filter Bank

• Filter bank design Frequency response of analysis filters

Uniform sixteen-channel filter bank

Decimation factor: 10Filter length: 220 taps

Page 12: Audio Segregation

12

Filter Bank Approach to ICA (1)• 2x2 network for the filter bank approach to ICA

x1(n)

x2(n)

M

M

M

M

M

M

H1(z)

H2(z)

HK(z)

H1(z)

H2(z)

HK(z)

ICA networkW1(z)

ICA networkW2(z)

ICA networkWK(z)

F1(z)

F2(z)

FK(z)

F1(z)

F2(z)

FK(z)

M

M

M

M

M

M

u1(n)

u2(n)

Page 13: Audio Segregation

13

Filter Bank Approach to ICA (2)• Adaptation rules

• Total number of multiplications

Time domain approach Filter bank approach

The number of required filter coefficients Uniform K-channel oversampled filter bank

is the number of adaptive filter coefficients

Page 14: Audio Segregation

14

Experimental Setup (1)

• Measure for blind source separation SIR for a 2x2 mixing/unmixing system

• Sources Two streams of speech 5 second length 16 kHz sampling rate

Page 15: Audio Segregation

15

Experimental Setup (2)• Mixing system

Virtual room to simulate impulse responses

Page 16: Audio Segregation

16

Experimental Results

• Learning curves of the three different approaches

Page 17: Audio Segregation

17

Experiment on Real-Recorded Data (1)

• Mixing environment

• Filter bank approach Using the sixteen-channel filter bank Each adaptive filter: 103 taps

Speakers

Microphones

40cm

60cm

Page 18: Audio Segregation

18

Experiment on Real-Recorded Data (2)

• Blind source separation of real recorded mixtures

Mixture 1

Mixture 2

Result 1

Result 2 stop

Page 19: Audio Segregation

19

Motivation of a Nonuniform Filter Bank Approach

• Time-averaged magnitude responses of signals The energy exponentially decreases as the frequency

increases.Speech Car noise Music

Subband divisionResult of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length

Page 20: Audio Segregation

20

Relationship between Performances and Filter Length

• Convergence of gradient-based algorithms Controlled by condition number

• Bordering theorem

• Condition number

Monotonically nondecreasing function of filter length

• The longer filter length The slower convergence speed

)0(r

rRR

rH 11 aa LL 1and

1

1

1min

max

aa LL

Page 21: Audio Segregation

21

Bark-Scale Filter Banks• Subband division

Result of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length

• Bark frequency warping function

• Bark-scale filter banks Resemble that of the mammalian cochlea Somewhat narrow subbands in low frequency region Wide subbands in high frequency region

5.02

112001200

log6)(

Page 22: Audio Segregation

22

Nonuniform Filter Bank Approach to BSS

• 2x2 network for the nonuniform oversampled filter bank approach to BSS

x1(n)

x2(n)

M1

M2

MK

M1

M2

MK

H1(z)

H2(z)

HK(z)

H1(z)

H2(z)

HK(z)

ICA networkW1(z)

ICA networkW2(z)

ICA networkWK(z)

F1(z)

F2(z)

FK(z)

F1(z)

F2(z)

FK(z)

M1

M2

MK

M1

M2

MK

u1(n)

u2(n)

Page 23: Audio Segregation

23

Design of a Bark-Scale Filter Bank

• Filter design of a Bark-scale oversampled filter bank 16-channel, , OSR=167%220 ,3] 7 11 14 18 20 22 22[ qLMBark-scale filter bank Uniform filter bank

Page 24: Audio Segregation

24

Experimental Results

• Results on blind source separation in the oversampled filter bank

SIR PESQ score

Page 25: Audio Segregation

25

FPGA Implementation (1)

noise references

mic. signals

outputs

noises femalespeech

microphones

male speech

Page 26: Audio Segregation

26

FPGA Implementation (2)

• 4 adaptive noise canceling (4 music signals) + 2 blind source separation (2 speech signals)

OUT1

OUT2

MIC1

MIC2

stop

Page 27: Audio Segregation

27

Application to Hearing Aids

• BTE-type hearing aids

noise speech

1m1m

front mic.

rear mic.

front mic. SNR=3.20dB

rear mic. SNR=2.45dB

output SNR=21.38dB

stop

Page 28: Audio Segregation

28

Discussion on ICA

• Assume sources are independent• Time domain approach

Intensive computations and slow convergence

• Frequency domain approach Less computations but inferior performance

• Filter bank approach Moderate computations and good performance Suitable for parallel processing Bark-scale filter bank approach

Page 29: Audio Segregation

29

Degenerate Unmixing and Estimation Technique

Page 30: Audio Segregation

30

Introduction

• Independent component analysis for blind source separation Good performance In general, the number of microphones should not be

smaller than the number of sources. Too many parameters

Heavy computational load and slow convergence Problem with a source which is active in a short period

Page 31: Audio Segregation

31

Binaural Processing

• Auditory scene analysis (ASA)

Cues: harmonics, pitch, on-set, etc

• Spatial cues Inter-aural time difference (ITD) Inter-aural intensity difference (IID)

target noise

Page 32: Audio Segregation

32

DUET Algorithm (1)

• Mixing model

• In the time-frequency domain

N

jj tntstx

111 )()()(

N

jjjj tntsatx

122 )()()(

),(

),(11

),(

),( 1

12

1

1

N

iN

i

S

S

eaeaX

XN

Page 33: Audio Segregation

33

DUET Algorithm (2)

• W-disjoint orthogonality assumption

• Parameter estimation

,,,0),(),( jiSS ji

jSeaX

Xji

jj

somefor ),,(1

),(

),(

2

1

),(

),(1,

),(

),(),(ˆ,),(ˆ

1

2

1

2

X

X

X

Xa

Page 34: Audio Segregation

34

DUET Algorithm (3)

• 2D Histogram of amplitude-delay estimates from two mixtures of five sources

♦ Amplitude parameters

( .98, 1.02, .93, 1.06, .93)

♦ Delay parameters

( .3, -.2, .8, -.7, -.2)

Page 35: Audio Segregation

35

DUET Algorithm (4)

• If the j-th source is active,

• Cost function

• Parameter estimation• Stochastic gradient descent algorithm

2

21

ˆ

2 ),(),(ˆˆ1

1 ),,ˆ,ˆ(),( XXea

aa ji

j

j

jjj

0),(),( 21 XXea jij

)),(,),,(min( 1

N

Page 36: Audio Segregation

36

DUET Algorithm (5)

• Mask

• Demixing

otherwise,0

),,,ˆ,ˆ(),,ˆ,ˆ(,1),(

jmaa mmjjj

),(),()( 1 XS jj

1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1

s1

s2

Page 37: Audio Segregation

37

Target Speech Enhancement

• In many practical applications, Need to estimate a signal from a target source The target source

Frequently, we can expect its approximated direction. Strong utterance in a noisy environment

Page 38: Audio Segregation

38

Proposed Method (1)

• Continuously variable mask

• Real mask

),(

),(),(

1

1

X

X j

rj

),(),(ˆ

),(),(ˆ1),(

21

21

ˆ

XXa

XXea

j

ij

cj

j

Continuously variable mask

Real mask

Page 39: Audio Segregation

39

Proposed Method (2)

• Determine a threshold. Using a top ranking

• Binary mask using a threshold

),(%35)( pcjp ofTopTh

otherwise

ThTh pp

cj

ppbj

,0

)(),(,1))(,,(

Real mask

Binary mask

Page 40: Audio Segregation

40

• Overall procedure

• Overall procedure of the DUET algorithm

Attenuation-delay

histogram

Continuousmask

Initializingattenuation

-delayparameters

Learningattenuation

-delayparameters

ST

FT

ST

FT

Thresholding

Binarymask

IST

FT

)(1 tx

)(2 tx)(tsj

Initialcontinuous

maskThresholding

Attenuation-delay

histogram

Initializingattenuation

-delayparameters

Learningattenuation

-delayparameters

ST

FT

ST

FT

Comparinglikelihoods

Binarymask

IST

FT

)(1 tx

)(2 tx

)(tsj

Proposed Method (3)

Page 41: Audio Segregation

41

Number of sources : 2 (1 target source and 1 noise source) Input SIR : 5 dB Simulated mixing in an anechoic environment

80˚

50cm

50 ˚

20 ˚-10 ˚

-40 ˚

-70 ˚

-100 ˚

• Source signals10-second-long speech signals uttered by

4 males and 4 females in the TIMIT database

• Microphones Space : 2 cm

• Angle differences between two sources 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚

Mic1 Mic2

Experimental Setup (1)

Page 42: Audio Segregation

42

Experimental Results (1)

9.1710.9

9

13.61

13.77

13.79

13.82

96.26 95.38

97.24

98.35

98.34

98.26

96.26

20.63

21.09

22.03

21.96

22.03

22.11

89.51

89.53

89.55

89.55

89.87

89.79

Proposed methodDUET

Page 43: Audio Segregation

43

90˚

50cm60 ˚

30 ˚0 ˚

-30 ˚

-60 ˚

-90 ˚ Mic

1Mic2

Experimental Setup (2) Number of sources : 2 (1 target source and 1 noise source) Input SIR : 5 dB Real recorded mixtures in a normal office room

• Source signals10-second-long speech signals uttered by

3 male and 3 female speakers in the TIMIT database

• Microphones Space : 2 cm

• Angle differences between two sources 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚

Page 44: Audio Segregation

44

Experimental Results (2)

76.67

81.77

85.81

96.93

97.14

97.03

6.29

10.02

11.49

12.70

12.46

12.57

72.27

80.02

83.90

83.37

84.67

85.14

5.51

10.99

14.88

15.93

15.92

15.99

Proposed methodDUET

Page 45: Audio Segregation

45

Discussion on DUET

• DUET(Degenerate Unmixing and Estimation Technique) Simple We should know the number of sources in advance.

Estimate the attenuation and delay parameters for all sources.

• Described target speech enhancement technique Estimate the parameters for only one target source

Much faster convergence of all the required parameters

• Not robust to reverberation

Page 46: Audio Segregation

46

Zero-Crossing-Based Binaural Processing

Page 47: Audio Segregation

47

Binaural Processing

• Auditory scene analysis (ASA) Spatial cues: ITD, IID Others: harmonics, pitch, on-set, etc

• Conventional methods Inter-aural cross-correlation Binary mask (all-or-none)

• Developed method Inter-aural zero-crossing difference Continuously variable mask

target noise

Page 48: Audio Segregation

48

Jeffress’ Model

running interaural cross-correlation ),( nm

),( nml ),( nmrrightear

leftear

multiplication

runningintegration

Page 49: Audio Segregation

49

Source Localization Based on Cross-Correlation

• Signal model for the sensor outputs

• ITD estimation based on generalized cross-correlation

• Phase transform (PHAT)

)()()( tntshtx ij

jiji )()()( ij

jiji NSHX

deXXWR jkiik )()()()(

)(maxargˆ

ikD

ik Rik

1)()()( kiPHAT XXW

Page 50: Audio Segregation

50

Finding Zero-Crossingstwo microphones

ITD

Page 51: Audio Segregation

51

Noise Robustness of the Zero-Crossing-Based Method

Y.-I. Kim and R. M. Kil, “Estimation of Interaural Time Differences Based on Zero-Crossings in Noisy Multisource Environments,” IEEE Trans. ASLP, vol.15, no. 2, 2007.

5-dB SNR

otherwise.,0

,1)( if,)(

1log10

)(SNR22

2210 jSjSj ii

iii

Page 52: Audio Segregation

52

Application to Source Localization

• Four sources located at azimuth angles of -10o, 0o, 10o, and 40o

Page 53: Audio Segregation

53

Speech Segregation

i : band(frequency) indexj : frame(time) index : time lagM: frame lengthT: frame shift

L R

i

j

i

j

• Cross-correlation- based ITD estimation

Page 54: Audio Segregation

54

Overall Procedure

scaling factorestimation

Gammatonefilterbank

inputsignal1

inputsignal2

enhanced signal

time reverseamplitude scaling

subband signal

BPFN

BPF2

BPF1

scaling factorestimation

BPF2

BPFN

BPFN

BPF2

BPF1

BPF1

subband signal

ITD estimationusing ZCs

ITD estimationusing ZCs

subband signal

ITD estimationusing ZCs

scaling factorestimation

Page 55: Audio Segregation

55

Amplitude Scaling

actualscalefactor

0.9 0.7 0.2 0.8

s(ITD)

Page 56: Audio Segregation

56

Relationship between Zero-Crossing-Based ITDs and the SNRs

• Band-pass signals from two microphones

• The mean of the estimated ITDs, , can be approximated by

where

Page 57: Audio Segregation

57

Zero Crossing vs. Cross-Correlation (1)

• Relative strength

• Criterion Confidence of the conversion between the relative

strength and an ITD Measure the sample standard deviation of ITDs for

each relative strength

Page 58: Audio Segregation

58

Zero Crossing vs. Cross-Correlation (2)

• Estimate the sample standard deviation by simulation 100,000 randomly generated samples Parameters

Phase : uniform distribution on Frequency : uniform distribution

on

Page 59: Audio Segregation

59

• ITDs normalized byZC-based ITDs CC-based ITDs

the lowestfreq. band

middlefreq. band

the highestfreq. band

Zero Crossing vs. Cross-Correlation (3)

Page 60: Audio Segregation

60

Recognition Experimental Setup• Recognizer

The CMU SPHINX-III speech recognition system Fully-continuous hidden Markov models

• Database The DARPA Resource Management database

Training data: 2,880 sentences Test data: 600 sentences

– The target and interfering speech were combined with different delays from sensor to sensor. (SIR=0dB)

• Feature 13th-order mel-frequency cepstral coefficients

Page 61: Audio Segregation

61

Recognition Results (1)• Word error rates (WERs) (%)

added white Gaussian noise(SNR:20dB)

0102030405060708090

100

noproc.

CC-binary

ZC-binary

CC-cont.

ZC-cont.

targetalone

noidenticalindependent

stop

Page 62: Audio Segregation

62

Recognition Results (2)• Word error rates (WERs) (%)

Page 63: Audio Segregation

63

What if There is Reverberation?

target noise

Room

Direct path

Echoic path

Page 64: Audio Segregation

64

Jeffress’ Model

running interaural cross-correlation ),( nm

),( nml ),( nmrrightear

leftear

multiplication

runningintegration

Page 65: Audio Segregation

65

Lindemann’s Model (1)

inhibited interaural cross-correlation ),( nm

),( nml ),( nmrrightear

leftear

),( nmk

),( nmil),( nmir

Page 66: Audio Segregation

66

Lindemann’s Model (2)

),( nml

),( nmr

),( nmil

),( nmir

l1

r1

1

),( nmk

),( nm

),(),(),( nmrnmlnmk

)}),({1))(,(1)(,(

)1,1(

nmkcnmlcnmr

nmr

ds

)}),({1))(,(1)(,(

)1,1(

nmkcnmrcnml

nml

ds

Inhibition stationaryinhibition

dynamicinhibition

)]1(1[)1()1(

)(/

nxennx

ninhd TT

)()](1)[,(),(

)()](1)[,(),(

mmnmrnmr

mmnmlnml

ll

rr

Monaural sensitivities

fMMmfrl emmm /)()()()(

Page 67: Audio Segregation

67

Simulation (1)

• Input signal Gammatone filterbank impulse response with 1kHz

center freq. Half-wave rectified Low-pass filtered (cf = 1.6kHz)

it dt

left

right

directsound

reflectedsound

sec6.0 mtd

Page 68: Audio Segregation

68

• Input signal

sec10mti

Simulation (2)

Jeffress’ model

Lindemann’s model

Page 69: Audio Segregation

69

Onset Detection

• Onset intervals Dominantly contain direct-path components

Room Impulse Response

Direct path

Late Reflection

Early Reflection

target noise

Page 70: Audio Segregation

70

Palomäki’s Model of the Precedence Effect

target noise

i-th BPF i-th BPF

Cross-corr.

ITD output

Envelope e(t)

Inhibition h(t)

Envelope e(t)

Inhibition h(t)

Inhibitedenvelope

Inhibitedenvelope

Page 71: Audio Segregation

71

Filter to Generate the Inhibition from an Envelope

• Low pass filter

n

Annhlp exp)(

is chosen to give a unity gain at DC. is a time constant.

A

msHzFs 15*)16000(

Page 72: Audio Segregation

72

Envelope and Inhibition

• Envelope: blue line, Inhibition: red line (cf = 1,037Hz)

Page 73: Audio Segregation

73

Source Localizationin Reverberant Environments (1)

• Energy-based onset detection Simple (small computation) Not robust to parameters and environments

Smoothed envelope

Onset detection

Page 74: Audio Segregation

74

Source Localizationin Reverberant Environments (2)

• Echo-free onset detection

Detection by comparing the sound to echo ratio with threshold

More robust to parameters

Possible echo at time n caused by the preceding sound at time np

Maximum possible echo

time

time

The total estimated echoesam

plitu

de le

vel

Observed sounds

efoTh

Page 75: Audio Segregation

75

Source Localizationin Reverberant Environments (3)

ITD estimationbased on zero crossing

SNR estimation

ReliableITD sampleselection

angleconversion

BPF1

BPF2

BPFN

BPF1

BPF2

BPFN

weightedhistogram

ITD estimationbased on zero crossing

SNR estimation

ReliableITD sampleselection

angleconversion

ITD estimationbased on zero crossing

SNR estimation

ReliableITD sampleselection

angleconversion

Sourcelocalization

envelope

envelope

envelope

waveform

waveform

waveform

waveform

waveform

waveform

Gammatonefilterbank

inputsignal1

inputsignal2

Onset detection

Onset detection

Onset detection

Page 76: Audio Segregation

76

Experimental Setup (1)

• Recording rooms

Moderately reverberant room(a normal office room)

Higher reverberant room(a bathroom)

• Height of both rooms : 3 m• Height of speakers and mics : 1.5 m

Page 77: Audio Segregation

77

Experimental Results (1)

Mo

de

rate

ly reve

rbe

ran

t roo

mH

igh

er re

verb

era

nt ro

om

Described method Energy-based onset detection Echo-free onset detection

Page 78: Audio Segregation

78

Experimental Setup (2)

• Simulated mixing environment

• Height of both rooms : 3 m• Height of speakers and mics : 1.1 m• 30-dB SNR observations by adding white Gaussian noise• 320 utterances by 16 speakers from the TI-DIGIT database

43mm4.0m

5.0m

2.0m

1.0m

3.0m

30°mic.1

mic.2

speakers

60°

Page 79: Audio Segregation

79

Experimental Results (2)

• Rates of localizations where errors of estimated angles were less than 3o

Page 80: Audio Segregation

80

Discussion on Binaural Processing• Describe a method that enhances speech by

estimating continuously variable masking weights • Estimation of ITDs from zero crossings

More reliable than that from cross-correlation

• Continuously variable mask Estimate relative target intensity in the t-f domain Better accuracy than binary mask

• Reverberation Precedence effect Onset detection and SNR estimation

Page 81: Audio Segregation

81

Thank you very much.

Page 82: Audio Segregation

82

Multi-rate Systems

• Decimation and expansion

)()( LjD

j eYeV

1

0

/)2( )(1

)(M

m

MmjjD eX

MeY

Page 83: Audio Segregation

83

Filter Banks (1)

• Multirate System

Page 84: Audio Segregation

84

Filter Banks (2)

• In the z-domain, ( )

• Perfect reconstruction system

MjeW 2

1

0

/1/11

0

/1 )()(1

)(1

)(M

m

mMmMk

M

m

mMkk WzXWzH

MWzX

MzV

1

0

)()(1

)()(M

m

mmk

Mkk zWXzWH

MzVzU

1

0

1

0

1

0

)()()(1

)()()(ˆK

kk

mk

M

m

mK

kkk zFzWHzWX

MzUzFzX

0),()(ˆ 0 czXczzX n

Page 85: Audio Segregation

85

Polyphase Representation (1)

• Analysis filter (Type 1 polyphase)

• Using matrix notations,

1

0

212

)(

)12()2()()(

M

m

Mkm

m

n

nk

n

nk

n

nkk

zEz

znhzznhznhzH

n

nkkm zmMnhzE )()(where

)()()(

)()()(

)()()(

)(

1,11,10,1

1,11110

1,00100

zEzEzE

zEzEzE

zEzEzE

z

MKKK

M

M

E

)(

)(

)(

)(

1

1

0

zH

zH

zH

z

K

h

TMM zzzz )1(11)()( Eh

Page 86: Audio Segregation

86

Polyphase Representation (2)

• Synthesis filter (Type 2 polyphase)

• Using matrix notations,

1

0

)1(

221

)(

)2()12()()(

M

m

Mmk

mM

n

nk

n

nk

n

nkk

zRz

znfznfzznfzF

n

nkmk zmMMnfzR )1()(where

)()()(

)()()(

)()()(

)(

1,11,10,1

1,11110

1,00100

zRzRzR

zRzRzR

zRzRzR

z

KMMM

K

K

R

)(

)(

)(

)(

1

1

0

zF

zF

zF

z

K

f

)(1)( )2()1( MMMT zzzz Rf

Page 87: Audio Segregation

87

Polyphase Representation (3)

• Polyphase representation

• Rearrangement using noble identities

Page 88: Audio Segregation

88

Paraunitary Propertyfor Perfect Reconstruction

• Paraunitary property

Transposed with its entries complex-conjugated and time-

reversed

• Perfect reconstruction condition

0,)()(~ ddzz IEE

)(~

zE

0),(~

)( czczz l ER10),()( * KknLchnf kk

10),(~

)( KkzHczzF kL

k

1 MMlLwhere

Page 89: Audio Segregation

89

Critically Sampled Filter Banks (1)

• Overall system

• Critically sampled filter banks MK

Page 90: Audio Segregation

90

Critically Sampled Filter Banks (2)

• NotationsTMzWXzWXzXz )]()()([)( 1 x

T

MK

MM

K

K

zWHzWHzWH

zWHzWHzWH

zHzHzH

z

)()()(

)()()(

)()()(

)(

11

11

10

110

110

H

)]}()()({[diag)( 1 MzWSzWSzSz S

)()()(1

)]()()([)( /1/1/1110

MMMTM zzz

MzYzYzYz xSHy

)()()(1

)](ˆ)(ˆ)(ˆ[)(ˆ /1/1110

MMTM zzz

MzYzYzYz xHWy

Page 91: Audio Segregation

91

Critically Sampled Filter Banks (3)

• The subband error signals are zero if

• Two subbands scheme (Gilloire and Vetterli, ’92) Assume the classical QMF filters

Therefore,

)()()()( zzzzM SHHW

)()(

)()(

)(det

1)(1

zHzH

zHzH

zz

HH

)()(

)()(

1

0

zHzH

zHzH

)()(

)()()(

zHzH

zHzHzH

Page 92: Audio Segregation

92

Critically Sampled Filter Banks (4)

• Adaptive filters

• is diagonal only if or

)()()()()]()()[()(

)]()()[()()()()()(

)(det

1

)()()()(

22

22

12

zSzHzSzHzSzSzHzH

zSzSzHzHzSzHzSzH

z

zzzz

H

HSHW

)( 2zW

0)()( zHzH 0)()( zSzS

A general physical system (X)

0)()( zHzHPR (X)

Page 93: Audio Segregation

93

Critically Sampled Filter Banks (5)

• Multiband scheme Assumptions

Adaptive filters (require the use of cross filters)

Slow convergence and performance degradation

1||,0)()( jizHzH ji )()( ii zWHzH

)(00)(

0)()(0

0)()()(

)(0)()(

)(

1,10,1

2221

121110

1,00100

zWzW

zWzW

zWzWzW

zWzWzW

z

MMM

M

M

W

where

Page 94: Audio Segregation

94

Oversampled Filter Banks (1)

• such that

• Non-critical decimation can avoid the aliasing problem. The redundancy

Provide enough information for successful adaptation in every bands.

Diagonal adaptive filter matrix

MK jizHzH ji ,0)()( )()( i

i zWHzH where

Page 95: Audio Segregation

95

Oversampled Filter Banks (2)

• Recall

• For perfect reconstruction,

• Remove the aliasing terms

10),(~

)( KkzHczzF kL

k

1

0

1

0

)()()(1

)(ˆK

kk

mk

M

m

m zFzWHzWXM

zX

1

0

)()()(1

)(ˆK

kkk zFzHzX

MzX

Page 96: Audio Segregation

96

Oversampled Filter Banks (3)

• Analysis filters from a real-valued linear phase prototype filter by generalized DFT

To cover the frequency range by exactly subbands

For the linear phase property • Synthesis filters

• All filters can be derived from one prototype filter.

1 , ,1 ,0 ,1 , ,1 ,0 ),()())((

200

q

nnkkK

j

k LnKknqenh

];0[ 2/K

2

10 k

2)1(0 qLn

)()1()(~

)( nhnLhnhnf kqkkk

Page 97: Audio Segregation

97

Design of Oversampled Filter Banks (1)

• Cost function Combination of filter bank reconstruction error and

stopband energy of the analysis filters

• Impulse response of the overall filter bank system

• Using matrix notation,

1

0

)()(1

)(K

kkk nfnh

Mnt

kk

qk

k

k

qk

kk

k

qk

k

k

k

Lf

f

f

Lh

hh

h

Lt

t

t

fHt

)1(

)1(

)0(

)1(00

0)0()1(

00)0(

)22(

)1(

)0(

Page 98: Audio Segregation

98

Design of Oversampled Filter Banks (2)

• Impulse response of the overall filter bank system

• Measure of the reconstruction error

• Measure for the energy contained in the stopband

fHfffHHHt TT

KTT

KM 110110

1

2

1 ))1(( qLnδt

2

2

11

00

2

)1(

)1(

)0(

))1(cos()1cos(1

))1(cos()1cos(1

))1(cos()1cos(1

kk

qk

k

k

qNN

q

q

k

Lh

h

h

L

L

L

hP

Page 99: Audio Segregation

99

Design of Oversampled Filter Banks (3)

• To enforce linear phase filters,

Therefore, , where .

• Cost function

where

qLJI

q

T

qT

LL

Tq

Lqqq

Lqqq

qq)12/()1()0(

)1()1()0(

2/2/

TTK

TT110 hhhhf

Tqkkkk Lhhh )1()1()0( h

2

1

02

21

))1((1

0

δf

P

H qK

kk

LnM

},,,{diag 110 KPPPP

Page 100: Audio Segregation

100

Design of Oversampled Filter Banks (4)

• GDFT

• Cost function

• Iterative least-squares design algorithm• Initialize• Minimize with respect to• Apply relaxation

,qMh kk

21

0

))1((1

0

δq

LP

LMH

q

q

K

kkk

LnM

where are diagonal matrices with transform factorskM

)(iq

)(iq

)1()1()()( iii qqq

Page 101: Audio Segregation

101

ITD Using Zero Crossings (1)• Two microphone signals

Ignore attenuation between microphones because of closeness

• Assume and

))(cos())(cos()(

)cos()cos()(

22112

211

dtwadtwAtx

twatwAtx

0)( 11 tx

))(sin()sin(

))(cos()cos(

))(sin()sin(

))(cos()cos()(

2212

2212

1111

111112

dwtwa

dwtwa

dwtwA

dwtwAtx

0)( 12 tx

Page 102: Audio Segregation

102

ITD Using Zero Crossings (2)

• Since

• Therefore,

• Since

)()sin(

)()sin()(

2212

111112

dwtwa

dwtwAtx

}2,1,{ ,)( jidw ji

1))(cos( ji dw )())(sin( jiji dwdw

0)( 12 tx

21221111

122111

)sin()sin(

))sin()sin((

dtwawdtwAw

twawtwAw

Page 103: Audio Segregation

103

ITD Using Zero Crossings (3)

• Recall

)sin()(cos

)sin()(cos

)sin()))cos((sin(cos

)sin()))cos((sin(cos

12212222

1

2122112222

1

122121

1

21221121

1

twawtwaAw

dtwawdtwaAw

twawtwAa

Aw

dtwawdtwAa

Aw

)(cos)sin(

)(cos)sin(

11222

2111

211222

21111

twAawtwAw

dtwAawdtwAw

for aA

for aA

0)cos()cos()( 121111 twatwAtx

Page 104: Audio Segregation

104

ITD Using Zero Crossings (4)

• Assume uniformly-distributed frequencies over a narrow band and phases over the interval ),(

21 )),(1(),( daAgdaAg

otherwise )(

)(tan2

, if 2

1

, if )(

)(tan)2

(

),(

22

22

22

122

22

22

22

122

2

Aa

AaAAa

AaA

aA

aAaA

aAaaA

aA

aA

aAg

Page 105: Audio Segregation

105

ITD Using Cross-Correlation (1)• Two microphone signals

• Cross-correlation))(cos())(cos()(

)cos()cos()(

22112

211

dtwadtwAtx

twatwAtx

T

Tdttxtx

Tc )()(

2

1)( 21

)))](cos()2)(2(cos(

)))(cos())(cos(

))(cos())((cos(

)))(cos()2)(2(cos([2

1

)()(

222222

211211

221221

111112

21

dwdwtwa

twdtwtwdtw

dtwtwdtwtwAa

dwdwtwA

txtx

Page 106: Audio Segregation

106

ITD Using Cross-Correlation (2)• ITD at the maximum of cross-correlation,

)]2sin()2sin()sin())sin((2

)sin())sin((2)2sin()2sin([

)]2)2cos()2(sin()cos())sin((2

)cos())sin((2

[

)]cos())sin((2

)cos())sin((2

)2)2cos()2(sin([

)]2)2cos()2(sin()cos())()sin((2

)cos())()sin((2

)2)2cos()2(sin([

22

21

2112

2222

22221

21

2221

212

2121

21

2121

21

2111

21

2222

222

2121

21

22

2121

21

2111

2

TwaTwwAa

TwwAaTwA

TwwTwawTwwww

Aa

wTwwww

Aad

wTwwww

Aa

wTwwww

AaTwwTwAd

TwwTwawwTwwww

Aa

wwTwwww

AaTwwTwA

0)(

d

dc

Page 107: Audio Segregation

107

Spatial Aliasing

• To avoid the spatial aliasing The delay between sensors should be smaller than a half period

of the signal. If we have t sample delay for noisy signal,

Alias-free condition is

1 sample delay at 16kHz sampling rate

Center frequencies of all Gammatone filters are lower than 8kHz.

FF

t

s 2

1

kHzF

F s 82

Page 108: Audio Segregation

108

Close Microphones

• ITD estimation without phase ambiguity The closest zero crossings provide the desired ITD

value. Do not need to estimate IIDs

• Easy to derive a relationship between the ITDs and the scaling factors

• Reduce search region of p2

• Reduce signal distortion• Compact implementation

Page 109: Audio Segregation

109

Results on Source Segregation

• Target signal• Without processing• Scaling with the following factors

factor = [summation of CC values in the neighborhood of ITD of the desired source (male speaker)] / [summation of CC values in the whole range]

factor = 1 if the peak value of CC is in the neighborhood of ITD of the desired source.factor = 0 otherwise.

Jeffress’ model Lindemann’s model

Page 110: Audio Segregation

110

Combining the Lindemann’s Model and the Precedence Effect

target noise

i-th BPF i-th BPF

Jeffress’or

Lindemann’smodel

The model will be operated only when e(t) > h(t).Otherwise, the model will provide the previous factor.

Envelope e(t) Inhibition h(t)

On-sete(t)>h(t) ?

Yes

Enhanced speech

Page 111: Audio Segregation

111

Results on Source Segregationfor Reverberated Signal

• Target signal with reverberation without reverberation (ideal solution)

• Without processing• Scaling with the following factors

factor = 1 if the peak value of CC is in the neighborhood of ITD of the desired source.factor = 0 otherwise. Jeffress’ model Lindemann’s model

w/o on-set enh. res: 1/8000 sec res: 1/48000 sec

Page 112: Audio Segregation

112

Dereverberation

• Early reflections are especially problematic. Affect on the same frame as the direct sound wave

• To remove early reflection components Dereverberate the linear prediction (LP) residual of

incoming speech Filter estimation for nearly exponentially-decaying

reverberation like a typical room impulse response Correspond to the inverse of the truncated auto-correlation

]))00 )()0( 00([DFT/.1(IDFT)( Rccnhderev

Page 113: Audio Segregation

113

Dereverberation and Echo Suppression

Dereverberation

Dereverberation

Gammatonefilterbank

Gammatonefilterbank

Cross-correlation

Inhibition

Frameenergies

Frame energies with inhibition

Maskestimation

Featureestimation

Input 1

Input 2

Page 114: Audio Segregation

114

Simulated Room Impulse Response

Virtual room to simulate impulse responses

2m1.5m

1.1m17cm

1m

30o

target

interferencemics

• Mixing environments (reverberation time: 0.5s)

Page 115: Audio Segregation

115

Recognition Results

• Word error rates (WERs) (%) A : no processing B : seg. + inhib. C : derev. + seg.

+ inhib. D : ideal masks

0

50

100

A B C D

infinity, 0sec

10dB, 0.3sec

10dB, 0.5sec

0dB, 0.3sec

0dB, 0.5sec


Recommended