Audio Segregation

transcript

2010. 4. 26.

Hyung-Min Park

Audio Segregation

Contents• Independent component analysis (ICA)

Conventional methods for acoustic mixtures Filter bank approach to ICA

• Degenerate unmixing and estimation technique (DUET) Target speech enhancement

• Zero-crossing-based binaural processing Inter-aural time difference (ITD)

Zero crossings vs. cross-correlation

Continuously-variable mask vs. binary mask

Cocktail Party Problem

Independent Component Analysis

Blind Source Separation: A Demo

sources andthe mixing environment

Independent Component Analysis

• Blind source separation Sensor signals

Recover the original source signals without knowing how they are mixed

• ICA Assume sources are independent Estimate the unmixing system W from

mixtures x(t)

Acoustic Mixtures

• Instantaneous mixtures

• Acoustic mixing environments Time delay Reverberation Convolutive mixing

sensors

sources

Time Domain Approach to ICA

• Feedback architecture

• Adaptation rules (Torkkola, 1996)

• Intensive computations and slow convergence

x2(n) u2(n)

Frequency Domain Approach to ICA (1)

• In the frequency domain

• Complex score function

• Adaptation rule (Smaragdis, 1998)

Short-Time

FourierTransform

InverseShort-Time

FourierTransform

Frequency Domain Approach to ICA (2)

• Performance limitation Contradiction between long reverberation covering

and insufficient learning data Long reverberation long frame size Small number of frames insufficient input data

Mixtures combined from different time ranges of sources Delayed mixtures

kth block kth blocks1(n) s2(n-d1)x1(n)

d2 d1s1(n-d2) s2(n)

Design of a Filter Bank

• Filter bank design Frequency response of analysis filters

Uniform sixteen-channel filter bank

Decimation factor: 10Filter length: 220 taps

Filter Bank Approach to ICA (1)• 2x2 network for the filter bank approach to ICA

ICA networkW1(z)

ICA networkW2(z)

ICA networkWK(z)

Filter Bank Approach to ICA (2)• Adaptation rules

• Total number of multiplications

Time domain approach Filter bank approach

The number of required filter coefficients Uniform K-channel oversampled filter bank

is the number of adaptive filter coefficients

Experimental Setup (1)

• Measure for blind source separation SIR for a 2x2 mixing/unmixing system

• Sources Two streams of speech 5 second length 16 kHz sampling rate

Experimental Setup (2)• Mixing system

Virtual room to simulate impulse responses

Experimental Results

• Learning curves of the three different approaches

Experiment on Real-Recorded Data (1)

• Mixing environment

• Filter bank approach Using the sixteen-channel filter bank Each adaptive filter: 103 taps

Speakers

Microphones

Experiment on Real-Recorded Data (2)

• Blind source separation of real recorded mixtures

Mixture 1

Mixture 2

Result 1

Result 2 stop

Motivation of a Nonuniform Filter Bank Approach

• Time-averaged magnitude responses of signals The energy exponentially decreases as the frequency

increases.Speech Car noise Music

Subband divisionResult of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length

Relationship between Performances and Filter Length

• Convergence of gradient-based algorithms Controlled by condition number

• Bordering theorem

• Condition number

Monotonically nondecreasing function of filter length

• The longer filter length The slower convergence speed

rH 11 aa LL 1and

Bark-Scale Filter Banks• Subband division

Result of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length

• Bark frequency warping function

• Bark-scale filter banks Resemble that of the mammalian cochlea Somewhat narrow subbands in low frequency region Wide subbands in high frequency region

112001200

log6)(

Nonuniform Filter Bank Approach to BSS

• 2x2 network for the nonuniform oversampled filter bank approach to BSS

ICA networkW1(z)

ICA networkW2(z)

ICA networkWK(z)

Design of a Bark-Scale Filter Bank

• Filter design of a Bark-scale oversampled filter bank 16-channel, , OSR=167%220 ,3] 7 11 14 18 20 22 22[ qLMBark-scale filter bank Uniform filter bank

Experimental Results

• Results on blind source separation in the oversampled filter bank

SIR PESQ score

FPGA Implementation (1)

noise references

mic. signals

outputs

noises femalespeech

microphones

male speech

FPGA Implementation (2)

• 4 adaptive noise canceling (4 music signals) + 2 blind source separation (2 speech signals)

Application to Hearing Aids

• BTE-type hearing aids

noise speech

front mic.

rear mic.

front mic. SNR=3.20dB

rear mic. SNR=2.45dB

output SNR=21.38dB

Discussion on ICA

• Assume sources are independent• Time domain approach

Intensive computations and slow convergence

• Frequency domain approach Less computations but inferior performance

• Filter bank approach Moderate computations and good performance Suitable for parallel processing Bark-scale filter bank approach

Degenerate Unmixing and Estimation Technique

Introduction

• Independent component analysis for blind source separation Good performance In general, the number of microphones should not be

smaller than the number of sources. Too many parameters

Heavy computational load and slow convergence Problem with a source which is active in a short period

Binaural Processing

• Auditory scene analysis (ASA)

Cues: harmonics, pitch, on-set, etc

• Spatial cues Inter-aural time difference (ITD) Inter-aural intensity difference (IID)

target noise

DUET Algorithm (1)

• Mixing model

• In the time-frequency domain

jj tntstx

111 )()()(

jjjj tntsatx

122 )()()(

DUET Algorithm (2)

• W-disjoint orthogonality assumption

• Parameter estimation

,,,0),(),( jiSS ji

somefor ),,(1

),(),(ˆ,),(ˆ

DUET Algorithm (3)

• 2D Histogram of amplitude-delay estimates from two mixtures of five sources

♦ Amplitude parameters

( .98, 1.02, .93, 1.06, .93)

♦ Delay parameters

( .3, -.2, .8, -.7, -.2)

DUET Algorithm (4)

• If the j-th source is active,

• Cost function

• Parameter estimation• Stochastic gradient descent algorithm

2 ),(),(ˆˆ1

1 ),,ˆ,ˆ(),( XXea

0),(),( 21 XXea jij

)),(,),,(min( 1

DUET Algorithm (5)

• Mask

• Demixing

otherwise,0

),,,ˆ,ˆ(),,ˆ,ˆ(,1),(

jmaa mmjjj

),(),()( 1 XS jj

1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1

Target Speech Enhancement

• In many practical applications, Need to estimate a signal from a target source The target source

Frequently, we can expect its approximated direction. Strong utterance in a noisy environment

Proposed Method (1)

• Continuously variable mask

• Real mask

),(),(

),(),(ˆ

),(),(ˆ1),(

Continuously variable mask

Real mask

Proposed Method (2)

• Determine a threshold. Using a top ranking

• Binary mask using a threshold

),(%35)( pcjp ofTopTh

otherwise

ThTh pp

)(),(,1))(,,(

Real mask

Binary mask

• Overall procedure

• Overall procedure of the DUET algorithm

Attenuation-delay

histogram

Continuousmask

Initializingattenuation

-delayparameters

Learningattenuation

-delayparameters

Thresholding

Binarymask

)(1 tx

)(2 tx)(tsj

Initialcontinuous

maskThresholding

Attenuation-delay

histogram

Initializingattenuation

-delayparameters

Learningattenuation

-delayparameters

Comparinglikelihoods

Binarymask

)(1 tx

)(2 tx

Proposed Method (3)

Number of sources : 2 (1 target source and 1 noise source) Input SIR : 5 dB Simulated mixing in an anechoic environment

20 ˚-10 ˚

-40 ˚

-70 ˚

-100 ˚

• Source signals10-second-long speech signals uttered by

4 males and 4 females in the TIMIT database

• Microphones Space : 2 cm

• Angle differences between two sources 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚

Mic1 Mic2

Experimental Results (1)

9.1710.9

96.26 95.38

Proposed methodDUET

50cm60 ˚

30 ˚0 ˚

-30 ˚

-60 ˚

-90 ˚ Mic

Experimental Setup (2) Number of sources : 2 (1 target source and 1 noise source) Input SIR : 5 dB Real recorded mixtures in a normal office room

• Source signals10-second-long speech signals uttered by

3 male and 3 female speakers in the TIMIT database

• Microphones Space : 2 cm

• Angle differences between two sources 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚

Proposed methodDUET

Discussion on DUET

• DUET(Degenerate Unmixing and Estimation Technique) Simple We should know the number of sources in advance.

Estimate the attenuation and delay parameters for all sources.

• Described target speech enhancement technique Estimate the parameters for only one target source

Much faster convergence of all the required parameters

• Not robust to reverberation

Zero-Crossing-Based Binaural Processing

Binaural Processing

• Auditory scene analysis (ASA) Spatial cues: ITD, IID Others: harmonics, pitch, on-set, etc

• Conventional methods Inter-aural cross-correlation Binary mask (all-or-none)

• Developed method Inter-aural zero-crossing difference Continuously variable mask

target noise

Jeffress’ Model

running interaural cross-correlation ),( nm

),( nml ),( nmrrightear

leftear

multiplication

runningintegration

Source Localization Based on Cross-Correlation

• Signal model for the sensor outputs

• ITD estimation based on generalized cross-correlation

• Phase transform (PHAT)

)()()( tntshtx ij

jiji )()()( ij

jiji NSHX

deXXWR jkiik )()()()(

)(maxargˆ

ik Rik

1)()()( kiPHAT XXW

Finding Zero-Crossingstwo microphones

Noise Robustness of the Zero-Crossing-Based Method

Y.-I. Kim and R. M. Kil, “Estimation of Interaural Time Differences Based on Zero-Crossings in Noisy Multisource Environments,” IEEE Trans. ASLP, vol.15, no. 2, 2007.

5-dB SNR

otherwise.,0

,1)( if,)(

1log10

)(SNR22

2210 jSjSj ii

Application to Source Localization

• Four sources located at azimuth angles of -10o, 0o, 10o, and 40o

Speech Segregation

i : band(frequency) indexj : frame(time) index : time lagM: frame lengthT: frame shift

• Cross-correlation- based ITD estimation

Overall Procedure

scaling factorestimation

Gammatonefilterbank

inputsignal1

inputsignal2

enhanced signal

time reverseamplitude scaling

subband signal

ITD estimationusing ZCs

subband signal

ITD estimationusing ZCs

Amplitude Scaling

actualscalefactor

0.9 0.7 0.2 0.8

s(ITD)

Relationship between Zero-Crossing-Based ITDs and the SNRs

• Band-pass signals from two microphones

• The mean of the estimated ITDs, , can be approximated by

Zero Crossing vs. Cross-Correlation (1)

• Relative strength

• Criterion Confidence of the conversion between the relative

strength and an ITD Measure the sample standard deviation of ITDs for

each relative strength

• Estimate the sample standard deviation by simulation 100,000 randomly generated samples Parameters

Phase : uniform distribution on Frequency : uniform distribution

• ITDs normalized byZC-based ITDs CC-based ITDs

the lowestfreq. band

middlefreq. band

the highestfreq. band

Recognition Experimental Setup• Recognizer

The CMU SPHINX-III speech recognition system Fully-continuous hidden Markov models

• Database The DARPA Resource Management database

Training data: 2,880 sentences Test data: 600 sentences

– The target and interfering speech were combined with different delays from sensor to sensor. (SIR=0dB)

• Feature 13th-order mel-frequency cepstral coefficients

Recognition Results (1)• Word error rates (WERs) (%)

added white Gaussian noise(SNR:20dB)

0102030405060708090

noproc.

CC-binary

ZC-binary

CC-cont.

ZC-cont.

targetalone

noidenticalindependent

Recognition Results (2)• Word error rates (WERs) (%)

What if There is Reverberation?

target noise

Direct path

Echoic path

Jeffress’ Model

running interaural cross-correlation ),( nm

leftear

multiplication

runningintegration

Lindemann’s Model (1)

inhibited interaural cross-correlation ),( nm

leftear

),( nmk

),( nmil),( nmir

Lindemann’s Model (2)

),( nml

),( nmr

),( nmil

),( nmir

),( nmk

),( nm

),(),(),( nmrnmlnmk

)}),({1))(,(1)(,(

nmkcnmlcnmr

)}),({1))(,(1)(,(

nmkcnmrcnml

Inhibition stationaryinhibition

dynamicinhibition

)]1(1[)1()1(

nxennx

ninhd TT

)()](1)[,(),(

mmnmrnmr

mmnmlnml

Monaural sensitivities

fMMmfrl emmm /)()()()(

Simulation (1)

• Input signal Gammatone filterbank impulse response with 1kHz

center freq. Half-wave rectified Low-pass filtered (cf = 1.6kHz)

directsound

reflectedsound

sec6.0 mtd

• Input signal

sec10mti

Simulation (2)

Jeffress’ model

Lindemann’s model

Onset Detection

• Onset intervals Dominantly contain direct-path components

Room Impulse Response

Direct path

Late Reflection

Early Reflection

target noise

Palomäki’s Model of the Precedence Effect

target noise

i-th BPF i-th BPF

Cross-corr.

ITD output

Envelope e(t)

Inhibition h(t)

Envelope e(t)

Inhibition h(t)

Inhibitedenvelope

Filter to Generate the Inhibition from an Envelope

• Low pass filter

Annhlp exp)(

is chosen to give a unity gain at DC. is a time constant.

msHzFs 15*)16000(

Envelope and Inhibition

• Envelope: blue line, Inhibition: red line (cf = 1,037Hz)

Source Localizationin Reverberant Environments (1)

• Energy-based onset detection Simple (small computation) Not robust to parameters and environments

Smoothed envelope

Onset detection

• Echo-free onset detection

Detection by comparing the sound to echo ratio with threshold

More robust to parameters

Possible echo at time n caused by the preceding sound at time np

Maximum possible echo

The total estimated echoesam

Observed sounds

ITD estimationbased on zero crossing

SNR estimation

ReliableITD sampleselection

angleconversion

weightedhistogram

SNR estimation

angleconversion

SNR estimation

angleconversion

Sourcelocalization

envelope

waveform

Gammatonefilterbank

inputsignal1

inputsignal2

Onset detection

• Recording rooms

Moderately reverberant room(a normal office room)

Higher reverberant room(a bathroom)

• Height of both rooms : 3 m• Height of speakers and mics : 1.5 m

ly reve

Described method Energy-based onset detection Echo-free onset detection

• Simulated mixing environment

• Height of both rooms : 3 m• Height of speakers and mics : 1.1 m• 30-dB SNR observations by adding white Gaussian noise• 320 utterances by 16 speakers from the TI-DIGIT database

43mm4.0m

30°mic.1

speakers

• Rates of localizations where errors of estimated angles were less than 3o

Discussion on Binaural Processing• Describe a method that enhances speech by

estimating continuously variable masking weights • Estimation of ITDs from zero crossings

More reliable than that from cross-correlation

• Continuously variable mask Estimate relative target intensity in the t-f domain Better accuracy than binary mask

• Reverberation Precedence effect Onset detection and SNR estimation

Thank you very much.

Multi-rate Systems

• Decimation and expansion

)()( LjD

j eYeV

/)2( )(1

MmjjD eX

Filter Banks (1)

• Multirate System

Filter Banks (2)

• In the z-domain, ( )

• Perfect reconstruction system

MjeW 2

/1 )()(1

mMkk WzXWzH

Mkk zWXzWH

)()()(1

)()()(ˆK

kkk zFzWHzWX

MzUzFzX

0),()(ˆ 0 czXczzX n

Polyphase Representation (1)

• Analysis filter (Type 1 polyphase)

• Using matrix notations,

)12()2()()(

znhzznhznhzH

nkkm zmMnhzE )()(where

)()()(

1,11,10,1

1,11110

1,00100

zEzEzE

TMM zzzz )1(11)()( Eh

• Synthesis filter (Type 2 polyphase)

• Using matrix notations,

)2()12()()(

znfznfzznfzF

nkmk zmMMnfzR )1()(where

)()()(

1,11,10,1

1,11110

1,00100

zRzRzR

)(1)( )2()1( MMMT zzzz Rf

• Polyphase representation

• Rearrangement using noble identities

Paraunitary Propertyfor Perfect Reconstruction

• Paraunitary property

Transposed with its entries complex-conjugated and time-

reversed

• Perfect reconstruction condition

0,)()(~ ddzz IEE

)( czczz l ER10),()( * KknLchnf kk

10),(~

)( KkzHczzF kL

1 MMlLwhere

Critically Sampled Filter Banks (1)

• Overall system

• Critically sampled filter banks MK

• NotationsTMzWXzWXzXz )]()()([)( 1 x

zWHzWHzWH

zHzHzH

)()()(

)]}()()({[diag)( 1 MzWSzWSzSz S

)()()(1

)]()()([)( /1/1/1110

MMMTM zzz

MzYzYzYz xSHy

)()()(1

)](ˆ)(ˆ)(ˆ[)(ˆ /1/1110

MMTM zzz

MzYzYzYz xHWy

• The subband error signals are zero if

• Two subbands scheme (Gilloire and Vetterli, ’92) Assume the classical QMF filters

Therefore,

)()()()( zzzzM SHHW

)()()(

zHzHzH

• Adaptive filters

• is diagonal only if or

)()()()()]()()[()(

)]()()[()()()()()(

)()()()(

zSzHzSzHzSzSzHzH

zSzSzHzHzSzHzSzH

)( 2zW

0)()( zHzH 0)()( zSzS

A general physical system (X)

0)()( zHzHPR (X)

• Multiband scheme Assumptions

Adaptive filters (require the use of cross filters)

Slow convergence and performance degradation

1||,0)()( jizHzH ji )()( ii zWHzH

)(00)(

0)()(0

0)()()(

)(0)()(

1,10,1

121110

1,00100

zWzWzW

Oversampled Filter Banks (1)

• such that

• Non-critical decimation can avoid the aliasing problem. The redundancy

Provide enough information for successful adaptation in every bands.

Diagonal adaptive filter matrix

MK jizHzH ji ,0)()( )()( i

i zWHzH where

• Recall

• For perfect reconstruction,

• Remove the aliasing terms

10),(~

)( KkzHczzF kL

)()()(1

m zFzWHzWXM

)()()(1

kkk zFzHzX

• Analysis filters from a real-valued linear phase prototype filter by generalized DFT

To cover the frequency range by exactly subbands

For the linear phase property • Synthesis filters

• All filters can be derived from one prototype filter.

1 , ,1 ,0 ,1 , ,1 ,0 ),()())((

k LnKknqenh

];0[ 2/K

2)1(0 qLn

)()1()(~

)( nhnLhnhnf kqkkk

Design of Oversampled Filter Banks (1)

• Cost function Combination of filter bank reconstruction error and

stopband energy of the analysis filters

• Impulse response of the overall filter bank system

• Using matrix notation,

kkk nfnh

0)0()1(

• Impulse response of the overall filter bank system

• Measure of the reconstruction error

• Measure for the energy contained in the stopband

fHfffHHHt TT

KM 110110

1 ))1(( qLnδt

))1(cos()1cos(1

• To enforce linear phase filters,

Therefore, , where .

• Cost function

qq)12/()1()0(

)1()1()0(

TT110 hhhhf

Tqkkkk Lhhh )1()1()0( h

))1((1

},,,{diag 110 KPPPP

• GDFT

• Cost function

• Iterative least-squares design algorithm• Initialize• Minimize with respect to• Apply relaxation

,qMh kk

))1((1

where are diagonal matrices with transform factorskM

)1()1()()( iii qqq

ITD Using Zero Crossings (1)• Two microphone signals

Ignore attenuation between microphones because of closeness

• Assume and

))(cos())(cos()(

)cos()cos()(

dtwadtwAtx

twatwAtx

0)( 11 tx

))(sin()sin(

))(cos()cos(

))(sin()sin(

))(cos()cos()(

111112

dwtwAtx

0)( 12 tx

ITD Using Zero Crossings (2)

• Since

• Therefore,

• Since

)()sin(

)()sin()(

111112

dwtwAtx

}2,1,{ ,)( jidw ji

1))(cos( ji dw )())(sin( jiji dwdw

0)( 12 tx

21221111

122111

)sin()sin(

))sin()sin((

dtwawdtwAw

twawtwAw

• Recall

)sin()(cos

)sin()))cos((sin(cos

12212222

2122112222

122121

21221121

twawtwaAw

dtwawdtwaAw

twawtwAa

dtwawdtwAa

)(cos)sin(

211222

twAawtwAw

dtwAawdtwAw

for aA

0)cos()cos()( 121111 twatwAtx

• Assume uniformly-distributed frequencies over a narrow band and phases over the interval ),(

21 )),(1(),( daAgdaAg

otherwise )(

)(tan2

, if 2

, if )(

)(tan)2

ITD Using Cross-Correlation (1)• Two microphone signals

• Cross-correlation))(cos())(cos()(

)cos()cos()(

dtwadtwAtx

twatwAtx

Tdttxtx

Tc )()(

1)( 21

)))](cos()2)(2(cos(

)))(cos())(cos(

))(cos())((cos(

)))(cos()2)(2(cos([2

222222

211211

221221

111112

dwdwtwa

twdtwtwdtw

dtwtwdtwtwAa

dwdwtwA

ITD Using Cross-Correlation (2)• ITD at the maximum of cross-correlation,

)]2sin()2sin()sin())sin((2

)sin())sin((2)2sin()2sin([

)]2)2cos()2(sin()cos())sin((2

)cos())sin((2

)]cos())sin((2

)cos())sin((2

)2)2cos()2(sin([

)]2)2cos()2(sin()cos())()sin((2

)cos())()sin((2

)2)2cos()2(sin([

TwaTwwAa

TwwAaTwA

TwwTwawTwwww

wTwwww

AaTwwTwAd

TwwTwawwTwwww

wwTwwww

AaTwwTwA

Spatial Aliasing

• To avoid the spatial aliasing The delay between sensors should be smaller than a half period

of the signal. If we have t sample delay for noisy signal,

Alias-free condition is

1 sample delay at 16kHz sampling rate

Center frequencies of all Gammatone filters are lower than 8kHz.

F s 82

Close Microphones

• ITD estimation without phase ambiguity The closest zero crossings provide the desired ITD

value. Do not need to estimate IIDs

• Easy to derive a relationship between the ITDs and the scaling factors

• Reduce search region of p2

• Reduce signal distortion• Compact implementation

Results on Source Segregation

• Target signal• Without processing• Scaling with the following factors

factor = [summation of CC values in the neighborhood of ITD of the desired source (male speaker)] / [summation of CC values in the whole range]

factor = 1 if the peak value of CC is in the neighborhood of ITD of the desired source.factor = 0 otherwise.

Jeffress’ model Lindemann’s model

Combining the Lindemann’s Model and the Precedence Effect

target noise

i-th BPF i-th BPF

Jeffress’or

Lindemann’smodel

The model will be operated only when e(t) > h(t).Otherwise, the model will provide the previous factor.

Envelope e(t) Inhibition h(t)

On-sete(t)>h(t) ?

Enhanced speech

Results on Source Segregationfor Reverberated Signal

• Target signal with reverberation without reverberation (ideal solution)

• Without processing• Scaling with the following factors

factor = 1 if the peak value of CC is in the neighborhood of ITD of the desired source.factor = 0 otherwise. Jeffress’ model Lindemann’s model

w/o on-set enh. res: 1/8000 sec res: 1/48000 sec

Dereverberation

• Early reflections are especially problematic. Affect on the same frame as the direct sound wave

• To remove early reflection components Dereverberate the linear prediction (LP) residual of

incoming speech Filter estimation for nearly exponentially-decaying

reverberation like a typical room impulse response Correspond to the inverse of the truncated auto-correlation

]))00 )()0( 00([DFT/.1(IDFT)( Rccnhderev

Dereverberation and Echo Suppression

Dereverberation

Gammatonefilterbank

Cross-correlation

Inhibition

Frameenergies

Frame energies with inhibition

Maskestimation

Featureestimation

Input 1

Input 2

Simulated Room Impulse Response

Virtual room to simulate impulse responses

2m1.5m

1.1m17cm

target

interferencemics

• Mixing environments (reverberation time: 0.5s)

Recognition Results

• Word error rates (WERs) (%) A : no processing B : seg. + inhib. C : derev. + seg.

+ inhib. D : ideal masks

A B C D

infinity, 0sec

10dB, 0.3sec

10dB, 0.5sec

0dB, 0.3sec

0dB, 0.5sec

Audio Segregation

Documents