+ All Categories
Home > Documents > Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication...

Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication...

Date post: 05-Jul-2020
Category:
Upload: others
View: 16 times
Download: 0 times
Share this document with a friend
380
Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann [email protected], {sehr,wk}@LNT.de NTT Communication Science Laboratories LMS, University of Erlangen-Nuremberg March 26, 2012
Transcript
Page 1: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberant Speech Processingfor Human Communication

and Automatic Speech Recognition

Tomohiro Nakatani, Armin Sehr, Walter [email protected], sehr,[email protected]

NTT Communication Science Laboratories

LMS, University of Erlangen-Nuremberg

March 26, 2012

Page 2: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Generic Scenario:Natural Interactive Human/Machine Interface

Mobile users, distant microphones/loudspeakers

DigitalSignal

Processing

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2

Page 3: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Generic Scenario:Natural Interactive Human/Machine Interface

Mobile users, distant microphones/loudspeakers

DigitalSignal

Processing

Tasks:

• Rendering - Reproducedesired signals at distantears

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2

Page 4: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Generic Scenario:Natural Interactive Human/Machine Interface

Mobile users, distant microphones/loudspeakers

DigitalSignal

Processing

Tasks:

• Acquisition - Localizesources and capture cleansignals from distance

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2

Page 5: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Generic Scenario:Natural Interactive Human/Machine Interface

Mobile users, distant microphones/loudspeakers

DigitalSignal

Processing

Tasks:

• Rendering - Reproducedesired signals at distantears

• Acquisition - Localizesources and capture cleansignals from distance

Challenges:

• Feedback of loudspeakersignals

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2

Page 6: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Generic Scenario:Natural Interactive Human/Machine Interface

Mobile users, distant microphones/loudspeakers

DigitalSignal

Processing

Tasks:

• Rendering - Reproducedesired signals at distantears

• Acquisition - Localizesources and capture cleansignals from distance

Challenges:

• Noise and interferers

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2

Page 7: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Generic Scenario:Natural Interactive Human/Machine Interface

Mobile users, distant microphones/loudspeakers

DigitalSignal

Processing

Tasks:

• Rendering - Reproducedesired signals at distantears

• Acquisition - Localizesources and capture cleansignals from distance

Challenges:

• Reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2

Page 8: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Generic Scenario:Natural Interactive Human/Machine Interface

Mobile users, distant microphones/loudspeakers

DigitalSignal

Processing

Tasks:

• Rendering - Reproducedesired signals at distantears

• Acquisition - Localizesources and capture cleansignals from distance

Challenges:

• Feedback of loudspeakersignals

• Noise and interferers

• Reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2

Page 9: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Applications

Hands-free equipment

for telecommunication and natural human/machine interaction

for mobile phones / smart phones, mobile computing devices,PDAs

in car interiors (’command&control’, telecommunication, in-carcommunication, . . .)

for desktop computers, info-/edutainment terminals, interactive TV,game stations, simulators

for telepresence systems (offices,. . ., classrooms, . . ., auditoria) for ambient communication (smart meeting rooms, smart homes,

information kiosks, museums and exhibitions, . . .) for voice-driven navigation systems in cars, operating rooms, . . .

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 3

Page 10: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Applications (cont’d)

Professional Audio equipment for stages and recording studios virtual acoustic environments (virtual concert halls, telepresence

studios,. . .)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 4

Page 11: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Applications (cont’d)

Professional Audio equipment for stages and recording studios virtual acoustic environments (virtual concert halls, telepresence

studios,. . .)

Safety and Surveillance acoustic displays in control centers, cockpits monitoring in health care environments (advanced ’babyphones’) acoustic scene analysis (train stations, . . .)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 4

Page 12: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Another Scenario: ’Listening devices’

DSP

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5

Page 13: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Another Scenario: ’Listening devices’

DSP

Tasks:

• Rendering -Reproduceundistorted signalswith binaural cues

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5

Page 14: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Another Scenario: ’Listening devices’

DSP

Tasks:

• Rendering -Reproduceundistorted signalswith binaural cues

• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5

Page 15: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Another Scenario: ’Listening devices’

DSP

Tasks:

• Rendering -Reproduceundistorted signalswith binaural cues

• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)

Challenges:

• Loudspeakerfeedback (howling)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5

Page 16: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Another Scenario: ’Listening devices’

DSP

Tasks:

• Rendering -Reproduceundistorted signalswith binaural cues

• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)

Challenges:

• Noise and interferers

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5

Page 17: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Another Scenario: ’Listening devices’

DSP

Tasks:

• Rendering -Reproduceundistorted signalswith binaural cues

• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)

Challenges:

• Reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5

Page 18: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Another Scenario: ’Listening devices’

DSP

Tasks:

• Rendering -Reproduceundistorted signalswith binaural cues

• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)

Challenges:

• Loudspeakerfeedback (howling)

• Noise and interferers

• Reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5

Page 19: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Applications

Hearing aids, of course

Headsets, e.g., for mobile phones, mobile computing devices, personal digital

assistants

hearing protection in noisy environments (construction work,mining,. . .)

active noise cancellation systems

. . .

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 6

Page 20: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Example 1: DICIT - an Interactive TV system

Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7

Page 21: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Example 1: DICIT - an Interactive TV system

Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )

featuring

Multichannel AEC (GFDAF, Buchner/Benesty et al., 2003ff) Multibeamforming (Mabande et al., 2009; Kellermann, 1997) Source localization (GCF; Brutti et al., 2007) Speech/non-speech classification (Omologo, 2009) Noise-robust automatic speech recognition (ViaVoice, IBM 2009)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7

Page 22: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Example 1: DICIT - an Interactive TV system

Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )

featuring

Multichannel AEC (GFDAF, Buchner/Benesty et al., 2003ff) Multibeamforming (Mabande et al., 2009; Kellermann, 1997) Source localization (GCF; Brutti et al., 2007) Speech/non-speech classification (Omologo, 2009) Noise-robust automatic speech recognition (ViaVoice, IBM 2009)

Challenge: Reverberation for large source distances in morereverberant rooms

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7

Page 23: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Who Spoke When, What, Who Spoke When, What, ToTo--whom and How?whom and How?

RealReal--time Meeting Browsertime Meeting Browser

Recognize speech andRecognize speech andother audio eventsother audio events

Example 2: Meeting recognition system

Page 24: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Example 3: Audio postproduction system

Microphone(s)Actor/actress

Step1:Sound&video recording (on location)

Step2:Audio post-production(de-noising, de-reverb, sound effects)

[Movies/TV creation]

Page 25: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 26: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction Fundamentals

Approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 27: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering Example applications

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 28: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering Example applications

Professional audio post production

Meeting speech recognition with microphone arrays

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 29: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering Example applications

Fundamentals: Dereverberation with inverse filtering

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 30: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering Example applications Fundamentals: Dereverberation with inverse filtering

What is ’inverse’ filtering? Robust ’approximate’ inverse filtering

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 31: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering Example applications

Fundamentals: Dereverberation with inverse filtering

Blind inverse filtering

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 32: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering Example applications

Fundamentals: Dereverberation with inverse filtering Blind inverse filtering

Overview of basic approaches Closer look: multichannel linear prediction with

time-varying source model

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 33: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering Example applications

Fundamentals: Dereverberation with inverse filtering

Blind inverse filtering

Integration with blind source separation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 34: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering

Part III: Robust ASR in reverberant environments Feature-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 35: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering

Part III: Robust ASR in reverberant environments Feature-based approaches

Cepstral mean normalization

Model-based feature enhancement

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 36: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering

Part III: Robust ASR in reverberant environments Feature-based approaches

Model-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 37: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering

Part III: Robust ASR in reverberant environments Feature-based approaches Model-based approaches

Matched training Multi-style training Adaptive training MAP and MLLR adaptation Parametric adaptation tailored to reverberation Frame-wise adaptation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 38: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering

Part III: Robust ASR in reverberant environments Feature-based approaches

Model-based approaches

Decoder-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 39: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering

Part III: Robust ASR in reverberant environments Feature-based approaches

Model-based approaches Decoder-based approaches

Missing feature techniques Uncertainty decoding

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 40: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering

Part III: Robust ASR in reverberant environments Feature-based approaches

Model-based approaches

Decoder-based approaches

A generic approach: REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 41: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview

Part I: Introduction

Part II: Multichannel blind inverse filtering

Part III: Robust ASR in reverberant environments

Part IV: Summary, Conclusions, and Outlook

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10

Page 42: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamental Signal Processing Problems - Formulation

DigitalSignal

Processing

W

v

x

u

y

KL

N P

Linear MIMO system W (’multi-ple input/ multiple output’):(

vy

)

= W ∗

(

ux

)

=

(

Wvu Wvx

Wyu Wyx

)

(

ux

)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11

Page 43: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamental Signal Processing Problems - Formulation

Linear MIMO system W (’multi-ple input/ multiple output’):(

vy

)

= W ∗

(

ux

)

=

(

Wvu Wvx

Wyu Wyx

)

(

ux

)

DigitalSignal

Processing

W

n

S1

SM

z1

z2

z2M−1

z2M

Hzvv

x

u

y

KL

N P

Listeners’ signals:

z = Hzv ∗ v + nz

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11

Page 44: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamental Signal Processing Problems - Formulation

Linear MIMO system W (’multi-ple input/ multiple output’):(

vy

)

= W ∗

(

ux

)

=

(

Wvu Wvx

Wyu Wyx

)

(

ux

)

Listeners’ signals:

z = Hzv ∗ v + nz

DigitalSignal

Processing

W

n

s1

sM

S1

SM

z1

z2

z2M−1

z2M

Hxv

Hxs

Hzvv

x

u

y

KL

N P Microphone signals:

x = Hxs ∗ s + Hxv ∗ v + nx

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11

Page 45: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamental Problems for Signal Acquisition

n

s1

sM

S1

SM

Hxv

Hxs

Wvu

Wyx

Wyu

v

x

u

y

KL

NP

Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !

= s ∗δ(k −k0)

where x = Hxs ∗ s + Hxv ∗ v + nx

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12

Page 46: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamental Problems for Signal Acquisition

n

s1

sM

S1

SM

Hxv

Hxs

Wvu

Wyx

Wyu

v

x

u

y

KL

NP

Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !

= s ∗δ(k −k0)

where x = Hxs ∗ s + Hxv ∗ v + nx

3 Subproblems:

• Echo cancellation:(Wyu + Wyx ∗ Hxv ∗ Wvu )∗u = 0

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12

Page 47: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamental Problems for Signal Acquisition

n

s1

sM

S1

SM

Hxv

Hxs

Wvu

Wyx

Wyu

v

x

u

y

KL

NP

Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !

= s ∗δ(k −k0)

where x = Hxs ∗ s + Hxv ∗ v + nx

3 Subproblems:

• Source separation anddereverberation :Wyx ∗ Hxs ∗ s = s ∗ δ(k − k0)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12

Page 48: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamental Problems for Signal Acquisition

n

s1

sM

S1

SM

Hxv

Hxs

Wvu

Wyx

Wyu

v

x

u

y

KL

NP

Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !

= s ∗δ(k −k0)

where x = Hxs ∗ s + Hxv ∗ v + nx

3 Subproblems:

• Noise and interferencesuppression:

Wyx ∗ nx = 0

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12

Page 49: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamental Problems for Signal Acquisition

n

s1

sM

S1

SM

Hxv

Hxs

Wvu

Wyx

Wyu

v

x

u

y

KL

NP

Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !

= s ∗δ(k −k0)

where x = Hxs ∗ s + Hxv ∗ v + nx

3 Subproblems:

• Echo cancellation:(Wyu + Wyx ∗ Hxv ∗ Wvu )∗u = 0

• Source separation anddereverberation :Wyx ∗ Hxs ∗ s = s ∗ δ(k − k0)

• Noise and interferencesuppression:

Wyx ∗ nx = 0

Components of x , i.e., Hxs ∗ s, Hxv ∗ v, nx , must be separated by W!

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12

Page 50: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - Room Impulse Response (RIR) properties

Elements of Hwv , Hxv , Hxs are room impulse responses (RIRs).

Typical structure of RIRs:

Direct sound

Early reflections

Late reverberation

h

t

Main characteristic parameters:

T60: Time for exponential decay of envelope by 60dB

DRR: Direct-to-Reverberant (Energy) Ratio

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 13

Page 51: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - Room Impulse Response (RIR) properties

• Reverberation time T60

⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14

Page 52: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - Room Impulse Response (RIR) properties

• Reverberation time T60

⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s

• FIR models⊲ typically LH ≈ T60 · fs/3 coefficients⊲ nonminimum-phase⊲ many zeros close to unit circle

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14

Page 53: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - Room Impulse Response (RIR) properties

• Reverberation time T60

⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s

• FIR models⊲ typically LH ≈ T60 · fs/3 coefficients⊲ nonminimum-phase⊲ many zeros close to unit circle

Example: Office 5.5m × 3m × 2.8m, T60 ≈ 300msec, fs = 12kHz.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

taps−1.5 −1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

0.5

1

1.5

Real part

Imag

inar

y pa

rt

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14

Page 54: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - RIR properties (cont’d)

RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1x 10

−5

t in s

h(n)

distance 4m

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6

−4

−2

0

2

4x 10

−5

t in s

h(n)

distance 1m

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15

Page 55: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - RIR properties (cont’d)

RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1x 10

−5

t in s

h(n)

distance 4m

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6

−4

−2

0

2

4x 10

−5

t in s

h(n)

distance 1m

Energy decay curves(EDC [Schröder 1965])

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40

−35

−30

−25

−20

−15

−10

−5

0

t in s

ener

gy d

ecay

cur

ve in

dB

distance 4mdistance 1m

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15

Page 56: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - RIR properties (cont’d)

RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1x 10

−5

t in s

h(n)

distance 4m

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6

−4

−2

0

2

4x 10

−5

t in s

h(n)

distance 1m

Energy decay curves(EDC [Schröder 1965])

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40

−35

−30

−25

−20

−15

−10

−5

0

t in s

ener

gy d

ecay

cur

ve in

dB

distance 4mdistance 1m

DRR(Direct-to-Reverberant Energy Ratio)

4.9dB/-4.0dB

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15

Page 57: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - RIR properties (cont’d)

RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1x 10

−5

t in s

h(n)

distance 4m

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6

−4

−2

0

2

4x 10

−5

t in s

h(n)

distance 1m

Energy decay curves(EDC [Schröder 1965])

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40

−35

−30

−25

−20

−15

−10

−5

0

t in s

ener

gy d

ecay

cur

ve in

dB

distance 4mdistance 1m

DRR(Direct-to-Reverberant Energy Ratio)

4.9dB/-4.0dB

RIR, DRR ⇔ Reverberation time T60

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15

Page 58: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - RIR properties (cont’d)

Variability with displacements:

Mic displacement 4.2cm(source distance d=4m):

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

−0.5

0

0.5

1x 10

−5

t in s

h(n)

Difference between RIR1 and RIR2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

0

1x 10

−5

t in s

h(n)

RIR 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

0

1x 10

−5

t in s

h(n)

RIR 2

System error norm: 0.23dB

Shift of RIR by 1 sample:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

t in s

h(n)

Difference between RIR1 and a RIR1 shifted by 1 sample

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

t in s

h(n)

RIR 1

System error norm: 2.56dB

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 16

Page 59: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - Reverberation in signal representations

Clean vs. reverberated with T60 ≈ 900ms, d1 = 4m and T60 ≈ 3.1s, d2 = 5m

Time-domain:

−0.5

0.5

s t

−0.3

0.3

x t

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

−0.5

0.5

x t

t in s

Pauses filled!

STFT domain:

f in

Hz

0

2000

4000

6000

−80

−60

−40

−20

0

f in

Hz

0

2000

4000

6000

−80

−60

−40

−20

0

t in s

f in

Hz

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60

2000

4000

6000

−80

−60

−40

−20

0

Pauses filled!

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 17

Page 60: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Fundamentals - Reverberation in ASR features

Clean vs. reverberated with T60 ≈ 900ms, d1 = 4m and T60 ≈ 3.1s, d2 = 5m

Logmelspec domain:m

el c

hann

el

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

mel

cha

nnel

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

t in s

mel

cha

nnel

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

Pauses filled!

MFCC domain:

ceps

tral

coe

ffici

ent

0

2

4

6

8

10

−60

−50

−40

−30

−20

−10

0

10

ceps

tral

coe

ffici

ent

0

2

4

6

8

10

−60

−50

−40

−30

−20

−10

0

t in sce

pstr

al c

oeffi

cien

t

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

0

2

4

6

8

10

−70

−60

−50

−40

−30

−20

−10

0

10

Pauses of c0 filled!

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 18

Page 61: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Speech Enhancement

Basic Idea: Separate speech production from RIR, equalize the latter

room(to be equalized)

vocal tract(to be preserved)

glottal excitation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19

Page 62: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Speech Enhancement

Basic Idea: Separate speech production from RIR, equalize the latter

room(to be equalized)

vocal tract(to be preserved)

glottal excitation

’Blind’ problem!(no reference signal for RIR input)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19

Page 63: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Speech Enhancement

Basic Idea: Separate speech production from RIR, equalize the latter

room(to be equalized)

vocal tract(to be preserved)

glottal excitation

’Blind’ problem!(no reference signal for RIR input)

Distinction: Partial Deconvolution

(removes reverberation by RIRinversion, ideally without speechdistortion)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19

Page 64: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Speech Enhancement

Basic Idea: Separate speech production from RIR, equalize the latter

room(to be equalized)

vocal tract(to be preserved)

glottal excitation

’Blind’ problem!(no reference signal for RIR input)

Distinction: Partial Deconvolution

(removes reverberation by RIRinversion, ideally without speechdistortion)

m

Reverberation Suppression(compromise betweendereverberation and signaldistortion necessary)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19

Page 65: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Signal Enhancement (cont’d)

Dealing with ’Blindness’ by exploiting

Prior Knowledge on

A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)

B, Room acoustics parameter (e.g., T60)

C, Location and radiation characteristics of speech source

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20

Page 66: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Signal Enhancement (cont’d)

Dealing with ’Blindness’ by exploiting

Prior Knowledge on

A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)

B, Room acoustics parameter (e.g., T60)

C, Location and radiation characteristics of speech source

and some Useful Assumptions

D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20

Page 67: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Signal Enhancement (cont’d)

Dealing with ’Blindness’ by exploiting

Prior Knowledge on

A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)

B, Room acoustics parameter (e.g., T60)

C, Location and radiation characteristics of speech source

and some Useful Assumptions

D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation

E, Speech signal statistics change faster than RIRs

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20

Page 68: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Signal Enhancement (cont’d)

Dealing with ’Blindness’ by exploiting

Prior Knowledge on

A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)

B, Room acoustics parameter (e.g., T60)

C, Location and radiation characteristics of speech source

and some Useful Assumptions

D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation

E, Speech signal statistics change faster than RIRs

F, Multichannel recordings: Speech component is the same ⇔ RIRs aredifferent

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20

Page 69: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation for Signal Enhancement - Approaches

SignalDereverberation

PartialDeconvolution

Single-channel Multichannel

ReverberationSuppression

Single-channel Multichannel

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 21

Page 70: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Single-Channel Partial Deconvolution

Single-channel partial deconvolution

Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22

Page 71: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Single-Channel Partial Deconvolution

Single-channel partial deconvolution

Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate

Inversion of a single RIR involves [Neely 1979] removing the allpass component of the nonminimum-phase RIR →

approximated by delay

inverting zeros close to, or on unit circle → approximation by ’channelshortening’

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22

Page 72: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Single-Channel Partial Deconvolution

Single-channel partial deconvolution

Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate

Inversion of a single RIR involves [Neely 1979] removing the allpass component of the nonminimum-phase RIR →

approximated by delay

inverting zeros close to, or on unit circle → approximation by ’channelshortening’

for realization problems see, e.g., [Morjopoulos 1994], [Naylor 2010]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22

Page 73: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Partial Deconvolution

Multichannel partial deconvolution

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23

Page 74: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Partial Deconvolution

Multichannel partial deconvolution

Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR

Spatial diversity facilitates RIR identification

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23

Page 75: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Partial Deconvolution

Multichannel partial deconvolution

Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR

Spatial diversity facilitates RIR identification

Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required

no common zeros of RIRs allowed

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23

Page 76: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Partial Deconvolution

Multichannel partial deconvolution

Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR

Spatial diversity facilitates RIR identification

Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required

no common zeros of RIRs allowed

Indirect approaches often invert in subbands for robustness (e.g. [Naylor2005])

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23

Page 77: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Partial Deconvolution

Multichannel partial deconvolution

Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR

Spatial diversity facilitates RIR identification

Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required

no common zeros of RIRs allowed

Indirect approaches often invert in subbands for robustness (e.g. [Naylor2005])

Direct approaches to identify a robust inverse exist (e.g. [Buchner 2004],[Buchner 2010], and below!)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23

Page 78: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Single-Channel Reverberation Suppressio n

Single-channel Reverberation Suppression

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24

Page 79: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Single-Channel Reverberation Suppressio n

Single-channel Reverberation Suppression

can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E), e.g., for

equalizing the vocal tract IR and suppressing reverberation in the LPCresidual (e.g., [Yegnanarayana 2000], [Gaubitch 2006])

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24

Page 80: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Single-Channel Reverberation Suppressio n

Single-channel Reverberation Suppression

can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E), e.g., for

equalizing the vocal tract IR and suppressing reverberation in the LPCresidual (e.g., [Yegnanarayana 2000], [Gaubitch 2006])

can exploit prior knowledge on room acoustics (e.g.,T60) to estimatePSD of reverberation and use spectral subtraction methods as commonfor additive noise (e.g., [Lebart 2001])

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24

Page 81: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Reverberation Suppression

Multichannel Reverberation Suppression

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25

Page 82: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Reverberation Suppression

Multichannel Reverberation Suppression

can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25

Page 83: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Reverberation Suppression

Multichannel Reverberation Suppression

can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,

beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25

Page 84: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Reverberation Suppression

Multichannel Reverberation Suppression

can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,

beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])

spatial diversity for multichannel spectral subtraction (e.g., [Allen 1977]), orsubspace methods (e.g., [Gannot 2003])

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25

Page 85: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dereverberation - Multichannel Reverberation Suppression

Multichannel Reverberation Suppression

can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,

beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])

spatial diversity for multichannel spectral subtraction (e.g., [Allen 1977]), orsubspace methods (e.g., [Gannot 2003])

spatial diversity complemented by prior knowledge on room acousticsparameter (e.g., [Habets 2005])

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25

Page 86: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Handling Reverberation for Automatic Speech Recognition

Block diagram of ASR system

pre−

training

speechsignal

transcription

transcriptionrecog−nition

processing extractionfeature acoustic

modellanguagemodel

A

B

C

D

REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26

Page 87: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Handling Reverberation for Automatic Speech Recognition

Block diagram of ASR system

pre−

training

speechsignal

transcription

transcriptionrecog−nition

processing extractionfeature acoustic

modellanguagemodel

A

B

C

D

REMOS

Strategies

A) signal-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26

Page 88: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Handling Reverberation for Automatic Speech Recognition

Block diagram of ASR system

pre−

training

speechsignal

transcription

transcriptionrecog−nition

processing extractionfeature acoustic

modellanguagemodel

A

B

C

D

REMOS

Strategies

A) signal-based approaches

B) feature-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26

Page 89: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Handling Reverberation for Automatic Speech Recognition

Block diagram of ASR system

pre−

training

speechsignal

transcription

transcriptionrecog−nition

processing extractionfeature acoustic

modellanguagemodel

A

B

C

D

REMOS

Strategies

A) signal-based approaches

B) feature-based approaches

C) model-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26

Page 90: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Handling Reverberation for Automatic Speech Recognition

Block diagram of ASR system

pre−

training

speechsignal

transcription

transcriptionrecog−nition

processing extractionfeature acoustic

modellanguagemodel

A

B

C

D

REMOS

Strategies

A) signal-based approaches

B) feature-based approaches

C) model-based approaches

D) decoder-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26

Page 91: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part II.Multichannel blind inverse filtering

Page 92: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Two approaches for signal dereverberation

Signaldereverberation

Partialdeconvolution

Reverberationsuppression

[Lebart 2001], [Habets 2005], [Löllman 2009], [Erkelens 2010], [Kameoka 2009], [Jeub 2010]and others

“Robust” blind inverse filteringis the main topic of part II

Page 93: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Multichannel inverse filtering

M

m

Kmt

mt xwy

1 0

)()(

Linear filtering:

ts)1(

th)2(

th)(M

th

)1(tx

)2(tx

)(Mtx

)1(tw

)2(tw

)(Mtw

+

ty

Dereverberatedsignal

tt sy Goal: estimate )(mtw s.t.

RIRsCleanspeech

Reverberantspeech

m : mic. indext : time index : a set of

variablesfor all t and m

Inversefilter

Page 94: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

!

Part II. Multichannel blind inverse filtering

- Example applications- Professional audio post production- Meeting recognition with microphone arrays

- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter

- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with

time-varying source model

- Integration with blind source separation

Page 95: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

"

Application to audio post-production

Microphone(s)Actor/actress

Step1:Sound&video recording (on location)

Step2:Audio post-production(de-noising, de-reverb, sound effects)

[Movies/TV creation]

Page 96: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

#Dereverberation plug-in for Pro Tools: NML RevCon-RR(sold by TAC System, Inc.)

Dereverberation system for audio post production [Kinoshita 2008]

Page 97: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Who Spoke When, What, Who Spoke When, What, ToTo--whom and How?whom and How?

Show&Tell:ST-3.2: Thursday, March 29, 10:30-12:30 RealReal--time Meeting Browsertime Meeting Browser

Recognize speech andRecognize speech andother audio eventsother audio events

Online meeting recognition [Hori 2012]

ReverberationReverberation

SimultaneousSimultaneousSpeechSpeech

BackgroundBackgroundnoisenoise

Page 98: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$

Online/offline processing flow of meeting recognition

Dereverberation

Voiceactivity

detection

Speechseparation

Mic signals

Dereverberatedmicrophone signals

Separatedsignals

Cleanedsignals

Noisesuppression ASR

Wordsequence

Preprocessingfor all following

signal processing units

Page 99: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%

ASR performance w/ and w/o dereverberation

Worderrorrate(%)

Test data: Meeting by 4 speakers (15 min x 8 sessions)Recording: 8 mics. (T60: about 350 ms, Speaker-mic distance: 100 cm)

Acoustic model:Trained on CSJ (corpus of spontaneous Japanese): headset recording

Language model:Vocabulary size: 156K (LVCSR)

Baseline:Distant microphone(w/o enhancement)w/o derev:BSS+denoisew/ derev:derev.+BSS+denoiseHeadset:Close microphone(w/o enhancement)Online processing

(Latency=1s for preprocessing,w/o speaker adaptation)

Offline processing(w/ unsupervised speaker adaptation)

0102030405060

908070

Hea

dsetw

/ der

evw

/o d

erev

72.

1 %

Bas

elin

e 86

.5 %

56.6

%30

.6 %

Hea

dset

w/ d

erev

38.0

%B

asel

ine

78.9

%w

/o d

erev

35.9

%27

.4 %

Page 100: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&

Questions to be answered

• What is inverse filtering ?• Is the inverse filter robust against interferences ?• Can we estimate the inverse filter with blind

processing ?

ts )1(th)2(

th)(M

th

)1(tx

)2(tx

)(Mtx

)1(tw

)2(tw

)(Mtw

+

tyDereverberatedsignal

Page 101: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

What is inverse filtering ?

Is the inverse filter robust against interferences ?

Can we estimate the inverse filter with blind processing ?

Answers at a glance

Unfortunately no,

Yes, we can,

Inversion of room impulse responses (RIRs)

by using cues for distinguishing speech from RIRs

but there is a robust ‘approximate’ inverse filter

Page 102: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part II. Multichannel blind inverse filtering

- Example applications- Professional audio post production- Meeting recognition with microphone arrays

- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter

- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with

time-varying source model

- Integration with blind source separation

Assume non-blindprocessing for

analysis purpose

Page 103: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Inversion of RIRs = Inversion of matrix transformation

Reverberant speech

Cleanspeech

Dereverberatedspeech

RIRsInversefiltering

)1(tx

tyts

Viewed asmatrix inversion

InversionViewed as matrixtransformation

)(Mtx

Page 104: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$!

)(

)(1

)(

mKt

mt

mt

x

xx

Matrix/vector representations of RIR convolution/filtering

Single channel filtering Multichannel filtering

Kmt

m xw0

)()(

0

1

)()(1

)(

)(0

)(

)(1

)(0

)(1

)(0

0

00

00

0

Kt

t

t

mL

m

mL

m

mL

mm

mm

s

ss

hh

h

h

hhh

hh

h

h

h

ts)(mH

)(

)(

)()(0 ,,

mKt

mt

mK

m

x

xww

Tm)(w)(m

tx)()( m

tTm xw

)(

)1(

)()1(

1

)()(

Mt

tTMTM

m

mt

Tm

x

xwwxw

Tw

txt

Tty xw

Single channel RIR convolution Multichannel RIR convolution

tmm

t sHx )()(

tMM

t

t

sH

H

x

x

)(

)1(

)(

)1(

tt Hsx

H)(mtx tx

hLKK 0

Page 105: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$"

Existence of inverse filter

• A column vector is an inverse filter when it satisfies:

• An inverse filter exists, when is invertible, i.e., it is full column rank, and is obtained as

tt ys w

ts tytxtt Hsx t

Tty xw

tTHsw

Hw

T]0,,0,1[ ewhereTT eHw

Hew TT TT HHHH 1)( where

w

TKtttt sss ],,,[01 swhere

Page 106: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$

M (#mics) > 1 is required for single source case

• H is invertible, or full column rank, if and only if

and all columns are linearly independent

• In the case of single source, (#rows of H) >= (#columns of H)is satisfied if and only if

M (#mics) > 1

H

)1(H

)(MH

)2(H

#rows

#columns

(#rows of H) >= (#columns of H)

Page 107: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$

Generalization to N sources ' M microphones case

– Inverse filter exists when is full column rank

• M (#mics) >N (#sources)• H(z) does not contain common zeros

)1(ts )1(

tx)1()1(

tt sy )2(

tx)2(

ts)(N

ts )(Mtx

)2()2(tt sy

)()( Nt

Nt sy

H W

HEquivalent

Multiple-input/output inverse theorem (MINT) [Miyoshi 1988]

Page 108: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$$

Part II. Multichannel blind inverse filtering

- Example applications- Professional audio post production- Meeting recognition with microphone arrays

- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter

- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with

time-varying source model

- Integration with blind source separation

Page 109: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$%

Assumptions for inverse filtering

• Invertible RIRs• No additive noise • Time-invariant RIRs

Not realistic !

Inverse filter is too sensitive tomodeling errors (noise or RIR change)

Problem of inverse filtering

Page 110: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$&

Inverse filter greatly amplifies noise

Noise-free reverberant case• Clean speech• Reverberant speech

– Synthesized using a fixed RIR (RT60=0.5 s)• Dereverberated speech using an inverse filter

for known RIRs (2-channel)

Noisy reverberant case• Noisy reverberant speech (SNR=30dB)• Speech processed using the same inverse filter

(2-channel)

Page 111: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$

Why inverse filter is so sensitive to additive noise

ts H

min

Minimum singular value of

tx tstn tn~

often extremely small

(compared to maximum singular value)

Extremelyamplifies

noise

tT

tn nHe ~where

Hew TT

invmax

Maximum singular value of

often extremelylarge

HH

Page 112: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$

Standard numerical approach for robustness [Engl 1996]

•Regularization– A general technique for robust matrix inversion

• Add a very small positive constant to diagonal offor calculating the pseudo-inversion of

– It can reduce the maximum singular value of

H

TT HIHHH 1)(~

(Identity matrixI

TT HHHH 1)(

HHT

H~invmax

Noisy rev. Processed

Noise amplification is greatly mitigated

Page 113: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$

• Channel shortening– Set “direct signal + early reflections” as target signal, and

reduce only late reverberation

Room acoustics motivated approach for robustness

Target to reverberation ratio (TRR) w/ channel shortening is much higher than TRR w/ inverse filtering

Directpath Early

reflections

Latereverberation

TargetTRR

e.g., -3 dB

e.g., 8 dB

Noisy rev. Processed

t

Illustration of an RIR

about 30 ms

Inversefiltering

Hew ~TT

Channelshortening

Hhew ~)( Te

TT

e eh

lh

Page 114: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%!

Intermediate summary II-1

• Dereverberation: inversion of RIRs– Assuming RIRs to be a time-invariant linear system

• Inverse filter exists– When we have more microphones than sources– But it may be very sensitive to additive noise

• ‘Approximate’ inverse filter is robust against noise– Based on regularization and channel shortening

Page 115: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%"

Part II. Multichannel blind inverse filtering

- Example applications- Professional audio post production- Meeting recognition with microphone arrays

- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter

- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with

time-varying source model

- Integration with blind source separation

Page 116: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%

Blind inverse filtering based dereverberation

Reverberantspeech

Unknown

RIRsts )(m

txDereverberatedspeech

Inversefiltering

ty

Estimateinverse

filter

wMics.

UnknownSpeech

productionsystem

Cleanspeech

Two approaches• RIR estimation + RIR inversion• Direct estimation of inverse filter

Page 117: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%

RIRsts )(mtx

Inversefiltering

ty

Estimateinverse

filter

w

Approach:Estimatethatdecorrelates

)(mtx

w

Unknown

• SOS approach assumes to be stationary white Gaussian

• HOS approach assumes to be an i.i.d. sequenceHigher order decorrelation [Sato 1975], [Bellini 1994]

ts

ts

Conventional decorrelation approaches for stationary white signal

Multichannel linear prediction (MCLP) [Slock 1994], [Abed-Meraim 1997]

0' tt ssEfor 'tt

0

)('

)(

mt

mt xxE

Increasecorrelation

0' tt yyEStationarywhite signal

Page 118: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%$

Multichannel linear prediction (MCLP)

Mic.1

Mic.M

)))

Predictreverberationin observation

)(mtx

)1()1(ttt rsx

Past observation

)))

Currentobservation

M

m

Lmt

mt

x

xcr1 1

)()()1(

)1(tr

Time

: reverberation

Page 119: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%%

MCLP based decorrelation [Slock 1994], [Abed-Meraim 1997]

• is modeled by

where is prediction coeffs. – is equivalent to inverse filter

• can be estimated by minimizing prediction error when sources are stationary and uncorrelated in time

– Quadratic form: optimized using a closed form solution

t

M

m

Kmt

mt sxcx

1 1

)()()1(

)1(tx

wTM

KM

K cccc ],,,,,,[ )()(1

)1()1(1 c

t

Tttx

2cxcc

1)1(minargˆ

c

tTt s cx 1

Predicted signal (= reverberation)Prediction error (= direct signal)

cxTttt xs 1)1(

c

Page 120: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%&

Why dereverberation can be achieved by MCLP

1

' '0t

tt ttss for

1

2

10

)1(

1

21

)1(

t

Ttt

tt

Tt shx cxxc

1

2

11

)1(

t

Tttt shs cx

1 1

2

11

)1(2||t t

Tttt shs cx

Minimization is achieved only when

1

2||t

ts

1

)1(

tsh cxTt 1

Truereverberation

Predictedreverberation=

01

1

t

Ttts x(and thus )

is usually assumed for MCLP without loss of generality

1)1(0 h

Page 121: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%

Robustness of MCLP against noise

)()()( mt

mt

mt nxz Let be noisy reverberant observation.

Additive noise (or can be viewed as modeling error)

Cost function fordereverberation

Cost function fornoise amplification

Assume and to be uncorrelated, then, the cost function becomes

1

21

)1(

1

21

)1(

1

21

)1(

t

Ttt

t

Ttt

t

Ttt nxz cncxcz

)(mtx

)(mtn

Regularization is inherently included

Page 122: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%

RIRsts )(m

txInversefiltering

ty

Estimateinverse

filter

wMics.

Approach:Estimatethatdecorrelates

)(mtx

w

Unknown

Problem of decorrelation approach for speech dereverberation

UnknownSpeech

productionsystem

Problem:Not only dereverberatebut also decorrelate ts

)(mtx

Key to the solution:Use cues to separatespeech and RIRs

Both are decorrelated

Page 123: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%

Cues for separating speech and RIRs

Cues Speech RIRs

Inter-channeldifference

Auto-correlationduration

Nonstationarity Stationary only within short time period of the order of 30 ms

Stationary over long time period of the order of 1000 ms or larger

Correlated only within short time interval of the order of 30 ms

Correlated within long time interval over100 ms

Common to all the microphone signals

Different for each microphone

Page 124: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&!

Approaches to blind inverse filtering

• Subspace method (RIR estimation + inversion)–[Furuya 1997], [Gannot 2003], [Gaubitch 2006]

• Pre-whitening + decorrelation–Second-order statistics (SOS): [Gaubitch 2003],

[Furuya 2007], [Triki 2007]–Higher-order statistics (HOS): [Gillespie 2001]

• Channel shortening–[Gillespie 2003], [Kinoshita 2009]

• Joint speech and reverberation modeling–[Hopgood 2003], [Buchner (TRINICON) 2010],

[Yoshioka 2007], [Nakatani 2008]

Auto-correlationduration

Auto-correlationdurationandnonstationarity

Inter-channeldifference

Cues

Page 125: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&"

Pre-whitening + decorrelation

• A typical method for pre-whitening–Low-dimensional (e.g., 12-dim) single channel linear prediction often used

Assumption: pre-whitening can decorrelate only in , and we can obtain where is an unknown decorrelated speech

Pre-whiteningReducecorrelationwithin shorttime interval

tt Hsx

Reverberantspeech

tmt sHx ~~ )(

tt sHx ~~ ts~tt Hsx ts

Estimateinverse filter

Inversefiltering

Estimatethatdecorrelates

)(~ mtx)(~ m

ty

w

w

Inversefiltering

w)(m

ty

Page 126: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&

Channel shortening

• Introduce constraints so that dereverberation reduces only late reverberation

• Techniques:– Correlation shaping [Gillespie 2003]– Multistep MCLP [Kinoshita 2009]

Make derev. robust and do not decorrelate speech

Channelshortening

Directpath Early

reflections

Latereverberation

Page 127: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&

Multistep MCLP [Gesbert 1997], [Kinoshita 2009]

Mic.1

Mic.M

)))

Predict late reverberationin observation

)(mtx

)1()1(ttt rsx

Past observation

)))

Currentobservation

Time

M

m

K

D

mt

mt xcr

1

)()()1(

Delay D (=30-50 ms)

)1(tr

ts : direct signal + earlyreflections

: latereverberation

Page 128: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&$

Approaches to blind inverse filtering

• Subspace method (RIR estimation + inversion)–[Furuya 1997], [Gannot 2001], [Gaubitch 2006]

• Pre-whitening + decorrelation–Second-order statistics (SOS): [Gaubitch 2003],

[Furuya 2006], [Triki 2006]–Higher-order statistics (HOS): [Gillespie 2001]

• Channel shortening–[Gillespie 2003], [Kinoshita 2006]

• Joint speech and reverberation modeling–[Hopgood 2003], [Buchner (TRINICON) 2010],

[Yoshioka 2007], [Nakatani 2008]

Duration of auto-correlation

Duration of auto-correlationandnonstationarity

Spatial diversity

Cues

Page 129: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&%

Joint speech and reverberation modeling for derev.

Reverberantobservation

Unknown true generative system

Sourceprocess

Reverberationprocess tx

hstt xpx ,);(~

Model of generative system

Sourcemodel

Reverberationmodel

Parametricmodel

sParametricmodel

h

Parameter estimation by

#Likelihood maximization[Hopgood 2003], [Yoshioka 2007],[Nakatani 2008]

#Kullback-Leiblerdivergence minimization[Buchner (TRINICON) 2010]

Distinguishable ?

Page 130: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&&

Time-varying&

Correlatedonly within

short interval

Stationary&

Correlatedover

long interval

Source model(SOS or HOS)

Reverberationmodel

Models for source process and reverberation process

Distinguishable

Page 131: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&

Multichannel blind partial deconvolution (MCBPD) by TRINICON

Cost function for SOS-TRINICON [Buchner 2010]

t

tyts ,,SOSˆdetlogˆdetlog RR J

Goal

Autocorrelation matrix of ty

ts,R

Goal

Decorrelation(deconvolution)

MCBPD byTRINICON

Autocorrelationmatrix of

observed signal

Page 132: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&

Part II. Multichannel blind inverse filtering

- Example applications- Professional audio post production- Meeting recognition with microphone arrays

- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter

- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with

time-varying source model

- Integration with blind source separation

Page 133: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&

MCLP with time-varying source model for dereverberation

[Yoshioka 2007][Nakatani 2008, 2011]

Time-varyingshort-timeGaussian

MCLP

Source process(SOS)

Distinguishable bylikelihood

maximization

Reverberationprocess

Page 134: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

!

Reformulation of MCLP based on likelihood maximization

);(log)( cxc txpL

)1,0;()( tts sNsp Assume (stationary white Gaussian), then

1

21

)1( .||)2/1()(t

tT

t constxL xcc

)(max ccL

1

21

)1( ||mint

tT

tx xcc

Minimize prediction errorMaximize likelihood

Conditionalprobability rule

.);|(log1

1:1'')1( constxxp

tttttx

.)(log1

constspt

ts Source model

cxTttt xs 1)1(

where tTtt sx cx 1

)1(

Page 135: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

"

Time-varying Gaussian source model (TVGSM)

1. Each short time segment is stationary multivariate Gaussian, which can be characterized by

2. varies over different time segments

TNtttt sss ][ 11 s

ttttsp RsRs ,0;; N

Tttt E ssR where

ts R

tR

: parameters to be estimated

is an autocorrelation matrix

ms order30

Page 136: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

MCLP with multivariate source model

• Prediction error is assumed to follow TVGSM

tTtt sx cx 1

)1(

ttt scXx 1)1(

1

12

1

)1(1

)1(1

)1(

Nt

t

t

TNt

Tt

Tt

Nt

t

t

s

ss

x

xx

c

x

xx

or

ts

)1(tx 1tX ts

30ms

order

Page 137: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Likelihood function of MCLP with TVGSM

t

ttst pL ),;(log),( RcsRc

),0;();( ttttsp RsRs Nwhere

||log||||),( 1)1(

tt

ttt tL RcXxRc R

sRss R1|||| Twhere (quadratic form)

Prediction errorweighted by

Normalization term

1tR

cXxs 1)1(

ttt and

Page 138: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$

Iterative optimization procedure

Initialize

ˆt

Ttt E xxR

A few iterations are sufficient for convergence

cxs ˆˆ1

)1( ttt X

Update prediction coeffs.

ˆ tR1. Dereverberate

2. Calculate autocorrelation matrix of

t

tt tRc

cXxc ˆ1)1( ||||minargˆ

Update source model

ˆˆˆt

Ttt E ssR

ˆ ts

tx

c

c ˆ tR

ts

Closedform

Page 139: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%

Importance of time-varying source model

Source signal Observation

Processed A Processed B

Freq

uenc

ykH

z

(A) MCLP withstationary whiteGaussian source model

(B) MCLP with TVGSM

T60 : 0.5 s

A few seconds of observation are sufficient for dereverberation

Recording: 2.5 s

Source-micdistance: 1.5 m# mics : 2

Page 140: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&

Blind inverse filtering works in noisy environments

Reducereverberation

10 dB0.1 dB

(T60 =0.65s)15 dB10 dB15 dBSNR

5.8 dB (T60 =0.39s)

TRR*

*TRR: Target-to-reverberation ratio (target = direct signal + early reflections)

Noisy reverberant speech Dereverberation(w/ multistep MCLP

w/ TVGSM)

Noise: additive white noise (reproduced and recorded by 8 mics)

Processed signal# mics: 8source-mics distance: 2 m

10.3 dB11.4 dB13.8 dB15.2 dB

TRR

Noise may slightly increase,but not significantly

Page 141: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Real-time factor (RTF)using MATLAB

(RT60: 0.5 s, # mics: 2)

Time-domain Subband

170 0.8

Computationally efficient implementation

• Subband decomposition approach [Nakatani 2010], [Yoshioka 2009b]

• Computational efficiency largely improves

Subbandanalysis

Subbandsynthesis

MCLP with TVGSM

MCLP with TVGSM

MCLP with TVGSM

ty )(mtx

1,ny

2,ny

,Fny

)(1,mnx

)(2,mnx

)(,mFnx

Page 142: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Processing flow with subband decomposition [Nakatani 2010]

1.Set analysis parameters: prediction delay (D should be # of subband samples corresponding to 30 ms, or larger)

: length of prediction filter, : # of mics,: index of target channel to be dereverberated: a coeff. for flooring constant (e.g., )

2.Decompose a multichannel observed signal into a set of subband signals

: subband signal (e.g., [Weiss 2000], or STFT can also be used)

: channel index, : sample index: subband index

E.g., # of subbands is 512 (including negative frequencies) for 16 kHz sampling

3.In each subband , set initial estimates of source variance as

where is a flooring constant for

4.Obtain vector representation of in all channels as

where T is non-conjugate transposition, and

5.In each subband f, iterate the following until convergence is achieved

i. Obtain prediction filter as

where and are Moore-Penrose pseudo-inverse and complex conjugate operations. (see [Yoshioka, 2009b] for efficient calculation)

ii.Obtain dereverberated subband signalas

iii. Update source variance estimates as

6.Compose a dereverberated signal from a set of dereverberated subband signals

)(,mfnx

m nf

2)(, ||max 0mfnnk x

fmfnfn x ! ,||max 2)(,,0

f

f fn,!

)(,mfnxTTM

fnTfn

Tfnfn ],,[ )(

,)2(,

)1(,, xxxx

TmfLn

mfn

mfn

mfn xxx ],,[ )(

,1)(,1

)(,

)(, x

n fn

mfnfDn

n fn

TfDnfDn

f

x

,

*)(,,

,

*,,

0

!!xxx

c

*

fDnTf

mfnfn x ,

*)(,,0

xcyfn,y

fn ,! ffnfn y ! ,||max 2

,,

fn,y

fc

D

L0m

410

fn,!

M

Page 143: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part II. Multichannel blind inverse filtering

- Example applications- Professional audio post production- Meeting recognition with microphone arrays

- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter

- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with

time-varying source model

- Integration with blind source separation

Page 144: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

!

Reverberant speechMixed reverberant speech

BSS+dereverberation

Cleanspeechestimate

IntegratedBSS+

dereverberationReverberation

Reverberation

Directsignal

Cleanspeech

Approaches:• MCLP based approach [Yoshioka 2009b, 2011]• TRINICON [Buchner 2010]

Page 145: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

"

Generative model for reverberant sound mixture

Time-varyingGaussian

Instantaneousmixing

process

Source process 1 Mixture process

Time-varyingGaussian

Source process 2

###

Multi-inputmulti-output

MCLP

Reverb. process###

)1(ts

)2(ts

)2(tx

)1(tx

)1(

tx

)2(tx

)(mtx : reverberant mixture

)(mtx

: non-reverberant mixture

Jointly optimized by maximum likelihood estimation approach [Yoshioka 2009b, 2011]

Page 146: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Optimization procedure (subband-based implementation)

Initialization

Compute source estimates

Update source model parameters

Update de-mixing matrices

Converged ?

Update prediction coefficients .)(mtx

*

+,-

.)(mts*.

/ 0 10

Closed-form optimization

Page 147: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Improvement in signal-to-interference ratio (SIR)

T60=0.3 s T60=0.5 s02468

10121416

Der

ever

b +

BS

S

BS

S

Unp

roce

ssed

# sources: 2# mics: 4Source-micdistance: 1.5 m

Recording: 1 to 8 s(average: 3.5 s)

SIR[dB]

BSS: [Sawada 2007]

Results averaged over 672 pairs of utterances (TIMIT test set)

Page 148: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

$

Live demoLive demo

Page 149: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

%

TRINICON: general framework for blind MIMO signal processing

• for source (assumed or estimated)• for output

"

0 0

,,

0

))),((ˆlog())),((ˆlog(),()(i

N

jPDyPDsb jipjipbi yyW J #

Unknownmixingsystem

Unmixingsystem Wb

)1(s

)(Ps

)1(x

)(Mx

)1(y

)(Py

Source models

Cost function [Buchner 2010]

with PD-variate pdfs (P: source number, D: filter length)

)),((ˆ , jip PDs y)),((ˆ , jip PDy y

b: index of signal blocks

Page 150: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

&

Comparison of SOS and HOS by TRINICON [Buchner 2010]

SIR improvement (dB)

Number of iterations

SOS

SOS+HOS

BSS (w/o derev.)

Signal-to-reverberation ratio (SRR) improvement (dB)

# mics.: 4, # sources: 2, T60 : 700 ms,Source-mic distance: 1.65 m, Recording: 30 sec

Number of iterations

SOS

SOS+HOS

Page 151: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Summary II-2

• Robust blind inverse filtering is possible – Using joint speech and reverberation modeling

• Based only on a few seconds of observation (e.g., 2.5 s)• With a relatively small computational cost (e.g., RTF<1)• In an online processing manner (e.g., latency=1s)

– Under low SNR conditions (e.g., 10 dB SNR)

• Future challenges– Realtime adaptation of inverse filter [Yoshioka 2009a],

[Evers 2011]– Single channel inverse filtering [Gillespie 2001] – Processing under more adverse noise conditions such as

nonstationary diffuse noise– Optimal integration of inverse filtering and spectral

enhancement based dereverberation

Page 152: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part III:

Robust Automatic Speech Recognition (ASR)

in Reverberant Environments

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 88

Page 153: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part III: Robust ASR in Reverberant Environments

Introduction

Feature-based Approaches

Model-based Approaches

Decoder-based Approaches

REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 89

Page 154: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part III: Robust ASR in Reverberant Environments

Introduction

Feature-based Approaches

Model-based Approaches

Decoder-based Approaches

REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 89

Page 155: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

ASR System

Block Diagram

pre−

training

speech

signal

transcription

transcriptionrecog−

nition

processing extractionfeature acoustic

modellanguagemodel

REMOS

D

C

B

A

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 90

Page 156: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Feature Extraction: Calculation of MFCCs

DFT

DCT logmel

Hamming

window

filtering

st|()|2

Goal:Dimensionality

reduction

MFCCs:

Mel

Frequency

Cepstral

Coefficients

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91

Page 157: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Feature Extraction: Calculation of MFCCs

DFT

DCT logmel

Hamming

window

filtering

st|()|2

Goal:Dimensionality

reduction

MFCCs:

Mel

Frequency

Cepstral

Coefficients

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91

Page 158: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Feature Extraction: Calculation of MFCCs

DFT

DCT logmel

Hamming

window

filtering

st

sMELn

|()|2

coefficientsmelspec

Goal:Dimensionality

reduction

MFCCs:

Mel

Frequency

Cepstral

Coefficients

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91

Page 159: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Feature Extraction: Calculation of MFCCs

DFT

DCT logmel

Hamming

window

filtering

st

sMELnsn

|()|2

logmelspeccoefficients coefficients

melspec

Goal:Dimensionality

reduction

MFCCs:

Mel

Frequency

Cepstral

Coefficients

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91

Page 160: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Feature Extraction: Calculation of MFCCs

DFT

DCT logmel

Hamming

window

filtering

st

sMELnsnsMFCC

n

|()|2

MFCCs logmelspeccoefficients coefficients

melspec

Goal:Dimensionality

reduction

MFCCs:

Mel

Frequency

Cepstral

Coefficients

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91

Page 161: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Acoustic Modeling

Hidden Markov Model (HMM) λ

1 2 4 5 63

a22 a33 a44 a55

a12 a23 a34 a45 a56

p(sn|qn = 2) p(sn|qn = 5)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 92

Page 162: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Acoustic Modeling

Hidden Markov Model (HMM) λ

1 2 4 5 63

a22 a33 a44 a55

a12 a23 a34 a45 a56

p(sn|qn = 2) p(sn|qn = 5)

Powerful model for:

temporal variation

spectral variation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 92

Page 163: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dispersive Effect of Reverberation

clean utterance "four, two, seven"

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

reverberant utterance "four, two, seven"

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

frame n

frame n

melchannell

melchannell

Logmelspec features, dB scale

Dispersive effect of reverberation:

features smeared along time axis

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93

Page 164: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dispersive Effect of Reverberation

clean utterance "four, two, seven"

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

reverberant utterance "four, two, seven"

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

frame n

frame n

melchannell

melchannell

Logmelspec features, dB scale

Dispersive effect of reverberation:

features smeared along time axis

Time-frequency pattern is changed

Inter-frame correlation is increased

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93

Page 165: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dispersive Effect of Reverberation

clean utterance "four, two, seven"

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

reverberant utterance "four, two, seven"

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

frame n

frame n

melchannell

melchannell

Logmelspec features, dB scale

Dispersive effect of reverberation:

features smeared along time axis

Time-frequency pattern is changed

Inter-frame correlation is increased

Different statistical properties

to be captured by acoustic model

Contradiction to conditional inde-

pendence assumption of HMMs

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93

Page 166: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Explanation of Dispersive Effect

10 20 30 40 50 60 70 80 90 100

−0.2

0

0.2

TD representation of initial RIR segment

FD representation of initial RIR segment

5

10

15

20

frame τ

ht

me

lch

an

ne

ll

time in msframe 1

frame 2frame 3

frame shift

Time-domain (TD) description ofreverberant speech xt :

xt = ht ∗ st

RIR typically much longer than

analysis window

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 94

Page 167: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Explanation of Dispersive Effect

10 20 30 40 50 60 70 80 90 100

−0.2

0

0.2

TD representation of initial RIR segment

FD representation of initial RIR segment

5

10

15

20

frame τ

ht

me

lch

an

ne

ll

time in msframe 1

frame 2frame 3

frame shift

Time-domain (TD) description ofreverberant speech xt :

xt = ht ∗ st

RIR typically much longer than

analysis window

Feature-domain (FD) description ofxMEL

n : melspec convolution

xMEL

n =

TH−1∑

τ=0

hMEL

τ ⊙ sMEL

n−τ

sMELn : clean-speech feature vector

xMELn : reverberant feature vector

hMEL

n : melspec RIR representation

⊙: element-wise multiplication

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 94

Page 168: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Illustration of Melspec Convolution

= *

= * sMELnxMEL

n hMEL

n

=

=

xMELn sMEL

nhMEL

0 ⊙

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95

Page 169: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Illustration of Melspec Convolution

= *

= * sMELnxMEL

n hMEL

n

+

+=

=

xMELn sMEL

nhMEL

0sMEL

n−1hMEL

1 ⊙

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95

Page 170: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Illustration of Melspec Convolution

= *

= * sMELnxMEL

n hMEL

n

+ +

+ +

+

+=

=

xMELn sMEL

nhMEL

0sMEL

n−1hMEL

1sMEL

n−2hMEL

2

. . .

. . .

⊙⊙

⊙⊙⊙

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95

Page 171: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Accuracy of Melspec Convolution

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

20 40 60 80 100 120

5

10

15

20

−60

−40

−20

0

20

a)

b)

c)

d)

Frame n

ch

an

ne

ll

ch

an

ne

ll

ch

an

ne

ll

ch

an

ne

ll

a) Clean utterance

b) Reverberant utterance

c) Melspec convolution

d) Simple multiplication

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 96

Page 172: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Statistical Properties of Reverberant Speech Features

Example: digit “seven”

logmelspec clean utterance

10 20 30 40

5

10

15

20

5

10

15

20

logmelspec reverberant utterance

10 20 30 40

5

10

15

20

5

10

15

20

means of clean logmelspec HMM

5 10 15

5

10

15

20

5

10

15

20

logmelspec RIR representation

20 40 60

5

10

15

20

−14

−12

−10

−8

−6

−4

−2

0

frame delay τ

frame nframe n

state j

melchannell

melchannell

melchannell

melchannell

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 97

Page 173: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Statistical Properties of Reverberant Speech Features

Histograms

0 5 10 15 200

0.1

0.2

0.3

0.4

10 15 200

0.1

0.2

0.3

0.4

histogram clean

histogram rev.

histogram clean

histogram rev.

histograms for state j = 1, channel l = 3

histograms for state j = 5, channel l = 21

x

x

estim

ate

dpdf

estim

ate

dpdf

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 98

Page 174: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Statistical Properties of Reverberant Speech Features

Histograms

0 5 10 15 200

0.1

0.2

0.3

0.4

10 15 200

0.1

0.2

0.3

0.4

histogram clean

histogram rev.

histogram clean

histogram rev.

histograms for state j = 1, channel l = 3

histograms for state j = 5, channel l = 21

x

x

estim

ate

dpdf

estim

ate

dpdfAuto-CoVariances (ACVs)

0 10 20 30

5

10

15

20

0.2

0.4

0.6

0.8

1

0 10 20 30

5

10

15

20

0.2

0.4

0.6

0.8

1

ACVs of clean speech, j = 9

ACVs of reverberant speech, j = 9

melchannell

melchannell

frame τ

frame τ

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 98

Page 175: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Recognition Results

Word Accuracy as Function of Reverberation Time

0 200 400 600 8000

10

20

30

40

50

60

70

80

90

100

reverberation time T60 in ms

word

accura

cy

in%

Task: Read sentences

from Wall Street

Journal (WSJ 5K task)

Features: MFCCs

+ ∆ + ∆∆ coefficients

Recognizer:Cross-word triphones,

3 states per triphone,

16 Gaussians per state

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 99

Page 176: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Which Part of Reverberation is Harmful for ASR?

Word Accuracy as Function of Dereverberation Start Time

0 100 200 300 400 500 600 70065

70

75

80

85

90

95

100

0 dB

5 dB

10 dB

15 dB

20 dB

30 dB

∞ dB

TDEREV in ms

word

accura

cy

in%

[Sehr 2010a]

Task: Connected

digits (TI digits)

Features: MFCCs

+ ∆ coefficients

Recognizer:Word-level HMMs,

16 states per digit,

3 Gaussians per state

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 100

Page 177: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Strategies for Reverberation-Robust ASR

Block diagram of ASR system

pre−

training

speech

signal

transcription

transcriptionrecog−

nition

processing extractionfeature acoustic

modellanguagemodel

Strategies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101

Page 178: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Strategies for Reverberation-Robust ASR

Block diagram of ASR system

pre−

training

speech

signal

transcription

transcriptionrecog−

nition

processing extractionfeature acoustic

modellanguagemodel

A

Strategies

A) signal-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101

Page 179: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Strategies for Reverberation-Robust ASR

Block diagram of ASR system

pre−

training

speech

signal

transcription

transcriptionrecog−

nition

processing extractionfeature acoustic

modellanguagemodel

A

B

Strategies

A) signal-based approaches

B) feature-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101

Page 180: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Strategies for Reverberation-Robust ASR

Block diagram of ASR system

pre−

training

speech

signal

transcription

transcriptionrecog−

nition

processing extractionfeature acoustic

modellanguagemodel

A

B

C

Strategies

A) signal-based approaches

B) feature-based approaches

C) model-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101

Page 181: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Strategies for Reverberation-Robust ASR

Block diagram of ASR system

pre−

training

speech

signal

transcription

transcriptionrecog−

nition

processing extractionfeature acoustic

modellanguagemodel

A

B

C

D

Strategies

A) signal-based approaches

B) feature-based approaches

C) model-based approaches

D) decoder-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101

Page 182: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Strategies for Reverberation-Robust ASR

Block diagram of ASR system

pre−

training

speech

signal

transcription

transcriptionrecog−

nition

processing extractionfeature acoustic

modellanguagemodel

A

B

C

D

REMOS

Strategies

A) signal-based approaches

B) feature-based approaches

C) model-based approaches

D) decoder-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101

Page 183: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part III: Robust ASR in Reverberant Environments

Introduction

Feature-based Approaches

Model-based Approaches

Decoder-based Approaches

REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 102

Page 184: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Key Ideas of Feature-based Approaches

Three Different Approaches

Feature compensation

⇒ Example: Cepstral mean normalization (CMN)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103

Page 185: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Key Ideas of Feature-based Approaches

Three Different Approaches

Feature compensation

⇒ Example: Cepstral mean normalization (CMN)

Features insensitive to reverberation

⇒ Example: RASTA features

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103

Page 186: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Key Ideas of Feature-based Approaches

Three Different Approaches

Feature compensation

⇒ Example: Cepstral mean normalization (CMN)

Features insensitive to reverberation

⇒ Example: RASTA features

Features facilitating the capture of statistical properties

⇒ Example: Dynamic features

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103

Page 187: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Cepstral Mean Normalization [Atal 1974]

If impulse response much shorter than STFT analysis window

xt = ht ∗ st

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104

Page 188: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Cepstral Mean Normalization [Atal 1974]

If impulse response much shorter than STFT analysis window

xt = ht ∗ st

|XSTFT

n,k |2 ≈ |HSTFT

k |2 |SSTFT

n,k |2

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104

Page 189: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Cepstral Mean Normalization [Atal 1974]

If impulse response much shorter than STFT analysis window

xt = ht ∗ st

|XSTFT

n,k |2 ≈ |HSTFT

k |2 |SSTFT

n,k |2

xMFCC

n,c ≈ hMFCC

c + sMFCC

n,c

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104

Page 190: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Cepstral Mean Normalization [Atal 1974]

If impulse response much shorter than STFT analysis window

xt = ht ∗ st

|XSTFT

n,k |2 ≈ |HSTFT

k |2 |SSTFT

n,k |2

xMFCC

n,c ≈ hMFCC

c + sMFCC

n,c

Cepstral Mean Normalization

xCMN

n,c = xMFCC

n,c − xMFCC

c

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104

Page 191: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Cepstral Mean Normalization [Atal 1974]

If impulse response much shorter than STFT analysis window

xt = ht ∗ st

|XSTFT

n,k |2 ≈ |HSTFT

k |2 |SSTFT

n,k |2

xMFCC

n,c ≈ hMFCC

c + sMFCC

n,c

Cepstral Mean Normalization

xCMN

n,c = xMFCC

n,c − xMFCC

c

xMFCC

c =1

N

N∑

n=1

xMFCC

n,c ≈ hMFCC

c + sMFCC

c

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104

Page 192: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Cepstral Mean Normalization [Atal 1974]

If impulse response much shorter than STFT analysis window

xt = ht ∗ st

|XSTFT

n,k |2 ≈ |HSTFT

k |2 |SSTFT

n,k |2

xMFCC

n,c ≈ hMFCC

c + sMFCC

n,c

Cepstral Mean Normalization

xCMN

n,c = xMFCC

n,c − xMFCC

c

xMFCC

c =1

N

N∑

n=1

xMFCC

n,c ≈ hMFCC

c + sMFCC

c

xCMN

n,c ≈ hMFCC

c + sMFCC

n,c − (hMFCC

c + sMFCC

c )= sMFCC

n,c − sMFCC

c = sCMN

n,c

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104

Page 193: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Cepstral Mean Normalization [Atal 1974]

If impulse response much shorter than STFT analysis window

xt = ht ∗ st

|XSTFT

n,k |2 ≈ |HSTFT

k |2 |SSTFT

n,k |2

xMFCC

n,c ≈ hMFCC

c + sMFCC

n,c

Cepstral Mean Normalization

xCMN

n,c = xMFCC

n,c − xMFCC

c

xMFCC

c =1

N

N∑

n=1

xMFCC

n,c ≈ hMFCC

c + sMFCC

c

xCMN

n,c ≈ hMFCC

c + sMFCC

n,c − (hMFCC

c + sMFCC

c )= sMFCC

n,c − sMFCC

c = sCMN

n,c

⇒ convolution compensated

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104

Page 194: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

CMN - Illustration 1st-order Highpass Filter

Clean vs. Highpass Filtered Logmel Features

No CMN

mel channel

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

t in s

mel channel

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 105

Page 195: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

CMN - Illustration 1st-order Highpass Filter

Clean vs. Highpass Filtered Logmel Features

No CMN

mel channel

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

t in s

mel channel

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

With CMN

mel channel

5

10

15

20

−12

−10

−8

−6

−4

−2

0

2

4

6

8

t in s

mel channel

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

5

10

15

20

−12

−10

−8

−6

−4

−2

0

2

4

6

8

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 105

Page 196: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

CMN - Illustration Reverberation

Clean vs. Reverberant (T60 = 900ms) Logmel Features

No CMN

mel channel

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

t in s

mel channel

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 106

Page 197: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

CMN - Illustration Reverberation

Clean vs. Reverberant (T60 = 900ms) Logmel Features

No CMN

mel channel

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

t in s

mel channel

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

5

10

15

20

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

With CMN

mel channel

5

10

15

20

−12

−10

−8

−6

−4

−2

0

2

4

6

8

t in s

mel channel

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

5

10

15

20

−12

−10

−8

−6

−4

−2

0

2

4

6

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 106

Page 198: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

CMN - Discussion

Approach

Apply CMN to both training and test data

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107

Page 199: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

CMN - Discussion

Approach

Apply CMN to both training and test data

⇒ Short impulse responses can be compensated

+ Good for compensating different microphone characteristics or

different telephone channels

+ Good for compensating coloration due to early reflections

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107

Page 200: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

CMN - Discussion

Approach

Apply CMN to both training and test data

⇒ Short impulse responses can be compensated

+ Good for compensating different microphone characteristics or

different telephone channels

+ Good for compensating coloration due to early reflections

− But: not suitable for compensating late reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107

Page 201: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

CMN - Discussion

Approach

Apply CMN to both training and test data

⇒ Short impulse responses can be compensated

+ Good for compensating different microphone characteristics or

different telephone channels

+ Good for compensating coloration due to early reflections

− But: not suitable for compensating late reverberation

Further considerations

Reliable only if utterance is long enough (>4 s [Droppo 2008])

Extensions necessary for different speech activity rates of

training and test data [Droppo 2008]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107

Page 202: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

RASTA (RelAtive SpecTrA) Features [Hermansky 1994]

Background

Speed of spectral changes of speech:

⇒ limited by movements of articulators in vocal tract

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108

Page 203: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

RASTA (RelAtive SpecTrA) Features [Hermansky 1994]

Background

Speed of spectral changes of speech:

⇒ limited by movements of articulators in vocal tract

Many non-speech effects:

⇒ characterized by short time-invariant impulse responses

Examples: microphone characteristics, telephone channels

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108

Page 204: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

RASTA (RelAtive SpecTrA) Features [Hermansky 1994]

Background

Speed of spectral changes of speech:

⇒ limited by movements of articulators in vocal tract

Many non-speech effects:

⇒ characterized by short time-invariant impulse responses

Examples: microphone characteristics, telephone channels

Analysis artifacts:

⇒ very fast spectral changes

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108

Page 205: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

RASTA (RelAtive SpecTrA) Features [Hermansky 1994]

Background

Speed of spectral changes of speech:

⇒ limited by movements of articulators in vocal tract

Many non-speech effects:

⇒ characterized by short time-invariant impulse responses

Examples: microphone characteristics, telephone channels

Analysis artifacts:

⇒ very fast spectral changes

Idea

Remove very slow and fast spectral changes from features:

⇒ bandpass filtering in each channel

+ Insensitivity to slow and fast spectral changes

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108

Page 206: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

RASTA Features: Block Diagram

Calculation of RASTA Features

Bandpassfilter

filterBandpass

Bandpassfilter

form

vecto

rs

H0(ejΩ)

H1(ejΩ)

HL(ejΩ)

log()

log()

log() exp()

exp()

exp()

xtxRASTA

n

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 109

Page 207: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

RASTA Features: Discussion

RASTA Features

Effective for short time-invariant impulse responses (like CMN)

+ Good for compensating different microphone characteristics or

different telephone channels

+ Good for compensating coloration due to early reflections

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 110

Page 208: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

RASTA Features: Discussion

RASTA Features

Effective for short time-invariant impulse responses (like CMN)

+ Good for compensating different microphone characteristics or

different telephone channels

+ Good for compensating coloration due to early reflections

Reverberation described by long RIRs

− Therefore: not suitable for compensating late reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 110

Page 209: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dynamic Features [Furui 1986]

Idea

Temporal changes of short-time spectra:

⇒ important for discriminating phonemes

First and second derivate of static features (∆ and ∆∆ features):

⇒ capture these changes

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111

Page 210: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dynamic Features [Furui 1986]

Idea

Temporal changes of short-time spectra:

⇒ important for discriminating phonemes

First and second derivate of static features (∆ and ∆∆ features):

⇒ capture these changes

∆ Feature Calculation

∆sn = sn+κ − sn−κ

typical: κ ∈ 1,2

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111

Page 211: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dynamic Features [Furui 1986]

Idea

Temporal changes of short-time spectra:

⇒ important for discriminating phonemes

First and second derivate of static features (∆ and ∆∆ features):

⇒ capture these changes

∆ Feature Calculation

∆sn = sn+κ − sn−κ

∆sn =

∑N∆

κ=1 κ ·(

sn+κ − sn−κ

)

2 ·∑N∆

κ=1 κ2

typical: κ ∈ 1,2 or N∆ ∈ 2,3,4

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111

Page 212: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Dynamic Features [Furui 1986]

Idea

Temporal changes of short-time spectra:

⇒ important for discriminating phonemes

First and second derivate of static features (∆ and ∆∆ features):

⇒ capture these changes

∆ Feature Calculation

∆sn = sn+κ − sn−κ

∆sn =

∑N∆

κ=1 κ ·(

sn+κ − sn−κ

)

2 ·∑N∆

κ=1 κ2

typical: κ ∈ 1,2 or N∆ ∈ 2,3,4

∆∆ features: calculated in a similar way from ∆ features

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111

Page 213: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Why are Dynamic Features interesting for Reverberant ASR?

Reverberant Speech

Long-term relations between feature vectors

Cannot be captured by HMMs

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112

Page 214: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Why are Dynamic Features interesting for Reverberant ASR?

Reverberant Speech

Long-term relations between feature vectors

Cannot be captured by HMMs

⇒ Mitigation by feature vectors with long temporal reach

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112

Page 215: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Why are Dynamic Features interesting for Reverberant ASR?

Reverberant Speech

Long-term relations between feature vectors

Cannot be captured by HMMs

⇒ Mitigation by feature vectors with long temporal reach

Temporal reach of features

Static features: typically 10 ms – 40 ms

∆ features: typically 20 ms – 120 ms

∆∆ features: typically 30 ms – 200 ms

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112

Page 216: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Why are Dynamic Features interesting for Reverberant ASR?

Reverberant Speech

Long-term relations between feature vectors

Cannot be captured by HMMs

⇒ Mitigation by feature vectors with long temporal reach

Temporal reach of features

Static features: typically 10 ms – 40 ms

∆ features: typically 20 ms – 120 ms

∆∆ features: typically 30 ms – 200 ms

Dynamic features can partly capture long-term relations

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112

Page 217: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

RIR parameter

estimationFeature extraction

DCT

ASR

Observation modelA priori model Inference

Reverberant speech xt

Reverberant logmelspec coefficients xn T60

p(sn|sn−1) p(sn|x1:n) p(xn|sn−TH :n)

Enhanced logmelspec coefficients sn

Enhanced MFCCs sMFCC

n

Estimated transcription

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 113

Page 218: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

A Priori Model: Clean Speech Model

Linear dynamic model−2

sn−3 sn−2 sn−1 sn

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114

Page 219: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

A Priori Model: Clean Speech Model

Linear dynamic model−2

sn−3 sn−2 sn−1 sn

sn = Asn−1 + b + un

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114

Page 220: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

A Priori Model: Clean Speech Model

Switching linear dynamic model

hidden statesqn−3 qn−2qn−2 qn

sn−3 sn−2 sn−1 sn

sn = A(qn)sn−1 + b(qn) + un

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114

Page 221: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

A Priori Model: Clean Speech Model

Switching linear dynamic model

hidden statesqn−3 qn−2qn−2 qn

sn−3 sn−2 sn−1 sn

sn = A(qn)sn−1 + b(qn) + un

p(sn|sn−1,qn) = N (sn;A(qn)sn−1 + b(qn),Σu(qn))

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114

Page 222: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

A Priori Model: Clean Speech Model

Switching linear dynamic model

hidden statesqn−3 qn−2qn−2 qn

sn−3 sn−2 sn−1 sn

sn = A(qn)sn−1 + b(qn) + un

p(sn|sn−1,qn) = N (sn;A(qn)sn−1 + b(qn),Σu(qn))

Model for non-stationary feature vector sequences of clean speech

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114

Page 223: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Observation Model: Reverberation Model

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115

Page 224: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Observation Model: Reverberation Model

based on melspec convolution

xn = log

(

TH∑

τ=0

exp(hτ + sn−τ )

)

+ vn

vn: captures approximation error

h0:TH: based on strictly exponentially decaying RIR model

⇒ Only T60 needs to be estimated

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115

Page 225: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Observation Model: Reverberation Model

based on melspec convolution

xn = log

(

TH∑

τ=0

exp(hτ + sn−τ )

)

+ vn

= f (sn−TH :n,h0:TH) + vn

vn: captures approximation error

h0:TH: based on strictly exponentially decaying RIR model

⇒ Only T60 needs to be estimated

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115

Page 226: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Observation Model: Reverberation Model

based on melspec convolution

xn = log

(

TH∑

τ=0

exp(hτ + sn−τ )

)

+ vn

= f (sn−TH :n,h0:TH) + vn

p(vn) = N (vn;µv ,Σv )

vn: captures approximation error

h0:TH: based on strictly exponentially decaying RIR model

⇒ Only T60 needs to be estimated

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115

Page 227: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Observation Model: Reverberation Model

based on melspec convolution

xn = log

(

TH∑

τ=0

exp(hτ + sn−τ )

)

+ vn

= f (sn−TH :n,h0:TH) + vn

p(vn) = N (vn;µv ,Σv )

p(xn|sn−TH :n) = N (xn; f (sn−TH :n,h0:TH) + µv ,Σv )

vn: captures approximation error

h0:TH: based on strictly exponentially decaying RIR model

⇒ Only T60 needs to be estimated

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115

Page 228: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Bayesian Inference

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116

Page 229: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Bayesian Inference

MMSE estimate

sn = E sn|x1:n

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116

Page 230: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Bayesian Inference

MMSE estimate

sn = E sn|x1:n

p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)

p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116

Page 231: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Bayesian Inference

MMSE estimate

sn = E sn|x1:n

p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)

p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn

≈p(xn|sn−TH :n)

∑Mi=1 p(sn|sn−1,qn = i) p(qn = i)

p(xn|sn,x1:n−1)∑M

i=1 p(sn|sn−1,qn = i) p(qn = i) dsn

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116

Page 232: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Bayesian Inference

MMSE estimate

sn = E sn|x1:n

p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)

p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn

≈p(xn|sn−TH :n)

∑Mi=1 p(sn|sn−1,qn = i) p(qn = i)

p(xn|sn,x1:n−1)∑M

i=1 p(sn|sn−1,qn = i) p(qn = i) dsn

⇒ Inference performed by bank of iterated extended Kalman filters

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116

Page 233: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Discussion

Approach tailored to reverberant feature vector sequences

long-term relations explicitely captured by observation model

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117

Page 234: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Discussion

Approach tailored to reverberant feature vector sequences

long-term relations explicitely captured by observation model

+ Promising results reported on AURORA 5 task (connected digits)

+ Moderate computational complexity

+ Latency of only a few frames

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117

Page 235: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Model-based Feature Enhancement [Krueger 2010]

Discussion

Approach tailored to reverberant feature vector sequences

long-term relations explicitely captured by observation model

+ Promising results reported on AURORA 5 task (connected digits)

+ Moderate computational complexity

+ Latency of only a few frames

Suitable for online recognition in reverberant environments

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117

Page 236: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Further Feature-based Approaches

[Petrick 2008] Harmonicity-based feature analysis

[Thomas 2008] Frequency-domain linear prediction

[Wolfel 2009] Particle filter-based feature enhancement

[Kumar 2010] Cepstral inverse filtering

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 118

Page 237: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part III: Robust ASR in Reverberant Environments

Introduction

Feature-based Approaches

Model-based Approaches

Decoder-based Approaches

REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 119

Page 238: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Key Idea of Model-based Approaches

Mismatch between clean HMM and reverberant data

test datareverberant

clean HMM

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120

Page 239: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Key Idea of Model-based Approaches

Feature-based: “dereverberate” data

clean HMMtest datadereverberated

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120

Page 240: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Key Idea of Model-based Approaches

Model-based: “reverberate” acoustic model

test datareverberant

clean HMM

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120

Page 241: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Key Idea of Model-based Approaches

Model-based: “reverberate” acoustic model

Adjust acoustic model to statistical properties of reverberant data

test datareverberant

clean HMM

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120

Page 242: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Key Idea of Model-based Approaches

Model-based: “reverberate” acoustic model

Adjust acoustic model to statistical properties of reverberant data

reverberant

test data

HMM

reverberant

clean HMM

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120

Page 243: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Training with Reverberant Data

Matched Training

Record training data in target environment

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121

Page 244: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Training with Reverberant Data

Matched Training

Record training data in target environment

+ Training data perfectly capture statistical properties

− Extremely high effort

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121

Page 245: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Training with Reverberant Data

Matched Training

Record training data in target environment

+ Training data perfectly capture statistical properties

− Extremely high effort

Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121

Page 246: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Training with Reverberant Data

Matched Training

Record training data in target environment

+ Training data perfectly capture statistical properties

− Extremely high effort

Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]

+ Significantly reduced effort

+ Only slight degradation in recognition performance [Stahl 2001]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121

Page 247: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Training with Reverberant Data

Matched Training

Record training data in target environment

+ Training data perfectly capture statistical properties

− Extremely high effort

Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]

+ Significantly reduced effort

+ Only slight degradation in recognition performance [Stahl 2001]

Multi-Style Training

Use training data from many different rooms

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121

Page 248: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Training with Reverberant Data

Matched Training

Record training data in target environment

+ Training data perfectly capture statistical properties

− Extremely high effort

Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]

+ Significantly reduced effort

+ Only slight degradation in recognition performance [Stahl 2001]

Multi-Style Training

Use training data from many different rooms

+ Robust HMMs

+ Very flexible

− Discrimination capability reduced compared to matched training

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121

Page 249: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Matched Training

matched

test datareverberant

HMM

clean HMM

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 122

Page 250: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Matched Training: Modeling Accuracy

Histograms

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

8 10 12 14 16 18 20 220

0.05

0.1

0.15

0.2

0.25

0.3

histogram rev.

output pdf clean HMM

output pdf rev. HMM

histogram rev.

output pdf clean HMM

output pdf rev. HMM

histograms for state j = 1, channel l = 3

histograms for state j = 5, channel l = 21

x

x

estim

ate

dpdf

estim

ate

dpdf

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 123

Page 251: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Matched Training: Modeling Accuracy

Histograms

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

8 10 12 14 16 18 20 220

0.05

0.1

0.15

0.2

0.25

0.3

histogram rev.

output pdf clean HMM

output pdf rev. HMM

histogram rev.

output pdf clean HMM

output pdf rev. HMM

histograms for state j = 1, channel l = 3

histograms for state j = 5, channel l = 21

x

x

estim

ate

dpdf

estim

ate

dpdfAuto-CoVariances (ACVs)

0 5 10 15 20 25 30 35

5

10

15

20

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35

5

10

15

20

0.2

0.4

0.6

0.8

1

ACVs of reverberant speech, j = 9

ACVs captured by HMM, j = 9

melchannell

melchannell

frame τ

frame τ

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 123

Page 252: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Multi-Style Training

reverberanttest data

clean HMM

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124

Page 253: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Multi-Style Training

training dataclean HMM

reverberant

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124

Page 254: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Multi-Style Training

training dataclean HMM

reverberant

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124

Page 255: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Multi-Style Training

training dataclean HMM

reverberant

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124

Page 256: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Multi-Style Training

training dataclean HMM

reverberant

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124

Page 257: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Multi-Style Training

training dataclean HMM

reverberant

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124

Page 258: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Multi-Style Training

reverberanttest data

HMMmulti−style

clean HMM

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124

Page 259: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

Idea

Capture only linguistic variabilities by acoustic model

Remove acoustic variabilities by appropriate transforms

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125

Page 260: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

Idea

Capture only linguistic variabilities by acoustic model

Remove acoustic variabilities by appropriate transforms

Approach

Multi-style training with dereverberated data

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125

Page 261: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

Idea

Capture only linguistic variabilities by acoustic model

Remove acoustic variabilities by appropriate transforms

Approach

Multi-style training with dereverberated data

Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125

Page 262: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

Idea

Capture only linguistic variabilities by acoustic model

Remove acoustic variabilities by appropriate transforms

Approach

Multi-style training with dereverberated data

Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]

+ long-term relations partly removed by dereverberation

+ room dependency reduced

⇒ discrimination capability increased compared to multi-style training

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125

Page 263: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

Idea

Capture only linguistic variabilities by acoustic model

Remove acoustic variabilities by appropriate transforms

Approach

Multi-style training with dereverberated data

Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]

+ long-term relations partly removed by dereverberation

+ room dependency reduced

⇒ discrimination capability increased compared to multi-style training

Successfully applied, e.g., in [Kinoshita 2006]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125

Page 264: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

clean HMM

reverberanttest data

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126

Page 265: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

clean HMM

reverberanttest data

dereverberated test data

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126

Page 266: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

clean HMM

reverberanttest data

dereverberated training data

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126

Page 267: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

clean HMM

reverberanttest data

dereverberated training data

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126

Page 268: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

clean HMM

reverberanttest data

dereverberated training data

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126

Page 269: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

clean HMM

reverberanttest data

dereverberated training data

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126

Page 270: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Reverberation-Adaptive Training

clean HMM

reverberanttest data

dereverberated test data

adaptiveHMM

statistical properties

sta

tisticalpro

pert

ies

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126

Page 271: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Data-driven Adaptation

Approaches

Maximum A Posteriori adaptation (MAP) [Gauvain 1994]

Maximum Likelihood Linear Regression (MLLR)

[Legetter 1995, Gales 1998]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 127

Page 272: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Data-driven Adaptation

Approaches

Maximum A Posteriori adaptation (MAP) [Gauvain 1994]

Maximum Likelihood Linear Regression (MLLR)

[Legetter 1995, Gales 1998]

Successfully used for speaker and noise adaptation

Can also be used for reducing mismatch due to reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 127

Page 273: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

MLLR

MLLR

Adaptation of the HMM mean vectors

µX = DµS + d

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128

Page 274: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

MLLR

MLLR

Adaptation of the HMM mean vectors and covariance matrices

µX = DµS + d

ΣXX = E ΣSS ET

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128

Page 275: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

MLLR

MLLR

Adaptation of the HMM mean vectors and covariance matrices

µX = DµS + d

ΣXX = E ΣSS ET

Transformation parameters D, d , E estimated by EM algorithm

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128

Page 276: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

MLLR

MLLR

Adaptation of the HMM mean vectors and covariance matrices

µX = DµS + d

ΣXX = E ΣSS ET

Transformation parameters D, d , E estimated by EM algorithm

Supervised MLLR: known transcription

Unsupervised MLLR: during recognition

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128

Page 277: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

MLLR

MLLR

Adaptation of the HMM mean vectors and covariance matrices

µX = DµS + d

ΣXX = E ΣSS ET

Transformation parameters D, d , E estimated by EM algorithm

Supervised MLLR: known transcription

Unsupervised MLLR: during recognition

CMLLR (Constrained MLLR)

Same transformation matrix for mean vector and covariance matrix

µX = DµS + d

ΣXX = D ΣSS DT

+ fewer adaptation parameters

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128

Page 278: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Data-driven Approaches

Illustration: Example Matched Training on Reverberated Data

reverberantly−trained HMM

(e.g., set of RIRs)

description of the

acoustic environmentclean−speech

training data

reverberated

training data

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 129

Page 279: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Data-driven Approaches

Discussion

Very accurate description of statistical properties by reverberant

training/adaptation data

Loss of accuracy: only when turning data into model

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130

Page 280: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Data-driven Approaches

Discussion

Very accurate description of statistical properties by reverberant

training/adaptation data

Loss of accuracy: only when turning data into model

Reverberant training: requires large amount of reverberant data

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130

Page 281: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Data-driven Approaches

Discussion

Very accurate description of statistical properties by reverberant

training/adaptation data

Loss of accuracy: only when turning data into model

Reverberant training: requires large amount of reverberant data

Data-driven adaptation: moderate amount of reverberant data

(but more than model-based adaptation)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130

Page 282: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Data-driven Approaches

Discussion

Very accurate description of statistical properties by reverberant

training/adaptation data

Loss of accuracy: only when turning data into model

Reverberant training: requires large amount of reverberant data

Data-driven adaptation: moderate amount of reverberant data

(but more than model-based adaptation)

Main Limitation

Conventional HMMs cannot accurately capture long-term relations

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130

Page 283: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Parametric Model-Based Approaches

Illustration

adapted HMMclean−speech HMM

reverberationrepresentation

training data

clean−speech

(e.g., set of RIRs)

acoustic environment

description of the

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 131

Page 284: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Parametric Model-based Adaptation

clean-speech

HMMs

reverberation

model

adaptation

algorithm

adapted

HMMs

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132

Page 285: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Parametric Model-based Adaptation

clean-speech

HMMs

reverberation

model

adaptation

algorithm

adapted

HMMs

Discussion

proposed in [Raut 2006, Hirsch 2008, Sehr 2009]

based on melspec convolution

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132

Page 286: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Parametric Model-based Adaptation

clean-speech

HMMs

reverberation

model

adaptation

algorithm

adapted

HMMs

Discussion

proposed in [Raut 2006, Hirsch 2008, Sehr 2009]

based on melspec convolution

+ long-term relations considered for HMM parameter estimation

+ no adaptation utterances necessary

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132

Page 287: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Parametric Model-based Adaptation

clean-speech

HMMs

reverberation

model

adaptation

algorithm

adapted

HMMs

Discussion

proposed in [Raut 2006, Hirsch 2008, Sehr 2009]

based on melspec convolution

+ long-term relations considered for HMM parameter estimation

+ no adaptation utterances necessary

− reduced accuracy due to approximation errors

− additional loss of accuracy when mapping combination to HMM

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132

Page 288: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Parametric Model-based Adaptation

clean-speech

HMMs

reverberation

model

adaptation

algorithm

adapted

HMMs

Discussion

proposed in [Raut 2006, Hirsch 2008, Sehr 2009]

based on melspec convolution

+ long-term relations considered for HMM parameter estimation

+ no adaptation utterances necessary

− reduced accuracy due to approximation errors

− additional loss of accuracy when mapping combination to HMM

Main Limitation

Conventional HMMs cannot accurately capture long-term relations

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132

Page 289: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Parametric Model-based Adaptation

Mean Adaptation Approach [Raut 2006, Hirsch 2008, Sehr 2009]

adaptation

transform to

cepstral domainmelspec domain

transform to perform calculate

cepstral averages

µSMFCC µ

SMFCC µSMEL µ

XMEL µXMFCC

β

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 133

Page 290: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Parametric Model-based Adaptation

Mean Adaptation Approach [Raut 2006, Hirsch 2008, Sehr 2009]

adaptation

transform to

cepstral domainmelspec domain

transform to perform calculate

cepstral averages

µSMFCC µ

SMFCC µSMEL µ

XMEL µXMFCC

β

Adaptation Equation

µXMEL(l , j) =∑

p

β(l , j , j − p) µSMEL(l , j − p)

β(l , j , i) state-level reverberation representation:

describes energy dispersion from state i to j in channel l

i , j state indices

l mel channel index

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 133

Page 291: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Estimation of Reverberation Representation [Hirsch 2008]

state 1 state 2 state 3

h2t

tstart(2,1) tend(2,1)t

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134

Page 292: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Estimation of Reverberation Representation [Hirsch 2008]

state 1 state 2 state 3

h2t

tstart(2,1) tend(2,1)t

h2t =

6 log(10)

T60M

· exp

(

−6 log(10)

T60M

· t

)

, for t ≥ 0

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134

Page 293: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Estimation of Reverberation Representation [Hirsch 2008]

state 1 state 2 state 3

h2t

tstart(2,1) tend(2,1)t

h2t =

6 log(10)

T60M

· exp

(

−6 log(10)

T60M

· t

)

, for t ≥ 0

β(j , i) =

∫ tend(j,i)

tstart(j,i)

h2t dt

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134

Page 294: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Limitation of Conventional HMMs

Emission pdf of Conventional HMMs

p(xn|j)

⇒ conditional independence assumption

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135

Page 295: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Limitation of Conventional HMMs

Emission pdf of Conventional HMMs

p(xn|j)

⇒ conditional independence assumption

Conditional Emission pdf

capturing long-term relationships by

p(xn|j ,x1:n−1)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135

Page 296: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Limitation of Conventional HMMs

Emission pdf of Conventional HMMs

p(xn|j)

⇒ conditional independence assumption

Conditional Emission pdf

capturing long-term relationships by

p(xn|j ,x1:n−1)

Approximation by: Context-aware Methods:

Frame-wise HMM adaptation

REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135

Page 297: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Conventional Adaptation versus Context-Aware Methods

HMM adaptation

Viterbi initialization

Finished?

Viterbi score calculation

(a) Conventional

HMM adaptation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136

Page 298: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Conventional Adaptation versus Context-Aware Methods

HMM adaptation

Viterbi initialization

Finished?

Viterbi score calculation

(a) Conventional

HMM adaptation

Viterbi initialization

HMM adaptation

Finished?

Viterbi score calculation

(b) Frame-wise

adaptation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136

Page 299: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Conventional Adaptation versus Context-Aware Methods

HMM adaptation

Viterbi initialization

Finished?

Viterbi score calculation

(a) Conventional

HMM adaptation

Viterbi initialization

HMM adaptation

Finished?

Viterbi score calculation

(b) Frame-wise

adaptation

Viterbi initialization

Inner optimization

Finished?

Viterbi score calculation

(c) REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136

Page 300: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Frame-wise Adaptation

xn ≈ log(exp(h0 + sn) + exp(rn))

µxn(j) = log(exp(h0 + µs(j)) + exp(rn))

rn late reverberation

j state index

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137

Page 301: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Frame-wise Adaptation

xn ≈ log(exp(h0 + sn) + exp(rn))

µxn(j) = log(exp(h0 + µs(j)) + exp(rn))

rn late reverberation

j state index

Autoregressive Modeling [Takiguchi 2006]

rn = a + xn−1 a prediction coefficient

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137

Page 302: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Frame-wise Adaptation

xn ≈ log(exp(h0 + sn) + exp(rn))

µxn(j) = log(exp(h0 + µs(j)) + exp(rn))

rn late reverberation

j state index

Autoregressive Modeling [Takiguchi 2006]

rn = a + xn−1 a prediction coefficient

Moving-Average Modeling [Sehr 2011]

rn = log

(

TH∑

τ=1

exp(µhτ+ sn−τ )

)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137

Page 303: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Frame-wise Adaptation

Discussion

+ Overcomes conditional independence assumption

+ Accurate modeling of long-term relations

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138

Page 304: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Frame-wise Adaptation

Discussion

+ Overcomes conditional independence assumption

+ Accurate modeling of long-term relations

− Increased computational complexity

− Increased effort for integration into ASR systems

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138

Page 305: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Frame-wise Adaptation

Discussion

+ Overcomes conditional independence assumption

+ Accurate modeling of long-term relations

− Increased computational complexity

− Increased effort for integration into ASR systems

Full potential not yet demonstrated

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138

Page 306: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Frame-wise Adaptation

Discussion

+ Overcomes conditional independence assumption

+ Accurate modeling of long-term relations

− Increased computational complexity

− Increased effort for integration into ASR systems

Full potential not yet demonstrated

Promising direction for future research

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138

Page 307: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Further Model-based Approaches

[Couvreur 2001] Reverberant training of several HMMs

+ model selection

[Sehr 2010b] Training of reverberant HMMs on stereo data

[Gales 2011] Extension of MLLR and VTS to reverberant

environments

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 139

Page 308: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part III: Robust ASR in Reverberant Environments

Introduction

Feature-based Approaches

Model-based Approaches

Decoder-based Approaches

REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 140

Page 309: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview of Decoder-based Approaches

Key Idea

Modify the decoding algorithm to increase reverberation robustness

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141

Page 310: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview of Decoder-based Approaches

Key Idea

Modify the decoding algorithm to increase reverberation robustness

Two Approaches

Missing feature techniques

⇒ Distinguish between reliable and unreliable observations

⇒ Estimate or discard the unreliable parts

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141

Page 311: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview of Decoder-based Approaches

Key Idea

Modify the decoding algorithm to increase reverberation robustness

Two Approaches

Missing feature techniques

⇒ Distinguish between reliable and unreliable observations

⇒ Estimate or discard the unreliable parts

Uncertainty decoding

⇒ Combined with signal or feature enhancement techniques

⇒ Exploit reliability information about enhanced data

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141

Page 312: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Overview of Decoder-based Approaches

Key Idea

Modify the decoding algorithm to increase reverberation robustness

Two Approaches

Missing feature techniques

⇒ Distinguish between reliable and unreliable observations

⇒ Estimate or discard the unreliable parts

Uncertainty decoding

⇒ Combined with signal or feature enhancement techniques

⇒ Exploit reliability information about enhanced data

Decoder-based approaches bridge the gap between

feature-based and model-based approaches

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141

Page 313: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Missing Feature Techniques

For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]

Key Ideas

Partition the observations into reliable and missing components Use only the reliable components for recognition

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142

Page 314: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Missing Feature Techniques

For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]

Key Ideas

Partition the observations into reliable and missing components Use only the reliable components for recognition

Main Steps

Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142

Page 315: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Missing Feature Techniques

For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]

Key Ideas

Partition the observations into reliable and missing components Use only the reliable components for recognition

Main Steps

Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately

How to handle missing data?

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142

Page 316: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Missing Feature Techniques

For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]

Key Ideas

Partition the observations into reliable and missing components Use only the reliable components for recognition

Main Steps

Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately

How to handle missing data?

Marginalization: Eliminate unreliable data by integration over

corresponding dimensions

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142

Page 317: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Missing Feature Techniques

For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]

Key Ideas

Partition the observations into reliable and missing components Use only the reliable components for recognition

Main Steps

Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately

How to handle missing data?

Marginalization: Eliminate unreliable data by integration over

corresponding dimensions Bounded marginalization: Exploit known bounds of the missing

data for integration

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142

Page 318: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Missing Feature Techniques

For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]

Key Ideas

Partition the observations into reliable and missing components Use only the reliable components for recognition

Main Steps

Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately

How to handle missing data?

Marginalization: Eliminate unreliable data by integration over

corresponding dimensions Bounded marginalization: Exploit known bounds of the missing

data for integration Data imputation: Determine state-dependent estimates for the

unreliable data, given the reliable data

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142

Page 319: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Missing Feature Techniques for Reverberation Robustness

[Palomaki 2004]

Modulation filtering for the mask estimation

Bounded marginalization for handling missing data

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 143

Page 320: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Missing Feature Techniques for Reverberation Robustness

[Palomaki 2004]

Modulation filtering for the mask estimation

Bounded marginalization for handling missing data

[Gemmeke 2011]

Oracle masks based on clean and reverberant features

Semi-Oracle masks based on clean features and estimated RIRs

Gaussian-dependent bounded imputation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 143

Page 321: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

Conventional Feature Enhancement Methods

EnhancementFeature

Algorithm

Decoding

AcousticModel

Transcriptionxnsn

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144

Page 322: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

Conventional Feature Enhancement Methods

EnhancementFeature

Algorithm

Decoding

AcousticModel

Transcriptionxnsn

Use only point estimate sn of clean features

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144

Page 323: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

Conventional Feature Enhancement Methods

EnhancementFeature

Algorithm

Decoding

AcousticModel

Transcriptionxnsn

Use only point estimate sn of clean features

Contribution of each Gaussian component m:

p(sn|m) = N (sn;µ(m)s ,Σ

(m)s )

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144

Page 324: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]

Feature Enhancement Combined with Uncertainty Decoding

EnhancementFeature

Algorithm

Decoding

AcousticModel

Transcriptionxn

p(sn|sn)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145

Page 325: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]

Feature Enhancement Combined with Uncertainty Decoding

EnhancementFeature

Algorithm

Decoding

AcousticModel

Transcriptionxn

p(sn|sn)

Signal/feature enhancement inevitably introduces distortions

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145

Page 326: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]

Feature Enhancement Combined with Uncertainty Decoding

EnhancementFeature

Algorithm

Decoding

AcousticModel

Transcriptionxn

p(sn|sn)

Signal/feature enhancement inevitably introduces distortions

Use reliability information in addition to point estimate

⇒ Use p(sn|sn) instead of sn

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145

Page 327: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]

Mismatch Modelsn = sn + bn

p(bn) = N (bn;0,Σbn)

p(sn|sn,m) ≈ p(sn|sn) = p(bn)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146

Page 328: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]

Mismatch Modelsn = sn + bn

p(bn) = N (bn;0,Σbn)

p(sn|sn,m) ≈ p(sn|sn) = p(bn)

Contribution of Gaussian Component m

p(sn|m) =

p(sn,sn|m) dsn =

p(sn|sn,m) p(sn|m) dsn

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146

Page 329: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]

Mismatch Modelsn = sn + bn

p(bn) = N (bn;0,Σbn)

p(sn|sn,m) ≈ p(sn|sn) = p(bn)

Contribution of Gaussian Component m

p(sn|m) =

p(sn,sn|m) dsn =

p(sn|sn,m) p(sn|m) dsn

≈ N (sn;µ(m)s ,Σ

(m)s +Σbn

)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146

Page 330: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]

Mismatch Modelsn = sn + bn

p(bn) = N (bn;0,Σbn)

p(sn|sn,m) ≈ p(sn|sn) = p(bn)

Contribution of Gaussian Component m

p(sn|m) =

p(sn,sn|m) dsn =

p(sn|sn,m) p(sn|m) dsn

≈ N (sn;µ(m)s ,Σ

(m)s +Σbn

)

Unreliable features ⇒ large Σbn⇒ little effect on Viterbi score

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146

Page 331: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding

[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]

Mismatch Modelsn = sn + bn

p(bn) = N (bn;0,Σbn)

p(sn|sn,m) ≈ p(sn|sn) = p(bn)

Contribution of Gaussian Component m

p(sn|m) =

p(sn,sn|m) dsn =

p(sn|sn,m) p(sn|m) dsn

≈ N (sn;µ(m)s ,Σ

(m)s +Σbn

)

Unreliable features ⇒ large Σbn⇒ little effect on Viterbi score

Main challenge: Estimation of time-variant feature cov. Σbn

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146

Page 332: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding for Reverberation-Robust ASR

[Delcroix 2009, Delcroix 2011a]

Key Idea

Strong reverberation

⇒ Large effect of speech enhancement

⇒ Large mismatch between clean and enhanced features

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147

Page 333: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding for Reverberation-Robust ASR

[Delcroix 2009, Delcroix 2011a]

Key Idea

Strong reverberation

⇒ Large effect of speech enhancement

⇒ Large mismatch between clean and enhanced features

Effect of speech enhancement captured by bn = xn − sn

⇒ Mismatch covariance assumed proportional to difference

between observed and enhanced features

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147

Page 334: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding for Reverberation-Robust ASR

[Delcroix 2009, Delcroix 2011a]

Key Idea

Strong reverberation

⇒ Large effect of speech enhancement

⇒ Large mismatch between clean and enhanced features

Effect of speech enhancement captured by bn = xn − sn

⇒ Mismatch covariance assumed proportional to difference

between observed and enhanced features

Model elements of time-variant diagonal mismatch cov. matrix Σbnas

(Σbn)ii = αi b2

n,i

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147

Page 335: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding for Reverberation-Robust ASR

[Delcroix 2009, Delcroix 2011a]

Key Idea

Strong reverberation

⇒ Large effect of speech enhancement

⇒ Large mismatch between clean and enhanced features

Effect of speech enhancement captured by bn = xn − sn

⇒ Mismatch covariance assumed proportional to difference

between observed and enhanced features

Model elements of time-variant diagonal mismatch cov. matrix Σbnas

(Σbn)ii = αi b2

n,i

α is estimated by EM algorithm using adaptation data

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147

Page 336: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding for Reverberation-Robust ASR

[Delcroix 2009, Delcroix 2011a]

Featureextraction

Variance

Reverberant speech

Dereverberation

Gaussiancovariance

Compensatedcovariance sequence

Acousticmodel

Word

Dereverberated speech

Variancecompensation

Recognition

Gaussian mean

Feature covariance

xtst

xn sn

Σbn

Σ(m)s

µ(m)s ,Σ

(m)s

µ(m)s

Σ(m)s +Σbn w

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 148

Page 337: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding for Reverberation-Robust ASR

[Delcroix 2009, Delcroix 2011a]

Discussion

− Accounting for time-variant covariance matrix Σbnincreases

computational complexity

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149

Page 338: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding for Reverberation-Robust ASR

[Delcroix 2009, Delcroix 2011a]

Discussion

− Accounting for time-variant covariance matrix Σbnincreases

computational complexity

+ Can be combined with static variance compensation and

mean adaptation by MLLR

+ Independent of enhancement algorithm ⇒ Highly flexible

+ Has been used successfully also for non-stationary interferences

[Delcroix 2011b]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149

Page 339: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Uncertainty Decoding for Reverberation-Robust ASR

[Delcroix 2009, Delcroix 2011a]

Discussion

− Accounting for time-variant covariance matrix Σbnincreases

computational complexity

+ Can be combined with static variance compensation and

mean adaptation by MLLR

+ Independent of enhancement algorithm ⇒ Highly flexible

+ Has been used successfully also for non-stationary interferences

[Delcroix 2011b]

Promising approach for interconnection of

signal/feature-based methods and ASR systems

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149

Page 340: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Part III: Robust ASR in Reverberant Environments

Introduction

Feature-based Approaches

Model-based Approaches

Decoder-based Approaches

REMOS

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 150

Page 341: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS: REverberation MOdeling for Speech Recognition

Online Model Combination

RVM

CSM

previous

observations

current

observation

: combinationoperator

CSM: clean-speech model

⇒ HMM network

RVM: reverberation model

combination of CSM and RVM:

⇒ context-aware acoustic model

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 151

Page 342: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS: REverberation MOdeling for Speech Recognition

Online Model Combination

RVM

CSM

previous

observations

current

observation

: combinationoperator

CSM: clean-speech model

⇒ HMM network

RVM: reverberation model

combination of CSM and RVM:

⇒ context-aware acoustic model

Advantages

CSM and RVM are trained

independently

changing environment:

adjust only RVM

changing task:

adjust only CSM

high degree of flexibility

[Sehr 2010c]

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 151

Page 343: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS Decoding [Sehr 2010c]

feature

extraction

transcriptionViterbi

algorithm

extended

RVMCSM

xt xn

Extended Viterbi Algorithm:

finds most likely path through CSM

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 152

Page 344: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS Decoding [Sehr 2010c]

feature

extraction

transcriptionViterbi

algorithm

extended

RVMCSM

xt xn

Extended Viterbi Algorithm:

finds most likely path through CSM

Inner Optimization: accounts for RVM and previous observations

determines most likely contributions of CSM and RVM to current

observation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 152

Page 345: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Online combination of model outputs from

clean-speech HMM and reverberation model

capturing long-term relations:

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153

Page 346: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Online combination of model outputs from

clean-speech HMM and reverberation model

capturing long-term relations:

Combination Operator

xn = f (sn,sn−TH :n−1,hn,an)

= log(exp(hn + sn) + exp(rn + an))

rn: logmelspec late re-verberation estimate

an: captures approxima-tion error of rn

hn: logmelspec repre-sentation of directsound component ofRIR

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153

Page 347: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Online combination of model outputs from

clean-speech HMM and reverberation model

capturing long-term relations:

Combination Operator

xn = f (sn,sn−TH :n−1,hn,an)

= log(exp(hn + sn) + exp(rn + an))

Late Reverberation Estimate

rn = log

(

TH∑

τ=1

exp(µHτ + sn−τ )

)

rn: logmelspec late re-verberation estimate

an: captures approxima-tion error of rn

hn: logmelspec repre-sentation of directsound component ofRIR

µH1:TH

:mean vectors of log-melspec representa-tion for late reverber-ation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153

Page 348: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Illustration of Generative Model

Cleanspeech

model

Reverberationmodel

p(sn|j)

p(hn)

p(an)

hn

an

sn sn−1 sn−TH

f

xn

. . .

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 154

Page 349: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Conditional emission pdf is decomposed into

reverberation model and clean HMM:

p(xn|j ,x1:n−1) =

p(xn|sn,x1:n−1) p(sn|j) dsn,

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 155

Page 350: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Conditional emission pdf is decomposed into

reverberation model and clean HMM:

p(xn|j ,x1:n−1) =

p(xn|sn,x1:n−1) p(sn|j) dsn,

Reverberation Model:

p(xn|sn,x1:n−1)

=

∫∫

p(hn)p(an) δ(xn − f (sn,sn−TH :n−1,hn,an)) dhn dan

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 155

Page 351: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Approximation of Conditional Emission pdf:by maximum values of integrand

p(xn|j ,x1:n−1) ≈ p(hn) p(an) p(sn|j)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 156

Page 352: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Approximation of Conditional Emission pdf:by maximum values of integrand

p(xn|j ,x1:n−1) ≈ p(hn) p(an) p(sn|j)

maximum values hn, an, sn determined by inner optimization

(hn, an, sn) = argmax(hn,an,sn)

p(hn)p(an)p(sn|j)

subject to xn = f (sn,sn−TH :n−1,hn,an)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 156

Page 353: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Detailed Illustration of REMOS Decoding

model

clean−speech HMMs

network of

reverberation

find previous

vectors

backtracking matrix

Viterbi score matrix

vectors (3D tensor)clean−speech

matrix of clean−speech

optimization

inner Viterbi

calculate

score

n

n

j

j

sn

αij

p(xn|j, x1:n−1)

γn−1(i)

γn(j)

ψn(j)

sn(j)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157

Page 354: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Detailed Illustration of REMOS Decoding

model

clean−speech HMMs

network of

reverberation

find previous

vectors

backtracking matrix

Viterbi score matrix

vectors (3D tensor)clean−speech

matrix of clean−speech

optimization

inner Viterbi

calculate

score

n

n

j

j

xn

RVM

CSM

sn−TH

sn−1

sn

hn

αij

p(xn|j, x1:n−1)

γn−1(i)

γn(j)

ψn(j)

sn(j)

p(hn) p(an)

p(sn|j)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157

Page 355: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Detailed Illustration of REMOS Decoding

model

clean−speech HMMs

network of

reverberation

find previous

vectors

backtracking matrix

Viterbi score matrix

vectors (3D tensor)clean−speech

matrix of clean−speech

optimization

inner Viterbi

calculate

score

n

n

nj

jj

l

xn

RVM

CSM

sn−TH

sn−1

sn

hn

αij

p(xn|j, x1:n−1)

γn−1(i)

γn(j)

ψn(j)

sn(j)

p(hn) p(an)

p(sn|j)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157

Page 356: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Detailed Illustration of REMOS Decoding

model

clean−speech HMMs

network of

reverberation

find previous

vectors

backtracking matrix

Viterbi score matrix

vectors (3D tensor)clean−speech

matrix of clean−speech

optimization

inner Viterbi

calculate

score

n

n

nj

jj

l

xn

RVM

CSM

sn−TH

sn−1

sn

hn

αij

p(xn|j, x1:n−1)

γn−1(i)

γn(j)

ψn(j)

sn(j)

p(hn) p(an)

p(sn|j)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157

Page 357: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Modeling Accuracy of REMOS

Example: digit “seven”

logmelspec clean utterance

10 20 30 40

5

10

15

20

5

10

15

20

logmelspec reverberant utterance

10 20 30 40

5

10

15

20

5

10

15

20

means of clean logmelspec HMM

5 10 15

5

10

15

20

5

10

15

20

logmelspec RIR representation

20 40 60

5

10

15

20

−14

−12

−10

−8

−6

−4

−2

0

frame delay τ

frame nframe n

state j

melchannell

melchannell

melchannell

melchannell

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 158

Page 358: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Modeling Accuracy of REMOS

Histograms

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

8 10 12 14 16 18 20 220

0.1

0.2

0.3

0.4

histogram rev.

output pdf clean HMM

prior hist. REMOS

posterior hist. REMOS

histogram rev.

output pdf clean HMM

prior hist. REMOS

posterior hist. REMOS

histograms for state j = 1, channel l = 3

histograms for state j = 5, channel l = 21

x

x

estim

ate

dpdf

estim

ate

dpdf

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 159

Page 359: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Modeling Accuracy of REMOS

Histograms

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

8 10 12 14 16 18 20 220

0.1

0.2

0.3

0.4

histogram rev.

output pdf clean HMM

prior hist. REMOS

posterior hist. REMOS

histogram rev.

output pdf clean HMM

prior hist. REMOS

posterior hist. REMOS

histograms for state j = 1, channel l = 3

histograms for state j = 5, channel l = 21

x

x

estim

ate

dpdf

estim

ate

dpdf

Auto-CoVariances (ACVs)

0 5 10 15 20 25 30 35

5

10

15

20

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35

5

10

15

20

0.2

0.4

0.6

0.8

1

ACVs of reverberant speech, j = 9

ACVs of posterior REMOS output, j = 9

melchannell

melchannell

frame τ

frame τ

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 159

Page 360: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Recognition Results [Sehr 2010c]

30

40

50

60

70

80

90

100

clean HMMclean HMM + MLLRadaptation [Sehr 2009]multi-style HMM

multi-style HMM + MLLRmatched HMM

REMOS

word

accura

cy

in%

room A room B room C

Setup

Task: Connected

digits (TI digits)

Features:Logmelspec

coefficients

Recognizer:Word-level HMMs,

16 states/digit,

1 Gaussian/state

Rooms:T60 DRR

A: 300 ms 4.0dBB: 700 ms −4.0dBC: 900 ms −4.0dB

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 160

Page 361: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Discussion

+ Approach tailored to reverberant feature vector sequences

+ Long-term relations explicitely captured by reverberation model

+ Reverberation exploited for discrimination

+ Very promising results in logmelspec domain

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161

Page 362: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Discussion

+ Approach tailored to reverberant feature vector sequences

+ Long-term relations explicitely captured by reverberation model

+ Reverberation exploited for discrimination

+ Very promising results in logmelspec domain

− Inner optimization increases decoding complexity

− Implementation requires changes in decoding routines

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161

Page 363: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

REMOS [Sehr 2010c]

Discussion

+ Approach tailored to reverberant feature vector sequences

+ Long-term relations explicitely captured by reverberation model

+ Reverberation exploited for discrimination

+ Very promising results in logmelspec domain

− Inner optimization increases decoding complexity

− Implementation requires changes in decoding routines

Promising direction for future research

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161

Page 364: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

IV. Summary, Conclusions, and Outlook

Dereverberation for Signal EnhancementState-of-the-art

Close to 12 dB DRR gain with T60 ≈ 0.7s (offline) with 4 mics, d=1.65m, no noise (TRINICON, 2 sources) 8 mics, d=2m, SNR=10 dB (MCLP)

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 162

Page 365: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

IV. Summary, Conclusions, and Outlook

Dereverberation for Signal EnhancementState-of-the-art

Close to 12 dB DRR gain with T60 ≈ 0.7s (offline) with 4 mics, d=1.65m, no noise (TRINICON, 2 sources) 8 mics, d=2m, SNR=10 dB (MCLP)

Challenges Larger distances, more reverberant rooms Robustness to speech-like interferers, nonstationary/diffuse noise,

transient echo cancellation residuals Robust tracking of time-varying acoustics Low-latency (≪ 1s) and efficient real-time implementations Joint optimization with spectral subtraction techniques

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 162

Page 366: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

IV. Summary, Conclusions, and Outlook (cont’d)

Dereverberation as preprocessing for ASR

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163

Page 367: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

IV. Summary, Conclusions, and Outlook (cont’d)

Dereverberation as preprocessing for ASRExample: 20 k WSJ convolved with RIRs (T60 = 0.78s, d = 2m), NTT ASR

WER[%] Preproc. Acoustic model85.5 none clean speech43.4 none multi-condition training26.1 1-ch derev multi-condition training14.2 2-ch derev clean w/ unsuperv.

speaker adaptation by MLLR

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163

Page 368: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

IV. Summary, Conclusions, and Outlook (cont’d)

Dereverberation as preprocessing for ASRExample: 20 k WSJ convolved with RIRs (T60 = 0.78s, d = 2m), NTT ASR

WER[%] Preproc. Acoustic model85.5 none clean speech43.4 none multi-condition training26.1 1-ch derev multi-condition training14.2 2-ch derev clean w/ unsuperv.

speaker adaptation by MLLR

Challenges for approaching close-talk performance Transition from reverberated signals to real recordings Self-adaptation to changing acoustics and front-ends, including

variable number and changing, unconstrained positions of talkers different nodes in distributed microphone arrays

Joint optimization with ASR methods to handle reverberation and noise

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163

Page 369: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Summary, Conclusions, and Outlook (cont’d)

Reverberation-specific ASR TechniquesState-of-the-art

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164

Page 370: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Summary, Conclusions, and Outlook (cont’d)

Reverberation-specific ASR TechniquesState-of-the-art

Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164

Page 371: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Summary, Conclusions, and Outlook (cont’d)

Reverberation-specific ASR TechniquesState-of-the-art

Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation

Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164

Page 372: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Summary, Conclusions, and Outlook (cont’d)

Reverberation-specific ASR TechniquesState-of-the-art

Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation

Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex

Decoder-based techniques compromise between the above regarding complexity

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164

Page 373: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Summary, Conclusions, and Outlook (cont’d)

Reverberation-specific ASR TechniquesState-of-the-art

Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation

Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex

Decoder-based techniques compromise between the above regarding complexity

Outlook Integration into state-of-the art ASR systems

expected soon for signal enhancement- and feature-based methods model-based methods must become more efficient for widespread use

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164

Page 374: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Concluding remarks

Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165

Page 375: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Concluding remarks

Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?

Blind deconvolution of the acoustic paths seems to come closer

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165

Page 376: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Concluding remarks

Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?

Blind deconvolution of the acoustic paths seems to come closer

Less ambitious algorithms are also effective and their progress followsthe typical DSP objectives

increase algorithmic performance and robustness

reduce computational load

integrate with other functionalities

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165

Page 377: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Concluding remarks

Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?

Blind deconvolution of the acoustic paths seems to come closer

Less ambitious algorithms are also effective and their progress followsthe typical DSP objectives

increase algorithmic performance and robustness

reduce computational load

integrate with other functionalities

As a follow-up to the CHIME Challenge 2011

⇒ Next Challenge for Reverberation-robust Speech Processin gis underway!

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165

Page 378: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Acknowledgements

We are especially grateful to

Dr. Keisuke Kinoshita, Dr. Marc Delcroix, Dr. Shoko Araki, Dr. MehrezSouden, and Dr. Takaaki Hori (NTT)

Dr. Herbert Buchner, Edwin Mabande and Lutz Marquardt (formerlyLMS)

Roland Maas and Christian Hofmann (LMS)

for their contributions to the course material

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 166

Page 379: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

Acknowledgements

We are especially grateful to

Dr. Keisuke Kinoshita, Dr. Marc Delcroix, Dr. Shoko Araki, Dr. MehrezSouden, and Dr. Takaaki Hori (NTT)

Dr. Herbert Buchner, Edwin Mabande and Lutz Marquardt (formerlyLMS)

Roland Maas and Christian Hofmann (LMS)

for their contributions to the course material and wish to acknowledgethe support of parts of the LMS by

Deutsche Forschungsgemeinschaft (DFG) under contract number KE890/4-1

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 166

Page 380: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann

ご清聴ありがとうございました

Nakatani, Sehr, Kellermann: Reverberant Speech Processing 167


Recommended