+ All Categories
Home > Documents > Dominant Speaker Identification for Multipoint ... · The conference is administrated through an...

Dominant Speaker Identification for Multipoint ... · The conference is administrated through an...

Date post: 19-Jul-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
66
Dominant Speaker Identification for Multipoint Videoconferencing Under the supervision of Prof. Israel Cohen Ilana Volfin
Transcript
Page 1: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Dominant Speaker

Identification

for Multipoint

Videoconferencing

Under the supervision of

Prof. Israel Cohen

Ilana Volfin

Page 2: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Outline

Videoconferencing – an introduction

Discussed realization

Proposed method

Speech activity score evaluation

— Single observation

— Sequence of observations

SNR estimation

Experimental Results

Outline

Page 3: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Videoconferencing

Used for :

Remote education

Medical consulting

Business meetings

Personal communications

Introduction

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 4: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Videoconferencing

History: [1]

Introduction

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

AT&T ‘Picturephone’

Not commercialized

1982 1964 1986

Compression Labs. System cost: 250,000$

Line cost per hour : 1,000$

PictureTel System cost: 80,000$

Line cost per hour : 100$

1991

System cost: 20,000$

Line cost per hour : 30$

Today

[1] Sprey, “Videoconferencing as a Communication Tool”,TRANSACTIONS ON PROFESSIONAL COMMUNICATION, IEEE, 1997

Page 5: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

What is a Multipoint Conference?

3 or more participants

Each at his own location

Single microphone and video camera

The information is administered

through a central control unit

Introduction

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

0 5 10 15 20 25-0.5

0

0.5

0 5 10 15 20 25-0.1

0

0.1

0 5 10 15 20 25-0.5

0

0.5

[sec]

Ch 1

Ch 2

Ch 3

Page 6: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Central control unit - MCU

Multipoint Control Unit

The conference is administrated through an MCU:

MCU

Participant 1

Participant 2

Participant N

Introduction

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 7: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Central control unit - MCU

Multipoint Control Unit

The conference is administrated through an MCU:

Participant 1

Participant 2

Participant N

Introduction

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

decoding

mixing

encoding

Heavy processing!

processing

Page 8: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Central control unit - MCU

Multipoint Control Unit

The conference is administrated through an MCU:

Participant 1

Participant 2

Participant N

Introduction

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Reduce the amount of information to improve conference quality

decoding

mixing

encoding

processing

Page 9: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Speaker selection

Select M most active participants

Discard all remaining information

Guidelines [2]

The selection process should not introduce audio artifacts

Transparent to the participants

Resistance to noise

Lack of discrimination

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Introduction

[2] P. J. Smith, “Voice conferencing over IP networks" , Master's thesis, Department of Electrical and Computer Engineering, McGill

University, Montreal, Canada, 2002.

Page 10: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Speaker selection - Challenges

Equipment

Low quality sensors lower SNR

Speakers produce crosstalk

Surrounding

General noises

Transient noises

Personal characteristics

Loud/Quiet participants

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Introduction

Page 11: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Classic Voice Activity Detection (VAD)

Binary classification problem

Classify each signal frame into either speech or non-

speech → ‘High resolution’ decision

— Representation

Use a speech specific representation, to maximize

discrimination ability

— Classification method

Thresholding

Machine learning techniques

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Introduction

Page 12: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

VAD and Speaker Selection

Speaker Selection is a generalized task of VAD

For each signal frame in each channel :

— Determine whether it contains prolonged speech

activity or not

‘Lower resolution’ decision is needed

— Non-speech frames may belong to a speech burst too

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Introduction

Page 13: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Speaker Selection Methods

Yu et al. Computers and communications ISCC 1998

“Linear PCM signal processing unit in multi-point video conferencing system”

— The channel with the biggest power is determined dominant

— To prevent frequent change of current speaker the decision is

made every 1 or 2 seconds

Chang, Lucent Technologies, 2001

“Multimedia conference call participant identification system and method “

— Speaking participants are identified as those who pass a volume

threshold

Kwak, Verizon Laboratories, 2002

“Speaker identifier for multi-party conference”

— Dominant speaker determined based on signal amplitude

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Introduction

Page 14: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Speaker Selection Methods

Smith , MSc thesis, McGill University, 2002

“Voice conferencing over IP networks"

— Participants are ranked in the order of becoming active speakers

— A participant can move up in rankings if the smoothed power of

his signal is above a barge-in threshold

Xu et al. ICME IEEE, 2006

“Pass: Peer-aware silence suppression for internet voice conferences”

— Very advanced VAD for ranking

— Barge-in mechanism

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Introduction

Page 15: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Summary of existing methods

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Introduction

Method Provides

solution

for:

Stationary

Noise

Frequent

Switching

Transient

Noise

Yu 1998

(Power) - - -

Assume that

non-dominant

channels are

completely silent Chang 2001

(Volume

Threshold)

- - -

Kwak 2002 (Amplitude)

- - -

Smith 2002 (Barge In)

√ √ - Assume that

change in signal

power originates

from change in

speaker Xu 2006

(Barge In & advanced VAD)

√ √ -

Transient noise occurrences are not addressed

Page 16: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Discussed realization

Dominant speaker identification (Speaker selection with M=1)

Only one participant appears on each screen

Most video information is discarded

Further traffic reduction by discarding some

audio

Conversation is focused on the dominant

speaker

Participant 1

Participant 2

Participant N

Dominant speaker

Previous Dominant speaker

Introduction ⋅ Discussed Realization⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Discussed Realization

Page 17: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Discussed realization

N channels

Speech Burst – a speech event composed of three

sequential phases: initiation, steady state and termination.

Speaker Switch – the point where a change in

dominant speaker occurs

Objective:

Follow speech bursts

Detect speaker switches

Discussed Realization

Introduction ⋅ Discussed Realization⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

5 10 15 20

-0.2

0

0.2

5 10 15 20

-0.05

0

0.05

5 10 15 20

-0.2

0

0.2

[sec]

Page 18: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Dominant speaker identification relying

on speech activity in time intervals of

different length

Proposed Method

Currently observed frame

Time interval of medium length

Long time interval

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

𝑡 now

Page 19: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Dominant speaker identification relying on speech

activity in time intervals of different length

Two Stages:

Local processing

— Individual processing on each channel

— Speech activity score evaluation for the immediate,

medium and long time-intervals

Global decision

— Speech activity score comparison across channels

Proposed Method

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 20: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Channel 𝑖, time frame 𝑙:

Proposed Method

Speech

detection in

sub-bands

Immediate

time Speech

activity

evaluation

Medium

interval Speech

activity

evaluation

Long interval Speech

activity

evaluation

Φ𝑖𝑚𝑚𝑒𝑑

𝑖 (𝑙)

Φ𝑚𝑒𝑑𝑖𝑢𝑚

𝑖 (𝑙)

Φ𝑙𝑜𝑛𝑔

𝑖 (𝑙)

Speech activity Evaluation Score Evaluation

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Speech activity evaluation on time

intervals of three characteristic

lengths

Three sequential steps

The input into each step consists of

smaller sub-units, of length

characteristic to the previous step

Activity is determined by the

number of active sub-units

Activity in a sub-unit is determined

by thresholding

Page 21: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Currently observed frame

Proposed Method

𝑙

𝑘1

𝑘1+𝑁1

• Time-frequency representation

• 𝑙 – time index

• 𝑘1, 𝑘1+𝑁1 – frequency range of voiced speech

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 22: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Currently observed frame

Time interval of medium length

Proposed Method

> 𝑡ℎ1

Binary vector

Σ

𝑎1 𝑙

𝑎1 𝑙 𝑎1 𝑙 − 1 ⋅⋅⋅ 𝑎1 𝑙 − 𝑁2 + 1

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 23: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Currently observed frame

Time interval of medium length

Proposed Method

𝑎1 𝑙

𝑎1 𝑙 − 1 𝑎1 𝑙

𝑎1 𝑙 − 𝑁2 + 1

𝛼𝑙 =

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 24: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Currently observed frame

Time interval of medium length

Proposed Method

𝑎1 𝑙

𝛼𝑙 =

> 𝑡ℎ2

Binary vector

Σ

𝑎2 𝑙

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 25: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Currently observed frame

Time interval of medium length

Long time interval

Proposed Method

𝑎1 𝑙

𝛼𝑙 =

> 𝑡ℎ2

Binary vector

Σ

𝑎2 𝑙

𝑎2 𝑙 𝑎2 𝑙 − 𝑁2 + 1 𝑎2 𝑙 − 2𝑁2 + 1 𝑎2 𝑙 − 𝑁3 − 1 𝑁2 + 1 ⋅⋅⋅

𝑁2 𝑁2 𝑁2 ⋅⋅⋅

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 26: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Currently observed frame

Time interval of medium length

Long time interval

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝑎2 𝑙 − 𝑁2 + 1 𝑎2 𝑙

𝑎2 𝑙 − 𝑁3𝑁2 + 1

𝛽𝑙 =

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 27: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Currently observed frame

Time interval of medium length

Long time interval

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝛽𝑙 =

> 𝑡ℎ3 Σ

𝑎3 𝑙

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Binary vector

Page 28: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Local processing

Currently observed frame

Time interval of medium length

Long time interval

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝑎3 𝑙

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 29: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Active sub-units & Transients

Currently observed frame

Time interval of medium length

Long time interval

Why Thresholding and counting ?

Smoothing

Equalization

Noise spikes suppression

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝑎3 𝑙

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

# of active

sub-bands

# of active

frames

# of active

blocks

Page 30: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Active sub-units & Transients

Currently observed frame

Time interval of medium length

Long time interval

Discriminating isolated transients from fluent audio

activity:

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝑎3 𝑙

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

# of active

sub-bands

# of active

frames

# of active

blocks

Binary vector

Immediate Binary vector

Medium

Binary vector

Long

Page 31: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Global Decision

Time frame 𝑙:

Proposed Method

a11(𝑙), a2

1 𝑙 , a31(𝑙)

a12(𝑙), a2

2 𝑙 , a32(𝑙)

a1𝑁(𝑙), a2

𝑁 𝑙 , a3𝑁(𝑙)

Dominant

Speaker

Identification

Dominant

speaker in frame 𝒍

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 32: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Global Decision

Time frame 𝑙:

Proposed Method

Φ𝑖𝑚𝑚𝑒𝑑1 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚

1 𝑙 , Φ𝑙𝑜𝑛𝑔1 (𝑙)

Φ𝑖𝑚𝑚𝑒𝑑2 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚

2 𝑙 , Φ𝑙𝑜𝑛𝑔2 (𝑙)

Φ𝑖𝑚𝑚𝑒𝑑𝑁 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚

𝑁 𝑙 , Φ𝑙𝑜𝑛𝑔𝑁 (𝑙)

Dominant

Speaker

Identification

Dominant

speaker in frame 𝒍

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 33: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Dominant speaker identification

Activated once in a decision-interval

Objective:

Detect speaker switch events

Method:

Compare speech activity scores across channels

Each channel is evaluated in respect to:

— the dominant channel

— minimal activity level

Proposed Method

Current dominant speaker remains unless

activity on one of the other channels

justifies a speaker switch

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 34: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Score Evaluation

Single

Score computed on a

single observation

Each observation is

analyzed independently

More responsive to changes

Sequential

Score computed on a

sequence of observations

Exploits natural temporal

dependence in speech

More Smooth

Score Evaluation

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 35: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Modeling the number of active sub-units

𝒂

Two Hypotheses: 𝐻1 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝐻0 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑎𝑏𝑠𝑒𝑛𝑡

𝑯𝟏

𝑯𝟎

Single

Sequential

OR

Score Evaluation

Likelihood Models

𝑁 – number of sub-units

𝑎 – number of active sub-units

𝑝 𝑎|𝐻1 = Bin 𝑁, 𝑝 =𝑁𝑎

𝑝𝑎 1 − 𝑝 𝑁−𝑎

𝑝 𝑎|𝐻0 = 𝐸𝑥𝑝 𝜆 = 𝜆𝑒−𝜆𝑎

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 36: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Score based on a single observation

Λ =𝑝 𝑎|𝐻1

𝑝 𝑎|𝐻0 𝚽 = 𝑙𝑜𝑔𝜦

speech

activity score

Φ = log𝑁𝑎

+ 𝑎 log(𝑝) + 𝑁 − 𝑎 log 1 − 𝑝 − log 𝜆 + 𝜆𝑎

Likelihood Ratio log-Likelihood Ratio

Score Evaluation - Single

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 37: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Score based on a sequence of observations

The score is computed on a sequence of

observations

𝑿𝑙 = 𝑎 𝑙 − 𝑀 + 1 ,… , 𝑎 𝑙

Observations are not independent

A speech frame is more likely to be followed by

another speech frame

𝑝 𝑞𝑛 = 𝐻1 𝑞𝑛−1 = 𝐻1 > 𝑝 𝑞𝑛 = 𝐻1 [4]

First order Markovian dependency in the transitions

between consecutive frames

Score Evaluation - Sequential

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

[4] J. Sohn et al., “A statistical model-based voice activity detection”, Signal Processing Letters, IEEE, 1999

Page 38: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

HMM of two states:

𝜁𝑙 = 𝐻1,(𝑙) 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑖𝑛 𝑓𝑟𝑎𝑚𝑒 𝑙

𝐻0,(𝑙) 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑎𝑏𝑠𝑒𝑛𝑡 𝑖𝑛 𝑓𝑟𝑎𝑚𝑒 𝑙

State dynamics: 𝑏𝑖𝑗 = 𝑝 𝜁𝑙 = j 𝜁𝑙−1 = 𝑖 , 𝑖, 𝑗𝜖{0,1}

— 𝑏00 + 𝑏01 = 1, 𝑏10 + 𝑏11 = 1

Steady state probabilities

𝑝𝐻0 =𝑏10

𝑏10+𝑏01, 𝑝𝐻1 =

𝑏01

𝑏10+𝑏01

Current observation only depends on current

state: 𝑝 𝑋𝑙 𝜁𝑙 , 𝑿𝑙 = 𝑝 𝑋𝑙 𝜁𝑙

Score based on a sequence of observations

Score Evaluation - Sequential

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 39: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Denote: 𝛼𝑙 1 ≜ 𝑝 𝑿𝑙 , 𝐻1,(𝑙) and 𝛼𝑙 0 ≜ 𝑝 𝑿𝑙 , 𝐻0, 𝑙

Recursive formula (forward procedure) for 𝛼𝑙 𝑗 , 𝑗 ∈ 0,1

Λ𝑙𝑠𝑒𝑞

=𝑝 𝑿𝑙|𝐻1, 𝑙

𝑝 𝑿𝑙|𝐻0, 𝑙

Likelihood Ratio

Λ𝑙𝑠𝑒𝑞

=𝛼𝑙 1

𝛼𝑙 0⋅𝑝𝐻0

𝑝𝐻1

𝑝 𝑿𝑙 , 𝐻1, 𝑙 𝑝𝐻0

𝑝 𝑿𝑙 , 𝐻0, 𝑙 𝑝𝐻1

𝚽𝒍𝒔𝒆𝒒

= 𝐥𝐨𝐠𝜶𝒍 𝟏

𝜶𝒍 𝟎+ 𝐥𝐨𝐠

𝑝𝐻0

𝑝𝐻1

Score Evaluation - Sequential

Score based on a sequence of observations

log-Likelihood Ratio

𝛼𝑞 𝑗 =

𝑝 𝑋𝑞 = 𝑎(𝑞)|𝐻𝑗,(𝑞) 𝛼𝑞−1 0 𝑏𝑗0 + 𝛼𝑞−1 1 𝑏𝑗1

𝑝 𝑋𝑞 = 𝑎 𝑞 |𝐻𝑗,(𝑞) 𝑝 𝐻𝑗

, 𝑙 − 𝑀 + 1 < 𝑞 < 𝑙 , 𝑞 = 𝑙 − 𝑀 + 1

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 40: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Score Evaluation - Summary

Single Vs. Sequential

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Denote: Γ 𝑛 =𝛼𝑛 1

𝛼𝑛(0) , Φ𝑠𝑒𝑞 = log Γ 𝑛 + log

𝑝𝐻0

𝑝𝐻1

Γ 𝑛 =𝛼𝑛 1

𝛼𝑛(0)=

𝑝 𝑎𝑛 𝐻1

𝑝 𝑎𝑛|𝐻0⋅𝛼𝑛−1 0 𝑏10 + 𝛼𝑛−1 1 𝑏11𝛼𝑛−1 0 𝑏00 + 𝛼𝑛−1 1 𝑏01

= Λ𝑛 ⋅𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

Φ𝑠𝑒𝑞 = Φ𝑠𝑖𝑛𝑔𝑙𝑒 + log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

+ log𝑝𝐻0

𝑝𝐻1

Page 41: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Score Evaluation - Summary

Single Vs. Sequential

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Φ𝑠𝑒𝑞 = Φ𝑠𝑖𝑛𝑔𝑙𝑒 + log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

+ log𝑝𝐻0

𝑝𝐻1

In the presence of speech

Γ 𝑛 − 1 ≫ 0 ⇒ log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

→ log𝑏11𝑏10

> 0

⇒ Φ𝑠𝑒𝑞> Φ𝑠𝑖𝑛𝑔𝑙𝑒

Page 42: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Score Evaluation - Summary

Single Vs. Sequential

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Φ𝑠𝑒𝑞 = Φ𝑠𝑖𝑛𝑔𝑙𝑒 + log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

+ log𝑝𝐻0

𝑝𝐻1

In absence of speech

Γ 𝑛 − 1 ≪ 1 ⇒ log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

→ log𝑏10𝑏00

< 0

⇒ Φ𝑠𝑒𝑞< Φ𝑠𝑖𝑛𝑔𝑙𝑒

Sequential processing improves separability

Page 43: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

A-priori SNR Estimation

Let: 𝑦 𝑛 = 𝑥 𝑛 + 𝑑 𝑛𝑆𝑇𝐹𝑇

𝑌𝑙 = 𝑋𝑙 + 𝐷𝑙

𝜆𝑙 = 𝐸 𝑋𝑙2 − Spectral variance of speech

𝑋𝑙 is a complex vector 𝑋𝑙 = 𝐴𝑙𝑒𝑗𝜙𝑙

𝜆𝑙 = 𝐸 𝑋𝑙2 = 𝐸 𝐴𝑙

2

𝜆𝐷𝑙= 𝐸 𝐷𝑙

2 − Spectral variance of noise

SNR estimation

A-priori SNR : 𝜉𝑙 =𝜆𝑙

𝜆𝐷𝑙

A-priori SNR : 𝜉 𝑙|𝑙 =𝜆 𝑙|𝑙

𝜆𝐷𝑙

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 44: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

The estimator is derived in two steps: [5]

1. Propagation Step

Assuming we have all information up to frame 𝑙 − 1

Obtain one frame ahead conditional variance 𝜆 𝑙|𝑙−1

2. Update Step

Update the estimator using information from current

frame, 𝑙

Obtain 𝜆 𝑙|𝑙

A-priori SNR Estimation

SNR estimation

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

[5] I. Cohen, “Relaxed statistical model for speech enhancement and a priori snr estimation“, Speech and Audio Processing, IEEE

Transactions on, 2005.

Page 45: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

The estimator is derived in two steps :

1. Propagation Step

Assuming we have all information up to frame 𝑙 − 1

The conditional variance of speech is assumed to

propagate as a GARCH(1,1) model:

𝜆𝑙|𝑙−1 = 𝜆𝑚𝑖𝑛 + 𝜇 𝑋𝑙−12 + 𝛿 𝜆𝑙−1|𝑙−2 + 𝜆𝑚𝑖𝑛

𝜆𝑚𝑖𝑛 > 0, 𝜇 ≥ 0, 𝛿 ≥ 0, 𝜇 + 𝛿 < 1

A-priori SNR Estimation

SNR estimation

𝜆 𝑙|𝑙−1 = 𝐸 𝜆𝑙|𝑙−1|𝐴 𝑙−1, 𝜆 𝑙−1|𝑙−2= 𝜆𝑚𝑖𝑛 + 𝜇𝐴 𝑙−1

2 + 𝛿 𝜆 𝑙−1|𝑙−2 + 𝜆𝑚𝑖𝑛

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 46: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

The estimator is derived in two steps :

1. Propagation Step

2. Update Step

Update the estimator using information from current

frame, 𝑙

Obtain 𝜆 𝑙|𝑙

A-priori SNR Estimation

SNR estimation

𝜆 𝑙|𝑙−1 = 𝐸 𝜆𝑙|𝑙−1|𝐴 𝑙−1, 𝜆 𝑙−1|𝑙−2= 𝜆𝑚𝑖𝑛 + 𝜇𝐴 𝑙−1

2 + 𝛿 𝜆 𝑙−1|𝑙−2 + 𝜆𝑚𝑖𝑛

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 47: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

A-priori SNR estimation

2. Update Step:

Let: 𝑦 𝑛 = 𝑥 𝑛 + 𝑑 𝑛𝑆𝑇𝐹𝑇

𝑌𝑙 = 𝑋𝑙 + 𝐷𝑙

Spectral enhancement 𝑋 𝑙 = 𝐺𝑌𝑙

𝐺 – spectral gain function

𝐺 is chosen as a minimizer of a certain distortion

measure

SNR estimation

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 48: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

A-priori SNR estimation

2. Update Step:

We chose 𝐺𝑆𝑃, the minimizer of the power distortion

measure

𝑑𝑆𝑃 = 𝐴𝑙2 − 𝐴 𝑙

2 2

𝛾𝑙 =𝑌𝑙

2

𝜆𝐷𝑙

- a posteriori SNR

𝜆 𝑙|𝑙−1 - one frame ahead conditional variance (speech)

The resulting spectral gain function is :

𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙 =𝜆 𝑙|𝑙−1

𝜆𝐷𝑙+𝜆 𝑙|𝑙−1

1𝛾𝑙+

𝜆 𝑙|𝑙−1

𝜆𝐷𝑙+𝜆 𝑙|𝑙−1

SNR estimation

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 49: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

2. Update Step

𝑋 𝑙 = 𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙 𝑌𝑙 𝜆 𝑙|𝑙 = 𝐸 𝑋 𝑙2

= 𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙2𝑌𝑙

2

The a-priori SNR estimator:

𝜉 𝑙|𝑙 =𝜆 𝑙|𝑙

𝜆𝐷𝑙

=𝜆 𝑙|𝑙−1

𝜆𝐷𝑙+ 𝜆 𝑙|𝑙−1

1+𝜆 𝑙|𝑙−1𝛾𝑙

𝜆𝐷𝑙+𝜆 𝑙|𝑙−1

𝜆 𝑙|𝑙 = 𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙2𝑌𝑙

2

Speech detection in sub-bands

SNR estimation

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 50: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experimental results

Three Experiments:

1. Identification of the dominant speaker

2. Robustness to transient audio occurrences

3. Results on a real multipoint conference

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 51: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experimental results

The proposed method was compared to:

The speaker with the highest VAD score in the

decision-interval is selected

1. RAMIREZ [6]

2. SOHN [4]

3. GARCH [7]

4. The speaker with the highest SNR is selected

5. The speaker with the highest signal POWER is

selected

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

[6] Ramirez et al. “Statistical voice activity detection using a multiple observation likelihood ratio test”, Signal Processing Letters, IEEE,2005

[7] S. Mousazadeh and I. Cohen, “Ar-garch in presence of noise: Parameter estimation and its application to voice activity detection“,

Audio, Speech, and Language Processing, IEEE Transactions on, 2011

Page 52: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Performance Evaluation

Quantitative Evaluation:

False Speaker Switches #

Mid Sentence Clipping (MC) – percent of

undetected mid section of a speech burst

MC

Experimental results

true detected

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 53: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #1

Signals:

One or more concatenated TiMit sentences of

same speaker (with added white noise)

The non silent part of the sentence is expected

to be detected as a continuous speech burst

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 54: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #1

Proposed method Sohn VAD based method

Decision interval = 0.1 sec

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 55: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #1

False Speaker Switches Mid Sentence Clipping

Experimental results

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.5

1

1.5

2

2.5

3Mid Sentence Clipping

Decision interval [sec]

%

Single

POWER

SNR

RAMIREZ

SOHN

GARCH

Sequential

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

10

20

30

40

50

60

70

80

90

Decision interval [sec]

Fals

e c

am

era

sw

itches

Single

POWER

SNR

RAMIREZ

SOHN

GARCH

Sequential

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 56: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #2

Signals:

One or more concatenated TiMit sentences of

same speaker (with added white noise)

A sound of sneezing was added to channel 2

A sound of door knocks was added to channel 3

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 57: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #2

ch 1

ch 2

0 5 10 15 20 25

[sec]

ch 3

sneezing

knocks

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 58: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #2

False Speaker Switches Mid Sentence Clipping

Experimental results

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.5

1

1.5

2

2.5

3

3.5

Decision interval [sec]

%

Single

POWER

SNR

RAMIREZ

SOHN

GARCH

Sequential

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

10

20

30

40

50

60

70

80

90

Decision interval [sec]

Fals

e c

am

era

sw

itches

Single

POWER

SNR

RAMIREZ

SOHN

GARCH

Sequential

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 59: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #2

False Speaker Switches Mid Sentence Clipping

Decision interval = 0.05 sec

Single Sequential

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 60: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #2

Proposed Method GARCH VAD based method

Decision interval = 0.3 sec

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Red – hand labeled

Black – Algorithm’s result

Page 61: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #2

Proposed Method POWER based method

Decision interval = 0.3 sec

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Red – hand labeled

Black – Algorithm’s result

Page 62: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #3

Signals:

Real Multi-channel conversation

5 channels

Channels 2 & 4 – clean speech

Channel 1 – mostly noise

Channels 3 and 5 - crosstalk

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Page 63: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Experiment #3

Proposed Method POWER based method

Decision interval = 0.4 sec

Experimental results

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Red – hand labeled

Black – Algorithm’s result

Page 64: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Summary

Novel method for dominant speaker

identification

Two approaches to speech activity score

evaluation

Experimental framework

Less false speaker switches

Robustness to transient audio occurrences

Summary

Page 65: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Future Research

Non causal processing

Temporally variable thresholds

Preprocessing

Speech enhancement

Echo detection (or cancellation)

Future Research

Page 66: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed

Thank You

Thank You


Recommended