+ All Categories
Home > Documents > Robust Speaker Recognition

Robust Speaker Recognition

Date post: 12-Jan-2016
Category:
Upload: boyce
View: 40 times
Download: 1 times
Share this document with a friend
Description:
Robust Speaker Recognition. JHU Summer School 2008 Lukas Burget Brno University of Technology. Variability refers to changes in channel effects between training and successive detection attempts Channel/session variability encompasses several factors The microphones - PowerPoint PPT Presentation
47
Robust Speaker Recognition JHU Summer School 2008 Lukas Burget Brno University of Technology
Transcript
Page 1: Robust Speaker Recognition

Robust Speaker Recognition

JHU Summer School 2008

Lukas BurgetBrno University of Technology

Page 2: Robust Speaker Recognition

NIST SRE2008 - Interview speech

The same microphone in training and test

< 1% EER

Different microphone in training and test

about 3% EER

• Variability refers to changes in channel effects between training and successive detection attempts

• Channel/session variability encompasses several factors

– The microphones• Carbon-button, electret, hands-free,

array, etc– The acoustic environment

• Office, car, airport, etc.– The transmission channel

• Landline, cellular, VoIP, etc.– The differences in speaker voice

• Aging, mood, spoken language, etc.

• Anything which affects the spectrum can cause problems

– Speaker and channel effects are bound together in spectrum and hence features used in speaker verifiers

The largest challenge to practical use of speaker detection systems is channel/session variability

The largest challenge to practical use of speaker detection systems is channel/session variability

Intersession variability

Page 3: Robust Speaker Recognition
Page 4: Robust Speaker Recognition

Channel/Session Compensation

Channel/session compensation occurs at several levels in a speaker detection system

Front-end processing

Front-end processing

Target modelTarget model

Background model

Background model

LR scorenormalization

LR scorenormalization

Adapt

Signal domainFeature domain Model domain Score domain

• Noise removal

• Tone removal

• Cepstral mean subtraction

• RASTA filtering

• Mean & variance normalization

• Feature warping

• Speaker Model Synthesis

• Eigenchannel compensation

•Joint Factor Analysis

• Nuisance Attribute Projection

• Z-norm

• T-norm

• ZT-norm

•Feature Mapping

•Eigenchannel adaptation in feature domain

Page 5: Robust Speaker Recognition

Front-end processing

Front-end processing

Target modelTarget model

Background model

Background model

LR scorenormalization

LR scorenormalization

Adapt

Signal domainFeature domain Model domain Score domain

• Noise removal

• Tone removal

• Cepstral mean subtraction

• RASTA filtering

• Mean & variance normalization

• Feature warping

• Speaker Model Synthesis

• Eigenchannel compensation

•Joint Factor Analysis

• Nuisance Attribute Projection

• Z-norm

• T-norm

• ZT-norm

•Feature Mapping

•Eigenchannel adaptation in feature domain

Page 6: Robust Speaker Recognition

Adaptive Noise Suppression

Reformulate as filtration: Y(n) = H(n)X(n) where H(n) = (X(n) – N(n)) / X(n)

It is necessary to

• to smooth H(n) in time

• make sure magnitude spectrum is not negative

• …

Basic idea of spectral subtraction (or Wiener filter):

Y(n) = X(n) - N(n)

•Y(n) – enhanced speech

•X(n) – spectrum of nth frame of noisy speech

•N(n) – estimate of stationary additive noise spectrum

Page 7: Robust Speaker Recognition

Detection

Suppression Filter

Smooth

Short-TimeSpectral

MagnitudeSpectral

Derivative

Delay

Background |Spectrum|

Speech |Spectrum|

Enhanced Speech

Degraded Speech

Suppression

Time Constant

Detection

Suppression Filter

Smooth

Short-TimeSpectral

MagnitudeSpectral

Derivative

Delay

Background |Spectrum|

Speech |Spectrum|

Enhanced Speech

Degraded Speech

Suppression

Time Constant

• Goal: Suppress wideband noise and preserve the speech• Approach: Maintain transient and dynamic speech components, such as energy bursts

in consonants, that are important “information-carriers” • Suppression algorithm has two primary components

– Detection of speech or background in each frame– Suppression component uses an adaptive Wiener filter requiring:

• Underlying speech signal spectrum, obtained by smoothing the enhanced output• Background spectrum• Signal change measure, given by a spectral derivative, for controlling smoothing

constants

Adaptive Noise Suppression

Page 8: Robust Speaker Recognition

• C3 example from ICSI• Processed with LLEnhance toolkit for wideband noise reduction

SNR = 15 dB

SNR = 25 dB

Adaptive Noise Suppression

Page 9: Robust Speaker Recognition

Front-end processing

Front-end processing

Target modelTarget model

Background model

Background model

LR scorenormalization

LR scorenormalization

Adapt

Signal domainFeature domain Model domain Score domain

• Noise removal

• Tone removal

• Cepstral mean subtraction

• RASTA filtering

• Mean & variance normalization

• Feature warping

• Speaker Model Synthesis

• Eigenchannel compensation

•Joint Factor Analysis

• Nuisance Attribute Projection

• Z-norm

• T-norm

• ZT-norm

•Feature Mapping

•Eigenchannel adaptation in feature domain

Page 10: Robust Speaker Recognition

Cepstral Mean Subtraction

Fourier Transform

Fourier Transform MagnitudeMagnitude Log()Log() Cosine

transform

Cosine transform

x 0.5

- 0.3

frames

•MFCC feature extraction scheme•Consider the same speech signal recorded over different microphone attenuatingcertain frequencies twice•Scaling in magnitude spectrumdomain corresponds to constantshift of the log filter bank outputs

Page 11: Robust Speaker Recognition

Cepstral Mean Subtraction

0.0

•Assuming the frequency characteristics of the two microphones do not change over time, the whole temporal trajectories of the affected log filter bank outputs differs by the constant.

•The shift disappears after subtracting mean computed over the segment.

•Usually only speech frames are considered for the mean estimation

•Since Cosine transform is linear operation the same trick can be applied directly in cepstral domain

Page 12: Robust Speaker Recognition

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

False alarm probability [%]

Mis

s pr

obab

ility

[%

]

Page 13: Robust Speaker Recognition

RASTA filtering

10

0

-10

-20

-30

-401 10 1000.10.01

Ma

gn

itude

[d

B]

Frequency [Hz]

-100 0 100 300200 400Time [s]

Impulse response

Frequency characteristic

frames

0.0

0.0

RASTA filtered

•Filtering log filter bank output (or equivalently cepstral) temporal trajectories by band pass filter

•Remove slow changes to compensate for the channel effect (≈CMS over 0.5 sec. sliding window)

•Remove fast changes (> 25Hz) likely not caused by speaker with limited ability to quickly change vocal tract configuration

original

Page 14: Robust Speaker Recognition

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

False alarm probability [%]

Mis

s pr

obab

ility

[%

]

Page 15: Robust Speaker Recognition

Mean and Variance Normalization

frames

original

after CMN/CVN

Speech with additive noise

Clean speech

•While convolutive noise causes the constant shift of cepstral coeff. temporal trajectories, noise additive in spectral domain fills valleys in the trajectories

•In addition to subtracting mean, trajectory can be normalized to unity variance (i.e. dividing by standard deviation) to compensate for his effect

Page 16: Robust Speaker Recognition

Feature Warping

0.0 1.00.5

0.0

Inverse Gaussian cumulative density

function

•Warping each cepstral coefficients in 3 second sliding window into Gaussian distribution

•Combines advantages of the previous techniques (CMN/CVN, RASTA)

•Resulting coefficients are (locally) Gaussianized more suitable for GMM models

Page 17: Robust Speaker Recognition

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

False alarm probability [%]

Mis

s pr

obab

ility

[%

]

Page 18: Robust Speaker Recognition

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

+ triple deltas

+ HLDA

False alarm probability [%]

Mis

s pr

obab

ility

[%

]

Page 19: Robust Speaker Recognition

Example of 2D GMM

Page 20: Robust Speaker Recognition

HLDA

Heteroscedastic Linear Discriminant Analysis provides a linear transformation that de-correlates classes.

Page 21: Robust Speaker Recognition

HLDA

HLDA allows for dimensionality reduction while preserving the discriminability between classes (HLDA without dim. Reduction is also called MLLT)

Nuisance dimensionUseful dimension

Page 22: Robust Speaker Recognition

Front-end processing

Front-end processing

Target modelTarget model

Background model

Background model

LR scorenormalization

LR scorenormalization

Adapt

Signal domainFeature domain Model domain Score domain

• Noise removal

• Tone removal

• Cepstral mean subtraction

• RASTA filtering

• Mean & variance normalization

• Feature warping

• Speaker Model Synthesis

• Eigenchannel compensation

•Joint Factor Analysis

• Nuisance Attribute Projection

• Z-norm

• T-norm

• ZT-norm

•Feature Mapping

•Eigenchannel adaptation in feature domain

Page 23: Robust Speaker Recognition

• It is generally difficult to get enrollment speech from all microphone types to be used

• The SMS approach addresses this by synthetically generating speaker models as if they came from different microphones (Teunen, ICSLP 2000)– A mapping of model parameters between different microphone types

is applied

cellular carbon buttonelectret

synthesis synthesis

Speaker Model Synthesis

Page 24: Robust Speaker Recognition

Speaker Model Synthesis

Learning mapping of model parameters between different microphone types:

•Start with channel-independent root model

•Create channel models by adapting root with channel specific data

•Learn mean shift between channel models

Page 25: Robust Speaker Recognition

Speaker Model Synthesis

Training speaker model:

•Adapt channel model which scores highest on training data to get target model

•Synthesize new target channel model by applying the shift

Training dataTest data

1 2 2 1( ) ( / )CD CD CD CDi i i i iT

1 2 2 1( ) ( )CD CD CD CDi i i i iT

1 2 2 1( ) ( / )CD CD CD CDi i i i iT

•GMM weights and variances can be also adapted and used to improve the mapping of model parameters between different microphone types

Page 26: Robust Speaker Recognition

Front-end processing

Front-end processing

Target modelTarget model

Background model

Background model

LR scorenormalization

LR scorenormalization

Adapt

Signal domainFeature domain Model domain Score domain

• Noise removal

• Tone removal

• Cepstral mean subtraction

• RASTA filtering

• Mean & variance normalization

• Feature warping

• Speaker Model Synthesis

• Eigenchannel compensation

•Joint Factor Analysis

• Nuisance Attribute Projection

• Z-norm

• T-norm

• ZT-norm

•Feature Mapping

•Eigenchannel adaptation in feature domain

Page 27: Robust Speaker Recognition

• Aim: Apply transform to map channel-dependent feature space into a channel-independent feature space

• Approach: – Train a channel-independent model using pooling of data from all

types of channels– Train channel-dependent models using MAP adaptation– For utterance, find top scoring CD model (channel detection)– Map each feature vector in utterance into CI space

CD 1 CD 1 CD 2 CD 2 CD N CD N …

CI CI ( )CD CIt i ty M x

D.A. Reynolds, “Channel Robust Speaker Verification via Feature Mapping,” ICASSP 2003

Feature mapping

Page 28: Robust Speaker Recognition

Feature mapping

•As for SMS, sreate channel models by adapting root with channel specific data

•Learn mean shifts between each channel models and channel-independent root model

Page 29: Robust Speaker Recognition

Feature mapping

• For each (training or test) speech segment, determine maximum likelihood channel model

• For each frame of the segment, record top-1 Gaussian per frame

• For each frame apply mapping to map x with CD pdf to y with CI pdf

• Target model is adapted from CI model using mapped features

• Mapped features and CI models are used in test

1argmax ( )CD CD

j j tj M

i p x

( )CI

CD CIit t i iCD

i

y x

Page 30: Robust Speaker Recognition

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

+ triple deltas

+ HLDA

+ Feature mapping (14 classes)

False alarm probability [%]

Mis

s pr

obab

ility

[%

]

Page 31: Robust Speaker Recognition

Session variability in mean supervector space

•GMM mean supervector – column vector created by concatenating mean vectors of all GMM components.

•For the case of variances shared by all speaker models, supervector M fully defines speaker model

•Speaker Model Synthesis can be rewritten as: MCD2 = MCD1 + kCD1CD2, where kCD1CD2 is the cross-channel shift

•Drawbacks of SMS (and Feature Mapping)•Channel dependent models must be created for each channel•Different factors causing intersession variability may combine (e.g. channel and language) compensation must be trained for each such combination•The factors are not discrete (i.e. effects on the intersession variability may be more or less strong)

•There is evidence that there is limited number of directions in the supervector space strongly affected by intersession variability. Different directions possibly corresponds to different factors.

Page 32: Robust Speaker Recognition

32

High inter-session variability

High speaker variability

UBM

Target speaker model

Session variability in mean supervector space

Example: single Gaussian model with 2D features

Page 33: Robust Speaker Recognition

33

High intersession variability

High speaker variability

UBM

Target speaker model Test data

For recognition, move both models along the high inter-session variability direction(s) to fit well the test data (e.g. in ML sense)

Session compensation in supervector space

Page 34: Robust Speaker Recognition

6D example of supervector space

Page 35: Robust Speaker Recognition

Identifying high intersession variability directions

• Take multiple speech segments from many training speakers recorded under different channel conditions. For each segment derive supervector by MAP adapting UBM.

• From each supervector, subtract mean computed over supervectors of corresponding speaker.

• Find direction's with largest intersession variability using PCA (eigen vectors of the average with-in speaker covariance matrix).

Eigenchannel U

supervectors of speaker 1

speaker 2

speaker 3

Page 36: Robust Speaker Recognition

Eigenchannel adaptation

Eigenchannel UUBM

Target speaker model M Test data

N. Brummer, SDV NIST SRE’04 System description, 2004.

• Speaker model obtained in usual way by MAP adapting UBM

• For test, adapt speaker model and UBM by moving supervectors in the direction(s) of eigenchannel(s) to well fit the test data find factors x maximizing likelihood of test data for

• The score is LLR computed using the adapted speaker model and UBM

t

t UxMxp )|(log

Page 37: Robust Speaker Recognition

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

+ triple deltas

+ HLDA

+ Feature mapping (14 classes)

+ Eigenchannels adaptation

Mis

s pr

obab

ility

[%

]

Page 38: Robust Speaker Recognition

38

Nuisance Attribute Projection

• NAP is an intersession compenzation technique proposed for SVMs

• Project out the eigenchannel directions from supervectors before using the supervectors for training SVMs or test

U

Page 39: Robust Speaker Recognition

• Speaker Model Synthesis: MCD2 = MCD1 + kCD1CD2 – constant supervector shift for recognized training and

test channel

• Eigenchannel adaptation: Mtest = Mtrain + Ux – the shift is given by linear combination of

eigenchannel basis U with factors x tuned for test data

• Eigenvoice adaptation– Consider also supervector subspace V with high

speaker variability and use it to obtain speaker model– M = MUBM + Vy – speaker model given by linear

combination of UBM supervec. and eigenvoice bases– speaker factors y tuned to match enrollment data– Can be combined with channel subspace:

M = MUBM + Vy + Ux• both x and y estimated on enrollment data

• only x updated for test data to adapt speaker model to test channel condition

High intersession variability

High speaker variability

Constructing models in supervector space

Page 40: Robust Speaker Recognition

• M = MUBM + Vy + Dz + Ux

• Probabilistic model– Gaussian priors assumed for factors y, z, x– Hyperparameters MUBM, V, D, U can be trained using EM algorithm– D - diagonal matrix describing remaining speaker variability not

covered by eigenvoices

Joint Factor analysis

v2

v1

u2

u1

d33

d22

d11

Page 41: Robust Speaker Recognition

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

+ triple deltas

+ HLDA

+ Feature mapping (14 classes)

+ Eigenchannels adaptation

Joint Factor Analysis (extrapolated result)

False alarm probability [%]

Page 42: Robust Speaker Recognition

Front-end processing

Front-end processing

Target modelTarget model

Background model

Background model

LR scorenormalization

LR scorenormalization

Adapt

Signal domainFeature domain Model domain Score domain

• Noise removal

• Tone removal

• Cepstral mean subtraction

• RASTA filtering

• Mean & variance normalization

• Feature warping

• Speaker Model Synthesis

• Eigenchannel compensation

•Joint Factor Analysis

• Nuisance Attribute Projection

• Z-norm

• T-norm

• ZT-norm

•Feature Mapping

•Eigenchannel adaptation in feature domain

Page 43: Robust Speaker Recognition

• Target model LR scores have different biases and scales for test data– Unusual channel or poor quality speech in training segments lower

scores from target model– Little training data target model close to UBM all LLR scores close to 0

• Znorm attempts to remove these bias and scale differences from the LR scores

pooled

Tgt1 scores

Tgt2 scores

LR scores znorm scores– Estimate mean and standard

deviation of non-target, same-sex utterances from data similar to test data

– During testing normalize LR score

– Align each model’s non-target scores to N(0,1)

Tgt

TgtTgtTgt

xxZ

)(

)(

Z-norm

Page 44: Robust Speaker Recognition

• Similar idea to Z-norm , but compensating for differences in test data

• Estimates bias and scale parameters for score normalization using fixed “cohort” set of speaker models

– Normalizes target score relative to a non-target model ensemble– Similar to standard cohort normalization except for standard deviation

scaling

coh

cohtgttgt

uuT

)(

)(Target model

Target model

Cohort model

Cohort modelCohort

model

Cohort modelCohort

model

Cohort model

), cohcoh

Tnorm score

Tnorm score

• Used cohorts of same gender as target • Can be used in conjunction with Znorm

– ZTnorm or TZnorm depending on order

T-norm

Introduced in 1999 by Ensigma (DSP Journal January 2000)

Page 45: Robust Speaker Recognition

Effect of ZT-norm

Eigenchannel adaptation

Joint Factor Analysis

no normalization

ZT-norm

NIST SRE2006

telephone trials

False alarm probability [%]

Mis

s pr

obab

ility

[%

]

Page 46: Robust Speaker Recognition

Score fusionNISR SRE 2006 all trials

Linear logistic regression fusion of scores from:

•GMM with eigenchannel adaptation

•SVM based on GMM supervectors

•SVM based on MLLR transformation (transformation adapting speaker indipendent LVCSR system to speaker)

LLR trained using many target and non-target trials from development set

Page 47: Robust Speaker Recognition

Conclusions


Recommended