MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1 © 2001 D.A.Reynolds and...

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1

© 2001 D.A.Reynolds and L.P.Heck

Speaker Verification: From Research to Reality

Douglas A. Reynolds, PhDSenior Member of Technical

StaffM.I.T. Lincoln Laboratory

Larry P. Heck, PhDSpeaker Verification R&DNuance Communications

This work was sponsored by the Department of Defense under Air Force contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense.

ICASSP Tutorial Salt Lake City, UT

7 May 2001




This material may not be reproduced in whole or part without written permission from the authors

This material may not be reproduced in whole or part without written permission from the authors

ICASSP Tutorial Salt Lake City, UT

7 May 2001



Tutorial Outline

• Part I : Background and Theory– Overview of area– Terminology– Theory and structure of verification systems– Channel compensation and adaptation

• Part II : Evaluation and Performance– Evaluation tools and metrics– Evaluation design– Publicly available corpora– Performance survey

• Part III : Applications and Deployments– Brief overview of commercial speaker verification systems– Design requirements for commercial verifiers– Steps to deployment– Examples of deployments



Goals of Tutorial

• Understand major concepts behind modern speaker verification systems

• Identify the key elements in evaluating performance of a speaker verification system

• Define the main issues and tasks in deploying a speaker verification system



Part I : Background and TheoryOutline

• Overview of area– Applications– Terminology

• General Theory– Features for speaker recognition– Speaker models– Verification decision

• Channel compensation

• Adaptation

• Combination of speech and speaker recognizers



Extracting Information from Speech

SpeechRecognition

LanguageRecognition

SpeakerRecognition

Words

Language Name

Speaker Name

“How are you?”

English

James Wilson

Speech Signal

Goal: Automatically extract information transmitted in speech signal



Evolution of Speaker Recognition

• This tutorial will focus on techniques and performance of state-of-the art systems

1970

Dynamic Time-Warping Vector Quantization

1980

Hidden Markov Models Gaussian Mixture Models

19902001Template matching

1960

Large databases, realistic,

unconstrained speech

Large databases, realistic,

unconstrained speech

Small databases, clean, controlled

speech

Small databases, clean, controlled

speech

Aural and spectrogram matching

1930-

Commercial application of speaker recognition technology

Commercial application of speaker recognition technology



Speaker Recognition Applications

Access Control Physical facilities

Computer networks and websites

Access Control Physical facilities

Computer networks and websites

Transaction AuthenticationTelephone banking

Remote credit card purchases

Transaction AuthenticationTelephone banking

Remote credit card purchases

Speech Data ManagementVoice mail browsing

Speech skimming

Speech Data ManagementVoice mail browsing

Speech skimming

PersonalizationIntelligent answering machine

Voice-web / device customization

PersonalizationIntelligent answering machine

Voice-web / device customization

Law EnforcementForensics

Home parole

Law EnforcementForensics

Home parole



Terminology

• The general area of speaker recognition can be divided into two fundamental tasks

VerificationVerificationIdentificationIdentificationIdentificationIdentification

Speaker recognitionSpeaker recognitionSpeaker recognitionSpeaker recognition

• Any work on speaker recognition should identify which task is being addressed



TerminologyIdentification

• Determines whom is talking from set of known voices

• No identity claim from user (one to many mapping)

• Often assumed that unknown voice must come from set of known speakers - referred to as closed-set identification

?

?

?

?

Whose voice is this?Whose voice is this?



Terminology Verification/Authentication/Detection

• Determine whether person is who they claim to be

• User makes identity claim (one to one mapping)

• Unknown voice could come from large set of unknown speakers - referred to as open-set verification

• Adding “none-of-the-above” option to closed-set identification gives open-set identification

?

Is this Bob’s voice?Is this Bob’s voice?



Terminology Segmentation and Clustering

• Determine when speaker change has occurred in speech signal (segmentation)

• Group together speech segments from same speaker (clustering)

• Prior speaker information may or may not be available

Speaker B

Speaker A

Which segments are from the same speaker?Which segments are from the same speaker?

Where are speaker changes?Where are speaker changes?



TerminologySpeech Modalities

• Text-dependent recognition

– Recognition system knows text spoken by person

– Examples: fixed phrase, prompted phrase

– Used for applications with strong control over user input

– Knowledge of spoken text can improve system performance

Application dictates different speech modalities:

• Text-independent recognition

– Recognition system does not know text spoken by person

– Examples: User selected phrase, conversational speech

– Used for applications with less control over user input

– More flexible system but also more difficult problem

– Speech recognition can provide knowledge of spoken text



TerminologyVoice Biometric

Strongest security

• Speaker verification is often referred to as a voice biometric

• Biometric: a human generated signal or attribute for authenticating a person’s identity

• Voice is a popular biometric:

– natural signal to produce

– does not require a specialized input device

– ubiquitous: telephones and microphone equipped PC

• Voice biometric can be combined with other forms of security

– Something you have - e.g., badge

– Something you know - e.g., password

– Something you are - e.g., voice

HaveKnow

Are







• Adaptation




General TheoryPhases of Speaker Verification System

Two distinct phases to any speaker verification system

Feature extraction

Feature extraction

Model training

Model training

Enrollment speech for each speaker

Bob

Sally

Model (voiceprint) for each speaker

Sally

Bob

Enrollment Enrollment PhasePhase

Model training

Model training

Accepted!Feature extraction

Feature extraction

Verificationdecision


Claimed identity: Sally

Verification Verification PhasePhase





General TheoryFeatures for Speaker Recognition

• Humans use several levels of perceptual cues for speaker recognition

Semantics, diction,pronunciations,idiosyncrasies

Socio-economicstatus, education,place of birth

Prosodics, rhythm,speed intonation,volume modulation

Personality type,parental influence

Acoustic aspect ofspeech, nasal,deep, breathy,rough

Anatomical structureof vocal apparatus

Semantics, diction,pronunciations,idiosyncrasies

Socio-economicstatus, education,place of birth

Prosodics, rhythm,speed intonation,volume modulation

Personality type,parental influence

Acoustic aspect ofspeech, nasal,deep, breathy,rough

Anatomical structureof vocal apparatus

High-level cues (learned traits)

Low-level cues (physical traits)

Easy to automatically extract

Difficult to automatically extract

Hierarchy of Perceptual Cues

• There are no exclusive speaker identity cues

• Low-level acoustic cues most applicable for automatic systems




• Desirable attributes of features for an automatic system (Wolf ‘72)

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

Practical

Robust

Secure

• No feature has all these attributes

• Features derived from spectrum of speech have proven to be the most effective in automatic systems



General TheorySpeech Production

• Speech production model: source-filter interaction– Anatomical structure (vocal tract/glottis) conveyed in speech spectrum

Vocal tractGlottal pulses

Time (sec)

Speech signal

Time (sec)




• Different speakers will have different spectra for similar sounds

Cross Section ofVocal Tract

/AE/

Cross Section of70605040302010

0

MaleSpeaker

FemaleSpeaker

MaleSpeaker

FemaleSpeaker14

1618

121086420

0 2000 4000 6000 00

0

1

2

3

4

123

45

2000 4000 6000

/I/

Vocal Tract

Frequency (Hz)

Ma

gn

itu

de

(d

B)

Frequency (Hz)

Ma

gn

itu

de

(d

B)

• Differences are in location and magnitude of peaks in spectrum– Peaks are known as formants and represent resonances of vocal cavity

• The spectrum captures the format location and, to some extent, pitch without explicit formant or pitch tracking




• Speech is a continuous evolution of the vocal tract – Need to extract time series of spectra– Use a sliding window - 20 ms window, 10 ms shift

...

Fourier Transform

Fourier Transform MagnitudeMagnitude

• Produces time-frequency evolution of the spectrum

Fre

quen

cy (

Hz)

Time (sec)




• The number of discrete Fourier transform samples representing the spectrum is reduced by averaging frequency bins together

– Typically done by a simulated filterbank

• A perceptually based filterbank is used such as a Mel or Bark scale filterbank

– Linearly spaced filters at low frequencies– Logarithmically spaced filters at high frequencies

Frequency

Ma

gn

itu

de




• Primary feature used in speaker recognition systems are cepstral feature vectors

• Log() function turns linear convolutional effects into additive biases

– Easy to remove using blind-deconvolution techniques

• Cosine transform helps decorrelate elements in feature vector

– Less burden on model and empirically better performance

...

Fourier TransformFourier Transform MagnitudeMagnitude

Log()Log() Cosine transformCosine transform 3.4 3.6 2.1 0.0-0.9 0.3 .1

3.4 3.6 2.1 0.0-0.9 0.3 .1

3.4 3.6 2.1 0.0-0.9 0.3 .1

One feature vector every 10 ms




Fourier Transform

Fourier Transform MagnitudeMagnitude Log()Log() Cosine

transform

Cosine transform




• Additional processing steps for speaker recognition features

• To help capture some temporal information about the spectra, delta cepstra are often computed and appended to the cepstra feature vector

– 1st order linear fit used over a 5 frame (50 ms) span

K

Kk

K

Kki

ii

k

knkcnc

t

tc

2

)()(

)(

• For telephone speech processing, only voice pass-band frequency region is used

– Use only output of filters in range 300-3300 Hz

300 3300




• To help remove channel convolutional effects, cepstral mean subtraction (CMS) or RASTA filtering is applied to the cepstral vectors

|)(|*|),(| fHnfS

• Some speaker information is lost, but generally CMS is highly beneficial to performance

• RASTA filtering is like a time-varying version of CMS (Hermansky, 92)

h(t)h(t) FT()FT() |.||.| Log()Log() Cos Trans()Cos Trans()

)(),( thnts |)(|log|),(|log fHnfS

hnsnc

)()(hsncN

mn

)(

1

snsmnc

)()(





Feature extraction

Feature extraction

Model training

Model training


Bob

Sally


Sally

Bob



Feature extraction







General TheorySpeaker Models

• Speaker models are used to represent the speaker-specific information conveyed in the feature vectors

• Desirable attributes of a speaker model– Theoretical underpinning– Generalizable to new data– Parsimonious representation (size and computation)

• Modern speaker verification systems employ some form of Hidden Markov Models (HMM)

– Statistical model for speech sound representation– Solid theoretical basis– Existing parameter estimation techniques



3.4 3.6 2.1 0.0-0.9 0.3 .1

1x


• Treat speaker as a hidden random source generating observed feature vectors

– Source has “states” corresponding to different speech sounds

Speaker (source)

Hidden speech state

3.4 3.6 2.1 0.0-0.9 0.3 .1

2x

3.4 3.6 2.1 0.0-0.9 0.3 .1

3x

…

Observed feature vectors




• Feature vectors generated from each state follow a Gaussian mixture distribution

)(|( xbpxpi

iis

),, iiis p

Transition probability

Feature distribution for state i

2,1 ss• Transition between states based on modality of speech

– Text-dependent case will have ordered states

– Text-independent case will allow all transitions

• Model parameters– Transition probabilities– State mixture parameters

• Parameters are estimated from training speech using Expectation Maximization (EM) algorithm




• HMMs encode the temporal evolution of the features (spectrum)

• HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.

• This provides a statistical model of how a speaker produces sounds

• Designer needs to set– Topology (# states and

allowed transitions)– Number of mixtures




Form of HMM depends on the application

“Open sesame”

Fixed Phrase Word/phrase models

/t/ /e/ /n/

Prompted phrases/passwords Phoneme models

General speech

Text-independent single state HMM (GMM)




• The dominant model factor in speaker recognition performance is the number of mixtures used (Matsui and Furui, ICASSP92)

• Selection of mixture order is dependent on a number of factors

– Topology of HMM– Amount of training data– Desired model size

• No good theoretical technique to pick mixtures order– Usually set empirically

• Parameter tying techniques can help increase the effective number of Gaussians with limited total parameter increase




• The likelihood of a HMM given a sequence of feature vectors is computed as

,...2,1

)|)(()|)(),...,2(),1(( )()()1(SS t

tststs txpHMMTxxxp

t

tststsS

txpHMMTxxxpi

)|)((max)|)(),...,2(),1(( )()()1(

Full likelihood score

Viterbi (best-path) score

time

sta

tes

x(1) x(3)x(2) x(4)





Feature extraction

Feature extraction

Model training

Model training


Bob

Sally


Sally

Bob



Feature extraction









General TheoryVerification Decision

• The verification task is fundamentally a two-class hypothesis test

– H0: the speech S is from an impostor– H1: the speech S is from the claimed speaker

• This is known as the likelihood ratio test

)1Pr(

)0Pr(

)0|(

)1|(

rule) Bayes (using )(

)0Pr()0|(

)(

)1Pr()1|(

)|0Pr()|1Pr(

H

H

HSp

HSp

Sp

HHSp

Sp

HHSp

SHSH

)0|(

)1|(

HSp

HSpLR

0Accept

1Accept

HLR

HLR

• We select the most likely hypothesis (Bayes test for minimum error)



General TheoryVerification Decision

• Usually the log-likelihood ratio is used

Front-end processing


Speaker modelSpeaker model

Impostor model

Impostor model

-

+

Reject

Accept

• The H1 likelihood is computed using the claimed speaker model

• Requires an alternative or impostor model for H0 likelihood

)0|(log)1|(log HSpHSpLLR



General TheoryBackground Model

• There are two main approaches for creating an alternative model for the likelihood ratio test

Speaker model

Speaker model

Bkg 1model

Bkg 1modelBkg 2

model

Bkg 2modelBkg 3

model

Bkg 3model

Cohorts/Likelihood Sets/Background Sets (Higgins, DSPJ91)

– Use a collection of other speaker models

– The likelihood of the alternative is some function, such as average, of the individual impostor model likelihoods

),...,1 ),(|(()0|( BbbBkgSpfHSp

Speaker model

Speaker model

Universalmodel

Universalmodel

General/World/Universal Background Model (Carey, ICASSP91)

– Use a single speaker-independent model– Trained on speech from a large number

of speakers to represent general speech patterns

)|()0|( UBMSpHSp




• The background model is crucial to good performance– Acts as a normalization to help minimize non-speaker

related variability in decision score

• Just using speaker model’s likelihood does not perform well

– Too unstable for setting decision thresholds– Influenced by too many non-speaker dependent factors

• The background model should be trained using speech representing the expected impostor speech

– Same type speech as speaker enrollment (modality, language, channel)

– Representation of impostor genders and microphone types to be encountered




• Selected highlights of research on background models

• Near/Far cohort selection (Reynolds, SpeechComm95)– Select cohort speakers to cover the speaker space around

speaker model

• Phonetic based cohort selection (Rosenberg, ICASSP96)– Select speech and speakers to match the same speech modality

as used for speaker enrollment

• Microphone dependent background models (Heck, ICASSP97)

– Train background model using speech from same type microphone as used for speaker enrollment

• Adapting speaker model from background model (Reynolds, Eurospeech97, DSPJ00)

– Use Maximum A Posteriori (MAP) estimation to derive speaker model from a background model



ACCEPT

General TheoryComponents of Speaker Verification System

Feature extraction

Feature extraction

SpeakerModel

SpeakerModel

Bob’s model

“My Name is Bob”

ACCEPT

Bob

ImpostorModel

ImpostorModel

Identity Claim

DecisionDecision

REJECTInput Speech

Impostor model(s)







• Adaptation




Channel Compensation

• Variability refers to changes in channel effects between enrollment and successive verification attempts

• Channel effects encompasses several factors– The microphones

Carbon-button, electret, hands-free, etc

– The acoustic environment Office, car, airport, etc.

– The transmission channel Landline, cellular, VoIP, etc.

• Anything which affects the spectrum can cause problems– Speaker and channel effects are bound together in spectrum

and hence features used in speaker verifiers

• Unlike speech recognition, speaker verifiers can not “average” out these effects using large amounts of speech

– Limited enrollment speech

The largest challenge to practical use of speaker verification systems is channel variability

The largest challenge to practical use of speaker verification systems is channel variability



Channel CompensationExamples



Channel CompensationExamples



Channel Compensation

• Three areas where compensation has been applied

Feature-based approachesCMS and RASTANonlinear mappings

Model-based approachesHandset-dependent background modelsSynthetic Model Synthesis (SMS)

Score-based approachesHnorm, Tnorm

'96 '99

MatchedHandsets

MismatchedHandsets

ErrorRates

Factor of 20worse

Factor of 2.5worse

Using compensation techniques has driven down error rates in NIST evaluations



Channel CompensationFeature-based Approaches

• CMS and RASTA only address linear channel effects on features

• Several approaches have looked at non-linear effects

Non-linear mappingNon-linear mapping (Quatieri, TrSAP 2000)

• Use Volterra series to map speech between different types of handsets

Discriminative feature designDiscriminative feature design (Heck, SpeechCom 2000)

• Use neural-net to find features to discriminate speakers not channels

Linear filter

Linear filter

Non-Linear filter

Non-Linear filter

Linear filter

Linear filter

carbon button speech

carbon button speech

electret speechelectret speech

Linear filter

Linear filter

Non-Linear filter

Non-Linear filter

Linear filter

Linear filter

Linear filter

Linear filter

Non-Linear filter

Non-Linear filter

Linear filter

Linear filter

Feature analysis

Feature analysis

ANN Transform

ANN Transform

Speaker Recognition

system

Speaker Recognition

system

Discriminative trainingDiscriminative training

Input features

Output features



Channel CompensationModel-based Approaches

• It is generally difficult to get enrollment speech from all microphone types to be used

• The SMS approach addresses this by synthetically generating speaker models as if they came from different microphones (Teunen, ICSLP 2000)

– A mapping of model parameters between different microphone types is applied

cellular carbon buttonelectret

synthesis synthesis



Channel CompensationScore-based Approaches

• Speaker model LR scores have different biases and scales for utterances from different handset types

• Hnorm attempts to remove these bias and scale differences from the LR scores (Reynolds, NIST eval96)

elec

carb

spk1

LR scores– Estimate mean and standard-deviation of impostor, same-sex utterances from different microphone-types

elec

carb

spk2

hnorm scores

carbspk

carbspk

carbspkcarb

spk

uuH

)(

)(

– During verification normalize LR score based on microphone label of utterance



Channel CompensationScore-based Approaches

• Tnorm/HTnorm - Estimates bias and scale parameters for score normalization using “cohort” set of speaker models (Auckenthaler, DSP Journal 2000)

– Test time score normalization

– Normalizes target score relative to a non-target model ensemble

– Similar to standard cohort normalization except for standard deviation scaling

coh

cohspkspk

uuT

)(

)(Speaker model

Speaker model

Cohort model

Cohort modelCohort

model

Cohort modelCohort

model

Cohort model

), cohcoh

Tnorm score

Tnorm score

• Used cohorts of same gender and channel as speaker

• Can be used in conjunction with Hnorm







• Adaptation




Adaptation

• Model adaptation is important for maintaining performance in speaker verification systems

– Limited enrollment speech– Speaker and speech environment change over time (Furui)

• Most useful approach is unsupervised adaptation– Use verifier decision to select data to update speaker model– Adjust model parameters to better match new data (MAP

adaptation)



Speaker modelSpeaker model

Impostor modelImpostor model

DecisionDecision

Accept

currentnewadapt )1(



Adaptation

• Adaptation parameter can be set in several ways– As a fixed value : Continuous adaptation– As a function of likelihood score : Adjust adaptation based on certainty of decision– As a function of verification sessions : Adapt aggressively early and taper off later

• Experiments have shown that adapting with N utterances produces performance comparable to having extra N utterances during initial training

• Potential problems with adaptation

– Impostor contamination– Novel channels may be

rejected and so never learnedAdaptation session

EE

R

Largest gain at start







• Adaptation




Combination of Speech and Speaker Recognizers

• There are four basic ways speech recognition is used with speaker verifiers

– For front-end speech segmentation

– For prompted text verification

– For knowledge verification

– To extract idiolectal information



Speech and Speaker RecognizersFront-end Segmentation

• Speech recognizer used to segment speech for training and verification

• Depending on task, different linguist units are recognized– Words, phones, broad phonetic classes

• The recognized phrase could also be providing claimed identity to verifier

– E.g., Account number

Speech recognizerSpeech recognizer

Speaker verifierSpeaker verifier

/a/ /b/ /c/ …

Impostor HMMs

Speaker HMMs



Speech and Speaker RecognizersPrompted Text Verification

• Prompted text systems used to help thwart play-back attack

• Need to verify voice and that prompted text was said

• Possible to have integrated speaker and text verification using speaker-dependent phrase decoding


/a/ /b/ /c/ …


Speaker HMMs

Impostor HMMs

Prompted phrase (e.g., 82-32-71)

Text score

Speaker score

CombineCombine



Speech and Speaker RecognizersKnowledge Verification

• Compare response to personal question to known answer– E.g., “What is your date of birth?”

• Can be used for initial enrollment speech collection– Use KV for first three accesses while collecting speech

• Can also be used as fall-back verification in case speaker verifier is unsure after some number of attempts

• Also known as Verbal Information Verification (Q. Li, ICSLP98)


Answer

KV score

Personal Info DB

Personal Info DBQuestion

Response

Speaker verifierSpeaker verifier UnsureDecisionDecision



Speech and Speaker Recognizers Idiolectal Information Extraction

• Recent work by Doddington has found significant speaker information using ngrams of recognized words (Eurospeech 2001 and NIST website)

• During training, create counts of ngrams from speaker trainng data and from a collection of background speakers

• During verification compute LR score between speaker and background ngram models

• Good example of using higher-levels of speaker information


Uh-I 0.022Uh-yeah 0.001Un-well 0.025

Bigram (n=2)

Speaker ngrams

Background ngrams

Uh-I 0.001Uh-yeah 0.049Uh-well 0.071

Bigram (n=2)

Uh I think yeah …

LR ComputationLR Computation




Part II : Evaluation and Performance



Part II : Evaluation and PerformanceOutline

• Evaluation metrics

• Evaluation design

• Publicly available corpora

• Performance survey



Evaluation metrics

• In speaker verification, there are two types of errors that can occur

False reject: incorrectly reject a speakerAlso known as a miss or a Type-I error

False accept: incorrectly accept an impostorAlso known as a Type-II error

• The performance of a verification system is a measure of the trade-off between these two errors

– The tradeoff is usually controlled by adjustment of the decision threshold

• In an evaluation, Ntrue true trials (speech from claimed speaker) and Nfalse false trials (speech from an impostor) are conducted and the probability of false reject and false accept are estimated for different thresholds

falsetrue N

scores trialfalse # )|accept falsePr(

N

scores trial true# )|reject falsePr(



Evaluation metrics

• Evaluation errors are estimates of true errors using a finite number of trials

falsetrue N

scores trialfalse # )|accept falsePr(

N

scores trial true# )|missPr(

LJ

impostor)|p( )|faPr(

speaker)|p( )|missPr(



Evaluation metricsROC and DET Curves

Plot of Pr(miss) vs. Pr(fa) shows system performanceDET plots Pr(miss) and Pr(fa) on normal deviate scale

Receiver Operator Characteristic

(ROC)

Decreasing threshold

Decreasing threshold

Better performance

Detection Error Tradeoff

(DET)

PROBABILITY OF FALSE ACCEPT (in %) PROBABILITY OF FALSE ACCEPT (in %)

PR

OB

AB

ILIT

Y O

F F

AL

SE

RE

JE

CT

(in

%)

PR

OB

AB

ILIT

Y O

F F

AL

SE

RE

JE

CT

(in

%)



Evaluation metricsDET Curve

PROBABILITY OF FALSE ACCEPT (in %)

PR

OB

AB

ILIT

Y O

F F

AL

SE

RE

JEC

T (

in %

)

Equal Error Rate (EER) = 1 %

Wire Transfer:

False acceptance is very costly

Users may tolerate rejections for security

Toll Fraud:

False rejections alienate customers

Any fraud rejection is beneficial

Equal Error Rate (EER) is often quoted as a summary performance measure

High Convenience

High Security

Balance

Application operating point depends on relative costs of the two errors



Evaluation metricsDecision Cost Function

• In addition to EER, a decision cost function (DCF) is also used to measure performance

)|p)Pr(faC(fa)Pr(im)|missPr(spkr)C(miss)Pr( )DCF(

C(miss) = cost of a miss

Pr(spkr) = prior probability of true speaker attempt

Pr(imp) = 1-Pr(spkr) = prior probability of impostor attempt

C(fa) = cost of a false alarm

• For application specific costs and priors, compare systems based on minimum value of DCF



Evaluation metricsThresholds

• Deployed verification system must make decisions – that is set and use a priori thresholds

– DET curves and EER are independent of setting thresholds

• The DCF can be used as an objective target for setting and measuring goodness of a priori thresholds

– Set threshold during development to minimize DCF– Measure how close to minimum DCF the threshold achieves

in evaluation

• For measuring system performance, speaker-independent thresholds should be used

– Pr(miss) is computed by pooling all true trial scores from all speakers in evaluation

– Pr(fa) is computed by pooling all false trial scores from all impostors in evaluation

• Using speaker-dependent threshold DETs produces very optimistic performance which can not be achieved in practice










Evaluation DesignData Selection Factors

• Performance numbers are only meaningful when evaluation conditions are known

Speech qualitySpeech quality – Channel and microphone characteristics– Ambient noise level and type– Variability between enrollment and verification

speech

Speech modalitySpeech modality – Fixed/prompted/user-selected phrases– Free text

Speech durationSpeech duration – Duration and number of sessions of enrollment and verification speech

Speaker populationSpeaker population – Size and composition– Experience

The evaluation data and design should match the

target application domain of interest



Evaluation DesignSizing of Evaluation

• For performance goals of Pr(miss)=1% and Pr(fa)=0.1% this implies

– 3,000 true trials 0.7% < Pr(miss) < 1.3% with 90% confidence

– 30,000 impostor trials 0.07% < Pr(fa) < 0.13% with 90% confidence

• Independence of trials is still an open issue

To be 90 percent confident that the true error rate is within +/- 30% of the observed error rate, there must be at least 30 errorsTo be 90 percent confident that the true error rate is within +/- 30% of the observed error rate, there must be at least 30 errors

• The overarching concern is to design an evaluation which produces statistically significant results

– Number and composition of speakers– Number of true and false trials

• For the number of trials, we can use the “rule of 30” based on binomial distribution and independence assumption (Doddington)



Evaluation DesignTrials

• True trials are the limiting factor in evaluation design

• False trials are easily generated by scoring all speaker models against all utterances

– May not be possible for speaker specific fixed phrases

Speaker 1 Speaker 2 Speaker 3

Utterance 1 True trial False trial False trial

Utterance 2 False trial True trial False trial

Utterance 3 False trial False trial True trial

Speaker models

Tes

t u

tts

• Important that each trial only uses utterance and model under test

– Otherwise system is using “known” impostors (closed set)

• Can design trials to examine performance on sub-conditions

– E.g., train on electret and test on carbon-button










Publicly Available CorporaData Providers

• Linguistic Data Consortiumhttp://www.ldc.upenn.edu/

• Linguistic Data Consortiumhttp://www.ldc.upenn.edu/

• European Language Resources Associationhttp://www.icp.inpg.fr/ELRA/

home.html

• European Language Resources Associationhttp://www.icp.inpg.fr/ELRA/

home.html



Publicly Available CorporaPartial Listing

• TIMIT, et. al (LDC) - Not particularly good for evaluations

• SIVA (ELRA) – Italian telephone prompted speech

• PolyVar (ELRA) – French telephone prompted and spontaneous speech

• POLYCOST (ELRA) – European languages prompted and spontaneous speech

• KING (LDC) – Dual wideband and telephone monologs

• YOHO (LDC) – Office environment combination lock phrases

• Switchboard I-II & NIST Eval Subsets (LDC) – Telephone conversational speech

• Tactical Speaker Identification, TSID (LDC) – Military radio communications

• Speaker Recognition Corpus (OGI) – Long term telephone prompted and spontaneous speech

Summary of corpora characteristics can be found athttp://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/Summary of corpora characteristics can be found athttp://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/










Performance Survey Range of Performance

Probability of False Accept (in %)

Pro

bab

ilit

y o

f F

alse

Rej

ect

(in

%)

Incre

asing constra

ints

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

0.1%

Text-dependent (Digit strings)

Telephone Data

Multiple microphones

Small amount of training data

Text-dependent (Digit strings)

Telephone Data


Small amount of training data

1%

Text-independent (Conversational)

Telephone Data


Moderate amount of training data

Text-independent (Conversational)

Telephone Data



10%

Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones


Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones


25%



Performance Survey NIST Speaker Recognition Evaluations

• Annual NIST evaluations of speaker verification technology (since 1995)

• Aim: Provide a common paradigm for comparing technologies

• Focus: Conversational telephone speech (text-independent)

Evaluation Coordinator

Linguistic Data Consortium

Data Provider

Technology Developers

Comparison of technologies on common task

Evaluate

Improve

http://www.nist.gov/speech/tests/spk/index.htm



Performance Survey NIST Speaker Recognition Evaluation 2000

• DET curves for 10 US and European sites

– Variable duration test segments (average 30 sec)

– Two minutes of training speech per speaker

– 1003 speakers (546 female, 457 male)

– 6096 true trials, 66520 false trials

• Equal error rates range between 8% and 19%

• Dominant approach is adapted Gaussian Mixture Model based system (single state HMM)



Performance Survey Effect of Training and Testing Duration

• Results from 1998 NIST evaluation

Increasing training data Increasing testing data



Performance Survey Effect of Microphone Mismatch

• In the NIST evaluation, performance was measured when speakers used the same and different telephone handset microphone types (carbon-button vs electret)

• With microphone mismatch, equal error rate increases by over a factor of 2

Using different handset types

Using same handset types 2.5 X



Performance Survey Effect of Speech Coding

• Recognition from reconstructed speech

• Error rate increases as bit rate decreases

– GSM speech performs as well as uncoded speech

0

5

10

15

EE

R (

%)

Tel.

GSM

G.7

29

G.7

23

MELP

Detection forVarious Coders

Coder Rates:T1 - 64.0 kb/sGSM - 12.2 kb/sG.729 - 8.0 kb/sG.723 - 5.3 kb/sMELP - 2.4 kb/s

• Recognition from speech coder parameters

• Negligible increase in EER with increased computational efficiency

0

5

10

15

EE

R (

%)

Detection fromG.729 Parameters



Performance Survey Human vs. Machine

• Motivation for comparing human to machine

– Evaluating speech coders and potential forensic applications

• Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000)

– Same amount of training data

– Matched Handset-type tests

– Mismatched Handset-type tests

– Used 3-sec conversational utterances from telephone speech

• Humans have more robustness to channel variabilities

– Use different levels of information

Humans44%

betterHumans

15%worse

ErrorRates

Computer

Human



Performance Survey Human Forensic Performance

• In 1986, the Federal Bureau of Investigation published a survey of two thousand voice identification comparisons made by FBI examiners

– Forensic comparisons completed over a period of fifteen years, under actual law enforcement conditions

– The examiners had a minimum of two years experience, and had completed over 100 actual cases

– The examiners used both aural and spectrographic methods– http://www.owlinvestigations.com/forensic_articles/aural_sp

ecetrographic/fulltext.html#research

No decision 65.2% (1304)

Non-match 18.8% (378) FR = 0.53% (2)

Match 15.9% (318) FA = 0.31% (1)

SuspectRecorded threat

From “Spectrographic voice identification: A forensic survey ,” J. Acoust. Soc. Am, 79(6) June 1986, Bruce E. Koenig



Performance Survey Comparison to Other Biometrics

• Raw accuracy is generally not a good way to compare different biometric techniques

– The application will dictate other important factors – See “Fundamentals of Biometric Technology” at

http://www.engr.sjsu.edu/biometrics/publications_tech.html for good discussion and comparison of biometrics

Characteristic

Finger-prints

Hand Geometry Retina Iris Face Signature Voice

Ease of use High High Low Medium Medium High High

Error incidence

Dryness, dirt, age

Hand injury, age

Glasses Poor lighting

Lighting, hair, glasses, age

Changing signatures

Microphones,channels, noise, colds

Accuracy High High Very high

Very high High High High

User acceptance Medium Medium Medium Medium Medium Medium High

Long-term stability High Medium High High Medium Medium Medium

From “A Practical Guide to Biometric Security Technology ,” IEEE Computer Society, IT Pro - Security, Jan-Feb 2001, Simon Liu and Mark Silverman



Performance Survey Comparison to Other Biometrics

From CESG Biometric Test Programme Report (http://www.cesg.gov.uk/biometrics/)




Part III : Applications and Deployments



Part III : Applications and DeploymentsOutline

• Brief overview of commercial speaker verification systems

• Design requirements for commercial verification systems – General considerations– Dialog design

• Steps to deploying speaker verification systems– Initial data collection– Tuning– Limited Deployment and Final Rollout

• Examples of real deployments










Commercial Speaker Verification

1985

Access ControlTI Corporate Facility(Texas Instruments)

1980

Large-Scale Deployments (1M+)

Large-Scale Deployments (1M+)

Small Scale Deployments (100s)

Small Scale Deployments (100s)

1990

Law EnforcementHome Incarceration

(ITT Industries)

TelecomSprint’s Voice FONCARD

(Texas Instruments)

19952001

Law EnforcementPrison Call Monitoring

(T-Netix)

CommerceHome Shopping Network

(Nuance)

FinancialCharles Schwab

(Nuance)

TelecomSwisscom(Nuance)

Access ControlMac OS9(Apple)



Applications and Deployments Commercial Speaker Verification Systems

NUANCE

VERIFIER










Applications and Deployments Design Requirements in Commercial Verifiers

• Fast– Example: 50 simultaneous verifications on single PIII 500MHz processor

• Accurate– < 0.1% FAR @ < 5% FRR with ~1-5% reprompt rate

• Robust (channel/noise variability)

• Compact Storage of Speaker Models – < 100KB/model with support for 1M+ users on standard DBs (e.g., Oracle)

• Scalable (1 Million+ users with standard DBs)

• Easy to deploy

• International language/region support

• Variety of operating modes:– Text-independent, text-prompted, text-dependent

• Fully Integrated with state-of-the-art speech recognizer

Biggest challenge Robustness to channel variabilityBiggest challenge Robustness to channel variability

Requirements:



Applications and Deployments Design Requirements: Online Adaptation

Online Unsupervised AdaptationOnline Unsupervised Adaptation• Adaptation is one of the most powerful technologies to address robustness

& ease of deployment

• Additional requirements:– Minimizes cross channel corruption

Adapting on cellular improves performance on office phone

– Minimizes cross-channel effects w/ no growth in storageSaves new information from addition channels in 1 channel

– Minimizes model corruption from impostor attack

• SMS with online adaptation (Heck, ICSLP 2000):– Addresses above requirements

– 5222 speakers, 8 calls @ 12.5% impostor attack rate 61% reduction in EER (unsupervised)



Applications and Deployments Dialog Design: General Principles

• Dialog should be designed to be secure and convenient– Security often compromised by users if dialog not convenient

Example: 4-digit PIN Security = 1 out of 10,000 false accepts? NO!

Users compromise security of PINs to make them easier to remember (writing down in wallet, on-line, etc.)

• Dialog should be maximally constrained but flexible– More constraints better accuracy for fixed length training

– Example: balance between constraints on acoustic space while maintaining flexibility digit sequences

Dialog Design Goal Constrained but flexible dialog to maximize security while maintaining convenience

Dialog Design Goal Constrained but flexible dialog to maximize security while maintaining convenience



Applications and Deployments Dialog Design: Rules of Thumb

• Enrollment:– must be secure (e.g., rely on knowledge)– should be completed in single session

• Identity claim should be:– unique (but perhaps not unique for multi-user accounts)– easy to recognize over large populations– useful for simultaneous verification

• Verification utterances should be:– easy to remember

YES: SSN, DOB, home telephone number NO: PIN, password

– easy to recognize (both recognizer and verifier)– perceived as short, but contain lots of speech

Names: “Smith S M I T H” Digits: “3 5 6 7, 3 5 6 7”

– known only by user– widely accepted by user population (e.g., not too private)– difficult to record/synthesize



Simultaneous Identity Claim and Verification:• Buffer identity claim utterance

• Recognize identity claim and retrieve corresponding model

• “Re-process” data by verifying same utt. against model

Applications and Deployments Dialog Design: Simultaneous ID Claim/Verification

Start Buffering DataStart Buffering DataStart Buffering DataStart Buffering Data

My name is John DoeMy name is John DoeMy name is John DoeMy name is John Doe

Start Verification (“john_doe.model”)Start Verification (“john_doe.model”)Start Verification (“john_doe.model”)Start Verification (“john_doe.model”)

Stop Verification & Stop Buffering DataStop Verification & Stop Buffering DataStop Verification & Stop Buffering DataStop Verification & Stop Buffering Data

My name is John DoeMy name is John DoeMy name is John DoeMy name is John Doe



Initial positivescore threshold

Initial negativescore threshold

FRR1

FRR2

FAR1 FAR2

score threshold

Applications and Deployments Dialog Design: Confidence-based Reprompting

Confidence-based reprompting:• minimize average length of authentication process

• Improve effective FAR/FRR by reprompting when unsure

• “reprompt rate (RPR)” controlled by two new thresholds

RPR = Pr(spkr) (FRR1 – FRR2) + Pr(imp) (FAR2 – FAR1)

1st Utterance:• FAR = 0.07%• FRR = 0.2%• RPR = 5.7%



Please enter your account number““5551234”5551234”Say your date of

birth““October 13, October 13,

1964”1964”You’re accepted by the system

Applications and Deployments Dialog Design: Knowledge Verification

KnowledgeKnowledgeVerificationVerificationKnowledgeKnowledgeVerificationVerification

KnowledgeBase

CombineCombineCombineCombine

Accept

Reject

RecognizeRecognize“What you know”“What you know”

SpeakerSpeakerVerificationVerification


VoicePrints

RecognizeRecognize“Who you are”“Who you are”




• Parallel (“or” of decisions)

• Weighted Scores

FAR = FAR(sv) * FAR(kv)FRR = FRR(kv) + (1-FRR(kv)) * FRR(sv)




Methods to Combine Knowledge:• Sequential (“and” of decisions)




FAR = FAR(sv) + FAR(kv)FRR = FRR(sv) * FRR(kv)




CombineCombineScoresScores

CombineCombineScoresScores





KnowledgeBase

CombineCombineCombineCombineAccept

Reject

RecognizeRecognize“What you know”“What you know”



VoicePrints

RecognizeRecognize“Who you are”“Who you are”

Example: Sequential Combination of Decisions• Easy to implement, focuses on improving overall security

Example: Sequential Combination of Decisions• Easy to implement, focuses on improving overall security

FAR = FAR(sv) * FAR(kv)0.01% = 0.1% * 10%

FRR = FRR(kv) + (1-FRR(kv)) * FRR(sv) 1.1% = 0.1% + (1 - 0.1%) * 1%



Applications and Deployments Dialog Design: Security Against Recordings

Prompted Text Verification:• Prompt user to repeat random phrase

– Example: “Please say 82-32-71, 82-32-71”– Serves as “liveness” test

• Requires modification of enrollment dialog– (typically) longer enrollment to adequately cover acoustics


/a/ /b/ /c/ …


Speaker HMMs

Impostor HMMs

Prompted phrase (e.g., 82-32-71)

Text score

Speaker score

CombineCombine










RolloutRolloutRolloutRollout

TuneTune

Limited DeploymentLimited DeploymentLimited DeploymentLimited Deployment

TuneTune

Initial Data CollectionInitial Data CollectionInitial Data CollectionInitial Data Collection

Applications and Deployments Deployment Steps



• “Probing/sampling” approach?

– Employ persons to call app. under supervision

– Not widely used (too difficult to collect enough data)

• Assessment from actual in-field data?

– Much easier to get volumes of data and more realistic

– Impostor trials: common enrollment utterance for impostor trials

– True Speaker Trials: Sort scores. Manually transcribe poor-scoring utts.

• Need ~50 callers/gender

• Need to observe 30 errors of each type/condition (“rule of 30”)

• Each speaker enroll/verifies several times (across multiple channels)

Applications and Deployments Deployment Steps: Initial Data Collection

How do you collect the data?


TuneTuneTuneTune

TuneTuneTuneTune


LimitedLimited DeploymentDeploymentLimitedLimited DeploymentDeployment




TuneTuneTuneTune

TuneTuneTuneTune



• Operating point (threshold)

– Setting operating point a Priori is very difficult!

– Speaker-independent and/or speaker-dependent thresholds?

– Picking correct operating point is key to a successful deployment!

• Dialog Design

–Customer feedback and/or usage patterns can be used to simplify dialog design (e.g., removing confirmation steps, reducing reprompt rate)

• Impostor Models (Acoustic)

–Training with real application data results in more competitive impostor models with better representation

of linguistics & noise & channels.

What components can be tuned?

Applications and Deployments Deployment: Tuning



• Begin with limited set of actual users– Representative of entire caller population

– Representative sampling of (telephone) network

– Representative of noise and channel mismatch conditions

• After rollout, track the following statistics:– Successful enrollment sessions (# of speaker models)

– Successful verification sessions

– In-grammar/Out-of-grammar analysis (recognition)

– Verification rejects (correct & false) for each speaker

– Duration of sessions




TuneTuneTuneTune

TuneTuneTuneTune

Applications and Deployments Deployment Steps: Limited Deployment/Rollout

What steps are there to deployment?










Applications and Deployments First High-Volume Deployment

ApplicationApplication• Speaker verification and Speaker verification and

identification based on identification based on home phone numberhome phone number

• Provides secure access to Provides secure access to customer record & credit customer record & credit card informationcard information

ImplementationImplementation• Nuance VerifierNuance VerifierTMTM

• Edify telephony platformEdify telephony platform• Deployed July 1999Deployed July 1999

BenefitsBenefits• SecuritySecurity• PersonalizationPersonalization

Size & VolumeSize & Volume• 600k customers 600k customers

enrolled currentlyenrolled currently@20K calls/day@20K calls/day

• Full deployment:Full deployment:5 million customers 5 million customers @170K calls/day@170K calls/day



Successful EnrollmentSuccessful EnrollmentSuccessful AuthenticationSuccessful Authentication

Applications and Deployments First High-Volume Deployment



• Toll fraud prevention

• Telephone credit card purchases

• Telephone brokerage (e.g., stock trading)

Applications and Deployments Transaction Authentication



• Charles Schwab “Service Broker” “No PIN to remember, no PIN to forget”

– Built on Nuance Verifier pilot (2000): 10,000 users (SF bay area, NY) deployment: ~3 million users

– National rollout beginning Q3, 2001

Account Number

Account Number

Random phrase

(4-digits)

Random phrase

(4-digits)Confident?Confident?

NoNo

22ndnd Utt? Utt?NoNo

YesYesMake

Decision

Make Decision

YesYesPIN?PIN?

Applications and Deployments 1st Large-Scale High Security Deployment



Applications and Deployments Law Enforcement

• Monitoring– Remote time and attendance logging– Home parole verification– Prison telephone usage

ITT: ITT:

• SpeakerKey - telephone based home incarceration service

• Deployed at 10 sites in Wisconsin, Ohio, Pennsylvania, Georgia, and California

• More than 12,000 home incarceration sessions in June, 1995

• 0.66% false acceptance 4.3% false rejection

T-Netix: T-Netix:

• Contain - validating the identity and location of a parolee

• PIN-LOCK - validating the identity of an inmate prior to allowing an outbound prison call

• Deployed in Arizona,Colorado and Maryland

• 10K inmates using PIN-LOCK

• Roughly 25,000 - 30,000 verifications performed daily



Applications and Deployments Demonstrations

• Nuance 1-888-NUANCE-8 – http://www.nuance.com/demos/demo-shoppingnetwork.html

• T-Netix 1-800-443-2748– http://www.t-netix.com/SpeakEZ/SpeakEZDemo.html

• ITT– http://www.buytel.com/WebKey/index.asp

• Voice Security– http://www.Voice-Security.com/KeyPad.html



Speaker Verification: From Research to RealityRecap

• Part I : Background and Theory– Major concepts behind theory and operation of modern

speaker verification systems

• Part II : Evaluation and Performance– Key elements in evaluating performance of a speaker

verification system

• Part III : Applications and Deployments– Main issues and tasks in deploying a speaker verification

system



Conclusions

Speaker recognition is one of the few recognition areas where machines can outperform humans

Speaker recognition technology is a viable technique currently available for applications

Speaker recognition can be augmented with other authentication techniques to increase security



Speaker recognition technology will become an

integral part of speech interfaces

Research will focus on using speaker recognitionfor more unconstrained, uncontrolled situations

Future Directions

– Audio search and retrieval– Increasing robustness to channel variability– Incorporating higher-levels of knowledge into decisions

– Personalization of services and devices– Unobtrusive protection of transactions and information



To Probe Further General Resources

Conferences and Workshops:1) 2001: A Speaker Odyssey - The Speaker

Recognition Workshop, Crete, Greece, 2001 http://www.odyssey.westhost.com/

2) Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (RLA2C), Avignon, France, 1998 [proceedings in English]

3) ESCA Workshop on Automatic Speaker Recognition, Identification, and Verification, Martigny, Switzerland 1994 http://www.isca-speech.org/workshops.html

4) Audio and Visual Based Person Authentication (AVBPA) 1997, 1999, 2001 http://www.hh.se/avbpa/

5) International Conference on Acoustics Speech and Signal Processing (ICASSP), annual [sessions on speaker recognition] http://www.icassp2001.org/

6) European Conference on Speech Communication and Technology (Eurospeech), biennial [sessions on speaker recognition] http://eurospeech2001.org/

7) International Conference on Spoken Language Processing (ICSLP), biennial [sessions on speaker recognition] http://www.icslp2000.org/

Journals:1) IEEE Transactions on Speech and Audio

Processing http://www.ieee.org/organizations/society/sp/tsa.html

2) Speech Communication http://www.elsevier.nl/locate/specom

3) Computer, Speech & Language http://www.academicpress.com/www/journal/0/@/0/la.htm

Web:1) The Linguistic Data Consortium,

http://www.ldc.upenn.edu/2) European Language Resources Association

http://www.icp.inpg.fr/ELRA/home.html3) NIST Speaker Recognition Benchmarks

http://www.nist.gov/speech/tests/spk/index.htm4) Joe Campbell’s Site for Speaker Recognition

Speech Corpora http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/

5) The Biometric Consortium http://www.biometrics.org/

6) Comp.Speech FAQ on speaker recognition http://www.speech.cs.cmu.edu/comp.speech/Section6/Q6.6.html

7) Search for “speaker verification” in Goggle search engine http://www.google.com/



To Probe Further Selected References

Tutorials:1) B. Atal, `Àutomatic recognition of speakers from

their voices,'' Proceedings of the IEEE, vol. 64, pp.~460--475, April 1976.

2) A. Rosenberg, `Àutomatic speaker verification: a review,'' Proceedings of the IEEE, vol. 64, pp. 475--487, April 1976.

3) G. Doddington, ``Speaker recognition---identifying people by their voices,'‘ Proceedings of the IEEE}, vol. 73, pp. 1651--1664, November 1985.

4) D. O'Shaughnessy, ``Speaker recognition,'' {IEEE ASSP Magazine, vol. 3, pp. 4--17, October 1986.

5) J. Naik, ``Speaker verification: a tutorial,'' IEEE Communications Magazine, vol. 28, pp. 42--48, January 1990.

6) H. Gish and M. Schmidt, ``Text-independent speaker identification,'' IEEE Signal Processing Magazine, vol. 11, pp. 18--32, October 1994.

7) S. Furui, `Àn overview of speaker recognition technology,'' in Automatic Speech and Speaker Recognition (C.-H. Lee, F.K. Soong, ed.), pp.~31--56, Kluwer Academic, 1996.

8) J. P. Campbell, ``Speaker recognition: A tutorial,'' Proceedings of the IEEE , vol. 85, pp. 1437--1462, September 1997.

9) Special Issue on Speaker Recognition, Digital Signal Processing, vol. 10, January 2000. http://www.idealibrary.com/links/toc/dspr/10/1/0

Technology:1) M. Carey, E. Parris, and J. Bridle, `À speaker

verification system using alphanets,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 397--400, May 1991.

2) G. Doddington, ``Speaker recognition based on idiolectal differences between speakers,'' in Proceedings of the European Conference on Speech Communication and Technology, 2001.

3) C. Fredouille, J. Mariethoz, C. Jaboulet, J. Hennebert, J.-F. Bonastre, C. Mokbel, and F. Bimbot, ``Behavious of a Bayesian adaptation method for incremental enrollment in speaker verification,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2000.

4) L. P. Heck and M. Weintraub, ``Handset-dependent background models for robust text-independent speaker recognition,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 1071--1073, April 1997.

10) G. Doddington, M. Przybocki, A. Martin, and D. A. Reynolds, ``The NIST speaker recognition evaluation - overview, methodology, systems, results, perspective,'‘ Speech Communication}, vol. 31, pp. 225-254,March 2000.




5) L. Heck and N. Mirghafori, `Òn-line unsupervised adaptation for speaker verification,'' in Proceedings of the International Conference on Spoken Language Processing, 2000.

6) L. Heck, Y. Konig, M.K. Sonmez, and M. Weintraub, ``Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design,'' Speech Communication, Vol. 31, 2000, pp. 181-192.

7) H. Hermansky, N. Morgan, A. Bayya, and P. Kohn, ``RASTA-PLP speech analysis technique,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. I.121--I.124, March 1992.

8) A. Higgins, L. Bahler, and J. Porter, ``Speaker verification using randomized phrase prompting,'' Digital Signal Processing, vol. 1, pp. 89--106,1991.

9) B. Koenig, ``Spectrographic voice identification: A forensic survey,'‘ Journal of the Acoustical Society of America, vol. 79, pp. 2088--2090, June 1986.

10) Q. Li and B.-H. Juang, ``Speaker verification using verbal information verification for automatic enrollment,'' in Proceedings of the International Conference on Spoken Language Processing, 1998.

11) A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, ``The DET curve in assessment of detection task performance,'' in Proceedings of the European Conference on Speech Communication and Technology, pp. 1895--1898, 1997.

12) T. Matsui and S. Furui, ``Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. II--157--II--164, March 1992.

13) M. Newman, L. Gillick, Y. Ito, D. McAllaster, and B. Peskin, ``Speaker verification through large vocabulary continuous speech recognition,'' in Proceedings of the International Conference on Spoken Language Processing, pp. 2419--2422, 1996.

14) T.F. Quatieri, D.A. Reynolds and G.C. O'Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition,” IEEE Transactions on Speech and Audio Processing, August 2000

15) D. A. Reynolds, ``Speaker identification and verification using Gaussian mixture speaker models,'' Speech Communication, vol. 17, pp. 91--108, August 1995.




16) D. A. Reynolds, ``HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 1535--1538, April 1997.

17) D. A. Reynolds, ``Comparison of background normalization methods for text-independent speaker verification,'' in Proceedings of the European Conference on Speech Communication and Technology, pp. 963--967, September 1997.

18) D.Reynolds, M.Zissman, T.Quateri, G.O'Leary, and B.Carlson, ``The effects of telephone transmission degradations on speaker recognition performance,'‘ in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp.329--332, May 1995

19) D.A. Reynolds and B.A. Carlson, ``Text-dependent speaker verification using decoupled and integrated speaker and speech recognizers,'' in Proceedings of the European Conference on Speech Communication and Technology, pp.647--650, September 1995.

20) D.A. Reynolds, ``The effects of handset variability on speaker recognition performance: Experiments on the Switchboard corpus,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing pp.113--116, May 1996.

21) A.E. Rosenberg, C.-H. Lee, ``Connected word talker verification using whole word hidden markov models,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 381--384, 1991.

22) A. E. Rosenberg and S. Parthasarathy, ``Speaker background models for connected digit password speaker verification,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 81--84, May 1996.

23) F. Soong and A. Rosenberg, `Òn the use of instantaneous and transitional spectral information in speaker recognition,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 877--880, 1986.

24) R. Teunen, B. Shahshahani, and L. Heck, `À model-based transformational approach to robust speaker recognition,'' in Proceedings of the International Conference on Spoken Language Processing, 2000.

Date post:	26-Dec-2015
Category:	Documents
Upload:	lenard-davidson
View:	221 times
Download:	2 times

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1 © 2001 D.A.Reynolds and...

Documents