1
1
Centre for Vision, Speech and Signal Processing
Speaker and Speech Recognition: Speaker Recognition and Verification
Josef Kittler Centre for Vision, Speech and Signal Processing
University of Surrey, Guildford GU2 7XH
www.ee.surrey.ac.uk/Personal/J.Kittler/lecturenotes
2
OUTLINE
Introduction Terminology Problem formulation Speech representation Text independent methods
Text dependent methods
2
3
Introduction
Person identification is crucial to the fabric of the society Security Access to services Business transactions Law enforcement Border control
4
Establishing Identity
One or more of the following What entity knows (eg. password) What entity has (eg. badge, smart
card) What entity is (eg. fingerprints, retinal
characteristics) Where entity is (eg. In front of a
particular terminal)
3
5
Authentication Overview
Authen-ticator Types
Proof Defense Traditional Example
Digital Example
Flaw
Secret
Secrecy, obscurity
Closely kept
Combo lock
Computer password
Less secret
with each use
Token
Posses-sion
Closely held
Metal key Key-less car entry
Insecure if lost
ID Unique-
ness Copy-
resistant Drivers license
Biometric access
Difficult to
replace
6
Authentication Overview
Authenticator Subtypes: 1. Secret
- secrecy, e.g., password - obscurity, e.g., mother’s maiden name, SSN
2. Token - active, e.g., synchronised password generator - passive, e.g., smart card password storage
3. ID - inalterable, e.g., fingerprint, face, hand, eye - alterable biometric signal, e.g., voice, keystroke, signature
4
7
Biometric functionalities
Biometrics- a means to prevent identity theft
Biometric functionalities Verification Identification Screening (watch list) Retrieval (detection of multiple identities) Negation of denial
8
Historical notes
Bertillon system Advantages of biometrics
Convenience Fraud reduction
Disadvantages Otput is a score Cannot be replaced Open to attack Privacy concerns
5
9
Application Characteristics
Cooperative / non-cooperative Scalable / non-scalable Private / public Closed / open High-level / low-level security Habituated / non-habituated
10
Selection criteria
Universality (all users possess this biometric)
Permanence Uniqueness Collectability Acceptability Open to attack Failure to enroll
6
11
Voice based speaker recognition/verification, Why ? Voice is one of the main biometric modality used for personal Identity Recognition and Verification by humans. User friendly Natural interface Conveying emotion Can be used over telephone
Introduction
12
Terminology
Speaker identification -- using utterances from a speaker, determine who he/she is out of a set of known speakers
Speaker verification -- using utterances from a speaker, determine whether the caller is who he/she claims to be (requires an identity claim)
Training -- using utterances from a speaker to train a unique voiceprint that can later be used to identify/verify a speaker. Applies to both SI/SV.
7
13
Voice biometrics properties
Biometric Signal (Voice) Speaker verification is very compelling:
• Voice is convenient. • Voice is ubiquitous. • Voice is inexpensive. • Voice provides challenge-response security.
Unfortunately: • It is sometimes inconvenient because it does not work well ubiquitously, and requires
a back up solution
14
Voice Recognition Applications
Voice based identity verification for Web banking Physical access control Border control Identity cards/driver’s licence Personalisation
Entertainment Robotic toys
Law enforcement/surveillance Telephone surveillance
8
15
Internet
Smart Card
Client
Server
Card Reader
Services Provider
Server
Private Network
Card Reader
Microphone
Application to web banking
16
Biometric Applications
Forensic Government Commercial
Criminal investigation
Identity card/ passport
ATM
Driver’s licence E-commerce/ banking
Nomadic working
Mobile phone
Social security Personalisation
9
Market Outlook for the Future
* Voice ID: Applications and Markets for the New Millennium 1999 J. Markowitz, Consultants
Speech Recognition
Speech Processing
Speech Synthesis Digitized Speech
Input
Speaker Verification
Speaker Identification
Voice Biometrics
…
10
19
Voice template
Set of features extracted from the raw biometric data during enrolment
Represents typical values of voice biometric
Multiple templates may be stored to account for intra class variability
Template issues Aging (maintenance) Central/distributed storage Privacy and protection
20
Deficiencies of existing commercial solutions
Sensitive to speech acquisition conditions
Sensitive to background noise Sensitive to emotional state Sensitive to physical state of
the user
• Quite effective for closed set applications under tightly controlled voice acquisition conditions
11
21
Main causes of acoustic variation in speech
Channel Speaker recognition system
Speaker • Voice quality • Pitch • Gender • Dialect
Speaking style • Stress/Emotion • Speaking rate • Lombard effect
Task/Context • Man-machine dialogue • Dictation • Free conversation
Phonetic/ Prosodic context
Noise • Other speakers • Background noise • Reverberations
Distortion Noise Echoes Dropouts
Microphone • Distortion • Electrical noise • Directional characteristics
22
Voice recognition challenge
Large number of classes Segmentation Noise and distortion Variability of deployed microphones
(interoperability) Population coverage and scalability System performance System attacks Aging Non-uniqueness of biometrics Privacy concerns (smart cards)
12
23
System Attack
impersonation
decision
speech data features
template
MFCC GMM
24
Generic Architecture
Silence detection
Energy norm.
Feature extraction
Recogniton
Background noise removal
Phoneme/speech recognition
Related processes
13
25
Privacy issues
Template protection Biometrics can be used to track
people (secretely)-violation of their right to privacy (big brother)
Biometric data may be used for other than intended purposes
Biometric database linking
26
Evaluation protocol
Test data and procedures adopted to evaluate a biometric system
Evaluation should be conducted by an independent body
Test and biometric data used should not have previously been seen by the system
Data use cases Training set Evaluation set Test set
Data set sizes should allow statistically significant evaluation
14
27
XM2VTS database
XM2VTS database Face images and speech recording
of 295 people Subjects recorded in 4 sessions
Lausanne Protocol
28
Performance Evaluation
Performance criteria Failure to enrol Accuracy Speed Storage Costs Ease of use Failure to acquire
15
29
Accuracy
Measured in terms of False rejections/identification False acceptances
Falsely accepted users are impostors Performance characterisation issues
Genuine ambiguity Confidence Competence
30
Performance characterisation (verification)
False rejection False acceptance Total error rate/Half total error rate Operating point
Equal error rate (civilian) Zero false acceptance (high security, forensic)
Test set/evaluation set Receiver operating characteristic
16
31
Performance characterisation (identification)
Confusion matrix
32
Reference
1.Douglas A. Reynolds, Thomas F. Quatieri, Robert B. Dunn,
Speaker Verification Using Adapted Gaussian Mixture Models.
Digital Signal Processing. 10(2000), 19-41.
2.Martin A., Doddington G., Kamm T., Ordowski M., Przybocki M.
The DET curve in assessment of detection task performance.
Eurospeech 97, 1895-1898
17
33
cepstrum and delta cepstrum
coefficients
A/D Converter
Silence Detector
LPC Analysis
Preprocessing and feature extraction
Hamming Window
34
Speech input and spectra
Client
Impostor
18
35
Speech representation
MFCC feature vectors (24 filterbank analysis), with delta coefficients and delta log-energy appended (2 coefficient-window)
33 component feature vectors Energy normalisation :
36
FFT-based signal Spectrum
LP Spectrum Spectrum
derived from LP-Cepstrum
Cepstral Processing Spectrum
Amplitude (dB)
Hz
19
37
Speaker verification problem
Consider that the system has been trained using samples of the input waveform provided by the client.
Each sample is represented by a feature vector
The training speech segment is long enough to create a representative model
for each client i and for all speakers
38
Hypothesis testing
Now a test speech segment is acquired from a speaker claiming to be client
Given a feature vector corresponding to waveform sample , the probability that the claim is true is given as
20
39
Likelihood Ratio
The claim will be accepted if Assuming the priors are equal, the test
becomes
This is also referred to as the likelihood ratio
We base the decision on more than one sample, hence on
40
Assuming the samples are independent and identically distributed, we can express the joint probability density (i.e. likelihood) in terms of marginals as
Taking a log we find the loglikelihood
And the loglikelihood ratio as
21
41
Discussion
The independence assumption may not be satisfied in practice
The log likelihood is a function of the number of samples. It may be desirable to perform normalisation by factor
For a large (infinite) sample, the summation will asymptotically become ‘integration’
where is the test sample density Loglikelihood has a meaning only in relative sense
42
Gaussian model
Assuming
The feature vectors are assumed to be normally distributed with mean and the covariance matrix
22
43
Substituting and taking natural log of the client density we find
The left hand side of the inequality can be expanded as
44
The first term can be rewritten as
Thus the decision rule finally becomes
Its symmetric form
23
45
Notes
# samples for both, model and probe should be as large as possible
Ignoring means we have
If model and probe match, , the product of the two matrices is an identity matrix, i.e. isotropic distribution. Hence the matching criterion measures ‘sphericity’.
It is a sphericity measure
46
Other sphericity measures
In essence, in matching a probe and a model, we are measuring the distance between two gaussian probability densities
Any feature selection criterion could be used for that purpose
The derived sphericity measure resembles divergence
Bhattacharrya measure
24
47
Decision threshold
The matching process maps multidimensional speech data into 1D space
In theory, the decision threshold could be derived from the known parameters
In practice The distributions will not be exactly gaussian The parameters are estimates subject to error We may wish to control the trade-off between false
acceptances and false rejections
Hence, decision threshold determined Experimentally Modelling score distributions
48
If s
> Threshold Reject the claimant
≤ Threshold Accept the claimant
Accept Reject Score s
The selected threshold defines an operating point
25
49
ROC curve
ROC – receiver operating characteristics Defines a relationship between the operating
point, false acceptances and false rejections in verification
DET curve- log scale ROC
50
Score normalisation
It may be desirable to normalise the scores, e.g. for the purposes of fusion or threshold determination
Possibilities Map to posterior probabilities Map to designated means, e.g. so that the client and
impostor means coincide with –1 and +1 respectively Map so that the variance of the score is normalised others
26
51
Score normalisation (cont)
Min-max
Scaling
Z-score
52
Score normalisation (cont)
Median
Double sigmoid
Tanh
Min-max, Z-score and tanh are efficient, median, double-sigmoid and tanh are robust
27
53
Score normalisation (cont)
Mapping to designated means (for verification)
54
Score normalisation: Aposteriori class probabilities
Aposteriori class probabilities are automatically normalised to [0,1]
Some systems compute a matching score , rather than
Scores have to be normalised to facilitate fusion by simple rules aposteriori probability estimate
28
55
Score distribution modelling
Probability density function of authentic claims and impostors can be estimated Parametric/nonparametric pdfs e.g. gaussian pdf.
Standard deviation for true claims is likely to be smaller than for impostors
For distance type scores, the mean of true claim scores lower than the mean of impostors
56
Example