Ala’a Spaih Abeer Abu-Hantash Ala’a Spaih Abeer Abu-Hantash Directed byDirected by
Dr.Allam MousaDr.Allam Mousa
Outline for TodayOutline for Today
Speaker Recognition Field1.
System Overview2.
MFCC & VQ3.
Experimental Results4.
Live Demo5.
Speaker Recognition FieldSpeaker Recognition Field
Speaker Recognition
Speaker Verification Speaker Identification
Text
Dependent
Text
Independent
Text
Independent
Text
Dependent
System OverviewSystem Overview
Speech input
Feature extraction
Training
modeSpeaker modeling
FeatureMatching
SpeakerModel
Database
DecisionLogic
SpeakerID
Testing
Mode
Feature ExtractionFeature Extraction
Feature extraction:is a special form of dimensionality reduction.
The aim: is to extract the formants.
Feature ExtractionFeature Extraction
The extracted features must have specific characteristics:
Easily measurable, occur naturally and frequently in speech.
Not change over time.
Vary as much among speakers, consistent for each speaker.
Not affected by: speaker health, background noise.
Many algorithms to extract them:
LPC,LPCC,HFCC,MFCC.
We used Mel Frequency Cepstral Coefficients algorithm:
MFCC.
Feature Extraction Using MFCCFeature Extraction Using MFCC
Input speechFraming and windowing
Fast Fourier transform
Absolute value
Mel scaled-filter bank
Log
Discrete cosine transformFeature vectors
Framing And WindowingFraming And Windowing
FFT
Spectrum
Vocal tract
Glottal
pulse
Mel Scaled-Filter BankMel Scaled-Filter Bank
Spectrum
mel(f)= 2595*log10(1+f/700)
Mel
spectrum
CepstrumCepstrum
Melspectrum
MFCC
Coeff.
DCT of the logarithm of the magnitude spectrum, the glottal pulse and the impulse response can be separated.
ClassificationClassification
Classification, that is to build a unique model for each speaker in the database.
Two major types of models for classification.
Stochastic models:GMM,HMM,ANN
Template models:VQ , DTW
We used VQ algorithm.
VQ AlgorithmVQ Algorithm
The VQ technique consists of extracting a small number of representative feature vectors.
The first step is to build a speaker-database consisting of N codebooks, one for each speaker in the database.
SpeakerFeature vectors
Clustered into
codewords
Speaker model
(codebook)
This done by
K-means
Clustering
algorithm
K-means ClusteringK-means Clustering
start
No. of clusters k
centroids
Distance objects to centroids
Grouping based on minimum distance
No change End
Noyes
VQ Example
Given data points, split into 4 codebook vectors with initial values at (2,2),(4,6),(6,5),(8,8).
VQ Example
Once there’s no more change, the feature space will be partitioned into 4 regions. Any input feature can be classified as belonging to one of the 4 regions. The entire codebook can be specified by the 4 centroid points.
If we set the codebook size to 8 then the output of the clustering will be:
K-means ClusteringK-means Clustering
0 2 4 6 8 10 12-8
-6
-4
-2
0
2
4
6
8
10
0 2 4 6 8 10 12-6
-4
-2
0
2
4
6
8
VQ
MFCC’s of a speaker (1000x12) Speaker Codebook (8x12)
Feature Matching
d2(x,y) (x i y i)2
i1
D
For each codebook a distortion measure is computed.The speaker with the lowest distortion is chosen. Define the distortion measure Euclidean distance.
System Operates In Two ModesSystem Operates In Two Modes
OfflineOffline
OnlineOnline
Monitoring Microphone
Inputs
MFCCFeature
Extraction
Calculate VQ
Distortion
Make Decision &
Display
Applications
Speaker Recognition for Authentication. Banking application.
Forensic Speaker Recognition Proving the identity of a recorded voice can help to convict a criminal or
discharge an innocent in court.
Speaker Recognition for Surveillance. Electronic eavesdropping of telephone and radio conversations.
ResultsResults
To show how the system identify the speaker according to Euclidean distance calculation.
Sp 1 Sp 2 Sp 3 Sp 4 Sp 5
Sp 1 10.7492 13.2712 17.8646 14.7885 13.2859
Sp 2 13.2364 10.2740 13.2884 11.7941 14.0461
Sp 3 17.5438 16.1177 11.9029 16.2916 17.7199
Sp 4 16.1360 13.7095 15.5633 11.7528 16.7327
Sp 5 14.9324 15.7028 17.2842 17.8917 12.3504
12 MFCC, 29 Filter banks, 64 Codebook size … ELSDSR database.
Results
Number of MFCC Vs. ID rate.
No. of
MFCC
ID
Rate
5 76 %
12 91 %
20 91 %
Frame Size Vs. ID rate.
Frame size(10-30) ms Good
Above 30 ms Bad
Results Results
The effect of the codebook size on the ID rate & VQ distortion.
82
84
86
88
90
92
94
96
98
100
0 50 100 150 200 250 300
Codebook Size
ID ra
te (%
)
0
2
4
6
8
10
12
14
0 50 100 150 200 250 300
Codebook Size
Mat
chin
g S
core
ResultsResults
Number of filter-banks Vs. ID rate & VQ distortion.
0%
20%
40%
60%
80%
100%
120%
0 10 20 30 40 50
Number of Filters in Filter-Bank
ID ra
te (%
)
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50
Number of Filters in Filter-bank
Mat
chin
g Sc
ore
ResultsResults
The performance of the system on different test shot lengths.
Test speech length
ID
Rate
0.2 sec 60 %
2 sec 85 %
6 sec 90 %
10 sec 95 %
0
20
40
60
80
100
0 2 4 6 8 10 12
Test Speech Length (sec)
ID r
ate
(%)
Summary
Effect of changing some parameters on: MFCC algorithm. VQ algorithm.Our system identify the speaker regardless of the
language and the text.Satisfied results: The same training and testing environment. Test data needs to be several ten seconds.