Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
14
2. Review of Literature
Speech is a natural means of communication for humans. It is not
surprising that humans can recognize the identity of a person by
hearing his voice. About 2-3 seconds of speech is sufficient for a
human to identify a voice. One review on human speech recognition
[33] states that many studies of 8-10 speakers yield accuracy of
more than 97% if a sentence or more of the speech is heard.
Performance falls if the length of the speech is short and if the
number of speakers is more. Speaker Recognition is one area of
artificial intelligence where machine performance can exceed human
performance using short test utterances and a large number of
speakers in which case machine accuracy often exceed that of
humans. Research on Speaker Identification systems can be dated
back to more than fifty years [3]. The survey of this work is given in
brief in the subsequent sections.
2.1 Early Systems (1960-1980)
The first reported work on Speaker Recognition can be dedicated
to Pruzansky at Bell Labs [34], as early as 1963, who initiated
research by using filter banks and correlating two digital
spectrograms for a similarity measure. The system used several
utterances of commonly spoken words by ten talkers and converted
it to time-frequency-energy patterns. Some of each talker's
utterances were used to form reference patterns and the remaining
utterances served as test patterns. The recognition procedure
consisted of cross-correlating the test patterns with the reference
patterns and selecting the talker corresponding to the reference
pattern with the highest correlation as the talker of the test
utterance. The recognition score for three-dimensional patterns was
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
15
89%. Reducing the original patterns to time-energy patterns
resulted in a lower recognition score; however, when only spectral
information was retained, recognition results were the same as
those for three-dimensional patterns. The work was further
improved in [35], by using a small subset of features. Features
were formed as the average of the speech energy over certain
rectangular areas on the spectrograms. Results were computed as a
function of the number of features used and as a function of the
size of the areas used to form the features. The filter bank approach
used in the earlier two cases was replaced by formant analysis by
Doddington [36]. Doddington proposed a speaker-verification using
eight known speakers and 32 impostors. Formant frequencies,
voicing pitch period, and speech energy—all as functions of time—
were used in verification. Proper time normalization was shown to
be an important factor in improving verification error performance.
Intra- Speaker variation in speech was investigated by Endres et al.
[37] and Furui [38]. In [37], Spectrograms of utterances produced
by seven speakers and recorded over periods of up to 29 years
showed that the frequency position of formants and pitch of voiced
sounds shift to lower frequencies with increasing age of test
persons. Speech spectrograms of texts spoken in a normal and a
disguised voice revealed strong variations in formant structure.
Speech spectrograms of utterances of well-known people were
compared with those of imitators. The imitators succeeded in
varying the formant structure and fundamental frequency of their
voices, but they were not able to adapt these parameters to match
or even be similar to those of imitated persons.
In [39], B S Atal, evaluated several different parametric
representations of speech derived from the linear prediction model,
for its effectiveness for automatic recognition of speakers from their
voices. Twelve predictor coefficients were determined approximately
once every 50 msec from speech sampled at 10 kHz. The predictor
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
16
coefficients and other speech parameters derived from them, such
as the impulse response function, the autocorrelation function, the
area function, and the cepstrum function were used as input to an
automatic speaker recognition system. S. Furui [40] and A. E.
Rosenberg and M. R. Sambur [41] used cepstrum coefficients
extracted by means of LPC analysis successively throughout an
utterance to form time functions. In time-domain methods, with
adequate time alignment, one can make precise and reliable
comparisons between two utterances of the same text, in similar
phonetic environments. Hence, text-dependent methods have a
much higher level of performance than text-independent methods.
Texas Instruments system based on filter banks and Bell Lab
Systems based on cepstal analysis were the first commercially
experimented Speaker Recognition systems.
2.2 Medieval systems (1980-2000)
In this period there was lot of development in Speaker
Identification technology. These advances were both in the field of
feature extraction and feature matching.
2.2.1 Feature Extraction
Voice pitch (F0) and formant frequencies (F1, F2, F3) extracted
from time aligned, un-coded and coded speech samples were
compared to establish the statistical distribution of error attributed
to the coding system [42]. The mel-warped cepstrum is a very
popular feature domain. The mel warping transforms the frequency
scale to place less emphasis on high frequencies. It is based on the
nonlinear human perception of the frequency of sounds [43]. The
cepstrum can be considered as the spectrum of the log spectrum.
Removing its mean reduces the effects of linear time-invariant
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
17
filtering (e.g., channel distortion). Often, the time derivatives of the
mel cepstra (also known as delta cepstra) are used as additional
features to model trajectory information.
Studies on automatically extracting the speech periods of each
person separately from a dialogue/conversation/meeting involving
more than two people have appeared as an extension of speaker
recognition technology [46 – 48]. Increasingly, speaker
segmentation and clustering techniques have been used to aid in
the adaptation of speech recognizers and for supplying metadata for
audio indexing and searching.
2.2.2 Feature Matching
Hidden Markov Model
As an alternative to the template-matching approach for text-
dependent speaker recognition, the Hidden Markov Model (HMM)
technique was introduced. HMMs have the same advantages for
speaker recognition as they do for speech recognition. Remarkably
robust models of speech events can be obtained with only small
amounts of specification or information accompanying training
utterances. Speaker recognition systems based on an HMM
architecture [44] used speaker models derived from a multi-word
sentence, a single word, or a phoneme. Typically, multi-word
phrases (a string of seven to ten digits, for example) were used,
and models for each individual word and for “silence” were
combined at a sentence level according to a predefined sentence-
level grammar.
Robustness
Research on increasing robustness became a central theme in the
1990s. Matsui et al. [24] compared the VQ-based method with the
discrete/continuous ergodic HMM-based method, particularly from
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
18
the viewpoint of robustness against utterance variations. They
found that the continuous ergodic HMM method is far superior to
the discrete ergodic HMM method and that the continuous ergodic
HMM method is as robust as the VQ-based method when enough
training data is available. They investigated speaker identification
rates using the continuous HMM as a function of the number of
states and mixtures. It was shown that speaker recognition rates
were strongly correlated with the total number of mixtures,
irrespective of the number of states. This means that using
information about transitions between different states is ineffective
for text-independent speaker recognition and, therefore, GMM
achieves almost the same performance as the multiple-state ergodic
HMM.
Text Prompted Speaker Recognition
Matsui et al. proposed a text prompted speaker recognition
method, in which key sentences are completely changed every time
the system is used [45]. The system accepts the input utterance
only when it determines that the registered speaker uttered the
prompted sentence. Because the vocabulary is unlimited,
prospective impostors cannot know in advance the sentence they
will be prompted to say. This method not only accurately recognizes
speakers, but can also reject an utterance whose text differs from
the prompted text, even if it is uttered by a registered speaker.
Thus, a recorded and played back voice can be correctly rejected.
Normalization
How to normalize intra-speaker variation of likelihood (similarity)
values is one of the most difficult problems in speaker verification.
Variations arise from the speaker him/herself, from differences in
recording and transmission conditions, and from noise. Speakers
cannot repeat an utterance precisely the same way from trial to
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
19
trial. Likelihood ratio- and posteriori probability-based techniques
were investigated [49 - 51]. In order to reduce the computational
cost for calculating the normalization term, methods using “cohort
speakers” or a “world model” were proposed.
2.3 Recent Trends in Speaker Identification (2000
onwards)
We can divide the recent advances in Speaker Identification in
two categories: Feature Extraction and Feature Matching.
2.3.1 Feature Extraction
Recently feature extraction techniques like MFCC, wavelet
decomposition and Transform domain techniques are being
explored.
Mel-Frequency Cepstral Coefficients (MFCC):
There has been a shift from LPC parameters to Mel-Frequency
Cepstral Coefficients (MFCC) for feature extraction. MFCC’s are
based on the known variation of the human ear’s critical bandwidths
with frequency. The MFCC technique makes use of two types of
filter, namely, linearly spaced filters and logarithmically spaced
filters. To capture the phonetically important characteristics of
speech, signal is expressed in the Mel frequency scale. This scale
has linear frequency spacing below 1000 Hz and a logarithmic
spacing above 1000 Hz. As a reference point, the pitch of a 1 KHz
tone, 40 dB above the perceptual hearing threshold is defined as
1000 Mels.
Fig. 2.1 shows the block diagram representation of the process
to convert the speech signal into MFCC. Here the speech signal is
first converted into frames and then windowed (e.g. hamming
window), to minimize the signal discontinuities at the beginning and
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
20
end of each frame. The next step is to convert the signal into
frequency domain by applying DFT on the windowed frames. Next
step is the Mel-frequency wrapping, where the Mel scale is used. Eq.
2.1 shows the conversion of frequency (f) to Mel Frequency. To
implement this, filter bank approach is used. In the final step, the
log Mel spectrum is converted back to time, which is called the
MFCC. This is done by using DCT.
)700/1(10log2595)( ffmel +×= (2.1)
Fig.2.1: Block diagram of MFCC processor
MFCC techniques have become a common approach of many
researchers [13-16, 52-55].
Wavelets:
Also another technique for feature extraction which is being
explored is using wavelet decomposition [17-19, 55-58]. Speech
signals have a wide variety of characteristics, in both time and
frequency domains. To analyze the non-stationary signals like
speech, both time and frequency resolutions are important.
Therefore while extracting features; it would be useful to analyze
the signal from multi-resolution perspective. Wavelets provide both
time as well as frequency resolution. The wavelet analysis
procedure is to adopt a wavelet prototype function, called an
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
21
analyzing wavelet or mother wavelet. Temporal analysis is
performed with a contracted, high frequency version of the
prototype wavelet, while frequency analysis is performed with a
dilated, low frequency version of the same wavelet. In [57],
Speaker Identification using different levels of decomposition of the
speech signal using discrete wavelet transform (DWT) and
Daubechies wavelets (mother wavelet) has been shown. Fig. 2.2
shows the how the speech signal is decomposed into approximate
(a1,…, a7) and detail coefficients (d1,…, d7) by using low pass and
high pass filters at each stage. The speech signal has been
decomposed up to seven levels using Discrete Wavelet Transform
(DWT) by using different Daubechies mother wavelets. The mean of
the approximate and exact coefficients of every level have been
taken as the feature vector.
Fig.2.2: 7th Level Wavelet Decomposition of the speech signal into
approximate and detail coefficients
K. Daqrouq et. al. [58] have used DWT for Text Dependent
Speaker Identification, which works in two steps: first gender
Discrimination and then feature extraction for Identification.
Wavelet packet transform is the extension of wavelet transform. It
provides a more precise way to analyze signals. Different from the
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
22
wavelet transform, the wavelet packet transform not only
decompose the signal in low frequency area, but also decompose
the signal in high frequency domain according to the signal’s
entropy. By using the wavelet packet transform the dynamic
features can be kept very well. In [59], the features extracted
through wavelet packet transform are the input to the artificial
neural network, and the classifier then determine the recognition
result.
High-level features:
High-level features such as word idiolect, pronunciation, phone
usage, prosody, etc. have been successfully used in text-
independent speaker verification. Typically, high-level-feature
recognition systems produce a sequence of symbols from the
acoustic signal and then perform recognition using the frequency
and co-occurrence of symbols. In Doddington’s idiolect work [60],
word unigrams and bigrams from manually transcribed
conversations were used to characterize a particular speaker in a
traditional target/background likelihood ratio framework.
Feature Extraction in Transform Domain:
The work reported on Speaker Identification shows that Discrete
Fourier Transform [DFT] is the most explored Transformation
technique. DFT has been used to compute LPC parameters [12];
MFCC features [13 - 16, 52 – 55]. In [61, 62], a novel technique of
utilizing the magnitude and phase of the speech signal has been
proposed. The complex DFT plane plotted by taking real part of DFT
as the X-axis and the imaginary part as the Y-axis has been
sectorized as shown in Fig. 2.3. The mean and density of the
sample points in each sector are used as features for Speaker
Identification. The concept of Sectorization is further extended to
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
23
Walsh Hadamard Transform (WHT) in [63], by plotting cal
coefficients verses sal coefficients.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
-1
-0.5
0
0.5
1
No. of samples
Amplitude
(a) (b)
-250 -225 -200 -175 -150 -125 -100 -75 -50 -25 0 25 50 75 100 125 150 175 200 225 250-250
-225
-200
-175
-150
-125
-100
-75
-50
-25
0
25
50
75
100
125
150
175
200
225
250
(c)
Fig.2.3: Speech signal and its circular sectors
The concept of using the Amplitude distribution in the transform
domain for feature extraction has been explored. The DFT, Discrete
cosine Transform (DCT) have been utilized for obtaining MFCC
coefficients [14-15, 53, 54]. Fig 2.4 shows the sum of the
magnitudes of the DFT samples by dividing the transform domain
into 32 groups. By making use of the symmetry of DFT, only the
first 16 sums are considered as feature vectors here. It has been
shown that the amplitude distribution of DFT, DCT, Discrete Sine
Transform (DST), WHT, Discrete Hartley Transform (DHT), Kekre
Transform (KT) and Haar Transform can be used for feature
extraction [66 - 68] for Speaker Identification.
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
24
Fig 2.4 Feature Vectors for FFT dividing the samples into 32 divisions
Another concept of utilizing the row mean of the column
Transform of the Transforms has been used for feature extraction
and has given very good results for Speaker Identification [69 –
71].
Vector Quantization:
Vector Quantization has been extensively used for Text
Independent Speaker Identification [20 – 22, 24, 52]. VQ has been
used for text dependent Recognition in [121]. Here each speaker is
represented by a sequence of vector quantization codebooks;
known input utterances are classified using these codebook
sequences and the resulting classification distortion is compared to
a rejection threshold. In [122], MFCC features are quantized to a
number of centroids using Vector Quantization. In [23], a VQ text
dependent speaker verification system based on VQ source coding
has been explored. Vector Quantization has been utilized for feature
extraction in the spatial domain by using clustering algorithms like
Linde Buzo Gray (LBG), Kekre’s Fast codebook Generation (KFCG)
and Kekre’s Median Codebook Generation (KMCG) [72 – 74]. These
techniques are further extended to the Transform domain by using
DFT, DCT and DST [75].
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
25
2.3.2 Feature matching
Artificial Neural network:
As seen from section 2.1 and 2.2, the techniques for feature
matching have shifted from template matching to statistical
modeling (e.g. HMM), distance based to likelihood based method.
The non-parametric approach of VQ is still being used. The recent
trend is the use of Artificial Neural Network (ANN). Being widely
used in pattern recognition tasks, neural networks have also been
applied in speaker recognition [77, 78].
Dynamic Time Warping (DTW):
The most popular method to compensate for speaking-rate
variability in template-based systems is known as DTW [5]. This
method accounts for the variation over time (trajectories) of
parameters corresponding to the dynamic configuration of the
articulators and vocal tract. In [79], Pandit M., proposes a
technique for optimisation of the feature sets, in a dynamic time
warping (DTW) based text-dependent speaker verification system.
The investigation on Gaussian mixture model (GMM) by comparing
it with some preliminary experiments on multilayered perceptron
network (MLP) with back propagation learning algorithm (BKP) and
dynamic time warping (DTW) techniques on Thai text-dependent
speaker identification system is given in [80].
2.3.3 Similarity measures
There are various distances measures which can be used as
similarity measures in the decision logic stage of Speaker
Identification. The various distance measures used in literature are:
• Manhattan Distance
• Euclidean Distance
• Mahalanobis distance
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
26
• Bhattacharya distance
• Earth mover’s distance
Out of these distance measures, the Euclidean distance, which is
the straight line distance between two points in an n-dimensional
space, is the most popularly used similarity measure [62 – 75].
Evgeny Karpov, et. al [130] have compared the performance of
Euclidean distance and Manhattan distance for Speaker
Identification. In [127] Shingo Kuroiwa, et. al. have used the Earth
Mover’s distance for CCC Speaker Recognition Evaluation 2006.
Bhattacharya distance has been used as similarity measure in [128,
129].
2.4 Summary of the progress in Speaker Identification
The technological progress in Speaker Identification can be
summarized as follows:
• Features of speech
Filter bank/spectral resonance – LPC – MFCC – magnitude
and row mean in transform domain – VQ codebook
• Feature Matching
Template matching - corpus-base statistical modeling
(e.g. HMM and n-grams) – DTW - Artificial Neural
Networks
• Type of speech signal
Clean speech – noisy speech – telephone speech
• System
Hardware recognizer – Software recognizer
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
27
Although these advances have taken place, there are still many
practical limitations which hinder the widespread commercial
deployment of applications and services. A more sound
understanding of the complex speech signal and its parameters is
through which this can be achieved.