On the use of perceptual Line Spectral pairs Frequencies ... · Biographical notes: Md. Sahidullah...

358 Int. J. Biometrics, Vol. 2, No. 4, 2010

On the use of perceptual Line Spectral pairs

Frequencies and higher-order residual moments

for Speaker Identification

Md. Sahidullah*, Sandipan Chakroborty

and Goutam Saha

Department of Electronics and ElectricalCommunication Engineering,Indian Institute of Technology Kharagpur,Kharagpur 721 302, IndiaE-mail: [email protected]: [email protected]: [email protected]*Corresponding author

Abstract: Conventional Speaker Identification (SI) systems utilise spectralfeatures like Mel-Frequency Cepstral Coefficients (MFCC) or PerceptualLinear Prediction (PLP) as a frontend module. Line Spectral pairsFrequencies (LSF) are popular alternative representation of LinearPrediction Coefficients (LPC). In this paper, an investigation is carriedout to extract LSF from perceptually modified speech. A new featureset extracted from the residual signal is also proposed. SI systembased on this residual feature containing complementary information tospectral characteristics, when fused with the conventional spectral featurebased system as well as the proposed perceptually modified LSF, showsimproved performance.

Keywords: SI; speaker identification; LSF; line spectral pairs frequencies;perceptual linear prediction; residual signal; higher order statistics.

Reference to this paper should be made as follows: Sahidullah, M.,Chakroborty, S. and Saha, G. (2010) ‘On the use of perceptual LineSpectral pairs Frequencies and higher-order residual moments for SpeakerIdentification’, Int. J. Biometrics, Vol. 2, No. 4, pp.358–378.

Biographical notes: Md. Sahidullah has Graduated in 2004 fromVidyasagar University in Electronics and Communication Engineeringand has obtained Masters in Computer Science and Engineering in theyear of 2006 from West Bengal University of Technology, Kolkata withspecialisation in Embedded System. He is currently an Institute ResearchScholar in the Department of Electronics and Electrical CommunicationEngineering at Indian Institute of Technology, Kharagpur 721 302, India.His research interests are speaker recognition, pattern classification andspeech processing.

Sandipan Chakroborty passed Bachelor of Engineering in Electronicsfrom Nagpur University, India in 2001 and passed Masters ofEngineering having specialisation in Digital System and Instrumentation

Copyright © 2010 Inderscience Enterprises Ltd.

On the use of perceptual Line Spectral pairs Frequencies 359

with highest honours from Bengal Engineering and Science University,Shibpur, Howrah, India in 2003. He was a Research Scholar in theDepartment of Electronics and Electrical Communication Engineering,Indian Institute of Technology, Kharagpur, India. His current area ofresearch includes pattern recognition, neural networks, speech processing,speaker recognition, data fusion analysis. Presently, he is with SamsungIndia Software Operations Ltd, Bangalore, India.

Goutam Saha received his BTech and PhD Degrees from IndianInstitute of Technology (IIT), Kharagpur and a short ManagementTraining from XLRI, Jamshedpur, India. He served Tata Steel andInstitute of Engineering and Management before joining Faculty ofElectronics and Communication Engineering at IIT Kharagpur in 2002where he is currently serving as Associate Professor. He briefly servedDepartment of Biomedical Engineering, University of Southern California,USA and Trento University, Italy. A winner of DST-Lockheed MartinIndia Innovation Growth Program 2009. His research interest includescharacterisation of biomedical signals, neurosignal processing and audiosurveillance. He is also coauthor of two popular engineering textbooks titled Digital Principles and Applications and Principles ofCommunication Systems published by Tata McGraw Hill.

1 Introduction

Speaker Identification (SI) (Campbell, 1997; Reynolds, 2002; Ramachandran et al.,2002; Kinnunen and Li, 2010) is the task of determining the identity of a subjectby its voice. A robust acoustic feature extraction technique (Faundez-Zanuy andMonte-Moreno, 2005) followed by an efficient modelling scheme (Matsui andTanabe, 2006) are the key requirements of an SI system. Feature extractiontransforms (Kinnunen, 2004) the crude speech signal into a compact but effectiverepresentation that is more stable and discriminative than the original signal.The central idea behind the feature extraction techniques for speaker recognitionsystem is to get an approximation of short term spectral characteristics of speech forcharacterising the vocal tract. Most of the existing SI systems use Mel FrequencyCepstral Coefficient (MFCC) or perceptual Linear Predictive Cepstral Coefficient(PLPCC) for parameterising speech (Kinnunen and Li, 2010). MFCC is based onfilterbank analysis. The speech signal is passed through some triangular shapedbandpass filters which are equally spaced in mel scale. Finally, the de-correlatedlog energies of the filters are treated as MFCC. On the other hand PLPCC isderivative of Linear Prediction Coefficient (LPC) of perceptually modified speechsignal (Hermansky, 1990). Perceptual modification of speech signal shows significantimprovement in speech recognition. Better performance was attained in lowermodel order of linear prediction analysis. Though PLPCC is mainly used forspeech recognition it was successfully applied in speaker recognition (Reynolds,1994). Various derivatives of LPC like: Log Area Ratio (LAR), Line Spectralpairs Frequencies (LSF), inverse sine coefficients (ARCSIN) are also used asfrontend for speaker recognition task (Campbell, 1997). Effective representation of

360 M. Sahidullah et al.

speaker specific information is still a challenging task in developing an SI system.In filterbank based approaches some experiments are carried out in Chakroborty(2008). Recently, an approach has been taken to use the perceptual version ofLog Area Ratio (LAR) coefficient as speech parametrisation for robust SI task(Abdulla, 2007). The Perceptual Log Area Ratio (PLAR) feature outperforms theconventional MFCC, PLP based speaker recognition systems. Experimentally ithas been found that various psychoacoustic processing has significant effect onthe performance of SI systems. LSFs are popular to represent linear predictioncoefficients in LPC based coders for filter stability and representational efficiency(Itakura, 1975). It also has other robust properties like ordering related to thespectral properties of the underlying data. LSF parameters are uncorrelated unlikeother representations of LPC (Erkelens and Broersen, 1995). The vocal tractresonance frequencies fall between the a pair of LSF frequencies (Bäckström andMagi, 2006; McLoughlin, 2008). The bandwidth of a formant is also directlyrelated to the adjacent LSFs. These properties make LSFs popular for analysis,classification, and transmission of speech signal. Detailed study of LSF featureas parametrisation for speech recognition task is available in Paliwal (1992).LSF parameters are integrated in speech compression schemes like G.729 whichare very popular especially in VoIP communication. LSFs are also successfullyintroduced in speaker recognition task (Liu et al., 1990; Lee et al., 2004). Allthese features try to represent vocal tract through short term spectral estimation.Vocal cord information which can be obtained from the residual of LP analysisalso carries speaker related traits. Pitch information, and other glottal informationcan be extracted from this residual signal. In Murty and Yegnanarayana (2006),Zheng et al. (2007) and Prasanna et al. (2006) attempts are made in parameterisingresidual information and combining those complementary information to improvethe speaker recognition performance.

Our contribution in this work is two fold. The first one is to study theeffectiveness of LSF after pairing psychoacoustic operations with the originalspeech signal. The LSF coefficients extracted is termed as Perceptual LSF (PLSF).The second introduces a novel complementary feature based on residual signalby applying higher order statistical moments, which we term as HOSMR. Whilethe PLSF is spectral based and captures vocal tract information, the HOSMR isresidual based and captures vocal cord information. From both feature sets separatemodels were developed and finally, the log-likelihood scores of both the systems arelinearly fused to get advantages of the complementarity in improving the overallperformance.

SI experiments are performed using the two newly proposed features usingGMM (Reynolds and Rose, 1995; Reynolds, 1992) as a classifier. Both newlyproposed features were evaluated to observe their individual speaker discriminatingability. Finally, experiments were conducted in dual stream mode by fusingthe contribution from both the systems. A comparison is also shown withthe experiments on other existing feature based SI systems. Two popularspeaker recognition corpus, YOHO and POLYCOST are used for conductingthe experiments. The performance of SI system using PLSF feature outperformsconventional feature extraction techniques among the spectral feature basedsystem. This also outperforms in fused mode when the contribution from residualinformation is included.


The rest of the paper is organised as follows. The theoretical concepts behindlinear prediction analysis, line spectral pairs frequencies and perceptual analysisare presented in Section 2. The proposed schemes are elaborated in Section 3.The experimental setup and results are discussed in Section 4. Finally, the paper isconcluded in Section 5.

2 Theoretical background

2.1 Linear prediction analysis and residual signal

In the LP model, (n − 1)th to (n − p)th samples of the speech wave are used topredict the nth sample. The predicted value of the nth speech sample (Atal, 1974) isgiven by

s(n) =p∑

k=1

a(k)s(n − k) (1)

where {a(k)}pk=1 are the predictor coefficients and s(n) is the nth speech sample.

The value of p is chosen such that it could effectively capture the real and complexpoles of the vocal tract in a frequency range equal to half the sampling frequency.The Prediction Coefficients (PC) are determined by minimising the mean squareprediction error (Campbell, 1997) and the error is defined as

e(n) = s(n) −p∑

k=1

a(k)s(n − k). (2)

The LP transfer function can be defined as,

H(z) =G

1 −∑p

k=1 a(k)z−k=

G

A(z)(3)

where G is the gain scaling factor for the present input and A(z) is the pth orderinverse filter. These LP coefficients itself can be used for speaker recognition as itcontains some speaker specific information like vocal tract resonance frequencies,their bandwidths etc.

The prediction error i.e., e(n) is called residual signal and it contains thecomplementary information that are not contained in the PC. Its worth mentioninghere that residual signal conveys vocal source cues containing fundamentalfrequency, pitch period etc. However, it is difficult to extract meaningful featuresdirectly from residual signal although there have been some attempts (Zheng et al.,2007; Murty and Yegnanarayana, 2006; Markov and Nakagawa, 1999) to find thefeatures out of it.

2.2 Line Spectral pairs Frequencies (LSF)

The LSFs are representation of the predictor coefficients of the inverse filter A(z).At first A(z) is decomposed into a pair of two auxiliary (p + 1) order polynomialsas follows:


A(z) =12(P (z) + Q(z))

P (z) = A(z) − z−(p+1)A(z−1) (4)

Q(z) = A(z) + z−(p+1)A(z−1).

The LSFs are the frequencies of the zeros of P (z) and Q(z). It is determined bycomputing the complex roots of the polynomials and consequently the angles. It canbe done in different ways like complex root method, real root method and ratiofilter method (Kondoz, 2004). The root of P (z) and Q(z) occur in symmetrical pairs,hence the name Line Spectrum pairs Frequency. P (z) corresponds to the vocal tractwith the glottis closed and Q(z) with the glottis open (Bäckström and Magi, 2006).However, speech production in general corresponds to neither of these extreme casesbut something in between where glottis is not fully open or fully closed. For analysispurpose, thus, a linear combination of these two extreme cases are considered.

On the other hand, the inverse filter A(z) is a minimum phase filter as all of itspoles lie inside the unit circle in the z-plane. Any minimum phase polynomial canbe mapped by this transform to represent each of its roots by a pair of frequencieswith unit amplitude. Another benefit of LSF frequency is that Power SpectralDensity (PSD) at a particular frequency tends to depend only on the close byLSF and vice-versa. In other words, an LSF of a certain frequency value affectsmainly the PSD at the same frequency value. It is known as localisation property(Chu, 2003), where the modification to PSD have a local effect on the LSF. Thisis its advantage over other representation like LPCC, Reflection Coefficient (RC),LAR where changes in particular parameter affect the whole spectrum. The LSFparameters are themselves frequency values directly linked to the signal’s frequencydescription. The power spectrum can also be directly computed from the LSF values(McLoughlin, 2008).

In Soong and Juang (1984), it is stated that LSF coefficients are sufficientlysensitive to the speaker characteristics. Though popularity of LSF remains in lowbit rate speech coding (Lepschy et al., 1988; Chu, 2003; Bishnu et al., 2003), itis also successfully employed in speaker recognition (Campbell, 1997; Liu et al.,1990; Lee et al., 2004; Yuan et al., 1999). LSFs coefficients are furthermoreappropriate for pattern classification, contrary to the other representation of LPCs(Tourneret, 1998).

2.3 Perceptual Linear Prediction (PLP) analysis

The PLP technique converts speech signal in meaningful perceptual way throughsome psychoacoustic process (Hermansky, 1990). It improves the performance ofspeech recognition over conventional LP analysis technique. The various stages ofthis method are based on our perceptual auditory characteristics. The significantblocks of PLP analysis are as follows:

2.3.1 Critical Band Integration (CBI)

In this step the power spectrum is warped along its frequency axis into Barkfrequency. In brief, the speech signal passed through some trapezoidal filters equallyspaced in Bark scale.


2.3.2 Equal Loudness Pre-emphasis (ELP)

Different frequency components of speech spectrum are weighted by a simulatedequal-loudness curve.

2.3.3 Intensity-loudness Power Law (IPL)

Cube-root compression of the modified speech spectrum is carried out according tothe power law of hearing (Stevens, 1957).

In addition, RASTA processing (Hermansky and Morgan, 1994) is done withPLP analysis as an initial spectral operation to enhance the speech signal againstdiverse communication channel and environmental variability. The integratedmethod is often referred to as RASTA-PLP.

3 Proposed framework

3.1 Perceptual Line Spectral pairs Frequency

One of the contribution of the present work is in combining strength of PLPwith LSF for automatic SI. Towards this, an alteration of standard PLP schemeis addressed and a strategy is formulated to use the modified PLP coefficient forgeneration of LSFs. A drawback of PLP analysis technique is that the nonlinearfrequency warping stage or Critical Band Integration (CBI) stage introducesundesired spectral smoothing. This work analyses the scatter plot of training dataincluding and excluding the CBI step. A pair of male and a pair of female speakerhave been arbitrarily chosen from the POLYCOST database. The extracted trainingfeatures are plotted in Figure 1 by reducing the 19-D vector to 2-D space using themethod of principle component analysis. It shows that the speaker’s data are moreseparable if critical band integration step is ignored.

Contrasted with the work (Abdulla, 2007), we include pre-emphasis in this partof the scheme and LSF. Perceptual weighting of different frequency componentenhances the speech signal in accordance with the listening style of human beings.Pre-emphasis stage, is however to emphasise the high frequency component ofspeech to overcome the roll-of factor of −6dB/octave due to human speakingcharacteristics. Hermansky (1990) has also included this step in his work. We havealso experimentally found out that this improves the recognition performance.The overall schematic diagram of the proposed Perceptual Line Spectral Pairsfeature extraction technique which is based on modified perceptual linear predictionanalysis, is shown in Figure 2.

The proposed perceptual operation represents the lower frequency region moreprecisely than the higher frequency zone. In Figure 3 comparative plots of speechspectrum, LP-spectrum, LSF of a speech frame and its perceptual version are shown.

The spectral peaks which are sharply approximated by conventional LPare smoothed by modified PLP. Note that the spectral tilt carries speakerrelated information (Yoma and Pegoraro, 2002). The perceptual modification ofspectral information carry speaker dependant information which was removed byconventional PLP [Section II-D in Hermansky (1990)]. LSFs reveal vocal tractspectral information including mouth shape, tongue position and contribution of


Figure 1 Figure showing scatter plot of first two feature vectors of two speaker’s trainingdata after principle component analysis. The scatter plot is shown for two cases(i) with critical band analysis step and (ii) without critical band analysis step.The plot is shown using two different colours. Part (a) shows the scatter plot oftwo male speaker’s data and Part (b) of two female speaker’s data. The data aretaken from POLYCOST database (see online version for colours)

the nasal cavity. Its perceptually motivated version represents those characteristicsmore effectively and hence is expected to improve speaker recognition performance.This is verified by rigorous experimentation as presented in Section 3.2.

3.2 Statistical moments of residual signal

Residual signal which was introduced in Section 2.1 generally has an impulse(for voiced) or noise (for unvoice) like behaviour and has a flat spectral response.Though it contains vocal source information, it is very difficult to perfectlycharacterise it. In literature Wavelet Octave Coefficients Of Residues (WOCOR)

Figure 2 Block diagram showing different stages of Perceptual Line Spectral Pairs (PLSF)frequency based feature extraction technique


Figure 3 Plot showing (a) Speech spectrum (light line), LP-spectrum (dark line) and LSF(Vertical Lines) and (b) Speech Spectrum (light line), PLP-spectrum (dark line)and PLSF (Vertical Lines). The odd LSFs are denoted using continuous linesand the even LSFs are denoted using dotted lines. Speech signal has been takenarbitrarily from a male speaker of YOHO database

(Zheng et al., 2007), Auto-Associative Neural Network (AANN) (Prasanna et al.,2006), residual phase (Murty and Yegnanarayana, 2006) etc., has been used toextract the residual information. It is worth mentioning here that the higher-order statistics were found significant in a number of signal processing applications(Nandi, 1994) when the nature of the signal is non-gaussian. Higher order statisticsalso got attention of the researchers for retrieving information from the LP residualsignals (Nemer et al., 2001) in voice activity detection. Recently, higher ordercumulant of LP residual signal has been investigated (Chetouani et al., 2009) forimproving the performance of SI system.

Higher order statistical moments of a signal parameterises the shape of afunction (Lo and Don, 1989). Let the distribution of random signal x be denotedby P (x), the central moment of order k of x be denoted by

Mk =∫ ∞

−∞(x − µ)kdP (5)

for k = 1, 2, 3 . . . , where µ is the mean of x.


On the other hand, the characteristics function of the probability distribution ofthe random variable is given by,

ϕX(t) =∫ ∞

−∞ejtxdP =

∞∑k=0

Mk(jt)k

k!. (6)

From equation (6) it is clear that moments (Mk) are coefficients for theexpansion of the characteristics function. Hence, they can be treated as one set ofexpressive constants of a distribution. Moments and their different modifications aresuccessfully used in image analysis (Liao and Pawlak, 1996; Teh and Chin, 1988).Moments can also effectively capture the randomness of residual signal in autoregressive modelling (Mattson and Pandit, 2006).

In this work, we use higher order statistical moments of residual signal toparameterise the vocal source information. One more motivation to use the momentson residual signal is that the speech production system is a non-Gaussian process forwhich higher order statistics can exist and meaningful vocal source features could beextracted. The feature derived by the proposed technique is termed as Higher OrderStatistical Moment of Residual (HOSMR). The different blocks of the proposedfeature extraction technique from residual are shown in Figure 4. Note that thesefeatures are complementary to PLSF features proposed in Section 3 as it is derivedfrom the residual signal.

Figure 4 Block diagram of residual moment based feature extraction technique

The followings are the steps for the computation of HOSMR:

1 Inverse filtering of LP analysis signal generates residual.

2 The residual signal is first normalised between the range [−1, +1].3 Then central moment of order k of a residual signal e is computed as,

mk =1N

N−1∑n=0

(e(n) − µ)k (7)

where, µ is the mean of residual signal over a frame.


The residual signal is scaled first in frame level and each frame is normalised aroundtheir mean. Therefore, first order central moment (i.e. the mean) is zero. The higherorder moments (for k = 2 . . . K) are taken as vocal source features as they representthe shape of the distribution of random signal. The lower order moments arecoarse parametrisation whereas the higher orders are finer representation of residualsignal. In Figure 5, LP residual signals of a voice and unvoice frame are shownas well as their higher order moments. Our experiment shows that a considerationof six moments give good result and still higher moments are not necessary.

Figure 5 Figure showing residual signal and higher order moments of two speech framesout of which one is voiced (a) and the other is unvoiced (b). The residual signalis shown in (c) and (d) and moments in (e) and (f) correspondingly. The numberof higher order moment shown here is 10 (see online version for colours)

4 Speaker Identification experiment

4.1 Experimental setup

4.1.1 Pre-processing stage

In this work, pre-processing stage is kept same for the proposed methods as well asthe ones with which compared. It is performed using the following steps:


• silence removal and end-point detection are done using energy thresholdcriterion

• the speech signal is then pre-emphasised with 0.97 pre-emphasis factor

• the pre-emphasised speech signal is segmented into frames of each 20ms with50% overlapping, i.e., total number of samples in each frame is N = 160(sampling frequency Fs = 8 kHz)

• in the last step of pre-processing, each frame is windowed using hammingwindow.

4.1.2 Classification and identification stage

The idea of GMM is to use weighted summation of multivariate Gaussian functionsto represent the probability density of feature vectors and it is given by

p(x) =M∑i=1

pibi(x) (8)

where x is a d-dimensional feature vector, bi(x), i = 1, . . . , M are the componentdensities and pi, i = 1, . . . , M are the mixture weights or prior of individualGaussian. Each component density is given by

bi(x) =1

(2π)d2 |Σi|

12

exp{

−12(x − µi)tΣ−1

i (x − µi)}

(9)

with mean vector µi and covariance matrix Σi.A speaker model is denoted as

λ = {pi,µi,Σi}Mi=1.

The parameter of λ are optimised using expectation maximisation (EM) algorithm(Dempster et al., 1977). In these experiments, the GMMs are trained with10 iterations where clusters are initialised by vector quantisation (Linde et al., 1980)algorithm.

In closed set SI task, an unknown utterance X = {x1,x2, . . . ,xt} is identified asan utterance of a particular speaker whose model gives maximum log-likelihood.

It can be written as

S = arg max1≤k≤S

log p(X|λ) = arg max1≤k≤S

T∑t=1

p(xt|λk) (10)

where S is the identified speaker from speaker’s model set Λ = {λ1, λ2, . . . , λS} andS is the total number of speakers.

4.1.3 Databases for experiments

YOHO database: The YOHO voice identification corpus (Campbell, 1997; Higginset al., 1989) was collected while testing ITT’s prototype speaker verification


system in an office environment. Most subjects were from the New York Cityarea, although there were many exceptions, including some non-native Englishspeakers. A high-quality telephone handset (Shure XTH-383) was used to collectthe speech; however, the speech was not passed through a telephone channel.There are 138 speakers (106 males and 32 females); for each speaker, there are4 enrollment sessions of 24 utterances each and 10 test sessions of 4 utteranceseach. In this work, a closed set text-independent SI problem is attempted where weconsider all 138 speakers as client speakers. For a speaker, all the 96 (4 sessions ×24 utterances) utterances are used for developing the speaker model while fortesting, 40 (10 sessions × 4 utterances) utterances are put under test. Therefore,for 138 speakers we put 138 × 40 = 5520 utterances under test and evaluated theidentification accuracies.

POLYCOST database: The POLYCOST database (Melin and Lindberg, 1996) wasrecorded as a common initiative within the COST 250 action during January–March1996. It contains around 10 sessions recorded by 134 subjects from 14 countries.Each session consists of 14 items, two of which (MOT01 & MOT02 files) containspeech in the subject’s mother tongue. The database was collected through theEuropean telephone network. The recording has been performed with ISDN cardson two XTL SUN platforms with an 8 kHz sampling rate. In this work, a closed settext independent SI problem is addressed where only the mother tongue (MOT) filesare used. Specified guideline Melin and Lindberg (1996) for conducting closed setSI experiments is adhered to, i.e. ‘MOT02’ files from first four sessions are used tobuild a speaker model while ‘MOT01’ files from session five onwards are taken fortesting. As with YOHO database, all speakers (131 after deletion of three speakers)in the database were registered as clients.

4.1.4 Score calculation

In closed-set SI problem, identification accuracy as defined in Reynolds and Rose(1995) and given by the equation (11) is followed.

Percentage of Identification Accuracy (PIA)

=No. of utterance correctly identifiedTotal no. of utterance under test

× 100. (11)

4.2 Speaker Identification experiments and results

The work uses GMM based classifier of different model orders which are powerof two i.e., 2, 4, 8, 16, etc. The number of Gaussian is limited by the amount ofavailable training data (Average training speech length per speaker from all sessionsafter silence removal: 40 s and 150 s for POLYCOST and YOHO respectively). Thenumber of mixtures are incremented to 16 for POLYCOST and 64 for YOHOdatabase.

We have conducted a series of experiments using the two databases. First,we have observed the consequence of different perceptual operations on theperformance of LSF based SI system. The individual effects of CBI, ELP, and ILPas well as their combined effort were evaluated. The inclusion of ELP and ILP stepsindependently improves the identification performance over the baseline system.


On the other hand including the CBI step the SI performance degrades. Hence,the inclusion of ELP and ILP and exclusion of CBI step is followed in proposedperceptual LSF (LSF) based SI system. This combination is also used for PLAR forthe same reason.

Next we conduct experiments using the different baseline features to comparethe PIA of the proposed technique. For LP based methods, comparisons are shownwith LSF; for filterbank based method MFCC. Two perceptual motivated featuresare also evaluated which are PLPCC and PLAR. The feature dimension is set at19 for all kinds of features. In LP based systems 19 filters are used for all-polemodelling of speech signals. On the other hand 20 filters are used for filterbankbased system and 19 coefficients are taken for extracting MFCC after discardingthe first co-efficient which represents dc component. The detail description areavailable in Chakroborty (2007, 2008). The formulation method of other featuresare available in Campbell (1997) and Rabiner and Juang (2003). The results areshown in Tables 1 and 2 for POLYCOST and YOHO database respectively.The last columns of each table corresponds to results on proposed PLSF basedSI system while the rest are based on other baseline features. From the resultsit is clear that the proposed feature outperforms the other existing techniques.The proposed perceptual feature gives 6.26% relative improvement over traditionalPLPCC based system and 6.08% over recently proposed PLAR feature based systemin POLYCOST database. The improvement in YOHO are correspondingly 2.77%and 0.224% . The POLYCOST database consists of speech signals collected overtelephone channel. The improvement for this database is significant over YOHOwhich is micro-phonic.

Table 1 Comparative Speaker Identification Results (PIA) using various spectral featuresfor POLYCOST database

Model order LSF MFCC PLPCC PLAR PLSF

2 60.7427 63.9257 62.9973 64.9867 65.64994 66.8435 72.9443 72.2812 74.6684 74.40328 75.7294 78.2493 75.0663 78.6472 80.901916 78.1167 78.9125 78.3820 78.5146 83.2891

Table 2 Comparative Speaker Identification Results (PIA) using various spectral featuresfor YOHO database

Model order LSF MFCC PLPCC PLAR PLSF

2 70.7428 75.5797 66.5761 83.4420 78.18844 81.3768 86.1594 76.9203 90.1449 89.00368 90.4529 91.4855 85.3080 94.0761 94.148616 93.2246 94.5471 90.6341 95.6884 96.141332 95.5978 96.0688 93.5326 96.5036 96.956564 96.5761 97.0109 94.6920 97.1014 97.3188

Speaker Identification (SI) performance of residual moment based feature is alsoevaluated for both the databases. The order of LP to extract the residual signal is


kept between 10–18 as these orders are often used in speech processing applications.This range is also sufficient to capture effective speaker specific information(Prasanna et al., 2006). We have performed SI experiment for different orderof moments. Empirically, we have observed that 4–6 higher order moments aresufficient to capture the vocal cord information. In Tables 3 and 4, the resultsof SI experiment is shown using HOSMR. The identification performance is verylow because the vocal cord parameters are the only cues for identifying speakers.It is importance lies in complementarity to vocal tract information.

Table 3 Speaker Identification accuracy for POLYCOST database using HOSMR feature(number of mixtures = 16)

% of accuracy for different number of features

LP order 1 2 3 4 5 6 7

10 7.1618 15.6499 17.5066 21.4854 19.4960 19.3634 17.904511 7.1618 14.9867 16.4456 20.1592 16.9761 20.5570 18.567612 5.5703 15.5172 17.5066 21.4854 21.0875 20.9549 20.689713 5.5703 14.7215 17.5066 22.1485 19.7613 20.6897 20.159214 7.1618 13.9257 16.0477 21.8833 19.6286 23.3422 19.363415 6.7639 15.5172 18.9655 21.7507 20.5570 20.0265 18.037116 7.1618 15.5172 17.5066 22.2812 20.2918 21.2202 20.291817 7.0292 14.9867 16.8435 21.4854 19.4960 22.4138 18.832918 6.7639 16.0477 17.3740 21.6180 21.3528 22.1485 18.0371

Table 4 Speaker Identification accuracy for YOHO database using HOSMR feature(number of mixtures = 64)

% of accuracy for different number of features

LP order 1 2 3 4 5 6 7

10 6.7754 15.3986 19.9638 24.5471 20.3804 21.6304 19.963811 6.7754 16.2862 20.4891 25.0181 21.6486 21.2862 20.235512 7.3732 16.1413 20.0725 25.5072 22.3913 22.5181 20.851413 7.4819 16.1775 20.6341 25.3804 22.2826 22.1558 19.873214 7.6268 16.1051 20.9964 25.9783 22.6268 22.9348 21.159415 7.4819 16.6486 20.5797 24.8913 22.3370 22.6993 19.891316 7.8442 16.1957 19.8370 23.8587 22.0652 21.9486 20.742817 7.6993 16.6486 21.0688 25.0725 21.2500 21.9565 20.271718 7.4819 16.1413 20.2899 24.9094 22.1739 22.4094 19.9094

4.3 Fusion of vocal tract and vocal cord information

Here, vocal tract and vocal cord parameters are successfully integrated foridentifying speakers. The way PLSF and HOSMR represent speech signal arecomplementary to one another. Hence, it is expected that combining the advantagesof both the feature will improve (Kittler et al., 1998) the overall performance of SIsystem. The block diagram of the combined system is shown in Figure 6. Spectral


Figure 6 Block diagram of fusion technique: score fusion of vocal tract information basedfeature (short term spectral) and vocal chord information (residual) (see onlineversion for colours)

features and residual features are extracted from the training data in two separatestreams.

Consequently, speaker modelling is performed for the respective featuresindependently and model parameters are stored in the model database. At thetime of testing same process is adopted for feature extraction. Log-likelihood oftwo different features are computed w.r.t. their corresponding models. Finally, theoutput score is weighted and combined. To get the advantages of both the systemand their complementarity the score level linear fusion can be formulated as inequation (12):

LLRcombined = ηLLRspectral + (1 − η)LLRresidual. (12)

where LLRspectral and LLRresidual are log-likelihood ratio calculated from thespectral and residual based systems, respectively. The fusion weight is determined bythe parameter η. In this experiment, we take equal evidence from the two systemsand set the value of η to 0.5.

In Tables 5 and 6, SI performance of the fused system is shown. The performanceof the combined system is better than the single feature based system. The order usedfor LP is 17 and the number of features for HOSMR is 6. We find that HOSMRfeature always improves the result due to its complementarity and PLSF-HOSMR


combination is better than other combinations. Note that the improvement is higherin lower model order of GMM due to the base effect which is usually experiencedin this type of performance evaluation. For example, in case of two model orderbased system using PLSF feature the PIA of dual stream based system is improvedby 6.46% (POLYCOST) and 5.61% (YOHO) compared to single stream basedapproach; on the other hand the improvements for model order 64 are given by 1.6%and 0.52% respectively. In a separate experiment we have seen that if single streammodel dimension of spectral feature is increased to 25 same as combined dimensionof features in two stream model the performance of the later is always better dueto complementarity of residual. The performance of the 25-dimensional pooledsystem is also better than the single streamed 25 dimensional spectral feature basedsystem.

Table 5 Speaker Identification accuracy for POLYCOST database using HOSMR basedfeature and fused system (Score Level Linear Fusion with η = 0.5; HOSMRConfiguration: LP Order = 17, Number of Higher Order Moments= 6)

% HOSMR fused with

No. of mixtures LSF MFCC PLPCC PLAR PLSF

2 65.1194 69.3634 65.7825 70.6897 69.89394 70.9549 76.3926 75.5968 77.7188 77.98418 77.8515 80.6366 77.3210 81.1671 81.962916 80.3714 80.5040 80.5040 81.5650 84.6154

Table 6 Speaker Identification accuracy for YOHO database using HOSMR based featureand fused system (Score Level Linear Fusion with η = 0.5; HOSMRConfiguration: LP Order = 17, Number of Higher Order Moments = 6)

% HOSMR fused with

No. of mixtures LSF MFCC PLPCC PLAR PLSF

2 76.2319 79.6377 72.5543 86.3768 82.57254 84.2754 88.1159 81.0507 91.5761 91.12328 91.2862 92.9167 87.7717 95.0000 95.199316 94.0217 95.0543 91.9022 96.4312 96.503632 95.9058 96.5399 94.3116 97.0290 97.282664 96.7935 97.1558 95.3986 97.5906 97.8261

We have varied the value of the weight (η) empirically. It is observed that unequalweighting improves the identification result. In Figure 7 the identification accuracyof fused system vs. fusion weight is shown. PLSF feature gives improved accuracyfor various fusion weight compared to other spectral feature based combinedsystem. With the help of exhaustive search at η = 0.427 we get PIA of 84.7480%for POLYCOST database and at η = 0.448 we get PIA of 97.8623% for YOHOdatabase.

The residual information extracted through HOSM contains more speakerspecific information than that is captured by pitch alone. In a different experiment


Figure 7 Speaker Identification accuracy on: (a) POLYCOST and (b) YOHO with fusedsystem for different values of fusion weight (η)

we have used standard pitch detection algorithm to find the pitch of the speechframes. The SI performances are observed for combined system i.e. for pitchand spectral features. Table 7 shows the comparative results for pitch based andHOSMR based fused system. The performance of HOSMR based system is betterif both voiced and unvoiced frames are taken instead of only voiced.

Table 7 Comparative Speaker Identification result (fused system) on two databases.The number of gaussian for POLYCOST is 16 and for YOHO is 64. The fusionweight (η) is set at 0.5. HOSMR Configuration: LP Order = 17, Number ofHigher Order Moments = 6

POLYCOST YOHO

Spectral feature Pitch based HOSMR based Pitch based HOSMR based

LSF 81.2997 80.3714 96.6123 96.7935MFCC 78.9125 80.5040 95.7065 97.1558PLPCC 78.5146 80.5040 93.2971 95.3986PLAR 79.9735 81.5650 96.7572 97.5906PLSF 84.0849 84.6154 97.1377 97.8261

The databases used for our experiment have same training and testing conditionwith significant session variability. To check the performance of the system whenthe test data is corrupted with noise we have conducted a separate experiment.


Additive white Gaussian noise is added to test signal for different SNR level forPOLYCOST full database and no speech enhancement is done. It is observedthat the performance of the fused system is still significantly better than singlefeature based system even in low SNR. The result is shown in Table 8. It is worthmentioning that the proposed HOSMR based fused system with MFCC improveperformance over single streamed vocal tract based features in higher noise whilefused system with PLSF outperforms in higher SNR.

Table 8 Performance of Speaker Identification system in presence of additive whiteGaussian noise. The results are shown for POLYCOST database (telephonic)where the speakers are modelled using 16 Gaussian components. The fusionweight (η) is set at 0.5. HOSMR configuration: LP Order = 17, Number ofHigher Order Moments = 6

SNR MFCC PLSF(in dB) MFCC PLSF +HOSMR +HOSMR

40 78.9125 82.8912 80.2387 83.421830 72.2812 71.4854 75.7294 75.862120 43.8992 34.4828 52.6525 35.4111

5 Conclusion

A novel scheme to improve the performance of the speaker recognition systemis presented. Speaker information captured by the vocal tract and vocal cordparameters are fused together to achieve the best performance. A novel spectralfeature, Perceptual Line Spectral pairs Frequency (PLSF) is proposed in this paperwhich effectively exploits the advantages of LSF and perceptual analysis of speechsignal to capture the vocal tract parameter. The PLSF outperforms other spectralbased features in terms of identification accuracy. Unlike others, it does not showa dip if dimension increases, rather it shows incremental improvement. In addition,a novel complementary feature is also proposed based on Higher Order StatisticalMoments of Residual (HOSMR) signal that gives the vocal cord characteristics.Experiments with two standard databases show the superiority of proposed PLSFfeatures and fusion of HOSMR over others.

References

Abdulla, W.H. (2007) ‘Robust speaker modelling using perceptually motivated feature’,Pattern Recogn. Lett., Vol. 28, No. 11, pp.1333–1342.

Atal, B.S. (1974) ‘Effectiveness of linear prediction characteristics of the speech wave forautomatic speaker identification and verification’, The Journal of the Acoustical Societyof America, Vol. 55, No. 6, pp.1304–1312.

Bäckström, T. and Magi, C. (2006) ‘Properties of line spectrum pair polynomials: a review’,Signal Process, Vol. 86, No. 11, pp.3286–3298.

Bishnu, S.A., Cuperman, V. and Gersho, A. (2003) Advances in Speech Coding, KluwerAcademic Publishers, USA.


Campbell, J.P. (1997) ‘Speaker recognition: a tutorial’, Proceedings of the IEEE, Vol. 85,No. 9, pp.1437–1462.

Chakroborty, S., Roy, A., Majumdar, S. and Saha, G. (2007) ‘Capturing complementaryinformation via reversed filter bank and parallel implementation with mfcc for improvedtext-independent speaker identification’, International Conference on Computing:Theory and Applications, ICCTA’07, Kolkata, India, pp.463–467.

Chakroborty, S. (2008) Some Studies on Acoustic Feature Extraction, Feature Selectionand Multi-Level Fusion Strategies for Robust Text-Independent Speaker Identification,PhD Dissertation, Indian Institute of Technology Kharagpur, Kharagpur, India.

Chetouani, M., Faundez-Zanuy, M., Gas, M.B. and Zarader, J. (2009) ‘Investigation onlp-residual representations for speaker identification’, Pattern Recognition, Vol. 42,No. 3, pp.487–494.

Chu, W.C. (2003) Speech Coding Algorithms, John-Wiley, USA.

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) ‘Maximum likelihood from incompletedata via the em algorithm’, Journal of the Royal Statistical Society. Series B(Methodological), Vol. 39, pp.1–38.

Erkelens, J. and Broersen, P. (1995) ‘On the statistical properties of line spectrum pairs’,International Conference on Acoustics, Speech, and Signal Processing, ICASSP-95,Vol. 1, pp.768–771.

Faundez-Zanuy, M. and Monte-Moreno, E. (2005) ‘State-of-the-art in speaker recognition’,Aerospace and Electronic Systems Magazine, IEEE, Vol. 20, No. 5, pp.7–12.

Hermansky, H. (1990) ‘Perceptual linear predictive (plp) analysis of speech’, The Journal ofthe Acoustical Society of America, Vol. 87, No. 4, pp.1738–1752.

Hermansky, H. and Morgan, N. (1994) ‘Rasta processing of speech’, IEEE Transactions onSpeech and Audio Processing, Vol. 2, No. 4, pp.578–589.

Higgins, A., Porter, J. and Bahler, L. (1989) Yoho Speaker Authentication Final Report,ITT Defense Communications Division, Tech. Rep.

Itakura, F. (1975) ‘Line spectrum representation of linear predictor coefficients of speechsignals’, The Journal of the Acoustical Society of America, Vol. 57, No. S1, p.S35.

Kinnunen, T. (2004) Spectral Features for Automatic Textindependent Speaker Recognition,PhD Dissertation, University of Joensuu, Joensuu, Finland.

Kinnunen, T. and Li, H. (2010) ‘An overview of text-independent speaker recognition: fromfeatures to supervectors’, Speech Communication, Vol. 52, No. 1, pp.12–40.

Kittler, J., Hatef, M., Duin, R. and Matas, J. (1998) ‘On combining classifiers’, IEEETransactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 3, pp.226–239.

Kondoz, A.M. (2004) Digital Speech Coding for Low Bit Rate Communication Systems,2nd ed., John Wiley & Sons Ltd., England.

Lee, B.J., Kim, S. and Kang, H.-G. (2004) ‘Speaker recognition based on transformed linespectral frequencies’, International Symposium on Intelligent Signal Processing andCommunication Systems, ISPACS 2004, Proceedings of 2004, pp.177–180.

Lepschy, A., Mian, G. and Viaro, U. (1988) ‘A note on line spectral frequencies [speechcoding]’, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 36, No. 8,pp.1355–1357.

Liao, S. and Pawlak, M. (1996) ‘On image analysis by moments’, IEEE Transactions onPattern Analysis and Machine Intelligence, Vol. 18, No. 3, pp.254–266.

Linde, Y., Buzo, A. and Gray, R. (1980) ‘An algorithm for vector quanisation design’,IEEE Transactions on Communications, Vol. COM-28, No. 4, pp.84–95.


Liu, C-S., Wang, W-J., Lin, M-T. and Wang, H-C. (1990) ‘Study of line spectrumpair frequencies for speaker recognition’, International Conference on Acoustics,Speech, and Signal Processing, ICASSP-90, Albuquerque, New Mexico, USA, Vol. 1,pp.277–280.

Lo, C.H. and Don, H.S. (1989) ‘3-d moment forms: their construction and applicationto object identification and positioning’, IEEE Transactions on Pattern Analysis andMachine Intelligence, Vol. 11, No. 10, pp.1053–1064.

Markov, K.P. and Nakagawa, S. (1999) ‘Integrating pitch and lpc-residual information withlpc-cepstrum for text-independent speaker recognition’, J. Acoust. Soc. Jpn. E, Vol. 20,No. 4, pp.281–291.

Matsui, T. and Tanabe, K. (2006) ‘Comparative study of speaker identification methods:dplrm, svm and gmm’, IEICE – Trans. Inf. Syst., Vol. E89-D, No. 3, pp.1066–1073.

Mattson, S.G. and Pandit, S.M. (2006) ‘Statistical moments of autoregressive model residualsfor damage localisation’, Mechanical Systems and Signal Processing, Vol. 20, No. 3,pp.627–645.

McLoughlin, I.V. (2008) ‘Review: line spectral pairs’, Signal Process, Vol. 88, No. 3,pp.448–467.

Melin, H. and Lindberg, J. (1996) ‘Guidelines for experiments on the polycost database’,Proceedings of a COST 250 Workshop on Application of Speaker RecognitionTechniques in Telephony, Vigo, Spain, pp.59–69.

Murty, K. and Yegnanarayana, B. (2006) ‘Combining evidence from residual phase andmfcc features for speaker recognition’, Signal Processing Letters, IEEE, Vol. 13, No. 1,pp.52–55.

Nandi, A. (1994) ‘Higher order statistics for digital signal processing’, IEE Colloquium onMathematical Aspects of Digital Signal Processing, London, England, pp.6/1–6/4.

Nemer, E., Goubran, R. and Mahmoud, S. (2001) ‘Robust voice activity detection usinghigher-order statistics in the lpc residual domain’, IEEE Transactions on Speech andAudio Processing, Vol. 9, No. 3, pp.217–231.

Paliwal, K. (1992) ‘On the use of line spectral frequency parameters for speech recognition’,Digital Signal Processing, Vol. 2, No. 2, pp.80–87.

Prasanna, S.M., Gupta, C.S. and Yegnanarayana, B. (2006) ‘Extraction of speaker-specific excitation information from linear prediction residual of speech’, SpeechCommunication, Vol. 48, No. 10, pp.1243–1261.

Rabiner L. and Juang, B.H. (2003) Fundamental of Speech Recognition, Pearson Education,First Indian Reprint, India.

Ramachandran, R.P., Farrell, K.R., Ramachandran, R. and Mammone, R.J. (2002)‘Speaker recognition–general classifier approaches and data fusion methods’, PatternRecognition, Vol. 35, No. 12, pp.2801–2821.

Reynolds, D.A. (1992) A Gaussian Mixture Modeling Approach to Text-independentSpeaker Identification, PhD Dissertation, Georgia Institute of Technology, Georgia,USA.

Reynolds, D. (1994) ‘Experimental evaluation of features for robust speaker identification’,IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp.639–643.

Reynolds, D. and Rose, R. (1995) ‘Robust text-independent speaker identification usinggaussian mixture speaker models’, IEEE Transactions on Speech and Audio Processing,Vol. 3, No. 1, pp.72–83.

Reynolds, D. (2002) ‘An overview of automatic speaker recognition technology’, IEEEInternational Conference on Acoustics, Speech, and Signal Processing, Proceedings(ICASSP’02), Vol. 4, pp.IV–4072–IV–4075.


Soong, F. and Juang, B. (1984) ‘Line spectrum pair (lsp) and speech data compression’,Vol. 9, pp.37–40.

Stevens, S.S (1957) ‘On the psychophysical law’, Psychological Review, Vol. 64, No. 3,pp.153–181.

Teh, C.H. and Chin, R. (1988) ‘On image analysis by the methods of moments’,IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10, No. 4,pp.496–513.

Tourneret, J.Y. (1998) ‘Statistical properties of line spectrum pairs’, Signal Processing,Vol. 65, No. 2, pp.239–255.

Yoma, N.B. and Pegoraro, T.F. (2002) ‘Robust speaker verification with state durationmodelling’, Speech Communication, Vol. 38, Nos. 1–2, pp.77–88.

Yuan, Z.X., Xu, B.L. and Yu, C.Z. (1999) ‘Binary quantisation of feature vectors forrobust text-independent speaker identification’, IEEE Transactions on Speech andAudio Processing, Vol. 7, No. 1, pp.70–78.

Zheng, N., Lee, T. and Ching, P.C. (2007) ‘Integration of complementary acoustic featuresfor speaker recognition’, Signal Processing Letters, IEEE, Vol. 14, No. 3, pp.181–184.

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

On the use of perceptual Line Spectral pairs Frequencies ... · Biographical notes: Md. Sahidullah...

Documents