International Journal of Informative & Futuristic Research...

967 www.ijifr.com

Copyright © IJIFR 2014

Original Paper

International Journal of Informative & Futuristic Research ISSN (Online): 2347-1697

Volume 2 Issue 4 December 2014

Abstract

Speech processing is the main area of digital signal processing. Various fields in speech processing include speech/ speaker recognition, language identification. The goal of the language identification system is to automatically identify the specific language from an unknown speech utterance. Such type of LID system should be able to identify the language user is speaking very quickly and accurately. There are two processes of LID system; feature extraction and feature matching. The performance and accuracy of the system depend on the efficiency of the feature extraction process. There are different feature extraction techniques which are LPC, Cepstral coefficients, MFCC, PLP, RASTA, etc. In this paper RASTA-PLP is used for feature extraction and vector quantization is used in both the training and testing phase. Linde-Buzo-Gray (LBG) algorithm is used for codebook generation.

1. Introduction

Speech is the most natural form of communication between humans. Changes in the signal mean

information carried in it, thus speech is nothing but a type of signal which contains information in

terms of frequency, power, energy, etc. The special case of digital signal processing is a speech

processing in which the speech signals and the processing methods of these signals discussed. Speech

processing includes the areas such as speech/speaker recognition, speech coding, speech synthesis,

voice analysis, language recognition, speech compression, etc. Language recognition system

Language Identification Using RASTA-PLP Feature

Extraction Technique Paper ID IJIFR/ V2/ E4/ 033 Page No. 967-973 Subject Area

Electronics &

Telecommunication

Key Words LID (Language Identification), Feature Extraction, Feature Matching, RASTA-

PLP, Vector Quantization (VQ).

Varsha Singh 1

Research Scholar,

Department of Electronics & Telecommunication,

The Chhattisgarh Swami Vivekanand Technical University,

Bhilai, Chhattisgarh

Vinay Kumar Jain 2

Associate Professor,

Department of Electronics & Telecommunication,



Dr. Neeta Tripathi 3

Principal,

Shri Shankaracharya Institute of Technology & Management ,



968

ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)

Volume - 2, Issue - 4, December 2014 16th Edition, Page No: 967-973

Varsha Singh, Vinay Kumar Jain, Dr. Neeta Tripathi: Language Identification Using RASTA-PLP Feature Extraction Technique

automatically determines the identity of language from a sample of speech. The language recognition

system can be divided into two types of tasks: language identification and language verification. A

language identification system automatically identifies the language being spoken. A language

identification task is also known as closed-set identification task because it is assumed that the input

speech signal is from a fixed set of known languages. On the other hand, language verification task is

the process in which system tries to detect that the spoken speech segment is from a certain language;

if the language is verified then it is accepted otherwise rejected.

A language verification task is of two types: closed-set and open-set verification. In closed-set

language verification system the speech segments of languages that are known to the system are

spoken. In open-set language verification system, it is not necessary that all the speech segments of

the languages are known to the system. LID system has many applications such as in a speech

translation system [1], multilingual speech recognition system [2], spoken document retrieval system

[3] etc. There are many factors that can be considered for differentiating between languages/

identifying a language. From studies it is concluded that, given only little previous knowledge, human

listeners can effectively identify a language without much lexical knowledge. In this case, human

listeners rely on prominent phonetic, phonotactic, and prosody cues to characterize the languages [4,

5]. These factors are in the range of spectral features of the acoustic speech signal to language-

specific words or phrases. Feature extraction is the process in which the important characteristics of

the speech are represented by reducing the amount of speech data. Many techniques are used for

feature extraction, such as LPC, MFCC, PLP, RASTA-PLP, spectrograph etc.

2. Literature Review

LID systems used for route an incoming telephone call (of a caller who speaks a language that is not

recognized by a receiver) to an operator who is fluent in the corresponding language. Such a service is

known as a Language Line Service. Zissman [6] compared the four approaches to language

identification of telephone speech: Gaussian mixture modelling (GMM), single-language phone

recognition followed by language-dependent language modelling (PRLM), parallel PRLM and

language-dependent parallel phone recognition (PPR). The performance of these approaches is tested

on the OGI-TS corpus. The experimental results show that the performance of GMM is poor, but runs

faster than real-time on a conventional UNIX workstation, PRLM runs a bit slower than real-time on a

conventional workstation and the performance of parallel PRLM and PPR almost equal and better but

these systems run more slowly.

Language identification using phone-based acoustic likelihoods [7] is the process in which an

unknown incoming speech signal is processed parallel by different sets of language-specific phone

model and selects the language associated with the model set having the highest likelihood. For

spoken language identification, the PPRLM system is built by a proposed method, i.e. acoustic

diversification [8]. Phone recognizers are used with different acoustic models trained on the same

speech data with the same phone set. The proposed method aims at providing a better acoustic

diversification using different acoustic models to provide complementary acoustic emphasis.

Experimental results show that acoustic diversity is as effective as phonetic diversity in characterizing

spoken languages in phonotactic approach to LID. For language recognition method, there are two

major approaches which are organized by NIST has proved efficient: acoustic modelling, which based

on short context information, and phonotactic modelling that tries to capture longer patterns in speech

969




[9].Prosody feature is linked to linguistic units such as syllables, and it is represented in terms of

changes in measurable parameters such as fundamental frequency, duration and energy. Thus

prosodic features are used for language and speaker recognition [10].

This research evaluated the effectiveness of prosodic features extracted using the proposed approach

for language recognition in the case of NIST LRE 2003 task. Though the success of language

recognition was constrained by the limited speech data available for training, it clearly illustrates the

potential of prosodic features for distinguishing languages. In [11] a novel approach is proposed for

text independent language identification which does not require annotated corpora. A LID system is

developed using vector quantization and discrete hidden Markov model (DHMM).The performance of

speaker identification under the effect of feature extraction techniques is presented [12]. The

identification rate for the different feature extraction technique is compared using vector quantization.

3. Methodology

The proposed LID system contains two main modules:

Feature Extraction: - Speech signal conveys information about a speaker in various forms, for

example, speaking style, context, and emotional state of the speaker. The objective of signal

processing is to extract important information of a signal by means of transformation and store in the

vector of feature coefficients.

Feature Matching: - In Feature Matching the decision making procedure is applied to identify the

speaker‟s language based on a previously stored database. This step is basically divided into two

parts, namely training and testing.

The following are the steps of training and testing phase:

During the Training phase a comprehensive database is to be prepared.

The database consists of recording of speech from a speaker who speaks different Indian

languages such as Hindi, Marathi etc.

Short clip of vowel and semi vowel is separated from recorded voice.

Short clips of speech signals are used for further analysis.

Vocal tract characteristic feature and other features of speech are extracted from short clip.

A proposed set of rule is to be formed for extracting feature for identification of language.

During the Testing phase voice is recorded and features are extracted.

Compare features with the features of trained data for identification.

A modular programming is to be developed in MATLAB for feature extraction, training and

Testing.

The block diagram of training phase is shown in Fig. 3.1. Feature vectors representing the voice

characteristics of the speaker are extracted from the training utterances and are used for building the

reference models. During testing, similar feature vectors are extracted from the test utterance, and the

degree of their match with the reference is obtained using some matching technique. The level of

match is used to arrive at the decision. The block diagram of the testing phase is given in Fig. 3.1.

970




Figure 3.1: Two distinct phases of implementing an LID system

4 Feature Extraction using RASTA-PLP RASTA-PLP: - The Perceptual Linear Predictive (PLP) is a technique [13] for speech analysis and it

is based on the short term spectrum of speech. PLP technique is susceptible to the frequency response

of the communication channel which modifies the values of the short-term spectrum. To make PLP

more robust to linear spectral distortion, RASTA [14] method is used and it yields better results than

PLP in the noisy environment. The term RASTA comes from the words RelAtive SpecTrA. In the

RASTA technique a bandpass filter is applied (a filter with sharp spectral zero at the zero frequency)

to each spectral component in the critical band spectrum estimate. This filtering emphasizes frame-to-

frame spectral changes that occur between the rates of 1 to 10 Hz. Before applying the bandpass filter,

log-RASTA takes the natural logarithm of each spectral component. This logarithm converts

multiplicative distortions in the frequency domain into an additive distortion, which can be filtered.

Conversion to the log-spectrum domain is a common approach used in signal deconvolution

problems. The steps of the RASTA - PLP for each analysis of the frame are as follows:

1. Compute the critical-band power spectrum and take its logarithm.

2. Transform spectral amplitude through a compressing static nonlinear transformation.

3. Filter the time trajectory of each transformed spectral component.

4. Transform the filtered speech representation through expanding static nonlinear transformation.

5. As in conventional PLP, multiply by the equal loudness curve and rise to the power 0.33 to

simulate the power law of hearing.

6. Compute an all-pole model of the resulting spectrum, following the conventional PLP technique.

Speech

Signal

Cepstral Coefficients of RASTA-PLP

Figure 4.1: RASTA-PLP Implementation Process

Inverse

Logarithm

Power-

Law of

hearing

Equal-

Loudness

Curve

Lograithm

& Filtering IDFT

DFT

Solving of

Set of Linear

Equations

Cepstral

Recursion

971




5 Feature Matching using Vector quantization

Vector quantization techniques play a dominant role in compression of speech signals. Vector

quantization is used in many applications such as image and voice compression, voice recognition.

Vector quantization is a lossy data compression method. It is fixed-to-fixed length algorithm. In a

speaker/language recognition system, the unique representation of each speaker in an efficient manner

is done by the process known as vector quantization which is based on the principle of block coding.

Vector Quantization (VQ) maps a‟ k‟ dimensional vector space to a finite set C = {C1, C2, C3… CN}.

The set C is called a codebook consisting of „N‟ number of codevectors and each codevectors Ci= {ci1,

ci2, ci3… cik} is of dimension k. The key to VQ is the good codebook. The method most commonly

used to generate codebook is the Linde-Buzo-Gray (LBG) algorithm [15], [16] which is also called as

Generalized Lloyd Algorithm (GLA).

6 LBG Algorithm

For generating the codebooks, the LBG algorithm [16] is used. The LBG algorithm steps are as

follows [16]:

1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors.

2. Double the size of the codebook by splitting each current codebook yn according to the rule

yn+ = yn(1+ε)

yn- = yn(1-ε)

where n varies from 1 to the current size of the codebook, and ε is a splitting parameter.

3. Find the centroids for the split codebook. (i.e., the codebook of twice the size)

4. Iterate steps 2 and 3 until a codebook of size M is designed.

Yes No

No Yes

Figure 5.1: Flow diagram of the LBG algorithm

Find

Centroid

m<M Stop

Split Each

Centroid

m=2*m

Cluster

Vector

Find

Centroids

Compute D

(distortion)

D′−D

D < ε D‟=D

972




7. Result

Training database contains speech samples of three languages English, Hindi and Marathi which are

spoken by 6 speakers (3 male and 3 female). Total 120 sentences are recorded in which 40 sentences

for each language. Testing database contains speech samples of 3 speakers (1male and 2 female)

different from speakers of training database. For testing phase 60 sentences are recorded in which 20

sentences for each language. For training phase the recorded sentences have duration of 10 seconds

and for testing phase the recorded sentences have a duration that varies from 3 seconds to 10 seconds.

Vector Quantization is used for the recognition, and number of centroids is 16. Table I shows the

identification (ID) rate of each language. The overall identification rate has achieved 71.67%.

Table 7.1: Result of Language Identification using RASTA-PLP feature extraction method

Language (no. of sentences) English Hindi Marathi % ID rate

English (20) 15 1 4 77.5

Hindi (40) 6 11 3 55

Marathi (40) 0 3 17 85

Overall average ID rate = 71.67%

8. Conclusion

It is clear from the result that the ID rate of English and Marathi is better than Hindi language. Due to

ease of implementation VQ is used and it gives satisfactory results with RASTA-PLP features. The

identification rate is independent of the speaker and text that is used within the train phase, but it

depends only on the feature extraction method.

References

[1] V. W. Zue and J. R. Glass, “Conversational Interfaces: Advances and Challenges”, Proceedings of the

IEEE, Vol. 88, No. 8, pp. 1166-1180, 2000.

[2] Waibel, P. Geutner, L. M. Tomokiyo, T. Schultz and M. Woszczyna, “Multilinguality in Speech and

Spoken Language Systems”, Proceedings of the IEEE, Vol. 88, No. 8, pp. 1181-1190, 2000.

[3] N. Bertoldi and M. Federico, “Cross-Language Spoken Document Retrieval on the TREC SDR

Collection”, Proc. CLEF 2002: Workshop on Cross-Language Information Retrieval and Evaluation,

Rome, 2002, Springer Verlag.

[4] Y. K. Muthusamy, N. Jain, and R. A. Cole, „„Perceptual benchmarks for automatic language

identification,‟‟ in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Adelaide, Australia, 1994, vol.

1, pp. 333–336.

[5] D. A. van Leeuwen, M. Boer, and R. Orr, „„A human benchmark for the NIST language recognition

evaluation 2005,‟‟ presented at the Odyssey: Speaker Language Recognition.

[6] M. A. Zissman, “Comparison of Four Approaches to Automatic Language Identification of Telephone

Speech”, IEEE Transactions On Speech And Audio Processing, Vol. 4, No. 1, January 1996.

[7] L.F. Lamel and J.L. Gauvain, “Language Identification Using Phone-based Acoustic Likelihoods”,

ICASSP, 1994.

[8] Khe Chai Sim and Haizhou Li, “On Acoustic Diversification Front-End for Spoken Language

Identification”, IEEE Transactions On Audio, Speech, And Language Processing, Vol. 16, No. 5, July

2008.

[9] O. Glembek, P. Matejka, L. Burget, and T. Mikolov, “Advances in phonotactic language recognition”, in

Proc. Interspeech, 2008, p. 4.

[10] Leena Mary, B. Yegnanarayana, “Extraction and representation of prosodic features for language and

speaker recognition”, Speech Communication 50 (2008) 782–796.

973




[11] M. Sadanandam, V. K. Prasad, V. Janaki, “Text Independent Language Recognition using Dhmm”,

International Journal of Computer Applications (0975 – 888) Volume 48– No.7, June 2012.

[12] M. Elkholy and N. Korany, “Effect of Feature Extraction Techniques on the Performance of Speaker

Identification”, International Journal of Signal Processing Systems, Vol. 1, No. 1 June 2013.

[13] H. Hermansky, “Perceptual linear predictive (PLP) analysis for speech”, J. Acoustic Soc. Am., pp. 1738-

1752, 1990.

[14] H. Hermansky and N. Morgan, “RASTA Processing of Speech”, IEEE Trans. On Speech and Audio

Processing, Vol. 2, 578-589, Oct. 1994.

[15] R. M. Gray, “Vector quantization”, IEEE ASSP Marg., pp. 4-29, Apr. 1984.

[16] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design”, IEEE Trans. Commun.,

vol. COM-28, no. 1, pp. 84-95, 1980.

Date post:	14-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

International Journal of Informative & Futuristic Research...

Documents