967 www.ijifr.com
Copyright © IJIFR 2014
Original Paper
International Journal of Informative & Futuristic Research ISSN (Online): 2347-1697
Volume 2 Issue 4 December 2014
Abstract
Speech processing is the main area of digital signal processing. Various fields in speech processing include speech/ speaker recognition, language identification. The goal of the language identification system is to automatically identify the specific language from an unknown speech utterance. Such type of LID system should be able to identify the language user is speaking very quickly and accurately. There are two processes of LID system; feature extraction and feature matching. The performance and accuracy of the system depend on the efficiency of the feature extraction process. There are different feature extraction techniques which are LPC, Cepstral coefficients, MFCC, PLP, RASTA, etc. In this paper RASTA-PLP is used for feature extraction and vector quantization is used in both the training and testing phase. Linde-Buzo-Gray (LBG) algorithm is used for codebook generation.
1. Introduction
Speech is the most natural form of communication between humans. Changes in the signal mean
information carried in it, thus speech is nothing but a type of signal which contains information in
terms of frequency, power, energy, etc. The special case of digital signal processing is a speech
processing in which the speech signals and the processing methods of these signals discussed. Speech
processing includes the areas such as speech/speaker recognition, speech coding, speech synthesis,
voice analysis, language recognition, speech compression, etc. Language recognition system
Language Identification Using RASTA-PLP Feature
Extraction Technique Paper ID IJIFR/ V2/ E4/ 033 Page No. 967-973 Subject Area
Electronics &
Telecommunication
Key Words LID (Language Identification), Feature Extraction, Feature Matching, RASTA-
PLP, Vector Quantization (VQ).
Varsha Singh 1
Research Scholar,
Department of Electronics & Telecommunication,
The Chhattisgarh Swami Vivekanand Technical University,
Bhilai, Chhattisgarh
Vinay Kumar Jain 2
Associate Professor,
Department of Electronics & Telecommunication,
The Chhattisgarh Swami Vivekanand Technical University,
Bhilai, Chhattisgarh
Dr. Neeta Tripathi 3
Principal,
Shri Shankaracharya Institute of Technology & Management ,
The Chhattisgarh Swami Vivekanand Technical University,
Bhilai, Chhattisgarh
968
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 4, December 2014 16th Edition, Page No: 967-973
Varsha Singh, Vinay Kumar Jain, Dr. Neeta Tripathi: Language Identification Using RASTA-PLP Feature Extraction Technique
automatically determines the identity of language from a sample of speech. The language recognition
system can be divided into two types of tasks: language identification and language verification. A
language identification system automatically identifies the language being spoken. A language
identification task is also known as closed-set identification task because it is assumed that the input
speech signal is from a fixed set of known languages. On the other hand, language verification task is
the process in which system tries to detect that the spoken speech segment is from a certain language;
if the language is verified then it is accepted otherwise rejected.
A language verification task is of two types: closed-set and open-set verification. In closed-set
language verification system the speech segments of languages that are known to the system are
spoken. In open-set language verification system, it is not necessary that all the speech segments of
the languages are known to the system. LID system has many applications such as in a speech
translation system [1], multilingual speech recognition system [2], spoken document retrieval system
[3] etc. There are many factors that can be considered for differentiating between languages/
identifying a language. From studies it is concluded that, given only little previous knowledge, human
listeners can effectively identify a language without much lexical knowledge. In this case, human
listeners rely on prominent phonetic, phonotactic, and prosody cues to characterize the languages [4,
5]. These factors are in the range of spectral features of the acoustic speech signal to language-
specific words or phrases. Feature extraction is the process in which the important characteristics of
the speech are represented by reducing the amount of speech data. Many techniques are used for
feature extraction, such as LPC, MFCC, PLP, RASTA-PLP, spectrograph etc.
2. Literature Review
LID systems used for route an incoming telephone call (of a caller who speaks a language that is not
recognized by a receiver) to an operator who is fluent in the corresponding language. Such a service is
known as a Language Line Service. Zissman [6] compared the four approaches to language
identification of telephone speech: Gaussian mixture modelling (GMM), single-language phone
recognition followed by language-dependent language modelling (PRLM), parallel PRLM and
language-dependent parallel phone recognition (PPR). The performance of these approaches is tested
on the OGI-TS corpus. The experimental results show that the performance of GMM is poor, but runs
faster than real-time on a conventional UNIX workstation, PRLM runs a bit slower than real-time on a
conventional workstation and the performance of parallel PRLM and PPR almost equal and better but
these systems run more slowly.
Language identification using phone-based acoustic likelihoods [7] is the process in which an
unknown incoming speech signal is processed parallel by different sets of language-specific phone
model and selects the language associated with the model set having the highest likelihood. For
spoken language identification, the PPRLM system is built by a proposed method, i.e. acoustic
diversification [8]. Phone recognizers are used with different acoustic models trained on the same
speech data with the same phone set. The proposed method aims at providing a better acoustic
diversification using different acoustic models to provide complementary acoustic emphasis.
Experimental results show that acoustic diversity is as effective as phonetic diversity in characterizing
spoken languages in phonotactic approach to LID. For language recognition method, there are two
major approaches which are organized by NIST has proved efficient: acoustic modelling, which based
on short context information, and phonotactic modelling that tries to capture longer patterns in speech
969
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 4, December 2014 16th Edition, Page No: 967-973
Varsha Singh, Vinay Kumar Jain, Dr. Neeta Tripathi: Language Identification Using RASTA-PLP Feature Extraction Technique
[9].Prosody feature is linked to linguistic units such as syllables, and it is represented in terms of
changes in measurable parameters such as fundamental frequency, duration and energy. Thus
prosodic features are used for language and speaker recognition [10].
This research evaluated the effectiveness of prosodic features extracted using the proposed approach
for language recognition in the case of NIST LRE 2003 task. Though the success of language
recognition was constrained by the limited speech data available for training, it clearly illustrates the
potential of prosodic features for distinguishing languages. In [11] a novel approach is proposed for
text independent language identification which does not require annotated corpora. A LID system is
developed using vector quantization and discrete hidden Markov model (DHMM).The performance of
speaker identification under the effect of feature extraction techniques is presented [12]. The
identification rate for the different feature extraction technique is compared using vector quantization.
3. Methodology
The proposed LID system contains two main modules:
Feature Extraction: - Speech signal conveys information about a speaker in various forms, for
example, speaking style, context, and emotional state of the speaker. The objective of signal
processing is to extract important information of a signal by means of transformation and store in the
vector of feature coefficients.
Feature Matching: - In Feature Matching the decision making procedure is applied to identify the
speaker‟s language based on a previously stored database. This step is basically divided into two
parts, namely training and testing.
The following are the steps of training and testing phase:
During the Training phase a comprehensive database is to be prepared.
The database consists of recording of speech from a speaker who speaks different Indian
languages such as Hindi, Marathi etc.
Short clip of vowel and semi vowel is separated from recorded voice.
Short clips of speech signals are used for further analysis.
Vocal tract characteristic feature and other features of speech are extracted from short clip.
A proposed set of rule is to be formed for extracting feature for identification of language.
During the Testing phase voice is recorded and features are extracted.
Compare features with the features of trained data for identification.
A modular programming is to be developed in MATLAB for feature extraction, training and
Testing.
The block diagram of training phase is shown in Fig. 3.1. Feature vectors representing the voice
characteristics of the speaker are extracted from the training utterances and are used for building the
reference models. During testing, similar feature vectors are extracted from the test utterance, and the
degree of their match with the reference is obtained using some matching technique. The level of
match is used to arrive at the decision. The block diagram of the testing phase is given in Fig. 3.1.
970
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 4, December 2014 16th Edition, Page No: 967-973
Varsha Singh, Vinay Kumar Jain, Dr. Neeta Tripathi: Language Identification Using RASTA-PLP Feature Extraction Technique
Figure 3.1: Two distinct phases of implementing an LID system
4 Feature Extraction using RASTA-PLP RASTA-PLP: - The Perceptual Linear Predictive (PLP) is a technique [13] for speech analysis and it
is based on the short term spectrum of speech. PLP technique is susceptible to the frequency response
of the communication channel which modifies the values of the short-term spectrum. To make PLP
more robust to linear spectral distortion, RASTA [14] method is used and it yields better results than
PLP in the noisy environment. The term RASTA comes from the words RelAtive SpecTrA. In the
RASTA technique a bandpass filter is applied (a filter with sharp spectral zero at the zero frequency)
to each spectral component in the critical band spectrum estimate. This filtering emphasizes frame-to-
frame spectral changes that occur between the rates of 1 to 10 Hz. Before applying the bandpass filter,
log-RASTA takes the natural logarithm of each spectral component. This logarithm converts
multiplicative distortions in the frequency domain into an additive distortion, which can be filtered.
Conversion to the log-spectrum domain is a common approach used in signal deconvolution
problems. The steps of the RASTA - PLP for each analysis of the frame are as follows:
1. Compute the critical-band power spectrum and take its logarithm.
2. Transform spectral amplitude through a compressing static nonlinear transformation.
3. Filter the time trajectory of each transformed spectral component.
4. Transform the filtered speech representation through expanding static nonlinear transformation.
5. As in conventional PLP, multiply by the equal loudness curve and rise to the power 0.33 to
simulate the power law of hearing.
6. Compute an all-pole model of the resulting spectrum, following the conventional PLP technique.
Speech
Signal
Cepstral Coefficients of RASTA-PLP
Figure 4.1: RASTA-PLP Implementation Process
Inverse
Logarithm
Power-
Law of
hearing
Equal-
Loudness
Curve
Lograithm
& Filtering IDFT
DFT
Solving of
Set of Linear
Equations
Cepstral
Recursion
971
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 4, December 2014 16th Edition, Page No: 967-973
Varsha Singh, Vinay Kumar Jain, Dr. Neeta Tripathi: Language Identification Using RASTA-PLP Feature Extraction Technique
5 Feature Matching using Vector quantization
Vector quantization techniques play a dominant role in compression of speech signals. Vector
quantization is used in many applications such as image and voice compression, voice recognition.
Vector quantization is a lossy data compression method. It is fixed-to-fixed length algorithm. In a
speaker/language recognition system, the unique representation of each speaker in an efficient manner
is done by the process known as vector quantization which is based on the principle of block coding.
Vector Quantization (VQ) maps a‟ k‟ dimensional vector space to a finite set C = {C1, C2, C3… CN}.
The set C is called a codebook consisting of „N‟ number of codevectors and each codevectors Ci= {ci1,
ci2, ci3… cik} is of dimension k. The key to VQ is the good codebook. The method most commonly
used to generate codebook is the Linde-Buzo-Gray (LBG) algorithm [15], [16] which is also called as
Generalized Lloyd Algorithm (GLA).
6 LBG Algorithm
For generating the codebooks, the LBG algorithm [16] is used. The LBG algorithm steps are as
follows [16]:
1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors.
2. Double the size of the codebook by splitting each current codebook yn according to the rule
yn+ = yn(1+ε)
yn- = yn(1-ε)
where n varies from 1 to the current size of the codebook, and ε is a splitting parameter.
3. Find the centroids for the split codebook. (i.e., the codebook of twice the size)
4. Iterate steps 2 and 3 until a codebook of size M is designed.
Yes No
No Yes
Figure 5.1: Flow diagram of the LBG algorithm
Find
Centroid
m<M Stop
Split Each
Centroid
m=2*m
Cluster
Vector
Find
Centroids
Compute D
(distortion)
D′−D
D < ε D‟=D
972
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 4, December 2014 16th Edition, Page No: 967-973
Varsha Singh, Vinay Kumar Jain, Dr. Neeta Tripathi: Language Identification Using RASTA-PLP Feature Extraction Technique
7. Result
Training database contains speech samples of three languages English, Hindi and Marathi which are
spoken by 6 speakers (3 male and 3 female). Total 120 sentences are recorded in which 40 sentences
for each language. Testing database contains speech samples of 3 speakers (1male and 2 female)
different from speakers of training database. For testing phase 60 sentences are recorded in which 20
sentences for each language. For training phase the recorded sentences have duration of 10 seconds
and for testing phase the recorded sentences have a duration that varies from 3 seconds to 10 seconds.
Vector Quantization is used for the recognition, and number of centroids is 16. Table I shows the
identification (ID) rate of each language. The overall identification rate has achieved 71.67%.
Table 7.1: Result of Language Identification using RASTA-PLP feature extraction method
Language (no. of sentences) English Hindi Marathi % ID rate
English (20) 15 1 4 77.5
Hindi (40) 6 11 3 55
Marathi (40) 0 3 17 85
Overall average ID rate = 71.67%
8. Conclusion
It is clear from the result that the ID rate of English and Marathi is better than Hindi language. Due to
ease of implementation VQ is used and it gives satisfactory results with RASTA-PLP features. The
identification rate is independent of the speaker and text that is used within the train phase, but it
depends only on the feature extraction method.
References
[1] V. W. Zue and J. R. Glass, “Conversational Interfaces: Advances and Challenges”, Proceedings of the
IEEE, Vol. 88, No. 8, pp. 1166-1180, 2000.
[2] Waibel, P. Geutner, L. M. Tomokiyo, T. Schultz and M. Woszczyna, “Multilinguality in Speech and
Spoken Language Systems”, Proceedings of the IEEE, Vol. 88, No. 8, pp. 1181-1190, 2000.
[3] N. Bertoldi and M. Federico, “Cross-Language Spoken Document Retrieval on the TREC SDR
Collection”, Proc. CLEF 2002: Workshop on Cross-Language Information Retrieval and Evaluation,
Rome, 2002, Springer Verlag.
[4] Y. K. Muthusamy, N. Jain, and R. A. Cole, „„Perceptual benchmarks for automatic language
identification,‟‟ in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Adelaide, Australia, 1994, vol.
1, pp. 333–336.
[5] D. A. van Leeuwen, M. Boer, and R. Orr, „„A human benchmark for the NIST language recognition
evaluation 2005,‟‟ presented at the Odyssey: Speaker Language Recognition.
[6] M. A. Zissman, “Comparison of Four Approaches to Automatic Language Identification of Telephone
Speech”, IEEE Transactions On Speech And Audio Processing, Vol. 4, No. 1, January 1996.
[7] L.F. Lamel and J.L. Gauvain, “Language Identification Using Phone-based Acoustic Likelihoods”,
ICASSP, 1994.
[8] Khe Chai Sim and Haizhou Li, “On Acoustic Diversification Front-End for Spoken Language
Identification”, IEEE Transactions On Audio, Speech, And Language Processing, Vol. 16, No. 5, July
2008.
[9] O. Glembek, P. Matejka, L. Burget, and T. Mikolov, “Advances in phonotactic language recognition”, in
Proc. Interspeech, 2008, p. 4.
[10] Leena Mary, B. Yegnanarayana, “Extraction and representation of prosodic features for language and
speaker recognition”, Speech Communication 50 (2008) 782–796.
973
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 4, December 2014 16th Edition, Page No: 967-973
Varsha Singh, Vinay Kumar Jain, Dr. Neeta Tripathi: Language Identification Using RASTA-PLP Feature Extraction Technique
[11] M. Sadanandam, V. K. Prasad, V. Janaki, “Text Independent Language Recognition using Dhmm”,
International Journal of Computer Applications (0975 – 888) Volume 48– No.7, June 2012.
[12] M. Elkholy and N. Korany, “Effect of Feature Extraction Techniques on the Performance of Speaker
Identification”, International Journal of Signal Processing Systems, Vol. 1, No. 1 June 2013.
[13] H. Hermansky, “Perceptual linear predictive (PLP) analysis for speech”, J. Acoustic Soc. Am., pp. 1738-
1752, 1990.
[14] H. Hermansky and N. Morgan, “RASTA Processing of Speech”, IEEE Trans. On Speech and Audio
Processing, Vol. 2, 578-589, Oct. 1994.
[15] R. M. Gray, “Vector quantization”, IEEE ASSP Marg., pp. 4-29, Apr. 1984.
[16] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design”, IEEE Trans. Commun.,
vol. COM-28, no. 1, pp. 84-95, 1980.