Development And Suitability Of Indian Languages Speech Database For
Building Watson Based ASR System
Dipti Pandey
KIIT College of
Engg.
Gurgaon, India
(dips.pande@
gmail.com )
Tapabrata Mondal
Jadavpur
University
Kolkata, India
(tapabratamondal
@gmail.com)
S. S. Agrawal
KIIT College of
Engg.
Gurgaon, India
(dr.shyamsagrawal
@gmail.com)
Srinivas Bangalore
AT & T Lab
Florham Park, NJ
com)
Abstract-In this paper, we discuss our efforts in the
development of Indian spoken languages corpora for
building large vocabulary speech recognition systems
using WATSON Toolkit. The current paper
demonstrates that these corpora can be reduced to a
varied degree for various phonemes by comparing
the similarity among phonemes of different
languages. We also discuss the design and
methodology of collection of speech databases and
the challenges we have faced during database
creation. The experiments have been conducted on
commonly known Indian languages, by training the
ASR system with WATSON toolkit and evaluation
by Sclite. The results for these experiments show that
different Indian languages have a great similarity
among their phoneme structures and phoneme
sequences and we have exploited these features to
create speech recognition system. Also, we have
developed an algorithm to bootstrapping the
phonemes of one particular language into another by
mapping the phonemes of different languages. The
performance of Hindi and Bangla ASR systems using
these databases has been compared.
Keyword Components: Speech Recognition, Speech
databases, Indian Languages
I.INTRODUCTION
Researchers are striving hard currently to improve
the accuracy of the speech processing techniques for
various applications. In the recent parts, some of the
researchers have been focusing on development of
suitable speech databases for Indian languages for
developing speech recognition systems:
Samudravijaya et al. [1], R.K.Agarwal [2], Chourasia
et al. [3], Shweta Sinha & S.S. Agarwal [4], Srinivas
Bangalore [5], Ahuja et al. [6], Maya Ingle, and
Manohar Chandwani [7].
In this paper, our goal is to develop a speech
recognition system that uses Indian languages
corpora through WATSON Toolkit. We are focusing
our major concentration on those languages for
developing large vocabulary speech recognition
system, which have great similarities. The work
could benefit large number of people working in the
field of speech recognition, as we are exploiting our
research in the study of comparison of phonemes of
different languages. Indian languages are basically
phonetic in nature and there exists a one-to-one
correspondence between the orthography and
pronunciation in all the sounds, barring few
exceptions.
II. ASR SYSTEM ARCHITECTURE
The architecture of speech recognition system is
shown in Fig 1.
It contains two modules: The Training Module and
The Testing Module.
Training module generates the system model with
which test data has been compared to get
performance percentage. Testing module compares
the test-data with training module and yields the 1-
best hypothesis.
First of all, Pronunciation Dictionary is created using
G2P Model (Section 5.1) which is trained with
30,000 words that are linguistically correct. Based on
these linguistically correct words, English phonemes
of different Hindi Graphemes have been generated.
For creating G2P Model, we have used Moses [8].
The pronunciation dictionary along with mapping
dictionary (Section3.3) represents the different
possibilities of pronouncing, or occurrence of a word.
Language Model has been created using large set of
text data to capture all the possibilities of occurrence
of a phoneme in a word, or a word in a sentence.
Thus, to give strength to the Acoustic Models
(Section3.1), Language Model has been created.
In Testing Module, Sclite [9] is used for evaluating
the 1-best hypothesis of each word. The average of
all the accuracies of different words gives the overall
accuracy of the Speech Recognition system with
possible word accuracy percentage.
Also, the Insertion, Deletion and Substitution error
can be computed.
. Fig 1. ASR SYSTEM ARCHITECTURE
III. BUILDING ASR SYSTEMS
Typically, ASR system comprises of three major
constituents - the acoustic models, the language model
and the phonetic lexicons.
A. Acoustic Models
In this experiment, context-independent as well as
context-dependent models of Hindi & Bangla have been
created by borrowing phonemes from English. Context-
independent models are basically mono-phone models,
taking each phone as an individual sound unit.
Furthermore, context-dependent models take the
probability of occurrence of one phone, relative to the
neighbouring phones. The data used for creating acoustic
models for Hindi and Bangla have been shown in Table1
& Table2 respectively.
TABLE1. CORPUS USED FOR HINDI ACOUSTIC MODELS
Corpus Number of Sentences Speakers
(Male / Female)
General Messages
1260 3(Male),2(Female)
Health & Tourism
Corpus
41282 2(Male),2(Female)
News Feeds
800 8(Male),4(Female)
Philosophical Data 1000 3(Male),2(Female)
TABLE2. CORPUS USED FOR BENGALI ACOUSTIC MODELS
Corpus Number of Sentences Speakers
(Male/Female)
Shruti Bangla
Speech Corpus
7383 2(Male),4(Female)
TDIL Data
1000 1(Male)
Health& Tourism
Corpus
41282 2(Male),2(Female)
We have trained the HMM models using Watson Toolkit
[10].For parameterization, Mel Frequency Cepstral
Coefficients (MFCC) have been computed. At the time
of recognition, various words are hypothesized against
the speech signal.
To compute the likelihood of a word, the 1-best
hypothesis of individual word of the text data has been
taken with the help of Sclite. The combined likelihood of
all the phonemes represents the likelihood of the word in
the acoustic models.
B. Language Model
For the language model, very large set of text data is
required, so that all the possibilities of occurrence of a
word in Indian languages can be captured. The text data
taken for language model is shown in Table3 & Table4.
TABLE3: TEXT DATA USED FOR HINDI LANGUAGE MODEL
Corpus Number of
Sentences
Total
Words
Unique
Words
General Messages
1260 65300 54324
Health & Tourism
Corpus
41282 90140 67522
News Feeds
800 7727 3351
Philosophical Data
1000 135400 64360
Wikipedia 19020 415818 175265
TABLE4: TEXT DATA USED FOR BANGLA LANGUAGE MODEL
Corpus Number of
Sentences
Total Words Unique
Words
Shruti Bangla Speech
Corpus
7383 22012 10054
TDIL Data
1000 25240 6720
Health& Tourism
41282 675915 91033
C. Lexicon Model
The lexicon model is a dictionary which maps the words
to phoneme sequences .In this experiment, we have
developed a pronunciation dictionary, a mapping
dictionary and grouping of the phones.
Pronunciation Dictionary: It contains the lexicons with
respect to each individual word, using their transcription
as how a word can be pronounced with help of English
alphabets.
Mapping Dictionary:It maps each phoneme of a particular
language into English phonemes. This mapping is shown
in Appendix1 (For Vowels) & Appendix2 (For
Consonants). In this way, we have created sounds of
Hindi & Bangla using English phonemes.
Grouping of Phones: The phones are grouped on the basis
of place of articulation and manner of articulation
(Appendix 1 &2). In case the engine is unable to decide a
particular phone, it is able to find the correct phone with
the help of such grouping by looking into the category
the particular phoneme belongs.
IV. HINDI & BANGLA PHONE SETS
To represent the sounds of the acoustic space, a set of
phonemes [11] are required which can be either from a
particular language or from the sounds of a combination
of languages.
The IPA [12] has defined phone sets for labeling speech
databases for sounds of a large number of languages
(including Hindi). But there are some sounds which are
not included in IPA but which are used for the purpose of
speech recognition. In continuous speech recognition
task, the purpose of defining a phonetic space is to form
well-defined phone set which can represent all the sounds
that exist in a language. So, we have used some phoneme
sequences, out of which all the sounds can be extracted,
either by individually or by clustering of these phonemes.
Some phonemes exists in text data only, but not in the
audio files. As the phoneme is pronounced by a speaker in different way, the variabilities have been captured.
For Example: व (/v/) is written in text form in Bangla too,
but it is pronounced as ब /b/.
A. Challenges
While dealing with Indian phone-sets, the following
challenges has been faced.
Nasal Sounds: To handle nasal sounds is a real task
especially when a vowel is followed by a consonant. For
Example : In case of अ ं, अ is followed by consonant न.
We have done the clustering of respective vowel &
consonant, using their 3-HMM states in order to have
strong recognition system which is able to recognize
almost all the phonemes. OOV (Out-of-vocabulary) problem: During our
experiments, the OOV (Out-of-vocabulary) problem
mostly occurs.OOV is the problem of words in the test
speech that are not present in the dictionary. To handle
this, we have added such phones in the vocabulary.
Clusters of Sounds: Some sounds exist in Hindi that are
clusters of two or more different phonemes. To define the
sounds, we have clustered the phonemes by taking 3-
HMM states of all the phonemes and clustered them to
get a new sound. Examples of clustering are shown in
Table 5. TABLE5: EXAMPLES OF CLUSTERING OF SOUNDS
Phonemes Clustering of Sounds
ओं ao,2 ao,3 n,2
उं uh,2 uh,3 n,2
ईं iy,2 iy,3 n,2
ञ y,2 y,3 n,2
त्र t,2 t,3 r,2
झ j,3 h,2 h,3
Phonemes not common in Hindi & Bengali: Some
phonemes have been observed that exist in Hindi, but not
in Bengali and vice-versa. The list of these phones are
given in Table 6. As we have dealt with both the
languages, we have trained the ASR individually for both
the languages, containing their own phoneme sets, and
compute their accuracies separately.
TABLE6 : LIST OF PHONEMES NOT COMMON IN HINDI & BENGALI
AUDIO FILES
Common in Hindi & Bangla 47 Phonemes
Only in Hindi 10 Phonemes व /v/
क़ /q/
ञ /ɲ/
य /j/
ष /ʂ/
ख़ /x/
ग़ /ɣ/
ज़ /z/
झ़ /ʒ/
फ़ /f/
Only in Bangla 3 Phonemes রং ŋ ঐ oj
ঔ ow
V. METHODOLOGY TO DEVELOP TEXT
CORPORA
A. Grapheme to Phoneme Conversion (G2P)
For analyzing the text corpus, the distribution of the basic
recognition units, the phones, the di-phones, syllables
etc., the text corpus has to be phonetized. G2P converters
are the tools that convert the text corpus into its phonetic
equivalent. But the phonetic nature of Indian languages
reduces the effort to building individually mapping tables
and rules for the lexical representation. These rules and
the mapping tables(Annexure1 &2) together comprise the
Grapheme to Phoneme converters.
We have used Moses [8] for G2P conversion, by training
it with 4280 unique words having their phonetic
equivalent set up by linguistics. The rest of the
graphemes are inputted and their phoneme equivalents
are taken as output.
B. Rapid Bootstrapping
The language adaptation technology enables us to rapidly
bootstrap a speech recognizer in a new target language.
Converting one language phonemes into other: In this
experiment, we have developed an algorithm to convert
each phoneme of a particular language to the
corresponding phoneme of the Hindi language, so we can
deal with more data for Hindi taken from other Indian
languages. For this, we have mapped each phoneme of a
particular language individually to the respective
phoneme in Hindi, and if it matches with a particular
phoneme of Hindi, then it gives those character of Hindi
as the converted phoneme. Thus, for a text data of a
particular Indian language, we get the data in Hindi
phonemes. Then, further processing can be done.
VI. COLLECTION OF AUDIO DATA
In this section, the steps involved in building the speech
corpora are discussed .Two channels: head-held
microphone and mobile phones have been used to record
the data simultaneously.
A. Speaker Selection & Transcription of Audio Files
Speech data was collected from native speakers of
different languages who were comfortable in speaking
and reading the particular language, for training purpose
to capture all diversities attributing to the gender, age and
dialect sufficiently.
B. Transcription Corrections
Besides care was taken to record the speech with minimal
background noise and mistakes in pronunciation, some
errors were still left while recording. These errors had to
be identified manually by listening to the speech. The
pronunciation mistakes were carefully identified and if
possible the corresponding changes were made in the
transcriptions so that the utterance and transcription
correspond to each other. The idea behind this was to
make the utmost utilization of the data and to serve as a
corpus for further related research work.
C. Data Statistics
The system has been trained with 70% of the overall
corpus and the remaining 30 % has been used as test data
in case of Open-Set Speech Recognition. In Closed Set,
overall data is used for training, and some data from the
same is used as test data.
VII. ASR EVALUATION RESULTS
We have done some experiments to find the relevancy of
our experiment. Two individual Recognition Engines of
Hindi & Bengali have been developed. The system has
been trained with the corpus of individual languages.
A. Overall Performance of Hindi and Bangla ASR
The overall performance of Hindi and Bangla ASR when
using 70 % of the data as the training set and remaining
30% as the test-set is shown in Table 7 & Table 8
respectively
TABLE 7: OVERALL PERFORMANCE OF HINDI ASR
Task
Name
Num
Phrases
Beam
Width
Word
Accuracy
Clock
Time
Output
170
174 170 51.5 221.00
Output
190
174 190 57.2 311.81
Output
210
174 210 61.0 431.27
Output
230
174 230 61.8 589.81
Output
250
174 250 62.1 781.11
TABLE 8: OVERALL PERFORMANCE OF BANGLA ASR
Task
Name
Num
Phrases
Beam
Width
Word
Accuracy
Clock
Time
Output
170
174 170 47.3 81.04
Output
190
174 190 51.8 108.23
Output
210
174 210 54.3 144.43
Output
230
174 230 54.2 195.63
Output
250
174 250 54.9 266.24
B. Hindi Speech Recognition
While dealing with Hindi Recognition Engine, we have
trained as well as tested the system with Hindi database.
For testing, we have used closed and open set both. The
accuracy of the open-set & closed-set for Hindi, where
the test-set is out of the training-data and closed-set,
where the test-set is from the training data is as follows:
Fig 2. Word accuracy percentage of Hindi ASR
C. Bangla Speech Recognition For Bangla Recognition, we have trained the system with
Bengali data and then tested it with the same as well as
different subset, giving accuracy of both closed and open
set. The performance of Bangla Recognition Engine is as
follows:
Fig 3: Word accuracy percentage of Bangla ASR
This shows that the best solution to improve the accuracy
is to add more and more number of speakers in the
training set. The evaluation of the experiment was made
according to the recognition accuracy and computed
using the word error rate [WER] which aligns a
recognized word against the correct word and computes
the number of substitutions (S), deletions (D) and
Insertions (I) and the number of words in the correct
sentence (N).
W.E.R=100*(S+D+I)/N
D. Using the transliterated Data
In this experiment, 50 sentences of Bangla were taken.
After their transliteration into Hindi [13], the sentences
were recorded with Hindi native speaker and used as the
additional test set. The parallel sentences of Bangla and
Hindi are then tested with the same Bangla ASR. As the
acoustic model is same for both the languages, we can
use both the parallel files as test set with the same
system.
The accuracy of Bengali corpus and its Hindi
transliteration when tested with the same Bengali ASR is
shown in Table9.
TABLE9. ACCURACIES OF ORIGINAL BANGLA SENTENCES & ITS
TRANSLITERATED VERSION
Language Sentences from Testing
Set
Accuracy (%)
Hindi 50 57.4
Bangla 50 64.2
The system has been trained for Original Bangla
sentences. Thus, the system is giving better accuracy for
Bangla sentences than their Hindi transliterated version.
As the difference in accuracy percentage is not large,
shows the transliteration is effective. Thus, we can
increase text corpus of a particular language by using
transliterated data obtained from any other language.
E. Comparison of Hindi & Bengali ASR Model
In the experiment, we have build the acoustic model for
both Bangla corpus and its Hindi transliterated corpus
individually, keeping the language model as same. Thus,
for Hindi ASR, we have used 50 Hindi sentences and 50
Bangla to Hindi transliterated sentences, as test sentences
and similarly with Bangla ASR. The accuracies observed
in these cases are shown in Table10.
TABLE10: COMPARISON OF HINDI & BANGLA ASR
Language Testing Sentences Accuracies
Hindi 50 74.2
Bangla 50 65.6
As we have larger Hindi corpus than Bangla, the
accuracy of Hindi ASR is better. Thus, it has been
concluded that the accuracy can be improved with
increase in corpus.
VIII.IMPROVING ASR ACCURACIES USING
MORE DATA FROM VARIOUS LANGUAGES
We have collected data from various commonly known
Indian languages, and transliterated the whole data into
Hindi alphabet, so that we can capture many variations
which occur in different Indian languages.With this data
acoustic as well as language model is improved. As
having correspondence between orthography and
pronunciation, the Indian languages are rich in phonetics,
we can capture all possibities of phonemes. The
experiments show that the ASR accuracy can be
improved using large corpus in this manner also.
IX. CONCLUSION & FUTURE WORK
In this paper, we discussed the design and development
of speech databases for two Indian languages i.e.,Hindi
& Bangla and their suitability for developing ASR
system using WATSON tool. The simple methodology of
database creation presented will serve as a catalyst for the
creation of speech databases in all other Indian
languages. Some of the conclusions of our study are:
Female Speakers perform better when the system is
trained with Female voice database alone.
Accuracy of the system is found better when the
system is trained with variety of speakers and their
speaking style as compared to simply increasing the
corpus from limited number of speakers.
Native Speakers performs better than Non-Native
Speakers in all conditions
As the beam-width of training speakers increases,
word accuracy also increases.
The word accuracy also increases with clock time.
We hope that the ASRs created using the database
developed in this experiment will serve as baseline
systems for further research in improving the accuracies
in each of the languages. Our future work is focused in
tuning these models and test them using language and
acoustic models built using a much larger corpus from
large number of speakers.
X.ACKNOWLEDGEMENT
We would like to acknowledge the help and support
recieved from Mr.Anirudhha of IIIT,Hyderabad in
conducting these experiments.We are also thankful to
Prof.Michael Carl of CBS, Copenhagen and KIIT
management in particular to Dr. Harsh V.Kamrah,Mrs.
Neelima Kamrah for providing necessary
facilities,financial help and encouragement .Also to
DIETY for providing fellowship to one of the author
Dipti Pandey.
REFERENCES [1] Samudravijaya K, P.V.S.Rao, and S.S.Agrawal,"
Hindi speech database ," Proc. Int. Conf. on Spoken
Language processing(ICSLP00), Beijing, China, October
2000, CDROM paper: 00192.pdf.
[2] K.Kumar and R.K. Agarwal,“Hindi Speech
Recognition System Using HTK,” International Journal
of Computing and Business Research,Vol.2, No.2,
2011,ISSN (On- line):2229-6166.
[3]Chourasia,K.Samudravijaya,andChandwani,"Phonetic
ally rich Hindi sentences corpus forcreation of speech
database,"Proc O-COCOSDA 2005, p 132-137.
[4] Shweta Sinha, SS Agrawal, Olsen Jesper,"Mobile
speech Hindi database, " OCOCOSDA-2011, Hshinchu-
Taiwan.
[5] www.mastar.jp/wfdtr/presentation/2_Dr.Bangalore.
[6] Ahuja, R., Bondale, N., Furtado, X., Krishnan, S.,
Poddar, P., Rao, P.V.S., Raveendran, R.,Samudravijaya
K, and Sen, A.," Recognition AndSynthesis in the Hindi
Language, in Proceedings of the Workshop on Speech
Technology, IIT, Madras,pp.3-19, Dec., 1992.
[7] Vishal Chourasia, Samudravijaya K, MayaIngle, and
Manohar Chandwani," Hindi speech recognition under
noisy conditions, J. Acoust. Soc. India,54(1), pp. 41-46,
January 2007.
[8] http://www.statmt.org/moses/manual/manual.pdf.
[9]http://www1.icsi.berkeley.edu/Speech/docs/sctk-
1.2/sclite.htm
[10] http://www.research.att.com/projects/WATSON/
?fbid=2tgRMa1CfjG.
[11] S S Agrawal, K Samudravijaya, Karunesh Arora,
\Text and Speech Corpora Development in Indian
Languages", Proceedings of ICSLT-O-COCOSDA 2004
New Delhi, India.
[12] www.madore.org/ david/misc/linguistic/ipa/.
[13]
http://en.wikipedia.org/wiki/Devanagaritransliteration
APPENDIX
The characterizations of Hindi and Bangla phonemes
have been done as follows:
Appendix1: Characterization of Vowels CATEGORY
HINDI
PHONEMES
BENGALI
PHONEMES
IPA
REPRESENTATION
USING ENGLISH
PHONEMES
Monothongs
(Short)
अ
इ
उ
ऋ
অ
ই
উ
ঋ
/ə/
/i/
/u/
/
AX
I
U
RR
Monothongs
(Long)
आ
ई
ऊ
ए
ओ
আ
ঈ
ঊ
এ
ও
aː
iː
uː
/e/
/o/
AA
II
UU
E
O
Diphthongs ऐ
औ
ঐ
ঔ
/æ/
/ɔː
AI
AU
Appendix2: Characterization of Consonants
CATEGORY HINDI
PHONEMES BENGALI
PHONEMES IPA REPRESENTATION
USING ENGLISH
PHONEMES
Unaspirated
(Unvoiced)
क च ट त प
ক চ ট ত প
/k/ /tʃ/ /ʈ/ /t/ /p/
k
c
tt
t
p
Aspirated
(Unvoiced) ख छ ठ थ फ
খ ছ ঠ থ ফ
/kʰ/
/tʃʰ/
/ʈʰ/
/tʰ/
/pʰ/
Kh
ch
tth
th
ph
Unaspirated
(Voiced) ग ज ड द ब
গ জ ড দ ব
/g/
/dʒ/
/ɖ/
/d/
/b/
g
j
dd
d
b
Aspirated
(Voiced) घ झ ढ ध भ
ঘ ঝ ঢ ধ ভ
/gʰ/
/dʒʰ/
/ɖʱ/
/dʰ/
/bʰ/
gh
jh
ddh
dh
bh
Nasals ड़ ञ ण न म
ড় ঞ ন ন ম
/ɽ/
/ɲ/
/ɳ/
/n/
/m/
ddn
ny
nn
n
m
Semivowels/
Approxima
nts
य र ल व
য র ল ব
/j/
/r/
/l/
/v/
y
r
l
w
Sibilants श ष स
শ ষ স
/ʃ/
/ʂ/
/s/
sh
sh^
s
Glottal ह হ /h/ h