[IEEE 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian...

Development And Suitability Of Indian Languages Speech Database For

Building Watson Based ASR System

Dipti Pandey

KIIT College of

Engg.

Gurgaon, India

(dips.pande@

gmail.com )

Tapabrata Mondal

Jadavpur

University

Kolkata, India

(tapabratamondal

@gmail.com)

S. S. Agrawal

KIIT College of

Engg.

Gurgaon, India

(dr.shyamsagrawal

@gmail.com)

Srinivas Bangalore

AT & T Lab

Florham Park, NJ

([email protected].

com)

Abstract-In this paper, we discuss our efforts in the

development of Indian spoken languages corpora for

building large vocabulary speech recognition systems

using WATSON Toolkit. The current paper

demonstrates that these corpora can be reduced to a

varied degree for various phonemes by comparing

the similarity among phonemes of different

languages. We also discuss the design and

methodology of collection of speech databases and

the challenges we have faced during database

creation. The experiments have been conducted on

commonly known Indian languages, by training the

ASR system with WATSON toolkit and evaluation

by Sclite. The results for these experiments show that

different Indian languages have a great similarity

among their phoneme structures and phoneme

sequences and we have exploited these features to

create speech recognition system. Also, we have

developed an algorithm to bootstrapping the

phonemes of one particular language into another by

mapping the phonemes of different languages. The

performance of Hindi and Bangla ASR systems using

these databases has been compared.

Keyword Components: Speech Recognition, Speech

databases, Indian Languages

I.INTRODUCTION

Researchers are striving hard currently to improve

the accuracy of the speech processing techniques for

various applications. In the recent parts, some of the

researchers have been focusing on development of

suitable speech databases for Indian languages for

developing speech recognition systems:

Samudravijaya et al. [1], R.K.Agarwal [2], Chourasia

et al. [3], Shweta Sinha & S.S. Agarwal [4], Srinivas

Bangalore [5], Ahuja et al. [6], Maya Ingle, and

Manohar Chandwani [7].

In this paper, our goal is to develop a speech

recognition system that uses Indian languages

corpora through WATSON Toolkit. We are focusing

our major concentration on those languages for

developing large vocabulary speech recognition

system, which have great similarities. The work

could benefit large number of people working in the

field of speech recognition, as we are exploiting our

research in the study of comparison of phonemes of

different languages. Indian languages are basically

phonetic in nature and there exists a one-to-one

correspondence between the orthography and

pronunciation in all the sounds, barring few

exceptions.

II. ASR SYSTEM ARCHITECTURE

The architecture of speech recognition system is

shown in Fig 1.

It contains two modules: The Training Module and

The Testing Module.

Training module generates the system model with

which test data has been compared to get

performance percentage. Testing module compares

the test-data with training module and yields the 1-

best hypothesis.

First of all, Pronunciation Dictionary is created using

G2P Model (Section 5.1) which is trained with

30,000 words that are linguistically correct. Based on

these linguistically correct words, English phonemes

of different Hindi Graphemes have been generated.

For creating G2P Model, we have used Moses [8].

The pronunciation dictionary along with mapping

dictionary (Section3.3) represents the different

possibilities of pronouncing, or occurrence of a word.

Language Model has been created using large set of

text data to capture all the possibilities of occurrence

of a phoneme in a word, or a word in a sentence.

Thus, to give strength to the Acoustic Models

(Section3.1), Language Model has been created.

In Testing Module, Sclite [9] is used for evaluating

the 1-best hypothesis of each word. The average of

all the accuracies of different words gives the overall

accuracy of the Speech Recognition system with

possible word accuracy percentage.

Also, the Insertion, Deletion and Substitution error

can be computed.

. Fig 1. ASR SYSTEM ARCHITECTURE

III. BUILDING ASR SYSTEMS

Typically, ASR system comprises of three major

constituents - the acoustic models, the language model

and the phonetic lexicons.

A. Acoustic Models

In this experiment, context-independent as well as

context-dependent models of Hindi & Bangla have been

created by borrowing phonemes from English. Context-

independent models are basically mono-phone models,

taking each phone as an individual sound unit.

Furthermore, context-dependent models take the

probability of occurrence of one phone, relative to the

neighbouring phones. The data used for creating acoustic

models for Hindi and Bangla have been shown in Table1

& Table2 respectively.

TABLE1. CORPUS USED FOR HINDI ACOUSTIC MODELS

Corpus Number of Sentences Speakers

(Male / Female)

General Messages

1260 3(Male),2(Female)

Health & Tourism

Corpus


News Feeds


Philosophical Data 1000 3(Male),2(Female)

TABLE2. CORPUS USED FOR BENGALI ACOUSTIC MODELS

Corpus Number of Sentences Speakers

(Male/Female)

Shruti Bangla

Speech Corpus


TDIL Data

1000 1(Male)

Health& Tourism

Corpus


We have trained the HMM models using Watson Toolkit

[10].For parameterization, Mel Frequency Cepstral

Coefficients (MFCC) have been computed. At the time

of recognition, various words are hypothesized against

the speech signal.

To compute the likelihood of a word, the 1-best

hypothesis of individual word of the text data has been

taken with the help of Sclite. The combined likelihood of

all the phonemes represents the likelihood of the word in

the acoustic models.

B. Language Model

For the language model, very large set of text data is

required, so that all the possibilities of occurrence of a

word in Indian languages can be captured. The text data

taken for language model is shown in Table3 & Table4.

TABLE3: TEXT DATA USED FOR HINDI LANGUAGE MODEL

Corpus Number of

Sentences

Total

Words

Unique

Words

General Messages

1260 65300 54324

Health & Tourism

Corpus

41282 90140 67522

News Feeds

800 7727 3351

Philosophical Data

1000 135400 64360

Wikipedia 19020 415818 175265

TABLE4: TEXT DATA USED FOR BANGLA LANGUAGE MODEL

Corpus Number of

Sentences

Total Words Unique

Words

Shruti Bangla Speech

Corpus

7383 22012 10054

TDIL Data

1000 25240 6720

Health& Tourism

41282 675915 91033

C. Lexicon Model

The lexicon model is a dictionary which maps the words

to phoneme sequences .In this experiment, we have

developed a pronunciation dictionary, a mapping

dictionary and grouping of the phones.

Pronunciation Dictionary: It contains the lexicons with

respect to each individual word, using their transcription

as how a word can be pronounced with help of English

alphabets.

Mapping Dictionary:It maps each phoneme of a particular

language into English phonemes. This mapping is shown

in Appendix1 (For Vowels) & Appendix2 (For

Consonants). In this way, we have created sounds of

Hindi & Bangla using English phonemes.

Grouping of Phones: The phones are grouped on the basis

of place of articulation and manner of articulation

(Appendix 1 &2). In case the engine is unable to decide a

particular phone, it is able to find the correct phone with

the help of such grouping by looking into the category

the particular phoneme belongs.

IV. HINDI & BANGLA PHONE SETS

To represent the sounds of the acoustic space, a set of

phonemes [11] are required which can be either from a

particular language or from the sounds of a combination

of languages.

The IPA [12] has defined phone sets for labeling speech

databases for sounds of a large number of languages

(including Hindi). But there are some sounds which are

not included in IPA but which are used for the purpose of

speech recognition. In continuous speech recognition

task, the purpose of defining a phonetic space is to form

well-defined phone set which can represent all the sounds

that exist in a language. So, we have used some phoneme

sequences, out of which all the sounds can be extracted,

either by individually or by clustering of these phonemes.

Some phonemes exists in text data only, but not in the

audio files. As the phoneme is pronounced by a speaker in different way, the variabilities have been captured.

For Example: व (/v/) is written in text form in Bangla too,

but it is pronounced as ब /b/.

A. Challenges

While dealing with Indian phone-sets, the following

challenges has been faced.

Nasal Sounds: To handle nasal sounds is a real task

especially when a vowel is followed by a consonant. For

Example : In case of अ ं, अ is followed by consonant न.

We have done the clustering of respective vowel &

consonant, using their 3-HMM states in order to have

strong recognition system which is able to recognize

almost all the phonemes. OOV (Out-of-vocabulary) problem: During our

experiments, the OOV (Out-of-vocabulary) problem

mostly occurs.OOV is the problem of words in the test

speech that are not present in the dictionary. To handle

this, we have added such phones in the vocabulary.

Clusters of Sounds: Some sounds exist in Hindi that are

clusters of two or more different phonemes. To define the

sounds, we have clustered the phonemes by taking 3-

HMM states of all the phonemes and clustered them to

get a new sound. Examples of clustering are shown in

Table 5. TABLE5: EXAMPLES OF CLUSTERING OF SOUNDS

Phonemes Clustering of Sounds

ओं ao,2 ao,3 n,2

उं uh,2 uh,3 n,2

ईं iy,2 iy,3 n,2

ञ y,2 y,3 n,2

त्र t,2 t,3 r,2

झ j,3 h,2 h,3

Phonemes not common in Hindi & Bengali: Some

phonemes have been observed that exist in Hindi, but not

in Bengali and vice-versa. The list of these phones are

given in Table 6. As we have dealt with both the

languages, we have trained the ASR individually for both

the languages, containing their own phoneme sets, and

compute their accuracies separately.

TABLE6 : LIST OF PHONEMES NOT COMMON IN HINDI & BENGALI

AUDIO FILES

Common in Hindi & Bangla 47 Phonemes

Only in Hindi 10 Phonemes व /v/

क़ /q/

ञ /ɲ/

य /j/

ष /ʂ/

ख़ /x/

ग़ /ɣ/

ज़ /z/

झ़ /ʒ/

फ़ /f/

Only in Bangla 3 Phonemes রং ŋ ঐ oj

ঔ ow

V. METHODOLOGY TO DEVELOP TEXT

CORPORA

A. Grapheme to Phoneme Conversion (G2P)

For analyzing the text corpus, the distribution of the basic

recognition units, the phones, the di-phones, syllables

etc., the text corpus has to be phonetized. G2P converters

are the tools that convert the text corpus into its phonetic

equivalent. But the phonetic nature of Indian languages

reduces the effort to building individually mapping tables

and rules for the lexical representation. These rules and

the mapping tables(Annexure1 &2) together comprise the

Grapheme to Phoneme converters.

We have used Moses [8] for G2P conversion, by training

it with 4280 unique words having their phonetic

equivalent set up by linguistics. The rest of the

graphemes are inputted and their phoneme equivalents

are taken as output.

B. Rapid Bootstrapping

The language adaptation technology enables us to rapidly

bootstrap a speech recognizer in a new target language.

Converting one language phonemes into other: In this

experiment, we have developed an algorithm to convert

each phoneme of a particular language to the

corresponding phoneme of the Hindi language, so we can

deal with more data for Hindi taken from other Indian

languages. For this, we have mapped each phoneme of a

particular language individually to the respective

phoneme in Hindi, and if it matches with a particular

phoneme of Hindi, then it gives those character of Hindi

as the converted phoneme. Thus, for a text data of a

particular Indian language, we get the data in Hindi

phonemes. Then, further processing can be done.

VI. COLLECTION OF AUDIO DATA

In this section, the steps involved in building the speech

corpora are discussed .Two channels: head-held

microphone and mobile phones have been used to record

the data simultaneously.

A. Speaker Selection & Transcription of Audio Files

Speech data was collected from native speakers of

different languages who were comfortable in speaking

and reading the particular language, for training purpose

to capture all diversities attributing to the gender, age and

dialect sufficiently.

B. Transcription Corrections

Besides care was taken to record the speech with minimal

background noise and mistakes in pronunciation, some

errors were still left while recording. These errors had to

be identified manually by listening to the speech. The

pronunciation mistakes were carefully identified and if

possible the corresponding changes were made in the

transcriptions so that the utterance and transcription

correspond to each other. The idea behind this was to

make the utmost utilization of the data and to serve as a

corpus for further related research work.

C. Data Statistics

The system has been trained with 70% of the overall

corpus and the remaining 30 % has been used as test data

in case of Open-Set Speech Recognition. In Closed Set,

overall data is used for training, and some data from the

same is used as test data.

VII. ASR EVALUATION RESULTS

We have done some experiments to find the relevancy of

our experiment. Two individual Recognition Engines of

Hindi & Bengali have been developed. The system has

been trained with the corpus of individual languages.

A. Overall Performance of Hindi and Bangla ASR

The overall performance of Hindi and Bangla ASR when

using 70 % of the data as the training set and remaining

30% as the test-set is shown in Table 7 & Table 8

respectively

TABLE 7: OVERALL PERFORMANCE OF HINDI ASR

Task

Name

Num

Phrases

Beam

Width

Word

Accuracy

Clock

Time

Output

170

174 170 51.5 221.00

Output

190

174 190 57.2 311.81

Output

210

174 210 61.0 431.27

Output

230

174 230 61.8 589.81

Output

250

174 250 62.1 781.11

TABLE 8: OVERALL PERFORMANCE OF BANGLA ASR

Task

Name

Num

Phrases

Beam

Width

Word

Accuracy

Clock

Time

Output

170

174 170 47.3 81.04

Output

190

174 190 51.8 108.23

Output

210

174 210 54.3 144.43

Output

230

174 230 54.2 195.63

Output

250

174 250 54.9 266.24

B. Hindi Speech Recognition

While dealing with Hindi Recognition Engine, we have

trained as well as tested the system with Hindi database.

For testing, we have used closed and open set both. The

accuracy of the open-set & closed-set for Hindi, where

the test-set is out of the training-data and closed-set,

where the test-set is from the training data is as follows:

Fig 2. Word accuracy percentage of Hindi ASR

C. Bangla Speech Recognition For Bangla Recognition, we have trained the system with

Bengali data and then tested it with the same as well as

different subset, giving accuracy of both closed and open

set. The performance of Bangla Recognition Engine is as

follows:

Fig 3: Word accuracy percentage of Bangla ASR

This shows that the best solution to improve the accuracy

is to add more and more number of speakers in the

training set. The evaluation of the experiment was made

according to the recognition accuracy and computed

using the word error rate [WER] which aligns a

recognized word against the correct word and computes

the number of substitutions (S), deletions (D) and

Insertions (I) and the number of words in the correct

sentence (N).

W.E.R=100*(S+D+I)/N

D. Using the transliterated Data

In this experiment, 50 sentences of Bangla were taken.

After their transliteration into Hindi [13], the sentences

were recorded with Hindi native speaker and used as the

additional test set. The parallel sentences of Bangla and

Hindi are then tested with the same Bangla ASR. As the

acoustic model is same for both the languages, we can

use both the parallel files as test set with the same

system.

The accuracy of Bengali corpus and its Hindi

transliteration when tested with the same Bengali ASR is

shown in Table9.

TABLE9. ACCURACIES OF ORIGINAL BANGLA SENTENCES & ITS

TRANSLITERATED VERSION

Language Sentences from Testing

Set

Accuracy (%)

Hindi 50 57.4

Bangla 50 64.2

The system has been trained for Original Bangla

sentences. Thus, the system is giving better accuracy for

Bangla sentences than their Hindi transliterated version.

As the difference in accuracy percentage is not large,

shows the transliteration is effective. Thus, we can

increase text corpus of a particular language by using

transliterated data obtained from any other language.

E. Comparison of Hindi & Bengali ASR Model

In the experiment, we have build the acoustic model for

both Bangla corpus and its Hindi transliterated corpus

individually, keeping the language model as same. Thus,

for Hindi ASR, we have used 50 Hindi sentences and 50

Bangla to Hindi transliterated sentences, as test sentences

and similarly with Bangla ASR. The accuracies observed

in these cases are shown in Table10.

TABLE10: COMPARISON OF HINDI & BANGLA ASR

Language Testing Sentences Accuracies

Hindi 50 74.2

Bangla 50 65.6

As we have larger Hindi corpus than Bangla, the

accuracy of Hindi ASR is better. Thus, it has been

concluded that the accuracy can be improved with

increase in corpus.

VIII.IMPROVING ASR ACCURACIES USING

MORE DATA FROM VARIOUS LANGUAGES

We have collected data from various commonly known

Indian languages, and transliterated the whole data into

Hindi alphabet, so that we can capture many variations

which occur in different Indian languages.With this data

acoustic as well as language model is improved. As

having correspondence between orthography and

pronunciation, the Indian languages are rich in phonetics,

we can capture all possibities of phonemes. The

experiments show that the ASR accuracy can be

improved using large corpus in this manner also.

IX. CONCLUSION & FUTURE WORK

In this paper, we discussed the design and development

of speech databases for two Indian languages i.e.,Hindi

& Bangla and their suitability for developing ASR

system using WATSON tool. The simple methodology of

database creation presented will serve as a catalyst for the

creation of speech databases in all other Indian

languages. Some of the conclusions of our study are:

Female Speakers perform better when the system is

trained with Female voice database alone.

Accuracy of the system is found better when the

system is trained with variety of speakers and their

speaking style as compared to simply increasing the

corpus from limited number of speakers.

Native Speakers performs better than Non-Native

Speakers in all conditions

As the beam-width of training speakers increases,

word accuracy also increases.

The word accuracy also increases with clock time.

We hope that the ASRs created using the database

developed in this experiment will serve as baseline

systems for further research in improving the accuracies

in each of the languages. Our future work is focused in

tuning these models and test them using language and

acoustic models built using a much larger corpus from

large number of speakers.

X.ACKNOWLEDGEMENT

We would like to acknowledge the help and support

recieved from Mr.Anirudhha of IIIT,Hyderabad in

conducting these experiments.We are also thankful to

Prof.Michael Carl of CBS, Copenhagen and KIIT

management in particular to Dr. Harsh V.Kamrah,Mrs.

Neelima Kamrah for providing necessary

facilities,financial help and encouragement .Also to

DIETY for providing fellowship to one of the author

Dipti Pandey.

REFERENCES [1] Samudravijaya K, P.V.S.Rao, and S.S.Agrawal,"

Hindi speech database ," Proc. Int. Conf. on Spoken

Language processing(ICSLP00), Beijing, China, October

2000, CDROM paper: 00192.pdf.

[2] K.Kumar and R.K. Agarwal,“Hindi Speech

Recognition System Using HTK,” International Journal

of Computing and Business Research,Vol.2, No.2,

2011,ISSN (On- line):2229-6166.

[3]Chourasia,K.Samudravijaya,andChandwani,"Phonetic

ally rich Hindi sentences corpus forcreation of speech

database,"Proc O-COCOSDA 2005, p 132-137.

[4] Shweta Sinha, SS Agrawal, Olsen Jesper,"Mobile

speech Hindi database, " OCOCOSDA-2011, Hshinchu-

Taiwan.

[5] www.mastar.jp/wfdtr/presentation/2_Dr.Bangalore.

pdf

[6] Ahuja, R., Bondale, N., Furtado, X., Krishnan, S.,

Poddar, P., Rao, P.V.S., Raveendran, R.,Samudravijaya

K, and Sen, A.," Recognition AndSynthesis in the Hindi

Language, in Proceedings of the Workshop on Speech

Technology, IIT, Madras,pp.3-19, Dec., 1992.

[7] Vishal Chourasia, Samudravijaya K, MayaIngle, and

Manohar Chandwani," Hindi speech recognition under

noisy conditions, J. Acoust. Soc. India,54(1), pp. 41-46,

January 2007.

[8] http://www.statmt.org/moses/manual/manual.pdf.

[9]http://www1.icsi.berkeley.edu/Speech/docs/sctk-

1.2/sclite.htm

[10] http://www.research.att.com/projects/WATSON/

?fbid=2tgRMa1CfjG.

[11] S S Agrawal, K Samudravijaya, Karunesh Arora,

\Text and Speech Corpora Development in Indian

Languages", Proceedings of ICSLT-O-COCOSDA 2004

New Delhi, India.

[12] www.madore.org/ david/misc/linguistic/ipa/.

[13]

http://en.wikipedia.org/wiki/Devanagaritransliteration

APPENDIX

The characterizations of Hindi and Bangla phonemes

have been done as follows:

Appendix1: Characterization of Vowels CATEGORY

HINDI

PHONEMES

BENGALI

PHONEMES

IPA

REPRESENTATION

USING ENGLISH

PHONEMES

Monothongs

(Short)

अ

इ

उ

ऋ

অ

ই

উ

ঋ

/ə/

/i/

/u/

/

AX

I

U

RR

Monothongs

(Long)

आ

ई

ऊ

ए

ओ

আ

ঈ

ঊ

এ

ও

aː

iː

uː

/e/

/o/

AA

II

UU

E

O

Diphthongs ऐ

औ

ঐ

ঔ

/æ/

/ɔː

AI

AU

Appendix2: Characterization of Consonants

CATEGORY HINDI

PHONEMES BENGALI

PHONEMES IPA REPRESENTATION

USING ENGLISH

PHONEMES

Unaspirated

(Unvoiced)

क च ट त प

ক চ ট ত প

/k/ /tʃ/ /ʈ/ /t/ /p/

k

c

tt

t

p

Aspirated

(Unvoiced) ख छ ठ थ फ

খ ছ ঠ থ ফ

/kʰ/

/tʃʰ/

/ʈʰ/

/tʰ/

/pʰ/

Kh

ch

tth

th

ph

Unaspirated

(Voiced) ग ज ड द ब

গ জ ড দ ব

/g/

/dʒ/

/ɖ/

/d/

/b/

g

j

dd

d

b

Aspirated

(Voiced) घ झ ढ ध भ

ঘ ঝ ঢ ধ ভ

/gʰ/

/dʒʰ/

/ɖʱ/

/dʰ/

/bʰ/

gh

jh

ddh

dh

bh

Nasals ड़ ञ ण न म

ড় ঞ ন ন ম

/ɽ/

/ɲ/

/ɳ/

/n/

/m/

ddn

ny

nn

n

m

Semivowels/

Approxima

nts

य र ल व

য র ল ব

/j/

/r/

/l/

/v/

y

r

l

w

Sibilants श ष स

শ ষ স

/ʃ/

/ʂ/

/s/

sh

sh^

s

Glottal ह হ /h/ h

Date post:	27-Jan-2017
Category:	Documents
Upload:	srinivas
View:	213 times
Download:	4 times

[IEEE 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian...

Documents